Skip to main content

Data Sources

Connect vector search backends for retrieval-augmented generation (RAG). Data sources provide semantic search over external knowledge bases.

Quick Start

builder.Services
.AddCoreAIServices()
.AddCoreAIOrchestration()

// Add one or both backends:
.AddCoreElasticsearchServices(
builder.Configuration.GetSection("CrestApps:Elasticsearch"))
.AddCoreAzureAISearchServices(
builder.Configuration.GetSection("CrestApps:AzureAISearch"));

If you are new to AI-powered search, here is a brief primer on the concepts that data sources rely on.

Embeddings

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Two texts with similar meanings produce vectors that are close together in mathematical space — even if they use completely different words. For example, "How do I reset my password?" and "I forgot my login credentials" produce similar embeddings.

Embeddings are generated by specialized AI models (e.g., text-embedding-ada-002 from OpenAI or an equivalent model in Azure AI).

Given a user's question, vector search converts it into an embedding and finds the documents whose embeddings are closest. This is fundamentally different from keyword search:

Keyword SearchVector Search
Query: "password reset"Matches documents containing "password" and "reset"Matches documents about authentication help, even if those words don't appear
Handles synonyms?NoYes
Handles typos?LimitedYes

Retrieval-Augmented Generation (RAG)

RAG combines vector search with generative AI. Instead of asking the AI model to answer from its training data alone, you retrieve relevant documents and inject them into the prompt so the model can ground its response in your actual data. This dramatically reduces hallucinations and lets the model answer questions about private or recent information.

RAG Pipeline

Here is the end-to-end flow when a user sends a query to a profile with data sources configured:

1. User Query: "What is our refund policy?"


2. Embedding Generation
└── The query is converted to a vector using an embedding model


3. Vector Search (IVectorSearchService)
└── The vector is compared against indexed document chunks
└── Top-N most similar chunks are returned (e.g., top 3)


4. Context Enrichment
└── `DataSourceAICompletionContextBuilderHandler` selects the attached data source
└── `DataSourceOrchestrationHandler` injects availability/tool guidance
└── `DataSourcePreemptiveRagHandler` injects retrieved chunks as system context


5. AI Completion
└── The model generates a response grounded in the retrieved documents


6. Response with Citations
└── The response references the source documents
tip

The number of document chunks retrieved (Top-N) is configurable per profile. A higher value provides more context but uses more tokens. The default is 3.

Architecture

Data sources integrate with the orchestration pipeline through three shared framework components:

  1. DataSourceAICompletionContextBuilderHandler copies the selected data source id into the completion context
  2. DataSourceOrchestrationHandler injects data-source availability instructions and keeps the search tool in scope
  3. DataSourcePreemptiveRagHandler performs preemptive retrieval and injects matching chunks into the system message

The same shared framework layer now also exposes IAIDataSourceIndexingService for keeping knowledge-base indexes synchronized with their source indexes. MVC uses that service to rebuild data sources on demand, react to source-content changes, and run periodic background alignment.

User Query


DataSourceAICompletionContextBuilderHandler
│ (attaches data source id)

DataSourceOrchestrationHandler
│ (adds data-source availability + tool guidance)

DataSourcePreemptiveRagHandler
│ (queries vector store)

Completion Context (enriched with relevant documents)


AI Model (grounds response in retrieved data)

Common Services (Keyed by Provider Name)

Each data source backend registers these services, keyed by its provider name:

ServicePurpose
IDataSourceContentManagerManages content in data source indices
IDataSourceDocumentReaderReads documents from data source indices
IAIDataSourceIndexingServiceRebuilds and repairs knowledge-base indexes from source indexes
IODataFilterTranslatorTranslates OData filters to backend-native queries
ISearchIndexManagerCreates, deletes, and manages search indices
ISearchDocumentManagerIndexes and removes documents in search indices
IVectorSearchServicePerforms vector similarity search

Available Backends

BackendExtensionProvider NameDocumentation
ElasticsearchAddCoreElasticsearchServices()"Elasticsearch"Elasticsearch
Azure AI SearchAddCoreAzureAISearchServices()"AzureAISearch"Azure AI Search

Key Interfaces Deep Dive

IVectorSearchService

Performs vector similarity search against an index. This is the core search operation used during RAG.

public interface IVectorSearchService
{
Task<IEnumerable<DocumentChunkSearchResult>> SearchAsync(
IIndexProfileInfo indexProfile,
float[] embedding,
string referenceId,
string referenceType,
int topN,
CancellationToken cancellationToken = default);
}

The embedding parameter is the vector representation of the user's query. The topN parameter controls how many chunks to return.

ISearchDocumentManager

Manages the lifecycle of documents within a search index — adding, updating, and removing documents.

public interface ISearchDocumentManager
{
Task<bool> AddOrUpdateAsync(
IIndexProfileInfo profile,
IReadOnlyCollection<IndexDocument> documents,
CancellationToken cancellationToken = default);

Task DeleteAsync(
IIndexProfileInfo profile,
IEnumerable<string> documentIds,
CancellationToken cancellationToken = default);

Task DeleteAllAsync(
IIndexProfileInfo profile,
CancellationToken cancellationToken = default);
}

ISearchIndexManager

Creates and manages search indexes themselves (not the documents within them).

public interface ISearchIndexManager
{
Task<bool> ExistsAsync(string indexFullName, CancellationToken cancellationToken = default);

Task CreateAsync(
IIndexProfileInfo profile,
IReadOnlyCollection<SearchIndexField> fields,
CancellationToken cancellationToken = default);

Task DeleteAsync(string indexFullName, CancellationToken cancellationToken = default);
}

IDataSourceContentManager

A higher-level service that searches for document chunks with optional OData filtering and manages data source content.

public interface IDataSourceContentManager
{
Task<IEnumerable<DataSourceSearchResult>> SearchAsync(
IIndexProfileInfo indexProfile,
float[] embedding,
string dataSourceId,
int topN,
string filter = null,
CancellationToken cancellationToken = default);

Task<long> DeleteByDataSourceIdAsync(
IIndexProfileInfo indexProfile,
string dataSourceId,
CancellationToken cancellationToken = default);
}

The optional filter parameter accepts an OData expression that is translated to the backend-native query language via IODataFilterTranslator.

IAIDataSourceIndexingService

Coordinates full and partial synchronization between a source index and its AI knowledge-base index.

public interface IAIDataSourceIndexingService
{
Task SyncAllAsync(CancellationToken cancellationToken = default);
Task SyncDataSourceAsync(AIDataSource dataSource, CancellationToken cancellationToken = default);
Task SyncSourceDocumentsAsync(IEnumerable<string> documentIds, CancellationToken cancellationToken = default);
Task RemoveSourceDocumentsAsync(IEnumerable<string> documentIds, CancellationToken cancellationToken = default);
Task DeleteDataSourceDocumentsAsync(AIDataSource dataSource, CancellationToken cancellationToken = default);
}

SyncDataSourceAsync() performs a full rebuild for a data source by deleting that data source's existing chunk documents and re-reading the mapped source index through IDataSourceDocumentReader. SyncSourceDocumentsAsync() and RemoveSourceDocumentsAsync() are intended for source-level handlers so article or catalog updates can keep the knowledge-base index aligned without waiting for the next scheduled full sync.

IDataSourceDocumentReader

Reads raw documents from a source index, typically used during re-indexing or migration.

public interface IDataSourceDocumentReader
{
IAsyncEnumerable<KeyValuePair<string, SourceDocument>> ReadAsync(
IIndexProfileInfo indexProfile,
string keyFieldName,
string titleFieldName,
string contentFieldName,
CancellationToken cancellationToken = default);

IAsyncEnumerable<KeyValuePair<string, SourceDocument>> ReadByIdsAsync(
IIndexProfileInfo indexProfile,
IEnumerable<string> documentIds,
string keyFieldName,
string titleFieldName,
string contentFieldName,
CancellationToken cancellationToken = default);
}

IODataFilterTranslator

Translates OData $filter expressions into backend-native query syntax. Each backend (Elasticsearch, Azure AI Search) has its own implementation.

public interface IODataFilterTranslator
{
string Translate(string odataFilter);
}

For example, the Elasticsearch translator converts category eq 'support' into an Elasticsearch query DSL filter targeting the filters.category field.

Adding a Custom Backend

To add a custom vector store backend (e.g., Pinecone, Qdrant, Weaviate), implement all six keyed services:

const string providerName = "MyBackend";

builder.Services.AddKeyedScoped<IVectorSearchService, MyVectorSearchService>(providerName);
builder.Services.AddKeyedScoped<ISearchIndexManager, MySearchIndexManager>(providerName);
builder.Services.AddKeyedScoped<ISearchDocumentManager, MySearchDocumentManager>(providerName);
builder.Services.AddKeyedScoped<IDataSourceContentManager, MyDataSourceContentManager>(providerName);
builder.Services.AddKeyedScoped<IDataSourceDocumentReader, MyDataSourceDocumentReader>(providerName);
builder.Services.AddKeyedScoped<IODataFilterTranslator, MyODataFilterTranslator>(providerName);

Example: Custom Vector Search Implementation

public sealed class PineconeVectorSearchService : IVectorSearchService
{
private readonly PineconeClient _client;

public PineconeVectorSearchService(PineconeClient client)
{
_client = client;
}

public async Task<IEnumerable<DocumentChunkSearchResult>> SearchAsync(
IIndexProfileInfo indexProfile,
float[] embedding,
string referenceId,
string referenceType,
int topN,
CancellationToken cancellationToken = default)
{
var response = await _client.QueryAsync(new QueryRequest
{
Vector = embedding,
TopK = topN,
Namespace = indexProfile.IndexFullName,
IncludeMetadata = true,
}, cancellationToken);

return response.Matches.Select(match => new DocumentChunkSearchResult
{
DocumentId = match.Id,
Score = match.Score,
Content = match.Metadata["content"].ToString(),
});
}
}
warning

All six services must be registered with the same providerName key. The framework resolves them by key at runtime based on the data source configuration.

Configuration Guide

Data source backends are configured in appsettings.json under the CrestApps:Search section. Each backend has its own configuration section:

{
"CrestApps": {
"Search": {
"Elasticsearch": {
"Url": "https://localhost:9200",
"Username": "elastic",
"Password": "your-password"
},
"AzureAISearch": {
"Endpoint": "https://my-search.search.windows.net",
"ApiKey": "your-admin-api-key"
}
}
}
}

The configuration section is passed to the provider registration extension method:

// Bind from configuration
builder.Services.AddCoreElasticsearchServices(
builder.Configuration.GetSection("CrestApps:Elasticsearch"));

// Or bind Azure AI Search
builder.Services.AddCoreAzureAISearchServices(
builder.Configuration.GetSection("CrestApps:AzureAISearch"));

See the individual backend pages for detailed configuration options: Elasticsearch | Azure AI Search