AI Documents
A complete document management pipeline that reads uploaded files, splits them into chunks, generates vector embeddings, and makes the content searchable via semantic similarity — enabling retrieval-augmented generation (RAG) in AI conversations.
Quick Start
builder.Services
.AddCoreAIServices()
.AddCoreAIOrchestration()
.AddCoreAIChatInteractions()
.AddCoreAIDocumentProcessing()
.AddCoreAIOpenAI();
// Register document and chunk stores
builder.Services.AddScoped<IAIDocumentStore, YesSqlAIDocumentStore>();
builder.Services.AddScoped<IAIDocumentChunkStore, YesSqlAIDocumentChunkStore>();
AddCoreAIDocumentProcessing() is shipped by CrestApps.Core.AI.Documents.
Upload a file and process it:
public sealed class DocumentUploadController(
IAIDocumentProcessingService processingService,
IAIClientFactory aiClientFactory,
IAIDeploymentManager deploymentManager,
IAIDocumentStore documentStore) : Controller
{
[HttpPost]
public async Task<IActionResult> Upload(
IFormFile file,
string referenceId,
string referenceType)
{
var embeddingDeployment =
await deploymentManager.ResolveOrDefaultAsync(AIDeploymentPurpose.Embedding);
var embeddingGenerator = embeddingDeployment is null
? null
: await aiClientFactory.CreateEmbeddingGeneratorAsync(embeddingDeployment);
var result = await processingService.ProcessFileAsync(
file, referenceId, referenceType, embeddingGenerator);
return result.Succeeded ? Ok(result) : BadRequest(result);
}
}
Problem & Solution
Users upload documents (PDFs, Word files, spreadsheets, text files) and expect the AI to answer questions about them. This requires a multi-stage pipeline:
- Reading — Extract plain text from diverse file formats (
.pdf,.docx,.xlsx,.csv,.txt,.md, and more) - Chunking — Split large documents into segments small enough to embed
- Embedding — Convert each chunk into a vector representation using a configured embedding model
- Indexing — Store embeddings in a vector search index (Elasticsearch or Azure AI Search)
- Searching — At query time, perform semantic similarity search to find the most relevant chunks
- Tabular processing — CSV and Excel files receive special treatment with structured, batch-oriented queries
The document processing system handles this full pipeline from upload to retrieval, while the built-in document tools make the content available to the AI during orchestration.
When a chat deployment also supports the Vision purpose, chat interaction and chat session uploads can include supported image formats (.bmp, .gif, .jpeg, .jpg, .png, .webp) alongside standard document files. Those images are stored as AIDocument records, analyzed at upload time by IImageAnalysisService to extract a structured summary (caption, OCR text, detected entities), and the results are persisted as AIDocumentChunk records — exactly like text documents. This makes image content available through the same read_document and search_documents tools used for regular documents.
For cases where the text analysis is insufficient (e.g., reading fine text, comparing visual elements, or understanding spatial layout), the inspect_image tool provides on-demand raw image inspection by sending the original bytes to a vision model in a one-shot call. This approach eliminates the cost of attaching raw image bytes to every chat request while preserving full visual inspection capability when needed.
The ChatDocumentsOptions.AnalyzeImagesAtUpload setting controls whether analysis runs at upload time, and MaxInspectImageCallsPerRequest limits how many costly raw-image inspections the model can perform per turn.
Creating a chat client to describe an image
When you want to call a vision-capable model directly, resolve the deployment, create an IChatClient, and send a multimodal user message:
public sealed class ImageDescriptionService(
IAIDeploymentManager deploymentManager,
IAIClientFactory clientFactory)
{
public async Task<string> DescribeImageAsync(
string imagePath,
string chatDeploymentName,
CancellationToken cancellationToken = default)
{
var deployment = await deploymentManager.ResolveOrDefaultAsync(
AIDeploymentPurpose.Chat,
deploymentName: chatDeploymentName,
cancellationToken: cancellationToken);
if (deployment?.Purpose.Supports(AIDeploymentPurpose.Vision) != true)
{
throw new InvalidOperationException("The selected chat deployment does not support vision.");
}
var chatClient = await clientFactory.CreateChatClientAsync(deployment);
var imageBytes = await File.ReadAllBytesAsync(imagePath, cancellationToken);
var mediaType = MediaTypeHelper.InferMediaType(Path.GetExtension(imagePath));
var response = await chatClient.GetResponseAsync(
[
new ChatMessage(
ChatRole.User,
[
new TextContent("Describe this image in detail."),
new DataContent(imageBytes, mediaType),
]),
],
cancellationToken: cancellationToken);
return response.Text;
}
}
Architecture Overview
┌─────────────┐
│ User Upload │
└──────┬──────┘
▼
┌──────────────────────────────┐
│ IAIDocumentProcessingService │ ← Orchestrates the pipeline
├──────────────────────────────┤
│ 1. Store document record │ → IAIDocumentStore
│ 2. Read file content │ → IngestionDocumentReader (keyed by extension)
│ 3. Normalize & chunk text │ → RagTextNormalizer
│ 4. Store chunks │ → IAIDocumentChunkStore
│ 5. Generate embeddings │ → IEmbeddingGenerator<string, Embedding<float>>
│ 6. Index in vector store │ → ISearchDocumentManager (Elasticsearch / Azure AI)
└──────────────────────────────┘
┌─────────────────────────────────────┐
│ During Conversation │
├─────────────────────────────────────┤
│ DocumentOrchestrationHandler │
│ detects documents on the session │
│ and injects document tools: │
│ │
│ • SearchDocumentsTool (vector RAG) │
│ • ReadDocumentTool (full text read) │
│ • ReadTabularDataTool (CSV/Excel) │
│ • InspectImageTool (vision on-demand)│
└──────────────┬──────────────────────┘
▼
┌─────────────────────────────────────┐
│ AI Model calls tools as needed │
│ to answer user questions about │
│ the uploaded documents │
└─────────────────────────────────────┘
Core Interfaces
| Interface | Package | Purpose |
|---|---|---|
IAIDocumentStore | CrestApps.Core.AI.Documents | CRUD for document records |
IAIDocumentChunkStore | CrestApps.Core.AI.Documents | CRUD for document chunks |
IAIDocumentProcessingService | CrestApps.Core.AI.Documents | Orchestrates file → chunk → embed → index |
IDocumentFileStore | CrestApps.Core.AI.Documents | Persists uploaded document files to a swappable storage backend |
ISearchDocumentManager | CrestApps.Core.AI | Manages documents in the vector search index |
IVectorSearchService | CrestApps.Core.AI | Performs vector similarity search at query time |
ITabularBatchProcessor | CrestApps.Core.AI.Documents | Splits and processes CSV/Excel batch queries |
ITabularBatchResultCache | CrestApps.Core.AI.Documents | Caches tabular query results |
IngestionDocumentReader | CrestApps.Core.AI.Documents | Abstract base for format-specific file readers |
AddCoreAIDocumentProcessing() registers a default FileSystemFileStore automatically. By default it stores uploaded files under App_Data\Documents, and each upload gets a new GUID-based stored file name while AIDocument.FileName keeps the original user upload name.
Configure a different local base path:
builder.Services.Configure<DocumentFileSystemFileStoreOptions>(options =>
{
options.BasePath = "App_Data/CustomDocuments";
});
Hosts can replace IDocumentFileStore entirely to change where uploaded files are written:
builder.Services.AddSingleton<IDocumentFileStore, AzureBlobDocumentFileStore>();
IDocumentFileStore extends the general IFileStore abstraction. AIDocument.StoredFileName and AIDocument.StoredFilePath preserve the backing file-store location so hosts can trace and delete the physical file later.
Document Processing Pipeline
Step 1 — Upload and Store
When a file is uploaded, a new AIDocument record is created in IAIDocumentStore:
public sealed class AIDocument : CatalogItem
{
public string ReferenceId { get; set; } // Owning resource (e.g., chat interaction ID)
public string ReferenceType { get; set; } // Resource type (e.g., "chatinteraction")
public string FileName { get; set; } // Original file name
public string ContentType { get; set; } // MIME type
public long FileSize { get; set; } // Size in bytes
public DateTime UploadedUtc { get; set; } // Upload timestamp
}
The ReferenceId and ReferenceType pair ties the document to an owning resource. Common reference types include:
| Constant | Value | Meaning |
|---|---|---|
AIReferenceTypes.Document.Profile | "profile" | Document attached to an AI profile |
AIReferenceTypes.Document.ChatInteraction | "chatinteraction" | Document attached to a chat interaction |
AIReferenceTypes.Document.ChatSession | "chatsession" | Document attached to a chat session |
Hosts can layer extra behavior on top of this shared pipeline, but the default follow-up indexing step now lives in the framework as DefaultAIDocumentIndexingService. Hosts can call that service after persisting AIDocument and AIDocumentChunk records so uploaded chunks are mirrored into the configured AI Documents vector index without duplicating provider-specific index management code.
Step 2 — Read File Content
An IngestionDocumentReader is resolved as a keyed service using the file extension. The reader extracts plain text from the file:
public abstract class IngestionDocumentReader
{
public abstract Task<IngestionDocument> ReadAsync(
Stream source,
string identifier,
string mediaType,
CancellationToken cancellationToken = default);
}
Step 3 — Normalize and Chunk
The extracted text is normalized (whitespace, encoding) and split into chunks. Each chunk becomes an AIDocumentChunk:
public sealed class AIDocumentChunk : CatalogItem
{
public string AIDocumentId { get; set; } // Parent document ID
public string ReferenceId { get; set; } // Denormalized from parent
public string ReferenceType { get; set; } // Denormalized from parent
public string Content { get; set; } // Chunk text
public float[] Embedding { get; set; } // Vector embedding
public int Index { get; set; } // Chunk order within the document
}
The ReferenceId and ReferenceType are denormalized from the parent document for efficient query access without joins.
Step 4 — Generate Embeddings
If the file extension is embeddable (see Built-in Document Readers), each chunk is converted to a vector via IEmbeddingGenerator<string, Embedding<float>>:
var embeddingGenerator =
await processingService.CreateEmbeddingGeneratorAsync("OpenAI", "default");
The generator is created from the configured provider and connection. Embeddings are stored on the chunk itself (Embedding property) so they survive index rebuilds.
Step 5 — Index in Vector Store
Chunks with embeddings are pushed to the search index via ISearchDocumentManager:
public interface ISearchDocumentManager
{
Task<bool> AddOrUpdateAsync(
IIndexProfileInfo profile,
IReadOnlyCollection<IndexDocument> documents,
CancellationToken cancellationToken = default);
Task DeleteAsync(
IIndexProfileInfo profile,
IEnumerable<string> documentIds,
CancellationToken cancellationToken = default);
Task DeleteAllAsync(
IIndexProfileInfo profile,
CancellationToken cancellationToken = default);
}
Implementations are registered as keyed services by provider name (e.g., "Elasticsearch", "AzureAISearch").
Step 6 — Query-Time Retrieval
During a conversation, SearchDocumentsTool calls IVectorSearchService to find the most relevant chunks:
public interface IVectorSearchService
{
Task<IEnumerable<DocumentChunkSearchResult>> SearchAsync(
IIndexProfileInfo indexProfile,
float[] embedding,
string referenceId,
string referenceType,
int topN,
CancellationToken cancellationToken = default);
}
The user's query is embedded, and the resulting vector is compared against indexed chunks using cosine similarity.
For uploaded chat-interaction and chat-session documents, the framework now switches between two context-loading strategies automatically:
- targeted questions continue to use semantic chunk retrieval (
SearchDocumentsTooland preemptive RAG) - whole-document tasks such as summarizing, reviewing, rewriting, translating, or extracting complete information from an attached file inject the full document text instead of a few chunks
That keeps RAG efficient for lookup-style questions while avoiding partial-context answers for requests that depend on the entire uploaded file.
Built-in Document Readers
| Reader | Extensions | Embeddable | Notes |
|---|---|---|---|
PlainTextIngestionDocumentReader | .txt, .md, .json, .xml, .html, .htm, .log, .yaml, .yml | Yes | UTF-8 stream reader |
PlainTextIngestionDocumentReader | .csv | No | Tabular — processed via ReadTabularDataTool |
OpenXmlIngestionDocumentReader | .docx, .pptx | Yes | Uses DocumentFormat.OpenXml SDK |
OpenXmlIngestionDocumentReader | .xlsx | No | Tabular — processed via ReadTabularDataTool |
PdfIngestionDocumentReader | .pdf | Yes | Uses UglyToad.PdfPig with DocstrumBoundingBoxes |
Embeddable means the content is chunked and vector-embedded for semantic search. Non-embeddable (tabular) formats are instead handled by the ReadTabularDataTool which reads and parses them directly.
Custom Document Reader
Register a reader for additional file formats:
builder.Services.AddCrestAppsIngestionDocumentReader<RtfIngestionDocumentReader>(
new ExtractorExtension(".rtf", embeddable: true));
Implement the reader:
public sealed class RtfIngestionDocumentReader : IngestionDocumentReader
{
public override async Task<IngestionDocument> ReadAsync(
Stream source,
string identifier,
string mediaType,
CancellationToken cancellationToken = default)
{
// Parse the RTF stream into plain text
using var reader = new StreamReader(source);
var rawContent = await reader.ReadToEndAsync(cancellationToken);
var plainText = StripRtfFormatting(rawContent);
return new IngestionDocument
{
Content = plainText,
Identifier = identifier,
};
}
}
ExtractorExtension
The ExtractorExtension type defines a file extension and whether its content is embeddable:
public sealed class ExtractorExtension
{
public string Extension { get; } // Normalized with leading dot (e.g., ".rtf")
public bool Embeddable { get; } // Whether embeddings should be generated
public ExtractorExtension(string extension, bool embeddable = true);
}
There is an implicit conversion from string to ExtractorExtension (with embeddable: true by default), so you can pass bare strings for embeddable extensions:
// These are equivalent:
services.AddCrestAppsIngestionDocumentReader<MyReader>(".rtf");
services.AddCrestAppsIngestionDocumentReader<MyReader>(new ExtractorExtension(".rtf", true));
// For non-embeddable extensions, use the explicit constructor:
services.AddCrestAppsIngestionDocumentReader<MyReader>(new ExtractorExtension(".tsv", false));
AddCrestAppsIngestionDocumentReader<T>
public static IServiceCollection AddCrestAppsIngestionDocumentReader<T>(
this IServiceCollection services,
params ExtractorExtension[] supportedExtensions)
where T : IngestionDocumentReader;
This method:
- Registers the reader as a singleton
- Registers a keyed singleton for each extension (used to resolve the right reader at runtime)
- Adds the extensions to
ChatDocumentsOptions
Document Tools
Three system tools are automatically available when documents are attached to a session. They are registered with AIToolPurposes.DocumentProcessing and injected by DocumentOrchestrationHandler.
SearchDocumentsTool
Name: search_documents (SystemToolNames.SearchDocuments)
Performs semantic vector search across all uploaded documents for the current session and returns the most relevant text chunks.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
query | string | Yes | The search query to find relevant content |
top_n | integer | No | Number of top matching chunks to return (default: 3) |
ReadDocumentTool
Name: read_document (SystemToolNames.ReadDocument)
Reads the full text content of a specific uploaded document. Truncates output to 50 KB maximum.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
document_id | string | Yes | The unique identifier of the document to read |
ReadTabularDataTool
Name: read_tabular_data (SystemToolNames.ReadTabularData)
Reads tabular data from CSV, TSV, or Excel files and returns formatted rows suitable for analysis.
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
document_id | string | Yes | The unique identifier of the tabular document |
max_rows | integer | No | Maximum number of data rows to return (default: 100) |
Supported extensions: .csv, .tsv, .xlsx, .xls
Implementing Stores
The framework defines two store interfaces. You must provide implementations for your persistence layer.
IAIDocumentStore
public interface IAIDocumentStore : ICatalog<AIDocument>
{
Task<IReadOnlyCollection<AIDocument>> GetDocumentsAsync(
string referenceId,
string referenceType);
}
Inherits CRUD operations from ICatalog<T>:
| Method | Description |
|---|---|
CreateAsync(T) | Insert a new document record |
UpdateAsync(T) | Update an existing document record |
DeleteAsync(T) | Delete a document record |
FindByIdAsync(string) | Find a document by its ItemId |
GetAllAsync() | Retrieve all documents |
GetAsync(IEnumerable<string>) | Retrieve documents by IDs |
PageAsync(int, int, TQuery) | Paginated query |
SaveChangesAsync() | Flush pending changes |
IAIDocumentChunkStore
public interface IAIDocumentChunkStore : ICatalog<AIDocumentChunk>
{
Task<IReadOnlyCollection<AIDocumentChunk>> GetChunksByAIDocumentIdAsync(string documentId);
Task<IReadOnlyCollection<AIDocumentChunk>> GetChunksByReferenceAsync(
string referenceId, string referenceType);
Task DeleteByDocumentIdAsync(string documentId);
}
Registration
Register your implementations with the DI container:
builder.Services.AddScoped<IAIDocumentStore, YesSqlAIDocumentStore>();
builder.Services.AddScoped<IAIDocumentChunkStore, YesSqlAIDocumentChunkStore>();
See Data Storage for more on the catalog pattern and YesSql index conventions.
Orchestration Integration
DocumentOrchestrationHandler implements IOrchestrationContextBuilderHandler and is registered automatically by AddCoreAIDocumentProcessing().
public sealed class DocumentOrchestrationHandler : IOrchestrationContextBuilderHandler
{
public Task BuildingAsync(OrchestrationContextBuildingContext context);
public Task BuiltAsync(OrchestrationContextBuiltContext context);
}
During context building, the handler:
- Checks if the current session has documents (via
ReferenceId/ReferenceType) - If documents exist, sets
AICompletionContextKeys.HasDocuments = true - Discovers all tools with purpose
AIToolPurposes.DocumentProcessingand adds them to the tool set - Enriches the system message with document metadata so the model knows what content is available
This means document tools are only injected when the session actually has documents — no wasted tokens on tool descriptions when there are no documents.
Tabular Data
CSV, TSV, and Excel files are marked as non-embeddable and receive special processing.
ITabularBatchProcessor
Splits large tabular content into batches, processes each batch with the LLM, and merges results:
public interface ITabularBatchProcessor
{
IList<TabularBatch> SplitIntoBatches(string content, string fileName);
Task<IList<TabularBatchResult>> ProcessBatchesAsync(
IList<TabularBatch> batches,
string userPrompt,
TabularBatchContext context,
CancellationToken cancellationToken = default);
string MergeResults(IList<TabularBatchResult> results, bool includeHeader = true);
}
ITabularBatchResultCache
Caches batch results to avoid re-processing identical queries:
public interface ITabularBatchResultCache
{
string GenerateCacheKey(string interactionId, string documentContentHash, string prompt);
string ComputeDocumentContentHash(IEnumerable<(string FileName, string Content)> documents);
TabularBatchCacheEntry TryGet(string cacheKey);
void Set(string cacheKey, TabularBatchCacheEntry entry, TimeSpan? expiration = null);
void Remove(string cacheKey);
void InvalidateForInteraction(string interactionId);
}
When documents are added or removed from an interaction, call InvalidateForInteraction to clear stale cache entries.
Configuration
ChatDocumentsOptions
Controls which file types can be uploaded and how they are processed:
services.Configure<ChatDocumentsOptions>(options =>
{
// Add an embeddable extension
options.Add(".rtf", embeddable: true);
// Add a tabular (non-embeddable) extension
options.Add(".tsv", embeddable: false);
});
| Property | Type | Description |
|---|---|---|
AllowedFileExtensions | IReadOnlySet<string> | Complete set of uploadable file extensions |
EmbeddableFileExtensions | IReadOnlySet<string> | Subset that gets vector-embedded |
MaxVisionInputBytesPerRequest | long | Maximum total image bytes attached to one multimodal request; set 0 or less to disable the limit |
Extensions not in EmbeddableFileExtensions are still allowed for upload and can be read by ReadDocumentTool or ReadTabularDataTool, but they are not chunked and embedded.
InteractionDocumentSettings
Per-interaction settings for document search:
public sealed class InteractionDocumentSettings
{
public string IndexProfileName { get; set; } // Index profile for embedding and search
public int TopN { get; set; } = 3; // Top matching chunks to include in context
}
Limits
- Maximum 25,000 characters total for embedding per session
ReadDocumentTooltruncates output to 50 KBReadTabularDataTooldefaults to 100 rows maximum
Services Registered by AddCoreAIDocumentProcessing()
| Service | Implementation | Lifetime | Purpose |
|---|---|---|---|
IAIDocumentProcessingService | DefaultAIDocumentProcessingService | Scoped | Orchestrates document processing |
ITabularBatchProcessor | TabularBatchProcessor | Scoped | Processes CSV/Excel batch queries |
ITabularBatchResultCache | TabularBatchResultCache | Singleton | Caches tabular query results |
DocumentOrchestrationHandler | — | Scoped | Injects document context into orchestration |
PlainTextIngestionDocumentReader | — | Singleton | .txt, .csv, .md, .json, .xml, .html, .htm, .log, .yaml, .yml |
OpenXmlIngestionDocumentReader | — | Singleton | .docx, .xlsx, .pptx |
PdfIngestionDocumentReader | — | Singleton | .pdf |
SearchDocumentsTool | — | System tool | Semantic vector search |
ReadDocumentTool | — | System tool | Full document read |
ReadTabularDataTool | — | System tool | Tabular data queries |