Document Processing
Reads, chunks, and indexes uploaded documents so the AI can search and reference them during conversations.
Quick Start
builder.Services.AddCrestAppsCore(crestApps => crestApps
.AddAISuite(ai => ai
.AddMarkdown()
.AddChatInteractions()
.AddDocumentProcessing(documentProcessing => documentProcessing
.AddOpenXml()
.AddPdf())
.AddOpenAI()));
Problem & Solution
Users upload documents (PDFs, Word files, spreadsheets) and expect the AI to answer questions about them. This requires:
- Reading diverse file formats into plain text
- Chunking large documents into embeddable segments
- Embedding chunks into vector space for semantic search
- Searching relevant chunks at query time (RAG)
- Tabular processing for CSV/Excel data with structured queries
The document processing system handles the full pipeline from upload to retrieval.
Services Registered by AddCoreAIDocumentProcessing()
| Service | Implementation | Lifetime | Purpose |
|---|---|---|---|
IAIDocumentProcessingService | DefaultAIDocumentProcessingService | Scoped | Reads, chunks, and materializes AIDocument / AIDocumentChunk records |
ITabularBatchProcessor | TabularBatchProcessor | Scoped | Processes CSV/Excel batch queries |
ITabularBatchResultCache | TabularBatchResultCache | Singleton | Caches tabular query results |
DocumentOrchestrationHandler | — | Scoped | Injects document context into orchestration |
Built-in Document Readers
AddDocumentProcessing(...) registers the plain-text and tabular readers. OpenXml and PDF readers now live in the dedicated CrestApps.Core.AI.OpenXml and CrestApps.Core.AI.Pdf packages, so hosts opt into those dependencies explicitly with the nested builder calls AddOpenXml() and AddPdf() or, if they prefer the raw IServiceCollection surface, AddCoreAIOpenXmlDocumentProcessing() and AddCoreAIPdfDocumentProcessing(). Markdown-aware normalization now also lives in its own CrestApps.Core.AI.Markdown package. AddAISuite(...) does not register it automatically, so hosts that want Markdig-backed normalization and chunking must opt in with AddMarkdown() or AddCoreAIMarkdown().
| Reader | Supported Extensions | Embeddable |
|---|---|---|
PlainTextIngestionDocumentReader | .txt, .md, .json, .xml, .html, .htm, .log, .yaml, .yml | Yes |
PlainTextIngestionDocumentReader | .csv | No (tabular) |
OpenXmlIngestionDocumentReader | .docx, .pptx | Yes |
OpenXmlIngestionDocumentReader | .xlsx | No (tabular) |
PdfIngestionDocumentReader | .pdf | Yes |
System Tools for Documents
These tools are automatically available to the orchestrator when documents are attached:
| Tool | Purpose |
|---|---|
SearchDocumentsTool | Semantic vector search across uploaded documents |
ReadDocumentTool | Reads full text of a specific document |
ReadTabularDataTool | Reads and parses CSV/TSV/Excel data |
Key Interfaces
IAIDocumentProcessingService
Processes an uploaded file after the host has resolved any embedding generator it wants to use.
public interface IAIDocumentProcessingService
{
Task<DocumentProcessingResult> ProcessFileAsync(
IFormFile file,
string referenceId,
string referenceType,
IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator);
}
The framework no longer asks IAIDocumentProcessingService to create embedding generators. Hosts resolve the embedding deployment through IAIDeploymentManager and create the generator through IAIClientFactory, then pass it into ProcessFileAsync(...). That keeps deployment selection and AI client creation in the shared client/deployment runtime instead of duplicating that logic inside the document processor.
Adding a Custom Document Reader
Register a reader for additional file formats:
builder.Services.AddCrestAppsIngestionDocumentReader<MyCustomReader>(".custom", ".myformat");
Implement the reader:
public sealed class MyCustomReader : IngestionDocumentReader
{
public override Task<IngestionDocument> ReadAsync(
Stream stream,
string fileName,
CancellationToken cancellationToken = default)
{
// Parse the stream into sections and elements
}
}
Configuration
ChatDocumentsOptions
Controls which file types can be uploaded and processed.
services.Configure<ChatDocumentsOptions>(options =>
{
// Add a new embeddable extension
options.Add(".rtf", embeddable: true);
// Add a tabular (non-embeddable) extension
options.Add(".tsv", embeddable: false);
});
- AllowedFileExtensions — Complete set of uploadable extensions
- EmbeddableFileExtensions — Subset that gets vector-embedded (non-embeddable files use direct read tools instead)
Use the registered option values to drive your upload UI as well as validation:
- AI Profile / AI Template knowledge uploads should use
EmbeddableFileExtensions - Chat interaction / chat session uploads should use
AllowedFileExtensions
That keeps file pickers, visible supported-format text, and server-side processing aligned with the readers actually registered in the app.
Limits
- Maximum 25,000 characters total for embedding per session
- Results are cached via
IDistributedCachefor batch tabular queries
Storage
Document metadata and chunks require store implementations:
builder.Services.AddScoped<IAIDocumentStore, YesSqlAIDocumentStore>();
builder.Services.AddScoped<IAIDocumentChunkStore, YesSqlAIDocumentChunkStore>();