Document Processing

Reads, chunks, and indexes uploaded documents so the AI can search and reference them during conversations.

Quick Start

builder.Services.AddCrestAppsCore(crestApps => crestApps
    .AddAISuite(ai => ai
        .AddMarkdown()
        .AddChatInteractions()
        .AddDocumentProcessing(documentProcessing => documentProcessing
            .AddOpenXml()
            .AddPdf())
        .AddOpenAI()));

Problem & Solution

Users upload documents (PDFs, Word files, spreadsheets) and expect the AI to answer questions about them. This requires:

Reading diverse file formats into plain text
Chunking large documents into embeddable segments
Embedding chunks into vector space for semantic search
Searching relevant chunks at query time (RAG)
Tabular processing for CSV/Excel data with structured queries

The document processing system handles the full pipeline from upload to retrieval.

Services Registered by `AddCoreAIDocumentProcessing()`

Service	Implementation	Lifetime	Purpose
`IAIDocumentProcessingService`	`DefaultAIDocumentProcessingService`	Scoped	Reads, chunks, and materializes `AIDocument` / `AIDocumentChunk` records
`ITabularBatchProcessor`	`TabularBatchProcessor`	Scoped	Processes CSV/Excel batch queries
`ITabularBatchResultCache`	`TabularBatchResultCache`	Singleton	Caches tabular query results
`DocumentOrchestrationHandler`	—	Scoped	Injects document context into orchestration

Built-in Document Readers

AddDocumentProcessing(...) registers the plain-text and tabular readers. OpenXml and PDF readers now live in the dedicated CrestApps.Core.AI.OpenXml and CrestApps.Core.AI.Pdf packages, so hosts opt into those dependencies explicitly with the nested builder calls AddOpenXml() and AddPdf() or, if they prefer the raw IServiceCollection surface, AddCoreAIOpenXmlDocumentProcessing() and AddCoreAIPdfDocumentProcessing(). Markdown-aware normalization now also lives in its own CrestApps.Core.AI.Markdown package. AddAISuite(...) does not register it automatically, so hosts that want Markdig-backed normalization and chunking must opt in with AddMarkdown() or AddCoreAIMarkdown().

Reader	Supported Extensions	Embeddable
`PlainTextIngestionDocumentReader`	`.txt`, `.md`, `.json`, `.xml`, `.html`, `.htm`, `.log`, `.yaml`, `.yml`	Yes
`PlainTextIngestionDocumentReader`	`.csv`	No (tabular)
`OpenXmlIngestionDocumentReader`	`.docx`, `.pptx`	Yes
`OpenXmlIngestionDocumentReader`	`.xlsx`	No (tabular)
`PdfIngestionDocumentReader`	`.pdf`	Yes

System Tools for Documents

These tools are automatically available to the orchestrator when documents are attached:

Tool	Purpose
`SearchDocumentsTool`	Semantic vector search across uploaded documents
`ReadDocumentTool`	Reads full text of a specific document
`ReadTabularDataTool`	Reads and parses CSV/TSV/Excel data

Key Interfaces

`IAIDocumentProcessingService`

Processes an uploaded file after the host has resolved any embedding generator it wants to use.

public interface IAIDocumentProcessingService
{
    Task<DocumentProcessingResult> ProcessFileAsync(
        IFormFile file,
        string referenceId,
        string referenceType,
        IEmbeddingGenerator<string, Embedding<float>> embeddingGenerator);
}

The framework no longer asks IAIDocumentProcessingService to create embedding generators. Hosts resolve the embedding deployment through IAIDeploymentManager and create the generator through IAIClientFactory, then pass it into ProcessFileAsync(...). That keeps deployment selection and AI client creation in the shared client/deployment runtime instead of duplicating that logic inside the document processor.

Adding a Custom Document Reader

builder.Services.AddCrestAppsIngestionDocumentReader<MyCustomReader>(".custom", ".myformat");

Implement the reader:

public sealed class MyCustomReader : IngestionDocumentReader
{
    public override Task<IngestionDocument> ReadAsync(
        Stream stream,
        string fileName,
        CancellationToken cancellationToken = default)
    {
        // Parse the stream into sections and elements
    }
}

Configuration

`ChatDocumentsOptions`

Controls which file types can be uploaded and processed.

services.Configure<ChatDocumentsOptions>(options =>
{
    // Add a new embeddable extension
    options.Add(".rtf", embeddable: true);

    // Add a tabular (non-embeddable) extension
    options.Add(".tsv", embeddable: false);
});

AllowedFileExtensions — Complete set of uploadable extensions
EmbeddableFileExtensions — Subset that gets vector-embedded (non-embeddable files use direct read tools instead)

Use the registered option values to drive your upload UI as well as validation:

AI Profile / AI Template knowledge uploads should use EmbeddableFileExtensions
Chat interaction / chat session uploads should use AllowedFileExtensions

That keeps file pickers, visible supported-format text, and server-side processing aligned with the readers actually registered in the app.

Limits

Maximum 25,000 characters total for embedding per session
Results are cached via IDistributedCache for batch tabular queries

Storage

Document metadata and chunks require store implementations:

builder.Services.AddScoped<IAIDocumentStore, YesSqlAIDocumentStore>();
builder.Services.AddScoped<IAIDocumentChunkStore, YesSqlAIDocumentChunkStore>();

Quick Start​

Problem & Solution​

Services Registered by AddCoreAIDocumentProcessing()​

Built-in Document Readers​

System Tools for Documents​

Key Interfaces​

IAIDocumentProcessingService​

Adding a Custom Document Reader​

Configuration​

ChatDocumentsOptions​

Limits​

Storage​