AI Documents

A complete document management pipeline that reads uploaded files, splits them into chunks, generates vector embeddings, and makes the content searchable via semantic similarity — enabling retrieval-augmented generation (RAG) in AI conversations.

Quick Start

builder.Services
    .AddCoreAIServices()
    .AddCoreAIOrchestration()
    .AddCoreAIChatInteractions()
    .AddCoreAIDocumentProcessing()
    .AddCoreAIOpenAI();

// Register document and chunk stores
builder.Services.AddScoped<IAIDocumentStore, YesSqlAIDocumentStore>();
builder.Services.AddScoped<IAIDocumentChunkStore, YesSqlAIDocumentChunkStore>();

AddCoreAIDocumentProcessing() is shipped by CrestApps.Core.AI.Documents.

Upload a file and process it:

public sealed class DocumentUploadController(
    IAIDocumentProcessingService processingService,
    IAIClientFactory aiClientFactory,
    IAIDeploymentManager deploymentManager,
    IAIDocumentStore documentStore) : Controller
{
    [HttpPost]
    public async Task<IActionResult> Upload(
        IFormFile file,
        string referenceId,
        string referenceType)
    {
        var embeddingDeployment =
            await deploymentManager.ResolveOrDefaultAsync(AIDeploymentPurpose.Embedding);
        var embeddingGenerator = embeddingDeployment is null
            ? null
            : await aiClientFactory.CreateEmbeddingGeneratorAsync(embeddingDeployment);

        var result = await processingService.ProcessFileAsync(
            file, referenceId, referenceType, embeddingGenerator);

        return result.Succeeded ? Ok(result) : BadRequest(result);
    }
}

Problem & Solution

Users upload documents (PDFs, Word files, spreadsheets, text files) and expect the AI to answer questions about them. This requires a multi-stage pipeline:

Reading — Extract plain text from diverse file formats (.pdf, .docx, .xlsx, .csv, .txt, .md, and more)
Chunking — Split large documents into segments small enough to embed
Embedding — Convert each chunk into a vector representation using a configured embedding model
Indexing — Store embeddings in a vector search index (Elasticsearch or Azure AI Search)
Searching — At query time, perform semantic similarity search to find the most relevant chunks
Tabular processing — CSV and Excel files receive special treatment with structured, batch-oriented queries

The document processing system handles this full pipeline from upload to retrieval, while the built-in document tools make the content available to the AI during orchestration.

When a chat deployment also supports the Vision purpose, chat interaction and chat session uploads can include supported image formats (.bmp, .gif, .jpeg, .jpg, .png, .webp) alongside standard document files. Those images are stored as AIDocument records, analyzed at upload time by IImageAnalysisService to extract a structured summary (caption, OCR text, detected entities), and the results are persisted as AIDocumentChunk records — exactly like text documents. This makes image content available through the same read_document and search_documents tools used for regular documents.

For cases where the text analysis is insufficient (e.g., reading fine text, comparing visual elements, or understanding spatial layout), the inspect_image tool provides on-demand raw image inspection by sending the original bytes to a vision model in a one-shot call. This approach eliminates the cost of attaching raw image bytes to every chat request while preserving full visual inspection capability when needed.

The ChatDocumentsOptions.AnalyzeImagesAtUpload setting controls whether analysis runs at upload time, and MaxInspectImageCallsPerRequest limits how many costly raw-image inspections the model can perform per turn.

Creating a chat client to describe an image

When you want to call a vision-capable model directly, resolve the deployment, create an IChatClient, and send a multimodal user message:

public sealed class ImageDescriptionService(
    IAIDeploymentManager deploymentManager,
    IAIClientFactory clientFactory)
{
    public async Task<string> DescribeImageAsync(
        string imagePath,
        string chatDeploymentName,
        CancellationToken cancellationToken = default)
    {
        var deployment = await deploymentManager.ResolveOrDefaultAsync(
            AIDeploymentPurpose.Chat,
            deploymentName: chatDeploymentName,
            cancellationToken: cancellationToken);

        if (deployment?.Purpose.Supports(AIDeploymentPurpose.Vision) != true)
        {
            throw new InvalidOperationException("The selected chat deployment does not support vision.");
        }

        var chatClient = await clientFactory.CreateChatClientAsync(deployment);
        var imageBytes = await File.ReadAllBytesAsync(imagePath, cancellationToken);
        var mediaType = MediaTypeHelper.InferMediaType(Path.GetExtension(imagePath));

        var response = await chatClient.GetResponseAsync(
            [
                new ChatMessage(
                    ChatRole.User,
                    [
                        new TextContent("Describe this image in detail."),
                        new DataContent(imageBytes, mediaType),
                    ]),
            ],
            cancellationToken: cancellationToken);

        return response.Text;
    }
}

Architecture Overview

┌─────────────┐
│  User Upload │
└──────┬──────┘
       ▼
┌──────────────────────────────┐
│  IAIDocumentProcessingService │  ← Orchestrates the pipeline
├──────────────────────────────┤
│  1. Store document record     │  → IAIDocumentStore
│  2. Read file content         │  → IngestionDocumentReader (keyed by extension)
│  3. Normalize & chunk text    │  → RagTextNormalizer
│  4. Store chunks              │  → IAIDocumentChunkStore
│  5. Generate embeddings       │  → IEmbeddingGenerator<string, Embedding<float>>
│  6. Index in vector store     │  → ISearchDocumentManager (Elasticsearch / Azure AI)
└──────────────────────────────┘

       ┌─────────────────────────────────────┐
       │          During Conversation         │
       ├─────────────────────────────────────┤
       │  DocumentOrchestrationHandler        │
       │  detects documents on the session    │
       │  and injects document tools:         │
       │                                      │
       │  • SearchDocumentsTool (vector RAG)  │
       │  • ReadDocumentTool (full text read) │
       │  • ReadTabularDataTool (CSV/Excel)   │
       │  • InspectImageTool (vision on-demand)│
       └──────────────┬──────────────────────┘
                      ▼
       ┌─────────────────────────────────────┐
       │  AI Model calls tools as needed      │
       │  to answer user questions about      │
       │  the uploaded documents              │
       └─────────────────────────────────────┘

Core Interfaces

Interface	Package	Purpose
`IAIDocumentStore`	`CrestApps.Core.AI.Documents`	CRUD for document records
`IAIDocumentChunkStore`	`CrestApps.Core.AI.Documents`	CRUD for document chunks
`IAIDocumentProcessingService`	`CrestApps.Core.AI.Documents`	Orchestrates file → chunk → embed → index
`IDocumentFileStore`	`CrestApps.Core.AI.Documents`	Persists uploaded document files to a swappable storage backend
`ISearchDocumentManager`	`CrestApps.Core.AI`	Manages documents in the vector search index
`IVectorSearchService`	`CrestApps.Core.AI`	Performs vector similarity search at query time
`ITabularBatchProcessor`	`CrestApps.Core.AI.Documents`	Splits and processes CSV/Excel batch queries
`ITabularBatchResultCache`	`CrestApps.Core.AI.Documents`	Caches tabular query results
`IngestionDocumentReader`	`CrestApps.Core.AI.Documents`	Abstract base for format-specific file readers

AddCoreAIDocumentProcessing() registers a default FileSystemFileStore automatically. By default it stores uploaded files under App_Data\Documents, and each upload gets a new GUID-based stored file name while AIDocument.FileName keeps the original user upload name.

Configure a different local base path:

builder.Services.Configure<DocumentFileSystemFileStoreOptions>(options =>
{
    options.BasePath = "App_Data/CustomDocuments";
});

Hosts can replace IDocumentFileStore entirely to change where uploaded files are written:

builder.Services.AddSingleton<IDocumentFileStore, AzureBlobDocumentFileStore>();

IDocumentFileStore extends the general IFileStore abstraction. AIDocument.StoredFileName and AIDocument.StoredFilePath preserve the backing file-store location so hosts can trace and delete the physical file later.

Document Processing Pipeline

Step 1 — Upload and Store

When a file is uploaded, a new AIDocument record is created in IAIDocumentStore:

public sealed class AIDocument : CatalogItem
{
    public string ReferenceId { get; set; }    // Owning resource (e.g., chat interaction ID)
    public string ReferenceType { get; set; }  // Resource type (e.g., "chatinteraction")
    public string FileName { get; set; }       // Original file name
    public string ContentType { get; set; }    // MIME type
    public long FileSize { get; set; }         // Size in bytes
    public DateTime UploadedUtc { get; set; }  // Upload timestamp
}

The ReferenceId and ReferenceType pair ties the document to an owning resource. Common reference types include:

Constant	Value	Meaning
`AIReferenceTypes.Document.Profile`	`"profile"`	Document attached to an AI profile
`AIReferenceTypes.Document.ChatInteraction`	`"chatinteraction"`	Document attached to a chat interaction
`AIReferenceTypes.Document.ChatSession`	`"chatsession"`	Document attached to a chat session

Hosts can layer extra behavior on top of this shared pipeline, but the default follow-up indexing step now lives in the framework as DefaultAIDocumentIndexingService. Hosts can call that service after persisting AIDocument and AIDocumentChunk records so uploaded chunks are mirrored into the configured AI Documents vector index without duplicating provider-specific index management code.

Step 2 — Read File Content

An IngestionDocumentReader is resolved as a keyed service using the file extension. The reader extracts plain text from the file:

public abstract class IngestionDocumentReader
{
    public abstract Task<IngestionDocument> ReadAsync(
        Stream source,
        string identifier,
        string mediaType,
        CancellationToken cancellationToken = default);
}

Step 3 — Normalize and Chunk

The extracted text is normalized (whitespace, encoding) and split into chunks. Each chunk becomes an AIDocumentChunk:

public sealed class AIDocumentChunk : CatalogItem
{
    public string AIDocumentId { get; set; }   // Parent document ID
    public string ReferenceId { get; set; }    // Denormalized from parent
    public string ReferenceType { get; set; }  // Denormalized from parent
    public string Content { get; set; }        // Chunk text
    public float[] Embedding { get; set; }     // Vector embedding
    public int Index { get; set; }             // Chunk order within the document
}

The ReferenceId and ReferenceType are denormalized from the parent document for efficient query access without joins.

Step 4 — Generate Embeddings

If the file extension is embeddable (see Built-in Document Readers), each chunk is converted to a vector via IEmbeddingGenerator<string, Embedding<float>>:

var embeddingGenerator =
    await processingService.CreateEmbeddingGeneratorAsync("OpenAI", "default");

The generator is created from the configured provider and connection. Embeddings are stored on the chunk itself (Embedding property) so they survive index rebuilds.

Step 5 — Index in Vector Store

Chunks with embeddings are pushed to the search index via ISearchDocumentManager:

public interface ISearchDocumentManager
{
    Task<bool> AddOrUpdateAsync(
        IIndexProfileInfo profile,
        IReadOnlyCollection<IndexDocument> documents,
        CancellationToken cancellationToken = default);

    Task DeleteAsync(
        IIndexProfileInfo profile,
        IEnumerable<string> documentIds,
        CancellationToken cancellationToken = default);

    Task DeleteAllAsync(
        IIndexProfileInfo profile,
        CancellationToken cancellationToken = default);
}

Implementations are registered as keyed services by provider name (e.g., "Elasticsearch", "AzureAISearch").

Step 6 — Query-Time Retrieval

During a conversation, SearchDocumentsTool calls IVectorSearchService to find the most relevant chunks:

public interface IVectorSearchService
{
    Task<IEnumerable<DocumentChunkSearchResult>> SearchAsync(
        IIndexProfileInfo indexProfile,
        float[] embedding,
        string referenceId,
        string referenceType,
        int topN,
        CancellationToken cancellationToken = default);
}

The user's query is embedded, and the resulting vector is compared against indexed chunks using cosine similarity.

For uploaded chat-interaction and chat-session documents, the framework now switches between two context-loading strategies automatically:

targeted questions continue to use semantic chunk retrieval (SearchDocumentsTool and preemptive RAG)
whole-document tasks such as summarizing, reviewing, rewriting, translating, or extracting complete information from an attached file inject the full document text instead of a few chunks

That keeps RAG efficient for lookup-style questions while avoiding partial-context answers for requests that depend on the entire uploaded file.

Built-in Document Readers

Reader	Extensions	Embeddable	Notes
`PlainTextIngestionDocumentReader`	`.txt`, `.md`, `.json`, `.xml`, `.html`, `.htm`, `.log`, `.yaml`, `.yml`	Yes	UTF-8 stream reader
`PlainTextIngestionDocumentReader`	`.csv`	No	Tabular — processed via `ReadTabularDataTool`
`OpenXmlIngestionDocumentReader`	`.docx`, `.pptx`	Yes	Uses `DocumentFormat.OpenXml` SDK
`OpenXmlIngestionDocumentReader`	`.xlsx`	No	Tabular — processed via `ReadTabularDataTool`
`PdfIngestionDocumentReader`	`.pdf`	Yes	Uses `UglyToad.PdfPig` with DocstrumBoundingBoxes

Embeddable means the content is chunked and vector-embedded for semantic search. Non-embeddable (tabular) formats are instead handled by the ReadTabularDataTool which reads and parses them directly.

Custom Document Reader

builder.Services.AddCrestAppsIngestionDocumentReader<RtfIngestionDocumentReader>(
    new ExtractorExtension(".rtf", embeddable: true));

Implement the reader:

public sealed class RtfIngestionDocumentReader : IngestionDocumentReader
{
    public override async Task<IngestionDocument> ReadAsync(
        Stream source,
        string identifier,
        string mediaType,
        CancellationToken cancellationToken = default)
    {
        // Parse the RTF stream into plain text
        using var reader = new StreamReader(source);
        var rawContent = await reader.ReadToEndAsync(cancellationToken);
        var plainText = StripRtfFormatting(rawContent);

        return new IngestionDocument
        {
            Content = plainText,
            Identifier = identifier,
        };
    }
}

`ExtractorExtension`

The ExtractorExtension type defines a file extension and whether its content is embeddable:

public sealed class ExtractorExtension
{
    public string Extension { get; }   // Normalized with leading dot (e.g., ".rtf")
    public bool Embeddable { get; }    // Whether embeddings should be generated

    public ExtractorExtension(string extension, bool embeddable = true);
}

There is an implicit conversion from string to ExtractorExtension (with embeddable: true by default), so you can pass bare strings for embeddable extensions:

// These are equivalent:
services.AddCrestAppsIngestionDocumentReader<MyReader>(".rtf");
services.AddCrestAppsIngestionDocumentReader<MyReader>(new ExtractorExtension(".rtf", true));

// For non-embeddable extensions, use the explicit constructor:
services.AddCrestAppsIngestionDocumentReader<MyReader>(new ExtractorExtension(".tsv", false));

`AddCrestAppsIngestionDocumentReader<T>`

public static IServiceCollection AddCrestAppsIngestionDocumentReader<T>(
    this IServiceCollection services,
    params ExtractorExtension[] supportedExtensions)
    where T : IngestionDocumentReader;

This method:

Registers the reader as a singleton
Registers a keyed singleton for each extension (used to resolve the right reader at runtime)
Adds the extensions to ChatDocumentsOptions

Document Tools

Three system tools are automatically available when documents are attached to a session. They are registered with AIToolPurposes.DocumentProcessing and injected by DocumentOrchestrationHandler.

`SearchDocumentsTool`

Name: search_documents (SystemToolNames.SearchDocuments)

Performs semantic vector search across all uploaded documents for the current session and returns the most relevant text chunks.

Parameters:

Parameter	Type	Required	Description
`query`	`string`	Yes	The search query to find relevant content
`top_n`	`integer`	No	Number of top matching chunks to return (default: 3)

`ReadDocumentTool`

Name: read_document (SystemToolNames.ReadDocument)

Reads the full text content of a specific uploaded document. Truncates output to 50 KB maximum.

Parameters:

Parameter	Type	Required	Description
`document_id`	`string`	Yes	The unique identifier of the document to read

`ReadTabularDataTool`

Name: read_tabular_data (SystemToolNames.ReadTabularData)

Reads tabular data from CSV, TSV, or Excel files and returns formatted rows suitable for analysis.

Parameters:

Parameter	Type	Required	Description
`document_id`	`string`	Yes	The unique identifier of the tabular document
`max_rows`	`integer`	No	Maximum number of data rows to return (default: 100)

Supported extensions: .csv, .tsv, .xlsx, .xls

Implementing Stores

The framework defines two store interfaces. You must provide implementations for your persistence layer.

`IAIDocumentStore`

public interface IAIDocumentStore : ICatalog<AIDocument>
{
    Task<IReadOnlyCollection<AIDocument>> GetDocumentsAsync(
        string referenceId,
        string referenceType);
}

Inherits CRUD operations from ICatalog<T>:

Method	Description
`CreateAsync(T)`	Insert a new document record
`UpdateAsync(T)`	Update an existing document record
`DeleteAsync(T)`	Delete a document record
`FindByIdAsync(string)`	Find a document by its `ItemId`
`GetAllAsync()`	Retrieve all documents
`GetAsync(IEnumerable<string>)`	Retrieve documents by IDs
`PageAsync(int, int, TQuery)`	Paginated query
`SaveChangesAsync()`	Flush pending changes

`IAIDocumentChunkStore`

public interface IAIDocumentChunkStore : ICatalog<AIDocumentChunk>
{
    Task<IReadOnlyCollection<AIDocumentChunk>> GetChunksByAIDocumentIdAsync(string documentId);
    Task<IReadOnlyCollection<AIDocumentChunk>> GetChunksByReferenceAsync(
        string referenceId, string referenceType);
    Task DeleteByDocumentIdAsync(string documentId);
}

Registration

builder.Services.AddScoped<IAIDocumentStore, YesSqlAIDocumentStore>();
builder.Services.AddScoped<IAIDocumentChunkStore, YesSqlAIDocumentChunkStore>();

See Data Storage for more on the catalog pattern and YesSql index conventions.

Orchestration Integration

DocumentOrchestrationHandler implements IOrchestrationContextBuilderHandler and is registered automatically by AddCoreAIDocumentProcessing().

public sealed class DocumentOrchestrationHandler : IOrchestrationContextBuilderHandler
{
    public Task BuildingAsync(OrchestrationContextBuildingContext context);
    public Task BuiltAsync(OrchestrationContextBuiltContext context);
}

During context building, the handler:

Checks if the current session has documents (via ReferenceId / ReferenceType)
If documents exist, sets AICompletionContextKeys.HasDocuments = true
Discovers all tools with purpose AIToolPurposes.DocumentProcessing and adds them to the tool set
Enriches the system message with document metadata so the model knows what content is available

This means document tools are only injected when the session actually has documents — no wasted tokens on tool descriptions when there are no documents.

Tabular Data

CSV, TSV, and Excel files are marked as non-embeddable and receive special processing.

`ITabularBatchProcessor`

Splits large tabular content into batches, processes each batch with the LLM, and merges results:

public interface ITabularBatchProcessor
{
    IList<TabularBatch> SplitIntoBatches(string content, string fileName);

    Task<IList<TabularBatchResult>> ProcessBatchesAsync(
        IList<TabularBatch> batches,
        string userPrompt,
        TabularBatchContext context,
        CancellationToken cancellationToken = default);

    string MergeResults(IList<TabularBatchResult> results, bool includeHeader = true);
}

`ITabularBatchResultCache`

Caches batch results to avoid re-processing identical queries:

public interface ITabularBatchResultCache
{
    string GenerateCacheKey(string interactionId, string documentContentHash, string prompt);
    string ComputeDocumentContentHash(IEnumerable<(string FileName, string Content)> documents);
    TabularBatchCacheEntry TryGet(string cacheKey);
    void Set(string cacheKey, TabularBatchCacheEntry entry, TimeSpan? expiration = null);
    void Remove(string cacheKey);
    void InvalidateForInteraction(string interactionId);
}

When documents are added or removed from an interaction, call InvalidateForInteraction to clear stale cache entries.

Configuration

`ChatDocumentsOptions`

Controls which file types can be uploaded and how they are processed:

services.Configure<ChatDocumentsOptions>(options =>
{
    // Add an embeddable extension
    options.Add(".rtf", embeddable: true);

    // Add a tabular (non-embeddable) extension
    options.Add(".tsv", embeddable: false);
});

Property	Type	Description
`AllowedFileExtensions`	`IReadOnlySet<string>`	Complete set of uploadable file extensions
`EmbeddableFileExtensions`	`IReadOnlySet<string>`	Subset that gets vector-embedded
`MaxVisionInputBytesPerRequest`	`long`	Maximum total image bytes attached to one multimodal request; set `0` or less to disable the limit

Extensions not in EmbeddableFileExtensions are still allowed for upload and can be read by ReadDocumentTool or ReadTabularDataTool, but they are not chunked and embedded.

`InteractionDocumentSettings`

Per-interaction settings for document search:

public sealed class InteractionDocumentSettings
{
    public string IndexProfileName { get; set; }  // Index profile for embedding and search
    public int TopN { get; set; } = 3;             // Top matching chunks to include in context
}

Limits

Maximum 25,000 characters total for embedding per session
ReadDocumentTool truncates output to 50 KB
ReadTabularDataTool defaults to 100 rows maximum

Services Registered by `AddCoreAIDocumentProcessing()`

Service	Implementation	Lifetime	Purpose
`IAIDocumentProcessingService`	`DefaultAIDocumentProcessingService`	Scoped	Orchestrates document processing
`ITabularBatchProcessor`	`TabularBatchProcessor`	Scoped	Processes CSV/Excel batch queries
`ITabularBatchResultCache`	`TabularBatchResultCache`	Singleton	Caches tabular query results
`DocumentOrchestrationHandler`	—	Scoped	Injects document context into orchestration
`PlainTextIngestionDocumentReader`	—	Singleton	`.txt`, `.csv`, `.md`, `.json`, `.xml`, `.html`, `.htm`, `.log`, `.yaml`, `.yml`
`OpenXmlIngestionDocumentReader`	—	Singleton	`.docx`, `.xlsx`, `.pptx`
`PdfIngestionDocumentReader`	—	Singleton	`.pdf`
`SearchDocumentsTool`	—	System tool	Semantic vector search
`ReadDocumentTool`	—	System tool	Full document read
`ReadTabularDataTool`	—	System tool	Tabular data queries

Quick Start​

Problem & Solution​

Creating a chat client to describe an image​

Architecture Overview​

Core Interfaces​

Document Processing Pipeline​

Step 1 — Upload and Store​

Step 2 — Read File Content​

Step 3 — Normalize and Chunk​

Step 4 — Generate Embeddings​

Step 5 — Index in Vector Store​

Step 6 — Query-Time Retrieval​

Built-in Document Readers​

Custom Document Reader​

ExtractorExtension​

AddCrestAppsIngestionDocumentReader<T>​

Document Tools​

SearchDocumentsTool​

ReadDocumentTool​

ReadTabularDataTool​

Implementing Stores​

IAIDocumentStore​

IAIDocumentChunkStore​

Registration​

Orchestration Integration​

Tabular Data​

ITabularBatchProcessor​

ITabularBatchResultCache​

Configuration​

ChatDocumentsOptions​

InteractionDocumentSettings​

Limits​

Services Registered by AddCoreAIDocumentProcessing()​

Quick Start

Problem & Solution

Creating a chat client to describe an image

Architecture Overview

Core Interfaces

Document Processing Pipeline

Step 1 — Upload and Store

Step 2 — Read File Content

Step 3 — Normalize and Chunk

Step 4 — Generate Embeddings

Step 5 — Index in Vector Store

Step 6 — Query-Time Retrieval

Built-in Document Readers

Custom Document Reader

`ExtractorExtension`

`AddCrestAppsIngestionDocumentReader<T>`

Document Tools

`SearchDocumentsTool`

`ReadDocumentTool`

`ReadTabularDataTool`

Implementing Stores

`IAIDocumentStore`

`IAIDocumentChunkStore`

Registration

Orchestration Integration

Tabular Data

`ITabularBatchProcessor`

`ITabularBatchResultCache`

Configuration

`ChatDocumentsOptions`

`InteractionDocumentSettings`

Limits

Services Registered by `AddCoreAIDocumentProcessing()`