LIBRAIN

Multi-Agent RAG System for Scientific Discovery

Open-source, built in .NET 10. Ingests arXiv papers, embeds them in a vector store, and generates citation-grounded hypotheses through a Reader → Synthesis → Evaluator pipeline.

github.com/erennmutlu1/librain

5 arXiv papers ingested

218 chunks indexed in Qdrant

< 200ms vector search latency

Why this exists

Most RAG and AI-agent tutorials are written in Python. Production engineering organizations that run on .NET — banks, insurers, government — hear the message that "AI requires Python" and weaken the case for adopting LLM-driven features in their existing stack. LIBRAIN is a deliberate counter-example: a complete multi-agent retrieval pipeline built end-to-end in .NET 10 on Microsoft-blessed primitives (Semantic Kernel, Application Insights), Anthropic Claude, OpenAI embeddings, and a managed vector store.

It's also a portfolio artifact aligned with the Microsoft AI-200 (Azure AI Cloud Developer) certification syllabus — the production deployment target is Azure Cosmos DB for NoSQL with DiskANN vector indexing, deferred to Phase 3 once the dev pipeline is proven against local Qdrant.

Architecture

Three logical agents plus a cross-cutting audit logger. Every step emits a structured event with a correlation ID so any output is fully traceable from query back to source chunks.

Reader

Ingests PDFs, chunks recursively, embeds, and persists to the vector store with citation metadata.

→

Synthesis

Vector-searches retrieved chunks and prompts Claude to generate citation-grounded hypotheses.

→

Evaluator

LLM-as-a-Judge filter scoring plausibility, novelty, and clarity to drop hallucinated outputs.

Tech stack

Locked stack — chosen up-front, no swaps mid-build. Each pick has a stated rationale documented in the project plan.

.NET 10 ASP.NET Core Minimal APIs Anthropic Claude OpenAI text-embedding-3-small Qdrant Microsoft Semantic Kernel PdfPig Application Insights xUnit v3

API surface

The Reader Agent track exposes three endpoints today. Synthesis (POST /api/query) and the audit-trail endpoint (GET /api/audit/{id}) ship in Phase 2.

POST /api/papers/ingest Multipart PDF upload. Streams through the full pipeline (extract → chunk → embed → persist) and returns paperId, chunk count, and a correlation ID for the audit trail.

GET /api/papers Lists ingested papers with title, chunk count, and ingestion timestamp. Aggregated from the chunk collection via Qdrant scroll + in-memory dedupe — sub-second at MVP scale.

GET /health Liveness probe. Returns 200 with a single status field.

Pipeline walkthrough

1

PDF parsing

PdfPig with the ContentOrderTextExtractor reconstructs glyph order with proper word spacing. The default extractor drops inter-word spaces on column-layout PDFs — an issue that only surfaced on real arXiv papers, not synthetic test input.

Catch: synthetic tests passed with 0/15 sections detected; real PDFs revealed the spacing bug. 17/18 sections detected after the fix.

2

Recursive chunking

Paragraph → sentence → hard-cut fallback. Target 512 tokens (~2,000 chars), max 1,024 tokens, 15% overlap (~75 tokens). Each chunk carries its absolute offset, page number, and best-effort section heading for downstream citation tracking.

Boundary: never split a paragraph if a paragraph break exists in the [target, max] window.

3

Embedding

OpenAI text-embedding-3-small (1,536-dim) with token-aware batching: ≤100 inputs per batch, ≤35K tokens, plus a 1.5-second pacing gap to stay clear of Tier 1's 40K-TPM rolling-window limit. Batch size honest to the actual rate tier the account is on.

Catch: an early 250K-token-per-batch limit assumed Tier 2 and rejected 2 of 5 papers on a fresh Tier 1 account.

4

Vector search

Qdrant cosine similarity with deterministic UUID v5 point IDs (RFC 4122 §4.3, SHA-1 over a project-private namespace plus paperId-chunkIndex). Re-ingesting a paper upserts cleanly; lazy collection bootstrap behind a Lazy<Task> means the app starts before Qdrant is reachable.

Boundary: top-K results carry full citation metadata — paperId, chunkIndex, page, section — so callers render footnotes without a second query.

Inside the chunker

The recursive chunker is the foundation of citation tracking — every chunk's offset and length must be deterministic so the synthesis layer can map a hypothesis citation back to a verifiable source span. The fallback ladder ensures size limits are respected without ever splitting mid-sentence when a paragraph break is reachable.

private static int ChooseEnd(string text, int start)
{
    int remaining = text.Length - start;
    if (remaining <= MaxChars)
    {
        return text.Length;
    }

    int targetEnd = start + TargetChars;
    int maxEnd = start + MaxChars;

    // Prefer paragraph boundary in [target, max]
    int paraEnd = LatestMatchIndex(ParagraphBreakRegex(), text, targetEnd, maxEnd);
    if (paraEnd > start) return paraEnd;

    // Fall back to sentence boundary
    int sentEnd = LatestMatchIndex(SentenceBreakRegex(), text, targetEnd, maxEnd);
    if (sentEnd > start) return sentEnd;

    // Hard cut as last resort
    return maxEnd;
}

From LIBRAIN/Reading/RecursiveChunker.cs. Constants are TargetChars = 2000, MaxChars = 4000. Each emitted chunk also rewinds OverlapChars = 300 backward to preserve context across boundaries — the 15% overlap target from the original RAG paper.

Results

Smoke-tested end-to-end on five real arXiv papers covering RAG, agentic AI surveys, and scientific-discovery pipelines.

5 papers

Including Lewis et al. (2005.11401), the original RAG paper

218 chunks

218 × 1,536-dim vectors stored in the Qdrant librain_chunks collection

< 200ms

Vector search latency at MVP scale; below Qdrant's 10K-point HNSW threshold so brute-force scan stays fast

Papers ingested in the smoke test

Five papers chosen for representativeness — the foundational RAG paper, two recent agentic-AI surveys, and two longer-form scientific-discovery papers — covering both the pipeline's depth (long surveys with many chunks) and topical relevance (RAG, agents, retrieval-grounded generation).

2005.11401 Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS 2020). The original RAG paper. 21 chunks.

2503.08979 Gridach et al. — Agentic AI for Scientific Discovery: A Survey of Progress, Challenges, and Future Directions (ICLR 2025). 18 chunks.

2504.05496 arXiv 2025 — agentic-AI methodology preprint. 12 chunks.

2505.04651 arXiv 2025 — long-form survey on RAG and retrieval pipelines. 80 chunks across 60 pages.

2508.14111 From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery. 87 chunks across 84 pages.

Vector search smoke test: query "How does retrieval augmented generation work?" returned all 5 top hits from 2005.11401 — distinct chunks from pages 1, 2, 8, and 9, with cosine similarities 0.52–0.61. Semantic recall validated end-to-end against the corpus that planted RAG itself.

What I learned

Synthetic tests don't replace smoke tests on real artifacts

The chunker's seven xUnit cases passed cleanly on synthetic strings. A 30-line throwaway smoke run on one real arXiv PDF caught the PdfPig word-spacing bug and the section-detection regex's blind spots — issues that would have shipped silently and degraded retrieval quality across every paper.

Rate-limit handling is honest sizing, not just defensive retries

OpenAI's tier-1 limit is 40K tokens per minute on text-embedding-3-small. The original 250K-token batch sized for tier 2 worked everywhere except where it mattered. Lower the per-batch cap to 35K and add proportional 1.5s pacing — no retry logic, no backoff trees, just batch sizes that respect the actual constraint.

Lazy provisioning decouples app boot from external infrastructure

Qdrant collection creation behind a Lazy<Task> means the app starts even if Qdrant is offline. The first ingest call awaits a single idempotent CollectionExistsAsync + CreateCollectionAsync race-safely. No explicit init command, no startup migration step, no operator runbook for first-deploy.

Atomic commits with no scope parens read better than scoped ones

Started with feat(reading): add PDF text extraction; ended with feat: add PDF text extraction service via PdfPig. The bare type: form keeps git log --oneline visually uniform on a small repo. Scopes are CI-machine-friendly but recruiter-unfriendly when the column widths don't align.

What's next

Phase 1 (Reader Agent track) shipped — ingest, chunk, embed, vector-search end-to-end. The next two phases are scoped, time-boxed, and documented in the repo's PROJECT_PLAN.md.

✓

Phase 1 — Reader Agent May 2026

PDF ingestion, recursive chunking, OpenAI embeddings, Qdrant vector store, REST endpoints. Shipped. 22 atomic commits, 5 papers smoke-tested, sub-200ms vector search.

2

Phase 2 — Synthesis & Evaluation June 2026

Synthesis Agent (Claude prompt → citation-grounded hypotheses), Evaluator Agent (LLM-as-a-Judge plausibility/novelty/clarity scoring), POST /api/query endpoint with full audit trail and citation validation.

3

Phase 3 — Polish & Showcase July 2026

Minimal Next.js frontend on Azure Static Web Apps, API on Azure Container Apps with production Cosmos DB vector store, companion arXiv preprint with implementation results, blog post on .NET RAG patterns, AI-200 (Azure AI Cloud Developer) certification.

Get in touch

Open to senior .NET / AI-engineering roles in the EU (Netherlands, Germany, Ireland) with visa sponsorship.

View on GitHub erennmutlu@outlook.com