LIBRAIN
Multi-Agent RAG System for Scientific Discovery
Open-source, built in .NET 10. Ingests arXiv papers, embeds them in a vector store, and generates citation-grounded hypotheses through a Reader → Synthesis → Evaluator pipeline.
github.com/erennmutlu1/librainWhy this exists
Most RAG and AI-agent tutorials are written in Python. Production engineering organizations that run on .NET — banks, insurers, government — hear the message that "AI requires Python" and weaken the case for adopting LLM-driven features in their existing stack. LIBRAIN is a deliberate counter-example: a complete multi-agent retrieval pipeline built end-to-end in .NET 10 on Microsoft-blessed primitives (Semantic Kernel, Application Insights), Anthropic Claude, OpenAI embeddings, and a managed vector store.
It's also a portfolio artifact aligned with the Microsoft AI-200 (Azure AI Cloud Developer) certification syllabus — the production deployment target is Azure Cosmos DB for NoSQL with DiskANN vector indexing, deferred to Phase 3 once the dev pipeline is proven against local Qdrant.
Architecture
Three logical agents plus a cross-cutting audit logger. Every step emits a structured event with a correlation ID so any output is fully traceable from query back to source chunks.
Reader
Ingests PDFs, chunks recursively, embeds, and persists to the vector store with citation metadata.
Synthesis
Vector-searches retrieved chunks and prompts Claude to generate citation-grounded hypotheses.
Evaluator
LLM-as-a-Judge filter scoring plausibility, novelty, and clarity to drop hallucinated outputs.
Tech stack
Locked stack — chosen up-front, no swaps mid-build. Each pick has a stated rationale documented in the project plan.
API surface
The Reader Agent track exposes three endpoints today. Synthesis (POST /api/query) and the audit-trail endpoint (GET /api/audit/{id}) ship in Phase 2.
/api/papers/ingest
Multipart PDF upload. Streams through the full pipeline (extract → chunk → embed → persist) and returns paperId, chunk count, and a correlation ID for the audit trail.
/api/papers
Lists ingested papers with title, chunk count, and ingestion timestamp. Aggregated from the chunk collection via Qdrant scroll + in-memory dedupe — sub-second at MVP scale.
/health
Liveness probe. Returns 200 with a single status field.
Pipeline walkthrough
PDF parsing
PdfPig with the ContentOrderTextExtractor reconstructs glyph order with proper word spacing. The default extractor drops inter-word spaces on column-layout PDFs — an issue that only surfaced on real arXiv papers, not synthetic test input.
Catch: synthetic tests passed with 0/15 sections detected; real PDFs revealed the spacing bug. 17/18 sections detected after the fix.
Recursive chunking
Paragraph → sentence → hard-cut fallback. Target 512 tokens (~2,000 chars), max 1,024 tokens, 15% overlap (~75 tokens). Each chunk carries its absolute offset, page number, and best-effort section heading for downstream citation tracking.
Boundary: never split a paragraph if a paragraph break exists in the [target, max] window.
Embedding
OpenAI text-embedding-3-small (1,536-dim) with token-aware batching: ≤100 inputs per batch, ≤35K tokens, plus a 1.5-second pacing gap to stay clear of Tier 1's 40K-TPM rolling-window limit. Batch size honest to the actual rate tier the account is on.
Catch: an early 250K-token-per-batch limit assumed Tier 2 and rejected 2 of 5 papers on a fresh Tier 1 account.
Vector search
Qdrant cosine similarity with deterministic UUID v5 point IDs (RFC 4122 §4.3, SHA-1 over a project-private namespace plus paperId-chunkIndex). Re-ingesting a paper upserts cleanly; lazy collection bootstrap behind a Lazy<Task> means the app starts before Qdrant is reachable.
Boundary: top-K results carry full citation metadata — paperId, chunkIndex, page, section — so callers render footnotes without a second query.
Inside the chunker
The recursive chunker is the foundation of citation tracking — every chunk's offset and length must be deterministic so the synthesis layer can map a hypothesis citation back to a verifiable source span. The fallback ladder ensures size limits are respected without ever splitting mid-sentence when a paragraph break is reachable.
private static int ChooseEnd(string text, int start)
{
int remaining = text.Length - start;
if (remaining <= MaxChars)
{
return text.Length;
}
int targetEnd = start + TargetChars;
int maxEnd = start + MaxChars;
// Prefer paragraph boundary in [target, max]
int paraEnd = LatestMatchIndex(ParagraphBreakRegex(), text, targetEnd, maxEnd);
if (paraEnd > start) return paraEnd;
// Fall back to sentence boundary
int sentEnd = LatestMatchIndex(SentenceBreakRegex(), text, targetEnd, maxEnd);
if (sentEnd > start) return sentEnd;
// Hard cut as last resort
return maxEnd;
}
From LIBRAIN/Reading/RecursiveChunker.cs. Constants are TargetChars = 2000, MaxChars = 4000. Each emitted chunk also rewinds OverlapChars = 300 backward to preserve context across boundaries — the 15% overlap target from the original RAG paper.
Results
Smoke-tested end-to-end on five real arXiv papers covering RAG, agentic AI surveys, and scientific-discovery pipelines.
Papers ingested in the smoke test
Five papers chosen for representativeness — the foundational RAG paper, two recent agentic-AI surveys, and two longer-form scientific-discovery papers — covering both the pipeline's depth (long surveys with many chunks) and topical relevance (RAG, agents, retrieval-grounded generation).
Vector search smoke test: query "How does retrieval augmented generation work?" returned all 5 top hits from 2005.11401 — distinct chunks from pages 1, 2, 8, and 9, with cosine similarities 0.52–0.61. Semantic recall validated end-to-end against the corpus that planted RAG itself.
What I learned
Synthetic tests don't replace smoke tests on real artifacts
The chunker's seven xUnit cases passed cleanly on synthetic strings. A 30-line throwaway smoke run on one real arXiv PDF caught the PdfPig word-spacing bug and the section-detection regex's blind spots — issues that would have shipped silently and degraded retrieval quality across every paper.
Rate-limit handling is honest sizing, not just defensive retries
OpenAI's tier-1 limit is 40K tokens per minute on text-embedding-3-small. The original 250K-token batch sized for tier 2 worked everywhere except where it mattered. Lower the per-batch cap to 35K and add proportional 1.5s pacing — no retry logic, no backoff trees, just batch sizes that respect the actual constraint.
Lazy provisioning decouples app boot from external infrastructure
Qdrant collection creation behind a Lazy<Task> means the app starts even if Qdrant is offline. The first ingest call awaits a single idempotent CollectionExistsAsync + CreateCollectionAsync race-safely. No explicit init command, no startup migration step, no operator runbook for first-deploy.
Atomic commits with no scope parens read better than scoped ones
Started with feat(reading): add PDF text extraction; ended with feat: add PDF text extraction service via PdfPig. The bare type: form keeps git log --oneline visually uniform on a small repo. Scopes are CI-machine-friendly but recruiter-unfriendly when the column widths don't align.
What's next
Phase 1 (Reader Agent track) shipped — ingest, chunk, embed, vector-search end-to-end. The next two phases are scoped, time-boxed, and documented in the repo's PROJECT_PLAN.md.
Phase 1 — Reader Agent May 2026
PDF ingestion, recursive chunking, OpenAI embeddings, Qdrant vector store, REST endpoints. Shipped. 22 atomic commits, 5 papers smoke-tested, sub-200ms vector search.
Phase 2 — Synthesis & Evaluation June 2026
Synthesis Agent (Claude prompt → citation-grounded hypotheses), Evaluator Agent (LLM-as-a-Judge plausibility/novelty/clarity scoring), POST /api/query endpoint with full audit trail and citation validation.
Phase 3 — Polish & Showcase July 2026
Minimal Next.js frontend on Azure Static Web Apps, API on Azure Container Apps with production Cosmos DB vector store, companion arXiv preprint with implementation results, blog post on .NET RAG patterns, AI-200 (Azure AI Cloud Developer) certification.
Get in touch
Open to senior .NET / AI-engineering roles in the EU (Netherlands, Germany, Ireland) with visa sponsorship.