LIBRAIN
Multi-Agent RAG System for Scientific Discovery
Open-source, built in .NET 10. A Reader → Synthesis → Evaluator pipeline plus a parallel Discovery Mode that fences speculative content into a structured novelClaim field. The claim-level validation contract eliminated a 3-of-5 hallucination signal under three-rater blinded replication on a 41-paper cross-domain corpus (AI, drug discovery, climate, neuroscience). Under model substitution and adversarial prompting the citation contract still admits zero fabricated citations, a by-construction guarantee. Companion paper (Zenodo, DOI) · 2026.
github.com/erennmutlu1/librain
doi.org/10.5281/zenodo.20745782
41
arXiv papers ingested
1,766
chunks indexed in Qdrant
< 200ms
vector search latency
Why this exists
Most RAG and AI-agent tutorials are written in Python. Production engineering organizations that run on .NET, banks, insurers, government, hear the message that "AI requires Python" and weaken the case for adopting LLM-driven features in their existing stack. LIBRAIN is a deliberate counter-example: a complete multi-agent retrieval pipeline built end-to-end in .NET 10 on Microsoft-blessed primitives (Semantic Kernel, Application Insights), Anthropic Claude, OpenAI embeddings, and a managed vector store.
Architecture
Two pipelines share a Reader Agent and a Qdrant vector store. The Phase 2 query path runs Synthesis to Evaluator on retrieved chunks. The Phase 2.5 Discovery path runs the Discovery Agent into three concurrent scorers (NoveltyScorer, Discovery Evaluator, ClaimValidator) under a single Task.WhenAll dispatch, then aggregates to a four-axis score. Every step emits a structured event with a correlation ID so any output is fully traceable from query back to source chunks.
Reader
Ingests PDFs, chunks recursively, embeds with OpenAI text-embedding-3-small, and persists to Qdrant with citation metadata and deterministic UUIDv5 IDs.
→
Synthesis
Claude Sonnet 4.6 via structured tool use. Vector-searches retrieved chunks and generates citation-grounded hypotheses (Phase 2 query path).
→
Evaluator
Claude Haiku 4.5, T=0.0. LLM-as-a-Judge scoring groundedness, relevance, and completeness; aggregate quality computed in C# to resist halo effects.
↘
parallel branch
from Reader
→
Discovery
Claude Sonnet 4.6 with an inverted prompt that invites extrapolation beyond cited evidence. The speculative portion is returned as a flagged novelClaim field, exempt from citation validation by contract. The synthesis-side model is config-driven (Models:SynthesisModel), so it swaps to Haiku 4.5 for the robustness sweep without touching source.
Task.WhenAll parallel dispatch
Discovery output (hypothesis + supportingEvidence + novelClaim) is dispatched to the three concurrent scorers below. Wall-clock cost is bounded by the slowest single call, not the sum.
NoveltyScorer
Deterministic. Embeds novelClaim with text-embedding-3-small, runs top-1 cosine search against the 41-paper Qdrant corpus, returns 1, similarity as novelty.
Discovery Evaluator
Claude Haiku 4.5, T=0.0. LLM-judged plausibility (does novelClaim follow from cited evidence) and structural coherence (well-formed testable hypothesis).
ClaimValidator
Claude Haiku 4.5, T=0.0. Per-sentence labels over novelClaim in {GROUNDED, EXTRAPOLATED, RISKY} with hallucination probability; max-aggregate risk in C# (halo-resistant). Added in Phase 3.A.5.
all three converge into the aggregator
Four-axis scoring
Arithmetic mean in C# (halo-resistant by construction) of novelty + plausibility + structural coherence + quality, plus the ClaimValidator risk label, returned to the API caller alongside the hypothesis and novelClaim.
Tech stack
Locked stack, chosen up-front, no swaps mid-build. Each pick has a stated rationale documented in the project plan.
.NET 10
ASP.NET Core Minimal APIs
Anthropic Claude
OpenAI text-embedding-3-small
Qdrant
Microsoft Semantic Kernel
PdfPig
Application Insights
xUnit v3
API surface
Seven endpoints across Reader, Synthesis, Discovery, and Baseline tracks. Phase 1 ingest endpoints, the Phase 2 query endpoint that runs the full embed → search → synthesise → evaluate pipeline, the Phase 3 discovery endpoint for cross-domain hypothesis generation, and the Phase 5 baseline endpoints (Naive-RAG and Single-LLM ablations) that feed the companion paper's three-system comparison.
POST
/api/papers/ingest
Multipart PDF upload. Streams through the full pipeline (extract → chunk → embed → persist) and returns paperId, chunk count, and a correlation ID for the audit trail.
GET
/api/papers
Lists ingested papers with title, chunk count, and ingestion timestamp. Aggregated from the chunk collection via Qdrant scroll + in-memory dedupe; sub-second at MVP scale.
POST
/api/query
Citation-grounded hypothesis generation. Embeds the query, runs vector search (top-K), synthesises a hypothesis via Claude Sonnet 4.6 with forced tool-use for citations, and evaluates it via Claude Haiku 4.5 on three dimensions (groundedness, relevance, completeness). Returns hypothesis, citations, per-dimension scores, and a correlation ID.
POST
/api/discover
Discovery Mode. Takes one or two topics and proposes a hypothesis that goes beyond the cited sources, returning the speculative portion as a novelClaim field. Each sentence of novelClaim is annotated by an extrapolation_basis contract (generalisation / analogy / pure speculation) and re-scored by a secondary ClaimValidator agent (Haiku, per-sentence GROUNDED / EXTRAPOLATED / RISKY classification). Scored on the 4-axis rubric (novelty, plausibility, structural coherence, quality).
POST
/api/naive-rag
Baseline #1. Retrieval plus a single Claude Sonnet 4.6 call with structured tool-use, but no citation-validation contract. Claimed citations are recorded verbatim and resolved against the retrieved set post-hoc, giving the fabrication-rate measurement surface used in the companion paper's three-system comparison.
POST
/api/single-llm
Baseline #2. Single Claude Sonnet 4.6 call on topic(s) only. No retrieval, no agent decomposition, no tool-use. The lower-bound ablation for the companion paper's three-system comparison.
GET
/health
Liveness probe. Returns 200 with a single status field.
Pipeline walkthrough
1
PDF parsing
PdfPig with the ContentOrderTextExtractor reconstructs glyph order with proper word spacing. The default extractor drops inter-word spaces on column-layout PDFs, an issue that only surfaced on real arXiv papers, not synthetic test input.
Catch: synthetic tests passed with 0/15 sections detected; real PDFs revealed the spacing bug. 17/18 sections detected after the fix.
2
Recursive chunking
Paragraph → sentence → hard-cut fallback. Target 512 tokens (~2,000 chars), max 1,024 tokens, 15% overlap (~75 tokens). Each chunk carries its absolute offset, page number, and best-effort section heading for downstream citation tracking.
Boundary: never split a paragraph if a paragraph break exists in the [target, max] window.
3
Embedding
OpenAI text-embedding-3-small (1,536-dim) with token-aware batching: ≤100 inputs per batch, ≤35K tokens, plus a 1.5-second pacing gap to stay clear of Tier 1's 40K-TPM rolling-window limit. Batch size honest to the actual rate tier the account is on.
Catch: an early 250K-token-per-batch limit assumed Tier 2 and rejected 2 of 5 papers on a fresh Tier 1 account.
4
Vector search
Qdrant cosine similarity with deterministic UUID v5 point IDs (RFC 4122, SHA-1 over a project-private namespace plus paperId-chunkIndex). Re-ingesting a paper upserts cleanly; lazy collection bootstrap behind a Lazy<Task> means the app starts before Qdrant is reachable.
Boundary: top-K results carry full citation metadata, paperId, chunkIndex, page, section, so callers render footnotes without a second query.
Discovery Mode in action
Five end-to-end runs, captured on the project's earlier 13-paper corpus, spanning AI, drug discovery, climate, and neuroscience. Each card shows live POST /api/discover output: the generated hypothesis, the speculative portion flagged as novelClaim, citation-resolved supporting evidence from retrieved chunks, and 4-axis Discovery Evaluator scores (Claude Haiku, temperature 0.0).
In-corpus
retrieval-augmented generation ⨯ hypothesis generation in scientific discovery
Hypothesis
RAG systems, by combining updatable non-parametric memory with generative models, provide a natural substrate for automated scientific hypothesis generation. Specifically, the RAG-Token mechanism's ability to draw on different retrieved documents for each output token could enable the synthesis of cross-disciplinary evidence into novel, testable hypotheses that span previously siloed fields. Furthermore, integrating hypothesis quality filters (novelty and feasibility scoring) directly into the RAG retrieval objective, so that the retriever is rewarded for surfacing documents that maximize hypothesis novelty while maintaining feasibility, could yield a self-improving discovery loop that surpasses both purely parametric LLMs and static literature-based discovery systems.
The Discovery novelClaim · exempt from citation validation
What if the retriever learned to surface papers that produce novel hypotheses, not just relevant ones?
integrating hypothesis quality filters (novelty and feasibility scoring) directly into the RAG retrieval objective, so that the retriever is rewarded for surfacing documents that maximize hypothesis novelty while maintaining feasibility, could yield a self-improving discovery loop that surpasses both purely parametric LLMs and static literature-based discovery systems.
A hypothesis seed for investigation, not a verified discovery. The model generated this beyond the retrieved evidence; the chunks below support the framing, not the claim itself.
Supporting evidence 5 citations, 2 papers, all resolved
Cross-domain
retrieval-augmented generation ⨯ protein folding dynamics
Hypothesis
RAG architectures, which combine parametric and non-parametric memory through differentiable retrieval and end-to-end marginalization over latent documents, could be directly applied to protein folding dynamics by indexing structural and biophysical databases as the non-parametric memory, enabling agentic systems to dynamically retrieve and integrate folding pathway evidence at inference time rather than encoding it statically in model weights. Such a RAG-augmented protein folding agent would be capable of discovering length-dependent or context-dependent folding phenomena, analogous to the mechanical crossover in peptide unfolding force uncovered by Sparks, that purely parametric structure-prediction models systematically miss because they cannot update their knowledge base without retraining.
The Discovery novelClaim · exempt from citation validation
What if protein-folding AI could retrieve fresh evidence at inference time, the way RAG retrieves documents?
Such a RAG-augmented protein folding agent would be capable of discovering length-dependent or context-dependent folding phenomena, analogous to the mechanical crossover in peptide unfolding force uncovered by Sparks, that purely parametric structure-prediction models systematically miss because they cannot update their knowledge base without retraining.
A hypothesis seed for investigation, not a verified discovery. The model generated this beyond the retrieved evidence; the chunks below support the framing, not the claim itself.
Supporting evidence 5 citations, 2 papers, all resolved
Cross-domain
retrieval-augmented generation ⨯ de novo molecular design
Hypothesis
RAG architectures, which combine a differentiable retriever over a hot-swappable non-parametric document index with a parametric generative model, can be directly adapted for de novo molecular design by replacing the text document index with a structured chemical knowledge base, enabling a generative chemistry agent to retrieve relevant reaction precedents, binding interaction reports, and molecular fragments at inference time. This retrieval-grounded generation paradigm could allow the molecular generator to marginalize over multiple retrieved chemical contexts (analogous to RAG-Token), producing candidate molecules that are simultaneously novel, synthesizable, and target-aware without requiring full retraining when new chemical knowledge becomes available. Crucially, such a system would self-evolve over successive design cycles by accumulating an experience database of past docking results and interaction reports, allowing the retriever to progressively sharpen its relevance signal toward high-affinity, drug-like chemical space.
The Discovery novelClaim · exempt from citation validation
What if the retriever itself got better every design cycle, learning from past docking results to aim squarely at high-affinity, drug-like chemical space?
Crucially, such a system would self-evolve over successive design cycles by accumulating an experience database of past docking results and interaction reports, allowing the retriever to progressively sharpen its relevance signal toward high-affinity, drug-like chemical space.
A hypothesis seed for investigation, not a verified discovery. The model generated this beyond the retrieved evidence; the chunks below support the framing, not the claim itself.
Supporting evidence 6 citations, 3 papers, all resolved
Cross-domain
weather foundation models ⨯ renewable energy planning
Hypothesis
Weather foundation models like Aurora and GraphCast, trained autoregressively on decades of global reanalysis data, can produce high-resolution multi-day trajectories of wind speed, wave height, and solar irradiance at orders-of-magnitude lower computational cost than NWP systems. These trajectory ensembles could be directly ingested by renewable energy planning pipelines as probabilistic resource atlases, replacing expensive Monte Carlo simulations currently used for site selection and grid balancing. Fine-tuning such models on domain-specific renewable energy variables, such as hub-height wind profiles or photovoltaic irradiance, could unlock a new class of AI-native energy planning tools that co-optimize forecast accuracy and infrastructure investment decisions in a single differentiable framework.
The Discovery novelClaim · exempt from citation validation
What if fine-tuning a weather foundation model on wind and solar variables gave one differentiable tool that co-optimizes forecasts and infrastructure investment?
Fine-tuning such models on domain-specific renewable energy variables, such as hub-height wind profiles or photovoltaic irradiance, could unlock a new class of AI-native energy planning tools that co-optimize forecast accuracy and infrastructure investment decisions in a single differentiable framework.
A hypothesis seed for investigation, not a verified discovery. The model generated this beyond the retrieved evidence; the chunks below support the framing, not the claim itself.
Supporting evidence 5 citations, 3 papers, all resolved
Cross-domain
drug discovery ⨯ climate adaptation
Hypothesis
Autonomous multi-agent AI systems, proven effective at decomposing and executing complex sequential pipelines in drug discovery, could be directly adapted to climate adaptation workflows by treating extreme-weather event prediction, impact assessment, and adaptive intervention design as analogous pipeline stages. Just as PharmAgents uses LLM-driven agents with self-evolving capabilities to iteratively refine drug candidates across target discovery, lead optimization, and preclinical evaluation, a climate-adaptation counterpart could iteratively refine regional adaptation strategies, such as infrastructure hardening or crop-switching recommendations, by incorporating real-time extreme-heat and atmospheric-river forecasts from models like GraphCast as dynamic environmental inputs. This cross-domain transfer would enable the first fully autonomous, interpretable pipeline that closes the loop from probabilistic climate hazard forecasting to actionable, location-specific adaptation policy generation.
The Discovery novelClaim · exempt from citation validation
What if the agentic pipeline built for drug discovery were turned on climate adaptation, refining regional strategies as live extreme-weather forecasts arrive?
A climate-adaptation counterpart could iteratively refine regional adaptation strategies, such as infrastructure hardening or crop-switching recommendations, by incorporating real-time extreme-heat and atmospheric-river forecasts from models like GraphCast as dynamic environmental inputs. This cross-domain transfer would enable the first fully autonomous, interpretable pipeline that closes the loop from probabilistic climate hazard forecasting to actionable, location-specific adaptation policy generation.
A hypothesis seed for investigation, not a verified discovery. Here the per-sentence claim-validator labeled the second sentence RISKY (sentence hallucination probability 0.75, the run's aggregate risk), the contract doing its job rather than failing. The chunks below support the framing, not the claim itself.
Supporting evidence 6 citations, 4 papers, all resolved
- arXiv:2503.22164 Gao et al., PharmAgents, chunks 1, 2 direct
- arXiv:2508.14111, Agentic AI for Scientific Discovery, chunk 33 (Protein Science section) direct
- arXiv:2212.12794 Lam et al., GraphCast, chunks 8, 65 (Atmospheric Rivers section) direct
- arXiv:2405.13063 Bodnar et al., Aurora, chunk 16 (Training methods section) analogous
Inside the chunker
The recursive chunker is the foundation of citation tracking, every chunk's offset and length must be deterministic so the synthesis layer can map a hypothesis citation back to a verifiable source span. The fallback ladder ensures size limits are respected without ever splitting mid-sentence when a paragraph break is reachable.
private static int ChooseEnd(string text, int start)
{
int remaining = text.Length, start;
if (remaining <= MaxChars)
{
return text.Length;
}
int targetEnd = start + TargetChars;
int maxEnd = start + MaxChars;
// Prefer paragraph boundary in [target, max]
int paraEnd = LatestMatchIndex(ParagraphBreakRegex(), text, targetEnd, maxEnd);
if (paraEnd > start) return paraEnd;
// Fall back to sentence boundary
int sentEnd = LatestMatchIndex(SentenceBreakRegex(), text, targetEnd, maxEnd);
if (sentEnd > start) return sentEnd;
// Hard cut as last resort
return maxEnd;
}
From LIBRAIN/Reading/RecursiveChunker.cs. Constants are TargetChars = 2000, MaxChars = 4000. Each emitted chunk also rewinds OverlapChars = 300 backward to preserve context across boundaries, the 15% overlap target from the original RAG paper.
Results
Run end-to-end on a 41-paper cross-domain corpus spanning RAG and agentic methods, drug discovery and proteins, weather / climate / energy, agriculture, clinical trials, and cognition / neuroscience.
41 papers
Including Lewis et al. (RAG), Lam et al. (GraphCast), Bodnar et al. (Aurora), Gao et al. (PharmAgents), Binz et al. (Centaur), Asai et al. (Self-RAG)
1,766 chunks
1,766 × 1,536-dim vectors stored in the Qdrant librain_chunks collection
< 200ms
Vector search latency at MVP scale; below Qdrant's 10K-point HNSW threshold so brute-force scan stays fast
Papers in the corpus
The 41-paper corpus is built across two ingestion rounds. The first round established the RAG and agentic-AI foundation that validates the pipeline; the second round broadened reach into drug discovery and proteins, weather / climate / energy, agriculture, clinical trials, and cognition / neuroscience so Discovery Mode has at least two strongly-relevant papers on both sides of every cross-domain topic pair.
Seed corpus (5 papers, 218 chunks)
The foundational RAG paper, two agentic-AI surveys, and two long-form scientific-discovery surveys, chosen for pipeline depth (long surveys produce many chunks) and topical relevance (RAG, agents, retrieval-grounded generation).
2005.11401
Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS 2020). The original RAG paper. 21 chunks.
2503.08979
Gridach et al., Agentic AI for Scientific Discovery: A Survey of Progress, Challenges, and Future Directions (ICLR 2025). 18 chunks.
2504.05496
arXiv 2025, agentic-AI methodology preprint. 12 chunks.
2505.04651
arXiv 2025, long-form survey on RAG and retrieval pipelines. 80 chunks across 60 pages.
2508.14111
From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery. 87 chunks across 84 pages.
Seed-corpus vector search smoke test: query "How does retrieval augmented generation work?" returned all 5 top hits from 2005.11401, distinct chunks from pages 1, 2, 8, and 9, with cosine similarities 0.52-0.61. Semantic recall validated end-to-end against the corpus that planted RAG itself.
Domain expansion (19 papers, 1,133 chunks)
Selected for balanced coverage so every cross-domain topic pair has at least two strongly-relevant papers per side, grouped below by domain.
RAG & agentic methods
2310.11511
Asai et al., Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (ICLR 2024). 32 chunks.
Drug discovery & proteins
2503.22164
Gao et al., PharmAgents: Building a Virtual Pharma with Large Language Model Agents. 20 chunks.
2305.17100
Zhang et al., BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks. 45 chunks.
1801.10193
Öztürk et al., DeepDTA: Deep Drug-Target Binding Affinity Prediction. 15 chunks.
2402.08703
Tang et al., A Survey of Generative AI for de novo Drug Design. 54 chunks.
2503.13522
Zhang et al., Advanced Deep Learning Methods for Protein Structure Prediction and Design. 178 chunks.
Weather, climate & energy
2212.12794
Lam et al., GraphCast: Learning skillful medium-range global weather forecasting (DeepMind, Nature 2023). 72 chunks.
2211.02556
Bi et al., Pangu-Weather: A 3D High-Resolution Model for Fast and Accurate Global Weather Forecast (Huawei). 30 chunks.
2405.13063
Bodnar et al., Aurora: A Foundation Model for the Earth System (Microsoft). 73 chunks.
1906.05433
Rolnick et al., Tackling Climate Change with Machine Learning. 110 chunks.
2302.01236
Vilgalys et al., A Machine Learning Approach to Measuring Climate Adaptation. 34 chunks.
2405.14472
Depoortere et al., SolNet: Open-source deep learning models for photovoltaic power forecasting. 24 chunks.
Agriculture & clinical
2007.10882
de Freitas Cunha et al., Estimating crop yields with remote sensing and deep learning. 11 chunks.
2502.06062
Yewle et al., Multi-modal Data Fusion and Deep Ensemble Learning for Accurate Crop Yield Prediction. 28 chunks.
2506.04293
Liu et al., AutoCT: Automating Interpretable Clinical Trial Prediction with LLM Agents. 24 chunks.
Cognition, neuroscience & education
2410.20268
Binz et al., Centaur: a foundation model of human cognition. 108 chunks across 140+ pages.
2410.07507
Mishra et al., Thought2Text: Text Generation from EEG Signal using Large Language Models. 15 chunks.
1705.06963
Schuman et al., A Survey of Neuromorphic Computing and Neural Networks in Hardware. 223 chunks.
2405.13001
Xu et al., Large Language Models for Education: A Survey. 37 chunks.
Robustness
The citation-validation contract is enforced in C#: every cited chunk in a hypothesis must resolve to the retrieval set, or it is dropped before the response is returned. That makes citation fidelity a property of the architecture rather than the model. To test it, I swept one locked topic pair (weather foundation models × renewable energy planning) along three axes and measured whether the contract held.
topK 3 → 10
Retrieval-budget sweep. Plausibility holds steady (~0.62) and validated-evidence count scales with K; outputs degrade gracefully, never collapse.
Sonnet → Haiku
Swapping the synthesis model to the cheaper Claude Haiku 4.5 (via Models:SynthesisModel) keeps the four-axis profile comparable, quality isn't tied to the strongest model.
0 fabricated
Under an adversarial prompt explicitly demanding 3 out-of-corpus citations, fabricated citations stayed at zero, under both Sonnet 4.6 and Haiku 4.5.
Headline: across retrieval-budget variation, a Sonnet → Haiku model substitution, and adversarial prompting, the citation contract admitted zero fabricated citations in every configuration, a by-construction guarantee, not a model-dependent one. Reproducible via dotnet run --project LIBRAIN.Experiments -- robustness. (Single run per cell on one topic pair; the LLM-judged axes carry run-to-run noise, the fabrication count does not.)
Citation contract: a measured benefit
The contract's value is easy to state but was hard to measure: with a strong model on a clean corpus, an unconstrained baseline rarely fabricates, so the contract looks like a guarantee with nothing to guard. To measure the delta I swept conditions that actually induce fabrication, free-text (prose) citations, weaker generators, and starved retrieval, across 160 cells (gpt-4o-mini and gpt-4o), counting citations that don't resolve to the retrieval set.
67 vs 0
Contract-free baseline surfaced 67 fabricated citations (14 clean, 53 under starved retrieval); the contracted pipeline leaked 0, caught 100%.
free-text ≳ structured
36 fabrications in prose-citation mode vs 31 in structured, the mode most deployed RAG systems actually use.
cross-family
Demonstrated on GPT-4o-mini and GPT-4o (a different model family than the paper's Claude), so the effect isn't model-specific.
GPT-4o cross-family measurement (fabrication count is a deterministic C# check). Reproducible via dotnet run --project LIBRAIN.Experiments -- fabrication-delta.
Best cross-domain discoveries
The full Discovery pipeline (novelClaim fence + citation contract + per-sentence claim validation + four-axis scoring) was run end-to-end across 30 cross-domain topic pairs and ranked by quality. The strongest, well-grounded bridges rise to the top; loose analogies fall away.
| Discovery (topic × topic) | Quality | nov / plaus / coher |
| graph neural networks × epidemic spread modeling | 0.686 | 0.36 / 0.80 / 0.90 |
| reinforcement learning × de novo molecular design | 0.640 | 0.42 / 0.70 / 0.80 |
| weather foundation models × renewable energy planning | 0.621 | 0.36 / 0.70 / 0.80 |
| deep-learning weather × agricultural yield | 0.603 | 0.31 / 0.70 / 0.80 |
| knowledge-graph embeddings × hypothesis generation | 0.591 | 0.27 / 0.70 / 0.80 |
Top discovery: GNNs can be adapted to improve epidemic-spread modeling by incorporating spatio-temporal data such as human-mobility patterns…, citing gcn + covid-stgnn (cross-domain, 0 dropped by the contract); the speculative sentence is fenced and correctly labeled EXTRAPOLATED (not a hallucination). GPT-4o cross-family run; quality = mean(novelty, plausibility, coherence). Reproducible via ... -- discover-run --provider openai.
Measurement validity
ρ = 0.41
Cosine novelty vs human novelty (pooled Spearman, n=45). Novelty is a reproducible measurement surface, not a control surface.
ρ 0.81 to 1.00
Inter-rater novelty Spearman across two independent three-rater panels (n=30): panel 1 at 0.81 / 0.82 / 0.94, panel 2 at 1.00 / 0.94 / 0.94. Computed in code (Cohen / Fleiss / Krippendorff) with explicit zero-variance handling.
0 / 30
Post-fix hallucination flags for LIBRAIN across n=30 (two panels). The unconstrained Naive-RAG baseline drew 3/30, one output flagged by all three raters for a factual error, a qualitative contrast on the same data.
Three-system baseline (pre-registered)
The same ten cross-domain topic pairs run through all three systems, scored on the four-axis rubric by the pre-registered Claude Haiku judge. Retrieval drives the largest plausibility gain (Naive-RAG over Single-LLM); LIBRAIN's distinctive move is novelty under speculation-fencing, about 31% higher than Naive-RAG at a deliberate plausibility cost.
| System (Claude Haiku-judged) | novelty | plausibility | coherence | quality |
| LIBRAIN | 0.403 | 0.505 | 0.707 | 0.538 |
| Naive-RAG | 0.307 | 0.622 | 0.792 | 0.574 |
| Single-LLM | 0.389 | 0.357 | 0.709 | 0.485 |
Read: LIBRAIN leads on novelty (0.403 vs 0.307, +31% over Naive-RAG) by fencing speculation into novelClaim, trading roughly 19% plausibility for it by design. These are the pre-registered numbers; reproducible via dotnet run --project LIBRAIN.Experiments -- baseline. Scored on the Claude Haiku four-axis rubric, the track the cross-judge check below deliberately departs from.
Cross-judge check (honest caveat)
Re-scoring all three systems with an independent GPT-4o judge (instead of the paper's Claude-Haiku judge) shifts the ranking. LIBRAIN reads lower on overall quality here because this judge rewards safe, well-structured output. On this judge LIBRAIN does not lead on novelty either; that lead shows on the Claude-Haiku track and the human ratings, not here:
| System (GPT-4o-judged) | novelty | plausibility | quality |
| LIBRAIN | 0.328 | 0.590 | 0.526 |
| Naive-RAG | 0.308 | 0.650 | 0.576 |
| Single-LLM | 0.353 | 0.690 | 0.614 |
A judge-substitution robustness check, not comparable to the pre-registered Claude-judged results, included for transparency. LIBRAIN's edge is traceability (fenced speculation, validated citations) plus the novelty win on the Claude-Haiku track (Table 7) and the human ratings across two independent three-rater panels (n=30: 4.47 vs 1.80 vs 2.47), which this conservative judge does not reward.
What I learned
Synthetic tests don't replace smoke tests on real artifacts
The chunker's seven xUnit cases passed cleanly on synthetic strings. A 30-line throwaway smoke run on one real arXiv PDF caught the PdfPig word-spacing bug and the section-detection regex's blind spots, issues that would have slipped through silently and degraded retrieval quality across every paper.
Rate-limit handling is honest sizing, not just defensive retries
OpenAI's tier-1 limit is 40K tokens per minute on text-embedding-3-small. The original 250K-token batch sized for tier 2 worked everywhere except where it mattered. Lower the per-batch cap to 35K and add proportional 1.5s pacing, no retry logic, no backoff trees, just batch sizes that respect the actual constraint.
Lazy provisioning decouples app boot from external infrastructure
Qdrant collection creation behind a Lazy<Task> means the app starts even if Qdrant is offline. The first ingest call awaits a single idempotent CollectionExistsAsync + CreateCollectionAsync race-safely. No explicit init command, no startup migration step, no operator runbook for first-deploy.
Atomic commits with no scope parens read better than scoped ones
Started with feat(reading): add PDF text extraction; ended with feat: add PDF text extraction service via PdfPig. The bare type: form keeps git log --oneline visually uniform on a small repo. Scopes are CI-machine-friendly but recruiter-unfriendly when the column widths don't align.
What's next
Phases 1 through 5 are complete: full Reader, Synthesis, and Evaluator pipeline, parallel Discovery Mode, a 41-paper cross-domain corpus, the static portfolio showcase you are reading right now, plus the three-system baseline (Naive-RAG and Single-LLM ablations), Anthropic prompt caching across all agents, parallelised scoring via Task.WhenAll, the claim-level hallucination mitigation, two independent three-rater blinded panels (n=30) confirming zero LIBRAIN hallucinations against a Naive-RAG baseline that drew one all-rater-flagged factual error, and a robustness sweep showing the citation contract holds (zero fabricated citations) under model substitution and adversarial prompting. Phase 6 (interactive frontend, cloud deployment, scaling the human-eval pilot beyond n=30) is deferred and documented in the companion paper.
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Phase 1Data Preparation
5 papers · 218 chunks
Phase 2System Development
Synthesis + Evaluator
Phase 3Discovery Mode
4-axis judging
Phase 4Showcase
41 papers
Phase 5Hallucination Mitigation
claim guard
Phase 6Deployment & Performance
Cloud + Frontend
Phase 7Documentation
arXiv preprint
Complete
Recent
Deferred / planned
✓
Phase 1, Reader Agent May 2026
PDF ingestion, recursive chunking, OpenAI embeddings, Qdrant vector store, REST endpoints. Complete.
✓
Phase 2, Synthesis & Evaluation May 2026
Synthesis Agent (Claude Sonnet with forced tool-use for citation-grounded hypotheses), Evaluator Agent (Claude Haiku, multi-dim scoring on groundedness/relevance/completeness), POST /api/query orchestrating embed → search → synthesise → evaluate. Complete.
✓
Phase 3, Discovery Mode May 2026
Parallel pipeline for cross-domain hypothesis generation. Discovery Agent extrapolates beyond cited evidence with the speculative portion flagged as a novelClaim; Discovery Evaluator scores on a 4-axis rubric (novelty, plausibility, structural coherence, quality). Complete. Empirical validation in the companion paper.
✓
Phase 4, Showcase May 2026
Corpus expansion from 5 to 41 papers across drug discovery and proteins, weather / climate / energy, agriculture, clinical trials, and cognition / neuroscience. Static demonstration set on this portfolio page (5 cross-domain Discovery Mode outputs with full citation traces and 4-axis scores). The companion paper picked up two new sections covering the extended demos in this phase. Complete.
✓
Phase 5, Hallucination Mitigation May 2026
Four follow-ups that close the empirical gaps the companion paper flagged.
Baselines. Naive-RAG and Single-LLM agents shipped as /api/naive-rag and /api/single-llm, feeding the three-system comparison in the companion paper. Both reuse the same Discovery Evaluator and NoveltyScorer, so cross-system scoring isolates pipeline structure from evaluator implementation. Reproduction run: 0 of 44 claimed citations fabricated by Naive-RAG under Sonnet 4.6, confirming the citation-validation contract behaves as a structural guarantee against a strong baseline.
Prompt caching. PromptCacheType.AutomaticToolsAndSystem wired on all seven LLM-backed agents. After the first call within a 5-minute TTL, roughly 80% of the system and tool prefix tokens hit the prompt cache. Cache-read and cache-create token counts are now part of every audit log line.
Parallel scoring. NoveltyScorer, ClaimValidator, and Discovery Evaluator now run via Task.WhenAll. The post-synthesis stage is bounded by the slowest single Haiku call instead of three sequential round-trips.
Claim-level validation. A new extrapolation_basis tool-schema field forces the Discovery Agent to label every sentence of novelClaim as generalisation, analogy, or pure speculation. A secondary ClaimValidatorAgent (Haiku, T=0.0) re-classifies each sentence as GROUNDED, EXTRAPOLATED, or RISKY against the retrieved chunks. This addresses the companion paper's finding that 3 of 5 LIBRAIN outputs were rater-flagged for factually-framed speculation inside novelClaim.
Rater re-scoring result (two independent three-rater panels over 10 pre-registered pairs, 30 outputs total, Latin-square blinding):
- Hallucination flags (rater 1, BEFORE → AFTER): 3 / 5 → 0 / 5 (target ≤ 1 / 5, PASS)
- Hallucination flags (n=30, two panels): LIBRAIN 0 / 30 and Single-LLM 0 / 30; the unconstrained Naive-RAG baseline drew 3 / 30, one output flagged by all three raters for a factual error
- Novelty pairwise Spearman ρ: panel 1 at 0.81 / 0.82 / 0.94, panel 2 at 1.00 / 0.94 / 0.94 (n=30 across two panels)
- Pooled novelty (n=30): LIBRAIN-with-fix 4.47 vs Naive-RAG 1.80 vs Single-LLM 2.47
The original pilot (panel 1, n=15) was extended with a second independent three-rater panel on five further pre-registered pairs (panel 2, n=15), all using the full hypothesis scope. The zero-hallucination result for LIBRAIN held across all 30 outputs, while the Naive-RAG baseline produced one factually wrong output that every rater flagged. LIBRAIN's novelty advantage persisted. All pre-registered gate criteria pass.
6
Phase 6, Deployment & Performance deferred
Minimal Next.js or Blazor frontend on Azure Static Web Apps for interactive demonstration, API on Azure Container Apps, response streaming for sub-5-second p95 latency, scaling the human-evaluation pilot beyond n=30 (two independent three-rater panels confirm the result; further scaling would tighten the confidence intervals), and a blog post on .NET RAG patterns. Deferred. The static demos above provide the read-only showcase a reviewer needs without operating the system.
Get in touch
Open to senior .NET / AI-engineering roles globally, remote, on-site in Turkey, or relocating to the EU with visa sponsorship.