LIBRAIN

Multi-Agent RAG System for Scientific Discovery

Open-source, built in .NET 10. A Reader → Synthesis → Evaluator pipeline plus a parallel Discovery Mode that fences speculative content into a structured novelClaim field. The claim-level validation contract eliminated a 3-of-5 hallucination signal under three-rater blinded replication on a 13-paper cross-domain corpus (AI, drug discovery, climate, neuroscience). Companion paper · May 2026.

github.com/erennmutlu1/librain

13 arXiv papers ingested

610 chunks indexed in Qdrant

< 200ms vector search latency

Why this exists

Most RAG and AI-agent tutorials are written in Python. Production engineering organizations that run on .NET — banks, insurers, government — hear the message that "AI requires Python" and weaken the case for adopting LLM-driven features in their existing stack. LIBRAIN is a deliberate counter-example: a complete multi-agent retrieval pipeline built end-to-end in .NET 10 on Microsoft-blessed primitives (Semantic Kernel, Application Insights), Anthropic Claude, OpenAI embeddings, and a managed vector store.

Architecture

Two pipelines share a Reader Agent and a Qdrant vector store. The Phase 2 query path runs Synthesis to Evaluator on retrieved chunks. The Phase 2.5 Discovery path runs the Discovery Agent into three concurrent scorers (NoveltyScorer, Discovery Evaluator, ClaimValidator) under a single Task.WhenAll dispatch, then aggregates to a four-axis score. Every step emits a structured event with a correlation ID so any output is fully traceable from query back to source chunks.

Reader

Ingests PDFs, chunks recursively, embeds with OpenAI text-embedding-3-small, and persists to Qdrant with citation metadata and deterministic UUIDv5 IDs.

→

Synthesis

Claude Sonnet 4.6 via structured tool use. Vector-searches retrieved chunks and generates citation-grounded hypotheses (Phase 2 query path).

→

Evaluator

Claude Haiku 4.5, T=0.0. LLM-as-a-Judge scoring groundedness, relevance, and completeness; aggregate quality computed in C# to resist halo effects.

↘ parallel branch
from Reader

→

Discovery

Claude Sonnet 4.6 with an inverted prompt that invites extrapolation beyond cited evidence. The speculative portion is returned as a flagged novelClaim field, exempt from citation validation by contract.

Task.WhenAll parallel dispatch Discovery output (hypothesis + supportingEvidence + novelClaim) is dispatched to the three concurrent scorers below. Wall-clock cost is bounded by the slowest single call, not the sum.

NoveltyScorer

Deterministic. Embeds novelClaim with text-embedding-3-small, runs top-1 cosine search against the 13-paper Qdrant corpus, returns 1 - similarity as novelty.

Discovery Evaluator

Claude Haiku 4.5, T=0.0. LLM-judged plausibility (does novelClaim follow from cited evidence) and structural coherence (well-formed testable hypothesis).

ClaimValidator

Claude Haiku 4.5, T=0.0. Per-sentence labels over novelClaim in {GROUNDED, EXTRAPOLATED, RISKY} with hallucination probability; max-aggregate risk in C# (halo-resistant). Added in Phase 3.A.5.

Four-axis scoring

Arithmetic mean in C# (halo-resistant by construction) of novelty + plausibility + structural coherence + quality, plus the ClaimValidator risk label, returned to the API caller alongside the hypothesis and novelClaim.

Tech stack

Locked stack — chosen up-front, no swaps mid-build. Each pick has a stated rationale documented in the project plan.

.NET 10 ASP.NET Core Minimal APIs Anthropic Claude OpenAI text-embedding-3-small Qdrant Microsoft Semantic Kernel PdfPig Application Insights xUnit v3

API surface

Seven endpoints across Reader, Synthesis, Discovery, and Baseline tracks. Phase 1 ingest endpoints, the Phase 2 query endpoint that runs the full embed → search → synthesise → evaluate pipeline, the Phase 3 discovery endpoint for cross-domain hypothesis generation, and the Phase 5 baseline endpoints (Naive-RAG and Single-LLM ablations) that feed the companion paper's three-system comparison.

POST /api/papers/ingest Multipart PDF upload. Streams through the full pipeline (extract → chunk → embed → persist) and returns paperId, chunk count, and a correlation ID for the audit trail.

GET /api/papers Lists ingested papers with title, chunk count, and ingestion timestamp. Aggregated from the chunk collection via Qdrant scroll + in-memory dedupe; sub-second at MVP scale.

POST /api/query Citation-grounded hypothesis generation. Embeds the query, runs vector search (top-K), synthesises a hypothesis via Claude Sonnet 4.6 with forced tool-use for citations, and evaluates it via Claude Haiku 4.5 on three dimensions (groundedness, relevance, completeness). Returns hypothesis, citations, per-dimension scores, and a correlation ID.

POST /api/discover Discovery Mode. Takes one or two topics and proposes a hypothesis that goes beyond the cited sources, returning the speculative portion as a novelClaim field. Each sentence of novelClaim is annotated by an extrapolation_basis contract (generalisation / analogy / pure speculation) and re-scored by a secondary ClaimValidator agent (Haiku, per-sentence GROUNDED / EXTRAPOLATED / RISKY classification). Scored on the 4-axis rubric (novelty, plausibility, structural coherence, quality).

POST /api/naive-rag Baseline #1. Retrieval plus a single Claude Sonnet 4.6 call with structured tool-use, but no citation-validation contract. Claimed citations are recorded verbatim and resolved against the retrieved set post-hoc, giving the fabrication-rate measurement surface used in the companion paper's three-system comparison.

POST /api/single-llm Baseline #2. Single Claude Sonnet 4.6 call on topic(s) only. No retrieval, no agent decomposition, no tool-use. The lower-bound ablation for the companion paper's three-system comparison.

GET /health Liveness probe. Returns 200 with a single status field.

Pipeline walkthrough

1

PDF parsing

PdfPig with the ContentOrderTextExtractor reconstructs glyph order with proper word spacing. The default extractor drops inter-word spaces on column-layout PDFs, an issue that only surfaced on real arXiv papers, not synthetic test input.

Catch: synthetic tests passed with 0/15 sections detected; real PDFs revealed the spacing bug. 17/18 sections detected after the fix.

2

Recursive chunking

Paragraph → sentence → hard-cut fallback. Target 512 tokens (~2,000 chars), max 1,024 tokens, 15% overlap (~75 tokens). Each chunk carries its absolute offset, page number, and best-effort section heading for downstream citation tracking.

Boundary: never split a paragraph if a paragraph break exists in the [target, max] window.

3

Embedding

OpenAI text-embedding-3-small (1,536-dim) with token-aware batching: ≤100 inputs per batch, ≤35K tokens, plus a 1.5-second pacing gap to stay clear of Tier 1's 40K-TPM rolling-window limit. Batch size honest to the actual rate tier the account is on.

Catch: an early 250K-token-per-batch limit assumed Tier 2 and rejected 2 of 5 papers on a fresh Tier 1 account.

4

Vector search

Qdrant cosine similarity with deterministic UUID v5 point IDs (RFC 4122, SHA-1 over a project-private namespace plus paperId-chunkIndex). Re-ingesting a paper upserts cleanly; lazy collection bootstrap behind a Lazy<Task> means the app starts before Qdrant is reachable.

Boundary: top-K results carry full citation metadata — paperId, chunkIndex, page, section — so callers render footnotes without a second query.

Discovery Mode in action

Five end-to-end runs against an expanded 13-paper corpus spanning AI, drug discovery, climate, and neuroscience. Each card shows live POST /api/discover output: the generated hypothesis, the speculative portion flagged as novelClaim, citation-resolved supporting evidence from retrieved chunks, and 4-axis Discovery Evaluator scores (Claude Haiku, temperature 0.0).

In-corpus

retrieval-augmented generation ⨯ hypothesis generation in scientific discovery

Hypothesis

RAG systems, by combining updatable non-parametric memory with generative models, provide a natural substrate for automated scientific hypothesis generation. Specifically, the RAG-Token mechanism's ability to draw on different retrieved documents for each output token could enable the synthesis of cross-disciplinary evidence into novel, testable hypotheses that span previously siloed fields. Furthermore, integrating hypothesis quality filters (novelty and feasibility scoring) directly into the RAG retrieval objective, so that the retriever is rewarded for surfacing documents that maximize hypothesis novelty while maintaining feasibility, could yield a self-improving discovery loop that surpasses both purely parametric LLMs and static literature-based discovery systems.

The Discovery novelClaim · exempt from citation validation

What if the retriever learned to surface papers that produce novel hypotheses, not just relevant ones?

integrating hypothesis quality filters (novelty and feasibility scoring) directly into the RAG retrieval objective, so that the retriever is rewarded for surfacing documents that maximize hypothesis novelty while maintaining feasibility, could yield a self-improving discovery loop that surpasses both purely parametric LLMs and static literature-based discovery systems.

A hypothesis seed for investigation, not a verified discovery. The model generated this beyond the retrieved evidence; the chunks below support the framing, not the claim itself.

Supporting evidence 5 citations, 2 papers, all resolved

arXiv:2005.11401 Lewis et al. — RAG — chunks 2, 10 direct
arXiv:2505.04651 — chunks 0, 14, 23 direct

Novelty

0.40

Plausibility

0.42

Coherence

0.68

Quality

0.50

Cross-domain

retrieval-augmented generation ⨯ protein folding dynamics

Hypothesis

RAG architectures, which combine parametric and non-parametric memory through differentiable retrieval and end-to-end marginalization over latent documents, could be directly applied to protein folding dynamics by indexing structural and biophysical databases as the non-parametric memory, enabling agentic systems to dynamically retrieve and integrate folding pathway evidence at inference time rather than encoding it statically in model weights. Such a RAG-augmented protein folding agent would be capable of discovering length-dependent or context-dependent folding phenomena — analogous to the mechanical crossover in peptide unfolding force uncovered by Sparks — that purely parametric structure-prediction models systematically miss because they cannot update their knowledge base without retraining.

The Discovery novelClaim · exempt from citation validation

What if protein-folding AI could retrieve fresh evidence at inference time, the way RAG retrieves documents?

Such a RAG-augmented protein folding agent would be capable of discovering length-dependent or context-dependent folding phenomena — analogous to the mechanical crossover in peptide unfolding force uncovered by Sparks — that purely parametric structure-prediction models systematically miss because they cannot update their knowledge base without retraining.

A hypothesis seed for investigation, not a verified discovery. The model generated this beyond the retrieved evidence; the chunks below support the framing, not the claim itself.

Supporting evidence 5 citations, 2 papers, all resolved

arXiv:2005.11401 Lewis et al. — RAG — chunks 0, 2, 10 direct
arXiv:2508.14111 — chunk 33 (Protein Science section) direct; chunk 32 (Genomics section) analogous

Novelty

0.48

Plausibility

0.42

Coherence

0.68

Quality

0.53

Cross-domain

retrieval-augmented generation ⨯ de novo molecular design

Hypothesis

RAG architectures, which dynamically retrieve and marginalize over external non-parametric knowledge to reduce hallucination in language generation, could be directly adapted for de novo molecular design by replacing the text document index with a structured chemical knowledge base, enabling a generative chemistry agent to retrieve relevant molecular scaffolds, reaction precedents, and binding interaction reports at each design step, thereby grounding novel molecule proposals in experimentally validated chemical knowledge. This RAG-augmented molecular design loop would allow the retrieval index to be hot-swapped as new experimental results accumulate, enabling continuous self-improvement of the design agent without retraining the generative model. Such a system would outperform purely parametric molecular generators by reducing chemically implausible hallucinations in the same way RAG reduces factual hallucinations in NLP.

The Discovery novelClaim · exempt from citation validation

What if RAG's hallucination-reduction trick (grounding outputs in retrieved evidence) also stopped chemically implausible molecules from being generated?

Such a system would outperform purely parametric molecular generators by reducing chemically implausible hallucinations in the same way RAG reduces factual hallucinations in NLP.

A hypothesis seed for investigation, not a verified discovery. The model generated this beyond the retrieved evidence; the chunks below support the framing, not the claim itself.

Supporting evidence 6 citations, 3 papers, all resolved

arXiv:2005.11401 Lewis et al. — RAG — chunks 0, 10 direct; chunk 2 analogous
arXiv:2503.22164 Gao et al. — PharmAgents — chunks 6, 7 direct
arXiv:2508.14111 — Agentic AI for Scientific Discovery — chunk 38 direct

Novelty

0.49

Plausibility

0.68

Coherence

0.82

Quality

0.66

Cross-domain

weather foundation models ⨯ renewable energy planning

Hypothesis

Weather foundation models like Aurora and GraphCast, which already demonstrate fine-tunable skill across diverse Earth system variables (wind, waves, air quality) at orders-of-magnitude lower computational cost than NWP, could be directly integrated into renewable energy planning pipelines as probabilistic scenario engines that generate spatially and temporally coherent multi-week wind and solar resource trajectories, simultaneously accounting for correlated atmospheric variables (e.g., wind speed, temperature, cloud cover) across entire grid regions. Such integration would enable grid operators to optimize long-term renewable capacity placement and dispatch schedules under physically consistent uncertainty ensembles, a capability that current energy planning tools — which rely on independent, lower-fidelity meteorological inputs — cannot provide.

The Discovery novelClaim · exempt from citation validation

What if global weather AI could double as a probabilistic scenario engine for sizing the next decade of wind and solar infrastructure?

Such integration would enable grid operators to optimize long-term renewable capacity placement and dispatch schedules under physically consistent uncertainty ensembles, a capability that current energy planning tools — which rely on independent, lower-fidelity meteorological inputs — cannot provide.

A hypothesis seed for investigation, not a verified discovery. The model generated this beyond the retrieved evidence; the chunks below support the framing, not the claim itself.

Supporting evidence 4 citations, 2 papers, all resolved

arXiv:2405.13063 Bodnar et al. — Aurora — chunks 0, 5 direct
arXiv:2212.12794 Lam et al. — GraphCast — chunks 16, 19 direct

Novelty

0.51

Plausibility

0.72

Coherence

0.78

Quality

0.67

Cross-domain

drug discovery ⨯ climate adaptation

Hypothesis

Autonomous multi-agent AI systems, already demonstrated to accelerate the full drug discovery pipeline from target identification to preclinical evaluation, could be directly repurposed for climate adaptation drug discovery by integrating high-resolution extreme weather forecasts — such as those produced by GraphCast — as dynamic environmental inputs that steer target selection and molecule optimization toward climate-sensitive diseases. Specifically, the self-evolving, experience-accumulating architecture of such systems would allow them to continuously update therapeutic priorities as AI weather models detect shifting disease-relevant climate patterns (e.g., expanding atmospheric river corridors or heat-wave frequency), effectively closing the loop between climate prediction and pharmaceutical response. This integration could enable a proactive, climate-aware drug discovery paradigm in which the epidemiological target landscape is updated in near-real-time from weather model outputs rather than relying on static historical disease burden data.

The Discovery novelClaim · exempt from citation validation

What if drug discovery agents could read climate forecasts to anticipate emerging disease patterns before they reach the clinic?

This integration could enable a proactive, climate-aware drug discovery paradigm in which the epidemiological target landscape is updated in near-real-time from weather model outputs rather than relying on static historical disease burden data.

A hypothesis seed for investigation, not a verified discovery. The model generated this beyond the retrieved evidence; the chunks below support the framing, not the claim itself.

Supporting evidence 6 citations, 4 papers, all resolved

arXiv:2503.22164 Gao et al. — PharmAgents — chunks 1, 2 direct
arXiv:2508.14111 — Agentic AI for Scientific Discovery — chunk 33 (Protein Science section) direct
arXiv:2212.12794 Lam et al. — GraphCast — chunks 8, 65 (Atmospheric Rivers section) direct
arXiv:2405.13063 Bodnar et al. — Aurora — chunk 16 (Training methods section) analogous

Novelty

0.41

Plausibility

0.62

Coherence

0.75

Quality

0.59

A cross-run consistency study (N=5 per pair) confirms plausibility varies meaningfully across runs (std 0.14 in-corpus / 0.12 cross-domain), a genuine signal rather than a single-shot anchor. Structural coherence sits closer to a well-formedness baseline. Full empirical breakdown in the companion paper.

Inside the chunker

The recursive chunker is the foundation of citation tracking — every chunk's offset and length must be deterministic so the synthesis layer can map a hypothesis citation back to a verifiable source span. The fallback ladder ensures size limits are respected without ever splitting mid-sentence when a paragraph break is reachable.

private static int ChooseEnd(string text, int start)
{
    int remaining = text.Length - start;
    if (remaining <= MaxChars)
    {
        return text.Length;
    }

    int targetEnd = start + TargetChars;
    int maxEnd = start + MaxChars;

    // Prefer paragraph boundary in [target, max]
    int paraEnd = LatestMatchIndex(ParagraphBreakRegex(), text, targetEnd, maxEnd);
    if (paraEnd > start) return paraEnd;

    // Fall back to sentence boundary
    int sentEnd = LatestMatchIndex(SentenceBreakRegex(), text, targetEnd, maxEnd);
    if (sentEnd > start) return sentEnd;

    // Hard cut as last resort
    return maxEnd;
}

From LIBRAIN/Reading/RecursiveChunker.cs. Constants are TargetChars = 2000, MaxChars = 4000. Each emitted chunk also rewinds OverlapChars = 300 backward to preserve context across boundaries — the 15% overlap target from the original RAG paper.

Results

Smoke-tested end-to-end on 13 arXiv papers covering RAG, agentic AI surveys, drug discovery, climate forecasting, and computational neuroscience.

13 papers

Including Lewis et al. (RAG), Lam et al. (GraphCast), Bodnar et al. (Aurora), Gao et al. (PharmAgents), Binz et al. (Centaur)

610 chunks

610 × 1,536-dim vectors stored in the Qdrant librain_chunks collection

< 200ms

Vector search latency at MVP scale; below Qdrant's 10K-point HNSW threshold so brute-force scan stays fast

Papers in the corpus

The 13-paper corpus is built across two ingestion rounds. The first round established the RAG and agentic-AI foundation that validates the pipeline; the second round broadened reach into drug discovery, climate forecasting, and computational neuroscience so Discovery Mode can produce cross-domain hypotheses.

Seed corpus (5 papers, 218 chunks)

The foundational RAG paper, two agentic-AI surveys, and two long-form scientific-discovery surveys — chosen for pipeline depth (long surveys produce many chunks) and topical relevance (RAG, agents, retrieval-grounded generation).

2005.11401 Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS 2020). The original RAG paper. 21 chunks.

2503.08979 Gridach et al. — Agentic AI for Scientific Discovery: A Survey of Progress, Challenges, and Future Directions (ICLR 2025). 18 chunks.

2504.05496 arXiv 2025 — agentic-AI methodology preprint. 12 chunks.

2505.04651 arXiv 2025 — long-form survey on RAG and retrieval pipelines. 80 chunks across 60 pages.

2508.14111 From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery. 87 chunks across 84 pages.

Seed-corpus vector search smoke test: query "How does retrieval augmented generation work?" returned all 5 top hits from 2005.11401 — distinct chunks from pages 1, 2, 8, and 9, with cosine similarities 0.52–0.61. Semantic recall validated end-to-end against the corpus that planted RAG itself.

Domain expansion (8 papers, 392 chunks)

Drug discovery (3 papers), climate & weather (3 papers), and computational neuroscience (2 papers). Selected by popularity and topical recognition — well-known systems (GraphCast, Pangu-Weather, Aurora, ChemCrow, Centaur) plus comprehensive recent surveys.

2304.05376 Bran et al. — ChemCrow: Augmenting large-language models with chemistry tools. Drug discovery agent. 27 chunks.

2503.02104 Liu et al. — Foundation Model in Biomedicine. Biomedical foundation model survey. 39 chunks.

2503.22164 Gao et al. — PharmAgents: Building a Virtual Pharma with Large Language Model Agents. Multi-agent drug pipeline. 20 chunks.

2212.12794 Lam et al. — GraphCast: Learning skillful medium-range global weather forecasting (DeepMind, Nature 2023). 72 chunks.

2211.02556 Bi et al. — Pangu-Weather: A 3D High-Resolution Model for Fast and Accurate Global Weather Forecast (Huawei). 30 chunks.

2405.13063 Bodnar et al. — Aurora: A Foundation Model for the Earth System (Microsoft). 73 chunks.

2410.20268 Binz et al. — Centaur: a foundation model of human cognition. 108 chunks across 140+ pages.

2506.06353 Babu & Mathew — Large Language Models for EEG: A Comprehensive Survey and Taxonomy. 23 chunks.

What I learned

Synthetic tests don't replace smoke tests on real artifacts

The chunker's seven xUnit cases passed cleanly on synthetic strings. A 30-line throwaway smoke run on one real arXiv PDF caught the PdfPig word-spacing bug and the section-detection regex's blind spots — issues that would have slipped through silently and degraded retrieval quality across every paper.

Rate-limit handling is honest sizing, not just defensive retries

OpenAI's tier-1 limit is 40K tokens per minute on text-embedding-3-small. The original 250K-token batch sized for tier 2 worked everywhere except where it mattered. Lower the per-batch cap to 35K and add proportional 1.5s pacing — no retry logic, no backoff trees, just batch sizes that respect the actual constraint.

Lazy provisioning decouples app boot from external infrastructure

Qdrant collection creation behind a Lazy<Task> means the app starts even if Qdrant is offline. The first ingest call awaits a single idempotent CollectionExistsAsync + CreateCollectionAsync race-safely. No explicit init command, no startup migration step, no operator runbook for first-deploy.

Atomic commits with no scope parens read better than scoped ones

Started with feat(reading): add PDF text extraction; ended with feat: add PDF text extraction service via PdfPig. The bare type: form keeps git log --oneline visually uniform on a small repo. Scopes are CI-machine-friendly but recruiter-unfriendly when the column widths don't align.

What's next

Phases 1 through 5 are complete: full Reader, Synthesis, and Evaluator pipeline, parallel Discovery Mode, a 13-paper corpus spanning AI, drug discovery, climate, and neuroscience, the static portfolio showcase you are reading right now, plus the three-system baseline (Naive-RAG and Single-LLM ablations), Anthropic prompt caching across all agents, parallelised scoring via Task.WhenAll, the claim-level hallucination mitigation, and a three-rater blinded replication of the AFTER-FIX pilot that confirms zero hallucinations across all fifteen outputs. Phase 6 (interactive frontend, cloud deployment, scaling the human-eval pilot beyond n = 15) is deferred and documented in the companion paper.

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Phase 1Data Preparation

5 papers · 218 chunks

Phase 2System Development

Synthesis + Evaluator

Phase 3Discovery Mode

4-axis judging

Phase 4Showcase

13 papers

Phase 5Hallucination Mitigation

claim guard

Phase 6Deployment & Performance

Cloud + Frontend

Phase 7Documentation

arXiv preprint

Complete Recent Deferred / planned

✓

Phase 1 — Reader Agent May 2026

PDF ingestion, recursive chunking, OpenAI embeddings, Qdrant vector store, REST endpoints. Complete.

✓

Phase 2 — Synthesis & Evaluation May 2026

Synthesis Agent (Claude Sonnet with forced tool-use for citation-grounded hypotheses), Evaluator Agent (Claude Haiku, multi-dim scoring on groundedness/relevance/completeness), POST /api/query orchestrating embed → search → synthesise → evaluate. Complete.

✓

Phase 3 — Discovery Mode May 2026

Parallel pipeline for cross-domain hypothesis generation. Discovery Agent extrapolates beyond cited evidence with the speculative portion flagged as a novelClaim; Discovery Evaluator scores on a 4-axis rubric (novelty, plausibility, structural coherence, quality). Complete. Empirical validation in the companion paper.

✓

Phase 4 — Showcase May 2026

Corpus expansion from 5 to 13 papers across drug discovery, climate forecasting, and computational neuroscience. Static demonstration set on this portfolio page (5 cross-domain Discovery Mode outputs with full citation traces and 4-axis scores). The companion paper picked up two new sections covering the extended demos in this phase. Complete.

✓

Phase 5 — Hallucination Mitigation May 2026

Four follow-ups that close the empirical gaps the companion paper flagged.

Baselines. Naive-RAG and Single-LLM agents shipped as /api/naive-rag and /api/single-llm, feeding the three-system comparison in the companion paper. Both reuse the same Discovery Evaluator and NoveltyScorer, so cross-system scoring isolates pipeline structure from evaluator implementation. Reproduction run: 0 of 44 claimed citations fabricated by Naive-RAG under Sonnet 4.6, confirming the citation-validation contract behaves as a structural guarantee against a strong baseline.

Prompt caching. PromptCacheType.AutomaticToolsAndSystem wired on all seven LLM-backed agents. After the first call within a 5-minute TTL, roughly 80% of the system and tool prefix tokens hit the prompt cache. Cache-read and cache-create token counts are now part of every audit log line.

Parallel scoring. NoveltyScorer, ClaimValidator, and Discovery Evaluator now run via Task.WhenAll. The post-synthesis stage is bounded by the slowest single Haiku call instead of three sequential round-trips.

Claim-level validation. A new extrapolation_basis tool-schema field forces the Discovery Agent to label every sentence of novelClaim as generalisation, analogy, or pure speculation. A secondary ClaimValidatorAgent (Haiku, T=0.0) re-classifies each sentence as GROUNDED, EXTRAPOLATED, or RISKY against the retrieved chunks. This addresses the companion paper's finding that 3 of 5 LIBRAIN outputs were rater-flagged for factually-framed speculation inside novelClaim.

Rater re-scoring result (5 pairs × 3 systems = 15 outputs, Latin-square blinding, three independent raters):

Hallucination flags (rater 1, BEFORE → AFTER): 3 / 5 → 0 / 5 (target ≤ 1 / 5, PASS)
Hallucination flags (three-rater replication): 0 of 15 from every rater, 15 / 15 pairwise agreement on the binary flag
Novelty pairwise Spearman ρ: R1–R2 = 0.805, R1–R3 = 0.820, R2–R3 = 0.937, all 95% CIs exclude zero at n = 15
Three-rater pooled novelty: LIBRAIN-with-fix 4.53 vs Naive-RAG 1.60 vs Single-LLM 2.13

Two additional independent raters re-scored the same fifteen outputs without seeing rater 1's scores and confirmed the zero-hallucination result on every output. The novelty advantage persists across all three raters and across both rubric scopes (rater 1 used novelClaim only as specified in the AFTER-FIX rubric; raters 2 and 3 used the full hypothesis). All pre-registered gate criteria pass.

6

Phase 6 — Deployment & Performance deferred

Minimal Next.js or Blazor frontend on Azure Static Web Apps for interactive demonstration, API on Azure Container Apps, response streaming for sub-5-second p95 latency, scaling the human-evaluation pilot beyond fifteen outputs (the three-rater replication confirms the result but the rho confidence intervals stay wide at n = 15), and a blog post on .NET RAG patterns. Deferred. The static demos above provide the read-only showcase a reviewer needs without operating the system.

Get in touch

Open to senior .NET / AI-engineering roles globally — remote, on-site in Turkey, or relocating to the EU with visa sponsorship.

View on GitHub erennmutlu@outlook.com