Skip to main content
From RAG to Search Agents: Three Tensions in Retrieval, Synthetic Data, and Evaluation
  1. Posts/

From RAG to Search Agents: Three Tensions in Retrieval, Synthetic Data, and Evaluation

·2047 words·10 mins
NeatGuyCoding
Author
NeatGuyCoding
Table of Contents

From RAG to Search Agents: Three Tensions in Retrieval, Synthetic Data, and Evaluation
#

Once large-model products make “search the web” a default capability, engineering teams often hit the same set of tensions: do users want a short answer or a long report? Should training budget go to LLM tokens or search API calls? Should evaluation optimize for correctness or for trajectory efficiency? This article centers on the working definition of a search agent—using search / browse tools to answer multi-hop web questions on the open internet—alongside public benchmarks such as BrowseComp-Plus, synthetic-data routes like WebShaper, and the engineering judgments of guest Nandan Thakur (BEIR / MIRACL co-author) in interview. We do not force a single conclusion; any figure not appearing in a paper or on a leaderboard is labeled speaker view or unverified.


Problem space: three axes, not one “smarter RAG”
#

Classic RAG is often “retrieve once → stuff into the prompt → generate.” Agentic search turns retrieval into a multi-turn, branchable process: the model decides when to search, what to search for, and whether to keep reading documents. In the public literature, BrowseComp-Plus operationalizes Deep-Research agents as LLM + search tools, measured on a fixed corpus by Accuracy and Search Calls.

At the same time, industry pushes a deep research product narrative—users often expect report-style output, arbitrary tool orchestration, and longer test-time compute. The guest’s split is: a search agent is anchored on “answer one question”; deep research is a wider umbrella (speaker view). Both product lines can share the same retrieval stack, but optimization targets and harnesses are not the same.

Mermaid diagram 1


Search agent vs. deep research: naming sets product expectations
#

Why
#

The same “model that can search the web,” if marketed as deep research, primes users for long documents, many tools, downloadable reports; if marketed as a search agent, it sits closer to web QA / the BrowseComp family (speaker view).

Mechanisms and constraints
#

BrowseComp-Plus evaluates on roughly 100K fixed documents for reproducible retrieval stacks (BM25, DiskANN dense indexes, etc.)—not directly comparable numerically to commercial agents on live, full-web APIs.

How to (mental model)
#

Decompose the harness into plan → search → read → answer; first clarify whether the deliverable is a span-level short answer or a nugget-level long answer (the latter is closer to TREC RAG nugget / support evaluation).

Common pitfalls
#

Judging BrowseComp-style short-answer tasks with report-product UX, or holding multi-turn agents to a single-shot RAG latency SLA.

Split-screen interview: left background shows Weaviate podcast branding; right side is guest Nandan Thakur.

On-screen OCR shows Weaviate podcast branding (no architecture slide text).


Evaluation lineage: from BEIR to BrowseComp+
#

Why
#

BEIR helped the community iterate retrievers across 18 zero-shot datasets; MIRACL brought heterogeneous languages under the same qrel mindset. Search agents need a similar “community orbits the benchmark” pull, but IR metrics and agent metrics are not the same random process (speaker view).

Mechanisms and constraints
#

LayerTypical metricsStability (literature / common sense)
RetrieverNDCG@k, MRRFixed qrels, reruns have tiny variance
Agent end-to-endAccuracy, Search CallsSame model, multiple rollouts—high variance
Long-report RAGNugget recall, Support 0–2Different family from BrowseComp Accuracy

The BrowseComp-Plus paper reports GPT-5 + BM25 at 55.9%, rising to about 70.1% with Qwen3-Embedding-8B; the project page footnote gives GPT-5 + Google Search API at about 59.9%. The guest’s oral claim that “top models are at 90–95%” does not match the August 2025 public table—possibly another benchmark, a private leaderboard, or an unverified BrowseComp (non-Plus) setup; the article must not substitute interview for leaderboard.

The guest describes BrowseComp+ sub-clauses with a filter / funnel mental model: many candidate constraints narrowed step by step (speaker view). The paper side has human-verified supporting docs + mined negatives and a GPT-4o sub-query decomposition pipeline—directionally aligned, but the 10³–10⁵ → single-digit magnitude does not appear verbatim in the paper.

How to
#

Reproduce BrowseComp-Plus + Tevatron on fixed corpus before claiming “swap retriever → gain Accuracy”; commercial API trajectories need a separate private eval.

Common pitfalls
#

Equating NDCG gains on BEIR with Accuracy gains on BrowseComp; Search-R1 + BM25 at only ~3.86% on BrowseComp-Plus shows “keyword-style queries are enough” has not held on that hard benchmark.

PhD thesis title slide fragment: RETRIEVAL AND RETRIEVAL-AUGMENTED GENERATION ON HETEROGENEOUS DOMAINS AND LANGUAGES; author Nandan Thakur, advisor Prof. Jimmy Lin, University of Waterloo Faculty of Mathematics.

Mermaid diagram 2


Synthetic data: Orbit, WebShaper, and “chained” multi-hop
#

Why
#

Search-R1 and similar work train on NQ, HotpotQA, etc., where task difficulty misaligns with the BrowseComp family; many dataset papers do not release training data (speaker view). Synthetic pipelines try to build hard cases with an intersection pattern: seed → retrieve → extract facts → hide → retrieve again.

Mechanisms and constraints
#

  • WebShaper (verifiable): formalization-driven, Knowledge Projections + agentic Expander; dataset WebShaperQA.
  • Orbit (unverified): guest describes ~20k BrowseComp-style four- to five-layer riddles, DeepSeek generation + self-verification + external search-agent verification, months of continuous runs on a consumer laptop (speaker view). Public search found no paper/repo under the name Orbit; treat as project oral history until published.
  • Difficulty contrast (speaker view): BrowseComp+ leans filter (many sub-clause candidates, funnel step by step); Orbit is closer to A→B→C→D chaining, self-assessed as not yet reaching Plus-level filter strength.

How to (minimal pipeline)
#

seed_entity → web_search → extract_facts → mask_entity → repeat → QA_pair

Quality gate: validate solvability at each hop with independent retrieval results, not LLM self-consistency alone.

Common pitfalls
#

Synthetic items that let the model guess from one clue (guest cites Emoji Movie–style shortcuts, speaker view); the ideal agent should todo-list–style verify constraints one by one.

Split screen around the 17-minute mark: while discussing the Orbit synthesis formula and examples, Weaviate podcast branding remains on the back wall.

OCR fragment includes Weaviate and DOQaQCcaSsS podcast overlay text (no slide API names).


Training economics: GRPO, rollouts, and “search costs more than the LLM”
#

Why
#

Search-R1 uses outcome-based reward and retrieved-token masking, reporting roughly +41% / +20% over naive RAG on seven QA sets for Qwen2.5-7B/3B (paper tables). The next competitive axis is often stated as: same Accuracy with fewer tokens and search calls (speaker view).

Mechanisms and constraints
#

  • GRPO (DeepSeekMath): sample G outputs per problem, group-normalized reward for advantage; G is a hyperparameter—the paper does not fix it at 8. Training where “one of several trajectories succeeds still yields positive advantage” is mechanistically compatible with the guest’s “signal if 1 of 8 is right,” but do not conflate training group size G with inference pass@k under one symbol.
  • API billing (speaker view): if 6 turns × 8 rollouts × 1 search/turn, back-of-envelope ~48 searches / training sample; BrowseComp-Plus reports strong models averaging >20 searches per question. When search is billed per call and the LLM per token, search bill can exceed LLM in agent training—opposite the intuition that “tokens only get cheaper.”
  • Sequential long horizon vs. parallel: pass@K works at inference; cross-rollout credit in RL for training is too expensive—common path is strong teacher rollouts → SFT distillation (speaker view).
  • Context rot: Chroma’s research report shows performance falls as input grows under controlled difficulty—tension with “eval keeps full trajectories” vs. “engineering compresses context.”

How to
#

Start with Search-R1’s codebase PPO/GRPO/reinforce switches on small corpus, then scale G and turns; log search_calls and tokens as separate cost columns.

Common pitfalls
#

Writing GRPO’s G as pass@8; counting repeated identical queries in BrowseComp+ trajectories as useful exploration (guest criticizes as waste, speaker view).

Around 20 minutes: natural light on the guest’s window, Weaviate podcast bookshelf on the left—segment on training cost and rollouts.

Around 40 minutes: guest facing camera, wood shelves and window light; same segment touches harness and retrieval interfaces.


Retrieval interface: snippets, full text, and self-hosted stacks
#

Why
#

Commercial search APIs often return snippet + offset, like classic reader spans; self-hosted BM25 / ANN can start at billion-scale corpora, but agents sometimes need snippet-level reproducibility or a full-text tool (speaker view).

Mechanisms and constraints
#

The guest favors a two-stage pattern: search → pick doc → document tool pulls full text; end-to-end full-text RL was not done much in training because APIs are slow/expensive (speaker view). Academic alternatives include FineWeb / ClueWeb indexes + internal search API; BrowseComp-Plus with Tevatron gives reproducible BM25/dense retrieval. The proper name “Deep Research Gym” had no independent project page in public URL checks (verified 2026-05); in writing, point to the Tevatron/BrowseComp-Plus ecosystem or mark unverified.

Guest hypothesis (no experimental numbers): after SFT/RL, the agent might only need to emit BM25 keywords for a lexical retriever—not yet supported by public results on BrowseComp-Plus; keep as a research hypothesis.

How to
#

# Concept harness (not production code)
for turn in range(max_turns):
    q = llm.plan_query(state)
    hits = retriever.search(q, top_k=10)      # snippet + doc_id
    if need_full_text(hits):
        doc = corpus.fetch_full(hits[0].doc_id)
    state = llm.update(state, hits, doc)
answer = llm.finalize(state)

Common pitfalls
#

Replacing multi-turn search with one vector top-k; ignoring that REPLUG-era “retrieval vs. LM division of labor” reappears in agents as summarization / external-folder memory (e.g., Claude Code–style), not only Databricks-style “compress and retrieve trained end-to-end together” (host paraphrase; official benchmark page not verified 2026-05).

OCR shows Weaviate podcast and ma overlay text from the retrieval-and-agent discussion segment.

Split screen around 16 minutes: Weaviate podcast branding and shelves on the left, guest on the right—near synthetic data and verification topics.


Harness: memory, compression, and the eval loop
#

Why
#

Multi-turn search makes context length a second bottleneck; themes from RAG a few years ago—prepend retrieved docs (REPLUG), chunk trimming, nugget report evaluation—show up again in agent harnesses (speaker view).

Mechanisms and constraints
#

  • Modular: search first, full-document tool on demand; coexists with “retrieval agent + compressor trained end-to-end”; guest personally leans modular (speaker view).
  • TREC RAG (verifiable): 2024 guide includes Nugget, Support, Fluency, Retrieval; Support 0–2 scores grounding. Guest judgment that “fluency weight should drop, grounding matters more” is a participant view.
  • Tool-use bar: ~six months ago SFT+RL was still needed for stable tool use; now mid/post-training has distilled it into large models—the race shifts to efficiency and Pareto (speaker view, time anchor is oral around recording).

Common pitfalls
#

Dropping audit-friendly citation trajectories to save length; applying IR’s “single deterministic ranking” mindset to stochastic rollouts.

OCR shows Weaviate podcast por text; no additional architecture text.

Split screen around 14 minutes: Weaviate podcast wall on the left—segment on GRPO and pass@K.


Unresolved conclusions (deliberately side by side)
#

  1. Task definition: BrowseComp-Plus’s operational Deep-Research agent definition can coexist with the guest’s “search agent ⊂ deep research umbrella”; product naming still shapes user expectations (speaker view).
  2. Data: WebShaper is published; Orbit scale and pipeline await public verification.
  3. SOTA: public BrowseComp-Plus ≤70.1% (fixed-corpus setting) conflicts with guest 90–95% oral claim—the article can only list both, not merge them.
  4. Training: GRPO’s within-group relative reward supports “sparse success”; 48 API calls/sample is lab back-of-envelope, not a theorem.
  5. Next benchmarks (speaker view): multilingual/multimodal riddles, FreshStack / CRAG family plus agentic search, harder filters—aligned with Omar Khattab–style “we need harder benchmarks,” specific metrics not developed in the podcast.

If you are shipping something
#

  1. Pin deliverable and eval family first: short answers → BrowseComp-Plus-style Accuracy + Search Calls; long reports → TREC nugget/support—do not mix tables.
  2. Split the cost ledger: log search_calls and llm_tokens separately; simulate budget with small G and few turns before training (GRPO’s G per DeepSeekMath).
  3. Harden synthetic data gates: external retrieval check per hop + anti-shortcut rules (no single-clue guessing); see WebShaper’s agentic Expander pattern.
  4. Prefer reproducible retrieval: reproduce on BrowseComp-Plus fixed corpus before swapping commercial APIs; when you need full text, add an explicit fetch_full(doc_id) tool—do not assume snippets suffice for training.
  5. Align context strategy with eval: if eval needs full trajectories, training can compress/distill separately; see context rot on length–accuracy tradeoffs.

References and further reading
#

Related

AI-Powered Search: When RAG, Agents, and Classic IR Get Rewired

·2081 words·10 mins
AI-Powered Search: When RAG, agents, and classic IR get rewired—retrieval quality vs. agent loops, long context vs. searchable history, leaderboard embeddings vs. domain corpora, with Doug Turnbull and Trey Grainger on what ships.