Multi-Stage Language Programs and Automatic Prompt Optimization: From DSPy to MIPRO#
When RAG, agents, and “compound AI systems” stack into pipelines of a dozen steps or more, what often blocks teams is not swapping in another vector database—it is who optimizes prompts, demos, and module composition at what granularity, and how an end-to-end metric assigns credit to each step. This article walks through the mechanics of DSPy and MIPRO (Multiprompt Instruction PRoposal Optimizer), what the literature supports, and where engineering judgment still applies. Visual material is interview split-screen footage; no readable architecture diagrams or results tables appear on screen—quantitative claims should be taken from the paper and official API docs.

Problem space: multi-stage programs, compound systems, and the evaluation bottleneck#
Why: Retrieval augmentation, tool calling, and multi-role orchestration are all “multiple LM calls + intermediate state”—web agents acting step by step, retrievers rewriting queries, writing pipelines generating section by section—all depend on external state and are hard to replace with a single structured output in one context (speaker view).
Mechanisms/constraints: Industry usage often blurs multi-stage, compound, and multi-agent; in the DSPy paper narrative, the emphasis is compilable multi-module programs and program-level metrics, not how many chat bubbles appear in the UI. CrewAI-style multi-role teams are another kind of multi-stage LM system, but marketing-style tasks often lack latency metrics you can optimize online—evaluation design is the bottleneck (speaker view).
Common pitfall: Treating “agent frameworks” and “prompt optimizers” as opposites; optimizers target instruction/demo parameters for each LM module inside a program. Intermediate retrieval and function calls can remain black boxes as long as the metric is computable at the program exit (see P09 boundary below).
If you already run RAG on Weaviate or another vector store, treat it as a Retrieve step or custom Python module in the program: the optimizer will not change your embedding model, but it will change the instruction and demos for “how to answer using retrieved results.” The host’s “generative feedback loop”—writing generated content back into the store for later retrieval or fine-tuning—is orthogonal to multi-stage programs and belongs to data-flywheel design (speaker view), requiring its own consistency and permission policies.

DSPy: declaring programs instead of hand-tuning prompts#
Why: Hand-crafted prompts can be extremely strong on individual cases (guest’s self-reported EHR project outperforming limited-data fine-tuning—speaker view, not generalizable), but maintenance cost rises when you change models or expand tasks.
Mechanisms/constraints: DSPy expresses pipelines as Module + Signature + Metric; teleprompters / optimizers search each module’s instructions and demonstrations on a training set, optimizing program-level scores rather than backpropagating a single-point loss.
How to proceed (minimal):
import dspy
class QA(dspy.Signature):
"""Answer from context."""
context = dspy.InputField()
question = dspy.InputField()
answer = dspy.OutputField()
class RAG(dspy.Module):
def __init__(self):
self.retrieve = dspy.Retrieve(k=3)
self.generate = dspy.ChainOfThought(QA)
def forward(self, question):
context = self.retrieve(question).passages
return self.generate(context=context, question=question)
# metric + trainset → optimizer.compile(program, trainset=...)
Common pitfall: Treating DSPy as “a better prompt template library”; its core is a parameterized program you can optimize, combinable with OPRO-style “LM as optimizer” ideas, but credit assignment is harder with multiple modules (next section).
In the interview, Krista describes “writing DSPy programs with GPT-4” as “pretty good” because the API feels PyTorch-like (speaker view). That lowers topology setup cost; it does not replace metric design or optimization budget. The vision-level “user just says optimize” still depends on task examples, criteria, and compute (speaker view). A more robust engineering path: hand-build the skeleton → automatically search prompts/demos → then consider per-stage model swaps or prompt routing (speaker view).


MIPRO: proposal, bootstrapping, and combinatorial search#
Why: In multi-stage programs, each module’s prompt includes instruction and few-shot demos; the search space grows exponentially with module count. You need dedicated handling for “how to propose good candidates” and “who deserves credit for the metric” (paper §3.1–§3.2: proposal challenge, credit assignment challenge—arXiv:2406.11695).
Mechanisms/constraints:
| Narrative source | Stage breakdown |
|---|---|
| Interview oral account | Two phases: propose instructions/demos, then combinatorial search (speaker view) |
| DSPy MIPROv2 docs | Three phases: bootstrapping → grounded proposal → discrete search (surrogate + BO) |
These are not contradictory: the literature lists bootstrap separately; the interview discusses “proposal + search” together. The main loop can be summarized as:

Few-shot source (P02, verified): Run the full program on training samples; if the metric marks success, keep the step-by-step trace as demos for that module—consistent with BootstrapFewShot and MIPRO paper Initialize.
Instruction proposal (P03, verified): GroundedProposer can ingest dataset summary, program summary (DescribeProgram / DescribeModule), bootstrap I/O, and categorical tips (source includes high_stakes, etc.). The guest says “letting the LM see its own program structure” improves proposals—speaker view, aligned with paper grounding design.
Combinatorial search (P04, partially verified): Treat instruction × demo as an expensive black box; MIPRO uses Optuna TPE as a surrogate, evaluating on mini-batches for noise resistance (paper §4.3; Snoek et al. 2012). MIPROv2 defaults to minibatch_size=35 (source code)—do not confuse this with the paper’s “20–50 full evaluation trials” budget as batch=20 (verification report flagged this).
How to proceed:
tp = dspy.MIPROv2(metric=my_metric, prompt_model="gpt-4o-mini", task_model="llama-3-8b")
optimized = tp.compile(my_program, trainset=train, num_trials=30, minibatch=True)
Common pitfall: Assuming MIPRO only tweaks one system sentence; it searches instruction and demonstration combinations, with v2 emphasizing fast probes on small batches.
Paper §3.2 also discusses greedy, surrogate, and history-based credit assignment: surrogate matches TPE in the MIPRO main loop, estimating candidate potential from observed (combination, mini-batch score) pairs; greedy locks in module by module—simple but prone to local optima. The interview does not compare algorithm names line by line; in practice follow dspy.MIPROv2 defaults and on a dev set contrast “optimize only the final generator” vs “search all modules together.”
Counterintuitive (with source labels): Optimizers often find instruction wording humans would not invent (host cites unreasonable effectiveness of prompts-style anecdotes; Krista agrees). Tips emphasizing high_stakes contexts appear in source and paper appendix, but “why it works” lacks a mechanistic explanation (speaker view). More few-shot is not monotonically helpful—the guest warns long demos can confuse models, and Llama-3-8B-scale task models sometimes do better with instruction only (speaker view), tension with industry many-shot enthusiasm; do not casually cite the wrong arXiv ID for Google’s original many-shot work (verification: 2402.04326 is unrelated).




Credit assignment: module-level OPRO vs program-level OPRO#
Why: When multiple modules share one end-to-end metric, you must determine which stage and which variables raised or dragged down the score (paper §3.2 credit assignment).
Mechanisms/constraints:
- Module-level OPRO: Feed only one module’s historical instructions plus that module’s relevant scores to the LM to propose the next version (extends OPRO single-prompt setting).
- Program-level OPRO: Feed all program instructions and a single program score to the LM for global inference—paper §4.5 experiments show no extra gain; guest says current models are not yet good enough for global Bayesian-style credit assignment; module-level is more realistic (speaker view, consistent with paper).

Common pitfall: Confusing with ORPO (preference alignment)—similar abbreviations, unrelated problems (verified).
For code generation with deterministic unit tests, the guest considers “execution results + self-improvement loop” strongest (speaker view); open-domain writing and marketing copy still rely on LLM-judge or human spot checks. DSPy can parameterize the metric chain too, but if the judge misaligns with real business goals, optimization overfits the wrong objective—the same class of problem as “retrieval metric ≠ user satisfaction” in RAG.


Meta-proposer, judges, and “non-differentiable” middleware#
Meta-proposer (P08, verified): 0-Shot MIPRO++ uses BO to choose discrete switches—whether to give the proposer dataset summary, program summary, tips, temperature, demos inside the meta-prompt, etc. (paper §4.4). Guest caution: fully tuning the meta-prompt consumes optimization rounds; with tight budget, trials may be better spent on main search (speaker view); paper Lesson 5 discusses high/low budget tradeoffs—no unified formula.
LLM-as-judge: Tasks without gold standards (summary quality, tweet style) often rely on judge LMs; Who Validates the Validators? discusses alignment with humans. DSPy can make the metric itself an optimizable program—speaker view—watch for judge drift.
Tool / retrieval middleware (P09, partially verified): Optimization target remains program-exit metric; paper’s six tasks focus on QA/classification/retrieval pipelines—no strong tool-use optimization benchmark—guest says LangProBe will extend tool scenarios, but public repo name LangProBe was not found on GitHub/DSPy README this pass (paper §5.1 benchmarks verifiable; product name unverified).
Proposer vs task model (P10, partially verified): Paper §5.2 main experiments use Llama-3-8B task model and GPT-3.5 proposer (not interview’s GPT-4 default); guest says gap between 8B proposer and GPT-4 is “smaller than expected” (speaker view). Structurally proposer calls are usually far fewer than task LM rollouts on the training set, so biasing budget toward a stronger proposer is common (speaker view).


Benchmarks, LangProBe, and reproducible experiments#
Why: Without aligned task sets, you cannot compare BootstrapFewShot, MIPRO, and hand-tuned prompts for compute well spent.
Mechanisms/constraints: MIPRO paper §5.1 lists six tasks (HotPotQA, HotPotQA Conditional, Iris, Heart Disease, ScoNe, HoVer), each with dataset / metric / fixed LM program topology; main table reports accuracy or task-specific metrics (e.g. HoVer Retrieval@21). Abstract states new optimizer and benchmark released in the DSPy ecosystem.
How to proceed: To reproduce Table 2, align with paper §5.2: Llama-3-8B task model, GPT-3.5 proposer (specific tasks use GPT-4o as bootstrap teacher), optimizer budget 20–50 full eval trials, and record extra trials from minibatch probing.
Common pitfall: Treating interview LangProBe as a verified PyPI/GitHub package—this search did not find that string in stanfordnlp/dspy (name unverified); the paper’s six tasks are safer ground truth. Host’s “DSPy ImageNet moment / ~20 point margin” (Bo Wang group) unverified for specific benchmark—do not use for production decisions.

Boundaries vs RAG, agent UI, and fine-tuning (unsettled conclusions)#
| Topic | Common practice | Guest/literature tension | Evidence boundary |
|---|---|---|---|
| Single JSON vs multi-stage | Structured output in one shot | External state forces multiple rounds | Speaker view |
| Many-shot ICL | More demos always help | Long context may confuse; small models or instruction-only | Guest has not read Google many-shot original; do not cite wrong arXiv ID |
| Chat single entry vs multi-agent UI | Multiple specialized bots | Lean toward consolidation to cut choice cost; also acknowledges Slack-style collaboration | Speaker view |
| Prompt routing | One best prompt for full test set | Bullish on per-example routing | Speaker view |
| Generative feedback loop | Vector DB read-only retrieval | Writing generations back can feed later training/retrieval | Weaviate-relevant; implementation unverified |
| Joint prompt + weight optimization | Prompt first, then LoRA | Colleague work explores searching both (Stanford) | Speaker view, no unified public recipe |
| Senior prompt engineers | Manual iteration wins | Guest reports being “surprised” by automatic optimization | Speaker view |
Fine-tuning: With very little data and stable task distribution, fine-tuning may still win; but pipelines with retrieval, tools, and multiple modules can sometimes save labeling via end-to-end prompt optimization (EHR case is speaker view). A practical combo: DSPy optimizes prompt/demo → collect high-scoring traces → then decide on distillation or fine-tuning a smaller model.


If you are shipping this#
- Fix program topology and metric first: On
STORM-style multi-stage writing or RAG pipelines, define an automatically computable metric on a dev set; for fuzzy tasks add LLM-judge and plan human spot checks. - Start with
BootstrapFewShot+ small-budgetMIPROv2: Alignnum_trialswith paper’s 20–50 full eval scale; withminibatch=True, note defaultminibatch_sizeand how trials are counted (docs and mipro_optimizer_v2.py). - Credit assignment per module: Prefer module-level strategies; do not expect stable gains from program-level OPRO on current models (paper §4.5).
- Configure
prompt_modelandtask_modelseparately: Verify proposer scale against the paper; local 8B task + stronger cloud proposer is a common cost saver (paper: GPT-3.5 proposer + Llama-3-8B task). - Build benchmarks for tool-use and retrieval: MIPRO paper’s six tasks do not cover complex tool chains; before production, fill gaps with your own traces and pass/fail metrics (speaker view: optimizers should stay agnostic to middleware, but benchmark gaps are real).
References and further reading#
- DSPy official documentation
- DSPy optimizer tutorial (including MIPROv2 three phases)
- MIPROv2 API reference
- BootstrapFewShot API
- MIPRO paper: Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
- DSPy paper: Compiling Declarative Language Model Calls
- OPRO: Large Language Models as Optimizers
- ORPO (unrelated to OPRO; preference alignment)
- Practical Bayesian Optimization (Snoek et al.)
- Who Validates the Validators? (LLM judge alignment)
- STORM: multi-stage writing pipeline example
- DSPy Tools / ReAct programming guide
- GroundedProposer source
- MIPROv2 implementation (mipro_optimizer_v2.py)
- stanfordnlp/dspy GitHub repository



