Prompt Strategy Comparison Walkthrough#
This walkthrough demonstrates the umbrella package doing real work with a live
model-backed agent while following the comparison-study recipe/reporting APIs
landing on the April 2026 sibling-library branches. It uses a real packaged
problem from design_research.problems, a managed
prompt-mode design_research.agents.Workflow,
design_research.agents.PromptWorkflowAgent, the
design_research.experiments.build_strategy_comparison_study scaffold, and
the newer condition-comparison helpers from design_research.analysis.
What This Covers#
resolves a real packaged problem through
design_research.problemsresolves that problem through the sibling-owned
design_research.experiments.resolve_probleminterop API so packaged evaluations normalize cleanly into experiment rowsbuilds the study from
design_research.experiments.build_strategy_comparison_studywith a recipe-first benchmark bundle containing a random baseline, a neutral prompt, and a profit-focused promptruns the live study through
design_research.experiments.run_studyexports the canonical study artifacts plus a markdown summary report built from
render_markdown_summary,render_methods_scaffold,render_codebook, andrender_significance_briefvalidates the exported event rows through
design_research.analysiscomputes ordered one-sided condition-pair permutation tests from the exported
runs.csvandevaluations.csvtables viabuild_condition_metric_tableandcompare_condition_pairs
Branch Alignment#
This local walkthrough intentionally tracks the April 2026 release-branch APIs
from design-research-agents, design-research-experiments, and
design-research-analysis. If you run it against older releases of those
sibling packages, it will fail fast with a clear upgrade message instead of
silently drifting from the new workflow/recipe/reporting surface.
During local development, the umbrella test harness can point subprocess runs at adjacent sibling worktrees so the examples stay validated against the same public APIs owned by the sibling libraries themselves.
Run It#
python -m pip install "llama-cpp-python[server]" huggingface-hub
make run-example
Optionally point the walkthrough at a specific local GGUF file:
export LLAMA_CPP_MODEL=/path/to/model.gguf
make run-example
The default configuration uses eight replicates per condition. To push to a larger sample size, raise the replicate count explicitly:
export PROMPT_STUDY_REPLICATES=12
make run-example
The example writes canonical exports to
artifacts/examples/prompt_strategy_comparison_study and writes a markdown
summary report to
artifacts/examples/prompt_strategy_comparison_study/artifacts/prompt_strategy_summary.md.
It prints condition means, a condition-comparison brief, a significance brief,
the summary-report path, exported artifact paths, and the event-table
validation summary. The script intentionally has no deterministic fallback path
for the live-agent conditions: it expects a real llama.cpp runtime.
If LLAMA_CPP_MODEL is not set, the client falls back to its built-in model
defaults and Hugging Face repo settings. The first run may therefore download a
model before the walkthrough executes, which is why the setup above includes
huggingface-hub.
The script is intentionally written in a linear, step-by-step style so it can
double as training material and as the literal-included documentation example.
The only local callbacks left in place are the small workflow request/response
adapters and the condition-specific prompt builders passed into
PromptWorkflowAgent(...).
Code#
examples/prompt_framing_study.py# 1"""Canonical live strategy-comparison walkthrough for the umbrella package."""
2
3from __future__ import annotations
4
5import csv
6import importlib.util
7import json
8import os
9from pathlib import Path
10
11import design_research as dr
12
13# These constants keep the live walkthrough readable: one packaged problem, one
14# study id, stable artifact paths, and the statistical settings used in the
15# pairwise comparisons later on.
16BASELINE_AGENT_ID = "SeededRandomBaselineAgent"
17PROBLEM_ID = "decision_laptop_design_profit_maximization"
18STUDY_ID = "prompt_strategy_comparison_study"
19OUTPUT_DIR = Path("artifacts") / "examples" / STUDY_ID
20SUMMARY_REPORT_NAME = "prompt_strategy_summary.md"
21DEFAULT_REPLICATES_PER_CONDITION = 50
22SIGNIFICANCE_ALPHA = 0.05
23EXACT_PERMUTATION_THRESHOLD = 250_000
24MONTE_CARLO_PERMUTATIONS = 20_000
25PERMUTATION_TEST_SEED = 17
26STRATEGY_ORDER = (BASELINE_AGENT_ID, "neutral_prompt", "profit_focus_prompt")
27PAIRWISE_COMPARISONS = (
28 ("profit_focus_prompt", "neutral_prompt"),
29 ("neutral_prompt", BASELINE_AGENT_ID),
30 ("profit_focus_prompt", BASELINE_AGENT_ID),
31)
32
33
34def main() -> None:
35 """Run the live strategy-comparison walkthrough with managed llama.cpp."""
36 # Read runtime settings from the environment and apply the example's default
37 # replicate count when the user does not override it.
38 runtime = llama_cpp_runtime_config(default_replicates=DEFAULT_REPLICATES_PER_CONDITION)
39
40 # Load the packaged decision problem and derive the JSON candidate schema the
41 # model-based agents should emit.
42 packaged_problem = dr.problems.get_problem(PROBLEM_ID)
43 candidate_schema = decision_candidate_schema(packaged_problem)
44
45 # Build the recipe-defined study and then materialize its conditions. The
46 # conditions encode one row per strategy/replicate combination.
47 study = _build_study(replicates=int(runtime["replicates"]))
48 conditions = dr.experiments.build_design(study)
49
50 # Resolve the packaged problem once so every run pulls from the same
51 # normalized problem packet.
52 problem_registry = {PROBLEM_ID: dr.experiments.resolve_problem(PROBLEM_ID)}
53
54 # Start a managed llama.cpp server client for the duration of the study.
55 # The context manager handles startup/shutdown around the live run.
56 with dr.agents.LlamaCppServerLLMClient(
57 model=str(runtime["model_source"]),
58 hf_model_repo_id=runtime["model_repo"],
59 api_model=str(runtime["model_name"]),
60 host=str(runtime["host"]),
61 port=int(runtime["port"]),
62 context_window=int(runtime["context_window"]),
63 ) as llm_client:
64 # Each `agent_id` in the strategy bundle maps either to a public agent
65 # id resolved directly by experiments or to one explicit binding that
66 # returns a prompt-driven workflow agent.
67 agent_bindings = {
68 # The neutral condition uses the live model but keeps the instruction
69 # framing generic.
70 "neutral_prompt": lambda _condition: dr.agents.PromptWorkflowAgent(
71 workflow=build_json_model_workflow(
72 llm_client=llm_client,
73 candidate_schema=candidate_schema,
74 study_id=STUDY_ID,
75 problem_id=PROBLEM_ID,
76 fallback_model_name=str(runtime["model_name"]),
77 fallback_provider=str(runtime["provider_name"]),
78 ),
79 prompt_builder=lambda problem_packet, _run_spec, _condition: _strategy_prompt(
80 problem_packet,
81 instruction=(
82 "Condition: neutral prompt. Choose the best overall candidate using the "
83 "packaged demand and feasibility information."
84 ),
85 ),
86 ),
87 # The profit-focused condition swaps only the framing instruction so
88 # the study isolates prompt strategy rather than model identity.
89 "profit_focus_prompt": lambda _condition: dr.agents.PromptWorkflowAgent(
90 workflow=build_json_model_workflow(
91 llm_client=llm_client,
92 candidate_schema=candidate_schema,
93 study_id=STUDY_ID,
94 problem_id=PROBLEM_ID,
95 fallback_model_name=str(runtime["model_name"]),
96 fallback_provider=str(runtime["provider_name"]),
97 ),
98 prompt_builder=lambda problem_packet, _run_spec, _condition: _strategy_prompt(
99 problem_packet,
100 instruction=(
101 "Condition: profit-focus prompt. Prioritize choices that maximize "
102 "market share proxy and expected demand."
103 ),
104 ),
105 ),
106 }
107
108 # Execute the full study while the managed llama.cpp client is running.
109 results = dr.experiments.run_study(
110 study,
111 conditions=conditions,
112 agent_bindings=agent_bindings,
113 problem_registry=problem_registry,
114 checkpoint=False,
115 show_progress=False,
116 )
117
118 # Export the standard analysis tables so the next steps can work from the
119 # same artifacts users would inspect after their own runs.
120 artifact_paths = dr.experiments.export_analysis_tables(
121 study,
122 conditions=conditions,
123 run_results=results,
124 output_dir=OUTPUT_DIR,
125 )
126
127 # Load only the CSVs we need for the walkthrough's reporting and statistical
128 # comparison steps.
129 exported_rows = load_analysis_exports(
130 artifact_paths,
131 names=("conditions.csv", "runs.csv", "evaluations.csv"),
132 )
133
134 # Confirm that the event-level export is structurally valid before building
135 # downstream tables from it.
136 validation_report = validate_exported_events(artifact_paths)
137
138 # Build one condition-by-metric table for the primary outcome we care about
139 # and another for a secondary business-facing metric.
140 primary_metric_rows = dr.analysis.build_condition_metric_table(
141 exported_rows["runs.csv"],
142 metric="market_share_proxy",
143 condition_column="agent_id",
144 conditions=exported_rows["conditions.csv"],
145 evaluations=exported_rows["evaluations.csv"],
146 )
147 demand_metric_rows = dr.analysis.build_condition_metric_table(
148 exported_rows["runs.csv"],
149 metric="expected_demand_units",
150 condition_column="agent_id",
151 conditions=exported_rows["conditions.csv"],
152 evaluations=exported_rows["evaluations.csv"],
153 )
154
155 # Compare the strategy pairs using the analysis package's pairwise
156 # permutation test helper.
157 comparison_report = dr.analysis.compare_condition_pairs(
158 primary_metric_rows,
159 condition_pairs=PAIRWISE_COMPARISONS,
160 alternative="greater",
161 alpha=SIGNIFICANCE_ALPHA,
162 exact_threshold=EXACT_PERMUTATION_THRESHOLD,
163 n_permutations=MONTE_CARLO_PERMUTATIONS,
164 seed=PERMUTATION_TEST_SEED,
165 )
166
167 # Convert the statistical report into rows that the experiments reporting
168 # helpers can render alongside the study summary.
169 significance_rows = comparison_report.to_significance_rows()
170
171 # Write one consolidated markdown report that includes the study summary,
172 # methods scaffold, variable codebook, and the pairwise comparison brief.
173 summary_path = dr.experiments.write_markdown_report(
174 study.output_dir,
175 SUMMARY_REPORT_NAME,
176 "\n\n".join(
177 [
178 dr.experiments.render_markdown_summary(study, results),
179 dr.experiments.render_methods_scaffold(study),
180 dr.experiments.render_codebook(study, conditions),
181 comparison_report.render_brief(),
182 dr.experiments.render_significance_brief(significance_rows),
183 ]
184 ),
185 )
186
187 # Collapse the metric tables to per-strategy means for a concise console
188 # summary after the run finishes.
189 primary_means = condition_means(primary_metric_rows)
190 demand_means = condition_means(demand_metric_rows)
191 successful_results = [result for result in results if result.status.value == "success"]
192
193 # Fail loudly if the live walkthrough did not actually produce usable data.
194 if not successful_results:
195 raise RuntimeError("The live walkthrough completed without any successful runs.")
196 if validation_report.errors:
197 raise RuntimeError(
198 "Unified event table validation failed:\n- " + "\n- ".join(validation_report.errors)
199 )
200
201 # Print a guided end-of-run summary so the console output doubles as a quick
202 # tour of the artifacts and the headline comparison result.
203 print("Problem:", PROBLEM_ID)
204 print("Study:", study.study_id)
205 print("Live provider:", runtime["provider_name"])
206 print("Live model API name:", runtime["model_name"])
207 print("Model source:", runtime["model_source"])
208 print("Replicates per condition:", runtime["replicates"])
209 print("Conditions:", len(conditions))
210 print("Runs:", len(results), f"({len(successful_results)} success)")
211 print("Condition means:")
212 for strategy_name in STRATEGY_ORDER:
213 print(
214 f" - agent_id={strategy_name}: "
215 f"mean_market_share_proxy={primary_means.get(strategy_name, 0.0):.4f}, "
216 f"mean_expected_demand_units={demand_means.get(strategy_name, 0.0):.0f}"
217 )
218 print(comparison_report.render_brief())
219 print(dr.experiments.render_significance_brief(significance_rows))
220 print("Event rows valid:", validation_report.is_valid, f"(rows={validation_report.n_rows})")
221 print("Summary report:", summary_path)
222 print("Artifacts:", artifact_names(artifact_paths))
223
224
225def _build_study(*, replicates: int) -> object:
226 """Build the live strategy-comparison recipe study."""
227 # The recipe builder captures the study in one config object. The bundle says
228 # which packaged problems and agent strategies participate; the run budget
229 # says how many replicates to execute.
230 return dr.experiments.build_strategy_comparison_study(
231 dr.experiments.StrategyComparisonConfig(
232 study_id=STUDY_ID,
233 title="Prompt Strategy Comparison Study",
234 description=(
235 "Compare a seeded random baseline, a neutral prompt, and a profit-focused "
236 "prompt on a packaged laptop-design decision problem."
237 ),
238 bundle=dr.experiments.BenchmarkBundle(
239 bundle_id="live-strategy-comparison",
240 name="Live Strategy Comparison Bundle",
241 description="Packaged decision problem with three strategy bindings.",
242 problem_ids=(PROBLEM_ID,),
243 agent_specs=STRATEGY_ORDER,
244 ),
245 run_budget=dr.experiments.RunBudget(replicates=replicates, parallelism=1),
246 output_dir=OUTPUT_DIR,
247 )
248 )
249
250
251def _strategy_prompt(problem_packet: object, *, instruction: str) -> str:
252 """Render one complete strategy prompt from the normalized problem packet."""
253 # Compose the final prompt from a few readable pieces instead of one giant
254 # literal string. That makes it easy to see which lines stay fixed across
255 # conditions and which line changes with the strategy framing.
256 return "\n".join(
257 [
258 "You are solving a packaged design-research decision problem.",
259 "Read the problem brief and return exactly one JSON object candidate.",
260 instruction,
261 "",
262 str(getattr(problem_packet, "brief", "")).strip(),
263 "",
264 "Return JSON only with no markdown fences and no extra commentary.",
265 ]
266 )
267
268
269def read_csv_rows(path: Path) -> list[dict[str, str]]:
270 """Read one exported CSV table into a list of row dictionaries."""
271 with path.open("r", encoding="utf-8", newline="") as file_obj:
272 return list(csv.DictReader(file_obj))
273
274
275def load_analysis_exports(
276 artifact_paths: dict[str, Path],
277 *,
278 names: tuple[str, ...],
279) -> dict[str, list[dict[str, str]]]:
280 """Load selected exported CSV artifacts into memory."""
281 return {name: read_csv_rows(artifact_paths[name]) for name in names}
282
283
284def validate_exported_events(artifact_paths: dict[str, Path]) -> object:
285 """Validate the exported canonical event table through the analysis layer."""
286 return dr.analysis.integration.validate_experiment_events(artifact_paths["events.csv"])
287
288
289def artifact_names(artifact_paths: dict[str, Path]) -> str:
290 """Return exported artifact filenames in stable sorted order."""
291 return ", ".join(sorted(path.name for path in artifact_paths.values()))
292
293
294def condition_means(rows: list[dict[str, object]]) -> dict[str, float]:
295 """Compute one mean per condition label from normalized rows."""
296 grouped: dict[str, list[float]] = {}
297 for row in rows:
298 grouped.setdefault(str(row["condition"]), []).append(float(row["value"]))
299 return {
300 condition: (sum(values) / len(values) if values else 0.0)
301 for condition, values in grouped.items()
302 }
303
304
305def decision_candidate_schema(problem: object) -> dict[str, object]:
306 """Build a JSON schema for discrete decision-factor candidates."""
307 properties: dict[str, object] = {}
308 required: list[str] = []
309 for factor in getattr(problem, "option_factors", ()):
310 levels = tuple(getattr(factor, "levels", ()))
311 key = str(getattr(factor, "key", ""))
312 if not key or not levels:
313 continue
314 properties[key] = {"type": "number", "enum": list(levels)}
315 required.append(key)
316
317 if not required:
318 raise RuntimeError("Expected a packaged decision problem with explicit option factors.")
319
320 return {
321 "type": "object",
322 "properties": properties,
323 "required": required,
324 "additionalProperties": False,
325 }
326
327
328def llama_cpp_runtime_config(*, default_replicates: int) -> dict[str, object]:
329 """Resolve runtime configuration and fail fast on missing live dependencies."""
330 missing_runtime = [
331 module_name
332 for module_name in ("llama_cpp", "fastapi", "uvicorn")
333 if importlib.util.find_spec(module_name) is None
334 ]
335 if missing_runtime:
336 raise RuntimeError(
337 "Install llama-cpp-python[server] before running the live walkthrough. Missing: "
338 + ", ".join(sorted(missing_runtime))
339 )
340
341 model_source = (
342 os.getenv("LLAMA_CPP_MODEL", "Qwen2.5-1.5B-Instruct-Q4_K_M.gguf").strip()
343 or "Qwen2.5-1.5B-Instruct-Q4_K_M.gguf"
344 )
345 model_repo = (
346 os.getenv("LLAMA_CPP_HF_MODEL_REPO_ID", "bartowski/Qwen2.5-1.5B-Instruct-GGUF").strip()
347 or None
348 )
349 if (
350 model_repo
351 and not Path(model_source).expanduser().exists()
352 and importlib.util.find_spec("huggingface_hub") is None
353 ):
354 raise RuntimeError(
355 "Install huggingface-hub or point LLAMA_CPP_MODEL at a local GGUF file before "
356 "running the live walkthrough."
357 )
358
359 replicates = int(os.getenv("PROMPT_STUDY_REPLICATES", str(default_replicates)))
360 if replicates < 2:
361 raise RuntimeError("PROMPT_STUDY_REPLICATES must be at least 2.")
362
363 return {
364 "provider_name": "llama-cpp",
365 "model_source": model_source,
366 "model_name": os.getenv("LLAMA_CPP_API_MODEL", "qwen2-1.5b-q4").strip() or "qwen2-1.5b-q4",
367 "model_repo": model_repo,
368 "host": os.getenv("LLAMA_CPP_HOST", "127.0.0.1").strip() or "127.0.0.1",
369 "port": int(os.getenv("LLAMA_CPP_PORT", "8001")),
370 "context_window": int(os.getenv("LLAMA_CPP_CONTEXT_WINDOW", "4096")),
371 "replicates": replicates,
372 }
373
374
375def build_json_model_workflow(
376 *,
377 llm_client: object,
378 candidate_schema: dict[str, object],
379 study_id: str,
380 problem_id: str,
381 fallback_model_name: str,
382 fallback_provider: str,
383) -> object:
384 """Build one reusable prompt-mode workflow that returns structured JSON."""
385
386 def request_builder(context: dict[str, object]) -> object:
387 """Build one structured LLM request from the workflow context."""
388 return dr.agents.LLMRequest(
389 messages=[
390 dr.agents.LLMMessage(
391 role="system",
392 content=(
393 "You are a careful study participant. Return valid JSON only and match "
394 "the requested schema exactly."
395 ),
396 ),
397 dr.agents.LLMMessage(role="user", content=str(context["prompt"])),
398 ],
399 temperature=0.0,
400 max_tokens=400,
401 response_schema=candidate_schema,
402 metadata={"study_id": study_id, "problem_id": problem_id},
403 )
404
405 def response_parser(response: object, _context: dict[str, object]) -> dict[str, object]:
406 """Parse one model response into workflow output, metrics, and events."""
407 model_text = strip_markdown_fences(str(getattr(response, "text", "")).strip())
408 candidate = json.loads(model_text)
409 if not isinstance(candidate, dict):
410 raise RuntimeError("Expected the live workflow to return one JSON object candidate.")
411 provider = str(getattr(response, "provider", "") or fallback_provider)
412 model_name = str(getattr(response, "model", "") or fallback_model_name)
413 return {
414 "final_output": candidate,
415 "metrics": usage_metrics(getattr(response, "usage", None)),
416 "events": [
417 {
418 "event_type": "model_response",
419 "actor_id": "agent",
420 "text": model_text,
421 "meta_json": {"provider": provider, "model_name": model_name},
422 }
423 ],
424 }
425
426 return dr.agents.Workflow(
427 steps=(
428 dr.agents.ModelStep(
429 step_id="select_candidate",
430 llm_client=llm_client,
431 request_builder=request_builder,
432 response_parser=response_parser,
433 ),
434 ),
435 output_schema=candidate_schema,
436 default_request_id_prefix=study_id,
437 )
438
439
440def strip_markdown_fences(text: str) -> str:
441 """Strip one optional fenced-code wrapper from a model response."""
442 if not text.startswith("```"):
443 return text
444 lines = text.splitlines()
445 if lines and lines[0].startswith("```"):
446 lines = lines[1:]
447 if lines and lines[-1].startswith("```"):
448 lines = lines[:-1]
449 return "\n".join(lines).strip()
450
451
452def usage_metrics(usage: object) -> dict[str, object]:
453 """Normalize usage payloads into canonical metric names."""
454 metrics: dict[str, object] = {"cost_usd": 0.0}
455 if isinstance(usage, dict):
456 prompt_tokens = usage.get("prompt_tokens")
457 completion_tokens = usage.get("completion_tokens")
458 else:
459 prompt_tokens = getattr(usage, "prompt_tokens", None)
460 completion_tokens = getattr(usage, "completion_tokens", None)
461 if isinstance(prompt_tokens, int):
462 metrics["input_tokens"] = prompt_tokens
463 if isinstance(completion_tokens, int):
464 metrics["output_tokens"] = completion_tokens
465 return metrics
466
467
468if __name__ == "__main__":
469 main()
When To Go Direct#
Use the umbrella package when you want one stable import surface for the ecosystem. Install a sibling package directly when you only need one layer or want package-specific internals. See Compatibility And Start Here for the tested version combination and install guidance.