Prompt Strategy Comparison Walkthrough#

This walkthrough demonstrates the umbrella package doing real work with a live model-backed agent while following the comparison-study recipe/reporting APIs landing on the April 2026 sibling-library branches. It uses a real packaged problem from design_research.problems, a managed prompt-mode design_research.agents.Workflow, design_research.agents.PromptWorkflowAgent, the design_research.experiments.build_strategy_comparison_study scaffold, and the newer condition-comparison helpers from design_research.analysis.

Flow diagram showing a packaged problem feeding a live workflow agent, then study execution, artifact export, and event-table validation.

What This Covers#

  • resolves a real packaged problem through design_research.problems

  • resolves that problem through the sibling-owned design_research.experiments.resolve_problem interop API so packaged evaluations normalize cleanly into experiment rows

  • builds the study from design_research.experiments.build_strategy_comparison_study with a recipe-first benchmark bundle containing a random baseline, a neutral prompt, and a profit-focused prompt

  • runs the live study through design_research.experiments.run_study

  • exports the canonical study artifacts plus a markdown summary report built from render_markdown_summary, render_methods_scaffold, render_codebook, and render_significance_brief

  • validates the exported event rows through design_research.analysis

  • computes ordered one-sided condition-pair permutation tests from the exported runs.csv and evaluations.csv tables via build_condition_metric_table and compare_condition_pairs

Branch Alignment#

This local walkthrough intentionally tracks the April 2026 release-branch APIs from design-research-agents, design-research-experiments, and design-research-analysis. If you run it against older releases of those sibling packages, it will fail fast with a clear upgrade message instead of silently drifting from the new workflow/recipe/reporting surface.

During local development, the umbrella test harness can point subprocess runs at adjacent sibling worktrees so the examples stay validated against the same public APIs owned by the sibling libraries themselves.

Run It#

python -m pip install "llama-cpp-python[server]" huggingface-hub
make run-example

Optionally point the walkthrough at a specific local GGUF file:

export LLAMA_CPP_MODEL=/path/to/model.gguf
make run-example

The default configuration uses eight replicates per condition. To push to a larger sample size, raise the replicate count explicitly:

export PROMPT_STUDY_REPLICATES=12
make run-example

The example writes canonical exports to artifacts/examples/prompt_strategy_comparison_study and writes a markdown summary report to artifacts/examples/prompt_strategy_comparison_study/artifacts/prompt_strategy_summary.md. It prints condition means, a condition-comparison brief, a significance brief, the summary-report path, exported artifact paths, and the event-table validation summary. The script intentionally has no deterministic fallback path for the live-agent conditions: it expects a real llama.cpp runtime.

If LLAMA_CPP_MODEL is not set, the client falls back to its built-in model defaults and Hugging Face repo settings. The first run may therefore download a model before the walkthrough executes, which is why the setup above includes huggingface-hub.

The script is intentionally written in a linear, step-by-step style so it can double as training material and as the literal-included documentation example. The only local callbacks left in place are the small workflow request/response adapters and the condition-specific prompt builders passed into PromptWorkflowAgent(...).

Code#

examples/prompt_framing_study.py#
  1"""Canonical live strategy-comparison walkthrough for the umbrella package."""
  2
  3from __future__ import annotations
  4
  5import csv
  6import importlib.util
  7import json
  8import os
  9from pathlib import Path
 10
 11import design_research as dr
 12
 13# These constants keep the live walkthrough readable: one packaged problem, one
 14# study id, stable artifact paths, and the statistical settings used in the
 15# pairwise comparisons later on.
 16BASELINE_AGENT_ID = "SeededRandomBaselineAgent"
 17PROBLEM_ID = "decision_laptop_design_profit_maximization"
 18STUDY_ID = "prompt_strategy_comparison_study"
 19OUTPUT_DIR = Path("artifacts") / "examples" / STUDY_ID
 20SUMMARY_REPORT_NAME = "prompt_strategy_summary.md"
 21DEFAULT_REPLICATES_PER_CONDITION = 50
 22SIGNIFICANCE_ALPHA = 0.05
 23EXACT_PERMUTATION_THRESHOLD = 250_000
 24MONTE_CARLO_PERMUTATIONS = 20_000
 25PERMUTATION_TEST_SEED = 17
 26STRATEGY_ORDER = (BASELINE_AGENT_ID, "neutral_prompt", "profit_focus_prompt")
 27PAIRWISE_COMPARISONS = (
 28    ("profit_focus_prompt", "neutral_prompt"),
 29    ("neutral_prompt", BASELINE_AGENT_ID),
 30    ("profit_focus_prompt", BASELINE_AGENT_ID),
 31)
 32
 33
 34def main() -> None:
 35    """Run the live strategy-comparison walkthrough with managed llama.cpp."""
 36    # Read runtime settings from the environment and apply the example's default
 37    # replicate count when the user does not override it.
 38    runtime = llama_cpp_runtime_config(default_replicates=DEFAULT_REPLICATES_PER_CONDITION)
 39
 40    # Load the packaged decision problem and derive the JSON candidate schema the
 41    # model-based agents should emit.
 42    packaged_problem = dr.problems.get_problem(PROBLEM_ID)
 43    candidate_schema = decision_candidate_schema(packaged_problem)
 44
 45    # Build the recipe-defined study and then materialize its conditions. The
 46    # conditions encode one row per strategy/replicate combination.
 47    study = _build_study(replicates=int(runtime["replicates"]))
 48    conditions = dr.experiments.build_design(study)
 49
 50    # Resolve the packaged problem once so every run pulls from the same
 51    # normalized problem packet.
 52    problem_registry = {PROBLEM_ID: dr.experiments.resolve_problem(PROBLEM_ID)}
 53
 54    # Start a managed llama.cpp server client for the duration of the study.
 55    # The context manager handles startup/shutdown around the live run.
 56    with dr.agents.LlamaCppServerLLMClient(
 57        model=str(runtime["model_source"]),
 58        hf_model_repo_id=runtime["model_repo"],
 59        api_model=str(runtime["model_name"]),
 60        host=str(runtime["host"]),
 61        port=int(runtime["port"]),
 62        context_window=int(runtime["context_window"]),
 63    ) as llm_client:
 64        # Each `agent_id` in the strategy bundle maps either to a public agent
 65        # id resolved directly by experiments or to one explicit binding that
 66        # returns a prompt-driven workflow agent.
 67        agent_bindings = {
 68            # The neutral condition uses the live model but keeps the instruction
 69            # framing generic.
 70            "neutral_prompt": lambda _condition: dr.agents.PromptWorkflowAgent(
 71                workflow=build_json_model_workflow(
 72                    llm_client=llm_client,
 73                    candidate_schema=candidate_schema,
 74                    study_id=STUDY_ID,
 75                    problem_id=PROBLEM_ID,
 76                    fallback_model_name=str(runtime["model_name"]),
 77                    fallback_provider=str(runtime["provider_name"]),
 78                ),
 79                prompt_builder=lambda problem_packet, _run_spec, _condition: _strategy_prompt(
 80                    problem_packet,
 81                    instruction=(
 82                        "Condition: neutral prompt. Choose the best overall candidate using the "
 83                        "packaged demand and feasibility information."
 84                    ),
 85                ),
 86            ),
 87            # The profit-focused condition swaps only the framing instruction so
 88            # the study isolates prompt strategy rather than model identity.
 89            "profit_focus_prompt": lambda _condition: dr.agents.PromptWorkflowAgent(
 90                workflow=build_json_model_workflow(
 91                    llm_client=llm_client,
 92                    candidate_schema=candidate_schema,
 93                    study_id=STUDY_ID,
 94                    problem_id=PROBLEM_ID,
 95                    fallback_model_name=str(runtime["model_name"]),
 96                    fallback_provider=str(runtime["provider_name"]),
 97                ),
 98                prompt_builder=lambda problem_packet, _run_spec, _condition: _strategy_prompt(
 99                    problem_packet,
100                    instruction=(
101                        "Condition: profit-focus prompt. Prioritize choices that maximize "
102                        "market share proxy and expected demand."
103                    ),
104                ),
105            ),
106        }
107
108        # Execute the full study while the managed llama.cpp client is running.
109        results = dr.experiments.run_study(
110            study,
111            conditions=conditions,
112            agent_bindings=agent_bindings,
113            problem_registry=problem_registry,
114            checkpoint=False,
115            show_progress=False,
116        )
117
118    # Export the standard analysis tables so the next steps can work from the
119    # same artifacts users would inspect after their own runs.
120    artifact_paths = dr.experiments.export_analysis_tables(
121        study,
122        conditions=conditions,
123        run_results=results,
124        output_dir=OUTPUT_DIR,
125    )
126
127    # Load only the CSVs we need for the walkthrough's reporting and statistical
128    # comparison steps.
129    exported_rows = load_analysis_exports(
130        artifact_paths,
131        names=("conditions.csv", "runs.csv", "evaluations.csv"),
132    )
133
134    # Confirm that the event-level export is structurally valid before building
135    # downstream tables from it.
136    validation_report = validate_exported_events(artifact_paths)
137
138    # Build one condition-by-metric table for the primary outcome we care about
139    # and another for a secondary business-facing metric.
140    primary_metric_rows = dr.analysis.build_condition_metric_table(
141        exported_rows["runs.csv"],
142        metric="market_share_proxy",
143        condition_column="agent_id",
144        conditions=exported_rows["conditions.csv"],
145        evaluations=exported_rows["evaluations.csv"],
146    )
147    demand_metric_rows = dr.analysis.build_condition_metric_table(
148        exported_rows["runs.csv"],
149        metric="expected_demand_units",
150        condition_column="agent_id",
151        conditions=exported_rows["conditions.csv"],
152        evaluations=exported_rows["evaluations.csv"],
153    )
154
155    # Compare the strategy pairs using the analysis package's pairwise
156    # permutation test helper.
157    comparison_report = dr.analysis.compare_condition_pairs(
158        primary_metric_rows,
159        condition_pairs=PAIRWISE_COMPARISONS,
160        alternative="greater",
161        alpha=SIGNIFICANCE_ALPHA,
162        exact_threshold=EXACT_PERMUTATION_THRESHOLD,
163        n_permutations=MONTE_CARLO_PERMUTATIONS,
164        seed=PERMUTATION_TEST_SEED,
165    )
166
167    # Convert the statistical report into rows that the experiments reporting
168    # helpers can render alongside the study summary.
169    significance_rows = comparison_report.to_significance_rows()
170
171    # Write one consolidated markdown report that includes the study summary,
172    # methods scaffold, variable codebook, and the pairwise comparison brief.
173    summary_path = dr.experiments.write_markdown_report(
174        study.output_dir,
175        SUMMARY_REPORT_NAME,
176        "\n\n".join(
177            [
178                dr.experiments.render_markdown_summary(study, results),
179                dr.experiments.render_methods_scaffold(study),
180                dr.experiments.render_codebook(study, conditions),
181                comparison_report.render_brief(),
182                dr.experiments.render_significance_brief(significance_rows),
183            ]
184        ),
185    )
186
187    # Collapse the metric tables to per-strategy means for a concise console
188    # summary after the run finishes.
189    primary_means = condition_means(primary_metric_rows)
190    demand_means = condition_means(demand_metric_rows)
191    successful_results = [result for result in results if result.status.value == "success"]
192
193    # Fail loudly if the live walkthrough did not actually produce usable data.
194    if not successful_results:
195        raise RuntimeError("The live walkthrough completed without any successful runs.")
196    if validation_report.errors:
197        raise RuntimeError(
198            "Unified event table validation failed:\n- " + "\n- ".join(validation_report.errors)
199        )
200
201    # Print a guided end-of-run summary so the console output doubles as a quick
202    # tour of the artifacts and the headline comparison result.
203    print("Problem:", PROBLEM_ID)
204    print("Study:", study.study_id)
205    print("Live provider:", runtime["provider_name"])
206    print("Live model API name:", runtime["model_name"])
207    print("Model source:", runtime["model_source"])
208    print("Replicates per condition:", runtime["replicates"])
209    print("Conditions:", len(conditions))
210    print("Runs:", len(results), f"({len(successful_results)} success)")
211    print("Condition means:")
212    for strategy_name in STRATEGY_ORDER:
213        print(
214            f"  - agent_id={strategy_name}: "
215            f"mean_market_share_proxy={primary_means.get(strategy_name, 0.0):.4f}, "
216            f"mean_expected_demand_units={demand_means.get(strategy_name, 0.0):.0f}"
217        )
218    print(comparison_report.render_brief())
219    print(dr.experiments.render_significance_brief(significance_rows))
220    print("Event rows valid:", validation_report.is_valid, f"(rows={validation_report.n_rows})")
221    print("Summary report:", summary_path)
222    print("Artifacts:", artifact_names(artifact_paths))
223
224
225def _build_study(*, replicates: int) -> object:
226    """Build the live strategy-comparison recipe study."""
227    # The recipe builder captures the study in one config object. The bundle says
228    # which packaged problems and agent strategies participate; the run budget
229    # says how many replicates to execute.
230    return dr.experiments.build_strategy_comparison_study(
231        dr.experiments.StrategyComparisonConfig(
232            study_id=STUDY_ID,
233            title="Prompt Strategy Comparison Study",
234            description=(
235                "Compare a seeded random baseline, a neutral prompt, and a profit-focused "
236                "prompt on a packaged laptop-design decision problem."
237            ),
238            bundle=dr.experiments.BenchmarkBundle(
239                bundle_id="live-strategy-comparison",
240                name="Live Strategy Comparison Bundle",
241                description="Packaged decision problem with three strategy bindings.",
242                problem_ids=(PROBLEM_ID,),
243                agent_specs=STRATEGY_ORDER,
244            ),
245            run_budget=dr.experiments.RunBudget(replicates=replicates, parallelism=1),
246            output_dir=OUTPUT_DIR,
247        )
248    )
249
250
251def _strategy_prompt(problem_packet: object, *, instruction: str) -> str:
252    """Render one complete strategy prompt from the normalized problem packet."""
253    # Compose the final prompt from a few readable pieces instead of one giant
254    # literal string. That makes it easy to see which lines stay fixed across
255    # conditions and which line changes with the strategy framing.
256    return "\n".join(
257        [
258            "You are solving a packaged design-research decision problem.",
259            "Read the problem brief and return exactly one JSON object candidate.",
260            instruction,
261            "",
262            str(getattr(problem_packet, "brief", "")).strip(),
263            "",
264            "Return JSON only with no markdown fences and no extra commentary.",
265        ]
266    )
267
268
269def read_csv_rows(path: Path) -> list[dict[str, str]]:
270    """Read one exported CSV table into a list of row dictionaries."""
271    with path.open("r", encoding="utf-8", newline="") as file_obj:
272        return list(csv.DictReader(file_obj))
273
274
275def load_analysis_exports(
276    artifact_paths: dict[str, Path],
277    *,
278    names: tuple[str, ...],
279) -> dict[str, list[dict[str, str]]]:
280    """Load selected exported CSV artifacts into memory."""
281    return {name: read_csv_rows(artifact_paths[name]) for name in names}
282
283
284def validate_exported_events(artifact_paths: dict[str, Path]) -> object:
285    """Validate the exported canonical event table through the analysis layer."""
286    return dr.analysis.integration.validate_experiment_events(artifact_paths["events.csv"])
287
288
289def artifact_names(artifact_paths: dict[str, Path]) -> str:
290    """Return exported artifact filenames in stable sorted order."""
291    return ", ".join(sorted(path.name for path in artifact_paths.values()))
292
293
294def condition_means(rows: list[dict[str, object]]) -> dict[str, float]:
295    """Compute one mean per condition label from normalized rows."""
296    grouped: dict[str, list[float]] = {}
297    for row in rows:
298        grouped.setdefault(str(row["condition"]), []).append(float(row["value"]))
299    return {
300        condition: (sum(values) / len(values) if values else 0.0)
301        for condition, values in grouped.items()
302    }
303
304
305def decision_candidate_schema(problem: object) -> dict[str, object]:
306    """Build a JSON schema for discrete decision-factor candidates."""
307    properties: dict[str, object] = {}
308    required: list[str] = []
309    for factor in getattr(problem, "option_factors", ()):
310        levels = tuple(getattr(factor, "levels", ()))
311        key = str(getattr(factor, "key", ""))
312        if not key or not levels:
313            continue
314        properties[key] = {"type": "number", "enum": list(levels)}
315        required.append(key)
316
317    if not required:
318        raise RuntimeError("Expected a packaged decision problem with explicit option factors.")
319
320    return {
321        "type": "object",
322        "properties": properties,
323        "required": required,
324        "additionalProperties": False,
325    }
326
327
328def llama_cpp_runtime_config(*, default_replicates: int) -> dict[str, object]:
329    """Resolve runtime configuration and fail fast on missing live dependencies."""
330    missing_runtime = [
331        module_name
332        for module_name in ("llama_cpp", "fastapi", "uvicorn")
333        if importlib.util.find_spec(module_name) is None
334    ]
335    if missing_runtime:
336        raise RuntimeError(
337            "Install llama-cpp-python[server] before running the live walkthrough. Missing: "
338            + ", ".join(sorted(missing_runtime))
339        )
340
341    model_source = (
342        os.getenv("LLAMA_CPP_MODEL", "Qwen2.5-1.5B-Instruct-Q4_K_M.gguf").strip()
343        or "Qwen2.5-1.5B-Instruct-Q4_K_M.gguf"
344    )
345    model_repo = (
346        os.getenv("LLAMA_CPP_HF_MODEL_REPO_ID", "bartowski/Qwen2.5-1.5B-Instruct-GGUF").strip()
347        or None
348    )
349    if (
350        model_repo
351        and not Path(model_source).expanduser().exists()
352        and importlib.util.find_spec("huggingface_hub") is None
353    ):
354        raise RuntimeError(
355            "Install huggingface-hub or point LLAMA_CPP_MODEL at a local GGUF file before "
356            "running the live walkthrough."
357        )
358
359    replicates = int(os.getenv("PROMPT_STUDY_REPLICATES", str(default_replicates)))
360    if replicates < 2:
361        raise RuntimeError("PROMPT_STUDY_REPLICATES must be at least 2.")
362
363    return {
364        "provider_name": "llama-cpp",
365        "model_source": model_source,
366        "model_name": os.getenv("LLAMA_CPP_API_MODEL", "qwen2-1.5b-q4").strip() or "qwen2-1.5b-q4",
367        "model_repo": model_repo,
368        "host": os.getenv("LLAMA_CPP_HOST", "127.0.0.1").strip() or "127.0.0.1",
369        "port": int(os.getenv("LLAMA_CPP_PORT", "8001")),
370        "context_window": int(os.getenv("LLAMA_CPP_CONTEXT_WINDOW", "4096")),
371        "replicates": replicates,
372    }
373
374
375def build_json_model_workflow(
376    *,
377    llm_client: object,
378    candidate_schema: dict[str, object],
379    study_id: str,
380    problem_id: str,
381    fallback_model_name: str,
382    fallback_provider: str,
383) -> object:
384    """Build one reusable prompt-mode workflow that returns structured JSON."""
385
386    def request_builder(context: dict[str, object]) -> object:
387        """Build one structured LLM request from the workflow context."""
388        return dr.agents.LLMRequest(
389            messages=[
390                dr.agents.LLMMessage(
391                    role="system",
392                    content=(
393                        "You are a careful study participant. Return valid JSON only and match "
394                        "the requested schema exactly."
395                    ),
396                ),
397                dr.agents.LLMMessage(role="user", content=str(context["prompt"])),
398            ],
399            temperature=0.0,
400            max_tokens=400,
401            response_schema=candidate_schema,
402            metadata={"study_id": study_id, "problem_id": problem_id},
403        )
404
405    def response_parser(response: object, _context: dict[str, object]) -> dict[str, object]:
406        """Parse one model response into workflow output, metrics, and events."""
407        model_text = strip_markdown_fences(str(getattr(response, "text", "")).strip())
408        candidate = json.loads(model_text)
409        if not isinstance(candidate, dict):
410            raise RuntimeError("Expected the live workflow to return one JSON object candidate.")
411        provider = str(getattr(response, "provider", "") or fallback_provider)
412        model_name = str(getattr(response, "model", "") or fallback_model_name)
413        return {
414            "final_output": candidate,
415            "metrics": usage_metrics(getattr(response, "usage", None)),
416            "events": [
417                {
418                    "event_type": "model_response",
419                    "actor_id": "agent",
420                    "text": model_text,
421                    "meta_json": {"provider": provider, "model_name": model_name},
422                }
423            ],
424        }
425
426    return dr.agents.Workflow(
427        steps=(
428            dr.agents.ModelStep(
429                step_id="select_candidate",
430                llm_client=llm_client,
431                request_builder=request_builder,
432                response_parser=response_parser,
433            ),
434        ),
435        output_schema=candidate_schema,
436        default_request_id_prefix=study_id,
437    )
438
439
440def strip_markdown_fences(text: str) -> str:
441    """Strip one optional fenced-code wrapper from a model response."""
442    if not text.startswith("```"):
443        return text
444    lines = text.splitlines()
445    if lines and lines[0].startswith("```"):
446        lines = lines[1:]
447    if lines and lines[-1].startswith("```"):
448        lines = lines[:-1]
449    return "\n".join(lines).strip()
450
451
452def usage_metrics(usage: object) -> dict[str, object]:
453    """Normalize usage payloads into canonical metric names."""
454    metrics: dict[str, object] = {"cost_usd": 0.0}
455    if isinstance(usage, dict):
456        prompt_tokens = usage.get("prompt_tokens")
457        completion_tokens = usage.get("completion_tokens")
458    else:
459        prompt_tokens = getattr(usage, "prompt_tokens", None)
460        completion_tokens = getattr(usage, "completion_tokens", None)
461    if isinstance(prompt_tokens, int):
462        metrics["input_tokens"] = prompt_tokens
463    if isinstance(completion_tokens, int):
464        metrics["output_tokens"] = completion_tokens
465    return metrics
466
467
468if __name__ == "__main__":
469    main()

When To Go Direct#

Use the umbrella package when you want one stable import surface for the ecosystem. Install a sibling package directly when you only need one layer or want package-specific internals. See Compatibility And Start Here for the tested version combination and install guidance.