LlamaCppServerLLMClient
LlamaCppServerLLMClient runs a managed local llama_cpp.server process.
Default behavior
Default GGUF artifact:
Qwen2.5-1.5B-Instruct-Q4_K_M.ggufDefault API model name exposed to requests:
qwen2-1.5b-q4Local execution (no hosted API requirement)
Constructor-first usage
from design_research_agents import LlamaCppServerLLMClient
from design_research_agents.llm import LLMMessage, LLMRequest
with LlamaCppServerLLMClient() as client:
response = client.generate(
LLMRequest(
messages=(LLMMessage(role="user", content="Summarize this paragraph."),),
model=client.default_model(),
)
)
Prefer the context-manager form so the managed local server always shuts down
deterministically. close() remains available for explicit lifecycle control.
Dependencies and environment
Install llama.cpp backend extras:
pip install -e \".[llama_cpp]\"Ensure local model download/runtime prerequisites are available.
Model notes for local runs
Smaller quantized GGUF models (for example 1B-3B 4-bit) are best for fast iteration on laptops.
Increase
context_windowand model size only when your RAM/latency budget supports it.Use Model Selection to enforce local-only behavior plus cost/latency constraints consistently across workflows.
Examples
examples/clients/llama_cpp_server_client.py
Attribution
Docs: llama.cpp server usage
Homepage: llama.cpp GitHub