LlamaCppServerLLMClient#

LlamaCppServerLLMClient runs a managed local llama_cpp.server process.

Default behavior#

Default GGUF artifact: Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
Default API model name exposed to requests: qwen2-1.5b-q4
Local execution (no hosted API requirement)

Constructor-first usage#

from design_research_agents import LlamaCppServerLLMClient
from design_research_agents.llm import LLMMessage, LLMRequest

with LlamaCppServerLLMClient() as client:
    response = client.generate(
        LLMRequest(
            messages=(LLMMessage(role="user", content="Summarize this paragraph."),),
            model=client.default_model(),
        )
    )

Prefer the context-manager form so the managed local server always shuts down deterministically. close() remains available for explicit lifecycle control.

Dependencies and environment#

Install llama.cpp backend extras: pip install -e \".[llama_cpp]\"
Ensure local model download/runtime prerequisites are available.

Model notes for local runs#

Smaller quantized GGUF models (for example 1B-3B 4-bit) are best for fast iteration on laptops.
Increase context_window and model size only when your RAM/latency budget supports it.
Use Model Selection to enforce local-only behavior plus cost/latency constraints consistently across workflows.

Examples#

examples/clients/llama_cpp_server_client.py

Attribution#

Docs: llama.cpp server usage
Homepage: llama.cpp GitHub