LlamaCppServerLLMClient

LlamaCppServerLLMClient runs a managed local llama_cpp.server process.

Default behavior

  • Default GGUF artifact: Qwen2.5-1.5B-Instruct-Q4_K_M.gguf

  • Default API model name exposed to requests: qwen2-1.5b-q4

  • Local execution (no hosted API requirement)

Constructor-first usage

from design_research_agents import LlamaCppServerLLMClient
from design_research_agents.llm import LLMMessage, LLMRequest

with LlamaCppServerLLMClient() as client:
    response = client.generate(
        LLMRequest(
            messages=(LLMMessage(role="user", content="Summarize this paragraph."),),
            model=client.default_model(),
        )
    )

Prefer the context-manager form so the managed local server always shuts down deterministically. close() remains available for explicit lifecycle control.

Dependencies and environment

  • Install llama.cpp backend extras: pip install -e \".[llama_cpp]\"

  • Ensure local model download/runtime prerequisites are available.

Model notes for local runs

  • Smaller quantized GGUF models (for example 1B-3B 4-bit) are best for fast iteration on laptops.

  • Increase context_window and model size only when your RAM/latency budget supports it.

  • Use Model Selection to enforce local-only behavior plus cost/latency constraints consistently across workflows.

Examples

  • examples/clients/llama_cpp_server_client.py

Attribution