LLM Modules#

This page documents the stable public LLM facades and client entry points.

Stable public LLM contracts and client entry points.

class design_research_agents.llm.AnthropicServiceLLMClient(*, name='anthropic', default_model='claude-3-5-haiku-latest', api_key_env='ANTHROPIC_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]

Client for the official Anthropic API backend.

Initialize an Anthropic service client with sensible defaults.

class design_research_agents.llm.AzureOpenAIServiceLLMClient(*, name='azure-openai', default_model='gpt-4o-mini', api_key_env='AZURE_OPENAI_API_KEY', api_key=None, azure_endpoint_env='AZURE_OPENAI_ENDPOINT', azure_endpoint=None, api_version_env='AZURE_OPENAI_API_VERSION', api_version=None, max_retries=2, model_patterns=None)[source]

Client for the Azure OpenAI API via the official OpenAI SDK.

Initialize an Azure OpenAI service client with sensible defaults.

class design_research_agents.llm.GeminiServiceLLMClient(*, name='gemini', default_model='gemini-2.5-flash', api_key_env='GOOGLE_API_KEY', api_key=None, max_retries=2, model_patterns=None)[source]

Client for the official Gemini API backend.

Initialize a Gemini service client with sensible defaults.

class design_research_agents.llm.GroqServiceLLMClient(*, name='groq', default_model='llama-3.1-8b-instant', api_key_env='GROQ_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]

Client for the official Groq API backend.

Initialize a Groq service client with sensible defaults.

class design_research_agents.llm.LLMMessage(*, role, content, name=None, tool_call_id=None, tool_name=None)[source]

One chat message in the provider-neutral completion format.

content: Plain-text message content.

name: Optional participant name, when supported by the provider.

role: Message role used by chat-compatible backends.

tool_call_id: Tool call identifier for tool-response messages.

tool_name: Tool name associated with a tool-response message.

class design_research_agents.llm.LLMRequest(*, messages, model=None, temperature=None, max_tokens=None, tools=(), response_schema=None, response_format=None, metadata=<factory>, provider_options=<factory>, task_profile=None)[source]

Provider-neutral request payload for LLM generation.

max_tokens: Maximum output token limit.

messages: Ordered conversation/messages sent to the model.

metadata: Caller metadata forwarded for tracing and diagnostics.

model: Explicit model identifier override for this request.

provider_options: Backend/provider-specific low-level options.

response_format: Provider-specific response-format hints.

response_schema: Optional schema for structured output validation.

task_profile: Optional routing profile used by selector-aware clients.

temperature: Sampling temperature override.

tools: Tool specifications exposed for model tool-calling.

class design_research_agents.llm.LLMResponse(*, text, model=None, provider=None, finish_reason=None, usage=None, latency_ms=None, raw_output=None, tool_calls=(), raw=None, provenance=None)[source]

Normalized non-streaming response payload returned by a backend.

finish_reason: Provider-specific completion reason.

latency_ms: End-to-end latency in milliseconds.

model: Model identifier reported by the backend.

provenance: Execution provenance metadata for auditability.

provider: Provider/backend name that produced this response.

raw: Canonical raw backend payload snapshot.

raw_output: Legacy/raw backend payload for debugging.

text: Primary response text emitted by the model.

tool_calls: Tool calls requested by the model in this response.

usage: Token usage counters when available.

class design_research_agents.llm.LlamaCppServerLLMClient(*, name='llama-local', model='Qwen2.5-1.5B-Instruct-Q4_K_M.gguf', hf_model_repo_id='bartowski/Qwen2.5-1.5B-Instruct-GGUF', api_model='qwen2-1.5b-q4', host='127.0.0.1', port=8001, context_window=4096, startup_timeout_seconds=60.0, request_timeout_seconds=60.0, poll_interval_seconds=0.25, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), max_retries=2, model_patterns=None)[source]

Client for a managed local llama_cpp.server backend.

Initialize a local llama-cpp client with sensible defaults.

Parameters:

name – Logical name for this client instance, used in logging and provenance.
model – Local model identifier or path for llama_cpp.server to load.
hf_model_repo_id – Optional Hugging Face repo ID to auto-download the model from if not found locally.
api_model – The model name to report in API responses, which can differ from the local model name.
host – Host interface for the local server to bind to.
port – Port for the local server to listen on.
context_window – Context window size (n_ctx) to configure the llama_cpp.server with.
startup_timeout_seconds – Max time to wait for the server process to start and become healthy.
request_timeout_seconds – HTTP timeout for generate and stream requests.
poll_interval_seconds – Time interval between health check polls during startup.
python_executable – Python executable to use for running the server process.
extra_server_args – Additional command-line arguments to pass when starting the server process.
max_retries – Number of times to retry a request in case of failure before giving up.
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (api_model,).

close()[source]: Stop the managed local server process.

class design_research_agents.llm.MLXLocalLLMClient(*, name='mlx-local', model_id='mlx-community/Qwen2.5-1.5B-Instruct-4bit', default_model='mlx-community/Qwen2.5-1.5B-Instruct-4bit', quantization='none', max_retries=2, model_patterns=None)[source]

Client for Apple MLX local inference.

Initialize an MLX local client with sensible defaults.

Parameters:

name – Logical name for this client instance, used in logging and provenance.
model_id – Identifier for the MLX model to load (e.g. “mlx-community /Qwen2.5-1.5B-Instruct-4bit”).
default_model – Default model name for prompts that don’t specify one.
quantization – Quantization level to use when loading the model (e.g. “4 -bit”, “8-bit”, “fp16”).
max_retries – Number of times to retry a request in case of failure before giving up
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (default_model,).

class design_research_agents.llm.OllamaLLMClient(*, name='ollama-local', default_model='qwen2.5:1.5b-instruct', host='127.0.0.1', port=11434, manage_server=True, ollama_executable='ollama', auto_pull_model=False, startup_timeout_seconds=60.0, poll_interval_seconds=0.25, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]

Client for local or self-hosted Ollama chat inference.

Initialize an Ollama client in managed-server or connect mode.

Parameters:

name – Logical name for this client instance.
default_model – Default model id used when requests omit model.
host – Host interface used in managed mode or connect mode.
port – TCP port used in managed mode or connect mode.
manage_server – Whether this client manages ollama serve lifecycle.
ollama_executable – Executable used to invoke ollama commands.
auto_pull_model – Whether to pull default_model after startup.
startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.

close()[source]: Stop the managed Ollama daemon when present.

class design_research_agents.llm.OpenAICompatibleHTTPLLMClient(*, name='openai-compatible', base_url='http://127.0.0.1:8001/v1', default_model='qwen2-1.5b-q4', api_key_env='OPENAI_API_KEY', api_key=None, max_retries=2, model_patterns=None)[source]

Client for OpenAI-compatible HTTP endpoints.

Initialize an OpenAI-compatible HTTP client with sensible defaults.

class design_research_agents.llm.OpenAIServiceLLMClient(*, name='openai', default_model='gpt-4o-mini', api_key_env='OPENAI_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]

Client for the official OpenAI API backend.

Initialize an OpenAI service client with sensible defaults.

class design_research_agents.llm.SGLangServerLLMClient(*, name='sglang-local', model='Qwen/Qwen2.5-1.5B-Instruct', host='127.0.0.1', port=30000, manage_server=True, startup_timeout_seconds=90.0, poll_interval_seconds=0.5, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), base_url=None, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]

Client for local or self-hosted SGLang OpenAI-compatible inference.

Initialize an SGLang client in managed-server or connect mode.

Parameters:

name – Logical name for this client instance.
model – Model identifier passed to managed SGLang server startup.
host – Host interface used in managed mode.
port – TCP port used in managed mode.
manage_server – Whether this client manages the SGLang server lifecycle.
startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
python_executable – Python executable used to launch managed SGLang process.
extra_server_args – Additional CLI flags forwarded to SGLang server.
base_url – Optional connect-mode endpoint URL. Required only for remote/self-managed deployments; defaults to http://{host}:{port}/v1.
request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.

Raises:

ValueError – If manage_server and base_url are both configured.

close()[source]: Stop the managed SGLang server process when present.

class design_research_agents.llm.TransformersLocalLLMClient(*, name='transformers-local', model_id='distilgpt2', default_model='distilgpt2', device='auto', dtype='auto', quantization='none', trust_remote_code=False, revision=None, max_retries=2, model_patterns=None)[source]

Client for in-process Transformers local inference.

Initialize a local Transformers client with sensible defaults.

Parameters:

name – Logical name for this client instance, used in logging and provenance.
model_id – Identifier for the model to load (e.g. “distilgpt2 or a Hugging Face repo ID like “gpt2”).
default_model – Default model name for prompts that don’t specify one.
device – Device to load the model on (e.g. “cpu”, “cuda”, “mps”, or “auto” to automatically select based on availability).
dtype – Data type to use for model weights (e.g. “float16”, “bfloat16”, “int8”, or “auto” to automatically select based on device).
quantization – Quantization level to use when loading the model (e.g. “4 bit”, “8-bit”, “fp16”, or “none” for no quantization).
trust_remote_code – Whether to allow execution of custom code from remote repositories when loading models, which may be required for some models but can be a security risk.
revision – Optional model revision to load (e.g. a git branch, tag, or commit hash), if the model is being loaded from a Hugging Face repository that has multiple revisions.
max_retries – Number of times to retry a request in case of failure before giving up
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (default_model,).

class design_research_agents.llm.VLLMServerLLMClient(*, name='vllm-local', model='Qwen/Qwen2.5-1.5B-Instruct', api_model='qwen2.5-1.5b-instruct', host='127.0.0.1', port=8002, manage_server=True, startup_timeout_seconds=90.0, poll_interval_seconds=0.5, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), base_url=None, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]

Client for local or self-hosted vLLM OpenAI-compatible inference.

Initialize a vLLM client in managed-server or connect mode.

Parameters:

name – Logical name for this client instance.
model – Model identifier passed to managed vLLM server startup.
api_model – Model alias exposed by vLLM OpenAI-compatible API.
host – Host interface used in managed mode.
port – TCP port used in managed mode.
manage_server – Whether this client manages the vLLM server lifecycle.
startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
python_executable – Python executable used to launch managed vLLM process.
extra_server_args – Additional CLI flags forwarded to vLLM server.
base_url – Optional connect-mode endpoint URL. Required only for remote/self-managed deployments; defaults to http://{host}:{port}/v1.
request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.

Raises:

ValueError – If manage_server and base_url are both configured.

close()[source]: Stop the managed vLLM server process when present.

Stable public LLM client classes backed by internal provider wrappers.

class design_research_agents.llm.clients.AnthropicServiceLLMClient(*, name='anthropic', default_model='claude-3-5-haiku-latest', api_key_env='ANTHROPIC_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]

Client for the official Anthropic API backend.

Initialize an Anthropic service client with sensible defaults.

class design_research_agents.llm.clients.AzureOpenAIServiceLLMClient(*, name='azure-openai', default_model='gpt-4o-mini', api_key_env='AZURE_OPENAI_API_KEY', api_key=None, azure_endpoint_env='AZURE_OPENAI_ENDPOINT', azure_endpoint=None, api_version_env='AZURE_OPENAI_API_VERSION', api_version=None, max_retries=2, model_patterns=None)[source]

Client for the Azure OpenAI API via the official OpenAI SDK.

Initialize an Azure OpenAI service client with sensible defaults.

class design_research_agents.llm.clients.GeminiServiceLLMClient(*, name='gemini', default_model='gemini-2.5-flash', api_key_env='GOOGLE_API_KEY', api_key=None, max_retries=2, model_patterns=None)[source]

Client for the official Gemini API backend.

Initialize a Gemini service client with sensible defaults.

class design_research_agents.llm.clients.GroqServiceLLMClient(*, name='groq', default_model='llama-3.1-8b-instant', api_key_env='GROQ_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]

Client for the official Groq API backend.

Initialize a Groq service client with sensible defaults.

class design_research_agents.llm.clients.LlamaCppServerLLMClient(*, name='llama-local', model='Qwen2.5-1.5B-Instruct-Q4_K_M.gguf', hf_model_repo_id='bartowski/Qwen2.5-1.5B-Instruct-GGUF', api_model='qwen2-1.5b-q4', host='127.0.0.1', port=8001, context_window=4096, startup_timeout_seconds=60.0, request_timeout_seconds=60.0, poll_interval_seconds=0.25, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), max_retries=2, model_patterns=None)[source]

Client for a managed local llama_cpp.server backend.

Initialize a local llama-cpp client with sensible defaults.

Parameters:

name – Logical name for this client instance, used in logging and provenance.
model – Local model identifier or path for llama_cpp.server to load.
hf_model_repo_id – Optional Hugging Face repo ID to auto-download the model from if not found locally.
api_model – The model name to report in API responses, which can differ from the local model name.
host – Host interface for the local server to bind to.
port – Port for the local server to listen on.
context_window – Context window size (n_ctx) to configure the llama_cpp.server with.
startup_timeout_seconds – Max time to wait for the server process to start and become healthy.
request_timeout_seconds – HTTP timeout for generate and stream requests.
poll_interval_seconds – Time interval between health check polls during startup.
python_executable – Python executable to use for running the server process.
extra_server_args – Additional command-line arguments to pass when starting the server process.
max_retries – Number of times to retry a request in case of failure before giving up.
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (api_model,).

close()[source]: Stop the managed local server process.

class design_research_agents.llm.clients.MLXLocalLLMClient(*, name='mlx-local', model_id='mlx-community/Qwen2.5-1.5B-Instruct-4bit', default_model='mlx-community/Qwen2.5-1.5B-Instruct-4bit', quantization='none', max_retries=2, model_patterns=None)[source]

Client for Apple MLX local inference.

Initialize an MLX local client with sensible defaults.

Parameters:

name – Logical name for this client instance, used in logging and provenance.
model_id – Identifier for the MLX model to load (e.g. “mlx-community /Qwen2.5-1.5B-Instruct-4bit”).
default_model – Default model name for prompts that don’t specify one.
quantization – Quantization level to use when loading the model (e.g. “4 -bit”, “8-bit”, “fp16”).
max_retries – Number of times to retry a request in case of failure before giving up
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (default_model,).

class design_research_agents.llm.clients.OllamaLLMClient(*, name='ollama-local', default_model='qwen2.5:1.5b-instruct', host='127.0.0.1', port=11434, manage_server=True, ollama_executable='ollama', auto_pull_model=False, startup_timeout_seconds=60.0, poll_interval_seconds=0.25, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]

Client for local or self-hosted Ollama chat inference.

Initialize an Ollama client in managed-server or connect mode.

Parameters:

name – Logical name for this client instance.
default_model – Default model id used when requests omit model.
host – Host interface used in managed mode or connect mode.
port – TCP port used in managed mode or connect mode.
manage_server – Whether this client manages ollama serve lifecycle.
ollama_executable – Executable used to invoke ollama commands.
auto_pull_model – Whether to pull default_model after startup.
startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.

close()[source]: Stop the managed Ollama daemon when present.

class design_research_agents.llm.clients.OpenAICompatibleHTTPLLMClient(*, name='openai-compatible', base_url='http://127.0.0.1:8001/v1', default_model='qwen2-1.5b-q4', api_key_env='OPENAI_API_KEY', api_key=None, max_retries=2, model_patterns=None)[source]

Client for OpenAI-compatible HTTP endpoints.

Initialize an OpenAI-compatible HTTP client with sensible defaults.

class design_research_agents.llm.clients.OpenAIServiceLLMClient(*, name='openai', default_model='gpt-4o-mini', api_key_env='OPENAI_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]

Client for the official OpenAI API backend.

Initialize an OpenAI service client with sensible defaults.

class design_research_agents.llm.clients.SGLangServerLLMClient(*, name='sglang-local', model='Qwen/Qwen2.5-1.5B-Instruct', host='127.0.0.1', port=30000, manage_server=True, startup_timeout_seconds=90.0, poll_interval_seconds=0.5, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), base_url=None, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]

Client for local or self-hosted SGLang OpenAI-compatible inference.

Initialize an SGLang client in managed-server or connect mode.

Parameters:

name – Logical name for this client instance.
model – Model identifier passed to managed SGLang server startup.
host – Host interface used in managed mode.
port – TCP port used in managed mode.
manage_server – Whether this client manages the SGLang server lifecycle.
startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
python_executable – Python executable used to launch managed SGLang process.
extra_server_args – Additional CLI flags forwarded to SGLang server.
base_url – Optional connect-mode endpoint URL. Required only for remote/self-managed deployments; defaults to http://{host}:{port}/v1.
request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.

Raises:

ValueError – If manage_server and base_url are both configured.

close()[source]: Stop the managed SGLang server process when present.

class design_research_agents.llm.clients.TransformersLocalLLMClient(*, name='transformers-local', model_id='distilgpt2', default_model='distilgpt2', device='auto', dtype='auto', quantization='none', trust_remote_code=False, revision=None, max_retries=2, model_patterns=None)[source]

Client for in-process Transformers local inference.

Initialize a local Transformers client with sensible defaults.

Parameters:

name – Logical name for this client instance, used in logging and provenance.
model_id – Identifier for the model to load (e.g. “distilgpt2 or a Hugging Face repo ID like “gpt2”).
default_model – Default model name for prompts that don’t specify one.
device – Device to load the model on (e.g. “cpu”, “cuda”, “mps”, or “auto” to automatically select based on availability).
dtype – Data type to use for model weights (e.g. “float16”, “bfloat16”, “int8”, or “auto” to automatically select based on device).
quantization – Quantization level to use when loading the model (e.g. “4 bit”, “8-bit”, “fp16”, or “none” for no quantization).
trust_remote_code – Whether to allow execution of custom code from remote repositories when loading models, which may be required for some models but can be a security risk.
revision – Optional model revision to load (e.g. a git branch, tag, or commit hash), if the model is being loaded from a Hugging Face repository that has multiple revisions.
max_retries – Number of times to retry a request in case of failure before giving up
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (default_model,).

class design_research_agents.llm.clients.VLLMServerLLMClient(*, name='vllm-local', model='Qwen/Qwen2.5-1.5B-Instruct', api_model='qwen2.5-1.5b-instruct', host='127.0.0.1', port=8002, manage_server=True, startup_timeout_seconds=90.0, poll_interval_seconds=0.5, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), base_url=None, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]

Client for local or self-hosted vLLM OpenAI-compatible inference.

Initialize a vLLM client in managed-server or connect mode.

Parameters:

name – Logical name for this client instance.
model – Model identifier passed to managed vLLM server startup.
api_model – Model alias exposed by vLLM OpenAI-compatible API.
host – Host interface used in managed mode.
port – TCP port used in managed mode.
manage_server – Whether this client manages the vLLM server lifecycle.
startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
python_executable – Python executable used to launch managed vLLM process.
extra_server_args – Additional CLI flags forwarded to vLLM server.
base_url – Optional connect-mode endpoint URL. Required only for remote/self-managed deployments; defaults to http://{host}:{port}/v1.
request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.

Raises:

ValueError – If manage_server and base_url are both configured.

close()[source]: Stop the managed vLLM server process when present.