LLM Modules
This page documents the stable public LLM facades and client entry points.
Stable public LLM contracts and client entry points.
- class design_research_agents.llm.AnthropicServiceLLMClient(*, name='anthropic', default_model='claude-3-5-haiku-latest', api_key_env='ANTHROPIC_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]
Client for the official Anthropic API backend.
Initialize an Anthropic service client with sensible defaults.
- class design_research_agents.llm.AzureOpenAIServiceLLMClient(*, name='azure-openai', default_model='gpt-4o-mini', api_key_env='AZURE_OPENAI_API_KEY', api_key=None, azure_endpoint_env='AZURE_OPENAI_ENDPOINT', azure_endpoint=None, api_version_env='AZURE_OPENAI_API_VERSION', api_version=None, max_retries=2, model_patterns=None)[source]
Client for the Azure OpenAI API via the official OpenAI SDK.
Initialize an Azure OpenAI service client with sensible defaults.
- class design_research_agents.llm.GeminiServiceLLMClient(*, name='gemini', default_model='gemini-2.5-flash', api_key_env='GOOGLE_API_KEY', api_key=None, max_retries=2, model_patterns=None)[source]
Client for the official Gemini API backend.
Initialize a Gemini service client with sensible defaults.
- class design_research_agents.llm.GroqServiceLLMClient(*, name='groq', default_model='llama-3.1-8b-instant', api_key_env='GROQ_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]
Client for the official Groq API backend.
Initialize a Groq service client with sensible defaults.
- class design_research_agents.llm.LLMMessage(*, role, content, name=None, tool_call_id=None, tool_name=None)[source]
One chat message in the provider-neutral completion format.
- content
Plain-text message content.
- name
Optional participant name, when supported by the provider.
- role
Message role used by chat-compatible backends.
- tool_call_id
Tool call identifier for tool-response messages.
- tool_name
Tool name associated with a tool-response message.
- class design_research_agents.llm.LLMRequest(*, messages, model=None, temperature=None, max_tokens=None, tools=(), response_schema=None, response_format=None, metadata=<factory>, provider_options=<factory>, task_profile=None)[source]
Provider-neutral request payload for LLM generation.
- max_tokens
Maximum output token limit.
- messages
Ordered conversation/messages sent to the model.
- metadata
Caller metadata forwarded for tracing and diagnostics.
- model
Explicit model identifier override for this request.
- provider_options
Backend/provider-specific low-level options.
- response_format
Provider-specific response-format hints.
- response_schema
Optional schema for structured output validation.
- task_profile
Optional routing profile used by selector-aware clients.
- temperature
Sampling temperature override.
- tools
Tool specifications exposed for model tool-calling.
- class design_research_agents.llm.LLMResponse(*, text, model=None, provider=None, finish_reason=None, usage=None, latency_ms=None, raw_output=None, tool_calls=(), raw=None, provenance=None)[source]
Normalized non-streaming response payload returned by a backend.
- finish_reason
Provider-specific completion reason.
- latency_ms
End-to-end latency in milliseconds.
- model
Model identifier reported by the backend.
- provenance
Execution provenance metadata for auditability.
- provider
Provider/backend name that produced this response.
- raw
Canonical raw backend payload snapshot.
- raw_output
Legacy/raw backend payload for debugging.
- text
Primary response text emitted by the model.
- tool_calls
Tool calls requested by the model in this response.
- usage
Token usage counters when available.
- class design_research_agents.llm.LlamaCppServerLLMClient(*, name='llama-local', model='Qwen2.5-1.5B-Instruct-Q4_K_M.gguf', hf_model_repo_id='bartowski/Qwen2.5-1.5B-Instruct-GGUF', api_model='qwen2-1.5b-q4', host='127.0.0.1', port=8001, context_window=4096, startup_timeout_seconds=60.0, request_timeout_seconds=60.0, poll_interval_seconds=0.25, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), max_retries=2, model_patterns=None)[source]
Client for a managed local
llama_cpp.serverbackend.Initialize a local llama-cpp client with sensible defaults.
- Parameters:
name – Logical name for this client instance, used in logging and provenance.
model – Local model identifier or path for llama_cpp.server to load.
hf_model_repo_id – Optional Hugging Face repo ID to auto-download the model from if not found locally.
api_model – The model name to report in API responses, which can differ from the local model name.
host – Host interface for the local server to bind to.
port – Port for the local server to listen on.
context_window – Context window size (n_ctx) to configure the llama_cpp.server with.
startup_timeout_seconds – Max time to wait for the server process to start and become healthy.
request_timeout_seconds – HTTP timeout for generate and stream requests.
poll_interval_seconds – Time interval between health check polls during startup.
python_executable – Python executable to use for running the server process.
extra_server_args – Additional command-line arguments to pass when starting the server process.
max_retries – Number of times to retry a request in case of failure before giving up.
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (api_model,).
- close()[source]
Stop the managed local server process.
- class design_research_agents.llm.MLXLocalLLMClient(*, name='mlx-local', model_id='mlx-community/Qwen2.5-1.5B-Instruct-4bit', default_model='mlx-community/Qwen2.5-1.5B-Instruct-4bit', quantization='none', max_retries=2, model_patterns=None)[source]
Client for Apple MLX local inference.
Initialize an MLX local client with sensible defaults.
- Parameters:
name – Logical name for this client instance, used in logging and provenance.
model_id – Identifier for the MLX model to load (e.g. “mlx-community /Qwen2.5-1.5B-Instruct-4bit”).
default_model – Default model name for prompts that don’t specify one.
quantization – Quantization level to use when loading the model (e.g. “4 -bit”, “8-bit”, “fp16”).
max_retries – Number of times to retry a request in case of failure before giving up
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (default_model,).
- class design_research_agents.llm.OllamaLLMClient(*, name='ollama-local', default_model='qwen2.5:1.5b-instruct', host='127.0.0.1', port=11434, manage_server=True, ollama_executable='ollama', auto_pull_model=False, startup_timeout_seconds=60.0, poll_interval_seconds=0.25, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]
Client for local or self-hosted Ollama chat inference.
Initialize an Ollama client in managed-server or connect mode.
- Parameters:
name – Logical name for this client instance.
default_model – Default model id used when requests omit model.
host – Host interface used in managed mode or connect mode.
port – TCP port used in managed mode or connect mode.
manage_server – Whether this client manages
ollama servelifecycle.ollama_executable – Executable used to invoke
ollamacommands.auto_pull_model – Whether to pull
default_modelafter startup.startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.
- close()[source]
Stop the managed Ollama daemon when present.
- class design_research_agents.llm.OpenAICompatibleHTTPLLMClient(*, name='openai-compatible', base_url='http://127.0.0.1:8001/v1', default_model='qwen2-1.5b-q4', api_key_env='OPENAI_API_KEY', api_key=None, max_retries=2, model_patterns=None)[source]
Client for OpenAI-compatible HTTP endpoints.
Initialize an OpenAI-compatible HTTP client with sensible defaults.
- class design_research_agents.llm.OpenAIServiceLLMClient(*, name='openai', default_model='gpt-4o-mini', api_key_env='OPENAI_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]
Client for the official OpenAI API backend.
Initialize an OpenAI service client with sensible defaults.
- class design_research_agents.llm.SGLangServerLLMClient(*, name='sglang-local', model='Qwen/Qwen2.5-1.5B-Instruct', host='127.0.0.1', port=30000, manage_server=True, startup_timeout_seconds=90.0, poll_interval_seconds=0.5, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), base_url=None, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]
Client for local or self-hosted SGLang OpenAI-compatible inference.
Initialize an SGLang client in managed-server or connect mode.
- Parameters:
name – Logical name for this client instance.
model – Model identifier passed to managed SGLang server startup.
host – Host interface used in managed mode.
port – TCP port used in managed mode.
manage_server – Whether this client manages the SGLang server lifecycle.
startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
python_executable – Python executable used to launch managed SGLang process.
extra_server_args – Additional CLI flags forwarded to SGLang server.
base_url – Optional connect-mode endpoint URL. Required only for remote/self-managed deployments; defaults to
http://{host}:{port}/v1.request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.
- Raises:
ValueError – If
manage_serverandbase_urlare both configured.
- close()[source]
Stop the managed SGLang server process when present.
- class design_research_agents.llm.TransformersLocalLLMClient(*, name='transformers-local', model_id='distilgpt2', default_model='distilgpt2', device='auto', dtype='auto', quantization='none', trust_remote_code=False, revision=None, max_retries=2, model_patterns=None)[source]
Client for in-process Transformers local inference.
Initialize a local Transformers client with sensible defaults.
- Parameters:
name – Logical name for this client instance, used in logging and provenance.
model_id – Identifier for the model to load (e.g. “distilgpt2 or a Hugging Face repo ID like “gpt2”).
default_model – Default model name for prompts that don’t specify one.
device – Device to load the model on (e.g. “cpu”, “cuda”, “mps”, or “auto” to automatically select based on availability).
dtype – Data type to use for model weights (e.g. “float16”, “bfloat16”, “int8”, or “auto” to automatically select based on device).
quantization – Quantization level to use when loading the model (e.g. “4 bit”, “8-bit”, “fp16”, or “none” for no quantization).
trust_remote_code – Whether to allow execution of custom code from remote repositories when loading models, which may be required for some models but can be a security risk.
revision – Optional model revision to load (e.g. a git branch, tag, or commit hash), if the model is being loaded from a Hugging Face repository that has multiple revisions.
max_retries – Number of times to retry a request in case of failure before giving up
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (default_model,).
- class design_research_agents.llm.VLLMServerLLMClient(*, name='vllm-local', model='Qwen/Qwen2.5-1.5B-Instruct', api_model='qwen2.5-1.5b-instruct', host='127.0.0.1', port=8002, manage_server=True, startup_timeout_seconds=90.0, poll_interval_seconds=0.5, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), base_url=None, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]
Client for local or self-hosted vLLM OpenAI-compatible inference.
Initialize a vLLM client in managed-server or connect mode.
- Parameters:
name – Logical name for this client instance.
model – Model identifier passed to managed vLLM server startup.
api_model – Model alias exposed by vLLM OpenAI-compatible API.
host – Host interface used in managed mode.
port – TCP port used in managed mode.
manage_server – Whether this client manages the vLLM server lifecycle.
startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
python_executable – Python executable used to launch managed vLLM process.
extra_server_args – Additional CLI flags forwarded to vLLM server.
base_url – Optional connect-mode endpoint URL. Required only for remote/self-managed deployments; defaults to
http://{host}:{port}/v1.request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.
- Raises:
ValueError – If
manage_serverandbase_urlare both configured.
- close()[source]
Stop the managed vLLM server process when present.
Stable public LLM client classes backed by internal provider wrappers.
- class design_research_agents.llm.clients.AnthropicServiceLLMClient(*, name='anthropic', default_model='claude-3-5-haiku-latest', api_key_env='ANTHROPIC_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]
Client for the official Anthropic API backend.
Initialize an Anthropic service client with sensible defaults.
- class design_research_agents.llm.clients.AzureOpenAIServiceLLMClient(*, name='azure-openai', default_model='gpt-4o-mini', api_key_env='AZURE_OPENAI_API_KEY', api_key=None, azure_endpoint_env='AZURE_OPENAI_ENDPOINT', azure_endpoint=None, api_version_env='AZURE_OPENAI_API_VERSION', api_version=None, max_retries=2, model_patterns=None)[source]
Client for the Azure OpenAI API via the official OpenAI SDK.
Initialize an Azure OpenAI service client with sensible defaults.
- class design_research_agents.llm.clients.GeminiServiceLLMClient(*, name='gemini', default_model='gemini-2.5-flash', api_key_env='GOOGLE_API_KEY', api_key=None, max_retries=2, model_patterns=None)[source]
Client for the official Gemini API backend.
Initialize a Gemini service client with sensible defaults.
- class design_research_agents.llm.clients.GroqServiceLLMClient(*, name='groq', default_model='llama-3.1-8b-instant', api_key_env='GROQ_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]
Client for the official Groq API backend.
Initialize a Groq service client with sensible defaults.
- class design_research_agents.llm.clients.LlamaCppServerLLMClient(*, name='llama-local', model='Qwen2.5-1.5B-Instruct-Q4_K_M.gguf', hf_model_repo_id='bartowski/Qwen2.5-1.5B-Instruct-GGUF', api_model='qwen2-1.5b-q4', host='127.0.0.1', port=8001, context_window=4096, startup_timeout_seconds=60.0, request_timeout_seconds=60.0, poll_interval_seconds=0.25, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), max_retries=2, model_patterns=None)[source]
Client for a managed local
llama_cpp.serverbackend.Initialize a local llama-cpp client with sensible defaults.
- Parameters:
name – Logical name for this client instance, used in logging and provenance.
model – Local model identifier or path for llama_cpp.server to load.
hf_model_repo_id – Optional Hugging Face repo ID to auto-download the model from if not found locally.
api_model – The model name to report in API responses, which can differ from the local model name.
host – Host interface for the local server to bind to.
port – Port for the local server to listen on.
context_window – Context window size (n_ctx) to configure the llama_cpp.server with.
startup_timeout_seconds – Max time to wait for the server process to start and become healthy.
request_timeout_seconds – HTTP timeout for generate and stream requests.
poll_interval_seconds – Time interval between health check polls during startup.
python_executable – Python executable to use for running the server process.
extra_server_args – Additional command-line arguments to pass when starting the server process.
max_retries – Number of times to retry a request in case of failure before giving up.
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (api_model,).
- close()[source]
Stop the managed local server process.
- class design_research_agents.llm.clients.MLXLocalLLMClient(*, name='mlx-local', model_id='mlx-community/Qwen2.5-1.5B-Instruct-4bit', default_model='mlx-community/Qwen2.5-1.5B-Instruct-4bit', quantization='none', max_retries=2, model_patterns=None)[source]
Client for Apple MLX local inference.
Initialize an MLX local client with sensible defaults.
- Parameters:
name – Logical name for this client instance, used in logging and provenance.
model_id – Identifier for the MLX model to load (e.g. “mlx-community /Qwen2.5-1.5B-Instruct-4bit”).
default_model – Default model name for prompts that don’t specify one.
quantization – Quantization level to use when loading the model (e.g. “4 -bit”, “8-bit”, “fp16”).
max_retries – Number of times to retry a request in case of failure before giving up
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (default_model,).
- class design_research_agents.llm.clients.OllamaLLMClient(*, name='ollama-local', default_model='qwen2.5:1.5b-instruct', host='127.0.0.1', port=11434, manage_server=True, ollama_executable='ollama', auto_pull_model=False, startup_timeout_seconds=60.0, poll_interval_seconds=0.25, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]
Client for local or self-hosted Ollama chat inference.
Initialize an Ollama client in managed-server or connect mode.
- Parameters:
name – Logical name for this client instance.
default_model – Default model id used when requests omit model.
host – Host interface used in managed mode or connect mode.
port – TCP port used in managed mode or connect mode.
manage_server – Whether this client manages
ollama servelifecycle.ollama_executable – Executable used to invoke
ollamacommands.auto_pull_model – Whether to pull
default_modelafter startup.startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.
- close()[source]
Stop the managed Ollama daemon when present.
- class design_research_agents.llm.clients.OpenAICompatibleHTTPLLMClient(*, name='openai-compatible', base_url='http://127.0.0.1:8001/v1', default_model='qwen2-1.5b-q4', api_key_env='OPENAI_API_KEY', api_key=None, max_retries=2, model_patterns=None)[source]
Client for OpenAI-compatible HTTP endpoints.
Initialize an OpenAI-compatible HTTP client with sensible defaults.
- class design_research_agents.llm.clients.OpenAIServiceLLMClient(*, name='openai', default_model='gpt-4o-mini', api_key_env='OPENAI_API_KEY', api_key=None, base_url=None, max_retries=2, model_patterns=None)[source]
Client for the official OpenAI API backend.
Initialize an OpenAI service client with sensible defaults.
- class design_research_agents.llm.clients.SGLangServerLLMClient(*, name='sglang-local', model='Qwen/Qwen2.5-1.5B-Instruct', host='127.0.0.1', port=30000, manage_server=True, startup_timeout_seconds=90.0, poll_interval_seconds=0.5, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), base_url=None, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]
Client for local or self-hosted SGLang OpenAI-compatible inference.
Initialize an SGLang client in managed-server or connect mode.
- Parameters:
name – Logical name for this client instance.
model – Model identifier passed to managed SGLang server startup.
host – Host interface used in managed mode.
port – TCP port used in managed mode.
manage_server – Whether this client manages the SGLang server lifecycle.
startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
python_executable – Python executable used to launch managed SGLang process.
extra_server_args – Additional CLI flags forwarded to SGLang server.
base_url – Optional connect-mode endpoint URL. Required only for remote/self-managed deployments; defaults to
http://{host}:{port}/v1.request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.
- Raises:
ValueError – If
manage_serverandbase_urlare both configured.
- close()[source]
Stop the managed SGLang server process when present.
- class design_research_agents.llm.clients.TransformersLocalLLMClient(*, name='transformers-local', model_id='distilgpt2', default_model='distilgpt2', device='auto', dtype='auto', quantization='none', trust_remote_code=False, revision=None, max_retries=2, model_patterns=None)[source]
Client for in-process Transformers local inference.
Initialize a local Transformers client with sensible defaults.
- Parameters:
name – Logical name for this client instance, used in logging and provenance.
model_id – Identifier for the model to load (e.g. “distilgpt2 or a Hugging Face repo ID like “gpt2”).
default_model – Default model name for prompts that don’t specify one.
device – Device to load the model on (e.g. “cpu”, “cuda”, “mps”, or “auto” to automatically select based on availability).
dtype – Data type to use for model weights (e.g. “float16”, “bfloat16”, “int8”, or “auto” to automatically select based on device).
quantization – Quantization level to use when loading the model (e.g. “4 bit”, “8-bit”, “fp16”, or “none” for no quantization).
trust_remote_code – Whether to allow execution of custom code from remote repositories when loading models, which may be required for some models but can be a security risk.
revision – Optional model revision to load (e.g. a git branch, tag, or commit hash), if the model is being loaded from a Hugging Face repository that has multiple revisions.
max_retries – Number of times to retry a request in case of failure before giving up
model_patterns – Optional tuple of model name patterns supported by this client, used for routing decisions. If None, defaults to (default_model,).
- class design_research_agents.llm.clients.VLLMServerLLMClient(*, name='vllm-local', model='Qwen/Qwen2.5-1.5B-Instruct', api_model='qwen2.5-1.5b-instruct', host='127.0.0.1', port=8002, manage_server=True, startup_timeout_seconds=90.0, poll_interval_seconds=0.5, python_executable='/opt/hostedtoolcache/Python/3.12.13/x64/bin/python3', extra_server_args=(), base_url=None, request_timeout_seconds=60.0, max_retries=2, model_patterns=None)[source]
Client for local or self-hosted vLLM OpenAI-compatible inference.
Initialize a vLLM client in managed-server or connect mode.
- Parameters:
name – Logical name for this client instance.
model – Model identifier passed to managed vLLM server startup.
api_model – Model alias exposed by vLLM OpenAI-compatible API.
host – Host interface used in managed mode.
port – TCP port used in managed mode.
manage_server – Whether this client manages the vLLM server lifecycle.
startup_timeout_seconds – Maximum startup wait time in managed mode.
poll_interval_seconds – Delay between readiness probes in managed mode.
python_executable – Python executable used to launch managed vLLM process.
extra_server_args – Additional CLI flags forwarded to vLLM server.
base_url – Optional connect-mode endpoint URL. Required only for remote/self-managed deployments; defaults to
http://{host}:{port}/v1.request_timeout_seconds – HTTP timeout for generate and stream requests.
max_retries – Number of retries for retryable provider/transport errors.
model_patterns – Optional tuple of model patterns for routing decisions.
- Raises:
ValueError – If
manage_serverandbase_urlare both configured.
- close()[source]
Stop the managed vLLM server process when present.