LLM Clients#

The framework exposes constructor-first client classes that implement the LLMClient contract. Choose a client based on deployment constraints first, then tune model selection.

Every public client also exposes a shared introspection surface for debugging and diagnostics: default_model(), capabilities(), config_snapshot(), server_snapshot(), and describe(). All public built-in clients also implement close() and support with. Examples prefer the context-manager form, while close() remains available for explicit lifecycle control.

For install profiles and platform constraints for backend extras, see Dependencies and Extras.

Comparison matrix#

Client

Execution location

Default model behavior

Setup burden

Privacy / cost / latency profile

OpenAICompatibleHTTPLLMClient

Remote or local OpenAI-compatible endpoint

Defaults to qwen2-1.5b-q4

Low-medium (compatible server + endpoint config)

Flexible privacy/cost posture based on endpoint hosting

LlamaCppServerLLMClient

Local managed llama_cpp.server process

Defaults to api_model="qwen2-1.5b-q4" mapped to a local GGUF

Medium (local runtime + model download)

Strong privacy, lowest marginal cost, variable latency by hardware

TransformersLocalLLMClient

Local in-process transformers runtime

Defaults to model_id=default_model="distilgpt2"

Medium-high (framework + model weights)

Strong privacy, lowest marginal cost, latency depends on device

MLXLocalLLMClient

Local Apple MLX runtime

Defaults to mlx-community/Qwen2.5-1.5B-Instruct-4bit

Medium (Apple silicon + MLX stack)

Strong privacy, lowest marginal cost, strong local throughput on Apple hardware

VLLMServerLLMClient

Local/self-hosted vLLM OpenAI-compatible server

Defaults to api_model="qwen2.5-1.5b-instruct"

Medium-high (server runtime + model provisioning)

Strong privacy, strong throughput, ops overhead for service management

OllamaLLMClient

Local/self-hosted Ollama daemon

Defaults to default_model="qwen2.5:1.5b-instruct"

Low-medium (Ollama runtime + model pull)

Strong privacy, low setup burden, laptop-friendly local serving

SGLangServerLLMClient

Local/self-hosted SGLang OpenAI-compatible server

Defaults to model="Qwen/Qwen2.5-1.5B-Instruct"

Medium-high (server runtime + model provisioning)

Strong privacy, high throughput, ops overhead for service management

OpenAIServiceLLMClient

Remote OpenAI API

Defaults to gpt-4o-mini

Low (API key)

Lowest setup effort, network/data egress tradeoff, usage-based cost

AzureOpenAIServiceLLMClient

Remote Azure OpenAI API

Defaults to gpt-4o-mini (deployment name)

Low-medium (API key + endpoint + API version)

Azure-native hosted option, network/data egress tradeoff, Azure-managed billing

AnthropicServiceLLMClient

Remote Anthropic API

Defaults to claude-3-5-haiku-latest

Low (API key)

Low setup effort, network/data egress tradeoff, usage-based cost

GeminiServiceLLMClient

Remote Gemini API

Defaults to gemini-2.5-flash

Low (API key)

Low setup effort, network/data egress tradeoff, usage-based cost

GroqServiceLLMClient

Remote Groq API

Defaults to llama-3.1-8b-instant

Low (API key)

Low setup effort, network/data egress tradeoff, usage-based cost

When to choose what#

  1. Need the built-in no-SDK path and already have a compatible endpoint: use OpenAICompatibleHTTPLLMClient.

  2. Need strict data-local execution: start with LlamaCppServerLLMClient, TransformersLocalLLMClient, MLXLocalLLMClient, VLLMServerLLMClient, OllamaLLMClient, or SGLangServerLLMClient.

  3. Need fastest hosted setup and managed model quality: use OpenAIServiceLLMClient, AzureOpenAIServiceLLMClient, AnthropicServiceLLMClient, GeminiServiceLLMClient, or GroqServiceLLMClient.

  4. Need policy-driven choice between local and remote options: use Model Selection.

See examples#

  • examples/clients/README.md

Pages#