LLM Clients
===========

The framework exposes constructor-first client classes that implement the
``LLMClient`` contract. Choose a client based on deployment constraints first,
then tune model selection.

Every public client also exposes a shared introspection surface for debugging
and diagnostics: ``default_model()``, ``capabilities()``,
``config_snapshot()``, ``server_snapshot()``, and ``describe()``.
All public built-in clients also implement ``close()`` and support ``with``.
Examples prefer the context-manager form, while ``close()`` remains available
for explicit lifecycle control.

For install profiles and platform constraints for backend extras, see
:doc:`../dependencies_and_extras`.

Comparison matrix
-----------------

.. list-table::
   :header-rows: 1

   * - Client
     - Execution location
     - Default model behavior
     - Setup burden
     - Privacy / cost / latency profile
   * - ``LlamaCppServerLLMClient``
     - Local managed ``llama_cpp.server`` process
     - Defaults to ``api_model="qwen2-1.5b-q4"`` mapped to a local GGUF
     - Medium (local runtime + model download)
     - Strong privacy, lowest marginal cost, variable latency by hardware
   * - ``TransformersLocalLLMClient``
     - Local in-process transformers runtime
     - Defaults to ``model_id=default_model="distilgpt2"``
     - Medium-high (framework + model weights)
     - Strong privacy, lowest marginal cost, latency depends on device
   * - ``MLXLocalLLMClient``
     - Local Apple MLX runtime
     - Defaults to ``mlx-community/Qwen2.5-1.5B-Instruct-4bit``
     - Medium (Apple silicon + MLX stack)
     - Strong privacy, lowest marginal cost, strong local throughput on Apple hardware
   * - ``VLLMServerLLMClient``
     - Local/self-hosted vLLM OpenAI-compatible server
     - Defaults to ``api_model="qwen2.5-1.5b-instruct"``
     - Medium-high (server runtime + model provisioning)
     - Strong privacy, strong throughput, ops overhead for service management
   * - ``OllamaLLMClient``
     - Local/self-hosted Ollama daemon
     - Defaults to ``default_model="qwen2.5:1.5b-instruct"``
     - Low-medium (Ollama runtime + model pull)
     - Strong privacy, low setup burden, laptop-friendly local serving
   * - ``SGLangServerLLMClient``
     - Local/self-hosted SGLang OpenAI-compatible server
     - Defaults to ``model="Qwen/Qwen2.5-1.5B-Instruct"``
     - Medium-high (server runtime + model provisioning)
     - Strong privacy, high throughput, ops overhead for service management
   * - ``OpenAIServiceLLMClient``
     - Remote OpenAI API
     - Defaults to ``gpt-4o-mini``
     - Low (API key)
     - Lowest setup effort, network/data egress tradeoff, usage-based cost
   * - ``AzureOpenAIServiceLLMClient``
     - Remote Azure OpenAI API
     - Defaults to ``gpt-4o-mini`` (deployment name)
     - Low-medium (API key + endpoint + API version)
     - Azure-native hosted option, network/data egress tradeoff, Azure-managed billing
   * - ``AnthropicServiceLLMClient``
     - Remote Anthropic API
     - Defaults to ``claude-3-5-haiku-latest``
     - Low (API key)
     - Low setup effort, network/data egress tradeoff, usage-based cost
   * - ``GeminiServiceLLMClient``
     - Remote Gemini API
     - Defaults to ``gemini-2.5-flash``
     - Low (API key)
     - Low setup effort, network/data egress tradeoff, usage-based cost
   * - ``GroqServiceLLMClient``
     - Remote Groq API
     - Defaults to ``llama-3.1-8b-instant``
     - Low (API key)
     - Low setup effort, network/data egress tradeoff, usage-based cost
   * - ``OpenAICompatibleHTTPLLMClient``
     - Remote or local OpenAI-compatible endpoint
     - Defaults to ``qwen2-1.5b-q4``
     - Low-medium (compatible server + endpoint config)
     - Flexible privacy/cost posture based on endpoint hosting

When to choose what
-------------------

1. Need strict data-local execution: start with ``LlamaCppServerLLMClient``,
   ``TransformersLocalLLMClient``, ``MLXLocalLLMClient``, ``VLLMServerLLMClient``,
   ``OllamaLLMClient``, or ``SGLangServerLLMClient``.
2. Need fastest onboarding and hosted quality: use ``OpenAIServiceLLMClient``,
   ``AzureOpenAIServiceLLMClient``, ``AnthropicServiceLLMClient``,
   ``GeminiServiceLLMClient``, or ``GroqServiceLLMClient``.
3. Need provider portability or self-hosted OpenAI-compatible infra: use
   ``OpenAICompatibleHTTPLLMClient``.
4. Need policy-driven choice between local and remote options: use
   :doc:`model_selection`.

See examples
------------

- ``examples/clients/README.md``

Pages
-----

- :doc:`model_selection`
- :doc:`llama_cpp_server`
- :doc:`openai_service`
- :doc:`azure_openai_service`
- :doc:`anthropic_service`
- :doc:`gemini_service`
- :doc:`groq_service`
- :doc:`openai_compatible_http`
- :doc:`transformers_local`
- :doc:`mlx_local`
- :doc:`vllm_server`
- :doc:`ollama_local`
- :doc:`sglang_server`

.. toctree::
   :maxdepth: 2
   :hidden:

   model_selection
   llama_cpp_server
   openai_service
   azure_openai_service
   anthropic_service
   gemini_service
   groq_service
   openai_compatible_http
   transformers_local
   mlx_local
   vllm_server
   ollama_local
   sglang_server