Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Crabllm is a high-performance LLM API gateway written in Rust. It sits between your application and LLM providers, exposing an OpenAI-compatible API surface.

One API format. Many providers. Low overhead.

What It Does

You send requests in OpenAI format to crabllm. It routes them to the configured provider — OpenAI, Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, or Ollama — translating the request and response as needed.

Your application talks to one endpoint. Crabllm handles the rest:

  • Provider translation — Anthropic, Google, and Bedrock have their own API formats. Crabllm translates automatically.
  • Routing — Weighted random selection across multiple providers for the same model. Automatic fallback when a provider fails.
  • Streaming — SSE streaming proxied without buffering.
  • Auth — Virtual API keys with per-key model access control.
  • Extensions — Rate limiting, caching, cost tracking, budget enforcement.

Why Rust

  • Sub-millisecond overhead — no GC pauses, no interpreter startup.
  • Memory safety — without runtime cost.
  • Concurrency — Tokio async runtime handles thousands of concurrent streaming connections efficiently.
  • Deployment — single static binary. No interpreter, no virtualenv, no Docker required.

Feature Comparison

FeatureLiteLLMCrabllm
/chat/completionsyesyes
/embeddingsyesyes
/modelsyesyes
OpenAI provideryesyes
Anthropic provideryesyes
Google Gemini provideryesyes
Azure OpenAI provideryesyes
AWS Bedrock provideryesyes
Tool/function callingyesyes
SSE streamingyesyes
Virtual keys + authyesyes
Weighted routingyesyes
Model aliasingyesyes
Retry + fallbackyesyes
Rate limiting (RPM/TPM)yesyes
Cost/usage trackingyesyes
Budget enforcementyesyes
Request cachingyesyes
Image/audio endpointsyesyes
Storage (memory)yesyes
Storage (persistent)PostgresSQLite
Redis storageyesyes

Getting Started

Install

cargo install crabllm

Configure

Create a crabllm.toml file:

listen = "0.0.0.0:8080"

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini"]

[providers.anthropic]
kind = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
models = ["claude-sonnet-4-20250514"]

Environment variables in ${VAR} syntax are expanded at startup.

Run

crabllm --config crabllm.toml

You’ll see:

crabllm listening on 0.0.0.0:8080 (3 models, 2 providers, 0 extensions)

Send a Request

All requests use the OpenAI format, regardless of which provider handles them:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

To use Anthropic, just change the model name:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The request format is the same. Crabllm translates it to the Anthropic Messages API internally.

Streaming

Add "stream": true to get SSE streaming:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Model Aliasing

Map friendly names to canonical model names:

[aliases]
gpt4 = "gpt-4o"
claude = "claude-sonnet-4-20250514"

Now "model": "gpt4" routes to gpt-4o.

Next Steps

  • Configuration — full reference for all config options
  • Providers — setup guides for each provider
  • Features — routing, auth, extensions, and more

Configuration

Crabllm is configured via a TOML file, passed with --config:

crabllm --config crabllm.toml

The --bind flag overrides the listen address.

Environment Variables

Strings containing ${VAR} are expanded from environment variables at startup. Unknown variables expand to empty string. Use this for secrets:

api_key = "${OPENAI_API_KEY}"

Top-Level Fields

FieldTypeDefaultDescription
listenstringrequiredAddress to bind, e.g. "0.0.0.0:8080"
shutdown_timeoutinteger30Graceful shutdown timeout in seconds

Providers

Each provider is a named entry under [providers]:

[providers.my_openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini"]
FieldTypeDefaultDescription
kindstringrequiredProvider type (see Providers)
api_keystring""API key for authentication
base_urlstringper-kindBase URL override
modelslist[]Model names this provider serves
weightinteger1Routing weight for load balancing
max_retriesinteger2Max retries on transient errors
timeoutinteger30Per-request timeout in seconds
api_versionstringAPI version (Azure only)
regionstringAWS region (Bedrock only)
access_keystringAWS access key (Bedrock only)
secret_keystringAWS secret key (Bedrock only)

Virtual Keys

[[keys]]
name = "team-a"
key = "sk-team-a-secret"
models = ["gpt-4o", "claude-sonnet-4-20250514"]

[[keys]]
name = "admin"
key = "sk-admin-secret"
models = ["*"]
FieldTypeDescription
namestringHuman-readable key name (used in usage tracking)
keystringThe bearer token clients send
modelslistAllowed models. ["*"] means all

When no keys are configured, authentication is disabled.

Aliases

[aliases]
gpt4 = "gpt-4o"
claude = "claude-sonnet-4-20250514"

Maps friendly model names to canonical names. Single-hop lookup.

Pricing

[pricing.gpt-4o]
prompt_cost_per_million = 2.50
completion_cost_per_million = 10.00

[pricing.claude-sonnet-4-20250514]
prompt_cost_per_million = 3.00
completion_cost_per_million = 15.00

Per-model token pricing in USD. Used by the budget extension for spend tracking.

Extensions

[extensions.cache]
ttl = 3600

[extensions.rate_limit]
rpm = 60

[extensions.usage]

[extensions.budget]
default_limit = 10000000

[extensions.logging]
level = "info"

See Extensions for details on each.

Storage

[storage]
kind = "memory"
KindFeature flagpath field
memorynone (default)not used
sqlitestorage-sqlitefile path, e.g. "crabllm.db"
redisstorage-redisURL, e.g. "redis://127.0.0.1:6379"

See Storage for details.

Full Example

listen = "0.0.0.0:8080"
shutdown_timeout = 30

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini"]
weight = 2
max_retries = 2
timeout = 30

[providers.anthropic]
kind = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
models = ["claude-sonnet-4-20250514"]

[providers.ollama]
kind = "ollama"
models = ["llama3.2"]

[aliases]
gpt4 = "gpt-4o"
claude = "claude-sonnet-4-20250514"

[[keys]]
name = "default"
key = "${CRABTALK_API_KEY}"
models = ["*"]

[pricing.gpt-4o]
prompt_cost_per_million = 2.50
completion_cost_per_million = 10.00

[extensions.rate_limit]
rpm = 100

[extensions.usage]

[extensions.logging]
level = "info"

[storage]
kind = "sqlite"
path = "crabllm.db"

Providers

A provider is an LLM service that crabllm routes requests to. Each provider has its own API format and authentication mechanism. Crabllm translates between the OpenAI-compatible format your application uses and the provider’s native format.

Supported Providers

KindProviderTranslation
openaiOpenAI, Groq, Together, vLLM, any OpenAI-compatible APIPass-through
anthropicAnthropic Messages APIFull translation
googleGoogle GeminiFull translation
azureAzure OpenAIURL + auth rewrite
bedrockAWS Bedrock Converse APIFull translation + SigV4 signing
ollamaOllama (local models)Pass-through (OpenAI-compatible)

Common Fields

Every provider supports these fields:

[providers.name]
kind = "..."           # required
api_key = "..."        # API key (supports ${ENV_VAR})
base_url = "..."       # base URL override
models = ["..."]       # model names this provider serves
weight = 1             # routing weight (higher = more traffic)
max_retries = 2        # retries on transient errors (429, 5xx)
timeout = 30           # per-request timeout in seconds

Multiple Providers for the Same Model

When multiple providers list the same model, crabllm selects between them using weighted random selection. If the selected provider fails, it falls back to the next provider by weight. See Routing.

[providers.openai_primary]
kind = "openai"
api_key = "${OPENAI_KEY_1}"
models = ["gpt-4o"]
weight = 3

[providers.openai_backup]
kind = "openai"
api_key = "${OPENAI_KEY_2}"
models = ["gpt-4o"]
weight = 1

Endpoint Support

EndpointOpenAIAnthropicGoogleAzureBedrockOllama
Chat completionsyesyesyesyesyesyes
Streamingyesyesyesyesyesyes
Embeddingsyesyesyes
Image generationyesyes
Audio speechyesyes
Audio transcriptionyesyes
Tool/function callingyesyesyesyesyesyes

OpenAI

The openai provider works with OpenAI and any OpenAI-compatible API (Groq, Together AI, vLLM, etc.). Requests are forwarded as-is with URL and auth rewrite — no translation needed.

Configuration

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini", "text-embedding-3-small"]

Custom Base URL

For OpenAI-compatible services, set base_url:

[providers.groq]
kind = "openai"
api_key = "${GROQ_API_KEY}"
base_url = "https://api.groq.com/openai/v1"
models = ["llama-3.3-70b-versatile"]

[providers.together]
kind = "openai"
api_key = "${TOGETHER_API_KEY}"
base_url = "https://api.together.xyz/v1"
models = ["meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo"]

Supported Endpoints

  • Chat completions (streaming and non-streaming)
  • Embeddings
  • Image generation
  • Audio speech (TTS)
  • Audio transcription

Tool Calling

Tool calling works as-is — the request body is forwarded directly:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "parameters": {
          "type": "object",
          "properties": {"location": {"type": "string"}},
          "required": ["location"]
        }
      }
    }]
  }'

Anthropic

The anthropic provider translates OpenAI-format requests to the Anthropic Messages API and back.

Configuration

[providers.anthropic]
kind = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
models = ["claude-sonnet-4-20250514", "claude-haiku-4-20250514"]

Translation

Crabllm handles the full translation between OpenAI and Anthropic formats:

  • System messages — extracted from the messages array and sent as the Anthropic system parameter.
  • Stop reasons — mapped between formats (end_turn to stop, etc.).
  • Tool calling — fully supported. Tool definitions, tool use responses, and tool result messages are all translated.
  • Streaming — Anthropic’s event stream (message_start, content_block_delta, etc.) is translated to OpenAI-format SSE chunks.

Usage

Send requests in OpenAI format as usual:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Limitations

  • Embeddings, image generation, and audio endpoints are not supported by the Anthropic API.

Google Gemini

The google provider translates OpenAI-format requests to the Google Gemini API (generativeai).

Configuration

[providers.google]
kind = "google"
api_key = "${GOOGLE_API_KEY}"
models = ["gemini-2.0-flash", "gemini-2.5-pro"]

Translation

  • System messages — mapped to Gemini’s systemInstruction field.
  • Rolesassistant mapped to model, user stays user.
  • Content — mapped to Gemini’s parts array format.
  • Tool calling — tool definitions mapped to functionDeclarations, tool messages to functionResponse parts, responses extract functionCall parts.
  • Streaming — uses streamGenerateContent?alt=sse and translates the Gemini event stream to OpenAI-format SSE chunks.

Usage

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.0-flash",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Limitations

  • Embeddings, image generation, and audio endpoints are not supported.

Azure OpenAI

The azure provider routes to Azure OpenAI deployments. The request body is OpenAI-format (no translation needed), but the URL pattern and authentication differ.

Configuration

[providers.azure]
kind = "azure"
api_key = "${AZURE_OPENAI_KEY}"
base_url = "https://my-resource.openai.azure.com"
api_version = "2024-02-01"
models = ["gpt-4o"]
  • base_url — your Azure OpenAI resource URL.
  • api_version — the Azure API version string.

How It Works

Crabllm rewrites the URL to Azure’s deployment-based pattern:

POST /openai/deployments/{model}/chat/completions?api-version={api_version}

Authentication uses the api-key header instead of Authorization: Bearer.

Supported Endpoints

  • Chat completions (streaming and non-streaming)
  • Embeddings
  • Image generation
  • Audio speech (TTS)
  • Audio transcription

Usage

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

AWS Bedrock

The bedrock provider translates requests to the AWS Bedrock Converse API with SigV4 request signing. No AWS SDK dependency — signing is handled internally.

Feature Flag

Bedrock support requires the provider-bedrock cargo feature:

cargo install crabllm --features provider-bedrock

Configuration

[providers.bedrock]
kind = "bedrock"
region = "us-east-1"
access_key = "${AWS_ACCESS_KEY_ID}"
secret_key = "${AWS_SECRET_ACCESS_KEY}"
models = ["anthropic.claude-3-5-sonnet-20241022-v2:0"]

Translation

  • System messages — mapped to the Bedrock system field.
  • Tool calling — tool definitions mapped to toolConfig.tools[].toolSpec, tool results to toolResult content blocks.
  • Stop reasonsend_turn to stop, tool_use to tool_calls, max_tokens to length.
  • Streaming — uses ConverseStream with AWS event-stream binary framing.

Usage

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic.claude-3-5-sonnet-20241022-v2:0",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Limitations

  • Embeddings, image generation, and audio endpoints are not supported.

Ollama

The ollama provider connects to a local Ollama instance. Ollama exposes an OpenAI-compatible API, so requests are forwarded as-is.

Configuration

[providers.ollama]
kind = "ollama"
models = ["llama3.2", "mistral"]

The default base URL is http://localhost:11434/v1. Override it if Ollama runs on a different host:

[providers.ollama]
kind = "ollama"
base_url = "http://192.168.1.100:11434/v1"
models = ["llama3.2"]

No API key is needed for local Ollama.

Usage

Start Ollama, pull a model, then send requests through crabllm:

ollama pull llama3.2
crabllm --config crabllm.toml
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Supported Endpoints

  • Chat completions (streaming and non-streaming)
  • Embeddings (if supported by the Ollama model)

Routing

Crabllm decides which provider handles a request based on model name, routing weights, and fallback logic.

Model Resolution

When a request arrives, crabllm looks up the model name in the configured providers. If the model is an alias, it resolves to the canonical name first (single-hop lookup).

Weighted Selection

When multiple providers serve the same model, one is selected via weighted random selection. Higher weight values mean more traffic:

[providers.primary]
kind = "openai"
api_key = "${OPENAI_KEY_1}"
models = ["gpt-4o"]
weight = 3                    # 75% of traffic

[providers.secondary]
kind = "openai"
api_key = "${OPENAI_KEY_2}"
models = ["gpt-4o"]
weight = 1                    # 25% of traffic

Selection is stateless — no shared counters. Each request picks independently.

Retry

When a provider returns a transient error (HTTP 429, 500, 502, 503, 504), crabllm retries the same provider with exponential backoff:

  • Base delay: 100ms, doubling each retry.
  • Full jitter: each sleep is a random duration in [backoff/2, backoff] to prevent thundering herd.
  • Max retries: configurable per provider via max_retries (default 2).
[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]
max_retries = 3               # retry up to 3 times

Set max_retries = 0 to disable retry entirely.

Fallback

When retries are exhausted on a provider, crabllm tries the next provider by descending weight. This continues until a provider succeeds or all providers have been tried.

# Primary provider (tried first)
[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]
weight = 2

# Fallback provider (tried if primary fails)
[providers.azure]
kind = "azure"
api_key = "${AZURE_KEY}"
base_url = "https://my-resource.openai.azure.com"
api_version = "2024-02-01"
models = ["gpt-4o"]
weight = 1

Timeouts

Each provider call is wrapped in a timeout. If the timeout expires, the request is treated as a transient error (triggers retry/fallback):

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]
timeout = 60                  # seconds (default: 30)

Timeout errors return HTTP 504 Gateway Timeout if all providers time out.

Streaming Behavior

For streaming requests, retry and fallback only apply to connection errors (before the stream starts). Once the first SSE chunk is sent to the client, the connection is committed to that provider.

Streaming

Crabllm supports Server-Sent Events (SSE) streaming for chat completions across all providers. Streams are proxied without buffering — tokens arrive incrementally as the provider generates them.

Usage

Set "stream": true in the request body:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Write a haiku."}],
    "stream": true
  }'

The response is a stream of SSE events:

data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":"An"}}]}

data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":" old"}}]}

data: [DONE]

Provider Translation

For non-OpenAI providers, crabllm translates the provider’s native streaming format to OpenAI-compatible SSE chunks:

  • Anthropicmessage_start, content_block_delta events translated to chat.completion.chunk format.
  • Google GeministreamGenerateContent response parts translated to OpenAI chunks.
  • Bedrock — AWS event-stream binary frames decoded and translated.
  • Azure — same SSE format as OpenAI, no translation needed.

Extension Hooks

Extensions can observe each streaming chunk via the on_chunk hook. The rate limiter and budget extension use this to count tokens in real-time as they arrive.

Keep-Alive

SSE connections include automatic keep-alive pings to prevent proxy/load balancer timeouts during long generation pauses.

Error Handling

If an error occurs mid-stream (after the first chunk has been sent), it is delivered as an SSE event with an error payload. The stream then terminates. Retry and fallback only apply before the stream starts.

Authentication

Crabllm supports virtual API keys for client authentication and model access control.

Virtual Keys

Define keys in the config:

[[keys]]
name = "team-frontend"
key = "sk-frontend-abc123"
models = ["gpt-4o-mini"]

[[keys]]
name = "team-backend"
key = "sk-backend-xyz789"
models = ["gpt-4o", "claude-sonnet-4-20250514"]

[[keys]]
name = "admin"
key = "${ADMIN_API_KEY}"
models = ["*"]

Clients send the key in the Authorization header:

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer sk-frontend-abc123" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hi"}]}'

Model Access Control

The models field controls which models a key can access:

  • ["gpt-4o", "gpt-4o-mini"] — only these models.
  • ["*"] — all models.

Requests for unauthorized models return HTTP 401.

No Auth Mode

When no keys are configured, authentication is disabled entirely. All requests pass through without checking the Authorization header.

# No [[keys]] section = auth disabled
listen = "0.0.0.0:8080"

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]

Key Name Tracking

The key name field is used by extensions for per-key tracking:

  • Rate limiting — enforced per key name.
  • Usage tracking — tokens counted per key name.
  • Budget — spend limits per key name.
  • Logging — key name included in log entries.

Extensions

Extensions add functionality to the request pipeline via hooks. They run in-handler (not as middleware), giving direct access to typed request and response data.

Available Extensions

Cache

Caches non-streaming chat completion responses. Cache key is a SHA-256 hash of the serialized request body.

[extensions.cache]
ttl_seconds = 3600           # default: 300 (5 minutes)

Admin route: DELETE /v1/cache — clears all cached entries.

Rate Limit

Enforces per-key request and token rate limits using a per-minute sliding window.

[extensions.rate_limit]
requests_per_minute = 60      # required
tokens_per_minute = 100000    # optional

Returns HTTP 429 when limits are exceeded. Token counting uses actual usage from provider responses (both streaming and non-streaming).

Usage Tracker

Accumulates prompt and completion token counts per key and model.

[extensions.usage]

No configuration needed. Admin route: GET /v1/usage — returns JSON array of usage entries with key, model, prompt_tokens, and completion_tokens.

Budget

Enforces per-key spend limits. Requires pricing to be configured for the models in use.

[extensions.budget]
default_budget = 10.00        # USD, required

[extensions.budget.keys.team-a]
budget = 50.00                # USD override for this key

Returns HTTP 429 when a key’s spend exceeds its budget. Admin route: GET /v1/budget — returns JSON array with key, spent_usd, budget_usd, and remaining_usd.

Logging

Structured request logging via the tracing framework.

[extensions.logging]
level = "info"

Logs completed requests (model, provider, key, latency, token counts) and errors. Initializes the tracing_subscriber when enabled.

Hook Pipeline

Extensions run in config order at these points:

  1. on_request — before provider dispatch. Can short-circuit (rate limit, budget).
  2. on_cache_lookup — before provider dispatch for non-streaming. Returns cached response if available.
  3. on_response — after successful non-streaming response.
  4. on_chunk — for each SSE chunk during streaming.
  5. on_error — when a provider call fails.

Combining Extensions

Multiple extensions can be enabled simultaneously:

[extensions.logging]
level = "info"

[extensions.rate_limit]
requests_per_minute = 100

[extensions.usage]

[extensions.cache]
ttl_seconds = 600

[extensions.budget]
default_budget = 100.00

All extensions share the same storage backend.

Storage

Extensions that persist data (cache, rate limits, usage, budget) use a shared storage backend. Three backends are available.

Memory (default)

In-memory storage using concurrent hash maps. Fast, but data is lost on restart.

[storage]
kind = "memory"

This is the default when no [storage] section is present. No feature flag required.

SQLite

Persistent storage using SQLite via async pooled connections.

[storage]
kind = "sqlite"
path = "crabllm.db"

Requires the storage-sqlite feature:

cargo install crabllm --features storage-sqlite

The database file is created automatically if it doesn’t exist. Uses two tables (kv and counters) with atomic increment via INSERT ... ON CONFLICT ... RETURNING.

Redis

Remote persistent storage using Redis async multiplexed connections.

[storage]
kind = "redis"
path = "redis://127.0.0.1:6379"

Requires the storage-redis feature:

cargo install crabllm --features storage-redis

Supports standard Redis URLs. Increment maps to INCRBY, key listing uses SCAN with prefix glob patterns.

How Extensions Use Storage

Each extension namespaces its keys with a 4-byte prefix to avoid collisions:

ExtensionOperations
Cacheget/set response JSON with TTL check
Rate Limitincrement per-key-per-minute counters
Usageincrement per-key-per-model token counters
Budgetincrement per-key spend in microdollars

Architecture

Principles

  • Simplicity over abstraction. No trait where a function suffices.
  • Single responsibility. Each crate has one focused job.
  • OpenAI as canonical format. Providers translate to/from it.
  • Streaming first-class. Never buffer a full response when streaming.
  • Configuration-driven. Provider setup and routing from config, not code.
  • Minimal gateway latency. Avoid hot-path allocations.

Workspace Layout

crabllm/
  crates/
    crabllm/   — binary, wires everything together
    core/       — shared types, config, errors
    provider/   — provider enum + translation modules
    proxy/      — HTTP server, routing, extensions
    bench/      — benchmark mock backend

Crates

crabllm

Binary entry point. Loads TOML config, builds the provider registry, initializes the storage backend and extensions, starts the Axum HTTP server. CLI args: --config and --bind.

core

Shared types with no business logic. Contains:

  • ConfigGatewayConfig with env var interpolation.
  • Types — OpenAI-compatible wire format structs (request, response, chunk).
  • Error — error enum with transient detection for retry logic.
  • Storage — async KV trait with memory, SQLite, and Redis backends.
  • Extension — hook trait for the request pipeline.

provider

Provider dispatch. The Provider enum has variants for each supported provider. Each variant dispatches to a per-provider module that handles request/response translation. ProviderRegistry maps model names to weighted deployment lists.

proxy

Axum HTTP server. Route handlers implement retry + fallback across deployments. Auth middleware validates virtual keys. Five built-in extensions run as in-handler hooks.

Request Flow

  1. Client sends OpenAI-format request to crabllm.
  2. Auth middleware validates the bearer token.
  3. Handler resolves model name (aliases) and gets deployment list.
  4. Extension on_request hooks run (rate limit, budget check).
  5. Cache lookup for non-streaming requests.
  6. Provider dispatch with retry + fallback.
  7. Provider translates request, calls upstream, translates response.
  8. Extension on_response/on_chunk hooks run (usage, budget, cache store).
  9. Response returned to client.

Benchmarks

Gateway overhead measured against a mock LLM server with instant responses — numbers reflect pure proxy cost.

Latency: P50 / P99 in milliseconds. Lower is better.

Chat Completions

RPSdirectcrabllmbifrostlitellm
1000.38 / 0.631.00 / 1.311.10 / 1.645.35 / 10.79
5000.28 / 0.420.66 / 1.070.36 / 0.91168.79 / 223.69
10000.15 / 0.310.44 / 0.830.27 / 0.46172.00 / 201.55
20000.17 / 0.330.29 / 0.880.29 / 0.53169.99 / 194.34
50000.13 / 0.330.26 / 0.570.26 / 0.48159.86 / 492.82

Streaming

RPSdirectcrabllmbifrostlitellm
1000.45 / 0.6243.53 / 48.141.51 / 2.20670.25 / 3357.70
5000.34 / 0.5442.90 / 47.140.51 / 0.93659.97 / 3569.92
10000.22 / 0.4244.18 / 48.300.45 / 0.98645.59 / 2797.66
200044.04 / 48.2344.25 / 48.5244.18 / 48.64596.90 / 2678.08
500044.04 / 48.2344.24 / 48.5044.20 / 48.66571.96 / 2563.73

Embeddings

RPSdirectcrabllmbifrostlitellm
1000.39 / 0.471.18 / 1.481.15 / 1.707.09 / 10.72
5000.30 / 0.420.78 / 1.150.43 / 1.03356.71 / 414.36
10000.17 / 0.270.51 / 0.910.38 / 0.85332.53 / 6516.44
20000.18 / 0.320.36 / 1.080.39 / 0.94317.53 / 365.68
50000.14 / 0.320.34 / 0.640.39 / 1.57305.91 / 8778.06

Memory (Peak RSS)

GatewayPeak RSS
direct15.3 MB
crabllm34.9 MB
bifrost171.7 MB
litellm541.8 MB