Introduction
Crabllm is a high-performance LLM API gateway written in Rust. It sits between your application and LLM providers, exposing an OpenAI-compatible API surface.
One API format. Many providers. Low overhead.
What It Does
You send requests in OpenAI format to crabllm. It routes them to the configured provider — OpenAI, Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, or Ollama — translating the request and response as needed.
Your application talks to one endpoint. Crabllm handles the rest:
- Provider translation — Anthropic, Google, and Bedrock have their own API formats. Crabllm translates automatically.
- Routing — Weighted random selection across multiple providers for the same model. Automatic fallback when a provider fails.
- Streaming — SSE streaming proxied without buffering.
- Auth — Virtual API keys with per-key model access control.
- Extensions — Rate limiting, caching, cost tracking, budget enforcement.
Why Rust
- Sub-millisecond overhead — no GC pauses, no interpreter startup.
- Memory safety — without runtime cost.
- Concurrency — Tokio async runtime handles thousands of concurrent streaming connections efficiently.
- Deployment — single static binary. No interpreter, no virtualenv, no Docker required.
Feature Comparison
| Feature | LiteLLM | Crabllm |
|---|---|---|
/chat/completions | yes | yes |
/embeddings | yes | yes |
/models | yes | yes |
| OpenAI provider | yes | yes |
| Anthropic provider | yes | yes |
| Google Gemini provider | yes | yes |
| Azure OpenAI provider | yes | yes |
| AWS Bedrock provider | yes | yes |
| Tool/function calling | yes | yes |
| SSE streaming | yes | yes |
| Virtual keys + auth | yes | yes |
| Weighted routing | yes | yes |
| Model aliasing | yes | yes |
| Retry + fallback | yes | yes |
| Rate limiting (RPM/TPM) | yes | yes |
| Cost/usage tracking | yes | yes |
| Budget enforcement | yes | yes |
| Request caching | yes | yes |
| Image/audio endpoints | yes | yes |
| Storage (memory) | yes | yes |
| Storage (persistent) | Postgres | SQLite |
| Redis storage | yes | yes |
Getting Started
Install
cargo install crabllm
Configure
Create a crabllm.toml file:
listen = "0.0.0.0:8080"
[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini"]
[providers.anthropic]
kind = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
models = ["claude-sonnet-4-20250514"]
Environment variables in ${VAR} syntax are expanded at startup.
Run
crabllm --config crabllm.toml
You’ll see:
crabllm listening on 0.0.0.0:8080 (3 models, 2 providers, 0 extensions)
Send a Request
All requests use the OpenAI format, regardless of which provider handles them:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'
To use Anthropic, just change the model name:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-20250514",
"messages": [{"role": "user", "content": "Hello!"}]
}'
The request format is the same. Crabllm translates it to the Anthropic Messages API internally.
Streaming
Add "stream": true to get SSE streaming:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
Model Aliasing
Map friendly names to canonical model names:
[aliases]
gpt4 = "gpt-4o"
claude = "claude-sonnet-4-20250514"
Now "model": "gpt4" routes to gpt-4o.
Next Steps
- Configuration — full reference for all config options
- Providers — setup guides for each provider
- Features — routing, auth, extensions, and more
Configuration
Crabllm is configured via a TOML file, passed with --config:
crabllm --config crabllm.toml
The --bind flag overrides the listen address.
Environment Variables
Strings containing ${VAR} are expanded from environment variables at startup.
Unknown variables expand to empty string. Use this for secrets:
api_key = "${OPENAI_API_KEY}"
Top-Level Fields
| Field | Type | Default | Description |
|---|---|---|---|
listen | string | required | Address to bind, e.g. "0.0.0.0:8080" |
shutdown_timeout | integer | 30 | Graceful shutdown timeout in seconds |
Providers
Each provider is a named entry under [providers]:
[providers.my_openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini"]
| Field | Type | Default | Description |
|---|---|---|---|
kind | string | required | Provider type (see Providers) |
api_key | string | "" | API key for authentication |
base_url | string | per-kind | Base URL override |
models | list | [] | Model names this provider serves |
weight | integer | 1 | Routing weight for load balancing |
max_retries | integer | 2 | Max retries on transient errors |
timeout | integer | 30 | Per-request timeout in seconds |
api_version | string | — | API version (Azure only) |
region | string | — | AWS region (Bedrock only) |
access_key | string | — | AWS access key (Bedrock only) |
secret_key | string | — | AWS secret key (Bedrock only) |
Virtual Keys
[[keys]]
name = "team-a"
key = "sk-team-a-secret"
models = ["gpt-4o", "claude-sonnet-4-20250514"]
[[keys]]
name = "admin"
key = "sk-admin-secret"
models = ["*"]
| Field | Type | Description |
|---|---|---|
name | string | Human-readable key name (used in usage tracking) |
key | string | The bearer token clients send |
models | list | Allowed models. ["*"] means all |
When no keys are configured, authentication is disabled.
Aliases
[aliases]
gpt4 = "gpt-4o"
claude = "claude-sonnet-4-20250514"
Maps friendly model names to canonical names. Single-hop lookup.
Pricing
[pricing.gpt-4o]
prompt_cost_per_million = 2.50
completion_cost_per_million = 10.00
[pricing.claude-sonnet-4-20250514]
prompt_cost_per_million = 3.00
completion_cost_per_million = 15.00
Per-model token pricing in USD. Used by the budget extension for spend tracking.
Extensions
[extensions.cache]
ttl = 3600
[extensions.rate_limit]
rpm = 60
[extensions.usage]
[extensions.budget]
default_limit = 10000000
[extensions.logging]
level = "info"
See Extensions for details on each.
Storage
[storage]
kind = "memory"
| Kind | Feature flag | path field |
|---|---|---|
memory | none (default) | not used |
sqlite | storage-sqlite | file path, e.g. "crabllm.db" |
redis | storage-redis | URL, e.g. "redis://127.0.0.1:6379" |
See Storage for details.
Full Example
listen = "0.0.0.0:8080"
shutdown_timeout = 30
[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini"]
weight = 2
max_retries = 2
timeout = 30
[providers.anthropic]
kind = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
models = ["claude-sonnet-4-20250514"]
[providers.ollama]
kind = "ollama"
models = ["llama3.2"]
[aliases]
gpt4 = "gpt-4o"
claude = "claude-sonnet-4-20250514"
[[keys]]
name = "default"
key = "${CRABTALK_API_KEY}"
models = ["*"]
[pricing.gpt-4o]
prompt_cost_per_million = 2.50
completion_cost_per_million = 10.00
[extensions.rate_limit]
rpm = 100
[extensions.usage]
[extensions.logging]
level = "info"
[storage]
kind = "sqlite"
path = "crabllm.db"
Providers
A provider is an LLM service that crabllm routes requests to. Each provider has its own API format and authentication mechanism. Crabllm translates between the OpenAI-compatible format your application uses and the provider’s native format.
Supported Providers
| Kind | Provider | Translation |
|---|---|---|
openai | OpenAI, Groq, Together, vLLM, any OpenAI-compatible API | Pass-through |
anthropic | Anthropic Messages API | Full translation |
google | Google Gemini | Full translation |
azure | Azure OpenAI | URL + auth rewrite |
bedrock | AWS Bedrock Converse API | Full translation + SigV4 signing |
ollama | Ollama (local models) | Pass-through (OpenAI-compatible) |
Common Fields
Every provider supports these fields:
[providers.name]
kind = "..." # required
api_key = "..." # API key (supports ${ENV_VAR})
base_url = "..." # base URL override
models = ["..."] # model names this provider serves
weight = 1 # routing weight (higher = more traffic)
max_retries = 2 # retries on transient errors (429, 5xx)
timeout = 30 # per-request timeout in seconds
Multiple Providers for the Same Model
When multiple providers list the same model, crabllm selects between them using weighted random selection. If the selected provider fails, it falls back to the next provider by weight. See Routing.
[providers.openai_primary]
kind = "openai"
api_key = "${OPENAI_KEY_1}"
models = ["gpt-4o"]
weight = 3
[providers.openai_backup]
kind = "openai"
api_key = "${OPENAI_KEY_2}"
models = ["gpt-4o"]
weight = 1
Endpoint Support
| Endpoint | OpenAI | Anthropic | Azure | Bedrock | Ollama | |
|---|---|---|---|---|---|---|
| Chat completions | yes | yes | yes | yes | yes | yes |
| Streaming | yes | yes | yes | yes | yes | yes |
| Embeddings | yes | — | — | yes | — | yes |
| Image generation | yes | — | — | yes | — | — |
| Audio speech | yes | — | — | yes | — | — |
| Audio transcription | yes | — | — | yes | — | — |
| Tool/function calling | yes | yes | yes | yes | yes | yes |
OpenAI
The openai provider works with OpenAI and any OpenAI-compatible API
(Groq, Together AI, vLLM, etc.). Requests are forwarded as-is with URL and auth
rewrite — no translation needed.
Configuration
[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini", "text-embedding-3-small"]
Custom Base URL
For OpenAI-compatible services, set base_url:
[providers.groq]
kind = "openai"
api_key = "${GROQ_API_KEY}"
base_url = "https://api.groq.com/openai/v1"
models = ["llama-3.3-70b-versatile"]
[providers.together]
kind = "openai"
api_key = "${TOGETHER_API_KEY}"
base_url = "https://api.together.xyz/v1"
models = ["meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo"]
Supported Endpoints
- Chat completions (streaming and non-streaming)
- Embeddings
- Image generation
- Audio speech (TTS)
- Audio transcription
Tool Calling
Tool calling works as-is — the request body is forwarded directly:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}]
}'
Anthropic
The anthropic provider translates OpenAI-format requests to the Anthropic
Messages API and back.
Configuration
[providers.anthropic]
kind = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
models = ["claude-sonnet-4-20250514", "claude-haiku-4-20250514"]
Translation
Crabllm handles the full translation between OpenAI and Anthropic formats:
- System messages — extracted from the messages array and sent as the
Anthropic
systemparameter. - Stop reasons — mapped between formats (
end_turntostop, etc.). - Tool calling — fully supported. Tool definitions, tool use responses, and tool result messages are all translated.
- Streaming — Anthropic’s event stream (
message_start,content_block_delta, etc.) is translated to OpenAI-format SSE chunks.
Usage
Send requests in OpenAI format as usual:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-20250514",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
}'
Limitations
- Embeddings, image generation, and audio endpoints are not supported by the Anthropic API.
Google Gemini
The google provider translates OpenAI-format requests to the Google Gemini API
(generativeai).
Configuration
[providers.google]
kind = "google"
api_key = "${GOOGLE_API_KEY}"
models = ["gemini-2.0-flash", "gemini-2.5-pro"]
Translation
- System messages — mapped to Gemini’s
systemInstructionfield. - Roles —
assistantmapped tomodel,userstaysuser. - Content — mapped to Gemini’s
partsarray format. - Tool calling — tool definitions mapped to
functionDeclarations, tool messages tofunctionResponseparts, responses extractfunctionCallparts. - Streaming — uses
streamGenerateContent?alt=sseand translates the Gemini event stream to OpenAI-format SSE chunks.
Usage
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemini-2.0-flash",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Limitations
- Embeddings, image generation, and audio endpoints are not supported.
Azure OpenAI
The azure provider routes to Azure OpenAI deployments. The request body is
OpenAI-format (no translation needed), but the URL pattern and authentication
differ.
Configuration
[providers.azure]
kind = "azure"
api_key = "${AZURE_OPENAI_KEY}"
base_url = "https://my-resource.openai.azure.com"
api_version = "2024-02-01"
models = ["gpt-4o"]
base_url— your Azure OpenAI resource URL.api_version— the Azure API version string.
How It Works
Crabllm rewrites the URL to Azure’s deployment-based pattern:
POST /openai/deployments/{model}/chat/completions?api-version={api_version}
Authentication uses the api-key header instead of Authorization: Bearer.
Supported Endpoints
- Chat completions (streaming and non-streaming)
- Embeddings
- Image generation
- Audio speech (TTS)
- Audio transcription
Usage
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'
AWS Bedrock
The bedrock provider translates requests to the AWS Bedrock Converse API with
SigV4 request signing. No AWS SDK dependency — signing is handled internally.
Feature Flag
Bedrock support requires the provider-bedrock cargo feature:
cargo install crabllm --features provider-bedrock
Configuration
[providers.bedrock]
kind = "bedrock"
region = "us-east-1"
access_key = "${AWS_ACCESS_KEY_ID}"
secret_key = "${AWS_SECRET_ACCESS_KEY}"
models = ["anthropic.claude-3-5-sonnet-20241022-v2:0"]
Translation
- System messages — mapped to the Bedrock
systemfield. - Tool calling — tool definitions mapped to
toolConfig.tools[].toolSpec, tool results totoolResultcontent blocks. - Stop reasons —
end_turntostop,tool_usetotool_calls,max_tokenstolength. - Streaming — uses ConverseStream with AWS event-stream binary framing.
Usage
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "anthropic.claude-3-5-sonnet-20241022-v2:0",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Limitations
- Embeddings, image generation, and audio endpoints are not supported.
Ollama
The ollama provider connects to a local Ollama instance.
Ollama exposes an OpenAI-compatible API, so requests are forwarded as-is.
Configuration
[providers.ollama]
kind = "ollama"
models = ["llama3.2", "mistral"]
The default base URL is http://localhost:11434/v1. Override it if Ollama runs
on a different host:
[providers.ollama]
kind = "ollama"
base_url = "http://192.168.1.100:11434/v1"
models = ["llama3.2"]
No API key is needed for local Ollama.
Usage
Start Ollama, pull a model, then send requests through crabllm:
ollama pull llama3.2
crabllm --config crabllm.toml
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Supported Endpoints
- Chat completions (streaming and non-streaming)
- Embeddings (if supported by the Ollama model)
Routing
Crabllm decides which provider handles a request based on model name, routing weights, and fallback logic.
Model Resolution
When a request arrives, crabllm looks up the model name in the configured providers. If the model is an alias, it resolves to the canonical name first (single-hop lookup).
Weighted Selection
When multiple providers serve the same model, one is selected via weighted random
selection. Higher weight values mean more traffic:
[providers.primary]
kind = "openai"
api_key = "${OPENAI_KEY_1}"
models = ["gpt-4o"]
weight = 3 # 75% of traffic
[providers.secondary]
kind = "openai"
api_key = "${OPENAI_KEY_2}"
models = ["gpt-4o"]
weight = 1 # 25% of traffic
Selection is stateless — no shared counters. Each request picks independently.
Retry
When a provider returns a transient error (HTTP 429, 500, 502, 503, 504), crabllm retries the same provider with exponential backoff:
- Base delay: 100ms, doubling each retry.
- Full jitter: each sleep is a random duration in
[backoff/2, backoff]to prevent thundering herd. - Max retries: configurable per provider via
max_retries(default 2).
[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]
max_retries = 3 # retry up to 3 times
Set max_retries = 0 to disable retry entirely.
Fallback
When retries are exhausted on a provider, crabllm tries the next provider by descending weight. This continues until a provider succeeds or all providers have been tried.
# Primary provider (tried first)
[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]
weight = 2
# Fallback provider (tried if primary fails)
[providers.azure]
kind = "azure"
api_key = "${AZURE_KEY}"
base_url = "https://my-resource.openai.azure.com"
api_version = "2024-02-01"
models = ["gpt-4o"]
weight = 1
Timeouts
Each provider call is wrapped in a timeout. If the timeout expires, the request is treated as a transient error (triggers retry/fallback):
[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]
timeout = 60 # seconds (default: 30)
Timeout errors return HTTP 504 Gateway Timeout if all providers time out.
Streaming Behavior
For streaming requests, retry and fallback only apply to connection errors (before the stream starts). Once the first SSE chunk is sent to the client, the connection is committed to that provider.
Streaming
Crabllm supports Server-Sent Events (SSE) streaming for chat completions across all providers. Streams are proxied without buffering — tokens arrive incrementally as the provider generates them.
Usage
Set "stream": true in the request body:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Write a haiku."}],
"stream": true
}'
The response is a stream of SSE events:
data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":"An"}}]}
data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":" old"}}]}
data: [DONE]
Provider Translation
For non-OpenAI providers, crabllm translates the provider’s native streaming format to OpenAI-compatible SSE chunks:
- Anthropic —
message_start,content_block_deltaevents translated tochat.completion.chunkformat. - Google Gemini —
streamGenerateContentresponse parts translated to OpenAI chunks. - Bedrock — AWS event-stream binary frames decoded and translated.
- Azure — same SSE format as OpenAI, no translation needed.
Extension Hooks
Extensions can observe each streaming chunk via the on_chunk hook. The rate
limiter and budget extension use this to count tokens in real-time as they arrive.
Keep-Alive
SSE connections include automatic keep-alive pings to prevent proxy/load balancer timeouts during long generation pauses.
Error Handling
If an error occurs mid-stream (after the first chunk has been sent), it is delivered as an SSE event with an error payload. The stream then terminates. Retry and fallback only apply before the stream starts.
Authentication
Crabllm supports virtual API keys for client authentication and model access control.
Virtual Keys
Define keys in the config:
[[keys]]
name = "team-frontend"
key = "sk-frontend-abc123"
models = ["gpt-4o-mini"]
[[keys]]
name = "team-backend"
key = "sk-backend-xyz789"
models = ["gpt-4o", "claude-sonnet-4-20250514"]
[[keys]]
name = "admin"
key = "${ADMIN_API_KEY}"
models = ["*"]
Clients send the key in the Authorization header:
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer sk-frontend-abc123" \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hi"}]}'
Model Access Control
The models field controls which models a key can access:
["gpt-4o", "gpt-4o-mini"]— only these models.["*"]— all models.
Requests for unauthorized models return HTTP 401.
No Auth Mode
When no keys are configured, authentication is disabled entirely. All requests
pass through without checking the Authorization header.
# No [[keys]] section = auth disabled
listen = "0.0.0.0:8080"
[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]
Key Name Tracking
The key name field is used by extensions for per-key tracking:
- Rate limiting — enforced per key name.
- Usage tracking — tokens counted per key name.
- Budget — spend limits per key name.
- Logging — key name included in log entries.
Extensions
Extensions add functionality to the request pipeline via hooks. They run in-handler (not as middleware), giving direct access to typed request and response data.
Available Extensions
Cache
Caches non-streaming chat completion responses. Cache key is a SHA-256 hash of the serialized request body.
[extensions.cache]
ttl_seconds = 3600 # default: 300 (5 minutes)
Admin route: DELETE /v1/cache — clears all cached entries.
Rate Limit
Enforces per-key request and token rate limits using a per-minute sliding window.
[extensions.rate_limit]
requests_per_minute = 60 # required
tokens_per_minute = 100000 # optional
Returns HTTP 429 when limits are exceeded. Token counting uses actual usage from provider responses (both streaming and non-streaming).
Usage Tracker
Accumulates prompt and completion token counts per key and model.
[extensions.usage]
No configuration needed. Admin route: GET /v1/usage — returns JSON array of
usage entries with key, model, prompt_tokens, and completion_tokens.
Budget
Enforces per-key spend limits. Requires pricing to be configured for the models in use.
[extensions.budget]
default_budget = 10.00 # USD, required
[extensions.budget.keys.team-a]
budget = 50.00 # USD override for this key
Returns HTTP 429 when a key’s spend exceeds its budget. Admin route:
GET /v1/budget — returns JSON array with key, spent_usd, budget_usd,
and remaining_usd.
Logging
Structured request logging via the tracing framework.
[extensions.logging]
level = "info"
Logs completed requests (model, provider, key, latency, token counts) and
errors. Initializes the tracing_subscriber when enabled.
Hook Pipeline
Extensions run in config order at these points:
- on_request — before provider dispatch. Can short-circuit (rate limit, budget).
- on_cache_lookup — before provider dispatch for non-streaming. Returns cached response if available.
- on_response — after successful non-streaming response.
- on_chunk — for each SSE chunk during streaming.
- on_error — when a provider call fails.
Combining Extensions
Multiple extensions can be enabled simultaneously:
[extensions.logging]
level = "info"
[extensions.rate_limit]
requests_per_minute = 100
[extensions.usage]
[extensions.cache]
ttl_seconds = 600
[extensions.budget]
default_budget = 100.00
All extensions share the same storage backend.
Storage
Extensions that persist data (cache, rate limits, usage, budget) use a shared storage backend. Three backends are available.
Memory (default)
In-memory storage using concurrent hash maps. Fast, but data is lost on restart.
[storage]
kind = "memory"
This is the default when no [storage] section is present. No feature flag
required.
SQLite
Persistent storage using SQLite via async pooled connections.
[storage]
kind = "sqlite"
path = "crabllm.db"
Requires the storage-sqlite feature:
cargo install crabllm --features storage-sqlite
The database file is created automatically if it doesn’t exist. Uses two tables
(kv and counters) with atomic increment via INSERT ... ON CONFLICT ... RETURNING.
Redis
Remote persistent storage using Redis async multiplexed connections.
[storage]
kind = "redis"
path = "redis://127.0.0.1:6379"
Requires the storage-redis feature:
cargo install crabllm --features storage-redis
Supports standard Redis URLs. Increment maps to INCRBY, key listing uses
SCAN with prefix glob patterns.
How Extensions Use Storage
Each extension namespaces its keys with a 4-byte prefix to avoid collisions:
| Extension | Operations |
|---|---|
| Cache | get/set response JSON with TTL check |
| Rate Limit | increment per-key-per-minute counters |
| Usage | increment per-key-per-model token counters |
| Budget | increment per-key spend in microdollars |
Architecture
Principles
- Simplicity over abstraction. No trait where a function suffices.
- Single responsibility. Each crate has one focused job.
- OpenAI as canonical format. Providers translate to/from it.
- Streaming first-class. Never buffer a full response when streaming.
- Configuration-driven. Provider setup and routing from config, not code.
- Minimal gateway latency. Avoid hot-path allocations.
Workspace Layout
crabllm/
crates/
crabllm/ — binary, wires everything together
core/ — shared types, config, errors
provider/ — provider enum + translation modules
proxy/ — HTTP server, routing, extensions
bench/ — benchmark mock backend
Crates
crabllm
Binary entry point. Loads TOML config, builds the provider registry, initializes
the storage backend and extensions, starts the Axum HTTP server. CLI args:
--config and --bind.
core
Shared types with no business logic. Contains:
- Config —
GatewayConfigwith env var interpolation. - Types — OpenAI-compatible wire format structs (request, response, chunk).
- Error — error enum with transient detection for retry logic.
- Storage — async KV trait with memory, SQLite, and Redis backends.
- Extension — hook trait for the request pipeline.
provider
Provider dispatch. The Provider enum has variants for each supported provider.
Each variant dispatches to a per-provider module that handles request/response
translation. ProviderRegistry maps model names to weighted deployment lists.
proxy
Axum HTTP server. Route handlers implement retry + fallback across deployments. Auth middleware validates virtual keys. Five built-in extensions run as in-handler hooks.
Request Flow
- Client sends OpenAI-format request to crabllm.
- Auth middleware validates the bearer token.
- Handler resolves model name (aliases) and gets deployment list.
- Extension
on_requesthooks run (rate limit, budget check). - Cache lookup for non-streaming requests.
- Provider dispatch with retry + fallback.
- Provider translates request, calls upstream, translates response.
- Extension
on_response/on_chunkhooks run (usage, budget, cache store). - Response returned to client.
Benchmarks
Gateway overhead measured against a mock LLM server with instant responses — numbers reflect pure proxy cost.
Latency: P50 / P99 in milliseconds. Lower is better.
Chat Completions
| RPS | direct | crabllm | bifrost | litellm |
|---|---|---|---|---|
| 100 | 0.38 / 0.63 | 1.00 / 1.31 | 1.10 / 1.64 | 5.35 / 10.79 |
| 500 | 0.28 / 0.42 | 0.66 / 1.07 | 0.36 / 0.91 | 168.79 / 223.69 |
| 1000 | 0.15 / 0.31 | 0.44 / 0.83 | 0.27 / 0.46 | 172.00 / 201.55 |
| 2000 | 0.17 / 0.33 | 0.29 / 0.88 | 0.29 / 0.53 | 169.99 / 194.34 |
| 5000 | 0.13 / 0.33 | 0.26 / 0.57 | 0.26 / 0.48 | 159.86 / 492.82 |
Streaming
| RPS | direct | crabllm | bifrost | litellm |
|---|---|---|---|---|
| 100 | 0.45 / 0.62 | 43.53 / 48.14 | 1.51 / 2.20 | 670.25 / 3357.70 |
| 500 | 0.34 / 0.54 | 42.90 / 47.14 | 0.51 / 0.93 | 659.97 / 3569.92 |
| 1000 | 0.22 / 0.42 | 44.18 / 48.30 | 0.45 / 0.98 | 645.59 / 2797.66 |
| 2000 | 44.04 / 48.23 | 44.25 / 48.52 | 44.18 / 48.64 | 596.90 / 2678.08 |
| 5000 | 44.04 / 48.23 | 44.24 / 48.50 | 44.20 / 48.66 | 571.96 / 2563.73 |
Embeddings
| RPS | direct | crabllm | bifrost | litellm |
|---|---|---|---|---|
| 100 | 0.39 / 0.47 | 1.18 / 1.48 | 1.15 / 1.70 | 7.09 / 10.72 |
| 500 | 0.30 / 0.42 | 0.78 / 1.15 | 0.43 / 1.03 | 356.71 / 414.36 |
| 1000 | 0.17 / 0.27 | 0.51 / 0.91 | 0.38 / 0.85 | 332.53 / 6516.44 |
| 2000 | 0.18 / 0.32 | 0.36 / 1.08 | 0.39 / 0.94 | 317.53 / 365.68 |
| 5000 | 0.14 / 0.32 | 0.34 / 0.64 | 0.39 / 1.57 | 305.91 / 8778.06 |
Memory (Peak RSS)
| Gateway | Peak RSS |
|---|---|
| direct | 15.3 MB |
| crabllm | 34.9 MB |
| bifrost | 171.7 MB |
| litellm | 541.8 MB |