Introduction

Crabllm is a high-performance LLM API gateway written in Rust. It sits between your application and LLM providers, exposing an OpenAI-compatible API surface.

One API format. Many providers. Low overhead.

What It Does

You send requests in OpenAI format to crabllm. It routes them to the configured provider — OpenAI, Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, or Ollama — translating the request and response as needed.

Your application talks to one endpoint. Crabllm handles the rest:

Provider translation — Anthropic, Google, and Bedrock have their own API formats. Crabllm translates automatically.
Routing — Weighted random selection across multiple providers for the same model. Automatic fallback when a provider fails.
Streaming — SSE streaming proxied without buffering.
Auth — Virtual API keys with per-key model access control.
Extensions — Rate limiting, caching, cost tracking, budget enforcement.

Why Rust

Sub-millisecond overhead — no GC pauses, no interpreter startup.
Memory safety — without runtime cost.
Concurrency — Tokio async runtime handles thousands of concurrent streaming connections efficiently.
Deployment — single static binary. No interpreter, no virtualenv, no Docker required.

Feature Comparison

Feature	LiteLLM	Crabllm
`/chat/completions`	yes	yes
`/embeddings`	yes	yes
`/models`	yes	yes
OpenAI provider	yes	yes
Anthropic provider	yes	yes
Google Gemini provider	yes	yes
Azure OpenAI provider	yes	yes
AWS Bedrock provider	yes	yes
Tool/function calling	yes	yes
SSE streaming	yes	yes
Virtual keys + auth	yes	yes
Weighted routing	yes	yes
Model aliasing	yes	yes
Retry + fallback	yes	yes
Rate limiting (RPM/TPM)	yes	yes
Cost/usage tracking	yes	yes
Budget enforcement	yes	yes
Request caching	yes	yes
Image/audio endpoints	yes	yes
Storage (memory)	yes	yes
Storage (persistent)	Postgres	SQLite
Redis storage	yes	yes

Getting Started

Install

cargo install crabllm

Configure

Create a crabllm.toml file:

listen = "0.0.0.0:8080"

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini"]

[providers.anthropic]
kind = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
models = ["claude-sonnet-4-20250514"]

Environment variables in ${VAR} syntax are expanded at startup.

Run

crabllm --config crabllm.toml

You’ll see:

crabllm listening on 0.0.0.0:8080 (3 models, 2 providers, 0 extensions)

Send a Request

All requests use the OpenAI format, regardless of which provider handles them:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

To use Anthropic, just change the model name:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The request format is the same. Crabllm translates it to the Anthropic Messages API internally.

Streaming

Add "stream": true to get SSE streaming:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Model Aliasing

Map friendly names to canonical model names:

[aliases]
gpt4 = "gpt-4o"
claude = "claude-sonnet-4-20250514"

Now "model": "gpt4" routes to gpt-4o.

Next Steps

Configuration — full reference for all config options
Providers — setup guides for each provider
Features — routing, auth, extensions, and more

Configuration

Crabllm is configured via a TOML file, passed with --config:

crabllm --config crabllm.toml

The --bind flag overrides the listen address.

Environment Variables

Strings containing ${VAR} are expanded from environment variables at startup. Unknown variables expand to empty string. Use this for secrets:

api_key = "${OPENAI_API_KEY}"

Top-Level Fields

Field	Type	Default	Description
`listen`	string	required	Address to bind, e.g. `"0.0.0.0:8080"`
`shutdown_timeout`	integer	`30`	Graceful shutdown timeout in seconds

Providers

Each provider is a named entry under [providers]:

[providers.my_openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini"]

Field	Type	Default	Description
`kind`	string	required	Provider type (see Providers)
`api_key`	string	`""`	API key for authentication
`base_url`	string	per-kind	Base URL override
`models`	list	`[]`	Model names this provider serves
`weight`	integer	`1`	Routing weight for load balancing
`max_retries`	integer	`2`	Max retries on transient errors
`timeout`	integer	`30`	Per-request timeout in seconds
`api_version`	string	—	API version (Azure only)
`region`	string	—	AWS region (Bedrock only)
`access_key`	string	—	AWS access key (Bedrock only)
`secret_key`	string	—	AWS secret key (Bedrock only)

Virtual Keys

[[keys]]
name = "team-a"
key = "sk-team-a-secret"
models = ["gpt-4o", "claude-sonnet-4-20250514"]

[[keys]]
name = "admin"
key = "sk-admin-secret"
models = ["*"]

Field	Type	Description
`name`	string	Human-readable key name (used in usage tracking)
`key`	string	The bearer token clients send
`models`	list	Allowed models. `["*"]` means all

When no keys are configured, authentication is disabled.

Aliases

[aliases]
gpt4 = "gpt-4o"
claude = "claude-sonnet-4-20250514"

Maps friendly model names to canonical names. Single-hop lookup.

Pricing

[pricing.gpt-4o]
prompt_cost_per_million = 2.50
completion_cost_per_million = 10.00

[pricing.claude-sonnet-4-20250514]
prompt_cost_per_million = 3.00
completion_cost_per_million = 15.00

Per-model token pricing in USD. Used by the budget extension for spend tracking.

Extensions

[extensions.cache]
ttl = 3600

[extensions.rate_limit]
rpm = 60

[extensions.usage]

[extensions.budget]
default_limit = 10000000

[extensions.logging]
level = "info"

See Extensions for details on each.

Storage

[storage]
kind = "memory"

Kind	Feature flag	`path` field
`memory`	none (default)	not used
`sqlite`	`storage-sqlite`	file path, e.g. `"crabllm.db"`
`redis`	`storage-redis`	URL, e.g. `"redis://127.0.0.1:6379"`

See Storage for details.

Full Example

listen = "0.0.0.0:8080"
shutdown_timeout = 30

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini"]
weight = 2
max_retries = 2
timeout = 30

[providers.anthropic]
kind = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
models = ["claude-sonnet-4-20250514"]

[providers.ollama]
kind = "ollama"
models = ["llama3.2"]

[aliases]
gpt4 = "gpt-4o"
claude = "claude-sonnet-4-20250514"

[[keys]]
name = "default"
key = "${CRABTALK_API_KEY}"
models = ["*"]

[pricing.gpt-4o]
prompt_cost_per_million = 2.50
completion_cost_per_million = 10.00

[extensions.rate_limit]
rpm = 100

[extensions.usage]

[extensions.logging]
level = "info"

[storage]
kind = "sqlite"
path = "crabllm.db"

Providers

A provider is an LLM service that crabllm routes requests to. Each provider has its own API format and authentication mechanism. Crabllm translates between the OpenAI-compatible format your application uses and the provider’s native format.

Supported Providers

Kind	Provider	Translation
`openai`	OpenAI, Groq, Together, vLLM, any OpenAI-compatible API	Pass-through
`anthropic`	Anthropic Messages API	Full translation
`google`	Google Gemini	Full translation
`azure`	Azure OpenAI	URL + auth rewrite
`bedrock`	AWS Bedrock Converse API	Full translation + SigV4 signing
`ollama`	Ollama (local models)	Pass-through (OpenAI-compatible)

Common Fields

Every provider supports these fields:

[providers.name]
kind = "..."           # required
api_key = "..."        # API key (supports ${ENV_VAR})
base_url = "..."       # base URL override
models = ["..."]       # model names this provider serves
weight = 1             # routing weight (higher = more traffic)
max_retries = 2        # retries on transient errors (429, 5xx)
timeout = 30           # per-request timeout in seconds

Multiple Providers for the Same Model

When multiple providers list the same model, crabllm selects between them using weighted random selection. If the selected provider fails, it falls back to the next provider by weight. See Routing.

[providers.openai_primary]
kind = "openai"
api_key = "${OPENAI_KEY_1}"
models = ["gpt-4o"]
weight = 3

[providers.openai_backup]
kind = "openai"
api_key = "${OPENAI_KEY_2}"
models = ["gpt-4o"]
weight = 1

Endpoint Support

Endpoint	OpenAI	Anthropic	Google	Azure	Bedrock	Ollama
Chat completions	yes	yes	yes	yes	yes	yes
Streaming	yes	yes	yes	yes	yes	yes
Embeddings	yes	—	—	yes	—	yes
Image generation	yes	—	—	yes	—	—
Audio speech	yes	—	—	yes	—	—
Audio transcription	yes	—	—	yes	—	—
Tool/function calling	yes	yes	yes	yes	yes	yes

OpenAI

The openai provider works with OpenAI and any OpenAI-compatible API (Groq, Together AI, vLLM, etc.). Requests are forwarded as-is with URL and auth rewrite — no translation needed.

Configuration

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o", "gpt-4o-mini", "text-embedding-3-small"]

Custom Base URL

For OpenAI-compatible services, set base_url:

[providers.groq]
kind = "openai"
api_key = "${GROQ_API_KEY}"
base_url = "https://api.groq.com/openai/v1"
models = ["llama-3.3-70b-versatile"]

[providers.together]
kind = "openai"
api_key = "${TOGETHER_API_KEY}"
base_url = "https://api.together.xyz/v1"
models = ["meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo"]

Supported Endpoints

Chat completions (streaming and non-streaming)
Embeddings
Image generation
Audio speech (TTS)
Audio transcription

Tool Calling

Tool calling works as-is — the request body is forwarded directly:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "parameters": {
          "type": "object",
          "properties": {"location": {"type": "string"}},
          "required": ["location"]
        }
      }
    }]
  }'

Anthropic

The anthropic provider translates OpenAI-format requests to the Anthropic Messages API and back.

Configuration

[providers.anthropic]
kind = "anthropic"
api_key = "${ANTHROPIC_API_KEY}"
models = ["claude-sonnet-4-20250514", "claude-haiku-4-20250514"]

Translation

Crabllm handles the full translation between OpenAI and Anthropic formats:

System messages — extracted from the messages array and sent as the Anthropic system parameter.
Stop reasons — mapped between formats (end_turn to stop, etc.).
Tool calling — fully supported. Tool definitions, tool use responses, and tool result messages are all translated.
Streaming — Anthropic’s event stream (message_start, content_block_delta, etc.) is translated to OpenAI-format SSE chunks.

Usage

Send requests in OpenAI format as usual:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-20250514",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Limitations

Embeddings, image generation, and audio endpoints are not supported by the Anthropic API.

Google Gemini

The google provider translates OpenAI-format requests to the Google Gemini API (generativeai).

Configuration

[providers.google]
kind = "google"
api_key = "${GOOGLE_API_KEY}"
models = ["gemini-2.0-flash", "gemini-2.5-pro"]

Translation

System messages — mapped to Gemini’s systemInstruction field.
Roles — assistant mapped to model, user stays user.
Content — mapped to Gemini’s parts array format.
Tool calling — tool definitions mapped to functionDeclarations, tool messages to functionResponse parts, responses extract functionCall parts.
Streaming — uses streamGenerateContent?alt=sse and translates the Gemini event stream to OpenAI-format SSE chunks.

Usage

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.0-flash",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Limitations

Embeddings, image generation, and audio endpoints are not supported.

Azure OpenAI

The azure provider routes to Azure OpenAI deployments. The request body is OpenAI-format (no translation needed), but the URL pattern and authentication differ.

Configuration

[providers.azure]
kind = "azure"
api_key = "${AZURE_OPENAI_KEY}"
base_url = "https://my-resource.openai.azure.com"
api_version = "2024-02-01"
models = ["gpt-4o"]

base_url — your Azure OpenAI resource URL.
api_version — the Azure API version string.

How It Works

Crabllm rewrites the URL to Azure’s deployment-based pattern:

POST /openai/deployments/{model}/chat/completions?api-version={api_version}

Authentication uses the api-key header instead of Authorization: Bearer.

Supported Endpoints

Chat completions (streaming and non-streaming)
Embeddings
Image generation
Audio speech (TTS)
Audio transcription

Usage

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

AWS Bedrock

The bedrock provider translates requests to the AWS Bedrock Converse API with SigV4 request signing. No AWS SDK dependency — signing is handled internally.

Feature Flag

Bedrock support requires the provider-bedrock cargo feature:

cargo install crabllm --features provider-bedrock

Configuration

[providers.bedrock]
kind = "bedrock"
region = "us-east-1"
access_key = "${AWS_ACCESS_KEY_ID}"
secret_key = "${AWS_SECRET_ACCESS_KEY}"
models = ["anthropic.claude-3-5-sonnet-20241022-v2:0"]

Translation

System messages — mapped to the Bedrock system field.
Tool calling — tool definitions mapped to toolConfig.tools[].toolSpec, tool results to toolResult content blocks.
Stop reasons — end_turn to stop, tool_use to tool_calls, max_tokens to length.
Streaming — uses ConverseStream with AWS event-stream binary framing.

Usage

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "anthropic.claude-3-5-sonnet-20241022-v2:0",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Limitations

Embeddings, image generation, and audio endpoints are not supported.

Ollama

The ollama provider connects to a local Ollama instance. Ollama exposes an OpenAI-compatible API, so requests are forwarded as-is.

Configuration

[providers.ollama]
kind = "ollama"
models = ["llama3.2", "mistral"]

The default base URL is http://localhost:11434/v1. Override it if Ollama runs on a different host:

[providers.ollama]
kind = "ollama"
base_url = "http://192.168.1.100:11434/v1"
models = ["llama3.2"]

No API key is needed for local Ollama.

Usage

Start Ollama, pull a model, then send requests through crabllm:

ollama pull llama3.2
crabllm --config crabllm.toml

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Supported Endpoints

Chat completions (streaming and non-streaming)
Embeddings (if supported by the Ollama model)

Routing

Crabllm decides which provider handles a request based on model name, routing weights, and fallback logic.

Model Resolution

When a request arrives, crabllm looks up the model name in the configured providers. If the model is an alias, it resolves to the canonical name first (single-hop lookup).

Weighted Selection

When multiple providers serve the same model, one is selected via weighted random selection. Higher weight values mean more traffic:

[providers.primary]
kind = "openai"
api_key = "${OPENAI_KEY_1}"
models = ["gpt-4o"]
weight = 3                    # 75% of traffic

[providers.secondary]
kind = "openai"
api_key = "${OPENAI_KEY_2}"
models = ["gpt-4o"]
weight = 1                    # 25% of traffic

Selection is stateless — no shared counters. Each request picks independently.

Retry

When a provider returns a transient error (HTTP 429, 500, 502, 503, 504), crabllm retries the same provider with exponential backoff:

Base delay: 100ms, doubling each retry.
Full jitter: each sleep is a random duration in [backoff/2, backoff] to prevent thundering herd.
Max retries: configurable per provider via max_retries (default 2).

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]
max_retries = 3               # retry up to 3 times

Set max_retries = 0 to disable retry entirely.

Fallback

When retries are exhausted on a provider, crabllm tries the next provider by descending weight. This continues until a provider succeeds or all providers have been tried.

# Primary provider (tried first)
[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]
weight = 2

# Fallback provider (tried if primary fails)
[providers.azure]
kind = "azure"
api_key = "${AZURE_KEY}"
base_url = "https://my-resource.openai.azure.com"
api_version = "2024-02-01"
models = ["gpt-4o"]
weight = 1

Timeouts

Each provider call is wrapped in a timeout. If the timeout expires, the request is treated as a transient error (triggers retry/fallback):

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]
timeout = 60                  # seconds (default: 30)

Timeout errors return HTTP 504 Gateway Timeout if all providers time out.

Streaming Behavior

For streaming requests, retry and fallback only apply to connection errors (before the stream starts). Once the first SSE chunk is sent to the client, the connection is committed to that provider.

Streaming

Crabllm supports Server-Sent Events (SSE) streaming for chat completions across all providers. Streams are proxied without buffering — tokens arrive incrementally as the provider generates them.

Usage

Set "stream": true in the request body:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Write a haiku."}],
    "stream": true
  }'

The response is a stream of SSE events:

data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":"An"}}]}

data: {"id":"...","object":"chat.completion.chunk","choices":[{"delta":{"content":" old"}}]}

data: [DONE]

Provider Translation

For non-OpenAI providers, crabllm translates the provider’s native streaming format to OpenAI-compatible SSE chunks:

Anthropic — message_start, content_block_delta events translated to chat.completion.chunk format.
Google Gemini — streamGenerateContent response parts translated to OpenAI chunks.
Bedrock — AWS event-stream binary frames decoded and translated.
Azure — same SSE format as OpenAI, no translation needed.

Extension Hooks

Extensions can observe each streaming chunk via the on_chunk hook. The rate limiter and budget extension use this to count tokens in real-time as they arrive.

Keep-Alive

SSE connections include automatic keep-alive pings to prevent proxy/load balancer timeouts during long generation pauses.

Error Handling

If an error occurs mid-stream (after the first chunk has been sent), it is delivered as an SSE event with an error payload. The stream then terminates. Retry and fallback only apply before the stream starts.

Authentication

Crabllm supports virtual API keys for client authentication and model access control.

Virtual Keys

Define keys in the config:

[[keys]]
name = "team-frontend"
key = "sk-frontend-abc123"
models = ["gpt-4o-mini"]

[[keys]]
name = "team-backend"
key = "sk-backend-xyz789"
models = ["gpt-4o", "claude-sonnet-4-20250514"]

[[keys]]
name = "admin"
key = "${ADMIN_API_KEY}"
models = ["*"]

Clients send the key in the Authorization header:

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer sk-frontend-abc123" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hi"}]}'

Model Access Control

The models field controls which models a key can access:

["gpt-4o", "gpt-4o-mini"] — only these models.
["*"] — all models.

Requests for unauthorized models return HTTP 401.

No Auth Mode

When no keys are configured, authentication is disabled entirely. All requests pass through without checking the Authorization header.

# No [[keys]] section = auth disabled
listen = "0.0.0.0:8080"

[providers.openai]
kind = "openai"
api_key = "${OPENAI_API_KEY}"
models = ["gpt-4o"]

Key Name Tracking

The key name field is used by extensions for per-key tracking:

Rate limiting — enforced per key name.
Usage tracking — tokens counted per key name.
Budget — spend limits per key name.
Logging — key name included in log entries.

Extensions

Extensions add functionality to the request pipeline via hooks. They run in-handler (not as middleware), giving direct access to typed request and response data.

Available Extensions

Cache

Caches non-streaming chat completion responses. Cache key is a SHA-256 hash of the serialized request body.

[extensions.cache]
ttl_seconds = 3600           # default: 300 (5 minutes)

Admin route: DELETE /v1/cache — clears all cached entries.

Rate Limit

Enforces per-key request and token rate limits using a per-minute sliding window.

[extensions.rate_limit]
requests_per_minute = 60      # required
tokens_per_minute = 100000    # optional

Returns HTTP 429 when limits are exceeded. Token counting uses actual usage from provider responses (both streaming and non-streaming).

Usage Tracker

Accumulates prompt and completion token counts per key and model.

[extensions.usage]

No configuration needed. Admin route: GET /v1/usage — returns JSON array of usage entries with key, model, prompt_tokens, and completion_tokens.

Budget

Enforces per-key spend limits. Requires pricing to be configured for the models in use.

[extensions.budget]
default_budget = 10.00        # USD, required

[extensions.budget.keys.team-a]
budget = 50.00                # USD override for this key

Returns HTTP 429 when a key’s spend exceeds its budget. Admin route: GET /v1/budget — returns JSON array with key, spent_usd, budget_usd, and remaining_usd.

Logging

Structured request logging via the tracing framework.

[extensions.logging]
level = "info"

Logs completed requests (model, provider, key, latency, token counts) and errors. Initializes the tracing_subscriber when enabled.

Hook Pipeline

Extensions run in config order at these points:

on_request — before provider dispatch. Can short-circuit (rate limit, budget).
on_cache_lookup — before provider dispatch for non-streaming. Returns cached response if available.
on_response — after successful non-streaming response.
on_chunk — for each SSE chunk during streaming.
on_error — when a provider call fails.

Combining Extensions

Multiple extensions can be enabled simultaneously:

[extensions.logging]
level = "info"

[extensions.rate_limit]
requests_per_minute = 100

[extensions.usage]

[extensions.cache]
ttl_seconds = 600

[extensions.budget]
default_budget = 100.00

All extensions share the same storage backend.

Storage

Extensions that persist data (cache, rate limits, usage, budget) use a shared storage backend. Three backends are available.

Memory (default)

In-memory storage using concurrent hash maps. Fast, but data is lost on restart.

[storage]
kind = "memory"

This is the default when no [storage] section is present. No feature flag required.

SQLite

Persistent storage using SQLite via async pooled connections.

[storage]
kind = "sqlite"
path = "crabllm.db"

Requires the storage-sqlite feature:

cargo install crabllm --features storage-sqlite

The database file is created automatically if it doesn’t exist. Uses two tables (kv and counters) with atomic increment via INSERT ... ON CONFLICT ... RETURNING.

Redis

Remote persistent storage using Redis async multiplexed connections.

[storage]
kind = "redis"
path = "redis://127.0.0.1:6379"

Requires the storage-redis feature:

cargo install crabllm --features storage-redis

Supports standard Redis URLs. Increment maps to INCRBY, key listing uses SCAN with prefix glob patterns.

How Extensions Use Storage

Each extension namespaces its keys with a 4-byte prefix to avoid collisions:

Extension	Operations
Cache	get/set response JSON with TTL check
Rate Limit	increment per-key-per-minute counters
Usage	increment per-key-per-model token counters
Budget	increment per-key spend in microdollars

Architecture

Principles

Simplicity over abstraction. No trait where a function suffices.
Single responsibility. Each crate has one focused job.
OpenAI as canonical format. Providers translate to/from it.
Streaming first-class. Never buffer a full response when streaming.
Configuration-driven. Provider setup and routing from config, not code.
Minimal gateway latency. Avoid hot-path allocations.

Workspace Layout

crabllm/
  crates/
    crabllm/   — binary, wires everything together
    core/       — shared types, config, errors
    provider/   — provider enum + translation modules
    proxy/      — HTTP server, routing, extensions
    bench/      — benchmark mock backend

Crates

crabllm

Binary entry point. Loads TOML config, builds the provider registry, initializes the storage backend and extensions, starts the Axum HTTP server. CLI args: --config and --bind.

core

Shared types with no business logic. Contains:

Config — GatewayConfig with env var interpolation.
Types — OpenAI-compatible wire format structs (request, response, chunk).
Error — error enum with transient detection for retry logic.
Storage — async KV trait with memory, SQLite, and Redis backends.
Extension — hook trait for the request pipeline.

Client sends OpenAI-format request to crabllm.
Auth middleware validates the bearer token.
Handler resolves model name (aliases) and gets deployment list.
Extension on_request hooks run (rate limit, budget check).
Cache lookup for non-streaming requests.
Provider dispatch with retry + fallback.
Provider translates request, calls upstream, translates response.
Extension on_response/on_chunk hooks run (usage, budget, cache store).
Response returned to client.

Benchmarks

Gateway overhead measured against a mock LLM server with instant responses — numbers reflect pure proxy cost.

Latency: P50 / P99 in milliseconds. Lower is better.

Chat Completions

RPS	direct	crabllm	bifrost	litellm
100	0.38 / 0.63	1.00 / 1.31	1.10 / 1.64	5.35 / 10.79
500	0.28 / 0.42	0.66 / 1.07	0.36 / 0.91	168.79 / 223.69
1000	0.15 / 0.31	0.44 / 0.83	0.27 / 0.46	172.00 / 201.55
2000	0.17 / 0.33	0.29 / 0.88	0.29 / 0.53	169.99 / 194.34
5000	0.13 / 0.33	0.26 / 0.57	0.26 / 0.48	159.86 / 492.82

Streaming

RPS	direct	crabllm	bifrost	litellm
100	0.45 / 0.62	43.53 / 48.14	1.51 / 2.20	670.25 / 3357.70
500	0.34 / 0.54	42.90 / 47.14	0.51 / 0.93	659.97 / 3569.92
1000	0.22 / 0.42	44.18 / 48.30	0.45 / 0.98	645.59 / 2797.66
2000	44.04 / 48.23	44.25 / 48.52	44.18 / 48.64	596.90 / 2678.08
5000	44.04 / 48.23	44.24 / 48.50	44.20 / 48.66	571.96 / 2563.73

Embeddings

RPS	direct	crabllm	bifrost	litellm
100	0.39 / 0.47	1.18 / 1.48	1.15 / 1.70	7.09 / 10.72
500	0.30 / 0.42	0.78 / 1.15	0.43 / 1.03	356.71 / 414.36
1000	0.17 / 0.27	0.51 / 0.91	0.38 / 0.85	332.53 / 6516.44
2000	0.18 / 0.32	0.36 / 1.08	0.39 / 0.94	317.53 / 365.68
5000	0.14 / 0.32	0.34 / 0.64	0.39 / 1.57	305.91 / 8778.06

Memory (Peak RSS)

Gateway	Peak RSS
direct	15.3 MB
crabllm	34.9 MB
bifrost	171.7 MB
litellm	541.8 MB

Keyboard shortcuts

Crabllm