Product

How SpendProxy works

SpendProxy is a reverse proxy that intercepts AI API traffic, calculates accurate costs, attributes spend to features automatically, and runs optimization engines to cut waste. All inside your infrastructure.

2-minute demo

See it in action

From wrong numbers to accurate cost control in one Docker container.

Architecture

Where SpendProxy sits in your stack

A transparent reverse proxy between your application code and AI providers. Your code changes one URL — everything else stays the same.

Your Application
OpenAI SDK, Anthropic SDK, HTTP calls
Changed: baseURL points to SpendProxy
SpendProxy Container
✓ Intercepts request
✓ Runs optimization engines
✓ Forwards to provider
✓ Calculates true cost from response
✓ Auto-attributes to feature
✓ Stores in SQLite
AI Providers
OpenAI (GPT-4.1, o3, o4-mini)
Anthropic (Claude 4, 3.5)
Google (Gemini 2.5)
Local Storage
SQLite in ~/.spendproxy/
Costs, attribution, request logs, optimization history
Dashboard
localhost:4100/dashboard
Auto-refreshes every 10 seconds
<2ms
Added latency (P99)
0
External dependencies
0
Lines of code to change
3
Providers supported

Cost accuracy

How SpendProxy calculates true cost

Every provider handles billing differently. SpendProxy implements provider-specific logic instead of using a single formula that gets most cases wrong.

Cached token handling

OpenAI

prompt_tokens includes cached. Subtract cached_tokens before applying the input rate. Cache hits are billed at 50% of the input rate.

Anthropic

input_tokens excludes cached. cache_read_input_tokens and cache_creation_input_tokens are billed at separate rates. Most tools ignore this entirely.

Google

promptTokenCount includes cached. Subtract cachedContentTokenCount before billing. Reasoning tokens (Gemini 2.5) are separated.

Reasoning token handling

Models like OpenAI o3/o4-mini and Gemini 2.5 Pro use internal reasoning tokens that are billed as output but never shown to the user. SpendProxy extracts reasoning_tokens from the response and bills them at the correct output rate, separate from visible completion tokens. Most tools either miss these entirely or double-count them.

Streaming accuracy

Full SSE streaming support for all three providers. For OpenAI, SpendProxy injects stream_options.include_usage automatically. For Anthropic, it parses message_start and message_delta events. For Google, it detects streaming endpoints and parses usageMetadata from chunks. Cost data is appended as an SSE comment — your client code doesn't change.

Optimization engines

5 engines that reduce spend automatically

Each engine operates independently in off, monitor, or autopilot mode. Monitor first to see the opportunity, then enable autopilot when you're confident.

01

Cache Injection

Detects prompts that would benefit from provider-level caching and enables it automatically. Works with OpenAI's prompt caching, Anthropic's cache control, and Google's cached content. No code changes needed — SpendProxy modifies the request before forwarding.

Typical savings: 50-90% on repetitive prompts
02

Response Deduplication

Identifies identical in-flight requests — common during frontend retries, parallel component renders, or webhook replays — and serves a single provider response to all callers. Eliminates redundant API calls without affecting your application logic.

Typical savings: 10-30% on high-traffic endpoints
03

Model Routing

Analyzes request complexity and routes simple tasks to cheaper models automatically. A lookup query doesn't need GPT-4.1 — SpendProxy downgrades it to GPT-4.1-mini transparently. Complex reasoning stays on the model you specified. Includes circuit breaker protection for model availability.

Typical savings: 20-60% on mixed workloads
04

Budget Guardrails

Set spend limits per feature, team, or project. When a budget is hit, SpendProxy can warn (add a header), downgrade (switch to a cheaper model), or block (return a 429). No more surprise bills from a runaway agent loop.

Protection: hard spend caps with configurable actions
05

Retry Storm Suppression

Detects retry cascades during provider outages — when your application hammers a failing API with exponential backoff across hundreds of instances. SpendProxy throttles the storm, serves cached responses where possible, and prevents a 2-minute outage from becoming a $5K bill.

Protection: prevents runaway spend during outages

Attribution

Automatic cost attribution without code changes

SpendProxy uses multiple signals from each request to automatically group costs by feature, agent, or workload type.

System prompt fingerprinting

Hashes the system prompt to create a stable fingerprint. Requests with the same system prompt are always grouped together, even across different users or sessions. Shows as chat:p-3f8a21b4 in the dashboard.

Toolset fingerprinting

Hashes the set of function tools attached to a request. Your code-review agent and your search agent get separate cost breakdowns automatically. Shows as agent:p-c72e:t-8f4a.

Workload classification

Detects the type of workload from the endpoint and request body: chat completion, agent call, embeddings, image generation, or audio. Each type is tracked separately.

SDK detection

Identifies the calling SDK from the User-Agent header: openai-python, anthropic-ts, langchain, vercel-ai, crewai, and more. Useful for understanding which parts of your stack drive costs.

Optional: explicit headers

You can also use manual headers for explicit control. They take priority over auto-detection.

X-CE-Route Feature or endpoint name
X-CE-Tag Custom tag (team, env)
X-CE-Project Project name

Response cost headers

Every proxied response includes cost metadata.

X-CE-Cost Total cost in USD
X-CE-Input-Tokens Input token count
X-CE-Output-Tokens Output token count
X-CE-Cached-Tokens Cached input tokens
X-CE-Model Resolved model name

Providers

Supported models

Fuzzy matching handles versioned model IDs automatically (e.g., gpt-4o-2024-08-06 resolves to gpt-4o).

OpenAI

gpt-4.1
gpt-4.1-mini
gpt-4.1-nano
gpt-4o
gpt-4o-mini
o3
o3-mini
o4-mini

Anthropic

claude-opus-4
claude-sonnet-4
claude-haiku-4
claude-3.5-sonnet
claude-3.5-haiku

Google

gemini-2.5-pro
gemini-2.5-flash
gemini-2.0-flash

Dashboard

Everything you need in one view

Cost breakdowns by model and feature, optimization insights with savings estimates, and full request-level detail. Auto-refreshes every 10 seconds.

Cost breakdown by model and auto-detected feature:

SpendProxy dashboard — cost breakdown by model and feature

Optimization insights with estimated monthly savings:

SpendProxy dashboard — optimization insights showing $13,094 in monthly savings

See it in action

Book a 30-minute technical walkthrough. We'll show you the architecture, the accuracy engine, and how the optimization engines work on real traffic.

Book a Technical Walkthrough