Feature

Caching

Hyperion caching reduces duplicate inference calls and lowers response latency. Use L1 exact matching for deterministic reuse and optional L2 semantic matching for near-intent reuse.

The Multi-Tier Strategy

Hyperion employs a two-tier caching architecture to maximize hit rates while keeping latency overhead near zero. By combining deterministic hashing with vector similarity, you can serve repeated requests instantly and similar requests intelligently.

L1: Exact Match (Redis)

The absolute fastest path. Requests are normalized (whitespace removed, keys sorted) and hashed. If an exact match exists in Redis, it is served in under 1ms.

L2: Semantic Match (Vector)

For queries with identical intent but different phrasing (e.g. "What is AI?" vs "Explain AI"), Hyperion performs a cosine similarity search against a vector database.

Configuration

Configure cache behavior per request using SDK options or headers. Use conservative thresholds in production if response correctness is strict.

response = client.chat.completions.create(
  model="openai/gpt-4.1-mini",
  messages=[{"role": "user", "content": "Summarize this ticket thread"}],
  hyperion={
    "bypass_cache": False,
    "cache_ttl": 3600,
    "similarity_threshold": 0.92
  }
)

Response Metadata

Read cache metadata from response headers or SDK metadata fields.

Caching Metadata

Response headers and SDK fields emitted by the gateway.
cache_status
string

Cache outcome for this request.

HITMISSBYPASS
cache_type
string

Tier that served the response.

L1_EXACTL2_SEMANTIC
similarity_score
float

Cosine similarity score for semantic hits (0.0 to 1.0).

Next
Smart Routing

Understand route selection and model/provider behavior.

Back
Architecture

Review the end-to-end request lifecycle.

Last updated: Feb 22, 2026