Caching
Hyperion caching reduces duplicate inference calls and lowers response latency. Use L1 exact matching for deterministic reuse and optional L2 semantic matching for near-intent reuse.
The Multi-Tier Strategy
Hyperion employs a two-tier caching architecture to maximize hit rates while keeping latency overhead near zero. By combining deterministic hashing with vector similarity, you can serve repeated requests instantly and similar requests intelligently.
L1: Exact Match (Redis)
The absolute fastest path. Requests are normalized (whitespace removed, keys sorted) and hashed. If an exact match exists in Redis, it is served in under 1ms.
L2: Semantic Match (Vector)
For queries with identical intent but different phrasing (e.g. "What is AI?" vs "Explain AI"), Hyperion performs a cosine similarity search against a vector database.
Configuration
Configure cache behavior per request using SDK options or headers. Use conservative thresholds in production if response correctness is strict.
response = client.chat.completions.create(
model="openai/gpt-4.1-mini",
messages=[{"role": "user", "content": "Summarize this ticket thread"}],
hyperion={
"bypass_cache": False,
"cache_ttl": 3600,
"similarity_threshold": 0.92
}
)Response Metadata
Read cache metadata from response headers or SDK metadata fields.
Caching Metadata
Cache outcome for this request.
Tier that served the response.
Cosine similarity score for semantic hits (0.0 to 1.0).
Understand route selection and model/provider behavior.
Review the end-to-end request lifecycle.