Feature

Caching

Hyperion caching reduces duplicate inference calls and lowers response latency. Use L1 exact matching for deterministic reuse and optional L2 semantic matching for near-intent reuse.

The Multi-Tier Strategy

Hyperion employs a two-tier caching architecture to maximize hit rates while keeping latency overhead near zero. By combining deterministic hashing with vector similarity, you can serve repeated requests instantly and similar requests intelligently.

L1: Exact Match (Redis)

The absolute fastest path. Requests are normalized (whitespace removed, keys sorted) and hashed. If an exact match exists in Redis, it is served in under 1ms.

L2: Semantic Match (Vector)

For queries with identical intent but different phrasing (e.g. "What is AI?" vs "Explain AI"), Hyperion performs a cosine similarity search against a vector database.

Configuration

Configure cache behavior per request using SDK options or headers. Use conservative thresholds in production if response correctness is strict.

response = client.chat.completions.create(
  model="openai/gpt-4.1-mini",
  messages=[{"role": "user", "content": "Summarize this ticket thread"}],
  hyperion={
    "bypass_cache": False,
    "cache_ttl": 3600,
    "similarity_threshold": 0.92
  }
)