Back to Blog
Engineering/10 min read/Feb 26, 2026

Native Go ML Inference: Porting Weights to the Core

In the high-stakes world of AI infrastructure, every millisecond counts. When we first built the Hyperion Intelligence layer, we did what most engineering teams do: we built it in Python. It was fast to develop, leveraged the massive scikit-learn ecosystem, and allowed us to iterate on our smart routing models daily.

But as the Hyperion Gateway moved from a prototype into a high-concurrency production engine handling millions of tokens per second, the "Python Tax" became unavoidable. An 18ms overhead for every routing decision might sound small, but in a world of sub-200ms TTFT (Time To First Token) targets, it was an eternity.

"We didn't just want a faster microservice. We wanted the intelligence to be part of the request's atomic execution path. That meant leaving the HTTP network hop behind."

The Bottleneck: HTTP and Serialization

The majority of our latency wasn't actually the model execution itself, it was the infrastructure surrounding it. A request would hit our Go gateway, get buffered, serialized to JSON, sent over a local network bridge to a Python FastAPI container, deserialized, processed, and then the whole dance would happen in reverse.

Even with optimized Gunicorn workers and local networking, you simply cannot beat the performance of in-process memory access. We decided to port the entire inference engine, classification, anomaly detection, and Multi-Armed Bandits, directly into the Go core.

Porting the Brain: From .pkl to weights.json

The core challenge was portability. Scikit-learn models are typically saved as Python pickles, binary blobs that are inherently tied to the Python runtime. To run these in Go without CGO or a heavy ONNX runtime, we had to rethink the "last mile" of our ML pipeline.

We moved to a **statically exported weight model.** Instead of asking Go to "run a model," we taught Go the math of our specific algorithms, leveraging probabilistic classification and adaptive routing.

  • Weight Extraction: Our Python training service now acts purely as a compiler. It trains on our massive synthetic and production datasets and then exports the log-probabilities and vocabularies as a plain, versioned weight file.
  • Direct Math in Go: We implemented the core statistical classification patterns directly in pure Go. No libraries, no overhead. Just raw, vectorized array operations.
  • Probabilistic Routing: Adaptive routing models, previously a bottleneck, were ported to use Go's native math/rand library for efficient probability-based sampling.

The Performance Delta

Legacy (Python HTTP)
  • Latency: ~17.5ms
  • Source: External Microservice
  • Failure Mode: Network/Serialization
Current (Native Go)
  • Latency: 0.047ms
  • Source: Atomic In-Process Memory
  • Failure Mode: None (Linear Logic)

Statistical Anomaly Detection

We didn't stop at classification. Our "Sentinel" anomaly detector, which prevents malicious or malformed prompts from hitting upstream providers, was previously a complex ensemble model, a compute-intensive method.

By analyzing our traffic patterns, we realized we could achieve the same "Guardrail" efficacy using a high-performance Z-score statistical filter in Go. This allows us to reject anomalies in microseconds, before they even consume a single goroutine's scheduling slot.

What This Means for the Hyperion Stack

By moving the intelligence layer into the Go core, we've achieved more than just speed. We've simplified the deployment architecture. The Hyperion Gateway is now more resilient; if the intelligence trainer is down, the gateway continues using its last-known good weights with zero impact on uptime.

This is the philosophy that drives Hyperion: use Python for the heavy-lifting training and experimentation, but trust Go for the mission-critical execution path. The resulting architecture is leaner, faster, and ready for the next order of magnitude in AI scale.

Common Questions

Hyperion AI Gateway is an enterprise-grade gateway for production LLM applications. It provides a single API layer that routes requests to multiple AI providers, optimizes latency and cost, enforces security policies, and ensures reliability through caching, failover, and load balancing.