KernelBench Level 3 is one of the hardest AI coding benchmarks: take a reference PyTorch implementation of a complete architecture (MobileNet, VGG, MiniGPT, Mamba) and produce a faster GPU kernel with identical outputs. GLM-5.1 delivered 3.6× geometric mean speedup — more than double what torch.compile achieves. Here's the full breakdown.
📋 Table of Contents
- 1.What Is KernelBench?
- 2.Level 3: Full-Model Optimization
- 3.GLM-5.1 Results: 3.6× Speedup
- 4.Comparison with Other Models
- 5.Long-Horizon Optimization Trajectories
- 6.Audit & Correctness Verification
- 7.Implications for ML Engineering
- 8.Lushbinary ML Optimization
1What Is KernelBench?
KernelBench evaluates whether a model can take a reference PyTorch implementation and produce a faster GPU kernel with identical outputs. It's organized into three levels: Level 1 covers single operators, Level 2 covers fused operator sequences, and Level 3 covers full-model end-to-end optimization of complete architectures — the hardest tier.
2Level 3: Full-Model Optimization
Level 3 includes 50 problems spanning architectures like MobileNet, VGG, MiniGPT, and Mamba. Each problem runs in an isolated Docker container with one H100 GPU, limited to 1,200 tool-use turns. Correctness is verified with atol=rtol=1e-4 against the PyTorch eager baseline.
3GLM-5.1 Results: 3.6× Speedup
GLM-5.1 achieved a 3.6× geometric mean speedup across all 50 problems. For context:
| Approach | Speedup |
|---|---|
| torch.compile (default) | 1.15× |
| torch.compile (max-autotune) | 1.49× |
| GLM-5.1 | 3.6× |
| Claude Opus 4.6 | 4.2× |
4Comparison with Other Models
The trajectories highlight differences in long-horizon optimization behavior. GLM-5 improves quickly at first but levels off relatively early. Claude Opus 4.5 continues a bit longer but its gains also taper. GLM-5.1 pushes the frontier further at 3.6× and continues making progress late into the run. Claude Opus 4.6 finishes at 4.2× and still shows headroom at the end.
5Long-Horizon Optimization Trajectories
The key insight from KernelBench is that the rate of improvement slows over time for all models, but the productive horizon varies significantly. GLM-5.1 sustains useful optimization for substantially longer than GLM-5, while the remaining gap with Claude Opus 4.6 shows that long-horizon optimization is still an open frontier.
6Audit & Correctness Verification
All solutions are independently audited for benchmark exploitation by Claude Opus 4.6 (max effort) and GPT-5.4 (xhigh). Each audit verifies the optimization doesn't exploit benchmark-specific behavior, works with arbitrary new inputs, and keeps all computation on the default CUDA stream. The lower speedup across audits is used, with a 50× hard cap to limit outlier influence.
7Implications for ML Engineering
A 3.6× speedup over PyTorch eager baselines has real production value. For ML teams running inference at scale, this level of kernel optimization can translate directly to reduced GPU costs and lower latency. The fact that an AI model can achieve this autonomously — without human kernel engineering expertise — suggests a future where model-driven optimization becomes a standard part of the ML deployment pipeline.
8Lushbinary ML Optimization
At Lushbinary, we help ML teams optimize inference performance using frontier AI models. From kernel optimization to deployment pipeline automation, we can help you ship faster and cheaper.
🚀 Free Consultation
Looking to optimize ML inference performance with AI-driven kernel optimization? We help teams ship faster and cheaper.
❓ Frequently Asked Questions
How did GLM-5.1 perform on KernelBench Level 3?
GLM-5.1 achieved 3.6× geometric mean speedup across 50 problems on KernelBench Level 3, which covers full-model end-to-end optimization of architectures like MobileNet, VGG, MiniGPT, and Mamba. For reference, torch.compile achieves 1.15× (default) and 1.49× (max-autotune).
How does GLM-5.1 compare to Claude Opus 4.6 on KernelBench?
Claude Opus 4.6 leads at 4.2× speedup compared to GLM-5.1's 3.6×. However, GLM-5.1 sustains useful optimization for substantially longer than GLM-5, which plateaus early. Both models show headroom at the end of their runs.
📚 Sources
- Z.ai — GLM-5.1: Towards Long-Horizon Tasks (April 7, 2026)
- HuggingFace — GLM-5.1 Model Weights
- GitHub — GLM-5.1 Repository
Content was rephrased for compliance with licensing restrictions. Benchmark data sourced from official Zhipu AI publications as of April 8, 2026. Pricing and availability may change — always verify on the vendor's website.
Optimizing ML Inference Performance?
Lushbinary helps ML teams optimize inference performance using frontier AI models — from kernel optimization to deployment pipeline automation.
Build Smarter, Launch Faster.
Book a free strategy call and explore how LushBinary can turn your vision into reality.

