Logo
Back to Blog
AI & LLMsJune 13, 202615 min read

Self-Host Kimi K2.7 Code: vLLM & SGLang Deployment Guide

Running the open-source Kimi K2.7 Code weights yourself means full data control and no per-token bill, but a 1T-parameter MoE is a serious infrastructure commitment. This guide covers the real VRAM math, quantization options, vLLM and SGLang serving, and when self-hosting actually beats the Moonshot API.

Lushbinary Team

Lushbinary Team

Cloud & DevOps Solutions

Self-Host Kimi K2.7 Code: vLLM & SGLang Deployment Guide

Kimi K2.7 Code, released June 12, 2026 by Moonshot AI under a modified MIT license, is one of the largest open-weight coding models you can actually download and run yourself. It is a Mixture-of-Experts (MoE) model with 1 trillion total parameters, 32 billion active per token, a 256K context window, multimodal input, and an always-thinking design that uses roughly 30 percent fewer thinking tokens than K2.6. The weights live on Hugging Face, which means self-hosting is on the table. The harder question is whether you should.

The trap with a 1T MoE is the VRAM math. Only 32B parameters fire per token, so it is tempting to size hardware as if you only need to hold 32B. You do not. An MoE keeps every expert weight resident in memory, so the weights footprint is based on the full 1 trillion parameter count. Get that wrong and you provision a node that physically cannot load the model. This guide walks the real numbers, the quantization tradeoffs, GPU sizing, vLLM and SGLang serving, wiring the endpoint into Hermes Agent, and the honest break-even against the Moonshot API.

If you want the capability and pricing overview first, read our Kimi K2.7 Code developer guide. For squeezing cost out of the hosted API, see our cost optimization and token efficiency guide.

1Should You Self-Host Kimi K2.7 Code?

Self-hosting a frontier coding model is a serious infrastructure commitment, not a weekend project. Kimi K2.7 Code is a 1 trillion parameter MoE, and running it well means multi-GPU nodes, a tuned inference server, and ongoing operational ownership. Before you go down that path, get honest about why you want it. There are three good reasons and several bad ones.

The strongest reason is data control and residency. If your code, prompts, and outputs cannot leave your network for regulatory, contractual, or security reasons, self-hosting is the only option that keeps everything inside your boundary. No third party sees your traffic, and you decide retention and logging.

The second reason is cost at sustained high volume. A node you own runs at a fixed hourly rate regardless of how many tokens you push through it, so once your traffic is high enough and steady enough, the per-token economics can beat a metered API. As the break-even section shows, the bar for this is genuinely high, so it only applies to teams with near-constant, well-batched load.

The third reason is control over the serving stack: custom sampling, pinned model versions, predictable latency, and the freedom to colocate the model with your own retrieval and tooling.

The tradeoffs are real and worth naming up front:

  • Capital and operations: multi-GPU nodes are expensive whether owned or rented, and someone has to keep them healthy, patched, and monitored.
  • Utilization risk: an idle node still costs money, so the cost advantage evaporates if traffic is bursty or low.
  • Engineering depth: tensor parallelism, quantization, KV cache tuning, and batching are not trivial, and mistakes show up as out-of-memory crashes or poor throughput.
  • Update cadence: you own model upgrades, security patches, and driver compatibility rather than getting them for free from a managed API.

💡 The honest default

Most teams should start on the Moonshot API or OpenRouter and only move to self-hosting when data-residency rules force it or when measured, sustained volume clears the break-even bar. Self-hosting is a deliberate choice backed by numbers, not a default.

2The Real VRAM Math

This is the section that decides whether your deployment works at all. The single most important fact: although only 32B parameters are active per token, an MoE keeps all expert weights resident in memory. So the weights footprint is computed from the full 1 trillion parameter count, not the 32B active count. Sizing on 32B is the classic mistake that leads to a node that cannot load the model.

The formula for the weights portion of memory is simple:

weights_memory = total_params x bytes_per_param

BF16/FP16 (2 bytes/param): 1e12 x 2 = 2,000 GB  (about 2 TB)
FP8        (1 byte/param):  1e12 x 1 = 1,000 GB  (about 1 TB)
4-bit      (~0.5 byte/param): 1e12 x 0.5 = 500 GB

At full BF16 precision the weights alone are about 2 TB. That is why full-precision self-hosting is impractical for almost everyone: you would need a large multi-node cluster just to hold the weights, before you serve a single request. FP8 halves that to about 1 TB, and 4-bit quantization brings the weights down to about 500 GB.

⚠️ Weights are not the total budget

The numbers above are weights only. The total VRAM budget is weights + KV cache + runtime overhead. At 256K context the KV cache and overhead add tens to well over a hundred GB depending on batch size and sequence length. Never present a weights-only figure as "how much VRAM you need." You must size hardware above the weights-only figure, with concrete headroom for the KV cache and the serving runtime.

So the mental model is: pick a quantization to fix the weights number, then add a real KV-cache-plus-overhead allowance for your context length and batch size, and only then choose hardware whose total VRAM comfortably exceeds that sum. The next two sections turn that into concrete GPU choices.

3Quantization Options: FP8 vs 4-bit vs GGUF

Quantization is the lever that sets your weights footprint, and therefore your hardware bill. Here is how the main options compare for Kimi K2.7 Code, with weights computed from the 1T parameter count.

FormatBytes/paramWeights onlyNotes
BF16 / FP162about 2,000 GB (2 TB)Highest fidelity, impractical for most without a cluster
FP81about 1,000 GB (1 TB)Strong quality, needs Hopper or newer for native FP8
4-bit quant~0.5about 500 GBSmallest footprint, some quality tradeoff
GGUFvaries by quanttracks the chosen bit-widthCommunity quants for llama.cpp-style runtimes

FP8 at 1 byte per parameter is the sweet spot for quality-sensitive production serving on modern data-center GPUs. The weights land at about 1 TB, which a single 8x H200 node can hold with room for the KV cache.

4-bit at roughly 0.5 bytes per parameter is the footprint play. About 500 GB of weights fits comfortably on an 8x H100 node and leaves generous headroom. Expect some quality tradeoff relative to FP8, so validate on your own coding tasks before committing.

GGUF community quants, such as those published at unsloth/Kimi-K2.7-Code-GGUF, target llama.cpp-style runtimes and a range of bit-widths. The weights footprint tracks whichever quant level you pick, so use the same total_params x bytes_per_param formula to size it. The official weights live at moonshotai/Kimi-K2.7-Code.

4GPU Sizing: How Many and Which

GPU count for a model this size is a VRAM requirement first and a parallelism preference second. You need enough aggregate VRAM to hold the weights plus the KV cache plus overhead. Only after that does the tensor-parallel topology come into play. For reference, NVIDIA H100 = 80 GB, H200 = 141 GB, and B200 = 192 GB per GPU.

QuantWeightsNodeTotal VRAMHeadroom over weights
FP8about 1 TB8x H2008 x 141 = 1,128 GBroughly 128 GB for KV cache and overhead
FP8about 1 TB8x H1008 x 80 = 640 GBCannot hold the ~1 TB weights at all
4-bitabout 500 GB8x H1008 x 80 = 640 GBroughly 140 GB for KV cache and overhead

Reading the table as weights-plus-headroom reasoning: FP8 (about 1 TB of weights) fits on an 8x H200 node, where 8 x 141 = 1,128 GB total leaves roughly 128 GB above the weights for the KV cache and runtime overhead. The same FP8 weights do not fit on an 8x H100 node, because 8 x 80 = 640 GB is below the ~1 TB the weights need. There is no clever batching that fixes a node that cannot even load the weights.

4-bit (about 500 GB of weights) is a more comfortable fit on an 8x H100 node: 8 x 80 = 640 GB total leaves roughly 140 GB above the weights for the KV cache and overhead. That headroom is what makes 4-bit attractive when H200s are scarce or expensive.

💡 VRAM required vs parallelism preferred

The 8-GPU count here is driven by the memory budget: that is the VRAM requirement. vLLM tensor-parallel prefers power-of-two sizes (1, 2, 4, 8), so TP=8 is the natural parallelism-preferred size that matches an 8-GPU node. You are not choosing 8 GPUs because the model needs 8-way parallelism for speed; you are choosing 8 GPUs because that is the VRAM you need, and TP=8 is the clean power-of-two that fits it.

5Serving with vLLM

vLLM is a strong default for high-throughput serving. It exposes an OpenAI-compatible API on a port, which is what makes wiring it into agents straightforward later. The key flag is --tensor-parallel-size, set to your GPU count. Keep it a power of two (1, 2, 4, 8); here TP=8 matches the 8-GPU memory budget from the sizing section.

# Serve Kimi K2.7 Code with vLLM (OpenAI-compatible API on :8000)
vllm serve moonshotai/Kimi-K2.7-Code \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --served-model-name kimi-k2.7-code \
  --host 0.0.0.0 \
  --port 8000

# --tensor-parallel-size 8  -> power-of-two TP matching the 8-GPU VRAM budget
# --max-model-len 262144    -> 256K context; raises KV-cache memory, so leave headroom
# --gpu-memory-utilization  -> cap so weights + KV cache + overhead fit per GPU

Tune --gpu-memory-utilization and --max-model-len together. The full 256K context inflates the KV cache, so if you hit out-of-memory errors, lower the max sequence length or the memory utilization cap until the weights, KV cache, and overhead fit inside per-GPU VRAM. This is the weights + KV cache + overhead budget showing up in practice.

6Serving with SGLang

SGLang is a capable alternative server, often favored for its structured generation and scheduling. It also exposes an OpenAI-compatible API, so it slots into the same agent wiring. The tensor-parallel flag is --tp, and the same power-of-two guidance applies.

# Serve Kimi K2.7 Code with SGLang (OpenAI-compatible API on :8000)
python -m sglang.launch_server \
  --model-path moonshotai/Kimi-K2.7-Code \
  --tp 8 \
  --context-length 262144 \
  --mem-fraction-static 0.90 \
  --host 0.0.0.0 \
  --port 8000

# --tp 8            -> power-of-two tensor parallelism matching the 8-GPU VRAM budget
# --context-length  -> 256K; larger context means a larger KV cache, so size for it
# --mem-fraction-static -> reserve fraction for weights so KV cache + overhead still fit

Pick vLLM or SGLang based on which one performs best on your hardware and workload, then benchmark throughput and latency on your real prompts. Both leave you with the same thing: an OpenAI-compatible endpoint on a local port.

7Connecting Your Endpoint to Hermes Agent

Because both vLLM and SGLang expose an OpenAI-compatible API, Hermes Agent from Nous Research can use your self-hosted server as its model backend. Hermes is provider-agnostic and supports OpenAI-compatible endpoints, so you point it at your local base URL and select the served model name. No code changes, just configuration.

# Point Hermes Agent at your self-hosted OpenAI-compatible endpoint
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="local-key"   # any non-empty value for a local server

# Select the served model (the --served-model-name you set above)
hermes model
# choose: kimi-k2.7-code

# Hermes now routes its calls to your local vLLM/SGLang server

The served model name must match what you set on the server (--served-model-name in vLLM, or the model path SGLang reports). For a full agent setup walkthrough, see our Hermes Agent autonomous coding setup guide.

8Self-Host vs Moonshot API: The Break-Even

This is where most self-hosting plans meet reality. The question is how many tokens per month you have to push through an owned node before it costs less than paying Moonshot per token. Here is the math with every assumption stated, so you can re-run it with your own numbers.

Assumptions. Rent an 8x H100 node at about $2 to $3 per GPU-hour (cloud on-demand, varies by provider), so $16 to $24 per hour for 8 GPUs. Over a 730-hour month that is $11,680 on the low end and $17,520 on the high end. Moonshot API pricing is $0.95 per million input tokens and $4.00 per million output tokens, so a blended price at a 50/50 input/output mix is (0.95 + 4.00) / 2 = $2.475 per million tokens.

monthly_self_host_cost = gpu_hourly x hours_per_month
  low:  $16/hr x 730 = $11,680 / month
  high: $24/hr x 730 = $17,520 / month

blended_api_price = (0.95 + 4.00) / 2 = $2.475 per million tokens (50/50 mix)

break_even_tokens = monthly_self_host_cost / blended_api_price
  low:  11,680 / 2.475 = about 4,719 million tokens/month (~4.7 billion)
  high: 17,520 / 2.475 = about 7,079 million tokens/month (~7.1 billion)

So self-hosting an 8x H100 node only beats the API somewhere around 4.7 to 7.1 billion tokens per month. Below that range, the API is cheaper; above it, the owned node wins on raw token cost.

⚠️ Can the hardware actually sustain that volume?

5 billion tokens per month is roughly 5e9 / (730 x 3600) = about 1,900 tokens per second sustained 24/7. That demands near-constant, well-batched utilization of the node. A node sitting idle most of the day will not reach break-even. Benchmark your own sustained throughput before assuming you can hit these volumes. Cache-hit pricing on the API ($0.19 per million) also lowers the API side for cache-heavy workloads, pushing the break-even even higher.

Conclusion. For most teams the Moonshot API (or OpenRouter) is the practical choice. Self-hosting wins mainly on strict data-residency requirements or very high sustained volume that clears the 4.7 to 7.1 billion tokens per month bar with steady, well-batched traffic. If your load is bursty or modest, the API is both cheaper and far less operational work. The Moonshot API uses the OpenAI-compatible base URL platform.moonshot.ai at https://api.moonshot.ai/v1 with model id kimi-k2.7-code, so you can start there and switch to self-hosting later without rewriting your integration.

9Why Lushbinary for Self-Hosted AI

Self-hosting a 1T MoE only pays off when the sizing, serving, and economics are all correct at the same time. That is exactly the work Lushbinary does. We size the GPUs against the real weights + KV cache + overhead budget, tune vLLM or SGLang for your hardware, and wire the endpoint into your agents so the deployment is production-ready, not a demo.

Here is what we bring to a Kimi K2.7 Code deployment:

  • Honest sizing: we compute weights from the full 1T parameter count, add a real KV-cache-plus-overhead allowance for your context length, and pick hardware that actually fits.
  • Serving setup: tuned vLLM or SGLang with the right tensor-parallel topology, memory utilization, and batching for your throughput and latency targets.
  • Break-even modeling: we run your real traffic against the API to tell you whether self-hosting actually saves money before you commit to a node.
  • Agent integration: we connect the OpenAI-compatible endpoint to Hermes Agent or your own tooling and validate it end to end.
  • Data-residency design: network isolation, retention controls, and logging that keep your code inside your boundary.

🚀 Free Consultation

Trying to decide between self-hosting Kimi K2.7 Code and the Moonshot API? Lushbinary will model your break-even, size the GPUs against the real VRAM budget, and recommend the deployment that actually fits your volume and data requirements, with no obligation.

Ready to Build Something Great?

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

Let's Talk About Your Project

Prefer email? Reach us directly:

Contact Us

❓ Frequently Asked Questions

How much VRAM do you need to self-host Kimi K2.7 Code?

Kimi K2.7 Code is a 1 trillion parameter Mixture-of-Experts model, and an MoE keeps all expert weights resident in memory, so weights are sized on the full 1T count. Using the formula total_params x bytes_per_param, BF16 weights are about 2,000 GB (2 TB), FP8 about 1,000 GB (1 TB), and 4-bit about 500 GB. That is weights only. Total VRAM is weights plus KV cache plus runtime overhead, and at 256K context the KV cache and overhead add tens to well over a hundred GB depending on batch size and sequence length, so you must size hardware above the weights-only figure.

How many GPUs does Kimi K2.7 Code require?

It is a VRAM requirement first. FP8 weights (about 1 TB) do not fit on 8x H100 (8 x 80 = 640 GB) but do fit on 8x H200 (8 x 141 = 1,128 GB), leaving roughly 128 GB for KV cache and overhead. 4-bit weights (about 500 GB) fit comfortably on 8x H100 (640 GB), leaving roughly 140 GB of headroom. vLLM prefers power-of-two tensor-parallel sizes, so TP=8 is the natural parallelism-preferred size that matches the 8-GPU memory budget.

When does self-hosting Kimi K2.7 Code beat the Moonshot API?

Renting an 8x H100 node at about $2 to $3 per GPU-hour is $16 to $24 per hour, or $11,680 to $17,520 over a 730-hour month. With a blended Moonshot price of (0.95 + 4.00) / 2 = $2.475 per million tokens at a 50/50 mix, break-even is monthly_cost / blended_price: 11,680 / 2.475 = about 4,719 million tokens (4.7 billion) on the low end and 17,520 / 2.475 = about 7,079 million tokens (7.1 billion) on the high end per month. That is roughly 1,900 tokens per second sustained 24/7 at 5 billion tokens, so only near-constant, well-batched utilization breaks even.

Can a self-hosted Kimi K2.7 Code endpoint work with Hermes Agent?

Yes. vLLM and SGLang both expose an OpenAI-compatible API, and Hermes Agent from Nous Research is provider-agnostic with OpenAI-compatible endpoint support. Point Hermes at your local base URL (for example http://localhost:8000/v1) and select the served model name, and your self-hosted server becomes the Hermes model backend.

Should most teams self-host Kimi K2.7 Code or use the API?

For most teams the Moonshot API (or OpenRouter) is the practical choice. Self-hosting an 8x H100 node only beats the API somewhere around 4.7 to 7.1 billion tokens per month of sustained, well-batched traffic, and cache-hit pricing of $0.19/M lowers the API side further for cache-heavy workloads. Self-hosting wins mainly on strict data-residency requirements or very high sustained volume. Always benchmark your own throughput and verify GPU pricing with your provider.

📚 Sources

Content was rephrased for compliance with licensing restrictions. Model specs sourced from Moonshot AI and Hugging Face as of June 2026. VRAM and cost figures are computed estimates that depend on quantization, context length, batch size, and current GPU pricing - always benchmark your own deployment and verify GPU pricing with your provider.

Deploy Kimi K2.7 Code on Your Own Infrastructure

We size the GPUs, tune vLLM or SGLang, and wire it into your agents so the economics actually work.

Ready to Build Something Great?

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

Let's Talk About Your Project

Prefer email? Reach us directly:

Contact Us

Subscribe · Newsletter

Self-Host the Right Way

Infrastructure-grade guides on serving open-weight models, GPU sizing, and the real cost of self-hosting.

  • New deep-dives on AI agents and cloud architecture
  • Engineering teardowns of shipped products
  • No spam, unsubscribe in one click

We respect your inbox. Read our privacy policy.

Exclusive Offer for Lushbinary Readers
WidelAI

One Subscription. Every Flagship AI Model.

Stop juggling multiple AI subscriptions. WidelAI gives you access to Claude, GPT, Gemini, and more - all under a single plan.

Claude Opus & SonnetGPT-5.5 & o3Gemini ProSingle DashboardAPI Access

Use code at checkout for 10% off your subscription:

Kimi K2.7 CodeSelf-HostingvLLMSGLangQuantizationGPU InferenceMoEOpen Source LLMMoonshot AIHermes AgentModel DeploymentInference Optimization

ContactUs