Logo
Back to Blog
Cloud & DevOpsApril 24, 202615 min read

Self-Hosting DeepSeek V4: vLLM Setup, Hardware Requirements & Deployment Guide

DeepSeek V4 ships under MIT license with open weights. We cover hardware requirements for V4-Pro (862GB) and V4-Flash (158GB), vLLM deployment, quantization options, expert parallelism, and cost analysis for self-hosted inference.

Lushbinary Team

Lushbinary Team

Cloud & DevOps Solutions

Self-Hosting DeepSeek V4: vLLM Setup, Hardware Requirements & Deployment Guide

DeepSeek V4 is the most capable model ever released under MIT license. V4-Pro at 1.6T parameters and V4-Flash at 284B both ship with open weights on Hugging Face, meaning you can run frontier-adjacent AI on your own infrastructure with zero vendor lock-in. The question isn't whether you can self-host it, it's whether you should, and what hardware you need.

V4-Flash needs roughly 170-175GB of total VRAM (158GB weights + ~10GB for the full 1M KV cache + overhead) and fits comfortably on 2x H200 or 2x RTX Pro 6000 Blackwell. V4-Pro at ~862GB needs a real GPU cluster. Both require careful deployment planning to get right. This guide covers hardware requirements, vLLM setup, quantization options, expert parallelism, cost analysis, and production deployment patterns.

If you're evaluating self-hosting vs API for DeepSeek V4, this is the guide that gives you the numbers to make that decision.

What This Guide Covers

  1. Why Self-Host DeepSeek V4?
  2. Hardware Requirements: V4-Flash vs V4-Pro
  3. Downloading Weights from Hugging Face
  4. vLLM Deployment Setup
  5. Expert Parallelism & Tensor Parallelism
  6. Quantization Options: FP8, FP4, INT4
  7. 1M Context Window Configuration
  8. AWS Deployment: Instance Types & Costs
  9. Self-Host vs API: Break-Even Analysis
  10. Why Lushbinary for LLM Infrastructure

1Why Self-Host DeepSeek V4?

Three reasons make self-hosting V4 compelling:

  • Data sovereignty: DeepSeek's hosted API routes through Chinese infrastructure. For regulated industries, defense contractors, or teams with strict data residency requirements, self-hosting on your own cloud eliminates this concern entirely.
  • Cost at scale: At very high token volumes (several billion tokens/day), self-hosting V4-Flash can be cheaper than DeepSeek's API. You pay for GPU hours, not per-token. See the break-even math in section 9 before assuming this applies to you.
  • Customization: MIT license means you can fine-tune V4 for your domain, modify the inference pipeline, and integrate it into custom toolchains without restrictions.

The trade-off: self-hosting requires GPU infrastructure expertise, ongoing maintenance, and upfront capital. For teams below multi-billion tokens/day, the API is almost always more cost-effective. Run the math in section 9 against your actual traffic before committing.

2Hardware Requirements: V4-Flash vs V4-Pro

SpecV4-FlashV4-Pro
Weight Size (FP4+FP8)~158GB~862GB
Minimum GPUs2x H200 or 2x RTX Pro 6000 Blackwell8x H200 141GB (single node) or 16x H100 80GB (2 nodes)
Recommended GPUs4x A100 80GB or 2x H200 (power-of-2 for vLLM)8x H200 or DGX H200
System RAM256GB+1TB+
Storage500GB NVMe2TB NVMe
InterconnectNVLink (multi-GPU)NVLink + InfiniBand

V4-Flash is the Self-Hosting Sweet Spot

Total memory footprint is roughly 170-175GB: ~158GB of FP4+FP8 weights, ~10GB for the full 1M-token KV cache (V4 uses only 7% of V3.2's KV cache), plus a few GB of runtime overhead. That fits on hardware a well-funded startup can afford, and V4-Flash delivers 85-95% of V4-Pro's quality on most tasks. Unless you specifically need V4-Pro's superior agentic coding and knowledge capabilities, V4-Flash is the practical self-hosting choice.

Why Some Guides Say 4x A100 (320GB)

vLLM tensor parallelism works best with power-of-two GPU counts (1, 2, 4, 8). Two A100 80GB only provide 160GB, which is below the ~170GB total you need at full 1M context. The next power of two is four A100s, hence the 4x A100 recommendation. The extra 150GB is headroom, not a requirement. On GPUs with more VRAM per card (H200, RTX Pro 6000 Blackwell), two GPUs are enough.

3Downloading Weights from Hugging Face

Both models are available on Hugging Face under the deepseek-ai organization. The Instruct checkpoints (FP4+FP8 mixed precision) are what you want for production deployment:

# Install huggingface-cli if needed

pip install huggingface_hub

# Download V4-Flash (Instruct, ~158GB)

huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \

--local-dir ./deepseek-v4-flash

# Download V4-Pro (Instruct, ~862GB)

huggingface-cli download deepseek-ai/DeepSeek-V4-Pro \

--local-dir ./deepseek-v4-pro

4vLLM Deployment Setup

vLLM is the recommended inference framework for DeepSeek V4. It supports MoE expert parallelism, the hybrid CSA+HCA attention architecture, and efficient KV cache management for 1M-token contexts.

# Install vLLM with MoE support

pip install vllm>=0.8.0

# Serve V4-Flash on 4x A100 80GB (set TP=2 for 2x H200)

python -m vllm.entrypoints.openai.api_server \

--model ./deepseek-v4-flash \

--tensor-parallel-size 4 \

--max-model-len 131072 \

--trust-remote-code \

--port 8000

The vLLM server exposes an OpenAI-compatible API, so you can point any existing OpenAI SDK client at it by changing the base URL. This makes migration from the DeepSeek hosted API to self-hosted seamless.

5Expert Parallelism & Tensor Parallelism

MoE models like V4 benefit from two types of parallelism:

  • Tensor Parallelism (TP): Splits individual layers across GPUs. Use this when a single layer's weights don't fit in one GPU's VRAM. Set --tensor-parallel-size to the number of GPUs.
  • Expert Parallelism (EP): Distributes different expert sub-networks across GPUs. Since MoE only activates a subset of experts per token, EP allows efficient utilization of multi-GPU setups without the communication overhead of TP.

For V4-Flash on 4x A100 or 2x H200, tensor parallelism alone is sufficient. For V4-Pro on 8+ GPUs, a combination of TP and EP gives the best throughput. vLLM handles this automatically when you set the appropriate parallelism flags.

6Quantization Options: FP8, FP4, INT4

DeepSeek ships V4 in two precision formats:

  • FP8 Mixed (Base checkpoints): Most parameters in FP8 precision. Higher quality, larger memory footprint.
  • FP4+FP8 Mixed (Instruct checkpoints): MoE expert parameters in FP4, other parameters in FP8. This is the recommended format: it balances quality and memory usage.

For further compression, community quantizations (GGUF, AWQ, GPTQ) will likely appear within days of launch. INT4 quantization can reduce V4-Flash to ~80GB, potentially fitting on 4x RTX 4090 (96GB total), but expect measurable quality degradation, especially on reasoning-heavy tasks.

⚠️ Quantization Trade-offs

The official FP4+FP8 Instruct checkpoints are already aggressively quantized. Further quantization to INT4 will degrade quality, particularly on math, reasoning, and agentic tasks. For production workloads, stick with the official checkpoints unless VRAM constraints leave no alternative.

71M Context Window Configuration

V4's hybrid CSA+HCA attention reduces KV cache to roughly 7% of V3.2's footprint at 1M context. In practice, a full 1M-token context consumes about 10GB of VRAM on top of the 158GB of weights, plus a few GB of runtime overhead. Total: roughly 170-175GB.

For V4-Flash, that fits on 2x H200 (282GB) or 2x RTX Pro 6000 Blackwell (192GB) with plenty of headroom. On A100 80GB, 2 cards give you only 160GB, which is below the budget once you load the full KV cache. Four A100s (320GB) are the next power-of-two tensor parallel size that vLLM supports, which is why most vLLM guides recommend 4x A100 80GB. The headroom is incidental, not required.

Set --max-model-len in vLLM to your desired context length. Start with 131072 (128K) and increase based on your VRAM headroom. Monitor GPU memory usage during inference to find the sweet spot for your hardware.

8AWS Deployment: Instance Types & Costs

InstanceGPUsVRAMOn-Demand $/hrBest For
p5.48xlarge8x H100 80GB640GB~$55V4-Flash (comfortable)
p5e.48xlarge8x H200 141GB1128GB~$40-$50V4-Pro (single node)
p5en.48xlarge8x H200 141GB (200 Gbps)1128GB~$63V4-Pro (faster fabric)
2x p5.48xlarge16x H100 80GB1280GB~$110V4-Pro (multi-node)

With 1-year reserved instances, costs drop roughly 40%. Spot instances can reduce costs further but aren't suitable for production inference due to interruption risk. Pricing varies by region and changes often, so always check the AWS pricing page before budgeting. For most teams, a single p5.48xlarge running V4-Flash is the cost-effective starting point.

9Self-Host vs API: Break-Even Analysis

The honest answer is that pure-cost break-even against DeepSeek's own API is hard to reach. Here is the math for V4-Flash using AWS us-east-1 on-demand pricing:

  • API cost: $0.14/M input, $0.28/M output (cache miss). Blended 50/50 input/output rate is roughly $0.21/M. At 50M tokens/day, that is about $7-$14/day depending on the mix.
  • Self-host cost: p5.48xlarge at $55.04/hr on-demand is about $1,321/day. A 1-year reserved instance at roughly 40% off drops that to around $790/day.
  • Break-even: To match $790/day of API spend at $0.21/M, you need to serve roughly 3.8 billion tokens/day on one reserved p5.48xlarge. A single 8x H100 node cannot physically sustain that throughput for V4-Flash, so the math almost never favors self-hosting on price alone for a single workload.

For all but the largest serving operations, DeepSeek's API is cheaper. Self-hosting is the right call when data sovereignty, regulatory residency, custom fine-tuning, or consistent latency under unpredictable API quotas outweigh raw cost. Those are legitimate reasons, but they are not a cost argument.

10Why Lushbinary for LLM Infrastructure

Lushbinary deploys self-hosted LLMs on AWS for teams that need data sovereignty, custom fine-tuning, or cost optimization at scale. We handle the full infrastructure stack: GPU instance selection, vLLM configuration, auto-scaling, monitoring, and cost optimization.

🚀 Free Consultation

Want to self-host DeepSeek V4 on your own infrastructure? Lushbinary specializes in GPU cloud deployment and LLM inference optimization. We'll help you choose the right hardware, configure vLLM, and get to production, no obligation.

❓ Frequently Asked Questions

How much VRAM do I need to run DeepSeek V4-Flash?

About 170-175GB total: ~158GB for FP4+FP8 weights, ~10GB for the full 1M-token KV cache, plus a few GB of overhead. That fits on 2x H200 (282GB), 2x RTX Pro 6000 Blackwell (192GB), or 4x A100 80GB. vLLM suggests 4x A100 because it prefers power-of-two GPU counts for tensor parallelism, not because 320GB of VRAM is actually required.

Can I run DeepSeek V4-Pro on a single machine?

Only on a high-memory node. V4-Pro at ~862GB does not fit on 8x H100 80GB (640GB total). Use 8x H200 141GB (1,128GB, single node) or two p5.48xlarge nodes (16x H100 80GB, 1,280GB) with NVLink and InfiniBand for multi-node.

What inference framework should I use?

vLLM is recommended. It supports MoE expert parallelism, the hybrid CSA+HCA attention, and efficient KV cache management. SGLang is a solid alternative.

How much does self-hosting cost on AWS?

V4-Flash on p5.48xlarge costs about $55/hour on-demand or ~$33/hour with a 1-year reserved instance in us-east-1. At DeepSeek's API rates ($0.14/M input, $0.28/M output), break-even with reserved instances only arrives around 3-4 billion tokens/day, which a single 8x H100 node cannot physically serve. Self-host for sovereignty or fine-tuning reasons rather than raw cost.

Is DeepSeek V4 really MIT licensed?

Yes. Both V4-Pro and V4-Flash are MIT licensed on Hugging Face. You can use, modify, fine-tune, and commercially deploy without restrictions.

Sources

Content was rephrased for compliance with licensing restrictions. Hardware specs and pricing sourced from official model cards and AWS pricing pages as of April 24, 2026. Pricing may change, always verify on vendor websites.

Deploy DeepSeek V4 on Your Infrastructure

Lushbinary handles GPU deployment, vLLM configuration, and production optimization for self-hosted LLMs.

Ready to Build Something Great?

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

Let's Talk About Your Project

Prefer email? Reach us directly:

Contact Us

Exclusive Offer for Lushbinary Readers
WidelAI

One Subscription. Every Flagship AI Model.

Stop juggling multiple AI subscriptions. WidelAI gives you access to Claude, GPT, Gemini, and more - all under a single plan.

Claude Opus & SonnetGPT-5.5 & o3Gemini ProSingle DashboardAPI Access

Use code at checkout for 10% off your subscription:

DeepSeek V4Self-HostingvLLMGPU InfrastructureOpen-Source LLMMIT LicenseModel DeploymentExpert ParallelismQuantizationAI Infrastructure

ContactUs