Logo
Back to Blog
Cloud & DevOpsApril 24, 202615 min read

Self-Hosting DeepSeek V4: vLLM Setup, Hardware Requirements & Deployment Guide

DeepSeek V4 ships under MIT license with open weights. We cover hardware requirements for V4-Pro (862GB) and V4-Flash (158GB), vLLM deployment, quantization options, expert parallelism, and cost analysis for self-hosted inference.

Lushbinary Team

Lushbinary Team

Cloud & DevOps Solutions

Self-Hosting DeepSeek V4: vLLM Setup, Hardware Requirements & Deployment Guide

DeepSeek V4 is the most capable model ever released under MIT license. V4-Pro at 1.6T parameters and V4-Flash at 284B both ship with open weights on Hugging Face, meaning you can run frontier-adjacent AI on your own infrastructure with zero vendor lock-in. The question isn't whether you can self-host it — it's whether you should, and what hardware you need.

V4-Flash at ~158GB in FP4+FP8 mixed precision fits on a single H200 node. V4-Pro at ~862GB needs a real GPU cluster. Both require careful deployment planning to get right. This guide covers hardware requirements, vLLM setup, quantization options, expert parallelism, cost analysis, and production deployment patterns.

If you're evaluating self-hosting vs API for DeepSeek V4, this is the guide that gives you the numbers to make that decision.

What This Guide Covers

  1. Why Self-Host DeepSeek V4?
  2. Hardware Requirements: V4-Flash vs V4-Pro
  3. Downloading Weights from Hugging Face
  4. vLLM Deployment Setup
  5. Expert Parallelism & Tensor Parallelism
  6. Quantization Options: FP8, FP4, INT4
  7. 1M Context Window Configuration
  8. AWS Deployment: Instance Types & Costs
  9. Self-Host vs API: Break-Even Analysis
  10. Why Lushbinary for LLM Infrastructure

1Why Self-Host DeepSeek V4?

Three reasons make self-hosting V4 compelling:

  • Data sovereignty: DeepSeek's hosted API routes through Chinese infrastructure. For regulated industries, defense contractors, or teams with strict data residency requirements, self-hosting on your own cloud eliminates this concern entirely.
  • Cost at scale: At high token volumes (50M+ tokens/day), self-hosting V4-Flash can be cheaper than even DeepSeek's already-low API prices. You pay for GPU hours, not per-token.
  • Customization: MIT license means you can fine-tune V4 for your domain, modify the inference pipeline, and integrate it into custom toolchains without restrictions.

The trade-off: self-hosting requires GPU infrastructure expertise, ongoing maintenance, and upfront capital. For teams processing fewer than 10M tokens/day, the API is almost certainly more cost-effective.

2Hardware Requirements: V4-Flash vs V4-Pro

SpecV4-FlashV4-Pro
Weight Size (FP4+FP8)~158GB~862GB
Minimum GPUs1x H200 or 2x A100 80GB8x H100 80GB
Recommended GPUs2x H200 or 4x A100 80GB8x H200 or DGX H100
System RAM256GB+1TB+
Storage500GB NVMe2TB NVMe
InterconnectNVLink (multi-GPU)NVLink + InfiniBand

V4-Flash is the Self-Hosting Sweet Spot

At 158GB, V4-Flash fits on hardware that a well-funded startup can afford. It delivers 85–95% of V4-Pro's quality on most tasks. Unless you specifically need V4-Pro's superior agentic coding and knowledge capabilities, V4-Flash is the practical self-hosting choice.

3Downloading Weights from Hugging Face

Both models are available on Hugging Face under the deepseek-ai organization. The Instruct checkpoints (FP4+FP8 mixed precision) are what you want for production deployment:

# Install huggingface-cli if needed

pip install huggingface_hub

# Download V4-Flash (Instruct, ~158GB)

huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \

--local-dir ./deepseek-v4-flash

# Download V4-Pro (Instruct, ~862GB)

huggingface-cli download deepseek-ai/DeepSeek-V4-Pro \

--local-dir ./deepseek-v4-pro

4vLLM Deployment Setup

vLLM is the recommended inference framework for DeepSeek V4. It supports MoE expert parallelism, the hybrid CSA+HCA attention architecture, and efficient KV cache management for 1M-token contexts.

# Install vLLM with MoE support

pip install vllm>=0.8.0

# Serve V4-Flash on 2x A100 80GB

python -m vllm.entrypoints.openai.api_server \

--model ./deepseek-v4-flash \

--tensor-parallel-size 2 \

--max-model-len 131072 \

--trust-remote-code \

--port 8000

The vLLM server exposes an OpenAI-compatible API, so you can point any existing OpenAI SDK client at it by changing the base URL. This makes migration from the DeepSeek hosted API to self-hosted seamless.

5Expert Parallelism & Tensor Parallelism

MoE models like V4 benefit from two types of parallelism:

  • Tensor Parallelism (TP): Splits individual layers across GPUs. Use this when a single layer's weights don't fit in one GPU's VRAM. Set --tensor-parallel-size to the number of GPUs.
  • Expert Parallelism (EP): Distributes different expert sub-networks across GPUs. Since MoE only activates a subset of experts per token, EP allows efficient utilization of multi-GPU setups without the communication overhead of TP.

For V4-Flash on 2 GPUs, tensor parallelism is sufficient. For V4-Pro on 8+ GPUs, a combination of TP and EP gives the best throughput. vLLM handles this automatically when you set the appropriate parallelism flags.

6Quantization Options: FP8, FP4, INT4

DeepSeek ships V4 in two precision formats:

  • FP8 Mixed (Base checkpoints): Most parameters in FP8 precision. Higher quality, larger memory footprint.
  • FP4+FP8 Mixed (Instruct checkpoints): MoE expert parameters in FP4, other parameters in FP8. This is the recommended format — it balances quality and memory usage.

For further compression, community quantizations (GGUF, AWQ, GPTQ) will likely appear within days of launch. INT4 quantization can reduce V4-Flash to ~80GB, potentially fitting on 4x RTX 4090 (96GB total) — but expect measurable quality degradation, especially on reasoning-heavy tasks.

⚠️ Quantization Trade-offs

The official FP4+FP8 Instruct checkpoints are already aggressively quantized. Further quantization to INT4 will degrade quality, particularly on math, reasoning, and agentic tasks. For production workloads, stick with the official checkpoints unless VRAM constraints leave no alternative.

71M Context Window Configuration

V4's hybrid CSA+HCA attention reduces KV cache to 10% of V3.2's footprint at 1M context. This makes long-context inference practical, but you still need to allocate sufficient memory for the KV cache.

For V4-Flash on 2x A100 80GB, a practical maximum context length is 128K–256K tokens. To use the full 1M context, you need 4x A100 80GB or 2x H200 to accommodate the KV cache alongside the model weights.

Set --max-model-len in vLLM to your desired context length. Start with 131072 (128K) and increase based on your VRAM headroom. Monitor GPU memory usage during inference to find the sweet spot for your hardware.

8AWS Deployment: Instance Types & Costs

InstanceGPUsVRAMOn-Demand $/hrBest For
p5.48xlarge8x H100 80GB640GB~$98V4-Flash (comfortable)
p5e.48xlarge8x H200 141GB1128GB~$120V4-Pro (single node)
2x p5.48xlarge16x H100 80GB1280GB~$196V4-Pro (comfortable)

With 1-year reserved instances, costs drop roughly 40%. Spot instances can reduce costs further but aren't suitable for production inference due to interruption risk. For most teams, a single p5.48xlarge running V4-Flash is the cost-effective starting point.

9Self-Host vs API: Break-Even Analysis

The break-even point depends on your daily token volume. Here's the math for V4-Flash:

  • API cost: $0.14/M input + $0.28/M output (cache miss rates). At 50M tokens/day (mixed input/output), roughly $14–$21/day.
  • Self-host cost: p5.48xlarge at ~$98/hr on-demand = $2,352/day. With reserved instances: ~$1,400/day.
  • Break-even: Self-hosting becomes cheaper at roughly 200M+ tokens/day with reserved instances, or when data sovereignty requirements make the API a non-option regardless of cost.

For most startups and mid-size teams, the DeepSeek API is more cost-effective. Self-hosting makes sense for enterprises with high token volumes, strict data residency requirements, or the need to fine-tune the model for specialized domains.

10Why Lushbinary for LLM Infrastructure

Lushbinary deploys self-hosted LLMs on AWS for teams that need data sovereignty, custom fine-tuning, or cost optimization at scale. We handle the full infrastructure stack: GPU instance selection, vLLM configuration, auto-scaling, monitoring, and cost optimization.

🚀 Free Consultation

Want to self-host DeepSeek V4 on your own infrastructure? Lushbinary specializes in GPU cloud deployment and LLM inference optimization. We'll help you choose the right hardware, configure vLLM, and get to production — no obligation.

❓ Frequently Asked Questions

How much VRAM do I need to run DeepSeek V4-Flash?

V4-Flash in FP4+FP8 is approximately 158GB. It fits on a single H200 (141GB HBM3e) or 2x A100 80GB. With INT4 quantization, it can potentially fit on 4x RTX 4090 but with quality trade-offs.

Can I run DeepSeek V4-Pro on a single machine?

No. V4-Pro at 862GB requires minimum 8x H100 80GB with NVLink. A DGX H100 node or 8x H200 setup is recommended.

What inference framework should I use?

vLLM is recommended. It supports MoE expert parallelism, the hybrid CSA+HCA attention, and efficient KV cache management. SGLang is a solid alternative.

How much does self-hosting cost on AWS?

V4-Flash on p5.48xlarge costs ~$98/hour on-demand or ~$60/hour reserved. Break-even vs API is at roughly 200M+ tokens/day with reserved instances.

Is DeepSeek V4 really MIT licensed?

Yes. Both V4-Pro and V4-Flash are MIT licensed on Hugging Face. You can use, modify, fine-tune, and commercially deploy without restrictions.

Sources

Content was rephrased for compliance with licensing restrictions. Hardware specs and pricing sourced from official model cards and AWS pricing pages as of April 24, 2026. Pricing may change — always verify on vendor websites.

Deploy DeepSeek V4 on Your Infrastructure

Lushbinary handles GPU deployment, vLLM configuration, and production optimization for self-hosted LLMs.

Ready to Build Something Great?

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack — no strings attached.

Let's Talk About Your Project

Contact Us

DeepSeek V4Self-HostingvLLMGPU InfrastructureOpen-Source LLMMIT LicenseModel DeploymentExpert ParallelismQuantizationAI Infrastructure

ContactUs