What inference framework should I use for DeepSeek V4?

vLLM is the recommended framework for DeepSeek V4 deployment. It supports MoE expert parallelism, tensor parallelism, and the hybrid CSA+HCA attention architecture. SGLang is an alternative with good MoE support.

How much does it cost to self-host DeepSeek V4 on AWS?

V4-Flash on a single p5.48xlarge (8x H100) costs approximately $55/hour on-demand in us-east-1, or about $33/hour with a 1-year reserved instance. V4-Pro requires an 8x H200 node (p5e.48xlarge around $40-$50/hr) or two p5.48xlarge nodes. Given DeepSeek's API pricing ($0.14/M input, $0.28/M output), break-even with reserved p5.48xlarge only arrives around 3-4 billion tokens/day, which a single 8x H100 node cannot physically serve, so self-hosting is almost always about data sovereignty or fine-tuning, not cost.

DeepSeek V4 is the most capable model ever released under MIT license. V4-Pro at 1.6T parameters and V4-Flash at 284B both ship with open weights on Hugging Face, meaning you can run frontier-adjacent AI on your own infrastructure with zero vendor lock-in. The question isn't whether you can self-host it, it's whether you should, and what hardware you need.

V4-Flash needs roughly 170-175GB of total VRAM (158GB weights + ~10GB for the full 1M KV cache + overhead) and fits comfortably on 2x H200 or 2x RTX Pro 6000 Blackwell. V4-Pro at ~862GB needs a real GPU cluster. Both require careful deployment planning to get right. This guide covers hardware requirements, vLLM setup, quantization options, expert parallelism, cost analysis, and production deployment patterns.

If you're evaluating self-hosting vs API for DeepSeek V4, this is the guide that gives you the numbers to make that decision.

What This Guide Covers

Why Self-Host DeepSeek V4?
Hardware Requirements: V4-Flash vs V4-Pro
Downloading Weights from Hugging Face
vLLM Deployment Setup
Expert Parallelism & Tensor Parallelism
Quantization Options: FP8, FP4, INT4
1M Context Window Configuration
AWS Deployment: Instance Types & Costs
Self-Host vs API: Break-Even Analysis
Why Lushbinary for LLM Infrastructure

1Why Self-Host DeepSeek V4?

Three reasons make self-hosting V4 compelling:

Data sovereignty: DeepSeek's hosted API routes through Chinese infrastructure. For regulated industries, defense contractors, or teams with strict data residency requirements, self-hosting on your own cloud eliminates this concern entirely.
Cost at scale: At very high token volumes (several billion tokens/day), self-hosting V4-Flash can be cheaper than DeepSeek's API. You pay for GPU hours, not per-token. See the break-even math in section 9 before assuming this applies to you.
Customization: MIT license means you can fine-tune V4 for your domain, modify the inference pipeline, and integrate it into custom toolchains without restrictions.

The trade-off: self-hosting requires GPU infrastructure expertise, ongoing maintenance, and upfront capital. For teams below multi-billion tokens/day, the API is almost always more cost-effective. Run the math in section 9 against your actual traffic before committing.

2Hardware Requirements: V4-Flash vs V4-Pro

Spec	V4-Flash	V4-Pro
Weight Size (FP4+FP8)	~158GB	~862GB
Minimum GPUs	2x H200 or 2x RTX Pro 6000 Blackwell	8x H200 141GB (single node) or 16x H100 80GB (2 nodes)
Recommended GPUs	4x A100 80GB or 2x H200 (power-of-2 for vLLM)	8x H200 or DGX H200
System RAM	256GB+	1TB+
Storage	500GB NVMe	2TB NVMe
Interconnect	NVLink (multi-GPU)	NVLink + InfiniBand

V4-Flash is the Self-Hosting Sweet Spot

Total memory footprint is roughly 170-175GB: ~158GB of FP4+FP8 weights, ~10GB for the full 1M-token KV cache (V4 uses only 7% of V3.2's KV cache), plus a few GB of runtime overhead. That fits on hardware a well-funded startup can afford, and V4-Flash delivers 85-95% of V4-Pro's quality on most tasks. Unless you specifically need V4-Pro's superior agentic coding and knowledge capabilities, V4-Flash is the practical self-hosting choice.

Why Some Guides Say 4x A100 (320GB)

vLLM tensor parallelism works best with power-of-two GPU counts (1, 2, 4, 8). Two A100 80GB only provide 160GB, which is below the ~170GB total you need at full 1M context. The next power of two is four A100s, hence the 4x A100 recommendation. The extra 150GB is headroom, not a requirement. On GPUs with more VRAM per card (H200, RTX Pro 6000 Blackwell), two GPUs are enough.

3Downloading Weights from Hugging Face

Both models are available on Hugging Face under the deepseek-ai organization. The Instruct checkpoints (FP4+FP8 mixed precision) are what you want for production deployment:

# Install huggingface-cli if needed

pip install huggingface_hub

# Download V4-Flash (Instruct, ~158GB)

huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \

--local-dir ./deepseek-v4-flash

# Download V4-Pro (Instruct, ~862GB)

huggingface-cli download deepseek-ai/DeepSeek-V4-Pro \

--local-dir ./deepseek-v4-pro

4vLLM Deployment Setup

vLLM is the recommended inference framework for DeepSeek V4. It supports MoE expert parallelism, the hybrid CSA+HCA attention architecture, and efficient KV cache management for 1M-token contexts.

# Install vLLM with MoE support

pip install vllm>=0.8.0

# Serve V4-Flash on 4x A100 80GB (set TP=2 for 2x H200)

python -m vllm.entrypoints.openai.api_server \

--model ./deepseek-v4-flash \

--tensor-parallel-size 4 \

--max-model-len 131072 \

--trust-remote-code \

--port 8000

The vLLM server exposes an OpenAI-compatible API, so you can point any existing OpenAI SDK client at it by changing the base URL. This makes migration from the DeepSeek hosted API to self-hosted seamless.

5Expert Parallelism & Tensor Parallelism

MoE models like V4 benefit from two types of parallelism:

Tensor Parallelism (TP): Splits individual layers across GPUs. Use this when a single layer's weights don't fit in one GPU's VRAM. Set --tensor-parallel-size to the number of GPUs.
Expert Parallelism (EP): Distributes different expert sub-networks across GPUs. Since MoE only activates a subset of experts per token, EP allows efficient utilization of multi-GPU setups without the communication overhead of TP.

For V4-Flash on 4x A100 or 2x H200, tensor parallelism alone is sufficient. For V4-Pro on 8+ GPUs, a combination of TP and EP gives the best throughput. vLLM handles this automatically when you set the appropriate parallelism flags.

6Quantization Options: FP8, FP4, INT4

DeepSeek ships V4 in two precision formats:

FP8 Mixed (Base checkpoints): Most parameters in FP8 precision. Higher quality, larger memory footprint.
FP4+FP8 Mixed (Instruct checkpoints): MoE expert parameters in FP4, other parameters in FP8. This is the recommended format: it balances quality and memory usage.

For further compression, community quantizations (GGUF, AWQ, GPTQ) will likely appear within days of launch. INT4 quantization can reduce V4-Flash to ~80GB, potentially fitting on 4x RTX 4090 (96GB total), but expect measurable quality degradation, especially on reasoning-heavy tasks.

⚠️ Quantization Trade-offs

The official FP4+FP8 Instruct checkpoints are already aggressively quantized. Further quantization to INT4 will degrade quality, particularly on math, reasoning, and agentic tasks. For production workloads, stick with the official checkpoints unless VRAM constraints leave no alternative.

71M Context Window Configuration

V4's hybrid CSA+HCA attention reduces KV cache to roughly 7% of V3.2's footprint at 1M context. In practice, a full 1M-token context consumes about 10GB of VRAM on top of the 158GB of weights, plus a few GB of runtime overhead. Total: roughly 170-175GB.

For V4-Flash, that fits on 2x H200 (282GB) or 2x RTX Pro 6000 Blackwell (192GB) with plenty of headroom. On A100 80GB, 2 cards give you only 160GB, which is below the budget once you load the full KV cache. Four A100s (320GB) are the next power-of-two tensor parallel size that vLLM supports, which is why most vLLM guides recommend 4x A100 80GB. The headroom is incidental, not required.

Set --max-model-len in vLLM to your desired context length. Start with 131072 (128K) and increase based on your VRAM headroom. Monitor GPU memory usage during inference to find the sweet spot for your hardware.

8AWS Deployment: Instance Types & Costs

Instance	GPUs	VRAM	On-Demand $/hr	Best For
p5.48xlarge	8x H100 80GB	640GB	~$55	V4-Flash (comfortable)
p5e.48xlarge	8x H200 141GB	1128GB	~$40-$50	V4-Pro (single node)
p5en.48xlarge	8x H200 141GB (200 Gbps)	1128GB	~$63	V4-Pro (faster fabric)
2x p5.48xlarge	16x H100 80GB	1280GB	~$110	V4-Pro (multi-node)

With 1-year reserved instances, costs drop roughly 40%. Spot instances can reduce costs further but aren't suitable for production inference due to interruption risk. Pricing varies by region and changes often, so always check the AWS pricing page before budgeting. For most teams, a single p5.48xlarge running V4-Flash is the cost-effective starting point.

9Self-Host vs API: Break-Even Analysis

The honest answer is that pure-cost break-even against DeepSeek's own API is hard to reach. Here is the math for V4-Flash using AWS us-east-1 on-demand pricing:

API cost: $0.14/M input, $0.28/M output (cache miss). Blended 50/50 input/output rate is roughly $0.21/M. At 50M tokens/day, that is about $7-$14/day depending on the mix.
Self-host cost: p5.48xlarge at $55.04/hr on-demand is about $1,321/day. A 1-year reserved instance at roughly 40% off drops that to around $790/day.
Break-even: To match $790/day of API spend at $0.21/M, you need to serve roughly 3.8 billion tokens/day on one reserved p5.48xlarge. A single 8x H100 node cannot physically sustain that throughput for V4-Flash, so the math almost never favors self-hosting on price alone for a single workload.

For all but the largest serving operations, DeepSeek's API is cheaper. Self-hosting is the right call when data sovereignty, regulatory residency, custom fine-tuning, or consistent latency under unpredictable API quotas outweigh raw cost. Those are legitimate reasons, but they are not a cost argument.

10Why Lushbinary for LLM Infrastructure

Lushbinary deploys self-hosted LLMs on AWS for teams that need data sovereignty, custom fine-tuning, or cost optimization at scale. We handle the full infrastructure stack: GPU instance selection, vLLM configuration, auto-scaling, monitoring, and cost optimization.

🚀 Free Consultation

Want to self-host DeepSeek V4 on your own infrastructure? Lushbinary specializes in GPU cloud deployment and LLM inference optimization. We'll help you choose the right hardware, configure vLLM, and get to production, no obligation.

❓ Frequently Asked Questions

How much VRAM do I need to run DeepSeek V4-Flash?

About 170-175GB total: ~158GB for FP4+FP8 weights, ~10GB for the full 1M-token KV cache, plus a few GB of overhead. That fits on 2x H200 (282GB), 2x RTX Pro 6000 Blackwell (192GB), or 4x A100 80GB. vLLM suggests 4x A100 because it prefers power-of-two GPU counts for tensor parallelism, not because 320GB of VRAM is actually required.

Can I run DeepSeek V4-Pro on a single machine?

Only on a high-memory node. V4-Pro at ~862GB does not fit on 8x H100 80GB (640GB total). Use 8x H200 141GB (1,128GB, single node) or two p5.48xlarge nodes (16x H100 80GB, 1,280GB) with NVLink and InfiniBand for multi-node.

What inference framework should I use?

vLLM is recommended. It supports MoE expert parallelism, the hybrid CSA+HCA attention, and efficient KV cache management. SGLang is a solid alternative.

How much does self-hosting cost on AWS?

V4-Flash on p5.48xlarge costs about $55/hour on-demand or ~$33/hour with a 1-year reserved instance in us-east-1. At DeepSeek's API rates ($0.14/M input, $0.28/M output), break-even with reserved instances only arrives around 3-4 billion tokens/day, which a single 8x H100 node cannot physically serve. Self-host for sovereignty or fine-tuning reasons rather than raw cost.

Is DeepSeek V4 really MIT licensed?

Yes. Both V4-Pro and V4-Flash are MIT licensed on Hugging Face. You can use, modify, fine-tune, and commercially deploy without restrictions.

Sources

Content was rephrased for compliance with licensing restrictions. Hardware specs and pricing sourced from official model cards and AWS pricing pages as of April 24, 2026. Pricing may change, always verify on vendor websites.

Deploy DeepSeek V4 on Your Infrastructure

Lushbinary handles GPU deployment, vLLM configuration, and production optimization for self-hosted LLMs.

Ready to Build Something Great?

Q: How much VRAM do I need to run DeepSeek V4-Flash?

V4-Flash in FP4+FP8 mixed precision is approximately 158GB of weights. Add roughly 10GB for the full 1M-token KV cache (V4 uses only 7% of V3.2's KV cache) plus a few GB of overhead, so about 170-175GB total. It fits on 2x H200 (282GB), 2x RTX Pro 6000 Blackwell (192GB), or 4x A100 80GB (320GB). vLLM recommends 4x A100 because it prefers power-of-two GPU counts, not because you actually need 320GB.

Q: Can I run DeepSeek V4-Pro on a single machine?

Only on a high-memory node. V4-Pro at ~862GB in FP4+FP8 does not fit on 8x H100 80GB (640GB total). The minimum single-node configuration is 8x H200 141GB (1,128GB total), which accommodates weights plus KV cache at long context. Alternatives are two p5.48xlarge nodes (16x H100 80GB, 1,280GB) or a DGX H200. 8x H100 80GB is only viable if you shard V4-Pro across multiple nodes.

Q: Is DeepSeek V4 really MIT licensed?

Yes. Both V4-Pro and V4-Flash are released under the MIT license on Hugging Face and ModelScope. You can use, modify, fine-tune, and commercially deploy the weights without restrictions. This is the most permissive license available for a frontier-class model.

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.

Let's Talk About Your Project

Prefer email? Reach us directly:

connect@lushbinary.com

Self-Hosting DeepSeek V4: vLLM Setup, Hardware Requirements & Deployment Guide

1Why Self-Host DeepSeek V4?

2Hardware Requirements: V4-Flash vs V4-Pro

3Downloading Weights from Hugging Face

4vLLM Deployment Setup

5Expert Parallelism & Tensor Parallelism

6Quantization Options: FP8, FP4, INT4

71M Context Window Configuration

8AWS Deployment: Instance Types & Costs

9Self-Host vs API: Break-Even Analysis

10Why Lushbinary for LLM Infrastructure

❓ Frequently Asked Questions

How much VRAM do I need to run DeepSeek V4-Flash?

Can I run DeepSeek V4-Pro on a single machine?

What inference framework should I use?

How much does self-hosting cost on AWS?

Is DeepSeek V4 really MIT licensed?

Sources

Deploy DeepSeek V4 on Your Infrastructure

Ready to Build Something Great?

Contact Us

Ship Better Engineering, Every Week

One Subscription. Every Flagship AI Model.

More from the Blog

Build a Food Delivery App Like DoorDash: 2026 MVP Guide

Build an Online Course Platform Like Teachable: MVP Guide

ContactUs

Our Address

Phone

Email