What inference framework should I use for DeepSeek V4?

vLLM is the recommended framework for DeepSeek V4 deployment. It supports MoE expert parallelism, tensor parallelism, and the hybrid CSA+HCA attention architecture. SGLang is an alternative with good MoE support.

How much does it cost to self-host DeepSeek V4 on AWS?

V4-Flash on a single p5.48xlarge (8x H100) costs approximately $98/hour on-demand or $60/hour with reserved instances. V4-Pro requires 2+ p5 instances at $196+/hour. At moderate utilization, V4-Flash self-hosting breaks even vs API at roughly 50M+ tokens/day.

DeepSeek V4 is the most capable model ever released under MIT license. V4-Pro at 1.6T parameters and V4-Flash at 284B both ship with open weights on Hugging Face, meaning you can run frontier-adjacent AI on your own infrastructure with zero vendor lock-in. The question isn't whether you can self-host it — it's whether you should, and what hardware you need.

V4-Flash at ~158GB in FP4+FP8 mixed precision fits on a single H200 node. V4-Pro at ~862GB needs a real GPU cluster. Both require careful deployment planning to get right. This guide covers hardware requirements, vLLM setup, quantization options, expert parallelism, cost analysis, and production deployment patterns.

If you're evaluating self-hosting vs API for DeepSeek V4, this is the guide that gives you the numbers to make that decision.

What This Guide Covers

Why Self-Host DeepSeek V4?
Hardware Requirements: V4-Flash vs V4-Pro
Downloading Weights from Hugging Face
vLLM Deployment Setup
Expert Parallelism & Tensor Parallelism
Quantization Options: FP8, FP4, INT4
1M Context Window Configuration
AWS Deployment: Instance Types & Costs
Self-Host vs API: Break-Even Analysis
Why Lushbinary for LLM Infrastructure

1Why Self-Host DeepSeek V4?

Three reasons make self-hosting V4 compelling:

Data sovereignty: DeepSeek's hosted API routes through Chinese infrastructure. For regulated industries, defense contractors, or teams with strict data residency requirements, self-hosting on your own cloud eliminates this concern entirely.
Cost at scale: At high token volumes (50M+ tokens/day), self-hosting V4-Flash can be cheaper than even DeepSeek's already-low API prices. You pay for GPU hours, not per-token.
Customization: MIT license means you can fine-tune V4 for your domain, modify the inference pipeline, and integrate it into custom toolchains without restrictions.

The trade-off: self-hosting requires GPU infrastructure expertise, ongoing maintenance, and upfront capital. For teams processing fewer than 10M tokens/day, the API is almost certainly more cost-effective.

2Hardware Requirements: V4-Flash vs V4-Pro

Spec	V4-Flash	V4-Pro
Weight Size (FP4+FP8)	~158GB	~862GB
Minimum GPUs	1x H200 or 2x A100 80GB	8x H100 80GB
Recommended GPUs	2x H200 or 4x A100 80GB	8x H200 or DGX H100
System RAM	256GB+	1TB+
Storage	500GB NVMe	2TB NVMe
Interconnect	NVLink (multi-GPU)	NVLink + InfiniBand

V4-Flash is the Self-Hosting Sweet Spot

At 158GB, V4-Flash fits on hardware that a well-funded startup can afford. It delivers 85–95% of V4-Pro's quality on most tasks. Unless you specifically need V4-Pro's superior agentic coding and knowledge capabilities, V4-Flash is the practical self-hosting choice.

3Downloading Weights from Hugging Face

Both models are available on Hugging Face under the deepseek-ai organization. The Instruct checkpoints (FP4+FP8 mixed precision) are what you want for production deployment:

# Install huggingface-cli if needed

pip install huggingface_hub

# Download V4-Flash (Instruct, ~158GB)

huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \

--local-dir ./deepseek-v4-flash

# Download V4-Pro (Instruct, ~862GB)

huggingface-cli download deepseek-ai/DeepSeek-V4-Pro \

--local-dir ./deepseek-v4-pro

4vLLM Deployment Setup

vLLM is the recommended inference framework for DeepSeek V4. It supports MoE expert parallelism, the hybrid CSA+HCA attention architecture, and efficient KV cache management for 1M-token contexts.

# Install vLLM with MoE support

pip install vllm>=0.8.0

# Serve V4-Flash on 2x A100 80GB

python -m vllm.entrypoints.openai.api_server \

--model ./deepseek-v4-flash \

--tensor-parallel-size 2 \

--max-model-len 131072 \

--trust-remote-code \

--port 8000

The vLLM server exposes an OpenAI-compatible API, so you can point any existing OpenAI SDK client at it by changing the base URL. This makes migration from the DeepSeek hosted API to self-hosted seamless.

5Expert Parallelism & Tensor Parallelism

MoE models like V4 benefit from two types of parallelism:

Tensor Parallelism (TP): Splits individual layers across GPUs. Use this when a single layer's weights don't fit in one GPU's VRAM. Set --tensor-parallel-size to the number of GPUs.
Expert Parallelism (EP): Distributes different expert sub-networks across GPUs. Since MoE only activates a subset of experts per token, EP allows efficient utilization of multi-GPU setups without the communication overhead of TP.

For V4-Flash on 2 GPUs, tensor parallelism is sufficient. For V4-Pro on 8+ GPUs, a combination of TP and EP gives the best throughput. vLLM handles this automatically when you set the appropriate parallelism flags.

6Quantization Options: FP8, FP4, INT4

DeepSeek ships V4 in two precision formats:

FP8 Mixed (Base checkpoints): Most parameters in FP8 precision. Higher quality, larger memory footprint.
FP4+FP8 Mixed (Instruct checkpoints): MoE expert parameters in FP4, other parameters in FP8. This is the recommended format — it balances quality and memory usage.

For further compression, community quantizations (GGUF, AWQ, GPTQ) will likely appear within days of launch. INT4 quantization can reduce V4-Flash to ~80GB, potentially fitting on 4x RTX 4090 (96GB total) — but expect measurable quality degradation, especially on reasoning-heavy tasks.

⚠️ Quantization Trade-offs

The official FP4+FP8 Instruct checkpoints are already aggressively quantized. Further quantization to INT4 will degrade quality, particularly on math, reasoning, and agentic tasks. For production workloads, stick with the official checkpoints unless VRAM constraints leave no alternative.

71M Context Window Configuration

V4's hybrid CSA+HCA attention reduces KV cache to 10% of V3.2's footprint at 1M context. This makes long-context inference practical, but you still need to allocate sufficient memory for the KV cache.

For V4-Flash on 2x A100 80GB, a practical maximum context length is 128K–256K tokens. To use the full 1M context, you need 4x A100 80GB or 2x H200 to accommodate the KV cache alongside the model weights.

Set --max-model-len in vLLM to your desired context length. Start with 131072 (128K) and increase based on your VRAM headroom. Monitor GPU memory usage during inference to find the sweet spot for your hardware.

8AWS Deployment: Instance Types & Costs

Instance	GPUs	VRAM	On-Demand $/hr	Best For
p5.48xlarge	8x H100 80GB	640GB	~$98	V4-Flash (comfortable)
p5e.48xlarge	8x H200 141GB	1128GB	~$120	V4-Pro (single node)
2x p5.48xlarge	16x H100 80GB	1280GB	~$196	V4-Pro (comfortable)

With 1-year reserved instances, costs drop roughly 40%. Spot instances can reduce costs further but aren't suitable for production inference due to interruption risk. For most teams, a single p5.48xlarge running V4-Flash is the cost-effective starting point.

9Self-Host vs API: Break-Even Analysis

The break-even point depends on your daily token volume. Here's the math for V4-Flash:

API cost: $0.14/M input + $0.28/M output (cache miss rates). At 50M tokens/day (mixed input/output), roughly $14–$21/day.
Self-host cost: p5.48xlarge at ~$98/hr on-demand = $2,352/day. With reserved instances: ~$1,400/day.
Break-even: Self-hosting becomes cheaper at roughly 200M+ tokens/day with reserved instances, or when data sovereignty requirements make the API a non-option regardless of cost.

For most startups and mid-size teams, the DeepSeek API is more cost-effective. Self-hosting makes sense for enterprises with high token volumes, strict data residency requirements, or the need to fine-tune the model for specialized domains.

10Why Lushbinary for LLM Infrastructure

Lushbinary deploys self-hosted LLMs on AWS for teams that need data sovereignty, custom fine-tuning, or cost optimization at scale. We handle the full infrastructure stack: GPU instance selection, vLLM configuration, auto-scaling, monitoring, and cost optimization.

🚀 Free Consultation

Want to self-host DeepSeek V4 on your own infrastructure? Lushbinary specializes in GPU cloud deployment and LLM inference optimization. We'll help you choose the right hardware, configure vLLM, and get to production — no obligation.

❓ Frequently Asked Questions

How much VRAM do I need to run DeepSeek V4-Flash?

V4-Flash in FP4+FP8 is approximately 158GB. It fits on a single H200 (141GB HBM3e) or 2x A100 80GB. With INT4 quantization, it can potentially fit on 4x RTX 4090 but with quality trade-offs.

Can I run DeepSeek V4-Pro on a single machine?

No. V4-Pro at 862GB requires minimum 8x H100 80GB with NVLink. A DGX H100 node or 8x H200 setup is recommended.

What inference framework should I use?

vLLM is recommended. It supports MoE expert parallelism, the hybrid CSA+HCA attention, and efficient KV cache management. SGLang is a solid alternative.

How much does self-hosting cost on AWS?

V4-Flash on p5.48xlarge costs ~$98/hour on-demand or ~$60/hour reserved. Break-even vs API is at roughly 200M+ tokens/day with reserved instances.

Is DeepSeek V4 really MIT licensed?

Yes. Both V4-Pro and V4-Flash are MIT licensed on Hugging Face. You can use, modify, fine-tune, and commercially deploy without restrictions.

Sources

Content was rephrased for compliance with licensing restrictions. Hardware specs and pricing sourced from official model cards and AWS pricing pages as of April 24, 2026. Pricing may change — always verify on vendor websites.

Deploy DeepSeek V4 on Your Infrastructure

Lushbinary handles GPU deployment, vLLM configuration, and production optimization for self-hosted LLMs.

Ready to Build Something Great?

Q: How much VRAM do I need to run DeepSeek V4-Flash?

V4-Flash in FP4+FP8 mixed precision is approximately 158GB. It fits on a single NVIDIA H200 (141GB HBM3e) or 2x A100 80GB GPUs. With INT4 quantization, it can potentially fit on 4x RTX 4090 (24GB each, 96GB total) but with quality trade-offs.

Q: Can I run DeepSeek V4-Pro on a single machine?

No. V4-Pro at 862GB in FP4+FP8 requires a minimum of 8x H100 80GB (640GB total VRAM) with NVLink interconnect. A more comfortable setup is 8x H200 or a DGX H100 node. This is enterprise-grade infrastructure.

Q: Is DeepSeek V4 really MIT licensed?

Yes. Both V4-Pro and V4-Flash are released under the MIT license on Hugging Face and ModelScope. You can use, modify, fine-tune, and commercially deploy the weights without restrictions. This is the most permissive license available for a frontier-class model.

Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack — no strings attached.

Let's Talk About Your Project

Self-Hosting DeepSeek V4: vLLM Setup, Hardware Requirements & Deployment Guide