Open Source · v0.9.0 · Rust + Python

NVIDIA Dynamo

The operating system for distributed AI inference. A technical deep-dive with interactive simulations and real analogies so you actually get it.

The Problem

Why can't you just throw a model on one GPU?

Modern LLMs like DeepSeek-R1 (671B parameters) or Llama-3-70B don't fit on a single GPU. Even an 80GB H100 can only hold ~40B parameters in FP16. So you must split the model across multiple GPUs (tensor parallelism), sometimes across multiple machines (pipeline parallelism).

But splitting creates problems:

Imagine a restaurant kitchen that used to have one chef doing everything. Now you have 8 chefs, each handling a different part of the dish. Suddenly you need: a coordinator to route orders, a system to pass half-finished plates between chefs, a scheduler to make sure no one's idle while others are slammed, and a way to cache prep work so you don't re-chop the same onions for every order. That's Dynamo.

Without Dynamo

Every request recomputes from scratch. Prefill and decode compete for the same GPU. No cache reuse. Manual scaling. Wasted compute everywhere.

With Dynamo

Prefill and decode run on separate optimized GPUs. KV cache is reused across requests. Smart routing minimizes recomputation. Auto-scaling matches demand.

15x
MoE throughput boost
(GB200 + Dynamo vs Hopper)
3x
Faster time to first token
(KV-aware routing)
12x
TTFT improvement
(KV cache offloading)
2x
Throughput per GPU
(disaggregated serving)
Architecture

How a request flows through Dynamo

Every request goes through these stages. Click the simulation button below to watch a request travel through the pipeline.

Complete Request Lifecycle
UserHTTP client
FrontendOpenAI API
RouterKV-aware
Prefill GPUContext encoding
NIXLKV transfer
Decode GPUToken generation
ResponseStreamed tokens

1. Frontend — the Rust HTTP server

Receives your /v1/chat/completions request. Handles tokenization, applies chat templates, validates it, and forwards to the Router. It's OpenAI-compatible — any OpenAI SDK works.

Technical detail: Written entirely in Rust for zero-copy performance. Exposes /v1/chat/completions, /v1/embeddings, /v1/models, /metrics (Prometheus), and /openapi.json.

2. Smart Router — the brain

Decides which GPU worker handles your request. This is not random load balancing — it's intelligent.

Think of airport gate assignment. A dumb system assigns any gate. A smart system checks: "This plane came from Tokyo and left its luggage cart at Gate 12 — route it back to Gate 12 so we don't re-transport everything." The Router does this with KV cache. If GPU #3 has 80% of the tokens from a similar previous request cached, route there to skip recomputing them.

It maintains a global radix tree — a data structure tracking which token prefixes are cached on which GPUs across the whole cluster.

ModeHow it worksWhen to use
kvScores each worker by KV cache hit rate via radix tree, routes to highest overlapProduction. Always use this.
round-robinSequential rotationBenchmarking baseline
randomRandom selectionTesting

3. Prefill Worker — the heavy lifter

Processes your entire prompt in a single forward pass. Compute-intensive — matrix multiplications across every attention layer for every input token simultaneously. Output: the KV cache.

What's KV cache? During self-attention, each token produces a Key and Value vector at each layer. These are stored so decode only needs to compute the new token's attention against existing K/V pairs instead of reprocessing the full sequence. For a 70B model with 4096-token context, this can be several GBs.

4. NIXL — the data highway

Moves the KV cache from prefill GPU to decode GPU using the fastest available transport.

NIXL is like a logistics network that auto-picks the fastest shipping: NVLink for same-machine GPU-to-GPU (conveyor belt), InfiniBand for cross-node (bullet train), or GPUDirect Storage for disk (freight elevator that bypasses the lobby).

TransportUse caseBandwidth
NVLinkSame-node GPU ↔ GPU900 GB/s (NVL72)
InfiniBandCross-node GPU ↔ GPU400 Gb/s
UCX (RoCE/TCP)Flexible networkVariable
GPUDirect StorageGPU ↔ NVMe/SSD~12 GB/s per drive

5. Decode Worker — the token factory

Receives KV cache and generates tokens one at a time. Each new token's attention is computed against all previous KV pairs, then its new K/V are appended. Continues until EOS or max_tokens.

Why separate prefill and decode? Prefill is compute-bound (large batch, fully parallel). Decode is memory-bandwidth-bound (one token at a time, reading lots of KV cache). Mixing them on one GPU = neither runs optimally. Separation lets you tune GPU configs independently.

Interactive Simulation

Disaggregated Serving in Action

Watch how requests flow differently in aggregated vs disaggregated mode. Click to see 8 requests being processed.

Aggregated vs Disaggregated Serving

Aggregated (traditional): Each GPU does both prefill AND decode. They fight for resources.

GPU 0 (both)
Idle
GPU 1 (both)
Idle
GPU 2 (both)
Idle
GPU 3 (both)
Idle

Disaggregated (Dynamo): Prefill GPUs blast through prompts. Decode GPUs focus on generation. NIXL transfers KV cache.

Prefill GPU 0
Idle
Prefill GPU 1
Idle
Decode GPU 0
Idle
Decode GPU 1
Idle
Click to compare processing of 8 requests
Deep Dive

KV-Aware Routing: Why it's a game changer

In multi-turn conversations, each follow-up shares a prefix with the previous message. Without smart routing, a new request lands on a GPU with no cached context, forcing expensive recomputation.

Imagine calling customer support and getting transferred to a new agent every time. Each agent asks you to repeat your entire story. Now imagine a system that always routes you to the agent who already has your file open. That's KV-aware routing. Your "file" is the KV cache, already loaded on a specific GPU.

KV Cache Hit Rate Simulation

3 GPUs with cached prefixes. A new request arrives — the router scores each GPU and picks the best.

GPU 0
Cached: "You are a helpful AI assistant. Today we discuss..."
Overlap: --
GPU 1
Cached: "Write me a Python function that..."
Overlap: --
GPU 2
Cached: "You are a helpful AI assistant. Today we discuss quantum..."
Overlap: --
INCOMING REQUEST
"You are a helpful AI assistant. Today we discuss quantum computing basics."

The radix tree: Dynamo keeps a global prefix tree where each node is a token. When a worker processes tokens, it publishes its cached prefix to the tree (via NATS). The router traverses this tree for each incoming request to find the longest matching prefix per worker, giving each a "hit score". Highest scorer wins.

Memory Architecture

KVBM: 4-tier memory hierarchy for KV cache

GPU memory (HBM) is fast but tiny. For many concurrent users or long conversations, KV cache overflows. KVBM creates a 4-tier hierarchy, automatically spilling to cheaper storage.

Think of a computer's own memory hierarchy: L1 cache (tiny, blazing fast) → L2 → RAM → SSD. KVBM does the same for KV cache: GPU HBM (fastest) → CPU RAMLocal SSDRemote storage. Hot data stays in GPU memory; cold data moves down. When needed again, it's fetched back up via NIXL.

G1
GPU HBM (High Bandwidth Memory)
Active KV cache for in-flight requests. ~80GB per GPU. ~3.35 TB/s
G2
CPU Memory (DRAM)
Overflow cache, warm data. ~512GB-2TB per node. ~200 GB/s
G3
Local/Pooled SSDs (NVMe)
Cold cache, large capacity. Multi-TB. ~7-12 GB/s per drive
G4
Remote Storage (Network/Cloud)
Archive, persistent. Virtually unlimited. Variable (10-100 Gb/s)

3-layer architecture: (1) Runtime connectors hooking into vLLM/TRT-LLM's cache management, (2) Logic layer with block allocation, lifecycle, eviction (LRU), (3) NIXL layer for actual data movement between tiers.

Autoscaling

The Planner: SLA-driven GPU allocation

Demand fluctuates. At 2 AM you need 2 GPUs; at noon, 32. The Planner monitors metrics and dynamically adjusts how many GPUs run prefill vs decode.

It's a hospital shift manager. Quiet night: 2 surgeons, 4 nurses. ER fills up: reassign staff — more triage (prefill), more surgeons (decode), call backup (scale up). The Planner watches two vital signs: prefill queue depth (requests waiting?) and decode KV utilization (decode GPUs running out of memory?).

Planner Autoscaling Simulation

Watch the planner respond to a traffic spike by rebalancing workers.

Prefill Workers
P0
P1
Decode Workers
D0
D1
PREFILL QUEUE DEPTH
2
DECODE KV UTILIZATION
35%
TRAFFIC
Low
Backends

Three engines, one framework

Dynamo is engine-agnostic. Your choice depends on priority:

SGLang

Best for: High throughput
Strengths: RadixAttention, efficient memory, fast compile
Install:

pip install "ai-dynamo[sglang]"

TensorRT-LLM

Best for: Max perf, GB200
Strengths: WideEP, MTP, speculative decode, Blackwell optimized
Install:

pip install "ai-dynamo[trtllm]"

vLLM

Best for: Broadest features
Strengths: LoRA, prompt embeds, request migration, multimodal
Install:

pip install "ai-dynamo[vllm]"

Feature Matrix

FeaturevLLMTensorRT-LLMSGLang
Disaggregated ServingYesYesYes
KV-Aware RoutingYesYesYes
SLA-Based PlannerYesYesYes
KV Block ManagerYesYesPlanned
Multimodal (Image)YesYesYes
Multimodal (Video)Yes
Request MigrationYesLimitedYes
LoRAYes
Tool CallingYesYesYes
Speculative DecodingYesYesWIP
WideEP (MoE)Yes
GB200 SupportYes
Get Running

Quick Start: 0 to inference in 5 minutes

Fastest way — Docker containers with everything pre-installed:

bash — Pull a container
# Pick your backend
docker run --gpus all --network host --rm -it \
  nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1       # SGLang

docker run --gpus all --network host --rm -it \
  nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1   # TensorRT-LLM

docker run --gpus all --network host --rm -it \
  nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1          # vLLM

Start the frontend and a worker:

bash — Terminal 1: Frontend
python3 -m dynamo.frontend --http-port 8000 --store-kv file
bash — Terminal 2: Worker
# SGLang
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --store-kv file

# TensorRT-LLM
python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --store-kv file

# vLLM (note: --model, not --model-path)
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --store-kv file \
  --kv-events-config '{"enable_kv_cache_events": false}'
bash — Test it
curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true,
    "max_tokens": 200
  }'

PyPI Install (no Docker)

bash
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv venv && source venv/bin/activate && uv pip install pip

# Pick one:
uv pip install "ai-dynamo[sglang]"
uv pip install "ai-dynamo[vllm]"
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"

Build from Source

bash — Full build
# 1. System deps
sudo apt install -y build-essential libhwloc-dev libudev-dev \
  pkg-config libclang-dev protobuf-compiler python3-dev cmake

# 2. Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# 3. Python + build tools
uv venv dynamo && source dynamo/bin/activate
uv pip install pip maturin

# 4. Build Rust bindings
cd lib/bindings/python && maturin develop --uv

# 5. Install
cd $PROJECT_ROOT
uv pip install -e lib/gpu_memory_service
uv pip install -e .
Production

Kubernetes: single GPU to data center

Dynamo uses Grove — a topology-optimized Kubernetes operator that understands GPU relationships and manages the inference graph as one resource.

Normal K8s treats pods as independent. Like scheduling an orchestra by randomly placing musicians in different rooms. Grove understands the violin section must be near the conductor (prefill GPUs near decode GPUs on the same NVLink domain). It schedules with GPU topology awareness.

Kubernetes Architecture
Traffic
K8s Ingress
Frontend Pod
Router Pod
Worker Pods(GPU nodes)
bash — Install Platform
export NAMESPACE=dynamo-system VERSION=0.9.0

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${VERSION}.tgz
helm install dynamo-crds dynamo-crds-${VERSION}.tgz -n default

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${VERSION}.tgz
helm install dynamo-platform dynamo-platform-${VERSION}.tgz \
  -n ${NAMESPACE} --create-namespace
yaml — DynamoGraphDeployment
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: llama-70b-disagg
spec:
  services:
    Frontend:
      replicas: 2
    PrefillWorker:
      replicas: 4
      resources:
        limits:
          nvidia.com/gpu: "2"
    DecodeWorker:
      replicas: 4
      resources:
        limits:
          nvidia.com/gpu: "2"

Production Recipes

ModelBackendModeGPUs
Llama-3-70BvLLMAggregated4x H100
Llama-3-70BvLLMDisaggregated8-16x H100
DeepSeek-R1SGLangDisaggregated16-32x H200
DeepSeek-R1TRT-LLMDisagg + WideEP32+4 GB200
Qwen3-32BvLLMDisagg + KV Router16x H200
Qwen3-235B MoETRT-LLMAgg TP4xEP416x GPU

Service Discovery

DeploymentetcdNATSNotes
Local devNoNo--store-kv file
KubernetesNoNoK8s-native CRDs
SlurmYesYesdocker compose -f deploy/docker-compose.yml up -d
API

OpenAI-compatible API (drop-in replacement)

MethodEndpointDescription
POST/v1/chat/completionsChat completion (streaming + non-streaming)
POST/v1/embeddingsText embeddings
GET/v1/modelsList models
GET/openapi.jsonOpenAPI 3 spec
GET/metricsPrometheus metrics

Tool Calling

ParserModels
hermesQwen2.5, Hermes
mistralMistral function-calling
llama3_jsonLlama 3.1/3.2
deepseek_v3DeepSeek V3/R1
pythonicLlama 4
phi4Phi-4
Configuration

Every knob you can turn

Frontend

FlagEnv VarDefaultWhat it does
--http-port8000HTTP listen port
--router-modeDYN_ROUTER_MODEround-robinRouting strategy
--store-kvetcdKV store (file for local)
--kv-eventsDYN_KV_EVENTSfalseKV event publishing

Planner

FlagDefaultWhat it does
--adjustment-interval180sTime between scaling decisions
--ttft500msTarget Time To First Token
--itl50msTarget Inter-Token Latency
--max-gpu-budget8Maximum total GPUs

Logging

bash
export DYN_LOG=debug                                    # Everything
export DYN_LOG=dynamo_llm=debug,dynamo_runtime=info     # Per-crate
Support Matrix

Hardware and software compatibility

GPU Support

ArchitectureExamplesStatus
BlackwellB100, B200, GB200 NVL72Full support
HopperH100, H200Full support
Ada LovelaceL4, L40, L40SFull support
AmpereA100, A10GFull support

OS and CPU

OSx86_64ARM64
Ubuntu 24.04YesYes
Ubuntu 22.04Yes
CentOS Stream 9Experimental

Version Compatibility

DynamovLLMSGLangTensorRT-LLMNIXL
main0.15.10.5.81.3.0rc10.9.0
v0.8.10.12.00.5.61.2.0rc60.8.0

NVIDIA Dynamo — Apache 2.0 License — Rust + Python

GitHub · Docs · NVIDIA