NVIDIA Dynamo — Deep Dive

The Problem

Why can't you just throw a model on one GPU?

Modern LLMs like DeepSeek-R1 (671B parameters) or Llama-3-70B don't fit on a single GPU. Even an 80GB H100 can only hold ~40B parameters in FP16. So you must split the model across multiple GPUs (tensor parallelism), sometimes across multiple machines (pipeline parallelism).

But splitting creates problems:

Imagine a restaurant kitchen that used to have one chef doing everything. Now you have 8 chefs, each handling a different part of the dish. Suddenly you need: a coordinator to route orders, a system to pass half-finished plates between chefs, a scheduler to make sure no one's idle while others are slammed, and a way to cache prep work so you don't re-chop the same onions for every order. That's Dynamo.

Without Dynamo

Every request recomputes from scratch. Prefill and decode compete for the same GPU. No cache reuse. Manual scaling. Wasted compute everywhere.

With Dynamo

Prefill and decode run on separate optimized GPUs. KV cache is reused across requests. Smart routing minimizes recomputation. Auto-scaling matches demand.

15x

MoE throughput boost
(GB200 + Dynamo vs Hopper)

Faster time to first token
(KV-aware routing)

12x

TTFT improvement
(KV cache offloading)

Throughput per GPU
(disaggregated serving)

Architecture

How a request flows through Dynamo

Every request goes through these stages. Click the simulation button below to watch a request travel through the pipeline.

Complete Request Lifecycle

UserHTTP client

→

FrontendOpenAI API

→

RouterKV-aware

→

Prefill GPUContext encoding

→

NIXLKV transfer

→

Decode GPUToken generation

→

ResponseStreamed tokens

1. Frontend — the Rust HTTP server

Receives your /v1/chat/completions request. Handles tokenization, applies chat templates, validates it, and forwards to the Router. It's OpenAI-compatible — any OpenAI SDK works.

Technical detail: Written entirely in Rust for zero-copy performance. Exposes /v1/chat/completions, /v1/embeddings, /v1/models, /metrics (Prometheus), and /openapi.json.

2. Smart Router — the brain

Decides which GPU worker handles your request. This is not random load balancing — it's intelligent.

Think of airport gate assignment. A dumb system assigns any gate. A smart system checks: "This plane came from Tokyo and left its luggage cart at Gate 12 — route it back to Gate 12 so we don't re-transport everything." The Router does this with KV cache. If GPU #3 has 80% of the tokens from a similar previous request cached, route there to skip recomputing them.

It maintains a global radix tree — a data structure tracking which token prefixes are cached on which GPUs across the whole cluster.

Mode	How it works	When to use
kv	Scores each worker by KV cache hit rate via radix tree, routes to highest overlap	Production. Always use this.
round-robin	Sequential rotation	Benchmarking baseline
random	Random selection	Testing

3. Prefill Worker — the heavy lifter

Processes your entire prompt in a single forward pass. Compute-intensive — matrix multiplications across every attention layer for every input token simultaneously. Output: the KV cache.

What's KV cache? During self-attention, each token produces a Key and Value vector at each layer. These are stored so decode only needs to compute the new token's attention against existing K/V pairs instead of reprocessing the full sequence. For a 70B model with 4096-token context, this can be several GBs.

4. NIXL — the data highway

Moves the KV cache from prefill GPU to decode GPU using the fastest available transport.

NIXL is like a logistics network that auto-picks the fastest shipping: NVLink for same-machine GPU-to-GPU (conveyor belt), InfiniBand for cross-node (bullet train), or GPUDirect Storage for disk (freight elevator that bypasses the lobby).

Transport	Use case	Bandwidth
NVLink	Same-node GPU ↔ GPU	900 GB/s (NVL72)
InfiniBand	Cross-node GPU ↔ GPU	400 Gb/s
UCX (RoCE/TCP)	Flexible network	Variable
GPUDirect Storage	GPU ↔ NVMe/SSD	~12 GB/s per drive

5. Decode Worker — the token factory

Receives KV cache and generates tokens one at a time. Each new token's attention is computed against all previous KV pairs, then its new K/V are appended. Continues until EOS or max_tokens.

Why separate prefill and decode? Prefill is compute-bound (large batch, fully parallel). Decode is memory-bandwidth-bound (one token at a time, reading lots of KV cache). Mixing them on one GPU = neither runs optimally. Separation lets you tune GPU configs independently.

Interactive Simulation

Disaggregated Serving in Action

Watch how requests flow differently in aggregated vs disaggregated mode. Click to see 8 requests being processed.

Aggregated vs Disaggregated Serving

Aggregated (traditional): Each GPU does both prefill AND decode. They fight for resources.

GPU 0 (both)

Idle

GPU 1 (both)

Idle

GPU 2 (both)

Idle

GPU 3 (both)

Idle

Disaggregated (Dynamo): Prefill GPUs blast through prompts. Decode GPUs focus on generation. NIXL transfers KV cache.

Prefill GPU 0

Idle

Prefill GPU 1

Idle

Decode GPU 0

Idle

Decode GPU 1

Idle

Click to compare processing of 8 requests

Deep Dive

KV-Aware Routing: Why it's a game changer

In multi-turn conversations, each follow-up shares a prefix with the previous message. Without smart routing, a new request lands on a GPU with no cached context, forcing expensive recomputation.

Imagine calling customer support and getting transferred to a new agent every time. Each agent asks you to repeat your entire story. Now imagine a system that always routes you to the agent who already has your file open. That's KV-aware routing. Your "file" is the KV cache, already loaded on a specific GPU.

KV Cache Hit Rate Simulation

3 GPUs with cached prefixes. A new request arrives — the router scores each GPU and picks the best.

GPU 0

Cached: "You are a helpful AI assistant. Today we discuss..."

Overlap: --

GPU 1

Cached: "Write me a Python function that..."

Overlap: --

GPU 2

Cached: "You are a helpful AI assistant. Today we discuss quantum..."

Overlap: --

INCOMING REQUEST

"You are a helpful AI assistant. Today we discuss quantum computing basics."

The radix tree: Dynamo keeps a global prefix tree where each node is a token. When a worker processes tokens, it publishes its cached prefix to the tree (via NATS). The router traverses this tree for each incoming request to find the longest matching prefix per worker, giving each a "hit score". Highest scorer wins.

Memory Architecture

KVBM: 4-tier memory hierarchy for KV cache

GPU memory (HBM) is fast but tiny. For many concurrent users or long conversations, KV cache overflows. KVBM creates a 4-tier hierarchy, automatically spilling to cheaper storage.

Think of a computer's own memory hierarchy: L1 cache (tiny, blazing fast) → L2 → RAM → SSD. KVBM does the same for KV cache: GPU HBM (fastest) → CPU RAM → Local SSD → Remote storage. Hot data stays in GPU memory; cold data moves down. When needed again, it's fetched back up via NIXL.

GPU HBM (High Bandwidth Memory)

Active KV cache for in-flight requests. ~80GB per GPU. ~3.35 TB/s

CPU Memory (DRAM)

Overflow cache, warm data. ~512GB-2TB per node. ~200 GB/s

Local/Pooled SSDs (NVMe)

Cold cache, large capacity. Multi-TB. ~7-12 GB/s per drive

Remote Storage (Network/Cloud)

Archive, persistent. Virtually unlimited. Variable (10-100 Gb/s)

3-layer architecture: (1) Runtime connectors hooking into vLLM/TRT-LLM's cache management, (2) Logic layer with block allocation, lifecycle, eviction (LRU), (3) NIXL layer for actual data movement between tiers.

Autoscaling

The Planner: SLA-driven GPU allocation

Demand fluctuates. At 2 AM you need 2 GPUs; at noon, 32. The Planner monitors metrics and dynamically adjusts how many GPUs run prefill vs decode.

It's a hospital shift manager. Quiet night: 2 surgeons, 4 nurses. ER fills up: reassign staff — more triage (prefill), more surgeons (decode), call backup (scale up). The Planner watches two vital signs: prefill queue depth (requests waiting?) and decode KV utilization (decode GPUs running out of memory?).

Planner Autoscaling Simulation

Watch the planner respond to a traffic spike by rebalancing workers.

Prefill Workers

Decode Workers

PREFILL QUEUE DEPTH

DECODE KV UTILIZATION

35%

TRAFFIC

Low

Backends

Three engines, one framework

Dynamo is engine-agnostic. Your choice depends on priority:

SGLang

Best for: High throughput
Strengths: RadixAttention, efficient memory, fast compile
Install:

pip install "ai-dynamo[sglang]"

TensorRT-LLM

Best for: Max perf, GB200
Strengths: WideEP, MTP, speculative decode, Blackwell optimized
Install:

pip install "ai-dynamo[trtllm]"

vLLM

Best for: Broadest features
Strengths: LoRA, prompt embeds, request migration, multimodal
Install:

pip install "ai-dynamo[vllm]"

Feature Matrix

Feature	vLLM	TensorRT-LLM	SGLang
Disaggregated Serving	Yes	Yes	Yes
KV-Aware Routing	Yes	Yes	Yes
SLA-Based Planner	Yes	Yes	Yes
KV Block Manager	Yes	Yes	Planned
Multimodal (Image)	Yes	Yes	Yes
Multimodal (Video)	Yes	—	—
Request Migration	Yes	Limited	Yes
LoRA	Yes	—	—
Tool Calling	Yes	Yes	Yes
Speculative Decoding	Yes	Yes	WIP
WideEP (MoE)	—	Yes	—
GB200 Support	—	Yes	—

Get Running

Quick Start: 0 to inference in 5 minutes

Fastest way — Docker containers with everything pre-installed:

bash — Pull a container

# Pick your backend
docker run --gpus all --network host --rm -it \
  nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1       # SGLang

docker run --gpus all --network host --rm -it \
  nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1   # TensorRT-LLM

docker run --gpus all --network host --rm -it \
  nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1          # vLLM

Start the frontend and a worker:

bash — Terminal 1: Frontend

python3 -m dynamo.frontend --http-port 8000 --store-kv file

bash — Terminal 2: Worker

# SGLang
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --store-kv file

# TensorRT-LLM
python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --store-kv file

# vLLM (note: --model, not --model-path)
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --store-kv file \
  --kv-events-config '{"enable_kv_cache_events": false}'

bash — Test it

curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true,
    "max_tokens": 200
  }'

PyPI Install (no Docker)

bash

curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv venv && source venv/bin/activate && uv pip install pip

# Pick one:
uv pip install "ai-dynamo[sglang]"
uv pip install "ai-dynamo[vllm]"
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"

Build from Source

bash — Full build

# 1. System deps
sudo apt install -y build-essential libhwloc-dev libudev-dev \
  pkg-config libclang-dev protobuf-compiler python3-dev cmake

# 2. Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# 3. Python + build tools
uv venv dynamo && source dynamo/bin/activate
uv pip install pip maturin

# 4. Build Rust bindings
cd lib/bindings/python && maturin develop --uv

# 5. Install
cd $PROJECT_ROOT
uv pip install -e lib/gpu_memory_service
uv pip install -e .

Production

Kubernetes: single GPU to data center

Dynamo uses Grove — a topology-optimized Kubernetes operator that understands GPU relationships and manages the inference graph as one resource.

Normal K8s treats pods as independent. Like scheduling an orchestra by randomly placing musicians in different rooms. Grove understands the violin section must be near the conductor (prefill GPUs near decode GPUs on the same NVLink domain). It schedules with GPU topology awareness.

Kubernetes Architecture

Traffic

→

K8s Ingress

→

Frontend Pod

→

Router Pod

→

Worker Pods(GPU nodes)

bash — Install Platform

export NAMESPACE=dynamo-system VERSION=0.9.0

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${VERSION}.tgz
helm install dynamo-crds dynamo-crds-${VERSION}.tgz -n default

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${VERSION}.tgz
helm install dynamo-platform dynamo-platform-${VERSION}.tgz \
  -n ${NAMESPACE} --create-namespace

yaml — DynamoGraphDeployment

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: llama-70b-disagg
spec:
  services:
    Frontend:
      replicas: 2
    PrefillWorker:
      replicas: 4
      resources:
        limits:
          nvidia.com/gpu: "2"
    DecodeWorker:
      replicas: 4
      resources:
        limits:
          nvidia.com/gpu: "2"

Production Recipes

Model	Backend	Mode	GPUs
Llama-3-70B	vLLM	Aggregated	4x H100
Llama-3-70B	vLLM	Disaggregated	8-16x H100
DeepSeek-R1	SGLang	Disaggregated	16-32x H200
DeepSeek-R1	TRT-LLM	Disagg + WideEP	32+4 GB200
Qwen3-32B	vLLM	Disagg + KV Router	16x H200
Qwen3-235B MoE	TRT-LLM	Agg TP4xEP4	16x GPU

Service Discovery

Deployment	etcd	NATS	Notes
Local dev	No	No	`--store-kv file`
Kubernetes	No	No	K8s-native CRDs
Slurm	Yes	Yes	`docker compose -f deploy/docker-compose.yml up -d`

API

OpenAI-compatible API (drop-in replacement)

Method	Endpoint	Description
`POST`	`/v1/chat/completions`	Chat completion (streaming + non-streaming)
`POST`	`/v1/embeddings`	Text embeddings
`GET`	`/v1/models`	List models
`GET`	`/openapi.json`	OpenAPI 3 spec
`GET`	`/metrics`	Prometheus metrics

Tool Calling

Parser	Models
`hermes`	Qwen2.5, Hermes
`mistral`	Mistral function-calling
`llama3_json`	Llama 3.1/3.2
`deepseek_v3`	DeepSeek V3/R1
`pythonic`	Llama 4
`phi4`	Phi-4

Configuration

Every knob you can turn

Frontend

Flag	Env Var	Default	What it does
`--http-port`	—	8000	HTTP listen port
`--router-mode`	`DYN_ROUTER_MODE`	round-robin	Routing strategy
`--store-kv`	—	etcd	KV store (file for local)
`--kv-events`	`DYN_KV_EVENTS`	false	KV event publishing

Planner

Flag	Default	What it does
`--adjustment-interval`	180s	Time between scaling decisions
`--ttft`	500ms	Target Time To First Token
`--itl`	50ms	Target Inter-Token Latency
`--max-gpu-budget`	8	Maximum total GPUs

Logging

bash

export DYN_LOG=debug                                    # Everything
export DYN_LOG=dynamo_llm=debug,dynamo_runtime=info     # Per-crate

Support Matrix

Hardware and software compatibility

GPU Support

Architecture	Examples	Status
Blackwell	B100, B200, GB200 NVL72	Full support
Hopper	H100, H200	Full support
Ada Lovelace	L4, L40, L40S	Full support
Ampere	A100, A10G	Full support

OS and CPU

OS	x86_64	ARM64
Ubuntu 24.04	Yes	Yes
Ubuntu 22.04	Yes	—
CentOS Stream 9	Experimental	—

Version Compatibility

Dynamo	vLLM	SGLang	TensorRT-LLM	NIXL
main	0.15.1	0.5.8	1.3.0rc1	0.9.0
v0.8.1	0.12.0	0.5.6	1.2.0rc6	0.8.0

NVIDIA Dynamo — Apache 2.0 License — Rust + Python

GitHub · Docs · NVIDIA