The operating system for distributed AI inference. A technical deep-dive with interactive simulations and real analogies so you actually get it.
Modern LLMs like DeepSeek-R1 (671B parameters) or Llama-3-70B don't fit on a single GPU. Even an 80GB H100 can only hold ~40B parameters in FP16. So you must split the model across multiple GPUs (tensor parallelism), sometimes across multiple machines (pipeline parallelism).
But splitting creates problems:
Imagine a restaurant kitchen that used to have one chef doing everything. Now you have 8 chefs, each handling a different part of the dish. Suddenly you need: a coordinator to route orders, a system to pass half-finished plates between chefs, a scheduler to make sure no one's idle while others are slammed, and a way to cache prep work so you don't re-chop the same onions for every order. That's Dynamo.
Every request recomputes from scratch. Prefill and decode compete for the same GPU. No cache reuse. Manual scaling. Wasted compute everywhere.
Prefill and decode run on separate optimized GPUs. KV cache is reused across requests. Smart routing minimizes recomputation. Auto-scaling matches demand.
Every request goes through these stages. Click the simulation button below to watch a request travel through the pipeline.
Receives your /v1/chat/completions request. Handles tokenization, applies chat templates, validates it, and forwards to the Router. It's OpenAI-compatible — any OpenAI SDK works.
Technical detail: Written entirely in Rust for zero-copy performance. Exposes /v1/chat/completions, /v1/embeddings, /v1/models, /metrics (Prometheus), and /openapi.json.
Decides which GPU worker handles your request. This is not random load balancing — it's intelligent.
Think of airport gate assignment. A dumb system assigns any gate. A smart system checks: "This plane came from Tokyo and left its luggage cart at Gate 12 — route it back to Gate 12 so we don't re-transport everything." The Router does this with KV cache. If GPU #3 has 80% of the tokens from a similar previous request cached, route there to skip recomputing them.
It maintains a global radix tree — a data structure tracking which token prefixes are cached on which GPUs across the whole cluster.
| Mode | How it works | When to use |
|---|---|---|
| kv | Scores each worker by KV cache hit rate via radix tree, routes to highest overlap | Production. Always use this. |
| round-robin | Sequential rotation | Benchmarking baseline |
| random | Random selection | Testing |
Processes your entire prompt in a single forward pass. Compute-intensive — matrix multiplications across every attention layer for every input token simultaneously. Output: the KV cache.
What's KV cache? During self-attention, each token produces a Key and Value vector at each layer. These are stored so decode only needs to compute the new token's attention against existing K/V pairs instead of reprocessing the full sequence. For a 70B model with 4096-token context, this can be several GBs.
Moves the KV cache from prefill GPU to decode GPU using the fastest available transport.
NIXL is like a logistics network that auto-picks the fastest shipping: NVLink for same-machine GPU-to-GPU (conveyor belt), InfiniBand for cross-node (bullet train), or GPUDirect Storage for disk (freight elevator that bypasses the lobby).
| Transport | Use case | Bandwidth |
|---|---|---|
| NVLink | Same-node GPU ↔ GPU | 900 GB/s (NVL72) |
| InfiniBand | Cross-node GPU ↔ GPU | 400 Gb/s |
| UCX (RoCE/TCP) | Flexible network | Variable |
| GPUDirect Storage | GPU ↔ NVMe/SSD | ~12 GB/s per drive |
Receives KV cache and generates tokens one at a time. Each new token's attention is computed against all previous KV pairs, then its new K/V are appended. Continues until EOS or max_tokens.
Why separate prefill and decode? Prefill is compute-bound (large batch, fully parallel). Decode is memory-bandwidth-bound (one token at a time, reading lots of KV cache). Mixing them on one GPU = neither runs optimally. Separation lets you tune GPU configs independently.
Watch how requests flow differently in aggregated vs disaggregated mode. Click to see 8 requests being processed.
Aggregated (traditional): Each GPU does both prefill AND decode. They fight for resources.
Disaggregated (Dynamo): Prefill GPUs blast through prompts. Decode GPUs focus on generation. NIXL transfers KV cache.
In multi-turn conversations, each follow-up shares a prefix with the previous message. Without smart routing, a new request lands on a GPU with no cached context, forcing expensive recomputation.
Imagine calling customer support and getting transferred to a new agent every time. Each agent asks you to repeat your entire story. Now imagine a system that always routes you to the agent who already has your file open. That's KV-aware routing. Your "file" is the KV cache, already loaded on a specific GPU.
3 GPUs with cached prefixes. A new request arrives — the router scores each GPU and picks the best.
The radix tree: Dynamo keeps a global prefix tree where each node is a token. When a worker processes tokens, it publishes its cached prefix to the tree (via NATS). The router traverses this tree for each incoming request to find the longest matching prefix per worker, giving each a "hit score". Highest scorer wins.
GPU memory (HBM) is fast but tiny. For many concurrent users or long conversations, KV cache overflows. KVBM creates a 4-tier hierarchy, automatically spilling to cheaper storage.
Think of a computer's own memory hierarchy: L1 cache (tiny, blazing fast) → L2 → RAM → SSD. KVBM does the same for KV cache: GPU HBM (fastest) → CPU RAM → Local SSD → Remote storage. Hot data stays in GPU memory; cold data moves down. When needed again, it's fetched back up via NIXL.
3-layer architecture: (1) Runtime connectors hooking into vLLM/TRT-LLM's cache management, (2) Logic layer with block allocation, lifecycle, eviction (LRU), (3) NIXL layer for actual data movement between tiers.
Demand fluctuates. At 2 AM you need 2 GPUs; at noon, 32. The Planner monitors metrics and dynamically adjusts how many GPUs run prefill vs decode.
It's a hospital shift manager. Quiet night: 2 surgeons, 4 nurses. ER fills up: reassign staff — more triage (prefill), more surgeons (decode), call backup (scale up). The Planner watches two vital signs: prefill queue depth (requests waiting?) and decode KV utilization (decode GPUs running out of memory?).
Watch the planner respond to a traffic spike by rebalancing workers.
Dynamo is engine-agnostic. Your choice depends on priority:
Best for: High throughput
Strengths: RadixAttention, efficient memory, fast compile
Install:
Best for: Max perf, GB200
Strengths: WideEP, MTP, speculative decode, Blackwell optimized
Install:
Best for: Broadest features
Strengths: LoRA, prompt embeds, request migration, multimodal
Install:
| Feature | vLLM | TensorRT-LLM | SGLang |
|---|---|---|---|
| Disaggregated Serving | Yes | Yes | Yes |
| KV-Aware Routing | Yes | Yes | Yes |
| SLA-Based Planner | Yes | Yes | Yes |
| KV Block Manager | Yes | Yes | Planned |
| Multimodal (Image) | Yes | Yes | Yes |
| Multimodal (Video) | Yes | — | — |
| Request Migration | Yes | Limited | Yes |
| LoRA | Yes | — | — |
| Tool Calling | Yes | Yes | Yes |
| Speculative Decoding | Yes | Yes | WIP |
| WideEP (MoE) | — | Yes | — |
| GB200 Support | — | Yes | — |
Fastest way — Docker containers with everything pre-installed:
# Pick your backend
docker run --gpus all --network host --rm -it \
nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1 # SGLang
docker run --gpus all --network host --rm -it \
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1 # TensorRT-LLM
docker run --gpus all --network host --rm -it \
nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1 # vLLM
Start the frontend and a worker:
python3 -m dynamo.frontend --http-port 8000 --store-kv file
# SGLang
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --store-kv file
# TensorRT-LLM
python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --store-kv file
# vLLM (note: --model, not --model-path)
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --store-kv file \
--kv-events-config '{"enable_kv_cache_events": false}'
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true,
"max_tokens": 200
}'
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv venv && source venv/bin/activate && uv pip install pip
# Pick one:
uv pip install "ai-dynamo[sglang]"
uv pip install "ai-dynamo[vllm]"
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"
# 1. System deps
sudo apt install -y build-essential libhwloc-dev libudev-dev \
pkg-config libclang-dev protobuf-compiler python3-dev cmake
# 2. Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
# 3. Python + build tools
uv venv dynamo && source dynamo/bin/activate
uv pip install pip maturin
# 4. Build Rust bindings
cd lib/bindings/python && maturin develop --uv
# 5. Install
cd $PROJECT_ROOT
uv pip install -e lib/gpu_memory_service
uv pip install -e .
Dynamo uses Grove — a topology-optimized Kubernetes operator that understands GPU relationships and manages the inference graph as one resource.
Normal K8s treats pods as independent. Like scheduling an orchestra by randomly placing musicians in different rooms. Grove understands the violin section must be near the conductor (prefill GPUs near decode GPUs on the same NVLink domain). It schedules with GPU topology awareness.
export NAMESPACE=dynamo-system VERSION=0.9.0
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${VERSION}.tgz
helm install dynamo-crds dynamo-crds-${VERSION}.tgz -n default
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${VERSION}.tgz
helm install dynamo-platform dynamo-platform-${VERSION}.tgz \
-n ${NAMESPACE} --create-namespace
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: llama-70b-disagg
spec:
services:
Frontend:
replicas: 2
PrefillWorker:
replicas: 4
resources:
limits:
nvidia.com/gpu: "2"
DecodeWorker:
replicas: 4
resources:
limits:
nvidia.com/gpu: "2"
| Model | Backend | Mode | GPUs |
|---|---|---|---|
| Llama-3-70B | vLLM | Aggregated | 4x H100 |
| Llama-3-70B | vLLM | Disaggregated | 8-16x H100 |
| DeepSeek-R1 | SGLang | Disaggregated | 16-32x H200 |
| DeepSeek-R1 | TRT-LLM | Disagg + WideEP | 32+4 GB200 |
| Qwen3-32B | vLLM | Disagg + KV Router | 16x H200 |
| Qwen3-235B MoE | TRT-LLM | Agg TP4xEP4 | 16x GPU |
| Deployment | etcd | NATS | Notes |
|---|---|---|---|
| Local dev | No | No | --store-kv file |
| Kubernetes | No | No | K8s-native CRDs |
| Slurm | Yes | Yes | docker compose -f deploy/docker-compose.yml up -d |
| Method | Endpoint | Description |
|---|---|---|
POST | /v1/chat/completions | Chat completion (streaming + non-streaming) |
POST | /v1/embeddings | Text embeddings |
GET | /v1/models | List models |
GET | /openapi.json | OpenAPI 3 spec |
GET | /metrics | Prometheus metrics |
| Parser | Models |
|---|---|
hermes | Qwen2.5, Hermes |
mistral | Mistral function-calling |
llama3_json | Llama 3.1/3.2 |
deepseek_v3 | DeepSeek V3/R1 |
pythonic | Llama 4 |
phi4 | Phi-4 |
| Flag | Env Var | Default | What it does |
|---|---|---|---|
--http-port | — | 8000 | HTTP listen port |
--router-mode | DYN_ROUTER_MODE | round-robin | Routing strategy |
--store-kv | — | etcd | KV store (file for local) |
--kv-events | DYN_KV_EVENTS | false | KV event publishing |
| Flag | Default | What it does |
|---|---|---|
--adjustment-interval | 180s | Time between scaling decisions |
--ttft | 500ms | Target Time To First Token |
--itl | 50ms | Target Inter-Token Latency |
--max-gpu-budget | 8 | Maximum total GPUs |
export DYN_LOG=debug # Everything
export DYN_LOG=dynamo_llm=debug,dynamo_runtime=info # Per-crate
| Architecture | Examples | Status |
|---|---|---|
| Blackwell | B100, B200, GB200 NVL72 | Full support |
| Hopper | H100, H200 | Full support |
| Ada Lovelace | L4, L40, L40S | Full support |
| Ampere | A100, A10G | Full support |
| OS | x86_64 | ARM64 |
|---|---|---|
| Ubuntu 24.04 | Yes | Yes |
| Ubuntu 22.04 | Yes | — |
| CentOS Stream 9 | Experimental | — |
| Dynamo | vLLM | SGLang | TensorRT-LLM | NIXL |
|---|---|---|---|---|
| main | 0.15.1 | 0.5.8 | 1.3.0rc1 | 0.9.0 |
| v0.8.1 | 0.12.0 | 0.5.6 | 1.2.0rc6 | 0.8.0 |
NVIDIA Dynamo — Apache 2.0 License — Rust + Python