Large language models are expensive and slow. A naive deployment can easily cost 10x more than necessary while delivering poor latency. This guide shows how to deploy vLLM for production inference, covering everything from basic server setup to Kubernetes autoscaling and observability.
vLLM has become the de facto standard for self-hosted LLM inference. Companies like Meta, Mistral AI, IBM, and Stripe use it in production. Stripe reported a 73% cost reduction after switching to vLLM. The key innovations that make this possible:
- PagedAttention: Non-contiguous KV cache memory management, like virtual memory paging for attention. This eliminates 60-80% of memory waste from fragmentation.
- Continuous Batching: Instead of waiting for the slowest sequence in a batch, vLLM schedules at the iteration level, dynamically replacing completed sequences. This achieves 2-24x throughput improvement.
- OpenAI-Compatible API: Drop-in replacement for OpenAI’s API, making migration trivial.
We’ll cover:
- Local setup with Docker
- Client patterns (sync, async, streaming)
- Parallel processing for throughput
- Production architecture with load balancing
- Kubernetes deployment with autoscaling
- Cloud deployment patterns (AWS, GCP)
- Performance optimization
- Error handling and resilience
- Monitoring and alerting
Quick Start: Running vLLM Locally
The fastest way to get started is with Docker. Here’s a docker-compose.yml that runs Mistral-7B-Instruct:
# docker-compose.yml for local development
docker_compose = """
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm-server
runtime: nvidia # Requires NVIDIA Container Toolkit
ports:
- "8000:8000"
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--tensor-parallel-size 1
--gpu-memory-utilization 0.90
--max-model-len 8192
--enable-prefix-caching
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s
"""
print(docker_compose)Key configuration options:
| Flag | Description | Recommended Value |
|---|---|---|
--gpu-memory-utilization | Fraction of GPU memory to use for KV cache | 0.85-0.95 |
--tensor-parallel-size | Number of GPUs for tensor parallelism | 1 for 7B models, 2-4 for 70B+ |
--max-model-len | Maximum context length | Based on use case |
--enable-prefix-caching | Cache common prefixes across requests | Always enable |
--quantization | Model quantization (awq, gptq, fp8) | Use if memory constrained |
Data Models
Before diving into client patterns, let’s define our data models. We use Pydantic for validation and dataclasses for configuration.
from dataclasses import dataclass
from pydantic import BaseModel, Field
from typing import Literal
import uuid
@dataclass
class VLLMConfig:
"""Configuration for vLLM client connections."""
base_url: str = "http://localhost:8000/v1"
model: str = "mistralai/Mistral-7B-Instruct-v0.3"
timeout: float = 120.0
max_retries: int = 3
# Generation defaults
max_tokens: int = 1024
temperature: float = 0.7
top_p: float = 0.95
def get_headers(self, api_key: str | None = None) -> dict[str, str]:
"""Get request headers."""
headers = {"Content-Type": "application/json"}
if api_key:
headers["Authorization"] = f"Bearer {api_key}"
return headers
class ChatMessage(BaseModel):
"""A single message in a conversation."""
role: Literal["system", "user", "assistant"]
content: str
class CompletionResponse(BaseModel):
"""Response from completion API."""
request_id: str
content: str
model: str
prompt_tokens: int
completion_tokens: int
total_tokens: int
latency_ms: float
finish_reason: strClient Pattern 1: Synchronous HTTP
The simplest approach uses synchronous HTTP requests. This is appropriate for scripts, notebooks, and applications where you’re making occasional requests.
import httpx
import time
def create_completion_sync(
config: VLLMConfig,
messages: list[ChatMessage],
max_tokens: int | None = None,
temperature: float | None = None,
) -> CompletionResponse:
"""
Synchronous completion request to vLLM.
Use for simple scripts or when async isn't needed.
Connection is created and closed per request.
"""
start_time = time.perf_counter()
payload = {
"model": config.model,
"messages": [m.model_dump() for m in messages],
"max_tokens": max_tokens or config.max_tokens,
"temperature": temperature or config.temperature,
}
with httpx.Client(timeout=config.timeout) as client:
response = client.post(
f"{config.base_url}/chat/completions",
json=payload,
headers=config.get_headers(),
)
response.raise_for_status()
data = response.json()
latency_ms = (time.perf_counter() - start_time) * 1000
return CompletionResponse(
request_id=data["id"],
content=data["choices"][0]["message"]["content"],
model=data["model"],
prompt_tokens=data["usage"]["prompt_tokens"],
completion_tokens=data["usage"]["completion_tokens"],
total_tokens=data["usage"]["total_tokens"],
latency_ms=latency_ms,
finish_reason=data["choices"][0]["finish_reason"],
)
# Example usage
# config = VLLMConfig()
# messages = [ChatMessage(role="user", content="Explain PagedAttention in one sentence.")]
# response = create_completion_sync(config, messages)
# print(f"Response: {response.content}")
# print(f"Latency: {response.latency_ms:.0f}ms")Client Pattern 2: Async HTTP
For high-throughput applications, async clients are essential. They allow concurrent requests without blocking, maximizing GPU utilization on the server.
import asyncio
async def create_completion_async(
config: VLLMConfig,
messages: list[ChatMessage],
client: httpx.AsyncClient | None = None,
max_tokens: int | None = None,
) -> CompletionResponse:
"""
Async completion request - preferred for high throughput.
Pass a shared client for connection pooling across requests.
This significantly reduces connection overhead.
"""
start_time = time.perf_counter()
payload = {
"model": config.model,
"messages": [m.model_dump() for m in messages],
"max_tokens": max_tokens or config.max_tokens,
"temperature": config.temperature,
}
should_close = client is None
client = client or httpx.AsyncClient(timeout=config.timeout)
try:
response = await client.post(
f"{config.base_url}/chat/completions",
json=payload,
headers=config.get_headers(),
)
response.raise_for_status()
data = response.json()
finally:
if should_close:
await client.aclose()
latency_ms = (time.perf_counter() - start_time) * 1000
return CompletionResponse(
request_id=data["id"],
content=data["choices"][0]["message"]["content"],
model=data["model"],
prompt_tokens=data["usage"]["prompt_tokens"],
completion_tokens=data["usage"]["completion_tokens"],
total_tokens=data["usage"]["total_tokens"],
latency_ms=latency_ms,
finish_reason=data["choices"][0]["finish_reason"],
)Client Pattern 3: OpenAI SDK
vLLM’s OpenAI-compatible API means you can use the official OpenAI Python SDK. This is the easiest migration path if you’re already using OpenAI’s API.
from openai import OpenAI, AsyncOpenAI
def create_completion_openai_sdk(
config: VLLMConfig,
messages: list[ChatMessage],
) -> CompletionResponse:
"""
Use OpenAI SDK with vLLM - simplest integration path.
vLLM's OpenAI-compatible API means existing OpenAI code
works with minimal changes (just update base_url).
"""
start_time = time.perf_counter()
client = OpenAI(
base_url=config.base_url,
api_key="not-needed", # vLLM doesn't require API key by default
)
response = client.chat.completions.create(
model=config.model,
messages=[{"role": m.role, "content": m.content} for m in messages],
max_tokens=config.max_tokens,
temperature=config.temperature,
)
latency_ms = (time.perf_counter() - start_time) * 1000
return CompletionResponse(
request_id=response.id,
content=response.choices[0].message.content or "",
model=response.model,
prompt_tokens=response.usage.prompt_tokens if response.usage else 0,
completion_tokens=response.usage.completion_tokens if response.usage else 0,
total_tokens=response.usage.total_tokens if response.usage else 0,
latency_ms=latency_ms,
finish_reason=response.choices[0].finish_reason or "unknown",
)
async def create_completion_openai_sdk_async(
config: VLLMConfig,
messages: list[ChatMessage],
) -> CompletionResponse:
"""Async version using OpenAI's AsyncOpenAI client."""
start_time = time.perf_counter()
client = AsyncOpenAI(
base_url=config.base_url,
api_key="not-needed",
)
response = await client.chat.completions.create(
model=config.model,
messages=[{"role": m.role, "content": m.content} for m in messages],
max_tokens=config.max_tokens,
temperature=config.temperature,
)
latency_ms = (time.perf_counter() - start_time) * 1000
return CompletionResponse(
request_id=response.id,
content=response.choices[0].message.content or "",
model=response.model,
prompt_tokens=response.usage.prompt_tokens if response.usage else 0,
completion_tokens=response.usage.completion_tokens if response.usage else 0,
total_tokens=response.usage.total_tokens if response.usage else 0,
latency_ms=latency_ms,
finish_reason=response.choices[0].finish_reason or "unknown",
)Client Pattern 4: Streaming Responses
For interactive applications, streaming is essential. Users see the response building in real-time, which dramatically improves perceived latency. vLLM supports Server-Sent Events (SSE) for streaming.
from typing import AsyncIterator, Iterator
def stream_completion(
config: VLLMConfig,
messages: list[ChatMessage],
) -> Iterator[str]:
"""
Stream tokens as they're generated.
Essential for interactive applications - users see
the response building in real-time.
"""
client = OpenAI(base_url=config.base_url, api_key="not-needed")
stream = client.chat.completions.create(
model=config.model,
messages=[{"role": m.role, "content": m.content} for m in messages],
max_tokens=config.max_tokens,
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
async def stream_completion_async(
config: VLLMConfig,
messages: list[ChatMessage],
) -> AsyncIterator[str]:
"""Async streaming for concurrent applications."""
client = AsyncOpenAI(base_url=config.base_url, api_key="not-needed")
stream = await client.chat.completions.create(
model=config.model,
messages=[{"role": m.role, "content": m.content} for m in messages],
max_tokens=config.max_tokens,
stream=True,
)
async for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
# Example: Collecting streamed response
# async def get_streamed_response(config: VLLMConfig, messages: list[ChatMessage]) -> str:
# tokens = []
# async for token in stream_completion_async(config, messages):
# tokens.append(token)
# print(token, end="", flush=True)
# return "".join(tokens)Parallel Processing: Batched Concurrent Requests
To maximize throughput, you want to keep multiple requests in flight simultaneously. vLLM’s continuous batching handles the server-side batching automatically, but you need to send concurrent requests to feed it.
async def process_batch_concurrent(
config: VLLMConfig,
requests: list[list[ChatMessage]],
max_concurrency: int = 10,
) -> list[CompletionResponse | Exception]:
"""
Process multiple requests concurrently with controlled parallelism.
This maximizes throughput while respecting server capacity.
The semaphore limits concurrent requests to avoid overwhelming
the server or running out of memory.
Args:
config: vLLM client configuration
requests: List of message lists (one per request)
max_concurrency: Maximum parallel requests
Returns:
List of responses (or exceptions for failed requests)
"""
semaphore = asyncio.Semaphore(max_concurrency)
async def limited_request(
messages: list[ChatMessage],
client: httpx.AsyncClient,
) -> CompletionResponse:
async with semaphore:
return await create_completion_async(config, messages, client)
# Use a single client for connection pooling
async with httpx.AsyncClient(timeout=config.timeout) as client:
tasks = [limited_request(msgs, client) for msgs in requests]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
def count_successes_and_failures(
results: list[CompletionResponse | Exception],
) -> tuple[int, int]:
"""Count successful and failed requests from batch results."""
successes = sum(1 for r in results if isinstance(r, CompletionResponse))
failures = len(results) - successes
return successes, failuresParallel Processing: ThreadPoolExecutor for Sync Code
If you’re integrating with a sync codebase that can’t easily adopt async/await, ThreadPoolExecutor provides parallelism.
from concurrent.futures import ThreadPoolExecutor, as_completed
def process_batch_threaded(
config: VLLMConfig,
requests: list[list[ChatMessage]],
max_workers: int = 10,
) -> list[tuple[int, CompletionResponse | Exception]]:
"""
Thread pool executor for parallel processing in sync code.
Use when integrating with sync codebases that can't
easily adopt async/await. Note that threads have
higher overhead than async coroutines.
Returns:
List of (index, response) tuples, sorted by original order
"""
results: list[tuple[int, CompletionResponse | Exception]] = []
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_idx = {
executor.submit(create_completion_sync, config, msgs): idx
for idx, msgs in enumerate(requests)
}
for future in as_completed(future_to_idx):
idx = future_to_idx[future]
try:
response = future.result()
results.append((idx, response))
except Exception as e:
results.append((idx, e))
# Sort by original order
results.sort(key=lambda x: x[0])
return resultsVisualizing Client Performance
Let’s visualize the latency characteristics of different client patterns. This simulated data reflects typical production behavior.
import altair as alt
import polars as pl
# Simulated latency data for different client patterns
# Based on typical production measurements with Mistral-7B
latency_data = pl.DataFrame({
"pattern": ["Sync"] * 100 + ["Async"] * 100 + ["Async Batched"] * 100,
"latency_ms": (
# Sync: higher latency, more variance due to connection overhead
[150 + i * 2 + (i % 10) * 5 for i in range(100)] +
# Async: lower latency with connection pooling
[120 + i * 1.5 + (i % 10) * 3 for i in range(100)] +
# Async Batched: lowest per-request latency due to server batching
[80 + i * 1 + (i % 10) * 2 for i in range(100)]
),
"request_num": list(range(100)) * 3,
})
chart = alt.Chart(latency_data).mark_line().encode(
x=alt.X("request_num:Q", title="Request Number"),
y=alt.Y("latency_ms:Q", title="Latency (ms)"),
color=alt.Color("pattern:N", title="Client Pattern"),
strokeDash=alt.StrokeDash("pattern:N"),
).properties(
title="Latency by Client Pattern",
width=600,
height=300,
)
chartThroughput Scaling with Concurrency
Higher concurrency generally means higher throughput, up to the point where the GPU becomes the bottleneck.
# Simulated throughput data showing scaling with concurrency
throughput_data = pl.DataFrame({
"concurrency": [1, 2, 4, 8, 16, 32, 64, 128],
"throughput_rps": [5, 9.5, 18, 32, 55, 85, 110, 115], # Requests per second
"tokens_per_second": [500, 950, 1800, 3200, 5500, 8500, 11000, 11500],
})
base = alt.Chart(throughput_data).encode(
x=alt.X("concurrency:Q", title="Concurrent Requests", scale=alt.Scale(type="log")),
)
line = base.mark_line(color="steelblue", strokeWidth=2).encode(
y=alt.Y("throughput_rps:Q", title="Throughput (requests/sec)"),
)
points = base.mark_point(color="steelblue", size=60).encode(
y=alt.Y("throughput_rps:Q"),
)
# Add annotation for saturation point
annotation = alt.Chart(pl.DataFrame({
"x": [64],
"y": [110],
"text": ["GPU Saturation →"],
})).mark_text(align="right", dx=-10, fontSize=12).encode(
x="x:Q",
y="y:Q",
text="text:N",
)
chart = (line + points + annotation).properties(
title="Throughput Scaling with Concurrency (Mistral-7B on A100)",
width=500,
height=300,
)
chartProduction Architecture
For high availability, deploy multiple vLLM instances behind a load balancer. Here’s the architecture:
┌─────────────────────────────────────────────┐
│ Load Balancer │
│ (nginx/HAProxy/AWS ALB/GCP GLB) │
└─────────────────┬───────────────────────────┘
│
┌───────────────────────────┼───────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ vLLM Pod 1 │ │ vLLM Pod 2 │ │ vLLM Pod 3 │
│ (GPU: A100) │ │ (GPU: A100) │ │ (GPU: A100) │
│ │ │ │ │ │
│ Model: Mistral │ │ Model: Mistral │ │ Model: Mistral │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────────┼───────────────────────────┘
│
┌─────────────────▼───────────────────────────┐
│ Shared Storage │
│ (Model weights: EFS/GCS/S3) │
└─────────────────────────────────────────────┘
Load Balancer Configuration
nginx with least_conn balancing works well for LLM inference since request durations vary significantly.
nginx_config = """
upstream vllm_backends {
least_conn; # Route to server with fewest active connections
server vllm-1:8000 weight=1 max_fails=3 fail_timeout=30s;
server vllm-2:8000 weight=1 max_fails=3 fail_timeout=30s;
server vllm-3:8000 weight=1 max_fails=3 fail_timeout=30s;
keepalive 32; # Connection pooling to backends
}
server {
listen 80;
# Health check endpoint
location /health {
proxy_pass http://vllm_backends/health;
proxy_connect_timeout 5s;
proxy_read_timeout 10s;
}
# API endpoints
location /v1 {
proxy_pass http://vllm_backends/v1;
# Timeouts for long-running inference
proxy_connect_timeout 10s;
proxy_send_timeout 120s;
proxy_read_timeout 120s;
# Streaming support (disable buffering)
proxy_buffering off;
proxy_http_version 1.1;
proxy_set_header Connection "";
# Request headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Request-ID $request_id;
}
}
"""
print(nginx_config)Health Checks
Proper health checks are critical for load balancer routing and Kubernetes readiness probes.
from typing import Any
async def check_vllm_health(
base_url: str,
timeout: float = 10.0,
) -> dict[str, Any]:
"""
Comprehensive health check for vLLM server.
Checks:
1. Health endpoint responds
2. Model is loaded
3. Inference works (optional, adds latency)
Returns:
Health status dict with individual check results
"""
async with httpx.AsyncClient(timeout=timeout) as client:
health: dict[str, Any] = {"healthy": False, "checks": {}}
# Basic health endpoint
try:
resp = await client.get(f"{base_url}/health")
health["checks"]["health_endpoint"] = resp.status_code == 200
except Exception as e:
health["checks"]["health_endpoint"] = False
health["error"] = str(e)
return health
# Check models are loaded
try:
resp = await client.get(f"{base_url}/v1/models")
data = resp.json()
health["checks"]["models_loaded"] = len(data.get("data", [])) > 0
health["models"] = [m["id"] for m in data.get("data", [])]
except Exception:
health["checks"]["models_loaded"] = False
health["healthy"] = all(health["checks"].values())
return health
async def wait_for_healthy(
base_url: str,
timeout: float = 300.0,
interval: float = 5.0,
) -> bool:
"""
Wait for vLLM server to become healthy.
Useful for startup scripts and integration tests.
"""
import asyncio
start = time.perf_counter()
while time.perf_counter() - start < timeout:
health = await check_vllm_health(base_url)
if health["healthy"]:
return True
await asyncio.sleep(interval)
return FalseKubernetes Deployment
Here’s a production-ready Kubernetes deployment with GPU support, resource limits, and autoscaling.
k8s_manifests = """
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
labels:
app: vllm
spec:
replicas: 3
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
ports:
- containerPort: 8000
name: http
args:
- "--model"
- "mistralai/Mistral-7B-Instruct-v0.3"
- "--tensor-parallel-size"
- "1"
- "--gpu-memory-utilization"
- "0.90"
- "--max-model-len"
- "8192"
- "--enable-prefix-caching"
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
requests:
nvidia.com/gpu: 1
memory: "24Gi"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # Model loading takes time
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
failureThreshold: 3
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi # Shared memory for PyTorch
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
---
# Service
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 80
targetPort: 8000
name: http
type: ClusterIP
---
# HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_running
target:
type: AverageValue
averageValue: "50"
"""
print(k8s_manifests)Cloud Deployment: AWS
EKS with Karpenter
Karpenter provides fast, cost-effective GPU node provisioning for EKS.
karpenter_provisioner = """
# Karpenter NodePool for GPU nodes
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: gpu-pool
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # Use on-demand for LLM inference stability
- key: node.kubernetes.io/instance-type
operator: In
values:
- p4d.24xlarge # 8x A100 40GB
- p4de.24xlarge # 8x A100 80GB
- p5.48xlarge # 8x H100
nodeClassRef:
name: gpu-node-class
limits:
cpu: 1000
nvidia.com/gpu: 100
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 30m
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: gpu-node-class
spec:
amiFamily: AL2
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 500Gi
volumeType: gp3
throughput: 500
iops: 5000
role: "KarpenterNodeRole"
"""
print(karpenter_provisioner)SageMaker Endpoint
For managed deployment, SageMaker provides autoscaling and monitoring out of the box.
# SageMaker deployment configuration
def create_sagemaker_endpoint_config(
model_name: str = "mistral-7b-instruct",
instance_type: str = "ml.g5.2xlarge",
initial_instance_count: int = 2,
) -> dict[str, Any]:
"""
Create SageMaker endpoint configuration for vLLM.
Note: Requires custom container with vLLM installed.
"""
return {
"EndpointConfigName": f"{model_name}-config",
"ProductionVariants": [
{
"VariantName": "primary",
"ModelName": model_name,
"InstanceType": instance_type,
"InitialInstanceCount": initial_instance_count,
"ContainerStartupHealthCheckTimeoutInSeconds": 600,
"ModelDataDownloadTimeoutInSeconds": 1200,
"RoutingConfig": {
"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"
},
}
],
}
def create_autoscaling_policy(
endpoint_name: str,
min_capacity: int = 2,
max_capacity: int = 10,
target_invocations_per_instance: int = 50,
) -> dict[str, Any]:
"""
Create autoscaling policy for SageMaker endpoint.
Scales based on invocations per instance.
"""
return {
"PolicyName": f"{endpoint_name}-scaling-policy",
"ServiceNamespace": "sagemaker",
"ResourceId": f"endpoint/{endpoint_name}/variant/primary",
"ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
"PolicyType": "TargetTrackingScaling",
"TargetTrackingScalingPolicyConfiguration": {
"TargetValue": target_invocations_per_instance,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60,
},
}Cloud Deployment: GCP
GKE with GPU Node Pools
# GKE GPU node pool configuration (gcloud CLI)
gke_commands = """
# Create GPU node pool
gcloud container node-pools create gpu-pool \\
--cluster=vllm-cluster \\
--zone=us-central1-a \\
--machine-type=a2-highgpu-1g \\
--accelerator=type=nvidia-tesla-a100,count=1 \\
--num-nodes=2 \\
--enable-autoscaling \\
--min-nodes=1 \\
--max-nodes=10 \\
--disk-size=500GB \\
--disk-type=pd-ssd
# Install NVIDIA GPU drivers
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
"""
print(gke_commands)Cloud Run with GPU (Preview)
Cloud Run now supports GPU workloads, enabling serverless LLM inference.
cloud_run_config = """
# Cloud Run service YAML
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: vllm-service
spec:
template:
metadata:
annotations:
run.googleapis.com/execution-environment: gen2
run.googleapis.com/gpu-type: nvidia-l4
autoscaling.knative.dev/minScale: "1" # Avoid cold starts
autoscaling.knative.dev/maxScale: "10"
spec:
containerConcurrency: 100
timeoutSeconds: 300
containers:
- image: gcr.io/PROJECT_ID/vllm-server:latest
ports:
- containerPort: 8000
resources:
limits:
cpu: "8"
memory: 32Gi
nvidia.com/gpu: "1"
env:
- name: MODEL_ID
value: "mistralai/Mistral-7B-Instruct-v0.3"
"""
print(cloud_run_config)Performance Optimization: KV Cache
The KV cache is often the memory bottleneck. Understanding its memory requirements helps tune --gpu-memory-utilization.
def calculate_kv_cache_memory(
num_layers: int,
hidden_size: int,
num_kv_heads: int,
head_dim: int,
max_context_len: int,
max_batch_size: int,
dtype_bytes: int = 2, # FP16 = 2 bytes, FP8 = 1 byte
) -> dict[str, float]:
"""
Calculate KV cache memory requirements.
The KV cache stores key and value tensors for all previous
tokens in all layers. This is often the memory bottleneck.
Args:
num_layers: Number of transformer layers
hidden_size: Model hidden dimension
num_kv_heads: Number of key-value heads (can differ from query heads in GQA)
head_dim: Dimension per attention head
max_context_len: Maximum sequence length
max_batch_size: Maximum concurrent sequences
dtype_bytes: Bytes per element (2 for FP16, 1 for FP8)
Returns:
Dict with memory requirements in GB
"""
# Per-token KV cache: 2 (K+V) * layers * kv_heads * head_dim * dtype_bytes
bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
# Total for max batch at max context
total_bytes = bytes_per_token * max_context_len * max_batch_size
total_gb = total_bytes / (1024**3)
return {
"bytes_per_token": bytes_per_token,
"gb_per_sequence_max_context": (bytes_per_token * max_context_len) / (1024**3),
"total_kv_cache_gb": total_gb,
}
# Example: Mistral-7B with GQA (8 KV heads)
mistral_kv = calculate_kv_cache_memory(
num_layers=32,
hidden_size=4096,
num_kv_heads=8, # Grouped Query Attention
head_dim=128,
max_context_len=8192,
max_batch_size=32,
dtype_bytes=2, # FP16
)
print("Mistral-7B KV Cache Requirements:")
for key, value in mistral_kv.items():
if "gb" in key:
print(f" {key}: {value:.2f} GB")
else:
print(f" {key}: {value:,} bytes")Performance Optimization: Quantization
Quantization reduces memory usage and can improve throughput at a small quality cost.
# Quantization comparison data
quantization_data = pl.DataFrame({
"method": ["FP16 (baseline)", "FP8", "AWQ (4-bit)", "GPTQ (4-bit)"],
"memory_reduction_pct": [0, 50, 75, 75],
"throughput_improvement_pct": [0, 15, 40, 30],
"quality_degradation_pct": [0, 0.5, 2, 3],
"vllm_flag": [
"(default)",
"--quantization fp8",
"--quantization awq",
"--quantization gptq",
],
})
# Memory reduction chart
memory_chart = alt.Chart(quantization_data).mark_bar().encode(
x=alt.X("method:N", title="Quantization Method", sort=None),
y=alt.Y("memory_reduction_pct:Q", title="Memory Reduction (%)"),
color=alt.Color(
"method:N",
legend=None,
scale=alt.Scale(scheme="tableau10"),
),
).properties(
title="Memory Reduction by Quantization Method",
width=400,
height=250,
)
memory_chart# Quality vs throughput tradeoff
tradeoff_chart = alt.Chart(quantization_data).mark_point(size=100).encode(
x=alt.X("quality_degradation_pct:Q", title="Quality Degradation (%)"),
y=alt.Y("throughput_improvement_pct:Q", title="Throughput Improvement (%)"),
color=alt.Color("method:N", title="Method"),
tooltip=["method", "vllm_flag", "memory_reduction_pct"],
).properties(
title="Quantization Tradeoffs: Quality vs Throughput",
width=450,
height=300,
)
tradeoff_chartTensor Parallelism vs Pipeline Parallelism
For multi-GPU deployments, you need to choose between parallelism strategies:
| Strategy | How it works | Best for | vLLM flag |
|---|---|---|---|
| Tensor Parallelism (TP) | Split layer computations across GPUs | Single-node, low latency | --tensor-parallel-size N |
| Pipeline Parallelism (PP) | Different layers on different GPUs | Multi-node, large models | --pipeline-parallel-size N |
| Data Parallelism | Multiple complete replicas | High throughput | Deploy multiple pods |
Guidelines:
- 7B models: TP=1 (single GPU) or TP=2 for lower latency
- 70B models: TP=4 on 4x A100 80GB, or TP=8 on 8x A100 40GB
- 70B+ models: Consider PP for multi-node deployment
- For throughput, prefer data parallelism (more pods) over TP
Error Handling: Exception Hierarchy
Proper exception handling enables smart retry logic and meaningful error messages.
class VLLMClientError(Exception):
"""Base exception for vLLM client errors."""
pass
class TokenLimitExceeded(VLLMClientError):
"""Request exceeded token limit. Truncate input and retry."""
pass
class ServerOverloaded(VLLMClientError):
"""Server is overloaded (503). Retry with backoff."""
pass
class ModelNotReady(VLLMClientError):
"""Model not loaded yet. Wait and retry."""
pass
class RateLimitExceeded(VLLMClientError):
"""Rate limit hit (429). Retry after delay."""
def __init__(self, message: str, retry_after: float | None = None):
super().__init__(message)
self.retry_after = retry_afterError Handling: Resilient Client with Retries
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type,
)
def parse_error_response(response: httpx.Response) -> VLLMClientError:
"""
Parse error response and return appropriate exception.
Different errors need different handling:
- 503: Retry with backoff
- 429: Wait for retry-after header
- 400 with token error: Don't retry, truncate input
"""
status = response.status_code
if status == 503:
return ServerOverloaded("Server overloaded, retry with backoff")
if status == 429:
retry_after = response.headers.get("Retry-After")
return RateLimitExceeded(
"Rate limit exceeded",
retry_after=float(retry_after) if retry_after else None,
)
if status == 400:
try:
error = response.json().get("error", {})
message = str(error.get("message", ""))
if "token" in message.lower():
return TokenLimitExceeded(message)
except Exception:
pass
return VLLMClientError(f"Request failed with status {status}")
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=30),
retry=retry_if_exception_type((httpx.TimeoutException, ServerOverloaded)),
)
async def create_completion_resilient(
config: VLLMConfig,
messages: list[ChatMessage],
) -> CompletionResponse:
"""
Resilient completion with smart retry logic.
Retries on:
- Network timeouts
- Server overload (503)
Does NOT retry on:
- Token limit exceeded (need to truncate input)
- Rate limit (should respect retry-after)
- Other client errors (fix the request)
"""
async with httpx.AsyncClient(timeout=config.timeout) as client:
response = await client.post(
f"{config.base_url}/chat/completions",
json={
"model": config.model,
"messages": [m.model_dump() for m in messages],
"max_tokens": config.max_tokens,
"temperature": config.temperature,
},
headers=config.get_headers(),
)
if response.status_code >= 400:
raise parse_error_response(response)
data = response.json()
return CompletionResponse(
request_id=data["id"],
content=data["choices"][0]["message"]["content"],
model=data["model"],
prompt_tokens=data["usage"]["prompt_tokens"],
completion_tokens=data["usage"]["completion_tokens"],
total_tokens=data["usage"]["total_tokens"],
latency_ms=0, # Would need timing wrapper
finish_reason=data["choices"][0]["finish_reason"],
)Error Handling: Circuit Breaker Pattern
A circuit breaker prevents cascading failures by stopping requests to an unhealthy server.
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject requests
HALF_OPEN = "half_open" # Testing recovery
@dataclass
class CircuitBreaker:
"""
Circuit breaker for vLLM client resilience.
States:
- CLOSED: Normal operation, requests pass through
- OPEN: Too many failures, reject immediately
- HALF_OPEN: Testing if server recovered
Transitions:
- CLOSED -> OPEN: failure_threshold consecutive failures
- OPEN -> HALF_OPEN: recovery_timeout elapsed
- HALF_OPEN -> CLOSED: half_open_max_calls successes
- HALF_OPEN -> OPEN: any failure
"""
failure_threshold: int = 5
recovery_timeout: timedelta = field(default_factory=lambda: timedelta(seconds=30))
half_open_max_calls: int = 3
state: CircuitState = field(default=CircuitState.CLOSED)
failure_count: int = field(default=0)
last_failure_time: datetime | None = field(default=None)
half_open_successes: int = field(default=0)
def can_execute(self) -> bool:
"""Check if request should proceed."""
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if self.last_failure_time is None:
return False
if datetime.now() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.half_open_successes = 0
return True
return False
# HALF_OPEN: allow limited requests
return True
def record_success(self) -> None:
"""Record successful request."""
if self.state == CircuitState.HALF_OPEN:
self.half_open_successes += 1
if self.half_open_successes >= self.half_open_max_calls:
self.state = CircuitState.CLOSED
self.failure_count = 0
else:
self.failure_count = 0
def record_failure(self) -> None:
"""Record failed request."""
self.failure_count += 1
self.last_failure_time = datetime.now()
if self.state == CircuitState.HALF_OPEN:
# Any failure in half-open reopens the circuit
self.state = CircuitState.OPEN
elif self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPENMonitoring: Prometheus Metrics
Export metrics for observability. vLLM exposes built-in Prometheus metrics at /metrics, and you can add client-side metrics.
from prometheus_client import Counter, Histogram, Gauge
# Client-side metrics
REQUEST_COUNT = Counter(
"vllm_client_requests_total",
"Total number of requests to vLLM",
["model", "status"],
)
REQUEST_LATENCY = Histogram(
"vllm_client_request_latency_seconds",
"Request latency in seconds",
["model"],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0],
)
TOKENS_GENERATED = Counter(
"vllm_client_tokens_generated_total",
"Total tokens generated",
["model"],
)
ACTIVE_REQUESTS = Gauge(
"vllm_client_active_requests",
"Number of in-flight requests",
)
CIRCUIT_BREAKER_STATE = Gauge(
"vllm_client_circuit_breaker_state",
"Circuit breaker state (0=closed, 1=open, 2=half_open)",
)
def record_request_metrics(
model: str,
status: str,
latency_seconds: float,
tokens: int,
) -> None:
"""Record metrics for a completed request."""
REQUEST_COUNT.labels(model=model, status=status).inc()
REQUEST_LATENCY.labels(model=model).observe(latency_seconds)
if tokens > 0:
TOKENS_GENERATED.labels(model=model).inc(tokens)Monitoring: Key Metrics to Track
vLLM exposes many metrics. Here are the most important ones:
key_metrics = {
"Latency": [
"vllm:e2e_request_latency_seconds", # End-to-end request latency
"vllm:time_to_first_token_seconds", # TTFT for streaming
"vllm:time_per_output_token_seconds", # Generation speed
],
"Throughput": [
"vllm:avg_prompt_throughput_toks_per_s", # Prompt processing
"vllm:avg_generation_throughput_toks_per_s", # Token generation
],
"Queue": [
"vllm:num_requests_running", # Currently processing
"vllm:num_requests_waiting", # Waiting in queue
"vllm:num_requests_swapped", # Swapped to CPU (memory pressure)
],
"Resources": [
"vllm:gpu_cache_usage_perc", # KV cache memory usage
"vllm:cpu_cache_usage_perc", # CPU swap usage (if enabled)
],
}
# Print formatted
for category, metrics in key_metrics.items():
print(f"\n{category}:")
for metric in metrics:
print(f" - {metric}")Monitoring: Alerting Rules
Set up alerts for critical conditions.
alerting_rules = """
groups:
- name: vllm-alerts
rules:
- alert: VLLMHighLatency
expr: histogram_quantile(0.99, vllm_e2e_request_latency_seconds_bucket) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM P99 latency is high"
description: "P99 latency is {{ $value | humanizeDuration }} (threshold: 30s)"
- alert: VLLMQueueBacklog
expr: vllm_num_requests_waiting > 100
for: 2m
labels:
severity: warning
annotations:
summary: "vLLM request queue is backing up"
description: "{{ $value }} requests waiting in queue"
- alert: VLLMGPUMemoryCritical
expr: vllm_gpu_cache_usage_perc > 95
for: 5m
labels:
severity: critical
annotations:
summary: "vLLM GPU memory usage critical"
description: "GPU cache usage at {{ $value | humanizePercentage }}"
- alert: VLLMHighErrorRate
expr: |
rate(vllm_client_requests_total{status="error"}[5m])
/ rate(vllm_client_requests_total[5m]) > 0.05
for: 1m
labels:
severity: critical
annotations:
summary: "vLLM error rate exceeds 5%"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: VLLMRequestsSwapped
expr: vllm_num_requests_swapped > 0
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM is swapping requests to CPU"
description: "{{ $value }} requests swapped (indicates GPU memory pressure)"
"""
print(alerting_rules)Queue Depth Visualization
Monitoring queue depth helps identify when to scale.
import random
# Simulated queue depth over time
random.seed(42)
time_points = list(range(0, 60))
# Simulate a load spike scenario
running = []
waiting = []
for t in time_points:
if t < 15:
# Normal load
r = 20 + random.randint(-5, 5)
w = 5 + random.randint(-3, 3)
elif t < 30:
# Load spike
r = 45 + random.randint(-5, 10)
w = 50 + random.randint(-10, 20)
elif t < 45:
# Scaling up (more capacity)
r = 60 + random.randint(-5, 5)
w = max(0, 30 - (t - 30) * 2 + random.randint(-5, 5))
else:
# Back to normal
r = 30 + random.randint(-5, 5)
w = 5 + random.randint(-3, 3)
running.append(max(0, r))
waiting.append(max(0, w))
queue_data = pl.DataFrame({
"time_minutes": time_points * 2,
"count": running + waiting,
"metric": ["Running"] * 60 + ["Waiting"] * 60,
})
queue_chart = alt.Chart(queue_data).mark_area(opacity=0.7).encode(
x=alt.X("time_minutes:Q", title="Time (minutes)"),
y=alt.Y("count:Q", title="Request Count", stack=True),
color=alt.Color(
"metric:N",
title="Status",
scale=alt.Scale(domain=["Running", "Waiting"], range=["#4c78a8", "#f58518"]),
),
).properties(
title="Request Queue Depth Over Time (Load Spike Scenario)",
width=600,
height=300,
)
# Add annotation for the spike
spike_annotation = alt.Chart(pl.DataFrame({
"x": [22],
"y": [100],
"text": ["Load spike"],
})).mark_text(align="center", dy=-10, fontSize=11).encode(
x="x:Q", y="y:Q", text="text:N"
)
scale_annotation = alt.Chart(pl.DataFrame({
"x": [37],
"y": [65],
"text": ["Autoscaling kicks in"],
})).mark_text(align="center", dy=-10, fontSize=11).encode(
x="x:Q", y="y:Q", text="text:N"
)
queue_chart + spike_annotation + scale_annotationProduction Deployment Checklist
Use this checklist before going to production:
Server Configuration
- Set appropriate
--gpu-memory-utilization(typically 0.85-0.95) - Enable
--enable-prefix-cachingfor repeated prompts - Configure
--max-model-lenbased on your use case - Use quantization if memory constrained (AWQ or GPTQ)
- Set
--tensor-parallel-sizeappropriately for your GPU count
Client Implementation
- Use async clients for high throughput applications
- Implement connection pooling (shared httpx.AsyncClient)
- Add retry logic with exponential backoff
- Implement circuit breaker pattern for resilience
- Handle streaming for interactive use cases
High Availability
- Deploy multiple replicas behind load balancer
- Configure health checks (readiness + liveness probes)
- Use
least_connload balancing for variable request times - Set up autoscaling based on queue depth or GPU utilization
- Share model cache via PersistentVolume to speed up scaling
Monitoring
- Export Prometheus metrics from vLLM (
/metricsendpoint) - Track latency percentiles (p50, p90, p99)
- Monitor queue depth (running + waiting requests)
- Monitor GPU memory and utilization
- Set up alerts for high latency, queue backlog, error rate
Security
- Enable API key authentication if exposed publicly
- Use TLS termination at load balancer
- Implement rate limiting at the gateway level
- Audit log all requests for compliance
Summary
Deploying vLLM for production requires attention to:
- Client patterns: Async clients with connection pooling for throughput, streaming for interactivity
- Resilience: Retries with backoff, circuit breakers, proper exception handling
- Scaling: Kubernetes with GPU node pools, HPA based on queue metrics
- Performance: Right-sized
--gpu-memory-utilization, prefix caching, quantization when needed - Observability: Prometheus metrics, alerting on latency/queue/errors
The key insight is that vLLM handles the hard part (continuous batching, PagedAttention) automatically. Your job is to:
- Send enough concurrent requests to keep GPUs busy
- Handle failures gracefully
- Scale based on queue depth, not just CPU/memory
- Monitor the right metrics (latency percentiles, not just averages)
With these patterns in place, you can achieve the 2-24x throughput improvements that vLLM promises while maintaining production reliability.