Production vLLM Deployment: The Complete Guide • Brandon Rundquist

Large language models are expensive and slow. A naive deployment can easily cost 10x more than necessary while delivering poor latency. This guide shows how to deploy vLLM for production inference, covering everything from basic server setup to Kubernetes autoscaling and observability.

vLLM has become the de facto standard for self-hosted LLM inference. Companies like Meta, Mistral AI, IBM, and Stripe use it in production. Stripe reported a 73% cost reduction after switching to vLLM. The key innovations that make this possible:

PagedAttention: Non-contiguous KV cache memory management, like virtual memory paging for attention. This eliminates 60-80% of memory waste from fragmentation.
Continuous Batching: Instead of waiting for the slowest sequence in a batch, vLLM schedules at the iteration level, dynamically replacing completed sequences. This achieves 2-24x throughput improvement.
OpenAI-Compatible API: Drop-in replacement for OpenAI’s API, making migration trivial.

We’ll cover:

Local setup with Docker
Client patterns (sync, async, streaming)
Parallel processing for throughput
Production architecture with load balancing
Kubernetes deployment with autoscaling
Cloud deployment patterns (AWS, GCP)
Performance optimization
Error handling and resilience
Monitoring and alerting

Quick Start: Running vLLM Locally

The fastest way to get started is with Docker. Here’s a docker-compose.yml that runs Mistral-7B-Instruct:

# docker-compose.yml for local development
docker_compose = """
version: '3.8'
 
services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm-server
    runtime: nvidia  # Requires NVIDIA Container Toolkit
    ports:
      - "8000:8000"
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --tensor-parallel-size 1
      --gpu-memory-utilization 0.90
      --max-model-len 8192
      --enable-prefix-caching
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s
"""
print(docker_compose)

Key configuration options:

Flag	Description	Recommended Value
`--gpu-memory-utilization`	Fraction of GPU memory to use for KV cache	0.85-0.95
`--tensor-parallel-size`	Number of GPUs for tensor parallelism	1 for 7B models, 2-4 for 70B+
`--max-model-len`	Maximum context length	Based on use case
`--enable-prefix-caching`	Cache common prefixes across requests	Always enable
`--quantization`	Model quantization (awq, gptq, fp8)	Use if memory constrained

Data Models

Before diving into client patterns, let’s define our data models. We use Pydantic for validation and dataclasses for configuration.

from dataclasses import dataclass
from pydantic import BaseModel, Field
from typing import Literal
import uuid
 
 
@dataclass
class VLLMConfig:
    """Configuration for vLLM client connections."""
    base_url: str = "http://localhost:8000/v1"
    model: str = "mistralai/Mistral-7B-Instruct-v0.3"
    timeout: float = 120.0
    max_retries: int = 3
    
    # Generation defaults
    max_tokens: int = 1024
    temperature: float = 0.7
    top_p: float = 0.95
    
    def get_headers(self, api_key: str | None = None) -> dict[str, str]:
        """Get request headers."""
        headers = {"Content-Type": "application/json"}
        if api_key:
            headers["Authorization"] = f"Bearer {api_key}"
        return headers
 
 
class ChatMessage(BaseModel):
    """A single message in a conversation."""
    role: Literal["system", "user", "assistant"]
    content: str
 
 
class CompletionResponse(BaseModel):
    """Response from completion API."""
    request_id: str
    content: str
    model: str
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    latency_ms: float
    finish_reason: str

Client Pattern 1: Synchronous HTTP

The simplest approach uses synchronous HTTP requests. This is appropriate for scripts, notebooks, and applications where you’re making occasional requests.

import httpx
import time
 
 
def create_completion_sync(
    config: VLLMConfig,
    messages: list[ChatMessage],
    max_tokens: int | None = None,
    temperature: float | None = None,
) -> CompletionResponse:
    """
    Synchronous completion request to vLLM.
    
    Use for simple scripts or when async isn't needed.
    Connection is created and closed per request.
    """
    start_time = time.perf_counter()
    
    payload = {
        "model": config.model,
        "messages": [m.model_dump() for m in messages],
        "max_tokens": max_tokens or config.max_tokens,
        "temperature": temperature or config.temperature,
    }
    
    with httpx.Client(timeout=config.timeout) as client:
        response = client.post(
            f"{config.base_url}/chat/completions",
            json=payload,
            headers=config.get_headers(),
        )
        response.raise_for_status()
        
    data = response.json()
    latency_ms = (time.perf_counter() - start_time) * 1000
    
    return CompletionResponse(
        request_id=data["id"],
        content=data["choices"][0]["message"]["content"],
        model=data["model"],
        prompt_tokens=data["usage"]["prompt_tokens"],
        completion_tokens=data["usage"]["completion_tokens"],
        total_tokens=data["usage"]["total_tokens"],
        latency_ms=latency_ms,
        finish_reason=data["choices"][0]["finish_reason"],
    )
 
 
# Example usage
# config = VLLMConfig()
# messages = [ChatMessage(role="user", content="Explain PagedAttention in one sentence.")]
# response = create_completion_sync(config, messages)
# print(f"Response: {response.content}")
# print(f"Latency: {response.latency_ms:.0f}ms")

Client Pattern 2: Async HTTP

For high-throughput applications, async clients are essential. They allow concurrent requests without blocking, maximizing GPU utilization on the server.

import asyncio
 
 
async def create_completion_async(
    config: VLLMConfig,
    messages: list[ChatMessage],
    client: httpx.AsyncClient | None = None,
    max_tokens: int | None = None,
) -> CompletionResponse:
    """
    Async completion request - preferred for high throughput.
    
    Pass a shared client for connection pooling across requests.
    This significantly reduces connection overhead.
    """
    start_time = time.perf_counter()
    
    payload = {
        "model": config.model,
        "messages": [m.model_dump() for m in messages],
        "max_tokens": max_tokens or config.max_tokens,
        "temperature": config.temperature,
    }
    
    should_close = client is None
    client = client or httpx.AsyncClient(timeout=config.timeout)
    
    try:
        response = await client.post(
            f"{config.base_url}/chat/completions",
            json=payload,
            headers=config.get_headers(),
        )
        response.raise_for_status()
        data = response.json()
        
    finally:
        if should_close:
            await client.aclose()
    
    latency_ms = (time.perf_counter() - start_time) * 1000
    
    return CompletionResponse(
        request_id=data["id"],
        content=data["choices"][0]["message"]["content"],
        model=data["model"],
        prompt_tokens=data["usage"]["prompt_tokens"],
        completion_tokens=data["usage"]["completion_tokens"],
        total_tokens=data["usage"]["total_tokens"],
        latency_ms=latency_ms,
        finish_reason=data["choices"][0]["finish_reason"],
    )

Client Pattern 3: OpenAI SDK

vLLM’s OpenAI-compatible API means you can use the official OpenAI Python SDK. This is the easiest migration path if you’re already using OpenAI’s API.

from openai import OpenAI, AsyncOpenAI
 
 
def create_completion_openai_sdk(
    config: VLLMConfig,
    messages: list[ChatMessage],
) -> CompletionResponse:
    """
    Use OpenAI SDK with vLLM - simplest integration path.
    
    vLLM's OpenAI-compatible API means existing OpenAI code
    works with minimal changes (just update base_url).
    """
    start_time = time.perf_counter()
    
    client = OpenAI(
        base_url=config.base_url,
        api_key="not-needed",  # vLLM doesn't require API key by default
    )
    
    response = client.chat.completions.create(
        model=config.model,
        messages=[{"role": m.role, "content": m.content} for m in messages],
        max_tokens=config.max_tokens,
        temperature=config.temperature,
    )
    
    latency_ms = (time.perf_counter() - start_time) * 1000
    
    return CompletionResponse(
        request_id=response.id,
        content=response.choices[0].message.content or "",
        model=response.model,
        prompt_tokens=response.usage.prompt_tokens if response.usage else 0,
        completion_tokens=response.usage.completion_tokens if response.usage else 0,
        total_tokens=response.usage.total_tokens if response.usage else 0,
        latency_ms=latency_ms,
        finish_reason=response.choices[0].finish_reason or "unknown",
    )
 
 
async def create_completion_openai_sdk_async(
    config: VLLMConfig,
    messages: list[ChatMessage],
) -> CompletionResponse:
    """Async version using OpenAI's AsyncOpenAI client."""
    start_time = time.perf_counter()
    
    client = AsyncOpenAI(
        base_url=config.base_url,
        api_key="not-needed",
    )
    
    response = await client.chat.completions.create(
        model=config.model,
        messages=[{"role": m.role, "content": m.content} for m in messages],
        max_tokens=config.max_tokens,
        temperature=config.temperature,
    )
    
    latency_ms = (time.perf_counter() - start_time) * 1000
    
    return CompletionResponse(
        request_id=response.id,
        content=response.choices[0].message.content or "",
        model=response.model,
        prompt_tokens=response.usage.prompt_tokens if response.usage else 0,
        completion_tokens=response.usage.completion_tokens if response.usage else 0,
        total_tokens=response.usage.total_tokens if response.usage else 0,
        latency_ms=latency_ms,
        finish_reason=response.choices[0].finish_reason or "unknown",
    )

Client Pattern 4: Streaming Responses

For interactive applications, streaming is essential. Users see the response building in real-time, which dramatically improves perceived latency. vLLM supports Server-Sent Events (SSE) for streaming.

from typing import AsyncIterator, Iterator
 
 
def stream_completion(
    config: VLLMConfig,
    messages: list[ChatMessage],
) -> Iterator[str]:
    """
    Stream tokens as they're generated.
    
    Essential for interactive applications - users see
    the response building in real-time.
    """
    client = OpenAI(base_url=config.base_url, api_key="not-needed")
    
    stream = client.chat.completions.create(
        model=config.model,
        messages=[{"role": m.role, "content": m.content} for m in messages],
        max_tokens=config.max_tokens,
        stream=True,
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content
 
 
async def stream_completion_async(
    config: VLLMConfig,
    messages: list[ChatMessage],
) -> AsyncIterator[str]:
    """Async streaming for concurrent applications."""
    client = AsyncOpenAI(base_url=config.base_url, api_key="not-needed")
    
    stream = await client.chat.completions.create(
        model=config.model,
        messages=[{"role": m.role, "content": m.content} for m in messages],
        max_tokens=config.max_tokens,
        stream=True,
    )
    
    async for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content
 
 
# Example: Collecting streamed response
# async def get_streamed_response(config: VLLMConfig, messages: list[ChatMessage]) -> str:
#     tokens = []
#     async for token in stream_completion_async(config, messages):
#         tokens.append(token)
#         print(token, end="", flush=True)
#     return "".join(tokens)

Parallel Processing: Batched Concurrent Requests

To maximize throughput, you want to keep multiple requests in flight simultaneously. vLLM’s continuous batching handles the server-side batching automatically, but you need to send concurrent requests to feed it.

async def process_batch_concurrent(
    config: VLLMConfig,
    requests: list[list[ChatMessage]],
    max_concurrency: int = 10,
) -> list[CompletionResponse | Exception]:
    """
    Process multiple requests concurrently with controlled parallelism.
    
    This maximizes throughput while respecting server capacity.
    The semaphore limits concurrent requests to avoid overwhelming
    the server or running out of memory.
    
    Args:
        config: vLLM client configuration
        requests: List of message lists (one per request)
        max_concurrency: Maximum parallel requests
        
    Returns:
        List of responses (or exceptions for failed requests)
    """
    semaphore = asyncio.Semaphore(max_concurrency)
    
    async def limited_request(
        messages: list[ChatMessage],
        client: httpx.AsyncClient,
    ) -> CompletionResponse:
        async with semaphore:
            return await create_completion_async(config, messages, client)
    
    # Use a single client for connection pooling
    async with httpx.AsyncClient(timeout=config.timeout) as client:
        tasks = [limited_request(msgs, client) for msgs in requests]
        results = await asyncio.gather(*tasks, return_exceptions=True)
    
    return results
 
 
def count_successes_and_failures(
    results: list[CompletionResponse | Exception],
) -> tuple[int, int]:
    """Count successful and failed requests from batch results."""
    successes = sum(1 for r in results if isinstance(r, CompletionResponse))
    failures = len(results) - successes
    return successes, failures

Parallel Processing: ThreadPoolExecutor for Sync Code

If you’re integrating with a sync codebase that can’t easily adopt async/await, ThreadPoolExecutor provides parallelism.

from concurrent.futures import ThreadPoolExecutor, as_completed
 
 
def process_batch_threaded(
    config: VLLMConfig,
    requests: list[list[ChatMessage]],
    max_workers: int = 10,
) -> list[tuple[int, CompletionResponse | Exception]]:
    """
    Thread pool executor for parallel processing in sync code.
    
    Use when integrating with sync codebases that can't
    easily adopt async/await. Note that threads have
    higher overhead than async coroutines.
    
    Returns:
        List of (index, response) tuples, sorted by original order
    """
    results: list[tuple[int, CompletionResponse | Exception]] = []
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_idx = {
            executor.submit(create_completion_sync, config, msgs): idx
            for idx, msgs in enumerate(requests)
        }
        
        for future in as_completed(future_to_idx):
            idx = future_to_idx[future]
            try:
                response = future.result()
                results.append((idx, response))
            except Exception as e:
                results.append((idx, e))
    
    # Sort by original order
    results.sort(key=lambda x: x[0])
    return results

Visualizing Client Performance

Let’s visualize the latency characteristics of different client patterns. This simulated data reflects typical production behavior.

import altair as alt
import polars as pl
 
# Simulated latency data for different client patterns
# Based on typical production measurements with Mistral-7B
latency_data = pl.DataFrame({
    "pattern": ["Sync"] * 100 + ["Async"] * 100 + ["Async Batched"] * 100,
    "latency_ms": (
        # Sync: higher latency, more variance due to connection overhead
        [150 + i * 2 + (i % 10) * 5 for i in range(100)] +
        # Async: lower latency with connection pooling
        [120 + i * 1.5 + (i % 10) * 3 for i in range(100)] +
        # Async Batched: lowest per-request latency due to server batching
        [80 + i * 1 + (i % 10) * 2 for i in range(100)]
    ),
    "request_num": list(range(100)) * 3,
})
 
chart = alt.Chart(latency_data).mark_line().encode(
    x=alt.X("request_num:Q", title="Request Number"),
    y=alt.Y("latency_ms:Q", title="Latency (ms)"),
    color=alt.Color("pattern:N", title="Client Pattern"),
    strokeDash=alt.StrokeDash("pattern:N"),
).properties(
    title="Latency by Client Pattern",
    width=600,
    height=300,
)
 
chart

Throughput Scaling with Concurrency

Higher concurrency generally means higher throughput, up to the point where the GPU becomes the bottleneck.

# Simulated throughput data showing scaling with concurrency
throughput_data = pl.DataFrame({
    "concurrency": [1, 2, 4, 8, 16, 32, 64, 128],
    "throughput_rps": [5, 9.5, 18, 32, 55, 85, 110, 115],  # Requests per second
    "tokens_per_second": [500, 950, 1800, 3200, 5500, 8500, 11000, 11500],
})
 
base = alt.Chart(throughput_data).encode(
    x=alt.X("concurrency:Q", title="Concurrent Requests", scale=alt.Scale(type="log")),
)
 
line = base.mark_line(color="steelblue", strokeWidth=2).encode(
    y=alt.Y("throughput_rps:Q", title="Throughput (requests/sec)"),
)
 
points = base.mark_point(color="steelblue", size=60).encode(
    y=alt.Y("throughput_rps:Q"),
)
 
# Add annotation for saturation point
annotation = alt.Chart(pl.DataFrame({
    "x": [64],
    "y": [110],
    "text": ["GPU Saturation →"],
})).mark_text(align="right", dx=-10, fontSize=12).encode(
    x="x:Q",
    y="y:Q",
    text="text:N",
)
 
chart = (line + points + annotation).properties(
    title="Throughput Scaling with Concurrency (Mistral-7B on A100)",
    width=500,
    height=300,
)
 
chart

Production Architecture

For high availability, deploy multiple vLLM instances behind a load balancer. Here’s the architecture:

                    ┌─────────────────────────────────────────────┐
                    │            Load Balancer                     │
                    │   (nginx/HAProxy/AWS ALB/GCP GLB)           │
                    └─────────────────┬───────────────────────────┘
                                      │
          ┌───────────────────────────┼───────────────────────────┐
          │                           │                           │
          ▼                           ▼                           ▼
┌─────────────────┐        ┌─────────────────┐        ┌─────────────────┐
│   vLLM Pod 1    │        │   vLLM Pod 2    │        │   vLLM Pod 3    │
│   (GPU: A100)   │        │   (GPU: A100)   │        │   (GPU: A100)   │
│                 │        │                 │        │                 │
│  Model: Mistral │        │  Model: Mistral │        │  Model: Mistral │
└─────────────────┘        └─────────────────┘        └─────────────────┘
          │                           │                           │
          └───────────────────────────┼───────────────────────────┘
                                      │
                    ┌─────────────────▼───────────────────────────┐
                    │           Shared Storage                     │
                    │   (Model weights: EFS/GCS/S3)               │
                    └─────────────────────────────────────────────┘

Load Balancer Configuration

nginx with least_conn balancing works well for LLM inference since request durations vary significantly.

nginx_config = """
upstream vllm_backends {
    least_conn;  # Route to server with fewest active connections
    
    server vllm-1:8000 weight=1 max_fails=3 fail_timeout=30s;
    server vllm-2:8000 weight=1 max_fails=3 fail_timeout=30s;
    server vllm-3:8000 weight=1 max_fails=3 fail_timeout=30s;
    
    keepalive 32;  # Connection pooling to backends
}
 
server {
    listen 80;
    
    # Health check endpoint
    location /health {
        proxy_pass http://vllm_backends/health;
        proxy_connect_timeout 5s;
        proxy_read_timeout 10s;
    }
    
    # API endpoints
    location /v1 {
        proxy_pass http://vllm_backends/v1;
        
        # Timeouts for long-running inference
        proxy_connect_timeout 10s;
        proxy_send_timeout 120s;
        proxy_read_timeout 120s;
        
        # Streaming support (disable buffering)
        proxy_buffering off;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        
        # Request headers
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Request-ID $request_id;
    }
}
"""
print(nginx_config)

Health Checks

Proper health checks are critical for load balancer routing and Kubernetes readiness probes.

from typing import Any
 
 
async def check_vllm_health(
    base_url: str,
    timeout: float = 10.0,
) -> dict[str, Any]:
    """
    Comprehensive health check for vLLM server.
    
    Checks:
    1. Health endpoint responds
    2. Model is loaded
    3. Inference works (optional, adds latency)
    
    Returns:
        Health status dict with individual check results
    """
    async with httpx.AsyncClient(timeout=timeout) as client:
        health: dict[str, Any] = {"healthy": False, "checks": {}}
        
        # Basic health endpoint
        try:
            resp = await client.get(f"{base_url}/health")
            health["checks"]["health_endpoint"] = resp.status_code == 200
        except Exception as e:
            health["checks"]["health_endpoint"] = False
            health["error"] = str(e)
            return health
        
        # Check models are loaded
        try:
            resp = await client.get(f"{base_url}/v1/models")
            data = resp.json()
            health["checks"]["models_loaded"] = len(data.get("data", [])) > 0
            health["models"] = [m["id"] for m in data.get("data", [])]
        except Exception:
            health["checks"]["models_loaded"] = False
        
        health["healthy"] = all(health["checks"].values())
        return health
 
 
async def wait_for_healthy(
    base_url: str,
    timeout: float = 300.0,
    interval: float = 5.0,
) -> bool:
    """
    Wait for vLLM server to become healthy.
    
    Useful for startup scripts and integration tests.
    """
    import asyncio
    
    start = time.perf_counter()
    while time.perf_counter() - start < timeout:
        health = await check_vllm_health(base_url)
        if health["healthy"]:
            return True
        await asyncio.sleep(interval)
    
    return False

Kubernetes Deployment

Here’s a production-ready Kubernetes deployment with GPU support, resource limits, and autoscaling.

k8s_manifests = """
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  labels:
    app: vllm
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        ports:
        - containerPort: 8000
          name: http
        args:
        - "--model"
        - "mistralai/Mistral-7B-Instruct-v0.3"
        - "--tensor-parallel-size"
        - "1"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--max-model-len"
        - "8192"
        - "--enable-prefix-caching"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120  # Model loading takes time
          periodSeconds: 10
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
          failureThreshold: 3
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 16Gi  # Shared memory for PyTorch
      nodeSelector:
        nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
---
# Service
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
  - port: 80
    targetPort: 8000
    name: http
  type: ClusterIP
---
# HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_running
      target:
        type: AverageValue
        averageValue: "50"
"""
print(k8s_manifests)

Cloud Deployment: AWS

EKS with Karpenter

Karpenter provides fast, cost-effective GPU node provisioning for EKS.

karpenter_provisioner = """
# Karpenter NodePool for GPU nodes
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-pool
spec:
  template:
    spec:
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]  # Use on-demand for LLM inference stability
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - p4d.24xlarge    # 8x A100 40GB
        - p4de.24xlarge   # 8x A100 80GB
        - p5.48xlarge     # 8x H100
      nodeClassRef:
        name: gpu-node-class
  limits:
    cpu: 1000
    nvidia.com/gpu: 100
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: gpu-node-class
spec:
  amiFamily: AL2
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      volumeSize: 500Gi
      volumeType: gp3
      throughput: 500
      iops: 5000
  role: "KarpenterNodeRole"
"""
print(karpenter_provisioner)

SageMaker Endpoint

For managed deployment, SageMaker provides autoscaling and monitoring out of the box.

# SageMaker deployment configuration
def create_sagemaker_endpoint_config(
    model_name: str = "mistral-7b-instruct",
    instance_type: str = "ml.g5.2xlarge",
    initial_instance_count: int = 2,
) -> dict[str, Any]:
    """
    Create SageMaker endpoint configuration for vLLM.
    
    Note: Requires custom container with vLLM installed.
    """
    return {
        "EndpointConfigName": f"{model_name}-config",
        "ProductionVariants": [
            {
                "VariantName": "primary",
                "ModelName": model_name,
                "InstanceType": instance_type,
                "InitialInstanceCount": initial_instance_count,
                "ContainerStartupHealthCheckTimeoutInSeconds": 600,
                "ModelDataDownloadTimeoutInSeconds": 1200,
                "RoutingConfig": {
                    "RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"
                },
            }
        ],
    }
 
 
def create_autoscaling_policy(
    endpoint_name: str,
    min_capacity: int = 2,
    max_capacity: int = 10,
    target_invocations_per_instance: int = 50,
) -> dict[str, Any]:
    """
    Create autoscaling policy for SageMaker endpoint.
    
    Scales based on invocations per instance.
    """
    return {
        "PolicyName": f"{endpoint_name}-scaling-policy",
        "ServiceNamespace": "sagemaker",
        "ResourceId": f"endpoint/{endpoint_name}/variant/primary",
        "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
        "PolicyType": "TargetTrackingScaling",
        "TargetTrackingScalingPolicyConfiguration": {
            "TargetValue": target_invocations_per_instance,
            "PredefinedMetricSpecification": {
                "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
            },
            "ScaleInCooldown": 300,
            "ScaleOutCooldown": 60,
        },
    }

Cloud Deployment: GCP

GKE with GPU Node Pools

# GKE GPU node pool configuration (gcloud CLI)
gke_commands = """
# Create GPU node pool
gcloud container node-pools create gpu-pool \\
  --cluster=vllm-cluster \\
  --zone=us-central1-a \\
  --machine-type=a2-highgpu-1g \\
  --accelerator=type=nvidia-tesla-a100,count=1 \\
  --num-nodes=2 \\
  --enable-autoscaling \\
  --min-nodes=1 \\
  --max-nodes=10 \\
  --disk-size=500GB \\
  --disk-type=pd-ssd
 
# Install NVIDIA GPU drivers
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
"""
print(gke_commands)

Cloud Run with GPU (Preview)

Cloud Run now supports GPU workloads, enabling serverless LLM inference.

cloud_run_config = """
# Cloud Run service YAML
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: vllm-service
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/execution-environment: gen2
        run.googleapis.com/gpu-type: nvidia-l4
        autoscaling.knative.dev/minScale: "1"  # Avoid cold starts
        autoscaling.knative.dev/maxScale: "10"
    spec:
      containerConcurrency: 100
      timeoutSeconds: 300
      containers:
      - image: gcr.io/PROJECT_ID/vllm-server:latest
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "8"
            memory: 32Gi
            nvidia.com/gpu: "1"
        env:
        - name: MODEL_ID
          value: "mistralai/Mistral-7B-Instruct-v0.3"
"""
print(cloud_run_config)

Performance Optimization: KV Cache

The KV cache is often the memory bottleneck. Understanding its memory requirements helps tune --gpu-memory-utilization.

def calculate_kv_cache_memory(
    num_layers: int,
    hidden_size: int,
    num_kv_heads: int,
    head_dim: int,
    max_context_len: int,
    max_batch_size: int,
    dtype_bytes: int = 2,  # FP16 = 2 bytes, FP8 = 1 byte
) -> dict[str, float]:
    """
    Calculate KV cache memory requirements.
    
    The KV cache stores key and value tensors for all previous
    tokens in all layers. This is often the memory bottleneck.
    
    Args:
        num_layers: Number of transformer layers
        hidden_size: Model hidden dimension
        num_kv_heads: Number of key-value heads (can differ from query heads in GQA)
        head_dim: Dimension per attention head
        max_context_len: Maximum sequence length
        max_batch_size: Maximum concurrent sequences
        dtype_bytes: Bytes per element (2 for FP16, 1 for FP8)
    
    Returns:
        Dict with memory requirements in GB
    """
    # Per-token KV cache: 2 (K+V) * layers * kv_heads * head_dim * dtype_bytes
    bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
    
    # Total for max batch at max context
    total_bytes = bytes_per_token * max_context_len * max_batch_size
    total_gb = total_bytes / (1024**3)
    
    return {
        "bytes_per_token": bytes_per_token,
        "gb_per_sequence_max_context": (bytes_per_token * max_context_len) / (1024**3),
        "total_kv_cache_gb": total_gb,
    }
 
 
# Example: Mistral-7B with GQA (8 KV heads)
mistral_kv = calculate_kv_cache_memory(
    num_layers=32,
    hidden_size=4096,
    num_kv_heads=8,  # Grouped Query Attention
    head_dim=128,
    max_context_len=8192,
    max_batch_size=32,
    dtype_bytes=2,  # FP16
)
 
print("Mistral-7B KV Cache Requirements:")
for key, value in mistral_kv.items():
    if "gb" in key:
        print(f"  {key}: {value:.2f} GB")
    else:
        print(f"  {key}: {value:,} bytes")

Performance Optimization: Quantization

Quantization reduces memory usage and can improve throughput at a small quality cost.

# Quantization comparison data
quantization_data = pl.DataFrame({
    "method": ["FP16 (baseline)", "FP8", "AWQ (4-bit)", "GPTQ (4-bit)"],
    "memory_reduction_pct": [0, 50, 75, 75],
    "throughput_improvement_pct": [0, 15, 40, 30],
    "quality_degradation_pct": [0, 0.5, 2, 3],
    "vllm_flag": [
        "(default)",
        "--quantization fp8",
        "--quantization awq",
        "--quantization gptq",
    ],
})
 
# Memory reduction chart
memory_chart = alt.Chart(quantization_data).mark_bar().encode(
    x=alt.X("method:N", title="Quantization Method", sort=None),
    y=alt.Y("memory_reduction_pct:Q", title="Memory Reduction (%)"),
    color=alt.Color(
        "method:N",
        legend=None,
        scale=alt.Scale(scheme="tableau10"),
    ),
).properties(
    title="Memory Reduction by Quantization Method",
    width=400,
    height=250,
)
 
memory_chart

# Quality vs throughput tradeoff
tradeoff_chart = alt.Chart(quantization_data).mark_point(size=100).encode(
    x=alt.X("quality_degradation_pct:Q", title="Quality Degradation (%)"),
    y=alt.Y("throughput_improvement_pct:Q", title="Throughput Improvement (%)"),
    color=alt.Color("method:N", title="Method"),
    tooltip=["method", "vllm_flag", "memory_reduction_pct"],
).properties(
    title="Quantization Tradeoffs: Quality vs Throughput",
    width=450,
    height=300,
)
 
tradeoff_chart

Tensor Parallelism vs Pipeline Parallelism

For multi-GPU deployments, you need to choose between parallelism strategies:

Strategy	How it works	Best for	vLLM flag
Tensor Parallelism (TP)	Split layer computations across GPUs	Single-node, low latency	`--tensor-parallel-size N`
Pipeline Parallelism (PP)	Different layers on different GPUs	Multi-node, large models	`--pipeline-parallel-size N`
Data Parallelism	Multiple complete replicas	High throughput	Deploy multiple pods

Guidelines:

7B models: TP=1 (single GPU) or TP=2 for lower latency
70B models: TP=4 on 4x A100 80GB, or TP=8 on 8x A100 40GB
70B+ models: Consider PP for multi-node deployment
For throughput, prefer data parallelism (more pods) over TP

Error Handling: Exception Hierarchy

Proper exception handling enables smart retry logic and meaningful error messages.

class VLLMClientError(Exception):
    """Base exception for vLLM client errors."""
    pass
 
 
class TokenLimitExceeded(VLLMClientError):
    """Request exceeded token limit. Truncate input and retry."""
    pass
 
 
class ServerOverloaded(VLLMClientError):
    """Server is overloaded (503). Retry with backoff."""
    pass
 
 
class ModelNotReady(VLLMClientError):
    """Model not loaded yet. Wait and retry."""
    pass
 
 
class RateLimitExceeded(VLLMClientError):
    """Rate limit hit (429). Retry after delay."""
    def __init__(self, message: str, retry_after: float | None = None):
        super().__init__(message)
        self.retry_after = retry_after

Error Handling: Resilient Client with Retries

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
)
 
 
def parse_error_response(response: httpx.Response) -> VLLMClientError:
    """
    Parse error response and return appropriate exception.
    
    Different errors need different handling:
    - 503: Retry with backoff
    - 429: Wait for retry-after header
    - 400 with token error: Don't retry, truncate input
    """
    status = response.status_code
    
    if status == 503:
        return ServerOverloaded("Server overloaded, retry with backoff")
    
    if status == 429:
        retry_after = response.headers.get("Retry-After")
        return RateLimitExceeded(
            "Rate limit exceeded",
            retry_after=float(retry_after) if retry_after else None,
        )
    
    if status == 400:
        try:
            error = response.json().get("error", {})
            message = str(error.get("message", ""))
            if "token" in message.lower():
                return TokenLimitExceeded(message)
        except Exception:
            pass
    
    return VLLMClientError(f"Request failed with status {status}")
 
 
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=30),
    retry=retry_if_exception_type((httpx.TimeoutException, ServerOverloaded)),
)
async def create_completion_resilient(
    config: VLLMConfig,
    messages: list[ChatMessage],
) -> CompletionResponse:
    """
    Resilient completion with smart retry logic.
    
    Retries on:
    - Network timeouts
    - Server overload (503)
    
    Does NOT retry on:
    - Token limit exceeded (need to truncate input)
    - Rate limit (should respect retry-after)
    - Other client errors (fix the request)
    """
    async with httpx.AsyncClient(timeout=config.timeout) as client:
        response = await client.post(
            f"{config.base_url}/chat/completions",
            json={
                "model": config.model,
                "messages": [m.model_dump() for m in messages],
                "max_tokens": config.max_tokens,
                "temperature": config.temperature,
            },
            headers=config.get_headers(),
        )
        
        if response.status_code >= 400:
            raise parse_error_response(response)
        
        data = response.json()
        
        return CompletionResponse(
            request_id=data["id"],
            content=data["choices"][0]["message"]["content"],
            model=data["model"],
            prompt_tokens=data["usage"]["prompt_tokens"],
            completion_tokens=data["usage"]["completion_tokens"],
            total_tokens=data["usage"]["total_tokens"],
            latency_ms=0,  # Would need timing wrapper
            finish_reason=data["choices"][0]["finish_reason"],
        )

Error Handling: Circuit Breaker Pattern

A circuit breaker prevents cascading failures by stopping requests to an unhealthy server.

from dataclasses import dataclass, field
from datetime import datetime, timedelta
from enum import Enum
 
 
class CircuitState(Enum):
    CLOSED = "closed"        # Normal operation
    OPEN = "open"            # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing recovery
 
 
@dataclass
class CircuitBreaker:
    """
    Circuit breaker for vLLM client resilience.
    
    States:
    - CLOSED: Normal operation, requests pass through
    - OPEN: Too many failures, reject immediately
    - HALF_OPEN: Testing if server recovered
    
    Transitions:
    - CLOSED -> OPEN: failure_threshold consecutive failures
    - OPEN -> HALF_OPEN: recovery_timeout elapsed
    - HALF_OPEN -> CLOSED: half_open_max_calls successes
    - HALF_OPEN -> OPEN: any failure
    """
    failure_threshold: int = 5
    recovery_timeout: timedelta = field(default_factory=lambda: timedelta(seconds=30))
    half_open_max_calls: int = 3
    
    state: CircuitState = field(default=CircuitState.CLOSED)
    failure_count: int = field(default=0)
    last_failure_time: datetime | None = field(default=None)
    half_open_successes: int = field(default=0)
    
    def can_execute(self) -> bool:
        """Check if request should proceed."""
        if self.state == CircuitState.CLOSED:
            return True
        
        if self.state == CircuitState.OPEN:
            if self.last_failure_time is None:
                return False
            if datetime.now() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_successes = 0
                return True
            return False
        
        # HALF_OPEN: allow limited requests
        return True
    
    def record_success(self) -> None:
        """Record successful request."""
        if self.state == CircuitState.HALF_OPEN:
            self.half_open_successes += 1
            if self.half_open_successes >= self.half_open_max_calls:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
        else:
            self.failure_count = 0
    
    def record_failure(self) -> None:
        """Record failed request."""
        self.failure_count += 1
        self.last_failure_time = datetime.now()
        
        if self.state == CircuitState.HALF_OPEN:
            # Any failure in half-open reopens the circuit
            self.state = CircuitState.OPEN
        elif self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

Monitoring: Prometheus Metrics

Export metrics for observability. vLLM exposes built-in Prometheus metrics at /metrics, and you can add client-side metrics.

from prometheus_client import Counter, Histogram, Gauge
 
# Client-side metrics
REQUEST_COUNT = Counter(
    "vllm_client_requests_total",
    "Total number of requests to vLLM",
    ["model", "status"],
)
 
REQUEST_LATENCY = Histogram(
    "vllm_client_request_latency_seconds",
    "Request latency in seconds",
    ["model"],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0],
)
 
TOKENS_GENERATED = Counter(
    "vllm_client_tokens_generated_total",
    "Total tokens generated",
    ["model"],
)
 
ACTIVE_REQUESTS = Gauge(
    "vllm_client_active_requests",
    "Number of in-flight requests",
)
 
CIRCUIT_BREAKER_STATE = Gauge(
    "vllm_client_circuit_breaker_state",
    "Circuit breaker state (0=closed, 1=open, 2=half_open)",
)
 
 
def record_request_metrics(
    model: str,
    status: str,
    latency_seconds: float,
    tokens: int,
) -> None:
    """Record metrics for a completed request."""
    REQUEST_COUNT.labels(model=model, status=status).inc()
    REQUEST_LATENCY.labels(model=model).observe(latency_seconds)
    if tokens > 0:
        TOKENS_GENERATED.labels(model=model).inc(tokens)

Monitoring: Key Metrics to Track

vLLM exposes many metrics. Here are the most important ones:

key_metrics = {
    "Latency": [
        "vllm:e2e_request_latency_seconds",     # End-to-end request latency
        "vllm:time_to_first_token_seconds",     # TTFT for streaming
        "vllm:time_per_output_token_seconds",   # Generation speed
    ],
    "Throughput": [
        "vllm:avg_prompt_throughput_toks_per_s",     # Prompt processing
        "vllm:avg_generation_throughput_toks_per_s", # Token generation
    ],
    "Queue": [
        "vllm:num_requests_running",   # Currently processing
        "vllm:num_requests_waiting",   # Waiting in queue
        "vllm:num_requests_swapped",   # Swapped to CPU (memory pressure)
    ],
    "Resources": [
        "vllm:gpu_cache_usage_perc",   # KV cache memory usage
        "vllm:cpu_cache_usage_perc",   # CPU swap usage (if enabled)
    ],
}
 
# Print formatted
for category, metrics in key_metrics.items():
    print(f"\n{category}:")
    for metric in metrics:
        print(f"  - {metric}")

Monitoring: Alerting Rules

Set up alerts for critical conditions.

alerting_rules = """
groups:
- name: vllm-alerts
  rules:
  - alert: VLLMHighLatency
    expr: histogram_quantile(0.99, vllm_e2e_request_latency_seconds_bucket) > 30
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM P99 latency is high"
      description: "P99 latency is {{ $value | humanizeDuration }} (threshold: 30s)"
 
  - alert: VLLMQueueBacklog
    expr: vllm_num_requests_waiting > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "vLLM request queue is backing up"
      description: "{{ $value }} requests waiting in queue"
 
  - alert: VLLMGPUMemoryCritical
    expr: vllm_gpu_cache_usage_perc > 95
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "vLLM GPU memory usage critical"
      description: "GPU cache usage at {{ $value | humanizePercentage }}"
 
  - alert: VLLMHighErrorRate
    expr: |
      rate(vllm_client_requests_total{status="error"}[5m]) 
      / rate(vllm_client_requests_total[5m]) > 0.05
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "vLLM error rate exceeds 5%"
      description: "Error rate is {{ $value | humanizePercentage }}"
 
  - alert: VLLMRequestsSwapped
    expr: vllm_num_requests_swapped > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM is swapping requests to CPU"
      description: "{{ $value }} requests swapped (indicates GPU memory pressure)"
"""
print(alerting_rules)

Queue Depth Visualization

Monitoring queue depth helps identify when to scale.

import random
 
# Simulated queue depth over time
random.seed(42)
time_points = list(range(0, 60))
 
# Simulate a load spike scenario
running = []
waiting = []
 
for t in time_points:
    if t < 15:
        # Normal load
        r = 20 + random.randint(-5, 5)
        w = 5 + random.randint(-3, 3)
    elif t < 30:
        # Load spike
        r = 45 + random.randint(-5, 10)
        w = 50 + random.randint(-10, 20)
    elif t < 45:
        # Scaling up (more capacity)
        r = 60 + random.randint(-5, 5)
        w = max(0, 30 - (t - 30) * 2 + random.randint(-5, 5))
    else:
        # Back to normal
        r = 30 + random.randint(-5, 5)
        w = 5 + random.randint(-3, 3)
    
    running.append(max(0, r))
    waiting.append(max(0, w))
 
queue_data = pl.DataFrame({
    "time_minutes": time_points * 2,
    "count": running + waiting,
    "metric": ["Running"] * 60 + ["Waiting"] * 60,
})
 
queue_chart = alt.Chart(queue_data).mark_area(opacity=0.7).encode(
    x=alt.X("time_minutes:Q", title="Time (minutes)"),
    y=alt.Y("count:Q", title="Request Count", stack=True),
    color=alt.Color(
        "metric:N",
        title="Status",
        scale=alt.Scale(domain=["Running", "Waiting"], range=["#4c78a8", "#f58518"]),
    ),
).properties(
    title="Request Queue Depth Over Time (Load Spike Scenario)",
    width=600,
    height=300,
)
 
# Add annotation for the spike
spike_annotation = alt.Chart(pl.DataFrame({
    "x": [22],
    "y": [100],
    "text": ["Load spike"],
})).mark_text(align="center", dy=-10, fontSize=11).encode(
    x="x:Q", y="y:Q", text="text:N"
)
 
scale_annotation = alt.Chart(pl.DataFrame({
    "x": [37],
    "y": [65],
    "text": ["Autoscaling kicks in"],
})).mark_text(align="center", dy=-10, fontSize=11).encode(
    x="x:Q", y="y:Q", text="text:N"
)
 
queue_chart + spike_annotation + scale_annotation

Production Deployment Checklist

Use this checklist before going to production:

Server Configuration

Set appropriate --gpu-memory-utilization (typically 0.85-0.95)
Enable --enable-prefix-caching for repeated prompts
Configure --max-model-len based on your use case
Use quantization if memory constrained (AWQ or GPTQ)
Set --tensor-parallel-size appropriately for your GPU count

Client Implementation

Use async clients for high throughput applications
Implement connection pooling (shared httpx.AsyncClient)
Add retry logic with exponential backoff
Implement circuit breaker pattern for resilience
Handle streaming for interactive use cases

High Availability

Deploy multiple replicas behind load balancer
Configure health checks (readiness + liveness probes)
Use least_conn load balancing for variable request times
Set up autoscaling based on queue depth or GPU utilization
Share model cache via PersistentVolume to speed up scaling

Monitoring

Export Prometheus metrics from vLLM (/metrics endpoint)
Track latency percentiles (p50, p90, p99)
Monitor queue depth (running + waiting requests)
Monitor GPU memory and utilization
Set up alerts for high latency, queue backlog, error rate

Security

Enable API key authentication if exposed publicly
Use TLS termination at load balancer
Implement rate limiting at the gateway level
Audit log all requests for compliance

Summary

Deploying vLLM for production requires attention to:

Client patterns: Async clients with connection pooling for throughput, streaming for interactivity
Resilience: Retries with backoff, circuit breakers, proper exception handling
Scaling: Kubernetes with GPU node pools, HPA based on queue metrics
Performance: Right-sized --gpu-memory-utilization, prefix caching, quantization when needed
Observability: Prometheus metrics, alerting on latency/queue/errors

The key insight is that vLLM handles the hard part (continuous batching, PagedAttention) automatically. Your job is to:

Send enough concurrent requests to keep GPUs busy
Handle failures gracefully
Scale based on queue depth, not just CPU/memory
Monitor the right metrics (latency percentiles, not just averages)

With these patterns in place, you can achieve the 2-24x throughput improvements that vLLM promises while maintaining production reliability.