SentryML
Isometric inference stack showing serving replicas, KV cache blocks, and Prometheus latency charts
mlops

ML Model Deployment: Serving Frameworks, KV Cache, and the Latency Metrics That Matter

Once a model clears staging, the serving stack decision determines whether you hit your latency SLAs or spend a sprint chasing p99 spikes. Here's what to evaluate and what to instrument.

By SentryML Editorial · · 8 min read

ML model deployment is where the model meets the load balancer. The model passed staging, it’s registered in MLflow or Weights & Biases, and now you need to decide what actually serves predictions in production. That decision — which serving framework, how many replicas, what hardware — shapes every latency SLA you’ll try to hit for the next six months.

This post covers the serving infrastructure layer: framework selection, the latency metrics that actually matter, how to configure a production-grade service, and what to watch once traffic is live. For the organizational and process side of ml model deployment (training-serving skew, deployment strategies, monitoring layers), see ML Model Deployment: A Guide to Shipping Models That Stay Healthy.

Picking the Serving Stack

Classic ML models (sklearn pipelines, XGBoost, tabular PyTorch) and large language models are different deployment problems. Running them through the same serving infrastructure is a mistake that costs you either latency or operability.

For classic ML models, the primary concerns are CPU or memory efficiency, schema validation on incoming features, and throughput for batch or synchronous requests. MLflow’s deployment layer handles this cleanly: mlflow models serve launches a FastAPI-backed inference server that reads the model directly from the registry, preserves the dependency environment via logged conda or pip requirements, and exposes a /invocations endpoint. MLflow also supports containerized deployment via mlflow models build-docker, which packages model, code, and dependencies into an OCI image you can ship to Kubernetes, SageMaker, or Azure ML without modifying the artifact. For higher throughput on GPU, Triton Inference Server adds batching, dynamic shapes, and backend switching (ONNXRuntime, TensorRT) without framework-specific wrapper code.

For LLMs, the constraint is GPU memory and KV cache management. Vanilla HTTP servers lose badly to purpose-built inference engines. The field has largely converged on three options:

  • vLLM: highest raw throughput via PagedAttention, continuous batching, and chunked prefill. Best when tokens/sec is the primary goal and you can absorb deployment complexity.
  • Ray Serve: production-grade lifecycle management (health checks, rolling updates, replica autoscaling) built on Ray’s distributed runtime. The Ray Serve production guide recommends running it on Kubernetes via KubeRay’s RayService custom resource, which handles failure recovery and upgrades automatically. Strong fit for multi-model pipelines and teams already running on Ray.
  • BentoML: developer ergonomics first — one Python class defines the service, BentoML handles containerization, Kubernetes manifests, and OpenAI-compatible endpoints. Lower ops overhead for smaller teams.

For multi-GPU models above 70B parameters, vLLM with tensor parallelism and Ray for inter-node coordination is the proven stack. For teams running a single model below 70B on one node, BentoML wrapping vLLM reduces operational overhead without sacrificing meaningful throughput.

The Latency Metrics That Matter

Aggregate mean response time is a poor signal for LLM serving. Use these four metrics instead:

Time to first token (TTFT) — interval from request received to first token emitted. TTFT is what users experience as responsiveness. A slow TTFT on a streaming endpoint makes the system feel broken even if total generation time is fine. Formula: t_first_token − t_request_received.

Time per output token (TPOT) — average decoding latency per token after the first. Drives the perceived streaming throughput after TTFT.

p99 end-to-end latency — 99th percentile of total response time. Mean latency is dominated by fast requests; p99 surfaces queue pressure and cache miss cases that affect real users. This is the metric to put in your SLA.

KV cache utilization — fraction of KV cache blocks currently in use. vLLM exposes this as a Prometheus metric (vllm:gpu_cache_usage_perc). When KV cache utilization exceeds roughly 85%, new requests start queuing and TTFT spikes. This is the earliest latency warning signal available — it fires before p99 degrades, giving you time to autoscale before users notice.

For classic ML models, replace KV cache utilization with GPU memory utilization and request queue depth. The principle is the same: instrument the resource that saturates first under load.

Wiring It Up

Ray Serve production deployments are defined in YAML and deployed via serve deploy. The configuration file becomes the source of truth for replica counts, autoscaling policy, and runtime dependencies:

# ray-serve-config.yaml
proxy_location: EveryNode
http_options:
  host: "0.0.0.0"
  port: 8000

applications:
  - name: classifier
    route_prefix: /predict
    import_path: "serving.app:deployment"
    runtime_env:
      pip:
        - torch==2.3.0
        - scikit-learn==1.5.0
    deployments:
      - name: Classifier
        num_replicas: 4
        ray_actor_options:
          num_cpus: 2
          num_gpus: 0.5
        autoscaling_config:
          min_replicas: 2
          max_replicas: 12
          target_num_ongoing_requests_per_replica: 10

target_num_ongoing_requests_per_replica is the autoscaling trigger. When average queue depth per replica exceeds 10, Ray Serve adds replicas up to the max. This metric reacts faster than CPU utilization for inference workloads because inference latency spikes before CPU saturates.

For LLM workloads, BentoML’s service definition wraps vLLM with a single class:

import bentoml

@bentoml.service(
    resources={"gpu": 1, "memory": "24Gi"},
    traffic={"timeout": 300},
)
class LLMService:
    def __init__(self):
        from vllm import LLM
        self.model = LLM(
            model="meta-llama/Llama-3.1-8B-Instruct",
            gpu_memory_utilization=0.90,
            max_model_len=8192,
            enable_prefix_caching=True,
        )

    @bentoml.api
    def generate(self, prompt: str) -> str:
        from vllm import SamplingParams
        outputs = self.model.generate(prompt, SamplingParams(max_tokens=512))
        return outputs[0].outputs[0].text

bentoml build followed by bentoml containerize packages this into an OCI image with vLLM, CUDA dependencies, and model weights. The resulting container deploys to any Kubernetes cluster or cloud container runtime without additional configuration.

What to Watch After Deploy

These four alerts catch the most common post-deployment failures within the first 72 hours:

  1. KV cache utilization > 80% — autoscale signal. TTFT will spike within minutes at this level under sustained load.
  2. p99 TTFT > 2× baseline — queue depth is building or there is a cache miss regression on a new prompt pattern.
  3. Request error rate > 0.1% — catches OOM kills (GPU ran out of KV cache blocks) and CUDA errors that surface as 5xx responses.
  4. Input token length distribution shift — a leading indicator of upstream change. If requests suddenly include much longer prompts, KV cache pressure follows within the hour.

vLLM exports all of these via Prometheus out of the box: vllm:num_requests_running, vllm:gpu_cache_usage_perc, vllm:e2e_request_latency_seconds, vllm:request_prompt_tokens. Scrape them into Grafana; set threshold alerts at the Prometheus layer so they fire even when the dashboard isn’t open.

Caveats

Quantization changes the serving math. INT8 or FP8 quantization (AWQ, GPTQ, llm.int8()) cuts GPU memory requirements roughly in half, which fits more KV cache blocks into the same VRAM. But quantized models carry measurable quality regression on long-context and reasoning tasks. Run your golden eval set before enabling quantization in production — the quality cost is workload-dependent and not safe to assume acceptable.

Tensor parallelism adds network overhead. Splitting a model across multiple GPUs requires all-reduce operations on every forward pass. Over NVLink this is fast; over PCIe-only GPU interconnects the all-reduce can become the throughput ceiling. Benchmark on your actual hardware before assuming multi-GPU gives linear throughput gains.

Replica count is not a substitute for profiling. Adding replicas helps when you’re request-queue bound. It does not help when the bottleneck is GPU memory bandwidth or a slow KV cache prefill for a specific prompt pattern. Profile first; scale second.

For the security surface of model serving — model theft via inference probing, adversarial input attacks, and supply chain risk on model artifacts — aisec.blog covers the offensive techniques that target production ML infrastructure. For regulatory framing around which models you can deploy and under what governance requirements, neuralwatch.org tracks EU AI Act and NIST AI RMF obligations that affect deployment decisions.

Sources

Sources

  1. Ray Serve Production Guide — Ray Documentation
  2. Deploying a Large Language Model with BentoML and vLLM — BentoML Blog
  3. ML Model Serving — MLflow Documentation
Subscribe

SentryML — in your inbox

ML observability & MLOps — model monitoring, drift detection, debugging in production. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments