A Field Guide to the OpenTelemetry GenAI Semantic Conventions
What the OpenTelemetry GenAI semantic conventions actually standardize — spans, events, and metrics for LLM and agent telemetry — what they don't yet cover, and how to instrument an LLM app against a moving spec without painting yourself into a corner.
If you are instrumenting an LLM application in 2026, you do not have to invent your telemetry schema from scratch. OpenTelemetry has been standardizing one — the GenAI semantic conventions ↗ — and adopting it buys you vendor-portable traces, metrics that mean the same thing across tools, and a schema you do not have to defend in a design review every quarter.
It also has a sharp edge: as of this writing the GenAI conventions are still marked Development, not Stable. Attribute names change, the structure of how prompts and completions are captured has already shifted once, and a naive instrumentation against the experimental spec will break on upgrade. This guide covers what the conventions actually standardize today, where they stop, and how to instrument against a moving target without rework. It is written for the engineer who got handed a production LLM deployment and needs telemetry that survives the next six months.
The three signal types
OpenTelemetry models GenAI telemetry as the same three signals it uses everywhere — spans, metrics, and (for the content itself) events/logs. The GenAI conventions are simply an agreed-upon set of names and attributes layered on top.
Spans capture a single GenAI operation — one model call, one tool invocation, one agent step — as a unit of work with a start, end, and structured attributes. Metrics are the aggregatable numbers: token counts, durations. Events carry the high-cardinality content — the prompt and the completion — which you generally do not want as span attributes because they are large and sensitive.
Spans: the unit of an LLM operation
The client span conventions ↗ define the attributes that describe a model call. The load-bearing ones, all under the gen_ai.* namespace:
gen_ai.operation.name— what kind of operation (chat,text_completion,embeddings,execute_tool).gen_ai.provider.name— the provider/format flavor (e.g.openai,anthropic,aws.bedrock). This is the discriminator that tells a backend which provider-specific attributes to expect.gen_ai.request.modelandgen_ai.response.model— what you asked for versus what answered. These can differ (aliasing, routing, fallback), and the gap is worth monitoring on its own.gen_ai.request.*— the sampling parameters:temperature,top_p,max_tokens, etc.gen_ai.usage.input_tokensandgen_ai.usage.output_tokens— token accounting on the span.gen_ai.response.finish_reasons— why generation stopped (stop,length,tool_calls, content filtering). A finish-reason distribution that shifts towardlengthor filtering is a real signal.
Span kind is CLIENT for a call to a remote model, and the conventions also define agent and framework spans for multi-step agentic systems — spans that wrap a planning step, a tool call (execute_tool), or an agent invocation. This is the layer that matters most for security and governance: an agent span tree is where you can see what the agent decided to do, not just what tokens it emitted.
Metrics: the aggregatable numbers
The GenAI metrics ↗ are defined as histograms (all currently Development status):
Client-side:
gen_ai.client.token.usage— input and output tokens, unit{token}.gen_ai.client.operation.duration— overall operation latency, units.gen_ai.client.operation.time_to_first_chunk— streaming TTFB, units.gen_ai.client.operation.time_per_output_chunk— inter-chunk latency for streaming, units.
Server-side (if you operate the model server, e.g. vLLM or TGI):
gen_ai.server.request.duration— model-server latency, units.gen_ai.server.time_to_first_token— queue + prefill phase, units.gen_ai.server.time_per_output_token— decode-phase performance, units.
The split between time-to-first-token (queue + prefill) and time-per-output-token (decode) is the right mental model for LLM latency. A latency regression that lives in TTFT is a queueing or prefill problem (batch sizing, scheduling); one that lives in time-per-output-token is a decode problem (model size, KV-cache pressure). One aggregate latency number hides which.
Events: the prompt and completion content
Capturing the actual prompt and response is handled separately, because the content is large, high-cardinality, and frequently contains PII or secrets. The conventions model this as GenAI events/log records rather than fat span attributes. The practical implication is a governance one: capturing message content is opt-in and configurable, and instrumentation libraries gate it behind a setting precisely because you usually do not want raw prompts flowing into your trace backend by default. Decide deliberately what content you retain, redact before export, and treat the content channel as a regulated data flow — not a debug convenience.
What the conventions do not give you
Adopting the spec is necessary, not sufficient. The gaps that matter, especially for anyone using telemetry for security rather than just performance:
- Identity and delegation context. The conventions standardize what model operation happened, not on whose authority. In an agentic system the security-relevant question is which principal an action was taken for and which delegation chain authorized it. That context — user identity, the originating request, the authorization scope of a tool call — is not in the GenAI conventions. You add it via the general OpenTelemetry attributes and your own conventions, and it is the single most valuable thing to put on an agent/tool span.
- Semantic safety signals. Guardrail verdicts, injection-detection scores, PII-classification results on inputs and outputs — none of this is in the GenAI spec. It is application-specific, and you emit it as your own attributes. The conventions give you the skeleton; the safety telemetry is yours to define.
- Stability. Development status means attribute names can change between releases. This is the operational risk, and the next section is how to manage it.
Instrumenting against a moving spec
The conventions are evolving, and an instrumentation written against last quarter’s experimental attributes will emit names a current backend does not recognize. Three practices keep you from rework:
Use the stability opt-in, deliberately. OpenTelemetry GenAI instrumentation libraries gate behavior behind the OTEL_SEMCONV_STABILITY_OPT_IN environment variable (with values such as gen_ai_latest_experimental). Set it explicitly so you know which convention version you are emitting, rather than inheriting a library default that shifts under you on upgrade. Pin it, and change it as a conscious migration step with a query audit, not as a side effect of a dependency bump.
Prefer auto-instrumentation, but verify the attribute names. The OpenTelemetry ecosystem and adjacent projects (e.g. the OpenLLMetry instrumentations) emit these conventions for popular SDKs with little code. Use them — but spot-check the emitted span attributes against the current spec ↗ on each upgrade, because the auto-instrumentation’s notion of “current” tracks the experimental spec and can move.
Layer your own stable namespace for what you depend on. For the attributes your dashboards and alerts depend on — your identity context, your guardrail verdicts, your business dimensions — define them under your own namespace (e.g. acme.llm.principal_id, acme.guardrail.injection_score). The gen_ai.* attributes will eventually stabilize; until they do, do not build a paging alert on top of an attribute name the spec might rename. Build it on a name you control, and treat the upstream conventions as the portable, best-effort layer beneath it.
Don’t put prompts in span attributes. Even where it is technically allowed, large message content as a span attribute bloats trace storage and leaks sensitive data into a system whose access controls were designed for ops, not for regulated content. Use the content-capture path, gate it, redact it, and default it off.
The minimum useful instrumentation
If you are starting today, the smallest setup that earns its keep:
- Emit client spans with
gen_ai.operation.name,gen_ai.provider.name,gen_ai.request.model,gen_ai.response.model, finish reasons, and token usage. Auto-instrumentation gets you most of this. - Emit the four client metrics (token usage, operation duration, and the two streaming-latency histograms) so you can separate prefill from decode and watch cost.
- Add agent/tool spans for any agentic flow, and decorate them with your own identity and authorization attributes. This is where security observability actually lives.
- Gate content capture behind an explicit, off-by-default setting with redaction.
- Pin the stability opt-in and review attribute names on every dependency upgrade.
That gets you portable performance and cost telemetry today, the agent-action visibility that security teams need, and a migration path that does not collapse when the conventions move from Development to Stable. The OpenTelemetry GenAI observability writeup ↗ is a good companion read for the rationale behind the design.
Sources
- OpenTelemetry — Semantic Conventions for Generative AI Systems ↗ — the index for the GenAI conventions; note the Development status banner.
- OpenTelemetry — GenAI Metrics ↗ — the exact client and server metric names, units, and histogram definitions.
- OpenTelemetry — GenAI Client Spans ↗ — span attributes for model operations.
- OpenTelemetry Blog — Inside the LLM Call ↗ — design rationale and a walkthrough of GenAI observability.
- open-telemetry/semantic-conventions (GitHub) ↗ — the source of truth and changelog; watch this to track when GenAI attributes stabilize.
→ This post is part of the ML Observability Hub — the complete index of ML monitoring ↗ and MLOps resources on SentryML. For the metrics you compute on top of this telemetry, see the ML monitoring metrics taxonomy; for translating spans into detection rules, see the MITRE ATLAS detection-engineering runbook.
Sources
SentryML — in your inbox
ML observability & MLOps — model monitoring, drift detection, debugging in production. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
The ML Monitoring Metrics Taxonomy: Drift, Data Quality, and Model Decay
A reference taxonomy of the signals that actually tell you a production ML system is failing — input drift, prediction drift, concept drift, data quality, and performance decay — and which open-source tool computes each one.
LLM Testing: A Guide to Evals, Metrics, and Production Monitoring
LLM testing spans offline evals, CI gate checks, and live production monitoring — three distinct jobs that need different tools. Here's how to cover all three without drowning your team.
A Lean 4 Stability Proof for Tool-Mediated LLM Agents
A new arXiv paper certifies controllability and ISS robustness for an LLM-driven SOC agent using Lean 4. The MLOps takeaway is simpler than the math: monitor the action catalog, not the model.