SentryML
Distributed tracing spans visualized as a waterfall
monitoring

A Field Guide to the OpenTelemetry GenAI Semantic Conventions

What the OpenTelemetry GenAI semantic conventions actually standardize — spans, events, and metrics for LLM and agent telemetry — what they don't yet cover, and how to instrument an LLM app against a moving spec without painting yourself into a corner.

By Priya Anand · · 8 min read

If you are instrumenting an LLM application in 2026, you do not have to invent your telemetry schema from scratch. OpenTelemetry has been standardizing one — the GenAI semantic conventions — and adopting it buys you vendor-portable traces, metrics that mean the same thing across tools, and a schema you do not have to defend in a design review every quarter.

It also has a sharp edge: as of this writing the GenAI conventions are still marked Development, not Stable. Attribute names change, the structure of how prompts and completions are captured has already shifted once, and a naive instrumentation against the experimental spec will break on upgrade. This guide covers what the conventions actually standardize today, where they stop, and how to instrument against a moving target without rework. It is written for the engineer who got handed a production LLM deployment and needs telemetry that survives the next six months.

The three signal types

OpenTelemetry models GenAI telemetry as the same three signals it uses everywhere — spans, metrics, and (for the content itself) events/logs. The GenAI conventions are simply an agreed-upon set of names and attributes layered on top.

Spans capture a single GenAI operation — one model call, one tool invocation, one agent step — as a unit of work with a start, end, and structured attributes. Metrics are the aggregatable numbers: token counts, durations. Events carry the high-cardinality content — the prompt and the completion — which you generally do not want as span attributes because they are large and sensitive.

Spans: the unit of an LLM operation

The client span conventions define the attributes that describe a model call. The load-bearing ones, all under the gen_ai.* namespace:

  • gen_ai.operation.name — what kind of operation (chat, text_completion, embeddings, execute_tool).
  • gen_ai.provider.name — the provider/format flavor (e.g. openai, anthropic, aws.bedrock). This is the discriminator that tells a backend which provider-specific attributes to expect.
  • gen_ai.request.model and gen_ai.response.model — what you asked for versus what answered. These can differ (aliasing, routing, fallback), and the gap is worth monitoring on its own.
  • gen_ai.request.* — the sampling parameters: temperature, top_p, max_tokens, etc.
  • gen_ai.usage.input_tokens and gen_ai.usage.output_tokens — token accounting on the span.
  • gen_ai.response.finish_reasons — why generation stopped (stop, length, tool_calls, content filtering). A finish-reason distribution that shifts toward length or filtering is a real signal.

Span kind is CLIENT for a call to a remote model, and the conventions also define agent and framework spans for multi-step agentic systems — spans that wrap a planning step, a tool call (execute_tool), or an agent invocation. This is the layer that matters most for security and governance: an agent span tree is where you can see what the agent decided to do, not just what tokens it emitted.

Metrics: the aggregatable numbers

The GenAI metrics are defined as histograms (all currently Development status):

Client-side:

  • gen_ai.client.token.usage — input and output tokens, unit {token}.
  • gen_ai.client.operation.duration — overall operation latency, unit s.
  • gen_ai.client.operation.time_to_first_chunk — streaming TTFB, unit s.
  • gen_ai.client.operation.time_per_output_chunk — inter-chunk latency for streaming, unit s.

Server-side (if you operate the model server, e.g. vLLM or TGI):

  • gen_ai.server.request.duration — model-server latency, unit s.
  • gen_ai.server.time_to_first_token — queue + prefill phase, unit s.
  • gen_ai.server.time_per_output_token — decode-phase performance, unit s.

The split between time-to-first-token (queue + prefill) and time-per-output-token (decode) is the right mental model for LLM latency. A latency regression that lives in TTFT is a queueing or prefill problem (batch sizing, scheduling); one that lives in time-per-output-token is a decode problem (model size, KV-cache pressure). One aggregate latency number hides which.

Events: the prompt and completion content

Capturing the actual prompt and response is handled separately, because the content is large, high-cardinality, and frequently contains PII or secrets. The conventions model this as GenAI events/log records rather than fat span attributes. The practical implication is a governance one: capturing message content is opt-in and configurable, and instrumentation libraries gate it behind a setting precisely because you usually do not want raw prompts flowing into your trace backend by default. Decide deliberately what content you retain, redact before export, and treat the content channel as a regulated data flow — not a debug convenience.

What the conventions do not give you

Adopting the spec is necessary, not sufficient. The gaps that matter, especially for anyone using telemetry for security rather than just performance:

  • Identity and delegation context. The conventions standardize what model operation happened, not on whose authority. In an agentic system the security-relevant question is which principal an action was taken for and which delegation chain authorized it. That context — user identity, the originating request, the authorization scope of a tool call — is not in the GenAI conventions. You add it via the general OpenTelemetry attributes and your own conventions, and it is the single most valuable thing to put on an agent/tool span.
  • Semantic safety signals. Guardrail verdicts, injection-detection scores, PII-classification results on inputs and outputs — none of this is in the GenAI spec. It is application-specific, and you emit it as your own attributes. The conventions give you the skeleton; the safety telemetry is yours to define.
  • Stability. Development status means attribute names can change between releases. This is the operational risk, and the next section is how to manage it.

Instrumenting against a moving spec

The conventions are evolving, and an instrumentation written against last quarter’s experimental attributes will emit names a current backend does not recognize. Three practices keep you from rework:

Use the stability opt-in, deliberately. OpenTelemetry GenAI instrumentation libraries gate behavior behind the OTEL_SEMCONV_STABILITY_OPT_IN environment variable (with values such as gen_ai_latest_experimental). Set it explicitly so you know which convention version you are emitting, rather than inheriting a library default that shifts under you on upgrade. Pin it, and change it as a conscious migration step with a query audit, not as a side effect of a dependency bump.

Prefer auto-instrumentation, but verify the attribute names. The OpenTelemetry ecosystem and adjacent projects (e.g. the OpenLLMetry instrumentations) emit these conventions for popular SDKs with little code. Use them — but spot-check the emitted span attributes against the current spec on each upgrade, because the auto-instrumentation’s notion of “current” tracks the experimental spec and can move.

Layer your own stable namespace for what you depend on. For the attributes your dashboards and alerts depend on — your identity context, your guardrail verdicts, your business dimensions — define them under your own namespace (e.g. acme.llm.principal_id, acme.guardrail.injection_score). The gen_ai.* attributes will eventually stabilize; until they do, do not build a paging alert on top of an attribute name the spec might rename. Build it on a name you control, and treat the upstream conventions as the portable, best-effort layer beneath it.

Don’t put prompts in span attributes. Even where it is technically allowed, large message content as a span attribute bloats trace storage and leaks sensitive data into a system whose access controls were designed for ops, not for regulated content. Use the content-capture path, gate it, redact it, and default it off.

The minimum useful instrumentation

If you are starting today, the smallest setup that earns its keep:

  1. Emit client spans with gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.response.model, finish reasons, and token usage. Auto-instrumentation gets you most of this.
  2. Emit the four client metrics (token usage, operation duration, and the two streaming-latency histograms) so you can separate prefill from decode and watch cost.
  3. Add agent/tool spans for any agentic flow, and decorate them with your own identity and authorization attributes. This is where security observability actually lives.
  4. Gate content capture behind an explicit, off-by-default setting with redaction.
  5. Pin the stability opt-in and review attribute names on every dependency upgrade.

That gets you portable performance and cost telemetry today, the agent-action visibility that security teams need, and a migration path that does not collapse when the conventions move from Development to Stable. The OpenTelemetry GenAI observability writeup is a good companion read for the rationale behind the design.

Sources


→ This post is part of the ML Observability Hub — the complete index of ML monitoring and MLOps resources on SentryML. For the metrics you compute on top of this telemetry, see the ML monitoring metrics taxonomy; for translating spans into detection rules, see the MITRE ATLAS detection-engineering runbook.

Sources

  1. OpenTelemetry — Semantic Conventions for Generative AI Systems
  2. OpenTelemetry — GenAI Metrics
  3. OpenTelemetry — GenAI Client Spans
  4. OpenTelemetry Blog — Inside the LLM Call: GenAI Observability
  5. open-telemetry/semantic-conventions (GitHub)
Subscribe

SentryML — in your inbox

ML observability & MLOps — model monitoring, drift detection, debugging in production. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments