OpenTelemetry GenAI Semantic Conventions: Instrument LLM Apps

Q: Spans: the unit of an LLM operation

The client span conventions define the attributes that describe a model call. The load-bearing ones, all under the `gen_ai.*` namespace:

Q: Metrics: the aggregatable numbers

The GenAI metrics are defined as histograms (all currently Development status):

If you are instrumenting an LLM application in 2026, you do not have to invent your telemetry schema from scratch. The OpenTelemetry GenAI semantic conventions ↗ give you a standard vocabulary to instrument LLM apps — vendor-portable traces, metrics that mean the same thing across tools, and a schema you do not have to defend in a design review every quarter.

It also has a sharp edge: as of this writing the GenAI conventions are still marked Development, not Stable. Attribute names change, the structure of how prompts and completions are captured has already shifted once, and a naive instrumentation against the experimental spec will break on upgrade. This guide covers what the conventions actually standardize today, where they stop, and how to instrument against a moving target without rework. It is written for the engineer who got handed a production LLM deployment and needs telemetry that survives the next six months.

The three signal types

OpenTelemetry models GenAI telemetry as the same three signals it uses everywhere — spans, metrics, and (for the content itself) events/logs. The GenAI conventions are simply an agreed-upon set of names and attributes layered on top.

Spans capture a single GenAI operation — one model call, one tool invocation, one agent step — as a unit of work with a start, end, and structured attributes. Metrics are the aggregatable numbers: token counts, durations. Events carry the high-cardinality content — the prompt and the completion — which you generally do not want as span attributes because they are large and sensitive.

Spans: the unit of an LLM operation

The client span conventions ↗ define the attributes that describe a model call. The load-bearing ones, all under the gen_ai.* namespace:

gen_ai.operation.name — what kind of operation (chat, text_completion, embeddings, execute_tool).
gen_ai.provider.name — the provider/format flavor (e.g. openai, anthropic, aws.bedrock). This is the discriminator that tells a backend which provider-specific attributes to expect.
gen_ai.request.model and gen_ai.response.model — what you asked for versus what answered. These can differ (aliasing, routing, fallback), and the gap is worth monitoring on its own.
gen_ai.request.* — the sampling parameters: temperature, top_p, max_tokens, etc.
gen_ai.usage.input_tokens and gen_ai.usage.output_tokens — token accounting on the span.
gen_ai.response.finish_reasons — why generation stopped (stop, length, tool_calls, content filtering). A finish-reason distribution that shifts toward length or filtering is a real signal.

Span kind is CLIENT for a call to a remote model, and the conventions also define agent and framework spans for multi-step agentic systems — spans that wrap a planning step, a tool call (execute_tool), or an agent invocation. This is the layer that matters most for security and governance: an agent span tree is where you can see what the agent decided to do, not just what tokens it emitted.

Metrics: the aggregatable numbers

The GenAI metrics ↗ are defined as histograms (all currently Development status):

Client-side:

gen_ai.client.token.usage — input and output tokens, unit {token}.
gen_ai.client.operation.duration — overall operation latency, unit s.
gen_ai.client.operation.time_to_first_chunk — streaming TTFB, unit s.
gen_ai.client.operation.time_per_output_chunk — inter-chunk latency for streaming, unit s.

Server-side (if you operate the model server, e.g. vLLM or TGI):

gen_ai.server.request.duration — model-server latency, unit s.
gen_ai.server.time_to_first_token — queue + prefill phase, unit s.
gen_ai.server.time_per_output_token — decode-phase performance, unit s.

The split between time-to-first-token (queue + prefill) and time-per-output-token (decode) is the right mental model for LLM latency. A latency regression that lives in TTFT is a queueing or prefill problem (batch sizing, scheduling); one that lives in time-per-output-token is a decode problem (model size, KV-cache pressure). One aggregate latency number hides which.

Events: the prompt and completion content

Capturing the actual prompt and response is handled separately, because the content is large, high-cardinality, and frequently contains PII or secrets. The conventions model this as GenAI events/log records rather than fat span attributes. The practical implication is a governance one: capturing message content is opt-in and configurable, and instrumentation libraries gate it behind a setting precisely because you usually do not want raw prompts flowing into your trace backend by default. Decide deliberately what content you retain, redact before export, and treat the content channel as a regulated data flow — not a debug convenience.

OpenTelemetry GenAI semantic conventions: attribute reference

The table below collects the load-bearing names from the OpenTelemetry GenAI semantic conventions in one place — the span attributes and metrics an instrumentation should emit for an LLM app. Names are qualitative pointers: the conventions are Development status, so exact spelling can change between releases, and the authoritative list is always the current spec ↗. Verify against it on every upgrade.

Name	Signal	What it captures
`gen_ai.operation.name`	Span	Operation type: `chat`, `text_completion`, `embeddings`, `execute_tool`
`gen_ai.provider.name`	Span	Provider/format discriminator (`openai`, `anthropic`, `aws.bedrock`); earlier drafts named this `gen_ai.system`
`gen_ai.request.model`	Span	Model requested by the caller
`gen_ai.response.model`	Span	Model that actually answered (can differ via routing/fallback)
`gen_ai.request.*`	Span	Sampling parameters: `temperature`, `top_p`, `max_tokens`
`gen_ai.usage.input_tokens`	Span	Prompt/input token count
`gen_ai.usage.output_tokens`	Span	Completion/output token count
`gen_ai.response.finish_reasons`	Span	Why generation stopped: `stop`, `length`, `tool_calls`, content filtering
`gen_ai.client.token.usage`	Metric (histogram)	Input + output tokens, unit `{token}`
`gen_ai.client.operation.duration`	Metric (histogram)	Overall operation latency, unit `s`
`gen_ai.server.time_to_first_token`	Metric (histogram)	Server-side queue + prefill latency, unit `s`
`gen_ai.server.time_per_output_token`	Metric (histogram)	Server-side decode-phase latency, unit `s`

Prompt and completion content is deliberately not in this table: the conventions route it through opt-in GenAI events/logs rather than span attributes, so it stays out of a trace backend by default. For the derived metrics computed on top of this raw telemetry, see the ML monitoring metrics taxonomy, and for where this telemetry sits in a production stack, ML model deployment.

What the conventions do not give you

Adopting the spec is necessary, not sufficient. The gaps that matter, especially for anyone using telemetry for security rather than just performance:

Identity and delegation context. The conventions standardize what model operation happened, not on whose authority. In an agentic system the security-relevant question is which principal an action was taken for and which delegation chain authorized it. That context — user identity, the originating request, the authorization scope of a tool call — is not in the GenAI conventions. You add it via the general OpenTelemetry attributes and your own conventions, and it is the single most valuable thing to put on an agent/tool span. We treat this missing layer as its own subject in the agent authority gap as an observability problem.
Semantic safety signals. Guardrail verdicts, injection-detection scores, PII-classification results on inputs and outputs — none of this is in the GenAI spec. It is application-specific, and you emit it as your own attributes. The conventions give you the skeleton; the safety telemetry is yours to define. For why that custom telemetry matters most in agent-to-agent flows, see why embedding-based defenses fail in multi-agent LLMs.
Stability. Development status means attribute names can change between releases. This is the operational risk, and the next section is how to manage it.

Instrumenting against a moving spec

The conventions are evolving, and an instrumentation written against last quarter’s experimental attributes will emit names a current backend does not recognize. Three practices keep you from rework:

Use the stability opt-in, deliberately. OpenTelemetry GenAI instrumentation libraries gate behavior behind the OTEL_SEMCONV_STABILITY_OPT_IN environment variable (with values such as gen_ai_latest_experimental). Set it explicitly so you know which convention version you are emitting, rather than inheriting a library default that shifts under you on upgrade. Pin it, and change it as a conscious migration step with a query audit, not as a side effect of a dependency bump.

Prefer auto-instrumentation, but verify the attribute names. The OpenTelemetry ecosystem and adjacent projects (e.g. the OpenLLMetry instrumentations) emit these conventions for popular SDKs with little code. Use them — but spot-check the emitted span attributes against the current spec ↗ on each upgrade, because the auto-instrumentation’s notion of “current” tracks the experimental spec and can move.

Layer your own stable namespace for what you depend on. For the attributes your dashboards and alerts depend on — your identity context, your guardrail verdicts, your business dimensions — define them under your own namespace (e.g. acme.llm.principal_id, acme.guardrail.injection_score). The gen_ai.* attributes will eventually stabilize; until they do, do not build a paging alert on top of an attribute name the spec might rename. Build it on a name you control, and treat the upstream conventions as the portable, best-effort layer beneath it.

Don’t put prompts in span attributes. Even where it is technically allowed, large message content as a span attribute bloats trace storage and leaks sensitive data into a system whose access controls were designed for ops, not for regulated content. Use the content-capture path, gate it, redact it, and default it off.

How to instrument an LLM app: the minimum useful setup

If you are starting today, the smallest setup that earns its keep to instrument an LLM app:

Emit client spans with gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.response.model, finish reasons, and token usage. Auto-instrumentation gets you most of this.
Emit the four client metrics (token usage, operation duration, and the two streaming-latency histograms) so you can separate prefill from decode and watch cost.
Add agent/tool spans for any agentic flow, and decorate them with your own identity and authorization attributes. This is where security observability actually lives.
Gate content capture behind an explicit, off-by-default setting with redaction.
Pin the stability opt-in and review attribute names on every dependency upgrade.

That gets you portable performance and cost telemetry today, the agent-action visibility that security teams need, and a migration path that does not collapse when the conventions move from Development to Stable. The OpenTelemetry GenAI observability writeup ↗ is a good companion read for the rationale behind the design.

FAQ

What are the OpenTelemetry GenAI semantic conventions? They are OpenTelemetry’s standard vocabulary for describing generative-AI operations as telemetry. They define gen_ai.* span attributes for model calls, tool invocations, and agent steps; a set of client and server metrics for tokens and latency; and an events/logs channel for prompt and completion content. They are currently Development status, so names can still change.

How do you instrument an LLM app with OpenTelemetry? Emit a CLIENT span per model call carrying operation name, provider, requested and responding model, finish reasons, and token usage; record the four client metrics for tokens and latency; wrap agentic flows in agent and tool spans; and gate prompt/completion capture behind an off-by-default, redacted setting. Auto-instrumentation supplies most span and metric emission. See model monitoring tools for the backends that consume it.

Is it gen_ai.system or gen_ai.provider.name? Both name the same idea — the discriminator identifying which provider or format a call used (openai, anthropic, aws.bedrock) so a backend knows which provider-specific attributes to expect. gen_ai.system was the earlier draft name; the conventions moved toward gen_ai.provider.name. Because the spec is evolving, verify the exact attribute against the current conventions on each upgrade.

Are the OpenTelemetry GenAI semantic conventions stable? Not yet. As of writing they carry Development status, meaning attribute names and structure can change between releases, and the prompt/completion capture model has already shifted once. Pin the OTEL_SEMCONV_STABILITY_OPT_IN setting, treat convention changes as conscious migrations, and layer dashboards and alerts on a namespace you control rather than raw gen_ai.* names.

Should prompts and completions be stored as span attributes? No. Message content is large, high-cardinality, and often contains PII or secrets, so the conventions route it through opt-in GenAI events/logs rather than fat span attributes. Keep content capture off by default, redact before export, and treat the content channel as a regulated data flow, not a debugging convenience.

Sources

OpenTelemetry — Semantic Conventions for Generative AI Systems ↗ — the index for the GenAI conventions; note the Development status banner.
OpenTelemetry — GenAI Metrics ↗ — the exact client and server metric names, units, and histogram definitions.
OpenTelemetry — GenAI Client Spans ↗ — span attributes for model operations.
OpenTelemetry Blog — Inside the LLM Call ↗ — design rationale and a walkthrough of GenAI observability.
open-telemetry/semantic-conventions (GitHub) ↗ — the source of truth and changelog; watch this to track when GenAI attributes stabilize.

→ This post is part of the ML Observability Hub — the complete index of ML monitoring ↗ and MLOps resources on SentryML. For the metrics you compute on top of this telemetry, see the ML monitoring metrics taxonomy; for translating spans into detection rules, see the MITRE ATLAS detection-engineering runbook.

OpenTelemetry GenAI Semantic Conventions: Instrument LLM Apps

The three signal types

Spans: the unit of an LLM operation

Metrics: the aggregatable numbers

Events: the prompt and completion content

OpenTelemetry GenAI semantic conventions: attribute reference

What the conventions do not give you

Instrumenting against a moving spec

How to instrument an LLM app: the minimum useful setup

FAQ

Sources

Sources

SentryML — in your inbox

Related

The ML Monitoring Metrics Taxonomy: Drift, Data Quality, and Model Decay

LLM Testing: A Guide to Evals, Metrics, and Production Monitoring

A Lean 4 Stability Proof for Tool-Mediated LLM Agents

Comments