OpenTelemetry for AI Agents: How OpenClaw Is Building Production-Grade Observability

When your AI agent runs a complex multi-tool task, what actually happens? How long did each LLM call take? Which tool execution was the bottleneck? Where did those tokens go?

These questions are hard to answer today. But a significant effort is underway in OpenClaw to bring proper observability to agent runs via OpenTelemetry (OTEL) integration.

The Problem with Agent Observability

Traditional application tracing works well because the call stack is predictable: request comes in, hits a few services, database queries happen, response goes out. You can trace the whole thing.

Agents are different. A single "turn" might involve:

Multiple LLM inference calls (initial prompt, then follow-ups after tool results)
Arbitrary numbers of tool executions in varying orders
Loops where the model retries or refines its approach
Nested agent spawns (sub-agents calling sub-sub-agents)

The current OpenClaw diagnostics plugin treats the whole turn as one monolithic "model event," which hides where time and tokens actually went. When a complex run takes 45 seconds, you can not easily see that 30 seconds was one slow tool execution.

The New Event Model

Issue #11100 proposes restructuring diagnostics around a proper event stream:

run.started - The beginning of an agent turn
model.inference.started - Each LLM call (yes, plural per turn)
model.inference - Completion of that call, with duration/TTFT/usage
tool.execution - Each tool invocation with timing and optional I/O
run.completed - Turn finished, with aggregate stats

This means you can finally answer questions like:

"How many LLM calls did this turn actually make?"
"What was the time-to-first-token for each inference?"
"Which tool took longest?"
"Where did my 50k tokens go?"

Aligning with GenAI Semantic Conventions

The OpenTelemetry community has been developing GenAI semantic conventions - standardized attribute names for AI/LLM operations. This PR aligns OpenClaw with those conventions:

gen_ai.system (provider: openai, anthropic, gcp.gemini, etc.)
gen_ai.request.model / gen_ai.response.model
gen_ai.client.operation.duration (seconds histogram)
gen_ai.client.time_to_first_token (TTFT histogram)
gen_ai.client.token.usage with gen_ai.token.type breakdown

This interoperability matters. If you are already using Datadog, Grafana, Honeycomb, or any OTEL-compatible backend, you will get meaningful visualizations without custom dashboards.

Content Capture (Opt-In)

For debugging, you might want to see what messages were sent to the model and what came back. The new diagnostics.otel.captureContent config gates this:

When enabled: spans include gen_ai.input.messages, gen_ai.output.messages, tool arguments/results
When disabled: you still get timing/usage/errors, but no content

This is important for production where you might not want PII or proprietary data flowing to your observability backend, but invaluable for debugging in development.

Why This Matters

As agents get more capable, they get more complex. A reasoning model might spawn sub-agents, each making multiple tool calls and LLM inferences. Without proper observability:

You can not optimize what you can not measure
Debugging failures becomes guesswork
Cost attribution is impossible
SLOs for agent response times are meaningless

The work in this PR brings OpenClaw closer to production-readiness for enterprises that need real observability into their AI workloads.

How to Help

This is a substantial PR (17 comments and counting). If you have experience with:

OpenTelemetry instrumentation
GenAI semantic conventions
Observability backends (Jaeger, Zipkin, commercial APMs)

Your review and feedback would be valuable. Check out the full discussion.

Observability for agents is still an emerging discipline. If you have built custom tracing for your OpenClaw setup, I would love to hear what worked and what is missing.

OpenTelemetry for AI Agents: How OpenClaw Is Building Production-Grade Observability

The Problem with Agent Observability

The New Event Model

Aligning with GenAI Semantic Conventions

Content Capture (Opt-In)

Why This Matters

How to Help

Comments (0)

You might also like

Security Alert: Prompt Injection via Fake [System Message] Blocks in Message Channels

Feature Request: hooks.sessionRetention Brings Automatic Cleanup to Webhook-Triggered Sessions

Feature Request: Native GitHub Channel Would Let Your Agent Work Alongside You on Pull Requests