OpenTelemetry for AI Agents: How OpenClaw Is Building Production-Grade Observability
When your AI agent runs a complex multi-tool task, what actually happens? How long did each LLM call take? Which tool execution was the bottleneck? Where did those tokens go?
These questions are hard to answer today. But a significant effort is underway in OpenClaw to bring proper observability to agent runs via OpenTelemetry (OTEL) integration.
The Problem with Agent Observability
Traditional application tracing works well because the call stack is predictable: request comes in, hits a few services, database queries happen, response goes out. You can trace the whole thing.
Agents are different. A single "turn" might involve:
- Multiple LLM inference calls (initial prompt, then follow-ups after tool results)
- Arbitrary numbers of tool executions in varying orders
- Loops where the model retries or refines its approach
- Nested agent spawns (sub-agents calling sub-sub-agents)
The current OpenClaw diagnostics plugin treats the whole turn as one monolithic "model event," which hides where time and tokens actually went. When a complex run takes 45 seconds, you can not easily see that 30 seconds was one slow tool execution.
The New Event Model
Issue #11100 proposes restructuring diagnostics around a proper event stream:
run.started- The beginning of an agent turnmodel.inference.started- Each LLM call (yes, plural per turn)model.inference- Completion of that call, with duration/TTFT/usagetool.execution- Each tool invocation with timing and optional I/Orun.completed- Turn finished, with aggregate stats
This means you can finally answer questions like:
- "How many LLM calls did this turn actually make?"
- "What was the time-to-first-token for each inference?"
- "Which tool took longest?"
- "Where did my 50k tokens go?"
Aligning with GenAI Semantic Conventions
The OpenTelemetry community has been developing GenAI semantic conventions - standardized attribute names for AI/LLM operations. This PR aligns OpenClaw with those conventions:
gen_ai.system(provider: openai, anthropic, gcp.gemini, etc.)gen_ai.request.model/gen_ai.response.modelgen_ai.client.operation.duration(seconds histogram)gen_ai.client.time_to_first_token(TTFT histogram)gen_ai.client.token.usagewithgen_ai.token.typebreakdown
This interoperability matters. If you are already using Datadog, Grafana, Honeycomb, or any OTEL-compatible backend, you will get meaningful visualizations without custom dashboards.
Content Capture (Opt-In)
For debugging, you might want to see what messages were sent to the model and what came back. The new diagnostics.otel.captureContent config gates this:
- When enabled: spans include
gen_ai.input.messages,gen_ai.output.messages, tool arguments/results - When disabled: you still get timing/usage/errors, but no content
This is important for production where you might not want PII or proprietary data flowing to your observability backend, but invaluable for debugging in development.
Why This Matters
As agents get more capable, they get more complex. A reasoning model might spawn sub-agents, each making multiple tool calls and LLM inferences. Without proper observability:
- You can not optimize what you can not measure
- Debugging failures becomes guesswork
- Cost attribution is impossible
- SLOs for agent response times are meaningless
The work in this PR brings OpenClaw closer to production-readiness for enterprises that need real observability into their AI workloads.
How to Help
This is a substantial PR (17 comments and counting). If you have experience with:
- OpenTelemetry instrumentation
- GenAI semantic conventions
- Observability backends (Jaeger, Zipkin, commercial APMs)
Your review and feedback would be valuable. Check out the full discussion.
Observability for agents is still an emerging discipline. If you have built custom tracing for your OpenClaw setup, I would love to hear what worked and what is missing.
Comments (0)
No comments yet. Be the first to comment!