Nobody monitors their agents

Everyone is shipping AI agents. Almost nobody is monitoring them. Observability for agents is where logging was for microservices in 2015. Obviously necessary, universally ignored. According to recent industry data, over 3 million AI agents are operating inside corporations right now, and only about 47% of them are monitored. In manufacturing and telecommunications, that figure drops to half. We've built a new category of autonomous software and deployed it at scale with essentially zero visibility into what it actually does. This is not a tooling gap. It's a cultural one. And if you're running agents in production without observability, you're not moving fast. You're just hoping nothing goes wrong.

The black box between input and output

Here's what makes agents different from traditional software: they make decisions. A conventional API takes an input, runs deterministic logic, and returns an output. An agent takes an input, reasons about it, decides which tools to call, interprets intermediate results, and then decides what to do next. The execution path is dynamic. It changes based on context, model behavior, and the state of external systems. Most teams treat their agents like they treat their API endpoints. They log the input, log the output, and call it a day. But the interesting part, the part that breaks, is everything in between. Which tools did the agent call? What did it decide not to do? How much did that run cost? Did it hallucinate a tool argument? Did it retry three times before giving up? Without visibility into that middle layer, debugging an agent failure is like debugging a distributed system with only access logs. You know something went wrong, but you have no idea where or why.

Traditional APM doesn't work here

If you come from a backend engineering background, your instinct is to reach for your existing observability stack. Datadog, Grafana, Prometheus, whatever you're already running. And those tools are great for what they were designed for: tracking request latency, error rates, CPU utilization, and throughput. But agents don't fail the way services fail. A service either returns a 200 or it doesn't. An agent can return a perfectly formatted response that's completely wrong. It can call the right tool with the wrong arguments. It can succeed on 9 out of 10 steps and then make a catastrophic mistake on step 10. The math here is brutal. An agent with 85% accuracy per step only completes a 10-step workflow successfully about 20% of the time. Every additional step compounds the error rate. Traditional APM has no concept of "decision quality" or "reasoning correctness." It can tell you the request took 3.2 seconds. It can't tell you the agent chose the wrong tool on step 4 and confidently produced garbage from that point forward. Agent observability needs a different set of primitives. Not just metrics, logs, and traces, but evaluations and governance layered on top.

What agent observability actually looks like

If you were building an observability layer for agents from scratch, here's what you'd need to capture: Trace every tool call. Every time your agent invokes an external tool, API, or function, you need a structured record of what was called, with what arguments, what it returned, and how long it took. This is the equivalent of distributed tracing for microservices, but applied to the agent's decision graph rather than a service mesh. Log every decision branch. When an agent decides between two actions, that decision point matters. You need to know what context it had, what options it considered (if your framework exposes that), and what it chose. This is where agent tracing diverges most from traditional tracing. You're not just recording "what happened." You're recording "why." Track token spend and cost per run. Agents can be expensive. A single complex workflow might involve dozens of LLM calls, each consuming thousands of tokens. Without per-run and per-user cost attribution, you can't make informed decisions about rate limiting, model routing, or pricing. One Reddit user running multi-agent systems in production put it plainly: cost per agent, per tool, and per workflow becomes nearly impossible to track without dedicated tooling. Alert on anomalous behavior. If an agent suddenly starts calling a tool 10x more than usual, or if its average token consumption spikes, something has changed. Maybe the prompt drifted. Maybe the underlying model was updated. Maybe the agent is stuck in a retry loop. Anomaly detection for agents is still primitive compared to traditional infrastructure monitoring, but it's essential. Measure output quality over time. This is the hardest part, and the part most teams skip entirely. You need automated evaluation of whether the agent's outputs are actually correct. LLM-as-a-judge evaluators, human review pipelines, regression test suites that run against production traces. Without quality measurement, you're just counting tokens and hoping the outputs are good.

The security angle nobody talks about

An unmonitored agent with tool access is, functionally, an unmonitored employee with admin credentials. Nobody would accept that for a human. We'd require access logs, audit trails, session recordings, and anomaly alerts. But for some reason, we hand agents the keys to our databases, APIs, and internal systems and then don't bother to watch what they do with them. The data backs up the concern. Over 90% of healthcare organizations have experienced a security or data privacy incident related to AI agents in the past year. In financial services, 88.7% of firms report agent-related security incidents. Researchers have demonstrated that 100% of tested LLMs can be compromised through inter-agent trust exploitation, where one agent manipulates another into executing malicious payloads. A critical vulnerability in a popular open-source agent framework (CVE-2026-25253, with a CVSS score of 8.8) showed that attackers could hijack agent connections through a single webpage, accessing whatever the agent could access: email, documents, code repositories, internal systems. The attack surface for agents is fundamentally different from traditional software. Prompt injection, tool misuse, data exfiltration at machine speed, cascading compromise across multi-agent networks. These aren't theoretical risks. They're happening now. And without observability, you won't know until the damage is done.

Your agent fleet is a liability

Let's be direct: if you're running agents in production without observability, your agent fleet is a liability. Not an asset. A liability. Gartner predicts that over 40% of agentic AI projects will be scrapped by 2027. Not because the models aren't good enough, but because the systems around them weren't engineered for production. The three most common failure patterns, bad context management, brittle tool integrations, and compounding errors, are all problems that observability catches early. Without it, you discover them when a customer complains, or worse, when something quietly breaks and nobody notices for weeks. I've been running 13+ agents across various workflows, and the pattern is consistent. The failures that hurt the most aren't the loud ones. They're the silent ones. The agent that subtly changes its behavior after a model update. The tool call that starts timing out intermittently. The cost that creeps up 30% over a month because the agent found a less efficient reasoning path. None of these show up in traditional monitoring. All of them show up in agent-specific observability.

What to instrument if you're starting today

If you're starting from zero, here's the practical order of operations: Start with structured tracing. Unstructured logs can't reconstruct the reasoning chain. Use OpenTelemetry's emerging gen_ai semantic conventions to create searchable, filterable span trees. Every agent run should produce a complete trace, not scattered log lines. And sample your AI traces at 100%. Agent runs are span trees, and if your sample rate drops below 1.0, you're losing entire executions, not individual calls. Add cost attribution early. Track cost by user, by workflow, and by model. The pre-built dashboards in most observability tools show per-model totals, but you need per-user and per-tier attribution to make business decisions about pricing and resource allocation. Implement basic quality evaluation. Even a simple LLM-as-a-judge setup that scores a random sample of agent outputs gives you more signal than nothing. The goal isn't perfect evaluation from day one. It's having any signal at all about whether outputs are getting better or worse over time. Build alerts on behavioral changes. Monitor tool call frequency, token consumption patterns, error rates, and latency distributions. Set baselines during a stable period, then alert when things deviate. The first time an alert catches a regression before a user reports it, the investment pays for itself. The tooling landscape for this is maturing rapidly. Open-source options like Langfuse and MLflow offer self-hosted tracing and evaluation. Platform solutions like Datadog's LLM Observability, Arize, and Braintrust provide deeper integrations with production monitoring stacks. OpenTelemetry is emerging as the standard for cross-framework agent telemetry. The tools exist. The gap is adoption.

The "Datadog for agents" opportunity

If your deploy script is an agent, and increasingly it is, shouldn't it have the same monitoring as your production services? The market opportunity here is enormous. Whoever builds the definitive agent observability platform, the tool that combines tracing, evaluation, cost tracking, security monitoring, and quality measurement into a single coherent experience, wins a massive category. This is an emerging infrastructure layer, as fundamental to the agent era as APM was to the microservices era. Some incumbents are moving fast. Datadog has shipped an AI Agents Console and LLM Observability product. Splunk added AI Agent Monitoring with LLM-as-a-judge evaluators. Azure AI Foundry is building agent observability directly into its development lifecycle. But the space is still fragmented. Most teams are duct-taping 3-4 tools together and still flying blind when agents start doing unexpected things. The platform that wins will be the one that treats agent observability not as a logging feature bolted onto an existing product, but as a first-class concern designed around how agents actually fail. Decision-level visibility, not just request-level. Quality measurement, not just performance metrics. Security governance, not just access logs.

Stop hoping, start watching

We've been here before. In the early days of microservices, teams shipped distributed systems without centralized logging, without distributed tracing, without health checks or circuit breakers. It worked, until it didn't. And then everyone scrambled to add observability after the fact, always harder and more expensive than building it in from the start. Agents are at that same inflection point. The teams that build observability into their agent architecture from day one will be the teams that successfully scale. Everyone else will learn the hard way that autonomy without visibility isn't innovation. It's negligence. You wouldn't deploy a production service without monitoring. Don't deploy a production agent without it either.

References

PwC AI Agent Survey on enterprise agent adoption, pwc.com

Gartner prediction on agentic AI project failure rates, referenced via Medium

Industry research on unmonitored AI agents and security incidents across healthcare, financial services, and manufacturing sectors, The Daily News Journal

Research on inter-agent trust exploitation and LLM vulnerabilities, arXiv

OpenTelemetry AI Agent Observability standards and semantic conventions, opentelemetry.io

Microsoft Azure AI Foundry agent observability best practices, Microsoft Azure Blog

Sentry guide on AI agent observability and trace sampling, Sentry Blog

Braintrust analysis of AI agent observability tools, braintrust.dev

Datadog AI agent monitoring capabilities, Datadog Blog

Langfuse open-source LLM observability and tracing, langfuse.com