Latency is the new downtime
Your app doesn't need to go down to lose users. It just needs to be slow. We've spent decades engineering for uptime. We built redundant systems, failover clusters, and health checks that page us at 3 AM. And it worked. Downtime, the kind where your service is genuinely unreachable, is rare now. But somewhere along the way, a new threshold emerged. Users don't wait for error pages anymore. They leave when things feel sluggish. In 2026, 500 milliseconds of latency can do the same damage that a 500 error did in 2016. Latency is the new downtime.
The moment you can feel it
Open a coding agent. Type a request. If the response starts streaming in 100 milliseconds, it feels like the tool is thinking alongside you. There's a flow to it, a rhythm between your intent and the machine's output. Now imagine the same request takes 800 milliseconds before anything appears on screen. The interface hasn't changed. The output will be identical. But the experience is fundamentally different. That 700-millisecond gap is where trust breaks down. This isn't speculation. Research consistently shows that even sub-second delays change how users perceive quality. A 200-millisecond spike in p99 latency during a high-traffic sale has been directly correlated with drops in completed transactions. Every 100 milliseconds of added delay can reduce conversions by roughly 1% at enterprise scale. The human nervous system is remarkably sensitive to response time, and once users notice a delay, they attribute it to the product being broken, not busy.
Streaming trained us to expect instant feedback
Large language models changed what "fast" means. When ChatGPT launched, it introduced millions of people to streaming responses, tokens appearing one by one, almost immediately after hitting enter. That interaction pattern rewired expectations across the entire software industry. Streaming works because of a psychological principle called perceived performance optimization. When users see partial output appearing immediately, their brains register the system as fast, even if the total generation time is the same as a non-streaming approach. The key metric is time to first token (TTFT), not total response time. A system that shows you something in 50 milliseconds and finishes in 3 seconds feels dramatically faster than one that shows nothing for 2 seconds and finishes in 2.5. The problem is that streaming set a new baseline. Every non-streaming endpoint now feels archaic by comparison. Users have been conditioned to expect that sense of immediate, progressive feedback. A traditional REST API that returns a complete JSON payload after a full second of processing feels like it's hanging, even if the total wall-clock time is reasonable. The expectation of instant feedback has leaked out of the chat interface and into every product interaction.
Agentic workflows compound the problem
This matters most for anyone building with AI agents. A single LLM call might have acceptable latency on its own, maybe 200 to 400 milliseconds to first token. But agents don't make single calls. They reason, they call tools, they evaluate results, and they call more tools. Each step in an agentic workflow adds latency, and the effect is multiplicative, not additive. Consider a typical multi-step agent: the user asks a question, the agent reasons about which tools to use, makes a search API call, waits for results, processes them, decides it needs a database query, executes that, synthesizes everything, and finally streams a response. Five tool calls at 300 milliseconds each means 1.5 seconds of dead air before the user sees anything meaningful. That's an eternity in a world where people expect tokens to start flowing immediately. The research backs this up. Teams building multi-agent systems consistently report that latency, not accuracy, is their primary bottleneck. One engineering team documented how letting a single agent call twelve tools caused latency to explode well before accuracy degraded. The compounding effect means that architectural decisions about agent design are fundamentally latency decisions. The implication is clear: narrow agents are faster agents. An agent designed to do one thing well, with a focused tool set and minimal reasoning steps, will always outperform a Swiss Army knife agent that needs to deliberate across a dozen capabilities. Constraining scope isn't just good design practice. It's a latency optimization.
Average latency is a lie
Most teams track average response time. This is almost useless for understanding actual user experience. If your average latency is 150 milliseconds but your p99 is 2 seconds, one out of every hundred users is having a terrible time. And those users remember. P99 latency, the response time that 99% of requests stay under, captures the experience of the unlucky tail. It's the checkout that hangs just long enough for someone to close the tab. It's the coding agent that freezes for two seconds right when you're in a creative flow. The Google SRE handbook makes this point explicitly: while average latency might read 100 milliseconds at 1,000 requests per second, 1% of those requests might take 5 seconds. The p99 problem is especially acute in AI applications because inference latency has high variance. Model serving under load, cold starts, cache misses on prompt prefixes, and variable input lengths all create long tails. A system that feels snappy 95% of the time but occasionally stalls for multiple seconds will erode user trust faster than one with consistently moderate latency. The discipline of tracking p99, not just averages, changes how you architect systems. You start optimizing for the worst case under normal operating conditions rather than the happy path.
Practical mitigations that actually work
The good news is that latency is an engineering problem with real solutions, not just a fact of physics. Edge computing and regional inference are the most straightforward wins. Traditional architectures serve everything from centralized data centers, which means a user in Singapore might round-trip to a server in Virginia for every interaction. Distributing inference to locations geographically closer to users can reduce latency by 60 to 80% for global applications. As inference workloads shift from training-heavy centralized compute to distributed serving, edge deployment is becoming practical in ways it wasn't even two years ago. Speculative decoding pairs a large target model with a lightweight draft model that quickly proposes several next tokens. The target model then verifies those proposals in a single forward pass. When predictions are accurate (and for many tokens they are, since lots of next tokens are obvious from context), the system generates multiple tokens at once, cutting latency without any impact on output quality. NVIDIA's research shows this can deliver up to 3x faster inference in practice. Speculative execution for agents extends this idea to agentic workflows. Instead of waiting for the model to decide which tool to call and then executing that tool serially, you can predict likely next actions and launch them in parallel. If the prediction is right, the result is ready instantly. If not, you discard the speculative work. Research on speculative actions in agentic systems has demonstrated up to 55% accuracy in next-action prediction, translating to significant end-to-end latency reductions. Parallel tool execution is the low-hanging fruit that many teams miss. If an agent needs to call a search API and a database query, and those calls are independent, run them simultaneously. This alone can cut multi-step agent latency in half.
The model size trade-off nobody talks about
There's a quiet tension in the AI industry between capability and speed. Larger models are generally more capable, but they're also slower to start generating tokens and more expensive to serve. Smaller, purpose-built models have predictable cost curves, lower TTFT, and can run on more modest infrastructure, including at the edge. For most production use cases, the right answer is to choose fast. A fine-tuned small model that handles your specific task in 50 milliseconds beats a frontier model that takes 500 milliseconds to deliver marginally better output. Users don't grade your responses on a benchmark. They grade them on feel. And feel is dominated by speed. This doesn't mean large models are irrelevant. There are genuine use cases where general knowledge and deep reasoning matter, complex analysis, open-ended creative work, multi-domain synthesis. But for the focused, repetitive tasks that make up the bulk of production AI workloads, classifying tickets, extracting fields, summarizing content, generating structured output, smaller models consistently win on the metric that matters most: how fast the user gets their answer. The practical approach is to match model size to task complexity. Use the smallest model that meets your quality bar, and measure quality in the context of actual user workflows, not abstract benchmarks.
First paint matters again
There's a parallel shift happening in web development that reinforces this latency-first mindset. Server-side rendering is making a comeback, and it's not a coincidence that it's happening in the AI era. The reason is straightforward. When your application's primary value comes from AI-generated content, the time to first meaningful paint is everything. A server-rendered page can show the user a complete initial view immediately, while a client-side rendered app needs to download JavaScript, execute it, make API calls, and then render. That extra round trip, which might have been acceptable for a static dashboard, is painful when users expect the instant feedback they get from streaming AI interfaces. Frameworks like Next.js, Nuxt.js, and Phoenix LiveView have made SSR the default again. As one developer put it, server-side rendering never really went away, but the web is finally remembering why it was the default. First paint and SEO are still better when markup comes from the server, and most applications don't need a client router, global state, or a 200-kilobyte hydration bundle. They just need partial HTML swaps. The AI era is accelerating this trend because 47% of mobile traffic now originates from AI-powered search interfaces, and those systems prioritize instantly parseable, structured content. If your content requires JavaScript execution to render, it's effectively invisible to AI discovery.
The new reliability standard
We spent the last two decades building systems that don't go down. That was the right fight, and we mostly won. The next fight is building systems that don't feel slow. This means treating latency budgets with the same seriousness as uptime SLAs. It means tracking p99, not averages. It means choosing narrow agents over general ones, small models over large ones when quality allows it, and edge deployment over centralized inference. It means streaming by default and rendering on the server. The bar for "working" has moved. Availability is table stakes. Speed is the new reliability.
References
- What actually reduced latency in our production systems, System Design with Sage, 2026
- The 2026 Latency War: Why Sub-50ms Delivery Defines the Next Generation of Digital Experience, EdgeNext
- What Is P99 Latency?, Aerospike
- Understanding AI Agent Latency and Performance, MindStudio, 2026
- SLMs vs. LLMs: Why Smaller AI Models Win in Business, The New Stack
- How Web Application Development Is Transforming in 2026, AgileSoft Labs
- SSR for Mobile Apps: AI Discovery and Edge Performance, Expert App Devs, 2026
- Monitoring Distributed Systems, Google SRE Book
You might also enjoy