2.9s to 300ms: A Systematic Approach to API Latency
A slow API rarely has one dramatic cause. It has a dozen small ones, each invisible on its own, that compound into a request that takes 2.9 seconds when it should take 300 milliseconds. The teams that fix this aren’t the ones who guess the cleverest optimization. They’re the ones who measure first and fix in the right order.
Here’s the method I use to take an endpoint from “uncomfortably slow” to “fast enough that nobody complains” — without rewriting everything.
Measure before you touch anything
The single most common mistake in performance work is optimizing the thing you assume is slow. It’s almost never the thing you assume.
Start with the distribution, not the average. A p50 of 400ms with a p95 of 2.9s is a completely different problem than a flat 2.9s across the board — the first is tail latency (a subset of requests hitting a slow path), the second is systemic. They get fixed differently, and you can’t tell which you have from an average.
Then find out where the time actually goes. Add tracing or timing around the major phases of a request — auth, database, downstream calls, serialization — and look at one real slow request end to end. You’re looking for the single biggest contiguous block of time. That block is your first target, and it’s frequently somewhere nobody suspected.
The usual suspects, roughly in order
Once you can see the breakdown, the cause is usually one of a small handful of things:
- N+1 queries. The endpoint runs one query to get a list, then one more query per item. Fifty items, fifty-one round trips. This is the most common single cause of slow APIs, and it hides well because each individual query is fast.
- Serial I/O that could be parallel. Three independent downstream calls made one after another, each waiting on the last, when they could run concurrently and cost only as much as the slowest one.
- Cold or missing caches. Recomputing or re-fetching the same rarely-changing data on every request.
- Oversized payloads. Returning fields nobody uses, un-paginated lists, or unindexed queries that scan far more rows than they return.
Notice these are mostly structural, not algorithmic. You’re rarely speeding up a hot loop. You’re removing round trips and work that didn’t need to happen.
Fix in order of impact-to-effort
With the breakdown in hand, rank the fixes — not by how interesting they are, but by latency-saved per hour-of-work:
- Eliminate the N+1. Batch the per-item queries into one. Often the largest single win for the least effort.
- Parallelize independent I/O. Fire concurrent downstream calls together instead of in sequence. Collapses three serial waits into one.
- Cache the stable stuff. Put a short-lived cache in front of data that changes rarely. Cheap, high-leverage, but adds an invalidation concern — so do it after the structural fixes, not before.
- Trim the payload and add the index. Return only what the client needs; make sure every query the endpoint runs is backed by an index.
Do them one at a time, and re-measure after each. Bundling changes makes it impossible to know which one helped — and which one quietly made something else worse.
Verify the win without breaking anything
A latency improvement that changes the response, drops data, or only works under light load isn’t an improvement. Before and after each change, confirm two things: the response is byte-for-byte equivalent (caching and batching are easy places to accidentally change behavior), and the win holds under realistic concurrency, not just a single warm request on your laptop.
Re-run the same measurement you started with. The number that matters is the p95 you set out to fix — not the p50, which often looked fine the whole time.
Know when to stop
There’s a point where the next 50ms costs more engineering than it’s worth. Once the endpoint is comfortably inside its budget and the remaining time is spread thin across many small things rather than concentrated in one fixable block, stop. Performance work has sharply diminishing returns, and the discipline is knowing when you’ve captured the easy 90% and should move on.
The 2.9s-to-300ms results come from this — not a heroic rewrite, but measuring honestly, fixing the biggest block first, and verifying each step. It’s repeatable, and most slow APIs have the same handful of problems waiting to be found.
If your API is slow and you’re not sure where the time is going, that’s exactly the kind of thing I help teams diagnose and fix. The first consultation is free — get in touch.