How you follow one request across many services, in plain English, and where it falls short.
In a monolith, when something is slow you read one log. In microservices, a single click can touch a dozen services, and the question "where did the time go?" gets genuinely hard. Distributed tracing is how you answer it. Here is what it is, in plain English, and the honest limits.
What is distributed tracing?
Distributed tracing follows a single request as it travels through all the services that handle it, and stitches the whole journey into one picture called a trace. A trace is made of spans, and each span is one unit of work: an API call, a database query, a function. Every span records when it started, how long it took, and some context, and they nest, so you can see that a slow checkout spent most of its time waiting on one database call three services deep. It is one of the three pillars of observability, alongside metrics and logs.
How it actually works
The real trick is context propagation. When a request enters your system it is given a trace identifier, and every service passes that identifier along as it calls the next one, usually in an HTTP header that follows the W3C Trace Context standard. Each service adds its own spans to the same trace. Break that chain, with a service that does not forward the header, and the trace fragments. This is why a shared standard like OpenTelemetry matters: it makes propagation work across services written by different teams in different languages.
Why it matters
It shows the path, not just the symptom. Instead of "the app is slow", you see exactly which hop is slow, and by how much.
It survives complexity. The more services you have, the more you need it. Tracing is what keeps a microservices estate debuggable.
It turns finger-pointing into evidence. The trace shows which service owns the delay, so the conversation is about the data, not whose fault it is.
The honest ledger
It is only as good as your instrumentation. Miss a service, or break context propagation, and you get a partial trace that can mislead more than it helps.
Sampling is a trade-off. Recording every request is expensive, so most teams sample. Sample too hard and you miss the rare failure you most wanted to see.
It shows where, not always why. A slow span points at the location. You still often need the logs and metrics around it to find the root cause.
It has a cost. High-volume tracing adds overhead and data, and rich span attributes multiply cardinality, which is where bills grow.
So, is distributed tracing for you?
If you run microservices, serverless, or any request that crosses more than one service, distributed tracing is close to essential, and the modern way to get it is OpenTelemetry, often with no code changes through eBPF-based tools. If you run a single, simple application it matters less, though it still helps. Either way, start with your most important user journeys rather than trying to trace everything at once.
Where has tracing saved you, or let you down? I would like to hear it, especially the sampling war stories. Reply, or book a slot and tell me what you found.
|
Get the next one One signal a week. No noise. | |
|
If this was useful, Metrics & Mayhem sends one short, practical piece like it to IT operations leaders most weeks. No fluff, no vendor noise.
Prefer to start with the book? Read a free chapter. |
Sources / further reading
OpenTelemetry observability primer (traces): opentelemetry.io
W3C Trace Context standard: w3.org/TR/trace-context
