Observability Vendor Lock-In: Measure Your Exit Time

We can all recite our resilience numbers. Four nines of uptime. A recovery time objective measured in minutes. A recovery point objective that the business signed off on in a meeting nobody enjoyed. These numbers live on dashboards, in board packs, in the slide that gets dusted off every audit. They are the language we use to say "we are in control".

There is one number we never put on the board. How long would it take to leave the platform that everything else is built on? Not the cost of leaving, although that is bad enough. The time. The elapsed wall-clock months between deciding to go and actually being gone.

I started thinking about this properly after listening to Barry Pilling pull apart the Broadcom, VMware and Tesco licensing dispute on the Tech Leaders Podcast. The legal detail is not my lane, the case is unresolved, and I have no view to offer on who is right. What stayed with me was the shape of the problem underneath it. A customer builds an estate around a platform over many years, the terms of that platform change in a way the customer did not choose, and at that exact moment the most important question in the room is one nobody had measured: how fast can we get out.

The number nobody boards

We are precise about almost everything else. We will argue for an afternoon about whether a service needs three nines or four. We will model failover regions, run game days, and write recovery runbooks for outages that may never happen. All of that is good work, and I am not knocking it.

But ask a leadership team how long it would take to migrate off their core observability platform, or their hypervisor, or their primary cloud, and you get a pause. Then an estimate that is really a hope. Then, if you push, an admission that nobody has actually tried to cost it, because trying to cost it would mean admitting how deep the roots have gone.

Exit time is a resilience number. It belongs on the same board as your RTO and your RPO, because it measures the same thing they do: your exposure when something outside your control goes wrong. The difference is that an outage is an act of weather and a change of terms is an act of will. Someone on the other side of a contract can decide to create your worst day. If you cannot state your exit time, they already know it and you do not.

The day the terms change

Resilience planning has a quiet assumption baked into it. The threat is failure: a region falling over, a disk filling, a dependency timing out. So we design for failure, and we are good at it.

The threat that exit time measures is different. It is not that the platform breaks. It is that the platform works exactly as designed, and the commercial reality around it shifts. An acquisition. A new pricing model. A licensing change that lands the week your renewal does. The platform is still up. Your problem is that you cannot afford to stay and cannot quickly leave.

This is where the embedded customer discovers something uncomfortable. Every integration that made the platform more useful also made it harder to leave. The deeper the value, the longer the exit. And the lever that decides what happens next, the speed at which you could credibly walk, is held by the other party in the negotiation, not by you. You optimised for capability and quietly sold off your optionality to pay for it.

Observability lock-in is the exit time you cannot see

I will bring this home to the discipline I actually work in, because observability is a near-perfect case study in exit time growing while nobody watches.

Think about how a mature observability estate gets built. You start with one vendor's agents because they are easy. Then your dashboards encode a year of hard-won operational knowledge in that vendor's query language. Then your alerting logic, the rules that decide who gets the call at three in the morning, lives in their system and nowhere else. Then your data sits in a proprietary format in their storage, priced by ingest and retention in ways that are hard to predict and harder to escape. None of these decisions was wrong on the day you made it. Each one was the sensible, shippable choice. Together they are a set of roots that take years to grow and, if you ever need to, years to pull up.

That is the part that should worry an operator. Exit time in observability is not a line item you can see. It is the sum of a hundred reasonable integrations, each of which added a little capability and a little lock-in, and you only find out the total on the day you try to leave. By then, it is a discovery, not a decision.

The honest version of this is not "never commit to a vendor". Commitment buys you real things, and a heterogeneous estate held together with string has its own failure modes. The honest version is: know the number. Treat the depth of your commitment as a measured quantity, not a vibe.

Designing for exit

If exit time is a resilience number, then you manage it the way you manage the others. You measure it, you review it, and you do the engineering that keeps it inside a range you can live with. A few things make the difference, and none of them is exotic.

Own your data in an open format. The single biggest driver of exit time is data you cannot take with you in a shape anyone else can read. Open formats and standards exist precisely so that your telemetry is portable by default rather than by heroic project. If your signals are stored in a format only one vendor can read, your exit time is whatever that vendor decides it is.

Own your storage, or at least own the decision about where it lives. The further your data sits from a proprietary, all-in-one bundle, the cheaper and faster it is to point a different tool at it. Separating where the data lives from what reads the data is one of the highest-leverage architectural choices you can make for portability.

Test portability before you need it. Resilience that has never been exercised is a guess. We do not trust a backup we have not restored. We should not trust an exit we have not rehearsed. A small, honest extraction test, can we get a month of one signal type out and into something else, tells you more about your real exit time than any architecture diagram.

Keep a live migration estimate. Not a perfect one. A current, deliberately rough number, owned by someone, reviewed when your estate changes materially, and reported alongside your other resilience figures. The value is not precision. The value is that the number exists at all, and that someone is watching it move.

The point of all this is not paranoia about vendors. It is the same instinct that makes us write a runbook for an outage we hope never comes. We are not predicting the bad day. We are making sure that if it arrives, our options are wider than the other party assumes.

Close

You can probably state your uptime to four nines without checking. You can quote your RTO from memory. Those numbers are real, and they matter.

Now try the other one. If a platform you depend on changed its terms tomorrow, how long would it take you to be gone? If you can answer that with a number you have actually tested, you are in a stronger position than most organisations I have seen. If you cannot, the silence is the answer, and it is not a comfortable one.

Resilience was never only about surviving failure. It is about keeping your choices open when someone else would prefer them closed. Exit time is how you measure that. Put it on the board.

I write about this kind of thing, the numbers we should be tracking and the ones we actually do, every week in the Observability Digest. If the idea of exit time as a resilience number is new to you, the newsletter is where I work these arguments out in the open. If you would rather start with the book, the fourth chapter of Metrics and Mayhem is free.

Work with me

An honest read on what your observability is actually doing.

If you lead observability in a regulated enterprise, I run a fixed-scope Observability Assessment for senior IT and engineering leaders. It ends in a written roadmap and a readout, not a sales deck.

See how it works →

Not ready to talk? Start with a free chapter of Metrics & Mayhem.

Your Real Resilience Number Is How Fast You Can Leave

The number nobody boards

The day the terms change

Observability lock-in is the exit time you cannot see

Designing for exit

Close

Reply

Keep Reading

Mastering Observability