~/narapa.dev/notes/the-migration-rollback-plan-i-refuse-to-write
cd ~
$cat notes/the-migration-rollback-plan-i-refuse-to-write.mdx

The migration rollback plan I refuse to write

Every enterprise data migration comes with a rollback plan. Usually it looks something like this:

If the new system fails validation at cutover, we will redirect traffic back to the legacy system. A database snapshot will be restored from T-0 minus 6 hours. Downtime is expected to be under 30 minutes.

It reads like a safety net. It is not a safety net. It is a document that exists to make people in suits feel better during the planning meeting.

I have been through three of these migrations now. I have written that paragraph more than once. I no longer write it, and I want to explain why.

The fiction in the standard rollback plan

Let's take that paragraph apart.

"Redirect traffic back to the legacy system." This assumes the legacy system is still running in the state it was in before the cutover. In reality, the moment you start writing to the new system, you are also writing to the legacy system (that is what "dual-write" means) — so the legacy system has new data in it that was not there yesterday. You cannot just point users back to it. It is now a different system than the one your users left.

"Restore the database from a snapshot." This ignores every transaction that has happened since that snapshot. For a typical insurance or finance system, that is thousands of real customer actions. You cannot quietly delete them. They are attached to people, and those people will notice.

"Downtime is expected to be under 30 minutes." The restore alone takes longer than that on most enterprise data volumes. And this ignores the time to figure out that something is wrong in the first place, the time to decide to pull the trigger, the time to get approvals, and the time to communicate the decision.

In practice, by the time you are thinking about rolling back, rolling back is already the worst option on the table.

The real question nobody wants to ask

When people ask me to write a rollback plan, the honest answer is:

Past a certain point in the migration, there is no rollback. The only way out is forward.

This is not a clever observation. It is the entire shape of the problem. You spend months setting up a dual-write architecture so that neither side is authoritative for too long. You do shadow reads to check that the new system is returning the same answers as the old one. You reconcile the data row by row and fix drift. You do all of this because by the time you cut over, you should not need a rollback plan.

If you need one, something has gone wrong much earlier.

What I write instead

So here is what I now put in the rollback section of a migration document, more or less word for word:

There is no rollback plan for this migration past the point of first production write.

Instead, we have a set of abort conditions that will pause the cutover before it becomes irreversible. If any of these conditions trip, we stop, diagnose, and do not proceed:

  • Reconciliation match rate below X for four consecutive hours.
  • Read latency at the new system above Y for more than one hour.
  • Any data-loss incident.
  • Regulatory reporting failure on either side.

We also maintain the legacy system in read-only "reference" mode for 90 days after cutover. This is not for rollback; it is for audit and for cross-checking edge cases. Writes during this period go only to the new system.

If the cutover is aborted, we roll back the cutover itself — not the data. We return to the dual-write state we were in before the cutover was attempted, investigate, fix, and schedule another attempt. The data written during the abort remains valid on both sides.

This is a longer and less comforting paragraph than the traditional one. It also happens to describe what actually happens.

Why this matters

People treat rollback plans as a kind of insurance. They imagine that because the plan exists, the risk is covered. The plan is the reason they are willing to sign off.

The problem is that if the plan is fiction, it covers nothing. You have not reduced risk. You have pretended to.

The honest version — "there is no rollback, here are the abort conditions, here is our read-only reference period" — is harder to sell in a room full of people who want a safety net. But it is the version that matches reality, and it forces the team to spend their energy where it matters: on the dual-write gates, the reconciliation accuracy, and the monitoring that triggers the abort conditions.

Those are the things that actually protect the customer. The paragraph about snapshot restore does not.

The thing I wish I had known earlier

I used to feel uneasy about writing these honest rollback sections. I assumed architects older than me would push back. Some of them do. But most of them are relieved. They already know the standard rollback plan is fiction. They just did not have anyone willing to say it out loud.

Saying it out loud, in writing, in a document that gets circulated, turns out to be the single most effective thing I have done on any of these programs. It changes what people argue about. It changes where the team invests its time. It changes what gets tested.

That is worth a lot more than a paragraph that lets somebody in a suit sleep at night.

Disagree? I would actually like to hear it. Email me.