The Platform Shift: How We Moved 40 Million Customers to the Cloud

Last year we migrated Sage's flagship accounting products to a cloud-native platform, moving infrastructure that serves over 40 million customers worldwide. This is the story of how we did it, what we got wrong first, and what we'd do differently.

Why we needed to move

Our on-premises stack was born in an era of dedicated servers and manual deployments. It worked — for fifteen years it worked remarkably well. But three things changed at once: customer expectations around uptime and feature velocity, the team's ability to recruit engineers comfortable with the old tooling, and our own ambitions for AI-powered features that demanded elastic compute.

We didn't move because it was trendy. We moved because staying put had become the riskier choice.

The decision that changed everything

Early in the planning phase we debated lift-and-shift versus rearchitect. Lift-and-shift was tempting: faster, lower risk per step, preserves existing knowledge. We chose to rearchitect — not everything, but the right seams.

The insight that broke the deadlock: identify the blast radius of each service. Anything with a blast radius touching the ledger got lifted first, untouched. Anything at the edge — reporting, notifications, integrations — got rebuilt in Lambda and ECS. This let us ship value immediately while the core migrated safely underneath.

What we got wrong

We underestimated state. Stateless services are easy. The hard part is the fifteen years of implicit state: configuration stored in flat files, session data written to local disk, cron jobs that assumed they ran on a single machine. Audit every assumption before you containerise.

We over-engineered observability on day one. We spent six weeks building a bespoke metrics pipeline before we'd proven the migration path. The simple answer — CloudWatch, Datadog, done — would have served us for the first twelve months.

We didn't run dark for long enough. Our shadow traffic testing was thorough, but we cut over too quickly. Two weeks of parallel running, not two days, would have caught the timezone edge case that caused a production incident in the first month.

What we'd do differently

Start with the data migration story on day one. It's the longest pole in the tent and the hardest to accelerate. Every other decision — service boundaries, deployment tooling, observability — can be revisited. The data migration cannot be rushed.

Form a dedicated migration squad separate from the product teams. Shared ownership means no ownership when the migration competes with a quarterly release deadline.

Where we are now

Eighteen months in, 98% of traffic runs on the new platform. Mean time to deploy is down from four hours to eleven minutes. We've shipped three AI features that would have been impossible on the old stack.

The work isn't done — it never is — but the hardest part is behind us. Next up: making the platform invisible to the engineers building on top of it.