Rethinking Cloud Migration: A Personal Reflection After Recent Global CSP Outages

I’ll be honest—lately, I’ve been grappling with a quiet but persistent doubt.

A doubt about one of the most fundamental narratives in modern IT strategy: that public cloud, by design, guarantees superior availability, scalability, and operational resilience.

This belief has shaped so many transformation roadmaps, business cases, and boardroom decisions. Yet over the past months, several global-scale incidents across AWS, Azure, and GCP have forced me to step back and ask: Are we leaning on the right assumptions?

When “Hyperscale” Fails: What Recent Incidents Are Telling Us

Across 2023–2025, we witnessed multiple high-impact outages from leading cloud service providers. A few examples:

AWS Global Event – Route 53 & Kinesis Issues

A cascading failure triggered by a subsystem capacity bug caused widespread disruption, taking down authentication flows, streaming services, and applications that depended on AWS’ internal control plane. Root causes pointed to throttling in internal APIs, highlighting that even hyperscale architectures have centralized choke points.¹

Microsoft Azure – Storage & Identity Outage

Azure Active Directory (now Entra ID) experienced a global disruption caused by a configuration change that propagated incorrectly across regions. Customers lost access not because their workloads failed, but because the identity backbone of Azure itself became unreachable.²

Google Cloud – Networking & Load Balancer Incident

A global networking policy rollout failure caused packet drops and triggered cascading service reliability issues. GCP later confirmed that an automated update to its traffic routing layer propagated without the intended phased validation.³

These weren’t “local zone” failures.

These were foundational disruptions affecting multiple regions and impacting customers who architected “according to best practices.”

And that forced me to rethink a big question:

If hyperscalers can still suffer global control-plane failures, are we overestimating the SLA advantage of cloud?

Where the Math Gets Interesting: The TCO Question

Parallel to my doubts about availability, I was involved in a 5-year Total Cost of Ownership (TCO) assessment for a hypothetical organization.

We compared:

Current on-prem infrastructure (modernized)
Full cloud migration to one of the major CSPs
(Cloudification initiatives such as refactoring and managed services were excluded—this was infrastructure lift & shift cost only.)

After calculation, validation, sanity checks, and applying a 3% margin of error, the results were surprisingly consistent: On-prem remained 35–50% cheaper over 5 years than a full public cloud migration.

This included:

New hardware investments
Modernization aligned to NBV (Net Book Value)
Software renewals
Support & maintenance
Operational overhead
Data center facility cost (excluding building CAPEX, but including power, cooling, and space lease)

And even after layering future scaling requirements, the numbers barely shifted.

This raised another uncomfortable question:

If cloud isn’t more available and isn’t cheaper—what strategic advantage are we actually pursuing?

So… Why Migrate? What’s the Actual Value?

Don’t get me wrong—cloud is transformative. But I think the industry has oversimplified the value proposition.

Here’s what remains truly compelling about cloud migration:

1. Speed of Innovation

Cloud-native services (AI/ML, serverless, managed databases, streaming platforms) drastically reduce time-to-market.

2. Elasticity for Unpredictable Workloads

For workloads that spike unpredictably, hyperscale elasticity is unmatched.

3. Global Presence & Instant Reach

A developer in Jakarta can deploy a service to 20+ global regions in minutes.

4. Operational Offloading

Power, cooling, physical security, and part of the platform stack shift away from internal teams.

But if your workloads are:

Predictable
Stable
Enterprise-grade
Centrally located
And you already operate in a well-managed data center

…then the TCO equation changes.

The Overlooked Middle Path: Cloud-Like On-Prem

This is where things get interesting.

Today’s on-prem solutions can deliver 80–90% of cloud-like capabilities without the long-term cloud cost burden.

Examples:

OpenStack for IaaS orchestration
Ceph for scalable distributed storage
Many of COTS platforms
Kubernetes distributions (OpenShift, Rancher, Tanzu) running on metal
Software-defined networking and storage
Full API-driven provisioning
Cloud-like observability stacks (Grafana, Loki, Prometheus)

I’ve seen on-prem setups and concept that provide:

Self-service provisioning
Auto-scaling (within cluster boundaries)
Automated failover
DR orchestration
Immutable infrastructure models
Secure multi-tenancy

So when someone argues that on-prem cannot deliver “cloud experience,” the reality today is far more nuanced.

But Let’s Be Fair: On-Prem Also Has Its Price

On-prem is not free. And not simple. You still need to invest in:

Data center facility (whether lease or build)
Power and cooling
Network backbone
Physical security
Spare parts inventory
Operational workforce
Long-term lifecycle management
Modernization cycles

But these are already accounted for in our TCO calculations—and still came out cheaper than cloud.

So What Are We Really Optimizing For?

After reflecting on all of this, I’ve started to reframe the conversation away from “Cloud vs On-Prem.”

Instead:

What architecture best supports the organization’s long-term strategy? —not what is currently fashionable.
Where is the real ROI—not just cost avoidance, but capability enhancement?
Do we need hyperscale elasticity, or do we simply need predictable stability?
What failure domains are we willing to accept? Global control-plane outages? Local power issues? Network routing failures?
What is our actual appetite for vendor lock-in?
How do we ensure technical sovereignty when our most critical workloads depend on someone else’s black box?

Closing Thoughts

I still believe cloud has a big role to play. But recent incidents and our internal financial modeling have taught me something valuable:

Cloud is no longer an assumed “default better option.” It is one architectural choice among many, and each deserves equal scrutiny.

The real challenge today is not choosing where to run workloads. It’s choosing why.

And I think more organizations will start asking that question with greater honesty in the years ahead.

Footnotes

AWS Post-Event Summary Reports (PESR), 2023–2024.
Microsoft Azure Status History – Global Outage Reports (Entra ID), 2023–2024.
Google Cloud Incident Reports – Networking Disruptions, 2023–2024.
Gartner – Cloud TCO Modeling and Hybrid Infrastructure Trends, 2023.
Uptime Institute – Annual Outage Analysis, 2024.

When “Hyperscale” Fails: What Recent Incidents Are Telling Us

AWS Global Event – Route 53 & Kinesis Issues

Microsoft Azure – Storage & Identity Outage

Google Cloud – Networking & Load Balancer Incident

Where the Math Gets Interesting: The TCO Question

So… Why Migrate? What’s the Actual Value?

1. Speed of Innovation

2. Elasticity for Unpredictable Workloads

3. Global Presence & Instant Reach

4. Operational Offloading

The Overlooked Middle Path: Cloud-Like On-Prem

But Let’s Be Fair: On-Prem Also Has Its Price

So What Are We Really Optimizing For?

Closing Thoughts

Footnotes

Related Articles

When Disaster Strikes, Why Do We Still Pretend We’re Surprised?

Work–Life Balance & Gen-Z in IT Operations: An Unpopular but Necessary Perspective

When Numbers Lie: Understanding KPI and SLA in IT Infrastructure

Why one team owning everything—from design to operations—can undermine infrastructure excellence

The Riddle of Two Doors: Lessons in Leadership and Decision-Making

When the Tables Turn: From Evaluator to the Evaluated