I’ll be honest—lately, I’ve been grappling with a quiet but persistent doubt.
A doubt about one of the most fundamental narratives in modern IT strategy: that public cloud, by design, guarantees superior availability, scalability, and operational resilience.
This belief has shaped so many transformation roadmaps, business cases, and boardroom decisions. Yet over the past months, several global-scale incidents across AWS, Azure, and GCP have forced me to step back and ask: Are we leaning on the right assumptions?
When “Hyperscale” Fails: What Recent Incidents Are Telling Us
Across 2023–2025, we witnessed multiple high-impact outages from leading cloud service providers. A few examples:
AWS Global Event – Route 53 & Kinesis Issues
A cascading failure triggered by a subsystem capacity bug caused widespread disruption, taking down authentication flows, streaming services, and applications that depended on AWS’ internal control plane. Root causes pointed to throttling in internal APIs, highlighting that even hyperscale architectures have centralized choke points.¹
Microsoft Azure – Storage & Identity Outage
Azure Active Directory (now Entra ID) experienced a global disruption caused by a configuration change that propagated incorrectly across regions. Customers lost access not because their workloads failed, but because the identity backbone of Azure itself became unreachable.²
Google Cloud – Networking & Load Balancer Incident
A global networking policy rollout failure caused packet drops and triggered cascading service reliability issues. GCP later confirmed that an automated update to its traffic routing layer propagated without the intended phased validation.³
These weren’t “local zone” failures.
These were foundational disruptions affecting multiple regions and impacting customers who architected “according to best practices.”
And that forced me to rethink a big question:
If hyperscalers can still suffer global control-plane failures, are we overestimating the SLA advantage of cloud?
Where the Math Gets Interesting: The TCO Question
Parallel to my doubts about availability, I was involved in a 5-year Total Cost of Ownership (TCO) assessment for a hypothetical organization.
We compared:
- Current on-prem infrastructure (modernized)
- Full cloud migration to one of the major CSPs
- (Cloudification initiatives such as refactoring and managed services were excluded—this was infrastructure lift & shift cost only.)
After calculation, validation, sanity checks, and applying a 3% margin of error, the results were surprisingly consistent: On-prem remained 35–50% cheaper over 5 years than a full public cloud migration.
This included:
- New hardware investments
- Modernization aligned to NBV (Net Book Value)
- Software renewals
- Support & maintenance
- Operational overhead
- Data center facility cost (excluding building CAPEX, but including power, cooling, and space lease)
And even after layering future scaling requirements, the numbers barely shifted.
This raised another uncomfortable question:
If cloud isn’t more available and isn’t cheaper—what strategic advantage are we actually pursuing?
So… Why Migrate? What’s the Actual Value?
Don’t get me wrong—cloud is transformative. But I think the industry has oversimplified the value proposition.
Here’s what remains truly compelling about cloud migration:
1. Speed of Innovation
Cloud-native services (AI/ML, serverless, managed databases, streaming platforms) drastically reduce time-to-market.
2. Elasticity for Unpredictable Workloads
For workloads that spike unpredictably, hyperscale elasticity is unmatched.
3. Global Presence & Instant Reach
A developer in Jakarta can deploy a service to 20+ global regions in minutes.
4. Operational Offloading
Power, cooling, physical security, and part of the platform stack shift away from internal teams.
But if your workloads are:
- Predictable
- Stable
- Enterprise-grade
- Centrally located
- And you already operate in a well-managed data center
…then the TCO equation changes.
The Overlooked Middle Path: Cloud-Like On-Prem
This is where things get interesting.
Today’s on-prem solutions can deliver 80–90% of cloud-like capabilities without the long-term cloud cost burden.
Examples:
- OpenStack for IaaS orchestration
- Ceph for scalable distributed storage
- Many of COTS platforms
- Kubernetes distributions (OpenShift, Rancher, Tanzu) running on metal
- Software-defined networking and storage
- Full API-driven provisioning
- Cloud-like observability stacks (Grafana, Loki, Prometheus)
I’ve seen on-prem setups and concept that provide:
- Self-service provisioning
- Auto-scaling (within cluster boundaries)
- Automated failover
- DR orchestration
- Immutable infrastructure models
- Secure multi-tenancy
So when someone argues that on-prem cannot deliver “cloud experience,” the reality today is far more nuanced.
But Let’s Be Fair: On-Prem Also Has Its Price
On-prem is not free. And not simple. You still need to invest in:
- Data center facility (whether lease or build)
- Power and cooling
- Network backbone
- Physical security
- Spare parts inventory
- Operational workforce
- Long-term lifecycle management
- Modernization cycles
But these are already accounted for in our TCO calculations—and still came out cheaper than cloud.
So What Are We Really Optimizing For?
After reflecting on all of this, I’ve started to reframe the conversation away from “Cloud vs On-Prem.”
Instead:
- What architecture best supports the organization’s long-term strategy? —not what is currently fashionable.
- Where is the real ROI—not just cost avoidance, but capability enhancement?
- Do we need hyperscale elasticity, or do we simply need predictable stability?
- What failure domains are we willing to accept? Global control-plane outages? Local power issues? Network routing failures?
- What is our actual appetite for vendor lock-in?
- How do we ensure technical sovereignty when our most critical workloads depend on someone else’s black box?
Closing Thoughts
I still believe cloud has a big role to play. But recent incidents and our internal financial modeling have taught me something valuable:
Cloud is no longer an assumed “default better option.” It is one architectural choice among many, and each deserves equal scrutiny.
The real challenge today is not choosing where to run workloads. It’s choosing why.
And I think more organizations will start asking that question with greater honesty in the years ahead.
Footnotes
- AWS Post-Event Summary Reports (PESR), 2023–2024.
- Microsoft Azure Status History – Global Outage Reports (Entra ID), 2023–2024.
- Google Cloud Incident Reports – Networking Disruptions, 2023–2024.
- Gartner – Cloud TCO Modeling and Hybrid Infrastructure Trends, 2023.
- Uptime Institute – Annual Outage Analysis, 2024.