Migrating to the cloud delivers major cost savings and improved ROI. Transferring IT spending to a pay-as-you-go, operational expense (OpEx) model significantly reduces capital expenses (CapEx), as well as providing other benefits. That’s why a pay-as-you-go plan for cloud services has become the default for most businesses today.
It wasn’t always that way. Traditionally, IT paid for hardware and budgets were CapEx-oriented. Companies would make a large upfront investment in storage, servers, routers, etc., and then leverage it for years. Everything was planned and monitored. The primary benefit of a CapEx model is stability: you know exactly what your costs will be on an annual basis, if not on a longer timeline. But the predictable costs didn’t always mean predictable results.
The cloud migration trend was accelerated by a significant drop in cloud platform pricing in 2013, led by Amazon’s AWS. Companies began to realize that they could both improve operations and save money by migrating to the cloud. I remember the day that year that triggered the company I worked for to move from on-premise VMWare to the public cloud. I was talking to a colleague who told me, “I can’t install the new server for you. We don’t have any storage available and the EMC storage we ordered is still stuck in customs…” The frustration drove us to look for more agile alternatives and eventually led us to migrate to the cloud.
The shift to the public cloud gave companies like mine more flexibility without compromising the predictability—you still had a fairly accurate estimation of your costs. For example, ec2/VM instances, which used to make up the majority of cloud cost, have fixed pricing: X USD per VM size/region.
In recent years, that has changed: enter “cloud bill shock.” But why? Most applications today run Kubernetes, Lambda/serverless, RDS, PaaS, and other pay-per-use resources. How much does auto-scaling a Kubernetes Cluster cost? How much are you going to pay for Lambda or for BigQuery? Basically, you have NO idea upfront: it depends on your usage. The OpEx approach had a lot of advantages—it gives modern businesses agility and flexibility, and you don’t pay for what you don’t use. However, it also makes cost forecasting much more challenging.
The solution: visibility, forecasting, and governance
Visibility: It’s very challenging to understand what you’re paying for in pay-per-use models. The FinOps team is charged with evaluating the business need, usually based on extensive resource tagging. But traditional tagging is manual, error prone, and extremely inefficient, so automating the tagging is crucial. Terratag is a great open source solution that performs recursive tagging for Terraform-based provisioning across your entire set of AWS, Azure, and GCP cloud resources.
Forecasting: Accurate forecasting is crucial when preparing a pay-per-use cloud budget. Companies need an accurate way to analyze their payment history and usage growth rate to create expense projections. Solutions like CloudHealth, CloudCheckr, and Cloudability offer great forecasting tools to meet these needs.
Governance: Theoretically, if you can estimate that new resources are too expensive, you should be able to prevent their provisioning and deployment. Open Policy Agent can help you manage this process with basic rules (for example, “do not provision more than 10 new instances,” or “do not provision extra-large ec2 instances”) and dedicated cost estimation CLIs can give you business-level governance. Terraform-cost-estimation is a great open source solution, if you use Terraform.
In addition to visibility, forecasting, and governance, it is crucial to shift cloud cost left and give developers the tools to prevent unnecessary cost increases as early as possible.
Shift-Left to empower developers to control their cloud budgets
Fifteen years ago, latency issues in production were IT’s responsibility, not the developers’. But APM solutions like NewRelic, AppDynamics, and later DataDog made it easy for developers to catch problems early, write responsible code in terms of latency, and fix degradations very early in the process. The same was true for security. Companies like Snyk shifted security left and empowered developers by providing them with tools to detect and fix security problems much faster. The same process needs to happen with cloud cost—developers need tools to understand how their code will affect cloud costs.
The problem is even greater with Infrastructure-as-Code since developers actually write and maintain the infrastructure in their git repositories. A simple “git push” can lead to a major cost degradation, but since developers don’t have the tools to take ownership of the process, they usually don’t take it into account. That’s why I believe that Infrastructure as Code is forcing a revolution in cost management, just as APM did with latency and Snyk and dev-first security companies did with security. Developers must have a way to see the impact of their deployments on cloud cost with a clear correlation. That’s why env0 (disclaimer—I’m the co-founder and CEO of env0) is focusing on automatic cost management, especially for IaC-based deployments.
To summarize, forget about old-school cloud cost management. The cloud has moved from CapEX to OpEX, demanding new solutions. Shifting cloud cost management left by empowering developers and giving them the tools to proactively correlate deployments and cloud costs (rather than leaving it to the IT/Ops teams with the current reactive approach) can prevent cost degradations much earlier in the process and boost overall efficiency.