A perfect storm of spiralling Cloud costs!
1. Before the Cloud
If you even had the opportunity to venture into a corporation’s own 'Pre-Cloud' datacentre it would seem overwhelming. These rooms sometimes as large as a football pitch, roared with the noise of a thousand servers. The reality was that these huge rooms were quite manageable. Everything was physical, visible and had an implicit lifecycle i.e., when a server reached the end of its life it would be pulled out to make way for something else. This meant costs were predictable and known up front.
2. The first shift to the Cloud wasn’t a big change
It has been argued that the Cloud really is just the same servers but in someone else’s datacentre. In many cases during the first phase of Cloud adoption this was true. Many companies simply migrated software on servers in their datacentre to new servers in the 'Cloud'.
3. The second phase and no more servers!
This second phase of Cloud adoption involves using the Cloud as a platform of services rather than another way to host a server. When we use the Cloud as a platform of services, often referred to as ‘Platform as a Service’ or 'Serverless' we break the link between the functionality we consume and the server it sits on.
Before the Cloud and even during the first phase of the Cloud, the performance of a database was significantly influenced by the specification of the server it sat on. The infrastructure and DBA teams would use their specialist knowledge to set up and manage multiple servers that could meet the performance and availability requirements of an organisation. In the new Serverless Cloud native world all the application engineering team need do is select the performance level on a dashboard and tick a couple of other boxes. All sorts of other physical network infrastructure and countless other services are now just listings on a screen.
Software applications that would have once been installed on a handful of servers are now being written with a serverless architecture. This means component parts of an application, right down to individual functions, are separately written, deployed, managed, and billed. Not running all this functionality on servers is a significant change. Without the servers to manage, the roles of the infrastructure teams who once oversaw them are significantly diminished.
4. The explosion in the number of components and complexity
In this second phase of Cloud adoption the delivery teams including DevOps, data scientists and development all separately provision and manage the Cloud resources they need. Often when you log into AWS, Azure or Google Cloud dashboards, new services are available. Sometimes the provisioning of a service automatically deploys tens of spuriously named expensive components. The challenge of keeping track of what’s running and relevant is significant. It can be almost impossible for the central support or infrastructure teams to have an appreciation of what is going on.
5. Controlling the Cloud
Methodologies around 'DevOps' and 'Infrastructure as Code' do aid in managing this new complexity, but not operational efficiency. Implementing overarching Cloud policies and controlled landing zones do lead to better efficiency but can restrict the freedoms engineers have. The challenge for organisations is in finding the sweet spot on the sliding scale between controlling efficiency and restricting creativity and productivity.
Proof of concepts and other non-production components are often manually deployed and easily forgotten. The possibilities of situations where you have countless inefficient or redundant things running in dark corners of your Cloud have increased significantly. When these zombie workloads existed in the old world, ultimately the lifecycle of the servers they were running on would mean they would not be zombies for ever. The other reason zombies were not such a problem in the pre the Cloud era was because the infrastructure they were running on would generally already have been paid for.
6. The second phase of Cloud adoption is both an opportunity and a risk to be managed
Well, this all sounds very gloomy! So, should we retrench from the Cloud and return to the comfort of our own racks of servers and reinstate the role of the traditional infrastructure team gate guardians?
The speed with which requests to IT from the business can now be met, increases opportunity and competitive advantage. But 'buyer beware', we need to recognise the cultural shift away from the capital expense procurement model with efficiency ensured by the specialist centralised teams. We now have an ongoing Cloud operational expense where the people most able to ensure technical efficiency are the frontline teams who engineered the functionality in the first place. To some degree these people need to be kept engaged to ensure long term efficiency regardless of the support model.
FinOps as a discipline already goes a long way to align the business, financial and technical stakeholders so value can be assessed on an ongoing basis. This does mean the engineering teams can be more easily challenged but they are still very much being relied on to mark their own efficiency homework and not necessarily given the time to do so. FinOps personnel are great at finding savings, including through commitment purchasing and these are often the biggest saving to be had. However, savings through resource efficiency and simple architectural choices can be of a similar magnitude and are often not realised to anything like their full extent - but no one knows that!
We need a model where resource savings are identified and realised on an ongoing basis accross all Cloud resources that attract cost. Remember a cost in the Cloud can be a cost forever! Some of the most popular FinOps software solutions are accountancy focused and therefore some larger FinOps teams have taken to employ technical personnel who just do technical analysis and engage with engineering. This can work well but is expensive! Other FinOps personnel are more finance focused. In this case better analysis and remediation focused tooling like our own Altocapa covering the wide range of Cloud resources can aid in the conversations and follow up with engineering teams.
So, the Cloud is not 'just the same servers but in someone else’s datacentre' and it requires a top to bottom cultural shift to make the most of it. FinOps as a discipline goes a long way to address this and covers many bases. The area that does need more focus is architectural and resource efficiency and how to engage to make sure it happens.