CloudNatix

View Original

My Journey to Efficient Computing, and the Founding of CloudNatix


I am often asked why I started  CloudNatix as I talk to customers, investors, hiring  etc.  After retelling my story for a couple of years now, I think it is a great time to pick up a pen (or keyboard) and write it down.  It’s a story that led me to realize that the future of large-scale cloud management sat at the intersection of several things I’d learned along my career journey.

For three decades I’ve dedicated my career to making computers do more. As a young engineer creating core Unix and Linux kernel subroutines, I learned to write the fastest code possible in the smallest memory footprint. At Intel, the performance engineering teams I ran obsessed over extracting every last clock cycle of performance, at every level of the hardware and software stack. The Hugepages feature in the Linux kernel was one of many tangible results of my work. Collectively, the computing capabilities we developed continue even now to enable business decisions worth billions of dollars every year.

In the performance-centric Intel lab environment, we had access to bleeding-edge chip and hardware technologies. We ran compute-intensive workloads to push these brawny machines beyond theoretical limits in short windows of time. Of course, the servers sat idle the rest of the day and night, clearly not the best cost-performance tradeoff. In our quest to eke out the next speed improvement, utilization simply did not matter.

Then, in 2006, I joined a young internet company named Google. Although this was years before the terms “hyperscaler” and “DevOps” were coined, the sheer scale of Google’s cloud infrastructure was already extraordinary. And the proprietary management methodologies Google had developed were beyond impressive. 

I began to investigate how efficiently this massive grid of computing resources could be operated. Our Cluster management team gathered data on average utilizations across data centers globally. Our analysis confirmed what we expected: a small number of marquee applications had been tuned impressively by skilled engineers to run in a highly performant AND efficient manner. However, for the most part, the rest of data center fleet had embarrassingly low utilization. This was my first window into the amount of infrastructure waste that even top-notch engineering teams frequently accept.

Once we learned to quantify infrastructure waste, our challenge at Google quickly became: how can we eliminate it? The answer seemed straightforward, even if the technologies did not yet exist: 

  1. share the underlying infrastructure among different applications, and 

  2. enable apps to run with soft boundaries and no resource contention (VMs were not an option for a couple of different reasons).

This is precisely what led me to co-invent Linux Containers at Google. Over the next few years we created—and perfected—techniques for sharing and managing compute infrastructure at scale. Without sacrificing application performance, my teams developed highly automated solutions for increasing machine utilization. This in turn decreased our hard-dollar compute costs dramatically. The Finance teams were understandably thrilled.

Meanwhile, a funny thing happened on the journey to cloud cost reductions: the better we got at saving money on compute, the more our internal business partners began to value the increased speed at which they could now innovate.  Containerization directly accelerated the pace at which BUs could innovate and deploy new applications, to drive new sources of revenue. Indeed, the road to deploying Containers globally at Google was long and arduous, but the advantages in terms of cost and speed meant that there was no looking back.

I spent about 11 years at Google learning not only how large scale infrastructure works but also how very large application stacks like Display Ads get managed. Five of those eleven years I was managing the backend infrastructure of Display Ads.  During that time, we improved the performance and utilization of our system by a very large percentage. As we increased the efficiency, we sped up the pace of new product feature releases, and  in turn this directly resulted in a positive impact on Google’s top line revenues as a direct, immediate result. 

I’m very proud of my teams that pushed the envelope in building great system-level technologies for business applications, as we learned in real-time what business applications need to innovate faster and more efficiently.  Today, it is well understood that the success of hyperscale companies derives in large part from the sophistication with which they manage their infrastructure operations. 

By early 2019, it became clear to anyone paying attention that Cloud spend (and waste) was emerging as our industry’s greatest challenge.  Analysts estimated the total waste at $50 Billion or more. In my own conversations with executives, I could see that the financial and operational pains were already impacting business in very tangible ways. The public examples of Dropbox and Nvidia reversing course and moving from Cloud back to on-premise highlighted the cost disadvantage of doing business in Cloud over the long run.  As a side note, I believe that Dropbox and Nvidia are more the exception than the rule--their usage of computing resources is highly asymmetrical, and tied to one specific dimension (storage for Dropbox and GPU-specific compute for Nvidia). This will be a topic for another blog.

Think about those enterprises whose application architectures originated on-premise (basically any company more than a decade old). These organizations found themselves in a hybrid or heterogeneous world, with an increasingly complex mix of on-premise and one or more Clouds. The complexity factor is not much less for even those companies that are born in Cloud.  Managing Cloud operations across a couple of different regions and a handful of clusters is, unfortunately, no easy task.   As if managing these very different environments from a technical perspective was not hard enough, the need for responsible financial management further exacerbated the problem.  On-prem is mostly CAPEX, whereas Cloud is mostly OPEX; essentially, owning vs. leasing.  The daily challenges of both scenarios are both mind-numbingly complex and widely divergent. 

This is a good moment to pause and point out that the limitless on-demand world of Cloud is truly a double-edged sword: thousands of new configs with different price points (even dynamically changing) are available to you and your developers instantly, with no practical CSP tools nor even incentives to manage proliferation and contain costs.

While all this was happening, there was another  tectonic shift beginning, in parallel. Companies were embarking upon widespread adoption of Containers and Kubernetes. As cloud-native application development began in earnest, a new IT organizational model appeared. Centralized enterprise architecture teams struggled just to manage old legacy-style clusters, while silos of DevOps engineers emerged to manage Cloud Native clusters which sprouted organically in far-flung business units.

This shift led to entirely new challenges in the executive suite, as Finance, IT, and the Business Units themselves struggled just to determine the true costs of these technology choices, to say nothing of optimizing their spend. A CFO colleague shared with me one specific example, where he challenged his engineering team to cut their rapidly accelerating Cloud spend.  Predictably, his request for cloud information generated a massive amount of busy work, as DevOps engineers debugged home grown scripts and analysts constantly revised error-prone spreadsheets. But the task proved so difficult that in the end, the team simply gave up and moved on to other priorities. Not exactly a project that a finance leader (or any executive) would be proud of. It’s an instructive example: the toil required to pursue cloud optimization manually can cost more than the potential benefits. Automation is the only solution.

The deeper I dove into the subject of Cloud usage and cost, or simply efficiency, the more I realized that the vast majority of enterprises are almost running blind for the most part about the efficacy of their spend. That is, until the time their AWS or Azure monthly bill becomes the top line item in the Expenses section of their finance Statement.  By that time, cloud operations have become sufficiently complex that the simplistic CSP tools are of no practical use.  One could even argue that they actually increase the pain.

Where does this all leave us today? Simply put,


Today’s urgent need to make computing infrastructure more efficient, and more cost effective, is the similar challenge we faced at Google circa 2006, in the early days of the first truly hyperscale internet company.


I started CloudNatix in 2019 to provide a more efficient computing environment to enterprises. The complexity of modern computing infrastructures, running heterogeneous workloads from legacy to Cloud native (and everything in between), demands an end-to-end Developer and Devops management platform.  This complexity of operations is the root cause of Cloud waste.  Our vision is to build a platform that enables Devops/SREs and developers to easily operate at scale. At the same time, through a single management pane, the platform should be able to provide the cost/waste/efficiency data across any organization or any application.

We are starting this journey by providing one-click visibility into the efficiency matrix of Cloud operations across different business units, teams and projects.  Our core ML-based recommendation engine provides key insights to our users on how to reduce the spend.  Our control management part of the platform allows the developers and SREs to be able to identify, debug and operate different clusters in a single cohesive fashion.  And finally, we very strongly believe that our Auto-pilot engine will dramatically improve the efficiency matrix.

As important as it is for me to solve a big challenge of computing efficiency that is plaguing the whole industry, it is also equally important for me to have a team that is humble, respects each other and is transparent in decisions. This enables us to stay focused on building world-class products and having fun doing a lot of heavy lifting in establishing CloudNatix as the leader.   If there is any silver lining to Covid, it is that the world is truly flattened and we are able to hire the smartest talent across the whole world.  I’m very proud of the team that we have built so far. Sitting in a meeting surrounded (on screen) with these extremely talented team members indeed makes every day such a rewarding experience. And we have several openings in sales and engineering organizations.

Feel free to reach out to me if you have any comments. In the meantime, Please sign-up for a demo over here.