As companies grow, things like visibility into your cloud infrastructure, monitoring the uptime of your services, and collecting and analyzing data from a wide variety of applications all become critical to the success of the organization and its customers. An unfortunate and often unavoidable side effect of this growth, however, is that it can often put stress on existing systems, precipitating the need for updated technologies, which can grow and scale over time.
At Acquia, years of existing customer growth and new customer acquisitions required us to increase the size of our fleet on an almost hourly basis. From a business perspective this was phenomenal -- an expected side effect of a thriving business -- but from an operational perspective, it was an early indication that we would inevitably reach a size when our original open-source monitoring service would no longer scale to meet our needs.
With so many servers, services, and applications to monitor, we first started down the path of trying to build our own monitoring services four years ago. While that project was in flight, we also carved out time and resources to build “temporary” solutions to help us filter and handle the growing number of alerts generated by our existing monitoring systems. What we discovered in the years that followed was, to put it simply, it was exceptionally difficult to dedicate the necessary time, resources, and expertise to such an initiative when there are so many other needs and problems to address across our engineering organization. As a result, we eventually had to stop and ask ourselves, “Is this something we should even be doing?”
By the time we asked ourselves that question, we were no different than many other companies our size: we had more than a dozen different systems across various teams for monitoring and analysis; there was no central management or controls for those systems, leading to inconsistent metrics and interpretations of data across products; and new teams repeatedly found themselves spending weeks on the evaluation, implementation, customization, and maintenance of new monitoring services, all while our primary products and teams continued to use our imperfect legacy solution.
The resulting pain was felt at all levels of Acquia, with teams across the organization and around the world experiencing toil and blockers due to monitoring service limitations and the bandwidth constraints they caused for our engineers. They simply could not find or interpret the data they needed with consistency, efficiency, or ease, and at the same time, we could not even provide our customers with all of the essential server health metrics they needed to optimize the uptime and performance of their applications.
In short -- we needed a new solution.
Choosing the Right Monitoring Service
We didn’t just need a new monitoring service -- we needed one that could handle all of our complex use cases as quickly as possible and on a tight budget. With those things in mind, we identified three possible paths we could take:
- Build a new monitoring service from scratch, in house;
- Take an existing open source solution and customize it to suit our own needs; or,
- Go the SaaS route and find a company/product that excels in this arena, allowing us to focus on what we do best while they focus on doing what they do best.
All three options had positive and negative attributes. Although Option 1 would allow us to address all of our needs precisely the way we wanted to, we estimated that it would take the most time and money to accomplish, and we would need to permanently dedicate engineering resources to maintaining and improving whatever we ended up building. Option 2 would require less effort than Option 1, but it would still require us to maintain the services and be responsible for upgrading them over time. Option 3, however, represented a current industry trend, where more and more companies are moving away from custom-built, in-house services in favor of plug-and-play solutions.
Option 3 seemed to make the most sense for us. A SaaS offering would provide us with a readily-available service with 24/7 support, a guarantee of new features and innovations on a regular cadence, and the ability to customize the service to suit our needs.
Making that decision was the easy part -- figuring out which SaaS monitoring service to entrust with a fleet as large as Acquia’s was a great deal more difficult. When it came to choosing a SaaS monitoring service, we did not want to limit our focus to the technical features and capabilities of an offering -- we also wanted to look at the company behind the services. Everyone claims they can solve your problems, but how do you know who truly is the best fit for your organization?
So when evaluating SaaS companies, we considered the following questions:
- How would we implement this solution, from install and initial customization through to feature configuration?
- How much work would it be to maintain the service long-term?
- What limitations does the service have, and are they deal breakers?
- What is the vendor’s support plan and SLA?
- Are they a startup or an established company?
- What are other people saying about them? Are they often recommended?
- What is the cost?
In our evaluation of more than a dozen possible solutions, we narrowed our options down to three companies with the features, reputations, and price ranges we were looking for. From there, we needed to look at what set each company apart from the others. With more than 15,000 instances in our fleet, our primary concern was that none of these services would be able to ingest the volume of data (millions of data points per minute) we would be sending. Needless to say, when two of the three vendors were willing to let us test their services on our entire fleet for free, that showed us how confident they were in their services.