Home / Comment permalink

Life in the Cloud

While I love life in the Internet Cloud, it was a gray, rainy few days at the end of last week.

While it's easy to point fingers at Amazon Web Services, I'm focused on how Acquia can do better. Our goal is to deliver fantastic end-to-end service and support for our customers' web sites irrespective of problems in the underlying infrastructure. That's for us to worry about, mitigate against, and repair — not you.

The View from Acquia
We partnered with Amazon as the leading provider and innovator of Cloud infrastructure. But more importantly, we designed our high-availability architecture to quickly and seamlessly recover from AWS infrastructure problems. Single server failures cause no site downtime and we've successfully recovered from hundreds of such failures. Acquians pride ourselves on second-to-none engineering and operations.

However, a little before 4AM EDT on Thursday, a major incident at one AWS data center rendered most storage inaccessible which in turn made hundreds of our servers unusable. Still, Acquia had planned for this contingency by backing up all data to multiple data centers. Unfortunately, a second AWS failure made it impossible to access those backup volumes from any data center. Aargh! The impact was felt most keenly by our Drupal Gardens customers with thousands of sites unavailable. While Dev Cloud was unscathed, the outage impacted 1% of our Managed Cloud customers. Our team worked around the clock to restore service: migrating servers to other regions, finding crafty ways to restore backups, and keeping in constant contact with customers. By the end of Friday (midnight!), we'd recovered all services. Thanks to the redundancy built into our architecture, we lost virtually no customer data.

We're pressing Amazon to do better. For many months, they've promised us EBS storage improvements and we look forward to seeing those. They must also improve their transparency. AWS is too secretive both in a crisis and on sunnier days. AWS is not a book seller whose back office operations have little impact on their customers.

But I don't think that's enough. We're taking action now to redistribute Garden's servers amongst more data centers to minimize the impact of a similar outage and we're beginning to extend our backup infrastructure to distribute the data to multiple geographic regions. And Acquia will continue to make significant investments in people, technology, and processes to ensure the most worry-free web site hosting available.

The Cloud View
None of this has dampened my enthusiasm for the Cloud. I've managed many data centers over the years from my basement server rack, to class A facilities with redundant everything, to colo, VPS, and managed hosting. In this era, it simply doesn't make sense economically nor technically for most organizations to build their own data center and hire and train expert sysadmin staff. The economies of scale both in hardware and people will drive most business and organizations to the cloud over the next few years. The important lesson we can never learn too well is that "everything breaks". And nowhere is that more true than on the rapidly evolving Internet. It's our job at Acquia to build resilient architectures that can prevent downtime due to failure, even major failures.

It can be challenging to ensure seamless service and mitigate the Cloud's risks for our customers. I think what keeps all of us going through the long days and nights are the incredible web sites, both big and small, that our customers create. We are working 24x7 to meet and exceed the high standards you're setting by using Drupal to create incredible web experiences.


Posted on by Simon Gardner.

Curious when you say, "we lost virtually no customer data". Would you be able to reveal what data you did lose ?

Posted on by Barry Jaspan.

In Drupal Gardens, we ended up completely losing one pair of HA database disks, forcing us to restore that one database cluster from backup (in all other cases, we lost at most one disk from an HA pair so no recovery from backup was necessary because we still had an up-to-date instance). We make database backups every hour, so for that database cluster we lost up to one hour of data (I do not know offhand exactly how long it was from the last backup to the crash so I don't know the exact amount of time that was lost). That database cluster served about 1200 Drupal Gardens sites.

So, bottom line, we lost between 0 and 60 minutes of data for about 1,200 (out of over 45,000) Drupal Gardens sites.

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.

Filtered HTML

  • Use [acphone_sales], [acphone_sales_text], [acphone_support], [acphone_international], [acphone_devcloud], [acphone_extra1] and [acphone_extra2] as placeholders for Acquia phone numbers. Add class "acquia-phones-link" to wrapper element to make number a link.
  • To post pieces of code, surround them with <code>...</code> tags. For PHP code, you can use <?php ... ?>, which will also colour it based on syntax.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <h4> <h5> <h2> <img>
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.