Venti Drupalccino coming right up! Acquia Hosting's automated provisioning system

Acquia Hosting is designed to be a high performance, high reliability, highly scalable, and highly manageable Drupal hosting infrastructure. Our goal was to be able to quickly provision small and large sites, meet the enormous requirements of Drupal Gardens, and keep all of it running reliably, with a small engineering and operations team. It is easy to say that "the cloud solves all of that," and indeed cloud technology provides a substantial number of benefits. However, in my recent talk at Drupalcon, Challenges of Hosting Drupal on AWS, I described a number of problems we had to solve to accomplish our goals on top of the cloud.

In that talk, I also mentioned that we had recently provisioned servers for a customer expecting 20 million unique visitors on day one in just a few minutes. The site went live last week, the entire event went off without a hiccup, and some details are available. Today, I thought I'd describe a bit about how Acquia Hosting allows us to provision systems like this so quickly and reliably.


Click to enlarge.

We have a central Acquia Hosting Controller server (also called the "hosting master") with a database describing everything about the Acquia Hosting environment: servers, clusters, sites, SVN repositories, backups, which sites are deployed on which servers, which users are allowed to access each SVN repo and each server, what their SSH keys are, etc. The Controller exports the "Hosting API" for managing this database, and we have a command-line tool called hosting-provision ("h-p" for short) that performs various actions.

To provision new servers for a customer named "acme_systems", we use h-p to allocate and organize them into functional clusters:

% hosting-provision --server-allocate web 2 --tag acme_systems
Servers web-37, web-38 allocated.
% hosting-provision --server-allocate dbmaster 2 --db-allocate-cluster --tag \ acme_systems
Servers dbmaster-39, dbmaster-40 allocated.
Database cluster 8 allocated.
% hosting-provision --server-allocate bal 2 --bal-allocate-cluster --tag acme_systems
Servers bal-41, bal-42 allocated.
Balancer cluster 9 allocated.

This shows that we allocated two web nodes, two database servers in a dual-master configuration, and two balancers in a failover cluster. At this stage we could also have specified the size of machines we wanted (EC2 instance type), geographic location (AWS region or availability zone), amount of storage space (EBS volumes), etc. The tag "acme_systems" is just a simple text label that allows other hosting-provision commands to operate on servers as a group.

Once the servers are allocated, we create a new customer site (named "acme1" in this example) and assign it to the servers we tagged for this customer:

% hosting-provision --create-site acme1 --tag acme_systems
Site "acm1" created.

So far, all we've done is run hosting-provision to use the Hosting API to configure the Acquia Hosting Controller's database to represent how we want the environment to look. Now the fun begins! We press the big red "launch!" button and stand back:

% hosting-provision --launcher --tag acme_systems

At this point our automated process takes over:

  1. The h-p launcher command asks the Controller for all allocated-but-unlaunched servers for the tag acme_systems.
  2. For each unlaunched server, the launcher creates the EC2 instance, waits for it to boot, and ssh's into it to install, configure, and start puppet.
  3. Puppet contacts our puppetmaster (which happens to run on the Controller) and builds the machine: installs files, packages, and cron jobs, mounts EBS volumes, and performs other fairly static configurations.
  4. Nagios, our monitoring system, learns about the new servers from the Controller and begins monitoring them (puppet installs the correct Nagios client on all servers) based on each server's type.
  5. Our backup system learns about the new servers from the Controller and begins backing up the databases, SVN repositories, and filesystem data they contain.
  6. Our colo mail server learns about the new servers from the Controller and configures itself to accept relayed mail from them.
  7. After each hosting server is configured by puppet, site configuration begins. The Controller tells each machine what it should be doing:
    1. "You are web-37, so you should have site acme1 deployed." web-37 will find the SVN repository for acme1, check it out, create an Apache virtual host, and reload Apache.
    2. "You are bal-41, so you should be balancing acme1 over web nodes web-37 and web-38." bal-41 will construct an nginx virtual host to load balance over the appropriate web nodes and reload nginx.
    3. "You are dbmaster-39 (or 40), so you should be in dual-master replication with dbmaster-40 (or 39)." The two database masters initialize replication between them, treating dbmaster-39 as the "active" master.
    4. "You are dbmaster-39, so you should have a database named acme1." dbmaster-39 will create the database (and dbmaster-40 will get it via replication).
  8. After about five minutes, the servers are up and the site is being served.

Eventually we might decide that the site needs another web node, perhaps for a temporary burst of traffic. No problem:

% hosting-provision --server-allocate web 1 --tag acme_systems
Servers web-45 allocated.
% hosting-provision --site-set-web acme1 web-45 deploy
Site "acme1" will be deployed on web-45.
% hosting-provision --launcher --tag acme_systems

Within a few minutes, web-45 will be launched, the SVN repo checked out, and Apache configured. bal-41 and bal-42 will add web-45 into rotation for the site as soon as it is ready. When the site no longer needs the additional horsepower, we take it off web-45:

% hosting-provision --site-set-web acme1 web-45 remove
Site "acme1" will be removed from web-45.

web-45 is still running, available to have other sites assigned to it. If we actually assigned it to another customer we'd remove the "acme_systems" tag.

It took a while to build this system but it provides a number of benefits:

  • We can relaunch any server, add or remove web nodes, database servers, add filesystem storage, and migrate sites across servers quickly and easily.
  • We know that all of our servers run a consistent configuration without any "secret" changes made by hand on one system that we then forget about when we boot the next one, greatly improving reliability.
  • The automation and consistency allows us to have a lean operations and support staff.
  • The Drupalccinos are very, very tasty.

Comments

Posted on by chx (not verified).

two database servers in a dual-master configuration

Active master-master or just a master-slave configuration (with the slave configured to be able to be the master)? Ie. how do you scale writes?

Aside from that, very nice job. :) I guess most sites are mostly read-only so not a big problem.

Posted on by Barry Jaspan.

(Note to readers: The previous comment author Karoly Negyesi is the lead architect for the Examiner.com's conversion to Drupal and MongoDB, and is doing amazing work there.)

We use an active/passive dual master system. As we both know you know, neither dual-master nor master/slave configurations with MySQL allow you to increase write capacity; for that you need a faster disk, sharding, or an alternative database technology like, oh... I can't think of any. :-)

Actually I have a early concept for how to increase burst write capacity, and scale Drupal geographically, without going to the effort of abandoning SQL. Watch this space.