Pipe Dream: Geographically Distributed Drupal

The speed of light is, unfortunately, still a constant. If your Drupal site has users in San Francisco, New York, London, Tokyo, Delhi, and Australia (and whose doesn't?), you've had no good way to give all of them fast access to your site. No matter where you put your master database server, most people have to cross an ocean to access it. Perhaps you can put read-only slave databases with local web servers in locations around the world, but then the remote users still have a long haul when they want to log in and create content---which is, after all, what your Drupal site is for.

I am experimenting with an approach to solving this problem that allows users to log in and create content using web and slave database servers that are geographically close to them while maintaining a single consistent Drupal site. It does not require multiple active database master servers and all the intractable problems that causes. My system, called Pipe Dream, intercepts database-changing operations at the remote locations and sends them over a message queue (a.k.a. a pipe) to the primary location where they are replayed.

Shameless plug: I'll be presenting my work on Pipe Dream at Drupalcon Copenhagen assuming my session gets selected. See you there!

The basic idea is simple. First consider the normal, local-only case:

web browser communicating with local servers

A user's web browser submits a form via HTTP POST to a web server, which then writes the new data into the database. Later, the web server reads from the database to generate pages.

Pipe Dream changes this process to use a message queue between remote and primary servers. A message queue is an asynchronous, "fire and forget" communication channel; it works like sending an email message. The sender can deliver a message into the queue and immediately go on to other tasks; eventually the recipient will get and act on the message. Here's how Pipe Dream works:

web browser communicating with local and remote servers using a message queue

The web browser submits the form via HTTP POST to the web server. The Pipe Dream module intercepts the form data before it is "submitted" (in the Drupal Form API sense) and redirects it to the primary master server via a message queue. A message queue listener on the primary location receives the message and re-submits it to the master server which writes the new data into the database. Eventually the new database changes are replicated to the remote MySQL slave server. In a healthy system, this whole process will probably only take a few seconds so the new data will be available at all of the remote locations right away.

The advantage of this system is that the remote users only ever talk to the web and database servers near them. When new content is submitted, the POST request is processed and returns right away. The content will appear later, but in the meantime the user experiences a fast web site and can move on to other things.

While the basic idea is simple, unsurprisingly, the details are complex. Here are some the issues I have identified and how I am thinking of addressing them:

* Delay. Yes, it is true that a user will post a comment and then may not see it immediately when the page reloads. I don't care. Pipe Dream can display a message: "Thank you for your submission. It will be appear shortly." We're talking about user-generated content web sites, not a nuclear missle control system.

* User login. The previous item notwithstanding, when a user logs in to a remote location, it won't do to tell them "Thank you for logging in. Please refresh the page a few times until your username appears in the upper-right corner." So the user login form actually has to be processed locally, issuing a local session cookie. Pipe Dream could replay the login at the primary location so the user will be logged in "everywhere" but probably the login can just remain local. Logins from the primary location can still be replicated everywhere, because the session table can have multiple entries for a single user id.

* Caches. Much like user login, it is important for cache tables to be stored locally. It might easily be the case that content viewed frequently in a remote location is rarely viewed in the primary location. This turns out to be a non-issue because Drupal 7 uses the REPLACE INTO query for cache updates. All cache clearing and cache entry operations at the primary location will be replicated out, and all cache entry operations at the remote locations will all be stored locally.

* Validation. Form submissions can fail for a number of legitimate reasons such as a missing title field. Pipe Dream will allow forms to be validated locally and only deliver them to the primary location if they pass. Of course, it is possible that a form will pass validation remotely and still fail at the primary location due to a race condition; perhaps a referenced node or term just got deleted. The easiest answer is "I don't care." A better answer is for Pipe Dream to store the form submission in a table of failed submits (which of course gets replicated to the remote locations) and then show the user a block containing forms to be re-submitted. When the user clicks an entry in the block, the form re-appears filled in with validation errors displayed.

* Form API complexity. Drupal 7 forms can have a variety of behaviors and structures, including multiple submit buttons; Pipe Dream will not always know how the form is supposed to behave. For example, node forms have Preview, Add More, Save, and Delete. The first should run locally, the last two should run at the primary location, and Add More is an AJAX operation that I haven't figured out what to do with yet. Pipe Dream will probably be able to handle all "simple" forms that have only a single primary Submit button and no AJAX plus a selected list of forms for which there is a special-purpose handler. Probably Pipe Dream will define a hook that lets modules tell it how to handle their non-standard forms. In the initial implementation, I only plan to support very standard forms; nodes and comments are what everyone is going to care about anyway.

* Images. These might be easy or tricky depending on the way Pipe Dream intercepts form submissions. I have not actually tried it yet.

There are several different approaches to implementing Pipe Dream. My initial plan as described above is to base it on form interception. chx likes the idea of a new database driver that passes all non-SELECT queries (instead of form submits) to the primary location. There are probably other options.

As of this writing, Pipe Dream is an early-stage research project. I have it working with basic node forms with title, body, and taxonomy terms, using a dumb SQL-based message queue. I plan to have a functioning demo and code in the Pipe Dream module at drupal.org later this summer.

Comments

Posted on by Greg Knaddison.

I think this accidentally got published on July 13 when it was meant for 3 months and 12 days ago ;)

Is there really demand for this? Can you describe some of the performance benefits you hope this will achieve?

Like: "currently a user in Australia accessing a server in New York sees page load/comment submission times of X seconds and under this architecture it would instead be Y seconds."

It seems like a much better project for Acquia would be to roll out something that can save seconds from every page request for every authenticated user on Acquia Hosting: memcache.

Posted on by Moshe Weitzman.

I think that if you say 'I don't care' enough you can get this to work :).

But I agree with Greg's question - what are you trying to solve here? If you want fast read access for distant users. The typical solution for that is a CDN. A CDN is pretty straightforward for media and anon pages but lets discuss the authenticated case. I think the best practice (still quite new) is for the CDN to assemble the personalized page using Edge Side Includes for the dynamic bits. Since all Drupal gardens sites run the same codebase, I would think you could solve this once for all those sites. For custom sites, they need custom engineering to generate the right HTML for each ESI. If you look at http://drupalcontrib.org/api/function/drupal_render_cache_set/7, you will see that D7 has a built-in a spot for ESI placeholders to be swapped into the $page array. Its unproven, but Acquia is quite used to shaking out D7 bugs.

Now that I re-read, it sounds like Barry wants faster *write* access for distant users which is not a use case we typically think about. Assuming you have offloaded all media and aggregated js/css to the CDN, I don't see a better speedup than this solution. It could be that the cure is worse than the disease though. Just how fast can people submit nodes and comments? Isn't a one second delay between pages acceptable? Or are we trying to isolate the submitter from write delays on the master?

I love the name Pipe Dream!

Posted on by Barry Jaspan.

I have two use cases:

* Large organizations using Drupal for internal sites. We have one such customer. They have offices in New York, London, and Shanghai. All users are always authenticated, and everyone both reads and creates content. They want the site to be pleasantly usable by everyone. The whole reason they switched to Drupal is because they thought MySQL replication would let them put an active master in each of their locations and have content replicated in a ring; of course, this is a disaster that won't even wait long to happen. Note that a very large percentage of users for Open Atrium, Drupal Commons, and similar distributions fall into this category.

* Drupal Gardens. Wouldn't you like your site on Gardens to be just as fast for users in San Francisco, New York, Europe, and APAC? We're running on AWS so we can easily spin up servers in all of these locations. By adding location-aware DNS to the system described above, we will be able to serve yoursite.drupalgardens.com from whichever datacenter is closest to each user.

Turning the last example around: Who would not want their site to be equally fast from anywhere? If you all need to accomplish it is the Pipe Dream contrib module, an AWS account for using SQS (or the message queue of your choice), and cheapo web hosting accounts in whatever geographic regions you care about, why wouldn't you do it?

Posted on by Greg Knaddison.

Sure, but what benefit at what cost?

I mean, R&D and personal technical odysseys are every engineers dream so I don't blame you for running down this road, but what is the potential benefit of this system? And what is the cost. Your use cases make sense, but what is the measurable loss in time to your customer in situation 1? Are the people in London and Shanghai waiting an extra 1 second per page request? 2 seconds? And would it not be possible to achieve similar benefit using more proven and immediately available techniques like the memcache I suggested or ESI as moshe suggested?

Posted on by Barry Jaspan.

Asking employees in remote locations to wait an extra 2 seconds (and it might be higher than that) for every page load when using the intranet is a great way to guarantee that they never use the intranet.

memcached would have absolutely no benefit in this situation because without Pipe Dream the users would still all have to visit the single primary location. The single location might be faster because memcached is installed, but I'm assuming the single location is already "fast enough" and the problem involves far-remote, authenticated users.

Another benefit of Pipe Dream is that it increases the maximum burst content creation rate. In a normal setup, if the master is busy, everyone who tries to create content has to wait for master to get to their request. With Pipe Dream, all the content creation requests are queued up to be processed when the master has time.

Posted on by Joshua Brauer.

Another fun thing to consider here is the possibility of dynamic use of cloud computing. If someone needed additional near-time instances of Drupal running and could do so in spinning off x clones in disparate spaces knowing that the message queue will be able to make sure the updates and creates eventually get incorporated.

Thanks,
Josh
Acquia Client Advisory Team

Posted on by grugnog (not verified).

MySQL Proxy seems like an obvious alternative to consider here - version 0.8 has lua scripts for read/write splitting that can do exactly what chx proposes. This is still beta, although my guess is probably pretty functional, since Drupal doesn't tend to use some of the problematic MySQL functionality. It would also have a bunch of side benefits, such as being able to balance multiple local slaves and fail back to the master for reads if no local slave is available.

edit: While I think MySQL Proxy might be quicker way to achieve the end-user latency objectives described, the form based approach also has some obvious side benefits - being able to elegantly deal with form submissions/failures in a non-real-time way could be useful for lots of other slow transactions (e.g. booking flights) and even without servers in multiple locations the queue system would really help with very large write heavy sites (together with sharded/nosql type data storage), particularly since you could (in theory) have several persistent Drupal daemons processing forms on behalf of each user, with no bootstrap cost.

Posted on by Barry Jaspan.

If you use read/write splitting with MySQL Proxy, and assuming the reads are going to a local db server, that sounds like all the writes are going individually to the primary master server... which means you are incurring the high latency of the long haul for every write query, not just for a single HTTP request. That will end up being much worse than just making everyone always talk to the primary server.

Posted on by grugnog (not verified).

This was an alternative to chx idea of writing a database driver to do read/write splitting (pointing out that this part is already doable) - they would share the same issues in that regard. That said, once you exclude cache, log and session table writes (which could be local only) the number of write queries for some operations (comments, polls, etc) is very small - node create/updates would be a problem though. Some of this latency could avoided though if these write queries could be executed directly after the page is served, but before php exits - this has been suggested in the past and given some more work in the database layer should be reasonably easy for most form submits, but would introduce some complexities in common to the form state approach you describe.

Posted on by justinrandell (not verified).

interesting idea, sounds like a fun thing to work on. i think replication is the wrong tool here though. i think using content UUIDs and messages both ways would scale better (in an application maintenance sense).

just guessing, but i think solving consistency issues via versioning and messages will be easier than solving the issues created by data flowing back from the primary site to the secondary site(s) in a non-application-aware way. it would also allow for "local" data to show up straight away when that makes sense for the type of content and application. messaging both ways allows for the rules about what makes sense here to be tuned per application.