by Kent Gale
Egypt has been engulfed in constant turmoil over the past decade, including upheavals during the early days of the Arab Spring, and then again this summer. Al-Masry Al-Youm, Egypt’s leading independent media group, plays a critical role, serving as a vital information source for Egypt’s population in these turbulent times. And because its digital site is a key beacon of communication, the site’s stability and functionality needs to be a top priority for Al-Masry’s IT team.
But what happens when your lead developer can’t get to work because of another series of riots? Acquia’s Support team steps in, going to extraordinary lengths to provide any support necessary to keep the information flowing. While the steps taken were well beyond the normal scope of service, Acquia is sensitive to special customer support needs. Given the extraordinary circumstances, we chose to provide the extra level of service to help Al-Masry get past this crisis.
With no one to manage a significant surge in traffic, the Al-Masry site suddenly was overwhelmed by Egyptians’ need to know. In the support chat that Acquia’s Support team monitors daily, there had been some significant chatter surrounding this topic. The team conducted a detailed examination of the site, and uncovered several issues that were causing the increased traffic to degrade site performance.
“I like to look at it as we’re attacking the problems from all fronts,” said Acquia Client Advisor Adam Malone. “Our Operations team was working on Varnish (caching) and optimizing the configuration specifically for Al Masry, while Support dove into the application layer and worked with Drupal.”
The way users modify and extend Drupal with custom code and configuration means that it takes a little time to get a feel for the site and understand what’s happening. Looking at the access logs, there were more redirects than hits. While not delivering content immediately, redirects still cause loading on the web server; by ensuring they were cached in Varnish, we could free up the web servers to deliver more content.
“We were tailing a number of logs output from Varnish, Apache and Drupal; showing whether users were bypassing the cache, what they were requesting and how the services were responding,” said Malone. “By identifying patterns in these logs, we were able to locate what was slowing the site down to a crawl, and either push it into Varnish or use another workaround to speed things back up.”
With the web servers and database server being overtaxed, the Support team wanted to try and take as much load off the servers and get as much traffic handled by Varnish caching on the load balancer as possible. Executing PHP and executing database queries is relatively slow compared to delivering cached content, and with the web server constantly hammered, the Support team needed to try to give Drupal a break. Varnish is able to deliver cached content much faster and without causing any PHP/MySQL involvement, reducing the strain on web servers.
Only so much could be done from a Varnish level. Half of the requests coming all the way through the Varnish cache to the web servers were requests that were leading to redirects due to errors in the code (bad image urls, for example). We needed Varnish to cache these requests rather than have them go all the way back to the webservers for fulfillment. Varnish could hand out thousands more page requests than PHP could. After consulting with the rest of the team, we made changes to the Varnish cache that allowed Varnish to hand out the redirects without sending the requests back to the web servers. Making sure the Support team could manage the traffic safely so the site wouldn’t crash was the main priority.
The site was seeing thousands upon thousands of hits -- at one point up to 15,000 established connections. As we watched the server logs, they confirmed that the Support and Operations teams’ actions were taking effect.
“It’s almost like an Easter egg hunt,” said Malone. “Different things pop up all the time, but the requests were cached in Varnish, and our log looked much happier. A single clue in a log file can take you on to a few lines of code, which can in turn take you to some more code. Working backwards and tracing where errors and cache misses originate can be like detective work.”
Elevating cached times, utilizing varnish to hold onto those page times a little longer, and disabling UI modules because of the situation, were some of the highlights of this scenario.
PHP requests get tailored through Varnish; Varnish gets the first opportunity to fulfill the request, and then hands the request to the web servers if it can't. The web server decides what is a PHP request (page request, or request for new copy of an image) and hands the request off to PHP when appropriate. PHP is very expensive compared to other services, so to best mitigate issues like invalid requests -- when something does not exist-- we recommend the use of the fast 404 module. The fast 404 module cuts out a lot of the PHP time when responding to things that do not exist. Due to changes made during this event, we ended up storing those 404 pages in Varnish to increase capacity for other PHP requests.
Several different Acquia technical groups participated in this process. Subject matter experts were consulted on an as-needed basis to help guide and direct the primary support personnel through more complex issues. This depth and breadth of expertise is not easy to find outside of Acquia.
In the end, having more than 20 people with decades of combined experience available for expert support was key to keeping Al-Masry on the job at a crucial time. Acquia has has an extensive technical resource team across customer solutions, who we can query on any given technical topic that can help us in a critical support situation. And having those experts available in different time zones around the world ensures that we always have someone on the job, and keeping an eye on our customer sites. While we went to extraordinary lengths to keep Al-Masry up and running, this level of commitment is not out of the ordinary for Acquia Support.