Integrating a CDN with Acquia Cloud [December 20, 2012]
Drupal provides a wide selection of caching solutions. One of the most effective is the use of a Content Delivery Network (CDN). CDNs can provide the best industry solution for app server price, performance, high availability and security. Chris Meyfarth, senior engineer at Akamai, will detail inherent issues on the Internet and will discuss different services such as: push, pull, post, DDOS, error handling, and how original server masking can help. Wim Leers, author of the CDN module and former CDN developer for Facebook, will present an overview of the available CDN options for integrating with Drupal, including technical evaluations of CDNs and asset serving versus full site serving. Alex Jarvis will discuss specific Drupal configurations for different use cases.
In this webinar, you’ll learn:
• When to use the CDN module or write custom CDN code
• How to evaluate the technical properties of a CDN
• How a CDN can provide security and high availability
• How Acquia Cloud can be configured with CDNs in multiple configurations
• About Akamai's test tool for assessing website performance
Moderator: Today’s webinar is Integrating a CDN with Acquia Cloud. Our Acquia presenters are Kieran Lal, Wim Leers and Alex Jarvis. We also have Christopher Meyfarth from Akamai we’re very excited to have him on.
Kieran Lal: So a few months ago, actually starting back in June of 2012, I started really exploring how we could go beyond what Acquia Cloud had in terms of performance and the most obvious answer once we had tuned the stack, got the reverse proxy right and tuned the PHP stack and tuned the database and got everything together, we start looking at some of the really great web performance optimization tools that were available and so that got me thinking about using CDN. So I started talking to a lot of people at Acquia and collecting more and more information, looking at all the different CDNs that had been integrated with Acquia Cloud and started asking people about what were the different scenarios and different use cases that came together. So I sort of had this on the shelf for a while and then we’re lucky enough to see that Wim Leers who I thought was going to work somewhere else actually ended up joining the Office of the CTO and so that was really exciting and I think within a few days of him joining, I immediately reached out and told them that I wanted to do a webinar off somewhere in the future on CDN. Wim I’ve known for many years and we hung out together when I was over for a conference in Hungary and so I knew about his interest in CDNs and more recently some of the work that he’d been doing through his master’s and/or an internship that he had done at Facebook. So it was really great to have what I consider to be one of the leading experts of CDN integration with Drupal here at Acquia and so we worked together, put together some outlines. I showed him my notes and we planned to do this, and then as we started getting close to the webinar, I was able to reach out to some new partners at Akamai and so I’m really excited to have Christopher Meyfarth from Akamai join us who’s one of the sales engineers, but a real expert in Akamai, and so we’re able to really focus on what is the premiere CDN in the space.
Then more recently, I was at BADcamp and then bumped into Alex Jarvis who’s one of our senior technical account managers and he was talking to me about some work that he was doing and we were going back and forth and I brought up the idea of Akamai and so he started telling me about all the expertise and some of the notes and training that he’d been doing internally, and so I was really excited to really bring what I think is one of the strongest teams of technical contributors that we have in our webinar series together to show off the advantages of using a CDN with Acquia Cloud and in particular using Akamai. So we’ll see three different perspectives, but they’ll keep it pretty thorough and we’ll have a lot of great content. I’m really looking forward to some excellent Q&A and even I know a lot of the staff at Acquia are really excited to learn about this because it’s a tier of knowledge that they really want and something that almost 50 people on our support team are keen to learn about so that they can use Akamai and use chris’ knowledge of CDN’s to improve the performance of their sites. So without further delay, I’m going to hand it over to Christopher who will go first and then we’ll have Wim talk about CDNs and some specific expertise that he’s got and then Alex will finish off and talk about specific configurations of Drupal with Akamai, and then we’ll try to pause. If we see questions that are relevant or particularly topical while our speakers are talking, we’ll take the time to answering the questions in context, but the most of them, if they’re generic-enough questions, we’ll hold it to until the end where possible and then try to leave enough time to do a Q&A. So Christopher, without further delay, I’m going to hand over the slide to you.
Christopher Meyfarth: Great. Thank you, Kieran. I’ll just make sure we got the presenter view here. So as Kieran mentioned, I’m Chris Meyfarth. I’m a lead senior sales engineer working with our channel program at Akamai. I’m going to talk a little bit today about the inherent problems of the internet. We’ll talk a little bit about the Akamai technology that we’ve created to help combat those inherent problems and we’re going to look at some test results that we did last week on Acquia.com. So without further ado, I’ll just jump right into it and talk a little bit about the inherent problems of delivering traffic over the public internet. I think that it’s important to understand why you need a CDN. Acquia is spending a lot of time honing Drupal, putting a lot of time into the data centers and regardless if it’s Drupal or an application that you have in-house, you don’t want to spend all that time on the application, all that time on the data center and then just cross your fingers as that information goes out across the wilds of the public internet. When you look at the public internet, it’s a network of networks and the protocols that control the public internet, there are three major protocols: you have BGP, TCP and HTTP. The issue here is BGP first and foremost which controls the route, it’s based on the number of steps that you take, not the performance. TCP of course is an antiquated protocol, it has a slow start and then you have HTTP and HTTPS which is just the sheer volume of information that we’re sending across the line. A great example is with commerce websites now in Akamai, the average commerce customers has about a two-meg footprint of their commerce website. I can remember 10 years ago when we were willing to wait 10 or 15 minutes to download a two-meg MP3, and now according to Forrester and Gartner, we have the two-second rule. We’re looking two download a site in two seconds or else we’re going to start seeing exponential abandonment. We’re doing in two seconds today what we were willing to wait 10 or 15 minutes to do 10 years ago. So when you step back and look at the public internet, you really have a compounding inefficiency. You have a ton of data going over a pipe that has a slow start, going over a route that isn’t based on performance. It’s just simply based on the number of steps that it needs to take.
So I’m going to talk about some of the newer Akamai technologies. I’m going to leave a lot of the caching and some of the traditional CDN stuff to Wim and to Alex, but to combat this compounding inefficiency, Akamai has developed some technology called SureRoute. SureRoute leverages our 125,000 servers in 2,000 different locations. They basically all talk to each other. They build a virtual weather map of internet traffic conditions, so anytime an end user connects to an Akamai server, we go to that weather map and in real time we query it and we pull down the three fastest routes given his exact location and where he’s going. We’ll actually take it a step further. We’ll actually send a little piece of information across each one of these routes. We’ll run a race, if you would, and the one that comes in first will start passing traffic across that route. Now, we’ll continue to run those races, so if anytime maybe internet traffic conditions change, people are coming back from lunch, there’s congestion on different lines, we continue to run those races, we detect if that’s no longer the fastest route, we failover to the second fastest route and then go back to that weather map to get three new routes.
Now, it’s an extreme example, but to see SureRoute in action, this is a cable cut that happens. So this, as I said, an extreme example of internet traffic pattern shifting very quickly. What you see in red is the public. It took the public internet about two weeks to recover and it took Akamai about 20 minutes. So SureRoute is just one technology that we have; of course Akamai is known for doing caching. We continue to push a lot of that caching logic out to the edge of the internet so that we can make caching decisions based on languages or cookies and we can make those as far away from the application origin as possible. In addition, the last technology feature we’ll talk about is a feature called prefetch. Prefetch is a really interesting technology because as the end user requests an HTML file and that is coming back and being returned, actually our Akamai edge server will hold on to that HTML file and start to parse it. So we will send out requests, we’ll refresh our cache. If anything is expired, we will refresh those cache objects or request them from the origin before the end user has even received that HTML file, so by the time that end user has parsed that HTML file and starts requesting those objects, they’re located on the edge server 10 milliseconds away from them.
So a quick overview of some of the newer Akamai technologies in action, we did a test last week of Acquia.com. We were looking, so we choose six steps, just generic web pages. We chose the homepage, the Support and Cloud Services, Solutions for Marketers, Network Trial page, Careers page and the PDF. The way we conduct these trials is we’ll actually build a fully functional Akamai configuration and send it out over the public internet, and then we spin up either Gomez or Keynote, one of the major leading performance testing suites and we’ll have them hit both the Akamai configuration as well as Acquia.com as it was being accessed on the public internet. So what we have here is the comparison results. This was over the course of about three days from many cities all over the world and we have the public internet results and then the Akamai results. What’s interesting about this – I mean I’ll let everyone consume the data points on their own, but what really stands out here is step number five. If you look at the average improvement, step five is nowhere close to everyone else, so I wanted to highlight step five and dig in a little bit deeper. Step five is the Careers page and this was a very heavy page that was pulling information from a few different queries and one of the major advantages of Akamai and using Akamai as a CDN is you can build rules into Akamai to handle the caching. So what we did with step five was we said, “Well, let’s rerun these tests, and rather than just cache the images and do the SureRoute acceleration, let’s take a snapshot of that page. Let’s cache the full HTML of that page not for very long, maybe just for a few hours. We want the page to stay dynamic, but in this case we would much rather have maybe a highly performing page that we could clear out from cache on request. Akamai has the ability to interact with an API to purge things out from cache, so let’s cache that full page.” These are the results. You can see that public internet was about five and a half seconds. We turned on Akamai, it’s about four and a half, but then when we started doing that full page caching, it dropped it down to two. Those results will only get better the more people you have requesting that page. It’ll keep it in cache, it’ll keep it in cache longer and more current.
So before I hand things over to Wim, just to summarize quickly for Akamai, the trial we did for Acquia.com is very easy to do. We do it for many customers, so if you want to get set up with the CDN, we strongly advise you to talk to an Acquia rep. We can do one of those trials, we can set up the steps, build that configuration, send it out over the public internet and in just a couple of weeks we can get some real world results for you. So at that point, if there are any questions, we can open it up quickly or we can pass it over to Wim.
Moderator: There’re no questions at this point, so I’m going to pass it over to Wim.
Christopher Meyfarth: Great. Thanks guys!
Wim Leers: Hi everyone, my name is Wim. As Kieran explained, I’m a senior software engineer with Acquia and I’ll be talking a bit about Drupal plus CDN, but before I really begin, I just want to make sure that everybody actually knows what a CDN is. If you’re not quite sure yet what that is, please raise your hand right now. [Pause] Okay, I do see that at least one or a few people are raising their hand, so a CDN is essentially a short name for content delivery network. In a nutshell, what it does is move content closer to the end user, meaning that to get the data for a website, for example, there will be less physical distance to cover, meaning that the data will be faster at the end user’s computer. So essentially, move the data closer to the end user so that everything on the web becomes faster. I hope that is clear; if not, I recommend the explanation on Wikipedia. So let’s dive into the real stuff.
There are actually two types of – you can use CDN in two big different ways. The first one is use a CDN for serving resources, static files, and the second way is using a CDN for everything in HTML. So the serving of resources should always be the first step since this is the thing that affects 90% of the page load time. It’s easy to do especially with origin-pull CDNs – more about that later – but in any case, the most important aspect is that this will give you the most benefit with the least amount of effort. You can also go for serving everything from a CDN, so this is an optional second step. It makes sense that this is a second step because the first aspect you can do without much changes to your application logic and it’s very easy to do, so it makes sense to always do the resources [Audio Gap], but if you go for the full approach, then you will affect 100% of your page load time since everything is being served from a CDN minus of course third party scripts. The downside is that you will have to carefully plan the integration; it takes quite a bit of effort especially if you have a website in which you are serving authenticated users a lot of content, meaning that if you have for example a news website where the typical user is not logged in, then you can serve the exact same content to millions of users, but imagine the case where you have for example a news website where each user can make his individual preferences for topics of interest. Well, then things become a lot more complex because not the same content can be served to every single user, meaning that you have to do far more complex page building to achieve the same performance. To do this kind of authenticated user, and thus, personalized content through CDN, what is typically used is Edge Side Includes and Akamai is I think the company who pioneered this.
More about that in just a second, but the one thing that you should also know is that in either of these cases, in both scenarios there will be less requests to the app server, exactly because static files are no longer served by your app server, your web server. That actually means that there is less load on your web servers. If you have many of those web servers, then it’s possible that you can even retire some of them, meaning that overall your costs for your app server should go down. So CDNs may cough some money, but they may also save some to some degree, at least. So I just mentioned Edge Side Includes which is really important if you want to serve the HTML of your website from a CDN while still having the ability to serve personalized content to every user. In Drupal 5, 6 and 7, it has been pretty difficult or rather complex and not well supported out of the box to do Edge Side Includes, but I’m happy to report that in Drupal 8, it will become much easier because in Drupal 8, there is the Blocks and Layouts Everywhere Initiative, codenamed SCOTCH. The single keyword that is really important here is Blocks Everywhere. Blocks you can regard as the atoms in the Drupal universe, as it were, the atoms of the Drupal webpage. In Drupal 5, 6 and 7, many things are blocks, but not everything is, so if we make everything a block, every single compartmentalized piece of content, if each of those is a block, then we can do much more interesting things.
In Drupal 8, we are adopting the Symfony which is another open source project, so we’re sharing code and sharing the effort there. We’re adopting the HttpKernel components from Symfony. What we’ll be doing is making sure that each part of the page, each block, each atom can be requested individually and when you, for example, would ask a Drupal 8 website to render the front page, what it will do then is use in-process subrequest to build a page. So it will actually fire subrequests within the PHP process to render each of those blocks individually. As you can tell, this very closely resembles ESI and is actually pretty much the exact same thing. Drupal 8 will make it much easier, if not trivial to support Edge Side Includes. So ESI plus CDN will become trivial in Drupal 8, but please remember that not every CDN supports Edge Side Includes.
So now we have an idea of the two big different ways of using a CDN and I’ve clarified that one of them will become significantly easier in Drupal 8, so let’s move on to the key properties of a CDN. The single most important property I believe is the geographical spread, the points of presence. PoP stands for “point of presence.” A point of presence is essentially a location where the CDN has edge servers as Chris pointed out already in his presentation, meaning that as a PoP, a CDN connects with local ISPs, so the more connections with local ISPs around the world or in different regions, the closer you are to the end user with that CDN. So what you need to do is make sure that the latency between your [Audio Gap] and your content is as low as possible, globally speaking, meaning that for as many users as possible, you want to have the lowest latency possible because then web pages will be rendered faster. So if you put those things together, what really matters is that you match your CDN to match your audience, so if you’re a company that’s mostly oriented towards European users, for example, then you would probably choose a CDN that has a strong European presence that has many PoPs in Europe. It could also be that you are going with a global CDN provider, provided that they do have a lot of presence in Europe.
So as you can see, it really depends on your needs, on your target audience which CDN is the best choice from a geographical point of view, from a best lowest latency point of view. The second most important property is the way you get files onto the CDN. Actually, there are two types: origin-pull CDNs and push CDNs. For origin-pull CDNs, there is not really a special transfer protocol involved. Your web server obviously speaks HTTP because otherwise it would not be a web server, so what origin-pull CDNs actually do is when they receive a request for a certain file, they will just go back to the origin server which is your web server and go and get the file over there. The upside is of course that there’s virtually no setup. You just do very, very little and everything will work just fine. The downside is exactly because it will work almost automatically, that there is very little flexibility in terms of what kind of preprocessing or optimization you can do before getting the file onto the CDN. Another downside can be redundant traffic. For most websites, this will not be a problem since requesting caches Again, it’s not really an issue since they’re so small, but imagine a case where you have multi-gigabyte video streams, for example. As Chris already explained, at least Specific edge servers, the PoPs of a CDN, tend to clear out files that have been requested least often in a certain period of time, so it is possible that your video file has only been requested a few times in a given timeframe, meaning that it will be removed from the CDN temporarily until it’s requested again, but that means that your origin server – meaning your web server – will have to serve the file again to your CDN. So multiply that by a number of edge servers and then you might be looking at significant traffic, but it really depends of course on your specific site and your specific goals.
That’s exactly where a push CDN can be advantageous. Exactly because it’s a push CDN, you have to push the files onto the CDN, so you control when a file is transferred from your origin server onto the CDN, so there is no redundant traffic since exactly it’s you who control that. You also have more flexibility in terms of potential preprocessing exactly because it pushes onto the CDN, but the big downside here is that there is a lot of setup involved. You have to write the script or code or some sync layer, whatever it is going to be. You have to somehow get the file onto the CDN and that might be a lot of work. I failed to add this on the slide, but this is exactly what I try to solve with my bachelor’s thesis. It’s called FileConveyor.org and it’s a Python daemon that is intended to simplify the setup of a push CDN. Essentially, its job is to weigh the differences between different ones, so it makes it easy to switch from one to the other or to advanced preprocessing and whatnot, but in any case, this is another component that you have to deal with which you don’t have to do with an origin-pull CDN.
That actually brings us to the last thing which is lock-in. Depending on the kind of CDN that you choose and its features, you may be looking at some or significant lock-in. For example, if you’re using Amazon S3 as a CDN, then you’re actually using a custom transfer protocol which means that if you want to switch to another CDN, you have to rewrite your sync layer, your script, whatever it is, so that is some degree of lock-in. Some CDNs offer unique features – for example, some kind of statistics – and you may not have the same in another CDN and so on, so you have to be careful there. You have to weigh the differences to figure out which things are most important to you.
All of this is about the differences between the different CDNs. So how do you select a CDN? Which CDNs should you choose? The first and foremost reason to go with a CDN is of course performance. That’s the whole reason we’re having this webinar. The most important thing is low latency. This is again about matching your target audience well to your CDN and making sure that the PoPs the CDN has correspond well with the geographical location of your visitors. However, this is not always the single most important thing. If you’re serving a lot of small files, for a typical website, it definitely is, but in the case of serving video streams for example or large software downloads or whatnot, what might matter more is high throughput - some call this bandwidth – meaning that it doesn’t really matter if you have to wait 0.1 or 0.2 seconds for your video to start streaming, but what you really want is that your video stream does not get interrupted. That’s the difference that you have to weigh there.
The other aspect based upon which you might want to choose a CDN is the type, origin-pull versus push, because of the inherent differences in how you get files onto the CDN and how you integrate with them. Also, advanced CDN-specific features such as, for example, automatic lossless image optimization, real time statistics, authentication in the sense that, for example, you have an ecommerce website and you sell digital goods, meaning that you don’t want everybody to be able to access the files, so then you want some kind of signed URL or [Audio Gap]. So it really depends on your use case. You have to make sure that the CDN offers the features that you really, really need.
Finally, of course there are also the support aspects, different CDNs with different kinds of support and the costs. Now, how can you maximally exploit a CDN for resources, meaning for static files? These are simple tricks that can have a significant performance impact, so it’s really recommended that you always use them. The first one is actually the one with probably the least amount of performance impact, but it’s also the single most simple one. It’s DNS prefetching; it’s just adding a small tag to your HTML head on every webpage where you use a CDN or even where you don’t use a CDN, and what you do there is reference the DNS, the domain name, for your CDN so that the web browser will already be able to look up the IP address for your CDN, meaning that when – so this is at the top of the page and only after that the loaded script images and so on are referenced. So by the time that the browser gets to parsing the actual resources, the DNS lookup will already have happened and you don’t incur the DNS lookup wait time.
The second thing is auto-balancing over multiple CDN domain names, also known as domain charting. This is very useful in cases where you have a lot of resources in a single webpage. For example, say that you have an ecommerce web shop and you have 100 product images on a single webpage. The typical web browser will do between 6 and 10 simultaneous HTTP requests to a single domain name, meaning that only, say, about 10 downloads are happening at the same time. So only when one of those 10 is ready can the next one occur, and so on. So as you can see, there is a lot of waiting going on for no good reason and the way you can solve that is by having multiple CDN domain names, multiple host names and balancing them automatically. For example, say that you would have four CDN domain names. You could automatically balance those 100 images over the four different domain names so that each has approximately the same amount - in this case, 25 – and then what you get is not 10 images being downloaded at the same time, but 40 images being downloaded at the same time. So given this theoretical case, this would accelerate the download of these 100 images by fourfold, so that’s a very significant increase, but again, it’s really only useful when you have a lot of images, or in any case, a lot of resource from a single page.
The last one is actually the one with the most impact in my experience also for very small websites and it’s called Far Future expiration. It’s actually really simple, really logical. Essentially, browser caches are always faster than a CDN unless you have an extremely, extremely slow hard disk drive. If you needed to get a file from a CDN, then you’re always going to incur some level of latency. You have to download the file, you have to save the file on the disk and so on. Browser caches avoid that. If the file is in the browser cache, you don’t need to go to the CDN. How do you make sure that files remain in the browser cache for as long as possible? Well, you need to mark them to expire many years from now and that’s why it’s called Far Future expiration. The one downside to this is that if you are changing, for example, your web shop logo and then what happens if the file is cached in the browser cache is that it will not be downloaded again exactly because it is still in the browser cache. So now for some reason, some subset of users is still seeing the old logo. The way you can solve that is by using unique file URLs. You need to make sure that each file, whenever its contents change, it’s served from a different unique file URL and that will cause of course the browser cache to retrieve the new file. One funny side aspect or interesting side aspect to this is that actually if you implement this, it will actually cause your CDN costs to go down because there will be fewer requests to the CDN since the browser cache retains the file longer. So yes, these are free tricks that are really useful.
Now I’ve explained the different types of usage and overall CDN information, so now let’s move on to the Drupal CDN module. I maintain this module and if you have any questions, I look forward to seeing you in the question queue. As you can see, this is the default admin screen for the CDN module and there is a message at the top that says “If you install the advanced help module, the CDN module will provide more and better help.” So if you install it for the first time and you’re still finding your way around it, please install it and it will give you examples and technical background information, but in any case, there is very little to configure as you will soon see. This is the general tab and here you can either disable or enable the integration with CDN or you can enable the testing mode. In testing mode, you can play around with it without harming your actual users. You can give specific users access to the files on the CDN by giving them specific permission that allows them to do so, so this is great for giving it a try.
The second tab is called “Details” and this is where the bulk of the work happens, the bulk of the configuration. Essentially, there are two modes: origin-pull CDNs and file conveyor for integrating with the project that I mentioned earlier for doing more advanced CDN integration. So we are going to go with origin-pull; this is actually exactly how it’s configured on my personal website. All you have to do really is copy-paste the CDN domain name that you get from your CDN provider, paste it in the CDN mapping field that you can see at the bottom of the current slide and that’s it. Just hit save and you will have CDN integration. I really tried to make it as simple as possible; however, you can also enable Far Future expiration. CDN module automatically does all the aforementioned maximally-exploiting CDN tricks. Far Future expiration obviously involves the unique file URL aspect and the CDN module shifts with several methods for generating unique file identifiers. You can even add your own.
The Drupal CDN module is great, I hope and believe, for simple use cases or for many use cases, even for more advanced ones that also does domain charting, but there are also times when you should not use it. That’s when every millisecond matters. It’s really designed for ease of use and frontend performance, i.e., the serving from a CDN in the most easy manner possible. It’s not designed to have the absolute lowest overhead possible. In that case, you should just write a hook implementation of “hook_file_url_alter” which I added to Drupal 7 and that will allow you to easily do what you need to do without the overhead of an entire Drupal module. It’s also feasible or reasonable to use your own code when you have a very complex CDN mapping, but even the CDN module has support for that in the sense that it has a callback in which you can implement custom logic to determine which CDN should be used for which file.
So now I’ve explained the different ways you can use the CDN, how we can maximally exploit it, how we can use it with Drupal, but what you really want to do is actually prove that the CDN integration that you’ve done is actually having a positive performance impact on your website. That’s what these last two slides will be about. Ideally you are already doing continuous integration for your website or your application which is essentially making sure that no new bugs enter the website and that everything continues to work correctly while you should also, in theory or in the best case possible, have a continuous performance monitoring that makes sure that whenever you add new features or make improvements and so on, you’re not actually making or adding performance regressions. You want to make sure that your website stays as fast or gets faster, or at least if it got slower, that you actually know it got slower.
So there are two ways to do that: synthetic user monitoring which is a test script essentially, but it only works in a few browsers and it’s a very controlled environment in terms of networking and OS and so on, so it’s not really realistic. It doesn’t give a realistic picture of what your end users are seeing or experiencing, but it is really great as a reference point for tracking the performance of your sites as it evolves. Exactly because it is in a controlled environment, the only variable is really the changing code, so for internal use, this is excellent, but for making sure that your website as it is experienced by your actual visitors and you want to improve that, then what you need is real use monitoring. This is actually measuring your actual visitors, hence in all browsers, hence in real world environments and does give very realistic results. It actually shows you how fast your website is for real users. This also gives you the ability to see in which specific location or if browser site performance is good or bad so that you have more useful information on reproducing the problem and improving it for those specific cases.
So if we go look at synthetic user monitoring, I’m going to show you how you can do synthetic user monitoring for free as well as real user monitoring for free. So the first step here is configure a dev or staging app or web server. Maybe you want to do this on production traffic; that’s also possible, but I’m assuming that you want to do this internally so that you can track the performance internally before it is pushed live. What you need to do is ensure that your pages’ resources are served from the CDN domain, then you can perform a test with the free and open source WebPageTest.org with a node, i.e., a desk server, a browser that is far away from your origin app server. Why exactly one that is far away? Well, we want to make sure that the CDN is having a positive performance impact and if it’s far away from your app server, then by definition the latency is higher, and in theory, if the CDN is working well, it should have a lower latency. Hence, this should show a significant difference with the case where you’re not using a CDN.
So point three is using it with a CDN and in point four, what we’re going to do is use WebPagTest’s scripting engine to point to your origin server instead of a CDN, so the CDN domain name will make it point to your origin server by remapping it to a different fixed IP address, run the test again and then all you have to do is compare the results. Of course this is a single measurement; you can repeat it, of course, but you have to make sure that what you’re looking at is actually representative of the real world, so I recommend to repeat it a few times to at least make sure that the largest variance is mitigated.
Finally, real user monitoring, what you need to do then ensure your production site’s pages’ resources are served from the CDN domain. Why production? Well, exactly because otherwise you’re not really testing with production with real users, unless of course you have some kind of mechanism where you’re serving the newest version of your website to a subset of your users, then you can use that instead, but in either of those cases, what you need to do is make sure that you’re using a CDN only for a subset of your users - for example, 50% of the users that you’re testing this against because we want to compare non-CDN traffic versus CDN traffic. What you need to do then is install some kind of real user monitoring performance measurement tool - New Relic RUM and Torbit Insight have both got free packages, so you can try either of those – and then again compare the results.
So that’s all I wanted to share with you. I hope it was clear. If there’re any questions, then I’m sure I’ll hear about it later. I’ll now pass it on to Alex who will talk about Drupal plus Acquia.
Alex Jarvis: Hi everyone. Thanks, Wim. My name’s Alex Jarvis. I’m a senior technical account manager at Acquia and it’s been my good fortune over the last two years to help some of Acquia’s biggest customers adopt Akamai and integrate Akamai tools to improve their site. I’m going to go over fairly quickly in high level the steps you’ll want to take on a Drupal site to take advantage of the Akamai network.
So the first rule of a CDN and an Akamai integration is really that every site is unique. I mean as Wim was pointing out, there’re a lot of factors and you need to customize the experience to be what your site needs. Really, I’m going to go over some best practices, but work with Akamai or your CDN provider to really understand the feature sets that they’re offering and to figure out how those work for your site because that’s going to be the most important thing. That said, for Akamai in particular, there are certainly some best practices and some initial steps that you need to take to really be able to leverage Akamai integration. At the most basic level, those changes that you need to make are in settings.php, so Drupal, as I hope you know, already has some CDN reverse proxy flags in settings.php and you’ll want to uncomment those or they’re commented by default. In particular, you’ll want to change the “reverse_proxy”, “reverse_proxy_header” and “omit_vary_cookie” config lines and uncomment those. What that’s going to do is tell Drupal that it is now behind a reverse proxy in the CDN and so that it will track its traffic differently particularly like the reverse proxy header tells it to look at a different HTTP header for the real IP address of the users that are accessing it.
Akamai for example sends requests as a HTTP true client IP header and that’s the user. The reason you want to do that is, for example, in Drupal 7, there’s an automatic throttling in place for things like failed login attempts. If you’re behind the CDN, all of your users’ IPs that are coming from the same edge servers will look the same, so if you have a university and a city that are all hitting the same edge networks and one person in the city makes a failure to log in, suddenly the users in the university nearby are getting blocked even though they have a different IP because it all looks the same. So that’s the note there. Similarly for Akamai and I believe others as well, the CDNs do not handle the very headers that Drupal sends out by default. Those are seen as cache busters and they will break cacheability of your site, so saying omit the vary cookie header will solve that because the vary cookie header encoding will not be included on request.
Finally, another issue that comes up fairly frequently for basic configuration is the base URL for your content where Drupal will try to use the URL that the request is coming into it for all of the static assets. So this is your aggregated CSS and JS and often your embedded content like images as well and it will preserve that URL, but when you’re behind Akamai, when it makes requests back to origin, it makes it against an Akamai name, so the Akamai name may be “origin-akamaidomain.com” and you don’t want users making requests past Akamai. You want them to be going through Akamai for everything and to ensure that that happens, you need to check what the request that’s coming into you is – so that’s the line about forwarded hosts there – and grab the name of the server that is being requested and if that matches that origin domain, that means this request is coming to your origin from Akamai and you want to then rewrite your base URL to be the public Akamai domain so that all those static assets are served from that domain and are coming to Akamai and not to your origin directly.
Alright, so we got a quick question here. I think we’ll keep that for a later time. So two other best practices for Drupal and Akamai is Akamai provides a staging network that they set up and by default they will point that at your production site along with their production network and when you’re doing testing, you’re testing against production. A much more helpful configuration is to have Akamai point their staging network at your staging tier so that you can test things, change configurations and work with Akamai to make changes without impacting production at all or having any risk of impacting our users. The second piece of that is that you should really have separate domains for the public, so a CDN-facing domain which is where users are coming in and an edit domain which is for your administrative users where people log in that are really creating the content and administering the site and this is a best practice with the CDN in general in my experience because you don’t want your administrators coming to the same site that the public is using for a variety of reasons. Chief among those are the CDN is specifically designed to cache as much as it can. I mean if you’re optimizing for the site, you want it to be as performing as it can be and give the best experience to the users, and that’s not really the same objective that you have for your administrators who, by definition, need to see an un-cached version of the page or have access to really see the latest and greatest that’s happening in the site before anyone else. For that reason, they shouldn’t be going through the front door, as it were, because the objectives of those two experiences are quite different. In Akamai especially, you will have unpredictable experiences where administrators will not see the content that they expect or they’ll be seeing some cache things, or even worse, Akamai or whatever network you’re using can cache administrative content on the public-facing site. So the user comes to a page that the administrator was just on and sees your admin bar or other undesirable content.
Similarly, by splitting out an edit domain from the rest of the site, you can take advantage of a whole bunch of additional security measures to make sure that the only thing talking to you from the public domain is your CDN environment and the only thing talking to your edit domain are your privileged users. I’ll talk about that a little bit more in a minute. Finally, as I already mentioned, you really want to make sure that you’re able to maximize cacheability and that you don’t need to put in any special rules or needlessly complicate your caching strategy by trying to use the same domain for both purposes. As I mentioned a moment ago, since you’re securing access, the way that I typically recommend that we approach that is if at all possible, if your company is running an internal DNS server, to have the edit domains entirely on your internal DNS so that no one outside of your network has access to it and the few people that hopefully do need access are able to set up their own host file entries for accessing that domain. Similarly, you can implement an IP whitelist that says “I don’t allow any access to my edit domain except from this IP range inside my network, my administrator’s home IP address” – hopefully they’re using a VPN and that may not even be necessary – “and my CDN or Akamai edge servers.”
So those are the best practices and there’s a whole lot more that you can do as I talked here briefly about some of the other things that you can take advantage of once you have those split domains. One that I really enjoy and adds a lot of security to the site is a blank user roles table. So Drupal by default stores all of the user’s role mapping. So user ID 19 has these roles in the users role table and if you create a copy of that and if you’re using MySQL, you use the BLACKHOLE storage engine where it’s basically writing everything to blackhole and nothing is actually written there. With a separate edit domain, you can add a database prefix to your user’s roles table to be “blank_” and then it will use this newly BLACKHOLE’d table when it does its lookups. As a result, everyone on the site will have no roles assigned, so even someone who’s an administrator on the site, if they accidentally logged in on the public-facing domain, will not have any roles assigned to them and there is no risk that they can access administrative content and have that cached for the public. It’s fairly easy to set up and it gives you a really nice benefit and then you only have to grant access to content that you have for authenticated users. That becomes the only thing where if maybe you have some content like comments that are only available to authenticated users, they’ll still be authenticated, but they will not have any other privilege roles assigned to them.
Similarly, you can mask that you’re even a Drupal site or that your site allows login or has any administrative functions. By modifying your .htaccess file for the site, you can check again incoming domain requests and if it matches the public-facing site, you can 403 or even 404 Not Found administrative paths, the user path, the directories that are in the site by default that are not needed typically for serving site content, and if you really want to mask potentially that it’s Drupal, for example, you can disallow all access to node and only allow your aliases to be accessed, so this gives you a lot of control over customizing the user experiencing and securing the site in a way that isn’t generally convenient otherwise.
So some other things, again, this is fairly high level and it’s site-specific, but some other cool options that you can do with something like Akamai in front of your site or a similarly good CDN is you can work with them and set up exclusion paths for certain things where AJAX callback paths may not be safe to cache in your CDN, so things to look out for are things like a login, lockout paths, AJAX callback, obviously any content creation that any of those paths need to be excluded. Another issue that comes up frequently is if you’re really caching content on your site for a long time and since you have a long TTL – time to live - on the site because maybe certain pieces of content don’t change frequently that Drupal’s aggregated CSS and JS files by default go away after a period of time and that can be problematic if the CDN maybe has the content of the page in place, but doesn’t have all the static assets in cache anymore and then comes back to your origin and requests these aggregated files that haven’t existed for a week or however long. There’re a couple of options for that. In Drupal 7, there’s a new variable, “drupal_stale_file_threshold” that you can increase to be as long as your maximum TTL time on your CDN and it will ensure that those files are preserved for at least that length of time. On Drupal 6, that variable’s not yet present; however, there is the advanced aggregation module called “advagg” available that can similarly be set to preserve these files for an extended period.
Some other things that often come into play are issues around cookie domains making sure that sessions are handled as you’d expect. You may need to change the cookie domains in your settings and if you’re leveraging Akamai in particular, Akamai has a lot of other really nice options for using their services including net storage where you can switch Akamai from being a pull-based system where it’s coming to origin for its request to you have a pushed-based system – and Wim was giving the benefits of those earlier – where you can push the assets that you want to be served on the edge.
Alex Jarvis: Yes?
Moderator: I’m sorry to interrupt, but I just wanted to say thank you to anyone that has to sign off now. If anyone wants to stay on and finish the presentation and ask questions, we’ll continue.
Alex Jarvis: I’m sorry; we’ve run a little bit over. I’m almost done here, so I’ll wrap it up really quickly. So NetStorage has nice advantages for going from a pull to a push-based system. Similarly for securing the site, Akamai SiteShield allows you to restrict access back to your origin webs and if you really hone the advantages of CDNs in Akamai, you can do some crazy things like allow users to sign in, get details about what role they have on a site and then send them over to your Akamai domain and give them customized content as an anonymous user, but that’s still customized to what rules they have. It’s cool things like that that can be done that Akamai and Acquia have done together that we’d be happy to help you with, and Acquia’s cloud is set up to do this. Aside from the initial configuration that I already mentioned, we work out of the box with Akamai on these things and the sky is the limit of some of the ways that you can optimize and enhance the experience on your website.
So that’s all I had. Thank you so much and I believe I may give it back to Kieran here and we’ll have some Q&A.
Kieran Lal: Great. Thanks, Alex. That was really great content from everybody. One of the things we seem to have found through the presentation was that people were either asking questions directly of the speakers or maybe there’s something going on with WebEx. As the organizers, we only saw some of the questions, so I’d like to ask the panelists, Chris and Wim and Alex, if you’ve seen any additional questions that have shown up in your windows, if you could assign them to me, then we’re going to go ahead and we’re going to take a couple of questions that were directed at Wim earlier and Wim had had a chance to read them, but they were around talking a little bit more about targeting assets that are pushed to a CDN versus one that are not. So Wim, do you want to take that away?
Wim Leers: Sure. So essentially the question was “Can you talk a bit more about targeting what assets are pushed to CDN versus which ones are not?” Looking at the exact timestamp, I think this question relates to the Drupal CDN module. In general, you can implement any kind of logic you want to determine which files are served by the CDN or not. In the case of a push CDN, you control which files are pushed. In the case of an origin-pull CDN, you control the URL, and if you control the URL, then you determine when it’s served from the CDN or not because for example, if your CDN would be mysite.CDN.com, if you use that domain name, then the origin-pull CDN will come back to your origin and get the file so it’s served from the CDN. If you decide to serve it from mysite.com, then it won’t be. So that’s the general answer, and in specific cases, a Drupal CDN module, what you do there is essentially say, “Hey, these specific file types should be served from the CDN.” So you list the domain name then you list file extensions – for example, CSS, JS, JPEG, PNG, GIF, whatnot. Those files are then searched from the CDN, so it’s based on file extension in the case of the CDN module because that’s the easiest and most understandable and most performance solution there.
The other question was “Can dynamic responses static assets be targeted for CDN?” The answer is yes, but it really depends on what you understand on dynamic responses. So essentially anything is possible, but for example, in the case of – I don’t know – a thumbnail that shows the latest promoted product and that changes every day, that is something that can definitely be a use case. The tricky thing is always to ensure that it’s not cached for too long or to make sure that it’s cleared when it needs to be cleared. In the case of Akamai for example, they support explicit purging, so there you can execute or perform a purge command on that specific file so that then the CDN will come and get the new version. That’s one way of doing it, but the most HTTP standard-like way is by using the proper cache control. Essentially what you do there is say this file can be cached for X hours or X days or X seconds and if your CDN listens or takes that specific method into account, then you can rely on it, but you should make sure that your CDN actually listens to those correctly because that’s not the case for every single CDN, or maybe they ignore any caching duration that is below, for example, five minutes or below one minute because otherwise the caching doesn’t make sense from a CDN point of view.
So the answer depends there, but if you’re talking about dynamic responses in the sense of page caching, serving the entire website from the CDN, then I think that maybe Alex or Chris from Akamai can answer the question better because I have little experience in that area.
Alex Jarvis: Sure, I’ll pick up on that. I’m going to jump back real quickly to the first question about pushing content to the CDN and just say from an Akamai perspective, for example, if you’re using NetStorage, that your Akamai configuration will control what kind of assets you want to do and what the lookup order is. You should look for these sorts of assets in that storage first and then you will explicitly control yourself which files you push up to that environment. Speaking more to the second question about dynamic content, in most cases, as Wim indicated, you have some options there being creative with your cache control, but if it’s something where you really need something to be completely dynamic and it cannot be cached every single time, then what you’re usually looking at doing is some form of ESI where you’re going to expose that dynamic piece in some individually-chunkable way that can be requested by ESI so that when the page renders, three-fourths of the page comes from the cache hopefully or as much of it as possible comes from the cache and the dynamic pieces making a callback to some service or some portion of the site that can deliver that dynamic content very efficiently and very quickly because you’re bypassing the CDN for that. In my experience, it’s not particularly feasible to have the CDN directly cache or directly hold dynamic content because the CDN is going back to the origin for anything that it doesn’t have on itself. It’s not processing things. In order to be fast, it needs to be able to just serve something itself and not be concerned about computing it, so if you need to do that, you really need to be looking at how to provide a way for origin to provide that quickly.
Kieran Lal: Okay, great. Thanks, Alex. We’ve got one more question and let me just pull it up here so that I can read it. So it was from Kai and he said “We’ve seen examples with which media files are hosted independently from the site itself. When or why would this be used instead of a traditional CDN?”
Alex Jarvis: I guess I’m not quite following the question. It sounds to me like that basically is a CDN or at least that approach to charting the assets so that they can be requested separately from the site content. I’d be curious if perhaps I’m misunderstanding the question slightly.
Kieran Lal: Kai, feel free to jump in and clarify your question, but I guess my thought would be when do I put my videos on YouTube and have them embedded and delivered as part of my site from YouTube? Then when do I host them, but have the videos delivered directly from the CDN or something along those lines?
Alex Jarvis: So in a case like that, I mean basically anything that you can offload from your origin is a win. The advantage to running your own CDN versus something like YouTube is that you have a lot more control over where that content is coming from and how it’s being handled. If you’re relying on a third party, then you’re also relying on your content delivery system, so you don’t know how YouTube distributes access to their videos. I mean in this case, in YouTube specifically, we’re talking about Google and we can assume probably that it’s going to be pretty good, but you don’t know precisely what’s happening, whereas with the CDN, you, by definition, have a lot more control even if it’s an abstract flavor. I mean you don’t know where every single edge server in Akamai is for instance, but you do know that you can look at the statistics, you can get the information about what’s being served, you can get a sense of where in the world those things are coming from and how they’re being used, so it really comes down to having a better sense of where your content is and how it’s being accessed by whom and how it’s being served. That’s where having a CDN that you’ve set up and you understand empowers you in ways that using a third party service that may not provide or that won’t provide the same level of information that’s helpful.
Kieran Lal: Great. Thanks, Alex. Kai, if you needed to – okay, so one of the things he was saying, the example he was thinking about was, say, an audio hosted independently, but I think he understands based on your response, so great.
Okay, we’re a good chunk over, but as promised, we had some really outstanding content and some real depth and expertise from all of our speakers, from Chris and from Wim and from Alex, so thanks everybody for attending. Hannah, was there anything else you wanted to say to wrap up?
Moderator: No. Thanks everyone for attending. We’ll send you slides and the recording within the next 48 hours.
Alex Jarvis: Alright. Thank you all very much and happy holidays to everyone.