IT Systems Testing at Scale
Testing your IT systems is valuable to the business, but there's a lot of confusion about why, where, what, how, and when to do it. Testing is a marathon; focus on your goals, pace yourself, and be prepared for change.
Why spend the time to test? There are many types of tests, and many things to test, but the final intent should be very clear from the beginning. Do you want to:
- Ensure code quality?
- Verify functionality?
- Discover performance bottlenecks?
- Look for overcapacity spending?
- Explore alternative workflows?
- Ensure system failures are alerted on?
- Check for security flaws?
If you answered “yes” to any of these, then testing is essential.
Where to Test
Getting started with testing, and even maintaining existing tests, requires a focus on which components of the system offer the most business value.This includes features that are used most often by all users, features used often that provide business insight for customers, features that help improve new customer's experiences, and/or features that help keep existing users. Think about what are the top 10 most common actions users do every day? Ensuring that the homepage, user login, user signup, search, and shopping cart of a website are working is much more important than supporting a rarely used feature like updating a user's birthday. When in doubt, you can usually learn this from analytics tracking system such as Google Analytics.
You can test any part of an IT system. Here's some examples of where to start:
- Web interface = JS testing, Mocha and Jasmine
- Browser support = Selenium
- Website end-to-end response = Pingdom
- Load balancers = JMeter
- HTTP cache = ex: Varnishtest
- Web server response = JMeter, Gatling
- Application functionality = Unit Testing
- Database = JMeter, data testing
- Filesystem = Various (see list)
- Message queue = Various (see list)
- Monitoring = ELK, SignalFX, Nagios, Splunk, SumoLogic, Loggly, Datadog
- Alerting = ELK, SignalFX, Datadog, New Relic, Alert Logic
- Breach detection = ThreatStack, Snort, Nagios, OSSEC, TrendMicro, Tripwire
- Code security = Static Code Analysis
- Application security = Dynamic Code Analysis
- Input robustness = American Fuzzy Lop
How and When to Test
Decide what will be tested and what you want to achieve, that will inform how you will test. If you want to test individual components in isolation, you can write simple, small tests using a single tool. Testing the interaction between multiple services will involve multiple testing tools. Performing end-to-end testing will always be confusing, tedious, and long running, but will provide very high quality feedback on when and where something has failed. Testing a distributed system will require a distributed approach with multiple simultaneous tests, which adds another layer of complexity, but will also increase the thoroughness of those tests.
Testing should be done all day every day. It won't be done daily at first, but you should strive for it. More tests, done quickly, create a positive feedback loop for developers, project managers, sales, and marketing teams to understand the quality and state of their product. If a test takes more than a few minutes to run, and only tests a small part of a system, it's rarely valuable. Initially, all tests can be run within a few minutes. As the system grows, those testing times will increase. It's very common for large organizations to have robust testing suites that are only run once a night. With time, and resources, many long running tests can be reduced to run from hours to minutes by spending time to parallelize and distribute them.
What's the difference between testing 10 servers, versus 10,000? You can often initially use a single testing machine, running a single test at a time, with less than 100 servers. As you scale your infrastructure and support for more than 10K concurrent users you'll hit the "C10K problem." Initially you can overcome these issues by scaling vertically with a more powerful server, and tuning your servers to do better caching, indexing, and optimize the use of threads; but, eventually you'll need to scale your servers horizontally by using more total machines. Scaling horizontally adds the complexity of "how do you make sure all of these servers are sharing information correctly?" This is when concepts like load balancing, clustering, sharding, and shared state will be explored. These infrastructure architecture changes will require different testing tools, and different combinations of them.
What you do for testing will change dramatically as you scale your infrastructure out. I've seen many companies grow from 100 servers to more than 10,000 over as little as a year. It's a dramatic growth for the business, but will also bring a dramatic change in the technology, and the testing needed. Any change to a technology will bring a higher maintenance cost. The best approach to minimize this cost is to:
- Test only the most valuable aspects of your system
- Ensure tests are run quickly
- Store every aspect as code (database schema, testing data, infrastructure schema, service configs, etc.)
- Implement a canary / testing stage
- Make it easy to reset system data between test runs
- Follow common conventions, such as RFC standards
- Reduce the complexity of the system
- Reduce the total number of features
- Remove old, rarely used features
- Dedicate time and resources to testing
Testing is really hard. It only gets harder with scale. There's no one size fits all for testing, it is always unique to the project. However, these are common problems shared by many companies, and the IT industry has learned over time the common approaches needed to overcome them. There is no single testing system that will give you everything you need, it requires the combination of multiple tools and approaches. Be patient, focus on high value features and functions, and build out the thoroughness of your tests slowly over time. Share the testing results with others, particularly outside of the technical teams. Testing data can and should inform your business, and help you succeed.