Scalability Testing: Tutorial & Best Practices
Unit and integration testing, while important, are only a few aspects of robust system coverage. Engineering departments also use non-functional testing as a way to predict the behavior of their systems and applications under various foreseeable conditions.
In non-functional testing, the goal is not to analyze specific application behaviors but how the underlying system handles various conditions. Performance, load, and scalability tests are a few of the most commonly used forms of non-functional tests.
Scalability testing is a form of non-functional testing that observes how an application’s performance and stability are impacted as user traffic scales up or down. It is similar to load testing, which simulates variations in stress (in the form of user load) on a system.
There are many use cases for scalability testing, which can be used to do the following:
- Simulate sudden changes in user traffic, such as during peak hours or off-hours.
- Predict the user experience for various network conditions, regions, and devices.
- Monitor for potential issues and catch them before release.
- Release new features with confidence that end-users will not experience crashes or slowdowns.
This article looks at eight best practices for scalability testing. We also cover proper testing strategies for maintaining your systems.
Summary of key scalability testing concepts
The following methodologies come together to form the foundation of a robust scalability testing suite.
Scalability testing best practices in detail
The following sections elaborate on the best practices presented in the summary table above to help your development team leverage the benefits of scalability testing for your application.
Run scalability testing regularly
Regular runs of a scalability testing suite help catch new potential issues as they are introduced. If a test run yields unexpected errors, the development team can quickly identify root causes or roll back faulty code changes.
Examining 95th and 99th percentile (dubbed “p95” and “p99”) test results also provides substantial benefits. These are the slowest 5% and 1% of a service’s response times, respectively, and often represent the experience of power customers who bring (or acquire) large datasets.
Real-world scenarios for incorporating scalability testing into a team’s workflow might include the following:
- Daily or weekly manual runs, with a scheduled system recording process
- Integration into a CI/CD pipeline, per deployment or PR
- Testing before major feature releases and system updates
- In a microservices architecture, incorporating non-functional testing into a distributed system with tools like Jaeger, Kubernetes, Terraform, or others
When employing scalability test results to investigate a live issue, it is helpful to have a history of past test runs available. Integrating scalability testing into existing CI/CD pipelines is an ideal way to ensure that your team maintains a consistent history of test data, which allows you to observe trends and perform comparative analysis. If resource constraints do not allow for daily or weekly runs of scalability testing operations, it can still be of use for QA-testing features before upcoming feature releases.
Simulate real user behavior in your test scripts
An effective testing methodology accurately represents real user data and workflows. Two of the most common challenges in simulating real-world scenarios are generating large volumes of user traffic and mimicking the unpredictability of real users at scale. Utilizing virtual users (VUs) for testing can solve both problems.
A VU represents a user or agent who might access a system resource, such as a REST API. Various numbers of VUs can be generated to execute test scripts on the test target concurrently, which simulates the behavior of real users.
One issue that frequently arises in scalability testing (whether or not it is performed using VUs) is coordinated omission. This problem stems from the fact that many testing tools operate by sending one request, awaiting its completion, recording the execution time, and only then starting the loop for the next request. If an average request takes 10 ms to run, but the tool loops every 100 ms, it is easy to miss periods where one or several requests undergo delays in service. These delays could be due to rate limiting, network timeouts, or other issues that would go overlooked if test loops were too synchronized.
Tools like Multiple solve the coordinated omission problem by deploying VUs with no synchronization to one another. They are not launched at specified intervals, and all points during a test run are considered valid times to launch a new VU. That means that generated user loads will be more randomized and organic at scale, and you can better trust the results of test runs.
Choose the right tooling
There is no one-size-fits-all tool for testing. The ideal tools depend on the tech stack, system requirements, code dependencies, and testing needs. A combination of services often forms the basis of an effective non-functional testing approach.
In choosing a testing tool, the first step is to determine what system components to test. Storage services, hotspot queries, and computationally intensive endpoints are good starting points. Once the crucial components have been identified, choose a tool that easily facilitates testing those components. For example, if your application uses MongoDB, you may wish to connect Mongo directly to your testing tool and run queries on it. In this case, you would want to choose a tool that supports Mongo server connections without extensive setup time.
Use a hosted service
Before the advent of cloud technologies, it was more difficult to set up non-functional tests because servers and test reporting systems would need to be created from scratch. It was—and still is—common to utilize a third-party vendor for non-functional testing.
These vendors provide end-to-end testing services with detailed reports, but this approach comes with its own limitations. First, third-party services may not respond as quickly to changes to infrastructure or CI/CD pipelines, which makes them less compatible with agile methodologies and general business requirements. In addition, a team may require a different infrastructure when testing smaller features and services, and it can be difficult to coordinate third-party vendors’ test runs with these one-off configurations.
More recently, hosted testing tools have emerged to alleviate the difficulties of setting up a test environment or coordinating with an outside vendor. Tools like Multiple allow developers to load any number of VUs and run a large test suite with ease. This saves considerable developer time and also guarantees that test results are not affected by the idiosyncrasies of the testing tool itself.
Track relevant performance metrics
To begin understanding system performance, it is crucial to obtain baseline performance metrics to serve as a reference point as applications and systems grow. The process of obtaining these metrics depends on a system’s core infrastructure, so be sure to consult the documentation of your infrastructure provider before writing test cases.
An engineering department with defined service-level objectives (SLOs) already has insight into its customer base's performance expectations. These SLOs offer quantitative data on product and technical standards, serving as a reference for test scenarios. For instance, an SLO specifying 99.5% server uptime would prompt tests to ensure this uptime under heavy load.
Here are some useful system metrics to be aware of:
- Latency: The time it takes a host server to receive and acknowledge a request from a client server. In other words, the time it takes a packet to move from a source server to a target.
- Response time: Latency plus the total processing or execution time before a response is sent back by the receiving server. Response time provides a more holistic view of the time it takes to complete an entire operation, including all the various delays and processing steps involved.
- Average throughput: The rate at which a system processes requests; in practice, this usually refers to the number of HTTP requests a server can process per unit of time. For example, if we run a system test that generates a load of 1,000 requests per second (RPS) for a specified duration, the throughput would be 1,000 transactions per second (TPS) for that duration. This metric is important when determining horizontal scaling needs—at what point should the number of servers be increased because the current servers cannot take on more load?
- Average device utilization: How much time hardware devices connected to a system spend in use, on average. This metric is typically expressed as a percentage, where higher values reflect higher usage ratios. As a domain-specific statistic, the specifics of what to measure and how to track average device utilization will depend on the nature of your system. In the IoT space, for instance, developers may wish to track CPU rates on remote hub devices or embedded machines for performance monitoring and capacity planning. This informs teams on the efficiency and workload of their hardware resources, such as CPUs, memory, and storage devices. Average device utilization is particularly important in domains where hardware resources are critical, such as IoT, control systems, and biomedical spaces.
To establish baseline measures of various performance metrics, developers typically run a set of smoke tests on the system. Smoke tests perform relatively simple operations (such as calling a public API endpoint) to provide reasonable expectations for an application’s baseline performance and fluctuations.
Once the baseline metrics are established and recorded, one or more VUs can be utilized to execute test scripts to evaluate the efficiency of core application logic, establish expectations for how long VU loops should run, and note any fluctuations in response time.
Test both upward and downward scalability
Under real-world conditions, heavy system load does not appear suddenly all at once: User traffic grows and shrinks over time. It could take minutes or hours for an application to reach peak daily traffic and just as long for traffic to return to normal after the end of peak hours.
Scalability testing should mimic these traffic patterns as closely as possible, using different ramp-up times to simulate a wide range of potential scenarios and load distributions. For example, users may become highly active in the first minute following a particular event or in response to a promotional sale or shopping season. Adjusting the ramp-up period and load on scalability tests to match this scenario provides the most accurate insights into how the system will respond to such events.
Downward scalability matters as well. In particular, it is important to ensure that the system does not continue to consume too many resources for an extended period of time once user traffic has decreased. If your system employs autoscaling, ensure it is set up to quickly and efficiently tear down expensive and underutilized resources.
Use historical data to observe trends
Recording system responses to non-functional tests opens the door to ongoing reporting and analysis. This is valuable for daily maintenance, troubleshooting, and stakeholder discussions. From an engineering perspective, daily or weekly runs of a testing suite produce opportunities to identify trends over time. Those trends—and changes in them—provide valuable references in architecture planning, post-mortem assessments, and technical documentation.
Many teams configure a dashboard for monitoring critical business and performance metrics. These teams typically preserve “snapshots” of regular test runs. Snapshots most often take the form of regular exports of metrics collected during the testing period. It is common for various teams to manage their own testing workflows, which makes it important to select tools like Multiple that enable them to collaborate.
Monitor CPU and memory usage
Both of these metrics of key resource usage (RAM and processing power) can be used to observe how reliable and responsive an application is because low memory availability or high CPU usage can lead to slowdowns, crashes, and bugs.
How best to measure these tools depends on the underlying application and system infrastructure. For example, a GCP REST API would most likely combine with a tool like the Google Cloud Profiler to track these metrics with low overhead. As a more general solution, many teams use application performance monitoring (APM) tools.
Resolving CPU and memory issues often involves following best coding practices and optimizing key operations. SQL queries, RPC calls, external device interactions, and even filesystem operations are targets for performance optimization, and teams should monitor such operations to ensure that problems are handled quickly.
A scalability testing suite
In the most common form of scalability testing, load parameters are raised or lowered over time to observe how performance changes. Other initial configurations are left unchanged while the test runs. This provides insights into how a system autoscales, at what point to expect system failures, and more. Testing can be conducted to simulate the real conditions expected in production.
It may be tempting to try writing tests for every scenario imaginable, but too much coverage can make interpreting test results more difficult and lead to a costly and time-consuming testing suite. It is necessary to strike a balance between covering different test scenarios and avoiding test bloat.
In general, it is a best practice to ensure that scalability tests cover an application’s most critical components, such as database reads, key API endpoints, and any other resource-intensive operations.
How to test
A robust process for scalability testing looks like the following:
- Run baseline system performance tests by writing test cases that represent one or more typical user journeys. Measure system latency, response times, and throughput.
- Start defining expectations for target performance in various user scenarios.
- Write tests aimed at capturing various user scenarios, simulating expected traffic and conditions.
- Compare expectations to results and make system changes as needed.
- Establish a process to continue testing related to planned releases, features, and upgrades.
Setting up a REST API server
To further illustrate how to run scalability tests, we will spin up a GCP Cloud Function that will serve a REST API. Similar logic applies to creating an API in AWS, Azure, or Digital Ocean.
Create a Cloud Function, using SCALABILITY_TEST as the function name.
This resource can be created in the UI using the following steps:
- Go to the Functions page, and click the Create Function action.
- Name the Function SCALABILITY_TEST.
- Select the Allow unauthenticated invocations option to allow public access to the API.
- Go to Source and replace the default server code with the following script:
Alternatively, use the GCloud CLI. Run the following in a Google Cloudshell:
The URL that will be called in the scalability test appears in the CLI output under the httpsTrigger field:
This will create our API with no authentication needed to access it. The echo command creates a simple index.js file that configures our API, and the gcloud functions deploy command will use that file as a basis to generate our Cloud Function.
Now that we have created our API, let’s test it. You can obtain the URL for your API from either the UI or your CLI output. Let’s try adding our API to a scalability test using Multiple.
Testing with Multiple
After logging in, click the New Test button. You should see the screen pictured below:
It’s time to make a simple scalability test using axios to call the REST API with a randomly generated query parameter. With Multiple, test code can be separated into distinct “phases” to streamline test setup, execution, and tear down. The function globalInit() is used for global setup (such as environment initialization) while the vuInit() function is used for per-user setup operations (such as user authentication). The core testing logic loop lives in the vuLoop() function. To test the GCP REST API, the vuLoop() function is all that’s needed. Update it with the following code:
In this example test, a fake company name is generated and sent to our REST API as a query parameter via Axios. Using the ctx.metric() function, we record the response time of our server when we get either a success or error response. The ctx.log() function prints the API’s output as well as our fake company name during debug runs.
Use the following NPM dependencies for the test:
Now, require them inside the test using the following require statements:
Finally, set the API URL (with endpoint) as an environment variable:
Check to ensure that the test works as expected with a debug run, which only runs a test loop once with one VU, logging debug output and system call responses (in this case, API responses). This will verify that the test code runs and is error-free.
When ready, run the test in full and generate a load of VUs. Using Multiple, it is easy to adjust the load parameters, ramp-up time, and behavior of the VU fleet.
This article has described how to build a robust scalability testing strategy, from gathering initial baseline performance metrics to monitoring the regular runs of a testing suite. Key takeaways include the following:
- Establishing a system’s baseline performance metrics provides engineering teams with a better picture of the system and its capabilities.
- When testing an application, it is important to simulate real user traffic as closely as possible using techniques like ramp-up testing and organic fake data.
- Monitoring systems with regular scalability tests helps uncover edge case issues and outlier slow operations.
Effective scalability testing depends on developers accurately interpreting and utilizing the data obtained from test runs. Engineering teams should work to regularly monitor expected performance under various conditions and adjust their infrastructure accordingly.
Employing the scalability testing best practices discussed in this article in conjunction with other forms of non-functional testing will provide a robust view of an application’s performance. This, in turn, will inform technical decision-making to guide application growth and deliver optimal user experiences.
Load Testing Tools
Learn how to choose the right load testing tool to assess application performance with nine must-have features for load testing tools.
Load Testing Best Practices
Learn ten load testing best practices to help maximize the benefits of testing system capacity and performance.
API Load Testing
Learn how to perform API load tests to increase confidence in system performance, detect and improve system performance under high loads, and identify performance thresholds.
Load Testing Services
Learn how SaaS load-testing services provide low overhead, infinite scalability, and role-based access control.
Load Testing vs Performance Testing
Learn how load testing and performance testing measure application performance under various workloads to identify and address weaknesses.
Postman Load Test
Learn how to use Postman to conduct load tests and explore its features, limitations, and alternatives.
Performance Testing Best Practices
Learn how to optimize performance testing with seven essential best practices, including setting up proper test environments and crafting detailed test plans.
Learn how to use scalability testing to simulate user traffic, monitor performance metrics, and ensure system stability.