AWS Load Testing: Tutorial & Best Practices

March 27, 2024

Load testing is a critical practice for evaluating application performance. From a product perspective, it raises confidence in application performance under unforeseen conditions. From an engineering perspective, it helps steer the direction of technical planning and ensures that performance remains an important consideration throughout the development process.

As applications have migrated to the cloud, developers have had to adjust load-testing strategies accordingly. Hosting applications on cloud resources means sacrificing some control over infrastructure performance and latency in favor of scalability and the many other benefits of cloud computing. In addition, the sheer variety of available computing resources and hosting infrastructures has grown considerably. Regular load testing is needed to guarantee consistent and optimal application performance and evaluate the performance implications of different infrastructure options.

With AWS as the leading cloud service provider in the industry, AWS load testing has become an essential practice for many teams. This article will explore AWS load testing in detail by presenting nine best practices. Through the lens of those best practices, it will also explore test planning, methodologies, and solutions to common issues with practical examples.

Summary of key AWS load testing concepts

The table below summarizes the nine essential AWS load testing best practices covered in this article.

Best practice Description
Define a testing methodology Adapt your load-testing strategy to the application under test.
Consider costs carefully Plan carefully and configure AWS resources accordingly to prevent unexpected expenses. Set AWS Service Alerts to receive notifications as costs or resource consumption increase.
Disable external requests in test environments Use network isolation and network stubbing to focus on internal performance metrics.
Use small-scale test environments Consider the tradeoffs of testing using small-scale replicas of the production environment.
Test for back pressure Include tests that maintain a steady request rate to determine system performance under constant load.
Ensure the scalability of load test server configuration Consider the throughput and bandwidth limits of load test servers when planning a testing strategy, or use a hosted load testing tool.
Monitor test results Use tools such as AWS CloudWatch to monitor ongoing system performance more effectively. Compare AWS results with results in your testing tool to identify discrepancies.
Plan for storage requirements Have a plan for how to store the results of test suite runs and compare historical test run results.

In the following sections, we will elaborate on each best practice in the summary table above and provide actionable recommendations for how to integrate AWS load testing into software projects.

Define a testing methodology

Engineering teams need a well-defined strategy for non-functional testing. Upfront planning is essential for effective reporting and cost prevention.

Establish requirements

The first step of defining a robust testing methodology is deciding what should be tested and when. It is crucial to comprehensively identify and document the system’s functional and non-functional requirements. This involves a detailed analysis of the software specifications, user expectations, and system constraints.

By clearly outlining these requirements, the testing team gains a roadmap for developing test cases that cover the full spectrum of functionalities, ensuring thorough validation and verification of the system. In addition, a well-defined set of requirements serves as a benchmark for evaluating the effectiveness of the testing process. It helps align testing activities with the overall project objectives.

Measure baseline metrics

Once the testing requirements are established, measure baseline metrics in key areas of your system. Standard collected metrics include throughput, response times, and error rates. Teams can collect these metrics from simulations of both typical and peak conditions.

Baseline metrics serve as a foundation for assessing performance and reliability. They can and should be a reference point for insights and comparisons to subsequent tests.

Identify testable requests

The next step is to identify testable requests. In this section, we will focus on two types of commonly tested requests: transactional requests and page or endpoint requests.

Transactional requests

Transactional requests involve multiple steps and usually represent critical business processes. Examples include user registrations and financial transactions.

Once transactional requests are identified, they can be tested in isolation and often serve as focal points for catching potential bottlenecks. When testing transactional requests, testing efforts should focus on data integrity, throughput limits, and transitions from one transaction state to another.

Page or endpoint requests

Page or endpoint requests involve individual interactions with the application, such as loading a webpage, submitting a form, or sending data in a POST request. These requests usually represent individual user actions.

Tests for page and endpoint requests should focus on user experience, input validation, proper data ingestion (if applicable), and the performance of system components in isolation.

Acceptable error rates and thresholds

A 100% error-free application is not realistic for any company. Business requirements determine what error rates are acceptable. Taking the following steps will help ensure that your application’s error rates do not negatively impact business goals:

  • Evaluate how the system handles errors and disruptions and ensure graceful degradation in application code.
  • Align error rate expectations with impact on business operations and define thresholds based on user expectations, non-functional requirements, and SLAs.
  • Set alerts when tests hit critical thresholds to ensure timely intervention.

It is important to note that error rates will change over time, so expect to adjust expectations and intervention strategies as the application grows and business needs evolve.

Consider costs carefully

One of the major benefits of using a cloud provider is elasticity - the ability to scale quickly and dynamically in response to real-time events. Instead of serving end users from a limited set of machines, cloud computing enables dynamic scaling based on user traffic.

However, engineering teams are responsible for managing how many resources an application consumes. Diligence is required to prevent unforeseen spending on cloud resources.

Forecasting costs

Conduct thorough cost forecasting to estimate resource usage. This has two purposes: preventing unforeseen expenses and identifying services that need quota increases. Estimations can be based on testing scenarios and expected application usage.

Set AWS Service Alerts to receive notifications as costs increase. Administrators can also set alerts on AWS resource consumption. These practices have two key uses. First, they prevent services from encountering limits that may result in unforeseen costs. Second, when the time comes to scale the system, they provide a record of how many system resources are already being consumed. This provides insights into how much the application can be scaled up without adversely impacting performance or uptime.

Load testing is a common way to identify services at risk of experiencing overload or significant performance degradation in response to high-traffic periods, such as upcoming releases or events. For example, an e-commerce platform may use load testing to simulate an expected holiday rush. Testing might assume a 40% increase in user traffic, and through load testing, the company can determine which services could exceed resource limits.

Before launching new services or features, commit to non-functional testing not only to forecast unforeseen expenses but also to be proactive in raising limits on system resources as needed.

Disable external requests in test environments

It is important to be proactive in preventing load testing from unintentionally affecting other components outside of the staging environment under test. Interactions with external services can introduce side effects, usage costs, and strains on those services.

For example, consider the side effects of a system making many concurrent requests to a payment gateway during test runs. These services often charge based on the volume of incoming requests, so service fees can become exorbitant. Many invalid requests can also lead to rate limits or service terminations. To prevent these issues, engineers typically want to ensure that calls to inter-connected services and external networks are disabled during testing.

However, to thoroughly evaluate application performance, it may be necessary to simulate interactions with external services. In non-functional testing, a common approach is to intercept external network requests. Depending on technical requirements, teams can adopt one of many open-source “interceptor” libraries, such as Nock. Using an interceptor, administrators can define rules globally for how to handle incoming requests.

Use small-scale test environments

At a large scale, a common problem is how to represent production environments for accurate testing. Imagine an e-commerce system that serves millions of users and has hundreds of integrations. At this scale, it is unrealistic to maintain an equivalent volume of data in staging and test environments.

A typical solution is to utilize a staging environment that mimics the production environment but at a smaller scale. While testing such an environment does not provide the same data as testing the production system directly, it does provide information that allows teams to make inferences about the production system. In this approach, it is common for host servers to mock external requests (as described in the previous section) for isolation and to prevent unwanted side effects during testing.

However, it is worth noting that utilizing smaller-scale test environments comes with tradeoffs and may not fit every use case. For example, companies should weigh the benefits of gaining more realistic results from running full-scale load tests against the cost savings of running tests with fewer virtual users or on a scaled-down version of the test target. Many companies find value in running small-scale tests frequently while reserving full-scale load tests for special events, such as major releases, or in anticipation of high-traffic periods.

Test for back pressure

Many technical definitions of backpressure give only part of the picture. At first glance, the concept of backpressure seems relatively straightforward: a constant rate of requests can bring about a different system response than a steadily rising flare-up in traffic, and the effects can propagate downstream.

Or take AWS’s technical overview:

“Some effects can be observed only when load is generated over a prolonged period of time. One of the most important effects is back pressure. This means that when a system is too slow to process the number of requests at the speed that they are coming, the performance of its client systems will degrade.”
- AWS Prescriptive Guidance

From these descriptions, it seems reasonable to assume that backpressure is simply a bottleneck that happens under peak load conditions.

However, the heart of the problem is best described by analogy. Imagine a single pipe with water flowing through it. All else equal, water will flow from the pipe steadily as it is poured in. This changes when the pipe is full. At this point, water may begin to flow out from the pipe at a slower rate due to the pressure and constraint placed by the current volume of water in the pipe. Some water may be delayed entering the pipe, and some may even be lost. This concept is the origin of the term “backpressure”.

Now imagine water flowing through a complex network of pipes and valves. There could be many points at which backpressure could occur. It would be essential to identify where these points are and test to see how they behave under realistic amounts of load. In general, it can be assumed that a system will behave differently under a sustained amount of load than under normal conditions.

Similarly, in software, a maturing cloud system will likely evolve to have many services, external providers, and client endpoints that interact in complex ways. This introduces many potential performance bottlenecks or points of failure.

Load testing can give insight into this complex system behavior at max capacity or under overload, which provides engineering teams with insights into which techniques they can use to improve the system's ability to handle incoming requests efficiently and without bottlenecks. The table below details four standard techniques for improving system performance:

Technique Description
Rate limiting Sets pre-defined limits on how many incoming requests will be considered for processing within a given timeframe. Further requests will be disallowed to ensure they do not overwhelm system resources.
Queues Queue systems, such as AWS SQS, offload critical requests from the application to a separate runtime where they can be run in parallel, negating any potential performance hits under load.
Load shedding The practice of selectively dropping or delaying lower-priority requests while a system experiences peak load conditions. This helps maintain essential system functionality and prevent system crashes.
Auto-scaling Dynamically adjusting system resources or configuration based on load conditions to accommodate higher or lower levels of traffic.

How to test for back pressure

Conducting a load test with a ramp-up period, steadily raising the test load until it matches target peak conditions, and then holding that load steady for a specific duration is a realistic approach to assessing the effects of back pressure on a system.

A typical enterprise practice is to hold the load steady for at least 15 to 30 minutes. This duration allows for monitoring system behavior over an extended period. It provides a reasonable indication of whether back pressure starts to affect the system's performance, resource utilization, or response times during sustained peak loads.

However, the exact duration may need to be adjusted based on the nature of the application, the expected usage patterns, and the specific goals of load testing. It is essential to strike a balance between a realistic simulation of peak conditions and the practical constraints of the testing environment.

Multiple enables this testing style by allowing testers to easily configure test parameters according to testing goals. Realistic simulations of sustained peak conditions can be run through a clean interface, and team members can share results.


Ensure the scalability of load test server configuration

A common pain point in load testing is reliably generating the desired level of load, which typically involves heavy bandwidth consumption. If companies utilize a single dedicated server as a load generator, it is common for the load generator itself to become the bottleneck in the load test suite, which can significantly skew test results. Testers must therefore consider the throughput and bandwidth limits of the load test server when planning a testing strategy.

For organizations needing to generate hundreds of requests consistently across many test runs, more than one dedicated load generator is likely necessary. In addition, to scale this type of solution to handle more and more requests, techniques like parallelization, orchestration, or containerization would be needed to ensure that testing servers can handle the load they need to generate over time.

Multiple solves this issue by generating containerized virtual users (VUs) and providing the cloud infrastructure to run large load tests. These VUs run in parallel, executing requests against the system under test. Using this approach, users can generate thousands of concurrent requests reliably.

Monitor test results

Telemetry is data used for monitoring and analysis. Two classic, ubiquitous types of telemetry in software are logs and metrics, both of which become incredibly complex and challenging to decipher in the context of a distributed system. An emerging solution to this issue is distributed tracing, which traces the propagation of requests through services and measures the response times between services.

In distributed tracing, a trace is a view of a request as it moves through the system. It consists of several spans or events related to a request. For instance, a span may represent a database query that ran in 200ms. A distributed trace holds a list of these spans across various services, giving a complete picture of the effects of a system request from end to end.

An example trace display (Source)

There are several emerging providers for distributed tracing, including the open-source tool Jaegar and AWS’ own solution, X-Ray.

Plan for storage requirements

Large load tests tend to cover a large number of data points, and this number can grow exponentially in distributed cloud systems. This is not a significant issue for a few test runs, but storage is certain to become an issue over time.

It is crucial to handle this problem before it starts. During the planning phase, establish what server size, partitioning strategy, and data expiry approach are required.

In a traditional approach, test run results would be exported either to disk space or to a storage service like Amazon S3 or Dropbox. This fails to scale as systems grow since management and offloading many test data volumes becomes increasingly difficult. For smaller systems or monolithic applications, regularly archiving data can be a sufficient solution to this issue. Amazon S3 is one of the leading solutions for this use case. It has lifecycle management options through which data can be expired and flushed to a low-cost storage option such as Amazon Glacier.

However, this approach eventually fails to scale in larger distributed systems and microservice architectures because many services will be racing to fill the same S3 buckets with data. For this use case, time-series data solutions are seeing increasing adoption.

Weather and stock market data are examples of time-series data. Data is continuous and always online being ingested. The most popular time-series data providers include Prometheus and InfluxDB. InfluxDB is general purpose and used in many disciplines, while Prometheus (and its sister tool Grafana) is most commonly used as a dedicated engine for real-time metrics ingestion.

AWS has offerings for each: InfluxDB, Prometheus, and Grafana. Teams interested in using time-series data should compare options and weigh how either can tie into their existing ecosystem.

As an alternative, load testing tools like Multiple store test results centrally with role-based access control. This effectively alleviates test result storage as a concern for developers maintaining test infrastructure.



The complexity of modern systems hosted on AWS presents challenges to developers when conducting load testing. Teams can overcome these challenges with the right AWS load testing best practices, such as careful planning of test requirements and methodologies, monitoring AWS resource usage and associated costs, choosing an appropriate testing environment, and mocking external dependencies.

In addition, utilizing a tool like Multiple for load testing allows developers to write test scripts easily, configure test parameters, and share results. We hope the information in this article will help your team leverage the many benefits of load testing in current and future software projects.