Soak Test: Tutorial & Best Practices

February 5, 2024

min

Soak testing is a subset of performance testing that measures a system’s response to continuous and sustained load over an extended duration. Engineers conduct soak tests to evaluate system performance and stability under these conditions to ensure the system performs optimally and reliably.

Instead of brief bursts of traffic, soak tests challenge systems with extended periods of load, revealing potential issues that might only manifest over time. However, merely running a test for a long duration is not enough—engineers need a comprehensive approach to obtain meaningful results.

This guide delves into the core principles of effective soak testing, providing engineers with a practical roadmap to uncover and address system vulnerabilities.

Summary of key soak test concepts

In this article, we explore the following best practices for writing great soak tests and then take a look at an example test.

Best Practice	Description
Define clear objectives	Defining clear objectives helps clarify the soak test’s purpose. For example, the test may seek to identify memory leaks, ensure system stability, or measure system degradation.
Generate realistic workloads	Soak tests should accurately reflect traffic levels. The more realistic the scenario, the less likely it is for unexpected defects to occur in production.
Automate the tests	Soak tests are long-running. To prevent the need for engineers to be constantly present, it is crucial to automate test execution, monitoring, and reporting.
Monitor continuously	Use monitoring tools to check for changes in the system under test constantly. Measure facets like memory, CPU, database connections, and storage growth.
Set up notifications	During long-running soak tests, notifications help free engineers from monitoring while alerting them when there are failures. Notifications also inform engineers if the test needs to be ended prematurely due to a defect or test condition being met.
Perform post-test analysis	Review test results and analyze them iteratively. Doing so improves test quality over time and helps identify trends.

Soak test best practices in detail

Define clear objectives

Every test begins with a goal in mind, and soak tests are no exception. Defining clear objectives for soak tests provides several benefits. They support a shared understanding among team members of the goal of the test and what is being tested for. They empower all team members to contribute to and improve the tests, and they help with debugging tests by narrowing down the scope of each test. This makes it easier to identify a root cause because fewer areas in the system need to be evaluated.

For instance, if a team’s primary concern is the system’s robustness over extended periods and the team has observed that the application is crashing due to out-of-memory errors, they might want to identify potential causes such as memory leaks. If they receive feedback about system crashes after prolonged use, they can then ensure system stability by looking at application logs and addressing any defects.

As another example, consider a scenario in which an application’s performance drops steadily by 20% over 48 hours of continuous operation. If the engineering team does not have a clear objective to measure system degradation, this significant issue might remain unnoticed. In response, the team may want to measure the P99 performance metric (explained below) for application response time. Then, by representing the P99 value visually, the team can identify when the degradation started and correlate the degradation to application and infrastructure logs. In many cases, this approach would help identify the source of the issue. If not, more evaluation is required, or perhaps the metrics and test objective itself would need to be reexamined.

Generate realistic workloads

The validity of any soak test largely depends on how well it replicates real-world scenarios. Engineers know that theoretical tests and actual traffic levels can significantly differ—to extract meaningful results, the test must mimic real-world user behavior.

For example, if an application is primarily used during business hours, with peak usage between 10 am and 2 pm, then the soak test should mirror these patterns. Simulating a continuous high load when there are actually periods of low activity might lead to misleading results. An example of these misleading results could be that the application runs poorly with low-volume traffic because there is not enough minimum capacity of infrastructure when scaled down. If tests are only run with high-volume traffic, the scale-down event would be missed entirely.

Test context

In software testing, test context refers to the specific conditions, environment, and constraints under which a software application is tested. This influences the testing process and outcome. In essence, context is the underlying business logic: how the customer uses the product and how a feature or functionality should behave. When the context is understood, it becomes much easier to prioritize tests, identify root causes, and create more meaningful test objectives. For example, testing login functionality is often a high priority, but within the context of a weather app—where most users are anonymous—behavior when not logged in takes precedence.

Automate the tests

Soak tests, by nature, are long-running. It is impractical and inefficient for engineers to oversee the entire test duration manually, so automation is essential. Once test parameters are set, the test should run independently without needing constant intervention. For example, a properly automated soak test running for a typical eight-hour workday should execute successfully without the need for an engineer to manually check infrastructure health throughout the test. Proper automation drastically reduces human intervention time.

In addition, the ability to run soak tests with little or no configuration is a significant advantage because it allows tests to be run more often and by team members who may not be as familiar with the test design. SaaS products like Multiple can assist with configuring and running tests with minimal effort.

Monitor continuously

Systems are dynamic, and various metrics might display unexpected behavior over time. Because of this, engineers should set up continuous monitoring for key performance indicators. Consider a database: Connection numbers might grow over extended periods, or storage might increase unexpectedly. An engineer’s role is to catch these shifts. By monitoring performance anomalies continuously, engineers can intervene when necessary. They can address the issues proactively through automation or reactively by adjusting the application infrastructure under test.

Monitoring goes hand in hand with alerting and notifications: Whether it is an email or through messaging applications, having alerts in place to notify engineers of potential issues prevents the need for constant manual observation.

Finally, it is important to remember to set monitors and alerts based on thresholds or triggers that are meaningful. Too many alerts and they become noise and are not taken seriously; too few, and they might miss a real issue. Monitors and alerts should be derived from a combination of the test’s objectives and the conditions the test infrastructure can enter into—conditions considered high-priority, such as out-of-memory or CPU errors. By monitoring continuously, the performance test can yield meaningful results as quickly as possible. ‍

Set up notifications

Setting up notifications and alerts ensures that engineers remain informed about vital system changes even when they are not actively watching. For instance, if memory consumption reaches a critical threshold, an immediate alert can be invaluable to tell the engineer to stop the test, investigate, and scale the infrastructure down to reduce costs. These notifications act as a first line of defense against potential failures and ensure timely interventions, potentially saving hours of diagnostic time later.

Another benefit of notifications and alerts is that they can inform the whole team, not just individuals. Testing should be a shared responsibility, and by sharing this information through channels like email and messaging, the whole team can become more aware of the testing process and intervene if there are issues.

Perform post-test analysis

The work does not end when the soak test does—engineers need to review and analyze the test results iteratively. This post-test analysis is where defects are primarily discovered: Patterns emerge, trends get spotted, and potential areas of concern become evident.

As multiple soak tests are performed, comparing and contrasting results can lead to valuable insights over time. For example, an engineering team might find that CPU usage showed a 5% increase over each of three consecutive soak tests. This might hint at an underlying issue, such as an increasing number of records in the database that are being processed inefficiently, that would have otherwise gone unnoticed in the application infrastructure.

Visually, there can also be trends such as spikes to investigate or slowly increasing or decreasing response time. When these observations are made, they can then be taken and correlated, usually by time, to application and infrastructure metrics, logs, and events.

Leveraging test tools to graph different facets of the test makes the results much easier to debug and more digestible. For example, by glancing at a line graph, an engineer can immediately see spikes over thousands of requests without needing to pore over text-based logs. Additionally, it is crucial to be able to share these results for collaborative analysis with other team members, who may be able to offer additional insights.

P95 and P99

There are many patterns to watch out for in the soak test results. For example, when analyzing reports, an engineer will want to examine the response time closely. Percentiles like P99 and P95 are typically used.

P95 is centered around enhancing performance for most requests, effectively ensuring that the system’s speed and efficiency surpass those of 95% of all traffic. This makes it particularly suitable for standard applications where maintaining general performance levels is essential. P95 is about guaranteeing a satisfactory level of performance for most users, focusing on typical scenarios. It also aids in planning for standard traffic conditions, ensuring systems can handle common fluctuations and loads. For regular service-level agreements (SLAs) where the demands are moderate, P95 is often sufficient, offering a balanced view of the system’s performance under usual circumstances and reflecting the experience of a large majority.

P99 extends the scope of performance optimization to include a broader range of requests, targeting areas beyond the P95 benchmark. A P99 performance benchmark means that 99% of requests will be faster (i.e., experience lower latency) than the recorded metric. P99 metrics are invaluable in high-stakes environments such as financial systems, healthcare, or real-time communication services, where consistently high performance is critical. P99 ensures that the system’s capabilities encompass a more inclusive range of scenarios, thus enhancing the user experience even in less common usage cases.

Regarding infrastructure and capacity planning, P99 is crucial for preparing systems to maintain performance during more extreme traffic conditions. When it comes to SLAs that demand comprehensive reliability and performance consistency, P99 is an essential metric, aligning with stringent standards and ensuring an elevated level of service for a wider user base. P95 provides a good standard for performance, but P99 takes it a step further, ensuring quality and reliability in more demanding and critical scenarios.

Like many things in software development, analysis needs to be looked at continuously throughout the software lifecycle. It is critical to always iteratively perform and analyze performance tests. Staying on top of trends and results makes it much easier to pinpoint code and infrastructure changes that may have impacted the application’s performance.

Soak test example using Multiple

This section provides an example of how to apply the concepts described above using the performance testing tool Multiple to create a soak test.

Multiple’s tests are written in JavaScript. In this example, we will create an HTTP test that checks for POST and GET requests to a chat API. We want to have realistic use cases of the API to make the soak test as accurate as possible, so we will simulate user interactions with a theoretical chat by creating POST and GET requests to send and receive chat data.

Test and test script setup

For test setup, we will require the Axios and Falso NPM packages and use the following test script:

// axios for making HTTP requests
const axios = require('axios');

// falso for generating synthetic data
const { randParagraph } = require('@ngneat/falso');

class TestSpec {
  npmDeps = {
    axios: '1.2.1',
    '@ngneat/falso': '6.4.0',
  };

  async vuInit(ctx) {
    const apiClient = axios.create({ baseURL: ctx.env.API_BASE_URL });

    return { apiClient };
  }

  async vuLoop(ctx) {
    // Retrieve the API client from vuInit
    const { apiClient } = ctx.vuInitData;

    let startTime = Date.now();

    try {
      await apiClient.post('chat', {
        // Generate synthetic data with falso
        message: randParagraph(),
      });
      // Capture the time taken for the POST request
      ctx.metric('POST /chat [200]', Date.now() - startTime, 'ms');
    } catch (error) {
      const httpStatusCode = error?.response?.status;

      if (httpStatusCode) {
        ctx.metric(
          `POST /chat [${httpStatusCode}]`,
          Date.now() - startTime,
          'ms',
        );

        // Log the error message so we can see it in data export log
        console.error(`${httpStatusCode} HTTP Error: ${error?.response?.data}`);
      } else {
        // Throw the error. Thrown errors are automatically logged
        throw error;
      }
    }

    startTime = Date.now();

    try {
      await apiClient.get('chat');
      ctx.metric('GET /chat [200]', Date.now() - startTime, 'ms');
    } catch (error) {
      const httpStatusCode = error?.response?.status;

      if (httpStatusCode) {
        ctx.metric(
          `GET /chat [${httpStatusCode}]`,
          Date.now() - startTime,
          'ms',
        );

        // Log the error message so we can see it in data export log
        console.error(`${httpStatusCode} HTTP Error: ${error?.response?.data}`);
      } else {
        // Throw the error. Thrown errors are automatically logged
        throw error;
      }
    }
  }
}

As you can see, this test script will provide the following data:

POST response times: The average response times in milliseconds of POST requests, separated according to each response’s HTTP status code.
GET response times: The average response times in milliseconds of GET requests, separated according to each response’s HTTP status code.

We want to collect response times for two reasons. Visually, we would like to see if response times increase or have spikes over time, both of which can imply resource issues or application errors. For application errors, we also want to collect response codes. Whenever we see 400 status codes (client error) or 500 status codes (server error), these are worth investigating. 400 status codes can indicate client-side configuration errors, such as a POST JSON body missing a required field. On the other hand, 500 status codes often indicate a server-side error and can be investigated by looking at server logs or conducting additional tests.

Test settings

Here are the settings for this test:

Let’s look at each of these settings in the context of soak testing:

Number of VUs: The number of virtual users executing our tests. Keep in mind that VUs are faster than an average user and can execute more requests than a user can in a given time span. Think about the requests per second (RPS) our application receives—the RPS value is based on this calculation:

RPS = Number of VUs * Loop Duration * Number Requests Per Loop

Test Duration: How long the tests will run. For soak tests, the duration should be determined by your test objective; in many scenarios, this could be anywhere from eight to twenty-four hours.
Ramp Up Duration: How quickly the VUs scale up to the number of VUs we have set in the “Number of VUs” setting above. Typically, user traffic has highs and lows. It is important to test the application in a manner where we capture the rise in traffic so that the underlying infrastructure and application can be tested for upward and downward scalability.
Minimum VU Loop Duration: The minimum amount of time to run the test loop. This setting helps achieve our RPS targets and predicts how fast the tests will run. Increasing the duration will lower the RPS, while lowering the duration increases RPS.
Environment Variables: We can set environment variables to inject in our test environment to help control the configuration.

Data collection

With the settings in place, we can then execute the test and start collecting data. In this case, we will use ten minutes as the test duration. However, as mentioned above, the actual appropriate length depends on the objectives set at the start of our test. In a real scenario where we try to duplicate real user traffic, it will likely be necessary to set a longer duration, such as a full day.

While the test runs, we can monitor how it performs to ensure that it works and there are no initial issues. Here is an example of how Multiple shows the test in real time:

We will look at the results in depth below, but notice the dotted lines. They represent the requests per second (RPS) for each type of request in the test. The RPS are generally increasing as the number of VUs increases. This is in accordance with the ramp-up time selected in the settings, which was two minutes.

Here is the final result:

Results analysis

We can draw some conclusions upon looking at the final results:

The RPS averaged 8 requests per second for successful GET requests (with a status of 200), and 9 requests per second for successful POST requests.
All POST requests were executed successfully since we did not capture any POST response status codes besides 200.
There were a significant number of GET requests that returned either 400 or 500 status codes. Out of 5,460 total GET requests over the 10-minute test, 695 requests failed because of client or server errors. This suggests an error rate of around 12.7%, which is high enough to warrant further investigation.
The average response times for all POST and GET requests (whether successful or not) ranged between 97-99 ms, with P95 values of 118-125 ms and P99 values of 146-249 ms. For most standard web applications, a typical acceptable upper limit for response times is 500 ms for P95 and 1 second for P99. Based on this test, our chat API is performing well. Note that P99 metrics will always be higher than average because of latency in the internet and network infrastructure; normally, P95 is the best indicator of what the average user experiences.
While there were a few spikes in performance, the maximum spikes are still very low and can be considered small latency spikes that do not need investigation.

The results of this isolated test do not suggest that there are performance issues in our application. However, the relatively high number of error codes returned from GET requests suggests that there are other issues in our API worth looking into. To do so, we could go back to our application logs, review application code, and investigate infrastructure resources to be sure that the right amount of resources, such as memory and CPU, were allocated throughout the test. The key to debugging these issues is to correlate time stamps and trace anomalies throughout the application and infrastructure.

Conclusion

Read the guide

CHAPTER

API Performance Testing

Learn seven important best practices for implementing API performance testing, such as defining realistic test cases and measuring key performance metrics.

Read the guide