Synthetic Test Data: Tutorial & Best Practices

April 9, 2024
15
min

Synthetic test data is artificially generated data that mimics real data in testing environments. It is commonly used in both functional and nonfunctional testing when real data is unavailable, insufficient, or inappropriate for testing purposes. In addition, synthetic test data offers an alternative to replicating real data from production to staging environments.

Compared to replication-based approaches, synthetic data is safer from a security and data integrity perspective because it will always be randomly created. This means there is no chance that user data will be copied outside production. Synthetic data is especially relevant in sectors such as healthcare and finance, where compliance guarantees and SLA observance are particularly important.

This article explores five best practices for generating synthetic test data. It also discusses use cases for synthetic test data and common pitfalls to avoid.

Summary of key synthetic test data best practices

Several best practices form the basis of a robust synthetic data generation strategy. These practices are summarized in the table below and explained in greater detail in the sections that follow.

Best practice Description
Understand data requirements Define data field properties, relationships, and validation requirements.
Generate an appropriate volume of data Upsample or downsample test data according to testing goals.
Validate synthetic data Ensure adequate diversity and realism in generated data while adhering to data requirements.
Plan for scalability Ensure that testing plans account for generating large amounts of data per build.
Prefer rule-based test data Set rules for how data will be generated, and highlight expected data validations.

Understanding synthetic test data

As applications evolve, test data management becomes an increasingly important consideration. Large applications with expansive user bases and robust CI/CD pipelines need to utilize large amounts of test data on every build and test operation.

When building new applications, many engineering teams start testing using fixtures and hard-coded user states. As applications grow and manual management becomes unwieldy, more efficient and scalable test data management techniques become necessary.

There are two common approaches for handling test data at scale, which are summarized in the table below.

Approach Description Potential concerns
Data masking Copying data from production to testing environments, replacing PII with representative fake values PII risks and the ongoing need to add new fields to blacklists
Synthetic test data Generating data in testing environments using AI Requires initial configuration and can be tricky to implement

Data masking

The process for data masking typically includes the following steps:

  1. Identify sensitive data
  2. Choose a masking tool
  3. Configure masking rules
  4. Execute the masking action
  5. Review/validate

Tools like Informatica, IBM InfoSphere Optim, and Delphix offer data masking capabilities that can automate much of the process.

One key drawback of using data masking is that masking can fail on a per-row level in transactions. In other words, data masking may successfully transform some pieces of sensitive data but not others. As this is best illustrated by example, observe the following Delphix error while transforming data due to conformity issues:

Algorithm: Segment Mapping (4 Characters, Alpha-Numeric)

+--------+--------+-------------+------------------------------------------------+
| Input  | Masked | Non-Conform | Comment                                       |
+========+========+=============+================================================+
| 1234   | 3424   |             | Masked ok                                      |
| ABCD   | KENB   |             | Masked ok                                      |
+--------+--------+-------------+------------------------------------------------+
| !AB!   | !AB!   | PLLP        | Contains punctuation                          |
| ÀÄÅB   | ÀÄÅB   | LLLL        | Tricky, contains accented letters             |
+--------+--------+-------------+------------------------------------------------+
| ABCD12 | ABCD12 | LLLLNN      | Too long                           |
+--------+--------+-------------+------------------------------------------------+

An attempt to mask three data records using Delphix (source)

Another primary drawback of data masking is that it carries the inherent risk of leaking PII—whenever data is replicated from production, care must be taken to ensure that no sensitive data is copied. This can require several transformation steps because sensitive data must be removed and new data added in its place.

Synthetic test data

In contrast, synthetic test data generation processes typically follow these steps:

  1. Define a schema describing the application’s structure, relationships, and constraints.
  2. Choose a generation method: Pick the tools, algorithms, or processes to be used.
  3. Set up generation rules for how data is to be created and randomized.
  4. Generate data using automated tools to create data at the proper scale quickly.
  5. Review/validate, check results, and ensure the accuracy of synthetic test data.

There are many choices for generation tools. Tonic and Mockaroo are popular choices, and there are many more providers who serve specific industries. Data generation may also be added to CI/CD pipelines, ensuring that fresh data is available in every build/test cycle.

Synthetic data generation eliminates any PII concerns. Given a structured data schema, it can be used to generate representative data with no risk of leaking private user data.

{{banner-2="/design/banners"}}

Synthetic data generation by example

Tonic is a powerful platform for generating realistic and diverse test data. Here is an example of how Tonic can be used to generate synthetic test data.

1. After creating a Tonic account, connect the data source:

  • Once logged in, click on the Add Data Source button.
  • Choose the type of data source to connect (e.g., database, API, or file).
  • Provide the necessary connection details and credentials.

2. Define the data model:

  • After connecting the data source, Tonic will automatically infer the data model based on the structure of the data.
  • Review and customize the data model as needed, specifying the fields, data types, and any constraints.

3. Configure data generation rules:

  • Tonic provides a set of built-in data generators and transformations that can be applied to the data model.
  • Configure the desired data generation rules for each field, such as using realistic values, random values, or specific patterns.
  • You can also define custom data generation rules using JavaScript or Python.

4. Generate synthetic data:

  • Once the data model is defined and the data generation rules are configured, click the Generate button.
  • Specify the number of records to generate and any additional options.
  • Tonic will generate the synthetic data based on the configuration.

5. Export the generated data:

  • After the data generation process is complete, the generated data can be exported in various formats, such as CSV, JSON, or SQL.
  • Download the exported data file or integrate it directly into the application or testing environment.

Here is a simplified example of how the generated data from Tonic might look:

[
  {
    "id": 1,
    "name": "John Doe",
    "email": "johndoe@example.com",
    "age": 35,
    "city": "New York"
  },
  {
    "id": 2,
    "name": "Jane Smith",
    "email": "janesmith@example.com",
    "age": 28,
    "city": "London"
  },
  {
    "id": 3,
    "name": "Michael Johnson",
    "email": "michaeljohnson@example.com",
    "age": 42,
    "city": "Paris"
  }
]

Now that we have gained an understanding of synthetic test data and its purpose, the following sections will expand upon five best practices for generating synthetic test data and how to integrate synthetic data generation effectively into development workflows.

Understand data requirements

Before choosing any testing strategy, it is critical to establish business requirements first. Data fields, formats, and relationships must be defined. Adhering to business requirements ensures that the synthetic test data accurately represents real-world data scenarios.

Many edge cases and potential anomalies in data can be identified by assessing application schemas, and it is wise to pay attention to nullable, transformable, or unique data fields. These can indicate key inflection points in generated data and, later, in test scenarios.

When working with structured data sources such as SQL or GraphQL, the database schema serves as the source of truth for data requirements. Test data can be generated that meets schema requirements, and edge cases—such as nullable fields—can be explored automatically or manually in tests.

Applications that rely only on non-relational databases are not required to have a set definition for how data is structured. However, in practice, data is often serialized in a predictable form. A schemaless app may be as simple as a JSON API or as complex as a multi-tenant MongoDB server. It is ideal to manually document a schema for such applications in as much detail as can be known in advance. For APIs, tools like Swagger, Postman, RAML, and TOML are commonly used to establish expectations and data requirements. Tools like Mockaroo can be used to create schemas for datasets.

In the context of applications with real-time data, data may be streamed from a source database to many sinks. In other words, data changes can be replicated from one database to various other destinations as they occur. Techniques such as database triggers or change data capture (CDC) may be used to ensure that this replication happens at the database level. Popular real-time tools like Apache Pulsar and Kafka are often used for this purpose. Depending on the implementation, topics may have predefined schemas that can be copied into version control. A popular example of this is Avro schemas. These schemas and the data they represent can and should be captured when generating synthetic data.

Generate an appropriate volume of data

The volume of test data should always be aligned with testing objectives.

While some scenarios necessitate test data that mirrors the volume expected in production environments—such as during large load tests or stress tests—other use cases do not require generating data on such a vast scale. This is particularly true for functional testing or small-scale nonfunctional tests, where the focus might be on specific features or performance aspects under typical usage conditions rather than peak server load scenarios.

The key is to assess the objectives of each test (or suite of tests) to decide on the appropriate amount of data needed. It is essential to strike a balance between generating enough data to cover various scenarios and avoiding unnecessary data overload. This is especially important in nonfunctional testing, where tests may attempt to replicate real-world peak conditions.

Validate synthetic data

To ensure the reliability and accuracy of test results, it is crucial to generate diverse and realistic synthetic test data. This can be achieved by creating structured schemas. Doing so ensures that any variability in test data does not deviate from an application schema’s expected structure in a way that would cause unexpected errors during testing.

In addition, unit tests can be implemented to verify that the data layer meets specific requirements. By accurately mimicking the characteristics and patterns of real data and adhering to data constraints, validating the synthetic data ensures its effectiveness in testing.

As enterprises scale, the management of synthetic data across different environments becomes a significant challenge. This is due to the sheer volume of data generated, especially in CI/CD pipelines where large amounts of data are generated to facilitate automatic testing and deployment. A dedicated team is often necessary to manage this data effectively, ensuring its integrity across development, testing, and production environments. This management includes many elements, such as:

  • Version control
  • Data storage optimization
  • Ensuring data accessibility for relevant stakeholders
  • Maintaining security and privacy standards

Furthermore, enterprise teams must navigate a complex landscape of legal compliance requirements, audits, and intricate business requirements. Synthetic data must not only be accurate and reliable but also compliant with industry regulations and standards. This often involves working closely with legal teams, auditors, and compliance officers to ensure that data generation, storage, and usage practices adhere to all relevant laws and guidelines.

Large enterprises have the distinct advantage of accessing domain experts. These experts validate synthetic data generation methodologies and provide in-depth knowledge of specific industries. Their involvement is crucial for adding another layer of scrutiny to the validation process. Domain experts can identify industry-specific nuances and requirements that generic testing might overlook. For example, in the healthcare industry, domain experts can ensure that synthetic patient data accurately reflects real-world medical conditions and treatments while adhering to privacy regulations such as HIPAA.

{{banner-1="/design/banners"}}

The role of data schemas

Relevant application data schemas should be used as a source for synthetic test data validation, meaning that the structure of test data should match data validations in the application. For example, no instance of synthetic test user data should have a missing email field if the email field is non-nullable in the application’s GraphQL schema.

Test data should match the application’s data schema in all cases unless invalid data handling is being explicitly tested. To illustrate, imagine an application with a defined data schema for customer records:

Customer ID (numeric)
Name (text)
Email Address (text, must follow email format)
Date of Birth (date, in YYYY-MM-DD format)
Loyalty Points (numeric)

Customer schema

When testing normal application functionality, such as creating a new customer record or updating an existing one, the test data should adhere to this schema. For instance, a valid test case might involve adding a new customer record with the following details:

Customer ID: 12345
Name: "Alex Smith"
Email Address: "alex.smith@example.com"
Date of Birth: "1990-05-15"
Loyalty Points: 200

Valid test data example

This test data is valid because it matches the application’s data schema, with each field following the correct format and data type as defined.

However, there might be scenarios where developers specifically want to test the application’s resilience and error handling capabilities when faced with invalid data. In such cases, deliberately using test data that deviates from the schema is necessary. For example, to test how the application handles an invalid email address, developers might use the following test case:

Customer ID: 12346
Name: "Jamie Doe"
Email Address: "jamie.doe" 
Date of Birth: "1985-10-30"
Loyalty Points: 150

Invalid data example

In this scenario, the email address does not match the expected format (i.e., “id@domain.com”), which violates the application’s data schema for the email field. This test aims to verify that the application correctly identifies the error, perhaps by displaying an error message or rejecting the record until the data is corrected.

Plan for scalability

The data generation process should be designed in a way that can easily accommodate future changes and additions to the data requirements. Planning for scalability in this way ensures that the synthetic test data remains relevant and useful as the application grows.

Consider integrating the following specific strategies into the data generation process to ensure it remains adaptable, scalable, and relevant as application needs evolve:

  • Implement modular design: Structure data generation scripts or tools in a modular fashion, where components are loosely coupled and can be independently updated or replaced. This approach makes it easier to modify specific aspects of the data generation process without having to overhaul the entire system.
  • Use configurable parameters: Allow key aspects of the data generation process to be controlled through external configuration files or parameters. This allows for easy adjustment of the volume, variety, and complexity of the generated data without altering the core logic of your generation tools.
  • Incorporate data templates: Employ templates for generating data that can be easily edited or extended to accommodate new fields, formats, or constraints as data requirements evolve. Templates provide a flexible way to define how synthetic data should be structured.
  • Leverage data specification languages: Consider using a data specification language (such as JSON Schema, Avro, or Protobuf) to define the structure, types, and validation rules for your synthetic data. These specifications can be updated to reflect new data requirements, and your generation tools can use them to ensure compliance with the latest schema.
  • Automate data validation checks: Integrate automated checks into your data generation process to validate that the synthetic data adheres to the current data model and meets all necessary constraints. Automation ensures that any changes in the data requirements are consistently enforced across all generated datasets.
  • Enable CI/CD: Integrate data generation tools into a CI/CD pipeline. This setup allows for automatic updates and testing of the data generation process alongside application code changes, ensuring that synthetic data evolves in tandem with the application.
  • Embrace version control for data models: Use version control systems to manage changes to data models and generation scripts. This helps track the evolution of data requirements and ensures that changes are deliberate and documented.

The following diagram illustrates the typical workflow for how synthetic data will be generated in a CI/CD pipeline:

Synthetic data can be persisted between build steps for validation and further testing

In addition, implementing an iterative feedback loop where test results inform subsequent rounds of data generation can refine the synthetic data’s quality over time. This approach ensures that the data evolves in response to testing needs, continuously enhancing its effectiveness and realism.

Further enhancements to consider as an application evolves include data augmentation, random sampling, and increasing (or parameterizing) randomization rates. These three practices will be discussed in the following sections.

Data augmentation

Data augmentation is a strategy used to increase the diversity of data available for training models or testing applications by applying various transformations to the data. This is particularly useful in scenarios where collecting more data is impractical or impossible. Data augmentation techniques include the following:

  • Flipping: This involves mirroring data, such as images, horizontally or vertically. It is useful in cases where the orientation of data might vary in real-world scenarios but should not affect the application’s performance.
  • Rotating: Data, especially images or geometric shapes, can be rotated by various amounts to simulate the effect of viewing the same object from different angles. This helps in creating a more versatile dataset.
  • Scaling: Resizing data to different dimensions tests how well an application can handle inputs of varying sizes. It is crucial for applications that must process data coming in from different sources with no standard size.
An example of data flipping (Source)

Random sampling

Random sampling involves selecting a subset of data from a larger dataset randomly for testing purposes. The key here is to use a variety of sampling techniques, such as these:

  • Simple random sampling: This method involves choosing sample data at random from a larger dataset.
  • Stratified sampling: Here, the larger dataset is divided into subgroups, and samples are taken from each subgroup.
  • Cluster sampling: In this approach, the larger dataset is divided into clusters, and entire clusters are selected for testing.

Each of these approaches to random sampling helps test data cover a wide range of scenarios and inputs, ensuring that the application is tested against unexpected or rare conditions and thereby improving its reliability and performance.

Randomization rates

Randomization rates dynamically adjust the level of randomness in the generated test data. By parameterizing randomization rates, developers can control how much variance is introduced in the test datasets. This is particularly useful in stress testing or in scenarios where the application’s behavior under extreme conditions is being evaluated. For instance, in a financial application, varying the randomness in transaction volumes or values can help assess the system’s performance under different market conditions. Higher randomization rates can simulate more volatile or unusual market conditions, while lower rates can simulate more stable scenarios.

Prefer rule-based test data

A business rules engine defines the business rules that will be used to generate data. A rules engine is designed to evaluate conditions and execute actions based on those conditions, automating decision-making processes based on predefined logic. Rules engines are used in various domains, such as finance for loan approval processes, e-commerce for dynamic pricing strategies, healthcare for patient care plans, and many others where decision logic needs to be frequently updated or is too complex to hard-code directly into applications. Depending on the business domain, there may be existing products that fit a team’s needs. However, it is just as common for this engine to be developed internally, with a design tailored to the application domain.

An advantage of working with data structured according to a business rules engine is the ability to maintain the integrity of incoming data records according to a predefined schema. SQL, GraphQL, Prisma, ActiveRecord, and many other tools and approaches to structuring application data rely on schemas to ensure data integrity. To generate synthetic test data, it is preferable to start from existing application schemas, which guarantees that changes to the database will be reflected in the testing environment.

Preferring rule-based test data is also vital to maintaining the integrity and consistency of test environments. By initiating synthetic test data generation from existing application schemas, developers can ensure a seamless alignment between the test data and the actual database structures, thus reducing the risk of test failures due to data schema mismatches. This is especially beneficial in CI pipelines and other DevOps practices, where automatic and frequent integration of changes requires test environments to be continuously updated to reflect the latest database schemas.

Leveraging schemas for synthetic test data generation allows for a more systematic and controlled approach to data creation. Developers can specify rules and constraints directly derived from the schema, ensuring that generated data not only conforms to the structural requirements but also reflects the complexity and variety of real-world data. This method enhances the relevance and effectiveness of testing, enabling teams to identify and address potential issues before they impact production environments.

To implement rule-based synthetic data generation effectively, teams should consider the following steps:

  1. Schema analysis: Thoroughly analyze the application schema to understand the data model, relationships, and constraints.
  2. Rule definition: Define explicit rules for data generation that align with the schema constraints and business logic.
  3. Automation: Automate the process of synthetic data generation to ensure that it can easily adapt to schema changes.
  4. Validation and testing: Continuously validate generated data against the schema and conduct rigorous testing to ensure data integrity and application robustness.

Adopting a rule-based approach to synthetic test data generation offers a strategic advantage by ensuring high-quality, relevant test data that accurately reflects application requirements and enhances the overall efficiency of the development and testing process.

{{banner-1="/design/banners"}}

Last thoughts

Following best practices for synthetic test data generation provides many benefits. It helps development teams ensure that test data adheres to data requirements, guarantees the reliability of testing processes, scales test data according to testing goals, and improves the robustness of CI/CD pipelines.

Synthetic data is more than a technical resource: When fully embraced, it can be a catalyst for security and responsibility in data handling. Whether it’s for PII avoidance or augmenting existing datasets, generation-based approaches avoid the challenges that come with replication-based techniques such as data masking. Embracing synthetic data generation addresses practical challenges like data scarcity and privacy concerns and also ensures ongoing security for evolving systems.