Skip to main content
Data Validation: Deep Dive

This article describes the recommended data validation testing approach and overall framework that will drive the logistics of the testing for onboarding of new data sources.

Updated this week

Data validation serves as the foundation of reliable data-driven decision-making, ensuring that information is accurate, consistent, and adheres to predefined standards. This article outlines a comprehensive approach to data validation, encompassing the test strategy, execution strategy, and data validation management.

BlueConic strongly advises that data validation should be owned by the customer as they have the domain knowledge and know the corner cases. BlueConic will partner with the customer on data validation, as detailed in this document, but final sign-off from designated customer stakeholders.

Stakeholders

The project team members that should be involved in data validation and their respective roles and responsibilities include:

  • Project Manager: Plans testing activities, reviews and approves the document, tracks test performance, and is accountable for test results.

  • Solutions Architect/Validation Owner: Ensures alignment between the test plan and design, provides the testing environment, and is responsible for testing and validating the platform, reporting any issues found.

  • Customer Data Analyst/Technical Lead: Verifies the test plan correlates with the design, provides the testing environment, and follows procedures for fixing defects.

  • Customer Product Owner: Ensures the business is aligned with the results of the test, partners with the Project Manager to track testing results and assist with managing results and risks within the overall business.

  • Connection Cutover Owners/Global Data Analysts: Responsible for transferring files from test tenant to production tenant for file-based systems.

Process

When onboarding a new data source, the process is as follows:

  1. Deliver data feed for tests

  2. Validate source file for required data points and proper UTF-8 encoding

    1. If using a CSV file, make the files available on SFTP/S3 and add the credentials to the BlueConic connection; for API integrations, only the credentials are needed. Ensure the file is valid and fields are properly encoded. Additional CSV file best practices can be found here.

  3. Configure import on test

    1. It is advised to import production data on test (which can be removed when it’s not needed anymore), to avoid surprises in the data when going to production.

    2. By configuring the connection on test, you can make sure that the right data processors are applied, the property and timeline event types are correct, and the date formats are correct.

  4. Run import on test

    1. Start the connection manually with the "Run now" button.

    2. The connection run history provides a summary of the ongoing import.

    3. The “first x lines” data processor can be used if you want to limit the number of lines in a CSV file to import.

  5. Validate data on test

    1. Once the data is imported, the data needs to be validated.

    2. Configuration changes are often needed, in that case, a feed loop should be triggered to step 3 until the validation is approved.

  6. Deliver data for production

    1. Once the data is validated, ensure it is accessible to the production environment. This primarily applies to file imports, which need to be available on the production environment's SFTP server.

  7. Push configuration to production

    1. Push the connection configuration to production using the Transfer Configuration Connection which contains the following validations:

      • Installed plugins and data processors, including version number

      • Available group types

      • Available timeline event types

      • Available profile properties, including their datatype

  8. Run import on production

    1. Once the data is ready in the configuration, the import can be started on production. In most cases, this will be a scheduled connection, so ensure the correct schedule is entered. The import can be started by setting the connection to "on" and, if needed, by pressing the "Run now" button.

  9. Double check data

    1. Perform a final check on the results of the import. This check should include verifying that the run history shows the expected numbers and that spot-checked profiles contain the expected data.

Infrastructure diagram

A potential infrastructure diagram could look like:


​Comments about this diagram:

  • The test SFTP and Event Streaming Platform server are completely separated from the production SFTP server. This means test files can be added to the test SFTP server and the data will never end up on the production tenant. This improves the flexibility of the test tenant for loading fake data, for example.

  • There is a separate Event Streaming Platform cluster for testing and for production. This might not be feasible (for whatever reason) and in that case, it is strongly advised to use different topics or different group IDs for test and production.

  • The Marketing Automation Platform is connected to both test and production SFTP servers. However, when in the testing phase, the marketing automation will be writing the data to test, before the connection is transferred to production (see step 7 of the process), the data files should be transferred to the production SFTP server (see step 6 of the process).

  • Some systems (like social media platforms) do not have a test environment. For these systems, BlueConic will connect to the API and import from the production social media environment.

Define your tests

The following type of tests can be used to validate the data:

  • Aggregated testing: Define the queries you want to compare with the source system. For example, this could include all profiles with a customer ID and an email address or all profiles with an order in 2020. Ensure these queries can also be executed in the source system.

  • Corner case testing: When logic is applied in a data processor to transform the way data is imported, define corner cases that can be validated individually. For example, when logic is involved in handling opt-ins, define cases for opting in and opting out to validate the end result. These corner case tests should also be used for unit tests in the data processor.

  • Spot checking: Spot checking involves checking specific profiles to validate the data. This is useful for validating the import configuration's result.

  • Exploratory testing: Exploratory testing refers to playing around in the imported data without having a predefined test plan. This is a way to find unexpected anomalies.

Note: Using production data in the test environment is recommended to detect anomalies early. For example, if a record contains [email protected] as an email address, you can prevent data from these records from being added to the same profile.

Data validation options

Different tests ask for different way of validation. This section describes the different ways data can be validated.

Run history table

When running an import connection, the run history table shows aggregated information of the imported data:


This run history table contains:

  • The number of updated profiles: This counter is increased when a profile is found based on the matching criteria, so this is the match rate on existing profiles.

  • The number of created profiles: This counter is increased when no profile is found based on the matching criteria and the option “allow the creation of new profiles” is selected.

  • The number of unchanged profiles: This counter is increased when updates are sent but did not result in actual update on the profile.

  • The number of checked records from the external system. For CSV imports, this represents the number of lines in the CSV file.

  • The number of timeline events sent to BlueConic. This could result in an update (if the event ID and event date are the same) or a new timeline event.

  • The last profiles updated/created. This shows a link to the profiles that were changed or created during the import, which can be used to verify that the import is running as expected.

  • The number of invalid values: when the provided data does not match the datatype of the profile property, the value is ignored and marked as invalid (see the log file which value was incorrect). The rest of the values are imported.

Profiles tab

The Profiles tab is for examining individual profiles and checking the values of imported data points. When you have defined corner cases in your test data set, use the Profiles tab to check the data points on these profiles. When you don't have corner case profiles, you can do a sample check.

Use the "All profile properties" tab to see all profile properties.


Note: The segments which are shown in “part of segments” is calculated in real-time. The profile itself does not contain information to which segments it belongs.

The “Profile timeline” tab can be used to see the results of import mappings to the profile timeline:

Data validation insight

For data validation, it's important to have exact numbers for the queries. Segmentation and most insights use a shorthand notation (for example, 3.6K) for readability, but this makes them less reliable for data validation. Also, the segmentation UI uses an index to show the numbers, but when there are more than 2.5 million profiles, a sample index of 1 million randomly selected profiles is created. This can result in slightly different numbers when using the sample index versus the full index. This is acceptable for segmentation, but for data validation, exact numbers are preferred.

To resolve this, use the Data Validation Insight.


​It works similarly to the Segment Discovery Insight, but it shows the exact number in an easily readable format and a non-formatted version for copying and pasting into Excel. The data present on the profile (as profile properties) is directly available for selection.

Query profiles notebook

The Query profiles notebook can be used to execute SQL queries on profile and timeline data. This makes it easy to execute queries and define new queries on the fly: the data is loaded (and imported into an SQLite database) once. From that moment on, you can execute SQL queries on it, and the exact number is returned.

This is also very useful when analyzing the root cause of differences between the source data and the BlueConic data.

If the notebook is not yet available on your tenant, ask BlueConic Support to add it to your tenant.

How the notebook works:

  • The profile properties, timeline event types, and segments to load can be configured in the parameters.

  • For each selected timeline event type, a SQL table is created. Also, for nested events, a separate table is created.

  • A “profiles” table is created with a column defined for each selected profile property.

  • All profiles for the selected segment are loaded and the data for those profiles are added to the SQL tables.

  • Execute the queries for validation and add these to the test results excel to compare it with the numbers from the external system.

Offline validation

If your team prefers to use other tools for data validation (for example, SageMaker), BlueConic data can be exported to CSV files using the S3/SFTP Connection or the REST API to extract the profiles.

Difference analysis

When there are differences between the numbers from the source system and the numbers in BlueConic, the Solutions Architect will take the lead in analyzing the cause. This may include:

  • Is the source data correct?

  • Are the correct files picked up?

  • Are the mappings correctly defined in the connection?

  • Are all other settings correct in the connection?

  • Is the data correctly handled in the data processor?

Removal of data

After the connection is pushed to production, the data can be removed from the test tenant using the "Delete Profiles" notebook. If this notebook is not available, ask BlueConic Support to add it.

Data validation summary and issue log

Maintaining a central data validation summary and issue log is crucial for ensuring a smooth and transparent transition from test to production environments. This document serves as a single source of truth, detailing the progress of each connection, outlining key responsibilities for BlueConic and customer validation owners, and documenting the start and end dates for validation and cutover.

By providing a clear status field for each connection, it facilitates real-time tracking and communication among all stakeholders. Primarily owned and maintained by the Solutions Architect and Customer Data Analyst, the log will capture data validation test results, document discrepancies, and serve as the final repository for test outcomes. This central resource empowers stakeholders to effectively evaluate testing progress, identify potential issues, and confidently assess the readiness of each connection for production deployment.

The data validation summary and issue log should have the following sections, which should be used by all stakeholders:

  • Summary: Summarize the queries inputted in the individual systems-related tabs. Use this sheet as a high-level status view of the data validation progress for each system.

  • Testing and Cutover Plan: Provide a detailed breakdown of the data validation milestones planned and actual dates for validation on test and cutover to production, as well as the status for each connection.

  • Details Log: Expand on critical issues cited. Use this tab as an escalation resource and a place to provide details and status updates.

  • Systems-Related Tabs: Detailed queries for each system.

Note: While the data validation summary and issue log help facilitate communication and documentation, they should not replace other communication channels for quick escalation of issues. When an issue requires immediate attention, the project team should use all available communication channels, such as email.

Did this answer your question?