Probabilistic matching notebook for AI Workbench

About probabilistic or fuzzy matching in BlueConic

How to use the Probabilistic Matching Notebook in BlueConic AI Workbench to do fuzzy matching on customer profilesThe profile merging and identity resolution tools in BlueConic help you to identify and merge customer profiles that belong to the same person. Additional AI-based matching tools are available in AI Workbench.

The Probabilistic matching notebook examines a set of customer profiles to find likely matches among them. You can use the AI Workbench Probabilistic matching notebook to find “fuzzy” matches in your customer profile database. Fuzzy matches are profiles that likely belong to the same person even though not all fields in these profiles have the exact same values.

A simple, but common, example of a fuzzy match is where two profiles have identical first names and surnames, but their phone numbers differ by one digit, possibly because of a typo. The notebook determines a match by finding common typos, misspellings, and the deliberate replacement of a character or digit by a person with multiple profiles. This happens in a probabilistic way, instead of through exact matching. It is important to note that the notebook also detects exact matches for the profile properties it examines.

Questions you can answer with this notebook include:

  • “Which profiles really belong to the same person if you filter out small changes in spelling?”
  • “How common are 'fuzzy' matches in my profile database?”
  • “How many exact matches are not covered by my existing merge rules?”

You can tie the notebook directly to the profile merge functionality in BlueConic by adding the notebook’s output to your profile merging rules.

Use cases for probabilistic or 'fuzzy matching' in BlueConic

This notebook is intended for use cases where you expect accurate matches to be rare relative to the total number of profiles in your database. For this reason, it is recommended to include at least one highly diverse profile property that has many different values in your dataset when performing probabilistic matching. Including less diverse properties is not a problem as long as at least one highly diverse property is also included.

Examples of highly diverse properties are “phone number” and “email address,” while “age” is a perfect example of a profile property with a very low diversity. If you do not include a highly diverse property among the profile properties to examine, the probability of finding false matches goes up exponentially. For example, there are 40,000 John Smiths in the United States, and among them hundreds of John Smiths have an age value of 48. However, when we find two John Smiths, both aged 48, with a phone number that differs by one digit, we are likely looking at the same man who mistyped his phone number and thus have found a fuzzy match.

Also, consider how a fuzzy match for the property “age” is almost meaningless on its own since a single typo can turn “29” into “92,” so it would be best to demand that this property match exactly.

Creating a Probabilistic matching notebook

Note: Before you get started, contact your BlueConic Customer Success Manager to have the notebook plugin added to your BlueConic environment.

  1. Select AI Workbench from the BlueConic navigation bar.
  2. Click Add notebook.
  3. In the popup window, scroll down to Probabilistic matching notebook and click it.
  4. The notebook opens. Read through the Notebook editor introduction. 
    In the Python notebook editor, data scientists can customize the code and the notebook's input parameters. But business users can supply model inputs in the notebook's parameters UI and run the model without writing any code.
  5. Select Parameters in the left-hand panel, to access the parameters UI where you can enter your model inputs. See the guidelines below.
  6. Save your settings and run the notebook.

Customizing your probabilistic matching parameters

In the Parameters tab, you can customize how BlueConic finds matching profiles without writing any code. Here are some guidelines for setting the input parameters:

  • Select a segment: Choose a segment of customer profiles for the model to examine for likely matches. Providing a segment here can help narrow down potential matches and/or decrease the size of the profile dataset.
  • Select at least 3 profile properties to check: Choose which profile properties the notebook should check for closely matching values. To prevent false matches, it's helpful to include profile properties with different types of values (phone number, email address, etc.).
  • Select profile properties that have to match exactly (optional): If there are profile properties from the list you selected above that must match exactly, enter them here. Your selections here must be a subset of the profile properties the notebook is examining. You can also leave this field empty.
  • Select a profile property to contain a “merge ID”: You can choose to select a merge ID for all the matching profiles the notebook finds. You can then use the merge ID in your BlueConic profile merge rules to automatically merge matching profiles. 
  • Maximum allowed Damerau-Levenshtein edit-distance: Here you can set a distance measure for how far apart profiles can be and still be considered matching (excluding the profile properties that must match exactly). We recommend you set a value of 1. Learn more about how the Damerau-Levenshtein edit-distance works.
  • Examine only recently updated profiles (optional): If this field is set, a profile has to have been modified in some way since this date (for example, adding an order or updating contact information) in order for the profile to be loaded. You can leave this field empty. Set a value here if you want to exclude older, inactive profiles.
  • All profile properties must have a value (recommended): Select this option to require that all the profile properties you chose to examine must have a value. The notebook will not check profiles that don't contain a value for these properties. If this option is not selected, profiles will be checked for matches as long as at least one of its properties has a value. We recommend that you select this option, because profile matches containing empty values are not reliable.  
  • Maximum number of profiles to process: Use this option for testing purposes, for example, to benchmark runtimes for the notebook. If you set it to 0, the notebook will not limit the number of profiles examined, aside from the segment count or the option to disregard empty values. 

See Setting AI Workbench notebook parameters for instructions on editing your notebook input values.

Running the Probabilistic or 'fuzzy' matching notebook

You can run the notebook manually or through scheduling. Manual runs allow for more detailed monitoring of the notebook’s execution and timings. With scheduling, you can set a repeating cadence for the notebook to run automatically. Scheduling also ensures a log gets saved and that the notebook does not have to share resources with other notebooks. For more information on these options, see Scheduling and running AI Workbench notebooks.

Examining your probabilistic matching results

When you run the notebook manually in AI Workbench, various timings (rounded to one-tenth of a minute) are displayed for the notebook’s costliest operations. Because all of these operations scale roughly linearly with the number of profiles, you can use the timing for sample sets to predict the notebook’s runtime for large datasets. During a scheduled run, limited timing data is sent to the status update.

For both manual and scheduled runs, the number of profiles, exact matches, and fuzzy matches are displayed in the output and log:

How to do probabilistic or fuzzy matching on customer profiles in the BlueConic customer data platform

For profiles found to have matches, the notebook writes a merge ID to a profile property you select in the notebook's Parameters tab. If that property is included in your BlueConic profile merge rules, the platform can automatically merge profiles that have the same merge ID. In addition, matches are recorded in a human-readable CSV-file with their profile IDs and the values of the profile properties that the notebook examined.

Increasing the accuracy of probabilistic or fuzzy profile matches

It is important to consider which profile properties are the best ones to use to find fuzzy matches. In general, we recommend that you include at least one highly diverse profile property in the properties the notebook examines. An easy rule of thumb is to look at the maximum length of a profile property. For example, a two-letter code can never have more than 26x26=676 possible unique values, so on its own it becomes quite meaningless in a dataset of millions of profiles. Note that diversity can often be deceptive. There are many different possible unique names and street addresses, but these are typically unevenly distributed, with a tiny number of common values. For example, the property values of “Joe” or “1 Main Street,” could make up large portions of a dataset.

You can further achieve increased accuracy by declaring that one or more profile properties have to match exactly, or by disregarding profiles with empty property values. These options also make the notebook run faster and more efficiently.

During testing, the notebook’s runtime scales linearly with the number of profiles, meaning if you double the number of profiles, it leads to a doubling of the runtime.

To illustrate this, a test on one customer’s server with 3.4 million real profiles, using the properties first name, last name, and phone number (all having a non-empty value), took roughly twenty minutes to find 14 thousand matches, while searching through 1.7 million profiles took 10 minutes.

Using edit-distance to increase accuracy

The notebook measures similarity between values based on the Damerau-Levenshtein measure of edit-distance. This measure increases by 1 for the following edits: deletion, insertion, or substitution of a character or the swapping of two adjacent characters. It corresponds well to common typos or misspellings, as well as the deliberate changing of say, a phone number. When two values are exactly the same, the edit-distance is 0. Note that the notebook strips capitalizations, hyphens, and special characters, so these do not contribute to the edit-distance.

Examples of probabilistic or fuzzy profile matches

For example, “Jennifer” and “Jenifer” have an edit-distance of 1 (deletion of an “n”), but so do “Michael” and “Micheal” (swapping of the adjacent characters “ae” for “ea”). It is recommended to stick to an edit-distance of 1 to lower the risk of false matches. Common profile property values are often only a few edits away from each other, so the number of false matches (as well as the notebook’s runtime) will increase exponentially with higher allowed edit-distances.

Note that the edit-distance is calculated per profile property. The total edit-distance between two profiles may be up to number of profile properties x the allowed edit-distance for the profiles to still be a fuzzy match.

Using the symspell algorithm, the notebook can rapidly find values from the dataset that are within the allowed edit-distance from another value.