Probabilistic matching notebook for AI Workbench

About probabilistic or fuzzy matching in BlueConic

How to use the Probabilistic Matching Notebook in BlueConic AI Workbench to do fuzzy matching on customer profilesThe profile merging and identity resolution tools in BlueConic help you to identify and merge profiles that belong to the same person. Additional matching tools are available in AI Workbench. You can use the AI Workbench Probabilistic matching notebook to find “fuzzy” matches in your customer profile database. Fuzzy matches are profiles that likely belong to the same person even though not all fields in these profiles have the exact same values.

A simple, but common, example of a fuzzy match is where two profiles have identical first names and surnames, but their phone numbers differ by one digit, possibly because of a typo. The notebook determines a match by finding common typos, misspellings, and the deliberate replacement of a character or digit by a person with multiple profiles. This happens in a probabilistic way, instead of through exact matching. It is important to note that the notebook also detects exact matches for the profile properties it examines.

Questions you can answer with this notebook include:

  • “Which profiles really belong to the same person if you filter out small changes in spelling?”
  • “How common are 'fuzzy' matches in my profile database?”
  • “How many exact matches are not covered by my existing merge rules?”

You can tie the notebook directly to the profile merge functionality in BlueConic by adding the notebook’s output to your profile merging rules.

Use cases for probabilistic or 'fuzzy matching' in BlueConic

This notebook is intended for use cases where you expect accurate matches to be rare relative to the total number of profiles in your database. For this reason, it is recommended to include at least one highly diverse profile property that has many different values in your dataset when performing probabilistic matching. Including less diverse properties is not a problem as long as at least one highly diverse property is also included.

Examples of highly diverse properties are “phone number” and “email address,” while “age” is a perfect example of a profile property with a very low diversity. If you do not include a highly diverse property among the profile properties to examine, the probability of finding false matches goes up exponentially. For example, there are 40,000 John Smiths in the United States, and among them hundreds of John Smiths have an age value of 48. However, when we find two John Smiths, both aged 48, with a phone number that differs by one digit, we are likely looking at the same man who mistyped his phone number and thus have found a fuzzy match.

Also, consider how a fuzzy match for the property “age” is almost meaningless on its own since a single typo can turn “29” into “92,” so it would be best to demand that this property match exactly.

Adding the Probabilistic matching notebook

Note: Before you get started, contact your BlueConic Customer Success Manager to have the notebook plugin added to your BlueConic environment.

  1. Select AI Workbench from the BlueConic navigation bar.
  2. Click Add notebook.
  3. A popup window appears. Scroll down to Probabilistic matching notebook and click it.
  4. The notebook opens. Read through the Notebook editor introduction. 
    Data scientists can adjust parameters and customize the Python code. But you can also run the notebook by configuring its input parameters.
  5. Use the Parameters window to set the required parameters.
  6. Save your settings and run the notebook.

Setting Probabilistic matching notebook parameters

The Probabilistic matching notebook takes a set of parameters, which you can specify in the notebook's Python code or via the Parameters tab in the UI. Here are some guidelines for setting the notebook's parameters:

  • All profile properties to check: All the profile properties the notebook should check for matching values.
  • Subset of profile properties that have to match exactly: Properties that have to match exactly. This must be a subset of the list of properties to check. You can leave this field empty.
  • Profile property to write “merge ID” to […]: If you provide a profile property here, the notebook will write a merge ID to this property for all matching profiles. If this property is included in your BlueConic profile merge rules, the profiles will automatically be merged. This must be a profile property that is marked as a “unique identifier.”
  • Profile segment: This is the segment the profiles have to be a member of. You can leave this field empty. Providing a segment here can help narrow down potential matches and/or decrease the size of the profile dataset.
  • Maximum allowed Damerau-Levenshtein edit-distance […]: The maximum allowed edit-distance per profile property, excluding properties listed in Subset of profile properties that have to match exactly. A value of 1 is recommended. Learn more about how the Damerau-Levenshtein edit-distance works.
  • Require all properties in a profile to have a value?: If this field is checked, none of the profile properties in All profile properties to check are allowed to be empty. If they are empty for a particular profile, that profile is not loaded. If this field is unchecked, a profile is loaded as long as at least one of its properties has a value. It is recommended to check this box because matches containing empty values are not reliable.
  • Minimum ‘last-modified-date’ […]: If this field is set, a profile has to have been modified in some way (for example, adding an order or updating contact information) since this date in order for the profile to be loaded. You can leave this field empty. Set a value here if you want to exclude older, inactive profiles.
  • Maximum number of profiles to process: This is provided mostly for testing purposes, for example, to benchmark the runtime. If set to 0, there is no limit to the number of profiles, except possibly by the segment or the requirement to disregard empty values.

See Setting AI Workbench notebook parameters for instructions on editing your notebook input values.

Running the Probabilistic or 'fuzzy' matching notebook

The notebook can be run manually or through scheduling. Manual runs allow for more detailed monitoring of the notebook’s execution and timings. With scheduling, you can set a repeating cadence for the notebook to run automatically. Scheduling also ensures a log gets saved and that the notebook does not have to share resources with other notebooks. For more information on these options, see Scheduling and running AI Workbench notebooks.

Examining your probabilistic matching results

When you run the notebook manually in AI Workbench, various timings (rounded to one-tenth of a minute) are displayed for the notebook’s costliest operations. Because all of these operations scale roughly linearly with the number of profiles, you can use the timing for sample sets to predict the notebook’s runtime for large datasets. During a scheduled run, limited timing data is sent to the status update.

For both manual and scheduled runs, the number of profiles, exact matches, and fuzzy matches are displayed in the output and log:

How to do probabilistic or fuzzy matching on customer profiles in the BlueConic customer data platform

For profiles found to have matches, the notebook writes a merge ID to a profile property you select in the notebook's Parameters tab. If that property is included in your BlueConic profile merge rules, the platform can automatically merge profiles that have the same merge ID. In addition, matches are recorded in a human-readable CSV-file with their profile IDs and the values of the profile properties that the notebook examined.

Ways to increase the accuracy of probabilistic or fuzzy profile matches

it is important to consider which profile properties are the best ones to use to find fuzzy matches. In general, it is recommended to include at least one highly diverse profile property in the selection. An easy rule of thumb is to look at the maximum length of a profile property. For example, a two-letter code can never have more than 26x26=676 possible unique values, so on its own it becomes quite meaningless in a dataset of millions of profiles. Note that diversity can often be deceptive. There are many different possible unique names and street addresses, but these are typically unevenly distributed, with a tiny number of common values. For example, the property values of “Joe” or “1 Main Street,” could make up large portions of a dataset.

You can further achieve increased accuracy by declaring that one or more profile properties have to match exactly, or by disregarding profiles with empty property values. These options also make the notebook run faster and more efficiently.

During testing, the notebook’s runtime scaled linearly with the number of profiles, meaning if you double the number of profiles, it leads to a doubling of the runtime.

To illustrate this, a test on one customer’s server with 3.4 million real profiles, using the properties first name, last name, and phone number (all having a non-empty value), took roughly twenty minutes to find 14 thousand matches, while searching through 1.7 million profiles took 10 minutes.

Using edit-distance to increase accuracy

The notebook measures similarity between values based on the Damerau-Levenshtein measure of edit-distance. This measure increases by 1 for the following edits: deletion, insertion, or substitution of a character or the swapping of two adjacent characters. It corresponds well to common typos or misspellings, as well as the deliberate changing of say, a phone number. When two values are exactly the same, the edit-distance is 0. Note that the notebook strips capitalizations, hyphens, and special characters, so these do not contribute to the edit-distance.

Examples of probabilistic or fuzzy profile matches

For example, “Jennifer” and “Jenifer” have an edit-distance of 1 (deletion of an “n”), but so do “Michael” and “Micheal” (swapping of the adjacent characters “ae” for “ea”). It is recommended to stick to an edit-distance of 1 to lower the risk of false matches. Common profile property values are often only a few edits away from each other, so the number of false matches (as well as the notebook’s runtime) will increase exponentially with higher allowed edit-distances.

Note that the edit-distance is calculated per profile property. The total edit-distance between two profiles may be up to number of profile properties x the allowed edit-distance for the profiles to still be a fuzzy match.

Using the symspell algorithm, the notebook can rapidly find values from the dataset that are within the allowed edit-distance from another value.