How to optimize AI Workbench notebook performance

As you develop notebooks in AI Workbench, it can be helpful to find ways to optimize notebook performance. At times you may run into these situations:

  1. Your notebook runs out of memory.
  2. Your notebook runs for a very long time.

This article illustrate a number of strategies to scale up your notebook by speeding up its processing and to keep it within its memory limits.

Avoid retrieving profile properties you don't need

Use the properties parameter of the get_profiles method to retrieve only the profile properties you are using in your model.

segment_id = bc.get_blueconic_parameter_value("Segment", "segment")
profile_property_id = bc.get_blueconic_parameter_value("Profile property", "profile_property")

for profile in bc.get_profiles(segment_id=segment_id,
                               properties=[profile_property_id],
                               progress_bar=False):
    # do something with the profile values
    value = profile.get_value(profile_property_id)

Avoid retrieving profiles you don't need

Retrieving profiles is one of the more expensive operations in AI Workbench. Reducing the number of profiles the notebook has to retrieve is therefore one of the most effective methods of improving the performance of your notebook.

Profiles that are no longer relevant

Let's say you are implementing your own RFM notebook, which only takes into account orders in the last year. If the recency, frequency, and monetary value profile properties of a specific profile all have a value of 1, and the customer did not order anything in the last year, the values for the RFM scores will not change. This means that we don't have to retrieve and update this specific profile. You can implement these conditions using filters, which are applied on top of the existing segment configuration.

from datetime import datetime, timedelta
from dateutil import relativedelta

# store datetime.now() in a global variable
# so that the value is the same across the execution
NOW = datetime.now()

segment_id = bc.get_blueconic_parameter_value("Segment", "segment")
last_order_date_property = bc.get_blueconic_parameter_value("Last order date property", "profile_property")

rfm_recency_property = bc.get_blueconic_parameter_value("RFM Recency property", "profile_property")

# the last order date has to be in the last year
ONE_YEAR_AGO = NOW - timedelta(days=365)
last_order_date_filter = blueconic.get_filter(last_order_date_property).in_range(min=ONE_YEAR_AGO)

# ... or the RFM recency has to be higher than 1
rfm_recency_filter = blueconic.get_filter(rfm_recency_property).in_range(min=2)

# retrieve all profiles that are part of the configured segment
# and match at least one of the filters
for profile in bc.get_profiles(segment_id=segment_id,
                               properties=[last_order_date_property],
                               required_properties=[last_order_date_property],
                               filters=[last_order_date_filter, rfm_recency_filter],
                               progress_bar=False):
    last_order_date = profile.get_value(last_order_date_property)
    time_since_last_order = relativedelta.relativedelta(last_order_date, NOW)
    
    # recency is the maximum of "10 – the number of months that have passed since the customer last purchased" and 1  
    recency = max(10 - time_since_last_order.months, 1)

Profiles that have not changed since the last successful execution

Let's say you are implementing a lead scoring model based on whether or not the lead has performed certain actions (e.g. subscribed to the newsletter, requested a demo, clicked on an ad, or downloaded a whitepaper). In this case the lead score does not change unless a value in one of the associated profile properties changes. This means the notebook does not need to retrieve profiles that did not change since the last successful execution. You can use the get_executions to retrieve the last few executions of the current notebook. You can retrieve all profiles that have changed since the last successful execution by using the start_date of the last successful execution as a filter on the lastmodifieddate profile property.

If the profile properties you are using in your model are all filled by web behavior, you can use the lastvisitdate profile property instead of the lastmodifieddate profile property.

Please see Filtering profiles chapter of the BlueConic Python API documentation for more information about using filters.

# Returns the start date of the last successful execution of this notebook
def get_last_successful_execution_start_date():
    for execution in bc.get_executions(count=10):
        if execution.state == "FINISHED":
            return execution.start_date
    return None

segment_id = bc.get_blueconic_parameter_value("Segment", "segment")
profile_property_id = bc.get_blueconic_parameter_value("Profile property", "profile_property")

# use the last successful execution of this notebook
# to add a filter based on the "lastmodifieddate" profile property
filters = []
last_successful_execution_start_date = get_last_successful_execution_start_date()
if last_successful_execution_start_date is not None:
    lastmodifieddate_filter = blueconic.get_filter("lastmodifieddate").in_range(min=last_successful_execution_start_date)
    filters = [lastmodifieddate_filter]

# retrieve all profiles that are part of the configured segment
# and match the filters
for profile in bc.get_profiles(segment_id=segment_id,
                               properties=[profile_property_id],
                               filters=filters,
                               progress_bar=False):
    # do something with the profile values
    value = profile.get_value(profile_property_id)

Avoid unnecessary profile update calls

Let's say your notebook updates a score in the profile (e.g. an engagement score or a propensity score). By comparing the new score with the existing score in the profile, you can determine whether it makes sense to update the profile.

segment_id = bc.get_blueconic_parameter_value("Segment", "segment")
engagement_score_property = bc.get_blueconic_parameter_value("Engagement score property", "profile_property")

with bc.get_profile_bulkhandler() as bulk_handler:
    for profile in bc.get_profiles(segment_id=segment_id,
                                   properties=["visits", "clickcount", engagement_score_property],
                                   progress_bar=False):

        # calculate a custom engagement score
        visits = profile.get_value("visits")
        pageviews = profile.get_value("clickcount")

        previous_engagement_score = profile.get_value(engagement_score_property)
        new_engagement_score = pageviews / visits
        
        # check if the new engagement score if different from the previous engagement score
        # and if so, update the profile
        if new_engagement_score != previous_engagement_score:
            profile.set_value(engagement_score_property, new_engagement_score)
            bulk_handler.write(profile)

Avoid retrieving the same profile twice

Let's contemplate these two scenarios:

  • You need to calculate aggregates for a number of segments.
  • You first need to calculate aggregate statistics for all profiles in a segment, and then use these statistics to update a score in the profile.

A naive approach for these scenarios would be to call the get_profiles method multiple times (e.g. once for each segment or once to train the model and once to apply the model). A better approach would be retrieving all necessary profiles at once and storing them in memory (e.g. in a Pandas DataFrame) or on disk (e.g. in a CSV or SQLite file).

Use online algorithms

Some use cases may require processing a large number of profiles and associated profile properties. Storing all this data in Python variables could cause a notebook to run out of memory. Online or out-of-core algorithms can help in these cases. These algorithms usually process data piece-by-piece or in small batches, avoiding the need to store all data in memory.

To compute the mean, variance, standard deviation, skewness, kurtosis, minimum, and maximum of your data, you can use the RunStats library.

To estimate the percentiles and quantiles of your data, you can use the tdigest library.

For machine learning use cases, the scikit-learn project provides a number of out-of-core algorithms.

Example: Out-of-core percentile estimation for an RFM calculation

In a previous example we simply used the number of months since the last order as the value of the "RFM frequency" profile property. A more advanced approach uses percentiles to ensure that each bucket contains a similar number of profiles. However, calculating percentiles across all profiles would use too much memory to be feasible, which is why we will use the tdigest library to estimate the percentiles. This requires us to make two passes across the data:

  1. Retrieve all profiles and update the T-Digest data structure.
  2. Use the T-Digest data structure to update the "RFM frequency" profile property values.

To avoid retrieving all profiles twice, we will store the profile data in a CSV file in step 1, and retrieve it in step 2.

# install the tdigest library
!pip install --quiet tdigest

import csv
from datetime import datetime, timedelta
from tdigest import TDigest

# store datetime.now() in a global variable
# so that the value is the same across the execution
NOW = datetime.now()

segment_id = bc.get_blueconic_parameter_value("Segment", "segment")
last_order_date_property = bc.get_blueconic_parameter_value("Last order date property", "profile_property")

csv_filename = bc.get_cwd() + "profiles.csv"
columns = ["profile_id", last_order_date_property]

# percentile estimation
number_of_days_since_last_order_digest = TDigest()

with open(csv_filename, "w") as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(columns)
    
    for profile in bc.get_profiles(segment_id=segment_id,
                               properties=[last_order_date_property],
                               required_properties=[last_order_date_property],
                               progress_bar=False):
        last_order_date = profile.get_value(last_order_date_property)
        number_of_days_since_last_order = round((NOW - last_order_date).total_seconds() / SECONDS_IN_DAY)

        # write the profile ID and number of days since the last order to a file
        # for later processing
        csvwriter.writerow([profile.id, number_of_days_since_last_order])

        # update the T-Digest data structure to estimate the percentiles
        number_of_days_since_last_order_digest.update(number_of_days_since_last_order)

number_of_days_since_last_order_digest.compress()    
    
rfm_recency_property = bc.get_blueconic_parameter_value("RFM Recency property", "profile_property")

# read the CSV file and use the T-Digest data structure to update the RFM recency
with bc.get_profile_bulkhandler() as bulk_handler:
    with open(csv_filename) as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            profile = blueconic.Profile(row["profile_id"])
            number_of_days_since_last_order = int(row["number_of_days_since_last_order"])
            
            # the RFM recency is based on the cumulative distribution of the recency values
            recency = math.ceil(number_of_days_since_last_order_digest.cdf(number_of_days_since_last_order) * 10)
            
            # update the profile
            profile.set_value(rfm_recency_property, recency)
            bulk_handler.write(profile)

Need additional resources for your AI Workbench use cases?

If your AI Workbench use cases require additional resources, let us know via support@blueconic.com. We'll discuss your requirements and upgrade your subscription as necessary.