For starters, in all recommendation implementations, there's an explore-exploit tradeoff. We suggest to all our BlueConic customers who implement recommendations that they start off with a comprehensive exploration phase to avoid ending up in a “local maximum” – which is just a mathematical way of saying an artificial peak in performance.
1. Exploration Phase
It's important to try at least a handful of different variants with different algorithms to ensure that the data BlueConic gathers during this period is sufficiently varied.
We usually recommend starting off with a number of variants with a single algorithm configured for each variant. Examples of algorithms you might select in the platform are “Breaking news,” “Viral news,” “Look-alike articles/products,” and “Recent high CTRs.” This quickly reveals which algorithms work well for that specific customer and gives us valuable data to determine the ideal algorithm combination.
Once the recommendation dialogues have racked up a number of clicks, BlueConic can start analyzing the recommendations click data to determine if a specific algorithm combination would have resulted in even more clicks. We will describe this process in more detail later on in this blog post. Once we've found an algorithm combination we like through offline analysis, it's important to test that specific combination against the previously best performing variant to ensure it's actually performing well in practice.
|Example content algorithms||Example product algorithms|
2. Exploitation Phase
During the exploitation phase, use what you learned during the exploration phase to get the maximum amount of value out of BlueConic’s recommendations engine. This means turning off all variants except the best performing one.
3. Continuous Improvement Phase
Our most successful BlueConic customers are continuously improving the quality of their recommendations. Once you've figured out the best algorithm combination across all visitors, it's time to start thinking about specific audience segments that would benefit from recommendations specifically tailored to them.
There are a few methods of evaluating and optimizing algorithm performance, and that’s what I’ll cover in the rest of this post.
Kernel Density Estimate
This graph shows the kernel density estimation of the number of recommendations that were clicked versus were not clicked for the specific algorithm. In this specific graph, you can see that if the look-alike score is low (i.e. the recommended content is very different from the content on the current page) the probability is a lot higher that the visitor will not click the recommendation. However, once the look-alike score increases, the roles reverse. This means that the look-alike algorithm is probably a great addition to our final algorithm combination.
The next methods are a little more complicated, so indulge me in an analogy first. Let's say you apply for a job in maritime depth estimation, and as a result, you are sent to the middle of a lake in a rowboat to find the deepest point of the lake. After cracking open a beer (you're on a boat after all) you use a piece of lead on a line to determine the depth on a number of random locations on the lake. When you compare your measurements, you notice that the deepest of your measurements has comparatively little vegetation, more fish, and the water has a darker color. Using this information, you gradually work towards a location with even fewer plants, more fish, and dark blue water. At some point you notice that not all fish like really deep waters, which means the deepest waters do not contain more fish than closer ashore, but still more fish than the shallowest points of the lake. Using this process, you finally end up in a place on the lake with a medium amount of fish, very dark water, and no vegetation at all.
The fish, vegetation, and color of the water in the story are our recommendation algorithms, while the depth of the lake is the quality of our recommendations with that algorithm combination. The quality of our recommendations is measured as the mean reciprocal rank of our recommendations based on the historical data. For all sets of 5 recommendations where a visitor actually clicked one of them, we determine the location of the clicked recommendation within these 5 recommendations. If the clicked article is the first recommendation, the score is 1, if it comes in second, the score is 1/2, if it comes in 5th the score is 1/5. We try to maximize the mean of all these scores (or: reciprocal ranks), which is called the mean reciprocal rank. The idea behind this is that the algorithm combination sorts all available articles by the probability that they will be clicked by the visitor, ultimately showing the top 5 articles to the visitor.
BlueConic currently has 11 recommendation algorithms, which can each have 21 different values between 0 and 10, which means there are 350,277,500,542,221 possible combinations. Even if we could evaluate a million algorithm combinations on all historical data every second, it would still take almost a dozen years to try each algorithm combination; that’s why we have to be a bit smarter than just randomly trying algorithm combinations and that's where the method to gradually move to the deepest part of the lake comes in.
The convergence graph shows that we're gradually finding deeper and deeper parts of the lake, until we've found the deepest part. When an optimizer has found the best possible solution and keeps trying similar algorithm combinations without making any more progress, we say that the optimizer has converged. We usually want to see that the last part of this graph is flat for a while. If the graph was still declining we should still take more measurements since it looks like spending more time optimizing the algorithm combination will still pay off.
These graphs are basically maps of the bottom of the lake we're on, generated based from the random samples we've taken. In reality our lake is of course multidimensional: that's why it's represented by a number of separate images where the axes are different recommendation algorithms. The dots in these images are all the probes where we used our piece of lead on a string to measure the depth of the lake, i.e. calculated the mean reciprocal rank for a specific algorithm combination. The color indicates the score for that specific algorithm combination, with a lighter color representing a higher mean reciprocal rank.
The histogram in this image shows the number of times we've tried a specific setting of the recommendation algorithm slider.
Based on our probes, a depth map is generated. The lighter colors represent areas with a higher mean reciprocal rank. In the line charts on the right you can see how the setting of a specific algorithm slider influences the mean reciprocal rank.
Recommendations is just one of the many powerful ways to put the data you’re able to collect in BlueConic to work delivering more effective, more relevant interactions with your customers. Let us know if you’d like to learn more.