When you implement personalized recommendations, it's best to start with a comprehensive exploration phase to avoid ending up in a “local maximum” – which is just a mathematical way of saying an artificial peak in performance.
1. Exploration Phase
It's important to try at least a handful of different variants with different algorithms to ensure that the data BlueConic gathers during this period is sufficiently varied.
We usually recommend starting with multiple variants with a single algorithm configured for each variant. Examples of algorithms you might select are “Breaking news,” “Viral news,” “Look-alike articles/products,” and “Recent high CTRs.” This quickly reveals which algorithms work well for that specific customer and provides valuable data to determine the ideal algorithm combination.
Once the recommendation dialogues have racked up clicks, BlueConic can start analyzing the recommendation click data to determine if a specific algorithm combination would have resulted in even better results. Once BlueConic finds an optimal algorithm combination through offline analysis, it's important to test that specific combination against the previously best-performing variant to ensure it's actually performing well in practice.
2. Exploitation Phase
During the exploitation phase, use what you learned during the exploration phase to get the maximum amount of value out of the BlueConic recommendations engine. This means turning off all variants except the best-performing one.
3. Continuous Improvement Phase
Our most successful BlueConic users are continuously improving the quality of their recommendations. Once you've figured out the best algorithm combination across all visitors, it's time to start thinking about specific audience segments that would benefit from recommendations specifically tailored to them.
The rest of this article examines methods for evaluating and optimizing algorithm performance of content and product recommendations.
Kernel Density Estimate
This graph shows the kernel density estimation of the number of recommendations that were clicked versus those that were not clicked for a specific algorithm. This graph shows that if the look-alike score is low (that is, the recommended content is very different from the content on the current page), the probability is a lot higher that the visitor will not click the recommendation. However, once the look-alike score increases, the roles reverse. This means that the look-alike algorithm is probably a great addition to our final algorithm combination.
The next methods are a little more complicated, so indulge us in an analogy first. Let's say you apply for a job in maritime depth estimation, and as a result, you are sent to the middle of a lake in a rowboat to find the deepest point of the lake. You use a piece of lead on a line to determine the depth on a number of random locations on the lake. When you compare your measurements, you notice that the deepest of your measurements has comparatively little vegetation, more fish, and the water has a darker color. Using this information, you gradually work towards a location with even fewer plants, more fish, and dark blue water. At some point you notice that not all fish like really deep waters, which means the deepest waters do not contain more fish than closer ashore, but still more fish than the shallowest points of the lake. Using this process, you finally end up in a place on the lake with a medium amount of fish, very dark water, and no vegetation at all.
The fish, vegetation, and color of the water in the story are our recommendation algorithms, while the depth of the lake is the quality of our recommendations with that algorithm combination. The quality of our recommendations is measured as the mean reciprocal rank of our recommendations based on the historical data. For all sets of 5 recommendations where a visitor actually clicked one of them, we determine the location of the clicked recommendation within these 5 recommendations. If the clicked article is the first recommendation, the score is 1; if it comes in second, the score is 1/2; and if it comes in fifth, the score is 1/5. We try to maximize the mean of all these scores (or: reciprocal ranks), which is called the mean reciprocal rank. The idea behind this is that the algorithm combination sorts all available articles by the probability that they will be clicked by the visitor, ultimately showing the top 5 articles to the visitor.
BlueConic currently has 11 recommendation algorithms, which can each have 21 different values between 0 and 10, which means there are 350,277,500,542,221 possible combinations. Even if we could evaluate a million algorithm combinations on all historical data every second, it would still take almost a dozen years to try each algorithm combination; that’s why we have to be a bit smarter than just randomly trying algorithm combinations. This is where the method to gradually move to the deepest part of the lake comes in.
Convergence
The convergence graph shows that we're gradually finding deeper and deeper parts of the lake, until we find the deepest part. When an optimizer has found the best possible solution and keeps trying similar algorithm combinations without making any more progress, we say that the optimizer has converged. We usually want to see that the last part of this graph is flat for a while. If the graph was still declining we should still take more measurements, since it looks like spending more time optimizing the algorithm combination will still pay off.
Evaluations
These graphs are basically maps of the bottom of the lake and are based on the random samples we've taken. In reality, the lake is of course multidimensional: that's why it's represented by a number of separate images where the axes are different recommendation algorithms. The dots in these images are all the probes where we used our piece of lead on a string to measure the depth of the lake, i.e. calculated the mean reciprocal rank for a specific algorithm combination. The color indicates the score for that specific algorithm combination, with a lighter color representing a higher mean reciprocal rank.
The histogram in this image shows the number of times we've tried a specific setting of the recommendation algorithm slider.
Objective
Based on our probes, a depth map is generated. The lighter colors represent areas with a higher mean reciprocal rank. In the line charts on the right you can see how the setting of a specific algorithm slider influences the mean reciprocal rank.
Using recommendations is just one of the many powerful ways to put the data you collect in BlueConic to work delivering more effective, more relevant interactions with your customers.