• Peter Hurley

How Recommendation Engines Work

The internet has had a profound effect on our everyday lives, be it from shopping online, streaming music and films or consuming news reports. In all of these interactions, it is commonplace to be provided with alternative and new recommendations. For example, when I log on to Netflix, I am confronted with a list of personal movie recommendations, Spotify provide me with a weekly list of songs I might like to listen to, the BBC news website will suggest alternative news stories for me to read and if I log onto Amazon, the landing page contains a range of items I might be interested in buying.

All of these recommendations are provided by what we call recommendation engines and it is advances in machine learning over the last two decades that have made them into a powerful asset for online business.

There are two broad approaches for how a recommendation engine will work, collaborative filtering and content-based filtering.

Content-based filtering is the easiest to implement. It assumes a user will like things that are similar (measured by their features) to other items they have previously liked. For example, in Netflix, content filtering could be used to suggest movies that have the same actor or genre as films that I have already watched.

Collaborative filtering is the key behind the big advancements in recommendation engines over the last decade and one of the best illustrations of the power of “Big Data”. This approach works by assuming there are common trends and every person is made up of a combination of those trends. For example on Amazon, new parents will be needing to buy similar items to each other, whilst students will want to buy similar items such as a new laptop, notepads, etc. Businesses that have a big pool of data on users and their interactions with products, can use collaborative filtering techniques to learn what those trends are and what items are associated with those trends. They can then calculate what trends each person belongs to and recommend the items belonging to those trends.

Both recommendation engine approaches have their advantages and disadvantages. For example, the big disadvantage for collaborative filtering is that it typically requires a large amount of existing data on ratings to be able to learn from, a problem often referred to as the cold start problem. For example, if I were to start a new Netflix account, Netflix has no information on me to be able to work out what trends I belong to. Content filtering does not suffer from the cold start problem to the same extent, however, it is limited to only offering results that are similar to previous items. This limits the number of items that can be recommended compared to collaborative filtering. Many of the best sites that use recommendation engines use a hybrid model combining both types of filtering.

The priority for any basic implementation of these algorithms is to maximise predictive accuracy, i.e. recommend the best items. However, there are other necessary measures to be considered when building a recommendation engine. Diversity and Serendipity are two such measures. Diversity is a measure of how much you allow different items to be shown. This is particularly important to enable new products such as a new song on Spotify to be listened to and that does not have any ratings. Serendipity can be thought of as how novel or surprising a recommended item is, for example, suggesting a popular Hollywood movie on Netflix is not as impressive as recommending an entirely new independent film that the user might like.

Getting the balance of these measures, along with accurate ratings is non-trivial but essential.

Another important factor to consider when using recommendation engines is what type of data you are using to do the recommendation. This can often be broken down into explicit and implicit data. Explicit data is the most informative data and is typically some user interaction that is a direct indicator of how much something is liked. A good example would be rating a movie on Netflix. Implicit data is typically some user data interaction that does not directly indicate how much an item is liked, for example looking at an item on Amazon. There is a lot more implicit data available than explicit data, but it is less informative and requires more sophisticated models and/or more data to become useful.

At DataJavelin we develop recommender systems for companies using machine learning techniques. We are particularly skilled in tackling the cold start problem; developing recommender systems where there is limited initial data.

#datascience, #recsys, #machinelearning, #data

Credit for cartoon: https://www.rinapiccolo.com/piccolo-cartoons/