What is this about
Today we’ll learn what’s happening in a recommendation engine by building a simple, yet powerful collaborative filtering engine.
1. What we have
Let’s say we have a music app. We have access to the music our users like, and it’d be tremendous to recommend new music to people (please don’t steal my idea, I might get rich with this someday).
We’ll use mongodb as our database, along with mongoid which I find to be quite awesome.
Allright, let’s roll !
Basic class definitions
First of all, let’s define some
Artist classes. In our app, a user has a list of liked artists, whereas the artist has a list of likers.
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9 10 11 12 13
We can use a simple script to load our classes and play with them:
Note: For better readability I’m not including the
Gemfile, nor the
mongoid.yml file here. A complete example is available to download at the end of this post.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
Allright, everything seems to work just fine. Time for the fancy stuff !
2. Collaborative filtering
Collaborative filtering in a nutshell
Collaborative filtering is not that difficult to understand:
- You like stuff
- There are other people who also like the same stuff
- These very people do like other stuff (that you don’t even know about)
- You might want to know about it.
Now for the implementation. There is of course a lot of details and variations in the existing algorithms. The one we are going to implement is the following (in pseudo-code):
- Find every user that share at least one favorite artist with you
- For each found user
- calculate the number of favorite artists you share. The more artists you share with a user, the more weight we’ll had to his recommendation.
- Divide the obtained sum by the total number of artists the user like. We don’t want a serial liker to pollute our score and recommend everything with too much of a weight. Let’s call
weight(u)the weight of this user
- for all the artists
weight(u)to our result:
result(a) += weight(u))
- sort the list and get the most recommended artists !
Shall we implement it?
Yes we can. Let’s create a fancy
recommended_artists method for the users that we write in a
reco module (in rails we would put this module under
1 2 3 4 5 6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Looks nice, how about we try it with a slightly modified script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Look Mom, it’s working !!!!
3. Performances / Why it’s awesome
Let’s take a closer look at our algorithm complexity here, and explain how we take advantage of the mongo’s NoSQL structure.
Unlike when dealing with a SQL database, we have deliberately denormalized our data. When we open a
User document, we can see an array of ids that represent the list of his/her favorite artists. When opening the corresponding
Artist document, whe can see a list of
liker_ids that correspond to the artist’s likers.
Now if we look at our algorithm in terms of requests, what we actually are doing is:
- Open the current_user document. When recommending for me, this represent my profile.
- In my document I find an array of artists. No additional query here.
- Perform one db query to obtain all the
likersof the artists I like with
friends = User.any_in(id: my_artists.distinct(:liker_ids)).only(:id, :liked_artist_ids). Note that we find the
Artistdocuments by their ids, then simply concatenate their respective
liker_idsarrays. No need to index any relation table, or foreign key. Furthermore, from this single request we already know what our ‘friends’ like, since every friend document contains a
- At this stage we already nailed down the interesting users among our database, and we can browse through their liked artists without performing any additional query.
This pretty much implies that the complexity of our algorithm doesn’t depend on the number of users, but rather to the interconnection level of our graph.
In other words: if we add billions of user that doesn’t share any liked artist with you, then your recommendation query would be totally unaffected. But if everybody loved the same artists, then we’d have some trouble.
Another very good example of such an algorithm is the “friend recommender”. Having billions of users where evey user have an average of 50 friends is a piece of cake to deal with. But of course, if a single user would become the friend of everyone, then you’d simply have to browse the whole database to find your friend’s friends. In a social graph this is very unlikely to happen: I’m way too hipster to befriend such a guy.
I’ve slightly modified the demo script to
big_reco.rb. We can see what happens with 100k users, that randomly like 10 artists each, among a 1k artists database:
100,000 users, 1,000 artists and 10 likes per user:
1 2 3
As you can see this is still pretty fast and run in about a second on an average laptop. Actually, preparing the database was hat took a while, so I didn’t make many more examples. Feel free to send me some if you are brave enough to try :)
Note that even with millions of users, a proper cache structure along with background jobs can garanty recommendations that take less than 300ms. Let me know if you’re intested, I might consider writing a tuto about how to design such a background engine !
Thanks for reading, I hope this was of some interest for you, feel free to ask any question or to share your feedback !