This slides describes our solution for the RecSys Challenge 2016. In the challenge, several datasets were provided from a social network for business XING. The goal of the competition was to use these data to predict job postings that a user will interact positively with (click, bookmark or reply). Our solution to this problem includes three different types of models: Factorization Machine, item-based collaborative filtering, and content-based topic model on tags. Thus, we combined collaborative and content-based approaches in our solution.
Our best submission, which was a blend of ten models, achieved 7th place in the challenge's final leaderboard with a score of 1677898.52. The approaches presented in this paper are general and scalable. Therefore they can be applied to another problem of this type.
2. Basic elements guidelines.
Problem statement
Data description
∙ Impressions — details about which items (job postings) were
shown to which user by the existing recommender (19 August
2015 — 9 November 2015).
∙ Interactions — interactions that the user performed on the
items (clicked, bookmarked, replied or deleted).
∙ Users — users details: job roles, career level, discipline,
industry, location, experience, and education.
∙ Items — items details: title, career level, discipline, industry,
location, employment type, tags, created time and flag if item
was active during the test.
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 2 / 23
3. Basic elements guidelines.
Problem statement
Data description: impressions and interactions
Date interval: 2015-08-19 – 2015-11-09
Impressions
∙ 201M unique user-item-week tuples
∙ 2.7M unique users
∙ 846K unique items
Interactions
∙ 8.8M events: clicked – 7.2M, deleted – 1.0M, replied – 422K,
bookmarked – 206K
∙ 785K unique users
∙ 1.03M unique items
∙ 2.8M из 6.9M (user-item) pairs are in impressions
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 3 / 23
4. Basic elements guidelines.
Problem statement
Data description: target users and items
150K users for making recommendations, from which:
∙ 39.7К (26.5%) have no events
∙ 59.5K (39.6%) have less than 2 events
∙ 70.6K (47.1%) have less than 3 events
327К active items, from which:
∙ 129К (39.5%) have no events
∙ 164K (50.1%) have less than 2 events
∙ 188K (57.6%) have less then 3 events
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 4 / 23
5. Basic elements guidelines.
Problem statement
Task of the challenge
score(R, ˆR) =
∑︁
u∈U
20(P2(ru, ˆru) + P4(ru, ˆru) + R30(ru, ˆru)+
+S30(ru, ˆru)) + 10(P6(ru, ˆru) + P20(ru, ˆru)),
where
U = {0, . . . , N − 1} – list of target users,
R = {ru}u∈U – lists of relevant items,
ˆR = {ˆru}u∈U – the solution,
Pk(ru, ˆru) – precision at top k for user u,
R30(ru, ˆru) – recall at top 30,
S30(ru, ˆru) – user success.
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 5 / 23
6. Basic elements guidelines.
Problem statement
Models validation
∙ The last week of interactions
∙ 10 000 random users from those who made any interactions
during this week
∙ Old items (created more than a month ago) without
interactions were removed
∙ Obtained score was highly correlated with the result on the
Public Leaderboard
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 6 / 23
7. Basic elements guidelines.
Solution of the team
Interesting insights from data
∙ A significant proportion of users and items have a small
number of events or have no events. It means that we need to
use a hybrid approach that takes into account not only
collaborative filtering but the content data of items and users.
∙ Impressions slowly change over time. That is, the presence of a
pair of user-item in impressions is a useful feature, and we use
it as the separate model.
∙ Geographical features (distance, region, city, geoclusters etc.)
are not improve score significantly.
∙ Tokens from user profiles and items are good features.
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 7 / 23
9. Basic elements guidelines.
Solution of the team
Item-based collaborative filtering
Similarity metrics:
∙ Jaccard
∙ Cosine
∙ Pearson
Event types for training:
∙ All Positive interactions
∙ Only Click interactions
∙ Impressions
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 9 / 23
10. Basic elements guidelines.
Solution of the team
Factorization Machines
Predicted score for user i on item j is given by:
p(i,j) = 𝜇 + wi + wj + aT
xi + bT
yj + uT
i vj ,
where
𝜇 – a global bias term,
wi and wj are weight terms for user i and item j respectively,
xi and yj are the user and item side feature vectors,
a and b are the weight vectors for those side features,
ui and vj – latent factors, which are vectors of fixed length
(number of factors is a parameter).
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 10 / 23
11. Basic elements guidelines.
Solution of the team
Factorization Machines: main parameters
∙ Number of latent factors (30 – 400)
∙ Number of sampled negative examples (1 – 12)
∙ Maximum number of iterations (25 – 70)
∙ Regularization parameters (1e-9 – 1e-7)
∙ User and item side features
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 11 / 23
12. Basic elements guidelines.
Solution of the team
Factorization Machines: side features
Users - all features (OneHotEncoder)
∙ jobroles
∙ career_level, discipline_id, industry_id
∙ country, region
∙ experience: n_entries_class, years, years_in_current
∙ edu: degree, field_of_studies
Items - all features, except latitude and longitude
(OneHotEncoder)
∙ title, tags
∙ career_level, discipline_id, industry_id
∙ country, region
∙ employment_type
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 12 / 23
13. Basic elements guidelines.
Solution of the team
Topic model: Latent Semantic Indexing (LSI)
∙ Let document associated with each user be all title and tags
tokens of items, which the user interacted with and job roles
tokens from user description.
∙ Convert each document into a token occurrences vector.
∙ Transform values in each vector to TF-IDF statistics and
combine all vectors into a large token-document matrix.
∙ Then we apply Singular Value Decomposition (SVD) technique
on the token-document matrix
∙ The similarity between user and item will be the similarity
between corresponding latent vectors.
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 13 / 23
14. Basic elements guidelines.
Solution of the team
Solution framework
Initial dataset
Itembased
models
FM models Topic model
Blending
Output
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 14 / 23
15. Basic elements guidelines.
Solution of the team
Linear Ensemble
Base models:
FR0 SIM0 PI Local Score
1 0 0 76995
0 1 0 69622
0 0 1 104495
1 1 1 132505
∙ SIM0 – Item-based Recommender (jaccard similarity)
∙ FR0 – Factorization Machines Recommender (400 factors)
∙ PI – Past Impressions Recommender (very simple model with
binary output)
∙ Local Score: 10 000 random users who made interactions with
items during last week
∙ The score on the Public Leaderboard ≈ 3.8 × the score on our
Local Validation
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 15 / 23
16. Basic elements guidelines.
Solution of the team
Linear Ensemble
Version «zero»:
FR0 SIM0 PI Local Score
1 2 1 134285
The first version:
FR0 SIM0 FR8.0
0 * SIM0 PI Local Score
1 13 8 1 138073
The second version:
FR1 SIM0 FR8.0
1 * SIM0 PI Local Score
1 13 8 1 140876
FR1 = FR_f 100_i25
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 16 / 23
17. Basic elements guidelines.
Solution of the team
Linear Ensemble
The third version:
FR2 SIM2 FR8.0
2 * SIM2 PI Local Score
1 13 8 1 143653
FR2 = 0.5*FR_f 100_i25 + 0.5*FR_f 400_i70
SIM2 = 0.5*SIM_jac + 0.5*SIM_click
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 17 / 23
19. Basic elements guidelines.
Solution of the team
Final models set
SIM_jac Item-based jaccard similarity
SIM_click Item-based jaccard similarity on clicks
SIM_pearson Item-based pearson similarity
SIM_imp Item-based jaccard similarity on impressions
FR_f100_i25 Factorization, n_factors=100, iter=25
FR_f400_i70 Factorization, n_factors=400, iter=70
FR_f400_i50_no_side Factorization, no side data
FR_imp Factorization on impressions
TM LSI topic model
PI Past Impressions
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 19 / 23
20. Basic elements guidelines.
Solution of the team
Hardware&Software
∙ 1 server: 28 cores, 56 threads, 256Gb RAM
∙ Full training + prediction = 16 hours
∙ All code was written in Python
∙ ML Libraries: graphlab, gensim
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 20 / 23
21. Basic elements guidelines.
Leaderboard
Submission history
Score Rank Name Date
554655 9 Topic model 100 factors 06/27/16
548366 8 Top 150 candidates from every model 06/25/16
543284 8 8 model set: 4FR + 4SIM + TM 06/24/16
537157 9 Topic model 06/23/16
530599 10 3 models set: FR + 2SIM 06/23/16
497136 15 3 models set: FR + 2SIM 06/22/16
496241 1 FR with side data 03/20/16
397604 1 Past Impressions model 03/11/16
132790 1 Simple item-based recommender 03/10/16
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 21 / 23