Sudeep Das presented on recommender systems and advances in deep learning approaches. Matrix factorization is still the foundational method for collaborative filtering, but deep learning models are now augmenting these approaches. Deep neural networks can learn hierarchical representations of users and items from raw data like images, text, and sequences of user actions. Models like wide and deep networks combine the strengths of memorization and generalization. Sequence models like recurrent neural networks have also been applied to sessions for next item recommendation.
10. ● 1999-2005: Netflix Prize:
○ >10% improvement, win $1,000,000
● Top performing model(s) ended up be a
variation of Matrix Factorization (SVD++,
Koren, et al)
● Although Netflix’s rec system has moved on,
MF is still the foundational method on which
most collaborative filtering systems are
based
Background
12. Singular Value Decomposition (Origins)
R = U Σ VT
U
VT
=
users
items
Σ
ratings
matrix
left/right singular
vectors
(orthonormal basis)
Singular values
(scaling)
R
13. ● Low-rank approximation
● Eckart-Young theorem:
SVD: Largest SV’s for approximation
≈
[U’,Σ’,VT
’] = argmin ǁR - UΣVT
ǁ2
R
F
Frobenius Norm
14. Low-rank Matrix Factorization
● No orthogonality requirement
● Weighted least squares (or others)
P≈R
Q
Size of latent space
U Σ VT
Scaling factor is
absorbed into
both matrices
(not normalized)
15. ● Bias terms
● Regularization, e.g. L2, L1, etc
Low-rank MF (cont…)
Overall bias User bias Item bias
18. ● Replace user-vector with sum of item vectors
Asymmetric Matrix Factorization
( )≈R I(R)
items items
N(u) is all items user i
rated/viewed/clicked
Y
Q
19. AMF, relation to Neural Network
1-hot encoding of a user’s
play history
Single hidden layer is
equivalent to learning
a Y and Q matrix (aka
weights)
20. ● SLIM replaces low-rank approx by a sparse item-item matrix.
Sparity comes from L1 regularizer.
● Equivalent to constructing a regression using user’s play history to
predict ratings
● NB: Important that diagonal is excluded. Otherwise solution is trivial.
SLIM
≈R I(R)
Diagonal
replaced with
with zeros
Y
items
items
0
24. ?
?
? ?
?
?
? ?
?
0.88
Items
Users now belong to
multiple “topics”,
with some proportion
0.12
Purchases are a mix
proportional to user’s
affinity for topic, and item
affinity within topic
28. Final step: Recommending from topics
● Once we’ve learnt a user’s distribution over topics, and each topic’s
distribution over items. Producing a recommendation is easy.
● Score every item, i, using below, and recommend items with highest
probability (discarding items the user has already purchased)
31. In many domains, deep learning is achieving near-human
or super-human accuracy!
However, applications of Deep Learning in Recommender Systems is at its infancy.
32. So, what is Deep Learning?
A class of machine learning algorithms:
● that use a cascade of multiple non-linear processing layers
● and complex model structures
● to learn different representations of the data in each layer
● where higher level features are derived from lower level features to form a
hierarchical representation.
Balázs Hidasi, RecSys 2016
34. Learning hierarchical representations of data
Learned Features Trainable Classifier
Each layer learns progressively complex representations from its predecessor
“Socrates”
Raw
Pixels Edges
Parts of Objects composed
from edges
Object models
35. Earliest adaptation: Restricted Boltzmann Machines
From recent presentation by
Alexandros Karatzoglou
One hidden layer.
User feedback on
items interacted
with, are
propagated back
to all items.
Very similar to an
autoencoder!
36. There are many ways to make this deep.
From Olivier Grisel, dotAI 2017
41. Wide + Deep Models for Recommendations
In a recommender setting, you may want to train with a wide set of
cross-product feature transformations , so that the model essentially
memorizes these sparse feature combinations (rules):
Meh! Yay! Cheng et al, Google Inc.
(2016)
42. Wide + Deep Models for Recommendations
On the other hand, you may want the ability to generalize using the
representational power of a deep network. But deep nets can
over-generalize.
Cheng et al, Google Inc.
(2016)
43. Wide + Deep Models for Recommendations
Best of both worlds:
Jointly train a deep + wide
network. The cross-feature
transformation in the wide
model component can
memorize all those sparse,
specific rules, while the
deep model component can
generalize to similar items
via embeddings.
Cheng et al, Google Inc.
(2016)
44. Wide + Deep Models for Recommendations
Cheng et al, Google Inc. (2016)
Wide + Deep Model
for app
recommendations.
45. The Youtube Recommendation model
A two Stage Approach with two deep networks:
● The candidate generation network takes events
from the user’s YouTube activity history as input and
retrieves a small subset (hundreds) of videos from a
large corpus. These candidates are intended to be
generally relevant to the user with high precision. The
candidate generation network only provides broad
personalization via collaborative filtering.
● The ranking network scores each video according to
a desired objective function using a rich set of
features describing the video and user. The highest
scoring videos are presented to the user, ranked by
their score
Covington et al., Google Inc. (2016)
46. The Youtube Recommendation model
Deep candidate generation model architecture
● embedded sparse features concatenated with
dense features. Embeddings are averaged
before concatenation to transform variable
sized bags of sparse IDs into fixed-width
vectors suitable for input to the hidden layers.
● All hidden layers are fully connected.
● In training, a cross-entropy loss is minimized
with gradient descent on the output of the
sampled softmax.
● At serving, an approximate nearest neighbor
lookup is performed to generate hundreds of
candidate video recommendations.
Stage One
Covington et al., Google Inc. (2016)
47. The Youtube Recommendation model
Stage Two
Deep ranking network
architecture
● uses embedded categorical
features (both univalent and
multivalent) with shared
embeddings and powers of
normalized continuous
features.
● All layers are fully connected.
In practice, hundreds of
features are fed into the
network.
Covington et al., Google Inc. (2016)
49. Collaborative
Denoising
Auto-Encoder
Collaborative Denoising Auto-Encoders for Top-N Recommender Systems, Wu et.al., WSDM 2016
● Treats the feedback on items
y that the user U has
interacted with (input layer)
as a noisy version of the
user’s preferences on all
items (output layer)
● Introduces a user specific
input node and hidden bias
node, while the item weights
are shared across all users.
50. Recurrent Neural Networks - Sequence Modeling
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
A recurrent neural network can be thought of as multiple copies of the same
network, each passing a message to a successor.
51. Session-based recommendation with Recurrent
Neural Networks (GRU4Rec)
Hidasi et al.
ICLR (2016)
● Treat each user session as
sequence of clicks
● Predict next item in the session
sequence
52. Adding Item metadata to GRU4Rec: Parallel RNN
Hidasi et al.
Recsys (2016)
● Separate RNNs for each input
type
○ Item ID
○ Image feature vector
obtained from CNN (last
avg. pooling layer)
54. VBPR: Visual Bayesian Personalized Ranking from
Implicit Feedback
He et al., AAAI (2015)
Helping cold start with augmenting item factors with visual factors
● Create an item Factor that is a sum of two terms: An Item Visual Factor which is an embedding
of a Deep CNN on the item image, and the usual collaborative item factor.
55. Deep content based music recommendations
http://benanne.github.io/2014/08/05/spotify-cnns.html
Cold Starting New or Less Popular
Music
● Take the Mel Spectrogram of
the song and run it through
several convolutional and
MaxPooling layers to a
compressed 1d representation.
● The training objective is to
minimize the squared error
between the collaborative item
factors of a known item and the
item factor predicted from the
CNN>
● Then for a new item, the model
can predict the item factor, and
make recommendations. Aäron van den Oord, Sander Dieleman and Benjamin Schrauwen, NIPS
2013
56. The Pinterest Application: Pin2Vec Related Pins
Liu et al (2017)
https://medium.com/the-graph/applying-deep-learning-to-related-pi
ns-a6fee3c92f5e
Learn a 128 dimensional compressed
representation of each item
(embedding). Then use a similarity
function (cosine) between them to find
similar items.
57. The Pinterest Application: Pin2Vec Related Pins
Liu et al (2017)
https://medium.com/the-graph/applying-deep-learning-to-related-pi
ns-a6fee3c92f5e
Co-occurrence Pin2Vec
58. The Pinterest Application: Pin2Vec Related Pins
Liu et al (2017)
https://medium.com/the-graph/applying-deep-learning-to-related-pi
ns-a6fee3c92f5e
59. Some concluding thoughts
● Deep Learning is augmenting shallow model based recommender systems.
The main draws for DL in RecSys seems to be:
● Better generalization beyond linear models for user-item interactions.
● Embeddings: Unified representation of heterogeneous signals (e.g. add
image/audio/textual content as side information to item embeddings via
convolutional NNs).
● Exploitation of sequential information in actions leading up to recommendation
(e.g. LSTM on viewing/purchase/search history to predict what will be
watched/purchased/searched next).
● DL toolkits provide unprecedented flexibility in experimenting with loss
functions (e.g. in toolkits like TensorFlow/MxNet/Keras etc. switching the loss
from classification loss to ranking loss is trivial.