Deep Learning Advances in Recommender Systems

Recommenders
Shallow / Deep
SUDEEP DAS
Frontiers and Advances in Data Sciences Conference,
X’ian, China 2017

Recommendations
guide our
experiences almost
everywhere!

Personalization in my
typical day

Morning: News/ Workout/ Getting ready

Commute hours:
Music/ YouTube Lectures/ Books

Now and then:
Social Media/ Shopping online

Evenings are for Netflix, of course!

● 1999-2005: Netflix Prize:
○ >10% improvement, win $1,000,000
● Top performing model(s) ended up be a
variation of Matrix Factorization (SVD++,
Koren, et al)
● Although Netflix’s rec system has moved on,
MF is still the foundational method on which
most collaborative filtering systems are
based
Background

Singular Value Decomposition (Origins)
R = U Σ VT
U
VT
=
users
items
Σ
ratings
matrix
left/right singular
vectors
(orthonormal basis)
Singular values
(scaling)
R

● Low-rank approximation
● Eckart-Young theorem:
SVD: Largest SV’s for approximation
≈
[U’,Σ’,VT
’] = argmin ǁR - UΣVT
ǁ2
R
F
Frobenius Norm

Low-rank Matrix Factorization
● No orthogonality requirement
● Weighted least squares (or others)
P≈R
Q
Size of latent space
U Σ VT
Scaling factor is
absorbed into
both matrices
(not normalized)

● Bias terms
● Regularization, e.g. L2, L1, etc
Low-rank MF (cont…)
Overall bias User bias Item bias

From Olivier Grisel, dotAI 2017
The FeedForward View

● Replace user-vector with sum of item vectors
Asymmetric Matrix Factorization
( )≈R I(R)
items items
N(u) is all items user i
rated/viewed/clicked
Y
Q

AMF, relation to Neural Network
1-hot encoding of a user’s
play history
Single hidden layer is
equivalent to learning
a Y and Q matrix (aka
weights)

● SLIM replaces low-rank approx by a sparse item-item matrix.
Sparity comes from L1 regularizer.
● Equivalent to constructing a regression using user’s play history to
predict ratings
● NB: Important that diagonal is excluded. Otherwise solution is trivial.
SLIM
≈R I(R)
Diagonal
replaced with
with zeros
Y
items
items
0

?
?
? ?
?
?
? ?
?
0.88
Items
Users now belong to
multiple “topics”,
with some proportion
0.12
Purchases are a mix
proportional to user’s
affinity for topic, and item
affinity within topic

K
D
W
θ φz w
α
β
Latent Dirichlet Allocation (LDA)

What topics look like:
0.15 0.630.22

Final step: Recommending from topics
● Once we’ve learnt a user’s distribution over topics, and each topic’s
distribution over items. Producing a recommendation is easy.
● Score every item, i, using below, and recommend items with highest
probability (discarding items the user has already purchased)

Deep Learning
in Recommender Systems

Why deep?
Deep
Learning
Is Making
Waves
Everywhere!

In many domains, deep learning is achieving near-human
or super-human accuracy!
However, applications of Deep Learning in Recommender Systems is at its infancy.

So, what is Deep Learning?
A class of machine learning algorithms:
● that use a cascade of multiple non-linear processing layers
● and complex model structures
● to learn different representations of the data in each layer
● where higher level features are derived from lower level features to form a
hierarchical representation.
Balázs Hidasi, RecSys 2016

Traditional vs Deep
Handcrafted
Features
Learned/Trainable
Features
Trainable Classifier
Trainable Classifier
Traditional ML
Deep Learning
“Socrates”
“Socrates”

Learning hierarchical representations of data
Learned Features Trainable Classifier
Each layer learns progressively complex representations from its predecessor
“Socrates”
Raw
Pixels Edges
Parts of Objects composed
from edges
Object models

Earliest adaptation: Restricted Boltzmann Machines
From recent presentation by
Alexandros Karatzoglou
One hidden layer.
User feedback on
items interacted
with, are
propagated back
to all items.
Very similar to an
autoencoder!

There are many ways to make this deep.

Deep Triplet Networks
From Olivier Grisel,
dotAI 2017

Wide + Deep Models for Recommendations
In a recommender setting, you may want to train with a wide set of
cross-product feature transformations , so that the model essentially
memorizes these sparse feature combinations (rules):
Meh! Yay! Cheng et al, Google Inc.
(2016)

On the other hand, you may want the ability to generalize using the
representational power of a deep network. But deep nets can
over-generalize.
Cheng et al, Google Inc.
(2016)

Best of both worlds:
Jointly train a deep + wide
network. The cross-feature
transformation in the wide
model component can
memorize all those sparse,
specific rules, while the
deep model component can
generalize to similar items
via embeddings.
Cheng et al, Google Inc.
(2016)

Cheng et al, Google Inc. (2016)
Wide + Deep Model
for app
recommendations.

The Youtube Recommendation model
A two Stage Approach with two deep networks:
● The candidate generation network takes events
from the user’s YouTube activity history as input and
retrieves a small subset (hundreds) of videos from a
large corpus. These candidates are intended to be
generally relevant to the user with high precision. The
candidate generation network only provides broad
personalization via collaborative filtering.
● The ranking network scores each video according to
a desired objective function using a rich set of
features describing the video and user. The highest
scoring videos are presented to the user, ranked by
their score
Covington et al., Google Inc. (2016)

Deep candidate generation model architecture
● embedded sparse features concatenated with
dense features. Embeddings are averaged
before concatenation to transform variable
sized bags of sparse IDs into fixed-width
vectors suitable for input to the hidden layers.
● All hidden layers are fully connected.
● In training, a cross-entropy loss is minimized
with gradient descent on the output of the
sampled softmax.
● At serving, an approximate nearest neighbor
lookup is performed to generate hundreds of
candidate video recommendations.
Stage One

Stage Two
Deep ranking network
architecture
● uses embedded categorical
features (both univalent and
multivalent) with shared
embeddings and powers of
normalized continuous
features.
● All layers are fully connected.
In practice, hundreds of
features are fed into the
network.

Collaborative
Denoising
Auto-Encoder
Collaborative Denoising Auto-Encoders for Top-N Recommender Systems, Wu et.al., WSDM 2016
● Treats the feedback on items
y that the user U has
interacted with (input layer)
as a noisy version of the
user’s preferences on all
items (output layer)
● Introduces a user specific
input node and hidden bias
node, while the item weights
are shared across all users.

Recurrent Neural Networks - Sequence Modeling
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
A recurrent neural network can be thought of as multiple copies of the same
network, each passing a message to a successor.

Session-based recommendation with Recurrent
Neural Networks (GRU4Rec)
Hidasi et al.
ICLR (2016)
● Treat each user session as
sequence of clicks
● Predict next item in the session
sequence

Adding Item metadata to GRU4Rec: Parallel RNN
Hidasi et al.
Recsys (2016)
● Separate RNNs for each input
type
○ Item ID
○ Image feature vector
obtained from CNN (last
avg. pooling layer)

Convolutional Neural Nets
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 (2016) http://cs231n.stanford.edu/

VBPR: Visual Bayesian Personalized Ranking from
Implicit Feedback
He et al., AAAI (2015)
Helping cold start with augmenting item factors with visual factors
● Create an item Factor that is a sum of two terms: An Item Visual Factor which is an embedding
of a Deep CNN on the item image, and the usual collaborative item factor.

Deep content based music recommendations
http://benanne.github.io/2014/08/05/spotify-cnns.html
Cold Starting New or Less Popular
Music
● Take the Mel Spectrogram of
the song and run it through
several convolutional and
MaxPooling layers to a
compressed 1d representation.
● The training objective is to
minimize the squared error
between the collaborative item
factors of a known item and the
item factor predicted from the
CNN>
● Then for a new item, the model
can predict the item factor, and
make recommendations. Aäron van den Oord, Sander Dieleman and Benjamin Schrauwen, NIPS
2013

The Pinterest Application: Pin2Vec Related Pins
Liu et al (2017)
https://medium.com/the-graph/applying-deep-learning-to-related-pi
ns-a6fee3c92f5e
Learn a 128 dimensional compressed
representation of each item
(embedding). Then use a similarity
function (cosine) between them to find
similar items.

Liu et al (2017)
ns-a6fee3c92f5e
Co-occurrence Pin2Vec

Liu et al (2017)
ns-a6fee3c92f5e

Some concluding thoughts
● Deep Learning is augmenting shallow model based recommender systems.
The main draws for DL in RecSys seems to be:
● Better generalization beyond linear models for user-item interactions.
● Embeddings: Unified representation of heterogeneous signals (e.g. add
image/audio/textual content as side information to item embeddings via
convolutional NNs).
● Exploitation of sequential information in actions leading up to recommendation
(e.g. LSTM on viewing/purchase/search history to predict what will be
watched/purchased/searched next).
● DL toolkits provide unprecedented flexibility in experimenting with loss
functions (e.g. in toolkits like TensorFlow/MxNet/Keras etc. switching the loss
from classification loss to ranking loss is trivial.

Headline
THANKS!
sdas@netflix.com
@datamusing
@netflixresearch

Deep Learning Advances in Recommender Systems

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Deep Learning Advances in Recommender Systems

Similaire à Deep Learning Advances in Recommender Systems (20)

Dernier

Dernier (20)

Deep Learning Advances in Recommender Systems