Netflix VectorFlow at ML Platform Meetup Oct 2017

Benoit Rostykus
Machine Learning Researcher Oct. 10, 2017 - ML Platform Meetup

Scope
1,888k 252k 2,322k 110k 6k
Lines of code*: *: git ls-files | xargs cat | wc -l
● 0.05 dev (I spend 5% of my time on it)
● offers a minimal DAG with backprop for feed-forward nets
● sparse data as first class citizen
● arbitrary loss function
● extremely fast on CPU
○ 0 memory allocation
○ lock-free inter-core parallelism
○ LLVM intrinsics for dense ops SIMD vectorization

Performance
● Currently in A/B test, one of the many sub-algorithms used to construct Netflix
homepage recommendations
● Training set
○ 33M rows / ~510 nonzeros per row / total dimensionality 7.3k / sparsity = 7%
○ 8 bytes per entry =(index, value)=(uint, float)
○ 16.8B entries, 125GB total
4.2 sec per SGD pass (proximal AdaGrad) over 16 cores (r4.8xlarge ec2 instance)
1.9GB
491k rows
250M entries
/ sec / core
!
33 GB / sec
1 GFLOPS / core
75% mem bandwidth
DDR4 SDRAM r4.8xlarge max read throughput is ~44GB/s
Real-world job: sparse logistic regression with positivity constraint on weights

Trade-offs
“All non-trivial abstractions, to some degree, are leaky.” - Joel Spolsky
genericity performance
● tensorflow/core/kernels
adjust_hue_op.cc
sparse_xent_op.cc
word2vec_ops.cc
REGISTER_OP("Skipgram")
.Deprecated(19,
"Moving word2vec into
tensorflow_models/tutorials and "
"deprecating its ops here as a result")
● RNN unrolling

Design choice: D
● Fact 1: python is awesome but slow. Fact 2: scientists can’t code in C++.
○ Mainstream solution: python to frontend an efficient C++ backend
○ Problem: scientists have outsourced technological leverage to C++ coders
○ Scientists might think they need a cluster of GPUs instead of a single box
○ Creates a “division of labor” which hampers innovation at interface
● vectorflow is written in D: a modern systems language
○ python-like experience for beginners, 100x faster runtime
○ C++ done right for experienced users
○ code compile run debug loop almost as fast as python
○ statically typed with great type-inference, best-in-class templates
○ amazing LLVM compiler LDC
○ low-level control if needed
■ compile-time evaluation, inline asm
■ manual mem management
● Single language benefits
○ you don’t have to switch language to have efficient code
○ less abstractions, less impedance mismatch, less bugs
○ faster dev time
D C++

Design choice: optimize for latency
● Most DL libraries optimize for throughput, not latency - assume memory move is cheap
○ mini-batch API
○ pass-by-copy by default, gather when sparse
■ computation is assumed to outweigh memory transport cost
● RAM -> GPU memory -> computation -> RAM
■ makes sense for compute heavy, dense problems
● images: convolutions are expensive
● Instead, vectorflow optimizes for low latency - assumes memory move is expensive
○ row-based API : fast query time
○ everything is pre-allocated when the graph is built
○ no memory allocation/copy during forward-prop nor backward-prop (RAM is slow)
○ great for low latency problems / sparse or shallow nets: real-time bidding, trading etc.
...
shallow => IO bound => CPU
deep => compute bound => GPU
...optimized for:
optimized for:

Design choice: templates leverage
● Data
○ Format agnostic: “bring your own data”
○ Move the code to the data, not the opposite
○ Loose requirement on schema
○ Library just expects an iterator
■ in-memory or out-of-core learning possible
○ Compile-time mapping of data fields to DAG roots to
avoid runtime copy
○ Netflix internal data-adapter example:
stream parquet-encoded s3-backed Hive tables
● Loss callback
○ Easily implement arbitrary loss functions
○ Compile-time specialization of learning logic based on
callback signature
○ Gradient buffer reference to avoid allocation
■ Can be dense or sparse!
example: sparse auto-encoder
(sparse cross-entropy)

Design choice: parallelism
● Distributed learning...
○ … is hard to implement & debug
○ … trades convergence speed for lower communication cost
■ meta-algorithms such as CoCoA (Berkeley), AIDE (CMU) help
● Don’t distribute over multiple machines unless you need it
● Inter-core parallelism: SIMD for all dense ops
● Intra-core parallelism: Hogwild! - asynchronous SGD
○ Data parallelism: each core iterates over a data chunk
○ Lock-free strategy, pretends each core is alone - race conditions will happen
○ Avoid need of a meta-algorithm
○ Works great as long as read/write patterns are sparse enough
■ More likely to be true in the sparse bottom layer
○ Works surprisingly well on dense problems too
○ Free: only cost is CPU cache line trashing

small > big simple > complex
● Distributed as source-code, not pre-compiled library
○ Compiling arch = running arch always optimized
■ leverages LLVM as much as possible, no handwritten-SIMD
● No third party dependencies
○ No brainer to install, just need a D compiler
○ Works everywhere
● Small code base, easy to understand and hack
● Polar bear friendly
Some Netflix use-cases:
● Survival regression
● Quantile regression
● Binary/multiclass classification
● Causal inference
● Auto-encoder
● ...
Roadmap:
● more complex nodes and deeper sparsity support
● algebraic API (mix of pytorch / tf through operators overloading)
● RNN, more optimizers (SVRG etc.)
● keep it simple & small - not meant to be an ML kitchen sink

Netflix VectorFlow at ML Platform Meetup Oct 2017

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Netflix VectorFlow at ML Platform Meetup Oct 2017