Vectorflow is a minimalist neural network library optimized for sparse data and single machine environments. It's part of Netflix OSS and can be accessed here https://github.com/Netflix/vectorflow This talk was given at an ML Platform meetup at Netflix HQ on Oct 10, 2017
2. Scope
1,888k 252k 2,322k 110k 6k
Lines of code*: *: git ls-files | xargs cat | wc -l
● 0.05 dev (I spend 5% of my time on it)
● offers a minimal DAG with backprop for feed-forward nets
● sparse data as first class citizen
● arbitrary loss function
● extremely fast on CPU
○ 0 memory allocation
○ lock-free inter-core parallelism
○ LLVM intrinsics for dense ops SIMD vectorization
3. Performance
● Currently in A/B test, one of the many sub-algorithms used to construct Netflix
homepage recommendations
● Training set
○ 33M rows / ~510 nonzeros per row / total dimensionality 7.3k / sparsity = 7%
○ 8 bytes per entry =(index, value)=(uint, float)
○ 16.8B entries, 125GB total
4.2 sec per SGD pass (proximal AdaGrad) over 16 cores (r4.8xlarge ec2 instance)
1.9GB
491k rows
250M entries
/ sec / core
!
33 GB / sec
1 GFLOPS / core
75% mem bandwidth
DDR4 SDRAM r4.8xlarge max read throughput is ~44GB/s
Real-world job: sparse logistic regression with positivity constraint on weights
4. Trade-offs
“All non-trivial abstractions, to some degree, are leaky.” - Joel Spolsky
genericity performance
● tensorflow/core/kernels
adjust_hue_op.cc
sparse_xent_op.cc
word2vec_ops.cc
REGISTER_OP("Skipgram")
.Deprecated(19,
"Moving word2vec into
tensorflow_models/tutorials and "
"deprecating its ops here as a result")
● RNN unrolling
5. Design choice: D
● Fact 1: python is awesome but slow. Fact 2: scientists can’t code in C++.
○ Mainstream solution: python to frontend an efficient C++ backend
○ Problem: scientists have outsourced technological leverage to C++ coders
○ Scientists might think they need a cluster of GPUs instead of a single box
○ Creates a “division of labor” which hampers innovation at interface
● vectorflow is written in D: a modern systems language
○ python-like experience for beginners, 100x faster runtime
○ C++ done right for experienced users
○ code compile run debug loop almost as fast as python
○ statically typed with great type-inference, best-in-class templates
○ amazing LLVM compiler LDC
○ low-level control if needed
■ compile-time evaluation, inline asm
■ manual mem management
● Single language benefits
○ you don’t have to switch language to have efficient code
○ less abstractions, less impedance mismatch, less bugs
○ faster dev time
D C++
6. Design choice: optimize for latency
● Most DL libraries optimize for throughput, not latency - assume memory move is cheap
○ mini-batch API
○ pass-by-copy by default, gather when sparse
■ computation is assumed to outweigh memory transport cost
● RAM -> GPU memory -> computation -> RAM
■ makes sense for compute heavy, dense problems
● images: convolutions are expensive
● Instead, vectorflow optimizes for low latency - assumes memory move is expensive
○ row-based API : fast query time
○ everything is pre-allocated when the graph is built
○ no memory allocation/copy during forward-prop nor backward-prop (RAM is slow)
○ great for low latency problems / sparse or shallow nets: real-time bidding, trading etc.
...
shallow => IO bound => CPU
deep => compute bound => GPU
...optimized for:
optimized for:
7. Design choice: templates leverage
● Data
○ Format agnostic: “bring your own data”
○ Move the code to the data, not the opposite
○ Loose requirement on schema
○ Library just expects an iterator
■ in-memory or out-of-core learning possible
○ Compile-time mapping of data fields to DAG roots to
avoid runtime copy
○ Netflix internal data-adapter example:
stream parquet-encoded s3-backed Hive tables
● Loss callback
○ Easily implement arbitrary loss functions
○ Compile-time specialization of learning logic based on
callback signature
○ Gradient buffer reference to avoid allocation
■ Can be dense or sparse!
example: sparse auto-encoder
(sparse cross-entropy)
8. Design choice: parallelism
● Distributed learning...
○ … is hard to implement & debug
○ … trades convergence speed for lower communication cost
■ meta-algorithms such as CoCoA (Berkeley), AIDE (CMU) help
● Don’t distribute over multiple machines unless you need it
● Inter-core parallelism: SIMD for all dense ops
● Intra-core parallelism: Hogwild! - asynchronous SGD
○ Data parallelism: each core iterates over a data chunk
○ Lock-free strategy, pretends each core is alone - race conditions will happen
○ Avoid need of a meta-algorithm
○ Works great as long as read/write patterns are sparse enough
■ More likely to be true in the sparse bottom layer
○ Works surprisingly well on dense problems too
○ Free: only cost is CPU cache line trashing
9. small > big simple > complex
● Distributed as source-code, not pre-compiled library
○ Compiling arch = running arch always optimized
■ leverages LLVM as much as possible, no handwritten-SIMD
● No third party dependencies
○ No brainer to install, just need a D compiler
○ Works everywhere
● Small code base, easy to understand and hack
● Polar bear friendly
Some Netflix use-cases:
● Survival regression
● Quantile regression
● Binary/multiclass classification
● Causal inference
● Auto-encoder
● ...
Roadmap:
● more complex nodes and deeper sparsity support
● algebraic API (mix of pytorch / tf through operators overloading)
● RNN, more optimizers (SVRG etc.)
● keep it simple & small - not meant to be an ML kitchen sink