3. What do these stickers mean?
Pick up stickers by getting installed!!!!
or get install help at the information booth
Hands-On sessions need H2O installed!
Generally need Datasets
Generally need one of Python or R (or both!)
Python 2.7 is a must (3.0 still WIP!)
R 3.2.2 is recommended
R 3.1.0 will not work
Python 2.7 is a must (3.0 still WIP!)
4. H2O-3
Complete rewrite of H2O Core Internals
– 14700 commits, 526 releases, 94 branches
Complete {Rapids, Model building, API} rewrite
Loads of Algo improvements; here's a few:
– Grid search, stopping criteria, row weights,
n-folds CV, checkpoints, distributions
• New data source ORC; Strings: UTF-8 handling
H2O2 commits H2O3 commits
6. Python
Fully function Python Client
Parity with R client
Uses your 2.7 Python install directly (3.0 WIP)
IPython, Jupyter
Load data, munge & clean, build models
– Pythonic column & row selection & slicing
– dataframe.apply(lambda x : …)
Interactive console response on Big Data
# Mean of the squares minus square of the means
means = H2OFrame(zip(*[df.mean()]))
sdev = df.apply(lambda x: (x*x).sum(),0) / df.nrow – means*means
7. Pythonic
Load from csv, hdfs, nfs, hive, s3, or
any 1-d or 2-d python obj:
Full set of Pythonic column & row selectors
Iteration, list comprehension
df = h2o.import_file('file_path.csv')
df = h2o.H2OFrame([any python obj])
df # All columns & rows
df['sepal_len'] # Column by name
df[2] # Column by index
df[-2] # 2nd
column from end
df[0:5,:] # First 5 rows, all cols
df[[1,3,5]] # List of columns (or rows)
df[df[0]>0.5,:] # Filter rows (or cols)
df[df[0]==None,0] = mean # Impute the mean
sum_sqs = [(col*col).sum() for col in df]
8. Python - Data Munging Pipelines
Full Data Munging Pipeline Support
– Complex munging and string ops
– Feature generation; outliers; imputation
– All at H2O in-memory speeds
Generate POJOs with Pipelines
– POJOs do data munging! then run Model
Plug into e.g. Storm Bolt, any Java App
See Spencer, Pipelines @ 11:30am; Hank tomorrow
DB
CSV logs
split()
asDate
join
impute,
outliers
groupby
sort train
model
POJO
9. Python – Big Data, Big Temps
Temps managed by Python's Ref-counting
– Aggressively removes temps
– No need for explicit management
– User named objects, loaded datasets,
models must be explicitly removed
Standard Python reference-semantics
Backed by copy-on-write optimization in H2O
– i.e., defensive copies are “free” until modified
Long running loops:
tmp = …
...tmp...
reclaim tmp!
10. R – Big Data Temp Management
Temps managed by R's GC
– Run gc() to flush extra temps
– No need for explicit management
– User named objects, loaded datasets,
models must be explicitly removed
Full R copy-by-value semantics
Backed by copy-on-write optimization in H2O
– i.e., copies are “free” until modified
Long running loops:
tmp = …
...tmp…
gc()
reclaim tmp!
11. Rapids – Driving H2O for Munging
A Big Data Language for Machines
– Used by R and Python clients, via REST
– Simple LISP syntax (1st
class functions!)
– Optimized for bulk Array ops
Functional LISP semantics – Pass-by-Value
COW: Copy-on-Write optimization
– Copies are “free” unless data is modified
Lifetimes tracked by the client
REST
12. Rapids – Join, Sort, & GroupBy
Big Sort, Big Join by Matt Dowle (of data.table)
– Parallel & Distributed; data too big for one machine
– Working: 1bx5 joined with 1bx5 yielding 1bx9
– Now testing 10b rows by 10b rows join on 10 nodes
– Stable sort, index built, can binary search
– Opens door for rolling joins
Any lambda function on Group-Bys
data.table 505s
H2O 1 node 236s
H2O 4 node 113s
14. What's (Not) New!
Same commitment to Quality, Speed, Size, Scale
– 10b x 10b row joins! GLRM! Grid search!
Same Rapid pace of Innovation
– Tons of new code! ~15000 commits!
Same Quality-Driven Culture
– Bigger team! New faces!
Community, Culture, Code
INSTALLED