What's New in H2O Python, R, and Flow

What's New!
Python
Flow
R
Cliff Click
CTO, Co-Founder

What do these stickers mean?
Pick up stickers by getting installed!!!!
or get install help at the information booth
Hands-On sessions need H2O installed!
Generally need Datasets
Generally need one of Python or R (or both!)
Python 2.7 is a must (3.0 still WIP!)
R 3.2.2 is recommended
R 3.1.0 will not work
Python 2.7 is a must (3.0 still WIP!)

H2O-3

Complete rewrite of H2O Core Internals
– 14700 commits, 526 releases, 94 branches

Complete {Rapids, Model building, API} rewrite

Loads of Algo improvements; here's a few:
– Grid search, stopping criteria, row weights,
n-folds CV, checkpoints, distributions
• New data source ORC; Strings: UTF-8 handling
H2O2 commits H2O3 commits

H2O-3 Algos
• Old ones improve...
– GLM beta constraints, offsets, multinomial
– GBM, DRF stochastic sampling, ncats_bin,
poisson, gamma, tweedie, huber, laplace

Generalized Low Rank Modeling - Anqi
– Dimensionality reduction, imputation

Ensembles - Erin

Python

Fully function Python Client
 Parity with R client
 Uses your 2.7 Python install directly (3.0 WIP)
 IPython, Jupyter

Load data, munge & clean, build models
– Pythonic column & row selection & slicing
– dataframe.apply(lambda x : …)

Interactive console response on Big Data
# Mean of the squares minus square of the means
means = H2OFrame(zip(*[df.mean()]))
sdev = df.apply(lambda x: (x*x).sum(),0) / df.nrow – means*means

Pythonic

Load from csv, hdfs, nfs, hive, s3, or
any 1-d or 2-d python obj:

Full set of Pythonic column & row selectors

Iteration, list comprehension
df = h2o.import_file('file_path.csv')
df = h2o.H2OFrame([any python obj])
df # All columns & rows
df['sepal_len'] # Column by name
df[2] # Column by index
df[-2] # 2nd
column from end
df[0:5,:] # First 5 rows, all cols
df[[1,3,5]] # List of columns (or rows)
df[df[0]>0.5,:] # Filter rows (or cols)
df[df[0]==None,0] = mean # Impute the mean
sum_sqs = [(col*col).sum() for col in df]

Python - Data Munging Pipelines

Full Data Munging Pipeline Support
– Complex munging and string ops
– Feature generation; outliers; imputation
– All at H2O in-memory speeds

Generate POJOs with Pipelines
– POJOs do data munging! then run Model

Plug into e.g. Storm Bolt, any Java App

See Spencer, Pipelines @ 11:30am; Hank tomorrow
DB
CSV logs
split()
asDate
join
impute,
outliers
groupby
sort train
model
POJO

Python – Big Data, Big Temps

Temps managed by Python's Ref-counting
– Aggressively removes temps
– No need for explicit management
– User named objects, loaded datasets,
models must be explicitly removed

Standard Python reference-semantics

Backed by copy-on-write optimization in H2O
– i.e., defensive copies are “free” until modified
Long running loops:
tmp = …
...tmp...
reclaim tmp!

R – Big Data Temp Management

Temps managed by R's GC
– Run gc() to flush extra temps
– No need for explicit management
– User named objects, loaded datasets,
models must be explicitly removed

Full R copy-by-value semantics

Backed by copy-on-write optimization in H2O
– i.e., copies are “free” until modified
Long running loops:
tmp = …
...tmp…
gc()
reclaim tmp!

Rapids – Driving H2O for Munging

A Big Data Language for Machines
– Used by R and Python clients, via REST
– Simple LISP syntax (1st
class functions!)
– Optimized for bulk Array ops

Functional LISP semantics – Pass-by-Value

COW: Copy-on-Write optimization
– Copies are “free” unless data is modified

Lifetimes tracked by the client
REST

Rapids – Join, Sort, & GroupBy

Big Sort, Big Join by Matt Dowle (of data.table)
– Parallel & Distributed; data too big for one machine
– Working: 1bx5 joined with 1bx5 yielding 1bx9
– Now testing 10b rows by 10b rows join on 10 nodes
– Stable sort, index built, can binary search
– Opens door for rolling joins

Any lambda function on Group-Bys
data.table 505s
H2O 1 node 236s
H2O 4 node 113s

Flow

Improvements:
– CMs, ROC, scoring history, deviance plots, cross-
validation metrics, POJO listings, parameter selection
– Grid search. Model & Frame import & export. Change
column type, impute, split frame
– Save/load/share flows
– Diagnostics: cluster status, log files, network tests,
profiler, stack trace, timeline
– Faster for wide datasets
INSTALLED

What's (Not) New!

Same commitment to Quality, Speed, Size, Scale
– 10b x 10b row joins! GLRM! Grid search!

Same Rapid pace of Innovation
– Tons of new code! ~15000 commits!

Same Quality-Driven Culture
– Bigger team! New faces!

Community, Culture, Code
INSTALLED

What's New in H2O Python, R, and Flow

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (18)

Similaire à What's New in H2O Python, R, and Flow

Similaire à What's New in H2O Python, R, and Flow (20)

Plus de Sri Ambati

Plus de Sri Ambati (20)

Dernier

Dernier (20)

What's New in H2O Python, R, and Flow