Data Science Languages and Industry Analytics

1
©
Cloudera,
Inc.
All
rights
reserved.

Data
Science
Languages
and

Industry
Analy<cs

Wes
McKinney,
BIDS
2015-‐09-‐19

2
©
Cloudera,
Inc.
All
rights
reserved.

Me

•  Serial
creator
of
structured
data
tools
/
user
interfaces

•  Mathema<cian
—
MIT
‘07

•  Professional
SQL
programmer
2007-‐2010
(@
AQR)

•  Created
pandas,
April
2008

•  Wrote
Python
for
Data
Analysis
2012

•  Founder
of
DataPad
-‐>
Cloudera

3
©
Cloudera,
Inc.
All
rights
reserved.

A
sample
big
data
architecture

Kafka
Kafka
Kafka
Kafka
Application data
S3 or HDFS
JSON Spark/MapReduce
Columnar
storage
Analytic SQL Engine
User
SQL

4
©
Cloudera,
Inc.
All
rights
reserved.

Big
data
architectures
currently

dominated
by
Java
/
JVM

languages

Python/R/Julia
don’t
have
much
of

a
“seat
at
the
table”

5
©
Cloudera,
Inc.
All
rights
reserved.

Industry
Analy<cs
Scien<ﬁc
Compu<ng

Heterogeneous
data

Flat
tables
and
JSON

Spark
/
MapReduce

SQL

DFS-‐friendly
/
streaming
data
formats

More
physical
machines

Homogeneous
data

Mul<dimensional
arrays

HPC
tools

Linear
algebra

Scien<ﬁc
data
formats

Fewer
physical
machines

Some
simplis<c
generaliza<ons

6
©
Cloudera,
Inc.
All
rights
reserved.

Many
Interac<ve-‐speed
SQL
engines

…
and
more

7
©
Cloudera,
Inc.
All
rights
reserved.

Ibis:
not
the
direct
subject
of
this
talk

•  hjp://blog.ibis-‐project.org

•  Craking
a
compelling
Python-‐on-‐Hadoop
user
experience

• Remove
SQL-‐programming
from
user
workﬂows

• Develop
high
performance
Python
extension
APIs

•  Pythonic
composable
DSL
designed
to
target
SQL
seman<cs

•  Develop
roadmap
targets
Impala
(C++
/
LLVM)
query
engine

• …
but
SQL
compiler
toolchain
works
well
with
other
SQL
dialects

8
©
Cloudera,
Inc.
All
rights
reserved.

Enabling
interoperability
with
big
data
systems

•  Distributed
/
MPP
query
engines:
implemented
in
a
host
language

• Typically
C++,
Java,
or
Scala

•  User-‐deﬁned
func<ons
(UDFs)
through
various
means

• Implement
in
host
language

• Implement
in
user
language
through
some
external
language
protocol

•  External
UDFs
are
usually
very
slow
(cf:
PL/Python,
PySpark,
etc.)

9
©
Cloudera,
Inc.
All
rights
reserved.

What
are
UDFs
good
for?

•  Note:
industry
data
scien<sts
have
libraries
containing
100s
of
UDFs
for
Hive
or

other
distributed
query
engines

•  Custom
data
transforma<ons

•  Custom
domain
logic
(date
/
<me
/
data
types)

•  Custom
data
types

•  Custom
aggrega<ons
(incl.
machine
learning
/
sta<s<cs
expressible
as
reduc<ons)

10
©
Cloudera,
Inc.
All
rights
reserved.

Why
are
external
UDFs
slow?

•  Serializa<on
/
deserializa<on
overhead

•  Scalar
vs
vectorized
computa<ons

•  RPC
overhead

11
©
Cloudera,
Inc.
All
rights
reserved.

How
to
make
them
fast?

•  Common
run<me
memory
representa<on
for
tabular
data

•  Share-‐memory
(zero-‐copy
or
memcpy-‐only)
external
UDF
protocol

•  Vectorized
UDF
interface
(for
interpreted
languages)

12
©
Cloudera,
Inc.
All
rights
reserved.

Memory
representa<on

•  Many
query
engines
are
standardizing
on
in-‐memory
columnar
rep’n
of

materialized
transient
data

• Apache
Drill:
hjps://drill.apache.org/faq/

• Spark

• Impala:

hjp://blog.cloudera.com/blog/2015/07/whats-‐next-‐for-‐impala-‐more-‐
reliability-‐usability-‐and-‐performance-‐at-‐even-‐greater-‐scale/

•  Industry-‐standard
serializa<on
format:
Apache
Parquet

• hjps://parquet.apache.org/

13
©
Cloudera,
Inc.
All
rights
reserved.

Serializa<on
vs
In-‐memory

•  Serializa<on
formats
(e.g.
Parquet)

• Op<mize
for
IO
/
DFS
throughput
at
expense
of
CPU/memory
bus
throughput

• Do
not
consider
random
access
or
in-‐memory
analy<cs
as
a
goal

•  No
standardized
in-‐memory
containers
for
materialized
data
from
ﬁle
/
RPC

protocols
(Parquet,
Thrik,
protobuf,
Avro,
etc.)

14
©
Cloudera,
Inc.
All
rights
reserved.

One
possible
proposal

•  Standardize
on
an
augmented
variant
of
the
Apache
Drill
in-‐memory
columnar

memory
layout

• hjps://drill.apache.org/docs/value-‐vectors/

•  Common
/
shared
C
impl
for
R/Python/Julia

• Currently
all
languages
have
poor
support
for
JSON-‐like
data

• make
your
needs
known!

• Enumerate
required
data
types
and
other
requirements

15
©
Cloudera,
Inc.
All
rights
reserved.

More
on
the
Drill
layout

persons'='[
''{
''''name:'‘wes’,
''''addresses:'[
'''''''{number:'2,'street:'‘a’},
'''''''{number:'3,'street:'‘bb’},
'''']
''},
''{
''''name:'‘mark’,
''''addresses:'[
'''''''{number:'4,'street:'‘ccc’},
'''''''{number:'5,'street:'‘dddd’},
'''''''{number:'6,'street:'‘f’},
'''']
''},

18
©
Cloudera,
Inc.
All
rights
reserved.

Array<Array<Int32>>
example

persons'='[
''{
''''name:'‘wes’,
''''fav_sequences:'[
''''''[0,'1,'2],
''''''[2,'3]
'''']
''},
''{
''''name:'‘mark’,
''''fav_sequences:'[
''''''[3],
''''''[4,'5],
''''''[6,'7]
'''']
''},
person.fav_sequences/values
person.fav_sequences
0
2
5
offset
0
3
5
6
8
0
1
2
2
3
3
4
5
6
7
offset

Data Science Languages and Industry Analytics

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à Data Science Languages and Industry Analytics

Similaire à Data Science Languages and Industry Analytics (20)

Plus de Wes McKinney

Plus de Wes McKinney (18)

Dernier

Dernier (20)

Data Science Languages and Industry Analytics