3. http://kylin.io
Extreme
OLAP Engine
for
Big
Data
Apache
Kylin is
an
open
source
Distributed
Analytics
Engine
designed
to
provide
SQL
interface
and
multi-‐dimensional
analysis
(OLAP)
on
Hadoop
supporting
extremely
large
datasets,
original
contributed
from
eBay
Inc.
What’s
Kylin
kylin / ˈkiːˈlɪn / 麒麟
--n. (in Chinese art) a mythical animal of composite form
• Open
Sourced
on
Oct
1st,
2014
• Accepted
as
Apache
Incubator
Project
on
Nov
25th,
2014
13. http://kylin.io
Kylin
Architecture
Overview
13
Cube Builder (MapReduce…)
SQL
Low
Latency
-‐
SecondsRouting
3rd
Party
App
(Web
App,
Mobile…)
Metadata
SQL-‐Based
Tool
(BI
Tools:
Tableau…)
Query
Engine
Hadoop
Hive
REST
API JDBC/ODBC
Ø Online
Analysis
Data
Flow
Ø Offline
Data
Flow
Ø Clients/Users
interactive
with
Kylin
via
SQL
Ø OLAP
Cube
is
transparent
to
users
Star
Schema
Data Key
Value
Data
Data
Cube
OLAP
Cubes
(HBase)
SQL
REST
Server
Data
Source
Abstraction
Engine
Abstraction
Storage
Abstraction
17. http://kylin.io
n The
freedom
n Zoo
break,
not
bound
to
Hadoop
any
more
n Free
to
go
to
a
better
engine
or
storage
n Extensibility
n Accept
any
input,
e.g.
Kafka
n Embrace
next-‐gen
distributed
platform,
e.g.
Spark
n Flexibility
n Choose
different
engine
for
different
data
set
The
Freedom,
Extensibility,
Flexibility
20. http://kylin.io
n Pros
n Simple
implementation,
depends
on
MR
shuffle
to
merge
sort
and
then
aggregate
n Little
requirement
on
memory
n Cons
n Aggregation
happens
at
reducer
side
n Mapper
outputs
raw
data
thus
shuffle
is
huge
n Multiple
rounds
of
MR
overhead
n Shuffle
can
be
100x
of
cube
size,
big
I/O
pressure
Layered
Cubing (MR
Engine
V1)
21. http://kylin.io
Fast
Cubing
(MR
Engine
V2)
Data
Split
Cube
Segment
Data
Split
Cube
Segment
Data
Split
Cube
Segment
……
Final
Cube
Merge
Sort
(Shuffle)
mapper mapper mapper
reducer
22. http://kylin.io
n One
round
MR
calculates
the
whole
cube
n Minimize
scheduling
overhead
n Aggregation
happens
at
mapper
side
n 1M
raw
records
becomes
10K
at
base
level
n Reduced
shuffles
size,
20x
total
cube
size
n Memory
eater
Fast
Cubing
(MR
Engine
V2)
23. http://kylin.io
n A
simplified
star
cubing
algorithm
n Xin,
Dong,
et
al.
"Star-‐cubing:
Computing
iceberg
cubes
by
top-‐down
and
bottom-‐up
integration." Proceedings
of
the
29th
international
conference
on
Very
large
data
bases-‐Volume
29.
VLDB
Endowment,
2003.
n Top-‐down;
Free
resource
on
branch
complete
n Multi-‐threading
if
mem
available;
Ordered
output
In-‐Mem
Cubing
24. http://kylin.io
n Pros
n Lesser
network
pressure
n Independent
cubing
algorithm
that
can
be
reused
by
Streaming,
Spark
etc.
n Seems
30%-‐50%
faster
n Cons
n Code
complexity
n High
mapper
CPU/Mem
consumption
Fast
Cubing
Summary
25. http://kylin.io
Comparison
on
~500
GB
cubes
Fast cubing is 30% - 50% faster
0
20
40
60
80
100
120
Case 1 Case 2
Layered Cubing Fast Cubing
28. http://kylin.io
n Do
micro
batch
at
minutes
interval
n Source
data
from
streaming
input
n Fast
cubing
Xin,
Dong,
et
al.
"Star-‐cubing:
Computing
iceberg
cubes
by
top-‐down
and
bottom-‐up
integration."Proceedings
of
the
29th
international
conference
on
Very
large
data
bases-‐Volume
29.
VLDB
Endowment,
2003.
n Cube
auto
merge
and
garbage
collection
Push
the
Idea
to
Near
Realtime
32. http://kylin.io
Use
Case:
SEO
Operational
Dashboard
• eBay
Site
– ebay.com,
ebay.co.uk,
ebay.de
• Buyer
Country
– US,
CN,
RU
• Search
Engine
– Google,
Bing,
Yahoo!
• Referrer
– google.com,
google.co.uk
• Page
– Search,
View
Item,
Product
• User
Experience
– Desktop,
Mobile
APP,
mWeb
• Visits, GMB $, GMB share,
conversion rate, bounce rate, # of
view items, # of bought items etc.
Dimensions
Measurements
33. http://kylin.io
Future
Lambda
Architecture
for
Realtime
Cube
StorageReal-‐time
In-‐Mem
Store
streaming Kafka
SQL
Query
minute
batch
Latest
second
Inverted
Index
Hybrid
Storage
Interface
Cube
34. http://kylin.io
DT,LOC TopN
2015-‐10-‐1,CN Item
A, $500
Item
B,
$300
…
TopN Support
select dt,
loc,
item,
sum(gmv)
from test_kylin_fact
where dt=‘2015-‐10-‐1’
and loc=‘CN’
group
by dt,
loc,
item
order
by 2
desc
limit 100 cube
pre-‐calculation
n TopN as
a
measure
n Answer
TopN queries
directly
from
pre-‐calculation
n Approximate
algorithm
n SpaceSaving TopN
n Ahmed
Metwally,
et
al.
“Efficient
computation
of
frequent
and
top-‐k
elements
in
data
streams”.
Proceeding
ICDT'05
Proceedings
of
the
10th
international
conference
on
Database
Theory,
2005.
n A
parallel
version
n Massimo
Cafaro,
et
al.
“A
parallel
space
saving
algorithm
for
frequent
items
and
the
Hurwitz
zeta
distribution”.
Proceeding
arXiv:
1401.0702v12
[cs.DS]
19
Setp 2015.