HBase Storage Internals

HBase
Storage
Internals,
present
and
future!

Ma6eo
Bertozzi
|
@Cloudera

March
2013
-‐
Hadoop
Summit
Europe

1

What
is
HBase

•  Open
source
Storage
Manager
that
provides
random

read/write
on
top
of
HDFS

•  Provides
Tables
with
a
“Key:Column/Value”
interface

•  Dynamic
columns
(qualiﬁers),
no
schema
needed

•  “Fixed”
column
groups
(families)

•  table[row:family:column]
=
value

2

HBase
EcoSystem

•  Apache
Hadoop
HDFS
for
data

durability
and
reliability
(Write-‐Ahead
App
MR

Log)

•  Apache
ZooKeeper
for
distributed

coordina]on
ZK
HDFS

•  Apache
Hadoop
MapReduce
built-‐in

support
for
running
MapReduce
jobs

3

How
HBase
Works?

“View
from
10000c”

4

Master,
Region
Servers
and
Regions

•  Region
Server

Client

•  Server
that
contains
a
set
of
Regions

ZooKeeper

•  Responsible
to
handle
reads
and
writes

•  Region

Master

•  The
basic
unit
of
scalability
in
HBase

•  Subset
of
the
table’s
data

Region
Server
Region
Server
Region
Server
•  Con]guous,
sorted
range
of
rows
stored

Region
Region
Region
together.

Region
Region
Region

•  Master

Region
Region
Region

•  Coordinates
the
HBase
Cluster

HDFS
•  Assignment/Balancing
of
the
Regions

•  Handles
admin
opera]ons

•  create/delete/modify
table,
…

5

Autosharding
and
.META.
table

•  A
Region
is
a
Subset
of
the
table’s
data

•  When
there
is
too
much
data
in
a
Region…

•  a
split
is
triggered,
crea]ng
2
regions

•  The
associa]on
“Region
-‐>
Server”
is
stored
in
a
System
Table

•  The
Loca]on
of
.META.
Is
stored
in
ZooKeeper

Table
Start
Key
Region
ID
Region
Server
machine01

Region
1
-‐
testTable

testTable
Key-‐00
1
machine01.host
Region
4
-‐
testTable

testTable
Key-‐31
2
machine03.host

machine02

testTable
Key-‐65
3
machine02.host

Region
3
-‐
testTable

testTable
Key-‐83
4
machine01.host
Region
1
-‐
users

…
…
…
…

machine03

users
Key-‐AB
1
machine03.host
Region
2
-‐
testTable

users
Key-‐KG
2
machine02.host
Region
2
-‐
users

6

The
Write
Path
–
Create
a
New
Table

•  The
client
asks
to
the
master
to
create
a
new
Table

•  hbase>
create
‘myTable’,
‘cf’
Client

createTable()

•  The
Master
Master

•  Store
the
Table
informa]on
(“schema”)
Store
Table

“Metadata”

•  Create
Regions
based
on
the
key-‐splits
provided
Assign
the
Regions

“enable”

•  no
splits
provided,
one
single
region
by

Region
Region
Region

default
Server

Region

Server
Server

Region

Region

•  Assign
the
Regions
to
the
Region
Servers
Region
Region
Region

•  The
assignment
Region
-‐>
Server

is
wri6en
to
a
system
table
called
“.META.”

7

The
Write
Path
–
“Inser]ng”
data

Client

•  table.put(row-‐key:family:column,
value)
Where
is

.META.?
Scan

.META.

•  The
client
asks
ZooKeeper
the
loca]on
of
.META.
ZooKeeper
Region
Server

Insert

Region

•  The
client
scans
.META.
searching
for
the

KeyValue
Region

Region
Server
responsible
to
handle
the
Key
Region
Server

Region

•  The
client
asks
the
Region
Server
to

Region

insert/update/delete
the
speciﬁed
key/value.

Region

•  The
Region
Server
process
the
request
and
dispatch
it
to
the

Region
responsible
to
handle

the
Key

•  The
opera]on
is
wri6en
to
a
Write-‐Ahead
Log
(WAL)

•  …and
the
KeyValues
added
to
the
Store:
“MemStore”

8

The
Write
Path
–
Append
Only
to
Random
R/W

•  Files
in
HDFS
are
RS

Region

WAL

Region
Region

•  Append-‐Only

•  Immutable
once
closed
MemStore
+
Store
Files
(HFiles)

•  HBase
provides
Random
Writes?

•  …not
really
from
a
storage
point
of
view

•  KeyValues
are
stored
in
memory
and
wri6en
to
disk
on
pressure

•  Don’t
worry
your
data
is
safe
in
the
WAL!

Key0
–
value
0

•  (The
Region
Server
can
recover
data
from
the
WAL
is
case
of
crash)
Key1
–
value
1

Key2
–
value
2

Key3
–
value
3

But
this
allow
to
sort
data
by
Key
before
wri]ng
on
disk

• 
Key4
–
value
4

Key5
–
value
5

•  Deletes
are
like
Inserts
but
with
a
“remove
me
ﬂag”
Store
Files

9

The
Read
Path
–
“reading”
data

•  The
client
asks
ZooKeeper
the
loca]on
of
.META.
Client

Where
is

•  The
client
scans
.META.
searching
for
the
Region
Server
.META.?
Scan

.META.

responsible
to
handle
the
Key
ZooKeeper
Region
Server

Region

•  The
client
asks
the
Region
Server
to
get
the
speciﬁed
key/
Get
Key

Region

value.
Region
Server

•  The
Region
Server
process
the
request
and
dispatch
it
to

Region

Region

the
Region
responsible
to
handle

the
Key
Region

•  MemStore
and
Store
Files
are
scanned
to
ﬁnd
the
key

10

The
Read
Path
–
Append
Only
to
Random
R/W

•  Each
flush
a
new
file
is
created

Key0
–
value
0.0
Key0
–
value
0.1

Key2
–
value
2.0
Key5
–
value
5.0

Key3
–
value
3.0
Key1
–
value
1.0

Key5
–
value
5.0
Key5
–
[deleted]

Key8
–
value
8.0
Key6
–
value
6.0

•  Each
file
have
KeyValues
sorted
by
key

Key9
–
value
9.0
Key7–
value
7.0

•  Two
or
more
files
can
contains
the
same
key

(updates/deletes)

•  To
find
a
Key
you
need
to
scan
all
the
files

•  …with
some
op]miza]ons

•  Filter
Files
Start/End
Key

•  Having
a
bloom
filter
on
each
file

11

HFile

HBase
Store
File
Format

12

HFile
format

Blocks

•  Only
Sequen]al
Writes,
just
append(key,
value)

Header

•  Large
Sequen]al
Reads
are
be6er
Record
0

Record
1

•  Why
grouping
records
in
blocks?
Key/Value
…

(record)
Record
N

•  Easy
to
split
Key
Length
:
int

Header

Value
Length
:
int

Record
0

•  Easy
to
read
Key
:
byte[]

Record
1

…

•  Easy
to
cache
Value
:
byte[]
Record
N

Index
0

•  Easy
to
index
(if
records
are
sorted)
…

Index
N

•  Block
Compression
(snappy,
lz4,
gz,
…)
Trailer

13

Data
Block
Encoding

•  “Be
aware
of
the
data”

•  Block
Encoding
allows
to
compress
the
Key
based
on

what
we
know

•  Keys
are
sorted…
prefix
may
be
similar
in
most
cases

•  One
file
contains
keys
from
one
Family
only

•  Timestamps
are
“similar”,
we
can
store
the
diff
“on-‐disk”

•  Type
is
“put”
most
of
the
]me…
KeyValue

Row
Length
:
short

Row
:
byte[]

Family
Length
:
byte

Family
:
byte[]

Qualifier
:
byte[]

Timestamp
:
long

Type
:
byte

14

Compac]ons

Op]mize
the
read-‐path

15

Compac]ons

•  Reduce
the
number
of
files
to
look
into
during
a
scan

Key0
–
value
0.0
Key0
–
value
0.1

Key2
–
value
2.0
Key1
–
value
1.0

Key3
–
value
3.0
Key4–
value
4.0

Key5
–
value
5.0
Key5
–
[deleted]

Key8
–
value
8.0
Key6
–
value
6.0

•  Removing
duplicated
keys
(updated
values)

Key9
–
value
9.0
Key7–
value
7.0

•  Removing
deleted
keys
Key0
–
value
0.1

Key1
–
value
1.0

•  Creates
a
new
file
by
merging
the
content
of
2+
files

Key2
–
value
2.0

Key4–

value
4.0

Key6
–
value
6.0

Key7–
value
7.0

Key8–
value
8.0

•  Remove
the
old
files

Key9–
value
9.0

16

Pluggable
Compac]ons

•  Try
diﬀerent
algorithm

Key0
–
value
0.0
Key0
–
value
0.1

Key2
–
value
2.0
Key1
–
value
1.0

Key3
–
value
3.0
Key4–
value
4.0

Key5
–
value
5.0
Key5
–
[deleted]

Key8
–
value
8.0
Key6
–
value
6.0

•  Be
aware
of
the
data
Key9
–
value
9.0
Key7–
value
7.0

•  Time
Series?
I
guess
no
updates
from
the
80s

Key0
–
value
0.1

•  Be
aware
of
the
requests

Key1
–
value
1.0

Key2
–
value
2.0

Key4–

value
4.0

Key6
–
value
6.0

Key7–
value
7.0

•  Compact
based
on
sta]s]cs

Key8–
value
8.0

Key9–
value
9.0

•  which
ﬁles
are
hot
and
which
are
not

•  which
keys
are
hot
and
which
are
not

17

Snapshots

Zero-‐copy
snapshots
and
table
clones

18

How
taking
a
snapshot
works?

•  The
master
orchestrate
the
RSs

•  the
communica]on
is
done
via
ZooKeeper

•  using
a
“2-‐phase
commit
like”
transac]on
(prepare/commit)

•  Each
RS
is
responsible
to
take
its
“piece”
of
snapshot

•  For
each
Region
store
the
metadata
informa]on
needed

•  (list
of
Store
Files,
WALs,
region
start/end
keys,
…)

ZK
ZK

Master
ZK

RS
RS

Region

WAL

Region
Region
Region

WAL

Region
Region

Store
Files
(HFiles)
Store
Files
(HFiles)

19

What
is
a
Snapshots?

•  “a
Snapshot
is
not
a
copy
of
the
table”

•  a
Snapshot
is
a
set
of
metadata
informa]on

•  The
table
“schema”
(column
families
and
a6ributes)

•  The
Regions
informa]on
(start
key,
end
key,
…)

•  The
list
of
Store
Files

•  The
list
of
WALs
ac]ve

ZK
ZK

Master
ZK

RS
RS

Region

WAL

Region
Region
Region

WAL

Region
Region

Store
Files
(HFiles)
Store
Files
(HFiles)

20

Cloning
a
Table
from
a
Snapshots

•  hbase>
clone_snapshot
‘snapshotName’,
‘tableName’

…

•  Creates
a
new
table
with
the
data
“contained”
in
the
snapshot

•  No
data
copies
involved

•  HFiles
are
immutable

•  And
shared
between
tables
and
snapshots

•  You
can
insert/update/remove
data
from
the
new
table

•  No
repercussions
on
the
snapshot,
original
tables
or
other

cloned
tables

21

Compac]ons
&
Archiving

•  HFiles
are
immutable,
and
shared
between
tables
and
snapshots

•  On
compac]on
or
table
dele]on,
files
are
removed
from
disk

•  If
files
are
referenced
by
a
snapshot
or
a
cloned
table

•  The
file
is
moved
to
an
“archive”
directory

•  And
deleted
later,
when
there’re
no
references
to
it

22

Compac]ons

Op]mize
the
read-‐path

23

0.96
is
coming
up

•  Moving
RPC
to
Protobuf

•  Allows
rolling
upgrades
with
no
surprises

•  HBase
Snapshots

•  Pluggable
Compac]ons

•  Remove
-‐ROOT-‐

•  Table
Locks

24

0.98
and
Beyond

•  Transparent
Table/Column-‐Family
Encryp]on

•  Cell-‐level
security

•  Mul]ple
WALs
per
Region
Server
(MTTR)

•  Data
Placement
Awareness
(MTTR)

•  Data
Type
Awareness

•  Compac]on
policies,
based
on
the
data
needs

•  Managing
blocks
directly
(instead
of
ﬁles)

25

Thank
you!

Ma6eo
Bertozz,
@cloudera

@th30z

HBase Storage Internals

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à HBase Storage Internals

Similaire à HBase Storage Internals (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

HBase Storage Internals