Voldemort : Prototype to Production

Recruiting SolutionsRecruiting SolutionsRecruiting Solutions
Voldemort : Prototype to Production
A Journey to 1M Operations/Sec

Voldemort Intro
●  Amazon Dynamo style NoSQL k-v store
○  get(k)
○  put(k,v)
○  getall(k1,k2,...)
○  delete(k)
●  Tunable Consistency
●  Highly Available
●  Automatic Partitioning

Voldemort Intro
●  Pluggable Storage
○  BDB-JE - Primary OLTP store
○  Read Only - Reliable serving layer for Hadoop datasets
○  MySQL - Good ‘ol MySQL without native replication
○  InMemory - Backed by Java ConcurrentHashMap
●  Clients
○  Native Java Client
○  REST Coordinator Service
●  Open source
●  More at project-voldemort.com

Agenda
○ High Level Overview
○  Usage At LinkedIn
○  Storage Layer
○  Cluster Expansion

Architecture
Coordinator
Service
Server 1
Server 2
Native Java
Client
get()
put()
getall()
Server 3
Server 4
bdb
bdb
bdb
bdb
“k1” p1 s1,s2
“k2” p2 s3,s4
“k3” p1 s1,s2
“k4” p2 s3,s4
Client
Service
Client
Service
Client
Service
Client
Service
Client
Service

Consistent Hashing
▪  Consistent Hashing Idea
▪  Divide key space into partitions
–  Partitions: A,B,C,…,H
–  hash(key) mod # partitions = pkey
▪  Randomly map partitions to servers
▪  Locate servers from keys
–  K1 => A => S1,
–  K2 => C => S3
A
B
C
DE
F
G
H
S
1
S
1
S
2
S
3
S
3
S
4
S
2
S
4
K
1
K
2
Voldemort Intro

Consistent Hashing with Replication
▪  Replication factor (RF)
–  how many replicas to have
▪  Replica selection
–  Find the primary partition
–  Walk the ring to create preference list
▪  Find RF-1 additional servers
▪  Skip servers already in list
▪  Examples: RF = 3
–  K1: S1, S2, S3
–  K2: S3,S1,S4
A
B
C
DE
F
G
H
S
1
S
1
S
2
S
3
S
3
S
4
S
2
S
4
K1 [S1, S2,
S3]
K2
[S3,S1,S
4]
Voldemort Intro

Zone Aware Replication
▪  Servers divided into zones
▪  Zone = Data Center
▪  Per zone replication factor
▪  Local zone vs. remote zones
–  Local zone (LZ) is where client is
▪  Two zones example:
–  LZ = 1
–  Zone1: S1 S3; RF=2
–  Zone2: S2 S4; RF=1
–  Preference lists:
▪  K1: Z1: S1, S3; Z2: S2
▪  K2: Z1: S3, S1; Z2: S4
A
B
C
DE
F
G
H
S
1
S
1
S
2
S
3
S
3
S
4
S
2
S
4
K1 [ Z1 [S1, S3], Z2
[S2] ]
K2 [ Z1 [S3, S1], Z2 [S4] ]
Voldemort Intro

Voldemort @ LinkedIn
385
Stores
238 R-
O Stores
147 R-
W Stores
3
Zones
14
Clusters
~200
TB
~750
Servers

~1M
Storage
ops/s
22%
R-O
78%
R-W

●  17% of all LinkedIn Services
○  embed a direct client
●  Fast (95th percentile < 20ms) for almost all clients

●  Front Facing
○  Search (Recruiter + Site)
○  People You May Know
○  inShare
○  Media thumbnails
○  Notifications
○  Endorsements
○  Skills
○  Frequency capping Ads
○  Custom Segments
○  Who Viewed Your Profile
○  People You Want to Hire
●  Internal Services
○  Email cache
○  Email delivery stack
○  Recommendation Services
○  Personalization Services
○  Mobile Auth
●  Not exhaustive!

●  Berkeley DB Java Edition
○  Embedded
○  100% Java
○  ACID compliant
○  Log structured
●  Voldemort uses
○  Vanilla k-v apis
○  Cursors for scans
Storage Layer

Storage Layer Rewrite
Where We Wanted To Be
●  Predictable online performance
●  Scan jobs
○  Non Intrusive, Fast
●  Elastic
○  Recover failed nodes in minutes
○  Add hardware overnight

Where We Really Were
1.  GC Issues
a.  Unpredictable GC Churn
b.  Scan jobs cause Full GCs
2.  Slow Scans (even on SSDs)
a.  Daily Retention Job/Slop Pusher
b.  Not Partition Aware
3.  Memory Management
a.  0-Control over a single store’s share
4.  Managing Multiple Versions
a.  Lock Contention
b.  Additional bdb-delete() cost during put()
5.  Weaker Durability on Crash
a.  Dirty Writes in heap

BDB Cache
on JVM
Disk
Index
Index
IndexIndexIndex
Index
Index
...
... ...
Leaf Index Leaf Index Leaf Leaf
Server Thread
BDB-Checkpointer
BDB-Cleaner
BDB-JE

JVM Heap BDB Cache
Store A’s B+Tree
Store D’s B+Tree
Store C’s B+Tree
Store B’s B+Tree
Server Threads Cleaner-A Checkpointer-A
Cleaner-A
Cleaner-A
Cleaner-A
Checkpointer-A
Checkpointer-A
Checkpointer-A
Multi-Tenant Example

Road To Recovery
●  Move data off heap
○  Only Index sits on heap
●  Cache Control to reduce scan impact
●  Partition Aware Storage
○  Range scans to the rescue
●  Dynamic Cache Partitioning
○  Control how much heap goes to a single store
●  SSD Aware Optimizations
○  Checkpointing
○  Cache Policy
●  Manage versions directly
○  Treat BDB as plain k-v store

Moving Data Off Heap
●  Much improved
GC
○  memory churn
○  promotions
●  SSD Aware hit-
the-disk design
●  Strong
Durability on
Crash
○  Runaway heap
SSD/Page
Cache
Index
put(k,v)
Leafold Leafnew
1
2
JVM Heap

Reducing Scan Impact
●  Massive Cache
Pollution
○  Throttling not an option
●  Exercise cursor level
control
●  Sustained rates upto
30-40K/sec

Managing Versions Directly
●  No more extra delete()
●  No more separate duplicate tree
○  Much improved locking performance
●  More compact storage
BIN
DIN
DBIN
V1 V2
BIN
V1,V2

SSD Aware Optimizations
●  Checkpoints on SSD
○  Age-old recovery time vs performance
tradeoff
●  Predictability
○  Level based policy
●  Streaming Writes
○  Turn off checkpointer
●  BDB5 Support
○  Much better compaction
○  Much less index metadata
Checkpointer Interval vs Recovery Time

Partition Aware Storage
“Key” “Key”Partition-id
Root
Subtree
k5
k6
k7
k8
k1
k2
k3
k4
Subtree
k9
k10
k11
k12
k13
k14
k15
k16
Root
P1 SubtreeP0 Subtree
k1
k3
k5
k7
k9
k11
k13
k15
k2
k4
k6
k8
k10
k12
k14
k16

Speed Up
Percentage Of Partitions Scanned
●  Restore
○  1 Day -> 1 hour ●  Rebalancing
○  ~Week -> Hours

Dynamic Cache Partitioning
●  Control share of heap per store
○  Dynamically add/reduce memory
○  Currently isolating bursty store
●  Improve Capacity Model
○  More production validation?
○  Auto tuning mechanisms?
●  Isolate at the JVM level?
○  Rethink deployment model

Wins In Production
Rein in GC
Storage Latency Way Down

Cluster Expansion Rewrite
●  Basis of scale-out philosophy
●  Cluster Expansion
○  Add servers to existing cluster
●  0 Downtime operation
●  Transparent to client
○  Functionality
○  Mostly Performance too

Types Of Clusters
●  Zoned Read Write
○  Zone = DataCenter
●  Non Zoned
○  Read-Write
○  Read-Only (Hadoop BuildAndPush)

Zone 1 Zone 2
Zone 1
Zone 2
Server 1 Server 1Server 2 Server 2
Server 2 Server 1Server 3 Server 2 Server 3Server 1
Expansion Example
P1
S2
P3
S4
S1
P2
S3
P4
P1
S2
P3
S4
S1
P2
S3
P4
P1S2
P3S4
S1P2
S3P4
P1S2
S4
P2
S3P4
S1
P3
Server 4
New New
Server 4
NewNew

Expansion In Action 1: Change Cluster
Topology
Server 1
New
Server
Server 2
New
Server
Server 2Server 1
Rebalance
Controller
1
Zone 1 Zone 2

Expansion In Action 2: Setup Proxy Bridges
Server 1
New
Server
Server 2
New
Server
Server 2Server 1
Rebalance
Controller
Proxy Bridge
1
2
1. Change cluster
topology
Zone 1 Zone 2

Expansion In Action 3: Client Picks Up New
Topology
Client
Server 1
New
Server
Server 2
New
Server
Server 2Server 1
Rebalance
Controller
Proxy Bridge
1
2
1. Change cluster
topology
2. Proxy request based
on old topology
3
Zone 1
Zone 2

Expansion In Action 4: Move Partitions
Client
Server 1
New
Server
Server 2
New
Server
Server 2Server 1
Rebalance
Controller
Local
Move
Proxy Bridge
Client
Cross DC Move
1
3
4
41. Change cluster
topology
on old topology
3. Client picks up
change
2
Zone 1
Zone 2

Expansion In Action
Client
Server 1
New
Server
Server 2
New
Server
Server 2Server 1
Rebalance
Controller
Local
Move
Proxy Bridge
Client
Cross DC Move
1
3
4
41. Change cluster
topology
2. Client picks up
change
on old topology
4. Move partitions
2
Zone 1
Zone 2

Problems
●  One Ring Spanning Data Centers
○  Cross datacenter data moves/proxies
●  Not Safely Abortable
○  Additional cleanup/consolidation
●  Cannot Add New Data Centers
●  Opaque Planner Code
○  No special treatment of Zones
●  Lack of tools
○  Skew Analysis
○  Repartitioning/Balancing Utility

Zone 1 Zone 2
Server 1
Redesign: Zone N-ary Philosophy
P1
S1
Server 4
New New
Data Move
Old nth Replica of
P in Zone Z
New nth Replica of
P in Zone Z
Donor Stealer
Proxy Bridge
● Given a partition P, whose mapping has changed
Server 3Server 2 Server 1
P1
S1
Server 4Server 3Server 2

Redesign: Advantages
●  Simple, yet powerful
●  Feasible alternative to breaking the ring
○  Expensive to rewrite all of DR
●  No more cross datacenter moves
●  Aligns proxy bridges mechanism with planner logic
●  Principally applied
○  Abortable Rebalances
○  Zone Expansion

Abortable Rebalance
●  Plans go wrong
●  Introducing proxy puts
○  Safely rollback to old
topology
●  Avoid Data Loss &
adhoc repairs
●  Double write load
during rebalance
Stealer Donor
put(k,v) proxy-get(k)
vold
local-
put(k,vold)
local-
put(k,v)
Success proxy-put(k,v)

Zone Expansion
●  Builds upon Zone N-ary idea
●  Fetch data from an existing zone
●  No proxy bridges
○  No donors in same zone
●  Cannot read from new zone until complete

New Rebalance Utilities
•  PartitionAnalysis
○  Determine skewness of a cluster
●  Repartitioner
○  Improve partition balance
○  Greedy-Random swapping
●  RebalancePlanner
○  Incorporate Zone N-Ary logic
○  Operational Insights: storage overhead,probability client will pick up
new metadata
●  Rebalance Controller
○  Cleaner reimplementation based on new planner/scheduler

Wins In Production
•  7 Zoned RW Clusters expanded into new zone
○  Hiccups resolved overnight
○  Abortability is handy
●  Small Details -> Big Difference
○  Proxy Pause period
○  Accurate Progress reporting
○  Proxy get/getall optimization

Voldemort : Prototype to Production

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Voldemort : Prototype to Production

Similaire à Voldemort : Prototype to Production (20)

Dernier

Dernier (20)

Voldemort : Prototype to Production