SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Recruiting SolutionsRecruiting SolutionsRecruiting Solutions
Voldemort : Prototype to Production
A Journey to 1M Operations/Sec
Voldemort Intro
●  Amazon Dynamo style NoSQL k-v store
○  get(k)
○  put(k,v)
○  getall(k1,k2,...)
○  delete(k)
●  Tunable Consistency
●  Highly Available
●  Automatic Partitioning
Voldemort Intro
●  Pluggable Storage
○  BDB-JE - Primary OLTP store
○  Read Only - Reliable serving layer for Hadoop datasets
○  MySQL - Good ‘ol MySQL without native replication
○  InMemory - Backed by Java ConcurrentHashMap
●  Clients
○  Native Java Client
○  REST Coordinator Service
●  Open source
●  More at project-voldemort.com
Agenda
○ High Level Overview
○  Usage At LinkedIn
○  Storage Layer
○  Cluster Expansion
Architecture
Coordinator
Service
Server 1
Server 2
Native Java
Client
get()
put()
getall()
Server 3
Server 4
bdb
bdb
bdb
bdb
“k1” p1 s1,s2
“k2” p2 s3,s4
“k3” p1 s1,s2
“k4” p2 s3,s4
Client
Service
Client
Service
Client
Service
Client
Service
Client
Service
Consistent Hashing
▪  Consistent Hashing Idea
▪  Divide key space into partitions
–  Partitions: A,B,C,…,H
–  hash(key) mod # partitions = pkey
▪  Randomly map partitions to servers
▪  Locate servers from keys
–  K1 => A => S1,
–  K2 => C => S3
A
B
C
DE
F
G
H
S
1
S
1
S
2
S
3
S
3
S
4
S
2
S
4
K
1
K
2
Voldemort Intro
Consistent Hashing with Replication
▪  Replication factor (RF)
–  how many replicas to have
▪  Replica selection
–  Find the primary partition
–  Walk the ring to create preference list
▪  Find RF-1 additional servers
▪  Skip servers already in list
▪  Examples: RF = 3
–  K1: S1, S2, S3
–  K2: S3,S1,S4
A
B
C
DE
F
G
H
S
1
S
1
S
2
S
3
S
3
S
4
S
2
S
4
K1 [S1, S2,
S3]
K2
[S3,S1,S
4]
Voldemort Intro
Zone Aware Replication
▪  Servers divided into zones
▪  Zone = Data Center
▪  Per zone replication factor
▪  Local zone vs. remote zones
–  Local zone (LZ) is where client is
▪  Two zones example:
–  LZ = 1
–  Zone1: S1 S3; RF=2
–  Zone2: S2 S4; RF=1
–  Preference lists:
▪  K1: Z1: S1, S3; Z2: S2
▪  K2: Z1: S3, S1; Z2: S4
A
B
C
DE
F
G
H
S
1
S
1
S
2
S
3
S
3
S
4
S
2
S
4
K1 [ Z1 [S1, S3], Z2
[S2] ]
K2 [ Z1 [S3, S1], Z2 [S4] ]
Voldemort Intro
Voldemort @ LinkedIn
385
Stores
238 R-
O Stores
147 R-
W Stores
3
Zones
14
Clusters
~200
TB
~750
Servers
Voldemort @ LinkedIn
~1M
Storage
ops/s
22%
R-O
78%
R-W
Voldemort @ LinkedIn
●  17% of all LinkedIn Services
○  embed a direct client
●  Fast (95th percentile < 20ms) for almost all clients
Voldemort @ LinkedIn
●  Front Facing
○  Search (Recruiter + Site)
○  People You May Know
○  inShare
○  Media thumbnails
○  Notifications
○  Endorsements
○  Skills
○  Frequency capping Ads
○  Custom Segments
○  Who Viewed Your Profile
○  People You Want to Hire
●  Internal Services
○  Email cache
○  Email delivery stack
○  Recommendation Services
○  Personalization Services
○  Mobile Auth
●  Not exhaustive!
Growth Since 2011
●  Berkeley DB Java Edition
○  Embedded
○  100% Java
○  ACID compliant
○  Log structured
●  Voldemort uses
○  Vanilla k-v apis
○  Cursors for scans
Storage Layer
Storage Layer Rewrite
Where We Wanted To Be
●  Predictable online performance
●  Scan jobs
○  Non Intrusive, Fast
●  Elastic
○  Recover failed nodes in minutes
○  Add hardware overnight
Storage Layer Rewrite
Where We Really Were
1.  GC Issues
a.  Unpredictable GC Churn
b.  Scan jobs cause Full GCs
2.  Slow Scans (even on SSDs)
a.  Daily Retention Job/Slop Pusher
b.  Not Partition Aware
3.  Memory Management
a.  0-Control over a single store’s share
4.  Managing Multiple Versions
a.  Lock Contention
b.  Additional bdb-delete() cost during put()
5.  Weaker Durability on Crash
a.  Dirty Writes in heap
Storage Layer Rewrite
BDB Cache
on JVM
Disk
Index
Index
IndexIndexIndex
Index
Index
...
... ...
Leaf Index Leaf Index Leaf Leaf
Server Thread
BDB-Checkpointer
BDB-Cleaner
BDB-JE
Storage Layer Rewrite
JVM Heap BDB Cache
Store A’s B+Tree
Store D’s B+Tree
Store C’s B+Tree
Store B’s B+Tree
Server Threads Cleaner-A Checkpointer-A
Cleaner-A
Cleaner-A
Cleaner-A
Checkpointer-A
Checkpointer-A
Checkpointer-A
Multi-Tenant Example
Storage Layer Rewrite
Road To Recovery
●  Move data off heap
○  Only Index sits on heap
●  Cache Control to reduce scan impact
●  Partition Aware Storage
○  Range scans to the rescue
●  Dynamic Cache Partitioning
○  Control how much heap goes to a single store
●  SSD Aware Optimizations
○  Checkpointing
○  Cache Policy
●  Manage versions directly
○  Treat BDB as plain k-v store
Storage Layer Rewrite
Moving Data Off Heap
●  Much improved
GC
○  memory churn
○  promotions
●  SSD Aware hit-
the-disk design
●  Strong
Durability on
Crash
○  Runaway heap
SSD/Page
Cache
Index
put(k,v)
Leafold Leafnew
1
2
JVM Heap
Storage Layer Rewrite
Reducing Scan Impact
●  Massive Cache
Pollution
○  Throttling not an option
●  Exercise cursor level
control
●  Sustained rates upto
30-40K/sec
Storage Layer Rewrite
Managing Versions Directly
●  No more extra delete()
●  No more separate duplicate tree
○  Much improved locking performance
●  More compact storage
BIN
DIN
DBIN
V1 V2
BIN
V1,V2
Storage Layer Rewrite
SSD Aware Optimizations
●  Checkpoints on SSD
○  Age-old recovery time vs performance
tradeoff
●  Predictability
○  Level based policy
●  Streaming Writes
○  Turn off checkpointer
●  BDB5 Support
○  Much better compaction
○  Much less index metadata
Checkpointer Interval vs Recovery Time
Storage Layer Rewrite
Partition Aware Storage
“Key” “Key”Partition-id
Root
Subtree
k5
k6
k7
k8
k1
k2
k3
k4
Subtree
k9
k10
k11
k12
k13
k14
k15
k16
Root
P1 SubtreeP0 Subtree
k1
k3
k5
k7
k9
k11
k13
k15
k2
k4
k6
k8
k10
k12
k14
k16
Storage Layer Rewrite
Speed Up
Percentage Of Partitions Scanned
●  Restore
○  1 Day -> 1 hour ●  Rebalancing
○  ~Week -> Hours
Storage Layer Rewrite
Dynamic Cache Partitioning
●  Control share of heap per store
○  Dynamically add/reduce memory
○  Currently isolating bursty store
●  Improve Capacity Model
○  More production validation?
○  Auto tuning mechanisms?
●  Isolate at the JVM level?
○  Rethink deployment model
Storage Layer Rewrite
Wins In Production
Rein in GC
Storage Latency Way Down
Cluster Expansion Rewrite
●  Basis of scale-out philosophy
●  Cluster Expansion
○  Add servers to existing cluster
●  0 Downtime operation
●  Transparent to client
○  Functionality
○  Mostly Performance too
Cluster Expansion Rewrite
Types Of Clusters
●  Zoned Read Write
○  Zone = DataCenter
●  Non Zoned
○  Read-Write
○  Read-Only (Hadoop BuildAndPush)
Zone 1 Zone 2
Zone 1
Zone 2
Server 1 Server 1Server 2 Server 2
Server 2 Server 1Server 3 Server 2 Server 3Server 1
Expansion Example
P1
S2
P3
S4
S1
P2
S3
P4
P1
S2
P3
S4
S1
P2
S3
P4
P1S2
P3S4
S1P2
S3P4
P1S2
S4
P2
S3P4
S1
P3
Server 4
New New
Server 4
NewNew
Expansion In Action 1: Change Cluster
Topology
Cluster Expansion Rewrite
Server 1
New
Server
Server 2
New
Server
Server 2Server 1
Rebalance
Controller
1
Zone 1 Zone 2
Expansion In Action 2: Setup Proxy Bridges
Cluster Expansion Rewrite
Server 1
New
Server
Server 2
New
Server
Server 2Server 1
Rebalance
Controller
Proxy Bridge
1
2
1. Change cluster
topology
Zone 1 Zone 2
Expansion In Action 3: Client Picks Up New
Topology
Cluster Expansion Rewrite
Client
Server 1
New
Server
Server 2
New
Server
Server 2Server 1
Rebalance
Controller
Proxy Bridge
1
2
1. Change cluster
topology
2. Proxy request based
on old topology
3
Zone 1
Zone 2
Expansion In Action 4: Move Partitions
Cluster Expansion Rewrite
Client
Server 1
New
Server
Server 2
New
Server
Server 2Server 1
Rebalance
Controller
Local
Move
Proxy Bridge
Client
Cross DC Move
1
3
4
41. Change cluster
topology
2. Proxy request based
on old topology
3. Client picks up
change
2
Zone 1
Zone 2
Expansion In Action
Cluster Expansion Rewrite
Client
Server 1
New
Server
Server 2
New
Server
Server 2Server 1
Rebalance
Controller
Local
Move
Proxy Bridge
Client
Cross DC Move
1
3
4
41. Change cluster
topology
2. Client picks up
change
3. Proxy request based
on old topology
4. Move partitions
2
Zone 1
Zone 2
Problems
Cluster Expansion Rewrite
●  One Ring Spanning Data Centers
○  Cross datacenter data moves/proxies
●  Not Safely Abortable
○  Additional cleanup/consolidation
●  Cannot Add New Data Centers
●  Opaque Planner Code
○  No special treatment of Zones
●  Lack of tools
○  Skew Analysis
○  Repartitioning/Balancing Utility
Zone 1 Zone 2
Server 1
Redesign: Zone N-ary Philosophy
P1
S1
Server 4
New New
Data Move
Old nth Replica of
P in Zone Z
New nth Replica of
P in Zone Z
Donor Stealer
Proxy Bridge
● Given a partition P, whose mapping has changed
Server 3Server 2 Server 1
P1
S1
Server 4Server 3Server 2
Redesign: Advantages
Cluster Expansion Rewrite
●  Simple, yet powerful
●  Feasible alternative to breaking the ring
○  Expensive to rewrite all of DR
●  No more cross datacenter moves
●  Aligns proxy bridges mechanism with planner logic
●  Principally applied
○  Abortable Rebalances
○  Zone Expansion
Abortable Rebalance
Cluster Expansion Rewrite
●  Plans go wrong
●  Introducing proxy puts
○  Safely rollback to old
topology
●  Avoid Data Loss &
adhoc repairs
●  Double write load
during rebalance
Stealer Donor
put(k,v) proxy-get(k)
vold
local-
put(k,vold)
local-
put(k,v)
Success proxy-put(k,v)
Zone Expansion
Cluster Expansion Rewrite
●  Builds upon Zone N-ary idea
●  Fetch data from an existing zone
●  No proxy bridges
○  No donors in same zone
●  Cannot read from new zone until complete
New Rebalance Utilities
Cluster Expansion Rewrite
•  PartitionAnalysis
○  Determine skewness of a cluster
●  Repartitioner
○  Improve partition balance
○  Greedy-Random swapping
●  RebalancePlanner
○  Incorporate Zone N-Ary logic
○  Operational Insights: storage overhead,probability client will pick up
new metadata
●  Rebalance Controller
○  Cleaner reimplementation based on new planner/scheduler
Wins In Production
Cluster Expansion Rewrite
•  7 Zoned RW Clusters expanded into new zone
○  Hiccups resolved overnight
○  Abortability is handy
●  Small Details -> Big Difference
○  Proxy Pause period
○  Accurate Progress reporting
○  Proxy get/getall optimization

Contenu connexe

Tendances

Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks
 
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedis Labs
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...DataStax
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge GraphsJeff Z. Pan
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAdrian Cockcroft
 
Demo Showcase: Graphs for Cybersecurity in Action
Demo Showcase: Graphs for Cybersecurity in ActionDemo Showcase: Graphs for Cybersecurity in Action
Demo Showcase: Graphs for Cybersecurity in ActionNeo4j
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...HostedbyConfluent
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesCarole Gunst
 

Tendances (20)

Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
Netflix Recommendations Using Spark + Cassandra (Prasanna Padmanabhan & Roopa...
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
 
AWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at NetflixAWS Re:Invent - High Availability Architecture at Netflix
AWS Re:Invent - High Availability Architecture at Netflix
 
Demo Showcase: Graphs for Cybersecurity in Action
Demo Showcase: Graphs for Cybersecurity in ActionDemo Showcase: Graphs for Cybersecurity in Action
Demo Showcase: Graphs for Cybersecurity in Action
 
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 

En vedette

Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesVinoth Chandar
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Introducción a Voldemort - Innova4j
Introducción a Voldemort - Innova4jIntroducción a Voldemort - Innova4j
Introducción a Voldemort - Innova4jInnova4j
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesVinoth Chandar
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processingnathanmarz
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)Mihnea Giurgea
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersAhsan Javed Awan
 
Distributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingDistributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingCloudFundoo
 
Project Voldemort: Big data loading
Project Voldemort: Big data loadingProject Voldemort: Big data loading
Project Voldemort: Big data loadingDan Harvey
 
Story 06
Story 06Story 06
Story 06JooWan
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약Sung Yub Kim
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudNeeraj Sabharwal
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 

En vedette (20)

Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Introducción a Voldemort - Innova4j
Introducción a Voldemort - Innova4jIntroducción a Voldemort - Innova4j
Introducción a Voldemort - Innova4j
 
Bluetube
BluetubeBluetube
Bluetube
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Project Voldemort
Project VoldemortProject Voldemort
Project Voldemort
 
Project Voldemort
Project VoldemortProject Voldemort
Project Voldemort
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Big data @ uber vu (1)
Big data @ uber vu (1)Big data @ uber vu (1)
Big data @ uber vu (1)
 
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in ClustersNode Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
 
Distributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent HashingDistributed Hash Table and Consistent Hashing
Distributed Hash Table and Consistent Hashing
 
Project Voldemort: Big data loading
Project Voldemort: Big data loadingProject Voldemort: Big data loading
Project Voldemort: Big data loading
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
Story 06
Story 06Story 06
Story 06
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid Cloud
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
UBER Data Mining
UBER Data MiningUBER Data Mining
UBER Data Mining
 

Similaire à Voldemort : Prototype to Production

Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQLPingCAP
 
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)Rick Hwang
 
Build Dynamic DNS server from scratch in C (Part1)
Build Dynamic DNS server from scratch in C (Part1)Build Dynamic DNS server from scratch in C (Part1)
Build Dynamic DNS server from scratch in C (Part1)Yen-Kuan Wu
 
TiDB vs Aurora.pdf
TiDB vs Aurora.pdfTiDB vs Aurora.pdf
TiDB vs Aurora.pdfssuser3fb50b
 
5 levels of high availability from multi instance to hybrid cloud
5 levels of high availability  from multi instance to hybrid cloud5 levels of high availability  from multi instance to hybrid cloud
5 levels of high availability from multi instance to hybrid cloudRafał Leszko
 
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid CloudRafał Leszko
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022StreamNative
 
Netflix - Realtime Impression Store
Netflix - Realtime Impression Store Netflix - Realtime Impression Store
Netflix - Realtime Impression Store Nitin S
 
To Serverless and Beyond
To Serverless and BeyondTo Serverless and Beyond
To Serverless and BeyondScyllaDB
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDBPingCAP
 
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and OpportunitiesOrchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and OpportunitiesRaghavendra Prabhu
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloudOVHcloud
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdffengxun
 
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Lucidworks
 
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017Alexey Grigorev
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsHan Zhou
 

Similaire à Voldemort : Prototype to Production (20)

Scale Relational Database with NewSQL
Scale Relational Database with NewSQLScale Relational Database with NewSQL
Scale Relational Database with NewSQL
 
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
Study Notes - Architecting for the cloud (AWS Best Practices, Feb 2016)
 
Build Dynamic DNS server from scratch in C (Part1)
Build Dynamic DNS server from scratch in C (Part1)Build Dynamic DNS server from scratch in C (Part1)
Build Dynamic DNS server from scratch in C (Part1)
 
TiDB vs Aurora.pdf
TiDB vs Aurora.pdfTiDB vs Aurora.pdf
TiDB vs Aurora.pdf
 
5 levels of high availability from multi instance to hybrid cloud
5 levels of high availability  from multi instance to hybrid cloud5 levels of high availability  from multi instance to hybrid cloud
5 levels of high availability from multi instance to hybrid cloud
 
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Netflix - Realtime Impression Store
Netflix - Realtime Impression Store Netflix - Realtime Impression Store
Netflix - Realtime Impression Store
 
Geode - Day 2
Geode - Day 2Geode - Day 2
Geode - Day 2
 
To Serverless and Beyond
To Serverless and BeyondTo Serverless and Beyond
To Serverless and Beyond
 
Welcome to icehouse
Welcome to icehouseWelcome to icehouse
Welcome to icehouse
 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
 
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and OpportunitiesOrchestrating Cassandra with Kubernetes: Challenges and Opportunities
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdf
 
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
 
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
Large Scale Vandalism Detection in Knowledge Bases: PyData Berlin 2017
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutions
 

Dernier

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Dernier (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Voldemort : Prototype to Production

  • 1. Recruiting SolutionsRecruiting SolutionsRecruiting Solutions Voldemort : Prototype to Production A Journey to 1M Operations/Sec
  • 2. Voldemort Intro ●  Amazon Dynamo style NoSQL k-v store ○  get(k) ○  put(k,v) ○  getall(k1,k2,...) ○  delete(k) ●  Tunable Consistency ●  Highly Available ●  Automatic Partitioning
  • 3. Voldemort Intro ●  Pluggable Storage ○  BDB-JE - Primary OLTP store ○  Read Only - Reliable serving layer for Hadoop datasets ○  MySQL - Good ‘ol MySQL without native replication ○  InMemory - Backed by Java ConcurrentHashMap ●  Clients ○  Native Java Client ○  REST Coordinator Service ●  Open source ●  More at project-voldemort.com
  • 4. Agenda ○ High Level Overview ○  Usage At LinkedIn ○  Storage Layer ○  Cluster Expansion
  • 5. Architecture Coordinator Service Server 1 Server 2 Native Java Client get() put() getall() Server 3 Server 4 bdb bdb bdb bdb “k1” p1 s1,s2 “k2” p2 s3,s4 “k3” p1 s1,s2 “k4” p2 s3,s4 Client Service Client Service Client Service Client Service Client Service
  • 6. Consistent Hashing ▪  Consistent Hashing Idea ▪  Divide key space into partitions –  Partitions: A,B,C,…,H –  hash(key) mod # partitions = pkey ▪  Randomly map partitions to servers ▪  Locate servers from keys –  K1 => A => S1, –  K2 => C => S3 A B C DE F G H S 1 S 1 S 2 S 3 S 3 S 4 S 2 S 4 K 1 K 2 Voldemort Intro
  • 7. Consistent Hashing with Replication ▪  Replication factor (RF) –  how many replicas to have ▪  Replica selection –  Find the primary partition –  Walk the ring to create preference list ▪  Find RF-1 additional servers ▪  Skip servers already in list ▪  Examples: RF = 3 –  K1: S1, S2, S3 –  K2: S3,S1,S4 A B C DE F G H S 1 S 1 S 2 S 3 S 3 S 4 S 2 S 4 K1 [S1, S2, S3] K2 [S3,S1,S 4] Voldemort Intro
  • 8. Zone Aware Replication ▪  Servers divided into zones ▪  Zone = Data Center ▪  Per zone replication factor ▪  Local zone vs. remote zones –  Local zone (LZ) is where client is ▪  Two zones example: –  LZ = 1 –  Zone1: S1 S3; RF=2 –  Zone2: S2 S4; RF=1 –  Preference lists: ▪  K1: Z1: S1, S3; Z2: S2 ▪  K2: Z1: S3, S1; Z2: S4 A B C DE F G H S 1 S 1 S 2 S 3 S 3 S 4 S 2 S 4 K1 [ Z1 [S1, S3], Z2 [S2] ] K2 [ Z1 [S3, S1], Z2 [S4] ] Voldemort Intro
  • 9. Voldemort @ LinkedIn 385 Stores 238 R- O Stores 147 R- W Stores 3 Zones 14 Clusters ~200 TB ~750 Servers
  • 11. Voldemort @ LinkedIn ●  17% of all LinkedIn Services ○  embed a direct client ●  Fast (95th percentile < 20ms) for almost all clients
  • 12. Voldemort @ LinkedIn ●  Front Facing ○  Search (Recruiter + Site) ○  People You May Know ○  inShare ○  Media thumbnails ○  Notifications ○  Endorsements ○  Skills ○  Frequency capping Ads ○  Custom Segments ○  Who Viewed Your Profile ○  People You Want to Hire ●  Internal Services ○  Email cache ○  Email delivery stack ○  Recommendation Services ○  Personalization Services ○  Mobile Auth ●  Not exhaustive!
  • 14. ●  Berkeley DB Java Edition ○  Embedded ○  100% Java ○  ACID compliant ○  Log structured ●  Voldemort uses ○  Vanilla k-v apis ○  Cursors for scans Storage Layer
  • 15. Storage Layer Rewrite Where We Wanted To Be ●  Predictable online performance ●  Scan jobs ○  Non Intrusive, Fast ●  Elastic ○  Recover failed nodes in minutes ○  Add hardware overnight
  • 16. Storage Layer Rewrite Where We Really Were 1.  GC Issues a.  Unpredictable GC Churn b.  Scan jobs cause Full GCs 2.  Slow Scans (even on SSDs) a.  Daily Retention Job/Slop Pusher b.  Not Partition Aware 3.  Memory Management a.  0-Control over a single store’s share 4.  Managing Multiple Versions a.  Lock Contention b.  Additional bdb-delete() cost during put() 5.  Weaker Durability on Crash a.  Dirty Writes in heap
  • 17. Storage Layer Rewrite BDB Cache on JVM Disk Index Index IndexIndexIndex Index Index ... ... ... Leaf Index Leaf Index Leaf Leaf Server Thread BDB-Checkpointer BDB-Cleaner BDB-JE
  • 18. Storage Layer Rewrite JVM Heap BDB Cache Store A’s B+Tree Store D’s B+Tree Store C’s B+Tree Store B’s B+Tree Server Threads Cleaner-A Checkpointer-A Cleaner-A Cleaner-A Cleaner-A Checkpointer-A Checkpointer-A Checkpointer-A Multi-Tenant Example
  • 19. Storage Layer Rewrite Road To Recovery ●  Move data off heap ○  Only Index sits on heap ●  Cache Control to reduce scan impact ●  Partition Aware Storage ○  Range scans to the rescue ●  Dynamic Cache Partitioning ○  Control how much heap goes to a single store ●  SSD Aware Optimizations ○  Checkpointing ○  Cache Policy ●  Manage versions directly ○  Treat BDB as plain k-v store
  • 20. Storage Layer Rewrite Moving Data Off Heap ●  Much improved GC ○  memory churn ○  promotions ●  SSD Aware hit- the-disk design ●  Strong Durability on Crash ○  Runaway heap SSD/Page Cache Index put(k,v) Leafold Leafnew 1 2 JVM Heap
  • 21. Storage Layer Rewrite Reducing Scan Impact ●  Massive Cache Pollution ○  Throttling not an option ●  Exercise cursor level control ●  Sustained rates upto 30-40K/sec
  • 22. Storage Layer Rewrite Managing Versions Directly ●  No more extra delete() ●  No more separate duplicate tree ○  Much improved locking performance ●  More compact storage BIN DIN DBIN V1 V2 BIN V1,V2
  • 23. Storage Layer Rewrite SSD Aware Optimizations ●  Checkpoints on SSD ○  Age-old recovery time vs performance tradeoff ●  Predictability ○  Level based policy ●  Streaming Writes ○  Turn off checkpointer ●  BDB5 Support ○  Much better compaction ○  Much less index metadata Checkpointer Interval vs Recovery Time
  • 24. Storage Layer Rewrite Partition Aware Storage “Key” “Key”Partition-id Root Subtree k5 k6 k7 k8 k1 k2 k3 k4 Subtree k9 k10 k11 k12 k13 k14 k15 k16 Root P1 SubtreeP0 Subtree k1 k3 k5 k7 k9 k11 k13 k15 k2 k4 k6 k8 k10 k12 k14 k16
  • 25. Storage Layer Rewrite Speed Up Percentage Of Partitions Scanned ●  Restore ○  1 Day -> 1 hour ●  Rebalancing ○  ~Week -> Hours
  • 26. Storage Layer Rewrite Dynamic Cache Partitioning ●  Control share of heap per store ○  Dynamically add/reduce memory ○  Currently isolating bursty store ●  Improve Capacity Model ○  More production validation? ○  Auto tuning mechanisms? ●  Isolate at the JVM level? ○  Rethink deployment model
  • 27. Storage Layer Rewrite Wins In Production Rein in GC Storage Latency Way Down
  • 28. Cluster Expansion Rewrite ●  Basis of scale-out philosophy ●  Cluster Expansion ○  Add servers to existing cluster ●  0 Downtime operation ●  Transparent to client ○  Functionality ○  Mostly Performance too
  • 29. Cluster Expansion Rewrite Types Of Clusters ●  Zoned Read Write ○  Zone = DataCenter ●  Non Zoned ○  Read-Write ○  Read-Only (Hadoop BuildAndPush)
  • 30. Zone 1 Zone 2 Zone 1 Zone 2 Server 1 Server 1Server 2 Server 2 Server 2 Server 1Server 3 Server 2 Server 3Server 1 Expansion Example P1 S2 P3 S4 S1 P2 S3 P4 P1 S2 P3 S4 S1 P2 S3 P4 P1S2 P3S4 S1P2 S3P4 P1S2 S4 P2 S3P4 S1 P3 Server 4 New New Server 4 NewNew
  • 31. Expansion In Action 1: Change Cluster Topology Cluster Expansion Rewrite Server 1 New Server Server 2 New Server Server 2Server 1 Rebalance Controller 1 Zone 1 Zone 2
  • 32. Expansion In Action 2: Setup Proxy Bridges Cluster Expansion Rewrite Server 1 New Server Server 2 New Server Server 2Server 1 Rebalance Controller Proxy Bridge 1 2 1. Change cluster topology Zone 1 Zone 2
  • 33. Expansion In Action 3: Client Picks Up New Topology Cluster Expansion Rewrite Client Server 1 New Server Server 2 New Server Server 2Server 1 Rebalance Controller Proxy Bridge 1 2 1. Change cluster topology 2. Proxy request based on old topology 3 Zone 1 Zone 2
  • 34. Expansion In Action 4: Move Partitions Cluster Expansion Rewrite Client Server 1 New Server Server 2 New Server Server 2Server 1 Rebalance Controller Local Move Proxy Bridge Client Cross DC Move 1 3 4 41. Change cluster topology 2. Proxy request based on old topology 3. Client picks up change 2 Zone 1 Zone 2
  • 35. Expansion In Action Cluster Expansion Rewrite Client Server 1 New Server Server 2 New Server Server 2Server 1 Rebalance Controller Local Move Proxy Bridge Client Cross DC Move 1 3 4 41. Change cluster topology 2. Client picks up change 3. Proxy request based on old topology 4. Move partitions 2 Zone 1 Zone 2
  • 36. Problems Cluster Expansion Rewrite ●  One Ring Spanning Data Centers ○  Cross datacenter data moves/proxies ●  Not Safely Abortable ○  Additional cleanup/consolidation ●  Cannot Add New Data Centers ●  Opaque Planner Code ○  No special treatment of Zones ●  Lack of tools ○  Skew Analysis ○  Repartitioning/Balancing Utility
  • 37. Zone 1 Zone 2 Server 1 Redesign: Zone N-ary Philosophy P1 S1 Server 4 New New Data Move Old nth Replica of P in Zone Z New nth Replica of P in Zone Z Donor Stealer Proxy Bridge ● Given a partition P, whose mapping has changed Server 3Server 2 Server 1 P1 S1 Server 4Server 3Server 2
  • 38. Redesign: Advantages Cluster Expansion Rewrite ●  Simple, yet powerful ●  Feasible alternative to breaking the ring ○  Expensive to rewrite all of DR ●  No more cross datacenter moves ●  Aligns proxy bridges mechanism with planner logic ●  Principally applied ○  Abortable Rebalances ○  Zone Expansion
  • 39. Abortable Rebalance Cluster Expansion Rewrite ●  Plans go wrong ●  Introducing proxy puts ○  Safely rollback to old topology ●  Avoid Data Loss & adhoc repairs ●  Double write load during rebalance Stealer Donor put(k,v) proxy-get(k) vold local- put(k,vold) local- put(k,v) Success proxy-put(k,v)
  • 40. Zone Expansion Cluster Expansion Rewrite ●  Builds upon Zone N-ary idea ●  Fetch data from an existing zone ●  No proxy bridges ○  No donors in same zone ●  Cannot read from new zone until complete
  • 41. New Rebalance Utilities Cluster Expansion Rewrite •  PartitionAnalysis ○  Determine skewness of a cluster ●  Repartitioner ○  Improve partition balance ○  Greedy-Random swapping ●  RebalancePlanner ○  Incorporate Zone N-Ary logic ○  Operational Insights: storage overhead,probability client will pick up new metadata ●  Rebalance Controller ○  Cleaner reimplementation based on new planner/scheduler
  • 42. Wins In Production Cluster Expansion Rewrite •  7 Zoned RW Clusters expanded into new zone ○  Hiccups resolved overnight ○  Abortability is handy ●  Small Details -> Big Difference ○  Proxy Pause period ○  Accurate Progress reporting ○  Proxy get/getall optimization