SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
On Brewing Fresh Espresso: LinkedIn’s Distributed Data
Serving Platform
Swaroop Jagadish
http://www.linkedin.com/in/swaroopjagadish
LinkedIn Confidential ©2013 All Rights Reserved
Outline
LinkedIn Data Ecosystem
Espresso: Design Points
Data Model and API
Architecture
Deep Dive: Fault Tolerance
Deep Dive: Secondary Indexing
Espresso In Production
Future work
2
The World’s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
225M+ 2M+
Company Pages
Connecting Talent  Opportunity. At scale…
LinkedIn Confidential ©2013 All Rights Reserved 3
LinkedIn Data Ecosystem
4
Espresso: Key Design Points
 Source-of-truth
– Master-Slave, Timeline consistent
– Query-after-write
– Backup/Restore
– High Availability
 Horizontally Scalable
 Rich functionality
– Hierarchical data model
– Document oriented
– Transactions within a hierarchy
– Secondary Indexes
5
Espresso: Key Design Points
 Agility – no “pause the world” operations
– “On the fly” Schema Evolution
– Elasticity
 Integration with the data ecosystem
– Change stream with freshness in O(seconds)
– ETL to Hadoop
– Bulk import
 Modular and Pluggable
– Off-the-shelf: MySQL, Lucene, Avro
6
Data Model and API
7
Application View
8
key
value
REST API:
/mailbox/msg_meta/bob/2
Partitioning
9
/mailbox/msg_meta/bob/2
MemberId is the partitioning key
Document based data model
Richer than a plain key-value store
Hierarchical keys
Values are rich documents and may contain
nested types
10
from : {
name : "Chris",
email : "chris@linkedin.com"
}
subject : "Go Giants!"
body : "World Series 2012! w00t!"
unread : true
Messages
mailboxID : String
messageID : long
from : {
name : String
email : String
}
subject : String
body : String
unread : boolean
REST based API
• Secondary Index query
– GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true
+isInbox:true”&start=0&count=15
• Partial updates
POST /MailboxDB/MessageMeta/bob/1
Content-Type: application/json
Content-Length: 21
{“unread” : “false”}
• Conditional operations
– Get a message, only if recently updated
GET /MailboxDB/MessageMeta/bob/1
If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT
11
Transactional writes within a hierarchy
mboxId value
George { “numUnread”:
2 }
MessageCounter
mboxId msgId value etag
George 0 {…, “unread”: false, …} 7abf8091
George 1 {…, “unread”: true, …} b648bc5f
George 2 {…, “unread”: true, …} 4fde8701
Message/Message/George/0 {…, “unread”: false, …} 7abf8091
/Message/George/0 {…, “unread”: true, …}
/MessageCounter/George {…, “numUnread”: “+1”, …}
1. Read, record etags
2. Prepare after-image
3.Update
mboxId value
George { “numUnread”:
3 }
Espresso Architecture
13
14
15
16
17
18
19
Cluster Management and Fault
Tolerance
20
Generic Cluster Manager: Apache Helix
 Generic cluster management
– State model + constraints
– Ideal state of distribution of partitions
across the cluster
– Migrate cluster from current state to
ideal state
• More Info
• SoCC 2012
• http://helix.incubator.apache.org
21
Espresso Partition Layout: Master, Slave
 3 Storage Engine nodes, 2-way replication
22
Apache Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Cluster Management
Cluster Expansion
Node Failover
Cluster Expansion
 Initial State with 3 Storage Nodes. Step1: Compute new Ideal
state
24
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4
Cluster Expansion
 Step 2: Bootstrap new node’s partitions by restoring from
backups
25
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1
Snapshots
Cluster Expansion
 Step 3: Catch up from live replication stream
26
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1
Snapshots
Cluster Expansion
 Step 4: Migrate masters and slaves to rebalance
27
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1
P1 P2 P3
P5 P6
P10
Node 2
P5 P6 P7
P2
P11 P12
Node 3
P9 P10 P11
P3 P4
P8
Master
Slave
Offline
Node 4
P4 P8 P12
P7 P9P1
Cluster Expansion
 Partitions are balanced. Router starts sending traffic to new
node
28
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 3
Database
Node: 1
M: P1 – Active
…
S: P5 – Active
…
Cluster
Node 1 Node 2
P5 P6 P7
P2 P11 P12
Node 3
Master
Slave
Offline
Node 4
P1 P2 P3
P5 P6 P10
P9 P10 P11
P3 P4 P8
P4 P8 P12
P1 P7 P9
Node Failover
• During failure or planned maintenance
29
Node 1
P1 P2 P3
P10P5 P6
Node 2
P5 P6 P7
P12P2 P11
Node 3
P9 P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
Node Failover
• Step 1: Detect Node failure
30
Node 1
P1 P2 P3
P10P5 P6
Node 2
P5 P6 P7
P12P2 P11
Node 3
P9 P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
Node Failover
• Step 2: Compute new ideal state for promoting slaves to
master
31
Node 1
P1 P2 P3
P5 P6
Node 2
P5 P6 P7
P12P2
Node 3
P10 P11
P8P3 P4
Node 4
P4 P8 P12
P7 P9P1
Helix
Partition: P1
Node: 1
…
Partition: P12
Node: 4
Database
Cluster
Node: 4
M: P4 – Active
…
S: P7 – Active
…
P11P10
P9
Failover Performance
32
Secondary indexing
33
Espresso Secondary Indexing
• Local Secondary Index Requirements
• Read after write
• Consistent with primary data under failure
• Rich query support: match, prefix, range, text search
• Cost-to-serve proportional to working set
• Pluggable Index Implementations
• MySQL B-Tree
• Inverted index using Apache Lucene with MySQL backing store
• Inverted index using Prefix Index
• Fastbit based bitmap index
Lucene based implementation
• Requires entire index to be memory-resident to support low latency
query response times
• For the Mailbox application, we have two options
Optimizations for Lucene based implementation
• Concurrent transactions on the same Lucene
index leads to inconsistency
• Need to acquire a lock
• Opening an index repeatedly is expensive
• Group commit to amortize index opening cost
write
Request 2
Request 3
Request 4
Request 5
Request 1
Optimizations for Lucene based implementation
 High value users of the site accumulate large
mailboxes
– Query performance degrades with a large index
 Performance shouldn’t get worse with more usage!
 Time Partitioned Indexes: Partition index into buckets
based on created time
Espresso in Production
38
Espresso in Production
 Unified Social Content Platform –social activity aggregation
 High Read:Write ratio
39
Espresso in Production
 InMail - Allows members to communicate with each other
 Large storage footprint
 Low latency requirement for secondary index queries involving text
search and relational predicates
40
Performance
 Average Failover Latency with 1024 partitions is
around 300ms
 Primary Data Reads and Writes
 For Single Storage Node on SSD
 Average row size = 1KB
41
Operation Average Latency Average
Throughput
Reads ~3ms 40,000 per
second
Writes ~6ms 20,000 per
second
Performance
 Partition-key level Secondary Index using Lucene
 One Index per Mailbox use-case
 Base data on SAS, Indexes on SSDs
 Average throughput per index = ~1000 per second
(after the group commit and partitioned index
optimizations)
42
Operation Average Latency
Queries (average
of 5 indexed
fields)
~20ms
Writes (Around
30 indexed fields)
~20ms
Durability and Consistency
 Within a Data Center
 Across Data Centers
Durability and Consistency
 Within a Data Center
– Write latency vs Durability
 Asynchronous replication
– May lead to data loss
– Tooling can mitigate some of this
 Semi-synchronous replication
– Wait for at least one relay to acknowledge
– During failover, slaves wait for catchup
 Consistency over availability
 Helix selects slave with least replication lag to take over
mastership
 Failover time is ~300ms in practice
Durability and Consistency
 Across data centers
– Asynchronous replication
– Stale reads possible
– Active-active: Conflict resolution via last-writer-wins
Lessons learned
Dealing with transient failures
Planned upgrades
Slave reads
Storage Devices
– SSDs vs SAS disks
Scaling Cluster Management
46
Future work
Coprocessors
– Synchronous, Asynchronous
Richer query processing
– Group-by, Aggregation
47
Key Takeaways
Espresso is a timeline consistent,
document-oriented distributed database
Feature rich: Secondary indexing,
transactions over related documents,
seamless integration with the data
ecosystem
In production since June 2012 serving
several key use-cases
48
49
Questions?

Contenu connexe

Tendances

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBconfluent
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowDatabricks
 
From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumClement Demonchy
 
Introduction to MariaDB
Introduction to MariaDBIntroduction to MariaDB
Introduction to MariaDBJongJin Lee
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
 

Tendances (20)

Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debezium
 
Introduction to MariaDB
Introduction to MariaDBIntroduction to MariaDB
Introduction to MariaDB
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 

Similaire à Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggleconfluent
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentationpunesparkmeetup
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Moolle fan-out control for scalable distributed data stores
Moolle  fan-out control for scalable distributed data storesMoolle  fan-out control for scalable distributed data stores
Moolle fan-out control for scalable distributed data storesSungJu Cho
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)AllineaSoftware
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimizationinside-BigData.com
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQAraf Karsh Hamid
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodePivotalOpenSourceHub
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelDaniel Coupal
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLThijs Terlouw
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 

Similaire à Espresso: LinkedIn's Distributed Data Serving Platform (Talk) (20)

Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggle
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
MYSQL
MYSQLMYSQL
MYSQL
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Moolle fan-out control for scalable distributed data stores
Moolle  fan-out control for scalable distributed data storesMoolle  fan-out control for scalable distributed data stores
Moolle fan-out control for scalable distributed data stores
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)Preparing Codes for Intel Knights Landing (KNL)
Preparing Codes for Intel Knights Landing (KNL)
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
Event Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQEvent Sourcing & CQRS, Kafka, Rabbit MQ
Event Sourcing & CQRS, Kafka, Rabbit MQ
 
Building Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache GeodeBuilding Apps with Distributed In-Memory Computing Using Apache Geode
Building Apps with Distributed In-Memory Computing Using Apache Geode
 
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The SequelSilicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
Silicon Valley Code Camp 2015 - Advanced MongoDB - The Sequel
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 

Plus de Amy W. Tang

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using HelixAmy W. Tang
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph PresentationAmy W. Tang
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesAmy W. Tang
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with HelixAmy W. Tang
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to DatabusAmy W. Tang
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang
 

Plus de Amy W. Tang (12)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 

Dernier

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 

Dernier (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

  • 1. On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform Swaroop Jagadish http://www.linkedin.com/in/swaroopjagadish LinkedIn Confidential ©2013 All Rights Reserved
  • 2. Outline LinkedIn Data Ecosystem Espresso: Design Points Data Model and API Architecture Deep Dive: Fault Tolerance Deep Dive: Secondary Indexing Espresso In Production Future work 2
  • 3. The World’s Largest Professional Network Members Worldwide 2 new Members Per Second 100M+ Monthly Unique Visitors 225M+ 2M+ Company Pages Connecting Talent  Opportunity. At scale… LinkedIn Confidential ©2013 All Rights Reserved 3
  • 5. Espresso: Key Design Points  Source-of-truth – Master-Slave, Timeline consistent – Query-after-write – Backup/Restore – High Availability  Horizontally Scalable  Rich functionality – Hierarchical data model – Document oriented – Transactions within a hierarchy – Secondary Indexes 5
  • 6. Espresso: Key Design Points  Agility – no “pause the world” operations – “On the fly” Schema Evolution – Elasticity  Integration with the data ecosystem – Change stream with freshness in O(seconds) – ETL to Hadoop – Bulk import  Modular and Pluggable – Off-the-shelf: MySQL, Lucene, Avro 6
  • 10. Document based data model Richer than a plain key-value store Hierarchical keys Values are rich documents and may contain nested types 10 from : { name : "Chris", email : "chris@linkedin.com" } subject : "Go Giants!" body : "World Series 2012! w00t!" unread : true Messages mailboxID : String messageID : long from : { name : String email : String } subject : String body : String unread : boolean
  • 11. REST based API • Secondary Index query – GET /MailboxDB/MessageMeta/bob/?query=“+isUnread:true +isInbox:true”&start=0&count=15 • Partial updates POST /MailboxDB/MessageMeta/bob/1 Content-Type: application/json Content-Length: 21 {“unread” : “false”} • Conditional operations – Get a message, only if recently updated GET /MailboxDB/MessageMeta/bob/1 If-Modifed-Since: Wed, 31 Oct 2012 02:54:12 GMT 11
  • 12. Transactional writes within a hierarchy mboxId value George { “numUnread”: 2 } MessageCounter mboxId msgId value etag George 0 {…, “unread”: false, …} 7abf8091 George 1 {…, “unread”: true, …} b648bc5f George 2 {…, “unread”: true, …} 4fde8701 Message/Message/George/0 {…, “unread”: false, …} 7abf8091 /Message/George/0 {…, “unread”: true, …} /MessageCounter/George {…, “numUnread”: “+1”, …} 1. Read, record etags 2. Prepare after-image 3.Update mboxId value George { “numUnread”: 3 }
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. Cluster Management and Fault Tolerance 20
  • 21. Generic Cluster Manager: Apache Helix  Generic cluster management – State model + constraints – Ideal state of distribution of partitions across the cluster – Migrate cluster from current state to ideal state • More Info • SoCC 2012 • http://helix.incubator.apache.org 21
  • 22. Espresso Partition Layout: Master, Slave  3 Storage Engine nodes, 2-way replication 22 Apache Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline
  • 24. Cluster Expansion  Initial State with 3 Storage Nodes. Step1: Compute new Ideal state 24 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4
  • 25. Cluster Expansion  Step 2: Bootstrap new node’s partitions by restoring from backups 25 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1 Snapshots
  • 26. Cluster Expansion  Step 3: Catch up from live replication stream 26 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1 Snapshots
  • 27. Cluster Expansion  Step 4: Migrate masters and slaves to rebalance 27 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 P1 P2 P3 P5 P6 P10 Node 2 P5 P6 P7 P2 P11 P12 Node 3 P9 P10 P11 P3 P4 P8 Master Slave Offline Node 4 P4 P8 P12 P7 P9P1
  • 28. Cluster Expansion  Partitions are balanced. Router starts sending traffic to new node 28 Helix Partition: P1 Node: 1 … Partition: P12 Node: 3 Database Node: 1 M: P1 – Active … S: P5 – Active … Cluster Node 1 Node 2 P5 P6 P7 P2 P11 P12 Node 3 Master Slave Offline Node 4 P1 P2 P3 P5 P6 P10 P9 P10 P11 P3 P4 P8 P4 P8 P12 P1 P7 P9
  • 29. Node Failover • During failure or planned maintenance 29 Node 1 P1 P2 P3 P10P5 P6 Node 2 P5 P6 P7 P12P2 P11 Node 3 P9 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active …
  • 30. Node Failover • Step 1: Detect Node failure 30 Node 1 P1 P2 P3 P10P5 P6 Node 2 P5 P6 P7 P12P2 P11 Node 3 P9 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active …
  • 31. Node Failover • Step 2: Compute new ideal state for promoting slaves to master 31 Node 1 P1 P2 P3 P5 P6 Node 2 P5 P6 P7 P12P2 Node 3 P10 P11 P8P3 P4 Node 4 P4 P8 P12 P7 P9P1 Helix Partition: P1 Node: 1 … Partition: P12 Node: 4 Database Cluster Node: 4 M: P4 – Active … S: P7 – Active … P11P10 P9
  • 34. Espresso Secondary Indexing • Local Secondary Index Requirements • Read after write • Consistent with primary data under failure • Rich query support: match, prefix, range, text search • Cost-to-serve proportional to working set • Pluggable Index Implementations • MySQL B-Tree • Inverted index using Apache Lucene with MySQL backing store • Inverted index using Prefix Index • Fastbit based bitmap index
  • 35. Lucene based implementation • Requires entire index to be memory-resident to support low latency query response times • For the Mailbox application, we have two options
  • 36. Optimizations for Lucene based implementation • Concurrent transactions on the same Lucene index leads to inconsistency • Need to acquire a lock • Opening an index repeatedly is expensive • Group commit to amortize index opening cost write Request 2 Request 3 Request 4 Request 5 Request 1
  • 37. Optimizations for Lucene based implementation  High value users of the site accumulate large mailboxes – Query performance degrades with a large index  Performance shouldn’t get worse with more usage!  Time Partitioned Indexes: Partition index into buckets based on created time
  • 39. Espresso in Production  Unified Social Content Platform –social activity aggregation  High Read:Write ratio 39
  • 40. Espresso in Production  InMail - Allows members to communicate with each other  Large storage footprint  Low latency requirement for secondary index queries involving text search and relational predicates 40
  • 41. Performance  Average Failover Latency with 1024 partitions is around 300ms  Primary Data Reads and Writes  For Single Storage Node on SSD  Average row size = 1KB 41 Operation Average Latency Average Throughput Reads ~3ms 40,000 per second Writes ~6ms 20,000 per second
  • 42. Performance  Partition-key level Secondary Index using Lucene  One Index per Mailbox use-case  Base data on SAS, Indexes on SSDs  Average throughput per index = ~1000 per second (after the group commit and partitioned index optimizations) 42 Operation Average Latency Queries (average of 5 indexed fields) ~20ms Writes (Around 30 indexed fields) ~20ms
  • 43. Durability and Consistency  Within a Data Center  Across Data Centers
  • 44. Durability and Consistency  Within a Data Center – Write latency vs Durability  Asynchronous replication – May lead to data loss – Tooling can mitigate some of this  Semi-synchronous replication – Wait for at least one relay to acknowledge – During failover, slaves wait for catchup  Consistency over availability  Helix selects slave with least replication lag to take over mastership  Failover time is ~300ms in practice
  • 45. Durability and Consistency  Across data centers – Asynchronous replication – Stale reads possible – Active-active: Conflict resolution via last-writer-wins
  • 46. Lessons learned Dealing with transient failures Planned upgrades Slave reads Storage Devices – SSDs vs SAS disks Scaling Cluster Management 46
  • 47. Future work Coprocessors – Synchronous, Asynchronous Richer query processing – Group-by, Aggregation 47
  • 48. Key Takeaways Espresso is a timeline consistent, document-oriented distributed database Feature rich: Secondary indexing, transactions over related documents, seamless integration with the data ecosystem In production since June 2012 serving several key use-cases 48