This document provides an overview of LinkedIn's data infrastructure. It discusses LinkedIn's large user base and data needs for products like profiles, communications, and recommendations. It describes LinkedIn's data ecosystem with three paradigms for online, nearline and offline data. It then summarizes key parts of LinkedIn's data infrastructure, including Databus for change data capture, Voldemort for distributed key-value storage, Kafka for messaging, and Espresso for distributed data storage. Overall, the document outlines how LinkedIn builds scalable data solutions to power its products and services for its large user base.
3. LinkedIn By The Numbers
150M + users*
~ 4.2B People Searches in 2011**
>2M companies with LinkedIn Company Pages**
16 languages
75% of Fortune 100 Companies use LinkedIn to hire***
* As of February 9th 2012
** As of December 31st 2011
*** As of September 30th 2011
3
10. Three Paradigms : Simplifying the Data Continuum
• Member Profiles • Linkedin Today • People You May Know
• Company Profiles • Profile Standardization • Connection Strength
• Connections • News • News
• Communications • Recommendations • Recommendations
• Search • Next best idea
• Communications
Online Nearline Offline
Activity that should Activity that should Activity that can be
be reflected immediately be reflected soon reflected later
10
15. Databus at LinkedIn
Client
Relay Consumer 1
Client Lib
Capture On-line
Databus
DB Changes Changes
Event Win Consumer n
On-line
Changes
Bootstrap Client
Consumer 1
Client Lib
Databus
Consistent
Snapshot at U
DB Consumer n
15
16. Databus at LinkedIn
Client
Relay Consumer 1
Client Lib
Capture On-line
Databus
DB Changes Changes
Event Win Consumer n
On-line
Changes
Bootstrap Client
Consumer 1
Client Lib
Databus
Consistent
Snapshot at U
DB Consumer n
Transport independent of data Tens of relays
source: Oracle, MySQL, … Hundreds of sources
Transactional semantics Low latency - milliseconds
In order, at least once delivery
16
25. Kafka: Architecture
Broker Tier
WebTier Consumers
Push Sequential write sendfile Pull
Iterator 1
Client Lib
Event Events
Topic 1
Kafka
s
100 MB/sec 200 MB/sec
Topic 2
Iterator n
Topic N
Topic Offset
Topic, Partition Offset
Ownership
Zookeeper Management
At least once delivery Billions of Events, TBs per day
Very high throughput 50K+ per sec at peak
Low latency Inter and Intra-cluster replication
Durability End-to-end latency: few seconds
25
28. Application View
Hierarchical data model
Rich functionality on resources
Conditional updates
Partial updates
Atomic counters
Rich functionality within
resource groups
Transactions
Secondary index
Text search
28
32. Generic Cluster Manager: Helix
• Generic Distributed State Model
• Centralized Config Management
• Automatic Load Balancing
• Fault tolerance
• Health monitoring
• Cluster expansion and
rebalancing
• Espresso, Databus and Search
• Open Source Apr 2012
• https://github.com/linkedin/helix
32
33. Espresso@Linkedin
Launched first application Oct 2011
Open source 2012
Future
– Multi-Datacenter support
– Global secondary indexes
– Time-partitioned data
33
35. Acknowledgments
Siddharth Anand, Aditya Auradkar, Chavdar Botev, Vinoth Chandar,
Shirshanka Das, Dave DeMaagd, Alex Feinberg, John Fung, Phanindra
Ganti, Mihir Gandhi, Lei Gao, Bhaskar Ghosh, Kishore Gopalakrishna,
Brendan Harris, Rajappa Iyer, Swaroop Jagadish, Joel Koshy, Kevin Krawez,
Jay Kreps, Shi Lu, Sunil Nagaraj, Neha Narkhede, Sasha Pachev, Igor
Perisic, Lin Qiao, Tom Quiggle, Jun Rao, Bob Schulman, Abraham
Sebastian, Oliver Seeliger, Adam Silberstein, Boris Shkolnik, Chinmay
Soman, Subbu Subramaniam, Roshan Sumbaly, Kapil Surlaker, Sajid
Topiwala, Cuong Tran, Balaji Varadarajan, Jemiah Westerman, Zach White,
Victor Ye, David Zhang, and Jason Zhang
35