Data Infrastructure at LinkedIn

Data Infrastructure at LinkedIn
Kapil Surlaker

http://www.linkedin.com/in/kapilsurlaker
@kapilsurlaker

1

Outline

 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play

2

LinkedIn By The Numbers

 150M + users*
 ~ 4.2B People Searches in 2011**
 >2M companies with LinkedIn Company Pages**
 16 languages
 75% of Fortune 100 Companies use LinkedIn to hire***

* As of February 9th 2012
** As of December 31st 2011
*** As of September 30th 2011

3

Broad Range of Products & Services

4

User Profiles
Large dataset
Medium writes
Very high reads
Freshness <1s

5

Communications
Large dataset
High writes
High reads
Freshness <1s

6

People You May Know
Large dataset
Compute intensive
High reads
Freshness ~hrs

7

LinkedIn Today Moving dataset
High writes
High reads
Freshness ~mins

8

Outline

 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play

9

Three Paradigms : Simplifying the Data Continuum

• Member Profiles • Linkedin Today • People You May Know

• Company Profiles • Profile Standardization • Connection Strength
• Connections • News • News
• Communications • Recommendations • Recommendations
• Search • Next best idea
• Communications

Online Nearline Offline
Activity that should Activity that should Activity that can be
be reflected immediately be reflected soon reflected later

10

LinkedIn Product Architecture

11


12


13

LinkedIn Data Infrastructure Solutions
Databus : Timeline-Consistent Change Data Capture

14

Databus at LinkedIn
Client
Relay Consumer 1

Client Lib
Capture On-line

Databus
DB Changes Changes
Event Win Consumer n
On-line
Changes

Bootstrap Client
Consumer 1

Client Lib
Databus
Consistent
Snapshot at U
DB Consumer n

15

Databus at LinkedIn
Client
Relay Consumer 1

Client Lib
Capture On-line

Databus
DB Changes Changes
Event Win Consumer n
On-line
Changes

Bootstrap Client
Consumer 1

Client Lib
Databus
Consistent
Snapshot at U
DB Consumer n

 Transport independent of data  Tens of relays
source: Oracle, MySQL, …  Hundreds of sources
 Transactional semantics  Low latency - milliseconds
 In order, at least once delivery

16


17


18


Voldemort: Highly-Available Distributed KV Store

19

Voldemort: Architecture

• Pluggable components • 10 clusters, 100+ nodes
• Tunable consistency / • Largest cluster – 10K+ qps
availability • Avg latency: 3ms
• Key/value model, • Hundreds of Stores
server side “views” • Largest store – 2.8TB+


21

Kafka: High-Volume Low-Latency Messaging System

22


23

Kafka: Architecture
Broker Tier
WebTier Consumers

Push Sequential write sendfile Pull
Iterator 1

Client Lib
Event Events
Topic 1

Kafka
s
100 MB/sec 200 MB/sec
Topic 2
Iterator n
Topic N

Topic  Offset

Topic, Partition Offset
Ownership
Zookeeper Management

24

Kafka: Architecture
Broker Tier
WebTier Consumers

Push Sequential write sendfile Pull
Iterator 1

Client Lib
Event Events
Topic 1

Kafka
s
100 MB/sec 200 MB/sec
Topic 2
Iterator n
Topic N

Topic  Offset

Topic, Partition Offset
Ownership
Zookeeper Management

 At least once delivery  Billions of Events, TBs per day
 Very high throughput  50K+ per sec at peak
 Low latency  Inter and Intra-cluster replication
 Durability  End-to-end latency: few seconds
25


26

Espresso: Indexed Timeline-Consistent Distributed
Data Store

27

Application View

Hierarchical data model

Rich functionality on resources
 Conditional updates
 Partial updates
 Atomic counters

Rich functionality within
resource groups
 Transactions
 Secondary index
 Text search

28

Espresso Partition Layout: Master, Slave
3 Storage Engine nodes, 2 way replication

Database
P.1 P.2 P.3 P.5 P.6 P.7
Partition: P.1
Node: 1 P.4 P.5 P.6 P.8 P.1 P.2
…
Partition: P.12
Node: 3
P.9 P.1 P.11 P.1
0 2
Node 1 Node 2
Cluster

Node: 1
M: P.1 – Active P.9 P.1 P.11
… 0
S: P.5 – Active
… P.1 P.3 P.4
2
P.7 P.8 Master
Cluster Slave
Manager Node 3

Espresso: System Components

31

Generic Cluster Manager: Helix

• Generic Distributed State Model
• Centralized Config Management
• Automatic Load Balancing
• Fault tolerance
• Health monitoring
• Cluster expansion and
rebalancing
• Espresso, Databus and Search
• Open Source Apr 2012
• https://github.com/linkedin/helix

32

Espresso@Linkedin

 Launched first application Oct 2011
 Open source 2012
 Future
– Multi-Datacenter support
– Global secondary indexes
– Time-partitioned data

33


34

Acknowledgments

Siddharth Anand, Aditya Auradkar, Chavdar Botev, Vinoth Chandar,
Shirshanka Das, Dave DeMaagd, Alex Feinberg, John Fung, Phanindra
Ganti, Mihir Gandhi, Lei Gao, Bhaskar Ghosh, Kishore Gopalakrishna,
Brendan Harris, Rajappa Iyer, Swaroop Jagadish, Joel Koshy, Kevin Krawez,
Jay Kreps, Shi Lu, Sunil Nagaraj, Neha Narkhede, Sasha Pachev, Igor
Perisic, Lin Qiao, Tom Quiggle, Jun Rao, Bob Schulman, Abraham
Sebastian, Oliver Seeliger, Adam Silberstein, Boris Shkolnik, Chinmay
Soman, Subbu Subramaniam, Roshan Sumbaly, Kapil Surlaker, Sajid
Topiwala, Cuong Tran, Balaji Varadarajan, Jemiah Westerman, Zach White,
Victor Ye, David Zhang, and Jason Zhang

35

Data Infrastructure at LinkedIn

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (17)

Similaire à Data Infrastructure at LinkedIn

Similaire à Data Infrastructure at LinkedIn (20)

Plus de Amy W. Tang

Plus de Amy W. Tang (8)

Dernier

Dernier (20)

Data Infrastructure at LinkedIn