SlideShare une entreprise Scribd logo
1  sur  46
Big Data Ecosystem at LinkedIn
Big 2015 at WWW
LinkedIn: Largest Professional Network
2
360M members 2 new members/sec
Rich Data Driven Products at LinkedIn
3
Similar Profiles
Connections
News
Skill Endorsements
How to build Data Products
4
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Managing offline processes
• Data Egress
• Moving results from offline to online system
Example Data Product: PYMK
5
• People You May Know (PYMK): recommend members to connect
Outline
6
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Managing offline processes
• Data Egress
• Moving results from offline to online system
Ingress - types of Data
7
• Database data: member profile, connections, …
• Activity data: Page views, Impressions, etc.
• Application and System metrics
• Service logs
Data Ingress - Point-to-point Pipelines
8
• O(n^2) data integration complexity
• Fragile, delayed, lossy
• Non-standardized
Data Ingress - Centralized Pipeline
9
• O(n) data integration complexity
• More reliable
• Standardizable
Data Ingress: Apache Kafka
10
• Publish subscribe messaging
• Producers send messages to Brokers
• Consumers read messages from
Brokers
• Messages are sent to a topic
• E.g. PeopleYouMayKnowTopic
• Each topic is broken into one or more
ordered partitions of messages
Kafka: Data Evolution and loading
11
• Standardized Schema for each topic
• Avro
• Central repository
• Producers/consumers use the same schema
• Data verification - audits
• ETL to Hadoop
• Map only jobs load data from broker
Goodhope et al., IEEE Data Eng. 2012
Outline
12
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Batch processing using Hadoop, Azkaban, Cubert
• Stream processing using Samza
• Iterative processing using Spark
• Data Egress
• Moving results from offline to online system
Data Processing: Hadoop
13
• Ease of programming
• High level Map and Reduce functions
• Scalable to very large cluster
• Fault tolerant
• Speculative execution, auto restart of failed jobs
• Scripting languages: PIG, Hive, Scalding
Data Processing: Hadoop at LinkedIn
14
• Used for data products, feature computation, training
models, analytics and reporting, trouble shooting, …
• Native MapReduce, PIG, Hive
• Workflows with 100s of Hadoop jobs
• 100s of workflows
• Processing petabytes of data everyday
Data Processing Example PYMK Feature
Engineering
Triangle closing
Prob(Bob knows Carol) ~ the # of common
connections
Alice
Bob Carol
15
How do people
know each other?
Data Processing in Hadoop Example
16
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
-- second degree pairs (id1, id2), aggregate and count common
connections
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
How to do PYMK Triangle Closing in Hadoop
17
How to manage Production Hadoop Workflow
Azkaban: Hadoop Workflow management
18
• Configuration
• Dependency management
• Access control
• Scheduling and SLA management
• Monitoring, history
Distributed Machine Learning: ML-ease
20
• ADMM Logistic Regression for binary response prediction
Agarwal et al. 2014
Limitations of Hadoop: Join and Group By
21
— Two datasets: A=(Salesman, Product), B=(Salesman,
Location)
Select SomeAggregate() FROM A Inner Join B ON A.salesman
= B.Salesman GROUP BY A.Product, B.Location
• Common Hadoop MapReduce/Pig/Hive implementation
• MapReduce: Load data and shuffle and reduce to do Inner Join and
store the output
• MapReduce: Load the above output, shuffle on group by keys and
aggregate on reducer to generate final output
Limitations of Triangle Closing Using Hadoop
22
• Large amount of data to shuffle from Mappers to Reducers
— connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
— Shuffling all 2nd degree connections - terabytes of data
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING
PigStorage();
Cubert
23
• An open source project built for analytics needs
• Map side aggregation
• Minimizes intermediate data and shuffling
• Fast and scalable primitives for joins and aggregation
• Partitions data into blocks
• Specialized operators MeshJoin, Cube
• 5-60X faster in experience
• Developer friendly - script like
Vemuri et al. VLDB 2014
Cubert Design
24
• Language
• Scripting language
• Physical - write MR programs
• Execution
• Data movement: Shuffle, Blockgen,
Combine, Pivot
• Primitives: MashJoin, Cube
• Data blocks: partition of data by cost
Vemuri et al. VLDB 2014
Cubert Script: count Daily/Weekly Stats
25
JOB "create blocks of the fact table"
MAP {
data = LOAD ("$FactTable", $weekAgo, $today) USING AVRO();
}
// create blocks of one week of data with a cost function
BLOCKGEN data BY ROW 1000000 PARTITIONED ON userId;
STORE data INTO "$output/blocks" USING RUBIX;
END
JOB "compute cubes"
MAP {
data = LOAD "$output/blocks" USING RUBIX;
// create a new column 'todayUserId' for today's records only
data = FROM data GENERATE country, locale, userId, clicks,
CASE(timestamp == $today, userId) AS todayUserId;
}
// creates the three cubes in a single job to count daily, weekly users and clicks
CUBE data BY country, locale INNER userId
AGGREGATES COUNT_DISTINCT(userId) as weeklyUniqueUsers,
COUNT_DISTINCT(todayUserId) as dailyUniqueUsers,
SUM(clicks) as totalClicks;
STORE data INTO "$output/results" USING AVRO();
END
Vemuri et al. VLDB 2014
Cubert Example: Join and Group By
26
Vemuri et al. VLDB 2014
— Two datasets: A=(Salesman, Product), B=(Salesman, Location)
Select SomeAggregate() FROM A Inner Join B ON A.salesman =
B.Salesman GROUP BY A.Product, B.Location
• Sort A by Product and B by
Location
• Divide A and B in specialized
blocks sorted by group by keys
• Load A’s blocks in memory and
stream B’s blocks to Join
• Group by can be performed
immediately after Join
Cubert Example: Triangle Closing
27
• Divide connections (src, dest) in blocks
• Duplicate connection graph G1, G2
• Sort G1 edges (src, dest) by src
• Sort G2 edges (src, dest) by dest
• MeshJoin G1 and G2 such that G1.dest=G2.src
• Aggregate by (G1.src, G2,dest) to get the number of common
connections
• Speedup by 50%
Cubert Summary
28
Vemuri et al. VLDB 2014
• Built for analytics needs
• Faster and scalable: 5-60X
• Working well in practice
Outline
29
• Ingress
• Moving data from online to offline system
• Offline Processing
• Batch processing - Hadoop, Azkaban, Cubert
• Stream processing - Samza
• Iterative processing - Spark
• Egress
• Moving results from offline to online system
Samza
30
• Samza streaming computation
• On top of messaging layer like Kafka for input/output
• Low latency
• Stateful processing through local store
• Many use cases at LinkedIn
• Site-speed monitoring
• Data standardization
Samza: Site Speed Monitoring
31
• LinkedIn homepage assembled by calling many services
• Each service logs through Kafka what went on with a request Id
Samza: Site Speed Monitoring
32
• The complete record of request - scattered across Kafka logs
• Problem: combine these logs to generate wholistic view
Samza: Site Speed Monitoring
33
• Hadoop/MR: join the logs using the request Id - once a day
• Too late to troubleshoot any issue
• Samza: near real-time join the Kafka logs using the requestId
Samza: Site Speed Monitoring
34
• Samza: near real-time join the Kafka logs using the requestId
• Two jobs
• Partition Kafka stream by request Id
• Aggregate all the records for a request Id
Fernandez et al. CIDR 2015
Outline
35
• Ingress
• Moving data from online to offline system
• Offline Processing
• Batch processing - Hadoop, Azkaban, Cubert
• Stream processing - Samza
• Iterative processing - Spark
• Egress
• Moving results from offline to online system
Iterative Processing using Spark
36
• Limitations of MapReduce
• What is Spark?
• Spark at LinkedIn
Limitations of MapReduce
37
• Iterative computation is slow
• Inefficient multi-pass computation
• Intermediate data written in distributed file system
Limitations of MapReduce
38
• Interactive computation is slow
• Same data is loaded again from distributed file system
Example: ADMM at LinkedIn
39
• Intermediate data is stored in distributed file system - slow
Intermediate
data in HDFS
SPARK
40
• Extends programming language with a
distributed data structure
• Resilient Distributed Datasets (RDD)
• can be stored in memory
• Faster iterative computation
• Faster interactive computation
• Clean APIs in Python, Scala, Java
• SQL, Streaming, Machine learning,
graph processing support
Matei Zaharia et al. NSDI 2012
Spark at LinkedIn
41
• ADMM on Spark
• Intermediate data is stored in memory - faster
Intermediate
data in memory
Outline
42
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Batch processing - Hadoop, Azkaban, Cubert
• Iterative processing - Spark
• Stream processing - Samza
• Data Egress
• Moving results from offline to online system
Data Egress - Key/Value
43
• Key-value store: Voldemort
• Based on Amazon’s Dynamo DB
• Distributed
• Scalable
• Bulk load from Hadoop
• Simple to use
• store results into ‘url’ using KeyValue(‘member_id’)
Sumbaly et al. FAST 2012
Data Egress - Streams
44
• Stream - Kafka
• Hadoop job as a Producer
• Service acts as Consumer
• Simple to use
• store data into ‘url’ using Stream(“topic=x“)
Goodhope et al., IEEE Data Eng. 2012
Conclusion
45
• Rich primitives for Data Ingress, Processing, Egress
• Data Ingress: Kafka, ETL
• Data Processing
• Batch processing - Hadoop, Cubert
• Stream processing - Samza
• Iterative processing - Spark
• Data Egress: Voldemort, Kafka
• Allow Data Scientists to focus to build Data Products
Future Opportunities
46
• Models of computation
• Efficient Graph processing
• Distributed Machine Learning
47
Acknowledgement
Thanks to data team at LinkedIn: data.linkedin.com
Contact: mtiwari@linkedin.com
@mitultiwari

Contenu connexe

Tendances

Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Clickstream analytics with Markov Chains
Clickstream analytics with Markov ChainsClickstream analytics with Markov Chains
Clickstream analytics with Markov ChainsAlex Papageorgiou
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...HostedbyConfluent
 
Webinaire Google Analytics 4 pour le Ecommerce
Webinaire Google Analytics 4 pour le EcommerceWebinaire Google Analytics 4 pour le Ecommerce
Webinaire Google Analytics 4 pour le EcommerceOsharaInc
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Roopa Tangirala
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScaleSeunghyun Lee
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkCaserta
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Mayank Shrivastava
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Dash plotly data visualization
Dash plotly data visualizationDash plotly data visualization
Dash plotly data visualizationCharu Gupta
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016DataStax
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkmxmxm
 

Tendances (20)

Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Clickstream analytics with Markov Chains
Clickstream analytics with Markov ChainsClickstream analytics with Markov Chains
Clickstream analytics with Markov Chains
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
 
Webinaire Google Analytics 4 pour le Ecommerce
Webinaire Google Analytics 4 pour le EcommerceWebinaire Google Analytics 4 pour le Ecommerce
Webinaire Google Analytics 4 pour le Ecommerce
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Dash plotly data visualization
Dash plotly data visualizationDash plotly data visualization
Dash plotly data visualization
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 

En vedette

The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInSam Shah
 
What is the Point of Hadoop
What is the Point of HadoopWhat is the Point of Hadoop
What is the Point of HadoopDataWorks Summit
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedInKeith Dsouza
 
A Case Study In Social CRM Without Technology: The Green Bay Packers
A Case Study In Social CRM Without Technology: The Green Bay PackersA Case Study In Social CRM Without Technology: The Green Bay Packers
A Case Study In Social CRM Without Technology: The Green Bay PackersPaul Greenberg
 
Big Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalBig Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalInnovation Enterprise
 
Distributed Multimedia Systems(DMMS)
Distributed Multimedia Systems(DMMS)Distributed Multimedia Systems(DMMS)
Distributed Multimedia Systems(DMMS)Nidhi Baranwal
 
Software Development & Architecture @ LinkedIn
Software Development & Architecture @ LinkedInSoftware Development & Architecture @ LinkedIn
Software Development & Architecture @ LinkedInC4Media
 
Linkedin Corporate Solution
Linkedin Corporate SolutionLinkedin Corporate Solution
Linkedin Corporate SolutionMohamed Ouabi
 
Linkedin Corporate Solution
Linkedin Corporate SolutionLinkedin Corporate Solution
Linkedin Corporate SolutionMohamed Ouabi
 
LinkedIn Corporate Solutions
LinkedIn Corporate SolutionsLinkedIn Corporate Solutions
LinkedIn Corporate SolutionsBenny Gould
 
Linkedin Corporate Solutions
Linkedin Corporate SolutionsLinkedin Corporate Solutions
Linkedin Corporate SolutionsAndrewBoe
 
Rmi, corba and java beans
Rmi, corba and java beansRmi, corba and java beans
Rmi, corba and java beansRaghu nath
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Distributed Objects: CORBA/Java RMI
Distributed Objects: CORBA/Java RMIDistributed Objects: CORBA/Java RMI
Distributed Objects: CORBA/Java RMIelliando dias
 
Corba introduction and simple example
Corba introduction and simple example Corba introduction and simple example
Corba introduction and simple example Alexia Wang
 
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Denodo
 

En vedette (20)

The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
Social crm
Social crmSocial crm
Social crm
 
What is the Point of Hadoop
What is the Point of HadoopWhat is the Point of Hadoop
What is the Point of Hadoop
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedIn
 
A Case Study In Social CRM Without Technology: The Green Bay Packers
A Case Study In Social CRM Without Technology: The Green Bay PackersA Case Study In Social CRM Without Technology: The Green Bay Packers
A Case Study In Social CRM Without Technology: The Green Bay Packers
 
Big Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalBig Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, Paypal
 
Distributed Multimedia Systems(DMMS)
Distributed Multimedia Systems(DMMS)Distributed Multimedia Systems(DMMS)
Distributed Multimedia Systems(DMMS)
 
Software Development & Architecture @ LinkedIn
Software Development & Architecture @ LinkedInSoftware Development & Architecture @ LinkedIn
Software Development & Architecture @ LinkedIn
 
Diversegy Consultant Program
Diversegy Consultant ProgramDiversegy Consultant Program
Diversegy Consultant Program
 
Linkedin Corporate Solution
Linkedin Corporate SolutionLinkedin Corporate Solution
Linkedin Corporate Solution
 
Linkedin Corporate Solution
Linkedin Corporate SolutionLinkedin Corporate Solution
Linkedin Corporate Solution
 
Analog
AnalogAnalog
Analog
 
LinkedIn Corporate Solutions
LinkedIn Corporate SolutionsLinkedIn Corporate Solutions
LinkedIn Corporate Solutions
 
Linkedin Corporate Solutions
Linkedin Corporate SolutionsLinkedin Corporate Solutions
Linkedin Corporate Solutions
 
Rmi, corba and java beans
Rmi, corba and java beansRmi, corba and java beans
Rmi, corba and java beans
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Distributed Objects: CORBA/Java RMI
Distributed Objects: CORBA/Java RMIDistributed Objects: CORBA/Java RMI
Distributed Objects: CORBA/Java RMI
 
C O R B A Unit 4
C O R B A    Unit 4C O R B A    Unit 4
C O R B A Unit 4
 
Corba introduction and simple example
Corba introduction and simple example Corba introduction and simple example
Corba introduction and simple example
 
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
 

Similaire à Big Data Ecosystem at LinkedIn

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceDeepak Chandramouli
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerIBM Cloud Data Services
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...DataWorks Summit/Hadoop Summit
 

Similaire à Big Data Ecosystem at LinkedIn (20)

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to Graphs
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 

Plus de Mitul Tiwari

Large scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInMitul Tiwari
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsMitul Tiwari
 
Large scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluationLarge scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluationMitul Tiwari
 
Metaphor: A system for related searches recommendations
Metaphor: A system for related searches recommendationsMetaphor: A system for related searches recommendations
Metaphor: A system for related searches recommendationsMitul Tiwari
 
Related searches at LinkedIn
Related searches at LinkedInRelated searches at LinkedIn
Related searches at LinkedInMitul Tiwari
 
Structural Diversity in Social Recommender Systems
Structural Diversity in Social Recommender SystemsStructural Diversity in Social Recommender Systems
Structural Diversity in Social Recommender SystemsMitul Tiwari
 
Organizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its ApplicationsOrganizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its ApplicationsMitul Tiwari
 
Large-scale Social Recommendation Systems: Challenges and Opportunity
Large-scale Social Recommendation Systems: Challenges and OpportunityLarge-scale Social Recommendation Systems: Challenges and Opportunity
Large-scale Social Recommendation Systems: Challenges and OpportunityMitul Tiwari
 
Building Data Driven Products at Linkedin
Building Data Driven Products at LinkedinBuilding Data Driven Products at Linkedin
Building Data Driven Products at LinkedinMitul Tiwari
 
Social Network Analysis at LinkedIn
Social Network Analysis at LinkedInSocial Network Analysis at LinkedIn
Social Network Analysis at LinkedInMitul Tiwari
 

Plus de Mitul Tiwari (10)

Large scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedIn
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systems
 
Large scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluationLarge scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluation
 
Metaphor: A system for related searches recommendations
Metaphor: A system for related searches recommendationsMetaphor: A system for related searches recommendations
Metaphor: A system for related searches recommendations
 
Related searches at LinkedIn
Related searches at LinkedInRelated searches at LinkedIn
Related searches at LinkedIn
 
Structural Diversity in Social Recommender Systems
Structural Diversity in Social Recommender SystemsStructural Diversity in Social Recommender Systems
Structural Diversity in Social Recommender Systems
 
Organizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its ApplicationsOrganizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its Applications
 
Large-scale Social Recommendation Systems: Challenges and Opportunity
Large-scale Social Recommendation Systems: Challenges and OpportunityLarge-scale Social Recommendation Systems: Challenges and Opportunity
Large-scale Social Recommendation Systems: Challenges and Opportunity
 
Building Data Driven Products at Linkedin
Building Data Driven Products at LinkedinBuilding Data Driven Products at Linkedin
Building Data Driven Products at Linkedin
 
Social Network Analysis at LinkedIn
Social Network Analysis at LinkedInSocial Network Analysis at LinkedIn
Social Network Analysis at LinkedIn
 

Dernier

Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationMarko4394
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 

Dernier (17)

Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentation
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 

Big Data Ecosystem at LinkedIn

  • 1. Big Data Ecosystem at LinkedIn Big 2015 at WWW
  • 2. LinkedIn: Largest Professional Network 2 360M members 2 new members/sec
  • 3. Rich Data Driven Products at LinkedIn 3 Similar Profiles Connections News Skill Endorsements
  • 4. How to build Data Products 4 • Data Ingress • Moving data from online to offline system • Data Processing • Managing offline processes • Data Egress • Moving results from offline to online system
  • 5. Example Data Product: PYMK 5 • People You May Know (PYMK): recommend members to connect
  • 6. Outline 6 • Data Ingress • Moving data from online to offline system • Data Processing • Managing offline processes • Data Egress • Moving results from offline to online system
  • 7. Ingress - types of Data 7 • Database data: member profile, connections, … • Activity data: Page views, Impressions, etc. • Application and System metrics • Service logs
  • 8. Data Ingress - Point-to-point Pipelines 8 • O(n^2) data integration complexity • Fragile, delayed, lossy • Non-standardized
  • 9. Data Ingress - Centralized Pipeline 9 • O(n) data integration complexity • More reliable • Standardizable
  • 10. Data Ingress: Apache Kafka 10 • Publish subscribe messaging • Producers send messages to Brokers • Consumers read messages from Brokers • Messages are sent to a topic • E.g. PeopleYouMayKnowTopic • Each topic is broken into one or more ordered partitions of messages
  • 11. Kafka: Data Evolution and loading 11 • Standardized Schema for each topic • Avro • Central repository • Producers/consumers use the same schema • Data verification - audits • ETL to Hadoop • Map only jobs load data from broker Goodhope et al., IEEE Data Eng. 2012
  • 12. Outline 12 • Data Ingress • Moving data from online to offline system • Data Processing • Batch processing using Hadoop, Azkaban, Cubert • Stream processing using Samza • Iterative processing using Spark • Data Egress • Moving results from offline to online system
  • 13. Data Processing: Hadoop 13 • Ease of programming • High level Map and Reduce functions • Scalable to very large cluster • Fault tolerant • Speculative execution, auto restart of failed jobs • Scripting languages: PIG, Hive, Scalding
  • 14. Data Processing: Hadoop at LinkedIn 14 • Used for data products, feature computation, training models, analytics and reporting, trouble shooting, … • Native MapReduce, PIG, Hive • Workflows with 100s of Hadoop jobs • 100s of workflows • Processing petabytes of data everyday
  • 15. Data Processing Example PYMK Feature Engineering Triangle closing Prob(Bob knows Carol) ~ the # of common connections Alice Bob Carol 15 How do people know each other?
  • 16. Data Processing in Hadoop Example 16 -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); -- second degree pairs (id1, id2), aggregate and count common connections common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); How to do PYMK Triangle Closing in Hadoop
  • 17. 17 How to manage Production Hadoop Workflow
  • 18. Azkaban: Hadoop Workflow management 18 • Configuration • Dependency management • Access control • Scheduling and SLA management • Monitoring, history
  • 19. Distributed Machine Learning: ML-ease 20 • ADMM Logistic Regression for binary response prediction Agarwal et al. 2014
  • 20. Limitations of Hadoop: Join and Group By 21 — Two datasets: A=(Salesman, Product), B=(Salesman, Location) Select SomeAggregate() FROM A Inner Join B ON A.salesman = B.Salesman GROUP BY A.Product, B.Location • Common Hadoop MapReduce/Pig/Hive implementation • MapReduce: Load data and shuffle and reduce to do Inner Join and store the output • MapReduce: Load the above output, shuffle on group by keys and aggregate on reducer to generate final output
  • 21. Limitations of Triangle Closing Using Hadoop 22 • Large amount of data to shuffle from Mappers to Reducers — connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); — Shuffling all 2nd degree connections - terabytes of data common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 22. Cubert 23 • An open source project built for analytics needs • Map side aggregation • Minimizes intermediate data and shuffling • Fast and scalable primitives for joins and aggregation • Partitions data into blocks • Specialized operators MeshJoin, Cube • 5-60X faster in experience • Developer friendly - script like Vemuri et al. VLDB 2014
  • 23. Cubert Design 24 • Language • Scripting language • Physical - write MR programs • Execution • Data movement: Shuffle, Blockgen, Combine, Pivot • Primitives: MashJoin, Cube • Data blocks: partition of data by cost Vemuri et al. VLDB 2014
  • 24. Cubert Script: count Daily/Weekly Stats 25 JOB "create blocks of the fact table" MAP { data = LOAD ("$FactTable", $weekAgo, $today) USING AVRO(); } // create blocks of one week of data with a cost function BLOCKGEN data BY ROW 1000000 PARTITIONED ON userId; STORE data INTO "$output/blocks" USING RUBIX; END JOB "compute cubes" MAP { data = LOAD "$output/blocks" USING RUBIX; // create a new column 'todayUserId' for today's records only data = FROM data GENERATE country, locale, userId, clicks, CASE(timestamp == $today, userId) AS todayUserId; } // creates the three cubes in a single job to count daily, weekly users and clicks CUBE data BY country, locale INNER userId AGGREGATES COUNT_DISTINCT(userId) as weeklyUniqueUsers, COUNT_DISTINCT(todayUserId) as dailyUniqueUsers, SUM(clicks) as totalClicks; STORE data INTO "$output/results" USING AVRO(); END Vemuri et al. VLDB 2014
  • 25. Cubert Example: Join and Group By 26 Vemuri et al. VLDB 2014 — Two datasets: A=(Salesman, Product), B=(Salesman, Location) Select SomeAggregate() FROM A Inner Join B ON A.salesman = B.Salesman GROUP BY A.Product, B.Location • Sort A by Product and B by Location • Divide A and B in specialized blocks sorted by group by keys • Load A’s blocks in memory and stream B’s blocks to Join • Group by can be performed immediately after Join
  • 26. Cubert Example: Triangle Closing 27 • Divide connections (src, dest) in blocks • Duplicate connection graph G1, G2 • Sort G1 edges (src, dest) by src • Sort G2 edges (src, dest) by dest • MeshJoin G1 and G2 such that G1.dest=G2.src • Aggregate by (G1.src, G2,dest) to get the number of common connections • Speedup by 50%
  • 27. Cubert Summary 28 Vemuri et al. VLDB 2014 • Built for analytics needs • Faster and scalable: 5-60X • Working well in practice
  • 28. Outline 29 • Ingress • Moving data from online to offline system • Offline Processing • Batch processing - Hadoop, Azkaban, Cubert • Stream processing - Samza • Iterative processing - Spark • Egress • Moving results from offline to online system
  • 29. Samza 30 • Samza streaming computation • On top of messaging layer like Kafka for input/output • Low latency • Stateful processing through local store • Many use cases at LinkedIn • Site-speed monitoring • Data standardization
  • 30. Samza: Site Speed Monitoring 31 • LinkedIn homepage assembled by calling many services • Each service logs through Kafka what went on with a request Id
  • 31. Samza: Site Speed Monitoring 32 • The complete record of request - scattered across Kafka logs • Problem: combine these logs to generate wholistic view
  • 32. Samza: Site Speed Monitoring 33 • Hadoop/MR: join the logs using the request Id - once a day • Too late to troubleshoot any issue • Samza: near real-time join the Kafka logs using the requestId
  • 33. Samza: Site Speed Monitoring 34 • Samza: near real-time join the Kafka logs using the requestId • Two jobs • Partition Kafka stream by request Id • Aggregate all the records for a request Id Fernandez et al. CIDR 2015
  • 34. Outline 35 • Ingress • Moving data from online to offline system • Offline Processing • Batch processing - Hadoop, Azkaban, Cubert • Stream processing - Samza • Iterative processing - Spark • Egress • Moving results from offline to online system
  • 35. Iterative Processing using Spark 36 • Limitations of MapReduce • What is Spark? • Spark at LinkedIn
  • 36. Limitations of MapReduce 37 • Iterative computation is slow • Inefficient multi-pass computation • Intermediate data written in distributed file system
  • 37. Limitations of MapReduce 38 • Interactive computation is slow • Same data is loaded again from distributed file system
  • 38. Example: ADMM at LinkedIn 39 • Intermediate data is stored in distributed file system - slow Intermediate data in HDFS
  • 39. SPARK 40 • Extends programming language with a distributed data structure • Resilient Distributed Datasets (RDD) • can be stored in memory • Faster iterative computation • Faster interactive computation • Clean APIs in Python, Scala, Java • SQL, Streaming, Machine learning, graph processing support Matei Zaharia et al. NSDI 2012
  • 40. Spark at LinkedIn 41 • ADMM on Spark • Intermediate data is stored in memory - faster Intermediate data in memory
  • 41. Outline 42 • Data Ingress • Moving data from online to offline system • Data Processing • Batch processing - Hadoop, Azkaban, Cubert • Iterative processing - Spark • Stream processing - Samza • Data Egress • Moving results from offline to online system
  • 42. Data Egress - Key/Value 43 • Key-value store: Voldemort • Based on Amazon’s Dynamo DB • Distributed • Scalable • Bulk load from Hadoop • Simple to use • store results into ‘url’ using KeyValue(‘member_id’) Sumbaly et al. FAST 2012
  • 43. Data Egress - Streams 44 • Stream - Kafka • Hadoop job as a Producer • Service acts as Consumer • Simple to use • store data into ‘url’ using Stream(“topic=x“) Goodhope et al., IEEE Data Eng. 2012
  • 44. Conclusion 45 • Rich primitives for Data Ingress, Processing, Egress • Data Ingress: Kafka, ETL • Data Processing • Batch processing - Hadoop, Cubert • Stream processing - Samza • Iterative processing - Spark • Data Egress: Voldemort, Kafka • Allow Data Scientists to focus to build Data Products
  • 45. Future Opportunities 46 • Models of computation • Efficient Graph processing • Distributed Machine Learning
  • 46. 47 Acknowledgement Thanks to data team at LinkedIn: data.linkedin.com Contact: mtiwari@linkedin.com @mitultiwari

Notes de l'éditeur

  1. Hi Everyone. I am Mitul Tiwari. Today I am going to talk about Big Data Ecosystem at LinkedIn.
  2. LinkedIn is the largest professional network with more than 360M members and it’s growing fast with more than 2 members joining per second. What’s LinkedIn’s Mission? … LinkedIn’s mission is to connect the world’s professionals and make them more productive and successful. - Members can connect with each other and maintain their professional network on linkedin.
  3. A rich recommender ecosystem at linkedin: from connections, news, skills, Jobs, companies, groups, search queries, talent, similar profiles, ...
  4. How do we build these data driven products? Building these data products involve three major steps. First, moving production data from online to offline system. Second, processing data in the offline system using technologies such as Hadoop, Samza, Spark. And finally, moving the results or processed data from offline to online serving system.
  5. Let’s take a concrete data product example of People You May Know at LinkedIn. Production data such as database data, activity data is moved to offline system. Offline system processes this data to generate PYMK recommendations for each member. This recommendation output is stored in a key value store Voldemort. Production system query this store to get PYMK recommendation for a given member and serve it online. In aAny deployed large-scale recommendation systems has to deal with scaling challenges high level design Kafka, Voldemort citations, url to Azkaban
  6. Let me talk about each of these three steps in more detail starting with Ingress that is moving data from online system to offline system.
  7. There are various types of data at LinkedIn in online production system. Database data contains various member information such as profile and connections. This is persistent data that member has provided. Activity data contains various kinds of member activities such as which pages member viewed or which People You May Know results were shown or impressed to users Performance and system metrics of online serving application system is also stored to monitor the health of the serving system. Finally, each online service generates various kinds of log information, for example, what kind of request parameters were used by the People You May Know backend service while serving the results
  8. Initial solution built for Data Ingress was point to point solution. That is each production service had many offline clients and data was transferred from a production service to an offline system. There are many limitations of such a solution. First, O(N^2) data integration complexity. That is, each online system could be transferring data to all the offline systems. Second, this is fragile and easy to break. That is, very hard to monitor the correctness of the the data flow. Also, because of O(N^2) complexity this can easily overload a service or data pipeline resulting in delay or loss of data. Finally, this solution is very hard to standardize and each point-to-point data transfer can come up with their own schema.
  9. At LinkedIn we have built a centralized data pipeline. This reduces point-to-point data transfer complexity to O(N) We could build more reliable data pipeline And this data pipeline is standardizable. That is,
  10. At LinkedIn we have built an open source data ingress pipeline called Kafka. Kafka is a publish subscribe messaging system Producers of data (such as online serving systems) send data to brokers. Consumers such as offline system can read messages from brokers Messages are sent for a particular topic. For example, PYMK impressions are sent at a topic such as PYMKImpressionTopic Each topic is broken into one or more ordered partitions of messages
  11. TODO: Kafka stats Kafka uses a standardize schema for each topic We use Avro schema which is like a Json schema with superior serialization, deserialization benefits There is a central repository of schema for each topic Both producers and consumers use the same topic schema Kafka also simplifies data verification using audits on the number of produced messages and the number of consumed messages Kakfa also facilitate each ETL of data to Hadoop by using Map online jobs to load data from brokers For more details check out this IEEE Data Engineering paper.
  12. Once we have data available in offline data processing system from production, we use various technologies such as Hadoop, Samza, and Spark to process this data. Let me start with talking about batch processing technologies based on Hadoop.
  13. Hadoop has been very successful to scale offline computation needs. Hadoop made ease of distributed programming by providing simple high level primitives like Map and Reduce functions Hadoop is scalable to a very large cluster Hadoop MapReduce provide fault tolerant functionalities like speculative execution, restarting failed MapReduce tasks automatically There are many scripting language like Pig, Hive, Scalding built on top of Hadoop to further ease the programming
  14. At LinkedIn: Hadoop is in use for building data products, feature computation, training machine learning models, business analytics, trouble shooting by analyzing data, etc. We have workflows with 100s of Hadoop MapReduce jobs And 100s of such workfllows Daily we process peta-bytes of data on Hadoop
  15. One good signal to indicate are common connections. That is Bob and Carol likely to know each other if they share a common connection. Bob and Carol likely to know each other if they share a common connection. Also, as the number of common connections increases, the likelihood of the two people knowing each other increases.
  16. Here is an example of data processing using Hadoop. For PYMK an important feature is triangle closing that is, finding the second degree connections and the number of common connections between two members Here is a PIG script that computes that Go through the PIG script
  17. Here is the PYMK production Azkaban Hadoop workflow, which involves dozens of hadoop jobs and dependencies Looks complicated but it’s trivial to manage such workflows using Azkaban
  18. How to manage
  19. After feature engineering and getting features such as triangle closing, organizational overlap scores for schools and companies, we apply a machine learning model to predict probability of two people knowing each other. We also incorporate user feedback both explicit and implicit in enhancing the connection probability We use pass connections as positive response variable to train our machine learning model
  20. ADMM stands for Alternating Direction Method of Multipliers (Boyd et al. 2011). The basic idea of ADMM is as follows: ADMM considers the large scale logistic regression model fitting as a convex optimization problem with constraints. While minimizing the user-defined loss function, it enforces an extra constraint that coefficients from all partitions have to equal. To solve this optimization problem, ADMM uses an iterative process. For each iteration it partitions the big data into many small partitions, and fits an independent logistic regression for each partition. Then, it aggregates the coefficients collected from all partitions, learns the consensus coefficients, and sends it back to all partitions to retrain. After 10-20 iterations, it ends up with a converged solution that is theoretically close to what you would have obtained if you trained it on a single machine.
  21. TODO: get comfortable
  22. TODO: get comfortable with this slide
  23. load one week of data and build a OLAP cube over country and locale as dimensions for unique users over the week, unique users for today, as well as total number of clicks.
  24. TODO: get comfortable
  25. TODO: revise
  26. TODO: get comfortable
  27. Consider what data is necessary to build a particular view of the LinkedIn home page. We provide interesting news via Pulse, timely updates from your connections in the Network Update Stream, potential new connections from People You May Know, advertisements targeted to your background, and much, much more. Each service publishes its logs to its own specific Kafka topic, which is named after the service, i.e. <service>_service_call. There are hundreds of these topics, one for each service and they share the same Avro schema, which allows them to be analyzed together. This schema includes timing information, who called whom, what was returned, etc, as well as the specific of what each particular service call did. Additionally log4j-style warnings and errors are also routed to Kafka in a separate <service>_log_event topic.
  28. After a request has been satisfied, the complete record of all the work that went into generating it is scattered across the Kafka logs for each service that participated. These individual logs are great tools for evaluating the performance and correctness of the individual services themselves, and are carefully monitored by the service owners. But how can we use these individual elements to gain a larger view of the entire chain of calls that created that page? Such a perspective would allow us to see how the calls are interacting with each other, identify slow services or highlight redundant or unnecessary calls.
  29. By creating a unique value or GUID for each call at the front end and propagating that value across all subsequent service calls, it's possible to tie them together and define a tree-structure of the calls starting from the front end all the way through to the leave service events. We call this value the TreeID and have built one of the first production Samza workflows at LinkedIn around it: the Call Graph Assembly (CGA) pipeline. All events involved in building the page now have such a TreeID, making it a powerful key on which to join data in new and fascinating ways. The CGA pipeline consists of two Samza jobs: the first repartitions the events coming from the sundry service call Kafka topics, creating a new key from their TreeIDs, while the second job assembles those repartitioned events into trees corresponding to the original calls from the front end request. This two-stage approach looks quite similar to the classic Map-Reduce approach where mappers will direct records to the correct reducer and those reducers then aggregate them together in some fashion. We expect this will be a common pattern in Samza jobs, particularly those that are implementing continuous, stream-based implementations of work that had previously been done in a batch fashion on Hadoop or similar situations.
  30. By creating a unique value or GUID for each call at the front end and propagating that value across all subsequent service calls, it's possible to tie them together and define a tree-structure of the calls starting from the front end all the way through to the leave service events. We call this value the TreeID and have built one of the first production Samza workflows at LinkedIn around it: the Call Graph Assembly (CGA) pipeline. All events involved in building the page now have such a TreeID, making it a powerful key on which to join data in new and fascinating ways. The CGA pipeline consists of two Samza jobs: the first repartitions the events coming from the sundry service call Kafka topics, creating a new key from their TreeIDs, while the second job assembles those repartitioned events into trees corresponding to the original calls from the front end request. This two-stage approach looks quite similar to the classic Map-Reduce approach where mappers will direct records to the correct reducer and those reducers then aggregate them together in some fashion. We expect this will be a common pattern in Samza jobs, particularly those that are implementing continuous, stream-based implementations of work that had previously been done in a batch fashion on Hadoop or similar situations.
  31. That concludes my brief discussion on Stream processing using Samza. Next I am going to talk about iterative processing using Spark.
  32. ADMM example
  33. ADMM example
  34. ADMM example
  35. ADMM example
  36. TODO: add reference
  37. TODO: get comfortable
  38. TODO: revise - add lessons, opportunities
  39. TODO: revise