SlideShare a Scribd company logo
1 of 37
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin’ Big Data with Ease
Lin Qiao
Data Analytics Infra @ LinkedIn
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/gobblin-linkedin
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
©2014 LinkedIn Corporation. All Rights Reserved.
Overview
• Challenges
• What does Gobblin provide?
• How does Gobblin work?
• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
Overview
• Challenges
• What does Gobblin provide?
• How does Gobblin work?
• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
Perception
Analytics Platform
Ingest
Framework
Primary
Data
Sources
Transformations Business
Facing
Insights
Member
Facing
Insights and
Data Products
Load
Load
Validation
Validation
©2014 LinkedIn Corporation. All Rights Reserved.
Reality
5
Hadoop
Camus
Lumos
Teradata
External
Partner
Data
Ingest
Framework
DWH ETL
(fact tables)
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Activity
(tracking)
Data
R/W store
(Oracle/
Espresso)
Profile Data
Databus
Changes
Derived Data
Set
Core Data Set
(Tracking,
Database,
External)
Computed Results for Member Facing Products
Enterprise
Products
Change
dump on filer
Ingest
utilities
Lassen
(facts and
dimensions)
Read store
(Voldemort)
©2014 LinkedIn Corporation. All Rights Reserved.
Challenges @ LinkedIn
• Large variety of data sources
• Multi-paradigm: streaming data, batch data
• Different types of data: facts, dimensions, logs,
snapshots, increments, changelog
• Operational complexity of multiple pipelines
• Data quality
• Data availability and predictability
• Engineering cost
©2014 LinkedIn Corporation. All Rights Reserved.
Open source solutions
sqoopp
flumep morphlinep
RDBMS vendor-
specific connectorsp
aegisthus
logstashCamus
©2014 LinkedIn Corporation. All Rights Reserved.
Goals
• Unified and Structured Data Ingestion Flow
– RDBMS -> Hadoop
– Event Streams -> Hadoop
• Higher level abstractions
– Facts, Dimensions
– Snapshots, increments, changelog
• ELT oriented
– Minimize transformation in the ingest pipeline
©2014 LinkedIn Corporation. All Rights Reserved.
Central Ingestion Pipeline
Hadoop
Teradata
External
Partner
Data
Gobblin
DWH ETL
(fact tables)
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Tracking
R/W store
(Oracle/
Espresso)
OLTP Data
Databus
Changes
Derived Data
Set
Core Data Set
(Tracking,
Database,
External)
Enterprise
Products
Change
dump on filer
REST
JDBC
SOAP
Custom
Compaction
©2014 LinkedIn Corporation. All Rights Reserved.
Overview
• Challenges
• What does Gobblin provide?
• How does Gobblin work?
• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Usage @ LinkedIn
• Business Analytics
– Source data for, sales analysis, product sentiment
analysis, etc.
• Engineering
– Source data for issue tracking, monitoring, product
release, security compliance, A/B testing
• Consumer product
– Source data for acquisition integration
– Performance analysis for email campaign, ads
campaign, etc.
©2014 LinkedIn Corporation. All Rights Reserved.
Key Features
 Horizontally scalable and robust framework
 Unified computation paradigm
 Turn-key solution
 Customize your own Ingestion
©2014 LinkedIn Corporation. All Rights Reserved.
Scalable and Robust Framework
13
Scalable
Centralized
State Management
State is carried over between jobs automatically, so metadata can be used
to track offsets, checkpoints, watermarks, etc.
Jobs are partitioned into tasks that run concurrently
Fault Tolerant Framework gracefully deals with machine and job failures
Query Assurance Baked in quality checking throughout the flow
©2014 LinkedIn Corporation. All Rights Reserved.
Unified computation paradigm
Common execution
flow
Common execution flow between batch ingestion and streaming ingestion
pipelines
Shared infra
components
Shared job state management, job metrics store, metadata management.
©2014 LinkedIn Corporation. All Rights Reserved.
Turn Key Solution
Built-in Exchange
Protocols
Existing adapters can easily be re-used for sources with common protocols
(e.g. JDBC, REST, SFTP, SOAP, etc.)
Built-in Source
Integration
Fully integrated with commonly used sources including MySQL, SQLServer,
Oracle, SalesForce, HDFS, filer, internal dropbox)
Built-in Data
Ingestion Semantics
Covers full dump and incremental ingestion for fact and dimension
datasets.
Policy driven flow
execution & tuning
Flow owners just need to specify pre-defined policy for handling job
failure, degree of parallelism, what data to publish, etc.
©2014 LinkedIn Corporation. All Rights Reserved.
Customize Your Own Ingestion Pipeline
Extendable
Operators
Configurable
Operator Flow
Operators for doing extraction, conversion, quality checking, data
persistence, etc., can be implemented or extended against common API.
Configuration allows for multiple plugin points to add in customized logic
and code
©2014 LinkedIn Corporation. All Rights Reserved.
Overview
• Challenges
• What does Gobblin provide?
• How does Gobblin work?
• Lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
Under the Hood
©2014 LinkedIn Corporation. All Rights Reserved.
Computation Model
• Gobblin standalone
– single process, multi-threading
– Testing, small data, sampling
• Gobblin on Map/Reduce
– Large datasets, horizontally scalable
• Gobblin on Yarn
– Better resource utilization
– More scheduling flexibilities
©2014 LinkedIn Corporation. All Rights Reserved.
Scalable Ingestion Flow
20
Source
Work
Unit
Work
Unit
Work
Unit
Data
Publisher
Extractor Converter
Quality
Checker
Writer
Extractor Converter
Quality
Checker
Writer
Extractor Converter
Quality
Checker
Writer
Task
Task
Task
©2014 LinkedIn Corporation. All Rights Reserved.
Sources
• Determines how to partition work
- Partitioning algorithm can leverage source sharding
- Group partitions intelligently for performance
• Creates work-units to be scheduled
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
©2014 LinkedIn Corporation. All Rights Reserved.
Job Management
• Job execution states
– Watermark
– Task state, job state, quality checker output, error code
• Job synchronization
• Job failure handling: policy driven
22
State Store
Job run 1 Job run 3Job run 2
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Operator Flow
Extract
Schema
Extract
Record
Convert
Record
Check
Record Data
Quality
Write
Record
Convert
Schema
Check Task
Data
Quality
Commit
Task Data
23
©2014 LinkedIn Corporation. All Rights Reserved.
Extractors Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
• Specifies how to get the schema and pull data from
the source
• Return ResultSet iterator
• Track high watermark
• Track extraction metrics
©2014 LinkedIn Corporation. All Rights Reserved.
Converters
• Allow for schema and data transformation
– Filtering
– projection
– type conversion
– Structural change
• Composable: can specify a list of converters to be applied in
the given order
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
©2014 LinkedIn Corporation. All Rights Reserved.
Quality
Checkers
• Ensure quality of any data produced by Gobblin
• Can be run on a per record, per task, or per job basis
• Can specify a list of quality checkers to be applied
– Schema compatibility
– Audit check
– Sensitive fields
– Unique key
• Policy driven
– FAIL – if the check fails then so does the job
– OPTIONAL – if the checks fails the job continues
– ERR_FILE – the offending row is written to an error file
26
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
©2014 LinkedIn Corporation. All Rights Reserved.
Writers
• Writing data in Avro format onto HDFS
– One writer per task
• Flexibility
– Configurable compression codec (Deflate, Snappy)
– Configurable buffer size
• Plan to support other data format (Parquet, ORC)
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
©2014 LinkedIn Corporation. All Rights Reserved.
Publishers
• Determines job success based on Policy.
- COMMIT_ON_FULL_SUCCESS
- COMMIT_ON_PARTIAL_SUCCESS
• Commits data to final directories based on job success.
Task 1
Task 2
Task 3
File 1
File 2
File 3
Tmp Dir
File 1
File 2
File 3
Final Dir
File 1
File 2
File 3
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Compaction
• Dimensions:
– Initial full dump followed by incremental extracts in
Gobblin
– Maintain a consistent snapshot by doing regularly
scheduled compaction
• Facts:
– Merge small files
29
Ingestion HDFS Compaction
©2014 LinkedIn Corporation. All Rights Reserved.
Overview
• Challenges
• What does Gobblin provide?
• How does Gobblin work?
• Retrospective and lookahead
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin in Production
• > 350 datasets
• ~ 60 TB per day
• Salesforce
• Responsys
• RightNow
• Timeforce
• Slideshare
• Newsle
• A/B testing
• LinkedIn JIRA
• Data retention
31
Production
Instances
Data Volume
©2014 LinkedIn Corporation. All Rights Reserved.
Lesson Learned
• Data quality has a lot more work to do
• Small data problem is not small
• Performance optimization opportunities
• Operational traits
©2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Roadmap
• Gobblin on Yarn
• Streaming Sources
• Gobblin Workbench with ingestion DSL
• Data Profiling for richer quality checking
• Open source in Q4’14
33
©2014 LinkedIn Corporation. All Rights Reserved.
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/gobblin-
linkedin

More Related Content

Viewers also liked

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
 
Shannon Eterginio_About Presentation
Shannon Eterginio_About PresentationShannon Eterginio_About Presentation
Shannon Eterginio_About PresentationShannon Eterginio
 
Gobbin config-meetup-june-2016
Gobbin config-meetup-june-2016Gobbin config-meetup-june-2016
Gobbin config-meetup-june-2016Vasanth Rajamani
 
Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Vasanth Rajamani
 
Giip kb-hadoop sizing
Giip kb-hadoop sizingGiip kb-hadoop sizing
Giip kb-hadoop sizingLowy Shin
 
Gobblin for Data Analytics
Gobblin for Data AnalyticsGobblin for Data Analytics
Gobblin for Data AnalyticsIntel IT Center
 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7mmathipra
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsStephan Ewen
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종NAVER D2
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli
 
Search in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksSearch in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksLucidworks
 
The Data-Drive Paradigm
The Data-Drive ParadigmThe Data-Drive Paradigm
The Data-Drive ParadigmLucidworks
 
Hadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyHadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyTreasure Data, Inc.
 
The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...Audrey Neveu
 

Viewers also liked (18)

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Shannon Eterginio_About Presentation
Shannon Eterginio_About PresentationShannon Eterginio_About Presentation
Shannon Eterginio_About Presentation
 
Gobblin on-aws
Gobblin on-awsGobblin on-aws
Gobblin on-aws
 
Gobbin config-meetup-june-2016
Gobbin config-meetup-june-2016Gobbin config-meetup-june-2016
Gobbin config-meetup-june-2016
 
Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7
 
Giip kb-hadoop sizing
Giip kb-hadoop sizingGiip kb-hadoop sizing
Giip kb-hadoop sizing
 
Gobblin for Data Analytics
Gobblin for Data AnalyticsGobblin for Data Analytics
Gobblin for Data Analytics
 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloud
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 
Data API 2.0
Data API 2.0Data API 2.0
Data API 2.0
 
Search in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksSearch in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, Lucidworks
 
The Data-Drive Paradigm
The Data-Drive ParadigmThe Data-Drive Paradigm
The Data-Drive Paradigm
 
Hadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyHadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-Tenancy
 
The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...
 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Recently uploaded

Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 

Gobblin: A Framework for Solving Big Data Ingestion Problem

  • 1. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin’ Big Data with Ease Lin Qiao Data Analytics Infra @ LinkedIn
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /gobblin-linkedin
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. ©2014 LinkedIn Corporation. All Rights Reserved. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead
  • 5. ©2014 LinkedIn Corporation. All Rights Reserved. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead
  • 6. ©2014 LinkedIn Corporation. All Rights Reserved. Perception Analytics Platform Ingest Framework Primary Data Sources Transformations Business Facing Insights Member Facing Insights and Data Products Load Load Validation Validation
  • 7. ©2014 LinkedIn Corporation. All Rights Reserved. Reality 5 Hadoop Camus Lumos Teradata External Partner Data Ingest Framework DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Activity (tracking) Data R/W store (Oracle/ Espresso) Profile Data Databus Changes Derived Data Set Core Data Set (Tracking, Database, External) Computed Results for Member Facing Products Enterprise Products Change dump on filer Ingest utilities Lassen (facts and dimensions) Read store (Voldemort)
  • 8. ©2014 LinkedIn Corporation. All Rights Reserved. Challenges @ LinkedIn • Large variety of data sources • Multi-paradigm: streaming data, batch data • Different types of data: facts, dimensions, logs, snapshots, increments, changelog • Operational complexity of multiple pipelines • Data quality • Data availability and predictability • Engineering cost
  • 9. ©2014 LinkedIn Corporation. All Rights Reserved. Open source solutions sqoopp flumep morphlinep RDBMS vendor- specific connectorsp aegisthus logstashCamus
  • 10. ©2014 LinkedIn Corporation. All Rights Reserved. Goals • Unified and Structured Data Ingestion Flow – RDBMS -> Hadoop – Event Streams -> Hadoop • Higher level abstractions – Facts, Dimensions – Snapshots, increments, changelog • ELT oriented – Minimize transformation in the ingest pipeline
  • 11. ©2014 LinkedIn Corporation. All Rights Reserved. Central Ingestion Pipeline Hadoop Teradata External Partner Data Gobblin DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Tracking R/W store (Oracle/ Espresso) OLTP Data Databus Changes Derived Data Set Core Data Set (Tracking, Database, External) Enterprise Products Change dump on filer REST JDBC SOAP Custom Compaction
  • 12. ©2014 LinkedIn Corporation. All Rights Reserved. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead
  • 13. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin Usage @ LinkedIn • Business Analytics – Source data for, sales analysis, product sentiment analysis, etc. • Engineering – Source data for issue tracking, monitoring, product release, security compliance, A/B testing • Consumer product – Source data for acquisition integration – Performance analysis for email campaign, ads campaign, etc.
  • 14. ©2014 LinkedIn Corporation. All Rights Reserved. Key Features  Horizontally scalable and robust framework  Unified computation paradigm  Turn-key solution  Customize your own Ingestion
  • 15. ©2014 LinkedIn Corporation. All Rights Reserved. Scalable and Robust Framework 13 Scalable Centralized State Management State is carried over between jobs automatically, so metadata can be used to track offsets, checkpoints, watermarks, etc. Jobs are partitioned into tasks that run concurrently Fault Tolerant Framework gracefully deals with machine and job failures Query Assurance Baked in quality checking throughout the flow
  • 16. ©2014 LinkedIn Corporation. All Rights Reserved. Unified computation paradigm Common execution flow Common execution flow between batch ingestion and streaming ingestion pipelines Shared infra components Shared job state management, job metrics store, metadata management.
  • 17. ©2014 LinkedIn Corporation. All Rights Reserved. Turn Key Solution Built-in Exchange Protocols Existing adapters can easily be re-used for sources with common protocols (e.g. JDBC, REST, SFTP, SOAP, etc.) Built-in Source Integration Fully integrated with commonly used sources including MySQL, SQLServer, Oracle, SalesForce, HDFS, filer, internal dropbox) Built-in Data Ingestion Semantics Covers full dump and incremental ingestion for fact and dimension datasets. Policy driven flow execution & tuning Flow owners just need to specify pre-defined policy for handling job failure, degree of parallelism, what data to publish, etc.
  • 18. ©2014 LinkedIn Corporation. All Rights Reserved. Customize Your Own Ingestion Pipeline Extendable Operators Configurable Operator Flow Operators for doing extraction, conversion, quality checking, data persistence, etc., can be implemented or extended against common API. Configuration allows for multiple plugin points to add in customized logic and code
  • 19. ©2014 LinkedIn Corporation. All Rights Reserved. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Lookahead
  • 20. ©2014 LinkedIn Corporation. All Rights Reserved. Under the Hood
  • 21. ©2014 LinkedIn Corporation. All Rights Reserved. Computation Model • Gobblin standalone – single process, multi-threading – Testing, small data, sampling • Gobblin on Map/Reduce – Large datasets, horizontally scalable • Gobblin on Yarn – Better resource utilization – More scheduling flexibilities
  • 22. ©2014 LinkedIn Corporation. All Rights Reserved. Scalable Ingestion Flow 20 Source Work Unit Work Unit Work Unit Data Publisher Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Task Task Task
  • 23. ©2014 LinkedIn Corporation. All Rights Reserved. Sources • Determines how to partition work - Partitioning algorithm can leverage source sharding - Group partitions intelligently for performance • Creates work-units to be scheduled Source Work Unit PublisherExtractor Converter Quality Checker Writer
  • 24. ©2014 LinkedIn Corporation. All Rights Reserved. Job Management • Job execution states – Watermark – Task state, job state, quality checker output, error code • Job synchronization • Job failure handling: policy driven 22 State Store Job run 1 Job run 3Job run 2
  • 25. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin Operator Flow Extract Schema Extract Record Convert Record Check Record Data Quality Write Record Convert Schema Check Task Data Quality Commit Task Data 23
  • 26. ©2014 LinkedIn Corporation. All Rights Reserved. Extractors Source Work Unit PublisherExtractor Converter Quality Checker Writer • Specifies how to get the schema and pull data from the source • Return ResultSet iterator • Track high watermark • Track extraction metrics
  • 27. ©2014 LinkedIn Corporation. All Rights Reserved. Converters • Allow for schema and data transformation – Filtering – projection – type conversion – Structural change • Composable: can specify a list of converters to be applied in the given order Source Work Unit PublisherExtractor Converter Quality Checker Writer
  • 28. ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checkers • Ensure quality of any data produced by Gobblin • Can be run on a per record, per task, or per job basis • Can specify a list of quality checkers to be applied – Schema compatibility – Audit check – Sensitive fields – Unique key • Policy driven – FAIL – if the check fails then so does the job – OPTIONAL – if the checks fails the job continues – ERR_FILE – the offending row is written to an error file 26 Source Work Unit PublisherExtractor Converter Quality Checker Writer
  • 29. ©2014 LinkedIn Corporation. All Rights Reserved. Writers • Writing data in Avro format onto HDFS – One writer per task • Flexibility – Configurable compression codec (Deflate, Snappy) – Configurable buffer size • Plan to support other data format (Parquet, ORC) Source Work Unit PublisherExtractor Converter Quality Checker Writer
  • 30. ©2014 LinkedIn Corporation. All Rights Reserved. Publishers • Determines job success based on Policy. - COMMIT_ON_FULL_SUCCESS - COMMIT_ON_PARTIAL_SUCCESS • Commits data to final directories based on job success. Task 1 Task 2 Task 3 File 1 File 2 File 3 Tmp Dir File 1 File 2 File 3 Final Dir File 1 File 2 File 3 Source Work Unit PublisherExtractor Converter Quality Checker Writer
  • 31. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin Compaction • Dimensions: – Initial full dump followed by incremental extracts in Gobblin – Maintain a consistent snapshot by doing regularly scheduled compaction • Facts: – Merge small files 29 Ingestion HDFS Compaction
  • 32. ©2014 LinkedIn Corporation. All Rights Reserved. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead
  • 33. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin in Production • > 350 datasets • ~ 60 TB per day • Salesforce • Responsys • RightNow • Timeforce • Slideshare • Newsle • A/B testing • LinkedIn JIRA • Data retention 31 Production Instances Data Volume
  • 34. ©2014 LinkedIn Corporation. All Rights Reserved. Lesson Learned • Data quality has a lot more work to do • Small data problem is not small • Performance optimization opportunities • Operational traits
  • 35. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin Roadmap • Gobblin on Yarn • Streaming Sources • Gobblin Workbench with ingestion DSL • Data Profiling for richer quality checking • Open source in Q4’14 33
  • 36. ©2014 LinkedIn Corporation. All Rights Reserved.
  • 37. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/gobblin- linkedin