4. Data Ecosystem - Overview
4
Serving App
Online Stores
Espresso
Oracle
MySQL
Logs
Analytics Infra
Business
Engines
Serving
OLAP
5. Data Ecosystem – Data
5
Tracking Data
Tracks user activity at web site
Append only
Example: Page View
Database Data
Member provided data in online-stores
Inserts, Updates and Deletes
Example: Member Profiles, Likes, Comments
7. Bridging OLTP to OLAP
7
OLTP OLAP
Integrating site-serving data stores with Hadoop
at scale with low latency.
Critical to LinkedIn’s
Member engagement
Business decision making
Kafka
Engines
Serving
OLAP
Databases
Tracking Data
Espresso
Oracle
MySQL
8. Challenge - Scalable ETL
8
600+ Tracking topics
500+ Database tables
XXX TB of Data at rest
X TB of new data generated per day
5000 Nodes, Several Hadoop clusters
Kafka
Engines
Serving
OLAP
Databases
Tracking Data
Espresso
Oracle
MySQL
OLTP OLAP
9. Challenge – Consistent Snapshot with SLA
9
Apply updates, deletes
Copy full tables
But, resource overheads
Small fraction of data changes
Kafka
Engines
Serving
OLAP
Databases
Tracking Data
Espresso
Oracle
MySQL
OLTP OLAP
10. Engines
Requirements
10
OLTP
Oracle Espresso
OLAP
Refresh data on HDFS frequently
Seamless handling of schema evolution
Optimal resource usage
Handle multi data centers
Efficient change capture on source
Ensure Last-Update semantics
Handle deletes
Serving
OLAP
Database Data
Tracking Data
12. Lumos
12
Data Capture
Can use commit logs
Delta processing
Latencies in minutes
Schema agnostic framework
Databus
Others
Hadoop : Data Center
DB
Extract
Files
Data Center
Colo-1
Databases
Colo-2
Databases
Lumos
databases
(HDFS)
dbchanges
(HDFS)
13. Lumos – Multi-Datacenter
13
Data Capture
Handle multi-datacenter stores
Resolve updates via commit order
Databus
Others
Hadoop : Data Center
DB
Extract
Files
Data Center
Colo-1
Databases
Colo-2
Databases
Lumos
databases
(HDFS)
dbchanges
(HDFS)
15. Lumos - High Level Architecture
15
Virtual
Snapshot
Builder
ETL Hadoop Cluster
Staging
(internal)
Lazy
Snapshot
Builder
User
Jobs
HDFS
Published
Virtual
Snapshot
MR/Pig/Hiv
e
Loaders
Compactor
Change
Captur
e Increments
Pre-
Process
Full Drops
18. Change Capture – File Based
18
File Format
Compressed CSV
Metadata
Full Drop
Via Fast Reader (Oracle, MySQL)
Via MySQL backups (Espresso)
Runs for hours with Dirty reads
Increments
Via SQL
Transactional
Full Drop
1am 4am
Inc
h-1
Inc
h-2
Inc
h-3
2am 3am
Prev.
HW
New
High-water mark
DB
Files
Web
Service
HDFS
HTTPS
Pulls
Inc
H-4
19. Change Capture – Databus Based
19
Databus
Relay
Mapper
Databus
Consumer
dbchanges
(HDFS)
Reducer
Database
Mapper
Databus
Consumer
Reducer
Reads Database commit logs
Multi datacenter via Databus Relay
Runs as MR Job
Output : date-time partitioned with multiple versions
True change capture (including hard deletes)
Databus
RelayDatabase
Hadoop
20. Pre-Processing
20
Data format conversion
Field level transformations
Privacy
Cleansing – Eg. Remove recursive schema
Metadata annotation
Add row counts for data validation
Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture Increments
Pre-
Process
Full Drops
21. Snapshotting – Lazy Materializer
21
One MR job per table, consumes full drops
Supports dirty reads.
Hash Partition on primary key
Number of partitions based on data size
Sorts on primary key
Results published into staging directory
Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture Increments
Pre-
Process
Full Drops
22. Snapshotting – Virtual Snapshot Builder
22
One MR Job for all tables
Identifies all existing snapshots, both published and staged
Creates appropriate delta partitions for every snapshot
Delta partition count equals Snapshot partition count
Club multiple partition in one file
Outputs latest row using delta column
Publishes staged snapshots with new deltas
Previously published snapshots updated with new deltas
Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture Increments
Pre-
Process
Full Drops
23. Snapshotting – Virtual Snapshot Builder
23
/db/table/snapshot-0
(10 partitions, 10 Avro files)
_delta
inc-1
(10 partitions, 2 Avro file)
Part-0 . .
.Part-9
Index files
Inc-2
(10 partitions, 2 Avro file)
Part-0
Part-5
Part-0
Incremental data is small
Rolls increments
Avoid creating small files
Equi-partitions INC as Snapshot
Seek and Read a partition
Partition-0
Part-0.avro File
Partition-4
Partition-5
Partition-9
Index file
Index files
Part-5
Index file
Part-5.avro File
24. Snapshotting – Loaders
24
Custom InputFormat (MR)
Uses the Index file to create Splits
RecordReader merges partition-0 of Snapshot and
Delta
Returns latest row from Delta if present
Masks row if deleted
Otherwise returns row from snapshot
Pig Loader enables reading virtual snapshot via Pig
Storage handler enables reading virtual snapshot via Hive
26. Snapshotting – Compactor
26
Required when partition size exceeds threshold
Materializes Virtual Snapshot to Snapshot
With more partitions
MR job with Reducer
Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture
Increments
Pre-
Process
Full Drops
27. Operating billions of rows per day
Dude, where’s my row?
– Automatic Data validation
When data misses the bus
– Handling late data
– Look back window
Cluster downtime
– Restart-ability
– Active-active
– Idempotent processing
27
28. Conclusion and Future Work
Conclusion
Lumos : Scalable ETL framework
Battle tested in production
Future Work
Unify Internal and External data
Open source
28
Today, Talk about Scaling ETL in order to consolidate and democratize data and analytics on Hadoop at LinkedIn.
Let’s start with the overall Data Ecosystem
Then focus on the specific problem of integrating online data-stores with Hadoop
and go over the solution
Members interact with the site apps
And they generate actions and data mutations
Which gets persisted in LOGS store and ONLINE data stores
Espresso, MySQL and Oracle are primary online data stores.
Espresso is a document oriented partitioned data store with transactional support. It is home grown.
Kafka is used as the LOG store.
Online Data sources are periodically replicated to hadoop for creating cubes & enrichments.
Cubes are used externally on the site as well as internally on the reports/insights for analysts.
(Eg: “Who viewed your profile”, “Campaign performance reports”, Member sign-up reports)
Cubes are delivered via Cube serving Engines. There are primarily 3 cube serving stack.
Voldemort is a key-value store : used to deliver static reports with pre-computed metrics.
Pinot : search technology : used for delivering some what dynamic reports with pre-compute metrics (drill)
Finally, the traditional BI stack comprised of TD + Tableau + MSTR: deliver insights to business users.
Explain interactively what action generated what data real use case.
Tracking: User activity at the site turns into tracking data
Example -> Tracking -> PageView, AdClick
Append -> each user activity generates new data
Immutable -> Once generated, does not change but grows over time
Usually organized by time and accessed over time range
Database: is user provided data stored in online stores.
This data is mutable over time
Example -> Member Profile, Education
Organized as full table as of some time and accessed in full
The problem is simply replicating the data from ONLINE to HADOOP
But, LNKD has 300m members and generates lots of data => humongous amount of data
Fresh data directly impacts the member engagement and business decision making
PROD data center that is accessible from outside
HADOOP is CORP data center
Deletes for compliance
Move the data entirely, but it puts load on the source system, network and hadoop resources
Commit time or
Since tracking data generates is append only, it is easier to handler and arrange them in time window.
DB data can have updates or deletes,
and reflecting that on HDFS in low latency and with optimal resouce usage is a challenge
TALK about schema evaluation
TALK about schema evaluation
This is not HDFS snaphsot not HBASE snapshot
Schema changes + rewrite the complete data
Sqoop: Cross-colo database connections are not allowed
Sqoop: May put load on the production databases
Hbase
Write the change logs and periodically do a snapshot and replicate
not all companies run Hbase as part of the standard deployment
not clear if this will meet the low-latency requirement
Hive Streaming
looks similar to what we do
caveat: it only supports ORCA
Change to Data Extract
Bottom right
TODO: cluster of databases and Relay
Reading off of databus
With a picture
Checkpoint
Scn to time mapping
Backup slides towards the end
Db Dump format to Avro
Oracle data types
Map-Only Job
Field Level transformation
Eliminate recursive schema
Avro Schema Attribute JSON
Meta info
Key and delta column
begin_date, end_date,
drop_date, full_drop date
Row counts
Db Dump format to Avro
Oracle data types
Map-Only Job
Field Level transformation
Eliminate recursive schema
Avro Schema Attribute JSON
Meta info
Key and delta column
begin_date, end_date,
drop_date, full_drop date
Row counts
Db Dump format to Avro
Oracle data types
Map-Only Job
Field Level transformation
Eliminate recursive schema
Avro Schema Attribute JSON
Meta info
Key and delta column
begin_date, end_date,
drop_date, full_drop date
Row counts
Db Dump format to Avro
Oracle data types
Map-Only Job
Field Level transformation
Eliminate recursive schema
Avro Schema Attribute JSON
Meta info
Key and delta column
begin_date, end_date,
drop_date, full_drop date
Row counts