SQOOP on SPARK for Faster Data Ingestion

SQOOP on SPARK
for Data Ingestion
Veena Basavaraj & Vinoth Chandar
@Uber

Works currently @ Uber focussed on building
a real time pipeline for ingestion to Hadoop
for batch and stream processing.
@linkedin lead on Voldemort
@Oracle focussed log based replication, HPC
and stream processing
Works currently @Uber on streaming systems.
Prior to this worked
@Cloudera on Ingestion
for Hadoop and @Linkedin
on fronted and service infra

Agenda
• Data Ingestion Today
• Introduction Apache Sqoop2
• Sqoop Jobs on Apache Spark
• Insights & Next Steps

Data Ingestion Tool
• Primary need
• Transferring data
from SQL to
HADOOP
• SQOOP solved it
well!

Data Ingestion Pipeline
• Ingestion pipeline can now have
• Non SQL like data sources
• Messaging Systems as data sources
• Multi-stage pipeline

Sqoop 2
• Generic Data Transfer
Service
• FROM - egress data out
from ANY source
• TO - ingress data into ANY
source
• Pluggable Data Sources
• Server-Client Design

Connector
• Connectors
represent Data
Sources

Connector
• Data Source properties
represented via Configs
• LINK config to connect to
data source
• JOB config to read/write
data from the data source

Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to egress data
• (L) Load API to ingress data
• No (T) Transform yet !

Sqoop Job
• Creating a Job
• Job Submission
• Job Execution

Lets talk about
MYSQL to KAFKA
example

Create Job
• Create LINKs
• Populate FROM link Conﬁg and
create FROM LINK
• Populate TO link Conﬁg and
create TO LINK

Create Job
• Create LINKs
• Populate FROM link Conﬁg and
create FROM LINK
• Populate TO link Conﬁg and
create TO LINK
Create MySQL link
Create Kafka link

Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Conﬁg
• Populate Driver Conﬁg such as
parallelism for extract and
load
numExtractors
numLoaders

Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Conﬁg
• Populate Driver Conﬁg such as
parallelism for extract and
load
Add MySQL From Config
Add kafka To Config
numExtractors
numLoaders

Create Job API
public static void createJob(String[] jobconfigs) {
CommandLine cArgs = parseArgs(createOptions(), jobconfigs);
MLink fromLink = createFromLink(‘jdbc-connector’, jobconfigs);
MLink toLink = createToLink(‘kafka-connector’, jobconfigs);
MJob sqoopJob = createJob(fromLink, toLink, jobconfigs);
}

Job Submit
• Sqoop uses MR engine to transfer data between FROM
and TO data sources
• Hadoop Conﬁguration Object is used to pass FROM/
TO and Driver Conﬁgs to the MR engine
• Submits the Job via MR-client and tracks job status and
stats such as counters

Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to egress data
• (L) Load API to ingress data
• No (T) Transform yet !
Remember!

Job Execution
• InputFormat/Splits for Partitioning
• Invokes FROM Partition API
• Mappers for Extraction
• Invokes FROM Extract API
• Reducers for Loading
• Invokes TO Load API
• OutputFormat for Commits/ Aborts

It turns out…
• Sqoop 2 supports pluggable Execution Engine
• Why not replace MR with Spark for parallelism?
• Why not extend the Connector APIs to support
simple (T) transformations along with (EL) ?

Why Apache Spark ?
• Why not ? Data Pipeline expressed as Spark jobs
• Speed is a feature! Faster than MapReduce
• Growing Community embracing Apache Spark
• Low effort less than few weeks to build a POC
• EL to ETL -> Nifty transformations can be easily added

Lets talk SQOOP on SPARK
implementation!

Spark Sqoop Job
• Creating a Job
• Job Submission
• Job Execution

Create Sqoop Spark Job
• Create a SparkContext from the relevant conﬁgs
• Instantiate a SqoopSparkJob and invoke SqoopSparkJob.init(..)
that wraps both Sqoop and Spark initialization
• As before Create a Sqoop Job with createJob API
• Invoke SqoopSparkJob.execute(conf, context)

public class SqoopJDBCHDFSJobDriver {
public static void main(String[] args){
final SqoopSparkJob sparkJob = new SqoopSparkJob();
CommandLine cArgs = SqoopSparkJob.parseArgs(createOptions(), args);
SparkConf conf = sparkJob.init(cArgs);
JavaSparkContext context = new JavaSparkContext(conf);
MLink fromLink = getJDBCLink();
MLink toLink = getHDFSLink();
MJob sqoopJob = createJob(fromLink, toLink);
sparkJob.setJob(sqoopJob);
sparkJob.execute(conf, context);
}
Create Sqoop Spark Job
1
2
3
4

Spark Job Submission
• We explored a few options.!
• Invoke Spark in process within the Sqoop Server to
execute the job
• Use Remote Spark Context used by Hive on Spark to
submit
• Sqoop Job as a driver for the Spark submit command

Spark Job Submission
• Build a “uber.jar” with the driver and all the sqoop
dependencies
• Programmatically using Spark yarn client or directly via
command line submit the driver program to yarn client/
• bin/spark-submit —classorg.apache.sqoop.spark.SqoopJDBCHDFSJobDriver
--master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/
—jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir
hdfs://path/to/output —numE 4 —numL 4

Spark Job Execution
• 3 main stages
• Obtain containers for parallel execution by simply
converting job’s partitions to an RDD
• Partition API determines parallelism, Map stage uses
Extract API to read records
• Another Map stage uses Load API to write records

Spark Job Execution
SqoopSparkJob.execute(…){
List<Partition> sp = getPartitions(request,numMappers);
JavaRDD<Partition> partitionRDD = sc.parallelize(sp,
sp.size());
JavaRDD<List<IntermediateDataFormat<?>>> extractRDD =
partitionRDD.map(new SqoopExtractFunction(request));
 
extractRDD.map(new SqoopLoadFunction(request)).collect();
}
1
2
3

Spark Job Execution
• We chose to have 2 map stages for a reason
• Load parallelism can be different from Extract
parallelism, for instance we may need to restrict the
TO based on number of Kafka Partitions on the topic
• We can repartition before we invoke the Load stage

Micro Benchmark —>MySQL to HDFS
Table w/ 300K records, numExtractors = numLoaders

Table w/ 2.8M records, numExtractors = numLoaders
good partitioning!!
Micro Benchmark —>MySQL to HDFS

What was Easy?
• Reusing existing Connectors, NO changes to the Connector API
required.
• Inbuilt support for Standalone and Cluster mode for quick end-end
testing and faster iteration
• Scheduling Spark sqoop jobs via Oozie

What was not Easy?
• No clean Spark Job Submit API that provides job statistics, using
Yarn UI for Job status and health.
• We had to convert a bunch of Sqoop core classes such as IDF
(internal representation for records transferred) to be serializable
• Managing Hadoop and spark dependencies together and CNF
caused some pain

Next Steps!
• Explore alternative ways for Spark Sqoop Job Submission
• Expose Spark job stats such as accumulators in the submission
history
• Proposed Connector Filter API (cleaning, data masking)
• We want to work with Sqoop community to merge this back if its
useful
• https://github.com/vybs/sqoop-on-spark

Questions!
• Apache Sqoop Project - sqoop.apache.org
• Apache Spark Project - spark.apache.org
• Thanks to the Folks @Cloudera and @Uber !!!
• You can reach us @vybs, @byte_array

SQOOP on SPARK for Faster Data Ingestion

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à SQOOP on SPARK for Faster Data Ingestion

Similaire à SQOOP on SPARK for Faster Data Ingestion (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

SQOOP on SPARK for Faster Data Ingestion