Sqoop on Spark provides a way to run Sqoop jobs using Apache Spark for parallel data ingestion. It allows Sqoop jobs to leverage Spark's speed and growing community. The key aspects covered are:
- Sqoop jobs can be created and executed on Spark by initializing a Spark context and wrapping Sqoop and Spark initialization.
- Data is partitioned and extracted in parallel using Spark RDDs and map transformations calling Sqoop connector APIs.
- Loading also uses Spark RDDs and map transformations to parallelly load data calling connector load APIs.
- Microbenchmarks show Spark-based ingestion can be significantly faster than traditional MapReduce-based Sqoop for large datasets
2. Works currently @ Uber focussed on building
a real time pipeline for ingestion to Hadoop
for batch and stream processing.
@linkedin lead on Voldemort
@Oracle focussed log based replication, HPC
and stream processing
Works currently @Uber on streaming systems.
Prior to this worked
@Cloudera on Ingestion
for Hadoop and @Linkedin
on fronted and service infra
3. Agenda
• Data Ingestion Today
• Introduction Apache Sqoop2
• Sqoop Jobs on Apache Spark
• Insights & Next Steps
7. Data Ingestion Pipeline
• Ingestion pipeline can now have
• Non SQL like data sources
• Messaging Systems as data sources
• Multi-stage pipeline
8. Sqoop 2
• Generic Data Transfer
Service
• FROM - egress data out
from ANY source
• TO - ingress data into ANY
source
• Pluggable Data Sources
• Server-Client Design
11. Connector
• Data Source properties
represented via Configs
• LINK config to connect to
data source
• JOB config to read/write
data from the data source
12. Connector
• Data Source properties
represented via Configs
• LINK config to connect to
data source
• JOB config to read/write
data from the data source
13. Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to egress data
• (L) Load API to ingress data
• No (T) Transform yet !
16. Create Job
• Create LINKs
• Populate FROM link Config and
create FROM LINK
• Populate TO link Config and
create TO LINK
17. Create Job
• Create LINKs
• Populate FROM link Config and
create FROM LINK
• Populate TO link Config and
create TO LINK
Create MySQL link
Create Kafka link
18. Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Config
• Populate Driver Config such as
parallelism for extract and
load
numExtractors
numLoaders
19. Create Job
• Create JOB associating FROM
and TO LINKS
• Populate the FROM and TO Job
Config
• Populate Driver Config such as
parallelism for extract and
load
Add MySQL From Config
Add kafka To Config
numExtractors
numLoaders
21. Job Submit
• Sqoop uses MR engine to transfer data between FROM
and TO data sources
• Hadoop Configuration Object is used to pass FROM/
TO and Driver Configs to the MR engine
• Submits the Job via MR-client and tracks job status and
stats such as counters
22. Connector API
• Pluggable Connector API implemented by Connectors
• Partition(P) API for parallelism
• (E) Extract API to egress data
• (L) Load API to ingress data
• No (T) Transform yet !
Remember!
23. Job Execution
• InputFormat/Splits for Partitioning
• Invokes FROM Partition API
• Mappers for Extraction
• Invokes FROM Extract API
• Reducers for Loading
• Invokes TO Load API
• OutputFormat for Commits/ Aborts
26. It turns out…
• Sqoop 2 supports pluggable Execution Engine
• Why not replace MR with Spark for parallelism?
• Why not extend the Connector APIs to support
simple (T) transformations along with (EL) ?
27. Why Apache Spark ?
• Why not ? Data Pipeline expressed as Spark jobs
• Speed is a feature! Faster than MapReduce
• Growing Community embracing Apache Spark
• Low effort less than few weeks to build a POC
• EL to ETL -> Nifty transformations can be easily added
30. Create Sqoop Spark Job
• Create a SparkContext from the relevant configs
• Instantiate a SqoopSparkJob and invoke SqoopSparkJob.init(..)
that wraps both Sqoop and Spark initialization
• As before Create a Sqoop Job with createJob API
• Invoke SqoopSparkJob.execute(conf, context)
31. public class SqoopJDBCHDFSJobDriver {
public static void main(String[] args){
final SqoopSparkJob sparkJob = new SqoopSparkJob();
CommandLine cArgs = SqoopSparkJob.parseArgs(createOptions(), args);
SparkConf conf = sparkJob.init(cArgs);
JavaSparkContext context = new JavaSparkContext(conf);
MLink fromLink = getJDBCLink();
MLink toLink = getHDFSLink();
MJob sqoopJob = createJob(fromLink, toLink);
sparkJob.setJob(sqoopJob);
sparkJob.execute(conf, context);
}
Create Sqoop Spark Job
1
2
3
4
32. Spark Job Submission
• We explored a few options.!
• Invoke Spark in process within the Sqoop Server to
execute the job
• Use Remote Spark Context used by Hive on Spark to
submit
• Sqoop Job as a driver for the Spark submit command
33. Spark Job Submission
• Build a “uber.jar” with the driver and all the sqoop
dependencies
• Programmatically using Spark yarn client or directly via
command line submit the driver program to yarn client/
• bin/spark-submit —classorg.apache.sqoop.spark.SqoopJDBCHDFSJobDriver
--master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/
—jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir
hdfs://path/to/output —numE 4 —numL 4
34. Spark Job Execution
• 3 main stages
• Obtain containers for parallel execution by simply
converting job’s partitions to an RDD
• Partition API determines parallelism, Map stage uses
Extract API to read records
• Another Map stage uses Load API to write records
36. Spark Job Execution
• We chose to have 2 map stages for a reason
• Load parallelism can be different from Extract
parallelism, for instance we may need to restrict the
TO based on number of Kafka Partitions on the topic
• We can repartition before we invoke the Load stage
38. Table w/ 2.8M records, numExtractors = numLoaders
good partitioning!!
Micro Benchmark —>MySQL to HDFS
39. What was Easy?
• Reusing existing Connectors, NO changes to the Connector API
required.
• Inbuilt support for Standalone and Cluster mode for quick end-end
testing and faster iteration
• Scheduling Spark sqoop jobs via Oozie
40. What was not Easy?
• No clean Spark Job Submit API that provides job statistics, using
Yarn UI for Job status and health.
• We had to convert a bunch of Sqoop core classes such as IDF
(internal representation for records transferred) to be serializable
• Managing Hadoop and spark dependencies together and CNF
caused some pain
41. Next Steps!
• Explore alternative ways for Spark Sqoop Job Submission
• Expose Spark job stats such as accumulators in the submission
history
• Proposed Connector Filter API (cleaning, data masking)
• We want to work with Sqoop community to merge this back if its
useful
• https://github.com/vybs/sqoop-on-spark
42. Questions!
• Apache Sqoop Project - sqoop.apache.org
• Apache Spark Project - spark.apache.org
• Thanks to the Folks @Cloudera and @Uber !!!
• You can reach us @vybs, @byte_array