2. Currently @Uber on streaming systems.
@Cloudera on Ingestion for Hadoop.
@Linkedin on front-end service infra.
2
Currently @ Uber focussed on building
a real time pipeline for ingestion to
Hadoop. @linkedin lead on Voldemort.
In the past, worked on log based
replication, HPC and stream
processing.
3. Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
3
9. Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
9
10. It turns out…
• MapReduce is slow!
• We need Connector APIs to support (T)
transformations, not just EL
• Good news! - Execution Engine is also
pluggable
10
11. Why Apache Spark ?
• Why not ? ETL expressed as Spark jobs
• Faster than MapReduce
• Growing Community embracing Apache
Spark
11
12. Why Not Use Spark Data Sources?
12
Sure we can ! but …
13. Why Not Spark DataSources ?
• Recent addition for data sources!
• Run MR Sqoop jobs on Spark with
simple config change
• Leverage incremental EL & job
management within Sqoop
14. Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
14
16. Sqoop Job API
• Create Sqoop Job
–Create FROM and TO job configs
–Create JOB associating FROM and TO configs
• SparkContext holds Sqoop Jobs
• Invoke SqoopSparkJob.execute(conf, context)
16
17. Spark Job Submission
• We explored a few options.!
– Invoke Spark in process within the Sqoop Server to
execute the job
– Use Remote Spark Context used by Hive on Spark to
submit
– Sqoop Job as a driver for the Spark submit command
17
18. Spark Job Submission
• Build a “uber.jar” with the driver and all the sqoop
dependencies
• Programmatically using Spark Yarn Client ( non public) or
directly via command line submit the driver program to
yarn client/
• bin/spark-submit —class org.apache.sqoop.spark.SqoopJDBCHDFSJobDriver
--master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/ —
jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir hdfs://
path/to/output —numE 4 —numL 4
18
21. Spark Job Execution
21
MySQL
Compute Partitions Extract from
MySQL
.map() .mapPartition()
Load
into HDFS
.repartition()
Repartition
to
limit
files on HDFS
22. Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
22
24. Table w/ 2.8M records, numExtractors = numLoaders
good partitioning!!
Micro Benchmark: MySQL to HDFS
24
25. What was Easy?
• NO changes to the Connector API
required.
• Inbuilt support for Standalone and Yarn
Cluster mode for quick end-end testing
and faster iteration
• Scheduling Spark sqoop jobs via Oozie
25
26. What was not Easy?
• No clean public Spark Job Submit API.
Using Yarn UI for Job status and health.
• Bunch of Sqoop core classes such as IDF
had to be made serializable
• Managing Hadoop and spark dependencies
together in Sqoop caused some pain
26
27. Next Steps!
• Explore alternative ways for Spark Sqoop
Job Submission with Spark 1.4 additions
• Connector Filter API (filter, data masking)
• SQOOP-1532
– https://github.com/vybs/sqoop-on-spark
27