SlideShare une entreprise Scribd logo
1  sur  48
Supercharging ETL with Spark 
Rafal Kwasny 
First Spark London Meetup 
2014-05-28
Who are you?
About me 
• Sysadmin/DevOps background 
• Worked as DevOps @Visualdna 
• Now building game analytics platform 
@Sony Computer Entertainment Europe
Outline 
• What is ETL 
• How do we do it in the standard Hadoop stack 
• How can we supercharge it with Spark 
• Real-life use cases 
• How to deploy Spark 
• Lessons learned
Standard technology stack 
Get the data
Standard technology stack 
Load into HDFS / S3
Standard technology stack 
Extract & Transform & Load
Standard technology stack 
Query, Analyze, train ML models
Standard technology stack 
Real Time pipeline
Hadoop 
• Industry standard 
• Have you ever looked at Hadoop code and 
tried to fix something?
How simple is simple? 
”Simple YARN application to run n copies of a unix command - 
deliberately kept simple (with minimal error handling etc.)” 
➜ $ git clone https://github.com/hortonworks/simple-yarn-app.git 
(…) 
➜ $ find simple-yarn-app -name "*.java" |xargs cat | wc -l 
232
ETL Workflow 
• Get some data from S3/HDFS 
• Map 
• Shuffle 
• Reduce 
• Save to S3/HDFS
ETL Workflow 
• Get some data from S3/HDFS 
• Map 
• Shuffle 
• Reduce 
• Save to S3/HDFS 
Repeat 10 times
Issue: Test run time 
• Job startup time ~20s to run a job that does nothing 
• Hard to test the code without a cluster ( cascading 
simulation mode != real life )
Issue: new applications 
MapReduce awkward for key big data workloads: 
• Low latency dispatch (E.G. quick queries) 
• Iterative algorithms (E.G. ML, Graph…) 
• Streaming data ingest
Issue: hardware is moving on 
Hardware had advanced since Hadoop started: 
• Very large RAMs, Faster networks (10Gb+) 
• Bandwidth to disk not keeping up 
• 1 GB of RAM ~ $0.75/month * 
*based on a spot price of AWS r3.8xlarge instance
How can we 
supercharge our ETL?
Use Spark 
• Fast and Expressive Cluster Computing Engine 
• Compatible with Apache Hadoop 
• In-memory storage 
• Rich APIs in Java, Scala, Python
Why Spark? 
• Up to 40x faster than Hadoop MapReduce 
( for some use cases, see: https://amplab.cs.berkeley.edu/benchmark/ ) 
• Jobs can be scheduled and run in <1s 
• Typically less code (2-5x) 
• Seamless Hadoop/HDFS integration 
• REPL 
• Accessible Source in terms of LOC and modularity
Why Spark? 
• Berkeley Data Analytics Stack ecosystem: 
• Spark, Spark Streaming, Shark, BlinkDB, MLlib 
• Deep integration into Hadoop ecosystem 
• Read/write Hadoop formats 
• Interoperability with other ecosystem components 
• Runs on Mesos & YARN, also MR1 
• EC2, EMR 
• HDFS, S3
Why Spark?
Using RAM for in-memory caching
Fault recovery
Stack 
Also: 
• SHARK ( Hive on Spark ) 
• Tachyon ( off heap caching ) 
• SparkR ( R wrapper ) 
• BlinkDB ( Approximate Queries)
Real-life use
Spark use-cases 
• next-generation ETL platform 
• No more “multiple chained MapReduce jobs” 
architecture 
• Less jobs to worry about 
• Better sleep for your DevOps team
Sessionization 
Add session_id to events
Why add session id? 
Combine all user activity into user sessions
Adding session ID 
user_id timestamp Referrer URL 
user1 1401207490 http://fb.com http://webpage/ 
user2 1401207491 http://twitter.com http://webpage/ 
user1 1401207543 http://webpage/ http://webpage/login 
user1 140120841 http://webpage/login http://webpage/add_to_cart 
user2 1401207491 http://webpage/ http://webpage/product1
Group by user 
user_id timestamp Referrer URL 
user1 1401207490 http://fb.com http://webpage/ 
user1 1401207543 http://webpage/ http://webpage/login 
user1 140120841 http://webpage/login http://webpage/add_to_cart 
user2 1401207491 http://twitter.com http://webpage/ 
user2 1401207491 http://webpage/ http://webpage/product1
Add unique session id 
user_id timestamp session_id Referrer URL 
user1 
140120749 
0 
8fddc743bfbafdc 
45e071e5c126ce 
ca7 
http://fb.com http://webpage/ 
user1 
140120754 
3 
8fddc743bfbafdc 
45e071e5c126ce 
ca7 
http://webpage/ http://webpage/login 
user1 140120841 
8fddc743bfbafdc 
45e071e5c126ce 
ca7 
http://webpage/lo 
gin 
http://webpage/add_to_ 
cart 
user2 
140120749 
1 
c00e742152500 
8584d9d1ff4201 
cbf65 
http://twitter.com http://webpage/ 
140120749 
c00e742152500 
http://webpage/product
Join with external data 
user_id timestamp session_id new_user Referrer URL 
user1 1401207490 
8fddc743bfba 
fdc45e071e5 
c126ceca7 
TRUE http://fb.com http://webpage/ 
user1 1401207543 
8fddc743bfba 
fdc45e071e5 
c126ceca7 
TRUE 
http://webpag 
e/ 
http://webpage/l 
ogin 
user1 140120841 
8fddc743bfba 
fdc45e071e5 
c126ceca7 
TRUE 
http://webpag 
e/login 
http://webpage/ 
add_to_cart 
user2 1401207491 
c00e7421525 
008584d9d1ff 
4201cbf65 
FALSE http://twitter.c 
om 
http://webpage/ 
c00e7421525
Sessionize user clickstream 
• Filter interesting events 
• Group by user 
• Add unique sessionId 
• Join with external data sources 
• Write output
val input = sc.textFile("file:///tmp/input") 
val rawEvents = input 
.map(line => line.split("t")) 
val userInfo = sc.textFile("file:///tmp/userinfo") 
.map(line => line.split("t")) 
.map(user => (user(0),user)) 
val processedEvents = rawEvents 
.map(arr => (arr(0),arr)) 
.cogroup(userInfo) 
.flatMapValues(k => { 
val new_user = k._2.length match { 
case x if x > 0 => "true" 
case _ => "false" 
} 
val session_id = java.util.UUID.randomUUID.toString 
k._1.map(line => 
line.slice(0,3) ++ Array(session_id) ++ Array(new_user) ++ line.drop(3) 
) 
}) 
.map(k => k._2)
Why is it better? 
• Single spark job 
• Easier to maintain than 3 consecutive map reduce 
stages 
• Can be unit tested
From the DevOps 
perspective
v1.0 - running on EC2 
• Start with an EC2 script 
./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> 
—instance-type=c3.xlarge launch <cluster-name> 
If it does not work for you - modify it, it’s just a simple 
python+boto
v2.0 - Autoscaling on spot instances 
1x Master - on-demand (c3.large) 
XX Slaves - spot instances depending on usage patterns (r3.*) 
• no HDFS 
• persistence in memory + S3
Other options 
• Mesos 
• YARN 
• MR1
Lessons learned
JVM issues 
• java.lang.OutOfMemoryError: GC overhead limit exceeded 
• add more memory? 
val sparkConf = new SparkConf() 
.set("spark.executor.memory", "120g") 
.set("spark.storage.memoryFraction","0.3") 
.set("spark.shuffle.memoryFraction","0.3") 
• increase parallelism: 
sc.textFile("s3://..path", 10000) 
groupByKey(10000)
Full GC 
2014-05-21T10:15:23.203+0000: 200.710: [Full GC 109G- 
>45G(110G), 79.3771030 secs] 
2014-05-21T10:16:42.580+0000: 280.087: Total time for which 
application threads were stopped: 79.3773830 seconds 
we want to avoid this 
• Use G1GC + Java 8 
• Store data serialized 
set("spark.serializer","org.apache.spark.serializer.KryoSerializer") 
set("spark.kryo.registrator","scee.SceeKryoRegistrator")
Bugs 
• for example: cdh5 does not work with Amazon S3 out of the 
box ( thx to Sean it will be fixed in next release ) 
• If in doubt use the provided ec2/spark-ec2 script 
• ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> 
—instance-type=c3.xlarge launch <cluster-name>
Tips & Tricks 
• you do not need to package whole spark with your app, just 
specify dependencies as provided in sbt 
libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-cdh5.0.1" % 
„provided" 
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.0.1" % 
"provided" 
assembly jar size from 120MB -> 5MB 
• always ensure you are compiling agains the same version of 
artifacts, if not ”bad things will happen”™
Future - Spark 1.0 
• Voting in progress to release Spark 1.0.0 RC11 
• Spark SQL 
• History server 
• Job Submission Tool 
• Java 8 support
Spark - Hadoop done right 
• Faster to run, less code to write 
• Deploying Spark can be easy and cost-effective 
• Still rough around the edges but improves quickly
Thank you for listening 
:)

Contenu connexe

Tendances

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Summit
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Databricks
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteSpark Summit
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Databricks
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengDatabricks
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache CalciteDataWorks Summit
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 

Tendances (20)

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 

En vedette

Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageSandeep Patil
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDataWorks Summit
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...gethue
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Zeppelin(Spark)으로 데이터 분석하기
Zeppelin(Spark)으로 데이터 분석하기Zeppelin(Spark)으로 데이터 분석하기
Zeppelin(Spark)으로 데이터 분석하기SangWoo Kim
 

En vedette (16)

Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
SocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean ManualSocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean Manual
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Proxy Servers
Proxy ServersProxy Servers
Proxy Servers
 
Proxy Server
Proxy ServerProxy Server
Proxy Server
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Zeppelin(Spark)으로 데이터 분석하기
Zeppelin(Spark)으로 데이터 분석하기Zeppelin(Spark)으로 데이터 분석하기
Zeppelin(Spark)으로 데이터 분석하기
 

Similaire à ETL with SPARK - First Spark London meetup

Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 

Similaire à ETL with SPARK - First Spark London meetup (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Dernier

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Dernier (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

ETL with SPARK - First Spark London meetup

  • 1. Supercharging ETL with Spark Rafal Kwasny First Spark London Meetup 2014-05-28
  • 3. About me • Sysadmin/DevOps background • Worked as DevOps @Visualdna • Now building game analytics platform @Sony Computer Entertainment Europe
  • 4. Outline • What is ETL • How do we do it in the standard Hadoop stack • How can we supercharge it with Spark • Real-life use cases • How to deploy Spark • Lessons learned
  • 6. Standard technology stack Load into HDFS / S3
  • 7. Standard technology stack Extract & Transform & Load
  • 8. Standard technology stack Query, Analyze, train ML models
  • 9. Standard technology stack Real Time pipeline
  • 10. Hadoop • Industry standard • Have you ever looked at Hadoop code and tried to fix something?
  • 11. How simple is simple? ”Simple YARN application to run n copies of a unix command - deliberately kept simple (with minimal error handling etc.)” ➜ $ git clone https://github.com/hortonworks/simple-yarn-app.git (…) ➜ $ find simple-yarn-app -name "*.java" |xargs cat | wc -l 232
  • 12. ETL Workflow • Get some data from S3/HDFS • Map • Shuffle • Reduce • Save to S3/HDFS
  • 13. ETL Workflow • Get some data from S3/HDFS • Map • Shuffle • Reduce • Save to S3/HDFS Repeat 10 times
  • 14. Issue: Test run time • Job startup time ~20s to run a job that does nothing • Hard to test the code without a cluster ( cascading simulation mode != real life )
  • 15. Issue: new applications MapReduce awkward for key big data workloads: • Low latency dispatch (E.G. quick queries) • Iterative algorithms (E.G. ML, Graph…) • Streaming data ingest
  • 16. Issue: hardware is moving on Hardware had advanced since Hadoop started: • Very large RAMs, Faster networks (10Gb+) • Bandwidth to disk not keeping up • 1 GB of RAM ~ $0.75/month * *based on a spot price of AWS r3.8xlarge instance
  • 17. How can we supercharge our ETL?
  • 18. Use Spark • Fast and Expressive Cluster Computing Engine • Compatible with Apache Hadoop • In-memory storage • Rich APIs in Java, Scala, Python
  • 19. Why Spark? • Up to 40x faster than Hadoop MapReduce ( for some use cases, see: https://amplab.cs.berkeley.edu/benchmark/ ) • Jobs can be scheduled and run in <1s • Typically less code (2-5x) • Seamless Hadoop/HDFS integration • REPL • Accessible Source in terms of LOC and modularity
  • 20. Why Spark? • Berkeley Data Analytics Stack ecosystem: • Spark, Spark Streaming, Shark, BlinkDB, MLlib • Deep integration into Hadoop ecosystem • Read/write Hadoop formats • Interoperability with other ecosystem components • Runs on Mesos & YARN, also MR1 • EC2, EMR • HDFS, S3
  • 22. Using RAM for in-memory caching
  • 24. Stack Also: • SHARK ( Hive on Spark ) • Tachyon ( off heap caching ) • SparkR ( R wrapper ) • BlinkDB ( Approximate Queries)
  • 25.
  • 27. Spark use-cases • next-generation ETL platform • No more “multiple chained MapReduce jobs” architecture • Less jobs to worry about • Better sleep for your DevOps team
  • 29. Why add session id? Combine all user activity into user sessions
  • 30. Adding session ID user_id timestamp Referrer URL user1 1401207490 http://fb.com http://webpage/ user2 1401207491 http://twitter.com http://webpage/ user1 1401207543 http://webpage/ http://webpage/login user1 140120841 http://webpage/login http://webpage/add_to_cart user2 1401207491 http://webpage/ http://webpage/product1
  • 31. Group by user user_id timestamp Referrer URL user1 1401207490 http://fb.com http://webpage/ user1 1401207543 http://webpage/ http://webpage/login user1 140120841 http://webpage/login http://webpage/add_to_cart user2 1401207491 http://twitter.com http://webpage/ user2 1401207491 http://webpage/ http://webpage/product1
  • 32. Add unique session id user_id timestamp session_id Referrer URL user1 140120749 0 8fddc743bfbafdc 45e071e5c126ce ca7 http://fb.com http://webpage/ user1 140120754 3 8fddc743bfbafdc 45e071e5c126ce ca7 http://webpage/ http://webpage/login user1 140120841 8fddc743bfbafdc 45e071e5c126ce ca7 http://webpage/lo gin http://webpage/add_to_ cart user2 140120749 1 c00e742152500 8584d9d1ff4201 cbf65 http://twitter.com http://webpage/ 140120749 c00e742152500 http://webpage/product
  • 33. Join with external data user_id timestamp session_id new_user Referrer URL user1 1401207490 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://fb.com http://webpage/ user1 1401207543 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://webpag e/ http://webpage/l ogin user1 140120841 8fddc743bfba fdc45e071e5 c126ceca7 TRUE http://webpag e/login http://webpage/ add_to_cart user2 1401207491 c00e7421525 008584d9d1ff 4201cbf65 FALSE http://twitter.c om http://webpage/ c00e7421525
  • 34. Sessionize user clickstream • Filter interesting events • Group by user • Add unique sessionId • Join with external data sources • Write output
  • 35. val input = sc.textFile("file:///tmp/input") val rawEvents = input .map(line => line.split("t")) val userInfo = sc.textFile("file:///tmp/userinfo") .map(line => line.split("t")) .map(user => (user(0),user)) val processedEvents = rawEvents .map(arr => (arr(0),arr)) .cogroup(userInfo) .flatMapValues(k => { val new_user = k._2.length match { case x if x > 0 => "true" case _ => "false" } val session_id = java.util.UUID.randomUUID.toString k._1.map(line => line.slice(0,3) ++ Array(session_id) ++ Array(new_user) ++ line.drop(3) ) }) .map(k => k._2)
  • 36. Why is it better? • Single spark job • Easier to maintain than 3 consecutive map reduce stages • Can be unit tested
  • 37. From the DevOps perspective
  • 38. v1.0 - running on EC2 • Start with an EC2 script ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> —instance-type=c3.xlarge launch <cluster-name> If it does not work for you - modify it, it’s just a simple python+boto
  • 39. v2.0 - Autoscaling on spot instances 1x Master - on-demand (c3.large) XX Slaves - spot instances depending on usage patterns (r3.*) • no HDFS • persistence in memory + S3
  • 40. Other options • Mesos • YARN • MR1
  • 42. JVM issues • java.lang.OutOfMemoryError: GC overhead limit exceeded • add more memory? val sparkConf = new SparkConf() .set("spark.executor.memory", "120g") .set("spark.storage.memoryFraction","0.3") .set("spark.shuffle.memoryFraction","0.3") • increase parallelism: sc.textFile("s3://..path", 10000) groupByKey(10000)
  • 43. Full GC 2014-05-21T10:15:23.203+0000: 200.710: [Full GC 109G- >45G(110G), 79.3771030 secs] 2014-05-21T10:16:42.580+0000: 280.087: Total time for which application threads were stopped: 79.3773830 seconds we want to avoid this • Use G1GC + Java 8 • Store data serialized set("spark.serializer","org.apache.spark.serializer.KryoSerializer") set("spark.kryo.registrator","scee.SceeKryoRegistrator")
  • 44. Bugs • for example: cdh5 does not work with Amazon S3 out of the box ( thx to Sean it will be fixed in next release ) • If in doubt use the provided ec2/spark-ec2 script • ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> —instance-type=c3.xlarge launch <cluster-name>
  • 45. Tips & Tricks • you do not need to package whole spark with your app, just specify dependencies as provided in sbt libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-cdh5.0.1" % „provided" libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0-cdh5.0.1" % "provided" assembly jar size from 120MB -> 5MB • always ensure you are compiling agains the same version of artifacts, if not ”bad things will happen”™
  • 46. Future - Spark 1.0 • Voting in progress to release Spark 1.0.0 RC11 • Spark SQL • History server • Job Submission Tool • Java 8 support
  • 47. Spark - Hadoop done right • Faster to run, less code to write • Deploying Spark can be easy and cost-effective • Still rough around the edges but improves quickly
  • 48. Thank you for listening :)

Notes de l'éditeur

  1. My experience supercharging Extract Transform Load workloads with Spark
  2. Get the data (access logs + application logs )
  3. Put it into S3 Load into HDFS
  4. Transform using Hive/Streaming/Cascading/Scalding into flat structure you can query
  5. Load into MPP database / Query using HIVE
  6. Rewrite all the logic for real-time On top of completely different technology Storm/Samza etc.
  7. Is it the best option?
  8. read–eval–print loop