SlideShare une entreprise Scribd logo
1  sur  31
Beyond Shuffling
tips & tricks for scaling Apache Spark
Global Big
Data SJ 2015
early version
Who am I?
My name is Holden Karau
Prefered pronouns are she/her
I’m a Software Engineer at IBM
previously Alpine, Databricks, Google, Foursquare & Amazon
co-author of Learning Spark & Fast Data processing with Spark
co-author of a new book focused on Spark performance coming out next year*
@holdenkarau
Slide share http://www.slideshare.net/hkarau
What is going to be covered:
What I think I might know about you
RDD re-use (caching, persistence levels, and checkpointing)
Working with key/value data
Why group key is evil and what we can do about it
Best practices for Spark accumulators*
When Spark SQL can be amazing and wonderful
A quick detour into some future performance work in Spark MLLib
Who I think you wonderful humans are?
Nice* people
Know some Apache Spark
Want to scale your Apache Spark jobs
Comfortable reading Scala
Lori Erickson
If you want to follow along with the exercise
Make sure you have recent-ish JDK
Install Spark (any precompiled Hadoop version)
http://spark.apache.org/downloads.html
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
RDD re-use - sadly not magic
If we know we are going to re-use the RDD what should we do?
If it fits nicely in memory caching in memory
persisting at another level
MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER
checkpointing
Noisey clusters
_2 & checkpointing can help
Richard Gillin
Considerations for Key/Value Data
What does the distribution of keys look like?
What type of aggregations do we need to do?
Do we want our data in any particular order?
Are we joining with another RDD?
Whats our partitioner?
eleda 1
What is key skew and why do we care?
Keys aren’t evenly distributed
Sales by zip code, or records by city, etc.
groupByKey will explode (but it's pretty easy to break)
We can have really unbalanced partitions
If we have enough key skew sortByKey could even fail
Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce
groupByKey - just how evil is it?
Pretty evil
Groups all of the records with the same key into a single record
Even if we immediately reduce it (e.g. sum it or similar)
This can be too big to fit in memory, then our job fails
Unless we are in SQL then happy pandas
PROgeckoam
Let’s revisit wordcount with groupByKey
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
grouped.mapValues(_.sum)
And now back to the “normal” version
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts
Let’s launch spark and compare the two
You can get Spark from http://spark.apache.org/downloads.html
You need a recent version of Java
If installing is difficult don’t worry - the results will be in the slides
Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp
Code to compare the two:
Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp
val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions
// Evil group by key version
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
val evilWordCounts = grouped.mapValues(_.sum)
evilWordCounts.take(5)
// Less evil version
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts.take(5)
GroupByKey
reduceByKey
So why did we read in python/*.py
If we just read in the standard README.md file there aren’t enough duplicated
keys for the reduceByKey & groupByKey difference to be really apparent
Which is why groupByKey can be safe sometimes
So what did we do instead?
reduceByKey
Works when the types are the same (e.g. in our summing version)
aggregateByKey
Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
Can just the shuffle cause problems?
Sorting by key can put all of the records in the same partition
We can run into partition size limits (around 2GB)
Or just get bad performance
So we can handle data like the above we can add some “junk” to our key
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy
Spark accumulators
Really “great” way for keeping track of failed records
Double counting makes things really tricky
Jobs which worked “fine” don’t continue to work “fine” when minor changes happen
Relative rules can save us* under certain conditions
Found Animals Foundation Follow
Using an accumulator for validation:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1
// Actual parse logic here
}
// An action (e.g. count, save, etc.)
if (bad.value > 0.1* ok.value) {
throw Exception("bad data - do not use results")
// Optional cleanup
}
// Mark as safe
P.S: If you are interested in this check out spark-validator (still early stages).
Found Animals Foundation Follow
Where can Spark SQL benefit perf?
Structured or semi-structured data
OK with having less* complex operations available to us
We may only need to operate on a subset of the data
The fastest data to process isn’t even read
Remember that non-magic cat? Its got some magic** now
In part from peeking inside of boxes
**Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting
magic
Matti Mattila
Why is Spark SQL good for those things?
Space efficient columnar cached representation
Able to push down operations to the data store
Optimizer is able to look inside of our operations
Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and
(append(_, _))
Matti Mattila
Preview: bringing codegen to Spark ML
Based on Spark SQL’s code generation
First draft using quasiquotes
Switch to janino for Java compilation
Initial draft for Gradient Boosted Trees
Based on DB’s work
First draft with QuasiQuotes
Moved to Java for speed
See SPARK-10387 for the details
Jon
@Override
public double call(Vector input) throws
Exception {
if (input.apply(1) <= 1.0) {
return 0.1;
} else {
if (input.apply(0) <= 0.5) {
return 0.0;
} else {
return 2.0;
}
}
}
(1, 1.0)
0.1 (0, 0.5)
0.0 2.0
What the generated code looks like: Glenn Simmons
Everyone* needs reduce, let’s make it faster!
reduce & aggregate have “tree” versions
we already had free map-side reduction
but now we can get even better!**
**And we might be able to make even cooler versions
Additional Resources
Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
http://spark.apache.org/docs/latest/
Books
Videos
Our next meetup!
Spark Office Hours
follow me on twitter for future ones - https://twitter.com/holdenkarau
fill out this survey to choose the next date - http://bit.ly/spOffice1
raider of gin
Q&A OR A quick detour into spark testing?
It's like a choose your own adventure novel, but with
voting
But more like the voting in High School since if we are
running out of time we might just skip it
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Spark Videos
Apache Spark Youtube Channel
My Spark videos on YouTube -
http://bit.ly/holdenSparkVideos
Spark Summit 2014 training
Paco’s Introduction to Apache Spark
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau

Contenu connexe

Tendances

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauSpark Summit
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Holden Karau
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Holden Karau
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
 

Tendances (20)

Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
Beyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden KarauBeyond Parallelize and Collect by Holden Karau
Beyond Parallelize and Collect by Holden Karau
 
Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016Beyond parallelize and collect - Spark Summit East 2016
Beyond parallelize and collect - Spark Summit East 2016
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
 
Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?Using Spark ML on Spark Errors - What do the clusters tell us?
Using Spark ML on Spark Errors - What do the clusters tell us?
 

En vedette

Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Holden Karau
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with ElasticsearchHolden Karau
 
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツJP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツHolden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...Holden Karau
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache SparkHolden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 

En vedette (7)

Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016Testing and validating spark programs - Strata SJ 2016
Testing and validating spark programs - Strata SJ 2016
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
 
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツJP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法  ※講演は翻訳資料にて行います。 - Getting the Best...
PySparkによるジョブを、より速く、よりスケーラブルに実行するための最善の方法 ※講演は翻訳資料にて行います。 - Getting the Best...
 
Getting started contributing to Apache Spark
Getting started contributing to Apache SparkGetting started contributing to Apache Spark
Getting started contributing to Apache Spark
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 

Similaire à Beyond Shuffling tips & tricks for scaling Apache Spark

Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Holden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache SparkYasoda Jayaweera
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...NoSQLmatters
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtssiddharth30121
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkIvan Morozov
 

Similaire à Beyond Shuffling tips & tricks for scaling Apache Spark (20)

Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Spark rdd
Spark rddSpark rdd
Spark rdd
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Devoxx
DevoxxDevoxx
Devoxx
 
An Overview of Apache Spark
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache Spark
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
 

Dernier

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Dernier (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Beyond Shuffling tips & tricks for scaling Apache Spark

  • 1. Beyond Shuffling tips & tricks for scaling Apache Spark Global Big Data SJ 2015 early version
  • 2. Who am I? My name is Holden Karau Prefered pronouns are she/her I’m a Software Engineer at IBM previously Alpine, Databricks, Google, Foursquare & Amazon co-author of Learning Spark & Fast Data processing with Spark co-author of a new book focused on Spark performance coming out next year* @holdenkarau Slide share http://www.slideshare.net/hkarau
  • 3. What is going to be covered: What I think I might know about you RDD re-use (caching, persistence levels, and checkpointing) Working with key/value data Why group key is evil and what we can do about it Best practices for Spark accumulators* When Spark SQL can be amazing and wonderful A quick detour into some future performance work in Spark MLLib
  • 4. Who I think you wonderful humans are? Nice* people Know some Apache Spark Want to scale your Apache Spark jobs Comfortable reading Scala Lori Erickson
  • 5. If you want to follow along with the exercise Make sure you have recent-ish JDK Install Spark (any precompiled Hadoop version) http://spark.apache.org/downloads.html
  • 6. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream
  • 7. RDD re-use - sadly not magic If we know we are going to re-use the RDD what should we do? If it fits nicely in memory caching in memory persisting at another level MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER checkpointing Noisey clusters _2 & checkpointing can help Richard Gillin
  • 8. Considerations for Key/Value Data What does the distribution of keys look like? What type of aggregations do we need to do? Do we want our data in any particular order? Are we joining with another RDD? Whats our partitioner? eleda 1
  • 9. What is key skew and why do we care? Keys aren’t evenly distributed Sales by zip code, or records by city, etc. groupByKey will explode (but it's pretty easy to break) We can have really unbalanced partitions If we have enough key skew sortByKey could even fail Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce
  • 10. groupByKey - just how evil is it? Pretty evil Groups all of the records with the same key into a single record Even if we immediately reduce it (e.g. sum it or similar) This can be too big to fit in memory, then our job fails Unless we are in SQL then happy pandas PROgeckoam
  • 11. Let’s revisit wordcount with groupByKey val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum)
  • 12. And now back to the “normal” version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts
  • 13. Let’s launch spark and compare the two You can get Spark from http://spark.apache.org/downloads.html You need a recent version of Java If installing is difficult don’t worry - the results will be in the slides Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp
  • 14. Code to compare the two: Quick pastebin of the code for the two: http://pastebin.com/CKn0bsqp val rdd = sc.textFile("python/pyspark/*.py", 20) // Make sure we have many partitions // Evil group by key version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() val evilWordCounts = grouped.mapValues(_.sum) evilWordCounts.take(5) // Less evil version val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts.take(5)
  • 17. So why did we read in python/*.py If we just read in the standard README.md file there aren’t enough duplicated keys for the reduceByKey & groupByKey difference to be really apparent Which is why groupByKey can be safe sometimes
  • 18. So what did we do instead? reduceByKey Works when the types are the same (e.g. in our summing version) aggregateByKey Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  • 19. Can just the shuffle cause problems? Sorting by key can put all of the records in the same partition We can run into partition size limits (around 2GB) Or just get bad performance So we can handle data like the above we can add some “junk” to our key (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) PROTodd Klassy
  • 20. Spark accumulators Really “great” way for keeping track of failed records Double counting makes things really tricky Jobs which worked “fine” don’t continue to work “fine” when minor changes happen Relative rules can save us* under certain conditions Found Animals Foundation Follow
  • 21. Using an accumulator for validation: val (ok, bad) = (sc.accumulator(0), sc.accumulator(0)) val records = input.map{ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe P.S: If you are interested in this check out spark-validator (still early stages). Found Animals Foundation Follow
  • 22. Where can Spark SQL benefit perf? Structured or semi-structured data OK with having less* complex operations available to us We may only need to operate on a subset of the data The fastest data to process isn’t even read Remember that non-magic cat? Its got some magic** now In part from peeking inside of boxes **Magic may cause stack overflow. Not valid in all states. Consult local magic bureau before attempting magic Matti Mattila
  • 23. Why is Spark SQL good for those things? Space efficient columnar cached representation Able to push down operations to the data store Optimizer is able to look inside of our operations Regular spark can’t see inside our operations to spot the difference between (min(_, _)) and (append(_, _)) Matti Mattila
  • 24. Preview: bringing codegen to Spark ML Based on Spark SQL’s code generation First draft using quasiquotes Switch to janino for Java compilation Initial draft for Gradient Boosted Trees Based on DB’s work First draft with QuasiQuotes Moved to Java for speed See SPARK-10387 for the details Jon
  • 25. @Override public double call(Vector input) throws Exception { if (input.apply(1) <= 1.0) { return 0.1; } else { if (input.apply(0) <= 0.5) { return 0.0; } else { return 2.0; } } } (1, 1.0) 0.1 (0, 0.5) 0.0 2.0 What the generated code looks like: Glenn Simmons
  • 26. Everyone* needs reduce, let’s make it faster! reduce & aggregate have “tree” versions we already had free map-side reduction but now we can get even better!** **And we might be able to make even cooler versions
  • 27. Additional Resources Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) http://spark.apache.org/docs/latest/ Books Videos Our next meetup! Spark Office Hours follow me on twitter for future ones - https://twitter.com/holdenkarau fill out this survey to choose the next date - http://bit.ly/spOffice1 raider of gin
  • 28. Q&A OR A quick detour into spark testing? It's like a choose your own adventure novel, but with voting But more like the voting in High School since if we are running out of time we might just skip it
  • 29. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action
  • 30. Spark Videos Apache Spark Youtube Channel My Spark videos on YouTube - http://bit.ly/holdenSparkVideos Spark Summit 2014 training Paco’s Introduction to Apache Spark
  • 31. Cat wave photo by Quinn Dombrowski k thnx bye! If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau

Notes de l'éditeur

  1. Photo from https://www.flickr.com/photos/lorika/4148361363/in/photolist-7jzriM-9h3my2-9Qn7iD-bp55TS-7YCJ4G-4pVTXa-7AFKbm-bkBfKJ-9Qn6FH-aniTRF-9LmYvZ-HD6w6-4mBo3t-8sekvz-mgpFzD-5z6BRK-de513-8dVhBu-bBZ22n-4Vi2vS-3g13dh-e7aPKj-b6iHHi-4ThGzv-7NcFNK-aniTU6-Kzqxd-7LPmYs-4ok2qy-dLY9La-Nvhey-Kte6U-74B7Ma-6VfnBK-6VjrY7-58kAY9-7qUeDK-4eoSxM-6Vjs5A-9v5Pvb-26mja-4scwq3-GHzAL-672eVr-nFUomD-4s8u8F-5eiQmQ-bxXXCc-5P9cCT-5GX8no
  2. https://www.flickr.com/photos/haoli/6349372032/in/photolist-aF5c6A-beRSyF-cnUjBm-dxujoM-cukarf-5osZv-7LrwZb-8hzdGg-dWAXVw-7j8eCn-mU1GDC-du6Njj-9fNeUF-9fNf2c-jeQw2Z-pCQxin-pCPx1S-oYtpxt-pCSwKY-oYtpz2-5nAgBd-4kR3Xg-2CLt3B-mU1HuL-pCPx4h-54W9r-mTYJGa-pVkTdo-2CLrVX-9qkxeT-9s2gwi-9qkx1X-oYqiWL-pCSwD5-2xFigB-72vWUH-dWoBAi-opf1Pw-7jc8Bu-6QfmGS-pVcDuv-4FDmvY-dWufM9-9rFwy5-RAsAG-csnYJu-7QF7sx-83wqki-6faJ2B-7NJT8E
  3. From https://www.flickr.com/photos/photoverulam/22626301622/in/photolist-AtpHbE-2biJ8i-cbDxLj-5SbTJs-bvJ6pR-4cKd6r-c5io3W-x7fuW-8GEnYV-7ngpwq-7ncv4F-7ncv36-6UPdLM-cS2j3s-6zXf6D-pps5P-6UPdZc-qbhws-egQRmW-61si6q-N864-65o5nN-4D4R6z-wavuvy-zzzrqc-6RG2Wn-zhbLnM-zhbLPP-coidfb-6d9XaA-cfPRY7-coidn7-coidLC-6hDKxj-se5vfT-t8y1tQ-5pRoHx-N854-8UuUYz-msyfx-9DqPba-49vTz-4c4F5-5QL2qk-v7G7z9-w4GYdP-irqiN-6Dc9WZ-2h4pkp-uKaPa
  4. https://www.flickr.com/photos/eleda/531867386/in/photolist-NZXDm-4H2JU2-chHH61-aDPTFx-5SYV6V-cgjJVm-bmsnCt-bWgJiD-eiwHzX-dQgyhR-3bN33R-eXWaq2-7Cr1HJ-5TxxkF-9prgZh-2Fehf-9xVUGJ-guZfLW-bWgJk2-93HkH6-9prh4Q-9poftp-eL99JM-9prerC-93LqUf-eLkz5L-6gsr2T-4ofma3-4obj4M-bV2a3u-7ygQQr-gS4GzY-GTrX9-7cLyNh-6yFvoe-fv6smP-4GRE5r-5kLaJv-5BE2Eg-4GVR4f-5Qnzri-6N33MP-4XfVC8-56HJVB-s5HTfd-4GVPwW-27SD6T-dGk3Vj-4ofqNC-9e2NoY
  5. https://www.flickr.com/photos/hckyso/2055866250/in/photolist-48ERpQ-c3rR-7DyiPo-4wAZRn-hzYJD4-9KvP1D-81rV7R-7F1fnm-dTQkdt-AdUJ-95BJsR-4hy1LG-891ckh-orpiij-7sDjhG-qdro34-s1x8Sm-7X3N8R-9JXXZC-aSJdR-ampKtE-6aTcDC-4P6QUv-9Zry8g-4d54Qi-ZMHEJ-g16RaZ-j95eU-9pp82n-7Efa4Y-apJqbb-6kYmJ8-t4N6G5-DCbLQ-7Smuw8-eir9us-ek6wdx-eiGMj1-5iMBeE-9bh3qr-8MpZPp-9kRy1L-ekLggu-du4gyZ-7bmbow-eir9vo-9kunTs-a2Wru-cQGXy5-DCcaR
  6. photo from https://www.flickr.com/photos/geckoam/2956778600/
  7. https://www.flickr.com/photos/latitudes/66424863/in/photolist-6SrNg-4FS7h3-n3aG-675Ggf-2mvpnV-4EPRi-agTjTx-3fuHL-7xHxwK-2RnrK-9hNfoi-2RnV1-2RnV3-6y5i2D-4EPSo-rgtUq-6amUo9-2RnV4-dxZEgS-HS6QM-dzGYC-cWsXC5-2RnV6-aDHNC-2RnV2-bqQu1Q-5kwTda-n35c-tvq1-rgu6G-NdcJr-6ahMeZ-oUnQSw-4kPxbs-xGmP-63cN61-6ahKok-rgtZY-zE7Wf-dghvFQ-sQaV1s-aLr6Tn-aWCMd4-whPuJ-jhaCqH-wM72t-Z5TfQ-a8Tqys-Nopr3-gz9b7W/
  8. Photo https://www.flickr.com/photos/foundanimalsfoundation/8055190879/in/photolist-dgNXBn-4L53ub-ajWE6R-ovhrAn-buEU2i-6TM1kv-6F62SX-dv1zwm-6JiU12-e3GnSr-877jwm-nkEHyT-5q27Jq-6Yngd4-4xcRaU-4x8Mgn-6g3oAX-8Hcwvh-6bdxVW-4xcUnq-idRQ5-4x93fz-9ix9t5-4x8QSt-4x9dhT-ovW6RV-ou7PoH-aukUjT-dbHTpJ-aPCdta-4xdaNG-4x8ViZ-4xd8kh-4x97ge-4xd1WS-4xduUs-4x8LaV-4x8Nig-4x8JEM-4x8Dxe-4x8U7n-4xdhs1-4xdfi9-4x8Gsg-4x9fL2-4xcSfW-4xcPmq-4x9akx-4x95e2-4x99n8
  9. https://www.flickr.com/photos/mattimattila/8190148857/in/photolist-dtJDV4-9tFyUo-9tBqdY-9tymzv-9tBFDf-9tBf1Y-9tyhGp-9tBerj-9tBe4u-9tygGt-9tBc1L-9tB7aJ-9tBeC5-9tzzx6-9tzzq2-9tCw4u-9tzyAv-9tCsHo-9tzvf8-9tyS8X-9tCx5Y-c1JFsu-9tBD8s-9tyt9Z-9tymqa-9tykmD-9tyi3D-9tBPo1-9tyvJt-9tBofj-9tBB9E-9tyBVx-9tBanw-9ty7KM-9ty662-9tCwwY-9tCrVq-4YqUM-9ty2Fv-9tCry1-9tzu72-9tCrbo-9tyReT-c1PhzY-9tyR5r-9tywiK-9tyw9B-9tBt3b-9tBsFs-9tBswf
  10. https://www.flickr.com/photos/jb-london/6659711647/in/photolist-b9uLfB-oMFLKY-psumAe-PvmTe-9vatNK-qektu-8g3jSA-349iv-6GtGmj-oK9cEY-991iGG-cPJ8QU-8dxxkB-mF2Hpc-jKLC8r-o6k2UB-eqbByC-6RGY2L-56P4E3-75QJPn-meLnko-athMJ5-dshXvy-9Ddf4h-dWcYXQ-8cxGxH-4EaXuw-nSfe14-eeXM3G-6w6p2X-dz1VFC-cirujw-nRjjjG-nRon7D-BBRxV-b8Y4UZ-4ang32-8N4tS6-aqNUJJ-3daDSd-bdnv4Z-9jJxG8-otHbqV-CsKnA-4rLoBN-pczP4-niPcP4-f9xNuq-fpDcRL-7khdoc
  11. https://www.flickr.com/photos/simmogl/4055700308/in/photolist-7bowpo-zvyWaQ-3EhtfM-zM3uwo-zM3Ba3-pG7vSq-oMmzPG-oMpCrt-5uSRnw-4HXxs-bwiGtb-9u29aC-oMmLtY-kop9e-4HXwH-oMoqa7-5zd1U6-9pB2jn-hCMrd5-bGyH6i-4Kj7q-dDaF1-prNtLn-zM3yR5-yRhpTn-yR8sRL-yR8oMd-yRhsqB-gTH7qx-zvz4sW-92waWk-yR8wph-yRhrJB-3EmRpS-7eqcqM-4Kj3p-njURVR-2aHanh-iFykZQ-9x97CL-9NfNL-k9N6fm-5RSaZ-4BxAv5-a51APZ-dqhjnr-dqhqZ9-eb9V7X-3EmR3d-6sCnb7
  12. https://www.flickr.com/photos/fairerdingo/2320356657/in/photolist-4x3rcg-AvWp-9apz9a-jJhpoQ-2ov2gu-7Rr2qr-3P2KH-5YFdAJ-gACrN-HTUWP-6j9ooG-dXpN3Q-9kccaV-aFuUfB-8ZN65i-6pQSAv-btZvjV-9ddxwE-4Lq8UH-dXaZ7j-73Xojt-mUZSq-fTy1P-e4n9B-hYwP4-89QrWo-67bSSJ-aThabK-bTctDK-94iUu2-asHJSr-bBnVA8-5MbBJM-g2Vrky-efhYzw-8NxAKw-e3baUF-grvK9-48GJ6n-bAV4eh-btJDEK-4zJtyV-8naFTb-dgJfT-5H88ML-vRFsiA-bHt6pc-7eVJa6-bm2YzR-63sSC5