The document provides an agenda and overview for a Big Data Warehousing meetup hosted by Caserta Concepts. The meetup agenda includes an introduction to SparkSQL with a deep dive on SparkSQL and a demo. Elliott Cordo from Caserta Concepts will provide an introduction and overview of Spark as well as a demo of SparkSQL. The meetup aims to share stories in the rapidly changing big data landscape and provide networking opportunities for data professionals.
2. 7:00 Networking
Grab some food and drink... Make some friends.
7:15 Leslie Linsner
Talent Manager
Caserta Concepts
Welcome + Intro + Swag
About the Meetup
About Caserta Concepts
7:30 Elliott Cordo
Chief Architect
Caserta Concepts
Introduction and Overview of Spark
Deep dive into SparkSQL
Demo of SparkSQL!
8:15 Q&A Ask Questions, Share your experience with
SparkSQL
8:45 More Networking
Don’t leave until you make at least one new Data Nerd friend!
Agenda
3. • Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Founded by Caserta Concepts
• November 10, 2012
• Next BDW Meetup:
• January 27th
• Topic: Graph Databases for MDM
• Location: TBD – Can you host us?
About the BDW Meetup Twitter: #BDWmeetup
#maximizeDataValue
@CasertaConcepts
4. About Caserta Concepts
• Award-winning technology innovation consulting with
expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
5. Does this word cloud excite you?
Speak with us about our open positions: leslie@casertaconcepts.com
Help Wanted
Storm
Big Data Architect Hbase
Cassandra
7. About SPARK!
• General Cluster Computing
• Born in UC Berkeley AMPLab around 2009
• Open sourced in 2010,
• Apache Software foundation in 2013
• Became top level project early in 2014
8. More about Spark
• A Swiss army knife!
• Streaming, batch, and interactive
• RDD – Redundant Distributed Datastore
• API’s for Java, Scala, Python
10. Current State of Spark
• Now in version 1.2
• ~175 active contributors
• Most Hadoop distros now support, or are in progress of
integrating Spark
• Databricks is offering commercial support and fully
managed Spark Clusters
• Large number of Organizations using Spark
12. Caserta Active Spark Project
• Interactive SQL on large datasets financial services
• Big ETL - Json Crunching ETL pipelines Ad-tech
• Several others in R&D
14. SPARK on Elastic Map Reduce
• Not currently a packaged application (coming soon?)
Maybe AWS has other plans for Spark?
• Easily bootstrapped:
• https://github.com/awslabs/emr-bootstrap-actions
aws emr create-cluster --name SparkCluster --ami-version 3.2.1
--instance-type m3.xlarge --instance-count 3
--ec2-attributes KeyName=caserta-1 --applications Name=Hive
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=["-
v1.2.0.a"]
• Latest version is turn-key
• Just need to copy hive-site.xml if accessing Hive tables
• Minor issues with metastore when Impala and Parquet installed.
15. Spark can be run locally too!
• Easy for development
• Local development is exactly the same as submitting work on a
cluster!
• IPython Notebook, or your favorite IDE
• Install on your Mac with one command
brew install apache-spark
16. IPython Notebook
• Great interactive environment for performing analysis in
Python..
• Cloudera has good documentation on configuring for
Yarn.
http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
• Hint - if your are installing local: Homebrew install
“SPARK_HOME” will be:
/usr/local/Cellar/apache-spark/1.2.0/libexec/
17. So Why talk about Spark
• Many competing big data processing platforms, query
engines, etc
• Hadoop Map Reduce is fairly mature
18. ..about Hadoop Map Reduce
• We can process very large datasets
• split processes across a large number of machines
• High recoverability/high safety intermediate data is
written to disk..
• Efficient and generally fast – move processing to data
• SQL on Hadoop via Hive
19. But map reduce has it’s downsides
• SLOW – disk based intermediate steps (local disk and
HDFS)
• Especially inefficient for iterative processing like
machine learning
• Challenging to conduct interactive analysis run job –
go get coffee
20. ..about Spark
• In-memory – eliminates intermediate disk based storage
• Performs generalized form of map-reduce split
processes across a large number of machines
• Fast enough for interactive analysis
• Fault tolerant via lineage tracking
• SparkSQL!
21. So do we still need Hadoop
No, but yes
• Why Hadoop?
• YARN
• HDFS
• Hadoop Map Reduce is mature and will still be appropriate for certain
workloads!
• Other services!
• But you can use other resource managers too:
• Mesos
• Spark Standalone
• And can work with other distributed file systems including:
• S3
• Gluster
• Tachyon
22. About SparkSQL
• Sparks SQL Engine
• Brand new - emerged as alpha in 1.0.1 ~ 1 year old
• Converts SQL into RDD operations
23. What happened to Shark
• Replaces the for Shark Query engine
• All new Catalyst optimizer – Shark leveraged the Hive
optimizer
• Hadoop Map Reduce optimization rules were not
applicable
• Writing optimization rules made easy more community
participation
24. We love SQL!
• Huge population of highly skilled developers and analysts
• Compatible with Tooling
• Many operations can easily and efficiently be expressed
in SQL
• Filters
• Joins
• Group by’s
• Aggregates
25. But sometimes SQL is not the best tool!
• Some operations do not fit SQL well
• Iteration
• Row-by row processing
• Other operations that are not set-based/SQL oriented
Spark can help!
• Spark API
• MLLIB – machine learning
Blend Spark SQL with other code in the same program
26. How can you leverage SPARK SQL
• Batch ETL development
• Interactive
• Spark Shell (PySpark)
• Spark SQL CLI
• Thrift Server (JDBC)
• Beeline
• Query Platforms
• BI Tools
27. SPARK SQL can leverage the Hive
metastore
• Hive Metastore can also be leveraged by a wide array of
applications
• Spark
• Hive
• Impala
• Pig
• Available from HiveContext
28. The Basics of the Spark SQLAPI
• SPARK Context – a connection to the Spark Execution Engine
• SCHEMA RDD – contains row of data with named columns
(think spreadsheet)!
• HiveContext (superset of SQLContext) – SQL on Spark, access
to Hive Metastore
• inferSchema – apply schema to a RDD of dictionary type
• jsonSchema/jsonFile – load a json file as a schema RDD
• registerTempTable – register an RDD as a temp table for
SQL fun
34. And what about other data sources
Out of the box:
• Parquet
• JDBC
Spark 1.2 brings a data sources API:
• Much easier to develop new integrations
• New integrations underway Cassandra, CSV, Avro
35. Where do we think SparkSQL is headed
• Spark in general will continue to gain momentum
• Increasing number of integrated data stores, file types etc
• Optimizer improvements Catalyst should allow it to
evolve very quickly!
• Subsequent - Improvements for interactive SQL – better
performance, concurrency
37. Awesome collection of AWS developed bootstrap actions:
https://github.com/awslabs/emr-bootstrap-actions
Will provide notebook and helpful scripts soon!
Remember Jan 27– Graph Databases
Resources
38. Elliott Cordo
Principal Consultant, Caserta
Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com
info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com
Thank You