SlideShare une entreprise Scribd logo
1  sur  38
Big Data Warehousing - January 7, 2015
Hosted by:
7:00 Networking
Grab some food and drink... Make some friends.
7:15 Leslie Linsner
Talent Manager
Caserta Concepts
Welcome + Intro + Swag
About the Meetup
About Caserta Concepts
7:30 Elliott Cordo
Chief Architect
Caserta Concepts
Introduction and Overview of Spark
Deep dive into SparkSQL
Demo of SparkSQL!
8:15 Q&A Ask Questions, Share your experience with
SparkSQL
8:45 More Networking
Don’t leave until you make at least one new Data Nerd friend!
Agenda
• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Founded by Caserta Concepts
• November 10, 2012
• Next BDW Meetup:
• January 27th
• Topic: Graph Databases for MDM
• Location: TBD – Can you host us?
About the BDW Meetup Twitter: #BDWmeetup
#maximizeDataValue
@CasertaConcepts
About Caserta Concepts
• Award-winning technology innovation consulting with
expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
Does this word cloud excite you?
Speak with us about our open positions: leslie@casertaconcepts.com
Help Wanted
Storm
Big Data Architect Hbase
Cassandra
Free T-Shirts
Courtesy of Caserta Concepts
RAFFLE!!!
About SPARK!
• General Cluster Computing
• Born in UC Berkeley AMPLab around 2009
• Open sourced in 2010,
• Apache Software foundation in 2013
• Became top level project early in 2014
More about Spark
• A Swiss army knife!
• Streaming, batch, and interactive
• RDD – Redundant Distributed Datastore
• API’s for Java, Scala, Python
Spark processing
• Lots of processing options:
• API’s in Java, Scala, Python
• GraphX
• Streaming
• MLlib
• SQL goodness
Current State of Spark
• Now in version 1.2
• ~175 active contributors
• Most Hadoop distros now support, or are in progress of
integrating Spark
• Databricks is offering commercial support and fully
managed Spark Clusters
• Large number of Organizations using Spark
Recent Achievement: Gray Sort
Caserta Active Spark Project
• Interactive SQL on large datasets  financial services
• Big ETL - Json Crunching ETL pipelines  Ad-tech
• Several others in R&D
Deploying Spark
• On-premise
• Databricks
• AWS EMR and EC2
SPARK on Elastic Map Reduce
• Not currently a packaged application (coming soon?) 
Maybe AWS has other plans for Spark?
• Easily bootstrapped:
• https://github.com/awslabs/emr-bootstrap-actions
aws emr create-cluster --name SparkCluster --ami-version 3.2.1
--instance-type m3.xlarge --instance-count 3
--ec2-attributes KeyName=caserta-1 --applications Name=Hive
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=["-
v1.2.0.a"]
• Latest version is turn-key
• Just need to copy hive-site.xml if accessing Hive tables
• Minor issues with metastore when Impala and Parquet installed.
Spark can be run locally too!
• Easy for development
• Local development is exactly the same as submitting work on a
cluster!
• IPython Notebook, or your favorite IDE
• Install on your Mac with one command
brew install apache-spark
IPython Notebook
• Great interactive environment for performing analysis in
Python..
• Cloudera has good documentation on configuring for
Yarn.
http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
• Hint - if your are installing local: Homebrew install
“SPARK_HOME” will be:
/usr/local/Cellar/apache-spark/1.2.0/libexec/
So Why talk about Spark
• Many competing big data processing platforms, query
engines, etc
• Hadoop Map Reduce is fairly mature
..about Hadoop Map Reduce
• We can process very large datasets
• split processes across a large number of machines
• High recoverability/high safety  intermediate data is
written to disk..
• Efficient and generally fast – move processing to data
• SQL on Hadoop via Hive
But map reduce has it’s downsides
• SLOW – disk based intermediate steps (local disk and
HDFS)
• Especially inefficient for iterative processing  like
machine learning
• Challenging to conduct interactive analysis  run job –
go get coffee
..about Spark
• In-memory – eliminates intermediate disk based storage
• Performs generalized form of map-reduce  split
processes across a large number of machines
• Fast enough for interactive analysis
• Fault tolerant via lineage tracking
• SparkSQL!
So do we still need Hadoop
No, but yes
• Why Hadoop?
• YARN
• HDFS
• Hadoop Map Reduce is mature and will still be appropriate for certain
workloads!
• Other services!
• But you can use other resource managers too:
• Mesos
• Spark Standalone
• And can work with other distributed file systems including:
• S3
• Gluster
• Tachyon
About SparkSQL
• Sparks SQL Engine
• Brand new - emerged as alpha in 1.0.1 ~ 1 year old
• Converts SQL into RDD operations
What happened to Shark
• Replaces the for Shark Query engine
• All new Catalyst optimizer – Shark leveraged the Hive
optimizer
• Hadoop Map Reduce optimization rules were not
applicable
• Writing optimization rules made easy  more community
participation
We love SQL!
• Huge population of highly skilled developers and analysts
• Compatible with Tooling
• Many operations can easily and efficiently be expressed
in SQL
• Filters
• Joins
• Group by’s
• Aggregates
But sometimes SQL is not the best tool!
• Some operations do not fit SQL well
• Iteration
• Row-by row processing
• Other operations that are not set-based/SQL oriented
Spark can help!
• Spark API
• MLLIB – machine learning
Blend Spark SQL with other code in the same program
How can you leverage SPARK SQL
• Batch ETL development
• Interactive
• Spark Shell (PySpark)
• Spark SQL CLI
• Thrift Server (JDBC)
• Beeline
• Query Platforms
• BI Tools
SPARK SQL can leverage the Hive
metastore
• Hive Metastore can also be leveraged by a wide array of
applications
• Spark
• Hive
• Impala
• Pig
• Available from HiveContext
The Basics of the Spark SQLAPI
• SPARK Context – a connection to the Spark Execution Engine
• SCHEMA RDD – contains row of data with named columns
(think spreadsheet)!
• HiveContext (superset of SQLContext) – SQL on Spark, access
to Hive Metastore
• inferSchema – apply schema to a RDD of dictionary type
• jsonSchema/jsonFile – load a json file as a schema RDD
• registerTempTable – register an RDD as a temp table for
SQL fun
Spark SQL from Flat File
Spark SQL Loves JSON
Inferring Schema and Querying JSON
Another method – load directly in SQL
Spark SQL + Hive
And what about other data sources
Out of the box:
• Parquet
• JDBC
Spark 1.2 brings a data sources API:
• Much easier to develop new integrations
• New integrations underway  Cassandra, CSV, Avro
Where do we think SparkSQL is headed
• Spark in general will continue to gain momentum
• Increasing number of integrated data stores, file types etc
• Optimizer improvements  Catalyst should allow it to
evolve very quickly!
• Subsequent - Improvements for interactive SQL – better
performance, concurrency
DEMO Some Spark!
…
Awesome collection of AWS developed bootstrap actions:
https://github.com/awslabs/emr-bootstrap-actions
Will provide notebook and helpful scripts soon!
Remember  Jan 27– Graph Databases
Resources
Elliott Cordo
Principal Consultant, Caserta
Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com
info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com
Thank You

Contenu connexe

Tendances

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalystTakuya UESHIN
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesTodd McGrath
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 

Tendances (20)

Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Spark etl
Spark etlSpark etl
Spark etl
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code Examples
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 

Similaire à Spark SQL

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentationRamesh Mudunuri
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...Tim Vaillancourt
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Mark Rittman
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 

Similaire à Spark SQL (20)

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 

Plus de Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 

Plus de Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 

Dernier

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Dernier (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Spark SQL

  • 1. Big Data Warehousing - January 7, 2015 Hosted by:
  • 2. 7:00 Networking Grab some food and drink... Make some friends. 7:15 Leslie Linsner Talent Manager Caserta Concepts Welcome + Intro + Swag About the Meetup About Caserta Concepts 7:30 Elliott Cordo Chief Architect Caserta Concepts Introduction and Overview of Spark Deep dive into SparkSQL Demo of SparkSQL! 8:15 Q&A Ask Questions, Share your experience with SparkSQL 8:45 More Networking Don’t leave until you make at least one new Data Nerd friend! Agenda
  • 3. • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Founded by Caserta Concepts • November 10, 2012 • Next BDW Meetup: • January 27th • Topic: Graph Databases for MDM • Location: TBD – Can you host us? About the BDW Meetup Twitter: #BDWmeetup #maximizeDataValue @CasertaConcepts
  • 4. About Caserta Concepts • Award-winning technology innovation consulting with expertise in: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Ad Tech / Higher Ed • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation • Writing, Education, Mentoring • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  • 5. Does this word cloud excite you? Speak with us about our open positions: leslie@casertaconcepts.com Help Wanted Storm Big Data Architect Hbase Cassandra
  • 6. Free T-Shirts Courtesy of Caserta Concepts RAFFLE!!!
  • 7. About SPARK! • General Cluster Computing • Born in UC Berkeley AMPLab around 2009 • Open sourced in 2010, • Apache Software foundation in 2013 • Became top level project early in 2014
  • 8. More about Spark • A Swiss army knife! • Streaming, batch, and interactive • RDD – Redundant Distributed Datastore • API’s for Java, Scala, Python
  • 9. Spark processing • Lots of processing options: • API’s in Java, Scala, Python • GraphX • Streaming • MLlib • SQL goodness
  • 10. Current State of Spark • Now in version 1.2 • ~175 active contributors • Most Hadoop distros now support, or are in progress of integrating Spark • Databricks is offering commercial support and fully managed Spark Clusters • Large number of Organizations using Spark
  • 12. Caserta Active Spark Project • Interactive SQL on large datasets  financial services • Big ETL - Json Crunching ETL pipelines  Ad-tech • Several others in R&D
  • 13. Deploying Spark • On-premise • Databricks • AWS EMR and EC2
  • 14. SPARK on Elastic Map Reduce • Not currently a packaged application (coming soon?)  Maybe AWS has other plans for Spark? • Easily bootstrapped: • https://github.com/awslabs/emr-bootstrap-actions aws emr create-cluster --name SparkCluster --ami-version 3.2.1 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=caserta-1 --applications Name=Hive --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=["- v1.2.0.a"] • Latest version is turn-key • Just need to copy hive-site.xml if accessing Hive tables • Minor issues with metastore when Impala and Parquet installed.
  • 15. Spark can be run locally too! • Easy for development • Local development is exactly the same as submitting work on a cluster! • IPython Notebook, or your favorite IDE • Install on your Mac with one command brew install apache-spark
  • 16. IPython Notebook • Great interactive environment for performing analysis in Python.. • Cloudera has good documentation on configuring for Yarn. http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ • Hint - if your are installing local: Homebrew install “SPARK_HOME” will be: /usr/local/Cellar/apache-spark/1.2.0/libexec/
  • 17. So Why talk about Spark • Many competing big data processing platforms, query engines, etc • Hadoop Map Reduce is fairly mature
  • 18. ..about Hadoop Map Reduce • We can process very large datasets • split processes across a large number of machines • High recoverability/high safety  intermediate data is written to disk.. • Efficient and generally fast – move processing to data • SQL on Hadoop via Hive
  • 19. But map reduce has it’s downsides • SLOW – disk based intermediate steps (local disk and HDFS) • Especially inefficient for iterative processing  like machine learning • Challenging to conduct interactive analysis  run job – go get coffee
  • 20. ..about Spark • In-memory – eliminates intermediate disk based storage • Performs generalized form of map-reduce  split processes across a large number of machines • Fast enough for interactive analysis • Fault tolerant via lineage tracking • SparkSQL!
  • 21. So do we still need Hadoop No, but yes • Why Hadoop? • YARN • HDFS • Hadoop Map Reduce is mature and will still be appropriate for certain workloads! • Other services! • But you can use other resource managers too: • Mesos • Spark Standalone • And can work with other distributed file systems including: • S3 • Gluster • Tachyon
  • 22. About SparkSQL • Sparks SQL Engine • Brand new - emerged as alpha in 1.0.1 ~ 1 year old • Converts SQL into RDD operations
  • 23. What happened to Shark • Replaces the for Shark Query engine • All new Catalyst optimizer – Shark leveraged the Hive optimizer • Hadoop Map Reduce optimization rules were not applicable • Writing optimization rules made easy  more community participation
  • 24. We love SQL! • Huge population of highly skilled developers and analysts • Compatible with Tooling • Many operations can easily and efficiently be expressed in SQL • Filters • Joins • Group by’s • Aggregates
  • 25. But sometimes SQL is not the best tool! • Some operations do not fit SQL well • Iteration • Row-by row processing • Other operations that are not set-based/SQL oriented Spark can help! • Spark API • MLLIB – machine learning Blend Spark SQL with other code in the same program
  • 26. How can you leverage SPARK SQL • Batch ETL development • Interactive • Spark Shell (PySpark) • Spark SQL CLI • Thrift Server (JDBC) • Beeline • Query Platforms • BI Tools
  • 27. SPARK SQL can leverage the Hive metastore • Hive Metastore can also be leveraged by a wide array of applications • Spark • Hive • Impala • Pig • Available from HiveContext
  • 28. The Basics of the Spark SQLAPI • SPARK Context – a connection to the Spark Execution Engine • SCHEMA RDD – contains row of data with named columns (think spreadsheet)! • HiveContext (superset of SQLContext) – SQL on Spark, access to Hive Metastore • inferSchema – apply schema to a RDD of dictionary type • jsonSchema/jsonFile – load a json file as a schema RDD • registerTempTable – register an RDD as a temp table for SQL fun
  • 29. Spark SQL from Flat File
  • 31. Inferring Schema and Querying JSON
  • 32. Another method – load directly in SQL
  • 33. Spark SQL + Hive
  • 34. And what about other data sources Out of the box: • Parquet • JDBC Spark 1.2 brings a data sources API: • Much easier to develop new integrations • New integrations underway  Cassandra, CSV, Avro
  • 35. Where do we think SparkSQL is headed • Spark in general will continue to gain momentum • Increasing number of integrated data stores, file types etc • Optimizer improvements  Catalyst should allow it to evolve very quickly! • Subsequent - Improvements for interactive SQL – better performance, concurrency
  • 37. Awesome collection of AWS developed bootstrap actions: https://github.com/awslabs/emr-bootstrap-actions Will provide notebook and helpful scripts soon! Remember  Jan 27– Graph Databases Resources
  • 38. Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com info@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com Thank You