SlideShare a Scribd company logo
1 of 30
Spark SQL 漫谈 
Cheng Hao 
Oct 25, 2014 
Copyright © 2014 Intel Corporation.
Agenda 
 Spark SQL Overview 
 Catalyst in Depth 
 SQL Core API Introduction 
 V.S. Shark & Hive-on-Spark 
 Our Contributions 
 Useful Materials 
2 
Copyright © 2014 Intel Corporation.
Spark SQL Overview 
Copyright © 2014 Intel Corporation.
Spark SQL in Spark 
4 
Spark 
Streaming 
real-time 
GraphX 
Graph 
(alpha) 
MLLib 
Machine 
Learning 
Spark Core 
Spark SQL 
 Spark SQL was first released in Spark 1.0 (May, 2014) 
 Initial committed by Michael Armbrust & Reynold Xin from Databricks 
Copyright © 2014 Intel Corporation.
Spark SQL Component Stack (User Perspective) 
 Hive-like interface(JDBC Service / CLI) 
 SQL API support (LINQ-like) 
 Both Hive QL & Simple SQL dialects are Supported 
 DDL is 100% compatible with Hive Metastore 
 Hive QL aims to 100% compatible with Hive DML 
 Simple SQL dialect is now very weak in functionality, 
but easy to extend 
5 
User Application 
CLI JDBC Service 
SQL API 
Hive QL Simple SQL 
Catalyst 
Spark Execution Operators 
Spark Core 
Data Analyst 
Hive Meta Store Simple Catalog 
Copyright © 2014 Intel Corporation.
Spark SQL Architecture 
6 
Frontend Backend 
Catalyst 
Copyright © 2014 Intel Corporation. 
By Michael Armbrust @ Databricks
Catalyst in Depth 
Copyright © 2014 Intel Corporation.
Understand Some Terminology 
 Logical and Physical query plans 
 Both are trees representing query evaluation 
 Internal nodes are operators over the data 
 Logical plan is higher-level and algebraic 
 Physical plan is lower-level and operational 
 Logical plan operators 
 Correspond to query language constructs 
 Conceptually describe what operation needs to be 
8 
performed 
 Physical plan operators 
 Correspond to implemented access methods 
 Physically Implement the operation described by logical 
operators 
SQL Text 
Parsing 
Unresolved 
Logical Plan 
Binding & Analyzing 
Logical Plan 
Optimizing 
Optimized 
Logical Plan 
Query Planning 
Physical Plan 
Copyright © 2014 Intel Corporation.
Examples 
9 
We execute the following commands on Spark SQL CLI. 
• CREATE TABLE T (key: String, value: String) 
• EXPLAIN EXTENDED 
SELECT 
a.key * (2 + 3), b.value 
FROM T a JOIN T b 
ON a.key=b.key AND a.key>3 
Copyright © 2014 Intel Corporation.
== Parsed Logical Plan == 
Project [('a.key * (2 + 3)) AS c_0#24,'b.value] 
Join Inner, Some((('a.key = 'b.key) && ('a.key > 3))) 
Understand some terminologies 
UnresolvedRelation None, T, Some(a) 
UnresolvedRelation None, T, Some(b) 
== Analyzed Logical Plan == 
Project [(CAST(key#27, DoubleType) * CAST((2 + 3), DoubleType)) AS c_0#24,value#30] 
Join Inner, Some(((key#27 = key#29) && (CAST(key#27, DoubleType) > CAST(3, DoubleType)))) 
MetastoreRelation default, T, Some(a) 
MetastoreRelation default, T, Some(b) 
== Optimized Logical Plan == 
Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30] 
Join Inner, Some((key#27 = key#29)) 
Project [key#27] 
Filter (CAST(key#27, DoubleType) > 3.0) 
MetastoreRelation default, T, Some(a) 
MetastoreRelation default, T, Some(b) 
== Physical Plan == 
Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30] 
BroadcastHashJoin [key#27], [key#29], BuildLeft 
Filter (CAST(key#27, DoubleType) > 3.0) 
HiveTableScan [key#27], (MetastoreRelation default, T, Some(a)), None 
HiveTableScan [key#29,value#30], (MetastoreRelation default, T, Some(b)), None 
Copyright © 2014 Intel Corporation. 
10
Catalyst Overview 
• Catalyst essentially a extensible framework to Analyze & Optimize the logical plan, 
expression. 
• Core Elements: 
• Tree Node API 
• Expression Optimization 
• Data Type & Schema 
• Row API 
• Logical Plan (Unresolved) Binding & Analyzing (Rules) 
• Logical Plan (Resolved) Optimizing (Rules) 
• SPI (Service Provider Interface) 
• FunctionRegistry 
• Schema Catalog 
11 
Copyright © 2014 Intel Corporation.
Data Type & Schema 
 Primitive Type 
 StringType, FloatType, IntegerType, ByteType, ShortType, DoubleType, LongType, 
BinaryType, BooleanType, DecimalType, TimestampType, DateType, Varchar(Not 
Complete Supported Yet), Char(Not Complete Supported Yet) 
 Complex Type 
 ArrayType 
 ArrayType(elementType: DataType) 
 StructType 
 StructField(name: String, dataType: DataType) 
 StructType(fields: Seq[StructField]) 
 MapType 
 MapType(keyType: DataType, valueType: DataType) 
 UnionType (Not Supported Yet) 
12 
Relation Schema 
Copyright © 2014 Intel Corporation.
Row API 
13 
trait Row extends Seq[Any] with Serializable 
{ 
def apply(i: Int): Any 
def isNullAt(i: Int): Boolean 
def getInt(i: Int): Int 
def getLong(i: Int): Long 
def getDouble(i: Int): Double 
def getFloat(i: Int): Float 
def getBoolean(i: Int): Boolean 
def getShort(i: Int): Short 
def getByte(i: Int): Byte 
def getString(i: Int): String 
def getAs[T](int: Int): T 
} 
 Row class is the key data structure widely used 
internal / external Spark SQL. 
 “def getAs[T]” is used for non-primitive data types 
 Field value represented as native language data 
type. 
 Field type represented as DataType described in last 
slice.
Logical Plan Binding & Analyzing 
• Essentially about data binding & semantic analysis 
• Example Rules 
• Bind Attributes, Relations with concrete data. 
• ResolveReferences, ResolveRelation 
• Expressions Analysis 
• Data Type Coercion (PropagateTypes, PromoteString, BooleanCasts, Division etc.) 
• Bind UDF(ResolveFunctions) 
• Evict / Expand the Analysis Logical Plan Operators 
• StarExpansion, EliminateAnalysisOperators 
• Implicit Semantic Supplement 
• Add sort expressions into the child projection list.(ResolveSortReferences) 
• Convert projection into aggregation if the projection contains aggregate 
function(GlobalAggregates). 
• UnresolvedHavingClauseAttributes 
• Semantic Checking 
• Unresolved Function, Relation, Attributes (CheckResolution) 
• Illegal expressions in projection of an Aggregation (CheckAggregation) 
• …. 
14 Copyright © 2014 Intel Corporation.
Logical Plan Optimizing 
• Simplify the Logical Plan Tree based on Relational / Logical Algebra, Common Sense (Rule Based) 
• Example Rules 
• Expression Optimization. 
• NullPropagation, ConstantFolding, SimplifyFilters, SimplifyCasts, OptimizeIn etc. 
• Filter PushDown 
• UnionPushdown, PushPredicateThroughProject, 
PushPredicateThroughJoin,ColumnPruning 
• Combine Operators 
• CombineFilters, CombineLimits 
• Concrete Example 
• IsNull(‘a + null) => IsNull(null) => Literal(true) 
• SELECT a.key, b.key FROM a, b ON a.key=b.key AND b.key>10 => 
SELECT a.key, b.key FROM a, (SELECT key FROM b WHERE key>10) ON a.key=b.key 
15 Copyright © 2014 Intel Corporation.
Spark SQL Dialects 
16 
Hive Parser 
Hive AST 
Logical Plan 
Optimized Logical 
Plan 
Hive+Spark 
Planner 
DSL API 
Spark 
Planner 
Execution 
Operators 
SQL Parser 
Unresolved 
Logical Plan 
Hive 
Catelog 
Simple 
Catelog 
HiveContext SQLContext 
Frontend 
Catalyst 
Backend 
XX Parser / API 
XXX 
Catelog 
XXX 
Planner 
XXXContext 
Frontend 
+ 
Catalyst + SPI 
+ 
Backend 
|| 
Tool 
Copyright © 2014 Intel Corporation.
Spark Plan (Physical Plan) 
 Root class of Spark Plan Operator (Physical Plan Operator for Spark) 
 Spark Plan Operators 
 Joins: BroadcastHashJoin, CartesianProduct, HashOuterJoin, LeftSemiJoinHash etc.) 
 Aggregate: Aggregate 
 BasicOperators: Distinct, Except, Filter, Limit, Project, Sort, Union etc.) 
 Shuffle: AddExchange, Exchange 
 Commands: CacheTableCommand, DescribeCommand, ExplainCommand etc.) 
 .. 
 Spark Strategy (SparkPlanner) 
 Map the Optimized Logical Plan to Spark Plan 
17 
abstract class SparkPlan { 
def children: Seq[SparkPlan] 
/** Specifies how data is partitioned across different nodes in the cluster. */ 
def outputPartitioning: Partitioning = UnknownPartitioning(0) 
/** Specifies any partition requirements on the input data for this operator. */ 
def requiredChildDistribution: Seq[Distribution] = 
Seq.fill(children.size)(UnspecifiedDistribution) 
def execute(): RDD[Row] 
} 
Optimized 
Logical Plan 
Spark Plan 
RDD 
Spark Execution
Case Study for Catalyst in Depth 
• StreamSQL 
18 
• Reuse the HiveContext but with different Frontend / Backend. 
• Frontend: Slight modification of the HiveParser 
• Backend: Customed Query Planner, to generate the physical plan based on Spark 
DStream. 
• JIRA: https://issues.apache.org/jira/browse/SPARK-1363 
• Source: https://github.com/thunderain-project/StreamSQL 
• SQL 92 Support 
• Reuse the HiveContext but with different Frontend 
• Frontend: A modified HiveParser & Hive QL translator. 
• https://github.com/intel-hadoop/spark/tree/panthera 
• Pig on Spark POC 
• Modify the SQLContext 
• Provide a PigParser to translate the Pig script into Catalyst unresolved logical plan 
• https://github.com/databricks/pig-on-spark 
Copyright © 2014 Intel Corporation.
SQL Core API Introduction 
Copyright © 2014 Intel Corporation.
SchemaRDD 
• What’s SchemaRDD? 
• Spark SQL Core API (In Scala) 
20 
• Create SchemaRDD instance from 
• Plain SQL Text def sql(sqlText: String) 
• An existed Logical Plan def logicalPlanToSparkQuery(plan: LogicalPlan) 
• Spark RDD def createSchemaRDD[A <: Product: TypeTag](rdd: RDD[A]) 
• Spark RDD with Schema def applySchema(rowRDD: RDD[Row], schema: StructType) 
• Frequently used format file (json, parquet, etc.) def parquetFile(path: String) 
• SQL DSL 
• select, where, join, orderBy, limit, groupBy, unionAll, etc. 
• Data Sink 
• Persist the data with specified storage level def persist(newLevel: StorageLevel) 
• Save the data as ParquetFile def saveAsParquetFile(path: String) 
• Save the data as a new Table def registerTempTable(tableName: String) 
• Insert the data into existed table def insertInto(tableName: String, overwrite: Boolean) 
• …. 
• Java API / Python API supported 
Copyright © 2014 Intel Corporation. 
class SchemaRDD( 
@transient val sqlContext: SQLContext, 
@transient val baseLogicalPlan: LogicalPlan) 
extends RDD[Row](sqlContext.sparkContext, Nil)
Conceptual State Transition Diagram 
21 
RDD 
Schema RDD 
Unresolved 
Logical Plan 
SQL API 
SQL Text / File / Table 
* Unresolved Logical Plan  RDD (Unresolved Logical Plan  Logical 
Plan  Optimized Logical Plan  Physical Plan  Spark RDD) 
File / Memory etc. 
Copyright © 2014 Intel Corporation. 
…
Code Example 
sbt/sbt hive/console 
// HiveContext is created by default, and the object is imported, so we can call the object methods directly. 
sql("CREATE TABLE IF NOT EXISTS kv_text(key INT, value STRING)") 
sql("LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE kv_text") // create a Hive table and load data into it 
case class KV(key: Int, value: String) 
val kvRdd = sparkContext.parallelize((1 to 100).map(i => KV(i, s"val_$i"))) // create a normal RDD 
// implicitly convert the kvRDD into a SchemaRDD 
kvRdd.where('key >= 1).where('key <=5).registerTempTable("kv_rdd") // create a Hive Table from a SchemaRDD 
jsonFile("/tmp/file2.json").registerTempTable("kv_json") // load json file and register as a Hive Table 
val result = sql("SELECT a.key, b.value, c.key from kv_text a join kv_rdd b join kv_json c") 
result.collect().foreach(row => { 
val f0 = if(row.isNullAt(0)) "null" else row.getInt(0) 
val f1 = if(row.isNullAt(1)) "null" else row.getString(1) 
val f2 = if(row.isNullAt(2)) "null" else row.getInt(2) 
println(s"result:$f0, $f1, $f2") 
22 
}) 
Copyright © 2014 Intel Corporation.
V.S. Shark & 
Hive 
Copyright © 2014 Intel Corporation.
 Background of Shark/Hive-on-Spark/Spark SQL 
 Shark is the first SQL on Spark product, based on the earlier versions of Hive (with a re-write QueryPlanner to generate 
Spark RDD-based Physicial Plan); Shark is retired now and replaced by Spark SQL. 
 Hive-on-Spark is an QueryPlanner extension of Hive, it focus on the SparkPlanner and Spark RDD-based physical 
operators implementation. Spark users will automatically get the whole set of Hive’s rich features, including any new 
features that Hive might introduce in the future. 
 Spark SQL is a new SQL engine on Spark developed from scratch. 
 Functionality 
 Spark SQL almost support all of the functionalities that Hive provided from the perspective of data analysts. 
 SQL API on Spark Shell V.S. Pig latin. 
 Spark SQL is an extensible / flexible framework for developers (based on Catalyst), new extensions are very easy to 
be integrated. 
 Implementation Philosophy of Spark SQL (Simple & Nature) 
 Largely employs the Scala features (Pattern Matching, Implicit Conversion, Partial Function etc.) 
 Large small pieces of simple rule to bind, analyze, optimize logical plan & expression tree, and also the physical plan 
generation. 
 In-memory Computing & Maximize the Memory Usage (Cache related SQL API & Command). 
 Spark SQL benefits a lot from Hive by reusing its components (Hive QL Parser, Metatore, SerDe, StorageHandler etc.) 
 Stability 
 Hive is the defacto standard for SQL on big data so far, and it has been proven as a productive tool for couple of years 
in practices, many corner cases are covered in its continuous enhancements. 
 Spark SQL just start its journey ( ~0.5 year), we need more time to prove / improve it. 
24 Copyright © 2014 Intel Corporation.
Our Contributions 
Copyright © 2014 Intel Corporation.
 Totally 60+ PRs, 50+ Merged on Spark SQL 
 Features 
26 
 Add serde support for CTAS (PR2570) 
 Support the Grouping Set (PR1567) 
 Support EXTENDED for EXPLAIN (PR1982) 
 Cross join support in HiveQL (PR2124) 
 Add support for left semi join (PR837) 
 Add Date type support (PR2344) 
 Add Timestamp type support (PR275) 
 Add Expression RLike & Like support (PR224) 
 .. 
 Performance Enhancement / Improvement 
 Avoid table creation in logical plan analyzing for CTAS (PR1846) 
 Extract the joinkeys from join condition (PR1190) 
 Reduce the Expression tree object creations for aggregation function (min/max) (PR2113) 
 Pushdown the join filter & predication for outer join (PR1015) 
 Constant Folding for Expression Optimization (PR482) 
 Fix Performance Issue in data type casting (PR679) 
 Not limit argument type for hive simple udf (PR2506) 
 Use GenericUDFUtils.ConversionHelper for Simple UDF type conversions (PR2407) 
 Select null from table would throw a MatchError (PR2396) 
 Type Coercion should support every type to have null value (PR2246) 
 …. 
 Bugs Fixing 
 …. 
Copyright © 2014 Intel Corporation.
Useful Materials 
Copyright © 2014 Intel Corporation.
 References 
28 
 http://spark-summit.org/wp-content/uploads/2013/10/J-Michael-Armburst-catalyst-spark-summit-dec-2013.pptx 
 http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark- 
SQL-Michael-Armbrust.pdf 
 https://www.youtube.com/watch?v=GQSNJAzxOr8 
 http://www.slideshare.net/ueshin/20140908-spark-sql-catalyst?qid=3bb8abf4-3d8d-433f-9397-c24c5256841d 
 https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark 
 http://web.stanford.edu/class/cs346/qpnotes.html 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf 
 http://codex.cs.yale.edu/avi/db-book/db6/slide-dir/PDF-dir/ch13.pdf 
 https://courses.cs.washington.edu/courses/cse444/12sp/lectures/ 
 http://www.cs.uiuc.edu/class/sp06/cs411/lectures.html 
• User Mail List 
 user@spark.apache.org 
• Dev Mail List 
 dev@spark.apache.org 
• Jira 
 https://issues.apache.org/jira/browse/SPARK/component/12322623 
• DevDoc 
 https://spark.apache.org/docs/latest/sql-programming-guide.html 
• Github 
 https://github.com/apache/spark/tree/master/sql 
Copyright © 2014 Intel Corporation.
Notice and Disclaimers: 
 Intel, the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands 
may be claimed as the property of others. 
See Trademarks on intel.com for full list of Intel trademarks. 
 Optimization Notice: 
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that 
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and 
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on 
microprocessors not manufactured by Intel. 
Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain 
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the 
applicable product User and Reference Guides for more information regarding the specific instruction sets covered 
by this notice. 
 Intel technologies may require enabled hardware, specific software, or services activation. Check with your system 
manufacturer or retailer. 
 No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems 
or any damages resulting from such losses. 
 You may not use or facilitate the use of this document in connection with any infringement or other legal analysis 
concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any 
patent claim thereafter drafted which includes subject matter disclosed herein. 
 No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this 
document. 
 The products described may contain design defects or errors known as errata which may cause the product to 
deviate from publish. 
Copyright © 2014 Intel Corporation.
Copyright © 2014 Intel Corporation.

More Related Content

What's hot

What's hot (20)

Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 

Viewers also liked

Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
Michael Joseph
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 

Viewers also liked (20)

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Seattle useR Group - R + Scala
Seattle useR Group - R + ScalaSeattle useR Group - R + Scala
Seattle useR Group - R + Scala
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Apache streams 2015
Apache streams 2015Apache streams 2015
Apache streams 2015
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 

Similar to Spark sql meetup

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 

Similar to Spark sql meetup (20)

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
 

More from Michael Zhang

2014 GITC 帶上數據去創業 talkingdata—高铎
 2014 GITC 帶上數據去創業 talkingdata—高铎 2014 GITC 帶上數據去創業 talkingdata—高铎
2014 GITC 帶上數據去創業 talkingdata—高铎
Michael Zhang
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_report
Michael Zhang
 
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Michael Zhang
 
Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Michael Zhang
 
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Michael Zhang
 
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Michael Zhang
 
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Michael Zhang
 
Q con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyQ con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodology
Michael Zhang
 
Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践
Michael Zhang
 

More from Michael Zhang (20)

廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐
 
HKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier ArchitectureHKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier Architecture
 
2014 GITC 帶上數據去創業 talkingdata—高铎
 2014 GITC 帶上數據去創業 talkingdata—高铎 2014 GITC 帶上數據去創業 talkingdata—高铎
2014 GITC 帶上數據去創業 talkingdata—高铎
 
Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket Linxiaofeng
 
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践
2014 Hpocon 李志刚   1号店 - puppet在1号店的实践2014 Hpocon 李志刚   1号店 - puppet在1号店的实践
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践
 
2014 Hpocon 姚仁捷 唯品会 - data driven ops
2014 Hpocon 姚仁捷   唯品会 - data driven ops2014 Hpocon 姚仁捷   唯品会 - data driven ops
2014 Hpocon 姚仁捷 唯品会 - data driven ops
 
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用
 
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控
 
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性
 
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_report
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
 
Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.
 
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
 
Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]
 
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
 
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
 
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
 
Q con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyQ con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodology
 
Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践
 

Recently uploaded

6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
nirzagarg
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
nirzagarg
 
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 

Recently uploaded (20)

6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
💚😋 Bilaspur Escort Service Call Girls, 9352852248 ₹5000 To 25K With AC💚😋
 
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
Pirangut | Call Girls Pune Phone No 8005736733 Elite Escort Service Available...
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts ServiceReal Escorts in Al Nahda +971524965298 Dubai Escorts Service
Real Escorts in Al Nahda +971524965298 Dubai Escorts Service
 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
Yerawada ] Independent Escorts in Pune - Book 8005736733 Call Girls Available...
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
📱Dehradun Call Girls Service 📱☎️ +91'905,3900,678 ☎️📱 Call Girls In Dehradun 📱
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 

Spark sql meetup

  • 1. Spark SQL 漫谈 Cheng Hao Oct 25, 2014 Copyright © 2014 Intel Corporation.
  • 2. Agenda  Spark SQL Overview  Catalyst in Depth  SQL Core API Introduction  V.S. Shark & Hive-on-Spark  Our Contributions  Useful Materials 2 Copyright © 2014 Intel Corporation.
  • 3. Spark SQL Overview Copyright © 2014 Intel Corporation.
  • 4. Spark SQL in Spark 4 Spark Streaming real-time GraphX Graph (alpha) MLLib Machine Learning Spark Core Spark SQL  Spark SQL was first released in Spark 1.0 (May, 2014)  Initial committed by Michael Armbrust & Reynold Xin from Databricks Copyright © 2014 Intel Corporation.
  • 5. Spark SQL Component Stack (User Perspective)  Hive-like interface(JDBC Service / CLI)  SQL API support (LINQ-like)  Both Hive QL & Simple SQL dialects are Supported  DDL is 100% compatible with Hive Metastore  Hive QL aims to 100% compatible with Hive DML  Simple SQL dialect is now very weak in functionality, but easy to extend 5 User Application CLI JDBC Service SQL API Hive QL Simple SQL Catalyst Spark Execution Operators Spark Core Data Analyst Hive Meta Store Simple Catalog Copyright © 2014 Intel Corporation.
  • 6. Spark SQL Architecture 6 Frontend Backend Catalyst Copyright © 2014 Intel Corporation. By Michael Armbrust @ Databricks
  • 7. Catalyst in Depth Copyright © 2014 Intel Corporation.
  • 8. Understand Some Terminology  Logical and Physical query plans  Both are trees representing query evaluation  Internal nodes are operators over the data  Logical plan is higher-level and algebraic  Physical plan is lower-level and operational  Logical plan operators  Correspond to query language constructs  Conceptually describe what operation needs to be 8 performed  Physical plan operators  Correspond to implemented access methods  Physically Implement the operation described by logical operators SQL Text Parsing Unresolved Logical Plan Binding & Analyzing Logical Plan Optimizing Optimized Logical Plan Query Planning Physical Plan Copyright © 2014 Intel Corporation.
  • 9. Examples 9 We execute the following commands on Spark SQL CLI. • CREATE TABLE T (key: String, value: String) • EXPLAIN EXTENDED SELECT a.key * (2 + 3), b.value FROM T a JOIN T b ON a.key=b.key AND a.key>3 Copyright © 2014 Intel Corporation.
  • 10. == Parsed Logical Plan == Project [('a.key * (2 + 3)) AS c_0#24,'b.value] Join Inner, Some((('a.key = 'b.key) && ('a.key > 3))) Understand some terminologies UnresolvedRelation None, T, Some(a) UnresolvedRelation None, T, Some(b) == Analyzed Logical Plan == Project [(CAST(key#27, DoubleType) * CAST((2 + 3), DoubleType)) AS c_0#24,value#30] Join Inner, Some(((key#27 = key#29) && (CAST(key#27, DoubleType) > CAST(3, DoubleType)))) MetastoreRelation default, T, Some(a) MetastoreRelation default, T, Some(b) == Optimized Logical Plan == Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30] Join Inner, Some((key#27 = key#29)) Project [key#27] Filter (CAST(key#27, DoubleType) > 3.0) MetastoreRelation default, T, Some(a) MetastoreRelation default, T, Some(b) == Physical Plan == Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30] BroadcastHashJoin [key#27], [key#29], BuildLeft Filter (CAST(key#27, DoubleType) > 3.0) HiveTableScan [key#27], (MetastoreRelation default, T, Some(a)), None HiveTableScan [key#29,value#30], (MetastoreRelation default, T, Some(b)), None Copyright © 2014 Intel Corporation. 10
  • 11. Catalyst Overview • Catalyst essentially a extensible framework to Analyze & Optimize the logical plan, expression. • Core Elements: • Tree Node API • Expression Optimization • Data Type & Schema • Row API • Logical Plan (Unresolved) Binding & Analyzing (Rules) • Logical Plan (Resolved) Optimizing (Rules) • SPI (Service Provider Interface) • FunctionRegistry • Schema Catalog 11 Copyright © 2014 Intel Corporation.
  • 12. Data Type & Schema  Primitive Type  StringType, FloatType, IntegerType, ByteType, ShortType, DoubleType, LongType, BinaryType, BooleanType, DecimalType, TimestampType, DateType, Varchar(Not Complete Supported Yet), Char(Not Complete Supported Yet)  Complex Type  ArrayType  ArrayType(elementType: DataType)  StructType  StructField(name: String, dataType: DataType)  StructType(fields: Seq[StructField])  MapType  MapType(keyType: DataType, valueType: DataType)  UnionType (Not Supported Yet) 12 Relation Schema Copyright © 2014 Intel Corporation.
  • 13. Row API 13 trait Row extends Seq[Any] with Serializable { def apply(i: Int): Any def isNullAt(i: Int): Boolean def getInt(i: Int): Int def getLong(i: Int): Long def getDouble(i: Int): Double def getFloat(i: Int): Float def getBoolean(i: Int): Boolean def getShort(i: Int): Short def getByte(i: Int): Byte def getString(i: Int): String def getAs[T](int: Int): T }  Row class is the key data structure widely used internal / external Spark SQL.  “def getAs[T]” is used for non-primitive data types  Field value represented as native language data type.  Field type represented as DataType described in last slice.
  • 14. Logical Plan Binding & Analyzing • Essentially about data binding & semantic analysis • Example Rules • Bind Attributes, Relations with concrete data. • ResolveReferences, ResolveRelation • Expressions Analysis • Data Type Coercion (PropagateTypes, PromoteString, BooleanCasts, Division etc.) • Bind UDF(ResolveFunctions) • Evict / Expand the Analysis Logical Plan Operators • StarExpansion, EliminateAnalysisOperators • Implicit Semantic Supplement • Add sort expressions into the child projection list.(ResolveSortReferences) • Convert projection into aggregation if the projection contains aggregate function(GlobalAggregates). • UnresolvedHavingClauseAttributes • Semantic Checking • Unresolved Function, Relation, Attributes (CheckResolution) • Illegal expressions in projection of an Aggregation (CheckAggregation) • …. 14 Copyright © 2014 Intel Corporation.
  • 15. Logical Plan Optimizing • Simplify the Logical Plan Tree based on Relational / Logical Algebra, Common Sense (Rule Based) • Example Rules • Expression Optimization. • NullPropagation, ConstantFolding, SimplifyFilters, SimplifyCasts, OptimizeIn etc. • Filter PushDown • UnionPushdown, PushPredicateThroughProject, PushPredicateThroughJoin,ColumnPruning • Combine Operators • CombineFilters, CombineLimits • Concrete Example • IsNull(‘a + null) => IsNull(null) => Literal(true) • SELECT a.key, b.key FROM a, b ON a.key=b.key AND b.key>10 => SELECT a.key, b.key FROM a, (SELECT key FROM b WHERE key>10) ON a.key=b.key 15 Copyright © 2014 Intel Corporation.
  • 16. Spark SQL Dialects 16 Hive Parser Hive AST Logical Plan Optimized Logical Plan Hive+Spark Planner DSL API Spark Planner Execution Operators SQL Parser Unresolved Logical Plan Hive Catelog Simple Catelog HiveContext SQLContext Frontend Catalyst Backend XX Parser / API XXX Catelog XXX Planner XXXContext Frontend + Catalyst + SPI + Backend || Tool Copyright © 2014 Intel Corporation.
  • 17. Spark Plan (Physical Plan)  Root class of Spark Plan Operator (Physical Plan Operator for Spark)  Spark Plan Operators  Joins: BroadcastHashJoin, CartesianProduct, HashOuterJoin, LeftSemiJoinHash etc.)  Aggregate: Aggregate  BasicOperators: Distinct, Except, Filter, Limit, Project, Sort, Union etc.)  Shuffle: AddExchange, Exchange  Commands: CacheTableCommand, DescribeCommand, ExplainCommand etc.)  ..  Spark Strategy (SparkPlanner)  Map the Optimized Logical Plan to Spark Plan 17 abstract class SparkPlan { def children: Seq[SparkPlan] /** Specifies how data is partitioned across different nodes in the cluster. */ def outputPartitioning: Partitioning = UnknownPartitioning(0) /** Specifies any partition requirements on the input data for this operator. */ def requiredChildDistribution: Seq[Distribution] = Seq.fill(children.size)(UnspecifiedDistribution) def execute(): RDD[Row] } Optimized Logical Plan Spark Plan RDD Spark Execution
  • 18. Case Study for Catalyst in Depth • StreamSQL 18 • Reuse the HiveContext but with different Frontend / Backend. • Frontend: Slight modification of the HiveParser • Backend: Customed Query Planner, to generate the physical plan based on Spark DStream. • JIRA: https://issues.apache.org/jira/browse/SPARK-1363 • Source: https://github.com/thunderain-project/StreamSQL • SQL 92 Support • Reuse the HiveContext but with different Frontend • Frontend: A modified HiveParser & Hive QL translator. • https://github.com/intel-hadoop/spark/tree/panthera • Pig on Spark POC • Modify the SQLContext • Provide a PigParser to translate the Pig script into Catalyst unresolved logical plan • https://github.com/databricks/pig-on-spark Copyright © 2014 Intel Corporation.
  • 19. SQL Core API Introduction Copyright © 2014 Intel Corporation.
  • 20. SchemaRDD • What’s SchemaRDD? • Spark SQL Core API (In Scala) 20 • Create SchemaRDD instance from • Plain SQL Text def sql(sqlText: String) • An existed Logical Plan def logicalPlanToSparkQuery(plan: LogicalPlan) • Spark RDD def createSchemaRDD[A <: Product: TypeTag](rdd: RDD[A]) • Spark RDD with Schema def applySchema(rowRDD: RDD[Row], schema: StructType) • Frequently used format file (json, parquet, etc.) def parquetFile(path: String) • SQL DSL • select, where, join, orderBy, limit, groupBy, unionAll, etc. • Data Sink • Persist the data with specified storage level def persist(newLevel: StorageLevel) • Save the data as ParquetFile def saveAsParquetFile(path: String) • Save the data as a new Table def registerTempTable(tableName: String) • Insert the data into existed table def insertInto(tableName: String, overwrite: Boolean) • …. • Java API / Python API supported Copyright © 2014 Intel Corporation. class SchemaRDD( @transient val sqlContext: SQLContext, @transient val baseLogicalPlan: LogicalPlan) extends RDD[Row](sqlContext.sparkContext, Nil)
  • 21. Conceptual State Transition Diagram 21 RDD Schema RDD Unresolved Logical Plan SQL API SQL Text / File / Table * Unresolved Logical Plan  RDD (Unresolved Logical Plan  Logical Plan  Optimized Logical Plan  Physical Plan  Spark RDD) File / Memory etc. Copyright © 2014 Intel Corporation. …
  • 22. Code Example sbt/sbt hive/console // HiveContext is created by default, and the object is imported, so we can call the object methods directly. sql("CREATE TABLE IF NOT EXISTS kv_text(key INT, value STRING)") sql("LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE kv_text") // create a Hive table and load data into it case class KV(key: Int, value: String) val kvRdd = sparkContext.parallelize((1 to 100).map(i => KV(i, s"val_$i"))) // create a normal RDD // implicitly convert the kvRDD into a SchemaRDD kvRdd.where('key >= 1).where('key <=5).registerTempTable("kv_rdd") // create a Hive Table from a SchemaRDD jsonFile("/tmp/file2.json").registerTempTable("kv_json") // load json file and register as a Hive Table val result = sql("SELECT a.key, b.value, c.key from kv_text a join kv_rdd b join kv_json c") result.collect().foreach(row => { val f0 = if(row.isNullAt(0)) "null" else row.getInt(0) val f1 = if(row.isNullAt(1)) "null" else row.getString(1) val f2 = if(row.isNullAt(2)) "null" else row.getInt(2) println(s"result:$f0, $f1, $f2") 22 }) Copyright © 2014 Intel Corporation.
  • 23. V.S. Shark & Hive Copyright © 2014 Intel Corporation.
  • 24.  Background of Shark/Hive-on-Spark/Spark SQL  Shark is the first SQL on Spark product, based on the earlier versions of Hive (with a re-write QueryPlanner to generate Spark RDD-based Physicial Plan); Shark is retired now and replaced by Spark SQL.  Hive-on-Spark is an QueryPlanner extension of Hive, it focus on the SparkPlanner and Spark RDD-based physical operators implementation. Spark users will automatically get the whole set of Hive’s rich features, including any new features that Hive might introduce in the future.  Spark SQL is a new SQL engine on Spark developed from scratch.  Functionality  Spark SQL almost support all of the functionalities that Hive provided from the perspective of data analysts.  SQL API on Spark Shell V.S. Pig latin.  Spark SQL is an extensible / flexible framework for developers (based on Catalyst), new extensions are very easy to be integrated.  Implementation Philosophy of Spark SQL (Simple & Nature)  Largely employs the Scala features (Pattern Matching, Implicit Conversion, Partial Function etc.)  Large small pieces of simple rule to bind, analyze, optimize logical plan & expression tree, and also the physical plan generation.  In-memory Computing & Maximize the Memory Usage (Cache related SQL API & Command).  Spark SQL benefits a lot from Hive by reusing its components (Hive QL Parser, Metatore, SerDe, StorageHandler etc.)  Stability  Hive is the defacto standard for SQL on big data so far, and it has been proven as a productive tool for couple of years in practices, many corner cases are covered in its continuous enhancements.  Spark SQL just start its journey ( ~0.5 year), we need more time to prove / improve it. 24 Copyright © 2014 Intel Corporation.
  • 25. Our Contributions Copyright © 2014 Intel Corporation.
  • 26.  Totally 60+ PRs, 50+ Merged on Spark SQL  Features 26  Add serde support for CTAS (PR2570)  Support the Grouping Set (PR1567)  Support EXTENDED for EXPLAIN (PR1982)  Cross join support in HiveQL (PR2124)  Add support for left semi join (PR837)  Add Date type support (PR2344)  Add Timestamp type support (PR275)  Add Expression RLike & Like support (PR224)  ..  Performance Enhancement / Improvement  Avoid table creation in logical plan analyzing for CTAS (PR1846)  Extract the joinkeys from join condition (PR1190)  Reduce the Expression tree object creations for aggregation function (min/max) (PR2113)  Pushdown the join filter & predication for outer join (PR1015)  Constant Folding for Expression Optimization (PR482)  Fix Performance Issue in data type casting (PR679)  Not limit argument type for hive simple udf (PR2506)  Use GenericUDFUtils.ConversionHelper for Simple UDF type conversions (PR2407)  Select null from table would throw a MatchError (PR2396)  Type Coercion should support every type to have null value (PR2246)  ….  Bugs Fixing  …. Copyright © 2014 Intel Corporation.
  • 27. Useful Materials Copyright © 2014 Intel Corporation.
  • 28.  References 28  http://spark-summit.org/wp-content/uploads/2013/10/J-Michael-Armburst-catalyst-spark-summit-dec-2013.pptx  http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark- SQL-Michael-Armbrust.pdf  https://www.youtube.com/watch?v=GQSNJAzxOr8  http://www.slideshare.net/ueshin/20140908-spark-sql-catalyst?qid=3bb8abf4-3d8d-433f-9397-c24c5256841d  https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark  http://web.stanford.edu/class/cs346/qpnotes.html  http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf  http://codex.cs.yale.edu/avi/db-book/db6/slide-dir/PDF-dir/ch13.pdf  https://courses.cs.washington.edu/courses/cse444/12sp/lectures/  http://www.cs.uiuc.edu/class/sp06/cs411/lectures.html • User Mail List  user@spark.apache.org • Dev Mail List  dev@spark.apache.org • Jira  https://issues.apache.org/jira/browse/SPARK/component/12322623 • DevDoc  https://spark.apache.org/docs/latest/sql-programming-guide.html • Github  https://github.com/apache/spark/tree/master/sql Copyright © 2014 Intel Corporation.
  • 29. Notice and Disclaimers:  Intel, the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. See Trademarks on intel.com for full list of Intel trademarks.  Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.  Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.  No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses.  You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.  No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.  The products described may contain design defects or errors known as errata which may cause the product to deviate from publish. Copyright © 2014 Intel Corporation.
  • 30. Copyright © 2014 Intel Corporation.