Not Your Father's Database by Vida Ha

•

17 j'aime•7,324 vues

Spark Summit

Spark Summit East Talk

Données & analyses

Not Your Father’s Database:
How to Use Apache Spark Properly  
in Your Big Data Architecture
Spark Summit East 2016

About Me
2005 Mobile Web & Voice Search
3

About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics

About Me
2005 Mobile Web & Voice Search
5
2012 Reporting & Analytics
2014 Solutions Engineering

This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
SQL

But the performance is very different…
Is this your Spark infrastructure?
7
SQL
HDFS

Just in Time Data Warehouse w/ Spark
HDFS

Just in Time Data Warehouse w/ Spark
and more…
HDFS

11
Know when to use other data stores  
besides file systems
Today’s Goal

Good: General Purpose Processing
Types of Data Sets to Store in File Systems:
• Archival Data
• Unstructured Data
• Social Media and other web datasets
• Backup copies of data stores
12

Types of workloads
• Batch Workloads
• Ad Hoc Analysis
– Best Practice: Use in memory caching
• Multi-step Pipelines
• Iterative Workloads
13
Good: General Purpose Processing

Benefits:
• Inexpensive Storage
• Incredibly flexible processing
• Speed and Scale
14
Good: General Purpose Processing

Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
15

Bad: Random Access
sqlContext.sql(
“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
Yes, but it’s not very efficient — Spark may have  
to go through all your files to find your row.
16

Bad: Random Access
Solution: If you frequently randomly access your
data, use a database.
• For traditional SQL databases, create an index  
on your key column.
• Key-Value NOSQL stores retrieves the value  
of a key efficiently out of the box.
17

Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable
select fields from my2ndTable”)
Each insert creates a new file:
• Inserts are reasonably fast.
• But querying will be slow…
18

Bad: Frequent Inserts
Solution:
• Option 1: Use a database to support the inserts.
• Option 2: Routinely compact your Spark SQL table files.
19

Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap:
Not an “Anti-pattern” to duplicately store your data.
20

Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?
• Random Access: Locate the row(s) in the files.
• Delete & Insert: Delete the old row and insert a new one.
• Update: File formats aren’t optimized for updating rows.
Solution: Many databases support efficient update operations.
21

Use Case: Up-to-date, live views of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent/Incremental Updates
22
Incremental
SQL Query
Database
Snapshot
+

Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
23
HDFS

Bad: External Reporting w/ load
Too many concurrent requests will overload Spark.
24
HDFS

Solution: Write out to a DB to handle load.
Bad: External Reporting w/ load
25
HDFS
DB

Good: Machine Learning & Data Science
Use MLlib, GraphX and Spark packages for machine
learning and data science.
Benefits:
• Built in distributed algorithms.
• In memory capabilities for iterative workloads.
• Data cleansing, featurization, training, testing, etc.
26

Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable
where name like '%xyz%'”)
Spark will go through each row to find results.
27

Recommandé

Insights into Customer Behavior from Clickstream Data by Ronald NowlingSpark Summit

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

Writing Continuous Applications with Structured Streaming in PySparkDatabricks

Optiq: A dynamic data management frameworkJulian Hyde

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Recommandé

Insights into Customer Behavior from Clickstream Data by Ronald NowlingSpark Summit

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

Writing Continuous Applications with Structured Streaming in PySparkDatabricks

Optiq: A dynamic data management frameworkJulian Hyde

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon

ODI11g, Hadoop and "Big Data" SourcesMark Rittman

Building an ETL pipeline for Elasticsearch using SparkItai Yaffe

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Spark sql meetupMichael Zhang

Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...DataStax

Parquet and AVROairisData

Quark Virtualization Engine for Analytics DataWorks Summit/Hadoop Summit

Spark SQL Beyond Official DocumentationDatabricks

Change Data Feed in DeltaDatabricks

Hadoop databases for oracle DBAsMaxym Kharchenko

SQL Now! How Optiq brings the best of SQL to NoSQL data.Julian Hyde

The Past, Present, and Future of Hadoop at LinkedInCarl Steinbach

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

Jaws - Data Warehouse with Spark SQL by Ema OrhianSpark Summit

Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Matt Fuller

Etl with apache impala by athemasterAthemaster Co., Ltd.

Reshape Data Lake (as of 2020.07)Eric Sun

Not Your Father's Database by DatabricksCaserta

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

Contenu connexe

Tendances

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon

ODI11g, Hadoop and "Big Data" SourcesMark Rittman

Building an ETL pipeline for Elasticsearch using SparkItai Yaffe

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Spark sql meetupMichael Zhang

Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...DataStax

Parquet and AVROairisData

Quark Virtualization Engine for Analytics DataWorks Summit/Hadoop Summit

Spark SQL Beyond Official DocumentationDatabricks

Change Data Feed in DeltaDatabricks

Hadoop databases for oracle DBAsMaxym Kharchenko

SQL Now! How Optiq brings the best of SQL to NoSQL data.Julian Hyde

The Past, Present, and Future of Hadoop at LinkedInCarl Steinbach

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

Jaws - Data Warehouse with Spark SQL by Ema OrhianSpark Summit

Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Matt Fuller

Etl with apache impala by athemasterAthemaster Co., Ltd.

Reshape Data Lake (as of 2020.07)Eric Sun

Tendances (20)

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

ODI11g, Hadoop and "Big Data" Sources

Building an ETL pipeline for Elasticsearch using Spark

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Spark sql meetup

Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...

Parquet and AVRO

Quark Virtualization Engine for Analytics

Spark SQL Beyond Official Documentation

Change Data Feed in Delta

Hadoop databases for oracle DBAs

SQL Now! How Optiq brings the best of SQL to NoSQL data.

The Past, Present, and Future of Hadoop at LinkedIn

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Jaws - Data Warehouse with Spark SQL by Ema Orhian

Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)

Etl with apache impala by athemaster

Reshape Data Lake (as of 2020.07)

Similaire à Not Your Father's Database by Vida Ha

Not Your Father's Database by DatabricksCaserta

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

Slide 2 collecting, storing and analyzing big dataTrieu Nguyen

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks

Agile data warehousingSneha Challa

NoSQL databasesFilip Ilievski

Demystifying data engineeringThang Bui (Bob)

One Large Data Lake, Hold the HypeJared Winick

One Large Data Lake, Hold the HypeKoverse, Inc.

Bender kuszmaul tutorial-xldb12Atner Yegorov

Data Structures and Algorithms for Big Databasesomnidba

Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA

Hands On: Introduction to the Hadoop EcosystemAdaryl "Bob" Wakefield, MBA

Data modeling trends for analyticsIke Ellis

No sq lv1_0Tuan Luong

Splice machine-bloor-webinar-data-lakesEdgar Alejandro Villegas

Database as a Service on the Oracle Database Appliance PlatformMaris Elsins

The modern analytics architectureJoseph D'Antoni

Connecting Hadoop and OracleTanel Poder

Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira

Similaire à Not Your Father's Database by Vida Ha (20)

Not Your Father's Database by Databricks

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Slide 2 collecting, storing and analyzing big data

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...

Agile data warehousing

NoSQL databases

Demystifying data engineering

One Large Data Lake, Hold the Hype

Bender kuszmaul tutorial-xldb12

Data Structures and Algorithms for Big Databases

Big Data Technologies and Why They Matter To R Users

Hands On: Introduction to the Hadoop Ecosystem

Data modeling trends for analytics

No sq lv1_0

Splice machine-bloor-webinar-data-lakes

Database as a Service on the Oracle Database Appliance Platform

The modern analytics architecture

Connecting Hadoop and Oracle

Scaling ETL with Hadoop - Avoiding Failure

Plus de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit

Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit

Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit

Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit

Powering a Startup with Apache Spark with Kevin KimSpark Summit

Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit

How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit

Goal Based Data Production with Sim SimeonovSpark Summit

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit

Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Apache Spark and Tensorflow as a Service with Jim Dowling

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Next CERN Accelerator Logging Service with Jakub Wozniak

Powering a Startup with Apache Spark with Kevin Kim

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Goal Based Data Production with Sim Simeonov

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Dernier

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

Learn How Data Science Changes Our WorldEduminds Learning

IMA MSN - Medical Students Network (2).pptxdolaknnilon

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

ASML's Taxonomy Adventure by Daniel Cantervoginip

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Dernier (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...

DBA Basics: Getting Started with Performance Tuning.pdf

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

Defining Constituents, Data Vizzes and Telling a Data Story

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

Student Profile Sample report on improving academic performance by uniting gr...

办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...

Top 5 Best Data Analytics Courses In Queens

Learn How Data Science Changes Our World

IMA MSN - Medical Students Network (2).pptx

Identifying Appropriate Test Statistics Involving Population Mean

RABBIT: A CLI tool for identifying bots based on their GitHub events.

ASML's Taxonomy Adventure by Daniel Canter

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Not Your Father's Database by Vida Ha

1. Not Your Father’s Database: How to Use Apache Spark Properly   in Your Big Data Architecture Spark Summit East 2016

2. Not Your Father’s Database: How to Use Apache Spark Properly   in Your Big Data Architecture Spark Summit East 2016

3. About Me 2005 Mobile Web & Voice Search 3

4. About Me 2005 Mobile Web & Voice Search 4 2012 Reporting & Analytics

5. About Me 2005 Mobile Web & Voice Search 5 2012 Reporting & Analytics 2014 Solutions Engineering

6. This system talks like a SQL Database… Is this your Spark infrastructure? 6 HDFS SQL

7. But the performance is very different… Is this your Spark infrastructure? 7 SQL HDFS

8. Just in Time Data Warehouse w/ Spark HDFS

9. Just in Time Data Warehouse w/ Spark HDFS

10. Just in Time Data Warehouse w/ Spark and more… HDFS

11. 11 Know when to use other data stores   besides file systems Today’s Goal

12. Good: General Purpose Processing Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores 12

13. Types of workloads • Batch Workloads • Ad Hoc Analysis – Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads 13 Good: General Purpose Processing

14. Benefits: • Inexpensive Storage • Incredibly flexible processing • Speed and Scale 14 Good: General Purpose Processing

15. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? 15

16. Bad: Random Access sqlContext.sql( “select * from my_large_table where id=2I34823”) Will this command run in Spark? Yes, but it’s not very efficient — Spark may have   to go through all your files to find your row. 16

17. Bad: Random Access Solution: If you frequently randomly access your data, use a database. • For traditional SQL databases, create an index   on your key column. • Key-Value NOSQL stores retrieves the value   of a key efficiently out of the box. 17

18. Bad: Frequent Inserts sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”) Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow… 18

19. Bad: Frequent Inserts Solution: • Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files. 19

20. Good: Data Transformation/ETL Use Spark to splice and dice your data files any way: File storage is cheap: Not an “Anti-pattern” to duplicately store your data. 20

21. Bad: Frequent/Incremental Updates Update statements — not supported yet. Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows. Solution: Many databases support efficient update operations. 21

22. Use Case: Up-to-date, live views of your SQL tables. Tip: Use ClusterBy for fast joins or Bucketing with 2.0. Bad: Frequent/Incremental Updates 22 Incremental SQL Query Database Snapshot +

23. Good: Connecting BI Tools Tip: Cache your tables for optimal performance. 23 HDFS

24. Bad: External Reporting w/ load Too many concurrent requests will overload Spark. 24 HDFS

25. Solution: Write out to a DB to handle load. Bad: External Reporting w/ load 25 HDFS DB

26. Good: Machine Learning & Data Science Use MLlib, GraphX and Spark packages for machine learning and data science. Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • Data cleansing, featurization, training, testing, etc. 26

27. Bad: Searching Content w/ load sqlContext.sql(“select * from mytable where name like '%xyz%'”) Spark will go through each row to find results. 27

28. Thank you