SlideShare une entreprise Scribd logo
1  sur  68
Télécharger pour lire hors ligne
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Eva Tse and Daniel Weeks, Netflix
October 2015
BDT303
Running Presto and Spark on the
Netflix
Big Data Platform
What to Expect from the Session
Our big data scale
Our architecture
For Presto and Spark
- Use cases
- Performance
- Contributions
- Deployment on Amazon EMR
- Integration with Netflix infrastructure
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Our Biggest Challenge is Scale
Netflix Key Business Metrics
65+ million
members
50 countries 1000+ devices
supported
10 billion
hours / quarter
Our Big Data Scale
Total ~25 PB DW on Amazon S3
Read ~10% DW daily
Write ~10% of read data daily
~ 550 billion events daily
~ 350 active platform users
Global Expansion
200 countries by end of 2016
Architecture Overview
Cloud
apps
Kafka Ursula
Cassandra
Aegisthus
Dimension Data
Event Data
15 min
Daily
AWS
S3
SS Tables
Data Pipelines
Storage Compute Service Tools
AWS
S3
Analytics
ETL
Interactive data exploration
Interactive slice & dice
RT analytics & iterative/ML algo and more ...
Different Big Data Processing Needs
Amazon S3 as our DW
Amazon S3 as Our DW Storage
Amazon S3 as single source of truth (not HDFS)
Designed for 11 9’s durability and 4 9’s availability
Separate compute and storage
Key enablement to
- multiple heterogeneous clusters
- easy upgrade via r/b deployment
S3
What About Performance?
Amazon S3 is a much bigger fleet than your cluster
Offload network load from cluster
Read performance
- Single stage read-only job has 5-10% impact
- Insignificant when amortized over a sequence of stages
Write performance
- Can be faster because Amazon S3 is eventually consistent w/ higher
throughput
Presto
Presto is an open-source, distributed SQL query engine for
running interactive analytic queries against data sources of
all sizes ranging from gigabytes to petabytes
Why We Love Presto?
Hadoop friendly - integration with Hive metastore
Works well on AWS - easy integration with Amazon S3
Scalable - works at petabyte scale
User friendly - ANSI SQL
Open source - and in Java!
Fast
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Usage Stats
~3500 queries/day
> 90%
Expanding Presto Use Cases
Data exploration and experimentation
Data validation
Backend for our interactive a/b test analytics application
Reporting
Not ETL (yet?)
Presto Deployment
Our Deployment
Version 0.114
+ patches
+ one non-public patch (Parquet vectorized read integration)
Deployed via bootstrap action on Amazon EMR
Separate clusters from our Hadoop YARN clusters
Not using Hadoop services
Leverage Amazon EMR for cluster management
Two Production Clusters
Resource isolation
Ad hoc cluster
1 coordinator (r3.4xl) + 225 workers (r3.4xl)
Dedicated application cluster
1 coordinator (r3.4xl) + 4 workers + dynamic workers (r3.xl, r3.2xl,
r3.4xl)
Dynamic cluster sizing via Netflix spinnaker API
Dynamic Cluster Sizing
Integration with Netflix Infrastructure
Atlas
Sidecar Coordinator
Amazon EMR Presto Worker
Kafka
Presto cluster
(for data lineage)
(for monitoring)
Queries + users
info
Presto Contributions
Our Contributions
Parquet File Format Support
Schema evolution
Predicate pushdown
Vectorized read**
Complex Types
Various functions for map, array and
struct types
Comparison operators for array and
struct types
Amazon S3 File System
Multi-part upload
AWS instance credentials
AWS role support*
Reliability
Query Optimization
Single distinct -> group by
Joins with similar sub-queries*
Parquet
Columnar file format
Supported across Hive, Pig, Presto, Spark
Performance benefits across different processing engines
Good performance on S3
Majority of our DW is on Parquet FF
RowGroup Metadata x N
row count, size, etc.
Dict Page
Data Page
Data Page
Column Chunk
Data Page
Data Page
Data Page
Column Chunk
Dict Page
Data Page
Data Page
RowGroup x N
Footer
schema, version, etc.
Column Chunk Metadata
encoding,
size,
min, max
Parquet
Column Chunk
Column Chunk Metadata
encoding,
size,
min, max
Column Chunk Metadata
encoding,
size,
min, max
Vectorized Read
Parquet: read column chunks in batches instead of row-by-row
Presto: replace ParquetHiveRecordCursor with ParquetPageSource
Performance improvement up to 2x for Presto
Beneficial for Spark, Hive, and Drill also
Pending parquet-131 commit before we can publish a Presto patch
Predicate Pushdown
Dictionary pushdown
column chunk stats [5, 20]
dictionary page <5,11,18,20>
skip this row group
Statistics pushdown
column chunk stats [20, 30]
skip this row group
Example: SELECT… WHERE abtest_id = 10;
Predicate Pushdown
Works best if data is clustered by predicate columns
Achieves data pruning like Hive partitions w/o the metadata overhead
Can also be implemented in Spark, Hive, and Pig
Atlas Analytics Use Case
0
20000
40000
60000
80000
100000
Pushdown No Pushdown
CPU (s)
0
50
100
150
200
250
300
350
Pushdown No Pushdown
Wall-clock Time (s)
0
2000
4000
6000
8000
Pushdown No Pushdown
Rows Processed
(Mils)
0
100
200
300
400
500
Pushdown No Pushdown
Bytes Read (GB)
170x
170x170x
10x
Example query: Analyze 4xx
errors from Genie for a day
High cardinality/selectivity for
app name and metrics name
as predicates
Data staged and clustered by
predicate columns
Stay Tuned…
Two upcoming blog posts on techblog.netflix.com
- Parquet usage @ Netflix Big Data Platform
- Presto + Parquet optimization and performance
Spark
Apache Spark™ is a fast and general engine for large-scale data processing.
Why Spark?
Batch jobs (Pig, Hive)
• ETL jobs
• Reporting and other analysis
Interactive jobs (Presto)
Iterative ML jobs (Spark)
Programmatic use cases
Deployments @ Netflix
Spark on Mesos
• Self-serving AMI
• Full BDAS (Berkeley Data Analytics Stack)
• Online streaming analytics
Spark on YARN
• Spark as a service
• YARN application on Amazon EMR Hadoop
• Offline batch analytics
Version Support
$ spark-shell –ver 1.5 …
s3://…/spark-1.4.tar.gz
s3://…/spark-1.5.tar.gz
s3://…/spark-1.5-custom.tar.gz
s3://…/1.5/spark-defaults.conf
s3://…/h2prod/yarn-site.xml
s3://../h2prod/core-site.xml
…
ConfigurationApplication
Multi-tenancy
Multi-tenancy
Dynamic Allocation on YARN
Optimize for resource utilization
Harness cluster’s scale
Still provide interactive performance
Container
Container
Container Container Container Container Container
Task Execution Model
Container
Task Task TaskTaskTaskTask
Task Task TaskTaskTask
Container
Task
Pending Tasks
Traditional MapReduce
Spark Execution
Persistent
Dynamic Allocation [SPARK-6954]
Cached Data
Spark allows data to be cached
• Interactive reuse of data set
• Iterative usage (ML)
Dynamic allocation
• Removes executors when no tasks are pending
Cached Executor Timeout [SPARK-7955]
val data = sqlContext
.table("dse.admin_genie_job_d”)
.filter($"dateint">=20150601 and $"dateint"<=20150830)
data.persist
data.count
Preemption [SPARK-8167]
Symptom
• Spark tasks randomly fail with “executor lost” error
Cause
• YARN preemption is not graceful
Solution
• Preempted tasks shouldn’t be counted as failures but should be retried
Reading / Processing / Writing
Amazon S3 Listing Optimization
Problem: Metadata is big data
• Tables with millions of partitions
• Partitions with hundreds of files each
Clients take a long time to launch jobs
Input split computation
mapreduce.input.fileinputformat.list-status.num-threads
• The number of threads to use list and fetch block locations for the specified
input paths.
Setting this property in Spark jobs doesn’t help
File listing for partitioned table
Partition path
Seq[RDD]
HadoopRDD
HadoopRDD
HadoopRDD
HadoopRDD
Partition path
Partition path
Partition path
Input dir
Input dir
Input dir
Input dir
Sequentially listing input dirs via S3N file system.
S3N
S3N
S3N
S3N
SPARK-9926, SPARK-10340
Symptom
• Input split computation for partitioned Hive table on Amazon S3 is slow
Cause
• Listing files on a per partition basis is slow
• S3N file system computes data locality hints
Solution
• Bulk list partitions in parallel using AmazonS3Client
• Bypass data locality computation for Amazon S3 objects
Amazon S3 Bulk Listing
Partition path
ParArray[RDD]
HadoopRDD
HadoopRDD
HadoopRDD
HadoopRDD
Partition path
Partition path
Partition path
Input dir
Input dir
Input dir
Input dir
Amazon S3 listing input dirs in parallel via AmazonS3Client
Amazon
S3Client
Performance Improvement
0
2000
4000
6000
8000
10000
12000
14000
16000
1 24 240 720
seconds
# of partitions
1.5 RC2
S3 bulk listing
SELECT * FROM nccp_log WHERE dateint=20150801 and hour=0 LIMIT 10;
Hadoop Output Committer
How it works
• Each task writes output to a temp dir.
• Output committer renames first successful task’s temp dir to final
destination
Challenges with Amazon S3
• Amazon S3 rename is copy and delete (non-atomic)
• Amazon S3 is eventually consistent
Amazon S3 Output Committer
How it works
• Each task writes output to local disk
• Output committer copies first successful task’s output to
Amazon S3
Advantages
• Avoid redundant Amazon S3 copy
• Avoid eventual consistency
• Always write to new paths
Our Contributions
SPARK-6018
SPARK-6662
SPARK-6909
SPARK-6910
SPARK-7037
SPARK-7451
SPARK-7850
SPARK-8355
SPARK-8572
SPARK-8908
SPARK-9270
SPARK-9926
SPARK-10001
SPARK-10340
Next Steps for Netflix Integration
Metrics
Data lineage
Parquet integration
Key Takeaways
Our DW source of truth is on Amazon S3
Run custom Presto and Spark distros on Amazon EMR
• Presto as stand-alone clusters
• Spark co-tenant on Hadoop YARN clusters
We are committed to open source; you can run what we run
What’s Next
On Scaling and Optimizing Infrastructure…
Graceful shrink of Amazon EMR +
Heterogeneous instance groups in Amazon EMR +
Netflix Atlas metrics +
Netflix Spinnaker API =
Load-based expand/shrink of Hadoop YARN clusters
Expand new Presto use cases
Integrate Spark in Netflix big data platform
Explore Spark for ETL use cases
On Presto and Spark…
Remember to complete
your evaluations!
Thank you!
Eva Tse & Daniel Weeks
Office hour: 12:00pm – 2pm Today @
Netflix Booth #326

Contenu connexe

Tendances

Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdfChris Hoyean Song
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
 

Tendances (20)

Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Presto
PrestoPresto
Presto
 
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 

En vedette

Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to SparkSky Yin
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloudQubole
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetupstevemcpherson
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 

En vedette (7)

MySQL@king
MySQL@kingMySQL@king
MySQL@king
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 

Similaire à (BDT303) Running Spark and Presto on the Netflix Big Data Platform

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaAmazon Web Services
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...Amazon Web Services
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017Pratim Das
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Get Value From Your Data
Get Value From Your DataGet Value From Your Data
Get Value From Your DataDanilo Poccia
 

Similaire à (BDT303) Running Spark and Presto on the Netflix Big Data Platform (20)

Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS Lambda
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
What's New in Amazon Aurora
What's New in Amazon AuroraWhat's New in Amazon Aurora
What's New in Amazon Aurora
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Get Value From Your Data
Get Value From Your DataGet Value From Your Data
Get Value From Your Data
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 

Dernier (20)

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 

(BDT303) Running Spark and Presto on the Netflix Big Data Platform

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Eva Tse and Daniel Weeks, Netflix October 2015 BDT303 Running Presto and Spark on the Netflix Big Data Platform
  • 2. What to Expect from the Session Our big data scale Our architecture For Presto and Spark - Use cases - Performance - Contributions - Deployment on Amazon EMR - Integration with Netflix infrastructure
  • 8. Netflix Key Business Metrics 65+ million members 50 countries 1000+ devices supported 10 billion hours / quarter
  • 9. Our Big Data Scale Total ~25 PB DW on Amazon S3 Read ~10% DW daily Write ~10% of read data daily ~ 550 billion events daily ~ 350 active platform users
  • 12. Cloud apps Kafka Ursula Cassandra Aegisthus Dimension Data Event Data 15 min Daily AWS S3 SS Tables Data Pipelines
  • 13. Storage Compute Service Tools AWS S3
  • 14. Analytics ETL Interactive data exploration Interactive slice & dice RT analytics & iterative/ML algo and more ... Different Big Data Processing Needs
  • 15. Amazon S3 as our DW
  • 16. Amazon S3 as Our DW Storage Amazon S3 as single source of truth (not HDFS) Designed for 11 9’s durability and 4 9’s availability Separate compute and storage Key enablement to - multiple heterogeneous clusters - easy upgrade via r/b deployment S3
  • 17. What About Performance? Amazon S3 is a much bigger fleet than your cluster Offload network load from cluster Read performance - Single stage read-only job has 5-10% impact - Insignificant when amortized over a sequence of stages Write performance - Can be faster because Amazon S3 is eventually consistent w/ higher throughput
  • 19. Presto is an open-source, distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes
  • 20. Why We Love Presto? Hadoop friendly - integration with Hive metastore Works well on AWS - easy integration with Amazon S3 Scalable - works at petabyte scale User friendly - ANSI SQL Open source - and in Java! Fast
  • 23. Expanding Presto Use Cases Data exploration and experimentation Data validation Backend for our interactive a/b test analytics application Reporting Not ETL (yet?)
  • 25. Our Deployment Version 0.114 + patches + one non-public patch (Parquet vectorized read integration) Deployed via bootstrap action on Amazon EMR Separate clusters from our Hadoop YARN clusters Not using Hadoop services Leverage Amazon EMR for cluster management
  • 26. Two Production Clusters Resource isolation Ad hoc cluster 1 coordinator (r3.4xl) + 225 workers (r3.4xl) Dedicated application cluster 1 coordinator (r3.4xl) + 4 workers + dynamic workers (r3.xl, r3.2xl, r3.4xl) Dynamic cluster sizing via Netflix spinnaker API
  • 28. Integration with Netflix Infrastructure Atlas Sidecar Coordinator Amazon EMR Presto Worker Kafka Presto cluster (for data lineage) (for monitoring) Queries + users info
  • 30. Our Contributions Parquet File Format Support Schema evolution Predicate pushdown Vectorized read** Complex Types Various functions for map, array and struct types Comparison operators for array and struct types Amazon S3 File System Multi-part upload AWS instance credentials AWS role support* Reliability Query Optimization Single distinct -> group by Joins with similar sub-queries*
  • 31. Parquet Columnar file format Supported across Hive, Pig, Presto, Spark Performance benefits across different processing engines Good performance on S3 Majority of our DW is on Parquet FF
  • 32. RowGroup Metadata x N row count, size, etc. Dict Page Data Page Data Page Column Chunk Data Page Data Page Data Page Column Chunk Dict Page Data Page Data Page RowGroup x N Footer schema, version, etc. Column Chunk Metadata encoding, size, min, max Parquet Column Chunk Column Chunk Metadata encoding, size, min, max Column Chunk Metadata encoding, size, min, max
  • 33. Vectorized Read Parquet: read column chunks in batches instead of row-by-row Presto: replace ParquetHiveRecordCursor with ParquetPageSource Performance improvement up to 2x for Presto Beneficial for Spark, Hive, and Drill also Pending parquet-131 commit before we can publish a Presto patch
  • 34. Predicate Pushdown Dictionary pushdown column chunk stats [5, 20] dictionary page <5,11,18,20> skip this row group Statistics pushdown column chunk stats [20, 30] skip this row group Example: SELECT… WHERE abtest_id = 10;
  • 35. Predicate Pushdown Works best if data is clustered by predicate columns Achieves data pruning like Hive partitions w/o the metadata overhead Can also be implemented in Spark, Hive, and Pig
  • 36. Atlas Analytics Use Case 0 20000 40000 60000 80000 100000 Pushdown No Pushdown CPU (s) 0 50 100 150 200 250 300 350 Pushdown No Pushdown Wall-clock Time (s) 0 2000 4000 6000 8000 Pushdown No Pushdown Rows Processed (Mils) 0 100 200 300 400 500 Pushdown No Pushdown Bytes Read (GB) 170x 170x170x 10x Example query: Analyze 4xx errors from Genie for a day High cardinality/selectivity for app name and metrics name as predicates Data staged and clustered by predicate columns
  • 37. Stay Tuned… Two upcoming blog posts on techblog.netflix.com - Parquet usage @ Netflix Big Data Platform - Presto + Parquet optimization and performance
  • 38. Spark
  • 39. Apache Spark™ is a fast and general engine for large-scale data processing.
  • 40. Why Spark? Batch jobs (Pig, Hive) • ETL jobs • Reporting and other analysis Interactive jobs (Presto) Iterative ML jobs (Spark) Programmatic use cases
  • 41. Deployments @ Netflix Spark on Mesos • Self-serving AMI • Full BDAS (Berkeley Data Analytics Stack) • Online streaming analytics Spark on YARN • Spark as a service • YARN application on Amazon EMR Hadoop • Offline batch analytics
  • 42. Version Support $ spark-shell –ver 1.5 … s3://…/spark-1.4.tar.gz s3://…/spark-1.5.tar.gz s3://…/spark-1.5-custom.tar.gz s3://…/1.5/spark-defaults.conf s3://…/h2prod/yarn-site.xml s3://../h2prod/core-site.xml … ConfigurationApplication
  • 45. Dynamic Allocation on YARN Optimize for resource utilization Harness cluster’s scale Still provide interactive performance
  • 46. Container Container Container Container Container Container Container Task Execution Model Container Task Task TaskTaskTaskTask Task Task TaskTaskTask Container Task Pending Tasks Traditional MapReduce Spark Execution Persistent
  • 48. Cached Data Spark allows data to be cached • Interactive reuse of data set • Iterative usage (ML) Dynamic allocation • Removes executors when no tasks are pending
  • 49. Cached Executor Timeout [SPARK-7955] val data = sqlContext .table("dse.admin_genie_job_d”) .filter($"dateint">=20150601 and $"dateint"<=20150830) data.persist data.count
  • 50. Preemption [SPARK-8167] Symptom • Spark tasks randomly fail with “executor lost” error Cause • YARN preemption is not graceful Solution • Preempted tasks shouldn’t be counted as failures but should be retried
  • 51. Reading / Processing / Writing
  • 52. Amazon S3 Listing Optimization Problem: Metadata is big data • Tables with millions of partitions • Partitions with hundreds of files each Clients take a long time to launch jobs
  • 53. Input split computation mapreduce.input.fileinputformat.list-status.num-threads • The number of threads to use list and fetch block locations for the specified input paths. Setting this property in Spark jobs doesn’t help
  • 54. File listing for partitioned table Partition path Seq[RDD] HadoopRDD HadoopRDD HadoopRDD HadoopRDD Partition path Partition path Partition path Input dir Input dir Input dir Input dir Sequentially listing input dirs via S3N file system. S3N S3N S3N S3N
  • 55. SPARK-9926, SPARK-10340 Symptom • Input split computation for partitioned Hive table on Amazon S3 is slow Cause • Listing files on a per partition basis is slow • S3N file system computes data locality hints Solution • Bulk list partitions in parallel using AmazonS3Client • Bypass data locality computation for Amazon S3 objects
  • 56. Amazon S3 Bulk Listing Partition path ParArray[RDD] HadoopRDD HadoopRDD HadoopRDD HadoopRDD Partition path Partition path Partition path Input dir Input dir Input dir Input dir Amazon S3 listing input dirs in parallel via AmazonS3Client Amazon S3Client
  • 57. Performance Improvement 0 2000 4000 6000 8000 10000 12000 14000 16000 1 24 240 720 seconds # of partitions 1.5 RC2 S3 bulk listing SELECT * FROM nccp_log WHERE dateint=20150801 and hour=0 LIMIT 10;
  • 58. Hadoop Output Committer How it works • Each task writes output to a temp dir. • Output committer renames first successful task’s temp dir to final destination Challenges with Amazon S3 • Amazon S3 rename is copy and delete (non-atomic) • Amazon S3 is eventually consistent
  • 59. Amazon S3 Output Committer How it works • Each task writes output to local disk • Output committer copies first successful task’s output to Amazon S3 Advantages • Avoid redundant Amazon S3 copy • Avoid eventual consistency • Always write to new paths
  • 61. Next Steps for Netflix Integration Metrics Data lineage Parquet integration
  • 63. Our DW source of truth is on Amazon S3 Run custom Presto and Spark distros on Amazon EMR • Presto as stand-alone clusters • Spark co-tenant on Hadoop YARN clusters We are committed to open source; you can run what we run
  • 65. On Scaling and Optimizing Infrastructure… Graceful shrink of Amazon EMR + Heterogeneous instance groups in Amazon EMR + Netflix Atlas metrics + Netflix Spinnaker API = Load-based expand/shrink of Hadoop YARN clusters
  • 66. Expand new Presto use cases Integrate Spark in Netflix big data platform Explore Spark for ETL use cases On Presto and Spark…
  • 68. Thank you! Eva Tse & Daniel Weeks Office hour: 12:00pm – 2pm Today @ Netflix Booth #326