SlideShare une entreprise Scribd logo
1  sur  25
Azure Data Factory: Mapping Data Flows
Performance Tuning Data Flows
v001
Sample Timings 1
Scenario 1
 Source: Delimited Text Blob Store
 Sink: Azure SQL DB
 File size: 421Mb, 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 4 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Current partitioning used throughout
Sample timings 2
 Scenario 2
 Source: Azure SQL DB Table
 Sink: Azure SQL DB Table
 Table size: 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 3 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Source partitioning on SQL DB Source, current partitioning
on Derived Column and Sink
Sample timings 3
 Scenario 3
 Source: Delimited Text Blob Store
 Sink: Delimited Text Blob store
 Table size: 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Leaving default/current partitioning throughout allows
ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker
cores)
File conversion Source->Sink property findings
 Large data sizes should use more vcores(16+) with memory optimized or
general purpose
 Compute optimized does not improve performance in this scenario
 CSV to parquet format convert has 45% time overhead in comparison with CSV
to CSV
 CSV to JSON format convert has 24% time overhead in comparison with CSV to
CSV
 CSV to JSON has better performance even though it has a lot of data to write
 CSV to parquet has a slight lag because of time spent in decompression
 Scaling V-cores improves performance for both IO and computation
File Conversion Timing
Compute type: General Purpose
• Dataset has 36 Columns of string, integer, short, double
• CSV dataset has 25 files with different file sizes
• Performance improvement scales proportionately with increase
in Vcores
• 8 Vcore to 64 Vcore performance increase is around 8 times more
SQL Database Timing
Synapse DW Timing
Compute type: General Purpose
Adding cores proportionally decreases time it takes to process data into staging files for Polybase. However, there is a
fairly static amount time that it takes to write that data from Parquet into SQL tables using Polybase.
CosmosDB Timing
Compute type: General Purpose
Window / Aggregate Timing
Compute type: General Purpose
• Performance improvement scales proportionately with
increase in Vcores
• 8 Vcore to 64 Vcore performance increase is around 5
times more
Transformation Timings
Compute type: General Purpose
Transformation recommendations
• When ranking data across entire dataset, use Rank
transformation instead of Window with rank()
• When using rowNumber() in Window to uniquely
add a row counter to each row across entire
dataset, instead use the Surrogate Key
transformation
TPCH Timings
Compute type: General Purpose
TPCH CSV in ADLS Gen 2
Azure Data Factory Data Flow Performance
* Includes cold cluster start-up time
Azure Synapse Data Flow Performance
* Includes cold cluster start-up time
Identifying bottlenecks
1. Cluster startup time
2. Sink processing time
3. Source read time
4. Transformation stage time
1. Sequential executions can
lower the cluster startup time
by setting a TTL in Azure IR
2. Total time to process the
stream from source to sink.
There is also a post-processing
time when you click on the Sink
that will show you how much
time Spark had to spend with
partition and job clean-up.
Write to single file and slow
database connections will
increase this time
3. Shows you how long it took to
read data from source.
Optimize with different source
partition strategies
4. This will show you bottlenecks
in your transformation logic.
With larger general purpose
and mem optimized IRs, most
of these operations occur in
memory in data frames and are
usually the fastest operations
in your data flow
File Partitioning
 Maintain current partitioning
 Avoid output to single file
 For manual partitioning, use number of cores from your Azure IR
and multiply by 5
 Example: transform a series of files in your ADLS folders w/32-core Azure IR, number of
partitions would be 32 x 5 = 160 partitions
 If you know data well enough to have high-cardinality columns, use those columns as Hash
partition
 If you do not know data patterns very well, use Round Robin
Best practices - Sources
 When reading from file-based sources, data flow automatically
partitions the data based on size
 ~128 MB per partition, evenly distributed
 Use current partitioning will be fastest for file-based and Synapse using PolyBase
 Enable staging for Synapse
 For Azure SQL DB, use Source partitioning on column with high
cardinality
 Improves performance, but can saturate your source database
 Reading can be limited by the I/O of your source
Optimizing transformations
 Each transformation has its own optimize tab
 Generally better to not alter -> reshuffling is a relatively slow process
 Reshuffling can occur if data is very skewed
 One node has a disproportionate amount of data
 For Joins, Exists and Lookups:
 If you have a many of these transforms, memory optimized greatly increases performance
 Can ‘Broadcast’ if the data on one side is small
 Rule of thumb: Less than 50k rows
 Use Window transformation partitioned over segments of data
 For Rank() across entire dataset, use the Rank transformation instead
 For RowNumber() across entire dataset, use the Surrogate Key transformation instead
 Transformations that require reshuffling like Sort negatively impact
performance
Best practices – Debug (Data Preview)
 Data Preview
 Data preview is inside the data flow designer transformation properties
 Uses row limits and sampling techniques to preview data from a small size of data
 Allows you to build and validate units of logic with samples of data in real time
 You have control over the size of the data limits under Debug Settings
 If you wish to test with larger datasets, set a larger compute size in the Azure IR when
switching on “Debug Mode”
 Data Preview is only a snapshot of data in memory from Spark data frames. This feature does
not write any data, so the sink drivers are not utilized and not tested in this mode.
Best practices – Debug (Pipeline Debug)
 Pipeline Debug
 Click debug button to test your data flow inside of a pipeline
 Default debug limits the execution runtime so you will want to limit data sizes
 Sampling can be applied here as well by using the “Enable Sampling” option in each Source
 Use the debug button option of “use activity IR” when you wish to use a job execution
compute environment
 This option is good for debugging with larger datasets. It will not have the same execution timeout limit as the
default debug setting
Best practices - Sinks
 SQL:
 Disable indexes on target with pre/post SQL scripts
 Increase SQL capacity during pipeline execution
 Enable staging when using Synapse
 Use Source Partitioning on Source under Optimize
 Set number of partitions based on size of IR
 File-based sinks:
 Use current partitioning allows Spark to create output
 Output to single file is a slow operation
 Often unnecessary by whoever is consuming data
 Can set naming patterns or use data in column
 Any reshuffling of data is slow
 Cosmos DB
 Set throughput and batch size to meet performance requirements
Azure Integration Runtime
 Data Flows use JIT compute to minimize running expensive clusters
when they are mostly idle
 Generally more economical, but each cluster takes ~4 minutes to spin up
 IR specifies what cluster type and core-count to use
 Memory optimized is best, compute optimized doesn’t generally work for production workloads
 When running Sequential jobs utilize Time to Live to reuse cluster
between executions
 Keeps compute resources alive for TTL minutes after execution for new job to use
 Maximum one job per cluster
 Reduces job startup latency to ~1.5 minutes
 Rule of thumb: start small and scale up
Azure IR – General Purpose
• This was General Purpose 4+4, the default auto resolve Azure IR
• For prod workloads, GP is usually sufficient at >= 16 cores
• You get 1 driver and 1 worker node, both with 4 vcores
• Good for debugging, testing, and many production workloads
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 4 partitions
• Cluster startup time: 4.5 mins
• Sink IO writing: 46s
• Transformation time: 42s
• Sink post-processing time: 45s
Azure IR – Compute Optimized
• Computed Optimized intended for smaller workloads
• 8+8, this is smallest CO option and you get 1 driver and 2
workers
• Not suitable for large production workloads
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 8 partitions
• Cluster startup time: 4.5 mins
• Sink IO writing: 20s
• Transformation time: 35s
• Sink post-processing time: 40s
• More worker nodes gave us more partitions and better perf than
General Purpose
Azure IR – Memory Optimized
• Memory Optimized well suited for large production workload
reliability with many aggregates, lookups, and joins
• 64+16 gives you 16 vcores for driver and 64 across worker nodes
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 64 partitions
• Cluster startup time: 4.8 mins
• Sink IO writing: 19s
• Transformation time: 17s
• Sink post-processing time: 40s

Contenu connexe

Tendances

Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptxchennakesava44
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data FlowMark Kromer
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksKnoldus Inc.
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020Timothy McAliley
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Mark Kromer
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Edureka!
 

Tendances (20)

Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Azure Data Engineering.pptx
Azure Data Engineering.pptxAzure Data Engineering.pptx
Azure Data Engineering.pptx
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data Flow
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
 

Similaire à Azure Data Factory Data Flow Performance Tuning 101

Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mark Kromer
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraAmazon Web Services
 
45 ways to speed up firebird database
45 ways to speed up firebird database45 ways to speed up firebird database
45 ways to speed up firebird databaseFabio Codebue
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
Espc17 make your share point fly by tuning and optimising sql server
Espc17 make your share point  fly by tuning and optimising sql serverEspc17 make your share point  fly by tuning and optimising sql server
Espc17 make your share point fly by tuning and optimising sql serverIsabelle Van Campenhoudt
 
Make your SharePoint fly by tuning and optimizing SQL Server
Make your SharePoint  fly by tuning and optimizing SQL ServerMake your SharePoint  fly by tuning and optimizing SQL Server
Make your SharePoint fly by tuning and optimizing SQL Serverserge luca
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephenSteve Feldman
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Marco Tusa
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB plc
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Speedment, Inc.
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
 
Optimize SQL server performance for SharePoint
Optimize SQL server performance for SharePointOptimize SQL server performance for SharePoint
Optimize SQL server performance for SharePointserge luca
 
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxMark Kromer
 

Similaire à Azure Data Factory Data Flow Performance Tuning 101 (20)

Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon Aurora
 
Performance tuning in sql server
Performance tuning in sql serverPerformance tuning in sql server
Performance tuning in sql server
 
45 ways to speed up firebird database
45 ways to speed up firebird database45 ways to speed up firebird database
45 ways to speed up firebird database
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Espc17 make your share point fly by tuning and optimising sql server
Espc17 make your share point  fly by tuning and optimising sql serverEspc17 make your share point  fly by tuning and optimising sql server
Espc17 make your share point fly by tuning and optimising sql server
 
Make your SharePoint fly by tuning and optimizing SQL Server
Make your SharePoint  fly by tuning and optimizing SQL ServerMake your SharePoint  fly by tuning and optimizing SQL Server
Make your SharePoint fly by tuning and optimizing SQL Server
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and Optimization
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
Optimize SQL server performance for SharePoint
Optimize SQL server performance for SharePointOptimize SQL server performance for SharePoint
Optimize SQL server performance for SharePoint
 
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptx
 

Plus de Mark Kromer

Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesMark Kromer
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mark Kromer
 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsMark Kromer
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsMark Kromer
 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFMark Kromer
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryMark Kromer
 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFMark Kromer
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Mark Kromer
 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFMark Kromer
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300Mark Kromer
 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2Mark Kromer
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1Mark Kromer
 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationMark Kromer
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer
 
Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019Mark Kromer
 
Microsoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow ScenariosMicrosoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow ScenariosMark Kromer
 
Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019Mark Kromer
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekMark Kromer
 

Plus de Mark Kromer (20)

Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flows
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADF
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADF
 
Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)Azure Data Factory Data Flows Training (Sept 2020 Update)
Azure Data Factory Data Flows Training (Sept 2020 Update)
 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADF
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1
 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview Migration
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019
 
Microsoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow ScenariosMicrosoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow Scenarios
 
Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 

Dernier

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Dernier (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Azure Data Factory Data Flow Performance Tuning 101

  • 1. Azure Data Factory: Mapping Data Flows Performance Tuning Data Flows v001
  • 2. Sample Timings 1 Scenario 1  Source: Delimited Text Blob Store  Sink: Azure SQL DB  File size: 421Mb, 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 4 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Current partitioning used throughout
  • 3. Sample timings 2  Scenario 2  Source: Azure SQL DB Table  Sink: Azure SQL DB Table  Table size: 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 3 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Source partitioning on SQL DB Source, current partitioning on Derived Column and Sink
  • 4. Sample timings 3  Scenario 3  Source: Delimited Text Blob Store  Sink: Delimited Text Blob store  Table size: 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Leaving default/current partitioning throughout allows ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker cores)
  • 5. File conversion Source->Sink property findings  Large data sizes should use more vcores(16+) with memory optimized or general purpose  Compute optimized does not improve performance in this scenario  CSV to parquet format convert has 45% time overhead in comparison with CSV to CSV  CSV to JSON format convert has 24% time overhead in comparison with CSV to CSV  CSV to JSON has better performance even though it has a lot of data to write  CSV to parquet has a slight lag because of time spent in decompression  Scaling V-cores improves performance for both IO and computation
  • 6. File Conversion Timing Compute type: General Purpose • Dataset has 36 Columns of string, integer, short, double • CSV dataset has 25 files with different file sizes • Performance improvement scales proportionately with increase in Vcores • 8 Vcore to 64 Vcore performance increase is around 8 times more
  • 8. Synapse DW Timing Compute type: General Purpose Adding cores proportionally decreases time it takes to process data into staging files for Polybase. However, there is a fairly static amount time that it takes to write that data from Parquet into SQL tables using Polybase.
  • 10. Window / Aggregate Timing Compute type: General Purpose • Performance improvement scales proportionately with increase in Vcores • 8 Vcore to 64 Vcore performance increase is around 5 times more
  • 11. Transformation Timings Compute type: General Purpose Transformation recommendations • When ranking data across entire dataset, use Rank transformation instead of Window with rank() • When using rowNumber() in Window to uniquely add a row counter to each row across entire dataset, instead use the Surrogate Key transformation
  • 12. TPCH Timings Compute type: General Purpose TPCH CSV in ADLS Gen 2
  • 13. Azure Data Factory Data Flow Performance * Includes cold cluster start-up time
  • 14. Azure Synapse Data Flow Performance * Includes cold cluster start-up time
  • 15. Identifying bottlenecks 1. Cluster startup time 2. Sink processing time 3. Source read time 4. Transformation stage time 1. Sequential executions can lower the cluster startup time by setting a TTL in Azure IR 2. Total time to process the stream from source to sink. There is also a post-processing time when you click on the Sink that will show you how much time Spark had to spend with partition and job clean-up. Write to single file and slow database connections will increase this time 3. Shows you how long it took to read data from source. Optimize with different source partition strategies 4. This will show you bottlenecks in your transformation logic. With larger general purpose and mem optimized IRs, most of these operations occur in memory in data frames and are usually the fastest operations in your data flow
  • 16. File Partitioning  Maintain current partitioning  Avoid output to single file  For manual partitioning, use number of cores from your Azure IR and multiply by 5  Example: transform a series of files in your ADLS folders w/32-core Azure IR, number of partitions would be 32 x 5 = 160 partitions  If you know data well enough to have high-cardinality columns, use those columns as Hash partition  If you do not know data patterns very well, use Round Robin
  • 17. Best practices - Sources  When reading from file-based sources, data flow automatically partitions the data based on size  ~128 MB per partition, evenly distributed  Use current partitioning will be fastest for file-based and Synapse using PolyBase  Enable staging for Synapse  For Azure SQL DB, use Source partitioning on column with high cardinality  Improves performance, but can saturate your source database  Reading can be limited by the I/O of your source
  • 18. Optimizing transformations  Each transformation has its own optimize tab  Generally better to not alter -> reshuffling is a relatively slow process  Reshuffling can occur if data is very skewed  One node has a disproportionate amount of data  For Joins, Exists and Lookups:  If you have a many of these transforms, memory optimized greatly increases performance  Can ‘Broadcast’ if the data on one side is small  Rule of thumb: Less than 50k rows  Use Window transformation partitioned over segments of data  For Rank() across entire dataset, use the Rank transformation instead  For RowNumber() across entire dataset, use the Surrogate Key transformation instead  Transformations that require reshuffling like Sort negatively impact performance
  • 19. Best practices – Debug (Data Preview)  Data Preview  Data preview is inside the data flow designer transformation properties  Uses row limits and sampling techniques to preview data from a small size of data  Allows you to build and validate units of logic with samples of data in real time  You have control over the size of the data limits under Debug Settings  If you wish to test with larger datasets, set a larger compute size in the Azure IR when switching on “Debug Mode”  Data Preview is only a snapshot of data in memory from Spark data frames. This feature does not write any data, so the sink drivers are not utilized and not tested in this mode.
  • 20. Best practices – Debug (Pipeline Debug)  Pipeline Debug  Click debug button to test your data flow inside of a pipeline  Default debug limits the execution runtime so you will want to limit data sizes  Sampling can be applied here as well by using the “Enable Sampling” option in each Source  Use the debug button option of “use activity IR” when you wish to use a job execution compute environment  This option is good for debugging with larger datasets. It will not have the same execution timeout limit as the default debug setting
  • 21. Best practices - Sinks  SQL:  Disable indexes on target with pre/post SQL scripts  Increase SQL capacity during pipeline execution  Enable staging when using Synapse  Use Source Partitioning on Source under Optimize  Set number of partitions based on size of IR  File-based sinks:  Use current partitioning allows Spark to create output  Output to single file is a slow operation  Often unnecessary by whoever is consuming data  Can set naming patterns or use data in column  Any reshuffling of data is slow  Cosmos DB  Set throughput and batch size to meet performance requirements
  • 22. Azure Integration Runtime  Data Flows use JIT compute to minimize running expensive clusters when they are mostly idle  Generally more economical, but each cluster takes ~4 minutes to spin up  IR specifies what cluster type and core-count to use  Memory optimized is best, compute optimized doesn’t generally work for production workloads  When running Sequential jobs utilize Time to Live to reuse cluster between executions  Keeps compute resources alive for TTL minutes after execution for new job to use  Maximum one job per cluster  Reduces job startup latency to ~1.5 minutes  Rule of thumb: start small and scale up
  • 23. Azure IR – General Purpose • This was General Purpose 4+4, the default auto resolve Azure IR • For prod workloads, GP is usually sufficient at >= 16 cores • You get 1 driver and 1 worker node, both with 4 vcores • Good for debugging, testing, and many production workloads • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 4 partitions • Cluster startup time: 4.5 mins • Sink IO writing: 46s • Transformation time: 42s • Sink post-processing time: 45s
  • 24. Azure IR – Compute Optimized • Computed Optimized intended for smaller workloads • 8+8, this is smallest CO option and you get 1 driver and 2 workers • Not suitable for large production workloads • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 8 partitions • Cluster startup time: 4.5 mins • Sink IO writing: 20s • Transformation time: 35s • Sink post-processing time: 40s • More worker nodes gave us more partitions and better perf than General Purpose
  • 25. Azure IR – Memory Optimized • Memory Optimized well suited for large production workload reliability with many aggregates, lookups, and joins • 64+16 gives you 16 vcores for driver and 64 across worker nodes • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 64 partitions • Cluster startup time: 4.8 mins • Sink IO writing: 19s • Transformation time: 17s • Sink post-processing time: 40s