SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
1
February 2016
Debugging Slow Buffered Reads
to the Lustre Filesystem
By Robert Roy, Senior Staff Engineer
22
Direct IO reads are better than Buffered IO
The Problem
Seagate CS9000 with 4M RPCs
Reads Buffered ~3.5 GB/s per OST
Reads o_direct ~4.5 GB/s per OST
Writes Buffered ~4.5 GB/s per OST
More clients do not produce more bandwidth
Suggests server side
Data path on the server side is the same
for o_direct and buffered IO
Suggests client side
Buffered IO uses paged cache which is
populated by readahead
Client side readahead is suspect
33
Readahead requests never ramp up to 4M RPCs
The Root Cause
[rroy@rroy-vm-wireshark ~]$ tshark -r buffered_1node_1thread.cap.gz -
Tfields -e ip.src -e ip.dst -e lustre.obd_ioobj.ioo_id -e
lustre.niobuf_remote.offset -e lustre.niobuf_remote.len -R
lustre.niobuf_remote | head -10
172.19.62.138 172.19.55.5 1903 0 1048576
172.19.62.138 172.19.55.5 1903 1048576 2097152
172.19.62.138 172.19.55.5 1903 3145728 1048576
172.19.62.138 172.19.55.5 1903 4194304 1048576
172.19.62.138 172.19.55.5 1903 5242880 2097152
172.19.62.138 172.19.55.5 1903 7340032 1048576
172.19.62.138 172.19.55.5 1903 8388608 1048576
172.19.62.138 172.19.55.5 1903 9437184 2097152
172.19.62.138 172.19.55.5 1903 11534336 1048576
172.19.62.138 172.19.55.5 1903 12582912 1048576
...
172.19.62.138 172.19.55.5 1903 1685061632 1048576
172.19.62.138 172.19.55.5 1903 1686110208 1048576
172.19.62.138 172.19.55.5 1903 1687158784 1048576
172.19.62.138 172.19.55.5 1903 1688207360 1048576
172.19.62.138 172.19.55.5 1903 1689255936 1048576
172.19.62.138 172.19.55.5 1903 1690304512 1048576
172.19.62.138 172.19.55.5 1903 1691353088 1048576
172.19.62.138 172.19.55.5 1903 1692401664 1048576
172.19.62.138 172.19.55.5 1903 1693450240 1048576
172.19.62.138 172.19.55.5 1903 1694498816 1048576
44
Even with a large 64MB IO size, all IO serviced from readahead is 1MB in size
The Root Cause
[rroy@rroy-vm-wireshark ~]$ tshark -r buffered_32node_4thread_64mIO.cap.gz
-Tfields -e ip.src -e ip.dst -e lustre.obd_ioobj.ioo_id -e
lustre.niobuf_remote.offset -e lustre.niobuf_remote.len -R
lustre.niobuf_remote | grep 288 | head -n 20
172.19.62.138 172.19.55.4 2288 0 4194304
172.19.62.138 172.19.55.4 2288 4194304 4194304
172.19.62.138 172.19.55.4 2288 8388608 4194304
172.19.62.138 172.19.55.4 2288 12582912 4194304
172.19.62.138 172.19.55.4 2288 16777216 4194304
172.19.62.138 172.19.55.4 2288 20971520 4194304
172.19.62.138 172.19.55.4 2288 25165824 4194304
172.19.62.138 172.19.55.4 2288 29360128 4194304
172.19.62.138 172.19.55.4 2288 33554432 4194304
172.19.62.138 172.19.55.4 2288 37748736 4194304
172.19.62.138 172.19.55.4 2288 41943040 4194304
172.19.62.138 172.19.55.4 2288 46137344 4194304
172.19.62.138 172.19.55.4 2288 50331648 4194304
172.19.62.138 172.19.55.4 2288 54525952 4194304
172.19.62.138 172.19.55.4 2288 58720256 4194304
172.19.62.138 172.19.55.4 2288 62914560 4194304
172.19.62.138 172.19.55.4 2288 67108864 1048576
172.19.62.138 172.19.55.4 2288 68157440 1048576
172.19.62.138 172.19.55.4 2288 69206016 1048576
172.19.62.138 172.19.55.4 2288 70254592 1048576
55
The Source of the Problem
And right above that line…
/lustre/llite/rw.c
#define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT)
/* RAS_INCREASE_STEP should be (1UL << (inode->i_blkbits - PAGE_CACHE_SHIFT)).
* Temporarily set RAS_INCREASE_STEP to 1MB. After 4MB RPC is enabled
* by default, this should be adjusted corresponding with max_read_ahead_mb
* and max_read_ahead_per_file_mb otherwise the readahead budget can be used
* up quickly which will affect read performance significantly. See LU-2816 */
66
Set the increase step to the same value as the RPC size
The Solution
< #define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT)
> #define RAS_INCREASE_STEP(inode) (PTLRPC_MAX_BRW_SIZE >> PAGE_CACHE_SHIFT)
77
Results
INCREASE_STEP RA
File
RA MB Clients PPN IO Size Read
Average
1MB 40 40 32 1 1M 6928.02
4MB 40 40 32 1 1M 8629.80
1MB 160 640 32 1 1M 7137.50
4MB 160 640 32 1 1M 9528.45
IOR -r -v -F –b 131072m -t 1m -i 3 -m -k -D 60
February 2016
Conclusion
99
Conclusion and More Information
Buffered reads can be improved significantly when 4m RPCs are in use
Seagate implemented a parameter to address the issue
lctl set_param -n llite.*.read_ahead_step 4
https://github.com/Xyratex/lustre-stable/commit/2395f8e0e7e963aec43deb07d719e9229884758c
LU-7140 tracks the upstream work
https://jira.hpdd.intel.com/browse/LU-7140
Thank You
Questions?
February 2016
About Seagate
13
›  2+ million enclosures
›  17+Petabytes shipped
›  Drive Variety (HDD, SAS,
SATA, SSD, hybrid)
›  Enclosures, controllers
›  Customer-driven partnership
›  Services: Logistics,
fulfillment, warranty,
design, supply chain
›  Purpose-engineered
to optimize capacity
and performance
›  40% fewer racks
required
›  >1TB/sec file system
performance
›  Solutions for object storage
›  Reference architectures
for open source and
software-defined storage
›  Private cloud appliances
for backup and recovery
›  Modular, scalable
components for DIY
customers
Scale-Out
SystemsHPCOEM
Seagate Cloud Systems & Silicon Group
14
Powering the Fastest HPC Sites
Awards
Award-Winning ClusterStor Architecture

Contenu connexe

Plus de inside-BigData.com

HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computinginside-BigData.com
 

Plus de inside-BigData.com (20)

HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Making Supernovae with Jets
Making Supernovae with JetsMaking Supernovae with Jets
Making Supernovae with Jets
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 
SW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computing
 

Dernier

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Dernier (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Debugging Slow Buffered Reads to the Lustre File System

  • 1. 1 February 2016 Debugging Slow Buffered Reads to the Lustre Filesystem By Robert Roy, Senior Staff Engineer
  • 2. 22 Direct IO reads are better than Buffered IO The Problem Seagate CS9000 with 4M RPCs Reads Buffered ~3.5 GB/s per OST Reads o_direct ~4.5 GB/s per OST Writes Buffered ~4.5 GB/s per OST More clients do not produce more bandwidth Suggests server side Data path on the server side is the same for o_direct and buffered IO Suggests client side Buffered IO uses paged cache which is populated by readahead Client side readahead is suspect
  • 3. 33 Readahead requests never ramp up to 4M RPCs The Root Cause [rroy@rroy-vm-wireshark ~]$ tshark -r buffered_1node_1thread.cap.gz - Tfields -e ip.src -e ip.dst -e lustre.obd_ioobj.ioo_id -e lustre.niobuf_remote.offset -e lustre.niobuf_remote.len -R lustre.niobuf_remote | head -10 172.19.62.138 172.19.55.5 1903 0 1048576 172.19.62.138 172.19.55.5 1903 1048576 2097152 172.19.62.138 172.19.55.5 1903 3145728 1048576 172.19.62.138 172.19.55.5 1903 4194304 1048576 172.19.62.138 172.19.55.5 1903 5242880 2097152 172.19.62.138 172.19.55.5 1903 7340032 1048576 172.19.62.138 172.19.55.5 1903 8388608 1048576 172.19.62.138 172.19.55.5 1903 9437184 2097152 172.19.62.138 172.19.55.5 1903 11534336 1048576 172.19.62.138 172.19.55.5 1903 12582912 1048576 ... 172.19.62.138 172.19.55.5 1903 1685061632 1048576 172.19.62.138 172.19.55.5 1903 1686110208 1048576 172.19.62.138 172.19.55.5 1903 1687158784 1048576 172.19.62.138 172.19.55.5 1903 1688207360 1048576 172.19.62.138 172.19.55.5 1903 1689255936 1048576 172.19.62.138 172.19.55.5 1903 1690304512 1048576 172.19.62.138 172.19.55.5 1903 1691353088 1048576 172.19.62.138 172.19.55.5 1903 1692401664 1048576 172.19.62.138 172.19.55.5 1903 1693450240 1048576 172.19.62.138 172.19.55.5 1903 1694498816 1048576
  • 4. 44 Even with a large 64MB IO size, all IO serviced from readahead is 1MB in size The Root Cause [rroy@rroy-vm-wireshark ~]$ tshark -r buffered_32node_4thread_64mIO.cap.gz -Tfields -e ip.src -e ip.dst -e lustre.obd_ioobj.ioo_id -e lustre.niobuf_remote.offset -e lustre.niobuf_remote.len -R lustre.niobuf_remote | grep 288 | head -n 20 172.19.62.138 172.19.55.4 2288 0 4194304 172.19.62.138 172.19.55.4 2288 4194304 4194304 172.19.62.138 172.19.55.4 2288 8388608 4194304 172.19.62.138 172.19.55.4 2288 12582912 4194304 172.19.62.138 172.19.55.4 2288 16777216 4194304 172.19.62.138 172.19.55.4 2288 20971520 4194304 172.19.62.138 172.19.55.4 2288 25165824 4194304 172.19.62.138 172.19.55.4 2288 29360128 4194304 172.19.62.138 172.19.55.4 2288 33554432 4194304 172.19.62.138 172.19.55.4 2288 37748736 4194304 172.19.62.138 172.19.55.4 2288 41943040 4194304 172.19.62.138 172.19.55.4 2288 46137344 4194304 172.19.62.138 172.19.55.4 2288 50331648 4194304 172.19.62.138 172.19.55.4 2288 54525952 4194304 172.19.62.138 172.19.55.4 2288 58720256 4194304 172.19.62.138 172.19.55.4 2288 62914560 4194304 172.19.62.138 172.19.55.4 2288 67108864 1048576 172.19.62.138 172.19.55.4 2288 68157440 1048576 172.19.62.138 172.19.55.4 2288 69206016 1048576 172.19.62.138 172.19.55.4 2288 70254592 1048576
  • 5. 55 The Source of the Problem And right above that line… /lustre/llite/rw.c #define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT) /* RAS_INCREASE_STEP should be (1UL << (inode->i_blkbits - PAGE_CACHE_SHIFT)). * Temporarily set RAS_INCREASE_STEP to 1MB. After 4MB RPC is enabled * by default, this should be adjusted corresponding with max_read_ahead_mb * and max_read_ahead_per_file_mb otherwise the readahead budget can be used * up quickly which will affect read performance significantly. See LU-2816 */
  • 6. 66 Set the increase step to the same value as the RPC size The Solution < #define RAS_INCREASE_STEP(inode) (ONE_MB_BRW_SIZE >> PAGE_CACHE_SHIFT) > #define RAS_INCREASE_STEP(inode) (PTLRPC_MAX_BRW_SIZE >> PAGE_CACHE_SHIFT)
  • 7. 77 Results INCREASE_STEP RA File RA MB Clients PPN IO Size Read Average 1MB 40 40 32 1 1M 6928.02 4MB 40 40 32 1 1M 8629.80 1MB 160 640 32 1 1M 7137.50 4MB 160 640 32 1 1M 9528.45 IOR -r -v -F –b 131072m -t 1m -i 3 -m -k -D 60
  • 9. 99 Conclusion and More Information Buffered reads can be improved significantly when 4m RPCs are in use Seagate implemented a parameter to address the issue lctl set_param -n llite.*.read_ahead_step 4 https://github.com/Xyratex/lustre-stable/commit/2395f8e0e7e963aec43deb07d719e9229884758c LU-7140 tracks the upstream work https://jira.hpdd.intel.com/browse/LU-7140
  • 13. 13 ›  2+ million enclosures ›  17+Petabytes shipped ›  Drive Variety (HDD, SAS, SATA, SSD, hybrid) ›  Enclosures, controllers ›  Customer-driven partnership ›  Services: Logistics, fulfillment, warranty, design, supply chain ›  Purpose-engineered to optimize capacity and performance ›  40% fewer racks required ›  >1TB/sec file system performance ›  Solutions for object storage ›  Reference architectures for open source and software-defined storage ›  Private cloud appliances for backup and recovery ›  Modular, scalable components for DIY customers Scale-Out SystemsHPCOEM Seagate Cloud Systems & Silicon Group
  • 14. 14 Powering the Fastest HPC Sites Awards Award-Winning ClusterStor Architecture