SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Calcul Québec - Université LavalCalcul Québec - Université Laval
Monitoring an heterogenous
Lustre environment
LUG 2015
frederick.lefebvre@calculquebec.ca
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Plan
Site description
A bit of history
Real time utilization monitoring
Moving forward
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Who we are
Advanced computing center @Université
Laval
Part of Calcul Québec & Compute Canada
Operate 2 Clusters & 4 Lustre fs
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Our infrastructure
1000+ lustre clients (2.5.3 & 2.1.6)
On 2 IB fabrics - 6 LNET routers
4 Lustre file systems (~2.5 PB)
500 TB - Lustre 2.5.3 + ZFS - 16 OSSs
500 TB - Lustre 2.1.6 - 6 OSSs
1.4 PB - Lustre 2.1.0 (Seagate) - 12 OSS
120 TB - Lustre 2.1.0 (Seagate) - 2 OSS
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
IBSwitchIBSwitch
IB Switch
IB Switch
IBSwitchIBSwitch
IBSwitch
IBSwitch
Lustre1-2.5.3+ZFS
Lustre2-2.1.6
Lustre3-2.1.0Seagate
Lustre4-2.1.0Seagate
Topology (2 IB fabrics)
IBSwitch
6xLNETrouters
Colosse:960nodes
Helios:21nodes-168GPUs
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
An history of monitoring
We initially monitored because
We knew we should
We wanted to identify failures
Tried multiple options over the years
collectl
cerebro +lmt
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
An history of monitoring (cont.)
Converged towards custom gmetrics scripts for Ganglia
Monitored IB
Lustre IOPS and
throughput
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
An history of monitoring (cont.)
Converged towards custom gmetrics scripts for Ganglia
Monitored IB
Lustre IOPS and
throughput
Worked reasonably well
But quality of our scripts was inconsistent
Some used a surprising amount of resources when running
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Issues
Users keep asking the same questions
Why is it so long navigating my files ?
Why is my app so slow at writing results ?
Which filesystem should I use ?
We had most of the data “somewhere” but searching for
it was always painful
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Issues (cont.)
Data is everywhere
● Performance data from compute node is in mongoDB
● Lustre client and server data in Ganglia (RRD)
● Scheduler data in text log files
Easy to end up with a spiderweb of complex script to
integrate all data
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
New reqs
1. Give staff a near-realtime view of the filesystems
utilisation and load
2. Provide end-users with data to identify their
utilisation of the Lustre filesystems
a. Bonus point for cross-referencing with the scheduler’s stats
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
New reqs (cont.)
We want to keep “historical” data for future analysis and
comparison
With the amount of storage now available, it didn’t make
sense to keep losing granularity over time with RRD.
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Enter Splunk
We have been using Splunk to store and
visualize system logs and energy efficiency
data.
It’s well suited to indexing a lot of
unstructured data
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Plan
Collect data from /proc on all nodes and
send it to the system’s log with logger
The logs are indexed and visualized with
Splunk
We use a separate index for performance data
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Data collection
Existing options to monitor Lustre tend to
be tied to a specific backend (RRD or DB)
More flexibility is needed
We decided to write another one
We could have modified an existing application
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
On Lustre clients and servers
cq_mon
Overview
Run...
Log...
Sleep...
python-lustre
python-psutil
plugins
...
rsyslogd
syslog
On Splunk server
rsyslogd /var/log/...
Splunk
Index
Search
engineHTTPS
syslog write to file
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
cq_mon
cq_mon
Run...
Log...
Sleep...
plugins Writen in Python
Configuration specifies which plugins to run
Runs each plugin at interval and log the
results as ‘key=values’ to syslog
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
cq_mon
cq_mon
Run...
Log...
Sleep...
plugins Writen in Python
Configuration specifies which plugins to run
Runs each plugin at interval and log the
results as ‘key=values’ to syslog
Add syslog filter so as not to ‘polute’
system logs with performance data
syslog
rsyslogd
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
cq_mon
python-lustre_util
Run...
Log...
Sleep...
plugins
Library to interact with lustre through /proc
Loads stats for different Lustre components make them
available through an appropriate object (ie: lustre_mdt,
lustre_ost, etc)
python-lustre
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
python-lustre_util (cont.)
Work in progress
Tested with Lustre 2.1.6, 2.4.x, 2.5.x+ZFS and 2.1.0-xyratex
Available through Github: https://github.com/calculquebec
Offer stats for:
MDS, OSS, MDT, OST
Client stats from the client and not the OSS/MDS
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
python-lustre_util (cont.)
cq_mon
cq_mon_lustre_oss
loads
plugins
cq_mon.cfg
-defines tests to run
and intervals
-Managed by puppet
reads
lustre_oss lustre_oss_stats
instantiate
reads /proc
lustre_oss_stats
and compare with
previous instance
lustre-utils
Other plugins use python-psutil to provide OS level stats
write to syslog
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Splunk dashboards
Put together custom dashboards for our needs
Lustre filesystems overview
1 detailed dashboard for each Lustre filesystem
Job report
Some reports are slower than we would like
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Lustre - overview
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Lustre overview (cont.)
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Lustre overview (cont.)
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Lustre overview (Client IOPS)
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Per FS view (Utilization)
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Per FS view (top users)
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Job report
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Limitations
The quantity of data collected is ‘scary’
Especially true for clients (~1MB/client per day)
Impact of running extra codes on computes nodes
Still uncertainty on how to monitor or visualize some
components
ie: LNET routers, failovers, DNE setups, drives
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Looking ahead
Add support for multiple MDS/MDT
Collect client’s data on the servers
Cross-reference statistics with data from
RobinHood DB
Make the code more resistant to failovers
Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval
Thank you
frederick.lefebvre@calculquebec.ca

Contenu connexe

Tendances

Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark InternalsKnoldus Inc.
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and SparkEvan Chan
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache CalciteDataWorks Summit
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on AndroidTomáš Kypta
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 

Tendances (20)

Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Internals
InternalsInternals
Internals
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 

En vedette

Cadete a
Cadete aCadete a
Cadete afbcat
 
Recommandations de la commission d'evaluation electorale d'haiti
Recommandations de la commission d'evaluation electorale d'haitiRecommandations de la commission d'evaluation electorale d'haiti
Recommandations de la commission d'evaluation electorale d'haitiStanleylucas
 
Sample Graphics
Sample GraphicsSample Graphics
Sample GraphicsMary Abe
 
Institución educativa el diamante
Institución educativa el diamanteInstitución educativa el diamante
Institución educativa el diamanteDavid Borja Cda
 
Trabajo de excel y word yuri mocada
Trabajo de excel y word yuri mocadaTrabajo de excel y word yuri mocada
Trabajo de excel y word yuri mocadaJenny Gomez
 
GardenGate Projects
GardenGate ProjectsGardenGate Projects
GardenGate ProjectsMary Abe
 
SGDA conference Trondheim Norway
SGDA conference Trondheim NorwaySGDA conference Trondheim Norway
SGDA conference Trondheim NorwayDavid Wortley
 
Understanding Student Housing Legislation | Pinnacle MC Global
Understanding Student Housing Legislation | Pinnacle MC Global Understanding Student Housing Legislation | Pinnacle MC Global
Understanding Student Housing Legislation | Pinnacle MC Global Pinnacle MC Global
 
Sample California motion to amend judgment to add judgment debtor on alter eg...
Sample California motion to amend judgment to add judgment debtor on alter eg...Sample California motion to amend judgment to add judgment debtor on alter eg...
Sample California motion to amend judgment to add judgment debtor on alter eg...LegalDocsPro
 
Sample California motion for leave to amend pleading
Sample California motion for leave to amend pleadingSample California motion for leave to amend pleading
Sample California motion for leave to amend pleadingLegalDocsPro
 
Marketing Experiencial y de contenidos: el nuevo espacio de encuentro de las ...
Marketing Experiencial y de contenidos: el nuevo espacio de encuentro de las ...Marketing Experiencial y de contenidos: el nuevo espacio de encuentro de las ...
Marketing Experiencial y de contenidos: el nuevo espacio de encuentro de las ...José Cantero Gómez
 
Libro blanco del marketing creativo: la creatividad como elemento de innovaci...
Libro blanco del marketing creativo: la creatividad como elemento de innovaci...Libro blanco del marketing creativo: la creatividad como elemento de innovaci...
Libro blanco del marketing creativo: la creatividad como elemento de innovaci...José Cantero Gómez
 

En vedette (18)

Cadete a
Cadete aCadete a
Cadete a
 
Презентация ООО "ЦОЭББ"
Презентация ООО "ЦОЭББ"Презентация ООО "ЦОЭББ"
Презентация ООО "ЦОЭББ"
 
Színházjárók
SzínházjárókSzínházjárók
Színházjárók
 
Recommandations de la commission d'evaluation electorale d'haiti
Recommandations de la commission d'evaluation electorale d'haitiRecommandations de la commission d'evaluation electorale d'haiti
Recommandations de la commission d'evaluation electorale d'haiti
 
Sample Graphics
Sample GraphicsSample Graphics
Sample Graphics
 
Institución educativa el diamante
Institución educativa el diamanteInstitución educativa el diamante
Institución educativa el diamante
 
Modulo9
Modulo9Modulo9
Modulo9
 
Trabajo de excel y word yuri mocada
Trabajo de excel y word yuri mocadaTrabajo de excel y word yuri mocada
Trabajo de excel y word yuri mocada
 
GardenGate Projects
GardenGate ProjectsGardenGate Projects
GardenGate Projects
 
SGDA conference Trondheim Norway
SGDA conference Trondheim NorwaySGDA conference Trondheim Norway
SGDA conference Trondheim Norway
 
Understanding Student Housing Legislation | Pinnacle MC Global
Understanding Student Housing Legislation | Pinnacle MC Global Understanding Student Housing Legislation | Pinnacle MC Global
Understanding Student Housing Legislation | Pinnacle MC Global
 
Pro_Tools_Tier_3
Pro_Tools_Tier_3Pro_Tools_Tier_3
Pro_Tools_Tier_3
 
Sample California motion to amend judgment to add judgment debtor on alter eg...
Sample California motion to amend judgment to add judgment debtor on alter eg...Sample California motion to amend judgment to add judgment debtor on alter eg...
Sample California motion to amend judgment to add judgment debtor on alter eg...
 
20151118 apresentação belfast nov
20151118 apresentação belfast nov20151118 apresentação belfast nov
20151118 apresentação belfast nov
 
Sample California motion for leave to amend pleading
Sample California motion for leave to amend pleadingSample California motion for leave to amend pleading
Sample California motion for leave to amend pleading
 
Marketing Experiencial y de contenidos: el nuevo espacio de encuentro de las ...
Marketing Experiencial y de contenidos: el nuevo espacio de encuentro de las ...Marketing Experiencial y de contenidos: el nuevo espacio de encuentro de las ...
Marketing Experiencial y de contenidos: el nuevo espacio de encuentro de las ...
 
Libro blanco del marketing creativo: la creatividad como elemento de innovaci...
Libro blanco del marketing creativo: la creatividad como elemento de innovaci...Libro blanco del marketing creativo: la creatividad como elemento de innovaci...
Libro blanco del marketing creativo: la creatividad como elemento de innovaci...
 
HPC in the Cloud
HPC in the CloudHPC in the Cloud
HPC in the Cloud
 

Similaire à Monitoring an Heterogenous Lustre Environment with Splunk

[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration StoryJoan Viladrosa Riera
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration storyJoan Viladrosa Riera
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven MicroservicesFabrizio Fortino
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudySalman Baset
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaGuido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo Splunk
 
Splunk Ninjas: New Features and Search Dojo
Splunk Ninjas: New Features and Search DojoSplunk Ninjas: New Features and Search Dojo
Splunk Ninjas: New Features and Search DojoSplunk
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901WeCloudData
 
KoprowskiT_SQLRelay2014#5_Newcastle_FromPlanToBackupToCloud
KoprowskiT_SQLRelay2014#5_Newcastle_FromPlanToBackupToCloudKoprowskiT_SQLRelay2014#5_Newcastle_FromPlanToBackupToCloud
KoprowskiT_SQLRelay2014#5_Newcastle_FromPlanToBackupToCloudTobias Koprowski
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Data Con LA
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain
 
Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk
 
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groups
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groupsUnbreakable Sharepoint 2016 With SQL Server 2016 availability groups
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groupsIsabelle Van Campenhoudt
 
OpenStack & OpenContrail in Production
OpenStack & OpenContrail in ProductionOpenStack & OpenContrail in Production
OpenStack & OpenContrail in ProductionEdgar Magana
 
OpenStack Ottawa Meetup - October 2018
OpenStack Ottawa Meetup - October 2018OpenStack Ottawa Meetup - October 2018
OpenStack Ottawa Meetup - October 2018Stacy Véronneau
 

Similaire à Monitoring an Heterogenous Lustre Environment with Splunk (20)

[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo
SplunkLive! Tampa: Splunk Ninjas: New Features, Pivot, and Search Dojo
 
Splunk Ninjas: New Features and Search Dojo
Splunk Ninjas: New Features and Search DojoSplunk Ninjas: New Features and Search Dojo
Splunk Ninjas: New Features and Search Dojo
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
KoprowskiT_SQLRelay2014#5_Newcastle_FromPlanToBackupToCloud
KoprowskiT_SQLRelay2014#5_Newcastle_FromPlanToBackupToCloudKoprowskiT_SQLRelay2014#5_Newcastle_FromPlanToBackupToCloud
KoprowskiT_SQLRelay2014#5_Newcastle_FromPlanToBackupToCloud
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search Dojo
 
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groups
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groupsUnbreakable Sharepoint 2016 With SQL Server 2016 availability groups
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groups
 
OpenStack & OpenContrail in Production
OpenStack & OpenContrail in ProductionOpenStack & OpenContrail in Production
OpenStack & OpenContrail in Production
 
OpenStack Ottawa Meetup - October 2018
OpenStack Ottawa Meetup - October 2018OpenStack Ottawa Meetup - October 2018
OpenStack Ottawa Meetup - October 2018
 

Monitoring an Heterogenous Lustre Environment with Splunk

  • 1. Calcul Québec - Université LavalCalcul Québec - Université Laval Monitoring an heterogenous Lustre environment LUG 2015 frederick.lefebvre@calculquebec.ca
  • 2. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Plan Site description A bit of history Real time utilization monitoring Moving forward
  • 3. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Who we are Advanced computing center @Université Laval Part of Calcul Québec & Compute Canada Operate 2 Clusters & 4 Lustre fs
  • 4. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Our infrastructure 1000+ lustre clients (2.5.3 & 2.1.6) On 2 IB fabrics - 6 LNET routers 4 Lustre file systems (~2.5 PB) 500 TB - Lustre 2.5.3 + ZFS - 16 OSSs 500 TB - Lustre 2.1.6 - 6 OSSs 1.4 PB - Lustre 2.1.0 (Seagate) - 12 OSS 120 TB - Lustre 2.1.0 (Seagate) - 2 OSS
  • 5. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval IBSwitchIBSwitch IB Switch IB Switch IBSwitchIBSwitch IBSwitch IBSwitch Lustre1-2.5.3+ZFS Lustre2-2.1.6 Lustre3-2.1.0Seagate Lustre4-2.1.0Seagate Topology (2 IB fabrics) IBSwitch 6xLNETrouters Colosse:960nodes Helios:21nodes-168GPUs
  • 6. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval An history of monitoring We initially monitored because We knew we should We wanted to identify failures Tried multiple options over the years collectl cerebro +lmt
  • 7. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval An history of monitoring (cont.) Converged towards custom gmetrics scripts for Ganglia Monitored IB Lustre IOPS and throughput
  • 8. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval An history of monitoring (cont.) Converged towards custom gmetrics scripts for Ganglia Monitored IB Lustre IOPS and throughput Worked reasonably well But quality of our scripts was inconsistent Some used a surprising amount of resources when running
  • 9. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Issues Users keep asking the same questions Why is it so long navigating my files ? Why is my app so slow at writing results ? Which filesystem should I use ? We had most of the data “somewhere” but searching for it was always painful
  • 10. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Issues (cont.) Data is everywhere ● Performance data from compute node is in mongoDB ● Lustre client and server data in Ganglia (RRD) ● Scheduler data in text log files Easy to end up with a spiderweb of complex script to integrate all data
  • 11. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval New reqs 1. Give staff a near-realtime view of the filesystems utilisation and load 2. Provide end-users with data to identify their utilisation of the Lustre filesystems a. Bonus point for cross-referencing with the scheduler’s stats
  • 12. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval New reqs (cont.) We want to keep “historical” data for future analysis and comparison With the amount of storage now available, it didn’t make sense to keep losing granularity over time with RRD.
  • 13. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Enter Splunk We have been using Splunk to store and visualize system logs and energy efficiency data. It’s well suited to indexing a lot of unstructured data
  • 14. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Plan Collect data from /proc on all nodes and send it to the system’s log with logger The logs are indexed and visualized with Splunk We use a separate index for performance data
  • 15. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Data collection Existing options to monitor Lustre tend to be tied to a specific backend (RRD or DB) More flexibility is needed We decided to write another one We could have modified an existing application
  • 16. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval On Lustre clients and servers cq_mon Overview Run... Log... Sleep... python-lustre python-psutil plugins ... rsyslogd syslog On Splunk server rsyslogd /var/log/... Splunk Index Search engineHTTPS syslog write to file
  • 17. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval cq_mon cq_mon Run... Log... Sleep... plugins Writen in Python Configuration specifies which plugins to run Runs each plugin at interval and log the results as ‘key=values’ to syslog
  • 18. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval cq_mon cq_mon Run... Log... Sleep... plugins Writen in Python Configuration specifies which plugins to run Runs each plugin at interval and log the results as ‘key=values’ to syslog Add syslog filter so as not to ‘polute’ system logs with performance data syslog rsyslogd
  • 19. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval cq_mon python-lustre_util Run... Log... Sleep... plugins Library to interact with lustre through /proc Loads stats for different Lustre components make them available through an appropriate object (ie: lustre_mdt, lustre_ost, etc) python-lustre
  • 20. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval python-lustre_util (cont.) Work in progress Tested with Lustre 2.1.6, 2.4.x, 2.5.x+ZFS and 2.1.0-xyratex Available through Github: https://github.com/calculquebec Offer stats for: MDS, OSS, MDT, OST Client stats from the client and not the OSS/MDS
  • 21. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval python-lustre_util (cont.) cq_mon cq_mon_lustre_oss loads plugins cq_mon.cfg -defines tests to run and intervals -Managed by puppet reads lustre_oss lustre_oss_stats instantiate reads /proc lustre_oss_stats and compare with previous instance lustre-utils Other plugins use python-psutil to provide OS level stats write to syslog
  • 22. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Splunk dashboards Put together custom dashboards for our needs Lustre filesystems overview 1 detailed dashboard for each Lustre filesystem Job report Some reports are slower than we would like
  • 23. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Lustre - overview
  • 24. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Lustre overview (cont.)
  • 25. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Lustre overview (cont.)
  • 26. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Lustre overview (Client IOPS)
  • 27. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Per FS view (Utilization)
  • 28. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Per FS view (top users)
  • 29. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Job report
  • 30. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Limitations The quantity of data collected is ‘scary’ Especially true for clients (~1MB/client per day) Impact of running extra codes on computes nodes Still uncertainty on how to monitor or visualize some components ie: LNET routers, failovers, DNE setups, drives
  • 31. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Looking ahead Add support for multiple MDS/MDT Collect client’s data on the servers Cross-reference statistics with data from RobinHood DB Make the code more resistant to failovers
  • 32. Calcul Québec - Université LavalCalcul Québec - Université LavalCalcul Québec - Université Laval Thank you frederick.lefebvre@calculquebec.ca