SlideShare une entreprise Scribd logo
1  sur  7
Télécharger pour lire hors ligne
Francis Marion University
Summer - Fall 2013 Research
Students Administrating an HPC Cluster
Students:
Will Dixon
Chad Garland
Supervisors:
Dr. Larry Engelhardt
Dr. Ginger Bryngelson
Dr. Galen Collier
December 6, 2013
Contents
1 Overview 1
2 Software 1
3 X11 Display 1
4 Hardware and Networking 1
5 The Operating System 2
6 Controlling Temperature 3
7 FMU PandA 3
8 Why Use an HPC Cluster 3
9 Administration 3
9.1 Turning Off the Patriot Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
9.2 Turning On the Patriot Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
9.3 Restarting SLURM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
A Powering the Cluster 4
A.1 powercluster File Located on the Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
A.2 startup File Located on the Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
A.3 startup File Located on the Master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
A.4 startup File Located on all Other Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Overview
NSF –EPSCoR ”CI” gave Francis Marion a $100,000 grant to build an HPC cluster. High performance
computing is becoming a necessity to many of the science fields. Students administer the cluster. Admin-
istrating the cluster requires knowledge of software and hardware described below. A Cluster is a bunch of
computers that communicate as one and each computer in the cluster is called a node.
2 Software
Acronyms for Networking Software:
• DNS (Domain Name Server): keeps track of the IP addresses for each computer in the cluster and has
a name that goes along with that IP address.
• DHCP (Dynamic Host Configuration Protocol): assigns each computer a unique IP address automat-
ically.
• DDNS (Dynamic Domain Name Server): A DHCP server that updates a DNS server keeping both up
to date
The user never interacts directly with any of the networking software. What makes these a necessity is it
makes it easier to add new computers to the cluster when expansion is needed.
Software used by the user:
• SLURM (Simple Linux Utility for Resource Management): this manages when and where a user’s
submission will run according to the resources that are available in the cluster.
• MPI (Message Passing Interface): a standard that allows multiple computer cores to communicate
with one another while doing computations.
• OpenACC: a standard that is used to make use of the GPU core.
• OpenMP: a standard for shared memory communication (between cores on a single node).
The user will use the above but those are not technically computers languages. They are standards and
there are packages that implement the standard. The languages that use these standards are C, Fortran,
Java, and Python.
3 X11 Display
Since most clusters do not have displays, they use what is called X11 forwarding. X11 is a display interface
that is used by Linux operating systems. With X11 forwarding it allows the user to ask the cluster to display
something through the command line, the cluster then sends the display interface through the internet back
to the user, and X11 displays it graphically on the local screen. This allows the user to create graphics on
the cluster but still display it locally.
4 Hardware and Networking
A cluster requires nodes to communicate with one another. The only way to do so is to use the software
mentioned in the Software section. Below is a diagram of how Francis Marion’s Patriot Cluster is networked
together.
The individual nodes in the cluster have different qualities (memory, number of cores, etc.) that make it
more dynamic for the user to choose what kind of program they would want to run where.
1
Figure 1: Diagram of the Patriot Cluster
Figure 1 shows the way the Patriot Cluster is networked together. The list below describes what makes
each of these components unique.
• Internet - where everyone has access to from anywhere
• Master node - gateway to the internet for the nodes in the cluster and where users login from the
internet (uses iptables for security which is a firewall for Linux).
• Storage node - this node hosts most of the servers (DHCP, DNS, DDNS) and stores all user files and
many programs used by all the computers
• Switch - a device that connects nodes together with a CAT 6 connection (Gigabit per second).
• Compute nodes - They have 32 Gigabytes of RAM and 12 Intel CPU cores. Used by SLURM to run
user submissions.
• High memory compute nodes - They also have 12 Intel CPU cores but have 64 Gigabytes of RAM. Used
by SLURM if the other compute nodes are busy or if high memory is required by the computation.
• GPU node - This has a GPU card that can be used by the user. Has 16 Intel CPU cores, 2496 GPU
cores, and 32 Gigabytes of RAM. Used by SLURM when user wants to use the GPU node or all other
nodes are busy.
2
5 The Operating System
Many different clusters use different operating systems, but most clusters that are built today use a distribu-
tion of Linux. The different distributions come with different types of software and user interfaces. The one
that Francis Marion’s Patriot Cluster uses is Ubuntu Linux 12.04 Server LTS (long term support). Ubuntu
has a wide range of support which helps administrators when they cannot find a solution to a problem.
Ubuntu has a desktop version but users that get on the cluster cannot control the cluster through a GUI
(graphical user interface). Instead, they control it using a command line interface.
6 Controlling Temperature
With high performance computing, comes high temperatures. All the power that is going into the computers
come out as heat. To control the temperature we use temperature monitoring internally and externally. If
the cluster gets too hot it can damage the cores and other components of the computer. So we try to keep
the cluster as cool as possible.
To monitor externally, we use temperature probes. We monitor the intake and output temperatures of
the cluster. When the cluster gets too hot it sends the administrators an email saying that the cluster is
getting too hot and something needs to be done.
To monitor internally (the cores themselves), we use Cron daemon that executes tasks based on time.
We wrote a script that gathers the temperature of each core and each processor and then compare dangerous
temperatures. If one core gets too hot, we pause all currently running jobs and send an email to the
administrators that the cluster is too hot.
7 FMU PandA
FMU PANDA stands for Francis Marion University’s Physics AND Astronomy. Francis Marion’s Patriot
Cluster has a webpage at www.fmupanda.com. This has information from getting an account on the cluster
to programing using the MPI or OpenMP standard. There is information on it on how to contact the
administrators and a forum that anyone can read and post to if there is a general question.
8 Why Use an HPC Cluster
High performance computing is used to complete calculations things faster than before. If we wrote a
program to find the weather for tomorrow, we don’t want the results two days from now, but we still want
the resolution to be high. A cluster can do this. It takes a single job and splits it among many different
cores to do the one job. Below is a graph of Speed-Up versus Number of cores for numerical integration.
Figure 2 shows a diminishing return. As the number of cores increases the speedup increases at a
decreasing rate. So we cannot keep increasing the number of cores and get the results that we want. Sooner
or later we will have to draw the line.
9 Administration
9.1 Turning Off the Patriot Cluster
We wrote a program on the storage node that will shut down the cluster properly (see Appendix A.1). What
the program does is login to every computer, ask for root access then shut the computer down. It then
goes to the next computer. The file is in the /bin directory named powercluster. Once that command is
entered (as the correct user admin) it will shut the cluster down. The last node to shutdown (and sometimes
does not shutdown) is the storage node in which you are logged in as. The reason behind not shutting the
storage down is if you log into the master then ssh to the storage, the master shutsdown before the storage’s
shutdown command is executed.
3
Figure 2: Speed Up versus Number of Cores
9.2 Turning On the Patriot Cluster
To turn the cluster back on the master must be turned on first. Then the storage and every other computer
in the cluster. Once all computers are on the login screen, login to the storage again as admin. Run the
command startup (one that we also created see Appendix A.2) and it will ask for admin’s sudo password.
Enter it once and it will remember it for the other times it asks it. It will run a command that was written
for all the computers (one for the master see Appendix A.3 and one for all the others see Appendix A.4)
that mounts external drives and starts up SLURM. That is it. The cluster is now running.
9.3 Restarting SLURM
Reseting SLURM has many different ways to execute. The simplist is on a single node that is running
SLURM and run the comand:
sudo s e r v i c e slurm−l l n l r e s t a r t
This will usually fix a single computer problem. The only way to reset SLURM on all computers is running
the above command on all computers. Another way (mainly just to make the computers to recheck the
configuration file) to reset some settings is run the following command on the master:
sudo scontrol reconfigure
The scontrol command is very usefull when trying to administrate users submitting jobs. A full list of options
for the scontrol command is at https://computing.llnl.gov/linux/slurm/scontrol.html.
A Powering the Cluster
A.1 powercluster File Located on the Storage
#!/ bin /bash
dsh −a poweroff
ssh −p98 master poweroff
4
sleep 10
poweroff
A.2 startup File Located on the Storage
#!/ bin /bash
read −sp ” [ sudo ] Password : ” pass
ssh −p 98 master ”echo $pass | sudo −S startup ”
echo $pass | sudo −S ntpdate master
dsh −a ”echo $pass | sudo −S startup ”
A.3 startup File Located on the Master
#!/ bin /bash
sh / bin / router
mount storage :/ storage / storage
mount storage :/ exports /home /home
s e r v i c e munge s t a r t
s e r v i c e slurm−l l n l s t a r t
A.4 startup File Located on all Other Computers
#!/ bin /bash
sh / bin / router
mount storage :/ storage / storage
mount storage :/ exports /home /home
s e r v i c e munge s t a r t
s e r v i c e slurm−l l n l s t a r t
5

Contenu connexe

Tendances

Os solved question paper
Os solved question paperOs solved question paper
Os solved question paperAnkit Bhatnagar
 
operating system question bank
operating system question bankoperating system question bank
operating system question bankrajatdeep kaur
 
Ch1: Operating System- Introduction
Ch1: Operating System- IntroductionCh1: Operating System- Introduction
Ch1: Operating System- IntroductionAhmar Hashmi
 
Writing Character driver (loadable module) in linux
Writing Character driver (loadable module) in linuxWriting Character driver (loadable module) in linux
Writing Character driver (loadable module) in linuxRajKumar Rampelli
 
Operating system interview question
Operating system interview questionOperating system interview question
Operating system interview questionsriram saravanan
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating SystemSharad Pandey
 
Multi processor scheduling
Multi  processor schedulingMulti  processor scheduling
Multi processor schedulingShashank Kapoor
 
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemProcess, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemLieYah Daliah
 
Cs6413 operating-systems-laboratory
Cs6413 operating-systems-laboratoryCs6413 operating-systems-laboratory
Cs6413 operating-systems-laboratoryPreeja Ravishankar
 
Operatingsystems lecture2
Operatingsystems lecture2Operatingsystems lecture2
Operatingsystems lecture2Gaurav Meena
 

Tendances (20)

Os solved question paper
Os solved question paperOs solved question paper
Os solved question paper
 
operating system question bank
operating system question bankoperating system question bank
operating system question bank
 
Ch1: Operating System- Introduction
Ch1: Operating System- IntroductionCh1: Operating System- Introduction
Ch1: Operating System- Introduction
 
Writing Character driver (loadable module) in linux
Writing Character driver (loadable module) in linuxWriting Character driver (loadable module) in linux
Writing Character driver (loadable module) in linux
 
Basic os-concepts
Basic os-conceptsBasic os-concepts
Basic os-concepts
 
Operating system interview question
Operating system interview questionOperating system interview question
Operating system interview question
 
Real Time Operating System
Real Time Operating SystemReal Time Operating System
Real Time Operating System
 
Os Question Bank
Os Question BankOs Question Bank
Os Question Bank
 
Multi processor scheduling
Multi  processor schedulingMulti  processor scheduling
Multi processor scheduling
 
MicroC/OS-II
MicroC/OS-IIMicroC/OS-II
MicroC/OS-II
 
operating system lecture notes
operating system lecture notesoperating system lecture notes
operating system lecture notes
 
Linux Memory Management
Linux Memory ManagementLinux Memory Management
Linux Memory Management
 
Linux kernel modules
Linux kernel modulesLinux kernel modules
Linux kernel modules
 
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating SystemProcess, Threads, Symmetric Multiprocessing and Microkernels in Operating System
Process, Threads, Symmetric Multiprocessing and Microkernels in Operating System
 
Cs6413 operating-systems-laboratory
Cs6413 operating-systems-laboratoryCs6413 operating-systems-laboratory
Cs6413 operating-systems-laboratory
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
2 srs
2 srs2 srs
2 srs
 
Rtos part2
Rtos part2Rtos part2
Rtos part2
 
Operatingsystems lecture2
Operatingsystems lecture2Operatingsystems lecture2
Operatingsystems lecture2
 
CS6401 OPERATING SYSTEMS Unit 2
CS6401 OPERATING SYSTEMS Unit 2CS6401 OPERATING SYSTEMS Unit 2
CS6401 OPERATING SYSTEMS Unit 2
 

Similaire à fall2013

Operating Systems: Revision
Operating Systems: RevisionOperating Systems: Revision
Operating Systems: RevisionDamian T. Gordon
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Nt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer ComponentsNt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer ComponentsKristi Anderson
 
04 threads-pbl-2-slots
04 threads-pbl-2-slots04 threads-pbl-2-slots
04 threads-pbl-2-slotsmha4
 
04 threads-pbl-2-slots
04 threads-pbl-2-slots04 threads-pbl-2-slots
04 threads-pbl-2-slotsmha4
 
Systems and Applications.pptx
Systems and Applications.pptxSystems and Applications.pptx
Systems and Applications.pptxMonaNashaat3
 
Linux or unix interview questions
Linux or unix interview questionsLinux or unix interview questions
Linux or unix interview questionsTeja Bheemanapally
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptMohmdUmer
 
Unix operating system basics
Unix operating system basicsUnix operating system basics
Unix operating system basicsSankar Suriya
 
intro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptxintro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptxssuser413a98
 

Similaire à fall2013 (20)

Operating Systems: Revision
Operating Systems: RevisionOperating Systems: Revision
Operating Systems: Revision
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Linux clustering solution
Linux clustering solutionLinux clustering solution
Linux clustering solution
 
Nt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer ComponentsNt1310 Unit 3 Computer Components
Nt1310 Unit 3 Computer Components
 
04 threads-pbl-2-slots
04 threads-pbl-2-slots04 threads-pbl-2-slots
04 threads-pbl-2-slots
 
04 threads-pbl-2-slots
04 threads-pbl-2-slots04 threads-pbl-2-slots
04 threads-pbl-2-slots
 
Systems and Applications.pptx
Systems and Applications.pptxSystems and Applications.pptx
Systems and Applications.pptx
 
Operating system ppt
Operating system pptOperating system ppt
Operating system ppt
 
Operating system ppt
Operating system pptOperating system ppt
Operating system ppt
 
Operating system ppt
Operating system pptOperating system ppt
Operating system ppt
 
Operating system ppt
Operating system pptOperating system ppt
Operating system ppt
 
.ppt
.ppt.ppt
.ppt
 
Linux or unix interview questions
Linux or unix interview questionsLinux or unix interview questions
Linux or unix interview questions
 
Cluster computer
Cluster  computerCluster  computer
Cluster computer
 
Walking around linux kernel
Walking around linux kernelWalking around linux kernel
Walking around linux kernel
 
unit_1.pdf
unit_1.pdfunit_1.pdf
unit_1.pdf
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.ppt
 
Unix operating system basics
Unix operating system basicsUnix operating system basics
Unix operating system basics
 
Underlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computingUnderlying principles of parallel and distributed computing
Underlying principles of parallel and distributed computing
 
intro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptxintro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptx
 

fall2013

  • 1. Francis Marion University Summer - Fall 2013 Research Students Administrating an HPC Cluster Students: Will Dixon Chad Garland Supervisors: Dr. Larry Engelhardt Dr. Ginger Bryngelson Dr. Galen Collier December 6, 2013
  • 2. Contents 1 Overview 1 2 Software 1 3 X11 Display 1 4 Hardware and Networking 1 5 The Operating System 2 6 Controlling Temperature 3 7 FMU PandA 3 8 Why Use an HPC Cluster 3 9 Administration 3 9.1 Turning Off the Patriot Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 9.2 Turning On the Patriot Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 9.3 Restarting SLURM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 A Powering the Cluster 4 A.1 powercluster File Located on the Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 A.2 startup File Located on the Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 A.3 startup File Located on the Master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 A.4 startup File Located on all Other Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
  • 3. 1 Overview NSF –EPSCoR ”CI” gave Francis Marion a $100,000 grant to build an HPC cluster. High performance computing is becoming a necessity to many of the science fields. Students administer the cluster. Admin- istrating the cluster requires knowledge of software and hardware described below. A Cluster is a bunch of computers that communicate as one and each computer in the cluster is called a node. 2 Software Acronyms for Networking Software: • DNS (Domain Name Server): keeps track of the IP addresses for each computer in the cluster and has a name that goes along with that IP address. • DHCP (Dynamic Host Configuration Protocol): assigns each computer a unique IP address automat- ically. • DDNS (Dynamic Domain Name Server): A DHCP server that updates a DNS server keeping both up to date The user never interacts directly with any of the networking software. What makes these a necessity is it makes it easier to add new computers to the cluster when expansion is needed. Software used by the user: • SLURM (Simple Linux Utility for Resource Management): this manages when and where a user’s submission will run according to the resources that are available in the cluster. • MPI (Message Passing Interface): a standard that allows multiple computer cores to communicate with one another while doing computations. • OpenACC: a standard that is used to make use of the GPU core. • OpenMP: a standard for shared memory communication (between cores on a single node). The user will use the above but those are not technically computers languages. They are standards and there are packages that implement the standard. The languages that use these standards are C, Fortran, Java, and Python. 3 X11 Display Since most clusters do not have displays, they use what is called X11 forwarding. X11 is a display interface that is used by Linux operating systems. With X11 forwarding it allows the user to ask the cluster to display something through the command line, the cluster then sends the display interface through the internet back to the user, and X11 displays it graphically on the local screen. This allows the user to create graphics on the cluster but still display it locally. 4 Hardware and Networking A cluster requires nodes to communicate with one another. The only way to do so is to use the software mentioned in the Software section. Below is a diagram of how Francis Marion’s Patriot Cluster is networked together. The individual nodes in the cluster have different qualities (memory, number of cores, etc.) that make it more dynamic for the user to choose what kind of program they would want to run where. 1
  • 4. Figure 1: Diagram of the Patriot Cluster Figure 1 shows the way the Patriot Cluster is networked together. The list below describes what makes each of these components unique. • Internet - where everyone has access to from anywhere • Master node - gateway to the internet for the nodes in the cluster and where users login from the internet (uses iptables for security which is a firewall for Linux). • Storage node - this node hosts most of the servers (DHCP, DNS, DDNS) and stores all user files and many programs used by all the computers • Switch - a device that connects nodes together with a CAT 6 connection (Gigabit per second). • Compute nodes - They have 32 Gigabytes of RAM and 12 Intel CPU cores. Used by SLURM to run user submissions. • High memory compute nodes - They also have 12 Intel CPU cores but have 64 Gigabytes of RAM. Used by SLURM if the other compute nodes are busy or if high memory is required by the computation. • GPU node - This has a GPU card that can be used by the user. Has 16 Intel CPU cores, 2496 GPU cores, and 32 Gigabytes of RAM. Used by SLURM when user wants to use the GPU node or all other nodes are busy. 2
  • 5. 5 The Operating System Many different clusters use different operating systems, but most clusters that are built today use a distribu- tion of Linux. The different distributions come with different types of software and user interfaces. The one that Francis Marion’s Patriot Cluster uses is Ubuntu Linux 12.04 Server LTS (long term support). Ubuntu has a wide range of support which helps administrators when they cannot find a solution to a problem. Ubuntu has a desktop version but users that get on the cluster cannot control the cluster through a GUI (graphical user interface). Instead, they control it using a command line interface. 6 Controlling Temperature With high performance computing, comes high temperatures. All the power that is going into the computers come out as heat. To control the temperature we use temperature monitoring internally and externally. If the cluster gets too hot it can damage the cores and other components of the computer. So we try to keep the cluster as cool as possible. To monitor externally, we use temperature probes. We monitor the intake and output temperatures of the cluster. When the cluster gets too hot it sends the administrators an email saying that the cluster is getting too hot and something needs to be done. To monitor internally (the cores themselves), we use Cron daemon that executes tasks based on time. We wrote a script that gathers the temperature of each core and each processor and then compare dangerous temperatures. If one core gets too hot, we pause all currently running jobs and send an email to the administrators that the cluster is too hot. 7 FMU PandA FMU PANDA stands for Francis Marion University’s Physics AND Astronomy. Francis Marion’s Patriot Cluster has a webpage at www.fmupanda.com. This has information from getting an account on the cluster to programing using the MPI or OpenMP standard. There is information on it on how to contact the administrators and a forum that anyone can read and post to if there is a general question. 8 Why Use an HPC Cluster High performance computing is used to complete calculations things faster than before. If we wrote a program to find the weather for tomorrow, we don’t want the results two days from now, but we still want the resolution to be high. A cluster can do this. It takes a single job and splits it among many different cores to do the one job. Below is a graph of Speed-Up versus Number of cores for numerical integration. Figure 2 shows a diminishing return. As the number of cores increases the speedup increases at a decreasing rate. So we cannot keep increasing the number of cores and get the results that we want. Sooner or later we will have to draw the line. 9 Administration 9.1 Turning Off the Patriot Cluster We wrote a program on the storage node that will shut down the cluster properly (see Appendix A.1). What the program does is login to every computer, ask for root access then shut the computer down. It then goes to the next computer. The file is in the /bin directory named powercluster. Once that command is entered (as the correct user admin) it will shut the cluster down. The last node to shutdown (and sometimes does not shutdown) is the storage node in which you are logged in as. The reason behind not shutting the storage down is if you log into the master then ssh to the storage, the master shutsdown before the storage’s shutdown command is executed. 3
  • 6. Figure 2: Speed Up versus Number of Cores 9.2 Turning On the Patriot Cluster To turn the cluster back on the master must be turned on first. Then the storage and every other computer in the cluster. Once all computers are on the login screen, login to the storage again as admin. Run the command startup (one that we also created see Appendix A.2) and it will ask for admin’s sudo password. Enter it once and it will remember it for the other times it asks it. It will run a command that was written for all the computers (one for the master see Appendix A.3 and one for all the others see Appendix A.4) that mounts external drives and starts up SLURM. That is it. The cluster is now running. 9.3 Restarting SLURM Reseting SLURM has many different ways to execute. The simplist is on a single node that is running SLURM and run the comand: sudo s e r v i c e slurm−l l n l r e s t a r t This will usually fix a single computer problem. The only way to reset SLURM on all computers is running the above command on all computers. Another way (mainly just to make the computers to recheck the configuration file) to reset some settings is run the following command on the master: sudo scontrol reconfigure The scontrol command is very usefull when trying to administrate users submitting jobs. A full list of options for the scontrol command is at https://computing.llnl.gov/linux/slurm/scontrol.html. A Powering the Cluster A.1 powercluster File Located on the Storage #!/ bin /bash dsh −a poweroff ssh −p98 master poweroff 4
  • 7. sleep 10 poweroff A.2 startup File Located on the Storage #!/ bin /bash read −sp ” [ sudo ] Password : ” pass ssh −p 98 master ”echo $pass | sudo −S startup ” echo $pass | sudo −S ntpdate master dsh −a ”echo $pass | sudo −S startup ” A.3 startup File Located on the Master #!/ bin /bash sh / bin / router mount storage :/ storage / storage mount storage :/ exports /home /home s e r v i c e munge s t a r t s e r v i c e slurm−l l n l s t a r t A.4 startup File Located on all Other Computers #!/ bin /bash sh / bin / router mount storage :/ storage / storage mount storage :/ exports /home /home s e r v i c e munge s t a r t s e r v i c e slurm−l l n l s t a r t 5