SlideShare une entreprise Scribd logo
1  sur  19
Scaling, Grid Engine and Running UIMA on the Cluster Chris Roeder 11/2010
The Scaling Problem “Does the solution scale?” asks if larger versions of the problem (often more data) can be dealt with by a given piece of software. “Scaling” is a loose collection of techniques to improve or implement a solution’s scalability.   The choice of techniques depends on the critical resource: cpu, memory or i/o and how easily the task is broken into pieces. This talk focusses on Scaling as it applies to UIMA NLP processing (not withstanding OpenDMAPv2). It is a work in progress.
Scaling NLP Processing a file is independent of processing another file:Text in, annotations out. Multi-threaded More than one thread of execution in one process  pipelines share memory and can step on each other. Ex. Stanford crashes because of concurrency issues “was not an issue in 2001”  <casProcessors casPoolSize=“4" processingUnitThreadCount=“2"> Multi-process Separate JVM’s, each with a single thread Memory is not shared, no crushed toes <casProcessors casPoolSize="3" processingUnitThreadCount=“1"> Overhead of repeated JVM and pipeline does cost, but it works. Many machines More memory, more cores Independence means they won’t miss being on the same machine Independent machines (Cluster) are cheaper than integrated (Enki)
Hardware Local Cluster (Colfax) A rack of machines with software (SGE) to integrate Integrated CPUs (Enki) Much like a rack, but motherboards are tied together and can share memory Gigabit ethernet delivers on the order of 300Mb/sec Motherboard runs up to 4.8GB/sec Virtual Cluster Virtualization software allows for a single machine to appear as many, offers flexibility, security Cloud A virtual cluster on the net: Amazon EC2
Hardware: CCP’s Colfax Cluster Runs Linux (Fedora/Red Hat) 6 machines (amc-colfax, amc-colfaxnd[1-5]) 2 cpus (Intel), 4 cores each, 48 cores total Intel motherboard 16GB memory each, 96 GB total 5TB shared (over NFS) disk array, RAID5 Named after the assembler: Colfax International
(Sun|Oracle) Grid Engine (SGE) Manages a queue of jobs, optimizing resources utilization Starts individual processes for a job Often used with Message Passing Interface (MPI) for processes that cooperate Used here to start “Array Jobs” Each job processes a portion of a large array of work to be done.
SGE Job An SGE job is a script and a command line Command line specifies resources for scheduling Memory others Script is run once for each process started Is not pure shell, but more/less a shell script (next slide) Job is assigned an ID number
more/less a shell script? Put these lines at top for SGE: #$ -N stanford_out Standard out goes to a file with this prefix #$ -S /bin/bash The shell to use (no “she-bang”: #!/bin/sh) #$ -cwd Runs from the current directory #$ -j y Merge stdout and stderr to one file
Submit a Job: qsub Qsub –t 1-200000:20000 sge_stanford_out.sh -t Index Range Do array items from 1 to 200 thousand, by 20k: 10 processes Do this with the sge_stanford_out.sh script How does the script know what files to process? $SGE_TASK_ID (first file number to run) $SGE_TASK_STEPSIZE A task will get values of 0,19999,20000 for example
Sge_stanford_out.sh Will evolve into generic UIMA job submission script Script modifies a template CPE file, creates a CPE for each process CPE specifies starting document number and number to process http://wikis.sun.com/display/gridengine62u2/How+to+Submit+an+Array+Job+From+the+Command+Line [roederc@amc-colfax sge_scripts]$ qsub -t 1-50:3 sge_stanford_out.sh Your job-array 130.1-50:3 ("stanford_out") has been submitted
qstat [roederc@amc-colfax sge_scripts]$ qstat job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID  -----------------------------------------------------------------------------------------------------------------     130 0.00000 stanford_o roederc      qw    11/02/2010 12:39:01                                    1 1-49:3 [roederc@amc-colfax sge_scripts]$ qmon [roederc@amc-colfax sge_scripts]$ qstat job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID  -----------------------------------------------------------------------------------------------------------------     130 0.55500 stanford_o roederc      r     11/02/2010 12:39:10 all.q@amc-colfaxnd4.ucdenver.p     1 4     130 0.55500 stanford_o roederc      r     11/02/2010 12:39:10 all.q@amc-colfaxnd2.ucdenver.p     1 7     130 0.55500 stanford_o roederc      r     11/02/2010 12:39:10 all.q@amc-colfaxnd5.ucdenver.p     1 10     130 0.55500 stanford_o roederc      r     11/02/2010 12:39:10 all.q@amc-colfaxnd3.ucdenver.p     1 13     130 0.55500 stanford_o roederc      r     11/02/2010 12:39:10 all.q@amc-colfaxnd1.ucdenver.p     1 16     130 0.55500 stanford_o roederc      r     11/02/2010 12:39:10 all.q@amc-colfaxnd5.ucdenver.p     1 19     130 0.55500 stanford_o roederc      r     11/02/2010 12:39:10 all.q@amc-colfaxnd2.ucdenver.p     1 22     130 0.55500 stanford_o roederc      r     11/02/2010 12:39:10 all.q@amc-colfaxnd4.ucdenver.p     1 25     130 0.55500 stanford_o roederc      r     11/02/2010 12:39:10 all.q@amc-colfax.ucdenver.pvt      1 28     130 0.55500 stanford_o roederc      r     11/02/2010 12:39:10 all.q@amc-colfaxnd3.ucdenver.p     1 31
Qdel command Use to kill a command Qdel <job num>
Failures?	 Q:What if a job fails? (A: it stops) Open problem	 For now, that process dies leaving unprocessed jobs Need to cull unprocessed files and try again Usually not enough memory Future: db-driven collection reader with cas-consumer that reports completion
Example 1: Distribute a simple script on cluster: Test_sge.sh Qsub test_sge.sh Runs it once Qsub test_sge.sh –t 1-5:1 Runs it five times Qsub test_sge.sh –t 100-500:100 Also runs it five times Gives index starts spaced by 100
Example 2:Run UIMA on Cluster Sge_stanford_out.sh: Calls a script with a template CPE and index range:  run_cpe_cluster_stanford_out.sh Modifies CPE template, creating a CPE for each sub-range Sets up environment, calls SimpleRunCPE (java) Note temp_cpe_<n>.xml in ../desc/cpe Start a number of terminals, run “top” in each to see cpu and memory usage.
Hadoop Inspired by Lisp’s map/reduce Map: apply a function to each element of a hash Reduce: combine hashes into one Known for optimizing by moving processing rather than data Similar code used by Google.  Hadoop is open source, used by Yahoo, Amazon. Specialized interfaces make it more suited to greenfield development
What about “The Cloud” Amazon’s Elastic Compute Cloud (EC2) is a cluster on the internet that can be rented by the hour Very Dynamic Set up nodes when you start using them	 Expect them to dissapper when you stop Must have machine configuration management sussed. You have to re-install everything. Use S3 for long-term storage Starts at $0.10/hour
Colfax Cluster 6 CPUs 5TB disk array
Enki 8TB RAID CPU

Contenu connexe

Tendances

nouka inventry manager
nouka inventry managernouka inventry manager
nouka inventry managerToshiaki Baba
 
All you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopAll you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopSaša Tatar
 
Quartz connector
Quartz connectorQuartz connector
Quartz connectorRahul Kumar
 
Cassandra Cluster Manager (CCM)
Cassandra Cluster Manager (CCM)Cassandra Cluster Manager (CCM)
Cassandra Cluster Manager (CCM)Chris Lohfink
 
Gearinfive
GearinfiveGearinfive
Gearinfivebpmedley
 
agri inventory - nouka data collector / yaoya data convertor
agri inventory - nouka data collector / yaoya data convertoragri inventory - nouka data collector / yaoya data convertor
agri inventory - nouka data collector / yaoya data convertorToshiaki Baba
 
NSClient++ Workshop: 06 Scripting
NSClient++ Workshop: 06 ScriptingNSClient++ Workshop: 06 Scripting
NSClient++ Workshop: 06 ScriptingMichael Medin
 
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...VCP Muthukrishna
 
Node, can you even in CPU intensive operations?
Node, can you even in CPU intensive operations?Node, can you even in CPU intensive operations?
Node, can you even in CPU intensive operations?The Software House
 
On the way to low latency
On the way to low latencyOn the way to low latency
On the way to low latencyArtem Orobets
 
Nasamatic NewHaven.IO 2014 05-21
Nasamatic NewHaven.IO 2014 05-21Nasamatic NewHaven.IO 2014 05-21
Nasamatic NewHaven.IO 2014 05-21Prasanna Gautam
 
Star bed 2018.07.19
Star bed 2018.07.19Star bed 2018.07.19
Star bed 2018.07.19Ruo Ando
 
OpenShift4 Installation by UPI on kvm
OpenShift4 Installation by UPI on kvmOpenShift4 Installation by UPI on kvm
OpenShift4 Installation by UPI on kvmJooho Lee
 
Доклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang MeetupДоклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang MeetupBadoo Development
 
101 apend. scripting, crond, atd
101 apend. scripting, crond, atd101 apend. scripting, crond, atd
101 apend. scripting, crond, atdAcácio Oliveira
 
Linux fundamental - Chap 15 Job Scheduling
Linux fundamental - Chap 15 Job SchedulingLinux fundamental - Chap 15 Job Scheduling
Linux fundamental - Chap 15 Job SchedulingKenny (netman)
 

Tendances (20)

Building a cron scheduler
Building a cron schedulerBuilding a cron scheduler
Building a cron scheduler
 
nouka inventry manager
nouka inventry managernouka inventry manager
nouka inventry manager
 
All you need to know about the JavaScript event loop
All you need to know about the JavaScript event loopAll you need to know about the JavaScript event loop
All you need to know about the JavaScript event loop
 
Quartz connector
Quartz connectorQuartz connector
Quartz connector
 
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
 
Cassandra Cluster Manager (CCM)
Cassandra Cluster Manager (CCM)Cassandra Cluster Manager (CCM)
Cassandra Cluster Manager (CCM)
 
Gearinfive
GearinfiveGearinfive
Gearinfive
 
agri inventory - nouka data collector / yaoya data convertor
agri inventory - nouka data collector / yaoya data convertoragri inventory - nouka data collector / yaoya data convertor
agri inventory - nouka data collector / yaoya data convertor
 
NSClient++ Workshop: 06 Scripting
NSClient++ Workshop: 06 ScriptingNSClient++ Workshop: 06 Scripting
NSClient++ Workshop: 06 Scripting
 
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
Zimbra Troubleshooting - Mails not being Delivered or Deferred or Connection ...
 
Puppet and Openshift
Puppet and OpenshiftPuppet and Openshift
Puppet and Openshift
 
Puppet Data Mining
Puppet Data MiningPuppet Data Mining
Puppet Data Mining
 
Node, can you even in CPU intensive operations?
Node, can you even in CPU intensive operations?Node, can you even in CPU intensive operations?
Node, can you even in CPU intensive operations?
 
On the way to low latency
On the way to low latencyOn the way to low latency
On the way to low latency
 
Nasamatic NewHaven.IO 2014 05-21
Nasamatic NewHaven.IO 2014 05-21Nasamatic NewHaven.IO 2014 05-21
Nasamatic NewHaven.IO 2014 05-21
 
Star bed 2018.07.19
Star bed 2018.07.19Star bed 2018.07.19
Star bed 2018.07.19
 
OpenShift4 Installation by UPI on kvm
OpenShift4 Installation by UPI on kvmOpenShift4 Installation by UPI on kvm
OpenShift4 Installation by UPI on kvm
 
Доклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang MeetupДоклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang Meetup
 
101 apend. scripting, crond, atd
101 apend. scripting, crond, atd101 apend. scripting, crond, atd
101 apend. scripting, crond, atd
 
Linux fundamental - Chap 15 Job Scheduling
Linux fundamental - Chap 15 Job SchedulingLinux fundamental - Chap 15 Job Scheduling
Linux fundamental - Chap 15 Job Scheduling
 

En vedette

BankTEL-Jack Henry Review-Banner Bank
BankTEL-Jack Henry Review-Banner BankBankTEL-Jack Henry Review-Banner Bank
BankTEL-Jack Henry Review-Banner Bankbanktel
 
Fortalezas y debilidades de nuestra primera jornada de práctica docente
Fortalezas y debilidades de nuestra primera jornada de práctica docenteFortalezas y debilidades de nuestra primera jornada de práctica docente
Fortalezas y debilidades de nuestra primera jornada de práctica docenteAnNii Gleez
 
Jarrett Ho Word Cloud
Jarrett Ho Word CloudJarrett Ho Word Cloud
Jarrett Ho Word CloudJarrett Ho
 
Presentacion de matematicas.. lineamientos
Presentacion de matematicas.. lineamientosPresentacion de matematicas.. lineamientos
Presentacion de matematicas.. lineamientosLaura Garcia
 
Rycom catalogue
Rycom catalogueRycom catalogue
Rycom catalogueBond-tang
 
Colliers Global Office Report
Colliers Global Office ReportColliers Global Office Report
Colliers Global Office ReportCoy Davidson
 
Tools and PLNs for Greek teachers
Tools and PLNs for Greek teachersTools and PLNs for Greek teachers
Tools and PLNs for Greek teachersAnna Varna
 
Jetspeed-2 Overview
Jetspeed-2 OverviewJetspeed-2 Overview
Jetspeed-2 Overviewbettlebrox
 
Oracle Database Appliance RAC in a box Some Strings Attached
Oracle Database Appliance RAC in a box Some Strings AttachedOracle Database Appliance RAC in a box Some Strings Attached
Oracle Database Appliance RAC in a box Some Strings AttachedFuad Arshad
 
Electric motor
Electric motorElectric motor
Electric motorKD1729
 
Energy Efficient Data Storage Systems
Energy Efficient Data Storage SystemsEnergy Efficient Data Storage Systems
Energy Efficient Data Storage SystemsXiao Qin
 

En vedette (20)

BankTEL-Jack Henry Review-Banner Bank
BankTEL-Jack Henry Review-Banner BankBankTEL-Jack Henry Review-Banner Bank
BankTEL-Jack Henry Review-Banner Bank
 
Fortalezas y debilidades de nuestra primera jornada de práctica docente
Fortalezas y debilidades de nuestra primera jornada de práctica docenteFortalezas y debilidades de nuestra primera jornada de práctica docente
Fortalezas y debilidades de nuestra primera jornada de práctica docente
 
Jarrett Ho Word Cloud
Jarrett Ho Word CloudJarrett Ho Word Cloud
Jarrett Ho Word Cloud
 
Presentacion de matematicas.. lineamientos
Presentacion de matematicas.. lineamientosPresentacion de matematicas.. lineamientos
Presentacion de matematicas.. lineamientos
 
Rycom catalogue
Rycom catalogueRycom catalogue
Rycom catalogue
 
Send email
Send emailSend email
Send email
 
Colliers Global Office Report
Colliers Global Office ReportColliers Global Office Report
Colliers Global Office Report
 
Diari del 19 de juny de 2013
Diari del 19 de juny de 2013Diari del 19 de juny de 2013
Diari del 19 de juny de 2013
 
NYU-Thesis-2010
NYU-Thesis-2010NYU-Thesis-2010
NYU-Thesis-2010
 
Projects2bid
Projects2bidProjects2bid
Projects2bid
 
World War 1
World War 1World War 1
World War 1
 
C-Sales
C-SalesC-Sales
C-Sales
 
Nazmy Markos
Nazmy MarkosNazmy Markos
Nazmy Markos
 
Tools and PLNs for Greek teachers
Tools and PLNs for Greek teachersTools and PLNs for Greek teachers
Tools and PLNs for Greek teachers
 
Jetspeed-2 Overview
Jetspeed-2 OverviewJetspeed-2 Overview
Jetspeed-2 Overview
 
An artist sketchbook
An artist sketchbookAn artist sketchbook
An artist sketchbook
 
Oracle Database Appliance RAC in a box Some Strings Attached
Oracle Database Appliance RAC in a box Some Strings AttachedOracle Database Appliance RAC in a box Some Strings Attached
Oracle Database Appliance RAC in a box Some Strings Attached
 
Electric motor
Electric motorElectric motor
Electric motor
 
How to Use OpenMP on Native Activity
How to Use OpenMP on Native ActivityHow to Use OpenMP on Native Activity
How to Use OpenMP on Native Activity
 
Energy Efficient Data Storage Systems
Energy Efficient Data Storage SystemsEnergy Efficient Data Storage Systems
Energy Efficient Data Storage Systems
 

Similaire à Sge

Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
 
Velocity 2012 - Learning WebOps the Hard Way
Velocity 2012 - Learning WebOps the Hard WayVelocity 2012 - Learning WebOps the Hard Way
Velocity 2012 - Learning WebOps the Hard WayCosimo Streppone
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems PerformanceBrendan Gregg
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAjith Narayanan
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Vincent Batts
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Julien SIMON
 
Oracle Database performance tuning using oratop
Oracle Database performance tuning using oratopOracle Database performance tuning using oratop
Oracle Database performance tuning using oratopSandesh Rao
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesAmazon Web Services
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesBrendan Gregg
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging RubyAman Gupta
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)Doug Burns
 
Thomas+Niewel+ +Oracletuning
Thomas+Niewel+ +OracletuningThomas+Niewel+ +Oracletuning
Thomas+Niewel+ +Oracletuningafa reg
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby SystemsEngine Yard
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 

Similaire à Sge (20)

Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
Lec7
Lec7Lec7
Lec7
 
Velocity 2012 - Learning WebOps the Hard Way
Velocity 2012 - Learning WebOps the Hard WayVelocity 2012 - Learning WebOps the Hard Way
Velocity 2012 - Learning WebOps the Hard Way
 
Open Source Systems Performance
Open Source Systems PerformanceOpen Source Systems Performance
Open Source Systems Performance
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methods
 
Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]Slackware Demystified [SELF 2011]
Slackware Demystified [SELF 2011]
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)
 
Oracle Database performance tuning using oratop
Oracle Database performance tuning using oratopOracle Database performance tuning using oratop
Oracle Database performance tuning using oratop
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
Using AWR for SQL Analysis
Using AWR for SQL AnalysisUsing AWR for SQL Analysis
Using AWR for SQL Analysis
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)
 
Thomas+Niewel+ +Oracletuning
Thomas+Niewel+ +OracletuningThomas+Niewel+ +Oracletuning
Thomas+Niewel+ +Oracletuning
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Rac introduction
Rac introductionRac introduction
Rac introduction
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 

Plus de Chris Roeder

Roeder posterismb2010
Roeder posterismb2010Roeder posterismb2010
Roeder posterismb2010Chris Roeder
 
Roeder rocky 2011_46
Roeder rocky 2011_46Roeder rocky 2011_46
Roeder rocky 2011_46Chris Roeder
 
Rocky2010 roeder full_textbiomedicalliteratureprocesing
Rocky2010 roeder full_textbiomedicalliteratureprocesingRocky2010 roeder full_textbiomedicalliteratureprocesing
Rocky2010 roeder full_textbiomedicalliteratureprocesingChris Roeder
 

Plus de Chris Roeder (7)

Roeder posterismb2010
Roeder posterismb2010Roeder posterismb2010
Roeder posterismb2010
 
Roeder rocky 2011_46
Roeder rocky 2011_46Roeder rocky 2011_46
Roeder rocky 2011_46
 
Uml
UmlUml
Uml
 
Spring survey
Spring surveySpring survey
Spring survey
 
Maven
MavenMaven
Maven
 
Rocky2010 roeder full_textbiomedicalliteratureprocesing
Rocky2010 roeder full_textbiomedicalliteratureprocesingRocky2010 roeder full_textbiomedicalliteratureprocesing
Rocky2010 roeder full_textbiomedicalliteratureprocesing
 
Hibernate
HibernateHibernate
Hibernate
 

Sge

  • 1. Scaling, Grid Engine and Running UIMA on the Cluster Chris Roeder 11/2010
  • 2. The Scaling Problem “Does the solution scale?” asks if larger versions of the problem (often more data) can be dealt with by a given piece of software. “Scaling” is a loose collection of techniques to improve or implement a solution’s scalability. The choice of techniques depends on the critical resource: cpu, memory or i/o and how easily the task is broken into pieces. This talk focusses on Scaling as it applies to UIMA NLP processing (not withstanding OpenDMAPv2). It is a work in progress.
  • 3. Scaling NLP Processing a file is independent of processing another file:Text in, annotations out. Multi-threaded More than one thread of execution in one process pipelines share memory and can step on each other. Ex. Stanford crashes because of concurrency issues “was not an issue in 2001” <casProcessors casPoolSize=“4" processingUnitThreadCount=“2"> Multi-process Separate JVM’s, each with a single thread Memory is not shared, no crushed toes <casProcessors casPoolSize="3" processingUnitThreadCount=“1"> Overhead of repeated JVM and pipeline does cost, but it works. Many machines More memory, more cores Independence means they won’t miss being on the same machine Independent machines (Cluster) are cheaper than integrated (Enki)
  • 4. Hardware Local Cluster (Colfax) A rack of machines with software (SGE) to integrate Integrated CPUs (Enki) Much like a rack, but motherboards are tied together and can share memory Gigabit ethernet delivers on the order of 300Mb/sec Motherboard runs up to 4.8GB/sec Virtual Cluster Virtualization software allows for a single machine to appear as many, offers flexibility, security Cloud A virtual cluster on the net: Amazon EC2
  • 5. Hardware: CCP’s Colfax Cluster Runs Linux (Fedora/Red Hat) 6 machines (amc-colfax, amc-colfaxnd[1-5]) 2 cpus (Intel), 4 cores each, 48 cores total Intel motherboard 16GB memory each, 96 GB total 5TB shared (over NFS) disk array, RAID5 Named after the assembler: Colfax International
  • 6. (Sun|Oracle) Grid Engine (SGE) Manages a queue of jobs, optimizing resources utilization Starts individual processes for a job Often used with Message Passing Interface (MPI) for processes that cooperate Used here to start “Array Jobs” Each job processes a portion of a large array of work to be done.
  • 7. SGE Job An SGE job is a script and a command line Command line specifies resources for scheduling Memory others Script is run once for each process started Is not pure shell, but more/less a shell script (next slide) Job is assigned an ID number
  • 8. more/less a shell script? Put these lines at top for SGE: #$ -N stanford_out Standard out goes to a file with this prefix #$ -S /bin/bash The shell to use (no “she-bang”: #!/bin/sh) #$ -cwd Runs from the current directory #$ -j y Merge stdout and stderr to one file
  • 9. Submit a Job: qsub Qsub –t 1-200000:20000 sge_stanford_out.sh -t Index Range Do array items from 1 to 200 thousand, by 20k: 10 processes Do this with the sge_stanford_out.sh script How does the script know what files to process? $SGE_TASK_ID (first file number to run) $SGE_TASK_STEPSIZE A task will get values of 0,19999,20000 for example
  • 10. Sge_stanford_out.sh Will evolve into generic UIMA job submission script Script modifies a template CPE file, creates a CPE for each process CPE specifies starting document number and number to process http://wikis.sun.com/display/gridengine62u2/How+to+Submit+an+Array+Job+From+the+Command+Line [roederc@amc-colfax sge_scripts]$ qsub -t 1-50:3 sge_stanford_out.sh Your job-array 130.1-50:3 ("stanford_out") has been submitted
  • 11. qstat [roederc@amc-colfax sge_scripts]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 130 0.00000 stanford_o roederc qw 11/02/2010 12:39:01 1 1-49:3 [roederc@amc-colfax sge_scripts]$ qmon [roederc@amc-colfax sge_scripts]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd4.ucdenver.p 1 4 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd2.ucdenver.p 1 7 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd5.ucdenver.p 1 10 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd3.ucdenver.p 1 13 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd1.ucdenver.p 1 16 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd5.ucdenver.p 1 19 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd2.ucdenver.p 1 22 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd4.ucdenver.p 1 25 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfax.ucdenver.pvt 1 28 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd3.ucdenver.p 1 31
  • 12. Qdel command Use to kill a command Qdel <job num>
  • 13. Failures? Q:What if a job fails? (A: it stops) Open problem For now, that process dies leaving unprocessed jobs Need to cull unprocessed files and try again Usually not enough memory Future: db-driven collection reader with cas-consumer that reports completion
  • 14. Example 1: Distribute a simple script on cluster: Test_sge.sh Qsub test_sge.sh Runs it once Qsub test_sge.sh –t 1-5:1 Runs it five times Qsub test_sge.sh –t 100-500:100 Also runs it five times Gives index starts spaced by 100
  • 15. Example 2:Run UIMA on Cluster Sge_stanford_out.sh: Calls a script with a template CPE and index range: run_cpe_cluster_stanford_out.sh Modifies CPE template, creating a CPE for each sub-range Sets up environment, calls SimpleRunCPE (java) Note temp_cpe_<n>.xml in ../desc/cpe Start a number of terminals, run “top” in each to see cpu and memory usage.
  • 16. Hadoop Inspired by Lisp’s map/reduce Map: apply a function to each element of a hash Reduce: combine hashes into one Known for optimizing by moving processing rather than data Similar code used by Google. Hadoop is open source, used by Yahoo, Amazon. Specialized interfaces make it more suited to greenfield development
  • 17. What about “The Cloud” Amazon’s Elastic Compute Cloud (EC2) is a cluster on the internet that can be rented by the hour Very Dynamic Set up nodes when you start using them Expect them to dissapper when you stop Must have machine configuration management sussed. You have to re-install everything. Use S3 for long-term storage Starts at $0.10/hour
  • 18. Colfax Cluster 6 CPUs 5TB disk array