2. The Scaling Problem “Does the solution scale?” asks if larger versions of the problem (often more data) can be dealt with by a given piece of software. “Scaling” is a loose collection of techniques to improve or implement a solution’s scalability. The choice of techniques depends on the critical resource: cpu, memory or i/o and how easily the task is broken into pieces. This talk focusses on Scaling as it applies to UIMA NLP processing (not withstanding OpenDMAPv2). It is a work in progress.
3. Scaling NLP Processing a file is independent of processing another file:Text in, annotations out. Multi-threaded More than one thread of execution in one process pipelines share memory and can step on each other. Ex. Stanford crashes because of concurrency issues “was not an issue in 2001” <casProcessors casPoolSize=“4" processingUnitThreadCount=“2"> Multi-process Separate JVM’s, each with a single thread Memory is not shared, no crushed toes <casProcessors casPoolSize="3" processingUnitThreadCount=“1"> Overhead of repeated JVM and pipeline does cost, but it works. Many machines More memory, more cores Independence means they won’t miss being on the same machine Independent machines (Cluster) are cheaper than integrated (Enki)
4. Hardware Local Cluster (Colfax) A rack of machines with software (SGE) to integrate Integrated CPUs (Enki) Much like a rack, but motherboards are tied together and can share memory Gigabit ethernet delivers on the order of 300Mb/sec Motherboard runs up to 4.8GB/sec Virtual Cluster Virtualization software allows for a single machine to appear as many, offers flexibility, security Cloud A virtual cluster on the net: Amazon EC2
5. Hardware: CCP’s Colfax Cluster Runs Linux (Fedora/Red Hat) 6 machines (amc-colfax, amc-colfaxnd[1-5]) 2 cpus (Intel), 4 cores each, 48 cores total Intel motherboard 16GB memory each, 96 GB total 5TB shared (over NFS) disk array, RAID5 Named after the assembler: Colfax International
6. (Sun|Oracle) Grid Engine (SGE) Manages a queue of jobs, optimizing resources utilization Starts individual processes for a job Often used with Message Passing Interface (MPI) for processes that cooperate Used here to start “Array Jobs” Each job processes a portion of a large array of work to be done.
7. SGE Job An SGE job is a script and a command line Command line specifies resources for scheduling Memory others Script is run once for each process started Is not pure shell, but more/less a shell script (next slide) Job is assigned an ID number
8. more/less a shell script? Put these lines at top for SGE: #$ -N stanford_out Standard out goes to a file with this prefix #$ -S /bin/bash The shell to use (no “she-bang”: #!/bin/sh) #$ -cwd Runs from the current directory #$ -j y Merge stdout and stderr to one file
9. Submit a Job: qsub Qsub –t 1-200000:20000 sge_stanford_out.sh -t Index Range Do array items from 1 to 200 thousand, by 20k: 10 processes Do this with the sge_stanford_out.sh script How does the script know what files to process? $SGE_TASK_ID (first file number to run) $SGE_TASK_STEPSIZE A task will get values of 0,19999,20000 for example
10. Sge_stanford_out.sh Will evolve into generic UIMA job submission script Script modifies a template CPE file, creates a CPE for each process CPE specifies starting document number and number to process http://wikis.sun.com/display/gridengine62u2/How+to+Submit+an+Array+Job+From+the+Command+Line [roederc@amc-colfax sge_scripts]$ qsub -t 1-50:3 sge_stanford_out.sh Your job-array 130.1-50:3 ("stanford_out") has been submitted
11. qstat [roederc@amc-colfax sge_scripts]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 130 0.00000 stanford_o roederc qw 11/02/2010 12:39:01 1 1-49:3 [roederc@amc-colfax sge_scripts]$ qmon [roederc@amc-colfax sge_scripts]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd4.ucdenver.p 1 4 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd2.ucdenver.p 1 7 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd5.ucdenver.p 1 10 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd3.ucdenver.p 1 13 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd1.ucdenver.p 1 16 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd5.ucdenver.p 1 19 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd2.ucdenver.p 1 22 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd4.ucdenver.p 1 25 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfax.ucdenver.pvt 1 28 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 all.q@amc-colfaxnd3.ucdenver.p 1 31
13. Failures? Q:What if a job fails? (A: it stops) Open problem For now, that process dies leaving unprocessed jobs Need to cull unprocessed files and try again Usually not enough memory Future: db-driven collection reader with cas-consumer that reports completion
14. Example 1: Distribute a simple script on cluster: Test_sge.sh Qsub test_sge.sh Runs it once Qsub test_sge.sh –t 1-5:1 Runs it five times Qsub test_sge.sh –t 100-500:100 Also runs it five times Gives index starts spaced by 100
15. Example 2:Run UIMA on Cluster Sge_stanford_out.sh: Calls a script with a template CPE and index range: run_cpe_cluster_stanford_out.sh Modifies CPE template, creating a CPE for each sub-range Sets up environment, calls SimpleRunCPE (java) Note temp_cpe_<n>.xml in ../desc/cpe Start a number of terminals, run “top” in each to see cpu and memory usage.
16. Hadoop Inspired by Lisp’s map/reduce Map: apply a function to each element of a hash Reduce: combine hashes into one Known for optimizing by moving processing rather than data Similar code used by Google. Hadoop is open source, used by Yahoo, Amazon. Specialized interfaces make it more suited to greenfield development
17. What about “The Cloud” Amazon’s Elastic Compute Cloud (EC2) is a cluster on the internet that can be rented by the hour Very Dynamic Set up nodes when you start using them Expect them to dissapper when you stop Must have machine configuration management sussed. You have to re-install everything. Use S3 for long-term storage Starts at $0.10/hour