Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!)
Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.
2. Alfred North Whitehead (1911) Civilization advances by extending the number of important operations which we can perform without thinking about them
3. J.C.R. Licklider reflects on thinking (1960) About 85 per cent of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know
4. For example … (Licklider again) At one point, it was necessary to compare six experimental determinations of a function relating speech-intelligibilityto speech-to-noise ratio. No two experimenters had used the same definition or measure of speech-to-noise ratio. Several hours of calculating were required to get the data into comparable form. When they were in comparable form, it took only a few seconds to determine what I needed to know.
5. Research hasn’t changed much in 300 years Analyzedata Collectdata Publish results Identify patterns Design experiment Pose question Test hypotheses Hypothesize explanation
6. Discovery 1960: Data collection dominates Janet Rowley: chromosome translocationsand cancer
10. Salesforce.com, Google, Animoto, …, …, caBIG, TeraGrid gateways Software Platform Infrastructure Varieties of “* as a Service” (*aaS)
11. Salesforce.com, Google, Animoto, …, …, caBIG, TeraGrid gateways Software Platform Amazon, GoGrid,Microsoft, Flexiscale, … Infrastructure Varieties of * as a service (*aaS)
12. Salesforce.com, Google, Animoto, …, …, caBIG, TeraGrid gateways Software Google, Microsoft, Amazon, … Platform Amazon, GoGrid,Microsoft, Flexiscale, … Infrastructure Varieties of * as a service (*aaS)
13. Perform important tasks without thinking Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution IaaS
14. Perform important tasks without thinking Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution SaaS IaaS
16. Research IT is a growing burden Big projects can build sophisticated solutions to IT problems Small labs and collaborations have problems with both They need solutions, not toolkits—ideally outsourced solutions
17. Medium science: Dark Energy Survey Blanco 4m on Cerro Tololo Image credit: Roger Smith/NOAO/AURA/NSF Every night, they receive 100,000 files in Illinois They transmit these files to Texas for analysis (35 msec latency) Then move the results back to Illinois This whole process must run reliably & routinely
19. A new approach to research IT Goal: Accelerate discovery and innovation worldwide by providing research IT as a service Leverage software-as-a-service (SaaS) to provide millions of researchers with unprecedented access to powerful research tools, and enable a massive shortening of cycle times intime-consuming research processes
35. Grid (aka federation) as a service Globus Toolkit Globus Online Build the Grid Components for building custom grid solutions globustoolkit.org Use the Grid Cloud-hostedfile transfer service globusonline.org
36. Globus Online’s Web 2.0 architecture Command line interface lsalcf#dtn:/ scpalcf#dtn:/myfile br />nersc#dtn:/myfile HTTP REST interface POST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc> Web interface Fire-and-forget data movement Many files and lots of data Credential management Performance optimization Expert operations and monitoring GridFTP servers FTP servers High-performance data transfer nodes Globus Connect on local computers
46. Next steps: Outsource additional activities Analyzedata Collectdata Publish results Identify patterns Design experiment Pose question Test hypotheses Hypothesize explanation
47. A use case for the next steps Medical image data is acquired at multiple sites Uploaded to a commercial cloud Quality control algorithms applied Anonymization procedures applied Metadata extracted and stored Access granted to clinical trial team Interactive access and analysis More metadata generated and stored Access granted to subset of data for education
48. Required building blocks Group management for data sharing Scheduled September, 2011, for BIRN biomedical Metadata management Create, update, query a hosted metadata catalog Data publication workflows Data movement, naming, metadata operations, etc. Cloud storage access And HTTP, WebDAV, SRM, iRODS, … Computation on shared data E.g., via Galaxy workflow system
50. Summary To accelerate discovery, automate the mundane Data-intensive computing is particularly full of mundane tasks Outsourcing complexity to SaaS providers is a promising route to automation Globus Online is an early experiment in SaaS for science
51. For more information Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011. Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Globus Online: Radical Simplification of Data Movement via SaaS. Preprint CI-PP-05-0611, Computation Institute, 2011.
Whitehead points out that a powerful tool for enhancing human capabilities is to automate the mundaneHe was talking about mathematics—e.g., decimal system, algebra, calculus, all facilitated thinkingBut in an era in which information and its processing increasingly dominate human activities, computing.For example, arithmetic and mathematics: thus, calculus, Excel, Matlab, supercomputersIncreasingly also discovery and innovation depends on integration of diverse resources: data sources, software, computing power, human expertise
The basic research process remains essentiallyunchanged since the emergence of the scientific method in the 17th Century.Collect data, analyze data, identify patterns within data, seek explanations for those patterns, collect new data to test explanations.Speed of discovery depends to a significant degree on the time required for this cycle. Here, new technologies are changing the research process rapidly and dramatically.Data collection time used to dominate research. For example, Janet Rowley took several years to collect data on gross chromosomal abnormalities for a few patients. Today, we can generate genome data at the rate of billions of base pairs per day. So other steps become bottlenecks, like managing and analyzing data—a key issue for Midway.It is important to realize that the vast majority of research is performed within “small and medium labs.” For example, almost all of the ~1000 faculty in BSD and PSD at UChicago work in their own lab. Each lab has a faculty member, some postdocs, students—so maybe 5000 total just at UC.Academic research is a cottage industry—albeit one that is increasingly interconnected—and is likely to stay that way.
The abnormality seen by Nowell and Hungerford on chromosome 22. Now known as the Philadelphia Chromosome
Sequencing capacity of a big lab is doubling every nine months5 orders of magnitude in ~5 yearsSingle lab with 10 sequencing machines can generate 400 Gbases-pairs per day
Federal Demonstration Partnership.
Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
Many interesting questions.What is the right mix of services at the platform level?How do we build services that meet scalability, performance, reliability needs?How can we leverage such offerings to build innovative applications?Legal, business model issues.
Of course, people also make effective use of IaaS, but only for more specialized tasks
Of course, people also make effective use of IaaS, but only for more specialized tasks
More specifically, the opportunity is to apply a very modern technology—software as a service, or SaaS—to address a very modern problem, namely the enormous challenges inherent in translating revolutionary 21st century technologies into scientific advances. Midway’s SaaS approach will address these challenges, and both make powerful tools far more widely available, and reduce the cycle time associated with research and discovery.
So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
So let’s look at that list again.I and my colleagues started an effort a little while ago aimed at applying SaaS to one of these tasks …
Why? Discover endpoints, determine available protocols, negotiate firewalls, configure software, manage space, determine required credentials, configure protocols, detect and respond to failures, identify diagnose and correct network misconfigurations,…
Explain attempts; a cornerstone of our failure mitigation strategyThrough repeated attempts GO was able to overcome transient errors at OLCF and rangerThe expired host certs on bigred were not updated until after the run had completed
Self-healingSLA-drivenMulti-tenancy – multitasking, … much moreService-orientedVirtualizedLinearly scalableData, data, data,