SlideShare une entreprise Scribd logo
1  sur  32
globus.org/genomics
Finding Needles in a Haystack – Big Data
Management and Analysis using Globus
Ravi Madduri
madduri@anl.gov
JSM 2015, Seattle, Washington
globus.org/genomics
• Globus Genomics is developed, operated, and supported by
researchers, developers, and bioinformaticians at the
Computation Institute – University of Chicago/Argonne
National Lab
• We are a non-profit organization building solutions for non-
profit researchers
• Our goal is to support the advancement of science by bringing
together our strengths and capabilities to help meet the
unique needs of researchers and research institutions
Who We Are
globus.org/genomics
Publish
results
Collect
data
Design
experiment
Test
hypothesis
Hypothesize
explanation
Identify
patterns
Analyze
data
Finding needles in haystacks
Pose
question
3
globus.org/genomics
Imagine if a researcher, when
tackling a problem, could easily:
• Assemble, integrate, and interpret all
relevant data within a knowledge network
• Be informed of anomalies, patterns, gaps
• Formulate & apply computational models
• Outsource tasks if local expertise lacking
• Launch automated processes to test
hypotheses, expand knowledge network
• Pay for all this by taking on other tasks
globus.org/genomics
We will cover
• Accelerating Scientific Discovery Process
by providing Science as a Service
– Research Data Management
– Analyzing Research Data
• Interactive Analysis
• Large-scale Analysis
– Publishing Results so others can
• Discover
• Validate
• Reproduce/Use
globus.org/genomics
90% of cancer patients carry a
mutation that may be
responsive to a known drug
Mark Rubin, Weill Cornell Medical College and NewYork-Presbyterian
Hospital in New York in Nature, April, 2015
Trying to find a single causative gene for
diseases with a complex genetic background
is like looking for the proverbial needle in a
haystack
– Nancy Cox
(Vanderbilt)
globus.org/genomics
Higgs discovery “only possible because
of the extraordinary achievements of …
grid computing”
Rolf Heuer, CERN DG
10s of PB, 100s of institutions,1000s of
scientists, 100Ks of CPUs, Bs of tasks
globus.org/genomics
How do we accelerate discovery
without requiring that every lab acquire
a haystack-sorting machine?
Clayton & Shuttleworth thresher, 1910: Museum Victoria, Australia
globus.org/genomics
Managing big data with Globus
PI initiates transfer
request; or requested
automatically by script,
science gateway
1
Globus transfers files
reliably, securely
Light Source
Compute Facility
2
PI selects files to
share, selects
user or group,
and sets access
permissions
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
Researcher logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
Researcher
assembles data set;
describes it using
metadata (Dublin
core and domain-
specific)
Curator reviews and
approves; data set
published on campus
or other system
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
4
7
6
3
5
• SaaS  Only a web
browser required
• Access using your
campus credentials
• Globus monitors and
informs throughout
6 8
Publication
Repository
Personal Computer
globus.org/genomics
Globus Platform-as-a-Service
Identity, Group, Profile
Management Services
…
Sharing Service
Transfer Service
Globus Toolkit
GlobusAPIs
GlobusConnect
globus.org/genomics
Globus Adoption and Usage
• 166,449 active Globus endpoints
• 27,961 users registered
• Biggest transfer: 500.42TB
• Longest running transfer: 182 days.
• Fastest transfer: 58.5Gbps (average)
• 55TB moved per day, on average, since the
service was launched in November 2010
• Average throughput: 637.7Mbps (since
service launch)
globus.org/genomics
Analyzing Big Data using Globus
Galaxies
Sequencing
Centers
Sequencing
Centers
Public
Data
Storage
Local Cluster/
CloudSeq
Center
Research Lab
Globus provides for
• High-performance
• Fault-tolerant
• Secure
file transfer between
all data-endpoints
Data management Data analysis
Picard
GATK
Fastq Ref Genome
Alignment
Variant Calling
Galaxy
Data Libraries
Globus Genomics
on Amazon EC2
• Analytical tools are
automatically run
on the scalable
compute
resources when
possible
• Globus integrated
within Galaxy
• Web-based UI
• Drag-Drop
workflow
creations
• Easily modify
workflows with
new tools
Galaxy-based workflow
managementGlobus
Genomics
globus.org/genomics
Our Science Stack
• Galaxy
– Interactive execution, iPython, R
– Creation, Execution, Sharing, Discovering Workflows
• Globus
– Data management
– Identity Management
• AWS
– HTCondor, Chef, EC2, EBS, S3, SNS
– Spot, Route 53, Cloud Formation
SaaS
PaaS
IaaS
globus.org/genomics
Examples of what
researchers have done
globus.org/genomics
• 134 samples and 4 workflows
• 4 TB data initially
• 2200 core hours in 6 days
Cox lab, UChicago
globus.org/genomics
Consensus Caller
globus.org/genomics
Rediscovery of previously observed variants Transition/Transversion Ratio
Genotype Mendel Error Rate Distributions of Mendel Error Counts per Trio
globus.org/genomics
Contaminated Samples
globus.org/genomics
Olopade lab, UChicago
A profile of inherited predisposition to breast
cancer among Nigerian women
Y. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner,
S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola,
O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade
• 200 targeted exomes
• 200 GB data initially
• 76,920 core hours in 1.25 days
globus.org/genomics
Expanding Consensus
Genotyper – SNVs, Indels, SVs
RAW
FASTQs
GATK
Pipeline/HC
FreeBayes
SAMtools
mpileup
GATK
Pipeline/UG
VCF
VCF
VCF
VCF
Consensus
Genotyper
VCF
Atlas2
Delly/Contra
VCF
VCF
globus.org/genomics
14 deleterious SNVs and 11 damaging
Indels (BRCA1: 15, BRCA2: 4, PALB2: 2,
BRIP1: 1, CHEK2: 1, NBN: 1, TP53: 1) were
found in 29 subjects, and they were all
confidently detected among 5 callers.
Identified SNVs and Indels were all
confirmed by Sanger sequencing.
Preliminary Results are very
encouraging
globus.org/genomics
QC
PPMI ADNI
Adenocarcinoma
http://bit.ly/1M0h6Yx
http://bit.ly/A10R89y
Adrenal
Brain Alignment
Feature
count
Alignment
QC
1. Query and
discover data
3. Execute parallel alignment
workflow on dynamically
provisioned cloud resources
ERMrest
2. Transfer
bags
Alignment
FilesAlignment
Files
3. Publish
bags
BDDS Collection
Alignment
FilesAlignment
Files
Differential
expression
Differential
expression
4. Discover published data
and execute comparison
workflow
Combining Data management
and Analysis
globus.org/genomics
Gene Expression Results
globus.org/genomics
Globus Genomics at a
glance
30
institutions, groups
10s
million core hours
labs
2 PBs
raw sequences
analyzed
>1500
analysis tools
1000s
genomes processed
>50
workflows
99%
uptime over the past
two years
1 PB
largest single transfer
to do
5 days
longest running
workflow
100s
different species
1000s
genomes processed
5 days
longest running
workflow
globus.org/genomics
Other Globus Genomics users
Dobyns
Lab
Cox Lab
Volchenboum Lab
Olopade Lab
Nagarajan Lab
globus.org/genomics
Pricing includes
• Estimated compute
• Storage (one month)
• Globus Genomics platform usage
• Support
Costs are remarkably low
globus.org/genomics
Globus Genomics – Making it routine to find
needles in NGS haystacks
www.globus.org/genomics
globus.org/genomics
Other Examples of
Science as a Service
• PDACS - Portal for data analysis services for
cosmological simulations
• CVRG Galaxy – Large-scale ECG Data
Analysis
• Globus Proteomics
• eMatter – Material Science Simulations
• FACE-IT - Framework to Advance Climate,
Economic, and Impact Investigations with
Information Technology (usefaceit.org)
globus.org/genomics
• More information on Globus
Genomics:www.globus.org/geno
mics
• More information on Globus:
www.globus.org
globus.org/genomics
Our work is supported by:
U. S. D E PART M ENT OF
ENERGY
31
globus.org/genomics
Thank you!
@madduri

Contenu connexe

Tendances

Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Todd Vision
 
Knowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnKnowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnTodd Vision
 
Why should researchers care about data curation?
Why should researchers care about data curation?Why should researchers care about data curation?
Why should researchers care about data curation?Varsha Khodiyar
 
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...GigaScience, BGI Hong Kong
 
Scott Edmunds: Revolutionizing Data Dissemination: GigaScience
Scott Edmunds: Revolutionizing Data Dissemination: GigaScienceScott Edmunds: Revolutionizing Data Dissemination: GigaScience
Scott Edmunds: Revolutionizing Data Dissemination: GigaScienceGigaScience, BGI Hong Kong
 
dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019dkNET
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
From Deadly E. coli to Endangered Polar Bear: GigaScience Provides First Cita...
From Deadly E. coli to Endangered Polar Bear: GigaScience Provides First Cita...From Deadly E. coli to Endangered Polar Bear: GigaScience Provides First Cita...
From Deadly E. coli to Endangered Polar Bear: GigaScience Provides First Cita...GigaScience, BGI Hong Kong
 
BioAssay Research Database Presentation at the Chem Axon UGM 2013
BioAssay Research Database Presentation at the Chem Axon UGM 2013BioAssay Research Database Presentation at the Chem Axon UGM 2013
BioAssay Research Database Presentation at the Chem Axon UGM 2013Andrea de Souza
 
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...GigaScience, BGI Hong Kong
 
GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience, BGI Hong Kong
 
GWAS and DAS
GWAS and DASGWAS and DAS
GWAS and DASVerena139
 
Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...
Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...
Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...GigaScience, BGI Hong Kong
 
leveraging the web to make science more collaborative
leveraging the web to make science more collaborativeleveraging the web to make science more collaborative
leveraging the web to make science more collaborativeBrian Bot
 
2014 CrossRef Annual Meeting Peer Review Panel: bioRxiv: the preprint server ...
2014 CrossRef Annual Meeting Peer Review Panel: bioRxiv: the preprint server ...2014 CrossRef Annual Meeting Peer Review Panel: bioRxiv: the preprint server ...
2014 CrossRef Annual Meeting Peer Review Panel: bioRxiv: the preprint server ...Crossref
 

Tendances (19)

Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck
 
Knowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnKnowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, Bonn
 
Canadian health census to lod
Canadian health census to lodCanadian health census to lod
Canadian health census to lod
 
Why should researchers care about data curation?
Why should researchers care about data curation?Why should researchers care about data curation?
Why should researchers care about data curation?
 
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
Scott Edmunds: Data Dissemination: Difficulties, Data Citation, DOI's (and Gi...
 
Scott Edmunds: Revolutionizing Data Dissemination: GigaScience
Scott Edmunds: Revolutionizing Data Dissemination: GigaScienceScott Edmunds: Revolutionizing Data Dissemination: GigaScience
Scott Edmunds: Revolutionizing Data Dissemination: GigaScience
 
dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
From Deadly E. coli to Endangered Polar Bear: GigaScience Provides First Cita...
From Deadly E. coli to Endangered Polar Bear: GigaScience Provides First Cita...From Deadly E. coli to Endangered Polar Bear: GigaScience Provides First Cita...
From Deadly E. coli to Endangered Polar Bear: GigaScience Provides First Cita...
 
BioAssay Research Database Presentation at the Chem Axon UGM 2013
BioAssay Research Database Presentation at the Chem Axon UGM 2013BioAssay Research Database Presentation at the Chem Axon UGM 2013
BioAssay Research Database Presentation at the Chem Axon UGM 2013
 
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...
 
GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.GigaScience: a new resource for the big-data community.
GigaScience: a new resource for the big-data community.
 
GWAS and DAS
GWAS and DASGWAS and DAS
GWAS and DAS
 
Knowledge Beacons
Knowledge BeaconsKnowledge Beacons
Knowledge Beacons
 
Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...
Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...
Peter Li: GigaDB and Galaxy - revolutionizing data dissemination, organizatio...
 
leveraging the web to make science more collaborative
leveraging the web to make science more collaborativeleveraging the web to make science more collaborative
leveraging the web to make science more collaborative
 
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
 
2014 CrossRef Annual Meeting Peer Review Panel: bioRxiv: the preprint server ...
2014 CrossRef Annual Meeting Peer Review Panel: bioRxiv: the preprint server ...2014 CrossRef Annual Meeting Peer Review Panel: bioRxiv: the preprint server ...
2014 CrossRef Annual Meeting Peer Review Panel: bioRxiv: the preprint server ...
 
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
 

En vedette

Big Data and Genomics
Big Data and GenomicsBig Data and Genomics
Big Data and GenomicsAl Costa
 
Effective ansible
Effective ansibleEffective ansible
Effective ansibleWu Bigo
 
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)joseplaborda
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Dan Taylor
 
Big Process for Big Data
Big Process for Big DataBig Process for Big Data
Big Process for Big DataIan Foster
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSEd Dodds
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduriRavi Madduri
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsYahoo Developer Network
 
基因大数据分析入门 Slideshare
基因大数据分析入门   Slideshare基因大数据分析入门   Slideshare
基因大数据分析入门 SlideshareWu Bigo
 

En vedette (20)

Big Data and Genomics
Big Data and GenomicsBig Data and Genomics
Big Data and Genomics
 
Effective ansible
Effective ansibleEffective ansible
Effective ansible
 
Supporting Barack Obama for President
Supporting Barack Obama for PresidentSupporting Barack Obama for President
Supporting Barack Obama for President
 
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)
ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
 
HL7: Clinical Decision Support
HL7: Clinical Decision SupportHL7: Clinical Decision Support
HL7: Clinical Decision Support
 
Big Process for Big Data
Big Process for Big DataBig Process for Big Data
Big Process for Big Data
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Public.Cdsc.Middleton
Public.Cdsc.MiddletonPublic.Cdsc.Middleton
Public.Cdsc.Middleton
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh RaskarWhat is SIGGRAPH NEXT? Intro by Ramesh Raskar
What is SIGGRAPH NEXT? Intro by Ramesh Raskar
 
Google Glass Breakdown
Google Glass BreakdownGoogle Glass Breakdown
Google Glass Breakdown
 
What is Media in MIT Media Lab, Why 'Camera Culture'
What is Media in MIT Media Lab, Why 'Camera Culture'What is Media in MIT Media Lab, Why 'Camera Culture'
What is Media in MIT Media Lab, Why 'Camera Culture'
 
Stereo and 3D Displays - Matt Hirsch
Stereo and 3D Displays - Matt HirschStereo and 3D Displays - Matt Hirsch
Stereo and 3D Displays - Matt Hirsch
 
Multiview Imaging HW Overview
Multiview Imaging HW OverviewMultiview Imaging HW Overview
Multiview Imaging HW Overview
 
Leap Motion Development (Rohan Puri)
Leap Motion Development (Rohan Puri)Leap Motion Development (Rohan Puri)
Leap Motion Development (Rohan Puri)
 
Raskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 NovemberRaskar UIST Keynote 2015 November
Raskar UIST Keynote 2015 November
 
Coded Photography - Ramesh Raskar
Coded Photography - Ramesh RaskarCoded Photography - Ramesh Raskar
Coded Photography - Ramesh Raskar
 
基因大数据分析入门 Slideshare
基因大数据分析入门   Slideshare基因大数据分析入门   Slideshare
基因大数据分析入门 Slideshare
 

Similaire à Finding Needles in Haystacks - Big Data Analysis Using Globus

Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theoryC. Tobin Magle
 
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus PosterNIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus PosterGlobus
 
FAIR BioData Management
FAIR BioData ManagementFAIR BioData Management
FAIR BioData ManagementUlrike Wittig
 
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsRamil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsGigaScience, BGI Hong Kong
 
GigaScience: data and beta-database launch. Announcing GigaDB
GigaScience: data and beta-database launch. Announcing GigaDBGigaScience: data and beta-database launch. Announcing GigaDB
GigaScience: data and beta-database launch. Announcing GigaDBGigaScience, BGI Hong Kong
 
Workshop finding and accessing data - fiona - lunteren april 18 2016
Workshop   finding and accessing data - fiona - lunteren april 18 2016Workshop   finding and accessing data - fiona - lunteren april 18 2016
Workshop finding and accessing data - fiona - lunteren april 18 2016Fiona Nielsen
 
2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big Data2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big DataBruce Kozuma
 
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...Fiona Nielsen
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forumChris Dwan
 
How to share useful data
How to share useful dataHow to share useful data
How to share useful dataPeter McQuilton
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchIan Foster
 
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
FAIR Data, Operations and Model management for Systems Biology and Systems Me...FAIR Data, Operations and Model management for Systems Biology and Systems Me...
FAIR Data, Operations and Model management for Systems Biology and Systems Me...Carole Goble
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
GlyGen Warren Workshop in Boston
GlyGen Warren Workshop in BostonGlyGen Warren Workshop in Boston
GlyGen Warren Workshop in BostonGlyGen
 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Fiona Nielsen
 
A practical guide to practicing open science
A practical guide to practicing open scienceA practical guide to practicing open science
A practical guide to practicing open scienceKrzysztof Gorgolewski
 

Similaire à Finding Needles in Haystacks - Big Data Analysis Using Globus (20)

Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
 
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus PosterNIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
 
FAIR BioData Management
FAIR BioData ManagementFAIR BioData Management
FAIR BioData Management
 
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientistsRamil Mauleon: Galaxy: bioinformatics for rice scientists
Ramil Mauleon: Galaxy: bioinformatics for rice scientists
 
GigaScience: data and beta-database launch. Announcing GigaDB
GigaScience: data and beta-database launch. Announcing GigaDBGigaScience: data and beta-database launch. Announcing GigaDB
GigaScience: data and beta-database launch. Announcing GigaDB
 
Workshop finding and accessing data - fiona - lunteren april 18 2016
Workshop   finding and accessing data - fiona - lunteren april 18 2016Workshop   finding and accessing data - fiona - lunteren april 18 2016
Workshop finding and accessing data - fiona - lunteren april 18 2016
 
2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big Data2018 Bio-IT World Agile in Wet Labs Speeds Big Data
2018 Bio-IT World Agile in Wet Labs Speeds Big Data
 
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...Workshop   finding and accessing data - fiona nadia charlotte - cambridge apr...
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
How to share useful data
How to share useful dataHow to share useful data
How to share useful data
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer research
 
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
FAIR Data, Operations and Model management for Systems Biology and Systems Me...FAIR Data, Operations and Model management for Systems Biology and Systems Me...
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
GlyGen Warren Workshop in Boston
GlyGen Warren Workshop in BostonGlyGen Warren Workshop in Boston
GlyGen Warren Workshop in Boston
 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016
 
A practical guide to practicing open science
A practical guide to practicing open scienceA practical guide to practicing open science
A practical guide to practicing open science
 

Dernier

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Dernier (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Finding Needles in Haystacks - Big Data Analysis Using Globus

  • 1. globus.org/genomics Finding Needles in a Haystack – Big Data Management and Analysis using Globus Ravi Madduri madduri@anl.gov JSM 2015, Seattle, Washington
  • 2. globus.org/genomics • Globus Genomics is developed, operated, and supported by researchers, developers, and bioinformaticians at the Computation Institute – University of Chicago/Argonne National Lab • We are a non-profit organization building solutions for non- profit researchers • Our goal is to support the advancement of science by bringing together our strengths and capabilities to help meet the unique needs of researchers and research institutions Who We Are
  • 4. globus.org/genomics Imagine if a researcher, when tackling a problem, could easily: • Assemble, integrate, and interpret all relevant data within a knowledge network • Be informed of anomalies, patterns, gaps • Formulate & apply computational models • Outsource tasks if local expertise lacking • Launch automated processes to test hypotheses, expand knowledge network • Pay for all this by taking on other tasks
  • 5. globus.org/genomics We will cover • Accelerating Scientific Discovery Process by providing Science as a Service – Research Data Management – Analyzing Research Data • Interactive Analysis • Large-scale Analysis – Publishing Results so others can • Discover • Validate • Reproduce/Use
  • 6. globus.org/genomics 90% of cancer patients carry a mutation that may be responsive to a known drug Mark Rubin, Weill Cornell Medical College and NewYork-Presbyterian Hospital in New York in Nature, April, 2015
  • 7. Trying to find a single causative gene for diseases with a complex genetic background is like looking for the proverbial needle in a haystack – Nancy Cox (Vanderbilt)
  • 8. globus.org/genomics Higgs discovery “only possible because of the extraordinary achievements of … grid computing” Rolf Heuer, CERN DG 10s of PB, 100s of institutions,1000s of scientists, 100Ks of CPUs, Bs of tasks
  • 9. globus.org/genomics How do we accelerate discovery without requiring that every lab acquire a haystack-sorting machine? Clayton & Shuttleworth thresher, 1910: Museum Victoria, Australia
  • 10. globus.org/genomics Managing big data with Globus PI initiates transfer request; or requested automatically by script, science gateway 1 Globus transfers files reliably, securely Light Source Compute Facility 2 PI selects files to share, selects user or group, and sets access permissions Globus controls access to shared files on existing storage; no need to move files to cloud storage! Researcher logs in to Globus and accesses shared files; no local account required; download via Globus Researcher assembles data set; describes it using metadata (Dublin core and domain- specific) Curator reviews and approves; data set published on campus or other system Peers, collaborators search and discover datasets; transfer and share using Globus 4 7 6 3 5 • SaaS  Only a web browser required • Access using your campus credentials • Globus monitors and informs throughout 6 8 Publication Repository Personal Computer
  • 11. globus.org/genomics Globus Platform-as-a-Service Identity, Group, Profile Management Services … Sharing Service Transfer Service Globus Toolkit GlobusAPIs GlobusConnect
  • 12. globus.org/genomics Globus Adoption and Usage • 166,449 active Globus endpoints • 27,961 users registered • Biggest transfer: 500.42TB • Longest running transfer: 182 days. • Fastest transfer: 58.5Gbps (average) • 55TB moved per day, on average, since the service was launched in November 2010 • Average throughput: 637.7Mbps (since service launch)
  • 13. globus.org/genomics Analyzing Big Data using Globus Galaxies Sequencing Centers Sequencing Centers Public Data Storage Local Cluster/ CloudSeq Center Research Lab Globus provides for • High-performance • Fault-tolerant • Secure file transfer between all data-endpoints Data management Data analysis Picard GATK Fastq Ref Genome Alignment Variant Calling Galaxy Data Libraries Globus Genomics on Amazon EC2 • Analytical tools are automatically run on the scalable compute resources when possible • Globus integrated within Galaxy • Web-based UI • Drag-Drop workflow creations • Easily modify workflows with new tools Galaxy-based workflow managementGlobus Genomics
  • 14. globus.org/genomics Our Science Stack • Galaxy – Interactive execution, iPython, R – Creation, Execution, Sharing, Discovering Workflows • Globus – Data management – Identity Management • AWS – HTCondor, Chef, EC2, EBS, S3, SNS – Spot, Route 53, Cloud Formation SaaS PaaS IaaS
  • 16. globus.org/genomics • 134 samples and 4 workflows • 4 TB data initially • 2200 core hours in 6 days Cox lab, UChicago
  • 18. globus.org/genomics Rediscovery of previously observed variants Transition/Transversion Ratio Genotype Mendel Error Rate Distributions of Mendel Error Counts per Trio
  • 20. globus.org/genomics Olopade lab, UChicago A profile of inherited predisposition to breast cancer among Nigerian women Y. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade • 200 targeted exomes • 200 GB data initially • 76,920 core hours in 1.25 days
  • 21. globus.org/genomics Expanding Consensus Genotyper – SNVs, Indels, SVs RAW FASTQs GATK Pipeline/HC FreeBayes SAMtools mpileup GATK Pipeline/UG VCF VCF VCF VCF Consensus Genotyper VCF Atlas2 Delly/Contra VCF VCF
  • 22. globus.org/genomics 14 deleterious SNVs and 11 damaging Indels (BRCA1: 15, BRCA2: 4, PALB2: 2, BRIP1: 1, CHEK2: 1, NBN: 1, TP53: 1) were found in 29 subjects, and they were all confidently detected among 5 callers. Identified SNVs and Indels were all confirmed by Sanger sequencing. Preliminary Results are very encouraging
  • 23. globus.org/genomics QC PPMI ADNI Adenocarcinoma http://bit.ly/1M0h6Yx http://bit.ly/A10R89y Adrenal Brain Alignment Feature count Alignment QC 1. Query and discover data 3. Execute parallel alignment workflow on dynamically provisioned cloud resources ERMrest 2. Transfer bags Alignment FilesAlignment Files 3. Publish bags BDDS Collection Alignment FilesAlignment Files Differential expression Differential expression 4. Discover published data and execute comparison workflow Combining Data management and Analysis
  • 25. globus.org/genomics Globus Genomics at a glance 30 institutions, groups 10s million core hours labs 2 PBs raw sequences analyzed >1500 analysis tools 1000s genomes processed >50 workflows 99% uptime over the past two years 1 PB largest single transfer to do 5 days longest running workflow 100s different species 1000s genomes processed 5 days longest running workflow
  • 26. globus.org/genomics Other Globus Genomics users Dobyns Lab Cox Lab Volchenboum Lab Olopade Lab Nagarajan Lab
  • 27. globus.org/genomics Pricing includes • Estimated compute • Storage (one month) • Globus Genomics platform usage • Support Costs are remarkably low
  • 28. globus.org/genomics Globus Genomics – Making it routine to find needles in NGS haystacks www.globus.org/genomics
  • 29. globus.org/genomics Other Examples of Science as a Service • PDACS - Portal for data analysis services for cosmological simulations • CVRG Galaxy – Large-scale ECG Data Analysis • Globus Proteomics • eMatter – Material Science Simulations • FACE-IT - Framework to Advance Climate, Economic, and Impact Investigations with Information Technology (usefaceit.org)
  • 30. globus.org/genomics • More information on Globus Genomics:www.globus.org/geno mics • More information on Globus: www.globus.org
  • 31. globus.org/genomics Our work is supported by: U. S. D E PART M ENT OF ENERGY 31

Notes de l'éditeur

  1. The basic research process remains essentially unchanged since the emergence of the scientific method in the 17th Century. Collect data, analyze data, identify patterns within data, seek explanations for those patterns, collect new data to test explanations. Speed of discovery depends to a significant degree on the time required for this cycle. Here, new technologies are changing the research process rapidly and dramatically. Data collection time used to dominate research. For example, Janet Rowley took several years to collect data on gross chromosomal abnormalities for a few patients. Today, we can generate genome data at the rate of billions of base pairs per day. So other steps become bottlenecks, like managing and analyzing data—a key issue for Midway. It is important to realize that the vast majority of research is performed within “small and medium labs.” For example, almost all of the ~1000 faculty in BSD and PSD at UChicago work in their own lab. Academic research is a cottage industry—albeit one that is increasingly interconnected—and is likely to stay that way.
  2. Given continued exponential growth along so many dimensions … … process efficiencies must improve at a comparable rate to maintain just constant progress
  3. http://sciencelife.uchospitals.edu/2013/10/28/testing-a-whole-haystack-to-find-more-needles/
  4. Peter Higgs
  5. http://museumvictoria.com.au/collections/items/772499/negative-threshing-team-with-aveling-porter-traction-engine-bagging-grain-building-a-haystack-miners-rest-victoria-1910
  6. Highlight CI Connect; coming up in Rob Gardner’s talk Highlight XSEDE’s planned adoption of user, group and profile management
  7. Our goal is to operationalize key capabilities so researchers can depend on them. Think of Gmail for science..
  8. We built this pipeline to create high quality variants using multiple genotyping algorithms
  9. Applying previous pipeline to targeted exomes in breast cancer. http://abstracts.ashg.org/cgi-bin/2014/ashg14s.pl?author=madduri&sort=ptimes&sbutton=Detail&absno=140122328&sid=84013
  10. Normal/tumor – for 2 subjects (form Geo, south Korean population)   Run workflow on each normal and tumor and publish     Qc, alignment, feature count, alignment qc  QC files, alignment file, and count file.   Differential expression
  11. http://gene.gmi.ac.kr/geneList/LUAD_ExpLevels_EGFR.jpg Picture shows how the gene EGFR expresses in lung and cancer tumor samples we have analyzed. We can do very similar analysis for ADNI, PPMI and other data sources easily.