SlideShare a Scribd company logo
1 of 20
Why Aren't We Benchmarking
Bioinformatics?
Joe Parker
Early Career Research Fellow (Phylogenomics)
Department of Biodiversity Informatics and Spatial Analysis
Royal Botanic Gardens, Kew
Richmond TW9 3AB
joe.parker@kew.org
Outline
• Introduction
• Brief history of Bioinformatics
• Benchmarking in Bioinformatics
• Case study 1: Typical benchmarking across
environments
• Case study 2: Mean-variance relationship for
repeated measures
• Conclusions: implication for statistical
genomics
A (very) brief history of
bioinformatics
Kluge & Farris (1969) Syst. Zool. 18:1-32
A (very) brief history of
bioinformatics
Stewart et al. (1987) Nature 330:401-404
A (very) brief history of
bioinformatics
ENCODE Consortium (2012) Nature 489:57-74
A (very) brief history of
bioinformatics
Kluge & Farris (1969) Syst. Zool. 18:1-32
Benchmarking to biologists
• Benchmarking as a comparative process
• i.e. ‘which software’s best?’ / ‘which
platform’
• Benchmarking application logic / profiling
unknown
• Environments / runtimes generally either
assumed to be identical, or else loosely
categorised into ‘laptops vs clusters’
Case Study 1
aka
‘Which program’s the best?’
Bioinformatics environments are
very heterogeneous
• Laptop:
– Portable
– Very costly form-factor
– Maté? Beer?
• Raspi:
– Low: cost, energy (& power)
– Highly portable
– Hackable form-factor
• Clusters:
• Not portable, setup costs
• The cloud:
– Power closely linked to budget (as limited as)
– Almost infinitely scalable
-Have to have a connection to get data up there
(and down!)
– Fiddly setup
Benchmarking to biologists
Comparison
System Arch
CPU type,
clock GHz
cores
RAM Gb /
MHz / type
HDD Gb
Haemodorum i686
Xeon E5620 @
2.4
8 33
1000
@ SATA
Raspberry Pi 2
B+
ARM
ARMv7
@ 1.0
1 1
8
@ flash card
Macbook Pro
(2011)
x64
Core i7
@ 2.2
4 8
250
@ SSD
EC2
m4.10xlarge
x64
Xeon E5
@ 2.4
40 160
320
@ SSD
Reviewing / comparing new methods
• Biological problems
often scale horribly
unpredictably
• Algorithm analyses
• So empirical
measurements on
different problem
sets to predict how
problems will scale…
Workflow
Setup
BLAST
2.2.30
CEGMA
genes
Short reads
Concatenate hits to CEGMA
alignments
Muscle
3.8.31
RAxML
7.2.8+
Set up workflow, binaries, and reference /
alignment data.
Deploy to machines.
Protein-protein blast reads (from MG-RAST
repository, Bass Strait oil field) against 458
core eukaryote genes from CEGMA. Keep
only top hits. Use max. num_threads
available.
Append top hit sequences to CEGMA
alignments.
For each:
Align in MUSCLE using default
parameters
Infer de novo phylogeny in RAxML
under Dayhoff, random starting tree
and max. PTHREADS.
Output and parse times.
Results - BLASTP
Results - RAxML
Case Study 2
aka
‘What the hell’s a random
seed?’
Properly benchmarking workflows
• Ignoring limiting-steps analyses
– in many workflows might actually be data
cleaning / parsing / transformation
• Or (most common error) inefficiently
iterating
• Or even disk I/O!
Workflow benchmarking very rare
• Many bioinformatics workflows / pipelines
limiting at odd steps, parsing etc
• http://beast.bio.ed.ac.uk/benchmarks
• Many e.g. bioinformatics papers
• More harm than good?
Conclusion
• Biologists and error
• Current practice
• Help!
• Interesting challenges
too…
Thanks:
RBG Kew, BI&SA, Mike Chester

More Related Content

What's hot

Reconstructing paleoenvironments using metagenomics
Reconstructing paleoenvironments using metagenomicsReconstructing paleoenvironments using metagenomics
Reconstructing paleoenvironments using metagenomicsRutger Vos
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSIntegrated DNA Technologies
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineCandy Smellie
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesChung-Tsai Su
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsJoão André Carriço
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...VHIR Vall d’Hebron Institut de Recerca
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...EMC
 
2nd stripe rust izmir
2nd stripe rust izmir2nd stripe rust izmir
2nd stripe rust izmirICARDA
 
DNA sequencer by kk sahu
DNA sequencer by kk sahu DNA sequencer by kk sahu
DNA sequencer by kk sahu KAUSHAL SAHU
 
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesTools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesSurya Saha
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Integrated DNA Technologies
 
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Joe Parker
 
A decade into Next Generation Sequencing on marine non-model organisms: curre...
A decade into Next Generation Sequencing on marine non-model organisms: curre...A decade into Next Generation Sequencing on marine non-model organisms: curre...
A decade into Next Generation Sequencing on marine non-model organisms: curre...Alexander Jueterbock
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015Torsten Seemann
 
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning Achievements
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning AchievementsIg Nobel Prize 2021 – The 9 Funny and Scientific Winning Achievements
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning AchievementsEurofins Genomics Germany GmbH
 

What's hot (20)

Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Reconstructing paleoenvironments using metagenomics
Reconstructing paleoenvironments using metagenomicsReconstructing paleoenvironments using metagenomics
Reconstructing paleoenvironments using metagenomics
 
Expanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGSExpanding Your Research Capabilities Using Targeted NGS
Expanding Your Research Capabilities Using Targeted NGS
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
 
Molecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics PipelineMolecular QC: Interpreting your Bioinformatics Pipeline
Molecular QC: Interpreting your Bioinformatics Pipeline
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and Opportunities
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
Ion Torrent Sequencing
Ion Torrent SequencingIon Torrent Sequencing
Ion Torrent Sequencing
 
2nd stripe rust izmir
2nd stripe rust izmir2nd stripe rust izmir
2nd stripe rust izmir
 
DNA sequencer by kk sahu
DNA sequencer by kk sahu DNA sequencer by kk sahu
DNA sequencer by kk sahu
 
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesTools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...Analyzing the exome—focusing your NGS analysis with high performance target c...
Analyzing the exome—focusing your NGS analysis with high performance target c...
 
Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...Phylogenomic methods for comparative evolutionary biology - University Colleg...
Phylogenomic methods for comparative evolutionary biology - University Colleg...
 
A decade into Next Generation Sequencing on marine non-model organisms: curre...
A decade into Next Generation Sequencing on marine non-model organisms: curre...A decade into Next Generation Sequencing on marine non-model organisms: curre...
A decade into Next Generation Sequencing on marine non-model organisms: curre...
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
 
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning Achievements
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning AchievementsIg Nobel Prize 2021 – The 9 Funny and Scientific Winning Achievements
Ig Nobel Prize 2021 – The 9 Funny and Scientific Winning Achievements
 

Viewers also liked

Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid Detection
Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid DetectionCapture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid Detection
Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid DetectionCFTCC
 
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsInterpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsJoe Parker
 
Real-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerReal-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerJoe Parker
 
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...Joe Parker
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowBrian Krueger
 
20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slidesharehansjansen9999
 
Nanotechnology in biology and medicine
Nanotechnology in biology and medicineNanotechnology in biology and medicine
Nanotechnology in biology and medicineravirajkuber991
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing PhylogenomicsJoe Parker
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.mkim8
 
Driving Connectivity in the Scottish Islands: Droneways and Airmasts
Driving Connectivity in the Scottish Islands: Droneways and AirmastsDriving Connectivity in the Scottish Islands: Droneways and Airmasts
Driving Connectivity in the Scottish Islands: Droneways and Airmasts3G4G
 
An Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case StudiesAn Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case Studies3G4G
 
5G Network Architecture and Design
5G Network Architecture and Design5G Network Architecture and Design
5G Network Architecture and Design3G4G
 
3GPP Standards for the Internet-of-Things
3GPP Standards for the Internet-of-Things3GPP Standards for the Internet-of-Things
3GPP Standards for the Internet-of-ThingsEiko Seidel
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsAnnelies Haegeman
 

Viewers also liked (16)

Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid Detection
Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid DetectionCapture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid Detection
Capture and Release Nanopore-Nanofiber Mesh: pH Triggered Nucleic Acid Detection
 
Interpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasetsInterpreting ‘tree space’ in the context of very large empirical datasets
Interpreting ‘tree space’ in the context of very large empirical datasets
 
Real-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerReal-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe Parker
 
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
Phylogenomic Convergence Detection - Evolutionary Biology Meeting in Marseill...
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
 
20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare20160308 dtl ngs_focus_group_meeting_slideshare
20160308 dtl ngs_focus_group_meeting_slideshare
 
Oxford Nanopore MinION
Oxford Nanopore MinIONOxford Nanopore MinION
Oxford Nanopore MinION
 
Nanotechnology in biology and medicine
Nanotechnology in biology and medicineNanotechnology in biology and medicine
Nanotechnology in biology and medicine
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing Phylogenomics
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
Driving Connectivity in the Scottish Islands: Droneways and Airmasts
Driving Connectivity in the Scottish Islands: Droneways and AirmastsDriving Connectivity in the Scottish Islands: Droneways and Airmasts
Driving Connectivity in the Scottish Islands: Droneways and Airmasts
 
An Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case StudiesAn Introduction to IoT: Connectivity & Case Studies
An Introduction to IoT: Connectivity & Case Studies
 
5G Network Architecture and Design
5G Network Architecture and Design5G Network Architecture and Design
5G Network Architecture and Design
 
3GPP Standards for the Internet-of-Things
3GPP Standards for the Internet-of-Things3GPP Standards for the Internet-of-Things
3GPP Standards for the Internet-of-Things
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 

Similar to Joe parker-benchmarking-bioinformatics

Thesis def
Thesis defThesis def
Thesis defJay Vyas
 
Novozymes Enzyme Stability Prediction
Novozymes Enzyme Stability PredictionNovozymes Enzyme Stability Prediction
Novozymes Enzyme Stability PredictionIttrainingIttraining
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Paolo Missier
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop finalMeng-Ru (Raymond) Tsai
 
Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identificationabhinav vedanbhatla
 
Paper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsPaper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsRyohei Suzuki
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology Sean Ekins
 
171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justin171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justinGenomeInABottle
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics PresentationZhenhong Bao
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
 

Similar to Joe parker-benchmarking-bioinformatics (20)

Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
Thesis def
Thesis defThesis def
Thesis def
 
Novozymes Enzyme Stability Prediction
Novozymes Enzyme Stability PredictionNovozymes Enzyme Stability Prediction
Novozymes Enzyme Stability Prediction
 
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
Design and evaluation of a genomics variant analysis pipeline using GATK Spar...
 
Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identification
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Paper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problemsPaper memo: persistent homology on biological problems
Paper memo: persistent homology on biological problems
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
 
171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justin171114 best practices for benchmarking variant calls justin
171114 best practices for benchmarking variant calls justin
 
HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Bioinformatica 06-10-2011-t2-databases
Bioinformatica 06-10-2011-t2-databasesBioinformatica 06-10-2011-t2-databases
Bioinformatica 06-10-2011-t2-databases
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 

Recently uploaded

《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and momentdonamiaquintan2
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 

Recently uploaded (20)

《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
PLASMODIUM. PPTX
PLASMODIUM. PPTXPLASMODIUM. PPTX
PLASMODIUM. PPTX
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and moment
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 

Joe parker-benchmarking-bioinformatics

  • 1. Why Aren't We Benchmarking Bioinformatics? Joe Parker Early Career Research Fellow (Phylogenomics) Department of Biodiversity Informatics and Spatial Analysis Royal Botanic Gardens, Kew Richmond TW9 3AB joe.parker@kew.org
  • 2. Outline • Introduction • Brief history of Bioinformatics • Benchmarking in Bioinformatics • Case study 1: Typical benchmarking across environments • Case study 2: Mean-variance relationship for repeated measures • Conclusions: implication for statistical genomics
  • 3. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32
  • 4. A (very) brief history of bioinformatics Stewart et al. (1987) Nature 330:401-404
  • 5. A (very) brief history of bioinformatics ENCODE Consortium (2012) Nature 489:57-74
  • 6. A (very) brief history of bioinformatics Kluge & Farris (1969) Syst. Zool. 18:1-32
  • 7. Benchmarking to biologists • Benchmarking as a comparative process • i.e. ‘which software’s best?’ / ‘which platform’ • Benchmarking application logic / profiling unknown • Environments / runtimes generally either assumed to be identical, or else loosely categorised into ‘laptops vs clusters’
  • 8. Case Study 1 aka ‘Which program’s the best?’
  • 9. Bioinformatics environments are very heterogeneous • Laptop: – Portable – Very costly form-factor – Maté? Beer? • Raspi: – Low: cost, energy (& power) – Highly portable – Hackable form-factor • Clusters: • Not portable, setup costs • The cloud: – Power closely linked to budget (as limited as) – Almost infinitely scalable -Have to have a connection to get data up there (and down!) – Fiddly setup
  • 11. Comparison System Arch CPU type, clock GHz cores RAM Gb / MHz / type HDD Gb Haemodorum i686 Xeon E5620 @ 2.4 8 33 1000 @ SATA Raspberry Pi 2 B+ ARM ARMv7 @ 1.0 1 1 8 @ flash card Macbook Pro (2011) x64 Core i7 @ 2.2 4 8 250 @ SSD EC2 m4.10xlarge x64 Xeon E5 @ 2.4 40 160 320 @ SSD
  • 12. Reviewing / comparing new methods • Biological problems often scale horribly unpredictably • Algorithm analyses • So empirical measurements on different problem sets to predict how problems will scale…
  • 13. Workflow Setup BLAST 2.2.30 CEGMA genes Short reads Concatenate hits to CEGMA alignments Muscle 3.8.31 RAxML 7.2.8+ Set up workflow, binaries, and reference / alignment data. Deploy to machines. Protein-protein blast reads (from MG-RAST repository, Bass Strait oil field) against 458 core eukaryote genes from CEGMA. Keep only top hits. Use max. num_threads available. Append top hit sequences to CEGMA alignments. For each: Align in MUSCLE using default parameters Infer de novo phylogeny in RAxML under Dayhoff, random starting tree and max. PTHREADS. Output and parse times.
  • 16. Case Study 2 aka ‘What the hell’s a random seed?’
  • 17.
  • 18. Properly benchmarking workflows • Ignoring limiting-steps analyses – in many workflows might actually be data cleaning / parsing / transformation • Or (most common error) inefficiently iterating • Or even disk I/O!
  • 19. Workflow benchmarking very rare • Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc • http://beast.bio.ed.ac.uk/benchmarks • Many e.g. bioinformatics papers • More harm than good?
  • 20. Conclusion • Biologists and error • Current practice • Help! • Interesting challenges too… Thanks: RBG Kew, BI&SA, Mike Chester

Editor's Notes

  1. NB slot is 30mins INCLUDING questions Bioinformatics (computational manipulation, analysis and modelling of biological and biochemical data) is a substantial area of primary research activity as well as a growing market worth up to $13bn US (2020) by some estimates[1]. In particular, exponential growth in the capacity and speed of DNA sequencers has led to a critical shortfall in analytical tools often termed a 'DNA deluge'[2]. Despite this critical need and substantial and growing investment in bioinformatics software, many best practices in software design are neglected, ignored, or even unacknowledged. I will illustrate the problem with an overview of two case studies in current bioinformatics projects, and present data showing worrying reproducibility deficits in commonly-used software. [1]http://www.marketsandmarkets.com/PressReleases/bioinformatics-market.asp [2]http://spectrum.ieee.org/biomedical/devices/the-dna-data-deluge
  2. I am basically an empirical scientist But driven by methods Replication -> empirical investigation This is a big gap in bioinfo at the mo
  3. Biological problems often scale horribly unpredictably On top of which hardly any biologists and few bioinformaticians (me included) understand algorithmic complexity / analysis properly So empirical measurements on different problem sets really the only way to predict how problems will scale…
  4. Wall-clock time on ly one part of the process Slowest (limiting) steps in many workflows might actually be data cleaning / parsing / transformation Or inefficiently iterating over parametizations etc Or even disk I/O!
  5. Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc http://beast.bio.ed.ac.uk/benchmarks Many e.g. bioinformatics papers Biologists are benchmarking the wrong things, or not at all
  6. Many bioinformatics workflows / pipelines limiting at odd steps, parsing etc http://beast.bio.ed.ac.uk/benchmarks Many e.g. bioinformatics papers Biologists are benchmarking the wrong things, or not at all