SlideShare une entreprise Scribd logo
1  sur  21
A foggy affair – sifting through
the „omics data flood
Wolfgang Hoeck
Principal Business Analyst – 09/23/2010

                                R&DI - Research & Development Informatics
Today‟s Presentation

 Terms and Domain Definitions
 The “problem space” from a scientists perspective
 Real and perceived bottlenecks
 Deluge of different „omics data types
 Data Summarization: Examples in the experimental
  data space
 Taking the next step in text: Adding context
 Bringing it all together – what does it mean?
 Concluding remarks

VIB Pharma - Laboratory Data Management
Conference USA                            2
„omics Data Definition and Drug Discovery

 The English-language neologism omics informally refers
  to a field of study in biology ending in –omic, such as
  genomics or proteomics. The related suffix –ome is used
  to address objects of study of such fields, such as the
  genome or proteome, respectively.
 Transcriptomics – the study of transcripts in a cell/tissue
 In a larger sense data representing: Gene Expression,
  Gene Amplification/Deletion, Gene Mutations
 In the context of cell lines and tissues integrated together
  to establish a picture of a disease and potential
  intervention points

VIB Pharma - Laboratory Data Management
Conference USA                            3
Drug Target Identification & Validation
          Structured Data                                      Semi- & Unstructured Data


        Experimental                                              Literature
            Data                                                     Data
Raw data available                 Analysis              Extraction        Raw data not or only
                                                                           partially available

• Profiling Data                      Target Identification
       • Gene Expression
       • Gene Copy Number
                                                                      • Target/Disease Associations
       • Gene Mutation
                                                                      • Specific experiment insight
• Functional Data
                                                                      • Publication Density
       • si/shRNA screens                  Target Validation
       • Transfections
       • Knock-outs

     Novel insights                                                    Confirmatory insights
                                     Therapeutic Molecule
 VIB Pharma - Laboratory Data Management
 Conference USA                              4
Key Issues In Oncology: highly diverse
    data sets in need of integration
    Numerous datasets not connected to each other: Datasets located on hard drives, local machines, servers
    Datasets not “written in the same language”: Distinctly different annotations & data formats
    No unified global interface/portal: Cannot see what datasets exist internally or externally, Cannot access
     datasets



                 X
                                NextGen              X                                   Copy
                               Sequence                                       X        Number
        Taq                                                  siRNA
                                                                                                             X
        Man                                                                   X
                                                 X
                                                                                              Drug
                            X                                    X                          Sensitivity
    X                                     Karyotype
                                                           X
                                                                        Images                 X
          Microarray                                  X
                                    X
                                                     Cell Line
                                                                          X       Used with permission from J.Argento
VIB Pharma - Laboratory Data Management              Profiling
Conference USA                               5
Technology Advances contributing to the
„omics data deluge

 Cheaper “old-world” technologies
      – Microarrays: More refined, widely available, cheaper
      – Analysis software: Standardized processing, more
        commercial choices, more in public domain

 Newer high-throughput technologies
      – NextGen Sequencing – from 30 bp to 120 bp: Revolutionary
        method to gain insight into genomics landscape of a sample
             • Transcriptomics (DGE, Digital Gene Expression): No pre-defined
               array necessary, very sensitive, detailed insight into a gene‟s
               transcriptome
             • Fusion Genes/Splice Variants: Detection of genes that normally
               don‟t fit together.
             • Alternative Transcripts: One gene, multiple versions
             • Genomics: Sequence variations among samples
                                          http://seqanswers.com/
VIB Pharma - Laboratory Data Management
Conference USA                            6
Academic “mega-Projects” contributing to
the „omics data deluge
 The Cancer Genome Atlas Project (TCGA) – A Rich
  Catalog of Human Cancer Genomes
      – 25 cancer types; 500+ samples each; microarray gene
        expression, amplification/deletion data; traditional and
        NextGen sequencing data; epigenetic data; clinical data; data
        at 4 levels of data reduction;

 The Sanger Cancer Genome Project – A rich
  characterization of cancer cell lines, tissues and
  responses
      – Catalog of somatic mutations in cancer cell lines and tissues;
        web tools, databases, downloads

 1000 Genomes Project – A Deep Catalog of Human
  Genetic Variation
       – Sequence 1000+ human genomes to detect sequence
           variations in population segments
VIB Pharma - Laboratory Data Management
Conference USA             7
Data Deluge – TCGA Ovarian Cancer
                  example
                                                                         Transcriptome                                                    Clinical

                         3 levels of data:                                                     Copy Number
                         • Raw, un processed
                         • Processed                                                                  Epigenetic Data
                         • Gene-level summaries                                                                         SNP

                                                                                                                                  Sequences
Samples (tumor, normal, cell line, 11/500+)




                                                                                                                    Screenshot taken from
                                              1 Sample (25000 data points, one kind of data)                        http://tcga-data.nci.nih.gov/tcga/

                  VIB Pharma - Laboratory Data Management
                  Conference USA                                               8
Where are the bottlenecks?
 Storage Space and Data Transfers: TB needed (already for microarray
     data), not just GB
      – Local storage: Attached to analysis computer, fast connectivity, stored raw
         data files and processed files
      – Centralized storage: Remote storage for sharing data across sites, data
         transfer speeds are an issue

 Analytical Skills and Computing Power:
      – Computing power: e.g.: 8 cpu, 16 GB Ram, 1.5 days to process one large
        scale copy number data set. How many CPUs can you put to work?
        Parallelizing the load on clusters does help
      – Human power: Still lots of data manipulation needed, domain knowledge is
        necessary

 Data sharing and integration capability:
      – Simplicity in user interface: Biologists are not computer scientists
      – Data cannot be shared as files: 1 gene copy number data set = 25 M rows of
        data.
      – Data needs summarization to enable broad surveys
      – Query performance: The faster the better, make users aware of what they are
        asking for.
VIB Pharma - Laboratory Data Management
Conference USA                            9
Who is affected by the bottlenecks?




                                                                          Increasing Level of summarization
 The Information Technologist
     – Cares about how much data needs to be stored where, how it
       will be transferred, how long it needs to be kept, what needs to
       be backed up, how big will the database get, etc.
 The Bioinformaticist
     – Cares about the raw data, needs powerful hardware, fast
       algorithms, plenty of storage space, means to share results
 The Scientist
     – Cares about analyzed results, wants immediate access, ability
       to interact with the data, ask specific questions about a gene,
       group of genes or samples (tissues, cell lines)
 The Manager
     – Cares about the final result – a novel, interesting, validated
       target ideally notified via automated e-mails

VIB Pharma - Laboratory Data Management
Conference USA                            10
Workflow/Data for one (1) sample in
NextGen Sequencing
  Microarray: 50,000+ probe set result values per sample
  NextGen Sequencing: 100M+ reads per sample




                          depth of read coverage at each position
Junctions.bed:            possible splice junctions
Digital_expression.txt:   normalized read counts (RPKM) with coverage statistics
Abnormal_Junction.bed:    possible junctions caused by translocation events
VIB Pharma - Laboratory Data Management
Conference USA                              11
Our appetite for data is enormous, but can
we digest it?

 Gene Expression: Microarrays, NextGen Sequencing
 Gene Copy Number Variations: Microarrays
 Gene Mutations: Classic and NextGen Sequencing
 Gene Methylations: Microarray
 Cell Line Panels Response Profiles: Plate-based
 siRNA screens: gene library screens
 Phenotypic screens: Knock-out animals

  Or do we all need a lifetime supply of indigestion pills?


VIB Pharma - Laboratory Data Management
Conference USA                            12
Is cloud computing a solution?

 Many Software-as-a-Service (SaaS) vendors
      –    Compendia Bioscience: Oncomine, Oncomine Power Tools
      –    NextBio: NextBio Basic, Professional, Enterprise
      –    GenomeQuest: NGS Data Management
      –    DNAnexus: NGS Data Management

 Keeping all the data that need integration in one place
  makes life a lot easier
 No internal IT resources required
 However, companies subscribing to these services
  lose out on building an internal knowledgebase
 Is this then a solution to this problem? Partially!!
VIB Pharma - Laboratory Data Management
Conference USA                            13
Oncomine Power Tools




                                               Used with permission from Compendia Bioscience
VIB Pharma - Laboratory Data Management
Conference USA                            14
Data Reduction & Interactive
Visualizations – the key to success?

 Don‟t start in the weeds – take a step back: Create
  summarizations; e.g.: Summarize at Gene level to
  enable systematic surveys
 However, enable digging into the weeds: Select a
  Gene and view the details – spread of sample values,
  sequence coverage of a gene‟s exon
 Make it interactive at every level: Search for Gene
  lists, enable filtering by annotations (cellular location,
  target class, pathways, etc.)
 Clearly define what type of data you are dealing with:
  identity and annotations are critical

VIB Pharma - Laboratory Data Management
Conference USA                            15
Internal Prototyping Workflow
Operational Layer                                      Knowledge Layer

                                                                           Query &
                                                                           Visualize



                          Data Mapping
                              Import                   Information Links
                                           Profiling
                                            Data
                                          Warehouse
           Data
          Analysis




                                                                                       …
VIB Pharma - Laboratory Data Management
Conference USA                            16
Molecular Profiling Database – Gene
Expression Data




                                               Filters based on available sample & gene annotations
Log2 difference
Summary Table
Details




VIB Pharma - Laboratory Data Management
Conference USA                            17
Molecular Profiling Database – Gene
Expression Data
                                                                          Multiple Visualizations




                                                                                                    Filters based on available sample & gene annotations
 Sample Spread (Scatter Plot)




                                               Sample Spread (Box Plot)




VIB Pharma - Laboratory Data Management
Conference USA                            18
High level visualization of gene mutations
in cell lines




Tissue and                                     Gene
Tumor                                          mutations in
categorizations                                specific cell
                                               lines




VIB Pharma - Laboratory Data Management
Conference USA                            19
A Target Prioritization Tool
                                                                Input from Multiple Data Sources
List of Potential Targets (Target Classes)




                                               Gene             Gene         Gene           Tool               siRNA      Knockout
                                             Expression      Copy Number    Mutation     Compounds           Functional   Functional



                                              Scores             Scores     Scores         Scores             Scores       Scores




                                             Prioritized Target List #1         Prioritized Target List #2

              VIB Pharma - Laboratory Data Management
              Conference USA                                          20
Concluding remarks/Acknowledgements

 We will get more data.
 If we fail to organize and summarize, we‟ll waste a lot
  of time and money
 Some standards will be necessary, we‟ll have to
  compromise at times
 Not everything can and will be in one place. Defined
  interfaces (data, processes) between disparate
  systems are desperately needed to enable data
  interchange.
 My colleagues in Research Informatics &
  Hematology/Oncology Therapeutic Area
VIB Pharma - Laboratory Data Management
Conference USA                            21

Contenu connexe

Tendances

Genevestigator
GenevestigatorGenevestigator
GenevestigatorBITS
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normalGenomeInABottle
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...GenomeInABottle
 
Aug2013 illumina platinum genomes
Aug2013 illumina platinum genomesAug2013 illumina platinum genomes
Aug2013 illumina platinum genomesGenomeInABottle
 
Sept2016 plenary nist_intro
Sept2016 plenary nist_introSept2016 plenary nist_intro
Sept2016 plenary nist_introGenomeInABottle
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationJoaquin Dopazo
 
Bioinformatics in dermato-oncology
Bioinformatics in dermato-oncologyBioinformatics in dermato-oncology
Bioinformatics in dermato-oncologyJoaquin Dopazo
 
Bert Reijmerink (Genalice) - Hoe technologie bijdraagt aan een betere behande...
Bert Reijmerink (Genalice) - Hoe technologie bijdraagt aan een betere behande...Bert Reijmerink (Genalice) - Hoe technologie bijdraagt aan een betere behande...
Bert Reijmerink (Genalice) - Hoe technologie bijdraagt aan een betere behande...AlmereDataCapital
 
A New Generation Of Mechanism-Based Biomarkers For The Clinic
A New Generation Of Mechanism-Based Biomarkers For The ClinicA New Generation Of Mechanism-Based Biomarkers For The Clinic
A New Generation Of Mechanism-Based Biomarkers For The ClinicJoaquin Dopazo
 
Normal/Tumor somatic mutations report tool
Normal/Tumor somatic mutations report toolNormal/Tumor somatic mutations report tool
Normal/Tumor somatic mutations report toolIsaac Noguera
 
Reference Materials Selection and Design Working Group Summary Aug2012
Reference Materials Selection and Design Working Group Summary Aug2012Reference Materials Selection and Design Working Group Summary Aug2012
Reference Materials Selection and Design Working Group Summary Aug2012GenomeInABottle
 
Aug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsAug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsGenomeInABottle
 
Envisioning a world where everyone helps solve disease
Envisioning a world where everyone helps solve diseaseEnvisioning a world where everyone helps solve disease
Envisioning a world where everyone helps solve diseasemhaendel
 
The server of the Spanish Population Variability
The server of the Spanish Population VariabilityThe server of the Spanish Population Variability
The server of the Spanish Population VariabilityJoaquin Dopazo
 
Forum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decadeForum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decadeJoaquin Dopazo
 
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QCPart 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QCJoachim Jacob
 
Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation mhaendel
 
Experimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectExperimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectFundación Ramón Areces
 

Tendances (20)

Genevestigator
GenevestigatorGenevestigator
Genevestigator
 
2016 ashg giab poster
2016 ashg giab poster2016 ashg giab poster
2016 ashg giab poster
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normal
 
Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...Genome in a Bottle- reference materials to benchmark challenging variants and...
Genome in a Bottle- reference materials to benchmark challenging variants and...
 
Aug2013 illumina platinum genomes
Aug2013 illumina platinum genomesAug2013 illumina platinum genomes
Aug2013 illumina platinum genomes
 
Sept2016 plenary nist_intro
Sept2016 plenary nist_introSept2016 plenary nist_intro
Sept2016 plenary nist_intro
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical information
 
Bioinformatics in dermato-oncology
Bioinformatics in dermato-oncologyBioinformatics in dermato-oncology
Bioinformatics in dermato-oncology
 
presentation
presentationpresentation
presentation
 
Bert Reijmerink (Genalice) - Hoe technologie bijdraagt aan een betere behande...
Bert Reijmerink (Genalice) - Hoe technologie bijdraagt aan een betere behande...Bert Reijmerink (Genalice) - Hoe technologie bijdraagt aan een betere behande...
Bert Reijmerink (Genalice) - Hoe technologie bijdraagt aan een betere behande...
 
A New Generation Of Mechanism-Based Biomarkers For The Clinic
A New Generation Of Mechanism-Based Biomarkers For The ClinicA New Generation Of Mechanism-Based Biomarkers For The Clinic
A New Generation Of Mechanism-Based Biomarkers For The Clinic
 
Normal/Tumor somatic mutations report tool
Normal/Tumor somatic mutations report toolNormal/Tumor somatic mutations report tool
Normal/Tumor somatic mutations report tool
 
Reference Materials Selection and Design Working Group Summary Aug2012
Reference Materials Selection and Design Working Group Summary Aug2012Reference Materials Selection and Design Working Group Summary Aug2012
Reference Materials Selection and Design Working Group Summary Aug2012
 
Aug2015 Giab nist integration methods
Aug2015 Giab nist integration methodsAug2015 Giab nist integration methods
Aug2015 Giab nist integration methods
 
Envisioning a world where everyone helps solve disease
Envisioning a world where everyone helps solve diseaseEnvisioning a world where everyone helps solve disease
Envisioning a world where everyone helps solve disease
 
The server of the Spanish Population Variability
The server of the Spanish Population VariabilityThe server of the Spanish Population Variability
The server of the Spanish Population Variability
 
Forum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decadeForum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decade
 
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QCPart 4 of RNA-seq for DE analysis: Extracting count table and QC
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
 
Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation Phenopackets as applied to variant interpretation
Phenopackets as applied to variant interpretation
 
Experimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectExperimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome Project
 

En vedette

Medicina veterinaria y zootecnia
Medicina veterinaria y zootecniaMedicina veterinaria y zootecnia
Medicina veterinaria y zootecniaManuel Ramirez
 
Estadio olimpico
Estadio olimpicoEstadio olimpico
Estadio olimpicoUNACH .
 
Progama ng pamahalaan sa pagpapaunlad ng bansa cariaga sjb
Progama ng pamahalaan sa pagpapaunlad ng bansa cariaga sjbProgama ng pamahalaan sa pagpapaunlad ng bansa cariaga sjb
Progama ng pamahalaan sa pagpapaunlad ng bansa cariaga sjbAlice Bernardo
 
Course Schedule QOCI
Course Schedule QOCICourse Schedule QOCI
Course Schedule QOCIGuin Shaw
 
C1.ics.p3.s2. la constitución mexicana, los derechos humanos y las instituciones
C1.ics.p3.s2. la constitución mexicana, los derechos humanos y las institucionesC1.ics.p3.s2. la constitución mexicana, los derechos humanos y las instituciones
C1.ics.p3.s2. la constitución mexicana, los derechos humanos y las institucionesMartín Ramírez
 
Advanced Analytics for Clinical Data Full Event Guide
Advanced Analytics for Clinical Data Full Event GuideAdvanced Analytics for Clinical Data Full Event Guide
Advanced Analytics for Clinical Data Full Event GuidePfizer
 
Catchment analysis for reliance trends
Catchment analysis for reliance trendsCatchment analysis for reliance trends
Catchment analysis for reliance trendsHardik Jain
 
Noise in Communication System
Noise in Communication SystemNoise in Communication System
Noise in Communication SystemIzah Asmadi
 

En vedette (16)

Medicina veterinaria y zootecnia
Medicina veterinaria y zootecniaMedicina veterinaria y zootecnia
Medicina veterinaria y zootecnia
 
Estadio olimpico
Estadio olimpicoEstadio olimpico
Estadio olimpico
 
Progama ng pamahalaan sa pagpapaunlad ng bansa cariaga sjb
Progama ng pamahalaan sa pagpapaunlad ng bansa cariaga sjbProgama ng pamahalaan sa pagpapaunlad ng bansa cariaga sjb
Progama ng pamahalaan sa pagpapaunlad ng bansa cariaga sjb
 
Guia de ciencias 6 3 i periodo
Guia de ciencias 6 3 i periodoGuia de ciencias 6 3 i periodo
Guia de ciencias 6 3 i periodo
 
Department logo
Department logoDepartment logo
Department logo
 
Gwiazdozbior
GwiazdozbiorGwiazdozbior
Gwiazdozbior
 
Portfolio Project
Portfolio ProjectPortfolio Project
Portfolio Project
 
Mi familia
Mi familiaMi familia
Mi familia
 
Course Schedule QOCI
Course Schedule QOCICourse Schedule QOCI
Course Schedule QOCI
 
C1.ics.p3.s2. la constitución mexicana, los derechos humanos y las instituciones
C1.ics.p3.s2. la constitución mexicana, los derechos humanos y las institucionesC1.ics.p3.s2. la constitución mexicana, los derechos humanos y las instituciones
C1.ics.p3.s2. la constitución mexicana, los derechos humanos y las instituciones
 
Advanced Analytics for Clinical Data Full Event Guide
Advanced Analytics for Clinical Data Full Event GuideAdvanced Analytics for Clinical Data Full Event Guide
Advanced Analytics for Clinical Data Full Event Guide
 
Amira-CV
Amira-CVAmira-CV
Amira-CV
 
Mahabang pagsusulit grade 6
Mahabang pagsusulit grade 6Mahabang pagsusulit grade 6
Mahabang pagsusulit grade 6
 
Catchment analysis for reliance trends
Catchment analysis for reliance trendsCatchment analysis for reliance trends
Catchment analysis for reliance trends
 
jayanthi_resume_
jayanthi_resume_jayanthi_resume_
jayanthi_resume_
 
Noise in Communication System
Noise in Communication SystemNoise in Communication System
Noise in Communication System
 

Similaire à Dmla0910 – Hoeck– Presentation

Trends in Annotation of Genomic Data
Trends in Annotation of Genomic DataTrends in Annotation of Genomic Data
Trends in Annotation of Genomic Databiobase
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectivePalaniappan SP
 
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Intel IT Center
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GenomeInABottle
 
Genomics In Personal Care Product Development
Genomics In Personal Care Product DevelopmentGenomics In Personal Care Product Development
Genomics In Personal Care Product DevelopmentGenemarkers
 
Microarrays;application
Microarrays;applicationMicroarrays;application
Microarrays;applicationFyzah Bashir
 
Friend DREAM 2012-11-14
Friend DREAM 2012-11-14Friend DREAM 2012-11-14
Friend DREAM 2012-11-14Sage Base
 
Next generation sequencing in pharmacogenomics
Next generation sequencing in pharmacogenomicsNext generation sequencing in pharmacogenomics
Next generation sequencing in pharmacogenomicsDr. Gerry Higgins
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshopGenomeInABottle
 
Real-time Analysis of Next Generation Sequencing Data
Real-time Analysis of Next Generation Sequencing DataReal-time Analysis of Next Generation Sequencing Data
Real-time Analysis of Next Generation Sequencing DataMatthieu Schapranow
 
Mar2013 Reference Material Selection Working Group
Mar2013 Reference Material Selection Working GroupMar2013 Reference Material Selection Working Group
Mar2013 Reference Material Selection Working GroupGenomeInABottle
 
Mar 2013 reference materials Selection
Mar 2013 reference materials SelectionMar 2013 reference materials Selection
Mar 2013 reference materials SelectionGenomeInABottle
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data viaNeuroscience Information Framework
 
2015 functional genomics variant annotation and interpretation- tools and p...
2015 functional genomics   variant annotation and interpretation- tools and p...2015 functional genomics   variant annotation and interpretation- tools and p...
2015 functional genomics variant annotation and interpretation- tools and p...Gabe Rudy
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
Next-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxNext-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxSwetaTripathi13
 

Similaire à Dmla0910 – Hoeck– Presentation (20)

Trends in Annotation of Genomic Data
Trends in Annotation of Genomic DataTrends in Annotation of Genomic Data
Trends in Annotation of Genomic Data
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics data
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data Perspective
 
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Genomics In Personal Care Product Development
Genomics In Personal Care Product DevelopmentGenomics In Personal Care Product Development
Genomics In Personal Care Product Development
 
Dna chip
Dna chipDna chip
Dna chip
 
Microarrays;application
Microarrays;applicationMicroarrays;application
Microarrays;application
 
Friend DREAM 2012-11-14
Friend DREAM 2012-11-14Friend DREAM 2012-11-14
Friend DREAM 2012-11-14
 
Next generation sequencing in pharmacogenomics
Next generation sequencing in pharmacogenomicsNext generation sequencing in pharmacogenomics
Next generation sequencing in pharmacogenomics
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Real-time Analysis of Next Generation Sequencing Data
Real-time Analysis of Next Generation Sequencing DataReal-time Analysis of Next Generation Sequencing Data
Real-time Analysis of Next Generation Sequencing Data
 
MICROARRAY.pptx
MICROARRAY.pptxMICROARRAY.pptx
MICROARRAY.pptx
 
Mar2013 Reference Material Selection Working Group
Mar2013 Reference Material Selection Working GroupMar2013 Reference Material Selection Working Group
Mar2013 Reference Material Selection Working Group
 
Mar 2013 reference materials Selection
Mar 2013 reference materials SelectionMar 2013 reference materials Selection
Mar 2013 reference materials Selection
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data via
 
2015 functional genomics variant annotation and interpretation- tools and p...
2015 functional genomics   variant annotation and interpretation- tools and p...2015 functional genomics   variant annotation and interpretation- tools and p...
2015 functional genomics variant annotation and interpretation- tools and p...
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Next-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptxNext-Generation Sequencing and Data Analysis.pptx
Next-Generation Sequencing and Data Analysis.pptx
 

Dmla0910 – Hoeck– Presentation

  • 1. A foggy affair – sifting through the „omics data flood Wolfgang Hoeck Principal Business Analyst – 09/23/2010 R&DI - Research & Development Informatics
  • 2. Today‟s Presentation  Terms and Domain Definitions  The “problem space” from a scientists perspective  Real and perceived bottlenecks  Deluge of different „omics data types  Data Summarization: Examples in the experimental data space  Taking the next step in text: Adding context  Bringing it all together – what does it mean?  Concluding remarks VIB Pharma - Laboratory Data Management Conference USA 2
  • 3. „omics Data Definition and Drug Discovery  The English-language neologism omics informally refers to a field of study in biology ending in –omic, such as genomics or proteomics. The related suffix –ome is used to address objects of study of such fields, such as the genome or proteome, respectively.  Transcriptomics – the study of transcripts in a cell/tissue  In a larger sense data representing: Gene Expression, Gene Amplification/Deletion, Gene Mutations  In the context of cell lines and tissues integrated together to establish a picture of a disease and potential intervention points VIB Pharma - Laboratory Data Management Conference USA 3
  • 4. Drug Target Identification & Validation Structured Data Semi- & Unstructured Data Experimental Literature Data Data Raw data available Analysis Extraction Raw data not or only partially available • Profiling Data Target Identification • Gene Expression • Gene Copy Number • Target/Disease Associations • Gene Mutation • Specific experiment insight • Functional Data • Publication Density • si/shRNA screens Target Validation • Transfections • Knock-outs Novel insights Confirmatory insights Therapeutic Molecule VIB Pharma - Laboratory Data Management Conference USA 4
  • 5. Key Issues In Oncology: highly diverse data sets in need of integration  Numerous datasets not connected to each other: Datasets located on hard drives, local machines, servers  Datasets not “written in the same language”: Distinctly different annotations & data formats  No unified global interface/portal: Cannot see what datasets exist internally or externally, Cannot access datasets X NextGen X Copy Sequence X Number Taq siRNA X Man X X Drug X X Sensitivity X Karyotype X Images X Microarray X X Cell Line X Used with permission from J.Argento VIB Pharma - Laboratory Data Management Profiling Conference USA 5
  • 6. Technology Advances contributing to the „omics data deluge  Cheaper “old-world” technologies – Microarrays: More refined, widely available, cheaper – Analysis software: Standardized processing, more commercial choices, more in public domain  Newer high-throughput technologies – NextGen Sequencing – from 30 bp to 120 bp: Revolutionary method to gain insight into genomics landscape of a sample • Transcriptomics (DGE, Digital Gene Expression): No pre-defined array necessary, very sensitive, detailed insight into a gene‟s transcriptome • Fusion Genes/Splice Variants: Detection of genes that normally don‟t fit together. • Alternative Transcripts: One gene, multiple versions • Genomics: Sequence variations among samples http://seqanswers.com/ VIB Pharma - Laboratory Data Management Conference USA 6
  • 7. Academic “mega-Projects” contributing to the „omics data deluge  The Cancer Genome Atlas Project (TCGA) – A Rich Catalog of Human Cancer Genomes – 25 cancer types; 500+ samples each; microarray gene expression, amplification/deletion data; traditional and NextGen sequencing data; epigenetic data; clinical data; data at 4 levels of data reduction;  The Sanger Cancer Genome Project – A rich characterization of cancer cell lines, tissues and responses – Catalog of somatic mutations in cancer cell lines and tissues; web tools, databases, downloads  1000 Genomes Project – A Deep Catalog of Human Genetic Variation – Sequence 1000+ human genomes to detect sequence variations in population segments VIB Pharma - Laboratory Data Management Conference USA 7
  • 8. Data Deluge – TCGA Ovarian Cancer example Transcriptome Clinical 3 levels of data: Copy Number • Raw, un processed • Processed Epigenetic Data • Gene-level summaries SNP Sequences Samples (tumor, normal, cell line, 11/500+) Screenshot taken from 1 Sample (25000 data points, one kind of data) http://tcga-data.nci.nih.gov/tcga/ VIB Pharma - Laboratory Data Management Conference USA 8
  • 9. Where are the bottlenecks?  Storage Space and Data Transfers: TB needed (already for microarray data), not just GB – Local storage: Attached to analysis computer, fast connectivity, stored raw data files and processed files – Centralized storage: Remote storage for sharing data across sites, data transfer speeds are an issue  Analytical Skills and Computing Power: – Computing power: e.g.: 8 cpu, 16 GB Ram, 1.5 days to process one large scale copy number data set. How many CPUs can you put to work? Parallelizing the load on clusters does help – Human power: Still lots of data manipulation needed, domain knowledge is necessary  Data sharing and integration capability: – Simplicity in user interface: Biologists are not computer scientists – Data cannot be shared as files: 1 gene copy number data set = 25 M rows of data. – Data needs summarization to enable broad surveys – Query performance: The faster the better, make users aware of what they are asking for. VIB Pharma - Laboratory Data Management Conference USA 9
  • 10. Who is affected by the bottlenecks? Increasing Level of summarization  The Information Technologist – Cares about how much data needs to be stored where, how it will be transferred, how long it needs to be kept, what needs to be backed up, how big will the database get, etc.  The Bioinformaticist – Cares about the raw data, needs powerful hardware, fast algorithms, plenty of storage space, means to share results  The Scientist – Cares about analyzed results, wants immediate access, ability to interact with the data, ask specific questions about a gene, group of genes or samples (tissues, cell lines)  The Manager – Cares about the final result – a novel, interesting, validated target ideally notified via automated e-mails VIB Pharma - Laboratory Data Management Conference USA 10
  • 11. Workflow/Data for one (1) sample in NextGen Sequencing Microarray: 50,000+ probe set result values per sample NextGen Sequencing: 100M+ reads per sample depth of read coverage at each position Junctions.bed: possible splice junctions Digital_expression.txt: normalized read counts (RPKM) with coverage statistics Abnormal_Junction.bed: possible junctions caused by translocation events VIB Pharma - Laboratory Data Management Conference USA 11
  • 12. Our appetite for data is enormous, but can we digest it?  Gene Expression: Microarrays, NextGen Sequencing  Gene Copy Number Variations: Microarrays  Gene Mutations: Classic and NextGen Sequencing  Gene Methylations: Microarray  Cell Line Panels Response Profiles: Plate-based  siRNA screens: gene library screens  Phenotypic screens: Knock-out animals Or do we all need a lifetime supply of indigestion pills? VIB Pharma - Laboratory Data Management Conference USA 12
  • 13. Is cloud computing a solution?  Many Software-as-a-Service (SaaS) vendors – Compendia Bioscience: Oncomine, Oncomine Power Tools – NextBio: NextBio Basic, Professional, Enterprise – GenomeQuest: NGS Data Management – DNAnexus: NGS Data Management  Keeping all the data that need integration in one place makes life a lot easier  No internal IT resources required  However, companies subscribing to these services lose out on building an internal knowledgebase  Is this then a solution to this problem? Partially!! VIB Pharma - Laboratory Data Management Conference USA 13
  • 14. Oncomine Power Tools Used with permission from Compendia Bioscience VIB Pharma - Laboratory Data Management Conference USA 14
  • 15. Data Reduction & Interactive Visualizations – the key to success?  Don‟t start in the weeds – take a step back: Create summarizations; e.g.: Summarize at Gene level to enable systematic surveys  However, enable digging into the weeds: Select a Gene and view the details – spread of sample values, sequence coverage of a gene‟s exon  Make it interactive at every level: Search for Gene lists, enable filtering by annotations (cellular location, target class, pathways, etc.)  Clearly define what type of data you are dealing with: identity and annotations are critical VIB Pharma - Laboratory Data Management Conference USA 15
  • 16. Internal Prototyping Workflow Operational Layer Knowledge Layer Query & Visualize Data Mapping Import Information Links Profiling Data Warehouse Data Analysis … VIB Pharma - Laboratory Data Management Conference USA 16
  • 17. Molecular Profiling Database – Gene Expression Data Filters based on available sample & gene annotations Log2 difference Summary Table Details VIB Pharma - Laboratory Data Management Conference USA 17
  • 18. Molecular Profiling Database – Gene Expression Data Multiple Visualizations Filters based on available sample & gene annotations Sample Spread (Scatter Plot) Sample Spread (Box Plot) VIB Pharma - Laboratory Data Management Conference USA 18
  • 19. High level visualization of gene mutations in cell lines Tissue and Gene Tumor mutations in categorizations specific cell lines VIB Pharma - Laboratory Data Management Conference USA 19
  • 20. A Target Prioritization Tool Input from Multiple Data Sources List of Potential Targets (Target Classes) Gene Gene Gene Tool siRNA Knockout Expression Copy Number Mutation Compounds Functional Functional Scores Scores Scores Scores Scores Scores Prioritized Target List #1 Prioritized Target List #2 VIB Pharma - Laboratory Data Management Conference USA 20
  • 21. Concluding remarks/Acknowledgements  We will get more data.  If we fail to organize and summarize, we‟ll waste a lot of time and money  Some standards will be necessary, we‟ll have to compromise at times  Not everything can and will be in one place. Defined interfaces (data, processes) between disparate systems are desperately needed to enable data interchange.  My colleagues in Research Informatics & Hematology/Oncology Therapeutic Area VIB Pharma - Laboratory Data Management Conference USA 21