Dmla0910 – Hoeck– Presentation

A foggy affair – sifting through
the „omics data flood
Wolfgang Hoeck
Principal Business Analyst – 09/23/2010

R&DI - Research & Development Informatics

Today‟s Presentation

 Terms and Domain Definitions
 The “problem space” from a scientists perspective
 Real and perceived bottlenecks
 Deluge of different „omics data types
 Data Summarization: Examples in the experimental
data space
 Taking the next step in text: Adding context
 Bringing it all together – what does it mean?
 Concluding remarks

VIB Pharma - Laboratory Data Management
Conference USA 2

„omics Data Definition and Drug Discovery

 The English-language neologism omics informally refers
to a field of study in biology ending in –omic, such as
genomics or proteomics. The related suffix –ome is used
to address objects of study of such fields, such as the
genome or proteome, respectively.
 Transcriptomics – the study of transcripts in a cell/tissue
 In a larger sense data representing: Gene Expression,
Gene Amplification/Deletion, Gene Mutations
 In the context of cell lines and tissues integrated together
to establish a picture of a disease and potential
intervention points

Conference USA 3

Drug Target Identification & Validation
Structured Data Semi- & Unstructured Data

Experimental Literature
Data Data
Raw data available Analysis Extraction Raw data not or only
partially available

• Profiling Data Target Identification
• Gene Expression
• Gene Copy Number
• Target/Disease Associations
• Gene Mutation
• Specific experiment insight
• Functional Data
• Publication Density
• si/shRNA screens Target Validation
• Transfections
• Knock-outs

Novel insights Confirmatory insights
Therapeutic Molecule
Conference USA 4

Key Issues In Oncology: highly diverse
data sets in need of integration
 Numerous datasets not connected to each other: Datasets located on hard drives, local machines, servers
 Datasets not “written in the same language”: Distinctly different annotations & data formats
 No unified global interface/portal: Cannot see what datasets exist internally or externally, Cannot access
datasets

X
NextGen X Copy
Sequence X Number
Taq siRNA
X
Man X
X
Drug
X X Sensitivity
X Karyotype
X
Images X
Microarray X
X
Cell Line
X Used with permission from J.Argento
VIB Pharma - Laboratory Data Management Profiling
Conference USA 5

Technology Advances contributing to the
„omics data deluge

 Cheaper “old-world” technologies
– Microarrays: More refined, widely available, cheaper
– Analysis software: Standardized processing, more
commercial choices, more in public domain

 Newer high-throughput technologies
– NextGen Sequencing – from 30 bp to 120 bp: Revolutionary
method to gain insight into genomics landscape of a sample
• Transcriptomics (DGE, Digital Gene Expression): No pre-defined
array necessary, very sensitive, detailed insight into a gene‟s
transcriptome
• Fusion Genes/Splice Variants: Detection of genes that normally
don‟t fit together.
• Alternative Transcripts: One gene, multiple versions
• Genomics: Sequence variations among samples
http://seqanswers.com/
Conference USA 6

Academic “mega-Projects” contributing to
the „omics data deluge
 The Cancer Genome Atlas Project (TCGA) – A Rich
Catalog of Human Cancer Genomes
– 25 cancer types; 500+ samples each; microarray gene
expression, amplification/deletion data; traditional and
NextGen sequencing data; epigenetic data; clinical data; data
at 4 levels of data reduction;

 The Sanger Cancer Genome Project – A rich
characterization of cancer cell lines, tissues and
responses
– Catalog of somatic mutations in cancer cell lines and tissues;
web tools, databases, downloads

 1000 Genomes Project – A Deep Catalog of Human
Genetic Variation
– Sequence 1000+ human genomes to detect sequence
variations in population segments
Conference USA 7

Data Deluge – TCGA Ovarian Cancer
example
Transcriptome Clinical

3 levels of data: Copy Number
• Raw, un processed
• Processed Epigenetic Data
• Gene-level summaries SNP

Sequences
Samples (tumor, normal, cell line, 11/500+)

Screenshot taken from
1 Sample (25000 data points, one kind of data) http://tcga-data.nci.nih.gov/tcga/

Conference USA 8

Where are the bottlenecks?
 Storage Space and Data Transfers: TB needed (already for microarray
data), not just GB
– Local storage: Attached to analysis computer, fast connectivity, stored raw
data files and processed files
– Centralized storage: Remote storage for sharing data across sites, data
transfer speeds are an issue

 Analytical Skills and Computing Power:
– Computing power: e.g.: 8 cpu, 16 GB Ram, 1.5 days to process one large
scale copy number data set. How many CPUs can you put to work?
Parallelizing the load on clusters does help
– Human power: Still lots of data manipulation needed, domain knowledge is
necessary

 Data sharing and integration capability:
– Simplicity in user interface: Biologists are not computer scientists
– Data cannot be shared as files: 1 gene copy number data set = 25 M rows of
data.
– Data needs summarization to enable broad surveys
– Query performance: The faster the better, make users aware of what they are
asking for.
Conference USA 9

Who is affected by the bottlenecks?

Increasing Level of summarization
 The Information Technologist
– Cares about how much data needs to be stored where, how it
will be transferred, how long it needs to be kept, what needs to
be backed up, how big will the database get, etc.
 The Bioinformaticist
– Cares about the raw data, needs powerful hardware, fast
algorithms, plenty of storage space, means to share results
 The Scientist
– Cares about analyzed results, wants immediate access, ability
to interact with the data, ask specific questions about a gene,
group of genes or samples (tissues, cell lines)
 The Manager
– Cares about the final result – a novel, interesting, validated
target ideally notified via automated e-mails

Conference USA 10

Workflow/Data for one (1) sample in
NextGen Sequencing
Microarray: 50,000+ probe set result values per sample
NextGen Sequencing: 100M+ reads per sample

depth of read coverage at each position
Junctions.bed: possible splice junctions
Digital_expression.txt: normalized read counts (RPKM) with coverage statistics
Abnormal_Junction.bed: possible junctions caused by translocation events
Conference USA 11

Our appetite for data is enormous, but can
we digest it?

 Gene Expression: Microarrays, NextGen Sequencing
 Gene Copy Number Variations: Microarrays
 Gene Mutations: Classic and NextGen Sequencing
 Gene Methylations: Microarray
 Cell Line Panels Response Profiles: Plate-based
 siRNA screens: gene library screens
 Phenotypic screens: Knock-out animals

Or do we all need a lifetime supply of indigestion pills?

Conference USA 12

Is cloud computing a solution?

 Many Software-as-a-Service (SaaS) vendors
– Compendia Bioscience: Oncomine, Oncomine Power Tools
– NextBio: NextBio Basic, Professional, Enterprise
– GenomeQuest: NGS Data Management
– DNAnexus: NGS Data Management

 Keeping all the data that need integration in one place
makes life a lot easier
 No internal IT resources required
 However, companies subscribing to these services
lose out on building an internal knowledgebase
 Is this then a solution to this problem? Partially!!
Conference USA 13

Oncomine Power Tools

Used with permission from Compendia Bioscience
Conference USA 14

Data Reduction & Interactive
Visualizations – the key to success?

 Don‟t start in the weeds – take a step back: Create
summarizations; e.g.: Summarize at Gene level to
enable systematic surveys
 However, enable digging into the weeds: Select a
Gene and view the details – spread of sample values,
sequence coverage of a gene‟s exon
 Make it interactive at every level: Search for Gene
lists, enable filtering by annotations (cellular location,
target class, pathways, etc.)
 Clearly define what type of data you are dealing with:
identity and annotations are critical

Conference USA 15

Internal Prototyping Workflow
Operational Layer Knowledge Layer

Query &
Visualize

Data Mapping
Import Information Links
Profiling
Data
Warehouse
Data
Analysis

…
Conference USA 16

Molecular Profiling Database – Gene
Expression Data

Filters based on available sample & gene annotations
Log2 difference
Summary Table
Details

Conference USA 17

Molecular Profiling Database – Gene
Expression Data
Multiple Visualizations

Filters based on available sample & gene annotations
Sample Spread (Scatter Plot)

Sample Spread (Box Plot)

Conference USA 18

High level visualization of gene mutations
in cell lines

Tissue and Gene
Tumor mutations in
categorizations specific cell
lines

Conference USA 19

A Target Prioritization Tool
Input from Multiple Data Sources
List of Potential Targets (Target Classes)

Gene Gene Gene Tool siRNA Knockout
Expression Copy Number Mutation Compounds Functional Functional

Scores Scores Scores Scores Scores Scores

Prioritized Target List #1 Prioritized Target List #2

Conference USA 20

Concluding remarks/Acknowledgements

 We will get more data.
 If we fail to organize and summarize, we‟ll waste a lot
of time and money
 Some standards will be necessary, we‟ll have to
compromise at times
 Not everything can and will be in one place. Defined
interfaces (data, processes) between disparate
systems are desperately needed to enable data
interchange.
 My colleagues in Research Informatics &
Hematology/Oncology Therapeutic Area
Conference USA 21

Dmla0910 – Hoeck– Presentation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à Dmla0910 – Hoeck– Presentation

Similaire à Dmla0910 – Hoeck– Presentation (20)

Dmla0910 – Hoeck– Presentation