1. A foggy affair – sifting through
the „omics data flood
Wolfgang Hoeck
Principal Business Analyst – 09/23/2010
R&DI - Research & Development Informatics
2. Today‟s Presentation
Terms and Domain Definitions
The “problem space” from a scientists perspective
Real and perceived bottlenecks
Deluge of different „omics data types
Data Summarization: Examples in the experimental
data space
Taking the next step in text: Adding context
Bringing it all together – what does it mean?
Concluding remarks
VIB Pharma - Laboratory Data Management
Conference USA 2
3. „omics Data Definition and Drug Discovery
The English-language neologism omics informally refers
to a field of study in biology ending in –omic, such as
genomics or proteomics. The related suffix –ome is used
to address objects of study of such fields, such as the
genome or proteome, respectively.
Transcriptomics – the study of transcripts in a cell/tissue
In a larger sense data representing: Gene Expression,
Gene Amplification/Deletion, Gene Mutations
In the context of cell lines and tissues integrated together
to establish a picture of a disease and potential
intervention points
VIB Pharma - Laboratory Data Management
Conference USA 3
4. Drug Target Identification & Validation
Structured Data Semi- & Unstructured Data
Experimental Literature
Data Data
Raw data available Analysis Extraction Raw data not or only
partially available
• Profiling Data Target Identification
• Gene Expression
• Gene Copy Number
• Target/Disease Associations
• Gene Mutation
• Specific experiment insight
• Functional Data
• Publication Density
• si/shRNA screens Target Validation
• Transfections
• Knock-outs
Novel insights Confirmatory insights
Therapeutic Molecule
VIB Pharma - Laboratory Data Management
Conference USA 4
5. Key Issues In Oncology: highly diverse
data sets in need of integration
Numerous datasets not connected to each other: Datasets located on hard drives, local machines, servers
Datasets not “written in the same language”: Distinctly different annotations & data formats
No unified global interface/portal: Cannot see what datasets exist internally or externally, Cannot access
datasets
X
NextGen X Copy
Sequence X Number
Taq siRNA
X
Man X
X
Drug
X X Sensitivity
X Karyotype
X
Images X
Microarray X
X
Cell Line
X Used with permission from J.Argento
VIB Pharma - Laboratory Data Management Profiling
Conference USA 5
6. Technology Advances contributing to the
„omics data deluge
Cheaper “old-world” technologies
– Microarrays: More refined, widely available, cheaper
– Analysis software: Standardized processing, more
commercial choices, more in public domain
Newer high-throughput technologies
– NextGen Sequencing – from 30 bp to 120 bp: Revolutionary
method to gain insight into genomics landscape of a sample
• Transcriptomics (DGE, Digital Gene Expression): No pre-defined
array necessary, very sensitive, detailed insight into a gene‟s
transcriptome
• Fusion Genes/Splice Variants: Detection of genes that normally
don‟t fit together.
• Alternative Transcripts: One gene, multiple versions
• Genomics: Sequence variations among samples
http://seqanswers.com/
VIB Pharma - Laboratory Data Management
Conference USA 6
7. Academic “mega-Projects” contributing to
the „omics data deluge
The Cancer Genome Atlas Project (TCGA) – A Rich
Catalog of Human Cancer Genomes
– 25 cancer types; 500+ samples each; microarray gene
expression, amplification/deletion data; traditional and
NextGen sequencing data; epigenetic data; clinical data; data
at 4 levels of data reduction;
The Sanger Cancer Genome Project – A rich
characterization of cancer cell lines, tissues and
responses
– Catalog of somatic mutations in cancer cell lines and tissues;
web tools, databases, downloads
1000 Genomes Project – A Deep Catalog of Human
Genetic Variation
– Sequence 1000+ human genomes to detect sequence
variations in population segments
VIB Pharma - Laboratory Data Management
Conference USA 7
8. Data Deluge – TCGA Ovarian Cancer
example
Transcriptome Clinical
3 levels of data: Copy Number
• Raw, un processed
• Processed Epigenetic Data
• Gene-level summaries SNP
Sequences
Samples (tumor, normal, cell line, 11/500+)
Screenshot taken from
1 Sample (25000 data points, one kind of data) http://tcga-data.nci.nih.gov/tcga/
VIB Pharma - Laboratory Data Management
Conference USA 8
9. Where are the bottlenecks?
Storage Space and Data Transfers: TB needed (already for microarray
data), not just GB
– Local storage: Attached to analysis computer, fast connectivity, stored raw
data files and processed files
– Centralized storage: Remote storage for sharing data across sites, data
transfer speeds are an issue
Analytical Skills and Computing Power:
– Computing power: e.g.: 8 cpu, 16 GB Ram, 1.5 days to process one large
scale copy number data set. How many CPUs can you put to work?
Parallelizing the load on clusters does help
– Human power: Still lots of data manipulation needed, domain knowledge is
necessary
Data sharing and integration capability:
– Simplicity in user interface: Biologists are not computer scientists
– Data cannot be shared as files: 1 gene copy number data set = 25 M rows of
data.
– Data needs summarization to enable broad surveys
– Query performance: The faster the better, make users aware of what they are
asking for.
VIB Pharma - Laboratory Data Management
Conference USA 9
10. Who is affected by the bottlenecks?
Increasing Level of summarization
The Information Technologist
– Cares about how much data needs to be stored where, how it
will be transferred, how long it needs to be kept, what needs to
be backed up, how big will the database get, etc.
The Bioinformaticist
– Cares about the raw data, needs powerful hardware, fast
algorithms, plenty of storage space, means to share results
The Scientist
– Cares about analyzed results, wants immediate access, ability
to interact with the data, ask specific questions about a gene,
group of genes or samples (tissues, cell lines)
The Manager
– Cares about the final result – a novel, interesting, validated
target ideally notified via automated e-mails
VIB Pharma - Laboratory Data Management
Conference USA 10
11. Workflow/Data for one (1) sample in
NextGen Sequencing
Microarray: 50,000+ probe set result values per sample
NextGen Sequencing: 100M+ reads per sample
depth of read coverage at each position
Junctions.bed: possible splice junctions
Digital_expression.txt: normalized read counts (RPKM) with coverage statistics
Abnormal_Junction.bed: possible junctions caused by translocation events
VIB Pharma - Laboratory Data Management
Conference USA 11
12. Our appetite for data is enormous, but can
we digest it?
Gene Expression: Microarrays, NextGen Sequencing
Gene Copy Number Variations: Microarrays
Gene Mutations: Classic and NextGen Sequencing
Gene Methylations: Microarray
Cell Line Panels Response Profiles: Plate-based
siRNA screens: gene library screens
Phenotypic screens: Knock-out animals
Or do we all need a lifetime supply of indigestion pills?
VIB Pharma - Laboratory Data Management
Conference USA 12
13. Is cloud computing a solution?
Many Software-as-a-Service (SaaS) vendors
– Compendia Bioscience: Oncomine, Oncomine Power Tools
– NextBio: NextBio Basic, Professional, Enterprise
– GenomeQuest: NGS Data Management
– DNAnexus: NGS Data Management
Keeping all the data that need integration in one place
makes life a lot easier
No internal IT resources required
However, companies subscribing to these services
lose out on building an internal knowledgebase
Is this then a solution to this problem? Partially!!
VIB Pharma - Laboratory Data Management
Conference USA 13
14. Oncomine Power Tools
Used with permission from Compendia Bioscience
VIB Pharma - Laboratory Data Management
Conference USA 14
15. Data Reduction & Interactive
Visualizations – the key to success?
Don‟t start in the weeds – take a step back: Create
summarizations; e.g.: Summarize at Gene level to
enable systematic surveys
However, enable digging into the weeds: Select a
Gene and view the details – spread of sample values,
sequence coverage of a gene‟s exon
Make it interactive at every level: Search for Gene
lists, enable filtering by annotations (cellular location,
target class, pathways, etc.)
Clearly define what type of data you are dealing with:
identity and annotations are critical
VIB Pharma - Laboratory Data Management
Conference USA 15
16. Internal Prototyping Workflow
Operational Layer Knowledge Layer
Query &
Visualize
Data Mapping
Import Information Links
Profiling
Data
Warehouse
Data
Analysis
…
VIB Pharma - Laboratory Data Management
Conference USA 16
17. Molecular Profiling Database – Gene
Expression Data
Filters based on available sample & gene annotations
Log2 difference
Summary Table
Details
VIB Pharma - Laboratory Data Management
Conference USA 17
18. Molecular Profiling Database – Gene
Expression Data
Multiple Visualizations
Filters based on available sample & gene annotations
Sample Spread (Scatter Plot)
Sample Spread (Box Plot)
VIB Pharma - Laboratory Data Management
Conference USA 18
19. High level visualization of gene mutations
in cell lines
Tissue and Gene
Tumor mutations in
categorizations specific cell
lines
VIB Pharma - Laboratory Data Management
Conference USA 19
20. A Target Prioritization Tool
Input from Multiple Data Sources
List of Potential Targets (Target Classes)
Gene Gene Gene Tool siRNA Knockout
Expression Copy Number Mutation Compounds Functional Functional
Scores Scores Scores Scores Scores Scores
Prioritized Target List #1 Prioritized Target List #2
VIB Pharma - Laboratory Data Management
Conference USA 20
21. Concluding remarks/Acknowledgements
We will get more data.
If we fail to organize and summarize, we‟ll waste a lot
of time and money
Some standards will be necessary, we‟ll have to
compromise at times
Not everything can and will be in one place. Defined
interfaces (data, processes) between disparate
systems are desperately needed to enable data
interchange.
My colleagues in Research Informatics &
Hematology/Oncology Therapeutic Area
VIB Pharma - Laboratory Data Management
Conference USA 21