Spark Summit Europe: Share and analyse genomic data at scale
Dmla0609 Hoeck Presentation
1. Interactive Visual Data Analytics
Wolfgang G. Hoeck, Ph.D.
Senior Manager, Therapeutic Area Systems
Amgen Inc.
Laboratory Data Management, Munich, June16-17, 2009
2. Agenda
A bit about Amgen
Interactive visual data analytics explained
Screening and target identification/validation
Expectations from an interactive visual data analytics
platform
Data formats ARE important
Registration systems: uniquely identifying what you
are working with
Bringing data together, the art of data mapping
From tabular data to data networks
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 2
3. Amgen: A Biotechnology Pioneer
Founded in 1980, Amgen was
one of the first biotechnology
companies to successfully
discover, develop and make
protein-based medicines
Today, we’re leading the
industry in its next wave of
innovation by:
– Developing therapies in
multiple modalities
– Driving cutting-edge
research and development
– Continuing to advance the
science of biotechnological
manufacturing
3
4. Our Worldwide Presence
Cambridge, MA Norway Denmark
Toronto, ON Luxembourg Finland
West Greenwich, RI The Netherlands Sweden
Washington, DC Belgium
Estonia
Burnaby, BC Ireland
Latvia
Lithuania
Bothell, WA Russia
Seattle, WA England
Longmont, CO Czech Republic
France
Boulder, CO Poland
Switzerland
Slovakia
Hungary
Fremont, CA India
United Arab Emirates
South San Francisco, CA Hong Kong
Greece
Slovenia
Thousand Oaks, CA Austria
Germany
Mexico City, Mexico
Italy
Louisville, KY Spain Australia
Juncos, Puerto Rico Portugal New Zealand
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 4
5. Scientific Data are complex, and it’s not
going to get any better
Target Identification & Validation
– Gene Expression of Cell Line Panels: 200 x 45000 x 3
• Understand differential expression of one or a handful of genes
• Understand expression profile in a particular cell line only
– Gene Expression of tumor samples: The Cancer Genome
Atlas
• Pilot phase: 3 tumor types - 500 GBM/Ovarian Cancer & 200 Lung Cancer
samples
• Next years: 25 more tumor types
Compound/Target Profiling
– 400+ targets across 100’s of small molecule compounds
• Compare target properties with compound properties
Cell Line Profiling
– 500 cell lines treated with 50 therapeutic molecules
– Each cell line has genetic abnormalities in many genes
(mutations, deletions, insertions, rearrangements, etc.)
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 5
6. Visualization of complex data must be
made available in interactive format
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 6
7. Interactive browsing of pre-analyzed data
– finding cell lines for in-vitro work
Step 1: Select gene of interest, e.g.: EGFR
Step 2: Select study of interest
Step 3: Review relative expression pattern
Step 4: Select cell line(s) for further work
8. Steps to share data in an interactive visual
format
Determine the location of desired data (one or
multiple places and/or formats)
Run a query against a database/data
warehouse
Power
User
Capture a dataset(table) of rows and columns
Decide on needed analytics & visualizations
Determine visualization settings and state
Share the results with other scientists
Decision
Maker
Enable scientists to interact with data
Enable scientists to download sets of data
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 8
9. We have many choices to visualize data …
Table
Bar Chart
Box Plot
Scatter Plot (X/Y-Plot)
Line Chart
Heatmap
Parallel Coordinates Plot
(Profile Chart)
Network
Map
TreeMap
e-Northern
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 9
10. …and all choices should retain
interactivity
Filtering a set of cell line and gene alteration
data to view a particular set of cells and the
set of genes harboring deletions
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 10
11. The ideal interactive, visual data analytics
platform
Desktop Clients Zero-footprint Web Clients
A desktop client
– Rich interactivity, visuals
– Rich analytics tools
A server component
– Configurable security
– Configurable data access
Analysis Web
Server Server
A web client
– Rich interactivity Stats, etc.
Server
– Easy access
An API for extension
capabilities DB1 DB2 DB3
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 11
12. From Desktop Client to Analysis Server
Data Access Data Analysis
– From files – Clustering Methods
– From databases • Hierarchical
– From clipboard • K-Means
– From services • PCA
• SOM
Data Manipulations – Profile Searching
– Data Mapping
– Data Merging Documentation
– Calculations – Space to explain what was done
– Data Transformations
Data Content
Visualizations – Tabular Format
– Table – Multiple Tables
– X/Y-Plot – Relationships between Tables
– Bar Chart
Data Security
– Parallel Coordinate Plot
– Group Level Security
– Box-Plot
– Function Level Security
– Networks
– Integration with Corporate LDAP
Data Storage
Action Logging
– One or many tabular datasets
– Who, When, What
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 12
13. About Data Formats
Tall-Skinny
– aka non-pivoted data format
– Each row represents a single event
Short-Wide
– aka pivoted data format
– Each rows represents a summary of
events in particular circumstances
– Typically results in “data loss”
Subject-Verb-Object
– aka network data format
– aka nodes and edges
– Represent complex data relationships,
i.e.: everything has a potential many-to-
many relationship
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 13
14. We are dealing with a Complex Data Concept
Network – register your entities!
is critical in Disease
Project Target
has a
is critical in
BioProcess
is represented by
works in
Gene Pathway occurs in
is translated into
is functional in
has a has a
Protein
Protein
Gene is expressed in Status
Status Diff.Expressed
Postt.Modified
Wildtype
Mutated Cell Line is derived from
Diff.Expressed
Amplified/Deleted
Tissue
VIBEvents, Laboratory Data Management Conference, Munich, June 16/17th, 2009 14
15. Data Assembly and Integration
Contract
Human Gene Amgen
Screening Results KinomeTree
Nomenclature Project/Compound
on monthly Kinase Map
Database Association
spreadsheets
• POC/Kd values • Gene Symbol • Kinase • Compound
• Entrez Gene • Full Name Classification registered for
Symbol • Gene Synonyms • Manual mapping specific Amgen
• Compound of Gene Symbols Project
Concentration to Kinome
• Compound ID classes
Contract Screening
Data Assembly
Data get assembled in Spotfire based on
in Desktop Client matching data keys such as Gene Symbol
or CompoundID. Visualizations are prepared
Publication Step based on scientist’s input. Filters are
organized according to frequency of usage.
Adjustments can typically be made in a
couple of hours. The final file is published
Contract Screening into a web-library accessible via hyperlink.
Data Assembly in Announcements are made via e-mail and
Web Client embedded hyperlink.
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 15
16. Viewing and interacting with integrated
data
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009
17. Biology Visualizations – Pathway Example
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 17
18. Network Visualizations – The hairball
principle
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 18
19. Network Visualizations – The hairball
principle resolved
Tools to connect Nodes
Kidney
Tools to extend Nodes
Bladder
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 19
20. Combining tabular & network
visualizations
Step 1:
Select Disease, then select
Therapeutic Molecule
Step 2:
Study Therapeutic Molecule
network
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 20
21. Concluding Thoughts
Interactive visualizations are a key to making complex
data shareable and understandable
If interactivity is self-explanatory, adoption is very
rapid – nobody wants to read a manual
Analytics can be accomplished in the hands of the
power user, it does not need to be available for
everyone
Data complexity is not getting any simpler, however,
with more sophisticated tools even complex data can
be made accessible and understandable
Thank you for your time and interest
Wolfgang G. Hoeck, Ph.D., Laboratory Data Management Conference, Munich, June 16/17 th, 2009 21