1. The Levinthal Lecture
Philip E. Bourne Ph.D., FACMI
Associate Director for Data Science
National Institutes of Health
philip.bourne@nih.gov
http://www.slideshare.net/pebourne
Open Eye Meeting, Santa Fe, March 8, 2016
2. What follows are my personal views
and not necessarily those of my
employer, the US federal government.
3. There is No Intelligent Life Down
Here
With Apologies to Cy
Phil Bourne
Open Eye Meeting, Santa Fe, March 8, 2016
8. Consider Cy’s Own words from
around 1970 concerning data sharing
“At that time, it was difficult to obtain
crystallographic coordinates although the
results of the structural analysis had
been published”
9. Local: Cooperative Community Action
Individual letters to editors of
journals
Committees
IUCr commission on
Biological Macromolecules
ACA/USNCCr
Richards committee
Funding agencies
Articles in journals
Marvin Cassman Fred Richards Richard Dickerson
Courtesy of Helen Berman
11. A Broad Culture of Sharing
1999 20042003 2007 20142008
Research
Tools
Policy
NIH Data
Sharing Policy
Model
Organism
Policy
Genome-wide
Association
(GWAS) Policy
2012
NIH Public
Access Policy
(Publications)
Big Data to
Knowledge
(BD2K) Initiative
Genomic Data
Sharing (GDS)
Policy
Modernization of
NIH Clinical
Trials
White House
Initiative
(2013 “Holdren
Memo”)
12. Data Sharing: An Essential ComponentData Sharing: An Essential Component
13. Modernizing NIH Clinical Trials
Activities
NIH-Funded trials published within 100 months of
completion
Less than 50% published within 30 months of completion
BMJ 2012;344:d7292
15. Increasing Clinical Trial Transparency
Proposed November 2014; Final Spring 2016 (est.)
Notice of Proposed Rulemaking: Clinical Trials Registration and
Results Submission (FDAAA, Section 801)
– Further implements statutory requirements on private and public
sponsors to register; report results on phase 2, 3, and 4 trials
– Includes drugs, biologics, and devices (except small feasibility)
Draft NIH Policy on Clinical Trial Information Dissemination
– Extends Section 801 requirements to all NIH-funded clinical trials
– Includes phase 1 trials and trials of non-FDA regulated
interventions such as behavioral trials
16. Evidence #3
Research does not follow a free market economy – you
can get rewarded regardless of what you produce
17. True Free Market - Photography
Digitization
Deception
Disruption
Demonetization
Dematerialization
Democratization
Time
Volume,Velocity,Variety
Digital camera invented by
Kodak but shelved
Megapixels & quality improve slowly;
Kodak slow to react
Film market collapses;
Kodak goes bankrupt
Phones replace
cameras
Instagram,
Flickr become the
value proposition
Digital media becomes bona fide
form of communication
18. False Market - Biomedical Research?
Digitization of Basic &
Clinical Research & EHR’s
Deception
We Are Here
Disruption
Demonetization
Dematerialization
Democratization
Open science
Patient centered health care
19. Sustaining the System is a Problem
Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
21. “And that’s why we’re here today. Because something
called precision medicine … gives us one of the greatest
opportunities for new medical breakthroughs that we
have ever seen.”
President Barack Obama
January 30, 2015
New Science
22. Lets get a bit closer to home for this
audience ….
23. Evidence #4
Molecular graphics has not
advanced as it should
http://upload.wikimedia.org/wikipedia/commons/2/2e/M
olecular-Graphics-GRIP-75-Console.jpg
24. What Did Cy Say?
1990 – “..although we may not have
"chemical insight" there are more and more
3-D structures determined experimentally to
aid in understanding which conformational
results are reasonable and which are not; as
long as we can look at them.”
25. Good News/Bad News of Molecular
Graphics Today
Good News:
– It is harder to think of a
more powerful way to
comprehend complex
data
– It has excited
generations to the
promise of science
– It has adapted to
changing technologies
Bad News:
– It is not an
adaptive/extensible
environment
– It is not a collaborative
environment
– It is not an integrative
environment
– State not transferable
BMC Bioinformatics 2005, 6:21
26. 1. A link brings up figures
from the paper
0. Full text of PLoS papers stored
in a database
2. Clicking the paper figure retrieves
data from the PDB which is
analyzed
3. A composite view of
journal and database
content results
Is a database
really different
than a
biological
journal?
PloS Comp Biol
2005 1(3) e34
4. The composite view has
links to pertinent blocks
of literature text and back to the PDB
1.
2.
3.
4.
The Knowledge and Data Cycle
27. Evidence #5
By Pbroks13 (talk) - File:Views on Evolution.jpgNew Scientist Magazine, 19
April 2008, Vol. 198, No.2652, page 31: "Evolution myths: It doesn't matter if
people don't grasp evolution"New Scientist Magazine, 19 August 2006, Vol.
191, No.2565, page 11: "Why doesn't America believe in evolution?"., Public
Domain, https://commons.wikimedia.org/w/index.php?curid=4403503
28. Nature’s Reductionism
There are ~ 20300
possible proteins
>>>> all the atoms in the Universe
~58M protein sequences from
58K organisms (source RefSeq)
116,539 protein structures
yield 1393 domain folds (SCOP)
29. Is structure a useful
discriminator of species?
Yang, Doolittle & Bourne (2005) PNAS 102(2) 373-8
30. Method – Distance Determination
(FSF)
SCOP
SUPERFAMILY
organisms
C. intestinalis C. briggsae F. rubripes
a.1.1 1 1 1
a.1.2 1 1 1
a.10.1 0 0 1
a.100.1 1 1 1
a.101.1 0 0 0
a.102.1 0 1 1
a.102.2 1 1 1
C. intestinalis C. briggsae F. rubripes
C. intestinalis 0 101 109
C. briggsae 0 144
F. rubripes 0
Presence/Absence
Data Matrix
Distance Matrix
31. The Answer Would Appear to be
Yes
It is possible to
generate a
reasonable tree of life
from merely the
presence or absence
of superfamilies
(FSFs) within a given
proteome
33. Evolution of the Earth
4.5 billion years of change
300+50K
1-5 atmospheres
Constant photoenergy
Chemical and geological
changes
Life has evolved in this time
The ocean was the “cradle”
for 90% of evolution
34. Whether the deep ocean
became oxic or euxinic
following the rise in
atmospheric oxygen (~2.3
Gya) is debated, therefore
both are shown (oxic ocean-
solid lines, euxinic ocean-
dashed lines).
The phylogenetic tree symbols
at the top of the figure show
one idea as to the theoretical
periods of diversification for
each Superkingdom.
Billions of years before present
Concentration
(O2inarbitraryunits,ZnandFeinmolesL-1
Bacteria
Archaea
Eukarya
Oxygen
Zinc
Iron
Cobalt
Manganese
Theoretical Levels of Trace Metals and Oxygen
in the Deep Ocean Through Earth’s History
Replotted from Saito et al, 2003
Inorganica Chimica Acta 356: 308-318
36. Good News/Bad News for the PDB in
this Changing Landscape
Bad News:
– Interface complex and
uni-data oriented
– Data accessible;
methods accessible (sort
of); but not together
– Significant redundancy in
services offered
– Sustainability
Good News:
– Annotation!
– Demand is increasing
– Integrated with other
data types
– Restful services
37. General Problem Statement:
How to insure a high quality
annotated data source that provides
the optimal environment for
accessibility, integration and analysis
by a broad community of diverse
users?
39. The Commons
Components
Computing environment
– cloud or HPC (High Performance Computing)
– supports access, utilization, sharing and storage of
digital objects.
Methods for Interoperability
– enables connectivity, shareability and interoperability
between digital objects.
Digital object compliance model
– describes the properties of digital objects that
enables them to be discoverable and shareable.
42. Commons - Pilots
The Cloud Credits - business model
BD2K Centers
MODs (Model Organism Databases)
HMP Data and tools available in the cloud
NCI Cloud Pilots & Genomic Data
Commons
43. The PDB in the Commons
Components:
– Annotated collection of data files
– API’s to access these data files
– Example methods using these APIs
Potential outcomes
– Nothing happens?
– A new breed of developer starts to use PDB data in new
ways ?
– The casual user has a broader set of services that
previously?
– Quality declines/increases?
44. Delineation of polypharmacology
across the human structural kinome
using a functional site interaction
fingerprint approach
Zhao et al. J. Med. Chem., 2016,
DOI: 10.1021/acs.jmedchem.5b02041
Evidence #7
The difficulty to translate academic
ideas into products
46. Binding Mode Characterization
of Kinase Inhibitors
Clustering of Fs-IFP across
the structural kinome
Spatial locations for
the binding regions
for the eight clusters
47. Kinase Binding Profile Prediction Using
Fs-IFP
ROC curves of the
trained support
vector machine model
The performance of
predicted binding
profile of 51 type-I
inhibitors to 344
kinases
48. Summary
There is more intelligence than we
think.
While we study complex systems they
are also why we do not make faster
progress
49. Acknowledgements
The 133 Folks who have passed
through my lab over the years
Cy Levinthal for giving me this
opportunity
https://docs.google.com/spreadsheets/d/1QZ48UaKcwDl_iFCvBmJs
T03FK-bMchdfuIHe9Oxc-rw/edit#gid=0
Figure 2. Cumulative percentage of studies published in a peer reviewed biomedical journal indexed by Medline during 100 months after trial completion among all NIH funded clinical trials registered within ClinicalTrials.gov
Public benefits to clinical trials data-sharing (OSP):
Inform future research and research funding decisions
Mitigate bias (e.g., non publication of results, especially negative results)
Prevent duplication of unsafe trials
Meet ethical obligation to human subjects (i.e., that results inform science)
Increase access to data about marketed products
All contribute to public trust in clinical research
Source: Ross JS, Tse T, Zarin DA, Xu H, Zhou L, Krumholz HM. Publication of NIH funded trials registered in ClinicalTrials.gov: cross-sectional analysis. BMJ 2012;344:d7292.
Text updated by Sarah Carr [10/7/2015] – also changed order to feature NPRM before Draft NIH Policy.
Nearly 900 Comments received on PPRM: Many simply stating broad support
Final Rule expected Spring 2016
Section 801 of the Food and Drug Administration Amendments Act (FDAAA)
Photos: FC tweet; RK screen grab
Digital object = data or analytics software
Sequence reference with a variety of electrostatic properties encoded.