Dr. Jabe Wilson, Elsevier's Consulting Director of Text and Data Analytics, gave this presentation at the Bio-IT World Conference in Boston on May 17, 2018.
How AI, OpenAI, and ChatGPT impact business and software.
Zen and the Art of Data Science Maintenance
1. | 0
Dr Jabe Wilson, Elsevier R&D Solutions Professional Services
Zen and the Art of Data Science Maintenance
Bio-IT World, 17 May 2018
2. | 1
The experience of doing Data Science
“Programming is never easy …
you’re kind of always on this
frontier where you are out of your
depth. And one of the things you
have to learn is to accept this
feeling – of being constantly
wrong. Which makes coding
sound like a branch of Zen
Buddhism”
- Andrew Smith, Code to Joy.
April/May.The Economist, 1843.
3. | 2
Data Science as an Art
• Intuition
• Qualitative insights
• Exploring a problem through
solutions
• “Inspiration exists, but it has to
find you working.”
- Pablo Picasso
Studio, Tony Wilson (1973) my father. http://www.tonywilsonpainterprintmaker.com/
4. | 3
What can go wrong doing Data Science
• Bad Data
• Bad Models
• Opaque Predictions
5. | 4
Good Data
Curation:
• Tagging against dictionaries
• Mapping dictionaries
• Regularising numeric units
6. | 5
Right Data
Depends on your choice of
model (semantic or machine
learning).
What happens when the model
changes (do you still have
enough data)?
9. | 8
Examples of Data Science in practice
• Rare disease treatment: Highly curated data allows
us to make predictions, but also required judgement
in the building of the model
• Translational safety: Concordance data is
predictive; but also shows the importance of curating
taxonomies
• Evidence selection: In order to select the right
information sets you need to be able to filter on the
context of parameter based assertions (machine
learning can help improve data selection)
• Real World Data interpretation: Machine learning
classification can be enhanced with taxonomies, but
also deliver across multimodal data sets
10. | 9
Examples of Data Science in practice
• Rare disease treatment: Highly curated data allows
us to make predictions, but also required judgement
in the building of the model
• Translational safety: Concordance data is
predictive; but also shows the importance of curating
taxonomies
• Evidence selection: In order to select the right
information sets you need to be able to filter on the
context of parameter based assertions (machine
learning can help improve data selection)
• Real World Data interpretation: Machine learning
classification can be enhanced with taxonomies, but
also deliver across multimodal data sets
11. | 10
| 10
Biological Pathways
extracted via semantic
text mining
A upregulates B
B upregulates C
C increases
Disease
A B C disease
Bioactivities
through text analysis
IC50 6.3nM, kinase binding
assay 10mM concentration
Chemical Structures
And Properties
InChi,
Name
NCBI,
Uniprot
EMTREE
ReaxysTree,
Structures
Normalizing vocabularies required: proteins, diseases, drugs, chemicals
12. | 11
• Very large data sets
- Order of ~107 documents published (patents, journals, books)
- Each document has ~200 sentences ~109 statements.
- Statements are about molecules, properties, reactions, indications etc.
• Combinatorial connections between large data sets
- “connecting the dots” among these facts results in a very large number of
possible connections
-
𝑛!
𝑘! 𝑛−𝑘 !
combinations of k elements chosen from a pool of n.
11
What Constitutes Big Data?
Pathways
• Relationships mined
from 12,000 titles ,
25M documents
• <subject> <verb>
<object> relationships
• Each subject, object,
verb has a taxonomy
• Example: “protein”
causes/induces
disease
Compounds
• 16,000 journal titles
plus patent offices
• Compounds,
Reactions,
Properties
• Over 6 million
compounds with
bioactivity
Bioassays
• Biological relationships
mined from journals/patents
(over 16 million)
• <compound> <verb>
<object> <quantity>
• Example: Sunitinib binds-to
Bcr-abl in <assay type> at
1nM
13. | 12
| 12
Building and refining the disease model for hyperinsulinism
Picked relevant
pathways
(from a collection of 1800
models)
Explored functions of
proteins using 6.2M pre-
text mined relations
and embedded Gene
Ontology
Summarized what is known
about CHI mechanism in an
overview model
14. | 13
| 13
Automated analysis combines bioassay data with text-mined data
• 88 targets related to
hyperinsulinism with ≥3
literature references
• Full relationship
information
Find all targets that
could be used to affect
the disease state
Step 1
From pathways to treatments
15. | 14
| 14
Automated analysis combines bioassay data with text-mined data
Find all targets that
could be used to affect
the disease state
Query for each protein to find
compounds that target it (>6
log units)
Step 1 Step 2
Targets based on
text mining
Approved
compounds
Bioassay data
From pathways to treatments
16. | 15
| 15
Automated analysis combines bioassay data with text-mined data
Mean of activities
among these targets
Targets and activities for
each compound
Drug-likeness
metrics for
sorting/classification
• All compounds that
were observed to bind
to targets in pathway
• Sorted by number of
active targets.
Too many targets may
suggest lack of specificity.
Find all targets that
could be used to affect
the disease state
Query for each protein to find
compounds that target it (>6
log units)
Collate data by compound to summarize the
targets/activities related to disease that the
compound hits
• Compute geometric mean of activities for ranking
• Rank by number of targets and geometric mean of
activities against targets
Step 1 Step 2
Step 3
From pathways to treatments
17. | 16
| 16
Approved compounds that may treat hyperinsulinism
• Each binds to one or
more targets related
to the disease
• Can easily be
obtained and tested
in preclinical studies
• List includes a
compound known to
treat hyperinsulinism,
sirolimus
18. | 17
17
Example: Process for Finding New Indications for a Drug (Ruxolitinib)
Find all targets for which
the compound has high
affinity
Collate the diseases by targets
and activity of the compound
Using unique set of proteins
from steps 1 and search for all
diseases reported to be related
to them
Step 1 Step 2 Step 3
Find all compound-
protein/gene relationships
with > 1 reference using
text analysis
Targets
inhibited
Targets
Related to
Disease
19. | 18
18
This Analysis Shows Connections of Ruxolitinib to Alopecia
A cancer drug that grows hair! Trials are under way
Alopecia areata is driven by cytotoxic T lymphocytes and is reversed by JAK inhibition
Nature Medicine 20, 1043–1049 (2014) doi:10.1038/nm.3645
Global transcriptional profiling of mouse and human AA skin revealed gene expression
signatures indicative of cytotoxic T cell infiltration, an interferon-γ (IFNG) response and
upregulation of several γ-chain (γc) cytokines known to promote the activation and
survival of IFN-γ–producing CD8+NKG2D+ effector T cells. Therapeutically, antibody-
mediated blockade of IFN-γ, interleukin-2 (IL-2) or interleukin-15 receptor β (IL-15Rβ)
prevented disease development, reducing the accumulation of CD8+NKG2D+ T cells in the
skin and the dermal IFN response in a mouse model of AA.
20. | 19
Examples of Data Science in practice
• Rare disease treatment: Highly curated data allows
us to make predictions, but also required judgement
in the building of the model
• Translational safety: Concordance data is
predictive; but also shows the importance of curating
taxonomies
• Evidence selection: In order to select the right
information sets you need to be able to filter on the
context of parameter based assertions (machine
learning can help improve data selection)
• Real World Data interpretation: Machine learning
classification can be enhanced with taxonomies, but
also deliver across multimodal data sets
21. | 20
• Concordance between
preclinical studies and human
adverse events, based on the
calculation of positive likelihood
ratios.
- Chi-squared tells us if there is a
statistically significant
relationship of any kind
between the human and animal
observations (which is used as
a filter).
- The likelihood ratio measures
the predictive value of the
animal observation.
A translational safety big data analysis
22. | 21
• If the chi-squared is high, and
the likelihood ratio is low, one
can state that there is high
confidence that the animal
observation does not predict
human observation.
• In which case the animal
model should not be used.
A translational safety big data analysis
23. | 22
• If the chi-squared is high, and
the likelihood ratio is high, one
can state that there is high
confidence that the animal
observation does predict
human observation.
• In which case checks for
adverse events can be added
to clinical trials.
A translational safety big data analysis
24. | 23
• Curation of taxonomy data.
• The higher levels of the
MedDRA hierarchy sometimes
include such a variety of
events that the additional false
positives and negatives result
in no statistical confidence in
the relationship.
A translational safety big data analysis
25. | 24
Examples of Data Science in practice
• Rare disease treatment: Highly curated data allows
us to make predictions, but also required judgement
in the building of the model
• Translational safety: Concordance data is
predictive; but also shows the importance of curating
taxonomies
• Evidence selection: In order to select the right
information sets you need to be able to filter on the
context of parameter based assertions (machine
learning can help improve data selection)
• Real World Data interpretation: Machine learning
classification can be enhanced with taxonomies, but
also deliver across multimodal data sets
26. | 25
Cold mice problem
• If we can interpret and classify complex parameter based
statements this allows us to select the right data.
22°C Cage (Standard Housing)
30°C Cage (Thermoneutrality)
Stress/Immune
response to
cold
No Immune
response to
cold
Decreased
response to
chemotoxic
drugs
Increased
response to
chemotoxic
drugs
27. | 26
All mice were maintained in a temperature controlled (22 ± 2 °C) environment 12-h light 12-h
dark photocycle and fed rodent chow meal .
The mice were individually placed into an acrylic cylinder (25 cm height 10 cm diameter)
containing 8 cm of water maintained at 22–24 °C
Cold mice problem: results
Allowing research reports to be filtered based on whether results will
be reliable due to experimental conditions.
28. | 27
Use case examples
• Rare disease treatment: Highly curated data allows
us to make predictions, but also required judgement
in the building of the model
• Translational safety: Concordance data is
predictive; but also shows the importance of curating
taxonomies
• Evidence selection: In order to select the right
information sets you need to be able to filter on the
context of parameter based assertions (machine
learning can help improve data selection)
• Real World Data interpretation: Machine learning
classification can be enhanced with taxonomies, but
also deliver across multimodal data sets
29. | 28
Real World Data interpretation
• Machine Learning:
- Classify images.
- Classify concepts (combining taxonomies with word embeddings
improves performance on similarity measurement and entity
classification).
• Opportunities for developing multimodal classification of data
sources with unstructured text and unlabelled images.
30. | 29
• These use case examples
illustrate the challenges and
creativity required to deliver
Data Science.
• We are developing a platform
to help support these activities.
o Good data: curated
data.
o Right data: export graph
and feature data.
o In-time data: bringing
data sets together in a
knowledge hub to
enable in-time data.
Supporting Data Scientists to deliver results
31. | 30
A platform for supporting Data Scientists
• Inspiration exists, but it has to
find you working.
- Pablo Picasso
• If you want to become a Data
Science Platform development
partner, or wish to hear more
about continuing developments
around Data Science at Elsevier
please contact me:
• www.linkedin.com/in/jabewilson/
• jabe.wilson@elsevier.com
Studio, Tony Wilson (1973) my father. http://www.tonywilsonpainterprintmaker.com/
32. | 31
Acknowledgements
• Helena F. Deus, Corey Harper, Darin McBeath and Ron Daniel Jr –
Elsevier Labs
• Matthew Clark, Frederik van den Broek, Anton Yuryev, Maria Shkrob
– Elsevier Professional services
• Thomas Steger-Hartmann, Investigational Toxicology, Bayer AG