Zen and the Art of Data Science Maintenance

| 0
Dr Jabe Wilson, Elsevier R&D Solutions Professional Services
Zen and the Art of Data Science Maintenance
Bio-IT World, 17 May 2018

| 1
The experience of doing Data Science
“Programming is never easy …
you’re kind of always on this
frontier where you are out of your
depth. And one of the things you
have to learn is to accept this
feeling – of being constantly
wrong. Which makes coding
sound like a branch of Zen
Buddhism”
- Andrew Smith, Code to Joy.
April/May.The Economist, 1843.

| 2
Data Science as an Art
• Intuition
• Qualitative insights
• Exploring a problem through
solutions
• “Inspiration exists, but it has to
find you working.”
- Pablo Picasso
Studio, Tony Wilson (1973) my father. http://www.tonywilsonpainterprintmaker.com/

| 3
What can go wrong doing Data Science
• Bad Data
• Bad Models
• Opaque Predictions

| 4
Good Data
Curation:
• Tagging against dictionaries
• Mapping dictionaries
• Regularising numeric units

| 5
Right Data
Depends on your choice of
model (semantic or machine
learning).
What happens when the model
changes (do you still have
enough data)?

| 6
In-time Data
• Transactional workflows
• Dynamic knowledge hubs
• Opportunity costs

| 7
Examples of Data Science in
practice

| 8
Examples of Data Science in practice
• Rare disease treatment: Highly curated data allows
us to make predictions, but also required judgement
in the building of the model
• Translational safety: Concordance data is
predictive; but also shows the importance of curating
taxonomies
• Evidence selection: In order to select the right
information sets you need to be able to filter on the
context of parameter based assertions (machine
learning can help improve data selection)
• Real World Data interpretation: Machine learning
classification can be enhanced with taxonomies, but
also deliver across multimodal data sets

| 9
taxonomies

| 10
| 10
Biological Pathways
extracted via semantic
text mining
A upregulates B
B upregulates C
C increases
Disease
A  B  C  disease
Bioactivities
through text analysis
IC50 6.3nM, kinase binding
assay 10mM concentration
Chemical Structures
And Properties
InChi,
Name
NCBI,
Uniprot
EMTREE
ReaxysTree,
Structures
Normalizing vocabularies required: proteins, diseases, drugs, chemicals

| 11
• Very large data sets
- Order of ~107 documents published (patents, journals, books)
- Each document has ~200 sentences ~109 statements.
- Statements are about molecules, properties, reactions, indications etc.
• Combinatorial connections between large data sets
- “connecting the dots” among these facts results in a very large number of
possible connections
-
𝑛!
𝑘! 𝑛−𝑘 !
combinations of k elements chosen from a pool of n.
11
What Constitutes Big Data?
Pathways
• Relationships mined
from 12,000 titles ,
25M documents
• <subject> <verb>
<object> relationships
• Each subject, object,
verb has a taxonomy
• Example: “protein”
causes/induces
disease
Compounds
• 16,000 journal titles
plus patent offices
• Compounds,
Reactions,
Properties
• Over 6 million
compounds with
bioactivity
Bioassays
• Biological relationships
mined from journals/patents
(over 16 million)
• <compound> <verb>
<object> <quantity>
• Example: Sunitinib binds-to
Bcr-abl in <assay type> at
1nM

| 12
| 12
Building and refining the disease model for hyperinsulinism
Picked relevant
pathways
(from a collection of 1800
models)
Explored functions of
proteins using 6.2M pre-
text mined relations
and embedded Gene
Ontology
Summarized what is known
about CHI mechanism in an
overview model

| 13
| 13
Automated analysis combines bioassay data with text-mined data
• 88 targets related to
hyperinsulinism with ≥3
literature references
• Full relationship
information
Find all targets that
could be used to affect
the disease state
Step 1
From pathways to treatments

| 14
| 14
the disease state
Query for each protein to find
compounds that target it (>6
log units)
Step 1 Step 2
Targets based on
text mining
Approved
compounds
Bioassay data

| 15
| 15
Mean of activities
among these targets
Targets and activities for
each compound
Drug-likeness
metrics for
sorting/classification
• All compounds that
were observed to bind
to targets in pathway
• Sorted by number of
active targets.
Too many targets may
suggest lack of specificity.
the disease state
Query for each protein to find
compounds that target it (>6
log units)
Collate data by compound to summarize the
targets/activities related to disease that the
compound hits
• Compute geometric mean of activities for ranking
• Rank by number of targets and geometric mean of
activities against targets
Step 1 Step 2
Step 3

| 16
| 16
Approved compounds that may treat hyperinsulinism
• Each binds to one or
more targets related
to the disease
• Can easily be
obtained and tested
in preclinical studies
• List includes a
compound known to
treat hyperinsulinism,
sirolimus

| 17
17
Example: Process for Finding New Indications for a Drug (Ruxolitinib)
Find all targets for which
the compound has high
affinity
Collate the diseases by targets
and activity of the compound
Using unique set of proteins
from steps 1 and search for all
diseases reported to be related
to them
Step 1 Step 2 Step 3
Find all compound-
protein/gene relationships
with > 1 reference using
text analysis
Targets
inhibited
Targets
Related to
Disease

| 18
18
This Analysis Shows Connections of Ruxolitinib to Alopecia
A cancer drug that grows hair! Trials are under way
Alopecia areata is driven by cytotoxic T lymphocytes and is reversed by JAK inhibition
Nature Medicine 20, 1043–1049 (2014) doi:10.1038/nm.3645
Global transcriptional profiling of mouse and human AA skin revealed gene expression
signatures indicative of cytotoxic T cell infiltration, an interferon-γ (IFNG) response and
upregulation of several γ-chain (γc) cytokines known to promote the activation and
survival of IFN-γ–producing CD8+NKG2D+ effector T cells. Therapeutically, antibody-
mediated blockade of IFN-γ, interleukin-2 (IL-2) or interleukin-15 receptor β (IL-15Rβ)
prevented disease development, reducing the accumulation of CD8+NKG2D+ T cells in the
skin and the dermal IFN response in a mouse model of AA.

| 19
taxonomies

| 20
• Concordance between
preclinical studies and human
adverse events, based on the
calculation of positive likelihood
ratios.
- Chi-squared tells us if there is a
statistically significant
relationship of any kind
between the human and animal
observations (which is used as
a filter).
- The likelihood ratio measures
the predictive value of the
animal observation.
A translational safety big data analysis

| 21
• If the chi-squared is high, and
the likelihood ratio is low, one
can state that there is high
confidence that the animal
observation does not predict
human observation.
• In which case the animal
model should not be used.

| 22
• If the chi-squared is high, and
the likelihood ratio is high, one
can state that there is high
confidence that the animal
observation does predict
human observation.
• In which case checks for
adverse events can be added
to clinical trials.

| 23
• Curation of taxonomy data.
• The higher levels of the
MedDRA hierarchy sometimes
include such a variety of
events that the additional false
positives and negatives result
in no statistical confidence in
the relationship.

| 24
taxonomies

| 25
Cold mice problem
• If we can interpret and classify complex parameter based
statements this allows us to select the right data.
22°C Cage (Standard Housing)
30°C Cage (Thermoneutrality)
Stress/Immune
response to
cold
No Immune
response to
cold
Decreased
response to
chemotoxic
drugs
Increased
response to
chemotoxic
drugs

| 26
All mice were maintained in a temperature controlled (22 ± 2 °C) environment 12-h light 12-h
dark photocycle and fed rodent chow meal .
The mice were individually placed into an acrylic cylinder (25 cm height 10 cm diameter)
containing 8 cm of water maintained at 22–24 °C
Cold mice problem: results
Allowing research reports to be filtered based on whether results will
be reliable due to experimental conditions.

| 27
Use case examples
taxonomies

| 28
Real World Data interpretation
• Machine Learning:
- Classify images.
- Classify concepts (combining taxonomies with word embeddings
improves performance on similarity measurement and entity
classification).
• Opportunities for developing multimodal classification of data
sources with unstructured text and unlabelled images.

| 29
• These use case examples
illustrate the challenges and
creativity required to deliver
Data Science.
• We are developing a platform
to help support these activities.
o Good data: curated
data.
o Right data: export graph
and feature data.
o In-time data: bringing
data sets together in a
knowledge hub to
enable in-time data.
Supporting Data Scientists to deliver results

| 30
A platform for supporting Data Scientists
• Inspiration exists, but it has to
find you working.
- Pablo Picasso
• If you want to become a Data
Science Platform development
partner, or wish to hear more
about continuing developments
around Data Science at Elsevier
please contact me:
• www.linkedin.com/in/jabewilson/
• jabe.wilson@elsevier.com
Studio, Tony Wilson (1973) my father. http://www.tonywilsonpainterprintmaker.com/

| 31
Acknowledgements
• Helena F. Deus, Corey Harper, Darin McBeath and Ron Daniel Jr –
Elsevier Labs
• Matthew Clark, Frederik van den Broek, Anton Yuryev, Maria Shkrob
– Elsevier Professional services
• Thomas Steger-Hartmann, Investigational Toxicology, Bayer AG

Zen and the Art of Data Science Maintenance

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à Zen and the Art of Data Science Maintenance

Similaire à Zen and the Art of Data Science Maintenance (20)

Plus de Elsevier

Plus de Elsevier (20)

Dernier

Dernier (20)

Zen and the Art of Data Science Maintenance