SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Office of Research and Development
The importance of data curation on
QSAR Modeling: PHYSPROP open
data as a case study
Kamel Mansouri,
Christopher Grulke
Ann Richard
Richard Judson
Antony Williams
NCCT, U.S. EPA
The views expressed in this presentation are those of the author and do not
necessarily reflect the views or policies of the U.S. EPA
QSAR 2016
14 June 2016, Miami, FL
Office of Research and Development
National Center for Computational Toxicology
Recent Cheminformatics development at NCCT
• We are building a new cheminformatics architecture
• PUBLIC dashboard gives access to curated chemistry
• Focus on integrating EPA and external resources
• Aggregating and curating data, visualization elements and
“services” to underpin other efforts
• RapidTox
• Read-across
• Predictive modeling
• Non-targeted screening
2
Office of Research and Development
National Center for Computational Toxicology
Developing “NCCT Models”
• Interest in physicochemical properties to include in exposure modeling,
augmented with ToxCast HTS in vitro data etc.
• Our approach to modeling:
– Obtain high quality training sets
– Apply appropriate modeling approaches
– Validate performance of models
– Define the applicability domain and limitations of the models
– Use models to predict properties across our full datasets
• Work has been initiated using available physicochemical data
Office of Research and Development
National Center for Computational Toxicology
PHYSPROP Data: Available from:
http://esc.syrres.com/interkow/EpiSuiteData.htm
• Water solubility
• Melting Point
• Boiling Point
• LogP (KOWWIN: Octanol-water partition coefficient)
• Atmospheric Hydroxylation Rate
• LogBCF (Bioconcentration Factor)
• Biodegradation Half-life
• Ready biodegradability
• Henry's Law Constant
• Fish Biotransformation Half-life
• LogKOA (Octanol/Air Partition Coefficient)
• LogKOC (Soil Adsorption Coefficient)
• Vapor Pressure
Office of Research and Development
National Center for Computational Toxicology
The Approach
• To build models we need the set of chemicals and their
property series
• Our curation process
– Decide on the “chemical” by checking levels of consistency
– We did NOT validate each measured property value
– Perform initial analysis manually to understand how to clean the data
(chemical structure and ID)
– Automate the process (and test iteratively)
– Process all datasets using final method
Office of Research and Development
National Center for Computational Toxicology
KNIME Workflow to Evaluate the Dataset
Office of Research and Development
National Center for Computational Toxicology
LogP dataset: 15,809 chemicals (structures)
• CAS Checksum: 12163 valid, 3646 invalid (>23%)
• Invalid names: 555
• Invalid SMILES 133
• Valence errors: 322 Molfile, 3782 SMILES (>24%)
• Duplicates check:
–31 DUPLICATE MOLFILES
–626 DUPLICATE SMILES
–531 DUPLICATE NAMES
• SMILES vs. Molfiles (structure check)
–1279 differ in stereochemistry (~8%)
–362 “Covalent Halogens”
–191 differ as tautomers
–436 are different compounds (~3%)
Office of Research and Development
National Center for Computational Toxicology
Examples of Errors
Valence Errors Different Compounds
Office of Research and Development
National Center for Computational Toxicology
Examples of Errors
Duplicate Structures Covalent Halogens
Office of Research and Development
National Center for Computational Toxicology
Data Files & Quality flags
• The data files have FOUR representations of a
chemical, plus the property value.
http://esc.syrres.com/interkow/EpiSuiteData.htm
4 levels of consistency exists
among:
• The Molblock
• The SMILES string
• The chemical name (based on
ACD/Labs dictionary)
• The CAS Number (based on a
DSSTox lookup)
Office of Research and Development
National Center for Computational Toxicology
Quality FLAGS into LogP data
• 4 Stars ENHANCED: 4 levels of consistency with stereo information
• 4 Stars: 4 levels of consistency, stereo ignored.
• 3 Stars Plus: 3 out of 4 levels. The 4th is a tautomer.
• 3 Stars ENHANCED: 4 levels of consistency with stereo information
• 3 Stars: 3 levels of consistency, stereo ignored.
• 2 Stars PLUS : 2 out of 4 levels. The 3rd is a tautomer.
• 1 Star - What's left.
Office of Research and Development
National Center for Computational Toxicology
Structure
standardization
Remove of
duplicates
Normalize of
tautomers
Clean salts and
counterions
Remove inorganics
and mixtures
Final inspection
QSAR-ready
structures
Initial
structures
Office of Research and Development
National Center for Computational Toxicology
KNIME workflow
UNC, DTU, EPA Consensus
Aim of the workflow:
• Combine (not reproduce) different procedures and ideas
• Minimize the differences between the structures used for prediction
by different groups
• Produce a flexible free and open source workflow to be shared
Indigo
Mansouri et al. (http://ehp.niehs.nih.gov/15-10267/)
Office of Research and Development
National Center for Computational Toxicology
Summary:
Property Initial file flagged Updated 3-4 STAR Curated QSAR ready
AOP 818 818 745
BCF 685 618 608
BioHC 175 151 150
Biowin 1265 1196 1171
BP 5890 5591 5436
HL 1829 1758 1711
KM 631 548 541
KOA 308 277 270
LogP 15809 14544 14041
MP 10051 9120 8656
PC 788 750 735
VP 3037 2840 2716
WF 5764 5076 4836
WS 2348 2046 2010
Office of Research and Development
National Center for Computational Toxicology
Development of a QSAR model
• Curation of the data
» Flagged and curated files available for sharing
• Preparation of training and test sets
» Inserted as a field in SDFiles and csv data files
• Calculation of an initial set of descriptors
» PaDEL 2D descriptors and fingerprints generated and shared
• Selection of a mathematical method
» Several approaches tested: KNN, PLS, SVM…
• Variable selection technique
» Genetic algorithm
• Validation of the model’s predictive ability
» 5-fold cross validation and external test set
• Define the Applicability Domain
» Local (nearest neighbors) and global (leverage) approaches
Office of Research and Development
National Center for Computational Toxicology
QSARs validity, reliability, applicability
and adequacy for regulatory purposes
ORCHESTRA. Theory,
guidance and application
on QSAR and REACH;
2012. http://home.
deib.polimi.it/gini/papers/or
chestra.pdf.
Office of Research and Development
National Center for Computational Toxicology
The 5
OECD
principles:
Principle Description
1) A defined endpoint Any physicochemical, biological or environmental effect
that can be measured and therefore modelled.
2) An unambiguous
algorithm
Ensure transparency in the description of the model
algorithm.
3) A defined domain of
applicability
Define limitations in terms of the types of chemical
structures, physicochemical properties and mechanisms of
action for which the models can generate reliable
predictions.
4) Appropriate measures
of goodness-of-fit,
robustness and
predictivity
a) The internal fitting performance of a model
b) the predictivity of a model, determined by using an
appropriate external test set.
5) Mechanistic
interpretation, if possible
Mechanistic associations between the descriptors used in
a model and the endpoint being predicted.
The conditions for the validity of QSARs
Office of Research and Development
National Center for Computational Toxicology
NCCT
models
Prop Vars 5-fold CV (75%) Training (75%) Test (25%)
Q2 RMSE N R2 RMSE N R2 RMSE
BCF 10 0.84 0.55 465 0.85 0.53 161 0.83 0.64
BP 13 0.93 22.46 4077 0.93 22.06 1358 0.93 22.08
LogP 9 0.85 0.69 10531 0.86 0.67 3510 0.86 0.78
MP 15 0.72 51.8 6486 0.74 50.27 2167 0.73 52.72
VP 12 0.91 1.08 2034 0.91 1.08 679 0.92 1
WS 11 0.87 0.81 3158 0.87 0.82 1066 0.86 0.86
HL 9 0.84 1.96 441 0.84 1.91 150 0.85 1.82
Office of Research and Development
National Center for Computational Toxicology
NCCT
models
Prop Vars 5-fold CV (75%) Training (75%) Test (25%)
Q2 RMSE N R2 RMSE N R2 RMSE
AOH 13 0.85 1.14 516 0.85 1.12 176 0.83 1.23
BioHL 6 0.89 0.25 112 0.88 0.26 38 0.75 0.38
KM 12 0.83 0.49 405 0.82 0.5 136 0.73 0.62
KOC 12 0.81 0.55 545 0.81 0.54 184 0.71 0.61
KOA 2 0.95 0.69 202 0.95 0.65 68 0.96 0.68
BA Sn-Sp BA Sn-Sp BA Sn-Sp
R-Bio 10 0.8 0.82-0.78 1198 0.8 0.82-0.79 411 0.79 0.81-0.77
Office of Research and Development
National Center for Computational Toxicology
LogP Model: Weighted kNN Model,
9 descriptors
Weighted 5-nearest neighbors
9 Descriptors
Training set: 10531 chemicals
Test set: 3510 chemicals
5 fold Cross-validation:
Q2=0.85 RMSE=0.69
Fitting:
R2=0.86 RMSE=0.67
Test:
R2=0.86 RMSE=0.78
Office of Research and Development
National Center for Computational Toxicology
Standalone
application:
Input:
–MATLAB .mat file, an ASCII file with only a matrix of variables
–SDF file or SMILES strings of QSAR-ready structures. In this case
the program will calculate PaDEL 2D descriptors and make the
predictions.
• The program will extract the molecules names from the input csv or SDF
(or assign arbitrary names if not) As IDs for the predictions.
Output
• Depending on the extension, the can be text file or csv with
–A list of molecules IDs and predictions
–Applicability domain
–Accuracy of the prediction
–Similarity index to the 5 nearest neighbors
–The 5 nearest neighbors from the training set: Exp. value, Prediction,
InChi key
Office of Research and Development
National Center for Computational Toxicology
The iCSS Chemistry Dashboard
at https://comptox.epa.gov
22
Office of Research and Development
National Center for Computational Toxicology
Office of Research and Development
National Center for Computational Toxicology
Conclusion
• QSAR prediction models (kNN) produced for all properties
• 700k chemical structures pushed through NCCT_Models
• Supplementary data will include appropriate files with flags
– full dataset plus QSAR ready form
• Full performance statistics available for all models
• Models will be deployed as prediction engines in the future
– one chemical at a time and batch processing (to be done
after RapidTox Project)
Office of Research and Development
National Center for Computational Toxicology
Acknowledgements
National Center for Computational Toxicology
Office of Research and Development
National Center for Computational Toxicology
26
Thank you for your attention

Contenu connexe

Tendances

Virtual screening of chemicals for endocrine disrupting activity through CER...
Virtual screening of chemicals for endocrine disrupting activity through  CER...Virtual screening of chemicals for endocrine disrupting activity through  CER...
Virtual screening of chemicals for endocrine disrupting activity through CER...Kamel Mansouri
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...Kamel Mansouri
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...Kamel Mansouri
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Kamel Mansouri
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...Kamel Mansouri
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Kamel Mansouri
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryAnn-Marie Roche
 
Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Andrew McEachran
 
Multivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataMultivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataDmitry Grapov
 
SETAC Rome Non-Target Screening For Chemical Discovery
SETAC Rome Non-Target Screening For Chemical DiscoverySETAC Rome Non-Target Screening For Chemical Discovery
SETAC Rome Non-Target Screening For Chemical DiscoveryEmma Schymanski
 

Tendances (20)

The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
Virtual screening of chemicals for endocrine disrupting activity through CER...
Virtual screening of chemicals for endocrine disrupting activity through  CER...Virtual screening of chemicals for endocrine disrupting activity through  CER...
Virtual screening of chemicals for endocrine disrupting activity through CER...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...OPERA: A free and open source QSAR tool for predicting physicochemical proper...
OPERA: A free and open source QSAR tool for predicting physicochemical proper...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...
 
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistry
 
Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...
 
Multivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological dataMultivariate data analysis and visualization tools for biological data
Multivariate data analysis and visualization tools for biological data
 
SETAC Rome Non-Target Screening For Chemical Discovery
SETAC Rome Non-Target Screening For Chemical DiscoverySETAC Rome Non-Target Screening For Chemical Discovery
SETAC Rome Non-Target Screening For Chemical Discovery
 

En vedette

2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...
2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...
2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...Prasanthperceptron
 
Computer-aided prediction of xenobiotics toxicity
Computer-aided prediction of xenobiotics toxicityComputer-aided prediction of xenobiotics toxicity
Computer-aided prediction of xenobiotics toxicityMobiliuz
 
Plans for Creative Writing
Plans for Creative WritingPlans for Creative Writing
Plans for Creative WritingFatheha Rahman
 
Why are we still doing industrial age drug
Why are we still doing industrial age drugWhy are we still doing industrial age drug
Why are we still doing industrial age drugSean Ekins
 
Haapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminarHaapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminarJüri Kaljundi
 
Uram ecp course
Uram ecp courseUram ecp course
Uram ecp coursetigerron
 
Eit orginal
Eit orginalEit orginal
Eit orginalanamsini
 
Pintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenesPintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenesOlmeda Orígenes
 
A Writing Group Strategy for Scientists
A Writing Group Strategy for ScientistsA Writing Group Strategy for Scientists
A Writing Group Strategy for Scientistsgizemk
 
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)Food Insight
 
ILASCD - Student-Centered Leadership
ILASCD - Student-Centered LeadershipILASCD - Student-Centered Leadership
ILASCD - Student-Centered LeadershipPJ Caposey
 
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...Digiday
 
MEMS sensor catalog with I2C
MEMS sensor catalog with I2CMEMS sensor catalog with I2C
MEMS sensor catalog with I2CAkira Sasaki
 
Sagar alone qsar studies of saponin analogues for anticancer activity
Sagar alone  qsar studies of saponin analogues for anticancer activitySagar alone  qsar studies of saponin analogues for anticancer activity
Sagar alone qsar studies of saponin analogues for anticancer activitysagar alone
 
Green chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designGreen chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designAlex Clark
 
The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...Pieter De Leenheer
 
[Etude] Entrepreneurs de la Tech : qui sont-ils?
[Etude] Entrepreneurs de la Tech : qui sont-ils?[Etude] Entrepreneurs de la Tech : qui sont-ils?
[Etude] Entrepreneurs de la Tech : qui sont-ils?FrenchWeb.fr
 

En vedette (20)

2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...
2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...
2 d qsar model of dihydrofolate reductase (dhfr) inhibitors with activity in ...
 
Computer-aided prediction of xenobiotics toxicity
Computer-aided prediction of xenobiotics toxicityComputer-aided prediction of xenobiotics toxicity
Computer-aided prediction of xenobiotics toxicity
 
Plans for Creative Writing
Plans for Creative WritingPlans for Creative Writing
Plans for Creative Writing
 
Why are we still doing industrial age drug
Why are we still doing industrial age drugWhy are we still doing industrial age drug
Why are we still doing industrial age drug
 
Haapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminarHaapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminar
 
Uram ecp course
Uram ecp courseUram ecp course
Uram ecp course
 
Eit orginal
Eit orginalEit orginal
Eit orginal
 
Pintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenesPintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenes
 
A Writing Group Strategy for Scientists
A Writing Group Strategy for ScientistsA Writing Group Strategy for Scientists
A Writing Group Strategy for Scientists
 
Codes & Tiny Houses
Codes & Tiny HousesCodes & Tiny Houses
Codes & Tiny Houses
 
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
 
ILASCD - Student-Centered Leadership
ILASCD - Student-Centered LeadershipILASCD - Student-Centered Leadership
ILASCD - Student-Centered Leadership
 
MANEJO DE LA ENFERMEDAD DE CHAGAS EN ATENCIÓN PRIMARIA EN ESPAÑA
MANEJO DE LA ENFERMEDAD DE CHAGAS  EN ATENCIÓN PRIMARIA EN ESPAÑAMANEJO DE LA ENFERMEDAD DE CHAGAS  EN ATENCIÓN PRIMARIA EN ESPAÑA
MANEJO DE LA ENFERMEDAD DE CHAGAS EN ATENCIÓN PRIMARIA EN ESPAÑA
 
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
 
MEMS sensor catalog with I2C
MEMS sensor catalog with I2CMEMS sensor catalog with I2C
MEMS sensor catalog with I2C
 
Sagar alone qsar studies of saponin analogues for anticancer activity
Sagar alone  qsar studies of saponin analogues for anticancer activitySagar alone  qsar studies of saponin analogues for anticancer activity
Sagar alone qsar studies of saponin analogues for anticancer activity
 
Green chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designGreen chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by design
 
25.qsar
25.qsar25.qsar
25.qsar
 
The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...The Data Driven University - Automating Data Governance and Stewardship in Au...
The Data Driven University - Automating Data Governance and Stewardship in Au...
 
[Etude] Entrepreneurs de la Tech : qui sont-ils?
[Etude] Entrepreneurs de la Tech : qui sont-ils?[Etude] Entrepreneurs de la Tech : qui sont-ils?
[Etude] Entrepreneurs de la Tech : qui sont-ils?
 

Similaire à The importance of data curation on QSAR Modeling: PHYSPROP open data as a case study. Presented at QSAR2016 Miami, FL 13-17, 2016

CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...Kamel Mansouri
 
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsKamel Mansouri
 
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityKamel Mansouri
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Kamel Mansouri
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology Sean Ekins
 
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsSean Ekins
 
Lecture 9 molecular descriptors
Lecture 9  molecular descriptorsLecture 9  molecular descriptors
Lecture 9 molecular descriptorsRAJAN ROLTA
 
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your dataAlex Henderson
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Sunghwan Kim
 

Similaire à The importance of data curation on QSAR Modeling: PHYSPROP open data as a case study. Presented at QSAR2016 Miami, FL 13-17, 2016 (20)

CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
 
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology Problems
 
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
 
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
 
Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
 
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
 
Lecture 9 molecular descriptors
Lecture 9  molecular descriptorsLecture 9  molecular descriptors
Lecture 9 molecular descriptors
 
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your data
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...
 

Dernier

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 

Dernier (20)

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 

The importance of data curation on QSAR Modeling: PHYSPROP open data as a case study. Presented at QSAR2016 Miami, FL 13-17, 2016

  • 1. Office of Research and Development The importance of data curation on QSAR Modeling: PHYSPROP open data as a case study Kamel Mansouri, Christopher Grulke Ann Richard Richard Judson Antony Williams NCCT, U.S. EPA The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA QSAR 2016 14 June 2016, Miami, FL
  • 2. Office of Research and Development National Center for Computational Toxicology Recent Cheminformatics development at NCCT • We are building a new cheminformatics architecture • PUBLIC dashboard gives access to curated chemistry • Focus on integrating EPA and external resources • Aggregating and curating data, visualization elements and “services” to underpin other efforts • RapidTox • Read-across • Predictive modeling • Non-targeted screening 2
  • 3. Office of Research and Development National Center for Computational Toxicology Developing “NCCT Models” • Interest in physicochemical properties to include in exposure modeling, augmented with ToxCast HTS in vitro data etc. • Our approach to modeling: – Obtain high quality training sets – Apply appropriate modeling approaches – Validate performance of models – Define the applicability domain and limitations of the models – Use models to predict properties across our full datasets • Work has been initiated using available physicochemical data
  • 4. Office of Research and Development National Center for Computational Toxicology PHYSPROP Data: Available from: http://esc.syrres.com/interkow/EpiSuiteData.htm • Water solubility • Melting Point • Boiling Point • LogP (KOWWIN: Octanol-water partition coefficient) • Atmospheric Hydroxylation Rate • LogBCF (Bioconcentration Factor) • Biodegradation Half-life • Ready biodegradability • Henry's Law Constant • Fish Biotransformation Half-life • LogKOA (Octanol/Air Partition Coefficient) • LogKOC (Soil Adsorption Coefficient) • Vapor Pressure
  • 5. Office of Research and Development National Center for Computational Toxicology The Approach • To build models we need the set of chemicals and their property series • Our curation process – Decide on the “chemical” by checking levels of consistency – We did NOT validate each measured property value – Perform initial analysis manually to understand how to clean the data (chemical structure and ID) – Automate the process (and test iteratively) – Process all datasets using final method
  • 6. Office of Research and Development National Center for Computational Toxicology KNIME Workflow to Evaluate the Dataset
  • 7. Office of Research and Development National Center for Computational Toxicology LogP dataset: 15,809 chemicals (structures) • CAS Checksum: 12163 valid, 3646 invalid (>23%) • Invalid names: 555 • Invalid SMILES 133 • Valence errors: 322 Molfile, 3782 SMILES (>24%) • Duplicates check: –31 DUPLICATE MOLFILES –626 DUPLICATE SMILES –531 DUPLICATE NAMES • SMILES vs. Molfiles (structure check) –1279 differ in stereochemistry (~8%) –362 “Covalent Halogens” –191 differ as tautomers –436 are different compounds (~3%)
  • 8. Office of Research and Development National Center for Computational Toxicology Examples of Errors Valence Errors Different Compounds
  • 9. Office of Research and Development National Center for Computational Toxicology Examples of Errors Duplicate Structures Covalent Halogens
  • 10. Office of Research and Development National Center for Computational Toxicology Data Files & Quality flags • The data files have FOUR representations of a chemical, plus the property value. http://esc.syrres.com/interkow/EpiSuiteData.htm 4 levels of consistency exists among: • The Molblock • The SMILES string • The chemical name (based on ACD/Labs dictionary) • The CAS Number (based on a DSSTox lookup)
  • 11. Office of Research and Development National Center for Computational Toxicology Quality FLAGS into LogP data • 4 Stars ENHANCED: 4 levels of consistency with stereo information • 4 Stars: 4 levels of consistency, stereo ignored. • 3 Stars Plus: 3 out of 4 levels. The 4th is a tautomer. • 3 Stars ENHANCED: 4 levels of consistency with stereo information • 3 Stars: 3 levels of consistency, stereo ignored. • 2 Stars PLUS : 2 out of 4 levels. The 3rd is a tautomer. • 1 Star - What's left.
  • 12. Office of Research and Development National Center for Computational Toxicology Structure standardization Remove of duplicates Normalize of tautomers Clean salts and counterions Remove inorganics and mixtures Final inspection QSAR-ready structures Initial structures
  • 13. Office of Research and Development National Center for Computational Toxicology KNIME workflow UNC, DTU, EPA Consensus Aim of the workflow: • Combine (not reproduce) different procedures and ideas • Minimize the differences between the structures used for prediction by different groups • Produce a flexible free and open source workflow to be shared Indigo Mansouri et al. (http://ehp.niehs.nih.gov/15-10267/)
  • 14. Office of Research and Development National Center for Computational Toxicology Summary: Property Initial file flagged Updated 3-4 STAR Curated QSAR ready AOP 818 818 745 BCF 685 618 608 BioHC 175 151 150 Biowin 1265 1196 1171 BP 5890 5591 5436 HL 1829 1758 1711 KM 631 548 541 KOA 308 277 270 LogP 15809 14544 14041 MP 10051 9120 8656 PC 788 750 735 VP 3037 2840 2716 WF 5764 5076 4836 WS 2348 2046 2010
  • 15. Office of Research and Development National Center for Computational Toxicology Development of a QSAR model • Curation of the data » Flagged and curated files available for sharing • Preparation of training and test sets » Inserted as a field in SDFiles and csv data files • Calculation of an initial set of descriptors » PaDEL 2D descriptors and fingerprints generated and shared • Selection of a mathematical method » Several approaches tested: KNN, PLS, SVM… • Variable selection technique » Genetic algorithm • Validation of the model’s predictive ability » 5-fold cross validation and external test set • Define the Applicability Domain » Local (nearest neighbors) and global (leverage) approaches
  • 16. Office of Research and Development National Center for Computational Toxicology QSARs validity, reliability, applicability and adequacy for regulatory purposes ORCHESTRA. Theory, guidance and application on QSAR and REACH; 2012. http://home. deib.polimi.it/gini/papers/or chestra.pdf.
  • 17. Office of Research and Development National Center for Computational Toxicology The 5 OECD principles: Principle Description 1) A defined endpoint Any physicochemical, biological or environmental effect that can be measured and therefore modelled. 2) An unambiguous algorithm Ensure transparency in the description of the model algorithm. 3) A defined domain of applicability Define limitations in terms of the types of chemical structures, physicochemical properties and mechanisms of action for which the models can generate reliable predictions. 4) Appropriate measures of goodness-of-fit, robustness and predictivity a) The internal fitting performance of a model b) the predictivity of a model, determined by using an appropriate external test set. 5) Mechanistic interpretation, if possible Mechanistic associations between the descriptors used in a model and the endpoint being predicted. The conditions for the validity of QSARs
  • 18. Office of Research and Development National Center for Computational Toxicology NCCT models Prop Vars 5-fold CV (75%) Training (75%) Test (25%) Q2 RMSE N R2 RMSE N R2 RMSE BCF 10 0.84 0.55 465 0.85 0.53 161 0.83 0.64 BP 13 0.93 22.46 4077 0.93 22.06 1358 0.93 22.08 LogP 9 0.85 0.69 10531 0.86 0.67 3510 0.86 0.78 MP 15 0.72 51.8 6486 0.74 50.27 2167 0.73 52.72 VP 12 0.91 1.08 2034 0.91 1.08 679 0.92 1 WS 11 0.87 0.81 3158 0.87 0.82 1066 0.86 0.86 HL 9 0.84 1.96 441 0.84 1.91 150 0.85 1.82
  • 19. Office of Research and Development National Center for Computational Toxicology NCCT models Prop Vars 5-fold CV (75%) Training (75%) Test (25%) Q2 RMSE N R2 RMSE N R2 RMSE AOH 13 0.85 1.14 516 0.85 1.12 176 0.83 1.23 BioHL 6 0.89 0.25 112 0.88 0.26 38 0.75 0.38 KM 12 0.83 0.49 405 0.82 0.5 136 0.73 0.62 KOC 12 0.81 0.55 545 0.81 0.54 184 0.71 0.61 KOA 2 0.95 0.69 202 0.95 0.65 68 0.96 0.68 BA Sn-Sp BA Sn-Sp BA Sn-Sp R-Bio 10 0.8 0.82-0.78 1198 0.8 0.82-0.79 411 0.79 0.81-0.77
  • 20. Office of Research and Development National Center for Computational Toxicology LogP Model: Weighted kNN Model, 9 descriptors Weighted 5-nearest neighbors 9 Descriptors Training set: 10531 chemicals Test set: 3510 chemicals 5 fold Cross-validation: Q2=0.85 RMSE=0.69 Fitting: R2=0.86 RMSE=0.67 Test: R2=0.86 RMSE=0.78
  • 21. Office of Research and Development National Center for Computational Toxicology Standalone application: Input: –MATLAB .mat file, an ASCII file with only a matrix of variables –SDF file or SMILES strings of QSAR-ready structures. In this case the program will calculate PaDEL 2D descriptors and make the predictions. • The program will extract the molecules names from the input csv or SDF (or assign arbitrary names if not) As IDs for the predictions. Output • Depending on the extension, the can be text file or csv with –A list of molecules IDs and predictions –Applicability domain –Accuracy of the prediction –Similarity index to the 5 nearest neighbors –The 5 nearest neighbors from the training set: Exp. value, Prediction, InChi key
  • 22. Office of Research and Development National Center for Computational Toxicology The iCSS Chemistry Dashboard at https://comptox.epa.gov 22
  • 23. Office of Research and Development National Center for Computational Toxicology
  • 24. Office of Research and Development National Center for Computational Toxicology Conclusion • QSAR prediction models (kNN) produced for all properties • 700k chemical structures pushed through NCCT_Models • Supplementary data will include appropriate files with flags – full dataset plus QSAR ready form • Full performance statistics available for all models • Models will be deployed as prediction engines in the future – one chemical at a time and batch processing (to be done after RapidTox Project)
  • 25. Office of Research and Development National Center for Computational Toxicology Acknowledgements National Center for Computational Toxicology
  • 26. Office of Research and Development National Center for Computational Toxicology 26 Thank you for your attention