SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
100 million compounds, 100K protein
structures, 2 million reactions, 1 million
journal articles, 20 million patents and 15
billion substructures
Is 20TB really Big Data?
Noel O’Boyle, Daniel Lowe, John May and
Roger Sayle
NextMove Software
From Big Data to Chemical Information (RSC CICAG) 22 April 2015 London
Big Data is…
“…a broad term for data sets so large or complex
that traditional data processing applications are
inadequate.” [Wikipedia]
Any dataset could be considered Big Data
without sufficiently efficient algorithms and
tools
100K protein structures
2 million reactions
1 million journal articles
20 million patents
100 million compounds
15 billion substructures
Swiss-Prot
Patents
PubMed Central OA
US, EU, JP, Kr
PubChem, UniChem
PubChem, ChEMBL
700K CSD entries
36 million Wikipedia articles
47 billion webpages indexed by Google
200 billion tweets per year
22 Pb EMBL-EBI’s data
20 Tb
http://pubchemblog.ncbi.nlm.nih.gov/2014/09/16/ten-years-of-service/
OVerview
• Finding matched pairs/series
• Substructure searching
• Maximum common subgraph
• Chemical text-mining
• Naming reactions
• Canonicalisation
Matched Pair Matched Series
of Length 3
+
+
+
find matched PAIRS/Series
Fragment
Index
Collate
Index
(Scaffold)
Matched
Series
• Hussain and Rea JCIM 2010, 50, 339
• ChEMBL 20 IC50 data
– 752 K datapoints from 64 K assays
• Processed in 12 minutes
– Giving 391 K matched series
Matched
Series
“Google searches are screamingly fast, so fast
that the type-ahead feature is doing the search
as you key characters in. Why are all chemical
searches so sloooow? … Ideally, as you sketch
your mol in, the searches should be happening
at the same pace, like the typeahead feature.”
John Van Drie, Nov 2011
Via Rajarshi Guha’s blog
http://blog.rguha.net/?p=993
substructure searching
• Approach: fingerprint screen (fast but false
positives) then match (slow but exact)
• Pathological cases – many pass the screen
– “denial of service queries” (Trung Nguyen)
– Substructures which happen to map to the same
bit as benzene, etc.
Faster Substructure searching
• Worst-case behaviour dominated by slow
matching
– This implies focus should be on faster matching
• Typical substructures can be expressed as
SMARTS patterns
– Arthor: fast SMARTS matching against a database
• Test set: Structure Query Collection (Andrew
Dalke, BindingDB)
– Time to count hits for 3323 queries against
eMolecules (6.9 million)
Optimised smarts matching
• Preprocess database so that matching is fast
– Efficient binary representation (minimal I/O)
– Matching done directly on binary representation
(minimal malloc)
• Match rarer atom expressions first (*)
– CCCCCBr  BrCCCCC
• Match rarer bond expressions first
– CC#C  C#CC
* c.f. slide 31 of ICCS presentation
http://www.slideshare.net/NextMoveSoftware/efficient-matching-of-multiple-
chemical-subgraphs
Time (ms)
Bingo 2h 7m Tripod 1d 6h
JCart 2h 42m FastSearch 1d 15h
RDCart 5h 9m OrChem 3d 1h
NumQueriesyettofinish
Total time for all 3323
queries
Time (ms)
NumQueriesyettofinish
Arthor+fp 36s
Arthor 27m 31s
Bingo 2h 7m Tripod 1d 6h
JCart 2h 42m FastSearch 1d 15h
RDCart 5h 9m OrChem 3d 1hTotal time for all 3323
queries
What about even larger
databases?
• Imagine a chemical database 100 times larger
than PubChem
– Search time scales linearly, so even Arthor will
take 100 times longer to search it
• A completely different approach is needed
• SmallWorld: sublinear searching by
precalculating all possible substructures and
their relationships
– The catch: disk space, and takes months of CPU-
time to generate (here’s one we made earlier)
SmallWorld -
A Graph database
Nodes are
anonymous graphs of
known substructures
Disk space: 12TB
Nodes: 19.7 billion
Edges: 64.8 billion
Maximum Common Subgraph (MCS)
• Computationally expensive
• Previous approaches:
– Backtracking algorithms
– Clique detection
– Dynamic programming
• However, with SmallWorld the computation
has already been done in advance
– MCS can be found in linear time relative to the
number of atoms in the smaller molecule
SmallWorld -
A Graph database
Chemical Text-Mining Big Data
• LeadMine for chemical entity extraction from text
– Performance: Finds ~90% of chemical structures
(compared to inter-annotator agreement of 91%*)
• Method:
– Dictionaries (e.g. list of common English words, trivial
chemical names) and grammars (e.g. all possible
IUPAC names)
– Speed just depends on the number of dictionaries
– Any dictionary can be used with spelling correction
* Krallinger et al, J. Cheminf. 2015, 7, S2
How Long to Process?
• All Open Access papers available from
PubMed Central
– 1.0 million articles processed in 3h 38min (*)
– 131K distinct compounds
• 2001-2015 USPTO applications (5 times larger)
– 4.1 million patents processed in 22h 50min (*)
– 4.3 million distinct compounds
– 84.7 million compound mentions (with at least 10
heavy atoms)
* using 4 cores
Automatic Naming of reactions
• Traditional approaches:
– Atom-mapping (can be slow, can give wrong mapping)
– Differences in fingerprints for reactants and products
(fast but has limitations)
• Our approach uses compiled SMARTS matching:
– Apply a particular reaction to the reactants and check
whether the products appears on the right
– Components are atom-mapped implicitly
Note: typically reactions in ELNs and patents are not
balanced
Performance
• Scales with the number of SMARTS patterns
used
– currently 802 patterns for 504 reactions
• Test set: 1.1 million reactions extracted from
the USPTO applications 2001-2012
– Processed in 11.2 h
– 437 K reactions named
US20010000038A1 : 3.10.2 Friedel-Crafts alkylation [Friedel-
Crafts reaction]
US20010000038A1 : 10.1.5 Wohl-Ziegler bromination
[Halogenation]
A different type of “Big” Data
• Large macromolecules can be efficiently handled by
cheminformatics tools
• Titin (35213 amino acids, 313 K atoms):
CC[C@H](C)[C@@H](C(=O)NCC(=O)N[C@@H](CCCCN)C(=O)N1CCC[C@H]1C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc2c[nH]cn2)C(=
O)N3CCC[C@H]3C(=O)N[C@@H](CO)C(=O)N[C@@H](CCC(=O)O)C(=O)N4CCC[C@H]4C(=O)N[C@@H](C(C)C)C(=O)N[C@@H](CC(
C)C)C(=O)N[C@@H](C)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](C)C(=O)N[C@@H](CS)C(=O)N[
C@@H](CCC(=O)O)C(=O)N5CCC[C@H]5C(=O)N6CCC[C@H]6C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@
@H](C(C)C)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](CC(=O)
O)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](CO)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@@H](CO)C(=
O)N[C@@H](C(C)C)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc7c[nH]c8c7cccc8)C(
=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CCC(=O)N)C(=O)N9CCC[C@H]9C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[
C@@H](CC(=O)O)C(=O)NCC(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](
[C@@H](C)O)C(=O)NCC(=O)N[C@@H](Cc1ccc(cc1)O)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](C(C)C)C(=O)N[C@@H](CCC
(=O)O)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](CC(C)C)C(=O)N1C
CC[C@H]1C(=O)N[C@@H](CC(=O)O)C(=O)NCC(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1c[nH]c2c1cccc2)C(=O)N[C@@H](
[C@@H](C)O)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H](C)C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H]([C@@H
](C)O)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@@H](C(C)C)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@
H]([C@@H](C)O)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](C(C)C)C(
=O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CC
(=O)N)C(=O)N[C@@H](CO)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](Cc1ccc(cc1)O)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H
](Cc1ccccc1)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](C(C)C)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](C)C(=O)……
A different type of “Big” Data
• Large macromolecules can be efficiently handled by
cheminformatics tools
• Titin (35213 amino acids, 313 K atoms):
– Canonical Smiles: 238s (state-of-the-art)
– Canonical Smiles: 1.7s (our preliminary results)
• All Swiss-Prot entries (541K structures)
– Remove those with ambiguous structures (e.g. Asx)
– 453K protein structures canonicalised in 2m 57s (4 cores)
100 million compounds, 100K protein
structures, 2 million reactions, 1 million
journal articles, 20 million patents and 15
billion substructures
Is 20TB really Big Data?
noel@nextmovesoftware.com
Noel O’Boyle, Daniel Lowe, John May and
Roger Sayle
NextMove Software
100 million compounds, 100K protein
structures, 2 million reactions, 1 million
journal articles, 20 million patents and 15
billion substructures
Is 20TB really Big Data?
With modern hardware and efficient algorithms,
many classic cheminformatics problems can be
handled with today’s datasets.
noel@nextmovesoftware.com
Noel O’Boyle, Daniel Lowe, John May and
Roger Sayle
NextMove Software

Contenu connexe

Tendances

Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Kim Herzig
 
Computational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methodsComputational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methodsAnubhav Jain
 
Morgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distMorgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distddm314
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shangSAIL_QU
 
Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...Anubhav Jain
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Anubhav Jain
 
Autodock Made Easy with MGL Tools - Molecular Docking
Autodock Made Easy with MGL Tools - Molecular DockingAutodock Made Easy with MGL Tools - Molecular Docking
Autodock Made Easy with MGL Tools - Molecular DockingGirinath Pillai
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Anubhav Jain
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Anubhav Jain
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionaimsnist
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsAnubhav Jain
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...University of California, San Diego
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Anubhav Jain
 
Predicting local atomic structures from X-ray absorption spectroscopy using t...
Predicting local atomic structures from X-ray absorption spectroscopy using t...Predicting local atomic structures from X-ray absorption spectroscopy using t...
Predicting local atomic structures from X-ray absorption spectroscopy using t...aimsnist
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Dataaimsnist
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...University of California, San Diego
 

Tendances (20)

Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
 
Computational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methodsComputational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methods
 
Morgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distMorgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 dist
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Autodock Made Easy with MGL Tools - Molecular Docking
Autodock Made Easy with MGL Tools - Molecular DockingAutodock Made Easy with MGL Tools - Molecular Docking
Autodock Made Easy with MGL Tools - Molecular Docking
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Pathogen phylogenetics using BEAST
Pathogen phylogenetics using BEASTPathogen phylogenetics using BEAST
Pathogen phylogenetics using BEAST
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisition
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
 
Digitizing documents to provide a public spectroscopy database
Digitizing documents to provide a public spectroscopy databaseDigitizing documents to provide a public spectroscopy database
Digitizing documents to provide a public spectroscopy database
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...
 
Predicting local atomic structures from X-ray absorption spectroscopy using t...
Predicting local atomic structures from X-ray absorption spectroscopy using t...Predicting local atomic structures from X-ray absorption spectroscopy using t...
Predicting local atomic structures from X-ray absorption spectroscopy using t...
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 

En vedette

Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?NextMove Software
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulNextMove Software
 
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...NextMove Software
 
Conclusies van de Raad van de EU over Turkije
Conclusies van de Raad van de EU over TurkijeConclusies van de Raad van de EU over Turkije
Conclusies van de Raad van de EU over TurkijeThierry Debels
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChemNextMove Software
 
Putting the Strategy in Client Strategy by JESS3
Putting the Strategy in Client Strategy by JESS3Putting the Strategy in Client Strategy by JESS3
Putting the Strategy in Client Strategy by JESS3JESS3
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]NextMove Software
 
GOYA Y LUCIENTES, Francisco de, Featured Paintings in Detail(1)
GOYA Y LUCIENTES, Francisco de, Featured Paintings in Detail(1)GOYA Y LUCIENTES, Francisco de, Featured Paintings in Detail(1)
GOYA Y LUCIENTES, Francisco de, Featured Paintings in Detail(1)guimera
 
Pew Research Center 2015 Climate Change Presentation
Pew Research Center 2015 Climate Change PresentationPew Research Center 2015 Climate Change Presentation
Pew Research Center 2015 Climate Change PresentationPew Research Center
 
The Future of Medical Education - Top Trends Likely to Have an Impact on the ...
The Future of Medical Education - Top Trends Likely to Have an Impact on the ...The Future of Medical Education - Top Trends Likely to Have an Impact on the ...
The Future of Medical Education - Top Trends Likely to Have an Impact on the ...Ogilvy Health
 
The Maker as a Reflective Practitioner
The Maker as a Reflective Practitioner The Maker as a Reflective Practitioner
The Maker as a Reflective Practitioner Jackie Gerstein, Ed.D
 

En vedette (13)

Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?Which is the best fingerprint for medicinal chemistry?
Which is the best fingerprint for medicinal chemistry?
 
GHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be usefulGHS and NFPA diamonds: where they come from and how they can be useful
GHS and NFPA diamonds: where they come from and how they can be useful
 
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
 
Conclusies van de Raad van de EU over Turkije
Conclusies van de Raad van de EU over TurkijeConclusies van de Raad van de EU over Turkije
Conclusies van de Raad van de EU over Turkije
 
1405 todd manning
1405 todd manning1405 todd manning
1405 todd manning
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 
Putting the Strategy in Client Strategy by JESS3
Putting the Strategy in Client Strategy by JESS3Putting the Strategy in Client Strategy by JESS3
Putting the Strategy in Client Strategy by JESS3
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
GOYA Y LUCIENTES, Francisco de, Featured Paintings in Detail(1)
GOYA Y LUCIENTES, Francisco de, Featured Paintings in Detail(1)GOYA Y LUCIENTES, Francisco de, Featured Paintings in Detail(1)
GOYA Y LUCIENTES, Francisco de, Featured Paintings in Detail(1)
 
Building A Content Marketing Strategy
Building A Content Marketing StrategyBuilding A Content Marketing Strategy
Building A Content Marketing Strategy
 
Pew Research Center 2015 Climate Change Presentation
Pew Research Center 2015 Climate Change PresentationPew Research Center 2015 Climate Change Presentation
Pew Research Center 2015 Climate Change Presentation
 
The Future of Medical Education - Top Trends Likely to Have an Impact on the ...
The Future of Medical Education - Top Trends Likely to Have an Impact on the ...The Future of Medical Education - Top Trends Likely to Have an Impact on the ...
The Future of Medical Education - Top Trends Likely to Have an Impact on the ...
 
The Maker as a Reflective Practitioner
The Maker as a Reflective Practitioner The Maker as a Reflective Practitioner
The Maker as a Reflective Practitioner
 

Similaire à Is 20TB really Big Data?

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applicationsaimsnist
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC ProjectMaho Nakata
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectMaho Nakata
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Anubhav Jain
 
Overview of the Exascale Additive Manufacturing Project
Overview of the Exascale Additive Manufacturing ProjectOverview of the Exascale Additive Manufacturing Project
Overview of the Exascale Additive Manufacturing Projectinside-BigData.com
 
Morgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 distMorgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 distddm314
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructureAnubhav Jain
 
Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Anubhav Jain
 
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Sean Ekins
 
Cycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC RunCycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC Runinside-BigData.com
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horseChris Southan
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impact
Accelerators at ORNL - Application Readiness, Early Science, and Industry ImpactAccelerators at ORNL - Application Readiness, Early Science, and Industry Impact
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impactinside-BigData.com
 
AI for Science Grand Challenges
AI for Science Grand ChallengesAI for Science Grand Challenges
AI for Science Grand ChallengesPFHub PFHub
 
The U.S. Exascale Computing Project: Status and Plans
The U.S. Exascale Computing Project: Status and PlansThe U.S. Exascale Computing Project: Status and Plans
The U.S. Exascale Computing Project: Status and Plansinside-BigData.com
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores inside-BigData.com
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and ComputationTal Lavian Ph.D.
 

Similaire à Is 20TB really Big Data? (20)

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC Project
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc project
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
Overview of the Exascale Additive Manufacturing Project
Overview of the Exascale Additive Manufacturing ProjectOverview of the Exascale Additive Manufacturing Project
Overview of the Exascale Additive Manufacturing Project
 
Morgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 distMorgan osg user school 2016 07-29 dist
Morgan osg user school 2016 07-29 dist
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...
 
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...
 
Cycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC RunCycle Computing Record-breaking Petascale HPC Run
Cycle Computing Record-breaking Petascale HPC Run
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impact
Accelerators at ORNL - Application Readiness, Early Science, and Industry ImpactAccelerators at ORNL - Application Readiness, Early Science, and Industry Impact
Accelerators at ORNL - Application Readiness, Early Science, and Industry Impact
 
AI for Science Grand Challenges
AI for Science Grand ChallengesAI for Science Grand Challenges
AI for Science Grand Challenges
 
The U.S. Exascale Computing Project: Status and Plans
The U.S. Exascale Computing Project: Status and PlansThe U.S. Exascale Computing Project: Status and Plans
The U.S. Exascale Computing Project: Status and Plans
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 

Plus de NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsNextMove Software
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)NextMove Software
 

Plus de NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 

Dernier

GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalMAESTRELLAMesa2
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 

Dernier (20)

GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
PROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and VerticalPROJECTILE MOTION-Horizontal and Vertical
PROJECTILE MOTION-Horizontal and Vertical
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 

Is 20TB really Big Data?

  • 1. 100 million compounds, 100K protein structures, 2 million reactions, 1 million journal articles, 20 million patents and 15 billion substructures Is 20TB really Big Data? Noel O’Boyle, Daniel Lowe, John May and Roger Sayle NextMove Software From Big Data to Chemical Information (RSC CICAG) 22 April 2015 London
  • 2. Big Data is… “…a broad term for data sets so large or complex that traditional data processing applications are inadequate.” [Wikipedia] Any dataset could be considered Big Data without sufficiently efficient algorithms and tools
  • 3. 100K protein structures 2 million reactions 1 million journal articles 20 million patents 100 million compounds 15 billion substructures Swiss-Prot Patents PubMed Central OA US, EU, JP, Kr PubChem, UniChem PubChem, ChEMBL 700K CSD entries 36 million Wikipedia articles 47 billion webpages indexed by Google 200 billion tweets per year 22 Pb EMBL-EBI’s data 20 Tb
  • 5. OVerview • Finding matched pairs/series • Substructure searching • Maximum common subgraph • Chemical text-mining • Naming reactions • Canonicalisation
  • 6. Matched Pair Matched Series of Length 3
  • 7. + + + find matched PAIRS/Series Fragment Index Collate Index (Scaffold) Matched Series • Hussain and Rea JCIM 2010, 50, 339 • ChEMBL 20 IC50 data – 752 K datapoints from 64 K assays • Processed in 12 minutes – Giving 391 K matched series Matched Series
  • 8. “Google searches are screamingly fast, so fast that the type-ahead feature is doing the search as you key characters in. Why are all chemical searches so sloooow? … Ideally, as you sketch your mol in, the searches should be happening at the same pace, like the typeahead feature.” John Van Drie, Nov 2011 Via Rajarshi Guha’s blog http://blog.rguha.net/?p=993
  • 9. substructure searching • Approach: fingerprint screen (fast but false positives) then match (slow but exact) • Pathological cases – many pass the screen – “denial of service queries” (Trung Nguyen) – Substructures which happen to map to the same bit as benzene, etc.
  • 10. Faster Substructure searching • Worst-case behaviour dominated by slow matching – This implies focus should be on faster matching • Typical substructures can be expressed as SMARTS patterns – Arthor: fast SMARTS matching against a database • Test set: Structure Query Collection (Andrew Dalke, BindingDB) – Time to count hits for 3323 queries against eMolecules (6.9 million)
  • 11. Optimised smarts matching • Preprocess database so that matching is fast – Efficient binary representation (minimal I/O) – Matching done directly on binary representation (minimal malloc) • Match rarer atom expressions first (*) – CCCCCBr  BrCCCCC • Match rarer bond expressions first – CC#C  C#CC * c.f. slide 31 of ICCS presentation http://www.slideshare.net/NextMoveSoftware/efficient-matching-of-multiple- chemical-subgraphs
  • 12. Time (ms) Bingo 2h 7m Tripod 1d 6h JCart 2h 42m FastSearch 1d 15h RDCart 5h 9m OrChem 3d 1h NumQueriesyettofinish Total time for all 3323 queries
  • 13. Time (ms) NumQueriesyettofinish Arthor+fp 36s Arthor 27m 31s Bingo 2h 7m Tripod 1d 6h JCart 2h 42m FastSearch 1d 15h RDCart 5h 9m OrChem 3d 1hTotal time for all 3323 queries
  • 14. What about even larger databases? • Imagine a chemical database 100 times larger than PubChem – Search time scales linearly, so even Arthor will take 100 times longer to search it • A completely different approach is needed • SmallWorld: sublinear searching by precalculating all possible substructures and their relationships – The catch: disk space, and takes months of CPU- time to generate (here’s one we made earlier)
  • 15. SmallWorld - A Graph database Nodes are anonymous graphs of known substructures Disk space: 12TB Nodes: 19.7 billion Edges: 64.8 billion
  • 16. Maximum Common Subgraph (MCS) • Computationally expensive • Previous approaches: – Backtracking algorithms – Clique detection – Dynamic programming • However, with SmallWorld the computation has already been done in advance – MCS can be found in linear time relative to the number of atoms in the smaller molecule
  • 18. Chemical Text-Mining Big Data • LeadMine for chemical entity extraction from text – Performance: Finds ~90% of chemical structures (compared to inter-annotator agreement of 91%*) • Method: – Dictionaries (e.g. list of common English words, trivial chemical names) and grammars (e.g. all possible IUPAC names) – Speed just depends on the number of dictionaries – Any dictionary can be used with spelling correction * Krallinger et al, J. Cheminf. 2015, 7, S2
  • 19. How Long to Process? • All Open Access papers available from PubMed Central – 1.0 million articles processed in 3h 38min (*) – 131K distinct compounds • 2001-2015 USPTO applications (5 times larger) – 4.1 million patents processed in 22h 50min (*) – 4.3 million distinct compounds – 84.7 million compound mentions (with at least 10 heavy atoms) * using 4 cores
  • 20. Automatic Naming of reactions • Traditional approaches: – Atom-mapping (can be slow, can give wrong mapping) – Differences in fingerprints for reactants and products (fast but has limitations) • Our approach uses compiled SMARTS matching: – Apply a particular reaction to the reactants and check whether the products appears on the right – Components are atom-mapped implicitly Note: typically reactions in ELNs and patents are not balanced
  • 21. Performance • Scales with the number of SMARTS patterns used – currently 802 patterns for 504 reactions • Test set: 1.1 million reactions extracted from the USPTO applications 2001-2012 – Processed in 11.2 h – 437 K reactions named US20010000038A1 : 3.10.2 Friedel-Crafts alkylation [Friedel- Crafts reaction] US20010000038A1 : 10.1.5 Wohl-Ziegler bromination [Halogenation]
  • 22. A different type of “Big” Data • Large macromolecules can be efficiently handled by cheminformatics tools • Titin (35213 amino acids, 313 K atoms): CC[C@H](C)[C@@H](C(=O)NCC(=O)N[C@@H](CCCCN)C(=O)N1CCC[C@H]1C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc2c[nH]cn2)C(= O)N3CCC[C@H]3C(=O)N[C@@H](CO)C(=O)N[C@@H](CCC(=O)O)C(=O)N4CCC[C@H]4C(=O)N[C@@H](C(C)C)C(=O)N[C@@H](CC( C)C)C(=O)N[C@@H](C)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](C)C(=O)N[C@@H](CS)C(=O)N[ C@@H](CCC(=O)O)C(=O)N5CCC[C@H]5C(=O)N6CCC[C@H]6C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@ @H](C(C)C)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](CC(=O) O)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](CO)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@@H](CO)C(= O)N[C@@H](C(C)C)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc7c[nH]c8c7cccc8)C( =O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CCC(=O)N)C(=O)N9CCC[C@H]9C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[ C@@H](CC(=O)O)C(=O)NCC(=O)NCC(=O)N[C@@H](CO)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H]( [C@@H](C)O)C(=O)NCC(=O)N[C@@H](Cc1ccc(cc1)O)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](C(C)C)C(=O)N[C@@H](CCC (=O)O)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](CC(C)C)C(=O)N1C CC[C@H]1C(=O)N[C@@H](CC(=O)O)C(=O)NCC(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](Cc1c[nH]c2c1cccc2)C(=O)N[C@@H]( [C@@H](C)O)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H](C)C(=O)N[C@@H](CO)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H]([C@@H ](C)O)C(=O)N[C@@H](CC(=O)N)C(=O)N[C@@H](C(C)C)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@ H]([C@@H](C)O)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](C(C)C)C( =O)N[C@@H](CO)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CC (=O)N)C(=O)N[C@@H](CO)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](Cc1ccc(cc1)O)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H ](Cc1ccccc1)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N[C@@H](C(C)C)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](C)C(=O)……
  • 23. A different type of “Big” Data • Large macromolecules can be efficiently handled by cheminformatics tools • Titin (35213 amino acids, 313 K atoms): – Canonical Smiles: 238s (state-of-the-art) – Canonical Smiles: 1.7s (our preliminary results) • All Swiss-Prot entries (541K structures) – Remove those with ambiguous structures (e.g. Asx) – 453K protein structures canonicalised in 2m 57s (4 cores)
  • 24. 100 million compounds, 100K protein structures, 2 million reactions, 1 million journal articles, 20 million patents and 15 billion substructures Is 20TB really Big Data? noel@nextmovesoftware.com Noel O’Boyle, Daniel Lowe, John May and Roger Sayle NextMove Software
  • 25. 100 million compounds, 100K protein structures, 2 million reactions, 1 million journal articles, 20 million patents and 15 billion substructures Is 20TB really Big Data? With modern hardware and efficient algorithms, many classic cheminformatics problems can be handled with today’s datasets. noel@nextmovesoftware.com Noel O’Boyle, Daniel Lowe, John May and Roger Sayle NextMove Software