SlideShare une entreprise Scribd logo
1  sur  10
Télécharger pour lire hors ligne
8/7/2019
1
Max Planck SocietyMax Planck Society
When The New Science Is in The Outliers
When The New Science Is In The Outliers
Matthias Scheffler
Fritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin, Germany, and Physics Department and IRIS
Adlershof, Humboldt-Universität zu Berlin, 12489 Berlin, Germany
Several issues hamper progress in data-driven materials science. In particular, these are a missing FAIR [1]
data infrastructure and appropriate data-analytics methodology [2].
Significant efforts are still necessary to fully realize the A and I of FAIR. Here the development of metadata,
their intricate relationships, and data ontology need critical attention. Obviously, a FAIR data infrastructure
– for being accepted by the community – should work without bureaucratic hurdles or the needs for special
training. In this talk, I will discuss the challenges and progress, focusing on computational materials science.
Concerning the data-analytics, we note that the number of possible materials is practically infinite, but only
10 or 100 of them may be relevant for a certain science or engineering purpose. In simple words, in
materials science and engineering, we are often looking for “needles in a hay stack”. Fitting or machine-
learning all data (i.e. the hay) with a single, global model may average away the specialties of the
interesting minority (i.e. the needles). I will discuss methods that identify statistically-exceptional
subgroups in a large amount of data, and I will discuss how one can estimate the domains of applicability of
machine-learning models. [3]
1. FAIR stands for Findable, Accessible, Interoperable and Re-usable. The FAIR Data Principles;
https://www.force11.org/group/fairgroup/fairprinciples
2. C. Draxl and M. Scheffler, Big-Data-Driven Materials Science and its FAIR Data Infrastructure. Plenary Chapter in Handbook of Materials
Modeling (eds. S. Yip and W. Andreoni), Springer (2019). https://arxiv.org/ftp/arxiv/papers/1904/1904.05859.pdf
3. Ch. Sutton, M. Boley, L. M. Ghiringhelli, M. Rupp, J. Vreeken, M. Scheffler, Domains of Applicability of Machine-Learning Models for Novel
Materials Discovery, to be published.
8/7/2019
2
Max Planck SocietyMax Planck SocietyMax Planck Society
High-Throughput Screening
in Computational (and Experimental) Materials Science
Sharing
Advances Science
Needs for a FAIR,
Efficient Research-
Data Infrastructure
Animation by G.-M. RignaneseO(101) – O(102) compounds selected
Recycle the “waste”!
Enable re-purposing.
Consider as many compounds a possible, typically O(103) – O(105)
Max Planck SocietyMax Planck SocietyMax Planck Society
Findable Accessible Interoperable Reusable
M. D. Wilkinson et al., Scientific Data 3, 160018 (2016)
Since
2015
8/7/2019
3
Max Planck SocietyMax Planck SocietyMax Planck Society
Since
2014
Findable Accessible Interoperable Reusable
M. D. Wilkinson et al., Scientific Data 3, 160018 (2016)
Encyclopedia
Archive (normalized data)
Visualization
Repository
(raw data)
Big-Data
Analytics
Requests the full input and output files The NOMAD Center of Excellence
Since
2015
The NOMAD Repository
>50 Mio. Total-Energy Calculations
90% of the VASP
files are from
AFLOW
S. Curtarolo
OQMD
C. Wolverton
Materials Project
G. Ceder K. Persson
Max Planck Society
8/7/2019
4
Max Planck SocietyMax Planck SocietyMax Planck Society
What Is Needed for A
FAIR Data Infrastructure?
 Scientific results are only meaningful and worth keeping if they are
fully characterized and all individual steps are fully documented.
 Computed data are only meaningful when method, approximations,
code, code version, and all computational parameters are known.
 For experimental data, we need a full characterization of the
sample, the description of the apparatus, the measurement
conditions, and the measured quantity.
This requires metadata, ontologies, and workflows.
We also need good search engines, an
“encyclopedia” GUI, and appropriate hardware.
Max Planck SocietyMax Planck Society
Any technique that
enables computers to
mimic human intelli-
gence, using logical if-
then rules, compressed
sensing, machine
learning (including
deep learning)
Artificial Intelligence (AI)
Machine Learning
The subset of machine lear-
ning composed of algorithms
that permit software to train
itself to perform tasks, like
speech and image recognition,
by exposing multilayered
neural networks to vast
amounts of data
A subset of AI
that includes
statistical techni-
ques that enable
machines to im-
prove at tasks
with more data.
It includes deep
learning
Deep Learning
Learning from “Big” Data:
Very Many Methods and Concepts,
Very Interdisciplinary
8/7/2019
5
Building Maps of Materials
(Role Models: Periodic Table, Ashby Plots)
Building Maps of Materials
(Role Models: Periodic Table, Ashby Plots)
-
Crystal-structure prediction
• Octet binaries (ZB vs. RS)
• AlxGayInzO3 (x+y+z=2)
• Perovskites (Goldschmidt
tolerance factor)
Property
classification:
• Topological
insulators
Activation
of CO2 at
metal
oxides and
carbides
Property
classification:
• Metal vs.
insulator
work in progress
Max Planck Society
Max Planck Society
One single model to describe the whole population
(known and unknown data)
• minimize the overall prediction error (e.g. RMSE)
using regularization
• therefore, disregard (on purpose) all local details
Global Learning
-- Machine Learning --
8/7/2019
6
Subgroups are statistically
exceptional.
Global vs. Local Learning
x=a
x=b
P
0.0 0.2 0.4 0.6 0.8 1.0
1.0
0.8
0.6
0.4
0.2
0.0
d
𝜎 𝑗 ≡ 𝑑 𝑗 ≥ 0.8 ∧ (𝑥 𝑗 = 𝑎)
Max Planck Society
A global model fitted to the entire
dataset may be difficult to interpret and
may well hide or incorrectly describe the
actuating physical mechanisms.
Given:
Sample S population
Target property Pj
Features (descriptors) dj
Formic acid
Formaldehyde
Methanol
Methane
Turning Greenhouse Gases into
Useful Chemicals and Fuels
Max Planck Society
CO
CO2
C
Aliaksei
Mazheika
Sergey
Levchenko
Francesc
Illas H.-J. Freund et al., Angew. Chem. Int. 50, 10064 (2011).We need an efficient catalyst!
8/7/2019
7
Identifying New Potential Catalysts
Considering Oxides
Oxides:
A2+B4+O3, AO, BO2,
A3+B3+O3, A2O3 (B2O3),
A1+B5+O3, A2O, BO
A2+: Mg, Ca, Sr, Ba
A3+(B3+): Al, Ga, In, Sc, Y, La
B4+: Ti, Zr, Si, Ge, Sn
A+: Li, Na, K, Rb, Cs;
B5+: Nb, V, Sb
Max Planck Society
Machine learning of all produced data
does not provide a good description.
Consider surfaces of many different
materials and all possibly relevant surface
sites: Which materials (and surface sites)
are catalytically active?
Two Possibly Interesting Subgroups
for Idenifying High-Performance Materials
Subgroup identification:
 Define a ‘target property’
 Minimize the width of the target-property distribution.
 Maximize the distance between the median of the target-
property distribution and that of the whole data set.
 Maximize the size of the subgroup.
For how many xxx compunds do we know high catalytic
activity? Whar is meant by high catalytic activity?
Max Planck Society
1) ‘Small O-C-O angle’ subgroup
2) ‘Large C-O bond length’ subgroup
8/7/2019
8
Statistically Exceptional Subgroups of Oxides
– Considering 51 Potential Descriptors –
VBM < − 5.14 eV
(wrt vacuum)
Min. of Hirschfeld
charges of the A and
B atoms qmin <
0.48 e−
Distance between
the O surface atom
and its second-
nearest neighbor
cation d2 > 2.26 Å
‘Small OCO angle’ subgroup
‘Large C-O bond length’ subgroup
Other materials
gas-phase CO2
δ− molecule (2 > δ > 0.9)
Max Planck Society
C-Obondlength,Å
(qmin < 0.48 e) AND (W ≥ 5.14) AND (d2 > 2.16 Å).
δ = 0
1.17 Å, 180°
Max. of O 2p DOS M
> −6.0 eV
Distance between O
surface atom and its
nearest neighbor
cation d1 > 1.8 Å
Distance between the
O surface atom and
its second-nearest
neighbor cation d2 >
2.12 Å
1.5
1.4
1.3
1.2
The descriptors should
characterize the clean surface
‘Small OCO angle’subgroup
‘Large C-O bond-length’subgroup
All materials and sites
Two Possibly Relevant Subgroups for
Semiconducting Oxide Materials
Most known materials
with good catalytic
performance belong
to the ‘large C-O bond
length’ subgroup.
From the “bad-
performance
materials”, none
belongs to the green
subgroup.
Max Planck Society
NumberofSystemsperEnergy
NOVEL MATERIALS DISCOVERY
8/7/2019
9
Domain of (reliable) Applicability (DoA)
of Machine-Learning Models
Max Planck Society
𝑒𝑖 = |𝑓 𝑥𝑖 − 𝑦 𝑥𝑖 |
• How reliable are machine-learning
models when fitted to all data?
• Are all data fitted equally well by the
one selected representation?
Individual absolute error
Find the subgroup with small individual errors.
Example: Data from NOMAD-Kaggle-2018
competition(*) on transparent, conducting oxides: AlxGayInz)2O3 (for 6 space
groups and up to 80 atoms/unit cell). Consider conjunctions on lattice-vector
lengths and angles, volume per atom, # atoms/unit cell, composition (%),
average nn distances (Al-Al, Al-Ga, Al-In, ... ), etc.
representation x
(*) C. Sutton, L.M. Ghiringhelli, et al., npj Comput. Materials, in print
simplified sketch
linear fit in
the DoA
Domain of
Applicability
linear fit
to all data
knowndatay(xi)andfitf(x)
Max Planck Society
ML model all data DoA selectors defining the DoA
(meV/cation) (meV/cation)
n-gram 15.2 11.41 𝑏 ≥ 5.59 Å 𝛾 < 90.35° 𝑅 Al−O ≤ 2.06Å 𝑅 Ga−O ≤ 2.07Å
SOAP 14.5 11.25 𝑎
𝑐 ≤ 3.89 𝛾 < 90.35° 𝛽 ≥ 88.68°
MBTR 13.9 8.03 𝑁 ≥ 50 𝛾 < 90.35° 𝑅Al-O ≤ 2.06 Å
Mean Absolute Error of the cohesive energy: 1
𝑁 𝑖=1
𝑁
|𝑓 𝑥𝑖 − 𝑦 𝑥𝑖 |
Example: (AlxGayInz)2O3
with Gaussian-kernel KRR and different representations(*)
(*) C. Sutton, M. Boley, L.M. Ghiringhelli, M. Rupp, J. Vreeken, M. Scheffler, to be published
Domain of (reliable) Applicability (DoA)
of Machine-Learning Models
8/7/2019
10
Max Planck Society
The Materials-Science Challenge Is Different
to That of Standard Machine Learning
RMSE =
Regularized RMSE optimization emphasizes the description of the majority.
It provides a “high chance of being right in the description of the hay”.
= predicted value
= true value
We are looking for statistically exceptional
data groups. This may be needles, or nuts,
or bolts, or coins, or … Often, we don’t know exactly what we are
searching for, except that the data should be statistically exceptional.
Identify these subgroups, and don’t “regularize away” the outliers!

Contenu connexe

Tendances

2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
 
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...aimsnist
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...BrianDeCost
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionaimsnist
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applicationsaimsnist
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsAnubhav Jain
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryIan Foster
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAnubhav Jain
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Dataaimsnist
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsAnubhav Jain
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningKAMAL CHOUDHARY
 
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...aimsnist
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Anubhav Jain
 
Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...Anubhav Jain
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Anubhav Jain
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Anubhav Jain
 
Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...Anubhav Jain
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 

Tendances (20)

2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisition
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...
 
Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 
Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 

Similaire à When The New Science Is In The Outliers

The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...Ichigaku Takigawa
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningAnubhav Jain
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)PFHub PFHub
 
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...remAYDOAN3
 
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...University of Illinois at Urbana-Champaign
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...Anubhav Jain
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Anubhav Jain
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習Ichigaku Takigawa
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyMachine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyNathan Frey, PhD
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"ieee_cis_cyprus
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsAnubhav Jain
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...Anubhav Jain
 

Similaire à When The New Science Is In The Outliers (20)

The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)
 
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
 
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
 
AI Science
AI Science AI Science
AI Science
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyMachine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
 
Apt thomas kelly
Apt thomas kellyApt thomas kelly
Apt thomas kelly
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
E04423133
E04423133E04423133
E04423133
 
Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...
 

Plus de aimsnist

Enabling Data Science Methods for Catalyst Design and Discovery
Enabling Data Science Methods for Catalyst Design and DiscoveryEnabling Data Science Methods for Catalyst Design and Discovery
Enabling Data Science Methods for Catalyst Design and Discoveryaimsnist
 
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...aimsnist
 
Coupling AI with HiTp experiments to Discover Metallic Glasses Faster
Coupling AI with HiTp experiments to Discover Metallic Glasses FasterCoupling AI with HiTp experiments to Discover Metallic Glasses Faster
Coupling AI with HiTp experiments to Discover Metallic Glasses Fasteraimsnist
 
Classical force fields as physics-based neural networks
Classical force fields as physics-based neural networksClassical force fields as physics-based neural networks
Classical force fields as physics-based neural networksaimsnist
 
Pathways Towards a Hierarchical Discovery of Materials
Pathways Towards a Hierarchical Discovery of MaterialsPathways Towards a Hierarchical Discovery of Materials
Pathways Towards a Hierarchical Discovery of Materialsaimsnist
 
Materials Data in Action
Materials Data in ActionMaterials Data in Action
Materials Data in Actionaimsnist
 
Combinatorial Experimentation and Machine Learning for Materials Discovery
Combinatorial Experimentation and Machine Learning for Materials DiscoveryCombinatorial Experimentation and Machine Learning for Materials Discovery
Combinatorial Experimentation and Machine Learning for Materials Discoveryaimsnist
 
Progress in Natural Language Processing of Materials Science Text
Progress in Natural Language Processing of Materials Science TextProgress in Natural Language Processing of Materials Science Text
Progress in Natural Language Processing of Materials Science Textaimsnist
 

Plus de aimsnist (8)

Enabling Data Science Methods for Catalyst Design and Discovery
Enabling Data Science Methods for Catalyst Design and DiscoveryEnabling Data Science Methods for Catalyst Design and Discovery
Enabling Data Science Methods for Catalyst Design and Discovery
 
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
 
Coupling AI with HiTp experiments to Discover Metallic Glasses Faster
Coupling AI with HiTp experiments to Discover Metallic Glasses FasterCoupling AI with HiTp experiments to Discover Metallic Glasses Faster
Coupling AI with HiTp experiments to Discover Metallic Glasses Faster
 
Classical force fields as physics-based neural networks
Classical force fields as physics-based neural networksClassical force fields as physics-based neural networks
Classical force fields as physics-based neural networks
 
Pathways Towards a Hierarchical Discovery of Materials
Pathways Towards a Hierarchical Discovery of MaterialsPathways Towards a Hierarchical Discovery of Materials
Pathways Towards a Hierarchical Discovery of Materials
 
Materials Data in Action
Materials Data in ActionMaterials Data in Action
Materials Data in Action
 
Combinatorial Experimentation and Machine Learning for Materials Discovery
Combinatorial Experimentation and Machine Learning for Materials DiscoveryCombinatorial Experimentation and Machine Learning for Materials Discovery
Combinatorial Experimentation and Machine Learning for Materials Discovery
 
Progress in Natural Language Processing of Materials Science Text
Progress in Natural Language Processing of Materials Science TextProgress in Natural Language Processing of Materials Science Text
Progress in Natural Language Processing of Materials Science Text
 

Dernier

11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptxmohitesoham12
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsapna80328
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14
 

Dernier (20)

11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptx
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
signals in triangulation .. ...Surveying
signals in triangulation .. ...Surveyingsignals in triangulation .. ...Surveying
signals in triangulation .. ...Surveying
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdf
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
 

When The New Science Is In The Outliers

  • 1. 8/7/2019 1 Max Planck SocietyMax Planck Society When The New Science Is in The Outliers When The New Science Is In The Outliers Matthias Scheffler Fritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin, Germany, and Physics Department and IRIS Adlershof, Humboldt-Universität zu Berlin, 12489 Berlin, Germany Several issues hamper progress in data-driven materials science. In particular, these are a missing FAIR [1] data infrastructure and appropriate data-analytics methodology [2]. Significant efforts are still necessary to fully realize the A and I of FAIR. Here the development of metadata, their intricate relationships, and data ontology need critical attention. Obviously, a FAIR data infrastructure – for being accepted by the community – should work without bureaucratic hurdles or the needs for special training. In this talk, I will discuss the challenges and progress, focusing on computational materials science. Concerning the data-analytics, we note that the number of possible materials is practically infinite, but only 10 or 100 of them may be relevant for a certain science or engineering purpose. In simple words, in materials science and engineering, we are often looking for “needles in a hay stack”. Fitting or machine- learning all data (i.e. the hay) with a single, global model may average away the specialties of the interesting minority (i.e. the needles). I will discuss methods that identify statistically-exceptional subgroups in a large amount of data, and I will discuss how one can estimate the domains of applicability of machine-learning models. [3] 1. FAIR stands for Findable, Accessible, Interoperable and Re-usable. The FAIR Data Principles; https://www.force11.org/group/fairgroup/fairprinciples 2. C. Draxl and M. Scheffler, Big-Data-Driven Materials Science and its FAIR Data Infrastructure. Plenary Chapter in Handbook of Materials Modeling (eds. S. Yip and W. Andreoni), Springer (2019). https://arxiv.org/ftp/arxiv/papers/1904/1904.05859.pdf 3. Ch. Sutton, M. Boley, L. M. Ghiringhelli, M. Rupp, J. Vreeken, M. Scheffler, Domains of Applicability of Machine-Learning Models for Novel Materials Discovery, to be published.
  • 2. 8/7/2019 2 Max Planck SocietyMax Planck SocietyMax Planck Society High-Throughput Screening in Computational (and Experimental) Materials Science Sharing Advances Science Needs for a FAIR, Efficient Research- Data Infrastructure Animation by G.-M. RignaneseO(101) – O(102) compounds selected Recycle the “waste”! Enable re-purposing. Consider as many compounds a possible, typically O(103) – O(105) Max Planck SocietyMax Planck SocietyMax Planck Society Findable Accessible Interoperable Reusable M. D. Wilkinson et al., Scientific Data 3, 160018 (2016) Since 2015
  • 3. 8/7/2019 3 Max Planck SocietyMax Planck SocietyMax Planck Society Since 2014 Findable Accessible Interoperable Reusable M. D. Wilkinson et al., Scientific Data 3, 160018 (2016) Encyclopedia Archive (normalized data) Visualization Repository (raw data) Big-Data Analytics Requests the full input and output files The NOMAD Center of Excellence Since 2015 The NOMAD Repository >50 Mio. Total-Energy Calculations 90% of the VASP files are from AFLOW S. Curtarolo OQMD C. Wolverton Materials Project G. Ceder K. Persson Max Planck Society
  • 4. 8/7/2019 4 Max Planck SocietyMax Planck SocietyMax Planck Society What Is Needed for A FAIR Data Infrastructure?  Scientific results are only meaningful and worth keeping if they are fully characterized and all individual steps are fully documented.  Computed data are only meaningful when method, approximations, code, code version, and all computational parameters are known.  For experimental data, we need a full characterization of the sample, the description of the apparatus, the measurement conditions, and the measured quantity. This requires metadata, ontologies, and workflows. We also need good search engines, an “encyclopedia” GUI, and appropriate hardware. Max Planck SocietyMax Planck Society Any technique that enables computers to mimic human intelli- gence, using logical if- then rules, compressed sensing, machine learning (including deep learning) Artificial Intelligence (AI) Machine Learning The subset of machine lear- ning composed of algorithms that permit software to train itself to perform tasks, like speech and image recognition, by exposing multilayered neural networks to vast amounts of data A subset of AI that includes statistical techni- ques that enable machines to im- prove at tasks with more data. It includes deep learning Deep Learning Learning from “Big” Data: Very Many Methods and Concepts, Very Interdisciplinary
  • 5. 8/7/2019 5 Building Maps of Materials (Role Models: Periodic Table, Ashby Plots) Building Maps of Materials (Role Models: Periodic Table, Ashby Plots) - Crystal-structure prediction • Octet binaries (ZB vs. RS) • AlxGayInzO3 (x+y+z=2) • Perovskites (Goldschmidt tolerance factor) Property classification: • Topological insulators Activation of CO2 at metal oxides and carbides Property classification: • Metal vs. insulator work in progress Max Planck Society Max Planck Society One single model to describe the whole population (known and unknown data) • minimize the overall prediction error (e.g. RMSE) using regularization • therefore, disregard (on purpose) all local details Global Learning -- Machine Learning --
  • 6. 8/7/2019 6 Subgroups are statistically exceptional. Global vs. Local Learning x=a x=b P 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.8 0.6 0.4 0.2 0.0 d 𝜎 𝑗 ≡ 𝑑 𝑗 ≥ 0.8 ∧ (𝑥 𝑗 = 𝑎) Max Planck Society A global model fitted to the entire dataset may be difficult to interpret and may well hide or incorrectly describe the actuating physical mechanisms. Given: Sample S population Target property Pj Features (descriptors) dj Formic acid Formaldehyde Methanol Methane Turning Greenhouse Gases into Useful Chemicals and Fuels Max Planck Society CO CO2 C Aliaksei Mazheika Sergey Levchenko Francesc Illas H.-J. Freund et al., Angew. Chem. Int. 50, 10064 (2011).We need an efficient catalyst!
  • 7. 8/7/2019 7 Identifying New Potential Catalysts Considering Oxides Oxides: A2+B4+O3, AO, BO2, A3+B3+O3, A2O3 (B2O3), A1+B5+O3, A2O, BO A2+: Mg, Ca, Sr, Ba A3+(B3+): Al, Ga, In, Sc, Y, La B4+: Ti, Zr, Si, Ge, Sn A+: Li, Na, K, Rb, Cs; B5+: Nb, V, Sb Max Planck Society Machine learning of all produced data does not provide a good description. Consider surfaces of many different materials and all possibly relevant surface sites: Which materials (and surface sites) are catalytically active? Two Possibly Interesting Subgroups for Idenifying High-Performance Materials Subgroup identification:  Define a ‘target property’  Minimize the width of the target-property distribution.  Maximize the distance between the median of the target- property distribution and that of the whole data set.  Maximize the size of the subgroup. For how many xxx compunds do we know high catalytic activity? Whar is meant by high catalytic activity? Max Planck Society 1) ‘Small O-C-O angle’ subgroup 2) ‘Large C-O bond length’ subgroup
  • 8. 8/7/2019 8 Statistically Exceptional Subgroups of Oxides – Considering 51 Potential Descriptors – VBM < − 5.14 eV (wrt vacuum) Min. of Hirschfeld charges of the A and B atoms qmin < 0.48 e− Distance between the O surface atom and its second- nearest neighbor cation d2 > 2.26 Å ‘Small OCO angle’ subgroup ‘Large C-O bond length’ subgroup Other materials gas-phase CO2 δ− molecule (2 > δ > 0.9) Max Planck Society C-Obondlength,Å (qmin < 0.48 e) AND (W ≥ 5.14) AND (d2 > 2.16 Å). δ = 0 1.17 Å, 180° Max. of O 2p DOS M > −6.0 eV Distance between O surface atom and its nearest neighbor cation d1 > 1.8 Å Distance between the O surface atom and its second-nearest neighbor cation d2 > 2.12 Å 1.5 1.4 1.3 1.2 The descriptors should characterize the clean surface ‘Small OCO angle’subgroup ‘Large C-O bond-length’subgroup All materials and sites Two Possibly Relevant Subgroups for Semiconducting Oxide Materials Most known materials with good catalytic performance belong to the ‘large C-O bond length’ subgroup. From the “bad- performance materials”, none belongs to the green subgroup. Max Planck Society NumberofSystemsperEnergy NOVEL MATERIALS DISCOVERY
  • 9. 8/7/2019 9 Domain of (reliable) Applicability (DoA) of Machine-Learning Models Max Planck Society 𝑒𝑖 = |𝑓 𝑥𝑖 − 𝑦 𝑥𝑖 | • How reliable are machine-learning models when fitted to all data? • Are all data fitted equally well by the one selected representation? Individual absolute error Find the subgroup with small individual errors. Example: Data from NOMAD-Kaggle-2018 competition(*) on transparent, conducting oxides: AlxGayInz)2O3 (for 6 space groups and up to 80 atoms/unit cell). Consider conjunctions on lattice-vector lengths and angles, volume per atom, # atoms/unit cell, composition (%), average nn distances (Al-Al, Al-Ga, Al-In, ... ), etc. representation x (*) C. Sutton, L.M. Ghiringhelli, et al., npj Comput. Materials, in print simplified sketch linear fit in the DoA Domain of Applicability linear fit to all data knowndatay(xi)andfitf(x) Max Planck Society ML model all data DoA selectors defining the DoA (meV/cation) (meV/cation) n-gram 15.2 11.41 𝑏 ≥ 5.59 Å 𝛾 < 90.35° 𝑅 Al−O ≤ 2.06Å 𝑅 Ga−O ≤ 2.07Å SOAP 14.5 11.25 𝑎 𝑐 ≤ 3.89 𝛾 < 90.35° 𝛽 ≥ 88.68° MBTR 13.9 8.03 𝑁 ≥ 50 𝛾 < 90.35° 𝑅Al-O ≤ 2.06 Å Mean Absolute Error of the cohesive energy: 1 𝑁 𝑖=1 𝑁 |𝑓 𝑥𝑖 − 𝑦 𝑥𝑖 | Example: (AlxGayInz)2O3 with Gaussian-kernel KRR and different representations(*) (*) C. Sutton, M. Boley, L.M. Ghiringhelli, M. Rupp, J. Vreeken, M. Scheffler, to be published Domain of (reliable) Applicability (DoA) of Machine-Learning Models
  • 10. 8/7/2019 10 Max Planck Society The Materials-Science Challenge Is Different to That of Standard Machine Learning RMSE = Regularized RMSE optimization emphasizes the description of the majority. It provides a “high chance of being right in the description of the hay”. = predicted value = true value We are looking for statistically exceptional data groups. This may be needles, or nuts, or bolts, or coins, or … Often, we don’t know exactly what we are searching for, except that the data should be statistically exceptional. Identify these subgroups, and don’t “regularize away” the outliers!