Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
1. Data Mining to Discovery for Inorganic Solids:
Software Tools and Applications
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
Artificial Intelligence for Materials Science
August 7, 2018
Slides (already) posted to hackingmaterials.lbl.gov
2. • Three projects available now
– Interpretable descriptors of crystal structure
– matminer
– atomate / Rocketsled
• One project in progress
– A text mining materials database
2
Overview of talk
4. Machine learning: the big problem in my view is connecting
data to ML algorithms through features
4
Lots of data on
complex objects that
you want to interrelate
Clustering, Regression, Feature
extraction, Model-building, etc.
Well developed
data-mining routines that work
only on numbers (ideally ones
with high relevance to your
problem)
Need to transform materials science objects into a set of
physically relevant numerical data (“features” or “descriptors”)
5. 5
The crystal structure is a core entity that
machine learning algorithms should know about
Step 1. Describe each site as a
fingerprint telling you close it is
to each of 22 known local
environments (e.g., tetrahedral,
octahedral, etc.)
Step 2: Describe each structure
as the average of its site
fingerprints*
tetrahedron
octahedron
distorted 8-coordinated cube
*(plus additional statistics like standard deviation, min, max,
etc. if desired – or split into separate cation/anion vectors)
6. Defining local order parameters for various environments
6
Use a given local order parameter
with a threshold
for motif recognition:
If qtet > qthresh,
then motif is tetrahedron.
Else
not (too much) a tetrahedron.
Tetrahedral order parameter, qtet, [1]:
[1] Zimmermann et al., J. Am. Chem. Soc., 2017, 10.1021/jacs.5b08098
7. We have now developed mathematical order parameters for
22 different local environments
7
8. How well do these work?
8
1. Order parameters clearly
distinguish different environments
even after thermal distortion
2. Work well in applications (defect site
finding, diffusion characterization)
[1] Zimmermann et al., Frontiers of Materials, 2017, doi: 10.3389/fmats.2017.00034
9. 9
Structure fingerprints: can they distinguish crystal
structures?
BaAl2O4 BaZnF4 CaFe2O4 CrVO4 K2NiF4
CaB2O4-I MgUO4 Pb3O4 SbNbO4 Sr2PbO4
Tetragonal BaTiO3 Th3P4 TlAlF4 ZnSO4 α-MnMoO4
BCCAragonite Barite β-K2SO4 Calcite
Half-Heusler
FCC GarnetHCP Rocksalt Diamond
High-cristobalite Ilmenite Low-cristobalite Low-quartz
Monazite Olivine Perovskites RutilePhenacite
Tetragonal BaTiO3 Th3P4 TlAlF4 ZnSO4 α-MnMoO4
BCCAragonite Barite β-K2SO4 Calcite
Half-Heusler
FCC GarnetHCP Rocksalt Diamond
High-cristobalite Ilmenite Low-cristobalite Low-quartz
Monazite Olivine Perovskites Rutile
Scheelite Spinel Thenardite Wolframite Zircon
Phenacite
• 40 diverse crystal structure prototypes
• Many complex examples (e.g., multi-cation, multi-anion) from each class
• Thousands of crystal structures in the test set
• Create structure fingerprints based on averages of local environments
10. • The Euclidean distance of structure fingerprints
between structures of the same prototype is
small and different prototypes is larger
10
Local environments fingerprints do distinguish prototypes!
Overlapping coefficient:
OVC = 1.7%
distance between structure fingerprint vectors
distribution
same prototype
different prototype
12. Results on MP web site, e.g. for BCC-like structures
12
https://www.materialsproject.org/materials/mp-91/!
Target: W
similar structures
(distance near 0)
Cs3Sb!
TiGaFeCo!
CeMg2Cu!
13. • Incorporate into machine learning models
• Compare performance against other site /
structure descriptors
• Beyond local environments
13
Structure descriptors – next steps
Implemented in:
• pymatgen - www.pymatgen.org
• matminer – https://hackingmaterials.github.io/matminer
More info: talk to Nils
Zimmermann at the poster
session!
15. 15
Currently, it can be hard to get started with ML in materials
How can we make
this transformation?
Test different ideas?
Where do we get
the data?
16. Goal of matminer: connect materials data with data mining
algorithms and data visualization libraries
16
Ward, L. et al. Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
17. >40 featurizer classes can
generate thousands of
potential descriptors
17
Matminer contains a library of descriptors for various
materials science entities
feat = EwaldEnergy([options])
y = feat.featurize([input_data])
• compatible with
scikit-learn
pipelining
• automatically deploy
multiprocessing to
parallelize over data
• include citations to
methodology papers
18. 18
Interactive Jupyter notebooks demonstrate use cases
https://github.com/hackingmaterials/matminer_examples!
Many examples available:
• Retrieving data from various databases
• Predicting bulk / shear modulus with ML
• Predicting formation energies:
• from composition alone
• with Voronoi-based structure features
included
• with Coulomb matrix and Orbital Field
matrix descriptors (reproducing
previous studies in the literature)
• Making interactive visualizations
• Creating an ML pipeline
19. • Further increase coverage and scope of feature
extraction methods available in the literature
• Increase the number of “standard” data sets that
can be used to benchmark different ML
approaches
• Apply to materials problems (in progress)
19
matminer – next steps
Implemented in:
• matminer – https://hackingmaterials.github.io/matminer
20. 20
III. atomate / Rocketsled
Generalizable
forward solver
Supercomputing
Power
Statistical
optimization
FireWorks NERSC Various optimization libraries
(Figure: J. Mueller)
21. With high-throughput DFT, we can generate data rapidly –
what to do next?
21
M. de Jong, W. Chen, H.
Geerlings, M. Asta, and K. A.
Persson, Sci. Data, 2015, 2,
150053.!
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.!
>4500 elastic
tensors
>900
piezoelectric
tensors
>48000
Seebeck
coefficients +
cRTA transport
Ricci, Chen, Aydemir, Snyder,
Rignanese, Jain, & Hautier (in
submission)!
22. Atomate is our software to easily run millions of such
calculations at supercomputing centers
22
Results!!
researcher!
Start with all binary
oxides, replace O->S,
run several different
properties
Workflows to run!
ü band structure!
ü surface energies!
ü elastic tensor!
q Raman spectrum!
q QH thermal expansion!
q spin-orbit coupling!
23. Can we build a general computational optimizer?
23
Generalizable
forward solver
Supercomputing
Power
Statistical
optimization
FireWorks
/ atomate
NERSC Various optimization libraries
(Figure: J. Mueller)
24. Rocketsled: Automatic materials screening that selects
materials to compute AND submits them to supercomputer
24
screening space of ~20,000
potential ABX3 perovskite
combinations as water splitting
materials – precomputed in DFT
by different group
if a machine learning algorithm was in
charge of picking the next compound
based on past data, how efficient
would it be?
25. • Built off the scikit-optimization package, with 10
different regressors (ML algorithms) available
• Bootstrapped uncertainty estimates for balancing
exploration and exploitation
• Next step: deployment for thermoelectrics search
25
Further details and next steps
Implemented in:
• rocketsled – https://github.com/hackingmaterials/rocketsled
27. Some questions that current search tools don’t answer:
these questions require materials-specific search tools!
“I’d like a list of all the chemical compositions that have been studied as
thermoelectrics, ideally weighted by research interest in them. Ok, now
filter to thermoelectric materials known to have layered structures. Now
show me some materials that are aren’t in that list but are similar in terms
of structure and electronic properties in the Materials Project database.”!
“What are all the known applications and unique properties of
NaCoO2? What techniques (computational, experimental) have
been used to study this compound in the past?”!
“I just predicted a new composition as a battery cathode. A lit search
shows no hits at all for that composition. Has anyone ever made
anything similar to that composition? I’d like to know for synthesis
ideas and also want to check against similarity to known battery
materials.”!
28. 28
An engine to label the content of scientific abstracts
Matstract
corpus
Unlabeled
data
Data
labels
Feature engineering
Text cleaning
Tokenization
POS tag
labels
Word embeddings
(word2vec)
Text processing
Hand crafted features
Supervised learning
Neural network
(LSTM)
Logistic regression
Train/test
sets
Named
Entities
Named
Entities
“Learning” what a
scientific study is about
from >2 million
materials science
abstracts
32. • Further testing
• Similarity metrics, e.g. if a target compound
doesn’t exist, retrieve information for “similar”
compounds instead
• Integration with Materials Project
32
Materials abstracts – next steps
Interested in being a beta tester?
Contact me
33. • Our group has been working on methods and
software for various applications
– Interpretable descriptors of crystal structure
– matminer
– atomate / Rocketsled
– A text mining materials database
• We encourage you to try the software and let us
know what you think!
– Help lists are available for all software
33
Conclusions
34. • Structure descriptors
– N. Zimmermann (project lead)
• Atomate / Rocketsled
– K Matthew (project lead, atomate)
– A. Dunn (project lead, rocketsled)
• Matminer
– L. Ward (project lead, U. Chicago)
• Text mining
– V. Tshitoyan, J. Dagdelen, L. Weston
• All that provided feedback & contributed code to open-source software efforts!
• Funding:
– DOE-BES
– Toyota Research Institute
34
Thank you!
Slides (already) posted to hackingmaterials.lbl.gov