Final Project#

For your final project, you will be given a benchmark dataset from the MatBench paper and website. Matbench is an automated leaderboard for benchmarking state of the art ML algorithms predicting a diverse range of solid materials’ properties. It is hosted and maintained by the Materials Project.

Each team has been assigned a different dataset, with a different set of features.

Team

Property

Input

Type

Features

A

Band gap (4,604)

Formula

Regression

Magpie element properties

B

Glass formation (5,680)

Formula

Classification

Matscholar element properties

C

Bulk Modulus (10,987)

Structure

Regression

Magpie + site stats fingerprint

D

Exfoliation (635)

Structure

Regression

Meredig + packing efficiency + density features

E

Dielectric (4,557)

Structure

Regression

Megnet + site stats fingerprint

Task#

Your task is perform data analysis on this dataset. You should submit your finished notebook and present your findings as a group (20 minutes + 10 minutes questions).

The starting point is always to check the literature. Read the MatBench paper and the models that have been tested. Try to understand the features that have been used. Next you can begin with the data analysis. Here are some ideas to get you started:

Data

  • What does your data contain. Is it computed or from experiments.

  • How reliable is the data. What sort of errors do you expect?

  • If you have been assigned a composition only dataset, what impact do you expect this to have on the model performance.

  • If you have been assigned a dataset with structures, does this make the problem more difficult or easier?

Features

  • What are features in your dataset.

  • Compare & contrast methods of featurisation (fingerprints have no real-life equivalents as opposed to descriptors etc.)

  • Can you use other featurisation methods or representation of crystals (e.g. graphs if a structure is available)? Note, this is more advanced and an entirely optional step. See the code in the rest of this notebook to see how the features were generated and to give you ideas of how to modify them.

Supervised learning

  • Get the most important predictors from a random forest, XGBoost, or another model, does that say anything about chemistry?

  • Classify crystals according to their properties. Would you be able to predict the performance of new crystals?

  • Attempt to predict the target property? How well does this compare against the other models on the MatBench leaderboard? How do your models differ?

Unsupervised learning

  • Attempt to cluster the molecules based on their features? Can you understand anything from the clustering about the materials in the dataset or their performance.

  • Some problems have many features. Does dimensionality reduction (e.g., PCA or recursive feature elimination) help? This can be interesting to see if featurisation matters.

Explainability

  • Is your model interpretable? What does it say?

  • You can perform SHAP (SHapley Additive exPlanation) analysis for greater insight

  • Otherwise, can you use feature importance or counterfactual explanations to provide further insights.

Materials screening (advanced)

  • Note, this is more advanced and an entirely optional step.

  • The Materials Project contains a database of 200,000 known and predicted inorganic materials.

  • You can download the structures and compositions of all of the materials using the Materials Project API. A good tutorial is available on YouTube

  • What happens if you apply your models to the full data set? Can you find any materials with interesting properties?

Downloads#

The featurised datasets are available to download as csv files from the links below. Note the files and gzipped and should be uncompressed before use.

Submission#

You should submit your Jupyter notebook and associated files as a zip before the deadline (9 am on January 9th). Make sure you adhere to the following guidance:

  • Structure your notebook clearly, with designated sections.

  • Include a brief introduction to the task, explaining what you did and why you did it.

  • Provide appropriate and clearly labelled figures and tables documenting your findings.

  • Summarise your findings and put them into context.

  • Comment your code so we can follow along.

  • Include an authors contribution commentary following the CRediT system.

  • Include a statement on the use of generative AI following Imperial’s guidance.

  • Include a requirements.txt that lists the packages (and ideally versions) of the codes required to run your notebook. You can see the example for notebooks in this course here.

  • Include the powerpoint file for the slides for your presentation in the upload.

Featurisation#

In the rest of this notebook, we’ll prepare the datasets used for the final project. Note: This notebook requires the matminer python package to be installed.

Team A#

More information on the dataset is available on the MatBench description page.

Features

Here, we use the “Magpie” features defined in the paper “A general-purpose machine learning framework for predicting properties of inorganic materials” (2016). They consist of statistical measures of elemental properties that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 132 features).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_expt_gap"))

df = load_dataset("matbench_expt_gap")
df
Dataset: matbench_expt_gap
Description: Matbench v0.1 test dataset for predicting experimental band gap from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, removing compositions with reported band gaps spanning more than a 0.1eV range; remaining compositions were assigned values based on the closest experimental value to the mean experimental value for that composition among all reports. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	composition: Chemical formula.
	gap expt: Target variable. Experimentally measured gap, in eV.
Num Entries: 4604
Reference: Y. Zhuo, A. Masouri Tehrani, J. Brgoch (2018) Predicting the Band Gaps of Inorganic Solids by Machine Learning J. Phys. Chem. Lett. 2018, 9, 7, 1668-1673 https:doi.org/10.1021/acs.jpclett.8b00124.
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@article{doi:10.1021/acs.jpclett.8b00124,\nauthor = {Zhuo, Ya and Mansouri Tehrani, Aria and Brgoch, Jakoah},\ntitle = {Predicting the Band Gaps of Inorganic Solids by Machine Learning},\njournal = {The Journal of Physical Chemistry Letters},\nvolume = {9},\nnumber = {7},\npages = {1668-1673},\nyear = {2018},\ndoi = {10.1021/acs.jpclett.8b00124},\nnote ={PMID: 29532658},\neprint = {\nhttps://doi.org/10.1021/acs.jpclett.8b00124\n\n}}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_expt_gap.json.gz
SHA256 Hash Digest: 783e7d1461eb83b00b2f2942da4b95fda5e58a0d1ae26b581c24cf8a82ca75b2

 
composition gap expt
0 Ag(AuS)2 0.00
1 Ag(W3Br7)2 0.00
2 Ag0.5Ge1Pb1.75S4 1.83
3 Ag0.5Ge1Pb1.75Se4 1.51
4 Ag2BBr 0.00
... ... ...
4599 ZrTaN3 1.72
4600 ZrTe 0.00
4601 ZrTi2O 0.00
4602 ZrTiF6 0.00
4603 ZrW2 0.00

4604 rows × 2 columns

Convert the composition string into a pymatgen Composition object.

from pymatgen.core import Composition

df.rename({"composition": "formula"}, axis=1, inplace=True)
df["composition"] = [Composition(comp) for comp in df["formula"]]

Generate the Magpie features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("magpie")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
df
formula gap expt composition MagpieData minimum Number MagpieData maximum Number MagpieData range Number MagpieData mean Number MagpieData avg_dev Number MagpieData mode Number MagpieData minimum MendeleevNumber ... MagpieData range GSmagmom MagpieData mean GSmagmom MagpieData avg_dev GSmagmom MagpieData mode GSmagmom MagpieData minimum SpaceGroupNumber MagpieData maximum SpaceGroupNumber MagpieData range SpaceGroupNumber MagpieData mean SpaceGroupNumber MagpieData avg_dev SpaceGroupNumber MagpieData mode SpaceGroupNumber
0 Ag(AuS)2 0.00 (Ag, Au, S) 16.0 79.0 63.0 47.400000 25.280000 16.0 65.0 ... 0.000000 0.000000 0.000000 0.000000 70.0 225.0 155.0 163.000000 74.400000 70.0
1 Ag(W3Br7)2 0.00 (Ag, W, Br) 35.0 74.0 39.0 46.714286 15.619048 35.0 51.0 ... 0.000000 0.000000 0.000000 0.000000 64.0 229.0 165.0 118.809524 73.079365 64.0
2 Ag0.5Ge1Pb1.75S4 1.83 (Ag, Ge, Pb, S) 16.0 82.0 66.0 36.275862 23.552913 16.0 65.0 ... 0.000000 0.000000 0.000000 0.000000 70.0 225.0 155.0 139.482759 76.670630 70.0
3 Ag0.5Ge1Pb1.75Se4 1.51 (Ag, Ge, Pb, Se) 32.0 82.0 50.0 46.206897 17.388823 34.0 65.0 ... 0.000000 0.000000 0.000000 0.000000 14.0 225.0 211.0 108.586207 104.370987 14.0
4 Ag2BBr 0.00 (Ag, B, Br) 5.0 47.0 42.0 33.500000 14.250000 47.0 65.0 ... 0.000000 0.000000 0.000000 0.000000 64.0 225.0 161.0 170.000000 55.000000 225.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4599 ZrTaN3 1.72 (Zr, Ta, N) 7.0 73.0 66.0 26.800000 23.760000 7.0 44.0 ... 0.000000 0.000000 0.000000 0.000000 194.0 229.0 35.0 201.000000 11.200000 194.0
4600 ZrTe 0.00 (Zr, Te) 40.0 52.0 12.0 46.000000 6.000000 40.0 44.0 ... 0.000000 0.000000 0.000000 0.000000 152.0 194.0 42.0 173.000000 21.000000 152.0
4601 ZrTi2O 0.00 (Zr, Ti, O) 8.0 40.0 32.0 23.000000 8.500000 22.0 43.0 ... 0.000023 0.000011 0.000011 0.000023 12.0 194.0 182.0 148.500000 68.250000 194.0
4602 ZrTiF6 0.00 (Zr, Ti, F) 9.0 40.0 31.0 14.500000 8.250000 9.0 43.0 ... 0.000023 0.000003 0.000005 0.000000 15.0 194.0 179.0 59.750000 67.125000 15.0
4603 ZrW2 0.00 (Zr, W) 40.0 74.0 34.0 62.666667 15.111111 74.0 44.0 ... 0.000000 0.000000 0.000000 0.000000 194.0 229.0 35.0 217.333333 15.555556 229.0

4604 rows × 135 columns

Drop the composition object column and save the dataset.

df.drop("composition", axis=1, inplace=True)
df.to_csv("datasets/team-a.csv", index=False)

Team B#

More information on the dataset is available on the MatBench description page.

Features

Here, we use elemental embeddings taken from features defined in the paper Unsupervised word embeddings capture latent knowledge from materials science literature (2021). The embeddings are given as a 200 dimensional vector that were learned through natural language processing. The features consist of statistical measures of these embeddings that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 1000 features).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_glass"))

df = load_dataset("matbench_glass")
df
Dataset: matbench_glass
Description: Matbench v0.1 test dataset for predicting full bulk metallic glass formation ability from chemical formula. Retrieved from "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. Deduplicated according to composition, ensuring no compositions were reported as both GFA and not GFA (i.e., all reports agreed on the classification designation). For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	composition: Chemical formula.
	gfa: Target variable. Glass forming ability: 1 means glass forming and corresponds to amorphous, 0 means non full glass forming.
Num Entries: 5680
Reference: Y. Kawazoe, T. Masumoto, A.-P. Tsai, J.-Z. Yu, T. Aihara Jr. (1997) Y. Kawazoe, J.-Z. Yu, A.-P. Tsai, T. Masumoto (ed.) SpringerMaterials
Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys · 1 Introduction Landolt-Börnstein - Group III Condensed Matter 37A (Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys) https://www.springer.com/gp/book/9783540605072 (Springer-Verlag Berlin Heidelberg © 1997) Accessed: 03-09-2019
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@Misc{LandoltBornstein1997:sm_lbs_978-3-540-47679-5_2,\nauthor="Kawazoe, Y.\nand Masumoto, T.\nand Tsai, A.-P.\nand Yu, J.-Z.\nand Aihara Jr., T.",\neditor="Kawazoe, Y.\nand Yu, J.-Z.\nand Tsai, A.-P.\nand Masumoto, T.",\ntitle="Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys {\\textperiodcentered} 1 Introduction: Datasheet from Landolt-B{\\"o}rnstein - Group III Condensed Matter {\\textperiodcentered} Volume 37A: ``Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys\'\' in SpringerMaterials (https://dx.doi.org/10.1007/10510374{\\_}2)",\npublisher="Springer-Verlag Berlin Heidelberg",\nnote="Copyright 1997 Springer-Verlag Berlin Heidelberg",\nnote="Part of SpringerMaterials",\nnote="accessed 2018-10-23",\ndoi="10.1007/10510374_2",\nurl="https://materials.springer.com/lb/docs/sm_lbs_978-3-540-47679-5_2"\n}', '@Article{Ward2016,\nauthor={Ward, Logan\nand Agrawal, Ankit\nand Choudhary, Alok\nand Wolverton, Christopher},\ntitle={A general-purpose machine learning framework for predicting properties of inorganic materials},\njournal={Npj Computational Materials},\nyear={2016},\nmonth={Aug},\nday={26},\npublisher={The Author(s)},\nvolume={2},\npages={16028},\nnote={Article},\nurl={http://dx.doi.org/10.1038/npjcompumats.2016.28}\n}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_glass.json.gz
SHA256 Hash Digest: 36beb654e2a463ee2a6572105bea0ca2961eee7c7b26a25377bff2c3b338e53a
composition gfa
0 Al False
1 Al(NiB)2 True
2 Al10Co21B19 True
3 Al10Co23B17 True
4 Al10Co27B13 True
... ... ...
5675 ZrTi9 False
5676 ZrTiSi2 True
5677 ZrTiSi3 True
5678 ZrVCo8 True
5679 ZrVNi2 True

5680 rows × 2 columns

Convert the composition string into a pymatgen Composition object.

from pymatgen.core import Composition

df.rename({"composition": "formula"}, axis=1, inplace=True)
df["composition"] = [Composition(comp) for comp in df["formula"]]

Generate the features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("matscholar_el")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
df
formula gfa composition MatscholarElementData minimum embedding 1 MatscholarElementData maximum embedding 1 MatscholarElementData range embedding 1 MatscholarElementData mean embedding 1 MatscholarElementData std_dev embedding 1 MatscholarElementData minimum embedding 2 MatscholarElementData maximum embedding 2 ... MatscholarElementData minimum embedding 199 MatscholarElementData maximum embedding 199 MatscholarElementData range embedding 199 MatscholarElementData mean embedding 199 MatscholarElementData std_dev embedding 199 MatscholarElementData minimum embedding 200 MatscholarElementData maximum embedding 200 MatscholarElementData range embedding 200 MatscholarElementData mean embedding 200 MatscholarElementData std_dev embedding 200
0 Al False (Al) -0.034189 -0.034189 0.000000 -0.034189 0.000000 -0.001735 -0.001735 ... -0.044308 -0.044308 0.000000 -0.044308 0.000000 -0.009788 -0.009788 0.000000 -0.009788 0.000000
1 Al(NiB)2 True (Al, Ni, B) -0.105366 0.020968 0.126334 -0.040597 0.070736 -0.015921 0.038819 ... -0.065492 0.037160 0.102652 -0.020195 0.059330 -0.009788 0.042107 0.051894 0.015274 0.027822
2 Al10Co21B19 True (Al, Co, B) -0.105366 0.000804 0.106171 -0.046539 0.059815 -0.015921 0.067943 ... -0.082171 0.037160 0.119331 -0.029253 0.067328 -0.029063 0.042107 0.071170 0.001836 0.040419
3 Al10Co23B17 True (Al, Co, B) -0.105366 0.000804 0.106171 -0.042292 0.059232 -0.015921 0.067943 ... -0.082171 0.037160 0.119331 -0.034026 0.066641 -0.029063 0.042107 0.071170 -0.001010 0.039941
4 Al10Co27B13 True (Al, Co, B) -0.105366 0.000804 0.106171 -0.033799 0.057383 -0.015921 0.067943 ... -0.082171 0.037160 0.119331 -0.043572 0.064497 -0.029063 0.042107 0.071170 -0.006704 0.038517
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5675 ZrTi9 False (Zr, Ti) -0.035601 -0.018014 0.017588 -0.033843 0.012436 -0.066841 -0.047185 ... -0.051519 0.005334 0.056852 -0.045833 0.040201 0.042314 0.063139 0.020825 0.044396 0.014726
5676 ZrTiSi2 True (Zr, Ti, Si) -0.035601 -0.018014 0.017588 -0.022898 0.009291 -0.066841 -0.035818 ... -0.051519 0.005334 0.056852 -0.029050 0.026519 -0.012335 0.063139 0.075474 0.020196 0.042189
5677 ZrTiSi3 True (Zr, Ti, Si) -0.035601 -0.018014 0.017588 -0.022116 0.009025 -0.066841 -0.035818 ... -0.051519 0.005334 0.056852 -0.030242 0.025259 -0.012335 0.063139 0.075474 0.013690 0.043492
5678 ZrVCo8 True (Zr, V, Co) -0.018014 0.057822 0.075835 0.004624 0.031897 -0.047185 0.067943 ... -0.082171 0.100906 0.183077 -0.055113 0.099783 -0.029063 0.102289 0.131353 -0.006708 0.078135
5679 ZrVNi2 True (Zr, V, Ni) -0.018014 0.057822 0.075835 0.020436 0.033921 -0.047185 0.038819 ... -0.065492 0.100906 0.166398 -0.006186 0.086338 0.000973 0.102289 0.101316 0.041844 0.054582

5680 rows × 1003 columns

Drop the composition object column and save the dataset.

df.drop("composition", axis=1, inplace=True)
df.to_csv("datasets/team-b.csv", index=False)

Team C#

More information on the dataset is available on the MatBench description page.

Features

Here, we use the “Magpie” features defined in the paper “A general-purpose machine learning framework for predicting properties of inorganic materials” (2016). They consist of statistical measures of elemental properties that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 132 features).

Since this dataset also has the crystal structure available, we also include the site stats fingerprint. This method calculates order parameters for the different sites (e.g. how tetrahedral, octahedral, square planar, etc the site is, and the coordination number) and calculates statistical properties (mean and standard deviation) of these over the full structure. Specifically, we use the CrystalNNFingerprint where the order parameters are introduced in the paper “Local structure order parameters and site fingerprints for quantification of coordination environment and crystal structure similarity” (2020).

Note, if you are so inclined, you can use the structure with message passing graph neural networks such as MegNet or M3GNet. Nice implementations are provided in the MatGL library although these will be significantly more expensive to train than classical models.

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_log_kvrh"))

df = load_dataset("matbench_log_kvrh")
df
Dataset: matbench_log_kvrh
Description: Matbench v0.1 test dataset for predicting DFT log10 VRH-average bulk modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	log10(K_VRH): Target variable. Base 10 logarithm of the DFT Voigt-Reuss-Hill average bulk moduli in GPa.
	structure: Pymatgen Structure of the material.
Num Entries: 10987
Reference: Jong, M. De, Chen, W., Angsten, T., Jain, A., Notestine, R., Gamst,
A., Sluiter, M., Ande, C. K., Zwaag, S. Van Der, Plata, J. J., Toher,
C., Curtarolo, S., Ceder, G., Persson, K. and Asta, M., "Charting
the complete elastic properties of inorganic crystalline compounds",
Scientific Data volume 2, Article number: 150009 (2015)
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@Article{deJong2015,\nauthor={de Jong, Maarten and Chen, Wei and Angsten, Thomas\nand Jain, Anubhav and Notestine, Randy and Gamst, Anthony\nand Sluiter, Marcel and Krishna Ande, Chaitanya\nand van der Zwaag, Sybrand and Plata, Jose J. and Toher, Cormac\nand Curtarolo, Stefano and Ceder, Gerbrand and Persson, Kristin A.\nand Asta, Mark},\ntitle={Charting the complete elastic properties\nof inorganic crystalline compounds},\njournal={Scientific Data},\nyear={2015},\nmonth={Mar},\nday={17},\npublisher={The Author(s)},\nvolume={2},\npages={150009},\nnote={Data Descriptor},\nurl={http://dx.doi.org/10.1038/sdata.2015.9}\n}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_log_kvrh.json.gz
SHA256 Hash Digest: 44b113ddb7e23aa18731a62c74afa7e5aa654199e0db5f951c8248a00955c9cd

 
structure log10(K_VRH)
0 [[0. 0. 0.] Ca, [1.37728887 1.57871271 3.73949... 1.707570
1 [[3.14048493 1.09300401 1.64101398] Mg, [0.625... 1.633468
2 [[ 2.06884519 2.40627241 -0.45891585] Si, [1.... 1.908485
3 [[2.06428082 0. 2.06428082] Pd, [0. ... 2.117271
4 [[3.09635262 1.0689416 1.53602403] Mg, [0.593... 1.690196
... ... ...
10982 [[0. 0. 0.] Rh, [3.2029368 3.2029368 2.09459... 1.778151
10983 [[-1.51157821 4.4173925 1.21553922] Mg, [3.... 1.724276
10984 [[4.37546772 4.51128393 6.81784473] H, [0.4573... 1.342423
10985 [[0. 0. 0.] Si, [ 4.55195829 4.55195829 -4.55... 1.770852
10986 [[1.44565668 0. 2.05259079] Al, [1.445... 1.954243

10987 rows × 2 columns

Create a composition column containing pymatgen Composition objects from the Structure object.

df["composition"] = [structure.composition for structure in df["structure"]]

Generate the compositional features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("magpie")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
from matminer.featurizers.structure import SiteStatsFingerprint

cnn_fp = SiteStatsFingerprint.from_preset("CrystalNNFingerprint_ops")
cnn_fp.set_n_jobs(8)
cnn_fp.featurize_dataframe(df, col_id="structure", inplace=True)
df
structure log10(K_VRH) composition MagpieData minimum Number MagpieData maximum Number MagpieData range Number MagpieData mean Number MagpieData avg_dev Number MagpieData mode Number MagpieData minimum MendeleevNumber ... mean wt CN_20 std_dev wt CN_20 mean wt CN_21 std_dev wt CN_21 mean wt CN_22 std_dev wt CN_22 mean wt CN_23 std_dev wt CN_23 mean wt CN_24 std_dev wt CN_24
0 [[0. 0. 0.] Ca, [1.37728887 1.57871271 3.73949... 1.707570 (Ca, Ge, Ag) 20.0 47.0 27.0 35.600000 9.120000 32.0 7.0 ... 0.001273 0.002546 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
1 [[3.14048493 1.09300401 1.64101398] Mg, [0.625... 1.633468 (Mg, Ge, Ba) 12.0 56.0 44.0 28.800000 13.440000 12.0 9.0 ... 0.006620 0.013240 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
2 [[ 2.06884519 2.40627241 -0.45891585] Si, [1.... 1.908485 (Si, Cu, Sr) 14.0 38.0 24.0 24.800000 8.640000 14.0 8.0 ... 0.000000 0.000000 0.0 0.0 0.007384 0.014768 0.0 0.0 0.0 0.0
3 [[2.06428082 0. 2.06428082] Pd, [0. ... 2.117271 (Pd, Dy) 46.0 66.0 20.0 51.000000 7.500000 46.0 31.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
4 [[3.09635262 1.0689416 1.53602403] Mg, [0.593... 1.690196 (Mg, Si, Ba) 12.0 56.0 44.0 21.600000 13.760000 12.0 9.0 ... 0.005602 0.011204 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10982 [[0. 0. 0.] Rh, [3.2029368 3.2029368 2.09459... 1.778151 (Rh, I) 45.0 53.0 8.0 50.333333 3.555556 53.0 59.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
10983 [[-1.51157821 4.4173925 1.21553922] Mg, [3.... 1.724276 (Mg, Co, Sn) 12.0 50.0 38.0 18.625000 9.937500 12.0 58.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
10984 [[4.37546772 4.51128393 6.81784473] H, [0.4573... 1.342423 (H, N, O) 1.0 8.0 7.0 4.666667 3.259259 1.0 82.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
10985 [[0. 0. 0.] Si, [ 4.55195829 4.55195829 -4.55... 1.770852 (Si, Sn) 14.0 50.0 36.0 32.000000 18.000000 14.0 78.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
10986 [[1.44565668 0. 2.05259079] Al, [1.445... 1.954243 (Al, Cu) 13.0 29.0 16.0 18.333333 7.111111 13.0 64.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0

10987 rows × 257 columns

Drop the composition and structure object columns and save the dataset.

df.drop(["composition", "structure"], axis=1, inplace=True)
df.to_csv("datasets/team-c.csv", index=False)

Team D#

More information on the dataset is available on the MatBench description page.

Features

Here we use the “Meredig” features defined in the paper “Combinatorial screening for new materials in unconstrained composition space with machine learning” (2014). They consist of the statistical properties of the elements in the composition (total 9 features). We also include the atom packing efficiency features defined in “A predictive structural model for bulk metallic glasses” (2015).

Since this dataset also has the crystal structure available, we also include some structural features. In particular, these are the density, volume per atom, packing fraction (density features), and the maximum packing efficiency defined in “Including crystal structure attributes in machine learning models of formation energies via Voronoi tessellations” (2017).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_jdft2d"))

df = load_dataset("matbench_jdft2d")
df
Dataset: matbench_jdft2d
Description: Matbench v0.1 test dataset for predicting exfoliation energies from crystal structure (computed with the OptB88vdW and TBmBJ functionals). Adapted from the JARVIS DFT database. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	exfoliation_en: Target variable. Exfoliation energy (meV/atom).
	structure: Pymatgen Structure of the material.
Num Entries: 636
Reference: 2D Dataset discussed in:
High-throughput Identification and Characterization of Two dimensional Materials using Density functional theory Kamal Choudhary, Irina Kalish, Ryan Beams & Francesca Tavazza Scientific Reports volume 7, Article number: 5179 (2017)
Original 2D Data file sourced from:
choudhary, kamal; https://orcid.org/0000-0001-9737-8074 (2018): jdft_2d-7-7-2018.json. figshare. Dataset.
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@Article{Choudhary2017,\nauthor={Choudhary, Kamal\nand Kalish, Irina\nand Beams, Ryan\nand Tavazza, Francesca},\ntitle={High-throughput Identification and Characterization of Two-dimensional Materials using Density functional theory},\njournal={Scientific Reports},\nyear={2017},\nvolume={7},\nnumber={1},\npages={5179},\nabstract={We introduce a simple criterion to identify two-dimensional (2D) materials based on the comparison between experimental lattice constants and lattice constants mainly obtained from Materials-Project (MP) density functional theory (DFT) calculation repository. Specifically, if the relative difference between the two lattice constants for a specific material is greater than or equal to 5%, we predict them to be good candidates for 2D materials. We have predicted at least 1356 such 2D materials. For all the systems satisfying our criterion, we manually create single layer systems and calculate their energetics, structural, electronic, and elastic properties for both the bulk and the single layer cases. Currently the database consists of 1012 bulk and 430 single layer materials, of which 371 systems are common to bulk and single layer. The rest of calculations are underway. To validate our criterion, we calculated the exfoliation energy of the suggested layered materials, and we found that in 88.9% of the cases the currently accepted criterion for exfoliation was satisfied. Also, using molybdenum telluride as a test case, we performed X-ray diffraction and Raman scattering experiments to benchmark our calculations and understand their applicability and limitations. The data is publicly available at the website http://www.ctcms.nist.gov/{\textasciitilde}knc6/JVASP.html.},\nissn={2045-2322},\ndoi={10.1038/s41598-017-05402-0},\nurl={https://doi.org/10.1038/s41598-017-05402-0}\n}', '@misc{choudhary__2018, title={jdft_2d-7-7-2018.json}, url={https://figshare.com/articles/jdft_2d-7-7-2018_json/6815705/1}, DOI={10.6084/m9.figshare.6815705.v1}, abstractNote={2D materials}, publisher={figshare}, author={choudhary, kamal and https://orcid.org/0000-0001-9737-8074}, year={2018}, month={Jul}}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_jdft2d.json.gz
SHA256 Hash Digest: 26057dc4524e193e32abffb296ce819b58b6e11d1278cae329a2f97817a4eddf
structure exfoliation_en
0 [[1.49323139 3.32688406 7.26257785] Hf, [3.326... 63.593833
1 [[1.85068084 4.37698238 6.9301577 ] As, [0. ... 134.863750
2 [[ 0. 2.0213325 11.97279555] Ti, [ 1... 43.114667
3 [[2.39882726 2.39882726 2.53701553] In, [0.054... 240.715488
4 [[-1.83484554e-06 1.73300105e+00 2.61675943e... 67.442833
... ... ...
631 [[ 2.38592362 1.37751086 13.178104 ] Co, [-2... 26.426545
632 [[0. 0. 6.02219863] Br, [0. ... 43.574286
633 [[2.74646086 0.06822876 1.46596737] Se, [6.324... 88.808659
634 [[6.79056646 2.04327631 3.37729406] I, [2.0440... 132.265250
635 [[ 0.69409027 1.22690182 -0.85636865] Co, [-0... 63.564333

636 rows × 2 columns

Create a composition column containing pymatgen Composition objects from the Structure object.

df["composition"] = [structure.composition for structure in df["structure"]]

Generate the compositional features for the dataset:

from matminer.featurizers.composition import Meredig, AtomicPackingEfficiency


for featurizer in [Meredig(), AtomicPackingEfficiency(impute_nan=True)]:
    featurizer.set_n_jobs(1)
    featurizer.featurize_dataframe(df, col_id="composition", inplace=True)
from matminer.featurizers.structure import DensityFeatures, MaximumPackingEfficiency

for featurizer in [DensityFeatures(), MaximumPackingEfficiency()]:
    featurizer.set_n_jobs(8)
    featurizer.featurize_dataframe(df, col_id="structure", inplace=True, ignore_errors=True)

# remove any rows that contain NaN values
df.dropna(inplace=True)
df
structure exfoliation_en composition H fraction He fraction Li fraction Be fraction B fraction C fraction N fraction ... frac f valence electrons mean simul. packing efficiency mean abs simul. packing efficiency dist from 1 clusters |APE| < 0.010 dist from 3 clusters |APE| < 0.010 dist from 5 clusters |APE| < 0.010 density vpa packing fraction max packing efficiency
0 [[1.49323139 3.32688406 7.26257785] Hf, [3.326... 63.593833 (Hf, Si, Te) 0.0 0 0.0 0 0.0 0.0 0.0 ... 0.368421 -0.007860 0.016033 0.048029 0.056762 0.067759 3.021472 61.218660 0.177875 0.184684
1 [[1.85068084 4.37698238 6.9301577 ] As, [0. ... 134.863750 (As) 0.0 0 0.0 0 0.0 0.0 0.0 ... 0.000000 0.023994 0.023994 1.000000 1.000000 1.000000 1.124487 110.637295 0.057581 0.074276
2 [[ 0. 2.0213325 11.97279555] Ti, [ 1... 43.114667 (Ti, Br, O) 0.0 0 0.0 0 0.0 0.0 0.0 ... 0.000000 0.002382 0.017291 0.078567 0.078567 0.079184 1.385515 57.436271 0.108929 0.103004
3 [[2.39882726 2.39882726 2.53701553] In, [0.054... 240.715488 (In, Bi) 0.0 0 0.0 0 0.0 0.0 0.0 ... 0.333333 -0.011374 0.013329 0.000000 0.067344 0.121218 1.950268 137.847700 0.118812 0.109654
4 [[-1.83484554e-06 1.73300105e+00 2.61675943e... 67.442833 (Nb, O) 0.0 0 0.0 0 0.0 0.0 0.0 ... 0.000000 -0.005835 0.005835 0.018856 0.049327 0.101429 1.183135 58.435112 0.083167 0.083109
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
631 [[ 2.38592362 1.37751086 13.178104 ] Co, [-2... 26.426545 (Co, Cl, O) 0.0 0 0.0 0 0.0 0.0 0.0 ... 0.000000 0.049309 0.049309 0.000000 0.035398 0.051959 0.823767 47.249048 0.049875 0.039890
632 [[0. 0. 6.02219863] Br, [0. ... 43.574286 (Br, Ca, Cu, O) 0.0 0 0.0 0 0.0 0.0 0.0 ... 0.000000 0.038838 0.038838 0.037863 0.044437 0.053600 1.437709 55.358454 0.190227 0.144017
633 [[2.74646086 0.06822876 1.46596737] Se, [6.324... 88.808659 (Se, Cl, Pd) 0.0 0 0.0 0 0.0 0.0 0.0 ... 0.000000 0.027839 0.027839 0.052691 0.066689 0.078605 1.252087 97.537984 0.066022 0.070146
634 [[6.79056646 2.04327631 3.37729406] I, [2.0440... 132.265250 (I, Hg) 0.0 0 0.0 0 0.0 0.0 0.0 ... 0.233333 0.023963 0.023963 1.000000 1.000000 1.000000 1.445489 174.000416 0.071121 0.069553
635 [[ 0.69409027 1.22690182 -0.85636865] Co, [-0... 63.564333 (Co, O) 0.0 0 0.0 0 0.0 0.0 0.0 ... 0.000000 -0.037170 0.037170 0.040992 0.067233 0.173491 1.004062 50.128437 0.080563 0.068925

635 rows × 147 columns

Drop the composition and structure object columns and save the dataset.

df.drop(["composition", "structure"], axis=1, inplace=True)
df.to_csv("datasets/team-d.csv", index=False)

Team E#

More information on the dataset is available on the MatBench description page.

Features

Here, we use elemental embeddings taken from features defined in the paper “Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals” (2019). The embeddings are given as a 80 dimensional vector that were learned through training a graph neural network on the entire periodic table.

Since this dataset also has the crystal structure available, we also include the site stats fingerprint with a Voronoi decomposition. This method obtains information about the geometric environment of each site and calculates statistical properties (mean and standard deviation) of these over the full structure. Specifically, we use the VoronoiFingerprint where the features are introduced in the paper “A transferable machine-learning framework linking interstice distribution and plastic heterogeneity in metallic glasses” (2020).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_dielectric"))

df = load_dataset("matbench_dielectric")
df
Dataset: matbench_dielectric
Description: Matbench v0.1 test dataset for predicting refractive index from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having refractive indices less than 1 and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	n: Target variable. Refractive index (unitless).
	structure: Pymatgen Structure of the material.
Num Entries: 4764
Reference: Petousis, I., Mrdjenovich, D., Ballouz, E., Liu, M., Winston, D.,
Chen, W., Graf, T., Schladt, T. D., Persson, K. A. & Prinz, F. B.
High-throughput screening of inorganic compounds for the discovery
of novel dielectric and optical materials. Sci. Data 4, 160134 (2017).
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@article{Jain2013,\nauthor = {Jain, Anubhav and Ong, Shyue Ping and Hautier, Geoffroy and Chen, Wei and Richards, William Davidson and Dacek, Stephen and Cholia, Shreyas and Gunter, Dan and Skinner, David and Ceder, Gerbrand and Persson, Kristin a.},\ndoi = {10.1063/1.4812323},\nissn = {2166532X},\njournal = {APL Materials},\nnumber = {1},\npages = {011002},\ntitle = {{The Materials Project: A materials genome approach to accelerating materials innovation}},\nurl = {http://link.aip.org/link/AMPADS/v1/i1/p011002/s1\\&Agg=doi},\nvolume = {1},\nyear = {2013}\n}', '@article{Petousis2017,\nauthor={Petousis, Ioannis and Mrdjenovich, David and Ballouz, Eric\nand Liu, Miao and Winston, Donald and Chen, Wei and Graf, Tanja\nand Schladt, Thomas D. and Persson, Kristin A. and Prinz, Fritz B.},\ntitle={High-throughput screening of inorganic compounds for the\ndiscovery of novel dielectric and optical materials},\njournal={Scientific Data},\nyear={2017},\nmonth={Jan},\nday={31},\npublisher={The Author(s)},\nvolume={4},\npages={160134},\nnote={Data Descriptor},\nurl={http://dx.doi.org/10.1038/sdata.2016.134}\n}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_dielectric.json.gz
SHA256 Hash Digest: 83befa09bc2ec2f4b6143afc413157827a90e5e2e42c1eb507ccfa01bf26a1d6
structure n
0 [[4.29304147 2.4785886 1.07248561] S, [4.2930... 1.752064
1 [[3.95051434 4.51121437 0.28035002] K, [4.3099... 1.652859
2 [[-1.78688104 4.79604117 1.53044621] Rb, [-1... 1.867858
3 [[4.51438064 4.51438064 0. ] Mn, [0.133... 2.676887
4 [[-4.36731958 6.8886097 0.50929706] Li, [-2... 1.793232
... ... ...
4759 [[ 2.79280881 0.12499663 -1.84045389] Ca, [-2... 2.136837
4760 [[0. 5.50363806 3.84192106] O, [4.7662... 2.690619
4761 [[0. 0. 0.] Ba, [ 0.23821924 4.32393487 -0.35... 2.811494
4762 [[0. 0.18884638 0. ] K, [0. ... 1.832887
4763 [[0. 0. 0.] Cs, [2.80639641 2.80639641 2.80639... 2.559279

4764 rows × 2 columns

Create a composition column containing pymatgen Composition objects from the Structure object.

df["composition"] = [structure.composition for structure in df["structure"]]

Generate the compositional features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("megnet_el")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
df
structure n composition MEGNetElementData minimum embedding 1 MEGNetElementData maximum embedding 1 MEGNetElementData range embedding 1 MEGNetElementData mean embedding 1 MEGNetElementData std_dev embedding 1 MEGNetElementData minimum embedding 2 MEGNetElementData maximum embedding 2 ... MEGNetElementData minimum embedding 15 MEGNetElementData maximum embedding 15 MEGNetElementData range embedding 15 MEGNetElementData mean embedding 15 MEGNetElementData std_dev embedding 15 MEGNetElementData minimum embedding 16 MEGNetElementData maximum embedding 16 MEGNetElementData range embedding 16 MEGNetElementData mean embedding 16 MEGNetElementData std_dev embedding 16
0 [[4.29304147 2.4785886 1.07248561] S, [4.2930... 1.752064 (S, K) -0.207227 0.101355 0.308582 -0.052936 0.218201 -0.418307 0.304931 ... -0.014150 0.246111 0.260262 0.115980 0.184033 -0.040199 0.018383 0.058582 -0.010908 0.041424
1 [[3.95051434 4.51121437 0.28035002] K, [4.3099... 1.652859 (K, V, O) -0.207227 0.010069 0.217296 -0.133437 0.089964 -0.188673 0.304931 ... 0.113411 0.246111 0.132700 0.202901 0.054314 0.018383 0.053152 0.034769 0.032988 0.015804
2 [[-1.78688104 4.79604117 1.53044621] Rb, [-1... 1.867858 (Rb, Zr, O) -0.222826 0.072824 0.295650 -0.119124 0.126238 -0.188673 0.179438 ... -0.148203 0.356478 0.504681 0.190559 0.215217 -0.047841 0.085135 0.132976 0.017693 0.062858
3 [[4.51438064 4.51438064 0. ] Mn, [0.133... 2.676887 (Mn, O, F) -0.308105 -0.026669 0.281436 -0.149582 0.144058 -0.575614 0.343975 ... -0.312556 0.501850 0.814407 0.127387 0.411133 -0.155995 0.371508 0.527503 0.084805 0.266730
4 [[-4.36731958 6.8886097 0.50929706] Li, [-2... 1.793232 (Li, Co, Si, O) -0.161449 0.426308 0.587757 -0.036320 0.230572 -0.275507 0.334700 ... -0.168183 0.299678 0.467860 0.143178 0.193200 -0.311854 0.306350 0.618204 0.045337 0.232319
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4759 [[ 2.79280881 0.12499663 -1.84045389] Ca, [-2... 2.136837 (Ca, Fe, W, O) -0.234947 0.153433 0.388380 -0.087296 0.161825 -0.188673 0.463157 ... -0.424430 0.299887 0.724317 0.122727 0.276572 -0.333184 0.348429 0.681613 0.039165 0.259230
4760 [[0. 5.50363806 3.84192106] O, [4.7662... 2.690619 (O, S, Mn, La) -0.113972 0.101355 0.215327 0.008412 0.120043 -0.418307 0.343975 ... -0.312556 0.192867 0.505423 0.027736 0.162918 -0.155995 0.462238 0.618232 0.124222 0.306600
4761 [[0. 0. 0.] Ba, [ 0.23821924 4.32393487 -0.35... 2.811494 (Ba, Ag, Ge, Se) -0.318859 0.220026 0.538885 0.090390 0.198999 -0.164837 0.271697 ... -0.502267 0.352671 0.854938 -0.142704 0.314194 -0.343382 0.275374 0.618756 -0.236249 0.240652
4762 [[0. 0.18884638 0. ] K, [0. ... 1.832887 (K, Zn, H, I, O) -0.207227 0.352363 0.559590 -0.032446 0.225594 -0.188673 0.635952 ... 0.021095 0.246111 0.225016 0.169482 0.089604 -0.270104 0.045132 0.315236 -0.050470 0.162726
4763 [[0. 0. 0.] Cs, [2.80639641 2.80639641 2.80639... 2.559279 (Cs, Ge, Br) -0.122546 0.163372 0.285918 -0.055542 0.148459 -0.047956 0.021357 ... -0.039392 0.583951 0.623343 0.249013 0.264631 -0.259408 0.052268 0.311676 -0.100368 0.131754

4764 rows × 83 columns

from matminer.featurizers.structure import SiteStatsFingerprint
from matminer.featurizers.site import VoronoiFingerprint

cesf = VoronoiFingerprint(cutoff=15.0)
ssf = SiteStatsFingerprint(cesf)

ssf.set_n_jobs(8)
ssf.fit(df["structure"])
ssf.featurize_dataframe(df, col_id="structure", inplace=True, ignore_errors=True)

# remove any rows that contain NaN values
df.dropna(inplace=True)
df
structure n composition MEGNetElementData minimum embedding 1 MEGNetElementData maximum embedding 1 MEGNetElementData range embedding 1 MEGNetElementData mean embedding 1 MEGNetElementData std_dev embedding 1 MEGNetElementData minimum embedding 2 MEGNetElementData maximum embedding 2 ... mean Voro_area_maximum std_dev Voro_area_maximum mean Voro_dist_mean std_dev Voro_dist_mean mean Voro_dist_std_dev std_dev Voro_dist_std_dev mean Voro_dist_minimum std_dev Voro_dist_minimum mean Voro_dist_maximum std_dev Voro_dist_maximum
0 [[4.29304147 2.4785886 1.07248561] S, [4.2930... 1.752064 (S, K) -0.207227 0.101355 0.308582 -0.052936 0.218201 -0.418307 0.304931 ... 8.204232 0.823014 3.767918 0.426651 0.709328 0.240967 2.674113 0.534927 4.890630 0.808575
1 [[3.95051434 4.51121437 0.28035002] K, [4.3099... 1.652859 (K, V, O) -0.207227 0.010069 0.217296 -0.133437 0.089964 -0.188673 0.304931 ... 6.824999 0.644861 3.258812 0.211691 0.599649 0.174834 2.092884 0.450891 4.157850 0.328473
2 [[-1.78688104 4.79604117 1.53044621] Rb, [-1... 1.867858 (Rb, Zr, O) -0.222826 0.072824 0.295650 -0.119124 0.126238 -0.188673 0.179438 ... 6.752772 0.892951 3.237223 0.301533 0.568850 0.134024 2.306172 0.401358 4.243866 0.241752
3 [[4.51438064 4.51438064 0. ] Mn, [0.133... 2.676887 (Mn, O, F) -0.308105 -0.026669 0.281436 -0.149582 0.144058 -0.575614 0.343975 ... 4.643363 0.543658 2.682981 0.200683 0.513412 0.099900 1.960063 0.028847 3.469882 0.083992
4 [[-4.36731958 6.8886097 0.50929706] Li, [-2... 1.793232 (Li, Co, Si, O) -0.161449 0.426308 0.587757 -0.036320 0.230572 -0.275507 0.334700 ... 5.444131 1.757433 2.966329 0.157502 0.514779 0.110638 1.914097 0.462314 3.549006 0.185586
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4759 [[ 2.79280881 0.12499663 -1.84045389] Ca, [-2... 2.136837 (Ca, Fe, W, O) -0.234947 0.153433 0.388380 -0.087296 0.161825 -0.188673 0.463157 ... 4.929955 0.291390 2.817745 0.235961 0.448275 0.073659 2.042058 0.146875 3.455685 0.174053
4760 [[0. 5.50363806 3.84192106] O, [4.7662... 2.690619 (O, S, Mn, La) -0.113972 0.101355 0.215327 0.008412 0.120043 -0.418307 0.343975 ... 6.809922 0.558490 3.379219 0.129438 0.470356 0.098123 2.556854 0.177656 4.183076 0.217302
4761 [[0. 0. 0.] Ba, [ 0.23821924 4.32393487 -0.35... 2.811494 (Ba, Ag, Ge, Se) -0.318859 0.220026 0.538885 0.090390 0.198999 -0.164837 0.271697 ... 9.552565 1.300761 3.786360 0.095118 0.639826 0.118692 2.586770 0.339563 4.664158 0.236099
4762 [[0. 0.18884638 0. ] K, [0. ... 1.832887 (K, Zn, H, I, O) -0.207227 0.352363 0.559590 -0.032446 0.225594 -0.188673 0.635952 ... 6.678443 0.932676 3.014896 0.271109 0.655179 0.183725 1.696626 0.474518 3.926988 0.410659
4763 [[0. 0. 0.] Cs, [2.80639641 2.80639641 2.80639... 2.559279 (Cs, Ge, Br) -0.122546 0.163372 0.285918 -0.055542 0.148459 -0.047956 0.021357 ... 7.414504 0.922714 3.636716 0.420112 0.244063 0.199277 3.038886 0.464979 3.736354 0.464979

4557 rows × 143 columns

Drop the composition and structure object columns and save the dataset.

df.drop(["composition", "structure"], axis=1, inplace=True)
df.to_csv("datasets/team-e.csv", index=False)