Final Project#

For your final project, you will be given a benchmark dataset from the MatBench paper and website. Matbench is an automated leaderboard for benchmarking state of the art ML algorithms predicting a diverse range of solid materials’ properties. It is hosted and maintained by the Materials Project.

Each team has been assigned a different dataset, with a different set of features.

Team

Property

Input

Type

Features

A

Band gap (4,604)

Formula

Regression

Magpie element properties

B

Glass formation (5,680)

Formula

Classification

Matscholar element properties

C

Bulk Modulus (10,987)

Structure

Regression

Magpie + site stats fingerprint

Task#

Your task is perform data analysis on this dataset. You should submit your finished notebook and present your findings as a group (20 minutes + 10 minutes questions).

The starting point is always to check the literature. Read the MatBench paper and the models that have been tested. Try to understand the features that have been used. Next you can begin with the data analysis. Here are some ideas to get you started:

Data

  • What does your data contain. Is it computed or from experiments.

  • How reliable is the data. What sort of errors do you expect?

  • If you have been assigned a composition only dataset, what impact do you expect this to have on the model performance.

  • If you have been assigned a dataset with structures, does this make the problem more difficult or easier?

Features

  • What are features in your dataset.

  • Compare & contrast methods of featurisation (fingerprints have no real-life equivalents as opposed to descriptors etc.)

  • Can you use other featurisation methods or representation of crystals (e.g. graphs if a structure is available)? Note, this is more advanced and an entirely optional step. See the code in the rest of this notebook to see how the features were generated and to give you ideas of how to modify them.

Supervised learning

  • Get the most important predictors from a random forest, XGBoost, or another model, does that say anything about chemistry?

  • Classify crystals according to their properties. Would you be able to predict the performance of new crystals?

  • Attempt to predict the target property? How well does this compare against the other models on the MatBench leaderboard? How do your models differ?

Unsupervised learning

  • Attempt to cluster the molecules based on their features? Can you understand anything from the clustering about the materials in the dataset or their performance.

  • Some problems have many features. Does dimensionality reduction (e.g., PCA or recursive feature elimination) help? This can be interesting to see if featurisation matters.

Explainability

  • Is your model interpretable? What does it say?

  • You can perform SHAP (SHapley Additive exPlanation) analysis for greater insight

  • Otherwise, can you use feature importance or counterfactual explanations to provide further insights.

Materials screening (advanced)

  • Note, this is more advanced and an entirely optional step.

  • The Materials Project contains a database of 200,000 known and predicted inorganic materials.

  • You can download the structures and compositions of all of the materials using the Materials Project API. A good tutorial is available on YouTube

  • What happens if you apply your models to the full data set? Can you find any materials with interesting properties?

Downloads#

The featurised datasets are available to download as csv files from the links below. Note the files and gzipped and should be uncompressed before use.

Submission#

You should submit your Jupyter notebook and associated files as a zip before the deadline (9 am on January 9th). Make sure you adhere to the following guidance:

  • Structure your notebook clearly, with designated sections.

  • Include a brief introduction to the task, explaining what you did and why you did it.

  • Provide appropriate and clearly labelled figures and tables documenting your findings.

  • Summarise your findings and put them into context.

  • Comment your code so we can follow along.

  • Include an authors contribution commentary following the CRediT system.

  • Include a statement on the use of generative AI following Imperial’s guidance.

  • Include a requirements.txt that lists the packages (and ideally versions) of the codes required to run your notebook. You can see the example for notebooks in this course here.

  • Include the powerpoint file for the slides for your presentation in the upload.

Featurisation#

In the rest of this notebook, we’ll prepare the datasets used for the final project. Note: This notebook requires the matminer python package to be installed.

Team A#

More information on the dataset is available on the MatBench description page.

Features

Here, we use the “Magpie” features defined in the paper “A general-purpose machine learning framework for predicting properties of inorganic materials” (2016). They consist of statistical measures of elemental properties that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 132 features).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_expt_gap"))

df = load_dataset("matbench_expt_gap")
df
Dataset: matbench_expt_gap
Description: Matbench v0.1 test dataset for predicting experimental band gap from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, removing compositions with reported band gaps spanning more than a 0.1eV range; remaining compositions were assigned values based on the closest experimental value to the mean experimental value for that composition among all reports. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	composition: Chemical formula.
	gap expt: Target variable. Experimentally measured gap, in eV.
Num Entries: 4604
Reference: Y. Zhuo, A. Masouri Tehrani, J. Brgoch (2018) Predicting the Band Gaps of Inorganic Solids by Machine Learning J. Phys. Chem. Lett. 2018, 9, 7, 1668-1673 https:doi.org/10.1021/acs.jpclett.8b00124.
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@article{doi:10.1021/acs.jpclett.8b00124,\nauthor = {Zhuo, Ya and Mansouri Tehrani, Aria and Brgoch, Jakoah},\ntitle = {Predicting the Band Gaps of Inorganic Solids by Machine Learning},\njournal = {The Journal of Physical Chemistry Letters},\nvolume = {9},\nnumber = {7},\npages = {1668-1673},\nyear = {2018},\ndoi = {10.1021/acs.jpclett.8b00124},\nnote ={PMID: 29532658},\neprint = {\nhttps://doi.org/10.1021/acs.jpclett.8b00124\n\n}}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_expt_gap.json.gz
SHA256 Hash Digest: 783e7d1461eb83b00b2f2942da4b95fda5e58a0d1ae26b581c24cf8a82ca75b2

 
composition gap expt
0 Ag(AuS)2 0.00
1 Ag(W3Br7)2 0.00
2 Ag0.5Ge1Pb1.75S4 1.83
3 Ag0.5Ge1Pb1.75Se4 1.51
4 Ag2BBr 0.00
... ... ...
4599 ZrTaN3 1.72
4600 ZrTe 0.00
4601 ZrTi2O 0.00
4602 ZrTiF6 0.00
4603 ZrW2 0.00

4604 rows × 2 columns

Convert the composition string into a pymatgen Composition object.

from pymatgen.core import Composition

df.rename({"composition": "formula"}, axis=1, inplace=True)
df["composition"] = [Composition(comp) for comp in df["formula"]]

Generate the Magpie features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("magpie")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
df
formula gap expt composition MagpieData minimum Number MagpieData maximum Number MagpieData range Number MagpieData mean Number MagpieData avg_dev Number MagpieData mode Number MagpieData minimum MendeleevNumber ... MagpieData range GSmagmom MagpieData mean GSmagmom MagpieData avg_dev GSmagmom MagpieData mode GSmagmom MagpieData minimum SpaceGroupNumber MagpieData maximum SpaceGroupNumber MagpieData range SpaceGroupNumber MagpieData mean SpaceGroupNumber MagpieData avg_dev SpaceGroupNumber MagpieData mode SpaceGroupNumber
0 Ag(AuS)2 0.00 (Ag, Au, S) 16.0 79.0 63.0 47.400000 25.280000 16.0 65.0 ... 0.000000 0.000000 0.000000 0.000000 70.0 225.0 155.0 163.000000 74.400000 70.0
1 Ag(W3Br7)2 0.00 (Ag, W, Br) 35.0 74.0 39.0 46.714286 15.619048 35.0 51.0 ... 0.000000 0.000000 0.000000 0.000000 64.0 229.0 165.0 118.809524 73.079365 64.0
2 Ag0.5Ge1Pb1.75S4 1.83 (Ag, Ge, Pb, S) 16.0 82.0 66.0 36.275862 23.552913 16.0 65.0 ... 0.000000 0.000000 0.000000 0.000000 70.0 225.0 155.0 139.482759 76.670630 70.0
3 Ag0.5Ge1Pb1.75Se4 1.51 (Ag, Ge, Pb, Se) 32.0 82.0 50.0 46.206897 17.388823 34.0 65.0 ... 0.000000 0.000000 0.000000 0.000000 14.0 225.0 211.0 108.586207 104.370987 14.0
4 Ag2BBr 0.00 (Ag, B, Br) 5.0 47.0 42.0 33.500000 14.250000 47.0 65.0 ... 0.000000 0.000000 0.000000 0.000000 64.0 225.0 161.0 170.000000 55.000000 225.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4599 ZrTaN3 1.72 (Zr, Ta, N) 7.0 73.0 66.0 26.800000 23.760000 7.0 44.0 ... 0.000000 0.000000 0.000000 0.000000 194.0 229.0 35.0 201.000000 11.200000 194.0
4600 ZrTe 0.00 (Zr, Te) 40.0 52.0 12.0 46.000000 6.000000 40.0 44.0 ... 0.000000 0.000000 0.000000 0.000000 152.0 194.0 42.0 173.000000 21.000000 152.0
4601 ZrTi2O 0.00 (Zr, Ti, O) 8.0 40.0 32.0 23.000000 8.500000 22.0 43.0 ... 0.000023 0.000011 0.000011 0.000023 12.0 194.0 182.0 148.500000 68.250000 194.0
4602 ZrTiF6 0.00 (Zr, Ti, F) 9.0 40.0 31.0 14.500000 8.250000 9.0 43.0 ... 0.000023 0.000003 0.000005 0.000000 15.0 194.0 179.0 59.750000 67.125000 15.0
4603 ZrW2 0.00 (Zr, W) 40.0 74.0 34.0 62.666667 15.111111 74.0 44.0 ... 0.000000 0.000000 0.000000 0.000000 194.0 229.0 35.0 217.333333 15.555556 229.0

4604 rows × 135 columns

Drop the composition object column and save the dataset.

df.drop("composition", axis=1, inplace=True)
df.to_csv("datasets/team-a.csv", index=False)

Team B#

More information on the dataset is available on the MatBench description page.

Features

Here, we use elemental embeddings taken from features defined in the paper Unsupervised word embeddings capture latent knowledge from materials science literature (2021). The embeddings are given as a 200 dimensional vector that were learned through natural language processing. The features consist of statistical measures of these embeddings that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 1000 features).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_glass"))

df = load_dataset("matbench_glass")
df
Dataset: matbench_glass
Description: Matbench v0.1 test dataset for predicting full bulk metallic glass formation ability from chemical formula. Retrieved from "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. Deduplicated according to composition, ensuring no compositions were reported as both GFA and not GFA (i.e., all reports agreed on the classification designation). For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	composition: Chemical formula.
	gfa: Target variable. Glass forming ability: 1 means glass forming and corresponds to amorphous, 0 means non full glass forming.
Num Entries: 5680
Reference: Y. Kawazoe, T. Masumoto, A.-P. Tsai, J.-Z. Yu, T. Aihara Jr. (1997) Y. Kawazoe, J.-Z. Yu, A.-P. Tsai, T. Masumoto (ed.) SpringerMaterials
Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys · 1 Introduction Landolt-Börnstein - Group III Condensed Matter 37A (Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys) https://www.springer.com/gp/book/9783540605072 (Springer-Verlag Berlin Heidelberg © 1997) Accessed: 03-09-2019
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@Misc{LandoltBornstein1997:sm_lbs_978-3-540-47679-5_2,\nauthor="Kawazoe, Y.\nand Masumoto, T.\nand Tsai, A.-P.\nand Yu, J.-Z.\nand Aihara Jr., T.",\neditor="Kawazoe, Y.\nand Yu, J.-Z.\nand Tsai, A.-P.\nand Masumoto, T.",\ntitle="Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys {\\textperiodcentered} 1 Introduction: Datasheet from Landolt-B{\\"o}rnstein - Group III Condensed Matter {\\textperiodcentered} Volume 37A: ``Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys\'\' in SpringerMaterials (https://dx.doi.org/10.1007/10510374{\\_}2)",\npublisher="Springer-Verlag Berlin Heidelberg",\nnote="Copyright 1997 Springer-Verlag Berlin Heidelberg",\nnote="Part of SpringerMaterials",\nnote="accessed 2018-10-23",\ndoi="10.1007/10510374_2",\nurl="https://materials.springer.com/lb/docs/sm_lbs_978-3-540-47679-5_2"\n}', '@Article{Ward2016,\nauthor={Ward, Logan\nand Agrawal, Ankit\nand Choudhary, Alok\nand Wolverton, Christopher},\ntitle={A general-purpose machine learning framework for predicting properties of inorganic materials},\njournal={Npj Computational Materials},\nyear={2016},\nmonth={Aug},\nday={26},\npublisher={The Author(s)},\nvolume={2},\npages={16028},\nnote={Article},\nurl={http://dx.doi.org/10.1038/npjcompumats.2016.28}\n}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_glass.json.gz
SHA256 Hash Digest: 36beb654e2a463ee2a6572105bea0ca2961eee7c7b26a25377bff2c3b338e53a
composition gfa
0 Al False
1 Al(NiB)2 True
2 Al10Co21B19 True
3 Al10Co23B17 True
4 Al10Co27B13 True
... ... ...
5675 ZrTi9 False
5676 ZrTiSi2 True
5677 ZrTiSi3 True
5678 ZrVCo8 True
5679 ZrVNi2 True

5680 rows × 2 columns

Convert the composition string into a pymatgen Composition object.

from pymatgen.core import Composition

df.rename({"composition": "formula"}, axis=1, inplace=True)
df["composition"] = [Composition(comp) for comp in df["formula"]]

Generate the features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("matscholar_el")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
df
formula gfa composition MatscholarElementData minimum embedding 1 MatscholarElementData maximum embedding 1 MatscholarElementData range embedding 1 MatscholarElementData mean embedding 1 MatscholarElementData std_dev embedding 1 MatscholarElementData minimum embedding 2 MatscholarElementData maximum embedding 2 ... MatscholarElementData minimum embedding 199 MatscholarElementData maximum embedding 199 MatscholarElementData range embedding 199 MatscholarElementData mean embedding 199 MatscholarElementData std_dev embedding 199 MatscholarElementData minimum embedding 200 MatscholarElementData maximum embedding 200 MatscholarElementData range embedding 200 MatscholarElementData mean embedding 200 MatscholarElementData std_dev embedding 200
0 Al False (Al) -0.034189 -0.034189 0.000000 -0.034189 0.000000 -0.001735 -0.001735 ... -0.044308 -0.044308 0.000000 -0.044308 0.000000 -0.009788 -0.009788 0.000000 -0.009788 0.000000
1 Al(NiB)2 True (Al, Ni, B) -0.105366 0.020968 0.126334 -0.040597 0.070736 -0.015921 0.038819 ... -0.065492 0.037160 0.102652 -0.020195 0.059330 -0.009788 0.042107 0.051894 0.015274 0.027822
2 Al10Co21B19 True (Al, Co, B) -0.105366 0.000804 0.106171 -0.046539 0.059815 -0.015921 0.067943 ... -0.082171 0.037160 0.119331 -0.029253 0.067328 -0.029063 0.042107 0.071170 0.001836 0.040419
3 Al10Co23B17 True (Al, Co, B) -0.105366 0.000804 0.106171 -0.042292 0.059232 -0.015921 0.067943 ... -0.082171 0.037160 0.119331 -0.034026 0.066641 -0.029063 0.042107 0.071170 -0.001010 0.039941
4 Al10Co27B13 True (Al, Co, B) -0.105366 0.000804 0.106171 -0.033799 0.057383 -0.015921 0.067943 ... -0.082171 0.037160 0.119331 -0.043572 0.064497 -0.029063 0.042107 0.071170 -0.006704 0.038517
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5675 ZrTi9 False (Zr, Ti) -0.035601 -0.018014 0.017588 -0.033843 0.012436 -0.066841 -0.047185 ... -0.051519 0.005334 0.056852 -0.045833 0.040201 0.042314 0.063139 0.020825 0.044396 0.014726
5676 ZrTiSi2 True (Zr, Ti, Si) -0.035601 -0.018014 0.017588 -0.022898 0.009291 -0.066841 -0.035818 ... -0.051519 0.005334 0.056852 -0.029050 0.026519 -0.012335 0.063139 0.075474 0.020196 0.042189
5677 ZrTiSi3 True (Zr, Ti, Si) -0.035601 -0.018014 0.017588 -0.022116 0.009025 -0.066841 -0.035818 ... -0.051519 0.005334 0.056852 -0.030242 0.025259 -0.012335 0.063139 0.075474 0.013690 0.043492
5678 ZrVCo8 True (Zr, V, Co) -0.018014 0.057822 0.075835 0.004624 0.031897 -0.047185 0.067943 ... -0.082171 0.100906 0.183077 -0.055113 0.099783 -0.029063 0.102289 0.131353 -0.006708 0.078135
5679 ZrVNi2 True (Zr, V, Ni) -0.018014 0.057822 0.075835 0.020436 0.033921 -0.047185 0.038819 ... -0.065492 0.100906 0.166398 -0.006186 0.086338 0.000973 0.102289 0.101316 0.041844 0.054582

5680 rows × 1003 columns

Drop the composition object column and save the dataset.

df.drop("composition", axis=1, inplace=True)
df.to_csv("datasets/team-b.csv", index=False)

Team C#

More information on the dataset is available on the MatBench description page.

Features

Here, we use the “Magpie” features defined in the paper “A general-purpose machine learning framework for predicting properties of inorganic materials” (2016). They consist of statistical measures of elemental properties that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 132 features).

Since this dataset also has the crystal structure available, we also include the site stats fingerprint. This method calculates order parameters for the different sites (e.g. how tetrahedral, octahedral, square planar, etc the site is, and the coordination number) and calculates statistical properties (mean and standard deviation) of these over the full structure. Specifically, we use the CrystalNNFingerprint where the order parameters are introduced in the paper “Local structure order parameters and site fingerprints for quantification of coordination environment and crystal structure similarity” (2020).

Note, if you are so inclined, you can use the structure with message passing graph neural networks such as MegNet or M3GNet. Nice implementations are provided in the MatGL library although these will be significantly more expensive to train than classical models.

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_log_kvrh"))

df = load_dataset("matbench_log_kvrh")
df
Dataset: matbench_log_kvrh
Description: Matbench v0.1 test dataset for predicting DFT log10 VRH-average bulk modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	log10(K_VRH): Target variable. Base 10 logarithm of the DFT Voigt-Reuss-Hill average bulk moduli in GPa.
	structure: Pymatgen Structure of the material.
Num Entries: 10987
Reference: Jong, M. De, Chen, W., Angsten, T., Jain, A., Notestine, R., Gamst,
A., Sluiter, M., Ande, C. K., Zwaag, S. Van Der, Plata, J. J., Toher,
C., Curtarolo, S., Ceder, G., Persson, K. and Asta, M., "Charting
the complete elastic properties of inorganic crystalline compounds",
Scientific Data volume 2, Article number: 150009 (2015)
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@Article{deJong2015,\nauthor={de Jong, Maarten and Chen, Wei and Angsten, Thomas\nand Jain, Anubhav and Notestine, Randy and Gamst, Anthony\nand Sluiter, Marcel and Krishna Ande, Chaitanya\nand van der Zwaag, Sybrand and Plata, Jose J. and Toher, Cormac\nand Curtarolo, Stefano and Ceder, Gerbrand and Persson, Kristin A.\nand Asta, Mark},\ntitle={Charting the complete elastic properties\nof inorganic crystalline compounds},\njournal={Scientific Data},\nyear={2015},\nmonth={Mar},\nday={17},\npublisher={The Author(s)},\nvolume={2},\npages={150009},\nnote={Data Descriptor},\nurl={http://dx.doi.org/10.1038/sdata.2015.9}\n}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_log_kvrh.json.gz
SHA256 Hash Digest: 44b113ddb7e23aa18731a62c74afa7e5aa654199e0db5f951c8248a00955c9cd

 
structure log10(K_VRH)
0 [[0. 0. 0.] Ca, [1.37728887 1.57871271 3.73949... 1.707570
1 [[3.14048493 1.09300401 1.64101398] Mg, [0.625... 1.633468
2 [[ 2.06884519 2.40627241 -0.45891585] Si, [1.... 1.908485
3 [[2.06428082 0. 2.06428082] Pd, [0. ... 2.117271
4 [[3.09635262 1.0689416 1.53602403] Mg, [0.593... 1.690196
... ... ...
10982 [[0. 0. 0.] Rh, [3.2029368 3.2029368 2.09459... 1.778151
10983 [[-1.51157821 4.4173925 1.21553922] Mg, [3.... 1.724276
10984 [[4.37546772 4.51128393 6.81784473] H, [0.4573... 1.342423
10985 [[0. 0. 0.] Si, [ 4.55195829 4.55195829 -4.55... 1.770852
10986 [[1.44565668 0. 2.05259079] Al, [1.445... 1.954243

10987 rows × 2 columns

Create a composition column containing pymatgen Composition objects from the Structure object.

df["composition"] = [structure.composition for structure in df["structure"]]

Generate the compositional features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("magpie")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
from matminer.featurizers.structure import SiteStatsFingerprint

cnn_fp = SiteStatsFingerprint.from_preset("CrystalNNFingerprint_ops")
cnn_fp.set_n_jobs(8)
cnn_fp.featurize_dataframe(df, col_id="structure", inplace=True)
df
structure log10(K_VRH) composition MagpieData minimum Number MagpieData maximum Number MagpieData range Number MagpieData mean Number MagpieData avg_dev Number MagpieData mode Number MagpieData minimum MendeleevNumber ... mean wt CN_20 std_dev wt CN_20 mean wt CN_21 std_dev wt CN_21 mean wt CN_22 std_dev wt CN_22 mean wt CN_23 std_dev wt CN_23 mean wt CN_24 std_dev wt CN_24
0 [[0. 0. 0.] Ca, [1.37728887 1.57871271 3.73949... 1.707570 (Ca, Ge, Ag) 20.0 47.0 27.0 35.600000 9.120000 32.0 7.0 ... 0.001273 0.002546 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
1 [[3.14048493 1.09300401 1.64101398] Mg, [0.625... 1.633468 (Mg, Ge, Ba) 12.0 56.0 44.0 28.800000 13.440000 12.0 9.0 ... 0.006620 0.013240 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
2 [[ 2.06884519 2.40627241 -0.45891585] Si, [1.... 1.908485 (Si, Cu, Sr) 14.0 38.0 24.0 24.800000 8.640000 14.0 8.0 ... 0.000000 0.000000 0.0 0.0 0.007384 0.014768 0.0 0.0 0.0 0.0
3 [[2.06428082 0. 2.06428082] Pd, [0. ... 2.117271 (Pd, Dy) 46.0 66.0 20.0 51.000000 7.500000 46.0 31.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
4 [[3.09635262 1.0689416 1.53602403] Mg, [0.593... 1.690196 (Mg, Si, Ba) 12.0 56.0 44.0 21.600000 13.760000 12.0 9.0 ... 0.005602 0.011204 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10982 [[0. 0. 0.] Rh, [3.2029368 3.2029368 2.09459... 1.778151 (Rh, I) 45.0 53.0 8.0 50.333333 3.555556 53.0 59.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
10983 [[-1.51157821 4.4173925 1.21553922] Mg, [3.... 1.724276 (Mg, Co, Sn) 12.0 50.0 38.0 18.625000 9.937500 12.0 58.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
10984 [[4.37546772 4.51128393 6.81784473] H, [0.4573... 1.342423 (H, N, O) 1.0 8.0 7.0 4.666667 3.259259 1.0 82.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
10985 [[0. 0. 0.] Si, [ 4.55195829 4.55195829 -4.55... 1.770852 (Si, Sn) 14.0 50.0 36.0 32.000000 18.000000 14.0 78.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0
10986 [[1.44565668 0. 2.05259079] Al, [1.445... 1.954243 (Al, Cu) 13.0 29.0 16.0 18.333333 7.111111 13.0 64.0 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0

10987 rows × 257 columns

Drop the composition and structure object columns and save the dataset.

df.drop(["composition", "structure"], axis=1, inplace=True)
df.to_csv("datasets/team-c.csv", index=False)