Final Project

Final Project#

For your final project, you will be given a benchmark dataset from the MatBench paper and website. Matbench is an automated leaderboard for benchmarking state of the art ML algorithms predicting a diverse range of solid materials’ properties. It is hosted and maintained by the Materials Project.

Each team has been assigned a different dataset, with a different set of features.

Team	Property	Input	Type	Features
A	Band gap (4,604)	Formula	Regression	Magpie element properties
B	Glass formation (5,680)	Formula	Classification	Matscholar element properties
C	Bulk Modulus (10,987)	Structure	Regression	Magpie + site stats fingerprint

Task#

Your task is perform data analysis on this dataset. You should submit your finished notebook and present your findings as a group (20 minutes + 10 minutes questions).

The starting point is always to check the literature. Read the MatBench paper and the models that have been tested. Try to understand the features that have been used. Next you can begin with the data analysis. Here are some ideas to get you started:

Data

What does your data contain. Is it computed or from experiments.
How reliable is the data. What sort of errors do you expect?
If you have been assigned a composition only dataset, what impact do you expect this to have on the model performance.
If you have been assigned a dataset with structures, does this make the problem more difficult or easier?

Features

What are features in your dataset.
Compare & contrast methods of featurisation (fingerprints have no real-life equivalents as opposed to descriptors etc.)
Can you use other featurisation methods or representation of crystals (e.g. graphs if a structure is available)? Note, this is more advanced and an entirely optional step. See the code in the rest of this notebook to see how the features were generated and to give you ideas of how to modify them.

Supervised learning

Get the most important predictors from a random forest, XGBoost, or another model, does that say anything about chemistry?
Classify crystals according to their properties. Would you be able to predict the performance of new crystals?
Attempt to predict the target property? How well does this compare against the other models on the MatBench leaderboard? How do your models differ?

Unsupervised learning

Attempt to cluster the molecules based on their features? Can you understand anything from the clustering about the materials in the dataset or their performance.
Some problems have many features. Does dimensionality reduction (e.g., PCA or recursive feature elimination) help? This can be interesting to see if featurisation matters.

Explainability

Is your model interpretable? What does it say?
You can perform SHAP (SHapley Additive exPlanation) analysis for greater insight
Otherwise, can you use feature importance or counterfactual explanations to provide further insights.

Materials screening (advanced)

Note, this is more advanced and an entirely optional step.
The Materials Project contains a database of 200,000 known and predicted inorganic materials.
You can download the structures and compositions of all of the materials using the Materials Project API. A good tutorial is available on YouTube
What happens if you apply your models to the full data set? Can you find any materials with interesting properties?

Downloads#

The featurised datasets are available to download as csv files from the links below. Note the files and gzipped and should be uncompressed before use.

Team A: Download
Team B: Download
Team C: Download

Submission#

You should submit your Jupyter notebook and associated files as a zip before the deadline (9 am on January 9th). Make sure you adhere to the following guidance:

Structure your notebook clearly, with designated sections.
Include a brief introduction to the task, explaining what you did and why you did it.
Provide appropriate and clearly labelled figures and tables documenting your findings.
Summarise your findings and put them into context.
Comment your code so we can follow along.
Include an authors contribution commentary following the CRediT system.
Include a statement on the use of generative AI following Imperial’s guidance.
Include a requirements.txt that lists the packages (and ideally versions) of the codes required to run your notebook. You can see the example for notebooks in this course here.
Include the powerpoint file for the slides for your presentation in the upload.

Featurisation#

In the rest of this notebook, we’ll prepare the datasets used for the final project. Note: This notebook requires the matminer python package to be installed.

Team A#

Dataset: Experimental Band Gaps (matbench_expt_gap)
Type: Regression
Features: Magpie Elemental features
Target: gap expt
Original Inputs: formula
Size: 4,604 samples
Reference: Predicting the Band Gaps of Inorganic Solids by Machine Learning (2018)

More information on the dataset is available on the MatBench description page.

Features

Here, we use the “Magpie” features defined in the paper “A general-purpose machine learning framework for predicting properties of inorganic materials” (2016). They consist of statistical measures of elemental properties that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 132 features).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_expt_gap"))

df = load_dataset("matbench_expt_gap")
df

Dataset: matbench_expt_gap
Description: Matbench v0.1 test dataset for predicting experimental band gap from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, removing compositions with reported band gaps spanning more than a 0.1eV range; remaining compositions were assigned values based on the closest experimental value to the mean experimental value for that composition among all reports. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	composition: Chemical formula.
	gap expt: Target variable. Experimentally measured gap, in eV.
Num Entries: 4604
Reference: Y. Zhuo, A. Masouri Tehrani, J. Brgoch (2018) Predicting the Band Gaps of Inorganic Solids by Machine Learning J. Phys. Chem. Lett. 2018, 9, 7, 1668-1673 https:doi.org/10.1021/acs.jpclett.8b00124.
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@article{doi:10.1021/acs.jpclett.8b00124,\nauthor = {Zhuo, Ya and Mansouri Tehrani, Aria and Brgoch, Jakoah},\ntitle = {Predicting the Band Gaps of Inorganic Solids by Machine Learning},\njournal = {The Journal of Physical Chemistry Letters},\nvolume = {9},\nnumber = {7},\npages = {1668-1673},\nyear = {2018},\ndoi = {10.1021/acs.jpclett.8b00124},\nnote ={PMID: 29532658},\neprint = {\nhttps://doi.org/10.1021/acs.jpclett.8b00124\n\n}}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_expt_gap.json.gz
SHA256 Hash Digest: 783e7d1461eb83b00b2f2942da4b95fda5e58a0d1ae26b581c24cf8a82ca75b2

 

	composition	gap expt
0	Ag(AuS)2	0.00
1	Ag(W3Br7)2	0.00
2	Ag0.5Ge1Pb1.75S4	1.83
3	Ag0.5Ge1Pb1.75Se4	1.51
4	Ag2BBr	0.00
...	...	...
4599	ZrTaN3	1.72
4600	ZrTe	0.00
4601	ZrTi2O	0.00
4602	ZrTiF6	0.00
4603	ZrW2	0.00

4604 rows × 2 columns

Convert the composition string into a pymatgen Composition object.

from pymatgen.core import Composition

df.rename({"composition": "formula"}, axis=1, inplace=True)
df["composition"] = [Composition(comp) for comp in df["formula"]]

Generate the Magpie features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("magpie")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
df

	formula	gap expt	composition	MagpieData minimum Number	MagpieData maximum Number	MagpieData range Number	MagpieData mean Number	MagpieData avg_dev Number	MagpieData mode Number	MagpieData minimum MendeleevNumber	...	MagpieData range GSmagmom	MagpieData mean GSmagmom	MagpieData avg_dev GSmagmom	MagpieData mode GSmagmom	MagpieData minimum SpaceGroupNumber	MagpieData maximum SpaceGroupNumber	MagpieData range SpaceGroupNumber	MagpieData mean SpaceGroupNumber	MagpieData avg_dev SpaceGroupNumber	MagpieData mode SpaceGroupNumber
0	Ag(AuS)2	0.00	(Ag, Au, S)	16.0	79.0	63.0	47.400000	25.280000	16.0	65.0	...	0.000000	0.000000	0.000000	0.000000	70.0	225.0	155.0	163.000000	74.400000	70.0
1	Ag(W3Br7)2	0.00	(Ag, W, Br)	35.0	74.0	39.0	46.714286	15.619048	35.0	51.0	...	0.000000	0.000000	0.000000	0.000000	64.0	229.0	165.0	118.809524	73.079365	64.0
2	Ag0.5Ge1Pb1.75S4	1.83	(Ag, Ge, Pb, S)	16.0	82.0	66.0	36.275862	23.552913	16.0	65.0	...	0.000000	0.000000	0.000000	0.000000	70.0	225.0	155.0	139.482759	76.670630	70.0
3	Ag0.5Ge1Pb1.75Se4	1.51	(Ag, Ge, Pb, Se)	32.0	82.0	50.0	46.206897	17.388823	34.0	65.0	...	0.000000	0.000000	0.000000	0.000000	14.0	225.0	211.0	108.586207	104.370987	14.0
4	Ag2BBr	0.00	(Ag, B, Br)	5.0	47.0	42.0	33.500000	14.250000	47.0	65.0	...	0.000000	0.000000	0.000000	0.000000	64.0	225.0	161.0	170.000000	55.000000	225.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4599	ZrTaN3	1.72	(Zr, Ta, N)	7.0	73.0	66.0	26.800000	23.760000	7.0	44.0	...	0.000000	0.000000	0.000000	0.000000	194.0	229.0	35.0	201.000000	11.200000	194.0
4600	ZrTe	0.00	(Zr, Te)	40.0	52.0	12.0	46.000000	6.000000	40.0	44.0	...	0.000000	0.000000	0.000000	0.000000	152.0	194.0	42.0	173.000000	21.000000	152.0
4601	ZrTi2O	0.00	(Zr, Ti, O)	8.0	40.0	32.0	23.000000	8.500000	22.0	43.0	...	0.000023	0.000011	0.000011	0.000023	12.0	194.0	182.0	148.500000	68.250000	194.0
4602	ZrTiF6	0.00	(Zr, Ti, F)	9.0	40.0	31.0	14.500000	8.250000	9.0	43.0	...	0.000023	0.000003	0.000005	0.000000	15.0	194.0	179.0	59.750000	67.125000	15.0
4603	ZrW2	0.00	(Zr, W)	40.0	74.0	34.0	62.666667	15.111111	74.0	44.0	...	0.000000	0.000000	0.000000	0.000000	194.0	229.0	35.0	217.333333	15.555556	229.0

4604 rows × 135 columns

Drop the composition object column and save the dataset.

df.drop("composition", axis=1, inplace=True)
df.to_csv("datasets/team-a.csv", index=False)

Team B#

Dataset: Glass formation ability (matbench_glass)
Type: Classification
Features: Matscholar elemental embeddings
Target: gfa
Original Inputs: formula
Size: 5,680 samples
Reference: Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys (2019)

More information on the dataset is available on the MatBench description page.

Features

Here, we use elemental embeddings taken from features defined in the paper Unsupervised word embeddings capture latent knowledge from materials science literature (2021). The embeddings are given as a 200 dimensional vector that were learned through natural language processing. The features consist of statistical measures of these embeddings that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 1000 features).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_glass"))

df = load_dataset("matbench_glass")
df

Dataset: matbench_glass
Description: Matbench v0.1 test dataset for predicting full bulk metallic glass formation ability from chemical formula. Retrieved from "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. Deduplicated according to composition, ensuring no compositions were reported as both GFA and not GFA (i.e., all reports agreed on the classification designation). For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	composition: Chemical formula.
	gfa: Target variable. Glass forming ability: 1 means glass forming and corresponds to amorphous, 0 means non full glass forming.
Num Entries: 5680
Reference: Y. Kawazoe, T. Masumoto, A.-P. Tsai, J.-Z. Yu, T. Aihara Jr. (1997) Y. Kawazoe, J.-Z. Yu, A.-P. Tsai, T. Masumoto (ed.) SpringerMaterials
Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys · 1 Introduction Landolt-Börnstein - Group III Condensed Matter 37A (Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys) https://www.springer.com/gp/book/9783540605072 (Springer-Verlag Berlin Heidelberg © 1997) Accessed: 03-09-2019
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@Misc{LandoltBornstein1997:sm_lbs_978-3-540-47679-5_2,\nauthor="Kawazoe, Y.\nand Masumoto, T.\nand Tsai, A.-P.\nand Yu, J.-Z.\nand Aihara Jr., T.",\neditor="Kawazoe, Y.\nand Yu, J.-Z.\nand Tsai, A.-P.\nand Masumoto, T.",\ntitle="Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys {\\textperiodcentered} 1 Introduction: Datasheet from Landolt-B{\\"o}rnstein - Group III Condensed Matter {\\textperiodcentered} Volume 37A: ``Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys\'\' in SpringerMaterials (https://dx.doi.org/10.1007/10510374{\\_}2)",\npublisher="Springer-Verlag Berlin Heidelberg",\nnote="Copyright 1997 Springer-Verlag Berlin Heidelberg",\nnote="Part of SpringerMaterials",\nnote="accessed 2018-10-23",\ndoi="10.1007/10510374_2",\nurl="https://materials.springer.com/lb/docs/sm_lbs_978-3-540-47679-5_2"\n}', '@Article{Ward2016,\nauthor={Ward, Logan\nand Agrawal, Ankit\nand Choudhary, Alok\nand Wolverton, Christopher},\ntitle={A general-purpose machine learning framework for predicting properties of inorganic materials},\njournal={Npj Computational Materials},\nyear={2016},\nmonth={Aug},\nday={26},\npublisher={The Author(s)},\nvolume={2},\npages={16028},\nnote={Article},\nurl={http://dx.doi.org/10.1038/npjcompumats.2016.28}\n}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_glass.json.gz
SHA256 Hash Digest: 36beb654e2a463ee2a6572105bea0ca2961eee7c7b26a25377bff2c3b338e53a

	composition	gfa
0	Al	False
1	Al(NiB)2	True
2	Al10Co21B19	True
3	Al10Co23B17	True
4	Al10Co27B13	True
...	...	...
5675	ZrTi9	False
5676	ZrTiSi2	True
5677	ZrTiSi3	True
5678	ZrVCo8	True
5679	ZrVNi2	True

5680 rows × 2 columns

Convert the composition string into a pymatgen Composition object.

from pymatgen.core import Composition

df.rename({"composition": "formula"}, axis=1, inplace=True)
df["composition"] = [Composition(comp) for comp in df["formula"]]

Generate the features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("matscholar_el")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
df

	formula	gfa	composition	MatscholarElementData minimum embedding 1	MatscholarElementData maximum embedding 1	MatscholarElementData range embedding 1	MatscholarElementData mean embedding 1	MatscholarElementData std_dev embedding 1	MatscholarElementData minimum embedding 2	MatscholarElementData maximum embedding 2	...	MatscholarElementData minimum embedding 199	MatscholarElementData maximum embedding 199	MatscholarElementData range embedding 199	MatscholarElementData mean embedding 199	MatscholarElementData std_dev embedding 199	MatscholarElementData minimum embedding 200	MatscholarElementData maximum embedding 200	MatscholarElementData range embedding 200	MatscholarElementData mean embedding 200	MatscholarElementData std_dev embedding 200
0	Al	False	(Al)	-0.034189	-0.034189	0.000000	-0.034189	0.000000	-0.001735	-0.001735	...	-0.044308	-0.044308	0.000000	-0.044308	0.000000	-0.009788	-0.009788	0.000000	-0.009788	0.000000
1	Al(NiB)2	True	(Al, Ni, B)	-0.105366	0.020968	0.126334	-0.040597	0.070736	-0.015921	0.038819	...	-0.065492	0.037160	0.102652	-0.020195	0.059330	-0.009788	0.042107	0.051894	0.015274	0.027822
2	Al10Co21B19	True	(Al, Co, B)	-0.105366	0.000804	0.106171	-0.046539	0.059815	-0.015921	0.067943	...	-0.082171	0.037160	0.119331	-0.029253	0.067328	-0.029063	0.042107	0.071170	0.001836	0.040419
3	Al10Co23B17	True	(Al, Co, B)	-0.105366	0.000804	0.106171	-0.042292	0.059232	-0.015921	0.067943	...	-0.082171	0.037160	0.119331	-0.034026	0.066641	-0.029063	0.042107	0.071170	-0.001010	0.039941
4	Al10Co27B13	True	(Al, Co, B)	-0.105366	0.000804	0.106171	-0.033799	0.057383	-0.015921	0.067943	...	-0.082171	0.037160	0.119331	-0.043572	0.064497	-0.029063	0.042107	0.071170	-0.006704	0.038517
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5675	ZrTi9	False	(Zr, Ti)	-0.035601	-0.018014	0.017588	-0.033843	0.012436	-0.066841	-0.047185	...	-0.051519	0.005334	0.056852	-0.045833	0.040201	0.042314	0.063139	0.020825	0.044396	0.014726
5676	ZrTiSi2	True	(Zr, Ti, Si)	-0.035601	-0.018014	0.017588	-0.022898	0.009291	-0.066841	-0.035818	...	-0.051519	0.005334	0.056852	-0.029050	0.026519	-0.012335	0.063139	0.075474	0.020196	0.042189
5677	ZrTiSi3	True	(Zr, Ti, Si)	-0.035601	-0.018014	0.017588	-0.022116	0.009025	-0.066841	-0.035818	...	-0.051519	0.005334	0.056852	-0.030242	0.025259	-0.012335	0.063139	0.075474	0.013690	0.043492
5678	ZrVCo8	True	(Zr, V, Co)	-0.018014	0.057822	0.075835	0.004624	0.031897	-0.047185	0.067943	...	-0.082171	0.100906	0.183077	-0.055113	0.099783	-0.029063	0.102289	0.131353	-0.006708	0.078135
5679	ZrVNi2	True	(Zr, V, Ni)	-0.018014	0.057822	0.075835	0.020436	0.033921	-0.047185	0.038819	...	-0.065492	0.100906	0.166398	-0.006186	0.086338	0.000973	0.102289	0.101316	0.041844	0.054582

5680 rows × 1003 columns

Drop the composition object column and save the dataset.

df.drop("composition", axis=1, inplace=True)
df.to_csv("datasets/team-b.csv", index=False)

Team C#

Dataset: Bulk modulus (matbench_log_kvrh)
Type: Regression
Features: Magpie element features + site stats fingerprint
Target: log10(K_VRH)
Original Inputs: composition + structure
Size: 10,987 samples
Reference: Charting the complete elastic properties of inorganic crystalline compounds (2015)

More information on the dataset is available on the MatBench description page.

Features

Here, we use the “Magpie” features defined in the paper “A general-purpose machine learning framework for predicting properties of inorganic materials” (2016). They consist of statistical measures of elemental properties that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 132 features).

Since this dataset also has the crystal structure available, we also include the site stats fingerprint. This method calculates order parameters for the different sites (e.g. how tetrahedral, octahedral, square planar, etc the site is, and the coordination number) and calculates statistical properties (mean and standard deviation) of these over the full structure. Specifically, we use the CrystalNNFingerprint where the order parameters are introduced in the paper “Local structure order parameters and site fingerprints for quantification of coordination environment and crystal structure similarity” (2020).

Note, if you are so inclined, you can use the structure with message passing graph neural networks such as MegNet or M3GNet. Nice implementations are provided in the MatGL library although these will be significantly more expensive to train than classical models.

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_log_kvrh"))

df = load_dataset("matbench_log_kvrh")
df

Dataset: matbench_log_kvrh
Description: Matbench v0.1 test dataset for predicting DFT log10 VRH-average bulk modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	log10(K_VRH): Target variable. Base 10 logarithm of the DFT Voigt-Reuss-Hill average bulk moduli in GPa.
	structure: Pymatgen Structure of the material.
Num Entries: 10987
Reference: Jong, M. De, Chen, W., Angsten, T., Jain, A., Notestine, R., Gamst,
A., Sluiter, M., Ande, C. K., Zwaag, S. Van Der, Plata, J. J., Toher,
C., Curtarolo, S., Ceder, G., Persson, K. and Asta, M., "Charting
the complete elastic properties of inorganic crystalline compounds",
Scientific Data volume 2, Article number: 150009 (2015)
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@Article{deJong2015,\nauthor={de Jong, Maarten and Chen, Wei and Angsten, Thomas\nand Jain, Anubhav and Notestine, Randy and Gamst, Anthony\nand Sluiter, Marcel and Krishna Ande, Chaitanya\nand van der Zwaag, Sybrand and Plata, Jose J. and Toher, Cormac\nand Curtarolo, Stefano and Ceder, Gerbrand and Persson, Kristin A.\nand Asta, Mark},\ntitle={Charting the complete elastic properties\nof inorganic crystalline compounds},\njournal={Scientific Data},\nyear={2015},\nmonth={Mar},\nday={17},\npublisher={The Author(s)},\nvolume={2},\npages={150009},\nnote={Data Descriptor},\nurl={http://dx.doi.org/10.1038/sdata.2015.9}\n}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_log_kvrh.json.gz
SHA256 Hash Digest: 44b113ddb7e23aa18731a62c74afa7e5aa654199e0db5f951c8248a00955c9cd

 

	structure	log10(K_VRH)
0	[[0. 0. 0.] Ca, [1.37728887 1.57871271 3.73949...	1.707570
1	[[3.14048493 1.09300401 1.64101398] Mg, [0.625...	1.633468
2	[[ 2.06884519 2.40627241 -0.45891585] Si, [1....	1.908485
3	[[2.06428082 0. 2.06428082] Pd, [0. ...	2.117271
4	[[3.09635262 1.0689416 1.53602403] Mg, [0.593...	1.690196
...	...	...
10982	[[0. 0. 0.] Rh, [3.2029368 3.2029368 2.09459...	1.778151
10983	[[-1.51157821 4.4173925 1.21553922] Mg, [3....	1.724276
10984	[[4.37546772 4.51128393 6.81784473] H, [0.4573...	1.342423
10985	[[0. 0. 0.] Si, [ 4.55195829 4.55195829 -4.55...	1.770852
10986	[[1.44565668 0. 2.05259079] Al, [1.445...	1.954243

10987 rows × 2 columns

Create a composition column containing pymatgen Composition objects from the Structure object.

df["composition"] = [structure.composition for structure in df["structure"]]

Generate the compositional features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("magpie")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)

from matminer.featurizers.structure import SiteStatsFingerprint

cnn_fp = SiteStatsFingerprint.from_preset("CrystalNNFingerprint_ops")
cnn_fp.set_n_jobs(8)
cnn_fp.featurize_dataframe(df, col_id="structure", inplace=True)
df

	structure	log10(K_VRH)	composition	MagpieData minimum Number	MagpieData maximum Number	MagpieData range Number	MagpieData mean Number	MagpieData avg_dev Number	MagpieData mode Number	MagpieData minimum MendeleevNumber	...	mean wt CN_20	std_dev wt CN_20	mean wt CN_21	std_dev wt CN_21	mean wt CN_22	std_dev wt CN_22	mean wt CN_23	std_dev wt CN_23	mean wt CN_24	std_dev wt CN_24
0	[[0. 0. 0.] Ca, [1.37728887 1.57871271 3.73949...	1.707570	(Ca, Ge, Ag)	20.0	47.0	27.0	35.600000	9.120000	32.0	7.0	...	0.001273	0.002546	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
1	[[3.14048493 1.09300401 1.64101398] Mg, [0.625...	1.633468	(Mg, Ge, Ba)	12.0	56.0	44.0	28.800000	13.440000	12.0	9.0	...	0.006620	0.013240	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
2	[[ 2.06884519 2.40627241 -0.45891585] Si, [1....	1.908485	(Si, Cu, Sr)	14.0	38.0	24.0	24.800000	8.640000	14.0	8.0	...	0.000000	0.000000	0.0	0.0	0.007384	0.014768	0.0	0.0	0.0	0.0
3	[[2.06428082 0. 2.06428082] Pd, [0. ...	2.117271	(Pd, Dy)	46.0	66.0	20.0	51.000000	7.500000	46.0	31.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
4	[[3.09635262 1.0689416 1.53602403] Mg, [0.593...	1.690196	(Mg, Si, Ba)	12.0	56.0	44.0	21.600000	13.760000	12.0	9.0	...	0.005602	0.011204	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10982	[[0. 0. 0.] Rh, [3.2029368 3.2029368 2.09459...	1.778151	(Rh, I)	45.0	53.0	8.0	50.333333	3.555556	53.0	59.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
10983	[[-1.51157821 4.4173925 1.21553922] Mg, [3....	1.724276	(Mg, Co, Sn)	12.0	50.0	38.0	18.625000	9.937500	12.0	58.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
10984	[[4.37546772 4.51128393 6.81784473] H, [0.4573...	1.342423	(H, N, O)	1.0	8.0	7.0	4.666667	3.259259	1.0	82.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
10985	[[0. 0. 0.] Si, [ 4.55195829 4.55195829 -4.55...	1.770852	(Si, Sn)	14.0	50.0	36.0	32.000000	18.000000	14.0	78.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
10986	[[1.44565668 0. 2.05259079] Al, [1.445...	1.954243	(Al, Cu)	13.0	29.0	16.0	18.333333	7.111111	13.0	64.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0

10987 rows × 257 columns

Drop the composition and structure object columns and save the dataset.

df.drop(["composition", "structure"], axis=1, inplace=True)
df.to_csv("datasets/team-c.csv", index=False)

Final Project

Contents

Final Project#

Task#

Downloads#

Submission#

Featurisation#

Team A#

Team B#

Team C#