Final Project

Final Project#

For your final project, you will be given a benchmark dataset from the MatBench paper and website. Matbench is an automated leaderboard for benchmarking state of the art ML algorithms predicting a diverse range of solid materials’ properties. It is hosted and maintained by the Materials Project.

Each team has been assigned a different dataset, with a different set of features.

Team	Property	Input	Type	Features
A	Band gap (4,604)	Formula	Regression	Magpie element properties
B	Glass formation (5,680)	Formula	Classification	Matscholar element properties
C	Bulk Modulus (10,987)	Structure	Regression	Magpie + site stats fingerprint
D	Exfoliation (635)	Structure	Regression	Meredig + packing efficiency + density features
E	Dielectric (4,557)	Structure	Regression	Megnet + site stats fingerprint

Task#

Your task is perform data analysis on this dataset. You should submit your finished notebook and present your findings as a group (20 minutes + 10 minutes questions).

The starting point is always to check the literature. Read the MatBench paper and the models that have been tested. Try to understand the features that have been used. Next you can begin with the data analysis. Here are some ideas to get you started:

Data

What does your data contain. Is it computed or from experiments.
How reliable is the data. What sort of errors do you expect?
If you have been assigned a composition only dataset, what impact do you expect this to have on the model performance.
If you have been assigned a dataset with structures, does this make the problem more difficult or easier?

Features

What are features in your dataset.
Compare & contrast methods of featurisation (fingerprints have no real-life equivalents as opposed to descriptors etc.)
Can you use other featurisation methods or representation of crystals (e.g. graphs if a structure is available)? Note, this is more advanced and an entirely optional step. See the code in the rest of this notebook to see how the features were generated and to give you ideas of how to modify them.

Supervised learning

Get the most important predictors from a random forest, XGBoost, or another model, does that say anything about chemistry?
Classify crystals according to their properties. Would you be able to predict the performance of new crystals?
Attempt to predict the target property? How well does this compare against the other models on the MatBench leaderboard? How do your models differ?

Unsupervised learning

Attempt to cluster the molecules based on their features? Can you understand anything from the clustering about the materials in the dataset or their performance.
Some problems have many features. Does dimensionality reduction (e.g., PCA or recursive feature elimination) help? This can be interesting to see if featurisation matters.

Explainability

Is your model interpretable? What does it say?
You can perform SHAP (SHapley Additive exPlanation) analysis for greater insight
Otherwise, can you use feature importance or counterfactual explanations to provide further insights.

Materials screening (advanced)

Note, this is more advanced and an entirely optional step.
The Materials Project contains a database of 200,000 known and predicted inorganic materials.
You can download the structures and compositions of all of the materials using the Materials Project API. A good tutorial is available on YouTube
What happens if you apply your models to the full data set? Can you find any materials with interesting properties?

Downloads#

The featurised datasets are available to download as csv files from the links below. Note the files and gzipped and should be uncompressed before use.

Team A: Download
Team B: Download
Team C: Download
Team D: Download
Team E: Download

Submission#

You should submit your Jupyter notebook and associated files as a zip before the deadline (9 am on January 9th). Make sure you adhere to the following guidance:

Structure your notebook clearly, with designated sections.
Include a brief introduction to the task, explaining what you did and why you did it.
Provide appropriate and clearly labelled figures and tables documenting your findings.
Summarise your findings and put them into context.
Comment your code so we can follow along.
Include an authors contribution commentary following the CRediT system.
Include a statement on the use of generative AI following Imperial’s guidance.
Include a requirements.txt that lists the packages (and ideally versions) of the codes required to run your notebook. You can see the example for notebooks in this course here.
Include the powerpoint file for the slides for your presentation in the upload.

Featurisation#

In the rest of this notebook, we’ll prepare the datasets used for the final project. Note: This notebook requires the matminer python package to be installed.

Team A#

Dataset: Experimental Band Gaps (matbench_expt_gap)
Type: Regression
Features: Magpie Elemental features
Target: gap expt
Original Inputs: formula
Size: 4,604 samples
Reference: Predicting the Band Gaps of Inorganic Solids by Machine Learning (2018)

More information on the dataset is available on the MatBench description page.

Features

Here, we use the “Magpie” features defined in the paper “A general-purpose machine learning framework for predicting properties of inorganic materials” (2016). They consist of statistical measures of elemental properties that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 132 features).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_expt_gap"))

df = load_dataset("matbench_expt_gap")
df

Dataset: matbench_expt_gap
Description: Matbench v0.1 test dataset for predicting experimental band gap from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, removing compositions with reported band gaps spanning more than a 0.1eV range; remaining compositions were assigned values based on the closest experimental value to the mean experimental value for that composition among all reports. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	composition: Chemical formula.
	gap expt: Target variable. Experimentally measured gap, in eV.
Num Entries: 4604
Reference: Y. Zhuo, A. Masouri Tehrani, J. Brgoch (2018) Predicting the Band Gaps of Inorganic Solids by Machine Learning J. Phys. Chem. Lett. 2018, 9, 7, 1668-1673 https:doi.org/10.1021/acs.jpclett.8b00124.
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@article{doi:10.1021/acs.jpclett.8b00124,\nauthor = {Zhuo, Ya and Mansouri Tehrani, Aria and Brgoch, Jakoah},\ntitle = {Predicting the Band Gaps of Inorganic Solids by Machine Learning},\njournal = {The Journal of Physical Chemistry Letters},\nvolume = {9},\nnumber = {7},\npages = {1668-1673},\nyear = {2018},\ndoi = {10.1021/acs.jpclett.8b00124},\nnote ={PMID: 29532658},\neprint = {\nhttps://doi.org/10.1021/acs.jpclett.8b00124\n\n}}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_expt_gap.json.gz
SHA256 Hash Digest: 783e7d1461eb83b00b2f2942da4b95fda5e58a0d1ae26b581c24cf8a82ca75b2

 

	composition	gap expt
0	Ag(AuS)2	0.00
1	Ag(W3Br7)2	0.00
2	Ag0.5Ge1Pb1.75S4	1.83
3	Ag0.5Ge1Pb1.75Se4	1.51
4	Ag2BBr	0.00
...	...	...
4599	ZrTaN3	1.72
4600	ZrTe	0.00
4601	ZrTi2O	0.00
4602	ZrTiF6	0.00
4603	ZrW2	0.00

4604 rows × 2 columns

Convert the composition string into a pymatgen Composition object.

from pymatgen.core import Composition

df.rename({"composition": "formula"}, axis=1, inplace=True)
df["composition"] = [Composition(comp) for comp in df["formula"]]

Generate the Magpie features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("magpie")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
df

	formula	gap expt	composition	MagpieData minimum Number	MagpieData maximum Number	MagpieData range Number	MagpieData mean Number	MagpieData avg_dev Number	MagpieData mode Number	MagpieData minimum MendeleevNumber	...	MagpieData range GSmagmom	MagpieData mean GSmagmom	MagpieData avg_dev GSmagmom	MagpieData mode GSmagmom	MagpieData minimum SpaceGroupNumber	MagpieData maximum SpaceGroupNumber	MagpieData range SpaceGroupNumber	MagpieData mean SpaceGroupNumber	MagpieData avg_dev SpaceGroupNumber	MagpieData mode SpaceGroupNumber
0	Ag(AuS)2	0.00	(Ag, Au, S)	16.0	79.0	63.0	47.400000	25.280000	16.0	65.0	...	0.000000	0.000000	0.000000	0.000000	70.0	225.0	155.0	163.000000	74.400000	70.0
1	Ag(W3Br7)2	0.00	(Ag, W, Br)	35.0	74.0	39.0	46.714286	15.619048	35.0	51.0	...	0.000000	0.000000	0.000000	0.000000	64.0	229.0	165.0	118.809524	73.079365	64.0
2	Ag0.5Ge1Pb1.75S4	1.83	(Ag, Ge, Pb, S)	16.0	82.0	66.0	36.275862	23.552913	16.0	65.0	...	0.000000	0.000000	0.000000	0.000000	70.0	225.0	155.0	139.482759	76.670630	70.0
3	Ag0.5Ge1Pb1.75Se4	1.51	(Ag, Ge, Pb, Se)	32.0	82.0	50.0	46.206897	17.388823	34.0	65.0	...	0.000000	0.000000	0.000000	0.000000	14.0	225.0	211.0	108.586207	104.370987	14.0
4	Ag2BBr	0.00	(Ag, B, Br)	5.0	47.0	42.0	33.500000	14.250000	47.0	65.0	...	0.000000	0.000000	0.000000	0.000000	64.0	225.0	161.0	170.000000	55.000000	225.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4599	ZrTaN3	1.72	(Zr, Ta, N)	7.0	73.0	66.0	26.800000	23.760000	7.0	44.0	...	0.000000	0.000000	0.000000	0.000000	194.0	229.0	35.0	201.000000	11.200000	194.0
4600	ZrTe	0.00	(Zr, Te)	40.0	52.0	12.0	46.000000	6.000000	40.0	44.0	...	0.000000	0.000000	0.000000	0.000000	152.0	194.0	42.0	173.000000	21.000000	152.0
4601	ZrTi2O	0.00	(Zr, Ti, O)	8.0	40.0	32.0	23.000000	8.500000	22.0	43.0	...	0.000023	0.000011	0.000011	0.000023	12.0	194.0	182.0	148.500000	68.250000	194.0
4602	ZrTiF6	0.00	(Zr, Ti, F)	9.0	40.0	31.0	14.500000	8.250000	9.0	43.0	...	0.000023	0.000003	0.000005	0.000000	15.0	194.0	179.0	59.750000	67.125000	15.0
4603	ZrW2	0.00	(Zr, W)	40.0	74.0	34.0	62.666667	15.111111	74.0	44.0	...	0.000000	0.000000	0.000000	0.000000	194.0	229.0	35.0	217.333333	15.555556	229.0

4604 rows × 135 columns

Drop the composition object column and save the dataset.

df.drop("composition", axis=1, inplace=True)
df.to_csv("datasets/team-a.csv", index=False)

Team B#

Dataset: Glass formation ability (matbench_glass)
Type: Classification
Features: Matscholar elemental embeddings
Target: gfa
Original Inputs: formula
Size: 5,680 samples
Reference: Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys (2019)

More information on the dataset is available on the MatBench description page.

Features

Here, we use elemental embeddings taken from features defined in the paper Unsupervised word embeddings capture latent knowledge from materials science literature (2021). The embeddings are given as a 200 dimensional vector that were learned through natural language processing. The features consist of statistical measures of these embeddings that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 1000 features).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_glass"))

df = load_dataset("matbench_glass")
df

Dataset: matbench_glass
Description: Matbench v0.1 test dataset for predicting full bulk metallic glass formation ability from chemical formula. Retrieved from "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. Deduplicated according to composition, ensuring no compositions were reported as both GFA and not GFA (i.e., all reports agreed on the classification designation). For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	composition: Chemical formula.
	gfa: Target variable. Glass forming ability: 1 means glass forming and corresponds to amorphous, 0 means non full glass forming.
Num Entries: 5680
Reference: Y. Kawazoe, T. Masumoto, A.-P. Tsai, J.-Z. Yu, T. Aihara Jr. (1997) Y. Kawazoe, J.-Z. Yu, A.-P. Tsai, T. Masumoto (ed.) SpringerMaterials
Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys · 1 Introduction Landolt-Börnstein - Group III Condensed Matter 37A (Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys) https://www.springer.com/gp/book/9783540605072 (Springer-Verlag Berlin Heidelberg © 1997) Accessed: 03-09-2019
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@Misc{LandoltBornstein1997:sm_lbs_978-3-540-47679-5_2,\nauthor="Kawazoe, Y.\nand Masumoto, T.\nand Tsai, A.-P.\nand Yu, J.-Z.\nand Aihara Jr., T.",\neditor="Kawazoe, Y.\nand Yu, J.-Z.\nand Tsai, A.-P.\nand Masumoto, T.",\ntitle="Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys {\\textperiodcentered} 1 Introduction: Datasheet from Landolt-B{\\"o}rnstein - Group III Condensed Matter {\\textperiodcentered} Volume 37A: ``Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys\'\' in SpringerMaterials (https://dx.doi.org/10.1007/10510374{\\_}2)",\npublisher="Springer-Verlag Berlin Heidelberg",\nnote="Copyright 1997 Springer-Verlag Berlin Heidelberg",\nnote="Part of SpringerMaterials",\nnote="accessed 2018-10-23",\ndoi="10.1007/10510374_2",\nurl="https://materials.springer.com/lb/docs/sm_lbs_978-3-540-47679-5_2"\n}', '@Article{Ward2016,\nauthor={Ward, Logan\nand Agrawal, Ankit\nand Choudhary, Alok\nand Wolverton, Christopher},\ntitle={A general-purpose machine learning framework for predicting properties of inorganic materials},\njournal={Npj Computational Materials},\nyear={2016},\nmonth={Aug},\nday={26},\npublisher={The Author(s)},\nvolume={2},\npages={16028},\nnote={Article},\nurl={http://dx.doi.org/10.1038/npjcompumats.2016.28}\n}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_glass.json.gz
SHA256 Hash Digest: 36beb654e2a463ee2a6572105bea0ca2961eee7c7b26a25377bff2c3b338e53a

	composition	gfa
0	Al	False
1	Al(NiB)2	True
2	Al10Co21B19	True
3	Al10Co23B17	True
4	Al10Co27B13	True
...	...	...
5675	ZrTi9	False
5676	ZrTiSi2	True
5677	ZrTiSi3	True
5678	ZrVCo8	True
5679	ZrVNi2	True

5680 rows × 2 columns

Convert the composition string into a pymatgen Composition object.

from pymatgen.core import Composition

df.rename({"composition": "formula"}, axis=1, inplace=True)
df["composition"] = [Composition(comp) for comp in df["formula"]]

Generate the features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("matscholar_el")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
df

	formula	gfa	composition	MatscholarElementData minimum embedding 1	MatscholarElementData maximum embedding 1	MatscholarElementData range embedding 1	MatscholarElementData mean embedding 1	MatscholarElementData std_dev embedding 1	MatscholarElementData minimum embedding 2	MatscholarElementData maximum embedding 2	...	MatscholarElementData minimum embedding 199	MatscholarElementData maximum embedding 199	MatscholarElementData range embedding 199	MatscholarElementData mean embedding 199	MatscholarElementData std_dev embedding 199	MatscholarElementData minimum embedding 200	MatscholarElementData maximum embedding 200	MatscholarElementData range embedding 200	MatscholarElementData mean embedding 200	MatscholarElementData std_dev embedding 200
0	Al	False	(Al)	-0.034189	-0.034189	0.000000	-0.034189	0.000000	-0.001735	-0.001735	...	-0.044308	-0.044308	0.000000	-0.044308	0.000000	-0.009788	-0.009788	0.000000	-0.009788	0.000000
1	Al(NiB)2	True	(Al, Ni, B)	-0.105366	0.020968	0.126334	-0.040597	0.070736	-0.015921	0.038819	...	-0.065492	0.037160	0.102652	-0.020195	0.059330	-0.009788	0.042107	0.051894	0.015274	0.027822
2	Al10Co21B19	True	(Al, Co, B)	-0.105366	0.000804	0.106171	-0.046539	0.059815	-0.015921	0.067943	...	-0.082171	0.037160	0.119331	-0.029253	0.067328	-0.029063	0.042107	0.071170	0.001836	0.040419
3	Al10Co23B17	True	(Al, Co, B)	-0.105366	0.000804	0.106171	-0.042292	0.059232	-0.015921	0.067943	...	-0.082171	0.037160	0.119331	-0.034026	0.066641	-0.029063	0.042107	0.071170	-0.001010	0.039941
4	Al10Co27B13	True	(Al, Co, B)	-0.105366	0.000804	0.106171	-0.033799	0.057383	-0.015921	0.067943	...	-0.082171	0.037160	0.119331	-0.043572	0.064497	-0.029063	0.042107	0.071170	-0.006704	0.038517
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5675	ZrTi9	False	(Zr, Ti)	-0.035601	-0.018014	0.017588	-0.033843	0.012436	-0.066841	-0.047185	...	-0.051519	0.005334	0.056852	-0.045833	0.040201	0.042314	0.063139	0.020825	0.044396	0.014726
5676	ZrTiSi2	True	(Zr, Ti, Si)	-0.035601	-0.018014	0.017588	-0.022898	0.009291	-0.066841	-0.035818	...	-0.051519	0.005334	0.056852	-0.029050	0.026519	-0.012335	0.063139	0.075474	0.020196	0.042189
5677	ZrTiSi3	True	(Zr, Ti, Si)	-0.035601	-0.018014	0.017588	-0.022116	0.009025	-0.066841	-0.035818	...	-0.051519	0.005334	0.056852	-0.030242	0.025259	-0.012335	0.063139	0.075474	0.013690	0.043492
5678	ZrVCo8	True	(Zr, V, Co)	-0.018014	0.057822	0.075835	0.004624	0.031897	-0.047185	0.067943	...	-0.082171	0.100906	0.183077	-0.055113	0.099783	-0.029063	0.102289	0.131353	-0.006708	0.078135
5679	ZrVNi2	True	(Zr, V, Ni)	-0.018014	0.057822	0.075835	0.020436	0.033921	-0.047185	0.038819	...	-0.065492	0.100906	0.166398	-0.006186	0.086338	0.000973	0.102289	0.101316	0.041844	0.054582

5680 rows × 1003 columns

Drop the composition object column and save the dataset.

df.drop("composition", axis=1, inplace=True)
df.to_csv("datasets/team-b.csv", index=False)

Team C#

Dataset: Bulk modulus (matbench_log_kvrh)
Type: Regression
Features: Magpie element features + site stats fingerprint
Target: log10(K_VRH)
Original Inputs: composition + structure
Size: 10,987 samples
Reference: Charting the complete elastic properties of inorganic crystalline compounds (2015)

More information on the dataset is available on the MatBench description page.

Features

Here, we use the “Magpie” features defined in the paper “A general-purpose machine learning framework for predicting properties of inorganic materials” (2016). They consist of statistical measures of elemental properties that have been weighted according to the chemical formula including the minimum value across all elements, maximum value, range, mean, standard deviation, and mode (total 132 features).

Since this dataset also has the crystal structure available, we also include the site stats fingerprint. This method calculates order parameters for the different sites (e.g. how tetrahedral, octahedral, square planar, etc the site is, and the coordination number) and calculates statistical properties (mean and standard deviation) of these over the full structure. Specifically, we use the CrystalNNFingerprint where the order parameters are introduced in the paper “Local structure order parameters and site fingerprints for quantification of coordination environment and crystal structure similarity” (2020).

Note, if you are so inclined, you can use the structure with message passing graph neural networks such as MegNet or M3GNet. Nice implementations are provided in the MatGL library although these will be significantly more expensive to train than classical models.

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_log_kvrh"))

df = load_dataset("matbench_log_kvrh")
df

Dataset: matbench_log_kvrh
Description: Matbench v0.1 test dataset for predicting DFT log10 VRH-average bulk modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	log10(K_VRH): Target variable. Base 10 logarithm of the DFT Voigt-Reuss-Hill average bulk moduli in GPa.
	structure: Pymatgen Structure of the material.
Num Entries: 10987
Reference: Jong, M. De, Chen, W., Angsten, T., Jain, A., Notestine, R., Gamst,
A., Sluiter, M., Ande, C. K., Zwaag, S. Van Der, Plata, J. J., Toher,
C., Curtarolo, S., Ceder, G., Persson, K. and Asta, M., "Charting
the complete elastic properties of inorganic crystalline compounds",
Scientific Data volume 2, Article number: 150009 (2015)
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@Article{deJong2015,\nauthor={de Jong, Maarten and Chen, Wei and Angsten, Thomas\nand Jain, Anubhav and Notestine, Randy and Gamst, Anthony\nand Sluiter, Marcel and Krishna Ande, Chaitanya\nand van der Zwaag, Sybrand and Plata, Jose J. and Toher, Cormac\nand Curtarolo, Stefano and Ceder, Gerbrand and Persson, Kristin A.\nand Asta, Mark},\ntitle={Charting the complete elastic properties\nof inorganic crystalline compounds},\njournal={Scientific Data},\nyear={2015},\nmonth={Mar},\nday={17},\npublisher={The Author(s)},\nvolume={2},\npages={150009},\nnote={Data Descriptor},\nurl={http://dx.doi.org/10.1038/sdata.2015.9}\n}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_log_kvrh.json.gz
SHA256 Hash Digest: 44b113ddb7e23aa18731a62c74afa7e5aa654199e0db5f951c8248a00955c9cd

 

	structure	log10(K_VRH)
0	[[0. 0. 0.] Ca, [1.37728887 1.57871271 3.73949...	1.707570
1	[[3.14048493 1.09300401 1.64101398] Mg, [0.625...	1.633468
2	[[ 2.06884519 2.40627241 -0.45891585] Si, [1....	1.908485
3	[[2.06428082 0. 2.06428082] Pd, [0. ...	2.117271
4	[[3.09635262 1.0689416 1.53602403] Mg, [0.593...	1.690196
...	...	...
10982	[[0. 0. 0.] Rh, [3.2029368 3.2029368 2.09459...	1.778151
10983	[[-1.51157821 4.4173925 1.21553922] Mg, [3....	1.724276
10984	[[4.37546772 4.51128393 6.81784473] H, [0.4573...	1.342423
10985	[[0. 0. 0.] Si, [ 4.55195829 4.55195829 -4.55...	1.770852
10986	[[1.44565668 0. 2.05259079] Al, [1.445...	1.954243

10987 rows × 2 columns

Create a composition column containing pymatgen Composition objects from the Structure object.

df["composition"] = [structure.composition for structure in df["structure"]]

Generate the compositional features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("magpie")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)

from matminer.featurizers.structure import SiteStatsFingerprint

cnn_fp = SiteStatsFingerprint.from_preset("CrystalNNFingerprint_ops")
cnn_fp.set_n_jobs(8)
cnn_fp.featurize_dataframe(df, col_id="structure", inplace=True)
df

	structure	log10(K_VRH)	composition	MagpieData minimum Number	MagpieData maximum Number	MagpieData range Number	MagpieData mean Number	MagpieData avg_dev Number	MagpieData mode Number	MagpieData minimum MendeleevNumber	...	mean wt CN_20	std_dev wt CN_20	mean wt CN_21	std_dev wt CN_21	mean wt CN_22	std_dev wt CN_22	mean wt CN_23	std_dev wt CN_23	mean wt CN_24	std_dev wt CN_24
0	[[0. 0. 0.] Ca, [1.37728887 1.57871271 3.73949...	1.707570	(Ca, Ge, Ag)	20.0	47.0	27.0	35.600000	9.120000	32.0	7.0	...	0.001273	0.002546	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
1	[[3.14048493 1.09300401 1.64101398] Mg, [0.625...	1.633468	(Mg, Ge, Ba)	12.0	56.0	44.0	28.800000	13.440000	12.0	9.0	...	0.006620	0.013240	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
2	[[ 2.06884519 2.40627241 -0.45891585] Si, [1....	1.908485	(Si, Cu, Sr)	14.0	38.0	24.0	24.800000	8.640000	14.0	8.0	...	0.000000	0.000000	0.0	0.0	0.007384	0.014768	0.0	0.0	0.0	0.0
3	[[2.06428082 0. 2.06428082] Pd, [0. ...	2.117271	(Pd, Dy)	46.0	66.0	20.0	51.000000	7.500000	46.0	31.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
4	[[3.09635262 1.0689416 1.53602403] Mg, [0.593...	1.690196	(Mg, Si, Ba)	12.0	56.0	44.0	21.600000	13.760000	12.0	9.0	...	0.005602	0.011204	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10982	[[0. 0. 0.] Rh, [3.2029368 3.2029368 2.09459...	1.778151	(Rh, I)	45.0	53.0	8.0	50.333333	3.555556	53.0	59.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
10983	[[-1.51157821 4.4173925 1.21553922] Mg, [3....	1.724276	(Mg, Co, Sn)	12.0	50.0	38.0	18.625000	9.937500	12.0	58.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
10984	[[4.37546772 4.51128393 6.81784473] H, [0.4573...	1.342423	(H, N, O)	1.0	8.0	7.0	4.666667	3.259259	1.0	82.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
10985	[[0. 0. 0.] Si, [ 4.55195829 4.55195829 -4.55...	1.770852	(Si, Sn)	14.0	50.0	36.0	32.000000	18.000000	14.0	78.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0
10986	[[1.44565668 0. 2.05259079] Al, [1.445...	1.954243	(Al, Cu)	13.0	29.0	16.0	18.333333	7.111111	13.0	64.0	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.0	0.0	0.0	0.0

10987 rows × 257 columns

Drop the composition and structure object columns and save the dataset.

df.drop(["composition", "structure"], axis=1, inplace=True)
df.to_csv("datasets/team-c.csv", index=False)

Team D#

Dataset: Exfoliation (matbench_jdft2d)
Type: Regression
Features: Meredig element features + atom packing efficiency + density features + maximum packing efficiency
Target: exfoliation_en
Original Inputs: composition + structure
Size: 636 samples
Reference: High-throughput Identification and Characterization of Two dimensional Materials using Density functional theory (2018)

More information on the dataset is available on the MatBench description page.

Features

Here we use the “Meredig” features defined in the paper “Combinatorial screening for new materials in unconstrained composition space with machine learning” (2014). They consist of the statistical properties of the elements in the composition (total 9 features). We also include the atom packing efficiency features defined in “A predictive structural model for bulk metallic glasses” (2015).

Since this dataset also has the crystal structure available, we also include some structural features. In particular, these are the density, volume per atom, packing fraction (density features), and the maximum packing efficiency defined in “Including crystal structure attributes in machine learning models of formation energies via Voronoi tessellations” (2017).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_jdft2d"))

df = load_dataset("matbench_jdft2d")
df

Dataset: matbench_jdft2d
Description: Matbench v0.1 test dataset for predicting exfoliation energies from crystal structure (computed with the OptB88vdW and TBmBJ functionals). Adapted from the JARVIS DFT database. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	exfoliation_en: Target variable. Exfoliation energy (meV/atom).
	structure: Pymatgen Structure of the material.
Num Entries: 636
Reference: 2D Dataset discussed in:
High-throughput Identification and Characterization of Two dimensional Materials using Density functional theory Kamal Choudhary, Irina Kalish, Ryan Beams & Francesca Tavazza Scientific Reports volume 7, Article number: 5179 (2017)
Original 2D Data file sourced from:
choudhary, kamal; https://orcid.org/0000-0001-9737-8074 (2018): jdft_2d-7-7-2018.json. figshare. Dataset.
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@Article{Choudhary2017,\nauthor={Choudhary, Kamal\nand Kalish, Irina\nand Beams, Ryan\nand Tavazza, Francesca},\ntitle={High-throughput Identification and Characterization of Two-dimensional Materials using Density functional theory},\njournal={Scientific Reports},\nyear={2017},\nvolume={7},\nnumber={1},\npages={5179},\nabstract={We introduce a simple criterion to identify two-dimensional (2D) materials based on the comparison between experimental lattice constants and lattice constants mainly obtained from Materials-Project (MP) density functional theory (DFT) calculation repository. Specifically, if the relative difference between the two lattice constants for a specific material is greater than or equal to 5%, we predict them to be good candidates for 2D materials. We have predicted at least 1356 such 2D materials. For all the systems satisfying our criterion, we manually create single layer systems and calculate their energetics, structural, electronic, and elastic properties for both the bulk and the single layer cases. Currently the database consists of 1012 bulk and 430 single layer materials, of which 371 systems are common to bulk and single layer. The rest of calculations are underway. To validate our criterion, we calculated the exfoliation energy of the suggested layered materials, and we found that in 88.9% of the cases the currently accepted criterion for exfoliation was satisfied. Also, using molybdenum telluride as a test case, we performed X-ray diffraction and Raman scattering experiments to benchmark our calculations and understand their applicability and limitations. The data is publicly available at the website http://www.ctcms.nist.gov/{\textasciitilde}knc6/JVASP.html.},\nissn={2045-2322},\ndoi={10.1038/s41598-017-05402-0},\nurl={https://doi.org/10.1038/s41598-017-05402-0}\n}', '@misc{choudhary__2018, title={jdft_2d-7-7-2018.json}, url={https://figshare.com/articles/jdft_2d-7-7-2018_json/6815705/1}, DOI={10.6084/m9.figshare.6815705.v1}, abstractNote={2D materials}, publisher={figshare}, author={choudhary, kamal and https://orcid.org/0000-0001-9737-8074}, year={2018}, month={Jul}}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_jdft2d.json.gz
SHA256 Hash Digest: 26057dc4524e193e32abffb296ce819b58b6e11d1278cae329a2f97817a4eddf

	structure	exfoliation_en
0	[[1.49323139 3.32688406 7.26257785] Hf, [3.326...	63.593833
1	[[1.85068084 4.37698238 6.9301577 ] As, [0. ...	134.863750
2	[[ 0. 2.0213325 11.97279555] Ti, [ 1...	43.114667
3	[[2.39882726 2.39882726 2.53701553] In, [0.054...	240.715488
4	[[-1.83484554e-06 1.73300105e+00 2.61675943e...	67.442833
...	...	...
631	[[ 2.38592362 1.37751086 13.178104 ] Co, [-2...	26.426545
632	[[0. 0. 6.02219863] Br, [0. ...	43.574286
633	[[2.74646086 0.06822876 1.46596737] Se, [6.324...	88.808659
634	[[6.79056646 2.04327631 3.37729406] I, [2.0440...	132.265250
635	[[ 0.69409027 1.22690182 -0.85636865] Co, [-0...	63.564333

636 rows × 2 columns

Create a composition column containing pymatgen Composition objects from the Structure object.

df["composition"] = [structure.composition for structure in df["structure"]]

Generate the compositional features for the dataset:

from matminer.featurizers.composition import Meredig, AtomicPackingEfficiency


for featurizer in [Meredig(), AtomicPackingEfficiency(impute_nan=True)]:
    featurizer.set_n_jobs(1)
    featurizer.featurize_dataframe(df, col_id="composition", inplace=True)

from matminer.featurizers.structure import DensityFeatures, MaximumPackingEfficiency

for featurizer in [DensityFeatures(), MaximumPackingEfficiency()]:
    featurizer.set_n_jobs(8)
    featurizer.featurize_dataframe(df, col_id="structure", inplace=True, ignore_errors=True)

# remove any rows that contain NaN values
df.dropna(inplace=True)
df

	structure	exfoliation_en	composition	H fraction	He fraction	Li fraction	Be fraction	B fraction	C fraction	N fraction	...	frac f valence electrons	mean simul. packing efficiency	mean abs simul. packing efficiency	dist from 1 clusters \|APE\| < 0.010	dist from 3 clusters \|APE\| < 0.010	dist from 5 clusters \|APE\| < 0.010	density	vpa	packing fraction	max packing efficiency
0	[[1.49323139 3.32688406 7.26257785] Hf, [3.326...	63.593833	(Hf, Si, Te)	0.0	0	0.0	0	0.0	0.0	0.0	...	0.368421	-0.007860	0.016033	0.048029	0.056762	0.067759	3.021472	61.218660	0.177875	0.184684
1	[[1.85068084 4.37698238 6.9301577 ] As, [0. ...	134.863750	(As)	0.0	0	0.0	0	0.0	0.0	0.0	...	0.000000	0.023994	0.023994	1.000000	1.000000	1.000000	1.124487	110.637295	0.057581	0.074276
2	[[ 0. 2.0213325 11.97279555] Ti, [ 1...	43.114667	(Ti, Br, O)	0.0	0	0.0	0	0.0	0.0	0.0	...	0.000000	0.002382	0.017291	0.078567	0.078567	0.079184	1.385515	57.436271	0.108929	0.103004
3	[[2.39882726 2.39882726 2.53701553] In, [0.054...	240.715488	(In, Bi)	0.0	0	0.0	0	0.0	0.0	0.0	...	0.333333	-0.011374	0.013329	0.000000	0.067344	0.121218	1.950268	137.847700	0.118812	0.109654
4	[[-1.83484554e-06 1.73300105e+00 2.61675943e...	67.442833	(Nb, O)	0.0	0	0.0	0	0.0	0.0	0.0	...	0.000000	-0.005835	0.005835	0.018856	0.049327	0.101429	1.183135	58.435112	0.083167	0.083109
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
631	[[ 2.38592362 1.37751086 13.178104 ] Co, [-2...	26.426545	(Co, Cl, O)	0.0	0	0.0	0	0.0	0.0	0.0	...	0.000000	0.049309	0.049309	0.000000	0.035398	0.051959	0.823767	47.249048	0.049875	0.039890
632	[[0. 0. 6.02219863] Br, [0. ...	43.574286	(Br, Ca, Cu, O)	0.0	0	0.0	0	0.0	0.0	0.0	...	0.000000	0.038838	0.038838	0.037863	0.044437	0.053600	1.437709	55.358454	0.190227	0.144017
633	[[2.74646086 0.06822876 1.46596737] Se, [6.324...	88.808659	(Se, Cl, Pd)	0.0	0	0.0	0	0.0	0.0	0.0	...	0.000000	0.027839	0.027839	0.052691	0.066689	0.078605	1.252087	97.537984	0.066022	0.070146
634	[[6.79056646 2.04327631 3.37729406] I, [2.0440...	132.265250	(I, Hg)	0.0	0	0.0	0	0.0	0.0	0.0	...	0.233333	0.023963	0.023963	1.000000	1.000000	1.000000	1.445489	174.000416	0.071121	0.069553
635	[[ 0.69409027 1.22690182 -0.85636865] Co, [-0...	63.564333	(Co, O)	0.0	0	0.0	0	0.0	0.0	0.0	...	0.000000	-0.037170	0.037170	0.040992	0.067233	0.173491	1.004062	50.128437	0.080563	0.068925

635 rows × 147 columns

Drop the composition and structure object columns and save the dataset.

df.drop(["composition", "structure"], axis=1, inplace=True)
df.to_csv("datasets/team-d.csv", index=False)

Team E#

Dataset: Dielectric (matbench_dielectric)
Type: Regression
Features: Megnet element + Voronoi site stats fingerprint
Target: n
Original Inputs: composition + structure
Size: 4,674 samples
Reference: High-throughput screening of inorganic compounds for the discovery of novel dielectric and optical materials (2017)

More information on the dataset is available on the MatBench description page.

Features

Here, we use elemental embeddings taken from features defined in the paper “Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals” (2019). The embeddings are given as a 80 dimensional vector that were learned through training a graph neural network on the entire periodic table.

Since this dataset also has the crystal structure available, we also include the site stats fingerprint with a Voronoi decomposition. This method obtains information about the geometric environment of each site and calculates statistical properties (mean and standard deviation) of these over the full structure. Specifically, we use the VoronoiFingerprint where the features are introduced in the paper “A transferable machine-learning framework linking interstice distribution and plastic heterogeneity in metallic glasses” (2020).

Dataset Generation

First load the dataset:

from matminer.datasets import load_dataset, get_all_dataset_info
import warnings

warnings.filterwarnings("ignore")  # ignore warnings during featurisation

print(get_all_dataset_info("matbench_dielectric"))

df = load_dataset("matbench_dielectric")
df

Dataset: matbench_dielectric
Description: Matbench v0.1 test dataset for predicting refractive index from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having refractive indices less than 1 and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
Columns:
	n: Target variable. Refractive index (unitless).
	structure: Pymatgen Structure of the material.
Num Entries: 4764
Reference: Petousis, I., Mrdjenovich, D., Ballouz, E., Liu, M., Winston, D.,
Chen, W., Graf, T., Schladt, T. D., Persson, K. A. & Prinz, F. B.
High-throughput screening of inorganic compounds for the discovery
of novel dielectric and optical materials. Sci. Data 4, 160134 (2017).
Bibtex citations: ["@Article{Dunn2020,\nauthor={Dunn, Alexander\nand Wang, Qi\nand Ganose, Alex\nand Dopp, Daniel\nand Jain, Anubhav},\ntitle={Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm},\njournal={npj Computational Materials},\nyear={2020},\nmonth={Sep},\nday={15},\nvolume={6},\nnumber={1},\npages={138},\nabstract={We present a benchmark test suite and an automated machine learning procedure for evaluating supervised machine learning (ML) models for predicting properties of inorganic bulk materials. The test suite, Matbench, is a set of 13{\\thinspace}ML tasks that range in size from 312 to 132k samples and contain data from 10 density functional theory-derived and experimental sources. Tasks include predicting optical, thermal, electronic, thermodynamic, tensile, and elastic properties given a material's composition and/or crystal structure. The reference algorithm, Automatminer, is a highly-extensible, fully automated ML pipeline for predicting materials properties from materials primitives (such as composition and crystal structure) without user intervention or hyperparameter tuning. We test Automatminer on the Matbench test suite and compare its predictive power with state-of-the-art crystal graph neural networks and a traditional descriptor-based Random Forest model. We find Automatminer achieves the best performance on 8 of 13 tasks in the benchmark. We also show our test suite is capable of exposing predictive advantages of each algorithm---namely, that crystal graph methods appear to outperform traditional machine learning methods given {\\textasciitilde}104 or greater data points. We encourage evaluating materials ML algorithms on the Matbench benchmark and comparing them against the latest version of Automatminer.},\nissn={2057-3960},\ndoi={10.1038/s41524-020-00406-3},\nurl={https://doi.org/10.1038/s41524-020-00406-3}\n}\n", '@article{Jain2013,\nauthor = {Jain, Anubhav and Ong, Shyue Ping and Hautier, Geoffroy and Chen, Wei and Richards, William Davidson and Dacek, Stephen and Cholia, Shreyas and Gunter, Dan and Skinner, David and Ceder, Gerbrand and Persson, Kristin a.},\ndoi = {10.1063/1.4812323},\nissn = {2166532X},\njournal = {APL Materials},\nnumber = {1},\npages = {011002},\ntitle = {{The Materials Project: A materials genome approach to accelerating materials innovation}},\nurl = {http://link.aip.org/link/AMPADS/v1/i1/p011002/s1\\&Agg=doi},\nvolume = {1},\nyear = {2013}\n}', '@article{Petousis2017,\nauthor={Petousis, Ioannis and Mrdjenovich, David and Ballouz, Eric\nand Liu, Miao and Winston, Donald and Chen, Wei and Graf, Tanja\nand Schladt, Thomas D. and Persson, Kristin A. and Prinz, Fritz B.},\ntitle={High-throughput screening of inorganic compounds for the\ndiscovery of novel dielectric and optical materials},\njournal={Scientific Data},\nyear={2017},\nmonth={Jan},\nday={31},\npublisher={The Author(s)},\nvolume={4},\npages={160134},\nnote={Data Descriptor},\nurl={http://dx.doi.org/10.1038/sdata.2016.134}\n}']
File type: json.gz
Figshare URL: https://ml.materialsproject.org/projects/matbench_dielectric.json.gz
SHA256 Hash Digest: 83befa09bc2ec2f4b6143afc413157827a90e5e2e42c1eb507ccfa01bf26a1d6

	structure	n
0	[[4.29304147 2.4785886 1.07248561] S, [4.2930...	1.752064
1	[[3.95051434 4.51121437 0.28035002] K, [4.3099...	1.652859
2	[[-1.78688104 4.79604117 1.53044621] Rb, [-1...	1.867858
3	[[4.51438064 4.51438064 0. ] Mn, [0.133...	2.676887
4	[[-4.36731958 6.8886097 0.50929706] Li, [-2...	1.793232
...	...	...
4759	[[ 2.79280881 0.12499663 -1.84045389] Ca, [-2...	2.136837
4760	[[0. 5.50363806 3.84192106] O, [4.7662...	2.690619
4761	[[0. 0. 0.] Ba, [ 0.23821924 4.32393487 -0.35...	2.811494
4762	[[0. 0.18884638 0. ] K, [0. ...	1.832887
4763	[[0. 0. 0.] Cs, [2.80639641 2.80639641 2.80639...	2.559279

4764 rows × 2 columns

Create a composition column containing pymatgen Composition objects from the Structure object.

df["composition"] = [structure.composition for structure in df["structure"]]

Generate the compositional features for the dataset:

from matminer.featurizers.composition import ElementProperty

ep = ElementProperty.from_preset("megnet_el")
ep.set_n_jobs(1)
ep.featurize_dataframe(df, col_id="composition", inplace=True)
df

	structure	n	composition	MEGNetElementData minimum embedding 1	MEGNetElementData maximum embedding 1	MEGNetElementData range embedding 1	MEGNetElementData mean embedding 1	MEGNetElementData std_dev embedding 1	MEGNetElementData minimum embedding 2	MEGNetElementData maximum embedding 2	...	MEGNetElementData minimum embedding 15	MEGNetElementData maximum embedding 15	MEGNetElementData range embedding 15	MEGNetElementData mean embedding 15	MEGNetElementData std_dev embedding 15	MEGNetElementData minimum embedding 16	MEGNetElementData maximum embedding 16	MEGNetElementData range embedding 16	MEGNetElementData mean embedding 16	MEGNetElementData std_dev embedding 16
0	[[4.29304147 2.4785886 1.07248561] S, [4.2930...	1.752064	(S, K)	-0.207227	0.101355	0.308582	-0.052936	0.218201	-0.418307	0.304931	...	-0.014150	0.246111	0.260262	0.115980	0.184033	-0.040199	0.018383	0.058582	-0.010908	0.041424
1	[[3.95051434 4.51121437 0.28035002] K, [4.3099...	1.652859	(K, V, O)	-0.207227	0.010069	0.217296	-0.133437	0.089964	-0.188673	0.304931	...	0.113411	0.246111	0.132700	0.202901	0.054314	0.018383	0.053152	0.034769	0.032988	0.015804
2	[[-1.78688104 4.79604117 1.53044621] Rb, [-1...	1.867858	(Rb, Zr, O)	-0.222826	0.072824	0.295650	-0.119124	0.126238	-0.188673	0.179438	...	-0.148203	0.356478	0.504681	0.190559	0.215217	-0.047841	0.085135	0.132976	0.017693	0.062858
3	[[4.51438064 4.51438064 0. ] Mn, [0.133...	2.676887	(Mn, O, F)	-0.308105	-0.026669	0.281436	-0.149582	0.144058	-0.575614	0.343975	...	-0.312556	0.501850	0.814407	0.127387	0.411133	-0.155995	0.371508	0.527503	0.084805	0.266730
4	[[-4.36731958 6.8886097 0.50929706] Li, [-2...	1.793232	(Li, Co, Si, O)	-0.161449	0.426308	0.587757	-0.036320	0.230572	-0.275507	0.334700	...	-0.168183	0.299678	0.467860	0.143178	0.193200	-0.311854	0.306350	0.618204	0.045337	0.232319
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4759	[[ 2.79280881 0.12499663 -1.84045389] Ca, [-2...	2.136837	(Ca, Fe, W, O)	-0.234947	0.153433	0.388380	-0.087296	0.161825	-0.188673	0.463157	...	-0.424430	0.299887	0.724317	0.122727	0.276572	-0.333184	0.348429	0.681613	0.039165	0.259230
4760	[[0. 5.50363806 3.84192106] O, [4.7662...	2.690619	(O, S, Mn, La)	-0.113972	0.101355	0.215327	0.008412	0.120043	-0.418307	0.343975	...	-0.312556	0.192867	0.505423	0.027736	0.162918	-0.155995	0.462238	0.618232	0.124222	0.306600
4761	[[0. 0. 0.] Ba, [ 0.23821924 4.32393487 -0.35...	2.811494	(Ba, Ag, Ge, Se)	-0.318859	0.220026	0.538885	0.090390	0.198999	-0.164837	0.271697	...	-0.502267	0.352671	0.854938	-0.142704	0.314194	-0.343382	0.275374	0.618756	-0.236249	0.240652
4762	[[0. 0.18884638 0. ] K, [0. ...	1.832887	(K, Zn, H, I, O)	-0.207227	0.352363	0.559590	-0.032446	0.225594	-0.188673	0.635952	...	0.021095	0.246111	0.225016	0.169482	0.089604	-0.270104	0.045132	0.315236	-0.050470	0.162726
4763	[[0. 0. 0.] Cs, [2.80639641 2.80639641 2.80639...	2.559279	(Cs, Ge, Br)	-0.122546	0.163372	0.285918	-0.055542	0.148459	-0.047956	0.021357	...	-0.039392	0.583951	0.623343	0.249013	0.264631	-0.259408	0.052268	0.311676	-0.100368	0.131754

4764 rows × 83 columns

from matminer.featurizers.structure import SiteStatsFingerprint
from matminer.featurizers.site import VoronoiFingerprint

cesf = VoronoiFingerprint(cutoff=15.0)
ssf = SiteStatsFingerprint(cesf)

ssf.set_n_jobs(8)
ssf.fit(df["structure"])
ssf.featurize_dataframe(df, col_id="structure", inplace=True, ignore_errors=True)

# remove any rows that contain NaN values
df.dropna(inplace=True)
df

	structure	n	composition	MEGNetElementData minimum embedding 1	MEGNetElementData maximum embedding 1	MEGNetElementData range embedding 1	MEGNetElementData mean embedding 1	MEGNetElementData std_dev embedding 1	MEGNetElementData minimum embedding 2	MEGNetElementData maximum embedding 2	...	mean Voro_area_maximum	std_dev Voro_area_maximum	mean Voro_dist_mean	std_dev Voro_dist_mean	mean Voro_dist_std_dev	std_dev Voro_dist_std_dev	mean Voro_dist_minimum	std_dev Voro_dist_minimum	mean Voro_dist_maximum	std_dev Voro_dist_maximum
0	[[4.29304147 2.4785886 1.07248561] S, [4.2930...	1.752064	(S, K)	-0.207227	0.101355	0.308582	-0.052936	0.218201	-0.418307	0.304931	...	8.204232	0.823014	3.767918	0.426651	0.709328	0.240967	2.674113	0.534927	4.890630	0.808575
1	[[3.95051434 4.51121437 0.28035002] K, [4.3099...	1.652859	(K, V, O)	-0.207227	0.010069	0.217296	-0.133437	0.089964	-0.188673	0.304931	...	6.824999	0.644861	3.258812	0.211691	0.599649	0.174834	2.092884	0.450891	4.157850	0.328473
2	[[-1.78688104 4.79604117 1.53044621] Rb, [-1...	1.867858	(Rb, Zr, O)	-0.222826	0.072824	0.295650	-0.119124	0.126238	-0.188673	0.179438	...	6.752772	0.892951	3.237223	0.301533	0.568850	0.134024	2.306172	0.401358	4.243866	0.241752
3	[[4.51438064 4.51438064 0. ] Mn, [0.133...	2.676887	(Mn, O, F)	-0.308105	-0.026669	0.281436	-0.149582	0.144058	-0.575614	0.343975	...	4.643363	0.543658	2.682981	0.200683	0.513412	0.099900	1.960063	0.028847	3.469882	0.083992
4	[[-4.36731958 6.8886097 0.50929706] Li, [-2...	1.793232	(Li, Co, Si, O)	-0.161449	0.426308	0.587757	-0.036320	0.230572	-0.275507	0.334700	...	5.444131	1.757433	2.966329	0.157502	0.514779	0.110638	1.914097	0.462314	3.549006	0.185586
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4759	[[ 2.79280881 0.12499663 -1.84045389] Ca, [-2...	2.136837	(Ca, Fe, W, O)	-0.234947	0.153433	0.388380	-0.087296	0.161825	-0.188673	0.463157	...	4.929955	0.291390	2.817745	0.235961	0.448275	0.073659	2.042058	0.146875	3.455685	0.174053
4760	[[0. 5.50363806 3.84192106] O, [4.7662...	2.690619	(O, S, Mn, La)	-0.113972	0.101355	0.215327	0.008412	0.120043	-0.418307	0.343975	...	6.809922	0.558490	3.379219	0.129438	0.470356	0.098123	2.556854	0.177656	4.183076	0.217302
4761	[[0. 0. 0.] Ba, [ 0.23821924 4.32393487 -0.35...	2.811494	(Ba, Ag, Ge, Se)	-0.318859	0.220026	0.538885	0.090390	0.198999	-0.164837	0.271697	...	9.552565	1.300761	3.786360	0.095118	0.639826	0.118692	2.586770	0.339563	4.664158	0.236099
4762	[[0. 0.18884638 0. ] K, [0. ...	1.832887	(K, Zn, H, I, O)	-0.207227	0.352363	0.559590	-0.032446	0.225594	-0.188673	0.635952	...	6.678443	0.932676	3.014896	0.271109	0.655179	0.183725	1.696626	0.474518	3.926988	0.410659
4763	[[0. 0. 0.] Cs, [2.80639641 2.80639641 2.80639...	2.559279	(Cs, Ge, Br)	-0.122546	0.163372	0.285918	-0.055542	0.148459	-0.047956	0.021357	...	7.414504	0.922714	3.636716	0.420112	0.244063	0.199277	3.038886	0.464979	3.736354	0.464979

4557 rows × 143 columns

Drop the composition and structure object columns and save the dataset.

df.drop(["composition", "structure"], axis=1, inplace=True)
df.to_csv("datasets/team-e.csv", index=False)

Final Project

Contents

Final Project#

Task#

Downloads#

Submission#

Featurisation#

Team A#

Team B#

Team C#

Team D#

Team E#