Welcome to MAterials Simulation Toolkit for Machine Learning (MAST-ML)’s documentation!¶
Acknowledgements¶
Materials Simulation Toolkit for Machine Learning (MAST-ML)
MAST-ML is an open-source Python package designed to broaden and accelerate the use of machine learning in materials science research
As of MAST-ML version 3.x, much of the original code and workflow have been rewritten. The use of an input file in version 2.x and older has been removed in favor of a more modular Jupyter notebook computing environment. Please see the examples and tutorials under the mastml/examples folder for a guide in using MAST-ML
Contributors
University of Wisconsin-Madison Computational Materials Group:
Prof. Dane Morgan
Dr. Ryan Jacobs
Dr. Tam Mayeshiba
Ben Afflerbach
Dr. Henry Wu
University of Kentucky contributors:
Luke Harold Miles
Robert Max Williams
Prof. Raphael Finkel
University of Wisconsin-Madison Undergraduate Skunkworks members (Spring 2021):
Avery Chan
Min Yi Lin
Hock Lye Lee
MAST-ML documentation:
An overview of code documentation and guides for installing MAST-ML can be found here
A number of Jupyter notebook tutorials detailing different MAST-ML use cases can be found here
Funding
This work was and is funded by the National Science Foundation (NSF) SI2 award No. 1148011 and DMREF award number DMR-1332851
Citing MAST-ML
If you find MAST-ML useful, please cite the following publication:
Jacobs, R., Mayeshiba, T., Afflerbach, B., Miles, L., Williams, M., Turner, M., Finkel, R., Morgan, D., “The Materials Simulation Toolkit for Machine Learning (MAST-ML): An automated open source toolkit to accelerate data- driven materials research”, Computational Materials Science 175 (2020), 109544. https://doi.org/10.1016/j.commatsci.2020.109544
Code Repository
MAST-ML is available via PyPi: pip install mastml
MAST-ML is available via Github
git clone –single-branch master https://github.com/uw-cmg/MAST-ML
MAST-ML version 3.x¶
New changes to MAST-ML¶
As of MAST-ML version update 3.x and going forward, there are some significant changes to MAST-ML for users to be aware of:
MAST-ML major updates:
MAST-ML no longer uses an input file. The core functionality and workflow of MAST-ML has been rewritten to be more conducive to use in a Jupyter notebook environment. This major change has made the code more modular and transparent, and we believe more intuitive and easier to use in a research setting. The last version of MAST-ML to have input file support was version 2.0.20 on PyPi.
Each component of MAST-ML can be run in a Jupyter notebook environment, either locally or through a cloud-based service like Google Colab. As a result, we have completely reworked our use-case tutorials and examples. All of these MAST-ML tutorials are in the form of Jupyter notebooks and can be found in the mastml/examples folder on Github.
An active part of improving MAST-ML is to provide an automated, quantitative analysis of model domain assessement and model prediction uncertainty quantification (UQ). Version 3.x of MAST-ML includes more detailed implementation of model UQ using new and established techniques.
MAST-ML minor updates:
More straightforward implementation of left-out test data, both designated manually by the user and via nested cross validation.
Improved integration of feature generation schemes in complimentary materials informatics packages, particularly matminer.
Improved data import schema based on locally-stored files, and via downloading data hosted on databases including Figshare, matminer, Materials Data Facility, and Foundry.
Support for generalized ensemble models with user-specified choice of model type to use as the weak learner, including support for ensembles of Keras-based neural networks.
Installing MAST-ML¶
Hardware and Data Requirements¶
Hardware¶
PC, Mac, computing cluster or Cloud resource (e.g. Google Colab) capable of running Python 3.
Data¶
Numeric data file in the form of .xlsx file. There must be at least some target feature data, so that models can be fit.
First row of data file (each column) should have a text name (as string) which will be used for importing data with MAST-ML.
For more information and examples of how to import data into MAST-ML, see the mastml/examples folder and the MASTML_examples_dataimport.ipynb Jupyter notebook.
Terminal installation (Linux or linux-like terminal environment e.g. Mac)¶
This documentation provides a few ways to install MAST-ML. If you don’t have python 3 on your system, begin with the section “Install Python3”. If you already have python 3 installed,
Install Python3¶
Install Python 3: for easier installation of numpy and scipy dependencies, download Anaconda from https://www.continuum.io/downloads
Create a conda environment (if using Anaconda)¶
Create an anaconda python environment:
conda create --name MAST_ML_env python=3.7
conda activate MAST_ML_env
Create a virtualenv environment (if not using Anaconda)¶
Create a virtualenv environment:
python3 -m venv MAST_ML_env
source MAST_ML_env/bin/activate
Install the MAST-ML package via PyPi¶
Pip install MAST-ML from PyPi:
pip install mastml
Install the MAST-ML package via Git¶
As an alternative to PyPi, you can git clone the Github repository, for example:
git clone --single-branch --branch master https://github.com/uw-cmg/MAST-ML
Once the branch is downloaded, install the needed dependencies with:
pip install -r MAST-ML/requirements.txt
Note that MAST-ML will need to be imported from within the MAST-ML directory as mastml is not located in the usual spot where python looks for imported packages.
Set up Juptyer notebooks¶
There is no separate setup for Jupyter notebooks necessary; once MAST-ML has been run and created a notebook, then in the terminal, navigate to a directory housing the notebook and type:
jupyter notebook
and a browser window with the notebook should appear.
Imports that don’t work¶
First try anaconda install, and if that gives errors try pip install Example: conda install numpy , or pip install numpy Put the path to the installed MAST-ML folder in your PYTHONPATH if it isn’t already
Windows installation¶
Install Python3¶
Install Python 3: for easier installation of numpy and scipy dependencies, download anaconda from https://www.continuum.io/downloads
Create a conda environment¶
From the Anaconda Navigator, go to Environments and create a new environment Select python version 3.6
Under “Channels”, along with defaults channel, “Add” the “materials” channel. The Channels list should now read:
defaults
materials
(may be the “matsci” channel instead of the “materials” channel; this channel is used to install pymatgen)
Set up the Spyder IDE and Jupyter notebooks¶
From the Anaconda Navigator, go to Home With the newly created environment selected, click on “Install” below Jupyter. Click on “Install” below Spyder.
Once the MASTML has been run and has created a jupyter notebook (run MASTML from a location inside the anaconda environment, so that the notebook will also be inside the environment tree), from the Anaconda Navigator, go to Environments, make sure the environment is selected, press the green arrow button, and select Open jupyter notebook.
Install the MAST-ML package¶
Pip install MAST-ML from PyPi:
pip install mastml
Alternatively, git clone the Github repository, for example:
git clone https://github.com/uw-cmg/MAST-ML
Clone from “master” unless instructed specifically to use another branch. Ask for access if you cannot find this code.
Check status.github.com for issues if you believe github may be malfunctioning
Run:
python setup.py install
Imports that don’t work¶
First try anaconda install, and if that gives errors try pip install Example: conda install numpy , or pip install numpy Put the path to the installed MAST-ML folder in your PYTHONPATH if it isn’t already
Windows 10 install: step-by-step guide (credit Joe Kern)¶
First, figure out if your computer is 32 or 64-bit. Type “system information” in your search bar. Look at system type. x86 is a 32-bit computer, x64 is a 64-bit.
Second, download an environment manager. Environments are directories in your computer that store dependencies. For instance, one program you run might be dependent on version 1.0 of another program x. However, another program you have might be dependent on version 2.0 of program x. Having multiple environments allows you utilize both programs and dependencies on your computer. I will recommend you download anaconda, not because it is the best, but because it is an environment manager I know how to get working with MAST-ML. Feel free to experiment with other managers. Download the Python 3.7 version at https://www.anaconda.com/distribution/, just follow the installation instructions. Pick the graphical installer that corresponds with your computer system (64 bit or 32 bit).
Third, download Visual studio. Some of the MAST-ML dependencies require C++ distributables in order to run. Visual Studio Code is a code editor made for Windows 10. The dependencies for MAST-ML will look in the Visual Studio Code folder for these C++ distributables when they download. There may be another way to download these these C++ distributables without Visual Studio Code, but I am not sure how to do that. Go here to download https://visualstudio.microsoft.com/downloads/#build-tools-for-visual-studio-2017
Fourth, download Visual Studio with C++ build tools and restart the computer
Fifth, Open anaconda navigator. Click Environments and create at the bottom. Name it MASTML and make it Python 3.6. DO NOT MAKE IT Python 3.7 or Python version 2.6 or 2.7. Some dependencies do not work with those other version.
Sixth, click the arrow next to your environment name and open a command shell. In the command line type “pip install “ and then copy paste the dependency names from the dependency file into your command prompt.
Seventh, test if MAST-ML runs. There are multiple ways to do this, but I will outline one. Navigate to your MAST-ML folder in the command prompt. To do this, you need to know the command ‘cd’. Typing ‘cd’ will let you change the directory you command prompt is operating in. In order to navigate to your mast-ml folder, right click the folder and click properties. Copy the location and in the command prompt type ‘cd’ and paste the location after. Add a ‘Mast-ml’ or whatever your folder is called to the end of the pasted value so you can get to mastml
Finally, copy paste python -m mastml.mastml_driver mastml/tests/conf/example_input.conf mastml/tests/csv/example_data.csv -o results/mastml_tutorial into your command prompt and run. If it all works, you’re good to go.
Getting Started with MAST-ML¶
Installing MAST-ML¶
If you have not done so, the first step is to install MAST-ML. More information on how to install MAST-ML can be found by navigating to the “Installing MAST-ML” tab on the left-hand side of this documentation page.
Performing your first MAST-ML run¶
Once MAST-ML is installed, you are ready to perform your first MAST-ML run
The first MAST-ML tutorial can be found under the mastml/examples folder, and is named MASTML_Tutorial_1_GettingStarted.ipynb
This first notebook can also be opened on Google Colab via this link: https://colab.research.google.com/github/uw-cmg/MAST-ML/blob/master/examples/MASTML_Tutorial_1_GettingStarted.ipynb
Open this first example notebook either in Google Colab if running on the cloud or locally by starting a Jupyter notebook session. There are explanations for each cell of the notebook. Reading through and running this tutorial should take about 10 minutes. At the end, you will have performed your first MAST-ML run!
Once complete, there are a series of other example/tutorial notebooks that can be found in the mastml/examples folder on Github.
Note that all of the example notebooks can be opened via Google Colab by clicking on the corresponding Google Colab badge icon in the README section in the Github repo master branch
Overview of MAST-ML tutorials and examples¶
MAST-ML tutorials¶
There are numerous MAST-ML tutorial and example Jupyter notebooks. These notebooks can be found in the mastml/examples folder. Here, a brief overview of the contents of each tutorial is provided:
Tutorial 1: Getting Started (MASTML_Tutorial_1_GettingStarted.ipynb):
Tutorial 1 link: https://colab.research.google.com/github/uw-cmg/MAST-ML/blob/master/examples/MASTML_Tutorial_1_GettingStarted.ipynb
- In this notebook, we will perform a first, basic run where we:
Import example data of Boston housing prices
Define a data preprocessor to normalize the data
Define a linear regression model and kernel ridge model to fit the data
Evaluate each of our models with 5-fold cross validation
Add a random forest model to our run and compare model performance
Tutorial 2: Data Import and Cleaning (MASTML_Tutorial_2_DataImport.ipynb):
Tutorial 2 link: https://colab.research.google.com/github/uw-cmg/MAST-ML/blob/master/examples/MASTML_Tutorial_2_DataImport.ipynb
- In this notebook, we will learn different ways to download and import data into a MAST-ML run:
Import model datasets from scikit-learn
Conduct different data cleaning methods
Import and prepare a real dataset that is stored locally
Download data from various materials databases
Tutorial 3: Feature Generation and Selection (MASTML_Tutorial_3_FeatureEngineering.ipynb):
Tutorial 3 link: https://colab.research.google.com/github/uw-cmg/MAST-ML/blob/master/examples/MASTML_Tutorial_3_FeatureEngineering.ipynb
- In this notebook, we will learn different ways to generate, preprocess, and select features:
Generate features based on material composition
Generate one-hot encoded features based on group labels
Preprocess features to be normalized
Select features using an ensemble model-based approach
Generate learning curves using a basic feature selection approach
Select features using forward selection
Tutorial 4: Model Fits and Data Split Tests (MASTML_Tutorial_4_Models_and_Tests.ipynb):
Tutorial 4 link: https://colab.research.google.com/github/uw-cmg/MAST-ML/blob/master/examples/MASTML_Tutorial_4_Models_and_Tests.ipynb
In this notebook, we will learn how to run a few different types of models on a select dataset, and conduct a few different types of data splits to evaluate our model performance. In this tutorial, we will:
Run a variety of model types from the scikit-learn package
Run a bootstrapped ensemble of neural networks
Compare performance of scikit-learn’s gradient boosting method and XGBoost
Compare performance of scikit-learn’s neural network and Keras-based neural network regressor
Compare model performance using random k-fold cross validation and leave out group cross validation
Explore the limits of model performance when up to 90% of data is left out using leave out percent cross validation
Tutorial 5: Left-out data, Nested cross-validation, and Optimized models (MASTML_Tutorial_5_NestedCV_and_OptimizedModels.ipynb):
Tutorial 5 link: https://colab.research.google.com/github/uw-cmg/MAST-ML/blob/master/examples/MASTML_Tutorial_5_NestedCV_and_OptimizedModels.ipynb
In this notebook, we will perform more advanced model fitting routines, including nested cross validation and hyperparameter optimization. In this tutorial, we will learn how to use MAST-ML to:
Assess performance on manually left-out test data
Perform nested cross validation to assess model performance on unseen data
Optimize the hyperparameters of our models to create the best model
Tutorial 6: Model Error Analysis, Uncertainty Quantification (MASTML_Tutorial_6_ErrorAnalysis_UncertaintyQuantification.ipynb):
Tutorial 6 link: https://colab.research.google.com/github/uw-cmg/MAST-ML/blob/master/examples/MASTML_Tutorial_6_ErrorAnalysis_UncertaintyQuantification.ipynb
- In this notebook tutorial, we will learn about how MAST-ML can be used to:
Assess the true and predicted errors of our model, and some useful measures of their statistical distributions
Explore different methods of quantifying and calibrating model uncertainties.
Compare the uncertainty quantification behavior of Bayesian and ensemble-based models.
Tutorial 7: Model predictions with calibrated error bars on new data, hosting on Foundry/DLHub (MASTML_Tutorial_7_ModelPredictions_with_CalibratedErrorBars_HostModelonFoundry.ipynb):
- In this notebook tutorial, we will learn about how MAST-ML can be used to:
Fit a model and use it to predict on new data.
Use our model to predict on new data using only composition as input.
Use nested CV to obtain error bar recalibration parameters and get predictions with calibrated error bars.
Code Documentation: Data Cleaning¶
mastml.data_cleaning Module¶
This module provides various methods for cleaning data that has been imported into MAST-ML, prior to model fitting.
- DataCleaning:
Class that enables easy use of various data cleaning methods, such as removal of missing values, different modes of data imputation, or using principal componenet analysis to fill interpolate missing values.
- DataUtilities:
Support class used to evaluate some basic statistics of imported data, such as its distribution, mean, etc. Also provides a means of flagging potential outlier datapoints based on their deviation from the overall data distribution.
- PPCA:
Class used by the PCA data cleaning routine in the DataCleaning class to perform probabilistic PCA to fill in missing data.
Classes¶
|
Dict subclass for counting hashable items. |
Class to perform various data cleaning operations, such as imputation or NaN removal |
|
Class that contains some basic data analysis utilities, such as flagging columns that contain problematic string entries, or flagging potential outlier values based on threshold values |
|
|
Class to generate histogram plots, such as histograms of residual values |
|
Class to perform probabilistic principal component analysis (PPCA) to fill in missing data. |
|
Imputation transformer for completing missing values. |
|
The year, month and day arguments are required. |
Class Inheritance Diagram¶

Code Documentation: Data Splitters¶
mastml.data_splitters Module¶
This module contains a collection of methods to split data into different types of train/test sets. Data splitters are the core component to evaluating model performance.
- BaseSplitter:
Base class that handles the core MAST-ML data splitting and model evaluation workflow. This class is responsible for looping over provided feature selectors, models, and data splits and training and evaluating the model for each split, then generating the necessary plots and performance statistics. All different splitter types inherit this base class.
- SklearnDataSplitter:
Wrapper class to enable MAST-ML workflow compatible use of any data splitter contained in scikit-learn, e.g. KFold, RepeatedKFold, LeaveOneGroupOut, etc.
- NoSplit:
Class that doesn’t perform any data split. Equivalent to a “full fit” of the data where all data is used in training.
- JustEachGroup:
Class that splits data so each individual group is used as training with all other groups used as testing. Essentially the inverse of LeaveOneGroupOut, this class trains only on one group and predicts the rest, as opposed to training on all but one group and testing on the left-out group.
- LeaveCloseCompositionsOut:
Class to split data based on their compositional similiarity. A useful means to separate compositionally similar compounds into the training or testing set, so that similar materials are not contained in both sets.
- LeaveOutPercent:
Method to randomly split the data based on fraction of total data points, rather than a designated number of splits. Enables one to do higher than 50% leave out (this is highest leave out possible with KFold where k=2), so can do e.g. leave out 90% data.
- LeaveOutTwinCV:
Another method to help separate similar data from the training and testing set. This method makes use of a general distance metric on the provided features, and flags twins as those data points within some provided distance threshold in the feature space.
- LeaveOutClusterCV:
Method to use a clustering algorithm to pre-cluster data into groups. Then, these different groups are used as each left-out data set. Basically functions as a leave out group test where the groups are automatically obtained from a clustering algorithm.
- LeaveMultiGroupOut:
Class to train the model on multiple groups at a time and test it on the rest of the data
- Bootstrap:
Method to perform bootstrap resampling, i.e. random leave-out with replacement.
Classes¶
Class functioning as a base splitter with methods for organizing output and evaluating any mastml data splitter |
|
|
Methods: |
|
# Note: Bootstrap taken directly from sklearn Github (https://github.com/scikit-learn/scikit-learn/blob/0.11.X/sklearn/cross_validation.py) # which was necessary as it was later removed from more recent sklearn releases Random sampling with replacement cross-validation iterator Provides train/test indices to split data in train test sets while resampling the input n_bootstraps times: each time a new random split of the data is performed and then samples are drawn (with replacement) on each side of the split to build the training and test sets. |
|
Represents a Composition, which is essentially a {element:amount} mapping type. |
|
This class evaluates which test data point is within and out of the domain |
|
Class to calculate the atomic fraction of each element in a composition. |
|
Collection of functions to conduct error analysis on certain types of models (uncertainty quantification), and prepare residual and model error data for plotting, as well as recalibrate model errors with various methods |
|
Class to train the model on one group at a time and test it on the rest of the data This class wraps scikit-learn's LeavePGroupsOut with P set to n-1. |
|
Leave-P-out where you exclude materials with compositions close to those the test set |
|
Class to train the model on multiple groups at a time and test it on the rest of the data |
|
Class to generate train/test split using clustering. |
|
Class to train the model using a certain percentage of data as training data |
|
Class to remove data twins from the test data. |
|
Class containing access to a wide range of metrics from scikit-learn and a number of MAST-ML custom-written metrics |
|
Unsupervised learner for implementing neighbor searches. |
|
Class for having a "null" transform where the output is the same as the input. |
|
Class for having a "null" transform where the output is the same as the input. |
|
Class to just train the model on the training data and test it on that same data. |
|
Class to wrap any scikit-learn based data splitter, e.g. |
|
The year, month and day arguments are required. |
Class Inheritance Diagram¶

Code Documentation: Datasets¶
mastml.datasets Module¶
This module provides various methods for importing data into MAST-ML.
- SklearnDatasets:
Enables easy import of model datasets from scikit-learn, such as boston housing data, friedman, etc.
- LocalDatasets:
Main method for importing datasets that are stored in an accessible path. Main file format is Excel spreadsheet (.xls or .xlsx). This method also makes it easy for separately denoting other data features that are not directly the X or y data, such as features used for grouping, extra features no used in fitting, or features that denote manually held-out test data
- FigshareDatasets:
Method to download data that is stored on Figshare, an open-source data hosting service. This class can be used to download data, then subsquently the LocalDatasets class can be used to import the data.
- FoundryDatasets:
Method to download data this stored on the Materials Data Facility (MDF) Foundry data hosting service. This class can be used to download data, then subsquently the LocalDatasets class can be used to import the data.
- MatminerDatasets:
Method to download data this stored as part of the matminer machine learning package (https://github.com/hackingmaterials/matminer). This class can be used to download data, then subsquently the LocalDatasets class can be used to import the data.
Classes¶
Class to download datasets hosted on Figshare. |
|
|
Forge fetches metadata and files from the Materials Data Facility. |
|
Class to download datasets hosted on Materials Data Facility |
|
Class to handle import and organization of a dataset stored locally. |
Class to download datasets hosted from the Matminer package's Figshare page. |
|
|
Class wrapping the sklearn.datasets funcionality for easy import of toy datasets from sklearn. |
Class Inheritance Diagram¶

Code Documentation: Error Analysis¶
mastml.error_analysis Module¶
This module contains classes for quantifying the predicted model errors (uncertainty quantification), and preparing provided residual (true errors) predicted model error data for plotting (e.g. residual vs. error plots), or for recalibration of model errors using the method of Palmer et al.
- ErrorUtils:
Collection of functions to conduct error analysis on certain types of models (uncertainty quantification), and prepare residual and model error data for plotting, as well as recalibrate model errors with various methods
- CorrectionFactors
Class for performing recalibration of model errors (uncertainty quantification) based on the method from the work of Palmer et al.
Classes¶
|
Class for performing recalibration of model errors (uncertainty quantification) based on the method from the work of Palmer et al. |
Collection of functions to conduct error analysis on certain types of models (uncertainty quantification), and prepare residual and model error data for plotting, as well as recalibrate model errors with various methods |
Class Inheritance Diagram¶

Code Documentation: Feature Generators¶
mastml.feature_generators Module¶
This module contains a collection of classes for generating input features to fit machine learning models to.
- BaseGenerator:
Base class to provide MAST-ML type functionality to all feature generators. All other feature generator classes should inherit from this base class
- ElementalFractionGenerator:
Class written to encode element fractions in materials compositions as full 118-element vector per material, where each element in the vector represents an element on the periodic table.
- ElementalFeatureGenerator:
Class written for MAST-ML to generate features for material compositions based on properties of the elements comprising the composition. A number of mathematically derived variants are included, like arithmetic average, composition-weighted average, range, max, and min. This generator also supports sublattice-based generation, where the elemental features can be averaged for each sublattice as opposed to just the total composition together. To use the sublattice feature of this generator, composition strings must include square brackets to separate the sublattices, e.g. the perovskite material La0.75Sr0.25MnO3 would be written as [La0.75Sr0.25][Mn][O3]
- PolynomialFeatureGenerator:
Class used to construct new features based on a polynomial expansion of existing features. The degree of the polynomial is given as input. For example, for two features x1 and x2, the quadratic features x1^2, x2^2 and x1*x2 would be generated if the degree is set to 2.
- OneHotGroupGenerator:
Class used to create a set of one-hot encoded features based on a single feature containing assorted categories. For example, if a feature contains strings denoting each data point as belonging to one of three groups such as “metal”, “semiconductor”, “insulator”, then the generated one-hot features are three feature columns containing a 1 or 0 to denote which group each data point is in
- OneHotElementEncoder:
Class used to create a set of one-hot encoded features based on elements present in a supplied chemical composition string. For example, if the data set contains alloys of materials with chemical formulas such as “GaAs”, “InAs”, “InP”, etc., then the generated one-hot features are four feature columns containing a 1 or 0 to denote whether a particular data point contains each of the unique elements, in this case Ga, As, In, or P.
- MaterialsProjectFeatureGenerator:
Class used to search the Materials Project database for computed material property information for the supplied composition. This only works if the material composition matches an entry present in the Materials Project. Will return material properties like formation energy, volume, electronic bandgap, elastic constants, etc.
- MatminerFeatureGenerator:
Class used to combine various composition and structure-based feature generation routines in the matminer package into MAST-ML. The use of structure-based features will require pymatgen structure objects in the input dataframe, while composition-based features require only a composition string. See the class documentation for more information on the different types of feature generation this class supports.
- DataframeUtilities:
Collection of helper routines for various common dataframe operations, like concatentation, merging, etc.
Classes¶
|
Base class for all estimators in scikit-learn. |
Class functioning as a base generator to support directory organization and evaluating different feature generators |
|
|
Represents a Composition, which is essentially a {element:amount} mapping type. |
Class of basic utilities for dataframe manipulation, and exchanging between dataframes and numpy arrays |
|
|
Enum representing an element in the periodic table. |
|
Class that is used to create elemental-based features from material composition strings |
|
Class that is used to create 86-element vector of element fractions from material composition strings |
|
A class to conveniently interface with the Materials Project REST interface. The recommended way to use MPRester is with the "with" context manager to ensure that sessions are properly closed after usage::. |
Class that wraps MaterialsProjectFeatureGeneration, giving it scikit-learn structure |
|
|
Class to wrap feature generator routines contained in the matminer package to more neatly conform to the MAST-ML working environment, and have all under a single class |
|
Class to generate new categorical features (i.e. |
|
Encode categorical features as a one-hot numeric array. |
|
Class to generate one-hot encoded values from a list of categories using scikit-learn's one hot encoder method More info at: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html |
|
Class to generate polynomial features using scikit-learn's polynomial features method More info at: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html |
|
Generate polynomial and interaction features. |
|
Mixin class for all transformers in scikit-learn. |
|
The year, month and day arguments are required. |
Class Inheritance Diagram¶

Code Documentation: Feature Selectors¶
mastml.feature_selectors Module¶
This module contains a collection of routines to perform feature selection.
- BaseSelector:
Base class to have MAST-ML like workflow functionality for feature selectors. All feature selection routines should inherit this base class
- SklearnFeatureSelector:
Class to wrap feature selectors from the scikit-learn package and make them have functionality from BaseSelector. Any scikit-learn feature selector from sklearn.feature_selection can be used by providing the name of the selector class as a string.
- NoSelect:
Class that performs no feature selection and just uses all features in the dataset. Needed as a placeholder when evaluating data splits in a MAST-ML run where feature selection is not performed.
- EnsembleModelFeatureSelector:
Class to selects features based on the feature importances scores obtained when fitting an ensemble-based model. Any model with the feature_importances_ attribute will work, e.g. sklearn’s RandomForestRegressor and GradientBoostingRegressor.
- PearsonSelector:
Class that selects features based on their Pearson correlation score with the target data. Can also be used to assess Pearson correlation between features for use to reduce dimensionality of the feature space.
- MASTMLFeatureSelector:
Class written for MAST-ML to perform more flexible forward selection than what can be found in scikit-learn. Allows the user to specify a particular model and cross validation routine for selecting features, as well as the ability to forcibly select certain features on the outset.
- ShapFeatureSelector:
Class to select features based on how much each of the features contribute to the model in predicting the target data.
Functions¶
|
Pearson correlation coefficient and p-value for testing non-correlation. |
|
Method that calculates the root mean squared error (RMSE) |
|
Function to get the correlation between two sets of features selected from two different methods of feature selection |
Classes¶
|
Base class for all estimators in scikit-learn. |
Base class that forms foundation of MAST-ML feature selectors |
|
|
Class custom-written for MAST-ML to conduct selection of features with ensemble model feature importances |
|
K-Folds cross-validator |
|
Class custom-written for MAST-ML to conduct forward selection of features with flexible model and cv scheme |
|
Class for having a "null" transform where the output is the same as the input. |
|
Class custom-written for MAST-ML to conduct selection of features based on Pearson correlation coefficent between features and target. |
|
Class custom-written for MAST-ML to conduct selection of features with SHAP |
|
Class that wraps scikit-learn feature selection methods with some new MAST-ML functionality |
|
Mixin class for all transformers in scikit-learn. |
|
The year, month and day arguments are required. |
Class Inheritance Diagram¶

Code Documentation: Hyperparameter Optimization¶
mastml.hyper_opt Module¶
This module contains methods for optimizing hyperparameters of models
- HyperOptUtils:
This class contains various helper utilities for setting up and running hyperparameter optimization
- GridSearch:
This class performs a basic grid search over the parameters and value ranges of interest to find the best set of model hyperparameters in the provided grid of values
- RandomizedSearch:
This class performs a randomized search over the parameters and value ranges of interest to find the best set of model hyperparameters in the provided grid of values. Often faster than GridSearch. Instead of a grid of values, it takes a probability distribution name as input (e.g. “norm”)
- BayesianSearch:
This class performs a Bayesian search over the parameters and value ranges of interest to find the best set of model hyperparameters in the provided grid of values. Often faster than GridSearch.
Classes¶
|
Bayesian optimization over hyper parameters. |
|
Class to conduct a Bayesian search to find optimized model hyperparameter values |
|
Search space dimension that can take on categorical values. |
|
Class to conduct a grid search to find optimized model hyperparameter values |
|
Exhaustive search over specified parameter values for an estimator. |
|
Helper class providing useful methods for other hyperparameter optimization classes. |
|
Search space dimension that can take on integer values. |
|
Class containing access to a wide range of metrics from scikit-learn and a number of MAST-ML custom-written metrics |
|
Class to conduct a randomized search to find optimized model hyperparameter values |
|
Randomized search on hyper parameters. |
|
Search space dimension that can take on any real value. |
|
Class to wrap any sklearn estimator, and provide some new dataframe functionality |
Class Inheritance Diagram¶

Code Documentation: Learning Curve¶
mastml.learning_curve Module¶
This module contains methods to construct learning curves, which evaluate some cross-validation performance metric (e.g. RMSE) as a function of amount of training data (i.e. a data learning curve) or as a function of the number of features used in the fitting (i.e. a feature learning curve).
- LearningCurve:
Class used to construct data learning curves and feature learning curves
Classes¶
|
K-Folds cross-validator |
This class is used to construct learning curves, both in the form of model performance vs. |
|
|
Class containing methods for constructing line plots |
|
Class containing access to a wide range of metrics from scikit-learn and a number of MAST-ML custom-written metrics |
|
Class that wraps scikit-learn feature selection methods with some new MAST-ML functionality |
|
The year, month and day arguments are required. |
Class Inheritance Diagram¶

Code Documentation: Mastml¶
mastml.mastml Module¶
This module contains routines to set up and manage the metadata for a MAST-ML run
- Mastml:
Class to set up directories for saving the output of a MAST-ML run, and for constructing and updating a metadata summary file.
Functions¶
|
Run some function in parallel. |
Classes¶
|
Main helper class to initialize mastml runs and create and manage run metadata |
|
Dictionary that remembers insertion order |
|
alias of |
|
The year, month and day arguments are required. |
|
partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords. |
Class Inheritance Diagram¶

Code Documentation: Metrics¶
mastml.metrics Module¶
This module contains a metrics class for construction and evaluation of various regression score metrics between true and model predicted data.
- Metrics:
Class to construct and evaluate a list of regression metrics of interest. The full list of available metrics can be obtained from Metrics()._metric_zoo()
Functions¶
|
Method that calculates the adjusted R^2 value |
|
Method that calculates the R^2 value |
|
Method that calculates the R^2 value without fitting the y-intercept |
|
Method that calculates the root mean squared error (RMSE) of a set of data, divided by the standard deviation of the training data set. |
|
Method that calculates the root mean squared error (RMSE) |
Classes¶
|
Ordinary least squares Linear Regression. |
|
Class containing access to a wide range of metrics from scikit-learn and a number of MAST-ML custom-written metrics |
Class Inheritance Diagram¶

Code Documentation: Models¶
mastml.models Module¶
Module for constructing models for use in MAST-ML.
- SklearnModel:
Class that wraps scikit-learn models to have MAST-ML type functionality. Providing the model name as a string and the keyword arguments for the model parameters will construct the model. Note that this class also supports construction of XGBoost models and Keras neural network models via Keras’ keras.wrappers.scikit_learn.KerasRegressor model.
- EnsembleModel:
Class that constructs a model which is an ensemble of many base models (sometimes called weak learners). This class supports construction of ensembles of most scikit-learn regression models as well as ensembles of neural networks that are made via Keras’ keras.wrappers.scikit_learn.KerasRegressor class.
Classes¶
|
A Bagging regressor. |
|
Base class for all estimators in scikit-learn. |
|
Class used to construct ensemble models with a particular number and type of weak learner (base model). |
|
Gaussian process regression (GPR). |
|
Class to wrap any sklearn estimator, and provide some new dataframe functionality |
|
Mixin class for all transformers in scikit-learn. |
Class Inheritance Diagram¶

Code Documentation: Plots¶
mastml.plots Module¶
This module contains classes used for generating different types of analysis plots
- Scatter:
This class contains a variety of scatter plot types, e.g. parity (predicted vs. true) plots
- Error:
This class contains plotting methods used to better quantify the model errors and uncertainty quantification.
- Histogram:
This class contains methods for constructing histograms of data distributions and visualization of model residuals.
- Line:
This class contains methods for making line plots, e.g. for constructing learning curves of model performance vs. amount of data or number of features.
Functions¶
|
Return the ceiling of x as an Integral. |
Method to check the dimensions of supplied data. |
|
|
Build a text report showing the main classification metrics. |
|
Calculate the width and height for a figure with a specified aspect ratio. |
|
Method to obtain a sensible divisor based on range of two values |
|
Return the logarithm of x to the given base. |
|
|
|
Method to make the x and y ticks for each axis the same. |
|
Method to make matplotlib figure and axes objects. |
|
Method to make square shaped matplotlib figure and axes objects. |
|
Helper function to make collections of different types of plots after a single or multiple data splits are evaluated. |
|
Draw a box to mark the location of an area represented by an inset axes. |
|
Method to return mean of a list or equivalent array with NaN values |
|
Method to create a range of values, including the specified start and end points, with nicely spaced intervals |
|
Method to return standard deviation of a list or equivalent array with NaN values |
|
Function to plot the average score of each feature against their occurrence in all of the splits |
|
DEPRECATED: Function plot_confusion_matrix is deprecated in 1.0 and will be removed in 1.2. |
|
Function to plot the occurrence of each feature in all of the splits |
|
Method that prints stats onto the plot. |
|
\(R^2\) (coefficient of determination) regression score function. |
|
Method to recursively find the max value of an array of iterables. |
Method to recursively return max and min of values or iterables in array |
|
|
Method to recursively find the min value of an array of iterables. |
|
|
|
Method to return a rounded down number |
|
Method to return a rounded up number |
|
Method to obtain number of decimal places to report on plots |
|
Method that converts a metric object into a string for displaying on a plot |
|
Method used to trim a set of arrays to make all arrays the same shape |
|
Create an anchored inset axes by scaling a parent axes. |
Classes¶
Classification plots |
|
|
Class to make plots related to model error assessment and uncertainty quantification |
|
Collection of functions to conduct error analysis on certain types of models (uncertainty quantification), and prepare residual and model error data for plotting, as well as recalibrate model errors with various methods |
|
The top level container for all the plot elements. |
|
alias of |
|
A class for storing and manipulating font properties. |
Class to generate histogram plots, such as histograms of residual values |
|
|
|
|
Class containing methods for constructing line plots |
|
Ordinary least squares Linear Regression. |
|
Class containing access to a wide range of metrics from scikit-learn and a number of MAST-ML custom-written metrics |
|
Exception class to raise if estimator is used before fitting. |
|
Class to generate scatter plots, such as parity plots showing true vs. |
|
Representation of a kernel-density estimate using Gaussian kernels. |
Class Inheritance Diagram¶

Code Documentation: Preprocessing¶
mastml.preprocessing Module¶
This module contains methods to perform data preprocessing, such as various standardization/normalization methods
- BasePreprocessor:
Base class that adds some MAST-ML type functionality to other preprocessors. Other preprocessor classes all inherit this base class
- SklearnPreprocessor:
Class that wraps any preprocessor method from scikit-learn (e.g. StandardScaler) to have MAST-ML type functionality
- NoPreprocessor:
Class that performs no preprocessing. A preprocessor is needed in the MAST-ML evaluation of data splits. If no preprocessing is desired, then this NoPreprocessor class is invoked by default
- MeanStdevScaler:
Preprocessor class which extends scikit-learn’s StandardScaler to scale the dataset to a particular user-specified mean and standard deviation value
Classes¶
|
Base class for all estimators in scikit-learn. |
|
Base class to provide new methods beyond sklearn fit_transform, such as dataframe support and directory management |
|
Class designed to normalize input data to a specified mean and standard deviation |
|
Class for having a "null" transform where the output is the same as the input. |
|
Class to wrap any scikit-learn preprocessor, e.g. |
|
Mixin class for all transformers in scikit-learn. |
|
The year, month and day arguments are required. |
Class Inheritance Diagram¶

Code Documentation: Baseline Tests¶
mastml.baseline_tests Module¶
This module contains baseline test for models
- Baseline_tests:
Class that contains the tests for the models
Classes¶
Methods: |
|
|
Class containing access to a wide range of metrics from scikit-learn and a number of MAST-ML custom-written metrics |
Class Inheritance Diagram¶

Code Documentation: MAST-ML Predictor¶
mastml.mastml_predictor Module¶
This module contains methods for easily making new predictions on test data once a suitable model has been trained. Also available is output of calibrated uncertainties for each prediction.
- make_prediction:
Method used to take a saved preprocessor, model and calibration file and output predictions and calibrated uncertainties on new test data.
Functions¶
|
Method used to take a saved preprocessor, model and calibration file and output predictions and calibrated uncertainties on new test data |
|
Prediction script, same functionality as make_prediction above, but tailored for model running on DLHub/Foundry |