Welcome to MAterials Simulation Toolkit for Machine Learning (MAST-ML)’s documentation!¶
Acknowledgements¶
Materials Simulation Toolkit for Machine Learning (MAST-ML)
MAST-ML is an open-source Python package designed to broaden and accelerate the use of machine learning in materials science research
Contributors
University of Wisconsin-Madison Computational Materials Group:
- Prof. Dane Morgan
- Dr. Ryan Jacobs
- Dr. Tam Mayeshiba
- Ben Afflerbach
- Dr. Henry Wu
University of Kentucky contributors:
- Luke Harold Miles
- Robert Max Williams
- Prof. Raphael Finkel
MAST-ML documentation:
An overview of code documentation and tutorials for getting started with MAST-ML can be found here
Funding
This work was and is funded by the National Science Foundation (NSF) SI2 award No. 1148011 and DMREF award number DMR-1332851
Citing MAST-ML
If you find MAST-ML useful, please cite the following publication:
Jacobs, R., Mayeshiba, T., Afflerbach, B., Miles, L., Williams, M., Turner, M., Finkel, R., Morgan, D., “The Materials Simulation Toolkit for Machine Learning (MAST-ML): An automated open source toolkit to accelerate data- driven materials research”, Computational Materials Science 175 (2020), 109544. https://doi.org/10.1016/j.commatsci.2020.109544
Code Repository
MAST-ML is available on PyPi: pip install mastml
MAST-ML is available on Github <https://github.com/uw-cmg/MAST-ML>
git clone –single-branch master https://github.com/uw-cmg/MAST-ML
Installing MAST-ML¶
Hardware and Data Requirements¶
Hardware¶
PC or Mac computer capable of running python 3.
Data¶
- Numeric data file in the form of .csv or .xlsx file. There must be at least some target feature data, so that models can be fit.
- First row of file (each column) should have a text name (as string) which is how columns will be referenced later in the input file.
- If working in Jupyter environment, can also directly pass in a pandas dataframe
Terminal installation (Linux or linux-like terminal on Mac)¶
Install Python3¶
Install Python 3: for easier installation of numpy and scipy dependencies, download Anaconda from https://www.continuum.io/downloads
Create a conda environment¶
Create an environment:
conda create --name MAST_ML python=3.7
conda activate MAST_ML
pip install mastml
Set up Juptyer notebooks¶
There is no separate setup for Jupyter notebooks necessary; once MASTML has been run and created a notebook, then in the terminal, navigate to a directory housing the notebook and type:
jupyter notebook
and a browser window with the notebook should appear.
Install the MAST-ML package¶
Pip install MAST-ML from PyPi:
pip install mastml
Alternatively, git clone the Github repository, for example:
git clone https://github.com/uw-cmg/MAST-ML
Clone from “master” unless instructed specifically to use another branch. Ask for access if you cannot find this code.
Check status.github.com for issues if you believe github may be malfunctioning
Run:
python setup.py install
Imports that don’t work¶
First try anaconda install, and if that gives errors try pip install Example: conda install numpy , or pip install numpy Put the path to the installed MAST-ML folder in your PYTHONPATH if it isn’t already
Windows installation¶
Install Python3¶
Install Python 3: for easier installation of numpy and scipy dependencies, download anaconda from https://www.continuum.io/downloads
Create a conda environment¶
From the Anaconda Navigator, go to Environments and create a new environment Select python version 3.6
Under “Channels”, along with defaults channel, “Add” the “materials” channel. The Channels list should now read:
defaults
materials
(may be the “matsci” channel instead of the “materials” channel; this channel is used to install pymatgen)
Set up the Spyder IDE and Jupyter notebooks¶
From the Anaconda Navigator, go to Home With the newly created environment selected, click on “Install” below Jupyter. Click on “Install” below Spyder.
Once the MASTML has been run and has created a jupyter notebook (run MASTML from a location inside the anaconda environment, so that the notebook will also be inside the environment tree), from the Anaconda Navigator, go to Environments, make sure the environment is selected, press the green arrow button, and select Open jupyter notebook.
Install the MAST-ML package¶
Pip install MAST-ML from PyPi:
pip install mastml
Alternatively, git clone the Github repository, for example:
git clone https://github.com/uw-cmg/MAST-ML
Clone from “master” unless instructed specifically to use another branch. Ask for access if you cannot find this code.
Check status.github.com for issues if you believe github may be malfunctioning
Run:
python setup.py install
Imports that don’t work¶
First try anaconda install, and if that gives errors try pip install Example: conda install numpy , or pip install numpy Put the path to the installed MAST-ML folder in your PYTHONPATH if it isn’t already
Windows 10 install: step-by-step guide (credit Joe Kern)¶
First, figure out if your computer is 32 or 64-bit. Type “system information” in your search bar. Look at system type. x86 is a 32-bit computer, x64 is a 64-bit.
Second, download an environment manager. Environments are directories in your computer that store dependencies. For instance, one program you run might be dependent on version 1.0 of another program x. However, another program you have might be dependent on version 2.0 of program x. Having multiple environments allows you utilize both programs and dependencies on your computer. I will recommend you download anaconda, not because it is the best, but because it is an environment manager I know how to get working with MAST-ML. Feel free to experiment with other managers. Download the Python 3.7 version at https://www.anaconda.com/distribution/, just follow the installation instructions. Pick the graphical installer that corresponds with your computer system (64 bit or 32 bit).
Third, download Visual studio. Some of the MAST-ML dependencies require C++ distributables in order to run. Visual Studio Code is a code editor made for Windows 10. The dependencies for MAST-ML will look in the Visual Studio Code folder for these C++ distributables when they download. There may be another way to download these these C++ distributables without Visual Studio Code, but I am not sure how to do that. Go here to download https://visualstudio.microsoft.com/downloads/#build-tools-for-visual-studio-2017
Fourth, download Visual Studio with C++ build tools and restart the computer
Fifth, Open anaconda navigator. Click Environments and create at the bottom. Name it MASTML and make it Python 3.6. DO NOT MAKE IT Python 3.7 or Python version 2.6 or 2.7. Some dependencies do not work with those other version.
Sixth, click the arrow next to your environment name and open a command shell. In the command line type “pip install “ and then copy paste the dependency names from the dependency file into your command prompt.
Seventh, test if MAST-ML runs. There are multiple ways to do this, but I will outline one. Navigate to your MAST-ML folder in the command prompt. To do this, you need to know the command ‘cd’. Typing ‘cd’ will let you change the directory you command prompt is operating in. In order to navigate to your mast-ml folder, right click the folder and click properties. Copy the location and in the command prompt type ‘cd’ and paste the location after. Add a ‘Mast-ml’ or whatever your folder is called to the end of the pasted value so you can get to mastml
Finally, copy paste python -m mastml.mastml_driver mastml/tests/conf/example_input.conf mastml/tests/csv/example_data.csv -o results/mastml_tutorial into your command prompt and run. If it all works, you’re good to go.
Startup¶
Locate the examples folder¶
In the installed MASTML directory, navigate to the tests
folder.
Under tests/conf, The file example_input.conf
will use the example_data.xlsx
data file located in tests/csv to run an example.
Run the MASTML command¶
The format is python3 -m mastml.mastml_driver <path to config file> <path to data .xlsx file> -o <path to results folder>
For example, to conduct the test run above, while in the MASTML install directory:
python3 -m mastml.mastml_driver tests/conf/example_input.conf tests/csv/example_data.xlsx -o results/example_results
This is a terminal command. For Windows, assuming setup has been followed as above, go to the Anaconda Navigator, Environments, select the environment, click the green arrow button, and Open terminal.
When you execute the above command, you’ll know it’s working if you begin to see output on your screen.
Check output¶
index.html
should be created, linking to certain representative plots for each test
For this example, output will be located in subfolders in the results/example_results folder.
Check the following to see if the run completed successfully:
A log.log file is generated and the last line contains the phrase "Making html file of all run stats..."
An index.html file that gives some summary plots from all the tests that were run
A series of subfolders with names "StandardScaler"->"DoNothing"->"KernelRidge", with the following three directories
within the "KernelRidge" directory: "LeaveOneGroupOut_host", "NoSplit", and "RepeatedKFold"
You can compare all of these files with those given in the /example_results directory which should match.
MAST-ML Input File¶
This document provides an overview of the various sections and fields of the MAST-ML input file.
A full template input file can be downloaded here: MASTML_InputFile
Input file sections¶
General Setup¶
The “GeneralSetup” section of the input file allows the user to specify an assortment of basic MAST-ML parameters, ranging from which column names in the .xlsx file to use as features for fitting (i.e. X data) or to fit to (i.e. y data), as well as which metrics to employ in fitting a model, among other things.
Example:
[GeneralSetup]
input_features = feature_1, feature_2, etc. or "Auto"
input_target = target_feature
randomizer = False
metrics = root_mean_squared_error, mean_absolute_error, etc. or "Auto"
input_other = additional_feature_1, additional_feature_2
input_grouping = grouping_feature_1
input_testdata = validation_feature_1
- input_features List of input X features
- input_target Target y feature
- randomizer Whether or not to randomize y feature data. Useful for establishing a null “baseline” test
- metrics Which metrics to evaluate model fits
- input_other Additional features that are not to be fitted on (i.e. not X features)
- input_grouping Feature names that provide information on data grouping
- input_test Feature name that designates whether data will be used for validation (set rows as 1 or 0 in csv file)
Data Cleaning¶
The “DataCleaning” section of the input file allows the user to clean their data to remove rows or columns that contain empty or NaN fields, or fill in these fields using imputation or principal component analysis methods.
Example:
[DataCleaning]
cleaning_method = remove, imputation, ppca
imputation_strategy = mean, median
- cleaning_method Method of data cleaning. “remove” simply removes columns with missing data. “imputation” uses basic operation to fill in missing values. “ppca” uses principal component analysis to fill in missing values.
- imputation_strategy Only valid field if doing imputation, selects method to impute missing data by using mean, median, etc. of the column
Clustering¶
Optional section to perform clustering of data using well-known clustering algorithms available in scikit-learn. Note that the subsection names must match the corresponding name of the routine in scikit-learn. More information on clustering routines and the parameters to set for each routine can be found here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster For the purpose of this full input file, we use the scikit-learn default parameter values. Note that not all parameters are listed.
Example:
[Clustering]
[[AffinityPropagation]]
damping = 0.5
max_iter = 200
convergence_iter = 15
affinity = euclidean
[[AgglomerativeClustering]]
n_clusters = 2
affinity = euclidean
compute_full_tree = auto
linkage = ward
[[Birch]]
threshold = 0.5
branching_factor = 50
n_clusters = 3
[[DBSCAN]]
eps = 0.5
min_samples = 5
metric = euclidean
algorithm = auto
leaf_size = 30
[[KMeans]]
n_clusters = 8
n_init = 10
max_iter = 300
tol = 0.0001
[[MiniBatchKMeans]]
n_clusters = 8
max_iter = 100
batch_size = 100
[[MeanShift]]
[[SpectralClustering]]
n_clusters = 8
n_init = 10
gamma = 1.0
affinity = rbf
Feature Generation¶
Optional section to perform feature generation based on properties of the constituent elements. These routines were custom written for MAST-ML, except for PolynomialFeatures. For more information on the MAST-ML custom routines, consult the MAST-ML online documentation. For more information on PolynomialFeatures, see: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
Example:
[FeatureGeneration]
[[Magpie]]
composition_feature = Material Compositions
feature_types = composition_avg, arithmetic_avg, max, min, difference
[[MaterialsProject]]
composition_feature = Material Compositions
api_key = my_api_key
[[Citrine]]
composition_feature = Material Compositions
api_key = my_api_key
[[ContainsElement]]
composition_feature = Host element
all_elements = False
element = Al
new_name = has_Al
[[PolynomialFeatures]]
degree=2
interaction_only=False
include_bias=True
- composition_feature Name of column in csv file containing material compositions
- feature_types Types of elemental features to output. If None is specified, all features are output. Note “elements” refers to properties of constituent elements
- api_key Your API key to access the Materials Project or Citrine. Register for your account at Materials Project: https://materialsproject.org or at Citrine: https://citrination.com
- all_elements For ContainsElement, whether or not to scan all data rows to assess all elements present in data set
- element For ContainsElement, name of element of interest. Ignored if all_elements = True
- new_name For ContainsElement, name of new feature column to generate. Ignored if all_elements = True
Feature Normalization¶
Optional section to perform feature normalization of the input or generated features using well-known feature normalization algorithms available in scikit-learn. Note that the subsection names must match the corresponding name of the routine in scikit-learn. More information on normalization routines and the parameters to set for each routine can be found here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing . For the purpose of this full input file, we use the scikit-learn default parameter values. Note that not all parameters are listed, and only the currently listed normalization routines are supported. In addition, MeanStdevScaler is a custom written normalization routine for MAST-ML. Additional information on MeanStdevScaler can be found in the online MAST-ML documentation.
Example:
[FeatureNormalization]
[[Binarizer]]
threshold = 0.0
[[MaxAbsScaler]]
[[MinMaxScaler]]
[[Normalizer]]
norm = l2
[[QuantileTransformer]]
n_quantiles = 1000
output_distribution = uniform
[[RobustScaler]]
with_centering = True
with_scaling = True
[[StandardScaler]]
[[MeanStdevScaler]]
mean = 0
stdev = 1
Learning Curve¶
Optional section to perform learning curve analysis on a dataset. Two types of learning curves will be generated: a data learning curve (score vs. amount of training data) and a feature learning curve (score vs. number of features).
Example:
[LearningCurve]
estimator = KernelRidge_learn
cv = RepeatedKFold_learn
scoring = root_mean_squared_error
n_features_to_select = 5
selector_name = MASTMLFeatureSelector
- estimator A scikit-learn model/estimator. The name needs to match an entry in the [Models] section. Note this model will be removed from the [Models] list after the learning curve is generated.
- cv A scikit-learn cross validation generator. The name needs to match an entry in the [DataSplits] section. Note this method will be removed from the [DataSplits] list after the learning curve is generated.
- scoring A scikit-learn scoring method compatible with MAST-ML. See the MAST-ML online documentation at https://htmlpreview.github.io/?https://raw.githubusercontent.com/uw-cmg/MAST-ML/dev_Ryan_2018-10-29/docs/build/html/3_metrics.html for more information.
- n_features_to_select The max number of features to use for the feature learning curve.
- selector_name Method to conduct feature selection for the feature learning curve. The name needs to match an entry in the [FeatureSelection] section. Note this method will be removed from the [FeatureSelection] section after the learning curve is generated.
Feature Selection¶
Optional section to perform feature selection using routines in scikit-learn, mlxtend and custom-written for MAST-ML. Note that the subsection names must match the corresponding name of the routine in scikit-learn. More information on selection routines and the parameters to set for each routine can be found here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection . For the purpose of this full input file, we use the scikit-learn default parameter values. Note that not all parameters are listed, and only the currently listed selection routines are supported. In addition, MASTMLFeatureSelector is a custom written selection routine for MAST-ML. Additional information on MASTMLFeatureSelector can be found in the online MAST-ML documentation. Finally, SequentialFeatureSelector is a routine available from the mlxtend package, which documention can be found here: http://rasbt.github.io/mlxtend/
Example:
[FeatureSelection]
[[GenericUnivariateSelect]]
[[SelectPercentile]]
[[SelectKBest]]
[[SelectFpr]]
[[SelectFdr]]
[[SelectFwe]]
[[RFE]]
estimator = RandomForestRegressor_selectRFE
n_features_to_select = 5
step = 1
[[SequentialFeatureSelector]]
estimator = RandomForestRegressor_selectSFS
k_features = 5
[[RFECV]]
estimator = RandomForestRegressor_selectRFECV
step = 1
cv = LeaveOneGroupOut_selectRFECV
min_features_to_select = 1
[[SelectFromModel]]
estimator = KernelRidge_selectfrommodel
max_features = 5
[[VarianceThreshold]]
threshold = 0.0
[[PCA]]
n_components = 5
[[MASTMLFeatureSelector]]
estimator = KernelRidge_selectMASTML
n_features_to_select = 5
cv = LeaveOneGroupOut_selectMASTML
# Any features you want to keep from the start, then use these to subsequently do forward selection
manually_selected_features = myfeature_1, myfeature_2
[[EnsembleModelFeatureSelector]]
# A scikit-learn model/estimator. Needs to have estimator feature ranking. The name needs to match an entry in the [Models] section.
estimator = RandomForestRegressor_selectEnsemble
# number of features to select
k_features = 5
[[PearsonSelector]]
# threshold for removal of redundant features
threshold_between_features = 0.9
# threshold for removal of features not sufficiently correlated with target
threshold_with_target = 0.8
# whether to remove features that are highly correlated with each other (i.e. redundant)
remove_highly_correlated_features = True
# number of features to select
k_features = 5
- estimator A scikit-learn model/estimator. The name needs to match an entry in the [Models] section. Note this model will be removed from the [Models] list after the learning curve is generated.
- n_features_to_select The max number of features to select
- step For RFE and RFECV, the number of features to remove in each step
- k_features For SequentialFeatureSelector, the max number of features to select.
- cv A scikit-learn cross validation generator. The name needs to match an entry in the [DataSplits] section. Note this method will be removed from the [DataSplits] list after the learning curve is generated.
Data Splits¶
Optional section to perform data splits using cross validation routines in scikit-learn, and custom-written for MAST-ML. Note that the subsection names must match the corresponding name of the routine in scikit-learn. More information on selection routines and the parameters to set for each routine can be found here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection . For the purpose of this full input file, we use the scikit-learn default parameter values. Note that not all parameters are listed, and only the currently listed data split routines are supported. In addition, NoSplit is a custom written selection routine for MAST-ML, which simply produces a full data fit with no cross validation. Additional information on NoSplit can be found in the online MAST-ML documentation.
Example:
[DataSplits]
[[NoSplit]]
[[KFold]]
shuffle = True
n_splits = 10
[[RepeatedKFold]]
n_splits = 5
n_repeats = 10
# Here, an example of another instance of RepeatedKFold, this one being used in the [LearningCurve] section above.
[[RepeatedKFold_learn]]
n_splits = 5
n_repeats = 10
[[GroupKFold]]
n_splits = 3
[[LeaveOneOut]]
[[LeavePOut]]
p = 10
[[RepeatedStratifiedKFold]]
n_splits = 5
n_repeats = 10
[[StratifiedKFold]]
n_splits = 3
[[ShuffleSplit]]
n_splits = 10
[[StratifiedShuffleSplit]]
n_splits = 10
[[LeaveOneGroupOut]]
# The column name in the input csv file containing the group labels
grouping_column = Host element
# Here, an example of another instance of LeaveOneGroupOut, this one being used in the [FeatureSelection] section above.
[[LeaveOneGroupOut_selectMASTML]]
# The column name in the input csv file containing the group labels
grouping_column = Host element
# Here, an example of another instance of LeaveOneGroupOut, this one being used based on the creation of the "has_Al"
# group from the [[ContainsElement]] routine present in the [FeatureGeneration] section.
[[LeaveOneGroupOut_Al]]
grouping_column = has_Al
# Here, an example of another instance of LeaveOneGroupOut, this one being used based on the creation of clusters
# from the [[KMeans]] routine present in the [Clustering] section.
[[LeaveOneGroupOut_kmeans]]
grouping_column = KMeans
[[LeaveCloseCompositionsOut]]
# Set the distance threshold in composition space
dist_threshold=0.1
[[Bootstrap]]
# Data set size
n = 378
# Number of bootstrap resamplings to perform
n_bootstraps = 10
# Training set size
train_size = 303
# Validation/test set size
test_size = 75
Models¶
Optional section to denote different models/estimators for model fitting from scikit-learn. Note that the subsection names must match the corresponding name of the routine in scikit-learn. More information on different model routines and the parameters to set for each routine can be found here for ensemble methods: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble and here for kernel ridge and linear methods: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.kernel_ridge and here for neural network methods: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.neural_network and here for support vector machine and decision tree methods: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm . For the purpose of this full input file, we use the scikit-learn default parameter values. Note that not all parameters are listed, and only the currently listed data split routines are supported.
Example:
[Models]
# Ensemble methods
[[AdaBoostClassifier]]
n_estimators = 50
learning_rate = 1.0
[[AdaBoostRegressor]]
n_estimators = 50
learning_rate = 1.0
[[BaggingClassifier]]
n_estimators = 50
max_samples = 1.0
max_features = 1.0
[[BaggingRegressor]]
n_estimators = 50
max_samples = 1.0
max_features = 1.0
[[ExtraTreesClassifier]]
n_estimators = 10
criterion = gini
min_samples_split = 2
min_samples_leaf = 1
[[ExtraTreesRegressor]]
n_estimators = 10
criterion = mse
min_samples_split = 2
min_samples_leaf = 1
[[GradientBoostingClassifier]]
loss = deviance
learning_rate = 1.0
n_estimators = 100
subsample = 1.0
criterion = friedman_mse
min_samples_split = 2
min_samples_leaf = 1
[[GradientBoostingRegressor]]
loss = ls
learning_rate = 0.1
n_estimators = 100
subsample = 1.0
criterion = friedman_mse
min_samples_split = 2
min_samples_leaf = 1
[[RandomForestClassifier]]
n_estimators = 10
criterion = gini
min_samples_leaf = 1
min_samples_split = 2
[[RandomForestRegressor]]
n_estimators = 10
criterion = mse
min_samples_leaf = 1
min_samples_split = 2
# Here, an example of another instance of RandomForestRegressor, this one being used based by the [[EnsembleFeatureSelector]]
# method from the [FeatureSelection] section.
[[RandomForestRegressor_selectEnsemble]]
n_estimators = 100
criterion = mse
[[XGBoostClassifier]]
[[XGBoostRegressor]]
n_estimators = 100
objective = reg:squarederror
# Kernel ridge and linear methods
[[KernelRidge]]
alpha = 1
kernel = linear
# Here, an example of another instance of KernelRidge, this one being used based by the [[MASTMLFeatureSelector]]
# method from the [FeatureSelection] section.
[[KernelRidge_selectMASTML]]
alpha = 1
kernel = linear
# Here, an example of another instance of KernelRidge, this one being used based in the [LearningCurve] section.
[[KernelRidge_learn]]
alpha = 1
kernel = linear
[[ARDRegression]]
n_iter = 300
[[BayesianRidge]]
n_iter = 300
[[ElasticNet]]
alpha = 1.0
[[HuberRegressor]]
epsilon = 1.35
max_iter = 100
[[Lars]]
[[Lasso]]
alpha = 1.0
[[LassoLars]]
alpha = 1.0
max_iter = 500
[[LassoLarsIC]]
criterion = aic
max_iter = 500
[[LinearRegression]]
[[LogisticRegression]]
penalty = l2
C = 1.0
[[Perceptron]]
alpha = 0.0001
[[Ridge]]
alpha = 1.0
[[RidgeClassifier]]
alpha = 1.0
[[SGDClassifier]]
loss = hinge
penalty = l2
alpha = 0.0001
[[SGDRegressor]]
loss = squared_loss
penalty = l2
alpha = 0.0001
# Neural networks
[[MLPClassifier]]
hidden_layer_sizes = 100,
activation = relu
solver = adam
alpha = 0.0001
batch_size = auto
learning_rate = constant
[[MLPRegressor]]
hidden_layer_sizes = 100,
activation = relu
solver = adam
alpha = 0.0001
batch_size = auto
learning_rate = constant
[[KerasRegressor]]
[[[Layer1]]]
layer_type = Dense
neuron_num= 100
input_dim= 287 #typically equal to n_features
kernel_initializer= random_normal
activation=relu
[[[Layer2]]]
layer_type = Dense
neuron_num= 50
kernel_initializer= random_normal
activation=relu
[[[Layer3]]]
layer_type = Dense
neuron_num= 25
kernel_initializer= random_normal
activation=relu
[[[Layer4]]]
layer_type = Dense
neuron_num= 1
kernel_initializer= random_normal
activation=linear
[[[FitParams]]]
epochs=20
batch_size=25
loss = mean_squared_error
optimizer = adam
metrics = mse
verbose=1
shuffle = True
#validation_split = 0.2
# Support vector machine methods
[[LinearSVC]]
penalty = l2
loss = squared_hinge
tol = 0.0001
C = 1.0
[[LinearSVR]]
epsilon = 0.1
loss = epsilon_insensitive
tol = 0.0001
C = 1.0
[[NuSVC]]
nu = 0.5
kernel = rbf
degree = 3
[[NuSVR]]
nu = 0.5
C = 1.0
kernel = rbf
degree = 3
[[SVC]]
C = 1.0
kernel = rbf
degree = 3
[[SVR]]
C = 1.0
kernel = rbf
degree = 3
# Decision tree methods
[[DecisionTreeClassifier]]
criterion = gini
splitter = best
min_samples_split = 2
min_samples_leaf = 1
[[DecisionTreeRegressor]]
criterion = mse
splitter = best
min_samples_split = 2
min_samples_leaf = 1
[[ExtraTreeClassifier]]
criterion = gini
splitter = random
min_samples_split = 2
min_samples_leaf = 1
[[ExtraTreeRegressor]]
criterion = mse
splitter = random
min_samples_split = 2
min_samples_leaf = 1
Misc Settings¶
This section controls which types of plots MAST-ML will write to the results directory and other miscellaneous settings.
Example:
[MiscSettings]
plot_target_histogram = True
plot_train_test_plots = True
plot_predicted_vs_true = True
plot_predicted_vs_true_average = True
plot_best_worst_per_point = True
plot_each_feature_vs_target = False
plot_error_plots = True
rf_error_method = stdev
rf_error_percentile = 95
normalize_target_feature = False
- plot_target_histogram Whether or not to output target data histograms
- plot_train_test_plots Whether or not to output parity plots within each CV split
- plot_predicted_vs_true Whether or not to output summarized parity plots
- plot_predicted_vs_true_average Whether or not to output averaged parity plots
- plot_best_worst_per_point Whether or not to output parity plot showing best and worst split per point
- plot_each_feature_vs_target Whether or not to show plots of target feature as a function of each individual input feature
- plot_error_method Whether or not to show the individual and average plots of the normalized errors
- rf_error_method If using random forest, whether to calculate error bars with stdev or confidence intervals (confint)
- rf_error_percentile If using confint above, the confidence interval to use to calculate the error bars
- normalize_target_feature Whether or not to normalize the target feature values
MAST-ML overview slides¶
The information for this MAST-ML overview shown on this page is available for download here:
Let’s begin with an overview of what MAST-ML is and what it can do:

Here is currently what MAST-ML can do as well as how to acquire it:

An overview of the general machine learning workflow that MAST-ML executes. Continuing development will focus on making the workflows more flexible and general

MAST-ML uses a text-based input file (.conf extension) which consists of different sections (corresponding to each part of the workflow) and specific subsections (e.g. different machine learning models to test, different feature selection algorithms, etc.). The input file is discussed in much greater detail here:
and an input file with the full range of capabilities can be downloaded here:

Running MAST-ML is easily done with a single-line command in a Terminal/command line, your favorite IDE, or within a Jupyter notebook

MAST-ML output takes the form of a full directory tree of results, with each level of the tree corresponding to a different portion of the machine learning workflow

The last three figures demonstrate some example output of a few machine learning analysis features MAST-ML offers. Here, the ability to generate and select features is shown.

A core feature of MAST-ML is the many pieces of statistical analysis regarding model assessment, which forms the basis of interpreting the quality and extensibility of a machine learning model.

Finally, MAST-ML offers the ability to easily optimize the model hyperparameters used in your analysis

Running MAST-ML on Google Colab¶
In addition to running MAST-ML on your own machine or computing cluster, MAST-ML can be run using cloud resources on Google Colab. This can be advantageous as you don’t have to worry about installing MAST-ML yourself, and all output files can be saved directly to your Google Drive.
MAST-ML comes with a notebook called MASTML_Colab.ipynb that you can open in Google Colab
Once you open the notebook in Google Colab, it will look something like this:

There are a few blocks of code in this notebook. The first block performs a pip install of MAST-ML for this Colab session. The second block links your Google Drive to the Colab instance so MAST-ML can save your run output directly to your Google Drive.
The one thing you’ll need to do from here is to upload a data file (.csv or .xlsx format) and MAST-ML input file (.conf format) to this Colab session. Files can be uploaded by pressing the vertical arrow on the left side of the screen, by the file directory tree.
Note that when a Colab session ends, the files you upload will be deleted. Since your output will be saved to your Google Drive, the data an input files will be deleted. Note that MAST-ML automatically saves a copy of both of these files to your output directory for each run you do.
MAST-ML tutorial¶
Introduction¶
This document provides step-by-step tutorials of conducting and analyzing different MAST-ML runs. For this tutorial, we will be using the dataset example_data.xlsx in the tests/csv/ folder and input file example_input.conf in tests/conf/.
MAST-ML requires two files to run: The first is the text-based input file (.conf extension). This file contains all of the key settings for MAST-ML, for example, which models to fit and how to normalize your input feature matrix. The second file is the data file (.csv or .xlsx extension). This is the data file containing the input feature columns and values (X values) and the corresponding y data to fit models to. The data file may contain other columns that are dedicated to constructing groups of data for specific tests, or miscellaneous notes, which columns can be selectively left out so they are not used in the fitting. This will be discussed in more detail below.
Throughout this tutorial, we will be modifying the input file to add and remove different sections and values. For a complete and more in-depth discussion of the input file and its myriad settings, the reader is directed to the dedicated input file section:
The data contained in the example_data.csv file consist of a previously selected matrix of X features created from combinations of elemental properties, for example the average atomic radius of the elements in the material. The y data values used for fitting are listed in the “Scaled activation energy (eV)” column, and are DFT-calculated migration barriers of dilute solute diffusion, referenced to the host system. For example, the value of Ag solute diffusing through a Ag host is set to zero. The “Host element” and “Solute element” columns denote which species comprise the corresponding reduced migration barrier.
Your first MAST-ML run¶
It’s time to conduct your very first MAST-ML run! First, we will set up the most basic input file, which will only import your data and input file, and do nothing else except copy the input files to the results directory and output a basic histogram of the target data. Open the example_input.conf file (or create your own new file), and write the following in your input file:
Example:
[GeneralSetup]
input_features = Auto
input_target = Scaled activation energy (eV)
randomizer = False
metrics = Auto
input_other = Material composition, Host element, Solute element, predict_Pt
The General Setup section contains high-level control about how your input data file is parsed. Additional details of each parameter can be found in the MAST-ML Input File section in this documentation. Briefly, setting “input_features” to “Auto” will automatically assign all columns to be part of the X feature matrix, except those that are associated with target_feature or not_input_features. The option “randomizer” will shuffle all of your y-data, which can be useful for running a “null” test. The “metrics” option is used to denote which metrics to eventually evaluate your models on, such as mean_absolute_error. Using “Auto” provides a catalogue of standard metrics which is generally sufficient for many problems. Finally, the “not_input_features” field is used to denote any feature columns you don’t want to use in fitting. If some columns contain text notes, these will need to be added here too.
There are two ways to execute a MAST-ML run. The first is to run it via a Terminal or IDE command line by directly calling the main MAST-ML driver module. Here, the python -m (for module) command is invoked on the mastml.masml_driver module, and the paths containing the input file and data file are passed in. Lastly, the argument -o (for output) is used together with the path to put all results files and folders.
Example:
python3 -m mastml.mastml_driver tests/conf/example_input.conf tests/csv/example_data.xlsx -o results/mastml_tutorial
The second way is to run MAST-ML through a Jupyter notebook by importing mastml and running the mastml_driver main() method and supply the paths to the input file, data file
Example:
import mastml_driver
conf_path = 'tests/conf/example_input.conf'
data_path = 'tests/conf/example_data.csv'
results_path = 'results/mastml_tutorial'
mastml_driver.main(conf_path, data_path, results_path)
Let’s examine the output from this first run. Below is a screenshot of a Mac directory output tree in the results/mastml_tutorial folder. Note that you can re-use the same output folder name, and the date and time of the run will be appended so no work will be lost. Each level of the directory tree corresponds to a step in the general supervised learning workflow that MAST-ML uses. The first level is general data input and feature generation, the second level is numerical manipulation of features, and the third level is selection of features. Since we did not do any feature manipulation in this run, the output selected.csv, normalized.csv and generated_features.csv are all the same, and are the same file as the copied input data file, example_data.csv. In the main directory tree, there is also a log.log and errors.log file, which summarize the inner details of the MAST-ML run and flag any errors that may have occurred. There are two .html files which provide very high-level summaries of data plots and file links that may be of interest, to make searching for these files easier. Finally, there is some generated data about the statistics of your input target data. A histogram named target_histogram.png is created, and basic statistical summary of your data is saved in the input_data_statistics.csv file.

Cleaning input data¶
Now, let’s imagine a slightly more complicated (but realistic) scenario where some of the value of your X feature matrix are not known. Open your example_data.csv file, and randomly remove some values of the X feature columns in your dataset. Don’t remove any y data values in the “Reduced Barrier (eV)” column. You’ll need to add the following section to your input file to handle cleaning of the input data:
Example:
[DataCleaning]
cleaning_method = imputation
imputation_strategy = mean
What this does is perform data imputation, where each missing value will be replaced with the mean value for that particular feature column. Other data cleaning options include imputation with median values, simply removing rows of data with missing values, or performing a probabilistic principal component analysis to fill in missing values.
From inspecting the data file in the parent directory to that in the subsequent directories, you can see that the missing values (here, the first 10 rows of the first several features were removed) have been replaced with the mean values for each respective feature column:

After data cleaning with imputation:

Feature generation and normalization¶
For this run, we are going to first generate a large X feature matrix based on a suite of elemental properties. Then, we are going to normalize the feature matrix so that all values in a given feature column have a mean of zero and a standard deviation equal to one.
To perform the feature generation and normalization steps, add these sections to your input file. Use the same file from the previous run, which contains the GeneralSetup and DataCleaning sections, and use your data file with the values you previously removed. (Note that you can use the pristine original data file too, and the data cleaning step will simply do nothing). For the purpose of this example, we are going to generate elemental features using the MAGPIE approach, using compositions as specified in the “Solute element” column of the data file. Note that if multiple elements are present, features containing the average (both mean and composition-weighted averages) of the elements present will be calculated. The value specified in the composition_feature parameter must be a column name in your data file which contains the material compositions.
Example:
[FeatureGeneration]
[[Magpie]]
composition_feature = Solute element
feature_types = composition_avg, arithmetic_avg, max, min, difference, elements
[FeatureNormalization]
[[StandardScaler]]
After performing this run, we can see that the .csv files in the feature generation and normalization folders of the results directory tree are now updated to reflect the generated and normalized X feature matrices. There are now many more features in the generated_features.csv file:

Note that feature columns that are identical in all values are removed automatically. We can see that the normalized feature set consists of each column having mean zero and standard deviation of one:

Training and evaluating your first model¶
Now that we have a full X feature matrix that has been normalized appropriately, it is time to train and evaluate your first model. For this tutorial, we will train a Kernel Ridge model with a radial basis function kernel (also known as Gaussian Kernel Ridge Regression, GKRR). We need to add two sections of our input file.
The first is the Models section, which provides a list of model types to train and the associated parameter values for each model. Here, we have chosen values of alpha and gamma equal to 1. There is no reason to think that these are the optimal parameter values, they were simply chosen as an example. Later in this tutorial we will optimize these parameters. Note that if you don’t specify the model parameter values, the values used will be the scikit-learn default values.
The second is the DataSplits section, which controls what types of fits and cross-validation tests will be applied to each specified model. Here, we have chosen “NoSplit”, which is simply a full y versus X fit of the data, without any form of cross-validation. We have also denoted “RepeatedKFold”, which is random leave-out cross-validation test. In this instance, we have 5 splits (so leave out 20%) and do the test two times.
Example:
[Models]
[[KernelRidge]]
kernel = rbf
alpha = 1
gamma = 1
[DataSplits]
[[NoSplit]]
[[RepeatedKFold]]
n_splits = 5
n_repeats = 2
Below is a snapshot of the resulting directory tree generated from this MAST-ML run. You’ll immediately notice the tree is deeper now, with a new level corresponding to each model we’ve fit (here just the single KernelRidge model), and, for each model, folders corresponding to each DataSplit test we denoted in the input file. For each data split method, there are folders and corresponding data plots and files for each hold-out split of the test. For instance, with the RepeatedKFold test, there were 10 total splits, which are labeled as split_0 through split_9. Contained in each folder are numerous files, such as different data parity plots of predicted vs. actual values, histograms of residuals, .csv files for all plotted data, a .pkl file of the exported trained model, and .ipynb Jupyter notebooks useful for custom modifications of the data plots.

Below is a parity plot from the NoSplit (full data fit) run. The R-squared value is high, but there is significant mean error. This suggests that the model parameters are not optimal (which shouldn’t be surprising considering we just picked them arbitrarily).

From examining the parity plot from the RepeatedKFold run (this is the ‘average_points_with_bars.png’ plot), which has the averaged values over all 10 splits, we can see that the predictions from random cross validation result in both a very low R-squared value and a high error. Essentially, cross-validation has shown that this model has no predictive ability. It seems our issues are two-fold: nonoptimal hyperparameters, and over-fitting. The over-fitting is evident due to the much worse before of the cross-validated parity plot compared to the full fit.

Feature selection and learning curves¶
As mentioned above, one problem with our current model is over-fitting. To further understand and minimize the effect of over-fitting, it is often necessary to construct learning curves and perform feature selection to obtain a reduced feature set which most accurately describes your data. To do this, we are going to add two additional sections to our input file.
The first section is related to feature selection. Here, we will use the SequentialFeatureSelector algorithm, which performs forward selection of features. We will select a total of 20 features, and use a KernelRidge model to evaluate the selected features. Here, we ahve denoted our estimator as “KernelRidge_select”. The models used in feature selection and learning curves are removed from the model queue, because in general one may want to use a different model for this step of the analysis than what will ultimately be used for fitting. Therefore, we need to also amend our models list to have this new KernelRidge_select model, as shown below.
Example:
[FeatureSelection]
[[SequentialFeatureSelector]]
estimator = KernelRidge_select
k_features = 20
[Models]
[[KernelRidge]]
kernel = rbf
alpha = 1
gamma = 1
[[KernelRidge_select]]
kernel = rbf
alpha = 1
gamma = 1
The second section we will add is to plot learning curves. There are two types of learning curves MAST-ML can make: a data learning curve and feature learning curve. The former is a plot of the metric of interest versus the amount of training data used in the fits. The latter is a plot of the metric of interest versus the number of features comprising the X feature matrix. In the example LearningCurve input file section shown below, we are going to use a KernelRidge model, a random k-fold cross-validation and the root_mean_square_error to evaluate our learning curves. We will also use a maximum of 20 features, and use the SelectKBest algorithm to assess the choice of features.
Example:
[LearningCurve]
estimator = KernelRidge_learn
cv = RepeatedKFold_learn
scoring = root_mean_squared_error
n_features_to_select = 20
selector_name = SelectKBest
As with the above example of FeatureSelection, we need to add the KernelRidge_learn and RepeatedKFold_learn entries to the Models and DataSplits sections of our input file, respectively. At this point in the tutorial, the complete input file should look like this:
Example:
[GeneralSetup]
input_features = Auto
input_target = Reduced barrier (eV)
randomizer = False
metrics = Auto
input_other = Host element, Solute element, predict_Pt
[DataCleaning]
cleaning_method = imputation
imputation_strategy = mean
[FeatureGeneration]
[[Magpie]]
composition_feature = Solute element
[FeatureNormalization]
[[StandardScaler]]
[FeatureSelection]
[[SequentialFeatureSelector]]
estimator = KernelRidge_select
k_features = 20
[LearningCurve]
estimator = KernelRidge_learn
cv = RepeatedKFold_learn
scoring = root_mean_squared_error
n_features_to_select = 20
selector_name = SelectKBest
[Models]
[[KernelRidge]]
kernel = rbf
alpha = 1
gamma = 1
[[KernelRidge_select]]
kernel = rbf
alpha = 1
gamma = 1
[[KernelRidge_learn]]
kernel = rbf
alpha = 1
gamma = 1
[DataSplits]
[[NoSplit]]
[[RepeatedKFold]]
n_splits = 5
n_repeats = 2
[[RepeatedKFold_learn]]
n_splits = 5
n_repeats = 2
Let’s take a look at the same full fit and RepeatedKFold random cross-validation tests for this run:
Full-fit:

Random leave out cross-validation:

What we can see is, now that we down-selected features from more than 300 features in the previous run to just 20 here, that the fits have noticeably improved and the problem of over-fitting has been minimized. Below, we can look at the plotted learning curves
Data learning curve:

Feature learning curve:

We can clearly see that, as expected, having more training data will result in better test scores, and adding more features (up to a certain point) will also result in better fits. Based on these learning curves, one may be able to argue that additional features should could be used to further lower the error.
Hyperparameter optimization¶
Next, we will consider optimization of the model hyperparameters, in order to use a better optimized model with a selected feature set to minimize the model errors. To do this, we need to add the HyperOpt section to our input file, as shown below. Here, we are optimzing our KernelRidge model, specifically its root_mean_squared_error, by using our RepeatedKFold random leave-out cross-validation scheme. The param_names field provides the parameter names to optimize. Here, we are optimizing the KernelRidge alpha and gamma parameters. Parameters must be delineated with a semicolon. The param_values field provides a bound on the values to search over. Here, the minimum value is -5, max is 5, 100 points are analyzed, and the numerical scaling is logarithmic, meaning it ranges from 10^-5 to 10^5. If “lin” instead of “log” would have been specified, the scale would be linear with 100 values ranging from -5 to 5.
Example:
[HyperOpt]
[[GridSearch]]
estimator = KernelRidge
cv = RepeatedKFold
param_names = alpha ; gamma
param_values = -5 5 100 log float ; -5 5 100 log float
scoring = root_mean_squared_error
Let’s take a final look at the same full fit and RepeatedKFold random cross-validation tests for this run:
Full-fit:

Random leave out cross-validation:

What we can see is, now that we down-selected features from more than 300 features in the previous run to just 20, along with optimizing the hyperparameters of our KernelRidge model, our fits are once again improved. The hyperparameter optimization portion of this workflow outputs the hyperparameter values and cross-validation scores for each step of, in this case, the GridSearch that we performed. All of this information is saved in the KerenlRidge.csv file in the GridSearch folder in the results directory tree. For this run, the optimal hyperparameters were alpha = 0.034 and gamma = 0.138
Random leave-out versus leave-out-group cross-validation¶
Here, we will use our selected feature set and optimized KernelRidge hyperparameters from the previous section to do a new kind of cross-validation test: leave out group (LOG) CV. To do this, you will modify the alpha and gamma values in the Models section, KernelRidge model in your input file. In addition, you can rename the selected.csv data file to a new name, for example “example_data_selected.csv”, and use the path to this new data file for this new run, as we will not be performing feature selection again (to save time).
We will compare these results to the results of LOG cross-validation with the random cross-validation. Our input data file had a column called “Host element”. This is a natural grouping to use for this problem, as it is interesting to assess our fits when training on a set of host elements and predicted the values of an entirely new host element set, without having ever trained on that set. Modify your input file to match what is shown below. Note that we have commented out the sections that we no longer want with the # symbol. You can either comment out the sections or remove them entirely.
Example:
[GeneralSetup]
input_features = Auto
input_target = Reduced barrier (eV)
randomizer = False
metrics = Auto
input_other = Host element, Solute element, predict_Pt
input_grouping = Host element
#[DataCleaning]
# cleaning_method = imputation
# imputation_strategy = mean
#[FeatureGeneration]
# [[Magpie]]
# composition_feature = Solute element
[FeatureNormalization]
[[StandardScaler]]
#[FeatureSelection]
# [[SequentialFeatureSelector]]
# estimator = KernelRidge_select
# k_features = 20
#[LearningCurve]
# estimator = KernelRidge_learn
# cv = RepeatedKFold_learn
# scoring = root_mean_squared_error
# n_features_to_select = 20
# selector_name = SelectKBest
[Models]
[[KernelRidge]]
kernel = rbf
alpha = 0.034
gamma = 0.138
#[[KernelRidge_select]]
# kernel = rbf
# alpha = 1
# gamma = 1
#[[KernelRidge_learn]]
# kernel = rbf
# alpha = 1
# gamma = 1
[DataSplits]
[[NoSplit]]
[[RepeatedKFold]]
n_splits = 5
n_repeats = 2
#[[RepeatedKFold_learn]]
# n_splits = 5
# n_repeats = 2
[[LeaveOneGroupOut]]
grouping_column = Host element
#[HyperOpt]
# [[GridSearch]]
# estimator = KernelRidge
# cv = RepeatedKFold
# param_names = alpha ; gamma
# param_values = -5 5 100 log ; -5 5 100 log
# scoring = root_mean_squared_error
The main new additions to this input file is under the General Setup section, where the parameter grouping_feature needs to be added, and the addition of LeaveOutGroup to the DataSplits section.
By doing this run, we can assess the model fits resulting from the random cross-validation and the LOG cross-validation.
Random cross-validation:

LOG cross-validation:

We can immediately see the R-squared and errors are both worse for the LOG cross-validation test compared to the random cross-validation test. This is likely because the LOG test is a more rigorous test of model extrapolation, because the test scores in each case are for data for which host elements were never included in the training set. In addition, a minor effect contributing to the reduced accuracy may be due to the fact that the model hyperparameters were optimized by evaluating the root mean squared error for a random cross-validation test. If instead the parameters were optimized using the LOG test, the resulting fits would likely be improved.
There are a couple additional plots that are usual output for a LOG test that are worth drawing attention to. The first is a plot of each metric test value for each group. This enables one to quickly assess which groups perform better or worse than others.

In addition, the parity plots for each split are now plotted with symbols denoting each group, which can help assess clustering of groups and goodness of fit on a per-group basis.
Training on all groups except Ag:

Testing on just Ag as the left-out host element:

Making predictions by importing a previously fit model¶
Here, we are going to import a previously fit model, and use it to predict the migration barriers for those data points with Pt as the host element.
In your previous run, the LOG test split where the Pt host values were predicted is in the split_12 folder. The parity plot for Pt test data should look like the below plot for your previous run:

Here, we are going to import the model that was fitted to all the groups except Pt, and use MAST-ML’s data validation function as detailed above to obtain this same plot, but with using Pt as the validation data and the imported, previously trained model. If one were to extend this data set to include, for example, U as a host element, any number of previously trained models could be used to predict the migration barrier values for U. To import this model, save the KernelRidge_split_12.pkl file from your previous run into the /models/ folder (it is as the the same level as the /tests/ folder in your main MAST-ML directory). To import this model into your next run, you can create a new field in the Models section, as shown below:
Example:
[Models]
#[[KernelRidge]]
# kernel = rbf
# alpha = 0.034
# gamma = 0.138
#[[KernelRidge_select]]
# kernel = rbf
# alpha = 1
# gamma = 1
#[[KernelRidge_learn]]
# kernel = rbf
# alpha = 1
# gamma = 1
[[ModelImport]]
model_path = models/KernelRidge_split_12.pkl
As we are only interested in assessing the fit on Pt for this example, we can change the DataSplits section to only have the LOG test:
Example:
[DataSplits]
#[[NoSplit]]
[[RepeatedKFold]]
n_splits = 5
n_repeats = 2
#[[RepeatedKFold_learn]]
# n_splits = 5
# n_repeats = 2
[[LeaveOneGroupOut]]
grouping_column = Host element
From running this model and inspecting the test data parity plot in split_12 (the folder for Pt group, we obtain this parity plot:

As a comparison, this plot is exactly the same as the above plot from the previous run. This is the expected result, and demonstrates that the previously fit model was successfully imported and used to predict the Pt values. By inspecting the other groups, for example split_1, which is for Ag, the R squared and errors indicate a better fit than our previous run. This better fit is expected, as the model we saved from the previous run contained Ag in the training data, so these predictions on Ag should be improved (note that this is defeats the purpose of the LOG test, but shows that the trained model we imported is behaving as expected).
Predicting values for new, extrapolated data¶
As a final example, we are going to use our model to predict the migration barriers for those data points with Pt as the host element. Your data file already has a column with the title “predict_Pt”, with values equal to 0 in all rows except where Pt is the host, in which case the value is 1. In the GeneralSetup section of your input file, add the parameter validation_columns, and have it equal to “predict_Pt”, as shown below. This will make it so that the data with Pt as the host element will never be involved in the model training. This feature is a convenient way to isolate part of your data, or some new part of your data, to only function as a validation data set. This way, whenever a model is trained and tested on the remaining data, an additional prediction will also be calculated, which here is for the Pt host data.
Example:
[GeneralSetup]
input_features = Auto
input_target = Reduced barrier (eV)
randomizer = False
metrics = Auto
input_other = Host element, Solute element, predict_Pt
input_grouping = Host element
input_testdata = predict_Pt
For this test, let’s run both the random cross-validation and LOG test. As a reminder, we need to un-comment the random cross-validation test in the DataSplits section:
Example:
[DataSplits]
#[[NoSplit]]
[[RepeatedKFold]]
n_splits = 5
n_repeats = 2
#[[RepeatedKFold_learn]]
# n_splits = 5
# n_repeats = 2
[[LeaveOneGroupOut]]
grouping_column = Host element
When running this test, you’ll notice there are fewer splits in the LOG test folder now. This is because Pt is only treated as a final “validation” or “extrapolation” data set, and is never involved in the training or test set in any split. For each split in the random and LOG CV tests, there is a “stats.txt” file which is written, which provides the average train, test and prediction results. The prediction results are for the Pt validation data. Below are screenshots of the stats.txt file for the random and LOG tests.
Random cross-validation:

LOG cross-validation:

For the random cross-validation, the R-squared and error values are higher for the predict_Pt dataset compared to the average of the testing datasets. This is to be expected, as Pt is never involved in model training. Further, we can see that the predictions for predict_Pt are slightly worse in the case of the LOG cross-validation test compared to the random cross-validation test. This also makes sense, as each training split of the LOG test tends to result in worse predictive performance (i.e. worse model training), relative to the random cross-validation case, as discussed in the above test when we compared the results of the random and LOG cross-validation tests.
This concludes the MAST-ML tutorial document! There are some other features of MAST-ML which were not explicitly discussed in this tutorial, such as forming data clusters. Consult the MAST-ML Input File section of this documentation for a more in-depth overview of all the possible options for different MAST-ML runs.
Code Documentation: Metrics¶
mastml.metrics Module¶
This module contains constructors for different model score metrics. Most model metrics are obtained from scikit-learn, while others are custom variations.
The full list of score functions in scikit-learn can be found at: http://scikit-learn.org/stable/modules/model_evaluation.html
Functions¶
adjusted_r2_score (y_true, y_pred[, n_features]) |
Method that calculates the adjusted R^2 value |
check_and_fetch_names (metric_names, …) |
Method that checks whether chosen metrics to evaluate models are appropriate for user-specified models (e.g. |
r2_score_fitted (y_true, y_pred) |
Method that calculates the R^2 value |
r2_score_noint (y_true, y_pred) |
Method that calculates the R^2 value without fitting the y-intercept |
rmse_over_stdev (y_true, y_pred[, train_y]) |
Method that calculates the root mean squared error (RMSE) of a set of data, divided by the standard deviation of the training data set. |
root_mean_squared_error (y_true, y_pred) |
Method that calculates the root mean squared error (RMSE) |
Code Documentation: Configuration file parser¶
mastml.conf_parser Module¶
The conf_parser module is used for handling, parsing, and checking MAST-ML input configuration files
Functions¶
check_models_mixed (model_names) |
Method used to check whether the user has mixed regression and classification tasks |
fix_types (maybe_list) |
Method that returns true datatype of values passed as string or list of strings, parsed from configuration file |
make_scorer (score_func, *[, …]) |
Make a scorer from a performance metric or loss function. |
mybool (string) |
Method that converts a string equal to ‘True’ or ‘False’ into type bool |
parse_conf_file (filepath[, from_dict]) |
Method that accepts the filepath of an input configuration file and returns its parsed dictionary |
Code Documentation: Data cleaner¶
mastml.data_cleaner Module¶
The data_cleaner module is used to clean missing or NaN values from pandas dataframes (e.g. removing NaN, imputation, etc.)
Functions¶
columns_with_strings (df) |
Method that ascertains which columns in data contain string entries |
flag_outliers (df, conf_not_input_features, …) |
Method that scans values in each X feature matrix column and flags values that are larger than 3 standard deviations from the average of that column value. |
imputation (df, strategy[, cols_to_leave_out]) |
Method that imputes values to the missing places based on the median, mean, etc. |
orth (A[, rcond]) |
Construct an orthonormal basis for the range of A using SVD |
ppca (df[, cols_to_leave_out]) |
Method that performs a recursive PCA routine to use PCA of known columns to fill in missing values in particular column |
remove (df, axis) |
Method that removes a full column or row of data values if one column or row contains NaN or is blank |
Classes¶
PPCA () |
Class to perform probabilistic principal component analysis (PPCA) to fill in missing data. |
SimpleImputer (*[, missing_values, strategy, …]) |
Imputation transformer for completing missing values. |
Class Inheritance Diagram¶

Code Documentation: Data loader¶
Code Documentation: Learning curve¶
mastml.learning_curve Module¶
This module contains methods to construct learning curves, which evaluate some cross-validation performance metric (e.g. RMSE) as a function of amount of training data (i.e. a sample learning curve) or as a function of the number of features used in the fitting (i.e. a feature learning curve).
Functions¶
f_regression (X, y, *[, center]) |
Univariate linear regression tests. |
feature_learning_curve (X, y, estimator, cv, …) |
Method that calculates data used to plot a feature learning curve, e.g. |
learning_curve (estimator, X, y, *[, groups, …]) |
Learning curve. |
sample_learning_curve (X, y, estimator, cv, …) |
Method that calculates data used to plot a sample learning curve, e.g. |
Code Documentation: Clusterers¶
mastml.legos.clusterers Module¶
The clusterers module is used for instantiating cluster algorithm objects from scikit-learn. More information is available at http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster
Code Documentation: Data splitters¶
mastml.legos.data_splitters Module¶
The data_splitters module contains a collection of classes for generating (train_indices, test_indices) pairs from a dataframe or a numpy array.
- For more information and a list of scikit-learn splitter classes, see:
- http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
Classes¶
BaseEstimator |
Base class for all estimators in scikit-learn. |
Bootstrap (n[, n_bootstraps, train_size, …]) |
# Note: Bootstrap taken directly from sklearn Github (https://github.com/scikit-learn/scikit-learn/blob/0.11.X/sklearn/cross_validation.py) # which was necessary as it was later removed from more recent sklearn releases Random sampling with replacement cross-validation iterator Provides train/test indices to split data in train test sets while resampling the input n_bootstraps times: each time a new random split of the data is performed and then samples are drawn (with replacement) on each side of the split to build the training and test sets. |
JustEachGroup () |
Class to train the model on one group at a time and test it on the rest of the data This class wraps scikit-learn’s LeavePGroupsOut with P set to n-1. |
LeaveCloseCompositionsOut ([dist_threshold, …]) |
Leave-P-out where you exclude materials with compositions close to those the test set |
LeaveOutPercent ([percent_leave_out, n_repeats]) |
Class to train the model using a certain percentage of data as training data |
NearestNeighbors (*[, n_neighbors, radius, …]) |
Unsupervised learner for implementing neighbor searches. |
NoSplit () |
Class to just train the model on the training data and test it on that same data. |
SplittersUnion (splitters) |
Class to take the union of two separate splitting routines, so that many splitting routines can be performed at once |
TransformerMixin |
Mixin class for all transformers in scikit-learn. |
Class Inheritance Diagram¶

Code Documentation: Utils¶
mastml.utils Module¶
The utils module contains a collection of miscellaneous methods and error handling used throughout MAST-ML
Functions¶
activate_logging (savepath, paths[, …]) |
Method to create MAST-ML logger file |
ceil |
Return the ceiling of x as an Integral. |
floor |
Return the floor of x as an Integral. |
join (a, *p) |
Join two or more pathname components, inserting ‘/’ as needed. |
log (x, [base=math.e]) |
Return the logarithm of x to the given base. |
log_header (paths, log) |
Method to create header for MAST-ML logger |
nice_range (lower, upper) |
Method to create a range of values, including the specified start and end points, with nicely spaced intervals |
verbosalize_logger (log, verbosity) |
Classes¶
BetweenFilter (min_level, max_level) |
Class to aid in handling logger display levels |
ConfError |
Class representing error in input configuration file |
FileNotFoundError |
Class representing error raised when a needed file cannot be found |
FiletypeError |
Class representing error raised when an improper file extension is used |
InvalidConfParameters |
Class representing error raised when you have invalid input configuration file parameters |
InvalidConfSection |
Class representing error raised when an invalid section name is present in the input configuration file |
InvalidConfSubSection |
Class representing error raised when an invalid subsection name is present in the input configuration file |
InvalidModel |
Class representing error when model does not exist |
InvalidValue |
Class representing error raised when an invalid value has been used |
MastError |
Base class for MAST-ML specific errors that should be shown to the user |
MissingColumnError |
Class representing error raised when your csv doesn’t have the specified column |
defaultdict |
defaultdict(default_factory[, …]) –> dict with default factory |
Class Inheritance Diagram¶

Code Documentation: MAST-ML Driver¶
mastml.mastml_driver Module¶
Main MAST-ML module responsible for executing the workflow of a MAST-ML run
Functions¶
check_paths (conf_path, data_path, outdir) |
This method is responsible for error handling of the user-specified paths for the configuration file, data file, and output directory. |
clone (estimator, *[, safe]) |
Constructs a new unfitted estimator with the same parameters. |
deepcopy (x[, memo, _nil]) |
Deep copy operation on arbitrary Python objects. |
get_commandline_args () |
This method is responsible for parsing and checking the command-line execution of MAST-ML inputted by the user. |
join (a, *p) |
Join two or more pathname components, inserting ‘/’ as needed. |
main (conf_path, data_path[, outdir, verbosity]) |
This method is responsible for setting up the initial stage of the MAST-ML run, such as parsing input directories to designate where data will be imported and results saved to, as well as creation of the MAST-ML run log. |
make_scorer (score_func, *[, …]) |
Make a scorer from a performance metric or loss function. |
mastml_run (conf_path, data_path, outdir) |
This method is responsible for conducting the main MAST-ML run workflow |
reduce (function, sequence[, initial]) |
Apply a function of two arguments cumulatively to the items of a sequence, from left to right, so as to reduce the sequence to a single value. |
Code Documentation: Plot Helper¶
mastml.plot_helper Module¶
This module contains a collection of functions which make plots (saved as png files) using matplotlib, generated from some model fits and cross-validation evaluation within a MAST-ML run.
This module also contains a method to create python notebooks containing plotted data and the relevant source code from this module, to enable the user to make their own modifications to the created plots in a straightforward way (useful for tweaking plots for a presentation or publication).
Functions¶
auc (x, y) |
Compute Area Under the Curve (AUC) using the trapezoidal rule. |
ceil |
Return the ceiling of x as an Integral. |
confusion_matrix (y_true, y_pred, *[, …]) |
Compute confusion matrix to evaluate the accuracy of a classification. |
figaspect (arg) |
Calculate the width and height for a figure with a specified aspect ratio. |
floor |
Return the floor of x as an Integral. |
get_divisor (high, low) |
Method to obtain a sensible divisor based on range of two values |
get_histogram_bins (y_df) |
Method to obtain the number of bins to use when plotting a histogram |
ipynb_maker (plot_func) |
This method creates Jupyter Notebooks so user can modify and regenerate the plots produced by MAST-ML. |
join (a, *p) |
Join two or more pathname components, inserting ‘/’ as needed. |
log (x, [base=math.e]) |
Return the logarithm of x to the given base. |
make_axes_locatable (axes) |
|
make_axis_same (ax, max1, min1) |
Method to make the x and y ticks for each axis the same. |
make_error_plots (run, path, …[, groups]) |
|
make_fig_ax ([aspect_ratio, x_align, left]) |
Method to make matplotlib figure and axes objects. |
make_fig_ax_square ([aspect, aspect_ratio]) |
Method to make square shaped matplotlib figure and axes objects. |
make_train_test_plots (run, path, …[, groups]) |
General plotting method used to execute sequence of specific plots of train-test data analysis |
mark_inset (parent_axes, inset_axes, loc1, …) |
Draw a box to mark the location of an area represented by an inset axes. |
nice_mean (ls) |
Method to return mean of a list or equivalent array with NaN values |
nice_names () |
|
nice_range (lower, upper) |
Method to create a range of values, including the specified start and end points, with nicely spaced intervals |
nice_std (ls) |
Method to return standard deviation of a list or equivalent array with NaN values |
parse_error_data (dataset_stdev, …) |
|
plot_1d_heatmap (xs, heats, savepath[, …]) |
Method to plot a heatmap for values of a single variable; used for plotting GridSearch results in hyperparameter optimization. |
plot_2d_heatmap (xs, ys, heats, savepath[, …]) |
Method to plot a heatmap for values of two variables; used for plotting GridSearch results in hyperparameter optimization. |
plot_3d_heatmap (xs, ys, zs, heats, savepath) |
Method to plot a heatmap for values of three variables; used for plotting GridSearch results in hyperparameter optimization. |
plot_average_cumulative_normalized_error (…) |
Method to plot the cumulative normalized residual errors of a model prediction |
plot_average_normalized_error (y_true, …[, …]) |
Method to plot the normalized residual errors of a model prediction |
plot_best_worst_per_point (y_true, …[, …]) |
Method to create a parity plot (predicted vs. |
plot_best_worst_split (y_true, best_run, …) |
Method to create a parity plot (predicted vs. |
plot_confusion_matrix (y_true, y_pred, …[, …]) |
Method used to generate a confusion matrix for a classification run. |
plot_cumulative_normalized_error (y_true, …) |
Method to plot the cumulative normalized residual errors of a model prediction |
plot_keras_history (model_history, savepath, …) |
|
plot_learning_curve (train_sizes, train_mean, …) |
Method used to plot both data and feature learning curves |
plot_learning_curve_convergence (train_sizes, …) |
Method used to plot both the convergence of data and feature learning curves as a function of amount of data or features |
plot_metric_vs_group (metric, groups, stats, …) |
Method to plot the value of a particular calculated metric (e.g. |
plot_metric_vs_group_size (metric, groups, …) |
Method to plot the value of a particular calculated metric (e.g. |
plot_normalized_error (y_true, y_pred, …[, …]) |
Method to plot the normalized residual errors of a model prediction |
plot_precision_recall_curve (y_true, y_pred, …) |
Method to calculate and plot the precision-recall curve for classification model results |
plot_predicted_vs_true (train_quad, …) |
Method to create a parity plot (predicted vs. |
plot_predicted_vs_true_bars (y_true, …[, …]) |
Method to calculate parity plot (predicted vs. |
plot_real_vs_predicted_error (y_true, …) |
|
plot_residuals_histogram (y_true, y_pred, …) |
Method to calculate and plot the histogram of residuals from regression model |
plot_roc_curve (y_true, y_pred, savepath) |
Method to calculate and plot the receiver-operator characteristic curve for classification model results |
plot_scatter (x, y, savepath[, groups, …]) |
Method to create a general scatter plot |
plot_stats (fig, stats[, x_align, y_align, …]) |
Method that prints stats onto the plot. |
plot_target_histogram (y_df, savepath[, …]) |
Method to plot the histogram of true y values |
precision_recall_curve (y_true, probas_pred, *) |
Compute precision-recall pairs for different probability thresholds. |
prediction_intervals (model, X, …) |
Method to calculate prediction intervals when using Random Forest and Gaussian Process regression models. |
r2_score (y_true, y_pred, *[, sample_weight, …]) |
R^2 (coefficient of determination) regression score function. |
recursive_max (arr) |
Method to recursively find the max value of an array of iterables. |
recursive_max_and_min (arr) |
Method to recursively return max and min of values or iterables in array |
recursive_min (arr) |
Method to recursively find the min value of an array of iterables. |
roc_curve (y_true, y_score, *[, pos_label, …]) |
Compute Receiver operating characteristic (ROC). |
round_down (num, divisor) |
Method to return a rounded down number |
round_up (num, divisor) |
Method to return a rounded up number |
rounder (delta) |
Method to obtain number of decimal places to report on plots |
stat_to_string (name, value, nice_names) |
Method that converts a metric object into a string for displaying on a plot |
trim_array (arr_list) |
Method used to trim a set of arrays to make all arrays the same shape |
wraps (wrapped[, assigned, updated]) |
Decorator factory to apply update_wrapper() to a wrapper function |
zoomed_inset_axes (parent_axes, zoom[, loc, …]) |
Create an anchored inset axes by scaling a parent axes. |
Code Documentation: HTML Helper¶
mastml.html_helper Module¶
Module for generating an HTML file, called index.html, which contains an overview of the key data and plots from a MAST-ML run. Images of cross-validation parity plots, data histograms, data statistics, and links to the relevant files are all provided.
Functions¶
attr (*args, **kwargs) |
Set attributes on the current active tag context |
get_current ([default]) |
get the current tag being used as a with context or decorated function. |
gmtime ([seconds]) |
tm_sec, tm_wday, tm_yday, tm_isdst) |
is_test_image (path) |
Method used to assess whether an image is for testing data |
is_train_image (path) |
Method used to assess whether an image is for training data |
join (a, *p) |
Join two or more pathname components, inserting ‘/’ as needed. |
make_html (outdir) |
Method used to create the main index.html file |
make_image (src[, title]) |
Method used to generate and show an image of a fixed width. |
make_link (href) |
Method used to generate a link to a particular file created from a MAST-ML run. |
relpath (path[, start]) |
Return a relative version of a path |
show_combo (combo_dir, outdir) |
Method used to collect combinations of data analysis (e.g. |
simple_section (filepath, outdir) |
Method used to create a section name for a particular analysis combination that will be displayed in the index.html file. |
strftime (format[, tuple]) |
Convert a time tuple to a string according to a format specification. |
Code Documentation: Feature Selectors¶
mastml.legos.feature_selectors Module¶
This module contains a collection of classes and methods for selecting features, and interfaces with scikit-learn feature selectors. More information on scikit-learn feature selectors is available at:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
Functions¶
cov (m[, y, rowvar, bias, ddof, fweights, …]) |
Estimate a covariance matrix, given data and weights. |
dataframify_new_column_names (transform, name) |
Method which transforms output of scikit-learn feature selectors to dataframe, and adds column names |
dataframify_selector (transform) |
Method which transforms output of scikit-learn feature selectors from array to dataframe. |
fitify_just_use_values (fit) |
Method which enables a feature selector fit method to operate on dataframes |
pearsonr (x, y) |
Pearson correlation coefficient and p-value for testing non-correlation. |
root_mean_squared_error (y_true, y_pred) |
Method that calculates the root mean squared error (RMSE) |
wraps (wrapped[, assigned, updated]) |
Decorator factory to apply update_wrapper() to a wrapper function |
Classes¶
BaseEstimator |
Base class for all estimators in scikit-learn. |
EnsembleModelFeatureSelector (estimator, …) |
Class custom-written for MAST-ML to conduct selection of features with ensemble model feature importances |
MASTMLFeatureSelector (estimator, …[, …]) |
Class custom-written for MAST-ML to conduct forward selection of features with flexible model and cv scheme |
PCA ([n_components, copy, whiten, …]) |
Principal component analysis (PCA). |
PearsonSelector (threshold_between_features, …) |
Class custom-written for MAST-ML to conduct selection of features based on Pearson correlation coefficent between features and target. |
SequentialFeatureSelector (estimator[, …]) |
Sequential Feature Selection for Classification and Regression. |
TransformerMixin |
Mixin class for all transformers in scikit-learn. |
constructor |
alias of sklearn.feature_selection._variance_threshold.VarianceThreshold |
Class Inheritance Diagram¶

Code Documentation: Feature Normalizers¶
mastml.legos.feature_normalizers Module¶
This module contains a collection of classes and methods for normalizing features. Also included is connection with scikit-learn methods. See http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing for more info.
Functions¶
dataframify (transform) |
Method which is a decorator transforms output of scikit-learn feature normalizers from array to dataframe. |
wraps (wrapped[, assigned, updated]) |
Decorator factory to apply update_wrapper() to a wrapper function |
Classes¶
BaseEstimator |
Base class for all estimators in scikit-learn. |
Binarizer (*[, threshold, copy]) |
Binarize data (set feature values to 0 or 1) according to a threshold. |
MaxAbsScaler (*[, copy]) |
Scale each feature by its maximum absolute value. |
MeanStdevScaler ([features, mean, stdev]) |
Class designed to normalize input data to a specified mean and standard deviation |
MinMaxScaler ([feature_range, copy, clip]) |
Transform features by scaling each feature to a given range. |
Normalizer ([norm, copy]) |
Normalize samples individually to unit norm. |
OneHotEncoder (*[, categories, drop, sparse, …]) |
Encode categorical features as a one-hot numeric array. |
QuantileTransformer (*[, n_quantiles, …]) |
Transform features using quantiles information. |
RobustScaler (*[, with_centering, …]) |
Scale features using statistics that are robust to outliers. |
StandardScaler (*[, copy, with_mean, with_std]) |
Standardize features by removing the mean and scaling to unit variance |
TransformerMixin |
Mixin class for all transformers in scikit-learn. |
Class Inheritance Diagram¶

Code Documentation: Randomizers¶
mastml.legos.randomizers Module¶
This module contains a class used to randomize the input y data, in order to create a “null model” for testing how rigorous other machine learning model predictions are.
Classes¶
Randomizer () |
Class which randomizes X-y pairings by shuffling the y values |
Class Inheritance Diagram¶

Code Documentation: Model Finder¶
mastml.legos.model_finder Module¶
This module provides a name_to_constructor dict for all models/estimators in scikit-learn, plus a couple test models and error handling functions
Functions¶
check_models_mixed (model_names) |
Method used to check whether the user has mixed regression and classification tasks |
find_model (model_name) |
Method used to check model names conform to scikit-learn model/estimator names |
Classes¶
AlwaysFive ([constant]) |
Class used as a test model that always predicts a value of 5. |
EnsembleRegressor (n_estimators, num_samples, …) |
|
KerasRegressor (conf_dict) |
|
ModelImport (model_path) |
Class used to import pickled models from previous machine learning fits |
RandomGuesser () |
Class used as a test model that always predicts random values for y data. |
Class Inheritance Diagram¶

Code Documentation: Utility Legos¶
mastml.legos.util_legos Module¶
This module contains a collection of classes for debugging and control flow
Classes¶
BaseEstimator |
Base class for all estimators in scikit-learn. |
DataFrameFeatureUnion (transforms) |
Class for unioning dataframe generators (sklearn.pipeline.FeatureUnion always puts out arrays) |
DoNothing () |
Class for having a “null” transform where the output is the same as the input. |
TransformerMixin |
Mixin class for all transformers in scikit-learn. |
Class Inheritance Diagram¶

Code Documentation: Feature Generators¶
mastml.legos.feature_generators Module¶
This module contains a collection of classes for generating input features to fit machine learning models to.
Functions¶
clean_dataframe (df) |
Method to clean dataframes after feature generation has occurred, to remove columns that have a single missing or NaN value, or remove a row that is fully empty |
Classes¶
BaseEstimator |
Base class for all estimators in scikit-learn. |
ContainsElement (composition_feature, …[, …]) |
Class to generate new categorical features (i.e. |
DataframeUtilities |
Class of basic utilities for dataframe manipulation, and exchanging between dataframes and numpy arrays |
Magpie (composition_feature[, feature_types]) |
Class that wraps MagpieFeatureGeneration, giving it scikit-learn structure |
MagpieFeatureGeneration (dataframe, …) |
Class to generate new features using Magpie data and dataframe containing material compositions |
MaterialsProject (composition_feature, api_key) |
Class that wraps MaterialsProjectFeatureGeneration, giving it scikit-learn structure |
MaterialsProjectFeatureGeneration (dataframe, …) |
Class to generate new features using Materials Project data and dataframe containing material compositions Datarame must have a column named “Material compositions”. |
Matminer (structural_features, structure_col) |
Class to generate structural features from matminer structure module Args: structural_features: the structure feature(s) the user wants to instantiate and generate structure_col: the dataframe column that contains the pymatgen structure object. |
NoGenerate () |
Class for having a “null” transform where the output is the same as the input. |
PolynomialFeatures ([features, degree, …]) |
Class to generate polynomial features using scikit-learn’s polynomial features method More info at: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html |
SklearnPolynomialFeatures |
alias of sklearn.preprocessing._data.PolynomialFeatures |
TransformerMixin |
Mixin class for all transformers in scikit-learn. |
Class Inheritance Diagram¶
