Welcome to MAterials Simulation Toolkit for Machine Learning (MAST-ML)’s documentation!

Acknowledgements

Materials Simulation Toolkit for Machine Learning (MAST-ML)

MAST-ML is an open-source Python package designed to broaden and accelerate the use of machine learning in materials science research

Contributors

University of Wisconsin-Madison Computational Materials Group:

  • Prof. Dane Morgan
  • Dr. Ryan Jacobs
  • Dr. Tam Mayeshiba
  • Ben Afflerbach
  • Dr. Henry Wu

University of Kentucky contributors:

  • Luke Harold Miles
  • Robert Max Williams
  • Prof. Raphael Finkel

MAST-ML documentation:

An overview of code documentation and tutorials for getting started with MAST-ML can be found here

Funding

This work was and is funded by the National Science Foundation (NSF) SI2 award No. 1148011 and DMREF award number DMR-1332851

Citing MAST-ML

If you find MAST-ML useful, please cite the following publication:

Jacobs, R., Mayeshiba, T., Afflerbach, B., Miles, L., Williams, M., Turner, M., Finkel, R., Morgan, D., “The Materials Simulation Toolkit for Machine Learning (MAST-ML): An automated open source toolkit to accelerate data- driven materials research”, Computational Materials Science 175 (2020), 109544. https://doi.org/10.1016/j.commatsci.2020.109544

Code Repository

MAST-ML is available on PyPi: pip install mastml

MAST-ML is available on Github <https://github.com/uw-cmg/MAST-ML>

git clone –single-branch master https://github.com/uw-cmg/MAST-ML

Installing MAST-ML

Hardware and Data Requirements

Hardware

PC or Mac computer capable of running python 3.

Data

  • Numeric data file in the form of .csv or .xlsx file. There must be at least some target feature data, so that models can be fit.
  • First row of file (each column) should have a text name (as string) which is how columns will be referenced later in the input file.
  • If working in Jupyter environment, can also directly pass in a pandas dataframe

Terminal installation (Linux or linux-like terminal on Mac)

Install Python3

Install Python 3: for easier installation of numpy and scipy dependencies, download Anaconda from https://www.continuum.io/downloads

Create a conda environment

Create an environment:

conda create --name MAST_ML python=3.7
conda activate MAST_ML
pip install mastml
Set up Juptyer notebooks

There is no separate setup for Jupyter notebooks necessary; once MASTML has been run and created a notebook, then in the terminal, navigate to a directory housing the notebook and type:

jupyter notebook

and a browser window with the notebook should appear.

Install the MAST-ML package

Pip install MAST-ML from PyPi:

pip install mastml

Alternatively, git clone the Github repository, for example:

git clone https://github.com/uw-cmg/MAST-ML

Clone from “master” unless instructed specifically to use another branch. Ask for access if you cannot find this code.

Check status.github.com for issues if you believe github may be malfunctioning

Run:

python setup.py install
Imports that don’t work

First try anaconda install, and if that gives errors try pip install Example: conda install numpy , or pip install numpy Put the path to the installed MAST-ML folder in your PYTHONPATH if it isn’t already

Windows installation

Install Python3

Install Python 3: for easier installation of numpy and scipy dependencies, download anaconda from https://www.continuum.io/downloads

Create a conda environment

From the Anaconda Navigator, go to Environments and create a new environment Select python version 3.6

Under “Channels”, along with defaults channel, “Add” the “materials” channel. The Channels list should now read:

defaults
materials

(may be the “matsci” channel instead of the “materials” channel; this channel is used to install pymatgen)

Set up the Spyder IDE and Jupyter notebooks

From the Anaconda Navigator, go to Home With the newly created environment selected, click on “Install” below Jupyter. Click on “Install” below Spyder.

Once the MASTML has been run and has created a jupyter notebook (run MASTML from a location inside the anaconda environment, so that the notebook will also be inside the environment tree), from the Anaconda Navigator, go to Environments, make sure the environment is selected, press the green arrow button, and select Open jupyter notebook.

Install the MAST-ML package

Pip install MAST-ML from PyPi:

pip install mastml

Alternatively, git clone the Github repository, for example:

git clone https://github.com/uw-cmg/MAST-ML

Clone from “master” unless instructed specifically to use another branch. Ask for access if you cannot find this code.

Check status.github.com for issues if you believe github may be malfunctioning

Run:

python setup.py install
Imports that don’t work

First try anaconda install, and if that gives errors try pip install Example: conda install numpy , or pip install numpy Put the path to the installed MAST-ML folder in your PYTHONPATH if it isn’t already

Windows 10 install: step-by-step guide (credit Joe Kern)

First, figure out if your computer is 32 or 64-bit. Type “system information” in your search bar. Look at system type. x86 is a 32-bit computer, x64 is a 64-bit.

Second, download an environment manager. Environments are directories in your computer that store dependencies. For instance, one program you run might be dependent on version 1.0 of another program x. However, another program you have might be dependent on version 2.0 of program x. Having multiple environments allows you utilize both programs and dependencies on your computer. I will recommend you download anaconda, not because it is the best, but because it is an environment manager I know how to get working with MAST-ML. Feel free to experiment with other managers. Download the Python 3.7 version at https://www.anaconda.com/distribution/, just follow the installation instructions. Pick the graphical installer that corresponds with your computer system (64 bit or 32 bit).

Third, download Visual studio. Some of the MAST-ML dependencies require C++ distributables in order to run. Visual Studio Code is a code editor made for Windows 10. The dependencies for MAST-ML will look in the Visual Studio Code folder for these C++ distributables when they download. There may be another way to download these these C++ distributables without Visual Studio Code, but I am not sure how to do that. Go here to download https://visualstudio.microsoft.com/downloads/#build-tools-for-visual-studio-2017

Fourth, download Visual Studio with C++ build tools and restart the computer

Fifth, Open anaconda navigator. Click Environments and create at the bottom. Name it MASTML and make it Python 3.6. DO NOT MAKE IT Python 3.7 or Python version 2.6 or 2.7. Some dependencies do not work with those other version.

Sixth, click the arrow next to your environment name and open a command shell. In the command line type “pip install “ and then copy paste the dependency names from the dependency file into your command prompt.

Seventh, test if MAST-ML runs. There are multiple ways to do this, but I will outline one. Navigate to your MAST-ML folder in the command prompt. To do this, you need to know the command ‘cd’. Typing ‘cd’ will let you change the directory you command prompt is operating in. In order to navigate to your mast-ml folder, right click the folder and click properties. Copy the location and in the command prompt type ‘cd’ and paste the location after. Add a ‘Mast-ml’ or whatever your folder is called to the end of the pasted value so you can get to mastml

Finally, copy paste python -m mastml.mastml_driver mastml/tests/conf/example_input.conf mastml/tests/csv/example_data.csv -o results/mastml_tutorial into your command prompt and run. If it all works, you’re good to go.

Startup

Locate the examples folder

In the installed MASTML directory, navigate to the tests folder.

Under tests/conf, The file example_input.conf will use the example_data.xlsx data file located in tests/csv to run an example.

Run the MASTML command

The format is python3 -m mastml.mastml_driver <path to config file> <path to data .xlsx file> -o <path to results folder>

For example, to conduct the test run above, while in the MASTML install directory:

python3 -m mastml.mastml_driver tests/conf/example_input.conf tests/csv/example_data.xlsx -o results/example_results

This is a terminal command. For Windows, assuming setup has been followed as above, go to the Anaconda Navigator, Environments, select the environment, click the green arrow button, and Open terminal.

When you execute the above command, you’ll know it’s working if you begin to see output on your screen.

Check output

index.html should be created, linking to certain representative plots for each test

For this example, output will be located in subfolders in the results/example_results folder.

Check the following to see if the run completed successfully:

A log.log file is generated and the last line contains the phrase "Making html file of all run stats..."
An index.html file that gives some summary plots from all the tests that were run
A series of subfolders with names "StandardScaler"->"DoNothing"->"KernelRidge", with the following three directories
within the "KernelRidge" directory: "LeaveOneGroupOut_host", "NoSplit", and "RepeatedKFold"

You can compare all of these files with those given in the /example_results directory which should match.

MAST-ML Input File

This document provides an overview of the various sections and fields of the MAST-ML input file.

A full template input file can be downloaded here: MASTML_InputFile

Input file sections

General Setup

The “GeneralSetup” section of the input file allows the user to specify an assortment of basic MAST-ML parameters, ranging from which column names in the .xlsx file to use as features for fitting (i.e. X data) or to fit to (i.e. y data), as well as which metrics to employ in fitting a model, among other things.

Example:

[GeneralSetup]
    input_features = feature_1, feature_2, etc. or "Auto"
    input_target = target_feature
    randomizer = False
    metrics = root_mean_squared_error, mean_absolute_error, etc. or "Auto"
    input_other = additional_feature_1, additional_feature_2
    input_grouping = grouping_feature_1
    input_testdata = validation_feature_1
  • input_features List of input X features
  • input_target Target y feature
  • randomizer Whether or not to randomize y feature data. Useful for establishing a null “baseline” test
  • metrics Which metrics to evaluate model fits
  • input_other Additional features that are not to be fitted on (i.e. not X features)
  • input_grouping Feature names that provide information on data grouping
  • input_test Feature name that designates whether data will be used for validation (set rows as 1 or 0 in csv file)

Data Cleaning

The “DataCleaning” section of the input file allows the user to clean their data to remove rows or columns that contain empty or NaN fields, or fill in these fields using imputation or principal component analysis methods.

Example:

[DataCleaning]
    cleaning_method = remove, imputation, ppca
    imputation_strategy = mean, median
  • cleaning_method Method of data cleaning. “remove” simply removes columns with missing data. “imputation” uses basic operation to fill in missing values. “ppca” uses principal component analysis to fill in missing values.
  • imputation_strategy Only valid field if doing imputation, selects method to impute missing data by using mean, median, etc. of the column

Clustering

Optional section to perform clustering of data using well-known clustering algorithms available in scikit-learn. Note that the subsection names must match the corresponding name of the routine in scikit-learn. More information on clustering routines and the parameters to set for each routine can be found here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster For the purpose of this full input file, we use the scikit-learn default parameter values. Note that not all parameters are listed.

Example:

[Clustering]
    [[AffinityPropagation]]
        damping = 0.5
        max_iter = 200
        convergence_iter = 15
        affinity = euclidean
    [[AgglomerativeClustering]]
        n_clusters = 2
        affinity = euclidean
        compute_full_tree = auto
        linkage = ward
    [[Birch]]
        threshold = 0.5
        branching_factor = 50
        n_clusters = 3
    [[DBSCAN]]
        eps = 0.5
        min_samples = 5
        metric = euclidean
        algorithm = auto
        leaf_size = 30
    [[KMeans]]
        n_clusters = 8
        n_init = 10
        max_iter = 300
        tol = 0.0001
    [[MiniBatchKMeans]]
        n_clusters = 8
        max_iter = 100
        batch_size = 100
    [[MeanShift]]
    [[SpectralClustering]]
        n_clusters = 8
        n_init = 10
        gamma = 1.0
        affinity = rbf

Feature Generation

Optional section to perform feature generation based on properties of the constituent elements. These routines were custom written for MAST-ML, except for PolynomialFeatures. For more information on the MAST-ML custom routines, consult the MAST-ML online documentation. For more information on PolynomialFeatures, see: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

Example:

[FeatureGeneration]
    [[Magpie]]
        composition_feature = Material Compositions
        feature_types = composition_avg, arithmetic_avg, max, min, difference
    [[MaterialsProject]]
        composition_feature = Material Compositions
        api_key = my_api_key
    [[Citrine]]
        composition_feature = Material Compositions
        api_key = my_api_key
    [[ContainsElement]]
        composition_feature = Host element
        all_elements = False
        element = Al
        new_name = has_Al
    [[PolynomialFeatures]]
        degree=2
        interaction_only=False
        include_bias=True
  • composition_feature Name of column in csv file containing material compositions
  • feature_types Types of elemental features to output. If None is specified, all features are output. Note “elements” refers to properties of constituent elements
  • api_key Your API key to access the Materials Project or Citrine. Register for your account at Materials Project: https://materialsproject.org or at Citrine: https://citrination.com
  • all_elements For ContainsElement, whether or not to scan all data rows to assess all elements present in data set
  • element For ContainsElement, name of element of interest. Ignored if all_elements = True
  • new_name For ContainsElement, name of new feature column to generate. Ignored if all_elements = True

Feature Normalization

Optional section to perform feature normalization of the input or generated features using well-known feature normalization algorithms available in scikit-learn. Note that the subsection names must match the corresponding name of the routine in scikit-learn. More information on normalization routines and the parameters to set for each routine can be found here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing . For the purpose of this full input file, we use the scikit-learn default parameter values. Note that not all parameters are listed, and only the currently listed normalization routines are supported. In addition, MeanStdevScaler is a custom written normalization routine for MAST-ML. Additional information on MeanStdevScaler can be found in the online MAST-ML documentation.

Example:

[FeatureNormalization]
    [[Binarizer]]
        threshold = 0.0
    [[MaxAbsScaler]]
    [[MinMaxScaler]]
    [[Normalizer]]
        norm = l2
    [[QuantileTransformer]]
        n_quantiles = 1000
        output_distribution = uniform
    [[RobustScaler]]
        with_centering = True
        with_scaling = True
    [[StandardScaler]]
    [[MeanStdevScaler]]
        mean = 0
        stdev = 1

Learning Curve

Optional section to perform learning curve analysis on a dataset. Two types of learning curves will be generated: a data learning curve (score vs. amount of training data) and a feature learning curve (score vs. number of features).

Example:

[LearningCurve]
    estimator = KernelRidge_learn
    cv = RepeatedKFold_learn
    scoring = root_mean_squared_error
    n_features_to_select = 5
    selector_name = MASTMLFeatureSelector
  • estimator A scikit-learn model/estimator. The name needs to match an entry in the [Models] section. Note this model will be removed from the [Models] list after the learning curve is generated.
  • cv A scikit-learn cross validation generator. The name needs to match an entry in the [DataSplits] section. Note this method will be removed from the [DataSplits] list after the learning curve is generated.
  • scoring A scikit-learn scoring method compatible with MAST-ML. See the MAST-ML online documentation at https://htmlpreview.github.io/?https://raw.githubusercontent.com/uw-cmg/MAST-ML/dev_Ryan_2018-10-29/docs/build/html/3_metrics.html for more information.
  • n_features_to_select The max number of features to use for the feature learning curve.
  • selector_name Method to conduct feature selection for the feature learning curve. The name needs to match an entry in the [FeatureSelection] section. Note this method will be removed from the [FeatureSelection] section after the learning curve is generated.

Feature Selection

Optional section to perform feature selection using routines in scikit-learn, mlxtend and custom-written for MAST-ML. Note that the subsection names must match the corresponding name of the routine in scikit-learn. More information on selection routines and the parameters to set for each routine can be found here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection . For the purpose of this full input file, we use the scikit-learn default parameter values. Note that not all parameters are listed, and only the currently listed selection routines are supported. In addition, MASTMLFeatureSelector is a custom written selection routine for MAST-ML. Additional information on MASTMLFeatureSelector can be found in the online MAST-ML documentation. Finally, SequentialFeatureSelector is a routine available from the mlxtend package, which documention can be found here: http://rasbt.github.io/mlxtend/

Example:

[FeatureSelection]
    [[GenericUnivariateSelect]]
    [[SelectPercentile]]
    [[SelectKBest]]
    [[SelectFpr]]
    [[SelectFdr]]
    [[SelectFwe]]
    [[RFE]]
        estimator = RandomForestRegressor_selectRFE
        n_features_to_select = 5
        step = 1
    [[SequentialFeatureSelector]]
        estimator = RandomForestRegressor_selectSFS
        k_features = 5
    [[RFECV]]
        estimator = RandomForestRegressor_selectRFECV
        step = 1
        cv = LeaveOneGroupOut_selectRFECV
        min_features_to_select = 1
    [[SelectFromModel]]
        estimator = KernelRidge_selectfrommodel
        max_features = 5
    [[VarianceThreshold]]
        threshold = 0.0
    [[PCA]]
        n_components = 5
    [[MASTMLFeatureSelector]]
        estimator = KernelRidge_selectMASTML
        n_features_to_select = 5
        cv = LeaveOneGroupOut_selectMASTML
        # Any features you want to keep from the start, then use these to subsequently do forward selection
        manually_selected_features = myfeature_1, myfeature_2
    [[EnsembleModelFeatureSelector]]
        # A scikit-learn model/estimator. Needs to have estimator feature ranking. The name needs to match an entry in the [Models] section.
        estimator = RandomForestRegressor_selectEnsemble
        # number of features to select
        k_features = 5
    [[PearsonSelector]]
        # threshold for removal of redundant features
        threshold_between_features = 0.9
        # threshold for removal of features not sufficiently correlated with target
        threshold_with_target = 0.8
        # whether to remove features that are highly correlated with each other (i.e. redundant)
        remove_highly_correlated_features = True
        # number of features to select
        k_features = 5
  • estimator A scikit-learn model/estimator. The name needs to match an entry in the [Models] section. Note this model will be removed from the [Models] list after the learning curve is generated.
  • n_features_to_select The max number of features to select
  • step For RFE and RFECV, the number of features to remove in each step
  • k_features For SequentialFeatureSelector, the max number of features to select.
  • cv A scikit-learn cross validation generator. The name needs to match an entry in the [DataSplits] section. Note this method will be removed from the [DataSplits] list after the learning curve is generated.

Data Splits

Optional section to perform data splits using cross validation routines in scikit-learn, and custom-written for MAST-ML. Note that the subsection names must match the corresponding name of the routine in scikit-learn. More information on selection routines and the parameters to set for each routine can be found here: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection . For the purpose of this full input file, we use the scikit-learn default parameter values. Note that not all parameters are listed, and only the currently listed data split routines are supported. In addition, NoSplit is a custom written selection routine for MAST-ML, which simply produces a full data fit with no cross validation. Additional information on NoSplit can be found in the online MAST-ML documentation.

Example:

[DataSplits]
    [[NoSplit]]
    [[KFold]]
        shuffle = True
        n_splits = 10
    [[RepeatedKFold]]
        n_splits = 5
        n_repeats = 10
    # Here, an example of another instance of RepeatedKFold, this one being used in the [LearningCurve] section above.
    [[RepeatedKFold_learn]]
        n_splits = 5
        n_repeats = 10
    [[GroupKFold]]
        n_splits = 3
    [[LeaveOneOut]]
    [[LeavePOut]]
        p = 10
    [[RepeatedStratifiedKFold]]
        n_splits = 5
        n_repeats = 10
    [[StratifiedKFold]]
        n_splits = 3
    [[ShuffleSplit]]
        n_splits = 10
    [[StratifiedShuffleSplit]]
        n_splits = 10
    [[LeaveOneGroupOut]]
        # The column name in the input csv file containing the group labels
        grouping_column = Host element
    # Here, an example of another instance of LeaveOneGroupOut, this one being used in the [FeatureSelection] section above.
    [[LeaveOneGroupOut_selectMASTML]]
        # The column name in the input csv file containing the group labels
        grouping_column = Host element
    # Here, an example of another instance of LeaveOneGroupOut, this one being used based on the creation of the "has_Al"
    # group from the [[ContainsElement]] routine present in the [FeatureGeneration] section.
    [[LeaveOneGroupOut_Al]]
        grouping_column = has_Al
    # Here, an example of another instance of LeaveOneGroupOut, this one being used based on the creation of clusters
    # from the [[KMeans]] routine present in the [Clustering] section.
    [[LeaveOneGroupOut_kmeans]]
        grouping_column = KMeans
    [[LeaveCloseCompositionsOut]]
        # Set the distance threshold in composition space
        dist_threshold=0.1
    [[Bootstrap]]
        # Data set size
        n = 378
        # Number of bootstrap resamplings to perform
        n_bootstraps = 10
        # Training set size
        train_size = 303
        # Validation/test set size
        test_size = 75

Models

Optional section to denote different models/estimators for model fitting from scikit-learn. Note that the subsection names must match the corresponding name of the routine in scikit-learn. More information on different model routines and the parameters to set for each routine can be found here for ensemble methods: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble and here for kernel ridge and linear methods: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.kernel_ridge and here for neural network methods: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.neural_network and here for support vector machine and decision tree methods: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm . For the purpose of this full input file, we use the scikit-learn default parameter values. Note that not all parameters are listed, and only the currently listed data split routines are supported.

Example:

[Models]
    # Ensemble methods

    [[AdaBoostClassifier]]
        n_estimators = 50
        learning_rate = 1.0
    [[AdaBoostRegressor]]
        n_estimators = 50
        learning_rate = 1.0
    [[BaggingClassifier]]
        n_estimators = 50
        max_samples = 1.0
        max_features = 1.0
    [[BaggingRegressor]]
        n_estimators = 50
        max_samples = 1.0
        max_features = 1.0
    [[ExtraTreesClassifier]]
        n_estimators = 10
        criterion = gini
        min_samples_split = 2
        min_samples_leaf = 1
    [[ExtraTreesRegressor]]
        n_estimators = 10
        criterion = mse
        min_samples_split = 2
        min_samples_leaf = 1
    [[GradientBoostingClassifier]]
        loss = deviance
        learning_rate = 1.0
        n_estimators = 100
        subsample = 1.0
        criterion = friedman_mse
        min_samples_split = 2
        min_samples_leaf = 1
    [[GradientBoostingRegressor]]
        loss = ls
        learning_rate = 0.1
        n_estimators = 100
        subsample = 1.0
        criterion = friedman_mse
        min_samples_split = 2
        min_samples_leaf = 1
    [[RandomForestClassifier]]
        n_estimators = 10
        criterion = gini
        min_samples_leaf = 1
        min_samples_split = 2
    [[RandomForestRegressor]]
        n_estimators = 10
        criterion = mse
        min_samples_leaf = 1
        min_samples_split = 2
    # Here, an example of another instance of RandomForestRegressor, this one being used based by the [[EnsembleFeatureSelector]]
    # method from the [FeatureSelection] section.
    [[RandomForestRegressor_selectEnsemble]]
        n_estimators = 100
        criterion = mse
    [[XGBoostClassifier]]
        [[XGBoostRegressor]]
        n_estimators = 100
        objective = reg:squarederror

    # Kernel ridge and linear methods

    [[KernelRidge]]
        alpha = 1
        kernel = linear
    # Here, an example of another instance of KernelRidge, this one being used based by the [[MASTMLFeatureSelector]]
    # method from the [FeatureSelection] section.
    [[KernelRidge_selectMASTML]]
        alpha = 1
        kernel = linear
    # Here, an example of another instance of KernelRidge, this one being used based in the [LearningCurve] section.
    [[KernelRidge_learn]]
        alpha = 1
        kernel = linear

    [[ARDRegression]]
        n_iter = 300
    [[BayesianRidge]]
        n_iter = 300
    [[ElasticNet]]
        alpha = 1.0
    [[HuberRegressor]]
        epsilon = 1.35
        max_iter = 100
    [[Lars]]
    [[Lasso]]
        alpha = 1.0
    [[LassoLars]]
        alpha = 1.0
        max_iter = 500
    [[LassoLarsIC]]
        criterion = aic
        max_iter = 500
    [[LinearRegression]]
    [[LogisticRegression]]
        penalty = l2
        C = 1.0
    [[Perceptron]]
        alpha = 0.0001
    [[Ridge]]
        alpha = 1.0
    [[RidgeClassifier]]
        alpha = 1.0
    [[SGDClassifier]]
        loss = hinge
        penalty = l2
        alpha = 0.0001
    [[SGDRegressor]]
        loss = squared_loss
        penalty = l2
        alpha = 0.0001

    # Neural networks

    [[MLPClassifier]]
        hidden_layer_sizes = 100,
        activation = relu
        solver = adam
        alpha = 0.0001
        batch_size = auto
        learning_rate = constant
    [[MLPRegressor]]
        hidden_layer_sizes = 100,
        activation = relu
        solver = adam
        alpha = 0.0001
        batch_size = auto
        learning_rate = constant
    [[KerasRegressor]]
        [[[Layer1]]]
             layer_type = Dense
             neuron_num= 100
             input_dim= 287   #typically equal to n_features
             kernel_initializer= random_normal
             activation=relu
        [[[Layer2]]]
             layer_type = Dense
             neuron_num= 50
             kernel_initializer= random_normal
             activation=relu
        [[[Layer3]]]
             layer_type = Dense
             neuron_num= 25
             kernel_initializer= random_normal
             activation=relu
        [[[Layer4]]]
             layer_type = Dense
             neuron_num= 1
             kernel_initializer= random_normal
             activation=linear
        [[[FitParams]]]
             epochs=20
             batch_size=25
             loss = mean_squared_error
             optimizer = adam
             metrics = mse
             verbose=1
             shuffle = True
             #validation_split = 0.2

    # Support vector machine methods

    [[LinearSVC]]
        penalty = l2
        loss = squared_hinge
        tol = 0.0001
        C = 1.0
    [[LinearSVR]]
        epsilon = 0.1
        loss = epsilon_insensitive
        tol = 0.0001
        C = 1.0
    [[NuSVC]]
        nu = 0.5
        kernel = rbf
        degree = 3
    [[NuSVR]]
        nu = 0.5
        C = 1.0
        kernel = rbf
        degree = 3
    [[SVC]]
        C = 1.0
        kernel = rbf
        degree = 3
    [[SVR]]
        C = 1.0
        kernel = rbf
        degree = 3

    # Decision tree methods

    [[DecisionTreeClassifier]]
        criterion = gini
        splitter = best
        min_samples_split = 2
        min_samples_leaf = 1
    [[DecisionTreeRegressor]]
        criterion = mse
        splitter = best
        min_samples_split = 2
        min_samples_leaf = 1
    [[ExtraTreeClassifier]]
        criterion = gini
        splitter = random
        min_samples_split = 2
        min_samples_leaf = 1
    [[ExtraTreeRegressor]]
        criterion = mse
        splitter = random
        min_samples_split = 2
        min_samples_leaf = 1

Misc Settings

This section controls which types of plots MAST-ML will write to the results directory and other miscellaneous settings.

Example:

[MiscSettings]
    plot_target_histogram = True
    plot_train_test_plots = True
    plot_predicted_vs_true = True
    plot_predicted_vs_true_average = True
    plot_best_worst_per_point = True
    plot_each_feature_vs_target = False
    plot_error_plots = True
    rf_error_method = stdev
    rf_error_percentile = 95
    normalize_target_feature = False
  • plot_target_histogram Whether or not to output target data histograms
  • plot_train_test_plots Whether or not to output parity plots within each CV split
  • plot_predicted_vs_true Whether or not to output summarized parity plots
  • plot_predicted_vs_true_average Whether or not to output averaged parity plots
  • plot_best_worst_per_point Whether or not to output parity plot showing best and worst split per point
  • plot_each_feature_vs_target Whether or not to show plots of target feature as a function of each individual input feature
  • plot_error_method Whether or not to show the individual and average plots of the normalized errors
  • rf_error_method If using random forest, whether to calculate error bars with stdev or confidence intervals (confint)
  • rf_error_percentile If using confint above, the confidence interval to use to calculate the error bars
  • normalize_target_feature Whether or not to normalize the target feature values

MAST-ML overview slides

The information for this MAST-ML overview shown on this page is available for download here:

MASTMLoverview

Let’s begin with an overview of what MAST-ML is and what it can do:

_images/WhatisMASTML.png

Here is currently what MAST-ML can do as well as how to acquire it:

_images/MASTMLscope.png

An overview of the general machine learning workflow that MAST-ML executes. Continuing development will focus on making the workflows more flexible and general

_images/MASTMLworkflow.png

MAST-ML uses a text-based input file (.conf extension) which consists of different sections (corresponding to each part of the workflow) and specific subsections (e.g. different machine learning models to test, different feature selection algorithms, etc.). The input file is discussed in much greater detail here:

MAST-ML Input File

and an input file with the full range of capabilities can be downloaded here:

MASTMLinputfile

_images/MASTMLsampleinput.png

Running MAST-ML is easily done with a single-line command in a Terminal/command line, your favorite IDE, or within a Jupyter notebook

_images/RunningMASTML.png

MAST-ML output takes the form of a full directory tree of results, with each level of the tree corresponding to a different portion of the machine learning workflow

_images/MASTMLhighleveloutput.png

The last three figures demonstrate some example output of a few machine learning analysis features MAST-ML offers. Here, the ability to generate and select features is shown.

_images/MASTMLfeaturegenerationselection.png

A core feature of MAST-ML is the many pieces of statistical analysis regarding model assessment, which forms the basis of interpreting the quality and extensibility of a machine learning model.

_images/MASTMLmodelassessment.png

Finally, MAST-ML offers the ability to easily optimize the model hyperparameters used in your analysis

_images/MASTMLhyperparameter.png

Running MAST-ML on Google Colab

In addition to running MAST-ML on your own machine or computing cluster, MAST-ML can be run using cloud resources on Google Colab. This can be advantageous as you don’t have to worry about installing MAST-ML yourself, and all output files can be saved directly to your Google Drive.

MAST-ML comes with a notebook called MASTML_Colab.ipynb that you can open in Google Colab

MASTML_Colab.ipynb

Once you open the notebook in Google Colab, it will look something like this:

_images/ColabHome.png

There are a few blocks of code in this notebook. The first block performs a pip install of MAST-ML for this Colab session. The second block links your Google Drive to the Colab instance so MAST-ML can save your run output directly to your Google Drive.

The one thing you’ll need to do from here is to upload a data file (.csv or .xlsx format) and MAST-ML input file (.conf format) to this Colab session. Files can be uploaded by pressing the vertical arrow on the left side of the screen, by the file directory tree.

example_input.conf

example_data.xlsx

Note that when a Colab session ends, the files you upload will be deleted. Since your output will be saved to your Google Drive, the data an input files will be deleted. Note that MAST-ML automatically saves a copy of both of these files to your output directory for each run you do.

MAST-ML tutorial

Introduction

This document provides step-by-step tutorials of conducting and analyzing different MAST-ML runs. For this tutorial, we will be using the dataset example_data.xlsx in the tests/csv/ folder and input file example_input.conf in tests/conf/.

MAST-ML requires two files to run: The first is the text-based input file (.conf extension). This file contains all of the key settings for MAST-ML, for example, which models to fit and how to normalize your input feature matrix. The second file is the data file (.csv or .xlsx extension). This is the data file containing the input feature columns and values (X values) and the corresponding y data to fit models to. The data file may contain other columns that are dedicated to constructing groups of data for specific tests, or miscellaneous notes, which columns can be selectively left out so they are not used in the fitting. This will be discussed in more detail below.

Throughout this tutorial, we will be modifying the input file to add and remove different sections and values. For a complete and more in-depth discussion of the input file and its myriad settings, the reader is directed to the dedicated input file section:

MAST-ML Input File

The data contained in the example_data.csv file consist of a previously selected matrix of X features created from combinations of elemental properties, for example the average atomic radius of the elements in the material. The y data values used for fitting are listed in the “Scaled activation energy (eV)” column, and are DFT-calculated migration barriers of dilute solute diffusion, referenced to the host system. For example, the value of Ag solute diffusing through a Ag host is set to zero. The “Host element” and “Solute element” columns denote which species comprise the corresponding reduced migration barrier.

Your first MAST-ML run

It’s time to conduct your very first MAST-ML run! First, we will set up the most basic input file, which will only import your data and input file, and do nothing else except copy the input files to the results directory and output a basic histogram of the target data. Open the example_input.conf file (or create your own new file), and write the following in your input file:

Example:

[GeneralSetup]
    input_features = Auto
    input_target = Scaled activation energy (eV)
    randomizer = False
    metrics = Auto
    input_other = Material composition, Host element, Solute element, predict_Pt

The General Setup section contains high-level control about how your input data file is parsed. Additional details of each parameter can be found in the MAST-ML Input File section in this documentation. Briefly, setting “input_features” to “Auto” will automatically assign all columns to be part of the X feature matrix, except those that are associated with target_feature or not_input_features. The option “randomizer” will shuffle all of your y-data, which can be useful for running a “null” test. The “metrics” option is used to denote which metrics to eventually evaluate your models on, such as mean_absolute_error. Using “Auto” provides a catalogue of standard metrics which is generally sufficient for many problems. Finally, the “not_input_features” field is used to denote any feature columns you don’t want to use in fitting. If some columns contain text notes, these will need to be added here too.

There are two ways to execute a MAST-ML run. The first is to run it via a Terminal or IDE command line by directly calling the main MAST-ML driver module. Here, the python -m (for module) command is invoked on the mastml.masml_driver module, and the paths containing the input file and data file are passed in. Lastly, the argument -o (for output) is used together with the path to put all results files and folders.

Example:

python3 -m mastml.mastml_driver tests/conf/example_input.conf tests/csv/example_data.xlsx -o results/mastml_tutorial

The second way is to run MAST-ML through a Jupyter notebook by importing mastml and running the mastml_driver main() method and supply the paths to the input file, data file

Example:

import mastml_driver
conf_path = 'tests/conf/example_input.conf'
data_path = 'tests/conf/example_data.csv'
results_path = 'results/mastml_tutorial'
mastml_driver.main(conf_path, data_path, results_path)

Let’s examine the output from this first run. Below is a screenshot of a Mac directory output tree in the results/mastml_tutorial folder. Note that you can re-use the same output folder name, and the date and time of the run will be appended so no work will be lost. Each level of the directory tree corresponds to a step in the general supervised learning workflow that MAST-ML uses. The first level is general data input and feature generation, the second level is numerical manipulation of features, and the third level is selection of features. Since we did not do any feature manipulation in this run, the output selected.csv, normalized.csv and generated_features.csv are all the same, and are the same file as the copied input data file, example_data.csv. In the main directory tree, there is also a log.log and errors.log file, which summarize the inner details of the MAST-ML run and flag any errors that may have occurred. There are two .html files which provide very high-level summaries of data plots and file links that may be of interest, to make searching for these files easier. Finally, there is some generated data about the statistics of your input target data. A histogram named target_histogram.png is created, and basic statistical summary of your data is saved in the input_data_statistics.csv file.

_images/MASTMLtutorial_run1.png

Cleaning input data

Now, let’s imagine a slightly more complicated (but realistic) scenario where some of the value of your X feature matrix are not known. Open your example_data.csv file, and randomly remove some values of the X feature columns in your dataset. Don’t remove any y data values in the “Reduced Barrier (eV)” column. You’ll need to add the following section to your input file to handle cleaning of the input data:

Example:

[DataCleaning]
    cleaning_method = imputation
    imputation_strategy = mean

What this does is perform data imputation, where each missing value will be replaced with the mean value for that particular feature column. Other data cleaning options include imputation with median values, simply removing rows of data with missing values, or performing a probabilistic principal component analysis to fill in missing values.

From inspecting the data file in the parent directory to that in the subsequent directories, you can see that the missing values (here, the first 10 rows of the first several features were removed) have been replaced with the mean values for each respective feature column:

_images/MASTMLtutorial_run2_1.png

After data cleaning with imputation:

_images/MASTMLtutorial_run2_2.png

Feature generation and normalization

For this run, we are going to first generate a large X feature matrix based on a suite of elemental properties. Then, we are going to normalize the feature matrix so that all values in a given feature column have a mean of zero and a standard deviation equal to one.

To perform the feature generation and normalization steps, add these sections to your input file. Use the same file from the previous run, which contains the GeneralSetup and DataCleaning sections, and use your data file with the values you previously removed. (Note that you can use the pristine original data file too, and the data cleaning step will simply do nothing). For the purpose of this example, we are going to generate elemental features using the MAGPIE approach, using compositions as specified in the “Solute element” column of the data file. Note that if multiple elements are present, features containing the average (both mean and composition-weighted averages) of the elements present will be calculated. The value specified in the composition_feature parameter must be a column name in your data file which contains the material compositions.

Example:

[FeatureGeneration]
    [[Magpie]]
        composition_feature = Solute element
        feature_types = composition_avg, arithmetic_avg, max, min, difference, elements

[FeatureNormalization]
    [[StandardScaler]]

After performing this run, we can see that the .csv files in the feature generation and normalization folders of the results directory tree are now updated to reflect the generated and normalized X feature matrices. There are now many more features in the generated_features.csv file:

_images/MASTMLtutorial_run3_1.png

Note that feature columns that are identical in all values are removed automatically. We can see that the normalized feature set consists of each column having mean zero and standard deviation of one:

_images/MASTMLtutorial_run3_2.png

Training and evaluating your first model

Now that we have a full X feature matrix that has been normalized appropriately, it is time to train and evaluate your first model. For this tutorial, we will train a Kernel Ridge model with a radial basis function kernel (also known as Gaussian Kernel Ridge Regression, GKRR). We need to add two sections of our input file.

The first is the Models section, which provides a list of model types to train and the associated parameter values for each model. Here, we have chosen values of alpha and gamma equal to 1. There is no reason to think that these are the optimal parameter values, they were simply chosen as an example. Later in this tutorial we will optimize these parameters. Note that if you don’t specify the model parameter values, the values used will be the scikit-learn default values.

The second is the DataSplits section, which controls what types of fits and cross-validation tests will be applied to each specified model. Here, we have chosen “NoSplit”, which is simply a full y versus X fit of the data, without any form of cross-validation. We have also denoted “RepeatedKFold”, which is random leave-out cross-validation test. In this instance, we have 5 splits (so leave out 20%) and do the test two times.

Example:

[Models]
    [[KernelRidge]]
        kernel = rbf
        alpha = 1
        gamma = 1

[DataSplits]
    [[NoSplit]]
    [[RepeatedKFold]]
        n_splits = 5
        n_repeats = 2

Below is a snapshot of the resulting directory tree generated from this MAST-ML run. You’ll immediately notice the tree is deeper now, with a new level corresponding to each model we’ve fit (here just the single KernelRidge model), and, for each model, folders corresponding to each DataSplit test we denoted in the input file. For each data split method, there are folders and corresponding data plots and files for each hold-out split of the test. For instance, with the RepeatedKFold test, there were 10 total splits, which are labeled as split_0 through split_9. Contained in each folder are numerous files, such as different data parity plots of predicted vs. actual values, histograms of residuals, .csv files for all plotted data, a .pkl file of the exported trained model, and .ipynb Jupyter notebooks useful for custom modifications of the data plots.

_images/MASTMLtutorial_run4_1.png

Below is a parity plot from the NoSplit (full data fit) run. The R-squared value is high, but there is significant mean error. This suggests that the model parameters are not optimal (which shouldn’t be surprising considering we just picked them arbitrarily).

_images/MASTMLtutorial_run4_2.png

From examining the parity plot from the RepeatedKFold run (this is the ‘average_points_with_bars.png’ plot), which has the averaged values over all 10 splits, we can see that the predictions from random cross validation result in both a very low R-squared value and a high error. Essentially, cross-validation has shown that this model has no predictive ability. It seems our issues are two-fold: nonoptimal hyperparameters, and over-fitting. The over-fitting is evident due to the much worse before of the cross-validated parity plot compared to the full fit.

_images/MASTMLtutorial_run4_3.png

Feature selection and learning curves

As mentioned above, one problem with our current model is over-fitting. To further understand and minimize the effect of over-fitting, it is often necessary to construct learning curves and perform feature selection to obtain a reduced feature set which most accurately describes your data. To do this, we are going to add two additional sections to our input file.

The first section is related to feature selection. Here, we will use the SequentialFeatureSelector algorithm, which performs forward selection of features. We will select a total of 20 features, and use a KernelRidge model to evaluate the selected features. Here, we ahve denoted our estimator as “KernelRidge_select”. The models used in feature selection and learning curves are removed from the model queue, because in general one may want to use a different model for this step of the analysis than what will ultimately be used for fitting. Therefore, we need to also amend our models list to have this new KernelRidge_select model, as shown below.

Example:

[FeatureSelection]
    [[SequentialFeatureSelector]]
        estimator = KernelRidge_select
        k_features = 20

[Models]
    [[KernelRidge]]
        kernel = rbf
        alpha = 1
        gamma = 1
    [[KernelRidge_select]]
        kernel = rbf
        alpha = 1
        gamma = 1

The second section we will add is to plot learning curves. There are two types of learning curves MAST-ML can make: a data learning curve and feature learning curve. The former is a plot of the metric of interest versus the amount of training data used in the fits. The latter is a plot of the metric of interest versus the number of features comprising the X feature matrix. In the example LearningCurve input file section shown below, we are going to use a KernelRidge model, a random k-fold cross-validation and the root_mean_square_error to evaluate our learning curves. We will also use a maximum of 20 features, and use the SelectKBest algorithm to assess the choice of features.

Example:

[LearningCurve]
    estimator = KernelRidge_learn
    cv = RepeatedKFold_learn
    scoring = root_mean_squared_error
    n_features_to_select = 20
    selector_name = SelectKBest

As with the above example of FeatureSelection, we need to add the KernelRidge_learn and RepeatedKFold_learn entries to the Models and DataSplits sections of our input file, respectively. At this point in the tutorial, the complete input file should look like this:

Example:

[GeneralSetup]
    input_features = Auto
    input_target = Reduced barrier (eV)
    randomizer = False
    metrics = Auto
    input_other = Host element, Solute element, predict_Pt

[DataCleaning]
    cleaning_method = imputation
    imputation_strategy = mean

[FeatureGeneration]
    [[Magpie]]
        composition_feature = Solute element

[FeatureNormalization]
    [[StandardScaler]]

[FeatureSelection]
    [[SequentialFeatureSelector]]
        estimator = KernelRidge_select
        k_features = 20

[LearningCurve]
    estimator = KernelRidge_learn
    cv = RepeatedKFold_learn
    scoring = root_mean_squared_error
    n_features_to_select = 20
    selector_name = SelectKBest

[Models]
    [[KernelRidge]]
        kernel = rbf
        alpha = 1
        gamma = 1
    [[KernelRidge_select]]
        kernel = rbf
        alpha = 1
        gamma = 1
    [[KernelRidge_learn]]
        kernel = rbf
        alpha = 1
        gamma = 1

[DataSplits]
    [[NoSplit]]
    [[RepeatedKFold]]
        n_splits = 5
        n_repeats = 2
    [[RepeatedKFold_learn]]
        n_splits = 5
        n_repeats = 2

Let’s take a look at the same full fit and RepeatedKFold random cross-validation tests for this run:

Full-fit:

_images/MASTMLtutorial_run5_1.png

Random leave out cross-validation:

_images/MASTMLtutorial_run5_2.png

What we can see is, now that we down-selected features from more than 300 features in the previous run to just 20 here, that the fits have noticeably improved and the problem of over-fitting has been minimized. Below, we can look at the plotted learning curves

Data learning curve:

_images/MASTMLtutorial_run5_3.png

Feature learning curve:

_images/MASTMLtutorial_run5_4.png

We can clearly see that, as expected, having more training data will result in better test scores, and adding more features (up to a certain point) will also result in better fits. Based on these learning curves, one may be able to argue that additional features should could be used to further lower the error.

Hyperparameter optimization

Next, we will consider optimization of the model hyperparameters, in order to use a better optimized model with a selected feature set to minimize the model errors. To do this, we need to add the HyperOpt section to our input file, as shown below. Here, we are optimzing our KernelRidge model, specifically its root_mean_squared_error, by using our RepeatedKFold random leave-out cross-validation scheme. The param_names field provides the parameter names to optimize. Here, we are optimizing the KernelRidge alpha and gamma parameters. Parameters must be delineated with a semicolon. The param_values field provides a bound on the values to search over. Here, the minimum value is -5, max is 5, 100 points are analyzed, and the numerical scaling is logarithmic, meaning it ranges from 10^-5 to 10^5. If “lin” instead of “log” would have been specified, the scale would be linear with 100 values ranging from -5 to 5.

Example:

[HyperOpt]
    [[GridSearch]]
        estimator = KernelRidge
        cv = RepeatedKFold
        param_names = alpha ; gamma
        param_values = -5 5 100 log float ; -5 5 100 log float
        scoring = root_mean_squared_error

Let’s take a final look at the same full fit and RepeatedKFold random cross-validation tests for this run:

Full-fit:

_images/MASTMLtutorial_run6_1.png

Random leave out cross-validation:

_images/MASTMLtutorial_run6_2.png

What we can see is, now that we down-selected features from more than 300 features in the previous run to just 20, along with optimizing the hyperparameters of our KernelRidge model, our fits are once again improved. The hyperparameter optimization portion of this workflow outputs the hyperparameter values and cross-validation scores for each step of, in this case, the GridSearch that we performed. All of this information is saved in the KerenlRidge.csv file in the GridSearch folder in the results directory tree. For this run, the optimal hyperparameters were alpha = 0.034 and gamma = 0.138

Random leave-out versus leave-out-group cross-validation

Here, we will use our selected feature set and optimized KernelRidge hyperparameters from the previous section to do a new kind of cross-validation test: leave out group (LOG) CV. To do this, you will modify the alpha and gamma values in the Models section, KernelRidge model in your input file. In addition, you can rename the selected.csv data file to a new name, for example “example_data_selected.csv”, and use the path to this new data file for this new run, as we will not be performing feature selection again (to save time).

We will compare these results to the results of LOG cross-validation with the random cross-validation. Our input data file had a column called “Host element”. This is a natural grouping to use for this problem, as it is interesting to assess our fits when training on a set of host elements and predicted the values of an entirely new host element set, without having ever trained on that set. Modify your input file to match what is shown below. Note that we have commented out the sections that we no longer want with the # symbol. You can either comment out the sections or remove them entirely.

Example:

[GeneralSetup]
    input_features = Auto
    input_target = Reduced barrier (eV)
    randomizer = False
    metrics = Auto
    input_other = Host element, Solute element, predict_Pt
    input_grouping = Host element

#[DataCleaning]
#    cleaning_method = imputation
#    imputation_strategy = mean

#[FeatureGeneration]
#    [[Magpie]]
#        composition_feature = Solute element

[FeatureNormalization]
    [[StandardScaler]]

#[FeatureSelection]
#    [[SequentialFeatureSelector]]
#        estimator = KernelRidge_select
#        k_features = 20

#[LearningCurve]
#    estimator = KernelRidge_learn
#    cv = RepeatedKFold_learn
#    scoring = root_mean_squared_error
#    n_features_to_select = 20
#    selector_name = SelectKBest

[Models]
    [[KernelRidge]]
        kernel = rbf
        alpha = 0.034
        gamma = 0.138
    #[[KernelRidge_select]]
    #    kernel = rbf
    #    alpha = 1
    #    gamma = 1
    #[[KernelRidge_learn]]
    #    kernel = rbf
    #    alpha = 1
    #    gamma = 1

[DataSplits]
    [[NoSplit]]
    [[RepeatedKFold]]
        n_splits = 5
        n_repeats = 2
    #[[RepeatedKFold_learn]]
    #    n_splits = 5
    #    n_repeats = 2
    [[LeaveOneGroupOut]]
        grouping_column = Host element

#[HyperOpt]
#    [[GridSearch]]
#        estimator = KernelRidge
#        cv = RepeatedKFold
#        param_names = alpha ; gamma
#        param_values = -5 5 100 log ; -5 5 100 log
#        scoring = root_mean_squared_error

The main new additions to this input file is under the General Setup section, where the parameter grouping_feature needs to be added, and the addition of LeaveOutGroup to the DataSplits section.

By doing this run, we can assess the model fits resulting from the random cross-validation and the LOG cross-validation.

Random cross-validation:

_images/MASTMLtutorial_run7_1.png

LOG cross-validation:

_images/MASTMLtutorial_run7_2.png

We can immediately see the R-squared and errors are both worse for the LOG cross-validation test compared to the random cross-validation test. This is likely because the LOG test is a more rigorous test of model extrapolation, because the test scores in each case are for data for which host elements were never included in the training set. In addition, a minor effect contributing to the reduced accuracy may be due to the fact that the model hyperparameters were optimized by evaluating the root mean squared error for a random cross-validation test. If instead the parameters were optimized using the LOG test, the resulting fits would likely be improved.

There are a couple additional plots that are usual output for a LOG test that are worth drawing attention to. The first is a plot of each metric test value for each group. This enables one to quickly assess which groups perform better or worse than others.

_images/MASTMLtutorial_run7_3.png

In addition, the parity plots for each split are now plotted with symbols denoting each group, which can help assess clustering of groups and goodness of fit on a per-group basis.

Training on all groups except Ag:

_images/MASTMLtutorial_run7_4.png

Testing on just Ag as the left-out host element:

_images/MASTMLtutorial_run7_5.png

Making predictions by importing a previously fit model

Here, we are going to import a previously fit model, and use it to predict the migration barriers for those data points with Pt as the host element.

In your previous run, the LOG test split where the Pt host values were predicted is in the split_12 folder. The parity plot for Pt test data should look like the below plot for your previous run:

_images/MASTMLtutorial_run8_1.png

Here, we are going to import the model that was fitted to all the groups except Pt, and use MAST-ML’s data validation function as detailed above to obtain this same plot, but with using Pt as the validation data and the imported, previously trained model. If one were to extend this data set to include, for example, U as a host element, any number of previously trained models could be used to predict the migration barrier values for U. To import this model, save the KernelRidge_split_12.pkl file from your previous run into the /models/ folder (it is as the the same level as the /tests/ folder in your main MAST-ML directory). To import this model into your next run, you can create a new field in the Models section, as shown below:

Example:

[Models]
    #[[KernelRidge]]
    #    kernel = rbf
    #    alpha = 0.034
    #    gamma = 0.138
    #[[KernelRidge_select]]
    #    kernel = rbf
    #    alpha = 1
    #    gamma = 1
    #[[KernelRidge_learn]]
    #    kernel = rbf
    #    alpha = 1
    #    gamma = 1
    [[ModelImport]]
        model_path = models/KernelRidge_split_12.pkl

As we are only interested in assessing the fit on Pt for this example, we can change the DataSplits section to only have the LOG test:

Example:

[DataSplits]
    #[[NoSplit]]
    [[RepeatedKFold]]
        n_splits = 5
        n_repeats = 2
    #[[RepeatedKFold_learn]]
    #    n_splits = 5
    #    n_repeats = 2
    [[LeaveOneGroupOut]]
        grouping_column = Host element

From running this model and inspecting the test data parity plot in split_12 (the folder for Pt group, we obtain this parity plot:

_images/MASTMLtutorial_run8_2.png

As a comparison, this plot is exactly the same as the above plot from the previous run. This is the expected result, and demonstrates that the previously fit model was successfully imported and used to predict the Pt values. By inspecting the other groups, for example split_1, which is for Ag, the R squared and errors indicate a better fit than our previous run. This better fit is expected, as the model we saved from the previous run contained Ag in the training data, so these predictions on Ag should be improved (note that this is defeats the purpose of the LOG test, but shows that the trained model we imported is behaving as expected).

Predicting values for new, extrapolated data

As a final example, we are going to use our model to predict the migration barriers for those data points with Pt as the host element. Your data file already has a column with the title “predict_Pt”, with values equal to 0 in all rows except where Pt is the host, in which case the value is 1. In the GeneralSetup section of your input file, add the parameter validation_columns, and have it equal to “predict_Pt”, as shown below. This will make it so that the data with Pt as the host element will never be involved in the model training. This feature is a convenient way to isolate part of your data, or some new part of your data, to only function as a validation data set. This way, whenever a model is trained and tested on the remaining data, an additional prediction will also be calculated, which here is for the Pt host data.

Example:

[GeneralSetup]
    input_features = Auto
    input_target = Reduced barrier (eV)
    randomizer = False
    metrics = Auto
    input_other = Host element, Solute element, predict_Pt
    input_grouping = Host element
    input_testdata = predict_Pt

For this test, let’s run both the random cross-validation and LOG test. As a reminder, we need to un-comment the random cross-validation test in the DataSplits section:

Example:

[DataSplits]
    #[[NoSplit]]
    [[RepeatedKFold]]
        n_splits = 5
        n_repeats = 2
    #[[RepeatedKFold_learn]]
    #    n_splits = 5
    #    n_repeats = 2
    [[LeaveOneGroupOut]]
        grouping_column = Host element

When running this test, you’ll notice there are fewer splits in the LOG test folder now. This is because Pt is only treated as a final “validation” or “extrapolation” data set, and is never involved in the training or test set in any split. For each split in the random and LOG CV tests, there is a “stats.txt” file which is written, which provides the average train, test and prediction results. The prediction results are for the Pt validation data. Below are screenshots of the stats.txt file for the random and LOG tests.

Random cross-validation:

_images/MASTMLtutorial_run9_1.png

LOG cross-validation:

_images/MASTMLtutorial_run9_2.png

For the random cross-validation, the R-squared and error values are higher for the predict_Pt dataset compared to the average of the testing datasets. This is to be expected, as Pt is never involved in model training. Further, we can see that the predictions for predict_Pt are slightly worse in the case of the LOG cross-validation test compared to the random cross-validation test. This also makes sense, as each training split of the LOG test tends to result in worse predictive performance (i.e. worse model training), relative to the random cross-validation case, as discussed in the above test when we compared the results of the random and LOG cross-validation tests.

This concludes the MAST-ML tutorial document! There are some other features of MAST-ML which were not explicitly discussed in this tutorial, such as forming data clusters. Consult the MAST-ML Input File section of this documentation for a more in-depth overview of all the possible options for different MAST-ML runs.

Code Documentation: Metrics

mastml.metrics Module

This module contains constructors for different model score metrics. Most model metrics are obtained from scikit-learn, while others are custom variations.

The full list of score functions in scikit-learn can be found at: http://scikit-learn.org/stable/modules/model_evaluation.html

Functions

adjusted_r2_score(y_true, y_pred[, n_features]) Method that calculates the adjusted R^2 value
check_and_fetch_names(metric_names, …) Method that checks whether chosen metrics to evaluate models are appropriate for user-specified models (e.g.
r2_score_fitted(y_true, y_pred) Method that calculates the R^2 value
r2_score_noint(y_true, y_pred) Method that calculates the R^2 value without fitting the y-intercept
rmse_over_stdev(y_true, y_pred[, train_y]) Method that calculates the root mean squared error (RMSE) of a set of data, divided by the standard deviation of the training data set.
root_mean_squared_error(y_true, y_pred) Method that calculates the root mean squared error (RMSE)

Code Documentation: Configuration file parser

mastml.conf_parser Module

The conf_parser module is used for handling, parsing, and checking MAST-ML input configuration files

Functions

check_models_mixed(model_names) Method used to check whether the user has mixed regression and classification tasks
fix_types(maybe_list) Method that returns true datatype of values passed as string or list of strings, parsed from configuration file
make_scorer(score_func, *[, …]) Make a scorer from a performance metric or loss function.
mybool(string) Method that converts a string equal to ‘True’ or ‘False’ into type bool
parse_conf_file(filepath[, from_dict]) Method that accepts the filepath of an input configuration file and returns its parsed dictionary

Code Documentation: Data cleaner

mastml.data_cleaner Module

The data_cleaner module is used to clean missing or NaN values from pandas dataframes (e.g. removing NaN, imputation, etc.)

Functions

columns_with_strings(df) Method that ascertains which columns in data contain string entries
flag_outliers(df, conf_not_input_features, …) Method that scans values in each X feature matrix column and flags values that are larger than 3 standard deviations from the average of that column value.
imputation(df, strategy[, cols_to_leave_out]) Method that imputes values to the missing places based on the median, mean, etc.
orth(A[, rcond]) Construct an orthonormal basis for the range of A using SVD
ppca(df[, cols_to_leave_out]) Method that performs a recursive PCA routine to use PCA of known columns to fill in missing values in particular column
remove(df, axis) Method that removes a full column or row of data values if one column or row contains NaN or is blank

Classes

PPCA() Class to perform probabilistic principal component analysis (PPCA) to fill in missing data.
SimpleImputer(*[, missing_values, strategy, …]) Imputation transformer for completing missing values.

Class Inheritance Diagram

Inheritance diagram of mastml.data_cleaner.PPCA

Code Documentation: Data loader

mastml.data_loader Module

The data_loader module is used for importing data from user-specified csv or xlsx file to MAST-ML

Functions

load_data(file_path[, input_features, …]) Method that accepts the filepath of an input data file and returns a full dataframe and parsed X and y dataframes

Code Documentation: Learning curve

mastml.learning_curve Module

This module contains methods to construct learning curves, which evaluate some cross-validation performance metric (e.g. RMSE) as a function of amount of training data (i.e. a sample learning curve) or as a function of the number of features used in the fitting (i.e. a feature learning curve).

Functions

f_regression(X, y, *[, center]) Univariate linear regression tests.
feature_learning_curve(X, y, estimator, cv, …) Method that calculates data used to plot a feature learning curve, e.g.
learning_curve(estimator, X, y, *[, groups, …]) Learning curve.
sample_learning_curve(X, y, estimator, cv, …) Method that calculates data used to plot a sample learning curve, e.g.

Code Documentation: Clusterers

mastml.legos.clusterers Module

The clusterers module is used for instantiating cluster algorithm objects from scikit-learn. More information is available at http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster

Code Documentation: Data splitters

mastml.legos.data_splitters Module

The data_splitters module contains a collection of classes for generating (train_indices, test_indices) pairs from a dataframe or a numpy array.

For more information and a list of scikit-learn splitter classes, see:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

Classes

BaseEstimator Base class for all estimators in scikit-learn.
Bootstrap(n[, n_bootstraps, train_size, …]) # Note: Bootstrap taken directly from sklearn Github (https://github.com/scikit-learn/scikit-learn/blob/0.11.X/sklearn/cross_validation.py) # which was necessary as it was later removed from more recent sklearn releases Random sampling with replacement cross-validation iterator Provides train/test indices to split data in train test sets while resampling the input n_bootstraps times: each time a new random split of the data is performed and then samples are drawn (with replacement) on each side of the split to build the training and test sets.
JustEachGroup() Class to train the model on one group at a time and test it on the rest of the data This class wraps scikit-learn’s LeavePGroupsOut with P set to n-1.
LeaveCloseCompositionsOut([dist_threshold, …]) Leave-P-out where you exclude materials with compositions close to those the test set
LeaveOutPercent([percent_leave_out, n_repeats]) Class to train the model using a certain percentage of data as training data
NearestNeighbors(*[, n_neighbors, radius, …]) Unsupervised learner for implementing neighbor searches.
NoSplit() Class to just train the model on the training data and test it on that same data.
SplittersUnion(splitters) Class to take the union of two separate splitting routines, so that many splitting routines can be performed at once
TransformerMixin Mixin class for all transformers in scikit-learn.

Class Inheritance Diagram

Inheritance diagram of mastml.legos.data_splitters.Bootstrap, mastml.legos.data_splitters.JustEachGroup, mastml.legos.data_splitters.LeaveCloseCompositionsOut, mastml.legos.data_splitters.LeaveOutPercent, mastml.legos.data_splitters.NoSplit, mastml.legos.data_splitters.SplittersUnion

Code Documentation: Utils

mastml.utils Module

The utils module contains a collection of miscellaneous methods and error handling used throughout MAST-ML

Functions

activate_logging(savepath, paths[, …]) Method to create MAST-ML logger file
ceil Return the ceiling of x as an Integral.
floor Return the floor of x as an Integral.
join(a, *p) Join two or more pathname components, inserting ‘/’ as needed.
log(x, [base=math.e]) Return the logarithm of x to the given base.
log_header(paths, log) Method to create header for MAST-ML logger
nice_range(lower, upper) Method to create a range of values, including the specified start and end points, with nicely spaced intervals
verbosalize_logger(log, verbosity)

Classes

BetweenFilter(min_level, max_level) Class to aid in handling logger display levels
ConfError Class representing error in input configuration file
FileNotFoundError Class representing error raised when a needed file cannot be found
FiletypeError Class representing error raised when an improper file extension is used
InvalidConfParameters Class representing error raised when you have invalid input configuration file parameters
InvalidConfSection Class representing error raised when an invalid section name is present in the input configuration file
InvalidConfSubSection Class representing error raised when an invalid subsection name is present in the input configuration file
InvalidModel Class representing error when model does not exist
InvalidValue Class representing error raised when an invalid value has been used
MastError Base class for MAST-ML specific errors that should be shown to the user
MissingColumnError Class representing error raised when your csv doesn’t have the specified column
defaultdict defaultdict(default_factory[, …]) –> dict with default factory

Class Inheritance Diagram

Inheritance diagram of mastml.utils.BetweenFilter, mastml.utils.ConfError, mastml.utils.FileNotFoundError, mastml.utils.FiletypeError, mastml.utils.InvalidConfParameters, mastml.utils.InvalidConfSection, mastml.utils.InvalidConfSubSection, mastml.utils.InvalidModel, mastml.utils.InvalidValue, mastml.utils.MastError, mastml.utils.MissingColumnError

Code Documentation: MAST-ML Driver

mastml.mastml_driver Module

Main MAST-ML module responsible for executing the workflow of a MAST-ML run

Functions

check_paths(conf_path, data_path, outdir) This method is responsible for error handling of the user-specified paths for the configuration file, data file, and output directory.
clone(estimator, *[, safe]) Constructs a new unfitted estimator with the same parameters.
deepcopy(x[, memo, _nil]) Deep copy operation on arbitrary Python objects.
get_commandline_args() This method is responsible for parsing and checking the command-line execution of MAST-ML inputted by the user.
join(a, *p) Join two or more pathname components, inserting ‘/’ as needed.
main(conf_path, data_path[, outdir, verbosity]) This method is responsible for setting up the initial stage of the MAST-ML run, such as parsing input directories to designate where data will be imported and results saved to, as well as creation of the MAST-ML run log.
make_scorer(score_func, *[, …]) Make a scorer from a performance metric or loss function.
mastml_run(conf_path, data_path, outdir) This method is responsible for conducting the main MAST-ML run workflow
reduce(function, sequence[, initial]) Apply a function of two arguments cumulatively to the items of a sequence, from left to right, so as to reduce the sequence to a single value.

Code Documentation: Plot Helper

mastml.plot_helper Module

This module contains a collection of functions which make plots (saved as png files) using matplotlib, generated from some model fits and cross-validation evaluation within a MAST-ML run.

This module also contains a method to create python notebooks containing plotted data and the relevant source code from this module, to enable the user to make their own modifications to the created plots in a straightforward way (useful for tweaking plots for a presentation or publication).

Functions

auc(x, y) Compute Area Under the Curve (AUC) using the trapezoidal rule.
ceil Return the ceiling of x as an Integral.
confusion_matrix(y_true, y_pred, *[, …]) Compute confusion matrix to evaluate the accuracy of a classification.
figaspect(arg) Calculate the width and height for a figure with a specified aspect ratio.
floor Return the floor of x as an Integral.
get_divisor(high, low) Method to obtain a sensible divisor based on range of two values
get_histogram_bins(y_df) Method to obtain the number of bins to use when plotting a histogram
ipynb_maker(plot_func) This method creates Jupyter Notebooks so user can modify and regenerate the plots produced by MAST-ML.
join(a, *p) Join two or more pathname components, inserting ‘/’ as needed.
log(x, [base=math.e]) Return the logarithm of x to the given base.
make_axes_locatable(axes)
make_axis_same(ax, max1, min1) Method to make the x and y ticks for each axis the same.
make_error_plots(run, path, …[, groups])
make_fig_ax([aspect_ratio, x_align, left]) Method to make matplotlib figure and axes objects.
make_fig_ax_square([aspect, aspect_ratio]) Method to make square shaped matplotlib figure and axes objects.
make_train_test_plots(run, path, …[, groups]) General plotting method used to execute sequence of specific plots of train-test data analysis
mark_inset(parent_axes, inset_axes, loc1, …) Draw a box to mark the location of an area represented by an inset axes.
nice_mean(ls) Method to return mean of a list or equivalent array with NaN values
nice_names()
nice_range(lower, upper) Method to create a range of values, including the specified start and end points, with nicely spaced intervals
nice_std(ls) Method to return standard deviation of a list or equivalent array with NaN values
parse_error_data(dataset_stdev, …)
plot_1d_heatmap(xs, heats, savepath[, …]) Method to plot a heatmap for values of a single variable; used for plotting GridSearch results in hyperparameter optimization.
plot_2d_heatmap(xs, ys, heats, savepath[, …]) Method to plot a heatmap for values of two variables; used for plotting GridSearch results in hyperparameter optimization.
plot_3d_heatmap(xs, ys, zs, heats, savepath) Method to plot a heatmap for values of three variables; used for plotting GridSearch results in hyperparameter optimization.
plot_average_cumulative_normalized_error(…) Method to plot the cumulative normalized residual errors of a model prediction
plot_average_normalized_error(y_true, …[, …]) Method to plot the normalized residual errors of a model prediction
plot_best_worst_per_point(y_true, …[, …]) Method to create a parity plot (predicted vs.
plot_best_worst_split(y_true, best_run, …) Method to create a parity plot (predicted vs.
plot_confusion_matrix(y_true, y_pred, …[, …]) Method used to generate a confusion matrix for a classification run.
plot_cumulative_normalized_error(y_true, …) Method to plot the cumulative normalized residual errors of a model prediction
plot_keras_history(model_history, savepath, …)
plot_learning_curve(train_sizes, train_mean, …) Method used to plot both data and feature learning curves
plot_learning_curve_convergence(train_sizes, …) Method used to plot both the convergence of data and feature learning curves as a function of amount of data or features
plot_metric_vs_group(metric, groups, stats, …) Method to plot the value of a particular calculated metric (e.g.
plot_metric_vs_group_size(metric, groups, …) Method to plot the value of a particular calculated metric (e.g.
plot_normalized_error(y_true, y_pred, …[, …]) Method to plot the normalized residual errors of a model prediction
plot_precision_recall_curve(y_true, y_pred, …) Method to calculate and plot the precision-recall curve for classification model results
plot_predicted_vs_true(train_quad, …) Method to create a parity plot (predicted vs.
plot_predicted_vs_true_bars(y_true, …[, …]) Method to calculate parity plot (predicted vs.
plot_real_vs_predicted_error(y_true, …)
plot_residuals_histogram(y_true, y_pred, …) Method to calculate and plot the histogram of residuals from regression model
plot_roc_curve(y_true, y_pred, savepath) Method to calculate and plot the receiver-operator characteristic curve for classification model results
plot_scatter(x, y, savepath[, groups, …]) Method to create a general scatter plot
plot_stats(fig, stats[, x_align, y_align, …]) Method that prints stats onto the plot.
plot_target_histogram(y_df, savepath[, …]) Method to plot the histogram of true y values
precision_recall_curve(y_true, probas_pred, *) Compute precision-recall pairs for different probability thresholds.
prediction_intervals(model, X, …) Method to calculate prediction intervals when using Random Forest and Gaussian Process regression models.
r2_score(y_true, y_pred, *[, sample_weight, …]) R^2 (coefficient of determination) regression score function.
recursive_max(arr) Method to recursively find the max value of an array of iterables.
recursive_max_and_min(arr) Method to recursively return max and min of values or iterables in array
recursive_min(arr) Method to recursively find the min value of an array of iterables.
roc_curve(y_true, y_score, *[, pos_label, …]) Compute Receiver operating characteristic (ROC).
round_down(num, divisor) Method to return a rounded down number
round_up(num, divisor) Method to return a rounded up number
rounder(delta) Method to obtain number of decimal places to report on plots
stat_to_string(name, value, nice_names) Method that converts a metric object into a string for displaying on a plot
trim_array(arr_list) Method used to trim a set of arrays to make all arrays the same shape
wraps(wrapped[, assigned, updated]) Decorator factory to apply update_wrapper() to a wrapper function
zoomed_inset_axes(parent_axes, zoom[, loc, …]) Create an anchored inset axes by scaling a parent axes.

Code Documentation: HTML Helper

mastml.html_helper Module

Module for generating an HTML file, called index.html, which contains an overview of the key data and plots from a MAST-ML run. Images of cross-validation parity plots, data histograms, data statistics, and links to the relevant files are all provided.

Functions

attr(*args, **kwargs) Set attributes on the current active tag context
get_current([default]) get the current tag being used as a with context or decorated function.
gmtime([seconds]) tm_sec, tm_wday, tm_yday, tm_isdst)
is_test_image(path) Method used to assess whether an image is for testing data
is_train_image(path) Method used to assess whether an image is for training data
join(a, *p) Join two or more pathname components, inserting ‘/’ as needed.
make_html(outdir) Method used to create the main index.html file
make_image(src[, title]) Method used to generate and show an image of a fixed width.
make_link(href) Method used to generate a link to a particular file created from a MAST-ML run.
relpath(path[, start]) Return a relative version of a path
show_combo(combo_dir, outdir) Method used to collect combinations of data analysis (e.g.
simple_section(filepath, outdir) Method used to create a section name for a particular analysis combination that will be displayed in the index.html file.
strftime(format[, tuple]) Convert a time tuple to a string according to a format specification.

Code Documentation: Feature Selectors

mastml.legos.feature_selectors Module

This module contains a collection of classes and methods for selecting features, and interfaces with scikit-learn feature selectors. More information on scikit-learn feature selectors is available at:

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

Functions

cov(m[, y, rowvar, bias, ddof, fweights, …]) Estimate a covariance matrix, given data and weights.
dataframify_new_column_names(transform, name) Method which transforms output of scikit-learn feature selectors to dataframe, and adds column names
dataframify_selector(transform) Method which transforms output of scikit-learn feature selectors from array to dataframe.
fitify_just_use_values(fit) Method which enables a feature selector fit method to operate on dataframes
pearsonr(x, y) Pearson correlation coefficient and p-value for testing non-correlation.
root_mean_squared_error(y_true, y_pred) Method that calculates the root mean squared error (RMSE)
wraps(wrapped[, assigned, updated]) Decorator factory to apply update_wrapper() to a wrapper function

Classes

BaseEstimator Base class for all estimators in scikit-learn.
EnsembleModelFeatureSelector(estimator, …) Class custom-written for MAST-ML to conduct selection of features with ensemble model feature importances
MASTMLFeatureSelector(estimator, …[, …]) Class custom-written for MAST-ML to conduct forward selection of features with flexible model and cv scheme
PCA([n_components, copy, whiten, …]) Principal component analysis (PCA).
PearsonSelector(threshold_between_features, …) Class custom-written for MAST-ML to conduct selection of features based on Pearson correlation coefficent between features and target.
SequentialFeatureSelector(estimator[, …]) Sequential Feature Selection for Classification and Regression.
TransformerMixin Mixin class for all transformers in scikit-learn.
constructor alias of sklearn.feature_selection._variance_threshold.VarianceThreshold

Class Inheritance Diagram

Inheritance diagram of mastml.legos.feature_selectors.EnsembleModelFeatureSelector, mastml.legos.feature_selectors.MASTMLFeatureSelector, mastml.legos.feature_selectors.PearsonSelector

Code Documentation: Feature Normalizers

mastml.legos.feature_normalizers Module

This module contains a collection of classes and methods for normalizing features. Also included is connection with scikit-learn methods. See http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing for more info.

Functions

dataframify(transform) Method which is a decorator transforms output of scikit-learn feature normalizers from array to dataframe.
wraps(wrapped[, assigned, updated]) Decorator factory to apply update_wrapper() to a wrapper function

Classes

BaseEstimator Base class for all estimators in scikit-learn.
Binarizer(*[, threshold, copy]) Binarize data (set feature values to 0 or 1) according to a threshold.
MaxAbsScaler(*[, copy]) Scale each feature by its maximum absolute value.
MeanStdevScaler([features, mean, stdev]) Class designed to normalize input data to a specified mean and standard deviation
MinMaxScaler([feature_range, copy, clip]) Transform features by scaling each feature to a given range.
Normalizer([norm, copy]) Normalize samples individually to unit norm.
OneHotEncoder(*[, categories, drop, sparse, …]) Encode categorical features as a one-hot numeric array.
QuantileTransformer(*[, n_quantiles, …]) Transform features using quantiles information.
RobustScaler(*[, with_centering, …]) Scale features using statistics that are robust to outliers.
StandardScaler(*[, copy, with_mean, with_std]) Standardize features by removing the mean and scaling to unit variance
TransformerMixin Mixin class for all transformers in scikit-learn.

Class Inheritance Diagram

Inheritance diagram of mastml.legos.feature_normalizers.MeanStdevScaler

Code Documentation: Randomizers

mastml.legos.randomizers Module

This module contains a class used to randomize the input y data, in order to create a “null model” for testing how rigorous other machine learning model predictions are.

Classes

Randomizer() Class which randomizes X-y pairings by shuffling the y values

Class Inheritance Diagram

Inheritance diagram of mastml.legos.randomizers.Randomizer

Code Documentation: Model Finder

mastml.legos.model_finder Module

This module provides a name_to_constructor dict for all models/estimators in scikit-learn, plus a couple test models and error handling functions

Functions

check_models_mixed(model_names) Method used to check whether the user has mixed regression and classification tasks
find_model(model_name) Method used to check model names conform to scikit-learn model/estimator names

Classes

AlwaysFive([constant]) Class used as a test model that always predicts a value of 5.
EnsembleRegressor(n_estimators, num_samples, …)
KerasRegressor(conf_dict)
ModelImport(model_path) Class used to import pickled models from previous machine learning fits
RandomGuesser() Class used as a test model that always predicts random values for y data.

Class Inheritance Diagram

Inheritance diagram of mastml.legos.model_finder.AlwaysFive, mastml.legos.model_finder.EnsembleRegressor, mastml.legos.model_finder.KerasRegressor, mastml.legos.model_finder.ModelImport, mastml.legos.model_finder.RandomGuesser

Code Documentation: Utility Legos

mastml.legos.util_legos Module

This module contains a collection of classes for debugging and control flow

Classes

BaseEstimator Base class for all estimators in scikit-learn.
DataFrameFeatureUnion(transforms) Class for unioning dataframe generators (sklearn.pipeline.FeatureUnion always puts out arrays)
DoNothing() Class for having a “null” transform where the output is the same as the input.
TransformerMixin Mixin class for all transformers in scikit-learn.

Class Inheritance Diagram

Inheritance diagram of mastml.legos.util_legos.DataFrameFeatureUnion, mastml.legos.util_legos.DoNothing

Code Documentation: Feature Generators

mastml.legos.feature_generators Module

This module contains a collection of classes for generating input features to fit machine learning models to.

Functions

clean_dataframe(df) Method to clean dataframes after feature generation has occurred, to remove columns that have a single missing or NaN value, or remove a row that is fully empty

Classes

BaseEstimator Base class for all estimators in scikit-learn.
ContainsElement(composition_feature, …[, …]) Class to generate new categorical features (i.e.
DataframeUtilities Class of basic utilities for dataframe manipulation, and exchanging between dataframes and numpy arrays
Magpie(composition_feature[, feature_types]) Class that wraps MagpieFeatureGeneration, giving it scikit-learn structure
MagpieFeatureGeneration(dataframe, …) Class to generate new features using Magpie data and dataframe containing material compositions
MaterialsProject(composition_feature, api_key) Class that wraps MaterialsProjectFeatureGeneration, giving it scikit-learn structure
MaterialsProjectFeatureGeneration(dataframe, …) Class to generate new features using Materials Project data and dataframe containing material compositions Datarame must have a column named “Material compositions”.
Matminer(structural_features, structure_col) Class to generate structural features from matminer structure module Args: structural_features: the structure feature(s) the user wants to instantiate and generate structure_col: the dataframe column that contains the pymatgen structure object.
NoGenerate() Class for having a “null” transform where the output is the same as the input.
PolynomialFeatures([features, degree, …]) Class to generate polynomial features using scikit-learn’s polynomial features method More info at: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
SklearnPolynomialFeatures alias of sklearn.preprocessing._data.PolynomialFeatures
TransformerMixin Mixin class for all transformers in scikit-learn.

Class Inheritance Diagram

Inheritance diagram of mastml.legos.feature_generators.ContainsElement, mastml.legos.feature_generators.DataframeUtilities, mastml.legos.feature_generators.Magpie, mastml.legos.feature_generators.MagpieFeatureGeneration, mastml.legos.feature_generators.MaterialsProject, mastml.legos.feature_generators.MaterialsProjectFeatureGeneration, mastml.legos.feature_generators.Matminer, mastml.legos.feature_generators.NoGenerate, mastml.legos.feature_generators.PolynomialFeatures

Indices and tables