Code Documentation: Data Splitters

mastml.data_splitters Module

This module contains a collection of methods to split data into different types of train/test sets. Data splitters are the core component to evaluating model performance.

BaseSplitter:

Base class that handles the core MAST-ML data splitting and model evaluation workflow. This class is responsible for looping over provided feature selectors, models, and data splits and training and evaluating the model for each split, then generating the necessary plots and performance statistics. All different splitter types inherit this base class.

SklearnDataSplitter:

Wrapper class to enable MAST-ML workflow compatible use of any data splitter contained in scikit-learn, e.g. KFold, RepeatedKFold, LeaveOneGroupOut, etc.

NoSplit:

Class that doesn’t perform any data split. Equivalent to a “full fit” of the data where all data is used in training.

JustEachGroup:

Class that splits data so each individual group is used as training with all other groups used as testing. Essentially the inverse of LeaveOneGroupOut, this class trains only on one group and predicts the rest, as opposed to training on all but one group and testing on the left-out group.

LeaveCloseCompositionsOut:

Class to split data based on their compositional similiarity. A useful means to separate compositionally similar compounds into the training or testing set, so that similar materials are not contained in both sets.

LeaveOutPercent:

Method to randomly split the data based on fraction of total data points, rather than a designated number of splits. Enables one to do higher than 50% leave out (this is highest leave out possible with KFold where k=2), so can do e.g. leave out 90% data.

LeaveOutTwinCV:

Another method to help separate similar data from the training and testing set. This method makes use of a general distance metric on the provided features, and flags twins as those data points within some provided distance threshold in the feature space.

LeaveOutClusterCV:

Method to use a clustering algorithm to pre-cluster data into groups. Then, these different groups are used as each left-out data set. Basically functions as a leave out group test where the groups are automatically obtained from a clustering algorithm.

LeaveMultiGroupOut:

Class to train the model on multiple groups at a time and test it on the rest of the data

Bootstrap:

Method to perform bootstrap resampling, i.e. random leave-out with replacement.

Classes

BaseSplitter()

Class functioning as a base splitter with methods for organizing output and evaluating any mastml data splitter

Baseline_tests()

Methods:

Bootstrap(n[, n_bootstraps, train_size, ...])

# Note: Bootstrap taken directly from sklearn Github (https://github.com/scikit-learn/scikit-learn/blob/0.11.X/sklearn/cross_validation.py) # which was necessary as it was later removed from more recent sklearn releases Random sampling with replacement cross-validation iterator Provides train/test indices to split data in train test sets while resampling the input n_bootstraps times: each time a new random split of the data is performed and then samples are drawn (with replacement) on each side of the split to build the training and test sets.

Composition(*args[, strict])

Represents a Composition, which is essentially a {element:amount} mapping type.

Domain()

This class evaluates which test data point is within and out of the domain

ElementFraction()

Class to calculate the atomic fraction of each element in a composition.

ErrorUtils()

Collection of functions to conduct error analysis on certain types of models (uncertainty quantification), and prepare residual and model error data for plotting, as well as recalibrate model errors with various methods

JustEachGroup(**kwargs)

Class to train the model on one group at a time and test it on the rest of the data This class wraps scikit-learn's LeavePGroupsOut with P set to n-1.

LeaveCloseCompositionsOut(composition_df[, ...])

Leave-P-out where you exclude materials with compositions close to those the test set

LeaveMultiGroupOut([multigroup_size])

Class to train the model on multiple groups at a time and test it on the rest of the data

LeaveOutClusterCV(cluster, **kwargs)

Class to generate train/test split using clustering.

LeaveOutPercent([percent_leave_out, n_repeats])

Class to train the model using a certain percentage of data as training data

LeaveOutTwinCV([threshold, ord, debug, ...])

Class to remove data twins from the test data.

Metrics(metrics_list[, metrics_type])

Class containing access to a wide range of metrics from scikit-learn and a number of MAST-ML custom-written metrics

NearestNeighbors(*[, n_neighbors, radius, ...])

Unsupervised learner for implementing neighbor searches.

NoPreprocessor([preprocessor, as_frame])

Class for having a "null" transform where the output is the same as the input.

NoSelect()

Class for having a "null" transform where the output is the same as the input.

NoSplit(**kwargs)

Class to just train the model on the training data and test it on that same data.

SklearnDataSplitter(splitter, **kwargs)

Class to wrap any scikit-learn based data splitter, e.g.

datetime(year, month, day[, hour[, minute[, ...)

The year, month and day arguments are required.

Class Inheritance Diagram

Inheritance diagram of mastml.data_splitters.BaseSplitter, mastml.data_splitters.Bootstrap, mastml.data_splitters.JustEachGroup, mastml.data_splitters.LeaveCloseCompositionsOut, mastml.data_splitters.LeaveMultiGroupOut, mastml.data_splitters.LeaveOutClusterCV, mastml.data_splitters.LeaveOutPercent, mastml.data_splitters.LeaveOutTwinCV, mastml.data_splitters.NoSplit, mastml.data_splitters.SklearnDataSplitter