Code Documentation: Data Splitters

mastml.data_splitters Module

This module contains a collection of methods to split data into different types of train/test sets. Data splitters are the core component to evaluating model performance.

BaseSplitter:

Base class that handles the core MAST-ML data splitting and model evaluation workflow. This class is responsible for looping over provided feature selectors, models, and data splits and training and evaluating the model for each split, then generating the necessary plots and performance statistics. All different splitter types inherit this base class.

SklearnDataSplitter:

Wrapper class to enable MAST-ML workflow compatible use of any data splitter contained in scikit-learn, e.g. KFold, RepeatedKFold, LeaveOneGroupOut, etc.

NoSplit:

Class that doesn’t perform any data split. Equivalent to a “full fit” of the data where all data is used in training.

JustEachGroup:

Class that splits data so each individual group is used as training with all other groups used as testing. Essentially the inverse of LeaveOneGroupOut, this class trains only on one group and predicts the rest, as opposed to training on all but one group and testing on the left-out group.

LeaveCloseCompositionsOut:

Class to split data based on their compositional similiarity. A useful means to separate compositionally similar compounds into the training or testing set, so that similar materials are not contained in both sets.

LeaveOutPercent:

Method to randomly split the data based on fraction of total data points, rather than a designated number of splits. Enables one to do higher than 50% leave out (this is highest leave out possible with KFold where k=2), so can do e.g. leave out 90% data.

LeaveOutTwinCV:

Another method to help separate similar data from the training and testing set. This method makes use of a general distance metric on the provided features, and flags twins as those data points within some provided distance threshold in the feature space.

LeaveOutClusterCV:

Method to use a clustering algorithm to pre-cluster data into groups. Then, these different groups are used as each left-out data set. Basically functions as a leave out group test where the groups are automatically obtained from a clustering algorithm.

LeaveMultiGroupOut:

Class to train the model on multiple groups at a time and test it on the rest of the data

Bootstrap:

Method to perform bootstrap resampling, i.e. random leave-out with replacement.

Functions

ceil(x, /)

Return the ceiling of x as an Integral.

check_random_state(seed)

Turn seed into a np.random.RandomState instance.

flatten_split_summary(split_summary)

make_plots(plots, y_true, y_pred, groups, ...)

Helper function to make collections of different types of plots after a single or multiple data splits are evaluated.

minkowski(u, v[, p, w])

Compute the Minkowski distance between two 1-D arrays.

parallel(func, x, *args, **kwargs)

Run some function in parallel.

pprint(object[, stream, indent, width, ...])

Pretty-print a Python object to a stream [default is sys.stdout].

Classes

BaseSplitter()

Class functioning as a base splitter with methods for organizing output and evaluating any mastml data splitter

Baseline_tests()

Methods:

Bootstrap(n[, n_bootstraps, train_size, ...])

# Note: Bootstrap taken directly from sklearn Github (https://github.com/scikit-learn/scikit-learn/blob/0.11.X/sklearn/cross_validation.py) # which was necessary as it was later removed from more recent sklearn releases Random sampling with replacement cross-validation iterator Provides train/test indices to split data in train test sets while resampling the input n_bootstraps times: each time a new random split of the data is performed and then samples are drawn (with replacement) on each side of the split to build the training and test sets.

Composition(*args[, strict])

Represents a Composition, a mapping of {element/species: amount} with enhanced functionality tailored for handling chemical compositions.

Domain(check_type[, preprocessor, model, ...])

ElementFraction()

Class to calculate the atomic fraction of each element in a composition.

ErrorUtils()

Collection of functions to conduct error analysis on certain types of models (uncertainty quantification), and prepare residual and model error data for plotting, as well as recalibrate model errors with various methods

JustEachGroup(**kwargs)

Class to train the model on one group at a time and test it on the rest of the data This class wraps scikit-learn's LeavePGroupsOut with P set to n-1.

LeaveCloseCompositionsOut(composition_df[, ...])

Leave-P-out where you exclude materials with compositions close to those the test set

LeaveMultiGroupOut([multigroup_size])

Class to train the model on multiple groups at a time and test it on the rest of the data

LeaveOutClusterCV(cluster, **kwargs)

Class to generate train/test split using clustering.

LeaveOutPercent([percent_leave_out, n_repeats])

Class to train the model using a certain percentage of data as training data

LeaveOutTwinCV([threshold, ord, debug, ...])

Class to remove data twins from the test data.

Metrics(metrics_list[, metrics_type])

Class containing access to a wide range of metrics from scikit-learn and a number of MAST-ML custom-written metrics

NearestNeighbors(*[, n_neighbors, radius, ...])

Unsupervised learner for implementing neighbor searches.

NoPreprocessor([preprocessor, as_frame])

Class for having a "null" transform where the output is the same as the input.

NoSelect()

Class for having a "null" transform where the output is the same as the input.

NoSplit(**kwargs)

Class to just train the model on the training data and test it on that same data.

NumpyEncoder(*[, skipkeys, ensure_ascii, ...])

SklearnDataSplitter(splitter, **kwargs)

Class to wrap any scikit-learn based data splitter, e.g. KFold.

datetime(year, month, day[, hour[, minute[, ...)

The year, month and day arguments are required.

tqdm(*_, **__)

Decorate an iterable object, returning an iterator which acts exactly like the original iterable, but prints a dynamically updating progressbar every time a value is requested.

Class Inheritance Diagram

digraph inheritance2ef622d552 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "BaseCrossValidator" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Base class for all cross-validators."]; "_MetadataRequester" -> "BaseCrossValidator" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BaseSplitter" [URL="api/mastml.data_splitters.BaseSplitter.html#mastml.data_splitters.BaseSplitter",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Class functioning as a base splitter with methods for organizing output and evaluating any mastml data splitter"]; "BaseCrossValidator" -> "BaseSplitter" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Bootstrap" [URL="api/mastml.data_splitters.Bootstrap.html#mastml.data_splitters.Bootstrap",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="# Note: Bootstrap taken directly from sklearn Github (https://github.com/scikit-learn/scikit-learn/blob/0.11.X/sklearn/cross_validation.py)"]; "BaseSplitter" -> "Bootstrap" [arrowsize=0.5,style="setlinewidth(0.5)"]; "JustEachGroup" [URL="api/mastml.data_splitters.JustEachGroup.html#mastml.data_splitters.JustEachGroup",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Class to train the model on one group at a time and test it on the rest of the data"]; "BaseSplitter" -> "JustEachGroup" [arrowsize=0.5,style="setlinewidth(0.5)"]; "LeaveCloseCompositionsOut" [URL="api/mastml.data_splitters.LeaveCloseCompositionsOut.html#mastml.data_splitters.LeaveCloseCompositionsOut",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Leave-P-out where you exclude materials with compositions close to those the test set"]; "BaseSplitter" -> "LeaveCloseCompositionsOut" [arrowsize=0.5,style="setlinewidth(0.5)"]; "LeaveMultiGroupOut" [URL="api/mastml.data_splitters.LeaveMultiGroupOut.html#mastml.data_splitters.LeaveMultiGroupOut",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Class to train the model on multiple groups at a time and test it on the rest of the data"]; "BaseSplitter" -> "LeaveMultiGroupOut" [arrowsize=0.5,style="setlinewidth(0.5)"]; "LeaveOutClusterCV" [URL="api/mastml.data_splitters.LeaveOutClusterCV.html#mastml.data_splitters.LeaveOutClusterCV",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Class to generate train/test split using clustering."]; "BaseSplitter" -> "LeaveOutClusterCV" [arrowsize=0.5,style="setlinewidth(0.5)"]; "LeaveOutPercent" [URL="api/mastml.data_splitters.LeaveOutPercent.html#mastml.data_splitters.LeaveOutPercent",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Class to train the model using a certain percentage of data as training data"]; "BaseSplitter" -> "LeaveOutPercent" [arrowsize=0.5,style="setlinewidth(0.5)"]; "LeaveOutTwinCV" [URL="api/mastml.data_splitters.LeaveOutTwinCV.html#mastml.data_splitters.LeaveOutTwinCV",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Class to remove data twins from the test data."]; "BaseSplitter" -> "LeaveOutTwinCV" [arrowsize=0.5,style="setlinewidth(0.5)"]; "NoSplit" [URL="api/mastml.data_splitters.NoSplit.html#mastml.data_splitters.NoSplit",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Class to just train the model on the training data and test it on that same data. Sometimes referred to as a \"Full fit\""]; "BaseSplitter" -> "NoSplit" [arrowsize=0.5,style="setlinewidth(0.5)"]; "SklearnDataSplitter" [URL="api/mastml.data_splitters.SklearnDataSplitter.html#mastml.data_splitters.SklearnDataSplitter",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="Class to wrap any scikit-learn based data splitter, e.g. KFold"]; "BaseSplitter" -> "SklearnDataSplitter" [arrowsize=0.5,style="setlinewidth(0.5)"]; "_MetadataRequester" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Mixin class for adding metadata request functionality."]; }