BaseSplitter

class mastml.data_splitters.BaseSplitter[source]

Bases: BaseCrossValidator

Class functioning as a base splitter with methods for organizing output and evaluating any mastml data splitter

Args:

None

Methods:
split_asframe: method to perform split into train indices and test indices, but return as dataframes
Args:

X: (pd.DataFrame), dataframe of X features

y: (pd.Series), series of y target data

groups: (pd.Series), series of group designations

Returns:

X_splits: (list), list of dataframes for X splits

y_splits: (list), list of dataframes for y splits

evaluate: main method to evaluate a sequence of models, selectors, and hyperparameter optimizers, build directories and perform analysis and output plots
Args:

X: (pd.DataFrame), dataframe of X features

y: (pd.Series), series of y target data

models: (list), list containing mastml.models instances

preprocessor: (mastml.preprocessor), mastml.preprocessor object to normalize the training data in each split.

groups: (pd.Series), series of group designations

hyperopts: (list), list containing mastml.hyperopt instances. One for each provided model is needed.

selectors: (list), list containing mastml.feature_selectors instances

metrics: (list), list of metric names to evaluate true vs. pred data in each split

plots: (list), list of names denoting which types of plots to make. Valid names are ‘Scatter’, ‘Error’, and ‘Histogram’

savepath: (str), string containing main savepath to construct splits for saving output

X_extra: (pd.DataFrame), dataframe of extra X data not used in model fitting

leaveout_inds: (list), list of arrays containing indices of data to be held out and evaluated using best model from set of train/validation splits

best_run_metric: (str), metric name to be used to decide which model performed best. Defaults to first listed metric in metrics.

nested_CV: (bool), whether to perform nested cross-validation. The nesting is done using the same splitter object as self.splitter

error_method: (str), the type of model error evaluation method to perform. Only applies to certain models. Valid names are ‘stdev_weak_learners’ and ‘jackknife_after_bootstrap’

remove_outlier_learners: (bool), whether to remove weak learners from ensemble models whose predictions are found to be outliers. Default False.

recalibrate_errors: (bool), whether to perform the predicted error bar recalibration method of Palmer et al. Default False.

verbosity: (int), the output plotting verbosity. Default is 1. Valid choices are 0, 1, 2, and 3.

baseline_test: (list), list of arrays containing base line tests to run.

distance_metric: (str), distance metric to use in baseline_tests test_nearest_neighbour_cdist method

domain_distance: (str), distance metric to perform domain evaluation of test data

file_extension: (str), must be either ‘.xlsx’ or ‘.csv’, determines data file type for saving

image_dpi: (int), determines output image quality

remove_split_dirs: (bool), whether to remove all the inner split directories after data and plots saved

**kwargs: (str), extra argument for domain_distance, eg. minkowsi requires additional arg p

Returns:

None

_evaluate_split_sets: method to evaluate a set of train/test splits. At the end of the split set, the left-out data (if any) is evaluated using the best model from the train/test splits
Args:

X_splits: (list), list of dataframes for X splits

y_splits: (list), list of dataframes for y splits

train_inds: (list), list of arrays of indices denoting the training data

test_inds: (list), list of arrays of indices denoting the testing data

model: (mastml.models instance), an estimator for fitting data

model_name: (str), class name of the model being evaluated

selector: (mastml.selector), a feature selector to select features in each split

preprocessor: (mastml.preprocessor), mastml.preprocessor object to normalize the training data in each split.

X_extra: (pd.DataFrame), dataframe of extra X data not used in model fitting

groups: (pd.Series), series of group designations

splitdir: (str), string denoting the split path in the save directory

hyperopt: (mastml.hyperopt), mastml.hyperopt instance to perform model hyperparameter optimization in each split

metrics: (list), list of metric names to evaluate true vs. pred data in each split

plots: (list), list of names denoting which types of plots to make. Valid names are ‘Scatter’, ‘Error’, and ‘Histogram’

has_model_errors: (bool), whether the model used has error bars (uncertainty quantification)

error_method: (str), the type of model error evaluation method to perform. Only applies to certain models. Valid names are ‘stdev_weak_learners’ and ‘jackknife_after_bootstrap’

remove_outlier_learners: (bool), whether to remove weak learners from ensemble models whose predictions are found to be outliers. Default False.

recalibrate_errors: (bool), whether to perform the predicted error bar recalibration method of Palmer et al. Default False.

verbosity: (int), the output plotting verbosity. Default is 1. Valid choices are 0, 1, 2, and 3.

baseline_test: (list), list of arrays containing base line tests to run.

distance_metric: (str), distance metric to use in baseline_test’s test_nearest_neighbour_cdist method

domain_distance: (str), distance metric to perform domain evaluation of test data

file_extension: (str), must be either ‘.xlsx’ or ‘.csv’, determines data file type for saving

image_dpi: (int), determines output image quality

**kwargs: (str), extra argument for domain_distance, eg. minkowsi requires additional arg p

Returns:

None

_evaluate_split: method to evaluate a single data split, i.e. fit model, predict test data, and perform some plots and analysis
Args:

X_train: (pd.DataFrame), dataframe of X training features

X_test: (pd.DataFrame), dataframe of X test features

y_train: (pd.Series), series of y training features

y_test: (pd.Series), series of y test features

model: (mastml.models instance), an estimator for fitting data

model_name: (str), class name of the model being evaluated

preprocessor: (mastml.preprocessor), mastml.preprocessor object to normalize the training data in each split.

selector: (mastml.selector), a feature selector to select features in each split

hyperopt: (mastml.hyperopt), mastml.hyperopt instance to perform model hyperparameter optimization in each split

metrics: (list), list of metric names to evaluate true vs. pred data in each split

plots: (list), list of names denoting which types of plots to make. Valid names are ‘Scatter’, ‘Error’, and ‘Histogram’

groups: (str), string denoting the test group, if applicable

splitpath:(str), string denoting the split path in the save directory

has_model_errors: (bool), whether the model used has error bars (uncertainty quantification)

X_extra_train: (pd.DataFrame), dataframe of the extra X data of the training split (not used in fit)

X_extra_test: (pd.DataFrame), dataframe of the extra X data of the testing split (not used in fit)

error_method: (str), the type of model error evaluation method to perform. Only applies to certain models. Valid names are ‘stdev_weak_learners’ and ‘jackknife_after_bootstrap’

remove_outlier_learners: (bool), whether to remove weak learners from ensemble models whose predictions are found to be outliers. Default False.

verbosity: (int), the output plotting verbosity. Default is 1. Valid choices are 0, 1, 2, and 3.

baseline_test: (list), list of arrays containing base line tests to run.

distance_metric: (str), distance metric to use in baseline_test’s test_nearest_neighbour_cdist method

domain_distance: (str), distance metric to perform domain evaluation of test data

file_extension: (str), must be either ‘.xlsx’ or ‘.csv’, determines data file type for saving

image_dpi: (int), determines output image quality

**kwargs: (str), extra argument for domain_distance, eg. minkowsi requires additional arg p

Returns:

None

_setup_savedir: method to create a save directory based on model/selector/preprocessor names
Args:

model: (mastml.models instance), an estimator for fitting data

preprocessor: (mastml.preprocessor), mastml.preprocessor object to normalize the training data in each split.

selector: (mastml.selector), a feature selector to select features in each split

savepath: (str), string denoting the save path of the file

_save_split_data: method to save the X and y split data to excel files
Args:

df: (pd.DataFrame), dataframe of X or y data to save to file

filename: (str), string denoting the filename, e.g. ‘Xtest’

savepath: (str), string denoting the save path of the file

columns: (list), list of dataframe column names, e.g. X feature names

file_extension: (str), must be either ‘.xlsx’ or ‘.csv’, determines data file type for saving

Returns:

None

_collect_data: method to collect all pd.Series (e.g. ytrain/ytest) data into single series over many splits (directories)
Args:

filename: (str), string denoting the filename, e.g. ‘ytest’

savepath: (str), string denoting the save path of the file

file_extension: (str), must be either ‘.xlsx’ or ‘.csv’, determines data file type for saving

iterdirs: (bool), whether to recursively search within directories to collect data or just look in cwd

Returns:

data: (list), list containing flattened array of all data of a given type over many splits, e.g. all ypred data

_collect_df_data: method to collect all pd.DataFrame (e.g. Xtrain/Xtest) data into single dataframe over many splits (directories)
Args:

filename: (str), string denoting the filename, e.g. ‘Xtest’

savepath: (str), string denoting the save path of the file

file_extension: (str), must be either ‘.xlsx’ or ‘.csv’, determines data file type for saving

Returns:

data: (list), list containing flattened array of all data of a given type over many splits, e.g. all Xtest data

_get_best_split: method to find the best performing model in a set of train/test splits
Args:

savepath: (str), string denoting the save path of the file

preprocessor: (mastml.preprocessor), mastml.preprocessor object to normalize the training data in each split.

best_run_metric: (str), name of the metric to use to find the best performing model

model_name: (str), class name of model being evaluated

file_extension: (str), must be either ‘.xlsx’ or ‘.csv’, determines data file type for saving

Returns:

best_split_dict: (dict), dictionary containing the path locations of the best model and corresponding preprocessor and selected feature list

_get_average_recalibration_params: method to get the average and standard deviation of the recalibration factors in all train/test CV sets
Args:

savepath: (str), string denoting the save path of the file

data_type: (str), string denoting the type of data to examine (e.g. test or leftout)

file_extension: (str), must be either ‘.xlsx’ or ‘.csv’, determines data file type for saving

Returns:

recalibrate_avg_dict: (dict): dictionary of average recalibration parameters

recalibrate_stdev_dict: (dict): dictionary of stdev of recalibration parameters

_get_recalibration_params: method to get the recalibration factors for a single evaluation
Args:

savepath: (str), string denoting the save path of the file

data_type: (str), string denoting the type of data to examine (e.g. test or leftout)

file_extension: (str), must be either ‘.xlsx’ or ‘.csv’, determines data file type for saving

Returns:

recalibrate_dict: (dict): dictionary of recalibration parameters

help: method to output key information on class use, e.g. methods and parameters
Args:

None

Returns:

None, but outputs help to screen

Methods Summary

evaluate(X, y, models[, mastml, ...])

help()

split_asframe(X, y[, groups, X_force_train, ...])

Methods Documentation

evaluate(X, y, models, mastml=None, preprocessor=None, groups=None, hyperopts=None, selectors=None, metrics=None, plots=None, savepath=None, X_extra=None, X_force_train=None, y_force_train=None, leaveout_inds=[], best_run_metric=None, nested_CV=False, error_method='stdev_weak_learners', remove_outlier_learners=False, recalibrate_errors=False, verbosity=1, baseline_test=None, distance_metric='euclidean', domain_distance=None, file_extension='.csv', image_dpi=250, parallel_run=False, remove_split_dirs=False, **kwargs)[source]
help()[source]
split_asframe(X, y, groups=None, X_force_train=None, y_force_train=None)[source]