ErrorUtils¶

class mastml.error_analysis.ErrorUtils[source]¶

Bases: object

Collection of functions to conduct error analysis on certain types of models (uncertainty quantification), and prepare residual and model error data for plotting, as well as recalibrate model errors with various methods

Args:

None

Methods:

_collect_error_data: method to collect all residuals, model errors, and dataset standard deviation over many data splits

Args:

savepath: (str), string denoting the path to save output to

data_type: (str), string denoting the data type analyzed, e.g. train, test, leftout

Returns:

model_errors: (pd.Series), series containing the predicted model errors

residuals: (pd.Series), series containing the true model errors (residuals)

dataset_stdev: (float), standard deviation of the data set

_recalibrate_errors: method to recalibrate the model errors using negative log likelihood function from work of Palmer et al.

Args:

model_errors: (pd.Series), series containing the predicted (uncalibrated) model errors

residuals: (pd.Series), series containing the true model errors (residuals)

Returns:

model_errors: (pd.Series), series containing the predicted (calibrated) model errors

a: (float), the slope of the recalibration linear fit

b: (float), the intercept of the recalibration linear fit

_parse_error_data: method to prepare the provided residuals and model errors for plotting the binned RvE (residual vs error) plots

Args:

model_errors: (pd.Series), series containing the predicted model errors

residuals: (pd.Series), series containing the true model errors (residuals)

dataset_stdev: (float), standard deviation of the data set

number_of_bins: (int), the number of bins to digitize the data into for making the RvE (residual vs. error) plot

Returns:

bin_values: (np.array), the x-axis of the RvE plot: reduced model error values digitized into bins

rms_residual_values: (np.array), the y-axis of the RvE plot: the RMS of the residual values digitized into bins

num_values_per_bin: (np.array), the number of data samples in each bin

number_of_bins: (int), the number of bins to put the model error and residual data into.

_get_model_errors: method for generating the model error values using either the standard deviation of weak learners or jackknife-after-bootstrap method of Wager et al.

Args:

model: (mastml.models object), a MAST-ML model, e.g. SklearnModel or EnsembleModel

X: (pd.DataFrame), dataframe of the X feature matrix

X_train: (pd.DataFrame), dataframe of the X training data feature matrix

X_test: (pd.DataFrame), dataframe of the X test data feature matrix

error_method: (str), string denoting the UQ error method to use. Viable options are ‘stdev_weak_learners’ and ‘jackknife_after_bootstrap’

remove_outlier_learners: (bool), whether specific weak learners that are found to deviate from 3 sigma of the average prediction for a given data point are removed (Default False)

Returns:

model_errors: (pd.Series), series containing the predicted model errors

num_removed_learners: (list), list of number of removed weak learners for each data point

_remove_outlier_preds: method to flag and remove outlier weak learner predictions

Args:

preds: (list), list of predicted values of a given data point from an ensemble of weak learners

Returns:

preds_cleaned: (list), ammended list of predicted values of a given data point from an ensemble of weak learners, with predictions from outlier learners removed

num_outliers: (int), the number of removed weak learners for the data point evaluated