Bootstrap

class mastml.data_splitters.Bootstrap(n, n_bootstraps=3, train_size=0.5, test_size=None, n_train=None, n_test=None, random_state=0, **kwargs)[source]

Bases: BaseSplitter

# Note: Bootstrap taken directly from sklearn Github (https://github.com/scikit-learn/scikit-learn/blob/0.11.X/sklearn/cross_validation.py) # which was necessary as it was later removed from more recent sklearn releases Random sampling with replacement cross-validation iterator Provides train/test indices to split data in train test sets while resampling the input n_bootstraps times: each time a new random split of the data is performed and then samples are drawn (with replacement) on each side of the split to build the training and test sets. Note: contrary to other cross-validation strategies, bootstrapping will allow some samples to occur several times in each splits. However a sample that occurs in the train split will never occur in the test split and vice-versa. If you want each sample to occur at most once you should probably use ShuffleSplit cross validation instead.

Args:

n: (int), total number of elements in the dataset

n_bootstraps: (int), (default is 3) Number of bootstrapping iterations

train_size: (int or float), (default is 0.5) If int, number of samples to include in the training split

(should be smaller than the total number of samples passed in the dataset). If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split.

test_size: (int or float or None), (default is None)

If int, number of samples to include in the training set (should be smaller than the total number of samples passed in the dataset). If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If None, n_test is set as the complement of n_train.

random_state: (int or RandomState), Pseudo number generator state used for random sampling.

Attributes:

parallel_run: an attribute definining wheteher to run splits with all available computer cores

Attributes Summary

indices

Methods Summary

get_n_splits([X, y, groups])

Returns the number of splitting iterations in the cross-validator

split(X[, y, groups])

Generate indices to split data into training and test set.

Attributes Documentation

indices = True

Methods Documentation

get_n_splits(X=None, y=None, groups=None)[source]

Returns the number of splitting iterations in the cross-validator

split(X, y=None, groups=None)[source]

Generate indices to split data into training and test set.

Parameters

Xarray-like of shape (n_samples, n_features)

Training data, where n_samples is the number of samples and n_features is the number of features.

yarray-like of shape (n_samples,)

The target variable for supervised learning problems.

groupsarray-like of shape (n_samples,), default=None

Group labels for the samples used while splitting the dataset into train/test set.

Yields

trainndarray

The training set indices for that split.

testndarray

The testing set indices for that split.