LeaveCloseCompositionsOut¶
- class mastml.data_splitters.LeaveCloseCompositionsOut(composition_df, dist_threshold=0.1, nn_kwargs=None, **kwargs)[source]¶
Bases:
BaseSplitter
Leave-P-out where you exclude materials with compositions close to those the test set
Computes the distance between the element fraction vectors. For example, the \(L_2\) distance between Al and Cu is \(\sqrt{2}\) and the \(L_1\) distance between Al and Al0.9Cu0.1 is 0.2.
Consequently, this splitter requires a list of compositions as the input to split rather than the features.
- Attributes:
parallel_run: an attribute definining wheteher to run splits with all available computer cores
- Args:
composition_df (pd.DataFrame): dataframe containing the vector of material compositions to analyze
dist_threshold (float): Entries must be farther than this distance to be included in the training set
nn_kwargs (dict): Keyword arguments for the scikit-learn NearestNeighbor class used to find nearest points
Methods Summary
get_n_splits
([X, y, groups])Returns the number of splitting iterations in the cross-validator
split
(X[, y, groups])Generate indices to split data into training and test set.
Methods Documentation
- get_n_splits(X=None, y=None, groups=None)[source]¶
Returns the number of splitting iterations in the cross-validator
- split(X, y=None, groups=None)[source]¶
Generate indices to split data into training and test set.
Parameters¶
- Xarray-like of shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
- yarray-like of shape (n_samples,)
The target variable for supervised learning problems.
- groupsarray-like of shape (n_samples,), default=None
Group labels for the samples used while splitting the dataset into train/test set.
Yields¶
- trainndarray
The training set indices for that split.
- testndarray
The testing set indices for that split.