LocalDatasets

class mastml.datasets.LocalDatasets(file_path, feature_names=None, target=None, extra_columns=None, group_column=None, testdata_columns=None, average_duplicates=False, average_duplicates_col=None, as_frame=False)[source]

Bases: object

Class to handle import and organization of a dataset stored locally.

Args:

file_path: (str), path to the data file to import

feature_names: (list), list of strings containing the X feature names

target: (str), string denoting the y data (target) name

extra_columns: (list), list of strings containing additional column names that are not features or target

group_column: (str), string denoting the name of an input column to be used to group data

testdata_columns: (list), list of strings containing column names denoting sets of left-out data. Entries should be marked with a 0 (not left out) or 1 (left out)

average_duplicates: (bool), whether to average duplicate entries from the imported data.

average_duplicates_col: (str), string denoting column name to perform averaging of duplicate entries. Needs to be specified if average_duplicates is True.

as_frame: (bool), whether to return data as pandas dataframe (otherwise will be numpy array)

Methods:

_import: imports the data. Should be either .csv or .xlsx format

Args:: None
Returns:: df: (pd.DataFrame), pandas dataframe of full dataset

_get_features: Method to assess which columns below to target, feature_names

Args:: df: (pd.DataFrame), pandas dataframe of full dataset
Returns:: None

load_data: Method to import the data and ascertain which columns are features, target and extra based on provided input.

Args:

copy: (bool), whether or not to copy the imported data to the designated savepath

savepath: (str), path to save the data to (used if copy=True)

Returns:

data_dict: (dict), dictionary containing dataframes of X, y, groups, X_extra, X_testdata

Methods Summary

load_data([copy, savepath])

Methods Documentation

load_data(copy=False, savepath=None)[source]