Datasets

lazygrid.datasets

lazygrid.datasets.fetch_datasets(output_dir: str = './data', update_data: bool = False, min_classes: int = 0, task: str = 'classification', max_samples: int = inf, max_features: int = inf) → pandas.core.frame.DataFrame

Load OpenML data sets compatible with the requirements.

Parameters:
  • output_dir – Directory where the .csv file will be stored
  • update_data – If True it deletes cached data sets and downloads their latest version; otherwise it loads data sets as specified inside the cache
  • min_classes – Minimum number of classes required for each data set
  • task – Classification or regression
  • max_samples – Maximum number of samples required for each data set
  • max_features – Maximum number of features required for each data set
Returns:

Information required to load the latest version of each data set

Return type:

Dataframe

Examples

>>> import lazygrid as lg
>>>
>>> datasets = lg.datasets.fetch_datasets(task="classification", min_classes=2, max_samples=1000, max_features=10)
>>> datasets.loc["iris"]
version          45
did           42098
n_samples       150
n_features        4
n_classes         3
Name: iris, dtype: int64
lazygrid.datasets.load_npy_dataset(path_x: str, path_y: str) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'int'>)

Load npy data set.

Parameters:
  • path_x – Path to data matrix
  • path_y – Path to data labels
Returns:

Data matrix, data labels, and number of classes

Return type:

Tuple

Examples

>>> import os
>>> from sklearn.datasets import make_classification
>>> import numpy as np
>>> import lazygrid as lg
>>>
>>> x, y = make_classification(random_state=42)
>>>
>>> path_x, path_y = "x.npy", "y.npy"
>>> np.save(path_x, x)
>>> np.save(path_y, y)
>>>
>>> x, y, n_classes = lg.datasets.load_npy_dataset(path_x, path_y)
lazygrid.datasets.load_openml_dataset(data_id: int = None, dataset_name: str = None) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'int'>)

Load OpenML data set.

Parameters:
  • data_id – Data set identifier
  • dataset_name – Data set name
Returns:

Data matrix, data labels, and number of classes

Return type:

Tuple

Examples

>>> import lazygrid as lg
>>>
>>> x, y, n_classes = lg.datasets.load_openml_dataset(dataset_name="iris")
>>> n_classes
3