timesead.data.exathlon_dataset

Attributes

`TRAIN_FILES`
`TEST_FILES`
`TRAIN_LENGTHS`
`TEST_LENGTHS`

Classes

ExathlonDataset

Implements the Exathlon dataset from [Jacob2021].

Module Contents

timesead.data.exathlon_dataset.TRAIN_FILES = ('1_0_1000000_14.csv', '1_0_100000_15.csv', '1_0_100000_16.csv', '1_0_10000_17.csv',...

timesead.data.exathlon_dataset.TEST_FILES = ('1_2_100000_68.csv', '1_4_1000000_80.csv', '1_5_1000000_86.csv', '2_1_100000_60.csv',...

timesead.data.exathlon_dataset.TRAIN_LENGTHS

timesead.data.exathlon_dataset.TEST_LENGTHS

class timesead.data.exathlon_dataset.ExathlonDataset(dataset_path: str = os.path.join(DATA_DIRECTORY, 'exathlon'), app_id: int = 1, training: bool = True, standardize: bool | Callable[[pandas.DataFrame, Dict], pandas.DataFrame] = True, download: bool = True, preprocess: bool = True)

Bases: timesead.data.dataset.BaseTSDataset

Implements the Exathlon dataset from [Jacob2021]. The data was collected by running different applications on a Spark cluster and recording metrics from the Spark service and the worker nodes. We consider the trace for each app a separate dataset. You can control which app trace to load by setting the app_id parameter.

Note

The Exathlon dataset consists of more than 2000 raw features that we reduce to 19 aggregated features as described in [Jacob2021]. This is done in the preprocess step during the class initialization.

Note

Automatically downloading the dataset via the download option requires git to be installed on your system and is currently only tested on linux!

Warning

This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the preprocess argument. The class will throw a RuntimeError without preprocessing.

[Jacob2021] (1,2,3)

V. Jacob, F. Song, A. Stiegler, B. Rad, Y. Diao, and N. Tatbul. Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series. Proceedings of the VLDB Endowment (PVLDB), 14(11): 2613 - 2626, 2021.

Parameters:

dataset_path (str) – Folder from which to load the dataset.
app_id (int) – Data from which app to load. Must be in [1-6, 9, 10].
training (bool) – Whether to load the training or the test set.
standardize (Union[bool, Callable[[pandas.DataFrame, Dict], pandas.DataFrame]]) – Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
download (bool) – Whether to download the dataset if it doesn’t exist.
preprocess (bool) – Whether to setup the dataset for experiments.

GITHUB_LINK = 'https://github.com/exathlonbenchmark/exathlon.git'

dataset_path

data_path

app_id = 1

training = True

inputs = None

targets = None

load_data() → Tuple[List[numpy.ndarray], List[numpy.ndarray]]

Return type:: Tuple[List[numpy.ndarray], List[numpy.ndarray]]

__getitem__(item: int) → Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

Access the timeseries at position index and its corresponding label sequence. A call to this function should return a single time series that was sampled independently of the other time series in this dataset.

Parameters:

index – The zero-based index of the time series to retrieve.
item (int)

Returns:

A tuple (inputs, targets), where inputs is again a tuple of Tensors with shape (T, D*), where D* can very between the tensors. targets contains labels for the time series as tensors of shape (T,).

Return type:

Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

__len__() → int | None

This should return the number of independent time series in the dataset

Return type:: Optional[int]

property seq_len: List[int]

This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int.

Return type:: List[int]

property num_features: int

Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.

Return type:: int

static get_default_pipeline() → Dict[str, Dict[str, Any]]

Return the default pipeline for this dataset that is used if the user does not specify a different pipeline. This must be a dict of the form:

{
    '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}},
    ...
}

Return type:: Dict[str, Dict[str, Any]]

static get_feature_names()

Return names for the features in the order they are present in the data tensors.

Returns:: A list of strings with names for each feature.

download() → None

Return type:: None