timesead.data.exathlon_dataset ============================== .. py:module:: timesead.data.exathlon_dataset Attributes ---------- .. autoapisummary:: timesead.data.exathlon_dataset.TRAIN_FILES timesead.data.exathlon_dataset.TEST_FILES timesead.data.exathlon_dataset.TRAIN_LENGTHS timesead.data.exathlon_dataset.TEST_LENGTHS Classes ------- .. autoapisummary:: timesead.data.exathlon_dataset.ExathlonDataset Module Contents --------------- .. py:data:: TRAIN_FILES :value: ('1_0_1000000_14.csv', '1_0_100000_15.csv', '1_0_100000_16.csv', '1_0_10000_17.csv',... .. py:data:: TEST_FILES :value: ('1_2_100000_68.csv', '1_4_1000000_80.csv', '1_5_1000000_86.csv', '2_1_100000_60.csv',... .. py:data:: TRAIN_LENGTHS .. py:data:: TEST_LENGTHS .. py:class:: ExathlonDataset(dataset_path: str = os.path.join(DATA_DIRECTORY, 'exathlon'), app_id: int = 1, training: bool = True, standardize: Union[bool, Callable[[pandas.DataFrame, Dict], pandas.DataFrame]] = True, download: bool = True, preprocess: bool = True) Bases: :py:obj:`timesead.data.dataset.BaseTSDataset` Implements the Exathlon dataset from [Jacob2021]_. The data was collected by running different applications on a Spark cluster and recording metrics from the Spark service and the worker nodes. We consider the trace for each app a separate dataset. You can control which app trace to load by setting the `app_id` parameter. .. note:: The Exathlon dataset consists of more than 2000 raw features that we reduce to 19 aggregated features as described in [Jacob2021]_. This is done in the preprocess step during the class initialization. .. note:: Automatically downloading the dataset via the `download` option requires `git` to be installed on your system and is currently only tested on linux! .. warning:: This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the `preprocess` argument. The class will throw a RuntimeError without preprocessing. .. [Jacob2021] V. Jacob, F. Song, A. Stiegler, B. Rad, Y. Diao, and N. Tatbul. Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series. Proceedings of the VLDB Endowment (PVLDB), 14(11): 2613 - 2626, 2021. :param dataset_path: Folder from which to load the dataset. :param app_id: Data from which app to load. Must be in [1-6, 9, 10]. :param training: Whether to load the training or the test set. :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature) :param download: Whether to download the dataset if it doesn't exist. :param preprocess: Whether to setup the dataset for experiments. .. py:attribute:: GITHUB_LINK :value: 'https://github.com/exathlonbenchmark/exathlon.git' .. py:attribute:: dataset_path .. py:attribute:: data_path .. py:attribute:: app_id :value: 1 .. py:attribute:: training :value: True .. py:attribute:: inputs :value: None .. py:attribute:: targets :value: None .. py:method:: load_data() -> Tuple[List[numpy.ndarray], List[numpy.ndarray]] .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]] Access the timeseries at position `index` and its corresponding label sequence. A call to this function should return a single time series that was sampled independently of the other time series in this dataset. :param index: The zero-based index of the time series to retrieve. :return: A tuple `(inputs, targets)`, where inputs is again a tuple of :class:`~torch.Tensor`\s with shape `(T, D*)`, where `D*` can very between the tensors. `targets` contains labels for the time series as tensors of shape `(T,)`. .. py:method:: __len__() -> Optional[int] This should return the number of independent time series in the dataset .. py:property:: seq_len :type: List[int] This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int. .. py:property:: num_features :type: int Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension. .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]] :staticmethod: Return the default pipeline for this dataset that is used if the user does not specify a different pipeline. This must be a dict of the form:: { '': {'class': '', 'args': {'', ...}}, ... } .. py:method:: get_feature_names() :staticmethod: Return names for the features in the order they are present in the data tensors. :return: A list of strings with names for each feature. .. py:method:: download() -> None