timesead.data.exathlon_dataset
==============================

.. py:module:: timesead.data.exathlon_dataset


Attributes
----------

.. autoapisummary::

   timesead.data.exathlon_dataset.TRAIN_FILES
   timesead.data.exathlon_dataset.TEST_FILES
   timesead.data.exathlon_dataset.TRAIN_LENGTHS
   timesead.data.exathlon_dataset.TEST_LENGTHS


Classes
-------

.. autoapisummary::

   timesead.data.exathlon_dataset.ExathlonDataset


Module Contents
---------------

.. py:data:: TRAIN_FILES
   :value: ('1_0_1000000_14.csv', '1_0_100000_15.csv', '1_0_100000_16.csv', '1_0_10000_17.csv',...


.. py:data:: TEST_FILES
   :value: ('1_2_100000_68.csv', '1_4_1000000_80.csv', '1_5_1000000_86.csv', '2_1_100000_60.csv',...


.. py:data:: TRAIN_LENGTHS

.. py:data:: TEST_LENGTHS

.. py:class:: ExathlonDataset(dataset_path: str = os.path.join(DATA_DIRECTORY, 'exathlon'), app_id: int = 1, training: bool = True, standardize: Union[bool, Callable[[pandas.DataFrame, Dict], pandas.DataFrame]] = True, download: bool = True, preprocess: bool = True)

   Bases: :py:obj:`timesead.data.dataset.BaseTSDataset`


   Implements the Exathlon dataset from [Jacob2021]_.
   The data was collected by running different applications on a Spark cluster and recording metrics from the Spark
   service and the worker nodes. We consider the trace for each app a separate dataset. You can control which app trace
   to load by setting the `app_id` parameter.

   .. note::
       The Exathlon dataset consists of more than 2000 raw features that we reduce to 19 aggregated features as
       described in [Jacob2021]_. This is done in the preprocess step during the class initialization.

   .. note::
      Automatically downloading the dataset via the `download` option requires `git` to be installed on your system
      and is currently only tested on linux!

   .. warning::
       This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the `preprocess`
       argument. The class will throw a RuntimeError without preprocessing.

   .. [Jacob2021] V. Jacob, F. Song, A. Stiegler, B. Rad, Y. Diao, and N. Tatbul.
       Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series.
       Proceedings of the VLDB Endowment (PVLDB), 14(11): 2613 - 2626, 2021.

   :param dataset_path: Folder from which to load the dataset.
   :param app_id: Data from which app to load. Must be in [1-6, 9, 10].
   :param training: Whether to load the training or the test set.
   :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default
       standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of
       common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
   :param download: Whether to download the dataset if it doesn't exist.
   :param preprocess: Whether to setup the dataset for experiments.


   .. py:attribute:: GITHUB_LINK
      :value: 'https://github.com/exathlonbenchmark/exathlon.git'


   .. py:attribute:: dataset_path


   .. py:attribute:: data_path


   .. py:attribute:: app_id
      :value: 1


   .. py:attribute:: training
      :value: True


   .. py:attribute:: inputs
      :value: None


   .. py:attribute:: targets
      :value: None


   .. py:method:: load_data() -> Tuple[List[numpy.ndarray], List[numpy.ndarray]]


   .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

      Access the timeseries at position `index` and its corresponding label sequence. A call to this function should
      return a single time series that was sampled independently of the other time series in this dataset.

      :param index: The zero-based index of the time series to retrieve.
      :return: A tuple `(inputs, targets)`, where inputs is again a tuple of :class:`~torch.Tensor`\s with shape
          `(T, D*)`, where `D*` can very between the tensors. `targets` contains labels for the time series as tensors
          of shape `(T,)`.


   .. py:method:: __len__() -> Optional[int]

      This should return the number of independent time series in the dataset


   .. py:property:: seq_len
      :type: List[int]


      This should return the length of each time series. If the time series have different lengths, the return
      value should be a list that contains the length of each sequence. If all sequences are of equal length,
      this should return an int.


   .. py:property:: num_features
      :type: int


      Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.


   .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]]
      :staticmethod:


      Return the default pipeline for this dataset that is used if the user does not specify a different pipeline.
      This must be a dict of the form::

          {
              '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}},
              ...
          }


   .. py:method:: get_feature_names()
      :staticmethod:


      Return names for the features in the order they are present in the data tensors.

      :return: A list of strings with names for each feature.


   .. py:method:: download() -> None