timesead.data
=============

.. py:module:: timesead.data

.. autoapi-nested-parse::

   This package contains code for loading and processing data. Each dataset class inherits from :class:`BaseTSDataset`


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/timesead/data/dataset/index
   /autoapi/timesead/data/exathlon_dataset/index
   /autoapi/timesead/data/minismd_dataset/index
   /autoapi/timesead/data/preprocessing/index
   /autoapi/timesead/data/smap_dataset/index
   /autoapi/timesead/data/smd_dataset/index
   /autoapi/timesead/data/statistics/index
   /autoapi/timesead/data/swat_dataset/index
   /autoapi/timesead/data/tep_dataset/index
   /autoapi/timesead/data/transforms/index
   /autoapi/timesead/data/wadi_dataset/index


Classes
-------

.. autoapisummary::

   timesead.data.BaseTSDataset
   timesead.data.ExathlonDataset
   timesead.data.MiniSMDDataset
   timesead.data.SMAPDataset
   timesead.data.MSLDataset
   timesead.data.SMDDataset
   timesead.data.SWaTDataset
   timesead.data.TEPDataset
   timesead.data.WADIDataset


Package Contents
----------------

.. py:class:: BaseTSDataset

   Bases: :py:obj:`abc.ABC`, :py:obj:`torch.utils.data.Dataset`


   Base class for all time-series datasets in TimeSeAD. Implementing the members in this abstract class provides the
   data pipeline system with the necessary information to process the data correctly.


   .. py:method:: __len__() -> int
      :abstractmethod:


      This should return the number of independent time series in the dataset


   .. py:property:: seq_len
      :type: Union[int, List[int]]

      :abstractmethod:


      This should return the length of each time series. If the time series have different lengths, the return
      value should be a list that contains the length of each sequence. If all sequences are of equal length,
      this should return an int.


   .. py:property:: num_features
      :type: Union[int, Tuple[int, Ellipsis]]

      :abstractmethod:


      Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.


   .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]]
      :staticmethod:

      :abstractmethod:


      Return the default pipeline for this dataset that is used if the user does not specify a different pipeline.
      This must be a dict of the form::

          {
              '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}},
              ...
          }


   .. py:method:: get_feature_names() -> List[str]
      :staticmethod:

      :abstractmethod:


      Return names for the features in the order they are present in the data tensors.

      :return: A list of strings with names for each feature.


   .. py:method:: __getitem__(index: int) -> Tuple[Tuple[torch.Tensor, Ellipsis], Tuple[torch.Tensor, Ellipsis]]
      :abstractmethod:


      Access the timeseries at position `index` and its corresponding label sequence. A call to this function should
      return a single time series that was sampled independently of the other time series in this dataset.

      :param index: The zero-based index of the time series to retrieve.
      :return: A tuple `(inputs, targets)`, where inputs is again a tuple of :class:`~torch.Tensor`\s with shape
          `(T, D*)`, where `D*` can very between the tensors. `targets` contains labels for the time series as tensors
          of shape `(T,)`.


.. py:class:: ExathlonDataset(dataset_path: str = os.path.join(DATA_DIRECTORY, 'exathlon'), app_id: int = 1, training: bool = True, standardize: Union[bool, Callable[[pandas.DataFrame, Dict], pandas.DataFrame]] = True, download: bool = True, preprocess: bool = True)

   Bases: :py:obj:`timesead.data.dataset.BaseTSDataset`


   Implements the Exathlon dataset from [Jacob2021]_.
   The data was collected by running different applications on a Spark cluster and recording metrics from the Spark
   service and the worker nodes. We consider the trace for each app a separate dataset. You can control which app trace
   to load by setting the `app_id` parameter.

   .. note::
       The Exathlon dataset consists of more than 2000 raw features that we reduce to 19 aggregated features as
       described in [Jacob2021]_. This is done in the preprocess step during the class initialization.

   .. note::
      Automatically downloading the dataset via the `download` option requires `git` to be installed on your system
      and is currently only tested on linux!

   .. warning::
       This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the `preprocess`
       argument. The class will throw a RuntimeError without preprocessing.

   .. [Jacob2021] V. Jacob, F. Song, A. Stiegler, B. Rad, Y. Diao, and N. Tatbul.
       Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series.
       Proceedings of the VLDB Endowment (PVLDB), 14(11): 2613 - 2626, 2021.

   :param dataset_path: Folder from which to load the dataset.
   :param app_id: Data from which app to load. Must be in [1-6, 9, 10].
   :param training: Whether to load the training or the test set.
   :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default
       standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of
       common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
   :param download: Whether to download the dataset if it doesn't exist.
   :param preprocess: Whether to setup the dataset for experiments.


   .. py:attribute:: GITHUB_LINK
      :value: 'https://github.com/exathlonbenchmark/exathlon.git'


   .. py:attribute:: dataset_path


   .. py:attribute:: data_path


   .. py:attribute:: app_id
      :value: 1


   .. py:attribute:: training
      :value: True


   .. py:attribute:: inputs
      :value: None


   .. py:attribute:: targets
      :value: None


   .. py:method:: load_data() -> Tuple[List[numpy.ndarray], List[numpy.ndarray]]


   .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

      Access the timeseries at position `index` and its corresponding label sequence. A call to this function should
      return a single time series that was sampled independently of the other time series in this dataset.

      :param index: The zero-based index of the time series to retrieve.
      :return: A tuple `(inputs, targets)`, where inputs is again a tuple of :class:`~torch.Tensor`\s with shape
          `(T, D*)`, where `D*` can very between the tensors. `targets` contains labels for the time series as tensors
          of shape `(T,)`.


   .. py:method:: __len__() -> Optional[int]

      This should return the number of independent time series in the dataset


   .. py:property:: seq_len
      :type: List[int]


      This should return the length of each time series. If the time series have different lengths, the return
      value should be a list that contains the length of each sequence. If all sequences are of equal length,
      this should return an int.


   .. py:property:: num_features
      :type: int


      Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.


   .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]]
      :staticmethod:


      Return the default pipeline for this dataset that is used if the user does not specify a different pipeline.
      This must be a dict of the form::

          {
              '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}},
              ...
          }


   .. py:method:: get_feature_names()
      :staticmethod:


      Return names for the features in the order they are present in the data tensors.

      :return: A list of strings with names for each feature.


   .. py:method:: download() -> None


.. py:class:: MiniSMDDataset(server_id: int = 0, path: str = os.path.join(DATA_DIRECTORY, 'mini_smd'), training: bool = True, standardize: Union[bool, Callable] = True, preprocess: bool = True)

   Bases: :py:obj:`timesead.data.dataset.BaseTSDataset`


   This is a condensed version of the :class:`~timesead.data.smd_dataset.SMDDataset` containing only shortened time
   series for two different machines. Mostly used for testing purposes.

   :param server_id: ID of the server to load. Must be 0 or 1.
   :param path: Path to the data
   :param training: Whether to load the training or the test set.
   :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default
       standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of
       common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)


   .. py:attribute:: server_id
      :value: 0


   .. py:attribute:: path


   .. py:attribute:: training
      :value: True


   .. py:attribute:: standardize
      :value: True


   .. py:attribute:: inputs
      :value: None


   .. py:attribute:: targets
      :value: None


   .. py:attribute:: processed_dir


   .. py:method:: load_data() -> Tuple[numpy.ndarray, numpy.ndarray]


   .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]


   .. py:method:: __len__() -> Optional[int]


   .. py:property:: seq_len
      :type: Union[int, List[int]]


   .. py:property:: num_features
      :type: int


   .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]]
      :staticmethod:


   .. py:method:: get_feature_names()
      :staticmethod:


.. py:class:: SMAPDataset(data_path: str = os.path.join(DATA_DIRECTORY, 'smap'), channel_id: int = 0, training: bool = True, download: bool = True)

   Bases: :py:obj:`_SMAPBaseDataset`


   Implementation of the SMAP dataset [Hundman2018].
   It consists of several monitored values from a single satellite and commands sent to that satellite. We consider the
   trace for each channel a separate dataset, where the monitored value is in the first feature dimension and the
   remaining binary features correspond to the commands.

   :param data_path: Folder from which to load the dataset.
   :param channel_id: Data from which channel to load. Must be in [0-54].
   :param training: Whether to load the training or the test set.
   :param download: Whether to download the dataset if it doesn't exist.


   .. py:property:: num_features
      :type: Union[int, Tuple[int, Ellipsis]]


      Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.


.. py:class:: MSLDataset(data_path: str = os.path.join(DATA_DIRECTORY, 'smap'), channel_id: int = 0, training: bool = True, download: bool = True)

   Bases: :py:obj:`_SMAPBaseDataset`


   Implementation of the MSL dataset [Hundman2018].
   It consists of several monitored values from a mars rover and commands sent to the rover. We consider the trace for
   each channel a separate dataset, where the monitored value is in the first feature dimension and the remaining
   binary features correspond to the commands.

   :param data_path: Folder from which to load the dataset.
   :param channel_id: Data from which channel to load. Must be in [0-26].
   :param training: Whether to load the training or the test set.
   :param download: Whether to download the dataset if it doesn't exist.


   .. py:property:: num_features
      :type: Union[int, Tuple[int, Ellipsis]]


      Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.


.. py:class:: SMDDataset(server_id: int, path: str = os.path.join(DATA_DIRECTORY, 'smd'), training: bool = True, standardize: Union[bool, Callable] = True, download: bool = True, preprocess: bool = True)

   Bases: :py:obj:`timesead.data.dataset.BaseTSDataset`


   Implementation of the Server Machine Dataset [Su2019]_.
   The data consists of traces from 28 different servers recorded over several weeks. We consider each trace to be a
   separate dataset.

   .. note::
       Automatically downloading the dataset currently requires that you have `git` installed on your system!

   .. [Su2019] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, D. Pei.
       Robust anomaly detection for multivariate time series through stochastic recurrent neural network.
       In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining,
       2019 Jul 25 (pp. 2828-2837).

   :param path: Folder from which to load the dataset.
   :param server_id: Data from which machine to load. Must be in [0, ..., 27].
   :param training: Whether to load the training or the test set.
   :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default
       standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of
       common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
   :param download: Whether to download the dataset if it doesn't exist.
   :param preprocess: Whether to setup the dataset for experiments.


   .. py:attribute:: GITHUB_LINK
      :value: 'https://github.com/NetManAIOps/OmniAnomaly.git'


   .. py:attribute:: server_id


   .. py:attribute:: path


   .. py:attribute:: processed_dir


   .. py:attribute:: training
      :value: True


   .. py:attribute:: standardize
      :value: True


   .. py:attribute:: inputs
      :value: None


   .. py:attribute:: targets
      :value: None


   .. py:method:: load_data() -> Tuple[numpy.ndarray, numpy.ndarray]


   .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]


   .. py:method:: __len__() -> Optional[int]


   .. py:property:: seq_len
      :type: Union[int, List[int]]


   .. py:property:: num_features
      :type: int


   .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]]
      :staticmethod:


   .. py:method:: get_feature_names()
      :staticmethod:


   .. py:method:: download()


.. py:class:: SWaTDataset(path: str = os.path.join(DATA_DIRECTORY, 'SWaT', 'SWaT.A1 & A2_Dec 2015', 'Physical'), training: bool = True, standardize: Union[bool, Callable] = True, remove_startup: bool = True, preprocess: bool = True)

   Bases: :py:obj:`timesead.data.dataset.BaseTSDataset`


   Implementation of the Secure WAter Treatment Dataset [Goh2016]_.
   This dataset was recorded from a miniature water treatment plant over the course of several weeks. Both training
   and test set consist of a single long time series, each. During testing, several attacks (cyber and physical) were
   carried out against the plant.

   .. note::
      Due to licensing issues, we cannot offer an automatic download option for this dataset. Please visit
      https://itrust.sutd.edu.sg/itrust-labs_datasets/dataset_info/ and fill in the form to request a download link.
      The required files are in the folder `SWaT.A1 & A2_Dec 2015/Physical`.

   .. warning::
       This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the `preprocess`
       argument to True. The class will fail giving an error without preprocessing.

   .. [Goh2016] Goh, Jonathan, et al. "A dataset to support research in the design of secure water treatment systems."
        Critical Information Infrastructures Security: 11th International Conference, CRITIS 2016, Paris, France,
        October 10–12, 2016, Revised Selected Papers 11. Springer International Publishing, 2017.

   :param path: Path where the files "SWaT_Dataset_Normal_v1.csv" and "SWaT_Dataset_Attack_v0.csv" are located.
   :param training: If True, this will load the training set consisting only of normal samples. Otherwise, loads
       the test set, which includes attacks.
   :param standardize: If True, apply min-max scaling (based on the training set). This can also be a function
       that accepts a DataFrame as its positional argument and a keyword argument `stats`: a dictionary of training
       data statistics.
   :param remove_startup: If True, this will remove the first 5 hours from the training set, as during this time
       the system was starting from an empty state. To be more exact, this removes only 4.5 hours, since the first 30
       minutes were already removed in v1 of the Dataset.
   :param preprocess: If True, setup dataset to run experiments.


   .. py:attribute:: path


   .. py:attribute:: processed_dir


   .. py:attribute:: training
      :value: True


   .. py:attribute:: remove_startup
      :value: True


   .. py:attribute:: inputs
      :value: None


   .. py:attribute:: targets
      :value: None


   .. py:method:: load_data() -> Tuple[numpy.ndarray, numpy.ndarray]


   .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]


   .. py:method:: __len__() -> Optional[int]


   .. py:property:: seq_len
      :type: Optional[int]


   .. py:property:: num_features
      :type: int


   .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]]
      :staticmethod:


   .. py:method:: get_feature_names()
      :staticmethod:


.. py:class:: TEPDataset(path: str = os.path.join(DATA_DIRECTORY, 'TEP_harvard'), faults: Optional[Union[int, List[int]]] = None, runs: Optional[Union[int, List[int]]] = None, training: bool = True, standardize: bool = True, cache_size: int = 21, preprocess: bool = True)

   Bases: :py:obj:`timesead.data.dataset.BaseTSDataset`


   Implementation of the Tennessee Eastman Process Dataset [Downs1993]_.
   The dataset was recorded by simulating a chemical process. The simulation also allows to introduce 20 different
   faults into the process which are used as anomaly labels. We implement the extended version of the dataset by
   Rieth et al. [Rieth2017]_ which runs the process several times with different RNG seeds.

   .. note::
      At the moment, we do not offer an automatic download option for this dataset. Please visit
      https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6C3JR1 and download the files manually.

   .. warning::
       This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the `preprocess`
       argument. The class will fail giving an error without preprocessing.

   .. [Downs1993] Downs, James J., and Ernest F. Vogel.
       "A plant-wide industrial process control problem." Computers & chemical engineering 17.3 (1993): 245-255.

   .. [Rieth2017] Rieth, Cory A.; Amsel, Ben D.; Tran, Randy; Cook, Maia B., 2017,
       "Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation",
       https://doi.org/10.7910/DVN/6C3JR1, Harvard Dataverse, V1

   :param path: Folder from which to load the dataset.
   :param faults: Specifies which faults to load data for. This can be a list of `int`\s, where 0 stands for
       fault-free data and [1, ..., 20] for the corresponding faults. Also supports a single `int` which means to
       only load data for this specific fault or `None` which loads data for all faults.
   :param runs: Specifies which runs to load for each fault. Each of the 500 runs was performed with a different
       random seed. This can either be specific runs passed as a list or a single `int` which means to load all runs
       from 0 up to this run. `None` means to load all available runs.
   :param training: Whether to load the training or the test set.
   :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default
       standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of
       common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
   :param cache_size: Depending on the number of faults and runs chosen, this dataset can be quite large. It is
       therefore loaded in a lazy manner from disk. Data for each fault is kept in memory in a FIFO cache to reduce
       access time. This parameter sets the size of that cache. Setting this to the number of faults that you want
       to load will mean that eventually the entire dataset will be cached in memory.
   :param preprocess: Whether to setup dataset for experiments.


   .. py:attribute:: cache


   .. py:attribute:: path


   .. py:attribute:: processed_dir


   .. py:attribute:: training
      :value: True


   .. py:attribute:: standardize
      :value: True


   .. py:attribute:: cache_size
      :value: 21


   .. py:attribute:: faults


   .. py:attribute:: runs
      :value: None


   .. py:method:: load_data(fault: int, runs: Optional[Union[int, List[int]]] = None) -> Tuple[numpy.ndarray, numpy.ndarray]


   .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

      Access the timeseries at position `index` and its corresponding label sequence. A call to this function should
      return a single time series that was sampled independently of the other time series in this dataset.

      :param index: The zero-based index of the time series to retrieve.
      :return: A tuple `(inputs, targets)`, where inputs is again a tuple of :class:`~torch.Tensor`\s with shape
          `(T, D*)`, where `D*` can very between the tensors. `targets` contains labels for the time series as tensors
          of shape `(T,)`.


   .. py:method:: __len__() -> Optional[int]

      This should return the number of independent time series in the dataset


   .. py:property:: seq_len
      :type: Optional[int]


      This should return the length of each time series. If the time series have different lengths, the return
      value should be a list that contains the length of each sequence. If all sequences are of equal length,
      this should return an int.


   .. py:property:: num_features
      :type: int


      Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.


   .. py:method:: get_feature_names() -> List[str]
      :staticmethod:


      Return names for the features in the order they are present in the data tensors.

      :return: A list of strings with names for each feature.


   .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]]
      :staticmethod:


      Return the default pipeline for this dataset that is used if the user does not specify a different pipeline.
      This must be a dict of the form::

          {
              '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}},
              ...
          }


.. py:class:: WADIDataset(path: str = os.path.join(DATA_DIRECTORY, 'wadi', 'WADI.A2_19 Nov 2019'), training: bool = True, standardize: Union[bool, Callable[[pandas.DataFrame, Dict], pandas.DataFrame]] = True, remove_startup: bool = True, split: bool = True, preprocess: bool = True)

   Bases: :py:obj:`timesead.data.dataset.BaseTSDataset`


   Implementation of the WAter DIstribution Dataset [Ahmed2017]_.
   This dataset was recorded from a miniature water distribution network over the course of several weeks.
   Both training and test set consist of a single long time series, or two time series, see details about the `split`
   parameter. During testing, several attacks (cyber and physical) were carried out against the plant.

   .. note::
      Due to licensing issues, we cannot offer an automatic download option for this dataset. Please visit
      https://itrust.sutd.edu.sg/itrust-labs_datasets/dataset_info/ and fill in the form to request a download link.
      The required files are in the folder `WADI.A2_19 Nov 2019`.

   .. [Ahmed2017] Ahmed, Chuadhry Mujeeb, Venkata Reddy Palleti, and Aditya P. Mathur.
       "WADI: a water distribution testbed for research in the design of secure cyber physical systems."
       Proceedings of the 3rd international workshop on cyber-physical systems for smart water networks. 2017.

   :param path: Folder from which to load the dataset.
   :param training: Whether to load the training or the test set.
   :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default
       standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of
       common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
   :param remove_startup: This removes the first 5 hours of the training set, during which the plant is starting.
   :param split: The authors removed some data points in v2 of the training dataset. Thus, there is a clear split
       at index 335998. Setting this to true will return 2 TS split at this location. Otherwise, one long TS is
       returned.
   :param preprocess: Whether to setup the dataset for experiments.


   .. py:attribute:: path


   .. py:attribute:: processed_dir


   .. py:attribute:: training
      :value: True


   .. py:attribute:: remove_startup
      :value: True


   .. py:attribute:: split
      :value: True


   .. py:attribute:: startup_remove_amount
      :value: 18000


   .. py:attribute:: split_index
      :value: 335999


   .. py:attribute:: inputs
      :value: None


   .. py:attribute:: targets
      :value: None


   .. py:method:: load_data() -> Tuple[numpy.ndarray, numpy.ndarray]


   .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

      This should return the time series of the dataset. I.e., if the dataset has 5 independent time-series,
      passing 0, ..., 4 as item should return these time series. The format is (inputs, targets), where inputs
      and targets are tupples of torch.Tensors.

      :param item: Index of the time series to return.
      :return:


   .. py:method:: __len__() -> Optional[int]

      This should return the number of independent time series in the dataset


   .. py:property:: seq_len
      :type: Optional[int]


      This should return the length of each time series. If the time series have different lengths, the return
      value should be a list that contains the length of each sequence. If all sequences are of equal length,
      this should return an int.


   .. py:property:: num_features
      :type: int


      Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.


   .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]]
      :staticmethod:


      Return the default pipeline for this dataset that is used if the user does not specify a different pipeline.
      This must be a dict of the form::

          {
              '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}},
              ...
          }


   .. py:method:: get_feature_names()
      :staticmethod:


      Return names for the features in the order they are present in the data tensors.

      :return: A list of strings with names for each feature.