timesead.data.tep_dataset
=========================

.. py:module:: timesead.data.tep_dataset


Classes
-------

.. autoapisummary::

   timesead.data.tep_dataset.TEPDataset


Module Contents
---------------

.. py:class:: TEPDataset(path: str = os.path.join(DATA_DIRECTORY, 'TEP_harvard'), faults: Optional[Union[int, List[int]]] = None, runs: Optional[Union[int, List[int]]] = None, training: bool = True, standardize: bool = True, cache_size: int = 21, preprocess: bool = True)

   Bases: :py:obj:`timesead.data.dataset.BaseTSDataset`


   Implementation of the Tennessee Eastman Process Dataset [Downs1993]_.
   The dataset was recorded by simulating a chemical process. The simulation also allows to introduce 20 different
   faults into the process which are used as anomaly labels. We implement the extended version of the dataset by
   Rieth et al. [Rieth2017]_ which runs the process several times with different RNG seeds.

   .. note::
      At the moment, we do not offer an automatic download option for this dataset. Please visit
      https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6C3JR1 and download the files manually.

   .. warning::
       This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the `preprocess`
       argument. The class will fail giving an error without preprocessing.

   .. [Downs1993] Downs, James J., and Ernest F. Vogel.
       "A plant-wide industrial process control problem." Computers & chemical engineering 17.3 (1993): 245-255.

   .. [Rieth2017] Rieth, Cory A.; Amsel, Ben D.; Tran, Randy; Cook, Maia B., 2017,
       "Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation",
       https://doi.org/10.7910/DVN/6C3JR1, Harvard Dataverse, V1

   :param path: Folder from which to load the dataset.
   :param faults: Specifies which faults to load data for. This can be a list of `int`\s, where 0 stands for
       fault-free data and [1, ..., 20] for the corresponding faults. Also supports a single `int` which means to
       only load data for this specific fault or `None` which loads data for all faults.
   :param runs: Specifies which runs to load for each fault. Each of the 500 runs was performed with a different
       random seed. This can either be specific runs passed as a list or a single `int` which means to load all runs
       from 0 up to this run. `None` means to load all available runs.
   :param training: Whether to load the training or the test set.
   :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default
       standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of
       common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
   :param cache_size: Depending on the number of faults and runs chosen, this dataset can be quite large. It is
       therefore loaded in a lazy manner from disk. Data for each fault is kept in memory in a FIFO cache to reduce
       access time. This parameter sets the size of that cache. Setting this to the number of faults that you want
       to load will mean that eventually the entire dataset will be cached in memory.
   :param preprocess: Whether to setup dataset for experiments.


   .. py:attribute:: cache


   .. py:attribute:: path


   .. py:attribute:: processed_dir


   .. py:attribute:: training
      :value: True


   .. py:attribute:: standardize
      :value: True


   .. py:attribute:: cache_size
      :value: 21


   .. py:attribute:: faults


   .. py:attribute:: runs
      :value: None


   .. py:method:: load_data(fault: int, runs: Optional[Union[int, List[int]]] = None) -> Tuple[numpy.ndarray, numpy.ndarray]


   .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

      Access the timeseries at position `index` and its corresponding label sequence. A call to this function should
      return a single time series that was sampled independently of the other time series in this dataset.

      :param index: The zero-based index of the time series to retrieve.
      :return: A tuple `(inputs, targets)`, where inputs is again a tuple of :class:`~torch.Tensor`\s with shape
          `(T, D*)`, where `D*` can very between the tensors. `targets` contains labels for the time series as tensors
          of shape `(T,)`.


   .. py:method:: __len__() -> Optional[int]

      This should return the number of independent time series in the dataset


   .. py:property:: seq_len
      :type: Optional[int]


      This should return the length of each time series. If the time series have different lengths, the return
      value should be a list that contains the length of each sequence. If all sequences are of equal length,
      this should return an int.


   .. py:property:: num_features
      :type: int


      Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.


   .. py:method:: get_feature_names() -> List[str]
      :staticmethod:


      Return names for the features in the order they are present in the data tensors.

      :return: A list of strings with names for each feature.


   .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]]
      :staticmethod:


      Return the default pipeline for this dataset that is used if the user does not specify a different pipeline.
      This must be a dict of the form::

          {
              '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}},
              ...
          }