timesead.data ============= .. py:module:: timesead.data .. autoapi-nested-parse:: This package contains code for loading and processing data. Each dataset class inherits from :class:`BaseTSDataset` Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/timesead/data/dataset/index /autoapi/timesead/data/exathlon_dataset/index /autoapi/timesead/data/minismd_dataset/index /autoapi/timesead/data/preprocessing/index /autoapi/timesead/data/smap_dataset/index /autoapi/timesead/data/smd_dataset/index /autoapi/timesead/data/statistics/index /autoapi/timesead/data/swat_dataset/index /autoapi/timesead/data/tep_dataset/index /autoapi/timesead/data/transforms/index /autoapi/timesead/data/wadi_dataset/index Classes ------- .. autoapisummary:: timesead.data.BaseTSDataset timesead.data.ExathlonDataset timesead.data.MiniSMDDataset timesead.data.SMAPDataset timesead.data.MSLDataset timesead.data.SMDDataset timesead.data.SWaTDataset timesead.data.TEPDataset timesead.data.WADIDataset Package Contents ---------------- .. py:class:: BaseTSDataset Bases: :py:obj:`abc.ABC`, :py:obj:`torch.utils.data.Dataset` Base class for all time-series datasets in TimeSeAD. Implementing the members in this abstract class provides the data pipeline system with the necessary information to process the data correctly. .. py:method:: __len__() -> int :abstractmethod: This should return the number of independent time series in the dataset .. py:property:: seq_len :type: Union[int, List[int]] :abstractmethod: This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int. .. py:property:: num_features :type: Union[int, Tuple[int, Ellipsis]] :abstractmethod: Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension. .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]] :staticmethod: :abstractmethod: Return the default pipeline for this dataset that is used if the user does not specify a different pipeline. This must be a dict of the form:: { '': {'class': '', 'args': {'', ...}}, ... } .. py:method:: get_feature_names() -> List[str] :staticmethod: :abstractmethod: Return names for the features in the order they are present in the data tensors. :return: A list of strings with names for each feature. .. py:method:: __getitem__(index: int) -> Tuple[Tuple[torch.Tensor, Ellipsis], Tuple[torch.Tensor, Ellipsis]] :abstractmethod: Access the timeseries at position `index` and its corresponding label sequence. A call to this function should return a single time series that was sampled independently of the other time series in this dataset. :param index: The zero-based index of the time series to retrieve. :return: A tuple `(inputs, targets)`, where inputs is again a tuple of :class:`~torch.Tensor`\s with shape `(T, D*)`, where `D*` can very between the tensors. `targets` contains labels for the time series as tensors of shape `(T,)`. .. py:class:: ExathlonDataset(dataset_path: str = os.path.join(DATA_DIRECTORY, 'exathlon'), app_id: int = 1, training: bool = True, standardize: Union[bool, Callable[[pandas.DataFrame, Dict], pandas.DataFrame]] = True, download: bool = True, preprocess: bool = True) Bases: :py:obj:`timesead.data.dataset.BaseTSDataset` Implements the Exathlon dataset from [Jacob2021]_. The data was collected by running different applications on a Spark cluster and recording metrics from the Spark service and the worker nodes. We consider the trace for each app a separate dataset. You can control which app trace to load by setting the `app_id` parameter. .. note:: The Exathlon dataset consists of more than 2000 raw features that we reduce to 19 aggregated features as described in [Jacob2021]_. This is done in the preprocess step during the class initialization. .. note:: Automatically downloading the dataset via the `download` option requires `git` to be installed on your system and is currently only tested on linux! .. warning:: This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the `preprocess` argument. The class will throw a RuntimeError without preprocessing. .. [Jacob2021] V. Jacob, F. Song, A. Stiegler, B. Rad, Y. Diao, and N. Tatbul. Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series. Proceedings of the VLDB Endowment (PVLDB), 14(11): 2613 - 2626, 2021. :param dataset_path: Folder from which to load the dataset. :param app_id: Data from which app to load. Must be in [1-6, 9, 10]. :param training: Whether to load the training or the test set. :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature) :param download: Whether to download the dataset if it doesn't exist. :param preprocess: Whether to setup the dataset for experiments. .. py:attribute:: GITHUB_LINK :value: 'https://github.com/exathlonbenchmark/exathlon.git' .. py:attribute:: dataset_path .. py:attribute:: data_path .. py:attribute:: app_id :value: 1 .. py:attribute:: training :value: True .. py:attribute:: inputs :value: None .. py:attribute:: targets :value: None .. py:method:: load_data() -> Tuple[List[numpy.ndarray], List[numpy.ndarray]] .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]] Access the timeseries at position `index` and its corresponding label sequence. A call to this function should return a single time series that was sampled independently of the other time series in this dataset. :param index: The zero-based index of the time series to retrieve. :return: A tuple `(inputs, targets)`, where inputs is again a tuple of :class:`~torch.Tensor`\s with shape `(T, D*)`, where `D*` can very between the tensors. `targets` contains labels for the time series as tensors of shape `(T,)`. .. py:method:: __len__() -> Optional[int] This should return the number of independent time series in the dataset .. py:property:: seq_len :type: List[int] This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int. .. py:property:: num_features :type: int Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension. .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]] :staticmethod: Return the default pipeline for this dataset that is used if the user does not specify a different pipeline. This must be a dict of the form:: { '': {'class': '', 'args': {'', ...}}, ... } .. py:method:: get_feature_names() :staticmethod: Return names for the features in the order they are present in the data tensors. :return: A list of strings with names for each feature. .. py:method:: download() -> None .. py:class:: MiniSMDDataset(server_id: int = 0, path: str = os.path.join(DATA_DIRECTORY, 'mini_smd'), training: bool = True, standardize: Union[bool, Callable] = True, preprocess: bool = True) Bases: :py:obj:`timesead.data.dataset.BaseTSDataset` This is a condensed version of the :class:`~timesead.data.smd_dataset.SMDDataset` containing only shortened time series for two different machines. Mostly used for testing purposes. :param server_id: ID of the server to load. Must be 0 or 1. :param path: Path to the data :param training: Whether to load the training or the test set. :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature) .. py:attribute:: server_id :value: 0 .. py:attribute:: path .. py:attribute:: training :value: True .. py:attribute:: standardize :value: True .. py:attribute:: inputs :value: None .. py:attribute:: targets :value: None .. py:attribute:: processed_dir .. py:method:: load_data() -> Tuple[numpy.ndarray, numpy.ndarray] .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]] .. py:method:: __len__() -> Optional[int] .. py:property:: seq_len :type: Union[int, List[int]] .. py:property:: num_features :type: int .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]] :staticmethod: .. py:method:: get_feature_names() :staticmethod: .. py:class:: SMAPDataset(data_path: str = os.path.join(DATA_DIRECTORY, 'smap'), channel_id: int = 0, training: bool = True, download: bool = True) Bases: :py:obj:`_SMAPBaseDataset` Implementation of the SMAP dataset [Hundman2018]. It consists of several monitored values from a single satellite and commands sent to that satellite. We consider the trace for each channel a separate dataset, where the monitored value is in the first feature dimension and the remaining binary features correspond to the commands. :param data_path: Folder from which to load the dataset. :param channel_id: Data from which channel to load. Must be in [0-54]. :param training: Whether to load the training or the test set. :param download: Whether to download the dataset if it doesn't exist. .. py:property:: num_features :type: Union[int, Tuple[int, Ellipsis]] Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension. .. py:class:: MSLDataset(data_path: str = os.path.join(DATA_DIRECTORY, 'smap'), channel_id: int = 0, training: bool = True, download: bool = True) Bases: :py:obj:`_SMAPBaseDataset` Implementation of the MSL dataset [Hundman2018]. It consists of several monitored values from a mars rover and commands sent to the rover. We consider the trace for each channel a separate dataset, where the monitored value is in the first feature dimension and the remaining binary features correspond to the commands. :param data_path: Folder from which to load the dataset. :param channel_id: Data from which channel to load. Must be in [0-26]. :param training: Whether to load the training or the test set. :param download: Whether to download the dataset if it doesn't exist. .. py:property:: num_features :type: Union[int, Tuple[int, Ellipsis]] Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension. .. py:class:: SMDDataset(server_id: int, path: str = os.path.join(DATA_DIRECTORY, 'smd'), training: bool = True, standardize: Union[bool, Callable] = True, download: bool = True, preprocess: bool = True) Bases: :py:obj:`timesead.data.dataset.BaseTSDataset` Implementation of the Server Machine Dataset [Su2019]_. The data consists of traces from 28 different servers recorded over several weeks. We consider each trace to be a separate dataset. .. note:: Automatically downloading the dataset currently requires that you have `git` installed on your system! .. [Su2019] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, D. Pei. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019 Jul 25 (pp. 2828-2837). :param path: Folder from which to load the dataset. :param server_id: Data from which machine to load. Must be in [0, ..., 27]. :param training: Whether to load the training or the test set. :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature) :param download: Whether to download the dataset if it doesn't exist. :param preprocess: Whether to setup the dataset for experiments. .. py:attribute:: GITHUB_LINK :value: 'https://github.com/NetManAIOps/OmniAnomaly.git' .. py:attribute:: server_id .. py:attribute:: path .. py:attribute:: processed_dir .. py:attribute:: training :value: True .. py:attribute:: standardize :value: True .. py:attribute:: inputs :value: None .. py:attribute:: targets :value: None .. py:method:: load_data() -> Tuple[numpy.ndarray, numpy.ndarray] .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]] .. py:method:: __len__() -> Optional[int] .. py:property:: seq_len :type: Union[int, List[int]] .. py:property:: num_features :type: int .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]] :staticmethod: .. py:method:: get_feature_names() :staticmethod: .. py:method:: download() .. py:class:: SWaTDataset(path: str = os.path.join(DATA_DIRECTORY, 'SWaT', 'SWaT.A1 & A2_Dec 2015', 'Physical'), training: bool = True, standardize: Union[bool, Callable] = True, remove_startup: bool = True, preprocess: bool = True) Bases: :py:obj:`timesead.data.dataset.BaseTSDataset` Implementation of the Secure WAter Treatment Dataset [Goh2016]_. This dataset was recorded from a miniature water treatment plant over the course of several weeks. Both training and test set consist of a single long time series, each. During testing, several attacks (cyber and physical) were carried out against the plant. .. note:: Due to licensing issues, we cannot offer an automatic download option for this dataset. Please visit https://itrust.sutd.edu.sg/itrust-labs_datasets/dataset_info/ and fill in the form to request a download link. The required files are in the folder `SWaT.A1 & A2_Dec 2015/Physical`. .. warning:: This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the `preprocess` argument to True. The class will fail giving an error without preprocessing. .. [Goh2016] Goh, Jonathan, et al. "A dataset to support research in the design of secure water treatment systems." Critical Information Infrastructures Security: 11th International Conference, CRITIS 2016, Paris, France, October 10–12, 2016, Revised Selected Papers 11. Springer International Publishing, 2017. :param path: Path where the files "SWaT_Dataset_Normal_v1.csv" and "SWaT_Dataset_Attack_v0.csv" are located. :param training: If True, this will load the training set consisting only of normal samples. Otherwise, loads the test set, which includes attacks. :param standardize: If True, apply min-max scaling (based on the training set). This can also be a function that accepts a DataFrame as its positional argument and a keyword argument `stats`: a dictionary of training data statistics. :param remove_startup: If True, this will remove the first 5 hours from the training set, as during this time the system was starting from an empty state. To be more exact, this removes only 4.5 hours, since the first 30 minutes were already removed in v1 of the Dataset. :param preprocess: If True, setup dataset to run experiments. .. py:attribute:: path .. py:attribute:: processed_dir .. py:attribute:: training :value: True .. py:attribute:: remove_startup :value: True .. py:attribute:: inputs :value: None .. py:attribute:: targets :value: None .. py:method:: load_data() -> Tuple[numpy.ndarray, numpy.ndarray] .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]] .. py:method:: __len__() -> Optional[int] .. py:property:: seq_len :type: Optional[int] .. py:property:: num_features :type: int .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]] :staticmethod: .. py:method:: get_feature_names() :staticmethod: .. py:class:: TEPDataset(path: str = os.path.join(DATA_DIRECTORY, 'TEP_harvard'), faults: Optional[Union[int, List[int]]] = None, runs: Optional[Union[int, List[int]]] = None, training: bool = True, standardize: bool = True, cache_size: int = 21, preprocess: bool = True) Bases: :py:obj:`timesead.data.dataset.BaseTSDataset` Implementation of the Tennessee Eastman Process Dataset [Downs1993]_. The dataset was recorded by simulating a chemical process. The simulation also allows to introduce 20 different faults into the process which are used as anomaly labels. We implement the extended version of the dataset by Rieth et al. [Rieth2017]_ which runs the process several times with different RNG seeds. .. note:: At the moment, we do not offer an automatic download option for this dataset. Please visit https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6C3JR1 and download the files manually. .. warning:: This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the `preprocess` argument. The class will fail giving an error without preprocessing. .. [Downs1993] Downs, James J., and Ernest F. Vogel. "A plant-wide industrial process control problem." Computers & chemical engineering 17.3 (1993): 245-255. .. [Rieth2017] Rieth, Cory A.; Amsel, Ben D.; Tran, Randy; Cook, Maia B., 2017, "Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation", https://doi.org/10.7910/DVN/6C3JR1, Harvard Dataverse, V1 :param path: Folder from which to load the dataset. :param faults: Specifies which faults to load data for. This can be a list of `int`\s, where 0 stands for fault-free data and [1, ..., 20] for the corresponding faults. Also supports a single `int` which means to only load data for this specific fault or `None` which loads data for all faults. :param runs: Specifies which runs to load for each fault. Each of the 500 runs was performed with a different random seed. This can either be specific runs passed as a list or a single `int` which means to load all runs from 0 up to this run. `None` means to load all available runs. :param training: Whether to load the training or the test set. :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature) :param cache_size: Depending on the number of faults and runs chosen, this dataset can be quite large. It is therefore loaded in a lazy manner from disk. Data for each fault is kept in memory in a FIFO cache to reduce access time. This parameter sets the size of that cache. Setting this to the number of faults that you want to load will mean that eventually the entire dataset will be cached in memory. :param preprocess: Whether to setup dataset for experiments. .. py:attribute:: cache .. py:attribute:: path .. py:attribute:: processed_dir .. py:attribute:: training :value: True .. py:attribute:: standardize :value: True .. py:attribute:: cache_size :value: 21 .. py:attribute:: faults .. py:attribute:: runs :value: None .. py:method:: load_data(fault: int, runs: Optional[Union[int, List[int]]] = None) -> Tuple[numpy.ndarray, numpy.ndarray] .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]] Access the timeseries at position `index` and its corresponding label sequence. A call to this function should return a single time series that was sampled independently of the other time series in this dataset. :param index: The zero-based index of the time series to retrieve. :return: A tuple `(inputs, targets)`, where inputs is again a tuple of :class:`~torch.Tensor`\s with shape `(T, D*)`, where `D*` can very between the tensors. `targets` contains labels for the time series as tensors of shape `(T,)`. .. py:method:: __len__() -> Optional[int] This should return the number of independent time series in the dataset .. py:property:: seq_len :type: Optional[int] This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int. .. py:property:: num_features :type: int Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension. .. py:method:: get_feature_names() -> List[str] :staticmethod: Return names for the features in the order they are present in the data tensors. :return: A list of strings with names for each feature. .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]] :staticmethod: Return the default pipeline for this dataset that is used if the user does not specify a different pipeline. This must be a dict of the form:: { '': {'class': '', 'args': {'', ...}}, ... } .. py:class:: WADIDataset(path: str = os.path.join(DATA_DIRECTORY, 'wadi', 'WADI.A2_19 Nov 2019'), training: bool = True, standardize: Union[bool, Callable[[pandas.DataFrame, Dict], pandas.DataFrame]] = True, remove_startup: bool = True, split: bool = True, preprocess: bool = True) Bases: :py:obj:`timesead.data.dataset.BaseTSDataset` Implementation of the WAter DIstribution Dataset [Ahmed2017]_. This dataset was recorded from a miniature water distribution network over the course of several weeks. Both training and test set consist of a single long time series, or two time series, see details about the `split` parameter. During testing, several attacks (cyber and physical) were carried out against the plant. .. note:: Due to licensing issues, we cannot offer an automatic download option for this dataset. Please visit https://itrust.sutd.edu.sg/itrust-labs_datasets/dataset_info/ and fill in the form to request a download link. The required files are in the folder `WADI.A2_19 Nov 2019`. .. [Ahmed2017] Ahmed, Chuadhry Mujeeb, Venkata Reddy Palleti, and Aditya P. Mathur. "WADI: a water distribution testbed for research in the design of secure cyber physical systems." Proceedings of the 3rd international workshop on cyber-physical systems for smart water networks. 2017. :param path: Folder from which to load the dataset. :param training: Whether to load the training or the test set. :param standardize: Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature) :param remove_startup: This removes the first 5 hours of the training set, during which the plant is starting. :param split: The authors removed some data points in v2 of the training dataset. Thus, there is a clear split at index 335998. Setting this to true will return 2 TS split at this location. Otherwise, one long TS is returned. :param preprocess: Whether to setup the dataset for experiments. .. py:attribute:: path .. py:attribute:: processed_dir .. py:attribute:: training :value: True .. py:attribute:: remove_startup :value: True .. py:attribute:: split :value: True .. py:attribute:: startup_remove_amount :value: 18000 .. py:attribute:: split_index :value: 335999 .. py:attribute:: inputs :value: None .. py:attribute:: targets :value: None .. py:method:: load_data() -> Tuple[numpy.ndarray, numpy.ndarray] .. py:method:: __getitem__(item: int) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]] This should return the time series of the dataset. I.e., if the dataset has 5 independent time-series, passing 0, ..., 4 as item should return these time series. The format is (inputs, targets), where inputs and targets are tupples of torch.Tensors. :param item: Index of the time series to return. :return: .. py:method:: __len__() -> Optional[int] This should return the number of independent time series in the dataset .. py:property:: seq_len :type: Optional[int] This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int. .. py:property:: num_features :type: int Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension. .. py:method:: get_default_pipeline() -> Dict[str, Dict[str, Any]] :staticmethod: Return the default pipeline for this dataset that is used if the user does not specify a different pipeline. This must be a dict of the form:: { '': {'class': '', 'args': {'', ...}}, ... } .. py:method:: get_feature_names() :staticmethod: Return names for the features in the order they are present in the data tensors. :return: A list of strings with names for each feature.