timesead.data

This package contains code for loading and processing data. Each dataset class inherits from BaseTSDataset

Submodules

Classes

`BaseTSDataset`	Base class for all time-series datasets in TimeSeAD. Implementing the members in this abstract class provides the
`ExathlonDataset`	Implements the Exathlon dataset from [Jacob2021].
`MiniSMDDataset`	This is a condensed version of the `SMDDataset` containing only shortened time
`SMAPDataset`	Implementation of the SMAP dataset [Hundman2018].
`MSLDataset`	Implementation of the MSL dataset [Hundman2018].
`SMDDataset`	Implementation of the Server Machine Dataset [Su2019].
`SWaTDataset`	Implementation of the Secure WAter Treatment Dataset [Goh2016].
`TEPDataset`	Implementation of the Tennessee Eastman Process Dataset [Downs1993].
`WADIDataset`	Implementation of the WAter DIstribution Dataset [Ahmed2017].

Package Contents

class timesead.data.BaseTSDataset

Bases: abc.ABC, torch.utils.data.Dataset

Base class for all time-series datasets in TimeSeAD. Implementing the members in this abstract class provides the data pipeline system with the necessary information to process the data correctly.

abstract __len__() → int

This should return the number of independent time series in the dataset

Return type:: int

property seq_len: int | List[int]

Abstractmethod:
Return type:: Union[int, List[int]]

This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int.

property num_features: int | Tuple[int, Ellipsis]

Abstractmethod:
Return type:: Union[int, Tuple[int, Ellipsis]]

Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.

static get_default_pipeline() → Dict[str, Dict[str, Any]]

Abstractmethod:
Return type:: Dict[str, Dict[str, Any]]

Return the default pipeline for this dataset that is used if the user does not specify a different pipeline. This must be a dict of the form:

{
    '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}},
    ...
}

static get_feature_names() → List[str]

Abstractmethod:
Return type:: List[str]

Return names for the features in the order they are present in the data tensors.

Returns:: A list of strings with names for each feature.
Return type:: List[str]

abstract __getitem__(index: int) → Tuple[Tuple[torch.Tensor, Ellipsis], Tuple[torch.Tensor, Ellipsis]]

Access the timeseries at position index and its corresponding label sequence. A call to this function should return a single time series that was sampled independently of the other time series in this dataset.

Parameters:: index (int) – The zero-based index of the time series to retrieve.
Returns:: A tuple (inputs, targets), where inputs is again a tuple of Tensors with shape (T, D*), where D* can very between the tensors. targets contains labels for the time series as tensors of shape (T,).
Return type:: Tuple[Tuple[torch.Tensor, Ellipsis], Tuple[torch.Tensor, Ellipsis]]

class timesead.data.ExathlonDataset(dataset_path: str = os.path.join(DATA_DIRECTORY, 'exathlon'), app_id: int = 1, training: bool = True, standardize: bool | Callable[[pandas.DataFrame, Dict], pandas.DataFrame] = True, download: bool = True, preprocess: bool = True)

Bases: timesead.data.dataset.BaseTSDataset

Implements the Exathlon dataset from [Jacob2021]. The data was collected by running different applications on a Spark cluster and recording metrics from the Spark service and the worker nodes. We consider the trace for each app a separate dataset. You can control which app trace to load by setting the app_id parameter.

Note

The Exathlon dataset consists of more than 2000 raw features that we reduce to 19 aggregated features as described in [Jacob2021]. This is done in the preprocess step during the class initialization.

Note

Automatically downloading the dataset via the download option requires git to be installed on your system and is currently only tested on linux!

Warning

This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the preprocess argument. The class will throw a RuntimeError without preprocessing.

[Jacob2021] (1,2,3)

V. Jacob, F. Song, A. Stiegler, B. Rad, Y. Diao, and N. Tatbul. Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series. Proceedings of the VLDB Endowment (PVLDB), 14(11): 2613 - 2626, 2021.

Parameters:

dataset_path (str) – Folder from which to load the dataset.
app_id (int) – Data from which app to load. Must be in [1-6, 9, 10].
training (bool) – Whether to load the training or the test set.
standardize (Union[bool, Callable[[pandas.DataFrame, Dict], pandas.DataFrame]]) – Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
download (bool) – Whether to download the dataset if it doesn’t exist.
preprocess (bool) – Whether to setup the dataset for experiments.

GITHUB_LINK = 'https://github.com/exathlonbenchmark/exathlon.git'

dataset_path

data_path

app_id = 1

training = True

inputs = None

targets = None

load_data() → Tuple[List[numpy.ndarray], List[numpy.ndarray]]

Return type:: Tuple[List[numpy.ndarray], List[numpy.ndarray]]

__getitem__(item: int) → Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

Access the timeseries at position index and its corresponding label sequence. A call to this function should return a single time series that was sampled independently of the other time series in this dataset.

Parameters:

index – The zero-based index of the time series to retrieve.
item (int)

Returns:

A tuple (inputs, targets), where inputs is again a tuple of Tensors with shape (T, D*), where D* can very between the tensors. targets contains labels for the time series as tensors of shape (T,).

Return type:

Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

__len__() → int | None

This should return the number of independent time series in the dataset

Return type:: Optional[int]

property seq_len: List[int]

This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int.

Return type:: List[int]

property num_features: int

Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.

Return type:: int

static get_default_pipeline() → Dict[str, Dict[str, Any]]

Return the default pipeline for this dataset that is used if the user does not specify a different pipeline. This must be a dict of the form:

{
    '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}},
    ...
}

Return type:: Dict[str, Dict[str, Any]]

static get_feature_names()

Return names for the features in the order they are present in the data tensors.

Returns:: A list of strings with names for each feature.

download() → None

Return type:: None

class timesead.data.MiniSMDDataset(server_id: int = 0, path: str = os.path.join(DATA_DIRECTORY, 'mini_smd'), training: bool = True, standardize: bool | Callable = True, preprocess: bool = True)

Bases: timesead.data.dataset.BaseTSDataset

This is a condensed version of the SMDDataset containing only shortened time series for two different machines. Mostly used for testing purposes.

Parameters:

server_id (int) – ID of the server to load. Must be 0 or 1.
path (str) – Path to the data
training (bool) – Whether to load the training or the test set.
standardize (Union[bool, Callable]) – Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
preprocess (bool)

server_id = 0

path

training = True

standardize = True

inputs = None

targets = None

processed_dir

load_data() → Tuple[numpy.ndarray, numpy.ndarray]

Return type:: Tuple[numpy.ndarray, numpy.ndarray]

__getitem__(item: int) → Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

Parameters:: item (int)
Return type:: Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

__len__() → int | None

Return type:: Optional[int]

property seq_len: int | List[int]

Return type:: Union[int, List[int]]

property num_features: int

Return type:: int

static get_default_pipeline() → Dict[str, Dict[str, Any]]

Return type:: Dict[str, Dict[str, Any]]

static get_feature_names()

class timesead.data.SMAPDataset(data_path: str = os.path.join(DATA_DIRECTORY, 'smap'), channel_id: int = 0, training: bool = True, download: bool = True)

Bases: _SMAPBaseDataset

Implementation of the SMAP dataset [Hundman2018]. It consists of several monitored values from a single satellite and commands sent to that satellite. We consider the trace for each channel a separate dataset, where the monitored value is in the first feature dimension and the remaining binary features correspond to the commands.

Parameters:

data_path (str) – Folder from which to load the dataset.
channel_id (int) – Data from which channel to load. Must be in [0-54].
training (bool) – Whether to load the training or the test set.
download (bool) – Whether to download the dataset if it doesn’t exist.

property num_features: int | Tuple[int, Ellipsis]

Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.

Return type:: Union[int, Tuple[int, Ellipsis]]

class timesead.data.MSLDataset(data_path: str = os.path.join(DATA_DIRECTORY, 'smap'), channel_id: int = 0, training: bool = True, download: bool = True)

Bases: _SMAPBaseDataset

Implementation of the MSL dataset [Hundman2018]. It consists of several monitored values from a mars rover and commands sent to the rover. We consider the trace for each channel a separate dataset, where the monitored value is in the first feature dimension and the remaining binary features correspond to the commands.

Parameters:

data_path (str) – Folder from which to load the dataset.
channel_id (int) – Data from which channel to load. Must be in [0-26].
training (bool) – Whether to load the training or the test set.
download (bool) – Whether to download the dataset if it doesn’t exist.

property num_features: int | Tuple[int, Ellipsis]

Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.

Return type:: Union[int, Tuple[int, Ellipsis]]

class timesead.data.SMDDataset(server_id: int, path: str = os.path.join(DATA_DIRECTORY, 'smd'), training: bool = True, standardize: bool | Callable = True, download: bool = True, preprocess: bool = True)

Bases: timesead.data.dataset.BaseTSDataset

Implementation of the Server Machine Dataset [Su2019]. The data consists of traces from 28 different servers recorded over several weeks. We consider each trace to be a separate dataset.

Note

Automatically downloading the dataset currently requires that you have git installed on your system!

[Su2019] (1,2)

Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, D. Pei. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019 Jul 25 (pp. 2828-2837).

Parameters:

path (str) – Folder from which to load the dataset.
server_id (int) – Data from which machine to load. Must be in [0, …, 27].
training (bool) – Whether to load the training or the test set.
standardize (Union[bool, Callable]) – Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
download (bool) – Whether to download the dataset if it doesn’t exist.
preprocess (bool) – Whether to setup the dataset for experiments.

GITHUB_LINK = 'https://github.com/NetManAIOps/OmniAnomaly.git'

server_id

path

processed_dir

training = True

standardize = True

inputs = None

targets = None

load_data() → Tuple[numpy.ndarray, numpy.ndarray]

Return type:: Tuple[numpy.ndarray, numpy.ndarray]

__getitem__(item: int) → Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

Parameters:: item (int)
Return type:: Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

__len__() → int | None

Return type:: Optional[int]

property seq_len: int | List[int]

Return type:: Union[int, List[int]]

property num_features: int

Return type:: int

static get_default_pipeline() → Dict[str, Dict[str, Any]]

Return type:: Dict[str, Dict[str, Any]]

static get_feature_names()

download()

class timesead.data.SWaTDataset(path: str = os.path.join(DATA_DIRECTORY, 'SWaT', 'SWaT.A1 & A2_Dec 2015', 'Physical'), training: bool = True, standardize: bool | Callable = True, remove_startup: bool = True, preprocess: bool = True)

Bases: timesead.data.dataset.BaseTSDataset

Implementation of the Secure WAter Treatment Dataset [Goh2016]. This dataset was recorded from a miniature water treatment plant over the course of several weeks. Both training and test set consist of a single long time series, each. During testing, several attacks (cyber and physical) were carried out against the plant.

Note

Due to licensing issues, we cannot offer an automatic download option for this dataset. Please visit https://itrust.sutd.edu.sg/itrust-labs_datasets/dataset_info/ and fill in the form to request a download link. The required files are in the folder SWaT.A1 & A2_Dec 2015/Physical.

Warning

This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the preprocess argument to True. The class will fail giving an error without preprocessing.

[Goh2016] (1,2)

Goh, Jonathan, et al. “A dataset to support research in the design of secure water treatment systems.” Critical Information Infrastructures Security: 11th International Conference, CRITIS 2016, Paris, France, October 10–12, 2016, Revised Selected Papers 11. Springer International Publishing, 2017.

Parameters:

path (str) – Path where the files “SWaT_Dataset_Normal_v1.csv” and “SWaT_Dataset_Attack_v0.csv” are located.
training (bool) – If True, this will load the training set consisting only of normal samples. Otherwise, loads the test set, which includes attacks.
standardize (Union[bool, Callable]) – If True, apply min-max scaling (based on the training set). This can also be a function that accepts a DataFrame as its positional argument and a keyword argument stats: a dictionary of training data statistics.
remove_startup (bool) – If True, this will remove the first 5 hours from the training set, as during this time the system was starting from an empty state. To be more exact, this removes only 4.5 hours, since the first 30 minutes were already removed in v1 of the Dataset.
preprocess (bool) – If True, setup dataset to run experiments.

path

processed_dir

training = True

remove_startup = True

inputs = None

targets = None

load_data() → Tuple[numpy.ndarray, numpy.ndarray]

Return type:: Tuple[numpy.ndarray, numpy.ndarray]

__getitem__(item: int) → Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

Parameters:: item (int)
Return type:: Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

__len__() → int | None

Return type:: Optional[int]

property seq_len: int | None

Return type:: Optional[int]

property num_features: int

Return type:: int

static get_default_pipeline() → Dict[str, Dict[str, Any]]

Return type:: Dict[str, Dict[str, Any]]

static get_feature_names()

class timesead.data.TEPDataset(path: str = os.path.join(DATA_DIRECTORY, 'TEP_harvard'), faults: int | List[int] | None = None, runs: int | List[int] | None = None, training: bool = True, standardize: bool = True, cache_size: int = 21, preprocess: bool = True)

Bases: timesead.data.dataset.BaseTSDataset

Implementation of the Tennessee Eastman Process Dataset [Downs1993]. The dataset was recorded by simulating a chemical process. The simulation also allows to introduce 20 different faults into the process which are used as anomaly labels. We implement the extended version of the dataset by Rieth et al. [Rieth2017] which runs the process several times with different RNG seeds.

Note

At the moment, we do not offer an automatic download option for this dataset. Please visit https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6C3JR1 and download the files manually.

Warning

This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the preprocess argument. The class will fail giving an error without preprocessing.

[Downs1993] (1,2)

Downs, James J., and Ernest F. Vogel. “A plant-wide industrial process control problem.” Computers & chemical engineering 17.3 (1993): 245-255.

[Rieth2017]

Rieth, Cory A.; Amsel, Ben D.; Tran, Randy; Cook, Maia B., 2017, “Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation”, https://doi.org/10.7910/DVN/6C3JR1, Harvard Dataverse, V1

Parameters:

path (str) – Folder from which to load the dataset.
faults (Optional[Union[int, List[int]]]) – Specifies which faults to load data for. This can be a list of ints, where 0 stands for fault-free data and [1, …, 20] for the corresponding faults. Also supports a single int which means to only load data for this specific fault or None which loads data for all faults.
runs (Optional[Union[int, List[int]]]) – Specifies which runs to load for each fault. Each of the 500 runs was performed with a different random seed. This can either be specific runs passed as a list or a single int which means to load all runs from 0 up to this run. None means to load all available runs.
training (bool) – Whether to load the training or the test set.
standardize (bool) – Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
cache_size (int) – Depending on the number of faults and runs chosen, this dataset can be quite large. It is therefore loaded in a lazy manner from disk. Data for each fault is kept in memory in a FIFO cache to reduce access time. This parameter sets the size of that cache. Setting this to the number of faults that you want to load will mean that eventually the entire dataset will be cached in memory.
preprocess (bool) – Whether to setup dataset for experiments.

cache

path

processed_dir

training = True

standardize = True

cache_size = 21

faults

runs = None

load_data(fault: int, runs: int | List[int] | None = None) → Tuple[numpy.ndarray, numpy.ndarray]

Parameters:

fault (int)
runs (Optional[Union[int, List[int]]])

Return type:

Tuple[numpy.ndarray, numpy.ndarray]

__getitem__(item: int) → Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

Access the timeseries at position index and its corresponding label sequence. A call to this function should return a single time series that was sampled independently of the other time series in this dataset.

Parameters:

index – The zero-based index of the time series to retrieve.
item (int)

Returns:

A tuple (inputs, targets), where inputs is again a tuple of Tensors with shape (T, D*), where D* can very between the tensors. targets contains labels for the time series as tensors of shape (T,).

Return type:

Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

__len__() → int | None

This should return the number of independent time series in the dataset

Return type:: Optional[int]

property seq_len: int | None

This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int.

Return type:: Optional[int]

property num_features: int

Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.

Return type:: int

static get_feature_names() → List[str]

Return names for the features in the order they are present in the data tensors.

Returns:: A list of strings with names for each feature.
Return type:: List[str]

static get_default_pipeline() → Dict[str, Dict[str, Any]]

Return the default pipeline for this dataset that is used if the user does not specify a different pipeline. This must be a dict of the form:

{
    '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}},
    ...
}

Return type:: Dict[str, Dict[str, Any]]

class timesead.data.WADIDataset(path: str = os.path.join(DATA_DIRECTORY, 'wadi', 'WADI.A2_19 Nov 2019'), training: bool = True, standardize: bool | Callable[[pandas.DataFrame, Dict], pandas.DataFrame] = True, remove_startup: bool = True, split: bool = True, preprocess: bool = True)

Bases: timesead.data.dataset.BaseTSDataset

Implementation of the WAter DIstribution Dataset [Ahmed2017]. This dataset was recorded from a miniature water distribution network over the course of several weeks. Both training and test set consist of a single long time series, or two time series, see details about the split parameter. During testing, several attacks (cyber and physical) were carried out against the plant.

Note

Due to licensing issues, we cannot offer an automatic download option for this dataset. Please visit https://itrust.sutd.edu.sg/itrust-labs_datasets/dataset_info/ and fill in the form to request a download link. The required files are in the folder WADI.A2_19 Nov 2019.

[Ahmed2017] (1,2)

Ahmed, Chuadhry Mujeeb, Venkata Reddy Palleti, and Aditya P. Mathur. “WADI: a water distribution testbed for research in the design of secure cyber physical systems.” Proceedings of the 3rd international workshop on cyber-physical systems for smart water networks. 2017.

Parameters:

path (str) – Folder from which to load the dataset.
training (bool) – Whether to load the training or the test set.
standardize (Union[bool, Callable[[pandas.DataFrame, Dict], pandas.DataFrame]]) – Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
remove_startup (bool) – This removes the first 5 hours of the training set, during which the plant is starting.
split (bool) – The authors removed some data points in v2 of the training dataset. Thus, there is a clear split at index 335998. Setting this to true will return 2 TS split at this location. Otherwise, one long TS is returned.
preprocess (bool) – Whether to setup the dataset for experiments.

path

processed_dir

training = True

remove_startup = True

split = True

startup_remove_amount = 18000

split_index = 335999

inputs = None

targets = None

load_data() → Tuple[numpy.ndarray, numpy.ndarray]

Return type:: Tuple[numpy.ndarray, numpy.ndarray]

__getitem__(item: int) → Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

This should return the time series of the dataset. I.e., if the dataset has 5 independent time-series, passing 0, …, 4 as item should return these time series. The format is (inputs, targets), where inputs and targets are tupples of torch.Tensors.

Parameters:: item (int) – Index of the time series to return.
Returns:
Return type:: Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]

__len__() → int | None

This should return the number of independent time series in the dataset

Return type:: Optional[int]

property seq_len: int | None

This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int.

Return type:: Optional[int]

property num_features: int

Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.

Return type:: int

static get_default_pipeline() → Dict[str, Dict[str, Any]]

Return the default pipeline for this dataset that is used if the user does not specify a different pipeline. This must be a dict of the form:

{
    '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}},
    ...
}

Return type:: Dict[str, Dict[str, Any]]

static get_feature_names()

Return names for the features in the order they are present in the data tensors.

Returns:: A list of strings with names for each feature.