timesead.data.wadi_dataset
Classes
Implementation of the WAter DIstribution Dataset [Ahmed2017]. |
Module Contents
- class timesead.data.wadi_dataset.WADIDataset(path: str = os.path.join(DATA_DIRECTORY, 'wadi', 'WADI.A2_19 Nov 2019'), training: bool = True, standardize: bool | Callable[[pandas.DataFrame, Dict], pandas.DataFrame] = True, remove_startup: bool = True, split: bool = True, preprocess: bool = True)
Bases:
timesead.data.dataset.BaseTSDatasetImplementation of the WAter DIstribution Dataset [Ahmed2017]. This dataset was recorded from a miniature water distribution network over the course of several weeks. Both training and test set consist of a single long time series, or two time series, see details about the split parameter. During testing, several attacks (cyber and physical) were carried out against the plant.
Note
Due to licensing issues, we cannot offer an automatic download option for this dataset. Please visit https://itrust.sutd.edu.sg/itrust-labs_datasets/dataset_info/ and fill in the form to request a download link. The required files are in the folder WADI.A2_19 Nov 2019.
- Parameters:
path (str) – Folder from which to load the dataset.
training (bool) – Whether to load the training or the test set.
standardize (Union[bool, Callable[[pandas.DataFrame, Dict], pandas.DataFrame]]) – Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
remove_startup (bool) – This removes the first 5 hours of the training set, during which the plant is starting.
split (bool) – The authors removed some data points in v2 of the training dataset. Thus, there is a clear split at index 335998. Setting this to true will return 2 TS split at this location. Otherwise, one long TS is returned.
preprocess (bool) – Whether to setup the dataset for experiments.
- path
- processed_dir
- training = True
- remove_startup = True
- split = True
- startup_remove_amount = 18000
- split_index = 335999
- inputs = None
- targets = None
- load_data() Tuple[numpy.ndarray, numpy.ndarray]
- Return type:
Tuple[numpy.ndarray, numpy.ndarray]
- __getitem__(item: int) Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]
This should return the time series of the dataset. I.e., if the dataset has 5 independent time-series, passing 0, …, 4 as item should return these time series. The format is (inputs, targets), where inputs and targets are tupples of torch.Tensors.
- Parameters:
item (int) – Index of the time series to return.
- Returns:
- Return type:
Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]
- __len__() int | None
This should return the number of independent time series in the dataset
- Return type:
Optional[int]
- property seq_len: int | None
This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int.
- Return type:
Optional[int]
- property num_features: int
Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.
- Return type:
- static get_default_pipeline() Dict[str, Dict[str, Any]]
Return the default pipeline for this dataset that is used if the user does not specify a different pipeline. This must be a dict of the form:
{ '<name>': {'class': '<name-of-transform-class>', 'args': {'<args-for-constructor>', ...}}, ... }
- static get_feature_names()
Return names for the features in the order they are present in the data tensors.
- Returns:
A list of strings with names for each feature.