timesead.data.tep_dataset
Classes
Implementation of the Tennessee Eastman Process Dataset [Downs1993]. |
Module Contents
- class timesead.data.tep_dataset.TEPDataset(path: str = os.path.join(DATA_DIRECTORY, 'TEP_harvard'), faults: int | List[int] | None = None, runs: int | List[int] | None = None, training: bool = True, standardize: bool = True, cache_size: int = 21, preprocess: bool = True)
Bases:
timesead.data.dataset.BaseTSDatasetImplementation of the Tennessee Eastman Process Dataset [Downs1993]. The dataset was recorded by simulating a chemical process. The simulation also allows to introduce 20 different faults into the process which are used as anomaly labels. We implement the extended version of the dataset by Rieth et al. [Rieth2017] which runs the process several times with different RNG seeds.
Note
At the moment, we do not offer an automatic download option for this dataset. Please visit https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6C3JR1 and download the files manually.
Warning
This dataset relies on preprocessing to be done on the data. Preprocessing can be done by setting the preprocess argument. The class will fail giving an error without preprocessing.
[Downs1993] (1,2)Downs, James J., and Ernest F. Vogel. “A plant-wide industrial process control problem.” Computers & chemical engineering 17.3 (1993): 245-255.
[Rieth2017]Rieth, Cory A.; Amsel, Ben D.; Tran, Randy; Cook, Maia B., 2017, “Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation”, https://doi.org/10.7910/DVN/6C3JR1, Harvard Dataverse, V1
- Parameters:
path (str) – Folder from which to load the dataset.
faults (Optional[Union[int, List[int]]]) – Specifies which faults to load data for. This can be a list of ints, where 0 stands for fault-free data and [1, …, 20] for the corresponding faults. Also supports a single int which means to only load data for this specific fault or None which loads data for all faults.
runs (Optional[Union[int, List[int]]]) – Specifies which runs to load for each fault. Each of the 500 runs was performed with a different random seed. This can either be specific runs passed as a list or a single int which means to load all runs from 0 up to this run. None means to load all available runs.
training (bool) – Whether to load the training or the test set.
standardize (bool) – Can be either a bool that decides whether to apply the dataset-dependent default standardization or a function with signature (dataframe, stats) -> dataframe, where stats is a dictionary of common statistics on the training dataset (i.e., mean, std, median, etc. for each feature)
cache_size (int) – Depending on the number of faults and runs chosen, this dataset can be quite large. It is therefore loaded in a lazy manner from disk. Data for each fault is kept in memory in a FIFO cache to reduce access time. This parameter sets the size of that cache. Setting this to the number of faults that you want to load will mean that eventually the entire dataset will be cached in memory.
preprocess (bool) – Whether to setup dataset for experiments.
- cache
- path
- processed_dir
- training = True
- standardize = True
- cache_size = 21
- faults
- runs = None
- load_data(fault: int, runs: int | List[int] | None = None) Tuple[numpy.ndarray, numpy.ndarray]
- Parameters:
- Return type:
Tuple[numpy.ndarray, numpy.ndarray]
- __getitem__(item: int) Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]
Access the timeseries at position index and its corresponding label sequence. A call to this function should return a single time series that was sampled independently of the other time series in this dataset.
- Parameters:
index – The zero-based index of the time series to retrieve.
item (int)
- Returns:
A tuple (inputs, targets), where inputs is again a tuple of
Tensors with shape (T, D*), where D* can very between the tensors. targets contains labels for the time series as tensors of shape (T,).- Return type:
Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]
- __len__() int | None
This should return the number of independent time series in the dataset
- Return type:
Optional[int]
- property seq_len: int | None
This should return the length of each time series. If the time series have different lengths, the return value should be a list that contains the length of each sequence. If all sequences are of equal length, this should return an int.
- Return type:
Optional[int]
- property num_features: int
Number of features of each datapoint. This can also be a tuple if the data has more than one feature dimension.
- Return type:
- static get_feature_names() List[str]
Return names for the features in the order they are present in the data tensors.
- Returns:
A list of strings with names for each feature.
- Return type:
List[str]