timesead.data.preprocessing.exathlon
Attributes
Functions
Returns the start and (excluded) end indices of all contiguous ranges of 1s in the binary array y. |
|
|
Loads a Spark trace as a pd.DataFrame from its full input path. |
|
|
|
Spark-specific raw data loading. |
|
Spark-specific Anomaly column extension. |
|
Spark-specific handling of NaN values. |
|
Returns the period DataFrames with "handled" executor features. |
|
Returns the period DataFrames with "handled" OS features. |
|
Returns the period DataFrames with "handled" OS features. |
|
Adds executor features averaged across active executors, keeping or not the original ones. |
|
Adds node features averaged across nodes, keeping or not the original ones. |
|
Adds features differences, either keeping or dropping the original ones. |
|
Returns the period DataFrames resampled to sampling_period using the provided aggregation function. |
|
Preprocess Exathlon dataset for experiments |
Module Contents
- timesead.data.preprocessing.exathlon.APP_IDS
- timesead.data.preprocessing.exathlon.TRACE_TYPES = ('undisturbed', 'bursty_input', 'bursty_input_crash', 'stalled_input', 'cpu_contention',...
- timesead.data.preprocessing.exathlon.ANOMALY_TYPES = ('bursty_input', 'bursty_input_crash', 'stalled_input', 'cpu_contention', 'driver_failure',...
- timesead.data.preprocessing.exathlon.extract_binary_ranges_ids(y)
Returns the start and (excluded) end indices of all contiguous ranges of 1s in the binary array y.
- Parameters:
y (ndarray) – 1d-array of binary elements.
- Returns:
array of start, end indices: [[start_1, end_1], [start_2, end_2], …].
- Return type:
ndarray
- timesead.data.preprocessing.exathlon.load_trace(trace_path)
Loads a Spark trace as a pd.DataFrame from its full input path.
- Parameters:
trace_path (str) – full path of the trace to load (with file extension).
- Returns:
the trace indexed by time, with columns processed to be consistent between traces.
- Return type:
pd.DataFrame
- timesead.data.preprocessing.exathlon.load_labels(data_path: str) pandas.DataFrame
- Parameters:
data_path (str)
- Return type:
- timesead.data.preprocessing.exathlon.load_raw_data(data_path: str, app_id: int = 0, trace_types: List[str] = TRACE_TYPES)
Spark-specific raw data loading. - The loaded periods as the application(s) traces. - The labels as the ground-truth table gathering events information for disturbed traces. - The periods information as lists initialized in the form [file_name, trace_type].
- timesead.data.preprocessing.exathlon.add_anomaly_column(period_dfs, labels, periods_info, ignored_anomalies: str = 'none')
Spark-specific Anomaly column extension. Note: periods_info is assumed to be of the form [file_name, trace_type]. Anomaly will be set to 0 if the record is outside any anomaly range, otherwise it will be set to another value depending on the range type (as defined by utils.spark.ANOMALY_TYPES). => The label for a given range type corresponds to its index in the ANOMALY_TYPES list +1.
- Parameters:
ignored_anomalies (str)
- timesead.data.preprocessing.exathlon.get_handled_nans(period_dfs)
Spark-specific handling of NaN values. The only NaN values that were recorded were for inactive executors, found equivalent of them being -1.
- timesead.data.preprocessing.exathlon.get_handled_executor_features(period_dfs)
Returns the period DataFrames with “handled” executor features. By looking at the features, we saw that some executor spots sometimes went to -1, presumably because we did not receive data from them within a given delay, without meaning the executors were not active anymore. The goal of this method is to detect such scenarios, in which case all -1 features are replaced with their last valid occurrence. Note: it was checked that if a given feature for an executor spot is -1, then all features from that spot are also -1, except for 1 or 2 records of some traces, which we argue is negligible. => To handle these very few cases, we would explicitly set all features to -1.
- timesead.data.preprocessing.exathlon.get_handled_os_features(period_dfs)
Returns the period DataFrames with “handled” OS features. Similarly to some executor features, some OS features might happen to be -1 for no other reason than their real value not being sent fast enough by the monitoring software. This is here true for all encountered -1 values that do not span an entire trace. In such scenarios, we replace -1 features with their last valid value (or their next one if their last is not available).
- timesead.data.preprocessing.exathlon.handle_missing_values(period_dfs)
Returns the period DataFrames with “handled” OS features. Similarly to some executor features, some OS features might happen to be -1 for no other reason than their real value not being sent fast enough by the monitoring software. This is here true for all encountered -1 values that do not span an entire trace. In such scenarios, we replace -1 features with their last valid value (or their next one if their last is not available).
- timesead.data.preprocessing.exathlon.add_executors_avg(period_df, original_treatment)
Adds executor features averaged across active executors, keeping or not the original ones. An executor is defined as “inactive” for a given feature if the value of the feature for this executor is -1. Note: it was checked that if a given feature for an executor spot is -1, then all features from that spot are also -1, except for 1 or 2 records of some traces, which we argue is negligible. => To handle these few cases, we should have set all the features to -1 (see data.spark_manager.SparkManager.get_handled_executor_features).
- timesead.data.preprocessing.exathlon.add_nodes_avg(period_df, original_treatment)
Adds node features averaged across nodes, keeping or not the original ones.
- timesead.data.preprocessing.exathlon.add_differencing(period_df, diff_factor_str, original_treatment)
Adds features differences, either keeping or dropping the original ones.
- Parameters:
- Returns:
the input DataFrame with differenced features, with or without the original ones.
- Return type:
pd.DataFrame
- timesead.data.preprocessing.exathlon.get_resampled(period_dfs, sampling_period, agg='mean', anomaly_col=False, pre_sampling_period=None)
Returns the period DataFrames resampled to sampling_period using the provided aggregation function. If anomaly_col is False, periods indices will also be reset to start from a round date, in order to keep them aligned with any subsequently resampled labels.
- Parameters:
period_dfs (list) – the period DataFrames to resample.
sampling_period (str) – the new sampling period, as a valid argument to pd.Timedelta.
agg (str) – the aggregation function defining the resampling (e.g. mean, median or max).
anomaly_col (bool) – whether the provided DataFrames have an Anomaly column, to resample differently.
pre_sampling_period (str|None) – original sampling period of the DataFrames. If None and anomaly_col is False, the original sampling period will be inferred from the first two records of the first DataFrame.
- Returns:
the same periods with resampled records.
- Return type:
- timesead.data.preprocessing.exathlon.preprocess_exathlon_data(raw_data_dir: str, out_data_dir: str, app_ids: List[int] = set(APP_IDS) - {7, 8}, subsample: str = None)
Preprocess Exathlon dataset for experiments
- Parameters:
raw_data_dir (str) – Path to raw data folder in the dataset
out_data_dir (str) – Folder to output the processed data
app_ids (List[int]) – Which all app data should be processed
subsample (str) – The new sampling period for DataFrames. Should be a valid argument to
pandas.Timedelta