Datasets#

The progpy dataset subpackage is used to download labeled prognostics data for use in model building, analysis, or validation. Every dataset comes equipped with a load_data function which loads the specified data. Some datasets require a dataset number or id. This indicates the specific data to load from the larger dataset. The format of the data is specific to the dataset downloaded. Details of the specific datasets are summarized below:

Note

To use the dataset feature, you must install the requests package.

Variable Load Battery Data (nasa_battery)#

progpy.datasets.nasa_battery.load_data(batt_id: str) tuple#

New in version 1.3.0.

Loads data for one or more batteries from NASA’s PCoE Dataset, ‘11. Randomized Battery Usage Data Set’ https://www.nasa.gov/content/prognostics-center-of-excellence-data-set-repository

Parameters

batt_id (str) – Battery name from dataset (RW1-28)

Raises
  • ValueError – Battery not in dataset (should be RW1-28)

  • ValueError – Battery id must be a string or int

  • ConnectionError – Failed to download data. This may be because of issues with your internet connection or the datasets may have moved. Please check your internet connection and make sure you’re using the latest version of progpy.

Returns

Data and description as a tuple (description, data), where the data is a list of pandas DataFrames such that data[i] is the data for run i, corresponding with details[i], above. The columns of the dataframe are (‘relativeTime’, ‘current’ (amps), ‘voltage’, ‘temperature’ (°C)) in that order.

Return type

tuple[dict, list[pd.DataFrame]]

CMAPSS Jet Engine Data (nasa_cmapss)#

progpy.datasets.nasa_cmapss.load_data(dataset_id: int) tuple#

New in version 1.3.0.

Loads data for one CMAPSS trajectory from NASA’s PCoE Dataset. See ‘6. Turbofan Engine Degredation Simulation Data Set’ at https://www.nasa.gov/content/prognostics-center-of-excellence-data-set-repository

Data Set: 1
Train trajectories: 100
Test trajectories: 100
Conditions: ONE (Sea Level)
Fault Modes: ONE (HPC Degradation)
Data Set: 2
Train trajectories: 260
Test trajectories: 259
Conditions: SIX
Fault Modes: ONE (HPC Degradation)
Data Set: 3
Train trajectories: 100
Test trajectories: 100
Conditions: ONE (Sea Level)
Fault Modes: TWO (HPC Degradation, Fan Degradation)
Data Set: 4
Train trajectories: 248
Test trajectories: 249
Conditions: SIX
Fault Modes: TWO (HPC Degradation, Fan Degradation)

Data sets consists of multiple multivariate time series. Each data set is further divided into training and test subsets. Each time series is from a different engine i.e., the data can be considered to be from a fleet of engines of the same type. Each engine starts with different degrees of initial wear and manufacturing variation which is unknown to the user. This wear and variation is considered normal, i.e., it is not considered a fault condition. There are three operational settings that have a substantial effect on engine performance. These settings are also included in the data. The data is contaminated with sensor noise.

The engine is operating normally at the start of each time series, and develops a fault at some point during the series. In the training set, the fault grows in magnitude until system failure. In the test set, the time series ends some time prior to system failure. The objective of the competition is to predict the number of remaining operational cycles before failure in the test set, i.e., the number of operational cycles after the last cycle that the engine will continue to operate. Also provided a vector of true Remaining Useful Life (RUL) values for the test data. 0

Parameters

dataset_id (int) – Dataset id

Raises
  • ValueError – Data not in dataset (should be 1-4)

  • ValueError – Data not in dataset (should be 1-4)

  • ConnectionError – Failed to download data. This may be because of issues with your internet connection or the datasets may have moved. Please check your internet connection and make sure you’re using the latest version of progpy.

Returns

Tuple of data: training data, testing data, time of end of life)

Each row of the training and testing data is a snapshot of data taken during a single operational cycle, each column is a different variable. The columns in the pandas dataframe correspond to:
  1. unit number

  2. time, in cycles

  3. operational setting 1

  4. operational setting 2

  5. operational setting 3

  6. sensor measurement 1

  7. sensor measurement 2

  1. sensor measurement 21

Return type

tuple[pd.DataFrame, pd.DataFrame, np.array]

References

0
  1. Saxena, K. Goebel, D. Simon, and N. Eklund, Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation, in the Proceedings of the Ist International Conference on Prognostics and Health Management (PHM08), Denver CO, Oct 2008.