Commit 79eabb64 authored by Antoine Guillaume's avatar Antoine Guillaume

Initial commit

parents
This diff is collapsed.
# Time series classification for predictive maintenance on event logs
This is the companion repository of the "Time series classification for predictive maintenance on event logs" paper.
# Abstract
Time series classification (TSC) gained a lot of attention in the past decade and number of methods for representing and classifying time series have been proposed.
Nowadays, methods based on convolutional networks and ensemble techniques represent the state of the art for time series classification. Techniques transforming time series to image or text also provide reliable ways to extract meaningful features or representations of time series. We compare the state-of-the-art representation and classification methods on a specific application, that is predictive maintenance from sequences of event logs. The contributions of this paper are twofold: introducing a new data set for predictive maintenance on automatic teller machines (ATMs) log data and comparing the performance of different representation methods for predicting the occurrence of a breakdown. The problem is difficult since unlike the classic case of predictive maintenance via signals from sensors, we have sequences of discrete event logs occurring at any time and the length of the sequences, corresponding to life cycles, varies a lot.
When using this repository or the ATM dataset, please cite:
** Link to paper **
## Required packages
The experiment were conducted with python 3.8, the following packages are required to run the script:
* numpy
* scikit-learn
* pyts
* matrixprofile
* sktime
* pandas
If you wish to run ResNet for images classification, you will also need Tensorflow 2.x.
## How to get the ATM dataset
The ATM dataset being a property of equensWorldline, you must first send an email to "intellectual-property-team-worldline@worldline.com" and "antoine.guillaume@equensworldline.com" to ask for authorization.
The compressed archive weights around 50Mo for a total weight of 575Mo.
## Parameters & Configuration
Configuration parameters are located at the beginning of the script, you MUST change the base_path to match the current directory of this project. Other parameters can be left as is to reproduce the results of the paper.
To change or check the algorithms parameters, they all are redefined in custom wrapper classes to avoid errors, if a parameter is not specified in the constructor, it is left as default.
ResNet is left commented in the code (~ line 880), so you can run the other algorithms without a Tensorflow installation or a GPU without any impact.
## Usage
Extract the files of the dataset archive located in ~/datasets in the dataset folder
```bash
python CV_script.py
```
The bash script launch_cross_validation.sh can be used on linux systems to run this as a background task with nohup, you MUST change the path to python to your own in the script. The script can also be imported inside a jupyter-notebook environment.
It is recommended that you run this script on a machine with at least 10 CPU cores, so all cross validation steps for a pipeline can run at the same time.
Expected runtime is 7 to 8 hours with 10 cores.
To obtain results from TS-CHIEF: CV_script in his default configuration will export data formatted for the TS-CHIEF java version available at https://github.com/dotnet54/TS-CHIEF. A jar executable is already packaged including some debugging to make it runnable. Once TS-CHIEF data is exported you can run it with the following script (for linux systems):
```bash
bash ts-chief_script.sh
```
If you changed the path to the data for TS-CHIEF, make sure to report this change in this script.
The runtime of this script is extremely long, one iteration take about 4 hours, with 40 iterations to make for all cross validation splits and data encodings. Outputted results can then be formatted the same way as other results with the python script:
```bash
python TSCHIEF_results_to_csv.py
```
## Contributing
If any bug should occur, please open a issue so we can work on a fix !
## Citations
[1]: [Johann Faouzi and Hicham Janati, "pyts: A Python Package for Time Series Classification", Journal of Machine Learning Research 2020](https://pyts.readthedocs.io/)
[2]: [Loning, Markus and Bagnall, Anthony and Ganesh, Sajaysurya and Kazakov, Viktor and Lines, Jason and Kiraly, Franz J, "sktime: A Unified Interface for Machine Learning with Time Series", Workshop on Systems for ML at NeurIPS 2019}](https://www.sktime.org/en/latest/)
[3]: [The Scikit-learn development team, "Scikit-learn: Machine Learning in Python", Journal of Machine Learning Research 2011](https://scikit-learn.org/stable/)
[4]: [The Pandas development team, "Pandas: Data Structures for Statistical Computing in Python"](https://pandas.pydata.org/)
[5]: [The Numpy development team, "Array programming with NumPy", Nature 2020](https://numpy.org/)
## License
[GPL-3.0](https://www.gnu.org/licenses/gpl-3.0.en.html)
\ No newline at end of file
import pandas as pd
import numpy as np
from pathlib import Path
from os import listdir
from sklearn.metrics import f1_score, balanced_accuracy_score
base_path = 'results/TSCHIEF/'
out_path = 'results/TSCHIEF/'
def CFI(y_test, y_pred):
if type(y_test) == list:
y_test = np.asarray(y_test)
if type(y_pred) == list
y_pred = np.asarray(y_pred)
tp = len(list(set(np.where(y_test == 1)[0]) & set(np.where(y_pred == 1)[0])))
fp = len(list(set(np.where(y_test == 0)[0]) & set(np.where(y_pred == 1)[0])))
fn = len(list(set(np.where(y_test == 1)[0]) & set(np.where(y_pred == 0)[0])))
return (fp+fn)/(tp+fp+fn) if (tp+fp+fn) > 0 else 1
run_list = [f for f in listdir(base_path)]
results_path = np.asarray([[f for f in list(Path(base_path+path+'/').rglob("*.csv")) if f.is_file()] for path in run_list]).reshape(-1)
df_dict = {'name':['TS-CHIEF']}
scorer_dict = {'balanced accuracy':balanced_accuracy_score,
'CFI':CFI,
'F1 score':f1_score}
for R in ['R1','R2','R3','R4']:
R_path = [r for r in results_path if R+'.csv' in str(r)]
result_dict = {'balanced accuracy':[],
'CFI':[],
'F1 score':[]}
for result in R_path:
df = pd.read_csv(result)
y_true = df['actual_label']
y_test = df['predicted_label']
result_dict['balanced accuracy'].append(balanced_accuracy_score(y_true,y_test))
result_dict['CFI'].append(CFI(y_true,y_test))
result_dict['F1 score'].append(f1_score(y_true,y_test))
for col in scorer_dict:
df_dict.update({col+' '+R: str(np.mean(result_dict[col]))[0:5] + '(+/- '+str(np.std(result_dict[col]))[0:5]+')'})
df_res = pd.DataFrame(df_dict)
df_res.to_csv(out_path+'TS-CHIEF_results.csv',index=False)
\ No newline at end of file
nohup ~/path_to_python3 ~/CV_script.py &
#!/bin/bash
n_cv=9
n_r=4
size=1000
for id_r in `seq 1 $n_r`
do
for id_cv in `seq 0 $n_cv`
do
jdk/jdk-15/bin/java -jar tschief.jar -train="datasets/TSCHIEF/data_Train_"$size"_"$id_cv"_R"$id_r".csv" -test="datasets/TSCHIEF/data_Test_"$size"_"$id_cv"_R"$id_r".csv" -out="results/TSCHIEF/" -repeats="1" -trees="500" -s="ee:10,boss:150,rise:150" -export="1" -verbosity="1" -shuffle="True" -target_column="last"
done
done
File added
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment