Predicting damage inflicted in traffic accidents.
Correlation - measured between features and target value allowed sometimes to decide if drop column.
def check_corelation(csv, col_1, col_2):
df_corr = pd.DataFrame()
df_corr[col_1] = csv[col_1].astype('category').cat.codes
df_corr[col_2] = csv[col_2]
df_corr = df_corr.dropna()
print(df_corr.corr())
Factorization
xxxxxxxxxx
def factorize(csv, col_name):
dummy = pd.get_dummies(csv[col_name])
dummy.columns = [col_name + " " + str(x) for x in dummy.columns]
csv = csv.drop(col_name, axis=1)
csv = pd.concat([csv, dummy], axis=1)
return csv
Feature Extraction - some features were extracted manually
xxxxxxxxxx
time = pd.DatetimeIndex(csv["time"])
time = time.hour * 60 + time.minute
time = pd.DataFrame(time)
time[time >= 1080] = "evening"
time[time >= 720] = "midday"
time[time >= 360] = "morning"
time[time >= 0] = "night"
csv["day_time"] = time
csv = factorize(csv, "day_time")
csv = csv.drop("time", axis=1)
Standarization
xxxxxxxxxx
logregPipe = Pipeline([('scaler', StandardScaler()),('logreg', LogisticRegression())])
x_train, y_train = logregPipe.fit_transform(x_train, y_train)
R² Denoisser - I've used two regerssors inside denoiser: Decision Tree Regressior and K-Neares Neighbours Regressor
xxxxxxxxxx
R² Score Denoisser
Regressor: DecisionTreeRegressor
Number of Selected Features: 71
xxxxxxxxxx
R² Score Denoisser
Regressor: KNeighborsRegressor
Number of Selected Features: 68
Boruta Selection - resulted in 16 features.
Boruta gives closer look at selected features. The most important were Light and Weather Condition.
xBorutaPy finished running.
Iteration: 20 / 100
Confirmed: 16
Tentative: 0
Rejected: 87
Number of Selected Features: 16
PCA - I've also tried Principal Component Analysis to select the most informative features.
xxxxxxxxxx
def pca_decomp(X_train, Y_train):
X_train = StandardScaler().fit_transform(X_train)
pca = PCA(n_components=0.9, svd_solver='full', random_state=0)
X_train = pca.fit_transform(X_train)
x_train, x_test, y_train, y_test = train_test_split(X_train,Y_train,test_size=0.3)
return (x_train, x_test, y_train, y_test)
Plotting distributions also helps to realize if selected features behave properly just before training.
I've used custom pipeline strategy to train and test 5 different ML algorithms. I'll describe it on example of Decision Tree Classifier.
Pipeline - I've used it to perform automated Scaling
xxxxxxxxxx
dectreePipe = Pipeline([('scaler', StandardScaler()),
('dectree', DecisionTreeClassifier())
])
dectreeParam = {
'dectree__criterion': ['entropy', 'gini'],
'dectree__class_weight': ['balanced', None],
'dectree__max_depth': range(1, 7),
}
GridSearchCV - I've used Grid Search with Cross Validation to select the best model
xxxxxxxxxx
dectreeGS = GridSearchCV(dectreePipe, param_grid=dectreeParam, cv=5,
scoring='f1').fit(x_train, y_train)
mean_test_score = dectreeGS.cv_results_["mean_test_score"]
std_test_score = dectreeGS.cv_results_["std_test_score"]
plt.figure(figsize=(4, 2))
plt.errorbar(np.arange(mean_test_score.shape[0]), mean_test_score,
std_test_score, fmt='ok')
xxxxxxxxxx
selected parameter
class_weight balanced
criterion entropy
max_depth 3
Confusion matrix is always a best way to visualize the prediction
Predicted 0 | Predicted 1 | |
---|---|---|
Actual 0 | 14073 | 20092 |
Actual 1 | 2463 | 4652 |
Logistic Regression - unfortunately GridSearch won't find any parameters that would give satisfiable solution. ROC looks almost flat, we will move forward.
K-Neighbors Classifier - results were better than LogReg but still we need more.
xxxxxxxxxx
knnPipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_jobs=-1)),
])
knnParam = {
'knn__n_neighbors': range(1, 12),
'knn__weights': ['uniform', 'distance'],
}
_ = knnPipe.fit(x_train, y_train)
xxxxxxxxxx
f1_score on the train set: 0.066
f1_score on the test set: 0.028
Decision Tree Classifier - this one give me the best results . Score below is after training on 10% part of training data, later there will be performed final training.
xxxxxxxxxx
f1_score on the train set: 0.297
f1_score on the test set: 0.292
xxxxxxxxxx
selected parameter
class_weight balanced
criterion entropy
max_depth 3
Random Forest - It was very hard to find proper parameters for that algorithm.
xxxxxxxxxx
selected parameter
bootstrap True
max_depth 20
max_features auto
min_samples_leaf 1
min_samples_split 2
XGBoost - even though it's one of most popular ML algorithm it looses this time with overfitting.
Quick Summary of parameters Grid Search.
We choose Decision Tree Classifier as our algorithm.
As a result none of our feature selection technique performed better than whole features set.
The best algorithm for predicting damage inflicted in traffic accidents is
DecisionTreeClassifier(criterion='entropy', class_weight='balanced', max_depth=5)
working on all features and whole set gained F1 Score = 34.17
Mateusz Dorobek, Piotr Podbielski, Aitor Mato, Jaume Mora Viñes - Team Safely - HACK UPC 2019