McKinsey - Hack the crash

Predicting damage inflicted in traffic accidents.

Data Preprocessing

Correlation - measured between features and target value allowed sometimes to decide if drop column.


def check_corelation(csv, col_1, col_2):
    df_corr = pd.DataFrame()
    df_corr[col_1] = csv[col_1].astype('category').cat.codes
    df_corr[col_2] = csv[col_2]
    df_corr = df_corr.dropna()
    print(df_corr.corr())

Factorization


xxxxxxxxxx
def factorize(csv, col_name):
    dummy = pd.get_dummies(csv[col_name])
    dummy.columns = [col_name + " " + str(x) for x in dummy.columns]
    csv = csv.drop(col_name, axis=1)
    csv = pd.concat([csv, dummy], axis=1)
    return csv

Feature Extraction - some features were extracted manually


xxxxxxxxxx
time = pd.DatetimeIndex(csv["time"])
time = time.hour * 60 + time.minute
time = pd.DataFrame(time)
time[time >= 1080] = "evening"
time[time >= 720] = "midday"
time[time >= 360] = "morning"
time[time >= 0] = "night"
csv["day_time"] = time
csv = factorize(csv, "day_time")
csv = csv.drop("time", axis=1)

Standarization


xxxxxxxxxx
logregPipe = Pipeline([('scaler', StandardScaler()),('logreg', LogisticRegression())])
x_train, y_train = logregPipe.fit_transform(x_train, y_train)

Feature Selection

R² Denoisser - I've used two regerssors inside denoiser: Decision Tree Regressior and K-Neares Neighbours Regressor


xxxxxxxxxx
R² Score Denoisser
Regressor: DecisionTreeRegressor
Number of Selected Features: 71


xxxxxxxxxx
R² Score Denoisser
Regressor: KNeighborsRegressor
Number of Selected Features: 68

Boruta Selection - resulted in 16 features.

Boruta gives closer look at selected features. The most important were Light and Weather Condition.


x
BorutaPy finished running.
Iteration:  20 / 100
Confirmed:  16
Tentative:  0
Rejected:   87
Number of Selected Features: 16

PCA - I've also tried Principal Component Analysis to select the most informative features.


xxxxxxxxxx
def pca_decomp(X_train, Y_train):
    X_train = StandardScaler().fit_transform(X_train)
    pca = PCA(n_components=0.9, svd_solver='full', random_state=0)
    X_train = pca.fit_transform(X_train)
    x_train, x_test, y_train, y_test = train_test_split(X_train,Y_train,test_size=0.3)
    return (x_train, x_test, y_train, y_test)

Data Analysis

Plotting distributions also helps to realize if selected features behave properly just before training.

Machine Learning Pipeline

I've used custom pipeline strategy to train and test 5 different ML algorithms. I'll describe it on example of Decision Tree Classifier.

Pipeline - I've used it to perform automated Scaling


xxxxxxxxxx
dectreePipe = Pipeline([('scaler', StandardScaler()),
                        ('dectree', DecisionTreeClassifier())
                       ])
dectreeParam = {
    'dectree__criterion': ['entropy', 'gini'],
    'dectree__class_weight': ['balanced', None],
    'dectree__max_depth': range(1, 7),
}

GridSearchCV - I've used Grid Search with Cross Validation to select the best model


xxxxxxxxxx
dectreeGS = GridSearchCV(dectreePipe, param_grid=dectreeParam, cv=5,    
                        scoring='f1').fit(x_train, y_train)
mean_test_score = dectreeGS.cv_results_["mean_test_score"]
std_test_score = dectreeGS.cv_results_["std_test_score"]
plt.figure(figsize=(4, 2))
plt.errorbar(np.arange(mean_test_score.shape[0]), mean_test_score,
             std_test_score, fmt='ok')


xxxxxxxxxx
                      selected parameter
class_weight           balanced
criterion               entropy
max_depth                     3

Confusion matrix is always a best way to visualize the prediction
Predicted 0 Predicted 1
Actual 0 14073 20092
Actual 1 2463 4652

	Predicted 0	Predicted 1
Actual 0	14073	20092
Actual 1	2463	4652

Model Selection

Logistic Regression - unfortunately GridSearch won't find any parameters that would give satisfiable solution. ROC looks almost flat, we will move forward.

K-Neighbors Classifier - results were better than LogReg but still we need more.


xxxxxxxxxx
knnPipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_jobs=-1)),
])
knnParam = {
    'knn__n_neighbors': range(1, 12),
    'knn__weights': ['uniform', 'distance'],
}
_ = knnPipe.fit(x_train, y_train)


xxxxxxxxxx
f1_score on the train set:  0.066
f1_score on the test set:  0.028

Decision Tree Classifier - this one give me the best results . Score below is after training on 10% part of training data, later there will be performed final training.


xxxxxxxxxx
f1_score on the train set:  0.297
f1_score on the test set: 0.292


xxxxxxxxxx
                selected parameter
class_weight           balanced
criterion               entropy
max_depth                     3

Random Forest - It was very hard to find proper parameters for that algorithm.


xxxxxxxxxx
                selected parameter
bootstrap                 True
max_depth                 20
max_features              auto
min_samples_leaf          1
min_samples_split         2

XGBoost - even though it's one of most popular ML algorithm it looses this time with overfitting.
Quick Summary of parameters Grid Search.
- Logistic Regression f1_score on the train set: 0.0265 f1_score on the test set: 0.0
- K-Neighbors Classifier f1_score on the train set: 1.0 f1_score on the test set: 0.102
- Decision Tree Classifier f1_score on the train set: 0.291 f1_score on the test set: 0.279
- RandomForestClassifier f1_score on the train set 0.969 f1_score on the test set 0.0
- XGBoost f1_score on the train set 1.0 f1_score on the test set 0.048

We choose Decision Tree Classifier as our algorithm.

Model Selection - Part II - choosing features

PCA 68 CV f1 on the train set: 31.3 CV f1 on the test set: 30.8
Denoisser KNR 68 CV f1 on the train set: 31.04 CV f1 on the test set: 30.75
Denoisser DTR 71 CV f1 on the train set: 30.81 CV f1 on the test set: 31.2
Boruta 16 CV f1 on the train set: 34.59 CV f1 on the test set: 33.8
All features 103 CV f1 on the train set: 34.19 CV f1 on the test set: 34.17

As a result none of our feature selection technique performed better than whole features set.

Lesson for future projects

Try to use PCA and other dimension reducing techniques as LDA or QDA instead of feature selection.
Know your algorithm parameters to properly GridSearch over them.
Feature processing may be very time consuming use functions and generalize your tasks.

Summary

The best algorithm for predicting damage inflicted in traffic accidents is

DecisionTreeClassifier(criterion='entropy', class_weight='balanced', max_depth=5)

working on all features and whole set gained F1 Score = 34.17

Mateusz Dorobek, Piotr Podbielski, Aitor Mato, Jaume Mora Viñes - Team Safely - HACK UPC 2019