
Predicting damage inflicted in traffic accidents.
Correlation - measured between features and target value allowed sometimes to decide if drop column.
def check_corelation(csv, col_1, col_2): df_corr = pd.DataFrame() df_corr[col_1] = csv[col_1].astype('category').cat.codes df_corr[col_2] = csv[col_2] df_corr = df_corr.dropna() print(df_corr.corr())Factorization
xxxxxxxxxxdef factorize(csv, col_name): dummy = pd.get_dummies(csv[col_name]) dummy.columns = [col_name + " " + str(x) for x in dummy.columns] csv = csv.drop(col_name, axis=1) csv = pd.concat([csv, dummy], axis=1) return csvFeature Extraction - some features were extracted manually
xxxxxxxxxxtime = pd.DatetimeIndex(csv["time"])time = time.hour * 60 + time.minutetime = pd.DataFrame(time)time[time >= 1080] = "evening"time[time >= 720] = "midday"time[time >= 360] = "morning"time[time >= 0] = "night"csv["day_time"] = timecsv = factorize(csv, "day_time")csv = csv.drop("time", axis=1)Standarization
xxxxxxxxxxlogregPipe = Pipeline([('scaler', StandardScaler()),('logreg', LogisticRegression())])x_train, y_train = logregPipe.fit_transform(x_train, y_train)R² Denoisser - I've used two regerssors inside denoiser: Decision Tree Regressior and K-Neares Neighbours Regressor
xxxxxxxxxxR² Score DenoisserRegressor: DecisionTreeRegressorNumber of Selected Features: 71
xxxxxxxxxxR² Score DenoisserRegressor: KNeighborsRegressorNumber of Selected Features: 68
Boruta Selection - resulted in 16 features.
Boruta gives closer look at selected features. The most important were Light and Weather Condition.
xBorutaPy finished running.Iteration: 20 / 100Confirmed: 16Tentative: 0Rejected: 87Number of Selected Features: 16
PCA - I've also tried Principal Component Analysis to select the most informative features.
xxxxxxxxxxdef pca_decomp(X_train, Y_train): X_train = StandardScaler().fit_transform(X_train) pca = PCA(n_components=0.9, svd_solver='full', random_state=0) X_train = pca.fit_transform(X_train) x_train, x_test, y_train, y_test = train_test_split(X_train,Y_train,test_size=0.3) return (x_train, x_test, y_train, y_test)Plotting distributions also helps to realize if selected features behave properly just before training.
I've used custom pipeline strategy to train and test 5 different ML algorithms. I'll describe it on example of Decision Tree Classifier.
Pipeline - I've used it to perform automated Scaling
xxxxxxxxxxdectreePipe = Pipeline([('scaler', StandardScaler()), ('dectree', DecisionTreeClassifier()) ])dectreeParam = { 'dectree__criterion': ['entropy', 'gini'], 'dectree__class_weight': ['balanced', None], 'dectree__max_depth': range(1, 7),}GridSearchCV - I've used Grid Search with Cross Validation to select the best model
xxxxxxxxxxdectreeGS = GridSearchCV(dectreePipe, param_grid=dectreeParam, cv=5, scoring='f1').fit(x_train, y_train)mean_test_score = dectreeGS.cv_results_["mean_test_score"]std_test_score = dectreeGS.cv_results_["std_test_score"]plt.figure(figsize=(4, 2))plt.errorbar(np.arange(mean_test_score.shape[0]), mean_test_score, std_test_score, fmt='ok')xxxxxxxxxxselected parameterclass_weight balancedcriterion entropymax_depth 3
Confusion matrix is always a best way to visualize the prediction
| Predicted 0 | Predicted 1 | |
|---|---|---|
| Actual 0 | 14073 | 20092 |
| Actual 1 | 2463 | 4652 |
Logistic Regression - unfortunately GridSearch won't find any parameters that would give satisfiable solution. ROC looks almost flat, we will move forward.
K-Neighbors Classifier - results were better than LogReg but still we need more.
xxxxxxxxxxknnPipe = Pipeline([ ('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_jobs=-1)),])knnParam = { 'knn__n_neighbors': range(1, 12), 'knn__weights': ['uniform', 'distance'],}_ = knnPipe.fit(x_train, y_train)xxxxxxxxxxf1_score on the train set: 0.066f1_score on the test set: 0.028
Decision Tree Classifier - this one give me the best results . Score below is after training on 10% part of training data, later there will be performed final training.
xxxxxxxxxxf1_score on the train set: 0.297f1_score on the test set: 0.292
xxxxxxxxxxselected parameterclass_weight balancedcriterion entropymax_depth 3
Random Forest - It was very hard to find proper parameters for that algorithm.
xxxxxxxxxxselected parameterbootstrap Truemax_depth 20max_features automin_samples_leaf 1min_samples_split 2
XGBoost - even though it's one of most popular ML algorithm it looses this time with overfitting.
Quick Summary of parameters Grid Search.
We choose Decision Tree Classifier as our algorithm.
As a result none of our feature selection technique performed better than whole features set.
The best algorithm for predicting damage inflicted in traffic accidents is
DecisionTreeClassifier(criterion='entropy', class_weight='balanced', max_depth=5)
working on all features and whole set gained F1 Score = 34.17
Mateusz Dorobek, Piotr Podbielski, Aitor Mato, Jaume Mora Viñes - Team Safely - HACK UPC 2019

