Abstract
In this vignette, we learn how to evaluate predictions on the IDlevel withevaluate().
Contact the author atr-pkgs@ludvigolsen.dk
When we have groups of observations (e.g. a participant ID), we aresometimes more interested in the overall prediction for the group thanthose at the observation-level.
Say we have a dataset with 10 observations per participant and amodel that predicts whether a participant has an autism diagnosis ornot. While the model will predict each of the 10 observations, it’sreally the overall prediction for the participant that we are interestedin.
evaluate() has two approaches to performing theevaluation on the ID level:averaging andvoting.
In averaging, we simply average the predicted probabilities for theparticipant. This is the default approach as it maintains informationabout how certain our model is about itsclassprediction. That is, if all observations have a 60% predictedprobability of an autism diagnosis, that should be considereddifferently than 90%.
In voting, we simply count the predictions of each outcome class andassign the class with the most predictions to the participant.
If 7 out of 10 of the observations are predicted as having no autismdiagnosis, that becomes the prediction for the participant.
We will use the simpleparticipant.scores dataset as ithas 3 rows per participant and a diagnosis column that we can evaluatepredictions against. Let’s add predicted probabilities and diagnoses andhave a look:
library(cvms)library(knitr)# kable()library(dplyr)set.seed(74)# Prepare datasetdata<- participant.scores%>%as_tibble()# Add probabilities and predicted classesdata[["probability"]]<-runif(nrow(data))data[["predicted diagnosis"]]<-ifelse(data[["probability"]]>0.5,1,0)data%>%head(10)%>%kable()| participant | age | diagnosis | score | session | probability | predicted diagnosis |
|---|---|---|---|---|---|---|
| 1 | 20 | 1 | 10 | 1 | 0.7046162 | 1 |
| 1 | 20 | 1 | 24 | 2 | 0.4800045 | 0 |
| 1 | 20 | 1 | 45 | 3 | 0.1960176 | 0 |
| 2 | 23 | 0 | 24 | 1 | 0.9369707 | 1 |
| 2 | 23 | 0 | 40 | 2 | 0.8698302 | 1 |
| 2 | 23 | 0 | 67 | 3 | 0.2140318 | 0 |
| 3 | 27 | 1 | 15 | 1 | 0.0240853 | 0 |
| 3 | 27 | 1 | 30 | 2 | 0.8547959 | 1 |
| 3 | 27 | 1 | 40 | 3 | 0.7027153 | 1 |
| 4 | 21 | 0 | 35 | 1 | 0.9579817 | 1 |
We tellevaluate() to aggregate the predictions by theparticipant column with themean (averaging)method.
Note: It is assumed that the target class is constant withinthe IDs. I.e., that the participant has the same diagnosis in allobservations.
ev<-evaluate(data = data,target_col ="diagnosis",prediction_cols ="probability",id_col ="participant",id_method ="mean",type ="binomial")ev#> # A tibble: 1 × 19#> `Balanced Accuracy` Accuracy F1 Sensitivity Specificity `Pos Pred Value`#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 0.292 0.3 0.364 0.333 0.25 0.4#> # ℹ 13 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower CI` <dbl>,#> # `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>,#> # `Detection Prevalence` <dbl>, Prevalence <dbl>, Predictions <list>,#> # ROC <named list>, `Confusion Matrix` <list>, Process <list>ThePredictions column contains the averagedpredictions:
| Target | Prediction | SD | Predicted Class | participant | id_method |
|---|---|---|---|---|---|
| 1 | 0.4602128 | 0.2548762 | 0 | 1 | mean |
| 0 | 0.6736109 | 0.3994204 | 1 | 2 | mean |
| 1 | 0.5271988 | 0.4422946 | 1 | 3 | mean |
| 0 | 0.7974576 | 0.1703448 | 1 | 4 | mean |
| 1 | 0.5887699 | 0.4738221 | 1 | 5 | mean |
| 1 | 0.3526630 | 0.2302525 | 0 | 6 | mean |
| 1 | 0.2333758 | 0.1913763 | 0 | 7 | mean |
| 1 | 0.3956015 | 0.3207379 | 0 | 8 | mean |
| 0 | 0.3374361 | 0.0304785 | 0 | 9 | mean |
| 0 | 0.5988969 | 0.0675830 | 1 | 10 | mean |
Let’s plot the confusion matrix as well:
# Note: If ev had multiple rows, we would have to# pass ev$`Confusion Matrix`[[1]] to# plot the first row's confusion matrixplot_confusion_matrix(ev)We can have a better look at the metrics:
| Balanced Accuracy | Accuracy | F1 | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | AUC | Lower CI |
|---|---|---|---|---|---|---|---|---|
| 0.29167 | 0.3 | 0.36364 | 0.33333 | 0.25 | 0.4 | 0.2 | 0.20833 | 0 |
| Upper CI | Kappa | MCC | Detection Rate | Detection Prevalence |
|---|---|---|---|---|
| 0.62475 | -0.4 | -0.40825 | 0.2 | 0.5 |
We can use themajority (voting) method for the IDaggregation instead:
ev_2<-evaluate(data = data,target_col ="diagnosis",prediction_cols ="probability",id_col ="participant",id_method ="majority",type ="binomial")ev_2#> # A tibble: 1 × 19#> `Balanced Accuracy` Accuracy F1 Sensitivity Specificity `Pos Pred Value`#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 0.292 0.3 0.364 0.333 0.25 0.4#> # ℹ 13 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower CI` <dbl>,#> # `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>,#> # `Detection Prevalence` <dbl>, Prevalence <dbl>, Predictions <list>,#> # ROC <named list>, `Confusion Matrix` <list>, Process <list>Now thePredictions column looks as follows:
| Target | Prediction | Predicted Class | participant | id_method |
|---|---|---|---|---|
| 1 | 0 | 0 | 1 | majority |
| 0 | 1 | 1 | 2 | majority |
| 1 | 1 | 1 | 3 | majority |
| 0 | 1 | 1 | 4 | majority |
| 1 | 1 | 1 | 5 | majority |
| 1 | 0 | 0 | 6 | majority |
| 1 | 0 | 0 | 7 | majority |
| 1 | 0 | 0 | 8 | majority |
| 0 | 0 | 0 | 9 | majority |
| 0 | 1 | 1 | 10 | majority |
In this case, thePredicted Class column is identical tothat in the averaging approach. We just don’t have the probabilities totell us, how sure the model is about that prediction.
If we have predictions from multiple models, we can group the dataframe and get the results per model.
Let’s duplicate the dataset and change the predictions. We thencombine the datasets and add amodel column for indicatingwhich of the data frames the observation came from:
# Duplicate data framedata_2<- data# Change the probabilities and predicted classesdata_2[["probability"]]<-runif(nrow(data))data_2[["predicted diagnosis"]]<-ifelse(data_2[["probability"]]>0.5,1,0)# Combine the two data framesdata_multi<- dplyr::bind_rows(data, data_2,.id ="model")data_multi#> # A tibble: 60 × 8#> model participant age diagnosis score session probability#> <chr> <fct> <dbl> <dbl> <dbl> <int> <dbl>#> 1 1 1 20 1 10 1 0.705#> 2 1 1 20 1 24 2 0.480#> 3 1 1 20 1 45 3 0.196#> 4 1 2 23 0 24 1 0.937#> 5 1 2 23 0 40 2 0.870#> 6 1 2 23 0 67 3 0.214#> 7 1 3 27 1 15 1 0.0241#> 8 1 3 27 1 30 2 0.855#> 9 1 3 27 1 40 3 0.703#> 10 1 4 21 0 35 1 0.958#> # ℹ 50 more rows#> # ℹ 1 more variable: `predicted diagnosis` <dbl>We can now group the data frame by themodel column andrun the evaluation again:
ev_3<- data_multi%>% dplyr::group_by(model)%>%evaluate(target_col ="diagnosis",prediction_cols ="probability",id_col ="participant",id_method ="mean",type ="binomial" )ev_3#> # A tibble: 2 × 20#> model `Balanced Accuracy` Accuracy F1 Sensitivity Specificity#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 1 0.292 0.3 0.364 0.333 0.25#> 2 2 0.375 0.4 0.5 0.5 0.25#> # ℹ 14 more variables: `Pos Pred Value` <dbl>, `Neg Pred Value` <dbl>,#> # AUC <dbl>, `Lower CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>,#> # `Detection Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>,#> # Predictions <list>, ROC <named list>, `Confusion Matrix` <list>,#> # Process <list>ThePredictions for the second model looks asfollows:
| model | Target | Prediction | SD | Predicted Class | participant | id_method |
|---|---|---|---|---|---|---|
| 2 | 1 | 0.3302017 | 0.3002763 | 0 | 1 | mean |
| 2 | 0 | 0.6040242 | 0.2854935 | 1 | 2 | mean |
| 2 | 1 | 0.7342651 | 0.2653166 | 1 | 3 | mean |
| 2 | 0 | 0.6383918 | 0.3799305 | 1 | 4 | mean |
| 2 | 1 | 0.4551732 | 0.3417810 | 0 | 5 | mean |
| 2 | 1 | 0.6808281 | 0.3626166 | 1 | 6 | mean |
| 2 | 1 | 0.4536740 | 0.3784584 | 0 | 7 | mean |
| 2 | 1 | 0.6281501 | 0.4506029 | 1 | 8 | mean |
| 2 | 0 | 0.7000411 | 0.1490745 | 1 | 9 | mean |
| 2 | 0 | 0.4630344 | 0.4344227 | 0 | 10 | mean |
'gaussian' evaluationThis kind of ID aggregation is also available for the'gaussian' evaluation (e.g. for linear regression models),although only with the averaging approach. Again, it is assumed that thetarget value is constant for all observations by a participant (like theage column in our dataset).
We add apredicted age column to our initialdataset:
We evaluate the predicted age, aggregated by participant:
ev_4<-evaluate(data = data,target_col ="age",prediction_cols ="predicted age",id_col ="participant",id_method ="mean",type ="gaussian")ev_4#> # A tibble: 1 × 8#> RMSE MAE `NRMSE(IQR)` RRSE RAE RMSLE Predictions Process#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>#> 1 10.3 8.7 0.984 1.48 1.45 0.340 <tibble [10 × 5]> <prcss_n_>ThePredictions column looks as follows:
| Target | Prediction | SD | participant | id_method |
|---|---|---|---|---|
| 20 | 35.66667 | 8.326664 | 1 | mean |
| 23 | 33.33333 | 10.214369 | 2 | mean |
| 27 | 35.33333 | 5.686241 | 3 | mean |
| 21 | 30.00000 | 4.582576 | 4 | mean |
| 32 | 28.66667 | 5.507570 | 5 | mean |
| 31 | 43.33333 | 1.154700 | 6 | mean |
| 43 | 39.00000 | 5.196152 | 7 | mean |
| 21 | 40.33333 | 2.516611 | 8 | mean |
| 34 | 35.33333 | 5.507570 | 9 | mean |
| 32 | 35.33333 | 7.571878 | 10 | mean |
On average, we predictparticipant 1 to have the age35.66.
If our targets are not constant within the IDs, we might beinterested in the ID-level evaluation. E.g. how well it predicted thescore for each of the participants.
We add apredicted score column to our dataset:
Now, we group the data frame by theparticipant columnand evaluate the predicted scores:
data%>% dplyr::group_by(participant)%>%evaluate(target_col ="score",prediction_cols ="predicted score",type ="gaussian" )#> # A tibble: 10 × 9#> participant RMSE MAE `NRMSE(IQR)` RRSE RAE RMSLE Predictions Process#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>#> 1 1 13.8 13.7 0.787 0.957 1.10 0.683 <tibble> <prcss_n_>#> 2 2 32.4 26.3 1.50 1.82 1.69 0.946 <tibble> <prcss_n_>#> 3 3 12.8 10.7 1.03 1.25 1.2 0.549 <tibble> <prcss_n_>#> 4 4 9.15 7.67 0.425 0.513 0.486 0.154 <tibble> <prcss_n_>#> 5 5 24.1 17.3 1.27 1.47 1.15 0.566 <tibble> <prcss_n_>#> 6 6 34.2 33.3 4.27 5.12 5.56 0.895 <tibble> <prcss_n_>#> 7 7 44.7 40 2.98 3.45 3.33 1.21 <tibble> <prcss_n_>#> 8 8 9.80 8 0.700 0.854 0.818 0.306 <tibble> <prcss_n_>#> 9 9 22.6 21.3 1.37 1.66 1.81 0.447 <tibble> <prcss_n_>#> 10 10 29.3 28 1.13 1.38 1.62 0.556 <tibble> <prcss_n_>Participant 4 has the lowest prediction error whileparticipant 7 has the highest.
This approach is similar to what themost_challenging()function does:
# Extract the ~20% observations with highest prediction errormost_challenging(data = data,type ="gaussian",obs_id_col ="participant",target_col ="score",prediction_cols ="predicted score",threshold =0.20)#> # A tibble: 2 × 4#> participant MAE RMSE `>=`#> <fct> <dbl> <dbl> <dbl>#> 1 7 40 44.7 32.7#> 2 6 33.3 34.2 32.7This concludes the vignette. If any elements are unclear you canleave feedback in a mail or in a GitHub issue :-)