visualize using a simple boxplot showing the spread of each performance metric across our validation folds

given that we have very few data points, it makes sense to make this clear by plotting the points as well


Multilabel results

results are more complex for multilabel but we can compute the same metrics for each class,

→ then show different aggregations of these across classes

→ the macro average takes the unweighted mean of the metric for each label, the micro average calculates metrics globally by counting all true/false positives/negatives across classes.
