Normalised Dice Similarity Coefficient - Shifts Challenge 2022

Normalised Dice Similarity Coefficient¶

Typically, the Dice Similarity Coefficient (DSC) is used as the performance metric between the ground-truth Y and its corresponding prediction Yˆ :

The reported score is usually the DSC averaged across all patient scans. However, DSC is biased to yield greater values for patients that have a greater lesion load i.e. a greater probability of the event occurring, where the event here is described as identifying a voxel as a lesion. To de-correlated DSC with lesion-load and obtain an unbiased metric of permormance, we consider a normalised DSC (nDSC). The following steps explain and justify how and why we calculate the proposed nDSC:

1. The probability of a successful event (identifying a lesion) influences the DSC score as the precision at 100% recall varies across the patients (the precision at 100% recall is simply the percentage of lesion voxels for the patient - i.e. the lesion load).

2. TheDSCscoreiscalculatedasageometricratiooftheprecision,Prτ,andrecall,Reτ values at a selected threshold, τ (ML models typically have a probabilistic prediction for each voxel which must be compared against a threshold to classify as either a positive class or a negative class).

Here, the recall is held fixed and the precision for each patient is adjusted (Prτ → Prτ ) by a different amount such that the cross-patient performance can be fairly evaluated.
ThenewvalueoftheprecisionisdeterminedbythescalingappliedtotheFP(falsepositives) which is scaled by a factor, kp that is different for each patient, p.
kp for each patient is determined by using the 100% recall rate point as this point is not influenced by model performance.
Hence, kp for patient p is the factor the FP at 100% recall must be scaled by in order to ensure the precision achieved is a chosen reference value, r. Derivation of deducing kp is given. The subscript 100% denotes operating at 100% recall.
```
<img src="https://public.grand-challenge-user-content.org/i/2022/09/29/Screenshot_2022-09-29_at_13.15.35.png"     style="width: 803px;" />
```
Here, r is selected as 0.1% because this is approximately the average precision across the patients at 100% recall (i.e. the average fraction of lesion voxels).
The recall is not influenced by scaling the FP by kp.

The precision is directly affected as the new precision at our selected operating point

(threshold to form the segmentation mask), τ∗, is given by:

<img src="https://public.grand-challenge-user-content.org/i/2022/09/29/Screenshot_2022-09-29_at_13.15.39.png"     style="width: 50%;" />

10. Thus, nDSC is calculated as the geometric mean of Prτ ∗ and Reτ ∗ for each patient.

Recall, kp is given in step 6.
The averaged nDSC is used as the predictive performance metric.

Figure 1: Empirical relationship of each metric with lesion load on Evlin using UNET ensemble. On the left: DSC; on the right: nDSC

We empirically demonstrate that the nDSC metric is less dependent on the lesion load compared to DSC via Figure 1. Recall, lesion load is defined as the fraction of voxels that are lesion voxels for a given subject. Figure 8 plots the performance in terms of both DSC and nDSC against the lesion load for each subject for Evlin. It is clear that DSC is dependent on the lesion load while nDSC decorrelates this relationship by flat line average.

The code to compute the nDSC is in the GitHub repo