Fixing a severe MISTAKE in calculating the accuracy using ImageNet-A #15
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The eval.py example provided along with the release of ImageNet-A dataset has a severe calculation mistake in calculating the ImageNet-A accuracy given a ImageNet-1K pretrained model.
Cause:
ImageNet-1K has 1,000 labelled classes, whereas ImageNet-A has only 200 labelled classes (which is a subset defined by variable "thousands_k_to_200".) Then the test model is trained for ImageNet-1K dataset, the old version determines the prediction of such model in a wrong way, causing the final ImageNet-A classification acc. to be over-estimated.
Reason:
I believe the goal of ImageNet-A dataset is to test the robustness of the original ImageNet-1K model, which means that we apply argmax rule to determine the pretrained model prediction, we should not assume the subset of 200 possible labels are known to the model and the prediction is determined by a subset of 200 logits. Instead, we should consider all 1000 logits and check if the argmax rule leads to the correct one in the label subset.
Approach:
Instead of using the pipeline provided as below:
output = net(data)[: ,indices_in_1k]
pred = output.data.max(1)[1]
correct = pred.eq(target)
We should compute the result as follows (pseudo concept below, see pull request for actual implementation):
output = net(data)
pred = torch.argmax(output, dim=1)[MAP_TO_200_SUBCLASS]
correct = pred.eq(target)
Result:
The old eval.py will significantly over-estimate the robust accuracy. For example, in Table 6 of DINO V2 paper: https://arxiv.org/pdf/2304.07193.pdf, where IN-A acc. is reported to be 75.9 using the old eval.py protocol; if this model is evaluated using the correct version, this number is reduced to ~ 53%.
Impact:
Considering the amount of citation of the original paper, I think this issue needs to be broadcasted and clarified in the community and let researchers be aware that all previous claims using ImageNet-A evaluation are very likely over-optimistic.