Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing a severe MISTAKE in calculating the accuracy using ImageNet-A #15

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

HengyueL
Copy link

The eval.py example provided along with the release of ImageNet-A dataset has a severe calculation mistake in calculating the ImageNet-A accuracy given a ImageNet-1K pretrained model.

Cause:
ImageNet-1K has 1,000 labelled classes, whereas ImageNet-A has only 200 labelled classes (which is a subset defined by variable "thousands_k_to_200".) Then the test model is trained for ImageNet-1K dataset, the old version determines the prediction of such model in a wrong way, causing the final ImageNet-A classification acc. to be over-estimated.

Reason:
I believe the goal of ImageNet-A dataset is to test the robustness of the original ImageNet-1K model, which means that we apply argmax rule to determine the pretrained model prediction, we should not assume the subset of 200 possible labels are known to the model and the prediction is determined by a subset of 200 logits. Instead, we should consider all 1000 logits and check if the argmax rule leads to the correct one in the label subset.

Approach:
Instead of using the pipeline provided as below:
output = net(data)[: ,indices_in_1k]
pred = output.data.max(1)[1]
correct = pred.eq(target)

We should compute the result as follows (pseudo concept below, see pull request for actual implementation):
output = net(data)
pred = torch.argmax(output, dim=1)[MAP_TO_200_SUBCLASS]
correct = pred.eq(target)

Result:
The old eval.py will significantly over-estimate the robust accuracy. For example, in Table 6 of DINO V2 paper: https://arxiv.org/pdf/2304.07193.pdf, where IN-A acc. is reported to be 75.9 using the old eval.py protocol; if this model is evaluated using the correct version, this number is reduced to ~ 53%.

Impact:
Considering the amount of citation of the original paper, I think this issue needs to be broadcasted and clarified in the community and let researchers be aware that all previous claims using ImageNet-A evaluation are very likely over-optimistic.

@zhulinchng
Copy link

@HengyueL I think because ImageNet-A only has 200 classes, it would be better to compare only against the relevant classes predicted from the standard Classifiers trained on the 1000 classes.

I believe your version of the evaluation would be best use in the case of:

  1. The classifiers were trained only on the 200 classes from the ImageNet-1K; or
  2. The ImageNet-A dataset has the same 1000 classes as the ImageNet-1K.

@xksteven
Copy link

Thanks for the PR but @zhulinchng statements highlight our original thinking in that we didn't want to bias the models or expect them to be calibrated for OOD images. In this way we simply grade their performance against our 200 class subset.

Both versions of evaluating can be viewed as being "correct" and instead measuring slightly different things. We have had discussions with other groups who choose to evaluate it in the way you're proposing.

Thanks for the PR but we will not have it merged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants