This file documents a collection of models reported in our paper. All numbers were obtained on Big Basin servers with 8 NVIDIA V100 GPUs & NVLink (except COCO panoptic segmentation models are trained with 64 NVIDIA V100 GPUs).
- The "Name" column contains a link to the config file. Running
train_net.py --num-gpus 8
with this config file will reproduce the model (except for COCO panoptic segmentation models are trained with 64 NVIDIA V100 GPUs with distributed training). - The model id column is provided for ease of reference. To check downloaded file integrity, any model on this page contains its md5 prefix in its file name.
- Training curves and other statistics can be found in
metrics
for each model.
It's common to initialize from backbone models pre-trained on ImageNet classification tasks. The following backbone models are available:
- R-50.pkl (torchvision): converted copy of torchvision's ResNet-50 model. More details can be found in the conversion script.
- R-103.pkl: a ResNet-101 with its first 7x7 convolution replaced by 3 3x3 convolutions. This modification has been used in most semantic segmentation papers (a.k.a. ResNet101c in our paper). We pre-train this backbone on ImageNet using the default recipe of pytorch examples.
Note: below are available pretrained models in Detectron2 that we do not use in our paper.
- R-50.pkl: converted copy of MSRA's original ResNet-50 model.
- R-101.pkl: converted copy of MSRA's original ResNet-101 model.
- X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB.
Our paper also uses ImageNet pretrained models that are not part of Detectron2, please refer to tools to get those pretrained models.
All models available for download through this document are licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
Name | Backbone | crop size |
lr sched |
train mem (MB) |
mIoU | mIoU (ms+flip) |
model id | download |
---|---|---|---|---|---|---|---|---|
PerPixelBaseline | R50 | 512x512 | 160k | 2451 | 39.2 | 40.9 | 40913338_1 | model | metrics |
PerPixelBaseline+ | R50 | 512x512 | 160k | 5817 | 41.9 | 42.9 | 40931736_2 | model | metrics |
MaskFormer | R50 | 512x512 | 160k | 4334 | 44.5 | 46.7 | 40931736_14 | model | metrics |
MaskFormer | R101 | 512x512 | 160k | 4905 | 45.5 | 47.2 | 40986936_1 | model | metrics |
MaskFormer | R101c | 512x512 | 160k | 4968 | 46.0 | 48.1 | 41703904_1 | model | metrics |
MaskFormer | Swin-T | 512x512 | 160k | 5292 | 46.7 | 48.8 | 40986951_3 | model | metrics |
MaskFormer | Swin-S | 512x512 | 160k | 6330 | 49.8 | 51.0 | 40846700_5 | model | metrics |
MaskFormer | Swin-B | 640x640 | 160k | 12928 | 52.7 | 53.9 | 40986951_0 | model | metrics |
MaskFormer | Swin-L | 640x640 | 160k | 18144 | 54.1 | 55.6 | 40846700_0 | model | metrics |
Name | Backbone | lr sched |
train mem (MB) |
mIoU | mIoU (ms+flip) |
model id | download |
---|---|---|---|---|---|---|---|
PerPixelBaseline | R50 | 60k | 6898 | 32.4 | 34.4 | 40941321_0 | model | metrics |
PerPixelBaseline+ | R50 | 60k | 18227 | 34.2 | 35.8 | 40941321_3 | model | metrics |
MaskFormer | R50 | 60k | 8618 | 37.1 | 38.9 | 40941321_6 | model | metrics |
MaskFormer | R101 | 60k | 10091 | 38.1 | 39.8 | 40986940_1 | model | metrics |
MaskFormer | R101c | 60k | 9927 | 38.0 | 39.3 | 41703904_3 | model | metrics |
Name | Backbone | lr sched |
train mem (MB) |
mIoU | model id | download |
---|---|---|---|---|---|---|
PerPixelBaseline | R50 | 200k | 8030 | 12.4 | 40986914_5 | model | metrics |
PerPixelBaseline+ | R50 | 200k | 26698 | 13.9 | 40986914_6 | model | metrics |
MaskFormer | R50 | 200k | 6529 | 16.0 | 40986914_1 | model | metrics |
MaskFormer | R101 | 200k | 6894 | 16.8 | 40986946_1 | model | metrics |
MaskFormer | R101c | 200k | 6904 | 17.4 | 41703904_6 | model | metrics |
Name | Backbone | lr sched |
train mem (MB) |
mIoU | mIoU (ms+flip) |
model id | download |
---|---|---|---|---|---|---|---|
MaskFormer | R101 | 90k | 6960 | 78.5 | 80.3 | 41127351_1 | model | metrics |
MaskFormer | R101c | 90k | 7204 | 79.7 | 81.4 | 41630444_2 | model | metrics |
Name | Backbone | lr sched |
train mem (MB) |
mIoU | mIoU (ms+flip) |
model id | download |
---|---|---|---|---|---|---|---|
MaskFormer | R50 | 300k | 15761 | 53.1 | 55.4 | 42325118 | model | metrics |
Name | Backbone | lr sched |
train mem (MB) |
PQ | model id | download |
---|---|---|---|---|---|---|
MaskFormer | R50 + 6 Enc | 554k | 22634 | 46.5 | 42747488_1 | model | metrics |
MaskFormer | R101 + 6 Enc | 554k | 27358 | 47.6 | 42747488_0 | model | metrics |
MaskFormer | Swin-T | 554k | 20023 | 47.7 | 41143190_0 | model | metrics |
MaskFormer | Swin-S | 554k | 21620 | 49.7 | 41270920 | model | metrics |
MaskFormer | Swin-B | 554k | 24411 | 51.8 | 41260906 | model | metrics |
MaskFormer | Swin-L | 554k | 23275 | 52.7 | 43219274 | model | metrics |
Note:
- All COCO panoptic segmentation models are trained with 64 NVIDIA V100 GPUs.
- For Swin-L model, we set
MAX_SIZE_TRAIN=1000
due to memory constraint.
Name | Backbone | lr sched |
train mem (MB) |
PQ | model id | download |
---|---|---|---|---|---|---|
MaskFormer | R50 + 6 Enc | 720k | 15899 | 34.7 | 42746872_1 | model | metrics |
MaskFormer | R50 + 6 Enc | 720k | 16516 | 35.7 | 42747444 | model | metrics |