MaskFormer Model Zoo and Baselines

Introduction

This file documents a collection of models reported in our paper. All numbers were obtained on Big Basin servers with 8 NVIDIA V100 GPUs & NVLink (except COCO panoptic segmentation models are trained with 64 NVIDIA V100 GPUs).

How to Read the Tables

The "Name" column contains a link to the config file. Running train_net.py --num-gpus 8 with this config file will reproduce the model (except for COCO panoptic segmentation models are trained with 64 NVIDIA V100 GPUs with distributed training).
The model id column is provided for ease of reference. To check downloaded file integrity, any model on this page contains its md5 prefix in its file name.
Training curves and other statistics can be found in metrics for each model.

Detectron2 ImageNet Pretrained Models

It's common to initialize from backbone models pre-trained on ImageNet classification tasks. The following backbone models are available:

R-50.pkl (torchvision): converted copy of torchvision's ResNet-50 model. More details can be found in the conversion script.
R-103.pkl: a ResNet-101 with its first 7x7 convolution replaced by 3 3x3 convolutions. This modification has been used in most semantic segmentation papers (a.k.a. ResNet101c in our paper). We pre-train this backbone on ImageNet using the default recipe of pytorch examples.

Note: below are available pretrained models in Detectron2 that we do not use in our paper.

R-50.pkl: converted copy of MSRA's original ResNet-50 model.
R-101.pkl: converted copy of MSRA's original ResNet-101 model.
X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB.

Third-party ImageNet Pretrained Models

Our paper also uses ImageNet pretrained models that are not part of Detectron2, please refer to tools to get those pretrained models.

License

All models available for download through this document are licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

Semantic Segmentation Models

ADE20K Semantic Segmentation

Name	Backbone	crop size	lr sched	train mem (MB)	mIoU	mIoU (ms+flip)	model id	download
PerPixelBaseline	R50	512x512	160k	2451	39.2	40.9	40913338_1	model \| metrics
PerPixelBaseline+	R50	512x512	160k	5817	41.9	42.9	40931736_2	model \| metrics
MaskFormer	R50	512x512	160k	4334	44.5	46.7	40931736_14	model \| metrics
MaskFormer	R101	512x512	160k	4905	45.5	47.2	40986936_1	model \| metrics
MaskFormer	R101c	512x512	160k	4968	46.0	48.1	41703904_1	model \| metrics
MaskFormer	Swin-T	512x512	160k	5292	46.7	48.8	40986951_3	model \| metrics
MaskFormer	Swin-S	512x512	160k	6330	49.8	51.0	40846700_5	model \| metrics
MaskFormer	Swin-B	640x640	160k	12928	52.7	53.9	40986951_0	model \| metrics
MaskFormer	Swin-L	640x640	160k	18144	54.1	55.6	40846700_0	model \| metrics

COCO-Stuff-10K Semantic Segmentation

Name	Backbone	lr sched	train mem (MB)	mIoU	mIoU (ms+flip)	model id	download
PerPixelBaseline	R50	60k	6898	32.4	34.4	40941321_0	model \| metrics
PerPixelBaseline+	R50	60k	18227	34.2	35.8	40941321_3	model \| metrics
MaskFormer	R50	60k	8618	37.1	38.9	40941321_6	model \| metrics
MaskFormer	R101	60k	10091	38.1	39.8	40986940_1	model \| metrics
MaskFormer	R101c	60k	9927	38.0	39.3	41703904_3	model \| metrics

ADE20K-Full Semantic Segmentation

Name	Backbone	lr sched	train mem (MB)	mIoU	model id	download
PerPixelBaseline	R50	200k	8030	12.4	40986914_5	model \| metrics
PerPixelBaseline+	R50	200k	26698	13.9	40986914_6	model \| metrics
MaskFormer	R50	200k	6529	16.0	40986914_1	model \| metrics
MaskFormer	R101	200k	6894	16.8	40986946_1	model \| metrics
MaskFormer	R101c	200k	6904	17.4	41703904_6	model \| metrics

Cityscapes Semantic Segmentation

Name	Backbone	lr sched	train mem (MB)	mIoU	mIoU (ms+flip)	model id	download
MaskFormer	R101	90k	6960	78.5	80.3	41127351_1	model \| metrics
MaskFormer	R101c	90k	7204	79.7	81.4	41630444_2	model \| metrics

Mapillary Vistas Semantic Segmentation

Name	Backbone	lr sched	train mem (MB)	mIoU	mIoU (ms+flip)	model id	download
MaskFormer	R50	300k	15761	53.1	55.4	42325118	model \| metrics

Panoptic Segmentation Models

COCO Panoptic Segmentation

Name	Backbone	lr sched	train mem (MB)	PQ	model id	download
MaskFormer	R50 + 6 Enc	554k	22634	46.5	42747488_1	model \| metrics
MaskFormer	R101 + 6 Enc	554k	27358	47.6	42747488_0	model \| metrics
MaskFormer	Swin-T	554k	20023	47.7	41143190_0	model \| metrics
MaskFormer	Swin-S	554k	21620	49.7	41270920	model \| metrics
MaskFormer	Swin-B	554k	24411	51.8	41260906	model \| metrics
MaskFormer	Swin-L	554k	23275	52.7	43219274	model \| metrics

Note:

All COCO panoptic segmentation models are trained with 64 NVIDIA V100 GPUs.
For Swin-L model, we set MAX_SIZE_TRAIN=1000 due to memory constraint.

ADE20K Panoptic Segmentation

Name	Backbone	lr sched	train mem (MB)	PQ	model id	download
MaskFormer	R50 + 6 Enc	720k	15899	34.7	42746872_1	model \| metrics
MaskFormer	R50 + 6 Enc	720k	16516	35.7	42747444	model \| metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODEL_ZOO.md

MODEL_ZOO.md

MaskFormer Model Zoo and Baselines

Introduction

How to Read the Tables

Detectron2 ImageNet Pretrained Models

Third-party ImageNet Pretrained Models

License

Semantic Segmentation Models

ADE20K Semantic Segmentation

COCO-Stuff-10K Semantic Segmentation

ADE20K-Full Semantic Segmentation

Cityscapes Semantic Segmentation

Mapillary Vistas Semantic Segmentation

Panoptic Segmentation Models

COCO Panoptic Segmentation

ADE20K Panoptic Segmentation

Files

MODEL_ZOO.md

Latest commit

History

MODEL_ZOO.md

File metadata and controls

MaskFormer Model Zoo and Baselines

Introduction

How to Read the Tables

Detectron2 ImageNet Pretrained Models

Third-party ImageNet Pretrained Models

License

Semantic Segmentation Models

ADE20K Semantic Segmentation

COCO-Stuff-10K Semantic Segmentation

ADE20K-Full Semantic Segmentation

Cityscapes Semantic Segmentation

Mapillary Vistas Semantic Segmentation

Panoptic Segmentation Models

COCO Panoptic Segmentation

ADE20K Panoptic Segmentation