Memory allocation problem #94

YeSho-cpp · 2023-10-18T09:58:33Z

Hello, sorry to bother you, I am running a nuclear data set with maskdino, but my problem now is insufficient memory, my bathsize is changed to 2, numworkers is changed to 0, and I started running, but the efficiency is too slow, numworkers will report memory allocation failure even if it is changed to 1. I have two a6000 graphics cards, but they cannot be distributed and used at the same time, otherwise the memory can not be allocated. I would like to ask you which parameters should be modified to reduce the use of memory.

YeSho-cpp · 2023-10-18T09:59:39Z

This is my data set information

[10/18 11:09:21] [10/18 11:09:21] [10/18 11:09:21] [10/18 11:09:21] �[36m| category |:-------------:|:- | person | 17073 | motorcycle | 0 | train | 0 | traffic light | 0 | parking meter | 0 | cat | 0 | sheep | 0 | bear | 0 | backpack | 0 | tie | 0 | skis | 0 | kite | 0 | skateboard | 0 | bottle | 0 | fork | 0 | bowl | 0 | sandwich | 0 | carrot | 0 | donut | 0 | couch | 0 | dining table | 0 | laptop | 0 | keyboard | 0 | oven | 0 | refrigerator | 0 | vase | 0 | hair drier | 0 | total | 17073 [10/18 11:09:21] [10/18 11:09:21] [10/18 11:09:21] [10/18 11:09:22] d2.data.datasets.coco INFO: Loading /share/home/ncu10/Code/AI/Point_label/PointWSSIS/cell_data_root/coco/annotations/instances_train2017.json takes 2.70 seconds.
d2.data.datasets.coco INFO: Loaded 432 images in COCO format from /share/home/ncu10/Code/AI/Point_label/PointWSSIS/cell_data_root/coco/annotations/instances_train2017.json
d2.data.build INFO: Removed 0 images with no usable annotations. 432 images left.
d2.data.build INFO: Distribution of instances among all 80 categories:
| #instances | category | #instances | category | #instances |
------------|:------------:|:-------------|:-------------:|:-------------|
| bicycle | 0 | car | 0 |
| airplane | 0 | bus | 0 |
| truck | 0 | boat | 0 |
| fire hydrant | 0 | stop sign | 0 |
| bench | 0 | bird | 0 |
| dog | 0 | horse | 0 |
| cow | 0 | elephant | 0 |
| zebra | 0 | giraffe | 0 |
| umbrella | 0 | handbag | 0 |
| suitcase | 0 | frisbee | 0 |
| snowboard | 0 | sports ball | 0 |
| baseball bat | 0 | baseball gl.. | 0 |
| surfboard | 0 | tennis racket | 0 |
| wine glass | 0 | cup | 0 |
| knife | 0 | spoon | 0 |
| banana | 0 | apple | 0 |
| orange | 0 | broccoli | 0 |
| hot dog | 0 | pizza | 0 |
| cake | 0 | chair | 0 |
| potted plant | 0 | bed | 0 |
| toilet | 0 | tv | 0 |
| mouse | 0 | remote | 0 |
| cell phone | 0 | microwave | 0 |
| toaster | 0 | sink | 0 |
| book | 0 | clock | 0 |
| scissors | 0 | teddy bear | 0 |
| toothbrush | 0 | | |
| | | | |�[0m
d2.data.build INFO: Using training sampler TrainingSampler
d2.data.common INFO: Serializing the dataset using: <class 'detectron2.data.common._TorchSerializedList'>
d2.data.common INFO: Serializing 432 elements to byte tensors and concatenating them all ...
d2.data.common INFO: Serialized dataset takes 28.01 MiB

YeSho-cpp · 2023-10-18T10:02:58Z

Using the resnet50 model
[10/18 11:09:13] detectron2 INFO: Rank of current process: 0. World size: 1
[10/18 11:09:14] detectron2 INFO: Environment info:

sys.platform linux
Python 3.8.15 (default, Nov 24 2022, 15:19:38) [GCC 11.2.0]
numpy 1.24.4
detectron2 0.6 @/share/home/ncu10/Code/AI/Point_label/MaskDINO/detectron2/detectron2
Compiler GCC 9.4
CUDA compiler CUDA 11.4
detectron2 arch flags 8.6
DETECTRON2_ENV_MODULE
PyTorch 1.13.1 @/share/home/ncu10/miniconda3/envs/py38/lib/python3.8/site-packages/torch
PyTorch debug build False
torch._C._GLIBCXX_USE_CXX11_ABI False
GPU available Yes
GPU 0 NVIDIA RTX A6000 (arch=8.6)
Driver version 470.86
CUDA_HOME /share/home/ncu10/CUDA/CUDA11.4
Pillow 9.5.0
torchvision 0.14.1 @/share/home/ncu10/miniconda3/envs/py38/lib/python3.8/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20221221
iopath 0.1.9
cv2 4.8.0

CUDA_VISIBLE_DEVICES=1 python train_net.py --num-gpus 1 --config-file /share/home/ncu10/Code/AI/Point_label/MaskDINO/configs/coco/instance-segmentation/maskdino_R50_bs16_50ep_3s.yaml MODEL.WEIGHTS /share/home/ncu10/Code/AI/Point_label/MaskDINO/model_file/maskdino_r50_50ep_300q_hid1024_3sd1_instance_maskenhanced_mask46.1ap_box51.5ap.pth

sym330 · 2023-11-23T07:26:13Z

same error

FengLi-ust · 2024-07-02T05:38:16Z

Sorry for the late reply. How much memory do you need in our case? We use about 30G for Resnet50 batch size 4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory allocation problem #94

Memory allocation problem #94

YeSho-cpp commented Oct 18, 2023

YeSho-cpp commented Oct 18, 2023

YeSho-cpp commented Oct 18, 2023

sym330 commented Nov 23, 2023

FengLi-ust commented Jul 2, 2024

Memory allocation problem #94

Memory allocation problem #94

Comments

YeSho-cpp commented Oct 18, 2023

YeSho-cpp commented Oct 18, 2023

YeSho-cpp commented Oct 18, 2023

sym330 commented Nov 23, 2023

FengLi-ust commented Jul 2, 2024