Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory allocation problem #94

Open
YeSho-cpp opened this issue Oct 18, 2023 · 4 comments
Open

Memory allocation problem #94

YeSho-cpp opened this issue Oct 18, 2023 · 4 comments

Comments

@YeSho-cpp
Copy link

Hello, sorry to bother you, I am running a nuclear data set with maskdino, but my problem now is insufficient memory, my bathsize is changed to 2, numworkers is changed to 0, and I started running, but the efficiency is too slow, numworkers will report memory allocation failure even if it is changed to 1. I have two a6000 graphics cards, but they cannot be distributed and used at the same time, otherwise the memory can not be allocated. I would like to ask you which parameters should be modified to reduce the use of memory.

@YeSho-cpp
Copy link
Author

This is my data set information

[10/18 11:09:21] d2.data.datasets.coco INFO: Loading /share/home/ncu10/Code/AI/Point_label/PointWSSIS/cell_data_root/coco/annotations/instances_train2017.json takes 2.70 seconds.
[10/18 11:09:21] d2.data.datasets.coco INFO: Loaded 432 images in COCO format from /share/home/ncu10/Code/AI/Point_label/PointWSSIS/cell_data_root/coco/annotations/instances_train2017.json
[10/18 11:09:21] d2.data.build INFO: Removed 0 images with no usable annotations. 432 images left.
[10/18 11:09:21] d2.data.build INFO: Distribution of instances among all 80 categories:
�[36m| category | #instances | category | #instances | category | #instances |
|:-------------:|:-------------|:------------:|:-------------|:-------------:|:-------------|
| person | 17073 | bicycle | 0 | car | 0 |
| motorcycle | 0 | airplane | 0 | bus | 0 |
| train | 0 | truck | 0 | boat | 0 |
| traffic light | 0 | fire hydrant | 0 | stop sign | 0 |
| parking meter | 0 | bench | 0 | bird | 0 |
| cat | 0 | dog | 0 | horse | 0 |
| sheep | 0 | cow | 0 | elephant | 0 |
| bear | 0 | zebra | 0 | giraffe | 0 |
| backpack | 0 | umbrella | 0 | handbag | 0 |
| tie | 0 | suitcase | 0 | frisbee | 0 |
| skis | 0 | snowboard | 0 | sports ball | 0 |
| kite | 0 | baseball bat | 0 | baseball gl.. | 0 |
| skateboard | 0 | surfboard | 0 | tennis racket | 0 |
| bottle | 0 | wine glass | 0 | cup | 0 |
| fork | 0 | knife | 0 | spoon | 0 |
| bowl | 0 | banana | 0 | apple | 0 |
| sandwich | 0 | orange | 0 | broccoli | 0 |
| carrot | 0 | hot dog | 0 | pizza | 0 |
| donut | 0 | cake | 0 | chair | 0 |
| couch | 0 | potted plant | 0 | bed | 0 |
| dining table | 0 | toilet | 0 | tv | 0 |
| laptop | 0 | mouse | 0 | remote | 0 |
| keyboard | 0 | cell phone | 0 | microwave | 0 |
| oven | 0 | toaster | 0 | sink | 0 |
| refrigerator | 0 | book | 0 | clock | 0 |
| vase | 0 | scissors | 0 | teddy bear | 0 |
| hair drier | 0 | toothbrush | 0 | | |
| total | 17073 | | | | |�[0m
[10/18 11:09:21] d2.data.build INFO: Using training sampler TrainingSampler
[10/18 11:09:21] d2.data.common INFO: Serializing the dataset using: <class 'detectron2.data.common._TorchSerializedList'>
[10/18 11:09:21] d2.data.common INFO: Serializing 432 elements to byte tensors and concatenating them all ...
[10/18 11:09:22] d2.data.common INFO: Serialized dataset takes 28.01 MiB

@YeSho-cpp
Copy link
Author

Using the resnet50 model
[10/18 11:09:13] detectron2 INFO: Rank of current process: 0. World size: 1
[10/18 11:09:14] detectron2 INFO: Environment info:


sys.platform linux
Python 3.8.15 (default, Nov 24 2022, 15:19:38) [GCC 11.2.0]
numpy 1.24.4
detectron2 0.6 @/share/home/ncu10/Code/AI/Point_label/MaskDINO/detectron2/detectron2
Compiler GCC 9.4
CUDA compiler CUDA 11.4
detectron2 arch flags 8.6
DETECTRON2_ENV_MODULE
PyTorch 1.13.1 @/share/home/ncu10/miniconda3/envs/py38/lib/python3.8/site-packages/torch
PyTorch debug build False
torch._C._GLIBCXX_USE_CXX11_ABI False
GPU available Yes
GPU 0 NVIDIA RTX A6000 (arch=8.6)
Driver version 470.86
CUDA_HOME /share/home/ncu10/CUDA/CUDA11.4
Pillow 9.5.0
torchvision 0.14.1 @/share/home/ncu10/miniconda3/envs/py38/lib/python3.8/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20221221
iopath 0.1.9
cv2 4.8.0

CUDA_VISIBLE_DEVICES=1 python train_net.py --num-gpus 1 --config-file /share/home/ncu10/Code/AI/Point_label/MaskDINO/configs/coco/instance-segmentation/maskdino_R50_bs16_50ep_3s.yaml MODEL.WEIGHTS /share/home/ncu10/Code/AI/Point_label/MaskDINO/model_file/maskdino_r50_50ep_300q_hid1024_3sd1_instance_maskenhanced_mask46.1ap_box51.5ap.pth

@sym330
Copy link

sym330 commented Nov 23, 2023

same error

@FengLi-ust
Copy link
Collaborator

Sorry for the late reply. How much memory do you need in our case? We use about 30G for Resnet50 batch size 4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants