Skip to content
This repository has been archived by the owner on Sep 2, 2024. It is now read-only.

Problem on MSeg, took 10 minutes for 1 image and only produced the gray scaled image? #56

Open
luda1013 opened this issue Apr 19, 2023 · 17 comments

Comments

@luda1013
Copy link

Hallo Pros,

i am currently working with enhancing image enhancement paper and algorithm and trying to implement that. In the process, we need to use MSeg-segmentation for real and rendered images/ datasets. i have like 50-60k images.

So the dependencies MSeg-api and MSeg_semantic were already installed. I tried the google collab first and then copying the commands, so i could run the script in my linux also. the command is like this:
python -u mseg_semantic/tool/universal_demo.py
--config="default_config_360.yaml"
model_name mseg-3m
model_path mseg-3m.pth
input_file /home/luda1013/PfD/image/try_images

the weight i used, i downloaded it from the google collab, so the mseg-3m-1080.pth
MSeg-log

but for me, it took like 10 minutes for 1 image and also what i get in temp_files is just the gray scale image of it.
Could someone help me how i could solve this problem, thank you :)

@luda1013
Copy link
Author

My setup:

  • Ubuntu 20.04.6 LTS
  • core i9-10980XE @ 3GHz
  • graphic (nvidia-smi): NVIDIA RTX A5000 (48 GB)
  • Cuda 11.7
  • Pytorch version : 2.0.0
  • Cuda at: /usr/local/cuda*

@czero69
Copy link

czero69 commented May 20, 2023

Sounds like you are running on cpu and not on gpu. I am runninng on 24gb gpu and for "default_config_1080_ms.yaml" config it took good few seconds for one fullHD input image. Check in your config what gpu is indicated as test_gpu: [0]. Make sure also pytorch is picking nvidia as a device and not a cpu. It is the often case where you have something badly installed in your environment. From log lines you should indicate or you can also write super simple pytorch script to print device name (howto check pytorch is using gpu). If it is the case, I would create new conda environment and install pytorch for gpu/cuda (pip install might be bit different for gpu support).

About grayscale output it is probably what you want. One channel (8 bit gray) is super enough to store 256 classes and mseg is producing less than that. This way of storing info is efficient for hdd, 4k image is barely 112kb or so.

@luda1013
Copy link
Author

ok, thank you for the reply.. ok, i will make new virtual environment just for the MSeg. about the cpu and gpu, when running the script, i also check the nvidia-smi and it is on 100% load
Nvidia when MSeg active

is okay to use another segmentation network for this enhancing photorealism enhancement?

@czero69
Copy link

czero69 commented May 22, 2023

ah, you are using two gpus (2x 24GB). That potentially may be the case. GPU usage indeed indicates that pytorch is using gpus. I didn't test the code (epe training / mseg generations) in multiGPU setup yet, but I remember from my past pytorch projects that if something was not well designed for multiGPU setup, some steps (communication, combinning results, even some part of calculating gradients, copying things back and forth, some merging ops on cpu) took much time and may be less efficient than on one gpu. What I would try in your case:

  • try running the code for one GPU and check (mby that will solve your issue)
  • if that does not helped. Create new conda env and install all dependencies that have gpu support.
  • check what mseg configuration are you using

Also if you step of generating knn will be slow that means faiss use cpu. On gpu faiss is blazing fast and all took 1-2 second for 500k samples in my case.

is okay to use another segmentation network for this enhancing photorealism enhancement?

Msegs and EPE are different networks, you probably mean conda envs. Everything should be running fine in the same environment (as it is working for me), but you can use different conda environments, it does not matter at all.

PS. what is your input image size for Mseg step?

@luda1013
Copy link
Author

luda1013 commented May 22, 2023

okay, so the MSeg scipt i used is the universal demo inference or the universal demo for batch; so the command is like:
python -u ~/mseg-semantic/mseg_semantic/tool/universal_demo_batched.py
--config=mseg_semantic/config/test/480/default_config_batched_ss.yaml
model_name mseg-3m
model_path ~/mseg-semantic/mseg-3m-480p.pth
input_file /home/luda1013/PfD/image/01_images/

the input file 01_images is a folder contains of 2500 images of Playing for Data - dataset
each image has 1914 pixel x 1052 pixel (Width x Height)
here are some screenshots on running MSeg:
image
o ya, also i dont nderstand that much about batch universal demo from Mseg, because i thought with batch, the MSeg run on batches of images, but in that command line above is like 1 batch consists only 1 image, so there are 2500 batches.

And before, i tried to use the universal_demo.py and at the first image i break the running and i think these are why it takes long time just to infer 1 image. But i dont yet investigate/debug line-per-line further.
MSeg-log_took long time here
MSeg-log_took long time here_2

because took too long for me for the mseg, now i have MSeg-BW-images for 1,5-2 folders of PfD: images_01 and images_02.
Now i am using it to implements the EPE, 1 folder as fake , 2nd folder as real dataset.
Have you tried also to implement and train the EPE? because till now i still can't bring it for training.
i have error when run the training:
image

image

now i am still at debug line per line where this num_samples = 0 came..

Thank you so much @czero69 for your help btw :)

@czero69
Copy link

czero69 commented May 22, 2023

o ya, also i dont nderstand that much about batch universal demo from Mseg, because i thought with batch, the MSeg run on batches of images, but in that command line above is like 1 batch consists only 1 image, so there are 2500 batches.

ah, I am going only with universal_demo.py

i have error when run the training:

you have at least two errors. One says some file is missing. Another one, num_samples=0 this happens when dataloader see no data at all, usually wrong paths / wrong input file structure / wrong input file path.

For preparing EPE data, you must take extra care. Go throu all preparation steps careffuly. I recommend printing results of every step below (values, means) to check does tensors looks ok (not nans/ not inf etc.)

these are my scripts, where images are 4k (hence 2160 3840 and -c 60)

python epe/matching/feature_based/collect_crops.py FirstRealVids /path/real.txt -n 8 -c 60
python epe/matching/feature_based/collect_crops.py Coffing /path/fake.txt -n 8 -c 60
python epe/matching/feature_based/find_knn.py crop_Coffing.npz crop_FirstRealVids.npz knn_Coffing-FirstRealVids.npz -k 15
python epe/matching/filter.py knn_Coffing-FirstRealVids.npz crop_Coffing.csv crop_FirstRealVids.csv 1.0 matched_crops_Coffing-FirstRealVids.csv
python epe/matching/compute_weights.py matched_crops_Coffing-FirstRealVids.csv 2160 3840 crop_weights_Coffing-FirstRealVids.npz

take a note of correct order for /path/real.txt (images, msegs)
and for /path/fake.txt (images, msegs, gbuffers_npz, gt_stencils)

To verify a bit, its good to see does matched crops looks ok

python ./epe/matching/feature_based/sample_matches.py  /path/fake.txt crop_Coffing.csv /path/real.txt crop_FirstRealVids.csv knn_Coffing-FirstRealVids.npz

Also, make sure all your input color images are RGB, 3-channel (not RGBA). Robust maps (mseg) and Stencils (gt masks) are in 8 bit. Your NPZs should have a same structure as fake NPZs ('data' key in numpy dict, float16). If it has different dim than 32 (32 == num of gbuffer channels in total) modify the code accordingly; should be in one place.

Have you tried also to implement and train the EPE? because till now i still can't bring it for training.

After solving some trivial issues, all is training fine for me. Results are ... shortly speaking ... breathtaking. Possibly I would rewrite an entire training pipeline to latest pytorch and pipeline similar to how I work nowadays, so it would be easier for me to modify epe basilne arch furhter, support batches > 1, logging, etc.

@luda1013
Copy link
Author

ya, this skipped entries is found at script epe.datasets.utils.py in function read_filelist.
here is my text file looks like for 1 line / 1 image:
rendered.txt:
/home/luda1013/PfD_test/image/02_images/images/02501.png,/home/luda1013/mseg-semantic/temp_files/mseg-3m_01_images_universal_ss/259/gray/02501_gray.jpg,/home/luda1013/PEcon2/gbuffer_02_images_rendered/02501.npz,/home/luda1013/PfD_test/label/02_labels/labels/02501.png

real.txt:
/home/luda1013/PfD_test/image/01_images/images/00001.png,/home/luda1013/mseg-semantic/temp_files/mseg-3m_01_images_universal_ss/259/gray/00001.png

so i think with the text files i do it correctly. and i already check the NaN.
o ya, at compute_weights.py, before i have NaN number when i run the script on terminal, tried to run at jupyter notebook, it gave me number.. idk why.. but even after solving the NaN number in compute_weights; i still get the same error and skipped entries at training.

The other input like RGB 3 Channel (not RGBA), how many bits, i dont check it yet.

i tell you again as soon i try all of the suggestions you made.
With the MSeg, i will try it later with 1 GPU after i can bring this EPE to train with these 2 folders of PfD-dataset that i have XD

@czero69
Copy link

czero69 commented May 22, 2023

in compute_weights.py take a note that argument is in H, W (and not W, H), so for e.g. fullHD will be 1080 1920; it was my NaN reason

The other input like RGB 3 Channel (not RGBA), how many bits, i dont check it yet.

should be 24 bits, check one random img for fake & real. Not everywhere in the code there is [:,:3,:,:] so RGBA will rise some 4!=3 in tensors size

num_samples=0

Almost for sure paths are wrong. Take one file from each of your .txt, .csvs and make stat in the terminal

stat /path/to.png

@luda1013
Copy link
Author

oww.. so the txt file i created is false? i thought is correct because i followed the instructions
" Each line should contain paths to image, robust label map, gbuffer file, and ground truth label map, all separated by commas. Each line should contain paths to image, robust label map, gbuffer file, and ground truth label map, all separated by commas."

sorry but i also dont quite understand with the stat in the terminal, so u suggest to verify the location of the image using stat?
image

maybe could help u too look better, here is the part of screen shot of my rendered.txt:
image

@czero69
Copy link

czero69 commented May 22, 2023

oww.. so the txt file i created is false?

at least order looks ok. I have this order too ["screenshot", "msegs/gray4k", "NPZs", "gray_stencils"]

the stat is just to make sure .txt/.csv have correct paths. The one you've printed is correct, indicating file indeed exist in HDD. Rendered.txt looks ok too for a first sight. But num_samples == 0 indicates epe pipeline does not see sth. Check all 4 paths in some random row in rendered.txt. Check paths in your citysample_ie2.config (or whatever name config you are using for epe training). Also check whats going on with, missing files, probably some paths are pointing to non-existing files.

@luda1013
Copy link
Author

luda1013 commented May 24, 2023

ahh okay, thank you, @czero69 for the stat tips to check the location of the images.

ya when i run the training, the num_samples = 0 come from the skipped entries.
image
and this comes form the epe.datasets.utils so from the function read_filelist()

ya, for the config i use the train_pfd2cs.yaml from github, i just modify the basic like path, etc.. i even keep the name same. pfd and cs, just to avoid unnecessary error

the number 355 skipped entries are for validation and 1066 are for training, in the val.txt there are 355 lines/ images and 1066 images in train.txt
i guess the script cannot see/ work the lines of paths

@vace17
Copy link

vace17 commented Jun 6, 2023

Hello @luda1013,
I noticed you are using CUDA version 11.7, were you able to train the model properly?
I was also using this version of CUDA but the training step takes a lot of time, I supposed because pytorch is not using the GPU (maybe because of incompatible CUDA?)
Thank you in advance for any help

@luda1013
Copy link
Author

luda1013 commented Jun 6, 2023

Hi @vace17 , not yet, i am still countered some problem, now i countered the problem:
error_train
seems that i need to squeeze or reshape the image first into (H,W) instead (H,W,C)

in your case, in the script they always made the used device is cuda, u can also then check when u are on training, open your terminal and check with nvidia-smi
u should then know if your gpu is loaded or not

Could you please help me then with training? till now i cannot bring it to train.
may i know your dataset for fake and real, the config and the step by step? do u just train or did u also modify the scripts? thank you very much :)

update:
i solved the ValueError: could not broadcast input array from shape (..) to shape (..)
i tried to convert the label map from PfD (which is still RGB or CMYK) to gray

but i have another issue now, and maybe it is the same with you @vace17 :
image

@vace17
Copy link

vace17 commented Jul 10, 2023

@luda1013 I have a different issue since I don't encounter this specific error and the training process is running but it takes a lot of time

@vace17
Copy link

vace17 commented Jul 10, 2023

@czero69 can I ask what is your specific setup?
Mine is :

  • Windows 10
  • Graphic card: NVIDIA GEFORCE RTX 3090
  • Cuda 11.7
  • Python version 3.8.16

I checked from the terminal the current usage of the GPU using the command nvidia-smi during the run of the training process. It seems to me that the GPU is currently used but the percentage of usage of it continues changing between low values (5-10%) and 50-60%.
How much time does it take to train the framework with your current setup?
Thank you in advance for any help

@czero69
Copy link

czero69 commented Jul 11, 2023

hey, I have tried two set-ups so far:

  • win10, NVIDIA RTX 3090 it trained 75k steps / day
  • linux, NVIDIA A6000 (48GB), ~150k steps / day

my entire epoch would be around 1M steps (batch == 1), 196x196 single crop

Authors mentioned somewhere here in the issue space that for them it was around 200k steps / day too and they were using 1x3090

@ZGX010
Copy link

ZGX010 commented Dec 27, 2023

oww.. so the txt file i created is false?

at least order looks ok. I have this order too ["screenshot", "msegs/gray4k", "NPZs", "gray_stencils"]

the stat is just to make sure .txt/.csv have correct paths. The one you've printed is correct, indicating file indeed exist in HDD. Rendered.txt looks ok too for a first sight. But num_samples == 0 indicates epe pipeline does not see sth. Check all 4 paths in some random row in rendered.txt. Check paths in your citysample_ie2.config (or whatever name config you are using for epe training). Also check whats going on with, missing files, probably some paths are pointing to non-existing files.

Hello Kamil, nice to see you again. @czero69
gbuffer seems to be in npz format, how to organize them?
How should the Gbuffer attributes corresponding to size and channel be defined?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants