Problem on MSeg, took 10 minutes for 1 image and only produced the gray scaled image? #56

luda1013 · 2023-04-19T09:51:10Z

Hallo Pros,

i am currently working with enhancing image enhancement paper and algorithm and trying to implement that. In the process, we need to use MSeg-segmentation for real and rendered images/ datasets. i have like 50-60k images.

So the dependencies MSeg-api and MSeg_semantic were already installed. I tried the google collab first and then copying the commands, so i could run the script in my linux also. the command is like this:
python -u mseg_semantic/tool/universal_demo.py
--config="default_config_360.yaml"
model_name mseg-3m
model_path mseg-3m.pth
input_file /home/luda1013/PfD/image/try_images

the weight i used, i downloaded it from the google collab, so the mseg-3m-1080.pth

but for me, it took like 10 minutes for 1 image and also what i get in temp_files is just the gray scale image of it.
Could someone help me how i could solve this problem, thank you :)

luda1013 · 2023-04-19T10:49:06Z

My setup:

Ubuntu 20.04.6 LTS
core i9-10980XE @ 3GHz
graphic (nvidia-smi): NVIDIA RTX A5000 (48 GB)
Cuda 11.7
Pytorch version : 2.0.0
Cuda at: /usr/local/cuda*

czero69 · 2023-05-20T10:52:21Z

Sounds like you are running on cpu and not on gpu. I am runninng on 24gb gpu and for "default_config_1080_ms.yaml" config it took good few seconds for one fullHD input image. Check in your config what gpu is indicated as test_gpu: [0]. Make sure also pytorch is picking nvidia as a device and not a cpu. It is the often case where you have something badly installed in your environment. From log lines you should indicate or you can also write super simple pytorch script to print device name (howto check pytorch is using gpu). If it is the case, I would create new conda environment and install pytorch for gpu/cuda (pip install might be bit different for gpu support).

About grayscale output it is probably what you want. One channel (8 bit gray) is super enough to store 256 classes and mseg is producing less than that. This way of storing info is efficient for hdd, 4k image is barely 112kb or so.

luda1013 · 2023-05-22T12:14:03Z

ok, thank you for the reply.. ok, i will make new virtual environment just for the MSeg. about the cpu and gpu, when running the script, i also check the nvidia-smi and it is on 100% load

is okay to use another segmentation network for this enhancing photorealism enhancement?

czero69 · 2023-05-22T12:42:08Z

ah, you are using two gpus (2x 24GB). That potentially may be the case. GPU usage indeed indicates that pytorch is using gpus. I didn't test the code (epe training / mseg generations) in multiGPU setup yet, but I remember from my past pytorch projects that if something was not well designed for multiGPU setup, some steps (communication, combinning results, even some part of calculating gradients, copying things back and forth, some merging ops on cpu) took much time and may be less efficient than on one gpu. What I would try in your case:

try running the code for one GPU and check (mby that will solve your issue)
if that does not helped. Create new conda env and install all dependencies that have gpu support.
check what mseg configuration are you using

Also if you step of generating knn will be slow that means faiss use cpu. On gpu faiss is blazing fast and all took 1-2 second for 500k samples in my case.

is okay to use another segmentation network for this enhancing photorealism enhancement?

Msegs and EPE are different networks, you probably mean conda envs. Everything should be running fine in the same environment (as it is working for me), but you can use different conda environments, it does not matter at all.

PS. what is your input image size for Mseg step?

luda1013 · 2023-05-22T13:06:18Z

okay, so the MSeg scipt i used is the universal demo inference or the universal demo for batch; so the command is like:
python -u ~/mseg-semantic/mseg_semantic/tool/universal_demo_batched.py
--config=mseg_semantic/config/test/480/default_config_batched_ss.yaml
model_name mseg-3m
model_path ~/mseg-semantic/mseg-3m-480p.pth
input_file /home/luda1013/PfD/image/01_images/

the input file 01_images is a folder contains of 2500 images of Playing for Data - dataset
each image has 1914 pixel x 1052 pixel (Width x Height)
here are some screenshots on running MSeg:

o ya, also i dont nderstand that much about batch universal demo from Mseg, because i thought with batch, the MSeg run on batches of images, but in that command line above is like 1 batch consists only 1 image, so there are 2500 batches.

And before, i tried to use the universal_demo.py and at the first image i break the running and i think these are why it takes long time just to infer 1 image. But i dont yet investigate/debug line-per-line further.

because took too long for me for the mseg, now i have MSeg-BW-images for 1,5-2 folders of PfD: images_01 and images_02.
Now i am using it to implements the EPE, 1 folder as fake , 2nd folder as real dataset.
Have you tried also to implement and train the EPE? because till now i still can't bring it for training.
i have error when run the training:

now i am still at debug line per line where this num_samples = 0 came..

Thank you so much @czero69 for your help btw :)

czero69 · 2023-05-22T15:24:59Z

o ya, also i dont nderstand that much about batch universal demo from Mseg, because i thought with batch, the MSeg run on batches of images, but in that command line above is like 1 batch consists only 1 image, so there are 2500 batches.

ah, I am going only with universal_demo.py

i have error when run the training:

you have at least two errors. One says some file is missing. Another one, num_samples=0 this happens when dataloader see no data at all, usually wrong paths / wrong input file structure / wrong input file path.

For preparing EPE data, you must take extra care. Go throu all preparation steps careffuly. I recommend printing results of every step below (values, means) to check does tensors looks ok (not nans/ not inf etc.)

these are my scripts, where images are 4k (hence 2160 3840 and -c 60)

python epe/matching/feature_based/collect_crops.py FirstRealVids /path/real.txt -n 8 -c 60
python epe/matching/feature_based/collect_crops.py Coffing /path/fake.txt -n 8 -c 60
python epe/matching/feature_based/find_knn.py crop_Coffing.npz crop_FirstRealVids.npz knn_Coffing-FirstRealVids.npz -k 15
python epe/matching/filter.py knn_Coffing-FirstRealVids.npz crop_Coffing.csv crop_FirstRealVids.csv 1.0 matched_crops_Coffing-FirstRealVids.csv
python epe/matching/compute_weights.py matched_crops_Coffing-FirstRealVids.csv 2160 3840 crop_weights_Coffing-FirstRealVids.npz

take a note of correct order for /path/real.txt (images, msegs)
and for /path/fake.txt (images, msegs, gbuffers_npz, gt_stencils)

To verify a bit, its good to see does matched crops looks ok

python ./epe/matching/feature_based/sample_matches.py  /path/fake.txt crop_Coffing.csv /path/real.txt crop_FirstRealVids.csv knn_Coffing-FirstRealVids.npz

Also, make sure all your input color images are RGB, 3-channel (not RGBA). Robust maps (mseg) and Stencils (gt masks) are in 8 bit. Your NPZs should have a same structure as fake NPZs ('data' key in numpy dict, float16). If it has different dim than 32 (32 == num of gbuffer channels in total) modify the code accordingly; should be in one place.

Have you tried also to implement and train the EPE? because till now i still can't bring it for training.

After solving some trivial issues, all is training fine for me. Results are ... shortly speaking ... breathtaking. Possibly I would rewrite an entire training pipeline to latest pytorch and pipeline similar to how I work nowadays, so it would be easier for me to modify epe basilne arch furhter, support batches > 1, logging, etc.

luda1013 · 2023-05-22T15:42:45Z

ya, this skipped entries is found at script epe.datasets.utils.py in function read_filelist.
here is my text file looks like for 1 line / 1 image:
rendered.txt:
/home/luda1013/PfD_test/image/02_images/images/02501.png,/home/luda1013/mseg-semantic/temp_files/mseg-3m_01_images_universal_ss/259/gray/02501_gray.jpg,/home/luda1013/PEcon2/gbuffer_02_images_rendered/02501.npz,/home/luda1013/PfD_test/label/02_labels/labels/02501.png

real.txt:
/home/luda1013/PfD_test/image/01_images/images/00001.png,/home/luda1013/mseg-semantic/temp_files/mseg-3m_01_images_universal_ss/259/gray/00001.png

so i think with the text files i do it correctly. and i already check the NaN.
o ya, at compute_weights.py, before i have NaN number when i run the script on terminal, tried to run at jupyter notebook, it gave me number.. idk why.. but even after solving the NaN number in compute_weights; i still get the same error and skipped entries at training.

The other input like RGB 3 Channel (not RGBA), how many bits, i dont check it yet.

i tell you again as soon i try all of the suggestions you made.
With the MSeg, i will try it later with 1 GPU after i can bring this EPE to train with these 2 folders of PfD-dataset that i have XD

czero69 · 2023-05-22T15:46:05Z

in compute_weights.py take a note that argument is in H, W (and not W, H), so for e.g. fullHD will be 1080 1920; it was my NaN reason

The other input like RGB 3 Channel (not RGBA), how many bits, i dont check it yet.

should be 24 bits, check one random img for fake & real. Not everywhere in the code there is [:,:3,:,:] so RGBA will rise some 4!=3 in tensors size

num_samples=0

Almost for sure paths are wrong. Take one file from each of your .txt, .csvs and make stat in the terminal

stat /path/to.png

luda1013 · 2023-05-22T16:40:30Z

oww.. so the txt file i created is false? i thought is correct because i followed the instructions
" Each line should contain paths to image, robust label map, gbuffer file, and ground truth label map, all separated by commas. Each line should contain paths to image, robust label map, gbuffer file, and ground truth label map, all separated by commas."

sorry but i also dont quite understand with the stat in the terminal, so u suggest to verify the location of the image using stat?

maybe could help u too look better, here is the part of screen shot of my rendered.txt:

czero69 · 2023-05-22T17:37:43Z

oww.. so the txt file i created is false?

at least order looks ok. I have this order too ["screenshot", "msegs/gray4k", "NPZs", "gray_stencils"]

the stat is just to make sure .txt/.csv have correct paths. The one you've printed is correct, indicating file indeed exist in HDD. Rendered.txt looks ok too for a first sight. But num_samples == 0 indicates epe pipeline does not see sth. Check all 4 paths in some random row in rendered.txt. Check paths in your citysample_ie2.config (or whatever name config you are using for epe training). Also check whats going on with, missing files, probably some paths are pointing to non-existing files.

luda1013 · 2023-05-24T11:05:23Z

ahh okay, thank you, @czero69 for the stat tips to check the location of the images.

ya when i run the training, the num_samples = 0 come from the skipped entries.

and this comes form the epe.datasets.utils so from the function read_filelist()

ya, for the config i use the train_pfd2cs.yaml from github, i just modify the basic like path, etc.. i even keep the name same. pfd and cs, just to avoid unnecessary error

the number 355 skipped entries are for validation and 1066 are for training, in the val.txt there are 355 lines/ images and 1066 images in train.txt
i guess the script cannot see/ work the lines of paths

vace17 · 2023-06-06T08:18:55Z

Hello @luda1013,
I noticed you are using CUDA version 11.7, were you able to train the model properly?
I was also using this version of CUDA but the training step takes a lot of time, I supposed because pytorch is not using the GPU (maybe because of incompatible CUDA?)
Thank you in advance for any help

luda1013 · 2023-06-06T10:45:11Z

Hi @vace17 , not yet, i am still countered some problem, now i countered the problem:

seems that i need to squeeze or reshape the image first into (H,W) instead (H,W,C)

in your case, in the script they always made the used device is cuda, u can also then check when u are on training, open your terminal and check with nvidia-smi
u should then know if your gpu is loaded or not

Could you please help me then with training? till now i cannot bring it to train.
may i know your dataset for fake and real, the config and the step by step? do u just train or did u also modify the scripts? thank you very much :)

update:
i solved the ValueError: could not broadcast input array from shape (..) to shape (..)
i tried to convert the label map from PfD (which is still RGB or CMYK) to gray

but i have another issue now, and maybe it is the same with you @vace17 :

vace17 · 2023-07-10T09:31:18Z

@luda1013 I have a different issue since I don't encounter this specific error and the training process is running but it takes a lot of time

vace17 · 2023-07-10T09:54:41Z

@czero69 can I ask what is your specific setup?
Mine is :

Windows 10
Graphic card: NVIDIA GEFORCE RTX 3090
Cuda 11.7
Python version 3.8.16

I checked from the terminal the current usage of the GPU using the command nvidia-smi during the run of the training process. It seems to me that the GPU is currently used but the percentage of usage of it continues changing between low values (5-10%) and 50-60%.
How much time does it take to train the framework with your current setup?
Thank you in advance for any help

czero69 · 2023-07-11T17:24:34Z

hey, I have tried two set-ups so far:

win10, NVIDIA RTX 3090 it trained 75k steps / day
linux, NVIDIA A6000 (48GB), ~150k steps / day

my entire epoch would be around 1M steps (batch == 1), 196x196 single crop

Authors mentioned somewhere here in the issue space that for them it was around 200k steps / day too and they were using 1x3090

ZGX010 · 2023-12-27T03:15:47Z

oww.. so the txt file i created is false?

at least order looks ok. I have this order too ["screenshot", "msegs/gray4k", "NPZs", "gray_stencils"]

the stat is just to make sure .txt/.csv have correct paths. The one you've printed is correct, indicating file indeed exist in HDD. Rendered.txt looks ok too for a first sight. But num_samples == 0 indicates epe pipeline does not see sth. Check all 4 paths in some random row in rendered.txt. Check paths in your citysample_ie2.config (or whatever name config you are using for epe training). Also check whats going on with, missing files, probably some paths are pointing to non-existing files.

Hello Kamil, nice to see you again. @czero69
gbuffer seems to be in npz format, how to organize them?
How should the Gbuffer attributes corresponding to size and channel be defined?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem on MSeg, took 10 minutes for 1 image and only produced the gray scaled image? #56

Problem on MSeg, took 10 minutes for 1 image and only produced the gray scaled image? #56

luda1013 commented Apr 19, 2023

luda1013 commented Apr 19, 2023

czero69 commented May 20, 2023 •

edited

Loading

luda1013 commented May 22, 2023

czero69 commented May 22, 2023 •

edited

Loading

luda1013 commented May 22, 2023 •

edited

Loading

czero69 commented May 22, 2023 •

edited

Loading

luda1013 commented May 22, 2023

czero69 commented May 22, 2023 •

edited

Loading

luda1013 commented May 22, 2023

czero69 commented May 22, 2023 •

edited

Loading

luda1013 commented May 24, 2023 •

edited

Loading

vace17 commented Jun 6, 2023

luda1013 commented Jun 6, 2023 •

edited

Loading

vace17 commented Jul 10, 2023

vace17 commented Jul 10, 2023

czero69 commented Jul 11, 2023 •

edited

Loading

ZGX010 commented Dec 27, 2023 •

edited

Loading

Problem on MSeg, took 10 minutes for 1 image and only produced the gray scaled image? #56

Problem on MSeg, took 10 minutes for 1 image and only produced the gray scaled image? #56

Comments

luda1013 commented Apr 19, 2023

luda1013 commented Apr 19, 2023

czero69 commented May 20, 2023 • edited Loading

luda1013 commented May 22, 2023

czero69 commented May 22, 2023 • edited Loading

luda1013 commented May 22, 2023 • edited Loading

czero69 commented May 22, 2023 • edited Loading

luda1013 commented May 22, 2023

czero69 commented May 22, 2023 • edited Loading

luda1013 commented May 22, 2023

czero69 commented May 22, 2023 • edited Loading

luda1013 commented May 24, 2023 • edited Loading

vace17 commented Jun 6, 2023

luda1013 commented Jun 6, 2023 • edited Loading

vace17 commented Jul 10, 2023

vace17 commented Jul 10, 2023

czero69 commented Jul 11, 2023 • edited Loading

ZGX010 commented Dec 27, 2023 • edited Loading

czero69 commented May 20, 2023 •

edited

Loading

czero69 commented May 22, 2023 •

edited

Loading

luda1013 commented May 22, 2023 •

edited

Loading

czero69 commented May 22, 2023 •

edited

Loading

czero69 commented May 22, 2023 •

edited

Loading

czero69 commented May 22, 2023 •

edited

Loading

luda1013 commented May 24, 2023 •

edited

Loading

luda1013 commented Jun 6, 2023 •

edited

Loading

czero69 commented Jul 11, 2023 •

edited

Loading

ZGX010 commented Dec 27, 2023 •

edited

Loading