Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📋 [TASK] Can I use GPU for training? I looked at the code and it doesn't seem to have this interface #2401

Closed
bommbomm opened this issue Oct 31, 2024 · 12 comments

Comments

@bommbomm
Copy link

Describe the task

Can I use GPU for training? I looked at the code and it doesn't seem to have this interface

Acceptance Criteria

datamodule.setup()
model = Patchcore()
engine = Engine(task="classification")
engine.train(datamodule=datamodule, model=model)

Priority

High

Related Epic

No response

Estimated Time

No response

Current Status

Not Started

Additional Information

No response

@blaz-r
Copy link
Contributor

blaz-r commented Oct 31, 2024

Hi. The Engine is based on Lightning trainer, so you can pass any Trainer argument to the Engine. Keep in mind that currently multi-GPU training is not supported, but you can use a single GPU like you would using Trainer.

@vmiller987
Copy link

Then you can't pass any Trainer argument to the Engine...

devices=[0,1,2,3,4,5,6,7]

I've honestly found anomalib very difficult to use.

@samet-akcay
Copy link
Contributor

@vmiller987 can you check you installed the torch with cuda option. GPU in Anomalib training should be automatically picked up if you have the correct torch.

With that being said, please note that multi-GPU is currently not supported, but we are working on it to enable it in v2.
#2258

I would love to get your feedback regarding which parts of anomalib you find it difficult to work with

@vmiller987
Copy link

vmiller987 commented Nov 1, 2024

@samet-akcay I have the correct torch. I can get it to run on one GPU. I have to use an environment variable in order to assign Anomalib to a specific GPU. I can't pass the Engine devices=[3] which is the Trainer way to assign devices. This will still default to GPU0.

I am a novice when it comes to unsupervised learning. I am trying to learn as I mainly have experience with supervised learning. Anomalib doesn't have a good place that explains it's models and how they should be used. The notebooks mainly revolve around Padim it seems. I am looking through the core papers/repo's for the other models to try and understand them.

Not all of the models are easily passed to the engine. For example, if I try to use model = Ganomaly() or model = Uflow(), both of these will return an attribute error Uflow object has no attribute model. These must be adapted or used in a different way which I haven't quite figured out.

@vmiller987
Copy link

Not all of the models are easily passed to the engine. For example, if I try to use model = Ganomaly() or model = Uflow(), both of these will return an attribute error Uflow object has no attribute model. These must be adapted or used in a different way which I haven't quite figured out.

I'm going to retract this part. I was doing something very silly and fixed it. I'm able to get quite a few of them to run including Ganomaly.

@samet-akcay
Copy link
Contributor

I can't pass the Engine devices=[3] which is the Trainer way to assign devices. This will still default to GPU0.

This is a known issue, I've created a PR for this, which has not been merged yet.
#2256

We are also working on a better solution, where you will be able to choose the device ID or train multi GPU
#2258

@bommbomm
Copy link
Author

bommbomm commented Nov 4, 2024

您能否检查一下您安装了带有 CUDA 选项的手电筒。如果您有正确的手电筒,则应自动拾取 Anomalib 训练中的 GPU。

话虽如此,请注意,目前不支持多 GPU,但我们正在努力在 v2 中启用它。 排名 #2258

我很想得到您的反馈,了解您觉得 anomalib 的哪些部分难以使用

1、 Because I wanted to use GPU, I changed it to the following code snippet:
import multiprocessing
from anomalib.data import Folder
from anomalib.models import Patchcore
from anomalib.engine import Engine

datamodule.setup()
model = Patchcore()
engine = Engine(task="classification",accelerator="gpu", devices="1")
engine.train(datamodule=datamodule, model=model)

2、 Then reinstalled CUDA and TORCH that support GPU training:
PyTorch version: 1.8.0+cu111
CUDA version: 11.1
cuDNN version: 8005
CUDA available: True
Number of GPUs available: 1
Device 0: NVIDIA GeForce RTX 3080 Ti

3、 Finally appeared
Traceback (most recent call last):
File "E:\WHGWD\anomalib-main\train.py", line 2, in
from anomalib.data import Folder
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data_init_.py", line 13, in
from .avenue import Avenue
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\avenue.py", line 30, in
from anomalib.data.base import AnomalibVideoDataModule, AnomalibVideoDataset
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\base_init_.py", line 7, in
from .datamodule import AnomalibDataModule
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\base\datamodule.py", line 13, in
from pytorch_lightning import LightningDataModule
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\pytorch_lightning_init_.py", line 34, in
from lightning_fabric.utilities.seed import seed_everything # noqa: E402
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric_init_.py", line 23, in
from lightning_fabric.fabric import Fabric # noqa: E402
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\fabric.py", line 32, in
from lightning_fabric.plugins import Precision # avoid circular imports: # isort: split
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins_init_.py", line 18, in
from lightning_fabric.plugins.precision.deepspeed import DeepSpeedPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision_init_.py", line 16, in
from lightning_fabric.plugins.precision.fsdp import FSDPPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\fsdp.py", line 19, in
from lightning_fabric.plugins.precision.native_amp import MixedPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\native_amp.py", line 29, in
class MixedPrecision(Precision):
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\native_amp.py", line 90, in MixedPrecision
def _autocast_context_manager(self) -> torch.autocast:
AttributeError: module 'torch' has no attribute 'autocast'

4、 I would like to ask, what would the code look like if I use GPU training correctly? And then how much is required for the torch version?

@bommbomm
Copy link
Author

bommbomm commented Nov 4, 2024

你好。Engine 基于 Lightning trainer,因此您可以将任何 Trainer 参数传递给 Engine。请记住,目前不支持多 GPU 训练,但您可以像使用 Trainer 一样使用单个 GPU。

Hello! Thank you for your reply!

1、 Because I wanted to use GPU, I changed it to the following code snippet:
import multiprocessing
from anomalib.data import Folder
from anomalib.models import Patchcore
from anomalib.engine import Engine

datamodule.setup()
model = Patchcore()
engine = Engine(task="classification",accelerator="gpu", devices="1")
engine.train(datamodule=datamodule, model=model)

2、 Then reinstalled CUDA and TORCH that support GPU training:
PyTorch version: 1.8.0+cu111
CUDA version: 11.1
cuDNN version: 8005
CUDA available: True
Number of GPUs available: 1
Device 0: NVIDIA GeForce RTX 3080 Ti

3、 Finally appeared
Traceback (most recent call last):
File "E:\WHGWD\anomalib-main\train.py", line 2, in
from anomalib.data import Folder
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data_init_.py", line 13, in
from .avenue import Avenue
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\avenue.py", line 30, in
from anomalib.data.base import AnomalibVideoDataModule, AnomalibVideoDataset
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\base_init_.py", line 7, in
from .datamodule import AnomalibDataModule
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\anomalib\data\base\datamodule.py", line 13, in
from pytorch_lightning import LightningDataModule
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\pytorch_lightning_init_.py", line 34, in
from lightning_fabric.utilities.seed import seed_everything # noqa: E402
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric_init_.py", line 23, in
from lightning_fabric.fabric import Fabric # noqa: E402
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\fabric.py", line 32, in
from lightning_fabric.plugins import Precision # avoid circular imports: # isort: split
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins_init_.py", line 18, in
from lightning_fabric.plugins.precision.deepspeed import DeepSpeedPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision_init_.py", line 16, in
from lightning_fabric.plugins.precision.fsdp import FSDPPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\fsdp.py", line 19, in
from lightning_fabric.plugins.precision.native_amp import MixedPrecision
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\native_amp.py", line 29, in
class MixedPrecision(Precision):
File "D:\Anaconda_location\envs\AnomalibGPU\lib\site-packages\lightning_fabric\plugins\precision\native_amp.py", line 90, in MixedPrecision
def _autocast_context_manager(self) -> torch.autocast:
AttributeError: module 'torch' has no attribute 'autocast'

4、 I would like to ask, what would the code look like if I use GPU training correctly? And then how much is required for the torch version?

@bommbomm
Copy link
Author

bommbomm commented Nov 4, 2024

然后,你不能将任何 Trainer 参数传递给 Engine...

设备=[0,1,2,3,4,5,6,7]

老实说,我发现 anomalib 非常难用。

Hello, thank you for your reply!

I found that I don't know how to use the GPU, and the downloaded torch version seems to have problems as well.

But if I download the default torch version, it only supports CPU training, so I downloaded other torch versions that support CUDA, but in the end, it still doesn't work. How did you solve this problem?

@samet-akcay
Copy link
Contributor

How do you install torch? pip install torch? On a fresh environment can you install all the required anomalib dependencies via anomalib install, which handles the which torch to install (CPU or GPU). It also handles to install the right version of the GPU based on the CUDA on your system.

Regarding the torch version, anomalib requires the following torch requirement. Your torch version could also be one of the issues:

"torch>=2",

@samet-akcay
Copy link
Contributor

You don't need to specify the GPU as accelerator, as it is automatically in auto mode that picks GPU if you have it installed.

For example here is the setup I tried.

Available GPU

❯ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:17:00.0 Off |                  N/A |
| 31%   38C    P8              24W / 350W |   3062MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:65:00.0 Off |                  N/A |
| 30%   41C    P8              18W / 350W |    283MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Code

# Import the required modules
from anomalib.data import MVTec
from anomalib.engine import Engine
from anomalib.models import Patchcore

# Initialize the datamodule, model and engine
datamodule = MVTec()
model = Patchcore()
engine = Engine()

# Train the model
engine.fit(datamodule=datamodule, model=model)

Output

FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

>>> # Look at here to see if you have GPU installed, and are using it
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
<<<

You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
/home/sakcay/.pyenv/versions/3.11.8/envs/anomalib/lib/python3.11/site-packages/lightning/pytorch/core/optimizer.py:181: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer

  | Name           | Type                     | Params
------------------------------------------------------------
0 | pre_processor  | PreProcessor             | 0     
1 | post_processor | OneClassPostProcessor    | 0     
2 | model          | PatchcoreModel           | 24.9 M
3 | image_metrics  | AnomalibMetricCollection | 0     
4 | pixel_metrics  | AnomalibMetricCollection | 0     
------------------------------------------------------------
24.9 M    Trainable params
0         Non-trainable params
24.9 M    Total params
99.450    Total estimated model params size (MB)
Epoch 0:   0%|                                                      | 0/7 [00:00<?, ?it/s]/home/sakcay/.pyenv/versions/3.11.8/envs/anomalib/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py:132: `training_step` returned `None`. If this was on purpose, ignore this warning...
Epoch 0: 100%|██████████████████████████████████████████████| 7/7 [00:01<00:00,  4.84it/s^Selecting Coreset Indices.:  16%|███                | 2685/16385 [00:03<00:17, 795.60it/s]

@samet-akcay
Copy link
Contributor

I'm moving this to the Q&A as I don't think this is a bug on Anomalib, but an installation issue on your end. Feel free to ask your questions there. Thanks

@openvinotoolkit openvinotoolkit locked and limited conversation to collaborators Nov 4, 2024
@samet-akcay samet-akcay converted this issue into discussion #2404 Nov 4, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants