Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[<Library component: Model|Core|etc...>] AutoTimeMixer is not working #1213

Open
skmanzg opened this issue Nov 25, 2024 · 1 comment
Open
Labels

Comments

@skmanzg
Copy link

skmanzg commented Nov 25, 2024

What happened + What you expected to happen

I have tried to use AutoTimeMixer after successfully doing ordinary 'TimeMixer.'

This one worked well when I tried to do it. (ordinary ver)

Note: Y_train_df["unique_id"].nunique() = 5 and H = 288

model = TimeMixer(
    h=H,  
    input_size=1440,   
    n_series=Y_train_df["unique_id"].nunique(),  
    scaler_type='minmax',
    max_steps=500,
    early_stop_patience_steps=10,
    val_check_steps=50,
    learning_rate=1e-3,
    loss=MSE(),
    valid_loss=MAE(),
    batch_size=32,
    
    d_ff=5,
    e_layers=5,
    
    accelerator='auto',  
    devices='auto',
    enable_model_summary=False,
    enable_progress_bar=True
)

and then, AutoTimeMixer is not working. Both Ray and Optuna are not working. I wonder why does it happen for auto.
I have tried to use many different parameters to match the tensor size only to fail to solve this problem.

The code and logs are in the section below:

Versions / Dependencies

python 3.10.14
reinstalled neuralforecast today

Reproduction script

H = 288

config1 = {
    'n_series': Y_train_df["unique_id"].nunique(),
    'input_size': 1440,
    # 'down_sampling_layers': 5,
    # 'down_sampling_window': 5,
    'scaler_type': 'minmax', 
    'batch_size': 64,
    }

config2 = AutoTimeMixer.get_default_config(h=288, backend="optuna", n_series= Y_train_df["unique_id"].nunique() )

def config_o(trial):
    return config1


model = AutoTimeMixer(
    h = H,
    n_series = Y_train_df["unique_id"].nunique(),
    config = config1,
    loss = MSE(),
    valid_loss = MSE(),
    verbose = True,
    backend = "ray",   # the error is the same when it is optuna and use config 2
    num_samples = 5,
    gpus = 1,
    
)




nf = NeuralForecast(models=[model], freq='10min') 

nf.fit(df=Y_train_df, val_size=288)

ERROR LOG





---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
Cell In[3], [line 60](vscode-notebook-cell:?execution_count=3&line=60)
     [42](vscode-notebook-cell:?execution_count=3&line=42) model = AutoTimeMixer(
     [43](vscode-notebook-cell:?execution_count=3&line=43)     h = H,
     [44](vscode-notebook-cell:?execution_count=3&line=44)     n_series = Y_train_df["unique_id"].nunique(),
   (...)
     [52](vscode-notebook-cell:?execution_count=3&line=52)     
     [53](vscode-notebook-cell:?execution_count=3&line=53) )
     [58](vscode-notebook-cell:?execution_count=3&line=58) nf = NeuralForecast(models=[model], freq='10min') 
---> [60](vscode-notebook-cell:?execution_count=3&line=60) nf.fit(df=Y_train_df, val_size=288)

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:544, in NeuralForecast.fit(self, df, static_df, val_size, sort_df, use_init_models, verbose, id_col, time_col, target_col, distributed_config)
    [541](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:541)     self._reset_models()
    [543](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:543) for i, model in enumerate(self.models):
--> [544](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:544)     self.models[i] = model.fit(
    [545](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:545)         self.dataset, val_size=val_size, distributed_config=distributed_config
    [546](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:546)     )
    [548](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/core.py:548) self._fitted = True

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:429, in BaseAuto.fit(self, dataset, val_size, test_size, random_seed, distributed_config)
    [417](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:417)     results = self._optuna_tune_model(
    [418](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:418)         cls_model=self.cls_model,
    [419](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:419)         dataset=dataset,
   (...)
    [426](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:426)         distributed_config=distributed_config,
    [427](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:427)     )
    [428](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:428)     best_config = results.best_trial.user_attrs["ALL_PARAMS"]
--> [429](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:429) self.model = self._fit_model(
    [430](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:430)     cls_model=self.cls_model,
    [431](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:431)     config=best_config,
    [432](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:432)     dataset=dataset,
    [433](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:433)     val_size=val_size * self.refit_with_val,
    [434](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:434)     test_size=test_size,
    [435](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:435)     distributed_config=distributed_config,
    [436](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:436) )
    [437](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:437) self.results = results
    [439](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:439) # Added attributes for compatibility with NeuralForecast core

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:362, in BaseAuto._fit_model(self, cls_model, config, dataset, val_size, test_size, distributed_config)
    [358](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:358) def _fit_model(
    [359](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:359)     self, cls_model, config, dataset, val_size, test_size, distributed_config=None
    [360](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:360) ):
    [361](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:361)     model = cls_model(**config)
--> [362](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:362)     model = model.fit(
    [363](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:363)         dataset,
    [364](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:364)         val_size=val_size,
    [365](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:365)         test_size=test_size,
    [366](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:366)         distributed_config=distributed_config,
    [367](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:367)     )
    [368](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_auto.py:368)     return model

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:547, in BaseMultivariate.fit(self, dataset, val_size, test_size, random_seed, distributed_config)
    [543](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:543) if distributed_config is not None:
    [544](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:544)     raise ValueError(
    [545](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:545)         "multivariate models cannot be trained using distributed data parallel."
    [546](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:546)     )
--> [547](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:547) return self._fit(
    [548](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:548)     dataset=dataset,
    [549](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:549)     batch_size=self.n_series,
    [550](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:550)     valid_batch_size=self.n_series,
    [551](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:551)     val_size=val_size,
    [552](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:552)     test_size=test_size,
    [553](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:553)     random_seed=random_seed,
    [554](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:554)     shuffle_train=False,
    [555](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:555)     distributed_config=None,
    [556](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py:556) )

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:356, in BaseModel._fit(self, dataset, batch_size, valid_batch_size, val_size, test_size, random_seed, shuffle_train, distributed_config)
    [354](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:354) model = self
    [355](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:355) trainer = pl.Trainer(**model.trainer_kwargs)
--> [356](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:356) trainer.fit(model, datamodule=datamodule)
    [357](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:357) model.metrics = trainer.callback_metrics
    [358](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_model.py:358) model.__dict__.pop("_trainer", None)

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:538, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    [536](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:536) self.state.status = TrainerStatus.RUNNING
    [537](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:537) self.training = True
--> [538](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:538) call._call_and_handle_interrupt(
    [539](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:539)     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    [540](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:540) )

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:46, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     [44](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:44) try:
     [45](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:45)     if trainer.strategy.launcher is not None:
---> [46](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:46)         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     [47](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:47)     return trainer_fn(*args, **kwargs)
     [49](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:49) except _TunerExitException:

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:144, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
    [136](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:136) process_context = mp.start_processes(
    [137](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:137)     self._wrapping_function,
    [138](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:138)     args=process_args,
   (...)
    [141](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:141)     join=False,  # we will join ourselves to get the process references
    [142](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:142) )
    [143](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:143) self.procs = process_context.processes
--> [144](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:144) while not process_context.join():
    [145](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:145)     pass
    [147](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:147) worker_output = return_queue.get()

File ~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:189, in ProcessContext.join(self, timeout)
    [187](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:187) msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    [188](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:188) msg += original_trace
--> [189](https://vscode-remote+ssh-002dremote-002btrain2.vscode-resource.vscode-cdn.net/sswoon/TimeSeries/NeuralForecast_test/~/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:189) raise ProcessRaisedException(msg, error_index, failed_process.pid)

ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
    fn(i, *args)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 173, in _wrapping_function
    results = function(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
    self.advance()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 250, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 190, in run
    self._optimizer_step(batch_idx, closure)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 268, in _optimizer_step
    call._call_lightning_module_hook(
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 167, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1306, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 153, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 270, in optimizer_step
    optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 238, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision.py", line 122, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
    return func.__get__(opt, opt.__class__)(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
    out = func(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/optim/optimizer.py", line 89, in _use_grad
    ret = func(self, *args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/optim/adam.py", line 205, in step
    loss = closure()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision.py", line 108, in _wrap_closure
    closure_result = closure()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 144, in __call__
    self._result = self.closure(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 129, in closure
    step_output = self._step_fn()
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 317, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 319, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 389, in training_step
    return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 640, in __call__
    wrapper_output = wrapper_module(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "seoul/anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 633, in wrapped_forward
    out = method(*_args, **_kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_base_multivariate.py", line 371, in training_step
    output = self(windows_batch)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/models/timemixer.py", line 645, in forward
    y_pred = self.forecast(insample_y, x_mark_enc, x_mark_dec)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/models/timemixer.py", line 576, in forecast
    x = self.normalize_layers[i](x, "norm")
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_modules.py", line 557, in forward
    x = self._normalize(x)
  File "anaconda3/envs/sswoon_llama3/lib/python3.10/site-packages/neuralforecast/common/_modules.py", line 588, in _normalize
    x = x * self.affine_weight
RuntimeError: The size of tensor a (2) must match the size of tensor b (5) at non-singleton dimension 2

Issue Severity

High: It blocks me from completing my task.

@skmanzg skmanzg added the bug label Nov 25, 2024
@skmanzg
Copy link
Author

skmanzg commented Nov 25, 2024

[plus] Since I have four GPUs, I set gpus = 4 and it seems gpus are not detected and freezed. I had to set gpus = 1 to avoid this problem. According to the document, gpus is the number of gpus that I have. I wonder why this is not working either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant