Getting error while fine tuning for Hindi #13

sanjitk2014 · 2024-02-19T11:59:12Z

Thanks . I am getting the below error basically RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

Please help. I am using google Colab . I exactly following the instruction.

024-02-19 11:48:42.153900: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-19 11:48:42.153955: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-19 11:48:42.155392: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-19 11:48:43.496722: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
Steps: 0%| | 50/175200 [00:36<26:49:06, 1.81it/s, lr=2e-5, step_loss=29.5, step_loss_disc=2.78, step_loss_duration=1.5
02/19/2024 11:49:16 - INFO - main - Running validation...
VALIDATION - batch 0, process0, waveform torch.Size([4, 134400, 1]), tokens torch.Size([4, 169])...
VALIDATION - batch 0, process0, PADDING AND GATHER...
Traceback (most recent call last):
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1494, in
main()
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1327, in main
full_generation = model(**full_generation_sample.to(model.device), speaker_id=speaker_id)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 817, in forward
return model_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 805, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2151, in forward
return self._inference_forward(
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2000, in _inference_forward
text_encoder_output = self.text_encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1563, in forward
hidden_states = self.embed_tokens(input_ids) * math.sqrt(self.config.hidden_size)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

ylacombe · 2024-02-19T12:05:36Z

Hey @sanjitk2014, you should probably check your samples, might very well be because of empty text or empty audio, let me know how it goes

sanjitk2014 · 2024-02-19T14:27:46Z

I have checked the dataset no empty audio and empty text. Use the following code to verify the dataset

import datasets
from datasets import DatasetDict, load_dataset

dataset=load_dataset("/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/ttsdata")
def prepare_dataset(batch):
# load
audio = batch["audio"]

batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]

if batch["input_length"] <=0 :
  print(batch["file_name"])
# process targets
input_str = batch["transcription"]
if len(input_str) <=0 :
  print(batch["file_name"])

# encode target text to label ids

return batch

train_data1 = dataset.map(prepare_dataset, num_proc=1)

sanjitk2014 · 2024-02-19T14:34:14Z

The checkpoint model I have generated from facebook/tts-mms-hin and using that as the pre trained model.

ylacombe · 2024-02-19T16:16:12Z

You should test if it's empty after having prepared the dataset I think

sanjitk2014 · 2024-02-19T17:49:54Z

I have checked the dataset no empty value or empty string. Still getting same error.

sanjitk2014 · 2024-02-20T16:08:43Z

Hi Ylacombe,
After the changing the input_ids to int() before passing to nn_Embedding , I resolved the issue but tumbled with the following exception.

tensor([], device='cuda:0', size=(1, 0, 192))
Traceback (most recent call last):
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1494, in
main()
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1327, in main
full_generation = model(**full_generation_sample.to(model.device), speaker_id=speaker_id)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 817, in forward
return model_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 805, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2159, in forward
return self._inference_forward(
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2008, in _inference_forward
text_encoder_output = self.text_encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1573, in forward
encoder_outputs = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1507, in forward
layer_outputs = encoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1437, in forward
hidden_states, attn_weights = self.attention(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1282, in forward
rel_pos_bias = self._relative_position_to_absolute_position(relative_logits)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1354, in _relative_position_to_absolute_position
x = nn.functional.pad(x, [0, 1, 0, 0, 0, 0])
RuntimeError: The input size 0, plus negative padding 0 and 0 resulted in a negative output size, which is invalid. Check dimension 1 of your input.

VafaKnm · 2024-02-27T10:06:30Z

Hi!
i get same error while fine-tuning mms-tts-fas model for Persian(Farsi) language; I print waveforms and token_ids for debugging which you can see them in the screenshots below. As you can see, some samples have empty tokens however there is not any empty text in my dataset. Do you find any solution for this?

ylacombe · 2024-02-29T15:11:14Z

Hi, screenshots like this are really not helpful!

Both your issues seem related to some samples being empty, i.e not tokenized properly. Could you give a link to the datasets you're using ?

Thanks

VafaKnm · 2024-03-02T07:51:17Z

Hi!
This is the "prepare_dataset" function in the "run_vits_finetuning". I add two lines of code for writing some information to text file for debugging. one of them is "input_str" which is output of "uromanize" function and other one is "string_inputs" which is output of tokenizer.

    def prepare_dataset(batch):
        # process target audio
        sample = batch[audio_column_name]
        audio_inputs = feature_extractor(
            sample["array"],
            sampling_rate=sample["sampling_rate"],
            return_attention_mask=False,
            do_normalize=do_normalize,
        )

        batch["labels"] = audio_inputs.get("input_features")[0]

        # process text inputs
        input_str = batch[text_column_name].lower() if do_lower_case else batch[text_column_name]
        
        if is_uroman:
            input_str = uromanize(input_str, uroman_path=uroman_path)
        string_inputs = tokenizer(input_str, return_attention_mask=False)


        # Writing input_str to a text file
        with open("/home/user1/vits_input_str.txt", "a") as file:
            file.write(input_str + "\n")
    
        # Writing string_inputs to a text file
        with open("/home/user1/vits_string_inputs.txt", "a") as file:
            file.write(str(string_inputs) + "\n")


        batch[model_input_name] = string_inputs.get("input_ids")[: max_tokens_length + 1]
        batch["waveform_input_length"] = len(sample["array"])
        batch["tokens_input_length"] = len(batch[model_input_name])
        batch["waveform"] = batch[audio_column_name]["array"]

        batch["mel_scaled_input_features"] = audio_inputs.get("mel_scaled_input_features")[0]

        if speaker_id_column_name is not None:
            if new_num_speakers > 1:
                # align speaker_id to [0, num_speaker_id-1].
                batch["speaker_id"] = speaker_id_dict.get(batch[speaker_id_column_name], 0)
        return batch

After monitoring these text files, i found that the file related to "uromanize" is correct but the file related to "tokenizer" has some problem; some of tokens are empty and most of them tokenize wrongly.
I noticed that, despite the fact that according to the documentation, Persian is one of the uroman languages, but the "is_uroman" parameter in the "tokenizer_config" file was set to "False" at the main model:
https://huggingface.co/facebook/mms-tts-fas/blob/main/tokenizer_config.json

So, i change my previous config and set "is_uroman" to False. in result, this error fixed to me.

imPdhar · 2024-11-12T12:01:14Z

@VafaKnm Were you able to finetune a model successfully? I somehow finetuned over the arabic model but all it returned was this error or a lot of noise.
cc @ylacombe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting error while fine tuning for Hindi #13

Getting error while fine tuning for Hindi #13

sanjitk2014 commented Feb 19, 2024

ylacombe commented Feb 19, 2024

sanjitk2014 commented Feb 19, 2024

sanjitk2014 commented Feb 19, 2024

ylacombe commented Feb 19, 2024

sanjitk2014 commented Feb 19, 2024

sanjitk2014 commented Feb 20, 2024

VafaKnm commented Feb 27, 2024 •

edited

Loading

ylacombe commented Feb 29, 2024

VafaKnm commented Mar 2, 2024

imPdhar commented Nov 12, 2024

Getting error while fine tuning for Hindi #13

Getting error while fine tuning for Hindi #13

Comments

sanjitk2014 commented Feb 19, 2024

ylacombe commented Feb 19, 2024

sanjitk2014 commented Feb 19, 2024

sanjitk2014 commented Feb 19, 2024

ylacombe commented Feb 19, 2024

sanjitk2014 commented Feb 19, 2024

sanjitk2014 commented Feb 20, 2024

VafaKnm commented Feb 27, 2024 • edited Loading

ylacombe commented Feb 29, 2024

VafaKnm commented Mar 2, 2024

imPdhar commented Nov 12, 2024

VafaKnm commented Feb 27, 2024 •

edited

Loading