Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting error while fine tuning for Hindi #13

Open
sanjitk2014 opened this issue Feb 19, 2024 · 10 comments
Open

Getting error while fine tuning for Hindi #13

sanjitk2014 opened this issue Feb 19, 2024 · 10 comments

Comments

@sanjitk2014
Copy link

Thanks . I am getting the below error basically RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

Please help. I am using google Colab . I exactly following the instruction.

024-02-19 11:48:42.153900: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-19 11:48:42.153955: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-19 11:48:42.155392: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-19 11:48:43.496722: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
Steps: 0%| | 50/175200 [00:36<26:49:06, 1.81it/s, lr=2e-5, step_loss=29.5, step_loss_disc=2.78, step_loss_duration=1.5
02/19/2024 11:49:16 - INFO - main - Running validation...
VALIDATION - batch 0, process0, waveform torch.Size([4, 134400, 1]), tokens torch.Size([4, 169])...
VALIDATION - batch 0, process0, PADDING AND GATHER...
Traceback (most recent call last):
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1494, in
main()
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1327, in main
full_generation = model(**full_generation_sample.to(model.device), speaker_id=speaker_id)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 817, in forward
return model_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 805, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2151, in forward
return self._inference_forward(
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2000, in _inference_forward
text_encoder_output = self.text_encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1563, in forward
hidden_states = self.embed_tokens(input_ids) * math.sqrt(self.config.hidden_size)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

@ylacombe
Copy link
Owner

Hey @sanjitk2014, you should probably check your samples, might very well be because of empty text or empty audio, let me know how it goes

@sanjitk2014
Copy link
Author

I have checked the dataset no empty audio and empty text. Use the following code to verify the dataset

import datasets
from datasets import DatasetDict, load_dataset

dataset=load_dataset("/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/ttsdata")
def prepare_dataset(batch):
# load
audio = batch["audio"]

batch["input_length"] = len(audio["array"]) / audio["sampling_rate"]

if batch["input_length"] <=0 :
  print(batch["file_name"])
# process targets
input_str = batch["transcription"]
if len(input_str) <=0 :
  print(batch["file_name"])

# encode target text to label ids

return batch

train_data1 = dataset.map(prepare_dataset, num_proc=1)

@sanjitk2014
Copy link
Author

The checkpoint model I have generated from facebook/tts-mms-hin and using that as the pre trained model.

@ylacombe
Copy link
Owner

You should test if it's empty after having prepared the dataset I think

@sanjitk2014
Copy link
Author

I have checked the dataset no empty value or empty string. Still getting same error.

@sanjitk2014
Copy link
Author

Hi Ylacombe,
After the changing the input_ids to int() before passing to nn_Embedding , I resolved the issue but tumbled with the following exception.

tensor([], device='cuda:0', size=(1, 0, 192))
Traceback (most recent call last):
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1494, in
main()
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1327, in main
full_generation = model(**full_generation_sample.to(model.device), speaker_id=speaker_id)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 817, in forward
return model_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 805, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2159, in forward
return self._inference_forward(
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2008, in _inference_forward
text_encoder_output = self.text_encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1573, in forward
encoder_outputs = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1507, in forward
layer_outputs = encoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1437, in forward
hidden_states, attn_weights = self.attention(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1282, in forward
rel_pos_bias = self._relative_position_to_absolute_position(relative_logits)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1354, in _relative_position_to_absolute_position
x = nn.functional.pad(x, [0, 1, 0, 0, 0, 0])
RuntimeError: The input size 0, plus negative padding 0 and 0 resulted in a negative output size, which is invalid. Check dimension 1 of your input.

@VafaKnm
Copy link

VafaKnm commented Feb 27, 2024

Hi!
i get same error while fine-tuning mms-tts-fas model for Persian(Farsi) language; I print waveforms and token_ids for debugging which you can see them in the screenshots below. As you can see, some samples have empty tokens however there is not any empty text in my dataset. Do you find any solution for this?

Screenshot (152)
Screenshot (153)
Screenshot (155)

@ylacombe
Copy link
Owner

Hi, screenshots like this are really not helpful!

Both your issues seem related to some samples being empty, i.e not tokenized properly. Could you give a link to the datasets you're using ?

Thanks

@VafaKnm
Copy link

VafaKnm commented Mar 2, 2024

Hi!
This is the "prepare_dataset" function in the "run_vits_finetuning". I add two lines of code for writing some information to text file for debugging. one of them is "input_str" which is output of "uromanize" function and other one is "string_inputs" which is output of tokenizer.

    def prepare_dataset(batch):
        # process target audio
        sample = batch[audio_column_name]
        audio_inputs = feature_extractor(
            sample["array"],
            sampling_rate=sample["sampling_rate"],
            return_attention_mask=False,
            do_normalize=do_normalize,
        )

        batch["labels"] = audio_inputs.get("input_features")[0]

        # process text inputs
        input_str = batch[text_column_name].lower() if do_lower_case else batch[text_column_name]
        
        if is_uroman:
            input_str = uromanize(input_str, uroman_path=uroman_path)
        string_inputs = tokenizer(input_str, return_attention_mask=False)


        # Writing input_str to a text file
        with open("/home/user1/vits_input_str.txt", "a") as file:
            file.write(input_str + "\n")
    
        # Writing string_inputs to a text file
        with open("/home/user1/vits_string_inputs.txt", "a") as file:
            file.write(str(string_inputs) + "\n")


        batch[model_input_name] = string_inputs.get("input_ids")[: max_tokens_length + 1]
        batch["waveform_input_length"] = len(sample["array"])
        batch["tokens_input_length"] = len(batch[model_input_name])
        batch["waveform"] = batch[audio_column_name]["array"]

        batch["mel_scaled_input_features"] = audio_inputs.get("mel_scaled_input_features")[0]

        if speaker_id_column_name is not None:
            if new_num_speakers > 1:
                # align speaker_id to [0, num_speaker_id-1].
                batch["speaker_id"] = speaker_id_dict.get(batch[speaker_id_column_name], 0)
        return batch

After monitoring these text files, i found that the file related to "uromanize" is correct but the file related to "tokenizer" has some problem; some of tokens are empty and most of them tokenize wrongly.
I noticed that, despite the fact that according to the documentation, Persian is one of the uroman languages, but the "is_uroman" parameter in the "tokenizer_config" file was set to "False" at the main model:
https://huggingface.co/facebook/mms-tts-fas/blob/main/tokenizer_config.json

So, i change my previous config and set "is_uroman" to False. in result, this error fixed to me.

@imPdhar
Copy link

imPdhar commented Nov 12, 2024

@VafaKnm Were you able to finetune a model successfully? I somehow finetuned over the arabic model but all it returned was this error or a lot of noise.
cc @ylacombe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants