Hardcoded num_mels to 80? #166

bzp83 · 2024-06-16T13:13:05Z

Line 81 in 4769534

    
           self.conv_pre = weight_norm(Conv1d(80, h.upsample_initial_channel, 7, 1, padding=3))

Hi, why is 80 hardcoded here? Should it match num_mels?

Thanks

harsh40c · 2024-06-18T08:53:11Z

Hey bro, i tried this repo code and i encountered the same error. I used librosa instead of tacotron2 for melspectogram generation and my spectograms has shape of (128×387). But since as shown above they hardcoded it to 80 and changing here doesnt solve the error as many other places needed to change so i changed the n_mels to 80 while generating melspectograms from librosa features. This solves this error but now i m getting cuDNN error as the version they used for CUDA and cuDNN are incompatible with GPU (using RTX3090). If we used newer pytorch which correseponds to CUDA 11.1 and cuDNN relevent version, I got kernels error as no available kernel something and using old version gives CUDNN_EXECUTION_FAILED error. If u have any solution regarding this please tell me. As for your querry as i told u change n_mels of spectograms generated to 80 to solve the issue.

bzp83 · 2024-06-18T09:11:19Z

yes... and to help me get even more confused, vits changes the code of hifi gan slightly and use "initial_channel" (https://github.com/jaywalnut310/vits/blob/2e561ba58618d021b5b8323d3765880f7e0ecfdb/models.py#L249) instead of hardcoded 80... I'm having a hard time figuring it out.

Anyway, yes I solved the problem and it works great on my rtx4090:

1 - update your requirements.txt to the code below, this will install latest version of those packages:

numpy
librosa
scipy
tensorboard
soundfile
matplotlib

2 - install latest pytorch, ie for 2.3.1 and cuda 12.1 do:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

3 - update your mel_spectrogram method in meldataset.py to:

def mel_spectrogram(
    y, n_fft, num_mels, sampling_rate, hop_size, win_size, fmin, fmax, center=False
):
    if torch.min(y) < -1.0:
        print("min value is ", torch.min(y))
    if torch.max(y) > 1.0:
        print("max value is ", torch.max(y))

    global mel_basis, hann_window
    dtype_device = str(y.dtype) + "_" + str(y.device)
    fmax_dtype_device = str(fmax) + "_" + dtype_device
    wnsize_dtype_device = str(win_size) + "_" + dtype_device
    if fmax_dtype_device not in mel_basis:
        mel = librosa_mel_fn(
            sr=sampling_rate, n_fft=n_fft, n_mels=num_mels, fmin=fmin, fmax=fmax
        )
        mel_basis[fmax_dtype_device] = torch.from_numpy(mel).type_as(y)
    if wnsize_dtype_device not in hann_window:
        hann_window[wnsize_dtype_device] = torch.hann_window(win_size).type_as(y)

    y = torch.nn.functional.pad(
        y.unsqueeze(1),
        (int((n_fft - hop_size) / 2), int((n_fft - hop_size) / 2)),
        mode="reflect",
    )
    y = y.squeeze(1)
    
    spec = torch.view_as_real(
        torch.stft(
            y,
            n_fft,
            hop_length=hop_size,
            win_length=win_size,
            window=hann_window[wnsize_dtype_device],
            center=center,
            pad_mode="reflect",
            normalized=False,
            onesided=True,
            return_complex=True,
        )
    )

    spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)

    spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
    spec = spectral_normalize_torch(spec)

    return spec

that should do it!

bzp83 · 2024-06-18T09:17:20Z

btw... I managed to train a model with 128 mels and 44100hz by using the config below. I also had to change that hardcoded 80 to 128 or just do self.conv_pre = weight_norm(Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3)) so I suspect that is indeed num_mels... but as I said, vits use initial_channels, which seems to be 192 all the time in the configs but num_mels is 80 😵

{
    "resblock": "1",
    "num_gpus": 0,
    "batch_size": 8,
    "learning_rate": 0.0002,
    "adam_b1": 0.8,
    "adam_b2": 0.99,
    "lr_decay": 0.999875,
    "seed": 1234,
    "upsample_rates": [
      8,
      8,
      2,
      2,
      2
    ],
    "upsample_kernel_sizes": [
      16,
      16,
      4,
      4,
      4
    ],
    "upsample_initial_channel": 512,
    "resblock_kernel_sizes": [
      3,
      7,
      11
    ],
    "resblock_dilation_sizes": [
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ],
      [
        1,
        3,
        5
      ]
    ],
    "segment_size": 16384,
    "num_mels": 128,
    "num_freq": 1025,
    "n_fft": 2048,
    "hop_size": 512,
    "win_size": 2048,
    "sampling_rate": 44100,
    "fmin": 0,
    "fmax": 22050,
    "fmax_for_loss": null,
    "num_workers": 16,
    "dist_config": {
      "dist_backend": "nccl",
      "dist_url": "tcp://localhost:54321",
      "world_size": 1
    }
  }

harsh40c · 2024-06-18T09:36:26Z

Hey man, thanks for solution it worked. Just consuming too much GPU memory but since other trainings were going on our server machine i will start its training when GPU is free. Then hope it will train properly. Anyway thanks a bunch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardcoded num_mels to 80? #166

Hardcoded num_mels to 80? #166

bzp83 commented Jun 16, 2024

harsh40c commented Jun 18, 2024

bzp83 commented Jun 18, 2024

bzp83 commented Jun 18, 2024

harsh40c commented Jun 18, 2024

Hardcoded num_mels to 80? #166

Hardcoded num_mels to 80? #166

Comments

bzp83 commented Jun 16, 2024

harsh40c commented Jun 18, 2024

bzp83 commented Jun 18, 2024

bzp83 commented Jun 18, 2024

harsh40c commented Jun 18, 2024