Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I want to run OpenVoiceV2 huggingface demo on my local GPU. #306

Open
tar-xzvff opened this issue Sep 10, 2024 · 2 comments
Open

I want to run OpenVoiceV2 huggingface demo on my local GPU. #306

tar-xzvff opened this issue Sep 10, 2024 · 2 comments
Labels
question Further information is requested

Comments

@tar-xzvff
Copy link

I am running on a local GPU (NVIDIA T4) on a Jupyter notebook with OpenVoiceV2.
How can I run the huggingface demo (with WebUI) on my local GPU?
https://huggingface.co/spaces/myshell-ai/OpenVoiceV2

@dosubot dosubot bot added the question Further information is requested label Sep 10, 2024
Copy link

dosubot bot commented Sep 10, 2024

To run the OpenVoiceV2 Hugging Face demo with WebUI on your local GPU (NVIDIA T4) in a Jupyter notebook, follow these steps:

  1. Install the required packages:
    Ensure you have the necessary packages installed. You can install them using pip.

    !pip install torch gradio langid
  2. Set up the environment and load the models:
    Use the provided script to set up the environment and load the models. You can adapt the script to run in a Jupyter notebook.

    import os
    import torch
    import gradio as gr
    from openvoice import se_extractor
    from openvoice.api import BaseSpeakerTTS, ToneColorConverter
    import langid
    
    en_ckpt_base = 'checkpoints_v2/base_speakers/EN'
    zh_ckpt_base = 'checkpoints_v2/base_speakers/ZH'
    ckpt_converter = 'checkpoints_v2/converter'
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    output_dir = 'outputs'
    os.makedirs(output_dir, exist_ok=True)
    
    # load models
    en_base_speaker_tts = BaseSpeakerTTS(f'{en_ckpt_base}/config.json', device=device)
    en_base_speaker_tts.load_ckpt(f'{en_ckpt_base}/checkpoint.pth')
    zh_base_speaker_tts = BaseSpeakerTTS(f'{zh_ckpt_base}/config.json', device=device)
    zh_base_speaker_tts.load_ckpt(f'{zh_ckpt_base}/checkpoint.pth')
    tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)
    tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')
    
    # load speaker embeddings
    en_source_default_se = torch.load(f'{en_ckpt_base}/en_default_se.pth').to(device)
    en_source_style_se = torch.load(f'{en_ckpt_base}/en_style_se.pth').to(device)
    zh_source_se = torch.load(f'{zh_ckpt_base}/zh_default_se.pth').to(device)
    
    # This online demo mainly supports English and Chinese
    supported_languages = ['zh', 'en']
    def predict(prompt, style, audio_file_pth, agree):
        # initialize a empty info
        text_hint = ''
        # agree with the terms
        if agree == False:
            text_hint += '[ERROR] Please accept the Terms & Condition!\n'
            gr.Warning("Please accept the Terms & Condition!")
            return (
                text_hint,
                None,
                None,
            )
    
        # first detect the input language
        language_predicted = langid.classify(prompt)[0].strip()  
        print(f"Detected language:{language_predicted}")
    
        if language_predicted not in supported_languages:
            text_hint += f"[ERROR] The detected language {language_predicted} for your input text is not in our Supported Languages: {supported_languages}\n"
            gr.Warning(
                f"The detected language {language_predicted} for your input text is not in our Supported Languages: {supported_languages}"
            )
    
            return (
                text_hint,
                None,
                None,
            )
        
        if language_predicted == "zh":
            tts_model = zh_base_speaker_tts
            source_se = zh_source_se
            language = 'Chinese'
            if style not in ['default']:
                text_hint += f"[ERROR] The style {style} is not supported for Chinese, which should be in ['default']\n"
                gr.Warning(f"The style {style} is not supported for Chinese, which should be in ['default']")
                return (
                    text_hint,
                    None,
                    None,
                )
        else:
            tts_model = en_base_speaker_tts
            if style == 'default':
                source_se = en_source_default_se
            else:
                source_se = en_source_style_se
            language = 'English'
            if style not in ['default', 'whispering', 'shouting', 'excited', 'cheerful', 'terrified', 'angry', 'sad', 'friendly']:
                text_hint += f"[ERROR] The style {style} is not supported for English, which should be in ['default', 'whispering', 'shouting', 'excited', 'cheerful', 'terrified', 'angry', 'sad', 'friendly']\n"
                gr.Warning(f"The style {style} is not supported for English, which should be in ['default', 'whispering', 'shouting', 'excited', 'cheerful', 'terrified', 'angry', 'sad', 'friendly']")
                return (
                    text_hint,
                    None,
                    None,
                )
    
        speaker_wav = audio_file_pth
        if len(prompt) < 2:
            text_hint += f"[ERROR] Please give a longer prompt text \n"
            gr.Warning("Please give a longer prompt text")
            return (
                text_hint,
                None,
                None,
            )
        if len(prompt) > 200:
            text_hint += f"[ERROR] Text length limited to 200 characters for this demo, please try shorter text. You can clone our open-source repo and try for your usage \n"
            gr.Warning(
                "Text length limited to 200 characters for this demo, please try shorter text. You can clone our open-source repo for your usage"
            )
            return (
                text_hint,
                None,
                None,
            )
        
        # note diffusion_conditioning not used on hifigan (default mode), it will be empty but need to pass it to model.inference
        try:
            target_se, audio_name = se_extractor.get_se(speaker_wav, tone_color_converter, target_dir='processed', vad=True)
        except Exception as e:
            text_hint += f"[ERROR] Get target tone color error {str(e)} \n"
            gr.Warning(
                "[ERROR] Get target tone color error {str(e)} \n"
            )
            return (
                text_hint,
                None,
                None,
            )
    
        src_path = f'{output_dir}/tmp.wav'
        tts_model.tts(prompt, src_path, speaker=style, language=language)
    
        save_path = f'{output_dir}/output.wav'
        # Run the tone color converter
        encode_message = "@MyShell"
        tone_color_converter.convert(
            audio_src_path=src_path, 
            src_se=source_se, 
            tgt_se=target_se, 
            output_path=save_path,
            message=encode_message)
    
        text_hint += f'''Get response successfully \n'''
    
        return (
            text_hint,
            save_path,
            speaker_wav,
        )
    
    title = "MyShell OpenVoice"
    description = """
    We introduce OpenVoice, a versatile instant voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. OpenVoice also achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set.
    """
    
    examples = [
        [
            "今天天气真好,我们一起出去吃饭吧。",
            'default',
            "resources/demo_speaker1.mp3",
            True,
        ],[
            "This audio is generated by open voice with a half-performance model.",
            'whispering',
            "resources/demo_speaker2.mp3",
            True,
        ],
        [
            "He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered, flour-fattened sauce.",
            'sad',
            "resources/demo_speaker0.mp3",
            True,
        ],
    ]
    
    with gr.Blocks(analytics_enabled=False) as demo:
    
        with gr.Row():
            with gr.Column():
                with gr.Row():
                    gr.Markdown(
                        """
                        ## <img src="https://huggingface.co/spaces/myshell-ai/OpenVoice/raw/main/logo.jpg" height="40"/>
                        """
                    )
                with gr.Row():    
                    gr.Markdown(markdown_table_v2)
                with gr.Row():
                    gr.Markdown(description)
            with gr.Column():
                gr.Video('https://github.com/myshell-ai/OpenVoice/assets/40556743/3cba936f-82bf-476c-9e52-09f0f417bb2f', autoplay=True)
                
        with gr.Row():
            gr.HTML(wrapped_markdown_content)
        with gr.Row():
            with gr.Column():
                input_text_gr = gr.Textbox(
                    label="Text Prompt",
                    info="One or two sentences at a time is better. Up to 200 text characters.",
                    value="He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered, flour-fattened sauce.",
                )
                style_gr = gr.Dropdown(
                    label="Style",
                    info="Select a style of output audio for the synthesised speech. (Chinese only support 'default' now)",
                    choices=['default', 'whispering', 'cheerful', 'terrified', 'angry', 'sad', 'friendly'],
                    max_choices=1,
                    value="default",
                )
                ref_gr = gr.Audio(
                    label="Reference Audio",
                    info="Click on the ✎ button to upload your own target speaker audio",
                    type="filepath",
                    value="resources/demo_speaker2.mp3",
                )
                tos_gr = gr.Checkbox(
                    label="Agree",
                    value=False,
                    info="I agree to the terms of the cc-by-nc-4.0 license-: https://github.com/myshell-ai/OpenVoice/blob/main/LICENSE",
                )
    
                tts_button = gr.Button("Send", elem_id="send-btn", visible=True)
    
    
            with gr.Column():
                out_text_gr = gr.Text(label="Info")
                audio_gr = gr.Audio(label="Synthesised Audio", autoplay=True)
                ref_audio_gr = gr.Audio(label="Reference Audio Used")
    
                gr.Examples(examples,
                            label="Examples",
                            inputs=[input_text_gr, style_gr, ref_gr, tos_gr],
                            outputs=[out_text_gr, audio_gr, ref_audio_gr],
                            fn=predict,
                            cache_examples=False,)
                tts_button.click(predict, [input_text_gr, style_gr, ref_gr, tos_gr], outputs=[out_text_gr, audio_gr, ref_audio_gr])
    demo.queue()  
    demo.launch(debug=True, show_api=True, share=False)
  3. Run the notebook:
    Execute the cells in your Jupyter notebook to start the Gradio interface and interact with the OpenVoiceV2 demo.

This script sets up the Gradio interface and loads the necessary models to run the OpenVoiceV2 demo on your local GPU [1][2].

To continue talking to Dosu, mention @dosu.

@nitinmukesh
Copy link

@dosu

The above code doesn't work as it is incorrect

(openvoice) C:\ai\OpenVoice>python openvoice\openvoice_app2.py
Traceback (most recent call last):
File "C:\ai\OpenVoice\openvoice\openvoice_app2.py", line 23, in
en_base_speaker_tts = BaseSpeakerTTS(f'{en_ckpt_base}/config.json', device=device)
File "c:\ai\openvoice\openvoice\api.py", line 21, in init
hps = utils.get_hparams_from_file(config_path)
File "c:\ai\openvoice\openvoice\utils.py", line 7, in get_hparams_from_file
with open(config_path, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints_v2/base_speakers/EN/config.json'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants