Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

大佬请问一下数据集中的 duration字段是什么含义? #91

Open
Alone749-i opened this issue Sep 29, 2024 · 3 comments
Open

Comments

@Alone749-i
Copy link

        {
            "audio": {
                "path": item["audio_file"],
            }
            "sentence": item["label"][0].replace(" ", "")
                    "language": "Chinese",
            "duration": 7.37
        }
        duration字段需要根据什么信息获取?
@hanasay
Copy link

hanasay commented Oct 4, 2024

不確定你的問題是 如何獲取duration 還是 為何要有duration
兩個問題我都一起回答好了 儘管這對各位大佬來說可能是廢話XD

1. 如何獲取duration

有很多工具可以表列出音頻的長度(例如 librosa, ffmpeg)
這邊我提供一個用python librosa module提取duration的範例

import librosa

librosa.get_duration(path='dataset/audio0.wav')

2. 為何要有duration

需要有duration欄位 是為了移除過長、過短的音頻,這些音頻可能會導致訓練效果變差
可以參考源代碼的這個部分
https://github.com/yeyupiaoling/Whisper-Finetune/blob/dd3653a3103fb53323ff95a6ebe875bed3c7a47d/utils/reader.py#L89C23-L89C25

@Alone749-i
Copy link
Author

感谢大佬很耐心的解答 感谢🙏

@buyaOyiweiniyingle
Copy link

请问大佬 duration在这里只是为了移除过长/过短的音频的话 那么如果我有一个很大的语音/文本对应的数据集 但是统计每一条语音的长度花费时间太长 是不是可以直接给每个duration字段赋一个安全的值(例如readme里面那个样例的7.37)而不需要让每一个duration都真的对应这条音频的时长?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants