Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusting Sentence Length in SRT File Sync #328

Open
iodides opened this issue Oct 11, 2024 · 2 comments
Open

Adjusting Sentence Length in SRT File Sync #328

iodides opened this issue Oct 11, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@iodides
Copy link

iodides commented Oct 11, 2024

Hello,

I am currently working on synchronizing lyrics in SRT files. However, I'm encountering an issue where the sentences are too long, and I would like to split them into individual lines for synchronization.

For example, the output from the web UI looks like this:

1
00:00:00,000 --> 00:00:18,559
The quiet night skies are so bright They shine softly through the window light

What I want is this:

1
00:00:00,000 --> 00:00:08,316
The quiet night skies are so bright

2
00:00:08,340 --> 00:00:18,559
They shine softly through the window light.

It seems that the AI recognizes "They" as the beginning of a new sentence because it's capitalized. Is there any option in the web UI settings to adjust the sentence length so that the lyrics are split appropriately across multiple lines?

Thank you for your help!

@jhj0517 jhj0517 added the enhancement New feature or request label Oct 12, 2024
@jhj0517
Copy link
Owner

jhj0517 commented Oct 12, 2024

Hi. As far as I know this is difficult to achieve as there's no such parameter in the whisper yet.

Related issues:

We can think of some pre-processing or post-processing to workaround this.

For pre-processing, you could use VAD with short Min Speech Duration (ms), Min Silence Duration (ms) and long Speech Pad (ms) to force a short segment by forcing padding between each segment when trasncribing.

Still, this can sometimes give a bad result, because VAD often doesn't catch the very short silences between speeches, even though with the short Min Silence Duration (ms).

Another way of post-processing would be to simply force the number of words for each line when writing subtitles, as openai/whisper#223 (comment) said.

+) As far as I know large-v3 is better for accurate timestamps. But large-v3 is only good with clean audio, because if the audio is noisy, it often causes hallucinations.

@jhj0517
Copy link
Owner

jhj0517 commented Nov 15, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants