Error: Token '4' not found in text #70

IMBAepsilon · 2024-09-14T19:01:27Z

when I use

echogarden align-transcript-and-translation 01.mp3 01.txt 01_translate.txt 01.json 01.srt

I got

Echogarden v1.5.0

Start stage 1: Align speech to transcript
Transcode with command-line ffmpeg.. 1102.4ms
Convert wave buffer to raw audio.. 384.1ms
Resample audio to 16kHz mono.. 962.1ms
Crop using voice activity detection.. 1263.1ms
Normalize and trim audio.. 181.2ms
No language specified. Detect language using reference text.. 84.4ms
Language detected: Japanese (ja)
Load alignment module.. 0.2ms
Synthesize alignment reference with eSpeak.. 5911.2ms

Starting alignment pass 1/1: granularity: low, max window duration: 189s
Compute reference MFCC features.. 1069.2ms
Compute source MFCC features.. 721.3ms
DTW cost matrix memory size: 685.4MB
Align reference and source MFCC features using DTW.. 2345.1ms

Convert path to timeline.. 20.7ms
Postprocess timeline.. 54.9ms
Total alignment time: 14195.5ms

Start stage 2: Align timeline to translated transcript
No source language specified. Detect source language.. 0.9ms
Source language detected: Japanese (ja)
No target language specified. Detect target language.. 0.6ms
Target language detected: Chinese (zh)
Load e5 module
Prepare text for semantic alignment.. 331.4ms
Initialize E5 embedding model.. 1184.6ms
Extract embeddings from source 1.. Error: Token '4' not found in text

The text was updated successfully, but these errors were encountered:

rotemdan · 2024-10-04T07:44:36Z

Thanks for the report.

The align-transcript-and-translation is a complex operation that combines alignment engines and a special word embedding model.

Due to how the text is tokenized when passed to the embedding model, it's possible that there are various edge cases where the tokenization and de-tokenization fails to match the original text.

I'll need the exact inputs used so I can reproduce the error and determine how to fix it.

dhouck mentioned this issue Nov 21, 2024

Recognize function: recognized words with embedded ellipses cause error with segment timeline #74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: Token '4' not found in text #70

Error: Token '4' not found in text #70

IMBAepsilon commented Sep 14, 2024

rotemdan commented Oct 4, 2024 •

edited

Loading

Error: Token '4' not found in text #70

Error: Token '4' not found in text #70

Comments

IMBAepsilon commented Sep 14, 2024

rotemdan commented Oct 4, 2024 • edited Loading

rotemdan commented Oct 4, 2024 •

edited

Loading