WhisperLevantineArabic

Fine-tuned Whisper model for the Levantine Dialect (Israeli-Arabic)

Description

This repository contains a fine-tuned version of the Whisper medium model, specifically optimized for transcribing Levantine Arabic with a focus on the Israeli dialect. This model aims to improve automatic speech recognition (ASR) performance for this specific variant of Arabic.

Model Details

Base Model: Whisper Medium
Fine-tuned for: Levantine Arabic (Israeli Dialect)
Performance Metrics: 10% WER on test-set

Dataset Description

This dataset contains transcribed audio data of spoken Levantine Arabic, with a focus on the Israeli dialect. It is designed to support research and development in speech recognition, linguistic analysis, and natural language processing for Levantine Arabic. The dataset comprises human-transcribed and annotated audio recordings, making it a valuable resource for training and evaluating speech recognition models and conducting linguistic studies.

Dataset Composition

The dataset consists of three main components:

Self-maintained Collection: 2,000 hours of audio data, collected and maintained by our team. This forms the core of the dataset and represents a wide range of Israeli Levantine Arabic speech.
Multi-Genre Broadcast (MGB-2)-Filtered: 200 hours of audio data sourced from the MGB-2 corpus, which includes broadcast news and other media content in Arabic.
CommonVoice18 (Filtered): An additional portion of data from the CommonVoice18 dataset.

Both MGB-2 and commonvoice18 filtered using AlcLaM (Arabic Language Model), ensuring relevance to Levantine Arabic.

Total Duration: Approximately 2,200 hours of transcribed audio
Dialects: Primarily Israeli Levantine Arabic, with some general Levantine Arabic content
Annotation: Human-transcribed and annotated for high accuracy
Diverse Sources: Includes self-collected data, broadcast media, and crowd-sourced content
Sampling Rate: 16kHz

Usage

The model was trained using a 16kHz sample rate, so ensure your audio files are also at 16kHz for optimal performance. You can get download the model from here

Run app.py to upload audio files or use the microphone

In the main function, change the current working directory to where the model is located. From the terminal: streamlit app.py

Run inference

# Example code for using the model
import glob
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from transformers import WhisperTokenizer
import torchaudio


def transcribe_audio(files_dir_path):
    """
    Transcribe an audio file using the Whisper model.
    Args:
        files_dir_path (str): The path to the audio files directory.
    """
    for file_path in glob.glob(files_dir_path + '/*.wav'):
        audio_input, samplerate = torchaudio.load(file_path)
        inputs = processor(audio_input.squeeze(), return_tensors="pt", sampling_rate=samplerate)
        with torch.no_grad():
            predicted_ids = model.generate(inputs["input_features"].to(device))
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
        print(transcription[0])


if __name__ == '__main__':
    wav_dir_path = '/path to wav files'
    checkpoint_path = 'your model path'
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Initialize model and processor
    tokenizer = WhisperTokenizer.from_pretrained(f'{checkpoint_path}/tokenizer', language="Arabic", task="transcribe")
    processor = WhisperProcessor.from_pretrained(f'{checkpoint_path}/processor', language="Arabic", task="transcribe")
    model = WhisperForConditionalGeneration.from_pretrained(checkpoint_path).to(device)
    model.generation_config.language = "arabic"
    model.generation_config.task = "transcribe"
    model.eval()

    transcribe_audio(wav_dir_path)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md
app.py		app.py
inference.py		inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WhisperLevantineArabic

Description

Model Details

Dataset Description

Dataset Composition

Both MGB-2 and commonvoice18 filtered using AlcLaM (Arabic Language Model), ensuring relevance to Levantine Arabic.

Usage

Run app.py to upload audio files or use the microphone

Run inference

About

Releases

Packages

Languages

NNLP-IL/WhisperLevantineArabic

Folders and files

Latest commit

History

Repository files navigation

WhisperLevantineArabic

Description

Model Details

Dataset Description

Dataset Composition

Both MGB-2 and commonvoice18 filtered using AlcLaM (Arabic Language Model), ensuring relevance to Levantine Arabic.

Usage

Run app.py to upload audio files or use the microphone

Run inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages