Skip to content

Fine-tuned Whisper model for the Levantine Dialect (Israeli-Arabic)

Notifications You must be signed in to change notification settings

NNLP-IL/WhisperLevantineArabic

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 

Repository files navigation

WhisperLevantineArabic

Fine-tuned Whisper model for the Levantine Dialect (Israeli-Arabic)

Description

This repository contains a fine-tuned version of the Whisper medium model, specifically optimized for transcribing Levantine Arabic with a focus on the Israeli dialect. This model aims to improve automatic speech recognition (ASR) performance for this specific variant of Arabic.

Model Details

  • Base Model: Whisper Medium
  • Fine-tuned for: Levantine Arabic (Israeli Dialect)
  • Performance Metrics: 10% WER on test-set

Dataset Description

This dataset contains transcribed audio data of spoken Levantine Arabic, with a focus on the Israeli dialect. It is designed to support research and development in speech recognition, linguistic analysis, and natural language processing for Levantine Arabic. The dataset comprises human-transcribed and annotated audio recordings, making it a valuable resource for training and evaluating speech recognition models and conducting linguistic studies.

Dataset Composition

The dataset consists of three main components:

  1. Self-maintained Collection: 2,000 hours of audio data, collected and maintained by our team. This forms the core of the dataset and represents a wide range of Israeli Levantine Arabic speech.

  2. Multi-Genre Broadcast (MGB-2)-Filtered: 200 hours of audio data sourced from the MGB-2 corpus, which includes broadcast news and other media content in Arabic.

  3. CommonVoice18 (Filtered): An additional portion of data from the CommonVoice18 dataset.

    Both MGB-2 and commonvoice18 filtered using AlcLaM (Arabic Language Model), ensuring relevance to Levantine Arabic.
  • Total Duration: Approximately 2,200 hours of transcribed audio
  • Dialects: Primarily Israeli Levantine Arabic, with some general Levantine Arabic content
  • Annotation: Human-transcribed and annotated for high accuracy
  • Diverse Sources: Includes self-collected data, broadcast media, and crowd-sourced content
  • Sampling Rate: 16kHz

Usage

The model was trained using a 16kHz sample rate, so ensure your audio files are also at 16kHz for optimal performance. You can get download the model from here

Run app.py to upload audio files or use the microphone

In the main function, change the current working directory to where the model is located. From the terminal: streamlit app.py

Run inference

# Example code for using the model
import glob
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from transformers import WhisperTokenizer
import torchaudio


def transcribe_audio(files_dir_path):
    """
    Transcribe an audio file using the Whisper model.
    Args:
        files_dir_path (str): The path to the audio files directory.
    """
    for file_path in glob.glob(files_dir_path + '/*.wav'):
        audio_input, samplerate = torchaudio.load(file_path)
        inputs = processor(audio_input.squeeze(), return_tensors="pt", sampling_rate=samplerate)
        with torch.no_grad():
            predicted_ids = model.generate(inputs["input_features"].to(device))
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
        print(transcription[0])


if __name__ == '__main__':
    wav_dir_path = '/path to wav files'
    checkpoint_path = 'your model path'
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Initialize model and processor
    tokenizer = WhisperTokenizer.from_pretrained(f'{checkpoint_path}/tokenizer', language="Arabic", task="transcribe")
    processor = WhisperProcessor.from_pretrained(f'{checkpoint_path}/processor', language="Arabic", task="transcribe")
    model = WhisperForConditionalGeneration.from_pretrained(checkpoint_path).to(device)
    model.generation_config.language = "arabic"
    model.generation_config.task = "transcribe"
    model.eval()

    transcribe_audio(wav_dir_path)



About

Fine-tuned Whisper model for the Levantine Dialect (Israeli-Arabic)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%