Skip to content

A comprehensive list of Hebrew NLP resources.

License

Notifications You must be signed in to change notification settings

AdamKaabyia/Resources

 
 

Repository files navigation

Hebrew NLP Resources

This repository collects resources for NLP in Hebrew, as part of the NLPH project, which you can read more about here. Resources are divided to folders by type. If you have a resource you can contribute, to be released under some open license, please submit a pull request, or contact us at [email protected]. See here for a list of companies operating in the field.

This specific document is meant to be a list of Hebrew NLP resources, both for general use and to be used as reference when discussing what existing tools can be opened, adapted or integrated to help create a good open source foundation for NLP in Hebrew, as part of the NLPH Project.

When contributing to the list, please add a link to the license for all non-paper resources, e.g. {AGPL-3.0}, {?} for an unkonwn licesnse or {X} for unreleased/closed/copyrighted resources. For code resource, please also add the main language in which the tool is written, e.g. [Python] or [?] for an unknown programming language. Please add hosting mirrors with pointy brackets, e.g. <Zenodo mirror>.

1.1.1 Unannotated Corpora

  • Hebrew Wikipedia dumps {CC-BY-SA 3.0} - Wikipedia, the free encyclopedia, publishes dumps of its content as XML files on a monthly basis.
  • Wikipedia Corpora used for AlephBERT - The texts in all of Hebrew Wikipedia was also extracted to pre-train OnlpLab's AlephBERT, using Attardi's Wikiextractor.
  • OSCAR {CC BY 4.0} - OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
  • Project Ben Yehuda public dumps {Public Domain} - A repository containing dumps of thousands of public domain works in Hebrew, from Project Ben-Yehuda, in plaintext UTF-8 files, with and without diacritics (nikkud), and in HTML files.
  • CC100 {MIT} - This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages, including Hebrew. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots.

1.1.2 Annotated Datasets

1.1.2.1 Annotated by Parts of Speech, Morphological Features, and/or Syntactic Dependencies

  • Knesset 2004-2005 {Public Domain} - A corpus of transcriptions of Knesset (Israeli parliament) meetings between January 2004 and November 2005. Includes tokenized and morphologically tagged versions of most of the documents in the corpus. <MILA> <Zenodo>
  • The GOV.il Corpus {CC-BY-SA 3.0} - קורפוס השפה העברית - מאגר שפה מתויגת, חלק מפרוייקט קורפוס השפה העברית של רשות התקשוב הממשלתי. התיוג מבוצע על ידי האקדמיה ללשון העברית. תוצר ראשון זה כולל 600 משפטים מתוייגים

1.1.2.2 Annotated by Entites

  • NEMO {?} - Named Entity (NER) annotations of the Hebrew Treebank (Haaretz newspaper) corpus, including: morpheme and token level NER labels, nested mentions, and more. The following entity types are tagged: Person, Organization, Geo-Political Entity, Location, Facility, Work-of-Oart, Event, Product, Language.
  • MDTEL {?} - A dataset of posts from the www.camoni.co.il, tagged with medical entities from the UMLS, and a code that recognize medical entities in the Hebrew text.
  • Ben-Mordecai and Elhadad's Corpus {?} - Newspaper articles in different fields: news, economy, fashion and gossip. The following entity types are tagged: entity names (person, location, organization), temporal experssion (date, time) and number experession (percent, money).

1.1.2.3 Question Answering Datasets

  • ParaShoot {?} - A Hebrew question and answering dataset in the style of SQuAD, created by Omri Keren and Omer Levy. ParaShoot is based on articles scraped from Wikipedia. The dataset contains 3K crowdsource-annotated pairs of questions and answers, in a setting suitable for few-shot learning.
  • tdklab {?} translated (by google translation API) SQUAD dataset from English to Hebrew. The translation process included fixation and removal of bad translations.

1.1.2.4 Sentiment

  • Hebrew-Sentiment-Data Amram et al. {?} - A sentiment analysis benchmark (positive, negative and neutral sentiment) for Hebrew, based on 12K social media comments, containing two instances of input items: token-based and morpheme-based. A cleaned version of the Hebrew Sentiment dataset - a test-train data leakage was cleaned.
  • Emotion User Generated Content (UGC) {MIT} - collected for HeBERT model and includes comments posted on news articles collected from 3 major Israeli news sites, between January 2020 to August 2020. The total size of the data is ~150 MB, including over 7 millions words and 350K sentences. ~2000 sentences were annotated by crowd members (3-10 annotators per sentence) for overall sentiment (polarity) and eight emotions: anger, disgust, expectation , fear, happy, sadness, surprise and trust.

1.1.2.5 Recorded Spoken Hebrew

  • CoSIH - The Corpus of Spoken Hebrew {?} - The Corpus of Spoken Israeli Hebrew (CoSIH) is a database of recordings of spoken Israeli Hebrew
  • MaTaCOp {?} - a corpus of Hebrew dialogues within the Map Task framework (allowed for non-commercial research and teaching purposes only)

1.1.2.6 Other

  • The MILA corpora collection {GPLv3} - The MILA center has 20 different corpora available for free for non-commercial use. All are available in plain text format, and most have tokenized, morphologically-analyzed, and morphologically-disambiguated versions available too.
  • JPress {Custom Terms of Use} - The National Library offers a collection of Jewish newspapers published in various countries, languages, and time periods, including digital versions and full-text search. The texts are published under a custom Terms of Use document that prohibits commercial use, and additionally requires checking the copyright status and receiving permission from the copyright-holder of the work for any use requiring such permission according to the Copyright Law.
  • DICTA {?} - Analytical tools for Jewish texts. They also have a GitHub organization.
  • Sefaria {Various} - A Living Library of Jewish Texts. 3,000 years of Jewish texts in Hebrew and English translation.
  • HaArchion {?} - Recording of various Hebrew prose and poetry being read.
  • ThinkIL {CC-BY-SA 3.0} - An archive of the writings of Zvi Yanai.
  • The BGU morphological lexicon {?} - Is it released?
  • The morphological lexicon of the Israeli National Institute for Testing and Evaluation - Unreleased.
  • The MILA lexicon of Hebrew words {GPLv3} - The lexicon was designed mainly for usage by morphological analyzers, but is being constantly extended to facilitate other applications as well. The lexicon contains about 25,000 lexicon items and is extended regularly. Free for non-commercial use.
  • Hebrew WordNet {GPLv3} - Hebrew WordNet uses the MultiWordNet methodology and is aligned with the one developed at IRST (and therefore is aligned with English, Italian and Spanish). Free for non-commercial use.
  • MILA's Verb Complements Lexicon {GPLv3} - NLPH backup here.
  • Hebrew Psychological Lexicons {CC-BY-SA} - Natalie Shapira's large collection of Hebrew psychological lexicons and word lists. Useful for various psychology applications such as detecting emotional state, well being, relationship quality in conversation, identifying topics (e.g., family, work) and many more.
  • The Hebrew Treebank {GPLv3} - The Hebrew Treebank Version 2.0 contains 6500 hand-annotated sentences of news items from the MILA HaAretz Corpus, with full word segmentation and morpho-syntactic analysis. Morphological features that are not directly relevant for syntactic structures, like roots, templates and patterns, are not analyzed. This resource can be used freely for research purposes only.
  • UD Hebrew Treebank {CC BY-NC-SA 4.0} - The Hebrew Universal Dependencies Treebank.
  • Modern Hebrew Dependency Treebank v.1 {GPLv3} - This is the Modern Hebrew Dependency Treebank which was created and used in Yoav Goldberg's PhD thesis.

Also see here: https://github.com/iddoberger/awesome-hebrew-nlp

  • Neural Sentiment Analyzer for Modern Hebrew [?] {MIT} - This code and dataset provide an established benchmark for neural sentiment analysis for Modern Hebrew.
  • Universal Language Model Fine-tuning for Text Classification (ULMFiT) in Hebrew - The weights (e.g. a trained model) for a Hebrew version for Howard's and Ruder's ULMFiT model. Trained on the Hebrew Wikipedia corpus.
  • BERT's multilingual model - Trained (also) on Hebrew.
  • MDTEL {?} - Yonatan Bitton's code that recognize medical entities in a Hebrew text.
  • HebSpacy {MIT} - A custom spaCy pipeline for Hebrew text including a transformer-based multitask NER model that recognizes 16 entity types in Hebrew, including GPE, PER, LOC and ORG.
  • HeBERT {MIT} - HeBERT is a Hebrew pretrained language model for Polarity Analysis and Emotion Recognition, published by Dr. Inbal Yahav Shenberger and Avichay Chriqui. It is based on Google's BERT architecture and it is BERT-Base config. HeBert was trained on three dataset: OSCAR, A Hebrew dump of Wikipedia, Emotion User Generated Content (UGC) data that was collected for the purpose of this study. The model was evaluated on downstream tasks: emotions recognition and sentiment analysis. Github: https://github.com/avichaychriqui/HeBERT
  • AlephBERT {?} - a large pre-trained language model for Modern Hebrew, publicly available, pre-training on Oscar, Texts of Hebrew tweets, all of Hebrew Wikipedia, published by the OnlpLab team. This model obtains stateof-the- art results on the tasks of segmentation, Part of Speech Tagging, Named Entity Recognition, and Sentiment Analysis. Github: https://github.com/OnlpLab/AlephBERT
  • Verb Inflector [Java] {Apache License 2.0} - A generation mechanism, created as part of Eran Tomer's ([email protected]) Master thesis, which produces vocalized and morphologically tagged Hebrew verbs given a non-vocalized verb in base-form and an indication of which pattern the verb follows.
  • HebMorph [Lucene] {AGPL-3.0} - An open-source effort to make Hebrew properly searchable by various IR software libraries. Includes Hebrew Analyzer for Lucene.
  • Hebrew OCR with Nikud [Python] {?} - A program to convert Hebrew text files (without Nikud) to text files with the correct Nikud. Developed by Adi Oz and Vered Shani.
  • Text-Fabric [Python] {CC BY-NC 4.0} - A Python package for browsing and processing ancient corpora, focused on the Hebrew Bible Database.
  • Nakdan - Automatic Nikud for Hebrew texts.
  • The Automatic Hebrew Transriber - Automatically transcribes text from Hebrew audio and video files.
  • word2word {Apache License 2.0} - Easy-to-use word-to-word translations for 3,564 language pairs. Hebrew is one of the 62 supported language, and thus word-to-word translation to/from Hebrew is supported for 61 languages.
  • Eyfo - A commercial engine for search and entity tagging in Hebrew.
  • Melingo's ICA (Intelligent Content Analysis) - A text analysis and textual categorized entity extraction API for Hebrew, Arabic and Farsi texts.
  • Genius - Automatic analysis of free text in Hebrew.
  • AlmaReader - Online text-to-speech service for Hebrew.
  • LightTag [?] {not open source} - A tool for managing annotation projects. Handles right-to-left and part-of-word marking. Tutorial video here.
  • Recogito [Scala, JavaScript, HTML] {Apache License 2.0} - A tool for linked data annotation.
  • CATMA [HTML, Java] {unclear} - A web-based tool for research and collaboration over text data. Handles right-to-left and part-of-word marking.
  • WebAnno [Java] {Apache License 2.0} - Web-based. Support RTL and project management.
  • Arethusa: Annotation Environment [JavaScript] {MIT} - A backend-independent client-side annotation framework. Repository here.
  • rasa-nlu-trainer [JavaScript] {MIT} - A tool to edit training examples for rasa NLU. Handles right-to-left and part-of-word marking.
  • brat [Python, JavaScript] {MIT} - An online environment for collaborative text annotation. Does not support right-to-left. Repository here.
  • openNLP [Java] {Apache License 2.0} - OpenNLP has a tagging tool.
  • opeNER [Ruby, HTML, Java, Python] - opeNER has a tagging tool.
  • pybossa [Python] {AGPL-3.0} - A framework for crowdsourcing of data analysis and enrichment tasks. GitHub.
  • TextThrasher [JavaScript, Python] - A crowdsourced text annotator. Built with React and Redux (possibly also with pybossa).
  • SHEBANQ - System for HEBrew Text: ANnotations for Queries and Markup. SHEBANQ is an online environment for studying the Hebrew Bible.
  • doccano {MIT} - an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequence to sequence tasks. So, you can create labeled data for sentiment analysis, named entity recognition, text summarization and so on.

This list is meant to cover both researchers in the field of natural language processing, and in various related fields, including neurolinguistics and speech science. It also aims to cover researchers in both academia and industry.

  • Allen Institute for AI - Israel
    • Prof. Yoav Goldberg
    • Dr. Jonathan Berant

Researching natural language processing in the industry? Open a pull request and add yourself here now!

About

A comprehensive list of Hebrew NLP resources.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 98.7%
  • M4 1.3%