HeQ - Hebrew Question Answering Dataset

Summary

HeQ is a question answering dataset in Modern Hebrew, consisting of 30,147 questions.

Introduction

The dataset follows the format and crowdsourcing methodology of SQuAD (Stanford Question Answering Dataset) and the original ParaShoot. A team of crowdworkers formulated and answered reading comprehension questions based on random paragraphs in Hebrew. The answer to each question is a segment of text (span) included in the relevant paragraph. The paragraphs are sourced from two different platforms: (1) Hebrew Wikipedia, and (2) Geektime, an online Israeli news channel specializing in technology.

Question Features

Two types of questions were collected:

Answerable questions (21,784): Questions for which the answer is present in the paragraph.
Unanswerable questions (8,363): Questions related to the paragraph's content, where the paragraph provides a plausible answer, but the answer is not explicitly included.

Quality Labels

As part of ongoing quality control during the collection process, and additional checks on the test and validation sets, 28% of the final data was manually verified. Of the final data, 16% received one of the following labels:

Verified: Questions that passed the threshold and were relatively easy, with wording exactly or similar to the relevant sentence in the paragraph, or very common questions.
Good: Questions with wording that was significantly different (lexically or syntactically) from the wording of the relevant sentence in the paragraph.
Gold: Questions that involve inference-making.

The Geektime dataset also includes the following labels:

Deixis: Questions where the answer is an expression of time that depends entirely on the context (e.g., "last year," "tomorrow," etc.).
Second: Questions where the answer is written in the second person ("you").
Checked: Manually validated questions that did not receive a label based on their quality.

Additional Answers

After splitting the data, the test and validation subsets underwent additional processing. In cases where there were multiple correct answer spans for an answerable question, the additional possible answers were added to both subsets to enhance robustness. For example, if the answer appears in quotation marks, another possible answer could be the same answer without the quotation marks. Another example involves answers that may or may not include prepositions preceding the content or appositions. Each answerable question in the test and validation sets received 0 to 6 additional possible answers.

Dataset Statistics

The table below shows the number of answerable and unanswerable questions in each source and split of HeQ:

	Hebrew Wikipedia	Geektime
Answerable	9987	9667
Unanswerable	3533	3955
Train	13520	13622
Val	751	750
Test	754	750

The table below shows the number of unique questions, paragraphs, and articles in each source and split of HeQ:

	Hebrew Wikipedia	Geektime
Questions	15025	15122
Paragraphs	2006	2395
Articles	1481	2317

Code

The Quality_control_Hebrew_QA.ipynb file is a notebook used to extract specific data features that formed the basis for the validation process. The code for the web-based annotation interface used in this project can be found here.

Paper

https://u.cs.biu.ac.il/~yogo/heq.pdf

Model

https://huggingface.co/amirdnc/HeQ

Contributors

HeQ was annotated by Webiks for MAFAT, as part of the National Natural Language Processing Plan of Israel.

Contributors: Hilla Merhav Fine (Webiks), Roei Shlezinger (Webiks), Amir David Nissan Cohen (MAFAT)

Advisors: Reut Tsarfaty, Kfir Bar, Yoav Goldberg

Acknowledgments

We would like to express our gratitude to Omri Keren and Omer Levy, the creators of the original ParaShoot, the first question answering dataset in Hebrew, who also contributed to this project their annotation interface. We are also grateful to Geektime, for allowing us to annotate their articles and share the data as a public dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
guidelines		guidelines
LICENSE		LICENSE
Quality_control_Hebrew_QA.ipynb		Quality_control_Hebrew_QA.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HeQ - Hebrew Question Answering Dataset

Summary

Introduction

Question Features

Quality Labels

Additional Answers

Dataset Statistics

Code

Paper

Model

Contributors

Acknowledgments

About

Releases

Packages

Contributors 4

Languages

License

NNLP-IL/Hebrew-Question-Answering-Dataset

Folders and files

Latest commit

History

Repository files navigation

HeQ - Hebrew Question Answering Dataset

Summary

Introduction

Question Features

Quality Labels

Additional Answers

Dataset Statistics

Code

Paper

Model

Contributors

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages