HumSet is a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. HumSet is curated by humanitarian analysts and covers various disasters around the globe that occurred from 2018 to 2021 in 46 humanitarian response projects. The dataset consists of approximately 17K annotated documents in three languages of English, French, and Spanish, originally taken from publicly-available resources. For each document, analysts have identified informative snippets (entries) in respect to common humanitarian frameworks, and assigned one or many classes to each entry. See the our pre-print short paper for details.
@inproceedings{fekih-etal-2022-humset,
title = "{H}um{S}et: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crises Response",
author = "Fekih, Selim and
Tamagnone, Nicolo{'} and
Minixhofer, Benjamin and
Shrestha, Ranjan and
Contla, Ximena and
Oglethorpe, Ewan and
Rekabsaz, Navid",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.321",
pages = "4379--4389",
}
Main dataset is shared in CSV format (humset_data.csv), where each row is considered as an entry with the following features:
entry_id | lead_id | project_id | sectors | pillars_1d | pillars_2d | subpillars_1d | subpillars_2d | lang | n_tokens | project_title | created_at | document | excerpt |
---|
- entry_id: tpyeunique identification number for a given entry. (int64)
- lead_id: unique identification number for the document to which the corrisponding entry belongs. (int64)
- sectors, pillars_1d, pillars_2d, subpillars_1d, subpillars_2d: labels assigned to the corresponding entry. Since this is a multi-label dataset (each entry may have several annotations belonging to the same category), they are reported as arrays of strings. For a detailed description of these categories, see the paper. (list)
- lang: language. (str)
- n_tokens: number of tokens (tokenized using NLTK v3.7 library). (int64)
- project_title: the name of the project where the corresponding annotation was created. (str)
- created_at: date and time of creation of the annotation in stardard ISO 8601 format. (str)
- document: document URL source of the excerpt. (str)
- excerpt: excerpt text. (str)
Note:
- subpillars_1d and subpillars_2d respective tags are reported, as strings, with the format {PILLAR}->{SUBPILLARS}, in order to underline the hierarchical structure of 1D and 2D categories.
In addition to the main dataset, documents (leads) full texts are also reported (documents.tar.gz). Each text source is represented JSON-formatted file ({lead_id}.json) with the following structure:
[
[
paragraph 1 - page 1,
paragraph 2 - page 1,
...
paragraph N - page 1
],
[
paragraph 1 - page 2,
paragraph 2 - page 2,
...
paragraph N - page 2
],
[
...
],
...
]
Each document is a list of lists of strings, where each element is the text of a page, divided into the corresponding paragraphs. This format was used since, as indicated in the paper, over 70% of the sources are in PDF format, thus choosing to keep the original textual subdivision. In the case of HTML web pages, the text is reported as if it belongs to a single page document.
Additionally, train/validation/test splitted dataset is shared. The repository contains the code with which it is possible to process the total dataset, but the latter contains some random components which would therefore result in a slightly different result.
To gain access to HumSet, please contact us at [email protected]
For any technical question please contact Selim Fekih, Nicolò Tamagnone.
For a detailed description about terms and conditions, refer to DEEP Terms of Use and Privacy Notice