Monolingual Word Sense alignment (MWSA) is the task of aligning word senses across resources in the same language. A word can be defined in different ways in different resources. Finding out which ones are somehow connected together is the task of word sense alignment. This task was recently the focus of the 1st "Monolingual Word Sense Alignment" Shared Task.
The current repository contains a set of 17 datasets of manually-annotated senses developed within the ELEXIS project. These datasets cover 15 languages and are based on expert-made dictionaries along with collaboratively-curated ones, such as Wiktionary. The following table shows the statistics of the datasets by providing the number of senses (number of the words in the definitions are provided in parentheses).
Language | Resource | Nouns | Verbs | Adjectives | Adverbs | Other | All |
---|---|---|---|---|---|---|---|
Basque (eu) | Basque Wordnet | 929 (6836) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 929 (6836) |
Euskal Hiztegia | 971 (7754) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 971 (7754) | |
Bulgarian (bg) | BTB-WN | 1394 (15649) | 175 (1698) | 305 (3187) | 50 (338) | 0 (0) | 1924 (20872) |
Bulgarian Wiktionary | 1273 (12883) | 164 (1107) | 194 (1418) | 39 (306) | 0 (0) | 1670 (15714) | |
Danish (da) | Ordbog over det danske Sprog | 2176 (282040) | 983 (119163) | 436 (60599) | 0 (0) | 0 (0) | 3595 (461802) |
Den Danske Ordbog | 1036 (12326) | 383 (4045) | 248 (2228) | 0 (0) | 0 (0) | 1667 (18599) | |
Dutch (NL) | Woordenboek der Nederlandsche Taal | 1459 (28979) | 405 (5185) | 527 (7878) | 106 (2662) | 0 (0) | 2497 (44704) |
Algemeen Nederlands Woordenboek | 497 (8443) | 140 (1542) | 109 (1393) | 13 (172) | 0 (0) | 759 (11550) | |
English (KD) (en) | Global | 92 (532) | 107 (617) | 80 (457) | 57 (257) | 61 (283) | 397 (2146) |
Password | 66 (536) | 72 (417) | 62 (324) | 33 (177) | 46 (188) | 279 (1642) | |
English (NUIG) (en) | Webster 1913 | 1131 (11606) | 741 (4622) | 373 (2585) | 45 (269) | 0 (0) | 2290 (19082) |
Princeton WordNet | 730 (12166) | 496 (6980) | 249 (2892) | 24 (207) | 0 (0) | 1499 (22245) | |
Estonian (es) | Dictionary of Estonian (EKS) | 543 (4012) | 273 (1598) | 151 (747) | 98 (451) | 78 (370) | 1143 (7178) |
Estonian Basic Dictionary (PSV) | 543 (4492) | 273 (1983) | 151 (1097) | 98 (596) | 79 (468) | 1144 (8636) | |
German (de) | German Wiktionary | 2026 (15160) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 2026 (15160) |
German OmegaWiki | 1266 (14354) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 1266 (14354) | |
Hungarian (hu) | Comprehensive | X | X | X | X | X | 1355 (14654) |
Explanatory | X | X | X | X | X | 1038 (10934) | |
Irish (ga) | An Foclóir Beag | 891 (8053) | 11 (95) | 55 (267) | 10 (56) | 36 (171) | 1003 (8642) |
Irish Wiktionary | 1209 (6696) | 8 (45) | 61 (181) | 10 (41) | 36 (109) | 1324 (7072) | |
Italian (it) | ItalWordNet | 408 (3128) | 352 (2411) | 0 (0) | 0 (0) | 0 (0) | 760 (5539) |
SIMPLE | 290 (1990) | 218 (1240) | 0 (0) | 0 (0) | 0 (0) | 508 (3230) | |
Serbian (sr) | Serbian WordNet | 691 (5864) | 985 (6522) | 92 (713) | 0 (0) | 0 (0) | 1768 (13099) |
Dictionary of Serbo-Croatian Literary Language | 289 (2360) | 281 (1527) | 29 (215) | 0 (0) | 0 (0) | 599 (4102) | |
Slovenian (JSI) (sl) | Slovene WordNet | 409 (1106) | 303 (901) | 237 (733) | 44 (133) | 0 (0) | 993 (2873) |
Slovene Lexical Database | 284 (2237) | 191 (1047) | 220 (1486) | 29 (102) | 0 (0) | 724 (4872) | |
Slovenian (ISJFR) (sl) | Standard Slovenian Dictionary (eSSKJ) | 229 (2060) | 109 (911) | 76 (620) | 0 (0) | 60 (588) | 474 (4179) |
Kostelski slovar | 151 (1050) | 61 (308) | 45 (257) | 0 (0) | 38 (263) | 295 (1878) | |
Spanish (es) | Diccionario de la lengua española | 617 (7986) | 225 (2426) | 305 (3269) | 26 (161) | 24 (250) | 1197 (14092) |
Spanish Wiktionary | 602 (6421) | 227 (2045) | 294 (2825) | 25 (129) | 22 (123) | 1170 (11543) | |
Portuguese (pt-pt) | Dicionário da Língua Portuguesa Contemporânea | 285 (4060) | 58 (686) | 110 (1287) | 9 (143) | 1 (9) | 463 (6185) |
Dicionário Aberto | 199 (1521) | 53 (203) | 67 (372) | 3 (15) | 1 (5) | 323 (2116) | |
Russian (rs) | Ozhegov-Shvedova | 258 (2038) | 109 (615) | 101 (533) | 15 (77) | 44 (368) | 527 (3631) |
Dictionary of the Russian Language (MAS) | 310 (2811) | 173 (1338) | 190 (1219) | 20 (114) | 71 (1010) | 764 (6492) |
This repository contains datasets in JSON, RDF and TSV. In the latter format, each line corresponds to a sense pair where the last column represents the type of semantic relationship. We have also included the induced semantic relationships based on the symmetric property of the relationships, as follows:
especial adjective que se aplica exclusivamente a alguém ou a alguma coisa. ≈ exclusivo, particular, privado. exclusivo. narrower
especial adjective exclusivo. que se aplica exclusivamente a alguém ou a alguma coisa. ≈ exclusivo, particular, privado. broader
where the first row represents a narrower
relation while the second one is broader
with the senses being swapped.
json-to-rdf.py
is a simple script that converts the JSON alignments into TSV and then RDF. This allows you to use the datasets with NAISC.
If you're using any part of these datasets, please don't forget to cite the following paper:
@inproceedings{ahmadi2020multilingual,
title={A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment},
author="Ahmadi, Sina and McCrae, John P. and Nimb, Sanni and Khan, Fahad and Monachini, Monica and Pedersen, Bolette S. and Declerck, Thierry and Wissik, Tanja and Bellandi, Andrea and Pisani, Irene and Troelsgård, Thomas and Olsen, Sussi and Krek, Simon and Lipp, Veronika and Váradi, Tamás and Simon, László and Győrffy, András and Tiberius, Carole and Schoonheim, Tanneke and Ben Moshe, Yifat and Rudich, Maya and Abu Ahmad, Raya and Lonke, Dorielle and Kovalenko, Kira and Langemets, Margit and Kallas, Jelena and Dereza, Oksana and Fransen, Theodorus and Cillessen, David and Lindemann, David and Alonso, Mikel and Salgado, Ana and Sancho, José Luis and Ureña-Ruiz, Rafael-J. and Simov, Kiril and Osenova, Petya and Kancheva, Zara and Radev, Ivaylo and Stanković, Ranka and Perdih, Andrej and Gabrovšek, Dejan",
booktitle="Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020)",
year={2020},
date="2020-05-11",
address= "Marseille, France"
}
This repository is licensed under the Apache License 2.0.