Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WordDumb WrongTranslation Issue #134

Open
Dorisking opened this issue Jun 23, 2023 · 13 comments
Open

WordDumb WrongTranslation Issue #134

Dorisking opened this issue Jun 23, 2023 · 13 comments
Labels
help wanted Extra attention is needed

Comments

@Dorisking
Copy link

Hi ~guys, I try to use WordDumb to read HarryPotter. But I get a lot of wrong meanings of words. for example , it explain drills as a type of strong cotton cloth instead a hand tool, power tool, or machine with a rotating cutting tip used for making holes. It sometimes choose the very rare and useless meaning of a word.
How can I adjust Translation Settings? Does it related with the dictionary in kindle?

@xxyzz
Copy link
Owner

xxyzz commented Jun 24, 2023

You could click the "other meanings" button to select the correct definition. This plugin only matches words with one definition, you could also change the default meaning in the plugin's "customize Kindle word wise" window.

@xxyzz
Copy link
Owner

xxyzz commented Jun 30, 2023

Maybe I could train a machine learning model to match the word to its gloss, and also match person name or location to Wikipedia summary or Wikidata item.

@xxyzz xxyzz reopened this Jun 30, 2023
@adriangc13

This comment was marked as off-topic.

@xxyzz

This comment was marked as off-topic.

@adriangc13

This comment was marked as off-topic.

@xxyzz

This comment was marked as off-topic.

@adriangc13

This comment was marked as off-topic.

@Vuizur
Copy link
Contributor

Vuizur commented Jul 9, 2023

Maybe I could train a machine learning model to match the word to its gloss, and also match person name or location to Wikipedia summary or Wikidata item.

It is a super interesting question. I randomly stumbled upon this problem for my thesis and tried using llama.cpp with an instruction-fine-tuned language model from Llama such Wizard-Vicuna-7B. I simply gave it the task in the format:

Sentence: <sentence>
Question: Which definition of <word> is correct here?
1. <definition>
2. <another definition>
Answer only with a number.
Answer: 

I benchmarked it for Russian (to copy a WIP graphic)
WSD_results

Disclaimer: I benchmarked the association of words with etymologies, not with senses.

(The accuracy in reality is maybe 5 percent higher, the test data has a few mistakes).
So WV7 (Wizard vicuna) runs on 8 GB RAM and Manticore 13B on 16 GB RAM PCs. And ChatGPT aced everything (except 1 example), but might be a bit too expensive.

In English the results will surely be better. The runtime will probably suck though, but if the users are very patient it might be possible.

Of course, training an own model, maybe with synthetic GPT3.5/4 data looks also pretty promising. But no idea.

This is maybe also interesting, but apparently only works for English (didn't test it): https://github.com/alvations/pywsd

@xxyzz
Copy link
Owner

xxyzz commented Jul 17, 2023

I think I'll need to take a deep learning course first...

Using existing model is easier to start but the performance could be bad. Training a model might be unavoidable because the model needs to output customized data(Kindle Word Wise database id or Wiktionary gloss). And for that same reason, pywsd might not be suitable, or maybe I could replace the default gloss data they're using.

The ultimate goal is to find(or build) a model or a library the could take a chunk of text then magically mark the words in it with correct gloss and Wikipedia summary(output data should also have the token offset location).

@Vuizur
Copy link
Contributor

Vuizur commented Jul 20, 2023

Using existing model is easier to start but the performance could be bad. Training a model might be unavoidable because the model needs to output customized data(Kindle Word Wise database id or Wiktionary gloss). And for that same reason, pywsd might not be suitable, or maybe I could replace the default gloss data they're using.

I think large language models such as Llama would work out of the box, but be extremely slow. For Worddumb they would only be viable (but probably still a bit slow) if the user has a GPU with at least 8 GB VRAM, which probably almost nobody has. Compared to English, Llama does have pretty mediocre multilingual skills unfortunately.

pywsd uses oldschool algorithms, if I understood it correctly they might be applied to the Wiktionary data and not even be too slow, but the accuracy will likely be garbage. (But I don't know a lot about this.)

The ultimate goal is to find(or build) a model or a library the could take a chunk of text then magically mark the words in it with correct gloss and Wikipedia summary(output data should also have the token offset location).

True. I tried asking GPT-4 to add a short translation after each word of a specific text in [brackets], and it did what I asked. But it was still a bit buggy and will probably hallucinate a lot and give wrong answers with more exotic languages or rarer words.

It might only be a matter of time before something like this gets more viable. 👍

@xxyzz
Copy link
Owner

xxyzz commented Jul 20, 2023

Using large language model for WSD maybe a little bit overkill IMO. I found this EWISER library: https://github.com/SapienzaNLP/ewiser, and they also have spacy plugin. Their paper is more recent and I'll see how I could integrate their work, look like I have a lot to learn...

The EWISER paper's authors' university also created babelfy.org, which has almost all the features I need but it has API limit(1000 per day).

@xxyzz xxyzz added the help wanted Extra attention is needed label Aug 3, 2023
@xxyzz
Copy link
Owner

xxyzz commented Aug 28, 2023

I find the state-of-the-art WSD model at here: https://paperswithcode.com/sota/word-sense-disambiguation-on-supervised, and the best model is ConSeC: https://paperswithcode.com/paper/consec-word-sense-disambiguation-as

But I never trained a model before and don't have a GPU card, this would take some time...

@xxyzz
Copy link
Owner

xxyzz commented Jul 9, 2024

I tried the LLaMA-3-Instruct-8B llamafile, I think accuracy is good but performance is ridiculously slow on CPU. I killed the process after waiting 4 hours. Maybe it's more usable with a powerful GPU?

Code pushed to the wsd branch: https://github.com/xxyzz/WordDumb/tree/wsd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants