Train a gensim word2vec model on Wikipedia.
Most of it is taken from this blogpost and this discussion.
This repository was created mostly for trying out make
, see The gist for the important stuff.
Note that performance depends heavily on corpus size and chosen parameters (especially for smaller corpora).
Examples and parameters below are cherry-picked.
Get the code for a language (see here).
Run make
with the code as the value for LANGUAGE
(or change the Makefile).
For instance, try Swahili (sw):
make LANGUAGE=sw
Ignore make
and execute the following bash commands for Swahili:
mkdir -p data/sw/
wget -P data/sw/ https://dumps.wikimedia.org/swwiki/latest/swwiki-latest-pages-articles.xml.bz2
Train a model in Python:
import multiprocessing
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import LineSentence
from gensim.models.word2vec import Word2Vec
wiki = WikiCorpus('data/sw/swwiki-latest-pages-articles.xml.bz2',
lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())
params = {'size': 200, 'window': 10, 'min_count': 10,
'workers': max(1, multiprocessing.cpu_count() - 1), 'sample': 1E-3,}
word2vec = Word2Vec(sentences, **params)
Try the old man:king woman:? problem:
female_king = word2vec.most_similar_cosmul(positive='mfalme mwanamke'.split(),
negative='mtu'.split(), topn=5,)
for ii, (word, score) in enumerate(female_king):
print("{}. {} ({:1.2f})".format(ii+1, word, score))
1. malkia (0.97)
2. kambisi (0.93)
3. suleimani (0.93)
4. karolo (0.92)
5. koreshi (0.92)
Returning respectively queen (jackpot!), Cambyses II (a Persian king), Solomon (king of Israel), Karolo Mkuu? (Charlemagne?) and Cyrus (a Persian King),
What doesn't match: car, train or breakfast?
print(word2vec.doesnt_match('gari treni mlo'.split()))
mlo
- Python
- gensim:
pip install gensim