Skip to content

nyxjemk/wiki-word2vec

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wiki Word2vec

Train a gensim word2vec model on Wikipedia.

Most of it is taken from this blogpost and this discussion. This repository was created mostly for trying out make, see The gist for the important stuff. Note that performance depends heavily on corpus size and chosen parameters (especially for smaller corpora). Examples and parameters below are cherry-picked.

Usage

Get the code for a language (see here).

Run make with the code as the value for LANGUAGE (or change the Makefile). For instance, try Swahili (sw):

make LANGUAGE=sw

The gist

Ignore make and execute the following bash commands for Swahili:

mkdir -p data/sw/
wget -P data/sw/ https://dumps.wikimedia.org/swwiki/latest/swwiki-latest-pages-articles.xml.bz2

Train a model in Python:

import multiprocessing
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import LineSentence
from gensim.models.word2vec import Word2Vec

wiki = WikiCorpus('data/sw/swwiki-latest-pages-articles.xml.bz2', 
                  lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())
params = {'size': 200, 'window': 10, 'min_count': 10, 
          'workers': max(1, multiprocessing.cpu_count() - 1), 'sample': 1E-3,}
word2vec = Word2Vec(sentences, **params)

Example 1

Try the old man:king woman:? problem:

female_king = word2vec.most_similar_cosmul(positive='mfalme mwanamke'.split(), 
                                           negative='mtu'.split(), topn=5,)
for ii, (word, score) in enumerate(female_king):
    print("{}. {} ({:1.2f})".format(ii+1, word, score))

1. malkia (0.97)
2. kambisi (0.93)
3. suleimani (0.93)
4. karolo (0.92)
5. koreshi (0.92)

Returning respectively queen (jackpot!), Cambyses II (a Persian king), Solomon (king of Israel), Karolo Mkuu? (Charlemagne?) and Cyrus (a Persian King),

Example 2

What doesn't match: car, train or breakfast?

print(word2vec.doesnt_match('gari treni mlo'.split()))

mlo

Dependencies

  • Python
  • gensim: pip install gensim

About

Train a gensim word2vec model on Wikipedia.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 83.9%
  • Makefile 16.1%