Wiki Word2vec

Train a gensim word2vec model on Wikipedia.

Most of it is taken from this blogpost and this discussion. This repository was created mostly for trying out make, see The gist for the important stuff. Note that performance depends heavily on corpus size and chosen parameters (especially for smaller corpora). Examples and parameters below are cherry-picked.

Usage

Get the code for a language (see here).

Run make with the code as the value for LANGUAGE (or change the Makefile). For instance, try Swahili (sw):

make LANGUAGE=sw

The gist

Ignore make and execute the following bash commands for Swahili:

mkdir -p data/sw/
wget -P data/sw/ https://dumps.wikimedia.org/swwiki/latest/swwiki-latest-pages-articles.xml.bz2

Train a model in Python:

import multiprocessing
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import LineSentence
from gensim.models.word2vec import Word2Vec

wiki = WikiCorpus('data/sw/swwiki-latest-pages-articles.xml.bz2', 
                  lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())
params = {'size': 200, 'window': 10, 'min_count': 10, 
          'workers': max(1, multiprocessing.cpu_count() - 1), 'sample': 1E-3,}
word2vec = Word2Vec(sentences, **params)

Example 1

Try the old man:king woman:? problem:

female_king = word2vec.most_similar_cosmul(positive='mfalme mwanamke'.split(), 
                                           negative='mtu'.split(), topn=5,)
for ii, (word, score) in enumerate(female_king):
    print("{}. {} ({:1.2f})".format(ii+1, word, score))

1. malkia (0.97)
2. kambisi (0.93)
3. suleimani (0.93)
4. karolo (0.92)
5. koreshi (0.92)

Returning respectively queen (jackpot!), Cambyses II (a Persian king), Solomon (king of Israel), Karolo Mkuu? (Charlemagne?) and Cyrus (a Persian King),

Example 2

What doesn't match: car, train or breakfast?

print(word2vec.doesnt_match('gari treni mlo'.split()))

mlo

Dependencies

Python
gensim: pip install gensim

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
create_word2vec.py		create_word2vec.py
process_wiki.py		process_wiki.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiki Word2vec

Usage

The gist

Example 1

Example 2

Dependencies

About

Releases

Packages

Languages

License

nyxjemk/wiki-word2vec

Folders and files

Latest commit

History

Repository files navigation

Wiki Word2vec

Usage

The gist

Example 1

Example 2

Dependencies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages