This script converts a text consisting of plain words (separated by white space) in a text consisting of phrases. Phrases might be just ordinary words as in the beginning, but more importantly, they might be named entities as recognized by SpaCy. Those phrases may consist of multiple plain words, which are afterwards merged into one word using underscores.
Example: New York City -> New_York_City
- Python3
- spacy (
pip3 install spacy
) - spacy vocabulary (
python3 -m spacy download en_core_web_sm
)
The tool can be used in two different ways:
- Convert one big input file:
python3 spacy_ner.py file --input={PATH_TO_CLEANED_TEXT} --output={PATH_TO_OUTPUT=stdout}
- Convert multiple input files:
python3 spacy_ner.py dir --source={PATH_TO_SOURCE_DIR} --target={PATH_TO_TARGET_DIR}