Add support for more markers
wizmer
released this
08 Feb 14:58
Improve marker handling (#229)
* Handle more markers
This commits assume that each marker can be suffixed with an arbitrary integer.
For example `Flower` is a valid marker but `Flower4` or `Flower4444` are now
also valid.
The list of valid marker tags has been updated based on the current morphology
repository that we have.
## Marker specification
Here is the regexes of supported markers:
- Dot[0-9]*
- Plus[0-9]*
- Cross[0-9]*
- Splat[0-9]*
- Flower[0-9]*
- Circle[0-9]*
- Flower[0-9]*
- TriStar[0-9]*
- OpenStar[0-9]*
- Asterisk[0-9]*
- SnowFlake[0-9]*
- OpenCircle[0-9]*
- ShadedStar[0-9]*
- FilledStar[0-9]*
- TexacoStar[0-9]*
- MoneyGreen[0-9]*
- DarkYellow[0-9]*
- OpenSquare[0-9]*
- OpenDiamond[0-9]*
- CircleArrow[0-9]*
- CircleCross[0-9]*
- OpenQuadStar[0-9]*
- DoubleCircle[0-9]*
- FilledSquare[0-9]*
- MalteseCross[0-9]*
- FilledCircle[0-9]*
- FilledDiamond[0-9]*
- FilledQuadStar[0-9]*
- OpenUpTriangle[0-9]*
- FilledUpTriangle[0-9]*
- OpenDownTriangle[0-9]*
- FilledDownTriangle[0-9]*
## Code snippet
Here is the code I used to generate the list of markers. I then had to manually
filter and check the result to obtain the final list of markers.
```python
import json
from itertools import chain
from morph_tool.utils import iter_morphology_files
from tqdm import tqdm
def create_token_list():
files = list(iter_morphology_files('/home/bcoste/workspace/MorphologyRepository/',
recursive=True))
def is_not_number(token):
try:
float(token)
return False
except ValueError:
return True
def parse_line(line):
tokens = line.split(';')[0].replace('(', ' ').replace(')', ' ').split()
return filter(is_not_number, tokens)
token = set()
for f in tqdm(files):
try:
new_set = set(chain.from_iterable(parse_line(line) for line in f.open().readlines()))
token = token | new_set
except:
print(f'failed parsing {f}')
with open('bla.json', mode='w') as f:
json.dump(list(token), f)
def read_token_list():
with open('bla.json') as f:
data = json.load(f)
filtered = {'Settings\\user\\Desktop\\tkb02\\060206\\'}
data = filter(lambda token: not token.startswith('0x'), data)
data = filter(lambda token: token[0].isupper(), data)
data = filter(lambda token: all(word not in token for word in filtered), data)
data = sorted(data, key=lambda token: len(token))
data = list(data)
from pprint import pprint
pprint(data)
```