Skip to content

Add support for more markers

Compare
Choose a tag to compare
@wizmer wizmer released this 08 Feb 14:58
· 177 commits to master since this release
8fdd355
Improve marker handling (#229)

* Handle more markers

This commits assume that each marker can be suffixed with an arbitrary integer.
For example `Flower` is a valid marker but `Flower4` or `Flower4444` are now
also valid.

The list of valid marker tags has been updated based on the current morphology
repository that we have.

## Marker specification
Here is the regexes of supported markers:

- Dot[0-9]*
- Plus[0-9]*
- Cross[0-9]*
- Splat[0-9]*
- Flower[0-9]*
- Circle[0-9]*
- Flower[0-9]*
- TriStar[0-9]*
- OpenStar[0-9]*
- Asterisk[0-9]*
- SnowFlake[0-9]*
- OpenCircle[0-9]*
- ShadedStar[0-9]*
- FilledStar[0-9]*
- TexacoStar[0-9]*
- MoneyGreen[0-9]*
- DarkYellow[0-9]*
- OpenSquare[0-9]*
- OpenDiamond[0-9]*
- CircleArrow[0-9]*
- CircleCross[0-9]*
- OpenQuadStar[0-9]*
- DoubleCircle[0-9]*
- FilledSquare[0-9]*
- MalteseCross[0-9]*
- FilledCircle[0-9]*
- FilledDiamond[0-9]*
- FilledQuadStar[0-9]*
- OpenUpTriangle[0-9]*
- FilledUpTriangle[0-9]*
- OpenDownTriangle[0-9]*
- FilledDownTriangle[0-9]*

## Code snippet

Here is the code I used to generate the list of markers. I then had to manually
filter and check the result to obtain the final list of markers.

```python
import json
from itertools import chain

from morph_tool.utils import iter_morphology_files
from tqdm import tqdm

def create_token_list():
    files = list(iter_morphology_files('/home/bcoste/workspace/MorphologyRepository/',
                                       recursive=True))

    def is_not_number(token):
        try:
            float(token)
            return False
        except ValueError:
            return True

    def parse_line(line):
        tokens = line.split(';')[0].replace('(', ' ').replace(')', ' ').split()
        return filter(is_not_number, tokens)
    token = set()
    for f in tqdm(files):
        try:
            new_set = set(chain.from_iterable(parse_line(line) for line in f.open().readlines()))
            token = token | new_set
        except:
            print(f'failed parsing {f}')

    with open('bla.json', mode='w') as f:
        json.dump(list(token), f)

def read_token_list():
    with open('bla.json') as f:
        data = json.load(f)

    filtered = {'Settings\\user\\Desktop\\tkb02\\060206\\'}

    data = filter(lambda token: not token.startswith('0x'), data)
    data = filter(lambda token: token[0].isupper(), data)
    data = filter(lambda token: all(word not in token for word in filtered), data)
    data = sorted(data, key=lambda token: len(token))
    data = list(data)

    from pprint import pprint
    pprint(data)
```