Default Normalizers not working #66

thomasegense · 2018-07-19T07:49:18Z

I am using the latest 1.13 release.

The FrequencyAnalyze default constructor adds the following normalizers:

public FrequencyAnalyzer() {
        this.normalizers.add(new TrimToEmptyNormalizer());
        this.normalizers.add(new CharacterStrippingNormalizer());
        this.normalizers.add(new LowerCaseNormalizer());
    }

And this seems correct, but it does not work properly. It leaves whitespace, so the trim is not working correct for some reason. Here is the log file.
Notice the first line, that is just white space that is most frequent.
Also notice how many times the word "crack" appears below with and without trailing spaces.

2018-07-19 09:42:44,639 [main] INFO  com.kennycason.kumo.WordCloud - placed:    (1/300)
2018-07-19 09:42:44,642 [main] INFO  com.kennycason.kumo.WordCloud - placed: the (2/300)
2018-07-19 09:42:44,643 [main] INFO  com.kennycason.kumo.WordCloud - placed: music (3/300)
2018-07-19 09:42:44,644 [main] INFO  com.kennycason.kumo.WordCloud - placed: and (4/300)
2018-07-19 09:42:44,644 [main] INFO  com.kennycason.kumo.WordCloud - placed: user (5/300)
2018-07-19 09:42:44,645 [main] INFO  com.kennycason.kumo.WordCloud - placed:  crack (6/300)
2018-07-19 09:42:44,646 [main] INFO  com.kennycason.kumo.WordCloud - placed: this (7/300)
2018-07-19 09:42:44,646 [main] INFO  com.kennycason.kumo.WordCloud - placed: you (8/300)
2018-07-19 09:42:44,647 [main] INFO  com.kennycason.kumo.WordCloud - placed: csdb (9/300)
2018-07-19 09:42:44,689 [main] INFO  com.kennycason.kumo.WordCloud - placed: comment (10/300)
2018-07-19 09:42:44,689 [main] INFO  com.kennycason.kumo.WordCloud - placed: submitted (11/300)
2018-07-19 09:42:44,689 [main] INFO  com.kennycason.kumo.WordCloud - placed: for (12/300)
2018-07-19 09:42:44,690 [main] INFO  com.kennycason.kumo.WordCloud - placed: graphics (13/300)
2018-07-19 09:42:44,690 [main] INFO  com.kennycason.kumo.WordCloud - placed: scene (14/300)
2018-07-19 09:42:44,691 [main] INFO  com.kennycason.kumo.WordCloud - placed: demo (15/300)
2018-07-19 09:42:44,702 [main] INFO  com.kennycason.kumo.WordCloud - placed: crack   (16/300)
2018-07-19 09:42:44,702 [main] INFO  com.kennycason.kumo.WordCloud - placed: c64 (17/300)
2018-07-19 09:42:44,702 [main] INFO  com.kennycason.kumo.WordCloud - placed: crack (18/300)
2018-07-19 09:42:44,710 [main] INFO  com.kennycason.kumo.WordCloud - placed: demo   (19/300)
2018-07-19 09:42:44,711 [main] INFO  com.kennycason.kumo.WordCloud - placed: can (20/300)
2018-07-19 09:42:44,713 [main] INFO  com.kennycason.kumo.WordCloud - placed: made (21/300)
2018-07-19 09:42:44,714 [main] INFO  com.kennycason.kumo.WordCloud - placed: commodore (22/300)
2018-07-19 09:42:44,714 [main] INFO  com.kennycason.kumo.WordCloud - placed: find (23/300)
2018-07-19 09:42:44,715 [main] INFO  com.kennycason.kumo.WordCloud - placed: all (24/300)
2018-07-19 09:42:44,719 [main] INFO  com.kennycason.kumo.WordCloud - placed: one-file (25/300)
2018-07-19 09:42:44,721 [main] INFO  com.kennycason.kumo.WordCloud - placed: intro (26/300)
2018-07-19 09:42:44,721 [main] INFO  com.kennycason.kumo.WordCloud - placed: 1990 (27/300)
2018-07-19 09:42:44,723 [main] INFO  com.kennycason.kumo.WordCloud - placed: about (28/300)
2018-07-19 09:42:44,723 [main] INFO  com.kennycason.kumo.WordCloud - placed: out (29/300)

The text was updated successfully, but these errors were encountered:

kennycason · 2018-07-20T20:36:19Z

Thanks for posting this . I'll check it out. Seems like it should be a straight forward fix.

kennycason · 2018-07-20T20:39:37Z

Could you give me a sample input? Are you loading from a raw text file? or are you loading a "Frequency file" of the format:

100: frog
94: dog
43: cog
3: fog
1: log
1: pog

kennycason · 2018-07-20T20:50:24Z

I created a simple unit tests with some weird text and could not immediately replicate your issue.
Test

    @Test
    public void defaultTokenizerTrimTest() throws IOException {
        final FrequencyAnalyzer frequencyAnalyzer = new FrequencyAnalyzer();
        final List<WordFrequency> wordFrequencies = frequencyAnalyzer.load(
                Thread.currentThread().getContextClassLoader().getResourceAsStream("trim_test.txt"));

        final Map<String, WordFrequency> wordFrequencyMap = wordFrequencies
                .stream()
                .collect(Collectors.toMap(WordFrequency::getWord,
                                          Function.identity()));

        assertEquals(2, wordFrequencyMap.get("random").getFrequency());
        assertEquals(1, wordFrequencyMap.get("some").getFrequency());
        assertEquals(1, wordFrequencyMap.get("with").getFrequency());
        assertEquals(1, wordFrequencyMap.get("spaces").getFrequency());
        assertEquals(1, wordFrequencyMap.get("i'm").getFrequency());
    }

The contents of trim_test.txt:
I'm some random random text with spaces .

Feel free to post your raw text/file and I can add tests around it and help debug.

kennycason · 2018-07-20T20:53:48Z

I went ahead and pushed up the test since there was no existing FrequencyAnalyzerTest. https://github.com/kennycason/kumo/blob/master/kumo-core/src/test/java/com/kennycason/kumo/nlp/FrequencyAnalyzerTest.java

thomasegense · 2018-07-23T06:50:03Z

Here is an example text file with the bug.
(removed sample file)
It gives same result loading from a text-file or from inputstream.
Most special characters are removed, but not -. Am not sure this is intended.
But I end up with different tokens:

-
--
---

etc.

kennycason · 2018-07-23T16:32:53Z

@thomasegense thanks for the sample! I'll check it out.

thomasegense · 2018-07-26T05:38:34Z

Hi again, can you reproduce the error?

kennycason · 2018-08-02T22:44:21Z

@thomasegense Hi, Sorry this week has been hectic for me at work. I'll try and look at over this weekend. I have this tab open in my browser. :)

kennycason · 2018-08-05T23:17:41Z

I was able to replicate this error.

    @Test
    public void largeTextFileTest() throws IOException {
        final FrequencyAnalyzer frequencyAnalyzer = new FrequencyAnalyzer();
        final List<WordFrequency> wordFrequencies = frequencyAnalyzer.load(
                Thread.currentThread().getContextClassLoader().getResourceAsStream("text/csdb.txt"));

        wordFrequencies
                .forEach(wordFrequency ->
                                 System.out.println(
                                         String.format("[%s] -> [%d]", wordFrequency.getWord(), wordFrequency.getFrequency())));
    }

Result:

[  ] -> [258594]
[the] -> [251345]
[music] -> [82106]
[and] -> [69944]
[user] -> [66652]
[ crack] -> [55529]
[this] -> [54919]
[you] -> [54355]
[csdb] -> [53250]
[comment] -> [50887]
[submitted] -> [50417]
[for] -> [49680]
[graphics] -> [44411]
[scene] -> [40164]
[demo] -> [38855]
[crack  ] -> [37584]
[c64] -> [36656]
[crack] -> [35495]
[demo  ] -> [35339]
[can] -> [31646]
[made] -> [28503]
[commodore] -> [27584]
[find] -> [27268]
[all] -> [25895]
[one-file] -> [25843]
[intro] -> [25235]
[1990] -> [22883]
[about] -> [22095]
[out] -> [21743]
[1989] -> [21269]
[here] -> [21171]
[not] -> [21055]
[but] -> [21001]
[which] -> [20647]
[was] -> [20377]
[are] -> [20349]
[forum] -> [20110]
[release] -> [20101]
[search] -> [19774]
[sceners] -> [19406]
[page] -> [19343]
[home] -> [19306]
[1988] -> [19037]
[that] -> [18841]
[code] -> [18535]
[website] -> [18503]
[computer] -> [18459]
[] -> [18446]
[1991] -> [17545]
[comments] -> [17502]

Looking at [ crack] in the debugger shows ascii character 160, which is a non-breaking space

One unquestionable bug is the empty token I found here:

I will consider how to handle these use-cases, In the mean time I recommend you strip the ascii character 160 from your text file. The hex code, and regex to match ASCII 160 is \xA0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default Normalizers not working #66

Default Normalizers not working #66

thomasegense commented Jul 19, 2018 •

edited

Loading

kennycason commented Jul 20, 2018

kennycason commented Jul 20, 2018

kennycason commented Jul 20, 2018

kennycason commented Jul 20, 2018

thomasegense commented Jul 23, 2018 •

edited

Loading

kennycason commented Jul 23, 2018

thomasegense commented Jul 26, 2018

kennycason commented Aug 2, 2018

kennycason commented Aug 5, 2018 •

edited

Loading

Default Normalizers not working #66

Default Normalizers not working #66

Comments

thomasegense commented Jul 19, 2018 • edited Loading

kennycason commented Jul 20, 2018

kennycason commented Jul 20, 2018

kennycason commented Jul 20, 2018

kennycason commented Jul 20, 2018

thomasegense commented Jul 23, 2018 • edited Loading

kennycason commented Jul 23, 2018

thomasegense commented Jul 26, 2018

kennycason commented Aug 2, 2018

kennycason commented Aug 5, 2018 • edited Loading

thomasegense commented Jul 19, 2018 •

edited

Loading

thomasegense commented Jul 23, 2018 •

edited

Loading

kennycason commented Aug 5, 2018 •

edited

Loading