-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default Normalizers not working #66
Comments
Thanks for posting this . I'll check it out. Seems like it should be a straight forward fix. |
Could you give me a sample input? Are you loading from a raw text file? or are you loading a "Frequency file" of the format:
|
I created a simple unit tests with some weird text and could not immediately replicate your issue. @Test
public void defaultTokenizerTrimTest() throws IOException {
final FrequencyAnalyzer frequencyAnalyzer = new FrequencyAnalyzer();
final List<WordFrequency> wordFrequencies = frequencyAnalyzer.load(
Thread.currentThread().getContextClassLoader().getResourceAsStream("trim_test.txt"));
final Map<String, WordFrequency> wordFrequencyMap = wordFrequencies
.stream()
.collect(Collectors.toMap(WordFrequency::getWord,
Function.identity()));
assertEquals(2, wordFrequencyMap.get("random").getFrequency());
assertEquals(1, wordFrequencyMap.get("some").getFrequency());
assertEquals(1, wordFrequencyMap.get("with").getFrequency());
assertEquals(1, wordFrequencyMap.get("spaces").getFrequency());
assertEquals(1, wordFrequencyMap.get("i'm").getFrequency());
} The contents of trim_test.txt: Feel free to post your raw text/file and I can add tests around it and help debug. |
I went ahead and pushed up the test since there was no existing FrequencyAnalyzerTest. https://github.com/kennycason/kumo/blob/master/kumo-core/src/test/java/com/kennycason/kumo/nlp/FrequencyAnalyzerTest.java |
Here is an example text file with the bug.
etc. |
@thomasegense thanks for the sample! I'll check it out. |
Hi again, can you reproduce the error? |
@thomasegense Hi, Sorry this week has been hectic for me at work. I'll try and look at over this weekend. I have this tab open in my browser. :) |
I am using the latest 1.13 release.
The FrequencyAnalyze default constructor adds the following normalizers:
And this seems correct, but it does not work properly. It leaves whitespace, so the trim is not working correct for some reason. Here is the log file.
Notice the first line, that is just white space that is most frequent.
Also notice how many times the word "crack" appears below with and without trailing spaces.
The text was updated successfully, but these errors were encountered: