Skip to content
This repository has been archived by the owner on May 3, 2022. It is now read-only.

Odd results for ". . ." #84

Open
nickom opened this issue May 6, 2014 · 4 comments
Open

Odd results for ". . ." #84

nickom opened this issue May 6, 2014 · 4 comments

Comments

@nickom
Copy link

nickom commented May 6, 2014

http://capitolwords.org/term/._._._/

Found because it was listed as the top 5 word phrase for this date:
http://capitolwords.org/date/2014/04/28/

screen shot 2014-05-06 at 4 35 33 pm

@drinks
Copy link
Contributor

drinks commented May 6, 2014

Yeah, this is a known issue. We use an ngram parser similar to Google's,
which treats punctuation as distinct tokens. I believe these are
low-volume days that have either sequences of dots in rollcalls or
similar 'table of contents' style pages. Definitely on the list.

@nickom
Copy link
Author

nickom commented May 6, 2014

Gotcha. The other thing that was so odd to me was that the highlighted examples had letters in them:
screen shot 2014-05-06 at 4 51 44 pm

@drinks
Copy link
Contributor

drinks commented May 6, 2014

Guessing that's a separate issue related to. being the regexp for 'match
any character,' code here:
https://github.com/sunlightlabs/Capitol-Words/blob/2bf155cd586847ea32ed294a8a3e6997e822199e/cwod_site/cwod/views.py#L318-L332

@nickom
Copy link
Author

nickom commented May 6, 2014

Also, shorter versions of the dots are the top words and their links go to some server errors or 404s. Here are the links for the top words on that day:

Two words (not found):
http://capitolwords.org/term/

Three words (server error):
http://capitolwords.org/term/._/

Four words:
http://capitolwords.org/term/._._/

Five words:
http://capitolwords.org/term/._._._/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants