-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error on large readers #8
Comments
Thanks for the repro, makes this easier. Unfortunately, here is what I've found. There doesn't appear to be an issue with the buffering/scanning of bytes. Basically, the buffering is working as expected, meaning that we keep the buffer filled, and hand these bytes over to the "segment words" black box function. I say black box, because that is the ragel state machine, with logic ported from java that will take a while for someone to reload into their head. When calling into this black box we pass 2 things, the buffer, and a boolean indicating if we're at the end of input. So, what happens when processing this data (both short and long).
At this point the 2 paths converge, for the short file, the entire data is read into memory. Once this happens, now when we call "segment words", the second argument is true (no more data). And this correctly returns the next token "redirect2". For the long file, the entire file cannot be read into memory without exceeding a configured limit. So, the question is, why is "segment words" behaving this way? Most likely there is some improperly handled corner case in the logic. It would appear that it is in some greedy path trying to build a really big token, and only the EOF=true convinces it that it cannot build a larger token, and only then does it return the correct token. So, that is my understanding at the moment, bug is in the ragel state machine code. But to be honest I'm not sure when I'll be able to look deeper. Here are the detailed log messages, showing everything behaving the same, right up until short has consumed the entire file, and long exceeds the buffer. Notice how, in the short case, once everything has been consumed, it returns a 9 bytes token ("redirect2").
|
Thanks for the quick response! On my end, a workaround might be to chunk the incoming reader and new up a segmenter every 64K bytes. Another path would perhaps be to regenerate using the latest Ragel in the hope it might have been found/fixed in the interim. Or, perhaps to hack in a bit of logic into segmentWords to return on whitespace instead of EOF. |
Yeah I think chunking from the outside would work, but obviously that's undesirable because you might chunk in the middle of a token. But maybe finding whitespace near your desired chunk boundary is reasonable compromise depending on your input. Also, not to discourage trying newer Ragel, but I suspect the bug is in our Ragel source file, not in the Ragel engine. If you look at this file: https://github.com/blevesearch/segment/blob/master/segment_words.rl You'll see the rules where based on seeing various runes, it's trying to build up a token of a particular type. You can see how the atEOF flag changes behavior in a number of places. My guess is that the bug is in this file, though I don't even remember how to go about debugging this. Parts of it turn into the generated Go code, so regular logging should work, but I believe that is limited to the actions, there might be some Ragel options to add debugging the states when generating the code. Hopefully that gives you a few ideas. |
Hi -- I took a quick peek, trying to follow this thread -- but am also struggling with the ragel. I am sheepish to share this, but one thought that popped to mind was whether perhaps a brute-force bubble-gum band-aid might work. So, I hacked my local blevesearch/segment.MaxScanTokenSize from 64KB to 640KB (which ought to be enough for anyone?). And, then TestLong() unsurprisingly passed. So one idea is perhaps change segment.MaxScanTokenSize from a const into a var, so that apps might be able to tweak it as an init()'ish escape valve until we have a real answer? Or add the maxScanTokenSize as a param as a new variant of NewSegmenter() or as a tweakable knob? |
@steveyen sure, you can increase MaxScanTokenSize, I thought that was implied, but you're right it should be explicitly stated. The real issue though is that if you expose an API that takes a Reader, it presumes some attempt to process the data without the entire thing in memory. What this bug demonstrates (I think, I have not fully verified this) is that for some cases (perhaps all) the method actually reads in all the data anyway. Presumably we didn't notice this because it does behave correctly, so long as the inputs fit into the buffer. But it is actually a quite severe bug. |
@steveyen perhaps I should have addressed your suggested changes. While I'm not strongly opposed to them, I think they don't really help. We already have another method that works on []byte. So if you know that our implementation is going to read in all the bytes anyway (due to the bug), you could just do that yourself and use the alternate method. https://github.com/blevesearch/segment/blob/master/segment.go#L84 |
So, I can't help but continue to look into this... I have not looked at this code in a really long time, but this line here seems completely wrong: https://github.com/blevesearch/segment/blob/master/segment_words.rl#L121 As I understand it, when What it seems should happen instead is that the code should check to make sure we're either I have some vague recollection considering this issue, but don't know how the code ended up as is. It's also possible I'm misunderstanding this code now. |
Hehe, I thought you might not be able to resist. Happy to help you reason through it, though you obviously are more familiar than I. A useful, grokkable goal might simply be to return (block) per token. No need to buffer ahead more than one, in all likelihood. Perf-wise, buffering matters most with I/O — can you just exploit the std lib OTOH , some small amount of buffering tokens (not just bytes) might turn out to be necessary. To ensure a whole code point, you’ll have to either look forward or look back. (Pardon if I am stating the obvious.) |
@clipperhouse the problem is that while this bug is severe, it doesn't affect the main consumer of this library (bleve) because it always does have all the data in memory. And for that use case, it's not so much that we buffer the tokens as we minimize the number of slices allocated by doing everything in a single pass. Avoiding/minimizing unnecessary allocations has been crucial to making bleve perform acceptably. So, while I'd like to fix this bug, I do not want to break the other mode where we can find all the tokens in a single pass. I got ragel built and installed again, so I'll see about getting everything building end-to-end again tonight. |
OK, so I started to take deeper look at this today. First, the suspected problem was this section: https://github.com/blevesearch/segment/blob/master/segment_words.rl#L120-L122 Basically, at this point the grammar has found something matching the definition of a word. The question is could it have built a longer word if it could see more of the data? Well normally the rules already try to do that, but if we're sourcing data from a reader, we didn't present the ragel state machine all the data, just some portion of it. So this particular section of code I linked above takes an extremely conservative view, unless we're really at the end of the data But, why take such a conservative view? Couldn't we just see if the token we matched went all the way to the end of the buffer? If our token went right to the edge of the buffer, then sure, be conservative, but otherwise we should be fine. So I tried this:
But, this doesn't work, one of the Unicode provided test cases is: OK, but we can fix this case too. What we can do is not just check that we're at the end of the buffer, but make sure there is a complete valid unicode codepoint. So, we change it this:
That fixes the case above, but still fails in yet another Unicode provided test case: So, where does that leave us? It seems to me the correct solution needs some proper lookahead or backtracking (I'm not a grammar expert). And I'm probably not the right person to build that. The other option would be to have some sort of user configurable option, either in number of bytes or number of runes, that would simply say, if we matched a word, and we have at least N more bytes/runes ahead in the buffer, assume that it's probably safe to return the token. |
OK, I wanted to see how many runes ahead we would have to look to at least pass the tests provided by the Unicode spec. I created a helper function, the intent being that after we've found a word token, we want to make sure the buffer beyond the end of our token has at least N valid runes.
I then tweaked the ragel action to do something like this:
So, if at the point we matched this token, we are not at EOF, make sure we have at least N more valid runes. And then I ran the test suite, and the answer is that N must be set to 4 in order for the tests to pass. What I don't know is if that is correct, or just the longest case exposed by this test data set. |
I think it would be pretty straightforward to expose this as something optional, and keep the old behavior by default. I think the goal of this library should be to be useful in practical situations. As is, the Reader version is arguably useless. Even if you set N to 100, you'd have at most 400 bytes extra read into your buffer. And presumably that would be correct on most inputs you'd actually encounter in the real world. |
Wow, thanks for digging in. So, trying to hypothesize what N represents here. Is it the least number of lookahead runes needed to disambiguate all cases — in which case 4 could be hard-coded with some confidence — or is it a coincidence relating to this test set? To test the latter, maybe try an artificial example, perhaps including a colon as in WB6, with a very long token, and N much smaller than that token. To make sure I understand, N needs to be at least 4 to pass, or exactly 4? Sounds like the former. If this were user configurable, that would de facto represent a max token size, I think. Which would be workable for me. I’d even go to 1024, I am happy to pay 4K of memory, if that’s the worst-case impact. Two more questions, if you’ll indulge. Does your current experiment pass my long test? And, if we take the user-configurable N approach, does that get us ‘proper’ buffering, i.e., not-unlimited memory use? Really appreciate your effort. |
So, there are a number of cases where you can successfully match a word token, but if you had only seen N more Unicode runes, you would have matched an even longer token. This matters for the Reader case doing proper buffering because we're not passing the entire data to the ragel state machine at once, in fact we may never have the entire data in memory at once. So, back to the changes I'm proposing, you as the user of the Segmenter would get to choose the value of N you are comfortable with (any value of N is a heuristic, possibly leading to a tokenization behavior not described by the Unicode spec). You have to choose N greater than or equal to 4 just to pass the Unicode provided tests. Choosing a larger value of N may further decrease the chance of tokenization errors, but it also means more data in the buffer before returning tokens (N * 4 bytes). N does not directly relate to the maximum token length that can be matched. It's the maximum number of Unicode runes we must read after a potential token to make sure we couldn't produce an even longer token. With these changes in place your TestLong passes. Yes, I believe the buffering logic was already working correctly, it was the segmenter itself demanding more data, and the buffer just growing to accommodate until it couldn't grow any more. With the segmenter fixed, to only look N runes ahead (not requiring it to go all the way to EOF), I believe the buffering works as originally intended. |
Thumbs-up from me on such a fix. The case you describe, missing the even-longer-token, is that an edge case due to certain Unicode rules that I as a Latin-script-biased person would be unfamiliar with? Or is it broadly true of any token? The configurable N would be great regardless. A token larger than 100 or 1000 is not a use case for a tool like mine, that would be some weird data. |
Hi! My environment is Go 1.14, macOS Catalina 10.15.3.
When reading a large file (over approx 65K), I am getting
bufio.Segmenter: token too long
To reproduce: here’s a repo. Run the tests with
go test -v
.I believe this error is in error, since no token in the file is that long. I expect to be able to use a reader of arbitrary length.
So either the returned error is inaccurate, or there is a bug in buffering/scanning the bytes. Thanks.
—
If curious I am using the segmenter here: https://github.com/clipperhouse/jargon. Thanks for creating it!
The text was updated successfully, but these errors were encountered: