-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing JSON from string generators/iterators #124
Comments
@doerwalter I won't be able to respond in detail now, but in the meantime I'll point you to #44. I'll try to answer more carefully tomorrow. |
Thanks @doerwalter for the wait. I haven't had as much time on my hands as I'd like, but I still wrote the message above to signal that I didn't intend to forget about this issue. Today I finally could sit down and write a proper reply. Sorry for the wall of text, but this is a good opportunity to do some brain dumps and somewhat document things I guess. Firstly, thanks again for the very detailed message, and presenting the problem very clearly. Long story short, the main issue to support generators or iterables as inputs is that it cannot be supported in the "generic" routines (see below), as this would break the support ijson currently has for event interception. This feature works based on premise that users can provide a generator to ijson as an input, with the generator returning 2- or 3-tuples (rather than strings or bytes.)
Just one comment here: ijson never reads with
You are correct that in your case the buffer size is what's preventing the parsing process from proceeding, and indeed using a smaller buffer size for your workaround is the way to go, even if that creates more back and forth to/from the underlying parser. In any case, yours is a very small JSON document, so it makes sense that the backend you choose makes no humanly noticeable difference (there sure is a difference, but it will be minimal). About the back and forth: that's exactly what a strings/bytes iterator input would end up doing too: each you the user yields some bytes/str, we'd pass that through the parser. The main difference is that users don't need to pass a
That'd be kind of a good first approach, but like I say below there's no way of doing this in the generic entry points. What we need to do is add new specialised entry points.
There's no way (I can see) where we add support for str/bytes iterables as inputs to What can be done however, and I hadn't thought about this during #44 I think, is to add a specialised method that deals only with bytes/string iterables as inputs (e.g.,
I wouldn't care about the Python parser (it's miles behind the others in terms of performance). And also in general encoding/decoding usually isn't the bottleneck, although it can make a measurable difference (and still by principle you want to be as efficient as possible). But for more reference... The In the case of the library you're using, does it have any option to return the non-decoded bytes instead? The fact that it is returning strings with fragments of a raw JSON document means it is doing some unnecessary extra work to begin with. Yes, the Python backend internally uses |
@doerwalter just a final remark about the |
Thanks for the detailed reply, this is really appreciated.
For context, this is the feature request I made in
OK, so passing a generator to
It feels wrong to me to ignore the async def read(self, size : int = -1) -> str:
while size == -1 or len(self.buffer) < size:
try:
self.buffer += await anext(self.iterator)
except StopAsyncIteration:
break
if size == -1:
(result, self.buffer) = (self.buffer, "")
else:
(result, self.buffer) = (self.buffer[:size], self.buffer[size:])
return result with async def read(self, size : int = -1) -> str:
return await anext(self.iterator, b"") my test script hangs.
Using
Exactly. The advantage would be that we would neither have to feed shorter input to the parser (because
OK.
Ah, OK, now the naming scheme makes sense! ;)
Sounds good to me.
OK, when we ignore the Python parser, there's really no feasible way around encoding and decoding, since the C parser expects bytes.
This sounds like a useful approach, except that we could drop the warning, and document the support for mixed str/bytes as a feature. ;)
Well the type annotations for the method I'm overwriting in def _transform(self, input: Iterator[Union[str, BaseMessage]]) -> Iterator[Any]:
...
class BaseMessage(Serializable):
content: Union[str, list[Union[str, dict]]]
"""The string contents of the message.""" (see https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/output_parsers/transform.py#L111 and https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/messages/base.py#L23) |
I'm using
ijson
to parse JSON that is coming out of alangchain
(https://www.langchain.com/) pipeline.This pipeline basically is an async iterator over strings (i.e.
str
objects notbytes
, which when joined together give to complete JSON output). But as far as I can tell,ijson
doesn't support parsing from such an generator directly (which is strange for a library that is based on generators and coroutines).So to be able to parse that JSON with
ijson
I have to wrap this generator in a file like object. I'm doing it like this:For an example the pipeline delivers 343 JSON fragments over the course of 34 seconds. I want to show output to the user as soon as the initial JSON snippets arrives (which happens after about one second). This seems to be exactly what
ijson
was made for.However when I try it out, nothing gets output until 34 seconds have passed, and then everything gets output at once. My guess would be that that is caused by the default buffer size used by
ijson
being 64K. And indeed the total size of the JSON output is less that 64K, furthermore when I use a small buffer size when callingijson.basic_parse_async()
like this:I get the desired effect: partial JSON gets output almost immediately.
However this seems to be inefficient for several reasons:
It is an additional layer.
The
read
method has to ensure that the strings it returns have the correct size, so constantly has to slice and reconcatenate string. However I don't care about the input being parsed in 10-character chunks, but I care about the incoming chunks being fed to the parser immediately without being split into tiny chunks should a longer chunk arrive.I don't know how
yajl
is implemented, but I assume when a small chunk size gets used, the C implementation of the parser can't play to its strengths, as the back and forth between Python and C drowns out the performance advantage of the C parser. (For my example the result was that using the pure Python parser is as fast as usingyajl2_c
).I tried to add support for parsing from generators to
ijson
, but I'm really out of my depths here. What I tried to do was:I don't know whether this is the correct approach (the fact that the argument is called
file_obj
suggests that it isn't). I also don't know what the difference betweenbasic_parse
andbasic_parse_gen
is.However that nearly gets me there. I get
i.e. the parser seems to expect bytes. I can work around that problem, by replacing
with
but that is another level of needless inefficiency and indirection (at least for the parser implemented in Python): I have to encode the strings to UTF-8, only for that to be immediately decoded from UTF-8 again by the parser.
(For the C parser this is still needless inefficiency, but that would be harder to fix, since the C parser would have to directly operate on Python unicode data).
So long story short: Is there any change that
ijson
gets the functionality to directly parses from string generator/iterators?The text was updated successfully, but these errors were encountered: