-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should we handle text_data in proto #1186
Comments
This was also discussed on the 2022-07-28 call: https://www.youtube.com/watch?v=cg0jKFwyJeM&list=PLO-qzjSpLN1BEyKjOVX_nMg7ziHXUYwec&index=34 |
I think this is where Mr. @jskeet conformance tests are going to come in useful :-) As you mention the behavior of the Go SDK is a bit odd, but potentially not incorrect - until you read the fine-print in the JSON Format specification. ;-)
Given that statement I'd humbly suggest that the Go SDK is not spec compliant, whether all the other SDKs are compliant in this scenario is another question entirely... |
The bottom line here seems to be that if your application code hands a 'string' to the SDK along with the assertion that it is JSON then the SDK its required to parse the string back into a JSON value and stick that in the data property. If the parse step fails then the SDK should error-out. |
@lionelvillard if we "fixed" the go-lang SDK so that we don't treat this as a string do you think this will break lots of Knative folks? |
I might be doing it wrong, but in testing the golang SDK I'm seeing this:
I think it's working as expected. Perhaps something changed since the original issue was opened. |
@duglin: I believe the point is that it's probably fine if the SDK parses something that already has a JSON object as the Here's C# code (using the System.Text.Json formatter, but I'm sure the Newtonsoft.Json one will behave the same way) showing the issue in C#: using CloudNative.CloudEvents;
using CloudNative.CloudEvents.SystemTextJson;
using System.Text;
var evt = new CloudEvent
{
Id = "Issue1186",
Type = "example.type",
Source = new Uri("example/uri", UriKind.RelativeOrAbsolute),
Data = "{\"Hello\": \"there\"}",
DataContentType = "application/json"
};
var formatter = new JsonEventFormatter(new() { WriteIndented = true }, new());
var bytes = formatter.EncodeStructuredModeMessage(evt, out _);
Console.WriteLine(Encoding.UTF8.GetString(bytes.Span)); Output:
It would be good to know what other SDKs do in the same scenario. Fundamentally I think this is a problem with model we have of data within our formats and SDKs - each format is opinionated in its own way, violating the model of "data is just a sequence of bytes". That does make for much more readable serialized events - at the cost of this sort of thing. I really don't know where we should go from here. |
That example I used was based on the first comment in the issue, here's another example passing in a string that looks like json:
Still seems right to me. Are you saying that the C# example you gave is wrong? It looks right to me too. You gave it a JSON string, so you get back a JSON string. Not a JSON object. Try your example with
|
I'm saying that it's different to the Java SDK, and at least plausibly violating the bit of the spec that Jem quoted. (It looks like that piece of the spec is subject to interpretation, which is never great.) At the very least, it means that we don't have portability across formats - converting from one format to another and back again in the "obvious" way can easily lose information. |
To me, the key is the first character after the Which part of the spec does the above example violate? This sentence:
(to me) is consistent with my first sentence in this comment... the first char after Another way I view it is this, for :
Is someone suggesting that each of these might result in a different interpretation? ie. string, error and json object respectively? If so, that doesn't seem right since each of these are valid JSON Values (strings) and I don't see how else someone could encode them in the CE to say "these are strings, don't try to decode them as json". What's interesting is if someone wanted to say that
but then that same logic doesn't apply for There could be a bug someplace, but I'm not seeing it being the spec....unless I'm just not groking this |
But that's the result of serialization. The problem is in the act of serializing (not deserializing; I don't believe that part is controversial). What Jem is saying is that if an SDK has a bunch of bytes (or a string) as the data, and the data content type has been set to application/json, it should be interpreted by the SDK as already being JSON - and that the act of serializing the CloudEvent in the JSON format should be to represent the JSON value that's already JSON-encoded in that text. That means the SDK would need to either parse or at least validate that it is JSON to start with, which is non-trivial, of course. Basically I think the distinction is between whether data content type should be used to indicate "this is already the type of data that I've presented to you" or "this is what I want it to be". |
How is that "bunch of bytes" given to the SDK? Are we talking about it's part of a stream from bytes from an HTTP body? If so, there's still a char/byte after the I guess I view this is an SDK issue in that each SDK needs to decide how it wants its users to populate But I don't think the main spec itself is ambiguous, or that a JSON serialized CE is open to interpretation w.r.t. what What's interesting is your question about whether data content type should influence things - and my current thinking "no" since I think whatever we do should work correctly even when that property is null - and should net the same results as when the property is set. Imagine how annoyed a user would be if they never set the property, things worked but then decided to be explicit about everything by setting this property and suddenly the serialization of their CEs changes and breaks things. That feels bad. |
Again, that would only be if we were deserializing a JSON-format CloudEvent to start with. That's not the situation we're talking about. The two scenarios described so far are:
That's definitely not what the spec says at the moment though. (Storing the bytes directly in data_base64 wouldn't be compliant with "If the datacontenttype declares the data to contain JSON-formatted content, a JSON serializer MUST translate the data value to a JSON value, and use the member name data to store it inside the JSON representation." I do believe we've got a problem if we're not consistent: if the result of steps of:
(Or other combinations...) ... comes up with significantly different results depending on the SDK used, I think that's an issue. I'd be open to saying "The JSON format spec is unfortunate here, and we should revisit it (and therefore the Java SDK handling)" but I'm not convinced yet in either direction. But I am convinced there's a problem. |
is this a JSON string? isn't a JSON string something like |
I tried with java SDK for some examples you have showed here (no HTTP deserialization): final CloudEvent[] events = new CloudEvent[]{
CloudEventBuilder.v1()
.withId(UUID.randomUUID().toString())
.withType("foo")
.withSource(URI.create("/foo"))
.withData("application/json", "\"{\"Hello\": \"there\"}\"".getBytes(StandardCharsets.UTF_8))
.build(),
CloudEventBuilder.v1()
.withId(UUID.randomUUID().toString())
.withType("foo")
.withSource(URI.create("/foo"))
.withData("application/json", "{\"Hello\": \"there\"}".getBytes(StandardCharsets.UTF_8))
.build(),
CloudEventBuilder.v1()
.withId(UUID.randomUUID().toString())
.withType("foo")
.withSource(URI.create("/foo"))
.withData("application/json", "{[[[}".getBytes(StandardCharsets.UTF_8))
.build(),
CloudEventBuilder.v1()
.withId(UUID.randomUUID().toString())
.withType("foo")
.withSource(URI.create("/foo"))
.withData("application/json", "\"{[[[}\"".getBytes(StandardCharsets.UTF_8))
.build()
};
for (final CloudEvent ce: events) {
byte[] bytes = getFormat().serialize(ce);
System.out.println(new String(bytes, StandardCharsets.UTF_8));
} output:
It doesn't do any special processing it is just considering it as sequence of bytes. it even outputs invalid JSONs with data = |
I'm not sure what's more wrong but to me the fact that Java SDK serialize to invalid JSON is wrong, however, for the |
I'm not sure the To me the quoted text is very clear... if you have JSON you can stick it directly under As for "bunch of bytes in == bunch of bytes out", the spec doesn't need to say that directly because I believe it's implied in the sense that people should expect their event payload remains unchanged between the sending and receiving of the event. If an SDK breaks this assumption then I think the SDK is broken. @pierDipi I think we may need to be more precise on what the term "JSON string" means and its context. I think we agree that when a string is serialized in JSON then quotes are added (string = So:
to me is passing in a string, and it doesn't become a JSON string until the complete JSON is created. And at that point the string The question is whether |
Ok, I see, Java SDK represents data only as bytes (byte[]), so |
that would imply they expect people to know the format and encode/escape things themselves.... that's interesting. |
Not only people, even things like proto text data to JSON format, the Java SKD loses the quotes: io.cloudevents.v1.proto.CloudEvent protoCe = io.cloudevents.v1.proto.CloudEvent
.newBuilder()
.setSpecVersion("1.0")
.setId(UUID.randomUUID().toString())
.setType("foo")
.setSource("/foo")
.setTextData("hello")
.build();
CloudEvent ce = new ProtoDeserializer(protoCe).read(CloudEventBuilder::fromSpecVersion);
byte[] bytes = new JsonFormat().serialize(ce);
System.out.println(new String(bytes, StandardCharsets.UTF_8)); output:
that's not correct |
Is there javadoc available? I see docs here: https://cloudevents.github.io/sdk-java/api.html but no formal api definition. Or is javadoc too old school :-) |
thanks! |
I don’t consider that a valid usage pattern .. it looks like you’re accessing the raw protobuff message builerd whereas you should be using the CloudEventBuilder.
Ideally the io.cloudevents.v1.proto package shouldn’t be accessible outside the ProtoFormat module :-(
J--
… On Apr 21, 2023, at 9:57 AM, Pierangelo Di Pilato ***@***.***> wrote:
Not only people, even things like proto text data to JSON format, the Java SKD loses the quotes:
io.cloudevents.v1.proto.CloudEvent protoCe = io.cloudevents.v1.proto.CloudEvent
.newBuilder()
.setSpecVersion("1.0")
.setId(UUID.randomUUID().toString())
.setType("foo")
.setSource("/foo")
.setTextData("hello")
.build();
CloudEvent ce = new ProtoDeserializer(protoCe).read(CloudEventBuilder::fromSpecVersion);
byte[] bytes = new JsonFormat().serialize(ce);
System.out.println(new String(bytes, StandardCharsets.UTF_8));
output:
{"specversion":"1.0","id":"b80e46fe-3428-4733-961e-ad56339e55d6","source":"/foo","type":"foo","data":hello}
—
Reply to this email directly, view it on GitHub <#1186 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQQUTAVVXVGSFK2TKB6D33XCK36TANCNFSM6AAAAAAWVZJICU>.
You are receiving this because you were mentioned.
|
@JemDay, I'm matching the implementation. This part io.cloudevents.v1.proto.CloudEvent protoCe = io.cloudevents.v1.proto.CloudEvent
.newBuilder()
.setSpecVersion("1.0")
.setId(UUID.randomUUID().toString())
.setType("foo")
.setSource("/foo")
.setTextData("hello")
.build();
CloudEvent ce = new ProtoDeserializer(protoCe).read(CloudEventBuilder::fromSpecVersion); public CloudEvent deserialize(byte[] bytes, CloudEventDataMapper<? extends CloudEventData> mapper)
throws EventDeserializationException {
try {
final io.cloudevents.v1.proto.CloudEvent ceProto = io.cloudevents.v1.proto.CloudEvent.parseFrom(bytes);
return new ProtoDeserializer(ceProto).read(CloudEventBuilder::fromSpecVersion);
} catch (InvalidProtocolBufferException e) {
throw new EventDeserializationException(e);
}
} and as you can see, text data is then passed as raw bytes without quotes during deserialization and therefore the Json format just passes the data leading to the invalid JSON: |
Apologies for the delay, I've been out-of-town and I'm trying to keep up on this thread... HTTP is a red-herring, we should be concentrating on the 'format' not the transport. My earlier comment was relating to the builder used in that proto 'example' - it's not the correct one. Application code does-not (or should not) be using those constructs... The CE Builder is the same irrespective of the format being used ... If we look at this:
Produces this output on-the-wire (which adheres to the spec as I understand it):
And then you'd need to have some example code that blindly demarshalls the buffer, and then re-marshals it using the "proto json format" ... this is what the proto Unit Tests do to ensure the wire-format is as expected. -- Apologiers for so many edits, that's me trying to rush an answer to the thread. |
This won't work when |
I may be wrong (probably am) but I don't believe you can call |
Right, you're not wrong, we need something like StringCloudEventData object that is instantiated by protobuf when |
I agree there is some cleanup we can do but I think we need to "step away" from proto for a minute and limit ourselves to the JSON Format implementation as that was the issue at-hand. We need crystal-clear guidance as to whether this is considered valid from a JSON Format specification perspective..
or is the excpectation it's:
|
@jskeet, I tested more or less the same thing for Java [ref]: public class SimpleTests {
private static final Logger LOGGER = Logger.getLogger(SimpleTests.class.getName());
private static final JsonNodeFactory jnf = JsonNodeFactory.instance;
@Test
void jsonSerializeString() throws Exception {
var event =
CloudEventBuilder.v1()
.withId("Issue1186")
.withType("example.type")
.withSource(URI.create("example/uri"))
.withDataContentType("application/json")
.withData("{\"Hello\": \"there\"}".getBytes(StandardCharsets.UTF_8))
.build();
var formatOptions = JsonFormatOptions.builder().forceStringSerialization(true).build();
var format = new JsonFormat(formatOptions);
var serialized = format.serialize(event);
LOGGER.info(() -> String.format("serialized = %s", new String(serialized)));
var mapper = new ObjectMapper();
var parsed = mapper.readTree(serialized);
LOGGER.info(() -> String.format("parsed = %s", parsed));
var dataNode = parsed.get("data");
assertEquals(jnf.objectNode().<ObjectNode>set("Hello", jnf.textNode("there")), dataNode);
}
} That test passes, and the output is:
The output formatted with {
"specversion": "1.0",
"id": "Issue1186",
"source": "example/uri",
"type": "example.type",
"datacontenttype": "application/json",
"data": {
"Hello": "there"
}
} So Java is behaving differently, the specific code that is causing this behaviour is this I would say the behaviour of the Java SDK is correct because the data in this case is essentially just the bytes of spec/cloudevents/formats/json-format.md Lines 153 to 158 in 30f1a08
I will test the same in Python. Not sure if this answers all the questions here, if you would like me to test more things let me know. |
I have now tested similar things in python [ref]. The most direct test in python behaves the same as C#: def test_json_serialize_string() -> None:
event = CloudEvent.create(
{
"specversion": "1.0",
"type": "example.type",
"source": "example/uri",
"datacontenttype": "application/json",
},
data='{"Hello": "there"}',
)
logging.debug("event = %s", event)
event_json = conversion.to_json(event)
logging.debug("event_json = %s", event_json)
parsed = json.loads(event_json)
logging.debug("parsed['data'] = %s", parsed["data"])
assert parsed["data"] == '{"Hello": "there"}' Test output: tests/test_simple.py::test_json_serialize_string
------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG root test_simple:21:test_json_serialize_string event = {'attributes': {'specversion': '1.0', 'type': 'example.type', 'source': 'example/uri', 'datacontenttype': 'application/json', 'id': '334cb7f8-3f5f-40f4-89e5-3caed0e17e54', 'time': '2023-04-29T14:11:39.531902+00:00'}, 'data': '{"Hello": "there"}'}
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG root test_simple:23:test_json_serialize_string event_json = b'{"specversion": "1.0", "id": "334cb7f8-3f5f-40f4-89e5-3caed0e17e54", "source": "example/uri", "type": "example.type", "datacontenttype": "application/json", "time": "2023-04-29T14:11:39.531902+00:00", "data": "{\\"Hello\\": \\"there\\"}"}'
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG root test_simple:25:test_json_serialize_string pretty event_json =
{
"specversion": "1.0",
"id": "334cb7f8-3f5f-40f4-89e5-3caed0e17e54",
"source": "example/uri",
"type": "example.type",
"datacontenttype": "application/json",
"time": "2023-04-29T14:11:39.531902+00:00",
"data": "{\"Hello\": \"there\"}"
}
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG root test_simple:26:test_json_serialize_string parsed['data'] = '{"Hello": "there"}'
PASSED [ 33%]
I can somewhat accept this behaviour, the obvious challenge with python is that If using a binary value for JSON, instead of a string, the behaviour is different and I would say more correct, or possibly less surprising: def test_json_serialize_bytes() -> None:
event = CloudEvent.create(
{
"specversion": "1.0",
"type": "example.type",
"source": "example/uri",
"datacontenttype": "application/json",
},
data=b'{"Hello": "there"}',
)
logging.debug("event = %s", event)
event_json = conversion.to_json(event)
logging.debug("event_json = %s", event_json)
parsed = json.loads(event_json)
logging.debug("pretty event_json = \n%s", json.dumps(parsed, indent=2))
logging.debug("parsed['data_base64'] = %s", parsed["data_base64"])
logging.debug(
"decoded parsed['data_base64'] = %r",
base64.b64decode(parsed["data_base64"]).decode(),
)
assert parsed["data_base64"] == base64.b64encode(b'{"Hello": "there"}').decode() Test output: tests/test_simple.py::test_json_serialize_bytes
------------------------------------------------------------------------------- live log call -------------------------------------------------------------------------------
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG root test_simple:41:test_json_serialize_bytes event = {'attributes': {'specversion': '1.0', 'type': 'example.type', 'source': 'example/uri', 'datacontenttype': 'application/json', 'id': '8be02a3b-c362-441c-8bd3-56876e1e2810', 'time': '2023-04-29T14:11:39.533045+00:00'}, 'data': b'{"Hello": "there"}'}
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG root test_simple:43:test_json_serialize_bytes event_json = b'{"specversion": "1.0", "id": "8be02a3b-c362-441c-8bd3-56876e1e2810", "source": "example/uri", "type": "example.type", "datacontenttype": "application/json", "time": "2023-04-29T14:11:39.533045+00:00", "data_base64": "eyJIZWxsbyI6ICJ0aGVyZSJ9"}'
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG root test_simple:45:test_json_serialize_bytes pretty event_json =
{
"specversion": "1.0",
"id": "8be02a3b-c362-441c-8bd3-56876e1e2810",
"source": "example/uri",
"type": "example.type",
"datacontenttype": "application/json",
"time": "2023-04-29T14:11:39.533045+00:00",
"data_base64": "eyJIZWxsbyI6ICJ0aGVyZSJ9"
}
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG root test_simple:46:test_json_serialize_bytes parsed['data_base64'] = eyJIZWxsbyI6ICJ0aGVyZSJ9
2023-04-29T16:11:39 1385287 140513442748224 010:DEBUG root test_simple:47:test_json_serialize_bytes decoded parsed['data_base64'] = '{"Hello": "there"}'
PASSED [ 66%] My expectation with Protobuf CloudEvents I guess in every language things will and probably should be a bit different. |
For C#, my conclusion would be, if C# has an analogue to com.fasterxml.jackson.databind.node.TextNode - that this is the wrong behaviour:
If on the other hand, parsing a JSON string (i.e. |
For C#, what is
|
It's defined with a compile-time type of |
Then it seems like |
Circling back to the original comment:
based on my reading (and lack of understanding of proto :-) ), it seems to me that If the golang or java SDKs are doing something different between those two fields (deserializing in one but not the other) then that would seem incorrect because the choice between the two is based on whether it has special characters, NOT because we have two different (de)serialization algorithms at play. |
This issue is stale because it has been open for 30 days with no |
Wandering into this much later, I can see each of the intermediate steps, but this is a deeply weird conversation given that the on-the-wire serialization of I believe that |
This issue is stale because it has been open for 30 days with no |
See: cloudevents/sdk-go#759
From that issue:
On the use of the
text_data
andbinary_data
fields in the protobuf-format spec, it says:The Java SDK's serializer always uses the
text_data
field when thedatacontenttype
isapplication/json
,application/xml
ortext/*
. Otherwise thebinary_data
field is used. However, in the Go SDK's deserializerbinary_data
is treated literally, whiletext_data
gets serialized using the appropriate serializer for thedatacontenttype
. When a CloudEvent with adatacontenttype
ofapplication/json
is serialized into protobuf using the Java SDK and then sent to a service using the Go SDK and deserialized there, this results in the JSON in thetext_data
field being interpreted as a JSON-String when it should be treated as a JSON-Object. So if I serialize this CloudEvent into protobuf using the Java SDK:and then send the resulting protobuf object to a service using the Go SDK and deserialize it there back into JSON format, the result is:
This seems to be unintended behavior, but I as far as I can tell, neither the Java nor the Go SDK implementations of the Protobuf format are technically wrong according to the way the spec is written, so I decided to raise this issue here to get clarification. Is this a bug in one or both of these libraries? Should the Protobuf format spec be more specific in how the
binary_data
andtext_data
fields should be used? Or is this behavior somehow intended?@JemDay FYI
The text was updated successfully, but these errors were encountered: