-
Notifications
You must be signed in to change notification settings - Fork 586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion on CloudEvent data and transcoding #1204
Comments
@duglin: If you could add this to the agenda for tomorrow's meeting, that would be useful. I hope this write-up helps... |
re: Example 5 section:
If you're ok with 5a then I think you have to be ok with 5b too since I believe they're basically the same - both have
I'm not sure this changes much for 5a and 5b since both treat it as bytes so whether we use re: Example 6 section:
I think this is mostly correct. I view it this way:
re: Example 7 section:
re: Example 8 section:
re: Example 8 section:
Net of all of this:
|
|
on the 5/25 call @jskeet agreed to write-up a summary/proposed-next-steps or perhaps a PR... to try to focus the discussion |
Okay, having read through this again to try to summarize it:
My personal "grand conclusion" is that data formats probably should be able to insist that implementations are aware of some data content types and handle them in a particular way, e.g. inferring that if a JSON format is presented with text data and told its JSON, it should parse it as JSON. That will be a breaking change for many SDKs, I suspect, so we need to really think about it carefully. There is one thing that I think we can discuss concretely:
I disagree with this. Just because an attribute is optional doesn't mean a receiver should be able to ignore it if it's provided. The JSON format explicitly talks about what should happen if the content type is indicated to be JSON - so I'd argue that any SDK receiver which implements the JSON format but ignores the content type is violating the spec. Maybe that's a good place to start, in order to chip away at this... |
From our specs: For clarity, when a feature is marked as "OPTIONAL" this means that it is OPTIONAL for both the Producer and Consumer of a message to support that feature. In other words, a producer can choose to include that feature in a message if it wants, and a consumer can choose to support that feature if it wants. A consumer that does not support that feature is free to take any action it wishes, including no action or generating an error, as long as doing so does not violate other requirements defined by this specification. However, the RECOMMENDED action is to ignore it. The producer SHOULD be prepared for the situation where a consumer ignores that feature. An Intermediary SHOULD forward OPTIONAL attributes. Rereading some of the previous comments, I'm still thinking part of the solution might be what I said: However, we could choose to be smart about this say: if the transcoder supports a format (JSON in this case, and we know this because it's going to output JSON), then it MUST therefore understand a "datacontenttype" of that same format, so it knows it's not just an array of bytes, it's a serialized JSON object and therefore it needs to convert it into a JSON object and serialize it as such - not as an array of bytes. and we could write this guidance in a generic way to handle any format. |
Eek... I hadn't noticed that aspect of "optional" before. I think that was a mistake :( Optionality of understanding/processing and optionality of including should be entirely separated IMO. Too late now... |
https://datatracker.ietf.org/doc/html/rfc2119#section-5 I think it's the: |
Hmm... I view "optional within data" to be very, very different from "this is an optional feature which may or may not be implemented in a conformant platform". For example, |
I think part of the reason we landed where we did is that CE (for the most part) is a format-spec, unlike other specs that control semantics. For example, in the xRegistry spec you'll see statements about what a receiver MUST do when an OPTIONAL attribute appears in a message. At that point it's optional to appear, but the semantics of it (when present) are not. We don't have a lot of words in CE around what a receiver does with a CE once it receives it. One thing we could consider is adding more normative language to SDK.md to ensure they behave the way we want them to. But then we're working on an "SDK spec" and not the "CE spec". As for this issue... perhaps what we're looking for is something closer to guidance right now, and maybe that'll turn into RFC2119 language at some point. For example, maybe we start with describing what a CE receiver should be doing if it wants to understand For example:
Once we agree on that, and hopefully it's generic and not format specific, we can then decide if any of that should be normative in a spec(s). Maybe? |
This issue is stale because it has been open for 30 days with no |
/remove-lifecycle stale Still hoping to find time to write up next steps. |
This issue is stale because it has been open for 30 days with no |
/remove-lifecycle stale I'll get back to this some day... |
This issue is stale because it has been open for 30 days with no |
/remove-lifecycle stale |
This issue is stale because it has been open for 30 days with no |
/remove-lifecycle stale |
This issue is stale because it has been open for 30 days with no |
This "issue" is to record discussions/thoughts on the nature of CloudEvent data, in the hope that it will help us to resolve #1186.
Each example is numbered for ease of reference later. The terms "data" and "payload" are used interchangeably, as sometimes this is helpful for disambiguation.
What is the data/payload of a CloudEvent?
The spec is deliberately hands-off about the nature of CloudEvent data:
(As an aside "encapsulated within
data
" doesn't really mean much now thatdata
isn't an attribute. We should do some clean-up at some point.)So the payload of a CloudEvent is generally opaque. It must be representable as a sequence of bytes, in order to be represented in binary mode in HTTP at least. ("Binary mode" doesn't define what a "message body" is, but in HTTP we need to be able to encode it as bytes. "Message bodies" for other transports may require text, which presumably means that binary mode for those transports has to specify how non-text CloudEvent data would be represented in the body.)
While it may sound like a truism that "data must be representable as a sequence of bytes", it's not entirely straightforward, as it requires that a serialized representation be chosen. When a CloudEvent is created within an event producer, the data that's intended to be represented (for example "an object in memory") may well not have a single "natural" serialized form. (There may be multiple representations available, or one may need to be created just for the purpose of encoding the data as a CloudEvent.)
There's also room for some interpretation when it comes to "It is encoded into a media format which is specified by the
datacontenttype
attribute". What guarantees/constraints are present in terms of the validity of that encoding?CloudEvent validity
Let's consider the following structured-mode even in the JSON format, which is only changed very slightly from an example in the JSON format spec:
Example 1: Invalid XML in JSON-formatted event
The same event could certainly be represented in binary mode, e.g. in HTTP, where the message body would be the UTF-8-encoded bytes of:
Is this a valid CloudEvent? Should it be accepted or rejected by processors?
It's fine in every way except one: the data isn't valid for the declared content type, because the
&
isn't escaped.Jon's opinion: this is valid at the "CloudEvents spec" level, but invalid at a data-processor level. It would be reasonable for a CloudEvent processor which tried to use the data to reject it.
Rationale:
There are many ways in which event data may be invalid in an application-specific way, beyond the content-type-level validity shown above. (Imagine database constraints being violated, for example.)
Note that this is entirely separate from spec-level-invalid or format-level-invalid CloudEvents.
Example 2: Invalid JSON-formatted event (empty id)
Example 3: Invalid JSON-formatted event (invalid JSON)
Example 4: Invalid JSON-formatted event (valid JSON, invalid data type for an extension attribute)
What is being represented?
The upshot of all of the above is that in general, the CloudEvent spec does not have a firm opinion of what the data in a CloudEvent "means". However, in structured mode, event formats effectively become opinionated about the meaning of data in some cases.
For example:
Transcoding
It is desirable to be able to convert a CloudEvent from one representation to another, e.g. "structured JSON to structured XML",
"structured protobuf to binary" or "binary to structured JSON". This is where the difference in "opinion" causes problems.
The
datacontenttype
of the CloudEvent is basically the only common information that can be used to inform transcoding - and it feels reasonable for it to do so. Let's look at some examples to try to agree on correct behavior.Note that this section is not intended to impose constraints on SDKs. Some SDKs may support explicit transcoding operations, some may effectively do so by "decode in one format, encode the result in a different format" - or that may lead to issues. But until we agree on what the right result of transcoding is, it's harder to work out how it should appear in SDKs.
Example 5: transcoding binary to structured JSON, no content-type
HTTP request (where the body after the blank line is UTF-8-encoded text, but that isn't present in the HTTP request header):
Option 5a: no inference
Option 5b: infer text
Option 5c: infer JSON
Relevant parts of the JSON format spec:
Without any content type at all, should the implementation determine that the type of data is Binary?
Jon's opinion: really unclear; either 5a or 5c seems reasonable.
Corollaries:
Example 6: Transcoding from structured protobuf to structured JSON, for JSON content
This is the example at the heart of #1186.
Note: this uses the JSON respresentation of the protobuf message. Don't get confused between the two!
Protobuf-format:
Transcoded JSON format:
Option 6a: (data is a JSON object)
Option 6b: (data is a JSON string)
Jon's opinion: Here the data content type says it's JSON, so it's reasonable for the transcoding operation to end up with a JSON object as the result (so option 6a). (Note that option 6b is what at least some SDKs will come out with at the moment.)
"data": "hello"
in the JSON format, we'd need"textData": "\"hello\""
in the protobuf formatdatacontenttype
hadn't been specified, would the result be the same? (The "assume it's JSON" part would be separated from the original event creation, only occurring at transcoding time...)Example 7: Transcoding from structured JSON to binary (numeric data)
Initial JSON-formatted event:
What should the binary mode encoding of this event be?
Option 7a: Encode as text
Treat the value as "it's a number, let's just encode it as text". This leads to further questions of:
Option 7b: Fail to deserialize from JSON
The JSON format spec states:
Jon's opinion: option 7b seems safe and reasonable here.
Example 8: Transcoding from structured JSON to binary (text data)
Initial JSON-formatted event:
What should the binary mode encoding of this event be?
Option 8a: Encode as text
We've got text, we can encode it that way, only needing to choose the encoding. (It's probably reasonable to assume UTF-8, but we should document that.)
Option 8b: Fail to serialize as it's invalid XML
Either initial deserialization could fail, or serialization to binary mode (if that's a separate step) could fail, because "Not XML" is not a valid XML document.
Option 8c: Coerce into valid XML
An implementation could transcode the data into some made-up element name, e.g.
<event>Not XML</event>
.Jon's opinion: option 8a seems appropriate here. The JSON format has no knowledge of XML, and it should only concern itself with content types it actually knows about. (As for option 8c... please no!)
Example 9: Transcoding from structured protobuf to binary, protobuf message data
Protobuf-format:
Option 9a: serialize the Any
Just serialize the value of the
proto_data
field.Option 9b: serialize the data
Just use the
proto_data.value
field (which is already abytes
value).Jon's opinion: 9b is consistent with normal protobuf transports, where the message type is effectively part of a side-channel, e.g. implicit in the RPC being invoked via gRPC. On the other hand, losing data always feels odd.
What's next?
After discussion of the right result of these examples of transcoding, we can work out the implications for specs and SDKs. Expected changes:
The text was updated successfully, but these errors were encountered: