Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding ndjson format #218

Merged
merged 6 commits into from
Oct 9, 2023
Merged

adding ndjson format #218

merged 6 commits into from
Oct 9, 2023

Conversation

jose-sherpa
Copy link
Contributor

@jose-sherpa jose-sherpa commented Sep 6, 2023

While the omniparser tool outputs JSON format currently, you will often need another tool or package to stream the JSON output. While I am aware this tool will only be used for JSON output, there is a type of JSON called NDJSON which stands for new line delimited JSON. This makes it easy to stream parse and process a JSON array with no added packages or complexity since you just read each line and parse them one by one. Since a strength of omniparser is to stream parse large files, we think it makes sense to make the output easily streamable without violating the output of JSON. It also results in a smaller file size.

http://ndjson.org/

header/header.go Outdated
@@ -15,6 +15,7 @@ type ParserSettings struct {
Version string `json:"version,omitempty"`
FileFormatType string `json:"file_format_type,omitempty"`
Encoding *string `json:"encoding,omitempty"`
NDJSON bool `json:"ndjson,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the type of case preferred in this project

Comment on lines 114 to 116
start := "[\n%s"
middle := ",\n%s"
end := "\n]"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be able to come up with better names or create a method on parser settings that returns a struct that encapsulates these variables

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably lparen and rparen, and delim might be a better naming option?

I'm fine with the current implementation you proposed. But do need to add unit tests - we tried very hard to keep coverage at 100%.

I'm thinking about adding a utility into https://github.com/jf-tech/go-corelib/tree/master/jsons which this omniparser uses extensively, that the utility is a json writer and encapsulates the functionalities you implement here. But that's a later optimization/refactoring. No need for this time. Just unittest coverage.

@jose-sherpa
Copy link
Contributor Author

jose-sherpa commented Sep 6, 2023

If accepted I will add the tests and refactor

@codecov
Copy link

codecov bot commented Sep 6, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (79a540b) 100.00% compared to head (b26cb6d) 100.00%.

Additional details and impacted files
@@            Coverage Diff            @@
##            master      #218   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           53        53           
  Lines         3041      3041           
=========================================
  Hits          3041      3041           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jose-sherpa
Copy link
Contributor Author

Oh and just curious @jf-tech , have you thought about updating the go version? From my experience it's completely backwards compatible and has had some good improvements on the performance side and new methods

@jf-tech
Copy link
Owner

jf-tech commented Sep 7, 2023

@jose-sherpa

  1. if the desired effect is just to change the output from the command line tool, not the actual output format (which is currently standard JSON) from the Transform.Read() interface, then I would be against putting the setting into the schema. It seems to be more of a command line option that would be the better place for the switch. What do you think?

    Yes, maybe in the future, we actually want to introduce different output formats for the transform interface, offering callers the choice of the current JSON, and maybe protobuf, or XML, or something else. At that time, we need to introduce the proper ParserSettings option and an extensible and pluggable mechanism for hooking up various writers. But doesn't seem to be the case in your request.

  2. As for go version, we pinned our go version to 1.14 against the underlying goja (the javascript engine version) at that time because that was the version we thoroughly tested against. Now most likely upgrading to a higher version is fine, but that might unknowingly break other dependent customers who might still be running on a lower go version. So our philosophy is unless we start to use some go features only available in a higher version, we will stick with the current one, well, until sometime in the future it's too antiquated or Google simply stops supporting it.

Thoughts? And again, thanks for chiming in and contributing!! Love it!

@jose-sherpa
Copy link
Contributor Author

@jose-sherpa

  1. if the desired effect is just to change the output from the command line tool, not the actual output format (which is currently standard JSON) from the Transform.Read() interface, then I would be against putting the setting into the schema. It seems to be more of a command line option that would be the better place for the switch. What do you think?
    Yes, maybe in the future, we actually want to introduce different output formats for the transform interface, offering callers the choice of the current JSON, and maybe protobuf, or XML, or something else. At that time, we need to introduce the proper ParserSettings option and an extensible and pluggable mechanism for hooking up various writers. But doesn't seem to be the case in your request.
  2. As for go version, we pinned our go version to 1.14 against the underlying goja (the javascript engine version) at that time because that was the version we thoroughly tested against. Now most likely upgrading to a higher version is fine, but that might unknowingly break other dependent customers who might still be running on a lower go version. So our philosophy is unless we start to use some go features only available in a higher version, we will stick with the current one, well, until sometime in the future it's too antiquated or Google simply stops supporting it.

Thoughts? And again, thanks for chiming in and contributing!! Love it!

Thanks for the prompt reply! I think it makes sense to have it as a command line option, I'll make that change!

@@ -39,6 +40,8 @@ func init() {

transformCmd.Flags().StringVarP(
&input, "input", "i", "", "input file (optional; if not specified, stdin/pipe is used)")
transformCmd.Flags().BoolVarP(
&ndjson, "ndjson", "", false, "change the output format to ndjson")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jf-tech is this what you meant with a command line option or did you want something like --format?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about changing it to a long flag only --stream. By default or not specified, it's false. The flag doesn't have a short form, only the --stream long form.

Copy link
Owner

@jf-tech jf-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looking good. thanks for contributing. please do add unittest coverage.

@@ -39,6 +40,8 @@ func init() {

transformCmd.Flags().StringVarP(
&input, "input", "i", "", "input file (optional; if not specified, stdin/pipe is used)")
transformCmd.Flags().BoolVarP(
&ndjson, "ndjson", "", false, "change the output format to ndjson")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about changing it to a long flag only --stream. By default or not specified, it's false. The flag doesn't have a short form, only the --stream long form.

Comment on lines 92 to 99

if ndjson {
return string(b), nil
}

return strings.Join(
strs.NoErrMapSlice(
strings.Split(jsons.BPJ(string(b)), "\n"),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look the code can be a little bit optimized here:

b = string(b)
if stream {
    return b, nil
}
return strings.Join(
	strs.NoErrMapSlice(
		strings.Split(jsons.BPJ(string(b)), "\n"),

since string(b) needs to be done no matter what so do it early-on.

Comment on lines 114 to 116
start := "[\n%s"
middle := ",\n%s"
end := "\n]"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably lparen and rparen, and delim might be a better naming option?

I'm fine with the current implementation you proposed. But do need to add unit tests - we tried very hard to keep coverage at 100%.

I'm thinking about adding a utility into https://github.com/jf-tech/go-corelib/tree/master/jsons which this omniparser uses extensively, that the utility is a json writer and encapsulates the functionalities you implement here. But that's a later optimization/refactoring. No need for this time. Just unittest coverage.

@jose-sherpa
Copy link
Contributor Author

@jf-tech I'll make the changes and add tests, thank you!

@jose-sherpa
Copy link
Contributor Author

jose-sherpa commented Sep 21, 2023

@jf-tech I'm looking through the tests and I think this should be in extensions/omniv21/samples however when looking in there it seems to use SampleTestCommon which does it's own logic for making the json output. Is there a different test that tests through the command? Or should I make it so the output logic are in different functions that SampleTestCommon can use?

Maybe something like this will allow multiple places to use the same writing that the transform command uses:

package writer

import (
	"github.com/jf-tech/go-corelib/jsons"
	"github.com/jf-tech/go-corelib/strs"
	"github.com/jf-tech/omniparser"
	"io"
	"strings"
)

type Writer interface {
	Write(printf func(string, ...any) (int, error), println func(...any) (int, error)) error
}

func NewJSON(transform omniparser.Transform) Writer {
	return &writer{
		LParen: "[\n%s",
		Delim:  ",\n%s",
		RParen: "\n]",
		Empty:  "[]",
		DoOne: func() (string, error) {
			b, err := transform.Read()
			if err != nil {
				return "", err
			}

			return strings.Join(
				strs.NoErrMapSlice(
					strings.Split(jsons.BPJ(string(b)), "\n"),
					func(s string) string { return "\t" + s }),
				"\n"), nil
		},
	}
}

func NewNDJSON(transform omniparser.Transform) Writer {
	return &writer{
		LParen: "%s",
		Delim:  "\n%s",
		RParen: "",
		Empty:  "",
		DoOne: func() (string, error) {
			b, err := transform.Read()
			if err != nil {
				return "", err
			}

			return string(b), nil
		},
	}
}

type writer struct {
	LParen string
	Delim  string
	RParen string
	Empty  string
	DoOne  func() (string, error)
}

func (w *writer) Write(printf func(string, ...any) (int, error), println func(...any) (int, error)) error {
	record, err := w.DoOne()
	if err == io.EOF {
		_, _ = println(w.Empty)
		return nil
	}
	if err != nil {
		return err
	}

	_, _ = printf(w.LParen, record)
	for {
		record, err = w.DoOne()
		if err == io.EOF {
			break
		}
		if err != nil {
			return err
		}
		_, _ = printf(w.Delim, record)
	}
	_, _ = println(w.RParen)
	return nil
}

which reduces doTransform to:

func doTransform() error {
	schemaName := filepath.Base(schema)
	schemaReadCloser, err := openFile("schema", schema)
	if err != nil {
		return err
	}
	defer schemaReadCloser.Close()

	inputReadCloser := io.ReadCloser(nil)
	inputName := ""
	if strs.IsStrNonBlank(input) {
		inputName = filepath.Base(input)
		inputReadCloser, err = openFile("input", input)
		if err != nil {
			return err
		}
		defer inputReadCloser.Close()
	} else {
		inputName = "(stdin)"
		inputReadCloser = os.Stdin
		// Note we don't defer Close() on this since os/golang runtime owns it.
	}

	schema, err := omniparser.NewSchema(schemaName, schemaReadCloser)
	if err != nil {
		return err
	}

	transform, err := schema.NewTransform(inputName, inputReadCloser, &transformctx.Ctx{})
	if err != nil {
		return err
	}

	var w writer.Writer
	if ndjson {
		w = writer.NewNDJSON(transform)
	} else {
		w = writer.NewJSON(transform)
	}

	return w.Write(fmt.Printf, fmt.Println)
}

@jf-tech
Copy link
Owner

jf-tech commented Sep 30, 2023

@jose-sherpa sorry for the delay. We were out/offline for several days.

First I'm not seeing the updated PR. Did you push your updates (based on my feedbacks) to origin?
Second, I just checked, the dir cli actually has no tests, because originally the purpose of it to create a quick cli tool for interactive schema authoring and a HTTP endpoint for playground (which retired last year).

So I would say don't worry about the tests for changes in this directory. Just push your latest updates to the PR and we'll take a look again and approve.

@jose-sherpa
Copy link
Contributor Author

@jf-tech hey sorry I was out of town, yeah I made the changes you suggested unless you see I missed something. The only commit I did not push was the commit containing the changes I outlined in my previous comment. If you like those changes let me know and I can push that commit

@@ -31,6 +31,7 @@ var (
}
schema string
input string
ndjson bool
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought we said changing ndjson to stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok I changed the command option name but not in code, I'll make that change now

@@ -39,6 +40,8 @@ func init() {

transformCmd.Flags().StringVarP(
&input, "input", "i", "", "input file (optional; if not specified, stdin/pipe is used)")
transformCmd.Flags().BoolVarP(
&ndjson, "stream", "", false, "change the output format to ndjson")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the argument help display text to "if specified, each record will be a standalone/full JSON blob and printed out immediately once transform is done"

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, ndjson -> stream

@@ -86,22 +89,42 @@ func doTransform() error {
if err != nil {
return "", err
}

s := string(b)
if ndjson {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

func(s string) string { return "\t" + s }),
"\n"), nil
}

record, err := doOne()
if err == io.EOF {
fmt.Println("[]")
if ndjson {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines 107 to 111
if ndjson {
fmt.Println("")
} else {
fmt.Println("[]")
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well this is just written weirdly, why not something inline with:

if !stream {
    println("[]")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Println writes a new line even with a blank string, should we just leave it blank?

lparen := "[\n%s"
delim := ",\n%s"
rparen := "\n]"
if ndjson {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream

@@ -39,6 +40,8 @@ func init() {

transformCmd.Flags().StringVarP(
&input, "input", "i", "", "input file (optional; if not specified, stdin/pipe is used)")
transformCmd.Flags().BoolVarP(
&stream, "stream", "", false, "change the output format to ndjson")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought i had a comment on the argument help text?

@jf-tech jf-tech merged commit dd04a11 into jf-tech:master Oct 9, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants