Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: process each JSON Line separately #67

Open
smammy opened this issue May 23, 2023 · 3 comments
Open

Enhancement: process each JSON Line separately #67

smammy opened this issue May 23, 2023 · 3 comments

Comments

@smammy
Copy link

smammy commented May 23, 2023

I'd like to have a switch (say, -L) that would cause jello to evaluate QUERY once per JSON line in the input. I'm not sure if this would fit in with the jello philosophy, but it sure would help me eliminate CPython startup time (and shell boilerplate) while avoiding memory bloat.

I think the JSON Line is a natural chunk size, because it avoids the problem of having to specify the chunk size (cf. ijson's "prefix" handling).

Some contrived examples, in fish shell:

# OLD
for url in $my_data_urls
    curl $url | jello _.haystack.needle
end

# NEW
curl $my_data_urls | jello -L _.haystack.needle
# OLD
find . -type f -name \*.json -print0 | while read -z jsonfile
    cat $jsonfile | jello _.haystack.needle
end

# NEW
find . -type f -name \*.json -exec cat | jello -L _.haystack.needle

Think of it as analogous to Perl's -p switch if that helps.

@kellyjonbrazil
Copy link
Owner

kellyjonbrazil commented May 24, 2023

Interesting - so jello currently slurps multiple JSON Lines documents into an array. With the -L option, instead of applying the query to the entire document, it would loop through the objects and apply the query?

I guess it's not as pretty, but couldn't this already be done within jello by using a for loop or list comprehension? Something like:

# OLD
for url in $my_data_urls
    curl $url | jello _.haystack.needle
end

# NEW
curl $my_data_urls | jello '[x.haystack.needle for x in _]'

@smammy
Copy link
Author

smammy commented May 26, 2023

Yeah, that's correct. However, if I understand correctly, curl $my_data_urls | jello '[x.haystack.needle for x in _]' could potentially use a lot more memory, and won't start outputting results until the entire stream is processed.

So implementing -L would get you the memory and latency savings of #53, but only in the case where your input is JSONL rather than one large JSON structure.

The downside would be that with -L there'd be no way to, say, calculate summary statistics, since they depend on preserving variables across lines. So there'd still be applications for the current way that JSON Lines are handled.

I might implement this locally and see if it's actually useful. :-)

@kellyjonbrazil
Copy link
Owner

Ah, I see! Yes, this is an option I've been kicking around as a possible alternative to full stream-based JSON processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants