Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about efficiency (lazy loading) #111

Open
stijndcl opened this issue Nov 30, 2024 · 3 comments
Open

Question about efficiency (lazy loading) #111

stijndcl opened this issue Nov 30, 2024 · 3 comments

Comments

@stijndcl
Copy link

This issue is more of a question about the way the library works. If the answer is that this is currently not supported, consider it a feature request instead.

JsonPath::query() seems to return a Vec<Value>, which to me suggests that it collects all results from a query at once and returns that.

When querying very large files (e.g. a 100 GB Json file that doesn't fit in memory) or querying small amounts of data, this is not always ideal. For example, if I only want the first or second element, there's no point in storing all the others, or even trying to query them. Similarly, if I want to fetch all the items, it'd be useful to be able to loop over an Iterator<Value> that progressively retrieves more as I request them.

Does this library have support for such queries? I'm essentially looking for a LazyJsonPath or JsonPath::lazy_query() or whatever that only fetches the results I actually want, only when I request them, rather than everything upfront.

Examples

Example 1: I only want the first element, so I'd expect query() to stop immediately once it finds a result.

let very_large_json = json!([]); // Imagine this is a 100 GB list

// This should not fetch more than 1 result
JsonPath("$..name").query(&very_large_json).first()

Example 2: I only want the second element, so I'd expect query() to stop immediately once it finds two results. It should not store the first but just skip over it and increase a counter, as it is pointless to me anyways.

let very_large_json = json!([]); // Imagine this is a 100 GB list

// This should not store more than 1 result in memory, and not fetch more than 2
JsonPath("$..name").query(&very_large_json).get(1)

Example 3: I want all results, but they don't fit in memory all at once, so they should be streamed instead.

let very_large_json = json!([]); // Imagine this is a 100 GB list, streamed from a file instead of held in memory

// This should not fetch everything at once
for item in JsonPath("$..name").query(&very_large_json) {
    // do something
}
@hiltontj
Copy link
Owner

hiltontj commented Nov 30, 2024

TLDR; what you describe is not supported currently.

JsonPath::query() seems to return a Vec<Value>, which to me suggests that it collects all results from a query at once and returns that.

That is correct, though its worth noting that what is output is a vec of references, not owned values, i.e., Vec<&Value>; so it returns references to the internal nodes of the serde_json::Value.

Although it would be possible to have an iterator-based approach that only yields one &Value at a time, the bigger problem might be that serde_json_path relies on serde_json having parsed the entire JSON into a Value up front.

So, I guess, there are two potential feature requests out of this:

  1. have the JSON parsing and the JSONPath query processing, i.e., against the parsed JSON, coupled together, so that query results can be yielded via a streaming approach as the JSON is parsed
  2. have an Iterator based query processor built into the existing JsonPath type, so that the caller can decide which or how many elements to yield, among other things that come with Iterator

I suppose that 2) is possible; 1) however will be quite challenging, as that would probably need to be an entirely new library/API and would need to handle the JSON parsing. For 1), you might be able to re-use the JSONPath parser from serde_json_path though - that could still be useful there.

@stijndcl
Copy link
Author

stijndcl commented Nov 30, 2024

Makes sense, thanks. Do you think an intermediary solution for the second could be possible?

If the whole json object fits into memory, and there are a lot of matches but I only want the Nth one, stop looking after you find it?

You could basically have the same API as the current NodeList (first, get(index), slice, ...) but just stop evaluating once the requirements are met. No explicit Iterator that you can lazily query with, but rather a "fetch what I want and then stop"

I understand that the first one (streaming solution) is a lot of work though. Was hoping Serde would have enough support for it already, but I'm not that familiar with the crate.

@hiltontj
Copy link
Owner

hiltontj commented Dec 1, 2024

Do you think an intermediary solution for the second could be possible?

It is certainly possible, but it would need a separate API, so will still be some work to figure out the implementation. The current implementation takes the inputted JSONPath, parses it into an abstract syntax tree (which is what the JsonPath type represents) then uses that along with a serde_json::Value, to recurse down the serde_json::Value data structure and produce the nodes in the query result.

It uses a recursive approach which has the advantage of not needing to hold onto any state to produce a correct result. But the downside is that the resulting API can't provide the efficiency wins that you're describing.

You could basically have the same API as the current NodeList (first, get(index), slice, ...) but just stop evaluating once the requirements are met. No explicit Iterator that you can lazily query with, but rather a "fetch what I want and then stop"

I would really push here to just have an API like so:

impl JsonPath {
    fn query_iter(value: &serde_json::Value) -> QueryIter {
        /* ... */
    }
}

struct QueryIter { /* ... */ }

impl Iterator for QueryIter {
    type Item = &Value;
    
    /* ... */
}

That is obviously oversimplified, but the idea is basically an API JsonPath::query_iter, that outputs a type QueryIter that implements the Iterator trait with a &Value as the item. I think the Iterator trait gives a lot of the semantics you are describing and takes a lot of the maintenance burden off of serde_json_path / reduces the cognitive burden on users that are already familiar with Iterator.


Anyhow, I don't know that I will have time to take an attempt at this any time soon. Perhaps if I get some spare time over the holidays, but unfortunately I can't guarantee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants