-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspected bug with complex strings #23
Comments
Yes, this thwarts the fairly naïve object boundary detection that genson deploys by default. It just checks for curly braces with their backs to each other. To override this, pass a
It occurs to me that there's no actual way to turn the delimiter off completely, short of passing it something that doesn't appear in your input text. Perhaps I should add that feature, and perhaps I should turn delimiting off by default. Or perhaps I should just find a more intelligent way to parse the input. Votes for your preferred option are appreciated. |
@wolverdude I had to look at the code to see what exactly this was referring to and I understand now, you use some heuristic to pull multiple object definitions from a single not-json file. Personally, I'd deprecate this functionality unless there's some standard that supports it (it seems very fragile and non-standard). Yaml actually does have support for this (via the '---' indicator), so it should be fine to support in that format once I implement it. What I did for my use-case was create a wrapper that instead generates the scheme based on "subobjects" so you pass a valid json list of objects and pass a flag to generate the schema to match each one instead of just the whole list. That way you can just use It would be handy and much more robust if this were builtin. |
To start off the deprecation, you could make |
I understand this is marked wontfix, and maybe I'm wrong, but I don't actually think it's so hard to fix, even in a streaming fashion. I would do it with a LALR parser. You can say e.g. (untested object = '{' ~ internals* ~ '}'
internals = quoted_string | bytes+
bytes = /[^{}]/ (Lark has a By testing for quoted strings first you just drop those And then you can use a streaming Lark parser, and only consider top-level objects. What am I missing @wolverdude ? |
@ctrlcctrlv, thanks for the suggestion! I'm unfamiliar with LALR, but I'll take your word for it. I'd be open to a PR for this, but some things need to be worked out first:
There is no standard way that I'm aware of to pack multiple JSON objects into a single file, so support for that beyond define-your-own-delimiter isn't a strong priority. My current plan is to soft-deprecate the current functionality by making GenSON assume there's only one object in the file unless a All that said, if you do in fact have a good, simple way to do this, then I see no reason why we can't at least optionally support it. Just be sure to include tests. |
Why is |
I've tested this one this time and it meets even your requirement №2. I had a long comment written here but I thought a proof of concept was better: https://github.com/ctrlcctrlv/jsonl-streaming-parser Input{"alph{a": 3, "king": {"queen": 3}} {"2": []} {"id": 1} 23 23 "cap'n crunch" [1,2, 3] Output (on console)
There are several changes @erezsh could consider making in Lark to make streaming parsers easier to handle. At present I just tear down the parser repeatedly. A non-naive implementation should rotate |
Also, you seem to rely on the default The parser isn't quite naïve, just not correct since its job is splitting more than it is parsing: start: jsonl+
jsonl: internals
object: "{" internals* "}"
list: "[" internals* "]"
internals: (ESCAPED_STRING | BYTES | WS | object | list)
BYTES: _BYTE+
_BYTE: /[^{}\[\]\s"]/
%import common.ESCAPED_STRING
%import common.WS |
@ctrlcctrlv Not sure why you pinged me. Can you sum it up for me? |
@erezsh Oh, yes. In summary, if there were a way for me to tell Lark that I am only interested in the results my transformer is storing (which I can e.g. auto-rotate), then Lark could natively support streaming parsers via its transformer mechanism. At present I have to use its ability to cache the grammar and continually recreate the parser (but not the transformer): |
The good thing about my implementation is that {}{"alph{a": 3, "king": {"queen": 3}} [][]""{} {"2": []} {"id": 1} 23 23 "cap\"n crunch" [1,2, 3] []""3{}2.2""null To be parsed as: [
{},
{"alph{a": 3, "king": {"queen": 3}},
[],
[],
"",
{},
{"2": []},
{"id": 1},
23,
23,
"cap\"n crunch",
[1,2, 3],
[],
"",
3,
{},
2.2,
"",
null
]
|
This is great! Since you already created a general-purpose JSON-parsing library, go ahead and package it up, and we can add it as a dependency. We can make the default delimiter None unless your Lark library has been installed, in which case it will use that by default. |
I haven't seen any movement on this. @ctrlcctrlv, I think your library could be useful as an independent PyPI package. Just add docs, manifests, and preferably some tests. https://packaging.python.org/en/latest/tutorials/packaging-projects/ Once that is done, I'll add it as an optional dependency. Failing that, I will change the default behavior in the next version, but it will just always default to no delimiter. It won't have this nice optional feature. |
Sorry. I know how to package Python code and maintain several PyPI packages, that's not the issue. The issue is I lost interest in the problem because I probably want to rewrite GenSON in Rust some day and extend it to generate SQL which is what I use it for anyway because better/jsonschema2db is kinda awful and hacky and very hard for me to extend. And it's unmaintained (apparently "Better" engineering does not include maintenance, har har). I even have a cute name for it: NoNoSQL. I asked the @surrealdb guys about it, specifically @tobiemh, but he was a bit too busy to engage w/the idea lol. I can understand why, replacing PostgreSQL is a big task and my silly NoNoSQL idea probably below the bottom of the priority list. You don't want to use my code in production as it stands because it never pops objects out of This is probably fixable. But I won't be the one to do it in Python as Python is not even the right language for this very computationally expensive problem (schema generation and SQL conversion) anyway, which also happens to be a strongly-typed problem when you bring it into SQL land thus why Rust seems attractive to me. Nevertheless you may edit my PoC as you wish. And if I ever finish NoNoSQL I'll let you know know ba dum tiss |
Oh, also, to make this not quite a hard "no", if @erezsh considers doing what I said above to make Lark better for streaming parsers so I don't have to constantly tear down and recreate the grammar (another blocking issue performance-wise) then I will consider packaging my JSON-L parser for real because it'll then be much easier (and more logical) for me to auto-rotate the parsed objects only keeping a count of what we've seen. Even if I end up replacing GenSON for the things I use it for (which is really just a symptom of needing to replace jsonschema2db) I can see how a fast streaming JSON-L parser would be useful to projects long-term and prevent many hacky implementations such as the one in GenSON. I just don't like encouraging people to use code I know has serious issues however well it may work on small examples. |
@ctrlcctrlv Have you opened an issue in Lark to discuss these changes that you're proposing? Also, are you aware of this solution? lark-parser/lark#488 (comment) |
I wasn't, but it seems like it would work :) |
This works fine:
but if there are repeated curly braces within a string, it bombs:
The text was updated successfully, but these errors were encountered: