-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: support incremental/streaming lexing #67
Comments
even better would be that alex already tacitly supports this and i'm simply not understanding it yet :) |
I'd happily accept a patch, provided it doesn't compromise the speed of non-streaming lexers. |
@cartazio In many cases this can already be made to work, though it requires knowing something about the maximum token length. For example we have implemented a streaming JSON lexer using alex. This relies on the fact that there's largest possible token length (around 6 bytes iirc for JSON) so that we can tell when we get to the end of a chunk if the lexer returning an error is due to running out of input or a real failure. If it fails within 6 bytes of the end then we need to supply more input and try again, but if there's more input available than that then it's a real lex error. |
Interesting. I have many questions :) Where is your Alex lexer for JSON? Do you have a parser too? Is it faster than aeson? |
I have a properly streaming one I wrote at work a year ago that has way On Friday, September 16, 2016, Simon Marlow [email protected]
|
I am very happy for you. |
I can see about cleaning it up and getting thst into hackage if you want :) On Friday, September 16, 2016, Simon Marlow [email protected]
|
I got something working that is pull-based, and I'd be happy to try and get it cleaned up and merged. You supply some monadic action that can be used to get additional data, and a maxmimum token length. The lexer treats an empty result from the action as EOF. If there is a lex error it checks for additional data and rescans if the data is less than the user-supplied maximum token length. It also attempts to get more data at EOF. There is probably room for improvement to differentiate errors that are occurring because of EOF and other errors, but this is a rough first cut. It is currently only working for bytestrings, with code borrowed from the monad template. It could accomadate userstate fairly readily, but I didn't need that, so it's not written. |
Ooo, this sounds amazing ! |
https://github.com/cartazio/streaming-machine-json This repo has the parser I mentioned |
in a number of application domains, I need to deal with handling streaming inputs in an incremental fashion, and having a streaming lexer / tokenization layer helps immensely with writing the layers on top.
If adding such capabilities to Alex are viable, i'd be very interested in trying to help add them. (rather than having to reinvent a lot of the tooling that alex provides)
would this be a feature you'd be open to having added? @simonmar ?
The text was updated successfully, but these errors were encountered: