Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: support incremental/streaming lexing #67

Open
cartazio opened this issue Jun 29, 2015 · 11 comments
Open

feature request: support incremental/streaming lexing #67

cartazio opened this issue Jun 29, 2015 · 11 comments

Comments

@cartazio
Copy link

in a number of application domains, I need to deal with handling streaming inputs in an incremental fashion, and having a streaming lexer / tokenization layer helps immensely with writing the layers on top.

If adding such capabilities to Alex are viable, i'd be very interested in trying to help add them. (rather than having to reinvent a lot of the tooling that alex provides)

would this be a feature you'd be open to having added? @simonmar ?

@cartazio
Copy link
Author

even better would be that alex already tacitly supports this and i'm simply not understanding it yet :)

@simonmar
Copy link
Member

I'd happily accept a patch, provided it doesn't compromise the speed of non-streaming lexers.

@dcoutts
Copy link
Contributor

dcoutts commented Sep 15, 2016

@cartazio In many cases this can already be made to work, though it requires knowing something about the maximum token length. For example we have implemented a streaming JSON lexer using alex. This relies on the fact that there's largest possible token length (around 6 bytes iirc for JSON) so that we can tell when we get to the end of a chunk if the lexer returning an error is due to running out of input or a real failure. If it fails within 6 bytes of the end then we need to supply more input and try again, but if there's more input available than that then it's a real lex error.

@simonmar
Copy link
Member

Interesting. I have many questions :) Where is your Alex lexer for JSON? Do you have a parser too? Is it faster than aeson?

@cartazio
Copy link
Author

I have a properly streaming one I wrote at work a year ago that has way
better memory behavior and incremental ingestion.

On Friday, September 16, 2016, Simon Marlow [email protected]
wrote:

Interesting. I have many questions :) Where is your Alex lexer for JSON?
Do you have a parser too? Is it faster than aeson?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#67 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAAQwkL3FPgNNe5vx_iF9CC0VB9tH8a2ks5qqpWHgaJpZM4FN1Sm
.

@simonmar
Copy link
Member

I am very happy for you.

@cartazio
Copy link
Author

I can see about cleaning it up and getting thst into hackage if you want :)

On Friday, September 16, 2016, Simon Marlow [email protected]
wrote:

I am very happy for you.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#67 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAAQwqcqLKOKjgk5ws8lTYNjJU4K19LWks5qqqorgaJpZM4FN1Sm
.

@iteratee
Copy link

I got something working that is pull-based, and I'd be happy to try and get it cleaned up and merged.

You supply some monadic action that can be used to get additional data, and a maxmimum token length.

The lexer treats an empty result from the action as EOF. If there is a lex error it checks for additional data and rescans if the data is less than the user-supplied maximum token length. It also attempts to get more data at EOF.

There is probably room for improvement to differentiate errors that are occurring because of EOF and other errors, but this is a rough first cut.

It is currently only working for bytestrings, with code borrowed from the monad template. It could accomadate userstate fairly readily, but I didn't need that, so it's not written.

@cartazio
Copy link
Author

Ooo, this sounds amazing !

@cartazio
Copy link
Author

https://github.com/cartazio/streaming-machine-json This repo has the parser I mentioned

@andreasabel
Copy link
Member

@iteratee If this is fully backwards-compatible and does not affect performance of what we have now, a PR would be welcome!

@simonmar wrote:

I'd happily accept a patch, provided it doesn't compromise the speed of non-streaming lexers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants