-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VM implementations: Split/RSplit use signed char offset #20
Comments
Perhaps because I never think that Split/RSplit can jump only forward.
Yes.
I guess yes, but should be done in generic way, to allow offsets be 32bit too for example (for some torture test/whatever). Unions isn't my first choice of language features, I'd start with macros. Then, per above issues, there's need to deal with 2 types of offsets - signed and unsigned. There should be either macros dichotomy then, or typedefs for signed and unsigned offsets. |
So first I will send a pull request that makes the split/rsplit offset unsigned.
In any case, I think it would be a good idea if compilecode() returns an error when an offset overflows instead of just letting the match function to segfault on it. I will send a pull request that implements that. |
You are right. The Possible solutions:
|
I guess option 4 is the most practical for now. Variable-length encoding is feature-winner, but more effort, bigger code, slower code. |
Thanks. |
Additional comment on the pattern that segfauls due to 8-bit signed offsets:
If Unicode is added, for some languages the max. possible regex length (using 8-bit signed offsets) will be about half than that, or even much less. |
1. The following pattern currently segfaults:
0123|5678|0123|56789|123456|8901234|6789|1234|67890|23|5
It is because the split instruction offset, which is fetched as signed char, becomes negative at that length.
Since split/rsplit can only use positive offsets, it is possible to support, using an unsigned char offset, a twice longer regex with alt at its end.
Is there a reason not to use unsigned offset for split/rsplit?
If this is OK, I will send a pull request that implements it.
2. Regarding the size restrictions on regexes. It looks like 8-bit offsets may be a too tight restriction for real-world use. However, maybe there are minimal implementations that may benefit from shorter compiled regex codes (I don't know).
It is possible to define the
*pc
casts as macros (or not use casts at all but instead use a union, and then define only the types with macros) that can be defined for 8-bit or 16-bit offsets. (Also the hard-coded offset increments/decrements would need to be macros in that case).Is this a good idea to implement that?
The text was updated successfully, but these errors were encountered: