Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM implementations: Split/RSplit use signed char offset #20

Open
ampli opened this issue Jul 22, 2015 · 6 comments
Open

VM implementations: Split/RSplit use signed char offset #20

ampli opened this issue Jul 22, 2015 · 6 comments

Comments

@ampli
Copy link
Contributor

ampli commented Jul 22, 2015

1. The following pattern currently segfaults:
0123|5678|0123|56789|123456|8901234|6789|1234|67890|23|5
It is because the split instruction offset, which is fetched as signed char, becomes negative at that length.
Since split/rsplit can only use positive offsets, it is possible to support, using an unsigned char offset, a twice longer regex with alt at its end.

Is there a reason not to use unsigned offset for split/rsplit?
If this is OK, I will send a pull request that implements it.

2. Regarding the size restrictions on regexes. It looks like 8-bit offsets may be a too tight restriction for real-world use. However, maybe there are minimal implementations that may benefit from shorter compiled regex codes (I don't know).
It is possible to define the *pccasts as macros (or not use casts at all but instead use a union, and then define only the types with macros) that can be defined for 8-bit or 16-bit offsets. (Also the hard-coded offset increments/decrements would need to be macros in that case).
Is this a good idea to implement that?

@pfalcon
Copy link
Owner

pfalcon commented Jul 22, 2015

Is there a reason not to use unsigned offset for split/rsplit?

Perhaps because I never think that Split/RSplit can jump only forward.

However, maybe there are minimal implementations that may benefit from shorter compiled regex codes (I don't know).

Yes.

Is this a good idea to implement that?

I guess yes, but should be done in generic way, to allow offsets be 32bit too for example (for some torture test/whatever). Unions isn't my first choice of language features, I'd start with macros. Then, per above issues, there's need to deal with 2 types of offsets - signed and unsigned. There should be either macros dichotomy then, or typedefs for signed and unsigned offsets.

@ampli
Copy link
Contributor Author

ampli commented Jul 22, 2015

So first I will send a pull request that makes the split/rsplit offset unsigned.

However, maybe there are minimal implementations that may benefit from shorter compiled regex codes (I don't know).

Yes.

In any case, I think it would be a good idea if compilecode() returns an error when an offset overflows instead of just letting the match function to segfault on it.

I will send a pull request that implements that.

@ampli
Copy link
Contributor Author

ampli commented Jul 23, 2015

Perhaps because I never think that Split/RSplit can jump only forward.

You are right.

The + code (only) needs a negative Split/RSplit offset.
So the needs for signed offset of Split/RSplit severely restricts the length of regex'es which use only signed 8-bit offsets. (The need of negative Jmp offset for '*' is also restrictive in this regard.)

Possible solutions:

  1. The + handling can be changed to use something like SplitB/RSplitB (B for back),
  2. Variable length offsets can be introduced, identical to the Unicode scheme.
    This will allow 6-bit positive and negative offsets in one byte (instead of 7-bit like now),
    a thing that will not change the code length of short regex'es, and will allow unrestricted regex length.
  3. Combination of the above 2 (will not change at all the code length of currently supported regex'es, at the cost of adding a few instructions for that.)
  4. As already mentioned, a compile time option for bigger offsets.

@pfalcon
Copy link
Owner

pfalcon commented Jul 23, 2015

I guess option 4 is the most practical for now. Variable-length encoding is feature-winner, but more effort, bigger code, slower code.

@pfalcon
Copy link
Owner

pfalcon commented Jul 23, 2015

In any case, I think it would be a good idea if compilecode() returns an error when an offset overflows instead of just letting the match function to segfault on it.
I will send a pull request that implements that.

Thanks.

@ampli
Copy link
Contributor Author

ampli commented Jul 23, 2015

Additional comment on the pattern that segfauls due to 8-bit signed offsets:

0123|5678|0123|56789|123456|8901234|6789|1234|67890|23|5

If Unicode is added, for some languages the max. possible regex length (using 8-bit signed offsets) will be about half than that, or even much less.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants