VM implementations: Split/RSplit use signed char offset #20

ampli · 2015-07-22T15:24:24Z

1. The following pattern currently segfaults:
0123|5678|0123|56789|123456|8901234|6789|1234|67890|23|5
It is because the split instruction offset, which is fetched as signed char, becomes negative at that length.
Since split/rsplit can only use positive offsets, it is possible to support, using an unsigned char offset, a twice longer regex with alt at its end.

Is there a reason not to use unsigned offset for split/rsplit?
If this is OK, I will send a pull request that implements it.

2. Regarding the size restrictions on regexes. It looks like 8-bit offsets may be a too tight restriction for real-world use. However, maybe there are minimal implementations that may benefit from shorter compiled regex codes (I don't know).
It is possible to define the *pccasts as macros (or not use casts at all but instead use a union, and then define only the types with macros) that can be defined for 8-bit or 16-bit offsets. (Also the hard-coded offset increments/decrements would need to be macros in that case).
Is this a good idea to implement that?

The text was updated successfully, but these errors were encountered:

pfalcon · 2015-07-22T20:02:13Z

Is there a reason not to use unsigned offset for split/rsplit?

Perhaps because I never think that Split/RSplit can jump only forward.

However, maybe there are minimal implementations that may benefit from shorter compiled regex codes (I don't know).

Yes.

Is this a good idea to implement that?

I guess yes, but should be done in generic way, to allow offsets be 32bit too for example (for some torture test/whatever). Unions isn't my first choice of language features, I'd start with macros. Then, per above issues, there's need to deal with 2 types of offsets - signed and unsigned. There should be either macros dichotomy then, or typedefs for signed and unsigned offsets.

ampli · 2015-07-22T23:23:55Z

So first I will send a pull request that makes the split/rsplit offset unsigned.

However, maybe there are minimal implementations that may benefit from shorter compiled regex codes (I don't know).

Yes.

In any case, I think it would be a good idea if compilecode() returns an error when an offset overflows instead of just letting the match function to segfault on it.

I will send a pull request that implements that.

ampli · 2015-07-23T12:27:29Z

Perhaps because I never think that Split/RSplit can jump only forward.

You are right.

The + code (only) needs a negative Split/RSplit offset.
So the needs for signed offset of Split/RSplit severely restricts the length of regex'es which use only signed 8-bit offsets. (The need of negative Jmp offset for '*' is also restrictive in this regard.)

Possible solutions:

The + handling can be changed to use something like SplitB/RSplitB (B for back),
Variable length offsets can be introduced, identical to the Unicode scheme.
This will allow 6-bit positive and negative offsets in one byte (instead of 7-bit like now),
a thing that will not change the code length of short regex'es, and will allow unrestricted regex length.
Combination of the above 2 (will not change at all the code length of currently supported regex'es, at the cost of adding a few instructions for that.)
As already mentioned, a compile time option for bigger offsets.

pfalcon · 2015-07-23T18:04:42Z

I guess option 4 is the most practical for now. Variable-length encoding is feature-winner, but more effort, bigger code, slower code.

pfalcon · 2015-07-23T18:05:38Z

In any case, I think it would be a good idea if compilecode() returns an error when an offset overflows instead of just letting the match function to segfault on it.
I will send a pull request that implements that.

Thanks.

ampli · 2015-07-23T19:23:29Z

Additional comment on the pattern that segfauls due to 8-bit signed offsets:

0123|5678|0123|56789|123456|8901234|6789|1234|67890|23|5

If Unicode is added, for some languages the max. possible regex length (using 8-bit signed offsets) will be about half than that, or even much less.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM implementations: Split/RSplit use signed char offset #20

VM implementations: Split/RSplit use signed char offset #20

ampli commented Jul 22, 2015

pfalcon commented Jul 22, 2015

ampli commented Jul 22, 2015

ampli commented Jul 23, 2015

pfalcon commented Jul 23, 2015

pfalcon commented Jul 23, 2015

ampli commented Jul 23, 2015

VM implementations: Split/RSplit use signed char offset #20

VM implementations: Split/RSplit use signed char offset #20

Comments

ampli commented Jul 22, 2015

pfalcon commented Jul 22, 2015

ampli commented Jul 22, 2015

ampli commented Jul 23, 2015

pfalcon commented Jul 23, 2015

pfalcon commented Jul 23, 2015

ampli commented Jul 23, 2015