-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching functions abort on "complex" regex'es instead of returning an error #18
Comments
I mentioned this issue (I mentioned, right?), and invited you to look into it.
Yes, this is another known issue.
Sorry, there's no portable way to check C stack overflow. So, the original issue should be fixed first ;-). How I want to fix it for uPy in the meantime (backlogged for a year) is to add STACK_CHECK() macro to recursion points, and let app which embeds re1.5 to define it to its like. In uPy case, that would throw exception (via longjmp), not return anything. But on the idea to return negative values for errors - sure, sounds good. |
If you mentioned it, I missed it (and I cannot find it now).
By "max stack depth" I meant to counting number of recursions. This looks to me simpler, and in uPy most probably an exception should be raised on any negative value returned form the regex match. On the down side in order to perform this counting there is a need to add an argument to recursiveloop(). (But if one additional argument is important, then the nsubp parameter of recusrsionloop() should be eliminated regardless this proposed recursion limit implementation.) |
My guess is that this counter would need to be part of the same structure as counters for {n,m}. |
Maybe a "stupid counter" is enough:
|
Maybe it's enough for now, but please take the above as a hint: we'll need to maintain (more) per-match state anyway. Whereas each new argument to a recursive function means more stack usage for register-hungry architectures. |
And the best solution would be still to unbreak |
It seems that as you said, the
|
Good clarification, sounds good. |
Using parameter block for recusrsiveloop() Currently, the parameters of recursiveloop() are: I propose to use a parameter block for the last 3, as follows:
This saves 2 arguments on the stack for each recursion, and allows to add as many as needed more state variables without adding function parameters. Note that the recursiveloop() VM is rather wasteful with recursion depth. Even trivial regex'es on a very short input may need a recursion depth in the hundreds or even in the thousands. For example, the following needs a recursion depth of exactly 100:
I hope this time I didn't forget It may mean that the recursiveloop() function is not appropriate for stuff that is sensitive to large stack depth. |
There is a problem with my idea of negative values for errors, and the current compilecode(): |
Yes, that's know technique, sounds good. I didn't try to independently review what should go there and what shouldn't, but I assume you did. |
In order to pass a parameter block, what is better (note that it is somewhat bigger than shown here, as it includes additional fields which are not shown here, like 1. Allocating it on the stack.
2. Allocating it using malloc() (maybe actually by macros for alloc/free).
|
Malloc definitely should be avoided. There're 2 choices: 1) dependency injection (caller allocates structure and passes a pointer); 2) stack. I guess in this case nothing calls for 1), so just allocating on stack. |
Consider the regex
(a*)*
.Currently the recursive, recursiveloop and backtrack implementations abort on it.
(BTW, this particular infinite loop, in which the PC is incremented without eventually the SP (string pointer) incremented too, can be fixed, but this is not the point of this issue.)
The first 2 implementations (recursive and recursiveloop) abort on runtime error (stack overflow), the last one aborts on a check of a max-depth limit.
It means that, for example, a whole uPy program would abort in such a case, instead of raising an
except
.I propose to fix it (the possibility of aborting) by allowing a negative value returned from the match functions, just like pcre_exec, as following:
0 - no match
positive - number of captures (this fixes another problem, not discussed yet)
negative - error number (max stack depth exceeded, Unicode error, and some more error types that can happen while matching a regex).
An unknown VM instruction can be left as abort, but even this can be changed to return a negative value, as, for example, pcre_exec() does, so anything which got wrong while matching a regex will have the potential of raising an exception instead of aborting the program.
Similarly, I propose to fix
re1_5_compilecode()
to return a negative value if a particular stack depth exceeds, in order to prevent stack overflow on regex'es with many nested (). I say "negative value" here too, not just -1, in order to leave a room to return other errors (like a Unicode problem).The text was updated successfully, but these errors were encountered: