Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching functions abort on "complex" regex'es instead of returning an error #18

Open
ampli opened this issue Jul 18, 2015 · 13 comments
Open

Comments

@ampli
Copy link
Contributor

ampli commented Jul 18, 2015

Consider the regex (a*)* .
Currently the recursive, recursiveloop and backtrack implementations abort on it.
(BTW, this particular infinite loop, in which the PC is incremented without eventually the SP (string pointer) incremented too, can be fixed, but this is not the point of this issue.)

The first 2 implementations (recursive and recursiveloop) abort on runtime error (stack overflow), the last one aborts on a check of a max-depth limit.
It means that, for example, a whole uPy program would abort in such a case, instead of raising an except.

I propose to fix it (the possibility of aborting) by allowing a negative value returned from the match functions, just like pcre_exec, as following:
0 - no match
positive - number of captures (this fixes another problem, not discussed yet)
negative - error number (max stack depth exceeded, Unicode error, and some more error types that can happen while matching a regex).

An unknown VM instruction can be left as abort, but even this can be changed to return a negative value, as, for example, pcre_exec() does, so anything which got wrong while matching a regex will have the potential of raising an exception instead of aborting the program.

Similarly, I propose to fix re1_5_compilecode() to return a negative value if a particular stack depth exceeds, in order to prevent stack overflow on regex'es with many nested (). I say "negative value" here too, not just -1, in order to leave a room to return other errors (like a Unicode problem).

@pfalcon
Copy link
Owner

pfalcon commented Jul 18, 2015

Consider the regex (a_)_ .
Currently the recursive, recursiveloop and backtrack implementations abort on it.
(BTW, this particular infinite loop, in which the PC is incremented without eventually the SP (string pointer) incremented too, can be fixed, but this is not the point of this issue.)

I mentioned this issue (I mentioned, right?), and invited you to look into it.

It means that, for example, a whole uPy program would abort in such a case, instead of raising an except.

Yes, this is another known issue.

negative - error number (max stack depth exceeded

Sorry, there's no portable way to check C stack overflow. So, the original issue should be fixed first ;-).

How I want to fix it for uPy in the meantime (backlogged for a year) is to add STACK_CHECK() macro to recursion points, and let app which embeds re1.5 to define it to its like. In uPy case, that would throw exception (via longjmp), not return anything.

But on the idea to return negative values for errors - sure, sounds good.

@ampli
Copy link
Contributor Author

ampli commented Jul 18, 2015

I mentioned this issue (I mentioned, right?), and invited you to look into it.

If you mentioned it, I missed it (and I cannot find it now).

negative - error number (max stack depth exceeded

Sorry, there's no portable way to check C stack overflow. So, the original issue should be fixed first ;-).

How I want to fix it for uPy in the meantime (backlogged for a year) is to add STACK_CHECK() macro to recursion points, and let app which embeds re1.5 to define it to its like. In uPy case, that would throw exception (via longjmp), not return anything.

By "max stack depth" I meant to counting number of recursions.
This counter can be compared to MATCH_LIMIT_RECURSION, which the app can redefine if needed.
I also thought that in case it is exceeded, returning a negative value is enough, just like what PCRE does (it returns then a negative value PCRE_ERROR_RECURSIONLIMIT).

This looks to me simpler, and in uPy most probably an exception should be raised on any negative value returned form the regex match.

On the down side in order to perform this counting there is a need to add an argument to recursiveloop(). (But if one additional argument is important, then the nsubp parameter of recusrsionloop() should be eliminated regardless this proposed recursion limit implementation.)

@pfalcon
Copy link
Owner

pfalcon commented Jul 19, 2015

By "max stack depth" I meant to counting number of recursions. This counter ...

My guess is that this counter would need to be part of the same structure as counters for {n,m}.

@ampli
Copy link
Contributor Author

ampli commented Jul 19, 2015

Maybe a "stupid counter" is enough:

recursiveloop(..., int r_counter) {
    if (r_counter++ > MATCH_LIMIT_RECURSION) return -1;
    ...
    case RSplit:
            off = (signed char)*pc++;
            rc = recursiveloop(..., r_counter);
            if (rc) return rc;
            continue;
     ...

@pfalcon
Copy link
Owner

pfalcon commented Jul 19, 2015

Maybe it's enough for now, but please take the above as a hint: we'll need to maintain (more) per-match state anyway. Whereas each new argument to a recursive function means more stack usage for register-hungry architectures.

@pfalcon
Copy link
Owner

pfalcon commented Jul 19, 2015

And the best solution would be still to unbreak (a*)* in a right way. But I suspect when "right way" will be known, we'll want to reduce back to recursion depth check ;-). I tried to think up "magically easy" way to assess matching progress to avoid infinite loops like that, and couldn't think up. Intuitively, that would require storing string pointer for each looping operator, and then checking it on each iteration. Single case if no-progress is not a problem, e.g. in a*b*, but when there's no progress all levels up, we caught in cycle.

@ampli
Copy link
Contributor Author

ampli commented Jul 19, 2015

It seems that as you said, the (a*)* problem and the too-deep recursion stack should be handled separately.

(a*)* is not really an error, it should match or not match a given string. Threads that don't cause the SP to eventually proceed should just be detected somehow (a special vm instruction?) and terminated.
But regex'es like (...(...( ... many more () ... ) ... ) ...) may cause a too-deep stack not due to an infinite loop. Maybe this can be detected while compiling by a recursion counter (and return a failure indication). Such a check is needed anyway during compilation, because it is recursive too.

@pfalcon
Copy link
Owner

pfalcon commented Jul 19, 2015

Good clarification, sounds good.

@ampli
Copy link
Contributor Author

ampli commented Jul 21, 2015

Using parameter block for recusrsiveloop()

Currently, the parameters of recursiveloop() are:
recursiveloop(char *pc, const char *sp, Subject *input, const char **subp, int nsubp)

I propose to use a parameter block for the last 3, as follows:

typedef struct Param Param;
struct Param {
    const Subject *input;
    const char **subp;
    unsigned char nsubp;
    ... // more per-match state, e.g. for progress check
    short rcnt; // recursion depth counter
} ;

recursiveloop(char *pc, const char *sp,  Param *p)

This saves 2 arguments on the stack for each recursion, and allows to add as many as needed more state variables without adding function parameters.

Note that the recursiveloop() VM is rather wasteful with recursion depth. Even trivial regex'es on a very short input may need a recursion depth in the hundreds or even in the thousands.
Anything which is not trivial, on input which is not a few characters, may need much more.

For example, the following needs a recursion depth of exactly 100:
EDIT:
The example I gave here was wrong, since I had a miscounting (didn't reduce the recursive depth as needed). There may be a need for a longer subject string to get a recursion depths of 100 on trivial regexes:

Regex: ((a*)b)*'
Subject: abababaabababababababababababaaba

I hope this time I didn't forget rcnt-- before any return....
I guess there are even better examples.

It may mean that the recursiveloop() function is not appropriate for stuff that is sensitive to large stack depth.

@ampli
Copy link
Contributor Author

ampli commented Jul 22, 2015

But on the idea to return negative values for errors - sure, sounds good.

There is a problem with my idea of negative values for errors, and the current compilecode():
Currently, compilecode() returns (char *) , and not int.
I propose to make it to return int, and change re through the first argument (that will be char **re,
or, better, Subject * (a separate issue - #22)).

@ampli ampli mentioned this issue Jul 22, 2015
@pfalcon
Copy link
Owner

pfalcon commented Jul 23, 2015

I propose to use a parameter block for the last 3, as follows:

Yes, that's know technique, sounds good. I didn't try to independently review what should go there and what shouldn't, but I assume you did.

@ampli
Copy link
Contributor Author

ampli commented Jul 23, 2015

In order to pass a parameter block, what is better (note that it is somewhat bigger than shown here, as it includes additional fields which are not shown here, like RE_Progress):
(The examples are from my development branches.)

1. Allocating it on the stack.

int
re1_5_recursiveloopprog(ByteProg *prog, Subject *input, const char **subp, int nsubp, int is_anchored)
{
    int rc;
    Param p = {
        .prog = prog,
        .input = input,
        .subp = subp,
        .nsubp = nsubp
    };
    rc = recursiveloop(HANDLE_ANCHORED(prog->insts, is_anchored), input->begin, &p);
#ifdef DEBUG
    printf("Max recursion depth %d\n", p->max_rcnt);
#endif
    return rc;
}

2. Allocating it using malloc() (maybe actually by macros for alloc/free).

int
re1_5_recursiveloopprog(ByteProg *prog, Subject *input, const char **subp, int nsubp, int is_anchored)
{
    int rc;
    Param *p = mal(sizeof(Param));
    p->prog = prog;
    p->input = input;
    p->subp = subp;
    p->nsubp = nsubp;

    rc = recursiveloop(HANDLE_ANCHORED(prog->insts, is_anchored), input->begin, p);
#ifdef DEBUG
    printf("Max recursion depth %d\n", p->max_rcnt);
#endif
    free(p);
    return rc;
}

@pfalcon
Copy link
Owner

pfalcon commented Jul 23, 2015

Malloc definitely should be avoided. There're 2 choices: 1) dependency injection (caller allocates structure and passes a pointer); 2) stack. I guess in this case nothing calls for 1), so just allocating on stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants