Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results in regexp_like #21124

Open
mbasmanova opened this issue Oct 12, 2023 · 2 comments
Open

Incorrect results in regexp_like #21124

mbasmanova opened this issue Oct 12, 2023 · 2 comments
Labels

Comments

@mbasmanova
Copy link
Contributor

regexp_like behaves strangely. I'd expect 'true' in all the following cases, but on of them returns false. This is using JONI for regex library.

CC: @tdcmeehan @aditi-pandit @amitkdutta @zacw7 @kaikalur

presto:whatsapp_closed> select regexp_like('a.b-c.d.e', '[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$');

 _col0
-------
 true

presto> select regexp_like('@a.b-c.d.e', '@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$');

 _col0
-------
 false

presto> select regexp_like('@a.b-c.d.e', '@[a-zA-Z0-9-]+\.[a-zA-Z0-9.-]+$');

 _col0
-------
 true
@tdcmeehan
Copy link
Contributor

Looks like that second character class [a-zA-Z0-9-.] should have the - be interpreted as a character literal, or an error (since it's a special character when it's not in the first or last position of the character class), but it seems to be ignored entirely.

The last example seems to work intuitively: a possible match is a matches the first character class, b-c.d.e matches the second one (hyphen, a special character, is treated as literal since it's the last character in the character class).

The first example only matches a portion of the string, since that's how the regex is defined. So a possible match is b-c to the first character class, then d.e.

@tdcmeehan tdcmeehan moved this from 🆕 Unprioritized to 🏗 In progress in Bugs and support requests Oct 12, 2023
@tdcmeehan
Copy link
Contributor

The same pattern works fine in Java pattern, where the hyphen is interpreted as a literal.

Pattern pattern = Pattern.compile("@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$");
Matcher matcher = pattern.matcher("@a.b-c.d.e");
matcher.matches(); // Returns true

Based the comments on jruby/joni#14, it seems any example of a difference between Java pattern syntax and Joni should be considered a bug, so I'd consider raising an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants