Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Results are mismatch with vanilla Spark when use regexp_replace('a{bc', '\\{', '\\[') #6224

Closed
kecookier opened this issue Jun 26, 2024 · 7 comments
Assignees
Labels
bug Something isn't working triage

Comments

@kecookier
Copy link
Contributor

Backend

VL (Velox)

Bug description

spark-sql> set spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConstantFolding;
spark-sql> select regexp_replace('a{bc', '\\{', '\\[');

vanilla get
a[bc

gluten get
abc

Spark version

Spark-3.2.x

Spark configurations

No response

System information

Velox System Info v0.0.2
Commit: 3a459ab
CMake Version: 3.22.0
System: Linux-3.10.0-862.mt20190308.130.el7.x86_64
Arch: x86_64
CPU Name: Model name: Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
C++ Compiler: /opt/rh/devtoolset-9/root/usr/bin/c++
C++ Compiler Version: 9.3.1
C Compiler: /opt/rh/devtoolset-9/root/usr/bin/cc
C Compiler Version: 9.3.1
CMake Prefix Path: /usr/local;/usr;/;/usr/local/cmake;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

No response

@kecookier kecookier added bug Something isn't working triage labels Jun 26, 2024
@kecookier kecookier self-assigned this Jun 27, 2024
@gaoyangxiaozhu
Copy link
Contributor

gaoyangxiaozhu commented Jun 28, 2024

thanks @kecookier do you will help fix the issue ?

@kecookier
Copy link
Contributor Author

thanks @kecookier do you will help fix the issue ?

Yes, I will work on this issue.

@wang-zhun
Copy link
Contributor

wang-zhun commented Jul 2, 2024

@kecookier I also encountered this issue, the reason is differences in handling \\

present situation
--- Query
with tb as (
select
    name
FROM Values
    ('12234') people (name)
)
select regexp_replace(name, '22', '\\|') c1, 'gluten-spark' c2 from (select /*+ repartition(2) */ * from tb)
union
select regexp_replace(name, '22', '\\|') c1, 'vanilla-spark' c2 from tb

-- Result
c1            c2
1|34      vanilla-spark
134       gluten-spark
Gluten Velox RE2
1. code
#include <iostream>
#include <re2/re2.h>

int main() {
    std::string input = "12234";
    RE2 pattern("22");
    std::string replacement = "\\|";
    int replacements = RE2::GlobalReplace(&input, pattern, replacement);
    std::cout << "Modified string: " << input << std::endl;
    return 0;
}

2. output:
re2/re2.cc:1059: invalid rewrite pattern: \|
Modified string: 134

3. compile:
g++ -o example example.cpp -L/usr/local/lib -lre2 -Wl,-rpath,/usr/local/lib
Spark java.util.regex
1. code
    import java.util.regex.Matcher
    import java.util.regex.Pattern

    val input = "12234"
    val pattern: Pattern = Pattern.compile("22")
    val replacement = "\\|"
    val matcher: Matcher = pattern.matcher(input)
    val result = matcher.replaceAll(replacement)
    println(s"Modified string: $result")

2. output:
Modified string: 1|34

@kecookier
Copy link
Contributor Author

@wang-zhun Sorry for the late reply, and thank you for the information. I have a work-in-progress PR that will fix this issue.

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Jul 4, 2024

Posting code where re2 error happens for replacement string with "\\": https://github.com/google/re2/blob/6dcd83d60f7944926bfd308cc13979fc53dd69ca/re2/re2.cc#L1053.

While java.util.regex.Matcher views characters after "\\" as literal, i.e., using what it is after "\\".

Can we just revise the replacement string to let Re2 do the correct replacement?

@kecookier
Copy link
Contributor Author

@PHILO-HE Apologies for the delayed response. This PR attempts to fix the mismatch problems in regexp_replace: facebookincubator/velox#10981

@kecookier
Copy link
Contributor Author

facebookincubator/velox#10981 Already merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants