Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate testing / ensuring that string functions get the same answer for String, LargeString, StringView, DictionaryString, etc #12415

Closed
Tracked by #11752
alamb opened this issue Sep 10, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Sep 10, 2024

Is your feature request related to a problem or challenge?

@2010YOUY01 found a bug where substr behaved differently for StringView than for String: #12383

As we continue to implement special versions of the string functions for different representations this drift may get worse

Describe the solution you'd like

I would like a systematic and SQL level set of tests that ensure the string functions compute the same answer for all different combinations.

Describe alternatives you've considered

I think we could do this in sqllogictest by separating the table definitions into different .slt files and then includeing the same queries file. Here is an example:

https://github.com/apache/datafusion/blob/02eab80cd62e02fcb68dee8b99d63aaac680a66c/datafusion/sqllogictest/test_files/join_disable_repartition_joins.slt#L26-L25

So this would look something like

in strings_stringview.slt

strings_stringview.slt defines the table with appropriate type. Note there would be similar files for string and largestring

statement ok
-- Populate the "t" table table with the "standard" values in StringView form
CREATE TABLE temp AS VALUES
 ('one', 'two'),
  ('three, 'four)

statement ok
CREATE TABLE t as 
SELECT 
  arrow_cast(column1, 'Utf8View') as column1,
  arrow_cast(column2, 'Utf8View') as column2,
FROM temp

-- run queries from string_queries.slt
include string_queries.slt

in strings_queries.slt

-- Note this file is intended to be run with a table 't' already defined 
-- with standard data with the appropriate data type (StringView, String, etc)

-- Test concat
query
SELECT concat(column1, column2) from t;

...

Additional context

No response

@goldmedal
Copy link
Contributor

take

@goldmedal
Copy link
Contributor

I plan to create a directory to collect all the string-related tests. The structure would like

test_files
    - string
        - init_data.slt.part
        - string_query.slt.part
        - string.slt
        - string_view.slt
        - large_string.slt

The purpose of each file:

  • init_data.slt.part: To generate the common testing data to ensure every derived test file uses the same data.
statement ok
create table test_source as values
  ('Andrew', 'X', 'datafusion📊🔥', '融'),
  ('Xiangpeng', 'Xiangpeng', 'datafusion数据融合', 'datafusion数据融合'),
  ('Raphael', 'R', 'datafusionДатаФусион', 'аФус'),
  (NULL, 'R', NULL, '🔥');
  • string_query.slt.part: To collect all the tests for string-like type (string, largestring and StringView)
    All the tests will be based on a test table like
# select all
query TTTT
SELECT ascii_1, ascii_2, unicode_1, unicode_2 FROM test
----
Andrew X datafusion📊🔥 融
Xiangpeng Xiangpeng datafusion数据融合 datafusion数据融合
Raphael R datafusionДатаФусион аФус
NULL R NULL 🔥

The original test case in datafusion/sqllogictest/test_files/string_view.slt is complete for many string scenarios. I'll follow it to implement the common tests and add some scalar function tests.

  • string.slt, string_view.slt and large_string.slt: They are the main entrypoint of sqllogictests. The structure would like
# init the testing data
include ./init_data.slt.part

# create the testing table required by the common tests
statement ok
create table test as
select
    arrow_cast(column1, 'Utf8') as ascii_1,
    arrow_cast(column2, 'Utf8') as ascii_2,
    arrow_cast(column3, 'Utf8') as unicode_1,
    arrow_cast(column4, 'Utf8') as unicode_2
from test_source;

# common tests for string-like functions and operators
include ./string_query.slt.part

# type specific tests
...

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2024

Thank you @goldmedal . Thank you so much.

Note that @2010YOUY01 did an initial version with one function here: #12433

I suggest we try and follow the same model with some other functions and see how it goes

@goldmedal
Copy link
Contributor

goldmedal commented Sep 18, 2024

I suggest we try and follow the same model with some other functions and see how it goes

Thanks for the suggestion. I think my idea is similar to his. The difference is that I tried to group the test cases by data type instead of by function. I'm not sure which one is better. Maybe by function to implement tests is more simple 🤔

@goldmedal
Copy link
Contributor

goldmedal commented Sep 21, 2024

Move the TODO list here:

Some TODO items should be finished in the follow-up PR

  • The remaining tests in the string_view.slt
    • LIKE/ILIKE
    • SUBSTR planing
    • ASCII
    • BTRIM
    • LTRIM
    • RTRIM
    • CHARACTER_LENGTH
    • CONCAT
    • CONTAINS
    • ENDS_WITH
    • LEVENSHTEIN
    • LOWER
    • UPPER
    • LPAD
    • OCTET_LENGTH
    • OVERLAY
    • REGEXP_LIKE
    • REGEXP_MATCH
    • REPEAT
    • REPLACE
    • REVERSE
    • RIGHT
    • LEFT
    • RPAD
    • SPLIT_PART
    • STRPOS
    • SUBSTRINDEX
    • FIND_IN_SET
    • || operator
    • ~ operator (regex match)
    • ~* operator (regex match case-insensitive)
    • !~~ operator (not like match)
    • !~~* operator (not like match case-insensitive)
    • Move StringView specific tests to string/string_view.slt
    • Dictionary String tests

@alamb
Copy link
Contributor Author

alamb commented Oct 25, 2024

I think @goldmedal has basically completed this epic -- and we should close it -- is that the case ?

@goldmedal
Copy link
Contributor

Yes, the testing framework has been finished. Other function tests can be enhanced by the related issue. Let's close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants