Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sqlite test files, progress bar, and automatic postgres container management into sqllogictests #13936

Merged
merged 17 commits into from
Jan 1, 2025

Conversation

Omega359
Copy link
Contributor

Which issue does this PR close?

Closes #13812

Rationale for this change

Add most of the sqlite test suite to Datafusion sqllogictests. Note: THESE TESTS DO NOT CURRENTLY PASS! Any test results where Datafusion returns a result that does not match sqlite nor match Postgresql was left as-is.

What changes are included in this PR?

This PR includes a number of changes many of which are part of the test files in the datafusion-testing repo (5,711,125 select statements of which 78,437 fail outright in Datafusion). The list below includes both the changes in this direct PR as well as the process to generate the files in datafusion-testing/data/sqlite/

  • All the .test files in the sqlite test suite are includes except the contents of the evidence and the index/delete folders
  • All .test files have had mysql and mssql specific tests removed. All other references to mysql and mssql have been removed.
  • All .test files were renamed to .slt
  • All files had control resultmode valuewise added to the beginning to allow the sqllogictest runner to properly be able to compare the results from Datafusion (and Postgresql) to the results in the .slt file
  • All queries have been run through both Datafusion and Postgres and any queries that failed with an error have had comments added explaining the failure and a skipif Datafusion and/or skipif postgres. For example:
# Datafusion - DataFusion error: SQL error: RecursionLimitExceeded
skipif Datafusion
  • Datatypes have been updated to reflect data types from Datafusion/Postgresql as the sqlite datatypes are very limited. Comments reflecting the change have been added. For example:
# Datafusion - Types were automatically converted from:
# Datafusion - [Expected] [T]
# Datafusion - [Actual  ] [I]
  • Results have been updated if the Datafusion results differ from the sqlite results AND the Datafusion results are the same as what the results from Postgresql are. There are queries where the results differs especially around floating point (Real results in slt terms). floating point results were deemed equivalent between Datafusion and Postgresql if the result was the same to 4 decimal places. Comments reflecting the change have been added. For example:
# Datafusion - Data was automatically updated based on comparison db results
# Datafusion - Previous results:
# Datafusion - 54
# Datafusion - 9
  • The sqllogictest and runners have been updated to include progress information
  • A datafusion-testing git submodule has been added. You may need to run git submodule update --init --remote --recursive to get it added to an existing checkout of datafusion.
  • Added the ability to start a postgres docker container automatically if no PG_URI is set.
  • Readme updates

Are these changes tested?

Indeed, yes. To run the tests locally checkout this branch, update the git submodules then run INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests. Be aware that the tests can take quite a long time to run, especially if you do not run with release or release-nonlto profiles.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 28, 2024
@Omega359 Omega359 marked this pull request as ready for review December 28, 2024 22:56
@Omega359
Copy link
Contributor Author

Related: apache/datafusion-testing#2

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THANK YOU @Omega359 -- this looks awesome

I think there are two things that we should fix prior to merge:

  1. The submodule issue (details below)
  2. "UnexpectedToken" issues (though I think this could potentially also be fine to fix this as a follow on PR)

Once we get this PR merged, I think the next obvious thing to do is to start running the suite in CI and actively tend the tickets for fixing issues found (the bugs listed on #13811 which will now be much eaiser to reproduce)

"UnexpectedToken label-XXX" errors:

When I ran this branch with

INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests

I got an error

External error: task 27341 panicked with message "called `Result::unwrap()` on an `Err` value: ParseError { kind: UnexpectedToken(\"label-1\"), loc: Location { file: \"../../datafusion-testing/data/sqlite/random/select/slt_good_21.slt\", line: 47, upper: None } }"

Perhaps we could downgrade / revert to 0.24 and file a ticket upstream 🤔

progress reporting

This is pretty neat 🎉

cargo test --test sqllogictests

Screenshot 2024-12-29 at 9 01 58 AM

git submodule issues

Initially I tried to run this locally and had some problems

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests
    Finished `release-nonlto` profile [optimized] target(s) in 0.43s
     Running bin/sqllogictests.rs (target/release-nonlto/deps/sqllogictests-19127caafe5284e5)
Error: Execution("Error reading directory \"../../datafusion-testing/data/\": Not a directory (os error 20)")
error: test failed, to rerun pass `-p datafusion-sqllogictest --test sqllogictests`

Caused by:
  process didn't exit successfully: `/Users/andrewlamb/Software/datafusion2/target/release-nonlto/deps/sqllogictests-19127caafe5284e5` (exit status: 1)

This seems to be related to not having the datafusion-testing submodule checked out

However, git submodule init didn't seem to work

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ git submodule init
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ git status
On branch sqllogictest_with_sqlite
nothing to commit, working tree clean
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ ls datafusion-testing
datafusion-testing*
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ cat datafusion-testing
e2e320c9477a6d8ab09662eae255887733c0e304(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$

I found I could fix it by running with --force:

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ git rm datafusion-testing
rm 'datafusion-testing'
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ git submodule add --force https://github.com/apache/datafusion-testing.git
Reactivating local git directory for submodule 'datafusion-testing'
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ ls datafusion-testing/
LICENSE.txt  NOTICE.txt   README.md    data/
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ git status
On branch sqllogictest_with_sqlite
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	typechange: datafusion-testing

datafusion/sqllogictest/README.md Show resolved Hide resolved
datafusion/sqllogictest/README.md Outdated Show resolved Hide resolved
use std::ffi::OsStr;
use std::fs;
#[cfg(feature = "postgres")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps as a follow on PR the code that manages the postgres container could be moved into its own module (like postgres_container.rs or something so we only needed one #[cfg(feature = "postgres")]

I suspect this would also make the code a bit easier to reason about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit of a mess eh? Yeah, I think that is a good idea

@alamb alamb changed the title Add sqlite test files into sqllogictests Add sqlite test files, progress bar, and automatic postgres container management into sqllogictests Dec 29, 2024
@Omega359
Copy link
Contributor Author

Git submodules - for me this worked (as documented in the description above):
git submodule update --init --remote --recursive

Sub modules with branches should be easier to update than ones without but it still isn't a normal flow like the rest of git. I had found this stackoverflow but not all the suggestions actually worked for me.

@Omega359
Copy link
Contributor Author

When I ran this branch with

External error: task 27341 panicked with message "called Result::unwrap() on an Err value: ParseError { kind: UnexpectedToken("label-1"), loc: Location { file: "../../datafusion-testing/data/sqlite/random/select/slt_good_21.slt", line: 47, upper: None } }"


- This looks like something that came in via sqllogictests 0.25 yesteday: https://github.com/apache/datafusion/pull/13917

Perhaps we could downgrade / revert to 0.24 and file a ticket upstream 🤔 

This is the slt in question and yes, I think it's a bug. I'll downgrade for now and file an issue with the sqllogictest-rs project later today:

query I rowsort label-1 
SELECT + 94 / - col2 + col1 + 9 FROM tab0 AS cor0 
---- 
12 
93 
99

@Omega359
Copy link
Contributor Author

  • submodule - readme updated with correct git submodule command to run.
  • sqlite test suite linked in docs now
  • Force version 0.24.0 of sqllogictest dependency until issue upstream is fixed

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 29, 2024
@alamb
Copy link
Contributor

alamb commented Dec 30, 2024

Git submodules - for me this worked (as documented in the description above):
git submodule update --init --remote --recursive

🤔 -- it still doesn't work for me.

I also tried it on an entirely new checkout:

andrewlamb@Andrews-MacBook-Pro-2 Downloads % git clone https://github.com/Omega359/arrow-datafusion.git
Cloning into 'arrow-datafusion'...
remote: Enumerating objects: 125653, done.
remote: Counting objects: 100% (29331/29331), done.
remote: Compressing objects: 100% (1500/1500), done.
remote: Total 125653 (delta 28229), reused 27839 (delta 27831), pack-reused 96322 (from 1)
Receiving objects: 100% (125653/125653), 242.40 MiB | 10.17 MiB/s, done.
Resolving deltas: 100% (97365/97365), done.
andrewlamb@Andrews-MacBook-Pro-2 Downloads % cd arrow-datafusion
andrewlamb@Andrews-MacBook-Pro-2 arrow-datafusion % git submodule init
Submodule 'parquet-testing' (https://github.com/apache/parquet-testing.git) registered for path 'parquet-testing'
Submodule 'testing' (https://github.com/apache/arrow-testing) registered for path 'testing'
andrewlamb@Andrews-MacBook-Pro-2 arrow-datafusion %

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Omega359 -- I think this looks great 🙏

It would be nice to fix the submodule thing before we merged to main, but I also don't think it is necessary (we can fix it afterwards)

Also, it would be nice to file a ticket / follow on to clean up / modularize the postgres container management code in the sqllogictest runnner. I can do that if you don't have a chance.

I'll plan to leave this PR open for at least one more day while approved to let others have a chance to comment / test if they want.

Thank you again for helping drive this forward 🙏

(and this is so cool!)
Screenshot 2024-12-30 at 6 27 15 AM

runner.with_normalizer(value_normalizer);
runner.with_validator(validator);

let res = runner
.run_file_async(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using run_multi_async can parse the file only once, maybe try it in a follow-up PR.

 let records = parse_file(&path).unwrap();
 let count = get_record_count2(&records, "Datafusion");
 let res = runner
        .run_multi_async(records)
        .await
        .map_err(|e| DataFusionError::External(Box::new(e))); 
fn get_record_count2(
    records: &[Record<<DataFusion as AsyncDB>::ColumnType>],
    label: &str,
) -> usize {
    fn runnable(cond: &Condition, label: &str) -> bool {
        match cond {
            Condition::SkipIf { label: l } => l != label,
            Condition::OnlyIf { label: l } => l == label,
        }
    }
    records
        .iter()
        .filter(|rec| match rec {
            Record::Query { conditions, .. } => {
                conditions.iter().all(|c| runnable(c, &label))
            }
            Record::Statement { conditions, .. } => {
                conditions.iter().all(|c| runnable(c, &label))
            }
            _ => false,
        })
        .count()
}

@Omega359
Copy link
Contributor Author

Also, it would be nice to file a ticket / follow on to clean up / modularize the postgres container management code in the sqllogictest runnner. I can do that if you don't have a chance.

#13948

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Omega359 for this great work. I think we can merge it as it is and and make further improvements with follow-up PRs.

}

#[cfg(feature = "postgres")]
static POSTGRES_IN: Lazy<Channel<ContainerCommands>> = Lazy::new(channel);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use std::sync::LazyLock without introducing the dependency of once_cell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that wasn't stable yet in a msrv that DF has but apparently I am wrong. I'll see if I can find time to change this today or file an issue to improve it otherwise.

});

POSTGRES_IN.tx.send(FetchHost).unwrap();
let db_host = POSTGRES_HOST.rx.lock().await.recv().await.unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need to wait for Postgres to start, so I wonder if we can call start_postgres directly in the current thread, without using thread::spawn and channels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but you need access to the host/post and container elsewhere. You could return that info back of course but this code was inspired by my test code in another project where that wasn't feasible.

})
.unwrap_or_else(|| "default_schema".to_string())
.to_string_lossy()
.to_string()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Calling to_string() seems unnecessary.

@alamb alamb merged commit aafec07 into apache:main Jan 1, 2025
29 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 1, 2025

Thanks again @Omega359 and @jonahgao -- and happy new year!

@jonahgao
Copy link
Member

jonahgao commented Jan 1, 2025

Thanks again @Omega359 and @jonahgao -- and happy new year!

Happy new year!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Complete / integrate sqlite sqllogictest test scripts integrattion
3 participants