Skip to content

Commit

Permalink
Merge pull request #1945 from jqnatividad/joinp_right_join
Browse files Browse the repository at this point in the history
`joinp`: add `--right` outer join option
  • Loading branch information
jqnatividad authored Jul 5, 2024
2 parents b4042a2 + a6e075d commit 0504328
Show file tree
Hide file tree
Showing 5 changed files with 68 additions and 28 deletions.
36 changes: 18 additions & 18 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ calamine = { git = "https://github.com/tafia/calamine", rev = "6b41309" }
# use modernized version of local_encoding
local-encoding = { git = "https://github.com/slonopotamus/local-encoding-rs", branch = "travis-madness" }
# use latest upstream version of polars with additional unreleased features/fixes
polars = { git = "https://github.com/pola-rs/polars", rev = "276655a" }
polars = { git = "https://github.com/pola-rs/polars", rev = "34126ca" }


[features]
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
| [index](/src/cmd/index.rs#L2) | Create an index (📇) for a CSV. This is very quick (even the 15gb, 28m row NYC 311 dataset takes all of 14 seconds to index) & provides constant time indexing/random access into the CSV. With an index, `count`, `sample` & `slice` work instantaneously; random access mode is enabled in `luau`; and multithreading (🏎️) is enabled for the `frequency`, `split`, `stats`, `schema` & `tojsonl` commands. |
| [input](/src/cmd/input.rs#L2) | Read CSV data with special commenting, quoting, trimming, line-skipping & non-UTF8 encoding handling rules. Typically used to "normalize" a CSV for further processing with other qsv commands. |
| [join](/src/cmd/join.rs#L2)<br>👆 | Inner, outer, right, cross, anti & semi joins. Automatically creates a simple, in-memory hash index to make it fast. |
| [joinp](/src/cmd/joinp.rs#L2)<br>✨🚀🐻‍❄️ | Inner, outer, cross, anti, semi & asof joins using the [Pola.rs](https://www.pola.rs) engine. Unlike the `join` command, `joinp` can process files larger than RAM, is multithreaded, has join key validation, pre-join filtering, supports [asof joins](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.join_asof.html) (which is [particularly useful for time series data](https://github.com/jqnatividad/qsv/blob/30cc920d0812a854fcbfedc5db81788a0600c92b/tests/test_joinp.rs#L509-L983)) & its output doesn't have duplicate columns. However, `joinp` doesn't have an --ignore-case option & it doesn't support right outer joins. |
| [joinp](/src/cmd/joinp.rs#L2)<br>✨🚀🐻‍❄️ | Inner, outer, right, cross, anti, semi & asof joins using the [Pola.rs](https://www.pola.rs) engine. Unlike the `join` command, `joinp` can process files larger than RAM, is multithreaded, has join key validation, pre-join filtering, supports [asof joins](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.join_asof.html) (which is [particularly useful for time series data](https://github.com/jqnatividad/qsv/blob/30cc920d0812a854fcbfedc5db81788a0600c92b/tests/test_joinp.rs#L509-L983)) & its output columns can be coalesced. However, `joinp` doesn't have an --ignore-case option. |
| [jsonl](/src/cmd/jsonl.rs#L2)<br>🚀🔣 | Convert newline-delimited JSON ([JSONL](https://jsonlines.org/)/[NDJSON](http://ndjson.org/)) to CSV. See `tojsonl` command to convert CSV to JSONL.
| [json](/src/cmd/json.rs#L2)<br> | Convert non-nested JSON to CSV.
| <a name="luau_deeplink"></a><br>[luau](/src/cmd/luau.rs#L2) 👑<br>✨📇🌐🔣 ![CKAN](docs/images/ckan.png) | Create multiple new computed columns, filter rows, compute aggregations and build complex data pipelines by executing a [Luau](https://luau-lang.org) [0.630](https://github.com/Roblox/luau/releases/tag/0.630) expression/script for every row of a CSV file ([sequential mode](https://github.com/jqnatividad/qsv/blob/bb72c4ef369d192d85d8b7cc6e972c1b7df77635/tests/test_luau.rs#L254-L298)), or using [random access](https://www.webopedia.com/definitions/random-access/) with an index ([random access mode](https://github.com/jqnatividad/qsv/blob/bb72c4ef369d192d85d8b7cc6e972c1b7df77635/tests/test_luau.rs#L367-L415)).<br>Can process a single Luau expression or [full-fledged data-wrangling scripts using lookup tables](https://github.com/dathere/qsv-lookup-tables#example) with discrete BEGIN, MAIN and END sections.<br> It is not just another qsv command, it is qsv's [Domain-specific Language](https://en.wikipedia.org/wiki/Domain-specific_language) (DSL) with [numerous qsv-specific helper functions](https://github.com/jqnatividad/qsv/blob/113eee17b97882dc368b2e65fec52b86df09f78b/src/cmd/luau.rs#L1356-L2290) to build production data pipelines. |
Expand Down
38 changes: 30 additions & 8 deletions src/cmd/joinp.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Unlike the join command, joinp can process files larger than RAM, is multithread
has join key validation, pre-join filtering, supports asof joins & its output columns
can be coalesced (no duplicate columns).
However, joinp doesn't have an --ignore-case option & it doesn't support right outer joins.
However, joinp doesn't have an --ignore-case option.
Returns the shape of the join result (number of rows, number of columns) to stderr.
Expand Down Expand Up @@ -41,6 +41,11 @@ joinp options:
--left-semi This returns only the rows in the first CSV data set
that have a corresponding row in the second data set.
The output schema is the same as the first data set.
--right Do a 'right outer' join. This returns all rows in
second CSV data set, including rows with no
corresponding row in the first data set. When no
corresponding row exists, it is padded out with
empty fields. (This is the reverse of 'outer left'.)
--full Do a 'full outer' join. This returns all rows in
both data sets with matching records joined. If
there is no match, the missing side will be padded
Expand Down Expand Up @@ -200,6 +205,7 @@ struct Args {
flag_left: bool,
flag_left_anti: bool,
flag_left_semi: bool,
flag_right: bool,
flag_full: bool,
flag_cross: bool,
flag_coalesce: bool,
Expand Down Expand Up @@ -263,17 +269,33 @@ pub fn run(argv: &[&str]) -> CliResult<()> {
args.flag_left,
args.flag_left_anti,
args.flag_left_semi,
args.flag_right,
args.flag_full,
args.flag_cross,
args.flag_asof,
) {
(false, false, false, false, false, false) => join.run(JoinType::Inner, validation, false),
(true, false, false, false, false, false) => join.run(JoinType::Left, validation, false),
(false, true, false, false, false, false) => join.run(JoinType::Anti, validation, false),
(false, false, true, false, false, false) => join.run(JoinType::Semi, validation, false),
(false, false, false, true, false, false) => join.run(JoinType::Full, validation, false),
(false, false, false, false, true, false) => join.run(JoinType::Cross, validation, false),
(false, false, false, false, false, true) => {
(false, false, false, false, false, false, false) => {
join.run(JoinType::Inner, validation, false)
},
(true, false, false, false, false, false, false) => {
join.run(JoinType::Left, validation, false)
},
(false, true, false, false, false, false, false) => {
join.run(JoinType::Anti, validation, false)
},
(false, false, false, true, false, false, false) => {
join.run(JoinType::Right, validation, false)
},
(false, false, true, false, false, false, false) => {
join.run(JoinType::Semi, validation, false)
},
(false, false, false, false, true, false, false) => {
join.run(JoinType::Full, validation, false)
},
(false, false, false, false, false, true, false) => {
join.run(JoinType::Cross, validation, false)
},
(false, false, false, false, false, false, true) => {
// safety: flag_strategy is always is_some() as it has a default value
args.flag_strategy = Some(args.flag_strategy.unwrap().to_lowercase());
let strategy = match args.flag_strategy.as_deref() {
Expand Down
18 changes: 18 additions & 0 deletions tests/test_joinp.rs
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,24 @@ joinp_test!(
}
);

joinp_test!(
joinp_outer_right_none_streaming,
|wrk: Workdir, mut cmd: process::Command| {
cmd.arg("--right")
.args(["--validate", "none"])
.arg("--streaming");
let got: Vec<Vec<String>> = wrk.read_stdout(&mut cmd);
let expected: Vec<Vec<String>> = vec![
svec!["state", "city", "place"],
svec!["MA", "Boston", "Logan Airport"],
svec!["MA", "Boston", "Boston Garden"],
svec!["NY", "Buffalo", "Ralph Wilson Stadium"],
svec!["", "Orlando", "Disney World"],
];
assert_eq!(got, expected);
}
);

joinp_test_comments!(
joinp_outer_left_validate_none_comments,
|wrk: Workdir, mut cmd: process::Command| {
Expand Down

0 comments on commit 0504328

Please sign in to comment.