-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Use FetchNode and OrderByNode #34437
Comments
This might wait until |
I'll start this now. Will do order_by first, then tackle the Declaration refactor after. |
@westonpace I started wiring these up and have run into a couple of issues. If these are user error, I'll appreciate the pointers to resolving them. The change to use these nodes is a pure refactor, in that I am expecting all existing tests to pass when using these nodes--and there are a lot of tests. I haven't gotten to everything yet but am hitting a couple of errors in C++:
Also, am I correct that |
I tried just removing the validation in the fetch node like this:
and that caused the tests to hang. |
Once #34698 merges then you should be able to ask the scan to emit data in a deterministic order (if it's a dataset scan this will be the order in which files are given to the dataset. If the dataset is created through discovery this is (usually, but not necessarily always) lexicographical ordered filenames) Note that in-memory sources (record batch reader, table, etc.) already declare implicit ordering. So if your source is a table in memory then The performance penalty for doing so is pretty minor. So this should allow you to do However, it still wouldn't support something like If we really want / need to support that non-deterministic approach then we can add a boolean flag to the fetch node options to
You are correct. It's still very doable but I don't see the purpose until someone gets around to implementing the more efficient solution. I think the only reason we had it before is because ordering and limit were both sink nodes and so you couldn't chain them.
Yes. The implementation will have to change slightly when we do a non-deterministic fetch but it's not too bad. We can just skip the sequencing queue and call process immediately (guarded by a mutex). Right now it's hanging because the sequencing queue just accumulates everything and never emits because it never sees the first batch (it should probably error though when it sees an unordered batch. That would be a nice cleanup). |
I'm pretty sure I saw this failing all over and not just on dataset tests. Most tests of acero in R just use tables and recordbatches. |
I was just testing the FetchNode in python, and at least for a Table source, I seem to get deterministic behaviour: import pyarrow as pa
from pyarrow._acero import TableSourceNodeOptions, FetchNodeOptions, Declaration
table = pa.table({'a': np.arange(10_000_000)})
table = pa.Table.from_batches(table.to_batches(max_chunksize=1_000_000))
for _ in range(100):
decl = Declaration.from_sequence([
Declaration("table_source", TableSourceNodeOptions(table)),
Declaration("fetch", FetchNodeOptions(0, 5))
])
assert decl.to_table()['a'].to_pylist() == [0, 1, 2, 3, 4] |
I checked again with the current state of my branch and 4 tests fail if I use the FetchNode, 3 of which are on Datasets. The other one is a query on a table, but fetch comes after aggregation:
I guess that's reasonable that it should error (or at least warn)? Seems like I should wait for #34698 to happen so that I'm not having to special-case datasets temporarily. |
One other question: this implicit order, can I reference it in some way? Is there some |
That's a fun question. No, there is no Since there is no magic column there are some things you cannot do with the implicit order. For example, you cannot "first order by X and then by implicit" (the implicit ordering is an entire ordering and not just a sort key). Since there is no column corresponding to it you also cannot "restore" it after some kind of order-destroying operation. However...reversing it should be possible in theory. Though reversing a dataset and taking A much better feature that (presumably) wouldn't be too difficult, would be to support a "reverse scan" that scans the data in reverse order. This would be a very efficient way to implement |
### Rationale for this change See also #32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * #34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * #34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: #34437 * Closes: #31980 * Closes: #31982 Authored-by: Neal Richardson <[email protected]> Signed-off-by: Nic Crane <[email protected]>
### Rationale for this change See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: apache#34437 * Closes: apache#31980 * Closes: apache#31982 Authored-by: Neal Richardson <[email protected]> Signed-off-by: Nic Crane <[email protected]>
### Rationale for this change See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: apache#34437 * Closes: apache#31980 * Closes: apache#31982 Authored-by: Neal Richardson <[email protected]> Signed-off-by: Nic Crane <[email protected]>
### Rationale for this change See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: apache#34437 * Closes: apache#31980 * Closes: apache#31982 Authored-by: Neal Richardson <[email protected]> Signed-off-by: Nic Crane <[email protected]>
Describe the enhancement requested
See #34059. There's at least one workaround we can remove and push the work into the ExecPlan instead of massaging the result after.
Component(s)
R
The text was updated successfully, but these errors were encountered: