-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Testing-only PR to check maint-19.0.1 status #45401
Conversation
### Rationale for this change If the value for Decimal32 or Decimal64 is `INT32_MIN` or `INT64_MIN` respectively, then UBSAN reports an issue when calling Negate on them due to overflow. ### What changes are included in this PR? Have the `Negate` methods of Decimal32 and Decimal64 use `arrow::internal::SafeSignedNegate`. ### Are these changes tested? Unit tests were added for both cases which were able to reproduce the problem when UBSAN was on without the fix. ### Are there any user-facing changes? No. * OSS-Fuzz issue: https://issues.oss-fuzz.com/issues/371239168 * GitHub Issue: #45180 Authored-by: Matt Topol <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
#45108) ### Rationale for this change #44513 triggers two distinct overflow issues within swiss join, both happening when the build side table contains large enough number of rows or distinct keys. (Cases at this extent of hash join build side are rather rare, so we haven't seen them reported until now): 1. The first issue is, our swiss table implementation takes the higher `N` bits of 32-bit hash value as the index to a buffer storing "block"s (a block contains `8` key - in some code also referred to as "group" - ids). This `N`-bit number is further multiplied by the size of a block, which is also related to `N`. The `N` in the case of #44513 is `26` and a block takes `40` bytes. So the multiply is possible to produce a number over `1 << 31` (negative when interpreted as signed 32bit). In our AVX2 specialization of accessing the block buffer https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal_avx2.cc#L404 , the issue like #41813 (comment) shows up. This is the actual issue that directly produced the segfault in #44513. 2. The other issue is, we take `7` bits of the 32-bit hash value after `N` as a "stamp" (to quick fail the hash comparison). But when `N` is greater than `25`, some arithmetic code like https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal.cc#L397 (`bits_hash_` is `constexpr 32`, `log_blocks_` is `N`, `bits_stamp_` is `constexpr 7`, this is to retrieve the stamp from a hash) produces `hash >> -1` aka `hash >> 0xFFFFFFFF` aka `hash >> 31` (the heading `1`s are trimmed) then the stamp value is wrong and results in false-mismatched rows. This is the reason of my false positive run in #44513 (comment) . ### What changes are included in this PR? For issue 1, use 64-bit index gather intrinsic to avoid the offset overflow. For issue 2, do not right-shift the hash if `N + 7 >= 32`. This is actually allowing the bits overlapping between block id (the `N` bits) and stamp (the `7` bits). Though this may introduce more false-positive hash comparisons (thus worsen the performance), I think this is still more reasonable than brutally failing for `N > 25`. I introduce two members `bits_shift_for_block_and_stamp_` and `bits_shift_for_block_`, which are derived from `log_blocks_` - esp. set to `0` and `32 - N` when `N + 7 >= 32`, this is to avoid branching like `if (log_blocks_ + bits_stamp_ > bits_hash_)` in tight loops. ### Are these changes tested? The fix is manually tested with the original case in my local. (I do have a concrete C++ UT to verify the fix but it requires too much resource and runs for too long time so it is impractical to run in any reasonable CI environment.) ### Are there any user-facing changes? None. * GitHub Issue: #44513 Lead-authored-by: Rossi Sun <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Rossi Sun <[email protected]>
…5285) The level histogram of size statistics can be omitted if its max level is 0. We haven't implemented this yet and enforces histogram size to be equal to `max_level + 1`. However, when reading a Parquet file with omitted level histogram, exception will be thrown. Omit level histogram when max level is 0. Yes, a test case has been added to reflect the change. No. * GitHub Issue: #45283 Lead-authored-by: Gang Wu <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Gang Wu <[email protected]>
…d MinIO (#45310) ### Rationale for this change Some AWS SDK versions have faulty chunked encoding when the body is 0 bytes: aws/aws-sdk-cpp#3259 ### What changes are included in this PR? Work around faulty chunked encoding implementation by only setting a body stream if non-empty. ### Are these changes tested? Locally for now, but will be picked by CI (and conda-forge) at some point. ### Are there any user-facing changes? No. * GitHub Issue: #45304 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change See #45120 ### What changes are included in this PR? Disable pointless test ### Are these changes tested? N/A ### Are there any user-facing changes? No * GitHub Issue: #45357 Lead-authored-by: David Li <[email protected]> Co-authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
…nd multiple row groups (#45350) ### Rationale for this change Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`. We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist. We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. ### What changes are included in this PR? * Don't load statistics when no row groups are used * Don't load statistics when multiple row groups are used * Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45339 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…pandas>=2.3 (#45383) The option already exists in pandas 2.2, but for that version our code does not work, so restricting it to pandas >= 2.3 * GitHub Issue: #45296 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
@github-actions crossbow submit --group verify-rc-source |
@github-actions crossbow submit --group packaging |
Revision: e696b26 Submitted crossbow builds: ursacomputing/crossbow @ actions-0ef1a4e171 |
Revision: e696b26 Submitted crossbow builds: ursacomputing/crossbow @ actions-aae58ed5ac |
…stics (#45202) We found out in #45085 that there is a non-trivial overhead when writing size statistics is enabled. Dramatically reduce overhead by speeding up def/rep levels histogram updates. Performance results on the author's machine: ``` ------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------------------------------------ BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 8103053 ns 8098569 ns 86 bytes_per_second=1003.26Mi/s items_per_second=129.477M/s output_size=537.472k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 8153499 ns 8148492 ns 86 bytes_per_second=997.117Mi/s items_per_second=128.683M/s output_size=537.488k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 8212560 ns 8207754 ns 83 bytes_per_second=989.918Mi/s items_per_second=127.754M/s output_size=537.502k page_index_size=47 BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType> 10405020 ns 10400775 ns 67 bytes_per_second=444.142Mi/s items_per_second=100.817M/s output_size=848.305k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 10464784 ns 10460778 ns 66 bytes_per_second=441.594Mi/s items_per_second=100.239M/s output_size=848.325k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 10469832 ns 10465739 ns 67 bytes_per_second=441.385Mi/s items_per_second=100.191M/s output_size=848.344k page_index_size=48 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 13004962 ns 12992678 ns 52 bytes_per_second=657.101Mi/s items_per_second=80.7052M/s output_size=617.464k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 13718352 ns 13705599 ns 50 bytes_per_second=622.921Mi/s items_per_second=76.5071M/s output_size=617.486k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 13845553 ns 13832138 ns 52 bytes_per_second=617.222Mi/s items_per_second=75.8072M/s output_size=617.506k page_index_size=54 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType> 15715263 ns 15702707 ns 44 bytes_per_second=320.449Mi/s items_per_second=66.7768M/s output_size=927.326k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 16507328 ns 16493800 ns 43 bytes_per_second=305.079Mi/s items_per_second=63.5739M/s output_size=927.352k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 16575359 ns 16561311 ns 42 bytes_per_second=303.836Mi/s items_per_second=63.3148M/s output_size=927.377k page_index_size=55 ``` Performance results without this PR: ``` ------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------------------------------------ BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 8042576 ns 8037678 ns 87 bytes_per_second=1010.86Mi/s items_per_second=130.458M/s output_size=537.472k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 9576627 ns 9571279 ns 73 bytes_per_second=848.894Mi/s items_per_second=109.554M/s output_size=537.488k page_index_size=33 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 9570204 ns 9563595 ns 73 bytes_per_second=849.576Mi/s items_per_second=109.642M/s output_size=537.502k page_index_size=47 BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType> 10165397 ns 10160868 ns 69 bytes_per_second=454.628Mi/s items_per_second=103.197M/s output_size=848.305k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 11662568 ns 11657396 ns 60 bytes_per_second=396.265Mi/s items_per_second=89.9494M/s output_size=848.325k page_index_size=34 BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 11657135 ns 11653063 ns 60 bytes_per_second=396.412Mi/s items_per_second=89.9829M/s output_size=848.344k page_index_size=48 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type> 13182006 ns 13168704 ns 51 bytes_per_second=648.318Mi/s items_per_second=79.6264M/s output_size=617.464k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type> 16438205 ns 16421762 ns 43 bytes_per_second=519.89Mi/s items_per_second=63.8528M/s output_size=617.486k page_index_size=34 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type> 16424615 ns 16409032 ns 42 bytes_per_second=520.293Mi/s items_per_second=63.9024M/s output_size=617.506k page_index_size=54 BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType> 15387808 ns 15373086 ns 46 bytes_per_second=327.32Mi/s items_per_second=68.2086M/s output_size=927.326k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType> 18319628 ns 18302938 ns 37 bytes_per_second=274.924Mi/s items_per_second=57.29M/s output_size=927.352k page_index_size=35 BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType> 18346665 ns 18329336 ns 37 bytes_per_second=274.528Mi/s items_per_second=57.2075M/s output_size=927.377k page_index_size=55 ``` Tested by existing tests, validated by existing benchmarks. No. * GitHub Issue: #45201 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
After my initial push, I noticed we needed to cherry-pick a 20.0.0 issue, #45201, in order to get a commit we already cherry-picked for this release to compile. I moved that issue into the 19.0.1 milestone and ran: git checkout maint-19.0.1
git cherry-pick 2b5f56ca999678411f35862539f4f4a53b38de5a
git push apache maint-19.0.1 resulting in cherry-picked commit 1b9079c. |
There's a failure in the regular CI checks that looks worth investigating. From https://github.com/apache/arrow/actions/runs/13081610957/job/36506197170?pr=45401: Test failure output
|
The stack trace in the errors doesn't look like it would implicate PyArrow since they're testing fsspec/s3fs but I thought I'd check it out since this release contains an S3 change (even thought it looks unrelated). I was able to reproduce the error locally but it started to seem more like a minio bug. So I updated minio from Edit: This appears to have been fixed in cc @pitrou @h-vetinari just FYI in case knowing about minio's incompatbility with the newer AWS SDK versions is useful, see minio/minio#20845. Edit: Nevermind on that FYI, I see #45305 now. |
@github-actions crossbow submit --group verify-rc-source |
@github-actions crossbow submit --group packaging |
Revision: 1b9079c Submitted crossbow builds: ursacomputing/crossbow @ actions-631ed85276 |
Revision: 1b9079c Submitted crossbow builds: ursacomputing/crossbow @ actions-b32d59baa6 |
Use latest Minio server release, which includes a fix for minio/minio#20845 This allows us to remove the boto3 version constraint. Yes, by existing CI tests. Yes. * GitHub Issue: #45305 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
Revision: cbb438e Submitted crossbow builds: ursacomputing/crossbow @ actions-44a3332ffb
|
Revision: cbb438e Submitted crossbow builds: ursacomputing/crossbow @ actions-833d20ce19
|
CHANGELOG.md
Outdated
## New Features and Improvements | ||
|
||
* [GH-45201](https://github.com/apache/arrow/issues/45201) - [C++][Parquet] Improve performance of writing size statistics | ||
* [GH-45304](https://github.com/apache/arrow/issues/45304) - [C++] Compatibility with newer aws sdk | ||
* [GH-45305](https://github.com/apache/arrow/issues/45305) - [Python] Compatibility with boto 1.36 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [C++][Parquet] Improve performance of writing size statistics #45201: It doesn't change the current behavior. So it improves performance but it's not a new feature.
- [C++] Compatibility with newer aws sdk #45304: It doesn't change the current behavior. It just keeps the current behavior with recent AWS SDK for C++. So it's not a new feature.
- [Python] Compatibility with boto 1.36 #45305: It just changes our CI. So it's not a new feature. (In general, we don't need to list CI related changes in release notes because it's not related to users.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We could:
- Modify the labels on the issues and re-generate. The "enhancement" label seem like it's enough to cause an issue to get listed here.
- Or just edit the changelog
Has this come up before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's choose "Or just edit the changelog" for now. We can revisit how to manager our issues and changelog later.
Has this come up before?
I think no. I think that we didn't check changelog carefully...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll tweak the header so it doesn't say "new features" but keep the issues since I think it's best for the changelog to be comprehensive.
Bumping versions didn't help fix the jobs failing on the upload artifacts step. I think this happened before so I'm going to go look at past testing PRs. |
The Windows wheel failures seem related to: but this was fixed on nightlies without us taking action |
@github-actions crossbow submit ubuntu* -v 19.0.1 |
Revision: cbb438e Submitted crossbow builds: ursacomputing/crossbow @ actions-b26cb89bd5
|
A performance improvement is not a bugfix unless there was a severe regression. So, yes, it's an enhancement. Normally, performance improvements are not included in bugfix releases (because every change is an opportunity for more bugs, and we want a bugfix release to fix existing bugs not introduce new ones), this is an exception.
Le 3 février 2025 03:42:21 GMT+01:00, Bryce Mecum ***@***.***> a écrit :
…
@amoeba commented on this pull request.
> +## New Features and Improvements
+
+* [GH-45201](#45201) - [C++][Parquet] Improve performance of writing size statistics
+* [GH-45304](#45304) - [C++] Compatibility with newer aws sdk
+* [GH-45305](#45305) - [Python] Compatibility with boto 1.36
Right. We could:
- Modify the labels on the issues and re-generate. The "enhancement" label seem like it's enough to cause an issue to get listed here.
- Or just edit the changelog
Has this come up before?
--
Reply to this email directly or view it on GitHub:
#45401 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
@github-actions crossbow submit debian* -v 19.0.1 |
Revision: cbb438e Submitted crossbow builds: ursacomputing/crossbow @ actions-ee0b027082
|
@github-actions crossbow submit almalinux-9-arm64 |
Revision: 7722b62 Submitted crossbow builds: ursacomputing/crossbow @ actions-81cbaf362c
|
@github-actions crossbow submit debian* --arrow-version 19.0.1.dev13 |
Revision: 7722b62 Submitted crossbow builds: ursacomputing/crossbow @ actions-d338ab75cd
|
@github-actions crossbow submit ubuntu* --arrow-version 19.0.1.dev13 |
Revision: 7722b62 Submitted crossbow builds: ursacomputing/crossbow @ actions-464f7177c8
|
Thanks so much @kou, I think the only remaining jobs that are failing are the Windows ones. How did you come up with the |
Sorry for testing without explanation. Crossbow uses arrow/dev/archery/archery/crossbow/core.py Line 753 in 31747f0
It uses deb packages:
See also: arrow/dev/tasks/linux-packages/apache-arrow/Rakefile Lines 36 to 46 in 31747f0
If we use If we use |
Anyway, we are ready for the first RC. |
You don't have to apologize for testing. The 13 part had me confused so I appreciate the explanation. I see now how that all worked. I started a thread on Zulip to coordinate. Thank you for the help here. |
Closing this as 19.0.1 RC0 has been created. Thanks all for the help. |
Caution
Do not merge this PR.
This PR is only to pre-check the first RC for 19.0.1 with crossbow and should not be merged.