Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Compute Sanitizer reports cudaErrorInvalidConfiguration error when writing ORC #13887

Closed
res-life opened this issue Aug 16, 2023 · 1 comment
Assignees
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@res-life
Copy link
Contributor

Describe the bug
I found the y dimension of dim3 is zero when writing ORC with all columns are non-String, details:

https://github.com/rapidsai/cudf/blob/v23.08.00/cpp/src/io/orc/dict_enc.cu#L68-L70

  auto const grid_size =
    dim3(cudf::util::div_rounding_up_unsafe<unsigned int>(num_rowgroups, block_size),
         static_cast<unsigned int>(num_str_cols));

grid_size.y is 0 when I print it out. This means the num_str_cols is zero.
From ChatGPT: Each dimension (x, y, and z) in dim3 must have a value greater than zero.
The zero dimension casued the error:

========= COMPUTE-SANITIZER
========= Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaLaunchKernel_ptsz.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4545f6]
=========                in /usr/lib64/libcuda.so.1
=========     Host Frame: [0x3199c48]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:__device_stub__ZN4cudf2io3orc3gpu27rowgroup_char_counts_kernelENS_6detail11base_2dspanIiNS_11device_spanEEENS5_IKNS1_22orc_column_device_viewELm18446744073709551615EEENS4_IKNS1_13rowgroup_rowsES5_EENS5_IKjLm18446744073709551615EEE(cudf::detail::base_2dspan<int, cudf::device_span>&, cudf::device_span<cudf::io::orc::orc_column_device_view const, 18446744073709551615ul>&, cudf::detail::base_2dspan<cudf::io::orc::rowgroup_rows const, cudf::device_span>&, cudf::device_span<unsigned int const, 18446744073709551615ul>&) [0x1a94474]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::orc::gpu::rowgroup_char_counts_kernel(cudf::detail::base_2dspan<int, cudf::device_span>, cudf::device_span<cudf::io::orc::orc_column_device_view const, 18446744073709551615ul>, cudf::detail::base_2dspan<cudf::io::orc::rowgroup_rows const, cudf::device_span>, cudf::device_span<unsigned int const, 18446744073709551615ul>) [0x1a944ae]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::(anonymous namespace)::set_rowgroup_char_counts(cudf::io::detail::orc::orc_table_view&, cudf::detail::base_2dspan<cudf::io::orc::rowgroup_rows const, cudf::device_span>, rmm::cuda_stream_view) [0x1ac2c21]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::(anonymous namespace)::convert_table_to_orc_data(cudf::table_view const&, cudf::io::table_input_metadata const&, cudf::io::detail::orc::stripe_size_limits, int, bool, cudf::io::orc::CompressionKind, unsigned long, cudf::io::statistics_freq, bool, cudf::io::detail::single_write_mode, cudf::io::data_sink const&, rmm::cuda_stream_view) [0x1ad1657]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::writer::impl::write(cudf::table_view const&) [0x1ad2b30]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::orc_chunked_writer::write(cudf::table_view const&) [0x1a0adc9]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:Java_ai_rapids_cudf_Table_writeORCChunk [0x120e73e]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame: [0xffffffffe6e20b26]
=========                in 
========= 
========= Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaGetLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4545f6]
=========                in /usr/lib64/libcuda.so.1
=========     Host Frame: [0x3193c14]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cub::CUB_101702_600_700_750_800_860_900_NS::PtxVersion(int&) [0x112da6d]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::(anonymous namespace)::decimal_chunk_sizes(cudf::io::detail::orc::orc_table_view&, cudf::io::detail::orc::file_segmentation const&, rmm::cuda_stream_view) [0x1ac601a]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::(anonymous namespace)::convert_table_to_orc_data(cudf::table_view const&, cudf::io::table_input_metadata const&, cudf::io::detail::orc::stripe_size_limits, int, bool, cudf::io::orc::CompressionKind, unsigned long, cudf::io::statistics_freq, bool, cudf::io::detail::single_write_mode, cudf::io::data_sink const&, rmm::cuda_stream_view) [0x1ad16d7]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::detail::orc::writer::impl::write(cudf::table_view const&) [0x1ad2b30]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:cudf::io::orc_chunked_writer::write(cudf::table_view const&) [0x1a0adc9]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame:Java_ai_rapids_cudf_Table_writeORCChunk [0x120e73e]
=========                in /tmp/cudf1642589926955139763.so
=========     Host Frame: [0xffffffffe6e20b26]
=========                in 
========= 
========= ERROR SUMMARY: 2 errors

Steps/Code to reproduce bug
The steps are in NVIDIA/spark-rapids-jni#1338.
And the steps are time consuming consider to the compilation.
You can review and check the cuDF code and I can verify this issue after a fix.

git clone https://github.com/NVIDIA/spark-rapids-jni.git
cd spark-rapids-jni
cherry pick the code in https://github.com/NVIDIA/spark-rapids-jni/pull/1321
./build/build-in-docker clean install -Dsubmodule.check.skip=true -DCPP_PARALLEL_LEVEL=15 -DskipTests
./build/build-in-docker test -Dtest=TableTest#testORCReadAndWriteForDecimal128 -DUSE_SANITIZER=ON -Dsubmodule.check.skip=true

Please let me know If you have any problem in the reproduce steps.

Expected behavior
Fix the error.
It's better to check if this issue causes other problems.
Check if this issue impacts performance.

Environment overview (please complete the following information)
cuDF branch 23.10

Environment details

Additional context
Refer to JNI issue: NVIDIA/spark-rapids-jni#1338
If update the test case to write a ORC file with a String column, the y dimension is non-zero and error disappears.
Refer to:

void testORCReadAndWriteForDecimal128() throws IOException {

    String[] colNames = new String[]{Columns.DECIMAL64.name,

==>>

    String[] colNames = new String[]{Columns.String.name,  // this means write a string column
...

compute-saniziter is a tool can detect some GPU memory relevant issues

@res-life res-life added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Aug 16, 2023
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Sep 27, 2023
@vuule vuule self-assigned this Dec 4, 2023
@davidwendt
Copy link
Contributor

I believe this is resolved by #14139

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

4 participants