Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check valid byte_range in parquet Column Chunk reading? #6255

Open
mapleFU opened this issue Aug 15, 2024 · 4 comments
Open

Check valid byte_range in parquet Column Chunk reading? #6255

mapleFU opened this issue Aug 15, 2024 · 4 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@mapleFU
Copy link
Member

mapleFU commented Aug 15, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

See: apache/parquet-testing#58 (comment)

When reading a corrupt file, currently, arrow-rs would have:

    /// Returns the offset and length in bytes of the column chunk within the file
    pub fn byte_range(&self) -> (u64, u64) {
        let col_start = match self.dictionary_page_offset() {
            Some(dictionary_page_offset) => dictionary_page_offset,
            None => self.data_page_offset(),
        };
        let col_len = self.compressed_size();
        assert!(
            col_start >= 0 && col_len >= 0,
            "column start and length should not be negative"
        );
        (col_start as u64, col_len as u64)
    }

Would we better check the range here?

Describe the solution you'd like

Checking the range when building the group reader or in "byte_range()"

Describe alternatives you've considered

    /// Returns the offset and length in bytes of the column chunk within the file
    pub fn byte_range(&self) -> Result<(u64, u64)> {
        let col_start = match self.dictionary_page_offset() {
            Some(dictionary_page_offset) => dictionary_page_offset,
            None => self.data_page_offset(),
        };
        let col_len = self.compressed_size();
        if col_len < 0 || col_len < 0{
            return Err(ParquetError::General(
                "column start and length should not be negative".to_string(),
            ));
        }
        (col_start as u64, col_len as u64)
    }

Additional context

@mapleFU mapleFU added enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate labels Aug 15, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Aug 15, 2024

@alamb I'm willing to try this but I'm not so familar with parquet-rs. Do you think this would better be checked in ColumnChunkMetaData::byte_range or ColumnChunkMetaData::from_thrift? Or this is already checked, we don't require this?

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

@alamb I'm willing to try this but I'm not so familar with parquet-rs. Do you think this would better be checked in ColumnChunkMetaData::byte_range or ColumnChunkMetaData::from_thrift? Or this is already checked, we don't require this?

I am not sure -- I think the first thing we should do is get a reproducer. Let me see if I can whip up some tests

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

See #6261 / #6262

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

Describe alternatives you've considered

That looks good to me, FWIW

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

2 participants