Can't write parquet of mixed types #159

zenazn · 2022-07-13T21:47:27Z

(Thank you so much for this project! I was about to write the very same thing and was fortunate enough to stumble upon your version)

The following code fails in version v0.3.1

const table = tableFromArrays({
  name: ['carl'],
  fav_number: [3],
});
const arrowBytes = tableToIPC(table, 'file');
const writerProps = new WriterPropertiesBuilder()
  .setCompression(Compression.SNAPPY)
  .setEncoding(Encoding.PLAIN_DICTIONARY) // <-- encoding line
  .build();
const bytes = writeParquet2(arrowBytes, writerProps);
return bytes;

In particular, it throws the following error: External format error: Invalid argument error: The datatype Float64 cannot be encoded by PlainDictionary.

If I comment out the encoding line above, I instead get the following error: External format error: Not yet implemented: Dictionary arrays only support dictionary encoding

I don't think any single parquet encoding works for both strings and numbers—instead, encodings need to be settable or inferred per field, which requires an API change of some sort to WriterPropertiesBuilder.

The text was updated successfully, but these errors were encountered:

kylebarron · 2022-07-14T00:08:49Z

I don't think any single parquet encoding works for both strings and numbers

I think more specifically... some parquet encodings cannot be used for both strings and numbers. In the test case, read-write-read works:

parquet-wasm/tests/js/arrow2.ts

Lines 41 to 49 in 50ff7b0

    
           const initialTable = tableFromIPC(wasm.readParquet2(arr)); 
        
           const writerProperties = new wasm.WriterPropertiesBuilder().build(); 
        
           const parquetBuffer = wasm.writeParquet2( 
        
             tableToIPC(initialTable, "file"), 
        
             writerProperties 
        
           ); 
        
           const table = tableFromIPC(wasm.readParquet2(parquetBuffer));

for this table that has both strings and numbers:

parquet-wasm/tests/data/generate_data.py

Lines 10 to 13 in 50ff7b0

    
           "str": pa.array(["a", "b", "c", "d"], type=pa.string()), 
        
           "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()), 
        
           "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()), 
        
           "bool": pa.array([True, True, False, False], type=pa.bool_()),

when using the default Plain encoding:

parquet-wasm/src/arrow2/writer_properties.rs

Line 91 in 50ff7b0

let encoding = arrow2::io::parquet::write::Encoding::Plain;

But it makes sense that dictionary encodings won't work on some data types.

On the arrow1 side, metadata for column encoding is supported natively:

parquet-wasm/src/arrow1/writer_properties.rs

Line 187 in 50ff7b0

Self(self.0.set_column_encoding(column_path, value.to_arrow1()))

On the arrow2 side, we store encoding separately:

parquet-wasm/src/arrow2/writer_properties.rs

Line 59 in 50ff7b0

encoding: arrow2::io::parquet::write::Encoding,

, so we'd need to store some sort of mapping from column name to encoding (with a separate fallback encoding). The writer properties doesn't have any knowledge of the dataset schema, so the properties wouldn't be able to catch mis-matched column names or similar.

zenazn · 2022-07-14T06:10:57Z

Oh interesting! It looks like Arrow is inferring dictionaries for strings in tableFromArrays instead of plain values, which your example parquet file uses:

https://github.com/apache/arrow/blob/e766828c699c6c74eba3b8c5de99e541017b8b9e/js/src/factories.ts#L142

I think I can work around this! Although of course it would be great to have support for arrow dictionaries turning into one of the parquet dictionary encodings by default

kylebarron · 2022-07-14T15:41:53Z

Ah interesting. My test case isn't testing writing a table initially created in arrow.js, because it loaded the table saved from pyarrow.

Regardless, as I mentioned above, I think it would be good to support column-specific encodings. They're already supported in arrow1; we just need to change encoding to a hashmap in the arrow2 writer properties.

zenazn · 2022-07-14T20:57:15Z

Agreed!

FWIW, working around this by inferring the type Utf8 for strings (instead of Dictionary<...>) is a decent workaround!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't write parquet of mixed types #159

Can't write parquet of mixed types #159

zenazn commented Jul 13, 2022

kylebarron commented Jul 14, 2022

zenazn commented Jul 14, 2022

kylebarron commented Jul 14, 2022

zenazn commented Jul 14, 2022 •

edited

Loading

Can't write parquet of mixed types #159

Can't write parquet of mixed types #159

Comments

zenazn commented Jul 13, 2022

kylebarron commented Jul 14, 2022

zenazn commented Jul 14, 2022

kylebarron commented Jul 14, 2022

zenazn commented Jul 14, 2022 • edited Loading

zenazn commented Jul 14, 2022 •

edited

Loading