Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FFI] - RangeError: byte length of BigInt64Array should be a multiple of 8 #129

Open
Vectorrent opened this issue Sep 11, 2024 · 6 comments

Comments

@Vectorrent
Copy link

I tried to load a new Parquet table, using the same method I always use, but that method failed with the following error:

(venv) [crow@crow-pc ode]$ node misc/parquetFailing.js 
file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:300
            ? new dataType.ArrayType(copyBuffer(dataView.buffer, dataPtr, length * byteWidth))
              ^

RangeError: byte length of BigInt64Array should be a multiple of 8
    at new BigInt64Array (<anonymous>)
    at parseDataContent (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:300:15)
    at parseData (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:175:16)
    at parseData (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:139:23)
    at parseTable (file:///home/crow/repos/ode/node_modules/arrow-js-ffi/dist/arrow-js-ffi.es.mjs:935:28)
    at file:///home/crow/repos/ode/misc/parquetFailing.js:25:19
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Node.js v18.20.4

This error is thrown when trying to load the table with FFI, but does not happen when we use the original implementation.

Since I already found a workaround, this bug isn't a huge priority for me. But I thought you guys might want to know about it.

Here is some reproducible code:

import * as arrow from 'apache-arrow'
import { parseTable } from 'arrow-js-ffi'
import { wasmMemory, readParquet } from 'parquet-wasm'

const url =
    'https://huggingface.co/api/datasets/tiiuae/falcon-refinedweb/parquet/default/train/320.parquet'

// This one will succeed
;(async () => {
    const resp = await fetch(url)
    const buffer = new Uint8Array(await resp.arrayBuffer())
    const arrowWasmTable = readParquet(buffer)
    const table = arrow.tableFromIPC(arrowWasmTable.intoIPCStream())
    table.free()

    console.log('successfully loaded table via parquet-wasm')
})()

// This one will fail
;(async () => {
    const resp = await fetch(url)
    const buffer = new Uint8Array(await resp.arrayBuffer())
    const ffiTable = readParquet(buffer).intoFFI()

    const table = parseTable(
        wasmMemory().buffer,
        ffiTable.arrayAddrs(),
        ffiTable.schemaAddr()
    )
    table.free()

    console.log('successfully loaded table via FFI')
})()

Versions:

  • parquet-wasm v0.6.1
  • arrow-js-ffi v0.4.2
  • node v18.20.4
@kylebarron kylebarron transferred this issue from kylebarron/parquet-wasm Sep 11, 2024
@kylebarron
Copy link
Owner

kylebarron commented Sep 11, 2024

@Vectorrent I'm unable to reproduce this:

  • Node v20.9.0
  • arrow-js-ffi latest main (which is the same effectively as latest released)
  • parquet-wasm 0.6.1

With this test case:

// issue129.test.ts
import { readFileSync } from "fs";
import { readParquet, wasmMemory } from "parquet-wasm";
import { describe, it, expect } from "vitest";
import * as arrow from "apache-arrow";
import * as wasm from "rust-arrow-ffi";
import { parseTable } from "../src";

wasm.setPanicHook();

describe("issue 129", (t) => {
  const buffer = readFileSync("0320.parquet");

  const ffiTable = readParquet(buffer).intoFFI();
  const memory = wasmMemory();

  const table = parseTable(
    memory.buffer,
    ffiTable.arrayAddrs(),
    ffiTable.schemaAddr()
  );
  ffiTable.free();

  console.log(table.schema);

  it("Should pass", () => {
    expect(true).toBeTruthy();
  });
});
Schema {
  fields: [
    Field {
      name: 'content',
      type: [Utf8],
      nullable: true,
      metadata: Map(0) {}
    },
    Field {
      name: 'url',
      type: [Utf8],
      nullable: true,
      metadata: Map(0) {}
    },
    Field {
      name: 'timestamp',
      type: [Timestamp_ [Timestamp]],
      nullable: true,
      metadata: Map(0) {}
    },
    Field {
      name: 'dump',
      type: [Utf8],
      nullable: true,
      metadata: Map(0) {}
    },
    Field {
      name: 'segment',
      type: [Utf8],
      nullable: true,
      metadata: Map(0) {}
    },
    Field {
      name: 'image_urls',
      type: [List],
      nullable: true,
      metadata: Map(0) {}
    }
  ],
  metadata: Map(1) {
    'huggingface' => '{"info": {"features": {"content": {"dtype": "string", "_type": "Value"}, "url": {"dtype": "string", "_type": "Value"}, "timestamp": {"dtype": "timestamp[s]", "_type": "Value"}, "dump": {"dtype": "string", "_type": "Value"}, "segment": {"dtype": "string", "_type": "Value"}, "image_urls": {"feature": {"feature": {"dtype": "string", "_type": "Value"}, "_type": "Sequence"}, "_type": "Sequence"}}}}'
  },
  dictionaries: Map(0) {},
  metadataVersion: 4
}

@Vectorrent
Copy link
Author

Strange. I tried your code (i.e. loading from disk), and that fails too. I upgraded to Node v22, and apache-arrow v17.0.0 - with no luck. Not sure what else to try; maybe it's an engine thing? I'm running on Linux.

Anyway, not a huge priority, since I do have a workaround. Just thought it was worth reporting.

@kylebarron
Copy link
Owner

Are you able to slice that data (i.e. take the first 5 rows) and save it as a Parquet file that also fails for you? Then we could check that data in to Git and add it as a test case to this repo.

It's good that reading from IPC works, but I do want to make sure that arrow-js-ffi is stable!

@Vectorrent
Copy link
Author

I sliced 5 rows with PyArrow, saved them to disk, then tried FFI again with the new file. No dice, it still fails.

Here's the sliced file: https://mega.nz/file/CRsFDJrC#3lRSoohQ1kohnqzX0O0TmVtjrsfgKRgj0KMLzxf2nU8

@kylebarron
Copy link
Owner

Ok, cool, thanks for making that file.

For reference, I find it much easier to zip a Parquet file and share that via github in the issue itself.

@Vectorrent
Copy link
Author

0320.output.parquet.zip

Oops, didn't realize zip files were supported here. See attached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants