-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(arrow): Triangulate on worker #2789
Conversation
Follow-up to #2788 |
It looked like the current code is running on batched geometries? |
I may need to take another look. The current goal is to allow this to run once on each |
FWIW, I'm interested in worker scaffolding for arbitrary Arrow data, where you can One complication with Arrow is to ensure that the view for specific object instance you wish to transfer is not shared with other data. For example, if you load an Arrow Table with Sorry for the screenshot, but this is showing that this If you transfered that arraybuffer, the table would stop working on the main thread. To get around that, we'd check that the view isn't shared, or do a memcopy to a new "owned" Data instance, and then transfer that. As an aside, this is another reason why https://github.com/kylebarron/arrow-js-ffi will be so nice when it stabilizes. Right now, parquet-wasm uses only Arrow IPC for moving arrow data from Wasm to JS. This means that the output ArrowJS table will always be shared views on a single backing buffer. But using arrow-js-ffi to move data from wasm to JS means that every |
Yes that pesky memory block problem is why I am focusing on adding triangulation for the binary data representation, after we have converted from arrow. On the upside that should allow this triangulation worker to work for all table loaders that can convert to I am personally more interested in batched / streaming loads rather than the atomic load case in your example. In the streaming case the returned Arrow may be assembled from multiple incoming memory chunks. I am not 100% clear how the arrow library handles this in terms of ArrayBuffer sharing. |
Thanks, @ibgreen! I will try to continue working on this PR. The current code is running on batched geometries, so what's in my mind is each worker can handle one batched geometries and output binary geometries + triangle indices. I hope there is no memory issue making each batch transferable. |
Thanks, @ibgreen! The worker infrastructure works great! I just added a job for From my experience with kepler:
To address this issue and implement e.g. progressive map rendering on big dataset, we can move the parsing job from the main thread to parallel web workers:
I think maybe we could rename the worker from Let me know what you think. Thanks! |
function parseGeoArrowBatch(data: ParseGeoArrowInput): ParseGeoArrowResult { | ||
let binaryGeometries: BinaryDataFromGeoArrow | null = null; | ||
const {arrowData, chunkIndex, geometryColumnName, geometryEncoding, meanCenter, triangle} = data; | ||
const arrowTable = arrow.tableFromIPC(arrowData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The raw Arrow arraybuffer is zero copied to each web worker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure where this arrowData
is coming from, but if you call tableToIPC
somewhere, you're making a copy.
fyi for earcut specifically, I created https://github.com/geoarrow/geoarrow-js as a place for some arrow-specific operations (so far wraps area and winding operations from math.gl/polygon + earcut), and I plan to implement a helper for post messaging arrays |
100% Agree. An
Yes. My goal is primarily to avoif blockin the main thread (parallel is a bonus).
We could rename it. I would wait a little for a few reasons:
We have plenty of formats that need WKT WKB parsing
And for triangulation, also GeoJSON, Shapefile, GeoPackage etc. If we need to create one worker for every format it will get very tedious and error prone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress. See my questions and let's align quickly off-line before we land.
modules/arrow/src/geoarrow/convert-geoarrow-to-binary-geometry.ts
Outdated
Show resolved
Hide resolved
@@ -37,6 +37,7 @@ export function parseArrowInBatches( | |||
shape: 'arrow-table', | |||
batchType: 'data', | |||
data: new arrow.Table([recordBatch]), | |||
rawArrayBuffer: asyncIterator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (including the input async iterator in the batch) raises red flags. What are you trying to achieve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed.
@@ -21,17 +22,24 @@ export type TriangulationWorkerOutput = | |||
|
|||
export type ParseGeoArrowInput = { | |||
operation: 'parse-geoarrow'; | |||
arrowData: ArrayBuffer; | |||
chunkData: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that you choose to use the Arrow API on the worker.
A problem with Arrow is that tables as JS classes (data + methods) and they can't be transferred to workers.
To call this worker we need to extract the binary arrays from arrow.
What do we gain by putting them back into arrow again to do our calculations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea here is to pass the pure data (not the methods in the class) to the web worker, so we can reconstruct the arrow.Data and then arrow.Vector for parsing the specific chunk. Most of the pure data are meta data like offset, indices, length etc.
chunkData.length, | ||
chunkData.nullCount, | ||
chunkData.buffers, | ||
chunkData.children, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that while many things here are pure data, I expect that children
can contain class instances that are not transferrable to workers.
Signed-off-by: Xun Li <[email protected]>
Signed-off-by: Xun Li <[email protected]>
Signed-off-by: Xun Li <[email protected]>
Signed-off-by: Xun Li <[email protected]>
d51fc3e
to
19edcf1
Compare
@ibgreen Is there any way to specify the value of transferList in the function loaders.gl/modules/worker-utils/src/lib/worker-farm/worker-thread.ts Lines 72 to 76 in 0884690
Let me know if I was wrong: it looks like this function will only be called in worker-job.ts loaders.gl/modules/worker-utils/src/lib/worker-farm/worker-job.ts Lines 34 to 39 in 0884690
transferList and therefore getTransferList() will be called to automatically get transferrable objects in the message data. Thank you!
Some notes: I've tried to use this worker in kepler (see pr here), but failed. Originally, I was thinking of sending each batch of data (arrow.Data) to a worker and reconstructing the arrow.Data and arrowVector. However, the batch data points to the same underneath buffer of the arrow table. It will be put in the So I think we may need to make a hard copy of the slice of the buffer for each batch. By removing the raw buffer from transferList is a quick way to test, but it will copy the entire raw buffer of arrow table to each worker. Or, we could borrow the |
I am not sure this is a good idea. I suspect that it requires enough plumbing that it is not super easy to add. But more importantly if the objects are not transferred I assume they will be serialized/deserialized which will kill performance. I would normally just strip any data that I don't want to be transferred, or replace it with copies. I think that is the way we should go. A bit more effort but let's do it right. I heard some positive noises from @kylebarron on this topic, let's see if we can leverage his work. |
Signed-off-by: Xun Li <[email protected]>
"hard cloning" the array isn't enough; you also need to monkey patch the data type before moving to a thread |
Signed-off-by: Xun Li <[email protected]>
Signed-off-by: Xun Li <[email protected]>
Starts wiring up earcut into the new triangulation worker.
This requires making some of the existing functions async.
@lixun910 My sense is that this is the wrong approach. The workload is too small, and it introduces to much async into a sync util library. It would preferable to run the worker on one batch of geometries rather than on individual polygons.