feat(arrow): Triangulate on worker #2789

ibgreen · 2023-11-17T12:41:51Z

Starts wiring up earcut into the new triangulation worker.

This requires making some of the existing functions async.

@lixun910 My sense is that this is the wrong approach. The workload is too small, and it introduces to much async into a sync util library. It would preferable to run the worker on one batch of geometries rather than on individual polygons.

ibgreen · 2023-11-17T12:44:10Z

Follow-up to #2788

kylebarron · 2023-11-17T15:26:22Z

It would preferable to run the worker on one batch of geometries rather than on individual polygons

It looked like the current code is running on batched geometries?

ibgreen · 2023-11-17T18:07:39Z

It looked like the current code is running on batched geometries?

I may need to take another look. The current goal is to allow this to run once on each arrow RecordBatch. Still getting up to speed on all the code. I mostly wanted to make sure the plumbing was in place for Xun.

kylebarron · 2023-11-17T20:20:12Z

FWIW, I'm interested in worker scaffolding for arbitrary Arrow data, where you can postMessage with any arrow object and it'll "just work", automatically using transferables.

One complication with Arrow is to ensure that the view for specific object instance you wish to transfer is not shared with other data. For example, if you load an Arrow Table with arrow.tableFromIPC, Arrow JS will make every Data object a view on the same original backing ArrayBuffer.

Sorry for the screenshot, but this is showing that this Table instance (in this case loaded from a 150MB Arrow IPC file) has PolygonData instances where the list offsets are both just views on the same 150MB ArrayBuffer:

If you transfered that arraybuffer, the table would stop working on the main thread. To get around that, we'd check that the view isn't shared, or do a memcopy to a new "owned" Data instance, and then transfer that.

As an aside, this is another reason why https://github.com/kylebarron/arrow-js-ffi will be so nice when it stabilizes. Right now, parquet-wasm uses only Arrow IPC for moving arrow data from Wasm to JS. This means that the output ArrowJS table will always be shared views on a single backing buffer. But using arrow-js-ffi to move data from wasm to JS means that every Data instance will have its own backing ArrayBuffer.

ibgreen · 2023-11-17T20:37:15Z

Yes that pesky memory block problem is why I am focusing on adding triangulation for the binary data representation, after we have converted from arrow. On the upside that should allow this triangulation worker to work for all table loaders that can convert to BinaryGeometries

I am personally more interested in batched / streaming loads rather than the atomic load case in your example. In the streaming case the returned Arrow may be assembled from multiple incoming memory chunks. I am not 100% clear how the arrow library handles this in terms of ArrayBuffer sharing.

lixun910 · 2023-11-20T17:25:18Z

Thanks, @ibgreen! I will try to continue working on this PR. The current code is running on batched geometries, so what's in my mind is each worker can handle one batched geometries and output binary geometries + triangle indices. I hope there is no memory issue making each batch transferable.

lixun910 · 2023-11-21T23:37:55Z

Thanks, @ibgreen! The worker infrastructure works great! I just added a job for parseGeoArrow.

From my experience with kepler:

Reading the arrow file into memory is very fast, e.g. 26 ms to read 1.7 GB arrow file 11 million polygons, so there is no need to use an arrow-worker
Parsing the geoarrow column including triangulating and getting centroids is slow on large data, and blocks the main thread.

To address this issue and implement e.g. progressive map rendering on big dataset, we can move the parsing job from the main thread to parallel web workers:

each web worker will parse the geoarrow column using one chunk/batch of arrow data, and return binary geometries to the main thread.
the app on the main thread will render the binary geometries, so that the parsing will not block the main thread.

I think maybe we could rename the worker from Triangulate to 'ParseGeoArrow`? With this worker, I think we can implement the progressive map rendering in kepler.

Let me know what you think. Thanks!

lixun910 · 2023-11-21T23:42:06Z

modules/arrow/src/workers/triangulation-worker.ts

+function parseGeoArrowBatch(data: ParseGeoArrowInput): ParseGeoArrowResult {
+  let binaryGeometries: BinaryDataFromGeoArrow | null = null;
+  const {arrowData, chunkIndex, geometryColumnName, geometryEncoding, meanCenter, triangle} = data;
+  const arrowTable = arrow.tableFromIPC(arrowData);


The raw Arrow arraybuffer is zero copied to each web worker

Not sure where this arrowData is coming from, but if you call tableToIPC somewhere, you're making a copy.

modules/arrow/src/geoarrow/convert-geoarrow-to-binary-geometry.ts

kylebarron · 2023-11-22T00:21:29Z

fyi for earcut specifically, I created https://github.com/geoarrow/geoarrow-js as a place for some arrow-specific operations (so far wraps area and winding operations from math.gl/polygon + earcut), and I plan to implement a helper for post messaging arrays

ibgreen · 2023-11-22T14:57:43Z

I think maybe we could rename the worker from Triangulate to 'ParseGeoArrow`? With this worker, I think we can implement the progressive map rendering in kepler.

Thanks, @ibgreen! The worker infrastructure works great! I just added a job for parseGeoArrow.

From my experience with kepler:

Reading the arrow file into memory is very fast, e.g. 26 ms to read 1.7 GB arrow file 11 million polygons, so there is no need to use an arrow-worker

100% Agree. An ArrowWorkerLoader doesn't make much sense. But as you describe below, a GeoArrowWorkerLoader might make more sense (though currently we are trying to create a more generic worker).

Parsing the geoarrow column including triangulating and getting centroids is slow on large data, and blocks the main thread.

To address this issue and implement e.g. progressive map rendering on big dataset, we can move the parsing job from the main thread to parallel web workers:

each web worker will parse the geoarrow column using one chunk/batch of arrow data, and return binary geometries to the main thread.

the app on the main thread will render the binary geometries, so that the parsing will not block the main thread.

Yes. My goal is primarily to avoif blockin the main thread (parallel is a bonus).

I think maybe we could rename the worker from Triangulate to 'ParseGeoArrow`? With this worker, I think we can implement the progressive map rendering in kepler.

Let me know what you think. Thanks!

We could rename it. I would wait a little for a few reasons:

there is a lot of plumbing involved so renaming in all places takes a while to get right. Let's make sure we know the final name before we rename it again.
We can just expose a new method parseArrowOnWorker, then the app doesn't care what the worker is called.
My hope was that we could create a common triangluation worker.

We have plenty of formats that need WKT WKB parsing

GeoArrow with arrow geos, wkb, wkt
GeoParquet with wkt and wkb
CSV with WKT
CSV with Hex WKB
...

And for triangulation, also GeoJSON, Shapefile, GeoPackage etc.

If we need to create one worker for every format it will get very tedious and error prone.

ibgreen

Good progress. See my questions and let's align quickly off-line before we land.

modules/arrow/src/geoarrow/convert-geoarrow-to-binary-geometry.ts

ibgreen · 2023-11-25T13:01:44Z

modules/arrow/src/parsers/parse-arrow-in-batches.ts

@@ -37,6 +37,7 @@ export function parseArrowInBatches(
          shape: 'arrow-table',
          batchType: 'data',
          data: new arrow.Table([recordBatch]),
+          rawArrayBuffer: asyncIterator,


This (including the input async iterator in the batch) raises red flags. What are you trying to achieve?

This should be removed.

ibgreen · 2023-11-25T13:03:52Z

modules/arrow/src/triangulate-on-worker.ts

@@ -21,17 +22,24 @@ export type TriangulationWorkerOutput =

 export type ParseGeoArrowInput = {
  operation: 'parse-geoarrow';
-  arrowData: ArrayBuffer;
+  chunkData: {


Interesting that you choose to use the Arrow API on the worker.

A problem with Arrow is that tables as JS classes (data + methods) and they can't be transferred to workers.

To call this worker we need to extract the binary arrays from arrow.

What do we gain by putting them back into arrow again to do our calculations?

My idea here is to pass the pure data (not the methods in the class) to the web worker, so we can reconstruct the arrow.Data and then arrow.Vector for parsing the specific chunk. Most of the pure data are meta data like offset, indices, length etc.

ibgreen · 2023-11-25T13:05:12Z

modules/arrow/src/workers/triangulation-worker.ts

+    chunkData.length,
+    chunkData.nullCount,
+    chunkData.buffers,
+    chunkData.children,


Note that while many things here are pure data, I expect that children can contain class instances that are not transferrable to workers.

Signed-off-by: Xun Li <[email protected]>

lixun910 · 2023-11-30T17:29:30Z

@ibgreen Is there any way to specify the value of transferList in the function postMessage(data: any, transferList?: any[]) at

loaders.gl/modules/worker-utils/src/lib/worker-farm/worker-thread.ts

Lines 72 to 76 in 0884690

    
           postMessage(data: any, transferList?: any[]): void { 
        
             transferList = transferList || getTransferList(data); 
        
             // @ts-ignore 
        
             this.worker.postMessage(data, transferList); 
        
           }

Let me know if I was wrong: it looks like this function will only be called in worker-job.ts

loaders.gl/modules/worker-utils/src/lib/worker-farm/worker-job.ts

Lines 34 to 39 in 0884690

    
           postMessage(type: WorkerMessageType, payload: WorkerMessagePayload): void { 
        
             this.workerThread.postMessage({ 
        
               source: 'loaders.gl', // Lets worker ignore unrelated messages 
        
               type, 
        
               payload 
        
             });

but we can't specify transferList and therefore getTransferList() will be called to automatically get transferrable objects in the message data. Thank you!

Some notes:

I've tried to use this worker in kepler (see pr here), but failed. Originally, I was thinking of sending each batch of data (arrow.Data) to a worker and reconstructing the arrow.Data and arrowVector. However, the batch data points to the same underneath buffer of the arrow table.

It will be put in the transferList by getTransferList(), and its memory will be detached and make the original arrow table not usable for e.g. showing table or mouse hover operations.

So I think we may need to make a hard copy of the slice of the buffer for each batch. By removing the raw buffer from transferList is a quick way to test, but it will copy the entire raw buffer of arrow table to each worker.

Or, we could borrow the _sliceBuffers function in apache-arrow, and replace subarray with slice to make a copy for arrow.Data from a specific part of the raw buffer. :
https://github.com/apache/arrow/blob/d555890141c55021077ea63380075e9cd1d68b1c/js/src/data.ts#L234

ibgreen · 2023-11-30T20:48:09Z

but we can't specify transferList and therefore getTransferList() will be called to automatically get transferrable objects in the message data. Thank you!

I am not sure this is a good idea. I suspect that it requires enough plumbing that it is not super easy to add.

But more importantly if the objects are not transferred I assume they will be serialized/deserialized which will kill performance.

I would normally just strip any data that I don't want to be transferred, or replace it with copies. I think that is the way we should go. A bit more effort but let's do it right. I heard some positive noises from @kylebarron on this topic, let's see if we can leverage his work.

Signed-off-by: Xun Li <[email protected]>

kylebarron · 2023-12-01T16:20:34Z

"hard cloning" the array isn't enough; you also need to monkey patch the data type before moving to a thread

Signed-off-by: Xun Li <[email protected]>

ibgreen requested a review from lixun910 November 17, 2023 12:43

ibgreen requested a review from kylebarron November 17, 2023 12:45

lixun910 reviewed Nov 21, 2023

View reviewed changes

ibgreen commented Nov 25, 2023

View reviewed changes

ibgreen marked this pull request as ready for review November 28, 2023 10:44

ibgreen and others added 11 commits November 28, 2023 15:46

feat(arrow): Triangulate on worker

331c4b7

wip

dbacbe1

wip

01f2c60

wip

f273038

add parseGeoArrow job

b9b99dd

Signed-off-by: Xun Li <[email protected]>

remove test.only

b4d4f3f

Signed-off-by: Xun Li <[email protected]>

parseGeoArrowOnWorker

b57efd8

wip

77141f2

postmessage one batch of geometries

5b908a3

post message with geometry batch to match arrow batch loader

9ad14c9

Signed-off-by: Xun Li <[email protected]>

rebase

19edcf1

Signed-off-by: Xun Li <[email protected]>

lixun910 force-pushed the triangulate-on-worker branch from d51fc3e to 19edcf1 Compare November 28, 2023 22:56

add kyle hard-clone

b8fd5d9

Signed-off-by: Xun Li <[email protected]>

lixun910 added 2 commits December 2, 2023 16:40

support multiple web workers

a3f801a

Signed-off-by: Xun Li <[email protected]>

add chunkOffset for globalFeatureId in webworker

44652be

Signed-off-by: Xun Li <[email protected]>

ibgreen merged commit b677f33 into master Dec 5, 2023
3 checks passed

ibgreen deleted the triangulate-on-worker branch December 5, 2023 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(arrow): Triangulate on worker #2789

feat(arrow): Triangulate on worker #2789

ibgreen commented Nov 17, 2023 •

edited

Loading

ibgreen commented Nov 17, 2023

kylebarron commented Nov 17, 2023

ibgreen commented Nov 17, 2023

kylebarron commented Nov 17, 2023

ibgreen commented Nov 17, 2023 •

edited

Loading

lixun910 commented Nov 20, 2023 •

edited

Loading

lixun910 commented Nov 21, 2023

lixun910 Nov 21, 2023

kylebarron Nov 22, 2023

kylebarron commented Nov 22, 2023

ibgreen commented Nov 22, 2023

ibgreen left a comment

ibgreen Nov 25, 2023

lixun910 Nov 27, 2023

ibgreen Nov 25, 2023

lixun910 Nov 27, 2023

ibgreen Nov 25, 2023

lixun910 commented Nov 30, 2023

ibgreen commented Nov 30, 2023

kylebarron commented Dec 1, 2023 •

edited

Loading

feat(arrow): Triangulate on worker #2789

feat(arrow): Triangulate on worker #2789

Conversation

ibgreen commented Nov 17, 2023 • edited Loading

ibgreen commented Nov 17, 2023

kylebarron commented Nov 17, 2023

ibgreen commented Nov 17, 2023

kylebarron commented Nov 17, 2023

ibgreen commented Nov 17, 2023 • edited Loading

lixun910 commented Nov 20, 2023 • edited Loading

lixun910 commented Nov 21, 2023

lixun910 Nov 21, 2023

Choose a reason for hiding this comment

kylebarron Nov 22, 2023

Choose a reason for hiding this comment

kylebarron commented Nov 22, 2023

ibgreen commented Nov 22, 2023

ibgreen left a comment

Choose a reason for hiding this comment

ibgreen Nov 25, 2023

Choose a reason for hiding this comment

lixun910 Nov 27, 2023

Choose a reason for hiding this comment

ibgreen Nov 25, 2023

Choose a reason for hiding this comment

lixun910 Nov 27, 2023

Choose a reason for hiding this comment

ibgreen Nov 25, 2023

Choose a reason for hiding this comment

lixun910 commented Nov 30, 2023

ibgreen commented Nov 30, 2023

kylebarron commented Dec 1, 2023 • edited Loading

ibgreen commented Nov 17, 2023 •

edited

Loading

ibgreen commented Nov 17, 2023 •

edited

Loading

lixun910 commented Nov 20, 2023 •

edited

Loading

kylebarron commented Dec 1, 2023 •

edited

Loading