Scale gs sort bucket size with scene size #7155

slimbuck · 2024-11-29T11:37:36Z

This PR scales the number of sorting buckets to the size of the scene. This is to address flickering found in larger scenes.

Notes:

the webworker takes only slightly longer on large scenes:
- bicycle scene sort went from 28ms -> 35ms (64k -> 1M buckets)
- the urban scene which user reported flicker on went from 5ms -> 5.7ms (64k -> 256k buckets)
the webworker allocates more bucket memory than before. we allocated 64k uint32s always, but now allocate up to 1M uint32s.

This is the before and after comparison on urban scene:

BEFORE:

Screen.Recording.2024-11-29.at.11.34.30.mov

AFTER:

Screen.Recording.2024-11-29.at.11.35.21.mov

mvaligursky

Perfect

willeastcott · 2024-11-29T12:13:21Z

Damn - this is lovely. 😍

fimbox · 2024-11-29T12:52:57Z

I am concerned about the added memory needs especially for large scenes on mobile with very limited memory.

What about storing the distances in exponential space instead of linear space?

something like
const d = 1 / (x * dx + y * dy + z * dz);

Needs changes in min max calculation, as well as inverting the results.

slimbuck · 2024-11-29T13:40:21Z

Hi @fimbox,

Are you suggesting 4MB overhead is too much for a scene comprising, say, 3M splats? Such a scene would require over 50MB of compressed data, never mind the other work buffers we create. Percentage-wise I actually think this is a fine trade-off to lower flickering on large scenes. (Note that smaller scenes allocate smaller buffers).

Nice suggestion about non-linear depth. I actually did try this (briefly), but it didn't result in smaller worst-case bucket size. Might be worth investigating further though or trying different mappings. As it stands, what I see using this update is that actually the linear buckets are fast and work surprisingly well with these added bits.

Thanks so much for you input!

fimbox · 2024-11-29T15:59:55Z

Hi @slimbuck
I agree, it's really not much, but I couldn't resist to give it a try.

Here is a demo of non-linear depth with my (very) old gsplat port. You can switch between Linear on Non-linear by pressing the button. I reduced the bit size to 11 to have an effect on the current loaded scene:

https://playcanvas.com/editor/scene/2124910

The idea is two-fold:

Only take particles in front of the camera in the min max bound. If the cam is in the center of the splat this can double the precision.
After camera clipping, remap the depth to depth = 1.0 - 1.0 / (1.0 + depth)

slimbuck · 2024-11-29T16:31:18Z

Hey this is so cool! I'm glad you couldn't resist! :D

The non-linear buckets actually seem to work better here. I also implemented non-sorting behind the camera with updated buckets, but it was actually slower and having sorted splats behind the camera is good when camera moves and hasn't got latest sorted indices.

Don't you want to just implement this in the engine and submit a PR?

fimbox · 2024-12-13T17:40:28Z

Need to report that after this change, one of our AVP gsplat demos became unusable. The demo features a large splat and was already near the performance threshold. Post-change, the AVP intermittently shows a black screen, suggesting CPU/GPU overload. The screen recovers briefly before going black again.
I will try to implement the proposed solution, since the flicker reduction is needed as well.

slimbuck · 2024-12-13T18:01:14Z

Hi @fimbox,

Thanks for reporting this issue. I'm really sorry for the performance regression.

I was thinking of making the number of sorting bins configurable. This will help with performance, but as you say ideally we wouldn't be forced to chose between flicker and rendering speed.

I have learned a few things on this topic since merging this PR:

Firstly, the reason for the slowdown on large scenes might not be what you think. The sorting code running on CPU does run a bit slower now since there are more buckets, but that doesn't account for the rendering drop-off. The reason for rendering performance hit is that placing splats into more buckets results in more render re-ordering of splats. This results in more memory access cache misses, slowing rendering considerably. (For some context on the importance of memory ordering to performance see #6357). This implies that reordering splat data into render order at runtime might be the only solution here...

Secondly, I realised that it's precisely when gaussians move from one bucket to the next that flickering occurs. (The counting sort we use is stable, so it keeps the splat order otherwise). This means that the more accurate log scales mentioned above actually exacerbate the flickering issue.

I am busy investigating again approaches to speed up rendering in the face of these seemingly incompatible requirements.

If you make any headway on this, please do share!

Thanks

fimbox · 2024-12-16T18:01:26Z

Hi @slimbuck,

I made the following observations while playing around with the SortWorker.update function:

A higher number of buckets usually does reduce flicker.
The current implementation only changes bucket transitions when the camera orientation changes, not when the camera position changes, this makes transitions very stable
Every camera position-dependent "enhancement" increases flicker due to the bucket transition change.
Limiting buckets to the front of the camera increases flicker because the bucket transitions also move with the camera.
Using a non-linear mapping also changes bucket transitions based on camera position, which increases flicker (as you mentioned).
In some cases, higher bucket density can outweigh the fact that the transition is camera position-dependent (see my demo).

However, the number of buckets needed is not just dependent on the number of splats per se (as done in the current implementation). It is also dependent on the splat distribution. Worst case: a few distant splats combined with a very detailed splat assembly in the middle can produce flicker, even with a small number of splats.

To improve accuracy with limited memory, I tried:

A 32-bit floating-point radix sort, which produces perfect sorting but is at least 2-3x slower than 16-bit sorting. Its performance is similar to using 20-bit buckets for smaller scenes but does not scale so well.
Surprisingly, a WASM radix sort is not much faster.

I think a WebGPU sort is the next thing to try. What are your next steps?

willeastcott · 2024-12-16T18:14:32Z

@fimbox I'm wondering whether leveraging SIMD could improve your WASM sort.

willeastcott · 2024-12-16T18:19:28Z

Just quickly asked ChatGPT:

* SIMD can speed up the dot-product phase when computing sorting keys. You could batch-process multiple vertices at once using intrinsics.
* However, the counting and output-reordering phases of a radix sort are often limited by memory bandwidth, scatter/gather patterns, and partial sums—these don’t vectorize as cleanly.
* More significant speedups usually come from parallelizing the histogram counts (thread-level parallelism) or using a GPU-based approach, not from instruction-level (SIMD) vectorization alone.

So yes, some parts will get a boost with SIMD (the dot product especially). But for a pure CPU-based, single-threaded Radix Sort, don’t expect a huge overall speedup from SIMD alone. Look to parallelization, memory layout, and potential GPU offload for larger gains.

I don't have much insight into this stuff personally, but thought I'd kick off a discussion about WASM optimization...

Maksims · 2024-12-17T08:42:55Z

but thought I'd kick off a discussion about WASM optimization

If there is very little communication between JS and WASM, and high computation workload in the meantime, with fixed memory footprint (no re-allocations), with simple raw buffer data communications, then WASM SIMD can work, but it very depends on what gets compiled into WASM, to ensure it is as small weight and dependency-less solution, that just solves a specific computation.

Also worth remembering, that WASM SIMD will unlikely to perform as well as compute shaders solution with WebGPU, which in a long term is the way to go.

fimbox · 2024-12-17T14:56:51Z

I just optimized the hell out of the WASM radix sort (check here), only one vertex loop per pass and so on, its faster, but its still at half speed compared to the JS single bucket sort.

I also played around with wasm SIMD, but the hardest part (Scatter) is not easily vectorizeable so the speed-up might be negligible.

mvaligursky · 2024-12-17T15:13:27Z

I initially wrote radix sort as well, use 4x8 bits. To speed it up, I switched it to 2x16bits. And the logical conclusion was a single pass, but dropped it to 16 bits.

So perhaps try 2 passes instead of 4 as well?

fimbox · 2024-12-19T12:54:47Z

I created a project that compares the 3 approaches (linear bucket, non-linear and radix). It also outputs the sorting times to the logs.

For us it seems we are good with non-linear mapping and we can easily patch it on demand. It solves flicker on close objects and is as fast as linear sorting. It has some flickering in distance areas though.

With typical Scaniverse scans you can also observe flickering with the linear sorting approach even with higher buckets counts.
These scenes have a detailed center and a very far splat-based "skybox". Non-linear mapping is perfect for such scenarios:

https://playcanvas.com/project/1285780/overview/gaussian-splatting-sorting

scale bucket size with scene size

8fe6aa1

slimbuck added bug area: graphics Graphics related issue labels Nov 29, 2024

slimbuck requested a review from a team November 29, 2024 11:37

slimbuck self-assigned this Nov 29, 2024

vercel bot deployed to Preview November 29, 2024 11:39 View deployment

mvaligursky approved these changes Nov 29, 2024

View reviewed changes

slimbuck merged commit 59813df into playcanvas:main Nov 29, 2024
8 checks passed

slimbuck deleted the sorter-dev-2 branch November 29, 2024 12:00

slimbuck added a commit to slimbuck/engine that referenced this pull request Nov 29, 2024

Scale gs sort bucket size with scene size (playcanvas#7155)

7ef3404

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale gs sort bucket size with scene size #7155

Scale gs sort bucket size with scene size #7155

slimbuck commented Nov 29, 2024 •

edited

Loading

mvaligursky left a comment

willeastcott commented Nov 29, 2024

fimbox commented Nov 29, 2024

slimbuck commented Nov 29, 2024

fimbox commented Nov 29, 2024 •

edited

Loading

slimbuck commented Nov 29, 2024

fimbox commented Dec 13, 2024

slimbuck commented Dec 13, 2024 •

edited

Loading

fimbox commented Dec 16, 2024

willeastcott commented Dec 16, 2024

willeastcott commented Dec 16, 2024

Maksims commented Dec 17, 2024

fimbox commented Dec 17, 2024

mvaligursky commented Dec 17, 2024

fimbox commented Dec 19, 2024

Scale gs sort bucket size with scene size #7155

Scale gs sort bucket size with scene size #7155

Conversation

slimbuck commented Nov 29, 2024 • edited Loading

mvaligursky left a comment

Choose a reason for hiding this comment

willeastcott commented Nov 29, 2024

fimbox commented Nov 29, 2024

slimbuck commented Nov 29, 2024

fimbox commented Nov 29, 2024 • edited Loading

slimbuck commented Nov 29, 2024

fimbox commented Dec 13, 2024

slimbuck commented Dec 13, 2024 • edited Loading

fimbox commented Dec 16, 2024

willeastcott commented Dec 16, 2024

willeastcott commented Dec 16, 2024

Maksims commented Dec 17, 2024

fimbox commented Dec 17, 2024

mvaligursky commented Dec 17, 2024

fimbox commented Dec 19, 2024

slimbuck commented Nov 29, 2024 •

edited

Loading

fimbox commented Nov 29, 2024 •

edited

Loading

slimbuck commented Dec 13, 2024 •

edited

Loading