Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale gs sort bucket size with scene size #7155

Merged
merged 1 commit into from
Nov 29, 2024

Conversation

slimbuck
Copy link
Member

@slimbuck slimbuck commented Nov 29, 2024

This PR scales the number of sorting buckets to the size of the scene. This is to address flickering found in larger scenes.

Notes:

  • the webworker takes only slightly longer on large scenes:
    • bicycle scene sort went from 28ms -> 35ms (64k -> 1M buckets)
    • the urban scene which user reported flicker on went from 5ms -> 5.7ms (64k -> 256k buckets)
  • the webworker allocates more bucket memory than before. we allocated 64k uint32s always, but now allocate up to 1M uint32s.

This is the before and after comparison on urban scene:

BEFORE:

Screen.Recording.2024-11-29.at.11.34.30.mov

AFTER:

Screen.Recording.2024-11-29.at.11.35.21.mov

@slimbuck slimbuck added bug area: graphics Graphics related issue labels Nov 29, 2024
@slimbuck slimbuck requested a review from a team November 29, 2024 11:37
@slimbuck slimbuck self-assigned this Nov 29, 2024
Copy link
Contributor

@mvaligursky mvaligursky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect

@slimbuck slimbuck merged commit 59813df into playcanvas:main Nov 29, 2024
8 checks passed
@slimbuck slimbuck deleted the sorter-dev-2 branch November 29, 2024 12:00
slimbuck added a commit to slimbuck/engine that referenced this pull request Nov 29, 2024
@willeastcott
Copy link
Contributor

Damn - this is lovely. 😍

@fimbox
Copy link

fimbox commented Nov 29, 2024

I am concerned about the added memory needs especially for large scenes on mobile with very limited memory.

What about storing the distances in exponential space instead of linear space?

something like
const d = 1 / (x * dx + y * dy + z * dz);

Needs changes in min max calculation, as well as inverting the results.

@slimbuck
Copy link
Member Author

Hi @fimbox,

Are you suggesting 4MB overhead is too much for a scene comprising, say, 3M splats? Such a scene would require over 50MB of compressed data, never mind the other work buffers we create. Percentage-wise I actually think this is a fine trade-off to lower flickering on large scenes. (Note that smaller scenes allocate smaller buffers).

Nice suggestion about non-linear depth. I actually did try this (briefly), but it didn't result in smaller worst-case bucket size. Might be worth investigating further though or trying different mappings. As it stands, what I see using this update is that actually the linear buckets are fast and work surprisingly well with these added bits.

Thanks so much for you input!

@fimbox
Copy link

fimbox commented Nov 29, 2024

Hi @slimbuck
I agree, it's really not much, but I couldn't resist to give it a try.

Here is a demo of non-linear depth with my (very) old gsplat port. You can switch between Linear on Non-linear by pressing the button. I reduced the bit size to 11 to have an effect on the current loaded scene:

https://playcanvas.com/editor/scene/2124910

The idea is two-fold:

  1. Only take particles in front of the camera in the min max bound. If the cam is in the center of the splat this can double the precision.
  2. After camera clipping, remap the depth to depth = 1.0 - 1.0 / (1.0 + depth)

@slimbuck
Copy link
Member Author

Hey this is so cool! I'm glad you couldn't resist! :D

The non-linear buckets actually seem to work better here. I also implemented non-sorting behind the camera with updated buckets, but it was actually slower and having sorted splats behind the camera is good when camera moves and hasn't got latest sorted indices.

Don't you want to just implement this in the engine and submit a PR?

@fimbox
Copy link

fimbox commented Dec 13, 2024

Need to report that after this change, one of our AVP gsplat demos became unusable. The demo features a large splat and was already near the performance threshold. Post-change, the AVP intermittently shows a black screen, suggesting CPU/GPU overload. The screen recovers briefly before going black again.
I will try to implement the proposed solution, since the flicker reduction is needed as well.

@slimbuck
Copy link
Member Author

slimbuck commented Dec 13, 2024

Hi @fimbox,

Thanks for reporting this issue. I'm really sorry for the performance regression.

I was thinking of making the number of sorting bins configurable. This will help with performance, but as you say ideally we wouldn't be forced to chose between flicker and rendering speed.

I have learned a few things on this topic since merging this PR:

Firstly, the reason for the slowdown on large scenes might not be what you think. The sorting code running on CPU does run a bit slower now since there are more buckets, but that doesn't account for the rendering drop-off. The reason for rendering performance hit is that placing splats into more buckets results in more render re-ordering of splats. This results in more memory access cache misses, slowing rendering considerably. (For some context on the importance of memory ordering to performance see #6357). This implies that reordering splat data into render order at runtime might be the only solution here...

Secondly, I realised that it's precisely when gaussians move from one bucket to the next that flickering occurs. (The counting sort we use is stable, so it keeps the splat order otherwise). This means that the more accurate log scales mentioned above actually exacerbate the flickering issue.

I am busy investigating again approaches to speed up rendering in the face of these seemingly incompatible requirements.

If you make any headway on this, please do share!

Thanks

@fimbox
Copy link

fimbox commented Dec 16, 2024

Hi @slimbuck,

I made the following observations while playing around with the SortWorker.update function:

  • A higher number of buckets usually does reduce flicker.
  • The current implementation only changes bucket transitions when the camera orientation changes, not when the camera position changes, this makes transitions very stable
  • Every camera position-dependent "enhancement" increases flicker due to the bucket transition change.
  • Limiting buckets to the front of the camera increases flicker because the bucket transitions also move with the camera.
  • Using a non-linear mapping also changes bucket transitions based on camera position, which increases flicker (as you mentioned).
  • In some cases, higher bucket density can outweigh the fact that the transition is camera position-dependent (see my demo).

However, the number of buckets needed is not just dependent on the number of splats per se (as done in the current implementation). It is also dependent on the splat distribution. Worst case: a few distant splats combined with a very detailed splat assembly in the middle can produce flicker, even with a small number of splats.

To improve accuracy with limited memory, I tried:

  • A 32-bit floating-point radix sort, which produces perfect sorting but is at least 2-3x slower than 16-bit sorting. Its performance is similar to using 20-bit buckets for smaller scenes but does not scale so well.
  • Surprisingly, a WASM radix sort is not much faster.

I think a WebGPU sort is the next thing to try. What are your next steps?

@willeastcott
Copy link
Contributor

@fimbox I'm wondering whether leveraging SIMD could improve your WASM sort.

@willeastcott
Copy link
Contributor

Just quickly asked ChatGPT:

* SIMD can speed up the dot-product phase when computing sorting keys. You could batch-process multiple vertices at once using intrinsics.
* However, the counting and output-reordering phases of a radix sort are often limited by memory bandwidth, scatter/gather patterns, and partial sums—these don’t vectorize as cleanly.
* More significant speedups usually come from parallelizing the histogram counts (thread-level parallelism) or using a GPU-based approach, not from instruction-level (SIMD) vectorization alone.

So yes, some parts will get a boost with SIMD (the dot product especially). But for a pure CPU-based, single-threaded Radix Sort, don’t expect a huge overall speedup from SIMD alone. Look to parallelization, memory layout, and potential GPU offload for larger gains.

I don't have much insight into this stuff personally, but thought I'd kick off a discussion about WASM optimization...

@Maksims
Copy link
Collaborator

Maksims commented Dec 17, 2024

but thought I'd kick off a discussion about WASM optimization

If there is very little communication between JS and WASM, and high computation workload in the meantime, with fixed memory footprint (no re-allocations), with simple raw buffer data communications, then WASM SIMD can work, but it very depends on what gets compiled into WASM, to ensure it is as small weight and dependency-less solution, that just solves a specific computation.

Also worth remembering, that WASM SIMD will unlikely to perform as well as compute shaders solution with WebGPU, which in a long term is the way to go.

@fimbox
Copy link

fimbox commented Dec 17, 2024

I just optimized the hell out of the WASM radix sort (check here), only one vertex loop per pass and so on, its faster, but its still at half speed compared to the JS single bucket sort.

I also played around with wasm SIMD, but the hardest part (Scatter) is not easily vectorizeable so the speed-up might be negligible.

@mvaligursky
Copy link
Contributor

I initially wrote radix sort as well, use 4x8 bits. To speed it up, I switched it to 2x16bits. And the logical conclusion was a single pass, but dropped it to 16 bits.

So perhaps try 2 passes instead of 4 as well?

@fimbox
Copy link

fimbox commented Dec 19, 2024

I created a project that compares the 3 approaches (linear bucket, non-linear and radix). It also outputs the sorting times to the logs.

For us it seems we are good with non-linear mapping and we can easily patch it on demand. It solves flicker on close objects and is as fast as linear sorting. It has some flickering in distance areas though.

With typical Scaniverse scans you can also observe flickering with the linear sorting approach even with higher buckets counts.
These scenes have a detailed center and a very far splat-based "skybox". Non-linear mapping is perfect for such scenarios:

https://playcanvas.com/project/1285780/overview/gaussian-splatting-sorting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: graphics Graphics related issue bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants