Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LoopBuilder to multithread Modern plugins #252

Open
mdoube opened this issue May 14, 2020 · 4 comments
Open

Use LoopBuilder to multithread Modern plugins #252

mdoube opened this issue May 14, 2020 · 4 comments
Assignees

Comments

@mdoube
Copy link
Member

mdoube commented May 14, 2020

Describe the bug
LoopBuilder is the ImgLib2 way to achieve chunked multithreading, see this for an example (specifically, the 'squared sum' example)
https://forum.image.sc/t/imglib2-split-image-into-chunks-for-multi-threaded-processing/37519/4

Modern Connectivity currently uses 8 cursors to achieve multithreading, which is a bit of a hack.

imagej/imagej-ops@6a60315#diff-d6c590f31ede52f4518dba0b4a44cf9fR261

    				final Cursor<B> octantCursor1 = Views.flatIterable(interval).cursor();
    				final Cursor<B> octantCursor2 = Views.flatIterable(interval).cursor();
    				final Cursor<B> octantCursor3 = Views.flatIterable(interval).cursor();
    				final Cursor<B> octantCursor4 = Views.flatIterable(interval).cursor();
    				final Cursor<B> octantCursor5 = Views.flatIterable(interval).cursor();
    				final Cursor<B> octantCursor6 = Views.flatIterable(interval).cursor();
    				final Cursor<B> octantCursor7 = Views.flatIterable(interval).cursor();
    				final Cursor<B> octantCursor8 = Views.flatIterable(interval).cursor();

    				octantCursor1.jumpFwd(start);
    				octantCursor2.jumpFwd(start + w);
    				octantCursor3.jumpFwd(start + 1);
    				octantCursor4.jumpFwd(start + w + 1);
    				octantCursor5.jumpFwd(start + w * h);
    				octantCursor6.jumpFwd(start + w * h + w);
    				octantCursor7.jumpFwd(start + w * h + 1);
    				octantCursor8.jumpFwd(start + w * h + w + 1);

    				for (int i = 0; i < steps; i++) {
    					boolean o1 = octantCursor1.next().get();
    					boolean o2 = octantCursor2.next().get();
    					boolean o3 = octantCursor3.next().get();
    					boolean o4 = octantCursor4.next().get();
    					boolean o5 = octantCursor5.next().get();
    					boolean o6 = octantCursor6.next().get();
    					boolean o7 = octantCursor7.next().get();
    					boolean o8 = octantCursor8.next().get();

Additional context
Any change to the code needs to be performance benchmarked and reported to @maarzt
If it works out, we might be able to improve the performance of other Modern plugins like VolumeFraction

@rimadoma
Copy link
Contributor

rimadoma commented May 14, 2020

The cursors do avoid random access, which slow the op down (I've tested). Specifically the random access needed to read the 8-neighbourhood of a pixel. My bet is that the inner lambda function of LoopBuilder.forEachPixel(T) (or Chunk) will have to avoid it too to achieve performance similar to the current op.

@rimadoma
Copy link
Contributor

rimadoma commented May 14, 2020

If it works out, we might be able to improve the performance of other Modern plugins like VolumeFraction

Sure, worth having a go. ATM ElementFractionWrapper just calls opService.stats().sum(interval) to count foreground voxels, and Interval.size() for total voxels. The former works, because in a BitType image foreground is always 1.

@maarzt
Copy link

maarzt commented May 14, 2020

LoopBuilder won't work for this use case. It supports a maximum number of 6 cursors or random accesses. You could work around this be using Views.collapse(Views.stack(...)) but I wouldn't expect any performance improvements by this.

I would choose a different approach:

  1. You should copy the BitType input image into a UnsignedByteType image. (Using values 0 and 1).
  2. Convolve the UnsignedByteType image with a kernel [1 2] in x, with a kernel [1 4] in you, and with a kernel [1 16] in z. This will encode your 8-neighborhood in the pixels of the UnsignedByteType.
  3. Use a look up table byte[256] to make the encoded 8-neighborhood to the delta-euler-value.

@mdoube mdoube changed the title Use LoopBuilder to multithread Modern Connectivity Use LoopBuilder to multithread Modern plugins May 15, 2020
@mdoube
Copy link
Member Author

mdoube commented Jun 9, 2020

@maarzt I just did a little performance testing comparing the legacy and modern Connectivity plugins. You can see that even with some hacky multithreading, and avoiding random accesses with final Cursor<B> octantCursor1 = Views.flatIterable(interval).cursor(); the Modern approach is still way slower than the legacy approach. Is there a faster way to access pixels sequentially in an Op? especially pixel neighbourhoods, which is a standard kind of thing to do when processing with a kernel.

Performance timings

Timing Legacy Connectivity
Legacy connectivity took 1111 ms
Legacy connectivity took 876 ms
Legacy connectivity took 873 ms
Legacy connectivity took 1107 ms
Legacy connectivity took 909 ms
Timing Modern Connectivity
Modern connectivity took 14352 ms
Modern connectivity took 14298 ms
Modern connectivity took 14079 ms
Modern connectivity took 14071 ms
Modern connectivity took 14141 ms

Macro code:

print("Timing Legacy Connectivity");
for (i = 0; i < 5; i++){
   start = getTime();
   run("Connectivity");
   end = getTime();
   print("Legacy connectivity took "+end - start+" ms");
}

print("Timing Modern Connectivity");
for (i = 0; i < 5; i++){
   start = getTime();
   run("Connectivity (Modern)", "inputimage=net.imagej.ImgPlus@73672ce");
   end = getTime();
   print("Modern connectivity took "+end - start+" ms");
}

Test image
umzc_378p_Apteryx_haastii_head.tif.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants