[WIP] Add support for writing to mmapped TIFFs #40

tlnagy · 2021-03-02T20:01:09Z

This will complete TiffImages.jl support for mmapped TIFFs. There are still some features that are needed before this is ready to merge:

Allocations

Despite my best efforts setindex! is still not zero allocation, despite no allocations reported by running julia in --track-allocations=all mode.

To reproduce:

julia> img = TiffImages.load("/home/tlnagy/Documents/exampletiffs/julia.tif", mmap=true);
[ Info: Watching /home/tlnagy/Documents/exampletiffs/julia.tif for changes

julia> filepath = download("https://github.com/tlnagy/exampletiffs/blob/master/julia.tif?raw=true")
"/tmp/jl_sP369j"

julia> img = TiffImages.load(filepath, mmap=true);
[ Info: Watching /tmp/jl_sP369j for changes

julia> convert!(img.data; writeable=true) # convert to writeable

julia> img[:, :, 1] .= 1.0; # warm up setindex

julia> @time img[:, :, 1] .= 1.0;
  0.360166 seconds (303.86 k allocations: 9.351 MiB, 4.10% gc time)

Thoughts @timholy, @IanButterworth?

codecov-io · 2021-03-02T20:04:03Z

Codecov Report

Merging #40 (808f8d5) into master (31191cb) will decrease coverage by 5.62%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master      #40      +/-   ##
==========================================
- Coverage   88.79%   83.16%   -5.63%     
==========================================
  Files          12       12              
  Lines         562      600      +38     
==========================================
  Hits          499      499              
- Misses         63      101      +38

Impacted Files	Coverage Δ
src/TiffImages.jl	`100.00% <ø> (ø)`
src/types/mmap.jl	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31191cb...808f8d5. Read the comment docs.

timholy · 2021-03-02T21:58:25Z

It's probably because of ifd[STRIPBYTECOUNTS]. Do you know what type it will return? It would be better to put the type-assert as "early" as possible.

To show you what I mean, here's a screen shot from Cthulhu with optimizations on, but with verbose turned off to highlight just the red lines:

Don't worry about the apparent inference failures for KeyError, Julia no longer will claim a type for a branch that is guaranteed to error (this is a good change). Those arrayrefs correspond to the ifd[STRIPBYTECOUNTS] and ifd[STRIPOFFSETS], and crucially you pass the returned value to getproperty (to extract data) before asserting a type. If you know the exact type so that the getproperty might be inferrable, it's better to assert it there. E.g.,

(ifd[STRIPBYTECOUNTS]::SomeType).data

is to be preferred to

(ifd[STRIPBYTECOUNTS].data)::SomeOtherType

In #36 I didn't feel confident I knew the types so I asserted at a later stage.

tlnagy · 2021-03-03T18:58:03Z

Ah, cool. Adding the earlier asserts gets rid of almost all allocations (no improvement in speed but we're probably I/O limited here):

julia> @time img[:, :, 1] .= 1.0;
  0.383113 seconds (9 allocations: 320 bytes)

That said, the flexibility of TIFFs is pretty difficult here because there aren't a lot of guarantees. I guess what I could do is parameterize the mmapped TIFF type on the on-disk structure of the image? Because then I don't need to rely on if/else statements in setindex! and getindex to figure out if it's striped or tiled or compressed, etc. Though my experience with speeding up TiffImages.jl has shown me to be careful with over-parameterizing my types 😰

timholy · 2021-03-03T23:07:40Z

Yeah, it's a tricky design space. You're going to have to have something be non-inferrable here---Julia can't predict what's on disk. The key issue comes down to two questions:

where can you make choices that lead to the fewest inference failures? Getting an inference failure out of the way once at the beginning so that later steps are inferrable is typically a win from a performance standpoint.
how diverse will your types be? This directly determines the amount of specialization the compiler will need to do. Since, for example, image stacks (or temporal movies) have an arbitrary length, forcing the compiler to specialize on this integer is definitely not a good thing. Conversely, when the type is one of just 3 choices, that's bounded so it might make sense to specialize. Then again, if you have 10 fields and each one of them might be any of the 3, then you have 3^10 possible combinations and again that's more specializations than you'd want.

When the two points above are in conflict with one another, often one of your best strategies is to coerce to a standard. I don't know much about this package or about the structure of TIFF files, but if they are essentially metadata then if you can standardize your metadata on just a few types (e.g., Vector{Int}, String, maybe something else), then you can handle the diversity with union-splitting. Note that if you're loading the metadata into RAM, there might not need to be a 1:1 correspondence between the object type in RAM and the object type on disk. EDIT: and now I notice the title of this PR, "mmapped", so you can ignore that last sentence!

timholy · 2021-03-04T01:46:04Z

BTW, it's possible you could at least make the code cleaner with a variant of https://timholy.github.io/SnoopCompile.jl/stable/snoopr/#Inferrable-field-access-for-abstract-types. Basically you could apply the type-assert for particular tag keys in a getindex method so that the caller will be inferrable. I doubt this would change anything with the specialization issues, though.

tlnagy · 2021-03-08T20:02:27Z

you can handle the diversity with union-splitting

Do you have any more information on this? I think I know what you're getting at, but not 100% sure.

As for the TIFF diversity, the problem stems from the fact that each tag can have its own datatype and there are quite a few native types:

TiffImages.jl/src/utils.jl

Lines 123 to 140 in 3d1d5c3

    
           const tiff_to_julian = Dict( 
        
             0x0001 => UInt8,            # CHAR 
        
             0x0002 => String,           # ASCII 
        
             0x0003 => UInt16,           # SHORT 
        
             0x0004 => UInt32,           # LONG 
        
             0x0005 => Rational{UInt32}, # RATIONAL 
        
             0x0006 => Int8,             # SBYTE 
        
             0x0007 => Any,              # UNDEFINED 
        
             0x0008 => Int16,            # SSHORT 
        
             0x0009 => Int32,            # SLONG 
        
             0x000a => Rational{Int32},  # SRATIONAL 
        
             0x000b => Float32,          # FLOAT 
        
             0x000c => Float64,          # DOUBLE 
        
             0x000d => UInt32,           # IFD 
        
             0x0010 => UInt64,           # LONG8 
        
             0x0011 => Int64,            # SLONG8 
        
             0x0012 => UInt64,           # IFD8 
        
           )

Background

My original solution was to store all tags using the same memory layout (as a Vector of raw bytes) and reinterpret them into the correct datatype on access. I changed this in #22 to the current design where I parameterize the Tag type on the stored datatype. But now the memory layout is different for each tag in the tag list (the tag list now has an eltype of abstract Tag) and the compiler doesn't know what the concrete type is until runtime since it depends of which key is accessed. That's what we were running into here. The difficulty is that I was hoping to make TiffImages.jl support all weird and nonstandard tags were I don't want to coerce them into a limited set of bins.

Potential solution

BTW, it's possible you could at least make the code cleaner with a variant of https://timholy.github.io/SnoopCompile.jl/stable/snoopr/#Inferrable-field-access-for-abstract-types. Basically you could apply the type-assert for particular tag keys in a getindex method so that the caller will be inferrable. I doubt this would change anything with the specialization issues, though.

This sounds like a good middle road. I only regularly use a minimal core set of tags for performance-critical code, things like width, length, data offsets, etc. I could coerce those into a limited set of expected datatypes on getindex or earlier* and leave the rest to be flexible. Why would this not fix the specialization issues? Or is it that I'm overspecializing?

EDIT: *I think on read could be useful here. I can force the datatype for these limited tags and then assert their type in getindex or wherever I access them.

timholy · 2021-03-08T21:30:45Z

then you can handle the diversity with union-splitting.

You may know about these already but:

automatic union-splitting: https://julialang.org/blog/2018/08/union-splitting/
manual union splitting: https://discourse.julialang.org/t/performance-problems-when-dealing-with-deliberately-type-unstable-code-how-to-use-type-knowledge-better/52535/26?u=tim.holy

I only regularly use a minimal core set of tags for performance-critical code, things like width, length, data offsets, etc. I could coerce those into a limited set of expected datatypes on getindex or earlier

That sounds like a good idea. I am guessing that this is quite a small amount of data, right? Could you also read it when you open the file and copy it in coerced form to a single vector in RAM? Or do that image-by-image when you first read a particular frame?

mwp implementation of writing for mmapped files

808f8d5

tlnagy added the enhancement New feature or request label Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add support for writing to mmapped TIFFs #40

[WIP] Add support for writing to mmapped TIFFs #40

tlnagy commented Mar 2, 2021 •

edited

Loading

codecov-io commented Mar 2, 2021 •

edited

Loading

timholy commented Mar 2, 2021 •

edited

Loading

tlnagy commented Mar 3, 2021

timholy commented Mar 3, 2021 •

edited

Loading

timholy commented Mar 4, 2021

tlnagy commented Mar 8, 2021 •

edited

Loading

timholy commented Mar 8, 2021

[WIP] Add support for writing to mmapped TIFFs #40

Are you sure you want to change the base?

[WIP] Add support for writing to mmapped TIFFs #40

Conversation

tlnagy commented Mar 2, 2021 • edited Loading

Allocations

codecov-io commented Mar 2, 2021 • edited Loading

Codecov Report

timholy commented Mar 2, 2021 • edited Loading

tlnagy commented Mar 3, 2021

timholy commented Mar 3, 2021 • edited Loading

timholy commented Mar 4, 2021

tlnagy commented Mar 8, 2021 • edited Loading

Background

Potential solution

timholy commented Mar 8, 2021

tlnagy commented Mar 2, 2021 •

edited

Loading

codecov-io commented Mar 2, 2021 •

edited

Loading

timholy commented Mar 2, 2021 •

edited

Loading

timholy commented Mar 3, 2021 •

edited

Loading

tlnagy commented Mar 8, 2021 •

edited

Loading