Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for writing to mmapped TIFFs #40

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tlnagy
Copy link
Owner

@tlnagy tlnagy commented Mar 2, 2021

This will complete TiffImages.jl support for mmapped TIFFs. There are still some features that are needed before this is ready to merge:

  • Support for out-of-memory copy
  • Proper errors for compressed images
  • Support for whole plane writing for speed?
  • Append!
  • Tests

Allocations

Despite my best efforts setindex! is still not zero allocation, despite no allocations reported by running julia in --track-allocations=all mode.

To reproduce:

julia> img = TiffImages.load("/home/tlnagy/Documents/exampletiffs/julia.tif", mmap=true);
[ Info: Watching /home/tlnagy/Documents/exampletiffs/julia.tif for changes

julia> filepath = download("https://github.com/tlnagy/exampletiffs/blob/master/julia.tif?raw=true")
"/tmp/jl_sP369j"

julia> img = TiffImages.load(filepath, mmap=true);
[ Info: Watching /tmp/jl_sP369j for changes

julia> convert!(img.data; writeable=true) # convert to writeable

julia> img[:, :, 1] .= 1.0; # warm up setindex

julia> @time img[:, :, 1] .= 1.0;
  0.360166 seconds (303.86 k allocations: 9.351 MiB, 4.10% gc time)

Thoughts @timholy, @IanButterworth?

@tlnagy tlnagy added the enhancement New feature or request label Mar 2, 2021
@codecov-io
Copy link

codecov-io commented Mar 2, 2021

Codecov Report

Merging #40 (808f8d5) into master (31191cb) will decrease coverage by 5.62%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #40      +/-   ##
==========================================
- Coverage   88.79%   83.16%   -5.63%     
==========================================
  Files          12       12              
  Lines         562      600      +38     
==========================================
  Hits          499      499              
- Misses         63      101      +38     
Impacted Files Coverage Δ
src/TiffImages.jl 100.00% <ø> (ø)
src/types/mmap.jl 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31191cb...808f8d5. Read the comment docs.

@timholy
Copy link
Contributor

timholy commented Mar 2, 2021

It's probably because of ifd[STRIPBYTECOUNTS]. Do you know what type it will return? It would be better to put the type-assert as "early" as possible.

To show you what I mean, here's a screen shot from Cthulhu with optimizations on, but with verbose turned off to highlight just the red lines:

image

Don't worry about the apparent inference failures for KeyError, Julia no longer will claim a type for a branch that is guaranteed to error (this is a good change). Those arrayrefs correspond to the ifd[STRIPBYTECOUNTS] and ifd[STRIPOFFSETS], and crucially you pass the returned value to getproperty (to extract data) before asserting a type. If you know the exact type so that the getproperty might be inferrable, it's better to assert it there. E.g.,

(ifd[STRIPBYTECOUNTS]::SomeType).data

is to be preferred to

(ifd[STRIPBYTECOUNTS].data)::SomeOtherType

In #36 I didn't feel confident I knew the types so I asserted at a later stage.

@tlnagy
Copy link
Owner Author

tlnagy commented Mar 3, 2021

Ah, cool. Adding the earlier asserts gets rid of almost all allocations (no improvement in speed but we're probably I/O limited here):

julia> @time img[:, :, 1] .= 1.0;
  0.383113 seconds (9 allocations: 320 bytes)

That said, the flexibility of TIFFs is pretty difficult here because there aren't a lot of guarantees. I guess what I could do is parameterize the mmapped TIFF type on the on-disk structure of the image? Because then I don't need to rely on if/else statements in setindex! and getindex to figure out if it's striped or tiled or compressed, etc. Though my experience with speeding up TiffImages.jl has shown me to be careful with over-parameterizing my types 😰

@timholy
Copy link
Contributor

timholy commented Mar 3, 2021

Yeah, it's a tricky design space. You're going to have to have something be non-inferrable here---Julia can't predict what's on disk. The key issue comes down to two questions:

  • where can you make choices that lead to the fewest inference failures? Getting an inference failure out of the way once at the beginning so that later steps are inferrable is typically a win from a performance standpoint.
  • how diverse will your types be? This directly determines the amount of specialization the compiler will need to do. Since, for example, image stacks (or temporal movies) have an arbitrary length, forcing the compiler to specialize on this integer is definitely not a good thing. Conversely, when the type is one of just 3 choices, that's bounded so it might make sense to specialize. Then again, if you have 10 fields and each one of them might be any of the 3, then you have 3^10 possible combinations and again that's more specializations than you'd want.

When the two points above are in conflict with one another, often one of your best strategies is to coerce to a standard. I don't know much about this package or about the structure of TIFF files, but if they are essentially metadata then if you can standardize your metadata on just a few types (e.g., Vector{Int}, String, maybe something else), then you can handle the diversity with union-splitting. Note that if you're loading the metadata into RAM, there might not need to be a 1:1 correspondence between the object type in RAM and the object type on disk. EDIT: and now I notice the title of this PR, "mmapped", so you can ignore that last sentence!

@timholy
Copy link
Contributor

timholy commented Mar 4, 2021

BTW, it's possible you could at least make the code cleaner with a variant of https://timholy.github.io/SnoopCompile.jl/stable/snoopr/#Inferrable-field-access-for-abstract-types. Basically you could apply the type-assert for particular tag keys in a getindex method so that the caller will be inferrable. I doubt this would change anything with the specialization issues, though.

@tlnagy
Copy link
Owner Author

tlnagy commented Mar 8, 2021

you can handle the diversity with union-splitting

Do you have any more information on this? I think I know what you're getting at, but not 100% sure.

As for the TIFF diversity, the problem stems from the fact that each tag can have its own datatype and there are quite a few native types:

TiffImages.jl/src/utils.jl

Lines 123 to 140 in 3d1d5c3

const tiff_to_julian = Dict(
0x0001 => UInt8, # CHAR
0x0002 => String, # ASCII
0x0003 => UInt16, # SHORT
0x0004 => UInt32, # LONG
0x0005 => Rational{UInt32}, # RATIONAL
0x0006 => Int8, # SBYTE
0x0007 => Any, # UNDEFINED
0x0008 => Int16, # SSHORT
0x0009 => Int32, # SLONG
0x000a => Rational{Int32}, # SRATIONAL
0x000b => Float32, # FLOAT
0x000c => Float64, # DOUBLE
0x000d => UInt32, # IFD
0x0010 => UInt64, # LONG8
0x0011 => Int64, # SLONG8
0x0012 => UInt64, # IFD8
)

Background

My original solution was to store all tags using the same memory layout (as a Vector of raw bytes) and reinterpret them into the correct datatype on access. I changed this in #22 to the current design where I parameterize the Tag type on the stored datatype. But now the memory layout is different for each tag in the tag list (the tag list now has an eltype of abstract Tag) and the compiler doesn't know what the concrete type is until runtime since it depends of which key is accessed. That's what we were running into here. The difficulty is that I was hoping to make TiffImages.jl support all weird and nonstandard tags were I don't want to coerce them into a limited set of bins.

Potential solution

BTW, it's possible you could at least make the code cleaner with a variant of https://timholy.github.io/SnoopCompile.jl/stable/snoopr/#Inferrable-field-access-for-abstract-types. Basically you could apply the type-assert for particular tag keys in a getindex method so that the caller will be inferrable. I doubt this would change anything with the specialization issues, though.

This sounds like a good middle road. I only regularly use a minimal core set of tags for performance-critical code, things like width, length, data offsets, etc. I could coerce those into a limited set of expected datatypes on getindex or earlier* and leave the rest to be flexible. Why would this not fix the specialization issues? Or is it that I'm overspecializing?

EDIT: *I think on read could be useful here. I can force the datatype for these limited tags and then assert their type in getindex or wherever I access them.

@timholy
Copy link
Contributor

timholy commented Mar 8, 2021

then you can handle the diversity with union-splitting.

You may know about these already but:

I only regularly use a minimal core set of tags for performance-critical code, things like width, length, data offsets, etc. I could coerce those into a limited set of expected datatypes on getindex or earlier

That sounds like a good idea. I am guessing that this is quite a small amount of data, right? Could you also read it when you open the file and copy it in coerced form to a single vector in RAM? Or do that image-by-image when you first read a particular frame?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants