Bitrounding + Lossless compression #3599

milankl · 2024-05-13T23:38:36Z

TL;DR: We can compress 18GB of Oceananigans simulation checkpoints into 350MB with bitrounding and lossless compression.

Problem

Output is currently uncompressed in Float64 which contains

redundancies: zeros for immersed boundaries, halos; similar/identical exponent bits
false information: tailing mantissa bits with no mutual information to neighbouring grid points

Proposed solution

Bitrounding to remove false information (replaced with zero bits -> redundancies) then lossless compression to remove redundancies.

I've looked into the bitwise real information content for a single checkpoint in Simone's OMIP simulations and I got this with the orange line denothing the 99.9% of real information

So

u, v have 0-2 mantissa bits of information (=keepbits) with more information in the surface layer (k=60)
w has 0 keepbits (exponent bits though!)
tempreture T (in ˚C) has 7 keepbits (that's 3-4 digits) relatively independent of depth
salinity S has 12 at the surface which however increases to 16 in the deep ocean
sea surface height $\eta$ is at 6 keepbits
tendencies are generally lower but maybe then shouldn't be stored anyway (use single Euler forward instead)

The checkpoint file Simone provided had

18GB total file size, single time step
including 7 halo points in all directions
400MB are grid
u,v,w,T,S,$\eta$ variables and 2x tendencies (AB2) for all but w, all in Float64

Compression options

The 18GB can be compressed into

Only lossless: 6.9GB (2.6x), removes redundancies from halo and immersed boundaries
Only Float32: 9GB (2x), removes only some false information in tailing bits
Float32 then lossless: 3.25GB (5.5x)
Bitrounded then lossless: 1GB (18x)
Bitrounded, zero tendencies, then lossless: 350MB (51x), with lossy compression saving the tendencies becomes eventually pointless as restarting with a single Euler forward step might just do the job anyway

This currently uses Zstd (https://github.com/facebook/zstd), a modern yet already widely available lossless compressor through its commandline interface zstd. With JLD2 at the moment compress=true uses ZlibCompressor from https://github.com/JuliaIO/CodecZlib.jl which is similarly good but 2-3x slower. I'm working on getting CodecZstd supported in JLD2: JuliaIO/JLD2.jl#560

While this PR is still a draft I'm proposing the new defaults

lossless compression with compress=true for JLD2, deflatelevel=3 for netCDF
bitrounding to keepbits ~20 (single precision-ish) whether you output in Float32/64 (doesn't matter when lossless compression is on)
a default bitrounder that rounds to the keepbits as suggested above that can be used instead of bitrounder=nothing (default)

We can then independently tweak the precision (how many keepbits, ideally as a function of the vertical, see salinity) and the lossless compressor (Zlib -> Zstandard)

milankl · 2024-05-13T23:47:28Z

Just added BitInformation to the Project.toml, due to dependency on StatsBase and Distributions this also adds

    Updating `~/git/Oceananigans.jl/Project.toml`
  [de688a37] + BitInformation v0.6.1
    Updating `~/git/Oceananigans.jl/Manifest.toml`
  [66dad0bd] + AliasTables v1.1.2
  [de688a37] + BitInformation v0.6.1
  [49dc2e85] + Calculus v0.5.1
  [31c24e10] + Distributions v0.25.108
  [fa6b7ba4] + DualNumbers v0.6.8
  [1a297f60] + FillArrays v1.11.0
  [34004b35] + HypergeometricFunctions v0.3.23
  [77ba4419] + NaNMath v1.0.2
  [90014a1f] + PDMats v0.11.31
  [1fd47b50] + QuadGK v2.9.4
  [79098fc4] + Rmath v0.7.1
  [2913bbd2] + StatsBase v0.34.3
  [4c63d2b9] + StatsFuns v1.3.1
  [f50d1b31] + Rmath_jll v0.4.0+0

also why is the Manifest.toml committed?

glwagner · 2024-05-14T01:40:21Z

Just added BitInformation to the Project.toml, due to dependency on StatsBase and Distributions this also adds

    Updating `~/git/Oceananigans.jl/Project.toml`
  [de688a37] + BitInformation v0.6.1
    Updating `~/git/Oceananigans.jl/Manifest.toml`
  [66dad0bd] + AliasTables v1.1.2
  [de688a37] + BitInformation v0.6.1
  [49dc2e85] + Calculus v0.5.1
  [31c24e10] + Distributions v0.25.108
  [fa6b7ba4] + DualNumbers v0.6.8
  [1a297f60] + FillArrays v1.11.0
  [34004b35] + HypergeometricFunctions v0.3.23
  [77ba4419] + NaNMath v1.0.2
  [90014a1f] + PDMats v0.11.31
  [1fd47b50] + QuadGK v2.9.4
  [79098fc4] + Rmath v0.7.1
  [2913bbd2] + StatsBase v0.34.3
  [4c63d2b9] + StatsFuns v1.3.1
  [f50d1b31] + Rmath_jll v0.4.0+0

also why is the Manifest.toml committed?

Through past experience we found that we needed the Manifest committed to make sense of the errors we encounter during CI.

glwagner · 2024-05-14T01:43:11Z

src/OutputWriters/netcdf_output_writer.jl

@@ -326,7 +320,7 @@ simulation = Simulation(model, Δt=1.25, stop_iteration=3)

 f(model) = model.clock.time^2; # scalar output

-g(model) = model.clock.time .* exp.(znodes(grid, Center())) # vector/profile output
+g(model) = model.clock.time .* exp.(znodes(Center, grid)) # vector/profile output


You may want to merge main because I think we need this change for the doctest to pass

I'm two commits ahead of main, none behind main...mk/compression I haven't actively changed these, but maybe @simone-silvestri and I started off from an outdated branch?

yes not sure, but this change does walk back a recent PR

shoot then I might have created a non-consistent history, sorry, I'll try to resolve that.

glwagner · 2024-05-14T01:45:05Z

src/OutputWriters/bit_rounding.jl

+default_bit_rounding(::Val{:T}) = 7
+default_bit_rounding(::Val{:S}) = 16                # 12 at the surface, 16 deep ocean


This is interesting. Why is there a difference between T, S? Is this specific to the simulation that this was tested on, or can we be sure this is valid for all simulations, past climates, future climates, idealized simulations at other resolutions, etc?

It seems we need to have default bit rounding for passive tracers.

Although relatively robust through time and space, this depends on a lot of things, also whether your unit carries some offset around (e.g. Kelvin vs ˚C, density vs density anomaly). So it's tricky to generalise. I suggest to have some reasonable defaults if someone uses bit rounding (default nothing or single precision as you like) but suggest to highlight that this should be checked similar to how I did it here with the bitinformation analysis above.

For global ocean simulations I expect these to be reasonable defaults. I believe for now this is mostly to reduce the filesizes for OMIP simulations

True, I'm just not sure that OMIP is going to be the most common use case, so there's a question about what default is appropriate here

The OMIP defaults might belong in the ClimaOcean setup, perhaps

We could set the defaults here as 23 mantissa bits (=Float32 precision, whether you use Float32 or 64) and then lower in ClimaOcean?

src/OutputWriters/bit_rounding.jl

glwagner · 2024-05-14T17:10:44Z

src/OutputWriters/bit_rounding.jl

+function BitRounding(outputs = nothing;
+                     user_rounding...)


For the purpose of figuring out good defaults perhaps we should include model as an input here?

Then default_bit_rounding can take model as an argument, and dispatch on various things, for example the equation of state (which should know the units of temperature), and perhaps the biogeochemistry model, which may know the units of some important tracers

Co-authored-by: Gregory L. Wagner <[email protected]>

milankl added 2 commits May 13, 2024 18:46

Bitrounding + Lossless compression

1f37d24

add BitInformation to Project.toml

7416b7b

glwagner reviewed May 14, 2024

View reviewed changes

milankl added the output 💾 label May 14, 2024

glwagner reviewed May 14, 2024

View reviewed changes

src/OutputWriters/bit_rounding.jl Outdated Show resolved Hide resolved

glwagner reviewed May 14, 2024

View reviewed changes

milankl and others added 3 commits May 14, 2024 14:12

Update src/OutputWriters/bit_rounding.jl

a2c8793

Co-authored-by: Gregory L. Wagner <[email protected]>

Merge branch 'main' into mk/compression

2d3e8b4

Merge branch 'main' into mk/compression

bbf80e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bitrounding + Lossless compression #3599

Bitrounding + Lossless compression #3599

milankl commented May 13, 2024

milankl commented May 13, 2024

glwagner commented May 14, 2024

glwagner May 14, 2024

milankl May 14, 2024

glwagner May 14, 2024

milankl May 14, 2024

glwagner May 14, 2024

milankl May 14, 2024

glwagner May 14, 2024

glwagner May 14, 2024

milankl May 14, 2024

glwagner May 14, 2024

glwagner May 14, 2024

		default_bit_rounding(::Val{:T}) = 7
		default_bit_rounding(::Val{:S}) = 16 # 12 at the surface, 16 deep ocean

Bitrounding + Lossless compression #3599

Are you sure you want to change the base?

Bitrounding + Lossless compression #3599

Conversation

milankl commented May 13, 2024

Problem

Proposed solution

Compression options

milankl commented May 13, 2024

glwagner commented May 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment