Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is no encodings on float array #1723

Open
shenlei149 opened this issue Dec 19, 2024 · 9 comments
Open

There is no encodings on float array #1723

shenlei149 opened this issue Dec 19, 2024 · 9 comments

Comments

@shenlei149
Copy link

We have a lot of data for ML training, which are saved as parquet files. We want to find a alternative way to store the data to speed up reading and reduce file size.

There are several columns with few nulls, and their data type is struct<value: list<array: float not null>>, and there are several handreds of rows per row group and 500/1000 floats in each list. For these columns, vortex does not have a good performance bacause of no encodings, so the array size is larger than parquet.

pseudo code likes:

import pyarrow.parquet as pq
import pyarrow as pa
import vortex as vx

arraw_table = pq.read_table("file.parquet")

some_arrow_array = choose_some_column(arraw_table)
for a_array in some_arrow_array:
    vx_array = vx.array(a_array)
    cvx_array = vx.compress(vx_array)
    # for list float columns, cvx_array.nbytes == vx_array.nbytes

I try to flatten the list array and export data, use them as one array, still no encoding...

Here is the code and output:

import pyarrow as pa
import vortex as vx
import array

with open("sample.data","rb") as file:
    numbers = array.array('f')
    numbers.fromfile(file, (265216//4))
    arrow_array = pa.array(numbers)
    print(arrow_array.nbytes)
    vcol = vx.array(arrow_array)
    print(vcol.nbytes)
    cvcol = vx.compress(vcol)
    print(cvcol.nbytes)
    print(cvcol.nbytes/vcol.nbytes)

530432
530433
526365
0.9923307938985697

I use btrblocks to encode the data.

Stats:
- input size 530432
- output size 219769
- compression ratio 2.41359

if I use 500 floats to do the same thing, no encoding, and the compression ratio is 1.

My question is if the way I use vortex is right, if not, what is the right way to do it?

I'm not sure if I make question clear, please let me know if you need more details.

@a10y
Copy link
Contributor

a10y commented Dec 19, 2024

Hey @shenlei149, thanks for giving Vortex a try for your usecase!

We do implement two floating point encodings, ALP and ALP-RD ("ALP for Real Doubles"). The first meant for low-precision decimals that can be approximated as integers, the latter is for high-precision doubles and is probably more likely to be useful for your use-case.

Note that the compression ratio for ALP-RD is not gigantic (usually it saves on the order of ~9 bits-per-value) but they do match what DuckDB/the ALP paper indicate.

I've noticed that we actually had a bug in our compressor selection for floating points, I've filed a PR with the fix here: #1725

@a10y
Copy link
Contributor

a10y commented Dec 19, 2024

BtrBlocks implements two encodings that we don't, Frequency and PDE. I'd be curious to understand which encoding they're choosing for your dataset

a10y added a commit that referenced this issue Dec 19, 2024
Fixes issue seen in #1723

In #1000, changes were made to remove ALP-RD as a top-level compressor
and instead to only use it for patches. However, it seems that it was
not getting selected anymore, whether that was due to the patching cost
overhead or something else. This was noticed by a user, and confirmed by
me in a Python shell.

<img width="946" alt="image"
src="https://github.com/user-attachments/assets/c42caed5-12e3-448b-aea6-3f33a7c97bfc"
/>

After this change, ALP-RD indeed does get selected again.

<img width="907" alt="image"
src="https://github.com/user-attachments/assets/bbb996fc-5223-43ed-9b4c-4b0262a417dc"
/>
@shenlei149
Copy link
Author

Thank you for your reply and quick fix.

update the code and rebuild, but it does not work. The size is still large.


From return status of btrblocks compress API, used_compression_schemes contains only one element, which is 2, so I think it uses dictionary encoding for my dataset.

enum class DoubleSchemeType : uint8_t {
  UNCOMPRESSED = 0,
  ONE_VALUE = 1,
  DICT = 2,
  RLE = 3,
  FREQUENCY = 4,
  PSEUDODECIMAL = 5,
  // legacy schemes
  DOUBLE_BP = 28,
  DICTIONARY_8 = 29,
  DICTIONARY_16 = 31,
  SCHEME_MAX = 32
};

@a10y
Copy link
Contributor

a10y commented Dec 19, 2024

Thanks for the extra info @shenlei149. I spent a bit of time today tracking down some float compression things in BtrBlocks.

Take for example the following program, which tries to compress a column of 64,000 fully random doubles:

#include "btrblocks.hpp"
#include "common/Log.hpp"
#include <random>

#define MAX_DEPTH 4

static btrblocks::Vector<double> generateData() {
    btrblocks::Vector<double> data(64000);
    std::default_random_engine engine;
    std::uniform_real_distribution<double> unif;
    for (size_t i = 0; i < 64000; i++) {
        data[i] = unif(engine);
    }

    return data;
}

int main(int argc, char *argv[]) {
    using namespace btrblocks;

    
    BtrBlocksConfig::configure([&](BtrBlocksConfig &config){
        std::cout << "setting max cascade depth to " << MAX_DEPTH << std::endl;
        config.integers.max_cascade_depth = MAX_DEPTH;
        config.doubles.max_cascade_depth = MAX_DEPTH;
        config.strings.max_cascade_depth = MAX_DEPTH;
        config.doubles.schemes.enable(DoubleSchemeType::DOUBLE_BP);
    });


    // Logging
    Log::set_level(Log::level::debug);

    // Compress it.
    Relation floats_rel;
    // All zeros?
    floats_rel.addColumn({"dbls", generateData()});

    // Compress the entire column.
    Range range(0, floats_rel.tuple_count);
    Chunk input = floats_rel.getChunk({range}, 0);
    Datablock compressor(floats_rel);
    
    
    // Compress, getting stats back.
    std::unique_ptr<uint8_t[]> output(new uint8_t[64000 * sizeof(double) * 2]);
    auto stats = compressor.compress(input, output);


    std::cout << "Stats:" <<  std::endl
        << "- input size " << input.size_bytes() << std::endl
        << "- output size " << stats.total_data_size << std::endl
        << "- compression ratio " << stats.compression_ratio << std::endl
        ;

    return 0;
}

This, somewhat predictably, does not compress at all:

image

It's really hard to figure out why the compressor is performing a certain way without having a better understanding of your data distribution.

It sounds like from your comment that BtrBlocks is selecting Dict encoding for your data, i.e. there are a lot of repeated fp64 values. Is that right? If so, I'd expect Vortex to be able to pick up on that as well, like it does in this example:

image

If it'd be possible to share a bit more information about this arrays that might help track this down.

@shenlei149
Copy link
Author

sample.zip

Here is sample file, and it is binary file not zip file... I just change the extension name to make github happy, otherwise file cannot be uploaded.

file size is 265216, and they are 66304 floats, 4 Bytes.


btrblocks does not support float so I have to use double like this

Relation to_compress;
std::variant<Vector<INTEGER>, Vector<DOUBLE>, Vector<str>> data;
Vector<DOUBLE> doubles(original_data.size()); // original_data is std::vector<float>
for (size_t i = 0; i < original_data.size(); ++i) {
  doubles[i] = original_data[i];
}
to_compress.addColumn(Column{"doubles", std::move(doubles)});

so compression ratio is 1.2 instead of 2.4. maybe it find out that there are many zeros in array and decide to use dictionary encoding :)

I find you use a function name tree_display, here is output. so vortex chooses alp+zigzag+bitpacked?

root: vortex.alp(0x11)(f64, len=66304) nbytes=526.37 kB (100.00%)
  metadata: ALPMetadata { exponents: Exponents { e: 16, f: 0 }, patches: Some(PatchesMetadata { len: 7452, indices_ptype: U32 }) }
  encoded: vortex.zigzag(0x1d)(i64, len=66304) nbytes=449.73 kB (85.44%)
    metadata: ZigZagMetadata
    encoded: fastlanes.bitpacked(0x15)(u64, len=66304) nbytes=449.73 kB (85.44%)
      metadata: BitPackedMetadata { validity: NonNullable, bit_width: 54, offset: 0, patches: Some(PatchesMetadata { len: 42, indices_ptype: U16 }) }
      buffer: 449.28 kB
      patch_indices: vortex.primitive(0x03)(u16, len=42) nbytes=85 B (0.02%)
        metadata: PrimitiveMetadata { validity: NonNullable }
        buffer: 84 B
      patch_values: vortex.primitive(0x03)(u64, len=42) nbytes=337 B (0.06%)
        metadata: PrimitiveMetadata { validity: NonNullable }
        buffer: 336 B
  patch_indices: fastlanes.bitpacked(0x15)(u32, len=7452) nbytes=17.00 kB (3.23%)
    metadata: BitPackedMetadata { validity: NonNullable, bit_width: 16, offset: 0, patches: Some(PatchesMetadata { len: 98, indices_ptype: U16 }) }
    buffer: 16.38 kB
    patch_indices: vortex.primitive(0x03)(u16, len=98) nbytes=197 B (0.04%)
      metadata: PrimitiveMetadata { validity: NonNullable }
      buffer: 196 B
    patch_values: vortex.primitive(0x03)(u32, len=98) nbytes=393 B (0.07%)
      metadata: PrimitiveMetadata { validity: NonNullable }
      buffer: 392 B
  patch_values: vortex.primitive(0x03)(f64, len=7452) nbytes=59.62 kB (11.33%)
    metadata: PrimitiveMetadata { validity: NonNullable }
    buffer: 59.62 kB

In my experiment, arrow_array is pyarrow.lib.DoubleArray not FloatArray because I don't know how to let pyarrow think data type if float. Sadly... :(

with open("sample.data","rb") as file:
    numbers = array.array('f')
    numbers.fromfile(file, (265216//4))
    arrow_array = pa.array(numbers)
    print(type(arrow_array))

vortex tries to ALP and btrblocks may find out a lot of zeros then use dictionary here.

@a10y
Copy link
Contributor

a10y commented Dec 20, 2024

Thanks for sharing that! I think I've decomped the source of the divergent behaviors.

Vortex performs compression by sampling. The samples are targeted at finding Run-length encoding and bitpacking opportunities. It does this by sampling 16 runs of 64 contiguous elements into a 1024 element tuple and tests the different compressors over that.

However, the duplicates in this dataset are distributed in such a way whereas to make it hard for the sampler to find them. It seems that on average, any individual window of 64-elements has very few duplicates, and that holds true even up to 1024 elements:

import struct
import pandas as pd

# Calculate for every 64-element window how many duplicates
def dupes_histo(window_size, xs):
    dupes = []
    for i in range(len(xs) - window_size):
        window = xs[i:i+window_size]
        dupe_count = len(window) - len(set(window))
        dupes.append(dupe_count)
    return dupes


with open("/Users/aduffy/Downloads/sample_floats.bin", "rb") as f:
    xs = []
    for i in range(66304):
        xs.append(struct.unpack("f", f.read(4))[0])

assert len(xs) == 66304


for WINDOW_SIZE in [64, 128, 1024]:
    dupes = pd.Series(dupes_histo(WINDOW_SIZE, xs))

    print(f"Analysis for WINDOW_SIZE {WINDOW_SIZE}:")

    print("- # windows: {}".format(len(dupes)))
    print("- min dupes in any window: {}".format(dupes.min()))
    print("- max dupes in any window: {}".format(dupes.max()))
    print("- # windows w/max dupes: {}".format(len(dupes[dupes == dupes.max()])))
    print("- average dupes per window: {}".format(dupes.mean()))
❯ uv run load_bin.py
Analysis for WINDOW_SIZE 64:
- # windows: 66240
- min dupes in any window: 0
- max dupes in any window: 3
- # windows w/max dupes: 96
- average dupes per window: 0.21541364734299517
Analysis for WINDOW_SIZE 128:
- # windows: 66176
- min dupes in any window: 0
- max dupes in any window: 5
- # windows w/max dupes: 98
- average dupes per window: 0.8615056818181818
Analysis for WINDOW_SIZE 1024:
- # windows: 65280
- min dupes in any window: 41
- max dupes in any window: 529
- # windows w/max dupes: 15
- average dupes per window: 174.91424632352943

Note that in Rust API you can change the sampling parameters, so when I tweak them high enough to effectively grab the entire array as a "sample", we do get the ideal encoding:

https://github.com/spiraldb/vortex/blob/aduffy/repro-gh-1723/bench-vortex/src/bin/gh1723.rs#L37-L44

image

I think we likely need to improve the use of global statistics in our compressor to perform well on these kinds of arrays. We have an open issue to add a cardinality estimate stat in #913, I think the addition of this + plumbing into the compressor would help greatly here.

@shenlei149
Copy link
Author

Thanks for your help, looking forward to new features.

@shenlei149
Copy link
Author

I have another question, how to read the output of tree_display? the outer level encoding is applied first or last? From the file size, I think it is applied last. For image which contains compression ratio 1.7, vortex use bitpacked first, then uses alp, the last encoding is dictionary, is it right?

@robert3005
Copy link
Member

The encodings are read from top to bottom so it did dictionary first, then did alp and bitpacking. When performing decompression they're applied in reverse order, i.e. bitpacking->alp->dict.

a10y added a commit that referenced this issue Dec 30, 2024
Our sample_size and sample_count parameters should be configurable.

As I was looking into #1723 I also noticed that we will runtime panic
when the total sample > u16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants