-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There is no encodings on float array #1723
Comments
Hey @shenlei149, thanks for giving Vortex a try for your usecase! We do implement two floating point encodings, ALP and ALP-RD ("ALP for Real Doubles"). The first meant for low-precision decimals that can be approximated as integers, the latter is for high-precision doubles and is probably more likely to be useful for your use-case. Note that the compression ratio for ALP-RD is not gigantic (usually it saves on the order of ~9 bits-per-value) but they do match what DuckDB/the ALP paper indicate. I've noticed that we actually had a bug in our compressor selection for floating points, I've filed a PR with the fix here: #1725 |
BtrBlocks implements two encodings that we don't, Frequency and PDE. I'd be curious to understand which encoding they're choosing for your dataset |
Fixes issue seen in #1723 In #1000, changes were made to remove ALP-RD as a top-level compressor and instead to only use it for patches. However, it seems that it was not getting selected anymore, whether that was due to the patching cost overhead or something else. This was noticed by a user, and confirmed by me in a Python shell. <img width="946" alt="image" src="https://github.com/user-attachments/assets/c42caed5-12e3-448b-aea6-3f33a7c97bfc" /> After this change, ALP-RD indeed does get selected again. <img width="907" alt="image" src="https://github.com/user-attachments/assets/bbb996fc-5223-43ed-9b4c-4b0262a417dc" />
Thank you for your reply and quick fix. update the code and rebuild, but it does not work. The size is still large. From return status of btrblocks compress API, used_compression_schemes contains only one element, which is 2, so I think it uses dictionary encoding for my dataset. enum class DoubleSchemeType : uint8_t {
UNCOMPRESSED = 0,
ONE_VALUE = 1,
DICT = 2,
RLE = 3,
FREQUENCY = 4,
PSEUDODECIMAL = 5,
// legacy schemes
DOUBLE_BP = 28,
DICTIONARY_8 = 29,
DICTIONARY_16 = 31,
SCHEME_MAX = 32
}; |
Thanks for the extra info @shenlei149. I spent a bit of time today tracking down some float compression things in BtrBlocks. Take for example the following program, which tries to compress a column of 64,000 fully random doubles:
This, somewhat predictably, does not compress at all: It's really hard to figure out why the compressor is performing a certain way without having a better understanding of your data distribution. It sounds like from your comment that BtrBlocks is selecting Dict encoding for your data, i.e. there are a lot of repeated fp64 values. Is that right? If so, I'd expect Vortex to be able to pick up on that as well, like it does in this example: If it'd be possible to share a bit more information about this arrays that might help track this down. |
Here is sample file, and it is binary file not zip file... I just change the extension name to make github happy, otherwise file cannot be uploaded. file size is 265216, and they are 66304 floats, 4 Bytes. btrblocks does not support float so I have to use double like this Relation to_compress;
std::variant<Vector<INTEGER>, Vector<DOUBLE>, Vector<str>> data;
Vector<DOUBLE> doubles(original_data.size()); // original_data is std::vector<float>
for (size_t i = 0; i < original_data.size(); ++i) {
doubles[i] = original_data[i];
}
to_compress.addColumn(Column{"doubles", std::move(doubles)}); so compression ratio is 1.2 instead of 2.4. maybe it find out that there are many zeros in array and decide to use dictionary encoding :) I find you use a function name tree_display, here is output. so vortex chooses alp+zigzag+bitpacked?
In my experiment, arrow_array is pyarrow.lib.DoubleArray not FloatArray because I don't know how to let pyarrow think data type if float. Sadly... :( with open("sample.data","rb") as file:
numbers = array.array('f')
numbers.fromfile(file, (265216//4))
arrow_array = pa.array(numbers)
print(type(arrow_array)) vortex tries to ALP and btrblocks may find out a lot of zeros then use dictionary here. |
Thanks for sharing that! I think I've decomped the source of the divergent behaviors. Vortex performs compression by sampling. The samples are targeted at finding Run-length encoding and bitpacking opportunities. It does this by sampling 16 runs of 64 contiguous elements into a 1024 element tuple and tests the different compressors over that. However, the duplicates in this dataset are distributed in such a way whereas to make it hard for the sampler to find them. It seems that on average, any individual window of 64-elements has very few duplicates, and that holds true even up to 1024 elements:
Note that in Rust API you can change the sampling parameters, so when I tweak them high enough to effectively grab the entire array as a "sample", we do get the ideal encoding: https://github.com/spiraldb/vortex/blob/aduffy/repro-gh-1723/bench-vortex/src/bin/gh1723.rs#L37-L44 I think we likely need to improve the use of global statistics in our compressor to perform well on these kinds of arrays. We have an open issue to add a cardinality estimate stat in #913, I think the addition of this + plumbing into the compressor would help greatly here. |
Thanks for your help, looking forward to new features. |
I have another question, how to read the output of tree_display? the outer level encoding is applied first or last? From the file size, I think it is applied last. For image which contains compression ratio 1.7, vortex use bitpacked first, then uses alp, the last encoding is dictionary, is it right? |
The encodings are read from top to bottom so it did dictionary first, then did alp and bitpacking. When performing decompression they're applied in reverse order, i.e. bitpacking->alp->dict. |
Our sample_size and sample_count parameters should be configurable. As I was looking into #1723 I also noticed that we will runtime panic when the total sample > u16
We have a lot of data for ML training, which are saved as parquet files. We want to find a alternative way to store the data to speed up reading and reduce file size.
There are several columns with few nulls, and their data type is
struct<value: list<array: float not null>>
, and there are several handreds of rows per row group and 500/1000 floats in each list. For these columns, vortex does not have a good performance bacause of no encodings, so the array size is larger than parquet.pseudo code likes:
I try to flatten the list array and export data, use them as one array, still no encoding...
Here is the code and output:
I use btrblocks to encode the data.
if I use 500 floats to do the same thing, no encoding, and the compression ratio is 1.
My question is if the way I use vortex is right, if not, what is the right way to do it?
I'm not sure if I make question clear, please let me know if you need more details.
The text was updated successfully, but these errors were encountered: