Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A report on improving R serialization performance #3

Open
traversc opened this issue Jun 15, 2024 · 2 comments
Open

A report on improving R serialization performance #3

traversc opened this issue Jun 15, 2024 · 2 comments

Comments

@traversc
Copy link

Introduction

Hi everyone, I thought I'd write up a report summarizing some experiments I've conducted over the past year on how R could improve serialization performance.

Colleagues often express frustration with the slowness of saveRDS or RStudio startup delays caused by save.image and load. However, these processes can be made significantly more efficient. I'm hoping to spark a discussion and gain support for making improvements.

To motivate discussion, here are some benchmarks:

Algorithm Save Time (s) Read Time (s) Compression (x)
saveRDS 35.6 8.45 2.83
base::serialize 3.01 5.90 1.07
qs2 (1 thread) 3.59 4.98 3.12
qs2 (4 threads) 1.66 4.43 3.12

For these tests, I saved a mixed numeric and text dataset of about 1 GB. saveRDS was used with default settings, base::serialize was used with no compression and XDR = FALSE, and qs2 utilized the R_serialize C API with a ZSTD block compression scheme.

Below, I've outlined several areas where I believe improvements can be made, roughly ordered from lowest to highest hanging fruit. Please let me know if you have any differing opinions or if I've overlooked something.

(On a side note, I would like to understand whether R_serialize is part of the stable C API or whether it can be added in. I counted at least 16 packages on CRAN that are using this interface. It is not in the list of excluded non-API functions and R CMD check does not check for it. Any insights would be appreciated.)

List of things that could be improved

Compression algorithm

Introducing ZSTD as an option would be a significant enhancement. Major tech services like AWS and Chrome have now adopted ZSTD compression, so it seems like it would be a worthy addition to R.

File IO

R serialization via saveRDS uses an IO buffer of 16384 bytes. On most systems I've tested, this small buffer size adds overhead. The buffer is also used for gzip and it might also be too small for efficient compression.

Below is a plot of writing 1 GB of data to a raw C file descriptor on disk. You can see that small buffers decrease performance.

mac_fd_blocksize_test

XDR

This has been discussed in R-devel, but an option for setting XDR in saveRDS would be much appreciated. For raw numeric data, base::serialize with XDR = FALSE is an improvement in speed by a whopping factor of 3.

Byte shuffling

Byte shuffling enhances both speed and compression for numeric data. A description of byte shuffling is discussed in this StackExchange post. This can be heuristically applied to blocks of data without much overhead.

Below is a comparison plot of a numeric dataset with and without byte shuffling.

byte_shuffling_comparison

Multithreading

This is a hard one to get right. Most data serialization libraries do not take full advantage of multithreading and only multithread compression after serialization of a block of data. Ideally, compression and IO should occur asynchronously to see significant benefits, especially during deserialization.

String serialization

Serialization format doesn't matter as much as I once thought, but string handling is one place I think there is room for improvement.

Each string currently incurs 64 bits of overhead: the first 32 are used only for encoding and type and the last 32 are used for the length of the string. If you have a large text dataset, this overhead becomes substantial.

For a character vector (STRSXP), storing type information is unnecessary since each element is a CHARSXP. Encoding requires only 3 bits, and most strings do not need a full 32 bits for size. There are numerous ways this overhead could be reduced.

Conclusion

I believe these areas present opportunities for improvement and I hope this overview helps. I'd appreciate any feedback or additional insights.

@shikokuchuo
Copy link
Member

(On a side note, I would like to understand whether R_serialize is part of the stable C API or whether it can be added in. I counted at least 16 packages on CRAN that are using this interface. It is not in the list of excluded non-API functions and R CMD check does not check for it. Any insights would be appreciated.)

R_Serialize and R_Unserialize are currently marked as experimental API. The designations are incomplete and being added, and can be tracked using the front end created by @yutannihilation here: https://yutannihilation.github.io/R-fun-API/ Having said that, for me it wouldn't make much sense for these functions to leave the API unless there were alternatives already in place.

@yutannihilation
Copy link

Oh, it's only a few days ago when these APIs got marked as experimental API. Now WRE has a section about serialization. Great.

https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Custom-serialization-input-and-output-1

commit: r-devel/r-svn@0b1eeb4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants