You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi everyone, I thought I'd write up a report summarizing some experiments I've conducted over the past year on how R could improve serialization performance.
Colleagues often express frustration with the slowness of saveRDS or RStudio startup delays caused by save.image and load. However, these processes can be made significantly more efficient. I'm hoping to spark a discussion and gain support for making improvements.
To motivate discussion, here are some benchmarks:
Algorithm
Save Time (s)
Read Time (s)
Compression (x)
saveRDS
35.6
8.45
2.83
base::serialize
3.01
5.90
1.07
qs2 (1 thread)
3.59
4.98
3.12
qs2 (4 threads)
1.66
4.43
3.12
For these tests, I saved a mixed numeric and text dataset of about 1 GB. saveRDS was used with default settings, base::serialize was used with no compression and XDR = FALSE, and qs2 utilized the R_serialize C API with a ZSTD block compression scheme.
Below, I've outlined several areas where I believe improvements can be made, roughly ordered from lowest to highest hanging fruit. Please let me know if you have any differing opinions or if I've overlooked something.
(On a side note, I would like to understand whether R_serialize is part of the stable C API or whether it can be added in. I counted at least 16 packages on CRAN that are using this interface. It is not in the list of excluded non-API functions and R CMD check does not check for it. Any insights would be appreciated.)
List of things that could be improved
Compression algorithm
Introducing ZSTD as an option would be a significant enhancement. Major tech services like AWS and Chrome have now adopted ZSTD compression, so it seems like it would be a worthy addition to R.
File IO
R serialization via saveRDS uses an IO buffer of 16384 bytes. On most systems I've tested, this small buffer size adds overhead. The buffer is also used for gzip and it might also be too small for efficient compression.
Below is a plot of writing 1 GB of data to a raw C file descriptor on disk. You can see that small buffers decrease performance.
XDR
This has been discussed in R-devel, but an option for setting XDR in saveRDS would be much appreciated. For raw numeric data, base::serialize with XDR = FALSE is an improvement in speed by a whopping factor of 3.
Byte shuffling
Byte shuffling enhances both speed and compression for numeric data. A description of byte shuffling is discussed in this StackExchange post. This can be heuristically applied to blocks of data without much overhead.
Below is a comparison plot of a numeric dataset with and without byte shuffling.
Multithreading
This is a hard one to get right. Most data serialization libraries do not take full advantage of multithreading and only multithread compression after serialization of a block of data. Ideally, compression and IO should occur asynchronously to see significant benefits, especially during deserialization.
String serialization
Serialization format doesn't matter as much as I once thought, but string handling is one place I think there is room for improvement.
Each string currently incurs 64 bits of overhead: the first 32 are used only for encoding and type and the last 32 are used for the length of the string. If you have a large text dataset, this overhead becomes substantial.
For a character vector (STRSXP), storing type information is unnecessary since each element is a CHARSXP. Encoding requires only 3 bits, and most strings do not need a full 32 bits for size. There are numerous ways this overhead could be reduced.
Conclusion
I believe these areas present opportunities for improvement and I hope this overview helps. I'd appreciate any feedback or additional insights.
The text was updated successfully, but these errors were encountered:
(On a side note, I would like to understand whether R_serialize is part of the stable C API or whether it can be added in. I counted at least 16 packages on CRAN that are using this interface. It is not in the list of excluded non-API functions and R CMD check does not check for it. Any insights would be appreciated.)
R_Serialize and R_Unserialize are currently marked as experimental API. The designations are incomplete and being added, and can be tracked using the front end created by @yutannihilation here: https://yutannihilation.github.io/R-fun-API/ Having said that, for me it wouldn't make much sense for these functions to leave the API unless there were alternatives already in place.
Introduction
Hi everyone, I thought I'd write up a report summarizing some experiments I've conducted over the past year on how R could improve serialization performance.
Colleagues often express frustration with the slowness of
saveRDS
or RStudio startup delays caused bysave.image
andload
. However, these processes can be made significantly more efficient. I'm hoping to spark a discussion and gain support for making improvements.To motivate discussion, here are some benchmarks:
For these tests, I saved a mixed numeric and text dataset of about 1 GB.
saveRDS
was used with default settings,base::serialize
was used with no compression andXDR = FALSE
, and qs2 utilized theR_serialize
C API with a ZSTD block compression scheme.Below, I've outlined several areas where I believe improvements can be made, roughly ordered from lowest to highest hanging fruit. Please let me know if you have any differing opinions or if I've overlooked something.
(On a side note, I would like to understand whether
R_serialize
is part of the stable C API or whether it can be added in. I counted at least 16 packages on CRAN that are using this interface. It is not in the list of excluded non-API functions andR CMD check
does not check for it. Any insights would be appreciated.)List of things that could be improved
Compression algorithm
Introducing ZSTD as an option would be a significant enhancement. Major tech services like AWS and Chrome have now adopted ZSTD compression, so it seems like it would be a worthy addition to R.
File IO
R serialization via
saveRDS
uses an IO buffer of 16384 bytes. On most systems I've tested, this small buffer size adds overhead. The buffer is also used forgzip
and it might also be too small for efficient compression.Below is a plot of writing 1 GB of data to a raw C file descriptor on disk. You can see that small buffers decrease performance.
XDR
This has been discussed in R-devel, but an option for setting XDR in
saveRDS
would be much appreciated. For raw numeric data,base::serialize
withXDR = FALSE
is an improvement in speed by a whopping factor of 3.Byte shuffling
Byte shuffling enhances both speed and compression for numeric data. A description of byte shuffling is discussed in this StackExchange post. This can be heuristically applied to blocks of data without much overhead.
Below is a comparison plot of a numeric dataset with and without byte shuffling.
Multithreading
This is a hard one to get right. Most data serialization libraries do not take full advantage of multithreading and only multithread compression after serialization of a block of data. Ideally, compression and IO should occur asynchronously to see significant benefits, especially during deserialization.
String serialization
Serialization format doesn't matter as much as I once thought, but string handling is one place I think there is room for improvement.
Each string currently incurs 64 bits of overhead: the first 32 are used only for encoding and type and the last 32 are used for the length of the string. If you have a large text dataset, this overhead becomes substantial.
For a character vector (STRSXP), storing type information is unnecessary since each element is a CHARSXP. Encoding requires only 3 bits, and most strings do not need a full 32 bits for size. There are numerous ways this overhead could be reduced.
Conclusion
I believe these areas present opportunities for improvement and I hope this overview helps. I'd appreciate any feedback or additional insights.
The text was updated successfully, but these errors were encountered: