forked from facebookincubator/velox
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Introduce BatchVectorSerializer to better support preserving encoding…
…s when serializing Vectors (facebookincubator#8567) Summary: Pull Request resolved: facebookincubator#8567 High Level Goal: Support preserving encodings in serialized Vectors from PartitionedOutput when the "arbitrary" partitioning scheme is enabled. This is specifically for Presto, in order to match the Presto Java behavior, and because we've seen in practice this can dramatically reduce the amount of shuffled data improving performance. This change: Looking at the way PrestoVectorSerializer handles encodings, it's currently not generic, and it looks very fragile. E.g. we need to make sure that flush is called after every append, which is currently impossible at compile time, and I'm not 100% confident is always guaranteed at run time. Since we now need to expose this through the generic VectorSerde interface, I figured it was a good time to clean it up a little before it's further cemented. To do this, I've added a new interface BatchVectorSerializer which only supports a single API serialize which effectively combines append and flush so we guarantee that only rows from a single Vector are written for every flush. Thanks to this it also doesn't have to maintain the streams across calls, so it doesn't need the VectorStreamGroup. In the PrestoBatchVectorSerializer implementation, we can take advantage of this to set the encodings based on the single RowVector we're serializing on each call. My hope is with this, we can remove PrestoVectorSerde::serializeEncoded replacing it with BatchVectorSerializer, and also remove the notion of encodings from PrestoVectorSerializer so we don't need to worry about calling append twice without a flush on a PrestoVectorSerializer initialized with encodings. If there's general agreement on this approach I'll do that in a follow up shortly after this. This will also give me a bit more isolation to implement the rest of the features that are needed for the high level goal (calculating the serialized size of Vector slices with encodings, splitting very large RowVectors while maintaining encodings, etc.) in order to match the Presto Java behavior, without breaking the existing serialization logic. Reviewed By: xiaoxmeng Differential Revision: D53144805 fbshipit-source-id: efbe557936f06c92bac9e3e926add1943712cda3
- Loading branch information
1 parent
e8da627
commit b8871b1
Showing
5 changed files
with
403 additions
and
190 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.