Skip to content

Commit

Permalink
PrestoBatchSerializer should not preserve Dictionary encoding if it m…
Browse files Browse the repository at this point in the history
…akes the data larger (#8688)

Summary:
Pull Request resolved: #8688

This change adds some basic heuristics which serializeDictionaryVector can use to flatten a Vector as part of serializing it
rather than preserving the Dictionary encoding.

The checks are:
* if the size of the Vector type is smaller than or equal to int32_t (the indices into the dictionary)
* if the Vector type is fixed width and we determine that the size of the indices + the size of the alphabet is larger than the size of the original data
* regardless of the Vector type, if the alphabet contains unique values

This helps to ensure the preserving encodings during serialization won't actually make the serialized data larger.

Reviewed By: bikramSingh91

Differential Revision: D53484809

fbshipit-source-id: c7954b827a0a8e946a67d53e5b1195184c9e8d3a
  • Loading branch information
Kevin Wilfong authored and facebook-github-bot committed Feb 27, 2024
1 parent 87a0519 commit 7e0a5a2
Show file tree
Hide file tree
Showing 4 changed files with 324 additions and 168 deletions.
Loading

0 comments on commit 7e0a5a2

Please sign in to comment.