MINOR: Fix Patched Base doc in specification

### What changes were proposed in this pull request? Fix patched base specification to state that only 5% of values are patched, not 10% ### Why are the changes needed? According to implementation: https://github.com/apache/orc/blob/0828c2ff114f30c84e4a23fd42ed58c6615c6f97/java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java#L535-L550 - Also 10% of 512 doesn't fit in max patch list length of 31 Also fix some formatting issues. Before: ![image](https://github.com/apache/orc/assets/22608443/69849f63-94f5-4da3-8338-70ef1dbc9ef5) After: ![image](https://github.com/apache/orc/assets/22608443/747cf944-9b3a-4367-b4f5-b6d8b2364f17) ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes #1948 from Jefffrey/patched-base-doc-fix. Authored-by: Jefffrey <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
apache · Jul 9, 2024 · cccbe72 · cccbe72
1 parent e9706df
commit cccbe72
Show file tree

Hide file tree

Showing 2 changed files with 8 additions and 8 deletions.
diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md
@@ -804,8 +804,8 @@ length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
 The patched base encoding is used for integer sequences whose bit
 widths varies a lot. The minimum signed value of the sequence is found
 and subtracted from the other values. The bit width of those adjusted
-values is analyzed and the 90 percentile of the bit width is chosen
-as W. The 10\% of values larger than W use patches from a patch list
+values is analyzed and the 95 percentile of the bit width is chosen
+as W. The 5% of values larger than W use patches from a patch list
 to set the additional bits. Patches are encoded as a list of gaps in
 the index values and the additional value bits.
 
@@ -830,8 +830,8 @@ the index values and the additional value bits.
   patch, and a patch value. Patches are applied by logically or'ing
   the data values with the relevant patch shifted W bits left. If a
   patch is 0, it was introduced to skip over more than 255 items. The
-  combined length of each patch (PGW + PW) must be less or equal to
-  64. (PGW + PW) is padded to the closest fixed bit size according to the
+  combined length of each patch (PGW + PW) must be less or equal to 64.
+  (PGW + PW) is padded to the closest fixed bit size according to the
   below table before being encoded in the patch list.
 
 (PGW + PW)    | closestFixedBits(PGW + PW)

diff --git a/site/specification/ORCv2.md b/site/specification/ORCv2.md
@@ -823,8 +823,8 @@ length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
 The patched base encoding is used for integer sequences whose bit
 widths varies a lot. The minimum signed value of the sequence is found
 and subtracted from the other values. The bit width of those adjusted
-values is analyzed and the 90 percentile of the bit width is chosen
-as W. The 10\% of values larger than W use patches from a patch list
+values is analyzed and the 95 percentile of the bit width is chosen
+as W. The 5% of values larger than W use patches from a patch list
 to set the additional bits. Patches are encoded as a list of gaps in
 the index values and the additional value bits.
 
@@ -849,8 +849,8 @@ the index values and the additional value bits.
   patch, and a patch value. Patches are applied by logically or'ing
   the data values with the relevant patch shifted W bits left. If a
   patch is 0, it was introduced to skip over more than 255 items. The
-  combined length of each patch (PGW + PW) must be less or equal to
-  64. (PGW + PW) is padded to the closest fixed bit size according to the
+  combined length of each patch (PGW + PW) must be less or equal to 64.
+  (PGW + PW) is padded to the closest fixed bit size according to the
   below table before being encoded in the patch list.
 
 (PGW + PW)    | closestFixedBits(PGW + PW)