Skip to content

Commit

Permalink
MINOR: Fix Patched Base doc in specification
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

Fix patched base specification to state that only 5% of values are patched, not 10%

### Why are the changes needed?

According to implementation:

https://github.com/apache/orc/blob/0828c2ff114f30c84e4a23fd42ed58c6615c6f97/java/core/src/java/org/apache/orc/impl/RunLengthIntegerWriterV2.java#L535-L550

- Also 10% of 512 doesn't fit in max patch list length of 31

Also fix some formatting issues.

Before:

![image](https://github.com/apache/orc/assets/22608443/69849f63-94f5-4da3-8338-70ef1dbc9ef5)

After:

![image](https://github.com/apache/orc/assets/22608443/747cf944-9b3a-4367-b4f5-b6d8b2364f17)

### How was this patch tested?

N/A

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #1948 from Jefffrey/patched-base-doc-fix.

Authored-by: Jefffrey <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
  • Loading branch information
Jefffrey authored and dongjoon-hyun committed Jul 9, 2024
1 parent e9706df commit cccbe72
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
8 changes: 4 additions & 4 deletions site/specification/ORCv1.md
Original file line number Diff line number Diff line change
Expand Up @@ -804,8 +804,8 @@ length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
The patched base encoding is used for integer sequences whose bit
widths varies a lot. The minimum signed value of the sequence is found
and subtracted from the other values. The bit width of those adjusted
values is analyzed and the 90 percentile of the bit width is chosen
as W. The 10\% of values larger than W use patches from a patch list
values is analyzed and the 95 percentile of the bit width is chosen
as W. The 5% of values larger than W use patches from a patch list
to set the additional bits. Patches are encoded as a list of gaps in
the index values and the additional value bits.

Expand All @@ -830,8 +830,8 @@ the index values and the additional value bits.
patch, and a patch value. Patches are applied by logically or'ing
the data values with the relevant patch shifted W bits left. If a
patch is 0, it was introduced to skip over more than 255 items. The
combined length of each patch (PGW + PW) must be less or equal to
64. (PGW + PW) is padded to the closest fixed bit size according to the
combined length of each patch (PGW + PW) must be less or equal to 64.
(PGW + PW) is padded to the closest fixed bit size according to the
below table before being encoded in the patch list.

(PGW + PW) | closestFixedBits(PGW + PW)
Expand Down
8 changes: 4 additions & 4 deletions site/specification/ORCv2.md
Original file line number Diff line number Diff line change
Expand Up @@ -823,8 +823,8 @@ length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad,
The patched base encoding is used for integer sequences whose bit
widths varies a lot. The minimum signed value of the sequence is found
and subtracted from the other values. The bit width of those adjusted
values is analyzed and the 90 percentile of the bit width is chosen
as W. The 10\% of values larger than W use patches from a patch list
values is analyzed and the 95 percentile of the bit width is chosen
as W. The 5% of values larger than W use patches from a patch list
to set the additional bits. Patches are encoded as a list of gaps in
the index values and the additional value bits.

Expand All @@ -849,8 +849,8 @@ the index values and the additional value bits.
patch, and a patch value. Patches are applied by logically or'ing
the data values with the relevant patch shifted W bits left. If a
patch is 0, it was introduced to skip over more than 255 items. The
combined length of each patch (PGW + PW) must be less or equal to
64. (PGW + PW) is padded to the closest fixed bit size according to the
combined length of each patch (PGW + PW) must be less or equal to 64.
(PGW + PW) is padded to the closest fixed bit size according to the
below table before being encoded in the patch list.

(PGW + PW) | closestFixedBits(PGW + PW)
Expand Down

0 comments on commit cccbe72

Please sign in to comment.