diff --git a/docs/notes/EntropyMetricForDataCompression.html b/docs/notes/EntropyMetricForDataCompression.html index 09c76a5..8325b3c 100644 --- a/docs/notes/EntropyMetricForDataCompression.html +++ b/docs/notes/EntropyMetricForDataCompression.html @@ -83,7 +83,7 @@

Entropy

evaluating the effectiveness of a data compression implementation.

Note: For an engaging and accessible explanation of - the entropy concept, see Josh Starmer’s video
+ the entropy concept, see Josh Starmer’s video Entropy (for Data Science) Clearly Explained!!! on YouTube (Starmer, 2024). I think you will find that the video lives up to its title

@@ -116,8 +116,7 @@

The entropy calculation

Applying entropy

-

To illustrate the entropy concept, let’s consider a couple - of special cases.

+

To illustrate the entropy concept, let’s consider a couple of applications.

Natural language text: The probability for each symbol in the alphabet is computed based on the number of times it occurs in @@ -127,23 +126,13 @@

Applying entropy

Using that approach to evaluate a text extract from this document produces an entropy value of approximately 4.79.

-

Text consisting of a single value: When the text is - just a single symbol repeated multiple times, the value of each symbol is completely - predictable. The probability of that value is always 1.  So the log of each - probability 1.0 is zero.  Thus the entropy would be zero.  As noted above, a - data set with low entropy compresses readily (or, in this case, trivially).

- -

All symbols have uniform probability: If all symbols - in a text are equiprobable, then they have the probability of - log(n) - where n is the number of unique symbols in the text.  Plugging that probability into - the equation above, we see that the entropy of the data set is simply .  This - situation would occur if:

- +

Numeric data: The most common way to compute entropy for numeric data is to treat individual bytes as symbols. + Since a byte can take on 256 separate values (typically, 0 to 255), this approach + limits the maximum number of unique symbols in a data stream and simplifies + the process of counting symbols and computing probabilities. Shannon's + definition of entropy is broad enough to accomodate symbol sets consisting of alternate + data types, such as integers or even floating-point values. Examples of this approach + are given in Part II of this article.

Higher order models for entropy and conditional probability

@@ -186,9 +175,13 @@

Higher order models for entropy and conditional probability

Third-order entropy formula -

Entropy, Huffman coding, and effective data compression

+

Entropy coding, the Huffman technique, and efficient data compression

+

The term entropy coding refers to "any lossless data compression method + that "attempts to approach the lower bound declared by Shannon's source coding theorem" (Wikipedia, 2023). + Entropy coding techniques include Huffman coding, arithmetic coding, + and asymmetric numeral systems. The simplest and most widely used of these is Huffman coding.

-

Huffman coding (Huffman, 1952) is a data compression technique that uses +

Huffman coding (Huffman, 1952) uses variable-length bit sequences to encode individual entries in a symbol set (alphabet, etc.).  The encoding for each symbol is based on the frequency of its occurrence within the text.  The most common symbols are assigned short sequences. Less common @@ -215,87 +208,54 @@

Entropy, Huffman coding, and effective data compression

Figure 1 – Huffman tree for "tree" (Image source: Huffman Coding Online, Ydelo, 2021).
+ -

Limitations of Huffman revealed by entropy calculation

-

The encoded text for “tree”, 111000, is six bits long and has an average of 1.5 bits per symbol. - The computed entropy rate for the text is also 1.5 bits per symbol. For this particular text, - the Huffman code is optimum. For most texts, Huffman is optimal, but not truly optimum. - Consider the code tree that results when we append the letter ‘s’ to the end of our text to get “trees”.

+

Implementation details and overhead

+

Of course, if we wish to use Huffman coding to compress a data set, we will also want + to have some method to extract the original information later on. So, most Huffman compressors +include some method to encode the structure of the Huffman tree. This requirement adds overhead to a compressed +product. Even discounting the overhead, the Huffman code often does not reach the lower-bound +established by the first-order entropy computation. This shortfall occurs because the Huffman coding can +only represent probabilities as integral powers of two and must approximate the probabilties for many symbol sets. +Even so, it usually produces good results.

+ +

In practice, we accept a Huffman implementation +as efficient if it is close to the entropy value. One way to define efficiency is using the formula below:

+ + + -
- -
Figure 2 – Huffman tree for "trees" (Image source: Huffman Coding Online, Ydelo, 2021).
-
-

The encoded text, 1011100110, is now 10 bits long. The text has an entropy rate of 1.922 bits/symbol, - but its average encoding rate is 2.0 bits/symbol. Based on that, we might wonder why the “optimal” Huffman - coding is producing a encoding rate that higher than the lower bound indicated by the entropy rate. - It turns out that the Huffman code is constrained by the mechanics of a encoding - in which each unique symbol is assigned a specific sequence of bits. That constraint - leads to the situation in which the construction of the Huffman tree can only - represent probabilities that are a power of two. Referring to the frequency table in Figure 2 above, - we see that the probabilities in the ‘p’ column are not integer powers of two. - So the Huffman results are only optimal within the constraints of how they are coded.

- -

By revealing that the Huffman code does not always reach the - true lower bound for an encoding rate, the entropy measurement suggests that - there might be an opportunity for more efficient encoding algorithms. And, - indeed, more recent data compression techniques such as Arithmetic Coding and Asymmetric - Numeral Systems improve on Huffman by operating over accurate representations - of probability (at the cost of more elaborate processing and software - complexity).

- -

Overhead elements for Huffman coding and other techniques

- -

To be useful, of course, it is not enough for a - compressor to encode a data set. It also has to provide a means of extraction.  - Each text can, potentially, have a unique code alphabet.  To ensure that a - Huffman coded message can be extracted by a receiving process, the message must - include information about how the text was encoded.  Naturally, this - information adds overhead to the overall size of the encoded message. Naïve - software implementations sometimes include a “frequency table” giving the - counts for each symbol so that the Huffman tree can be reconstructed by the - receiver. Unfortunately, storing a frequency tables may require a large number - of bytes. Better implementations (such as the one found in the Gridfour - software library) may use schemes such as encoding the structure of the Huffman - tree as a preamble to a transmitted text

- -

Putting it all together

- -

We exported the text for an early version of this document - to an 8-bit text file.   Running the text through a first-order entropy +

For example, the Gridfour software project includes + an implementation of Huffman coding. To evaluate it, we exported the text for an earlier version of this document + to an 8-bit text file.  Running the text through a first-order entropy computation and a Huffman coding implementation yielded the following results:

     Huffman encoding
-        Huffman tree overhead: 887 bytes, 10.08 bits/unique symbol
+        Length of text:       11813 bytes
+        Unique symbols in text:  88
+        Entropy:                  4.79 bits/symbol
+
+        Huffman tree overhead:     887 bytes, 10.08 bits per unique symbol
         Encoded size
             Including overhead: 4.86 bits/symbol
             Excluding overhead: 4.82 bits/symbol
+
         Efficiency
             Including overhead:  97.6 %
             Excluding overhead:  99.4 %
                 
-

Excluding the overhead, the efficiency for the - implementation could be computed as a percentage using the following: 

- - - - -

Excluding the overhead for the Huffman tree, the implementation - had an efficiency of 99.4 percent.  Including the overhead, the implementation - had an efficiency of 97.6 percent.

-

For many data compression techniques, the efficiency calculation is useful as a diagnostic tool.  A poor efficiency rating would indicate that there might be a problem with an implementation.  Efficiency also - has the advantage of being robust with regard to the entropy in the source + has the advantage of being robust with regard to the magnitude of the entropy in the source data.  If an implementation is suitable to a data set, even a data set with a high - entropy value might still produce a good efficiency score.

+ entropy value will produce a good efficiency score.

-

Beyond entropy-based compression: the Deflate algorithm

+

Beyond entropy coding: the Deflate algorithm

Because the Huffman technique does not consider the order of symbols within the text, it cannot take advantage of sequences of symbols @@ -356,6 +316,7 @@

References

Shannon, C. (July 1948). A mathematical theory of communication (PDF). Bell System Technical Journal. 27 (3): 379–423. doi:10.1002/j.1538-7305.1948.tb01338.x. hdl:11858/00-001M-0000-002C-4314-2. Reprint with corrections hosted by the Harvard University Mathematics Department at https://en.wikipedia.org/wiki/Harvard_University (accessed November 2024).

Soni, J., & Goodman, R. (2017). A mind at play: how Claude Shannon invented the information age (First Simon & Schuster hardcover edition.). Simon & Schuster.

Starmer, J. [Stat Quest] (2024). Entropy (for Data Science): Clearly Explained!!! [Video]. YouTube. https://www.youtube.com/watch?v=YtebGVx-Fxw

+

Wikipedia contributors. (2023, November 15). Entropy coding. In Wikipedia, The Free Encyclopedia. Retrieved 18:06, November 27, 2024, from https://en.wikipedia.org/w/index.php?title=Entropy_coding&oldid=1185288022

Wikipedia contributors. (2024a, October 3). Josiah Willard Gibbs. In Wikipedia, The Free Encyclopedia. Retrieved 17:02, October 15, 2024, from https://en.wikipedia.org/w/index.php?title=Josiah_Willard_Gibbs&oldid=1249234374

Wikipedia contributors. (2024b, October 14). Huffman coding. In Wikipedia, The Free Encyclopedia. Retrieved 18:03, October 14, 2024, from https://en.wikipedia.org/w/index.php?title=Huffman_coding&oldid=1245054267

Wikipedia contributors. (2024c, May 2). Shannon's source coding theorem. In Wikipedia, The Free Encyclopedia. Retrieved 13:01, October 3, 2024, from https://en.wikipedia.org/w/index.php?title=Shannon%27s_source_coding_theorem&oldid=1221844320

diff --git a/docs/notes/EntropyMetricForDataCompressionCaseStudies.html b/docs/notes/EntropyMetricForDataCompressionCaseStudies.html index 98475ee..9b9bbd7 100644 --- a/docs/notes/EntropyMetricForDataCompressionCaseStudies.html +++ b/docs/notes/EntropyMetricForDataCompressionCaseStudies.html @@ -122,7 +122,7 @@

Data compression

does not change the actual values of the elevations and ocean depths stored in the data product. It just changes the statistical properties of the file (not the data) and the way it is processed by conventional data compression techniques. - With the adjustment for data type, we see the following results:

+ With the adjustment for data type, we see the following specifications:

     ETOPO1_Ice_c_gmt4.grd
diff --git a/docs/notes/index.html b/docs/notes/index.html
index 7af2645..56435aa 100644
--- a/docs/notes/index.html
+++ b/docs/notes/index.html
@@ -78,7 +78,10 @@ 

Welcome to the Gridfour Project Notes page!

Gridfour Raster Data Compression Algorithms
  • - What Entropy tells us about Data Compression + What Entropy tells us about Data Compression, Part I: Concepts +
  • +
  • + What Entropy tells us about Data Compression, Part II: Case Studies
  • Lossless Compression for Floating-Point Data @@ -92,7 +95,7 @@

    Welcome to the Gridfour Project Notes page!

  • GVRS Performance
  • -
  • +
  • Managing a Virtual Raster using a Tile-Cache Algorithm