diff --git a/docs/notes/EntropyMetricForDataCompression.html b/docs/notes/EntropyMetricForDataCompression.html index 09c76a5..8325b3c 100644 --- a/docs/notes/EntropyMetricForDataCompression.html +++ b/docs/notes/EntropyMetricForDataCompression.html @@ -83,7 +83,7 @@
Note: For an engaging and accessible explanation of
- the entropy concept, see Josh Starmer’s video
+ the entropy concept, see Josh Starmer’s video
Entropy (for Data Science) Clearly Explained!!! on YouTube (Starmer, 2024).
I think you will find that the video lives up to its title
To illustrate the entropy concept, let’s consider a couple - of special cases.
+To illustrate the entropy concept, let’s consider a couple of applications.
Natural language text: The probability for each symbol in the alphabet is computed based on the number of times it occurs in @@ -127,23 +126,13 @@
Text consisting of a single value: When the text is - just a single symbol repeated multiple times, the value of each symbol is completely - predictable. The probability of that value is always 1. So the log of each - probability 1.0 is zero. Thus the entropy would be zero. As noted above, a - data set with low entropy compresses readily (or, in this case, trivially).
- -All symbols have uniform probability: If all symbols - in a text are equiprobable, then they have the probability of - - where n is the number of unique symbols in the text. Plugging that probability into - the equation above, we see that the entropy of the data set is simply . This - situation would occur if:
-Numeric data: The most common way to compute entropy for numeric data is to treat individual bytes as symbols. + Since a byte can take on 256 separate values (typically, 0 to 255), this approach + limits the maximum number of unique symbols in a data stream and simplifies + the process of counting symbols and computing probabilities. Shannon's + definition of entropy is broad enough to accomodate symbol sets consisting of alternate + data types, such as integers or even floating-point values. Examples of this approach + are given in Part II of this article.
The term entropy coding refers to "any lossless data compression method + that "attempts to approach the lower bound declared by Shannon's source coding theorem" (Wikipedia, 2023). + Entropy coding techniques include Huffman coding, arithmetic coding, + and asymmetric numeral systems. The simplest and most widely used of these is Huffman coding.
-Huffman coding (Huffman, 1952) is a data compression technique that uses +
Huffman coding (Huffman, 1952) uses variable-length bit sequences to encode individual entries in a symbol set (alphabet, etc.). The encoding for each symbol is based on the frequency of its occurrence within the text. The most common symbols are assigned short sequences. Less common @@ -215,87 +208,54 @@
The encoded text for “tree”, 111000, is six bits long and has an average of 1.5 bits per symbol. - The computed entropy rate for the text is also 1.5 bits per symbol. For this particular text, - the Huffman code is optimum. For most texts, Huffman is optimal, but not truly optimum. - Consider the code tree that results when we append the letter ‘s’ to the end of our text to get “trees”.
+Of course, if we wish to use Huffman coding to compress a data set, we will also want + to have some method to extract the original information later on. So, most Huffman compressors +include some method to encode the structure of the Huffman tree. This requirement adds overhead to a compressed +product. Even discounting the overhead, the Huffman code often does not reach the lower-bound +established by the first-order entropy computation. This shortfall occurs because the Huffman coding can +only represent probabilities as integral powers of two and must approximate the probabilties for many symbol sets. +Even so, it usually produces good results.
+ +In practice, we accept a Huffman implementation +as efficient if it is close to the entropy value. One way to define efficiency is using the formula below:
+ + + - -The encoded text, 1011100110, is now 10 bits long. The text has an entropy rate of 1.922 bits/symbol, - but its average encoding rate is 2.0 bits/symbol. Based on that, we might wonder why the “optimal” Huffman - coding is producing a encoding rate that higher than the lower bound indicated by the entropy rate. - It turns out that the Huffman code is constrained by the mechanics of a encoding - in which each unique symbol is assigned a specific sequence of bits. That constraint - leads to the situation in which the construction of the Huffman tree can only - represent probabilities that are a power of two. Referring to the frequency table in Figure 2 above, - we see that the probabilities in the ‘p’ column are not integer powers of two. - So the Huffman results are only optimal within the constraints of how they are coded.
- -By revealing that the Huffman code does not always reach the - true lower bound for an encoding rate, the entropy measurement suggests that - there might be an opportunity for more efficient encoding algorithms. And, - indeed, more recent data compression techniques such as Arithmetic Coding and Asymmetric - Numeral Systems improve on Huffman by operating over accurate representations - of probability (at the cost of more elaborate processing and software - complexity).
- -To be useful, of course, it is not enough for a - compressor to encode a data set. It also has to provide a means of extraction. - Each text can, potentially, have a unique code alphabet. To ensure that a - Huffman coded message can be extracted by a receiving process, the message must - include information about how the text was encoded. Naturally, this - information adds overhead to the overall size of the encoded message. Naïve - software implementations sometimes include a “frequency table” giving the - counts for each symbol so that the Huffman tree can be reconstructed by the - receiver. Unfortunately, storing a frequency tables may require a large number - of bytes. Better implementations (such as the one found in the Gridfour - software library) may use schemes such as encoding the structure of the Huffman - tree as a preamble to a transmitted text
- -We exported the text for an early version of this document - to an 8-bit text file. Running the text through a first-order entropy +
For example, the Gridfour software project includes + an implementation of Huffman coding. To evaluate it, we exported the text for an earlier version of this document + to an 8-bit text file. Running the text through a first-order entropy computation and a Huffman coding implementation yielded the following results:
Huffman encoding - Huffman tree overhead: 887 bytes, 10.08 bits/unique symbol + Length of text: 11813 bytes + Unique symbols in text: 88 + Entropy: 4.79 bits/symbol + + Huffman tree overhead: 887 bytes, 10.08 bits per unique symbol Encoded size Including overhead: 4.86 bits/symbol Excluding overhead: 4.82 bits/symbol + Efficiency Including overhead: 97.6 % Excluding overhead: 99.4 %-
Excluding the overhead, the efficiency for the - implementation could be computed as a percentage using the following:
- - - - -Excluding the overhead for the Huffman tree, the implementation - had an efficiency of 99.4 percent. Including the overhead, the implementation - had an efficiency of 97.6 percent.
-For many data compression techniques, the efficiency calculation is useful as a diagnostic tool. A poor efficiency rating would indicate that there might be a problem with an implementation. Efficiency also - has the advantage of being robust with regard to the entropy in the source + has the advantage of being robust with regard to the magnitude of the entropy in the source data. If an implementation is suitable to a data set, even a data set with a high - entropy value might still produce a good efficiency score.
+ entropy value will produce a good efficiency score. -Because the Huffman technique does not consider the order of symbols within the text, it cannot take advantage of sequences of symbols @@ -356,6 +316,7 @@
Shannon, C. (July 1948). A mathematical theory of communication (PDF). Bell System Technical Journal. 27 (3): 379–423. doi:10.1002/j.1538-7305.1948.tb01338.x. hdl:11858/00-001M-0000-002C-4314-2. Reprint with corrections hosted by the Harvard University Mathematics Department at https://en.wikipedia.org/wiki/Harvard_University (accessed November 2024).
Soni, J., & Goodman, R. (2017). A mind at play: how Claude Shannon invented the information age (First Simon & Schuster hardcover edition.). Simon & Schuster.
Starmer, J. [Stat Quest] (2024). Entropy (for Data Science): Clearly Explained!!! [Video]. YouTube. https://www.youtube.com/watch?v=YtebGVx-Fxw
+Wikipedia contributors. (2023, November 15). Entropy coding. In Wikipedia, The Free Encyclopedia. Retrieved 18:06, November 27, 2024, from https://en.wikipedia.org/w/index.php?title=Entropy_coding&oldid=1185288022
Wikipedia contributors. (2024a, October 3). Josiah Willard Gibbs. In Wikipedia, The Free Encyclopedia. Retrieved 17:02, October 15, 2024, from https://en.wikipedia.org/w/index.php?title=Josiah_Willard_Gibbs&oldid=1249234374
Wikipedia contributors. (2024b, October 14). Huffman coding. In Wikipedia, The Free Encyclopedia. Retrieved 18:03, October 14, 2024, from https://en.wikipedia.org/w/index.php?title=Huffman_coding&oldid=1245054267
Wikipedia contributors. (2024c, May 2). Shannon's source coding theorem. In Wikipedia, The Free Encyclopedia. Retrieved 13:01, October 3, 2024, from https://en.wikipedia.org/w/index.php?title=Shannon%27s_source_coding_theorem&oldid=1221844320
diff --git a/docs/notes/EntropyMetricForDataCompressionCaseStudies.html b/docs/notes/EntropyMetricForDataCompressionCaseStudies.html index 98475ee..9b9bbd7 100644 --- a/docs/notes/EntropyMetricForDataCompressionCaseStudies.html +++ b/docs/notes/EntropyMetricForDataCompressionCaseStudies.html @@ -122,7 +122,7 @@ETOPO1_Ice_c_gmt4.grd diff --git a/docs/notes/index.html b/docs/notes/index.html index 7af2645..56435aa 100644 --- a/docs/notes/index.html +++ b/docs/notes/index.html @@ -78,7 +78,10 @@Welcome to the Gridfour Project Notes page!
Gridfour Raster Data Compression Algorithms