Editing Entropy (information theory) (section)

==Introduction==
The core idea of information theory is that the "informational value" of a communicated message depends on the degree to which the content of the message is surprising. If a highly likely event occurs, the message carries very little information. On the other hand, if a highly unlikely event occurs, the message is much more informative. For instance, the knowledge that some particular number ''will not'' be the winning number of a lottery provides very little information, because any particular chosen number will almost certainly not win. However, knowledge that a particular number ''will'' win a lottery has high informational value because it communicates the occurrence of a very low probability event.

The ''[[information content]],'' also called the ''surprisal'' or ''self-information,'' of an event <math>E</math> is a function that increases as the probability <math>p(E)</math> of an event decreases. When <math>p(E)</math> is close to 1, the surprisal of the event is low, but if <math>p(E)</math> is close to 0, the surprisal of the event is high. This relationship is described by the function
<math display="block">\log\left(\frac{1}{p(E)}\right) ,</math>
where <math>\log</math> is the [[logarithm]], which gives 0 surprise when the probability of the event is 1.<ref>{{cite web |url = https://www.youtube.com/watch?v=YtebGVx-Fxw |title = Entropy (for data science) Clearly Explained!!! |date = 24 August 2021 |via = [[YouTube]] |access-date = 5 October 2021 |archive-date = 5 October 2021 |archive-url = https://web.archive.org/web/20211005135139/https://www.youtube.com/watch?v=YtebGVx-Fxw |url-status = live }}</ref> In fact, {{math|log}} is the only function that satisfies а specific set of conditions defined in section ''{{slink|#Characterization}}''.

Hence, we can define the information, or surprisal, of an event <math>E</math> by
<math display="block">I(E) = -\log(p(E)) ,</math>
or equivalently,
<math display="block">I(E) = \log\left(\frac{1}{p(E)}\right) .</math>

Entropy measures the expected (i.e., average) amount of information conveyed by identifying the outcome of a random trial.<ref name="mackay2003">{{cite book|last=MacKay|first=David J.C.|author-link=David J. C. MacKay|url=http://www.inference.phy.cam.ac.uk/mackay/itila/book.html|title=Information Theory, Inference, and Learning Algorithms|publisher=Cambridge University Press|year=2003|isbn=0-521-64298-1|access-date=9 June 2014|archive-date=17 February 2016|archive-url=https://web.archive.org/web/20160217105359/http://www.inference.phy.cam.ac.uk/mackay/itila/book.html|url-status=live}}</ref>{{rp|p=67}}  This implies that rolling a die has higher entropy than tossing a coin because each outcome of a die toss has smaller probability (<math>p=1/6</math>) than each outcome of a coin toss (<math>p=1/2</math>).

Consider a coin with probability {{math|''p''}} of landing on heads and probability {{math|1 − ''p''}} of landing on tails. The maximum surprise is when {{math|1=''p'' = 1/2}}, for which one outcome is not expected over the other. In this case a coin flip has an entropy of one [[bit]] (similarly, one [[Ternary numeral system|trit]] with equiprobable values contains <math>\log_2 3</math> (about 1.58496) bits of information because it can have one of three values). The minimum surprise is when {{math|1=''p'' = 0}} (impossibility) or {{math|1=''p'' = 1}} (certainty) and the entropy is zero bits. When the entropy is zero, sometimes referred to as unity<ref group=Note name=Note02/>, there is no uncertainty at all – no freedom of choice – no [[Information content|information]].<ref>{{Cite book |last=Shannon |first=Claude Elwood |title=The mathematical theory of communication |last2=Weaver |first2=Warren |date=1998 |publisher=Univ. of Illinois Press |isbn=978-0-252-72548-7 |location=Urbana |pages=15 |language=English}}</ref> Other values of ''p'' give entropies between zero and one bits.

=== Example ===
Information theory is useful to calculate the smallest amount of information required to convey a message, as in [[data compression]]. For example, consider the transmission of sequences comprising the 4 characters 'A', 'B', 'C', and 'D' over a binary channel. If all 4 letters are equally likely (25%), one cannot do better than using two bits to encode each letter. 'A' might code as '00', 'B' as '01', 'C' as '10', and 'D' as '11'. However, if the probabilities of each letter are unequal, say 'A' occurs with 70% probability, 'B' with 26%, and 'C' and 'D' with 2% each, one could assign variable length codes. In this case, 'A' would be coded as '0', 'B' as '10', 'C' as '110', and 'D' as '111'. With this representation, 70% of the time only one bit needs to be sent, 26% of the time two bits, and only 4% of the time 3 bits. On average, fewer than 2 bits are required since the entropy is lower (owing to the high prevalence of 'A' followed by 'B' – together 96% of characters). The calculation of the sum of probability-weighted log probabilities measures and captures this effect.

English text, treated as a string of characters, has fairly low entropy; i.e. it is fairly predictable.  We can be fairly certain that, for example, 'e' will be far more common than 'z', that the combination 'qu' will be much more common than any other combination with a 'q' in it, and that the combination 'th' will be more common than 'z', 'q', or 'qu'. After the first few letters one can often guess the rest of the word. English text has between 0.6 and 1.3 bits of entropy per character of the message.<ref name="Schneier, B page 234">Schneier, B: ''Applied Cryptography'', Second edition, John Wiley and Sons.</ref>{{rp|p=234}}