In this post, I am going to write about Shannon's information, entropy, cross-entropy and the Kullback-Leibler divergence. With the last two concepts being heavily used in machine learning I think it is nice to have a good understanding of the meaning and relationship of these terms.

So to try to explain these concepts I would like to first post a question/exercise:

Suppose we have 4 letters A, B, C and D in some alphabet we will use to communicate between each other through some channel. I tell you that I will be sending you a messsage every morning. This message will have no meaning, it will be just a concatenation of 2 out of these 4 letters picked randomly with equal probability.

Let's generate some messages:

import numpy as np

abecedary = ["A", "B", "C", "D"]
p = np.array([1/len(abecedary)] * len(abecedary))
np.random.choice(abecedary, size=(10,2), p=p)

array([['D', 'D'],
       ['A', 'A'],
       ['B', 'C'],
       ['B', 'D'],
       ['D', 'A'],
       ['B', 'C'],
       ['D', 'C'],
       ['D', 'A'],
       ['D', 'C'],
       ['C', 'A']], dtype='<U1')

So for 10 consecutive days I send you one of these messages every day. So far we have talked about the message but not about the channel we use to transmit it. So let me tell you that this channel will be digital meaning we will use values of 1 or 0 to encode our message.

The first question is: how many bits do I need to encode my message? Well, we know that if I have 4 letters, and each letter is equally likely, I am gonna need $\small log_2(4)=2$ bits to encode each letter. And since we know that each letter is independent of each otherwe will need 2*2 bits to encode the message I'm sending to you every day. The key thing to note here is that these 2 bits required for each letter is actually the information each letter in the message is carrying. Bear with me.

Let's look at another example, a more interesting one: Our variables will be independent as in the previous example but this time the probability mass function won't be uniform. Our variables will follow this distribution $\small p=(0.7, 0.1, 0.1, 0.1)$. Let's generate the 10 messages again:

p = np.array([0.7, 0.1, 0.1, 0.1])
np.random.choice(abecedary, size=(10,2), p=p)

array([['A', 'A'],
       ['B', 'A'],
       ['A', 'A'],
       ['A', 'A'],
       ['A', 'A'],
       ['A', 'B'],
       ['A', 'A'],
       ['B', 'A'],
       ['A', 'A'],
       ['A', 'A']], dtype='<U1')

By seeing these samples, I'm asking you again how should we encode our messages? One could say that we still have 4 letters, threfore we could use the same encoding as before. This way, we would be using again 2 bits for each letter. But if you think carefully, we know that 'A' is far more likely than the 3 other letters. So how about we take advantage of this information to encode our message? Maybe we can use one bit to encode whether the letter is A or not. And then use two more bits for the rest of the letters.

This waw we would have for example the following codes for each letter:

    A -> 1
    B -> 01
    C -> 001
    D -> 000

It is true that we now need 3 bits for some letters. But the key thing to mesure is the expected number of bits to be used for each letter. Let's compute it:

num_bits = np.array([1, 2, 3, 3])
np.sum(p * num_bits)

1.5

Spoiler alert: this is close to the entropy of our message, also refered to as the expected information of our message, but it is actually not. It is actually the cross-entropy. Bear with me again.

Information

So we have managed to make our encoding more efficient by using the distribution of our random process of sending messages. Going back to the information concept, here we would have:

$$ -log_2(0.7)=0.51 \; bits \;for \;A,\\ -log_2(0.1)=3.32 \; bits \;for \;B, \\ -log_2(0.1)=3.32 \; bits \;for \;C,\\ -log_2(0.1)=3.32 \; bits \;for \;D, $$

Wait, where does this formula come from? Well let me tell you that the way we compute information is this:

$$ \large I_X(x)=-log_2(p(x)) $$

We could actually use another base, but let's stick to base two and continue talking about bits. Try to think of information as the surprise you have after receiving the message. So A wouldn't surprise you that much, but B, C and D would, therefore they carry more information. The other way to think about it is the number of bits we need to represent each value in the most optimal way.

So, as we computed previously we would need 0.51 bits to encode A, and 3.32 bits to encode the rest of the letters. Going back to the previous example, let us try to compute information in a different way to make it more intuitive (hopefully):

Let's transform our probabilities in frequency counts like so: We now will have in our alphabet 7 As, 1 B, 1 C and 1 D. We then have now 10 letters. So what is the amount of bits we need to encode one letter? It is actually $\small log_2(10)=3.32$. So the letters only appearing once will need this amount of bits to be encoded. As you can see it is the same result we got above. How about A? Well, we know that A is 7 times more likely than B, C, or D. So the way to compute it is $\small log_2(\frac{10}{7})=0.51$. Intuitively, out of the ten buckets, A is filling 7.

As you can see, for the case of B, C and D we used $\small log_2{10}$ which is actually $\small -log_2\frac{1}{10}$ and the same goes for A which becomes $\small -log_2(\frac{7}{10})$ which leads us to the formal definition presented before: $\small I_X(x)=-log_2(p(x))$. This should solve the first "bear with me".

Entropy

Let's go to the second one. Since we now know what information in this context means and how we can quantify the amount each event carries, let's calculate the expected amount of information in our message like this:

$$ \large E[I_X]=H(p)=\sum_i p_X(x_i)I_X(x_i) $$

"The entropy of our source or the expected information in our message is {}".format(-np.sum(p*np.log2(p)))

'The entropy of our source or the expected information in our message is 1.3567796494470397'

Previously we tried to adjust the encoding of our messages by assigning 1 bit to A, two bits to B and C and three bits to D. That reduced the amount of bits used on average to send a message but we still are using more bits than required. Our message has an entropy of 1.35 and we use 1.5 bits to encode it.

We can not do better for four letters using bits but now we know that we are using a code which could carry more information than what our "source" or message is actually carrying. So this concept of expected information is known as entropy.

And this attemp we have made to find an optimal code will lead us to the concept of cross-entropy.

Cross-entropy

When we tried to use our improved encoding, we were assuming a probability mass function for our source: $$ P(A)=2^{-1}\\ P(B)=2^{-2}\\ P(C)=2^{-3}\\ P(D)=2^{-3} $$

"This the probability distribution \
we were assuming by using our improved encoding: {}".format(2**(-num_bits).astype(float))

'This the probability distribution we were assuming by using our improved encoding: [0.5   0.25  0.125 0.125]'

So when we instead of plugging in the source distribution in the information term of the entropy fromula use another distribution, we are getting what we call cross-entropy. And this gives us the expected length in bits of our source messages in each letter. The idea is that we know the distribution of our alphabet and the encoding or pmf we are using to transmit it. So the formula for the cross-entropy $\small H(p,q)$ is the same as the one for the entropy, only that we use now two different probability distributions:

$$ \large H(p,q)=-\sum_i p(x_i)log(q(x_i)) $$

Kullback-Leibler divergence

So we now know the entropy or expected information in each of our letter's message, and the cross-entropy or expected length of each letter in our message. And this difference here between the optimal encoding or information and the expected length of our encoding is what is called kullback-Leibler divergence: $$ D_{KL}(p||q)=H(p,q)-H(p)=-\sum_i p(x_i)log(q(x_i))+\sum_i p(x_i)log(p(x_i))=\sum_i p(x_i)log(\frac{p(x_i)}{q(x_i)}) $$

properties and uses of the kl divergence (WIP)
cross-entropy in ML -> maximum likelihood estimation (WIP)