Friday 15 March 2013

Probability Preferences: Equivalence Classes are primary

Another dimension of relevance when trying to judge what you might fairly expect is the achievements of the founder of probability theory is the idea of an equivalence class.  At its most fundamental, a randomisation device is a piece of technology which has $n \geq 0$ states, possibly infinitely many in the continuous case.  It is said to be in only one of those states, and the likelihood of it being in the $n_i$th state is $p_i$, the probability.  As mentioned in yesterday's post, there's no requirement that these probabilities follow any pattern whatsoever other than the primary one, namely that their sum is 1.

Start by imagining a traditional die, which has six distinctly marked faces, each with a different number of pips.  The fact that these faces are mutually distinguishable is important, not the fact that the distinction is achieved with pips indicating the numbers 1 to 6.  It could just as easily have been 6 colours or 6 pictures.  We can refer to this randomisation machine's $n$ states as its elementary states, its elementary outcomes.  It will have some particular 6-state discrete probability distribution, that is to say, some set of six probability numbers in the range 0 to 1 with the single additional constraint that $\sum_{i=1}^6 p_i =1$ 

Now imagine a different die, one which had on three faces a common colour, and on the other three faces a second colour.  This randomisation machine operates like a coin - it will have some particular 2-state discrete probability distribution.

Now imagine all possible combinations of six different colours written on the faces of dice.  That's a lot of dice, each with its own number of states, with its own discrete probability distribution.

Finally, imagine a die with no face markings.  This represents the minimal 0-state machine which doesn't technically get to be called a randomisation machine, since there's no uncertainty in rolling it.  But for completeness you can see how it fits in.

Without knowing anything about the particular probability distribution of a die, you can see that the first die I mentioned, the one with 6 distinct faces, somehow provides the most randomness.  That is, if you first build the die (and therefore fix its probability distribution), when you come to the decision of how to label its faces, there's something natural feeling about having all the available slots differently labelled.  It is more efficient, less wasteful of capacity, more generative of randomness.  In terms of information theory, it is the maximum entropy choice, given any particular probability distribution.  The maximum entropy choice would have you use all available slots on this randomisation machine, all other things being equal.  Likewise the faceless die is the minimum entropy configuration.  And in between, each and every labelling can be ranked by entropy.  The die's entropy must therefore be a function of the number of distinguishable states in the randomisation machine.  

Next let's turn our attention to the probability distribution we can construct for die.  To do that, let's hold the face-painting choice constant and go for the option of 6 distinguishable faces.  There are in theory an infinite number of sets of 6 real numbers which fall between 0 and 1 and which sum to 1.  For this 6 distinct-faced die, we can rank each and every one of them by entropy.  We'll discover that the maximum entropy probability model is the one which has equal probabilities for all faces, namely all faces have a probability of $\frac{1}{6}$.  And, for a 6 faced die, the minimum entropy die would be one where it was impossible to get 5 of the faces, but completely certain you'd get one particular face.  How to build such a die in practice is a different matter, but it isn't relevant here.

Now realise that you can run the same analysis for not just the 'maximum entropy' 6 face-distinguished die, but for all labellings down to the faceless die.  And there's a kind of global maximum and minimum entropy pair of dice in this universe of all possible dice, namely the equi-probable 6-label die and the totally faceless die.  And all couplings in between can get ranked by entropy.  When you tell a randomisation machine to produce for you an observable state (that is, when you roll the die), you get the most information out of it when you're rolling the maximum entropy die.

 It is a nice way of characterising a randomisation machine.  Knowing the maximum number of distinct states it is in.  That seems a kind of permanent, almost physical aspect of the machine.  Likewise, the probability distribution seems somehow 'built in' to the physical machine.  Of course, the machine doesn't need to be physical at all.  Still, the machine gets built and it is kind of natural to imagine a particular probability distribution burned in to the 'maximum entropy' face-painted die.  This is where we started.  Now, imagine we took that particular die - in fact, lets just for the sake of argument give it, off the top of my head, the distribution $\frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{12}, \frac{1}{4}$.  I could have picked equi-probable but decided not to.  And I paint the colours red, orange, yellow, green, blue, indigo on it, some colours of the rainbow.

It is done, the die is built.  But wait.  I decide on the following rule.  I want to make this particular die behave like a coin flip.  Rather than re-paint, I just decide in my head to count red or orange or yellow to represent one state, and the other three colours to represent another.  This is an equivalence class.  It is a transformation or grouping or partition of the elementary outcomes of a randomisation machine.  Likewise I can, by just deciding to do so, interpret my rainbow die castings to reproduce any particular painting.  I jjust need to remember in my head the equivalence rule.

So the rainbow die, with its specific probability distribution, is set in the middle of a room, filled with people, each with their own distinct equivalence class in mind.  Each roll of the die is seen by each person as a different outcome of his own randomisation machine.  By applying an equivalence class, you've got the randomisation machine to perform differently for you.  This is kind of like software.  The equivalence class being the program.  With each equivalence class, there's a way of rolling up the built-in probabilities to produce a transformation to a new probability distribution for that equivalence class.  Imagine the red face was $\frac{1}{12}$ and the orange $\frac{1}{4}$ and the rest of the colours $\frac{1}{6}$.  By the  red-or-orange-or-yellow versus green-or-blue-or-indigo equivalence class I have simulated a fair coin flip, even though the 'elementary' outcomes were not equi-probable.

So in a sense a so-called n-state randomisation machine describes the maximum entropy, all-states-distinguished equivalence class.  Even though I've been claiming this is natural, efficient, etc., in theory there's nothing special about this maximum entropy state except that it ranks top of the list for information content. It is as if each of the observers of the machine, through the glasses of his own equivalence class, sees a different reality, but it none of them can take the glasses off.  If you do privilege the maximum entropy equivalence class, then call all of its states elementary outcomes, elementary events or the sample space. If that's what you're going to do, then all the other equivalence classes represent composite events, or simply events, and you can work out the probability of these events by rolling up their constituent probabilities.   Executing or running a randomisation machine can then be said to reduce uncertainty in a disjoint set of n possible outcomes.  That is, a randomisation machine is a chooser.  It picks one of n.  It is an OR-eraser.  The concept of OR is primitive, it exists at the elementary outcome level.  It is a particularly tight kind of OR - one which is exclusive and all-encompassing. In other words the OR-eraser which is the random event picks exactly one of the n elementary outcomes.  If the act of causing a randomisation machine to run is an act of OR-erasure.  At the level of a single randomisation machine, there's no concept of AND.  A single choice, 1-of-n, is made.  At the equivalence class level the construction of the equivalence class can involve OR-construction (disjunction) and AND-construction (conjunction).

As I mentioned last night, the best a single die can hope to achieve is the uncertainty reduction of about 2.58 bits.  That's its maximum entropy.  The formula is $\sum_{i=1}^6 p_i \log p_i$.  This quantity is purely a function of the probability distribution, as you can see, but you should remember I chose colours as elementary outcomes partly because there's no natural mapping on to a number.  In this sense information is more fundamental than expectation, which I'll mention more of in another posting.  

My thought experiment of multiple people looking at the result of a randomisation machine's single run and seeing different (non-elementary) outcomes is clearer in the act of picking a random playing card.  Participant 1 sees a Queen of Hearts, another sees a Queen, another sees a Heart, another sees a Heart, another sees a Face card, etc.  And those are only the 'semantically coherent' equivalent equivalence classes - there are in face a whole bunch more.