The Persistent Struggle: March 2013

Saturday, 30 March 2013

The absent-minded psychic

Neo-classical economics makes dramatically different assumptions for how the head of an economic agent works when looking into the future than when looking into the past. This looks like mathematical expediency to me. Looking backward. The efficient market hypothesis says that there's no point looking at any point beyond the most recent price, since that market price fully reflects all of the information available to rational agents. This fits nicely with Markov model mathematics, memoryless sources of randomness. On the other hand, when looking into the future, the rational expectations theory suggests that the agent has perfect (in the sense of being as good as it could ever be) foresight of the downstream consequences of hypothesised economic choices. This fits nicely with optimisation mathematics, calculus, iterated game theory. Insofar as, for them everything is a market within which they make long term equilibrium-oriented decisions, they can operate with no memory and ideal foresight.

Monday, 25 March 2013

The patron saint of quants

Pascal was a competent but by no means brilliant mathematician. He designed and set into production a mechanical computer. He leveraged off the back of superior mathematical talent. He is said to have invented decision theory with his wager, giving birth to a whole industry, via Daniel Bernoulli, Bentham, Samuelson down to Black, which inappropriately attempted to extend the concept of fair value calculation way beyond any practicable remit. Pascal briefly hung around with wealthy gamblers who used his intellectual horsepower to help enrich themselves. He quit his game while still relatively young with more than a few regrets. Pascal was the patron saint of quants.

Probability preferences : the source of randomness is not the game

Fermat's version of the solution to the problem of points was to create a grid of possibilities reaching fully out to the point beyond which no doubt could exist as to who the winner would be. This grid of possibilities included parts of the tree which, on one view, would be utterly irrelevant to the game in hand, and on another view, incorrectly modelled the set of possibilities embedded in the game.

Pascal's solution, by way of contrast, was a ragged tree of possibilities stretching out along each branch only as far as was needed to resolve the state of the game in question, and no further.

Pascal additionally made the mistake, in interpreting Fermat's solution, of ignoring order when tossing three dice/coins and in this mis-interpretation came up with an answer in the case of three players which diverged from his own reverse recursive solution based on the principle of fair treatment at each node of his ragged tree.

Because Pascal's wrong-headed idea of Fermat's solution did not match his own, he jumped to the conclusion that what must be wrong in Fermat's method was the extension of the tree of possibilities beyond those parts which the game in hand required. Pascal consulted Roberval on the likely legitimacy of this fully rolled out tree of possibilities and Roberval seems to have told Pascal that this is where Fermat is going wrong, namely that this 'false assumption' of theoretical play of zombie-games leads to bad results. It doesn't.

The evolution in time of a source of randomness was seen clearly by Fermat as separate from the rule, game or activity sitting on top of it. In this case the game was the 'first to get N wins' Modern derivatives when tree based methods are used all apply this same move. First the random process's set of possibilities are evolved on a lower, supporting layer, then the payoff of the contract is worked out at the terminal time horizon. Both in De Mere's game and with an option, there's a clearly defined termination point. With De Mere's game, the point happens when the first player reaches N wins. With options, the termination point is the expiry of the option. Gambler's ruin, as I'll discuss later, doesn't have such a straightforward termination point. So step 1 is to lay out all the possible states from now to the termination point, the tree of possibilities for the stochastic process. Then you work out the terminal value of the contract or game and use Pascal's fairness criterion to crawl back up the second tree, until you reach the 'now' point, which gives you the fair value of the contract. This is the essence of the finite difference solution set, and it works for path dependent and path independent pricings. The implications of the game is that the tree is re-combinant, which means the binomial coefficients become relevant when working out the probability that each path is traversed.

Fermat has a clearer and earlier conception of this separation. But Roberval and Pascal were right to flag this move up - what grounds did Fermat give for the move? In modern parlance, we can see that the stochastic process, often a stock price or a spot FX or a tradeable rate, is independently observable in the market. But back then, Pascal was struggling to separate the game from the source of randomness. F. N. David suggests that Pascal sets Roberval up as the disbeliever as a distancing mechanism for his own failure to grasp this point. Likewise, David suggests perhaps Pascal only solved his side of the problem after initial prompting from Fermat, in a letter which starts off the correspondence but which unfortunately no longer exists.

Of course, this isn't a solution of an unfinished game, but the fair value of the game at any point during its life. Each author I read seems clear in his mind that one other other of the great mathematicians' solution is preferred. Is this just ignorance, aesthetic preference masquerading as informed opinion? Yes, largely. But my own opinion is that the both solutions share many similarities - both need to evolve a tree of possibilities, a binary tree, for which the binomial coefficients come in handy as the number of steps increases. Both then involve evaluating the state of the game at the fixed and known horizon point. Fermat's tree is a set of possibilities of a stochastic process. His solution takes place exclusively at that final set of terminal nodes, but working out the ratio of the set of nodes in which player A is the winner over the total set of terminal nodes. Pascal's tree is the tree of game states. He reasons in a reverse iterative way until he reaches the start point, and the start point gives him his final answer. The arithmetic triangle could help both these men build their trees as the number of steps increases.

Friday, 22 March 2013

Probability preferences : expectation is secondary

I didn't realise counting was so important to the theory of probability. First you have the simplified sub-case where all N disjoint outcomes are mutually exclusive, in which case you can use combinatorics to estimate probabilities. Combinatorics just being counting power tools. In effect the move is to set all of these $\frac{1}{n}$ probabilities to be mapped to the natural numbers. Then comparing probability areas becomes a question of counting sample space elementary outcomes.

Second, even in the case where it is a general (non equi-probable) distribution, you can look at the set of outcomes themselves and map them to a series of numbers on the real (or whole) line. So say you have a die with six images on them. You could map those images to six numbers. In fact, dice normally come with this 1-to-6 mapping additionally etched onto each of the faces. The move from odds-format to ratio-of-unity format that we see in probability theory is crying out for a second number, representing some kind of value, perhaps a fair value, associated with some game or contract or activity. In other words, now we've partitioned the sample space into mutually exclusive outcome weights, let's look at finding numerical values associated with the various states. When it comes to pricing a financial contract which has an element of randomness in it (usually a function of some company's stock price, which serves nicely as such a source), then a careful reading of the prospectus of the derived instrument ought to be able to be cashed out in terms of a future value, given any particular level of the stock.

I've seen Pascal's wager claimed to be the first use of expectation in a founding moment for decision theory. By the way, that's a poorly constructed wager since it doesn't present value the infinite benefit of God's love. That could make a dramatic difference to the choices made. Anyway, Huygens himself wrote about expectations in his probability book, but for me, the warm seat problem (the problem of points) represents an attempt to find a mean future value starting from now during a game. This is an expectation calculation, even though the word may not have been used in this context.

Thursday, 21 March 2013

Warm Seat

I am really rather pleased with my reading of the history of the theory of probability. Four points struck me about it, firstly that Cardano has a much stronger claim than the authors of histories of probability give him credit for. Second that Pascal was wrong in criticising Fermat's combinatorial approach in the case of more than two players in the problem of points and that his mistake was an equivalence class / ordering misunderstanding about the reading of three thrown dice. Third, that Pascal's solution is a bit like using dynamic hedging for an exotic option (one which doesn't exist yet, but which I'll call a one-touch upswing option). And fourth, that Huygens's gambler's ruin can be made into a problem of points by using participant stakes and separately some tokens which are transferred from the loser to the winner after each throw. On the last three of these points Todhunter and the authors Shafer and Vovk agree with me, variously.

A better name for the problem of points is the warm seat price. And the original first-to-six game, and also Gambler's ruin with plastic tokens and stakes can both be seen as specific games for which there's a warm seat price - the fair value of the game for a participant if he wanted to get out of the game immediately. Gambler's ruin doesn't have a definite time in the future at which point it will with certainty be known who the winner is.

It is also amusingly my warm seat moment since I didn't discover anything myself, but followed in other peoples' footsteps, and have experienced the warm seat experience of discovery others had made before me.

Wednesday, 20 March 2013

Probability preferences: the irrelevance of parallel/sequential distinction

In a sense, whether you throw one die sequentially n times to get a $6^n$ event space, or whether you simultaneously toss n distinguishable dice at one time, it doesn't matter. As long as you read your die results in a way which preserves the identity of the die the number appears. I'll leave off talking about what implication this has for the famous Pascal-Fermat problem of points until a later posting. For now, consider what this means for the classic repeated experiment in probability theory. If the events are genuinely independent, then it doesn't matter what relative time it is when you toss each one. The law of large numbers could equally well be satisfied with a single massively parallel experiment in, say, tossing a coin than it is in tossing a coin sequentially n times.

Likewise in set theory, there's a curious atemporality to Venn diagrams. And when discussing the joint probability of $A \cap B$, which is of course not the same as A then B. Even with Bayes' theorem it is important to realise that the 'given' meaning in A|B is with respect to our knowledge of the occurrence of B, not that B happened first and then A subsequently happened.

Tuesday, 19 March 2013

Probability Preferences : Independence is primary, multiple random sources secondary

I have already talked about the absolute importance of the idea of mutual exclusivity, disjunction to probability theory and how it enables the addition of probabilities. I'd now like to chat about independence. Remember I said that the pairwise disjoint sets were absolutely dependent, in the sense that knowing one happened tells you everything you need to know about whether the other happened. Note the opposite is not the case. That is, you can also have absolutely dependent events which are nevertheless not mutually exclusive. I will give three examples, though of course the classic example of independence is two (or more) separate randomisation machines in operation.

Take a die. Give each face six different colours. Then give the faces six separate figurative etchings. Then add six separate signatures to the faces. When you roll this die and are told it landed red face up, you know with certainty which etching landed face up, and which signature is on that face. But those three events are not mutually exclusive.

Take another die, with the traditional pips. Event E1 is tossing of an even number. Event E2 is the tossing of 1,2,3 or 4. $P(E1)=\frac{1}{2}$ and $P(E2)=\frac{2}{3}$. The occurrence of $E1 \cap E2$ is satisfied only by throwing a 2 or a 4 and so $P(E1E1) = \frac{1}{3}$. This means, weirdly, that E1 and E2 are considered independent, since knowing that one occurred didn't change your best guess of the likelihood of the other. The events are independent within the toss of a single randomisation machine.

In a previous posting, I mentioned having 52 cards strung out with 52 people, and when someone decides, they pick up a card, and in that act, disable that possibility for the 51 others. This system is mutually exclusive. You can create independence by splitting the audio link into two channels. The independence of the channels creates the independent pair of randomisation machines.

As the second example hinted at, independence means $P(E1E2) = P(E1) \times P(E2)$. The most obvious way in which this can happen over one or more randomisation machines is for it to happen over two machines, where E1 can only happen as an outcome of machine 1 and E2 from machine 2. This is what you might call segregated independence - all the ways E1 can be realised happen to be on randomisation machine 1 and all E2s on a second randomisation machine. Example two could be called technical independence.

As the single randomisation machine becomes more complex - 12 faces instead of 6; 24 faces, 1000 faces, a countably large number of faces, it becomes clear that independence of a rich kind is entirely possible with just one source of randomness. Another way of saying this is that multiple sources of randomness are just one way, albeit the most obvious way, of achieving independence. Hence relegating that idea to the second tier in importance.

One gambler wiped out, the other withdraws his interest

In so far as odds are products of a book maker, they reflect not true chances but bookie-hedged or risk-neutral odds. So right at the birth of probability theory you had a move from risk-neutral odds to risk neutral slices, in the sense of dividing up a pie. The odds, remember, reflect the betting action, not directly the likelihood of respective outcomes. If there's heavy betting in one direction, then the odds (and the corresponding probability distribution) will reflect it, regardless of any participant's own opinion on the real probabilities. Those subjective assessments of the real likelihood start, at their most general, as a set of prior subjective probability models in each interested party's head. Ongoing revelation of information may adjust that probability distribution. If the event being betted on is purely random (that is, with no strategic element, a distinction Cardano made), then one or more participants might correctly model the situation in a way which is as good as they'll want, that is immune to new information. For example, the rolling of two dice and the relative occurrence of pips summing to 10 versus the relative occurrence of pips summing to 9 is the basis of a game where an interested party may well hit upon the theoretical outcomes implied by Cardano and others, and would stick with that model.

Another way of putting this is to say that probability theory only co-incidentally cares about correspondence to reality. This extra property of a probability distribution over a sample space is not in any way essential. In other words, the fair value of these games, or the various actual likelihoods are just one probability distribution of infinitely many for the game.

Yet another way of putting this is to say that the core of the theory of probability didn't need to require the analysis of the fair odds of a game. The discoverers ought to have been familiar with bookies odds and how they may differ from likely outcome odds. Their move was in switching from hedge odds of "a to b" to hedge probabilities of $\frac{b}{a+b}$. That it did bind this up with a search for fair odds is no doubt partly due to the history of the idea of a fair price, dating back in the Christian tradition as far back as Saint Thomas Aquinas.

Imagine two players, Pascal and Fermat, playing a coin tossing game. They both arrive with equal bags of coins which represent their two wagers. They hand these wagers to the organisers, who take care of the pair of wagers. Imagine they each come with 6,000,000 USD. The organisers hand out six tokens each , made of plastic and otherwise identical looking. Then the coin is brought out. Everyone knows that the coin will be very slightly biassed, but only the organisers know precisely to what degree, or whether towards heads or tails. The game is simple. Player 1 is the heads player, player 2 tails. Player 1 starts. He tosses a coin. If it is heads, he takes one of his opponent's plastic coins and puts it in his pile. If that happened, he'd have 7 to his opponent's 6. If he's wrong, then he surrenders one of his tokens to his opponent. Then the opponent takes his turn collecting on tails and paying out on heads. The game ends when the winner gets to have all 12 tokens and the loser has 0 tokens. The winner keeps the 12,000,000 USD, a tidy 100% profit for an afternoon's work. The loser just lost 6,000,000 USD. Each player can quit the game at any point.

Meanwhile this game is televised and on the internet. There are 15 major independent betting cartels around the world taking bets on the game. In each of these geographic regions, the betting is radically different, leading to 15 sets of odds on a Pascal or a Fermat victory.

Totally independent to those 15 cartels of betting, there are a further 15 betting cartels which have an inside bet on, which pays out if you guessed who would see 6 victories first, not necessarily in a row.

Now this second have is inside the first, since you can't finish the first game unless you collected 6 points too. Pascal and Fermat don't know or care about the inner game. They're battling it out for total ownership of the tokens, at which point their game ends. The second betting cartel are guaranteed to finish in at most 11 tosses every time, and possibly as few as 6 tosses.

Just by coincidence, Fermat, player 1, gets 4 heads in a row, to bring him to 10 points of total ownership of all the tokens. He only needs 2 more heads to win. At this point Pascal decides to quit the game. To betters in cartel 1 it looks like Pascal and Fermat are playing gambler's ruin, to cartel 1 it looks like they're playing 'first to get six wins', which is the game the real Pascal and Fermat analyse in their famous letters.

Soon after, Pascal's religious conversion wipes out his gambling dalliance, and Fermat, only partly engaged with this problem, withdraws his interest. Both men metaphorically enacting gambler's ruin and the problem of points.

Monday, 18 March 2013

Probability Preferences: Conjunction and Disjunction are primary, Disjunction more so

Making addition work with sets is the heart of probability theory. Sure, a probability was really just a way of re-expressing odds, and that had been known about for ages before Cardano. Odds of n : m means that the event has probability $\frac{n}{n+m}$, which allows you to work out a weighting. But apart from nailing those numbers down to the range 0 to 1, the primary basic rule of probability can be thought of as specifying the conditions under which disjunction works at its most simple. That is, it lays out what must be true about A and B to allow you to say that $P(A \cup B) = P(A) + P(B)$. Set theory, of course, has its own history and life outside of probability theory, but probability theory becomes parasitic on set theory insofar as the general descriptions of events use the language of set theory. In set theory, events are said to be disjoint when there's no possibility of overlap The events have their own autonomous standalone identities, so applying the union operator allows for no possibility of double counting. If you define a number of events $A_i$ and you want to state that there's no overlap anywhere you say they're pairwise disjoint, meaning that each and every pair from that list is disjoint. We sometimes say the events $A_i$ are mutually exclusive. What we don't say often enough is that this means they're utterly dependent on each other. The occurrence of some particular of the $A_i$ disjoint events, say, $A_12$ tells you certainly that all the other events did not happen. This is complete dependence and it isn't obvious just by looking at the corresponding Venn diagram. If you have a full house, so to speak of events such that $A_1 \cup A_2 \dots A_i = S$, the entire sample space of possibility, then you've fully specified a probability model and can say $P(\bigcup_{i=1}^\infty A_i) = \sum_{i=1}^\infty P(A_i)$. With this single condition, the addition of probabilities is born. It is a very constrained sort of addition, to be sure, since no matter how many disjoint events in your experiment, even an infinite number of them, your sum of probabilities (all those additions) will never result in a number greater than 1.

Examples of these utterly dependent events. Rolling a die and getting 'red face up' (with a red-orange-yellow-green-blue-indigo die), tossing a coin and watching it fall heads-up, selecting the six of clubs by picking randomly from a pack of 52 playing cards, rolling a traditional pair or dice and getting a pair of sixes, rolling a die and getting an even (not an odd) number of pips face up.

With dice and coins, notice that it is in spinning them that we rely on their shape to select precisely one of n possibilities. We initiate the random event and a separate object's physical shape guarantees the one of n result. With picking a card, we initiate the random event but it is additionally in our act of selecting that we guarantee leaving the remaining 51 cards unturned. Strictly speaking, the randomising action has already happened with the cards, when they were presumably shuffled thoroughly. The shuffle, the toss, the flip. These are the randomising acts. With the toss and the flip, imagine the viewer closes his eyes on the toss and the flip. Then he's in the same uncertain state as the person about to pick a card from a shuffled deck.

Notice you can fully specify a probability model with a complete set of pairwise disjoint events even if the events in question aren't elementary - the example above which I gave is of rolling a die and getting odd or even.

If I gave half the playing cards to one person and the other half to another person, perhaps in a different room, then if there was no form of communication possible between them, then we wouldn't have pairwise disjoint events across all 52 cards. We'd have a pair of 26-card pairwise disjoint events, each of which was independent from the other. Imagine if I gave one card each to 52 different people, in different countries. Imagine further that I told them they could turn over their one card whenever they wanted, and as soon as they did so, to press a buzzer which had the effect of disabling the remaining 51 cards so that they could not be turned over. Ignoring messy practical reality here, then there's no shuffle. There's no natural sort order which could be applied to the geographical distribution of the people and cards, so no sense of working out whether they were randomly distributed in space. Still, the buzzer and disabling devices make this a coherent utterly dependent trial.

This requirement which allows simple addition of probabilities has implications for the randomisation machine - if it is to co-ordinate precisely 1 of n outcomes, then all n outcomes must be co-ordinated or constrained by someone or something.

Conjunction, on the other hand, cannot work in a world of mutually exclusive events. By definition, there is no overlap anywhere. So the major set up axiom of probability theory identifies a set of events on which it is impossible to perform intersection (probability multiplication).

In summary, the basic axioms of probability nail it as a real number in the range 0 to 1, and identify a set of events on which natural addition is absolutely possible and natural multiplication is absolutely impossible. Finally, when you have a set of mutually exclusive, absolutely dependent events which cover all the outcomes of a trial, then the set of events is called a partition of the sample space.

Friday, 15 March 2013

Probability Preferences: Equivalence Classes are primary

Another dimension of relevance when trying to judge what you might fairly expect is the achievements of the founder of probability theory is the idea of an equivalence class. At its most fundamental, a randomisation device is a piece of technology which has $n \geq 0$ states, possibly infinitely many in the continuous case. It is said to be in only one of those states, and the likelihood of it being in the $n_i$th state is $p_i$, the probability. As mentioned in yesterday's post, there's no requirement that these probabilities follow any pattern whatsoever other than the primary one, namely that their sum is 1.

Start by imagining a traditional die, which has six distinctly marked faces, each with a different number of pips. The fact that these faces are mutually distinguishable is important, not the fact that the distinction is achieved with pips indicating the numbers 1 to 6. It could just as easily have been 6 colours or 6 pictures. We can refer to this randomisation machine's $n$ states as its elementary states, its elementary outcomes. It will have some particular 6-state discrete probability distribution, that is to say, some set of six probability numbers in the range 0 to 1 with the single additional constraint that $\sum_{i=1}^6 p_i =1$

Now imagine a different die, one which had on three faces a common colour, and on the other three faces a second colour. This randomisation machine operates like a coin - it will have some particular 2-state discrete probability distribution.

Now imagine all possible combinations of six different colours written on the faces of dice. That's a lot of dice, each with its own number of states, with its own discrete probability distribution.

Finally, imagine a die with no face markings. This represents the minimal 0-state machine which doesn't technically get to be called a randomisation machine, since there's no uncertainty in rolling it. But for completeness you can see how it fits in.

Without knowing anything about the particular probability distribution of a die, you can see that the first die I mentioned, the one with 6 distinct faces, somehow provides the most randomness. That is, if you first build the die (and therefore fix its probability distribution), when you come to the decision of how to label its faces, there's something natural feeling about having all the available slots differently labelled. It is more efficient, less wasteful of capacity, more generative of randomness. In terms of information theory, it is the maximum entropy choice, given any particular probability distribution. The maximum entropy choice would have you use all available slots on this randomisation machine, all other things being equal. Likewise the faceless die is the minimum entropy configuration. And in between, each and every labelling can be ranked by entropy. The die's entropy must therefore be a function of the number of distinguishable states in the randomisation machine.

Next let's turn our attention to the probability distribution we can construct for die. To do that, let's hold the face-painting choice constant and go for the option of 6 distinguishable faces. There are in theory an infinite number of sets of 6 real numbers which fall between 0 and 1 and which sum to 1. For this 6 distinct-faced die, we can rank each and every one of them by entropy. We'll discover that the maximum entropy probability model is the one which has equal probabilities for all faces, namely all faces have a probability of $\frac{1}{6}$. And, for a 6 faced die, the minimum entropy die would be one where it was impossible to get 5 of the faces, but completely certain you'd get one particular face. How to build such a die in practice is a different matter, but it isn't relevant here.

Now realise that you can run the same analysis for not just the 'maximum entropy' 6 face-distinguished die, but for all labellings down to the faceless die. And there's a kind of global maximum and minimum entropy pair of dice in this universe of all possible dice, namely the equi-probable 6-label die and the totally faceless die. And all couplings in between can get ranked by entropy. When you tell a randomisation machine to produce for you an observable state (that is, when you roll the die), you get the most information out of it when you're rolling the maximum entropy die.

It is a nice way of characterising a randomisation machine. Knowing the maximum number of distinct states it is in. That seems a kind of permanent, almost physical aspect of the machine. Likewise, the probability distribution seems somehow 'built in' to the physical machine. Of course, the machine doesn't need to be physical at all. Still, the machine gets built and it is kind of natural to imagine a particular probability distribution burned in to the 'maximum entropy' face-painted die. This is where we started. Now, imagine we took that particular die - in fact, lets just for the sake of argument give it, off the top of my head, the distribution $\frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{12}, \frac{1}{4}$. I could have picked equi-probable but decided not to. And I paint the colours red, orange, yellow, green, blue, indigo on it, some colours of the rainbow.

It is done, the die is built. But wait. I decide on the following rule. I want to make this particular die behave like a coin flip. Rather than re-paint, I just decide in my head to count red or orange or yellow to represent one state, and the other three colours to represent another. This is an equivalence class. It is a transformation or grouping or partition of the elementary outcomes of a randomisation machine. Likewise I can, by just deciding to do so, interpret my rainbow die castings to reproduce any particular painting. I jjust need to remember in my head the equivalence rule.

So the rainbow die, with its specific probability distribution, is set in the middle of a room, filled with people, each with their own distinct equivalence class in mind. Each roll of the die is seen by each person as a different outcome of his own randomisation machine. By applying an equivalence class, you've got the randomisation machine to perform differently for you. This is kind of like software. The equivalence class being the program. With each equivalence class, there's a way of rolling up the built-in probabilities to produce a transformation to a new probability distribution for that equivalence class. Imagine the red face was $\frac{1}{12}$ and the orange $\frac{1}{4}$ and the rest of the colours $\frac{1}{6}$. By the red-or-orange-or-yellow versus green-or-blue-or-indigo equivalence class I have simulated a fair coin flip, even though the 'elementary' outcomes were not equi-probable.

So in a sense a so-called n-state randomisation machine describes the maximum entropy, all-states-distinguished equivalence class. Even though I've been claiming this is natural, efficient, etc., in theory there's nothing special about this maximum entropy state except that it ranks top of the list for information content. It is as if each of the observers of the machine, through the glasses of his own equivalence class, sees a different reality, but it none of them can take the glasses off. If you do privilege the maximum entropy equivalence class, then call all of its states elementary outcomes, elementary events or the sample space. If that's what you're going to do, then all the other equivalence classes represent composite events, or simply events, and you can work out the probability of these events by rolling up their constituent probabilities. Executing or running a randomisation machine can then be said to reduce uncertainty in a disjoint set of n possible outcomes. That is, a randomisation machine is a chooser. It picks one of n. It is an OR-eraser. The concept of OR is primitive, it exists at the elementary outcome level. It is a particularly tight kind of OR - one which is exclusive and all-encompassing. In other words the OR-eraser which is the random event picks exactly one of the n elementary outcomes. If the act of causing a randomisation machine to run is an act of OR-erasure. At the level of a single randomisation machine, there's no concept of AND. A single choice, 1-of-n, is made. At the equivalence class level the construction of the equivalence class can involve OR-construction (disjunction) and AND-construction (conjunction).

As I mentioned last night, the best a single die can hope to achieve is the uncertainty reduction of about 2.58 bits. That's its maximum entropy. The formula is $\sum_{i=1}^6 p_i \log p_i$. This quantity is purely a function of the probability distribution, as you can see, but you should remember I chose colours as elementary outcomes partly because there's no natural mapping on to a number. In this sense information is more fundamental than expectation, which I'll mention more of in another posting.

My thought experiment of multiple people looking at the result of a randomisation machine's single run and seeing different (non-elementary) outcomes is clearer in the act of picking a random playing card. Participant 1 sees a Queen of Hearts, another sees a Queen, another sees a Heart, another sees a Heart, another sees a Face card, etc. And those are only the 'semantically coherent' equivalent equivalence classes - there are in face a whole bunch more.

Thursday, 14 March 2013

Probability Preferences: Event Space is primary, Equi-probable Event Space is secondary

Technically, probabilities are proportions, fractions of a nominally unitary whole. Those proportions don't have to be the same size. When they are, then counting tricks, combinatorics, can come into play. In my four walls metaphor for probability the first wall is made up of bricks of uneven areas. This the primary case in probability theory. Understanding that you have an event space and that you sum regions of a unitary whole, this is all that you need. With equally-sized areas, number theory tricks become relevant, since there's a mapping from each area to a whole number, and you arrive at your proportion by scaling it down by the sum of all such elementary outcomes, $\sum_n 1$

It is hugely important in my mind to see where and when numbers come into it all and at what stage. Unevenly sized elementary outcomes don't map neatly to the whole number system, and that's OK. On a related point, the event in question, elementary or otherwise, doesn't have to have a mapping on to a number either. If it does, then you further can talk about expectations, functions of random variables, etc. But you don't need that either. What distinguishes an equi-probable random device is that this probability distribution is the maximum entropy one (2.58 bits in the case of a die, 1 in the case of a coin). The mimimal entropy case for all randomisation devices is the one where all elementary outcomes, regardless of how biassed or unbiased the device is, map to one event. In that case the information content is 0 and technically it is no longer a randomisation device, you've effaced its randomness, so to speak. What makes these proportions of a unitary whole interesting is that, for any given activity, game or contract with randomness, there's a particular configuration of thee probabilities in your mathematical analysis which come close to the results you would expect if you carried out the experiment multiple times.

Isaac Todhunter's "History of the mathematical theory of probability from the time of Pascal to that of Laplace", 1865, is a key milestone in the history of probability theory. F.N. David, also often quoted by many of the authors I've read, references Todhunter thus: "[he].. has been and always will be the major work of reference in this subject" (F.N. David, preface, ix). Ian Hacking, in his amazing "The emergence of probability" says in the first sentence of chapter 1 "[Todhunter]...remains an authoritative survey of nearly all work between 1654 and 1812" (Hacking, p1). Todhunter's very book title is revealing - he originates probability theory with Pascal. This choice echoes down through all the probability books I've come across.

Todhunter was a senior wrangler, so his intellectual capacity is beyond doubt (just check out the list of former senior wranglers and the equally stellar top 12's). He describes Cardano's "On casting the die" as a 15 page gambler's manual where ".. the discussions relating to chances form but a small portion of the treatise" (Todhunter, p2).

Cardano discusses the activity of throwing two dice and summing the number of pips across the two dice. He lays out the theory of probability as 'proportions of a unitary whole' using the language of 'chances'. That he chose dice rather than astragali is of merely historical interest since no doubt he is the first in the western tradition to make this proportions-as-chances analogy. Cardano also nails the implications of all 36 elementary outcomes on the activity of 'summing the pips', which involves understanding that rolling two dice implicitly maintains a knowledge of which die is which. In a sense, that each die is 'result-reading colour coded'. In a previous book he also talks about binomial coefficients, for which Pascal usually gets credit. He performs the same analysis for three dice. As I'll mention in a subsequent post (on parallel/sequential irrelevance), this is theoretically equivalent to predicting the future three steps out. Keith Devlin in "The unfinished game" explicitly (and wrongly) gives Pascal and Fermat credit for this.

My suspicion is that this senior Wrangler naturally preferred the great mathematicians Pascal and Fermat and that he recoiled in disgust at the unloveable life which Cardano seems to have lived.

F.N. David upgrades Cardano to ".. a little more achievement that Todhunter allows him but .. not .. much more" (F.N. David, p59). Hacking ends his chapter on Cardano with this: "Do we not find all the germs of a reflective study of chance in Cardano?Yes indeed" (Hacking, p56).

Did Cardano understand the primacy of the 'variable sized brick' case? Yes. Hacking quotes this translated section from Cardano: "I am as able to throw 1,3 or 5 as 2,4 or 6. The wagers are therefore laid in accordance with this equality if the die is honest, and if not, they are made so much the larger or smaller in proportion to the departure from true equality" (Hacking, p54). F.N. David is not so sure since Cardano incorrectly treats of astragali as if they were equi-probable, though he admits this may just be due to Cardano's lack of experience with astragali. Anyway, if not, surely you're allowed to totally mis-characterise one specific randomisation machine and still be the father of modern probability theory.

Tuesday, 12 March 2013

Probability preferences

In order to support my claim that Pascal (and to some extent, Fermat) are too highly praised in the history of probability theory, I'd like to make a claim about what I see as important in the constellation of ideas around the birth of probability theory. This is my opinion, and is based on what I know that has happened in the subject of probability theory since the time of Cardano, Pascal, Fermat and Huygens.

Concepts of primary importance in probability theory (in the pre-Kolmogorov world of Cardano, Fermat, Pascal)

Event Space
Independence.
Conjunction and disjunction.
Equivalence class.
Parallel/sequential irrelevance of future outcomes.
A relation between historical observed regularities and multiple future possible worlds.
A clear separation between the implementation of the random process(es) and the implementation of the activity, game, contract, etc. which utilises the source of randomness.

Concepts of secondary importance.

Equi-probable event space.
Expectation.
Single versus multiple random sources.
Law of large numbers (though it is of primary importance to the dependent subject of statistics).
i.i.d. (two or more random sources which are independent and identically distributed)
A Bernoulli scheme
The binomial distribution
Stirling's approximation for n factorial
The normal distribution
Information content of a random device
Identification of the activity, game, contract, etc, as purely random, or additionally strategic.

I'd like to say something about each of these in turn.

Before I do, I'd like to say this - the Greeks didn't develop probability theory, as Bernstein and also David suggest, due to a preference for theory over experimentation, but perhaps because probabilities are ratios, and the Indians didn't invent base ten positional number notation until the eighth century A.D., making subsequent manipulations of these ratios more notationally bearable. No doubt the early renaissance love of experimentation (Bacon and Galileo) may have assisted in drawing the parallel between the outcome of a scientific experiment and the outcome of a randomisation machine.

Sunday, 10 March 2013

Musical Chairs

Most of the histories of probability trace their facts back to Hacking and David, and I agree these two books are the best of the bunch I have read. The Hacking book itself references David. I love the Bernstein book series but I noticed his page 43-44 has some musings on why the Greeks didn't bother with working out odds behind dice games. I bet they did. Anyway, he offers an example of the so-called sloppiness of the Greek observations of dice probabilities by mentioning some facts he gleans from David - namely that when using the astragali they valued the Venus throw (1,3,4,6) higher than (6,6,6,6) or (1,1,1,1). which are, he states "... equally probable".

No, they are not. David clearly states that the probability of throwing a six as one in ten; likewise with throwing a one. And threes and fours are about four in ten events. This means that even if order is important, the Venus is indeed more likely. Second mistake, there's only one permutation of four sixes, and only one permutation of four ones. But there are many permutations of the four Venus numbers, meaning the probability of (1,3,4,6) in any permutation is even higher again than the strictly ordered (1,3,4,6).

It is this partition/permutation dilemma of probability theory, even today, which is so easy to get wrong. I just re-read some earlier postings I made on equivalence classes and their information content and key milestones in probability theory, and I still like what I wrote. Also check out a posting on combinatorics in Cardano and Lull.

It is just a throwaway comment in Bernstein's book and hardly invalidates his wonderful sweeping history of risk but is nicely illustrates the problems of thinking about event space and equivalence class.

Divorce born

I've been thinking about Cardano, Pascal, Fermat and Huygens a lot recently and hope to make a number of postings. For now I'd just like to bring some controversy to the usual story found in the literature about these characters and their relative importance. According to this literature there are three pivotal moments - which I'll call Cardano's circuit, Pascal-Fermat's divorce settlement and Huygen's hope relating to the problems of complex sample space, the arithmetic triangle, and expected value of an uncertain outcome, or to simplify it even further, to factorial, binomial coefficients and the average, all fairly contemporaneous mathematical inventions or discoveries in the Western tradition.

The story usually told is one which lays great praise at the workings of Pascal and Fermat and which makes a big deal of the so-called problem of points. What I'd like to do during this discussion is show how connected the problem of points is to another famous probability exercise, so-called Gambler's ruin. I'd like to bring these two problems together and show ways in which they're related to many contemporary decision problems. I'd also like to claim that the solution to Gambler's ruin is more important than the problem of points, and has more resonance today. I'd also like to claim that Cardano's discussion of event space has the better claim to being the foundation of probability theory.

In all of the postings to come, I base my readings on the following books, plus free online primary sources, where available in an English translation.

One last introductory point - this thread is clearly a biassed Western history of ideas discussion. Many of the commentators below neglect to sufficiently emphasise the great world traditions in mathematics which played into this - especially from the Islamic, Chinese, Indian traditions. These clearly played in to the so-called canonical view of the birth of probability but that weakness in the line of argument is a weakness for another time and another place.

Saturday, 9 March 2013

Horse Shit

I just read John Gray's disappointing latest book, The Silence of Animals. There's not much to say about it. He's good on criticising humanism and the progressive perspective, but utterly unconvincing. His selection of quotations left me cold. His economic perspectives miscalculations.I didn't know about Felix Mauthner or that he was born in a Bohemian town the Gernams pronounce as 'horse shit'.

Thursday, 7 March 2013

Philosophy of pessimism re-balanced

The philosophy of pessimism is wrongly categorised and I have an improvement. The usual classification of types of pessimism are cultural (Rousseau to Foucault), metaphysical/theological (Schopenhauer, Buddhism) and post-metaphysical (Nietzsche, existentialism). My classification addresses four issues.

First, pessimism is really a critical attitude towards something. Before addressing the what, I'd like to point out that it is better to talk about philosophies which exist along the optimism-pessimism scale, rather than just concentrating on progressivist/optimistic and pessimistic as if they were inhabiting different worlds. The act of criticism, to some philosophers, leads to the possibility of a better situation, and for others, merely an understanding of some kind of the situation we address. This aprogressivist set of philosophies can range widely over this optimistic-pessimistic scale from Panglossian to Schopenhaurian. In summary, the first dimension of a re-balanced philosophy of pessimism is how the thinker evaluates the possibility that things could get better, either through the critique he provides or through some other mechanism. This is a measure of the impotence of the critique.

Second I make a primary distinction between ontological and phenomenological pessimism, sine I think 'post-metaphysical' is backward-looking and, dare I say it, negative. Depending on what you consider your ontology to be you might find Schopenhauer's Will in here, or a theory of the fundamental nature of mankind, or God.

Third the critical thinker may (or may not) see a connection or implication between his initial critical target (the ontological or the phenomenological) and its correspondent subject - that is to say some kind of implication from ontological to phenomenological may exist for him Schopenhauer), or some kind of implication from phenomenological to ontological may exist for him (Stoics), or no implication whatsoever may exist for him (Nietzsche).

Fourth the primary critical target of the phenomenological critical thinkers is often exclusively either individuals or supra-individual constructs. Critics of individuals may additionally posit supra-individual solutions to the primary problem - Plato's Republic, Hobbes's Leviathan, Machiavelli's Prince, Mill's Liberal society, Comte's positive science/religion, Marx's dictatorship of the proletariat. Critics of individuals may not posit any such remedial supra-individual fix-up (L Rochefoucauld, Montaigne, Kahneman). Critics of the supra-individual tend to want to deconstruct the offending edifice - Rousseau, Foucault, neo-classical economists, public choice theorists, anarchists of all persuasions, Nietzsche, the later Wittgenstein, Derrida.

Philosophies of pessimism, particularly those which are revolutionary, are unbalanced. Their critique is neither best directed exclusively at human nature, nor at cultural institutions, but at both. Ontological stances where the thinker makes a leap from ontological to phenomenological ought still to be considered, provided that the phenomenological conclusions drawn are socially useful. Ontological stances which say nothing about phenomenological realm in a sense turn their backs on the possibility of social improvement and ought to be of only interest to historians of ideas. Cultural institutions can and do change, sometimes for the better, sometimes for the worse. A balanced philosophy should recognise that. It should also address both the individual and the supra-individual/cultural as more or less equally valid subjects of criticism (and praise). To give primacy to one over the other is a form of extremism.