## Tuesday, 25 April 2017

### Birthdays

Call event $A$ the event that at least 2 people out of $k$ share the same birthday.  The probability of this is one minus the complement  $A^c$, namely that all $k$ individuals have different birthdays.  This is easier to calculate using permutations and counting methods. The answer is $1 - \frac{365!}{365^k(365-k)!}$  To break this apart, you're finding the number of ways in which you can permute $k$ from $365$, which is $365 \times 364 \times ... \times 365 -k$.  You divide this by the total number of days of the year each of $k$ people could conceivably have their birthday, in an unconstrained way.  This is simply $365 \times 365, k$ times, or $365^k$.  Generalising this away from dates, you get $1 - n!\binom{n}{k} \frac{1}{n^k}$.  Selecting $k$ people like this is at base a kind of sampling without replacement.  How does one solve a problem like this.  The first step is to realise that solving its complement may be more tractable than solving the primary problem. This trick is as old as the origins of probability theory itself - see how Pascal used it for DeMere's foundational gambling problem.

Secondly, the basic law of sampling with replacement, namely the act of picking $k$ from $n$ with replacement can be done in $n \times n \times n ... \times n, k$ times, or $n^k$ for short.  This is the free and unconstrained selection of $k$ objects each of which can have one of $n$ attributes.  Then with this denominator counted, you work out the top line, namely the ways in which you can have $k$ all of whom have different birthdays.  This would be sampling with replacement, also known as a permutation, on the numerator.  Perhaps $1-n!\frac{\binom{n}{k}}{n^k}$ is clearer. Permutation is the act of creating a list from a set.

To get to this point mathematically you're simplifying reality. No twins considered, no increased school prevalence of joint birthday parties, no modelling of the non uniform distribution of birthdays throughout the year. The birthday problem appears in most probability textbooks.  But now to find alternative uses for it and to come to a conclusion about the implementation of factorial in systems.

There are 2 dimensions to this problem - first you take a sample of $k$ objects, presumably from a larger population.   This act of sampling is assumed to be completely uniform - namely that the likelihood of you having picked this object is considered to be the same as the likelihood of any other object. Notice that we don't have (or need) a handle on the size of the population.

The second dimension to the problem implies that every member of your population has been deterministically classified.  And further, that this categorisation has a finite (or countably infinite) range of possibilities and that this categorisation can be mapped onto a a counting label. With birthdays, you can be sure that whoever you pick, they were surely born on one of 365 dates of the year. The cardinality of this categorisation  is of course 365. Be clear however that the categorisation itself doesn't have to be inherently numeric, since you're attaching a cardinality to the categorisation.  What I mean by this can be spelled out with another example. We could create a palette of 50 colours and for every member of the population map their eye colour to one of those 50 colour buckets. Clearly there's no inherent numerical or ordering aspect to these 50 colours yet we know that there are 50 of them. Notice here that with eye colour it is unlikely that the bucketing would be as uniformly distributed as the putative day of year was. Without the uniformity of the bucketing distribution you wouldn't arrive at the birthday solution but would need a more complex solution.

So let's step back. We need a categorisation which - unlike many real world categorisations, which are probably likely to be distributed by some kind of power law distribution - is uniformly distributed over the population.

Condition 1 says we implicitly pick $k$ objects with uniform likelihood from a population. Condition 2 says that the population must be categorised in a perfectly uniform way.  Without condition 2 you don't get a denominator with $n^k$ in it.

What kind of categorisation is so perfectly uniformly distributed? How could it be so? Real world informationally contentful categorisations are rarely so uniform. Using the technical definition of information however, random categorisations are indeed the most informationally rich - they reduce the largest amount of uncertainty when the categorisation is revealed to you.  Even if the categorisation is deterministic for the objects, you still, in selecting $k$ from the population, might chose any number of repeats.  This probability calculation tells you how likely you are to achieve 'all clear' or not.

I must link these 2 dimensions together - you don't need the sampling process to have an unconditionally uniform likelihood of picking any member. It only needs to be uniform with respect to the categorisation, that is conditional on the particular categorisation you've chosen. In other words the selector of the sample might be quite non-uniform provided that the categorisation probability distribution is such that the probability of having selected any one object times the conditional probability of picking a category given the object works out as uniformly distributed.

OK so with this in mind I'd like to consider other examples of the birthday problem.
Start with a binary classification.  Say boys and girls - probability examples often  make the approximating assumption that they're equally likely in the population. Let's say we pick two people at random.  What are the chances that the two people are of the same sex?  Clearly this ought to be $\frac{1}{2}$.  Let's see:  $\frac{2\times 1}{(2^2)(0!)}$.

How about good old dice.  I have a game with a single die.  A play fee of $1$ is associated with the game where the player throws 3 times in a row and if the same number comes up you win.  If you were going to devise this as a break-even profitable game, what reward ($r$) can you afford to give to the gambler?  $1- \frac{6 \times 5 \times 4}{6^3}$ or $1-\frac{120}{216}$ which equals $0.444$ or 44.4%.  So $0.444 \times r = 1$.  Hence $r=2.25$ is a break-even level of reward.

A ticket machine allocates tickets for passengers to travel on a 20 carriage train such that it allocates the passenger to a carriage uniformly.  5 Drunken Irish men buy tickets.  What are the chances that at least two of them end up in the same carriage, at which point they will start singing sad ballads loudly together?  The answer is $1-\frac{20 \times 19 \times 18 \times 17 \times 16}{20^5}$ or $1-\frac{1,860,480}{3,200,000}$ or $1-0.5814$ or $0.42$ approximately, a 42% chance.

You have in your hands a digital contract.  And a fraudulent one which benefits you.  You have also access to the same hash function used by some cryptographic community - e.g. a bitcoin pay hash or a pgp hash.  Now in a legitimate process, this hashed contract would be signed by the patsy's private key (which you don't have).  So what you do is make a large number of informationally identical but subtly different copies of the legitimate contract and a large number of similar copies of your putative fraudulent contract.  Keep producing new alternatives on both sides until finally you find one such that both the legitimate and the illegitimate evaluate to precisely the same hash.  The collision you found is much easier than you might imagine, thanks to the birthday problem.  Once found, you submit the legitimate contract to the patsy.  The patsy signs this.  You scrape away the digital signature and attach it to the same-hash-producing illegitimate contract, benefiting you.  Present this for payment and you will have exploited the patsy.

You have to estimate the population of sparrows in a woodland area but you're on a budget.  You catch 100 sparrows, tag them with 'TPS' for 'The Persistent Struggle' then release them again.  A week later you return and you catch another 200.  Of these 200, 6 had TPS tags (3% or 0.03).  So you estimate the population in this woodland as 100/0.03 or 3,333 sparrows.

The following link contains some history on the birthday problem.