Thursday 23 January 2020

The whole Black-Scholes formula derivation in one post

First some intuition on the journey through.  A call option is a security whose value is somehow dependent on the price of the underlying and it is our job to create a model for how to value it.  It is so strongly dependent on the price of the underlying that I am even prepared to say that there is a single source of uncertainty here which both of these instruments share.  Imagine you're chaperoning your drunk friend home at the end of the night.  He staggers around on the street, giving perhaps too much of his money to some strangers he meets, stealing flowers from pretty girls, asking for cigarettes from smokers.  You chase behind him, apologetic, retrieving the money so foolishly given away, apologetically returning the cigarettes and flowers.  At the end of the night, everyone goes to bed with the items they owned at the start of your friend's stagger home.  Good job.  You just neutralised him, clinically, precisely, completely.  His behaviour was a single source of randomness there, which explained and motivated both his own and your own behaviour.   And so too a portfolio of being long some quantity of a call option and short some quantity of the underlying stock could too end the night with no change in financial value (transaction costs, among other things, are assumed to be zero).  You could make matters worse, of course, and copy the drunk friend, leveraging his social inadequacies.  And in a similar way, owning a portfolio of long some number of call options and long some number of stocks would amplify your returns compared to how you might have performed if you just owned the stock. Imagine further that the net set of objects he collected on his drunken wander had a value, and this value was called his drift.  One Friday, his drift might amount to a gain of 10 dollars and the next Friday, a loss of 3 dollars.  But given this perfect chaperone role you perform week in week out, none of us care what the actual drift is, week in week out, since you do so lovely a job of neutralising it.  This act of not caring what the estimate of the drift value is turns out to be the part of the puzzle which eluded Black and Scholes for years, until it was suggested to them by Merton.

First we need to define how a random time evolving process might be described.  In fact, following  ThieleBachelier, Einstein and Weiner, we use the following formula to say, in essence, that we would like to build a model of the return on a stock price as a periodic drift together with a random  (normally distributed) piece of noise overlayed on top, and it can be thought of as shaped Brownian motion with a drift overlay:
$\frac{dS}{S} = \mu dt + \sigma dz$

The noise element, $dz$ is the novel element here and it represents a sequence of normally distributed increments which has a mean of zero and a variance of $dt$.  
$ dz = \epsilon \sqrt{dt} $
Where $ \epsilon $ represents a pure Weiner process, with unit variance, normally distributed.  To this we scale the variance by $ \sigma^2 $, the variance of our stock.  This stock is risky, so it must also have positive expected return (according to CAPM, otherwise why would anyone want to own it).  This is a time process as well as being random in some additional dimension, and we know that variance is additive in time. A characteristic attribute of $dz$ is that any and all future random steps are independent of any previous step.  So by stating the above, you're deciding that stock returns could be somewhat accurately modelled as a normal distribution with a specific expected drift, $ \mu $ and a variance $ \sigma^2 $.  Think of the above equation as saying, stock price returns are distributed normally, that is,
$\frac{\delta S}{S} \sim \phi(\mu  dt, \sigma \sqrt{\delta t}) $
and that this is operationally achieved by decomposing the process into a pure random element, isolating it, if you like, and a drift component, an instantaneous expected mean value.  So $\mu$ is the gradient of a straight line, that line being a non-deviating march directly from here to there for the stock price.  Imagine a little tin toy mouse is wound up (the degree of winding up corresponds to how long the mouse will move for) and starts in the corner of an empty room.  You pointing the mouse's nose in a particular direction is akin to you determining the drift component.  Then you let it go, and its path is pretty much determined.  You could draw a line on that floor and this would be the line the mouse follows.  But imagine a randomising element being applied to the wheels.  These are just as likely to jig the mouse momentarily left or right with equal probability.  Now, at the outset you aren't as confident about the precise location of the mouse, but your estimate of his general direction is still the same, if you were forced to guess.  That mouse is still roughly following the same heading.  Also, if you are forced to estimate a location for the mouse, and the experimenter offers you the choice of guessing its location soon after the start of the process or quite near the end of the process, pick the former.  There's clearly less randomness that has accumulated then.  The more that time elapses, the more chances there are for random events to occur, making it much harder for you to guess the location correctly.  Variance is additive, and specifically in this case, variance is additive in time.  You also don't want the wheels to jig the mouse too violently, as this too makes your prediction job harder.  This is just another way of saying that volatility, like time, makes your prediction job harder.  The more volatility and time there is, the harder the life of a predictor is.

OK so where do we go from here?  Well, we have just set a model for returns, but going back to the original problem, the relationship isn't between a call price and a stock return, no, it is between a call price and its underlying stock price.   So how do we go from a model which says returns are normally distributed to a model which describes the price itself?  Easy, just multiply both sides by $ S $ to give
$dS = \mu S dt + \sigma S dz$
As a side note, this is now called an Ito process, since the drift and variance are both expressed as functions of the underlying, rather than as absolutely static constants and it is also sometimes called geometric Brownian motion, since price changes are scaled with the current level of the stock; they are still scalars 'in the present moment' since, at any time you are expected to know what the stock price or stock volatility is (i.e. they are assumed known or at least estimable and are most certainly not considered to be random over the little chink of time $dt$ which the model considers).   This subtle move from a Weiner to an Ito process means that the chaperone must be continually alert to changes in his friend's behaviour - if his friend followed a Weiner process, then the chaperone could have a firmer idea at the outset of how busy he was going to be on average.  Now that we have a model for how a stock price changes over a short period of time, we can ask the major question - how can we describe the movement of the call price changes so that it is fully in terms of the stock price changes?

Now, we'd like our chaperone behaviour to be modelled mathematically as being a function of his friend's behaviour.  The so-called chain rule, which many of us learned in school,  shows us how a small change in an underlying function gets geared into a small change in the outer function; it states that:
$ \frac{d}{dt}[C(S(t))] = C'(S(t))\times S'(t) $
However, this won't work directly since the friend's behaviour is random (as are stock price returns, according to our primary model choice) and the chain rule is only for smooth functions.  Random functions are not mathematically differentiable in the same way.  So we have hit a brick wall.  We can't directly use the chain rule to see how our call price (C) might move as a function of the stock price (S).  What next?    The Taylor polynomial.  Why?  Because this approximation helps Ito to create a rather fearsome version of the chain rule for the world of random processes.  And once we have the chain rule, we can progress with our challenge to express the somewhat random movement of the chaperone (the call option) in terms of the somewhat random movement of the drunk friend (the stock).

The Taylor polynomial for some function $C(S,t)$ up to and including second order terms is
$\delta C = \frac{\partial C}{\partial S} \delta S + \frac{\partial C}{\partial t} \delta t + \frac{1}{2} \frac{\partial^2 C}{\partial S^2} \delta S^2  + \frac{1}{2} \frac{\partial^2 C}{\partial t^2} \delta t^2 + \frac{\partial^2 C}{\partial S \partial t} \delta S \delta t + \ldots $

As we take both $S$ and $t$ closer to 0, we are usually happy to drop all second order (including cross) elements, in the familiar world of non-random functions, which might there leave something like the familiar school chain rule
 $ \delta C \simeq \frac{\partial C}{\partial S} \delta S + \frac{\partial C}{\partial t} \delta t  $
But in our case we know that $dS = \mu S dt + \sigma S dz$ .  So we need to be careful when disposing of those second order terms.  $t$ can uncontroversially go to 0 in the limit, so any term with $\delta t^2$ or $\delta t \delta S$ can be dropped, leaving a question over what to do with the term for $\delta S^2$.  Ito helpfully points out that, $dS = \mu S dt + \sigma S dz$ is really the same, in a small discrete timeframe, as $\delta S = \mu S \delta t + \sigma S \epsilon \sqrt{\delta t}$

Let's square the discretised formula $\delta S^2 = \mu^2 S^2 \delta t^2 + \sigma^2 S^2 \epsilon^2 \delta t$,  and again we can throw away any term where $\delta t$ appears in second order.  This leaves us with $\delta S^2 \simeq \sigma^2 S^2 \epsilon^2 \delta t$. 

What do we know about $\epsilon$?  It is has a mean of zero, $E[\epsilon]=0$ and a variance of 1, by construction of a Weiner process.  We know the alternative definition of variance as $E[\epsilon^2]-{E[\epsilon]}^2$ and that this equals 1.  So $E[\epsilon^2]$ must also be 1 (1-0=1).

What a piece of magic.  We now are armed with a chain rule for functions which have a stochastic element $\delta C = \frac{\partial C}{\partial S} \delta S + \frac{\partial C}{\partial t} \delta t + \frac{1}{2} \frac{\partial^2 C}{\partial S^2} \sigma^2 S^2 \delta t   $

This little step is quite a milestone.  It has 3 terms, the first two of which are just the usual first order terms for the chain rule in two variables, then comes the little monster, the second order term but in $dt$ and which also brings with it a derivative not even in $dt$.

Next we realise we already know that $dS = \mu S dt + \sigma S dz$ so we can do a substitution right away $dC = \frac{\partial C}{\partial S} (\mu S dt + \sigma S dz) + \frac{\partial C}{\partial t} dt + \frac{1}{2} \frac{\partial^2 C}{\partial S^2} \sigma^2 S^2  dt   $

When this is tided up you get Ito's lemma, which is really the chain rule, with a bit of Weiner process magic caused by a consequence of variance being additive in time, hence standard deviation grows with the square root of time, which when it appears in a second order term of a Taylor polynomial, switches allegiance from $dz$ to $dt$, in a way.   This has partly helped us eliminate an element of randomness from the proceedings, but now, thanks to Black and Merton largely, we are going to eliminate the remaining element of randomness.

So we have Ito telling us that $dC =  (\frac{\partial C}{\partial S} \mu S + \frac{\partial C}{\partial t} + \frac{1}{2} \frac{\partial^2 C}{\partial S^2} \sigma^2 S^2) dt  + \frac{\partial C}{\partial S} \sigma S dz$ and now we consider a portfolio consisting in short one call and long some known amount of stock, the amount cleverly chosen to be $\frac{\partial C}{\partial S}$.  We don't as yet know what that instantaneously constant value is yet, but it is some value which we shall use as a fraction.  So our portfolio can be described as $\Pi = \frac{\partial C}{\partial S} S - C$ and correspondingly the change in the portfolio value is $\delta \Pi = \frac{\partial C}{\partial S} \delta S - \delta C$.  When we substitute in for $\delta S$ and $\delta C$ using that specially chosen ratio, then the $dz$ terms cancel out perfectly, leaving $\delta \Pi = (- \frac{\partial C}{\partial t}  - \frac{1}{2} \frac{\partial^2 C}{\partial S^2} \sigma^2 S^2) \delta t$

The next step in this mammoth piece of deductive logic is to realise that this change in portfolio value has lost its randomness, then the principle of no arbitrage would imply that the change in portfolio value must also be riskless.  If you got more than that, then you would have made money with no risk, a violation of CAPM (and no arbitrage).  So with an instantaneous risk free rate $r$ and a little bit of time $\delta t$,  $\delta \Pi = (- \frac{\partial C}{\partial t}  - \frac{1}{2} \frac{\partial^2 C}{\partial S^2} \sigma^2 S^2) \delta t = r(\frac{\partial C}{\partial S} S - C) \delta t$.  $\delta t$ cancels on both sides and, rearranging you get the legendary Black Scholes partial differential equation

$\frac{\partial C}{\partial t} + rS \frac{\partial C}{\partial S} + \frac{1}{2} \sigma^2 S^2 \frac{\partial^2 C}{\partial S^2} = rC$

All that's left to do is for Black an Scholes is to solve it.  It turns out this is a fairly unsurprising heat equation, and it certainly can be solved by numerical methods.  For certain kinds of option it can be solved mathematically.  Black got here in June 1969.  Ed Thorp was already in 1967 claiming to have worked this out himself and was making money doing it.  It was in fact Scholes who suggested constructing a 'zero beta' portfolio to Black, who by then had been sitting on the equation, but without having any way to solve it.  It was Merton who suggested a replicating portfolio / no arbitrage approach which didn't rely on CAPM (beta).  The shocking thing is, in constructing a pair of portfolio weights designed to get rid entirely of the randomness, the expected return on the stock also disappears.

At this important juncture it is worth reflecting that no limit or constraint was placed on the interpretation of C above, so the resulting partial differential equation will hold for all options.  But, in order to solve for options, we now have to get specific about which options we would like to solve for, and this will be easier for some than for others.  And, by easier, the ideal end goal is to solve the PDE analytically.  Luckily, for european call options an analytic solution exists.

Now, the consequence of $\mu$ not appearing in the Black-Scholes equation is fully exploited.  We live in a single world, the one where stock prices are risky, and hence where the unknown but estimable $\mu$ is the return a stock has to pay an investor such that they are happy to take on the risk of that stock.  In CAPM, the risk premium, of course, is $\mu -r$, and this, when properly diversified across a sufficient number of stocks, will indicate the equity risk premium.  And in this world, the Black-Scholes equation must hold.  But there are other worlds, and in all of those other worlds, the Black-Scholes equation must hold too.  Since it doesn't need $\mu$ it doesn't care for the expectations of investors.  They could demand 200% return on stocks, or -10% or even the risk free rate.  It doesn't matter.  In all of those worlds of varying investor preferences, the equation must still hold true.  So the thought emerges, why not make life easier, why not assume that the fundamental stock process itself didn't have $\mu$ in it at all, but instead $r$, the risk free rate.  As a further simplifying factor, this risk free rate is assumed to be a constant throughout the life of the option.  Both of these factors assisted in producing an analytical solution to the case of a European call.

This is such an alien world, the risk neutral world.  All assets, in this world, return the same as treasuries;  Amazon, options on Netflix, corporate bonds on Disney.  How strange.  But this world, as one might imagine, is a world which is just that little bit simpler to mathematically process.  The randomness is still there, the disparate volatilities are still there, but the drift element is $r$ for all of those stock processes.  But to step back a little, let's just look at the complexity of the issue of finding a formula for the price of a call option in the absence of these simplifying assumptions which drop out of the Black-Scholes equation.

In general, the problem facing Black at the outset, and which faced Samuelson too, is how to solve $C_t = E_t[e^{- \int_{t}^{T} r_u du} \max(0, S_T, K)]$ where $T$ is the maturity of the call and $K$ is the strike price.  $\max(0,S_T,K)$ is the terminal value of the call in the instant it expires, some time way out in the future, at $T$, and we want to find its expected value (in this risky world).  We also want to discount its expected value back using the risk free rate.

The first simplifying step happens in the Black-Scholes assumptions, namely that the risk free rate is constant for the life of the option.  With this, we can extract the rate from the expectation bracketing, and simplify to $C_t = e^{r(T-t)}E_t[\max(0,S_T-K)]$.  At this point, Paul Samuelson didn't really progress beyond making assumptions about the required return on the stock and option.  


Recall that the drift term didn't appear in the Black-Scholes partial differential equation.  This means you can consider the driving stochastic equation $dS = \mu S dt + \sigma S dz$ for the movement of the changes in a stock price to be replaced by $dS = r S dt + \sigma S dz$.  In other words, if we pretended that all stocks drifted like treasuries, whilst still retaining their prior degree of randomness (which, as we have seen, through portfolio construction we can eliminate), we would still end up with the Black Scholes equation.  But the simplification might make pricing easier.  And indeed it does.

Next comes a discussion about a lognormal process.  I have found the books to be not quite as clear here in discussing what it is and why it is important.   All this time, we have been working with the returns of a stock price, modelling them as normally distributed.   But prices themselves aren't normally distributed - since prices are bounded by 0 at the lower end.  Note this isn't an additional assumption Black is making here on top of the assumption that returns are normally distributed but rather it is an inevitable mathematical consequence of how continuously compounded normal returns work when applied to a (non-negative) starting price.  

However, what justifies the use of the lognormal process mathematically?  The assumed process for the returns of a stock is a multiplicative (geometric) process; whereas the central limit theorem is additive (arithmetic).  The central limit theorem says that if independent and identically distributed random variables have samples drawn from them then the arithmetic mean is normally distributed.

But we have been talking about $\frac{dS}{S}$, which can be thought of in discrete form as $\frac{\Delta S}{S}$, or spelling that out a bit differently, $\frac{S_{t+\Delta t}}{S_t}$.  This is a price relative, i.e. 1+ a geometric return.  How do we massage it so that the central limit theorem applies?   We can consider it to be a multiplicative process made up of even smaller parts.  If we apply the log function to that equation we have a sum of logs, this sum now being amenable to the central limit theorem.  So the $\log(\frac{S_{t+\Delta t}}{S_t})$ can be claimed to be normally distributed.  A lognormal distribution is skewed so the median  (geometric mean) isn't the same as the mean (arithmetic).  To translate between both you need to apply a correction term.  But what is this correction term?  You can work that out by a second application of Ito's Lemma.

$G = \log S, \frac{\partial G}{\partial S} = \frac{1}{S}, \frac{\partial^2 G}{\partial S^2} = \frac{1}{S^2}, \frac{\partial G}{\partial t} = 0$ so by Ito's lemma $dG = (\mu - \frac{\sigma^2}{2})dt + \sigma dz$.  As you can see, the arithmetic to geometric adjustment factor turns out to be $-\frac{\sigma^2}{2}$.  Notice that , $dG$, in English, is just the change from one moment to the following moment in the logarithm of the price of the stock.  I.e. for times $t$ and later $T$, this difference is $\log S_T - \log S_t$, so really, $\log S_T - \log S_t$ is going to be normally distributed with the drift term equal to $(\mu - \frac{\sigma^2}{2})(T-t)$ and a familiar variance of $\sigma^2(T-t)$.  Just two more steps here: first $\log S_T$ itself must have the drift term of $ \log(S_t) + (\mu - \frac{\sigma^2}{2})(T-t)$; second $\mu = r$ in a risk neutral world, namely so we can say the drift is $ \log(S_t) + (r - \frac{\sigma^2}{2})(T-t)$, variance still as before.

So taking the exponent on both sides of the process, $S_T=S_0 e^{(r-\frac{\sigma^2}{2})(T-t)+\sigma \sqrt{(T-t)} \epsilon}$ describes how the stock price is distributed at time horizon $T$ (to correspond to the expiry of the option).  Now let's work through our 'discounted expected value' formula again, knowing what we now know: $C_t = e^{r(T-t)}E_t[\max(0,S_T-K)]$

First of all, to make it clear this isn't an expectation over a real probability distribution $\mathcal{P}$ but rather one over a risk neutral assumption, call it $\mathcal{Q}$ then $C_t = e^{r(T-t)}E^{\mathcal{Q}}_{t}[\max(0,S_T-K)]$.  We now know what the distribution of $S_T$ is, thanks to the lognormal reasoning above, so we insert that in to get:


$C_t = e^{r(T-t)}E^{\mathcal{Q}}_{t}[\max(0,S_t e^{(r-\frac{\sigma^2}{2})(T-t)+\sigma \sqrt{(T-t)} \epsilon}-K)]$

Yes this looks horrible but it quickly simplifies down.  First, you only get paid on condition that $S_T \ge K$.  So We can zone in on that condition.  All the other possible values of the stock price less than the strike will result in a terminal value of 0, so regardless of their probability, their value contribution will be 0 to the overall expectation.  In other words, we will be looking to break the $\int_{-\infty}^{\infty}$ bounds into that fraction which is valueless and that fraction which is valuable, when we come to break out the expectation.  And solely concentrate on the valuable fraction.  But where is that dividing line?

So  $S_t e^{(r-\frac{\sigma^2}{2})(T-t)+\sigma \sqrt{(T-t)} \epsilon} \ge K$ or  $ e^{(r-\frac{\sigma^2}{2})(T-t)+\sigma \sqrt{(T-t)} \epsilon} \ge \frac{K}{S_t}$ and taking logs, $(r-\frac{\sigma^2}{2})(T-t)+\sigma \sqrt{(T-t)} \epsilon \ge \log(\frac{K}{S_t})$

Now we know what $\epsilon$ is, namely a normal distribution (for which we have lookup tables for specific values as we do in one sided statistical tests), and here we have an inequality with it, so we will continue to re-arrange to find that condition for the value of the normal distribution to be greater than.  This threshold value is when the normal distribution is greater than the cutoff value
$\frac{\log(\frac{K}{S_t})-(r-\frac{\sigma^2}{2})(T-t)}{\sigma \sqrt{T-t}}$

This begins to look familiar.  Take out -1 from the top line and you get $-\frac{\log(\frac{S_t}{K})+(r-\frac{\sigma^2}{2})(T-t)}{\sigma \sqrt{T-t}}$.  We have our cutoff point and now we can return one last time to the 'discounted expected value' formula, using a shorthand of $-d_2$ to refer to this cutoff point.  Why $d_2$, well, it is the value for the bound which Black uses in the original paper.

Recall the distribution of a normal variable is given by the density function $f(x)  = \frac{1}{\sqrt{2 \pi}} e^{\frac{1}{2}x^2}$ so we'd like to integrate the formula for cases where the terminal stock price is greater than the strike, but only for values from $-d_2$ to infinity.

$e^{-r(T-t)}  \frac{1}{\sqrt{2 \pi}} [\int_{-d_2}^{\infty} (S_t e^{(r-\frac{\sigma^2}{2})(T-t) + \sigma \sqrt{T-t}x} - K) e^{\frac{1}{2}x^2} dx ] $

The final flourish is all down to the convenience of dealing with exponentials in the context of integration.  First, let's break this in two

$e^{-r(T-t)}  \frac{1}{\sqrt{2 \pi}} [ \int_{-d_2}^{\infty} S_t e^{(r-\frac{\sigma^2}{2})(T-t) + \sigma \sqrt{T-t}x}  e^{\frac{1}{2}x^2} dx  - \int_{-d_2}^{\infty}  K e^{\frac{1}{2}x^2} dx ] $

and that becomes

$  \frac{1}{\sqrt{2 \pi}} [ S_t \int_{-d_2}^{\infty} e^{-r(T-t)} e^{(r-\frac{\sigma^2}{2})(T-t) + \sigma \sqrt{T-t}x}  e^{\frac{1}{2}x^2} dx  - e^{-r(T-t)} K \int_{-d_2}^{\infty}  e^{\frac{1}{2}x^2} dx ] $

The $r(T-t)$ elements cancel in the first term when you bring in the exponent inside the integral and on the second term, using $N(x)$ to represent the cumulative normal distribution (for which we have tables)

$  \frac{1}{\sqrt{2 \pi}} [ S_t \int_{-d_2}^{\infty}  e^{\frac{\sigma^2}{2}(T-t) + \sigma \sqrt{T-t}x  - \frac{1}{2}x^2}  dx ] - e^{-r(T-t)} K N(d_2)  $

Now there's a square which can be completed in that first term,

$  \frac{1}{\sqrt{2 \pi}} [ S_t \int_{-d_2}^{\infty}  e^{-\frac{1}{2}(x-\sigma \sqrt{T-t})^2 }  dx ] - e^{-r(T-t)} K N(d_2)  $

OK.  It is change of variable time for that first element.  Let $y = x - \sigma \sqrt{T-t}$  To make this a proper $dy$ equivalent, we must also tweak the bounds.  Since $x$ previously started at $-d_2$, it now has to start a bit lower than that, at $-d_2 - \sigma \sqrt{T-t}$, also known as $-(d_2 + \sigma \sqrt{T-t})$ which in Black nomenclature is referred to as $-d_1$.  So, that results in

$   S_t [\frac{1}{\sqrt{2 \pi}} \int_{-d_1}^{\infty}  e^{-\frac{1}{2}y^2 }  dy ] - e^{-r(T-t)} K N(d_2)  $

This change of variable trick makes us see that this first term is also just a read from the cumulative normal distribution, only at a slightly different point

$   S_t N(d_1) - e^{-r(T-t)} K N(d_2)  $


And this is the value now of a European call option.  Sweet Jesus.  Roll credits.  Give that man a Nobel prize.