If $X$ and $Y$ are random variables then their covariance is the expected value of the product of their deviations from their means. Or in mathematical form, $\sigma_{X,Y}=E[(X-E[X]) (Y-E[Y])]$. There's a lot of juice in this idea, a lot. But interpreting it can be hard, since the value's meaning depends heavily on the units of $X$ and $Y$. For example of $X$ and $Y$ are return streams, if you represent the returns as percentages, e.g. 4%, 3.5%, etc versus representing them as unit fractions, e.g. 0.04, 0.035, etc, then the covariance of one would be 10,000 times larger than the covariance of the other.
You can see that the variance is in fact just the self-covariance. That is $\sigma_X^2 = \sigma_{XX} = E[{(X-E[X])}^2]$. So going back to the covariance between two random variables, the largest possible value for the covariance of $X$ and $Y$ is going to be when $Y$ moves exactly like $X$, is in fact $X$.
A useful way to normalise covariance was presented by Auguste Bravais, an idea which Pearson championed. In it, the units of covariance are normalised away by the product of the standard deviations of the variables. The resulting measure, normalised covariance, which ranges from -1 to +1 had become better known as the Pearson correlation coefficient, or simply the correlation, or COVAR() in excel. $\rho_{X,Y} = \frac{\sigma_{X,Y}}{\sigma_X \sigma_Y}$. This is easier for humans to read, comprehend and for various covariances from different contexts to be compared and ranked. But if you are building a square variance-covariance matrix, you now know it is just a covariance matrix. Furthermore, if you square this normalised covariance, you arrive at the familiar $R^2$ measure, the coefficient of determination, which is also equal to the proportion of the variance explained by the model, as a fraction of the total dependent variable variance, being $\frac{\sigma_{\hat{Y}}^2}{\sigma_{Y}^2}$.
If $X$ is the return stream of an equity, and $Y$ is the return of the market, then by dividing the covariance by the variance of the market return, $\sigma_Y^2$, we end up with the familiar beta of the stock, $\beta_X = \frac{\sigma_{X,Y}}{\sigma_Y^2}$. Notice how similar this is to the so-called Pearson correlation coefficient. In fact $\beta_X = \rho_{X,Y} \times \frac{\sigma_X}{\sigma_Y}$. That is to say, when you scale the correlation of the security returns to the market by a scaling factor of the security returns volatility per unit of market returns volatility, you get the beta. Beta as correlation times volatility ratio, that makes sense for a beta.
Finally, 3 rules:
- if $Y =V+W$ then $\sigma_{X,Y} = \sigma_{X,V} + \sigma_{X,W}$
- if $Y =b$ then $\sigma_{X,Y} =0$
- if $Y=bZ$ then $\sigma_{X,Y} = b \times \sigma_{X,Z}$
And of course it is on the basis of rule (1) that Sharpe makes the development from Markowitz.