Markov’s inequality and the Cramér-Chernoff bounding method

Markov’s inequality is a basic tool for putting upper bounds on tail probabilities. To derive it, suppose $Y$ is a non-negative random variable. We want to put an upper bound on the tail probability $P(Y > t)$ where $t > 0$ . We do this by observing that

$t \cdot 1_{\{Y \ge t\}} \leq Y \cdot 1_{\{Y \ge t\}} \qquad \qquad \qquad (1)$

where $1_{\{Y \ge t\}}$ is the indicator function which takes the value 1 when $Y \geq t$ and the value 0 otherwise. (It is obvious that (1) above holds, simply by the definition of this indicator function). Taking expectations of both sides of (1) we get

$t \cdot P(Y \geq t) \leq E[Y \cdot 1_{\{Y \geq t\}}]$

But

$E[Y \cdot 1_{\{Y \geq t\}}] = E[Y | Y \geq t] P(Y \geq t)$

$\leq E[Y | Y \geq t] P(Y \geq t) + E[Y | Y < t] P(Y < t)$

$= E[Y]$

(by the law of total probability), so we can write

$t \cdot P(Y \geq t) \leq E[Y]$

and so

$P(Y \geq t) \leq \frac{E[Y]}{t} \qquad \qquad \qquad (2)$

This is Markov’s inequality. It can be modified, and thereby made sharper for particular situations, by considering a non-decreasing and non-negative function $\phi$ defined on an interval containing the values of $Y$ . Then we have

$P(\phi(Y) \geq \phi(t)) \leq \frac{E[\phi(Y)]}{\phi(t)} \qquad \qquad \qquad (3)$

As an example of applying this modified version, note that we obtain Chebyshev’s inequality by using the function $\phi(t) = t^2$ , and defining $Y = Z - E[Z]$ . Putting these in (3) we get

$P(|Z - E[Z]| \geq t) \leq \frac{Var[Z]}{t^2}$

More generally, we can get a version of Markov’s inequality involving the Laplace transform of the random variable $Z$ by specifying $\phi(t) = e^{kt}$ , where $k$ is a positive number. In this case, noting that $P(\phi(Z) \geq \phi(t)) \equiv P(Z \geq t)$ since $\phi$ is monotonically increasing, Markov’s inequality implies

$P(Z \geq t) \leq \frac{E[e^{kZ}]}{e^{kt}} \qquad \qquad \qquad (4)$

So, now the Laplace transform of $Z$ appears in the upper bound.

The so-called Cramér-Chernoff bounding method determines the best possible bound for a tail probability that one can possibly obtain by using Markov’s inequality with an exponential function $\phi(t) = e^{kt}$ . (This is one of the ways to understand intuitively how the entropy function in large deviation theory arises – see below). To explain this, begin by writing (4) above as

$P(Z \geq t) \leq e^{-kt} E[e^{kZ}] \qquad \qquad \qquad (5)$

Since this inequality holds for all $k \geq 0$ , one may choose $k$ to minimise the upper bound in (5). Define the log of the Laplace transform as

$\lambda(k) = \ln E[e^{kz}] \qquad \qquad \qquad (6)$

for all $k \geq 0$ . Then we can write (5) as

$P(Z \geq t) \leq e^{-(kt - \lambda(k))} \qquad \qquad \qquad (7)$

Since we are free to replace the right-hand side of (7) by its minimum, we can do so, and we thus obtain Chernoff’s inequality

$P(Z \geq t) \leq e^{-\Lambda^{\ast}(t)} \qquad \qquad \qquad (8)$

where the right-hand side of

$\Lambda^{\ast}(t) \equiv \sup_{k \geq 0} (kt - \lambda(k)) \qquad \qquad \qquad (9)$

is known as the Cramér transform of (6). A straightforward argument, given in the Appendix below, shows that we can extend the supremum in (9) over all $\mathbb{R}$ in the definition of the Cramér transform:

$\Lambda^{\ast}(t) \equiv \sup_{\mathbb{R}} (kt - \lambda(k)) \qquad \qquad \qquad (10)$

The expression on the right-hand side of (10) is known as the Fenchel-Legendre dual function of $\lambda(k)$ , and is exactly the Fenchel-Legendre transform that is used in large deviation theory. As discussed in the Appendix below, the Fenchel-Legendre transform will coincide with the Cramér transform at every $t \geq E[Z]$ , but will equal zero for $t < E[Z]$ .

If $\lambda(k)$ is differentiable, the Cramér transform can be computed by differentiating $kt - \lambda(k)$ with respect to $k$ . The optimising value of $k$ is found by setting the derivative with respect to $k$ equal to zero, that is

$\Lambda^{\ast}(t) \equiv (k^{\ast} t - \lambda(k^{\ast})) \qquad \qquad \qquad (11)$

where $k^{\ast}$ is such that

$\lambda^{\prime}(k^{\ast}) = t \qquad \qquad \qquad (12)$

The strict convexity of $\lambda(k)$ implies that $\lambda^{\prime}$ has an increasing inverse $(\lambda^{\prime})^{-1}$ , so

$k^{\ast} = (\lambda^{\prime})^{-1}(t) \qquad \qquad \qquad (13)$

This formula can be used to compute the Cramér transform explicitly in simple cases. As an example, consider an exponential variate $X$ with probability density

$f(x) = \frac{1}{\mu} e^{-x/\mu} \qquad \qquad \qquad (14)$

Then the log of the Laplace transform of the random variable is

$\lambda(k) = \ln E[e^{kX}] = -\ln(1 + \mu k) \qquad \qquad \qquad (15)$

Therefore

$\lambda^{\prime}(k) = -\frac{\mu}{1 + \mu k}$

and so

$k^{\ast} = (\lambda^{\prime})^{-1}(s) = -\big(\frac{1}{s} + \frac{1}{\mu}\big) \qquad \qquad \qquad (16)$

Appendix

The Cramér transform on the right-hand side of (9) is a non-negative function because it is identically zero for all $t$ when $k = 0$ (since $\lambda(0) = 0$ ), so the supremum can always be chosen to be at least zero. Since the exponential function is convex, Jensen’s inequality applies to it, and we can write

$e^{E[X]} \leq E[e^{X}]$

Therefore

$\ln E[e^X] \geq E[X]$

Applying this to our case, with $X = k Z$ , we get

$\lambda(k) \geq k E[Z] \qquad \qquad \qquad (17)$

Therefore, we will have for all negative values of $k$ that

$k t - \lambda(k) \leq 0$

whenever $t \geq E[Z]$ (because (17) will then hold a fortiori). But we are still free to choose the supremum on the right-hand side of (9) to be at least zero when $kt - \lambda(k) \leq 0$ , so we can formally extend the supremum on the right-hand side of (9) over all $k \in \mathbb{R}$ in the definition of the Cramér transform, thereby obtaining the Fenchel-Legendre transform in (10). Therefore, as stated in the main text, the Fenchel-Legendre transform will coincide with the Cramér transform at every $t \geq E[Z]$ .

To see that the Fenchel-Legendre transform is zero for $k > 0$ whenever $t < E[Z]$ , observe that this inequality implies $kt < kE[Z]$ , but by Jensen’s inequality we will then have $\lambda(k) \geq kE[Z] > kt$ , so $kt - \lambda(k) < 0$ . Since the supremum can always be chosen to be at least zero, the Fenchel-Legendre transform will indeed always be set to zero in this case.