Markov’s inequality and the Cramér-Chernoff bounding method

Markov’s inequality is a basic tool for putting upper bounds on tail probabilities. To derive it, suppose Y is a non-negative random variable. We want to put an upper bound on the tail probability P(Y > t) where t > 0. We do this by observing that

t \cdot 1_{\{Y \ge t\}} \leq Y \cdot 1_{\{Y \ge t\}} \qquad \qquad \qquad (1)

where 1_{\{Y \ge t\}} is the indicator function which takes the value 1 when Y \geq t and the value 0 otherwise. (It is obvious that (1) above holds, simply by the definition of this indicator function). Taking expectations of both sides of (1) we get

t \cdot P(Y \geq t) \leq E[Y \cdot 1_{\{Y \geq t\}}]

But

E[Y \cdot 1_{\{Y \geq t\}}] = E[Y | Y \geq t] P(Y \geq t)

\leq E[Y | Y \geq t] P(Y \geq t) + E[Y | Y < t] P(Y < t)

= E[Y]

(by the law of total probability), so we can write

t \cdot P(Y \geq t) \leq E[Y]

and so

P(Y \geq t) \leq \frac{E[Y]}{t} \qquad \qquad \qquad (2)

This is Markov’s inequality. It can be modified, and thereby made sharper for particular situations, by considering a non-decreasing and non-negative function \phi defined on an interval containing the values of Y. Then we have

P(\phi(Y) \geq \phi(t)) \leq \frac{E[\phi(Y)]}{\phi(t)} \qquad \qquad \qquad (3)

As an example of applying this modified version, note that we obtain Chebyshev’s inequality by using the function \phi(t) = t^2, and defining Y = Z - E[Z]. Putting these in (3) we get

P(|Z - E[Z]| \geq t) \leq \frac{Var[Z]}{t^2}

More generally, we can get a version of Markov’s inequality involving the Laplace transform of the random variable Z by specifying \phi(t) = e^{kt}, where k is a positive number. In this case, noting that P(\phi(Z) \geq \phi(t)) \equiv P(Z \geq t) since \phi is monotonically increasing, Markov’s inequality implies

P(Z \geq t) \leq \frac{E[e^{kZ}]}{e^{kt}} \qquad \qquad \qquad (4)

So, now the Laplace transform of Z appears in the upper bound.

The so-called Cramér-Chernoff bounding method determines the best possible bound for a tail probability that one can possibly obtain by using Markov’s inequality with an exponential function \phi(t) = e^{kt}. (This is one of the ways to understand intuitively how the entropy function in large deviation theory arises – see below). To explain this, begin by writing (4) above as

P(Z \geq t) \leq e^{-kt} E[e^{kZ}] \qquad \qquad \qquad (5)

Since this inequality holds for all k \geq 0, one may choose k to minimise the upper bound in (5). Define the log of the Laplace transform as

\lambda(k) = \ln E[e^{kz}] \qquad \qquad \qquad (6)

for all k \geq 0. Then we can write (5) as

P(Z \geq t) \leq e^{-(kt - \lambda(k))} \qquad \qquad \qquad (7)

Since we are free to replace the right-hand side of (7) by its minimum, we can do so, and we thus obtain Chernoff’s inequality

P(Z \geq t) \leq e^{-\Lambda^{\ast}(t)} \qquad \qquad \qquad (8)

where the right-hand side of

\Lambda^{\ast}(t) \equiv \sup_{k \geq 0} (kt - \lambda(k)) \qquad \qquad \qquad (9)

is known as the Cramér transform of (6). A straightforward argument, given in the Appendix below, shows that we can extend the supremum in (9) over all \mathbb{R} in the definition of the Cramér transform:

\Lambda^{\ast}(t) \equiv \sup_{\mathbb{R}} (kt - \lambda(k)) \qquad \qquad \qquad (10)

The expression on the right-hand side of (10) is known as the Fenchel-Legendre dual function of \lambda(k), and is exactly the Fenchel-Legendre transform that is used in large deviation theory. As discussed in the Appendix below, the Fenchel-Legendre transform will coincide with the Cramér transform at every t \geq E[Z], but will equal zero for t < E[Z].

If \lambda(k) is differentiable, the Cramér transform can be computed by differentiating kt - \lambda(k) with respect to k. The optimising value of k is found by setting the derivative with respect to k equal to zero, that is

\Lambda^{\ast}(t) \equiv (k^{\ast} t - \lambda(k^{\ast})) \qquad \qquad \qquad (11)

where k^{\ast} is such that

\lambda^{\prime}(k^{\ast}) = t \qquad \qquad \qquad (12)

The strict convexity of \lambda(k) implies that \lambda^{\prime} has an increasing inverse (\lambda^{\prime})^{-1}, so

k^{\ast} = (\lambda^{\prime})^{-1}(t) \qquad \qquad \qquad (13)

This formula can be used to compute the Cramér transform explicitly in simple cases. As an example, consider an exponential variate X with probability density

f(x) = \frac{1}{\mu} e^{-x/\mu} \qquad \qquad \qquad (14)

Then the log of the Laplace transform of the random variable is

\lambda(k) = \ln E[e^{kX}] = -\ln(1 + \mu k) \qquad \qquad \qquad (15)

Therefore

\lambda^{\prime}(k) = -\frac{\mu}{1 + \mu k}

and so

k^{\ast} = (\lambda^{\prime})^{-1}(s) = -\big(\frac{1}{s} + \frac{1}{\mu}\big) \qquad \qquad \qquad (16)

Appendix

The Cramér transform on the right-hand side of (9) is a non-negative function because it is identically zero for all t when k = 0 (since \lambda(0) = 0), so the supremum can always be chosen to be at least zero. Since the exponential function is convex, Jensen’s inequality applies to it, and we can write

e^{E[X]} \leq E[e^{X}]

Therefore

\ln E[e^X] \geq E[X]

Applying this to our case, with X = k Z, we get

\lambda(k) \geq k E[Z] \qquad \qquad \qquad (17)

Therefore, we will have for all negative values of k that

k t - \lambda(k) \leq 0

whenever t \geq E[Z] (because (17) will then hold a fortiori). But we are still free to choose the supremum on the right-hand side of (9) to be at least zero when kt - \lambda(k) \leq 0, so we can formally extend the supremum on the right-hand side of (9) over all k \in \mathbb{R} in the definition of the Cramér transform, thereby obtaining the Fenchel-Legendre transform in (10). Therefore, as stated in the main text, the Fenchel-Legendre transform will coincide with the Cramér transform at every t \geq E[Z].

To see that the Fenchel-Legendre transform is zero for k > 0 whenever t < E[Z], observe that this inequality implies kt < kE[Z], but by Jensen’s inequality we will then have \lambda(k) \geq kE[Z] > kt, so kt - \lambda(k) < 0. Since the supremum can always be chosen to be at least zero, the Fenchel-Legendre transform will indeed always be set to zero in this case.

Published by Dr Christian P. H. Salas

Mathematics Lecturer

Leave a comment