Markov’s inequality is a basic tool for putting upper bounds on tail probabilities. To derive it, suppose is a non-negative random variable. We want to put an upper bound on the tail probability
where
. We do this by observing that
where is the indicator function which takes the value 1 when
and the value 0 otherwise. (It is obvious that (1) above holds, simply by the definition of this indicator function). Taking expectations of both sides of (1) we get
But
(by the law of total probability), so we can write
and so
This is Markov’s inequality. It can be modified, and thereby made sharper for particular situations, by considering a non-decreasing and non-negative function defined on an interval containing the values of
. Then we have
As an example of applying this modified version, note that we obtain Chebyshev’s inequality by using the function , and defining
. Putting these in (3) we get
More generally, we can get a version of Markov’s inequality involving the Laplace transform of the random variable by specifying
, where
is a positive number. In this case, noting that
since
is monotonically increasing, Markov’s inequality implies
So, now the Laplace transform of appears in the upper bound.
The so-called Cramér-Chernoff bounding method determines the best possible bound for a tail probability that one can possibly obtain by using Markov’s inequality with an exponential function . (This is one of the ways to understand intuitively how the entropy function in large deviation theory arises – see below). To explain this, begin by writing (4) above as
Since this inequality holds for all , one may choose
to minimise the upper bound in (5). Define the log of the Laplace transform as
for all . Then we can write (5) as
Since we are free to replace the right-hand side of (7) by its minimum, we can do so, and we thus obtain Chernoff’s inequality
where the right-hand side of
is known as the Cramér transform of (6). A straightforward argument, given in the Appendix below, shows that we can extend the supremum in (9) over all in the definition of the Cramér transform:
The expression on the right-hand side of (10) is known as the Fenchel-Legendre dual function of , and is exactly the Fenchel-Legendre transform that is used in large deviation theory. As discussed in the Appendix below, the Fenchel-Legendre transform will coincide with the Cramér transform at every
, but will equal zero for
.
If is differentiable, the Cramér transform can be computed by differentiating
with respect to
. The optimising value of
is found by setting the derivative with respect to
equal to zero, that is
where is such that
The strict convexity of implies that
has an increasing inverse
, so
This formula can be used to compute the Cramér transform explicitly in simple cases. As an example, consider an exponential variate with probability density
Then the log of the Laplace transform of the random variable is
Therefore
and so
Appendix
The Cramér transform on the right-hand side of (9) is a non-negative function because it is identically zero for all when
(since
), so the supremum can always be chosen to be at least zero. Since the exponential function is convex, Jensen’s inequality applies to it, and we can write
Therefore
Applying this to our case, with , we get
Therefore, we will have for all negative values of that
whenever (because (17) will then hold a fortiori). But we are still free to choose the supremum on the right-hand side of (9) to be at least zero when
, so we can formally extend the supremum on the right-hand side of (9) over all
in the definition of the Cramér transform, thereby obtaining the Fenchel-Legendre transform in (10). Therefore, as stated in the main text, the Fenchel-Legendre transform will coincide with the Cramér transform at every
.
To see that the Fenchel-Legendre transform is zero for whenever
, observe that this inequality implies
, but by Jensen’s inequality we will then have
, so
. Since the supremum can always be chosen to be at least zero, the Fenchel-Legendre transform will indeed always be set to zero in this case.
