Zipf’s law refers to the phenomenon that many data sets in social and exact sciences are observed to obey a power law of the form with the exponent
approximately equal to 2. In the present note I want to set out a simple Yule-Simon process (similar to one first discussed in Simon, H, 1955, On a class of skew distribution functions, Biometrika, 44:425-40) which shows clearly how Zipf’s law can emerge from urn-type processes following a similar pattern.
The simple process discussed here involves the appearance of new species within families of closely related species called genera. New species appear within genera (through evolutionary processes) which usually remain quite close in their main characteristics to the pre-existing species. However, every so often, a new species will appear which is sufficiently different from all pre-existing ones to enable it to be regarded as having started a completely new genus. We can construct a simple Yule-Simon process as a stylised version of this. Suppose that species appear one at a time and that when the number of species reaches , the next new species will start a new genus. Therefore, when the first new genus appears, there are
species in total. Species continue to appear one at a time, and when the number of species reaches
(i.e., another
species have appeared), the next new species will again start a new genus. Thus, when the second new genus appears, there are
species in total. We assume this process continues indefinitely, so that when the
-th new genus appears, there are
species in total.
We further assume that between each genus and the next, the new species that appear will be distributed among the already existing genera in proportion to the number of species they already have (this gives rise to the characteristic feature of Zipf’s law when applied to wealth distribution in economics that `the rich get richer’). So at stage
, the next species that appears will appear in the
-th genus with probability
where is the number of species already in
, and
is simply the total number of species at stage
, which is
. There are
opportunities for this to happen, so genus
gains a new species with probability
Let denote the fraction of genera that have
species when the total number of genera is
. Then
is the number of genera that have
species when the total number of genera is
, and the expected number of genera of size
that gain a new species in this interval is
Now, when these genera gain the new species, they will move out of the class of genera with species, and into the class of genera with
species, so the number of genera with
species will fall by
. Analogously, the expected number of genera with
species that will gain a new species is
When these genera gain a new species, they will move into the class of genera with species, so the number of genera with
species will rise by
. Therefore we can write a master equation for the new number
of genera with
species at stage
thus:
However, this master equation does not hold for genera of size 1. Instead, these genera obey
The second term on the right-hand side is 1 because, by definition, exactly one new genus appears at each step of the process, so there is only one entrant from the class of genera with zero species into the class of genera with one species.
We assume there is a steady state as , in which case we get the steady state equation for
and the steady state equation for
But using the steady state equation for above we observe that
and substituting this back into the steady state equation for above we get
Continuing the iteration in this way we get
and using the steady state expression for in this we get
Since with
, we can write this as
Now, the gamma function is defined as
and the beta function for
and
is defined as
It is not too difficult to show that the two are related by the equation
and furthermore, for large we have
Comparing with the final expression for above in terms of the gamma function we see that
So we get a power law when the genera size is large, and we get Zipf’s law when the number of new entrant species
is large.
