A simple Yule-Simon process and Zipf’s law

Zipf’s law refers to the phenomenon that many data sets in social and exact sciences are observed to obey a power law of the form p(x) \sim x^{-\alpha} with the exponent \alpha approximately equal to 2. In the present note I want to set out a simple Yule-Simon process (similar to one first discussed in Simon, H, 1955, On a class of skew distribution functions, Biometrika, 44:425-40) which shows clearly how Zipf’s law can emerge from urn-type processes following a similar pattern.

The simple process discussed here involves the appearance of new species within families of closely related species called genera. New species appear within genera (through evolutionary processes) which usually remain quite close in their main characteristics to the pre-existing species. However, every so often, a new species will appear which is sufficiently different from all pre-existing ones to enable it to be regarded as having started a completely new genus. We can construct a simple Yule-Simon process as a stylised version of this. Suppose that species appear one at a time and that when the number of species reaches m, the next new species will start a new genus. Therefore, when the first new genus appears, there are m + 1 species in total. Species continue to appear one at a time, and when the number of species reaches 2m + 1 (i.e., another m species have appeared), the next new species will again start a new genus. Thus, when the second new genus appears, there are 2(m + 1) species in total. We assume this process continues indefinitely, so that when the n-th new genus appears, there are n(m+1) species in total.

We further assume that between each genus and the next, the m new species that appear will be distributed among the already existing genera in proportion to the number of species they already have (this gives rise to the characteristic feature of Zipf’s law when applied to wealth distribution in economics that `the rich get richer’). So at stage n, the next species that appears will appear in the i-th genus with probability

\frac{k_i}{\sum k_i} = \frac{k_i}{n(m+1)}

where k_i is the number of species already in i, and \sum k_i is simply the total number of species at stage n, which is n(m+1). There are m opportunities for this to happen, so genus i gains a new species with probability

\frac{m k_i}{n(m+1)}

Let p_{k,n} denote the fraction of genera that have k species when the total number of genera is n. Then np_{k,n} is the number of genera that have k species when the total number of genera is n, and the expected number of genera of size k that gain a new species in this interval is

\frac{m k}{n(m+1)}np_{k,n} = \frac{m}{m+1}kp_{k,n}

Now, when these genera gain the new species, they will move out of the class of genera with k species, and into the class of genera with k+1 species, so the number of genera with k species will fall by \frac{m}{m+1}kp_{k,n}. Analogously, the expected number of genera with k-1 species that will gain a new species is

\frac{m}{m+1}(k-1)p_{k-1,n}

When these genera gain a new species, they will move into the class of genera with k species, so the number of genera with k species will rise by \frac{m}{m+1}(k-1)p_{k-1,n}. Therefore we can write a master equation for the new number (n+1)p_{k,n+1} of genera with k > 1 species at stage n+1 thus:

(n+1)p_{k,n+1} = np_{k,n} + \frac{m}{m+1}[(k-1)p_{k-1,n} - kp_{k,n}]

However, this master equation does not hold for genera of size 1. Instead, these genera obey

(n+1)p_{1,n+1} = np_{1,n} + 1 - \frac{m}{m+1}p_{1,n}

The second term on the right-hand side is 1 because, by definition, exactly one new genus appears at each step of the process, so there is only one entrant from the class of genera with zero species into the class of genera with one species.

We assume there is a steady state as n \rightarrow \infty, in which case we get the steady state equation for k > 1

(n+1)p_{k} = np_{k} + \frac{m}{m+1}[(k-1)p_{k-1} - kp_{k}]

\implies

p_k = \frac{k-1}{k+1+\frac{1}{m}}p_{k-1}

and the steady state equation for k = 1

(n+1)p_{1} = np_{1} + 1 - \frac{m}{m+1}p_{1}

\implies

p_1 = \frac{1 + \frac{1}{m}}{2 + \frac{1}{m}}

But using the steady state equation for p_k above we observe that

p_{k-1} = \frac{k-2}{k+\frac{1}{m}}p_{k-2}

and substituting this back into the steady state equation for p_k above we get

p_k = \frac{(k-1)(k-2)}{(k+1+\frac{1}{m})(k + \frac{1}{m})}p_{k-2}

Continuing the iteration in this way we get

p_k = \frac{(k-1)(k-2) \cdots 1}{(k+1+\frac{1}{m})(k+\frac{1}{m}) \cdots (3+\frac{1}{m})}p_1

and using the steady state expression for p_1 in this we get

p_k = \frac{(k-1)!(1+\frac{1}{m})}{(k+1+\frac{1}{m})(k+\frac{1}{m}) \cdots (3+\frac{1}{m})(2+\frac{1}{m})}

Since \Gamma(k) = (k-1)\Gamma(k-1) with \Gamma(1) = 1, we can write this as

p_k = \frac{\Gamma(k)\Gamma(2+\frac{1}{m})}{\Gamma(k+2+\frac{1}{m})}(1+\frac{1}{m})

Now, the gamma function is defined as

\Gamma(p) = \int_0^{\infty} t^{p-1}e^{-t} dt

and the beta function B(p, q) for p > 0 and q > 0 is defined as

B(p, q) = \int_0^1 x^{p-1}(1 - x)^{q-1} dx

It is not too difficult to show that the two are related by the equation

B(p, q) = \frac{\Gamma(p) \Gamma(q)}{\Gamma(p+q)}

and furthermore, for large p we have

B(p, q) \sim p^{-q}

Comparing with the final expression for p_k above in terms of the gamma function we see that

p_k = (1 + \frac{1}{m}) B(k, 2 + \frac{1}{m}) \sim k^{-(2 + \frac{1}{m})}

So we get a power law when the genera size k is large, and we get Zipf’s law when the number of new entrant species m is large.

Published by Dr Christian P. H. Salas

Mathematics Lecturer

Leave a comment