2. Talk outline
1 Introduction to power laws
2 Distributional properties
3 Parameter inference
4 Power law generating mechanisms
http://xkcd.com/
3. Classic example: distribution of US cities
q
2000
No. of Cities
1500
Some data sets vary over enormous
1000
range
500
US towns & cities: q
q
q qqq qq qq q q q q q
Duffield (pop 52) 0 q qqq qq
q q
5
New York City (pop 8 mil) 10 4.5
10 105.5 106 106.5 107
City population
The data is highly right-skewed
Cumulative No. of Cities
q
When the data is plotted on a 103 q
logarithmic scale, it seems to follow a q
q
102 q
straight line qq
q
qqqq
q
qqq
This observation is attributed to Zipf 101 q q
qq
q
q
q
100 q
105 105.5 106 106.5
City population
4. Distribution of world cities
World city populations for 8 countries
logsize vs logrank
107.5
New York
Mumbai (Bombay)
São Paulo
Delhi Djakarta
Los Angeles
Shanghai
Kolkata (Calcutta)
Moscou
Lagos
Log Population
Pékin (Beijing)
Rio de Janeiro Chicago
Ruhr
7
10 Hong Kong (Xianggang)
Washington
Chongqing
ChennaiBoston San Francisco − San José
(Madras)
ShenyangDallas − Fort Worth
TianjinBangalore
Hyderabad Philadelphie
Detroit Bandung
Houston Miami
Canton (Guangzhou) Atlanta
Ahmadabad Belo Horizonte
Pune
San Diego − Tijuana Ibadan
Xian
Saint−Petersbourg Harbin
Wuhan Shantou
Chengdu Hangzhou
Phoenix Kano
Nanjing Medan − Saint−Petersburg
Seattle Tampa Alegre
Berlin Surabaya
Porto
Recife
Minneapolis Salvador
CuritibaKanpur
Jinan
106.5
BrasiliaFortaleza
CincinnatiCleveland
Hambourg
Francfort Surat
Changchun Jaipur
Lucknow Denver
Shijiazhuang Saint−Louis
Dalian Taiyuan
Zibo
Brownsville − McAllen − Matamoros − Reynosa Orlando
Nagpur Patna
Campinas Portland − Ciudad Juarez
Qingdao Tangshan
El Paso
Guiyang Pittsburgh
Kunming Sacramento
Charlotte Belem
Munich Stuttgart City
Anshan Salt Lake
Changsha Bénin
Wuxi Zhengzhou
Nanchang Palembang
Goiânia San Antonio
Indianapolis Kansas City
Columbus Indore
Las Vegas Mirat Harcourt
Kaduna Jilin
Lanzhou Port
Niznij Novgorod Santos Pandang (Macassar)
Manaus Oshogbo
Raleigh VadodaraUjung
Bhopal Cirebon
−Xinyang Nashik
Bhubaneswar Ludhiana Beach − Norfolk − Corée du Nord)
Durham Agra
ZhanjiangVirginia
Austin Coimbatore
Nashville Dandong−Sinuiju (Chine
Vitoria
Greensboro − Winston−SalemXuzhou
Luoyang Yogyakarta
VisakhapatnamUrumqi
Nanning Semarang
Tanjungkarang (Bandar Lampung)Fuzhou (Bénarès)
Kochi Mannheim
HuainanVaranasi
Rajkot Novosibirsk
BielefeldBaotou
Aba Volgograd
Onitsha Suzhou
Hefei Qiqihar
Denpasar Samara
Handan Leipzig−Halle
São Luis Louisville
GrandAsansolRostov
Madurai Datong
Rapids Iekaterinburg
Allahabad Bengbu
Mataram Jacksonville
Ningbo
Greenville − Jamshedpur Memphis City
Spartanburg Oklahoma
Natal
Surakarta Jabalpur
Richmond Tcheliabinsk
BirminghamWenzhou
Nuremberg Tegal
Dhanbad Maisuru
Chemnitz−ZwickauRongcheng
OgbomoshoAmritsar
Brême Buffalo
Maceio Aurangabad
Hohhot Nouvelle−Orléans
RochesterMaiduguri
Daqing Zhangjiakou
TeresinaVijayawada
Saarbruck−Forbach Hanovre Albany
(France)Omsk
Abuja Bhilai
AomenSholapur
SaratovKazan
BaodingSrinagar
Dresde Pingxiang
Thiruvananthapuram Benxi Pessoa
Zhenjiang Xianyang
106
Chandigarh Ranchi
Guwahati Fresno
Krasnojarsk Joao
Kozhikkod Knoxville
Ufa Samarinda
Malang
Ilorin
Tucson
100 100.5 101 101.5 102
Log Rank
http://brenocon.com/blog/2009/05/zipfs-law-and-world-city-populations/
5. What does it mean?
Let p (x )dx be the fraction of cities with a population between x and
x + dx
If this histogram is a straight line on log − log scales, then
ln p (x ) = −α ln x + c
where α and c are constants
Hence
p (x ) = Cx −α
where C = ec
6. What does it mean?
Let p (x )dx be the fraction of cities with a population between x and
x + dx
If this histogram is a straight line on log − log scales, then
ln p (x ) = −α ln x + c
where α and c are constants
Hence
p (x ) = Cx −α
where C = ec
Distributions of this form are said to follow a power law
The constant α is called the exponent of the power law
We typically don’t care about c.
7. The power law distribution
Name f (x ) Notes
Power law x −α Pareto distribution
Exponential e − λx
1 (ln x −µ)2
log-normal x
exp(− 2σ 2
)
Power law x −α Zeta distribution
Power law x −α x = 1, . . . , n, Zipf’s dist’
Γ (x )
Yule Γ (x + α )
Poisson λx /x !
8. Alleged power-law phenomena
The frequency of occurrence of unique words in the novel Moby Dick by
Herman Melville
The numbers of customers affected in electrical blackouts in the United
States between 1984 and 2002
The number of links to web sites found in a 1997 web crawl of about 200
million web pages
9. Alleged power-law phenomena
The frequency of occurrence of unique words in the novel Moby Dick by
Herman Melville
The numbers of customers affected in electrical blackouts in the United
States between 1984 and 2002
The number of links to web sites found in a 1997 web crawl of about 200
million web pages
The number of hits on web pages
The number of papers scientist write
The number of citations received by papers
Annual incomes
Sales of books, music; in fact anything that can be sold
12. The power law distribution
The power-law distribution is
p (x ) ∝ x − α
where α, the scaling parameter, is a constant
The scaling parameter typically lies in the range 2 < α < 3, although
there are some occasional exceptions
Typically, the entire process doesn’t obey a power law
Instead, the power law applies only for values greater than some
minimum xmin
13. Power law: PDF & CDF
α
1.50 1.75 2.00 2.25 2.50
For the continuous PL, the pdf is 1.5
PDF
−α
α−1 x 1.0
p (x ) =
xmin xmin 0.5
where α > 1 and xmin > 0. 0.0
CDF
The CDF is: 1.5
− α +1 1.0
x
P (x ) = 1 −
xmin 0.5
0.0
0.0 2.5 5.0 7.5 10.0
x
14. Power law: PDF & CDF
α
For the discrete power law, the pmf is 1.50 1.75 2.00 2.25 2.50
PDF
x −α 1.5
p (x ) =
ζ (α, xmin ) 1.0
where 0.5
∞ 0.0
ζ (α, xmin ) = ∑ (n + xmin )−α 1.5
CDF
n =0
1.0
is the generalised zeta function
0.5
When xmin = 1, ζ (α, 1) is the standard
zeta function 0.0
0.0 2.5 5.0 7.5 10.0
x
15. Moments
Moments:
∞ α−1
x m = E [X m ] = x m p (x ) = xm
xmin α − 1 − m min
Hence, when m ≥ α − 1, we have diverging moments
16. Moments
Moments:
∞ α−1
x m = E [X m ] = x m p (x ) = xm
xmin α − 1 − m min
Hence, when m ≥ α − 1, we have diverging moments
So when
α < 2, all moments are infinite
α < 3, all second and higher-order moments are infinite
α < 4, all third order and higher-order moments are infinite
....
18. Distributional properties
For any power law with exponent α > 1, the median is defined:
x1/2 = 21/(α−1) xmin
If we use power-law to model wealth distribution, then we might be interested
in the fraction of wealth in the richer half:
∞ − α +2
x1 / 2
xp (x )dx x1/2
∞ = = 2−(α−2)/(α−1)
xp (x )dx xmin
xmin
provided α > 2, the integrals converge
19. Distributional properties
For any power law with exponent α > 1, the median is defined:
x1/2 = 21/(α−1) xmin
If we use power-law to model wealth distribution, then we might be interested
in the fraction of wealth in the richer half:
∞ − α +2
x1 / 2
xp (x )dx x1/2
∞ = = 2−(α−2)/(α−1)
xp (x )dx xmin
xmin
provided α > 2, the integrals converge
When the wealth distribution was modelled using a power-law, α was
estimated to be 2.1, so 2−0.091 94% of the wealth is in the hands of the
richer 50% of the population
20. Top-heavy distribution & the 80/20 rule
Pareto principle: aka 80/20 rule
The law of the vital few, and the principle of factor sparsity states that, for many
events, roughly 80% of the effects come from 20% of the causes
21. Top-heavy distribution & the 80/20 rule
Pareto principle: aka 80/20 rule
The law of the vital few, and the principle of factor sparsity states that, for many
events, roughly 80% of the effects come from 20% of the causes
For example, the distribution of world GDP
Population quantile Income
Richest 20% 82.70%
Second 20% 11.75%
Third 20% 2.30%
Fourth 20% 1.85%
Poorest 20% 1.40%
Other examples are:
80% of your profits come from 20% of your customers
80% of your complaints come from 20% of your customers
80% of your profits come from 20% of the time you spend
22. Scale-free distributions
The power law distribution is often referred to as a scale-free distribution
A power law is the only distribution that is the same on regardless of the
scale
23. Scale-free distributions
The power law distribution is often referred to as a scale-free distribution
A power law is the only distribution that is the same on regardless of the
scale
For any b, we have
p (bx ) = g (b )p (x )
That is, if we increase the scale by which we measure x by a factor of b,
the shape of the distribution p (x ) is unchanged, except for a multiplicative
constant
The PL distribution is the only distribution with this property
24. Random numbers
For the continuous case, we can generate random numbers using the
standard inversion method:
x = xmin (1 − u )−1/(α−1)
where U ∼ U (0, 1)
25. Random numbers
The discrete case is a bit more tricky
Instead, we have to solve the CMF numerically by “doubling up” and a
binary search
26. Random numbers
The discrete case is a bit more tricky
Instead, we have to solve the CMF numerically by “doubling up” and a
binary search
So for a given u, we first bound the solution to the equation via:
1: x2 := xmin
2: repeat
3: x1 := x2
4: x2 := 2x1
5: until P (x2 ) < 1 − u
Basically, the algorithm tests whether u ∈ [x , 2x ), starting with x = xmin
Once we have the region we use a binary search
28. Fitting power law distributions
Suppose we know xmin and wish to estimate the exponent α.
29. Method 1
1 Bin your data: [xmin , xmin + x ), [xmin + x , xmin + 2 x)
2 Plot your data on a log-log plot
3 Use least squares to estimate α
Bin size: 0.01 Bin size: 0.1 Bin size: 1.0
100
10−1
10−2
CDF
10−3
10−4
10−5
100 101 102 103 100 101 102 103 100 101 102 103
x
You could also use logarithmic binning (which is better) or should I say not as
bad?
30. Method 2
Similar to method 1, but
Don’t bin, just plot the data CDF
Then use least squares to estimate α
Using linear regression is a bad idea
31. Method 2
Similar to method 1, but
Don’t bin, just plot the data CDF
Then use least squares to estimate α
Using linear regression is a bad idea
Error estimates are completely off
It doesn’t even provide a good point estimate of α
32. Method 2
Similar to method 1, but
Don’t bin, just plot the data CDF
Then use least squares to estimate α
Using linear regression is a bad idea
Error estimates are completely off
It doesn’t even provide a good point estimate of α
On the bright side you do get a good R 2 value
33. Method 3: Log-Likelihood
The log-likelihood isn’t that hard to derive
Continuous:
n
xi
(α|x , xmin ) = n log(α − 1) − n log(xmin ) − α ∑ log
i =1
xmin
Discrete:
n
(α|x , xmin ) = −n log[ζ (α, xmin )] − α ∑ log(xi )
i =1
xmin −1 n
= −n log[ζ (α)] + n log ∑ xi − α ∑ log(xi )
i =1 i =1
35. MLEs
Maximising the log-likelihood gives
−1
n
xi
ˆ
α = 1+n ∑ ln xxmin
i =1
An estimate of the associated error is
α−1
σ= √
n
The discrete case is a bit more tricky and involves ignoring higher order terms,
to get:
−1
n
xi
ˆ
α 1+n ∑ ln xxmin − 0.5
i =1
36. Estimating xmin
Recall that the power-law pdf is
−α
α−1 x
p (x ) =
xmin xmin
where α > 1 and xmin > 0
xmin isn’t a parameter in the usual since - it’s a cut-off in the state space
Typically power-laws are only present in the distributional tails.
So how much of the data should we discard so our distribution fits a
power-law?
37. Estimating xmin : method 1
The most common way is just look at the
log-log plot
What could be easier!
Blackouts Fires Flares
100
10−2
10−4
10−6
10−8
1−P(x)
Moby Dick Terrorism Web links
100
10−2
10−4
10−6
10−8
100 102 104 106 100 102 104 106 100 102 104 106
x
38. Estimating xmin : method 2
Use a "Bayesian approach" - the BIC:
−2 + k ln n = −2 + xmin ln n
Increasing xmin increases the number of parameters
Only suitable for discrete distributions
39. Estimating xmin : method 3
Minimise the distance between the data and the fitted model CDFs:
D = max |S (x ) − P (x )|
x ≥xmin
where S (x ) is the CDF of the data and P (x ) is the theoretical CDF (the
Kolmogorov-Smirnov statistic)
Our estimate xmin is then the value of xmin that minimises D
Use some form of bootstrapping to get a handle on uncertainty of xmin
41. Word distributions
Suppose we type randomly on a
typewriter
We hit the space bar with probability qs
and a letter with probability ql
If there are m letters in the alphabet,
then
ql = (1 − qs )/m
http://activerain.com/
42. Word distributions
Suppose we type randomly on a
typewriter
We hit the space bar with probability qs
and a letter with probability ql
If there are m letters in the alphabet,
then
ql = (1 − qs )/m
The distribution of word frequency has http://activerain.com/
the form p (x ) ∼ x −α
43. Relationship between α value and Zipf’s principle of least
effort.
α value Examples in literature Least effort for
α < 1.6 Advanced schizophrenia
1.6 ≤ α < 2 Military combat texts, Wikipedia, Web Annotator
pages listed on the open directory project
α=2 Single author texts Equal effort levels
2 < α ≤ 2.4 Multi author texts Audience
α > 2.4 Fragmented discourse schizophrenia
44. Random walks
Suppose we have a 1d random walk
At each unit of time, we move ±1
4 q
q q q
2 q q q q q
q q q q q q q
Position
0 q
q q
q q
q q
q q
q q
q
q q q q q
−2 q q q q
−4
0 10 20 30
Time
45. Random walks
Suppose we have a 1d random walk
At each unit of time, we move ±1
4 q
q q q
2 q q q q q
q q q q q q q
Position
0 q
q q
q q
q q
q q
q q
q
q q q q q
−2 q q q q
−4
0 10 20 30
Time
If we start at n = 0, what is the probability for the first return time at time t
46. Random walks
With a bit of algebra, we get:
n
(2n)
f2n =
(2n − 1)22n
For large n, we get
2
f2n
n (2n − 1)2
So as n → ∞, we get
f2n ∼ n−3/2
So the distribution of return times follows a power law with exponent
α = 3/2!
47. Random walks
With a bit of algebra, we get:
n
(2n)
f2n =
(2n − 1)22n
For large n, we get
2
f2n
n (2n − 1)2
So as n → ∞, we get
f2n ∼ n−3/2
So the distribution of return times follows a power law with exponent
α = 3/2!
Tenuous link to phylogenetics
48. Phase transitions and critical phenomena
Suppose we have a simple lattice. Each
square is coloured with probability
p = 0.5
We can look at the clusters of coloured
squares. For example, the mean cluster
area, s , of a randomly chosen square:
If a square is white, then zero
If a square is coloured, but surround
by white, then one
etc
49. Phase transitions and critical phenomena
Suppose we have a simple lattice. Each
square is coloured with probability
p = 0.5
We can look at the clusters of coloured
squares. For example, the mean cluster
area, s , of a randomly chosen square:
If a square is white, then zero
If a square is coloured, but surround
by white, then one
etc
When p is small, s is independent of
the lattice size
When p is large, s depends on the
lattice size
50. Phase transitions and critical phenomena
p=0.3
As we increase p, the value of s also
increases
For some p, s starts to increase with
the lattice size
p=0.5927...
This is know as the critical value, and is
p = pc = 0.5927462..
If we calculate the distribution of p (s ),
then when p = pc , p (s ) follows a
power-law distribution
p=0.9
51. Forest fire
This simple model has been used as a primitive model of forest fires
We start with an empty lattice and trees grow at random
Every so often, a forest fire strikes at random
If the forest is too connected, i.e. large p, then the forest burns down
So (it is argued) that the forest size oscillates around p = pc
52. Forest fire
This simple model has been used as a primitive model of forest fires
We start with an empty lattice and trees grow at random
Every so often, a forest fire strikes at random
If the forest is too connected, i.e. large p, then the forest burns down
So (it is argued) that the forest size oscillates around p = pc
This is an example of self-organised criticality
53. Future work
There isn’t even an R package for power law estimation
Writing this talk I have (more or less) written one
Use a Bayesian change point model to estimate xmin in a vaguely
sensible way
RJMCMC to change between the power law and other heavy tailed
distributions
References
A. Clauset, C.R. Shalizi, and M.E.J. Newman.
Power-lawdistributionsinempiricaldata.
http://arxiv.org/abs/0706.1062
MEJ Newman. Powerlaws,ParetodistributionsandZipf’slaw.
http://arxiv.org/abs/cond-mat/0412004