An introduction to power laws and their distributional properties

An introduction to power laws

Colin Gillespie

November 15, 2012

Talk outline

1 Introduction to power laws
2 Distributional properties
3 Parameter inference
4 Power law generating mechanisms

http://xkcd.com/

Classic example: distribution of US cities

q
2000

No. of Cities
1500
Some data sets vary over enormous
1000
range
500
US towns & cities: q

q
q qqq qq qq q q q q q
Dufﬁeld (pop 52) 0 q qqq qq
q q

5
New York City (pop 8 mil) 10 4.5
10 105.5 106 106.5 107
City population
The data is highly right-skewed

Cumulative No. of Cities
q

When the data is plotted on a 103 q

logarithmic scale, it seems to follow a q
q
102 q
straight line qq
q
qqqq
q
qqq
This observation is attributed to Zipf 101 q q
qq
q
q
q

100 q

105 105.5 106 106.5
City population

Distribution of world cities

World city populations for 8 countries
logsize vs logrank
107.5
New York

Mumbai (Bombay)
São Paulo
Delhi Djakarta
Los Angeles
Shanghai
Kolkata (Calcutta)
Moscou

Lagos
Log Population

Pékin (Beijing)
Rio de Janeiro Chicago
Ruhr
7
10 Hong Kong (Xianggang)
Washington
Chongqing
ChennaiBoston San Francisco − San José
(Madras)
ShenyangDallas − Fort Worth
TianjinBangalore
Hyderabad Philadelphie
Detroit Bandung
Houston Miami
Canton (Guangzhou) Atlanta
Ahmadabad Belo Horizonte
Pune
San Diego − Tijuana Ibadan
Xian
Saint−Petersbourg Harbin
Wuhan Shantou
Chengdu Hangzhou
Phoenix Kano
Nanjing Medan − Saint−Petersburg
Seattle Tampa Alegre
Berlin Surabaya
Porto
Recife
Minneapolis Salvador
CuritibaKanpur
Jinan
106.5
BrasiliaFortaleza
CincinnatiCleveland
Hambourg
Francfort Surat
Changchun Jaipur
Lucknow Denver
Shijiazhuang Saint−Louis
Dalian Taiyuan
Zibo
Brownsville − McAllen − Matamoros − Reynosa Orlando
Nagpur Patna
Campinas Portland − Ciudad Juarez
Qingdao Tangshan
El Paso
Guiyang Pittsburgh
Kunming Sacramento
Charlotte Belem
Munich Stuttgart City
Anshan Salt Lake
Changsha Bénin
Wuxi Zhengzhou
Nanchang Palembang
Goiânia San Antonio
Indianapolis Kansas City
Columbus Indore
Las Vegas Mirat Harcourt
Kaduna Jilin
Lanzhou Port
Niznij Novgorod Santos Pandang (Macassar)
Manaus Oshogbo
Raleigh VadodaraUjung
Bhopal Cirebon
−Xinyang Nashik
Bhubaneswar Ludhiana Beach − Norfolk − Corée du Nord)
Durham Agra
ZhanjiangVirginia
Austin Coimbatore
Nashville Dandong−Sinuiju (Chine
Vitoria
Greensboro − Winston−SalemXuzhou
Luoyang Yogyakarta
VisakhapatnamUrumqi
Nanning Semarang
Tanjungkarang (Bandar Lampung)Fuzhou (Bénarès)
Kochi Mannheim
HuainanVaranasi
Rajkot Novosibirsk
BielefeldBaotou
Aba Volgograd
Onitsha Suzhou
Hefei Qiqihar
Denpasar Samara
Handan Leipzig−Halle
São Luis Louisville
GrandAsansolRostov
Madurai Datong
Rapids Iekaterinburg
Allahabad Bengbu
Mataram Jacksonville
Ningbo
Greenville − Jamshedpur Memphis City
Spartanburg Oklahoma
Natal
Surakarta Jabalpur
Richmond Tcheliabinsk
BirminghamWenzhou
Nuremberg Tegal
Dhanbad Maisuru
Chemnitz−ZwickauRongcheng
OgbomoshoAmritsar
Brême Buffalo
Maceio Aurangabad
Hohhot Nouvelle−Orléans
RochesterMaiduguri
Daqing Zhangjiakou
TeresinaVijayawada
Saarbruck−Forbach Hanovre Albany
(France)Omsk
Abuja Bhilai
AomenSholapur
SaratovKazan
BaodingSrinagar
Dresde Pingxiang
Thiruvananthapuram Benxi Pessoa
Zhenjiang Xianyang
106
Chandigarh Ranchi
Guwahati Fresno
Krasnojarsk Joao
Kozhikkod Knoxville
Ufa Samarinda
Malang
Ilorin
Tucson

100 100.5 101 101.5 102
Log Rank

http://brenocon.com/blog/2009/05/zipfs-law-and-world-city-populations/

What does it mean?

Let p (x )dx be the fraction of cities with a population between x and
x + dx
If this histogram is a straight line on log − log scales, then

ln p (x ) = −α ln x + c

where α and c are constants
Hence
p (x ) = Cx −α

where C = ec

What does it mean?

Let p (x )dx be the fraction of cities with a population between x and
x + dx
If this histogram is a straight line on log − log scales, then

ln p (x ) = −α ln x + c

where α and c are constants
Hence
p (x ) = Cx −α

where C = ec
Distributions of this form are said to follow a power law
The constant α is called the exponent of the power law
We typically don’t care about c.

The power law distribution

Name f (x ) Notes
Power law x −α Pareto distribution
Exponential e − λx
1 (ln x −µ)2
log-normal x
exp(− 2σ 2
)
Power law x −α Zeta distribution
Power law x −α x = 1, . . . , n, Zipf’s dist’
Γ (x )
Yule Γ (x + α )
Poisson λx /x !

Alleged power-law phenomena

The frequency of occurrence of unique words in the novel Moby Dick by
Herman Melville
The numbers of customers affected in electrical blackouts in the United
States between 1984 and 2002
The number of links to web sites found in a 1997 web crawl of about 200
million web pages

Alleged power-law phenomena

The frequency of occurrence of unique words in the novel Moby Dick by
Herman Melville
The numbers of customers affected in electrical blackouts in the United
States between 1984 and 2002
The number of links to web sites found in a 1997 web crawl of about 200
million web pages
The number of hits on web pages
The number of papers scientist write
The number of citations received by papers
Annual incomes
Sales of books, music; in fact anything that can be sold

Zipf plots

Blackouts Fires Flares
100

10−2

10−4

10−6

10−8
1−P(x)

Moby Dick Terrorism Web links
100

10−2

10−4

10−6

10−8

100 102 104 106 100 102 104 106 100 102 104 106
x

The power law distribution

The power-law distribution is

p (x ) ∝ x − α

where α, the scaling parameter, is a constant
The scaling parameter typically lies in the range 2 < α < 3, although
there are some occasional exceptions
Typically, the entire process doesn’t obey a power law
Instead, the power law applies only for values greater than some
minimum xmin

Power law: PDF & CDF

α
1.50 1.75 2.00 2.25 2.50

For the continuous PL, the pdf is 1.5
PDF

−α
α−1 x 1.0

p (x ) =
xmin xmin 0.5

where α > 1 and xmin > 0. 0.0
CDF
The CDF is: 1.5

− α +1 1.0
x
P (x ) = 1 −
xmin 0.5

0.0
0.0 2.5 5.0 7.5 10.0
x

Power law: PDF & CDF

α
For the discrete power law, the pmf is 1.50 1.75 2.00 2.25 2.50

PDF

x −α 1.5

p (x ) =
ζ (α, xmin ) 1.0

where 0.5

∞ 0.0

ζ (α, xmin ) = ∑ (n + xmin )−α 1.5
CDF

n =0
1.0
is the generalised zeta function
0.5
When xmin = 1, ζ (α, 1) is the standard
zeta function 0.0
0.0 2.5 5.0 7.5 10.0
x

Moments

Moments:
∞ α−1
x m = E [X m ] = x m p (x ) = xm
xmin α − 1 − m min
Hence, when m ≥ α − 1, we have diverging moments

Moments

Moments:
∞ α−1
x m = E [X m ] = x m p (x ) = xm
xmin α − 1 − m min
Hence, when m ≥ α − 1, we have diverging moments

So when
α < 2, all moments are infinite
α < 3, all second and higher-order moments are infinite
α < 4, all third order and higher-order moments are infinite
....

Distributional properties

For any power law with exponent α > 1, the median is deﬁned:

x1/2 = 21/(α−1) xmin



x1/2 = 21/(α−1) xmin

If we use power-law to model wealth distribution, then we might be interested
in the fraction of wealth in the richer half:
∞ − α +2
x1 / 2
xp (x )dx x1/2
∞ = = 2−(α−2)/(α−1)
xp (x )dx xmin
xmin

provided α > 2, the integrals converge



x1/2 = 21/(α−1) xmin

If we use power-law to model wealth distribution, then we might be interested
in the fraction of wealth in the richer half:
∞ − α +2
x1 / 2
xp (x )dx x1/2
∞ = = 2−(α−2)/(α−1)
xp (x )dx xmin
xmin

provided α > 2, the integrals converge

When the wealth distribution was modelled using a power-law, α was
estimated to be 2.1, so 2−0.091 94% of the wealth is in the hands of the
richer 50% of the population

Top-heavy distribution & the 80/20 rule

Pareto principle: aka 80/20 rule
The law of the vital few, and the principle of factor sparsity states that, for many
events, roughly 80% of the effects come from 20% of the causes

Top-heavy distribution & the 80/20 rule

Pareto principle: aka 80/20 rule
The law of the vital few, and the principle of factor sparsity states that, for many
events, roughly 80% of the effects come from 20% of the causes

For example, the distribution of world GDP

Population quantile Income
Richest 20% 82.70%
Second 20% 11.75%
Third 20% 2.30%
Fourth 20% 1.85%
Poorest 20% 1.40%

Other examples are:
80% of your proﬁts come from 20% of your customers
80% of your complaints come from 20% of your customers
80% of your proﬁts come from 20% of the time you spend

Scale-free distributions

The power law distribution is often referred to as a scale-free distribution
A power law is the only distribution that is the same on regardless of the
scale

Scale-free distributions

The power law distribution is often referred to as a scale-free distribution
A power law is the only distribution that is the same on regardless of the
scale
For any b, we have
p (bx ) = g (b )p (x )

That is, if we increase the scale by which we measure x by a factor of b,
the shape of the distribution p (x ) is unchanged, except for a multiplicative
constant
The PL distribution is the only distribution with this property

Random numbers

For the continuous case, we can generate random numbers using the
standard inversion method:

x = xmin (1 − u )−1/(α−1)

where U ∼ U (0, 1)

Random numbers

The discrete case is a bit more tricky
Instead, we have to solve the CMF numerically by “doubling up” and a
binary search

Random numbers

The discrete case is a bit more tricky
Instead, we have to solve the CMF numerically by “doubling up” and a
binary search
So for a given u, we ﬁrst bound the solution to the equation via:
1: x2 := xmin
2: repeat
3: x1 := x2
4: x2 := 2x1
5: until P (x2 ) < 1 − u
Basically, the algorithm tests whether u ∈ [x , 2x ), starting with x = xmin
Once we have the region we use a binary search

Fitting power law distributions

Fitting power law distributions

Suppose we know xmin and wish to estimate the exponent α.

Method 1

1 Bin your data: [xmin , xmin + x ), [xmin + x , xmin + 2 x)
2 Plot your data on a log-log plot
3 Use least squares to estimate α

Bin size: 0.01 Bin size: 0.1 Bin size: 1.0
100

10−1

10−2
CDF

10−3

10−4

10−5
100 101 102 103 100 101 102 103 100 101 102 103
x

You could also use logarithmic binning (which is better) or should I say not as
bad?

Method 2

Similar to method 1, but
Don’t bin, just plot the data CDF
Then use least squares to estimate α
Using linear regression is a bad idea

Method 2

Error estimates are completely off
It doesn’t even provide a good point estimate of α

Method 2

Error estimates are completely off
It doesn’t even provide a good point estimate of α
On the bright side you do get a good R 2 value

Method 3: Log-Likelihood

The log-likelihood isn’t that hard to derive

Continuous:
n
xi
(α|x , xmin ) = n log(α − 1) − n log(xmin ) − α ∑ log
i =1
xmin

Discrete:
n
(α|x , xmin ) = −n log[ζ (α, xmin )] − α ∑ log(xi )
i =1
xmin −1 n
= −n log[ζ (α)] + n log ∑ xi − α ∑ log(xi )
i =1 i =1

MLEs

Maximising the log-likelihood gives
−1
n
xi
ˆ
α = 1+n ∑ ln xxmin
i =1

An estimate of the associated error is
α−1
σ= √
n

MLEs

Maximising the log-likelihood gives
−1
n
xi
ˆ
α = 1+n ∑ ln xxmin
i =1

An estimate of the associated error is
α−1
σ= √
n

The discrete case is a bit more tricky and involves ignoring higher order terms,
to get:
−1
n
xi
ˆ
α 1+n ∑ ln xxmin − 0.5
i =1

Estimating xmin

Recall that the power-law pdf is
−α
α−1 x
p (x ) =
xmin xmin

where α > 1 and xmin > 0
xmin isn’t a parameter in the usual since - it’s a cut-off in the state space
Typically power-laws are only present in the distributional tails.
So how much of the data should we discard so our distribution ﬁts a
power-law?

Estimating xmin : method 1

The most common way is just look at the
log-log plot
What could be easier!

Blackouts Fires Flares
100

10−2

10−4

10−6

10−8
1−P(x)

Moby Dick Terrorism Web links
100

10−2

10−4

10−6

10−8

100 102 104 106 100 102 104 106 100 102 104 106
x


Use a "Bayesian approach" - the BIC:

−2 + k ln n = −2 + xmin ln n

Increasing xmin increases the number of parameters
Only suitable for discrete distributions


Minimise the distance between the data and the ﬁtted model CDFs:

D = max |S (x ) − P (x )|
x ≥xmin

where S (x ) is the CDF of the data and P (x ) is the theoretical CDF (the
Kolmogorov-Smirnov statistic)
Our estimate xmin is then the value of xmin that minimises D
Use some form of bootstrapping to get a handle on uncertainty of xmin

Mechanisms for generating PL distributions

Word distributions

Suppose we type randomly on a
typewriter
We hit the space bar with probability qs
and a letter with probability ql
If there are m letters in the alphabet,
then
ql = (1 − qs )/m

http://activerain.com/

Word distributions

Suppose we type randomly on a
typewriter
We hit the space bar with probability qs
and a letter with probability ql
If there are m letters in the alphabet,
then
ql = (1 − qs )/m

The distribution of word frequency has http://activerain.com/
the form p (x ) ∼ x −α

Relationship between α value and Zipf’s principle of least
effort.

α value Examples in literature Least effort for
α < 1.6 Advanced schizophrenia
1.6 ≤ α < 2 Military combat texts, Wikipedia, Web Annotator
pages listed on the open directory project
α=2 Single author texts Equal effort levels
2 < α ≤ 2.4 Multi author texts Audience
α > 2.4 Fragmented discourse schizophrenia

Random walks

Suppose we have a 1d random walk
At each unit of time, we move ±1
4 q

q q q

2 q q q q q

q q q q q q q
Position

0 q
q q
q q
q q
q q
q q
q

q q q q q

−2 q q q q

−4

0 10 20 30
Time

Random walks

Suppose we have a 1d random walk
At each unit of time, we move ±1
4 q

q q q

2 q q q q q

q q q q q q q
Position

0 q
q q
q q
q q
q q
q q
q

q q q q q

−2 q q q q

−4

0 10 20 30
Time

If we start at n = 0, what is the probability for the ﬁrst return time at time t

Random walks

With a bit of algebra, we get:
n
(2n)
f2n =
(2n − 1)22n
For large n, we get
2
f2n
n (2n − 1)2

So as n → ∞, we get
f2n ∼ n−3/2

So the distribution of return times follows a power law with exponent
α = 3/2!

Random walks

With a bit of algebra, we get:
n
(2n)
f2n =
(2n − 1)22n
For large n, we get
2
f2n
n (2n − 1)2

So as n → ∞, we get
f2n ∼ n−3/2

So the distribution of return times follows a power law with exponent
α = 3/2!
Tenuous link to phylogenetics

Phase transitions and critical phenomena

Suppose we have a simple lattice. Each
square is coloured with probability
p = 0.5
We can look at the clusters of coloured
squares. For example, the mean cluster
area, s , of a randomly chosen square:

If a square is white, then zero
If a square is coloured, but surround
by white, then one
etc


Suppose we have a simple lattice. Each
square is coloured with probability
p = 0.5
We can look at the clusters of coloured
squares. For example, the mean cluster
area, s , of a randomly chosen square:

If a square is white, then zero
If a square is coloured, but surround
by white, then one
etc
When p is small, s is independent of
the lattice size
When p is large, s depends on the
lattice size


p=0.3
As we increase p, the value of s also
increases
For some p, s starts to increase with
the lattice size

p=0.5927...
This is know as the critical value, and is
p = pc = 0.5927462..
If we calculate the distribution of p (s ),
then when p = pc , p (s ) follows a
power-law distribution

p=0.9

Forest fire

This simple model has been used as a primitive model of forest fires
We start with an empty lattice and trees grow at random
Every so often, a forest fire strikes at random
If the forest is too connected, i.e. large p, then the forest burns down
So (it is argued) that the forest size oscillates around p = pc

Forest fire

This simple model has been used as a primitive model of forest fires
We start with an empty lattice and trees grow at random
Every so often, a forest fire strikes at random
If the forest is too connected, i.e. large p, then the forest burns down
So (it is argued) that the forest size oscillates around p = pc
This is an example of self-organised criticality

Future work

There isn’t even an R package for power law estimation
Writing this talk I have (more or less) written one
Use a Bayesian change point model to estimate xmin in a vaguely
sensible way
RJMCMC to change between the power law and other heavy tailed
distributions
References
A. Clauset, C.R. Shalizi, and M.E.J. Newman.
Power-lawdistributionsinempiricaldata.
http://arxiv.org/abs/0706.1062
MEJ Newman. Powerlaws,ParetodistributionsandZipf’slaw.
http://arxiv.org/abs/cond-mat/0412004

An introduction to power laws and their distributional properties

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à An introduction to power laws and their distributional properties

Similaire à An introduction to power laws and their distributional properties (20)

Plus de Colin Gillespie

Plus de Colin Gillespie (10)

An introduction to power laws and their distributional properties