Genotype Imputation via Matrix Completion

Genotype Imputation via Matrix Completion

ERIC CHI
DEPARTMENT OF HUMAN GENETICS
UNIVERSITY OF CALIFORNIA, LOS ANGELES

October 25, 2012

Collaborators

UCLA
Kenneth Lange
Diego Ortega Del Vecchyo

NCSU
Hua Zhou

USC
Gary Chen

Where are we going?

Genotype Imputation

Movie Recommendation Systems
Matrix Completion
Mendel-Impute

Some Test Cases
Genome wide association
Low coverage sequencing

Genetic Variation: SNPs

Much of the variation between people ...AATGATC...
are Single Nucleotide Polymorphisms
...AATGATC...
...GATGATC...
...AATGATC...
If you have the G allele in position k
...AATGATC...
that increases your risk of high ...AATGATC...
cholesterol 3x. ...GATGATC...
k
Diallelic

Simulated Association Study
Genetic Association

...AATGATC...
Baseline Cholesterol
...AATGATC...
...GATGATC...
y ≈ µ + βx
...AATGATC...
...AATGATC...
...AATGATC...
Your Cholesterol “State” at SNP k ...GATGATC...
k
Diallelic

Genome Wide Association Studies

Problem: Don’t always get to see ...AATGATC...
what’s at SNP k ...AATGATC...
...GATGATC...
Problem: Costs $$$ to sequence an ...AATGATC...
individual.
...AATGATC...
GWAS: ...AATGATC...
...GATGATC...
~1K subjects
~1M “select” SNPs k
Diallelic

SNPs Have High Spatial Correlation
“Linkage Disequilibrium”
Reference Observation Prediction
A A A G A
A T A A A
T T G T .
G G G G .
A G A A .
T T T T .
C G G C C
Haplotypes: Blocks of
highly correlated SNPs

SNPs Have High Spatial Correlation
“Linkage Disequilibrium”
A A A G A A
A T A A A A
T T G T . T
G G G G . G
A G A A . A
T T T T . T
C G G C C C

Deliberately Introduce Missingness

Strategically type (measure) certain
SNPs.

Use reference haplotypes to
reconstruct.

Motivation: Save $$$

Challenge #1: Recombination

A A A G A
A T A A A
T T G T T
G G G G .
A G A A G
T T T T T
C G G C G

Challenge #1: Recombination

A A A G A A
A T A A A A
T T G T T T
G G G G . G
A G A A G G
T T T T T T
C G G C G G

Challenge #2: Mutations/Typing Errors

A A A G G
A T A A A
T T G T T
G G G G .
A G A A G
T T T T T
C G G C C

Challenge #2: Mutations/Typing Errors

A A A G G G
A T A A A A
T T G T T T
G G G G . G
A G A A G A
T T T T T T
C G G C C C

Genotype Imputation
Case 1: Traditional GWAS

A A A G A/G A G
A T A A A/A A A
T T G T ./. T T
G G G G ./. G G
A G A A A/A A A
T T T T T/T T T
C G G C C/G C G

Idea:
Estimate underlying haplotypes via Hidden Markov Models

Deliberately Introduce Missingness

Strategically type (measure) certain SNPs.

Motivation: Save $$$

Use references to reconstruct via HMM

Problem: Very slow!
Weeks~Months (on a cluster) to impute all
chromosomes for a large study ~ 1K
subjects

Netﬂix Prize: October 2006-August 2009

• Predict un-rated movies

• Training Set: 480,000
customer ratings on
18,000 movies.

• Around 98.7% missing
ratings.

• $1,000,000 prize!



customer ratings on
18,000 movies.

ratings.

• $1,000,000 prize!

“You look at the cumulative hours
and you’re getting Ph.D.’s for a dollar an hour.” -- Reed Hastings



customer ratings on
18,000 movies.

ratings.

• $1,000,000 prize!

Methods: Variations on the SVD and k-nearest neighbors

Ratings Matrix

Customers
Alice Bob Charlie ···
Star Wars 2 5 ? ···
Harry Potter ? 1 ? ···
Movies Miss Congeniality 1 5 1 ···
Lord of the Rings 5 2 ? ···
.
. .
. .
. .
. ..
. . . . .

Unphased Genotype Matrix

Subject
“Alice” “Bob” “Charlie” ···
rs274044 0 2 ? ···
rs274541 ? 1 ? ···
SNP rs286593 1 2 1 ···
rs287261 2 0 ? ···
.
. .
. .
. .
. ..
. . . . .

0 = Homozygous Major Allele
1 = Heterozygous
2 = Homozygous Minor Allele

Solve with Matrix Completion?

Study Panel

SNP
... 0 0 ? ? 0 1 1 0 0 2 ...
Subj ... 0 0 ? ? 0 1 1 0 0 1 ...
... 0 1 0 0 0 1 0 0 0 1 ...

Reference Haplotypes

Idea:
Linkage disequilibrium = low rank structure?


SNP
... 0 0 ? ? 0 1 1 0 0 2 ...
Subj ... 0 0 ? ? 0 1 1 0 0 1 ...
... 0 1 0 0 0 1 0 0 0 1 ...

Not missing at random = problem?

Idea:
Linkage disequilibrium = low rank structure?

The Singular Value Decomposition
Singular Value Decomposition

R

X = UΣV t = σr ur vrt
r =1

t
Uk Σk Vk = arg min (xij − zij )2
rank(Z )=k i,j


rank 1 matrix
R

r =1

t
rank(Z )=k i,j


rank 1 matrix
R

r =1

mixture rank 1 matrices
of
t
rank(Z )=k i,j


rank 1 matrix
R

r =1

mixture rank 1 matrices
of
t
rank(Z )=k i,j
Each rank 1 matrix is a basic pattern
SVD expresses X as a mixture of basic patterns


The answer to the question:
What is the “best” rank r approximation to X?
R

r =1

t
rank(Z )=k i,j

The answer to the question:
What is the “best” rank r approximation to X?
R

X = UΣV t =
R σr ur vrt

t rσ u v t
=1
X = UΣV = k k k
k=1
EckartΣ V t = arg min (x(1936)2
U
Young Theorem − z )
k k k ij ij

rank(Z )=k i,j
Ur Σr Vrt = arg min (xij − zij )2
rank(Z )≤r i,j

Matrix Completion
Matrix Completion

Problem: Given an observed m × n matrix X with missing
entries indexed by Ω ⊂ {1, . . . , m} × {1, . . . , n}, ﬁll in the
missing entries.
Solution: Find a low rank matrix, consistent with the observed
entries of X .
1
min f (Z ) := (xij − zij )2
rank(Z )≤r 2 c
(i,j)∈Ω

Cand`s and Recht, 2009
e
Cai, Cand`s, Shen, 2010
e
Mazumder, Hastie, and Tibshirani, 2010

Solving the convex relaxation
Solving a Convex Relaxation

Convex relaxation under the nuclear norm Z ∗ .

min h(Z ) := f (Z ) + λZ ∗
Z

Z ∗ = σr ,
r

where σr are the singular values of Z .

X ∗ is to rank(X ) as x1 is to x0

Solving the convex relaxation

Solve by minimizing a local quadratic surrogate of f (Z )

1
g (Z | Z ) = f (Z ) + ∇f (Z ), Z − Z + Z − Z k 2
k k k k
F
2δ
1
= Z − M2 + c k ,
F
2δ
where

M = Z k − δ∇f (Z k )

Core problem:
1 2
min Z − MF + λZ ∗ .
Z 2

Solving the core Core Problem
Solving the problem

Core problem:
1
min Z − M2 + λZ ∗ .
F
Z 2

Solution:

Z ∗ = Dλ (M).

Dλ (M) := USλ (Σ)V t
M = UΣV t
Sλ (σ) = sign(σ)(|σ| − λ)+ .

Solving convex relaxation

k+1 k 1
Z = arg min g (Z | Z ) = Z − M2 + λZ ∗ + c k ,
F
Z 2δ

where

M = Z k − δ∇f (Z k )

repeat

k+1 1
Z = Dλ (Z − ∇f (Z k ))
k
δ
until convergence
Accelerate with Nesterov’s Method
(Beck and Teboulle 2009)

Mendel-Impute
Sliding window
Exploit linkage disequilibrium to solve smaller problems.

SNPs
Subjects

A B C

Construct hold-out-set by masking entries in A and C
Train on observed entries from A, B, and C
Choose λ based on performance on hold-out-set
Impute missing entries in B.

Not Missing at Random

Raw Imputed Values Final Imputed Dosages
2.0 ● ●● ● ●

● ●● ●
1.5 ●
value

1.0 ●● ●
● ●● ● ●
● ● ● ●●●●●●● ●●● ● ●●● ● ●● ●
●●● ● ●
● ● ● ● ● ● ● ●●
● ●
● ● ●●
●●
● ● ●

● ● ● ● ● ● ● ● ●
●● ● ●●● ●● ● ● ● ●● ● ●●
●● ●
● ● ● ●● ● ●● ●●● ● ●●●
● ●●
●● ●● ●●
● ●● ● ● ● ● ●
●
0.5

● ●
●●● ●●●● ●● ●● ● ●●● ● ● ● ●●●●●●●● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ●
●
● ●●● ● ● ● ●
● ●● ● ● ● ●● ●● ● ● ●● ● ●● ● ●
0.0 ●●●● ●●● ●●●●● ●●●● ●●●● ● ●● ● ● ● ● ● ●● ●●●● ●●●●●●●●●●●●●●
●● ●● ● ●● ●●●● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ●●●●● ●●● ●
●●
●● ● ●●●●●●●●●●●●● ●●●● ●● ●● ● ●● ●●●●● ●●●● ●●●●●●●●●●●●●●●
● ● ●● ● ●● ●
● ● ●●● ● ● ● ●●
●●●● ●●● ●●●●●●●●● ●●●● ● ●● ● ● ●● ● ● ● ● ●●●● ●●● ● ●●● ●● ●
● ● ●
● ●●●●●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●●●●● ● ●●● ●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ● ●●●●● ● ●●●●●●●● ●●●●●●●●●●●
●●●●●●●●● ● ●● ●●●● ●●● ● ● ●●● ● ● ● ●●● ●●●●●●●●●
●● ● ●● ● ●

0 100 200 300 400 500 0 100 200 300 400 500
Subject

Mendel−Impute

0.0 0.2 0.4 0.6 0.8 1.0

0.0
0.2
0.4

Beagle
0.6
0.8
1.0

Example: Simulated Association Study

Quantitative trait with a single causative SNP.
545 people and 60, 000 SNPs.
ith measurement generated by causative SNP

yi = µ + βxi + σi ,

where µ = 160, β = 3, σ = 5 and i are i.i.d. N(0, 1).
xi = dosage (count of minor alleles for i).

Results: Timing
Results A

Program Run time
MACH 12:40
BEAGLE 10:20
IMPUTE2 07:10
Mendel-Impute 00:56
Table: Timing results on high-coverage genotyping microarray data
(HH:MM).

Genotype Imputation
Case 2: Low Coverage Sequencing

Genotypes Alleles Reads
A A A,G 10,0
A T A,T 50,50
T T T,G 0
G G G 13
A G A,G 21,17
T T T 8
C G C,G 0

Idea:

Genotype Imputation
Case 2: Low Coverage Sequencing

A A A G (A,G) = (20,15) A G
A T A A (A) = (10) A A
T T G T (T) = (5) T T
G G G G No reads G G
A G A A (A) = (13) A A
T T T T (T) = (17) T T
C G G C (C,G) = (25,1) C G

Idea:


Consider ith SNP
A occurs 5% (minor allele)
T occurs 95% (major allele)

Mapping:
SNP
A,T ... 0.1 0.3 ? 1.3 ? 1.9 ...
18,1 1.9 Subj ... 0.2 ? 0.5 1.7 ? 1.5 ...
0,11 0.1 ... 0.4 1.3 ? ? 0.1 ? ...
Missingness is more random

Reads Posterior Mean is in [0,2]

Application: Low Coverage Sequencing
Low Coverage Sequencing

Weighted version for sequencing data
1
min wij (xij − zij )2 + λZ ∗ ,
Z 2
i,j

wij := number of reads at loci j for subject i.
xij ∈ [0, 2] (posterior mean allele dosage)
Binomial likelihood (“success” = sequencing error).
Prior: Hardy-Weinberg genotype frequencies.

Results: Timing

Program Run time
Mendel-Impute 03:18:36
BEAGLE 23:27:34
IMPUTE2 31:02:09

Pros: Fast and accurate enough
Cons: No phasing, just imputation

Summary

Matrix completion is purely empirical!
-- Singular vectors are not interpretable.

Accuracy is in the ball park of standard model based methods

But much faster.

Improvements:
-- Singular Value Thresholding without the SVD (Cai Osher)

Genotype Imputation via Matrix Completion

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Genotype Imputation via Matrix Completion

Similaire à Genotype Imputation via Matrix Completion (20)

Genotype Imputation via Matrix Completion