This document discusses genotype imputation via matrix completion. It summarizes Eric Chi's work using matrix completion methods to impute missing genotypes in large genetic datasets. Specifically, it discusses how genotype data can be represented as a matrix with missing entries, how linkage disequilibrium induces low-rank structure, and how algorithms like singular value decomposition can be used to recover missing values. The document also describes challenges like non-random missingness and how Mendel-Impute addresses these through a sliding window approach to impute missing genotypes.
3. Where are we going?
Genotype Imputation
Movie Recommendation Systems
Matrix Completion
Mendel-Impute
Some Test Cases
Genome wide association
Low coverage sequencing
4. Genetic Variation: SNPs
Much of the variation between people ...AATGATC...
are Single Nucleotide Polymorphisms
...AATGATC...
...GATGATC...
...AATGATC...
If you have the G allele in position k
...AATGATC...
that increases your risk of high ...AATGATC...
cholesterol 3x. ...GATGATC...
k
Diallelic
5. Simulated Association Study
Genetic Association
...AATGATC...
Baseline Cholesterol
...AATGATC...
...GATGATC...
y ≈ µ + βx
...AATGATC...
...AATGATC...
...AATGATC...
Your Cholesterol “State” at SNP k ...GATGATC...
k
Diallelic
6. Genome Wide Association Studies
Problem: Don’t always get to see ...AATGATC...
what’s at SNP k ...AATGATC...
...GATGATC...
Problem: Costs $$$ to sequence an ...AATGATC...
individual.
...AATGATC...
GWAS: ...AATGATC...
...GATGATC...
~1K subjects
~1M “select” SNPs k
Diallelic
7. SNPs Have High Spatial Correlation
“Linkage Disequilibrium”
Reference Observation Prediction
A A A G A
A T A A A
T T G T .
G G G G .
A G A A .
T T T T .
C G G C C
Haplotypes: Blocks of
highly correlated SNPs
8. SNPs Have High Spatial Correlation
“Linkage Disequilibrium”
Reference Observation Prediction
A A A G A A
A T A A A A
T T G T . T
G G G G . G
A G A A . A
T T T T . T
C G G C C C
Haplotypes: Blocks of
highly correlated SNPs
10. Challenge #1: Recombination
Reference Observation Prediction
A A A G A
A T A A A
T T G T T
G G G G .
A G A A G
T T T T T
C G G C G
Haplotypes: Blocks of
highly correlated SNPs
11. Challenge #1: Recombination
Reference Observation Prediction
A A A G A A
A T A A A A
T T G T T T
G G G G . G
A G A A G G
T T T T T T
C G G C G G
Haplotypes: Blocks of
highly correlated SNPs
12. Challenge #2: Mutations/Typing Errors
Reference Observation Prediction
A A A G G
A T A A A
T T G T T
G G G G .
A G A A G
T T T T T
C G G C C
Haplotypes: Blocks of
highly correlated SNPs
13. Challenge #2: Mutations/Typing Errors
Reference Observation Prediction
A A A G G G
A T A A A A
T T G T T T
G G G G . G
A G A A G A
T T T T T T
C G G C C C
Haplotypes: Blocks of
highly correlated SNPs
14. Genotype Imputation
Case 1: Traditional GWAS
Reference Observation Prediction
A A A G A/G A G
A T A A A/A A A
T T G T ./. T T
G G G G ./. G G
A G A A A/A A A
T T T T T/T T T
C G G C C/G C G
Idea:
Estimate underlying haplotypes via Hidden Markov Models
15. Deliberately Introduce Missingness
Strategically type (measure) certain SNPs.
Motivation: Save $$$
Use references to reconstruct via HMM
Problem: Very slow!
Weeks~Months (on a cluster) to impute all
chromosomes for a large study ~ 1K
subjects
16.
17. Netflix Prize: October 2006-August 2009
• Predict un-rated movies
• Training Set: 480,000
customer ratings on
18,000 movies.
• Around 98.7% missing
ratings.
• $1,000,000 prize!
18. Netflix Prize: October 2006-August 2009
• Predict un-rated movies
• Training Set: 480,000
customer ratings on
18,000 movies.
• Around 98.7% missing
ratings.
• $1,000,000 prize!
“You look at the cumulative hours
and you’re getting Ph.D.’s for a dollar an hour.” -- Reed Hastings
19. Netflix Prize: October 2006-August 2009
• Predict un-rated movies
• Training Set: 480,000
customer ratings on
18,000 movies.
• Around 98.7% missing
ratings.
• $1,000,000 prize!
Methods: Variations on the SVD and k-nearest neighbors
20. Netflix Prize: October 2006-August 2009
• Predict un-rated movies
• Training Set: 480,000
customer ratings on
18,000 movies.
• Around 98.7% missing
ratings.
• $1,000,000 prize!
Methods: Variations on the SVD and k-nearest neighbors
21. Ratings Matrix
Customers
Alice Bob Charlie ···
Star Wars 2 5 ? ···
Harry Potter ? 1 ? ···
Movies Miss Congeniality 1 5 1 ···
Lord of the Rings 5 2 ? ···
.
. .
. .
. .
. ..
. . . . .
25. The Singular Value Decomposition
Singular Value Decomposition
R
X = UΣV t = σr ur vrt
r =1
t
Uk Σk Vk = arg min (xij − zij )2
rank(Z )=k i,j
26. The Singular Value Decomposition
Singular Value Decomposition
rank 1 matrix
R
X = UΣV t = σr ur vrt
r =1
t
Uk Σk Vk = arg min (xij − zij )2
rank(Z )=k i,j
27. The Singular Value Decomposition
Singular Value Decomposition
rank 1 matrix
R
X = UΣV t = σr ur vrt
r =1
mixture rank 1 matrices
of
t
Uk Σk Vk = arg min (xij − zij )2
rank(Z )=k i,j
28. The Singular Value Decomposition
Singular Value Decomposition
rank 1 matrix
R
X = UΣV t = σr ur vrt
r =1
mixture rank 1 matrices
of
t
Uk Σk Vk = arg min (xij − zij )2
rank(Z )=k i,j
Each rank 1 matrix is a basic pattern
SVD expresses X as a mixture of basic patterns
29. The Singular Value Decomposition
Singular Value Decomposition
The answer to the question:
What is the “best” rank r approximation to X?
R
X = UΣV t = σr ur vrt
r =1
t
Uk Σk Vk = arg min (xij − zij )2
rank(Z )=k i,j
30. The Singular Value Decomposition
Singular Value Decomposition
The Singular Value Decomposition
The answer to the question:
What is the “best” rank r approximation to X?
R
X = UΣV t =
R σr ur vrt
t rσ u v t
=1
X = UΣV = k k k
k=1
EckartΣ V t = arg min (x(1936)2
U
Young Theorem − z )
k k k ij ij
rank(Z )=k i,j
Ur Σr Vrt = arg min (xij − zij )2
rank(Z )≤r i,j
31. Matrix Completion
Matrix Completion
Problem: Given an observed m × n matrix X with missing
entries indexed by Ω ⊂ {1, . . . , m} × {1, . . . , n}, fill in the
missing entries.
Solution: Find a low rank matrix, consistent with the observed
entries of X .
1
min f (Z ) := (xij − zij )2
rank(Z )≤r 2 c
(i,j)∈Ω
Cand`s and Recht, 2009
e
Cai, Cand`s, Shen, 2010
e
Mazumder, Hastie, and Tibshirani, 2010
32. Solving the convex relaxation
Solving a Convex Relaxation
Convex relaxation under the nuclear norm Z ∗ .
min h(Z ) := f (Z ) + λZ ∗
Z
Z ∗ = σr ,
r
where σr are the singular values of Z .
X ∗ is to rank(X ) as x1 is to x0
33. Solving the convex relaxation
Solving a Convex Relaxation
Solve by minimizing a local quadratic surrogate of f (Z )
1
g (Z | Z ) = f (Z ) + ∇f (Z ), Z − Z + Z − Z k 2
k k k k
F
2δ
1
= Z − M2 + c k ,
F
2δ
where
M = Z k − δ∇f (Z k )
Core problem:
1 2
min Z − MF + λZ ∗ .
Z 2
34. Solving the core Core Problem
Solving the problem
Core problem:
1
min Z − M2 + λZ ∗ .
F
Z 2
Solution:
Z ∗ = Dλ (M).
Dλ (M) := USλ (Σ)V t
M = UΣV t
Sλ (σ) = sign(σ)(|σ| − λ)+ .
35. Solving convex relaxation
Solving a Convex Relaxation
k+1 k 1
Z = arg min g (Z | Z ) = Z − M2 + λZ ∗ + c k ,
F
Z 2δ
where
M = Z k − δ∇f (Z k )
repeat
k+1 1
Z = Dλ (Z − ∇f (Z k ))
k
δ
until convergence
Accelerate with Nesterov’s Method
(Beck and Teboulle 2009)
36. Mendel-Impute
Sliding window
Exploit linkage disequilibrium to solve smaller problems.
SNPs
Subjects
A B C
Construct hold-out-set by masking entries in A and C
Train on observed entries from A, B, and C
Choose λ based on performance on hold-out-set
Impute missing entries in B.
39. Example: Simulated Association Study
Example: Simulated Association Study
Quantitative trait with a single causative SNP.
545 people and 60, 000 SNPs.
ith measurement generated by causative SNP
yi = µ + βxi + σi ,
where µ = 160, β = 3, σ = 5 and i are i.i.d. N(0, 1).
xi = dosage (count of minor alleles for i).
41. Results: Timing
Results A
Program Run time
MACH 12:40
BEAGLE 10:20
IMPUTE2 07:10
Mendel-Impute 00:56
Table: Timing results on high-coverage genotyping microarray data
(HH:MM).
42. Genotype Imputation
Case 2: Low Coverage Sequencing
Genotypes Alleles Reads
A A A,G 10,0
A T A,T 50,50
T T T,G 0
G G G 13
A G A,G 21,17
T T T 8
C G C,G 0
Idea:
Estimate underlying haplotypes via Hidden Markov Models
43. Genotype Imputation
Case 2: Low Coverage Sequencing
Reference Observation Prediction
A A A G (A,G) = (20,15) A G
A T A A (A) = (10) A A
T T G T (T) = (5) T T
G G G G No reads G G
A G A A (A) = (13) A A
T T T T (T) = (17) T T
C G G C (C,G) = (25,1) C G
Idea:
Estimate underlying haplotypes via Hidden Markov Models
44. Solve with Matrix Completion?
Consider ith SNP
A occurs 5% (minor allele)
T occurs 95% (major allele)
Mapping:
SNP
A,T ... 0.1 0.3 ? 1.3 ? 1.9 ...
18,1 1.9 Subj ... 0.2 ? 0.5 1.7 ? 1.5 ...
0,11 0.1 ... 0.4 1.3 ? ? 0.1 ? ...
Missingness is more random
Reads Posterior Mean is in [0,2]
45. Application: Low Coverage Sequencing
Low Coverage Sequencing
Weighted version for sequencing data
1
min wij (xij − zij )2 + λZ ∗ ,
Z 2
i,j
wij := number of reads at loci j for subject i.
xij ∈ [0, 2] (posterior mean allele dosage)
Binomial likelihood (“success” = sequencing error).
Prior: Hardy-Weinberg genotype frequencies.
47. Results: Timing
Example: Simulated Association Study
Program Run time
Mendel-Impute 03:18:36
BEAGLE 23:27:34
IMPUTE2 31:02:09
Pros: Fast and accurate enough
Cons: No phasing, just imputation
48. Summary
Matrix completion is purely empirical!
-- Singular vectors are not interpretable.
Accuracy is in the ball park of standard model based methods
But much faster.
Improvements:
-- Singular Value Thresholding without the SVD (Cai Osher)