SlideShare une entreprise Scribd logo
1  sur  48
Genotype Imputation via Matrix Completion


ERIC CHI
DEPARTMENT OF HUMAN GENETICS
UNIVERSITY OF CALIFORNIA, LOS ANGELES




                     October 25, 2012
Collaborators


UCLA
Kenneth Lange
Diego Ortega Del Vecchyo

NCSU
Hua Zhou

USC
Gary Chen
Where are we going?


Genotype Imputation



Movie Recommendation Systems
    Matrix Completion
    Mendel-Impute


Some Test Cases
   Genome wide association
   Low coverage sequencing
Genetic Variation: SNPs

Much of the variation between people     ...AATGATC...
are Single Nucleotide Polymorphisms
                                         ...AATGATC...
                                         ...GATGATC...
                                         ...AATGATC...
If you have the G allele in position k
                                         ...AATGATC...
that increases your risk of high         ...AATGATC...
cholesterol 3x.                          ...GATGATC...
                                           k
                                               Diallelic
Simulated Association Study
       Genetic Association


                                              ...AATGATC...
          Baseline Cholesterol
                                              ...AATGATC...
                                              ...GATGATC...
              y ≈ µ + βx
                                              ...AATGATC...
                                              ...AATGATC...
                                              ...AATGATC...
  Your Cholesterol         “State” at SNP k   ...GATGATC...
                                                k
                                                    Diallelic
Genome Wide Association Studies


Problem: Don’t always get to see    ...AATGATC...
what’s at SNP k                     ...AATGATC...
                                    ...GATGATC...
Problem: Costs $$$ to sequence an   ...AATGATC...
individual.
                                    ...AATGATC...
GWAS:                               ...AATGATC...
                                    ...GATGATC...
~1K subjects
~1M “select” SNPs                     k
                                          Diallelic
SNPs Have High Spatial Correlation
   “Linkage Disequilibrium”
   Reference     Observation    Prediction
  A A A G            A
  A T A A            A
  T T G T            .
  G G G G            .
  A G A A            .
  T T T T            .
  C G G C            C
Haplotypes: Blocks of
highly correlated SNPs
SNPs Have High Spatial Correlation
   “Linkage Disequilibrium”
   Reference     Observation    Prediction
  A A A G            A              A
  A T A A            A              A
  T T G T            .              T
  G G G G            .              G
  A G A A            .              A
  T T T T            .              T
  C G G C            C              C
Haplotypes: Blocks of
highly correlated SNPs
Deliberately Introduce Missingness

Strategically type (measure) certain
SNPs.

Use reference haplotypes to
reconstruct.

Motivation: Save $$$
Challenge #1: Recombination

   Reference     Observation    Prediction
  A A A G            A
  A T A A            A
  T T G T            T
  G G G G            .
  A G A A            G
  T T T T            T
  C G G C            G
Haplotypes: Blocks of
highly correlated SNPs
Challenge #1: Recombination

   Reference     Observation    Prediction
  A A A G            A              A
  A T A A            A              A
  T T G T            T              T
  G G G G            .              G
  A G A A            G              G
  T T T T            T              T
  C G G C            G              G
Haplotypes: Blocks of
highly correlated SNPs
Challenge #2: Mutations/Typing Errors

   Reference     Observation    Prediction
  A A A G            G
  A T A A            A
  T T G T            T
  G G G G            .
  A G A A            G
  T T T T            T
  C G G C            C
Haplotypes: Blocks of
highly correlated SNPs
Challenge #2: Mutations/Typing Errors

   Reference     Observation    Prediction
  A A A G            G              G
  A T A A            A              A
  T T G T            T              T
  G G G G            .              G
  A G A A            G              A
  T T T T            T              T
  C G G C            C              C
Haplotypes: Blocks of
highly correlated SNPs
Genotype Imputation
 Case 1: Traditional GWAS

 Reference           Observation             Prediction
A A A G                 A/G                    A G
A T A A                 A/A                    A A
T T G T                  ./.                   T T
G G G G                  ./.                   G G
A G A A                 A/A                    A A
T T T T                 T/T                    T T
C G G C                 C/G                    C G

Idea:
Estimate underlying haplotypes via Hidden Markov Models
Deliberately Introduce Missingness


Strategically type (measure) certain SNPs.

Motivation: Save $$$

Use references to reconstruct via HMM

Problem: Very slow!
   Weeks~Months (on a cluster) to impute all
   chromosomes for a large study ~ 1K
   subjects
Netflix Prize: October 2006-August 2009



•   Predict un-rated movies

•   Training Set: 480,000
    customer ratings on
    18,000 movies.

•   Around 98.7% missing
    ratings.

•   $1,000,000 prize!
Netflix Prize: October 2006-August 2009



     •   Predict un-rated movies

     •   Training Set: 480,000
         customer ratings on
         18,000 movies.

     •   Around 98.7% missing
         ratings.

     •   $1,000,000 prize!

“You look at the cumulative hours
      and you’re getting Ph.D.’s for a dollar an hour.” -- Reed Hastings
Netflix Prize: October 2006-August 2009



  •   Predict un-rated movies

  •   Training Set: 480,000
      customer ratings on
      18,000 movies.

  •   Around 98.7% missing
      ratings.

  •   $1,000,000 prize!


Methods: Variations on the SVD and k-nearest neighbors
Netflix Prize: October 2006-August 2009



  •   Predict un-rated movies

  •   Training Set: 480,000
      customer ratings on
      18,000 movies.

  •   Around 98.7% missing
      ratings.

  •   $1,000,000 prize!


Methods: Variations on the SVD and k-nearest neighbors
Ratings Matrix


                                     Customers
                             Alice   Bob   Charlie   ···
            Star Wars         2       5      ?       ···
           Harry Potter       ?       1      ?       ···
Movies   Miss Congeniality    1       5      1       ···
         Lord of the Rings    5       2      ?       ···
                 .
                 .             .
                               .      .
                                      .      .
                                             .       ..
                 .             .      .      .          .
Unphased Genotype Matrix


                               Subject
                 “Alice”   “Bob” “Charlie”   ···
      rs274044     0         2         ?     ···
      rs274541     ?         1         ?     ···
SNP   rs286593     1         2         1     ···
      rs287261     2         0         ?     ···
          .
          .         .
                    .        .
                             .         .
                                       .     ..
          .         .        .         .        .


           0 = Homozygous Major Allele
           1 = Heterozygous
           2 = Homozygous Minor Allele
Solve with Matrix Completion?



                       Study Panel


                       SNP
            ... 0 0 ? ? 0 1 1 0 0 2 ...
       Subj ... 0 0 ? ? 0 1 1 0 0 1 ...
            ... 0 1 0 0 0 1 0 0 0 1 ...

                  Reference Haplotypes



Idea:
Linkage disequilibrium = low rank structure?
Solve with Matrix Completion?




                       SNP
            ... 0 0 ? ? 0 1 1 0 0 2 ...
       Subj ... 0 0 ? ? 0 1 1 0 0 1 ...
            ... 0 1 0 0 0 1 0 0 0 1 ...

            Not missing at random = problem?


Idea:
Linkage disequilibrium = low rank structure?
The Singular Value Decomposition
  Singular Value Decomposition




                               R
                               
                X = UΣV t =           σr ur vrt
                               r =1


                                 
                   t
            Uk Σk Vk = arg min         (xij − zij )2
                      rank(Z )=k i,j
The Singular Value Decomposition
  Singular Value Decomposition



                                                  rank 1 matrix
                               R
                               
                X = UΣV t =           σr ur vrt
                               r =1


                                 
                   t
            Uk Σk Vk = arg min         (xij − zij )2
                      rank(Z )=k i,j
The Singular Value Decomposition
  Singular Value Decomposition



                                                  rank 1 matrix
                               R
                               
                X = UΣV t =           σr ur vrt
                               r =1


                      mixture rank 1 matrices
                              of
                   t
            Uk Σk Vk = arg min         (xij − zij )2
                      rank(Z )=k i,j
The Singular Value Decomposition
  Singular Value Decomposition



                                                   rank 1 matrix
                                R
                                
                 X = UΣV t =           σr ur vrt
                                r =1


                       mixture rank 1 matrices
                               of
                    t
             Uk Σk Vk = arg min         (xij − zij )2
                       rank(Z )=k i,j
     Each rank 1 matrix is a basic pattern
     SVD expresses X as a mixture of basic patterns
The Singular Value Decomposition
  Singular Value Decomposition

 The answer to the question:
 What is the “best” rank r approximation to X?
                                R
                                
                 X = UΣV t =           σr ur vrt
                                r =1


                                  
                    t
             Uk Σk Vk = arg min         (xij − zij )2
                       rank(Z )=k i,j
The Singular Value Decomposition
    Singular Value Decomposition
The Singular Value Decomposition
   The answer to the question:
   What is the “best” rank r approximation to X?
                                           R
                                           
                   X = UΣV t =
                             R     σr ur vrt
                            
                        t      rσ u v t
                                =1
                 X = UΣV =       k k k
                                       k=1
           EckartΣ V t = arg min (x(1936)2
              U
                  Young Theorem − z )
                 k  k     k                   ij      ij
                                     
                             rank(Z )=k i,j
             Ur Σr Vrt   = arg min      (xij − zij )2
                              rank(Z )≤r i,j
Matrix Completion
    Matrix Completion

       Problem: Given an observed m × n matrix X with missing
        entries indexed by Ω ⊂ {1, . . . , m} × {1, . . . , n}, fill in the
        missing entries.
       Solution: Find a low rank matrix, consistent with the observed
        entries of X .
                                     1 
                       min f (Z ) :=          (xij − zij )2
                   rank(Z )≤r        2      c
                                          (i,j)∈Ω


             Cand`s and Recht, 2009
                   e
             Cai, Cand`s, Shen, 2010
                       e
             Mazumder, Hastie, and Tibshirani, 2010
Solving the convex relaxation
       Solving a Convex Relaxation

       Convex relaxation under the nuclear norm Z ∗ .

                       min h(Z ) := f (Z ) + λZ ∗
                        Z


                                        
                              Z ∗ =       σr ,
                                        r

        where σr are the singular values of Z .



               X ∗ is to rank(X ) as x1 is to x0
Solving the convex relaxation
   Solving a Convex Relaxation

  Solve by minimizing a local quadratic surrogate of f (Z )

                                              1
     g (Z | Z ) = f (Z ) + ∇f (Z ), Z − Z  + Z − Z k 2
             k           k           k         k
                                                         F
                                              2δ
                   1
                =    Z − M2 + c k ,
                             F
                  2δ
  where

                       M = Z k − δ∇f (Z k )

  Core problem:
                       1       2
                    min Z − MF + λZ ∗ .
                     Z 2
Solving the core Core Problem
   Solving the problem

   Core problem:
                      1
                   min Z − M2 + λZ ∗ .
                              F
                    Z 2

   Solution:

                             Z ∗ = Dλ (M).



                   Dλ (M) := USλ (Σ)V t
                       M      =   UΣV t
                    Sλ (σ)    =   sign(σ)(|σ| − λ)+ .
Solving convex relaxation
     Solving a Convex Relaxation


        k+1                        k         1
    Z         = arg min g (Z | Z ) =            Z − M2 + λZ ∗ + c k ,
                                                       F
                  Z                          2δ

  where

                            M = Z k − δ∇f (Z k )

  repeat

                          k+1            1
                      Z         = Dλ (Z − ∇f (Z k ))
                                       k
                                         δ
  until convergence
                                           Accelerate with Nesterov’s Method
                                           (Beck and Teboulle 2009)
Mendel-Impute
Sliding window
  Exploit linkage disequilibrium to solve smaller problems.

                                   SNPs
        Subjects




                   A                   B                C




       Construct hold-out-set by masking entries in A and C
       Train on observed entries from A, B, and C
       Choose λ based on performance on hold-out-set
       Impute missing entries in B.
Not Missing at Random



                                    Raw Imputed Values                                              Final Imputed Dosages
        2.0                                                                                                      ● ●● ● ●




                                               ● ●● ●
        1.5                                     ●
value




        1.0                                                                        ●●   ●
                                                                                        ●    ●● ● ●
                                                                                              ● ●     ●   ●●●●●●● ●●● ● ●●● ● ●● ●
                                                                                                           ●●● ● ●
                                                                                                             ● ●   ● ● ● ● ● ●●
                                                                                                                    ●    ●
                                                                                                                         ●     ●     ●●
                                                                                                                                     ●●
                                                                                                                                     ●      ● ●

                               ●    ●    ● ● ● ●         ● ● ●
                                         ●● ● ●●● ●● ● ● ● ●● ●   ●●
              ●●   ●
                   ●     ●              ● ●● ● ●● ●●● ● ●●●
                                                    ●    ●●
                                                        ●●  ●●    ●●
                             ● ●●          ●        ●             ●        ● ●
                             ●
        0.5

                                 ●                       ●
               ●●● ●●●● ●● ●● ● ●●● ● ● ● ●●●●●●●● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ●
                       ●
                   ● ●●● ● ● ●       ●
                ● ●● ● ● ● ●● ●● ● ● ●●    ● ●● ● ●
        0.0   ●●●● ●●● ●●●●● ●●●● ●●●● ● ●● ● ● ● ● ● ●● ●●●● ●●●●●●●●●●●●●●
               ●● ●● ● ●● ●●●● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ●●●●● ●●● ●
               ●●
              ●● ● ●●●●●●●●●●●●● ●●●● ●● ●● ● ●● ●●●●● ●●●● ●●●●●●●●●●●●●●●
                       ● ● ●●       ●   ●●     ●
                                               ●        ●      ●●● ● ● ● ●●
              ●●●● ●●● ●●●●●●●●● ●●●● ● ●● ● ● ●● ● ● ● ● ●●●● ●●● ● ●●● ●● ●
                                                        ●                 ● ●
              ● ●●●●●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●        ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●
                                                                                   ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
                                                                                   ●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●●●●● ● ●●● ●●●●●●●●●●●●●●●●
                                                                                    ●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ● ●●●●● ● ●●●●●●●● ●●●●●●●●●●●
                                                                                    ●●●●●●●●● ● ●● ●●●● ●●●  ●    ●   ●●● ● ● ● ●●●    ●●●●●●●●●
               ●● ●                                     ●●     ● ●


              0         100          200        300        400         500         0        100        200        300         400         500
                                                                             Subject
Mendel−Impute

               0.0   0.2     0.4   0.6     0.8   1.0




         0.0
         0.2
         0.4

Beagle
         0.6
         0.8
         1.0
Example: Simulated Association Study
    Example: Simulated Association Study


       Quantitative trait with a single causative SNP.
             545 people and 60, 000 SNPs.
             ith measurement generated by causative SNP

                                 yi = µ + βxi + σi ,

              where µ = 160, β = 3, σ = 5 and i are i.i.d. N(0, 1).
              xi = dosage (count of minor alleles for i).
Example: Simulated Association Study
Results: Timing
Results A



                            Program        Run time
                            MACH              12:40
                          BEAGLE              10:20
                         IMPUTE2              07:10
                     Mendel-Impute            00:56
  Table: Timing results on high-coverage genotyping microarray data
  (HH:MM).
Genotype Imputation
 Case 2: Low Coverage Sequencing

Genotypes           Alleles          Reads
 A A                  A,G            10,0
 A T                  A,T            50,50
 T T                  T,G            0
 G G                  G              13
 A G                  A,G            21,17
 T T                  T              8
 C G                  C,G            0

Idea:
Estimate underlying haplotypes via Hidden Markov Models
Genotype Imputation
 Case 2: Low Coverage Sequencing

 Reference           Observation             Prediction
A A A G            (A,G) = (20,15)             A G
A T A A               (A) = (10)               A A
T T G T                (T) = (5)               T T
G G G G               No reads                 G G
A G A A               (A) = (13)               A A
T T T T               (T) = (17)               T T
C G G C             (C,G) = (25,1)             C G

Idea:
Estimate underlying haplotypes via Hidden Markov Models
Solve with Matrix Completion?


Consider ith SNP
 A occurs 5% (minor allele)
 T occurs 95% (major allele)

Mapping:
                                           SNP
A,T                             ... 0.1 0.3 ? 1.3 ? 1.9 ...
18,1          1.9          Subj ... 0.2 ? 0.5 1.7 ? 1.5 ...
0,11          0.1               ... 0.4 1.3 ? ? 0.1 ? ...
                            Missingness is more random

Reads         Posterior Mean is in [0,2]
Application: Low Coverage Sequencing
    Low Coverage Sequencing

       Weighted version for sequencing data
                          1
                      min    wij (xij − zij )2 + λZ ∗ ,
                       Z 2
                               i,j


             wij := number of reads at loci j for subject i.
             xij ∈ [0, 2] (posterior mean allele dosage)
                   Binomial likelihood (“success” = sequencing error).
                   Prior: Hardy-Weinberg genotype frequencies.
Example: Simulated Association Study
      Example: Simulated Association Study
Results: Timing
Example: Simulated Association Study




                            Program         Run time
                     Mendel-Impute          03:18:36
                          BEAGLE            23:27:34
                         IMPUTE2            31:02:09

       Pros: Fast and accurate enough
       Cons: No phasing, just imputation
Summary



Matrix completion is purely empirical!
-- Singular vectors are not interpretable.

Accuracy is in the ball park of standard model based methods

But much faster.

Improvements:
-- Singular Value Thresholding without the SVD (Cai  Osher)

Contenu connexe

Tendances

Editing rice-genome with CRISPR/Cas9: To improve agronomic traits for increa...
Editing rice-genome with CRISPR/Cas9:  To improve agronomic traits for increa...Editing rice-genome with CRISPR/Cas9:  To improve agronomic traits for increa...
Editing rice-genome with CRISPR/Cas9: To improve agronomic traits for increa...apaari
 
Molecular marker and its application to genome mapping and molecular breeding
Molecular marker and its application to genome mapping and molecular breedingMolecular marker and its application to genome mapping and molecular breeding
Molecular marker and its application to genome mapping and molecular breedingFOODCROPS
 
08.13.08: DNA Sequence Variation
08.13.08: DNA Sequence Variation08.13.08: DNA Sequence Variation
08.13.08: DNA Sequence VariationOpen.Michigan
 
DNA microarray final ppt.
DNA microarray final ppt.DNA microarray final ppt.
DNA microarray final ppt.Aashish Patel
 
Gene stacking and its materiality in crop improvement
Gene stacking and its materiality in crop improvementGene stacking and its materiality in crop improvement
Gene stacking and its materiality in crop improvementShamlyGupta
 
Quantitative Trait LOci (QTLs) Mapping: Basics procedure, principle and Methods
Quantitative Trait LOci (QTLs) Mapping: Basics procedure, principle and MethodsQuantitative Trait LOci (QTLs) Mapping: Basics procedure, principle and Methods
Quantitative Trait LOci (QTLs) Mapping: Basics procedure, principle and MethodsMahesh Hampannavar
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsAjit Shinde
 
Comparitive genomic hybridisation
Comparitive genomic hybridisationComparitive genomic hybridisation
Comparitive genomic hybridisationnamrathrs87
 
Crispr cas: A new tool of genome editing
Crispr cas: A new tool of genome editing Crispr cas: A new tool of genome editing
Crispr cas: A new tool of genome editing palaabhay
 
Marker assisted whole genome selection in crop improvement
Marker assisted whole genome     selection in crop improvementMarker assisted whole genome     selection in crop improvement
Marker assisted whole genome selection in crop improvementSenthil Natesan
 
Map based cloning of genome
Map based cloning of genomeMap based cloning of genome
Map based cloning of genomeKAUSHAL SAHU
 
Genomic selection for crop improvement
Genomic selection for crop improvementGenomic selection for crop improvement
Genomic selection for crop improvementnagamani gorantla
 
The ensembl database
The ensembl databaseThe ensembl database
The ensembl databaseAshfaq Ahmad
 
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...Mapping and Applications of Linkage Disequilibrium and Association Mapping in...
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...FAO
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicskiran singh
 

Tendances (20)

Microarray
MicroarrayMicroarray
Microarray
 
Editing rice-genome with CRISPR/Cas9: To improve agronomic traits for increa...
Editing rice-genome with CRISPR/Cas9:  To improve agronomic traits for increa...Editing rice-genome with CRISPR/Cas9:  To improve agronomic traits for increa...
Editing rice-genome with CRISPR/Cas9: To improve agronomic traits for increa...
 
Molecular marker and its application to genome mapping and molecular breeding
Molecular marker and its application to genome mapping and molecular breedingMolecular marker and its application to genome mapping and molecular breeding
Molecular marker and its application to genome mapping and molecular breeding
 
08.13.08: DNA Sequence Variation
08.13.08: DNA Sequence Variation08.13.08: DNA Sequence Variation
08.13.08: DNA Sequence Variation
 
DNA microarray final ppt.
DNA microarray final ppt.DNA microarray final ppt.
DNA microarray final ppt.
 
GWAS
GWASGWAS
GWAS
 
Gene stacking and its materiality in crop improvement
Gene stacking and its materiality in crop improvementGene stacking and its materiality in crop improvement
Gene stacking and its materiality in crop improvement
 
Quantitative Trait LOci (QTLs) Mapping: Basics procedure, principle and Methods
Quantitative Trait LOci (QTLs) Mapping: Basics procedure, principle and MethodsQuantitative Trait LOci (QTLs) Mapping: Basics procedure, principle and Methods
Quantitative Trait LOci (QTLs) Mapping: Basics procedure, principle and Methods
 
Crispr/Cas 9
Crispr/Cas 9Crispr/Cas 9
Crispr/Cas 9
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Comparitive genomic hybridisation
Comparitive genomic hybridisationComparitive genomic hybridisation
Comparitive genomic hybridisation
 
Crispr cas: A new tool of genome editing
Crispr cas: A new tool of genome editing Crispr cas: A new tool of genome editing
Crispr cas: A new tool of genome editing
 
Gene expression profiling
Gene expression profilingGene expression profiling
Gene expression profiling
 
Marker assisted whole genome selection in crop improvement
Marker assisted whole genome     selection in crop improvementMarker assisted whole genome     selection in crop improvement
Marker assisted whole genome selection in crop improvement
 
Map based cloning of genome
Map based cloning of genomeMap based cloning of genome
Map based cloning of genome
 
Genomic selection for crop improvement
Genomic selection for crop improvementGenomic selection for crop improvement
Genomic selection for crop improvement
 
SNp mining in crops
SNp mining in cropsSNp mining in crops
SNp mining in crops
 
The ensembl database
The ensembl databaseThe ensembl database
The ensembl database
 
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...Mapping and Applications of Linkage Disequilibrium and Association Mapping in...
Mapping and Applications of Linkage Disequilibrium and Association Mapping in...
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 

Similaire à Genotype Imputation via Matrix Completion

Photomorphogenesis talk
Photomorphogenesis talkPhotomorphogenesis talk
Photomorphogenesis talkHugh Shanahan
 
SSAHA_pileup
SSAHA_pileupSSAHA_pileup
SSAHA_pileupbpb
 
Lecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_fullLecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_fullLekki Frazier-Wood
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Torsten Seemann
 
Mechanisms of Evolution LabGroup Members NamesScenario .docx
Mechanisms of Evolution LabGroup Members NamesScenario .docxMechanisms of Evolution LabGroup Members NamesScenario .docx
Mechanisms of Evolution LabGroup Members NamesScenario .docxandreecapon
 
2.2 analyzing and manipulating dna
2.2 analyzing and manipulating dna2.2 analyzing and manipulating dna
2.2 analyzing and manipulating dnaEmmanuel Aguon
 
Relationships and Biodiversity State Lab Review(1)
Relationships and Biodiversity  State Lab Review(1)Relationships and Biodiversity  State Lab Review(1)
Relationships and Biodiversity State Lab Review(1)gparchment
 
Gel Electrophoresis Notes
Gel Electrophoresis NotesGel Electrophoresis Notes
Gel Electrophoresis Noteskathy_lambert
 
Logic to-prolog
Logic to-prologLogic to-prolog
Logic to-prologsaru40
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomicsUSC
 
Sequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionSequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionTenaAvdic
 
monsanto MON_06/25/04d
monsanto MON_06/25/04dmonsanto MON_06/25/04d
monsanto MON_06/25/04dfinance28
 
Elizabeth Iorns - How Science Exchange promotes Open Science
Elizabeth Iorns - How Science Exchange promotes Open ScienceElizabeth Iorns - How Science Exchange promotes Open Science
Elizabeth Iorns - How Science Exchange promotes Open ScienceScience Exchange
 
Olivia Contradictions
Olivia ContradictionsOlivia Contradictions
Olivia Contradictionsfarzanehs
 
Splice site recognition among different organisms
Splice site recognition among different organismsSplice site recognition among different organisms
Splice site recognition among different organismsDespoina Kalfakakou
 

Similaire à Genotype Imputation via Matrix Completion (20)

Photomorphogenesis talk
Photomorphogenesis talkPhotomorphogenesis talk
Photomorphogenesis talk
 
4 Genetics - Gene linkage (by Elizabeth)
4 Genetics - Gene linkage (by Elizabeth)4 Genetics - Gene linkage (by Elizabeth)
4 Genetics - Gene linkage (by Elizabeth)
 
SSAHA_pileup
SSAHA_pileupSSAHA_pileup
SSAHA_pileup
 
Barcelona sabatica
Barcelona sabaticaBarcelona sabatica
Barcelona sabatica
 
Lecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_fullLecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_full
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
 
Mechanisms of Evolution LabGroup Members NamesScenario .docx
Mechanisms of Evolution LabGroup Members NamesScenario .docxMechanisms of Evolution LabGroup Members NamesScenario .docx
Mechanisms of Evolution LabGroup Members NamesScenario .docx
 
2.2 analyzing and manipulating dna
2.2 analyzing and manipulating dna2.2 analyzing and manipulating dna
2.2 analyzing and manipulating dna
 
Relationships and Biodiversity State Lab Review(1)
Relationships and Biodiversity  State Lab Review(1)Relationships and Biodiversity  State Lab Review(1)
Relationships and Biodiversity State Lab Review(1)
 
Wagner chapter 1
Wagner chapter 1Wagner chapter 1
Wagner chapter 1
 
Gel Electrophoresis Notes
Gel Electrophoresis NotesGel Electrophoresis Notes
Gel Electrophoresis Notes
 
Logic to-prolog
Logic to-prologLogic to-prolog
Logic to-prolog
 
Blum
BlumBlum
Blum
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 
Sequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionSequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics Introduction
 
monsanto MON_06/25/04d
monsanto MON_06/25/04dmonsanto MON_06/25/04d
monsanto MON_06/25/04d
 
Elizabeth Iorns - How Science Exchange promotes Open Science
Elizabeth Iorns - How Science Exchange promotes Open ScienceElizabeth Iorns - How Science Exchange promotes Open Science
Elizabeth Iorns - How Science Exchange promotes Open Science
 
Olivia Contradictions
Olivia ContradictionsOlivia Contradictions
Olivia Contradictions
 
Splice site recognition among different organisms
Splice site recognition among different organismsSplice site recognition among different organisms
Splice site recognition among different organisms
 
Gene cloning
Gene cloningGene cloning
Gene cloning
 

Genotype Imputation via Matrix Completion

  • 1. Genotype Imputation via Matrix Completion ERIC CHI DEPARTMENT OF HUMAN GENETICS UNIVERSITY OF CALIFORNIA, LOS ANGELES October 25, 2012
  • 2. Collaborators UCLA Kenneth Lange Diego Ortega Del Vecchyo NCSU Hua Zhou USC Gary Chen
  • 3. Where are we going? Genotype Imputation Movie Recommendation Systems Matrix Completion Mendel-Impute Some Test Cases Genome wide association Low coverage sequencing
  • 4. Genetic Variation: SNPs Much of the variation between people ...AATGATC... are Single Nucleotide Polymorphisms ...AATGATC... ...GATGATC... ...AATGATC... If you have the G allele in position k ...AATGATC... that increases your risk of high ...AATGATC... cholesterol 3x. ...GATGATC... k Diallelic
  • 5. Simulated Association Study Genetic Association ...AATGATC... Baseline Cholesterol ...AATGATC... ...GATGATC... y ≈ µ + βx ...AATGATC... ...AATGATC... ...AATGATC... Your Cholesterol “State” at SNP k ...GATGATC... k Diallelic
  • 6. Genome Wide Association Studies Problem: Don’t always get to see ...AATGATC... what’s at SNP k ...AATGATC... ...GATGATC... Problem: Costs $$$ to sequence an ...AATGATC... individual. ...AATGATC... GWAS: ...AATGATC... ...GATGATC... ~1K subjects ~1M “select” SNPs k Diallelic
  • 7. SNPs Have High Spatial Correlation “Linkage Disequilibrium” Reference Observation Prediction A A A G A A T A A A T T G T . G G G G . A G A A . T T T T . C G G C C Haplotypes: Blocks of highly correlated SNPs
  • 8. SNPs Have High Spatial Correlation “Linkage Disequilibrium” Reference Observation Prediction A A A G A A A T A A A A T T G T . T G G G G . G A G A A . A T T T T . T C G G C C C Haplotypes: Blocks of highly correlated SNPs
  • 9. Deliberately Introduce Missingness Strategically type (measure) certain SNPs. Use reference haplotypes to reconstruct. Motivation: Save $$$
  • 10. Challenge #1: Recombination Reference Observation Prediction A A A G A A T A A A T T G T T G G G G . A G A A G T T T T T C G G C G Haplotypes: Blocks of highly correlated SNPs
  • 11. Challenge #1: Recombination Reference Observation Prediction A A A G A A A T A A A A T T G T T T G G G G . G A G A A G G T T T T T T C G G C G G Haplotypes: Blocks of highly correlated SNPs
  • 12. Challenge #2: Mutations/Typing Errors Reference Observation Prediction A A A G G A T A A A T T G T T G G G G . A G A A G T T T T T C G G C C Haplotypes: Blocks of highly correlated SNPs
  • 13. Challenge #2: Mutations/Typing Errors Reference Observation Prediction A A A G G G A T A A A A T T G T T T G G G G . G A G A A G A T T T T T T C G G C C C Haplotypes: Blocks of highly correlated SNPs
  • 14. Genotype Imputation Case 1: Traditional GWAS Reference Observation Prediction A A A G A/G A G A T A A A/A A A T T G T ./. T T G G G G ./. G G A G A A A/A A A T T T T T/T T T C G G C C/G C G Idea: Estimate underlying haplotypes via Hidden Markov Models
  • 15. Deliberately Introduce Missingness Strategically type (measure) certain SNPs. Motivation: Save $$$ Use references to reconstruct via HMM Problem: Very slow! Weeks~Months (on a cluster) to impute all chromosomes for a large study ~ 1K subjects
  • 16.
  • 17. Netflix Prize: October 2006-August 2009 • Predict un-rated movies • Training Set: 480,000 customer ratings on 18,000 movies. • Around 98.7% missing ratings. • $1,000,000 prize!
  • 18. Netflix Prize: October 2006-August 2009 • Predict un-rated movies • Training Set: 480,000 customer ratings on 18,000 movies. • Around 98.7% missing ratings. • $1,000,000 prize! “You look at the cumulative hours and you’re getting Ph.D.’s for a dollar an hour.” -- Reed Hastings
  • 19. Netflix Prize: October 2006-August 2009 • Predict un-rated movies • Training Set: 480,000 customer ratings on 18,000 movies. • Around 98.7% missing ratings. • $1,000,000 prize! Methods: Variations on the SVD and k-nearest neighbors
  • 20. Netflix Prize: October 2006-August 2009 • Predict un-rated movies • Training Set: 480,000 customer ratings on 18,000 movies. • Around 98.7% missing ratings. • $1,000,000 prize! Methods: Variations on the SVD and k-nearest neighbors
  • 21. Ratings Matrix Customers Alice Bob Charlie ··· Star Wars 2 5 ? ··· Harry Potter ? 1 ? ··· Movies Miss Congeniality 1 5 1 ··· Lord of the Rings 5 2 ? ··· . . . . . . . . .. . . . . .
  • 22. Unphased Genotype Matrix Subject “Alice” “Bob” “Charlie” ··· rs274044 0 2 ? ··· rs274541 ? 1 ? ··· SNP rs286593 1 2 1 ··· rs287261 2 0 ? ··· . . . . . . . . .. . . . . . 0 = Homozygous Major Allele 1 = Heterozygous 2 = Homozygous Minor Allele
  • 23. Solve with Matrix Completion? Study Panel SNP ... 0 0 ? ? 0 1 1 0 0 2 ... Subj ... 0 0 ? ? 0 1 1 0 0 1 ... ... 0 1 0 0 0 1 0 0 0 1 ... Reference Haplotypes Idea: Linkage disequilibrium = low rank structure?
  • 24. Solve with Matrix Completion? SNP ... 0 0 ? ? 0 1 1 0 0 2 ... Subj ... 0 0 ? ? 0 1 1 0 0 1 ... ... 0 1 0 0 0 1 0 0 0 1 ... Not missing at random = problem? Idea: Linkage disequilibrium = low rank structure?
  • 25. The Singular Value Decomposition Singular Value Decomposition R X = UΣV t = σr ur vrt r =1 t Uk Σk Vk = arg min (xij − zij )2 rank(Z )=k i,j
  • 26. The Singular Value Decomposition Singular Value Decomposition rank 1 matrix R X = UΣV t = σr ur vrt r =1 t Uk Σk Vk = arg min (xij − zij )2 rank(Z )=k i,j
  • 27. The Singular Value Decomposition Singular Value Decomposition rank 1 matrix R X = UΣV t = σr ur vrt r =1 mixture rank 1 matrices of t Uk Σk Vk = arg min (xij − zij )2 rank(Z )=k i,j
  • 28. The Singular Value Decomposition Singular Value Decomposition rank 1 matrix R X = UΣV t = σr ur vrt r =1 mixture rank 1 matrices of t Uk Σk Vk = arg min (xij − zij )2 rank(Z )=k i,j Each rank 1 matrix is a basic pattern SVD expresses X as a mixture of basic patterns
  • 29. The Singular Value Decomposition Singular Value Decomposition The answer to the question: What is the “best” rank r approximation to X? R X = UΣV t = σr ur vrt r =1 t Uk Σk Vk = arg min (xij − zij )2 rank(Z )=k i,j
  • 30. The Singular Value Decomposition Singular Value Decomposition The Singular Value Decomposition The answer to the question: What is the “best” rank r approximation to X? R X = UΣV t = R σr ur vrt t rσ u v t =1 X = UΣV = k k k k=1 EckartΣ V t = arg min (x(1936)2 U Young Theorem − z ) k k k ij ij rank(Z )=k i,j Ur Σr Vrt = arg min (xij − zij )2 rank(Z )≤r i,j
  • 31. Matrix Completion Matrix Completion Problem: Given an observed m × n matrix X with missing entries indexed by Ω ⊂ {1, . . . , m} × {1, . . . , n}, fill in the missing entries. Solution: Find a low rank matrix, consistent with the observed entries of X . 1 min f (Z ) := (xij − zij )2 rank(Z )≤r 2 c (i,j)∈Ω Cand`s and Recht, 2009 e Cai, Cand`s, Shen, 2010 e Mazumder, Hastie, and Tibshirani, 2010
  • 32. Solving the convex relaxation Solving a Convex Relaxation Convex relaxation under the nuclear norm Z ∗ . min h(Z ) := f (Z ) + λZ ∗ Z Z ∗ = σr , r where σr are the singular values of Z . X ∗ is to rank(X ) as x1 is to x0
  • 33. Solving the convex relaxation Solving a Convex Relaxation Solve by minimizing a local quadratic surrogate of f (Z ) 1 g (Z | Z ) = f (Z ) + ∇f (Z ), Z − Z + Z − Z k 2 k k k k F 2δ 1 = Z − M2 + c k , F 2δ where M = Z k − δ∇f (Z k ) Core problem: 1 2 min Z − MF + λZ ∗ . Z 2
  • 34. Solving the core Core Problem Solving the problem Core problem: 1 min Z − M2 + λZ ∗ . F Z 2 Solution: Z ∗ = Dλ (M). Dλ (M) := USλ (Σ)V t M = UΣV t Sλ (σ) = sign(σ)(|σ| − λ)+ .
  • 35. Solving convex relaxation Solving a Convex Relaxation k+1 k 1 Z = arg min g (Z | Z ) = Z − M2 + λZ ∗ + c k , F Z 2δ where M = Z k − δ∇f (Z k ) repeat k+1 1 Z = Dλ (Z − ∇f (Z k )) k δ until convergence Accelerate with Nesterov’s Method (Beck and Teboulle 2009)
  • 36. Mendel-Impute Sliding window Exploit linkage disequilibrium to solve smaller problems. SNPs Subjects A B C Construct hold-out-set by masking entries in A and C Train on observed entries from A, B, and C Choose λ based on performance on hold-out-set Impute missing entries in B.
  • 37. Not Missing at Random Raw Imputed Values Final Imputed Dosages 2.0 ● ●● ● ● ● ●● ● 1.5 ● value 1.0 ●● ● ● ●● ● ● ● ● ● ●●●●●●● ●●● ● ●●● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ●● ● ●● ●●● ● ●●● ● ●● ●● ●● ●● ● ●● ● ● ● ● ● ● 0.5 ● ● ●●● ●●●● ●● ●● ● ●●● ● ● ● ●●●●●●●● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●● ● ●● ● ● 0.0 ●●●● ●●● ●●●●● ●●●● ●●●● ● ●● ● ● ● ● ● ●● ●●●● ●●●●●●●●●●●●●● ●● ●● ● ●● ●●●● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ●●●●● ●●● ● ●● ●● ● ●●●●●●●●●●●●● ●●●● ●● ●● ● ●● ●●●●● ●●●● ●●●●●●●●●●●●●●● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ●● ●●●● ●●● ●●●●●●●●● ●●●● ● ●● ● ● ●● ● ● ● ● ●●●● ●●● ● ●●● ●● ● ● ● ● ● ●●●●●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●●●●● ● ●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ● ●●●●● ● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●●● ● ●● ●●●● ●●● ● ● ●●● ● ● ● ●●● ●●●●●●●●● ●● ● ●● ● ● 0 100 200 300 400 500 0 100 200 300 400 500 Subject
  • 38. Mendel−Impute 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 Beagle 0.6 0.8 1.0
  • 39. Example: Simulated Association Study Example: Simulated Association Study Quantitative trait with a single causative SNP. 545 people and 60, 000 SNPs. ith measurement generated by causative SNP yi = µ + βxi + σi , where µ = 160, β = 3, σ = 5 and i are i.i.d. N(0, 1). xi = dosage (count of minor alleles for i).
  • 41. Results: Timing Results A Program Run time MACH 12:40 BEAGLE 10:20 IMPUTE2 07:10 Mendel-Impute 00:56 Table: Timing results on high-coverage genotyping microarray data (HH:MM).
  • 42. Genotype Imputation Case 2: Low Coverage Sequencing Genotypes Alleles Reads A A A,G 10,0 A T A,T 50,50 T T T,G 0 G G G 13 A G A,G 21,17 T T T 8 C G C,G 0 Idea: Estimate underlying haplotypes via Hidden Markov Models
  • 43. Genotype Imputation Case 2: Low Coverage Sequencing Reference Observation Prediction A A A G (A,G) = (20,15) A G A T A A (A) = (10) A A T T G T (T) = (5) T T G G G G No reads G G A G A A (A) = (13) A A T T T T (T) = (17) T T C G G C (C,G) = (25,1) C G Idea: Estimate underlying haplotypes via Hidden Markov Models
  • 44. Solve with Matrix Completion? Consider ith SNP A occurs 5% (minor allele) T occurs 95% (major allele) Mapping: SNP A,T ... 0.1 0.3 ? 1.3 ? 1.9 ... 18,1 1.9 Subj ... 0.2 ? 0.5 1.7 ? 1.5 ... 0,11 0.1 ... 0.4 1.3 ? ? 0.1 ? ... Missingness is more random Reads Posterior Mean is in [0,2]
  • 45. Application: Low Coverage Sequencing Low Coverage Sequencing Weighted version for sequencing data 1 min wij (xij − zij )2 + λZ ∗ , Z 2 i,j wij := number of reads at loci j for subject i. xij ∈ [0, 2] (posterior mean allele dosage) Binomial likelihood (“success” = sequencing error). Prior: Hardy-Weinberg genotype frequencies.
  • 46. Example: Simulated Association Study Example: Simulated Association Study
  • 47. Results: Timing Example: Simulated Association Study Program Run time Mendel-Impute 03:18:36 BEAGLE 23:27:34 IMPUTE2 31:02:09 Pros: Fast and accurate enough Cons: No phasing, just imputation
  • 48. Summary Matrix completion is purely empirical! -- Singular vectors are not interpretable. Accuracy is in the ball park of standard model based methods But much faster. Improvements: -- Singular Value Thresholding without the SVD (Cai Osher)