SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Sequence Analysis
• Is the process of subjecting a DNA, RNA or
peptide sequence to any of a wide range of
analytical methods to understand its
features, function, structure, or evolution
• Is the process of subjecting a DNA, RNA or
peptide sequence to any of a wide range of
analytical methods to understand its
features, function, structure, or evolution
• Given two sequences, we can
– Measure their similarity
– Determine the residue-residue correspondences
– Observe patterns of conservation and variability
– Inter evolutionary relationships
• Given two sequences, we can
– Measure their similarity
– Determine the residue-residue correspondences
– Observe patterns of conservation and variability
– Inter evolutionary relationships
Bioinformatics
Sequence Analysis
• The most basic sequence analysis - whether two sequences
are related – sequence alignment. This involves
aligning two sequences
similarity in sequences
sequences are related similarity is by chance
• The most basic sequence analysis - whether two sequences
are related – sequence alignment. This involves
aligning two sequences
similarity in sequences
sequences are related similarity is by chance
Bioinformatics
• Is the most basic tool of bioinformatics.
• Sequence similarity must be quantified –
important to identify real similarity from
coincidence.
• Is the most basic tool of bioinformatics.
• Sequence similarity must be quantified –
important to identify real similarity from
coincidence.
Bioinformatics
• Finding similarity between sequences is important for
many biological inferences, like
•Finding similar proteins allows us to predict the
function and structure of the unknown protein.
•Similar sequences can come from two species which
share a common ancestor indicating their evolutionary
relationship.
• Locating similar subsequences in DNA allows us to
identify pockets of interest, such as regulatory
elements.etc
• Finding similarity between sequences is important for
many biological inferences, like
•Finding similar proteins allows us to predict the
function and structure of the unknown protein.
•Similar sequences can come from two species which
share a common ancestor indicating their evolutionary
relationship.
• Locating similar subsequences in DNA allows us to
identify pockets of interest, such as regulatory
elements.etc
Bioinformatics
• Pairwise sequence alignment
• Local and global alignment
• Multiple sequence alignment
•Clustal W
Sequence Alignment
• Pairwise sequence alignment
• Local and global alignment
• Multiple sequence alignment
•Clustal W
•The comparing of two sequences by searching for a series of
individual characters or patterns that are in the same order in
the sequences, ie, the identification of residue-residue
correspondences.
• Local and Global.
• Global alignment, attempts to align the entire sequence. If two
sequences have approximately the same length and are quite
similar, they are suitable for the global alignment.
• Local alignment finds stretches of sequences with high level
of matches.
Pairwise sequence alignment
•The comparing of two sequences by searching for a series of
individual characters or patterns that are in the same order in
the sequences, ie, the identification of residue-residue
correspondences.
• Local and Global.
• Global alignment, attempts to align the entire sequence. If two
sequences have approximately the same length and are quite
similar, they are suitable for the global alignment.
• Local alignment finds stretches of sequences with high level
of matches.
L G P S S K Q T G K G S - S R I W D N
Global alignment
L N - I T K S A G K G A I M R L G D A
- - - - - - - T G K G - - - - - - - -
Local alignment
- - - - - - - A G K G - - - - - - - -
Methods of sequence alignment
•Dot plot method
• Dynamic programming approach
• Smith-Waterman algorithm and Needleman-Wunsch
algorithm
•Heuristic methods / k-Tuple Method
• BLAST and FASTA
•Dot plot method
• Dynamic programming approach
• Smith-Waterman algorithm and Needleman-Wunsch
algorithm
•Heuristic methods / k-Tuple Method
• BLAST and FASTA
• A dot matrix analysis is a method for comparing two
sequences to look for possible alignment (Gibbs and
McIntyre 1970)
• One sequence (A) is listed across the top of the matrix and
the other (B) is listed down the left side
• Starting from the first character in B, one moves across the
page keeping in the first row and placing a dot in many
column where the character in A is the same
• The process is continued until all possible comparisons
between A and B are made
• Any region of similarity is revealed by a diagonal row
of dots
• Isolated dots not on diagonal represent random matches
Dot matrix analysis
• A dot matrix analysis is a method for comparing two
sequences to look for possible alignment (Gibbs and
McIntyre 1970)
• One sequence (A) is listed across the top of the matrix and
the other (B) is listed down the left side
• Starting from the first character in B, one moves across the
page keeping in the first row and placing a dot in many
column where the character in A is the same
• The process is continued until all possible comparisons
between A and B are made
• Any region of similarity is revealed by a diagonal row
of dots
• Isolated dots not on diagonal represent random matches
• Detection of matching regions can be improved by
filtering out random matches and this can be achieved
by using a sliding window
• It means that instead of comparing a single sequence
position more positions is compared at the same time
and, dot is printed only if a certain minimal number of
matches occur
• Dot matrix analysis can also be used to find direct and
inverted repeats within the sequences
Dot matrix analysis
• Detection of matching regions can be improved by
filtering out random matches and this can be achieved
by using a sliding window
• It means that instead of comparing a single sequence
position more positions is compared at the same time
and, dot is printed only if a certain minimal number of
matches occur
• Dot matrix analysis can also be used to find direct and
inverted repeats within the sequences
• Nucleic Acids Dot Plots -
http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html
Dot matrix analysis: two identical sequences
• Nucleic Acids Dot Plots of genes Adh1 and G6pd in the mouse
•http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html
Dot matrix analysis: two very different sequences
• Nucleic Acids Dot Plots of genes Adh1 from the mouse and rat (25 MY)
•http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html
Dot matrix analysis: two similar sequences
• Nucleic Acids Dot Plots of genes Adh1 from the mouse and rat (25 MY)
•http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html
Dot matrix analysis: two similar sequences sequences; size
of the sliding window increased
• Is a highly computationally demanding as well as intensive
method.
• It aligns two nucleotide/protein sequences, explores all possible
alignments and chooses the best alignment (high scoring
alignment) as the optimal alignment.
• Is based on alignment scores.
• It uses gaps to achieve the best alignment.
• Global alignment program is based on Needleman-Wunsch
algorithm and local alignment on Smith-Waterman. Both
algorithms are derivates from the basic dynamic programming
algorithm.
Dynamic programming algorithm for
sequence alignment
• Is a highly computationally demanding as well as intensive
method.
• It aligns two nucleotide/protein sequences, explores all possible
alignments and chooses the best alignment (high scoring
alignment) as the optimal alignment.
• Is based on alignment scores.
• It uses gaps to achieve the best alignment.
• Global alignment program is based on Needleman-Wunsch
algorithm and local alignment on Smith-Waterman. Both
algorithms are derivates from the basic dynamic programming
algorithm.
• How are alignments scored?
• Using scoring matrices
•They account for gaps, substitutions, insertions and
deletions.
•For nucleic acids, scoring is simple (only 4 characters are
present, and substitutions do not happen)
•Eg: the scoring scheme used by BioEdit
• How are alignments scored?
• Using scoring matrices
•They account for gaps, substitutions, insertions and
deletions.
•For nucleic acids, scoring is simple (only 4 characters are
present, and substitutions do not happen)
•Eg: the scoring scheme used by BioEdit
Variation Score
Match 2
Mismatch -1
Gap initiation -3
Extending gap by 1 -1
• For proteins , the scoring schemes are more complicated because
amino acid substitutions occur frequently, especially among
amino acids with similar physicochemical properties
• Eg: Alanine valine substitutions happen without
significant changes to the protein.
Scoring a sequence alignment with a gap
penalty
Sequence 1 V D S - C Y
Sequence 2 V E S L C Y
Score 4 2 4 -11 9 7 Score = sum of amino acid pair scores (26)
minus single gap penalty (11) = 15
As two sequences may differ, it is likely to have non-identical amino
acids placed in the corresponding positions. In order to optimise
the alignment gap(s) may be introduced, which may reflect losses
or insertions, which occurred in the past in the sequences.
Introduction of gaps causes penalties.
Scores gained by each match are not always the same, for instance
two rare amino acids will score more than two common.
Derivation of the dynamic programming algorithm
1. Score of new = Score of previous + Score of new
alignment alignment (A) aligned pair
V D S - C Y V D S - C Y
V E S L C Y V E S L C Y
15 = 8 + 7
2. Score of = Score of previous + Score of new
alignment (A) alignment (B) aligned pair
V D S - C V D S - C
V E S L C V E S L C
8 = -1 + 9
3. Repeat removing aligned pairs until end of alignments is reached
1. Score of new = Score of previous + Score of new
alignment alignment (A) aligned pair
V D S - C Y V D S - C Y
V E S L C Y V E S L C Y
15 = 8 + 7
2. Score of = Score of previous + Score of new
alignment (A) alignment (B) aligned pair
V D S - C V D S - C
V E S L C V E S L C
8 = -1 + 9
3. Repeat removing aligned pairs until end of alignments is reached
• Consider building this alignment in steps, starting from the initial match (V/V)
and then sequentially adding a new pair until the alignment is complete, at each
stage choosing a pair from all the possible matches that provides the highest
score for the alignment up to that point.
• If the full alignment has the highest possible (or optimal) score, then the old
alignment from which it was derived (A) by addition of the aligned Y/Y pair
must also have been optimal up to that point in the alignment.
• In this manner, the alignment can be traced back to the first aligned pair that
was also an optimal alignment.
• The example, which we have considered, illustrates 3 choices: 1. Match the
next character(s) in the following position(s); 2. Match the next character(s) to a
gap in the upper sequence; 3. Add a gap in the lower sequence.
Description of the dynamic programming algorithm
• Consider building this alignment in steps, starting from the initial match (V/V)
and then sequentially adding a new pair until the alignment is complete, at each
stage choosing a pair from all the possible matches that provides the highest
score for the alignment up to that point.
• If the full alignment has the highest possible (or optimal) score, then the old
alignment from which it was derived (A) by addition of the aligned Y/Y pair
must also have been optimal up to that point in the alignment.
• In this manner, the alignment can be traced back to the first aligned pair that
was also an optimal alignment.
• The example, which we have considered, illustrates 3 choices: 1. Match the
next character(s) in the following position(s); 2. Match the next character(s) to a
gap in the upper sequence; 3. Add a gap in the lower sequence.
• It is critical to have reasonable scoring schemes accepted by the scientific
community for DNA and proteins and for different types of alignments
• Matrices for DNA are rather similar as there are only two options purine &
pyrimidine and match & mismatch
• Proteins are much more complex and the number of option is significant
• PAM and BLOSUM matrices are the commonly used scoring matrices for
proteins.
• They are constructed by analysing the substitution frequencies seen in
alignments of known families of proteins.
• Identities are assigned high positive scores. Also some amino acids are
more abundant than others
• Frequently observed substitutions also get positive scores.
• Mismatches or matches that are unlikely to have been a result of
evolution are given negative scores.
Scoring matrices
• It is critical to have reasonable scoring schemes accepted by the scientific
community for DNA and proteins and for different types of alignments
• Matrices for DNA are rather similar as there are only two options purine &
pyrimidine and match & mismatch
• Proteins are much more complex and the number of option is significant
• PAM and BLOSUM matrices are the commonly used scoring matrices for
proteins.
• They are constructed by analysing the substitution frequencies seen in
alignments of known families of proteins.
• Identities are assigned high positive scores. Also some amino acids are
more abundant than others
• Frequently observed substitutions also get positive scores.
• Mismatches or matches that are unlikely to have been a result of
evolution are given negative scores.
• These scores form the matrix entries and are represented in log odds scores
• Odds score is the ratio of chance of amino acid substitution due to essential
biological reason to the chance of random substitution.
• PAM- (Point Accepted Mutation) matrix is derived from global alignments of
very similar sequences, so that an observed change will reflect one mutation
• An accepted point mutation is a replacement of one A.A by another,
accepted by natural selection
• There are many different PAMs, which represent different evolutionary
scenarios.
• BLOSUM (blocks substitution matrix ) –dvpd from regions of closely related
proteins that can be aligned without gaps. They calculated the ratio of
observed pairs at any position to the number expected from overall amino acid
frequency.
• Results in the form of log odds score.
• PAM is more suitable for studying quite distant proteins, BLOSUM is for
more conserved proteins of domains
Scoring matrices
• These scores form the matrix entries and are represented in log odds scores
• Odds score is the ratio of chance of amino acid substitution due to essential
biological reason to the chance of random substitution.
• PAM- (Point Accepted Mutation) matrix is derived from global alignments of
very similar sequences, so that an observed change will reflect one mutation
• An accepted point mutation is a replacement of one A.A by another,
accepted by natural selection
• There are many different PAMs, which represent different evolutionary
scenarios.
• BLOSUM (blocks substitution matrix ) –dvpd from regions of closely related
proteins that can be aligned without gaps. They calculated the ratio of
observed pairs at any position to the number expected from overall amino acid
frequency.
• Results in the form of log odds score.
• PAM is more suitable for studying quite distant proteins, BLOSUM is for
more conserved proteins of domains
• Gap penalties are subtracted from alignment scores to ensure algorithms
produce biologically sensible alignments without too many gaps
• Gap penalties may be:
• Constant – independent of the length of the gap
• Proportional – proportional to the length of the gap
• Affine – containing gap opening and gap extension contributions.
• Opening a gap should be strongly penalised than extending a gap.
Gap Penalty
• Gap penalties are subtracted from alignment scores to ensure algorithms
produce biologically sensible alignments without too many gaps
• Gap penalties may be:
• Constant – independent of the length of the gap
• Proportional – proportional to the length of the gap
• Affine – containing gap opening and gap extension contributions.
• Opening a gap should be strongly penalised than extending a gap.
Scoring matrices: PAM (Percent Accepted Mutation)
Amino acids are grouped according to the chemistry of the side group: (C) sulfhydryl, (STPAG)-small
hydrophilic, (NDEQ) acid, acid amide and hydrophilic, (HRK) basic, (MILV) small hydrophobic, and
(FYW) aromatic. Log odds values: +10 means that ancestor probability is greater, 0 means that the
probability are equal, -4 means that the change is random. Thus the probability of alignment YY/YY is
10+10=20, whereas YY/TP is –3-5=-8, a rare and unexpected between homologous sequences.
Scoring matrices: BLOSUM62
(BLOcks amino acid SUbstitution Matrices)
Ideology of BLOSUM is similar but it is calculated from a very different and much larger set
of proteins, which are much more similar and create blocks of proteins with a similar pattern
Alignment A: a1 a2 a3 a4
b1 b2 b3 b4
Alignment B: a1 a2 a3 a4 -
b1 - b2 b3 b4
Alignment A: a1 a2 a3 a4
b1 b2 b3 b4
Alignment B: a1 a2 a3 a4 -
b1 - b2 b3 b4
The highest scoring matrix position
is located (in this case s44) and then
traced back as far as possible,
generating the path shown

Contenu connexe

Similaire à Sequence-analysis-pairwise-alignment.pdf

Similaire à Sequence-analysis-pairwise-alignment.pdf (20)

lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
seq alignment.ppt
seq alignment.pptseq alignment.ppt
seq alignment.ppt
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Sequence alignment unit 3
Sequence alignment unit 3Sequence alignment unit 3
Sequence alignment unit 3
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Biological sequences analysis
Biological sequences analysisBiological sequences analysis
Biological sequences analysis
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
Sequence Alignment.pptx
Sequence Alignment.pptxSequence Alignment.pptx
Sequence Alignment.pptx
 
02-alignment.pdf
02-alignment.pdf02-alignment.pdf
02-alignment.pdf
 
Parwati sihag
Parwati sihagParwati sihag
Parwati sihag
 
Needleman wunsch computional ppt
Needleman wunsch computional pptNeedleman wunsch computional ppt
Needleman wunsch computional ppt
 
Dot matrix seminar
Dot matrix seminarDot matrix seminar
Dot matrix seminar
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
 
Bioinformatics lesson
Bioinformatics lessonBioinformatics lesson
Bioinformatics lesson
 
Bioinformatics lesson
Bioinformatics lessonBioinformatics lesson
Bioinformatics lesson
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekinge2016 bioinformatics i_alignments_wim_vancriekinge
2016 bioinformatics i_alignments_wim_vancriekinge
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 

Dernier

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Jshifa
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 

Dernier (20)

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)Recombination DNA Technology (Microinjection)
Recombination DNA Technology (Microinjection)
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 

Sequence-analysis-pairwise-alignment.pdf

  • 1. Sequence Analysis • Is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution • Is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution
  • 2. • Given two sequences, we can – Measure their similarity – Determine the residue-residue correspondences – Observe patterns of conservation and variability – Inter evolutionary relationships • Given two sequences, we can – Measure their similarity – Determine the residue-residue correspondences – Observe patterns of conservation and variability – Inter evolutionary relationships
  • 3. Bioinformatics Sequence Analysis • The most basic sequence analysis - whether two sequences are related – sequence alignment. This involves aligning two sequences similarity in sequences sequences are related similarity is by chance • The most basic sequence analysis - whether two sequences are related – sequence alignment. This involves aligning two sequences similarity in sequences sequences are related similarity is by chance
  • 4. Bioinformatics • Is the most basic tool of bioinformatics. • Sequence similarity must be quantified – important to identify real similarity from coincidence. • Is the most basic tool of bioinformatics. • Sequence similarity must be quantified – important to identify real similarity from coincidence.
  • 5. Bioinformatics • Finding similarity between sequences is important for many biological inferences, like •Finding similar proteins allows us to predict the function and structure of the unknown protein. •Similar sequences can come from two species which share a common ancestor indicating their evolutionary relationship. • Locating similar subsequences in DNA allows us to identify pockets of interest, such as regulatory elements.etc • Finding similarity between sequences is important for many biological inferences, like •Finding similar proteins allows us to predict the function and structure of the unknown protein. •Similar sequences can come from two species which share a common ancestor indicating their evolutionary relationship. • Locating similar subsequences in DNA allows us to identify pockets of interest, such as regulatory elements.etc
  • 6. Bioinformatics • Pairwise sequence alignment • Local and global alignment • Multiple sequence alignment •Clustal W Sequence Alignment • Pairwise sequence alignment • Local and global alignment • Multiple sequence alignment •Clustal W
  • 7. •The comparing of two sequences by searching for a series of individual characters or patterns that are in the same order in the sequences, ie, the identification of residue-residue correspondences. • Local and Global. • Global alignment, attempts to align the entire sequence. If two sequences have approximately the same length and are quite similar, they are suitable for the global alignment. • Local alignment finds stretches of sequences with high level of matches. Pairwise sequence alignment •The comparing of two sequences by searching for a series of individual characters or patterns that are in the same order in the sequences, ie, the identification of residue-residue correspondences. • Local and Global. • Global alignment, attempts to align the entire sequence. If two sequences have approximately the same length and are quite similar, they are suitable for the global alignment. • Local alignment finds stretches of sequences with high level of matches. L G P S S K Q T G K G S - S R I W D N Global alignment L N - I T K S A G K G A I M R L G D A - - - - - - - T G K G - - - - - - - - Local alignment - - - - - - - A G K G - - - - - - - -
  • 8. Methods of sequence alignment •Dot plot method • Dynamic programming approach • Smith-Waterman algorithm and Needleman-Wunsch algorithm •Heuristic methods / k-Tuple Method • BLAST and FASTA •Dot plot method • Dynamic programming approach • Smith-Waterman algorithm and Needleman-Wunsch algorithm •Heuristic methods / k-Tuple Method • BLAST and FASTA
  • 9. • A dot matrix analysis is a method for comparing two sequences to look for possible alignment (Gibbs and McIntyre 1970) • One sequence (A) is listed across the top of the matrix and the other (B) is listed down the left side • Starting from the first character in B, one moves across the page keeping in the first row and placing a dot in many column where the character in A is the same • The process is continued until all possible comparisons between A and B are made • Any region of similarity is revealed by a diagonal row of dots • Isolated dots not on diagonal represent random matches Dot matrix analysis • A dot matrix analysis is a method for comparing two sequences to look for possible alignment (Gibbs and McIntyre 1970) • One sequence (A) is listed across the top of the matrix and the other (B) is listed down the left side • Starting from the first character in B, one moves across the page keeping in the first row and placing a dot in many column where the character in A is the same • The process is continued until all possible comparisons between A and B are made • Any region of similarity is revealed by a diagonal row of dots • Isolated dots not on diagonal represent random matches
  • 10. • Detection of matching regions can be improved by filtering out random matches and this can be achieved by using a sliding window • It means that instead of comparing a single sequence position more positions is compared at the same time and, dot is printed only if a certain minimal number of matches occur • Dot matrix analysis can also be used to find direct and inverted repeats within the sequences Dot matrix analysis • Detection of matching regions can be improved by filtering out random matches and this can be achieved by using a sliding window • It means that instead of comparing a single sequence position more positions is compared at the same time and, dot is printed only if a certain minimal number of matches occur • Dot matrix analysis can also be used to find direct and inverted repeats within the sequences
  • 11. • Nucleic Acids Dot Plots - http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html Dot matrix analysis: two identical sequences
  • 12. • Nucleic Acids Dot Plots of genes Adh1 and G6pd in the mouse •http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html Dot matrix analysis: two very different sequences
  • 13. • Nucleic Acids Dot Plots of genes Adh1 from the mouse and rat (25 MY) •http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html Dot matrix analysis: two similar sequences
  • 14. • Nucleic Acids Dot Plots of genes Adh1 from the mouse and rat (25 MY) •http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html Dot matrix analysis: two similar sequences sequences; size of the sliding window increased
  • 15. • Is a highly computationally demanding as well as intensive method. • It aligns two nucleotide/protein sequences, explores all possible alignments and chooses the best alignment (high scoring alignment) as the optimal alignment. • Is based on alignment scores. • It uses gaps to achieve the best alignment. • Global alignment program is based on Needleman-Wunsch algorithm and local alignment on Smith-Waterman. Both algorithms are derivates from the basic dynamic programming algorithm. Dynamic programming algorithm for sequence alignment • Is a highly computationally demanding as well as intensive method. • It aligns two nucleotide/protein sequences, explores all possible alignments and chooses the best alignment (high scoring alignment) as the optimal alignment. • Is based on alignment scores. • It uses gaps to achieve the best alignment. • Global alignment program is based on Needleman-Wunsch algorithm and local alignment on Smith-Waterman. Both algorithms are derivates from the basic dynamic programming algorithm.
  • 16. • How are alignments scored? • Using scoring matrices •They account for gaps, substitutions, insertions and deletions. •For nucleic acids, scoring is simple (only 4 characters are present, and substitutions do not happen) •Eg: the scoring scheme used by BioEdit • How are alignments scored? • Using scoring matrices •They account for gaps, substitutions, insertions and deletions. •For nucleic acids, scoring is simple (only 4 characters are present, and substitutions do not happen) •Eg: the scoring scheme used by BioEdit Variation Score Match 2 Mismatch -1 Gap initiation -3 Extending gap by 1 -1
  • 17. • For proteins , the scoring schemes are more complicated because amino acid substitutions occur frequently, especially among amino acids with similar physicochemical properties • Eg: Alanine valine substitutions happen without significant changes to the protein.
  • 18. Scoring a sequence alignment with a gap penalty Sequence 1 V D S - C Y Sequence 2 V E S L C Y Score 4 2 4 -11 9 7 Score = sum of amino acid pair scores (26) minus single gap penalty (11) = 15 As two sequences may differ, it is likely to have non-identical amino acids placed in the corresponding positions. In order to optimise the alignment gap(s) may be introduced, which may reflect losses or insertions, which occurred in the past in the sequences. Introduction of gaps causes penalties. Scores gained by each match are not always the same, for instance two rare amino acids will score more than two common.
  • 19. Derivation of the dynamic programming algorithm 1. Score of new = Score of previous + Score of new alignment alignment (A) aligned pair V D S - C Y V D S - C Y V E S L C Y V E S L C Y 15 = 8 + 7 2. Score of = Score of previous + Score of new alignment (A) alignment (B) aligned pair V D S - C V D S - C V E S L C V E S L C 8 = -1 + 9 3. Repeat removing aligned pairs until end of alignments is reached 1. Score of new = Score of previous + Score of new alignment alignment (A) aligned pair V D S - C Y V D S - C Y V E S L C Y V E S L C Y 15 = 8 + 7 2. Score of = Score of previous + Score of new alignment (A) alignment (B) aligned pair V D S - C V D S - C V E S L C V E S L C 8 = -1 + 9 3. Repeat removing aligned pairs until end of alignments is reached
  • 20. • Consider building this alignment in steps, starting from the initial match (V/V) and then sequentially adding a new pair until the alignment is complete, at each stage choosing a pair from all the possible matches that provides the highest score for the alignment up to that point. • If the full alignment has the highest possible (or optimal) score, then the old alignment from which it was derived (A) by addition of the aligned Y/Y pair must also have been optimal up to that point in the alignment. • In this manner, the alignment can be traced back to the first aligned pair that was also an optimal alignment. • The example, which we have considered, illustrates 3 choices: 1. Match the next character(s) in the following position(s); 2. Match the next character(s) to a gap in the upper sequence; 3. Add a gap in the lower sequence. Description of the dynamic programming algorithm • Consider building this alignment in steps, starting from the initial match (V/V) and then sequentially adding a new pair until the alignment is complete, at each stage choosing a pair from all the possible matches that provides the highest score for the alignment up to that point. • If the full alignment has the highest possible (or optimal) score, then the old alignment from which it was derived (A) by addition of the aligned Y/Y pair must also have been optimal up to that point in the alignment. • In this manner, the alignment can be traced back to the first aligned pair that was also an optimal alignment. • The example, which we have considered, illustrates 3 choices: 1. Match the next character(s) in the following position(s); 2. Match the next character(s) to a gap in the upper sequence; 3. Add a gap in the lower sequence.
  • 21. • It is critical to have reasonable scoring schemes accepted by the scientific community for DNA and proteins and for different types of alignments • Matrices for DNA are rather similar as there are only two options purine & pyrimidine and match & mismatch • Proteins are much more complex and the number of option is significant • PAM and BLOSUM matrices are the commonly used scoring matrices for proteins. • They are constructed by analysing the substitution frequencies seen in alignments of known families of proteins. • Identities are assigned high positive scores. Also some amino acids are more abundant than others • Frequently observed substitutions also get positive scores. • Mismatches or matches that are unlikely to have been a result of evolution are given negative scores. Scoring matrices • It is critical to have reasonable scoring schemes accepted by the scientific community for DNA and proteins and for different types of alignments • Matrices for DNA are rather similar as there are only two options purine & pyrimidine and match & mismatch • Proteins are much more complex and the number of option is significant • PAM and BLOSUM matrices are the commonly used scoring matrices for proteins. • They are constructed by analysing the substitution frequencies seen in alignments of known families of proteins. • Identities are assigned high positive scores. Also some amino acids are more abundant than others • Frequently observed substitutions also get positive scores. • Mismatches or matches that are unlikely to have been a result of evolution are given negative scores.
  • 22. • These scores form the matrix entries and are represented in log odds scores • Odds score is the ratio of chance of amino acid substitution due to essential biological reason to the chance of random substitution. • PAM- (Point Accepted Mutation) matrix is derived from global alignments of very similar sequences, so that an observed change will reflect one mutation • An accepted point mutation is a replacement of one A.A by another, accepted by natural selection • There are many different PAMs, which represent different evolutionary scenarios. • BLOSUM (blocks substitution matrix ) –dvpd from regions of closely related proteins that can be aligned without gaps. They calculated the ratio of observed pairs at any position to the number expected from overall amino acid frequency. • Results in the form of log odds score. • PAM is more suitable for studying quite distant proteins, BLOSUM is for more conserved proteins of domains Scoring matrices • These scores form the matrix entries and are represented in log odds scores • Odds score is the ratio of chance of amino acid substitution due to essential biological reason to the chance of random substitution. • PAM- (Point Accepted Mutation) matrix is derived from global alignments of very similar sequences, so that an observed change will reflect one mutation • An accepted point mutation is a replacement of one A.A by another, accepted by natural selection • There are many different PAMs, which represent different evolutionary scenarios. • BLOSUM (blocks substitution matrix ) –dvpd from regions of closely related proteins that can be aligned without gaps. They calculated the ratio of observed pairs at any position to the number expected from overall amino acid frequency. • Results in the form of log odds score. • PAM is more suitable for studying quite distant proteins, BLOSUM is for more conserved proteins of domains
  • 23. • Gap penalties are subtracted from alignment scores to ensure algorithms produce biologically sensible alignments without too many gaps • Gap penalties may be: • Constant – independent of the length of the gap • Proportional – proportional to the length of the gap • Affine – containing gap opening and gap extension contributions. • Opening a gap should be strongly penalised than extending a gap. Gap Penalty • Gap penalties are subtracted from alignment scores to ensure algorithms produce biologically sensible alignments without too many gaps • Gap penalties may be: • Constant – independent of the length of the gap • Proportional – proportional to the length of the gap • Affine – containing gap opening and gap extension contributions. • Opening a gap should be strongly penalised than extending a gap.
  • 24. Scoring matrices: PAM (Percent Accepted Mutation) Amino acids are grouped according to the chemistry of the side group: (C) sulfhydryl, (STPAG)-small hydrophilic, (NDEQ) acid, acid amide and hydrophilic, (HRK) basic, (MILV) small hydrophobic, and (FYW) aromatic. Log odds values: +10 means that ancestor probability is greater, 0 means that the probability are equal, -4 means that the change is random. Thus the probability of alignment YY/YY is 10+10=20, whereas YY/TP is –3-5=-8, a rare and unexpected between homologous sequences.
  • 25. Scoring matrices: BLOSUM62 (BLOcks amino acid SUbstitution Matrices) Ideology of BLOSUM is similar but it is calculated from a very different and much larger set of proteins, which are much more similar and create blocks of proteins with a similar pattern
  • 26. Alignment A: a1 a2 a3 a4 b1 b2 b3 b4 Alignment B: a1 a2 a3 a4 - b1 - b2 b3 b4 Alignment A: a1 a2 a3 a4 b1 b2 b3 b4 Alignment B: a1 a2 a3 a4 - b1 - b2 b3 b4 The highest scoring matrix position is located (in this case s44) and then traced back as far as possible, generating the path shown