Genomic Data Analysis: From Reads to Variants

Genomic Data Analysis
From READS to VARIANTS
24-10-17 to 26-10-17,
Porto Alegre, Brazil.
Aureliano Bombarely
Virginia Tech
Department of Horticulture
Latham 216
220 Ag Quad Lane
Blacksburg, VA
USA
aurebg@vt.edu

DAY 1:
• Presentation of the Course.
• Introduction to the Linux Operating System and the Command Line Interface.
• 25 essential commands to work with Linux.
• Common bioinformatics formats, from FASTAs to GFFs and VCFs.
• 10 essential commands to play with the biological data.
DAY 2:
• Introduction to Next Generation Sequencing Technologies (NGS).
• Experimental design for population studies, from breeding to ecological studies.
• De-multiplexing and the complexities of sample identification.
• Read processing and quality evaluation.
• Read mapping to a reference.
• Variant calling and summary of the read processing.
• Quality evaluation and possible pitfalls.
DAY 3:
• Variant filtering.
• Simple stats for the variant analysis.
• Variant visualization tools: IGV and TASSEL.
• Changing formats for VCF files.
• Example 1: Population analysis with Structure for Sinningia speciosa.
• Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.

SOFTWARE REQUIRED
If
PuTTY
http://www.putty.org/
If
Terminal
(Already in your system)

SOFTWARE REQUIRED
IGV
http://software.broadinstitute.org/software/igv/ http://www.maizegenetics.net/tassel
TASSEL

SOFTWARE REQUIRED
FileZilla
https://filezilla-project.org/

1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.

GDA1: 1- Presentation of the Course.
Biological Problem
Scientific Question
Hypothesis
Genetics & related disciplines
Molecular biology
Massive DNA Sequencing
Results
Experimental Design
Approach
?

Biological Problem
Scientific Question
Hypothesis
Genetics & related disciplines
Molecular biology
Massive DNA Sequencing
Results
Experimental Design
Approach

Genomic Data Analysis is:
• Knowledge about sequencing technologies.
• Knowledge about methodologies (e.g. library preparation).
• Bioinformatic skills (Linux command line and R).
• Basic knowledge about statistical analysis.
Genomic Data Analysis IS NOT:
• Programming (useful but not necessary).
• Basic knowledge of computer system administration.
• Modeling.
• Algorithm development.
• Database development.

Genomic Data Analysis is:
• Knowledge about sequencing technologies.
• Knowledge about methodologies (e.g. library preparation).
• Bioinformatic skills (Linux command line and R).
• Basic knowledge about statistical analysis.
BIOINFORMATICS:
• Programming (useful but not necessary).
• Basic knowledge of computer system administration.
• Modeling.
• Algorithm development.
• Database development.

GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Linux:
Unix-like computer operating system assembled
under the model of free and open source software
development and distribution
Operating System (OS):
Set of programs that
manage computer hardware
resources and provide common
services for application
software.
Wikipedia

Unix-like?
Feduccia A, Trends Ecol. Evol. 2001

Unix:
Is a multitasking, multi-user computer operating
system originally developed in 1969.
https://www.howtogeek.com/182649/htg-explains-what-is-unix/

Linux Distribution:
Distributions (often called distros for short) are
Operating Systems including a large collection of
software applications such as word processors,
spreadsheets, media players, and database applications.
The operating system will consist of the Linux
kernel and, usually, a set of libraries and utilities from
the GNU Project, with graphics support from the X
Window System.

Linux Distribution:

What is a console ?
Computer terminal or system consoles are the text
entry and display device for system administration
messages, particularly those from the BIOS or boot loader,
the kernel, from the init system and from the system logger. It
is a physical device consisting of a keyboard and a
screen.
Wikipedia

So then...
What is typical black or white screen where
programers and system administrators write
commands?

Command-line interface (CLI):
Mechanism for interacting with a computer operating
system or software by typing commands to perform specific
tasks.
The command-line interpreter may be run in a
text terminal or in a terminal emulator window as a remote
shell client such as PuTTY.
Wikipedia
010100
110010
000101
2+2 4

Shell:
Piece of software that provides an interface for users of an
operating system. There are two categories:
- Command-line interface (CLI)
- Graphical user interface (GUI)

Command-line interface (CLI):
Shell: Bash:
Operating
System
Shell (Bash)
STDIN STDOUT
STDERRCommand
Command is a directive to a computer program acting as an
interpreter of some kind, in order to perform a specific task.

Parts of a command:
...And push RETURN or ENTER key to execute the command
Command Argument 1 & 2
The command
call a program
Arguments modify the
behavior of the program
Argument 3
-l means “long listing”
-h/--human-readable means “human readable”

Special characters in bash:
CHARACTER MEANING
SPACE Separate commands and arguments
# POUND Comment
; SEMICOLON Command separator two run multiple commands
. DOT Source command OR filename component
OR current directory
.. DOUBLE DOTS Parent directory
' ' SINGLE QUOTES Use expression between quotes literaly
, COMMA Concatenate strings
BACKSLASH Escape for single character
/ SLASH Filename path separator
* ASTERISK Wild card for filename expansion in globbing
>, <, >> CHARACTERS Redirection input/outputs
| PIPE Pipe outputs between commands
Characters with an special meaning for the bash

ls Solanum lycopersicum
ls 'Solanum lycopersicum'
ls Solanum lycopersicum
Use single quotes or escape for special characters
Bash understand spaces as separators between arguments
Special characters in bash:

Practice 1.1: Connect to the server
Windows users:
1. Open the program PuTTY and start a session.
2. Add the following information for your session and click open.
1. Host: begonia.hort.vt.edu
2. Port: 1809
3. Connection type: SSH
3. Introduce username and password
4. Click connect.
MacOS/Linux users:
1. Open the program Terminal.
2. Type in the terminal:
ssh -p 1809 username@begonia.hort.vt.edu
3. Push enter/return
4. Introduce the password and push Enter/Return. Note the writing will be hidden.

Practice 1.1: Connect to the server
Everyone:
5. Type in the terminal:
pwd
6. Push enter/return
7. Describe the message that has appeared in the screen

GDA1: 3- 25 essential commands to work with Linux.
1. pwd
• The command prints the path to the working directory.
• When you login to the system working directory = home ($HOME)
/data/GDA_UFRGS2017/User00_Home
/ means root (beginning of the file system)
data is the name of the 1st directory in root
/GDA_UFRGS2017 2nd directory after data
/User00_Home 3rd directory after GDA_UFRGS2017
pwd

2. mkdir
• The command prints makes a new directory.
• If the file exists gives an error.
• Argument -p makes also the parent directories
• rmdir removes a empty directory
mkdir linux_exercises
mkdir linux_exercises/test01
mkdir -p linux_exercises/test01
rmdir linux_exercises/test01
✓correct
✴error
✓correct
pwd
mkdir
✓correct

3. cd
• The command changes the working directory.
• Two consecutive dots changes (e.g. “cd ..”) one directory up/
back in the file system
cd linux_exercises
cd linux_exercises/test01
cd test01
✓correct
✴error
✓correct
pwd
mkdir
cd

4. ls
• It lists the items in the working directory (default) or any directory.
• -l argument produces the item long listing.
• -h argument produces a human readable form.
• -a argument prints everything (including hided files starting with “.”).
• -t argument sorts by time
ls
ls -lht linux_exercises/test01
cd test01
✓correct
✴error
✓correct
pwd
mkdir
cd
ls

Practice 1.2: Navigating the file system
1. Type pwd in the terminal and run it.
2. Make the directory ‘linux_exercises’ typing and running.
3. Run pwd in the current directory and annotate the result.
4. Change the working directory to ‘linux_exercises’.
5. Make a new directory named ’01_file_system_tree’.
6. Change the working directory to ’01_file_system_tree’.
8. Make a new directory named ‘subdir01’
9. Change the working directory to ’subdir01’.
11. Change the working directory one level up
12. Make a new directory named ‘subdir02’
13. Change the working directory to ’subdir02’.
15. Draw the file system tree for the directories ‘subdir01’ and ‘subdir02’
pwd
mkdir
cd
ls

/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
01_file_system_tree/
subdir01/
subdir02/
cd 01_file_system_tree/subdir02
pwd
mkdir
cd
ls

/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
subdir01/
subdir02/
cd ../../
pwd
mkdir
cd
ls

/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
subdir01/
subdir02/
cd ../subdir01/
pwd
mkdir
cd
ls

/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
subdir01/
subdir02/
cd /data/GDS_URFG2017/User00_Home/linux_exercises/01_file_system_tree/
subdir01/
cd ../subdir01/
pwd
mkdir
cd
ls

/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
subdir01/
subdir02/
cd /data/GDS_URFG2017/User00_Home/linux_exercises/01_file_system_tree/
subdir01/
cd ../subdir01/
Relative filepath
Absolute filepath
pwd
mkdir
cd
ls

Absolute filepath Relative filepath
Latham Hall 311
220 Ag Quad Lane
Blacksburg, VA 24061
USA
Room 311
pwd
mkdir
cd
ls

pwd
mkdir
cd
lsCommands for directories:
COMMAND USE EXAMPLE
cd Change working dir cd ../
pwd Print working dir pwd
ls List information ls -lh /home
mkdir Create a new dir mkdir test
rmdir Remove empty dir rmdir test

pwd
mkdir
cd
ls
history
5. history
• It lists last 500 command runs.
• No arguments needed

pwd
mkdir
cd
ls
history
Typing shortcuts for bash:
SHORTCUT MEANING
Tab Autocomplete files or folder names
↑ Scroll up to the command history
↓ Scroll down to the command history
Ctrl + A Go to the beginning of the line that you are typing
Ctrl + D Go to the end of the line that you are typing
Ctrl + U Clear all the line (or until the cursor position)
Ctrl + R Search previously used commands
Ctrl + C Kill the process that you are running
Ctrl + D Exit the current shell
Ctrl + Z Put the running process to the background. Use
command fg to recover it.

6. less
• Opens a text type file in the screen.
• To navigate use the arrows ( ).
• “Shift + G” goes to the end of the file.
• “/“ + some word search for the word.
• “q” to quit/exit.
• Open the file with “-N” to open with row numbers.
• More information at: http://www.tutorialspoint.com/unix_commands/less.htm
less ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta
less -N ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta
pwd
mkdir
cd
ls
history
less

7. touch
• Creates an empty file.
touch this_is_a_test_file.txt
pwd
mkdir
cd
ls
history
less
touch
rm
8. rm
• Remove/delete permanently a file from the system.
• The file CAN NOT BE RECOVERED.
• “rm -Rf <directory>” will remove the directory and all its content CAREFUL.
rm this_is_a_test_file.txt
rm -Rf 01_file_system_tree/subdir01

9. cp
• Copy a file from one location to another.
• “./“ means copy here in the working directory
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
10.mv
• Two functions:
• If the destination EXISTS and is a DIR, move a file there.
• If the destination DO NOT EXISTS, change the
• CAREFUL: If the destination EXISTS and is a file WILL OVERWRITE IT
mv Sispe038_cds.fasta Sispe038_cds.fa
cp ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta ./

Practice 1.3: Copying and moving files
1. Change working directory to ‘linux_exercises’.
2. Make a directory with the name: “Sispe_ref”.
3. Change working directory to “Sispe_ref”.
4. Copy all the fasta files from /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/
to your current working directory typing:
cp /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/
*.fasta ./
6. Remove the file “Sispe038.scaffolds.fasta”.
7. Change the name of “Sispe038.scaffolds500kb.fasta” to “Sispe038ReducedRef.fa”.
8. Create a mapping reference using Bowtie2-build running:
bowtie2-build Sispe038ReducedRef.fa Sispe038ReducedRef
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv

11.cat
• It prints the content of the file as STDOUT in the screen.
• Usually used to concatenate (merge) files one after another using
“cat file1.txt file2.txt > merged.txt ”
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
12.head/tail
• Prints the first/last 10 lines of the file as STDOUT
• The number of lines (x) can be changed using “-n x”.
head Sispe038ReducedRef.fasta
head -n 100 Sispe038ReducedRef.fasta
tail -n 1 Sispe038ReducedRef.fasta
cat Sispe038ReducedRef.fa

13.grep
• Command to find LINES and print that match with the pattern used.
• “-c” option prints the NUMBER of LINES that match.
• “-v” option prints the LINES that DO NOT match.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
grep “>” Sispe038ReducedRef.fa
grep -c “>” Sispe038ReducedRef.fa
grep -v “>” Sispe038ReducedRef.fa

14.gzip/gunzip
• Command to compress a file with gzip.
• Command to uncompress a file.gz with gunzip
• To keep the original file the “-c” option can be used
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
gzip Sispe038ReducedRef.fa
gunzip Sispe038ReducedRef.fa.gz
gzip -c Sispe038ReducedRef.fa > SispeRef.fa.gz

pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
15.tar
• Command to archive/unarchive files contained into a directory.
• It can be combined with tools such as gzip and bzip2.
• Commonly used commands:
• tar -zxvf package_of_files.tar.gz to unarchive and uncompress .gz
• tar -jxvf package_of_files.tar.bz2 to unarchive and uncompress .bz2
• tar -zcvf dir1.tar.gz /path_to_fir1 to archive and compress with gzip
• tar -jcvf dir1.tar.bz2 /path_to_fir1 to archive and compress with bzip2

Practice 1.4: Concatenating files and taking a look to them
2. Make a directory with the name: “CDS_refs”.
3. Change working directory to “CDS_refs”.
4. Copy into your current working directory the following the files:
1. /data/GDA_UFRGS2017/DATA/Arabidopsis_thaliana/reference/
Athaliana_Phytozome167_TAIR10.pep.fa.gz
2. /data/GDA_UFRGS2017/DATA/Oryza_sativa/reference/
Osativa_Phytozome323_v7.0.pep.fa.gz
5. Uncompress both files.
6. Count how many lines have the symbol “>” for both files.
7. Concatenate both files in a file named “Atha_Osat_PEP.fasta”.
8. Count how many lines have the symbol “>” in this file.
9. Create a BLAST+ reference running the following command:
makeblastdb -in Atha_Osat_PEP.fasta -dbtype prot -
parse_seqids
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar

16.cut
• It divides the file by TAB and prints as STDOUT the selected column.
• “-f x” where x is the number of the column.
• “-d y” where y is a character can be used to change the delimiter
17.sort
• It sorts alphabetically a file based in the firsts characters of the line.
• “-n” can be used to sort numerically.
• “-r” can be used to do a reversed sorting.
• “-u” can be used to apile unique ids
• Usually used with cut “e.g. cut -f1 my_file.txt | sort”.
cut -f 3 Sispe038_genome.genemodels.gff3
cut -f 3 Sispe038_genome.genemodels.gff3 | sort -u
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort

18.uniq
• It reports or omits unique lines.
• Usually used in conjunction with cut and sort “e.g. cut -f1 my_file.txt |
sort”.
19.wc
• It counts newlines, words or bytes in a file.
• “-l” counts the number of lines.
• “-w” counts the number of words.
• “-m” counts the number of characters
cut -f 3 Sispe038_genome.genemodels.gff3 | sort | uniq -c
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
wc -l Sispe038_genome.genemodels.gff3

20.sed
• Stream editor to transform text.
• The simplest option is to use “s/<find>/<replace>/“ option.
• A “g” to replace as many times as it can “s/<find>/<replace>/“
• More info at: https://www.gnu.org/software/sed/manual/sed.html
sed “s/A/a/g“ Sispe038ReducedRef.fa
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed

2. Make a directory with the name: “GFF_refs”.
3. Change working directory to “GFF_refs”.
1. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/
Sispe038_genome.genemodels.gff3
2. /data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/reference/
Niben251.1_genome.genemodels.sorted.gff3
5. Count the number of lines in both files.
6. Count the number of lines ignoring lines with “#” symbol using grep.
7. Select the third column in both files and print the first 20 lines.
8. Select the third column in both files, sort it and count unique items using “uniq -c”
Practice 1.5: Selecting columns and counting them
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed

pwd
mkdir
cd
ls
history
less
touch
rm
COMMAND USE EXAMPLE
less Open a file with less.
Q to exit. Arrows to scroll
less myfile
touch Create an empty file touch myfile
mv Move file between dirs. Change name mv myfile yourfile
rm Remove file rm youfil
cat Print file content as STDOUT cat myfile
head Print first 10 lines as STDOUT head myfile
tail Print last 10 lines as STDOUT tail myfile
grep Print matching lines as STDOUT grep 'ATG' myfile
cut Cut columns and print as STDOUT cut -f1 myfile
sort Sort lines and print as STDOUT sort myfile
uniq Select uniq words (-c to count uniq). uniq -c myfile
sed Replace ocurrences, print lines STDOUT sed 's/ATG/CTG/' myfile
wc Word count wc myfile
Commands for files:

Compression and archiving commands:
pwd
mkdir
cd
ls
history
less
touch
rm
COMMAND USE EXAMPLE
gzip Compress a file using gzip gzip -c test.txt > test.txt.gz
gunzip Uncompress a file using gzip gunzip test.txt.gz
bzip2 Compress a file using bzip bzip2 -c test.txt >
test.txt.bz2
bunzip2 Uncompress a file using gzip bunzip2 test.txt.bz2
tar Archive files usint tar tar -cf sample.tar sample/*.txt
tar -zcvf Archive using tar and compress
using gzip
tar -zcvf samples.tar.gz
sample/*.txt
tar -zxvf Unarchive using tar and
uncompress using gunzip
tar -zxvf samples.tar.gz
tar -jcvf Archive using tar and compress
using bzip2
tar -jcvf samples.tar.bz2
sample/*.txt
tar -jxvf Unarchive using tar and
uncompress using bunzip2
tar -jxvf samples.tar.bz2

21.top/htop
• Display Linux processes.
• Type “q” to quit.
• “kill PID” can be used to terminate a process.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
Global Resource Usage: %CPU / MEMORY / SWAP MEMORY
Single Process Resource Usage: PID / USER / %CPU / %MEMORY / COMMAND

22.df/du
• Commands to check how much disk space is being used in the
system (df -lh) or how much space a directory is using (du -lh
<dir>).
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
df -lh
du -lh linux_exercises

23.wget/curl
• Commands to download files from the internet.
• wget can be used recursively (e.g. using * or “-r” for dirs)
• curl has pipeting abilities (using “|”)..
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
wget ftp://ftp.solgenomics.net/genomes/
Solanum_lycopersicum/annotation/ITAG3.2_release/*.fasta
curl ftp://ftp.solgenomics.net/genomes/
Solanum_lycopersicum/annotation/ITAG3.2_release/
ITAG3.2_proteins.fasta | grep -c “>”

pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
24.ssh/scp
• Commands to:
• ssh = access to a remote server
ssh -p 1809 username@begonia.hort.vt.edu
• scp = copy from/to a remote server
• From LOCAL to REMOTE
scp -p 1809 file1 username@begonia.hort.vt.edu:/dirpath
• From REMOTE to LOCAL
scp -p 1809 username@begonia.hort.vt.edu:/dirpath/file1 ./

pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
File Permissions and Ownerships:
All the Unix systems are designed as multiuser operating
systems. It means that different could access, modify or
delete the same files.
To avoid problems, they has a file permission and ownership
system. It restrict who can access and modify each of the
files in the computer.
This system has two parts:
• Who is the owner of the file ?
• What type of access has each of the users in the
System ?

pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
Ownership:
Each file has assigned an user owner and a group owner.
The user owner can be:
• Real user (for example: aurebg).
• Virtual user created by a program (for example: mysql).
• Administrator user or root.

pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
Permissions:
Each file has assigned 9 different permissions, 3 for the file
user-owner (u), 3 for the group-owner (g) and 3 for everyone
else (o). There are 3 types of permissions or file attributes:
• Readable (r), it has permission to read the file.
• Writable (w), it has permission to write the file.
• Executable (x), it has permission to execute as program.
10 letters code for linux file: ----------
drwxrwxrwx
switch OFF
switch ON
user
group
other
Readable for everyone
Readable for everyone, writable or
executable only for the user-owner
-r--r--r--
-rwxr—-r--

pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
chmod/chown
25.chmod/chown
• Commands to manage ownership and privileges:
• To know the permissions type: “ls -lh"
• Change owner: chown user:group filename
chown aurebg:aurebg file.txt
• Change permisions: chmod permissions_code filename
chmod 664 file.txt
# It changes to readable+writable by user and group and readable by anyone
chmod u+r file.txt
# It changes to readable by user
chmod [ugo] [+-=] [rwx] file
chmod [0-7] [0-7] [0-7] file |rwx|rwx|rwx|
|421|421|421|

I. FASTA
FASTA format is a text based file format that store three different types: DNA,
RNA or protein sequences. Used to represent the information for sequences
for genomes, mRNA’s, cDNA’s, miRNA’s…
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
>SeqID1 optional_description1
AGCGTGGAGAGCGATGAGATCAGAAAGTAGGACGACAGATGGGGAGAT
GGCAGGTGTGGGAGGAGTTGACGATGACGTGATTGATGACGGGAGACG
AGCGTGGAGAGCGATGAGATCAGAAAGTAGGCTGACAGATGGGGAGAT
GGCAGGTGAGGGAGGAGCTGACGATGACGTGTTTGATGACGGGAGACG
AGCGTGGAGAGCGATGAGATCAGAAAGTAGGACGACAGTGGGGGAGAT
GGCAGGTGAGGGAGGAGTTGACGATGACGTGTTTGATGACGGGAGACG
Space separating ID and description
One line ID
ID line always
starts with “>”
}
sequence can be
one or more lines

II. FASTQ
FASTQ format is a text based file format that store usually DNA sequences. It
contains information about the sequencing QUALITY of each nucleotide.
@GWNJ-0957:89:GW170928504:7:1101:2757:1309 1:N:0:NCGTCCC
TATCTAAGTATTTGATTAATGATAGATGACGATGGAGAAATATAATCTACTTTTTT
AAGTCCCTCATTTTCTTTCTCCATCTTTCTTTTTTATTACTCCCATTGTTCCCCAT
+
AAAAAFFJFJJFJJAAAAAFJJJ<FJJJJJJJJJJ7<7<<<<JJJJJJFFJJJAFJ
F-7<<-7AFJJFJJJJJJJJAJJFJFJ<7<-7A-7FAFJA777777<7-7--7--7
@GWNJ-0957:89:GW170928504:7:1101:3549:1309 1:N:0:NCGTCCC
ACCATTCATTATTTTTTTATTTAGTCTTTATTACTTTACTTTCCTTCCTTCTGAAA
TACTGCTATTGTACATAAAACAAAATGATCTACTTAAAAATAAAACAAATTTAAAA
+
AAA-AAJJFJJAAJAA-7AFJJ-7-<<-<AJJ--<J-<-<---77F7-A---A7--
<777<7<7<<F-77F<J<JJ7F7AFF77<77<7777<77<---7---77---7---
One line ID
ID line always
starts with “@”
}
sequence can be
one or more lines
quality line
always starts
with “+” }
One quality character per nucleotide. Each character code a
number from 0-41 (Illumina v1.8+).

II. FASTQ
QUALITY explained.

2. Make a directory with the name: “FASTQ1”.
3. Change working directory to “FASTQ1”.
1. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/collection/P1_001B.fastq.gz
2. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/collection/P1_007.fastq.gz
5. Uncompress them.
6. Run the following commands to get the stats.
fastq-stats P1_001B.fastq
fastq-stats P1_007.fastq
7. Redirect the output using “>” into a file using the following commands.
fastq-stats P1_001B.fastq > P1_001B.stats.txt
fastq-stats P1_007.fastq > P1_007.stats.txt
Practice 1.6: Getting stats for a FASTQ file

III. SAM/BAM
SAM (and its binary form BAM) format is designed to store read mapping
information to a reference. It has 11 columns.

III. SAM/BAM
The 2nd column: FLAG defines the status of the read mapping.

IV. GFF3
GFF3 is a text-based file with 9 columns. It is designed to store genomic
features (e.g. genes, exons, repetitive elements…) information. More
information at http://gmod.org/wiki/GFF3.
##gff-version 3
ctg13 . mRNA 1300 9000 . + . ID=mrna0001;Name=GDR1
ctg13 . exon 1300 1500 . + . ID=exon00001;Parent=mrna0001
seqid
source
type
start
end
score
phase
attributes
strand
mrna0001
exon00001 exon00002 exon00003 exon00004 exon00005

DIPLOID
0 = REF
1 => ALT
/ => NO PHASED
| => PHASED
V. VCF
VCF is a text-based file with 8 fixed columns and one extra per sample for the
multisample files. It contacts metadata at the beginning of the file as “#”
explaining the different fields
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
20 1370 rs01 G A 29 PASS DP=14;AF=0.5 GT:GQ:DP 0/1:51:14
20 1730 . T A 3 q10 DP=11;AF=0.2 GT:GQ:DP 0/1:58:11
20 2121 rs02 A G,T 67 PASS DP=10;AF=0.5 GT:GQ:DP 1/2:23:10
20 6781 . T . 47 PASS DP=13;AF=1 GT:GQ:DP 1/1:56:13
E.g. 1
E.g. 2
E.g. 3
E.g. 4
GT:GQ:DP
0/1:51:14GENOTYPE
DEPTH
GENOTYPEQUAL
• E.g. 1 is a biallelic heterozygous SNP.
• E.g. 2 is a biallelic heterozygous SNP with
low quality, probably because the mapping.
• E.g. 3 is a non-biallelic heterozygous SNP.
• E.g. 4 is a biallelic homozygous Deletion

PIPELINE: Combination of commands where the input of one
command is the the output of the previous one
GDA1: 5- 10 essential commands to play with the biological data.
CMD1 CMD2 CMD3 CMD4
Input Output
grep -v “#” Sispe038.gff3 | cut -f 3 | sort | uniq -c

2. Make a directory with the name: “VCF_ANALYSIS”.
3. Change working directory to “VCF_ANALYSIS”.
1. /data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/resistant_popbatch01/
VLS24_S1.PolCollapsedBiallelicAF1.vcf
5. Answer the following questions:
5.1. How many variants has this file?
5.2. Ignoring Scaffolds (SeqID=Niben251ScfXXXXX), how many variants have each
chromosome (SeqID=Niben251ChrYY)?
5.3. How many biallelic SNPs have this file?
5.4. How many biallelic SNPs with allele frequency 1 (AF=1) have each chromosome?
Practice 1.7:

2. Make a directory with the name: “MY_FIRST_SCRIPT”.
3. Change working directory to “MY_FIRST_SCRIPT”.
1. /data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/reference/
Niben251.1_genome.gene_models.sorted.gff
5. Write a script that count the number of types per chromosome and uses two arguments
1st=file.gff; 2nd=type.
Practice 1.8:

1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identiﬁcation.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.

DNA sequencing is the process of determining the precise
order of nucleotides within a DNA molecule. It includes any
method or technology that is used to determine the order of
the four bases—adenine, guanine, cytosine, and thymine—in
a strand of DNA.
https://en.wikipedia.org/wiki/DNA_sequencing
(Gentile et al. Nano Lett., 2012, 12 (12), pp 6453–6458)
ATGCGCGTCGCGGTGAAT
GDA2: 1- Introduction to Next Generation Sequencing Technologies.

1950 1960 1970 1980 1990 2000 2010 2020
Electrophoresis(1952)
DNAStructure(1953)
SangerDNASequencing(1977)
AB370ASequencer(1986)
AB310capillarSequencer(1986)
454Sequencer(2005)
SolexaGenomeAnalyzerSequencer(2006)
PaciﬁcBiosciencesSequencer(2011)
OxfordNanoporePortablesequencer(2015)
MS2Bacteriophage(1977)
Epstein-BarrVirus(1984)
Haemophilusinﬂuenzae(1995)
Arabiodpsisthaliana(2000)
Homosapiens(2001)
2016/02/04 Sequenced
Genomes
Viridiplantae 178
Metazoa 5907
Bacteria 7897

Frederick Sanger (1918-2013)
Twice awarded with the Nobel Prize of Chemistry
PreNGS Era

0.015Mb
0.078Mb
0.315Mb
0.138Mb
0.414Mb
1.3Mb
2.6Mb
Error Rate
0.1%
Error Type
substitution

https://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-
course/what-you-will-learn/what-next-generation-dna-

(Mardis E.R. (2013) Annual Review of Analytical Chemistry 6: 287-303)
Next Generation Sequencing vs Sanger
Next Generation Sequencing Sanger
DNA libraries need to be prepared Fragment ampliﬁcation
Direct nucleotide detection based in different
methods
Physical fragment separation for detection
Millions to billions of reads Thousands of reads
Variable size (short and long technologies) 400 to 900 bp read length
Variable error rate Very low error rate
Quantitative comparison Semicomparative comparison

Next Generation Sequencing
0
10000
20000
30000
40000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
NGS Ecology
(Graph by Dr. David Haak)

http://www.slideshare.net/cosentia/high-throughput-equencing
Next-generation sequencing platforms
Isolation and purification of
target DNA
Sample preparation
Library validation
Cluster generation
on solid-phase
Emulsion PCR
Sequencing by synthesis
with 3’-blocked reversible
terminators
Pyrosequencing Sequencing by ligation
Single colour imaging
Sequencing by synthesis
with 3’-unblocked reversible
terminators
AmplificationSequencingImaging
Four colour imaging
Data analysis
Roche 454Illumina GAII ABi SOLiD Helicos HeliScope
Next Generation Sequencing

Technology
Read length
(bp)
Accuracy Reads/Run Time/Run Cost/Mb
Applied Bio 3730XL
(Sanger)
400 - 900 99.9% 384
4 h
(12 runs/day)
$2,400
Roche 454 GS FLX
(Pyrosequencing)
700
Single/Pairs
99.9% 1,000,000 24h $10
Illumina HiSeq4000 (Seq.
by synthesis)
75-250
Single/Pairs
99% 5,000,000,000 24 to 120 h $0.05 to $0.15
Ilumina MiSeq
(Seq. by synthesis)
50-300 Single/
Pairs
99% 44,000,000 24 to 72 h $0.17
SOLiD 4
(Seq. by ligation)
25-50
Single/Pairs
99.9% 1,400,000,000 168 h $0.13
ION Torrent
(Seq. by semiconductor)
170-400
Single
98% 80,000,000 2 h $2
Paciﬁc Biosciences Sequel
(SMRT)
14,000
Single
85%
(99.9%)
1,600,000 4 h $0.6
Oxford N. Minion
(Nanopore sequencing)
10,000
Single
62%
(96%)
4,400,000 48 h $0.02

GDA2: 1- Introduction to Next Generation Sequencing Technologies: Libraries

Multiplexing
Use of DNA tags (4-7 bp) to identify samples in the same
sequencing lane, cell or sector.
mRNA-1
mRNA-2
Sample
1
Sample
2
cDNA-1-tag_ATGC
cDNA-2-tag_CGAG
ATGC
CGAG
AUGCGUU
AUGCGUU
UUGCGCU
AAGAGUU
AUGCGUU
AUGUGAA
UUGCGCU
AAAAGUU
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
}
Pool
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sequencing
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Libraries

http://www.roche.com/
GDA2: 1- Introduction to Next Generation Sequencing Technologies: 454

Pyrosequencing technology
(Mardis E.R. (2008) Trends in Genetics 24: 133-141)

Pyrosequencing technology
https://www.youtube.com/watch?v=rsJoG-AulNE

http://454.com/products/gs-ﬂx-system/index.asp

http://www.bio-itworld.com/BioIT_Article.aspx?id=131053

http://www.illumina.com/
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Illumina

Sequence by Synthesis technology
(Mardis E.R. (2013) Annual Review of Analytical Chemistry 6: 287-303)

Sequence by Synthesis technology
http://www.illumina.com/techniques/sequencing/dna-
sequencing.html#

https://www.illumina.com/systems/sequencing-platforms.html
Benchtop systems
Production-scale systems

https://products.appliedbiosystems.com
GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD

Sequence by Ligation technology

http://media.invitrogen.com.edgesuite.net/ab/
applications-technologies/solid/solid-5500.html
Sequence by Ligation technology

https://products.appliedbiosystems.com
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Ion Torrent

Sequence by Semiconductor technology
A"sample"of"DNA"is"cut"into"
millions"of"fragments,"and"
each"fragment"is"a7ached"
to"its"own"bead"
The"fragment"is"copied""
un;l"it"covers"the"bead"
This"automated"process"
produces"millions"of"beads"
covered"with"millions"of"
different"fragments"
The"beads"are"then"flowed"
across"the"chip,"each"being"
deposited"into"a"well"
Then"the"chip"is"flooded"
with"one"of"the"four"
nucleo;des"
If"the"next"base"on"the"DNA"
strand"is"complementary"to"
this"nucleo;de,"a"nucleo;de"
will"be"incorporated"and""
a"hydrogen"ion"will"be"
released"
The"hydrogen"ion"changes"
the"pH"of"the"solu;on"in""
the"well"
An"ionCsensi;ve"layer"
beneath"the"well"measures"
that"pH"change"and"
converts"it"to"voltage"
This"voltage"change"is"
recorded,"indica;ng"the"
nucleo;de"has"been"
incorporated"and"the""
base"is"called"
This"process"happens"
simultaneously"in"millions"
of"wells"
Copy%DNA% Load%chip% Incorporate%nucleo6de% Detect%and%call%

Sequence by Semiconductor technology

http://www.pacb.com/
GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio

Single Molecule Real Time (SMRT) technology
Niedringhaus et al. Analytical Chemistry 2011

hsp://bit.ly/1naxgTe

http://genome.duke.edu/cores-and-services/sequencing-and-genomic-technologies/pacbio
PacBio Sequel

https://www.nanoporetech.com/
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Oxford Nanopore

Niedringhaus et al. Analytical Chemistry 2011
Sequence by Nanopore technology

Sequence by Nanopore technology
GDA2: 1- Introduction to Next Generation Sequencing Technologies

GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
Population (Genetics)
Group of organisms or individuals from the same geographical
location with the capability of interbreeding.
• Natural populations (e.g. Sinningia speciosa group of plants that
grow in the area of Pedra Lisa).
• Artiﬁcial populations (e.g. F2 segregating population of Sinningia
speciosa Empress x Buzios).

• Natural populations • Artificial populations
- Structure & Size.
- Diversification.
- Speciation.
- Selection.
- Drift.
- Fitness.
- Migration.
- Genetic maps.
- Geno2Pheno links.
- QTLs
- GWAS.
- Artificial Selection.
- Domestication.

(1) Focus in genotyping instead a right sampling of the populations.
(2) Wrong randomization of the samples.
(3) Confuse geopolitical borders with biological borders.
(4) Testing signiﬁcance of the clustering output.
(5) Misinterpretation of Mandel’s r statistic (correlation between dist. matrices).
(6) Single K value interpretation without consider other alternative scenarios.
(7) Don’t take into account loci ﬁxation associated with an adaptive trait

✴ Focus in genotyping instead a right sampling of the populations.
How many individuals are necessary per “population” ?
It depends of the analysis and the population.
Example 1: Single dominant locus QTL Analysis.
• Recombination rate (genome size and chromosome
number).
• Genotyping methodology (resolution).
• Loci location.
}
100 F2 individuals
as starting point and
then move to other
populations (e.g. F3) or
adding more individuals
Example 2: Local adaptation.
• Trait analyzed.
• Population structure.
• Quality of the reference.
• Genotyping methodology (resolution).
}
50 individuals per
group as starting point
and then move to other
populations (e.g. F3) or
adding more individuals

✴ Genotyping approaches.
Genotyping: It is the process of determining genetic differences of an
individual by examining the individual's DNA sequence.
Genome sequencing
Cost effective approaches
Reduced representation
1. Targeted amplification (e.g. TrueSeq Custom Amplicon)
2. Hybridization (e.g. Sequence Capture)
3. Enzymatic Digestion + Size selection (e.g. RAD-Seq / GBS)
4. RNA isolation (RNA-Seq)

✴ Genotyping approaches: Reduced representation approaches.
1. Targeted amplification (e.g. TrueSeq Custom Amplicon)
Gene A Gene B Gene C
RE
site
RE
site
RE
site
RE
site
Amplification
Library preparation and sequencing
Fastq Files
Different samples
Different MIDs

2. Hybridization (e.g. Sequence Capture)
MIDPCR
Different samples
Different MIDs
RE
site
RE
site
RE
site
RE
site
Fragmentation
DNA Capture
Sequencing
Fastq Files
Amplification and Lib. preparation

3. Enzymatic Digestion + Size selection (e.g. RAD-Seq / GBS)
REMIDPCR
Different samples
Different MIDs
RE
site
RE
site
RE
site
RE
site
Digestion
Adapters ligation
Sequencing
Fastq Files
Amplification (Size selection ~500bp)
Elshire et al. 2011 PLOS One 6:e193779
Genotyping-By-Sequencing (GBS)

4.RNA isolation (RNA-Seq)
RE
site
RE
site
RE
site
RE
site
Gene expression
RNA extraction and cDNA synthesis
Library preparation and sequencing
Fastq Files
Different samples
Different MIDs

GDA2: 3- De-multiplexing and the complexities of sample identiﬁcation.
Multiplexing
Use of DNA tags (4-7 bp) to identify samples in the same
sequencing lane, cell or sector.
mRNA-1
mRNA-2
Sample
1
Sample
2
cDNA-1-tag_ATGC
cDNA-2-tag_CGAG
ATGC
CGAG
AUGCGUU
AUGCGUU
UUGCGCU
AAGAGUU
AUGCGUU
AUGUGAA
UUGCGCU
AAAAGUU
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
}
Pool
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sequencing

De-Multiplexing
Identiﬁcation of the sequenced DNA samples using the DNA tag
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sequencing
Demultiplexing
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sample
1
Sample
2

De-Multiplexing
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGCG
AAAAGTTCGAG
ATGCGTTATCC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sequencing
Demultiplexing
ATGCGTTCGAG
ATGTGAACGAG
AAAAGTTCGAG
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
TTGCGCTCGCG
ATGCGTTATCC
?
Sample
1
Sample
2

De-Multiplexing
Keys for barcode/tag designing (GBS/RADseq):
• The barcode does not contain or recreate the enzyme cut
site.
• Any barcode in a set is at least two substitutions away
from any other barcode.
• They vary in length as a set (to avoid the all cut site bases
appearing at the same positions in the sequencing read).
• They contain the complementary sticky end to the enzyme
cut site.
http://www.maizegenetics.net/genotyping-by-sequencing-gbs

De-Multiplexing
Software RE Link
Fastx-toolkit
(Barcode splitter)
No http://hannonlab.cshl.edu/fastx_toolkit/
Ea-utils
(Fastq-multx)
No https://expressionanalysis.github.io/ea-utils/
GBSX Yes https://github.com/GenomicsCoreLeuven/GBSX
TASSEL Yes http://www.maizegenetics.net/tassel

Fastq raw
Fastq Processed
Reads processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
Mapped
Reads
Assembled
Reads
Other Analysis
GDA2: 4- Read processing and quality evaluation.

0- Read Quality Evaluation
• Does the sequencing produced the expected number of
reads?
READ COUNTS
• Do the reads have the expected average length?
AVERAGE READ LENGTH
• Do the reads have the expected nucleotide qscore?
QSCORE NUCLEOTIDE BOXES
Technology dependent

FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
0- Read Quality Evaluation

1- Quality ﬁltering
• Generally associated with a minimum length and a
minimum qscore (extremes, by average, minimum for all
the nucleotides)
Tecnology min. length (bp) min. qscore
454 100 20
Illumina 50 30
SOLiD 20 30
Ion Torrent 50 20
PacBio 1000 NA
Oxford Nanopore 1000 NA

1- Quality ﬁltering
• Tools and time for processing varies depending of the
technology.
Software Link
Fastx-toolkit http://hannonlab.cshl.edu/fastx_toolkit/
Ea-utils https://expressionanalysis.github.io/ea-utils/
PrinSeq http://prinseq.sourceforge.net/
Trimmomatic http://www.usadellab.org/cms/?page=trimmomatic
e.g. running ea-utils command:
fastq-mcf -q 30 -l 50 -o s01_Q30L50_R1.fq
Illumina_Adapters.fa s01_R1.fq

Practice 2.1: Process reads and get stats.
1. Make a new directory called ‘sinningia_genotyping’.
2. Change the working directory to ‘sinningia_genotyping’
2. Make a directory with the name: “00_raw”.
3. Change working directory to “00_raw”.
4. Copy four fastq files and the “IlluminaAdapters_V2.fasta” from /data/GDA_UFRGS2017/
DATA/Sinningia_speciosa/collection/ to your current working directory.
5. Get the stats for the raw reads using “fastq-stats”.
6. Process the reads using “fastq-mcf” with a min. quality score of 30 and a min. length of
50 bp (note: you can use a script). An example of the command could be something like:
fastq-mcf -q 30 -l 50 -o P1_003_Q30L50.fq
IlluminaAdapters_V2.fasta P1_003.fastq.gz
7. Make a directory one level up (../) with the name “01_processed”.
8. Move the outputs from “fastq-mcf” a “../01_processed”.
9. Get the stats for the processed reads using “fastq-stats”.

Read Mapping:
It is the process of search the location of a read comparing the its sequence and the
sequence of a reference.
ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG
GCAGCGACCAGCGA
||||||||||| ||
1........10........20........30........40........50
ref
read
ref:10..23
Sequence Alignment:
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity
that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1] Aligned sequences of
nucleotide or amino acid residues are typically represented as rows within a matrix.
http://en.wikipedia.org/wiki/Sequence_alignment
GDA2: 5- Read mapping to a reference.

Read Mapping Considerations:
• Length of the read.
• Number of reads.
• Size of the reference.
Short reads
(NGS)
Millions of
sequences
Long reads
(Chromosomes)
Dozens of
sequences
Medium reads
(Genes/Transcripts)
Thousands of
sequences

Read Mapping Software:
ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG ref
database/indexes
GCAGCGACCAGTGA
read seeds (kmers)
GCAGCGACCA
CAGCGACCAG
AGCGACCAGC
CGACCAGCGA
GCGACCAGCG
search
GCAGCGACCA
ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG
extension
GCAGCGACCAGCGA

ATGACGTGC
GCCGTGCTG
find seed
(perfect match l=4)
extension
(mismatches allowed)
ATGACGTGC
GCCGTGCTG ATGACGTGC
GCCGTGCTG
| |||||
A T G A C G T G C
0 -1 -2 -3 -4 -5 -6 -7 -8 -9
G -1 -1 -2 -1 -4 -5 -4 -7 -6 -9
C -2 -2 -2 -3 -2 -3 -6 -5 -8 -5
C -3 -3 -3 -3 -4 -1 -4 -7 -6 -7
G -2 -4 -4 -2 -4 -5 0 -5 -6 -7
T -3 -3 -3 -5 -3 -5 -6 1 -6 -7
G -4 -4 -4 -2 -6 -4 -4 -7 2 -7
C -5 -5 -5 -5 -3 -5 -5 -5 -8 3
T -4 -6 -4 -6 -6 -4 -6 -4 -6 -9
G -5 -5 -7 -3 -8 -7 -3 -7 -3 -7
• Smith–Waterman algorithm
• Needleman–Wunsch algorithm
• Burrows-Wheeler index
Short reads
(NGS)
Millions of
sequences
Long reads
(Chromosomes)
Dozens of
sequences
Medium reads
(Genes/Transcripts)
Thousands of
sequences
||||

Name Type Input Output
Mauve Long sequences Fasta, GenBank
backbone (positions)
XMFA (alignments)
LastZ/MultiZ Long sequences Fasta several (maf, sam…)
Blast Medium sequences Fasta
Blast formats (0 text ﬁle, 6
tabular ﬁle)
Blat Medium sequences Fasta
Blast formats + Blat tabular
format
Bowtie Short sequences Fasta, Fastq Sam
BWA Short sequences Fasta, Fastq Sam
Novoalign Short sequences Fasta, Fastq Sam
SOAP Short sequences Fasta, Fastq Sam
Stampy Short sequences Fasta, Fastq Sam

http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

Sam/Bam ﬁle manipulation:
http://samtools.sourceforge.net/

Practice 2.2: Read mapping and get stats.
2. Make a directory with the name: “02_mapped”.
3. Change working directory to “01_processed”.
4. Map each of the processed reads to the reference index “../../linux_exercises/Sispe_ref/
Sispe038ReducedRef.fa” created Day1: Practice 3, using bowtie2-build. Redirect the
output to the directory “../02_mapped/”. An example of a command could be:
bowtie2 -p 2 -t -x ../../linux_exercises/Sispe_ref/
Sispe038ReducedRef -U P1_009_Q30L50.fq -S ../02_mapped/
P1_009_Q30L50.sam
5. Change working directory to “../01_processed”.
6. Count how many hits have each sam file using “samtools” and the for example the
following command.
samtools view -c —F 4 -Sb P1_009_Q30L50.sam

Practice 2.2: Read mapping and get stats.
7. Filter the sam file removing the reads without hits (tag 0x4) and convert it to bam.
samtools view -F 4 -Sb -o P1_009.bam P1_009_Q30L50.sam
8. Merge all the bam files using ‘bamaddrg’ with a command such as. As sample name use
the names from the file “SampleNamesMappingFile.txt” (e.g. P1_003 name is
“Purple_Dreaming”; do not use spaces).
/data/software/bamaddrg/bamaddrg -b P1_003.bam -s
Purple_Dreaming -b P1_009.bam -s Merry_Christmas -b P1_014.bam
-s White -b P1_021.bam -s Good_Morning > SispeUser00_merged.bam
9. Sort the merged bam file with “samtools sort”.
samtools sort -o SispeUser00_sorted.bam SispeUser00_merged.bam
10. Delete the sam files and the unsorted bam’s.

Genetic variant is the genetic differences both within and among populations.
• Structural differences: Structural Variations (SVs), Copy Number Variation (CNV).
• Molecular differences (changes in the DNA sequence).
• Single Nucleotide Variants/Polymorphisms (SNVs/SNPs)
• Insertions/deletions Variants/Polymorphisms (INDELs/DIVs/DIPs)
• Multiple Nucleotide Variants/Polymorphism (MNVs/MNPs)
GACGTGC
GCCGTGC
| |||||
Sample 1
Sample 2
Polymorphism is a DNA sequence variation that is common in the population
GACGTGC
G-CGTGC
| |||||
Sample 1
Sample 2
SNVs/SNPs INDELs/DIVs/DIPs
GACGTGC
GCTGTGC
| ||||
Sample 1
Sample 2
MNVs/MNPs
GDA2: 6- Variant calling and summary of the read processing.

Processed Reads
Mapped Reads
Processed Map
Variants
Read
mapping
Local realignment,
sort and filtering
Variant
calling
Annotated Variants
Variant
annotation
Variant calling:
• Heuristic methods (read depth)
- SamTools
- VarScan
• Probabilistic methods (bayesian)
- GATK
- FreeBayes
- SOAPsnp/SOAPindel
Variant
filtering

Name Type Strength Weaknesses
SamTools Heuristic
• Assumes errors are non-
independent (matches data)
• Good accuracy with low
coverage data
• Reasonably quick
• Increase false positives at
high coverage
• Lower quality indel calling
GATK Probabilistic
• Trains with real data
• Excellent accuracy with high
coverage data
• Low false positive rate
• Assumes errors are
independent
• High level of preprocessing
• Very slow
FreeBayes Probabilistic
• Combined bam population
estimate
• Good accuracy with low
coverage data
• Very very quick
• No training, population level
estimate only
• Lower quality indel calling

Processed Reads
Mapped Reads
Processed Map
Variants
Read
mapping
Local realignment,
sort and filtering
Variant
calling
Annotated Variants
Variant
annotation
Variant filtering:
- VCFTools
- GATK
Variant
filtering
Variant annotation:
- SnpEff

Practice 2.3: Variant calling.
2. Make a directory with the name: “03_variants”.
3. Change working directory to “02_mapped”.
4. Create an index with “samtools index” for the sorted bam file:
samtools index SispeUser00_sorted.bam
5. Run “freebayes” with --min-base-quality 30 --min-mapping-quality 20 --min-coverage 5
with a command such as:
freebayes -b SispeUser00_sorted.bam -f ../../linux_exercises/Sispe_ref/
Sispe038ReducedRef.fa -v ../03_variants/SispeUser00.vcf --min-coverage 5 -q 30 -m 20
6. Count how many variants has the VCF file including a division of variants per type (SNP,
INDEL, MNP and Complex).

Methods for Variant Evaluation
• Validation by Sanger Sequencing of specific candidates (~5 - 500) using other
datasets (e.g. transcriptome) if it is possible.
• Comparison with other method (e.g. genotyping chip).
• Different mapping and variant calling tools comparison (with a “truth set” or a
“gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
https://gatkforums.broadinstitute.org/gatk/discussion/6308/evaluating-the-quality-of-a-variant-callset

• Validation by Sanger Sequencing of specific candidates (~5 - 500) using other
datasets (e.g. transcriptome) if it is possible.
Variants from
RNASeq
(Illumina)
Variants from
ESTs
(Sanger)

• Different mapping, variant calling tools and datasets comparison (with a “truth
set” or a “gold standard” if it is possible).
Assumptions:
1. The content of the truth set has been validated.
2. Your samples are expected to have similar genomic content as the
population of samples that was used to produce the truth set

Metrics:
1. Variant level concordance: Percentage of variants in your samples that
match (are concordant with) variants in your truth set.
2. Genotype concordance: Percentage of variants in your genotype that
match (are concordant with) variants in your truth set.
False Positives (FP) False Negatives (FN)
True Positives (TP)
My Dataset (16) True Set (18) % SENSITIVITY:
TP * 100 / (TP + FN) = 13 * 100 / (13 + 5) = 72%
% FALSE DISCOVERY RATE:
FP * 100 / (TP + FP) = 3 * 100 / (13 + 3) = 20%
% GT CONCORDANCE:
SumMatches * 100 / TP
6 * 100 / 11 = 54%
A * T C T C C * C A C
A T T C * C C T * A *
1 0 1 1 0 1 1 0 0 1 0
True Set (9)
My Dataset (8)
Matches (6)

Metrics:
3. Number of SNPs and INDELs: Between different datasets should be
consistent for the same number of mapped reads.
4. TiTv Ratio: Ratio of transition (Ts) to transversion (Tv) SNPs should be
random (~0.5). Methylation islands (CpG) and other factors may introduce
a bias so expected values will range from 0.5 - 3.0.
5. Ratio Insertions/Deletions: It should be close to 1, except in rare alleles
that it could be 0.2 - 0.5.

Comparison between different tools:
https://bcbio.wordpress.com/

Tools:
Name URL
VariantEvaluation
(GATK)
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/
org_broadinstitute_gatk_tools_walkers_varianteval_VariantEval.php
GenotypeConcordance
(GATK)
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/
org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeConcordance.php
VCFTools http://vcftools.sourceforge.net/
VCFStats http://lindenb.github.io/jvarkit/
PicardTools https://broadinstitute.github.io/picard/index.html

1. Variant ﬁltering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF ﬁles.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.

Variant ﬁltering is the process to remove low quality or other non
adequate variants (e.g. non biallelic, complex…) for the downstream
analysis. It depends on:
1. Source and methodology used to generate the data (library
preparation errors and biases).
2. Sequencing technology (read sequencing errors) and amount of
data (insufﬁcient depth/sites).
3. Software used for mapping (mapping errors) and variant calling
(produced by a low coverage/low complexity sites).
4. Reliability (low quality/incomplete) and nature (genomic differences/
polyploidy) of the reference genome.
5. Type of population (e.g. F2 population) and type of analysis that it
will be performed.
GDA3: 1-Variant Filtering

Variant ﬁltering
Two major source of error (Li et al. 2014):
• Erroneous realignment in low-complexity regions
• Incomplete reference genome with respect to the sample
“The raw genotype calls is as high as 1 in 10-15 kb, but the error
rate of post-filtered calls is reduced in 1 in 100-200 kb without
significant compromise on the sensitivity”.
More data is not always better.
High quality/reliable data

Alignment problems
coordinates 12345678901234 5678901234567890123456
reference aggttttttataac---aattaagtctacagagcaacta
sample aggttttttataacAATaattaagtctacagagcaacta
read1 aggttttttataac***aaAtaa
read2 ggttttttataac***aaAtaaTt
read3 ttttataacAATaattaagtctaca
read4 CaaT***aattaagtctacagagcaac
read5 aaT***aattaagtctacagagcaact
read6 T***aattaagtctacagagcaacta
Aligners calculate the alignment correctness and give it a score
depending of:
• Length of the alignment.
• Number of mismatches and gaps.
• Uniqueness of the alignment (number of hits).
}Good alignment
Misaligned bases

Alignment problems
Misaligned bases - Solutions:
• Read realignment (IndelRealigner for GATK (obsolete),
now it is integrated in the HaplotypeCaller).
• Mark alignment quality per base (BAQ) and do not use for
variant calling.

Library preparation problems
PCR duplications produce biases in the variant call (e.g. het.)
• Library speciﬁc problem for Whole Genome Sequencing.
Fragmentation
Library preparation
PCR Duplication

PCR duplications - Solutions:
• Mark duplicates with tools such as samtools rmdup
SKIP PCR DUPLICATION MARKING STEP FOR GBS, RAD-SEQ…
CAREFUL: Some reduced representations techniques with unequal ratios
of site amplication WILL PRODUCE THOUSANDS PCR DUPLICATION

Sequencing errors produce biases in the variant call.

Sequencing errors - Solutions:
• High coverage (< 20 X) to minimize sequencing errors.
• Recalibrate bases (Base Score Quality Recalibration - BSQR)
using tools such as BaseRecalibrator.

Variant ﬁltering:
https://software.broadinstitute.org/gatk/best-practices/

Variant ﬁltering:
https://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-
detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/

Variant ﬁltering:
Three general purpose callers:
• FreeBayes (v0.9.9.2-18)
• GATK UnifiedGenotyper (2.7-2)
• GATK HaplotypeCaller (2.7-2)
• Skipping base recalibration and indel realignment
had almost no impact on the quality of resulting
variant calls
• FreeBayes outperforms the GATK callers on both
SNP and indel calling. The most recent versions of
FreeBayes have improved sensitivity and specificity
which puts them on par with GATK HaplotypeCaller.
• GATK HaplotypeCaller is all around better than the
UnifiedGenotyper.

Software
Filters
Depth Het.
Var.
Quality
Mapping
Quality
Allele
Freq.
Position/
Distance
HWE Missing
VCFTools Yes Yes Yes No Yes Yes Yes Yes
SnpSift* Yes Yes Yes No Yes Yes No No
Vardict Yes No Yes No Yes No No No
GATK Yes Yes Yes Yes Yes Yes No Yes
Variant ﬁltering:
* It will depends of the tags for the VCF ﬁle

Examples using VCFTools
1.Variants with low quality QUAL < 20.
vcftools --vcf input.vcf --minQ 20 --recode --recode-INFO-
all --out out
2. Variants with depth DP < 10.
vcftools --vcf input.vcf --min-meanDP 10 --recode --
recode-INFO-all --out out
3. Separated by at least 1000 bp.
vcftools --vcf input.vcf --thin 1000 --recode --recode-
INFO-all --out out
4. No biallelic.
vcftools --vcf input.vcf --min-alleles 2 --max-alleles 2
--recode --recode-INFO-all --out out
5. No missing.
vcftools --vcf input.vcf --max-missing 1.0 --recode --
recode-INFO-all --out out

Practice 3.1: Filter the variant file.
2. Change working directory to “03_variants”.
3. Run “vcf-stats” on the “SispeUserXX.vcf” file.
4. Remove the variants with QUAL < 20 and run “vcf-stats” again.
5. Remove the variants with DEPTH < 10 and run “vcf-stats” again.
6. Remove the variants separated between them 1000 bp or less and run “vcf-stats” again.
7. Get biallelic variants and run “vcf-stats” again.
8. Remove all the genotypes with missing data.
9. Select only SNPs.

Stats for the VCF files
GDA3: 2- Simple population stats for the variant analysis.
Regular stats with vcf-stats (https://vcftools.github.io/perl_module.html)
vcf-stats is a program that runs several stats for a VCF file producing the
following files:
• stats.counts, number of variants per sample and for all the samples.
• stats.dump, parseable hash Perl format file with all the VCF stats.
• stats.indels, number of INDELs per sample and for all the samples.
• stats.legend, various definitions
• stats.private, unique (not shared) variants for each sample
• stats.samples-tstv, transicions/transversions for each sample
• stats.shared, shared variants for all the samples
• stats.snps, number of SNPs per sample and for all the samples.
• stats.tstv, transicions/transversions for all the samples

Distribution with bcftools stats (https://samtools.github.io/bcftools/bcftools.html)
bcftools is a software to manipulate VCF/BCF ﬁles. Stats can be used to
produce several data distributions such as QUAL (quality), DP (depth), ST
(substitution types), IDD (InDel size distribution), AF (allele frequency)… It
also include as summary (SN).
# SN, Summary numbers:
# SN [2]id [3]key [4]value
SN 0 number of samples: 4
SN 0 number of records: 110927
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 99184
SN 0 number of MNPs:10943
SN 0 number of indels:3101
SN 0 number of others:506
SN 0 number of multiallelic sites: 3816
SN 0 number of multiallelic SNP sites: 798

Distribution with vcfutils.pl qstats
vcfutils.pl is a program that get the qual. and ts/tv parameters associated
with the SNPs. It can be used to test if there are some bias of the ts/tv
associated with low quality.
QUAL #non-indel #SNPs #transitions #joint ts/tv #joint/#ref #joint/#non-indel
6856.32 1909 1909 654 0 0.5211 0.0000 0.0000 0.5211
4769.34 3818 3818 1381 0 0.5667 0.0000 0.0000 0.6151
3506.53 5727 5727 2215 0 0.6307 0.0000 0.0000 0.7758
2748.14 7636 7636 3051 0 0.6654 0.0000 0.0000 0.7791
2240.06 9545 9545 3956 0 0.7078 0.0000 0.0000 0.9014
. . .
16.3149 80179 80179 41279 0 1.0612 0.0000 0.0000 1.2945
11.551 82088 82088 42386 0 1.0676 0.0000 0.0000 1.3803
6.48534 83997 83997 43471 0 1.0727 0.0000 0.0000 1.3167
2.79415 85906 85906 44556 0 1.0775 0.0000 0.0000 1.3167

Populations stats with vcftools
vcftools can also be used to get simple population genetics parameters
associated to a VCF ﬁle. Some of these examples are:
• Calculate nucleotide diversity (π)
vcftools --vcf input.vcf --keep NamesGroup1.txt --
window-pi 100000 --out Group1_Pi
• Calculate linkage disequilibrium (LD) (for phased genotypes).
vcftools --vcf input.vcf --keep NamesGroup1.txt --ld-
window-bp 50000 --chr SeqID1 --hap-r2 --min-r2 0.001 --
out Group1_SeqID1_LD

Populations stats with vcftools
vcftools can also be used to get simple population genetics parameters
associated to a VCF ﬁle. Some of these examples are:
• Calculate FST between two groups.
vcftools --vcf input.vcf --weir-fst-pop SampleGroup1.txt
--weir-fst-pop SampleGroup2.txt --fst-window-size 100000
--out Group1_VS_Group2_FST
• Calculate TajimaD for one group.
vcftools --vcf input.vcf --keep NamesGroup1.txt --
TajimaD 100000 --out Group1_Pi

Practice 3.2: Get stats for the VCF file
2. Change working directory to “03_variants”.
3. Run “vcf-stats” on the “SispeUserXX.vcf” file.
4. Run “bcftools stats” on the “SispeUserXX.vcf” and pipe the output to select “^SN”
5. Run “vcftools” to calculate the nucleotide diversity on the “SispeUserXX.vcf”.
6. Run “vcftools” to calculate the Tajima D on the “SispeUserXX.vcf”..
7. Divide your dataset in two groups and calculate the FST between these two groups.

IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
http://software.broadinstitute.org/software/igv/
The Integrative Genomics Viewer (IGV) is a high-performance
visualization tool for interactive exploration of large, integrated genomic
datasets. It supports a wide variety of data types, including array-based
and next-generation sequence data, and genomic annotations.

http://software.broadinstitute.org/software/igv/download

1- Create a new .genome file for the Sinningia reference.
2- Add an Unique identifier (e.g. “Sispe038”), a descriptive name (e.g. “S.
species version 0.3.8”, the FASTA and the GFF files with the reference.

3- Select any scaffold

4- To load any VCF or BAM ﬁle, select “Load From File”
6- Then select the scaffold that you want to visualize (e.g. “Sispe038Scf0002”)
5- Then load your VCF ﬁle (e.g. “SispeUser00.vcf”).

It creates two tracks: 1- With all the variants and the AF as a red/blue
bar; 2- With all the individual samples.

You also can load BAM ﬁles to see the read alignment.

TASSEL, Integrative Genomic Viewer
http://www.maizegenetics.net/tassel
TASSEL is a tools to investigate the relationship between phenotypes and
genotypes.TASSEL has functionality for association study, evaluating
evolutionary relationships, analysis of linkage disequilibrium, principal
component analysis, cluster analysis, missing data imputation and data
visualization.

1- Load VCF data

2- Explore the VCF data

3- Calculate nucleotide diversity

4- Get a distance matrix

4- Perform a Principal Component Analysis

5- Produce a cladogram

Change formats from VCF to others.
GDA3: 4- Changing formats for VCF ﬁles.
http://www.cmpg.unibe.ch/software/PGDSpider/

VCF => FastStructure
PGDSpider can be used to change between different formats:
• From VCF to FastStructure.
perl -ne 'chomp($_); if ($_ =~ m/#/) { print "$_n"}
else { @a= split(/t/, $_); if (length($a[3]) == 1 &&
length($a[4]) == 1) {print "$_n"} }' input.vcf >
clean.vcf
java -Xmx1024m -Xms512m -jar /data/software/
PGDSpider_2.1.1.2/PGDSpider2-cli.jar -inputfile
clean.vcf -inputfileformat VCF -outputfile
clean.structure.str -outputfileformat STRUCTURE -spid
VCF2FastStructure.spid

VCF => FastStructure
PGDSpider requires a configuration file (.spid) for each of the formats.
Example for a VCF2FastStructure file.
# VCF Parser questions
PARSER_FORMAT=VCF
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
VCF_PARSER_REGION_QUESTION=
VCF_PARSER_PL_QUESTION=GT
VCF_PARSER_QUAL_QUESTION=20
VCF_PARSER_GTQUAL_QUESTION=0
VCF_PARSER_READ_QUESTION=5
VCF_PARSER_IND_QUESTION=
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=TRUE
VCF_PARSER_MONOMORPHIC_QUESTION=FALSE
VCF_PARSER_POP_QUESTION=
# STRUCTURE Writer questions
WRITER_FORMAT=STRUCTURE
STRUCTURE_WRITER_FAST_FORMAT_QUESTION=TRUE
STRUCTURE_WRITER_LOCI_DISTANCE_QUESTION=TRUE

Change formats

https://github.com/aubombarely/GenoToolBox
Change formats

GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
EmpressPurple01 S_speciosa DOMESTICATED
EmpressRed01 S_speciosa DOMESTICATED
Buzios S_speciosa WILD
PurpleDreaming Hybrid DOMESTICATED
GalaxyTour S_speciosa DOMESTICATED
AnsNix Hybrid DOMESTICATED
AmandasPenny Hybrid DOMESTICATED
TV_Faeton S_speciosa DOMESTICATED
DarthVader S_speciosa DOMESTICATED
MerryChristmas S_speciosa DOMESTICATED
StrawberryJam S_speciosa DOMESTICATED
BlueDandy S_speciosa DOMESTICATED
DeadlyRomance S_speciosa DOMESTICATED
DiegoPink S_speciosa WILD
White S_speciosa DOMESTICATED
Kleopatra S_speciosa DOMESTICATED
BestRoskosh S_speciosa DOMESTICATED
LovePotion S_speciosa DOMESTICATED
NTVenushki S_speciosa DOMESTICATED
GoodMorning S_speciosa DOMESTICATED
BlueKnight S_speciosa DOMESTICATED
AvenidaNiemeyer S_speciosa WILD
BuziosXEmpressF1 S_speciosa DOMESTICATED
ChilternSeeds S_speciosa WILD
EmpressRed02 S_speciosa DOMESTICATED
PedraLisa S_speciosa WILD
CardosoMoreira S_speciosa WILD
EmpressPurple02 S_speciosa DOMESTICATED
Carangola S_speciosa WILD
CardosoMoreiraPink S_speciosa WILD
CharlesLawn S_speciosa DOMESTICATED
Shelleri S_helleri WILD
MiguelPereira S_speciosa WILD
Buzios Carangola
A. Niemeyer
Darth Vader
Empress Red
Blue Knight
Goal:
Analyze the population
structure for cultivated
Sinningias

Genetic Variation in the Species
Wild accessions 9
Cultivars 25
Wild x Cultivated F1 1
Other species 1
____________________________________________
TOTAL 36
Sequencing
Library preparation
De-multiplexing
Read processing
Alignment
Variant detection
SNP filtering
APeKI digestion
Illumina, single end, 100 bp
GBSX v1.2
Fastq-mcf v1.04.807, Q30, L50
Bowtie2 v2.2.4
Freebayes v0.9.20
bcftools: only biallelic SNPs
vcffliter: Q>30, Depth >= 5
vcftools: no missing observations
41,626 SNPs

1. Clean the ﬁle of SNPs deﬁned with more than one character (e.g AC/AG).
perl -ne 'chomp($_); if ($_ =~ m/#/) { print "$_n"}
else { @a= split(/t/, $_); if (length($a[3]) == 1 &&
length($a[4]) == 1) {print "$_n"} }'
Sispe038_Set01_FILTERED_SNPs.vcf >
Sispe038_Set01_FILTERED_SNPs_CLEAN.vcf
2. Change the VCF format to FastStructure.
java -Xmx1024m -Xms512m -jar /data/software/
PGDSpider_2.1.1.2/PGDSpider2-cli.jar -inputfile
Sispe038_Set01_FILTERED_SNPs_CLEAN.vcf -inputfileformat
VCF -outputfile
Sispe038_Set01_FILTERED_SNPs_CLEAN.structure.str -
outputfileformat STRUCTURE -spid VCF2FastStructure.spid

3. Prepare a script (run_faststructure) with the fastStructure command line,
5 replicates, random seeds and K from 1 to 20.
4. Change the permissions of the script and run it
chmod 755 run_faststructure.sh
./run_faststructure.sh
#!/bin/bash
python /data/sowware/fastStructure/structure.py -K 1 --
input=Sispe038_Set01_FILTERED_SNPs_CLEAN.structure --
output=Sispe038_Set01_FS_K01_R01 --format=str —seed=123456789
…

5. Run ChooseK to get the most supported K.
python /data/software/fastStructure/chooseK.py --
input=Sispe038_Set01_FS_*
Model complexity that maximizes marginal likelihood = 2
Model components used to explain structure in data = 2

5. Run ChooseK to get the most supported K.
Model components used to explain structure in data = 2
In our review of 1,264 studies using structure to explore population subdivision, studies
that used ΔK were more likely to identify K = 2 (54%, 443/822) than studies that did not
use ΔK (21%, 82/386). A troubling finding was that very few studies performed the
hierarchical analysis recommended by the authors of both ΔK and structure to fully
explore population subdivision.

S-6-4
(PI#555684)
S-6-5
(Standard Line)
F1
(Picture from Dr. David Zaitlin)
GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
F2_107
F2_108
F2_109
F2_110
F2_111
F2_112
F2_113
F2_115
F2_117
F2_118
F2_119
F2_120
F2_121
F2_122
F2_123
F2_124
F2_125
F2_126
F2_127
F2_129
F2_130
F2_131
S_64_2
S_65_2
F2_001
F2_002
F2_003
F2_004
F2_005
F2_006
F2_007
F2_009
F2_010
F2_011
F2_012
F2_013
F2_014
F2_015
F2_016
F2_017
F2_018
F2_019
F2_021
F2_022
F2_023
F2_024
F2_028
F2_031
F2_032
F2_033
F2_034
F2_035
F2_036
F2_077
F2_078
F2_079
F2_080
F2_081
F2_082
F2_083
F2_084
F2_085
F2_086
F2_087
F2_088
F2_089
F2_091
F2_092
F2_093
F2_094
F2_095
F2_096
F2_097
F2_098
F2_099
F2_100
F2_101
F2_102
F2_103
F2_104
F2_105
F2_106
Goal:
Develop a genetic map
with 19 linkage groups

http://www.rqtl.org/
http://www.rqtl.org/tutorials/geneticmaps.pdf

https://github.com/aubombarely/GenoToolBox

1. Change the VCF format to CSV used as input by R/QTL using Vcf2Maker
from GenoToolBox (https://github.com/aubombarely/GenoToolBox).
/old_home/aurebg/Software/GenoToolBox/SNPTools/
Vcf2Mapmaker -i NibenGBS_M30.vcf -o NibenGBS_M30.csv -f
csv -a S_64_2 -b S_65_2 -B -d 1000
2. Load the genotypes in R/QTL.
setwd('./')
library('qtl')
NbenX = read.cross(file="NibenGBS_M30.csv", format=“csv”)
3. Follow the R/QTL tutorial.

Genomic Data Analysis: From Reads to Variants

Genomic Data Analysis: From Reads to Variants

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Genomic Data Analysis: From Reads to Variants

Similaire à Genomic Data Analysis: From Reads to Variants (20)

Plus de Aureliano Bombarely

Plus de Aureliano Bombarely (10)

Dernier

Dernier (20)

Genomic Data Analysis: From Reads to Variants