SlideShare une entreprise Scribd logo
1  sur  239
Télécharger pour lire hors ligne
Genomic Data Analysis
From READS to VARIANTS
24-10-17 to 26-10-17,
Porto Alegre, Brazil.
Aureliano Bombarely
Virginia Tech
Department of Horticulture
Latham 216
220 Ag Quad Lane
Blacksburg, VA
USA
aurebg@vt.edu
Genomic Data Analysis
DAY 1:
• Presentation of the Course.
• Introduction to the Linux Operating System and the Command Line Interface.
• 25 essential commands to work with Linux.
• Common bioinformatics formats, from FASTAs to GFFs and VCFs.
• 10 essential commands to play with the biological data.
DAY 2:
• Introduction to Next Generation Sequencing Technologies (NGS).
• Experimental design for population studies, from breeding to ecological studies.
• De-multiplexing and the complexities of sample identification.
• Read processing and quality evaluation.
• Read mapping to a reference.
• Variant calling and summary of the read processing.
• Quality evaluation and possible pitfalls.
DAY 3:
• Variant filtering.
• Simple stats for the variant analysis.
• Variant visualization tools: IGV and TASSEL.
• Changing formats for VCF files.
• Example 1: Population analysis with Structure for Sinningia speciosa.
• Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
Genomic Data Analysis
Day 1
Genomic Data Analysis
SOFTWARE	REQUIRED
If
PuTTY
http://www.putty.org/
If
Terminal
(Already	in	your	system)
Genomic Data Analysis
SOFTWARE	REQUIRED
IGV
http://software.broadinstitute.org/software/igv/ http://www.maizegenetics.net/tassel
TASSEL
Genomic Data Analysis
SOFTWARE	REQUIRED
FileZilla
https://filezilla-project.org/
Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
GDA1: 1- Presentation of the Course.
Biological	Problem	
Scientific	Question	
Hypothesis
Genetics	&	related	disciplines	
Molecular	biology
Massive	DNA	Sequencing
Results
Experimental	Design
Approach
?
GDA1: 1- Presentation of the Course.
Biological	Problem	
Scientific	Question	
Hypothesis
Genetics	&	related	disciplines	
Molecular	biology
Massive	DNA	Sequencing
Results
Experimental	Design
Approach
Genomic	Data	Analysis
GDA1: 1- Presentation of the Course.
Genomic	Data	Analysis	is:	
• Knowledge	about	sequencing	technologies.	
• Knowledge	about	methodologies	(e.g.	library	preparation).	
• Bioinformatic	skills	(Linux	command	line	and	R).	
• Basic	knowledge	about	statistical	analysis.
Genomic	Data	Analysis	IS	NOT:	
• Programming	(useful	but	not	necessary).	
• Basic	knowledge	of	computer	system	administration.	
• Modeling.	
• Algorithm	development.	
• Database	development.
GDA1: 1- Presentation of the Course.
Genomic	Data	Analysis	is:	
• Knowledge	about	sequencing	technologies.	
• Knowledge	about	methodologies	(e.g.	library	preparation).	
• Bioinformatic	skills	(Linux	command	line	and	R).	
• Basic	knowledge	about	statistical	analysis.
BIOINFORMATICS:	
• Programming	(useful	but	not	necessary).	
• Basic	knowledge	of	computer	system	administration.	
• Modeling.	
• Algorithm	development.	
• Database	development.
Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Linux:
Unix-like computer operating system assembled
under the model of free and open source software
development and distribution
Operating System (OS):
Set of programs that
manage computer hardware
resources and provide common
services for application
software.
Wikipedia
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Unix-like?
Feduccia A, Trends Ecol. Evol. 2001
Unix:
Is a multitasking, multi-user computer operating
system originally developed in 1969.
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
https://www.howtogeek.com/182649/htg-explains-what-is-unix/
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Linux Distribution:
Distributions (often called distros for short) are
Operating Systems including a large collection of
software applications such as word processors,
spreadsheets, media players, and database applications.
The operating system will consist of the Linux
kernel and, usually, a set of libraries and utilities from
the GNU Project, with graphics support from the X
Window System.
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Linux Distribution:
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
What is a console ?
Computer terminal or system consoles are the text
entry and display device for system administration
messages, particularly those from the BIOS or boot loader,
the kernel, from the init system and from the system logger. It
is a physical device consisting of a keyboard and a
screen.
Wikipedia
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
So then...
What is typical black or white screen where
programers and system administrators write
commands?
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Command-line interface (CLI):
Mechanism for interacting with a computer operating
system or software by typing commands to perform specific
tasks.
The command-line interpreter may be run in a
text terminal or in a terminal emulator window as a remote
shell client such as PuTTY.
Wikipedia
010100
110010
000101
2+2 4
Shell:
Piece of software that provides an interface for users of an
operating system. There are two categories:
- Command-line interface (CLI)
- Graphical user interface (GUI)
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Command-line interface (CLI):
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Shell: Bash:
Operating
System
Shell (Bash)
STDIN STDOUT
STDERRCommand
Command is a directive to a computer program acting as an
interpreter of some kind, in order to perform a specific task.
Parts of a command:
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
...And push RETURN or ENTER key to execute the command
Command Argument	1	&	2
The command
call a program
Arguments modify the
behavior of the program
Argument	3
-l means “long listing”
-h/--human-readable means “human readable”
Special characters in bash:
CHARACTER MEANING
SPACE Separate commands and arguments
# POUND Comment
; SEMICOLON Command separator two run multiple commands
. DOT Source command OR filename component
OR current directory
.. DOUBLE DOTS Parent directory
' ' SINGLE QUOTES Use expression between quotes literaly
, COMMA Concatenate strings
 BACKSLASH Escape for single character
/ SLASH Filename path separator
* ASTERISK Wild card for filename expansion in globbing
>, <, >> CHARACTERS Redirection input/outputs
| PIPE Pipe outputs between commands
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Characters with an special meaning for the bash
ls Solanum lycopersicum
ls 'Solanum lycopersicum'
ls Solanum lycopersicum
Use single quotes or escape for special characters
Bash understand spaces as separators between arguments
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Special characters in bash:
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Practice 1.1: Connect to the server
Windows	users:	
1. Open	the	program	PuTTY	and	start	a	session.	
2. Add	the	following	information	for	your	session	and	click	open.	
1. Host:	begonia.hort.vt.edu	
2. Port:	1809	
3. Connection	type:	SSH	
3. Introduce	username	and	password	
4. Click	connect.
MacOS/Linux	users:	
1. Open	the	program	Terminal.	
2. Type	in	the	terminal:		
ssh -p 1809 username@begonia.hort.vt.edu
3. Push	enter/return	
4. Introduce	the	password	and	push	Enter/Return.	Note	the	writing	will	be	hidden.
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Practice 1.1: Connect to the server
Everyone:	
5. Type	in	the	terminal:		
pwd
6. Push	enter/return	
7. Describe	the	message	that	has	appeared	in	the	screen
Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
GDA1: 3- 25 essential commands to work with Linux.
1. pwd
• The command prints the path to the working directory.
• When you login to the system working directory = home ($HOME)
/data/GDA_UFRGS2017/User00_Home
/ means root (beginning of the file system)
data is the name of the 1st directory in root
/GDA_UFRGS2017 2nd directory after data
/User00_Home 3rd directory after GDA_UFRGS2017
pwd
GDA1: 3- 25 essential commands to work with Linux.
2. mkdir
• The command prints makes a new directory.
• If the file exists gives an error.
• Argument -p makes also the parent directories
• rmdir removes a empty directory
mkdir linux_exercises
mkdir linux_exercises/test01
mkdir -p linux_exercises/test01
rmdir linux_exercises/test01
✓correct
✴error
✓correct
pwd
mkdir
✓correct
GDA1: 3- 25 essential commands to work with Linux.
3. cd
• The command changes the working directory.
• Two consecutive dots changes (e.g. “cd ..”) one directory up/
back in the file system
cd linux_exercises
cd linux_exercises/test01
cd test01
✓correct
✴error
✓correct
pwd
mkdir
cd
GDA1: 3- 25 essential commands to work with Linux.
4. ls
• It lists the items in the working directory (default) or any directory.
• -l argument produces the item long listing.
• -h argument produces a human readable form.
• -a argument prints everything (including hided files starting with “.”).
• -t argument sorts by time
ls
ls -lht linux_exercises/test01
cd test01
✓correct
✴error
✓correct
pwd
mkdir
cd
ls
Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
1. Type	pwd	in	the	terminal	and	run	it.	
2. Make	the	directory	‘linux_exercises’	typing	and	running.	
3. Run	pwd	in	the	current	directory	and		annotate	the	result.	
4. Change	the	working	directory	to	‘linux_exercises’.	
5. Make	a	new	directory	named	’01_file_system_tree’.	
6. Change	the	working	directory	to	’01_file_system_tree’.	
7. Run	pwd	in	the	current	directory	and	annotate	the	result.	
8. Make	a	new	directory	named	‘subdir01’	
9. Change	the	working	directory	to	’subdir01’.	
10.	Run	pwd	in	the	current	directory	and	annotate	the	result.	
11.	Change	the	working	directory	one	level	up	
12.	Make	a	new	directory	named	‘subdir02’	
13.	Change	the	working	directory	to	’subdir02’.	
14.	Run	pwd	in	the	current	directory	and	annotate	the	result.	
15.	Draw	the	file	system	tree	for	the	directories	‘subdir01’	and	‘subdir02’
pwd
mkdir
cd
ls
Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
01_file_system_tree/
subdir01/
subdir02/
cd 01_file_system_tree/subdir02
pwd
mkdir
cd
ls
Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
01_file_system_tree/
subdir01/
subdir02/
cd ../../
pwd
mkdir
cd
ls
Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
01_file_system_tree/
subdir01/
subdir02/
cd ../subdir01/
pwd
mkdir
cd
ls
Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
01_file_system_tree/
subdir01/
subdir02/
cd /data/GDS_URFG2017/User00_Home/linux_exercises/01_file_system_tree/
subdir01/
cd ../subdir01/
pwd
mkdir
cd
ls
Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
01_file_system_tree/
subdir01/
subdir02/
cd /data/GDS_URFG2017/User00_Home/linux_exercises/01_file_system_tree/
subdir01/
cd ../subdir01/
Relative filepath
Absolute filepath
pwd
mkdir
cd
ls
Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
Absolute filepath Relative filepath
Latham Hall 311
220 Ag Quad Lane
Blacksburg, VA 24061
USA
Room 311
pwd
mkdir
cd
ls
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
lsCommands for directories:
COMMAND USE EXAMPLE
cd Change working dir cd ../
pwd Print working dir pwd
ls List information ls -lh /home
mkdir Create a new dir mkdir test
rmdir Remove empty dir rmdir test
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
5. history
• It lists last 500 command runs.
• No arguments needed
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
Typing shortcuts for bash:
SHORTCUT MEANING
Tab Autocomplete files or folder names
↑ Scroll up to the command history
↓ Scroll down to the command history
Ctrl + A Go to the beginning of the line that you are typing
Ctrl + D Go to the end of the line that you are typing
Ctrl + U Clear all the line (or until the cursor position)
Ctrl + R Search previously used commands
Ctrl + C Kill the process that you are running
Ctrl + D Exit the current shell
Ctrl + Z Put the running process to the background. Use
command fg to recover it.
GDA1: 3- 25 essential commands to work with Linux.
6. less
• Opens a text type file in the screen.
• To navigate use the arrows ( ).
• “Shift + G” goes to the end of the file.
• “/“ + some word search for the word.
• “q” to quit/exit.
• Open the file with “-N” to open with row numbers.
• More information at: http://www.tutorialspoint.com/unix_commands/less.htm
less ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta
less -N ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta
pwd
mkdir
cd
ls
history
less
GDA1: 3- 25 essential commands to work with Linux.
7. touch
• Creates an empty file.
touch this_is_a_test_file.txt
pwd
mkdir
cd
ls
history
less
touch
rm
8. rm
• Remove/delete permanently a file from the system.
• The file CAN NOT BE RECOVERED.
• “rm -Rf <directory>” will remove the directory and all its content CAREFUL.
rm this_is_a_test_file.txt
rm -Rf 01_file_system_tree/subdir01
GDA1: 3- 25 essential commands to work with Linux.
9. cp
• Copy a file from one location to another.
• “./“ means copy here in the working directory
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
10.mv
• Two functions:
• If the destination EXISTS and is a DIR, move a file there.
• If the destination DO NOT EXISTS, change the
• CAREFUL: If the destination EXISTS and is a file WILL OVERWRITE IT
mv Sispe038_cds.fasta Sispe038_cds.fa
cp ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta ./
Practice 1.3: Copying and moving files
GDA1: 3- 25 essential commands to work with Linux.
1. Change	working	directory	to	‘linux_exercises’.		
2. Make	a	directory	with	the	name:	“Sispe_ref”.	
3. Change	working	directory	to	“Sispe_ref”.	
4. Copy	all	the	fasta	files	from	/data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/	
to	your	current	working	directory	typing:	
cp /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/
*.fasta ./
6. Remove	the	file	“Sispe038.scaffolds.fasta”.	
7. Change	the	name	of	“Sispe038.scaffolds500kb.fasta”	to	“Sispe038ReducedRef.fa”.	
8. Create	a	mapping	reference	using	Bowtie2-build	running:	
bowtie2-build Sispe038ReducedRef.fa Sispe038ReducedRef
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
GDA1: 3- 25 essential commands to work with Linux.
11.cat
• It prints the content of the file as STDOUT in the screen.
• Usually used to concatenate (merge) files one after another using
“cat file1.txt file2.txt > merged.txt ”
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
12.head/tail
• Prints the first/last 10 lines of the file as STDOUT
• The number of lines (x) can be changed using “-n x”.
head Sispe038ReducedRef.fasta
head -n 100 Sispe038ReducedRef.fasta
tail -n 1 Sispe038ReducedRef.fasta
cat Sispe038ReducedRef.fa
GDA1: 3- 25 essential commands to work with Linux.
13.grep
• Command to find LINES and print that match with the pattern used.
• “-c” option prints the NUMBER of LINES that match.
• “-v” option prints the LINES that DO NOT match.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
grep “>” Sispe038ReducedRef.fa
grep -c “>” Sispe038ReducedRef.fa
grep -v “>” Sispe038ReducedRef.fa
GDA1: 3- 25 essential commands to work with Linux.
14.gzip/gunzip
• Command to compress a file with gzip.
• Command to uncompress a file.gz with gunzip
• To keep the original file the “-c” option can be used
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
gzip Sispe038ReducedRef.fa
gunzip Sispe038ReducedRef.fa.gz
gzip -c Sispe038ReducedRef.fa > SispeRef.fa.gz
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
GDA1: 3- 25 essential commands to work with Linux.
15.tar
• Command to archive/unarchive files contained into a directory.
• It can be combined with tools such as gzip and bzip2.
• Commonly used commands:
• tar -zxvf package_of_files.tar.gz to unarchive and uncompress .gz
• tar -jxvf package_of_files.tar.bz2 to unarchive and uncompress .bz2
• tar -zcvf dir1.tar.gz /path_to_fir1 to archive and compress with gzip
• tar -jcvf dir1.tar.bz2 /path_to_fir1 to archive and compress with bzip2
Practice 1.4: Concatenating files and taking a look to them
GDA1: 3- 25 essential commands to work with Linux.
1. Change	working	directory	to	‘linux_exercises’.		
2. Make	a	directory	with	the	name:	“CDS_refs”.	
3. Change	working	directory	to	“CDS_refs”.	
4. Copy	into	your	current	working	directory	the	following	the	files:	
1. /data/GDA_UFRGS2017/DATA/Arabidopsis_thaliana/reference/
Athaliana_Phytozome167_TAIR10.pep.fa.gz	
2. 	/data/GDA_UFRGS2017/DATA/Oryza_sativa/reference/
Osativa_Phytozome323_v7.0.pep.fa.gz	
5. Uncompress	both	files.	
6. Count	how	many	lines	have	the	symbol	“>”	for	both	files.	
7. Concatenate	both	files	in	a	file	named	“Atha_Osat_PEP.fasta”.	
8. Count	how	many	lines	have	the	symbol	“>”	in	this	file.	
9. Create	a	BLAST+	reference	running	the	following	command:		
makeblastdb -in Atha_Osat_PEP.fasta -dbtype prot -
parse_seqids
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
GDA1: 3- 25 essential commands to work with Linux.
16.cut
• It divides the file by TAB and prints as STDOUT the selected column.
• “-f x” where x is the number of the column.
• “-d y” where y is a character can be used to change the delimiter
17.sort
• It sorts alphabetically a file based in the firsts characters of the line.
• “-n” can be used to sort numerically.
• “-r” can be used to do a reversed sorting.
• “-u” can be used to apile unique ids
• Usually used with cut “e.g. cut -f1 my_file.txt | sort”.
cut -f 3 Sispe038_genome.genemodels.gff3
cut -f 3 Sispe038_genome.genemodels.gff3 | sort -u
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
GDA1: 3- 25 essential commands to work with Linux.
18.uniq
• It reports or omits unique lines.
• Usually used in conjunction with cut and sort “e.g. cut -f1 my_file.txt |
sort”.
19.wc
• It counts newlines, words or bytes in a file.
• “-l” counts the number of lines.
• “-w” counts the number of words.
• “-m” counts the number of characters
cut -f 3 Sispe038_genome.genemodels.gff3 | sort | uniq -c
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
wc -l Sispe038_genome.genemodels.gff3
GDA1: 3- 25 essential commands to work with Linux.
20.sed
• Stream editor to transform text.
• The simplest option is to use “s/<find>/<replace>/“ option.
• A “g” to replace as many times as it can “s/<find>/<replace>/“
• More info at: https://www.gnu.org/software/sed/manual/sed.html
sed “s/A/a/g“ Sispe038ReducedRef.fa
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
1. Change	working	directory	to	‘linux_exercises’.		
2. Make	a	directory	with	the	name:	“GFF_refs”.	
3. Change	working	directory	to	“GFF_refs”.	
4. Copy	into	your	current	working	directory	the	following	the	files:	
1. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/
Sispe038_genome.genemodels.gff3	
2. 	/data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/reference/
Niben251.1_genome.genemodels.sorted.gff3	
5. Count	the	number	of	lines	in	both	files.	
6. Count	the	number	of	lines	ignoring	lines	with	“#”	symbol	using	grep.	
7. Select	the	third	column	in	both	files	and	print	the	first	20	lines.	
8. Select	the	third	column	in	both	files,	sort	it	and	count	unique	items	using	“uniq	-c”	
Practice 1.5: Selecting columns and counting them
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
less
touch
rm
COMMAND USE EXAMPLE
less Open a file with less.
Q to exit. Arrows to scroll
less myfile
touch Create an empty file touch myfile
mv Move file between dirs. Change name mv myfile yourfile
rm Remove file rm youfil
cat Print file content as STDOUT cat myfile
head Print first 10 lines as STDOUT head myfile
tail Print last 10 lines as STDOUT tail myfile
grep Print matching lines as STDOUT grep 'ATG' myfile
cut Cut columns and print as STDOUT cut -f1 myfile
sort Sort lines and print as STDOUT sort myfile
uniq Select uniq words (-c to count uniq). uniq -c myfile
sed Replace ocurrences, print lines STDOUT sed 's/ATG/CTG/' myfile
wc Word count wc myfile
Commands for files:
Compression and archiving commands:
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
less
touch
rm
COMMAND USE EXAMPLE
gzip Compress a file using gzip gzip -c test.txt > test.txt.gz
gunzip Uncompress a file using gzip gunzip test.txt.gz
bzip2 Compress a file using bzip bzip2 -c test.txt >
test.txt.bz2
bunzip2 Uncompress a file using gzip bunzip2 test.txt.bz2
tar Archive files usint tar tar -cf sample.tar sample/*.txt
tar -zcvf Archive using tar and compress
using gzip
tar -zcvf samples.tar.gz
sample/*.txt
tar -zxvf Unarchive using tar and
uncompress using gunzip
tar -zxvf samples.tar.gz
tar -jcvf Archive using tar and compress
using bzip2
tar -jcvf samples.tar.bz2
sample/*.txt
tar -jxvf Unarchive using tar and
uncompress using bunzip2
tar -jxvf samples.tar.bz2
GDA1: 3- 25 essential commands to work with Linux.
21.top/htop
• Display Linux processes.
• Type “q” to quit.
• “kill PID” can be used to terminate a process.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
Global Resource Usage: %CPU / MEMORY / SWAP MEMORY
Single Process Resource Usage: PID / USER / %CPU / %MEMORY / COMMAND
22.df/du
• Commands to check how much disk space is being used in the
system (df -lh) or how much space a directory is using (du -lh
<dir>).
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
df -lh
du -lh linux_exercises
23.wget/curl
• Commands to download files from the internet.
• wget can be used recursively (e.g. using * or “-r” for dirs)
• curl has pipeting abilities (using “|”)..
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
wget ftp://ftp.solgenomics.net/genomes/
Solanum_lycopersicum/annotation/ITAG3.2_release/*.fasta
curl ftp://ftp.solgenomics.net/genomes/
Solanum_lycopersicum/annotation/ITAG3.2_release/
ITAG3.2_proteins.fasta | grep -c “>”
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
24.ssh/scp
• Commands to:
• ssh = access to a remote server
ssh -p 1809 username@begonia.hort.vt.edu
• scp = copy from/to a remote server
• From LOCAL to REMOTE
scp -p 1809 file1 username@begonia.hort.vt.edu:/dirpath
• From REMOTE to LOCAL
scp -p 1809 username@begonia.hort.vt.edu:/dirpath/file1 ./
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
GDA1: 3- 25 essential commands to work with Linux.
File Permissions and Ownerships:
All the Unix systems are designed as multiuser operating
systems. It means that different could access, modify or
delete the same files.
To avoid problems, they has a file permission and ownership
system. It restrict who can access and modify each of the
files in the computer.
This system has two parts:
• Who is the owner of the file ?
• What type of access has each of the users in the
System ?
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
GDA1: 3- 25 essential commands to work with Linux.
Ownership:
Each file has assigned an user owner and a group owner.
The user owner can be:
• Real user (for example: aurebg).
• Virtual user created by a program (for example: mysql).
• Administrator user or root.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
GDA1: 3- 25 essential commands to work with Linux.
Permissions:
Each file has assigned 9 different permissions, 3 for the file
user-owner (u), 3 for the group-owner (g) and 3 for everyone
else (o). There are 3 types of permissions or file attributes:
• Readable (r), it has permission to read the file.
• Writable (w), it has permission to write the file.
• Executable (x), it has permission to execute as program.
10 letters code for linux file: ----------
drwxrwxrwx
switch OFF
switch ON
user
group
other
Readable for everyone
Readable for everyone, writable or
executable only for the user-owner
-r--r--r--
-rwxr—-r--
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
chmod/chown
25.chmod/chown
• Commands to manage ownership and privileges:
• To know the permissions type: “ls -lh"
• Change owner: chown user:group filename
chown aurebg:aurebg file.txt
• Change permisions: chmod permissions_code filename
chmod 664 file.txt
# It changes to readable+writable by user and group and readable by anyone
chmod u+r file.txt
# It changes to readable by user
GDA1: 3- 25 essential commands to work with Linux.
chmod [ugo] [+-=] [rwx] file
chmod [0-7] [0-7] [0-7] file |rwx|rwx|rwx|
|421|421|421|
Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
I. FASTA
FASTA format is a text based file format that store three different types: DNA,
RNA or protein sequences. Used to represent the information for sequences
for genomes, mRNA’s, cDNA’s, miRNA’s…
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
>SeqID1 optional_description1
AGCGTGGAGAGCGATGAGATCAGAAAGTAGGACGACAGATGGGGAGAT
GGCAGGTGTGGGAGGAGTTGACGATGACGTGATTGATGACGGGAGACG
>SeqID2 optional_description2
AGCGTGGAGAGCGATGAGATCAGAAAGTAGGCTGACAGATGGGGAGAT
GGCAGGTGAGGGAGGAGCTGACGATGACGTGTTTGATGACGGGAGACG
>SeqID3 optional_description3
AGCGTGGAGAGCGATGAGATCAGAAAGTAGGACGACAGTGGGGGAGAT
GGCAGGTGAGGGAGGAGTTGACGATGACGTGTTTGATGACGGGAGACG
Space separating ID and description
One line ID
ID line always
starts with “>”
}
sequence can be
one or more lines
II. FASTQ
FASTQ format is a text based file format that store usually DNA sequences. It
contains information about the sequencing QUALITY of each nucleotide.
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
@GWNJ-0957:89:GW170928504:7:1101:2757:1309 1:N:0:NCGTCCC
TATCTAAGTATTTGATTAATGATAGATGACGATGGAGAAATATAATCTACTTTTTT
AAGTCCCTCATTTTCTTTCTCCATCTTTCTTTTTTATTACTCCCATTGTTCCCCAT
+
AAAAAFFJFJJFJJAAAAAFJJJ<FJJJJJJJJJJ7<7<<<<JJJJJJFFJJJAFJ
F-7<<-7AFJJFJJJJJJJJAJJFJFJ<7<-7A-7FAFJA777777<7-7--7--7
@GWNJ-0957:89:GW170928504:7:1101:3549:1309 1:N:0:NCGTCCC
ACCATTCATTATTTTTTTATTTAGTCTTTATTACTTTACTTTCCTTCCTTCTGAAA
TACTGCTATTGTACATAAAACAAAATGATCTACTTAAAAATAAAACAAATTTAAAA
+
AAA-AAJJFJJAAJAA-7AFJJ-7-<<-<AJJ--<J-<-<---77F7-A---A7--
<777<7<7<<F-77F<J<JJ7F7AFF77<77<7777<77<---7---77---7---
One line ID
ID line always
starts with “@”
}
sequence can be
one or more lines
quality line
always starts
with “+” }
One quality character per nucleotide. Each character code a
number from 0-41 (Illumina v1.8+).
II. FASTQ
QUALITY explained.
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
II. FASTQ
FASTQ format is a text based file format that store usually DNA sequences. It
contains information about the sequencing QUALITY of each nucleotide.
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
@GWNJ-0957:89:GW170928504:7:1101:2757:1309 1:N:0:NCGTCCC
TATCTAAGTATTTGATTAATGATAGATGACGATGGAGAAATATAATCTACTTTTTT
AAGTCCCTCATTTTCTTTCTCCATCTTTCTTTTTTATTACTCCCATTGTTCCCCAT
+
AAAAAFFJFJJFJJAAAAAFJJJ<FJJJJJJJJJJ7<7<<<<JJJJJJFFJJJAFJ
F-7<<-7AFJJFJJJJJJJJAJJFJFJ<7<-7A-7FAFJA777777<7-7--7--7
@GWNJ-0957:89:GW170928504:7:1101:3549:1309 1:N:0:NCGTCCC
ACCATTCATTATTTTTTTATTTAGTCTTTATTACTTTACTTTCCTTCCTTCTGAAA
TACTGCTATTGTACATAAAACAAAATGATCTACTTAAAAATAAAACAAATTTAAAA
+
AAA-AAJJFJJAAJAA-7AFJJ-7-<<-<AJJ--<J-<-<---77F7-A---A7--
<777<7<7<<F-77F<J<JJ7F7AFF77<77<7777<77<---7---77---7---
One line ID
ID line always
starts with “@”
}
sequence can be
one or more lines
quality line
always starts
with “+” }
One quality character per nucleotide. Each character code a
number from 0-41 (Illumina v1.8+).
1. Change	working	directory	to	‘linux_exercises’.		
2. Make	a	directory	with	the	name:	“FASTQ1”.	
3. Change	working	directory	to	“FASTQ1”.	
4. Copy	into	your	current	working	directory	the	following	the	files:	
1. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/collection/P1_001B.fastq.gz	
2. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/collection/P1_007.fastq.gz	
5. Uncompress	them.	
6. Run	the	following	commands	to	get	the	stats.	
fastq-stats P1_001B.fastq
fastq-stats P1_007.fastq
7. Redirect	the	output	using	“>”	into	a	file	using	the	following	commands.	
fastq-stats P1_001B.fastq > P1_001B.stats.txt
fastq-stats P1_007.fastq > P1_007.stats.txt
Practice 1.6: Getting stats for a FASTQ file
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
III. SAM/BAM
SAM (and its binary form BAM) format is designed to store read mapping
information to a reference. It has 11 columns.
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
III. SAM/BAM
The 2nd column: FLAG defines the status of the read mapping.
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
IV. GFF3
GFF3 is a text-based file with 9 columns. It is designed to store genomic
features (e.g. genes, exons, repetitive elements…) information. More
information at http://gmod.org/wiki/GFF3.
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
##gff-version 3
ctg13 . mRNA 1300 9000 . + . ID=mrna0001;Name=GDR1
ctg13 . exon 1300 1500 . + . ID=exon00001;Parent=mrna0001
ctg13 . exon 1600 1800 . + . ID=exon00002;Parent=mrna0001
ctg13 . exon 3000 3900 . + . ID=exon00003;Parent=mrna0001
ctg13 . exon 5000 5500 . + . ID=exon00004;Parent=mrna0001
ctg13 . exon 7000 9000 . + . ID=exon00005;Parent=mrna0001
seqid
source
type
start
end
score
phase
attributes
strand
mrna0001
exon00001 exon00002 exon00003 exon00004 exon00005
DIPLOID
0 = REF
1 => ALT
/ => NO PHASED
| => PHASED
V. VCF
VCF is a text-based file with 8 fixed columns and one extra per sample for the
multisample files. It contacts metadata at the beginning of the file as “#”
explaining the different fields
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
20 1370 rs01 G A 29 PASS DP=14;AF=0.5 GT:GQ:DP 0/1:51:14
20 1730 . T A 3 q10 DP=11;AF=0.2 GT:GQ:DP 0/1:58:11
20 2121 rs02 A G,T 67 PASS DP=10;AF=0.5 GT:GQ:DP 1/2:23:10
20 6781 . T . 47 PASS DP=13;AF=1 GT:GQ:DP 1/1:56:13
E.g.	1
E.g.	2
E.g.	3
E.g.	4
GT:GQ:DP
0/1:51:14GENOTYPE
DEPTH
GENOTYPEQUAL
• E.g.	1	is	a	biallelic	heterozygous	SNP.	
• E.g.	2	is	a	biallelic	heterozygous	SNP	with	
low	quality,	probably	because	the	mapping.	
• E.g.	3	is	a	non-biallelic	heterozygous	SNP.	
• E.g.	4	is	a	biallelic	homozygous	Deletion
Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
PIPELINE: Combination of commands where the input of one
command is the the output of the previous one
GDA1: 5- 10 essential commands to play with the biological data.
CMD1 CMD2 CMD3 CMD4
Input Output
grep -v “#” Sispe038.gff3 | cut -f 3 | sort | uniq -c
(1) GET SEQUENCE NUMBER
grep -c ‘>’ file.fasta
(3) GET A LIST OF THE TOP TEN MORE ABUNDANT FASTA
DESCRIPTIONS
grep ‘>’ file.fasta | cut -d ‘ ‘ -f2 | sort | uniq -c |
sort -nr | head -n 10
(2) GET FASTA SIZE
grep -v “>” file.fasta | wc -m
GDA1: 5- 10 essential commands to play with the biological data.
(4) GET NUMBER OF TYPES IN A GFF3 FILE
grep -v ‘#’ file.gff | cut -f3 | sort | uniq -c
(5) GET NUMBER OF GENES PER SEQID IN A GFF3 FILE
grep -v ‘#’ file.gff | cut -f1,3 | grep “gene” | sort
| uniq -c
(6) GET NUMBER OF EXONS PER mRNA IN A GFF3 FILE
grep -v "#" file.gff | cut -f3,9 | grep "exon" | sed -
r 's/.+Parent=//' | sed -r 's/;.+//' | sort | uniq -c
| sed -r 's/s+//' | cut -d ' ' -f1 | sort | uniq -c |
sort -nr
GDA1: 5- 10 essential commands to play with the biological data.
(7) GET THE NUMBER OF VARIANTS PER CHROM IN A VCF FILE
grep -v ‘#’ file.vcf | cut -f1 | sort | uniq -c
(8) GET THE NUMBER OF BIALLELIC VARIANTS IN A VCF FILE
grep -v ‘#’ file.vcf | cut -f4,5 | grep -v “,” | wc -l
(10) GET THE NUMBER OF VARIANT IMPACTS IN A SNPEFF VCF FILE
grep -v "#" file.SnpEff.vcf | cut -f8 | sed -r 's/.
+;ANN=//' | cut -d '|' -f2 | sort | uniq -c
GDA1: 5- 10 essential commands to play with the biological data.
(9) GET THE NUMBER OF BIALLELIC SNPs IN A VCF FILE
grep -v ‘#’ file.vcf | cut -f4,5 | grep -Ec "^.s+.$"
1. Change	working	directory	to	‘linux_exercises’.		
2. Make	a	directory	with	the	name:	“VCF_ANALYSIS”.	
3. Change	working	directory	to	“VCF_ANALYSIS”.	
4. Copy	into	your	current	working	directory	the	following	the	files:	
1. /data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/resistant_popbatch01/
VLS24_S1.PolCollapsedBiallelicAF1.vcf	
5. 	Answer	the	following	questions:	
5.1.	How	many	variants	has	this	file?	
5.2.	Ignoring	Scaffolds	(SeqID=Niben251ScfXXXXX),	how	many	variants	have	each	
chromosome	(SeqID=Niben251ChrYY)?	
5.3.	How	many	biallelic	SNPs	have	this	file?	
5.4.	How	many	biallelic	SNPs	with	allele	frequency	1	(AF=1)	have	each	chromosome?
Practice 1.7:
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
SCRIPT: Executable file with some specific language (e.g. Bash,
Perl…) that has commands/functions to be executed.
GDA1: 5- 10 essential commands to play with the biological data.
#!/bin/bash
file_gff=$1;
echo “Analyzing file $1”;
date;
grep -v "#" $1 | cut -f3,9 | grep "exon" | sed
-r 's/.+Parent=//' | sed -r 's/;.+//' | sort |
uniq -c | sed -r 's/s+//' | cut -d ' ' -f1 |
sort | uniq -c | sort -nr
nano exons_per_mRNA.sh
chmod 755 exons_per_mRNA.sh
exons_per_mRNA.sh file1.gff
NANO EDITOR SCREEN
External
Argument
To Save
in Nano
CTR+O
To Exit
in Nano
CTR+X
1. Change	working	directory	to	‘linux_exercises’.		
2. Make	a	directory	with	the	name:	“MY_FIRST_SCRIPT”.	
3. Change	working	directory	to	“MY_FIRST_SCRIPT”.	
4. Copy	into	your	current	working	directory	the	following	the	files:	
1. /data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/reference/
Niben251.1_genome.gene_models.sorted.gff	
5. Write	a	script	that	count	the	number	of	types	per	chromosome	and	uses	two	arguments	
1st=file.gff;	2nd=type.
Practice 1.8:
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
Genomic Data Analysis
Day 2
Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
DNA sequencing is the process of determining the precise
order of nucleotides within a DNA molecule. It includes any
method or technology that is used to determine the order of
the four bases—adenine, guanine, cytosine, and thymine—in
a strand of DNA.
https://en.wikipedia.org/wiki/DNA_sequencing
(Gentile et al. Nano Lett., 2012, 12 (12), pp 6453–6458)
ATGCGCGTCGCGGTGAAT
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
1950 1960 1970 1980 1990 2000 2010 2020
Electrophoresis(1952)
DNAStructure(1953)
SangerDNASequencing(1977)
AB370ASequencer(1986)
AB310capillarSequencer(1986)
454Sequencer(2005)
SolexaGenomeAnalyzerSequencer(2006)
PacificBiosciencesSequencer(2011)
OxfordNanoporePortablesequencer(2015)
MS2Bacteriophage(1977)
Epstein-BarrVirus(1984)
Haemophilusinfluenzae(1995)
Arabiodpsisthaliana(2000)
Homosapiens(2001)
2016/02/04 Sequenced
Genomes
Viridiplantae 178
Metazoa 5907
Bacteria 7897
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
Frederick Sanger (1918-2013)
Twice awarded with the Nobel Prize of Chemistry
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
PreNGS Era
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
0.015Mb
0.078Mb
0.315Mb
0.138Mb
0.414Mb
1.3Mb
2.6Mb
Error Rate
0.1%
Error Type
substitution
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
https://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-
course/what-you-will-learn/what-next-generation-dna-
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
(Mardis E.R. (2013) Annual Review of Analytical Chemistry 6: 287-303)
Next Generation Sequencing vs Sanger
Next Generation Sequencing Sanger
DNA libraries need to be prepared Fragment amplification
Direct nucleotide detection based in different
methods
Physical fragment separation for detection
Millions to billions of reads Thousands of reads
Variable size (short and long technologies) 400 to 900 bp read length
Variable error rate Very low error rate
Quantitative comparison Semicomparative comparison
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
Next Generation Sequencing
0
10000
20000
30000
40000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
NGS Ecology
(Graph by Dr. David Haak)
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
http://www.slideshare.net/cosentia/high-throughput-equencing
Next-generation sequencing platforms
Isolation and purification of
target DNA
Sample preparation
Library validation
Cluster generation
on solid-phase
Emulsion PCR
Sequencing by synthesis
with 3’-blocked reversible
terminators
Pyrosequencing Sequencing by ligation
Single colour imaging
Sequencing by synthesis
with 3’-unblocked reversible
terminators
AmplificationSequencingImaging
Four colour imaging
Data analysis
Roche 454Illumina GAII ABi SOLiD Helicos HeliScope
Next Generation Sequencing
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
Technology
Read length
(bp)
Accuracy Reads/Run Time/Run Cost/Mb
Applied Bio 3730XL
(Sanger)
400 - 900 99.9% 384
4 h
(12 runs/day)
$2,400
Roche 454 GS FLX
(Pyrosequencing)
700
Single/Pairs
99.9% 1,000,000 24h $10
Illumina HiSeq4000 (Seq.
by synthesis)
75-250
Single/Pairs
99% 5,000,000,000 24 to 120 h $0.05 to $0.15
Ilumina MiSeq
(Seq. by synthesis)
50-300 Single/
Pairs
99% 44,000,000 24 to 72 h $0.17
SOLiD 4
(Seq. by ligation)
25-50
Single/Pairs
99.9% 1,400,000,000 168 h $0.13
ION Torrent
(Seq. by semiconductor)
170-400
Single
98% 80,000,000 2 h $2
Pacific Biosciences Sequel
(SMRT)
14,000
Single
85%
(99.9%)
1,600,000 4 h $0.6
Oxford N. Minion
(Nanopore sequencing)
10,000
Single
62%
(96%)
4,400,000 48 h $0.02
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Libraries
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Libraries
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Libraries
Multiplexing
Use of DNA tags (4-7 bp) to identify samples in the same
sequencing lane, cell or sector.
mRNA-1
mRNA-2
Sample	
1
Sample	
2
cDNA-1-tag_ATGC
cDNA-2-tag_CGAG
ATGC
CGAG
AUGCGUU
AUGCGUU
UUGCGCU
AAGAGUU
AUGCGUU
AUGUGAA
UUGCGCU
AAAAGUU
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
}
Pool
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sequencing
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Libraries
http://www.roche.com/
GDA2: 1- Introduction to Next Generation Sequencing Technologies: 454
Pyrosequencing technology
(Mardis E.R. (2008) Trends in Genetics 24: 133-141)
GDA2: 1- Introduction to Next Generation Sequencing Technologies: 454
Pyrosequencing technology
https://www.youtube.com/watch?v=rsJoG-AulNE
GDA2: 1- Introduction to Next Generation Sequencing Technologies: 454
http://454.com/products/gs-flx-system/index.asp
GDA2: 1- Introduction to Next Generation Sequencing Technologies: 454
http://www.bio-itworld.com/BioIT_Article.aspx?id=131053
GDA2: 1- Introduction to Next Generation Sequencing Technologies: 454
http://www.illumina.com/
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Illumina
Sequence by Synthesis technology
(Mardis E.R. (2013) Annual Review of Analytical Chemistry 6: 287-303)
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Illumina
Sequence by Synthesis technology
(Mardis E.R. (2013) Annual Review of Analytical Chemistry 6: 287-303)
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Illumina
Sequence by Synthesis technology
http://www.illumina.com/techniques/sequencing/dna-
sequencing.html#
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Illumina
https://www.illumina.com/systems/sequencing-platforms.html
Benchtop	systems	
Production-scale	systems	
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Illumina
https://products.appliedbiosystems.com
GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
Sequence by Ligation technology
GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
Sequence by Ligation technology
GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
Sequence by Ligation technology
GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
Sequence by Ligation technology
GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
http://media.invitrogen.com.edgesuite.net/ab/
applications-technologies/solid/solid-5500.html
Sequence by Ligation technology
GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
https://products.appliedbiosystems.com
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Ion Torrent
Sequence by Semiconductor technology
A"sample"of"DNA"is"cut"into"
millions"of"fragments,"and"
each"fragment"is"a7ached"
to"its"own"bead"
The"fragment"is"copied""
un;l"it"covers"the"bead"
This"automated"process"
produces"millions"of"beads"
covered"with"millions"of"
different"fragments"
The"beads"are"then"flowed"
across"the"chip,"each"being"
deposited"into"a"well"
Then"the"chip"is"flooded"
with"one"of"the"four"
nucleo;des"
If"the"next"base"on"the"DNA"
strand"is"complementary"to"
this"nucleo;de,"a"nucleo;de"
will"be"incorporated"and""
a"hydrogen"ion"will"be"
released"
The"hydrogen"ion"changes"
the"pH"of"the"solu;on"in""
the"well"
An"ionCsensi;ve"layer"
beneath"the"well"measures"
that"pH"change"and"
converts"it"to"voltage"
This"voltage"change"is"
recorded,"indica;ng"the"
nucleo;de"has"been"
incorporated"and"the""
base"is"called"
This"process"happens"
simultaneously"in"millions"
of"wells"
Copy%DNA% Load%chip% Incorporate%nucleo6de% Detect%and%call%
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Ion Torrent
Sequence by Semiconductor technology
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Ion Torrent
Sequence by Semiconductor technology
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Ion Torrent
http://www.pacb.com/
GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio
Single Molecule Real Time (SMRT) technology
Niedringhaus et al. Analytical Chemistry 2011
GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio
Single Molecule Real Time (SMRT) technology
hsp://bit.ly/1naxgTe	
GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio
Single Molecule Real Time (SMRT) technology
http://genome.duke.edu/cores-and-services/sequencing-and-genomic-technologies/pacbio
GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio
PacBio	Sequel
https://www.nanoporetech.com/
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Oxford Nanopore
Niedringhaus et al. Analytical Chemistry 2011
Sequence by Nanopore technology
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Oxford Nanopore
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Oxford Nanopore
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Oxford Nanopore
GDA2: 1- Introduction to Next Generation Sequencing Technologies: Oxford Nanopore
Sequence by Nanopore technology
GDA2: 1- Introduction to Next Generation Sequencing Technologies
Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
Population (Genetics)
Group of organisms or individuals from the same geographical
location with the capability of interbreeding.
• Natural populations (e.g. Sinningia speciosa group of plants that
grow in the area of Pedra Lisa).
• Artificial populations (e.g. F2 segregating population of Sinningia
speciosa Empress x Buzios).
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
• Natural populations • Artificial populations
- Structure & Size.
- Diversification.
- Speciation.
- Selection.
- Drift.
- Fitness.
- Migration.
- Genetic maps.
- Geno2Pheno links.
- QTLs
- GWAS.
- Artificial Selection.
- Domestication.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
(1) Focus in genotyping instead a right sampling of the populations.
(2) Wrong randomization of the samples.
(3) Confuse geopolitical borders with biological borders.
(4) Testing significance of the clustering output.
(5) Misinterpretation of Mandel’s r statistic (correlation between dist. matrices).
(6) Single K value interpretation without consider other alternative scenarios.
(7) Don’t take into account loci fixation associated with an adaptive trait
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
✴ Focus in genotyping instead a right sampling of the populations.
How many individuals are necessary per “population” ?
It depends of the analysis and the population.
Example 1: Single dominant locus QTL Analysis.
• Recombination rate (genome size and chromosome
number).
• Genotyping methodology (resolution).
• Loci location.
}
100 F2 individuals
as starting point and
then move to other
populations (e.g. F3) or
adding more individuals
Example 2: Local adaptation.
• Trait analyzed.
• Population structure.
• Quality of the reference.
• Genotyping methodology (resolution).
}
50 individuals per
group as starting point
and then move to other
populations (e.g. F3) or
adding more individuals
✴ Genotyping approaches.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
Genotyping: It is the process of determining genetic differences of an
individual by examining the individual's DNA sequence.
Genome sequencing
Cost effective approaches
Reduced representation
1. Targeted amplification (e.g. TrueSeq Custom Amplicon)
2. Hybridization (e.g. Sequence Capture)
3. Enzymatic Digestion + Size selection (e.g. RAD-Seq / GBS)
4. RNA isolation (RNA-Seq)
✴ Genotyping approaches: Reduced representation approaches.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
1. Targeted amplification (e.g. TrueSeq Custom Amplicon)
Gene	A Gene	B Gene	C
RE	
site
RE	
site
RE	
site
RE	
site
Amplification
Library	preparation	and	sequencing
Fastq	Files
Different	samples	
Different	MIDs
✴ Genotyping approaches: Reduced representation approaches.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
2. Hybridization (e.g. Sequence Capture)
MIDPCR
Different samples
Different MIDs
Gene	A Gene	B Gene	C
RE	
site
RE	
site
RE	
site
RE	
site
Fragmentation
DNA	Capture
Sequencing
Fastq	Files
Amplification	and	Lib.	preparation
✴ Genotyping approaches: Reduced representation approaches.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
3. Enzymatic Digestion + Size selection (e.g. RAD-Seq / GBS)
REMIDPCR
Different samples
Different MIDs
Gene	A Gene	B Gene	C
RE	
site
RE	
site
RE	
site
RE	
site
Digestion
Adapters	ligation
Sequencing
Fastq	Files
Amplification	(Size	selection	~500bp)
Elshire et al. 2011 PLOS One 6:e193779
Genotyping-By-Sequencing (GBS)
✴ Genotyping approaches: Reduced representation approaches.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
4.RNA isolation (RNA-Seq)
Gene	A Gene	B Gene	C
RE	
site
RE	
site
RE	
site
RE	
site
Gene	expression
RNA	extraction	and	cDNA	synthesis
Library	preparation	and	sequencing
Fastq	Files
Different	samples	
Different	MIDs
Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
GDA2: 3- De-multiplexing and the complexities of sample identification.
Multiplexing
Use of DNA tags (4-7 bp) to identify samples in the same
sequencing lane, cell or sector.
mRNA-1
mRNA-2
Sample	
1
Sample	
2
cDNA-1-tag_ATGC
cDNA-2-tag_CGAG
ATGC
CGAG
AUGCGUU
AUGCGUU
UUGCGCU
AAGAGUU
AUGCGUU
AUGUGAA
UUGCGCU
AAAAGUU
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
}
Pool
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sequencing
GDA2: 3- De-multiplexing and the complexities of sample identification.
De-Multiplexing
Identification of the sequenced DNA samples using the DNA tag
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sequencing
Demultiplexing
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sample	
1
Sample	
2
GDA2: 3- De-multiplexing and the complexities of sample identification.
De-Multiplexing
Identification of the sequenced DNA samples using the DNA tag
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGCG
AAAAGTTCGAG
ATGCGTTATCC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sequencing
Demultiplexing
ATGCGTTCGAG
ATGTGAACGAG
AAAAGTTCGAG
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
TTGCGCTCGCG
ATGCGTTATCC
?
Sample	
1
Sample	
2
De-Multiplexing
Keys for barcode/tag designing (GBS/RADseq):
• The barcode does not contain or recreate the enzyme cut
site.
• Any barcode in a set is at least two substitutions away
from any other barcode.
• They vary in length as a set (to avoid the all cut site bases
appearing at the same positions in the sequencing read).
• They contain the complementary sticky end to the enzyme
cut site.
GDA2: 3- De-multiplexing and the complexities of sample identification.
http://www.maizegenetics.net/genotyping-by-sequencing-gbs
GDA2: 3- De-multiplexing and the complexities of sample identification.
De-Multiplexing
Identification of the sequenced DNA samples using the DNA tag
Software RE Link
Fastx-toolkit
(Barcode splitter)
No http://hannonlab.cshl.edu/fastx_toolkit/
Ea-utils
(Fastq-multx)
No https://expressionanalysis.github.io/ea-utils/
GBSX Yes https://github.com/GenomicsCoreLeuven/GBSX
TASSEL Yes http://www.maizegenetics.net/tassel
Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
Fastq raw
Fastq Processed
Reads processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
Mapped
Reads
Assembled
Reads
Other Analysis
GDA2: 4- Read processing and quality evaluation.
0- Read Quality Evaluation
• Does the sequencing produced the expected number of
reads?
READ COUNTS
• Do the reads have the expected average length?
AVERAGE READ LENGTH
• Do the reads have the expected nucleotide qscore?
QSCORE NUCLEOTIDE BOXES
Technology dependent
GDA2: 4- Read processing and quality evaluation.
FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
0- Read Quality Evaluation
GDA2: 4- Read processing and quality evaluation.
1- Quality filtering
• Generally associated with a minimum length and a
minimum qscore (extremes, by average, minimum for all
the nucleotides)
Tecnology min. length (bp) min. qscore
454 100 20
Illumina 50 30
SOLiD 20 30
Ion Torrent 50 20
PacBio 1000 NA
Oxford Nanopore 1000 NA
GDA2: 4- Read processing and quality evaluation.
1- Quality filtering
• Tools and time for processing varies depending of the
technology.
Software Link
Fastx-toolkit http://hannonlab.cshl.edu/fastx_toolkit/
Ea-utils https://expressionanalysis.github.io/ea-utils/
PrinSeq http://prinseq.sourceforge.net/
Trimmomatic http://www.usadellab.org/cms/?page=trimmomatic
e.g. running ea-utils command:
fastq-mcf -q 30 -l 50 -o s01_Q30L50_R1.fq
Illumina_Adapters.fa s01_R1.fq
GDA2: 4- Read processing and quality evaluation.
Practice 2.1: Process reads and get stats.
1. Make	a	new	directory	called	‘sinningia_genotyping’.	
2. Change	the	working	directory	to	‘sinningia_genotyping’		
2. Make	a	directory	with	the	name:	“00_raw”.	
3. Change	working	directory	to	“00_raw”.	
4. Copy	four	fastq	files	and	the	“IlluminaAdapters_V2.fasta”	from	/data/GDA_UFRGS2017/
DATA/Sinningia_speciosa/collection/	to	your	current	working	directory.	
5. 	Get	the	stats	for	the	raw	reads	using	“fastq-stats”.	
6. Process	the	reads	using	“fastq-mcf”	with	a	min.	quality	score	of	30	and	a	min.	length	of	
50	bp	(note:	you	can	use	a	script).	An	example	of	the	command	could	be	something	like:	
fastq-mcf -q 30 -l 50 -o P1_003_Q30L50.fq
IlluminaAdapters_V2.fasta P1_003.fastq.gz
7. Make	a	directory	one	level	up	(../)	with	the	name	“01_processed”.	
8. Move	the	outputs	from	“fastq-mcf”	a	“../01_processed”.	
9. 	Get	the	stats	for	the	processed	reads	using	“fastq-stats”.
GDA2: 4- Read processing and quality evaluation.
Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
Read Mapping:
It is the process of search the location of a read comparing the its sequence and the
sequence of a reference.
ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG
GCAGCGACCAGCGA
||||||||||| ||
1........10........20........30........40........50
ref
read
ref:10..23
Sequence Alignment:
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity
that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1] Aligned sequences of
nucleotide or amino acid residues are typically represented as rows within a matrix.
http://en.wikipedia.org/wiki/Sequence_alignment
GDA2: 5- Read mapping to a reference.
Read Mapping Considerations:
• Length of the read.
• Number of reads.
• Size of the reference.
Short reads
(NGS)
Millions of
sequences
Long reads
(Chromosomes)
Dozens of
sequences
Medium reads
(Genes/Transcripts)
Thousands of
sequences
GDA2: 5- Read mapping to a reference.
Read Mapping Software:
ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG ref
database/indexes
GCAGCGACCAGTGA
read seeds (kmers)
GCAGCGACCA
CAGCGACCAG
AGCGACCAGC
CGACCAGCGA
GCGACCAGCG
search
GCAGCGACCA
ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG
extension
GCAGCGACCAGCGA
GDA2: 5- Read mapping to a reference.
ATGACGTGC
GCCGTGCTG
find seed
(perfect match l=4)
extension
(mismatches allowed)
ATGACGTGC
GCCGTGCTG ATGACGTGC
GCCGTGCTG
| |||||
A T G A C G T G C
0 -1 -2 -3 -4 -5 -6 -7 -8 -9
G -1 -1 -2 -1 -4 -5 -4 -7 -6 -9
C -2 -2 -2 -3 -2 -3 -6 -5 -8 -5
C -3 -3 -3 -3 -4 -1 -4 -7 -6 -7
G -2 -4 -4 -2 -4 -5 0 -5 -6 -7
T -3 -3 -3 -5 -3 -5 -6 1 -6 -7
G -4 -4 -4 -2 -6 -4 -4 -7 2 -7
C -5 -5 -5 -5 -3 -5 -5 -5 -8 3
T -4 -6 -4 -6 -6 -4 -6 -4 -6 -9
G -5 -5 -7 -3 -8 -7 -3 -7 -3 -7
• Smith–Waterman algorithm
• Needleman–Wunsch algorithm
• Burrows-Wheeler index
Short reads
(NGS)
Millions of
sequences
Long reads
(Chromosomes)
Dozens of
sequences
Medium reads
(Genes/Transcripts)
Thousands of
sequences
GDA2: 5- Read mapping to a reference.
||||
Read Mapping Software:
Name Type Input Output
Mauve Long sequences Fasta, GenBank
backbone (positions)
XMFA (alignments)
LastZ/MultiZ Long sequences Fasta several (maf, sam…)
Blast Medium sequences Fasta
Blast formats (0 text file, 6
tabular file)
Blat Medium sequences Fasta
Blast formats + Blat tabular
format
Bowtie Short sequences Fasta, Fastq Sam
BWA Short sequences Fasta, Fastq Sam
Novoalign Short sequences Fasta, Fastq Sam
SOAP Short sequences Fasta, Fastq Sam
Stampy Short sequences Fasta, Fastq Sam
GDA2: 5- Read mapping to a reference.
Read Mapping Software:
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
GDA2: 5- Read mapping to a reference.
Sam/Bam file manipulation:
http://samtools.sourceforge.net/
GDA2: 5- Read mapping to a reference.
Practice 2.2: Read mapping and get stats.
1. Change	the	working	directory	to	‘sinningia_genotyping’		
2. Make	a	directory	with	the	name:	“02_mapped”.	
3. Change	working	directory	to	“01_processed”.	
4. Map	each	of	the	processed	reads	to	the	reference	index	“../../linux_exercises/Sispe_ref/
Sispe038ReducedRef.fa”	created	Day1:	Practice	3,	using	bowtie2-build.	Redirect	the	
output	to	the	directory	“../02_mapped/”.	An	example	of	a	command	could	be:	
bowtie2 -p 2 -t -x ../../linux_exercises/Sispe_ref/
Sispe038ReducedRef -U P1_009_Q30L50.fq -S ../02_mapped/
P1_009_Q30L50.sam
5. Change	working	directory	to	“../01_processed”.	
6. Count	how	many	hits		have	each	sam	file	using	“samtools”	and	the	for	example	the	
following	command.	
samtools view -c —F 4 -Sb P1_009_Q30L50.sam
GDA2: 5- Read mapping to a reference.
Practice 2.2: Read mapping and get stats.
7. Filter	the	sam	file	removing	the	reads	without	hits	(tag	0x4)	and	convert	it	to	bam.	
samtools view -F 4 -Sb -o P1_009.bam P1_009_Q30L50.sam
8. Merge	all	the	bam	files	using	‘bamaddrg’	with	a	command	such	as.	As	sample	name	use	
the	names	from	the	file	“SampleNamesMappingFile.txt”	(e.g.	P1_003	name	is	
“Purple_Dreaming”;	do	not	use	spaces).	
/data/software/bamaddrg/bamaddrg -b P1_003.bam -s
Purple_Dreaming -b P1_009.bam -s Merry_Christmas -b P1_014.bam
-s White -b P1_021.bam -s Good_Morning > SispeUser00_merged.bam
9. Sort	the	merged	bam	file	with	“samtools	sort”.	
samtools sort -o SispeUser00_sorted.bam SispeUser00_merged.bam
10.	Delete	the	sam	files	and	the	unsorted	bam’s.
GDA2: 5- Read mapping to a reference.
Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
Genetic variant is the genetic differences both within and among populations.
• Structural differences: Structural Variations (SVs), Copy Number Variation (CNV).
• Molecular differences (changes in the DNA sequence).
• Single Nucleotide Variants/Polymorphisms (SNVs/SNPs)
• Insertions/deletions Variants/Polymorphisms (INDELs/DIVs/DIPs)
• Multiple Nucleotide Variants/Polymorphism (MNVs/MNPs)
GACGTGC
GCCGTGC
| |||||
Sample 1
Sample 2
Polymorphism is a DNA sequence variation that is common in the population
GACGTGC
G-CGTGC
| |||||
Sample 1
Sample 2
SNVs/SNPs INDELs/DIVs/DIPs
GACGTGC
GCTGTGC
| ||||
Sample 1
Sample 2
MNVs/MNPs
GDA2: 6- Variant calling and summary of the read processing.
Processed Reads
Mapped Reads
Processed Map
Variants
Read
mapping
Local realignment,
sort and filtering
Variant
calling
Annotated Variants
Variant
annotation
Variant calling:
• Heuristic methods (read depth)
- SamTools
- VarScan
• Probabilistic methods (bayesian)
- GATK
- FreeBayes
- SOAPsnp/SOAPindel
Variant
filtering
GDA2: 6- Variant calling and summary of the read processing.
Read Mapping Software:
Name Type Strength Weaknesses
SamTools Heuristic
• Assumes errors are non-
independent (matches data)
• Good accuracy with low
coverage data
• Reasonably quick
• Increase false positives at
high coverage
• Lower quality indel calling
GATK Probabilistic
• Trains with real data
• Excellent accuracy with high
coverage data
• Low false positive rate
• Assumes errors are
independent
• High level of preprocessing
• Very slow
FreeBayes Probabilistic
• Combined bam population
estimate
• Good accuracy with low
coverage data
• Very very quick
• No training, population level
estimate only
• Lower quality indel calling
GDA2: 6- Variant calling and summary of the read processing.
GDA2: 6- Variant calling and summary of the read processing.
Processed Reads
Mapped Reads
Processed Map
Variants
Read
mapping
Local realignment,
sort and filtering
Variant
calling
Annotated Variants
Variant
annotation
Variant filtering:
- VCFTools
- GATK
Variant
filtering
Variant annotation:
- SnpEff
GDA2: 6- Variant calling and summary of the read processing.
Practice 2.3: Variant calling.
1. Change	the	working	directory	to	‘sinningia_genotyping’		
2. Make	a	directory	with	the	name:	“03_variants”.	
3. Change	working	directory	to	“02_mapped”.	
4. Create	an	index	with	“samtools	index”	for	the	sorted	bam	file:	
samtools index SispeUser00_sorted.bam
5. Run	“freebayes”	with	--min-base-quality	30	--min-mapping-quality	20	--min-coverage	5	
with	a	command	such	as:	
freebayes	-b	SispeUser00_sorted.bam	-f	../../linux_exercises/Sispe_ref/
Sispe038ReducedRef.fa	-v	../03_variants/SispeUser00.vcf	--min-coverage	5	-q	30	-m	20	
6. Count	how	many	variants	has	the	VCF	file	including	a	division	of	variants	per	type	(SNP,	
INDEL,	MNP	and	Complex).
GDA2: 6- Variant calling and summary of the read processing.
Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
Methods for Variant Evaluation
• Validation by Sanger Sequencing of specific candidates (~5 - 500) using other
datasets (e.g. transcriptome) if it is possible.
• Comparison with other method (e.g. genotyping chip).
• Different mapping and variant calling tools comparison (with a “truth set” or a
“gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
https://gatkforums.broadinstitute.org/gatk/discussion/6308/evaluating-the-quality-of-a-variant-callset
• Validation by Sanger Sequencing of specific candidates (~5 - 500) using other
datasets (e.g. transcriptome) if it is possible.
GDA2: 7- Quality evaluation and possible pitfalls.
Variants from
RNASeq
(Illumina)
Variants from
ESTs
(Sanger)
• Different mapping, variant calling tools and datasets comparison (with a “truth
set” or a “gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
Assumptions:
1. The content of the truth set has been validated.
2. Your samples are expected to have similar genomic content as the
population of samples that was used to produce the truth set
Metrics:
1. Variant level concordance: Percentage of variants in your samples that
match (are concordant with) variants in your truth set.
2. Genotype concordance: Percentage of variants in your genotype that
match (are concordant with) variants in your truth set.
• Different mapping, variant calling tools and datasets comparison (with a “truth
set” or a “gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
False Positives (FP) False Negatives (FN)
True Positives (TP)
My Dataset (16) True Set (18) % SENSITIVITY:
TP * 100 / (TP + FN) = 13 * 100 / (13 + 5) = 72%
% FALSE DISCOVERY RATE:
FP * 100 / (TP + FP) = 3 * 100 / (13 + 3) = 20%
% GT CONCORDANCE:
SumMatches * 100 / TP
6 * 100 / 11 = 54%
A * T C T C C * C A C
A T T C * C C T * A *
1 0 1 1 0 1 1 0 0 1 0
True Set (9)
My Dataset (8)
Matches (6)
Metrics:
3. Number of SNPs and INDELs: Between different datasets should be
consistent for the same number of mapped reads.
4. TiTv Ratio: Ratio of transition (Ts) to transversion (Tv) SNPs should be
random (~0.5). Methylation islands (CpG) and other factors may introduce
a bias so expected values will range from 0.5 - 3.0.
5. Ratio Insertions/Deletions: It should be close to 1, except in rare alleles
that it could be 0.2 - 0.5.
• Different mapping, variant calling tools and datasets comparison (with a “truth
set” or a “gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
Comparison between different tools:
• Different mapping, variant calling tools and datasets comparison (with a “truth
set” or a “gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
https://bcbio.wordpress.com/
Tools:
• Different mapping, variant calling tools and datasets comparison (with a “truth
set” or a “gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
Name URL
VariantEvaluation
(GATK)
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/
org_broadinstitute_gatk_tools_walkers_varianteval_VariantEval.php
GenotypeConcordance
(GATK)
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/
org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeConcordance.php
VCFTools http://vcftools.sourceforge.net/
VCFStats http://lindenb.github.io/jvarkit/
PicardTools https://broadinstitute.github.io/picard/index.html
Genomic Data Analysis
Day 3
Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
Variant filtering is the process to remove low quality or other non
adequate variants (e.g. non biallelic, complex…) for the downstream
analysis. It depends on:
1. Source and methodology used to generate the data (library
preparation errors and biases).
2. Sequencing technology (read sequencing errors) and amount of
data (insufficient depth/sites).
3. Software used for mapping (mapping errors) and variant calling
(produced by a low coverage/low complexity sites).
4. Reliability (low quality/incomplete) and nature (genomic differences/
polyploidy) of the reference genome.
5. Type of population (e.g. F2 population) and type of analysis that it
will be performed.
GDA3: 1-Variant Filtering
Variant filtering
Two major source of error (Li et al. 2014):
• Erroneous realignment in low-complexity regions
• Incomplete reference genome with respect to the sample
GDA3: 1-Variant Filtering
“The	raw	genotype	calls	is	as	high	as	1	in	10-15	kb,	but	the	error	
rate	of	post-filtered	calls	is	reduced	in	1	in	100-200	kb	without	
significant	compromise	on	the	sensitivity”.
More data is not always better.
High quality/reliable data
Alignment problems
GDA3: 1-Variant Filtering
coordinates 12345678901234 5678901234567890123456
reference aggttttttataac---aattaagtctacagagcaacta
sample aggttttttataacAATaattaagtctacagagcaacta
read1 aggttttttataac***aaAtaa
read2 ggttttttataac***aaAtaaTt
read3 ttttataacAATaattaagtctaca
read4 CaaT***aattaagtctacagagcaac
read5 aaT***aattaagtctacagagcaact
read6 T***aattaagtctacagagcaacta
Aligners calculate the alignment correctness and give it a score
depending of:
• Length of the alignment.
• Number of mismatches and gaps.
• Uniqueness of the alignment (number of hits).
}Good alignment
Misaligned bases
Alignment problems
GDA3: 1-Variant Filtering
Misaligned bases - Solutions:
• Read realignment (IndelRealigner for GATK (obsolete),
now it is integrated in the HaplotypeCaller).
• Mark alignment quality per base (BAQ) and do not use for
variant calling.
Library preparation problems
GDA3: 1-Variant Filtering
PCR duplications produce biases in the variant call (e.g. het.)
• Library specific problem for Whole Genome Sequencing.
Gene	A Gene	B Gene	C
Fragmentation
Library	preparation
PCR	Duplication
PCR duplications - Solutions:
• Mark duplicates with tools such as samtools rmdup
Library preparation problems
GDA3: 1-Variant Filtering
SKIP	PCR	DUPLICATION	MARKING	STEP	FOR	GBS,	RAD-SEQ…
CAREFUL:	Some	reduced	representations	techniques	with	unequal	ratios	
of	site	amplication	WILL	PRODUCE	THOUSANDS	PCR	DUPLICATION
Library preparation problems
GDA3: 1-Variant Filtering
Sequencing errors produce biases in the variant call.
Library preparation problems
GDA3: 1-Variant Filtering
Sequencing errors - Solutions:
• High coverage (< 20 X) to minimize sequencing errors.
• Recalibrate bases (Base Score Quality Recalibration - BSQR)
using tools such as BaseRecalibrator.
GDA3: 1-Variant Filtering
Variant filtering:
https://software.broadinstitute.org/gatk/best-practices/
GDA3: 1-Variant Filtering
Variant filtering:
https://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant-
detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/
GDA3: 1-Variant Filtering
Variant filtering:
Three	general	purpose	callers:	
• FreeBayes	(v0.9.9.2-18)	
• GATK	UnifiedGenotyper	(2.7-2)	
• GATK	HaplotypeCaller	(2.7-2)
• Skipping	base	recalibration	and	indel	realignment	
had	almost	no	impact	on	the	quality	of	resulting	
variant	calls	
• FreeBayes	outperforms	the	GATK	callers	on	both	
SNP	and	indel	calling.	The	most	recent	versions	of	
FreeBayes	have	improved	sensitivity	and	specificity	
which	puts	them	on	par	with	GATK	HaplotypeCaller.	
• GATK	HaplotypeCaller	is	all	around	better	than	the	
UnifiedGenotyper.
Software
Filters
Depth Het.
Var.
Quality
Mapping
Quality
Allele
Freq.
Position/
Distance
HWE Missing
VCFTools Yes Yes Yes No Yes Yes Yes Yes
SnpSift* Yes Yes Yes No Yes Yes No No
Vardict Yes No Yes No Yes No No No
GATK Yes Yes Yes Yes Yes Yes No Yes
Variant filtering:
* It will depends of the tags for the VCF file
GDA3: 1-Variant Filtering
GDA3: 1-Variant Filtering
Examples using VCFTools
1.Variants with low quality QUAL < 20.
vcftools --vcf input.vcf --minQ 20 --recode --recode-INFO-
all --out out
2. Variants with depth DP < 10.
vcftools --vcf input.vcf --min-meanDP 10 --recode --
recode-INFO-all --out out
3. Separated by at least 1000 bp.
vcftools --vcf input.vcf --thin 1000 --recode --recode-
INFO-all --out out
4. No biallelic.
vcftools --vcf input.vcf --min-alleles 2 --max-alleles 2
--recode --recode-INFO-all --out out
5. No missing.
vcftools --vcf input.vcf --max-missing 1.0 --recode --
recode-INFO-all --out out
Practice 3.1: Filter the variant file.
1. Change	the	working	directory	to	‘sinningia_genotyping’		
2. Change	working	directory	to	“03_variants”.	
3. Run	“vcf-stats”	on	the	“SispeUserXX.vcf”	file.	
4. Remove	the	variants	with	QUAL	<	20	and	run	“vcf-stats”	again.	
5. Remove	the	variants	with	DEPTH	<	10	and	run	“vcf-stats”	again.	
6. Remove	the	variants	separated	between	them	1000	bp	or	less	and	run	“vcf-stats”	again.	
7. Get	biallelic	variants	and	run	“vcf-stats”	again.	
8. Remove	all	the	genotypes	with	missing	data.	
9. Select	only	SNPs.
GDA3: 1-Variant Filtering
Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
Stats for the VCF files
GDA3: 2- Simple population stats for the variant analysis.
Regular stats with vcf-stats (https://vcftools.github.io/perl_module.html)
vcf-stats is a program that runs several stats for a VCF file producing the
following files:
• stats.counts, number of variants per sample and for all the samples.
• stats.dump, parseable hash Perl format file with all the VCF stats.
• stats.indels, number of INDELs per sample and for all the samples.
• stats.legend, various definitions
• stats.private, unique (not shared) variants for each sample
• stats.samples-tstv, transicions/transversions for each sample
• stats.shared, shared variants for all the samples
• stats.snps, number of SNPs per sample and for all the samples.
• stats.tstv, transicions/transversions for all the samples
Stats for the VCF files
GDA3: 2- Simple population stats for the variant analysis.
Distribution with bcftools stats (https://samtools.github.io/bcftools/bcftools.html)
bcftools is a software to manipulate VCF/BCF files. Stats can be used to
produce several data distributions such as QUAL (quality), DP (depth), ST
(substitution types), IDD (InDel size distribution), AF (allele frequency)… It
also include as summary (SN).
# SN, Summary numbers:
# SN [2]id [3]key [4]value
SN 0 number of samples: 4
SN 0 number of records: 110927
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 99184
SN 0 number of MNPs:10943
SN 0 number of indels:3101
SN 0 number of others:506
SN 0 number of multiallelic sites: 3816
SN 0 number of multiallelic SNP sites: 798
Stats for the VCF files
GDA3: 2- Simple population stats for the variant analysis.
Distribution with vcfutils.pl qstats
vcfutils.pl is a program that get the qual. and ts/tv parameters associated
with the SNPs. It can be used to test if there are some bias of the ts/tv
associated with low quality.
QUAL #non-indel #SNPs #transitions #joint ts/tv #joint/#ref #joint/#non-indel
6856.32 1909 1909 654 0 0.5211 0.0000 0.0000 0.5211
4769.34 3818 3818 1381 0 0.5667 0.0000 0.0000 0.6151
3506.53 5727 5727 2215 0 0.6307 0.0000 0.0000 0.7758
2748.14 7636 7636 3051 0 0.6654 0.0000 0.0000 0.7791
2240.06 9545 9545 3956 0 0.7078 0.0000 0.0000 0.9014
. . .
16.3149 80179 80179 41279 0 1.0612 0.0000 0.0000 1.2945
11.551 82088 82088 42386 0 1.0676 0.0000 0.0000 1.3803
6.48534 83997 83997 43471 0 1.0727 0.0000 0.0000 1.3167
2.79415 85906 85906 44556 0 1.0775 0.0000 0.0000 1.3167
Stats for the VCF files
GDA3: 2- Simple population stats for the variant analysis.
Populations stats with vcftools
vcftools can also be used to get simple population genetics parameters
associated to a VCF file. Some of these examples are:
• Calculate nucleotide diversity (π)
vcftools --vcf input.vcf --keep NamesGroup1.txt --
window-pi 100000 --out Group1_Pi
• Calculate linkage disequilibrium (LD) (for phased genotypes).
vcftools --vcf input.vcf --keep NamesGroup1.txt --ld-
window-bp 50000 --chr SeqID1 --hap-r2 --min-r2 0.001 --
out Group1_SeqID1_LD
Stats for the VCF files
GDA3: 2- Simple population stats for the variant analysis.
Populations stats with vcftools
vcftools can also be used to get simple population genetics parameters
associated to a VCF file. Some of these examples are:
• Calculate FST between two groups.
vcftools --vcf input.vcf --weir-fst-pop SampleGroup1.txt
--weir-fst-pop SampleGroup2.txt --fst-window-size 100000
--out Group1_VS_Group2_FST
• Calculate TajimaD for one group.
vcftools --vcf input.vcf --keep NamesGroup1.txt --
TajimaD 100000 --out Group1_Pi
Practice 3.2: Get stats for the VCF file
1. Change	the	working	directory	to	‘sinningia_genotyping’		
2. Change	working	directory	to	“03_variants”.	
3. Run	“vcf-stats”	on	the	“SispeUserXX.vcf”	file.	
4. Run	“bcftools	stats”	on	the	“SispeUserXX.vcf”	and	pipe	the	output	to	select	“^SN”	
5. Run	“vcftools”	to	calculate	the	nucleotide	diversity	on	the	“SispeUserXX.vcf”.	
6. Run	“vcftools”	to	calculate	the	Tajima	D	on	the	“SispeUserXX.vcf”..	
7. 	Divide	your	dataset	in	two	groups	and	calculate	the	FST	between	these	two	groups.	
GDA3: 2- Simple population stats for the variant analysis.
Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
http://software.broadinstitute.org/software/igv/
The Integrative Genomics Viewer (IGV) is a high-performance
visualization tool for interactive exploration of large, integrated genomic
datasets. It supports a wide variety of data types, including array-based
and next-generation sequence data, and genomic annotations.
IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
http://software.broadinstitute.org/software/igv/download
IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
1- Create a new .genome file for the Sinningia reference.
2- Add an Unique identifier (e.g. “Sispe038”), a descriptive name (e.g. “S.
species version 0.3.8”, the FASTA and the GFF files with the reference.
IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
3- Select any scaffold
IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
4- To load any VCF or BAM file, select “Load From File”
6- Then select the scaffold that you want to visualize (e.g. “Sispe038Scf0002”)
5- Then load your VCF file (e.g. “SispeUser00.vcf”).
IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
It creates two tracks: 1- With all the variants and the AF as a red/blue
bar; 2- With all the individual samples.
IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
You also can load BAM files to see the read alignment.
TASSEL, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
http://www.maizegenetics.net/tassel
TASSEL is a tools to investigate the relationship between phenotypes and
genotypes.TASSEL has functionality for association study, evaluating
evolutionary relationships, analysis of linkage disequilibrium, principal
component analysis, cluster analysis, missing data imputation and data
visualization.
TASSEL, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
1- Load VCF data
TASSEL, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
2- Explore the VCF data
TASSEL, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
3- Calculate nucleotide diversity
TASSEL, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
4- Get a distance matrix
TASSEL, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
4- Perform a Principal Component Analysis
TASSEL, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
5- Produce a cladogram
Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
Change formats from VCF to others.
GDA3: 4- Changing formats for VCF files.
http://www.cmpg.unibe.ch/software/PGDSpider/
Change formats from VCF to others.
VCF => FastStructure
PGDSpider can be used to change between different formats:
• From VCF to FastStructure.
perl -ne 'chomp($_); if ($_ =~ m/#/) { print "$_n"}
else { @a= split(/t/, $_); if (length($a[3]) == 1 &&
length($a[4]) == 1) {print "$_n"} }' input.vcf >
clean.vcf
java -Xmx1024m -Xms512m -jar /data/software/
PGDSpider_2.1.1.2/PGDSpider2-cli.jar -inputfile
clean.vcf -inputfileformat VCF -outputfile
clean.structure.str -outputfileformat STRUCTURE -spid
VCF2FastStructure.spid
GDA3: 4- Changing formats for VCF files.
Change formats from VCF to others.
VCF => FastStructure
PGDSpider requires a configuration file (.spid) for each of the formats.
Example for a VCF2FastStructure file.
# VCF Parser questions
PARSER_FORMAT=VCF
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
VCF_PARSER_REGION_QUESTION=
VCF_PARSER_PL_QUESTION=GT
VCF_PARSER_QUAL_QUESTION=20
VCF_PARSER_GTQUAL_QUESTION=0
VCF_PARSER_READ_QUESTION=5
VCF_PARSER_IND_QUESTION=
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=TRUE
VCF_PARSER_MONOMORPHIC_QUESTION=FALSE
VCF_PARSER_POP_QUESTION=
# STRUCTURE Writer questions
WRITER_FORMAT=STRUCTURE
STRUCTURE_WRITER_FAST_FORMAT_QUESTION=TRUE
STRUCTURE_WRITER_LOCI_DISTANCE_QUESTION=TRUE
GDA3: 4- Changing formats for VCF files.
Change formats
GDA3: 4- Changing formats for VCF files.
https://github.com/aubombarely/GenoToolBox
GDA3: 4- Changing formats for VCF files.
Change formats
Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
EmpressPurple01 S_speciosa DOMESTICATED
EmpressRed01 S_speciosa DOMESTICATED
Buzios S_speciosa WILD
PurpleDreaming Hybrid DOMESTICATED
GalaxyTour S_speciosa DOMESTICATED
AnsNix Hybrid DOMESTICATED
AmandasPenny Hybrid DOMESTICATED
TV_Faeton S_speciosa DOMESTICATED
DarthVader S_speciosa DOMESTICATED
MerryChristmas S_speciosa DOMESTICATED
StrawberryJam S_speciosa DOMESTICATED
BlueDandy S_speciosa DOMESTICATED
DeadlyRomance S_speciosa DOMESTICATED
DiegoPink S_speciosa WILD
White S_speciosa DOMESTICATED
Kleopatra S_speciosa DOMESTICATED
BestRoskosh S_speciosa DOMESTICATED
LovePotion S_speciosa DOMESTICATED
NTVenushki S_speciosa DOMESTICATED
GoodMorning S_speciosa DOMESTICATED
BlueKnight S_speciosa DOMESTICATED
AvenidaNiemeyer S_speciosa WILD
BuziosXEmpressF1 S_speciosa DOMESTICATED
ChilternSeeds S_speciosa WILD
EmpressRed02 S_speciosa DOMESTICATED
PedraLisa S_speciosa WILD
CardosoMoreira S_speciosa WILD
EmpressPurple02 S_speciosa DOMESTICATED
Carangola S_speciosa WILD
CardosoMoreiraPink S_speciosa WILD
CharlesLawn S_speciosa DOMESTICATED
Shelleri S_helleri WILD
MiguelPereira S_speciosa WILD
Buzios Carangola
A. Niemeyer
Darth Vader
Empress Red
Blue Knight
Goal:
Analyze the population
structure for cultivated
Sinningias
Genetic Variation in the Species
Wild accessions 9
Cultivars 25
Wild x Cultivated F1 1
Other species 1
____________________________________________
TOTAL 36
Sequencing
Library preparation
De-multiplexing
Read processing
Alignment
Variant detection
SNP filtering
APeKI digestion
Illumina, single end, 100 bp
GBSX v1.2
Fastq-mcf v1.04.807, Q30, L50
Bowtie2 v2.2.4
Freebayes v0.9.20
bcftools: only biallelic SNPs
vcffliter: Q>30, Depth >= 5
vcftools: no missing observations
41,626 SNPs
GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
Genetic Variation in the Species
1. Clean the file of SNPs defined with more than one character (e.g AC/AG).
perl -ne 'chomp($_); if ($_ =~ m/#/) { print "$_n"}
else { @a= split(/t/, $_); if (length($a[3]) == 1 &&
length($a[4]) == 1) {print "$_n"} }'
Sispe038_Set01_FILTERED_SNPs.vcf >
Sispe038_Set01_FILTERED_SNPs_CLEAN.vcf
2. Change the VCF format to FastStructure.
java -Xmx1024m -Xms512m -jar /data/software/
PGDSpider_2.1.1.2/PGDSpider2-cli.jar -inputfile
Sispe038_Set01_FILTERED_SNPs_CLEAN.vcf -inputfileformat
VCF -outputfile
Sispe038_Set01_FILTERED_SNPs_CLEAN.structure.str -
outputfileformat STRUCTURE -spid VCF2FastStructure.spid
GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
Genetic Variation in the Species
3. Prepare a script (run_faststructure) with the fastStructure command line,
5 replicates, random seeds and K from 1 to 20.
4. Change the permissions of the script and run it
chmod 755 run_faststructure.sh
./run_faststructure.sh
#!/bin/bash	
python	/data/sowware/fastStructure/structure.py	-K	1	--
input=Sispe038_Set01_FILTERED_SNPs_CLEAN.structure	--
output=Sispe038_Set01_FS_K01_R01	--format=str	—seed=123456789	
…
GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
Genetic Variation in the Species
5. Run ChooseK to get the most supported K.
python /data/software/fastStructure/chooseK.py --
input=Sispe038_Set01_FS_*
Model complexity that maximizes marginal likelihood = 2
Model components used to explain structure in data = 2
GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
Genetic Variation in the Species
5. Run ChooseK to get the most supported K.
Model components used to explain structure in data = 2
In	our	review	of	1,264	studies	using	structure	to	explore	population	subdivision,	studies	
that	used	ΔK	were	more	likely	to	identify	K	=	2	(54%,	443/822)	than	studies	that	did	not	
use	ΔK	(21%,	82/386).	A	troubling	finding	was	that	very	few	studies	performed	the	
hierarchical	analysis	recommended	by	the	authors	of	both	ΔK	and	structure	to	fully	
explore	population	subdivision.
GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
S-6-4
(PI#555684)
S-6-5
(Standard Line)
F1
(Picture from Dr. David Zaitlin)
GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
F2_107	
F2_108	
F2_109	
F2_110	
F2_111	
F2_112	
F2_113	
F2_115	
F2_117	
F2_118	
F2_119	
F2_120	
F2_121	
F2_122	
F2_123	
F2_124	
F2_125	
F2_126	
F2_127	
F2_129	
F2_130	
F2_131	
S_64_2	
S_65_2
F2_001	
F2_002	
F2_003	
F2_004	
F2_005	
F2_006	
F2_007	
F2_009	
F2_010	
F2_011	
F2_012	
F2_013	
F2_014	
F2_015	
F2_016	
F2_017	
F2_018	
F2_019	
F2_021	
F2_022	
F2_023	
F2_024	
F2_028	
F2_031	
F2_032	
F2_033	
F2_034	
F2_035	
F2_036
F2_077	
F2_078	
F2_079	
F2_080	
F2_081	
F2_082	
F2_083	
F2_084	
F2_085	
F2_086	
F2_087	
F2_088	
F2_089	
F2_091	
F2_092	
F2_093	
F2_094	
F2_095	
F2_096	
F2_097	
F2_098	
F2_099	
F2_100	
F2_101	
F2_102	
F2_103	
F2_104	
F2_105	
F2_106
Goal:
Develop a genetic map
with 19 linkage groups
http://www.rqtl.org/
http://www.rqtl.org/tutorials/geneticmaps.pdf
GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
https://github.com/aubombarely/GenoToolBox
GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
1. Change the VCF format to CSV used as input by R/QTL using Vcf2Maker
from GenoToolBox (https://github.com/aubombarely/GenoToolBox).
/old_home/aurebg/Software/GenoToolBox/SNPTools/
Vcf2Mapmaker -i NibenGBS_M30.vcf -o NibenGBS_M30.csv -f
csv -a S_64_2 -b S_65_2 -B -d 1000
2. Load the genotypes in R/QTL.
setwd('./')
library('qtl')
NbenX = read.cross(file="NibenGBS_M30.csv", format=“csv”)
3. Follow the R/QTL tutorial.
GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
Genomic Data Analysis: From Reads to Variants
Genomic Data Analysis: From Reads to Variants

Contenu connexe

Tendances

Basic linux commands for bioinformatics
Basic linux commands for bioinformaticsBasic linux commands for bioinformatics
Basic linux commands for bioinformaticsBonnie Ng
 
An Overview to Protein bioinformatics
An Overview to Protein bioinformaticsAn Overview to Protein bioinformatics
An Overview to Protein bioinformaticsJoel Ricci-López
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...VHIR Vall d’Hebron Institut de Recerca
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
 
Stateful Flow Table - SFT 2020 DPDK users pace summit
Stateful Flow Table - SFT 2020 DPDK users pace summitStateful Flow Table - SFT 2020 DPDK users pace summit
Stateful Flow Table - SFT 2020 DPDK users pace summitAndrey Vesnovaty
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
Role of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies andRole of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies andSarla Rao
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Denis C. Bauer
 
Transcriptomics: A time efficient tool for crop improvement
Transcriptomics: A time efficient tool for crop improvementTranscriptomics: A time efficient tool for crop improvement
Transcriptomics: A time efficient tool for crop improvementSajid Sheikh
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for BioinformaticsJosé Héctor Gálvez
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
Enzymes used in genetic engineering
Enzymes used in genetic engineeringEnzymes used in genetic engineering
Enzymes used in genetic engineeringdharmesh sherathia
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 

Tendances (20)

Snp genotyping
Snp genotypingSnp genotyping
Snp genotyping
 
Biopython
BiopythonBiopython
Biopython
 
Primerdesign
PrimerdesignPrimerdesign
Primerdesign
 
Basic linux commands for bioinformatics
Basic linux commands for bioinformaticsBasic linux commands for bioinformatics
Basic linux commands for bioinformatics
 
An Overview to Protein bioinformatics
An Overview to Protein bioinformaticsAn Overview to Protein bioinformatics
An Overview to Protein bioinformatics
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
 
Genomic Data Analysis
Genomic Data AnalysisGenomic Data Analysis
Genomic Data Analysis
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
Stateful Flow Table - SFT 2020 DPDK users pace summit
Stateful Flow Table - SFT 2020 DPDK users pace summitStateful Flow Table - SFT 2020 DPDK users pace summit
Stateful Flow Table - SFT 2020 DPDK users pace summit
 
Overview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seqOverview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seq
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Role of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies andRole of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies and
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2
 
Transcriptomics: A time efficient tool for crop improvement
Transcriptomics: A time efficient tool for crop improvementTranscriptomics: A time efficient tool for crop improvement
Transcriptomics: A time efficient tool for crop improvement
 
Introduction to Python for Bioinformatics
Introduction to Python for BioinformaticsIntroduction to Python for Bioinformatics
Introduction to Python for Bioinformatics
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
Human encodeproject
Human encodeprojectHuman encodeproject
Human encodeproject
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Enzymes used in genetic engineering
Enzymes used in genetic engineeringEnzymes used in genetic engineering
Enzymes used in genetic engineering
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 

Similaire à Genomic Data Analysis: From Reads to Variants

Chapter09 -- networking with unix and linux
Chapter09  -- networking with unix and linuxChapter09  -- networking with unix and linux
Chapter09 -- networking with unix and linuxRaja Waseem Akhtar
 
Linux@assignment ppt
Linux@assignment pptLinux@assignment ppt
Linux@assignment pptRama .
 
jhkghj
jhkghjjhkghj
jhkghjAdmin
 
test2PPT
test2PPTtest2PPT
test2PPTAdmin
 
2018-Summer-Tutorial-Intro-to-Linux.pdf
2018-Summer-Tutorial-Intro-to-Linux.pdf2018-Summer-Tutorial-Intro-to-Linux.pdf
2018-Summer-Tutorial-Intro-to-Linux.pdfsanjeevkuraganti
 
Lec 01_Linux System Administration (1).pptx
Lec 01_Linux System Administration (1).pptxLec 01_Linux System Administration (1).pptx
Lec 01_Linux System Administration (1).pptxShabanaShafi3
 
Galvin-operating System(Free bsd)
Galvin-operating System(Free bsd)Galvin-operating System(Free bsd)
Galvin-operating System(Free bsd)dsuyal1
 
Computer and multimedia Week 1 Windows Architecture.pptx
Computer and multimedia Week 1 Windows Architecture.pptxComputer and multimedia Week 1 Windows Architecture.pptx
Computer and multimedia Week 1 Windows Architecture.pptxfatahozil
 
Linux.ppt
Linux.ppt Linux.ppt
Linux.ppt onu9
 
Linux Command Suumary
Linux Command SuumaryLinux Command Suumary
Linux Command Suumarymentorsnet
 

Similaire à Genomic Data Analysis: From Reads to Variants (20)

BasicLinux
BasicLinuxBasicLinux
BasicLinux
 
Chapter09 -- networking with unix and linux
Chapter09  -- networking with unix and linuxChapter09  -- networking with unix and linux
Chapter09 -- networking with unix and linux
 
Linux@assignment ppt
Linux@assignment pptLinux@assignment ppt
Linux@assignment ppt
 
Ch01
Ch01Ch01
Ch01
 
Asp dot net
Asp dot netAsp dot net
Asp dot net
 
ch20.ppt
ch20.pptch20.ppt
ch20.ppt
 
jhkghj
jhkghjjhkghj
jhkghj
 
Asp net
Asp netAsp net
Asp net
 
test2PPT
test2PPTtest2PPT
test2PPT
 
2018-Summer-Tutorial-Intro-to-Linux.pdf
2018-Summer-Tutorial-Intro-to-Linux.pdf2018-Summer-Tutorial-Intro-to-Linux.pdf
2018-Summer-Tutorial-Intro-to-Linux.pdf
 
Lec 01_Linux System Administration (1).pptx
Lec 01_Linux System Administration (1).pptxLec 01_Linux System Administration (1).pptx
Lec 01_Linux System Administration (1).pptx
 
Galvin-operating System(Free bsd)
Galvin-operating System(Free bsd)Galvin-operating System(Free bsd)
Galvin-operating System(Free bsd)
 
Asp net
Asp netAsp net
Asp net
 
Computer and multimedia Week 1 Windows Architecture.pptx
Computer and multimedia Week 1 Windows Architecture.pptxComputer and multimedia Week 1 Windows Architecture.pptx
Computer and multimedia Week 1 Windows Architecture.pptx
 
Linux.ppt
Linux.ppt Linux.ppt
Linux.ppt
 
MODULE 1.pptx
MODULE 1.pptxMODULE 1.pptx
MODULE 1.pptx
 
Ch1
Ch1Ch1
Ch1
 
Synapse india reviews sharing asp.net
Synapse india reviews sharing  asp.netSynapse india reviews sharing  asp.net
Synapse india reviews sharing asp.net
 
System software
System softwareSystem software
System software
 
Linux Command Suumary
Linux Command SuumaryLinux Command Suumary
Linux Command Suumary
 

Plus de Aureliano Bombarely (10)

Lesson mitochondria bombarely_a20180927
Lesson mitochondria bombarely_a20180927Lesson mitochondria bombarely_a20180927
Lesson mitochondria bombarely_a20180927
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 
PlastidEvolution
PlastidEvolutionPlastidEvolution
PlastidEvolution
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
RNAseq Analysis
RNAseq AnalysisRNAseq Analysis
RNAseq Analysis
 
PerlTesting
PerlTestingPerlTesting
PerlTesting
 
PerlScripting
PerlScriptingPerlScripting
PerlScripting
 
Introduction2R
Introduction2RIntroduction2R
Introduction2R
 
GoTermsAnalysisWithR
GoTermsAnalysisWithRGoTermsAnalysisWithR
GoTermsAnalysisWithR
 
BasicGraphsWithR
BasicGraphsWithRBasicGraphsWithR
BasicGraphsWithR
 

Dernier

Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 

Dernier (20)

Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 

Genomic Data Analysis: From Reads to Variants

  • 1. Genomic Data Analysis From READS to VARIANTS 24-10-17 to 26-10-17, Porto Alegre, Brazil. Aureliano Bombarely Virginia Tech Department of Horticulture Latham 216 220 Ag Quad Lane Blacksburg, VA USA aurebg@vt.edu
  • 2. Genomic Data Analysis DAY 1: • Presentation of the Course. • Introduction to the Linux Operating System and the Command Line Interface. • 25 essential commands to work with Linux. • Common bioinformatics formats, from FASTAs to GFFs and VCFs. • 10 essential commands to play with the biological data. DAY 2: • Introduction to Next Generation Sequencing Technologies (NGS). • Experimental design for population studies, from breeding to ecological studies. • De-multiplexing and the complexities of sample identification. • Read processing and quality evaluation. • Read mapping to a reference. • Variant calling and summary of the read processing. • Quality evaluation and possible pitfalls. DAY 3: • Variant filtering. • Simple stats for the variant analysis. • Variant visualization tools: IGV and TASSEL. • Changing formats for VCF files. • Example 1: Population analysis with Structure for Sinningia speciosa. • Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
  • 7. Genomic Data Analysis 1. Presentation of the Course. 2. Introduction to the Linux Operating System and the Command Line Interface. 3. 25 essential commands to work with Linux. 4. Common bioinformatics formats, from FASTAs to GFFs and VCFs. 5. 10 essential commands to play with the biological data.
  • 8. Genomic Data Analysis 1. Presentation of the Course. 2. Introduction to the Linux Operating System and the Command Line Interface. 3. 25 essential commands to work with Linux. 4. Common bioinformatics formats, from FASTAs to GFFs and VCFs. 5. 10 essential commands to play with the biological data.
  • 9. GDA1: 1- Presentation of the Course. Biological Problem Scientific Question Hypothesis Genetics & related disciplines Molecular biology Massive DNA Sequencing Results Experimental Design Approach ?
  • 10. GDA1: 1- Presentation of the Course. Biological Problem Scientific Question Hypothesis Genetics & related disciplines Molecular biology Massive DNA Sequencing Results Experimental Design Approach Genomic Data Analysis
  • 11. GDA1: 1- Presentation of the Course. Genomic Data Analysis is: • Knowledge about sequencing technologies. • Knowledge about methodologies (e.g. library preparation). • Bioinformatic skills (Linux command line and R). • Basic knowledge about statistical analysis. Genomic Data Analysis IS NOT: • Programming (useful but not necessary). • Basic knowledge of computer system administration. • Modeling. • Algorithm development. • Database development.
  • 12. GDA1: 1- Presentation of the Course. Genomic Data Analysis is: • Knowledge about sequencing technologies. • Knowledge about methodologies (e.g. library preparation). • Bioinformatic skills (Linux command line and R). • Basic knowledge about statistical analysis. BIOINFORMATICS: • Programming (useful but not necessary). • Basic knowledge of computer system administration. • Modeling. • Algorithm development. • Database development.
  • 13. Genomic Data Analysis 1. Presentation of the Course. 2. Introduction to the Linux Operating System and the Command Line Interface. 3. 25 essential commands to work with Linux. 4. Common bioinformatics formats, from FASTAs to GFFs and VCFs. 5. 10 essential commands to play with the biological data.
  • 14. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface Linux: Unix-like computer operating system assembled under the model of free and open source software development and distribution Operating System (OS): Set of programs that manage computer hardware resources and provide common services for application software. Wikipedia
  • 15. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface Unix-like? Feduccia A, Trends Ecol. Evol. 2001
  • 16. Unix: Is a multitasking, multi-user computer operating system originally developed in 1969. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface https://www.howtogeek.com/182649/htg-explains-what-is-unix/
  • 17. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface Linux Distribution: Distributions (often called distros for short) are Operating Systems including a large collection of software applications such as word processors, spreadsheets, media players, and database applications. The operating system will consist of the Linux kernel and, usually, a set of libraries and utilities from the GNU Project, with graphics support from the X Window System.
  • 18. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface Linux Distribution:
  • 19. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface What is a console ? Computer terminal or system consoles are the text entry and display device for system administration messages, particularly those from the BIOS or boot loader, the kernel, from the init system and from the system logger. It is a physical device consisting of a keyboard and a screen. Wikipedia
  • 20. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface So then... What is typical black or white screen where programers and system administrators write commands?
  • 21. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface Command-line interface (CLI): Mechanism for interacting with a computer operating system or software by typing commands to perform specific tasks. The command-line interpreter may be run in a text terminal or in a terminal emulator window as a remote shell client such as PuTTY. Wikipedia 010100 110010 000101 2+2 4
  • 22. Shell: Piece of software that provides an interface for users of an operating system. There are two categories: - Command-line interface (CLI) - Graphical user interface (GUI) GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
  • 23. Command-line interface (CLI): GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface Shell: Bash: Operating System Shell (Bash) STDIN STDOUT STDERRCommand Command is a directive to a computer program acting as an interpreter of some kind, in order to perform a specific task.
  • 24. Parts of a command: GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface ...And push RETURN or ENTER key to execute the command Command Argument 1 & 2 The command call a program Arguments modify the behavior of the program Argument 3 -l means “long listing” -h/--human-readable means “human readable”
  • 25. Special characters in bash: CHARACTER MEANING SPACE Separate commands and arguments # POUND Comment ; SEMICOLON Command separator two run multiple commands . DOT Source command OR filename component OR current directory .. DOUBLE DOTS Parent directory ' ' SINGLE QUOTES Use expression between quotes literaly , COMMA Concatenate strings BACKSLASH Escape for single character / SLASH Filename path separator * ASTERISK Wild card for filename expansion in globbing >, <, >> CHARACTERS Redirection input/outputs | PIPE Pipe outputs between commands GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface Characters with an special meaning for the bash
  • 26. ls Solanum lycopersicum ls 'Solanum lycopersicum' ls Solanum lycopersicum Use single quotes or escape for special characters Bash understand spaces as separators between arguments GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface Special characters in bash:
  • 27. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface Practice 1.1: Connect to the server Windows users: 1. Open the program PuTTY and start a session. 2. Add the following information for your session and click open. 1. Host: begonia.hort.vt.edu 2. Port: 1809 3. Connection type: SSH 3. Introduce username and password 4. Click connect. MacOS/Linux users: 1. Open the program Terminal. 2. Type in the terminal: ssh -p 1809 username@begonia.hort.vt.edu 3. Push enter/return 4. Introduce the password and push Enter/Return. Note the writing will be hidden.
  • 28. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface Practice 1.1: Connect to the server Everyone: 5. Type in the terminal: pwd 6. Push enter/return 7. Describe the message that has appeared in the screen
  • 29. Genomic Data Analysis 1. Presentation of the Course. 2. Introduction to the Linux Operating System and the Command Line Interface. 3. 25 essential commands to work with Linux. 4. Common bioinformatics formats, from FASTAs to GFFs and VCFs. 5. 10 essential commands to play with the biological data.
  • 30. GDA1: 3- 25 essential commands to work with Linux. 1. pwd • The command prints the path to the working directory. • When you login to the system working directory = home ($HOME) /data/GDA_UFRGS2017/User00_Home / means root (beginning of the file system) data is the name of the 1st directory in root /GDA_UFRGS2017 2nd directory after data /User00_Home 3rd directory after GDA_UFRGS2017 pwd
  • 31. GDA1: 3- 25 essential commands to work with Linux. 2. mkdir • The command prints makes a new directory. • If the file exists gives an error. • Argument -p makes also the parent directories • rmdir removes a empty directory mkdir linux_exercises mkdir linux_exercises/test01 mkdir -p linux_exercises/test01 rmdir linux_exercises/test01 ✓correct ✴error ✓correct pwd mkdir ✓correct
  • 32. GDA1: 3- 25 essential commands to work with Linux. 3. cd • The command changes the working directory. • Two consecutive dots changes (e.g. “cd ..”) one directory up/ back in the file system cd linux_exercises cd linux_exercises/test01 cd test01 ✓correct ✴error ✓correct pwd mkdir cd
  • 33. GDA1: 3- 25 essential commands to work with Linux. 4. ls • It lists the items in the working directory (default) or any directory. • -l argument produces the item long listing. • -h argument produces a human readable form. • -a argument prints everything (including hided files starting with “.”). • -t argument sorts by time ls ls -lht linux_exercises/test01 cd test01 ✓correct ✴error ✓correct pwd mkdir cd ls
  • 34. Practice 1.2: Navigating the file system GDA1: 3- 25 essential commands to work with Linux. 1. Type pwd in the terminal and run it. 2. Make the directory ‘linux_exercises’ typing and running. 3. Run pwd in the current directory and annotate the result. 4. Change the working directory to ‘linux_exercises’. 5. Make a new directory named ’01_file_system_tree’. 6. Change the working directory to ’01_file_system_tree’. 7. Run pwd in the current directory and annotate the result. 8. Make a new directory named ‘subdir01’ 9. Change the working directory to ’subdir01’. 10. Run pwd in the current directory and annotate the result. 11. Change the working directory one level up 12. Make a new directory named ‘subdir02’ 13. Change the working directory to ’subdir02’. 14. Run pwd in the current directory and annotate the result. 15. Draw the file system tree for the directories ‘subdir01’ and ‘subdir02’ pwd mkdir cd ls
  • 35. Practice 1.2: Navigating the file system GDA1: 3- 25 essential commands to work with Linux. / data/ GDA_UFRGS2017/ User00_Home/ linux_exercises/ 01_file_system_tree/ subdir01/ subdir02/ cd 01_file_system_tree/subdir02 pwd mkdir cd ls
  • 36. Practice 1.2: Navigating the file system GDA1: 3- 25 essential commands to work with Linux. / data/ GDA_UFRGS2017/ User00_Home/ linux_exercises/ 01_file_system_tree/ subdir01/ subdir02/ cd ../../ pwd mkdir cd ls
  • 37. Practice 1.2: Navigating the file system GDA1: 3- 25 essential commands to work with Linux. / data/ GDA_UFRGS2017/ User00_Home/ linux_exercises/ 01_file_system_tree/ subdir01/ subdir02/ cd ../subdir01/ pwd mkdir cd ls
  • 38. Practice 1.2: Navigating the file system GDA1: 3- 25 essential commands to work with Linux. / data/ GDA_UFRGS2017/ User00_Home/ linux_exercises/ 01_file_system_tree/ subdir01/ subdir02/ cd /data/GDS_URFG2017/User00_Home/linux_exercises/01_file_system_tree/ subdir01/ cd ../subdir01/ pwd mkdir cd ls
  • 39. Practice 1.2: Navigating the file system GDA1: 3- 25 essential commands to work with Linux. / data/ GDA_UFRGS2017/ User00_Home/ linux_exercises/ 01_file_system_tree/ subdir01/ subdir02/ cd /data/GDS_URFG2017/User00_Home/linux_exercises/01_file_system_tree/ subdir01/ cd ../subdir01/ Relative filepath Absolute filepath pwd mkdir cd ls
  • 40. Practice 1.2: Navigating the file system GDA1: 3- 25 essential commands to work with Linux. Absolute filepath Relative filepath Latham Hall 311 220 Ag Quad Lane Blacksburg, VA 24061 USA Room 311 pwd mkdir cd ls
  • 41. GDA1: 3- 25 essential commands to work with Linux. pwd mkdir cd lsCommands for directories: COMMAND USE EXAMPLE cd Change working dir cd ../ pwd Print working dir pwd ls List information ls -lh /home mkdir Create a new dir mkdir test rmdir Remove empty dir rmdir test
  • 42. GDA1: 3- 25 essential commands to work with Linux. pwd mkdir cd ls history 5. history • It lists last 500 command runs. • No arguments needed
  • 43. GDA1: 3- 25 essential commands to work with Linux. pwd mkdir cd ls history Typing shortcuts for bash: SHORTCUT MEANING Tab Autocomplete files or folder names ↑ Scroll up to the command history ↓ Scroll down to the command history Ctrl + A Go to the beginning of the line that you are typing Ctrl + D Go to the end of the line that you are typing Ctrl + U Clear all the line (or until the cursor position) Ctrl + R Search previously used commands Ctrl + C Kill the process that you are running Ctrl + D Exit the current shell Ctrl + Z Put the running process to the background. Use command fg to recover it.
  • 44. GDA1: 3- 25 essential commands to work with Linux. 6. less • Opens a text type file in the screen. • To navigate use the arrows ( ). • “Shift + G” goes to the end of the file. • “/“ + some word search for the word. • “q” to quit/exit. • Open the file with “-N” to open with row numbers. • More information at: http://www.tutorialspoint.com/unix_commands/less.htm less ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta less -N ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta pwd mkdir cd ls history less
  • 45. GDA1: 3- 25 essential commands to work with Linux. 7. touch • Creates an empty file. touch this_is_a_test_file.txt pwd mkdir cd ls history less touch rm 8. rm • Remove/delete permanently a file from the system. • The file CAN NOT BE RECOVERED. • “rm -Rf <directory>” will remove the directory and all its content CAREFUL. rm this_is_a_test_file.txt rm -Rf 01_file_system_tree/subdir01
  • 46. GDA1: 3- 25 essential commands to work with Linux. 9. cp • Copy a file from one location to another. • “./“ means copy here in the working directory pwd mkdir cd ls history less touch rm cp mv 10.mv • Two functions: • If the destination EXISTS and is a DIR, move a file there. • If the destination DO NOT EXISTS, change the • CAREFUL: If the destination EXISTS and is a file WILL OVERWRITE IT mv Sispe038_cds.fasta Sispe038_cds.fa cp ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta ./
  • 47. Practice 1.3: Copying and moving files GDA1: 3- 25 essential commands to work with Linux. 1. Change working directory to ‘linux_exercises’. 2. Make a directory with the name: “Sispe_ref”. 3. Change working directory to “Sispe_ref”. 4. Copy all the fasta files from /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/ to your current working directory typing: cp /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/ *.fasta ./ 6. Remove the file “Sispe038.scaffolds.fasta”. 7. Change the name of “Sispe038.scaffolds500kb.fasta” to “Sispe038ReducedRef.fa”. 8. Create a mapping reference using Bowtie2-build running: bowtie2-build Sispe038ReducedRef.fa Sispe038ReducedRef pwd mkdir cd ls history less touch rm cp mv
  • 48. GDA1: 3- 25 essential commands to work with Linux. 11.cat • It prints the content of the file as STDOUT in the screen. • Usually used to concatenate (merge) files one after another using “cat file1.txt file2.txt > merged.txt ” pwd mkdir cd ls history less touch rm cp mv cat head/tail 12.head/tail • Prints the first/last 10 lines of the file as STDOUT • The number of lines (x) can be changed using “-n x”. head Sispe038ReducedRef.fasta head -n 100 Sispe038ReducedRef.fasta tail -n 1 Sispe038ReducedRef.fasta cat Sispe038ReducedRef.fa
  • 49. GDA1: 3- 25 essential commands to work with Linux. 13.grep • Command to find LINES and print that match with the pattern used. • “-c” option prints the NUMBER of LINES that match. • “-v” option prints the LINES that DO NOT match. pwd mkdir cd ls history less touch rm cp mv cat head/tail grep grep “>” Sispe038ReducedRef.fa grep -c “>” Sispe038ReducedRef.fa grep -v “>” Sispe038ReducedRef.fa
  • 50. GDA1: 3- 25 essential commands to work with Linux. 14.gzip/gunzip • Command to compress a file with gzip. • Command to uncompress a file.gz with gunzip • To keep the original file the “-c” option can be used pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip gzip Sispe038ReducedRef.fa gunzip Sispe038ReducedRef.fa.gz gzip -c Sispe038ReducedRef.fa > SispeRef.fa.gz
  • 51. pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar GDA1: 3- 25 essential commands to work with Linux. 15.tar • Command to archive/unarchive files contained into a directory. • It can be combined with tools such as gzip and bzip2. • Commonly used commands: • tar -zxvf package_of_files.tar.gz to unarchive and uncompress .gz • tar -jxvf package_of_files.tar.bz2 to unarchive and uncompress .bz2 • tar -zcvf dir1.tar.gz /path_to_fir1 to archive and compress with gzip • tar -jcvf dir1.tar.bz2 /path_to_fir1 to archive and compress with bzip2
  • 52. Practice 1.4: Concatenating files and taking a look to them GDA1: 3- 25 essential commands to work with Linux. 1. Change working directory to ‘linux_exercises’. 2. Make a directory with the name: “CDS_refs”. 3. Change working directory to “CDS_refs”. 4. Copy into your current working directory the following the files: 1. /data/GDA_UFRGS2017/DATA/Arabidopsis_thaliana/reference/ Athaliana_Phytozome167_TAIR10.pep.fa.gz 2. /data/GDA_UFRGS2017/DATA/Oryza_sativa/reference/ Osativa_Phytozome323_v7.0.pep.fa.gz 5. Uncompress both files. 6. Count how many lines have the symbol “>” for both files. 7. Concatenate both files in a file named “Atha_Osat_PEP.fasta”. 8. Count how many lines have the symbol “>” in this file. 9. Create a BLAST+ reference running the following command: makeblastdb -in Atha_Osat_PEP.fasta -dbtype prot - parse_seqids pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar
  • 53. GDA1: 3- 25 essential commands to work with Linux. 16.cut • It divides the file by TAB and prints as STDOUT the selected column. • “-f x” where x is the number of the column. • “-d y” where y is a character can be used to change the delimiter 17.sort • It sorts alphabetically a file based in the firsts characters of the line. • “-n” can be used to sort numerically. • “-r” can be used to do a reversed sorting. • “-u” can be used to apile unique ids • Usually used with cut “e.g. cut -f1 my_file.txt | sort”. cut -f 3 Sispe038_genome.genemodels.gff3 cut -f 3 Sispe038_genome.genemodels.gff3 | sort -u pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort
  • 54. GDA1: 3- 25 essential commands to work with Linux. 18.uniq • It reports or omits unique lines. • Usually used in conjunction with cut and sort “e.g. cut -f1 my_file.txt | sort”. 19.wc • It counts newlines, words or bytes in a file. • “-l” counts the number of lines. • “-w” counts the number of words. • “-m” counts the number of characters cut -f 3 Sispe038_genome.genemodels.gff3 | sort | uniq -c pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort uniq wc wc -l Sispe038_genome.genemodels.gff3
  • 55. GDA1: 3- 25 essential commands to work with Linux. 20.sed • Stream editor to transform text. • The simplest option is to use “s/<find>/<replace>/“ option. • A “g” to replace as many times as it can “s/<find>/<replace>/“ • More info at: https://www.gnu.org/software/sed/manual/sed.html sed “s/A/a/g“ Sispe038ReducedRef.fa pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort uniq wc sed
  • 56. 1. Change working directory to ‘linux_exercises’. 2. Make a directory with the name: “GFF_refs”. 3. Change working directory to “GFF_refs”. 4. Copy into your current working directory the following the files: 1. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/ Sispe038_genome.genemodels.gff3 2. /data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/reference/ Niben251.1_genome.genemodels.sorted.gff3 5. Count the number of lines in both files. 6. Count the number of lines ignoring lines with “#” symbol using grep. 7. Select the third column in both files and print the first 20 lines. 8. Select the third column in both files, sort it and count unique items using “uniq -c” Practice 1.5: Selecting columns and counting them GDA1: 3- 25 essential commands to work with Linux. pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort uniq wc sed
  • 57. GDA1: 3- 25 essential commands to work with Linux. pwd mkdir cd ls history less touch rm COMMAND USE EXAMPLE less Open a file with less. Q to exit. Arrows to scroll less myfile touch Create an empty file touch myfile mv Move file between dirs. Change name mv myfile yourfile rm Remove file rm youfil cat Print file content as STDOUT cat myfile head Print first 10 lines as STDOUT head myfile tail Print last 10 lines as STDOUT tail myfile grep Print matching lines as STDOUT grep 'ATG' myfile cut Cut columns and print as STDOUT cut -f1 myfile sort Sort lines and print as STDOUT sort myfile uniq Select uniq words (-c to count uniq). uniq -c myfile sed Replace ocurrences, print lines STDOUT sed 's/ATG/CTG/' myfile wc Word count wc myfile Commands for files:
  • 58. Compression and archiving commands: GDA1: 3- 25 essential commands to work with Linux. pwd mkdir cd ls history less touch rm COMMAND USE EXAMPLE gzip Compress a file using gzip gzip -c test.txt > test.txt.gz gunzip Uncompress a file using gzip gunzip test.txt.gz bzip2 Compress a file using bzip bzip2 -c test.txt > test.txt.bz2 bunzip2 Uncompress a file using gzip bunzip2 test.txt.bz2 tar Archive files usint tar tar -cf sample.tar sample/*.txt tar -zcvf Archive using tar and compress using gzip tar -zcvf samples.tar.gz sample/*.txt tar -zxvf Unarchive using tar and uncompress using gunzip tar -zxvf samples.tar.gz tar -jcvf Archive using tar and compress using bzip2 tar -jcvf samples.tar.bz2 sample/*.txt tar -jxvf Unarchive using tar and uncompress using bunzip2 tar -jxvf samples.tar.bz2
  • 59. GDA1: 3- 25 essential commands to work with Linux. 21.top/htop • Display Linux processes. • Type “q” to quit. • “kill PID” can be used to terminate a process. pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort uniq wc sed top/htop df/du Global Resource Usage: %CPU / MEMORY / SWAP MEMORY Single Process Resource Usage: PID / USER / %CPU / %MEMORY / COMMAND
  • 60. 22.df/du • Commands to check how much disk space is being used in the system (df -lh) or how much space a directory is using (du -lh <dir>). GDA1: 3- 25 essential commands to work with Linux. pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort uniq wc sed top/htop df/du df -lh du -lh linux_exercises
  • 61. 23.wget/curl • Commands to download files from the internet. • wget can be used recursively (e.g. using * or “-r” for dirs) • curl has pipeting abilities (using “|”).. GDA1: 3- 25 essential commands to work with Linux. pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort uniq wc sed top/htop df/du wget/curl wget ftp://ftp.solgenomics.net/genomes/ Solanum_lycopersicum/annotation/ITAG3.2_release/*.fasta curl ftp://ftp.solgenomics.net/genomes/ Solanum_lycopersicum/annotation/ITAG3.2_release/ ITAG3.2_proteins.fasta | grep -c “>”
  • 62. pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort uniq wc sed top/htop df/du wget/curl ssh/scp 24.ssh/scp • Commands to: • ssh = access to a remote server ssh -p 1809 username@begonia.hort.vt.edu • scp = copy from/to a remote server • From LOCAL to REMOTE scp -p 1809 file1 username@begonia.hort.vt.edu:/dirpath • From REMOTE to LOCAL scp -p 1809 username@begonia.hort.vt.edu:/dirpath/file1 ./ GDA1: 3- 25 essential commands to work with Linux.
  • 63. pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort uniq wc sed top/htop df/du wget/curl ssh/scp GDA1: 3- 25 essential commands to work with Linux. File Permissions and Ownerships: All the Unix systems are designed as multiuser operating systems. It means that different could access, modify or delete the same files. To avoid problems, they has a file permission and ownership system. It restrict who can access and modify each of the files in the computer. This system has two parts: • Who is the owner of the file ? • What type of access has each of the users in the System ?
  • 64. pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort uniq wc sed top/htop df/du wget/curl ssh/scp GDA1: 3- 25 essential commands to work with Linux. Ownership: Each file has assigned an user owner and a group owner. The user owner can be: • Real user (for example: aurebg). • Virtual user created by a program (for example: mysql). • Administrator user or root.
  • 65. pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort uniq wc sed top/htop df/du wget/curl ssh/scp GDA1: 3- 25 essential commands to work with Linux. Permissions: Each file has assigned 9 different permissions, 3 for the file user-owner (u), 3 for the group-owner (g) and 3 for everyone else (o). There are 3 types of permissions or file attributes: • Readable (r), it has permission to read the file. • Writable (w), it has permission to write the file. • Executable (x), it has permission to execute as program. 10 letters code for linux file: ---------- drwxrwxrwx switch OFF switch ON user group other Readable for everyone Readable for everyone, writable or executable only for the user-owner -r--r--r-- -rwxr—-r--
  • 66. pwd mkdir cd ls history less touch rm cp mv cat head/tail grep gzip tar cut sort uniq wc sed top/htop df/du wget/curl ssh/scp chmod/chown 25.chmod/chown • Commands to manage ownership and privileges: • To know the permissions type: “ls -lh" • Change owner: chown user:group filename chown aurebg:aurebg file.txt • Change permisions: chmod permissions_code filename chmod 664 file.txt # It changes to readable+writable by user and group and readable by anyone chmod u+r file.txt # It changes to readable by user GDA1: 3- 25 essential commands to work with Linux. chmod [ugo] [+-=] [rwx] file chmod [0-7] [0-7] [0-7] file |rwx|rwx|rwx| |421|421|421|
  • 67. Genomic Data Analysis 1. Presentation of the Course. 2. Introduction to the Linux Operating System and the Command Line Interface. 3. 25 essential commands to work with Linux. 4. Common bioinformatics formats, from FASTAs to GFFs and VCFs. 5. 10 essential commands to play with the biological data.
  • 68. I. FASTA FASTA format is a text based file format that store three different types: DNA, RNA or protein sequences. Used to represent the information for sequences for genomes, mRNA’s, cDNA’s, miRNA’s… GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs. >SeqID1 optional_description1 AGCGTGGAGAGCGATGAGATCAGAAAGTAGGACGACAGATGGGGAGAT GGCAGGTGTGGGAGGAGTTGACGATGACGTGATTGATGACGGGAGACG >SeqID2 optional_description2 AGCGTGGAGAGCGATGAGATCAGAAAGTAGGCTGACAGATGGGGAGAT GGCAGGTGAGGGAGGAGCTGACGATGACGTGTTTGATGACGGGAGACG >SeqID3 optional_description3 AGCGTGGAGAGCGATGAGATCAGAAAGTAGGACGACAGTGGGGGAGAT GGCAGGTGAGGGAGGAGTTGACGATGACGTGTTTGATGACGGGAGACG Space separating ID and description One line ID ID line always starts with “>” } sequence can be one or more lines
  • 69. II. FASTQ FASTQ format is a text based file format that store usually DNA sequences. It contains information about the sequencing QUALITY of each nucleotide. GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs. @GWNJ-0957:89:GW170928504:7:1101:2757:1309 1:N:0:NCGTCCC TATCTAAGTATTTGATTAATGATAGATGACGATGGAGAAATATAATCTACTTTTTT AAGTCCCTCATTTTCTTTCTCCATCTTTCTTTTTTATTACTCCCATTGTTCCCCAT + AAAAAFFJFJJFJJAAAAAFJJJ<FJJJJJJJJJJ7<7<<<<JJJJJJFFJJJAFJ F-7<<-7AFJJFJJJJJJJJAJJFJFJ<7<-7A-7FAFJA777777<7-7--7--7 @GWNJ-0957:89:GW170928504:7:1101:3549:1309 1:N:0:NCGTCCC ACCATTCATTATTTTTTTATTTAGTCTTTATTACTTTACTTTCCTTCCTTCTGAAA TACTGCTATTGTACATAAAACAAAATGATCTACTTAAAAATAAAACAAATTTAAAA + AAA-AAJJFJJAAJAA-7AFJJ-7-<<-<AJJ--<J-<-<---77F7-A---A7-- <777<7<7<<F-77F<J<JJ7F7AFF77<77<7777<77<---7---77---7--- One line ID ID line always starts with “@” } sequence can be one or more lines quality line always starts with “+” } One quality character per nucleotide. Each character code a number from 0-41 (Illumina v1.8+).
  • 70. II. FASTQ QUALITY explained. GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
  • 71. II. FASTQ FASTQ format is a text based file format that store usually DNA sequences. It contains information about the sequencing QUALITY of each nucleotide. GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs. @GWNJ-0957:89:GW170928504:7:1101:2757:1309 1:N:0:NCGTCCC TATCTAAGTATTTGATTAATGATAGATGACGATGGAGAAATATAATCTACTTTTTT AAGTCCCTCATTTTCTTTCTCCATCTTTCTTTTTTATTACTCCCATTGTTCCCCAT + AAAAAFFJFJJFJJAAAAAFJJJ<FJJJJJJJJJJ7<7<<<<JJJJJJFFJJJAFJ F-7<<-7AFJJFJJJJJJJJAJJFJFJ<7<-7A-7FAFJA777777<7-7--7--7 @GWNJ-0957:89:GW170928504:7:1101:3549:1309 1:N:0:NCGTCCC ACCATTCATTATTTTTTTATTTAGTCTTTATTACTTTACTTTCCTTCCTTCTGAAA TACTGCTATTGTACATAAAACAAAATGATCTACTTAAAAATAAAACAAATTTAAAA + AAA-AAJJFJJAAJAA-7AFJJ-7-<<-<AJJ--<J-<-<---77F7-A---A7-- <777<7<7<<F-77F<J<JJ7F7AFF77<77<7777<77<---7---77---7--- One line ID ID line always starts with “@” } sequence can be one or more lines quality line always starts with “+” } One quality character per nucleotide. Each character code a number from 0-41 (Illumina v1.8+).
  • 72. 1. Change working directory to ‘linux_exercises’. 2. Make a directory with the name: “FASTQ1”. 3. Change working directory to “FASTQ1”. 4. Copy into your current working directory the following the files: 1. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/collection/P1_001B.fastq.gz 2. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/collection/P1_007.fastq.gz 5. Uncompress them. 6. Run the following commands to get the stats. fastq-stats P1_001B.fastq fastq-stats P1_007.fastq 7. Redirect the output using “>” into a file using the following commands. fastq-stats P1_001B.fastq > P1_001B.stats.txt fastq-stats P1_007.fastq > P1_007.stats.txt Practice 1.6: Getting stats for a FASTQ file GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
  • 73. III. SAM/BAM SAM (and its binary form BAM) format is designed to store read mapping information to a reference. It has 11 columns. GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
  • 74. III. SAM/BAM The 2nd column: FLAG defines the status of the read mapping. GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
  • 75. IV. GFF3 GFF3 is a text-based file with 9 columns. It is designed to store genomic features (e.g. genes, exons, repetitive elements…) information. More information at http://gmod.org/wiki/GFF3. GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs. ##gff-version 3 ctg13 . mRNA 1300 9000 . + . ID=mrna0001;Name=GDR1 ctg13 . exon 1300 1500 . + . ID=exon00001;Parent=mrna0001 ctg13 . exon 1600 1800 . + . ID=exon00002;Parent=mrna0001 ctg13 . exon 3000 3900 . + . ID=exon00003;Parent=mrna0001 ctg13 . exon 5000 5500 . + . ID=exon00004;Parent=mrna0001 ctg13 . exon 7000 9000 . + . ID=exon00005;Parent=mrna0001 seqid source type start end score phase attributes strand mrna0001 exon00001 exon00002 exon00003 exon00004 exon00005
  • 76. DIPLOID 0 = REF 1 => ALT / => NO PHASED | => PHASED V. VCF VCF is a text-based file with 8 fixed columns and one extra per sample for the multisample files. It contacts metadata at the beginning of the file as “#” explaining the different fields GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs. #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 20 1370 rs01 G A 29 PASS DP=14;AF=0.5 GT:GQ:DP 0/1:51:14 20 1730 . T A 3 q10 DP=11;AF=0.2 GT:GQ:DP 0/1:58:11 20 2121 rs02 A G,T 67 PASS DP=10;AF=0.5 GT:GQ:DP 1/2:23:10 20 6781 . T . 47 PASS DP=13;AF=1 GT:GQ:DP 1/1:56:13 E.g. 1 E.g. 2 E.g. 3 E.g. 4 GT:GQ:DP 0/1:51:14GENOTYPE DEPTH GENOTYPEQUAL • E.g. 1 is a biallelic heterozygous SNP. • E.g. 2 is a biallelic heterozygous SNP with low quality, probably because the mapping. • E.g. 3 is a non-biallelic heterozygous SNP. • E.g. 4 is a biallelic homozygous Deletion
  • 77. Genomic Data Analysis 1. Presentation of the Course. 2. Introduction to the Linux Operating System and the Command Line Interface. 3. 25 essential commands to work with Linux. 4. Common bioinformatics formats, from FASTAs to GFFs and VCFs. 5. 10 essential commands to play with the biological data.
  • 78. PIPELINE: Combination of commands where the input of one command is the the output of the previous one GDA1: 5- 10 essential commands to play with the biological data. CMD1 CMD2 CMD3 CMD4 Input Output grep -v “#” Sispe038.gff3 | cut -f 3 | sort | uniq -c
  • 79. (1) GET SEQUENCE NUMBER grep -c ‘>’ file.fasta (3) GET A LIST OF THE TOP TEN MORE ABUNDANT FASTA DESCRIPTIONS grep ‘>’ file.fasta | cut -d ‘ ‘ -f2 | sort | uniq -c | sort -nr | head -n 10 (2) GET FASTA SIZE grep -v “>” file.fasta | wc -m GDA1: 5- 10 essential commands to play with the biological data.
  • 80. (4) GET NUMBER OF TYPES IN A GFF3 FILE grep -v ‘#’ file.gff | cut -f3 | sort | uniq -c (5) GET NUMBER OF GENES PER SEQID IN A GFF3 FILE grep -v ‘#’ file.gff | cut -f1,3 | grep “gene” | sort | uniq -c (6) GET NUMBER OF EXONS PER mRNA IN A GFF3 FILE grep -v "#" file.gff | cut -f3,9 | grep "exon" | sed - r 's/.+Parent=//' | sed -r 's/;.+//' | sort | uniq -c | sed -r 's/s+//' | cut -d ' ' -f1 | sort | uniq -c | sort -nr GDA1: 5- 10 essential commands to play with the biological data.
  • 81. (7) GET THE NUMBER OF VARIANTS PER CHROM IN A VCF FILE grep -v ‘#’ file.vcf | cut -f1 | sort | uniq -c (8) GET THE NUMBER OF BIALLELIC VARIANTS IN A VCF FILE grep -v ‘#’ file.vcf | cut -f4,5 | grep -v “,” | wc -l (10) GET THE NUMBER OF VARIANT IMPACTS IN A SNPEFF VCF FILE grep -v "#" file.SnpEff.vcf | cut -f8 | sed -r 's/. +;ANN=//' | cut -d '|' -f2 | sort | uniq -c GDA1: 5- 10 essential commands to play with the biological data. (9) GET THE NUMBER OF BIALLELIC SNPs IN A VCF FILE grep -v ‘#’ file.vcf | cut -f4,5 | grep -Ec "^.s+.$"
  • 82. 1. Change working directory to ‘linux_exercises’. 2. Make a directory with the name: “VCF_ANALYSIS”. 3. Change working directory to “VCF_ANALYSIS”. 4. Copy into your current working directory the following the files: 1. /data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/resistant_popbatch01/ VLS24_S1.PolCollapsedBiallelicAF1.vcf 5. Answer the following questions: 5.1. How many variants has this file? 5.2. Ignoring Scaffolds (SeqID=Niben251ScfXXXXX), how many variants have each chromosome (SeqID=Niben251ChrYY)? 5.3. How many biallelic SNPs have this file? 5.4. How many biallelic SNPs with allele frequency 1 (AF=1) have each chromosome? Practice 1.7: GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
  • 83. SCRIPT: Executable file with some specific language (e.g. Bash, Perl…) that has commands/functions to be executed. GDA1: 5- 10 essential commands to play with the biological data. #!/bin/bash file_gff=$1; echo “Analyzing file $1”; date; grep -v "#" $1 | cut -f3,9 | grep "exon" | sed -r 's/.+Parent=//' | sed -r 's/;.+//' | sort | uniq -c | sed -r 's/s+//' | cut -d ' ' -f1 | sort | uniq -c | sort -nr nano exons_per_mRNA.sh chmod 755 exons_per_mRNA.sh exons_per_mRNA.sh file1.gff NANO EDITOR SCREEN External Argument To Save in Nano CTR+O To Exit in Nano CTR+X
  • 84. 1. Change working directory to ‘linux_exercises’. 2. Make a directory with the name: “MY_FIRST_SCRIPT”. 3. Change working directory to “MY_FIRST_SCRIPT”. 4. Copy into your current working directory the following the files: 1. /data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/reference/ Niben251.1_genome.gene_models.sorted.gff 5. Write a script that count the number of types per chromosome and uses two arguments 1st=file.gff; 2nd=type. Practice 1.8: GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
  • 86. Genomic Data Analysis 1. Introduction to Next Generation Sequencing Technologies. 2. Experimental design for population studies, from breeding to ecological studies. 3. De-multiplexing and the complexities of sample identification. 4. Read processing and quality evaluation. 5. Read mapping to a reference. 6. Variant calling and summary of the read processing. 7. Quality evaluation and possible pitfalls.
  • 87. Genomic Data Analysis 1. Introduction to Next Generation Sequencing Technologies. 2. Experimental design for population studies, from breeding to ecological studies. 3. De-multiplexing and the complexities of sample identification. 4. Read processing and quality evaluation. 5. Read mapping to a reference. 6. Variant calling and summary of the read processing. 7. Quality evaluation and possible pitfalls.
  • 88. DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and thymine—in a strand of DNA. https://en.wikipedia.org/wiki/DNA_sequencing (Gentile et al. Nano Lett., 2012, 12 (12), pp 6453–6458) ATGCGCGTCGCGGTGAAT GDA2: 1- Introduction to Next Generation Sequencing Technologies.
  • 89. 1950 1960 1970 1980 1990 2000 2010 2020 Electrophoresis(1952) DNAStructure(1953) SangerDNASequencing(1977) AB370ASequencer(1986) AB310capillarSequencer(1986) 454Sequencer(2005) SolexaGenomeAnalyzerSequencer(2006) PacificBiosciencesSequencer(2011) OxfordNanoporePortablesequencer(2015) MS2Bacteriophage(1977) Epstein-BarrVirus(1984) Haemophilusinfluenzae(1995) Arabiodpsisthaliana(2000) Homosapiens(2001) 2016/02/04 Sequenced Genomes Viridiplantae 178 Metazoa 5907 Bacteria 7897 GDA2: 1- Introduction to Next Generation Sequencing Technologies.
  • 90. Frederick Sanger (1918-2013) Twice awarded with the Nobel Prize of Chemistry GDA2: 1- Introduction to Next Generation Sequencing Technologies. PreNGS Era
  • 91. GDA2: 1- Introduction to Next Generation Sequencing Technologies.
  • 94. (Mardis E.R. (2013) Annual Review of Analytical Chemistry 6: 287-303) Next Generation Sequencing vs Sanger Next Generation Sequencing Sanger DNA libraries need to be prepared Fragment amplification Direct nucleotide detection based in different methods Physical fragment separation for detection Millions to billions of reads Thousands of reads Variable size (short and long technologies) 400 to 900 bp read length Variable error rate Very low error rate Quantitative comparison Semicomparative comparison GDA2: 1- Introduction to Next Generation Sequencing Technologies.
  • 95. Next Generation Sequencing 0 10000 20000 30000 40000 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 NGS Ecology (Graph by Dr. David Haak) GDA2: 1- Introduction to Next Generation Sequencing Technologies.
  • 96. http://www.slideshare.net/cosentia/high-throughput-equencing Next-generation sequencing platforms Isolation and purification of target DNA Sample preparation Library validation Cluster generation on solid-phase Emulsion PCR Sequencing by synthesis with 3’-blocked reversible terminators Pyrosequencing Sequencing by ligation Single colour imaging Sequencing by synthesis with 3’-unblocked reversible terminators AmplificationSequencingImaging Four colour imaging Data analysis Roche 454Illumina GAII ABi SOLiD Helicos HeliScope Next Generation Sequencing GDA2: 1- Introduction to Next Generation Sequencing Technologies.
  • 97. Technology Read length (bp) Accuracy Reads/Run Time/Run Cost/Mb Applied Bio 3730XL (Sanger) 400 - 900 99.9% 384 4 h (12 runs/day) $2,400 Roche 454 GS FLX (Pyrosequencing) 700 Single/Pairs 99.9% 1,000,000 24h $10 Illumina HiSeq4000 (Seq. by synthesis) 75-250 Single/Pairs 99% 5,000,000,000 24 to 120 h $0.05 to $0.15 Ilumina MiSeq (Seq. by synthesis) 50-300 Single/ Pairs 99% 44,000,000 24 to 72 h $0.17 SOLiD 4 (Seq. by ligation) 25-50 Single/Pairs 99.9% 1,400,000,000 168 h $0.13 ION Torrent (Seq. by semiconductor) 170-400 Single 98% 80,000,000 2 h $2 Pacific Biosciences Sequel (SMRT) 14,000 Single 85% (99.9%) 1,600,000 4 h $0.6 Oxford N. Minion (Nanopore sequencing) 10,000 Single 62% (96%) 4,400,000 48 h $0.02 GDA2: 1- Introduction to Next Generation Sequencing Technologies.
  • 98. GDA2: 1- Introduction to Next Generation Sequencing Technologies: Libraries
  • 99. GDA2: 1- Introduction to Next Generation Sequencing Technologies: Libraries
  • 100. GDA2: 1- Introduction to Next Generation Sequencing Technologies: Libraries
  • 101. Multiplexing Use of DNA tags (4-7 bp) to identify samples in the same sequencing lane, cell or sector. mRNA-1 mRNA-2 Sample 1 Sample 2 cDNA-1-tag_ATGC cDNA-2-tag_CGAG ATGC CGAG AUGCGUU AUGCGUU UUGCGCU AAGAGUU AUGCGUU AUGUGAA UUGCGCU AAAAGUU ATGCGTTATGC ATGCGTTATGC TTGCGCTATGC AAGAGTTATGC ATGCGTTCGAG ATGTGAACGAG TTGCGCTCGAG AAAAGTTCGAG } Pool ATGCGTTCGAG ATGTGAACGAG TTGCGCTCGAG AAAAGTTCGAG ATGCGTTATGC ATGCGTTATGC TTGCGCTATGC AAGAGTTATGC Sequencing GDA2: 1- Introduction to Next Generation Sequencing Technologies: Libraries
  • 102. http://www.roche.com/ GDA2: 1- Introduction to Next Generation Sequencing Technologies: 454
  • 103. Pyrosequencing technology (Mardis E.R. (2008) Trends in Genetics 24: 133-141) GDA2: 1- Introduction to Next Generation Sequencing Technologies: 454
  • 104. Pyrosequencing technology https://www.youtube.com/watch?v=rsJoG-AulNE GDA2: 1- Introduction to Next Generation Sequencing Technologies: 454
  • 105. http://454.com/products/gs-flx-system/index.asp GDA2: 1- Introduction to Next Generation Sequencing Technologies: 454
  • 106. http://www.bio-itworld.com/BioIT_Article.aspx?id=131053 GDA2: 1- Introduction to Next Generation Sequencing Technologies: 454
  • 107. http://www.illumina.com/ GDA2: 1- Introduction to Next Generation Sequencing Technologies: Illumina
  • 108. Sequence by Synthesis technology (Mardis E.R. (2013) Annual Review of Analytical Chemistry 6: 287-303) GDA2: 1- Introduction to Next Generation Sequencing Technologies: Illumina
  • 109. Sequence by Synthesis technology (Mardis E.R. (2013) Annual Review of Analytical Chemistry 6: 287-303) GDA2: 1- Introduction to Next Generation Sequencing Technologies: Illumina
  • 110. Sequence by Synthesis technology http://www.illumina.com/techniques/sequencing/dna- sequencing.html# GDA2: 1- Introduction to Next Generation Sequencing Technologies: Illumina
  • 112. https://products.appliedbiosystems.com GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
  • 113. Sequence by Ligation technology GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
  • 114. Sequence by Ligation technology GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
  • 115. Sequence by Ligation technology GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
  • 116. Sequence by Ligation technology GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
  • 117. http://media.invitrogen.com.edgesuite.net/ab/ applications-technologies/solid/solid-5500.html Sequence by Ligation technology GDA2: 1- Introduction to Next Generation Sequencing Technologies: SOLiD
  • 118. https://products.appliedbiosystems.com GDA2: 1- Introduction to Next Generation Sequencing Technologies: Ion Torrent
  • 119. Sequence by Semiconductor technology A"sample"of"DNA"is"cut"into" millions"of"fragments,"and" each"fragment"is"a7ached" to"its"own"bead" The"fragment"is"copied"" un;l"it"covers"the"bead" This"automated"process" produces"millions"of"beads" covered"with"millions"of" different"fragments" The"beads"are"then"flowed" across"the"chip,"each"being" deposited"into"a"well" Then"the"chip"is"flooded" with"one"of"the"four" nucleo;des" If"the"next"base"on"the"DNA" strand"is"complementary"to" this"nucleo;de,"a"nucleo;de" will"be"incorporated"and"" a"hydrogen"ion"will"be" released" The"hydrogen"ion"changes" the"pH"of"the"solu;on"in"" the"well" An"ionCsensi;ve"layer" beneath"the"well"measures" that"pH"change"and" converts"it"to"voltage" This"voltage"change"is" recorded,"indica;ng"the" nucleo;de"has"been" incorporated"and"the"" base"is"called" This"process"happens" simultaneously"in"millions" of"wells" Copy%DNA% Load%chip% Incorporate%nucleo6de% Detect%and%call% GDA2: 1- Introduction to Next Generation Sequencing Technologies: Ion Torrent
  • 120. Sequence by Semiconductor technology GDA2: 1- Introduction to Next Generation Sequencing Technologies: Ion Torrent
  • 121. Sequence by Semiconductor technology GDA2: 1- Introduction to Next Generation Sequencing Technologies: Ion Torrent
  • 122. http://www.pacb.com/ GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio
  • 123. Single Molecule Real Time (SMRT) technology Niedringhaus et al. Analytical Chemistry 2011 GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio
  • 124. Single Molecule Real Time (SMRT) technology hsp://bit.ly/1naxgTe GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio
  • 125. Single Molecule Real Time (SMRT) technology http://genome.duke.edu/cores-and-services/sequencing-and-genomic-technologies/pacbio GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio PacBio Sequel
  • 126. https://www.nanoporetech.com/ GDA2: 1- Introduction to Next Generation Sequencing Technologies: Oxford Nanopore
  • 127. Niedringhaus et al. Analytical Chemistry 2011 Sequence by Nanopore technology GDA2: 1- Introduction to Next Generation Sequencing Technologies: Oxford Nanopore
  • 128. GDA2: 1- Introduction to Next Generation Sequencing Technologies: Oxford Nanopore
  • 129. GDA2: 1- Introduction to Next Generation Sequencing Technologies: Oxford Nanopore
  • 130. GDA2: 1- Introduction to Next Generation Sequencing Technologies: Oxford Nanopore
  • 131. Sequence by Nanopore technology GDA2: 1- Introduction to Next Generation Sequencing Technologies
  • 132. Genomic Data Analysis 1. Introduction to Next Generation Sequencing Technologies. 2. Experimental design for population studies, from breeding to ecological studies. 3. De-multiplexing and the complexities of sample identification. 4. Read processing and quality evaluation. 5. Read mapping to a reference. 6. Variant calling and summary of the read processing. 7. Quality evaluation and possible pitfalls.
  • 133. GDA2: 3- Experimental design for population studies, from breeding to ecological studies. Population (Genetics) Group of organisms or individuals from the same geographical location with the capability of interbreeding. • Natural populations (e.g. Sinningia speciosa group of plants that grow in the area of Pedra Lisa). • Artificial populations (e.g. F2 segregating population of Sinningia speciosa Empress x Buzios).
  • 134. GDA2: 3- Experimental design for population studies, from breeding to ecological studies. • Natural populations • Artificial populations - Structure & Size. - Diversification. - Speciation. - Selection. - Drift. - Fitness. - Migration. - Genetic maps. - Geno2Pheno links. - QTLs - GWAS. - Artificial Selection. - Domestication.
  • 135. GDA2: 3- Experimental design for population studies, from breeding to ecological studies. (1) Focus in genotyping instead a right sampling of the populations. (2) Wrong randomization of the samples. (3) Confuse geopolitical borders with biological borders. (4) Testing significance of the clustering output. (5) Misinterpretation of Mandel’s r statistic (correlation between dist. matrices). (6) Single K value interpretation without consider other alternative scenarios. (7) Don’t take into account loci fixation associated with an adaptive trait
  • 136. GDA2: 3- Experimental design for population studies, from breeding to ecological studies. ✴ Focus in genotyping instead a right sampling of the populations. How many individuals are necessary per “population” ? It depends of the analysis and the population. Example 1: Single dominant locus QTL Analysis. • Recombination rate (genome size and chromosome number). • Genotyping methodology (resolution). • Loci location. } 100 F2 individuals as starting point and then move to other populations (e.g. F3) or adding more individuals Example 2: Local adaptation. • Trait analyzed. • Population structure. • Quality of the reference. • Genotyping methodology (resolution). } 50 individuals per group as starting point and then move to other populations (e.g. F3) or adding more individuals
  • 137. ✴ Genotyping approaches. GDA2: 3- Experimental design for population studies, from breeding to ecological studies. Genotyping: It is the process of determining genetic differences of an individual by examining the individual's DNA sequence. Genome sequencing Cost effective approaches Reduced representation 1. Targeted amplification (e.g. TrueSeq Custom Amplicon) 2. Hybridization (e.g. Sequence Capture) 3. Enzymatic Digestion + Size selection (e.g. RAD-Seq / GBS) 4. RNA isolation (RNA-Seq)
  • 138. ✴ Genotyping approaches: Reduced representation approaches. GDA2: 3- Experimental design for population studies, from breeding to ecological studies. 1. Targeted amplification (e.g. TrueSeq Custom Amplicon) Gene A Gene B Gene C RE site RE site RE site RE site Amplification Library preparation and sequencing Fastq Files Different samples Different MIDs
  • 139. ✴ Genotyping approaches: Reduced representation approaches. GDA2: 3- Experimental design for population studies, from breeding to ecological studies. 2. Hybridization (e.g. Sequence Capture) MIDPCR Different samples Different MIDs Gene A Gene B Gene C RE site RE site RE site RE site Fragmentation DNA Capture Sequencing Fastq Files Amplification and Lib. preparation
  • 140. ✴ Genotyping approaches: Reduced representation approaches. GDA2: 3- Experimental design for population studies, from breeding to ecological studies. 3. Enzymatic Digestion + Size selection (e.g. RAD-Seq / GBS) REMIDPCR Different samples Different MIDs Gene A Gene B Gene C RE site RE site RE site RE site Digestion Adapters ligation Sequencing Fastq Files Amplification (Size selection ~500bp) Elshire et al. 2011 PLOS One 6:e193779 Genotyping-By-Sequencing (GBS)
  • 141. ✴ Genotyping approaches: Reduced representation approaches. GDA2: 3- Experimental design for population studies, from breeding to ecological studies. 4.RNA isolation (RNA-Seq) Gene A Gene B Gene C RE site RE site RE site RE site Gene expression RNA extraction and cDNA synthesis Library preparation and sequencing Fastq Files Different samples Different MIDs
  • 142. Genomic Data Analysis 1. Introduction to Next Generation Sequencing Technologies. 2. Experimental design for population studies, from breeding to ecological studies. 3. De-multiplexing and the complexities of sample identification. 4. Read processing and quality evaluation. 5. Read mapping to a reference. 6. Variant calling and summary of the read processing. 7. Quality evaluation and possible pitfalls.
  • 143. GDA2: 3- De-multiplexing and the complexities of sample identification. Multiplexing Use of DNA tags (4-7 bp) to identify samples in the same sequencing lane, cell or sector. mRNA-1 mRNA-2 Sample 1 Sample 2 cDNA-1-tag_ATGC cDNA-2-tag_CGAG ATGC CGAG AUGCGUU AUGCGUU UUGCGCU AAGAGUU AUGCGUU AUGUGAA UUGCGCU AAAAGUU ATGCGTTATGC ATGCGTTATGC TTGCGCTATGC AAGAGTTATGC ATGCGTTCGAG ATGTGAACGAG TTGCGCTCGAG AAAAGTTCGAG } Pool ATGCGTTCGAG ATGTGAACGAG TTGCGCTCGAG AAAAGTTCGAG ATGCGTTATGC ATGCGTTATGC TTGCGCTATGC AAGAGTTATGC Sequencing
  • 144. GDA2: 3- De-multiplexing and the complexities of sample identification. De-Multiplexing Identification of the sequenced DNA samples using the DNA tag ATGCGTTCGAG ATGTGAACGAG TTGCGCTCGAG AAAAGTTCGAG ATGCGTTATGC ATGCGTTATGC TTGCGCTATGC AAGAGTTATGC Sequencing Demultiplexing ATGCGTTCGAG ATGTGAACGAG TTGCGCTCGAG AAAAGTTCGAG ATGCGTTATGC ATGCGTTATGC TTGCGCTATGC AAGAGTTATGC Sample 1 Sample 2
  • 145. GDA2: 3- De-multiplexing and the complexities of sample identification. De-Multiplexing Identification of the sequenced DNA samples using the DNA tag ATGCGTTCGAG ATGTGAACGAG TTGCGCTCGCG AAAAGTTCGAG ATGCGTTATCC ATGCGTTATGC TTGCGCTATGC AAGAGTTATGC Sequencing Demultiplexing ATGCGTTCGAG ATGTGAACGAG AAAAGTTCGAG ATGCGTTATGC TTGCGCTATGC AAGAGTTATGC TTGCGCTCGCG ATGCGTTATCC ? Sample 1 Sample 2
  • 146. De-Multiplexing Keys for barcode/tag designing (GBS/RADseq): • The barcode does not contain or recreate the enzyme cut site. • Any barcode in a set is at least two substitutions away from any other barcode. • They vary in length as a set (to avoid the all cut site bases appearing at the same positions in the sequencing read). • They contain the complementary sticky end to the enzyme cut site. GDA2: 3- De-multiplexing and the complexities of sample identification. http://www.maizegenetics.net/genotyping-by-sequencing-gbs
  • 147. GDA2: 3- De-multiplexing and the complexities of sample identification. De-Multiplexing Identification of the sequenced DNA samples using the DNA tag Software RE Link Fastx-toolkit (Barcode splitter) No http://hannonlab.cshl.edu/fastx_toolkit/ Ea-utils (Fastq-multx) No https://expressionanalysis.github.io/ea-utils/ GBSX Yes https://github.com/GenomicsCoreLeuven/GBSX TASSEL Yes http://www.maizegenetics.net/tassel
  • 148. Genomic Data Analysis 1. Introduction to Next Generation Sequencing Technologies. 2. Experimental design for population studies, from breeding to ecological studies. 3. De-multiplexing and the complexities of sample identification. 4. Read processing and quality evaluation. 5. Read mapping to a reference. 6. Variant calling and summary of the read processing. 7. Quality evaluation and possible pitfalls.
  • 149. Fastq raw Fastq Processed Reads processing and filtering 1. Low quality reads (qscore) (Q30) 2. Short reads (L50) 3. PCR duplications (Only Genomes). 4. Contaminations. 5. Corrections Mapped Reads Assembled Reads Other Analysis GDA2: 4- Read processing and quality evaluation.
  • 150. 0- Read Quality Evaluation • Does the sequencing produced the expected number of reads? READ COUNTS • Do the reads have the expected average length? AVERAGE READ LENGTH • Do the reads have the expected nucleotide qscore? QSCORE NUCLEOTIDE BOXES Technology dependent GDA2: 4- Read processing and quality evaluation.
  • 151. FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) 0- Read Quality Evaluation GDA2: 4- Read processing and quality evaluation.
  • 152. 1- Quality filtering • Generally associated with a minimum length and a minimum qscore (extremes, by average, minimum for all the nucleotides) Tecnology min. length (bp) min. qscore 454 100 20 Illumina 50 30 SOLiD 20 30 Ion Torrent 50 20 PacBio 1000 NA Oxford Nanopore 1000 NA GDA2: 4- Read processing and quality evaluation.
  • 153. 1- Quality filtering • Tools and time for processing varies depending of the technology. Software Link Fastx-toolkit http://hannonlab.cshl.edu/fastx_toolkit/ Ea-utils https://expressionanalysis.github.io/ea-utils/ PrinSeq http://prinseq.sourceforge.net/ Trimmomatic http://www.usadellab.org/cms/?page=trimmomatic e.g. running ea-utils command: fastq-mcf -q 30 -l 50 -o s01_Q30L50_R1.fq Illumina_Adapters.fa s01_R1.fq GDA2: 4- Read processing and quality evaluation.
  • 154. Practice 2.1: Process reads and get stats. 1. Make a new directory called ‘sinningia_genotyping’. 2. Change the working directory to ‘sinningia_genotyping’ 2. Make a directory with the name: “00_raw”. 3. Change working directory to “00_raw”. 4. Copy four fastq files and the “IlluminaAdapters_V2.fasta” from /data/GDA_UFRGS2017/ DATA/Sinningia_speciosa/collection/ to your current working directory. 5. Get the stats for the raw reads using “fastq-stats”. 6. Process the reads using “fastq-mcf” with a min. quality score of 30 and a min. length of 50 bp (note: you can use a script). An example of the command could be something like: fastq-mcf -q 30 -l 50 -o P1_003_Q30L50.fq IlluminaAdapters_V2.fasta P1_003.fastq.gz 7. Make a directory one level up (../) with the name “01_processed”. 8. Move the outputs from “fastq-mcf” a “../01_processed”. 9. Get the stats for the processed reads using “fastq-stats”. GDA2: 4- Read processing and quality evaluation.
  • 155. Genomic Data Analysis 1. Introduction to Next Generation Sequencing Technologies. 2. Experimental design for population studies, from breeding to ecological studies. 3. De-multiplexing and the complexities of sample identification. 4. Read processing and quality evaluation. 5. Read mapping to a reference. 6. Variant calling and summary of the read processing. 7. Quality evaluation and possible pitfalls.
  • 156. Read Mapping: It is the process of search the location of a read comparing the its sequence and the sequence of a reference. ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG GCAGCGACCAGCGA ||||||||||| || 1........10........20........30........40........50 ref read ref:10..23 Sequence Alignment: In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1] Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. http://en.wikipedia.org/wiki/Sequence_alignment GDA2: 5- Read mapping to a reference.
  • 157. Read Mapping Considerations: • Length of the read. • Number of reads. • Size of the reference. Short reads (NGS) Millions of sequences Long reads (Chromosomes) Dozens of sequences Medium reads (Genes/Transcripts) Thousands of sequences GDA2: 5- Read mapping to a reference.
  • 158. Read Mapping Software: ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG ref database/indexes GCAGCGACCAGTGA read seeds (kmers) GCAGCGACCA CAGCGACCAG AGCGACCAGC CGACCAGCGA GCGACCAGCG search GCAGCGACCA ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG extension GCAGCGACCAGCGA GDA2: 5- Read mapping to a reference.
  • 159. ATGACGTGC GCCGTGCTG find seed (perfect match l=4) extension (mismatches allowed) ATGACGTGC GCCGTGCTG ATGACGTGC GCCGTGCTG | ||||| A T G A C G T G C 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 G -1 -1 -2 -1 -4 -5 -4 -7 -6 -9 C -2 -2 -2 -3 -2 -3 -6 -5 -8 -5 C -3 -3 -3 -3 -4 -1 -4 -7 -6 -7 G -2 -4 -4 -2 -4 -5 0 -5 -6 -7 T -3 -3 -3 -5 -3 -5 -6 1 -6 -7 G -4 -4 -4 -2 -6 -4 -4 -7 2 -7 C -5 -5 -5 -5 -3 -5 -5 -5 -8 3 T -4 -6 -4 -6 -6 -4 -6 -4 -6 -9 G -5 -5 -7 -3 -8 -7 -3 -7 -3 -7 • Smith–Waterman algorithm • Needleman–Wunsch algorithm • Burrows-Wheeler index Short reads (NGS) Millions of sequences Long reads (Chromosomes) Dozens of sequences Medium reads (Genes/Transcripts) Thousands of sequences GDA2: 5- Read mapping to a reference. ||||
  • 160. Read Mapping Software: Name Type Input Output Mauve Long sequences Fasta, GenBank backbone (positions) XMFA (alignments) LastZ/MultiZ Long sequences Fasta several (maf, sam…) Blast Medium sequences Fasta Blast formats (0 text file, 6 tabular file) Blat Medium sequences Fasta Blast formats + Blat tabular format Bowtie Short sequences Fasta, Fastq Sam BWA Short sequences Fasta, Fastq Sam Novoalign Short sequences Fasta, Fastq Sam SOAP Short sequences Fasta, Fastq Sam Stampy Short sequences Fasta, Fastq Sam GDA2: 5- Read mapping to a reference.
  • 163. Practice 2.2: Read mapping and get stats. 1. Change the working directory to ‘sinningia_genotyping’ 2. Make a directory with the name: “02_mapped”. 3. Change working directory to “01_processed”. 4. Map each of the processed reads to the reference index “../../linux_exercises/Sispe_ref/ Sispe038ReducedRef.fa” created Day1: Practice 3, using bowtie2-build. Redirect the output to the directory “../02_mapped/”. An example of a command could be: bowtie2 -p 2 -t -x ../../linux_exercises/Sispe_ref/ Sispe038ReducedRef -U P1_009_Q30L50.fq -S ../02_mapped/ P1_009_Q30L50.sam 5. Change working directory to “../01_processed”. 6. Count how many hits have each sam file using “samtools” and the for example the following command. samtools view -c —F 4 -Sb P1_009_Q30L50.sam GDA2: 5- Read mapping to a reference.
  • 164. Practice 2.2: Read mapping and get stats. 7. Filter the sam file removing the reads without hits (tag 0x4) and convert it to bam. samtools view -F 4 -Sb -o P1_009.bam P1_009_Q30L50.sam 8. Merge all the bam files using ‘bamaddrg’ with a command such as. As sample name use the names from the file “SampleNamesMappingFile.txt” (e.g. P1_003 name is “Purple_Dreaming”; do not use spaces). /data/software/bamaddrg/bamaddrg -b P1_003.bam -s Purple_Dreaming -b P1_009.bam -s Merry_Christmas -b P1_014.bam -s White -b P1_021.bam -s Good_Morning > SispeUser00_merged.bam 9. Sort the merged bam file with “samtools sort”. samtools sort -o SispeUser00_sorted.bam SispeUser00_merged.bam 10. Delete the sam files and the unsorted bam’s. GDA2: 5- Read mapping to a reference.
  • 165. Genomic Data Analysis 1. Introduction to Next Generation Sequencing Technologies. 2. Experimental design for population studies, from breeding to ecological studies. 3. De-multiplexing and the complexities of sample identification. 4. Read processing and quality evaluation. 5. Read mapping to a reference. 6. Variant calling and summary of the read processing. 7. Quality evaluation and possible pitfalls.
  • 166. Genetic variant is the genetic differences both within and among populations. • Structural differences: Structural Variations (SVs), Copy Number Variation (CNV). • Molecular differences (changes in the DNA sequence). • Single Nucleotide Variants/Polymorphisms (SNVs/SNPs) • Insertions/deletions Variants/Polymorphisms (INDELs/DIVs/DIPs) • Multiple Nucleotide Variants/Polymorphism (MNVs/MNPs) GACGTGC GCCGTGC | ||||| Sample 1 Sample 2 Polymorphism is a DNA sequence variation that is common in the population GACGTGC G-CGTGC | ||||| Sample 1 Sample 2 SNVs/SNPs INDELs/DIVs/DIPs GACGTGC GCTGTGC | |||| Sample 1 Sample 2 MNVs/MNPs GDA2: 6- Variant calling and summary of the read processing.
  • 167. Processed Reads Mapped Reads Processed Map Variants Read mapping Local realignment, sort and filtering Variant calling Annotated Variants Variant annotation Variant calling: • Heuristic methods (read depth) - SamTools - VarScan • Probabilistic methods (bayesian) - GATK - FreeBayes - SOAPsnp/SOAPindel Variant filtering GDA2: 6- Variant calling and summary of the read processing.
  • 168. Read Mapping Software: Name Type Strength Weaknesses SamTools Heuristic • Assumes errors are non- independent (matches data) • Good accuracy with low coverage data • Reasonably quick • Increase false positives at high coverage • Lower quality indel calling GATK Probabilistic • Trains with real data • Excellent accuracy with high coverage data • Low false positive rate • Assumes errors are independent • High level of preprocessing • Very slow FreeBayes Probabilistic • Combined bam population estimate • Good accuracy with low coverage data • Very very quick • No training, population level estimate only • Lower quality indel calling GDA2: 6- Variant calling and summary of the read processing.
  • 169. GDA2: 6- Variant calling and summary of the read processing.
  • 170. Processed Reads Mapped Reads Processed Map Variants Read mapping Local realignment, sort and filtering Variant calling Annotated Variants Variant annotation Variant filtering: - VCFTools - GATK Variant filtering Variant annotation: - SnpEff GDA2: 6- Variant calling and summary of the read processing.
  • 171. Practice 2.3: Variant calling. 1. Change the working directory to ‘sinningia_genotyping’ 2. Make a directory with the name: “03_variants”. 3. Change working directory to “02_mapped”. 4. Create an index with “samtools index” for the sorted bam file: samtools index SispeUser00_sorted.bam 5. Run “freebayes” with --min-base-quality 30 --min-mapping-quality 20 --min-coverage 5 with a command such as: freebayes -b SispeUser00_sorted.bam -f ../../linux_exercises/Sispe_ref/ Sispe038ReducedRef.fa -v ../03_variants/SispeUser00.vcf --min-coverage 5 -q 30 -m 20 6. Count how many variants has the VCF file including a division of variants per type (SNP, INDEL, MNP and Complex). GDA2: 6- Variant calling and summary of the read processing.
  • 172. Genomic Data Analysis 1. Introduction to Next Generation Sequencing Technologies. 2. Experimental design for population studies, from breeding to ecological studies. 3. De-multiplexing and the complexities of sample identification. 4. Read processing and quality evaluation. 5. Read mapping to a reference. 6. Variant calling and summary of the read processing. 7. Quality evaluation and possible pitfalls.
  • 173. Methods for Variant Evaluation • Validation by Sanger Sequencing of specific candidates (~5 - 500) using other datasets (e.g. transcriptome) if it is possible. • Comparison with other method (e.g. genotyping chip). • Different mapping and variant calling tools comparison (with a “truth set” or a “gold standard” if it is possible). GDA2: 7- Quality evaluation and possible pitfalls. https://gatkforums.broadinstitute.org/gatk/discussion/6308/evaluating-the-quality-of-a-variant-callset
  • 174. • Validation by Sanger Sequencing of specific candidates (~5 - 500) using other datasets (e.g. transcriptome) if it is possible. GDA2: 7- Quality evaluation and possible pitfalls. Variants from RNASeq (Illumina) Variants from ESTs (Sanger)
  • 175. • Different mapping, variant calling tools and datasets comparison (with a “truth set” or a “gold standard” if it is possible). GDA2: 7- Quality evaluation and possible pitfalls. Assumptions: 1. The content of the truth set has been validated. 2. Your samples are expected to have similar genomic content as the population of samples that was used to produce the truth set
  • 176. Metrics: 1. Variant level concordance: Percentage of variants in your samples that match (are concordant with) variants in your truth set. 2. Genotype concordance: Percentage of variants in your genotype that match (are concordant with) variants in your truth set. • Different mapping, variant calling tools and datasets comparison (with a “truth set” or a “gold standard” if it is possible). GDA2: 7- Quality evaluation and possible pitfalls. False Positives (FP) False Negatives (FN) True Positives (TP) My Dataset (16) True Set (18) % SENSITIVITY: TP * 100 / (TP + FN) = 13 * 100 / (13 + 5) = 72% % FALSE DISCOVERY RATE: FP * 100 / (TP + FP) = 3 * 100 / (13 + 3) = 20% % GT CONCORDANCE: SumMatches * 100 / TP 6 * 100 / 11 = 54% A * T C T C C * C A C A T T C * C C T * A * 1 0 1 1 0 1 1 0 0 1 0 True Set (9) My Dataset (8) Matches (6)
  • 177. Metrics: 3. Number of SNPs and INDELs: Between different datasets should be consistent for the same number of mapped reads. 4. TiTv Ratio: Ratio of transition (Ts) to transversion (Tv) SNPs should be random (~0.5). Methylation islands (CpG) and other factors may introduce a bias so expected values will range from 0.5 - 3.0. 5. Ratio Insertions/Deletions: It should be close to 1, except in rare alleles that it could be 0.2 - 0.5. • Different mapping, variant calling tools and datasets comparison (with a “truth set” or a “gold standard” if it is possible). GDA2: 7- Quality evaluation and possible pitfalls.
  • 178. Comparison between different tools: • Different mapping, variant calling tools and datasets comparison (with a “truth set” or a “gold standard” if it is possible). GDA2: 7- Quality evaluation and possible pitfalls. https://bcbio.wordpress.com/
  • 179. Tools: • Different mapping, variant calling tools and datasets comparison (with a “truth set” or a “gold standard” if it is possible). GDA2: 7- Quality evaluation and possible pitfalls. Name URL VariantEvaluation (GATK) https://software.broadinstitute.org/gatk/documentation/tooldocs/current/ org_broadinstitute_gatk_tools_walkers_varianteval_VariantEval.php GenotypeConcordance (GATK) https://software.broadinstitute.org/gatk/documentation/tooldocs/current/ org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeConcordance.php VCFTools http://vcftools.sourceforge.net/ VCFStats http://lindenb.github.io/jvarkit/ PicardTools https://broadinstitute.github.io/picard/index.html
  • 181. Genomic Data Analysis 1. Variant filtering. 2. Simple stats for the variant analysis. 3. Variant visualization tools: IGV and TASSEL. 4. Changing formats for VCF files. 5. Example 1: Population analysis with Structure for Sinningia speciosa. 6. Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
  • 182. Genomic Data Analysis 1. Variant filtering. 2. Simple stats for the variant analysis. 3. Variant visualization tools: IGV and TASSEL. 4. Changing formats for VCF files. 5. Example 1: Population analysis with Structure for Sinningia speciosa. 6. Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
  • 183. Variant filtering is the process to remove low quality or other non adequate variants (e.g. non biallelic, complex…) for the downstream analysis. It depends on: 1. Source and methodology used to generate the data (library preparation errors and biases). 2. Sequencing technology (read sequencing errors) and amount of data (insufficient depth/sites). 3. Software used for mapping (mapping errors) and variant calling (produced by a low coverage/low complexity sites). 4. Reliability (low quality/incomplete) and nature (genomic differences/ polyploidy) of the reference genome. 5. Type of population (e.g. F2 population) and type of analysis that it will be performed. GDA3: 1-Variant Filtering
  • 184. Variant filtering Two major source of error (Li et al. 2014): • Erroneous realignment in low-complexity regions • Incomplete reference genome with respect to the sample GDA3: 1-Variant Filtering “The raw genotype calls is as high as 1 in 10-15 kb, but the error rate of post-filtered calls is reduced in 1 in 100-200 kb without significant compromise on the sensitivity”. More data is not always better. High quality/reliable data
  • 185. Alignment problems GDA3: 1-Variant Filtering coordinates 12345678901234 5678901234567890123456 reference aggttttttataac---aattaagtctacagagcaacta sample aggttttttataacAATaattaagtctacagagcaacta read1 aggttttttataac***aaAtaa read2 ggttttttataac***aaAtaaTt read3 ttttataacAATaattaagtctaca read4 CaaT***aattaagtctacagagcaac read5 aaT***aattaagtctacagagcaact read6 T***aattaagtctacagagcaacta Aligners calculate the alignment correctness and give it a score depending of: • Length of the alignment. • Number of mismatches and gaps. • Uniqueness of the alignment (number of hits). }Good alignment Misaligned bases
  • 186. Alignment problems GDA3: 1-Variant Filtering Misaligned bases - Solutions: • Read realignment (IndelRealigner for GATK (obsolete), now it is integrated in the HaplotypeCaller). • Mark alignment quality per base (BAQ) and do not use for variant calling.
  • 187. Library preparation problems GDA3: 1-Variant Filtering PCR duplications produce biases in the variant call (e.g. het.) • Library specific problem for Whole Genome Sequencing. Gene A Gene B Gene C Fragmentation Library preparation PCR Duplication
  • 188. PCR duplications - Solutions: • Mark duplicates with tools such as samtools rmdup Library preparation problems GDA3: 1-Variant Filtering SKIP PCR DUPLICATION MARKING STEP FOR GBS, RAD-SEQ… CAREFUL: Some reduced representations techniques with unequal ratios of site amplication WILL PRODUCE THOUSANDS PCR DUPLICATION
  • 189. Library preparation problems GDA3: 1-Variant Filtering Sequencing errors produce biases in the variant call.
  • 190. Library preparation problems GDA3: 1-Variant Filtering Sequencing errors - Solutions: • High coverage (< 20 X) to minimize sequencing errors. • Recalibrate bases (Base Score Quality Recalibration - BSQR) using tools such as BaseRecalibrator.
  • 191. GDA3: 1-Variant Filtering Variant filtering: https://software.broadinstitute.org/gatk/best-practices/
  • 192. GDA3: 1-Variant Filtering Variant filtering: https://bcbio.wordpress.com/2013/10/21/updated-comparison-of-variant- detection-methods-ensemble-freebayes-and-minimal-bam-preparation-pipelines/
  • 193. GDA3: 1-Variant Filtering Variant filtering: Three general purpose callers: • FreeBayes (v0.9.9.2-18) • GATK UnifiedGenotyper (2.7-2) • GATK HaplotypeCaller (2.7-2) • Skipping base recalibration and indel realignment had almost no impact on the quality of resulting variant calls • FreeBayes outperforms the GATK callers on both SNP and indel calling. The most recent versions of FreeBayes have improved sensitivity and specificity which puts them on par with GATK HaplotypeCaller. • GATK HaplotypeCaller is all around better than the UnifiedGenotyper.
  • 194. Software Filters Depth Het. Var. Quality Mapping Quality Allele Freq. Position/ Distance HWE Missing VCFTools Yes Yes Yes No Yes Yes Yes Yes SnpSift* Yes Yes Yes No Yes Yes No No Vardict Yes No Yes No Yes No No No GATK Yes Yes Yes Yes Yes Yes No Yes Variant filtering: * It will depends of the tags for the VCF file GDA3: 1-Variant Filtering
  • 195. GDA3: 1-Variant Filtering Examples using VCFTools 1.Variants with low quality QUAL < 20. vcftools --vcf input.vcf --minQ 20 --recode --recode-INFO- all --out out 2. Variants with depth DP < 10. vcftools --vcf input.vcf --min-meanDP 10 --recode -- recode-INFO-all --out out 3. Separated by at least 1000 bp. vcftools --vcf input.vcf --thin 1000 --recode --recode- INFO-all --out out 4. No biallelic. vcftools --vcf input.vcf --min-alleles 2 --max-alleles 2 --recode --recode-INFO-all --out out 5. No missing. vcftools --vcf input.vcf --max-missing 1.0 --recode -- recode-INFO-all --out out
  • 196. Practice 3.1: Filter the variant file. 1. Change the working directory to ‘sinningia_genotyping’ 2. Change working directory to “03_variants”. 3. Run “vcf-stats” on the “SispeUserXX.vcf” file. 4. Remove the variants with QUAL < 20 and run “vcf-stats” again. 5. Remove the variants with DEPTH < 10 and run “vcf-stats” again. 6. Remove the variants separated between them 1000 bp or less and run “vcf-stats” again. 7. Get biallelic variants and run “vcf-stats” again. 8. Remove all the genotypes with missing data. 9. Select only SNPs. GDA3: 1-Variant Filtering
  • 197. Genomic Data Analysis 1. Variant filtering. 2. Simple stats for the variant analysis. 3. Variant visualization tools: IGV and TASSEL. 4. Changing formats for VCF files. 5. Example 1: Population analysis with Structure for Sinningia speciosa. 6. Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
  • 198. Stats for the VCF files GDA3: 2- Simple population stats for the variant analysis. Regular stats with vcf-stats (https://vcftools.github.io/perl_module.html) vcf-stats is a program that runs several stats for a VCF file producing the following files: • stats.counts, number of variants per sample and for all the samples. • stats.dump, parseable hash Perl format file with all the VCF stats. • stats.indels, number of INDELs per sample and for all the samples. • stats.legend, various definitions • stats.private, unique (not shared) variants for each sample • stats.samples-tstv, transicions/transversions for each sample • stats.shared, shared variants for all the samples • stats.snps, number of SNPs per sample and for all the samples. • stats.tstv, transicions/transversions for all the samples
  • 199. Stats for the VCF files GDA3: 2- Simple population stats for the variant analysis. Distribution with bcftools stats (https://samtools.github.io/bcftools/bcftools.html) bcftools is a software to manipulate VCF/BCF files. Stats can be used to produce several data distributions such as QUAL (quality), DP (depth), ST (substitution types), IDD (InDel size distribution), AF (allele frequency)… It also include as summary (SN). # SN, Summary numbers: # SN [2]id [3]key [4]value SN 0 number of samples: 4 SN 0 number of records: 110927 SN 0 number of no-ALTs: 0 SN 0 number of SNPs: 99184 SN 0 number of MNPs:10943 SN 0 number of indels:3101 SN 0 number of others:506 SN 0 number of multiallelic sites: 3816 SN 0 number of multiallelic SNP sites: 798
  • 200. Stats for the VCF files GDA3: 2- Simple population stats for the variant analysis. Distribution with vcfutils.pl qstats vcfutils.pl is a program that get the qual. and ts/tv parameters associated with the SNPs. It can be used to test if there are some bias of the ts/tv associated with low quality. QUAL #non-indel #SNPs #transitions #joint ts/tv #joint/#ref #joint/#non-indel 6856.32 1909 1909 654 0 0.5211 0.0000 0.0000 0.5211 4769.34 3818 3818 1381 0 0.5667 0.0000 0.0000 0.6151 3506.53 5727 5727 2215 0 0.6307 0.0000 0.0000 0.7758 2748.14 7636 7636 3051 0 0.6654 0.0000 0.0000 0.7791 2240.06 9545 9545 3956 0 0.7078 0.0000 0.0000 0.9014 . . . 16.3149 80179 80179 41279 0 1.0612 0.0000 0.0000 1.2945 11.551 82088 82088 42386 0 1.0676 0.0000 0.0000 1.3803 6.48534 83997 83997 43471 0 1.0727 0.0000 0.0000 1.3167 2.79415 85906 85906 44556 0 1.0775 0.0000 0.0000 1.3167
  • 201. Stats for the VCF files GDA3: 2- Simple population stats for the variant analysis. Populations stats with vcftools vcftools can also be used to get simple population genetics parameters associated to a VCF file. Some of these examples are: • Calculate nucleotide diversity (π) vcftools --vcf input.vcf --keep NamesGroup1.txt -- window-pi 100000 --out Group1_Pi • Calculate linkage disequilibrium (LD) (for phased genotypes). vcftools --vcf input.vcf --keep NamesGroup1.txt --ld- window-bp 50000 --chr SeqID1 --hap-r2 --min-r2 0.001 -- out Group1_SeqID1_LD
  • 202. Stats for the VCF files GDA3: 2- Simple population stats for the variant analysis. Populations stats with vcftools vcftools can also be used to get simple population genetics parameters associated to a VCF file. Some of these examples are: • Calculate FST between two groups. vcftools --vcf input.vcf --weir-fst-pop SampleGroup1.txt --weir-fst-pop SampleGroup2.txt --fst-window-size 100000 --out Group1_VS_Group2_FST • Calculate TajimaD for one group. vcftools --vcf input.vcf --keep NamesGroup1.txt -- TajimaD 100000 --out Group1_Pi
  • 203. Practice 3.2: Get stats for the VCF file 1. Change the working directory to ‘sinningia_genotyping’ 2. Change working directory to “03_variants”. 3. Run “vcf-stats” on the “SispeUserXX.vcf” file. 4. Run “bcftools stats” on the “SispeUserXX.vcf” and pipe the output to select “^SN” 5. Run “vcftools” to calculate the nucleotide diversity on the “SispeUserXX.vcf”. 6. Run “vcftools” to calculate the Tajima D on the “SispeUserXX.vcf”.. 7. Divide your dataset in two groups and calculate the FST between these two groups. GDA3: 2- Simple population stats for the variant analysis.
  • 204. Genomic Data Analysis 1. Variant filtering. 2. Simple stats for the variant analysis. 3. Variant visualization tools: IGV and TASSEL. 4. Changing formats for VCF files. 5. Example 1: Population analysis with Structure for Sinningia speciosa. 6. Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
  • 205. IGV, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL http://software.broadinstitute.org/software/igv/ The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.
  • 206. IGV, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL http://software.broadinstitute.org/software/igv/download
  • 207. IGV, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL 1- Create a new .genome file for the Sinningia reference. 2- Add an Unique identifier (e.g. “Sispe038”), a descriptive name (e.g. “S. species version 0.3.8”, the FASTA and the GFF files with the reference.
  • 208. IGV, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL 3- Select any scaffold
  • 209. IGV, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL 4- To load any VCF or BAM file, select “Load From File” 6- Then select the scaffold that you want to visualize (e.g. “Sispe038Scf0002”) 5- Then load your VCF file (e.g. “SispeUser00.vcf”).
  • 210. IGV, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL It creates two tracks: 1- With all the variants and the AF as a red/blue bar; 2- With all the individual samples.
  • 211. IGV, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL You also can load BAM files to see the read alignment.
  • 212. TASSEL, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL http://www.maizegenetics.net/tassel TASSEL is a tools to investigate the relationship between phenotypes and genotypes.TASSEL has functionality for association study, evaluating evolutionary relationships, analysis of linkage disequilibrium, principal component analysis, cluster analysis, missing data imputation and data visualization.
  • 213. TASSEL, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL 1- Load VCF data
  • 214. TASSEL, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL 2- Explore the VCF data
  • 215. TASSEL, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL 3- Calculate nucleotide diversity
  • 216. TASSEL, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL 4- Get a distance matrix
  • 217. TASSEL, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL 4- Perform a Principal Component Analysis
  • 218. TASSEL, Integrative Genomic Viewer GDA3: 3- Variant visualization tools: IGV and TASSEL 5- Produce a cladogram
  • 219. Genomic Data Analysis 1. Variant filtering. 2. Simple stats for the variant analysis. 3. Variant visualization tools: IGV and TASSEL. 4. Changing formats for VCF files. 5. Example 1: Population analysis with Structure for Sinningia speciosa. 6. Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
  • 220. Change formats from VCF to others. GDA3: 4- Changing formats for VCF files. http://www.cmpg.unibe.ch/software/PGDSpider/
  • 221. Change formats from VCF to others. VCF => FastStructure PGDSpider can be used to change between different formats: • From VCF to FastStructure. perl -ne 'chomp($_); if ($_ =~ m/#/) { print "$_n"} else { @a= split(/t/, $_); if (length($a[3]) == 1 && length($a[4]) == 1) {print "$_n"} }' input.vcf > clean.vcf java -Xmx1024m -Xms512m -jar /data/software/ PGDSpider_2.1.1.2/PGDSpider2-cli.jar -inputfile clean.vcf -inputfileformat VCF -outputfile clean.structure.str -outputfileformat STRUCTURE -spid VCF2FastStructure.spid GDA3: 4- Changing formats for VCF files.
  • 222. Change formats from VCF to others. VCF => FastStructure PGDSpider requires a configuration file (.spid) for each of the formats. Example for a VCF2FastStructure file. # VCF Parser questions PARSER_FORMAT=VCF VCF_PARSER_PLOIDY_QUESTION=DIPLOID VCF_PARSER_REGION_QUESTION= VCF_PARSER_PL_QUESTION=GT VCF_PARSER_QUAL_QUESTION=20 VCF_PARSER_GTQUAL_QUESTION=0 VCF_PARSER_READ_QUESTION=5 VCF_PARSER_IND_QUESTION= VCF_PARSER_EXC_MISSING_LOCI_QUESTION=TRUE VCF_PARSER_MONOMORPHIC_QUESTION=FALSE VCF_PARSER_POP_QUESTION= # STRUCTURE Writer questions WRITER_FORMAT=STRUCTURE STRUCTURE_WRITER_FAST_FORMAT_QUESTION=TRUE STRUCTURE_WRITER_LOCI_DISTANCE_QUESTION=TRUE GDA3: 4- Changing formats for VCF files.
  • 223. Change formats GDA3: 4- Changing formats for VCF files.
  • 224. https://github.com/aubombarely/GenoToolBox GDA3: 4- Changing formats for VCF files. Change formats
  • 225. Genomic Data Analysis 1. Variant filtering. 2. Simple stats for the variant analysis. 3. Variant visualization tools: IGV and TASSEL. 4. Changing formats for VCF files. 5. Example 1: Population analysis with Structure for Sinningia speciosa. 6. Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
  • 226. GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa. EmpressPurple01 S_speciosa DOMESTICATED EmpressRed01 S_speciosa DOMESTICATED Buzios S_speciosa WILD PurpleDreaming Hybrid DOMESTICATED GalaxyTour S_speciosa DOMESTICATED AnsNix Hybrid DOMESTICATED AmandasPenny Hybrid DOMESTICATED TV_Faeton S_speciosa DOMESTICATED DarthVader S_speciosa DOMESTICATED MerryChristmas S_speciosa DOMESTICATED StrawberryJam S_speciosa DOMESTICATED BlueDandy S_speciosa DOMESTICATED DeadlyRomance S_speciosa DOMESTICATED DiegoPink S_speciosa WILD White S_speciosa DOMESTICATED Kleopatra S_speciosa DOMESTICATED BestRoskosh S_speciosa DOMESTICATED LovePotion S_speciosa DOMESTICATED NTVenushki S_speciosa DOMESTICATED GoodMorning S_speciosa DOMESTICATED BlueKnight S_speciosa DOMESTICATED AvenidaNiemeyer S_speciosa WILD BuziosXEmpressF1 S_speciosa DOMESTICATED ChilternSeeds S_speciosa WILD EmpressRed02 S_speciosa DOMESTICATED PedraLisa S_speciosa WILD CardosoMoreira S_speciosa WILD EmpressPurple02 S_speciosa DOMESTICATED Carangola S_speciosa WILD CardosoMoreiraPink S_speciosa WILD CharlesLawn S_speciosa DOMESTICATED Shelleri S_helleri WILD MiguelPereira S_speciosa WILD Buzios Carangola A. Niemeyer Darth Vader Empress Red Blue Knight Goal: Analyze the population structure for cultivated Sinningias
  • 227. Genetic Variation in the Species Wild accessions 9 Cultivars 25 Wild x Cultivated F1 1 Other species 1 ____________________________________________ TOTAL 36 Sequencing Library preparation De-multiplexing Read processing Alignment Variant detection SNP filtering APeKI digestion Illumina, single end, 100 bp GBSX v1.2 Fastq-mcf v1.04.807, Q30, L50 Bowtie2 v2.2.4 Freebayes v0.9.20 bcftools: only biallelic SNPs vcffliter: Q>30, Depth >= 5 vcftools: no missing observations 41,626 SNPs GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
  • 228. Genetic Variation in the Species 1. Clean the file of SNPs defined with more than one character (e.g AC/AG). perl -ne 'chomp($_); if ($_ =~ m/#/) { print "$_n"} else { @a= split(/t/, $_); if (length($a[3]) == 1 && length($a[4]) == 1) {print "$_n"} }' Sispe038_Set01_FILTERED_SNPs.vcf > Sispe038_Set01_FILTERED_SNPs_CLEAN.vcf 2. Change the VCF format to FastStructure. java -Xmx1024m -Xms512m -jar /data/software/ PGDSpider_2.1.1.2/PGDSpider2-cli.jar -inputfile Sispe038_Set01_FILTERED_SNPs_CLEAN.vcf -inputfileformat VCF -outputfile Sispe038_Set01_FILTERED_SNPs_CLEAN.structure.str - outputfileformat STRUCTURE -spid VCF2FastStructure.spid GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
  • 229. Genetic Variation in the Species 3. Prepare a script (run_faststructure) with the fastStructure command line, 5 replicates, random seeds and K from 1 to 20. 4. Change the permissions of the script and run it chmod 755 run_faststructure.sh ./run_faststructure.sh #!/bin/bash python /data/sowware/fastStructure/structure.py -K 1 -- input=Sispe038_Set01_FILTERED_SNPs_CLEAN.structure -- output=Sispe038_Set01_FS_K01_R01 --format=str —seed=123456789 … GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
  • 230. Genetic Variation in the Species 5. Run ChooseK to get the most supported K. python /data/software/fastStructure/chooseK.py -- input=Sispe038_Set01_FS_* Model complexity that maximizes marginal likelihood = 2 Model components used to explain structure in data = 2 GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
  • 231. Genetic Variation in the Species 5. Run ChooseK to get the most supported K. Model components used to explain structure in data = 2 In our review of 1,264 studies using structure to explore population subdivision, studies that used ΔK were more likely to identify K = 2 (54%, 443/822) than studies that did not use ΔK (21%, 82/386). A troubling finding was that very few studies performed the hierarchical analysis recommended by the authors of both ΔK and structure to fully explore population subdivision. GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
  • 232. Genomic Data Analysis 1. Variant filtering. 2. Simple stats for the variant analysis. 3. Variant visualization tools: IGV and TASSEL. 4. Changing formats for VCF files. 5. Example 1: Population analysis with Structure for Sinningia speciosa. 6. Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
  • 233. S-6-4 (PI#555684) S-6-5 (Standard Line) F1 (Picture from Dr. David Zaitlin) GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana. F2_107 F2_108 F2_109 F2_110 F2_111 F2_112 F2_113 F2_115 F2_117 F2_118 F2_119 F2_120 F2_121 F2_122 F2_123 F2_124 F2_125 F2_126 F2_127 F2_129 F2_130 F2_131 S_64_2 S_65_2 F2_001 F2_002 F2_003 F2_004 F2_005 F2_006 F2_007 F2_009 F2_010 F2_011 F2_012 F2_013 F2_014 F2_015 F2_016 F2_017 F2_018 F2_019 F2_021 F2_022 F2_023 F2_024 F2_028 F2_031 F2_032 F2_033 F2_034 F2_035 F2_036 F2_077 F2_078 F2_079 F2_080 F2_081 F2_082 F2_083 F2_084 F2_085 F2_086 F2_087 F2_088 F2_089 F2_091 F2_092 F2_093 F2_094 F2_095 F2_096 F2_097 F2_098 F2_099 F2_100 F2_101 F2_102 F2_103 F2_104 F2_105 F2_106 Goal: Develop a genetic map with 19 linkage groups
  • 234. http://www.rqtl.org/ http://www.rqtl.org/tutorials/geneticmaps.pdf GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
  • 235. https://github.com/aubombarely/GenoToolBox GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
  • 236. 1. Change the VCF format to CSV used as input by R/QTL using Vcf2Maker from GenoToolBox (https://github.com/aubombarely/GenoToolBox). /old_home/aurebg/Software/GenoToolBox/SNPTools/ Vcf2Mapmaker -i NibenGBS_M30.vcf -o NibenGBS_M30.csv -f csv -a S_64_2 -b S_65_2 -B -d 1000 2. Load the genotypes in R/QTL. setwd('./') library('qtl') NbenX = read.cross(file="NibenGBS_M30.csv", format=“csv”) 3. Follow the R/QTL tutorial. GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
  • 237. GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.