Presentation about how to perform data processing for genomics data in population genetics and quantitative genetics studies. It explains how to process the reads, map them, get variants and quantify them. It also presents 25 common Linux commands that are required in order to interact with the Linux system and be able to run different tools.
Module for Grade 9 for Asynchronous/Distance learning
Genomic Data Analysis: From Reads to Variants
1. Genomic Data Analysis
From READS to VARIANTS
24-10-17 to 26-10-17,
Porto Alegre, Brazil.
Aureliano Bombarely
Virginia Tech
Department of Horticulture
Latham 216
220 Ag Quad Lane
Blacksburg, VA
USA
aurebg@vt.edu
2. Genomic Data Analysis
DAY 1:
• Presentation of the Course.
• Introduction to the Linux Operating System and the Command Line Interface.
• 25 essential commands to work with Linux.
• Common bioinformatics formats, from FASTAs to GFFs and VCFs.
• 10 essential commands to play with the biological data.
DAY 2:
• Introduction to Next Generation Sequencing Technologies (NGS).
• Experimental design for population studies, from breeding to ecological studies.
• De-multiplexing and the complexities of sample identification.
• Read processing and quality evaluation.
• Read mapping to a reference.
• Variant calling and summary of the read processing.
• Quality evaluation and possible pitfalls.
DAY 3:
• Variant filtering.
• Simple stats for the variant analysis.
• Variant visualization tools: IGV and TASSEL.
• Changing formats for VCF files.
• Example 1: Population analysis with Structure for Sinningia speciosa.
• Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
7. Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
8. Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
9. GDA1: 1- Presentation of the Course.
Biological Problem
Scientific Question
Hypothesis
Genetics & related disciplines
Molecular biology
Massive DNA Sequencing
Results
Experimental Design
Approach
?
10. GDA1: 1- Presentation of the Course.
Biological Problem
Scientific Question
Hypothesis
Genetics & related disciplines
Molecular biology
Massive DNA Sequencing
Results
Experimental Design
Approach
Genomic Data Analysis
11. GDA1: 1- Presentation of the Course.
Genomic Data Analysis is:
• Knowledge about sequencing technologies.
• Knowledge about methodologies (e.g. library preparation).
• Bioinformatic skills (Linux command line and R).
• Basic knowledge about statistical analysis.
Genomic Data Analysis IS NOT:
• Programming (useful but not necessary).
• Basic knowledge of computer system administration.
• Modeling.
• Algorithm development.
• Database development.
12. GDA1: 1- Presentation of the Course.
Genomic Data Analysis is:
• Knowledge about sequencing technologies.
• Knowledge about methodologies (e.g. library preparation).
• Bioinformatic skills (Linux command line and R).
• Basic knowledge about statistical analysis.
BIOINFORMATICS:
• Programming (useful but not necessary).
• Basic knowledge of computer system administration.
• Modeling.
• Algorithm development.
• Database development.
13. Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
14. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Linux:
Unix-like computer operating system assembled
under the model of free and open source software
development and distribution
Operating System (OS):
Set of programs that
manage computer hardware
resources and provide common
services for application
software.
Wikipedia
15. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Unix-like?
Feduccia A, Trends Ecol. Evol. 2001
16. Unix:
Is a multitasking, multi-user computer operating
system originally developed in 1969.
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
https://www.howtogeek.com/182649/htg-explains-what-is-unix/
17. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Linux Distribution:
Distributions (often called distros for short) are
Operating Systems including a large collection of
software applications such as word processors,
spreadsheets, media players, and database applications.
The operating system will consist of the Linux
kernel and, usually, a set of libraries and utilities from
the GNU Project, with graphics support from the X
Window System.
18. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Linux Distribution:
19. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
What is a console ?
Computer terminal or system consoles are the text
entry and display device for system administration
messages, particularly those from the BIOS or boot loader,
the kernel, from the init system and from the system logger. It
is a physical device consisting of a keyboard and a
screen.
Wikipedia
20. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
So then...
What is typical black or white screen where
programers and system administrators write
commands?
21. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Command-line interface (CLI):
Mechanism for interacting with a computer operating
system or software by typing commands to perform specific
tasks.
The command-line interpreter may be run in a
text terminal or in a terminal emulator window as a remote
shell client such as PuTTY.
Wikipedia
010100
110010
000101
2+2 4
22. Shell:
Piece of software that provides an interface for users of an
operating system. There are two categories:
- Command-line interface (CLI)
- Graphical user interface (GUI)
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
23. Command-line interface (CLI):
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Shell: Bash:
Operating
System
Shell (Bash)
STDIN STDOUT
STDERRCommand
Command is a directive to a computer program acting as an
interpreter of some kind, in order to perform a specific task.
24. Parts of a command:
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
...And push RETURN or ENTER key to execute the command
Command Argument 1 & 2
The command
call a program
Arguments modify the
behavior of the program
Argument 3
-l means “long listing”
-h/--human-readable means “human readable”
25. Special characters in bash:
CHARACTER MEANING
SPACE Separate commands and arguments
# POUND Comment
; SEMICOLON Command separator two run multiple commands
. DOT Source command OR filename component
OR current directory
.. DOUBLE DOTS Parent directory
' ' SINGLE QUOTES Use expression between quotes literaly
, COMMA Concatenate strings
BACKSLASH Escape for single character
/ SLASH Filename path separator
* ASTERISK Wild card for filename expansion in globbing
>, <, >> CHARACTERS Redirection input/outputs
| PIPE Pipe outputs between commands
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Characters with an special meaning for the bash
26. ls Solanum lycopersicum
ls 'Solanum lycopersicum'
ls Solanum lycopersicum
Use single quotes or escape for special characters
Bash understand spaces as separators between arguments
GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Special characters in bash:
27. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Practice 1.1: Connect to the server
Windows users:
1. Open the program PuTTY and start a session.
2. Add the following information for your session and click open.
1. Host: begonia.hort.vt.edu
2. Port: 1809
3. Connection type: SSH
3. Introduce username and password
4. Click connect.
MacOS/Linux users:
1. Open the program Terminal.
2. Type in the terminal:
ssh -p 1809 username@begonia.hort.vt.edu
3. Push enter/return
4. Introduce the password and push Enter/Return. Note the writing will be hidden.
28. GDA1: 2- Introduction to the Linux Operating System and the Command Line Interface
Practice 1.1: Connect to the server
Everyone:
5. Type in the terminal:
pwd
6. Push enter/return
7. Describe the message that has appeared in the screen
29. Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
30. GDA1: 3- 25 essential commands to work with Linux.
1. pwd
• The command prints the path to the working directory.
• When you login to the system working directory = home ($HOME)
/data/GDA_UFRGS2017/User00_Home
/ means root (beginning of the file system)
data is the name of the 1st directory in root
/GDA_UFRGS2017 2nd directory after data
/User00_Home 3rd directory after GDA_UFRGS2017
pwd
31. GDA1: 3- 25 essential commands to work with Linux.
2. mkdir
• The command prints makes a new directory.
• If the file exists gives an error.
• Argument -p makes also the parent directories
• rmdir removes a empty directory
mkdir linux_exercises
mkdir linux_exercises/test01
mkdir -p linux_exercises/test01
rmdir linux_exercises/test01
✓correct
✴error
✓correct
pwd
mkdir
✓correct
32. GDA1: 3- 25 essential commands to work with Linux.
3. cd
• The command changes the working directory.
• Two consecutive dots changes (e.g. “cd ..”) one directory up/
back in the file system
cd linux_exercises
cd linux_exercises/test01
cd test01
✓correct
✴error
✓correct
pwd
mkdir
cd
33. GDA1: 3- 25 essential commands to work with Linux.
4. ls
• It lists the items in the working directory (default) or any directory.
• -l argument produces the item long listing.
• -h argument produces a human readable form.
• -a argument prints everything (including hided files starting with “.”).
• -t argument sorts by time
ls
ls -lht linux_exercises/test01
cd test01
✓correct
✴error
✓correct
pwd
mkdir
cd
ls
34. Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
1. Type pwd in the terminal and run it.
2. Make the directory ‘linux_exercises’ typing and running.
3. Run pwd in the current directory and annotate the result.
4. Change the working directory to ‘linux_exercises’.
5. Make a new directory named ’01_file_system_tree’.
6. Change the working directory to ’01_file_system_tree’.
7. Run pwd in the current directory and annotate the result.
8. Make a new directory named ‘subdir01’
9. Change the working directory to ’subdir01’.
10. Run pwd in the current directory and annotate the result.
11. Change the working directory one level up
12. Make a new directory named ‘subdir02’
13. Change the working directory to ’subdir02’.
14. Run pwd in the current directory and annotate the result.
15. Draw the file system tree for the directories ‘subdir01’ and ‘subdir02’
pwd
mkdir
cd
ls
35. Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
01_file_system_tree/
subdir01/
subdir02/
cd 01_file_system_tree/subdir02
pwd
mkdir
cd
ls
36. Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
01_file_system_tree/
subdir01/
subdir02/
cd ../../
pwd
mkdir
cd
ls
37. Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
01_file_system_tree/
subdir01/
subdir02/
cd ../subdir01/
pwd
mkdir
cd
ls
38. Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
01_file_system_tree/
subdir01/
subdir02/
cd /data/GDS_URFG2017/User00_Home/linux_exercises/01_file_system_tree/
subdir01/
cd ../subdir01/
pwd
mkdir
cd
ls
39. Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
/
data/
GDA_UFRGS2017/
User00_Home/
linux_exercises/
01_file_system_tree/
subdir01/
subdir02/
cd /data/GDS_URFG2017/User00_Home/linux_exercises/01_file_system_tree/
subdir01/
cd ../subdir01/
Relative filepath
Absolute filepath
pwd
mkdir
cd
ls
40. Practice 1.2: Navigating the file system
GDA1: 3- 25 essential commands to work with Linux.
Absolute filepath Relative filepath
Latham Hall 311
220 Ag Quad Lane
Blacksburg, VA 24061
USA
Room 311
pwd
mkdir
cd
ls
41. GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
lsCommands for directories:
COMMAND USE EXAMPLE
cd Change working dir cd ../
pwd Print working dir pwd
ls List information ls -lh /home
mkdir Create a new dir mkdir test
rmdir Remove empty dir rmdir test
42. GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
5. history
• It lists last 500 command runs.
• No arguments needed
43. GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
Typing shortcuts for bash:
SHORTCUT MEANING
Tab Autocomplete files or folder names
↑ Scroll up to the command history
↓ Scroll down to the command history
Ctrl + A Go to the beginning of the line that you are typing
Ctrl + D Go to the end of the line that you are typing
Ctrl + U Clear all the line (or until the cursor position)
Ctrl + R Search previously used commands
Ctrl + C Kill the process that you are running
Ctrl + D Exit the current shell
Ctrl + Z Put the running process to the background. Use
command fg to recover it.
44. GDA1: 3- 25 essential commands to work with Linux.
6. less
• Opens a text type file in the screen.
• To navigate use the arrows ( ).
• “Shift + G” goes to the end of the file.
• “/“ + some word search for the word.
• “q” to quit/exit.
• Open the file with “-N” to open with row numbers.
• More information at: http://www.tutorialspoint.com/unix_commands/less.htm
less ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta
less -N ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta
pwd
mkdir
cd
ls
history
less
45. GDA1: 3- 25 essential commands to work with Linux.
7. touch
• Creates an empty file.
touch this_is_a_test_file.txt
pwd
mkdir
cd
ls
history
less
touch
rm
8. rm
• Remove/delete permanently a file from the system.
• The file CAN NOT BE RECOVERED.
• “rm -Rf <directory>” will remove the directory and all its content CAREFUL.
rm this_is_a_test_file.txt
rm -Rf 01_file_system_tree/subdir01
46. GDA1: 3- 25 essential commands to work with Linux.
9. cp
• Copy a file from one location to another.
• “./“ means copy here in the working directory
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
10.mv
• Two functions:
• If the destination EXISTS and is a DIR, move a file there.
• If the destination DO NOT EXISTS, change the
• CAREFUL: If the destination EXISTS and is a file WILL OVERWRITE IT
mv Sispe038_cds.fasta Sispe038_cds.fa
cp ../DATA/Sinningia_speciosa/reference/Sispe038_cds.fasta ./
47. Practice 1.3: Copying and moving files
GDA1: 3- 25 essential commands to work with Linux.
1. Change working directory to ‘linux_exercises’.
2. Make a directory with the name: “Sispe_ref”.
3. Change working directory to “Sispe_ref”.
4. Copy all the fasta files from /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/
to your current working directory typing:
cp /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/
*.fasta ./
6. Remove the file “Sispe038.scaffolds.fasta”.
7. Change the name of “Sispe038.scaffolds500kb.fasta” to “Sispe038ReducedRef.fa”.
8. Create a mapping reference using Bowtie2-build running:
bowtie2-build Sispe038ReducedRef.fa Sispe038ReducedRef
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
48. GDA1: 3- 25 essential commands to work with Linux.
11.cat
• It prints the content of the file as STDOUT in the screen.
• Usually used to concatenate (merge) files one after another using
“cat file1.txt file2.txt > merged.txt ”
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
12.head/tail
• Prints the first/last 10 lines of the file as STDOUT
• The number of lines (x) can be changed using “-n x”.
head Sispe038ReducedRef.fasta
head -n 100 Sispe038ReducedRef.fasta
tail -n 1 Sispe038ReducedRef.fasta
cat Sispe038ReducedRef.fa
49. GDA1: 3- 25 essential commands to work with Linux.
13.grep
• Command to find LINES and print that match with the pattern used.
• “-c” option prints the NUMBER of LINES that match.
• “-v” option prints the LINES that DO NOT match.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
grep “>” Sispe038ReducedRef.fa
grep -c “>” Sispe038ReducedRef.fa
grep -v “>” Sispe038ReducedRef.fa
50. GDA1: 3- 25 essential commands to work with Linux.
14.gzip/gunzip
• Command to compress a file with gzip.
• Command to uncompress a file.gz with gunzip
• To keep the original file the “-c” option can be used
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
gzip Sispe038ReducedRef.fa
gunzip Sispe038ReducedRef.fa.gz
gzip -c Sispe038ReducedRef.fa > SispeRef.fa.gz
51. pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
GDA1: 3- 25 essential commands to work with Linux.
15.tar
• Command to archive/unarchive files contained into a directory.
• It can be combined with tools such as gzip and bzip2.
• Commonly used commands:
• tar -zxvf package_of_files.tar.gz to unarchive and uncompress .gz
• tar -jxvf package_of_files.tar.bz2 to unarchive and uncompress .bz2
• tar -zcvf dir1.tar.gz /path_to_fir1 to archive and compress with gzip
• tar -jcvf dir1.tar.bz2 /path_to_fir1 to archive and compress with bzip2
52. Practice 1.4: Concatenating files and taking a look to them
GDA1: 3- 25 essential commands to work with Linux.
1. Change working directory to ‘linux_exercises’.
2. Make a directory with the name: “CDS_refs”.
3. Change working directory to “CDS_refs”.
4. Copy into your current working directory the following the files:
1. /data/GDA_UFRGS2017/DATA/Arabidopsis_thaliana/reference/
Athaliana_Phytozome167_TAIR10.pep.fa.gz
2. /data/GDA_UFRGS2017/DATA/Oryza_sativa/reference/
Osativa_Phytozome323_v7.0.pep.fa.gz
5. Uncompress both files.
6. Count how many lines have the symbol “>” for both files.
7. Concatenate both files in a file named “Atha_Osat_PEP.fasta”.
8. Count how many lines have the symbol “>” in this file.
9. Create a BLAST+ reference running the following command:
makeblastdb -in Atha_Osat_PEP.fasta -dbtype prot -
parse_seqids
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
53. GDA1: 3- 25 essential commands to work with Linux.
16.cut
• It divides the file by TAB and prints as STDOUT the selected column.
• “-f x” where x is the number of the column.
• “-d y” where y is a character can be used to change the delimiter
17.sort
• It sorts alphabetically a file based in the firsts characters of the line.
• “-n” can be used to sort numerically.
• “-r” can be used to do a reversed sorting.
• “-u” can be used to apile unique ids
• Usually used with cut “e.g. cut -f1 my_file.txt | sort”.
cut -f 3 Sispe038_genome.genemodels.gff3
cut -f 3 Sispe038_genome.genemodels.gff3 | sort -u
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
54. GDA1: 3- 25 essential commands to work with Linux.
18.uniq
• It reports or omits unique lines.
• Usually used in conjunction with cut and sort “e.g. cut -f1 my_file.txt |
sort”.
19.wc
• It counts newlines, words or bytes in a file.
• “-l” counts the number of lines.
• “-w” counts the number of words.
• “-m” counts the number of characters
cut -f 3 Sispe038_genome.genemodels.gff3 | sort | uniq -c
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
wc -l Sispe038_genome.genemodels.gff3
55. GDA1: 3- 25 essential commands to work with Linux.
20.sed
• Stream editor to transform text.
• The simplest option is to use “s/<find>/<replace>/“ option.
• A “g” to replace as many times as it can “s/<find>/<replace>/“
• More info at: https://www.gnu.org/software/sed/manual/sed.html
sed “s/A/a/g“ Sispe038ReducedRef.fa
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
56. 1. Change working directory to ‘linux_exercises’.
2. Make a directory with the name: “GFF_refs”.
3. Change working directory to “GFF_refs”.
4. Copy into your current working directory the following the files:
1. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/reference/
Sispe038_genome.genemodels.gff3
2. /data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/reference/
Niben251.1_genome.genemodels.sorted.gff3
5. Count the number of lines in both files.
6. Count the number of lines ignoring lines with “#” symbol using grep.
7. Select the third column in both files and print the first 20 lines.
8. Select the third column in both files, sort it and count unique items using “uniq -c”
Practice 1.5: Selecting columns and counting them
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
57. GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
less
touch
rm
COMMAND USE EXAMPLE
less Open a file with less.
Q to exit. Arrows to scroll
less myfile
touch Create an empty file touch myfile
mv Move file between dirs. Change name mv myfile yourfile
rm Remove file rm youfil
cat Print file content as STDOUT cat myfile
head Print first 10 lines as STDOUT head myfile
tail Print last 10 lines as STDOUT tail myfile
grep Print matching lines as STDOUT grep 'ATG' myfile
cut Cut columns and print as STDOUT cut -f1 myfile
sort Sort lines and print as STDOUT sort myfile
uniq Select uniq words (-c to count uniq). uniq -c myfile
sed Replace ocurrences, print lines STDOUT sed 's/ATG/CTG/' myfile
wc Word count wc myfile
Commands for files:
58. Compression and archiving commands:
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
less
touch
rm
COMMAND USE EXAMPLE
gzip Compress a file using gzip gzip -c test.txt > test.txt.gz
gunzip Uncompress a file using gzip gunzip test.txt.gz
bzip2 Compress a file using bzip bzip2 -c test.txt >
test.txt.bz2
bunzip2 Uncompress a file using gzip bunzip2 test.txt.bz2
tar Archive files usint tar tar -cf sample.tar sample/*.txt
tar -zcvf Archive using tar and compress
using gzip
tar -zcvf samples.tar.gz
sample/*.txt
tar -zxvf Unarchive using tar and
uncompress using gunzip
tar -zxvf samples.tar.gz
tar -jcvf Archive using tar and compress
using bzip2
tar -jcvf samples.tar.bz2
sample/*.txt
tar -jxvf Unarchive using tar and
uncompress using bunzip2
tar -jxvf samples.tar.bz2
59. GDA1: 3- 25 essential commands to work with Linux.
21.top/htop
• Display Linux processes.
• Type “q” to quit.
• “kill PID” can be used to terminate a process.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
Global Resource Usage: %CPU / MEMORY / SWAP MEMORY
Single Process Resource Usage: PID / USER / %CPU / %MEMORY / COMMAND
60. 22.df/du
• Commands to check how much disk space is being used in the
system (df -lh) or how much space a directory is using (du -lh
<dir>).
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
df -lh
du -lh linux_exercises
61. 23.wget/curl
• Commands to download files from the internet.
• wget can be used recursively (e.g. using * or “-r” for dirs)
• curl has pipeting abilities (using “|”)..
GDA1: 3- 25 essential commands to work with Linux.
pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
wget ftp://ftp.solgenomics.net/genomes/
Solanum_lycopersicum/annotation/ITAG3.2_release/*.fasta
curl ftp://ftp.solgenomics.net/genomes/
Solanum_lycopersicum/annotation/ITAG3.2_release/
ITAG3.2_proteins.fasta | grep -c “>”
63. pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
GDA1: 3- 25 essential commands to work with Linux.
File Permissions and Ownerships:
All the Unix systems are designed as multiuser operating
systems. It means that different could access, modify or
delete the same files.
To avoid problems, they has a file permission and ownership
system. It restrict who can access and modify each of the
files in the computer.
This system has two parts:
• Who is the owner of the file ?
• What type of access has each of the users in the
System ?
65. pwd
mkdir
cd
ls
history
less
touch
rm
cp
mv
cat
head/tail
grep
gzip
tar
cut
sort
uniq
wc
sed
top/htop
df/du
wget/curl
ssh/scp
GDA1: 3- 25 essential commands to work with Linux.
Permissions:
Each file has assigned 9 different permissions, 3 for the file
user-owner (u), 3 for the group-owner (g) and 3 for everyone
else (o). There are 3 types of permissions or file attributes:
• Readable (r), it has permission to read the file.
• Writable (w), it has permission to write the file.
• Executable (x), it has permission to execute as program.
10 letters code for linux file: ----------
drwxrwxrwx
switch OFF
switch ON
user
group
other
Readable for everyone
Readable for everyone, writable or
executable only for the user-owner
-r--r--r--
-rwxr—-r--
67. Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
68. I. FASTA
FASTA format is a text based file format that store three different types: DNA,
RNA or protein sequences. Used to represent the information for sequences
for genomes, mRNA’s, cDNA’s, miRNA’s…
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
>SeqID1 optional_description1
AGCGTGGAGAGCGATGAGATCAGAAAGTAGGACGACAGATGGGGAGAT
GGCAGGTGTGGGAGGAGTTGACGATGACGTGATTGATGACGGGAGACG
>SeqID2 optional_description2
AGCGTGGAGAGCGATGAGATCAGAAAGTAGGCTGACAGATGGGGAGAT
GGCAGGTGAGGGAGGAGCTGACGATGACGTGTTTGATGACGGGAGACG
>SeqID3 optional_description3
AGCGTGGAGAGCGATGAGATCAGAAAGTAGGACGACAGTGGGGGAGAT
GGCAGGTGAGGGAGGAGTTGACGATGACGTGTTTGATGACGGGAGACG
Space separating ID and description
One line ID
ID line always
starts with “>”
}
sequence can be
one or more lines
69. II. FASTQ
FASTQ format is a text based file format that store usually DNA sequences. It
contains information about the sequencing QUALITY of each nucleotide.
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
@GWNJ-0957:89:GW170928504:7:1101:2757:1309 1:N:0:NCGTCCC
TATCTAAGTATTTGATTAATGATAGATGACGATGGAGAAATATAATCTACTTTTTT
AAGTCCCTCATTTTCTTTCTCCATCTTTCTTTTTTATTACTCCCATTGTTCCCCAT
+
AAAAAFFJFJJFJJAAAAAFJJJ<FJJJJJJJJJJ7<7<<<<JJJJJJFFJJJAFJ
F-7<<-7AFJJFJJJJJJJJAJJFJFJ<7<-7A-7FAFJA777777<7-7--7--7
@GWNJ-0957:89:GW170928504:7:1101:3549:1309 1:N:0:NCGTCCC
ACCATTCATTATTTTTTTATTTAGTCTTTATTACTTTACTTTCCTTCCTTCTGAAA
TACTGCTATTGTACATAAAACAAAATGATCTACTTAAAAATAAAACAAATTTAAAA
+
AAA-AAJJFJJAAJAA-7AFJJ-7-<<-<AJJ--<J-<-<---77F7-A---A7--
<777<7<7<<F-77F<J<JJ7F7AFF77<77<7777<77<---7---77---7---
One line ID
ID line always
starts with “@”
}
sequence can be
one or more lines
quality line
always starts
with “+” }
One quality character per nucleotide. Each character code a
number from 0-41 (Illumina v1.8+).
71. II. FASTQ
FASTQ format is a text based file format that store usually DNA sequences. It
contains information about the sequencing QUALITY of each nucleotide.
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
@GWNJ-0957:89:GW170928504:7:1101:2757:1309 1:N:0:NCGTCCC
TATCTAAGTATTTGATTAATGATAGATGACGATGGAGAAATATAATCTACTTTTTT
AAGTCCCTCATTTTCTTTCTCCATCTTTCTTTTTTATTACTCCCATTGTTCCCCAT
+
AAAAAFFJFJJFJJAAAAAFJJJ<FJJJJJJJJJJ7<7<<<<JJJJJJFFJJJAFJ
F-7<<-7AFJJFJJJJJJJJAJJFJFJ<7<-7A-7FAFJA777777<7-7--7--7
@GWNJ-0957:89:GW170928504:7:1101:3549:1309 1:N:0:NCGTCCC
ACCATTCATTATTTTTTTATTTAGTCTTTATTACTTTACTTTCCTTCCTTCTGAAA
TACTGCTATTGTACATAAAACAAAATGATCTACTTAAAAATAAAACAAATTTAAAA
+
AAA-AAJJFJJAAJAA-7AFJJ-7-<<-<AJJ--<J-<-<---77F7-A---A7--
<777<7<7<<F-77F<J<JJ7F7AFF77<77<7777<77<---7---77---7---
One line ID
ID line always
starts with “@”
}
sequence can be
one or more lines
quality line
always starts
with “+” }
One quality character per nucleotide. Each character code a
number from 0-41 (Illumina v1.8+).
72. 1. Change working directory to ‘linux_exercises’.
2. Make a directory with the name: “FASTQ1”.
3. Change working directory to “FASTQ1”.
4. Copy into your current working directory the following the files:
1. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/collection/P1_001B.fastq.gz
2. /data/GDA_UFRGS2017/DATA/Sinningia_speciosa/collection/P1_007.fastq.gz
5. Uncompress them.
6. Run the following commands to get the stats.
fastq-stats P1_001B.fastq
fastq-stats P1_007.fastq
7. Redirect the output using “>” into a file using the following commands.
fastq-stats P1_001B.fastq > P1_001B.stats.txt
fastq-stats P1_007.fastq > P1_007.stats.txt
Practice 1.6: Getting stats for a FASTQ file
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
73. III. SAM/BAM
SAM (and its binary form BAM) format is designed to store read mapping
information to a reference. It has 11 columns.
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
74. III. SAM/BAM
The 2nd column: FLAG defines the status of the read mapping.
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
75. IV. GFF3
GFF3 is a text-based file with 9 columns. It is designed to store genomic
features (e.g. genes, exons, repetitive elements…) information. More
information at http://gmod.org/wiki/GFF3.
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
##gff-version 3
ctg13 . mRNA 1300 9000 . + . ID=mrna0001;Name=GDR1
ctg13 . exon 1300 1500 . + . ID=exon00001;Parent=mrna0001
ctg13 . exon 1600 1800 . + . ID=exon00002;Parent=mrna0001
ctg13 . exon 3000 3900 . + . ID=exon00003;Parent=mrna0001
ctg13 . exon 5000 5500 . + . ID=exon00004;Parent=mrna0001
ctg13 . exon 7000 9000 . + . ID=exon00005;Parent=mrna0001
seqid
source
type
start
end
score
phase
attributes
strand
mrna0001
exon00001 exon00002 exon00003 exon00004 exon00005
76. DIPLOID
0 = REF
1 => ALT
/ => NO PHASED
| => PHASED
V. VCF
VCF is a text-based file with 8 fixed columns and one extra per sample for the
multisample files. It contacts metadata at the beginning of the file as “#”
explaining the different fields
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
20 1370 rs01 G A 29 PASS DP=14;AF=0.5 GT:GQ:DP 0/1:51:14
20 1730 . T A 3 q10 DP=11;AF=0.2 GT:GQ:DP 0/1:58:11
20 2121 rs02 A G,T 67 PASS DP=10;AF=0.5 GT:GQ:DP 1/2:23:10
20 6781 . T . 47 PASS DP=13;AF=1 GT:GQ:DP 1/1:56:13
E.g. 1
E.g. 2
E.g. 3
E.g. 4
GT:GQ:DP
0/1:51:14GENOTYPE
DEPTH
GENOTYPEQUAL
• E.g. 1 is a biallelic heterozygous SNP.
• E.g. 2 is a biallelic heterozygous SNP with
low quality, probably because the mapping.
• E.g. 3 is a non-biallelic heterozygous SNP.
• E.g. 4 is a biallelic homozygous Deletion
77. Genomic Data Analysis
1. Presentation of the Course.
2. Introduction to the Linux Operating System and the
Command Line Interface.
3. 25 essential commands to work with Linux.
4. Common bioinformatics formats, from FASTAs to GFFs
and VCFs.
5. 10 essential commands to play with the biological data.
78. PIPELINE: Combination of commands where the input of one
command is the the output of the previous one
GDA1: 5- 10 essential commands to play with the biological data.
CMD1 CMD2 CMD3 CMD4
Input Output
grep -v “#” Sispe038.gff3 | cut -f 3 | sort | uniq -c
79. (1) GET SEQUENCE NUMBER
grep -c ‘>’ file.fasta
(3) GET A LIST OF THE TOP TEN MORE ABUNDANT FASTA
DESCRIPTIONS
grep ‘>’ file.fasta | cut -d ‘ ‘ -f2 | sort | uniq -c |
sort -nr | head -n 10
(2) GET FASTA SIZE
grep -v “>” file.fasta | wc -m
GDA1: 5- 10 essential commands to play with the biological data.
80. (4) GET NUMBER OF TYPES IN A GFF3 FILE
grep -v ‘#’ file.gff | cut -f3 | sort | uniq -c
(5) GET NUMBER OF GENES PER SEQID IN A GFF3 FILE
grep -v ‘#’ file.gff | cut -f1,3 | grep “gene” | sort
| uniq -c
(6) GET NUMBER OF EXONS PER mRNA IN A GFF3 FILE
grep -v "#" file.gff | cut -f3,9 | grep "exon" | sed -
r 's/.+Parent=//' | sed -r 's/;.+//' | sort | uniq -c
| sed -r 's/s+//' | cut -d ' ' -f1 | sort | uniq -c |
sort -nr
GDA1: 5- 10 essential commands to play with the biological data.
81. (7) GET THE NUMBER OF VARIANTS PER CHROM IN A VCF FILE
grep -v ‘#’ file.vcf | cut -f1 | sort | uniq -c
(8) GET THE NUMBER OF BIALLELIC VARIANTS IN A VCF FILE
grep -v ‘#’ file.vcf | cut -f4,5 | grep -v “,” | wc -l
(10) GET THE NUMBER OF VARIANT IMPACTS IN A SNPEFF VCF FILE
grep -v "#" file.SnpEff.vcf | cut -f8 | sed -r 's/.
+;ANN=//' | cut -d '|' -f2 | sort | uniq -c
GDA1: 5- 10 essential commands to play with the biological data.
(9) GET THE NUMBER OF BIALLELIC SNPs IN A VCF FILE
grep -v ‘#’ file.vcf | cut -f4,5 | grep -Ec "^.s+.$"
82. 1. Change working directory to ‘linux_exercises’.
2. Make a directory with the name: “VCF_ANALYSIS”.
3. Change working directory to “VCF_ANALYSIS”.
4. Copy into your current working directory the following the files:
1. /data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/resistant_popbatch01/
VLS24_S1.PolCollapsedBiallelicAF1.vcf
5. Answer the following questions:
5.1. How many variants has this file?
5.2. Ignoring Scaffolds (SeqID=Niben251ScfXXXXX), how many variants have each
chromosome (SeqID=Niben251ChrYY)?
5.3. How many biallelic SNPs have this file?
5.4. How many biallelic SNPs with allele frequency 1 (AF=1) have each chromosome?
Practice 1.7:
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
83. SCRIPT: Executable file with some specific language (e.g. Bash,
Perl…) that has commands/functions to be executed.
GDA1: 5- 10 essential commands to play with the biological data.
#!/bin/bash
file_gff=$1;
echo “Analyzing file $1”;
date;
grep -v "#" $1 | cut -f3,9 | grep "exon" | sed
-r 's/.+Parent=//' | sed -r 's/;.+//' | sort |
uniq -c | sed -r 's/s+//' | cut -d ' ' -f1 |
sort | uniq -c | sort -nr
nano exons_per_mRNA.sh
chmod 755 exons_per_mRNA.sh
exons_per_mRNA.sh file1.gff
NANO EDITOR SCREEN
External
Argument
To Save
in Nano
CTR+O
To Exit
in Nano
CTR+X
84. 1. Change working directory to ‘linux_exercises’.
2. Make a directory with the name: “MY_FIRST_SCRIPT”.
3. Change working directory to “MY_FIRST_SCRIPT”.
4. Copy into your current working directory the following the files:
1. /data/GDA_UFRGS2017/DATA/Nicotiana_benthamiana/reference/
Niben251.1_genome.gene_models.sorted.gff
5. Write a script that count the number of types per chromosome and uses two arguments
1st=file.gff; 2nd=type.
Practice 1.8:
GDA1: 4- Common bioinformatics formats, from FASTAs to GFFs and VCFs.
86. Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
87. Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
88. DNA sequencing is the process of determining the precise
order of nucleotides within a DNA molecule. It includes any
method or technology that is used to determine the order of
the four bases—adenine, guanine, cytosine, and thymine—in
a strand of DNA.
https://en.wikipedia.org/wiki/DNA_sequencing
(Gentile et al. Nano Lett., 2012, 12 (12), pp 6453–6458)
ATGCGCGTCGCGGTGAAT
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
90. Frederick Sanger (1918-2013)
Twice awarded with the Nobel Prize of Chemistry
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
PreNGS Era
94. (Mardis E.R. (2013) Annual Review of Analytical Chemistry 6: 287-303)
Next Generation Sequencing vs Sanger
Next Generation Sequencing Sanger
DNA libraries need to be prepared Fragment amplification
Direct nucleotide detection based in different
methods
Physical fragment separation for detection
Millions to billions of reads Thousands of reads
Variable size (short and long technologies) 400 to 900 bp read length
Variable error rate Very low error rate
Quantitative comparison Semicomparative comparison
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
96. http://www.slideshare.net/cosentia/high-throughput-equencing
Next-generation sequencing platforms
Isolation and purification of
target DNA
Sample preparation
Library validation
Cluster generation
on solid-phase
Emulsion PCR
Sequencing by synthesis
with 3’-blocked reversible
terminators
Pyrosequencing Sequencing by ligation
Single colour imaging
Sequencing by synthesis
with 3’-unblocked reversible
terminators
AmplificationSequencingImaging
Four colour imaging
Data analysis
Roche 454Illumina GAII ABi SOLiD Helicos HeliScope
Next Generation Sequencing
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
97. Technology
Read length
(bp)
Accuracy Reads/Run Time/Run Cost/Mb
Applied Bio 3730XL
(Sanger)
400 - 900 99.9% 384
4 h
(12 runs/day)
$2,400
Roche 454 GS FLX
(Pyrosequencing)
700
Single/Pairs
99.9% 1,000,000 24h $10
Illumina HiSeq4000 (Seq.
by synthesis)
75-250
Single/Pairs
99% 5,000,000,000 24 to 120 h $0.05 to $0.15
Ilumina MiSeq
(Seq. by synthesis)
50-300 Single/
Pairs
99% 44,000,000 24 to 72 h $0.17
SOLiD 4
(Seq. by ligation)
25-50
Single/Pairs
99.9% 1,400,000,000 168 h $0.13
ION Torrent
(Seq. by semiconductor)
170-400
Single
98% 80,000,000 2 h $2
Pacific Biosciences Sequel
(SMRT)
14,000
Single
85%
(99.9%)
1,600,000 4 h $0.6
Oxford N. Minion
(Nanopore sequencing)
10,000
Single
62%
(96%)
4,400,000 48 h $0.02
GDA2: 1- Introduction to Next Generation Sequencing Technologies.
123. Single Molecule Real Time (SMRT) technology
Niedringhaus et al. Analytical Chemistry 2011
GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio
124. Single Molecule Real Time (SMRT) technology
hsp://bit.ly/1naxgTe
GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio
125. Single Molecule Real Time (SMRT) technology
http://genome.duke.edu/cores-and-services/sequencing-and-genomic-technologies/pacbio
GDA2: 1- Introduction to Next Generation Sequencing Technologies: PacBio
PacBio Sequel
131. Sequence by Nanopore technology
GDA2: 1- Introduction to Next Generation Sequencing Technologies
132. Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
133. GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
Population (Genetics)
Group of organisms or individuals from the same geographical
location with the capability of interbreeding.
• Natural populations (e.g. Sinningia speciosa group of plants that
grow in the area of Pedra Lisa).
• Artificial populations (e.g. F2 segregating population of Sinningia
speciosa Empress x Buzios).
135. GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
(1) Focus in genotyping instead a right sampling of the populations.
(2) Wrong randomization of the samples.
(3) Confuse geopolitical borders with biological borders.
(4) Testing significance of the clustering output.
(5) Misinterpretation of Mandel’s r statistic (correlation between dist. matrices).
(6) Single K value interpretation without consider other alternative scenarios.
(7) Don’t take into account loci fixation associated with an adaptive trait
136. GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
✴ Focus in genotyping instead a right sampling of the populations.
How many individuals are necessary per “population” ?
It depends of the analysis and the population.
Example 1: Single dominant locus QTL Analysis.
• Recombination rate (genome size and chromosome
number).
• Genotyping methodology (resolution).
• Loci location.
}
100 F2 individuals
as starting point and
then move to other
populations (e.g. F3) or
adding more individuals
Example 2: Local adaptation.
• Trait analyzed.
• Population structure.
• Quality of the reference.
• Genotyping methodology (resolution).
}
50 individuals per
group as starting point
and then move to other
populations (e.g. F3) or
adding more individuals
137. ✴ Genotyping approaches.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
Genotyping: It is the process of determining genetic differences of an
individual by examining the individual's DNA sequence.
Genome sequencing
Cost effective approaches
Reduced representation
1. Targeted amplification (e.g. TrueSeq Custom Amplicon)
2. Hybridization (e.g. Sequence Capture)
3. Enzymatic Digestion + Size selection (e.g. RAD-Seq / GBS)
4. RNA isolation (RNA-Seq)
138. ✴ Genotyping approaches: Reduced representation approaches.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
1. Targeted amplification (e.g. TrueSeq Custom Amplicon)
Gene A Gene B Gene C
RE
site
RE
site
RE
site
RE
site
Amplification
Library preparation and sequencing
Fastq Files
Different samples
Different MIDs
139. ✴ Genotyping approaches: Reduced representation approaches.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
2. Hybridization (e.g. Sequence Capture)
MIDPCR
Different samples
Different MIDs
Gene A Gene B Gene C
RE
site
RE
site
RE
site
RE
site
Fragmentation
DNA Capture
Sequencing
Fastq Files
Amplification and Lib. preparation
140. ✴ Genotyping approaches: Reduced representation approaches.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
3. Enzymatic Digestion + Size selection (e.g. RAD-Seq / GBS)
REMIDPCR
Different samples
Different MIDs
Gene A Gene B Gene C
RE
site
RE
site
RE
site
RE
site
Digestion
Adapters ligation
Sequencing
Fastq Files
Amplification (Size selection ~500bp)
Elshire et al. 2011 PLOS One 6:e193779
Genotyping-By-Sequencing (GBS)
141. ✴ Genotyping approaches: Reduced representation approaches.
GDA2: 3- Experimental design for population studies, from breeding to ecological studies.
4.RNA isolation (RNA-Seq)
Gene A Gene B Gene C
RE
site
RE
site
RE
site
RE
site
Gene expression
RNA extraction and cDNA synthesis
Library preparation and sequencing
Fastq Files
Different samples
Different MIDs
142. Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
143. GDA2: 3- De-multiplexing and the complexities of sample identification.
Multiplexing
Use of DNA tags (4-7 bp) to identify samples in the same
sequencing lane, cell or sector.
mRNA-1
mRNA-2
Sample
1
Sample
2
cDNA-1-tag_ATGC
cDNA-2-tag_CGAG
ATGC
CGAG
AUGCGUU
AUGCGUU
UUGCGCU
AAGAGUU
AUGCGUU
AUGUGAA
UUGCGCU
AAAAGUU
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
}
Pool
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sequencing
144. GDA2: 3- De-multiplexing and the complexities of sample identification.
De-Multiplexing
Identification of the sequenced DNA samples using the DNA tag
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sequencing
Demultiplexing
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGAG
AAAAGTTCGAG
ATGCGTTATGC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sample
1
Sample
2
145. GDA2: 3- De-multiplexing and the complexities of sample identification.
De-Multiplexing
Identification of the sequenced DNA samples using the DNA tag
ATGCGTTCGAG
ATGTGAACGAG
TTGCGCTCGCG
AAAAGTTCGAG
ATGCGTTATCC
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
Sequencing
Demultiplexing
ATGCGTTCGAG
ATGTGAACGAG
AAAAGTTCGAG
ATGCGTTATGC
TTGCGCTATGC
AAGAGTTATGC
TTGCGCTCGCG
ATGCGTTATCC
?
Sample
1
Sample
2
146. De-Multiplexing
Keys for barcode/tag designing (GBS/RADseq):
• The barcode does not contain or recreate the enzyme cut
site.
• Any barcode in a set is at least two substitutions away
from any other barcode.
• They vary in length as a set (to avoid the all cut site bases
appearing at the same positions in the sequencing read).
• They contain the complementary sticky end to the enzyme
cut site.
GDA2: 3- De-multiplexing and the complexities of sample identification.
http://www.maizegenetics.net/genotyping-by-sequencing-gbs
147. GDA2: 3- De-multiplexing and the complexities of sample identification.
De-Multiplexing
Identification of the sequenced DNA samples using the DNA tag
Software RE Link
Fastx-toolkit
(Barcode splitter)
No http://hannonlab.cshl.edu/fastx_toolkit/
Ea-utils
(Fastq-multx)
No https://expressionanalysis.github.io/ea-utils/
GBSX Yes https://github.com/GenomicsCoreLeuven/GBSX
TASSEL Yes http://www.maizegenetics.net/tassel
148. Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
149. Fastq raw
Fastq Processed
Reads processing
and
filtering
1. Low quality reads (qscore) (Q30)
2. Short reads (L50)
3. PCR duplications (Only Genomes).
4. Contaminations.
5. Corrections
Mapped
Reads
Assembled
Reads
Other Analysis
GDA2: 4- Read processing and quality evaluation.
150. 0- Read Quality Evaluation
• Does the sequencing produced the expected number of
reads?
READ COUNTS
• Do the reads have the expected average length?
AVERAGE READ LENGTH
• Do the reads have the expected nucleotide qscore?
QSCORE NUCLEOTIDE BOXES
Technology dependent
GDA2: 4- Read processing and quality evaluation.
152. 1- Quality filtering
• Generally associated with a minimum length and a
minimum qscore (extremes, by average, minimum for all
the nucleotides)
Tecnology min. length (bp) min. qscore
454 100 20
Illumina 50 30
SOLiD 20 30
Ion Torrent 50 20
PacBio 1000 NA
Oxford Nanopore 1000 NA
GDA2: 4- Read processing and quality evaluation.
153. 1- Quality filtering
• Tools and time for processing varies depending of the
technology.
Software Link
Fastx-toolkit http://hannonlab.cshl.edu/fastx_toolkit/
Ea-utils https://expressionanalysis.github.io/ea-utils/
PrinSeq http://prinseq.sourceforge.net/
Trimmomatic http://www.usadellab.org/cms/?page=trimmomatic
e.g. running ea-utils command:
fastq-mcf -q 30 -l 50 -o s01_Q30L50_R1.fq
Illumina_Adapters.fa s01_R1.fq
GDA2: 4- Read processing and quality evaluation.
154. Practice 2.1: Process reads and get stats.
1. Make a new directory called ‘sinningia_genotyping’.
2. Change the working directory to ‘sinningia_genotyping’
2. Make a directory with the name: “00_raw”.
3. Change working directory to “00_raw”.
4. Copy four fastq files and the “IlluminaAdapters_V2.fasta” from /data/GDA_UFRGS2017/
DATA/Sinningia_speciosa/collection/ to your current working directory.
5. Get the stats for the raw reads using “fastq-stats”.
6. Process the reads using “fastq-mcf” with a min. quality score of 30 and a min. length of
50 bp (note: you can use a script). An example of the command could be something like:
fastq-mcf -q 30 -l 50 -o P1_003_Q30L50.fq
IlluminaAdapters_V2.fasta P1_003.fastq.gz
7. Make a directory one level up (../) with the name “01_processed”.
8. Move the outputs from “fastq-mcf” a “../01_processed”.
9. Get the stats for the processed reads using “fastq-stats”.
GDA2: 4- Read processing and quality evaluation.
155. Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
156. Read Mapping:
It is the process of search the location of a read comparing the its sequence and the
sequence of a reference.
ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG
GCAGCGACCAGCGA
||||||||||| ||
1........10........20........30........40........50
ref
read
ref:10..23
Sequence Alignment:
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity
that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1] Aligned sequences of
nucleotide or amino acid residues are typically represented as rows within a matrix.
http://en.wikipedia.org/wiki/Sequence_alignment
GDA2: 5- Read mapping to a reference.
157. Read Mapping Considerations:
• Length of the read.
• Number of reads.
• Size of the reference.
Short reads
(NGS)
Millions of
sequences
Long reads
(Chromosomes)
Dozens of
sequences
Medium reads
(Genes/Transcripts)
Thousands of
sequences
GDA2: 5- Read mapping to a reference.
158. Read Mapping Software:
ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG ref
database/indexes
GCAGCGACCAGTGA
read seeds (kmers)
GCAGCGACCA
CAGCGACCAG
AGCGACCAGC
CGACCAGCGA
GCGACCAGCG
search
GCAGCGACCA
ATGGCGTGGCAGCGACCAGTGACCAGTGACGTGTGCAGACGTGATATGCAG
extension
GCAGCGACCAGCGA
GDA2: 5- Read mapping to a reference.
159. ATGACGTGC
GCCGTGCTG
find seed
(perfect match l=4)
extension
(mismatches allowed)
ATGACGTGC
GCCGTGCTG ATGACGTGC
GCCGTGCTG
| |||||
A T G A C G T G C
0 -1 -2 -3 -4 -5 -6 -7 -8 -9
G -1 -1 -2 -1 -4 -5 -4 -7 -6 -9
C -2 -2 -2 -3 -2 -3 -6 -5 -8 -5
C -3 -3 -3 -3 -4 -1 -4 -7 -6 -7
G -2 -4 -4 -2 -4 -5 0 -5 -6 -7
T -3 -3 -3 -5 -3 -5 -6 1 -6 -7
G -4 -4 -4 -2 -6 -4 -4 -7 2 -7
C -5 -5 -5 -5 -3 -5 -5 -5 -8 3
T -4 -6 -4 -6 -6 -4 -6 -4 -6 -9
G -5 -5 -7 -3 -8 -7 -3 -7 -3 -7
• Smith–Waterman algorithm
• Needleman–Wunsch algorithm
• Burrows-Wheeler index
Short reads
(NGS)
Millions of
sequences
Long reads
(Chromosomes)
Dozens of
sequences
Medium reads
(Genes/Transcripts)
Thousands of
sequences
GDA2: 5- Read mapping to a reference.
||||
160. Read Mapping Software:
Name Type Input Output
Mauve Long sequences Fasta, GenBank
backbone (positions)
XMFA (alignments)
LastZ/MultiZ Long sequences Fasta several (maf, sam…)
Blast Medium sequences Fasta
Blast formats (0 text file, 6
tabular file)
Blat Medium sequences Fasta
Blast formats + Blat tabular
format
Bowtie Short sequences Fasta, Fastq Sam
BWA Short sequences Fasta, Fastq Sam
Novoalign Short sequences Fasta, Fastq Sam
SOAP Short sequences Fasta, Fastq Sam
Stampy Short sequences Fasta, Fastq Sam
GDA2: 5- Read mapping to a reference.
163. Practice 2.2: Read mapping and get stats.
1. Change the working directory to ‘sinningia_genotyping’
2. Make a directory with the name: “02_mapped”.
3. Change working directory to “01_processed”.
4. Map each of the processed reads to the reference index “../../linux_exercises/Sispe_ref/
Sispe038ReducedRef.fa” created Day1: Practice 3, using bowtie2-build. Redirect the
output to the directory “../02_mapped/”. An example of a command could be:
bowtie2 -p 2 -t -x ../../linux_exercises/Sispe_ref/
Sispe038ReducedRef -U P1_009_Q30L50.fq -S ../02_mapped/
P1_009_Q30L50.sam
5. Change working directory to “../01_processed”.
6. Count how many hits have each sam file using “samtools” and the for example the
following command.
samtools view -c —F 4 -Sb P1_009_Q30L50.sam
GDA2: 5- Read mapping to a reference.
164. Practice 2.2: Read mapping and get stats.
7. Filter the sam file removing the reads without hits (tag 0x4) and convert it to bam.
samtools view -F 4 -Sb -o P1_009.bam P1_009_Q30L50.sam
8. Merge all the bam files using ‘bamaddrg’ with a command such as. As sample name use
the names from the file “SampleNamesMappingFile.txt” (e.g. P1_003 name is
“Purple_Dreaming”; do not use spaces).
/data/software/bamaddrg/bamaddrg -b P1_003.bam -s
Purple_Dreaming -b P1_009.bam -s Merry_Christmas -b P1_014.bam
-s White -b P1_021.bam -s Good_Morning > SispeUser00_merged.bam
9. Sort the merged bam file with “samtools sort”.
samtools sort -o SispeUser00_sorted.bam SispeUser00_merged.bam
10. Delete the sam files and the unsorted bam’s.
GDA2: 5- Read mapping to a reference.
165. Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
166. Genetic variant is the genetic differences both within and among populations.
• Structural differences: Structural Variations (SVs), Copy Number Variation (CNV).
• Molecular differences (changes in the DNA sequence).
• Single Nucleotide Variants/Polymorphisms (SNVs/SNPs)
• Insertions/deletions Variants/Polymorphisms (INDELs/DIVs/DIPs)
• Multiple Nucleotide Variants/Polymorphism (MNVs/MNPs)
GACGTGC
GCCGTGC
| |||||
Sample 1
Sample 2
Polymorphism is a DNA sequence variation that is common in the population
GACGTGC
G-CGTGC
| |||||
Sample 1
Sample 2
SNVs/SNPs INDELs/DIVs/DIPs
GACGTGC
GCTGTGC
| ||||
Sample 1
Sample 2
MNVs/MNPs
GDA2: 6- Variant calling and summary of the read processing.
168. Read Mapping Software:
Name Type Strength Weaknesses
SamTools Heuristic
• Assumes errors are non-
independent (matches data)
• Good accuracy with low
coverage data
• Reasonably quick
• Increase false positives at
high coverage
• Lower quality indel calling
GATK Probabilistic
• Trains with real data
• Excellent accuracy with high
coverage data
• Low false positive rate
• Assumes errors are
independent
• High level of preprocessing
• Very slow
FreeBayes Probabilistic
• Combined bam population
estimate
• Good accuracy with low
coverage data
• Very very quick
• No training, population level
estimate only
• Lower quality indel calling
GDA2: 6- Variant calling and summary of the read processing.
170. Processed Reads
Mapped Reads
Processed Map
Variants
Read
mapping
Local realignment,
sort and filtering
Variant
calling
Annotated Variants
Variant
annotation
Variant filtering:
- VCFTools
- GATK
Variant
filtering
Variant annotation:
- SnpEff
GDA2: 6- Variant calling and summary of the read processing.
171. Practice 2.3: Variant calling.
1. Change the working directory to ‘sinningia_genotyping’
2. Make a directory with the name: “03_variants”.
3. Change working directory to “02_mapped”.
4. Create an index with “samtools index” for the sorted bam file:
samtools index SispeUser00_sorted.bam
5. Run “freebayes” with --min-base-quality 30 --min-mapping-quality 20 --min-coverage 5
with a command such as:
freebayes -b SispeUser00_sorted.bam -f ../../linux_exercises/Sispe_ref/
Sispe038ReducedRef.fa -v ../03_variants/SispeUser00.vcf --min-coverage 5 -q 30 -m 20
6. Count how many variants has the VCF file including a division of variants per type (SNP,
INDEL, MNP and Complex).
GDA2: 6- Variant calling and summary of the read processing.
172. Genomic Data Analysis
1. Introduction to Next Generation Sequencing Technologies.
2. Experimental design for population studies, from breeding
to ecological studies.
3. De-multiplexing and the complexities of sample
identification.
4. Read processing and quality evaluation.
5. Read mapping to a reference.
6. Variant calling and summary of the read processing.
7. Quality evaluation and possible pitfalls.
173. Methods for Variant Evaluation
• Validation by Sanger Sequencing of specific candidates (~5 - 500) using other
datasets (e.g. transcriptome) if it is possible.
• Comparison with other method (e.g. genotyping chip).
• Different mapping and variant calling tools comparison (with a “truth set” or a
“gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
https://gatkforums.broadinstitute.org/gatk/discussion/6308/evaluating-the-quality-of-a-variant-callset
174. • Validation by Sanger Sequencing of specific candidates (~5 - 500) using other
datasets (e.g. transcriptome) if it is possible.
GDA2: 7- Quality evaluation and possible pitfalls.
Variants from
RNASeq
(Illumina)
Variants from
ESTs
(Sanger)
175. • Different mapping, variant calling tools and datasets comparison (with a “truth
set” or a “gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
Assumptions:
1. The content of the truth set has been validated.
2. Your samples are expected to have similar genomic content as the
population of samples that was used to produce the truth set
176. Metrics:
1. Variant level concordance: Percentage of variants in your samples that
match (are concordant with) variants in your truth set.
2. Genotype concordance: Percentage of variants in your genotype that
match (are concordant with) variants in your truth set.
• Different mapping, variant calling tools and datasets comparison (with a “truth
set” or a “gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
False Positives (FP) False Negatives (FN)
True Positives (TP)
My Dataset (16) True Set (18) % SENSITIVITY:
TP * 100 / (TP + FN) = 13 * 100 / (13 + 5) = 72%
% FALSE DISCOVERY RATE:
FP * 100 / (TP + FP) = 3 * 100 / (13 + 3) = 20%
% GT CONCORDANCE:
SumMatches * 100 / TP
6 * 100 / 11 = 54%
A * T C T C C * C A C
A T T C * C C T * A *
1 0 1 1 0 1 1 0 0 1 0
True Set (9)
My Dataset (8)
Matches (6)
177. Metrics:
3. Number of SNPs and INDELs: Between different datasets should be
consistent for the same number of mapped reads.
4. TiTv Ratio: Ratio of transition (Ts) to transversion (Tv) SNPs should be
random (~0.5). Methylation islands (CpG) and other factors may introduce
a bias so expected values will range from 0.5 - 3.0.
5. Ratio Insertions/Deletions: It should be close to 1, except in rare alleles
that it could be 0.2 - 0.5.
• Different mapping, variant calling tools and datasets comparison (with a “truth
set” or a “gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
178. Comparison between different tools:
• Different mapping, variant calling tools and datasets comparison (with a “truth
set” or a “gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
https://bcbio.wordpress.com/
179. Tools:
• Different mapping, variant calling tools and datasets comparison (with a “truth
set” or a “gold standard” if it is possible).
GDA2: 7- Quality evaluation and possible pitfalls.
Name URL
VariantEvaluation
(GATK)
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/
org_broadinstitute_gatk_tools_walkers_varianteval_VariantEval.php
GenotypeConcordance
(GATK)
https://software.broadinstitute.org/gatk/documentation/tooldocs/current/
org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeConcordance.php
VCFTools http://vcftools.sourceforge.net/
VCFStats http://lindenb.github.io/jvarkit/
PicardTools https://broadinstitute.github.io/picard/index.html
181. Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
182. Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
183. Variant filtering is the process to remove low quality or other non
adequate variants (e.g. non biallelic, complex…) for the downstream
analysis. It depends on:
1. Source and methodology used to generate the data (library
preparation errors and biases).
2. Sequencing technology (read sequencing errors) and amount of
data (insufficient depth/sites).
3. Software used for mapping (mapping errors) and variant calling
(produced by a low coverage/low complexity sites).
4. Reliability (low quality/incomplete) and nature (genomic differences/
polyploidy) of the reference genome.
5. Type of population (e.g. F2 population) and type of analysis that it
will be performed.
GDA3: 1-Variant Filtering
184. Variant filtering
Two major source of error (Li et al. 2014):
• Erroneous realignment in low-complexity regions
• Incomplete reference genome with respect to the sample
GDA3: 1-Variant Filtering
“The raw genotype calls is as high as 1 in 10-15 kb, but the error
rate of post-filtered calls is reduced in 1 in 100-200 kb without
significant compromise on the sensitivity”.
More data is not always better.
High quality/reliable data
185. Alignment problems
GDA3: 1-Variant Filtering
coordinates 12345678901234 5678901234567890123456
reference aggttttttataac---aattaagtctacagagcaacta
sample aggttttttataacAATaattaagtctacagagcaacta
read1 aggttttttataac***aaAtaa
read2 ggttttttataac***aaAtaaTt
read3 ttttataacAATaattaagtctaca
read4 CaaT***aattaagtctacagagcaac
read5 aaT***aattaagtctacagagcaact
read6 T***aattaagtctacagagcaacta
Aligners calculate the alignment correctness and give it a score
depending of:
• Length of the alignment.
• Number of mismatches and gaps.
• Uniqueness of the alignment (number of hits).
}Good alignment
Misaligned bases
186. Alignment problems
GDA3: 1-Variant Filtering
Misaligned bases - Solutions:
• Read realignment (IndelRealigner for GATK (obsolete),
now it is integrated in the HaplotypeCaller).
• Mark alignment quality per base (BAQ) and do not use for
variant calling.
187. Library preparation problems
GDA3: 1-Variant Filtering
PCR duplications produce biases in the variant call (e.g. het.)
• Library specific problem for Whole Genome Sequencing.
Gene A Gene B Gene C
Fragmentation
Library preparation
PCR Duplication
188. PCR duplications - Solutions:
• Mark duplicates with tools such as samtools rmdup
Library preparation problems
GDA3: 1-Variant Filtering
SKIP PCR DUPLICATION MARKING STEP FOR GBS, RAD-SEQ…
CAREFUL: Some reduced representations techniques with unequal ratios
of site amplication WILL PRODUCE THOUSANDS PCR DUPLICATION
193. GDA3: 1-Variant Filtering
Variant filtering:
Three general purpose callers:
• FreeBayes (v0.9.9.2-18)
• GATK UnifiedGenotyper (2.7-2)
• GATK HaplotypeCaller (2.7-2)
• Skipping base recalibration and indel realignment
had almost no impact on the quality of resulting
variant calls
• FreeBayes outperforms the GATK callers on both
SNP and indel calling. The most recent versions of
FreeBayes have improved sensitivity and specificity
which puts them on par with GATK HaplotypeCaller.
• GATK HaplotypeCaller is all around better than the
UnifiedGenotyper.
195. GDA3: 1-Variant Filtering
Examples using VCFTools
1.Variants with low quality QUAL < 20.
vcftools --vcf input.vcf --minQ 20 --recode --recode-INFO-
all --out out
2. Variants with depth DP < 10.
vcftools --vcf input.vcf --min-meanDP 10 --recode --
recode-INFO-all --out out
3. Separated by at least 1000 bp.
vcftools --vcf input.vcf --thin 1000 --recode --recode-
INFO-all --out out
4. No biallelic.
vcftools --vcf input.vcf --min-alleles 2 --max-alleles 2
--recode --recode-INFO-all --out out
5. No missing.
vcftools --vcf input.vcf --max-missing 1.0 --recode --
recode-INFO-all --out out
196. Practice 3.1: Filter the variant file.
1. Change the working directory to ‘sinningia_genotyping’
2. Change working directory to “03_variants”.
3. Run “vcf-stats” on the “SispeUserXX.vcf” file.
4. Remove the variants with QUAL < 20 and run “vcf-stats” again.
5. Remove the variants with DEPTH < 10 and run “vcf-stats” again.
6. Remove the variants separated between them 1000 bp or less and run “vcf-stats” again.
7. Get biallelic variants and run “vcf-stats” again.
8. Remove all the genotypes with missing data.
9. Select only SNPs.
GDA3: 1-Variant Filtering
197. Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
198. Stats for the VCF files
GDA3: 2- Simple population stats for the variant analysis.
Regular stats with vcf-stats (https://vcftools.github.io/perl_module.html)
vcf-stats is a program that runs several stats for a VCF file producing the
following files:
• stats.counts, number of variants per sample and for all the samples.
• stats.dump, parseable hash Perl format file with all the VCF stats.
• stats.indels, number of INDELs per sample and for all the samples.
• stats.legend, various definitions
• stats.private, unique (not shared) variants for each sample
• stats.samples-tstv, transicions/transversions for each sample
• stats.shared, shared variants for all the samples
• stats.snps, number of SNPs per sample and for all the samples.
• stats.tstv, transicions/transversions for all the samples
199. Stats for the VCF files
GDA3: 2- Simple population stats for the variant analysis.
Distribution with bcftools stats (https://samtools.github.io/bcftools/bcftools.html)
bcftools is a software to manipulate VCF/BCF files. Stats can be used to
produce several data distributions such as QUAL (quality), DP (depth), ST
(substitution types), IDD (InDel size distribution), AF (allele frequency)… It
also include as summary (SN).
# SN, Summary numbers:
# SN [2]id [3]key [4]value
SN 0 number of samples: 4
SN 0 number of records: 110927
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 99184
SN 0 number of MNPs:10943
SN 0 number of indels:3101
SN 0 number of others:506
SN 0 number of multiallelic sites: 3816
SN 0 number of multiallelic SNP sites: 798
200. Stats for the VCF files
GDA3: 2- Simple population stats for the variant analysis.
Distribution with vcfutils.pl qstats
vcfutils.pl is a program that get the qual. and ts/tv parameters associated
with the SNPs. It can be used to test if there are some bias of the ts/tv
associated with low quality.
QUAL #non-indel #SNPs #transitions #joint ts/tv #joint/#ref #joint/#non-indel
6856.32 1909 1909 654 0 0.5211 0.0000 0.0000 0.5211
4769.34 3818 3818 1381 0 0.5667 0.0000 0.0000 0.6151
3506.53 5727 5727 2215 0 0.6307 0.0000 0.0000 0.7758
2748.14 7636 7636 3051 0 0.6654 0.0000 0.0000 0.7791
2240.06 9545 9545 3956 0 0.7078 0.0000 0.0000 0.9014
. . .
16.3149 80179 80179 41279 0 1.0612 0.0000 0.0000 1.2945
11.551 82088 82088 42386 0 1.0676 0.0000 0.0000 1.3803
6.48534 83997 83997 43471 0 1.0727 0.0000 0.0000 1.3167
2.79415 85906 85906 44556 0 1.0775 0.0000 0.0000 1.3167
201. Stats for the VCF files
GDA3: 2- Simple population stats for the variant analysis.
Populations stats with vcftools
vcftools can also be used to get simple population genetics parameters
associated to a VCF file. Some of these examples are:
• Calculate nucleotide diversity (π)
vcftools --vcf input.vcf --keep NamesGroup1.txt --
window-pi 100000 --out Group1_Pi
• Calculate linkage disequilibrium (LD) (for phased genotypes).
vcftools --vcf input.vcf --keep NamesGroup1.txt --ld-
window-bp 50000 --chr SeqID1 --hap-r2 --min-r2 0.001 --
out Group1_SeqID1_LD
202. Stats for the VCF files
GDA3: 2- Simple population stats for the variant analysis.
Populations stats with vcftools
vcftools can also be used to get simple population genetics parameters
associated to a VCF file. Some of these examples are:
• Calculate FST between two groups.
vcftools --vcf input.vcf --weir-fst-pop SampleGroup1.txt
--weir-fst-pop SampleGroup2.txt --fst-window-size 100000
--out Group1_VS_Group2_FST
• Calculate TajimaD for one group.
vcftools --vcf input.vcf --keep NamesGroup1.txt --
TajimaD 100000 --out Group1_Pi
203. Practice 3.2: Get stats for the VCF file
1. Change the working directory to ‘sinningia_genotyping’
2. Change working directory to “03_variants”.
3. Run “vcf-stats” on the “SispeUserXX.vcf” file.
4. Run “bcftools stats” on the “SispeUserXX.vcf” and pipe the output to select “^SN”
5. Run “vcftools” to calculate the nucleotide diversity on the “SispeUserXX.vcf”.
6. Run “vcftools” to calculate the Tajima D on the “SispeUserXX.vcf”..
7. Divide your dataset in two groups and calculate the FST between these two groups.
GDA3: 2- Simple population stats for the variant analysis.
204. Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
205. IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
http://software.broadinstitute.org/software/igv/
The Integrative Genomics Viewer (IGV) is a high-performance
visualization tool for interactive exploration of large, integrated genomic
datasets. It supports a wide variety of data types, including array-based
and next-generation sequence data, and genomic annotations.
207. IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
1- Create a new .genome file for the Sinningia reference.
2- Add an Unique identifier (e.g. “Sispe038”), a descriptive name (e.g. “S.
species version 0.3.8”, the FASTA and the GFF files with the reference.
208. IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
3- Select any scaffold
209. IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
4- To load any VCF or BAM file, select “Load From File”
6- Then select the scaffold that you want to visualize (e.g. “Sispe038Scf0002”)
5- Then load your VCF file (e.g. “SispeUser00.vcf”).
210. IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
It creates two tracks: 1- With all the variants and the AF as a red/blue
bar; 2- With all the individual samples.
211. IGV, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
You also can load BAM files to see the read alignment.
212. TASSEL, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
http://www.maizegenetics.net/tassel
TASSEL is a tools to investigate the relationship between phenotypes and
genotypes.TASSEL has functionality for association study, evaluating
evolutionary relationships, analysis of linkage disequilibrium, principal
component analysis, cluster analysis, missing data imputation and data
visualization.
216. TASSEL, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
4- Get a distance matrix
217. TASSEL, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
4- Perform a Principal Component Analysis
218. TASSEL, Integrative Genomic Viewer
GDA3: 3- Variant visualization tools: IGV and TASSEL
5- Produce a cladogram
219. Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
220. Change formats from VCF to others.
GDA3: 4- Changing formats for VCF files.
http://www.cmpg.unibe.ch/software/PGDSpider/
221. Change formats from VCF to others.
VCF => FastStructure
PGDSpider can be used to change between different formats:
• From VCF to FastStructure.
perl -ne 'chomp($_); if ($_ =~ m/#/) { print "$_n"}
else { @a= split(/t/, $_); if (length($a[3]) == 1 &&
length($a[4]) == 1) {print "$_n"} }' input.vcf >
clean.vcf
java -Xmx1024m -Xms512m -jar /data/software/
PGDSpider_2.1.1.2/PGDSpider2-cli.jar -inputfile
clean.vcf -inputfileformat VCF -outputfile
clean.structure.str -outputfileformat STRUCTURE -spid
VCF2FastStructure.spid
GDA3: 4- Changing formats for VCF files.
222. Change formats from VCF to others.
VCF => FastStructure
PGDSpider requires a configuration file (.spid) for each of the formats.
Example for a VCF2FastStructure file.
# VCF Parser questions
PARSER_FORMAT=VCF
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
VCF_PARSER_REGION_QUESTION=
VCF_PARSER_PL_QUESTION=GT
VCF_PARSER_QUAL_QUESTION=20
VCF_PARSER_GTQUAL_QUESTION=0
VCF_PARSER_READ_QUESTION=5
VCF_PARSER_IND_QUESTION=
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=TRUE
VCF_PARSER_MONOMORPHIC_QUESTION=FALSE
VCF_PARSER_POP_QUESTION=
# STRUCTURE Writer questions
WRITER_FORMAT=STRUCTURE
STRUCTURE_WRITER_FAST_FORMAT_QUESTION=TRUE
STRUCTURE_WRITER_LOCI_DISTANCE_QUESTION=TRUE
GDA3: 4- Changing formats for VCF files.
225. Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
227. Genetic Variation in the Species
Wild accessions 9
Cultivars 25
Wild x Cultivated F1 1
Other species 1
____________________________________________
TOTAL 36
Sequencing
Library preparation
De-multiplexing
Read processing
Alignment
Variant detection
SNP filtering
APeKI digestion
Illumina, single end, 100 bp
GBSX v1.2
Fastq-mcf v1.04.807, Q30, L50
Bowtie2 v2.2.4
Freebayes v0.9.20
bcftools: only biallelic SNPs
vcffliter: Q>30, Depth >= 5
vcftools: no missing observations
41,626 SNPs
GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
228. Genetic Variation in the Species
1. Clean the file of SNPs defined with more than one character (e.g AC/AG).
perl -ne 'chomp($_); if ($_ =~ m/#/) { print "$_n"}
else { @a= split(/t/, $_); if (length($a[3]) == 1 &&
length($a[4]) == 1) {print "$_n"} }'
Sispe038_Set01_FILTERED_SNPs.vcf >
Sispe038_Set01_FILTERED_SNPs_CLEAN.vcf
2. Change the VCF format to FastStructure.
java -Xmx1024m -Xms512m -jar /data/software/
PGDSpider_2.1.1.2/PGDSpider2-cli.jar -inputfile
Sispe038_Set01_FILTERED_SNPs_CLEAN.vcf -inputfileformat
VCF -outputfile
Sispe038_Set01_FILTERED_SNPs_CLEAN.structure.str -
outputfileformat STRUCTURE -spid VCF2FastStructure.spid
GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
229. Genetic Variation in the Species
3. Prepare a script (run_faststructure) with the fastStructure command line,
5 replicates, random seeds and K from 1 to 20.
4. Change the permissions of the script and run it
chmod 755 run_faststructure.sh
./run_faststructure.sh
#!/bin/bash
python /data/sowware/fastStructure/structure.py -K 1 --
input=Sispe038_Set01_FILTERED_SNPs_CLEAN.structure --
output=Sispe038_Set01_FS_K01_R01 --format=str —seed=123456789
…
GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
230. Genetic Variation in the Species
5. Run ChooseK to get the most supported K.
python /data/software/fastStructure/chooseK.py --
input=Sispe038_Set01_FS_*
Model complexity that maximizes marginal likelihood = 2
Model components used to explain structure in data = 2
GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
231. Genetic Variation in the Species
5. Run ChooseK to get the most supported K.
Model components used to explain structure in data = 2
In our review of 1,264 studies using structure to explore population subdivision, studies
that used ΔK were more likely to identify K = 2 (54%, 443/822) than studies that did not
use ΔK (21%, 82/386). A troubling finding was that very few studies performed the
hierarchical analysis recommended by the authors of both ΔK and structure to fully
explore population subdivision.
GDA3: 5- Example 1: Population analysis with Structure for Sinningia speciosa.
232. Genomic Data Analysis
1. Variant filtering.
2. Simple stats for the variant analysis.
3. Variant visualization tools: IGV and TASSEL.
4. Changing formats for VCF files.
5. Example 1: Population analysis with Structure for Sinningia
speciosa.
6. Example 2: Genetic Map with R/QTL for Nicotiana
benthamiana.
236. 1. Change the VCF format to CSV used as input by R/QTL using Vcf2Maker
from GenoToolBox (https://github.com/aubombarely/GenoToolBox).
/old_home/aurebg/Software/GenoToolBox/SNPTools/
Vcf2Mapmaker -i NibenGBS_M30.vcf -o NibenGBS_M30.csv -f
csv -a S_64_2 -b S_65_2 -B -d 1000
2. Load the genotypes in R/QTL.
setwd('./')
library('qtl')
NbenX = read.cross(file="NibenGBS_M30.csv", format=“csv”)
3. Follow the R/QTL tutorial.
GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.
237. GDA3: 6- Example 2: Genetic Map with R/QTL for Nicotiana benthamiana.