Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How big data is transforming biology and how
we are using Python to make sense of it all
Maria Nattestad
Computational bio...
Overview
Genome
sequencing
Using Python
to study cancer
Personal
genomics
Overview
Genome
sequencing
Using Python
to study cancer
Personal
genomics
Your genome
46 strings of A, T, C, and G for a total of about 6 billion characters
male
Mutations in the genome can lead to
cancer and other diseases
Over 20,000 genes are
scattered all over the genome.
The gen...
Illumina = “Next-generation sequencing”
Sanger = The original
Human Genome Project publishes first draft
Big Data
3000 Rice Genomes Project
Sequencing by the numbers
• Human genome is 6 billion letters [ATCG]
• No technology exists that can read an entire
chromo...
The genome is not random
ATCGATCAT?ATCGATCATA
repeats
Because of this the human genome STILL has gaps
Repeats make it harder to assemble the genome
puzzle
A
B
R
CDR
RR
CR
B R DR
A R
A
B
R
C
D
If a repeat is longer than the r...
Long-read DNA Sequencing
Pacific Biosciences
Oxford Nanopore MinION
>10X as expensive as next-generation (Illumina) sequen...
Resolving repeats
with long-read sequencing
A R D CB R
A
B R
R
C
D
A R DCB R
Overview
Genome
sequencing
Using Python
to study cancer
Personal
genomics
How the human genome changes
during cancer
Normal human genome
How the human genome changes
during cancer
(Davidson et al, 2000)
80 chromosomes instead of 46
Cancer genome
Cell line fro...
Split-read variant calling
chromosome 1
chromosome 2
A simple gene fusion
Gene1
Gene2
Gene1 Gene2
A complex gene fusion
Gene1
Gene2
Gene1 Gene2
SplitThreader:
A new Python graph library for representing rearranged genomes
CHR 1
CHR 2
ATCGCCTA
GTCCATAG
8
10
2
ATCG CC...
Class structure of SplitThreader
Node Node
NodeNode Edge
Edge
Edge
Port Port Port Port
Port Port Port Port
Graph
Edge
Edge...
Biological insights from SplitThreader
Depth first search
or
Breadth first search
Gene fusion finding
History of mutations
Using SplitThreader to find a gene fusion
CYTH1
EIF3H
CYTH1 EIF3H
Goal is to find a path like this:
Too many copies of Her2 contributes to making
cancer worse
Sequencing
Actual genome
Her2
Too much Her2
Too much signal to ...
About 40 copies of Her2 gene scattered around the
genome
Her2 gene
Her2
Chr 17: 83 Mb
8 Mb
Her2
Her2
Her2
8 Mb
Chromosome 17
Her2
Chr 17
Chr 8
1. Healthy chromosome 17
2. Sequence copied into
chromosome 8
3. Subsequence copied within
chromosome 8
...
SplitThreader is open source on Github
ATCG CCGA
ATAGGTCC
CHR 1
CHR 2
10
2
8
https://github.com/marianattestad/splitthread...
Overview
Genome
sequencing
Using Python
to study cancer
Personal
genomics
Personal genomics
SNP chip Sequencing
• Illumina, SureGenomics
• About $1,000
• Captures large and small
mutations even if...
Personal genomics debates
• Should the government allow these
companies to give people their genomic data?
– How about int...
THANK YOU
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Ribbon tutorial for genomic data
Next
Upcoming SlideShare
Ribbon tutorial for genomic data
Next
Download to read offline and view in fullscreen.

Share

Data and Python in Biology at PyData NYC 2015

Download to read offline

http://marianattestad.com
How big data is transforming biology and how we are using Python to make sense of it all. Genome sequencing, big data in genomics, long-read sequencing, SplitThreader string graph library in Python, cancer research, and personal genomics!
I am Maria Nattestad, a PhD student in computational biology at Cold Spring Harbor laboratory.

Data and Python in Biology at PyData NYC 2015

  1. 1. How big data is transforming biology and how we are using Python to make sense of it all Maria Nattestad Computational biology PhD student PyData NYC 2015
  2. 2. Overview Genome sequencing Using Python to study cancer Personal genomics
  3. 3. Overview Genome sequencing Using Python to study cancer Personal genomics
  4. 4. Your genome 46 strings of A, T, C, and G for a total of about 6 billion characters male
  5. 5. Mutations in the genome can lead to cancer and other diseases Over 20,000 genes are scattered all over the genome. The genome is the instruction manual for creating a living thing. Some changes in the genome do nothing or encode normal variation like hair color, while others can cause disease.
  6. 6. Illumina = “Next-generation sequencing” Sanger = The original Human Genome Project publishes first draft
  7. 7. Big Data 3000 Rice Genomes Project
  8. 8. Sequencing by the numbers • Human genome is 6 billion letters [ATCG] • No technology exists that can read an entire chromosome from end to end • Illumina sequencing produces 100 letters of sequence • If the genome was random, this would be enough
  9. 9. The genome is not random ATCGATCAT?ATCGATCATA repeats Because of this the human genome STILL has gaps
  10. 10. Repeats make it harder to assemble the genome puzzle A B R CDR RR CR B R DR A R A B R C D If a repeat is longer than the reads
  11. 11. Long-read DNA Sequencing Pacific Biosciences Oxford Nanopore MinION >10X as expensive as next-generation (Illumina) sequencing >100X read length
  12. 12. Resolving repeats with long-read sequencing A R D CB R A B R R C D A R DCB R
  13. 13. Overview Genome sequencing Using Python to study cancer Personal genomics
  14. 14. How the human genome changes during cancer Normal human genome
  15. 15. How the human genome changes during cancer (Davidson et al, 2000) 80 chromosomes instead of 46 Cancer genome Cell line from a woman with metastatic breast cancer in 1971, tumor cells have been grown and studied in the lab ever since.
  16. 16. Split-read variant calling chromosome 1 chromosome 2
  17. 17. A simple gene fusion Gene1 Gene2 Gene1 Gene2
  18. 18. A complex gene fusion Gene1 Gene2 Gene1 Gene2
  19. 19. SplitThreader: A new Python graph library for representing rearranged genomes CHR 1 CHR 2 ATCGCCTA GTCCATAG 8 10 2 ATCG CCGA ATAGGTCC CHR 1 CHR 2 10 2 8
  20. 20. Class structure of SplitThreader Node Node NodeNode Edge Edge Edge Port Port Port Port Port Port Port Port Graph Edge Edge Edge Edge Once you enter a node, you must exit out the other side like a tunnel.
  21. 21. Biological insights from SplitThreader Depth first search or Breadth first search Gene fusion finding History of mutations
  22. 22. Using SplitThreader to find a gene fusion CYTH1 EIF3H CYTH1 EIF3H Goal is to find a path like this:
  23. 23. Too many copies of Her2 contributes to making cancer worse Sequencing Actual genome Her2 Too much Her2 Too much signal to divide Too many cell divisions Cancer grows
  24. 24. About 40 copies of Her2 gene scattered around the genome Her2 gene
  25. 25. Her2 Chr 17: 83 Mb 8 Mb
  26. 26. Her2 Her2
  27. 27. Her2 8 Mb Chromosome 17
  28. 28. Her2 Chr 17 Chr 8 1. Healthy chromosome 17 2. Sequence copied into chromosome 8 3. Subsequence copied within chromosome 8 4. Complex variant and inverted duplication within chromosome 8 5. Subsequence copied within chromosome 8
  29. 29. SplitThreader is open source on Github ATCG CCGA ATAGGTCC CHR 1 CHR 2 10 2 8 https://github.com/marianattestad/splitthreader Visualization with D3.js is underway! Contributions are very welcome
  30. 30. Overview Genome sequencing Using Python to study cancer Personal genomics
  31. 31. Personal genomics SNP chip Sequencing • Illumina, SureGenomics • About $1,000 • Captures large and small mutations even if completely novel and unexpected • 23andMe • About $100 • Captures tiny mutations scientists already know to look for
  32. 32. Personal genomics debates • Should the government allow these companies to give people their genomic data? – How about interpreting the health risks? • Is sharing your own genome breaking your family’s privacy?
  33. 33. THANK YOU
  • choeungjin

    Aug. 14, 2020
  • pembeci

    Dec. 22, 2015

http://marianattestad.com How big data is transforming biology and how we are using Python to make sense of it all. Genome sequencing, big data in genomics, long-read sequencing, SplitThreader string graph library in Python, cancer research, and personal genomics! I am Maria Nattestad, a PhD student in computational biology at Cold Spring Harbor laboratory.

Views

Total views

1,243

On Slideshare

0

From embeds

0

Number of embeds

31

Actions

Downloads

38

Shares

0

Comments

0

Likes

2

×