SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
King’s College London, University of London
MSc in Advanced Software Engineering
Approximate Indexing: Gapped
Suffix Array
KyungHoon Park
King’s College London, University of London
Agenda
 Research Objective
 Gapped suffix array
 Application
 Going beyond gSA
 Q&A
King’s College London, University of London
Research Objective
King’s College London, University of London
Main questions
1. Using the developed suffix array, can gapped
suffix array be developed in O(n) time?
2. What are the limitations of gapped suffix array?
How can these can be overcome?
King’s College London, University of London
Research aims
1. To fully understand and implement suffix array
and LCP.
2. Implement a gapped suffix array from the suffix
array in O(n) time.
3. To study and implement the paper gapped suffix
array.
4. If there are possibilities to develop to multiple
gapped suffix array, to research other limitations.
King’s College London, University of London
Gapped Suffix Array
King’s College London, University of London
Main questions
1. Using the developed suffix array, can
gapped suffix array be developed in O(n)
time?
2. 2. What are the limitations of gapped suffix array?
How can these can be overcome?
King’s College London, University of London
Definitions
T = t1t2 … tn, P = p1 p2 … pn , strings of symbols in
finite alphabet
m = length of search string
n = length of text
k = k-mistake (Hamming distance)
King’s College London, University of London
Suffix Array
i T[i] SA T[SA[i]] LCP
0 mississippi 10 i 0
1 ississippi 7 ippi 1
2 ssissippi 4 issippi 1
3 sissippi 1 ississippi 4
4 issippi 0 mississippi 0
5 ssippi 9 pi 0
6 sippi 8 ppi 1
7 ippi 6 sippi 0
8 ppi 3 sissippi 2
9 pi 5 ssippi 1
T = mississippi
King’s College London, University of London
Gapped Suffix Array
1. First introduced by Crochemore and Tischler
(2010)
2. Constructed after SA
3. SA that has a Gap within a specific range to
provide approximate index.
4. The range of gap defined before constructing
the gapped suffix array.
King’s College London, University of London
Gapped Suffix Array
T = mississippi, (1, 2)-gSA (3,1)
i T[i] SA gSA (1, 2)- gSA(3,1)
1 mississippi 10 10 i#
2 ississippi 7 7 i#pi
3 ssissippi 4 4 i#sippi
4 sissippi 1 1 i#sissippi
5 issippi 0 0 m#ssissippi
6 Ssippi 9 9 p#
7 Sippi 8 8 p#i
8 Ippi 6 5 s#ppi
9 ppi 3 2 s#ssippi
10 pi 5 6 s#ippi
11 i 2 3 s#issippi
Definition
(g0, g1)-gSA (m, k)
gSA = Gapped suffix array
g0 = start cursor of the gap
g1 = end cursor of the gap
m = length of search string
k = Hamming distance
King’s College London, University of London
Flow of constructing the gSA
• Skew
Algorithm
1. Constructing
the SA
• Figure of the
k-mistake
• Range of gap
2. Defining the
limitations
• Sorting based on
GRANK &
HRANK
3. Constructing
the gSA
King’s College London, University of London
Limitations of gSA
1. Hamming distance, length of pattern and gap
range should define prior to constructing.
2. gSA cannot cover all of approximate string
matching based on defined k-mistake.
ex) k = 2, gap=(1,3)
coat -> c##t, ##at, co## (support)
#o#t, c#a# (cannot support)
3. gSA cannot support multiple gaps
EX) coach -> c#a#h
King’s College London, University of London
Constructing gSA - #1. GRANK
i 0 1 2 3 4 5 6 7 8 9 10
T[i] m i s s i s s i p p i
GRANK 5 1 8 8 1 8 8 1 6 6 1
GRANK contains the ranks of factors of y with
length up to g0. That is, rank created by cutting
the characters before the beginning of the gap at
position g0
For Example, m = 3, gap range = (1,2)
King’s College London, University of London
Constructing gSA - #2. HRANK
HRANK contains the RANKs of the suffixes that are
at the end of the gap.
As we have now already created the suffix array
before constructing the gapped suffix, it is possible
to easily bring the suffix of where the gap ends.
HRANK[r] = ISA[SA[r]+g1]
King’s College London, University of London
GRANK & HRANK
For example, the structure of the GRANK and
HRANK of the fourth suffix sissippi is constructed as
below.
s i s s i p p i
GRANK Gap HRANK
If we perform the radix sort by combining both
GRANK and HRANK created in this way, it is
possible to create gSA in linear time.
King’s College London, University of London
Example of (1,2)-gSA(3,1)
i T[i] SA gSA (1, 2)- gSA GRANK HRANK
1 mississippi 10 10 i# 5 0
2 ississippi 7 7 i#pi 1 6
3 ssissippi 4 4 i#sippi 8 8
4 sissippi 1 1 i#sissippi 8 9
5 issippi 0 0 m#ssissippi 1 11
6 Ssippi 9 9 p# 8 0
7 Sippi 8 8 p#i 8 1
8 Ippi 6 5 s#ppi 1 7
9 ppi 3 2 s#ssippi 6 10
10 pi 5 6 s#ippi 6 2
11 i 2 3 s#issippi 1 3
King’s College London, University of London
Search in (1,2)-gSA(3,1)
For example, if m = mis (m0, m1, m2), it needs to
search three times:
- search mi (m0, m1) in the SA
- search is (m1, m2) in the SA
- search ms (m0, m2) in the gSA
P = cot
(1,2)-gSA(3,1) c#t #ot co#
Searching array in the (1,2)-gSA(3,1) in the SA in the SA
King’s College London, University of London
Application
King’s College London, University of London
Platform and Language
1. Language: C#
2. Platform: Microsoft .NET
(.Net Framework v4.0)
King’s College London, University of London
Algorithms
1. Construction of suffix array with LCP
- Radix sort
- Skew algorithm
2. Construction of gapped suffix array with gLCP
- Radix sort
3. Approximate string search
- pattern analysis
- binary search with LCP
King’s College London, University of London
Gapped Suffix Array
King’s College London, University of London
Going beyond gSA
King’s College London, University of London
Main questions
1. Using the developed suffix array, can gapped
suffix array be developed in O(n) time?
2. What are the limitations of gapped
suffix array? How can these can be
overcome?
King’s College London, University of London
Limitation of gSA
P = coat
(2,3)-gSA(4,1) #oat c#at co#t coa#
Searching array SA Cannot
support
gSA(4,1) SA
P = coast
(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#
Searching array SA Cannot
support
Cannot
support
gSA(5,1) SA
If we suppose k is 1 and gap is ended at m-1
King’s College London, University of London
Countermeasure
P = coat
(2,3)-gSA(4,1) #oat c#at co#t coa#
Searching array SA gSA(3,1) gSA(4,1) SA
P = coast
(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#
Searching array SA gSA(3,1) gSA(4,1) gSA(5,1) SA
King’s College London, University of London
Countermeasure
P = cot c#t, #ot, co#
gSA(3, 1)  SA, gSA(3, 1)
P = coat #oat, c#at, co#t, coa#
gSA(4, 1)  SA, gSA(3, 1), gSA(4, 1)
P = coast #oast, c#oast, co#st, coa#t, coas#
gSA(5, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1)
P = coasts #oasts, c#oasts, co#sts, coa#ts, coas#s, coast#
gSA(6, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1), gSA(6, 1)
gSA(m, 1) SA, gSA(3, 1) … gSA(m-2, 1), gSA(m-1, 1), gSA(m, 1)
King’s College London, University of London
Theorem If the length of the Gap is 1, the required
count of gSA is | m - 2 |, and it is possible for both
construction and search time to be performed in linear
time.
King’s College London, University of London
Total count of required gSAs
gSA(m, p) Required gapped suffix arrays
gSA(3,1)  SA, gSA(3,1)
gSA(4,1)  SA, gSA(3,1), gSA(4,1)
gSA(4,2)  SA, gSA(3,1), gSA(4,2)
gSA(5,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1)
gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2)
gSA(5,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,3)
gSA(6,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1)
gSA(6,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(6,2),
gSA(6,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,3)
gSA(6,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,4)
gSA(7,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1), gSA(7,1)
gSA(7,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2) , gSA(6,1), gS
A(6,2), gSA(7,2)
gSA(7,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,1) , gSA(6,2), gSA(6,3), gSA(7,3)
gSA(7,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,1) , gSA(6,2) , gSA(6,3), gSA(6,4), gSA(7,4)
gSA(7,5)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
gC =Total count of required
gSAs
𝒈𝑪 =
𝒊=𝟏
𝒑−𝟏
𝒌 − 𝒊 𝒊𝒇 𝒌 − 𝒊 > 𝟎
𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
King’s College London, University of London
Multiple gaps, m is various
P = coat ##at, #o#t, #oa#, c##t, c#a#, co##
gSA(4,2)  SA, gSA(3,1), gSA(4,2)
P = coast ##ast, #o#st, #oa#t, #oas#, c##st, c#a#t, c#as#, co##t, co#s#,coa##
gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2), (1,2)(3,4)-gSA(5,2)
P = coasts ##asts, #o#sts, #oa#ts, #oas#s, #oast#, c##sts, c#a#ts, c#as#s, c#ast#, co#
#ts, co#s#s, co#st#, coa##s, coa#t#, coas##
gSA(6,2)  SA, gSA(3,1) , gSA(4,1),gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS
A(6,2), (1,2)(4,5)-gSA(6,2), (2,3)(4,5)-gSA(6,2)
P = coasts ###sts, ##a#ts, ##as#s, ##ast#, #o##ts, #o#s#s, #o#st#, #oa##s, #oa#t#, #
oas##, c###ts, c##s#s, c##st#, c#a##s, c#a#t#, c#as##, co###s, co##t#, co
#s##, coa###
gSA(6,3)  SA, gSA(3,1) , gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS
A(5,3)gSA(6,3), (1,3)(4,5)-gSA(6,3), (1,2)(3,5)-gSA(6,3)
King’s College London, University of London
Two approaches to support the
multiple gaps
Second is to continuously additionally create multiple gapped
suffix array as per above method.
Perform a search where the search is carried out until the first gap
of the search pattern, and after that every individual character is
compared.
King’s College London, University of London
First approach
c # a # t
r = gSA[i](3,1),T[r]
T[ r+2 ]T[ r+3 ]T[ r+4 ]
c # a s # s
r = gSA[i](3,1),T[r]
T[r+3]T[r+4]T[r+5]
King’s College London, University of London
Worst case for searching with it
First fragment’s length is defined fm
Binary search the first fragment with gLCP = O(logn + fm)
Search rest of fragment = O((m - fm)n)
So O((m - fm)n + log n + fm)
King’s College London, University of London
Summary
King’s College London, University of London
Further work
Gapped suffix array only supports searching of specific
patterns.
For it to support approximate indexing in all situations,
will require more research and development into
multiple gapped suffix arrays.
Future task is to study multiple gapped suffix array and
its efficiency
King’s College London, University of London
Conclusion
The theory of Maxime that gSA can be created in linear
time has been put into practice and confirmed to be
true
Additionally to this research, further potentials of
multiple gSAs were looked at and were able to
conclude that it’s an area requiring more research
King’s College London, University of London
King’s College London, University of London
Q&A

Contenu connexe

Similaire à Approximate Indexing: Gapped Suffix Array

A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsunyil96
 
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREESPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREEijitcs
 
String kmp
String kmpString kmp
String kmpthinkphp
 
Parallel random projection using R high performance computing for planted mot...
Parallel random projection using R high performance computing for planted mot...Parallel random projection using R high performance computing for planted mot...
Parallel random projection using R high performance computing for planted mot...TELKOMNIKA JOURNAL
 
Combining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcherCombining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcherIAEME Publication
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentKemal Can Kara
 
prolog-coolPrograms-flora.ppt
prolog-coolPrograms-flora.pptprolog-coolPrograms-flora.ppt
prolog-coolPrograms-flora.pptdatapro2
 
Deconstructing Columnar Transposition Ciphers
Deconstructing Columnar Transposition CiphersDeconstructing Columnar Transposition Ciphers
Deconstructing Columnar Transposition CiphersRobert Talbert
 
Graph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesTwo Sigma
 
32 -longest-common-prefix
32 -longest-common-prefix32 -longest-common-prefix
32 -longest-common-prefixSanjeev Gupta
 
Point Placement Algorithms: An Experimental Study
Point Placement Algorithms: An Experimental StudyPoint Placement Algorithms: An Experimental Study
Point Placement Algorithms: An Experimental StudyCSCJournals
 
Langford sequences through a product of labeled digraphs
Langford sequences through a product of labeled digraphsLangford sequences through a product of labeled digraphs
Langford sequences through a product of labeled digraphsGraph-TA
 
Msa & rooted/unrooted tree
Msa & rooted/unrooted treeMsa & rooted/unrooted tree
Msa & rooted/unrooted treeSamiul Ehsan
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...cseiitgn
 

Similaire à Approximate Indexing: Gapped Suffix Array (18)

A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithms
 
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREESPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
 
String kmp
String kmpString kmp
String kmp
 
Parallel random projection using R high performance computing for planted mot...
Parallel random projection using R high performance computing for planted mot...Parallel random projection using R high performance computing for planted mot...
Parallel random projection using R high performance computing for planted mot...
 
poster
posterposter
poster
 
Combining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcherCombining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcher
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitment
 
prolog-coolPrograms-flora.ppt
prolog-coolPrograms-flora.pptprolog-coolPrograms-flora.ppt
prolog-coolPrograms-flora.ppt
 
Deconstructing Columnar Transposition Ciphers
Deconstructing Columnar Transposition CiphersDeconstructing Columnar Transposition Ciphers
Deconstructing Columnar Transposition Ciphers
 
Presentation 2
Presentation 2Presentation 2
Presentation 2
 
Graph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality Guarantees
 
32 -longest-common-prefix
32 -longest-common-prefix32 -longest-common-prefix
32 -longest-common-prefix
 
2nd Semester M Tech: Structural Engineering (June-2015) Question Papers
2nd  Semester M Tech: Structural Engineering  (June-2015) Question Papers2nd  Semester M Tech: Structural Engineering  (June-2015) Question Papers
2nd Semester M Tech: Structural Engineering (June-2015) Question Papers
 
Point Placement Algorithms: An Experimental Study
Point Placement Algorithms: An Experimental StudyPoint Placement Algorithms: An Experimental Study
Point Placement Algorithms: An Experimental Study
 
Ch06 multalign
Ch06 multalignCh06 multalign
Ch06 multalign
 
Langford sequences through a product of labeled digraphs
Langford sequences through a product of labeled digraphsLangford sequences through a product of labeled digraphs
Langford sequences through a product of labeled digraphs
 
Msa & rooted/unrooted tree
Msa & rooted/unrooted treeMsa & rooted/unrooted tree
Msa & rooted/unrooted tree
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
 

Dernier

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 

Dernier (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 

Approximate Indexing: Gapped Suffix Array

  • 1. King’s College London, University of London MSc in Advanced Software Engineering Approximate Indexing: Gapped Suffix Array KyungHoon Park
  • 2. King’s College London, University of London Agenda  Research Objective  Gapped suffix array  Application  Going beyond gSA  Q&A
  • 3. King’s College London, University of London Research Objective
  • 4. King’s College London, University of London Main questions 1. Using the developed suffix array, can gapped suffix array be developed in O(n) time? 2. What are the limitations of gapped suffix array? How can these can be overcome?
  • 5. King’s College London, University of London Research aims 1. To fully understand and implement suffix array and LCP. 2. Implement a gapped suffix array from the suffix array in O(n) time. 3. To study and implement the paper gapped suffix array. 4. If there are possibilities to develop to multiple gapped suffix array, to research other limitations.
  • 6. King’s College London, University of London Gapped Suffix Array
  • 7. King’s College London, University of London Main questions 1. Using the developed suffix array, can gapped suffix array be developed in O(n) time? 2. 2. What are the limitations of gapped suffix array? How can these can be overcome?
  • 8. King’s College London, University of London Definitions T = t1t2 … tn, P = p1 p2 … pn , strings of symbols in finite alphabet m = length of search string n = length of text k = k-mistake (Hamming distance)
  • 9. King’s College London, University of London Suffix Array i T[i] SA T[SA[i]] LCP 0 mississippi 10 i 0 1 ississippi 7 ippi 1 2 ssissippi 4 issippi 1 3 sissippi 1 ississippi 4 4 issippi 0 mississippi 0 5 ssippi 9 pi 0 6 sippi 8 ppi 1 7 ippi 6 sippi 0 8 ppi 3 sissippi 2 9 pi 5 ssippi 1 T = mississippi
  • 10. King’s College London, University of London Gapped Suffix Array 1. First introduced by Crochemore and Tischler (2010) 2. Constructed after SA 3. SA that has a Gap within a specific range to provide approximate index. 4. The range of gap defined before constructing the gapped suffix array.
  • 11. King’s College London, University of London Gapped Suffix Array T = mississippi, (1, 2)-gSA (3,1) i T[i] SA gSA (1, 2)- gSA(3,1) 1 mississippi 10 10 i# 2 ississippi 7 7 i#pi 3 ssissippi 4 4 i#sippi 4 sissippi 1 1 i#sissippi 5 issippi 0 0 m#ssissippi 6 Ssippi 9 9 p# 7 Sippi 8 8 p#i 8 Ippi 6 5 s#ppi 9 ppi 3 2 s#ssippi 10 pi 5 6 s#ippi 11 i 2 3 s#issippi Definition (g0, g1)-gSA (m, k) gSA = Gapped suffix array g0 = start cursor of the gap g1 = end cursor of the gap m = length of search string k = Hamming distance
  • 12. King’s College London, University of London Flow of constructing the gSA • Skew Algorithm 1. Constructing the SA • Figure of the k-mistake • Range of gap 2. Defining the limitations • Sorting based on GRANK & HRANK 3. Constructing the gSA
  • 13. King’s College London, University of London Limitations of gSA 1. Hamming distance, length of pattern and gap range should define prior to constructing. 2. gSA cannot cover all of approximate string matching based on defined k-mistake. ex) k = 2, gap=(1,3) coat -> c##t, ##at, co## (support) #o#t, c#a# (cannot support) 3. gSA cannot support multiple gaps EX) coach -> c#a#h
  • 14. King’s College London, University of London Constructing gSA - #1. GRANK i 0 1 2 3 4 5 6 7 8 9 10 T[i] m i s s i s s i p p i GRANK 5 1 8 8 1 8 8 1 6 6 1 GRANK contains the ranks of factors of y with length up to g0. That is, rank created by cutting the characters before the beginning of the gap at position g0 For Example, m = 3, gap range = (1,2)
  • 15. King’s College London, University of London Constructing gSA - #2. HRANK HRANK contains the RANKs of the suffixes that are at the end of the gap. As we have now already created the suffix array before constructing the gapped suffix, it is possible to easily bring the suffix of where the gap ends. HRANK[r] = ISA[SA[r]+g1]
  • 16. King’s College London, University of London GRANK & HRANK For example, the structure of the GRANK and HRANK of the fourth suffix sissippi is constructed as below. s i s s i p p i GRANK Gap HRANK If we perform the radix sort by combining both GRANK and HRANK created in this way, it is possible to create gSA in linear time.
  • 17. King’s College London, University of London Example of (1,2)-gSA(3,1) i T[i] SA gSA (1, 2)- gSA GRANK HRANK 1 mississippi 10 10 i# 5 0 2 ississippi 7 7 i#pi 1 6 3 ssissippi 4 4 i#sippi 8 8 4 sissippi 1 1 i#sissippi 8 9 5 issippi 0 0 m#ssissippi 1 11 6 Ssippi 9 9 p# 8 0 7 Sippi 8 8 p#i 8 1 8 Ippi 6 5 s#ppi 1 7 9 ppi 3 2 s#ssippi 6 10 10 pi 5 6 s#ippi 6 2 11 i 2 3 s#issippi 1 3
  • 18. King’s College London, University of London Search in (1,2)-gSA(3,1) For example, if m = mis (m0, m1, m2), it needs to search three times: - search mi (m0, m1) in the SA - search is (m1, m2) in the SA - search ms (m0, m2) in the gSA P = cot (1,2)-gSA(3,1) c#t #ot co# Searching array in the (1,2)-gSA(3,1) in the SA in the SA
  • 19. King’s College London, University of London Application
  • 20. King’s College London, University of London Platform and Language 1. Language: C# 2. Platform: Microsoft .NET (.Net Framework v4.0)
  • 21. King’s College London, University of London Algorithms 1. Construction of suffix array with LCP - Radix sort - Skew algorithm 2. Construction of gapped suffix array with gLCP - Radix sort 3. Approximate string search - pattern analysis - binary search with LCP
  • 22. King’s College London, University of London Gapped Suffix Array
  • 23. King’s College London, University of London Going beyond gSA
  • 24. King’s College London, University of London Main questions 1. Using the developed suffix array, can gapped suffix array be developed in O(n) time? 2. What are the limitations of gapped suffix array? How can these can be overcome?
  • 25. King’s College London, University of London Limitation of gSA P = coat (2,3)-gSA(4,1) #oat c#at co#t coa# Searching array SA Cannot support gSA(4,1) SA P = coast (3,4)-gSA(5,1) #oast c#oast co#st coa#t coas# Searching array SA Cannot support Cannot support gSA(5,1) SA If we suppose k is 1 and gap is ended at m-1
  • 26. King’s College London, University of London Countermeasure P = coat (2,3)-gSA(4,1) #oat c#at co#t coa# Searching array SA gSA(3,1) gSA(4,1) SA P = coast (3,4)-gSA(5,1) #oast c#oast co#st coa#t coas# Searching array SA gSA(3,1) gSA(4,1) gSA(5,1) SA
  • 27. King’s College London, University of London Countermeasure P = cot c#t, #ot, co# gSA(3, 1)  SA, gSA(3, 1) P = coat #oat, c#at, co#t, coa# gSA(4, 1)  SA, gSA(3, 1), gSA(4, 1) P = coast #oast, c#oast, co#st, coa#t, coas# gSA(5, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1) P = coasts #oasts, c#oasts, co#sts, coa#ts, coas#s, coast# gSA(6, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1), gSA(6, 1) gSA(m, 1) SA, gSA(3, 1) … gSA(m-2, 1), gSA(m-1, 1), gSA(m, 1)
  • 28. King’s College London, University of London Theorem If the length of the Gap is 1, the required count of gSA is | m - 2 |, and it is possible for both construction and search time to be performed in linear time.
  • 29. King’s College London, University of London Total count of required gSAs gSA(m, p) Required gapped suffix arrays gSA(3,1)  SA, gSA(3,1) gSA(4,1)  SA, gSA(3,1), gSA(4,1) gSA(4,2)  SA, gSA(3,1), gSA(4,2) gSA(5,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1) gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2) gSA(5,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,3) gSA(6,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1) gSA(6,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(6,2), gSA(6,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,3) gSA(6,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,4) gSA(7,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1), gSA(7,1) gSA(7,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2) , gSA(6,1), gS A(6,2), gSA(7,2) gSA(7,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,1) , gSA(6,2), gSA(6,3), gSA(7,3) gSA(7,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,1) , gSA(6,2) , gSA(6,3), gSA(6,4), gSA(7,4) gSA(7,5)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS gC =Total count of required gSAs 𝒈𝑪 = 𝒊=𝟏 𝒑−𝟏 𝒌 − 𝒊 𝒊𝒇 𝒌 − 𝒊 > 𝟎 𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
  • 30. King’s College London, University of London Multiple gaps, m is various P = coat ##at, #o#t, #oa#, c##t, c#a#, co## gSA(4,2)  SA, gSA(3,1), gSA(4,2) P = coast ##ast, #o#st, #oa#t, #oas#, c##st, c#a#t, c#as#, co##t, co#s#,coa## gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2), (1,2)(3,4)-gSA(5,2) P = coasts ##asts, #o#sts, #oa#ts, #oas#s, #oast#, c##sts, c#a#ts, c#as#s, c#ast#, co# #ts, co#s#s, co#st#, coa##s, coa#t#, coas## gSA(6,2)  SA, gSA(3,1) , gSA(4,1),gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS A(6,2), (1,2)(4,5)-gSA(6,2), (2,3)(4,5)-gSA(6,2) P = coasts ###sts, ##a#ts, ##as#s, ##ast#, #o##ts, #o#s#s, #o#st#, #oa##s, #oa#t#, # oas##, c###ts, c##s#s, c##st#, c#a##s, c#a#t#, c#as##, co###s, co##t#, co #s##, coa### gSA(6,3)  SA, gSA(3,1) , gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS A(5,3)gSA(6,3), (1,3)(4,5)-gSA(6,3), (1,2)(3,5)-gSA(6,3)
  • 31. King’s College London, University of London Two approaches to support the multiple gaps Second is to continuously additionally create multiple gapped suffix array as per above method. Perform a search where the search is carried out until the first gap of the search pattern, and after that every individual character is compared.
  • 32. King’s College London, University of London First approach c # a # t r = gSA[i](3,1),T[r] T[ r+2 ]T[ r+3 ]T[ r+4 ] c # a s # s r = gSA[i](3,1),T[r] T[r+3]T[r+4]T[r+5]
  • 33. King’s College London, University of London Worst case for searching with it First fragment’s length is defined fm Binary search the first fragment with gLCP = O(logn + fm) Search rest of fragment = O((m - fm)n) So O((m - fm)n + log n + fm)
  • 34. King’s College London, University of London Summary
  • 35. King’s College London, University of London Further work Gapped suffix array only supports searching of specific patterns. For it to support approximate indexing in all situations, will require more research and development into multiple gapped suffix arrays. Future task is to study multiple gapped suffix array and its efficiency
  • 36. King’s College London, University of London Conclusion The theory of Maxime that gSA can be created in linear time has been put into practice and confirmed to be true Additionally to this research, further potentials of multiple gSAs were looked at and were able to conclude that it’s an area requiring more research
  • 37. King’s College London, University of London
  • 38. King’s College London, University of London Q&A