4. King’s College London, University of London
Main questions
1. Using the developed suffix array, can gapped
suffix array be developed in O(n) time?
2. What are the limitations of gapped suffix array?
How can these can be overcome?
5. King’s College London, University of London
Research aims
1. To fully understand and implement suffix array
and LCP.
2. Implement a gapped suffix array from the suffix
array in O(n) time.
3. To study and implement the paper gapped suffix
array.
4. If there are possibilities to develop to multiple
gapped suffix array, to research other limitations.
7. King’s College London, University of London
Main questions
1. Using the developed suffix array, can
gapped suffix array be developed in O(n)
time?
2. 2. What are the limitations of gapped suffix array?
How can these can be overcome?
8. King’s College London, University of London
Definitions
T = t1t2 … tn, P = p1 p2 … pn , strings of symbols in
finite alphabet
m = length of search string
n = length of text
k = k-mistake (Hamming distance)
9. King’s College London, University of London
Suffix Array
i T[i] SA T[SA[i]] LCP
0 mississippi 10 i 0
1 ississippi 7 ippi 1
2 ssissippi 4 issippi 1
3 sissippi 1 ississippi 4
4 issippi 0 mississippi 0
5 ssippi 9 pi 0
6 sippi 8 ppi 1
7 ippi 6 sippi 0
8 ppi 3 sissippi 2
9 pi 5 ssippi 1
T = mississippi
10. King’s College London, University of London
Gapped Suffix Array
1. First introduced by Crochemore and Tischler
(2010)
2. Constructed after SA
3. SA that has a Gap within a specific range to
provide approximate index.
4. The range of gap defined before constructing
the gapped suffix array.
11. King’s College London, University of London
Gapped Suffix Array
T = mississippi, (1, 2)-gSA (3,1)
i T[i] SA gSA (1, 2)- gSA(3,1)
1 mississippi 10 10 i#
2 ississippi 7 7 i#pi
3 ssissippi 4 4 i#sippi
4 sissippi 1 1 i#sissippi
5 issippi 0 0 m#ssissippi
6 Ssippi 9 9 p#
7 Sippi 8 8 p#i
8 Ippi 6 5 s#ppi
9 ppi 3 2 s#ssippi
10 pi 5 6 s#ippi
11 i 2 3 s#issippi
Definition
(g0, g1)-gSA (m, k)
gSA = Gapped suffix array
g0 = start cursor of the gap
g1 = end cursor of the gap
m = length of search string
k = Hamming distance
12. King’s College London, University of London
Flow of constructing the gSA
• Skew
Algorithm
1. Constructing
the SA
• Figure of the
k-mistake
• Range of gap
2. Defining the
limitations
• Sorting based on
GRANK &
HRANK
3. Constructing
the gSA
13. King’s College London, University of London
Limitations of gSA
1. Hamming distance, length of pattern and gap
range should define prior to constructing.
2. gSA cannot cover all of approximate string
matching based on defined k-mistake.
ex) k = 2, gap=(1,3)
coat -> c##t, ##at, co## (support)
#o#t, c#a# (cannot support)
3. gSA cannot support multiple gaps
EX) coach -> c#a#h
14. King’s College London, University of London
Constructing gSA - #1. GRANK
i 0 1 2 3 4 5 6 7 8 9 10
T[i] m i s s i s s i p p i
GRANK 5 1 8 8 1 8 8 1 6 6 1
GRANK contains the ranks of factors of y with
length up to g0. That is, rank created by cutting
the characters before the beginning of the gap at
position g0
For Example, m = 3, gap range = (1,2)
15. King’s College London, University of London
Constructing gSA - #2. HRANK
HRANK contains the RANKs of the suffixes that are
at the end of the gap.
As we have now already created the suffix array
before constructing the gapped suffix, it is possible
to easily bring the suffix of where the gap ends.
HRANK[r] = ISA[SA[r]+g1]
16. King’s College London, University of London
GRANK & HRANK
For example, the structure of the GRANK and
HRANK of the fourth suffix sissippi is constructed as
below.
s i s s i p p i
GRANK Gap HRANK
If we perform the radix sort by combining both
GRANK and HRANK created in this way, it is
possible to create gSA in linear time.
17. King’s College London, University of London
Example of (1,2)-gSA(3,1)
i T[i] SA gSA (1, 2)- gSA GRANK HRANK
1 mississippi 10 10 i# 5 0
2 ississippi 7 7 i#pi 1 6
3 ssissippi 4 4 i#sippi 8 8
4 sissippi 1 1 i#sissippi 8 9
5 issippi 0 0 m#ssissippi 1 11
6 Ssippi 9 9 p# 8 0
7 Sippi 8 8 p#i 8 1
8 Ippi 6 5 s#ppi 1 7
9 ppi 3 2 s#ssippi 6 10
10 pi 5 6 s#ippi 6 2
11 i 2 3 s#issippi 1 3
18. King’s College London, University of London
Search in (1,2)-gSA(3,1)
For example, if m = mis (m0, m1, m2), it needs to
search three times:
- search mi (m0, m1) in the SA
- search is (m1, m2) in the SA
- search ms (m0, m2) in the gSA
P = cot
(1,2)-gSA(3,1) c#t #ot co#
Searching array in the (1,2)-gSA(3,1) in the SA in the SA
20. King’s College London, University of London
Platform and Language
1. Language: C#
2. Platform: Microsoft .NET
(.Net Framework v4.0)
21. King’s College London, University of London
Algorithms
1. Construction of suffix array with LCP
- Radix sort
- Skew algorithm
2. Construction of gapped suffix array with gLCP
- Radix sort
3. Approximate string search
- pattern analysis
- binary search with LCP
24. King’s College London, University of London
Main questions
1. Using the developed suffix array, can gapped
suffix array be developed in O(n) time?
2. What are the limitations of gapped
suffix array? How can these can be
overcome?
25. King’s College London, University of London
Limitation of gSA
P = coat
(2,3)-gSA(4,1) #oat c#at co#t coa#
Searching array SA Cannot
support
gSA(4,1) SA
P = coast
(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#
Searching array SA Cannot
support
Cannot
support
gSA(5,1) SA
If we suppose k is 1 and gap is ended at m-1
26. King’s College London, University of London
Countermeasure
P = coat
(2,3)-gSA(4,1) #oat c#at co#t coa#
Searching array SA gSA(3,1) gSA(4,1) SA
P = coast
(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#
Searching array SA gSA(3,1) gSA(4,1) gSA(5,1) SA
27. King’s College London, University of London
Countermeasure
P = cot c#t, #ot, co#
gSA(3, 1) SA, gSA(3, 1)
P = coat #oat, c#at, co#t, coa#
gSA(4, 1) SA, gSA(3, 1), gSA(4, 1)
P = coast #oast, c#oast, co#st, coa#t, coas#
gSA(5, 1) SA, gSA(3, 1), gSA(4, 1), gSA(5, 1)
P = coasts #oasts, c#oasts, co#sts, coa#ts, coas#s, coast#
gSA(6, 1) SA, gSA(3, 1), gSA(4, 1), gSA(5, 1), gSA(6, 1)
gSA(m, 1) SA, gSA(3, 1) … gSA(m-2, 1), gSA(m-1, 1), gSA(m, 1)
28. King’s College London, University of London
Theorem If the length of the Gap is 1, the required
count of gSA is | m - 2 |, and it is possible for both
construction and search time to be performed in linear
time.
29. King’s College London, University of London
Total count of required gSAs
gSA(m, p) Required gapped suffix arrays
gSA(3,1) SA, gSA(3,1)
gSA(4,1) SA, gSA(3,1), gSA(4,1)
gSA(4,2) SA, gSA(3,1), gSA(4,2)
gSA(5,1) SA, gSA(3,1), gSA(4,1), gSA(5,1)
gSA(5,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2)
gSA(5,3) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,3)
gSA(6,1) SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1)
gSA(6,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(6,2),
gSA(6,3) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,3)
gSA(6,4) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,4)
gSA(7,1) SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1), gSA(7,1)
gSA(7,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2) , gSA(6,1), gS
A(6,2), gSA(7,2)
gSA(7,3) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,1) , gSA(6,2), gSA(6,3), gSA(7,3)
gSA(7,4) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,1) , gSA(6,2) , gSA(6,3), gSA(6,4), gSA(7,4)
gSA(7,5) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
gC =Total count of required
gSAs
𝒈𝑪 =
𝒊=𝟏
𝒑−𝟏
𝒌 − 𝒊 𝒊𝒇 𝒌 − 𝒊 > 𝟎
𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
30. King’s College London, University of London
Multiple gaps, m is various
P = coat ##at, #o#t, #oa#, c##t, c#a#, co##
gSA(4,2) SA, gSA(3,1), gSA(4,2)
P = coast ##ast, #o#st, #oa#t, #oas#, c##st, c#a#t, c#as#, co##t, co#s#,coa##
gSA(5,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2), (1,2)(3,4)-gSA(5,2)
P = coasts ##asts, #o#sts, #oa#ts, #oas#s, #oast#, c##sts, c#a#ts, c#as#s, c#ast#, co#
#ts, co#s#s, co#st#, coa##s, coa#t#, coas##
gSA(6,2) SA, gSA(3,1) , gSA(4,1),gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS
A(6,2), (1,2)(4,5)-gSA(6,2), (2,3)(4,5)-gSA(6,2)
P = coasts ###sts, ##a#ts, ##as#s, ##ast#, #o##ts, #o#s#s, #o#st#, #oa##s, #oa#t#, #
oas##, c###ts, c##s#s, c##st#, c#a##s, c#a#t#, c#as##, co###s, co##t#, co
#s##, coa###
gSA(6,3) SA, gSA(3,1) , gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS
A(5,3)gSA(6,3), (1,3)(4,5)-gSA(6,3), (1,2)(3,5)-gSA(6,3)
31. King’s College London, University of London
Two approaches to support the
multiple gaps
Second is to continuously additionally create multiple gapped
suffix array as per above method.
Perform a search where the search is carried out until the first gap
of the search pattern, and after that every individual character is
compared.
32. King’s College London, University of London
First approach
c # a # t
r = gSA[i](3,1),T[r]
T[ r+2 ]T[ r+3 ]T[ r+4 ]
c # a s # s
r = gSA[i](3,1),T[r]
T[r+3]T[r+4]T[r+5]
33. King’s College London, University of London
Worst case for searching with it
First fragment’s length is defined fm
Binary search the first fragment with gLCP = O(logn + fm)
Search rest of fragment = O((m - fm)n)
So O((m - fm)n + log n + fm)
35. King’s College London, University of London
Further work
Gapped suffix array only supports searching of specific
patterns.
For it to support approximate indexing in all situations,
will require more research and development into
multiple gapped suffix arrays.
Future task is to study multiple gapped suffix array and
its efficiency
36. King’s College London, University of London
Conclusion
The theory of Maxime that gSA can be created in linear
time has been put into practice and confirmed to be
true
Additionally to this research, further potentials of
multiple gSAs were looked at and were able to
conclude that it’s an area requiring more research