2. This week, I’m doing a study on n-grams and
their possible applications for topic detection
in multiple documents.
(Details in subsequent slides)
3. In the fields of computational
linguistics and probability, an n-gram is a
contiguous sequence of n items from a
given sequence of text or speech.The items can
be phonemes,syllables, letters, words or base
pairs according to the application.The n-grams
typically are collected from a text or speech
corpus.When the items are words, n-grams may
also be called shingles.[5]
4. An n-gram of size 1 is referred to as a
"unigram"; size 2 is a "bigram" (or, less
commonly, a "digram"); size 3 is a "trigram".
Larger sizes are sometimes referred to by the
value ofn, e.g., "four-gram", "five-gram", and
so on.
N-grams have been used in summarization
and summary evaluation [1, 2, 3].
5. This means that n-gram Si,i+n-1 can be found as a substring of length n of the original text,
spanning from the i-th to the (i + n-1)-th character of the original text.
6. Consider the following sentence:-
This is a sentence.
Word unigrams: this, is, a, sentence
Word bigrams: this is, is a, a sentence
Character bigrams: th, hi, is, s , a, ...
Character 4-grams: this, his , is , ...
8. Application of this method to the sentence
Do you like this summary?
with a requested n-gram size of 3 would return:
{`Do ', `o y', ` yo', `you', `ou ', `u l', ` li', `lik', `ike', `ke ', `e t', `
th', `thi',`his', `is ', `s s', ` su', `sum', `umm', `mma', `mar',
`ary', `ry?'}
while an algorithm taking disjoint n-grams would return
{`Do ', `you', ` li', `ke ', `thi', `s s', `umm', `ary'}
(and `?' would probably be omitted).
10. There are three methods which can be used to create-
gram graphs based on how neighbourhood between
adjacent n-grams is computed in a text.
In general, a fixed-width window of characters (or
words) around a given n-gram N0 is used, with all
characters (or words) within the window considered to
be neighbours of N0.
These neighbours are represented as connected
vertices in the text graph.
The edge connecting the neighbours is weighted,
indicating for example the distance between the
neighbours or the number of co-occurrences within
the text.
11. 1)The Non-Symmetric Approach
2) The Symmetric Approach
3) The Gauss-normalized symmetric
approach (details omitted)
12. For the string abcdef , the figure shows the n-gram graphs
for the non-symmetric approach, the symmetric approach
and the Gauess-normalized symmetric approach resp.
13. 1)Transitivity implying Indication of n-th order relations:-
The graph in itself is a structure that maintains
information about the `neighbor-of-a-neighbor'.This
means that if A is related to B through an edge or a path in
the graph and B is related to C through another edge or
path, then if the proximity relation is considered transitive
we can deduce that A is related to C.
The length of the path between A and C can offer
information about this indirect proximity relation.This can
be further refined, if the edges have been assigned
weights indicating the degree of proximity between the
connected vertices.
Major Features of the n-gram
14. 2) Language independence/ neutrality.
When used in Natural Language Processing, the n-gram
graph representation makes no assumption about the
underlying language.This makes the representation fully
language-neutral and applicable independent even of
writing orientation (left-to-right or right-to left) when
character n-grams are used.
Moreover, the fact the method enters the sub-word level
has proved to be useful in all the cases where a word
appears in different forms, e.g. due to difference in writing
style, intersection of word types and so forth.
15. Given two instances of n-gram graph representation G1, G2, there
is a number of operators that can be applied on G1;G2 to provide
the n-gram graph equivalent of union, intersection and other such
operators of set theory.
For example, let the merging of G1 and G2 corresponding to the
union operator in set theory be:
G3 = G1 ∪G2,
which is implemented by adding all edges from both graphs to a
third one, while making sure no duplicate edges are created.
Two edges are considered duplicates of each other, when they
share identical vertices.
16. The intersection operator ∩ (G X G -> G) for
two graphs, returns a graph with the
common edges of the two operand graphs,
with the averaged weights of the original
edges assigned as edge weights.
The averaged weights make sure that we
keep common edges and their weights are
assigned to the closest possible value to both
the original graphs: the average.
18. The intersection operator, on the other hand,
can be used to determine the common
subgraphs of different document graphs.
Thus, if we take the running intersection of a
number of documents, the final resulting
graph would possibly contain n-grams of the
common-topic of the documents.
19. How to extract back possible topic/keywords
from the resulting graph (after applying the
intersection operator)?
The final graph would also contain vertices
made from the n-gram of stop-words (noise)
since it has not been preprocessed. How to
filter it? OR Should we preprocess the
documents first?(some methods for removing
noise have already been studied in context of
summary evaluation)
20. [1] Michele Banko and LucyVanderwende. Using n-grams to understand the nature
of summaries. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL
2004: Short Papers, pages 1{4, Boston, Massachusetts, USA, May 2004.Association
for Computational Linguistics.
[2] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-
gram co-occurrence statistics. In NAACL '03: Proceedings of the 2003 Conference of
the NorthAmerican Chapter of theAssociation for Computational Linguistics on
Human LanguageTechnology, pages 71{78, Morristown, NJ, USA, 2003.Association
for Computational Linguistics.
[3] T. Copeck and S. Szpakowicz.Vocabulary usage in newswire summaries. InText
Summarization Branches Out: Proceedings of theACL-04Workshop, pages 19{26.
Association for Computational Linguistics, 2004.
[4] George Giannakopoulos. TESTINGTHE USE OF N-GRAMGRAPHS IN
SUMMARIZATION SUB-TASKS 2009
[5] Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey
(1997). "Syntactic clustering of the web". Computer Networks and ISDNSystems 29 (8):
1157–1166.