SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Weekly Progress Report
11th to 16th July 2016
-Ayush Pareek
 This week, I’m doing a study on n-grams and
their possible applications for topic detection
in multiple documents.
 (Details in subsequent slides)
 In the fields of computational
linguistics and probability, an n-gram is a
contiguous sequence of n items from a
given sequence of text or speech.The items can
be phonemes,syllables, letters, words or base
pairs according to the application.The n-grams
typically are collected from a text or speech
corpus.When the items are words, n-grams may
also be called shingles.[5]
 An n-gram of size 1 is referred to as a
"unigram"; size 2 is a "bigram" (or, less
commonly, a "digram"); size 3 is a "trigram".
Larger sizes are sometimes referred to by the
value ofn, e.g., "four-gram", "five-gram", and
so on.
 N-grams have been used in summarization
and summary evaluation [1, 2, 3].
This means that n-gram Si,i+n-1 can be found as a substring of length n of the original text,
spanning from the i-th to the (i + n-1)-th character of the original text.
 Consider the following sentence:-
This is a sentence.
 Word unigrams: this, is, a, sentence
 Word bigrams: this is, is a, a sentence
 Character bigrams: th, hi, is, s , a, ...
 Character 4-grams: this, his , is , ...
Time Complexity = O(size of the text) = O(|T|)
Application of this method to the sentence
Do you like this summary?
with a requested n-gram size of 3 would return:
{`Do ', `o y', ` yo', `you', `ou ', `u l', ` li', `lik', `ike', `ke ', `e t', `
th', `thi',`his', `is ', `s s', ` su', `sum', `umm', `mma', `mar',
`ary', `ry?'}
 while an algorithm taking disjoint n-grams would return
{`Do ', `you', ` li', `ke ', `thi', `s s', `umm', `ary'}
(and `?' would probably be omitted).
Formally,
 There are three methods which can be used to create-
gram graphs based on how neighbourhood between
adjacent n-grams is computed in a text.
 In general, a fixed-width window of characters (or
words) around a given n-gram N0 is used, with all
characters (or words) within the window considered to
be neighbours of N0.
 These neighbours are represented as connected
vertices in the text graph.
 The edge connecting the neighbours is weighted,
indicating for example the distance between the
neighbours or the number of co-occurrences within
the text.
 1)The Non-Symmetric Approach
 2) The Symmetric Approach
 3) The Gauss-normalized symmetric
approach (details omitted)
 For the string abcdef , the figure shows the n-gram graphs
for the non-symmetric approach, the symmetric approach
and the Gauess-normalized symmetric approach resp.
1)Transitivity implying Indication of n-th order relations:-
 The graph in itself is a structure that maintains
information about the `neighbor-of-a-neighbor'.This
means that if A is related to B through an edge or a path in
the graph and B is related to C through another edge or
path, then if the proximity relation is considered transitive
we can deduce that A is related to C.
 The length of the path between A and C can offer
information about this indirect proximity relation.This can
be further refined, if the edges have been assigned
weights indicating the degree of proximity between the
connected vertices.
Major Features of the n-gram
2) Language independence/ neutrality.
 When used in Natural Language Processing, the n-gram
graph representation makes no assumption about the
underlying language.This makes the representation fully
language-neutral and applicable independent even of
writing orientation (left-to-right or right-to left) when
character n-grams are used.
 Moreover, the fact the method enters the sub-word level
has proved to be useful in all the cases where a word
appears in different forms, e.g. due to difference in writing
style, intersection of word types and so forth.
 Given two instances of n-gram graph representation G1, G2, there
is a number of operators that can be applied on G1;G2 to provide
the n-gram graph equivalent of union, intersection and other such
operators of set theory.
 For example, let the merging of G1 and G2 corresponding to the
union operator in set theory be:
G3 = G1 ∪G2,
which is implemented by adding all edges from both graphs to a
third one, while making sure no duplicate edges are created.
 Two edges are considered duplicates of each other, when they
share identical vertices.
 The intersection operator ∩ (G X G -> G) for
two graphs, returns a graph with the
common edges of the two operand graphs,
with the averaged weights of the original
edges assigned as edge weights.
 The averaged weights make sure that we
keep common edges and their weights are
assigned to the closest possible value to both
the original graphs: the average.
 Formally,
 The intersection operator, on the other hand,
can be used to determine the common
subgraphs of different document graphs.
 Thus, if we take the running intersection of a
number of documents, the final resulting
graph would possibly contain n-grams of the
common-topic of the documents.
 How to extract back possible topic/keywords
from the resulting graph (after applying the
intersection operator)?
 The final graph would also contain vertices
made from the n-gram of stop-words (noise)
since it has not been preprocessed. How to
filter it? OR Should we preprocess the
documents first?(some methods for removing
noise have already been studied in context of
summary evaluation)
 [1] Michele Banko and LucyVanderwende. Using n-grams to understand the nature
of summaries. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL
2004: Short Papers, pages 1{4, Boston, Massachusetts, USA, May 2004.Association
for Computational Linguistics.
 [2] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-
gram co-occurrence statistics. In NAACL '03: Proceedings of the 2003 Conference of
the NorthAmerican Chapter of theAssociation for Computational Linguistics on
Human LanguageTechnology, pages 71{78, Morristown, NJ, USA, 2003.Association
for Computational Linguistics.
 [3] T. Copeck and S. Szpakowicz.Vocabulary usage in newswire summaries. InText
Summarization Branches Out: Proceedings of theACL-04Workshop, pages 19{26.
Association for Computational Linguistics, 2004.
 [4] George Giannakopoulos. TESTINGTHE USE OF N-GRAMGRAPHS IN
SUMMARIZATION SUB-TASKS 2009
 [5] Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey
(1997). "Syntactic clustering of the web". Computer Networks and ISDNSystems 29 (8):
1157–1166.

Contenu connexe

Tendances

[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data scienceNikhil Jaiswal
 
Text smilarity02 corpus_based
Text smilarity02 corpus_basedText smilarity02 corpus_based
Text smilarity02 corpus_basedcyan1d3
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector spaceUjjawal
 
Spell Checker and string matching Using BK tree
Spell Checker and string matching Using BK treeSpell Checker and string matching Using BK tree
Spell Checker and string matching Using BK tree111shridhar
 
Summary distributed representations_words_phrases
Summary distributed representations_words_phrasesSummary distributed representations_words_phrases
Summary distributed representations_words_phrasesYue Xiangnan
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and ClusteringAnkur Shrivastava
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Boyre Moore Algorithm | Computer Science
Boyre Moore Algorithm | Computer ScienceBoyre Moore Algorithm | Computer Science
Boyre Moore Algorithm | Computer ScienceTransweb Global Inc
 

Tendances (15)

Ir 09
Ir   09Ir   09
Ir 09
 
[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data science
 
Ghost
GhostGhost
Ghost
 
Text smilarity02 corpus_based
Text smilarity02 corpus_basedText smilarity02 corpus_based
Text smilarity02 corpus_based
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
 
Spell Checker and string matching Using BK tree
Spell Checker and string matching Using BK treeSpell Checker and string matching Using BK tree
Spell Checker and string matching Using BK tree
 
Summary distributed representations_words_phrases
Summary distributed representations_words_phrasesSummary distributed representations_words_phrases
Summary distributed representations_words_phrases
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and Clustering
 
NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram ModelsNLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram Models
 
Text summarization
Text summarization Text summarization
Text summarization
 
Text summarization
Text summarizationText summarization
Text summarization
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
 
P13 corley
P13 corleyP13 corley
P13 corley
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Boyre Moore Algorithm | Computer Science
Boyre Moore Algorithm | Computer ScienceBoyre Moore Algorithm | Computer Science
Boyre Moore Algorithm | Computer Science
 

Similaire à 2016 m7 w2

A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysisIndexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysiskevig
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis kevig
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis kevig
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
 
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSIJCSEA Journal
 
International Journal of Computer Science, Engineering and Applications (IJCSEA)
International Journal of Computer Science, Engineering and Applications (IJCSEA)International Journal of Computer Science, Engineering and Applications (IJCSEA)
International Journal of Computer Science, Engineering and Applications (IJCSEA)IJCSEA Journal
 
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSIJCSEA Journal
 
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSIJCSEA Journal
 
Anomaly Detection in Arabic Texts using Ngrams and Self Organizing Maps
Anomaly Detection in Arabic Texts using Ngrams and Self Organizing MapsAnomaly Detection in Arabic Texts using Ngrams and Self Organizing Maps
Anomaly Detection in Arabic Texts using Ngrams and Self Organizing MapsIJCSEA Journal
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...Giovanni Murru
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherenceContent Savvy
 
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemAlgorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemIRJET Journal
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 

Similaire à 2016 m7 w2 (20)

A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysisIndexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
 
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
 
International Journal of Computer Science, Engineering and Applications (IJCSEA)
International Journal of Computer Science, Engineering and Applications (IJCSEA)International Journal of Computer Science, Engineering and Applications (IJCSEA)
International Journal of Computer Science, Engineering and Applications (IJCSEA)
 
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
 
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
 
Anomaly Detection in Arabic Texts using Ngrams and Self Organizing Maps
Anomaly Detection in Arabic Texts using Ngrams and Self Organizing MapsAnomaly Detection in Arabic Texts using Ngrams and Self Organizing Maps
Anomaly Detection in Arabic Texts using Ngrams and Self Organizing Maps
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...
About the paper: Graph Connectivity Measures for Unsupervised Word Sense Disa...
 
semeval2016
semeval2016semeval2016
semeval2016
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherence
 
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemAlgorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
L0261075078
L0261075078L0261075078
L0261075078
 
L0261075078
L0261075078L0261075078
L0261075078
 

Dernier

How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 

Dernier (20)

How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 

2016 m7 w2

  • 1. Weekly Progress Report 11th to 16th July 2016 -Ayush Pareek
  • 2.  This week, I’m doing a study on n-grams and their possible applications for topic detection in multiple documents.  (Details in subsequent slides)
  • 3.  In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.The items can be phonemes,syllables, letters, words or base pairs according to the application.The n-grams typically are collected from a text or speech corpus.When the items are words, n-grams may also be called shingles.[5]
  • 4.  An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value ofn, e.g., "four-gram", "five-gram", and so on.  N-grams have been used in summarization and summary evaluation [1, 2, 3].
  • 5. This means that n-gram Si,i+n-1 can be found as a substring of length n of the original text, spanning from the i-th to the (i + n-1)-th character of the original text.
  • 6.  Consider the following sentence:- This is a sentence.  Word unigrams: this, is, a, sentence  Word bigrams: this is, is a, a sentence  Character bigrams: th, hi, is, s , a, ...  Character 4-grams: this, his , is , ...
  • 7. Time Complexity = O(size of the text) = O(|T|)
  • 8. Application of this method to the sentence Do you like this summary? with a requested n-gram size of 3 would return: {`Do ', `o y', ` yo', `you', `ou ', `u l', ` li', `lik', `ike', `ke ', `e t', ` th', `thi',`his', `is ', `s s', ` su', `sum', `umm', `mma', `mar', `ary', `ry?'}  while an algorithm taking disjoint n-grams would return {`Do ', `you', ` li', `ke ', `thi', `s s', `umm', `ary'} (and `?' would probably be omitted).
  • 10.  There are three methods which can be used to create- gram graphs based on how neighbourhood between adjacent n-grams is computed in a text.  In general, a fixed-width window of characters (or words) around a given n-gram N0 is used, with all characters (or words) within the window considered to be neighbours of N0.  These neighbours are represented as connected vertices in the text graph.  The edge connecting the neighbours is weighted, indicating for example the distance between the neighbours or the number of co-occurrences within the text.
  • 11.  1)The Non-Symmetric Approach  2) The Symmetric Approach  3) The Gauss-normalized symmetric approach (details omitted)
  • 12.  For the string abcdef , the figure shows the n-gram graphs for the non-symmetric approach, the symmetric approach and the Gauess-normalized symmetric approach resp.
  • 13. 1)Transitivity implying Indication of n-th order relations:-  The graph in itself is a structure that maintains information about the `neighbor-of-a-neighbor'.This means that if A is related to B through an edge or a path in the graph and B is related to C through another edge or path, then if the proximity relation is considered transitive we can deduce that A is related to C.  The length of the path between A and C can offer information about this indirect proximity relation.This can be further refined, if the edges have been assigned weights indicating the degree of proximity between the connected vertices. Major Features of the n-gram
  • 14. 2) Language independence/ neutrality.  When used in Natural Language Processing, the n-gram graph representation makes no assumption about the underlying language.This makes the representation fully language-neutral and applicable independent even of writing orientation (left-to-right or right-to left) when character n-grams are used.  Moreover, the fact the method enters the sub-word level has proved to be useful in all the cases where a word appears in different forms, e.g. due to difference in writing style, intersection of word types and so forth.
  • 15.  Given two instances of n-gram graph representation G1, G2, there is a number of operators that can be applied on G1;G2 to provide the n-gram graph equivalent of union, intersection and other such operators of set theory.  For example, let the merging of G1 and G2 corresponding to the union operator in set theory be: G3 = G1 ∪G2, which is implemented by adding all edges from both graphs to a third one, while making sure no duplicate edges are created.  Two edges are considered duplicates of each other, when they share identical vertices.
  • 16.  The intersection operator ∩ (G X G -> G) for two graphs, returns a graph with the common edges of the two operand graphs, with the averaged weights of the original edges assigned as edge weights.  The averaged weights make sure that we keep common edges and their weights are assigned to the closest possible value to both the original graphs: the average.
  • 18.  The intersection operator, on the other hand, can be used to determine the common subgraphs of different document graphs.  Thus, if we take the running intersection of a number of documents, the final resulting graph would possibly contain n-grams of the common-topic of the documents.
  • 19.  How to extract back possible topic/keywords from the resulting graph (after applying the intersection operator)?  The final graph would also contain vertices made from the n-gram of stop-words (noise) since it has not been preprocessed. How to filter it? OR Should we preprocess the documents first?(some methods for removing noise have already been studied in context of summary evaluation)
  • 20.  [1] Michele Banko and LucyVanderwende. Using n-grams to understand the nature of summaries. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Short Papers, pages 1{4, Boston, Massachusetts, USA, May 2004.Association for Computational Linguistics.  [2] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n- gram co-occurrence statistics. In NAACL '03: Proceedings of the 2003 Conference of the NorthAmerican Chapter of theAssociation for Computational Linguistics on Human LanguageTechnology, pages 71{78, Morristown, NJ, USA, 2003.Association for Computational Linguistics.  [3] T. Copeck and S. Szpakowicz.Vocabulary usage in newswire summaries. InText Summarization Branches Out: Proceedings of theACL-04Workshop, pages 19{26. Association for Computational Linguistics, 2004.  [4] George Giannakopoulos. TESTINGTHE USE OF N-GRAMGRAPHS IN SUMMARIZATION SUB-TASKS 2009  [5] Broder, Andrei Z.; Glassman, Steven C.; Manasse, Mark S.; Zweig, Geoffrey (1997). "Syntactic clustering of the web". Computer Networks and ISDNSystems 29 (8): 1157–1166.