4.16.24 21st Century Movements for Black Lives.pptx
From NLP to text mining
1. From NLP to Text Mining
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com
2. About Speaker
陳宜欣 Yi-Shin Chen
▷ Currently
• 清華大學資訊工程系副教授
• 主持智慧型資料工程與應用實驗室 (IDEA Lab)
▷ Current Research
• 音樂治療+人工智慧
• 文字資料的情緒分析、精神分析
2
將祝福種在教育的土壤中,用愛來護持
@ Yi-Shin Chen, NLP to Text Mining
4. Language
▷The method of human communication (Oxford dictionary)
• Either spoken or written
• Consisting of the use of words
• In a structured and conventional way
▷The fundamental problem of communication is:
• Reproducing at one point either exactly or approximately a
message selected at another point
• By Claude Shannon (1916–2001)
@ Yi-Shin Chen, NLP to Text Mining 4
5. Communication
@ Yi-Shin Chen, NLP to Text Mining 5
This is the
best thing
happened in
my life
Decode
Semantic
+Inference
Produce
EncodeThis == Best thing (happened (in my life))
He is happy
Something to show we are buddy
“I’m so happy for you”
7. The Father of Modern Linguistics
▷Noam Chomsky
• “a language to be a set (finite or infinite) of
sentences, each finite in length and
constructed out of a finite set of elements”
• “the structure of language is biologically
determined”
• “that humans are born with an innate linguistic
ability that constitutes a Universal Grammar
will also be examined”
→ Syntactic Structures 句法結構
7
Noam Chomsky
(1928 - current)
@ Yi-Shin Chen, NLP to Text Mining
8. Basic Concepts in NLP
8
This is the best thing happened in my life.
Det. Det. NN PNPre.Verb VerbAdj
辭彙分析
Lexical analysis
(Part-of Speech
Tagging 詞性標註)
句法分析
Syntactic analysis
(Parsing)
Noun Phrase
Prep Phrase
Prep Phrase
Noun Phrase
Sentence
@ Yi-Shin Chen, NLP to Text Mining
9. Parsing
▷Parsing is the process of determining whether a
string of tokens can be generated by a grammar
@ Yi-Shin Chen, NLP to Text Mining 9
Lexical
Analyzer
Parser
Input
token
getNext
Token
Symbol
table
parse tree
Rest of
Front End Intermediate
representation
Output of the encoder should be
equivalent to the input
10. Parsing Example (for Compiler)
▷Grammar:
• E :: = E op E | - E | (E) | id
• op :: = + | - | * | /
@ Yi-Shin Chen, NLP to Text Mining 10
a * - ( b + c )
E
id op E
( )E
id op
- E
id
11. Basic Concepts in NLP
11
This is the best thing happened in my life.
Det. Det. NN PNPre.Verb VerbAdj
辭彙分析
Lexical analysis
(Part-of Speech
Tagging 詞性標註)
句法分析
Syntactic analysis
(Parsing)
This? (t1)
Best thing (t2)
My (m1)
Happened (t1, t2, m1)
語意分析
Semantic Analysis
Happy (x) if Happened (t1, ‘Best’, m1) Happy
推理 Inference
Noun Phrase
Prep Phrase
Prep Phrase
Noun Phrase
Sentence
@ Yi-Shin Chen, NLP to Text Mining
12. NLP to Natural Language Understanding (NLU)
12
https://nlp.stanford.edu/~wcmac/papers/20140716-UNLU.pdf
NLP
NLU
Named Entity
Recognition (NER)
Part-Of-Speech
Tagging (POS)
Text Categorization
Co-Reference
Resolution
Machine
Translation
Syntactic Parsing
Question
Answering (QA)
Relation Extraction
Semantic Parsing
Paraphrase &
Natural Language
Inference
Sentiment Analysis
Dialogue Agents
Summarization
Automatic Speech
Recognition (ASR)
Text-To-Speech
(TTS)
@ Yi-Shin Chen, NLP to Text Mining
15. NLP Techniques
▷Word segmentation*
▷Part of speech tagging (POS)
▷Stemming*
▷Syntactic Parsing
▷Named entity extraction
▷Co-reference resolution
▷Text categorization
@ Yi-Shin Chen, NLP to Text Mining 15
16. Word Segmentation
▷In some languages, there is no explicit word boundary
• 這地面積還真不小
• 人体内存在很多微生物
• うふふふふ 楽しむ ありがとうございます
▷Need segmentation tools
• Chinese
→ Jieba: https://github.com/fxsjy/jieba
→ CKIP (Sinica): http://ckipsvr.iis.sinica.edu.tw/
→ Or via any statistics analysis
@ Yi-Shin Chen, NLP to Text Mining 16
17. Part-of-speech (POS) Tagging
▷Processing text and assigning parts of speech to
each word
▷Tags may vary - by different tagging sets
• Noun (N), Adjective (A), Verb (V), URL (U)…
@ Yi-Shin Chen, NLP to Text Mining 17
Happy Easter! I went to work and came home to an empty house now im
going for a quick run
Happy_A Easter_N !_, I_O went_V to_P work_N and_& came_V home_N to_P
an_D empty_A house_N now_R im_L going_V for_P a_D quick_A run_N
(JJ)Happy (NNP) Easter! (PRP) I (VBD) went (TO) to (VB) work (CC) and
(VBD) came (NN) home (TO) to (DT) an (JJ) empty (NN) house (RB) now
(VBP) im (VBG) going (IN) for (DT) a (JJ) quick (NN) run
18. Stemmer
▷ Reduce inflected/derived words to their
base/root form
• E.g., Porter Stemmer
http://textanalysisonline.com/nltk-porter-stemmer
▷ To get a more accurate statistics of words
@ Yi-Shin Chen, NLP to Text Mining 18
Now, AI is poised to start an equally large
transformation on many industries
Now , AI is pois to start an equal larg transform
on mani industry
Porter Stemmer
19. NLP Techniques
▷Word segmentation*
▷Part of speech tagging (POS)
▷Stemming*
▷Syntactic Parsing
▷Named entity extraction
▷Co-reference resolution
▷Text categorization
@ Yi-Shin Chen, NLP to Text Mining 19
Obtain representations
20. Vector Space Model (Bag of Words)
▷Represent the keywords of objects using a term vector
• Term: basic concept, e.g., keywords to describe an object
• Each term represents one dimension in a vector
• Values of each term in a vector corresponds to the
importance of that term
• Words are used as features without the order
→ 我欠你一個人情 = 你欠我一個人情
• Usually working with N-gram features
20@ Yi-Shin Chen, NLP to Text Mining
21. Term Frequency and Inverse
Document Frequency (TFIDF)
▷Since not all objects in the vector space are equally
important, we can weight each term using its
occurrence probability in the object description
• Term frequency: TF(d,t)
→ number of times t occurs in the object description d
• Inverse document frequency: IDF(t)
→ to scale down the terms that occur in many descriptions
21@ Yi-Shin Chen, NLP to Text Mining
22. Normalizing Term Frequency
▷nij represents the number of times a term ti occurs in
a description dj . tfij can be normalized using the total
number of terms in the document
• 𝑡𝑓𝑖𝑗 =
𝑛 𝑖𝑗
𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑𝑉𝑎𝑙𝑢𝑒
▷Normalized value could be:
• Sum of all frequencies of terms
• Max frequency value
• Any other values can make tfij between 0 to 1
• BM25*: 𝑡𝑓𝑖𝑗 =
𝑛𝑖𝑗× 𝑘+1
𝑛𝑖𝑗+𝑘
22@ Yi-Shin Chen, NLP to Text Mining
23. Inverse Document Frequency
▷ IDF seeks to scale down the coordinates of terms
that occur in many object descriptions
• For example, some stop words(the, a, of, to, and…) may
occur many times in a description. However, they should
be considered as non-important in many cases
• 𝑖𝑑𝑓𝑖 = 𝑙𝑜𝑔
𝑁
𝑑𝑓 𝑖
+ 1
→ where dfi (document frequency of term ti) is the
number of descriptions in which ti occurs
▷ IDF can be replaced with ICF (inverse class frequency) and
many other concepts based on applications
23@ Yi-Shin Chen, NLP to Text Mining
24. BOW Vectors
▷TFIDF+BOW still close to the state of the art
▷Good baseline to start with
@ Yi-Shin Chen, NLP to Text Mining 24
Document 1
season
timeout
lost
wi
n
game
score
ball
pla
y
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
26. Limitation of Bag of Words
▷No semantical relationship between words
• Not designed to model linguistic knowledge
▷Sparsity
• Due to high number of dimensions
▷Curse of dimensionality
• When dimensionality increases, the distance
between points becomes less meaningful
@ Yi-Shin Chen, NLP to Text Mining 26
27. Word Context
▷Intuition: the context represents the semantics
▷Hypothesis: more simple models trained on more data
will result in better word representations
Work done by Mikolov et al. in 2013
@ Yi-Shin Chen, NLP to Text Mining 27
A medical doctor is a person who uses medicine to treat illness and injuries
Some medical doctors only work on certain diseases or injuries
Medical doctors examine, diagnose and treat patients
These words represents doctors/doctor
29. Main Ideas of Word2Vec
▷Two models:
• Continuous bag-of-words model
• Continuous skip-gram model
▷Utilize a neural network to learn the weights
of the word vectors
@ Yi-Shin Chen, NLP to Text Mining 29
30. Continuous Bag-of-Words Model
▷Continuous Bag-of-Words Model
• Predict target word by the context words
• Eg: Given a sentence and window size 2
@ Yi-Shin Chen, NLP to Text Mining 30
Ex: ([features], label)
([I, am, good, pizza], eating), ([am, eating, pizza, now], good) and so on and so forth.
I am eating good pizza now
Target word
Context word Context word
31. Continuous Bag-of-Words Model (Contd.)
@ Yi-Shin Chen, NLP to Text Mining 31
▷ We train one depth neural network, set hidden layer’s width 3.
Input layer
1x6
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Hidden layer
0.1
0.2
0.4
0.1
0.1
0.1
Softmax layer
W W‘
6x3
3x6
0
0
0
1
0
0
Actual label
I am eating good pizza
000001 000010 000100 001000 010000
Backward propagation for update weight
32. @ Yi-Shin Chen, NLP to Text Mining 32
Input layer
1x6
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Hidden layer
0.1
0.2
0.4
0.1
0.1
0.1
Softmax layer
W W‘
6x3
3x6
0
0
0
1
0
0
Actual label
I am eating good pizza
000001 000010 000100 001000 010000
Backward propagation for update weight
▷ The goal is to extract word representation vector instead of prob
of predict target word.
Let’s look into here.
Continuous Bag-of-Words Model (Contd.)
33. @ Yi-Shin Chen, NLP to Text Mining 33
Input layer
1x6
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Hidden layer
0.1
0.2
0.4
0.1
0.1
0.1
Softmax layer
W W‘
6x3
3x6
0
0
0
1
0
0
Actual label
I am eating good pizza
000001 000010 000100 001000 010000
Backward propagation for update weight
▷ The goal is to extract word representation vector instead of prob
of predict target word.
Let’s look into here.
Continuous Bag-of-Words Model (Contd.)
34. ▷The output of the hidden layer is just the “word
vector” for the input word
@ Yi-Shin Chen, NLP to Text Mining 34
Continuous Bag-of-Words Model (Contd.)
Input
layer
1x6
W
6x3
.2
.7
.2
.2
.3
.8
.19 .28 .22
.22 .23 .21
.1 .5 .3
.2 .13 .23
0 0 0 0 0 1 X = .2 .13 .23
Hidden layer
I am eating good pizza
000001 000010 000100 001000 010000
35. ▷Disadvantage:
• It cannot capture rare word.
@ Yi-Shin Chen, NLP to Text Mining 35
Continuous Bag-of-Words Model (Contd.)
So, The skip gram algorithm comes.
This is a good movie theater
Target
word
Marvelous
Context word Context word
36. Continuous Skip-gram Model
▷ Reverse format of CBOW
▷ Predict representations of word that is put around the target words
▷ The context is specified by the window length
@ Yi-Shin Chen, NLP to Text Mining 36
37. Continuous Skip-gram Model (Contd.)
▷Advantage:
• It can capture rare word.
• It captures the similarity of word semantic.
→ synonyms like “intelligent” and “smart” would have
very similar contexts
@ Yi-Shin Chen, NLP to Text Mining 37
39. NLP Techniques
▷Word segmentation*
▷Part of speech tagging (POS)
▷Stemming*
▷Syntactic Parsing
▷Named entity extraction
▷Co-reference resolution
▷Text categorization
@ Yi-Shin Chen, NLP to Text Mining 39
40. Named Entity Recognition (NER)
▷Find and classify all the named entities in a text
▷What’s a named entity?
• A mention of an entity using its name.
→ Kansas Jayhawks
• This is a subset of the possible mentions...
→ Kansas, Jayhawks, the team, it, they
▷Find means identify the exact span of the mention
▷Classify means determine the category of the entity
being referred to
40@ Yi-Shin Chen, NLP to Text Mining
43. Named Entity Recognition Approaches
▷As with partial parsing and chunking there are two
basic approaches (and hybrids)
• Rule-based (regular expressions)
→ Lists of names
→ Patterns to match things that look like names
→ Patterns to match the environments that classes of
names tend to occur in.
• Machine Learning-based approaches
→ Get annotated training data
→ Extract features
→ Train systems to replicate the annotation
43@ Yi-Shin Chen, NLP to Text Mining
44. Rule-Based Approaches
▷Employ regular expressions to extract data
▷Examples:
• Telephone number: (d{3}[-. ()]){1,2}[dA-Z]{4}.
→ 800-865-1125
→ 800.865.1125
→ (800)865-CARE
• Software name extraction: ([A-Z][a-z]*s*)+
→ Installation Designer v1.1
44@ Yi-Shin Chen, NLP to Text Mining
46. Co-reference Resolution
▷Find all expressions that refer to certain entity in a text
▷Common approaches for relation extractors
• Hand-written patterns
• Supervised machine learning
• Semi-supervised and unsupervised
→ Bootstrapping (using seeds)
→ Distant supervision
→ Unsupervised learning from the web
@ Yi-Shin Chen, NLP to Text Mining 46
47. Hearst's Patterns for IS-A Relations
▷"Y such as X ((, X)* (, and|or) X)"
▷"such Y as X"
▷"X or other Y"
▷"X and other Y"
▷"Y including X"
▷"Y, especially X"
@ Yi-Shin Chen, NLP to Text Mining 47
(Hearst, 1992): Automatic Acquisition of Hyponyms
48. Extracting Richer Relations Using Rules
▷ Intuition: relations often hold between specific entities
• located-in (ORGANIZATION, LOCATION)
• founded (PERSON, ORGANIZATION)
• cures (DRUG, DISEASE)
▷ Start with Named Entity tags to help extract relation
48
Content Slides by Prof. Dan Jurafsky
49. Relation Types
▷As with named entities, the list of relations is
application specific. For generic news texts...
49@ Yi-Shin Chen, NLP to Text Mining
50. Relation Bootstrapping (Hearst 1992)
▷Gather a set of seed pairs that have relation R
▷Iterate:
1. Find sentences with these pairs
2. Look at the context between or around the pair and
generalize the context to create patterns
3. Use the patterns for grep for more pairs
50Content Slides by Prof. Dan Jurafsky
51. Bootstrapping
▷<Mark Twain, Elmira> Seed tuple
• Grep (google) for the environments of the seed tuple
“Mark Twain is buried in Elmira, NY.”
X is buried in Y
“The grave of Mark Twain is in Elmira”
The grave of X is in Y
“Elmira is Mark Twain’s final resting place”
Y is X’s final resting place
▷Use those patterns to grep for new tuples
▷Iterate
51Content Slides by Prof. Dan Jurafsky
52. Possible Seed: Wikipedia Infobox
▷ Infoboxes are kept in a namespace
separate from articles
• Namespce example:
Special:SpecialPages; Wikipedia:List of
infoboxes
• Example:
{{Infobox person
|name = Casanova
|image = Casanova_self_portrait.jpg
|caption = A self portrait of Casanova
...
|website = }}
@ Yi-Shin Chen, NLP to Text Mining 52
53. Concept-based Model
▷ESA (Egozi, Markovitch, 2011)
Every Wikipedia article represents a concept
TF-IDF concept to inferring concepts from document
Manually-maintained knowledge base
53@ Yi-Shin Chen, NLP to Text Mining
54. Yago
▷ YAGO: A Core of Semantic Knowledge Unifying WordNet and
Wikipedia, WWW 2007
▷ Unification of Wikipedia & WordNet
▷ Make use of rich structures and information
• Infoboxes, Category Pages, etc.
54@ Yi-Shin Chen, NLP to Text Mining
58. More Tools
▷NLTK (python): tokenize, tag, NE extraction,
show parsing tree
• Porter stemmer
• n-grams
▷spyCy: undustrial-strength NLP in python
@ Yi-Shin Chen, NLP to Text Mining 58
61. Data (Text vs. Non-Text)
61
World Sensor Data
Interpret by Report
Weather
Thermometer, Hygrometer
24。C, 55%
Location GPS 37。N, 123 。E
Body Sphygmometer, MRI, etc. 126/70 mmHg
World To be or not to be..
human
Subjective
Objective
@ Yi-Shin Chen, NLP to Text Mining
62. Data Mining vs. Text Mining
62
Non-text data
• Numerical
Categorical
Relational
Text data
• Text
Data Mining
• Clustering
• Classification
• Association Rules
• …
Text Processing
(including NLP)
@ Yi-Shin Chen, NLP to Text Mining
Preprocessing
64. @ Yi-Shin Chen, NLP to Text Mining 64
一般資料:(General Data)
────
職業: 無
種族: 客家
婚姻: married
旅遊史:No recent travel history in three months
接觸史:無
群聚:無
職業病史:Nil
資料來源:Patient herself and her daughter
主訴:(Chief Complaint)
──
Sudden onest short of breath with cold sweating noted five days ago. ( since 06/09)
現在病症:(Present Illness)
────
This 60 year-old female had hypertension for 10 years and diabetes mellitus for 5 years that had
regular medical control. As her mentioned, she got similar episode attacked three months ago with
initail presentations of short of breat, DOE, orthopnea. She went to LMD for help with CAD, 3-V-D,
s/p PTCA stenting x 3 for one vessels in 2012/03. She got regular CV OPD f/u at there and the
medications was tooked. Since after, she had increatment of the oral water intake amounts. The
urine output seems to be adequate and no body weight change or legs edema noted in recent three
months. This time, acute onset severe dyspnea with orthopnea, DOE, heavy sweating and oral
thirsty noted on 06/09. He had no fever, chills, nausea, vomiting, palpitation, cough, chest tightness,
chest pain, palpitation, abdominal discomfort noticed. For the symptoms intolerable, he came to our
ED for help with chest x film revealed cardiomegaly and elevations of cardiac markers noted. The
cardiologist was consulted and the heparinization was applied. The CPK level had no elevation at
regular f/u examinations, and her symptoms got much improved after. The cardiosonogram reported
impaired LV systolic function and moderate MR. She was admitted for further suvery and
managements to the acute ischemic heart disease.
Align /Classify the attributes correctly
65. Language Detection
▷To detect an language (possible languages) in
which the specified text is written
▷Difficulties
• Short message
• Different languages in one statement
• Noisy
65
職業: 無
種族: 客家
婚姻: married
旅遊史:No recent travel history in three months
接觸史:無
群聚:無
職業病史:Nil
資料來源:Patient herself and her daughter
@ Yi-Shin Chen, NLP to Text Mining
66. Data Cleaning
▷Special character
▷Utilize regular expressions to clean data
66
Unicode emotions ☺, ♥…
Symbol icon ☏, ✉…
Currency symbol €, £, $...
Tweet URL
Filter out non-(letters, space,
punctuation, digit) ◕‿◕ Friendship is everything ♥ ✉
xxxx@gmail.com
I added a video to a @YouTube playlist
http://t.co/ceYX62StGO Jamie Riepe
(^|s*)http(S+)?(s*|$)
(p{L}+)|(p{Z}+)|
(p{Punct}+)|(p{Digit}+)
@ Yi-Shin Chen, NLP to Text Mining
67. Part-of-speech (POS) Tagging
▷Processing text and assigning parts of speech
to each word
▷Twitter POS tagging
• Noun (N), Adjective (A), Verb (V), URL (U)…
67
This 60 year-old female had hypertension for 10 years and diabetes mellitus
for 5 years that had regular medical control.
This(D) 60(Num) year-old(Adj) female(N) had(V) hypertension (N) for(Pre)
10(Num) years(N) and(Con) diabetes(N) mellitus(N) for(pre) 5(Num) years(N)
that(Det) had(V) regular(Adj) medical(Adj) control(N).
@ Yi-Shin Chen, NLP to Text Mining
68. Stemming
68
▷Problem:
• Diabetes -> diabete
This 60 year-old female had hypertension for 10 years and diabetes
mellitus for 5 years that had regular medical control.
have
have
year
year
@ Yi-Shin Chen, NLP to Text Mining
69. More Preprocesses for Different Languages
▷Chinese Simplified/Traditional Conversion
▷Word segmentation
69@ Yi-Shin Chen, NLP to Text Mining
70. Wrong Chinese Word Segmentation
▷Wrong segmentation
• 這(Nep) 地面(Nc) 積(VJ) 還(D) 真(D) 不(D) 小(VH) http://t.co/QlUbiaz2Iz
▷Wrong word
• @iamzeke 實驗(Na) 室友(Na) 多(Dfa) 危險(VH) 你(Nh) 不(D) 知道(VK) 嗎
(T) ?
▷Wrong order
• 人體(Na) 存(VC) 內在(Na) 很多(Neqa) 微生物(Na)
▷Unknown word
• 半夜(Nd) 逛團(Na) 購(VC) 看到(VE) 太(Dfa) 吸引人(VH) !!
70
地面|面積
實驗室|室友
存在|內在
未知詞: 團購
@ Yi-Shin Chen, NLP to Text Mining
72. Data Mining vs. Text Mining
72
Non-text data
• Numerical
Categorical
Relational
• Precise
• Objective
Text data
• Text
• Ambiguous
• Subjective
Data Mining
• Clustering
• Classification
• Association Rules
• …
Text Processing
(including NLP)
Text Mining
@ Yi-Shin Chen, NLP to Text Mining
Preprocessing
73. Overview of Data Mining
Understand the objectivities
73@ Yi-Shin Chen, NLP to Text Mining
74. Tasks in Data Mining
▷Problems should be well defined at the beginning
▷Two categories of tasks [Fayyad et al., 1996]
74
Predictive Tasks
• Predict unknown values
• e.g., potential customers
Descriptive Tasks
• Find patterns to describe data
• e.g., Friendship finding
VIP
Cheap
Potential
@ Yi-Shin Chen, NLP to Text Mining
75. Select Techniques
▷Problems could be further decomposed
75
Predictive Tasks
• Classification
• Ranking
• Regression
• …
Descriptive Tasks
• Clustering
• Association rules
• Summarization
• …
Supervised
Learning
Unsupervised
Learning
@ Yi-Shin Chen, NLP to Text Mining
76. Supervised vs. Unsupervised Learning
▷ Supervised learning
• Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
• New data is classified based on the training set
▷ Unsupervised learning
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
76@ Yi-Shin Chen, NLP to Text Mining
77. Classification
▷ Given a collection of records (training set )
• Each record contains a set of attributes
• One of the attributes is the class
▷ Find a model for class attribute:
• The model forms a function of the values of other attributes
▷ Goal: previously unseen records should be assigned a class as
accurately as possible.
• A test set is needed
→ To determine the accuracy of the model
▷Usually, the given data set is divided into training & test
• With training set used to build the model
• With test set used to validate it
77@ Yi-Shin Chen, NLP to Text Mining
78. Ranking
▷Produce a permutation to items in a new list
• Items ranked in higher positions should be more important
• E.g., Rank webpages in a search engine Webpages in
higher positions are more relevant.
78@ Yi-Shin Chen, NLP to Text Mining
79. Regression
▷Find a function which model the data with least error
• The output might be a numerical value
• E.g.: Predict the stock value
79@ Yi-Shin Chen, NLP to Text Mining
80. Clustering
▷Group data into clusters
• Similar to the objects within the same cluster
• Dissimilar to the objects in other clusters
• No predefined classes (unsupervised classification)
80@ Yi-Shin Chen, NLP to Text Mining
81. Association Rule Mining
▷Basic concept
• Given a set of transactions
• Find rules that will predict the occurrence of an item
• Based on the occurrences of other items in the transaction
81@ Yi-Shin Chen, NLP to Text Mining
82. Summarization
▷Provide a more compact representation of the data
• Data: Visualization
• Text – Document Summarization
→ E.g.: Snippet
82@ Yi-Shin Chen, NLP to Text Mining
83. Back to Text Mining
Let’s come back to Text Mining
83@ Yi-Shin Chen, NLP to Text Mining
84. Landscape of Text Mining
84
World Sensor Data
Interpret by Report
World
devices
24。C, 55%
World To be or not to be..
human
Non-text
(Context)
Text
Subjective
Objective
Perceived by Express
Mining knowledge
about Languages
Nature Language Processing & Text Representation;
Word Association and Mining
@ Yi-Shin Chen, NLP to Text Mining
85. Landscape of Text Mining
85
World Sensor Data
Interpret by Report
World
devices
24。C, 55%
World To be or not to be..
human
Non-text
(Context)
Text
Subjective
Objective
Perceived by Express
Mining content about the observers
Opinion Mining and Sentiment Analysis
@ Yi-Shin Chen, NLP to Text Mining
86. Landscape of Text Mining
86
World Sensor Data
Interpret by Report
World
devices
24。C, 55%
World To be or not to be..
human
Non-text
(Context)
Text
Subjective
Objective
Perceived by Express
Mining content about the World
Topic Mining , Contextual Text Mining
@ Yi-Shin Chen, NLP to Text Mining
87. From NLP to Text Mining
87
This is the best thing happened in my life.
Det. Det. NN PNPre.Verb VerbAdj
String of Characters
This?
Happy
This is the best thing happened in my life.
String of Words
POS Tags
Best thing
Happened
My life
Entity Period
Entities
Relationships
Emotion
The writer loves his new born baby Understanding
(Logic predicates)
Entity
Deeper NLP
Less accurate
Closer to knowledge
@ Yi-Shin Chen, NLP to Text Mining
88. NLP vs. Text Mining
▷Text Mining objectives
• Overview
• Know the trends
• Accept noise
@ Yi-Shin Chen, NLP to Text Mining 88
▷NLP objectives
• Understanding
• Ability to answer
• Immaculate
90. Word Relations
▷Paradigmatic: can be substituted for each other (similar)
• E.g., Cat & dog, run and walk
▷Syntagmatic: can be combined with each other (correlated)
• E.g., Cat and fights, dog and barks
→These two basic and complementary relations can be generalized to
describe relations of any times in a language
90
Animals Act
Animals Act
@ Yi-Shin Chen, NLP to Text Mining
91. Mining Word Associations
▷Paradigmatic
• Represent each word by its context
• Compute context similarity
• Words with high context similarity
▷Syntagmatic
• Count the number of times two words occur together in a context
• Compare the co-occurrences with the corresponding individual
occurrences
• Words with high co-occurrences but relatively low individual
occurrence
91@ Yi-Shin Chen, NLP to Text Mining
92. Paradigmatic Word Associations
John’s cat eats fish in Saturday
Mary’s dog eats meat in Sunday
John’s cat drinks milk in Sunday
Mary’s dog drinks beer in Tuesday
92
Act
FoodTime
Human
Animals
Own
John
Cat
John’s cat Eat
FishIn Saturday
John’s --- eats fish in Saturday
Mary’s --- eats meat in Sunday
John’s --- drinks milk in Sunday
Mary’s --- drinks beer in Tuesday
Similar left content
Similar right content
Similar general content
How similar are context (“cat”) and context (“dog”)?
How similar are context (“cat”) and context (“John”)?
Expected Overlap of Words in Context (EOWC)
Overlap (“cat”, “dog”)
Overlap (“cat”, “John”)
@ Yi-Shin Chen, NLP to Text Mining
93. Common Approach for EOWC:
Cosine Similarity
▷ If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,
where indicates vector dot product and || d || is the length of vector d.
▷ Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
Overlap (“John”, “Cat”) =.3150
93@ Yi-Shin Chen, NLP to Text Mining
94. Quality of EOWC?
▷ The more overlap the two context documents have,
the higher the similarity would be
▷However:
• It favor matching one frequent term very well over matching
more distinct terms
• It treats every word equally (overlap on “the” should not be as
meaningful as overlap on “eats”)
▷Solution?
• TFIDF
94@ Yi-Shin Chen, NLP to Text Mining
95. Mining Word Associations
▷Paradigmatic
• Represent each word by its context
• Compute context similarity
• Words with high context similarity
▷Syntagmatic
• Count the number of times two words occur together in a context
• Compare the co-occurrences with the corresponding individual
occurrences
• Words with high co-occurrences but relatively low individual
occurrence
95@ Yi-Shin Chen, NLP to Text Mining
96. Syntagmatic Word Associations
John’s cat eats fish in Saturday
Mary’s dog eats meat in Sunday
John’s cat drinks milk in Sunday
Mary’s dog drinks beer in Tuesday
96
Act
FoodTime
Human
Animals
Own
John
Cat
John’s cat Eat
FishIn Saturday
John’s *** eats *** in Saturday
Mary’s *** eats *** in Sunday
John’s --- drinks --- in Sunday
Mary’s --- drinks --- in Tuesday
What words tend to occur to the left of “eats”
What words to the right?
Whenever “eats” occurs, what other words also tend to occur?
Correlated occurrences
P(dog | eats) = ? ; P(cats | eats) = ?
@ Yi-Shin Chen, NLP to Text Mining
97. Word Prediction
Prediction Question: Is word W present (or absent) in this segment?
97
Text Segment (any unit, e.g., sentence, paragraph, document)
Predict the occurrence of word
W1 = ‘meat’ W2 = ‘a’ W3 = ‘unicorn’
@ Yi-Shin Chen, NLP to Text Mining
98. Word Prediction: Formal Definition
▷Binary random variable {0,1}
• 𝑥 𝑤 =
1 𝑤 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡
0 𝑤 𝑖𝑠 𝑎𝑏𝑠𝑒𝑡
• 𝑃 𝑥 𝑤 = 1 + 𝑃 𝑥 𝑤 = 0 = 1
▷The more random 𝑥 𝑤 is, the more difficult the prediction is
▷How do we quantitatively measure the randomness?
98@ Yi-Shin Chen, NLP to Text Mining
99. Entropy
▷Entropy measures the amount of randomness or
surprise or uncertainty
▷Entropy is defined as:
99
1
log
1
log,
1
11
1
n
i
i
n
i
ii
n
i i
in
pwhere
pp
p
pppH
•entropy = 0
easy
•entropy=1
•difficult0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1
Entropy(p,1-p)
@ Yi-Shin Chen, NLP to Text Mining
Number of
choices
101. Mining Syntagmatic Relations
▷For each word W1
• For every word W2, compute conditional entropy 𝐻 𝑥 𝑤1 𝑥 𝑤2
• Sort all the candidate words in ascending order of 𝐻 𝑥 𝑤1 𝑥 𝑤2
• Take the top-ranked candidate words with some given threshold
▷However
• 𝐻 𝑥 𝑤1 𝑥 𝑤2 and 𝐻 𝑥 𝑤1 𝑥 𝑤3 are comparable
• 𝐻 𝑥 𝑤1 𝑥 𝑤2 and 𝐻 𝑥 𝑤3 𝑥 𝑤2 are not
→ Because the upper bounds are different
▷Conditional entropy is not symmetric
• 𝐻 𝑥 𝑤1 𝑥 𝑤2 ≠ 𝐻 𝑥 𝑤2 𝑥 𝑤1
101@ Yi-Shin Chen, NLP to Text Mining
102. Mutual Information
▷𝐼 𝑥; 𝑦 = 𝐻 𝑥 − 𝐻 𝑥 𝑦 = 𝐻 𝑦 − 𝐻 𝑦 𝑥
▷Properties:
• Symmetric
• Non-negative
• I(x;y)=0 iff x and y are independent
▷Allow us to compare different (x,y) pairs
102
H(x) H(y)
H(x|y) H(y|x)I(x;y)
@ Yi-Shin Chen, NLP to Text Mining
H(x,y)
103. Topic Mining
Assume we already know the word relationships
103@ Yi-Shin Chen, NLP to Text Mining
104. Landscape of Text Mining
104
World Sensor Data
Interpret by Report
World
devices
24。C, 55%
World To be or not to be..
human
Non-text
(Context)
Text
Subjective
Objective
Perceived by Express
Mining knowledge
about Languages
Nature Language Processing & Text Representation;
Word Association and Mining
@ Yi-Shin Chen, NLP to Text Mining
105. Topic Mining: Motivation
▷Topic: key idea in text data
• Theme/subject
• Different granularities (e.g., sentence, article)
▷Motivated applications, e.g.:
• Hot topics during the debates in 2016 presidential election
• What do people like about Windows 10
• What are Facebook users talking about today?
• What are the most watched news?
105
@ Yi-Shin Chen, NLP to Text Mining
106. Tasks of Topic Mining
106
Text Data Topic 1
Topic 2
Topic 3
Topic 4
Topic n
Doc1 Doc2
@ Yi-Shin Chen, NLP to Text Mining
107. Formal Definition of Topic Mining
▷Input
• A collection of N text documents 𝑆 = 𝑑1, 𝑑2, 𝑑3, … 𝑑 𝑛
• Number of topics: k
▷Output
• k topics: 𝜃1, 𝜃2, 𝜃3, … 𝜃 𝑛
• Coverage of topics in each 𝑑𝑖: 𝜇𝑖1, 𝜇𝑖2, 𝜇𝑖3, … 𝜇𝑖𝑛
▷How to define topic 𝜃𝑖?
• Topic=term (word)?
• Topic= classes?
107@ Yi-Shin Chen, NLP to Text Mining
108. Tasks of Topic Mining (Terms as Topics)
108
Text Data Politics
Weather
Sports
Travel
Technology
Doc1 Doc2
@ Yi-Shin Chen, NLP to Text Mining
109. Problems with “Terms as Topics”
▷Not generic
• Can only represent simple/general topic
• Cannot represent complicated topics
→ E.g., “uber issue”: political or transportation related?
▷Incompleteness in coverage
• Cannot capture variation of vocabulary
▷Word sense ambiguity
• E.g., Hollywood star vs. stars in the sky; apple watch
vs. apple recipes
109@ Yi-Shin Chen, NLP to Text Mining
110. Improved Ideas
▷Idea1 (Probabilistic topic models): topic = word distribution
• E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005),
(play, 0.003), (NBA,0.01)…}
• : generic, easy to implement
▷Idea 2 (Concept topic models): topic = concept
• Maintain concepts (manually or automatically)
→ E.g., ConceptNet
110@ Yi-Shin Chen, NLP to Text Mining
111. Possible Approaches for Probabilistic
Topic Models
▷Bag-of-words approach:
• Mixture of unigram language model
• Expectation-maximization algorithm
• Probabilistic latent semantic analysis
• Latent Dirichlet allocation (LDA) model
▷Graph-based approach :
• TextRank (Mihalcea and Tarau, 2004)
• Reinforcement Approach (Xiaojun et al., 2007)
• CollabRank (Xiaojun er al., 2008)
111@ Yi-Shin Chen, NLP to Text Mining
Generative
Model
112. Bag-of-words Assumption
▷Word order is ignored
▷“bag-of-words” – exchangeability
▷Theorem (De Finetti, 1935) – if 𝑥1, 𝑥2, … , 𝑥 𝑛 are
infinitely exchangeable, then the joint probability
p 𝑥1, 𝑥2, … , 𝑥 𝑛 has a representation as a mixture:
▷p 𝑥1, 𝑥2, … , 𝑥 𝑛 = 𝑑𝜃𝑝 𝜃 𝑖=1
𝑁
𝑝 𝑥𝑖 𝜃
for some random variable θ
@ Yi-Shin Chen, NLP to Text Mining 112
113. Latent Dirichlet Allocation
▷ Latent Dirichlet Allocation (D. M. Blei, A. Y. Ng, 2003)
Linear Discriminant Analysis
M
d
d
k
N
n z
dndnddnd
k
N
n z
nnn
nn
N
n
n
dzwpzppDp
dzwpzppp
zwpzppp
d
dn
n
1 1
1
1
),()()(),(
),()()(),()3(
),()()(),,,()2(
w
wz
@ Yi-Shin Chen, NLP to Text Mining 113
114. LDA Assumption
▷Assume:
• When writing a document, you
1. Decide how many words
2. Decide distribution(P = Dir(𝛼) P = Dir(𝛽))
3. Choose topics (Dirichlet)
4. Choose words for topics (Dirichlet)
5. Repeat 3
• Example
1. 5 words in document
2. 50% food & 50% cute animals
3. 1st word - food topic, gives you the word “bread”.
4. 2nd word - cute animals topic, “adorable”.
5. 3rd word - cute animals topic, “dog”.
6. 4th word - food topic, “eating”.
7. 5th word - food topic, “banana”.
114
“bread adorable dog eating banana”
Document
Choice of topics and words
@ Yi-Shin Chen, NLP to Text Mining
115. LDA Learning (Gibbs)
▷ How many topics you think there are ?
▷ Randomly assign words to topics
▷ Check and update topic assignments (Iterative)
• p(topic t | document d)
• p(word w | topic t)
• Reassign w a new topic, p(topic t | document d) * p(word w | topic t)
115
I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
#Topic: 2I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
p(red|1)=0.67; p(purple|1)=0.33; p(red|2)=0.67; p(purple|2)=0.33; p(red|3)=0.67; p(purple|3)=0.33;
p(eat|red)=0.17; p(eat|purple)=0.33; p(fish|red)=0.33; p(fish|purple)=0.33;
p(vegetable|red)=0.17; p(dog|purple)=0.33; p(pet|red)=0.17; p(kitten|red)=0.17;
p(purple|2)*p(fish|purple)=0.5*0.33=0.165; p(red|2)*p(fish|red)=0.5*0.2=0.1;
fish
p(red|1)=0.67; p(purple|1)=0.33; p(red|2)=0.50; p(purple|2)=0.50; p(red|3)=0.67; p(purple|3)=0.33;
p(eat|red)=0.20; p(eat|purple)=0.33; p(fish|red)=0.20; p(fish|purple)=0.33;
p(vegetable|red)=0.20; p(dog|purple)=0.33; p(pet|red)=0.20; p(kitten|red)=0.20;
I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
@ Yi-Shin Chen, NLP to Text Mining
116. Related Work – Topic Model (LDA)
116
▷I eat fish and vegetables.
▷Dog and fish are pets.
▷My kitten eats fish.
Sentence 1: 14.67% Topic 1, 85.33% Topic 2
Sentence 2: 85.44% Topic 1, 14.56% Topic 2
Sentence 3: 19.95% Topic 1, 80.05% Topic 2
LDA
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
@ Yi-Shin Chen, NLP to Text Mining
117. Possible Approaches for Probabilistic
Topic Models
▷Bag-of-words approach:
• Mixture of unigram language model
• Expectation-maximization algorithm
• Probabilistic latent semantic analysis
• Latent Dirichlet allocation (LDA) model
▷Graph-based approach :
• TextRank (Mihalcea and Tarau, 2004)
• Reinforcement Approach (Xiaojun et al., 2007)
• CollabRank (Xiaojun er al., 2008)
117@ Yi-Shin Chen, NLP to Text Mining
118. Construct Graph
▷ Directed graph
▷ Elements in the graph
→ Terms
→ Phrases
→ Sentences
118@ Yi-Shin Chen, NLP to Text Mining
119. Connect Term Nodes
▷Connect terms based on its slop.
119
I love the new ipod shuffle.
It is the smallest ipod.
Slop=1
I
love
the
new
ipod
shuffle
itis
smallest
Slop=0
@ Yi-Shin Chen, NLP to Text Mining
120. Connect Phrase Nodes
▷Connect phrases to
• Compound words
• Neighbor words
120
I love the new
ipod shuffle.
It is the smallest
ipod.
I
love
the
new
ipod
shuffle
itis
smallest
ipod shuffle
@ Yi-Shin Chen, NLP to Text Mining
121. Connect Sentence Nodes
▷Connect to
• Neighbor sentences
• Compound terms
• Compound phrase
@ Yi-Shin Chen, NLP to Text Mining 121
I love the new ipod shuffle.
It is the smallest ipod.
ipod
new
shuffle
it
smallest
ipod shuffle
I love the new ipod shuffle.
It is the smallest ipod.
122. Edge Weight Types
122
I love the new ipod shuffle.
It is the smallest ipod.
I
love
the
new
ipod
shuffleitis
smallest
ipod shuffle
@ Yi-Shin Chen, NLP to Text Mining
123. Graph-base Ranking
▷Scores for each node (TextRank 2004)
123
ythe (1 0.85) 0.85*(0.1 0.5 0.2 0.6 0.3 0.7)
the
new
love
is
0.1
0.2
0.3
Score from parent nodes
0.5
0.7
0.6
ynew ylove yis
damping factor
@ Yi-Shin Chen, NLP to Text Mining
)(
)(
)()1()(
i
jk
VInj
j
VOutV
jk
ji
i VWS
w
w
ddVWS
125. Graph-based Extraction
• Pros
→ Structure and syntax information
→ Mutual influence
• Cons
→ Common words get higher scores
125@ Yi-Shin Chen, NLP to Text Mining
126. Summary: Probabilistic Topic Models
▷Probabilistic topic models): topic = word distribution
• E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005),
(play, 0.003), (NBA,0.01)…}
• : generic, easy to implement
• ?: Not easy to understand/communicate
• ?: Not easy to construct semantic relationship between topics
126
Topic? Topic= {(Crooked, 0.02), (dishonest, 0.001), (News,
0.0008), (totally, 0.0009), (total, 0.000009), (failed,
0.0006), (bad, 0.0015), (failing, 0.00001),
(presidential, 0.0000008), (States, 0.0000004),
(terrible, 0.0000085),(failed, 0.000021),
(lightweight,0.00001),(weak, 0.0000075), ……}
Most commonly
used words in
Trump insults
@ Yi-Shin Chen, NLP to Text Mining
127. Improved Ideas
▷Idea1 (Probabilistic topic models): topic = word distribution
• E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005),
(play, 0.003), (NBA,0.01)…}
• : generic, easy to implement
▷Idea 2 (Concept topic models): topic = concept
• Maintain concepts (manually or automatically)
→ E.g., ConceptNet
127@ Yi-Shin Chen, NLP to Text Mining
128. NLP Related Approach:
Named Entity Recognition
▷Find and classify all the named entities in a text.
▷What’s a named entity?
• A mention of an entity using its name.
→ Kansas Jayhawks
• This is a subset of the possible mentions...
→ Kansas, Jayhawks, the team, it, they
▷Find means identify the exact span of the mention
▷Classify means determine the category of the entity
being referred to
128@ Yi-Shin Chen, NLP to Text Mining
129. Named Entity Recognition Approaches
▷As with partial parsing and chunking there are two
basic approaches (and hybrids)
• Rule-based (regular expressions)
→ Lists of names
→ Patterns to match things that look like names
→ Patterns to match the environments that classes of names
tend to occur in.
• Machine Learning-based approaches
→ Get annotated training data
→ Extract features
→ Train systems to replicate the annotation
129@ Yi-Shin Chen, NLP to Text Mining
131. Landscape of Text Mining
131
World Sensor Data
Interpret by Report
World
devices
24。C, 55%
World To be or not to be..
human
Non-text
(Context)
Text
Subjective
Objective
Perceived by Express
Mining content about the observers
Opinion Mining and Sentiment Analysis
@ Yi-Shin Chen, NLP to Text Mining
132. Opinion
▷a subjective statement describing a person's perspective
about something
132
Objective statement or Factual statement: can be proved to be right or wrong
Opinion holder: Personalized / customized
Depends on background, culture, context
Target
@ Yi-Shin Chen, NLP to Text Mining
133. Opinion Representation
▷Opinion holder: user
▷Opinion target: object
▷Opinion content: keywords?
▷Opinion context: time, location, others?
▷Opinion sentiment (emotion): positive/negative,
happy or sad
133@ Yi-Shin Chen, NLP to Text Mining
134. Sentiment Analysis
▷Input: An opinionated text object
▷Output: Sentiment tag/Emotion label
• Polarity analysis: {positive, negative, neutral}
• Emotion analysis: happy, sad, anger
▷Naive approach:
• Apply classification, clustering for extracted text features
134@ Yi-Shin Chen, NLP to Text Mining
136. Text Features
▷Character n-grams
• Usually for spelling/recognition proof
• Less meaningful
▷Word n-grams
• n should be bigger than 1 for sentiment analysis
▷POS tag n-grams
• Can mixed with words and POS tags
→ E.g., “adj noun”, “sad noun”
136@ Yi-Shin Chen, NLP to Text Mining
137. More Text Features
▷Word classes
• Thesaurus: LIWC
• Ontology: WordNet, Yago, DBPedia,
SentiWordNet
• Recognized entities: DBPedia, Yago
▷Frequent patterns in text
• Could utilize pattern discovery algorithms
• Optimizing the tradeoff between coverage and
specificity is essential
137@ Yi-Shin Chen, NLP to Text Mining
138. LIWC
▷Linguistic Inquiry and word count
• LIWC2015
▷Home page: http://liwc.wpengine.com/
▷>70 classes
▷ Developed by researchers with interests in social,
clinical, health, and cognitive psychology
▷Cost: US$89.95
138@ Yi-Shin Chen, NLP to Text Mining
139. Chinese Corpus and Resource
▷Provided by NLP Lab of Academia Sinica
• http://academiasinicanlplab.github.io/
• Corpus
→ NTCIR MOAT (Multilingual opinion analysis task) Corpus
→ EmotionLines: An Emotion Corpus of Multi-Party Conversations
• Resource
→ 中文意見詞典NTUSD & ANTUSD說明
→ NTUSD - NTU Sentiment Dictionary
→ ANTUSD - Augmented NTU Sentiment Dictionary
→ ACBiMA - Advanced Chinese Bi-Character Word Morphological Analyzer
→ CSentiPackage 1.0 (click here to request the password)
→ ACBiMA - Advanced Chinese Bi-Character Word Morphological Analyzer
@ Yi-Shin Chen, NLP to Text Mining 139
141. Emotion Analysis: NLP Approach
▷Aggregation
• Summing up opinion scores of characters/words
▷Weighted by structures
• Morphological structures
→ Intra-word structures
• Sentence syntactic structures
→ Inter-word structures
▷Composition
• With deep neural networks
@ Yi-Shin Chen, NLP to Text Mining 141
142. Emotion Analysis: Text Mining Approach
▷ Graph and Pattern Apporach
• Carlos Argueta, Fernando Calderon, and Yi-Shin Chen, Multilingual
Emotion Classifier using Unsupervised Pattern Extraction from Microblog
Data, Intelligent Data Analysis - An International Journal, 2016
142@ Yi-Shin Chen, NLP to Text Mining
149. Preprocessing Steps
▷Hints: Remove troublesome ones
o Too short
→ Too short to get important features
o Contain too many hashtags
→ Too much information to process
o Are retweets
→ Increase the complexity
o Have URLs
→ Too trouble to collect the page data
o Convert user mentions to <usermention> and hashtags to
<hashtag>
→ Remove the identification. We should not peek answers!
149@ Yi-Shin Chen, NLP to Text Mining
150. Basic Guidelines
▷ Identify the common and differences between
the experimental and control groups
• Analyze the frequency of words
→ TF•IDF (Term frequency, inverse document frequency)
• Analyze the co-occurrence between words/patterns
→ Co-occurrence
• Analyze the importance between words
→ Centrality
Graph
150@ Yi-Shin Chen, NLP to Text Mining
151. Graph Construction
▷Construct two graphs
• E.g.
→ Emotion one: I love the World of Warcraft new game
→ Not-emotion one: 3,000 killed in the world by ebola
I
of
Warcraft
new
game
WorldLove
the
0.9
0.84
0.65
0.12
0.12
0.53
0.67
0.45
3,000
world
b
y
ebola
the
killed in
0.49
0.87
0.93
0.83
0.55
0.25
151@ Yi-Shin Chen, NLP to Text Mining
152. Graph Processes
▷Remove the common ones between two graphs
• Leave the significant ones only appear in the
emotion graph
▷Analyze the centrality of words
• Betweenness, Closeness, Eigenvector, Degree, Katz
→ Can use the free/open software, e.g, Gaphi, GraphDB
▷Analyze the cluster degrees
• Clustering Coefficient
GraphKey
patterns
152@ Yi-Shin Chen, NLP to Text Mining
153. Essence Only
Only key phrases
→emotion patterns
153@ Yi-Shin Chen, NLP to Text Mining
154. Emotion Patterns Extraction
o The goal:
o Language independent extraction – not based on grammar or
manual templates
o More representative set of features - balance between generality
and specificity
o High recall/coverage – adapt to unseen words
o Requiring only a relatively small number – high reliability
o Efficient— fast extraction and utilization
o Meaningful - even if there are no recognizable emotion words in
it
154@ Yi-Shin Chen, NLP to Text Mining
155. Patterns Definition
oConstructed from two types of elements:
o Surface tokens: hello, , lol, house, …
o Wildcards: * (matches every word)
oContains at least 2 elements
oContains at least one of each type of element
Examples:
155
Pattern Matches
* this * “Hate this weather”, “love this drawing”
* * “so happy ”, “to me ”
luv my * “luv my gift”, “luv my price”
* that “want that”, “love that”, “hate that”
@ Yi-Shin Chen, NLP to Text Mining
156. Patterns Construction
o Constructed from instances
o An instance is a sequence of 2 or more words from CW
and SW
o Contains at least one CW and one SW
Examples
156
SubjectWords
love
hate
gift
weather
…
Connector
Words
this
luv
my
…
Instances
“hate this weather”
“so happy ”
“luv my gift”
“love this drawing”
“luv my price”
“to me
“kill this idiot”
“finish this task”
@ Yi-Shin Chen, NLP to Text Mining
157. Patterns Construction (2)
oFind all instances in a corpus with their frequency
oAggregate counts by grouping them based on length
and position of matching CW
157
Instances Count
“hate this weather” 5
“so happy ” 4
“luv my gift” 7
“love this drawing” 2
“luv my price” 1
“to me ” 3
“kill this idiot” 1
“finish this task” 4
Connector
Words
this
luv
my
…
Groups Cou
nt
“Hate this weather”, “love this drawing”, “kill this idiot”,
“finish this task”
12
“so happy ”, “to me ” 7
“luv my gift”, “luv my price” 8
… …
@ Yi-Shin Chen, NLP to Text Mining
158. Patterns Construction (3)
oReplace all the SWs by a wildcard * and keep the CWs to
convert all instances into the representing pattern
oThe wildcard matches any word and is used for term
generalization
oInfrequent patterns are filtered out
158
Connector
Words
this
got
my
pain
…
Pattern Groups Cou
nt
* this * “Hate this weather”, “love this drawing”, “kill this idiot”, “finish this task” 12
* * “so happy ”, “to me ” 7
luv my * “luv my gift”, “luv my price” 8
… … …
@ Yi-Shin Chen, NLP to Text Mining
159. Ranking Emotion Patterns
▷ Ranking the emotion patterns for each emotion
• Frequency, exclusiveness, diversity
• One ranked list for each emotion
SadJoy Anger
159@ Yi-Shin Chen, NLP to Text Mining
160. Samples of Emotion Patterns
SadJoy Anger
finally * my
tomorrow !!! *
<hashtag> birthday .+
* yay !
:) * !
princess *
* hehe
prom dress *
memories *
* without my
sucks * <hashtag>
* tonight :(
* anymore ..
felt so *
. :( *
* :((
my * always
shut the *
teachers *
people say *
-.- *
understand why *
why are *
with these *
160@ Yi-Shin Chen, NLP to Text Mining
161. Accuracy
161
Naïve Bayes SVM NRCWE Our Approach
English 81.90% 76.60% 35.40% 81.20%
Spanish 70.00% 52.00% 0.00% 80.00%
French 72.00% 61.00% 0.00% 84.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Accuracy
LIWC
No LIWC
@ Yi-Shin Chen, NLP to Text Mining
162. Chinese Dataset
▷ Facebook Graph API
• 粉絲頁文章
• 文章留言
▷ 尋找有回應表情符號的留言
162@ Yi-Shin Chen, NLP to Text Mining
166. Context
▷Text usually has rich context information
• Direct context (meta-data): time, location, author
• Indirect context: social networks of authors,
other text related to the same source
• Any other related text
▷Context could be used for:
• Partition the data
• Provide extra features
166@ Yi-Shin Chen, NLP to Text Mining
167. Contextual Text Mining
▷Query log + User = Personalized search
▷Tweet + Time = Event identification
▷Tweet + Location-related patterns = Location identification
▷Tweet + Sentiment = Opinion mining
▷Text Mining +Context Contextual Text Mining
167@ Yi-Shin Chen, NLP to Text Mining
168. Partition Text
168
User y
User 2
User n
User k
User x
User 1
Users above age 65
Users under age 12
1998 1999 2000 2001 2002 2003 2004 2005 2006
Data within year 2000
Posts containing #sad
@ Yi-Shin Chen, NLP to Text Mining
169. Generative Model of Text
169
I eat
fish and
vegetables.
Dog and
fish
are pets.
My kitten
eats fish.
eat
fish
vegetables
Dog
pets
are
kitten
My
and
I
)|( ModelwordP
Generation
Analyze Model
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
)|(
)|(
DocumentTopicP
TopicwordP
@ Yi-Shin Chen, NLP to Text Mining
170. Contextualized Models of Text
170
I eat
fish and
vegetables.
Dog and
fish
are pets.
My kitten
eats fish.
eat
fish
vegetables
Dog
pets
are
kitten
My
and
I
Generation
Analyze Model
Year=2008
Location=Taiwan
Source=FB
emotion=happy Gender=Man
),|( ContextModelwordP
@ Yi-Shin Chen, NLP to Text Mining
171. Naïve Contextual Topic Model
171
I eat
fish and
vegetables.
Dog and
fish
are pets.
My kitten
eats fish.
eat
fish
vegetables
Dog
pets
are
kitten
My
and
I
Generation
Year=2008
Year=2007
Cj Ki
jij ContextTopicwPContextizPjcPwP
..1 ..1
),|()|()()(
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
How do we estimate it? → Different approaches for different contextual data and problems@ Yi-Shin Chen, NLP to Text Mining
172. Contextual Probabilistic Latent Semantic
Analysis (CPLSA) (Mei, Zhai, KDD2006)
▷An extension of PLSA model ([Hofmann 99]) by
• Introducing context variables
• Modeling views of topics
• Modeling coverage variations of topics
▷Process of contextual text mining
• Instantiation of CPLSA (context, views, coverage)
• Fit the model to text data (EM algorithm)
• Compare a topic from different views
• Compute strength dynamics of topics from coverages
• Compute other probabilistic topic patterns
@ Yi-Shin Chen, NLP to Text Mining
172
173. The Probabilistic Model
173
D
D
),( 111
))|()|(),|(),|(log(),()(log
CD Vw
k
l
ilj
m
j
j
n
i
i wplpCDpCDvpDwcp
• A probabilistic model explaining the generation of a
document D and its context features C: if an author
wants to write such a document, he will
– Choose a view vi according to the view distribution 𝑝 𝑣𝑖 𝐷, 𝐶
– Choose a coverage кj according to the coverage distribution
𝑝 𝑘𝑗 𝐷, 𝐶
– Choose a theme θ𝑖𝑙 according to the coverage кj
– Generate a word using θ𝑖𝑙
– The likelihood of the document collection is:
@ Yi-Shin Chen, NLP to Text Mining
174. Contextual Text Mining Example
Contextual hate speech code words
@ Yi-Shin Chen, NLP to Text Mining 174
https://arxiv.org/pdf/1711.10093.pdf
175. Culture Effect
▷Emotion expressions might be affected by our culture
175
182413
51866
5719 6435
112944
13085
22660
116421
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
200000
@ Yi-Shin Chen, NLP to Text Mining
176. Code Word
▷Code Word: “a word or phrase that has a secret
meaning or that is used instead of another word or
phrase to avoid speaking directly” Merriam-Webster
▷舉例:
176
Anyone who isn’t white doesn’t deserve to live here.
Those foreign ________ should be deported.niggers
已知的 憎恨字眼,容易辨識
animals
改用不挑釁的字眼,可以靠著推理得出Skype 不是通訊軟體名稱嗎?
skypes
從上下文應該會覺得不太對
@ Yi-Shin Chen, NLP to Text Mining
177. Detecting Code Word
▷Detecting the words not in the dictionary
▷Expand the word set with context
177@ Yi-Shin Chen, NLP to Text Mining
180. Context
▷Relatedness 關聯性
• with word2vec
▷Similarity 相似性
• with dependency-based word embedding
180
the man jumps
boy plays
talks
Relatedness: word collocation
Similarity:behavior@ Yi-Shin Chen, NLP to Text Mining
181. Dependency-based Word Embedding
▷Use dependency-based contexts instead of
linear BoW [Levi and Goldberg, 2014]
@ Yi-Shin Chen, NLP to Text Mining 181
182. Why Does Word Context Matter?
▷Input: Skypes
▷ Create different embeddings to capture different word usage.
▷ Build similarity and relatedness embeddings for hate data & clean
data 182
SimilarityRelatedness
Hate Data
Clean Data
skyped
facetime
Skype-ing
phone
whatsapp
line
snapchat
imessage
chat
dropbox
kike
Line
cockroaches
negroes
facebook
animals
@ Yi-Shin Chen, NLP to Text Mining
183. Finding Word Neighbours with Page Rank
▷Ranking the words with differences between data sets
183
niggersfaggots
monkeys
cunts
animals
40.92
asshole
negroe
@ Yi-Shin Chen, NLP to Text Mining
185. Evaluation: Control results
185
Most participants were able to correctly answer the control across all 3
experiments
0
0.1
0.2
0.3
0.4
0.5
0.6
Very Likely Likely Neutral Unlikely Very Unlikely
MAJORITYPERCENTAGE
RATING
Control word "niggers"
Positive for Hate Speech
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Very Unlikely Unlikely Neutral Likely Very Likely
MAJORITYPERCENTAGE
RATING
Control word "water"
Negative for Hate Speech
@ Yi-Shin Chen, NLP to Text Mining
186. Evaluation: Aggregate Classification
186
Hate
Speech
Not Hate
Speech
HateCommunity Precision
Recall
F1
0.88
1.00
0.93
1.00
0.67
0.80
CleanTexts Precision
Recall
F1
1.00
0.75
0.86
0.86
1.00
0.92
HateTexts Precision
Recall
F1
0.75
0.75
0.75
0.83
0.83
0.83
@ Yi-Shin Chen, NLP to Text Mining
189. External Resources or NOT?
189
▷Yes for resources
• Steady
• Emphasizes on certain
terms
• Few mistakes
• Conservative
▷No
• Keep changing
• Relationships between
words are more important
• Many Mistakes
• Experimental
@ Yi-Shin Chen, NLP to Text Mining
190. Some Algorithms Do Not Work?
▷What are motivations for these algorithm?
• Eg.,TFIDF is designed to find representatives
▷What are the assumptions for these algorithm?
• E.g., LDA
▷Can our current data fit our problem?
• How to modify?
190@ Yi-Shin Chen, NLP to Text Mining
191. How to Find Relationships?
▷Rely on machines
• Machine learning
• Association rules
• Regression
▷Human first
• E.g, The graph approach in emotion analysis
→ Find the centrality relationships and patterns in graph
191@ Yi-Shin Chen, NLP to Text Mining
192. Always Remember
▷Have a good and solid objective
• No goal no gold
• Know the relationships between them
192@ Yi-Shin Chen, NLP to Text Mining