1. 從自然語言處理到文字探勘
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com
2. About Speaker
Currently
清華大學資訊工程系副教授
主持智慧型資料工程與應用實驗室 (IDEA Lab)
Education
Ph.D. in Computer Science, USC, USA
M.B.A. in Information Management, NCU, TW
B.B.A. in Information Management, NCU, TW
2
陳宜欣 Yi-Shin Chen
16. 語言
定義:
“The method of human communication (Oxford dictionary)”
人類用嘴說出來的話,由語音、語彙和語法所組成,是表達情意、
傳遞思想的重要工具(教育部國語辭典)
溝通特色:
透過書寫、口說、或肢體
包含用字
有結構性、常約定成俗
溝通的常見問題Claude Shannon (1916–2001)
Reproducing at one point either exactly or approximately a
message selected at another point
將某方的訊息原封不動(或近似)的重建在另一方。
@ Yi-Shin Chen, NLP to Text Mining 16
19. 計算語言之父
Noam Chomsky
“a language to be a set (finite or infinite) of
sentences, each finite in length and constructed out
of a finite set of elements”
語言是句子的集合(有限或無限),每個句子的長度
都是有限的,而且從有限的元素組合
“the structure of language is biologically determined”
語言結構是被生理結構確定
“that humans are born with an innate linguistic
ability that constitutes a Universal Grammar
人類生來就有天生的語言能力,這個能力包含了一種
通用語法
Syntactic Structures 句法結構
@ Yi-Shin Chen, NLP to Text Mining 19
Noam Chomsky
(1928 - current)
20. 自然語言處理的基本概念
@ Yi-Shin Chen, NLP to Text Mining 20
This is the best thing happened in my life.
Det. Det. NN PNPre.Verb VerbAdj
辭彙分析
Lexical analysis
(Part-of Speech
Tagging 詞性標註)
句法分析
Syntactic analysis
(Parsing)
Noun Phrase
Prep Phrase
Prep Phrase
Noun Phrase
Sentence
21. Parsing
用來理解是否輸入字串是否可以透過文法產生
@ Yi-Shin Chen, NLP to Text Mining 21
Lexical
Analyzer
Parser
Input
token
getNext
Token
Symbol
table
parse tree
Rest of
Front End Intermediate
representation
編碼器的產出應該要和輸入相等
22. Parsing 範例 (四則運算編譯器)
文法:
E :: = E op E | - E | (E) | id
op :: = + | - | * | /
@ Yi-Shin Chen, NLP to Text Mining 22
a * - ( b + c )
E
id op E
( )E
id op
- E
id
23. 自然語言處理的基本概念
@ Yi-Shin Chen, NLP to Text Mining 23
This is the best thing happened in my life.
Det. Det. NN PNPre.Verb VerbAdj
辭彙分析
Lexical analysis
(Part-of Speech
Tagging 詞性標註)
句法分析
Syntactic analysis
(Parsing)
This? (t1)
Best thing (t2)
My (m1)
Happened (t1, t2, m1)
語意分析
Semantic Analysis
Happy (x) if Happened (t1, ‘Best’, m1) Happy
推理 Inference
Noun Phrase
Prep Phrase
Prep Phrase
Noun Phrase
Sentence
24. NLP to Natural Language Understanding (NLU)
@ Yi-Shin Chen, NLP to Text Mining 24
https://nlp.stanford.edu/~wcmac/papers/20140716-UNLU.pdf
NLP
NLU
Named Entity
Recognition (NER)
Part-Of-Speech
Tagging (POS)
Text Categorization
Co-Reference
Resolution
Machine
Translation
Syntactic Parsing
Question
Answering (QA)
Relation Extraction
Semantic Parsing
Paraphrase &
Natural Language
Inference
Sentiment Analysis
Dialogue Agents
Summarization
Automatic Speech
Recognition (ASR)
Text-To-Speech
(TTS)
27. 自然語言處理技術
Word segmentation* (斷詞)
Part of speech tagging (POS) (詞性標註)
Stemming*
Syntactic Parsing(句法分析)
Named entity recognition (命名實體識別)
Co-reference resolution(共同引用解析)
Text categorization(文本分類)
@ Yi-Shin Chen, NLP to Text Mining 27
28. Word Segmentation斷詞
在某些語言中,字與字之間沒有明顯的分界
這地面積還真不小
人体内存在很多微生物
うふふふふ 楽しむ ありがとうございます
中文需要斷詞工具
Jieba: https://github.com/fxsjy/jieba
CKIP (Sinica): http://ckipsvr.iis.sinica.edu.tw/
或其他相關的統計方法
@ Yi-Shin Chen, NLP to Text Mining 28
29. POS Tagging詞性標註
透過文本分析,替每個用詞的詞性標註
標註方式可以有很多種 –也會有不同種類的標註集
名詞 (N), 形容詞 (A), 動詞 (V), URL (U)…
@ Yi-Shin Chen, NLP to Text Mining 29
Happy Easter! I went to work and came home to an empty house now im
going for a quick run
Happy_A Easter_N !_, I_O went_V to_P work_N and_& came_V home_N to_P
an_D empty_A house_N now_R im_L going_V for_P a_D quick_A run_N
(JJ)Happy (NNP) Easter! (PRP) I (VBD) went (TO) to (VB) work (CC) and
(VBD) came (NN) home (TO) to (DT) an (JJ) empty (NN) house (RB) now
(VBP) im (VBG) going (IN) for (DT) a (JJ) quick (NN) run
30. Stemmer
將經過變化的字,簡化到其字根或原型的狀態
E.g., Porter Stemmer
http://textanalysisonline.com/nltk-porter-stemmer
字詞頻率的統計會更正確
@ Yi-Shin Chen, NLP to Text Mining 30
Now, AI is poised to start an equally large
transformation on many industries
Now , AI is pois to start an equal larg transform
on mani industry
Porter Stemmer
31. 自然語言處理技術
Word segmentation* (斷詞)
Part of speech tagging (POS) (詞性標註)
Stemming*
Syntactic Parsing(句法分析)
Named entity extraction
Co-reference resolution(共同引用解析)
Text categorization(文本分類)
@ Yi-Shin Chen, NLP to Text Mining 31
表示法
40. 文字上下文Word Context
Intuition: 上下文代表語義
Hypothesis: 用更簡單的模型以及更大的資料量來訓練,會
得到更好的文字表示法
[Work done by Mikolov et al. in 2013]
@ Yi-Shin Chen, NLP to Text Mining 40
A medical doctor is a person who uses medicine to treat illness and injuries
Some medical doctors only work on certain diseases or injuries
Medical doctors examine, diagnose and treat patients
上下文可以代表 doctors/doctor
41. 範例
“king” – “man” + “woman” =
“queen”
@ Yi-Shin Chen, NLP to Text Mining 41
42. Word2Vec概念
Two models:
Continuous bag-of-words model
Continuous skip-gram model
利用類神經網路來學習字的權重 (the weights
of the word vectors)
@ Yi-Shin Chen, NLP to Text Mining 42
草船借箭
43. Continuous Bag-of-Words Model
Continuous Bag-of-Words Model
利用上下文來預測目標字
Eg: 視窗大小為2,下則句子可以轉成:
@ Yi-Shin Chen, NLP to Text Mining 43
Ex: ([features], label)
([I, am, good, pizza], eating)
I am eating good pizza now
目標字
上下文 上下文
([am, eating, pizza, now], good)
44. Continuous Bag-of-Words Model (Contd.)
@ Yi-Shin Chen, NLP to Text Mining 44
Input layer
1x6
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Hidden layer
0.1
0.2
0.4
0.1
0.1
0.1
Softmax layer
W W‘
6x3
3x6
0
0
0
1
0
0
Actual label
I am eating good pizza now
000001 000010 000100 001000 010000
45. One Hot Encoding(獨熱編碼)
一種轉換分類數值(categorical variables)的方法
假設有四種分類1,2,3,4
Integer/label encoding: 1,2,3,4
容易讓演算法誤解:4>2
One hot encoding: 0001, 0010, 0100, 1000
每一個維度只有在該分類存在時才會是1
演算法會知道維度間不用比較大小
@ Yi-Shin Chen, NLP to Text Mining 45
46. Continuous Bag-of-Words Model (Contd.)
@ Yi-Shin Chen, NLP to Text Mining 46
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Hidden layer
0.1
0.2
0.4
0.1
0.1
0.1
Softmax layer
W W‘
6x3
3x6
0
0
0
1
0
0
Actual label
I am eating good pizza now
000001 000010 000100 001000 010000
倒傳遞方式修正權重
Input layer
1x6
47. Continuous Bag-of-Words Model (Contd.)
@ Yi-Shin Chen, NLP to Text Mining 47
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Hidden layer
0.1
0.2
0.4
0.1
0.1
0.1
Softmax layer
W W‘
6x3
3x6
0
0
0
1
0
0
Actual label
I am eating good pizza now
重點在這裡
Input layer
1x6
48. Continuous Bag-of-Words Model (Contd.)
隱藏層的產出就是輸入字的文字向量
@ Yi-Shin Chen, NLP to Text Mining 48
輸入層
1x6
W
6x3
.2
.7
.2
.2
.3
.8
.19 .28 .22
.22 .23 .21
.1 .5 .3
.2 .13 .23
0 0 0 0 0 1 X = .2 .13 .23
Hidden layer
I am eating good pizza
000001 000010 000100 001000 010000
49. Continuous Bag-of-Words Model (Contd.)
缺點:
沒辦法處理罕見字
@ Yi-Shin Chen, NLP to Text Mining 49
使用The skip gram algorithm
This is a good movie theater
Target
word
Marvelous
上下文 上下文
50. Continuous Skip-gram Model
前一種方法CBOW的反向操作
輸入目標字猜測上下文
同樣的,上下文也會給定視窗大小
@ Yi-Shin Chen, NLP to Text Mining 50
51. Continuous Skip-gram Model (Contd.)
優點:
可以處理罕見字(因為上下文通常常見)
語意的相似性可以分析出來
“intelligent” 和“smart” 可能有非常相似的上下文
@ Yi-Shin Chen, NLP to Text Mining 51
58. “陳**” “蔡**” “張**”
Named Entity Recognition Approaches
▷兩種基本作法 (以及混合法)
透過規則Rule-based (regular expressions)
→Lists of names
→透過長得像名字的模式patterns
→透過通常出現在名字周圍的上下文patterns
@ Yi-Shin Chen, NLP to Text Mining 58
If “陳宜欣” AND “清華大學” then “大學教師”
If “蔡英文” AND “民進黨” then “政治人物”
If “陳昇瑋” AND “資料科學” then “中研院研究員”
玉山金控挖角學界大數據及人工智慧專家、台灣人工智慧學校執行長陳昇瑋擔任科
技長,也成為國內首間設有科技長的金融機構。其中,陳昇瑋最重要的任務就是整
合玉山金控旗下逾千位的「科技聯隊」。
陳昇瑋在2006年從台大電機所博士畢業後,便加入中研院資料科學研究所擔任研究
員,主題就是大數據和AI。不只鑽研學術,他也積極推動產學合作,在2014發起台
灣資料科學年會,推廣資料科學在各領域的應用,更於2017年發起台灣人工智慧年
會,成功邀請AlphaGo主要設計者黃士傑回台發表首場公開演說。而在今年1月底開
辦、培育科研及產業人工智慧人才的台灣人工智慧學校,也是由陳昇瑋擔任執行長。
59. Rule-Based Approaches
▷利用正規表示法regular expressions來抽取資料
▷Examples:
Telephone number: (d{3}[-. ()]){1,2}[dA-Z]{4}.
→800-865-1125
→800.865.1125
→(800)865-CARE
Software name extraction: ([A-Z][a-z]*s*)+
→Installation Designer v1.1
@ Yi-Shin Chen, NLP to Text Mining 59
60. Named Entity Recognition Approaches
▷兩種基本作法 (以及混合法)
透過規則Rule-based (regular expressions)
→Lists of names
→透過長得像名字的模式patterns
→透過通常出現在名字周圍的上下文patterns
機器學習法
→透過有標註的訓練資料
→抽取特徵
→訓練系統來產生一樣的標註
@ Yi-Shin Chen, NLP to Text Mining 60
63. Hearst's Patterns for IS-A Relations
"Y such as X ((, X)* (, and|or) X)"
"such Y as X"
"X or other Y"
"X and other Y"
"Y including X"
"Y, especially X"
@ Yi-Shin Chen, NLP to Text Mining 63
(Hearst, 1992): Automatic Acquisition of Hyponyms
64. 透過規則抽取更多關聯
基本概念: 通常一些特定的實體會有一些特殊關聯
located-in (ORGANIZATION, LOCATION)
founded (PERSON, ORGANIZATION)
cures (DRUG, DISEASE)
透過前一步驟的命名實體標註來協助找出關聯
@ Yi-Shin Chen, NLP to Text Mining 64Content Slides by Prof. Dan Jurafsky
66. Relation Bootstrapping (Hearst 1992)
蒐集和 relation R有關的配對資料
重複執行以下動作:
1. 找出符合的輸入資料的句子
2. 找尋配對資料的上下文,然後從中產生更多patterns
3. 透過找出的patterns來得到更多配對資料
@ Yi-Shin Chen, NLP to Text Mining 66Content Slides by Prof. Dan Jurafsky
67. Bootstrapping
<Mark Twain, Elmira> Seed tuple
Grep (google) for the environments of the seed tuple
“Mark Twain is buried in Elmira, NY.”
X is buried in Y
“The grave of Mark Twain is in Elmira”
The grave of X is in Y
“Elmira is Mark Twain’s final resting place”
Y is X’s final resting place
Use those patterns to grep for new tuples
Iterate
@ Yi-Shin Chen, NLP to Text Mining 67Content Slides by Prof. Dan Jurafsky
68. 可能的種子集: Wikipedia Infobox
Infoboxes 和Wikipedia的其他文章是屬於不
同地方的資料 (kept in a namespace)
Namespce example: Special:SpecialPages;
Wikipedia:List of infoboxes
Example:
{{Infobox person
|name = Casanova
|image = Casanova_self_portrait.jpg
|caption = A self portrait of Casanova
...
|website = }}
@ Yi-Shin Chen, NLP to Text Mining 68
69. Concept-based Model
ESA (Egozi, Markovitch, 2011)
Every Wikipedia article represents a concept
TF-IDF concept to inferring concepts from document
人工整理的資料集
@ Yi-Shin Chen, NLP to Text Mining 69
70. Yago
YAGO: A Core of Semantic Knowledge Unifying WordNet and
Wikipedia, WWW 2007
將Wikipedia 和 WordNet整合在一起
利用各種結構化資訊
Infoboxes, Category Pages, etc.
@ Yi-Shin Chen, NLP to Text Mining 70
74. More Tools
NLTK (python): tokenize, tag, NE extraction,
show parsing tree
Porter stemmer
n-grams
spaCy: industrial-strength NLP in python
@ Yi-Shin Chen, NLP to Text Mining 74
77. 資料 (文字 vs. 非文字)
@ Yi-Shin Chen, NLP to Text Mining 77
世界 探測器 資料
闡釋 報告
天氣
Thermometer, Hygrometer
24。C, 55%
地點 GPS 37。N, 123 。E
身體 Sphygmometer, MRI, etc. 126/70 mmHg
世界 To be or not to be..
人類
主觀
客觀
78. 資料探勘 vs. 文字探勘
@ Yi-Shin Chen, NLP to Text Mining 78
非文字資料
數值資料
類別資料
關聯式字資料
文字資料
文字
資料探勘
• 分群法
• 分類法
• 關聯式規則
• …
文字處理
(包含自然語言處理)
處理
80. @ Yi-Shin Chen, NLP to Text Mining 80
一般資料:(General Data)
────
職業: 無
種族: 客家
婚姻: married
旅遊史:No recent travel history in three months
接觸史:無
群聚:無
職業病史:Nil
資料來源:Patient herself and her daughter
主訴:(Chief Complaint)
──
Sudden onest short of breath with cold sweating noted five days ago. ( since 06/09)
現在病症:(Present Illness)
────
This 60 year-old female had hypertension for 10 years and diabetes mellitus for 5 years that had
regular medical control. As her mentioned, she got similar episode attacked three months ago with
initail presentations of short of breat, DOE, orthopnea. She went to LMD for help with CAD, 3-V-D,
s/p PTCA stenting x 3 for one vessels in 2012/03. She got regular CV OPD f/u at there and the
medications was tooked. Since after, she had increatment of the oral water intake amounts. The
urine output seems to be adequate and no body weight change or legs edema noted in recent three
months. This time, acute onset severe dyspnea with orthopnea, DOE, heavy sweating and oral
thirsty noted on 06/09. He had no fever, chills, nausea, vomiting, palpitation, cough, chest tightness,
chest pain, palpitation, abdominal discomfort noticed. For the symptoms intolerable, he came to our
ED for help with chest x film revealed cardiomegaly and elevations of cardiac markers noted. The
cardiologist was consulted and the heparinization was applied. The CPK level had no elevation at
regular f/u examinations, and her symptoms got much improved after. The cardiosonogram reported
impaired LV systolic function and moderate MR. She was admitted for further suvery and
managements to the acute ischemic heart disease.
81. @ Yi-Shin Chen, NLP to Text Mining 81
一般資料:(General Data)
────
職業: 無
種族: 客家
婚姻: married
旅遊史:No recent travel history in three months
接觸史:無
群聚:無
職業病史:Nil
資料來源:Patient herself and her daughter
主訴:(Chief Complaint)
──
Sudden onest short of breath with cold sweating noted five days ago. ( since 06/09)
現在病症:(Present Illness)
────
This 60 year-old female had hypertension for 10 years and diabetes mellitus for 5 years that had
regular medical control. As her mentioned, she got similar episode attacked three months ago with
initail presentations of short of breat, DOE, orthopnea. She went to LMD for help with CAD, 3-V-D,
s/p PTCA stenting x 3 for one vessels in 2012/03. She got regular CV OPD f/u at there and the
medications was tooked. Since after, she had increatment of the oral water intake amounts. The
urine output seems to be adequate and no body weight change or legs edema noted in recent three
months. This time, acute onset severe dyspnea with orthopnea, DOE, heavy sweating and oral
thirsty noted on 06/09. He had no fever, chills, nausea, vomiting, palpitation, cough, chest tightness,
chest pain, palpitation, abdominal discomfort noticed. For the symptoms intolerable, he came to our
ED for help with chest x film revealed cardiomegaly and elevations of cardiac markers noted. The
cardiologist was consulted and the heparinization was applied. The CPK level had no elevation at
regular f/u examinations, and her symptoms got much improved after. The cardiosonogram reported
impaired LV systolic function and moderate MR. She was admitted for further suvery and
managements to the acute ischemic heart disease.
正確對齊 /分類各種屬性資料
82. 偵測語系
偵測輸入文字是屬於哪種語言(語系)
困難點
非常短
在同一個句子裡面有不同的語言文字
雜訊
@ Yi-Shin Chen, NLP to Text Mining 82
職業: 無
種族: 客家
婚姻: married
旅遊史:No recent travel history in three months
接觸史:無
群聚:無
職業病史:Nil
資料來源:Patient herself and her daughter
83. 偵測錯誤範例
Twitter examples
83
@sayidatynet top song #LailaGhofran
shokran ya garh new album #listen
中華隊的服裝挺特別的,好藍。。。
#ChineseTaipei #Sochi #2014冬奧
授業前の雪合戦w
http://t.co/d9b5peaq7J
移除雜訊前 /移除雜訊後
en -> id
it -> zh-tw
en -> ja
85. 資料清理
特殊字元
利用regular expressions來清理資料
@ Yi-Shin Chen, NLP to Text Mining 85
Unicode emotions ☺, ♥…
Symbol icon ☏, ✉…
Currency symbol €, £, $...
Tweet URL
Filter out non-(letters, space,
punctuation, digit) ◕‿◕ Friendship is everything ♥ ✉
xxxx@gmail.com
I added a video to a @YouTube playlist
http://t.co/ceYX62StGO Jamie Riepe
(^|s*)http(S+)?(s*|$)
(p{L}+)|(p{Z}+)|
(p{Punct}+)|(p{Digit}+)
86. 詞性標註POS Tagging
透過文本分析,替每個用詞的詞性標註
標註方式可以有很多種 –也會有不同種類的標註集
名詞 (N), 形容詞 (A), 動詞 (V), URL (U)…
@ Yi-Shin Chen, NLP to Text Mining 86
This 60 year-old female had hypertension for 10 years and diabetes mellitus
for 5 years that had regular medical control.
This(D) 60(Num) year-old(Adj) female(N) had(V) hypertension (N) for(Pre)
10(Num) years(N) and(Con) diabetes(N) mellitus(N) for(pre) 5(Num) years(N)
that(Det) had(V) regular(Adj) medical(Adj) control(N).
87. Stemming
缺點:
Diabetes -> diabete
@ Yi-Shin Chen, NLP to Text Mining 87
This 60 year-old female had hypertension for 10 years and diabetes
mellitus for 5 years that had regular medical control.
have
have
year
year
95. 監督式學習 vs. 非監督式學習
▷ 監督式學習
監督: 訓練資料會包含指出該資料在哪一個分類的標註
未知資料就會根據訓練資料所學來分類
▷ 非監督式學習
沒有分類資訊
利用一些測量方法或觀察來找出群聚關係
@ Yi-Shin Chen, NLP to Text Mining 95
96. Supervised vs. Unsupervised Learning
Supervised learning
Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the observations
New data is classified based on the training set
Unsupervised learning
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
96
97. 分類法
給定訓練資料 (training set )
每筆資料包含許多屬性 attributes
其中有一個屬性就是分類答案 class
根據這個分類答案屬性來找出一個適合的模型 model
這個模型可以想成是其他屬性值的函數集
目標:針對模型之前沒看過的資料來分派一個夠正確的分
類
需要測試資料test set
來測試訓練模型的正確率
所以通常收集到的資料會切成訓練資料和測試資料
用訓練資料來訓練模型
用測試資料來驗證模型正確率
97
104. 文字探勘的不同可能性
@ Yi-Shin Chen, NLP to Text Mining 104
24。C, 55%
To be or not to be..
human
文字
被感知 表達
探勘語言間的知識
自然語言處理 & 文字表示法 & 字與字的關聯和探勘
世界 探測器 資料
闡釋 報告
客觀
世界
世界
探測儀器
主觀
105. 文字探勘的不同可能性
@ Yi-Shin Chen, NLP to Text Mining 105
24。C, 55%
To be or not to be..
human
文字
被感知 表達
世界 探測器 資料
闡釋 報告
客觀
世界
世界
探測儀器
主觀
了解文字與觀察者間的關聯
意見探勘 & 情緒探勘
106. 文字探勘的不同可能性
@ Yi-Shin Chen, NLP to Text Mining 106
24。C, 55%
To be or not to be..
human
文字
被感知 表達
世界 探測器 資料
闡釋 報告
客觀
世界
世界
探測儀器
主觀
主題探勘Topic Mining, 上下文文字探勘Contextual Text Mining
了解跟世界有關的事情
107. 從自然語言處理到文字探勘
@ Yi-Shin Chen, NLP to Text Mining
107
This is the best thing happened in my life.
Det. Det. NN PNPre.Verb VerbAdj
String of Characters
This?
開心
This is the best thing happened in my life.
String of Words
POS Tags
Best thing
Happened
My life
Entity Period
實體關聯
情緒
作者非常開心的迎接他剛出生的兒子 理解
(Logic predicates)
Entity
越多自然語言處理
越不精確
更接近我們要的知識
108. 自然語言處理 vs. 文字探勘
文字探勘的目標
概觀Overview
了解趨勢
可以接受雜訊
@ Yi-Shin Chen, NLP to Text Mining 108
自然語言處理的目標
理解Understanding
有能力來回答
Ability to answer
非常精確
109. 自然語言處理示意圖
@ Yi-Shin Chen, NLP to Text Mining 109
故事始於 1983 年。在這一年,處境艱難的新聞雜誌《美國新聞與世界報導》(U.S. News &
World Report)決定展開一項雄心勃勃的計畫:它將評估美國 1,800 家學院和大學,替它們排出
優劣次序。
如果這項計畫成功了,由此產生的大學排名將成為有用的工具,有助數以百萬計的年輕人做他
們人生中的首個重大決定。對許多年輕人來說,上什麼大學決定了他們未來的職業路向,也決
定了他們將結交哪些終身的朋友(很可能包括他們的配偶)。
這家雜誌社也希望大學排名那一期可以創造銷售奇跡,使《美國新聞》至少有一週可以追上主
要對手如《時代》和《新聞週刊》。
《美國新聞》的人員要衡量的是「教育卓越程度」,這比玉米的成本或一粒玉米有多少微克的
蛋白質模糊得多。他們沒有直接的方法可以量化四年的大學教育對一名學生的影響,遑論對數
千萬名學生的影響。他們無法測量學生四年大學生活的各方面,例如學到多少東西、有多快樂、
對個人信心有何影響,以及在友誼上有多大的收獲。他們的模型並不反映詹森總統的高等教育
理想──「加深個人成就、提升個人生產力和增加個人報酬的一種方式。」
他們因此仰賴一些看似與教育成就有關的替代指標,例如學生的 SAT 分數、師生比率,以及錄
取率。他們分析新生升至二年級的百分比,也分析畢業率。他們計算在生的校友捐錢給母校的
百分比,假定校友願意捐錢,代表他們很可能滿意自己所接受的教育之品質。
116. Paradigmatic Word Associations
John’s cat eats fish in Saturday
Mary’s dog eats meat in Sunday
John’s cat drinks milk in Sunday
Mary’s dog drinks beer in Tuesday
@ Yi-Shin Chen, NLP to Text Mining 116
John’s --- eats fish in Saturday
Mary’s --- eats meat in Sunday
John’s --- drinks milk in Sunday
Mary’s --- drinks beer in Tuesday
比較左邊內容的相似度
比較右邊內容的相似度
比較其他內容的相似度
“cat”的上下文以及 “dog”上下文有多相近?
“cat”的上下文以及 “John”上下文有多相近?
Expected Overlap of Words in Context (EOWC)
Overlap (“cat”, “dog”)
Overlap (“cat”, “John”)
129. 文字探勘的不同可能性
@ Yi-Shin Chen, NLP to Text Mining 129
24。C, 55%
To be or not to be..
human
文字
被感知 表達
世界 探測器 資料
闡釋 報告
客觀
世界
世界
探測儀器
主觀
主題探勘Topic Mining, 上下文文字探勘Contextual Text Mining
了解跟世界有關的事情
131. Tasks of Topic Mining
@ Yi-Shin Chen, NLP to Text Mining 131
Text Data Topic 1
Topic 2
Topic 3
Topic 4
Topic n
Doc1 Doc2
132. Topic Mining正式定義
輸入
A collection of N text documents 𝑆 = 𝑑1, 𝑑2, 𝑑3, … 𝑑 𝑛
Number of topics: k
輸出
k topics: 𝜃1, 𝜃2, 𝜃3, … 𝜃 𝑛
Coverage of topics in each 𝑑𝑖: 𝜇𝑖1, 𝜇𝑖2, 𝜇𝑖3, … 𝜇𝑖𝑛
怎麼定義 topic 𝜃𝑖?
Topic=term (word)?
Topic= classes?
@ Yi-Shin Chen, NLP to Text Mining 132
133. Tasks of Topic Mining (Terms as Topics)
@ Yi-Shin Chen, NLP to Text Mining 133
Text Data Politics
Weather
Sports
Travel
Technology
Doc1 Doc2
134. “Terms as Topics”的困難
不夠通用
只能代表簡單/普遍的主題
無法表示比較複雜的主題
E.g., “uber issue”: 應該是政治問題?還是交通問題?
不夠完整
無法詞彙不同種的意義
詞義含糊
E.g., Hollywood star vs. stars in the sky; apple watch
vs. apple recipes
@ Yi-Shin Chen, NLP to Text Mining 134
138. Latent Dirichlet Allocation
Latent Dirichlet Allocation (D. M. Blei, A. Y. Ng, 2003)
Linear Discriminant Analysis
@ Yi-Shin Chen, NLP to Text Mining 138
M
d
d
k
N
n z
dndnddnd
k
N
n z
nnn
nn
N
n
n
dzwpzppDp
dzwpzppp
zwpzppp
d
dn
n
1 1
1
1
),()()(),(
),()()(),()3(
),()()(),,,()2(
w
wz
139. LDA 假設
假設:
寫文章的時候,作者會
1. 決定要寫幾個字
2. 決定字出現的機率 (P = Dir(𝛼) P = Dir(𝛽))
3. 決定文章主題 topics (Dirichlet)
4. 根據主題選擇文字 (Dirichlet)
5. 從第三步驟開始重複
舉例
1. 5 個字
2. 50% food & 50% cute animals
3. 1st word - food topic, gives you the word “bread”.
4. 2nd word - cute animals topic, “adorable”.
5. 3rd word - cute animals topic, “dog”.
6. 4th word - food topic, “eating”.
7. 5th word - food topic, “banana”.
@ Yi-Shin Chen, NLP to Text Mining 139
“bread adorable dog eating banana”
Document
Choice of topics and words
140. LDA Learning (Gibbs)
決定要有幾個主題 ?
隨機分配字彙給不同主題
檢查並修改該如何分配主題以及字彙 (Iterative)
p(topic t | document d)
p(word w | topic t)
Reassign w a new topic, p(topic t | document d) * p(word w | topic t)
@ Yi-Shin Chen, NLP to Text Mining 140
I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
#Topic: 2I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
p(red|1)=0.67; p(purple|1)=0.33; p(red|2)=0.67; p(purple|2)=0.33; p(red|3)=0.67; p(purple|3)=0.33;
p(eat|red)=0.17; p(eat|purple)=0.33; p(fish|red)=0.33; p(fish|purple)=0.33;
p(vegetable|red)=0.17; p(dog|purple)=0.33; p(pet|red)=0.17; p(kitten|red)=0.17;
p(purple|2)*p(fish|purple)=0.5*0.33=0.165; p(red|2)*p(fish|red)=0.5*0.2=0.1;
fish
p(red|1)=0.67; p(purple|1)=0.33; p(red|2)=0.50; p(purple|2)=0.50; p(red|3)=0.67; p(purple|3)=0.33;
p(eat|red)=0.20; p(eat|purple)=0.33; p(fish|red)=0.20; p(fish|purple)=0.33;
p(vegetable|red)=0.20; p(dog|purple)=0.33; p(pet|red)=0.20; p(kitten|red)=0.20;
I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
141. Related Work – Topic Model (LDA)
@ Yi-Shin Chen, NLP to Text Mining 141
I eat fish and vegetables.
Dog and fish are pets.
My kitten eats fish.
Sentence 1: 14.67% Topic 1, 85.33% Topic 2
Sentence 2: 85.44% Topic 1, 14.56% Topic 2
Sentence 3: 19.95% Topic 1, 80.05% Topic 2
LDA
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
142. 可能的Probabilistic Topic Models
Bag-of-words approach:
Mixture of unigram language model
Expectation-maximization algorithm
Probabilistic latent semantic analysis
Latent Dirichlet allocation (LDA) model
Graph-based approach :
TextRank (Mihalcea and Tarau, 2004)
Reinforcement Approach (Xiaojun et al., 2007)
CollabRank (Xiaojun er al., 2008)
@ Yi-Shin Chen, NLP to Text Mining 142
144. 連接文字點
根據字與字的間隔slop來連接文字點
@ Yi-Shin Chen, NLP to Text Mining 144
I love the new ipod shuffle.
It is the smallest ipod.
Slop=1
I
love
the
new
ipod
shuffle
itis
smallest
Slop=0
145. 連接片語點
連接片語
Compound words
Neighbor words
@ Yi-Shin Chen, NLP to Text Mining 145
I love the new
ipod shuffle.
It is the smallest
ipod.
I
love
the
new
ipod
shuffle
itis
smallest
ipod shuffle
146. 連接句子點
Connect to
Neighbor sentences
Compound terms
Compound phrase
@ Yi-Shin Chen, NLP to Text Mining 146
I love the new ipod shuffle.
It is the smallest ipod.
ipod
new
shuffle
it
smallest
ipod shuffle
I love the new ipod shuffle.
It is the smallest ipod.
147. 有不同種類連接線的權重
@ Yi-Shin Chen, NLP to Text Mining
147
I love the new ipod shuffle.
It is the smallest ipod.
I
love
the
new
ipod
shuffleitis
smallest
ipod shuffle
148. 圖節點的排序
每個點的排序分數 (TextRank 2004)
@ Yi-Shin Chen, NLP to Text Mining 148
ythe (1 0.85) 0.85*(0.1 0.5 0.2 0.6 0.3 0.7)
the
new
love
is
0.1
0.2
0.3
Score from parent nodes
0.5
0.7
0.6
ynew ylove yis
damping factor
)(
)(
)()1()(
i
jk
VInj
j
VOutV
jk
ji
i VWS
w
w
ddVWS
149. Result
@ Yi-Shin Chen, NLP to Text Mining 149
ipod 1.21
new 1.02
shuffle 0.99
smallest 0.77
love 0.57
156. 文字探勘的不同可能性
@ Yi-Shin Chen, NLP to Text Mining 156
24。C, 55%
To be or not to be..
human
文字
被感知 表達
世界 探測器 資料
闡釋 報告
客觀
世界
世界
探測儀器
主觀
了解文字與觀察者間的關聯
意見探勘 & 情緒探勘
157. 意見Opinion
a subjective statement describing a person's perspective
about something
@ Yi-Shin Chen, NLP to Text Mining 157
Objective statement or Factual statement: can be proved to be right or wrong
Opinion holder: Personalized / customized
Depends on background, culture, context
Target
169. 情緒分析: 文字探勘的方法
Graph and Pattern Approach
Carlos Argueta, Fernando Calderon, and Yi-Shin Chen, Multilingual Emotion
Classifier using Unsupervised Pattern Extraction from Microblog Data,
Intelligent Data Analysis - An International Journal, 2016
@ Yi-Shin Chen, NLP to Text Mining 169
170. 潛意識情緒大資料
Twitter, 目前最容易大量下載的資料
170
Throwing my phone always calms me down #anger
My sister always makes things look much more worse than they seem >:[ #anger
Why my brother always crabby !?!? #rude #youranadult #anger #issues
WHY DOES MY COMPUTER ALWAYS FREEZE??? NEVER FAILS. #anger
Im wanna crazy,if my life always sucks like this. #anger
Hashtag和表情符號最能標註情緒,所以可以當成人工標記的答案
177. 資料蒐集後的前處理
重點:拿掉麻煩的、不會處理的
o Too short
→ 短到拿不到特徵
o Contain too many hashtags
→ 資訊太多很難處理
o Are retweets
→ 會增加計算複雜度
o Have URLs
→ 還要再抓一次資料,這樣太累了
o Convert user mentions to <usermention> and hashtags to
<hashtag>
→ 消去識別碼, 不能偷看答案
反正是
大數據
177
179. Graph Construction
建立兩種圖(情緒圖 & 非情緒圖)
E.g.
情緒文字:I love the World of Warcraft new game
非情緒文字: 3,000 killed in the world by ebola
I
of
Warcraft
new
game
WorldLove
the
0.9
0.84
0.65
0.12
0.12
0.53
0.67
0.45
3,000
world
b
y
ebola
the
killed in
0.49
0.87
0.93
0.83
0.55
0.25
179
182. 情緒特徵
o 由兩種元素組成:
o Surface tokens: hello, , lol, house, …
o 替代字元: * (matches every word)
o 一個情緒特徵由至少兩種元素組成
o 每一類型的元素至少要有一個
Examples:
182
情緒形態 Matches
* this * “Hate this weather”, “love this drawing”
* * “so happy ”, “to me ”
luv my * “luv my gift”, “luv my price”
* that “want that”, “love that”, “hate that”
183. 情緒特徵組成
o 每個情緒特徵由數個實例組合而成
o 每個實例至少要有兩個以上的 中心字 與 社區字
o 至少要各一個 中心字 和 社區字
Examples
183
社區字
love
hate
gift
weather
…
中心字
this
luv
my
…
Instances
“hate this weather”
“so happy ”
“luv my gift”
“love this drawing”
“luv my price”
“to me
“kill this idiot”
“finish this task”
184. 情緒特徵組成(2)
o 找出所有實例的對應頻率
o 根據中心字以及其出現的位置,對實例分組
184
Instances Count
“hate this weather” 5
“so happy ” 4
“luv my gift” 7
“love this drawing” 2
“luv my price” 1
“to me ” 3
“kill this idiot” 1
“finish this task” 4
Groups Cou
nt
“Hate this weather”, “love this drawing”, “kill this idiot”,
“finish this task”
12
“so happy ”, “to me ” 7
“luv my gift”, “luv my price” 8
… …
185. 情緒特徵組成(3)
o 替換所有的社區字成為替代字元
o 保留中心字
o 過濾不常出現的情緒特徵
185
Pattern Groups #
* this * “Hate this weather”, “love this drawing”, “kill this idiot”, “finish this task” 12
* * “so happy ”, “to me ” 7
luv my * “luv my gift”, “luv my price” 8
… … …
196. Contextual Text Mining
Query log + User = 個人化搜尋
Tweet + Time = 事件偵測
Tweet + Location-related patterns = 地點偵測
Tweet + Sentiment = 意見探勘
Text Mining +Context Contextual Text Mining
@ Yi-Shin Chen, NLP to Text Mining 196
197. Partition Text
@ Yi-Shin Chen, NLP to Text Mining 197
User y
User 2
User n
User k
User x
User 1
Users above age 65
Users under age 12
1998 1999 2000 2001 2002 2003 2004 2005 2006
Data within year 2000
Posts containing #sad
198. Generative Model of Text
@ Yi-Shin Chen, NLP to Text Mining 198
I eat
fish and
vegetables.
Dog and
fish
are pets.
My kitten
eats fish.
eat
fish
vegetables
Dog
pets
are
kitten
My
and
I
)|( ModelwordP
Generation
Analyze Model
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
)|(
)|(
DocumentTopicP
TopicwordP
199. Contextualized Models of Text
@ Yi-Shin Chen, NLP to Text Mining 199
I eat
fish and
vegetables.
Dog and
fish
are pets.
My kitten
eats fish.
eat
fish
vegetables
Dog
pets
are
kitten
My
and
I
Generation
Analyze Model
Year=2008
Location=Taiwan
Source=FB
emotion=happy Gender=Man
),|( ContextModelwordP
200. Naïve Contextual Topic Model
@ Yi-Shin Chen, NLP to Text Mining 200
I eat
fish and
vegetables.
Dog and
fish
are pets.
My kitten
eats fish.
eat
fish
vegetables
Dog
pets
are
kitten
My
and
I
Generation
Year=2008
Year=2007
Cj Ki
jij ContextTopicwPContextizPjcPwP
..1 ..1
),|()|()()(
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
How do we estimate it? → Different approaches for different contextual data and problems
201. Contextual Probabilistic Latent Semantic
Analysis (CPLSA) (Mei, Zhai, KDD2006)
An extension of PLSA model ([Hofmann 99]) by
Introducing context variables
Modeling views of topics
Modeling coverage variations of topics
Process of contextual text mining
Instantiation of CPLSA (context, views, coverage)
Fit the model to text data (EM algorithm)
Compare a topic from different views
Compute strength dynamics of topics from coverages
Compute other probabilistic topic patterns
@ Yi-Shin Chen, NLP to Text Mining
201
202. The Probabilistic Model
@ Yi-Shin Chen, NLP to Text Mining 202
D
D
),( 111
))|()|(),|(),|(log(),()(log
CD Vw
k
l
ilj
m
j
j
n
i
i wplpCDpCDvpDwcp
• A probabilistic model explaining the generation of a
document D and its context features C: if an author
wants to write such a document, he will
– Choose a view vi according to the view distribution 𝑝 𝑣𝑖 𝐷, 𝐶
– Choose a coverage кj according to the coverage distribution
𝑝 𝑘𝑗 𝐷, 𝐶
– Choose a theme θ𝑖𝑙 according to the coverage кj
– Generate a word using θ𝑖𝑙
– The likelihood of the document collection is:
203. Contextual Text Mining 範例
偵測仇恨暗語
@ Yi-Shin Chen, NLP to Text Mining 203
https://arxiv.org/pdf/1711.10093.pdf
205. 憎恨情緒的暗語(Code Word)
Code Word: “a word or phrase that has a secret
meaning or that is used instead of another word or
phrase to avoid speaking directly” Merriam-Webster
舉例:
205
Anyone who isn’t white doesn’t deserve to live here.
Those foreign ________ should be deported.niggers
已知的 憎恨字眼,容易辨識
animals
改用不挑釁的字眼,可以靠著推理得出Skype 不是通訊軟體名稱嗎?
skypes
從上下文應該會覺得不太對
209. 上下文
Relatedness 關聯性
with word2vec
Similarity 相似性
with dependency2vec
209
the man jumps
boy plays
talks
Relatedness: word collocation
Similarity:behavior
210. Dependency-based Word Embedding
Use dependency-based contexts instead of
linear BoW [Levi and Goldberg, 2014]
@ Yi-Shin Chen, NLP to Text Mining 210
214. 驗證:控制資料的比較結果
@ Yi-Shin Chen, NLP to Text Mining 214
實驗参與者通常可以區分資料差別
0
0.1
0.2
0.3
0.4
0.5
0.6
Very Likely Likely Neutral Unlikely Very Unlikely
MAJORITYPERCENTAGE
RATING
Control word "niggers"
Positive for Hate Speech
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Very Unlikely Unlikely Neutral Likely Very Likely
MAJORITYPERCENTAGE
RATING
Control word "water"
Negative for Hate Speech
215. 驗證: 分類結果
@ Yi-Shin Chen, NLP to Text Mining 215
仇恨文字 非仇恨文字
HateCommunity Precision
Recall
F1
0.88
1.00
0.93
1.00
0.67
0.80
CleanTexts Precision
Recall
F1
1.00
0.75
0.86
0.86
1.00
0.92
HateTexts Precision
Recall
F1
0.75
0.75
0.75
0.83
0.83
0.83