從自然語言處理到文字探勘

從自然語言處理到文字探勘
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com

About Speaker
 Currently
 清華大學資訊工程系副教授
 主持智慧型資料工程與應用實驗室 (IDEA Lab)
 Education
 Ph.D. in Computer Science, USC, USA
 M.B.A. in Information Management, NCU, TW
 B.B.A. in Information Management, NCU, TW
2
陳宜欣 Yi-Shin Chen

人工智慧興衰
1950~1960萌芽期
1960~1970黃金年代
1970~1980人工智慧第一個冬天
1980~1990專家系統
1990 人工智慧第二個冬天
1997 第二波人工智慧
 與人的競賽
2016 Alpha Go
4
開始學習人工智慧
博士班
教書

頗怪異的旅程
大學：人工智慧
研究所：遺傳演算法
博士：資料庫（＋一點人工智慧）
教書：智慧資料工程
 教授：高等資料庫庫、資料探勘
 十年後才加入中華民國人工智慧學會
 教授：文字探勘、自然語言
 主持教育部「人工智慧技術及應用領域課程」
5
（誤解?）

人工智慧技術及應用人才培育計畫
 教育部資科司
 計畫主持人：台灣大學林守德教授
6
技術及應用人才培育計畫
子計畫一：人工智慧技
術及應用領域課程
子計畫二：人工智慧開
放資源與平臺
子計畫三：人工智慧競
賽
子計畫四：人工智慧科
普論述
子計畫五：人工智慧基
本教育
• 規劃人工智慧課程地圖
• 針對需求，協助開設線上人工線上
相關課程

AI教育平台網頁
7
 https://idea.cs.nthu.edu.tw/~AIcoursemap/home/index

@ Yi-Shin Chen, NLP to Text Mining 8

自然語言
怎麼形成的?

語言
定義：
 “The method of human communication (Oxford dictionary)”
 人類用嘴說出來的話，由語音、語彙和語法所組成，是表達情意、
傳遞思想的重要工具（教育部國語辭典）
溝通特色：
 透過書寫、口說、或肢體
 包含用字
 有結構性、常約定成俗
溝通的常見問題Claude Shannon (1916–2001)
 Reproducing at one point either exactly or approximately a
message selected at another point
 將某方的訊息原封不動（或近似）的重建在另一方。

溝通原件
這真是我人生
中最棒的事情
解碼
語意 +推理
產生
編碼這 == 最棒的事情（在我人生中）
他很開心
需要表示我們是夥伴
我真為你開心

多語言溝通
解碼
語意 +推
理
編碼
解碼
語意 +
推理
產生
編碼
翻譯(Transfer Learning)

計算語言之父
Noam Chomsky
 “a language to be a set (finite or infinite) of
sentences, each finite in length and constructed out
of a finite set of elements”
 語言是句子的集合（有限或無限），每個句子的長度
都是有限的，而且從有限的元素組合
 “the structure of language is biologically determined”
 語言結構是被生理結構確定
 “that humans are born with an innate linguistic
ability that constitutes a Universal Grammar
 人類生來就有天生的語言能力，這個能力包含了一種
通用語法
 Syntactic Structures 句法結構
Noam Chomsky
(1928 - current)

自然語言處理的基本概念
This is the best thing happened in my life.
Det. Det. NN PNPre.Verb VerbAdj
辭彙分析
Lexical analysis
(Part-of Speech
Tagging 詞性標註)
句法分析
Syntactic analysis
(Parsing)
Noun Phrase
Prep Phrase
Prep Phrase
Noun Phrase
Sentence

Parsing
用來理解是否輸入字串是否可以透過文法產生
Lexical
Analyzer
Parser
Input
token
getNext
Token
Symbol
table
parse tree
Rest of
Front End Intermediate
representation
編碼器的產出應該要和輸入相等

Parsing 範例 (四則運算編譯器)
文法:
 E :: = E op E | - E | (E) | id
 op :: = + | - | * | /
a * - ( b + c )
E
id op E
( )E
id op
- E
id

自然語言處理的基本概念
辭彙分析
Lexical analysis
(Part-of Speech
Tagging 詞性標註)
句法分析
Syntactic analysis
(Parsing)
This? (t1)
Best thing (t2)
My (m1)
Happened (t1, t2, m1)
語意分析
Semantic Analysis
Happy (x) if Happened (t1, ‘Best’, m1) Happy
推理 Inference
Noun Phrase
Prep Phrase
Prep Phrase
Noun Phrase
Sentence

NLP to Natural Language Understanding (NLU)
https://nlp.stanford.edu/~wcmac/papers/20140716-UNLU.pdf
NLP
NLU
Named Entity
Recognition (NER)
Part-Of-Speech
Tagging (POS)
Text Categorization
Co-Reference
Resolution
Machine
Translation
Syntactic Parsing
Question
Answering (QA)
Relation Extraction
Semantic Parsing
Paraphrase &
Natural Language
Inference
Sentiment Analysis
Dialogue Agents
Summarization
Automatic Speech
Recognition (ASR)
Text-To-Speech
(TTS)

Challenges
斷詞錯誤
 下午天留客天留我不留
 下雨天留客天留我不留
推理錯誤
 玻璃杯碎了一地  玻璃杯不能用了
 專家眼鏡碎滿地  專家眼鏡不能用了？
語言的演化
 安史之亂（唐）
Different syntactic analysis
Different lexical analysis

自然語言處理技術
表示法

Word segmentation* (斷詞)
Part of speech tagging (POS) (詞性標註)
Stemming*
Syntactic Parsing(句法分析)
Named entity recognition (命名實體識別)
Co-reference resolution(共同引用解析)
Text categorization(文本分類)

Word Segmentation斷詞
在某些語言中，字與字之間沒有明顯的分界
 這地面積還真不小
 人体内存在很多微生物
 うふふふふ楽しむありがとうございます
中文需要斷詞工具
 Jieba: https://github.com/fxsjy/jieba
 CKIP (Sinica): http://ckipsvr.iis.sinica.edu.tw/
 或其他相關的統計方法

POS Tagging詞性標註
透過文本分析，替每個用詞的詞性標註
標註方式可以有很多種 –也會有不同種類的標註集
 名詞 (N), 形容詞 (A), 動詞 (V), URL (U)…
Happy Easter! I went to work and came home to an empty house now im
going for a quick run
Happy_A Easter_N !_, I_O went_V to_P work_N and_& came_V home_N to_P
an_D empty_A house_N now_R im_L going_V for_P a_D quick_A run_N
(JJ)Happy (NNP) Easter! (PRP) I (VBD) went (TO) to (VB) work (CC) and
(VBD) came (NN) home (TO) to (DT) an (JJ) empty (NN) house (RB) now
(VBP) im (VBG) going (IN) for (DT) a (JJ) quick (NN) run

Stemmer
 將經過變化的字，簡化到其字根或原型的狀態
 E.g., Porter Stemmer
http://textanalysisonline.com/nltk-porter-stemmer
 字詞頻率的統計會更正確
Now, AI is poised to start an equally large
transformation on many industries
Now , AI is pois to start an equal larg transform
on mani industry
Porter Stemmer

Stemming*
Named entity extraction
表示法

向量表示法(Bag of Words, BOW)
用字詞向量來表示文章
 字詞: 基本概念, e.g., 描述事物的關鍵字
 每一個字詞都是向量中的其中一個維度
 向量中的每一個數值代表：該字詞在向量中的重要程度
 字的順序並不重要
 所以：我欠你一個人情 = 你欠我一個人情
 通常每個向量可能會有多個字詞組成(N-gram features)
我欠你一個人情
N=1N=2N=3

詞頻以及反向文件頻率(TFIDF)
字詞的重要性可以從其在文本中的頻率推估
 詞頻Term frequency: TF(d,t)
 此字詞在文件d中出現的次數 t
 反向文件頻率: IDF(t)
 將常常在大部分的文件都出現過的字詞扣分

TFIDF舉例
 台北股市今天開高走低，收盤跌62.22點，為10715.72點，跌幅0.58%，失
守年線10719，成交金額新台幣1707.19億元
 中國股市昨天再度重挫，上證綜指、深證成指收跌逾2%。值得注意的
是，先前傳出官方禁止陸媒談論貿易戰時扯到陸股，今天多家中國財經
媒體收評都不敢談到國際因素對股市的影響。
 面對中國股市跌跌不休，昨天各大中國財經媒體在收評時，態度也轉趨
謹慎。以新浪財經來說，收評的消息面分析，僅側重中國國內政策；華
爾街見聞收評稿也僅只提到國內財經因素，包括7月財報季等，但國際
因素隻字未提。
股市： TF= 4 DF=3 中國： TF= 5 DF=2

詞頻正規化
nij 代表詞ti 出現在文件dj的次數
tfij 是經過正規化後的數字
 𝑡𝑓𝑖𝑗 =
𝑛 𝑖𝑗
𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑𝑉𝑎𝑙𝑢𝑒
Normalized value 的選擇性:
 所有字詞出現的總次數
 最大的字詞出現次數
 其他可以讓 tfij 介於0 和 1中間的數值
 BM25*: 𝑡𝑓𝑖𝑗 =
𝑛 𝑖𝑗× 𝑘+1
𝑛𝑖𝑗+𝑘

反向文件頻率(IDF)
 IDF將常常在大部分的文件都出現過的字詞扣分
 舉例, 一些冠詞或是連接詞(the, a, of, to, and…)可能會大量出
現在文本中，但是在大部分的情況下並不重要
 𝑖𝑑𝑓𝑖 = 𝑙𝑜𝑔
𝑁
𝑑𝑓 𝑖
+ 1
 dfi ：字詞 ti出現的文件頻率
 IDF 可以被反向分類頻率替換，或是其他的概念

BOW 向量
TFIDF+BOW 仍然有非常好的表現
非常好的比較基準點
Document 1
season
timeout
lost
wi
n
game
score
ball
pla
y
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0

Word2Vec
機器學習風味

BOW的限制
沒有考慮單字間的語義關聯
 不是為了模擬語言知識而設計
稀疏
 高維度
高維度的詛咒 (Curse of dimensionality)
 當維度增加的時候，點與點的距離相對地變得越來
越無意義

文字上下文Word Context
Intuition: 上下文代表語義
Hypothesis: 用更簡單的模型以及更大的資料量來訓練，會
得到更好的文字表示法
[Work done by Mikolov et al. in 2013]
A medical doctor is a person who uses medicine to treat illness and injuries
Some medical doctors only work on certain diseases or injuries
Medical doctors examine, diagnose and treat patients
上下文可以代表 doctors/doctor

範例
“king” – “man” + “woman” =
 “queen”

Word2Vec概念
Two models:
 Continuous bag-of-words model
 Continuous skip-gram model
利用類神經網路來學習字的權重 (the weights
of the word vectors)
草船借箭

Continuous Bag-of-Words Model
Continuous Bag-of-Words Model
 利用上下文來預測目標字
 Eg: 視窗大小為2，下則句子可以轉成：
Ex: ([features], label)
([I, am, good, pizza], eating)
I am eating good pizza now
目標字
上下文上下文
([am, eating, pizza, now], good)

Continuous Bag-of-Words Model (Contd.)
Input layer
1x6
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Hidden layer
0.1
0.2
0.4
0.1
0.1
0.1
Softmax layer
W W‘
6x3
3x6
0
0
0
1
0
0
Actual label
000001 000010 000100 001000 010000

One Hot Encoding(獨熱編碼)
一種轉換分類數值(categorical variables)的方法
假設有四種分類1,2,3,4
 Integer/label encoding: 1,2,3,4
 容易讓演算法誤解：4>2
 One hot encoding: 0001, 0010, 0100, 1000
 每一個維度只有在該分類存在時才會是1
 演算法會知道維度間不用比較大小

0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Hidden layer
0.1
0.2
0.4
0.1
0.1
0.1
Softmax layer
W W‘
6x3
3x6
0
0
0
1
0
0
Actual label
000001 000010 000100 001000 010000
倒傳遞方式修正權重
Input layer
1x6

0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Hidden layer
0.1
0.2
0.4
0.1
0.1
0.1
Softmax layer
W W‘
6x3
3x6
0
0
0
1
0
0
Actual label
重點在這裡
Input layer
1x6

隱藏層的產出就是輸入字的文字向量
輸入層
1x6
W
6x3
.2
.7
.2
.2
.3
.8
.19 .28 .22
.22 .23 .21
.1 .5 .3
.2 .13 .23
0 0 0 0 0 1 X = .2 .13 .23
Hidden layer
I am eating good pizza
000001 000010 000100 001000 010000

缺點:
 沒辦法處理罕見字
 使用The skip gram algorithm
This is a good movie theater
Target
word
Marvelous
上下文上下文

Continuous Skip-gram Model
 前一種方法CBOW的反向操作
 輸入目標字猜測上下文
 同樣的，上下文也會給定視窗大小

Continuous Skip-gram Model (Contd.)
優點:
 可以處理罕見字（因為上下文通常常見）
 語意的相似性可以分析出來
 “intelligent” 和“smart” 可能有非常相似的上下文

Example Results

關聯分析

Stemming*
Named entity recognition (命名實體識別)

Named Entity Recognition (NER)
▷找出與分類文章中所有的命名實體
▷命名實體named entity是什麼?
 提到此實體的名字
→Kansas Jayhawks
 此實體的所有可能代名詞
→Kansas, Jayhawks, the team, it, they
▷找出提到實體的所有可能字串
▷將該實體分到正確的分類中

Named Entity Type

Ambiguity 模糊不確定

“陳＊＊” “蔡＊＊” “張＊＊”
Named Entity Recognition Approaches
▷兩種基本作法 (以及混合法)
 透過規則Rule-based (regular expressions)
→Lists of names
→透過長得像名字的模式patterns
→透過通常出現在名字周圍的上下文patterns
If “陳宜欣” AND “清華大學” then “大學教師”
If “蔡英文” AND “民進黨” then “政治人物”
If “陳昇瑋” AND “資料科學” then “中研院研究員”
玉山金控挖角學界大數據及人工智慧專家、台灣人工智慧學校執行長陳昇瑋擔任科
技長，也成為國內首間設有科技長的金融機構。其中，陳昇瑋最重要的任務就是整
合玉山金控旗下逾千位的「科技聯隊」。
陳昇瑋在2006年從台大電機所博士畢業後，便加入中研院資料科學研究所擔任研究
員，主題就是大數據和AI。不只鑽研學術，他也積極推動產學合作，在2014發起台
灣資料科學年會，推廣資料科學在各領域的應用，更於2017年發起台灣人工智慧年
會，成功邀請AlphaGo主要設計者黃士傑回台發表首場公開演說。而在今年1月底開
辦、培育科研及產業人工智慧人才的台灣人工智慧學校，也是由陳昇瑋擔任執行長。

Rule-Based Approaches
▷利用正規表示法regular expressions來抽取資料
▷Examples:
 Telephone number: (d{3}[-. ()]){1,2}[dA-Z]{4}.
→800-865-1125
→800.865.1125
→(800)865-CARE
 Software name extraction: ([A-Z][a-z]*s*)+
→Installation Designer v1.1

→Lists of names
 機器學習法
→透過有標註的訓練資料
→抽取特徵
→訓練系統來產生一樣的標註

Machine Learning-Based Approaches

Co-reference Resolution
找出有關某個實體的所有表示法
常見策略
 人工訂出的patterns
 透過監督機器學習法
 半監督式以及非監督式學習法
 Bootstrapping (透過少數的種子資料)
 遠距監督Distant supervision
 從網頁資訊來做非監督式學習Unsupervised learning from the web

Hearst's Patterns for IS-A Relations
"Y such as X ((, X)* (, and|or) X)"
"such Y as X"
"X or other Y"
"X and other Y"
"Y including X"
"Y, especially X"
(Hearst, 1992): Automatic Acquisition of Hyponyms

透過規則抽取更多關聯
 基本概念: 通常一些特定的實體會有一些特殊關聯
 located-in (ORGANIZATION, LOCATION)
 founded (PERSON, ORGANIZATION)
 cures (DRUG, DISEASE)
 透過前一步驟的命名實體標註來協助找出關聯
@ Yi-Shin Chen, NLP to Text Mining 64Content Slides by Prof. Dan Jurafsky

Relation Types
 不同的命名實體可能會因為領域不同，而會有不同種的關聯。
 舉例來說：以下是新聞中可能有的關聯

Relation Bootstrapping (Hearst 1992)
蒐集和 relation R有關的配對資料
重複執行以下動作:
1. 找出符合的輸入資料的句子
2. 找尋配對資料的上下文，然後從中產生更多patterns
3. 透過找出的patterns來得到更多配對資料

Bootstrapping
<Mark Twain, Elmira> Seed tuple
 Grep (google) for the environments of the seed tuple
“Mark Twain is buried in Elmira, NY.”
X is buried in Y
“The grave of Mark Twain is in Elmira”
The grave of X is in Y
“Elmira is Mark Twain’s final resting place”
Y is X’s final resting place
Use those patterns to grep for new tuples
Iterate

可能的種子集: Wikipedia Infobox
 Infoboxes 和Wikipedia的其他文章是屬於不
同地方的資料 (kept in a namespace)
 Namespce example: Special:SpecialPages;
Wikipedia:List of infoboxes
 Example:
{{Infobox person
|name = Casanova
|image = Casanova_self_portrait.jpg
|caption = A self portrait of Casanova
...
|website = }}

Concept-based Model
ESA (Egozi, Markovitch, 2011)
Every Wikipedia article represents a concept
TF-IDF concept to inferring concepts from document
人工整理的資料集

Yago
 YAGO: A Core of Semantic Knowledge Unifying WordNet and
Wikipedia, WWW 2007
 將Wikipedia 和 WordNet整合在一起
 利用各種結構化資訊
 Infoboxes, Category Pages, etc.

自然語言處理工具

Parsing Tools - English
Berkeley Parser:
 http://tomato.banatao.berkeley.edu:8080/parser/parser.html
Stanford CoreNLP
 http://nlp.stanford.edu:8080/corenlp/
Stanford Parser
 Support: English, Simplified Chinese, Arbic, French, Spanish
 http://nlp.stanford.edu:8080/parser/index.jsp

Parsing Tools - Chinese
Stanford Parser
 Support: English, Simplified Chinese, Arbic, French, Spanish
 http://nlp.stanford.edu:8080/parser/index.jsp
語言雲
 https://www.ltp-cloud.com/intro/
CKIP (Traditional Chinese)
 http://ckipsvr.iis.sinica.edu.tw/

More Tools
NLTK (python): tokenize, tag, NE extraction,
show parsing tree
 Porter stemmer
 n-grams
spaCy: industrial-strength NLP in python

Semantic Resources
Wordnet
Google Knowledge Graph API
Hownet (簡體中文)
E-hownet (繁體中文)
Yago
DBPedia

文字探勘

資料 (文字 vs. 非文字)
世界探測器資料
闡釋報告
天氣
Thermometer, Hygrometer
24。C, 55%
地點 GPS 37。N, 123 。E
身體 Sphygmometer, MRI, etc. 126/70 mmHg
世界 To be or not to be..
人類
主觀
客觀

資料探勘 vs. 文字探勘
非文字資料
數值資料
類別資料
關聯式字資料
文字資料
文字
資料探勘
• 分群法
• 分類法
• 關聯式規則
• …
文字處理
(包含自然語言處理)
處理

Preprocessing in Reality

一般資料:(General Data)
────
職業: 無
種族: 客家
婚姻: married
旅遊史:No recent travel history in three months
接觸史:無
群聚:無
職業病史:Nil
資料來源:Patient herself and her daughter
主訴:(Chief Complaint)
──
Sudden onest short of breath with cold sweating noted five days ago. ( since 06/09)
現在病症:(Present Illness)
────
This 60 year-old female had hypertension for 10 years and diabetes mellitus for 5 years that had
regular medical control. As her mentioned, she got similar episode attacked three months ago with
initail presentations of short of breat, DOE, orthopnea. She went to LMD for help with CAD, 3-V-D,
s/p PTCA stenting x 3 for one vessels in 2012/03. She got regular CV OPD f/u at there and the
medications was tooked. Since after, she had increatment of the oral water intake amounts. The
urine output seems to be adequate and no body weight change or legs edema noted in recent three
months. This time, acute onset severe dyspnea with orthopnea, DOE, heavy sweating and oral
thirsty noted on 06/09. He had no fever, chills, nausea, vomiting, palpitation, cough, chest tightness,
chest pain, palpitation, abdominal discomfort noticed. For the symptoms intolerable, he came to our
ED for help with chest x film revealed cardiomegaly and elevations of cardiac markers noted. The
cardiologist was consulted and the heparinization was applied. The CPK level had no elevation at
regular f/u examinations, and her symptoms got much improved after. The cardiosonogram reported
impaired LV systolic function and moderate MR. She was admitted for further suvery and
managements to the acute ischemic heart disease.

一般資料:(General Data)
────
職業: 無
種族: 客家
婚姻: married
接觸史:無
群聚:無
職業病史:Nil
主訴:(Chief Complaint)
──
Sudden onest short of breath with cold sweating noted five days ago. ( since 06/09)
現在病症:(Present Illness)
────
This 60 year-old female had hypertension for 10 years and diabetes mellitus for 5 years that had
regular medical control. As her mentioned, she got similar episode attacked three months ago with
initail presentations of short of breat, DOE, orthopnea. She went to LMD for help with CAD, 3-V-D,
s/p PTCA stenting x 3 for one vessels in 2012/03. She got regular CV OPD f/u at there and the
medications was tooked. Since after, she had increatment of the oral water intake amounts. The
urine output seems to be adequate and no body weight change or legs edema noted in recent three
months. This time, acute onset severe dyspnea with orthopnea, DOE, heavy sweating and oral
thirsty noted on 06/09. He had no fever, chills, nausea, vomiting, palpitation, cough, chest tightness,
chest pain, palpitation, abdominal discomfort noticed. For the symptoms intolerable, he came to our
ED for help with chest x film revealed cardiomegaly and elevations of cardiac markers noted. The
cardiologist was consulted and the heparinization was applied. The CPK level had no elevation at
regular f/u examinations, and her symptoms got much improved after. The cardiosonogram reported
impaired LV systolic function and moderate MR. She was admitted for further suvery and
managements to the acute ischemic heart disease.
正確對齊 /分類各種屬性資料

偵測語系
偵測輸入文字是屬於哪種語言(語系)
困難點
 非常短
 在同一個句子裡面有不同的語言文字
 雜訊
職業: 無
種族: 客家
婚姻: married
接觸史:無
群聚:無
職業病史:Nil

偵測錯誤範例
Twitter examples
83
@sayidatynet top song #LailaGhofran
shokran ya garh new album #listen
中華隊的服裝挺特別的，好藍。。。
#ChineseTaipei #Sochi #2014冬奧
授業前の雪合戦w
http://t.co/d9b5peaq7J
移除雜訊前 /移除雜訊後
en -> id
it -> zh-tw
en -> ja

移除雜訊
在偵測語系前應該要先移除雜訊
 Html file ->tags
 Twitter -> hashtag, mention, URL
84
<meta name="twitter:description"
content="觸犯法國隱私法〔駐歐洲特派記
者胡蕙寧、國際新聞中心／綜合報導〕網路
搜尋引擎巨擘 Google8 日在法文版首頁
（www.google.fr）張貼悔過書 ..."/>
觸犯法國隱私法〔駐歐洲特派記者胡蕙寧、國際新聞中
心／綜合報導〕網路搜尋引擎巨擘Google8日在法文版
首頁（www.google.fr）張貼悔過書 ...
英文
(en)
繁中
(zh-tw)

資料清理
特殊字元
利用regular expressions來清理資料
Unicode emotions ☺, ♥…
Symbol icon ☏, ✉…
Currency symbol €, £, $...
Tweet URL
Filter out non-(letters, space,
punctuation, digit) ◕‿◕ Friendship is everything ♥ ✉
xxxx@gmail.com
I added a video to a @YouTube playlist
http://t.co/ceYX62StGO Jamie Riepe
(^|s*)http(S+)?(s*|$)
(p{L}+)|(p{Z}+)|
(p{Punct}+)|(p{Digit}+)

詞性標註POS Tagging
透過文本分析，替每個用詞的詞性標註
標註方式可以有很多種 –也會有不同種類的標註集
 名詞 (N), 形容詞 (A), 動詞 (V), URL (U)…
This 60 year-old female had hypertension for 10 years and diabetes mellitus
for 5 years that had regular medical control.
This(D) 60(Num) year-old(Adj) female(N) had(V) hypertension (N) for(Pre)
10(Num) years(N) and(Con) diabetes(N) mellitus(N) for(pre) 5(Num) years(N)
that(Det) had(V) regular(Adj) medical(Adj) control(N).

Stemming
缺點:
 Diabetes -> diabete
This 60 year-old female had hypertension for 10 years and diabetes
mellitus for 5 years that had regular medical control.
have
have
year
year

關於不同語言的前處理問題
簡體中文/繁體中文的轉換
 繁體中文簡體中文：多一
 舉例: 后、後后
斷詞

錯誤的斷詞
錯誤的斷詞
 這(Nep) 地面(Nc) 積(VJ) 還(D) 真(D) 不(D) 小(VH) http://t.co/QlUbiaz2Iz
原文寫錯字
 @iamzeke 實驗(Na) 室友(Na) 多(Dfa) 危險(VH) 你(Nh) 不(D) 知道(VK) 嗎(T) ?
錯誤的順序
 人體(Na) 存(VC) 內在(Na) 很多(Neqa) 微生物(Na)
字典中不曾出現的字
 半夜(Nd) 逛團(Na) 購(VC) 看到(VE) 太(Dfa) 吸引人(VH) !!
地面|面積
實驗室|室友
存在|內在
未知詞: 團購

Back to Mining
Let’s come back to Mining

文字探勘
資料探勘 vs. 文字探勘
非文字資料
數值資料
類別資料
關聯式字資料
文字資料
文字
資料探勘
• 分群法
• 分類法
• 關聯式規則
• …
文字處理
(包含自然語言處理)
處理

資料探勘簡介
首要：了解探勘目的

資料探勘任務
▷所有的問題應該要在探勘之前就被完整的定義
▷兩種分類 [Fayyad et al., 1996]
預測性任務
• 預測未知值
描述性任務
• 找可以描述資料集的描述法
VIP
Cheap
Potential

如何選擇相關資料探勘技術
▷問題可以再分解成
預測性問題
• 分類Classification
• 排序Ranking
• 迴歸Regression
• …
描述性問題
• 分群Clustering
• 關聯規則Association rules
• 摘要Summarization
• …
監督式學習
非監督式學習

監督式學習 vs. 非監督式學習
▷ 監督式學習
 監督: 訓練資料會包含指出該資料在哪一個分類的標註
 未知資料就會根據訓練資料所學來分類
▷ 非監督式學習
 沒有分類資訊
 利用一些測量方法或觀察來找出群聚關係

Supervised vs. Unsupervised Learning
 Supervised learning
 Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
96

分類法
給定訓練資料 (training set )
 每筆資料包含許多屬性 attributes
 其中有一個屬性就是分類答案 class
根據這個分類答案屬性來找出一個適合的模型 model
 這個模型可以想成是其他屬性值的函數集
目標：針對模型之前沒看過的資料來分派一個夠正確的分
類
 需要測試資料test set
 來測試訓練模型的正確率
所以通常收集到的資料會切成訓練資料和測試資料
 用訓練資料來訓練模型
 用測試資料來驗證模型正確率
97

排序
產生輸入序列的排序
 排序較前的物件應該較為重要
 E.g., 搜尋引擎中的排序，排序前面的結果理論上應該比較符
合搜尋意圖
98

迴歸Regression
找出一個函式，其可以讓輸入值與函式間的錯誤最小
 輸出是數值資料
 E.g.: 預測股價
99

分群Clustering
將資料分群
 相似的物件會在同一個群體中
 不相似的物件會分處在不同群體
 不需要分類標籤 (非監督式學習)
100

關聯規則探勘Association Rule Mining
基本概念
 給一批交易資料
 根據物件共同出現的關連，找出各物件出現的規則
101

摘要
提供更精簡的資料表示法
 數值資料：資料視覺化
 文字資料：文章摘要
 E.g.: Snippet
102

文字探勘
回到文字開始的地方

文字探勘的不同可能性
24。C, 55%
To be or not to be..
human
文字
被感知表達
探勘語言間的知識
自然語言處理 & 文字表示法 & 字與字的關聯和探勘
闡釋報告
客觀
世界
世界
探測儀器
主觀

24。C, 55%
human
文字
被感知表達
闡釋報告
客觀
世界
世界
探測儀器
主觀
了解文字與觀察者間的關聯
意見探勘 & 情緒探勘

24。C, 55%
human
文字
被感知表達
闡釋報告
客觀
世界
世界
探測儀器
主觀
主題探勘Topic Mining, 上下文文字探勘Contextual Text Mining
了解跟世界有關的事情

從自然語言處理到文字探勘
@ Yi-Shin Chen, NLP to Text Mining
107
String of Characters
This?
開心
String of Words
POS Tags
Best thing
Happened
My life
Entity Period
實體關聯
情緒
作者非常開心的迎接他剛出生的兒子理解
(Logic predicates)
Entity
越多自然語言處理
越不精確
更接近我們要的知識

自然語言處理 vs. 文字探勘
文字探勘的目標
 概觀Overview
 了解趨勢
 可以接受雜訊
自然語言處理的目標
 理解Understanding
 有能力來回答
Ability to answer
 非常精確

自然語言處理示意圖
故事始於 1983 年。在這一年，處境艱難的新聞雜誌《美國新聞與世界報導》（U.S. News &
World Report）決定展開一項雄心勃勃的計畫：它將評估美國 1,800 家學院和大學，替它們排出
優劣次序。
如果這項計畫成功了，由此產生的大學排名將成為有用的工具，有助數以百萬計的年輕人做他
們人生中的首個重大決定。對許多年輕人來說，上什麼大學決定了他們未來的職業路向，也決
定了他們將結交哪些終身的朋友（很可能包括他們的配偶）。
這家雜誌社也希望大學排名那一期可以創造銷售奇跡，使《美國新聞》至少有一週可以追上主
要對手如《時代》和《新聞週刊》。
《美國新聞》的人員要衡量的是「教育卓越程度」，這比玉米的成本或一粒玉米有多少微克的
蛋白質模糊得多。他們沒有直接的方法可以量化四年的大學教育對一名學生的影響，遑論對數
千萬名學生的影響。他們無法測量學生四年大學生活的各方面，例如學到多少東西、有多快樂、
對個人信心有何影響，以及在友誼上有多大的收獲。他們的模型並不反映詹森總統的高等教育
理想──「加深個人成就、提升個人生產力和增加個人報酬的一種方式。」
他們因此仰賴一些看似與教育成就有關的替代指標，例如學生的 SAT 分數、師生比率，以及錄
取率。他們分析新生升至二年級的百分比，也分析畢業率。他們計算在生的校友捐錢給母校的
百分比，假定校友願意捐錢，代表他們很可能滿意自己所接受的教育之品質。

優劣次序。
故事(Na) 始於(VJ) 1983年(Nd) 。(PERIODCATEGORY)
在(P) 這(Nep) 一年(Nd) ，(COMMACATEGORY)
處境(Na) 艱難(VH) 的(DE) 新聞(Na) 雜誌(Na)
《(PARENTHESISCATEGORY) 美國(Nc) 新聞(Na) 與(Caa) 世界(Nc) 報
導(VE) 》(PARENTHESISCATEGORY) （(PARENTHESISCATEGORY)
U(FW) ‧(PERIODCATEGORY) S(FW) ‧(PERIODCATEGORY) News(FW)
&World(FW) Report(FW) ）(PARENTHESISCATEGORY) 決定(VE) 展開
(VC) 一(Neu) 項(Nf) 雄心勃勃(VH) 的(DE) 計畫(Na) ：
(COLONCATEGORY)
它(Nh) 將(D) 評估(VE) 美國(Nc) １(Neu) ,(COMMACATEGORY)
８００(Neu) 家(Nf) 學院(Nc) 和(Caa) 大學(Nc) ，(COMMACATEGORY)
替(P) 它們(Nh) 排出(VC) 優劣(Na) 次序(Na) 。(PERIODCATEGORY)
斷詞/詞性標註

優劣次序。
故事始於 1983 年。在這一年DATE，處境艱難的新聞雜誌《美國新聞與世界報導
WORK_OF_ART》（U.S. News & World Report）決定展開一項雄心勃勃的計畫：它將
評估美國GPE 1,800 家學院和大學，替它們排出優劣次序。
實體辨識
Date 1983  媒體報導《美國新聞與世界報導》評估國家美國大學排名
關聯辨識
《美國新聞與世界報導》做一件影響美國1800學院和大學的創舉
理解

文字探勘示意圖
優劣次序。
如果這項計畫成功了，由此產生的大學排名將成為有用的工具，有助數以百萬計的年輕人做他
們人生中的首個重大決定。對許多年輕人來說，上什麼大學決定了他們未來的職業路向，也決
定了他們將結交哪些終身的朋友（很可能包括他們的配偶）。
這家雜誌社也希望大學排名那一期可以創造銷售奇跡，使《美國新聞》至少有一週可以追上主
要對手如《時代》和《新聞週刊》。
《美國新聞》的人員要衡量的是「教育卓越程度」，這比玉米的成本或一粒玉米有多少微克的
蛋白質模糊得多。他們沒有直接的方法可以量化四年的大學教育對一名學生的影響，遑論對數
千萬名學生的影響。他們無法測量學生四年大學生活的各方面，例如學到多少東西、有多快樂、
對個人信心有何影響，以及在友誼上有多大的收獲。他們的模型並不反映詹森總統的高等教育
理想──「加深個人成就、提升個人生產力和增加個人報酬的一種方式。」
他們因此仰賴一些看似與教育成就有關的替代指標，例如學生的 SAT 分數、師生比率，以及錄
取率。他們分析新生升至二年級的百分比，也分析畢業率。他們計算在生的校友捐錢給母校的
百分比，假定校友願意捐錢，代表他們很可能滿意自己所接受的教育之品質。
負面表述
正面表述
10
7

字的關聯性
回到文字

文字的關係
Paradigmatic: 可以互相取代（相似性）
 E.g., 貓 & 狗, 跑步 and 走路
Syntagmatic: 可以合併使用（相關性）
 E.g., 貓和打架, 狗和狂叫
→這兩種關係大概可以描述文字彼此在同一個語言中的關係
Animals Act
Animals Act

探勘文字間的關聯
Paradigmatic
 用每個字的上下文來代表
 計算上下文的相似性
 找出有較高相似性的文字
Syntagmatic
 計算兩兩字互相在文件中以上下文方式出現的次數
 比較互相伴隨出現的機率、以及單獨出現的機率
 找出伴隨機率較高但是單獨出現機率較低的字

Paradigmatic Word Associations
John’s cat eats fish in Saturday
Mary’s dog eats meat in Sunday
John’s cat drinks milk in Sunday
Mary’s dog drinks beer in Tuesday
John’s --- eats fish in Saturday
Mary’s --- eats meat in Sunday
John’s --- drinks milk in Sunday
Mary’s --- drinks beer in Tuesday
比較左邊內容的相似度
比較右邊內容的相似度
比較其他內容的相似度
“cat”的上下文以及 “dog”上下文有多相近?
“cat”的上下文以及 “John”上下文有多相近?
 Expected Overlap of Words in Context (EOWC)
Overlap (“cat”, “dog”)
Overlap (“cat”, “John”)

Common Approach for EOWC:
Cosine Similarity
 假使文件 d1 以及文件d2 是兩個文件向量
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2||
 表示向量的dot product； || d || 是向量長度
 Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
 Overlap (“John”, “Cat”) =.3150

EOWC品質?
兩個上下文區段重複性越高，相似度越高
然而：
 這樣的方法對常常出現的字詞有較高的優勢
 每個字的重要性都一樣（然而有些字可能比較沒有意義，例如：
『的』和『快樂』比較，快樂可能比較有意義）
可能答案?
 TFIDF

提醒：
詞頻以及反向文件頻率(TFIDF)
字詞的重要性可以從其在文本中的頻率推估
 詞頻Term frequency: TF(d,t)
 此字詞在文件d中出現的次數 t
 反向文件頻率: IDF(t)
 將常常在大部分的文件都出現過的字詞扣分

Mining Word Associations
Paradigmatic
 用每個字的上下文來代表
 計算上下文的相似性
 找出有較高相似性的文字
Syntagmatic
 計算兩兩字互相在文件中以上下文方式出現的次數
 比較互相伴隨出現的機率、以及單獨出現的機率
 找出伴隨機率較高但是單獨出現機率較低的字

Syntagmatic Word Associations
John’s cat eats fish in Saturday
Mary’s dog eats meat in Sunday
John’s cat drinks milk in Sunday
Mary’s dog drinks beer in Tuesday
John’s *** eats *** in Saturday
Mary’s *** eats *** in Sunday
John’s --- drinks --- in Sunday
Mary’s --- drinks --- in Tuesday
哪些字通常會出現在eat左邊?
哪些字通常會出現在右邊？
當 “eats” 出現的時候，有哪些字也通
常會伴隨出現？
伴隨出現關聯
P(dog | eats) = ? ; P(cats | eats) = ?

預測字的出現機率
預測問題: W 這個字是否會出現／缺席在此文段中？
文章區塊 (任何單位, e.g., 句子, 段落, 文章)
預測文字出現的頻率
W1 = ‘meat’ W2 = ‘a’ W3 = ‘unicorn’

文字機率預測: 正式定義
Binary random variable {0,1}
 𝑥 𝑤 =
1 𝑤 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡
0 𝑤 𝑖𝑠 𝑎𝑏𝑠𝑒𝑡
 𝑃 𝑥 𝑤 = 1 + 𝑃 𝑥 𝑤 = 0 = 1
 𝑥 𝑤 越隨機，預測難度越高
我們如何衡量隨機 randomness?

熵Entropy
Entropy 用來計算亂度、驚奇度、或是不確定性
Entropy 可以定義成:
   
  1
log
1
log,
1
11
1












n
i
i
n
i
ii
n
i i
in
pwhere
pp
p
pppH 
•entropy = 0
• 簡單
•entropy=1
• 困難0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.5 1
Entropy(p,1-p)
選擇次
數

Conditional Entropy
Know nothing about the segment
Know “eats” is present (Xeat=1)
𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1
𝑝(𝑥 𝑚𝑒𝑎𝑡 = 0)
𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1 𝑥 𝑒𝑎𝑡𝑠 = 1
𝑝(𝑥 𝑚𝑒𝑎𝑡 = 0 𝑥 𝑒𝑎𝑡𝑠 = 1)
𝐻 𝑥 𝑚𝑒𝑎𝑡
= −𝑝 𝑥 𝑚𝑒𝑎𝑡 = 0 × 𝑙𝑜𝑔2 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 0 − 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1 × 𝑙𝑜𝑔2 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1
𝐻 𝑥 𝑚𝑒𝑎𝑡 𝑥 𝑒𝑎𝑡𝑠 = 1
= −𝑝 𝑥 𝑚𝑒𝑎𝑡 = 0 𝑥 𝑒𝑎𝑡𝑠 = 1 × 𝑙𝑜𝑔2 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 0 𝑥 𝑒𝑎𝑡𝑠 = 1 − 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1 𝑥 𝑒𝑎𝑡𝑠 = 1
× 𝑙𝑜𝑔2 𝑝 𝑥 𝑚𝑒𝑎𝑡 = 1 𝑥 𝑒𝑎𝑡𝑠 = 1
      

n
Xx
xXYHxpXYH

Mining Syntagmatic Relations
針對每個字W1
 對所有的字 W2, 計算相對應的conditional entropy 𝐻 𝑥 𝑤1 𝑥 𝑤2
 對所有候選字排序，從大到小排列 𝐻 𝑥 𝑤1 𝑥 𝑤2
 選取排名較前面且高於給定門檻值的候選字
但是注意
 𝐻 𝑥 𝑤1 𝑥 𝑤2 and 𝐻 𝑥 𝑤1 𝑥 𝑤3 是可比較的
 𝐻 𝑥 𝑤1 𝑥 𝑤2 and 𝐻 𝑥 𝑤3 𝑥 𝑤2 不行
 因為其上限值the upper bounds不同
Conditional entropy 並不是對稱的
 𝐻 𝑥 𝑤1 𝑥 𝑤2 ≠ 𝐻 𝑥 𝑤2 𝑥 𝑤1

Mutual Information
 𝐼 𝑥; 𝑦 = 𝐻 𝑥 − 𝐻 𝑥 𝑦 = 𝐻 𝑦 − 𝐻 𝑦 𝑥
Properties:
 對稱Symmetric
 非負數Non-negative
 I(x;y)=0 iff x and y 是獨立的
可以用來比較不同的 (x,y) pairs
H(x) H(y)
H(x|y) H(y|x)I(x;y)
H(x,y)

主題探勘Topic Mining

24。C, 55%
human
文字
被感知表達
闡釋報告
客觀
世界
世界
探測儀器
主觀
主題探勘Topic Mining, 上下文文字探勘Contextual Text Mining
了解跟世界有關的事情

Topic Mining: 動機
主題: 文本中重要的關鍵想法
 Theme/subject
 可以有不一樣的單位標準 (e.g., 句子, 文章)
可能應用, e.g.:
 2016年的總統選舉重要主題
 人們滿意 Windows 10的哪些功能?
 今日臉書使用者提到什麼?
 哪些新聞最常被觀看?
130

Tasks of Topic Mining
Text Data Topic 1
Topic 2
Topic 3
Topic 4
Topic n
Doc1 Doc2

Topic Mining正式定義
輸入
 A collection of N text documents 𝑆 = 𝑑1, 𝑑2, 𝑑3, … 𝑑 𝑛
 Number of topics: k
輸出
 k topics: 𝜃1, 𝜃2, 𝜃3, … 𝜃 𝑛
 Coverage of topics in each 𝑑𝑖: 𝜇𝑖1, 𝜇𝑖2, 𝜇𝑖3, … 𝜇𝑖𝑛
怎麼定義 topic 𝜃𝑖?
 Topic=term (word)?
 Topic= classes?

Tasks of Topic Mining (Terms as Topics)
Text Data Politics
Weather
Sports
Travel
Technology
Doc1 Doc2

“Terms as Topics”的困難
不夠通用
 只能代表簡單/普遍的主題
 無法表示比較複雜的主題
 E.g., “uber issue”: 應該是政治問題?還是交通問題?
不夠完整
 無法詞彙不同種的意義
詞義含糊
 E.g., Hollywood star vs. stars in the sky; apple watch
vs. apple recipes

可能改進的策略
Idea1 (Probabilistic topic models): topic = word distribution
 E.g.: Sports = {(Sports, 0.2), (Game 0.01), (basketball 0.005), (play,
0.003), (NBA,0.01)…}
 : 普及，容易實踐
Idea 2 (Concept topic models): topic = concept
 透過人為或是自動萃取的概念集 concepts
 E.g., ConceptNet

可能的Probabilistic Topic Models
Bag-of-words approach:
 Mixture of unigram language model
 Expectation-maximization algorithm
 Probabilistic latent semantic analysis
 Latent Dirichlet allocation (LDA) model
Graph-based approach :
 TextRank (Mihalcea and Tarau, 2004)
 Reinforcement Approach (Xiaojun et al., 2007)
 CollabRank (Xiaojun er al., 2008)
Generative
Model

Bag-of-words 基本假設
字的順序不重要
“bag-of-words” – 順序是可以替換的
Theorem (De Finetti, 1935) – 如果 𝑥1, 𝑥2, … , 𝑥 𝑛 是
可以替換的,其聯合機率 p 𝑥1, 𝑥2, … , 𝑥 𝑛 可以視為:
 p 𝑥1, 𝑥2, … , 𝑥 𝑛 = 𝑑𝜃𝑝 𝜃 𝑖=1
𝑁
𝑝 𝑥𝑖 𝜃

Latent Dirichlet Allocation
 Latent Dirichlet Allocation (D. M. Blei, A. Y. Ng, 2003)
Linear Discriminant Analysis
 
 

 





















M
d
d
k
N
n z
dndnddnd
k
N
n z
nnn
nn
N
n
n
dzwpzppDp
dzwpzppp
zwpzppp
d
dn
n
1 1
1
1
),()()(),(
),()()(),()3(
),()()(),,,()2(



w
wz

LDA 假設
假設:
 寫文章的時候，作者會
1. 決定要寫幾個字
2. 決定字出現的機率 (P = Dir(𝛼) P = Dir(𝛽))
3. 決定文章主題 topics (Dirichlet)
4. 根據主題選擇文字 (Dirichlet)
5. 從第三步驟開始重複
 舉例
1. 5 個字
2. 50% food & 50% cute animals
3. 1st word - food topic, gives you the word “bread”.
4. 2nd word - cute animals topic, “adorable”.
5. 3rd word - cute animals topic, “dog”.
6. 4th word - food topic, “eating”.
7. 5th word - food topic, “banana”.
“bread adorable dog eating banana”
Document
Choice of topics and words

Related Work – Topic Model (LDA)
 I eat fish and vegetables.
 Dog and fish are pets.
 My kitten eats fish.
Sentence 1: 14.67% Topic 1, 85.33% Topic 2
LDA
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten

可能的Probabilistic Topic Models
Bag-of-words approach:
 Mixture of unigram language model
 Expectation-maximization algorithm
 Probabilistic latent semantic analysis
 Latent Dirichlet allocation (LDA) model
Graph-based approach :
 TextRank (Mihalcea and Tarau, 2004)
 Reinforcement Approach (Xiaojun et al., 2007)
 CollabRank (Xiaojun er al., 2008)

建構Graph
 有方向的圖Directed graph
 圖中的元素
 字詞Terms
 片語Phrases
 句子Sentences

連接文字點
根據字與字的間隔slop來連接文字點
I love the new ipod shuffle.
It is the smallest ipod.
Slop=1
I
love
the
new
ipod
shuffle
itis
smallest
Slop=0

連接片語點
連接片語
 Compound words
 Neighbor words
I love the new
ipod shuffle.
It is the smallest
ipod.
I
love
the
new
ipod
shuffle
itis
smallest
ipod shuffle

連接句子點
Connect to
 Neighbor sentences
 Compound terms
 Compound phrase
ipod
new
shuffle
it
smallest
ipod shuffle

有不同種類連接線的權重
147
I
love
the
new
ipod
shuffleitis
smallest
ipod shuffle

圖節點的排序
每個點的排序分數 (TextRank 2004)
ythe  (1 0.85)  0.85*(0.1 0.5  0.2  0.6  0.3 0.7)
the
new
love
is
0.1
0.2
0.3
Score from parent nodes
0.5
0.7
0.6
ynew ylove yis
damping factor




)(
)(
)()1()(
i
jk
VInj
j
VOutV
jk
ji
i VWS
w
w
ddVWS

Result
ipod 1.21
new 1.02
shuffle 0.99
smallest 0.77
love 0.57

Graph-based Extraction
 優點
 可以利用文字間的結構和句法
 文字間的相互影響Mutual influence也可以被考慮
 缺點
 常見字的分數可能會較高
可以參考文獻中的方法修正

總結Probabilistic Topic Models
Probabilistic topic models): topic = word distribution
0.003), (NBA,0.01)…}
 ?: 不容易理解或溝通
 ?: 不容易建構主題間的語義關聯
主題? Topic= {(Crooked, 0.02), (dishonest, 0.001), (News,
0.0008), (totally, 0.0009), (total, 0.000009), (failed,
0.0006), (bad, 0.0015), (failing, 0.00001),
(presidential, 0.0000008), (States, 0.0000004),
(terrible, 0.0000085),(failed, 0.000021),
(lightweight,0.00001),(weak, 0.0000075), ……}
川普常用來
汙辱別人的
字

可能改進的策略
Idea1 (Probabilistic topic models): topic = word distribution
0.003), (NBA,0.01)…}
Idea 2 (Concept topic models): topic = concept
 透過人為或是自動萃取的概念集 concepts
 E.g., ConceptNet

Named Entity Recognition (NER)
▷找出與分類文章中所有的命名實體
▷命名實體named entity是什麼?
 提到此實體的名字
→Kansas Jayhawks
 此實體的所有可能代名詞
→Kansas, Jayhawks, the team, it, they
▷找出提到實體的所有可能字串
▷將該實體分到正確的分類中

→Lists of names
 機器學習法
→透過有標註的訓練資料
→抽取特徵
→訓練系統來產生一樣的標註

意見探勘
普羅大眾怎麼想?

24。C, 55%
human
文字
被感知表達
闡釋報告
客觀
世界
世界
探測儀器
主觀
了解文字與觀察者間的關聯
意見探勘 & 情緒探勘

意見Opinion
a subjective statement describing a person's perspective
about something
Objective statement or Factual statement: can be proved to be right or wrong
Opinion holder: Personalized / customized
Depends on background, culture, context
Target

自身舉例
『這幾年來我一直在練習：如何識別卻不批判，不要妄自
在事情上面貼下我們自己以為的標籤，那個標籤不會對解
決事情有用，只會增加彼此立場衝突的情況。』

意見表示法Opinion Representation
Opinion holder: user
Opinion target: object
Opinion content: keywords?
Opinion context: time, location, others?
Opinion sentiment (emotion): positive/negative,
happy or sad

文字資料中不同的解讀
首位女總統蔡英文，已與卸任政府總統馬英九、吳敦義進行交
接，進行就職宣誓... ➡【總統就職蔡英文高舉右手致誓詞】
未來四年交到他手上！好好為國家人民謀福利就對了！
馬總統，祝福您閤家平安健康。
心、口不一是「詐騙政黨」的一貫手法。
首位女總統！蔡英文國歌沒避「吾黨所宗」
對著國父宣誓真是搞笑
大家等著看笑話吧！
如再說要改國號，就是叛亂，搞政變，要推翻中華民國，就是犯法
Time? Location? Context?

情緒分析Sentiment Analysis
Input: An opinionated text object
Output: Sentiment tag/Emotion label
 Polarity analysis: {positive, negative, neutral}
 Emotion analysis: happy, sad, anger
Naive approach:
 用分類法或分群法來找出相關情緒

Sentiment Representation
類別
 Sentiment
 Positive, neutral, negative
 Stars
 Emotions
 Joy, anger, fear, sadness
Dimensional
 Valence and arousal

文字特徵
Character n-grams
 常用在拼字檢查、或是錯字替換
 通常會失去字的意義
Word n-grams
 在情緒分析上通常n 會大於一
POS tag n-grams
 可以和單字以及詞性標籤一起使用
 E.g., “adj noun”, “sad noun”

更多文字特徵
文字分類
 Thesaurus: LIWC
 Ontology: WordNet, Yago, DBPedia,
SentiWordNet
 Recognized entities: DBPedia, Yago
常見 patterns
 Could utilize pattern discovery algorithms
 可以用來加強覆蓋率

LIWC
Linguistic Inquiry and word count
 LIWC2015
網址: http://liwc.wpengine.com/
>70 classes
 由對社會心理、臨床、醫療、以及認知研究者所開發
價格: US$89.95

中文語料庫以及相關資源
 中研院語言與知識處理實驗室提供
 http://www.iis.sinica.edu.tw/page/research/NaturalLanguageandKnowledg
eProcessing.html?lang=zh
 http://academiasinicanlplab.github.io/
 語料庫
 NTCIR MOAT (Multilingual opinion analysis task) Corpus
 EmotionLines: An Emotion Corpus of Multi-Party Conversations
 資源
 中文意見詞典NTUSD & ANTUSD說明
 NTUSD - NTU Sentiment Dictionary
 ANTUSD - Augmented NTU Sentiment Dictionary
 ACBiMA - Advanced Chinese Bi-Character Word Morphological Analyzer
 CSentiPackage 1.0 (click here to request the password)
 ACBiMA - Advanced Chinese Bi-Character Word Morphological Analyzer

Chinese Treebank
https://catalog.ldc.upenn.edu/ldc2013t21
POS Tagged Syntactically Bracketed

情緒分析: 自然語言處理法
Aggregation
 總和相關字與字元的分數
Weighted by structures
 型態結構Morphological structures
 字內的結構Intra-word structures
 Sentence syntactic structures
 字間的結構Inter-word structures
組合Composition
 用深度學習方法來結合

情緒分析: 文字探勘的方法
 Graph and Pattern Approach
 Carlos Argueta, Fernando Calderon, and Yi-Shin Chen, Multilingual Emotion
Classifier using Unsupervised Pattern Extraction from Microblog Data,
Intelligent Data Analysis - An International Journal, 2016

潛意識情緒大資料
Twitter, 目前最容易大量下載的資料
170
Throwing my phone always calms me down #anger
My sister always makes things look much more worse than they seem >:[ #anger
Why my brother always crabby !?!? #rude #youranadult #anger #issues
WHY DOES MY COMPUTER ALWAYS FREEZE??? NEVER FAILS. #anger
Im wanna crazy,if my life always sucks like this. #anger
Hashtag和表情符號最能標註情緒，所以可以當成人工標記的答案

潛意識情緒資料等等！
還要有
對照組
173

資料蒐集後的前處理
重點：拿掉麻煩的、不會處理的
o Too short
→ 短到拿不到特徵
o Contain too many hashtags
→ 資訊太多很難處理
o Are retweets
→ 會增加計算複雜度
o Have URLs
→ 還要再抓一次資料，這樣太累了
o Convert user mentions to <usermention> and hashtags to
<hashtag>
→ 消去識別碼, 不能偷看答案
反正是
大數據
177

處理原則
找出『實驗組』和『對照組』的相同、相異處
 分析字詞出現的頻率
 TF•IDF (Term frequency, inverse document frequency)
 分析字詞互相伴隨出現的頻率
 Co-occurrence
 比較字詞間的重要關係程度
 Centrality Graph
178

Graph Construction
建立兩種圖(情緒圖 & 非情緒圖)
 E.g.
 情緒文字：I love the World of Warcraft new game 
 非情緒文字： 3,000 killed in the world by ebola
I
of
Warcraft
new
game
WorldLove
the
0.9
0.84
0.65
0.12
0.12
0.53
0.67

0.45
3,000
world
b
y
ebola
the
killed in
0.49
0.87
0.93
0.83
0.55
0.25
179

圖型處理
將兩種圖型相同的地方剔除
 留下情緒圖才有的特徵
接下來分析哪些字是所謂的中心點
 Betweenness, Closeness, Eigenvector, Degree, Katz
 都有免費軟體可以使用, e.g, Gaphi, GraphDB
再分析哪一些字常一起被使用
 Clustering Coefficient Graph重要
字詞
180

去蕪存菁
留下圖型中的重要字詞
→重組成情緒特徵
181

情緒特徵
o 由兩種元素組成:
o Surface tokens: hello, , lol, house, …
o 替代字元: * (matches every word)
o 一個情緒特徵由至少兩種元素組成
o 每一類型的元素至少要有一個
Examples:
182
情緒形態 Matches
* this * “Hate this weather”, “love this drawing”
* *  “so happy ”, “to me ”
luv my * “luv my gift”, “luv my price”
* that “want that”, “love that”, “hate that”

情緒特徵組成
o 每個情緒特徵由數個實例組合而成
o 每個實例至少要有兩個以上的中心字與社區字
o 至少要各一個中心字和社區字
Examples
183
社區字
love
hate
gift
weather
…
中心字
this
luv
my

…
Instances
“hate this weather”
“so happy ”
“luv my gift”
“love this drawing”
“luv my price”
“to me 
“kill this idiot”
“finish this task”

情緒特徵組成(2)
o 找出所有實例的對應頻率
o 根據中心字以及其出現的位置，對實例分組
184
Instances Count
“hate this weather” 5
“so happy ” 4
“luv my gift” 7
“love this drawing” 2
“luv my price” 1
“to me ” 3
“kill this idiot” 1
“finish this task” 4
Groups Cou
nt
“Hate this weather”, “love this drawing”, “kill this idiot”,
“finish this task”
12
“so happy ”, “to me ” 7
“luv my gift”, “luv my price” 8
… …

情緒特徵組成(3)
o 替換所有的社區字成為替代字元
o 保留中心字
o 過濾不常出現的情緒特徵
185
Pattern Groups #
* this * “Hate this weather”, “love this drawing”, “kill this idiot”, “finish this task” 12
* *  “so happy ”, “to me ” 7
luv my * “luv my gift”, “luv my price” 8
… … …

情緒特徵排序
 將情緒特徵排序
 根據頻率、情緒專屬性、多元性
 每一種情緒都有專屬的特徵排序
悲傷快樂生氣
186

情緒特徵結果抽樣
悲傷快樂生氣
finally * my
tomorrow !!! *
<hashtag> birthday .+
* yay !
:) * !
princess *
* hehe
prom dress *
memories *
* without my
sucks * <hashtag>
* tonight :(
* anymore ..
felt so *
. :( *
* :((
my * always
shut the *
teachers *
people say *
-.- *
understand why *
why are *
with these *
187

情緒分類正確率
188
Naïve Bayes SVM NRCWE Our Approach
English 81.90% 76.60% 35.40% 81.20%
Spanish 70.00% 52.00% 0.00% 80.00%
French 72.00% 61.00% 0.00% 84.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Accuracy
使用情緒
詞典
無情緒詞典

Chinese Dataset
 Facebook Graph API
 粉絲頁文章
 文章留言
 尋找有回應表情符號的留言

中文贅字處理
超過三個以上的贅字，視同是三個贅字
190
喔，我來做。
喔喔，我來做。
喔喔喔，我來做。
喔喔喔喔，我來做。
Meaning
正面回應
中性回應
負面回應
Sentence

中文詞彙合併
中文詞彙通常由兩個、三個、四個字組成
算出不同字詞的頻率
較短字詞的頻率需減掉較長字詞的頻率
什麼 1025
憑什麼 600
什麼東西 200
1025-600-200=225

中文情緒元素
社區字道歉、浪費、腦殘、可憐的、太誇張、太扯了、欺負弱小、
腦袋有洞、無恥之徒
中心字真是、好了、不如、你們這、給你們、有本事、哈哈哈、
有報應的、真她媽的

中文情緒特徵
* * 哈哈哈
國民黨 * *
黨產 * *
* * 政府
* * 一樣
* * * @u 這
有梗 * *
柯P * *
悲傷HAHA 生氣
小白 * *
心疼 * *
希望 * *
天使 * *
早日康復 * * *
下輩子 * *
無辜 * *
* *，願
台灣 * *
法官 * *
女的 * *
憑什麼 * *
* * *道歉
* * 民進黨
* * 這位
到底 * *
* * 房東
* * 真的
還是 * *
* * @u 你看
好可愛 @u * *
肥貓 * *
子彈 * *
* * 太強
WOW
2016/6

Contextual Text Mining
Basic Concepts

Context
文章中通常有很豐富的context information
 直接存在的關聯資訊Direct context (meta-data): 時
間、地點、作者
 間接的關聯資訊: 作者的朋友網絡、其他資訊
 以及其他有用資訊
可以利用Context來:
 切割資料
 提供更多資料特徵

Contextual Text Mining
Query log + User = 個人化搜尋
Tweet + Time = 事件偵測
Tweet + Location-related patterns = 地點偵測
Tweet + Sentiment = 意見探勘
Text Mining +Context  Contextual Text Mining

Partition Text
User y
User 2
User n
User k
User x
User 1
Users above age 65
Users under age 12
1998 1999 2000 2001 2002 2003 2004 2005 2006
Data within year 2000
Posts containing #sad

Generative Model of Text
I eat
fish and
vegetables.
Dog and
fish
are pets.
My kitten
eats fish.
eat
fish
vegetables
Dog
pets
are
kitten
My
and
I
)|( ModelwordP
Generation
Analyze Model
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
)|(
)|(
DocumentTopicP
TopicwordP

Contextualized Models of Text
I eat
fish and
vegetables.
Dog and
fish
are pets.
My kitten
eats fish.
eat
fish
vegetables
Dog
pets
are
kitten
My
and
I
Generation
Analyze Model
Year=2008
Location=Taiwan
Source=FB
emotion=happy Gender=Man
),|( ContextModelwordP

Naïve Contextual Topic Model
I eat
fish and
vegetables.
Dog and
fish
are pets.
My kitten
eats fish.
eat
fish
vegetables
Dog
pets
are
kitten
My
and
I
Generation
Year=2008
Year=2007
  

Cj Ki
jij ContextTopicwPContextizPjcPwP
..1 ..1
),|()|()()(
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
Topic 1
0.268 fish
0.210 pet
0.210 dog
0.147 kitten
Topic 2
0.296 eat
0.265 fish
0.189 vegetable
0.121 kitten
How do we estimate it? → Different approaches for different contextual data and problems

Contextual Probabilistic Latent Semantic
Analysis (CPLSA) (Mei, Zhai, KDD2006)
An extension of PLSA model ([Hofmann 99]) by
 Introducing context variables
 Modeling views of topics
 Modeling coverage variations of topics
Process of contextual text mining
 Instantiation of CPLSA (context, views, coverage)
 Fit the model to text data (EM algorithm)
 Compare a topic from different views
 Compute strength dynamics of topics from coverages
 Compute other probabilistic topic patterns
201

The Probabilistic Model
    

D
D
),( 111
))|()|(),|(),|(log(),()(log
CD Vw
k
l
ilj
m
j
j
n
i
i wplpCDpCDvpDwcp 
• A probabilistic model explaining the generation of a
document D and its context features C: if an author
wants to write such a document, he will
– Choose a view vi according to the view distribution 𝑝 𝑣𝑖 𝐷, 𝐶
– Choose a coverage кj according to the coverage distribution
𝑝 𝑘𝑗 𝐷, 𝐶
– Choose a theme θ𝑖𝑙 according to the coverage кj
– Generate a word using θ𝑖𝑙
– The likelihood of the document collection is:

Contextual Text Mining 範例
偵測仇恨暗語
https://arxiv.org/pdf/1711.10093.pdf

社會標準
情緒表達和社會標準可能有關聯
204
182413
51866
5719 6435
112944
13085 22660
116421
開心信任恐懼驚訝悲傷憎恨生氣期待

憎恨情緒的暗語(Code Word)
Code Word: “a word or phrase that has a secret
meaning or that is used instead of another word or
phrase to avoid speaking directly” Merriam-Webster
舉例：
205
Anyone who isn’t white doesn’t deserve to live here.
Those foreign ________ should be deported.niggers
已知的憎恨字眼，容易辨識
animals
改用不挑釁的字眼,可以靠著推理得出Skype 不是通訊軟體名稱嗎?
skypes
從上下文應該會覺得不太對

暗語偵測
自動偵測不在字典中的暗語
利用已知仇恨字的上下文來擴展仇恨字詞庫
206

非仇恨資料
208
過濾含仇恨
字詞的語句

上下文
Relatedness 關聯性
 with word2vec
Similarity 相似性
 with dependency2vec
209
the man jumps
boy plays
talks
Relatedness: word collocation
Similarity:behavior

Dependency-based Word Embedding
Use dependency-based contexts instead of
linear BoW [Levi and Goldberg, 2014]

不同資料的上下文關聯
輸入: Skypes
211
相似性關聯性
仇恨資料
非仇恨資料
skyped
facetime
Skype-ing
phone
whatsapp
line
snapchat
imessage
chat
dropbox
kike
Line
cockroaches
negroes
facebook
animals

Code Word ranking

擴充仇恨暗語
利用PageRank的概念來尋找相鄰字
 利用不同資料集的差異來排序
213
niggersfaggots
monkeys
cunts
animals
40.92
asshole
negroe

驗證：控制資料的比較結果
實驗参與者通常可以區分資料差別
0
0.1
0.2
0.3
0.4
0.5
0.6
Very Likely Likely Neutral Unlikely Very Unlikely
MAJORITYPERCENTAGE
RATING
Control word "niggers"
Positive for Hate Speech
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Very Unlikely Unlikely Neutral Likely Very Likely
MAJORITYPERCENTAGE
RATING
Control word "water"
Negative for Hate Speech

驗證: 分類結果
仇恨文字非仇恨文字
HateCommunity Precision
Recall
F1
0.88
1.00
0.93
1.00
0.67
0.80
CleanTexts Precision
Recall
F1
1.00
0.75
0.86
0.86
1.00
0.92
HateTexts Precision
Recall
F1
0.75
0.75
0.75
0.83
0.83
0.83

方法比較
216
傳統 TFIDF
我們的方法

辭典/斷詞系統乎？
要
 領域用詞變化不大
 只有某些詞才重要
 文本的錯誤不多
 保守
218
不要
 領域用詞變化很快
 字詞的關聯更重要
 文本錯字連篇
 冒險

某些演算法名過其實?
演算法的設計初衷是什麼?
 如：TFIDF原本用來找出較能代表文章的字詞
演算法原始的設計假設是什麼?
目前的問題和資料有完全符合嗎？
 是否需要修正？
 如何根據問題本身修正？
219

TFIDF效果很差？
預期目的是什麼?
 找出有效區別文件的字詞
 IDF: 文件差異性
 找出有效區別情緒的字詞
 IEF: 情緒差異性
 平衡與否？
 TF重要性  IDF重要性
220

怎麼考慮特徵點的關聯?
直接交給演算法
 Machine learning
 Association rules
 Regression
自己先找出關聯性
 可以將資料轉成圖 (Graph)
 特徵為點、相關性為邊
 分析點的中心關係、係數關係
221
通常是找出特徵點
和結果的關聯
通常是找出特徵點
和特徵點的關聯

請記得
要有非常明確的目標
 No goal no gold
 弄清楚目標與相關方法以及資料的關聯

從自然語言處理到文字探勘

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 從自然語言處理到文字探勘

Similaire à 從自然語言處理到文字探勘 (20)

Plus de Yi-Shin Chen

Plus de Yi-Shin Chen (16)

Dernier

Dernier (17)

從自然語言處理到文字探勘