2. About Speaker
陳宜欣 Yi-Shin Chen
▷ Currently
• 清華大學資訊工程系副教授
• 主持智慧型資料工程與應用實驗室 (IDEA Lab)
▷ Education
• Ph.D. in Computer Science, USC, USA
• M.B.A. in Information Management, NCU, TW
• B.B.A. in Information Management, NCU, TW
▷ Current Research
• 音樂治療+人工智慧
• 文字資料的情緒分析、精神分析
2
5. 語言分析
▷Noam Chomsky
• “a language to be a set (finite or infinite) of
sentences, each finite in length and
constructed out of a finite set of elements”
• “the structure of language is biologically
determined”
• “that humans are born with an innate linguistic
ability that constitutes a Universal Grammar
will also be examined”
→ Syntactic Structures 句法結構
5
Noam Chomsky
(1928 - current)
6. Basic Concepts in NLP
6
This is the best thing happened in my life.
Det. Det. NN PNPre.Verb VerbAdj
辭彙分析
Lexical analysis
(Part-of Speech
Tagging 詞性標註)
句法分析
Syntactic analysis
(Parsing)
This? (t1)
Best thing (t2)
My (m1)
Happened (t1, t2, m1)
語意分析
Semantic Analysis
Happy (x) if Happened (t1, ‘Best’, m1) Happy
推理 Inference
(Emotion Analysis)
Noun Phrase
Prep Phrase
Prep Phrase
Noun Phrase
Sentence
13. Subconscious Crowdsourcing
▷Crowdsourcing
• Merriam-Webster: Obtaining needed services, ideas,
or content by soliciting contributions from a large
group of people, especially an online community
▷群眾的潛意識智慧
• 從人們的日常紀錄中,擷取共同潛意識
Chun-Hao Chang, Elvis Saravia and Yi-Shin Chen, Subconscious Crowdsourcing: A Feasible Data Collection
Mechanism for Mental Disorder Detection on Social Media, The 2016 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining (ASONAM 2016), San Francisco, CA, USA, 18 - 21 August, 2016
13
很貴
不花
錢
14. 潛意識情緒大資料
▷Twitter, 目前最容易大量下載的資料
Throwing my phone always calms me down #anger
My sister always makes things look much more worse than they seem >:[ #anger
Why my brother always crabby !?!? #rude #youranadult #anger #issues
WHY DOES MY COMPUTER ALWAYS FREEZE??? NEVER FAILS. #anger
Im wanna crazy,if my life always sucks like this. #anger
Hashtag和表情符號最能標註情緒,所以可以當成人工標記的答案
14
22. 資料蒐集後的前處理
▷重點:拿掉麻煩的、不會處理的
o Too short
→ 短到拿不到特徵
o Contain too many hashtags
→ 資訊太多很難處理
o Are retweets
→ 會增加計算複雜度
o Have URLs
→ 還要再抓一次資料,這樣太累了
o Convert user mentions to <usermention> and hashtags to
<hashtag>
→ 消去識別碼, 不能偷看答案
反正是
大數據
22
24. Graph Construction
▷建立兩種圖(情緒圖 & 非情緒圖)
• E.g.
→情緒文字:I love the World of Warcraft new game
→ 非情緒文字: 3,000 killed in the world by ebola
I
of
Warcraft
new
game
WorldLove
the
0.9
0.84
0.65
0.12
0.12
0.53
0.67
0.45
3,000
world
by
ebola
the
killed in
0.49
0.87
0.93
0.83
0.55
0.25 24
27. 情緒特徵
o由兩種元素組成:
o Surface tokens: hello, , lol, house, …
o 替代字元: * (matches every word)
o一個情緒特徵由至少兩種元素組成
o 每一類型的元素至少要有一個
Examples:
27
情緒形態 Matches
* this * “Hate this weather”, “love this drawing”
* * “so happy ”, “to me ”
luv my * “luv my gift”, “luv my price”
* that “want that”, “love that”, “hate that”
28. 情緒特徵組成
o 每個情緒特徵由數個實例組合而成
o 每個實例至少要有兩個以上的 中心字 與 社區字
o 至少要各一個 中心字 和 社區字
Examples
28
社區字
love
hate
gift
weather
…
中心字
this
luv
my
…
Instances
“hate this weather”
“so happy ”
“luv my gift”
“love this drawing”
“luv my price”
“to me
“kill this idiot”
“finish this task”
29. 情緒特徵組成(2)
o找出所有實例的對應頻率
o根據中心字以及其出現的位置,對實例分組
29
Instances Count
“hate this weather” 5
“so happy ” 4
“luv my gift” 7
“love this drawing” 2
“luv my price” 1
“to me ” 3
“kill this idiot” 1
“finish this task” 4
Groups Cou
nt
“Hate this weather”, “love this drawing”, “kill this idiot”,
“finish this task”
12
“so happy ”, “to me ” 7
“luv my gift”, “luv my price” 8
… …
42. 憎恨情緒的暗語(Code Word)
▷Code Word: “a word or phrase that has a secret meaning or
that is used instead of another word or phrase to avoid
speaking directly” Merriam-Webster
▷舉例:
42
Anyone who isn’t white doesn’t deserve to live here.
Those foreign ________ should be deported.niggers
已知的 憎恨字眼,容易辨識
animals
改用不挑釁的字眼,可以靠著推理得出Skype 不是通訊軟體名稱嗎?
skypes
從上下文應該會覺得不太對
46. 上下文
▷Relatedness 關聯性
• with word2vec
▷Similarity 相似性
• with dependency2vec
46
the man jumps
boy plays
talks
Relatedness: word collocation
Similarity:behaviour