CML's Presentation at FengChia University

社群媒體的資安挑戰—
認知、跨平台、與推送演算法
東海大學資工系&雲創學院
賴俊鳴助理教授
<cmlai@thu.edu.tw>

 AWS Certified Instructor
 AWS SAA & CLF Certificate
 研究興趣：社群網路數據分析、應用大數據、
人機互動
 WSET Level I & II
 旅遊點數玩家
2021/10/18 2
關於我
Chun-Ming Lai
AWS Certified Solutions Architect - Associate
Nov 03, 2019
Nov 03, 2022
Validation Number WM1ENMBCGEE1QY51
Validate at: http://aws.amazon.com/verification
Chun-Ming Lai
Sep 21, 2019
Sep 21, 2022
Validation Number MGFL2VQCGE1E1PC5
Validate at: http://aws.amazon.com/verification

Advertising
Recommendation Engines
Public Impersonal Pages
Public Personal Profiles
Private Groups
Private Personal
Profiles
Group
Messages
1:1
10/18/2021 3
Embrace transparency and restraint
on communication behavior
Amplification Privacy Concern
• Confidentiality,
• Policy
• Law

 Abuse with an Internal Victim
• Cyber Bullying
• Doxing (揭露隱私)
• Child grooming
• Sextortion (敲詐勒索)
• Terrorist recruiting
2021/10/18 4
Security Issues With Targets (1/2)

Abuse with an External Victim
• CSAM Trading (Child Sexual Abuse Material)
• Conspiracy (陰謀)
• Hate Speech
• Anti-Vax (反疫苗)
• Disinformation
2021/10/18 5
Security Issues With Target (2/2)

 Do you try hard to find the News that you
like to receive?
 Or, is there a special “force” to push the
News in front of you?
2021/10/18 6
Ask??

12/06/2019
Media Sources
Social Algorithms
Online Participants
Content
Comments
Reactions
ML is learning how to select the
information you like to read
Addictive Design
A major design change around 2012~2013

12/06/2019 8
𝑒𝑑𝑔𝑒𝑠 𝑒
𝑢𝑒 𝑤𝑒𝑑𝑒
• ue is user affinity
• 𝑤𝑒 is how the content is weighted
• 𝑑𝑒 𝑖𝑠 𝑎 𝑡𝑖𝑚𝑒 𝑏𝑎𝑠𝑒𝑑 𝑑𝑒𝑐𝑎𝑦 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟
Social Algorithms

OUR ALGORITHM
Attack
不實訊息如果要大量散播需要社群演算法的推波助瀾
Disinformation is ineffective without a successful manipulation of social algorithm
Detect Attack behavior-OUR ALGORITHM

11/28/2019
Media Sources
Manipulated Social Algorithms
Online Participants
Content
Comments
Reactions
Online Participants

12/06/2019
Systematic AI and Security
Media Sources
Social Algorithms
Online Participants
Content
Comments
Reactions

 資本主義是蘇中的間諜
 共產主義是歐美的間諜
12/06/2019 13
Joke

12/06/2019
Systematic AI and Security
Media Sources
Restricted
Social Algorithms
Online Participants
Cleaned
Content
Comments
Reactions
Suspicious Accounts Detection
Fact Check sites What’s the problem?

12/06/2019
Media Sources
Personalized
Social Algorithms
Participants
Restricted
Social
Informatics
Comments
Reactions
Individual
Participant
Not universally filtering the content, but
personalized removing the influence from
the suspicious accounts

12/06/2019 16
Temporal/
Spatial
Interaction
Graph
Accounts Credibility
Fact Checker
Research Directions

12/06/2019 17
Our Work in National Defense
Conference

 Giants / Big Brother
 Information Gathering
 Internet Archive / Wayback Machine
 Political Correctness
12/06/2019 18
Challenges

2021/10/18 21
Crisis Informatics

 Systematic AI, not just the data, but the
adaptive process and ecological system around
all the data
• Systematic means the depth of domain
 From universally collecting all the data to
systematically select the data (or know what we
don’t have)
• we need systematic AI to know what to do
• we cannot learn system ecological system easily with
adversarial (so we need to filter them out)
Decentralized, at least virtually, information
ownership  better resistance and robustness
12/06/2019 22
Data-Centric Computing
Takeaways

Reaction
 回覆的即時性
 是否切中要點，立案追蹤
 文章的生命週期
 平均1.5小時，影響人的生活
10/18/2021 24

Security Threat
Severe Threat
• Phishing
• Malware, drive-by-download
Medium to light Threat
• Advertisement
• Spamming (Fund-raising, porn, canned messages, etc.)
New type Threat
• Rumors, Media manipulation, sign up, vote stuffing, etc.
• Fake News
• Crowdturfing = CrowdSourcing + Astroturfing
10/18/2021 25

Outline
 Suitable Target, Lifecycle Analysis
 Multiple Accounts Detection
 Geolocation Identification
 Personal words
10/18/2021 26

10/18/2021 27
Facebook.com/63811549237/posts/10153038271604238
2014, 12-19, 03:06 am GMT
Social Media— Climate Change

10/18/2021 29
Total: 609 comments

Suitable Targets Problem
Any post thread p in social media
platform, predict whether p
contains at least one malicious
comment via a classifier – c
{target,nontarget}
10/18/2021 30

Key idea: Life Cycle of Posts
10/18/2021 31
10 hrs

Definition
 Time Series (TS)
• TScreated(post): the time an original article is posted
• TSj: a time period j following the time of the original
• TSfinal: the end of our observation
 Accumulated Number of participants (AccNcomment)
• The number of post comments between TSi and TS(i-1)
 Discussion Atmosphere Vector (DAV)
10/18/2021 32

Example
TScreated(Climate) = 2014-12-19 03:06:42
Suppose j = 5, final = 120
DAV(Climate) = [# of comments 03:06:42 ~ 03:11:42 1st
# of comments 03:11:42 ~ 03:16:42 2nd
…
# of comments 05:01:42 ~ 05:06:42] 24th
10/18/2021 33

Dataset
2011~2014 Ten Main Media pages on
Facebook
Totally 42,703,463
10/18/2021 34

Feature Engineering
 # of comments, # of likes, # of shares
 Spanning time (Last comment time – first comment time)
 Temporal Feature with Delta Time window, with a final
observation time
 Context-free, don’t need to address Natural Language
Processing
10/18/2021
35
Time Elapsed
1st
Comments 1st Likes 1st Shares

Results
10/18/2021 36
Near Real Time

Discussion: Do you understand Facebook enough?
10/18/2021 37
• Attackers’ preference
• Selected by Facebook
• Audience reaction
• Bandwagon Effect
• Rich get Richer
• Human loves biased and
debating ones

Life Cycle and Influence Ratio
10/18/2021 38
CNN 2012 all post threads
>70%
mURL

DAV Predict IR (1/2)
10/18/2021 39

DAV Predict IR (2/2)
10/18/2021 40

Accounts Activity within a week around election date
10/18/2021 41
Active = Count(Activities) within 1 week >= threshold

10/18/2021 42
Clinton
1st week
Clinton
2nd
week

10/18/2021 43
Trump
2nd
week
Trump
1st
week
All accounts:
Periodic
Attacker accounts:
Random

Conclusion
Predict Suitable Targets successfully with temporal
features
• Attackers: Follow or not?
• Defenders: Deploy resource
Temporal Analysis with different variables
• Influence Ratio, increase or decrease for next time
window?
• 24 hours pattern, link online and offline behavior
10/18/2021 44

Outline
 Multiple Accounts Detection
 Personal words
10/18/2021 45

Semi-Supervised Learning on Graphs
Motivation of detecting multiple accounts on FB
Crawler
1
Crawler
2
Crawler
3
FaceBook
API
When Call FaceBook
API:
API will give each
crawler a different
scope ID. Thus it leads
to same user with
different scope ID in
the dataset.

100003468896671 高婷婷
https://www.facebook.com/mayuko.sakamoto.503
100004123536871 賴婷婷
https://www.facebook.com/profile.php?id=100004123536871
100003251795795 陳婷婷 https://www.facebook.com/rika.etoh
100000681128139 高婷婷 https://www.facebook.com/vincenzo.muscari.5
100002630019886 陳婷婷 https://www.facebook.com/sven.erkens.98
813243492 高婷婷 https://www.facebook.com/profile.php?id=813243492
Ting-Ting’s Family

Facebook 允許朋友數
100003468896671 高婷婷 45xx
100004123536871 賴婷婷 45xx
100003251795795 陳婷婷 4xxx
5000

Multiple Accounts Detection using
Semi-Supervised Learning on Graphs
When crawling data from FB using multiple crawlers, it will give you a scope ID instead of
giving you primary ID for each crawler.
For example, a user’s primary ID is mohamed.aimane.98. He has multiple scope ID,
they are 1815396745342476, 1815402648675219 , 1815411572007660,
1815468805335270 , 1815515615330589 ,1815482155333935 , 1815488781999939 ,
1816157185266432. It implies mohamed’s data is crawled by 8 different crawlers.
As the result, in our dataset we know their users names are all mohamed aimane, but
there are a lot of ID with the same user name.
Problem : Given 2 scope ID with the same user name. Are they the same user(same
primary ID) or not?
Motivation of detecting multiple accounts on FB

Graph Construction
U: {Users}, V:{Pages}, edge:{u,v} : u had an activity on page v
Activities

Main Algorithms
Unsupervised learning using Katz Similarity
Pxy(i) = (x,x1,x2,….y), length I
u1, u2 are similar if their activity paths are similar
Katz similarity can be computed by:
Where M is the adjacency matrix of graph G. 𝛽 is a scalar smaller than 1/ 𝑀 2
to
ensure convergence, and I is the identity matrix.

Main Algorithms
Unsupervised learning using Katz Similarity

Katz matrix is
1 0.9
0.9 1
0.2 0.3
0.5 0.5
0.2 0.6
0.3 0.5
1 0.8
0.8 1
The threshold we use is 0.8
Then the 1st node and the 2nd node are belong to the same user, and the 3rd and 4th
node are belongs to the same user, others are not.
Example of Algorithm 1

Main Algorithms
Semi-Supervised Method using Graph Embedding

Classical ML Tasks in Networks
• Node Classification
• Predict type of a node
• Link Prediction
• Predict friends
• Community Detection
• Network Similarity
• Similar with two networks

Node2vec(1/4)
Many Possible ways:
• PageRank score, Degree, centrality, # of edges…etc.
Features

Node2vec(2/4)
Mixture of BFS and DFS
BFS --- LocalView (u and S1)
DFS --- GlobalView (u and S6)

Node2vec(3/4)
• Two Parameters:
• Return parameter p:
• Return back to the previous node
• In-out parameter q:
• Moving outwards (DFS) vs. inwards (BFS)
• The ratio of BFS vs.DFS
• Biased 2nd-order random walks explore network neighborhoods.
Parameters

Node2vec(4/4)
• Simulate r random walks of length l starting from each node u
• Optimize the node2vec objective using Stochastic Gradient Descent

Embedding for node 1 : (0.1, 0.3, 0.2, 0.4), Embedding for node 2 : (0.2, 0.3, 0.2, 0.4)
We sample some ground truth that : node 1 and node 2 are belongs to the same node,
ect.
L looks like :((1,2), 1) ((1,3), 0 ), ((2,3), 0) ((2,4), - 1) ((3,4), -1) …..
X is from embedding : for example, ((1,2), (0.1, 0, 0, 0 )) ….
Then feed X and L into label spreading model, we will get, the 1st node and the 2nd node
are belong to the same user, and the 3rd and 4th node are belongs to the same user,
others are not.
Example of Algorithm 3

Main Algorithms
Different measurement of Embedding Vectors

Experiments and Evaluation
Comparison among the Three Methods
Two simple datasets : dataset 1: 188 nodes and 262 activities (links);
dataset 2: 4188 accounts and 6715 activities(links).

Outline
 Multiple Account Detection
 Personal words
10/18/2021 64

Page Information and Page-like Graph
10/18/2021
Sport Illustrated
Golden State
Warriors
Oakland Museum
Giving Tuesday
like
like
like
Field Example
Page ID 47657117525
Name Golden State Warriors
Category Sports Team
Country United States
Fan Count 11,019,236
Description The Official Facebook page
of
the Golden State Warriors
65

10/18/2021
• Facebook public
pages are public
profiles used by
local businesses,
companies,
organizations or
public figures
Likes
Promoting other pages to
community participants
66

Data Collection
Facebook Graph API version 2.8 used to collect our
data [1]
• 38,831,367 pages (for this work)
• 2,430,873 US
• 12,685,090 other countries
• 23,715,404 unknown
 [1] https://developers.facebook.com/docs/graph-api/reference/page
10/18/2021 67

Majority Vote Algorithm
10/18/2021
• location designated as state
information in this scenario
• The location labeling is determined by
the most votes
• Overall accuracy is only 59.4%
• This algorithm works well in page nationality
prediction task, with 90.25% accuracy
68

Baseline Algorithm
Utilizes locality of states to find pages
belonging to their corresponding states
• Pick out anchored pages, with local property, as
multiple seeds to start BFS from
Target classifier: 51 classes
• 50 classes of US states and a class of ”others (OT)”
State Distance Vector (SDV)
10/18/2021 69

Alabama Arkansas Arizona Wyoming
……
P IHOP(P, S_Arizona) == 4
OHOP(P, S_Arizona) == 3
31M+ nodes, 600M+ edges
10/18/2021
Alaska
70

Anchor Page Selection (1/2)
10/18/2021
Effectiveness of BFS-based algorithms
• It depends on anchored page selection
Anchored pages have to be local such that SDV can provide authentic
tendency of a page’s locality
Suitable examples (focusing on local communities)
• state universities, government, park or police organizations
Ill-suited examples (popular and thus having global impact)
• NBA, MLB, or NFL sports teams
71

• We adopt all subsidiary
pages
of ”OnlyInYourState.com” as
a set of anchored pages
• It has a distinct page for each state
• Each subsidiary page mostly
connects local communities
Anchor Page Selection (2/2)
Page Name Page ID
Only In Alabama 783744898386760
Only In Alaska 686107314826906
Only In Southern California
184034905285700
6
Only In Northern California 856450181102963
Idaho Only 435099846671531
Only In New York 386608421546055
Only In Virginia
156051573754049
2
Only In West Virginia
150970950928653
2
Only In Wisconsin
139029706462742
0
Only In Wyoming
172417436447638
1
10/18/2021 72

51 Anchors
Arizona
Northern
California
10/18/2021 73

Advanced Algorithm
Baseline algorithm’s drawback
• A local page can have a few connections with those pages far beyond
• This kind of connection noise would highly reduce prediction accuracy
State Neighborhood Probability (SNP)
Both SDV and SNP are taken as feature vectors for ML models
• Utilize locality and neighborhood context for better identification
10/18/2021 74

Dataset
California accounts for 20% of all US pages, and half of all
pages (49.49%) are located in top 5 states
• California, New York, Florida, Illinois, and Texas
10/18/2021 75

Accuracy Summary
Classifier Precision Recall F1 score
Naive Bayes (Baseline BFS) 0.44 0.27 0.26
Adaboost (Baseline BFS) 0.46 0.40 0.37
Random Forest (Baseline BFS) 0.69 0.69 0.68
Random Forest (Advanced BFS) 0.89 0.88 0.88
10/18/2021 76

依製造方式可分為：
• 釀造酒 – ex: 啤酒、葡萄酒、米酒、紹興酒、日本清酒…
(15% ↓)
• 蒸餾酒– ex: 高粱酒、白蘭地、威士忌、伏特加、蘭姆酒… (40% ↑)
• 調製酒 – ex: 藥酒、奶酒、…

葡萄梗→單寧
葡萄皮→單寧、顏色
葡萄肉→糖分、酸度
葡萄籽→苦油

Sunlight Warmth
Water Nutrients

氣溫糖度酒精濃度顏色
單寧酸度
𝐶6𝐻12𝑂6 → 𝐶2𝐻5𝑂𝐻 + 𝐶𝑂2
葡萄糖酒精
發酵作用

氣溫糖度酒精濃度顏色
單寧酸度
South Africa
Germany

1. 顏色：
• 紅葡萄酒/紅酒(Red wine)
• 白葡萄酒/白酒(White wine)
• 玫瑰紅葡萄酒(Rosé wine)

紅葡萄酒就是紅葡萄釀的，
白葡萄酒就是白葡萄釀的？

紅酒
白酒
白酒：壓碎(Crush) → 榨汁(Press) → 發酵(Ferment)
紅酒：壓碎 → 帶皮發酵 → 榨汁
Rosé：壓碎 → 帶皮發酵(12-36 hrs) → 榨汁

紅葡萄可釀紅、白葡萄酒，
白葡萄只能釀白葡萄酒

2. 氣泡酒(Sparkling wine)：
• 香檳 (Champagne, France)
• Asti (Italy)
• Cava (Spain)

氣泡酒：二次發酵
發酵後的葡萄酒→ 加入糖以及酵母 →
密封發酵 → 發酵後酵母自溶(autolysis)
→ 過濾殘渣 → 瓶中or桶中陳放

• 香檳(Champagne)：瓶中發酵，個別加入酵母，
冷凍瓶頸過濾殘渣
• 一般氣泡酒：桶中發酵，一起加入酵母，
高壓下裝瓶

只有法國香檳區符合香檳製造相關規範所
產的氣泡酒才能叫香檳

3. 甜酒(Sweet wine)：
• 冰酒(Ice wine)
• 貴腐甜酒(Noble rot / botrytis)
• 波特酒(Port)
• 雪莉酒(Sherry)

•冰酒(Ice wine)
• 波特酒(Port)

•貴腐甜酒(Noble rot / botrytis)
• 波特酒(Port)

貴腐甜酒(Noble rot / Botrytis)
• 天時：早霧午陽
• 地利：兩河交界
• 代表區域：
• 法國-索甸(Sauterne)
• 匈牙利-托凱(Tokaji)
• 德國(Germany)TBA

Guess how much is it ?
甜酒之王 ─ 伊昆堡

•波特酒(Port)
•雪莉酒(Sherry)

波特酒 (Port) 雪莉酒(Sherry)
代表國家葡萄牙西班牙
葡萄紅葡萄白葡萄
加烈時機發酵時發酵後
口感甜不甜

常用開瓶器
• 蝴蝶型開瓶器
• 侍酒師之友 (sommelier’
knife)
• 老酒開瓶器 (AH-SO)

直放？橫放？
冰箱？酒櫃？
恆溫恆濕
儲存環境

分裝
• 密封容器
酒喝不完怎麼辦？

分裝
• 密封容器
取酒器:
• Coravin™
酒喝不完怎麼辦？

原則：不直接碰觸到酒

不要超過1/3，除了氣泡酒

盡可能和氧氣接觸
濾渣

所有的紅酒
優質白酒 (經橡木桶陳年)

所有的紅酒
優質白酒 (經橡木桶陳年)
Follow your own taste !

視葡萄品種、產區、
當年氣候而定

• 舊世界:
 Ex: Spain
• 新世界:
 Ex: USA、Chile、NZ、Australia…
• 大賣場也有不錯的酒
• 網站:
 Ex: 葡萄酒新手選…

• 紅酒配紅肉，白酒配白肉
• 風土搭配法
• 不甜氣泡酒百搭

經典餐酒搭配
• 夏布利(Chablis)白酒 + 生蠔
• 香檳 + 魚子醬
• 貴腐甜酒 + 藍紋起司 or 鵝肝醬
• 台菜 + 德國麗絲玲白酒

經典餐酒搭配
• 夏布利(Chablis)白酒 + 生蠔
• 香檳 + 魚子醬
• 貴腐甜酒 + 藍紋起司 or 鵝肝醬
• 台菜 + 德國麗絲玲白酒
香檳百搭 !!!

品酒五術語
甜度 (sweetness)
酸度 (acidity)
單寧 (tannin)
果味 (fruit)
酒體 (body)

視覺
顏色
白酒
檸檬 → 金色 → 琥珀色
(新→老)
紅酒
紫色 → 紅寶石/石榴石 → 黃褐色
(新→老)

嗅覺
濃郁
程度
淡 → 中等 → 濃郁
香氣
何種花香、果香、香
料、草本植物…etc

味覺
甜度不甜 → 微甜 → 中等 → 甜
酸度低 →中 →高
單寧
(白酒不評)
低 →中 →高
酒體輕 → 中 → 重
味道何種花香、果香、香料、草本植物…etc
尾韻短 → 中 → 長

視覺
顏色
白酒：檸檬 → 金色 → 琥珀色 (新→老)
紅酒：紫色 → 紅寶石/石榴石 → 黃褐色(新→老)
嗅覺
濃郁程度淡 → 中等 → 濃郁
香氣何種花香、果香、香料、草本植物…etc
味覺
甜度不甜 → 微甜 → 中等 → 甜
酸度低 →中 →高
單寧低 →中 →高
酒體輕 → 中 → 重
味道何種花香、果香、香料、草本植物…etc
尾韻短 → 中 → 長

平衡性：balance
尾韻： finish
複雜性：complexity
經典性：typicity
強度性：intensity
何謂好酒

Instead of Drinking the best
wine, try to drink wine the
best.

12/06/2019 175
感謝聆聽指教

CML's Presentation at FengChia University

Recommandé

Recommandé

Contenu connexe

Similaire à CML's Presentation at FengChia University

Similaire à CML's Presentation at FengChia University (20)

Dernier

Dernier (20)

CML's Presentation at FengChia University

Notes de l'éditeur