2. AWS Certified Instructor
AWS SAA & CLF Certificate
研究興趣:社群網路數據分析、應用大數據、
人機互動
WSET Level I & II
旅遊點數玩家
2021/10/18 2
關於我
Chun-Ming Lai
AWS Certified Solutions Architect - Associate
Nov 03, 2019
Nov 03, 2022
Validation Number WM1ENMBCGEE1QY51
Validate at: http://aws.amazon.com/verification
Chun-Ming Lai
Sep 21, 2019
Sep 21, 2022
Validation Number MGFL2VQCGE1E1PC5
Validate at: http://aws.amazon.com/verification
3. Advertising
Recommendation Engines
Public Impersonal Pages
Public Personal Profiles
Private Groups
Private Personal
Profiles
Group
Messages
1:1
10/18/2021 3
Embrace transparency and restraint
on communication behavior
Amplification Privacy Concern
• Confidentiality,
• Policy
• Law
4. Abuse with an Internal Victim
• Cyber Bullying
• Doxing (揭露隱私)
• Child grooming
• Sextortion (敲詐勒索)
• Terrorist recruiting
2021/10/18 4
Security Issues With Targets (1/2)
5. Abuse with an External Victim
• CSAM Trading (Child Sexual Abuse Material)
• Conspiracy (陰謀)
• Hate Speech
• Anti-Vax (反疫苗)
• Disinformation
2021/10/18 5
Security Issues With Target (2/2)
6. Do you try hard to find the News that you
like to receive?
Or, is there a special “force” to push the
News in front of you?
2021/10/18 6
Ask??
7. 12/06/2019
Media Sources
Social Algorithms
Online Participants
Content
Comments
Reactions
ML is learning how to select the
information you like to read
Addictive Design
A major design change around 2012~2013
8. 12/06/2019 8
𝑒𝑑𝑔𝑒𝑠 𝑒
𝑢𝑒 𝑤𝑒𝑑𝑒
• ue is user affinity
• 𝑤𝑒 is how the content is weighted
• 𝑑𝑒 𝑖𝑠 𝑎 𝑡𝑖𝑚𝑒 𝑏𝑎𝑠𝑒𝑑 𝑑𝑒𝑐𝑎𝑦 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟
Social Algorithms
22. Systematic AI, not just the data, but the
adaptive process and ecological system around
all the data
• Systematic means the depth of domain
From universally collecting all the data to
systematically select the data (or know what we
don’t have)
• we need systematic AI to know what to do
• we cannot learn system ecological system easily with
adversarial (so we need to filter them out)
Decentralized, at least virtually, information
ownership better resistance and robustness
12/06/2019 22
Data-Centric Computing
Takeaways
30. Suitable Targets Problem
Any post thread p in social media
platform, predict whether p
contains at least one malicious
comment via a classifier – c
{target,nontarget}
10/18/2021 30
32. Definition
Time Series (TS)
• TScreated(post): the time an original article is posted
• TSj: a time period j following the time of the original
• TSfinal: the end of our observation
Accumulated Number of participants (AccNcomment)
• The number of post comments between TSi and TS(i-1)
Discussion Atmosphere Vector (DAV)
10/18/2021 32
33. Example
TScreated(Climate) = 2014-12-19 03:06:42
Suppose j = 5, final = 120
DAV(Climate) = [# of comments 03:06:42 ~ 03:11:42 1st
# of comments 03:11:42 ~ 03:16:42 2nd
…
# of comments 05:01:42 ~ 05:06:42] 24th
10/18/2021 33
35. Feature Engineering
# of comments, # of likes, # of shares
Spanning time (Last comment time – first comment time)
Temporal Feature with Delta Time window, with a final
observation time
Context-free, don’t need to address Natural Language
Processing
10/18/2021
35
Time Elapsed
1st
Comments 1st Likes 1st Shares
44. Conclusion
Predict Suitable Targets successfully with temporal
features
• Attackers: Follow or not?
• Defenders: Deploy resource
Temporal Analysis with different variables
• Influence Ratio, increase or decrease for next time
window?
• 24 hours pattern, link online and offline behavior
10/18/2021 44
45. Outline
Suitable Target, Lifecycle Analysis
Multiple Accounts Detection
Geolocation Identification
Personal words
10/18/2021 45
46. Semi-Supervised Learning on Graphs
Motivation of detecting multiple accounts on FB
Crawler
1
Crawler
2
Crawler
3
FaceBook
API
When Call FaceBook
API:
API will give each
crawler a different
scope ID. Thus it leads
to same user with
different scope ID in
the dataset.
50. Multiple Accounts Detection using
Semi-Supervised Learning on Graphs
When crawling data from FB using multiple crawlers, it will give you a scope ID instead of
giving you primary ID for each crawler.
For example, a user’s primary ID is mohamed.aimane.98. He has multiple scope ID,
they are 1815396745342476, 1815402648675219 , 1815411572007660,
1815468805335270 , 1815515615330589 ,1815482155333935 , 1815488781999939 ,
1816157185266432. It implies mohamed’s data is crawled by 8 different crawlers.
As the result, in our dataset we know their users names are all mohamed aimane, but
there are a lot of ID with the same user name.
Problem : Given 2 scope ID with the same user name. Are they the same user(same
primary ID) or not?
Motivation of detecting multiple accounts on FB
52. Main Algorithms
Unsupervised learning using Katz Similarity
Pxy(i) = (x,x1,x2,….y), length I
u1, u2 are similar if their activity paths are similar
Katz similarity can be computed by:
Where M is the adjacency matrix of graph G. 𝛽 is a scalar smaller than 1/ 𝑀 2
to
ensure convergence, and I is the identity matrix.
54. Katz matrix is
1 0.9
0.9 1
0.2 0.3
0.5 0.5
0.2 0.6
0.3 0.5
1 0.8
0.8 1
The threshold we use is 0.8
Then the 1st node and the 2nd node are belong to the same user, and the 3rd and 4th
node are belongs to the same user, others are not.
Example of Algorithm 1
56. Classical ML Tasks in Networks
• Node Classification
• Predict type of a node
• Link Prediction
• Predict friends
• Community Detection
• Network Similarity
• Similar with two networks
59. Node2vec(3/4)
• Two Parameters:
• Return parameter p:
• Return back to the previous node
• In-out parameter q:
• Moving outwards (DFS) vs. inwards (BFS)
• The ratio of BFS vs.DFS
• Biased 2nd-order random walks explore network neighborhoods.
Parameters
60. Node2vec(4/4)
• Simulate r random walks of length l starting from each node u
• Optimize the node2vec objective using Stochastic Gradient Descent
61. Embedding for node 1 : (0.1, 0.3, 0.2, 0.4), Embedding for node 2 : (0.2, 0.3, 0.2, 0.4)
We sample some ground truth that : node 1 and node 2 are belongs to the same node,
ect.
L looks like :((1,2), 1) ((1,3), 0 ), ((2,3), 0) ((2,4), - 1) ((3,4), -1) …..
X is from embedding : for example, ((1,2), (0.1, 0, 0, 0 )) ….
Then feed X and L into label spreading model, we will get, the 1st node and the 2nd node
are belong to the same user, and the 3rd and 4th node are belongs to the same user,
others are not.
Example of Algorithm 3
63. Experiments and Evaluation
Comparison among the Three Methods
Two simple datasets : dataset 1: 188 nodes and 262 activities (links);
dataset 2: 4188 accounts and 6715 activities(links).
64. Outline
Suitable Target, Lifecycle Analysis
Multiple Account Detection
Geolocation Identification
Personal words
10/18/2021 64
65. Page Information and Page-like Graph
10/18/2021
Sport Illustrated
Golden State
Warriors
Oakland Museum
Giving Tuesday
like
like
like
Field Example
Page ID 47657117525
Name Golden State Warriors
Category Sports Team
Country United States
Fan Count 11,019,236
Description The Official Facebook page
of
the Golden State Warriors
65
66. 10/18/2021
• Facebook public
pages are public
profiles used by
local businesses,
companies,
organizations or
public figures
Likes
Promoting other pages to
community participants
66
67. Data Collection
Facebook Graph API version 2.8 used to collect our
data [1]
• 38,831,367 pages (for this work)
• 2,430,873 US
• 12,685,090 other countries
• 23,715,404 unknown
[1] https://developers.facebook.com/docs/graph-api/reference/page
10/18/2021 67
68. Majority Vote Algorithm
10/18/2021
• location designated as state
information in this scenario
• The location labeling is determined by
the most votes
• Overall accuracy is only 59.4%
• This algorithm works well in page nationality
prediction task, with 90.25% accuracy
68
69. Baseline Algorithm
Utilizes locality of states to find pages
belonging to their corresponding states
• Pick out anchored pages, with local property, as
multiple seeds to start BFS from
Target classifier: 51 classes
• 50 classes of US states and a class of ”others (OT)”
State Distance Vector (SDV)
10/18/2021 69
71. Anchor Page Selection (1/2)
10/18/2021
Effectiveness of BFS-based algorithms
• It depends on anchored page selection
Anchored pages have to be local such that SDV can provide authentic
tendency of a page’s locality
Suitable examples (focusing on local communities)
• state universities, government, park or police organizations
Ill-suited examples (popular and thus having global impact)
• NBA, MLB, or NFL sports teams
71
72. • We adopt all subsidiary
pages
of ”OnlyInYourState.com” as
a set of anchored pages
• It has a distinct page for each state
• Each subsidiary page mostly
connects local communities
Anchor Page Selection (2/2)
Page Name Page ID
Only In Alabama 783744898386760
Only In Alaska 686107314826906
Only In Southern California
184034905285700
6
Only In Northern California 856450181102963
Idaho Only 435099846671531
Only In New York 386608421546055
Only In Virginia
156051573754049
2
Only In West Virginia
150970950928653
2
Only In Wisconsin
139029706462742
0
Only In Wyoming
172417436447638
1
10/18/2021 72
74. Advanced Algorithm
Baseline algorithm’s drawback
• A local page can have a few connections with those pages far beyond
• This kind of connection noise would highly reduce prediction accuracy
State Neighborhood Probability (SNP)
Both SDV and SNP are taken as feature vectors for ML models
• Utilize locality and neighborhood context for better identification
10/18/2021 74
75. Dataset
California accounts for 20% of all US pages, and half of all
pages (49.49%) are located in top 5 states
• California, New York, Florida, Illinois, and Texas
10/18/2021 75
Sometimes it’s hard to evaluate “spamming”
New
SFW – Likefarm? Is that ContentFarm?
Every principle has its mind, reason, everything has its causality
SFW – we need to have a better organized presentation for problems.
SFW – the defenders concern might be different – we need to consider the risk factor
Shelf Life, skim messages, can “catch” ones eyes only , enlarge the influence
https://www.facebook.com/barackobama/posts/10151673679836749
https://www.facebook.com/cnn/posts/313652498762911
SFW – ask the audience “which post has higher prob to be attacked”?
SFW – watch out for the transition into this slide.
SFW – do you want to provide one example for all or most of the slides?
SFW – I feel that you should give an example to explain.
SFW – Definition**s**
SFW – how to interpret 10 minutes? (what is the total time and attack time)?
Naïve Bayne: DAV not independent with each other
Adaboost: Not good for outlier, number of estimators = 50 and learning rate = 1.
Decision Tree: Good for social networks data
we set minimum samples split = 2 and minimum samples leaf = 1, as with depth, nodes are expanded until all leaves are pure.
1. IR is learnable?
2. No difference between Light and Critical malicious URLs since their performance are quite similar
3. Increase recall result is high
SFW – explain “Exact time after last attack”
Why do you choose similarity
Fast
Read the silde
Our first thought is majority vote algorithm
where IHOP(Page,Si) denotes hop distance between page and seed Si, using inward edges as connection for BFS;
OHOP(Page, Si) denotes hop distance between page and seed Si, using outward edges as connection for BFS.
In particular, since California is much larger than other states in perspectives of population and economy,
“OnlyInYourState.com” splits California into Northern and Southern regions, as shown in Table.
Therefore, both ”Only In Northern California” and ”Only In Southern California” are used as anchored pages to calculate IHOP (P age, Si) and OHOP (P age, Si),
in addition to the other forty nine an- chored pages. Hence Nanchored pages is set as 51.
Furthermore, since ”Only In Idaho” had been registered, OnlyInYourState.com named its Idaho counterpart as ”Idaho Only” instead.
In general, more anchored pages involved would enlarge the BFS coverage of pages.
This probability is not high; however, the baseline BFS-based ML algorithm only cares about the hop distances to the anchored pages.
where INP(Page,Ri) denotes inward neighborhood location probability between this page and the adjacent pages belonging to the region Ri;
where IE(Page,Ri) is the number of inward edges between this page and the adjacent pages belonging to the region Ri;
We took the pages with declared location information of country and city as ground truth data.
Few pages are excluded because their city names exist in multiple states, which can result in ambiguous city-to-state mapping.
There are 29,849 cities in total in the US.
The training set utilized 80% of data while test set employed the rest.
Since number of classes is rather large, Random Forest classifier is preferably adopted, instead of Gradient Boosting classifier [23].
The default parameter sets were applied when using the implementations available in the scikit-learn package [54].
As shown in Table 4.2, the precision, recall, f1 score of the Random Forest classifier are at least 20% better than the counterparts of the Naive Bayes classifier and the Adaboost classifier.
Thus in the following, we only present results done with the Random Forest classifier.
baseline BFS-based ML algorithm with the Random Forest classifier achieved 69% accuracy, which is 10% better than accuracy of the majority vote algorithm.
With addition of SNP, advanced BFS-based ML algorithm accomplished 89% prediction accuracy, which is a 20% improvement over baseline.