Understanding Big Data Analytics for Research Activity

BIG DATA ANALYTICS
UNDERSTANDING FOR RESEARCH ACTIVITY
Dr. Andry Alamsyah
Asosiasi Ilmuwan Data Indonesia
School of Economics and Business,Telkom University

Research Field :
Social Computing, Social Network, Complex Network / Network Science, Computational Social
Science, Data Analytics, Big Data, Data Mining, Graph Theory, Disruptive Innovation / Disruptive
Economy, ICT Entrepreneurial Business, Data / Information Business
Andry Alamsyah
• Researcher / Data Scientist
• Director of Digital Business Ecosystem Research Centre
• Chief and Founder of Lab. Social Computing & Big Data
• Chairman & Founder Indonesian Data Scientist Society (AIDI)
email andry.alamsyah@gmail.com
blog andrya.staff.telkomuniversity.ac.id
repository telkomuniversity.academia.edu/andryalamsyah
repository researchgate.net/profile/Andry_Alamsyah
linkedin linkedin.com/andry.alamsyah
twitter twitter.com/andrybrew
Education :
S1 : Mathematics - ITB, Topic: Statistics
S2 : Informatics - UPJV, France, Topic: Information System, and Multimedia
S3 : Electro and Informatics - ITB, Topic: Social Network, and Big Data
Links :
Introduction

• Background and Motivation
• Big Data DeNinition and Related Field
• Understanding Pattern
• Data Analytics / Machine Learning Fundamental (Prediction and
Recommendation)
• Social Media Analytics (by Case Study)
• Conclusion
• Working on Your Computer (Machine Learning Practice)
Agenda

> information overload,
> technological based society
> acquire new value => new culture
> empowered individuals
> more data available
> building contextual story / search
Digital Ocean

• Industry 4.0 -> cyber physical system -> enabling human to produce large-
scale data -> human behaviour quantiNication
• Key Technologies : data, computational power and connectivity; analytics
and intelligence; human machine interaction; advanced production methods
the environment
Deloitte, Industry 4.0
Industry 4.0

Cheap Change Everything
efﬁcient economy
new value proposition
• cutting through the BIG DATA hype
• cheap means everywhere
• cheap creates value
• from cheap to strategy
complex
human behaviour
market uncertainty
business
sustainability
disruptive economic
coopetitive, cooperative,
competitive
business ecosystem
/ platform
programmable economy
event driven
API economy
toward large-scale and massive
socio-economic impact
Industry 4.0

Big Data DeNinition and
Related Field

Big Data DeNinition
•a term => describe extremely large amounts of structured and
unstructured data

•the activity => capture / storage / processing / sharing / reporting of
data => beyond the ability of legacy software tools and hardware
infrastructure

•related to many “science” branch => data analytics, data science,
machine learning, artificial intelligence, IoT, and many more

•the application => on many field => efficient, cost-effective, faster &
accurate decision making
Gigabyte 109 = 1.000.000.000
Terabyte 1012 = 1.000.000.000.000
Petabyte
Exabyte
1015 = 1.000.000.000.000.000
Exabyte 1018 = 1.000.000.000.000.000.000
Zetabyte 1021 = 1.000.000.000.000.000.000.000
1990 2010 Hadoop
store 1400 MB store 1TB 100 drives
working at the
same time can
read 1TB data in
2 minutes
transfer speed 4.5 MB/s transfer speed 100 MB/s
read drive ~ 5 minutes read drive ~ 3 hours

Volume, Variety, and Velocity are the "essen+al" characteris+cs of Big Data
Veracity, and Value are the "quality" of Big Data
The 5'Vs

DATA ANALYTICS
-the discovery, interpretation, and communication of meaningful patterns in data (wikipedia)
-the process to uncover hidden patterns, unknown correlation, and other useful information that can help organisations make
more informed business decision
SOURCE
review, opinion,
historical data,
conversation,
network friendship,
CCTV, Vlog,
location tagging,
etc
BIG DATA
large, fast, complex
the 5V’s data
DATA SCIENCE
the science to extract
knowledge / pattern from data
SOCIAL COMPUTING
quantiﬁcation of human / social
behaviour
INSIGHT
market segmentation, risk analytics
information dissemination,
recommended investment, fraud
detection, personalised adv, customer
acquisition and retention, purchase
behaviour, early detection event,
brand awareness, etc
opportunity activity
methodology
beneﬁt
application
Big Data Related Terms (Use Case)

Data Analytics
• The discovery, interpretation, and communication of meaningful patterns in data (wikipedia)

• The process to uncover hidden patterns, unknown correlation, and other useful information that
can help organisations make more informed business decision predictive, descriptive, diagnostic,
prescriptive.

Predictive Analytics
• study the past if you want to study the future (confucius)

• Predictive Analytics is the art of building and using models that make predictions
based on patterns extracted from historical data. Predictive analytics applications
include: price predictions, dosage predictions, risk assesment, propensity/likelihood
modelling, diagnosis, document classiﬁcations
• Predictive is the assignment of a value to any unknown variable.

• A model is trained to make predictions based on a set of historical examples. (we use
Machine Learning)

Data Science
Data science is a multi-disciplinary ﬁeld that uses scientiﬁc methods, processes, algorithms and systems to
extract knowledge and insights from structured and unstructured data.

CRISP-DM
CRISP-DM -> Cross -Industry Standard Process for Data Mining is an open standard process model that
describes common approaches used by data mining experts. It is the most widely-used analytics model.[2]

Structure Data Type
Column Value
Pa+ent Andry Alamsyah
Date of Birth 12/07/1995
Date Admi?ed 02/03/2019
“The patient came in complaining
of chest pain, shortness of breath,
and lingering headaches.. Smokes
2 packs a day.. Family history of
heart disease.. Has been
experiencing similar symptoms for
the past 12 hours…”
High Degree of
Organiza+on, such as a
rela+onal database
Informa+on that is
diﬃcult to organise using
tradi+onal mechanisms
VS
Structured Unstructured

Working with Unstructured Data
brand A brand B

Understanding Pattern
Structured Data
Mapping Position

Unstructured Data
Friendship Network

Unstructured Data
Growth Friendship Network

Unstructured Data
Conversational Network

Structured Data
Regional Economic Value
based on Checkin
Mechanism

How Can (Big) Data Analytics Helps?
by describing the phenomenon,
by predicting the value,
by estimating the future outcome,
by optimising the resources and the
decision,
by simulating all the possible scenarios ..

Data Analytics / Machine Learning
Fundamentals (Prediction and
Recommendation)

• Machine learning is deﬁned as an automated process that extracts
patterns from data to build the models used in predictive analytics
applications.

• A branch of artiﬁcial intelligence, concerned with the design and
development of algorithms that allow computers to evolve behaviours
based on empirical data.
Machine Learning

Machine Learning
Machine Learning is an idea to learn from
examples and experience, without being
explicitly programmed.
Instead of writing code, we feed data to the
generic algorithm, and it builds logic based
on the data given.
Computer Output
Program
Data
• Traditional Programming
Computer Program
Output
Data
• Machine Learning

Machine Learning
Machine learning (ML) is the science of
getting computers to act without being
explicitly programmed. ML has given us self-
driving cars, practical speech recognition,
effective web search, and a vastly improved
understanding of the human genome. ML is
pervasive today, we probably use it dozens of
times a day without knowing it. It is the best
way to make progress towards human-level
AI. (standford/coursera)
ML is a type of artiNicial intelligence (AI)
that provides computers with the ability to
l e a r n w i t h o u t b e i n g e x p l i c i t l y
p r o g r a m m e d . M L f o c u s e s o n t h e
development of computer programs that can
teach themselves to grow and change when
exposed to new data. (whatis.com)

Machine Learning in Business
Finance and Banking
• Credit scoring
• Fraud detection
• Risk Analysis
• Portfolio Optimization
• Client Analysis
• Trading Exchange Forecasting
Retail and E-Commerce
• Price Optimization
• Recommendation
• Predictive Inventory Planning
• Fraud Detection
• Customer Segmentation
Manufacturing
• Predictive Maintenance or Condition
Monitoring.
• Warranty reserve estimation
• Demand forecasting
• Process Optimization
Marketing and Sales
• Market and Customer Segmentation
• Price Optimization
• Customer Churn Analysis
• Customer lifetime value prediction
• Sentiment Analysis in Social Networks

1.Formula / Function
• T = 0.48O + 0.23TL + 0.5D
2.Decision Tree
3. Correlation or Association
4.Rule
• IF IPS3=2.8 THEN graduate_ontime
5.Cluster
Output / Pattern / Model / Knowledge

Learning Illustration
A
BA
B A
B
A
B A
B
A
B
A
B
A
B
A
B
Data ->
Two Possible Solutions
1 2

•It is based on a labeled training set.
•The class of each piece of data in
training set is known.
•Class labels are pre-determined
and provided in the training phase.
Supervised Learning
A
B
A
B
A
B
e Class
l Class
l Class
l Class
e Class
e Class
“What is the class of this data point?”
Task performed : classiﬁcation, pattern recognition

Supervised Learning
•Prediction methods are commonly referred to as
supervised learning. Supervised methods are
thought to attempt the discovery of the
relationships between input attributes and a target
attribute.
•A training set is given and the objective is to form a
description that can be used to predict unseen
examples.

Supervised Learning
Problems :
• ClassiNication
• The domain of the target attribute is Ninite and categorical.
• A classiNier must assign a class to a unseen example.
• Regression
• The target attribute is formed by inNinite values.
• To Nit a model to learn the output target attribute as a function of input
attributes.
• Time Series Analysis
• Making predictions in time.

Supervised Learning
Class/Label/TargetAttribute/Feature
Nominal
Numerik

Unsupervised Learning
•Input : set of patterns P, from n-dimensional space S, but little or no
information about their classiNication, evaluation, interesting features,
etc.
It must learn these by itself! : )
•Tasks:
- Clustering - Group patterns based on similarity
- Vector Quantisation - Fully divide up S into a small set of regions (deNined by
codebook vectors) that also helps cluster P.
- Feature Extraction - Reduce dimensionality of S by removing unimportant
features (i.e. those that do not help in clustering P)
• There is no supervisor and only input data is available.
• The aim is now to Nind regularities, irregularities, relationships,
similarities and associations in the input.

Problems :
• Clustering
• Association Rules
• Pattern Mining
• It is adopted as a more general term than frequent pattern mining or
association mining
• Outlier Detection
• It is the process of Ninding data which have very different behaviour
from the expectation (outliers or anomalies)

Attribute/Feature

Background :
• How to learn a new skill
• Learning and intelligence
• Interaction with environment
• Goal-oriented learning
• Agent – Environment interactions
• Activities
- What to do
- How to map situations to actions
- Process positive and negative rewards
Reinforcement Learning
Reinforcement learning (RL) is an area of machine learning concerned with
how software agents ought to take actions in an environment so as to maximise
some notion of cumulative reward.
Basic reinforcement is modeled as a Markov decision process, and are often stochastic process

The Analogy
• A child learns to walk
• The child is an agent trying to manipulate the
environment
• The child is taking actions (state 1, state 2, state
3, and so on)
• Positive rewards when able to walk
• Negative rewards when not able to walk

Various Practical applications of Reinforcement Learning
• RL can be used in robotics for industrial automation.
• RL can be used in machine learning and data processing
• RL can be used to create training systems that provide custom instruction and materials according
to the requirement of students.

Data Preparation (CRISP-DM)
Data Preprocessing
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?

1.Data Cleaning
a. Fill in missing values
b. Smooth noisy data
c. Iden+fy or remove outliers
d. Resolve inconsistencies
2.Data Reduc6on
a. Dimensionality reduc+on
b. Numerosity reduc+on
c. Data compression
3.Data Transforma6on and Data Discre6sa6on
a. Normalisa+on
b. Concept hierarchy genera+on
4.Data integra6on
a. Integra+on of mul+ple databases or ﬁles
Data Preprocessing Task

Common Data
Analytics Rules
Tasks Descrip6on Algorithms Examples
Classiﬁcation Predict if data points belongs to one
of the predeﬁned classes. Prediction
based on learning from known
dataset.
Decision tree, neural
network
Bucketing new customers into
one of the known customer
groups
Regression Predict the numeric target label of a
data point. Prediction based on
learning from known dataset.
Linear regression,
logistic regression
Estimating insurance premium
Clustering Identify natural clusters within the data
set based on inherit properties within
data set.
K-Means, density
based clustering
Finding customer segments in a
company based on transaction
and call data.
Association
Rules
Identify relationships within an item
set based on transaction data
FP-Growth algorithm,
Apriori
Find cross-selling opportunities
for a retailer based on
transaction purchase history
Anomaly
Detection
Predict if a data point is an outlier
compared to other data point in the
dataset
Distance based, density
based, Local Outlier
Factor (LOF)
Fraud transaction detection in
credit cards

Estimation
Customer Order Number of Traﬃc Light Distance Travel Time
1 3 3 3 16
2 1 7 4 20
3 2 4 6 18
4 4 6 8 36
...
1000 2 4 2 12
Label
Learning Model using
Estimation Methods (Linear
Regression)
Travel Time = 0.48O + 0.23TL + 0.5D
Knowledge
Pizza Delivery Time

Predictions
stock price dataset in
time series format
label
prediction using
Neural Network
Learning
prediction plot

ClassiNication
NIM Gender Nilai UN Asal Sekolah IPS1 IPS2 IPS3 IPS 4 ... Lulus Tepat
10001 L 28 SMAN 2 3.3 3.6 2.89 2.9 Ya
10002 P 27 SMA DK 4.0 3.2 3.8 3.7 Tidak
10003 P 24 SMAN 1 2.7 3.4 4.0 3.5 Tidak
10004 L 26.4 SMAN 3 3.2 2.7 3.6 3.4 Ya
...
...
11000 L 23.4 SMAN 5 3.3 2.8 3.1 3.2 Ya
label
learning using C4.5
classiﬁcation methods

input : golf playing recommendation
output (rules) :
If outlook = sunny and humidity = high then play = no 
If outlook = rainy and windy = true then play = no 
If outlook = overcast then play = yes 
If humidity = normal then play = yes 
If none of the above then play = yes
output (tree) :
ClassiNication

Clustering
dataset without label
learning using K-means
clustering methods

Association
learning using FP-Growth
association methods

1.Es+ma+on:
- Linear Regression, Neural Network, Support Vector Machine, etc
2.Predic+on/Forecas+ng:
- Linear Regression, Neural Network, Support Vector Ma chine, etc
3.Classiﬁca+on:
- Naive Bayes, K-Nearest Neighbor, C4.5, ID3, CART, Linear Discriminant
Analysis, Logis+c Regression, etc
4.Clustering:
- K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means, etc
5.Associa+on:
- FP-Growth, A Priori, Coeﬃcient of Correla+on, Chi Square, etc
Algorithm in Data Analytics

Based on Information Theory, for example in Decision Tree model
induced by the concept of entropy and information gain.
Information-Based Learning
Is it a man ?,
Does the person wear glasses ?

Similarity-Based Learning
Training
Records
Test Record
Compute
Distance
Basic Idea => If it walks like a duck, quack like a duck, then it's
probably a duck
the best way to make a predictions is to simply look at what has worked well in the past and
predict the same thing again. for examples k-NN and k-means algorithm
similarity can be represent as distance (euclidean)

Probability-based prediction approaches are heavily based on Bayes’ Theorem
Probability-Based Learning
• A probabilistic framework for solving classiNication problems
• Conditional Probability / Bayes Theorem
)(
)()|(
)|(
AP
CPCAP
ACP =
• Given:
• A doctor knows that meningitis causes stiff neck 50% of the time
• Prior probability of any patient having meningitis is 1/50,000
• Prior probability of any patient having stiff neck is 1/20
• If a patient has stiff neck, what’s the probability he/she has meningitis?
0002.0
20/1
50000/15.0
)(
)()|(
)|( =
´
==
SP
MPMSP
SMP

perform a search for a set of parameters for a parameterised model that minimises the
total error across the predictions made by that model with respect to a set of training
instances. For example: multivariable linear regression with gradient descent, support
vector machine
Error-Based Learning
B1
B2
b11
b12
b21
b22
margin
Linear Regression
Find hyperplane maximizes the margin

=> B1 is better than B2
Support Vector Machine

Model Evaluation
1.Estimation:
- Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2.Prediction/Forecasting
- Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3.ClassiNication:
- Confusion Matrix: Accuracy
- ROC Curve: Area Under Curve (AUC)
4.Clustering:
- Internal Evaluation: Davies–Bouldin index, Dunn index,
- External Evaluation: Rand measure, F-measure, Jaccard index,
Fowlkes–Mallows index, Confusion matrix
5.Association:
- Lift Charts: Lift Ratio
- Precision and Recall (F-measure)

learning and evaluation process confusion matrix
PREDICTED CLASS
ACTUAL 
CLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
FNFPTNTP
TNTP
dcba
da
+++
+
=
+++
+
=Accuracy
cba
a
pr
rp
ba
a
ca
a
++
=
+
=
+
=
+
=
2
22
(F)measure-F
(r)Recall
(p)Precision
Model Evaluation
evaluation metric

Model Evaluation
• Learning curve shows how
accuracy changes with varying
sample size
• Requires a sampling schedule for
creating learning curve:
- Arithmetic sampling
- Geometric sampling
• Effect of small sample size:
- Bias in the estimate
- Variance of estimate

Increase Coverage
Experiment Dataset Accuracy
1 93%
2 91%
3 90%
4 93%
5 93%
6 91%
7 94%
8 93%
9 91%
10 90%
Average Accuracy 92%
Orange Box : k-subset (data tes+ng)
K-Cross Validation

The Future ML Trends
artiﬁcial neural network
convolutional neural network
deep learning

WorkNlow
Application
Programming
Interface
(API)
Crawling
Process
> Network Structure
(Social Network Analysis)
> Content Analysis
(Text Analytics)
Pattern Mining and
Analytics Process

First Topic
Identified
Topic Modelling
•Topic modelling is a type of statistical modelling for discovering the abstract
“topics” that occur in a collection of documents..

•LDA (Latent Dirichlet Allocation) is the most popular (and typically most eﬀective)
topic modelling technique

TOP BRAND ALTERNATIVE MEASUREMENT BASED ON
CONSUMER NETWORK ACTIVITY
Abstract:
In Business Intelligence effort, the legacy methodology to
measure product brand awareness use technique such as
surveys, interviews, and questionnaires. This methodology
requires expensive effort to collect data from respondent and
takes considerably time to accomplish. The availability of Big
Data in the form of social media interaction can beneﬁt us.
The conversation and user generated content from social
media certainly can be used to measure brand awareness
through consumer activity. We use Social Network Analysis
methodology to measure the dynamic and evolution of brand
conversations in social media. By comparing the network
properties, we propose new alternative measurement
methods of product brand awareness. Our proposed
methodology is better adapted to large-scale conversational
data in social media.This measurement will also enhance the
current methodology by viewing consumer opinions as a
whole network and not as separated individual. This study
conducted via social networking conversations on Twitter using
two industry case studies, they are mobile operators and
mobile phone brands in Indonesia
mobile phone rank
mobile operator rank

A COMPARISON OF INDONESIA E-COMMERCE SENTIMENT ANALYSIS FOR
MARKETING INTELLIGENCE EFFORT
CASE STUDY : BUKALAPAK, TOKOPEDIA, ELEVENIA
Abstract:The rapid growth of e-commerce market in Indonesia, making various e-commerce companies
appear and there has been high competition among them. Marketing intelligence is important activity to
measure competitive position. One element of marketing intelligence is to assess customer satisfaction.
Many Indonesian customers express their sense of satisfaction or dissatisfaction towards the company
through social media. Hence, using social media data, it provides a new practical way to measure
marketing intelligent effort.This research performs sentiment analysis using naive bayes classifier
classification method withTF-IDF weighting.We compare the sentiments towards of top-3 e-commerce
sites visited companies, they are Bukalapak,Tokopedia and Elevenia.We useTwitter data for sentiment
analysis because it's faster, cheaper and easier from both the customer and the researcher side.The
purpose of this research is to find out how to process the huge customer sentimentTwitter to become
useful information for the e-commerce company, and which of those top-3 e-commerce companies has
the highest level of customer satisfaction. From the experiment results, it shows the method can be used
to classify customer sentiments in social mediaTwitter automatically and Elevenia is the highest e-
commerce with customer satisfaction
COMPARABLE RESULT
AMONG THREE CASE STUDY

NETWORK TEXT ANALYSIS TO SUMMARISE ONLINE CONVERSATIONS FOR
MARKETING INTELLIGENCE EFFORTS IN TELECOMMUNICATION
INDUSTRY
Abstract - Market tight competition put pressure the companies to employ a new and faster way to support their
marketing intelligence effort.The need of marketing intelligence includes gathering and analysing data for conﬁdent
decision making about market and its competition.Today, the abundant large scale data from online social network
services has made possible to extract valuable information such as user opinions and sentiment from the
conversations in the market.As the competition arise, new challenge emerged, which include faster data
summarisation.The common practice of summarise contents is using wordcloud or weighted list of appearance words.
This approach is lack of sense and contextual relations between words in questions, because the words has no
connection with other words that might construct an important phrase.With the help of graph formulation, we
propose a methodology of network text analysis to summarise large conversation in online social network services.
This proposed methodology capture complex relations between words, while still maintain fast summarisation. In this
paper, we compare three major telecommunication provider in Indonesia, which is Telkomsel, XL and Indosat.The
conversations about those brands in online social network services Twitter is collected, Network text about each
brands are constructed and analysed.

NETWORK MARKET ANALYSIS USING LARGE SCALE SOCIAL
NETWORK CONVERSATION OF INDONESIA FAST FOOD INDUSTRY
Abstract - The high competitiveness of the Indonesia Fast Food market has forced the industry to ﬁnd the new way to understand market behaviour. The new challenge
should include faster data collection and analytical process, preferably time delivery needed close to real-time. The common practice of gathering market data using
questionnaires and interviews are considered expensive and time-consuming process compared to mining online conversation with brand community respected. With the
availability of large-scale data from online social network services (oSNS), we can extract valuable information represent dynamic behaviour of the market. Many brands have
their presence in oSNS as a part of their customer relationship management (CRM) effort. The social interactions formed in oSNS can be modeled using Social Network
Analysis (SNA) methodology. In this paper, we compare two brand communities of head to head competitive product in the fast food industry, they are McDonald’s and Burger
King. The SNA model constructs large-scale network, its size, reaching close to a million of nodes and edges. The result will give us insight about what is important in
understanding the dynamic market beside the market size represented by the community conversations.

SOCIAL NETWORK AND SENTIMENT ANALYSIS FOR SOCIAL CUSTOMER
RELATIONSHIP MANAGEMENT IN INDONESIA BANKING SECTOR

SCRM Network
BCA BNI MANDIRI
Abstract - The increasing number of social media users affects both individual and corporation user. Banking sector, for example, use social media to support
their Social Customer Relationship Management activity. We investigate the dynamics and evolution of conversation network between bank customer using
Social Network Analysis methodology. Measurement is conducted by calculating its network properties to see the characteristic and how active the network is.
Customers talking about banks’ services can also express their opinion on social media. Therefore we perform sentiment analysis to classify customer’s opinion
into positive, negative and neutral class. This research was performed on Twitter’s conversation about Bank Mandiri, Bank Central Asia (BCA) and Bank
Negara Indonesia (BNI). The result of this research is beneﬁcial for business intelligence purpose to support decision making.

MEASURING MARKETING COMMUNICATIONS MIX EFFORT USING
MAGNITUDE OF INFLUENCE AND INFLUENCE RANK METRIC
Abstract: In the context of modern marke:ng, Twi>er is considered as a communica:on pla@orm to spread informa:on. Many companies create and acquire several Twi>er
accounts to support and perform varie:es of marke:ng mix ac:vi:es. Ini:ally, each accounts used to capture specific market profile. Together, the accounts create network of
informa:on that provide consumer to the informa:on they need depends on their contextual u:lisa:on. From many accounts available, we have the fundamental ques:on on
how to measure influence of each account in the market based not only their rela:ons, but also the effects of their pos:ngs. Magnitude of Influence (MOI) metric is adapted
together with Influence Rank (IR) measurement of accounts in their social network neighbourhood. We use social network analysis approach to analyse 65 accounts in the social
network of an Indonesian mobile phone network operator, Telkomsel which involved in marke:ng communica:ons mix ac:vi:es through series of related tweets. Using social
network provide the idea of the ac:vity in building and maintaining rela:onships with the target audience. This paper shows the results of the most poten:al accounts based on
the network structure and engagement. Based on this research, the more number of followers one account has, the more responsibility it has to generate the interac:on from
their followers in order to achieve the expected effec:veness. The focus of this paper is to determine the most poten:al accounts in the applica:on of marke:ng communica:ons
mix in Twi>er.
ratio of affection
magnitude of influence
LCRT function
influence rank (based on pagerank)

MAPPING ONLINE TRANSPORTATION SERVICE QUALITY AND MULTI-CLASS
CLASSIFICATION PROBLEM SOLVING PRIORITIES
CASE STUDY : GOJEK AND GRAB
Abstract. Online transportation service is known for its accessibility, transparency, and tariff affordability. These points make online transportation have
advantages over the existing conventional transportation service. Online transportation service is an example of disruptive technology that change the
relationship between customers and companies. In Indonesia, there are high competition among online transportation provider, hence the companies must
maintain and monitor their service level. To understand their position, we apply both sentiment analysis and multiclass classiﬁcation to understand customer
opinions. From negative sentiments, we can identify problems and establish problem-solving priorities. As a case study, we use the most popular online
transportation provider in Indonesia: Gojek and Grab. Since many customers are actively give compliment and complain about company’s service level on
Twitter, therefore we collect 61,721 tweets in Bahasa during one month observations. We apply Naive Bayes and Support Vector Machine methods to see which
model perform best for our data. The result reveal Gojek has better service quality with 19.76% positive and 80.23% negative sentiments than Grab with 9.2%
positive and 90.8% negative. The Gojek highest problem-solving priority is regarding application problems, while Grab is about unusable promos. The overall
result shows general problems of both case study are related to accessibility dimension which indicate lack of capability to provide good digital access to the end
users.

HYBRID SENTIMENT AND NETWORK ANALYSIS OF SOCIAL
OPINION POLARIZATION
Abstract: The rapid growth of social media and user generated contents (UGC)
has provided a rich source of poten:ally relevant data. The problems arise on
how to summarise those data to understand and transforming it into
informa:on. Twi>er as one of the most popular social networking and micro-
blogging service can be analysed in terms of content produced with sen:ment
analysis. On the other hand, some types of networks can also be constructed to
analyse the social network structure and network proper:es. This research
intended to combine those content and structural approaches into hybrid
approach for iden:ﬁes social opinion polarisa:on, this is in the form of
conversa:on network. Sen:ment analysis used to determine public sen:ment,
and social network analysis used to analyse the structure of the network,
detec:ng communi:es and inﬂuen:al actors in the network. Using this hybrid
approach, we have comprehensive understanding about social opinion
polarisa:on. As case study, we present real social opinion polarisa:on about
reclama:on issue in Indonesia.

DYNAMIC LARGE SCALE DATA ON TWITTER USING
SENTIMENT ANALYSIS AND TOPIC MODELLING
Case Study: Uber
Digital flows now exert a larger impact, the world is now more connected than
ever, the amount of cross-border bandwidth that used has grown 45 times larger
since 2005. With the massive amount of data spreading in the net, including
social media, speed is one most essential factor in business. companies can
take advantage of social media as a source to analyse and extract the
customer’s opinion, and therefore the company can have quick response
towards the condition.
The main purpose of this research is content analysis, to obtain the goal, we
need to extract the information as well as summarise the topic inside it.
However, in order to analyse the content quickly, there are varies choice of tools
with its specific output that creates challenges in the process. We use Naïve
Bayes Sentiment Analysis based on time-series, specifically on daily basis and
topic modeling based on Latent Dirichlet Allocation (LDA) to evaluate the
sentiment of the topic as well as the model of the topics discussed.
The purpose of this research is to help both companies and individuals to map
the public opinion towards certain topic by analyzing the sentiment of the text
and create a topic model. Therefore, a real-time information for determining the
consumer opinion become a crucial part. Twitter can serve the purpose as one
source of real-time information from user-generated content. We pick Uber as
the case study, viewed as one of the most favored transportation methods in
most part of the world. Data collection period is from 10th February 2017 until
28th February 2017 with 1.048.576 tweets collected.

ANALYSING EMPLOYEE VOICE USING REAL-TIME FEEDBACK
Abstract People nowadays tend to use social media as a platform to share their
reviews, emotions, and opinions, including about their jobs. Thus, a lot of data is
available on the web. Therefore, a rapid response is needed to analyse and interpret
the data. Unfortunately, many organisations still use annual surveys to assess
satisfaction, engagement, and culture in the workplace. Compared to other
conventional datasets such as company survey and questionnaire, decision-makers
could make decision effectively and efficiently by using the interpreted data. This
may be done with the help of sentiment analysis method.
In this research, we classify the feedback based on its category and sentiment.
Several classification algorithms are used in opinion mining, two of them are Naive
Bayes Classifier (NBC) and Support Vector Machine (SVM). This paper aims to
classify feedback based on sentiments using NBC and SVM.
*ICST, 2018

MONTE CARLO SIMULATION AND CLUSTERING FOR
CUSTOMER SEGMENTATION IN BUSINESS ORGANISATION
Abstract: U:lising data for segmenta:on analysis can bring a streamlined way to get poten:al insight as of decision making support in a business organisa:on. Using
appropriate data analy:cal technique help the organisa:ons in profiling their customer segments accurately. The result brings an effec:ve marke:ng strategy. However, there
are :mes in doing data analy:c, the organisa:on needs another variable of data where the value is unavailable, for example: customer’s income data which mostly hard to
collect. By using Monte Carlo simula:on, the value of customer’s income can be generated and then compared with customer spending to construct customer segmenta:on
model. An unsupervised learning for customer segmenta:on model using K-Means clustering enables us to see the grouping pa>erns of customer’s income towards their
spending. Clusters of the dataset might be interpreted as a group of customers that having a similar character. This paper shows us how to generate customer’s income data
and create data cluster to op:mising customer poten:al by u:lising data. Furthermore, the result brings us insight into which group off the customer might unserved properly
considering their average income with their spending behaviour.

MAPPING ORGANISATION KNOWLEDGE NETWORK AND
SOCIAL MEDIA BASED REPUTATION MANAGEMENT
Abstract—Knowledge management and reputation are important aspects in an
organization, especially in ICT industry. Controlling knowledge management and
modeling personal reputation through social media is essentials for the organization
because we can see how employee build their relationship around their peer
networks or clients virtually and how knowledge network can support organization
performance. The purpose of this research is to map knowledge network and
reputation formulation in order to fully understand how knowledge ﬂow in an
organization and whether employee reputation have higher degree of inﬂuence in
organization knowledge network. We particularly develop formulas to measure
knowledge network and personal reputation based on their social media activities.
As case study, we pick an Indonesian ICT company which actively build their
business around their employee peer knowledge outside the company. For
knowledge network, we perform data collection by conducting interviews. For
reputation management, we crawl data from several popular social media. We base
our work on Social Network Analysis methodology. The result shows that employees
knowledge is directly proportional with their reputation, but there are different
reputations level on different social media observed in this research.
reputation formula for twitter, instagram and linkedin

PREDICTION MODELS BASED ON FLIGHT TICKETS AND HOTEL ROOMS DATA
SALES FOR RECOMMENDATION SYSTEM IN ONLINE TRAVELAGENT BUSINESS
Abstract - Indonesia as one of the favorite vacation destinations of domestic and foreign travelers made the value of investment in the tourism industry continued to
grow significantly. This was created more Online Travel Agent business in recent years. However, it made a lot of business travel and Umrah travel in Indonesia is
threatened with bankruptcy, after the online travel business activity is rampant in conventional business market ticket sales and travel tours. The research case
study is different from the Online Travel Agent business in general, because it worked in real-time analytic using flight tickets and hotel rooms sales data to create
prediction or recommendation model. Data mining, extraction of hidden predictive information from large databases, was a powerful technique with great potential
to help companies focus on the most important information in their data warehouse. By using classification method in data mining, the objectives of this paper is to
create predictive models from flight tickets and hotel rooms sales data using the decision tree classification approach. The result of this paper is beneficial for
business that can be used as basic algorithm for programming in Online Travel Agent recommendation feature.

EFFECTIVE KNOWLEDGE MANAGEMENT USING
BIG DATAAND SOCIAL NETWORK ANALYSIS
Vizualisa+on of hierarchical structure organiza+on and knowledge
ﬂow of informal organiza+on
Abstract: Knowledge management consists of iden+fying, crea+ng, represen+ng, distribu+ng, and
enabling adop+on of insights and experiences in an organiza+on. One approach of modeling knowledge
management is using network model. Big Data is one of important ICT technological roadmap, which
main func+on is modelling behaviour and helping organiza+on decision support. Social Network
Analysis is a micro version of Big Data where we can model and establish social network quan+ﬁca+on.
In this paper we will show how Social Network Analysis can help organiza+on applying Knowledge
Management strategies and prac+ces by experiment using real-world large dataset contains 360000+
email exchanges between 36000+ employees inside in an organiza+on
business case resolved using SNA methodology
map of full network emaile xchange between employes in Enron

INDONESIA INFRASTRUCTURE AND CONSUMER STOCK PORTFOLIO
PREDICTION USING ARTIFICIAL NEURAL NETWORK BACKPROPAGATION
*ICOICT, 2017
Abstract: Ar:ficial Neural Network (ANN) method is increasingly popular to build predic:ve
model that generated small error predic:on. To have a good model, ANN needs large dataset as an
input. ANN backpropaga:on is a gradient decrease method to minimize the output error squared.
Stock price movements are suitable with ANN requirement : it is a large data set because stock price
is recorded up to every seconds, usually called high frequency data. The implementa:onof stock
price predic:on using ANN approach is quite new. The predic:ve model help investor in building
stock por@olio and their decision making process. Buying some stocks in por@olio decrease
diversified risk and increases the chance of higherreturn.In this paper, we show how to generate
predic:on model using ar:ficial neural network backpropaga:on of stock price and forming
por@olio with predicted price that bring predic:on of the por@olio with the smallest error. The data
set we use is historical stock price data from ten different company stocks of infrastructure and
consumer sector Indonesia Stock Exchage. The results is for lower risk condi:on, ANN predic:ve
model gives higher expected return than the return from real condi:on, while for higher risk, the
return from the real condi:on is higherthan the ANN predic:ve model.

THE DYNAMIC OF BANKING NETWORK TOPOLOGY
Case Study: Indonesian Presidential Election Event
ABSTRACT - Information and communication technologies have brought major changes in data storage and processing. Various types and high volume of
data has been digitalised and support mining-based data processing to provide knowledge in a modern and efficient way. Banking transaction data has been
stored digitally and suitable for the mining process especially in network science model.Understanding transaction system risk requires fundamental study on
payments flow and bank behaviour in various situations. Lehman Brother’s failure spread contagion impact in a short time indicates that financial markets
have interdependent properties and connected to each other in a large network. Thus, overall system network approach becomes more important than a single
bank. Political conditions greatly affect economic stability including the banking and financial sectors. Presidential election is a major political event for a
nation. This affected on community sentiment and financial market. However, the linkage between political events and topological changes is poorly
understood.This research presents an insight of the event driven dynamic network topology with banking transaction as a case study. We search for the
banking transaction network topology dynamic driven by 2014 Indonesian presidential election event. We discover that banks are more engaged to others in
larger value 3 days before the end of campaign period and less engaged to others in smaller value in the end of campaign period. Unique transaction activity
between banks remain stable with low declination in the end of campaign period. This scenario provides the possibility to learn the banking transaction
pattern and support the financial system stability supervision.

A COMPARATIVE STUDY OF EMPLOYEE CHURN PREDICTION
MODEL
Abstract - Churn phenomenon commonly occurs in customer loyalty towards
brand product or services. They becomes critical issue that any industry
would make best effort to avoid. Churn problem may arise within the
organisation, called employee churn. Employee churn creates myriad and
adverse effects to the organisation as it correlates with unfairly workload
distribution, great deal of money lost and also extra time needed to find a
replace, which may result in the rise of customer dissatisfaction rate. The
purpose of this study is to find the best model to predict employee churn. A
successful prediction model for employee churn is significantly needed in
order to avert various negative impacts for the organisation. There are three
popular classification models for prediction, namely naïve bayes, decision
tree, and random forest. This study compares performance of the
aforementioned models by using Human Resource Information System
(HRIS) from one of Indonesia’s renowned telecommunication company. The
data collected for the study spans for 2 years period, started from 2015 until
2017. The findings from the study suggest that the best classification model is
random forest due to its immense accuracy of 97.5%. The second-best
method is naïve bayes with 96.6%, and the lowest accuracy of classification
model is decision tree with 88.7%. The study concludes that the most reliable
and accurate classification model to predict employee churn is random forest

STATISTICS DATA ANALYTICS
Conﬁrmative Explorative
Small Data Set Larga Data Set
Small Number of Variable Large Number of Variable
Deductive (no predictions) Inductive
Numeric Data Numeric and Non-Numeric Data
Clean Data Data Cleaning
Complimentary Methods

• big data provide granular, micro data
• big data provide relatively fast and cheap process
• research opportunity on data science methods, implementation and evaluation
maturity
• data scientist helps big data initiatives towards future and sustainable economic
activities
• uncovering hidden truths, democratisation by data, are primary objective of data
scientist
• hard to Nind data scientist talent
• high cost to maintain data scientist talent ..
• big data often populations study, so no sampling error => methods familiarity
• beneNit > data quality + costs + security
• ML result credibility (different algorithm, different conclusion)
CHALLENGES
Opportunities
Challenges

The Power of Data is …
every breath you take
every move you make
every bond you break
every step you take
l’ll be watching you

without Big Data, you are blind
and deaf in the middle of a freeway
- Geoffrey Moore -

Understanding Big Data Analytics for Research Activity

Understanding Big Data Analytics for Research Activity

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Understanding Big Data Analytics for Research Activity

Similaire à Understanding Big Data Analytics for Research Activity (20)

Plus de Andry Alamsyah

Plus de Andry Alamsyah (19)

Dernier

Dernier (20)

Understanding Big Data Analytics for Research Activity