SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Building a Location
Based Social Graph in
Spark at InMobi
Seinjuti Chatterjee
Ian Anderson
Talk Agenda
• InMobi background
• Privacy
• Location Service
• Data collection
• Location based social groups
• Static
• Dynamic
• Spark at InMobi
InMobi: Engaging 1bn users across the globe
Global Premium Publishers
Rich MediaBanner Video Interstitial Native
Example Mobile Ads
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi)
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi)
Ad Request
Data
Enrichment /
Processing
Ad Auction
Serve Ad
InMobi SDK and Data Collection
Target behaviours not users
Hadoop / Pig with Falcon / Oozie scheduling
HDFS store
Spark - Standalone Scheduler
Daily snapshots
Classified data
per geo region
Hadoop and Spark Setup
Feature generation pipeline
hRaven
monitoring
Decision
tree
generation
Connected
components
generation
Location
based app
similarity
Social Groups
• “A social group within social sciences has been defined
as two or more people who interact with one another,
share similar characteristics, and collectively have a
sense of unity.”
- http://en.wikipedia.org/wiki/Social_group
Uni. Friends 49ers fans Film club Work Friends Car club
Location-Based Social Groups
• Focus on location as the group membership criteria
• Shared behaviours
• Location can indicate social connection e.g. 49ers
fans at Levi’s Stadium
• San Francisco example
• We want to reach a group of people that travel into
San Francisco for business purposes
• Visual demonstration
• Implementation walkthrough
Identifying POIs
Location Data
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi)
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi)
POI Classifier
“Given a geographic location e.g the United States, United Kingdom, India we want to understand the
underlying nature of the location, in order to enrich the context of the ad request at the time of
origin”
• Types of POI can be Public Terminal, Hotel or Inn or B&B, Eatery, Sports Center, Community
Center, Academic Institution, Retail Store.
• POI classification helps profile user as a commuter or a frequent flyer or a university student or
frequent business traveller or avid retail therapist
• We have trained a Decision Tree Model on labelled data of POI visitation frequency patterns which
gives us upto 70% accuracy now for predicting a POI type for a location.
• The final classifier was trained on 70K labelled locations and used to label 500K locations and public
wifi’s
Average merit Average rank Attribute
1.331 +/- 0.003 1.0 +/- 0.0 normalized ssid groupsize
0.251 +/- 0.006 2.1 +/- 0.3 avg hour spread
0.241 +/- 0.001 2.9 +/- 0.3 ratio of avg weekend daytime adreq to avg weekend nighttime adreq
0.227 +/- 0.001 4.0 +/- 0.0 ratio of weekend early hours to weekend adreq
0.224 +/- 0.001 5.2 +/- 0.4 ratio of weekend late hours to weekend adreq
0.222 +/- 0.002 5.8 +/- 0.4 ratio of weekend lunch hours to weekend adreq
0.208 +/- 0.002 7.0 +/- 0.0 percentage of devices who visited at least 1 uniq days
0.202 +/- 0.001 8.0 +/- 0.0 ratio of avg weekday daytime adreq to avg weekday nighttime adreq
0.194 +/- 0.001 9.0 +/- 0.0 ratio of avg weekday to weekend adreq
0.184 +/- 0.001 10.0 +/- 0.0 ratio of weekday lunch hours to weekday adreq
0.178 +/- 0.001 11.0 +/- 0.0 ratio_of_weekend_breakfast_hours_to_weekend_adreq
0.172 +/- 0.001 12.0 +/- 0.0 percentage_of_devices_who_visited_atleast_2_uniq_days
Feature Ranking using Information Gain
Spark Implementation
• Number of Data Points = 69,465 Num Classes = 7
• Test-Train Split = 30-70%
• Impurity=”gini”, maxDepth=30, maxBins=32, numTrees=13,
featureSubsetStrategy=”auto”
• DecisionTree Classifier with holdout set
• RandomForest Classifier with holdout set
• More tuning and 10 fold cross validation is required
TP Rate FP Rate Precision Recall F-Measure Class
0.54% 0.05% 0.54% 0.54% 0.54% UNIVERSlTY_OR_COLLEGE
0.71% 0.11% 0.72% 0.72% 0.71% lNN_AND_HOTELS
0.70% 0.15% 0.71% 0.71% 0.70% EATERY
0.29% 0.08% 0.28% 0.28% 0.29% COMMUNlTY_CENTER
0.60% 0.06% 0.60% 0.60% 0.60% STORE
0.24% 0.05% 0.23% 0.24% 0.24% SPORTS_AND_FlTNESS
0.19% 0.00% 0.20% 0.19% 0.19% AlRPORT_BUS_RAlL_TERMlNAL
Detailed Accuracy by Class
DecisionTreeModel classifier of depth 30 with 22351 nodes
Decision Tree Model Results: Spark
a b c d e f g <- classified as
1100 189 174 375 90 88 4 a = UNIVERSlTY_OR_COLLEGE
201 4154 648 304 269 217 22 b = lNN_AND_HOTELS
208 633 5047 463 451 364 23 c = EATERY
335 284 389 581 158 226 9 d = COMMUNlTY_CENTER
103 246 433 157 1561 94 17 e = STORE
95 228 342 213 74 301 6 f = SPORTS_AND_FlTNESS
6 23 27 17 8 4 20 g = AlRPORT_BUS_RAlL_TERMlNAL
Decision Tree Model Results: Spark
Confusion Matrix
TP Rate FP Rate Precision Recall F-Measure Class
0.64% 0.03% 0.73% 0.68% 0.64% UNIVERSITY_OR_COLLEGE
0.82% 0.09% 0.77% 0.80% 0.82% INN_AND_HOTELS
0.85% 0.21% 0.68% 0.75% 0.85% EATERY
0.26% 0.04% 0.42% 0.32% 0.26% COMMUNITY_CENTER
0.60% 0.03% 0.77% 0.68% 0.60% STORE
0.18% 0.03% 0.31% 0.23% 0.18% SPORTS_AND_FITNESS
0.30% 0.00% 0.83% 0.44% 0.30% AIRPORT_BUS_RAIL_TERMINA
L
Detailed Accuracy by Class
TreeEnsembleModel classifier with 13 trees, Test Error = 0.308
Random Forest Results: Spark
a b c d e f g <- classified as
1303 167 251 202 53 52 0 a = UNIVERSlTY_OR_COLLEGE
47 4747 721 69 113 97 4 b = lNN_AND_HOTELS
51 537 6060 178 141 168 0 c = EATERY
327 259 632 506 109 130 2 d = COMMUNlTY_CENTER
37 170 695 94 1587 48 0 e = STORE
23 231 568 151 38 228 0 f = SPORTS_AND_FlTNESS
1 21 29 8 11 1 30 g = AlRPORT_BUS_RAlL_TERMlNAL
Random Forest Results: Spark
Confusion Matrix
Weka Implementation
Trained a Decision Tree Model in WEKA
• Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
• Relation: us.publicPOI.classification
• Filter: weka.filters.supervised.instance.SMOTE
• Instances: 69465
• Attributes: 28
• Number of Leaves : 6377
• Size of the tree : 12753
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.671 0.043 0.628 0.671 0.649 0.815 UNIVERSlTY_OR_COLLEGE
0.805 0.089 0.776 0.805 0.790 0.873 lNN_AND_HOTELS
0.802 0.113 0.788 0.802 0.795 0.859 EATERY
0.333 0.060 0.368 0.333 0.350 0.672 COMMUNlTY_CENTER
0.786 0.025 0.820 0.786 0.803 0.887 STORE
0.234 0.040 0.269 0.234 0.250 0.617 SPORTS_AND_FlTNESS
0.405 0.001 0.599 0.405 0.484 0.757 AlRPORT_BUS_RAlL_TERMlNAL
Detailed Accuracy by Class
Decision Tree Model Results: Weka
a b c d e f g <- classified as
4500 562 478 768 158 226 10 a = UNIVERSlTY_OR_COLLEGE
517 15458 1569 765 292 580 31 b = lNN_AND_HOTELS
486 1673 19145 1144 497 902 11 c = EATERY
1229 957 1219 2194 315 643 26 d = COMMUNlTY_CENTER
169 443 653 363 6849 230 7 e = STORE
257 737 1195 703 219 953 4 f = SPORTS_AND_FlTNESS
12 78 47 28 18 12 133 g = AlRPORT_BUS_RAlL_TERMlNAL
Confusion Matrix
Decision Tree Model Results: Weka
Model built with 10 fold cross validation
Model Results Comparison
Decision
Tree
Spark
Random
Forest
Spark
Baseline
Decision
Tree Weka
Weighted F-measure 0.618 0.676 0.705
Weighted Precision 0.620 0.676 0.702
Weighted Recall 0.616 0.693 0.709
Weighted TPR 0.616 0.693 0.709
Weighted FPR 0.098 0.109 0.079
Test Error Rate 0.384 0.307 0.290
Hotel and Inn F-measure 0.71% 0.79% 0.79%
University/College F-measure 0.54% 0.68% 0.65%
Eatery F-measure 0.70% 0.76% 0.79%
Store F-measure 0.60% 0.76% 0.80%
Decision
Tree
Spark
Random
Forest
Spark
Baseline
Decision
Tree Weka
Input Training
Sample SIze
70K 70K 70K
Time taken to
build model
2 mins 2 mins 35.69
seconds
Resource
Usage
Single Node
Cluster
Single Node
Cluster
Single Node
Cluster
Scalability YES YES NO
Parallelism YES YES NO
Accuracy 60% 69% 70%
Classifier Visualization: Inn/Hotel
Visualization: Airport
Visualization: University / College
Visualization: Eatery
Visualization: Store
• A connected component is a set of locations which have been frequently
co-visited by users over a month.
• Conceptually it is a subgraph of frequent visitation trends which
transforms into a profile of the user.
• 5265 connected components generated for 576093 locations in 2 hours
where each location has seen on an average 4 devices.
Examples:
• University students who like eating @BuffaloWing
• Frequent business travellers to SFO who stay at hotels and rent a car
Connected Component
Spark Implementation
L(1)
L(2)
L(3)
SSID: shopwifi_67
GPS: 37.21, -122.43
Label: STORE
L(1)
L(2)
L(3)
Days revisited: 7
Avg. hour spread: 1
Req. frequency: 0.1
1. Create vertices 2. Create edges
3. Run connected component algorithm 4. Rank users
1 user covisited
2userscovisited
● Connection strength given by number
of edges (ie. common users) between
L(i) and L(j)
● Cluster locations sorted by component
size
● Rank the users per connected
component
● Profile the user with the profile of the
connected component
University, Eateries, Target Store
Connected Component
Students, Coffee Shops, Library
Connected Component
Frequent Business Travellers to SFO
Connected Component
Frequent Business Travellers to LAX
Connected Component
University
Students
General
Population
Category
0.01% 6.65% Productivity
0.14% 5.77% Health & Fitness
0.01% 4.44% Lifestyle
3.14% 4.26% Entertainment
0.00% 3.33% News
0.03% 2.91% Reference
9.55% 2.42% Tools
4.39% 0.28% Social
5.26% 0.21% Media & Video
Using classifications: App Usage
Using classifications: App Usage
Using classifications: App Usage
• Migration towards spark as the runtime of
choice for data processing
• Legacy Pig jobs being switched to the
Spark backend using Pig on Spark
• Pure MR applications being rewritten to
use the Spark Java API
Wider Spark Usage at InMobi
Special Thanks to Paul Duff, Senior Research Scientist
Thank you
&
Questions

Contenu connexe

Tendances

Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsDatabricks
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 
Self-Service Apache Spark Structured Streaming Applications and Analytics
Self-Service Apache Spark Structured Streaming Applications and AnalyticsSelf-Service Apache Spark Structured Streaming Applications and Analytics
Self-Service Apache Spark Structured Streaming Applications and AnalyticsDatabricks
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsDatabricks
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusDatabricks
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaDatabricks
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 

Tendances (20)

Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Self-Service Apache Spark Structured Streaming Applications and Analytics
Self-Service Apache Spark Structured Streaming Applications and AnalyticsSelf-Service Apache Spark Structured Streaming Applications and Analytics
Self-Service Apache Spark Structured Streaming Applications and Analytics
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
 
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerCloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and DeltaBest Practices for Building Robust Data Platform with Apache Spark and Delta
Best Practices for Building Robust Data Platform with Apache Spark and Delta
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
 
Conviva spark
Conviva sparkConviva spark
Conviva spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 

En vedette

Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Andrei Nikolaenko
 
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )Shamim bhuiyan
 
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++Антон Шестаков
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Alexey Zinoviev
 
Community detection (Поиск сообществ в графах)
Community detection (Поиск сообществ в графах)Community detection (Поиск сообществ в графах)
Community detection (Поиск сообществ в графах)Kirill Rybachuk
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analyticsSigmoid
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
 
Лекция 12. Spark
Лекция 12. SparkЛекция 12. Spark
Лекция 12. SparkTechnopark
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisJen Aman
 
Top 2017 Mobile Advertising Trends in Indonesia
Top 2017 Mobile Advertising Trends in IndonesiaTop 2017 Mobile Advertising Trends in Indonesia
Top 2017 Mobile Advertising Trends in IndonesiaInMobi
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...Daniele Loiacono
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 

En vedette (19)

Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
 
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
 
Apache spark
Apache sparkApache spark
Apache spark
 
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
 
Community detection (Поиск сообществ в графах)
Community detection (Поиск сообществ в графах)Community detection (Поиск сообществ в графах)
Community detection (Поиск сообществ в графах)
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analytics
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
Лекция 12. Spark
Лекция 12. SparkЛекция 12. Spark
Лекция 12. Spark
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 
Top 2017 Mobile Advertising Trends in Indonesia
Top 2017 Mobile Advertising Trends in IndonesiaTop 2017 Mobile Advertising Trends in Indonesia
Top 2017 Mobile Advertising Trends in Indonesia
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
Confusion Matrices for Improving Performance of Feature Pattern Classifier Sy...
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 

Similaire à Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi)

RS in the context of Big Data-v4
RS in the context of Big Data-v4RS in the context of Big Data-v4
RS in the context of Big Data-v4Khadija Atiya
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014nkabra
 
Reducing energy consumption of computing
Reducing energy consumption of computing Reducing energy consumption of computing
Reducing energy consumption of computing NGUYEN VAN LUONG
 
Fisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving IIIFisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving IIIYu Huang
 
System and User Aspects of Web Search Latency
System and User Aspects of Web Search LatencySystem and User Aspects of Web Search Latency
System and User Aspects of Web Search LatencyTelefonica Research
 
Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?EDB
 
SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014 SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014 Sujit Ghosh
 
Local collaborative autoencoders (WSDM2021)
Local collaborative autoencoders (WSDM2021)Local collaborative autoencoders (WSDM2021)
Local collaborative autoencoders (WSDM2021)민진 최
 
Agile Finance for Project Success
Agile Finance for Project SuccessAgile Finance for Project Success
Agile Finance for Project SuccessStephen Milligan
 
Explain Yourself: Why You Get the Recommendations You Do
Explain Yourself: Why You Get the Recommendations You DoExplain Yourself: Why You Get the Recommendations You Do
Explain Yourself: Why You Get the Recommendations You DoDatabricks
 
Extend Your Journey: Introducing Signal Strength into Location-based Applicat...
Extend Your Journey: Introducing Signal Strength into Location-based Applicat...Extend Your Journey: Introducing Signal Strength into Location-based Applicat...
Extend Your Journey: Introducing Signal Strength into Location-based Applicat...Chih-Chuan Cheng
 
Thesis_presentation_arda_tasci
Thesis_presentation_arda_tasciThesis_presentation_arda_tasci
Thesis_presentation_arda_tasciArda Taşcı
 
Modularization compass - Navigating white waters of feature-oriented modularity
Modularization compass - Navigating white waters of feature-oriented modularityModularization compass - Navigating white waters of feature-oriented modularity
Modularization compass - Navigating white waters of feature-oriented modularityAndrzej Olszak
 
TOSCA and Cloudify
TOSCA and CloudifyTOSCA and Cloudify
TOSCA and Cloudifydfilppi
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingYu Huang
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsGiulio Carducci
 
Machine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’sMachine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’sSiddharth Chakravarty
 
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...Emanuel Lacić
 
Introduction to GS1 EPCIS standard and Oliot EPCIS X (EPCIS v2.0 prototype)
Introduction to GS1 EPCIS standard and Oliot EPCIS X (EPCIS v2.0 prototype)Introduction to GS1 EPCIS standard and Oliot EPCIS X (EPCIS v2.0 prototype)
Introduction to GS1 EPCIS standard and Oliot EPCIS X (EPCIS v2.0 prototype)Jaewook Byun
 
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...DataStax Academy
 

Similaire à Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi) (20)

RS in the context of Big Data-v4
RS in the context of Big Data-v4RS in the context of Big Data-v4
RS in the context of Big Data-v4
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Reducing energy consumption of computing
Reducing energy consumption of computing Reducing energy consumption of computing
Reducing energy consumption of computing
 
Fisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving IIIFisheye-Omnidirectional View in Autonomous Driving III
Fisheye-Omnidirectional View in Autonomous Driving III
 
System and User Aspects of Web Search Latency
System and User Aspects of Web Search LatencySystem and User Aspects of Web Search Latency
System and User Aspects of Web Search Latency
 
Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?Where Should You Deliver Database Services From?
Where Should You Deliver Database Services From?
 
SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014 SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014
 
Local collaborative autoencoders (WSDM2021)
Local collaborative autoencoders (WSDM2021)Local collaborative autoencoders (WSDM2021)
Local collaborative autoencoders (WSDM2021)
 
Agile Finance for Project Success
Agile Finance for Project SuccessAgile Finance for Project Success
Agile Finance for Project Success
 
Explain Yourself: Why You Get the Recommendations You Do
Explain Yourself: Why You Get the Recommendations You DoExplain Yourself: Why You Get the Recommendations You Do
Explain Yourself: Why You Get the Recommendations You Do
 
Extend Your Journey: Introducing Signal Strength into Location-based Applicat...
Extend Your Journey: Introducing Signal Strength into Location-based Applicat...Extend Your Journey: Introducing Signal Strength into Location-based Applicat...
Extend Your Journey: Introducing Signal Strength into Location-based Applicat...
 
Thesis_presentation_arda_tasci
Thesis_presentation_arda_tasciThesis_presentation_arda_tasci
Thesis_presentation_arda_tasci
 
Modularization compass - Navigating white waters of feature-oriented modularity
Modularization compass - Navigating white waters of feature-oriented modularityModularization compass - Navigating white waters of feature-oriented modularity
Modularization compass - Navigating white waters of feature-oriented modularity
 
TOSCA and Cloudify
TOSCA and CloudifyTOSCA and Cloudify
TOSCA and Cloudify
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media Posts
 
Machine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’sMachine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’s
 
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
 
Introduction to GS1 EPCIS standard and Oliot EPCIS X (EPCIS v2.0 prototype)
Introduction to GS1 EPCIS standard and Oliot EPCIS X (EPCIS v2.0 prototype)Introduction to GS1 EPCIS standard and Oliot EPCIS X (EPCIS v2.0 prototype)
Introduction to GS1 EPCIS standard and Oliot EPCIS X (EPCIS v2.0 prototype)
 
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
 

Plus de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Dernier

SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 

Dernier (17)

SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 

Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterjee and Ian Anderson, InMobi)

  • 1. Building a Location Based Social Graph in Spark at InMobi Seinjuti Chatterjee Ian Anderson
  • 2. Talk Agenda • InMobi background • Privacy • Location Service • Data collection • Location based social groups • Static • Dynamic • Spark at InMobi
  • 3. InMobi: Engaging 1bn users across the globe
  • 5. Rich MediaBanner Video Interstitial Native Example Mobile Ads
  • 8. Ad Request Data Enrichment / Processing Ad Auction Serve Ad InMobi SDK and Data Collection Target behaviours not users
  • 9. Hadoop / Pig with Falcon / Oozie scheduling HDFS store Spark - Standalone Scheduler Daily snapshots Classified data per geo region Hadoop and Spark Setup Feature generation pipeline hRaven monitoring Decision tree generation Connected components generation Location based app similarity
  • 10. Social Groups • “A social group within social sciences has been defined as two or more people who interact with one another, share similar characteristics, and collectively have a sense of unity.” - http://en.wikipedia.org/wiki/Social_group Uni. Friends 49ers fans Film club Work Friends Car club
  • 11. Location-Based Social Groups • Focus on location as the group membership criteria • Shared behaviours • Location can indicate social connection e.g. 49ers fans at Levi’s Stadium • San Francisco example • We want to reach a group of people that travel into San Francisco for business purposes • Visual demonstration • Implementation walkthrough
  • 16. POI Classifier “Given a geographic location e.g the United States, United Kingdom, India we want to understand the underlying nature of the location, in order to enrich the context of the ad request at the time of origin” • Types of POI can be Public Terminal, Hotel or Inn or B&B, Eatery, Sports Center, Community Center, Academic Institution, Retail Store. • POI classification helps profile user as a commuter or a frequent flyer or a university student or frequent business traveller or avid retail therapist • We have trained a Decision Tree Model on labelled data of POI visitation frequency patterns which gives us upto 70% accuracy now for predicting a POI type for a location. • The final classifier was trained on 70K labelled locations and used to label 500K locations and public wifi’s
  • 17. Average merit Average rank Attribute 1.331 +/- 0.003 1.0 +/- 0.0 normalized ssid groupsize 0.251 +/- 0.006 2.1 +/- 0.3 avg hour spread 0.241 +/- 0.001 2.9 +/- 0.3 ratio of avg weekend daytime adreq to avg weekend nighttime adreq 0.227 +/- 0.001 4.0 +/- 0.0 ratio of weekend early hours to weekend adreq 0.224 +/- 0.001 5.2 +/- 0.4 ratio of weekend late hours to weekend adreq 0.222 +/- 0.002 5.8 +/- 0.4 ratio of weekend lunch hours to weekend adreq 0.208 +/- 0.002 7.0 +/- 0.0 percentage of devices who visited at least 1 uniq days 0.202 +/- 0.001 8.0 +/- 0.0 ratio of avg weekday daytime adreq to avg weekday nighttime adreq 0.194 +/- 0.001 9.0 +/- 0.0 ratio of avg weekday to weekend adreq 0.184 +/- 0.001 10.0 +/- 0.0 ratio of weekday lunch hours to weekday adreq 0.178 +/- 0.001 11.0 +/- 0.0 ratio_of_weekend_breakfast_hours_to_weekend_adreq 0.172 +/- 0.001 12.0 +/- 0.0 percentage_of_devices_who_visited_atleast_2_uniq_days Feature Ranking using Information Gain
  • 18. Spark Implementation • Number of Data Points = 69,465 Num Classes = 7 • Test-Train Split = 30-70% • Impurity=”gini”, maxDepth=30, maxBins=32, numTrees=13, featureSubsetStrategy=”auto” • DecisionTree Classifier with holdout set • RandomForest Classifier with holdout set • More tuning and 10 fold cross validation is required
  • 19. TP Rate FP Rate Precision Recall F-Measure Class 0.54% 0.05% 0.54% 0.54% 0.54% UNIVERSlTY_OR_COLLEGE 0.71% 0.11% 0.72% 0.72% 0.71% lNN_AND_HOTELS 0.70% 0.15% 0.71% 0.71% 0.70% EATERY 0.29% 0.08% 0.28% 0.28% 0.29% COMMUNlTY_CENTER 0.60% 0.06% 0.60% 0.60% 0.60% STORE 0.24% 0.05% 0.23% 0.24% 0.24% SPORTS_AND_FlTNESS 0.19% 0.00% 0.20% 0.19% 0.19% AlRPORT_BUS_RAlL_TERMlNAL Detailed Accuracy by Class DecisionTreeModel classifier of depth 30 with 22351 nodes Decision Tree Model Results: Spark
  • 20. a b c d e f g <- classified as 1100 189 174 375 90 88 4 a = UNIVERSlTY_OR_COLLEGE 201 4154 648 304 269 217 22 b = lNN_AND_HOTELS 208 633 5047 463 451 364 23 c = EATERY 335 284 389 581 158 226 9 d = COMMUNlTY_CENTER 103 246 433 157 1561 94 17 e = STORE 95 228 342 213 74 301 6 f = SPORTS_AND_FlTNESS 6 23 27 17 8 4 20 g = AlRPORT_BUS_RAlL_TERMlNAL Decision Tree Model Results: Spark Confusion Matrix
  • 21. TP Rate FP Rate Precision Recall F-Measure Class 0.64% 0.03% 0.73% 0.68% 0.64% UNIVERSITY_OR_COLLEGE 0.82% 0.09% 0.77% 0.80% 0.82% INN_AND_HOTELS 0.85% 0.21% 0.68% 0.75% 0.85% EATERY 0.26% 0.04% 0.42% 0.32% 0.26% COMMUNITY_CENTER 0.60% 0.03% 0.77% 0.68% 0.60% STORE 0.18% 0.03% 0.31% 0.23% 0.18% SPORTS_AND_FITNESS 0.30% 0.00% 0.83% 0.44% 0.30% AIRPORT_BUS_RAIL_TERMINA L Detailed Accuracy by Class TreeEnsembleModel classifier with 13 trees, Test Error = 0.308 Random Forest Results: Spark
  • 22. a b c d e f g <- classified as 1303 167 251 202 53 52 0 a = UNIVERSlTY_OR_COLLEGE 47 4747 721 69 113 97 4 b = lNN_AND_HOTELS 51 537 6060 178 141 168 0 c = EATERY 327 259 632 506 109 130 2 d = COMMUNlTY_CENTER 37 170 695 94 1587 48 0 e = STORE 23 231 568 151 38 228 0 f = SPORTS_AND_FlTNESS 1 21 29 8 11 1 30 g = AlRPORT_BUS_RAlL_TERMlNAL Random Forest Results: Spark Confusion Matrix
  • 23. Weka Implementation Trained a Decision Tree Model in WEKA • Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 • Relation: us.publicPOI.classification • Filter: weka.filters.supervised.instance.SMOTE • Instances: 69465 • Attributes: 28 • Number of Leaves : 6377 • Size of the tree : 12753
  • 24. TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.671 0.043 0.628 0.671 0.649 0.815 UNIVERSlTY_OR_COLLEGE 0.805 0.089 0.776 0.805 0.790 0.873 lNN_AND_HOTELS 0.802 0.113 0.788 0.802 0.795 0.859 EATERY 0.333 0.060 0.368 0.333 0.350 0.672 COMMUNlTY_CENTER 0.786 0.025 0.820 0.786 0.803 0.887 STORE 0.234 0.040 0.269 0.234 0.250 0.617 SPORTS_AND_FlTNESS 0.405 0.001 0.599 0.405 0.484 0.757 AlRPORT_BUS_RAlL_TERMlNAL Detailed Accuracy by Class Decision Tree Model Results: Weka
  • 25. a b c d e f g <- classified as 4500 562 478 768 158 226 10 a = UNIVERSlTY_OR_COLLEGE 517 15458 1569 765 292 580 31 b = lNN_AND_HOTELS 486 1673 19145 1144 497 902 11 c = EATERY 1229 957 1219 2194 315 643 26 d = COMMUNlTY_CENTER 169 443 653 363 6849 230 7 e = STORE 257 737 1195 703 219 953 4 f = SPORTS_AND_FlTNESS 12 78 47 28 18 12 133 g = AlRPORT_BUS_RAlL_TERMlNAL Confusion Matrix Decision Tree Model Results: Weka Model built with 10 fold cross validation
  • 26. Model Results Comparison Decision Tree Spark Random Forest Spark Baseline Decision Tree Weka Weighted F-measure 0.618 0.676 0.705 Weighted Precision 0.620 0.676 0.702 Weighted Recall 0.616 0.693 0.709 Weighted TPR 0.616 0.693 0.709 Weighted FPR 0.098 0.109 0.079 Test Error Rate 0.384 0.307 0.290 Hotel and Inn F-measure 0.71% 0.79% 0.79% University/College F-measure 0.54% 0.68% 0.65% Eatery F-measure 0.70% 0.76% 0.79% Store F-measure 0.60% 0.76% 0.80% Decision Tree Spark Random Forest Spark Baseline Decision Tree Weka Input Training Sample SIze 70K 70K 70K Time taken to build model 2 mins 2 mins 35.69 seconds Resource Usage Single Node Cluster Single Node Cluster Single Node Cluster Scalability YES YES NO Parallelism YES YES NO Accuracy 60% 69% 70%
  • 32. • A connected component is a set of locations which have been frequently co-visited by users over a month. • Conceptually it is a subgraph of frequent visitation trends which transforms into a profile of the user. • 5265 connected components generated for 576093 locations in 2 hours where each location has seen on an average 4 devices. Examples: • University students who like eating @BuffaloWing • Frequent business travellers to SFO who stay at hotels and rent a car Connected Component
  • 33. Spark Implementation L(1) L(2) L(3) SSID: shopwifi_67 GPS: 37.21, -122.43 Label: STORE L(1) L(2) L(3) Days revisited: 7 Avg. hour spread: 1 Req. frequency: 0.1 1. Create vertices 2. Create edges 3. Run connected component algorithm 4. Rank users 1 user covisited 2userscovisited ● Connection strength given by number of edges (ie. common users) between L(i) and L(j) ● Cluster locations sorted by component size ● Rank the users per connected component ● Profile the user with the profile of the connected component
  • 34. University, Eateries, Target Store Connected Component
  • 35. Students, Coffee Shops, Library Connected Component
  • 36. Frequent Business Travellers to SFO Connected Component
  • 37. Frequent Business Travellers to LAX Connected Component
  • 38. University Students General Population Category 0.01% 6.65% Productivity 0.14% 5.77% Health & Fitness 0.01% 4.44% Lifestyle 3.14% 4.26% Entertainment 0.00% 3.33% News 0.03% 2.91% Reference 9.55% 2.42% Tools 4.39% 0.28% Social 5.26% 0.21% Media & Video Using classifications: App Usage
  • 41. • Migration towards spark as the runtime of choice for data processing • Legacy Pig jobs being switched to the Spark backend using Pig on Spark • Pure MR applications being rewritten to use the Spark Java API Wider Spark Usage at InMobi
  • 42. Special Thanks to Paul Duff, Senior Research Scientist Thank you & Questions