SlideShare une entreprise Scribd logo
1  sur  21
A Real Time Sentiment Analysis Application using
Hadoop and HBase in the Cloud




Jagane Sundar
Founder, AltoScale Inc.



June 14, 2012                      Hadoop Summit 2012


     AltoScale
AltoScale                               About me


Ø Extensive Knowledge of Hadoop, Cloud Compute and
  Virtualization
Ø Co-founder of AltoScale. We developed the Workbench
Ø Worked on Hadoop Management and Performance at
  Yahoo
Ø Primarily a systems and storage guy – have written TCP
  stacks and NFS Clients, Livebackup for KVM




2
AltoScale                   My Motivation




Ø Build a cool real time big data app in order
 to acquire a deep understanding of Real
 Time Big Data Systems in the cloud




3
AltoScale   What will you get out of this?




Ø See how easy it is to build a highly
 scalable real-time Big Data application
 using a variety of open source tools and
 technologies




4
AltoScale         Real Time Sentiment Analysis




        Ø Easily accessible real time signals
           v Twitter public status updates
           v Blog entries




5
AltoScale           Real Time Sentiment Analysis


Ø Two types of solutions to Real Time Sentiment
  Analysis
    v Keywords known a-priori
      o  Filter tweets by keyword
    v Open ended sentiment analysis (no a-priori
      knowledge of keywords)
      o  Random sample of all public tweets
          •  1 % of public tweets easily available
          •  10% (twitter firehose) may be available for purchase




6
AltoScale
                    Real Time Sentiment Analysis:
                          Application Architecture
                             Hadoop/HBase

                                          Service Node
                     TwitterSampler                         HBase REST Gateway


                          Analyze Sentiment




                                HBase every minute
                                Write a few new rows to




                                                                       Scan HTable
                                                            Hadoop Slave
                                        DataNode, Region Server

                                                               Hadoop Slave
                                                          DataNode, Region Server
                     Master                                      Hadoop Slave
                NN, HBase Master                            DataNode, Region Server

7
AltoScale
                              Real Time Sentiment Analysis:
                             Twitter Streaming API Overview

                                Twitter APIs




        REST APIs                                   Streaming APIs
    (Request/Response)                          (Persistent HTTP Conn)




           Public Streams             User Streams             Site Streams
           (Sample of all             (One User’s             (Multiple Users’
           public updates)              updates)                 updates)
                    filter

                   sample                      We use this API to
                                               collect tweets
8                  firehose
AltoScale
                  Real Time Sentiment Analysis:
                          Time Series Database




    Ø Inspired by TSDB, but does not use TSDB
    Ø Read Benoît “tsuna” Sigoure’s slides from
      HBaseCon 2012




9
AltoScale
                               Real Time Sentiment Analysis:
                                                   in HBase



          Row              NEUTRAL   POSITIVE   NEGATIVE       Sample
                                                               Tweets
obama:2012:06:04:13:34    1          4          0          sdac soasp few


romney:2012:06:04:13:34   2          3          1          Smsm djcn dje
                                                           jdj
davebarry:2012:06:04:13:34 0         9          0          cs dsjw ausj




    10
AltoScale
                 Real Time Sentiment Analysis:
                                   Front Page




11
AltoScale
                 Real Time Sentiment Analysis:
                                 Results Page




12
AltoScale
                       Real Time Sentiment Analysis:
                  Standing on the Shoulders of Giants

Ø Hadoop and HBase, of course
Ø Twitter4j library for getting the twitter stream
Ø Sentiment Analysis
     v https://code.google.com/p/twitter-sentiment-analysis/
     v Weka Library

Ø Tomcat
Ø Jquery, dojo for javascript client




13
AltoScale
                    Real Time Sentiment Analysis:
             Twitter Stream API - TsStatusListener

public static class TsStatusListener implements StatusListener {
       public void onStatus(Status status) {
               Item item = wm.weightedClassify(status.getText());
               int polarity = 0;
               try {
                   polarity = Integer.parseInt(item.getPolarity().trim());
               } catch (NumberFormatException nfe) {
               }
               updateKeywordTrackers(status, polarity);
       }
}
14
AltoScale
                                             Real Time Sentiment Analysis:
                                                         Writing to HBase
private void writeToHBase() {
             Calendar cal = Calendar.getInstance();
             String calStr = String.format("%04d", (cal.get(Calendar.YEAR)))
                           + ":" + String.format("%02d", cal.get(Calendar.MONTH) + 1)
                           + ":" + String.format("%02d", cal.get(Calendar.DAY_OF_MONTH))
                           + ":" + String.format("%02d", cal.get(Calendar.HOUR_OF_DAY))
                           + ":" + String.format("%02d", cal.get(Calendar.MINUTE));
             String rowKey = keyword + ":" + calStr;
             Put put = new Put(rowKey.getBytes());
             put.add(COLFAM1.getBytes(), "NEUTRAL".getBytes(), tracker.getNeutralCount().getBytes());
             put.add(COLFAM1.getBytes(), "POSITIVE".getBytes(), tracker.getPositiveCount().getBytes());
             put.add(COLFAM1.getBytes(), "NEGATIVE".getBytes(), tracker.getNegativeCount().getBytes());

             try {
                           table.put(put);
             } catch (Exception ex) {
                           System.err.println(ex);
             }
}
    15
AltoScale
                                                           Reading from HBase
                                                              Various Options
                         Technologies for Writing HBase Clients

                                              Service Node

Option 1: HBase Client                   Java Client linked to
                                         HBase Client classes




                                               Service Node                                 Service Node


                                                                                        Thrift Client
Option 2: Thrift RPC                     HBase Thrift Gateway
                                                                 Thrift protocol




                                    16                   Service Node


                                         HBase REST Gateway
Option 3: REST API                                               REST (HTTP or HTTPS)
AltoScale
                                                  Reading from HBase
                                  and presenting to the user’s browser
     Hadoop/HBase in the cloud

                                 Service Node
       HBase REST Gateway
                                         REST scan            Tomcat
                                                               Proxy


                                                     Static
                                                     html
                   Scan HTable




                                   Hadoop Slave
                   DataNode, Region Server

                                      Hadoop Slave
                                 DataNode, Region Server
     Master                             Hadoop Slave
NN, HBase Master                   DataNode, Region Server

17
AltoScale                         Tomcat as HTTP Proxy


Ø HBase Stardust REST Server runs on port 8081 and is
  connected to the HBase
Ø The REST server has the capability to scan tables
Ø A javascript webpage is the client
Ø Problem:
     v JavaScript security restrictions do now allow the JavaScript to
        execute REST calls to any server other than the one it was
        loaded from
     v Tomcat is used as a proxy. It serves up:
        o  Static html pages with the javascript client, images etc.
        o  REST requests from the javascript client are proxied to the HBase
           Stardust server running on port 8081
18
AltoScale                  Future Improvements


Ø Elastic HBase in the cloud
Ø At night time, use on VM to receive tweets and write out
  into SequenceFiles in S3
Ø Before business hours, start up HBase, run a MR job to
  process all these SequenceFiles and write into HBase
Ø Cost effective real time HBase application in the cloud




19
AltoScale              Big Data Apps in the Cloud


Ø The Cloud is suitable for Big Data apps which use Big
  Data from the Internet. For example:
     v Twitter Public Status Updates
     v Blog entries
     v Web Crawl data

Ø Big Data apps in the cloud are not useful if all your data
  is generated inside your network
     v Router, Storage device, Authentication device logs
     v Logs from Web Servers located inside your network




20
AltoScale




Ø Questions, Comments, Flames?


       •  Thanks!
       •  Jagane Sundar
       •  jagane@altoscale.com




21

Contenu connexe

Tendances

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Unit 1-introduction to scripts
Unit 1-introduction to scriptsUnit 1-introduction to scripts
Unit 1-introduction to scriptssana mateen
 
Context model
Context modelContext model
Context modelUbaid423
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
Unit 1-uses for scripting languages,web scripting
Unit 1-uses for scripting languages,web scriptingUnit 1-uses for scripting languages,web scripting
Unit 1-uses for scripting languages,web scriptingsana mateen
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 

Tendances (20)

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Unit 1-introduction to scripts
Unit 1-introduction to scriptsUnit 1-introduction to scripts
Unit 1-introduction to scripts
 
Building Aneka clouds.ppt
Building Aneka clouds.pptBuilding Aneka clouds.ppt
Building Aneka clouds.ppt
 
Php Lecture Notes
Php Lecture NotesPhp Lecture Notes
Php Lecture Notes
 
Context model
Context modelContext model
Context model
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
Message passing in Distributed Computing Systems
Message passing in Distributed Computing SystemsMessage passing in Distributed Computing Systems
Message passing in Distributed Computing Systems
 
Web servers
Web serversWeb servers
Web servers
 
Mvc architecture
Mvc architectureMvc architecture
Mvc architecture
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Unit 1-uses for scripting languages,web scripting
Unit 1-uses for scripting languages,web scriptingUnit 1-uses for scripting languages,web scripting
Unit 1-uses for scripting languages,web scripting
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Html
HtmlHtml
Html
 
Beyond syllabus for web technology
Beyond syllabus for web technologyBeyond syllabus for web technology
Beyond syllabus for web technology
 
CSS
CSSCSS
CSS
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 

En vedette

Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment AnalysisJaganadh Gopinadhan
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Social media & sentiment analysis splunk conf2012
Social media & sentiment analysis   splunk conf2012Social media & sentiment analysis   splunk conf2012
Social media & sentiment analysis splunk conf2012Michael Wilde
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
 
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QA
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QABuilding a Sentiment Analytics Solution powered by Machine Learning- Webinar QA
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QAImpetus Technologies
 
Social media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applicationsSocial media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applicationsYiannis Kompatsiaris
 
Intro to Algebra II
Intro to Algebra IIIntro to Algebra II
Intro to Algebra IIteamxxlp
 
Packet capture and network traffic analysis
Packet capture and network traffic analysisPacket capture and network traffic analysis
Packet capture and network traffic analysisCARMEN ALCIVAR
 
Top 10 senior administrative officer interview questions and answers
Top 10 senior administrative officer interview questions and answersTop 10 senior administrative officer interview questions and answers
Top 10 senior administrative officer interview questions and answersannababy1245
 
Virtualization In Software Testing
Virtualization In Software TestingVirtualization In Software Testing
Virtualization In Software TestingColloquium
 
Vendor quality management
Vendor quality managementVendor quality management
Vendor quality managementG2Link
 
Video Quality Measurements
Video Quality MeasurementsVideo Quality Measurements
Video Quality MeasurementsYoss Cohen
 
Digital Platform Selection Best Practices
Digital Platform Selection Best PracticesDigital Platform Selection Best Practices
Digital Platform Selection Best Practicesedynamic
 
Hands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM DashboardHands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM DashboardCA Technologies
 
Defining Workplace Safety
Defining Workplace SafetyDefining Workplace Safety
Defining Workplace SafetyBruce Lambert
 
Which test cases to automate
Which test cases to automateWhich test cases to automate
Which test cases to automatesachxn1
 

En vedette (20)

Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Social media & sentiment analysis splunk conf2012
Social media & sentiment analysis   splunk conf2012Social media & sentiment analysis   splunk conf2012
Social media & sentiment analysis splunk conf2012
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Chem Lab Report (1)
Chem Lab Report (1)Chem Lab Report (1)
Chem Lab Report (1)
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QA
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QABuilding a Sentiment Analytics Solution powered by Machine Learning- Webinar QA
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QA
 
Social media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applicationsSocial media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applications
 
Intro to Algebra II
Intro to Algebra IIIntro to Algebra II
Intro to Algebra II
 
Orbital Notation
Orbital NotationOrbital Notation
Orbital Notation
 
Packet capture and network traffic analysis
Packet capture and network traffic analysisPacket capture and network traffic analysis
Packet capture and network traffic analysis
 
Top 10 senior administrative officer interview questions and answers
Top 10 senior administrative officer interview questions and answersTop 10 senior administrative officer interview questions and answers
Top 10 senior administrative officer interview questions and answers
 
Virtualization In Software Testing
Virtualization In Software TestingVirtualization In Software Testing
Virtualization In Software Testing
 
Vendor quality management
Vendor quality managementVendor quality management
Vendor quality management
 
Video Quality Measurements
Video Quality MeasurementsVideo Quality Measurements
Video Quality Measurements
 
Digital Platform Selection Best Practices
Digital Platform Selection Best PracticesDigital Platform Selection Best Practices
Digital Platform Selection Best Practices
 
Analysis of water pollution presentaion by m.nadeem ashraf
Analysis of water pollution presentaion by m.nadeem ashrafAnalysis of water pollution presentaion by m.nadeem ashraf
Analysis of water pollution presentaion by m.nadeem ashraf
 
Hands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM DashboardHands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM Dashboard
 
Defining Workplace Safety
Defining Workplace SafetyDefining Workplace Safety
Defining Workplace Safety
 
Which test cases to automate
Which test cases to automateWhich test cases to automate
Which test cases to automate
 

Similaire à Realtime Sentiment Analysis Application Using Hadoop and HBase

Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerCodemotion
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizonArtem Ervits
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseYahoo Developer Network
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Serverless by Examples and Case Studies
Serverless by Examples and Case StudiesServerless by Examples and Case Studies
Serverless by Examples and Case StudiesSrushith Repakula
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and SparkMichael Stack
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 

Similaire à Realtime Sentiment Analysis Application Using Hadoop and HBase (20)

Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
מיכאל
מיכאלמיכאל
מיכאל
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Serverless by Examples and Case Studies
Serverless by Examples and Case StudiesServerless by Examples and Case Studies
Serverless by Examples and Case Studies
 
Serverless by examples and case studies
Serverless by examples and case studiesServerless by examples and case studies
Serverless by examples and case studies
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 

Dernier (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 

Realtime Sentiment Analysis Application Using Hadoop and HBase

  • 1. A Real Time Sentiment Analysis Application using Hadoop and HBase in the Cloud Jagane Sundar Founder, AltoScale Inc. June 14, 2012 Hadoop Summit 2012 AltoScale
  • 2. AltoScale About me Ø Extensive Knowledge of Hadoop, Cloud Compute and Virtualization Ø Co-founder of AltoScale. We developed the Workbench Ø Worked on Hadoop Management and Performance at Yahoo Ø Primarily a systems and storage guy – have written TCP stacks and NFS Clients, Livebackup for KVM 2
  • 3. AltoScale My Motivation Ø Build a cool real time big data app in order to acquire a deep understanding of Real Time Big Data Systems in the cloud 3
  • 4. AltoScale What will you get out of this? Ø See how easy it is to build a highly scalable real-time Big Data application using a variety of open source tools and technologies 4
  • 5. AltoScale Real Time Sentiment Analysis Ø Easily accessible real time signals v Twitter public status updates v Blog entries 5
  • 6. AltoScale Real Time Sentiment Analysis Ø Two types of solutions to Real Time Sentiment Analysis v Keywords known a-priori o  Filter tweets by keyword v Open ended sentiment analysis (no a-priori knowledge of keywords) o  Random sample of all public tweets •  1 % of public tweets easily available •  10% (twitter firehose) may be available for purchase 6
  • 7. AltoScale Real Time Sentiment Analysis: Application Architecture Hadoop/HBase Service Node TwitterSampler HBase REST Gateway Analyze Sentiment HBase every minute Write a few new rows to Scan HTable Hadoop Slave DataNode, Region Server Hadoop Slave DataNode, Region Server Master Hadoop Slave NN, HBase Master DataNode, Region Server 7
  • 8. AltoScale Real Time Sentiment Analysis: Twitter Streaming API Overview Twitter APIs REST APIs Streaming APIs (Request/Response) (Persistent HTTP Conn) Public Streams User Streams Site Streams (Sample of all (One User’s (Multiple Users’ public updates) updates) updates) filter sample We use this API to collect tweets 8 firehose
  • 9. AltoScale Real Time Sentiment Analysis: Time Series Database Ø Inspired by TSDB, but does not use TSDB Ø Read Benoît “tsuna” Sigoure’s slides from HBaseCon 2012 9
  • 10. AltoScale Real Time Sentiment Analysis: in HBase Row NEUTRAL POSITIVE NEGATIVE Sample Tweets obama:2012:06:04:13:34 1 4 0 sdac soasp few romney:2012:06:04:13:34 2 3 1 Smsm djcn dje jdj davebarry:2012:06:04:13:34 0 9 0 cs dsjw ausj 10
  • 11. AltoScale Real Time Sentiment Analysis: Front Page 11
  • 12. AltoScale Real Time Sentiment Analysis: Results Page 12
  • 13. AltoScale Real Time Sentiment Analysis: Standing on the Shoulders of Giants Ø Hadoop and HBase, of course Ø Twitter4j library for getting the twitter stream Ø Sentiment Analysis v https://code.google.com/p/twitter-sentiment-analysis/ v Weka Library Ø Tomcat Ø Jquery, dojo for javascript client 13
  • 14. AltoScale Real Time Sentiment Analysis: Twitter Stream API - TsStatusListener public static class TsStatusListener implements StatusListener { public void onStatus(Status status) { Item item = wm.weightedClassify(status.getText()); int polarity = 0; try { polarity = Integer.parseInt(item.getPolarity().trim()); } catch (NumberFormatException nfe) { } updateKeywordTrackers(status, polarity); } } 14
  • 15. AltoScale Real Time Sentiment Analysis: Writing to HBase private void writeToHBase() { Calendar cal = Calendar.getInstance(); String calStr = String.format("%04d", (cal.get(Calendar.YEAR))) + ":" + String.format("%02d", cal.get(Calendar.MONTH) + 1) + ":" + String.format("%02d", cal.get(Calendar.DAY_OF_MONTH)) + ":" + String.format("%02d", cal.get(Calendar.HOUR_OF_DAY)) + ":" + String.format("%02d", cal.get(Calendar.MINUTE)); String rowKey = keyword + ":" + calStr; Put put = new Put(rowKey.getBytes()); put.add(COLFAM1.getBytes(), "NEUTRAL".getBytes(), tracker.getNeutralCount().getBytes()); put.add(COLFAM1.getBytes(), "POSITIVE".getBytes(), tracker.getPositiveCount().getBytes()); put.add(COLFAM1.getBytes(), "NEGATIVE".getBytes(), tracker.getNegativeCount().getBytes()); try { table.put(put); } catch (Exception ex) { System.err.println(ex); } } 15
  • 16. AltoScale Reading from HBase Various Options Technologies for Writing HBase Clients Service Node Option 1: HBase Client Java Client linked to HBase Client classes Service Node Service Node Thrift Client Option 2: Thrift RPC HBase Thrift Gateway Thrift protocol 16 Service Node HBase REST Gateway Option 3: REST API REST (HTTP or HTTPS)
  • 17. AltoScale Reading from HBase and presenting to the user’s browser Hadoop/HBase in the cloud Service Node HBase REST Gateway REST scan Tomcat Proxy Static html Scan HTable Hadoop Slave DataNode, Region Server Hadoop Slave DataNode, Region Server Master Hadoop Slave NN, HBase Master DataNode, Region Server 17
  • 18. AltoScale Tomcat as HTTP Proxy Ø HBase Stardust REST Server runs on port 8081 and is connected to the HBase Ø The REST server has the capability to scan tables Ø A javascript webpage is the client Ø Problem: v JavaScript security restrictions do now allow the JavaScript to execute REST calls to any server other than the one it was loaded from v Tomcat is used as a proxy. It serves up: o  Static html pages with the javascript client, images etc. o  REST requests from the javascript client are proxied to the HBase Stardust server running on port 8081 18
  • 19. AltoScale Future Improvements Ø Elastic HBase in the cloud Ø At night time, use on VM to receive tweets and write out into SequenceFiles in S3 Ø Before business hours, start up HBase, run a MR job to process all these SequenceFiles and write into HBase Ø Cost effective real time HBase application in the cloud 19
  • 20. AltoScale Big Data Apps in the Cloud Ø The Cloud is suitable for Big Data apps which use Big Data from the Internet. For example: v Twitter Public Status Updates v Blog entries v Web Crawl data Ø Big Data apps in the cloud are not useful if all your data is generated inside your network v Router, Storage device, Authentication device logs v Logs from Web Servers located inside your network 20
  • 21. AltoScale Ø Questions, Comments, Flames? •  Thanks! •  Jagane Sundar •  jagane@altoscale.com 21