SlideShare une entreprise Scribd logo
1  sur  21
Powering Social Business Intelligence
Cassandra and Hadoop at the Dachis Group
Social Business Wha?
Big Data meets Big Budgets




• Brand marketers spend
  • $450 (~£270) billion annually on tradition media
  • $50 (~£30) billion annually on SEO/SEM
• Starting to transition to social media
Effectiveness - Traditional
Measure all the things!




            Measuring traditional marketing effectiveness
Effectiveness - Social
Measure all the things!




             Measuring social marketing effectiveness
The Dachis Group
Measure all the things!




• Jeff Dachis amasses small army of social strategists
• Funds team to create social analytics platform
  • Measure business outcomes of social media strategies
  • Track social media surrounding Forbes Global 2000
  • Include all brands, all subsidiaries, all social media types
Architecture

• Raw data in S3
• Cassandra
  • Realtime queries to return raw data
  • Hadoop analytic integration for foundational measures
  • Horizontal scalability
  • Operationally simple
• RDBMS
  • Time rollups of measures
  • Aggregates and composite measures
  • Arbitrary dimensional queries
  • Mini data warehouse
Pipeline



                                                            Memcached
   AWS S3                       Cassandra                    Postgres




   Raw Signal                      Signal                       Metrics
    Storage                      Repository                      Store




                Normalization           Enrichment   Analysis
Normalization




• Parallel copy from S3 to HDFS
• MapReduce to Cassandra from Raw to Normalized CF
• Normalized data model
  • Decent investment to get right
  • Mostly for conceptual reasons rather than concerns about queries
  • Secondary indexes vs app maintained indexes
Enrichment



• Enrich with
  • Unique company/brand information
  • Sentiment
  • Relationships
  • Conversations
  • Social graph information
• Enter Pig
• Enter Oozie
The Bleeding Edge
Pig


 • newlogicalplan in 0.8.0
 • Debugging/tracing?
 • Incremental development
 • Working with Cassandra
      • Pygmalion - facilitating to and from Cassandra
 • Experience, unit test framework, UDFs, community



                                slowly became
The Bleeding Edge
Oozie
• Learning curve and common errors
  • User impersonation
  • Logs, we haz them, lots of them
  • Web UI needs love
• Specific to Cassandra
  • mapreduce.fileoutputcommitter.marksuccessfuljobs
  • See http://wiki.apache.org/cassandra/HadoopSupport#Oozie
• Still very good DAG workflow crunching tool
  • Subworkflows, fork/join, regular scheduling, dataset detection
  • Extensible
  • Apache Incubator (@oozie on twitter, #oozie on freenode)
The Bleeding Edge
Cassandra

 • Rack aware snitch and replication
   • Always rotate racks in order in topology
   • In EC2 this likely means rotate AZs
 • Dealing with scanning over column families
   • Project early
 • General tuning and unique workload
   • Mahout and other higher memory hadoop tasks
   • EC2 instance types
 • Visualization tool helped (OpsCenter, Acunu has Control Center)
 • Community++
Social Business Index
Launches September 2011


                          • Global Ranking of Companies
                          • Industry Rankings
                          • Visualization of strategy
This might actually work!


 • Fall 2011, built up the team
 • Expertise in Pig, Lucene/Solr, machine learning, statistics, event
  prediction and analysis
 • Making everlasting gobstoppers
Social Performance Monitor
The measures behind the score
Topics topics topics




• Black Friday
  • Science project!
  • Mallet, Pig
  • Custom analysis
• Superbowl
• Oscars
Productizing Topics



• Ongoing automated topic detection
• Lessons from one-off topic analysis
• Represented by term distributions
• Threads with detail like
  • Signal volume
  • Participants
  • Links
  • Sentiment gauge
Advocates




• Auto-discovery of potential advocates
• Curated set of known advocates
• Example signal (from Cassandra)
• Reports and other useful bits
Lessons learned




• Emerging products are sometimes frustrating, but well worth the pain in
 their respective niche.
• “Never underestimate the massive impact of small bugs in big
 data.” (@peteskomoroch at LinkedIn)
• Community karma
A Note on Community

• Community involvement
  • IRC, mailing lists, twitter, conferences, meetups
  • Newer projects have little or outdated docs
  • Some features may be
    • Deprecated
    • Not ready for primetime
    • Not a fit for your use case
• Community karma
  • Don’t just take
  • Be a bridge builder
  • Positive karma helps
Questions?




• We’re hiring
• Ping me @jeromatron (Twitter and IRC)

Contenu connexe

Tendances

Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
Joydeep Sen Sarma
 

Tendances (20)

Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloud
 
Big data advance topics - part 2.pptx
Big data   advance topics - part 2.pptxBig data   advance topics - part 2.pptx
Big data advance topics - part 2.pptx
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
 
Big Data A La Carte Menu
Big Data A La Carte MenuBig Data A La Carte Menu
Big Data A La Carte Menu
 
Hadoop-2 @ eBay
Hadoop-2 @ eBayHadoop-2 @ eBay
Hadoop-2 @ eBay
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Concepts on Hadoop
Concepts on HadoopConcepts on Hadoop
Concepts on Hadoop
 

Similaire à Cassandra eu

Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
Christopher Whitaker
 

Similaire à Cassandra eu (20)

Hydra - Content Processing Framework for Search Driven Solutions
Hydra - Content Processing Framework for Search Driven SolutionsHydra - Content Processing Framework for Search Driven Solutions
Hydra - Content Processing Framework for Search Driven Solutions
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hydra Project Management Survey
Hydra Project Management SurveyHydra Project Management Survey
Hydra Project Management Survey
 
Modul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptxModul_1_Introduction_to_Big_Data.pptx
Modul_1_Introduction_to_Big_Data.pptx
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Apache drill
Apache drillApache drill
Apache drill
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
Bi with apache hadoop(en)
Bi with apache hadoop(en)Bi with apache hadoop(en)
Bi with apache hadoop(en)
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Ds01 data science
Ds01   data scienceDs01   data science
Ds01 data science
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
 

Plus de Jeremy Hanna

Modern Cassandra for Developers
Modern Cassandra for DevelopersModern Cassandra for Developers
Modern Cassandra for Developers
Jeremy Hanna
 
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache CassandraCassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
Jeremy Hanna
 
End-to-end Analytics with Apache Cassandra
End-to-end Analytics with Apache CassandraEnd-to-end Analytics with Apache Cassandra
End-to-end Analytics with Apache Cassandra
Jeremy Hanna
 
Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
Jeremy Hanna
 

Plus de Jeremy Hanna (11)

Göteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache CassandraGöteborg Distributed: Eventual Consistency in Apache Cassandra
Göteborg Distributed: Eventual Consistency in Apache Cassandra
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
Modern Cassandra for Developers
Modern Cassandra for DevelopersModern Cassandra for Developers
Modern Cassandra for Developers
 
Troubleshooting Cassandra
Troubleshooting CassandraTroubleshooting Cassandra
Troubleshooting Cassandra
 
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache CassandraCassandra + Hadoop: Analisi Batch con Apache Cassandra
Cassandra + Hadoop: Analisi Batch con Apache Cassandra
 
End-to-end Analytics with Apache Cassandra
End-to-end Analytics with Apache CassandraEnd-to-end Analytics with Apache Cassandra
End-to-end Analytics with Apache Cassandra
 
Pig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in AnalyticsPig with Cassandra: Adventures in Analytics
Pig with Cassandra: Adventures in Analytics
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
 
Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon Cassandra + Hadoop @ApacheCon
Cassandra + Hadoop @ApacheCon
 
Cassandra+Hadoop
Cassandra+HadoopCassandra+Hadoop
Cassandra+Hadoop
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Cassandra eu

  • 1. Powering Social Business Intelligence Cassandra and Hadoop at the Dachis Group
  • 2. Social Business Wha? Big Data meets Big Budgets • Brand marketers spend • $450 (~£270) billion annually on tradition media • $50 (~£30) billion annually on SEO/SEM • Starting to transition to social media
  • 3. Effectiveness - Traditional Measure all the things! Measuring traditional marketing effectiveness
  • 4. Effectiveness - Social Measure all the things! Measuring social marketing effectiveness
  • 5. The Dachis Group Measure all the things! • Jeff Dachis amasses small army of social strategists • Funds team to create social analytics platform • Measure business outcomes of social media strategies • Track social media surrounding Forbes Global 2000 • Include all brands, all subsidiaries, all social media types
  • 6. Architecture • Raw data in S3 • Cassandra • Realtime queries to return raw data • Hadoop analytic integration for foundational measures • Horizontal scalability • Operationally simple • RDBMS • Time rollups of measures • Aggregates and composite measures • Arbitrary dimensional queries • Mini data warehouse
  • 7. Pipeline Memcached AWS S3 Cassandra Postgres Raw Signal Signal Metrics Storage Repository Store Normalization Enrichment Analysis
  • 8. Normalization • Parallel copy from S3 to HDFS • MapReduce to Cassandra from Raw to Normalized CF • Normalized data model • Decent investment to get right • Mostly for conceptual reasons rather than concerns about queries • Secondary indexes vs app maintained indexes
  • 9. Enrichment • Enrich with • Unique company/brand information • Sentiment • Relationships • Conversations • Social graph information • Enter Pig • Enter Oozie
  • 10. The Bleeding Edge Pig • newlogicalplan in 0.8.0 • Debugging/tracing? • Incremental development • Working with Cassandra • Pygmalion - facilitating to and from Cassandra • Experience, unit test framework, UDFs, community slowly became
  • 11. The Bleeding Edge Oozie • Learning curve and common errors • User impersonation • Logs, we haz them, lots of them • Web UI needs love • Specific to Cassandra • mapreduce.fileoutputcommitter.marksuccessfuljobs • See http://wiki.apache.org/cassandra/HadoopSupport#Oozie • Still very good DAG workflow crunching tool • Subworkflows, fork/join, regular scheduling, dataset detection • Extensible • Apache Incubator (@oozie on twitter, #oozie on freenode)
  • 12. The Bleeding Edge Cassandra • Rack aware snitch and replication • Always rotate racks in order in topology • In EC2 this likely means rotate AZs • Dealing with scanning over column families • Project early • General tuning and unique workload • Mahout and other higher memory hadoop tasks • EC2 instance types • Visualization tool helped (OpsCenter, Acunu has Control Center) • Community++
  • 13. Social Business Index Launches September 2011 • Global Ranking of Companies • Industry Rankings • Visualization of strategy
  • 14. This might actually work! • Fall 2011, built up the team • Expertise in Pig, Lucene/Solr, machine learning, statistics, event prediction and analysis • Making everlasting gobstoppers
  • 15. Social Performance Monitor The measures behind the score
  • 16. Topics topics topics • Black Friday • Science project! • Mallet, Pig • Custom analysis • Superbowl • Oscars
  • 17. Productizing Topics • Ongoing automated topic detection • Lessons from one-off topic analysis • Represented by term distributions • Threads with detail like • Signal volume • Participants • Links • Sentiment gauge
  • 18. Advocates • Auto-discovery of potential advocates • Curated set of known advocates • Example signal (from Cassandra) • Reports and other useful bits
  • 19. Lessons learned • Emerging products are sometimes frustrating, but well worth the pain in their respective niche. • “Never underestimate the massive impact of small bugs in big data.” (@peteskomoroch at LinkedIn) • Community karma
  • 20. A Note on Community • Community involvement • IRC, mailing lists, twitter, conferences, meetups • Newer projects have little or outdated docs • Some features may be • Deprecated • Not ready for primetime • Not a fit for your use case • Community karma • Don’t just take • Be a bridge builder • Positive karma helps
  • 21. Questions? • We’re hiring • Ping me @jeromatron (Twitter and IRC)

Notes de l'éditeur

  1. \n
  2. \n
  3. This is how they see their capability after, for example, the Superbowl.\n
  4. When managers ask how effective the campaign was, the marketing department says it was awesome. When asked how they know that, they say that Zoltar told them so. In reality there are a lot of home grown methods, some good, some not so good. Some of what we did grew out of a spreadsheet that was manually updated, validated and refined over time with one of our major customers.\n\n
  5. What brands does Berkshire Hathaway have under its gigantic umbrella?!?\nCan mention Red Bull, Disney, HP, Levis, Samsung, Honda, etc.\n
  6. Operationally simple doesn’t mean that you don’t need to learn a lot about it, just that there aren’t a lot of moving parts.\nUnique use case in that it’s hybrid. Both lots of writes and analytics and reads.\n
  7. \n
  8. It’s just scads of text, but we do classify - conversations long/short difference between microblogs and blogs.\nWe may use hadoop to generate alternate CFs for specific queries as we need them.\n
  9. Company information is unique because we had to buy, borrow, steal and yes crowd source that data.\nPig handles joins really well for example account snapshots and signal for enrichment.\n\n
  10. Mention Brandon’s work to make things better with CassandraStorage and newer versions of Pig, including regression tests.\nSpeculative exectution.\n
  11. Mention having looked at Azkaban as well.\nNo real way around the logs, just takes getting used to. User impersonation is a product of the authorization framework, patch added to DSE.\n
  12. Mention consistency level choices.\nRotate racks - yeah, wasn’t documented except in the code.\nBackup/restore.\nRoot causes sometimes difficult to determine.\nScaling up - each order of magnitude jump has its own problems.\n
  13. But the long sleepless Summer finally pays off...\n
  14. Everlasting gobstoppers are a fun phase for the projects.\n
  15. Reveals numbers\n
  16. Explanation\nGreat working as a team\nMention Boxing Day\n
  17. Also customer curated topics in the future\n
  18. \n
  19. Data consistency - periodic checks, staging cluster, unit tests, integration testing.\nReparable data. Sometimes incredibly painful, but possible.\nMention backup/restore.\nMention root causes.\n
  20. Be active in communities of these new projects\nIf necessary start building communities around them\nDon’t just take, answer questions, follow mailing lists but have a filter, docs, bug submission, feature requests, votes, representation, tests, patches/pull requests.\n
  21. \n