Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Presentation
Next
Download to read offline and view in fullscreen.

18

Share

Download to read offline

Scaling Analytics with elasticsearch

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Scaling Analytics with elasticsearch

  1. 1. Scaling Analytics with elasticsearch Dan Noble @dwnoble
  2. 2. Background• Technologist at The HumanGeo• We use elasticsearch to build social media analysis tools• 100MM documents indexed• 600GB+ index size• Author of Python elasticsearch driver “rawes” https://github.com/humangeo/rawes
  3. 3. Overview• What is elasticsearch?• Scaling with elasticsearch• How can I use elasticsearch to help with analytics?• Use Case: Social Media Analytics
  4. 4. What is elasticsearch?
  5. 5. Search Engine• Open source• Distributed• Automatic failover• Crazy fast
  6. 6. Search Engine• Actively maintained• REST API• JSON messages• Lucene based
  7. 7. Search Elasticsearch “Cluster” Host Index: Articles• Simple case: one host• One index containing a set of articles
  8. 8. Distributed Search Elasticsearch “Cluster” Host Host Articles (a) Articles (b)• Too much data?• Add another host• Indices can be broken up into “shards” and live on different machines
  9. 9. Redundancy Elasticsearch Cluster Host Host Articles (a) Articles (b) Articles (b) Articles (a)• Shards can be replicated to improve availability
  10. 10. Node Auto Discovery Elasticsearch Cluster Host Host Host Articles (a) Articles (b) Articles (b) Articles (b) Articles (a) Articles (a)• Say we add a third host• elasticsearch will automatically start moving shards to this new host to distribute load
  11. 11. Failover Elasticsearch Cluster Host Host Host Articles (a) Articles (b) Articles (b) Articles (b) Articles (a) Articles (a)• Say a host goes down• Shards on that host are no longer available for search• Elasticsearch automatically rebuilds these two shards on other hosts
  12. 12. Querying Elasticsearch Cluster Host Host Host Articles (a) Articles (b) Articles (b) Articles(a) Query: “Barack Obama”Can query against Client Search for articles any host (Web Application) Send request to other shards if needed
  13. 13. REST API• JSON query syntax• Developer friendly• Easy to get started
  14. 14. Python Exampleimport raweses = rawes.Elastic(elastic-00:9200)es.get(articles/_search, data={ "query": { "filtered" : { "query" : { "query_string" : { "query" : "Barack Obama" } } } }})
  15. 15. Community
  16. 16. Elasticsearch Summary• Scales horizontally• Redundancy• Configures itself automatically• Developer friendly
  17. 17. Analytics and elasticsearch• Date Histograms• Statistical facets• Geospatial queries• All with arbitrary search parameters• Again: Fast
  18. 18. Use Case: Social Media Analysis• Use social media APIs to search for data on a topic of interest• 100MM documents indexed• Sentiment analysis• Location extraction (“Geotagging”)
  19. 19. Sample Documentes.post(articles/facebook, data={ ”date": "2012-09-01 08:37:55", "tags": { "sentiment": { "positive": 0.36, "negative": 0.10 } "geotags": [{ "term" : "Cairo", "location" : "30.0566,31.2262”, “type” : “geo_point” }], "search_terms": [ "Mohamed Morsi" ] }, "item": { "publisher: "Facebook" "source_domain": "www.facebook.com", "author": "James Smith", "source_url": "http://www.facebook.com/5551231234/posts/414141414141", "content_text": "Mohamed Morsi visits Iran for first time since 1979 ....", "title": "James Smith posted a note to Facebook", "author_url: "http://www.facebook.com/profile.php?id=5551231234" }})
  20. 20. Analytical Queries
  21. 21. Date Histogram for Sentimentes.get(articles/_search, data={ "query" : { "query_string" : { "query" : "Mohamed Morsi" } }, "facets" : { "sentiment_histogram" : { "date_histogram" : { "key_field" : "date_of_information.$date", "value_field" : "tags.sentiment.positive", "interval" : "day" } } }})
  22. 22. Date Histogram for Sentiment
  23. 23. Statistical Facet for Sentiment: Queryes.get(articles/_search, data={ "query" : { "query_string" : { "query" : "Mohamed Morsi" } }, "facets" : { "sentiment_stats" : { "statistical" : { "field" : "tags.sentiment.positive" } } }})
  24. 24. Statistical Facet for Sentiment: Result{ "facets": { "sentiment_stats": { "_type": "statistical", "count": 8825, "max": 0.375, "mean": 0.008503991588291782, "min": 0.0, "std_deviation": 0.021251077265305472, "sum_of_squares": 4.623648343200283, "total": 75.04772576667497, "variance": 0.00045160828493598306 } }, "hits": { "hits": [], "max_score": 1.1120162, "total": 8825 }, "took": 60}
  25. 25. Top Keywordses.get(articles/_search, data={ "query" : { "match_all" : {} }, "facets" : { "search_terms" : { "terms" : { "field" : "tags.search_terms", "size" : 3 } } }})
  26. 26. Top Search Terms
  27. 27. Geospatial searches.get(articles/_search, data={ "query" : { "filtered" : { "filter" : { "geo_distance" : { "distance" : ”20km", "tags.geotags.location" : { "lat" : 30, "lon" : 31 } } } } }})
  28. 28. Questions
  • OkharyadiSaputra

    Aug. 16, 2018
  • tunjidev

    Apr. 25, 2018
  • JamieIvanov

    Oct. 4, 2016
  • vladimiryesaulov

    Jun. 5, 2015
  • rajurameshm

    Jan. 25, 2015
  • daveshemenski

    Aug. 27, 2014
  • Maziyarsh

    Feb. 15, 2014
  • acoronadoiruegas

    Jan. 31, 2014
  • AndyDavies

    Jan. 20, 2014
  • ma_g

    Jan. 17, 2014
  • ishafizan

    Jan. 17, 2014
  • RichardLawrence2

    Dec. 27, 2013
  • egolan74

    Oct. 13, 2013
  • yiqing95

    Jun. 10, 2013
  • fayimora

    May. 21, 2013
  • barlem

    Apr. 4, 2013
  • hawky4sarah

    Feb. 18, 2013
  • atsushisasaki5496

    Feb. 15, 2013

Views

Total views

11,633

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

104

Shares

0

Comments

0

Likes

18

×