SlideShare une entreprise Scribd logo
1  sur  75
Télécharger pour lire hors ligne
Solr, Lucene
            & Hadoop
             @



Monday, May 14, 12
david@etsy
                     4 Years Lucene and
                         Solr @ Etsy

Monday, May 14, 12
History of Search
                          @ Etsy
                     Hadoop + HBase
                        Indexing
                       (in development)

                       Replication
Monday, May 14, 12
About
                      Us

Monday, May 14, 12
Monday, May 14, 12
Monday, May 14, 12
Monday, May 14, 12
13MM Listings
                     39MM Unique Visitors
                      880K Shops / 150
                         Countries
                       100+ Engineers
Monday, May 14, 12
Architectur
                     e Overview

Monday, May 14, 12
Overview
                      Search       Web         Database
                     +n slaves    +n webs     +n db shards




                                 Memcached
                                  +n caches




Monday, May 14, 12
Thrift

            Search                                Web

               slave                              web
                        query = hats for cats

               slave                              web
                        result = 402, 283, 837

           +n slaves                             +n webs




Monday, May 14, 12
Hydration
                      Database

                        shard


                        shard
                                   Web

                                   web
                      +n shards


                                   web

                     Memcached
                                  +n webs

                       cache


                       cache


                      +n caches

Monday, May 14, 12
The Results




Monday, May 14, 12
History of
                      Search
Monday, May 14, 12
History of Search
               2007
                     •1 Million Listings
                     •A Single “Master”
                     Postgres Database
                     •PHP > Twisted >
                     Stored Proc > TSearch
Monday, May 14, 12
History of Search
               2008
                     •2 Million Listings
                     •A Single “Master”
                     Postgres Database
                     •PHP > Solr
                     •4 Solr Slaves + 2
Monday, May 14, 12
History of Search
               2009
                     •4 Million Listings
                     •A Single “Master”
                     Postgres Database
                     •PHP > Solr
                     •6 Solr Slaves + 2
Monday, May 14, 12
History of Search
               2010
                     •7 Million Listings
                     •A Single “Master”
                     Postgres Database
                     •PHP > Thrift > Solr
                     •10 Solr Slaves + 1
Monday, May 14, 12
History of Search
               2011
              •10 Million Listings
              •“Master” Postgres
              Database + DB SHARDS!
              •PHP > Thrift > Solr
              •24 Solr Slaves + 1
Monday, May 14, 12
Future of Search
               2012
                     •?? Million Listings
                     •MORE DB SHARDS!
                     •PHP > Thrift > Solr
                     •?? Solr Slaves + 1
                     Master
Monday, May 14, 12
What Did
                     We Learn?
Monday, May 14, 12
Lucene + Solr
                       > TSearch
  http://www.depesz.com/2010/10/17/why-im-
             not-fan-of-tsearch-2/

Monday, May 14, 12
Love Lucene +
      Solr Trunk!


Monday, May 14, 12
Run, Don’t
                       Walk...



Monday, May 14, 12
Deployinator
      Fork it: https://github.com/etsy/deployinator


Monday, May 14, 12
Smoker


Monday, May 14, 12
StatsD, Graph
                      Everything!
                Fork it: https://github.com/etsy/statsd


Monday, May 14, 12
Monday, May 14, 12
95th Percentile


Monday, May 14, 12
start · build_query · perform_search ·
               receive_search_ads · search_side_response ·
               create_event_logger · set_tpl_vars · tpl_render ·
               receive_search_ads_post_render
Monday, May 14, 12
Solr Top Level
                        Cache >
                      Memcached
Monday, May 14, 12
etsy-
                     index.properties
    $ cat /search/data/person/index/etsy-index.properties
    #Tue Mar 27 13:05:51 EDT 2012
    max_update_time=2012-03-27T17:05:51.955Z

Monday, May 14, 12
Check Index Size
  Don’t Install if < 50%
      Current Size

Monday, May 14, 12
Check if Index is
                   Too Old
               Don’t Update if >
                 10 Days Old
Monday, May 14, 12
What Did We Learn?




                     Store Nothing


Monday, May 14, 12
Keep
                     Denormalized
                         Data

Monday, May 14, 12
DB Shard



                                    PHP        JSON    Search
                     DB Shard   Denormalizer          Database




                     DB Shard




Monday, May 14, 12
Full       Apply
                                             Install
                     Reindex   Incremental




Monday, May 14, 12
Full       Apply         Apply
                                                       Install
                 Reindex   Incremental   Incremental




Monday, May 14, 12
r


                              Database
                      exe
                     Ind




Monday, May 14, 12
HBase +
                     Hadoop
Monday, May 14, 12
HBase + Hadoop




                     Why HBase?


Monday, May 14, 12
HBase + Hadoop

                     DB Shard



                                    PHP        JSON
                     DB Shard   Denormalizer          HBase




                     DB Shard




Monday, May 14, 12
HBase + Hadoop

               listings_denormalized
              {NAME => 'listings_denormalized', FAMILIES
              => [{NAME => 'listing_data', BLOOMFILTER =>
              'ROW', REPLICATION_SCOPE => '0',
              COMPRESSION => 'SNAPPY', VERSIONS => '1',
              TTL => '-1', BLOCKSIZE => '65536',
              IN_MEMORY => 'false', BLOCKCACHE =>




Monday, May 14, 12
HBase + Hadoop

               listings_denormalized_m
               odified_index
              {NAME =>
              'listings_denormalized_modified_index',
              FAMILIES => [{NAME => 'pks', BLOOMFILTER
              => 'ROW', REPLICATION_SCOPE => '0',
              COMPRESSION => 'SNAPPY', VERSIONS => '1',
              TTL => '-1', BLOCKSIZE => '65536',




Monday, May 14, 12
HBase + Hadoop




                      SOLR-1301
                https://issues.apache.org/jira/browse/
                              SOLR-1301

Monday, May 14, 12
HBase + Hadoop


                             Disk   •Solr
                 Solr
             Output Format          Document
                             HDFS
                                    Converter
                                    •Solr Requires

Monday, May 14, 12
HBase + Hadoop

                     •Not Great with
                     Multi-Core Configs
                     •Added Solr Multi-Core
                     Support
                     • Solr Config Issues
                     •Added ENV support
Monday, May 14, 12
HBase + Hadoop



                SolrInputDocume
                    ntWritable
    public class SolrInputDocumentWritable extends SolrInputDocument
    implements org.apache.hadoop.io.Writable {



Monday, May 14, 12
HBase + Hadoop




                     Oozie


Monday, May 14, 12
HBase + Hadoop



                     Oozie + HBase?


Monday, May 14, 12
HBase + Hadoop



               ScanStringGenera
                      tor
   http://blog.ozbuyucusu.com/2011/07/21/
 using-hbase-tablemapper-via-oozie-workflow/

Monday, May 14, 12
HBase + Hadoop
                              Hadoop           Indexer


                     Oozie                      Start




                      Map              HBase    Copy




                     Reduce            HDFS    Merge




                      Solr
                                       Disk     Install
                     Output




Monday, May 14, 12
HBase + Hadoop



               IndexerActionMai
                      n

Monday, May 14, 12
HBase + Hadoop




                     Deployinator


Monday, May 14, 12
HBase + Hadoop




                     IndexCompare


Monday, May 14, 12
HBase + Hadoop

    $ ./compare

    ERROR: please provide two index directories

    example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588
    options:
        -p --percent= percent of the index to check
        -i --id=      primary key id field in the index
        -h --hash=    comparison or hash field in the index
        <index> <index>




Monday, May 14, 12
HBase + Hadoop
      $ ./compare 
      /search/data/person/index-1332867952588/ 
      /search/data/person/index-1335378487672

        id field: user_id
      hash field: hash
      percentage: 0.0010
           files: /search/data/person/index-1332867952588/ /search/
      data/person/index-1335378487672

      /search/data/person/index-1332867952588 contains 1515512 docs
      /search/data/person/index-1335378487672 contains 14837972 docs
      1516 of 1516 documents are the same




Monday, May 14, 12
HBase + Hadoop




                     Copy and Merge


Monday, May 14, 12
HBase + Hadoop




                     Open Source


Monday, May 14, 12
Replication

Monday, May 14, 12
Replication




Monday, May 14, 12
Replication

                              Slaves

                     Master




                              +n slaves




Monday, May 14, 12
Monday, May 14, 12
BitTorrent
                     Replication
Monday, May 14, 12
Bit Torrent

  Using BitTornado:




Monday, May 14, 12
Replication
               Bit Torrent + Solr




Monday, May 14, 12
Replication
               Bit Torrent + Solr




Monday, May 14, 12
Monday, May 14, 12
Monday, May 14, 12
Replication

                     Fork of TTorent: https://github.com/
                                 etsy/ttorrent
                              Multi-File Support
                              Large File Support

                       Fork BitTorrent: Comming Soon




Monday, May 14, 12
Need a job?

Monday, May 14, 12
Monday, May 14, 12
Thanks!

Monday, May 14, 12
david@etsy

Monday, May 14, 12

Contenu connexe

En vedette

Solr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanSolr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanGregg Donovan
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...Lucidworks
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaLucidworks
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Lucidworks
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...Lucidworks
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
Webinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with FusionWebinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with FusionLucidworks
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineTrey Grainger
 

En vedette (10)

Solr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg DonovanSolr & Lucene @ Etsy by Gregg Donovan
Solr & Lucene @ Etsy by Gregg Donovan
 
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Television News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/SolrTelevision News Search and Analysis with Lucene/Solr
Television News Search and Analysis with Lucene/Solr
 
Webinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with FusionWebinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with Fusion
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 

Similaire à Solr, Lucene and Hadoop @ Etsy

Pinterest arch summit august 2012 - scaling pinterest
Pinterest arch summit   august 2012 - scaling pinterestPinterest arch summit   august 2012 - scaling pinterest
Pinterest arch summit august 2012 - scaling pinterestdrewz lin
 
Lightning talks percona live mysql_2012
Lightning talks percona live mysql_2012Lightning talks percona live mysql_2012
Lightning talks percona live mysql_2012Giuseppe Maxia
 
Spring Data NHJUG April 2012
Spring Data NHJUG April 2012Spring Data NHJUG April 2012
Spring Data NHJUG April 2012trisberg
 
Web5 pushing the web forward.apr
Web5 pushing the web forward.aprWeb5 pushing the web forward.apr
Web5 pushing the web forward.aprArnout Kazemier
 
Testing mysql creatively in a sandbox
Testing mysql creatively in a sandboxTesting mysql creatively in a sandbox
Testing mysql creatively in a sandboxGiuseppe Maxia
 
UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012andersonjohnd
 
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services 2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services PHP Conference Argentina
 
Escalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP ConferenceEscalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP ConferenceMatias Paterlini
 
Mongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappeMongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappeSpyros Passas
 
GitHub Notable OSS Project
GitHub  Notable OSS ProjectGitHub  Notable OSS Project
GitHub Notable OSS Projectroumia
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsiehData Con LA
 
PuppetCamp NYC - Building Scalable Modules
PuppetCamp NYC - Building Scalable ModulesPuppetCamp NYC - Building Scalable Modules
PuppetCamp NYC - Building Scalable ModulesPuppet
 
Puppet Module Writing 201
Puppet Module Writing 201Puppet Module Writing 201
Puppet Module Writing 201eshamow
 
OpenSky Infrastructure
OpenSky InfrastructureOpenSky Infrastructure
OpenSky InfrastructureJonathan Wage
 
Complex Made Simple: Sleep Better With TorqueBox
Complex Made Simple: Sleep Better With TorqueBoxComplex Made Simple: Sleep Better With TorqueBox
Complex Made Simple: Sleep Better With TorqueBoxLance Ball
 
Practicing Continuous Deployment
Practicing Continuous DeploymentPracticing Continuous Deployment
Practicing Continuous Deploymentzeeg
 
Building Scalable Web Applications For The Cloud
Building Scalable Web Applications For The CloudBuilding Scalable Web Applications For The Cloud
Building Scalable Web Applications For The CloudCarl Mercier
 

Similaire à Solr, Lucene and Hadoop @ Etsy (18)

Pinterest arch summit august 2012 - scaling pinterest
Pinterest arch summit   august 2012 - scaling pinterestPinterest arch summit   august 2012 - scaling pinterest
Pinterest arch summit august 2012 - scaling pinterest
 
Lightning talks percona live mysql_2012
Lightning talks percona live mysql_2012Lightning talks percona live mysql_2012
Lightning talks percona live mysql_2012
 
Spring Data NHJUG April 2012
Spring Data NHJUG April 2012Spring Data NHJUG April 2012
Spring Data NHJUG April 2012
 
Web5 pushing the web forward.apr
Web5 pushing the web forward.aprWeb5 pushing the web forward.apr
Web5 pushing the web forward.apr
 
Testing mysql creatively in a sandbox
Testing mysql creatively in a sandboxTesting mysql creatively in a sandbox
Testing mysql creatively in a sandbox
 
UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012UPHPU Meeting, February 17, 2012
UPHPU Meeting, February 17, 2012
 
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services 2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
2013 - Matías Paterlini: Escalando PHP con sharding y Amazon Web Services
 
Escalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP ConferenceEscalando una PHP App con DB sharding - PHP Conference
Escalando una PHP App con DB sharding - PHP Conference
 
Mongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappeMongo db php_shaken_not_stirred_joomlafrappe
Mongo db php_shaken_not_stirred_joomlafrappe
 
GitHub Notable OSS Project
GitHub  Notable OSS ProjectGitHub  Notable OSS Project
GitHub Notable OSS Project
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh
 
Measure Everything
Measure EverythingMeasure Everything
Measure Everything
 
PuppetCamp NYC - Building Scalable Modules
PuppetCamp NYC - Building Scalable ModulesPuppetCamp NYC - Building Scalable Modules
PuppetCamp NYC - Building Scalable Modules
 
Puppet Module Writing 201
Puppet Module Writing 201Puppet Module Writing 201
Puppet Module Writing 201
 
OpenSky Infrastructure
OpenSky InfrastructureOpenSky Infrastructure
OpenSky Infrastructure
 
Complex Made Simple: Sleep Better With TorqueBox
Complex Made Simple: Sleep Better With TorqueBoxComplex Made Simple: Sleep Better With TorqueBox
Complex Made Simple: Sleep Better With TorqueBox
 
Practicing Continuous Deployment
Practicing Continuous DeploymentPracticing Continuous Deployment
Practicing Continuous Deployment
 
Building Scalable Web Applications For The Cloud
Building Scalable Web Applications For The CloudBuilding Scalable Web Applications For The Cloud
Building Scalable Web Applications For The Cloud
 

Plus de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Dernier

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Solr, Lucene and Hadoop @ Etsy

  • 1. Solr, Lucene & Hadoop @ Monday, May 14, 12
  • 2. david@etsy 4 Years Lucene and Solr @ Etsy Monday, May 14, 12
  • 3. History of Search @ Etsy Hadoop + HBase Indexing (in development) Replication Monday, May 14, 12
  • 4. About Us Monday, May 14, 12
  • 8. 13MM Listings 39MM Unique Visitors 880K Shops / 150 Countries 100+ Engineers Monday, May 14, 12
  • 9. Architectur e Overview Monday, May 14, 12
  • 10. Overview Search Web Database +n slaves +n webs +n db shards Memcached +n caches Monday, May 14, 12
  • 11. Thrift Search Web slave web query = hats for cats slave web result = 402, 283, 837 +n slaves +n webs Monday, May 14, 12
  • 12. Hydration Database shard shard Web web +n shards web Memcached +n webs cache cache +n caches Monday, May 14, 12
  • 14. History of Search Monday, May 14, 12
  • 15. History of Search 2007 •1 Million Listings •A Single “Master” Postgres Database •PHP > Twisted > Stored Proc > TSearch Monday, May 14, 12
  • 16. History of Search 2008 •2 Million Listings •A Single “Master” Postgres Database •PHP > Solr •4 Solr Slaves + 2 Monday, May 14, 12
  • 17. History of Search 2009 •4 Million Listings •A Single “Master” Postgres Database •PHP > Solr •6 Solr Slaves + 2 Monday, May 14, 12
  • 18. History of Search 2010 •7 Million Listings •A Single “Master” Postgres Database •PHP > Thrift > Solr •10 Solr Slaves + 1 Monday, May 14, 12
  • 19. History of Search 2011 •10 Million Listings •“Master” Postgres Database + DB SHARDS! •PHP > Thrift > Solr •24 Solr Slaves + 1 Monday, May 14, 12
  • 20. Future of Search 2012 •?? Million Listings •MORE DB SHARDS! •PHP > Thrift > Solr •?? Solr Slaves + 1 Master Monday, May 14, 12
  • 21. What Did We Learn? Monday, May 14, 12
  • 22. Lucene + Solr > TSearch http://www.depesz.com/2010/10/17/why-im- not-fan-of-tsearch-2/ Monday, May 14, 12
  • 23. Love Lucene + Solr Trunk! Monday, May 14, 12
  • 24. Run, Don’t Walk... Monday, May 14, 12
  • 25. Deployinator Fork it: https://github.com/etsy/deployinator Monday, May 14, 12
  • 27. StatsD, Graph Everything! Fork it: https://github.com/etsy/statsd Monday, May 14, 12
  • 30. start · build_query · perform_search · receive_search_ads · search_side_response · create_event_logger · set_tpl_vars · tpl_render · receive_search_ads_post_render Monday, May 14, 12
  • 31. Solr Top Level Cache > Memcached Monday, May 14, 12
  • 32. etsy- index.properties $ cat /search/data/person/index/etsy-index.properties #Tue Mar 27 13:05:51 EDT 2012 max_update_time=2012-03-27T17:05:51.955Z Monday, May 14, 12
  • 33. Check Index Size Don’t Install if < 50% Current Size Monday, May 14, 12
  • 34. Check if Index is Too Old Don’t Update if > 10 Days Old Monday, May 14, 12
  • 35. What Did We Learn? Store Nothing Monday, May 14, 12
  • 36. Keep Denormalized Data Monday, May 14, 12
  • 37. DB Shard PHP JSON Search DB Shard Denormalizer Database DB Shard Monday, May 14, 12
  • 38. Full Apply Install Reindex Incremental Monday, May 14, 12
  • 39. Full Apply Apply Install Reindex Incremental Incremental Monday, May 14, 12
  • 40. r Database exe Ind Monday, May 14, 12
  • 41. HBase + Hadoop Monday, May 14, 12
  • 42. HBase + Hadoop Why HBase? Monday, May 14, 12
  • 43. HBase + Hadoop DB Shard PHP JSON DB Shard Denormalizer HBase DB Shard Monday, May 14, 12
  • 44. HBase + Hadoop listings_denormalized {NAME => 'listings_denormalized', FAMILIES => [{NAME => 'listing_data', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => Monday, May 14, 12
  • 45. HBase + Hadoop listings_denormalized_m odified_index {NAME => 'listings_denormalized_modified_index', FAMILIES => [{NAME => 'pks', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '-1', BLOCKSIZE => '65536', Monday, May 14, 12
  • 46. HBase + Hadoop SOLR-1301 https://issues.apache.org/jira/browse/ SOLR-1301 Monday, May 14, 12
  • 47. HBase + Hadoop Disk •Solr Solr Output Format Document HDFS Converter •Solr Requires Monday, May 14, 12
  • 48. HBase + Hadoop •Not Great with Multi-Core Configs •Added Solr Multi-Core Support • Solr Config Issues •Added ENV support Monday, May 14, 12
  • 49. HBase + Hadoop SolrInputDocume ntWritable public class SolrInputDocumentWritable extends SolrInputDocument implements org.apache.hadoop.io.Writable { Monday, May 14, 12
  • 50. HBase + Hadoop Oozie Monday, May 14, 12
  • 51. HBase + Hadoop Oozie + HBase? Monday, May 14, 12
  • 52. HBase + Hadoop ScanStringGenera tor http://blog.ozbuyucusu.com/2011/07/21/ using-hbase-tablemapper-via-oozie-workflow/ Monday, May 14, 12
  • 53. HBase + Hadoop Hadoop Indexer Oozie Start Map HBase Copy Reduce HDFS Merge Solr Disk Install Output Monday, May 14, 12
  • 54. HBase + Hadoop IndexerActionMai n Monday, May 14, 12
  • 55. HBase + Hadoop Deployinator Monday, May 14, 12
  • 56. HBase + Hadoop IndexCompare Monday, May 14, 12
  • 57. HBase + Hadoop $ ./compare ERROR: please provide two index directories example: ./compare -p 0.1 -i user_id ./index ./index-1332867952588 options: -p --percent= percent of the index to check -i --id= primary key id field in the index -h --hash= comparison or hash field in the index <index> <index> Monday, May 14, 12
  • 58. HBase + Hadoop $ ./compare /search/data/person/index-1332867952588/ /search/data/person/index-1335378487672 id field: user_id hash field: hash percentage: 0.0010 files: /search/data/person/index-1332867952588/ /search/ data/person/index-1335378487672 /search/data/person/index-1332867952588 contains 1515512 docs /search/data/person/index-1335378487672 contains 14837972 docs 1516 of 1516 documents are the same Monday, May 14, 12
  • 59. HBase + Hadoop Copy and Merge Monday, May 14, 12
  • 60. HBase + Hadoop Open Source Monday, May 14, 12
  • 63. Replication Slaves Master +n slaves Monday, May 14, 12
  • 65. BitTorrent Replication Monday, May 14, 12
  • 66. Bit Torrent Using BitTornado: Monday, May 14, 12
  • 67. Replication Bit Torrent + Solr Monday, May 14, 12
  • 68. Replication Bit Torrent + Solr Monday, May 14, 12
  • 71. Replication Fork of TTorent: https://github.com/ etsy/ttorrent Multi-File Support Large File Support Fork BitTorrent: Comming Soon Monday, May 14, 12
  • 72. Need a job? Monday, May 14, 12