SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
Apache




                               Nutch as a Web mining platform
Nutch – Berlin Buzzwords '10




                                   the present and the future

                                           Andrzej Białecki
                                           ab@sigram.com
Intro
                               ●   Started using Lucene in 2003 (1.2-dev?)
                               ●   Created Luke – the Lucene Index Toolbox
                               ●   Nutch, Lucene committer, Lucene PMC member
Nutch – Berlin Buzzwords '10




                               ●   Nutch project lead
Agenda
                               ●   Nutch architecture overview
                               ●   Crawling in general – strategies and challenges
                               ●   Nutch workflow
                               ●   Web data mining with Nutch
                                                    
Nutch – Berlin Buzzwords '10




                                   with examples
                               ●   Nutch present and future
                               ●   Questions and answers




                                                                                     3
Apache Nutch project
                               ●       Founded in 2003 by Doug Cutting, the Lucene
                                       creator, and Mike Cafarella
                               ●       Apache project since 2004 (sub-project of Lucene)
                               ●       Spin-offs:
Nutch – Berlin Buzzwords '10




                                   –   Map-Reduce and distributed FS → Hadoop
                                   –   Content type detection and parsing → Tika
                               ●       Many installations in operation, mostly vertical
                                       search
                               ●       Collections typically 1 mln - 200 mln documents
                               ●       Apache Top-Level Project since May
                               ●       Current release 1.1
                                                                                           4
What's in a search engine?
                               … a few things that may surprise you! 
Nutch – Berlin Buzzwords '10




                                                                         5
Search engine building blocks

                               Injector    Scheduler             Crawler     Searcher
Nutch – Berlin Buzzwords '10




                                                                                        Indexer
                                  Web graph            Updater         Content
                                  -page info
                                  -links (in/out)
                                                                      repository
                                                                                        Parser




                                                       Crawling frontier controls



                                                                                                  6
Nutch features at a glance
                               ●       Plugin-based, highly modular:
                                         ●
                                             Most behaviors can be changed via plugins
                               ●       Data repository:
                                   –   Page status database and link database (web graph)
                                   –   Content and parsed data database (shards)
Nutch – Berlin Buzzwords '10




                               ●       Multi-protocol, multi-threaded, distributed crawler
                               ●       Robust crawling frontier controls
                               ●       Scalable data processing framework
                                         ●
                                             Hadoop MapReduce processing
                               ●       Full-text indexer & search front-end
                                         ●
                                             Using Solr (or Lucene)
                                         ●
                                             Support for distributed search
                               ●       Flexible integration options
                                                                                             7
Search engine building blocks

                               Injector    Scheduler             Crawler     Searcher
Nutch – Berlin Buzzwords '10




                                                                                        Indexer
                                  Web graph            Updater         Content
                                  -page info
                                  -links (in/out)
                                                                      repository
                                                                                        Parser




                                                       Crawling frontier controls



                                                                                                  8
Nutch building blocks

                               Injector   Generator              Fetcher      Searcher
Nutch – Berlin Buzzwords '10




                                                                                           Indexer
                                   CrawlDB            Updater
                                                                       Shards
                                                                     (segments)
                                                        Link                               Parser
                                    LinkDB            inverter


                                  URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                         9
Nutch data

                                                  Maintains info on all known URL-s:
                               Injector Generator ● Fetch schedule
                                                           Fetcher       Searcher
                                                  ● Fetch status

                                                  ● Page signature

                                                  ● Metadata
Nutch – Berlin Buzzwords '10




                                                                                  Indexer
                                    CrawlDB      Updater
                                                                       Shards
                                                                     (segments)
                                                       Link                                Parser
                                   LinkDB            inverter


                                  URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                         10
Nutch data

                               Injector  Generator For each Fetcher
                                                            target URL keeps info on
                                                                             Searcher
                                                   incoming links, i.e. list of source
                                                   URL-s and their associated anchor
                                                   text
Nutch – Berlin Buzzwords '10




                                                                                       Indexer
                                   CrawlDB        Updater
                                                                       Shards
                                                                     (segments)
                                                       Link                                Parser
                                    LinkDB           inverter


                                  URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                         11
Nutch data

                               Shards (“segments”) keep:
                               Injector Generator
                               ● Raw page content            Fetcher          Searcher
                               ● Parsed content + discovered

                               metadata + outlinks
                               ● Plain text for indexing and
Nutch – Berlin Buzzwords '10




                                                                                           Indexer
                               snippets
                                    CrawlDB         Updater
                                                                       Shards
                                                                     (segments)
                                                       Link                                Parser
                                    LinkDB           inverter


                                  URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                         12
Shard-based workflow
                               ●  Unit of work (batch) – easier to process massive datasets
                               ●  Convenience placeholder, using predefined directory names
                               ●  Unit of deployment to the search infrastructure
                                 – Solr-based search may discard shards once indexed
                               ●  Once completed they are basically unmodifiable
                                   –   No in-place updates of content, or replacing of obsolete content
Nutch – Berlin Buzzwords '10




                               ●    Periodically phased-out by new, re-crawled shards
                                   – Solr-based search can update Solr index in-place

                                                                  200904301234/
                                                                  2009043012345
                                                                 2009043012345
                                             Generator
                                                                  crawl_generate/
                                                                  crawl_generate
                                                                 crawl_generate
                                                                  crawl_fetch/
                                               Fetcher            crawl_fetch
                                                                 crawl_fetch
                                                                  content/
                                                                  content
                                                                  crawl_parse/            “cached” view
                                                                 content
                                               Parser             crawl_parse
                                                                  parse_data/
                                                                 crawl_parse
                                                                  parse_text/
                                                                  parse_data
                                                                 parse_data               snippets
                                               Indexer            parse_text
                                                                 parse_text
                                                                                                          13
Crawling frontier challenge
         ●                     No authoritative catalog of web pages
         ●                     Crawlers need to discover their view of web universe
                                  ●
                                      Start from “seed list” & follow (walk) some (useful? interesting?) outlinks
         ●                     Many dangers of simply wandering around
                                  ●
                                      explosion or collapse of the frontier; collecting unwanted content (spam,
                                       junk, offensive)
Nutch – Berlin Buzzwords '10




                                I need a few
                                 interesting
                                   items...




                                                                                                               14
High-quality seed list
                               ●       Reference sites:
                                   –    Wikipedia, FreeBase, DMOZ         seed + 1 hop
                                   –    Existing verticals
                               ●       Seeding from existing
Nutch – Berlin Buzzwords '10




                                       search engines
                                   –    Collect top-N URL-s for              seed
                                         characteristic keywords                 i=1
                               ●       Seed URL-s plus 1:
                                   –    First hop usually retains high-
                                          quality and focus
                                   –    Remove blatantly obvious junk

                                   15                                                    15
Controlling the crawling frontier
                               ●       URL filter plugins
                                   –   White-list, black-list, regex
                                   –   May use external resources
                                        (DB-s, services ...)
                               ●       URL normalizer plugins
Nutch – Berlin Buzzwords '10




                                   –   Resolving relative path
                                         elements                        seed
                                   –   “Equivalent” URLs                    i=1

                                                                                  i=2
                               ●       Additional controls                              i=3

                                   –   priority, metadata select/block
                                   –   Breadth first, depth first,
                                         per‑site mixed ...
                                                                                              16
Wide vs. focused crawling
                               ●       Differences:
                                   –   Little technical difference in configuration
                                   –   Big difference in operations, maintenance and quality
                               ●       Wide crawling:
                                         ●
                                             (Almost) Unlimited crawling frontier
Nutch – Berlin Buzzwords '10




                                         ●
                                             High risk of spamming and junk content
                                         ●
                                             “Politeness” a very important limiting factor
                                         ●
                                             Bandwidth & DNS considerations
                               ●       Focused (vertical or enterprise) crawling:
                                         ●
                                             Limited crawling frontier
                                         ●
                                             Bandwidth or politeness is often not an issue
                                         ●
                                             Low risk of spamming and junk content




                                                                                               17
Vertical & enterprise search
                               ●       Vertical search
                                   –   Range of selected “reference” sites
                                   –   Robust control of the crawling frontier
                                   –   Extensive content post-processing
                                   –   Business-driven decisions about ranking
Nutch – Berlin Buzzwords '10




                               ●       Enterprise search
                                   –   Variety of data sources and data formats
                                   –   Well-defined and limited crawling frontier
                                   –   Integration with in-house data sources
                                   –   Little danger of spam
                                   –   PageRank-like scoring usually works poorly

                                                                                    18
Nutch – Berlin Buzzwords '10




                                    ?
                                        Face to face with Nutch




19
Installation & basic config
                                ●       http://nutch.apache.org
                                ●       Java 1.5+
                                ●       Single-node out of the box
                                    –   Comes also as a “job” jar to run on existing Hadoop cluster
                                ●       File-based configuration: conf/
Nutch – Berlin Buzzwords '10




                                    –   Plugin list
                                    –   Per-plugin configuration

                                ●       … much, much more on this on the Wiki




                               20                                                                     20
Main Nutch workflow
                                                                                        Command-line:
                                                                                        bin/nutch
                                          ●       Inject: initial creation of CrawlDB    inject
                                              –   Insert seed URLs
                                              –   Initial LinkDB is empty
Nutch – Berlin Buzzwords '10




                                          ●       Generate new shard's fetchlist         generate
                                          ●       Fetch raw content                      fetch
                                          ●       Parse content (discovers outlinks)     parse
                                          ●       Update CrawlDB from shards             updatedb
                                          ●       Update LinkDB from shards              invertlinks
                                          ●       Index shards                           index /
                                                                                            solrindex
                               (repeat)                                                                 21
Injecting new URL-s

                               Injector   Generator              Fetcher      Searcher
Nutch – Berlin Buzzwords '10




                                                                                           Indexer
                                   CrawlDB            Updater
                                                                       Shards
                                                                     (segments)
                                                        Link                               Parser
                                    LinkDB            inverter


                                  URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                         22
Generating fetchlists

                               Injector   Generator              Fetcher      Searcher
Nutch – Berlin Buzzwords '10




                                                                                           Indexer
                                   CrawlDB            Updater
                                                                       Shards
                                                                     (segments)
                                                        Link                               Parser
                                    LinkDB            inverter


                                  URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                         23
Fetching content

                               Injector   Generator              Fetcher      Searcher
Nutch – Berlin Buzzwords '10




                                                                                           Indexer
                                   CrawlDB            Updater
                                                                       Shards
                                                                     (segments)
                                                        Link                               Parser
                                    LinkDB            inverter


                                  URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                         24
Content processing

                               Injector   Generator              Fetcher      Searcher
Nutch – Berlin Buzzwords '10




                                                                                           Indexer
                                   CrawlDB            Updater
                                                                       Shards
                                                                     (segments)
                                                        Link                               Parser
                                    LinkDB            inverter


                                  URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                         25
Link inversion

                               Injector   Generator              Fetcher      Searcher
Nutch – Berlin Buzzwords '10




                                                                                           Indexer
                                   CrawlDB            Updater
                                                                       Shards
                                                                     (segments)
                                                        Link                               Parser
                                    LinkDB            inverter


                                  URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                         26
Page importance - scoring

                               Injector   Generator              Fetcher      Searcher
Nutch – Berlin Buzzwords '10




                                                                                           Indexer
                                   CrawlDB            Updater
                                                                       Shards
                                                                     (segments)
                                                        Link                               Parser
                                    LinkDB            inverter


                                  URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                         27
Indexing

                               Injector   Generator              Fetcher      Searcher
Nutch – Berlin Buzzwords '10




                                                                                           Indexer
                                   CrawlDB            Updater
                                                                       Shards
                                                                     (segments)
                                                        Link                               Parser
                                    LinkDB            inverter


                                  URL filters & normalizers, parsing/indexing filters, scoring plugins



                                                                                                         28
Map-reduce indexing
                               ●       Map() just assembles all parts of documents
                               ●       Reduce() performs text analysis + indexing:
                                   –   Sends assembled documents to Solr
                                       or
                                   –   Adds to a local Lucene index
Nutch – Berlin Buzzwords '10




                               ●       Other possible MR indexing models:
                                   –   Hadoop contrib/indexing model:
                                         ●
                                             analysis and indexing on map() side
                                         ●
                                             Index merging on reduce() side
                                   –   Modified Nutch model:
                                         ●
                                             Analysis on map() side
                                         ●
                                             Indexing on reduce() side

                                                                                     29
Nutch integration
                               ●       Nutch search & tools API
                                   –   Search via REST-style interaction, XML / JSON response
                                   –   Tools CLI and API to access bulk & single Nutch items
                                   –   Single-node, embedded, distributed (Hadoop cluster)
                               ●       Data-level integration: direct MapFile /
Nutch – Berlin Buzzwords '10




                                       SequenceFile reading
                                   –   More complicated (and still requires using Nutch classes)
                                   –   May be more efficient
                                   –   Future: native tools related to data stores (HBase, SQL, ...)
                               ●       Exporting Nutch data
                                   –   All data can be exported to plain text formats
                                   –   bin/nutch read*
                                         ●   ...db – read CrawlDB and dump some/all records
                                         ●   ...linkdb – read LinkDb and dump some/all records
                                         ●   ...seg – read segments (shards) and dump some/all records   30
Web data mining with Nutch
Nutch – Berlin Buzzwords '10




                                                            31
Nutch search
                                ●       Solr indexing and searching (preferred)
                                    –   Simple Lucene indexing / search available too
                                ●       Using Solr search:
                                    –   DisMax search over several fields (url, title, body, anchors)
                                    –   Faceted search
Nutch – Berlin Buzzwords '10




                                    –   Search results clustering
                                    –   SolrCloud:
                                          ●
                                              Automatic shard replication and load-balancing
                                          ●
                                              Hashing update handler to distribute docs to Solr shards




                               30                                                                        32
Search-based analytics
                               ●       Keyword search → crude topic mining
                               ●       Phrase search → crude collocation mining
                               ●       Anchor search → crude semantic enrichment
                               ●       Feedback loop from search results:
                                   –   Faceting and on-line clustering may discover latent topics
Nutch – Berlin Buzzwords '10




                                   –   Top-N results for reference queries may prioritize further crawling
                               ●       Example: question answering system
                                   –   Source documents from reference sites
                                   –   NLP document analysis: key-phrase detection, POS-tagging,
                                         noun-verb / subject-predicate detection, enrichment from DBs
                                         and semantic nets
                                   –   NLP query analysis: expected answer type (e.g. person, place,
                                         date, activity, method, ...), key-phrases, synonyms
                                   –   Regular search
                                   –   Evaluation of raw results (further NLP analysis of each document) 33
Web as a corpus
                               ●       Examples:
                                   –   Source of raw text in a specific language
                                   –   Source of text on a given subject
                                         ●
                                             Selection by e.g. a presence of keywords, or full-blown NLP
                                         ●
                                             Add data from known reference sites (Wikipedia, Freebase) or
                                              databases (Medline) or semantic nets (WordNet, OpenCyc)
Nutch – Berlin Buzzwords '10




                                   –   Source of documents in a specific format (e.g. PDF)
                               ●       Nutch setup:
                                   –   URLFilters define the crawling frontier and content types
                                   –   Parse plugins determine the content extraction / processing
                                         ●
                                             e.g. language detection
                               ●       Nutch shards:
                                   –   Extracted text, metadata, outlinks / anchors


                                                                                                            34
Web as a corpus (2)
                               ●       Concept mining
                                   –   Harvesting human-created concept descriptions and
                                         associations
                                   –   “kind of”, “contains”, “includes”, “application of”
                                   –   Co-occurrence of concepts has some meaning too!
                               ●       Example: medical search engine
Nutch – Berlin Buzzwords '10




                                   –   Controlled vocabulary of diseases, symptoms, procedures
                                   –   Identifiable metadata: author, journal, publication date, etc.
                                   –   Nutch crawl of reference sites and DBs
                                         ●
                                             Co-occurrence of controlled vocabulary
                                              –
                                                  BloomFilter-s for quick trimming of map-side data
                                              –
                                                  Or Mahout collocation mining for uncontrolled concepts
                                         ●
                                             Cube of co-occurring (related) concepts
                                         ●
                                             Several dimensions to traverse
                                              –
                                                  “authors who publish most often together on treatment of myocardial infarction”
                                         ●
                                             10 nodes, 100k phrases in vocabulary, 20 mln pages, ~300bln
                                              phrases on map side → ~5GB data cube                                                  35
Web as a directed graph
                               ●   Nodes (vertices): URL-s as unique identifiers
                               ●   Edges (links): hyperlinks like <a href=”targetUrl”/>
                               ●   Edge labels: <a href=”..”>anchor text</a>
                               ●   Often represented as adjacency (neighbor) lists
                               ●   Inverted graph: LinkDB in Nutch
Nutch – Berlin Buzzwords '10




                                                                               Straight (outlink) graph:
                                                                               1 → 2a, 3b, 4c, 5d, 6e
                                       8                                       5 → 6f, 9g
                                                           j                   7 → 3h, 4i, 8j, 9k
                                                                       k
                                                                   7       9
                                               3           h
                                   2
                                                                   i           Inverted (inlink) graph:
                                       a       b                               2 ← 1a
                                                   c                           3 ← 1b, 7h
                                           1                   4
                                                                               4 ← 1c, 7i
                                       e       d                               5 ← 1d
                                                                   g           6 ← 1e, 5f
                                   6           f
                                                       5                       8 ← 7j
                                                                               9 ← 5g, 7k
                                                                                                           36
Link inversion
                               ●   Pages have outgoing links (outlinks)
                                        … I know where I'm pointing to
                               ●   Question: who points to me?
                                        … I don't know, there is no catalog of pages
                                        … NOBODY knows for sure either!
Nutch – Berlin Buzzwords '10




                               ●   In-degree may indicate importance of the page
                               ●   Anchor text provides important semantic info
                               ●   Answer: invert the outlinks that I know about,
                                   and group by target (Nutch 'invertlinks')
                                              tgt 2                           src 2
                                    tgt 1                            src 1

                                            src 1      tgt 3                 tgt 1     src 3

                                    tgt 5      tgt 4                 src 5     src 4           37
Web as a recommender
                               ●       Links as recommendations:
                                   –   Link represents an association
                                   –   Anchor text represents a recommended topic
                                         ●
                                             … with some surrounding text of a hyperlink?
                               ●       Not all pages are created equal
Nutch – Berlin Buzzwords '10




                                   –   Recommendations from good pages are useful
                                   –   Recommendations from bad pages may be useless
                                   –   Merit / guilt by association:
                                         ●
                                             Links from good pages should improve the target's reputation
                                         ●
                                             Links from bad pages may compromise good pages' reputation
                               ●       Not all recommendations are trustworthy
                                   –   What links to trust, and to what degree?
                                   –   Social aspects: popularity, fashion, mobbing, fallacy of
                                        “common belief”
                                                                                                            38
Link analysis and scoring
                               ●       PageRank                                                  1      1


                                   –   Query-independent page weight                             1      1
                                   –   Based on the flow of weight along link paths
                                         ●
                                             Dampening factor α to stabilize the flow
                                         ●
                                             Weight from “dangling nodes” redistributed         1.25   1.25


                                       Other models
Nutch – Berlin Buzzwords '10




                               ●
                                                                                                0.75   0.75
                                   –   Hyperlink-Induced Topic Search (HITS)
                                         ●
                                             Query-dependent, local iterations, hub/authority
                                   –   TrustRank                                                1.06   1.31
                                         ●
                                             Propagation of “trust” based on human expert
                                              evaluation of seed sites                          0.69   0.94

                               ●       Challenges
                                   –   Loops, link spam, cliques, loosely connected
                                         subgraphs, mobbing, etc
                                                                                                              39
Nutch link analysis tools
                               ●       Tools for PageRank calculation with loop detection
                                   –   LinkDb: source of anchor text (think “recommended topics”)
                                   –   Page in-degree ≈ popularity / importance / quality
                                   –   Scoring API (and plugins) to control the flow of page importance
                                         along link paths
                               ●       Nutch shards:
Nutch – Berlin Buzzwords '10




                                   –   Source of outlinks → expanding the crawling frontier
                                   –   Page linked-ness vs. its content: hub or authority
                               ●       Example: porn / junk detection
                                   –   Links to “porn” pages poisonous to importance / quality
                                   –   Links from “porn” pages decrease the confidence in quality of the
                                         target page
                               ●       Example: vertical crawl
                                   –   Expanding to pages “on topic” == with sufficient in-link support
                                         from known on topic pages
                                                                                                           40
Web of gossip and opinions
                               ●       General Web – not considering special-purpose
                                       networks here...
                               ●       Example:
                                   –   Who / what is in the news?
                                   –   How often a name is mentioned?
                                             today Google yields 44,500 hits for ab@getopt.org 
Nutch – Berlin Buzzwords '10




                                         ●


                                   –   What facts about me are publicly available?
                                   –   What is the sentiment associated with a name (person,
                                        organization, trademark)?
                               ●       Nutch setup:
                                   –   Seed from a few reference news sites, blogs, Twitter, etc
                                   –   Use Nutch plugin for RSS/Atom crawling
                                   –   NLP parsing plugins (NER, classification, sentiment analysis)
                               ●       Nutch shards:
                                   –   Capture temporal aspect
                                                                                                   41
Web as a source of … anything
                               ●       The data is there, just lost among irrelevant stuff
                                   –   Difficult to find → good seed list + crawling frontier controls
                                   –   Mixed with junk & irrelevant data → URL & content filtering


                                       Be creative – combine multiple strategies:
Nutch – Berlin Buzzwords '10




                               ●

                                   –   Crawl for raw data, stay on topic – filter out junk early
                                   –   Use plain indexing & search as a crude analytic tool
                                   –   Use creative post-processing to filter and enhance the data
                                   –   Export data from Nutch and pipe it to other tools (Pig,
                                        HBase, Mahout, ...)



                                                                                                         42
Future of Nutch
                               ●       Nutch 2.0 re-design
                                   –   Refactoring, cleanup, better scale-up / scale-down
                                   –   Avoid code duplication
                                   –   Expected release ~Q4 2010
                               ●       Share code with other crawler projects →
                                       crawler-commons
Nutch – Berlin Buzzwords '10




                               ●       Indexing & Search → Solr, SolrCloud
                                   –   Distributed and replicated search is difficult
                                   –   Initial integration needs significant improvement
                                   –   Shard management – SolrCloud / Zookeeper
                               ●       Web-graph & page repository → ORM layer
                                   –   Combine CrawlDB, LinkDB and shard storage
                                   –   Avoid tedious shard management
                                   –   Gora ORM mapping: HBase, SQL, Cassandra? BerkeleyDB?
                                   –   Benefit from native tools specific to storage → easier integration   43
Future of Nutch (2)
                               ●       What's left then?
                                   –   Crawling frontier management, discovery
                                   –   Re-crawl algorithms
                                   –   Spider trap handling
                                   –   Fetcher
Nutch – Berlin Buzzwords '10




                                   –   Ranking: enterprise-specific, user-feedback
                                   –   Duplicate detection, URL aliasing (mirror detection)
                                   –   Template detection and cleanup, pagelet-level crawling
                                   –   Spam & junk control
                               ●       Vision: á la carte toolkit, scalable from
                                       1-1000s nodes
                                   –   Easier setup for small 1 node installs
                                   –   Focus on a reliable, easy to integrate framework
                                                                                                44
Conclusions
                                          (This overview is a tip of the iceberg)

                                                         Nutch
                               ●   Implements all core search engine components
Nutch – Berlin Buzzwords '10




                               ●   Extremely configurable and modular
                               ●   Scales well
                               ●   A complete crawl & search platform – and a toolkit
                               ●   Easy to use as an input feed to data collecting and
                                   data mining tools



                                                                                     45
Q&A

                               ●       Further information:
                                   –   http://nutch.apache.org/
                                   –   user@nutch.apache.org
Nutch – Berlin Buzzwords '10




                               ●       Contact author:
                                   –   ab@sigram.com




                                                                   46

Contenu connexe

Tendances

Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at BristechJulien Nioche
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventGruter
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopHyunsik Choi
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearchJoey Wen
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN Jim Dowling
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoHyunsik Choi
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012Amazon Web Services
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 

Tendances (20)

Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special EventApache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
 
Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything
 
Tajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for HadoopTajo: A Distributed Data Warehouse System for Hadoop
Tajo: A Distributed Data Warehouse System for Hadoop
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 

En vedette

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
Lab_syllabus_sp16_updated
Lab_syllabus_sp16_updatedLab_syllabus_sp16_updated
Lab_syllabus_sp16_updatedXiangzhen Sun
 
How to NLProc from .NET
How to NLProc from .NETHow to NLProc from .NET
How to NLProc from .NETSergey Tihon
 
Weekend Malware Research 2012
Weekend Malware Research 2012Weekend Malware Research 2012
Weekend Malware Research 2012Andrew Morris
 
Flaying the Blockchain Ledger for Fun, Profit, and Hip Hop
Flaying the Blockchain Ledger for Fun, Profit, and Hip HopFlaying the Blockchain Ledger for Fun, Profit, and Hip Hop
Flaying the Blockchain Ledger for Fun, Profit, and Hip HopAndrew Morris
 
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...Andrew Morris
 
Shmoocon Epilogue 2013 - Ruining security models with SSH
Shmoocon Epilogue 2013 - Ruining security models with SSHShmoocon Epilogue 2013 - Ruining security models with SSH
Shmoocon Epilogue 2013 - Ruining security models with SSHAndrew Morris
 
BSidesCharleston2014 - Ballin on a Budget: Tracking Chinese Malware Campaigns...
BSidesCharleston2014 - Ballin on a Budget: Tracking Chinese Malware Campaigns...BSidesCharleston2014 - Ballin on a Budget: Tracking Chinese Malware Campaigns...
BSidesCharleston2014 - Ballin on a Budget: Tracking Chinese Malware Campaigns...Andrew Morris
 
Behavior Driven Development (BDD)
Behavior Driven Development (BDD)Behavior Driven Development (BDD)
Behavior Driven Development (BDD)Ajay Danait
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache MahoutDaniel Glauser
 
Asynchronous API in Java8, how to use CompletableFuture
Asynchronous API in Java8, how to use CompletableFutureAsynchronous API in Java8, how to use CompletableFuture
Asynchronous API in Java8, how to use CompletableFutureJosé Paumard
 
Text categorization
Text categorizationText categorization
Text categorizationKU Leuven
 
Continuous Integration and PHP
Continuous Integration and PHPContinuous Integration and PHP
Continuous Integration and PHPArno Schneider
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
SOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principlesSOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principlesSergey Karpushin
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkThamme Gowda
 
Spring boot introduction
Spring boot introductionSpring boot introduction
Spring boot introductionRasheed Waraich
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging ChallengesAaron Irizarry
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with DataSeth Familian
 

En vedette (19)

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
Lab_syllabus_sp16_updated
Lab_syllabus_sp16_updatedLab_syllabus_sp16_updated
Lab_syllabus_sp16_updated
 
How to NLProc from .NET
How to NLProc from .NETHow to NLProc from .NET
How to NLProc from .NET
 
Weekend Malware Research 2012
Weekend Malware Research 2012Weekend Malware Research 2012
Weekend Malware Research 2012
 
Flaying the Blockchain Ledger for Fun, Profit, and Hip Hop
Flaying the Blockchain Ledger for Fun, Profit, and Hip HopFlaying the Blockchain Ledger for Fun, Profit, and Hip Hop
Flaying the Blockchain Ledger for Fun, Profit, and Hip Hop
 
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
 
Shmoocon Epilogue 2013 - Ruining security models with SSH
Shmoocon Epilogue 2013 - Ruining security models with SSHShmoocon Epilogue 2013 - Ruining security models with SSH
Shmoocon Epilogue 2013 - Ruining security models with SSH
 
BSidesCharleston2014 - Ballin on a Budget: Tracking Chinese Malware Campaigns...
BSidesCharleston2014 - Ballin on a Budget: Tracking Chinese Malware Campaigns...BSidesCharleston2014 - Ballin on a Budget: Tracking Chinese Malware Campaigns...
BSidesCharleston2014 - Ballin on a Budget: Tracking Chinese Malware Campaigns...
 
Behavior Driven Development (BDD)
Behavior Driven Development (BDD)Behavior Driven Development (BDD)
Behavior Driven Development (BDD)
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
Asynchronous API in Java8, how to use CompletableFuture
Asynchronous API in Java8, how to use CompletableFutureAsynchronous API in Java8, how to use CompletableFuture
Asynchronous API in Java8, how to use CompletableFuture
 
Text categorization
Text categorizationText categorization
Text categorization
 
Continuous Integration and PHP
Continuous Integration and PHPContinuous Integration and PHP
Continuous Integration and PHP
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
SOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principlesSOLID, DRY, SLAP design principles
SOLID, DRY, SLAP design principles
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
Spring boot introduction
Spring boot introductionSpring boot introduction
Spring boot introduction
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 

Similaire à Nutch as a Web data mining platform

Net flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalNet flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalYeounhee Lee
 
A Provenance-Aware Linked Data Application for Trip Management and Organization
A Provenance-Aware Linked Data Application for Trip Management and OrganizationA Provenance-Aware Linked Data Application for Trip Management and Organization
A Provenance-Aware Linked Data Application for Trip Management and OrganizationBoris Villazón-Terrazas
 
OpenNaaS Overview Complete
OpenNaaS Overview CompleteOpenNaaS Overview Complete
OpenNaaS Overview CompleteJoan Garcia
 
Building apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon BostonBuilding apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon Bostonamansk
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015ontopic
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchMapR Technologies
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillDataWorks Summit
 
OPEN NETWORK OPERATING SYSTEM.PPTX
OPEN NETWORK OPERATING SYSTEM.PPTXOPEN NETWORK OPERATING SYSTEM.PPTX
OPEN NETWORK OPERATING SYSTEM.PPTXAhmed59616
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebNuxeo
 
Opentracing jaeger
Opentracing jaegerOpentracing jaeger
Opentracing jaegerOracle Korea
 
Distributed Tracing with Jaeger
Distributed Tracing with JaegerDistributed Tracing with Jaeger
Distributed Tracing with JaegerInho Kang
 
Tail-f Webinar OpenFlow Switch Management Using NETCONF and YANG
Tail-f Webinar OpenFlow Switch Management Using NETCONF and YANGTail-f Webinar OpenFlow Switch Management Using NETCONF and YANG
Tail-f Webinar OpenFlow Switch Management Using NETCONF and YANGTail-f Systems
 
Openstack Neutron, interconnections with BGP/MPLS VPNs
Openstack Neutron, interconnections with BGP/MPLS VPNsOpenstack Neutron, interconnections with BGP/MPLS VPNs
Openstack Neutron, interconnections with BGP/MPLS VPNsThomas Morin
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFiHortonworks
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014spinningmatt
 
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup ZurichJaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich⛑ Pavol Loffay
 

Similaire à Nutch as a Web data mining platform (20)

Net flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalNet flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_final
 
A Provenance-Aware Linked Data Application for Trip Management and Organization
A Provenance-Aware Linked Data Application for Trip Management and OrganizationA Provenance-Aware Linked Data Application for Trip Management and Organization
A Provenance-Aware Linked Data Application for Trip Management and Organization
 
OpenNaaS Overview Complete
OpenNaaS Overview CompleteOpenNaaS Overview Complete
OpenNaaS Overview Complete
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Publishing Linked Data from RDB
Publishing Linked Data from RDBPublishing Linked Data from RDB
Publishing Linked Data from RDB
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Building apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon BostonBuilding apps with HBase - Big Data TechCon Boston
Building apps with HBase - Big Data TechCon Boston
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
OPEN NETWORK OPERATING SYSTEM.PPTX
OPEN NETWORK OPERATING SYSTEM.PPTXOPEN NETWORK OPERATING SYSTEM.PPTX
OPEN NETWORK OPERATING SYSTEM.PPTX
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Opentracing jaeger
Opentracing jaegerOpentracing jaeger
Opentracing jaeger
 
Distributed Tracing with Jaeger
Distributed Tracing with JaegerDistributed Tracing with Jaeger
Distributed Tracing with Jaeger
 
Tail-f Webinar OpenFlow Switch Management Using NETCONF and YANG
Tail-f Webinar OpenFlow Switch Management Using NETCONF and YANGTail-f Webinar OpenFlow Switch Management Using NETCONF and YANG
Tail-f Webinar OpenFlow Switch Management Using NETCONF and YANG
 
Openstack Neutron, interconnections with BGP/MPLS VPNs
Openstack Neutron, interconnections with BGP/MPLS VPNsOpenstack Neutron, interconnections with BGP/MPLS VPNs
Openstack Neutron, interconnections with BGP/MPLS VPNs
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup ZurichJaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich
Jaeger and OpenTracing Cloud Native Computing (CNCF) meetup Zurich
 

Dernier

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Dernier (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Nutch as a Web data mining platform

  • 1. Apache Nutch as a Web mining platform Nutch – Berlin Buzzwords '10 the present and the future Andrzej Białecki ab@sigram.com
  • 2. Intro ● Started using Lucene in 2003 (1.2-dev?) ● Created Luke – the Lucene Index Toolbox ● Nutch, Lucene committer, Lucene PMC member Nutch – Berlin Buzzwords '10 ● Nutch project lead
  • 3. Agenda ● Nutch architecture overview ● Crawling in general – strategies and challenges ● Nutch workflow ● Web data mining with Nutch  Nutch – Berlin Buzzwords '10 with examples ● Nutch present and future ● Questions and answers 3
  • 4. Apache Nutch project ● Founded in 2003 by Doug Cutting, the Lucene creator, and Mike Cafarella ● Apache project since 2004 (sub-project of Lucene) ● Spin-offs: Nutch – Berlin Buzzwords '10 – Map-Reduce and distributed FS → Hadoop – Content type detection and parsing → Tika ● Many installations in operation, mostly vertical search ● Collections typically 1 mln - 200 mln documents ● Apache Top-Level Project since May ● Current release 1.1 4
  • 5. What's in a search engine? … a few things that may surprise you!  Nutch – Berlin Buzzwords '10 5
  • 6. Search engine building blocks Injector Scheduler Crawler Searcher Nutch – Berlin Buzzwords '10 Indexer Web graph Updater Content -page info -links (in/out) repository Parser Crawling frontier controls 6
  • 7. Nutch features at a glance ● Plugin-based, highly modular: ● Most behaviors can be changed via plugins ● Data repository: – Page status database and link database (web graph) – Content and parsed data database (shards) Nutch – Berlin Buzzwords '10 ● Multi-protocol, multi-threaded, distributed crawler ● Robust crawling frontier controls ● Scalable data processing framework ● Hadoop MapReduce processing ● Full-text indexer & search front-end ● Using Solr (or Lucene) ● Support for distributed search ● Flexible integration options 7
  • 8. Search engine building blocks Injector Scheduler Crawler Searcher Nutch – Berlin Buzzwords '10 Indexer Web graph Updater Content -page info -links (in/out) repository Parser Crawling frontier controls 8
  • 9. Nutch building blocks Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 9
  • 10. Nutch data Maintains info on all known URL-s: Injector Generator ● Fetch schedule Fetcher Searcher ● Fetch status ● Page signature ● Metadata Nutch – Berlin Buzzwords '10 Indexer CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 10
  • 11. Nutch data Injector Generator For each Fetcher target URL keeps info on Searcher incoming links, i.e. list of source URL-s and their associated anchor text Nutch – Berlin Buzzwords '10 Indexer CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 11
  • 12. Nutch data Shards (“segments”) keep: Injector Generator ● Raw page content Fetcher Searcher ● Parsed content + discovered metadata + outlinks ● Plain text for indexing and Nutch – Berlin Buzzwords '10 Indexer snippets CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 12
  • 13. Shard-based workflow ● Unit of work (batch) – easier to process massive datasets ● Convenience placeholder, using predefined directory names ● Unit of deployment to the search infrastructure – Solr-based search may discard shards once indexed ● Once completed they are basically unmodifiable – No in-place updates of content, or replacing of obsolete content Nutch – Berlin Buzzwords '10 ● Periodically phased-out by new, re-crawled shards – Solr-based search can update Solr index in-place 200904301234/ 2009043012345 2009043012345 Generator crawl_generate/ crawl_generate crawl_generate crawl_fetch/ Fetcher crawl_fetch crawl_fetch content/ content crawl_parse/ “cached” view content Parser crawl_parse parse_data/ crawl_parse parse_text/ parse_data parse_data snippets Indexer parse_text parse_text 13
  • 14. Crawling frontier challenge ● No authoritative catalog of web pages ● Crawlers need to discover their view of web universe ● Start from “seed list” & follow (walk) some (useful? interesting?) outlinks ● Many dangers of simply wandering around ● explosion or collapse of the frontier; collecting unwanted content (spam, junk, offensive) Nutch – Berlin Buzzwords '10 I need a few interesting items... 14
  • 15. High-quality seed list ● Reference sites: – Wikipedia, FreeBase, DMOZ seed + 1 hop – Existing verticals ● Seeding from existing Nutch – Berlin Buzzwords '10 search engines – Collect top-N URL-s for seed characteristic keywords i=1 ● Seed URL-s plus 1: – First hop usually retains high- quality and focus – Remove blatantly obvious junk 15 15
  • 16. Controlling the crawling frontier ● URL filter plugins – White-list, black-list, regex – May use external resources (DB-s, services ...) ● URL normalizer plugins Nutch – Berlin Buzzwords '10 – Resolving relative path elements seed – “Equivalent” URLs i=1 i=2 ● Additional controls i=3 – priority, metadata select/block – Breadth first, depth first, per‑site mixed ... 16
  • 17. Wide vs. focused crawling ● Differences: – Little technical difference in configuration – Big difference in operations, maintenance and quality ● Wide crawling: ● (Almost) Unlimited crawling frontier Nutch – Berlin Buzzwords '10 ● High risk of spamming and junk content ● “Politeness” a very important limiting factor ● Bandwidth & DNS considerations ● Focused (vertical or enterprise) crawling: ● Limited crawling frontier ● Bandwidth or politeness is often not an issue ● Low risk of spamming and junk content 17
  • 18. Vertical & enterprise search ● Vertical search – Range of selected “reference” sites – Robust control of the crawling frontier – Extensive content post-processing – Business-driven decisions about ranking Nutch – Berlin Buzzwords '10 ● Enterprise search – Variety of data sources and data formats – Well-defined and limited crawling frontier – Integration with in-house data sources – Little danger of spam – PageRank-like scoring usually works poorly 18
  • 19. Nutch – Berlin Buzzwords '10 ? Face to face with Nutch 19
  • 20. Installation & basic config ● http://nutch.apache.org ● Java 1.5+ ● Single-node out of the box – Comes also as a “job” jar to run on existing Hadoop cluster ● File-based configuration: conf/ Nutch – Berlin Buzzwords '10 – Plugin list – Per-plugin configuration ● … much, much more on this on the Wiki 20 20
  • 21. Main Nutch workflow Command-line: bin/nutch ● Inject: initial creation of CrawlDB inject – Insert seed URLs – Initial LinkDB is empty Nutch – Berlin Buzzwords '10 ● Generate new shard's fetchlist generate ● Fetch raw content fetch ● Parse content (discovers outlinks) parse ● Update CrawlDB from shards updatedb ● Update LinkDB from shards invertlinks ● Index shards index / solrindex (repeat) 21
  • 22. Injecting new URL-s Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 22
  • 23. Generating fetchlists Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 23
  • 24. Fetching content Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 24
  • 25. Content processing Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 25
  • 26. Link inversion Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 26
  • 27. Page importance - scoring Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 27
  • 28. Indexing Injector Generator Fetcher Searcher Nutch – Berlin Buzzwords '10 Indexer CrawlDB Updater Shards (segments) Link Parser LinkDB inverter URL filters & normalizers, parsing/indexing filters, scoring plugins 28
  • 29. Map-reduce indexing ● Map() just assembles all parts of documents ● Reduce() performs text analysis + indexing: – Sends assembled documents to Solr or – Adds to a local Lucene index Nutch – Berlin Buzzwords '10 ● Other possible MR indexing models: – Hadoop contrib/indexing model: ● analysis and indexing on map() side ● Index merging on reduce() side – Modified Nutch model: ● Analysis on map() side ● Indexing on reduce() side 29
  • 30. Nutch integration ● Nutch search & tools API – Search via REST-style interaction, XML / JSON response – Tools CLI and API to access bulk & single Nutch items – Single-node, embedded, distributed (Hadoop cluster) ● Data-level integration: direct MapFile / Nutch – Berlin Buzzwords '10 SequenceFile reading – More complicated (and still requires using Nutch classes) – May be more efficient – Future: native tools related to data stores (HBase, SQL, ...) ● Exporting Nutch data – All data can be exported to plain text formats – bin/nutch read* ● ...db – read CrawlDB and dump some/all records ● ...linkdb – read LinkDb and dump some/all records ● ...seg – read segments (shards) and dump some/all records 30
  • 31. Web data mining with Nutch Nutch – Berlin Buzzwords '10 31
  • 32. Nutch search ● Solr indexing and searching (preferred) – Simple Lucene indexing / search available too ● Using Solr search: – DisMax search over several fields (url, title, body, anchors) – Faceted search Nutch – Berlin Buzzwords '10 – Search results clustering – SolrCloud: ● Automatic shard replication and load-balancing ● Hashing update handler to distribute docs to Solr shards 30 32
  • 33. Search-based analytics ● Keyword search → crude topic mining ● Phrase search → crude collocation mining ● Anchor search → crude semantic enrichment ● Feedback loop from search results: – Faceting and on-line clustering may discover latent topics Nutch – Berlin Buzzwords '10 – Top-N results for reference queries may prioritize further crawling ● Example: question answering system – Source documents from reference sites – NLP document analysis: key-phrase detection, POS-tagging, noun-verb / subject-predicate detection, enrichment from DBs and semantic nets – NLP query analysis: expected answer type (e.g. person, place, date, activity, method, ...), key-phrases, synonyms – Regular search – Evaluation of raw results (further NLP analysis of each document) 33
  • 34. Web as a corpus ● Examples: – Source of raw text in a specific language – Source of text on a given subject ● Selection by e.g. a presence of keywords, or full-blown NLP ● Add data from known reference sites (Wikipedia, Freebase) or databases (Medline) or semantic nets (WordNet, OpenCyc) Nutch – Berlin Buzzwords '10 – Source of documents in a specific format (e.g. PDF) ● Nutch setup: – URLFilters define the crawling frontier and content types – Parse plugins determine the content extraction / processing ● e.g. language detection ● Nutch shards: – Extracted text, metadata, outlinks / anchors 34
  • 35. Web as a corpus (2) ● Concept mining – Harvesting human-created concept descriptions and associations – “kind of”, “contains”, “includes”, “application of” – Co-occurrence of concepts has some meaning too! ● Example: medical search engine Nutch – Berlin Buzzwords '10 – Controlled vocabulary of diseases, symptoms, procedures – Identifiable metadata: author, journal, publication date, etc. – Nutch crawl of reference sites and DBs ● Co-occurrence of controlled vocabulary – BloomFilter-s for quick trimming of map-side data – Or Mahout collocation mining for uncontrolled concepts ● Cube of co-occurring (related) concepts ● Several dimensions to traverse – “authors who publish most often together on treatment of myocardial infarction” ● 10 nodes, 100k phrases in vocabulary, 20 mln pages, ~300bln phrases on map side → ~5GB data cube 35
  • 36. Web as a directed graph ● Nodes (vertices): URL-s as unique identifiers ● Edges (links): hyperlinks like <a href=”targetUrl”/> ● Edge labels: <a href=”..”>anchor text</a> ● Often represented as adjacency (neighbor) lists ● Inverted graph: LinkDB in Nutch Nutch – Berlin Buzzwords '10 Straight (outlink) graph: 1 → 2a, 3b, 4c, 5d, 6e 8 5 → 6f, 9g j 7 → 3h, 4i, 8j, 9k k 7 9 3 h 2 i Inverted (inlink) graph: a b 2 ← 1a c 3 ← 1b, 7h 1 4 4 ← 1c, 7i e d 5 ← 1d g 6 ← 1e, 5f 6 f 5 8 ← 7j 9 ← 5g, 7k 36
  • 37. Link inversion ● Pages have outgoing links (outlinks) … I know where I'm pointing to ● Question: who points to me? … I don't know, there is no catalog of pages … NOBODY knows for sure either! Nutch – Berlin Buzzwords '10 ● In-degree may indicate importance of the page ● Anchor text provides important semantic info ● Answer: invert the outlinks that I know about, and group by target (Nutch 'invertlinks') tgt 2 src 2 tgt 1 src 1 src 1 tgt 3 tgt 1 src 3 tgt 5 tgt 4 src 5 src 4 37
  • 38. Web as a recommender ● Links as recommendations: – Link represents an association – Anchor text represents a recommended topic ● … with some surrounding text of a hyperlink? ● Not all pages are created equal Nutch – Berlin Buzzwords '10 – Recommendations from good pages are useful – Recommendations from bad pages may be useless – Merit / guilt by association: ● Links from good pages should improve the target's reputation ● Links from bad pages may compromise good pages' reputation ● Not all recommendations are trustworthy – What links to trust, and to what degree? – Social aspects: popularity, fashion, mobbing, fallacy of “common belief” 38
  • 39. Link analysis and scoring ● PageRank 1 1 – Query-independent page weight 1 1 – Based on the flow of weight along link paths ● Dampening factor α to stabilize the flow ● Weight from “dangling nodes” redistributed 1.25 1.25 Other models Nutch – Berlin Buzzwords '10 ● 0.75 0.75 – Hyperlink-Induced Topic Search (HITS) ● Query-dependent, local iterations, hub/authority – TrustRank 1.06 1.31 ● Propagation of “trust” based on human expert evaluation of seed sites 0.69 0.94 ● Challenges – Loops, link spam, cliques, loosely connected subgraphs, mobbing, etc 39
  • 40. Nutch link analysis tools ● Tools for PageRank calculation with loop detection – LinkDb: source of anchor text (think “recommended topics”) – Page in-degree ≈ popularity / importance / quality – Scoring API (and plugins) to control the flow of page importance along link paths ● Nutch shards: Nutch – Berlin Buzzwords '10 – Source of outlinks → expanding the crawling frontier – Page linked-ness vs. its content: hub or authority ● Example: porn / junk detection – Links to “porn” pages poisonous to importance / quality – Links from “porn” pages decrease the confidence in quality of the target page ● Example: vertical crawl – Expanding to pages “on topic” == with sufficient in-link support from known on topic pages 40
  • 41. Web of gossip and opinions ● General Web – not considering special-purpose networks here... ● Example: – Who / what is in the news? – How often a name is mentioned? today Google yields 44,500 hits for ab@getopt.org  Nutch – Berlin Buzzwords '10 ● – What facts about me are publicly available? – What is the sentiment associated with a name (person, organization, trademark)? ● Nutch setup: – Seed from a few reference news sites, blogs, Twitter, etc – Use Nutch plugin for RSS/Atom crawling – NLP parsing plugins (NER, classification, sentiment analysis) ● Nutch shards: – Capture temporal aspect 41
  • 42. Web as a source of … anything ● The data is there, just lost among irrelevant stuff – Difficult to find → good seed list + crawling frontier controls – Mixed with junk & irrelevant data → URL & content filtering Be creative – combine multiple strategies: Nutch – Berlin Buzzwords '10 ● – Crawl for raw data, stay on topic – filter out junk early – Use plain indexing & search as a crude analytic tool – Use creative post-processing to filter and enhance the data – Export data from Nutch and pipe it to other tools (Pig, HBase, Mahout, ...) 42
  • 43. Future of Nutch ● Nutch 2.0 re-design – Refactoring, cleanup, better scale-up / scale-down – Avoid code duplication – Expected release ~Q4 2010 ● Share code with other crawler projects → crawler-commons Nutch – Berlin Buzzwords '10 ● Indexing & Search → Solr, SolrCloud – Distributed and replicated search is difficult – Initial integration needs significant improvement – Shard management – SolrCloud / Zookeeper ● Web-graph & page repository → ORM layer – Combine CrawlDB, LinkDB and shard storage – Avoid tedious shard management – Gora ORM mapping: HBase, SQL, Cassandra? BerkeleyDB? – Benefit from native tools specific to storage → easier integration 43
  • 44. Future of Nutch (2) ● What's left then? – Crawling frontier management, discovery – Re-crawl algorithms – Spider trap handling – Fetcher Nutch – Berlin Buzzwords '10 – Ranking: enterprise-specific, user-feedback – Duplicate detection, URL aliasing (mirror detection) – Template detection and cleanup, pagelet-level crawling – Spam & junk control ● Vision: á la carte toolkit, scalable from 1-1000s nodes – Easier setup for small 1 node installs – Focus on a reliable, easy to integrate framework 44
  • 45. Conclusions (This overview is a tip of the iceberg) Nutch ● Implements all core search engine components Nutch – Berlin Buzzwords '10 ● Extremely configurable and modular ● Scales well ● A complete crawl & search platform – and a toolkit ● Easy to use as an input feed to data collecting and data mining tools 45
  • 46. Q&A ● Further information: – http://nutch.apache.org/ – user@nutch.apache.org Nutch – Berlin Buzzwords '10 ● Contact author: – ab@sigram.com 46