SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Frontera: Large-Scale Open Source Web
Crawling Framework
Alexander Sibiryakov, 20 July 2015
sibiryakov@scrapinghub.com
• Born in Yekaterinburg, RU
• 5 years at Yandex, search
quality department: social and
QA search, snippets.
• 2 years at Avast! antivirus,
research team: automatic false
positive solving, large scale
prediction of malicious
download attempts.
Hola los participantes!
2
–Wikipedia: Web Crawler article, July 2015
3
«A Web crawler starts with a list of URLs to visit,
called the seeds. As the crawler visits these
URLs, it identifies all the hyperlinks in the page
and adds them to the list of URLs to visit, called
the crawl frontier.».
tophdart.com
• Client needed to crawl 1B+
pages/week, and identify
frequently changing HUB
pages.
• Scrapy is hard for broad
crawling and had no crawl
frontier capabilities, out of the
box,
• People were tend to favor
Apache Nutch instead of
Scrapy.
Motivation
Hyperlink-Induced Topic Search,
Jon Kleinberg, 1999
5
• Frontera is all about knowing what to crawl next
and when to stop.
• Single-Threaded mode can be used for up to 100
websites (parallel downloading),
• for performance broad crawls there is a distributed
mode.
Frontera: single-threaded and distributed
6
• Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
• Canonical URLs resolution abstraction: each
document has many URLs, which to use?
• Scrapy ecosystem: good documentation, big
community, ease of customization.
Main features
7
• Need of URL metadata and content storage,
• Need of isolation of URL ordering/queueing logic
from the spider
• Advanced URL ordering logic (big websites, or
revisiting)
Single-threaded use cases
8
Single-threaded architecture
9
• Frontera is implemented as a
set of custom scheduler and
spider middleware for Scrapy.
• Frontera doesn’t require
Scrapy, and can be used
separately.
• Scrapy role is process
management and fetching
operation.
• And we’re friends forever!
Frontera and Scrapy
10
• $pip install frontera
• write a spider, or take example one from Frontera
repo,
• edit spider settings.py changing scheduler and
add Frontera’s spider middleware,
• $scrapy crawl [your_spider]
• Check your chosen DB contents after crawl.
Single-threaded Frontera quickstart
• You have set of URLs and need to revisit them (e.g.
to track changes).
• Building a search engine with content retrieval from the
Web.
• All kinds of research work on web graph: gathering links 
statistics, structure of graph, tracking domain count, etc.
• You have a topic and you want to crawl the documents
about that topic.
• More general focused crawling tasks: e.g. you search for 
pages that are big hubs, and frequently changing in time.
Distributed use cases: broad crawls
12
Frontera architecture: distributed
Kafka topic
SW
DB
Strategy workers
DB workers
13
• Communication layer is Apache Kafka: topic
partitioning, offsets mechanism.
• Crawling strategy abstraction: crawling goal, url
ordering, scoring model is coded in separate
module.
• Polite by design: each website is downloaded by
at most one spider.
• Python: workers, spiders.
Main features: distributed
• Apache HBase,
• Apache Kafka,
• Python 2.7+,
• Scrapy 0.24+,
• DNS Service.
Software requirements
CDH (100% Open source
Hadoop distribution)
• Single-thread Scrapy spider gives 1200 pages/min.
from about 100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
• 12 spiders ~ 14.4K pages/min.,
• 3 SW and 3 DB workers,
• Total 18 cores.
Hardware requirements
• Network could be a bottleneck
for internal communication.
Solution: increase count of
network interfaces.
• HBase can be backed by
HDDs, and free RAM would
be great for caching the
priority queue.
• Kafka throughput is key
performance issue, make sure
that Kafka brokers has enough
IOPS.
Hardware requirements: gotchas
Quickstart for distributed Frontera
• $pip install distributed-frontera
• prepare HBase and Kafka,
• simple Scrapy spider, passing links and/or content,
• configure Frontera workers and spiders,
• run workers, spiders and pull in the seeds.
Consult http://distributed-frontera.readthedocs.org/ for
more information.
Quick spanish (.es) internet crawl
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es, druni.es,
docentesconeducacion.es - are
the biggest websites
• 68.7K domains found,
• 46.5M crawled pages overall,
• 1.5 months,
• 22 websites with more than 50M
pages
For more info and graphs check
the poster
Feature plans: distributed version
• Revisit strategy,
• PageRank or HITS-based
strategy,
• Own url parsing and html
parsing,
• Integration to Scrapinghub’s
paid services,
• Testing at larger scales.
Preguntas!
Thank you!
Alexander Sibiryakov,
sibiryakov@scrapinghub.com

Contenu connexe

Tendances

Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsDamien Dallimore
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.jsVikash Singh
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...Databricks
 
Running Spring Boot Applications as GraalVM Native Images
Running Spring Boot Applications as GraalVM Native ImagesRunning Spring Boot Applications as GraalVM Native Images
Running Spring Boot Applications as GraalVM Native ImagesVMware Tanzu
 
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
 
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...HostedbyConfluent
 
Apache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel IndustryApache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel IndustryKai Wähner
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningKai Wähner
 
NLU / Intent Detection Benchmark by Intento, August 2017
NLU / Intent Detection Benchmark by Intento, August 2017NLU / Intent Detection Benchmark by Intento, August 2017
NLU / Intent Detection Benchmark by Intento, August 2017Konstantin Savenkov
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataScyllaDB
 
Building Next-Generation Web APIs with JSON-LD and Hydra
Building Next-Generation Web APIs with JSON-LD and HydraBuilding Next-Generation Web APIs with JSON-LD and Hydra
Building Next-Generation Web APIs with JSON-LD and HydraMarkus Lanthaler
 
Introduction to Spring Cloud
Introduction to Spring Cloud           Introduction to Spring Cloud
Introduction to Spring Cloud VMware Tanzu
 
Cloud native-apps-architectures
Cloud native-apps-architecturesCloud native-apps-architectures
Cloud native-apps-architecturesCapgemini
 
Closures in Javascript
Closures in JavascriptClosures in Javascript
Closures in JavascriptDavid Semeria
 
Netflix Case Study - AWS
Netflix Case Study - AWSNetflix Case Study - AWS
Netflix Case Study - AWSHeath Dorn
 
React Router: React Meetup XXL
React Router: React Meetup XXLReact Router: React Meetup XXL
React Router: React Meetup XXLRob Gietema
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 

Tendances (20)

Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring Applications
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
 
Running Spring Boot Applications as GraalVM Native Images
Running Spring Boot Applications as GraalVM Native ImagesRunning Spring Boot Applications as GraalVM Native Images
Running Spring Boot Applications as GraalVM Native Images
 
Recommendation system
Recommendation system Recommendation system
Recommendation system
 
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
3 Kafka patterns to deliver Streaming Machine Learning models with Andrea Spi...
 
NEXT.JS
NEXT.JSNEXT.JS
NEXT.JS
 
Apache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel IndustryApache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel Industry
 
Apache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep LearningApache Kafka Streams + Machine Learning / Deep Learning
Apache Kafka Streams + Machine Learning / Deep Learning
 
NLU / Intent Detection Benchmark by Intento, August 2017
NLU / Intent Detection Benchmark by Intento, August 2017NLU / Intent Detection Benchmark by Intento, August 2017
NLU / Intent Detection Benchmark by Intento, August 2017
 
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured DataRealtime Indexing for Fast Queries on Massive Semi-Structured Data
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
 
Building Next-Generation Web APIs with JSON-LD and Hydra
Building Next-Generation Web APIs with JSON-LD and HydraBuilding Next-Generation Web APIs with JSON-LD and Hydra
Building Next-Generation Web APIs with JSON-LD and Hydra
 
Introduction to Spring Cloud
Introduction to Spring Cloud           Introduction to Spring Cloud
Introduction to Spring Cloud
 
Cloud native-apps-architectures
Cloud native-apps-architecturesCloud native-apps-architectures
Cloud native-apps-architectures
 
Closures in Javascript
Closures in JavascriptClosures in Javascript
Closures in Javascript
 
Netflix Case Study - AWS
Netflix Case Study - AWSNetflix Case Study - AWS
Netflix Case Study - AWS
 
React Router: React Meetup XXL
React Router: React Meetup XXLReact Router: React Meetup XXL
React Router: React Meetup XXL
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 

En vedette

Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghubDana Brophy
 
Using Web Data for Finance
Using Web Data for FinanceUsing Web Data for Finance
Using Web Data for FinanceScrapinghub
 
Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience requiredScrapinghub
 
Relational machine-learning
Relational machine-learningRelational machine-learning
Relational machine-learningBhushan Kotnis
 
Scrapinghub Deck for Startups
Scrapinghub Deck for StartupsScrapinghub Deck for Startups
Scrapinghub Deck for StartupsScrapinghub
 
The story behind La Paella King brand
The story behind La Paella King brandThe story behind La Paella King brand
The story behind La Paella King brandMichael Robert Gill
 
Large scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInMitul Tiwari
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solrMike Frampton
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scrapingScrapinghub
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Spark Summit
 
Using a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDMUsing a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDMNeo4j
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...CloudTechnologies
 
Spirit of Valencia Team introduction
Spirit of Valencia Team introductionSpirit of Valencia Team introduction
Spirit of Valencia Team introductionMichael Robert Gill
 

En vedette (20)

Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
Using Web Data for Finance
Using Web Data for FinanceUsing Web Data for Finance
Using Web Data for Finance
 
Mining the web, no experience required
Mining the web, no experience requiredMining the web, no experience required
Mining the web, no experience required
 
Relational machine-learning
Relational machine-learningRelational machine-learning
Relational machine-learning
 
Scrapinghub Deck for Startups
Scrapinghub Deck for StartupsScrapinghub Deck for Startups
Scrapinghub Deck for Startups
 
The story behind La Paella King brand
The story behind La Paella King brandThe story behind La Paella King brand
The story behind La Paella King brand
 
Large scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedIn
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
 
Mining legal texts with Python
Mining legal texts with PythonMining legal texts with Python
Mining legal texts with Python
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
 
Using a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDMUsing a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDM
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
 
Spirit of Valencia Team introduction
Spirit of Valencia Team introductionSpirit of Valencia Team introduction
Spirit of Valencia Team introduction
 

Similaire à Frontera-Open Source Large Scale Web Crawling Framework

Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraPyData
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015ontopic
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Restful风格ž„web服务架构
Restful风格ž„web服务架构Restful风格ž„web服务架构
Restful风格ž„web服务架构Benjamin Tan
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slideslancesfa
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsJoshua Shinavier
 
Apache Geode - The First Six Months
Apache Geode -  The First Six MonthsApache Geode -  The First Six Months
Apache Geode - The First Six MonthsAnthony Baker
 
My Site is slow - Drupal Camp London 2013
My Site is slow - Drupal Camp London 2013My Site is slow - Drupal Camp London 2013
My Site is slow - Drupal Camp London 2013hernanibf
 
Online Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and MuseumsOnline Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and Museumsmherbison
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptxDEEPAK948083
 

Similaire à Frontera-Open Source Large Scale Web Crawling Framework (20)

Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
webcrawler.pptx
webcrawler.pptxwebcrawler.pptx
webcrawler.pptx
 
Restful风格ž„web服务架构
Restful风格ž„web服务架构Restful风格ž„web服务架构
Restful风格ž„web服务架构
 
Web Crawling & Crawler
Web Crawling & CrawlerWeb Crawling & Crawler
Web Crawling & Crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Webcrawler
WebcrawlerWebcrawler
Webcrawler
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slides
 
The Real-time Web in the Age of Agents
The Real-time Web in the Age of AgentsThe Real-time Web in the Age of Agents
The Real-time Web in the Age of Agents
 
Apache Geode - The First Six Months
Apache Geode -  The First Six MonthsApache Geode -  The First Six Months
Apache Geode - The First Six Months
 
My Site is slow - Drupal Camp London 2013
My Site is slow - Drupal Camp London 2013My Site is slow - Drupal Camp London 2013
My Site is slow - Drupal Camp London 2013
 
Online Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and MuseumsOnline Collections Crawlability for Libraries, Archives, and Museums
Online Collections Crawlability for Libraries, Archives, and Museums
 
4 Web Crawler.pptx
4 Web Crawler.pptx4 Web Crawler.pptx
4 Web Crawler.pptx
 
DC presentation 1
DC presentation 1DC presentation 1
DC presentation 1
 

Frontera-Open Source Large Scale Web Crawling Framework

  • 1. Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015 sibiryakov@scrapinghub.com
  • 2. • Born in Yekaterinburg, RU • 5 years at Yandex, search quality department: social and QA search, snippets. • 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. Hola los participantes! 2
  • 3. –Wikipedia: Web Crawler article, July 2015 3 «A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.».
  • 5. • Client needed to crawl 1B+ pages/week, and identify frequently changing HUB pages. • Scrapy is hard for broad crawling and had no crawl frontier capabilities, out of the box, • People were tend to favor Apache Nutch instead of Scrapy. Motivation Hyperlink-Induced Topic Search, Jon Kleinberg, 1999 5
  • 6. • Frontera is all about knowing what to crawl next and when to stop. • Single-Threaded mode can be used for up to 100 websites (parallel downloading), • for performance broad crawls there is a distributed mode. Frontera: single-threaded and distributed 6
  • 7. • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Scrapy ecosystem: good documentation, big community, ease of customization. Main features 7
  • 8. • Need of URL metadata and content storage, • Need of isolation of URL ordering/queueing logic from the spider • Advanced URL ordering logic (big websites, or revisiting) Single-threaded use cases 8
  • 10. • Frontera is implemented as a set of custom scheduler and spider middleware for Scrapy. • Frontera doesn’t require Scrapy, and can be used separately. • Scrapy role is process management and fetching operation. • And we’re friends forever! Frontera and Scrapy 10
  • 11. • $pip install frontera • write a spider, or take example one from Frontera repo, • edit spider settings.py changing scheduler and add Frontera’s spider middleware, • $scrapy crawl [your_spider] • Check your chosen DB contents after crawl. Single-threaded Frontera quickstart
  • 12. • You have set of URLs and need to revisit them (e.g. to track changes). • Building a search engine with content retrieval from the Web. • All kinds of research work on web graph: gathering links  statistics, structure of graph, tracking domain count, etc. • You have a topic and you want to crawl the documents about that topic. • More general focused crawling tasks: e.g. you search for  pages that are big hubs, and frequently changing in time. Distributed use cases: broad crawls 12
  • 13. Frontera architecture: distributed Kafka topic SW DB Strategy workers DB workers 13
  • 14. • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design: each website is downloaded by at most one spider. • Python: workers, spiders. Main features: distributed
  • 15. • Apache HBase, • Apache Kafka, • Python 2.7+, • Scrapy 0.24+, • DNS Service. Software requirements CDH (100% Open source Hadoop distribution)
  • 16. • Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: • 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores. Hardware requirements
  • 17. • Network could be a bottleneck for internal communication. Solution: increase count of network interfaces. • HBase can be backed by HDDs, and free RAM would be great for caching the priority queue. • Kafka throughput is key performance issue, make sure that Kafka brokers has enough IOPS. Hardware requirements: gotchas
  • 18. Quickstart for distributed Frontera • $pip install distributed-frontera • prepare HBase and Kafka, • simple Scrapy spider, passing links and/or content, • configure Frontera workers and spiders, • run workers, spiders and pull in the seeds. Consult http://distributed-frontera.readthedocs.org/ for more information.
  • 19. Quick spanish (.es) internet crawl • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found, • 46.5M crawled pages overall, • 1.5 months, • 22 websites with more than 50M pages For more info and graphs check the poster
  • 20. Feature plans: distributed version • Revisit strategy, • PageRank or HITS-based strategy, • Own url parsing and html parsing, • Integration to Scrapinghub’s paid services, • Testing at larger scales.