SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Crawling with NodeJS
JSMeetup2@Paris 24.11.2010
@sylvinus
Crawling?
Web crawling
Grab
Process
Store
?
NodeJS
Server-side Javascript
Async / Event-driven / Reactor pattern
Small stdlib, Exploding module ecosystem
Why?
Boldly going where no one has g...
Threads vs. Async
ZOMG Server-side CSS3 selectors!
Apricot
https://github.com/silentrob/Apricot
HTML/DOM Parser, inspired by Hpricot
Sizzle + JSDOM + XUI
Problems w/ Apricot
if (file.match(/^https?:///)) {
    var urlInfo = url.parse(file, parseQueryString=false),
    host = http.createClient(((urlInfo.protocol === 'http:') ? 80 : 443),
urlInfo.hostname),
    req_url = urlInfo.pathname;
    if (urlInfo.search) {
      req_url += urlInfo.search;
    }
    var request = host.request('GET', req_url, { host: urlInfo.hostname });
    request.addListener('response', function (response) {
      var data = '';
      response.addListener('data', function (chunk) {
        data += chunk;
      });
      response.addListener("end", function() {
        fnLoaderHandle(null, data);
      });
    });
    if (request.end) {
      request.end();
    } else {
      request.close();
    }
      
  } else {
    fs.readFile(file, encoding='utf8', fnLoaderHandle);
  }
Problems w/ Apricot
No advanced HTTP client in Node’s lib
npm install request
https + redirects + buffering
Problems w/ Apricot
Apricot.parse("<p id='test'>An HTML Fragment</p>", function(doc) {
doc.find("selector"); // Populates internal collection, See Sizzle selector syntax (rules)
doc.each(callback); // Itterates over the collection, applying a callback to each match
(element)
doc.remove(); // Removes all elements in the internal collection (See XUI Syntax)
doc.inner("fragment"); // See XUI Syntax
doc.outer("fragment"); // See XUI Syntax
doc.top("fragment"); // See XUI Syntax
doc.bottom("fragment"); // See XUI Syntax
doc.before("fragment"); // See XUI Syntax
doc.after("fragment"); // See XUI Syntax
doc.hasClass("class"); // See XUI Syntax
doc.addClass("class"); // See XUI Syntax
doc.removeClass("class"); // See XUI Syntax
doc.toHTML; // Returns the HTML
doc.innerHTML; // Returns the innerHTML of the body.
doc.toDOM; // Returns the DOM representation
// Most methods are chainable, so this works
doc.find("selector").addClass('foo').after(", just because");
});
Problems w/ Apricot
XUI api?!
jQuery please :)
require("jsdom").jQueryify !!
var jsdom = require("jsdom"),
window = jsdom.jsdom().createWindow();
jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js' , function() {
window.$('body').append('<div class="testing">Hello World, It works</div>');
console.log(window.$('.testing').text());
});
Concurrency?
https://github.com/coopernurse/node-pool
npm install generic-pool
generic-pool
// Create a MySQL connection pool with
// a max of 10 connections and a 30 second max idle time
var poolModule = require('generic-pool');
var pool = poolModule.Pool({
name : 'mysql',
create : function(callback) {
var Client = require('mysql').Client;
var c = new Client();
c.user = 'scott';
c.password = 'tiger';
c.database = 'mydb';
c.connect();
callback(c);
},
destroy : function(client) { client.end(); },
max : 10,
idleTimeoutMillis : 30000,
log : false
});
// borrow connection - callback function is called
// once a resource becomes available
pool.borrow(function(client) {
client.query("select * from foo", [], function() {
// return object back to pool
pool.returnToPool(client);
});
});
So what?
Apricot - XUI + jQuery
+ request + generic-pool
+ qunit + ?
=
??
Simple API?
var Crawler = require("node-crawler").Crawler;
var c = new Crawler({
"maxConnections":10,
"timeout":60,
"defaultHandler":function(error,result,$) {
$("#content a:link").each(function(a) {
c.queue(a.href);
})
}
});
c.queue(["http://jamendo.com/","http://tedxparis.com", ...]);
c.queue([{
"uri":"http://parisjs.org/register",
"method":"POST"
"handler":function(error,result,$) {
$("div:contains(Thank you)").after(" very much");
}
}]);
Name contest! :)
node-crawler ?
Crawly ?
?????
Thanks!
First code on github tonight
Help & Forks welcomed
(We’re hiring HTML5/JS hackers ;-)
Also, http://html5weekend.org/

Contenu connexe

Tendances

The Promised Land (in Angular)
The Promised Land (in Angular)The Promised Land (in Angular)
The Promised Land (in Angular)Domenic Denicola
 
Avoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promisesAvoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promisesAnkit Agarwal
 
Node.js in action
Node.js in actionNode.js in action
Node.js in actionSimon Su
 
async/await in Swift
async/await in Swiftasync/await in Swift
async/await in SwiftPeter Friese
 
Callbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascriptCallbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascriptŁukasz Kużyński
 
HTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymoreHTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymoreRemy Sharp
 
Asynchronous programming done right - Node.js
Asynchronous programming done right - Node.jsAsynchronous programming done right - Node.js
Asynchronous programming done right - Node.jsPiotr Pelczar
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
HTML5 JavaScript APIs
HTML5 JavaScript APIsHTML5 JavaScript APIs
HTML5 JavaScript APIsRemy Sharp
 
JavaScript Promises
JavaScript PromisesJavaScript Promises
JavaScript PromisesTomasz Bak
 
The promise of asynchronous php
The promise of asynchronous phpThe promise of asynchronous php
The promise of asynchronous phpWim Godden
 
Javascript call ObjC
Javascript call ObjCJavascript call ObjC
Javascript call ObjCLin Luxiang
 
Working with AFNetworking
Working with AFNetworkingWorking with AFNetworking
Working with AFNetworkingwaynehartman
 
Understanding the Node.js Platform
Understanding the Node.js PlatformUnderstanding the Node.js Platform
Understanding the Node.js PlatformDomenic Denicola
 
Understanding Asynchronous JavaScript
Understanding Asynchronous JavaScriptUnderstanding Asynchronous JavaScript
Understanding Asynchronous JavaScriptjnewmanux
 
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2KZepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2KThomas Fuchs
 
An Introduction to Tornado
An Introduction to TornadoAn Introduction to Tornado
An Introduction to TornadoGavin Roy
 

Tendances (20)

The Promised Land (in Angular)
The Promised Land (in Angular)The Promised Land (in Angular)
The Promised Land (in Angular)
 
Avoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promisesAvoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promises
 
Promise pattern
Promise patternPromise pattern
Promise pattern
 
Node.js in action
Node.js in actionNode.js in action
Node.js in action
 
async/await in Swift
async/await in Swiftasync/await in Swift
async/await in Swift
 
Callbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascriptCallbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascript
 
HTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymoreHTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymore
 
Promises, Promises
Promises, PromisesPromises, Promises
Promises, Promises
 
Asynchronous programming done right - Node.js
Asynchronous programming done right - Node.jsAsynchronous programming done right - Node.js
Asynchronous programming done right - Node.js
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
HTML5 JavaScript APIs
HTML5 JavaScript APIsHTML5 JavaScript APIs
HTML5 JavaScript APIs
 
JavaScript Promises
JavaScript PromisesJavaScript Promises
JavaScript Promises
 
The promise of asynchronous php
The promise of asynchronous phpThe promise of asynchronous php
The promise of asynchronous php
 
Javascript call ObjC
Javascript call ObjCJavascript call ObjC
Javascript call ObjC
 
Working with AFNetworking
Working with AFNetworkingWorking with AFNetworking
Working with AFNetworking
 
Understanding the Node.js Platform
Understanding the Node.js PlatformUnderstanding the Node.js Platform
Understanding the Node.js Platform
 
Node.js - A Quick Tour
Node.js - A Quick TourNode.js - A Quick Tour
Node.js - A Quick Tour
 
Understanding Asynchronous JavaScript
Understanding Asynchronous JavaScriptUnderstanding Asynchronous JavaScript
Understanding Asynchronous JavaScript
 
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2KZepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2K
 
An Introduction to Tornado
An Introduction to TornadoAn Introduction to Tornado
An Introduction to Tornado
 

Similaire à Web Crawling with NodeJS

soft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.jssoft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.jssoft-shake.ch
 
Nodejs and WebSockets
Nodejs and WebSocketsNodejs and WebSockets
Nodejs and WebSocketsGonzalo Ayuso
 
Java script at backend nodejs
Java script at backend   nodejsJava script at backend   nodejs
Java script at backend nodejsAmit Thakkar
 
Javascript Frameworks for Joomla
Javascript Frameworks for JoomlaJavascript Frameworks for Joomla
Javascript Frameworks for JoomlaLuke Summerfield
 
Week 4 - jQuery + Ajax
Week 4 - jQuery + AjaxWeek 4 - jQuery + Ajax
Week 4 - jQuery + Ajaxbaygross
 
Express Presentation
Express PresentationExpress Presentation
Express Presentationaaronheckmann
 
Pracitcal AJAX
Pracitcal AJAXPracitcal AJAX
Pracitcal AJAXjherr
 
Building Applications Using Ajax
Building Applications Using AjaxBuilding Applications Using Ajax
Building Applications Using Ajaxs_pradeep
 
Introduction to Vert.x
Introduction to Vert.xIntroduction to Vert.x
Introduction to Vert.xYiguang Hu
 
Build Your Own CMS with Apache Sling
Build Your Own CMS with Apache SlingBuild Your Own CMS with Apache Sling
Build Your Own CMS with Apache SlingBob Paulin
 
An opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathonAn opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathonLuciano Mammino
 
JavaScript performance patterns
JavaScript performance patternsJavaScript performance patterns
JavaScript performance patternsStoyan Stefanov
 

Similaire à Web Crawling with NodeJS (20)

soft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.jssoft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.js
 
Tornadoweb
TornadowebTornadoweb
Tornadoweb
 
Nodejs and WebSockets
Nodejs and WebSocketsNodejs and WebSockets
Nodejs and WebSockets
 
Java script at backend nodejs
Java script at backend   nodejsJava script at backend   nodejs
Java script at backend nodejs
 
Javascript Frameworks for Joomla
Javascript Frameworks for JoomlaJavascript Frameworks for Joomla
Javascript Frameworks for Joomla
 
Week 4 - jQuery + Ajax
Week 4 - jQuery + AjaxWeek 4 - jQuery + Ajax
Week 4 - jQuery + Ajax
 
Express Presentation
Express PresentationExpress Presentation
Express Presentation
 
dojo.Patterns
dojo.Patternsdojo.Patterns
dojo.Patterns
 
5.node js
5.node js5.node js
5.node js
 
Node intro
Node introNode intro
Node intro
 
Sanjeev ghai 12
Sanjeev ghai 12Sanjeev ghai 12
Sanjeev ghai 12
 
Pracitcal AJAX
Pracitcal AJAXPracitcal AJAX
Pracitcal AJAX
 
Html5 For Jjugccc2009fall
Html5 For Jjugccc2009fallHtml5 For Jjugccc2009fall
Html5 For Jjugccc2009fall
 
Building Applications Using Ajax
Building Applications Using AjaxBuilding Applications Using Ajax
Building Applications Using Ajax
 
jQuery: Events, Animation, Ajax
jQuery: Events, Animation, AjaxjQuery: Events, Animation, Ajax
jQuery: Events, Animation, Ajax
 
Introduction to Vert.x
Introduction to Vert.xIntroduction to Vert.x
Introduction to Vert.x
 
Build Your Own CMS with Apache Sling
Build Your Own CMS with Apache SlingBuild Your Own CMS with Apache Sling
Build Your Own CMS with Apache Sling
 
NodeJS
NodeJSNodeJS
NodeJS
 
An opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathonAn opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathon
 
JavaScript performance patterns
JavaScript performance patternsJavaScript performance patterns
JavaScript performance patterns
 

Plus de Sylvain Zimmer

Developer-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing oneDeveloper-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing oneSylvain Zimmer
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with SparkSylvain Zimmer
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...Sylvain Zimmer
 
PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?Sylvain Zimmer
 
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Sylvain Zimmer
 
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013Sylvain Zimmer
 
140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of JavascriptSylvain Zimmer
 
Joshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical OverviewJoshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical OverviewSylvain Zimmer
 
Javascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSJavascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSSylvain Zimmer
 
no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4Sylvain Zimmer
 
Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011Sylvain Zimmer
 
Archicamp présentation
Archicamp présentationArchicamp présentation
Archicamp présentationSylvain Zimmer
 
Twisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecasesTwisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecasesSylvain Zimmer
 

Plus de Sylvain Zimmer (13)

Developer-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing oneDeveloper-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing one
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?
 
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
 
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
 
140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript
 
Joshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical OverviewJoshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical Overview
 
Javascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSJavascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJS
 
no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4
 
Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011
 
Archicamp présentation
Archicamp présentationArchicamp présentation
Archicamp présentation
 
Twisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecasesTwisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecases
 

Dernier

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Dernier (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Web Crawling with NodeJS

  • 4. ?
  • 5. NodeJS Server-side Javascript Async / Event-driven / Reactor pattern Small stdlib, Exploding module ecosystem
  • 6. Why? Boldly going where no one has g... Threads vs. Async ZOMG Server-side CSS3 selectors!
  • 8. Problems w/ Apricot if (file.match(/^https?:///)) {     var urlInfo = url.parse(file, parseQueryString=false),     host = http.createClient(((urlInfo.protocol === 'http:') ? 80 : 443), urlInfo.hostname),     req_url = urlInfo.pathname;     if (urlInfo.search) {       req_url += urlInfo.search;     }     var request = host.request('GET', req_url, { host: urlInfo.hostname });     request.addListener('response', function (response) {       var data = '';       response.addListener('data', function (chunk) {         data += chunk;       });       response.addListener("end", function() {         fnLoaderHandle(null, data);       });     });     if (request.end) {       request.end();     } else {       request.close();     }          } else {     fs.readFile(file, encoding='utf8', fnLoaderHandle);   }
  • 9. Problems w/ Apricot No advanced HTTP client in Node’s lib npm install request https + redirects + buffering
  • 10. Problems w/ Apricot Apricot.parse("<p id='test'>An HTML Fragment</p>", function(doc) { doc.find("selector"); // Populates internal collection, See Sizzle selector syntax (rules) doc.each(callback); // Itterates over the collection, applying a callback to each match (element) doc.remove(); // Removes all elements in the internal collection (See XUI Syntax) doc.inner("fragment"); // See XUI Syntax doc.outer("fragment"); // See XUI Syntax doc.top("fragment"); // See XUI Syntax doc.bottom("fragment"); // See XUI Syntax doc.before("fragment"); // See XUI Syntax doc.after("fragment"); // See XUI Syntax doc.hasClass("class"); // See XUI Syntax doc.addClass("class"); // See XUI Syntax doc.removeClass("class"); // See XUI Syntax doc.toHTML; // Returns the HTML doc.innerHTML; // Returns the innerHTML of the body. doc.toDOM; // Returns the DOM representation // Most methods are chainable, so this works doc.find("selector").addClass('foo').after(", just because"); });
  • 11. Problems w/ Apricot XUI api?! jQuery please :) require("jsdom").jQueryify !! var jsdom = require("jsdom"), window = jsdom.jsdom().createWindow(); jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js' , function() { window.$('body').append('<div class="testing">Hello World, It works</div>'); console.log(window.$('.testing').text()); });
  • 13. generic-pool // Create a MySQL connection pool with // a max of 10 connections and a 30 second max idle time var poolModule = require('generic-pool'); var pool = poolModule.Pool({ name : 'mysql', create : function(callback) { var Client = require('mysql').Client; var c = new Client(); c.user = 'scott'; c.password = 'tiger'; c.database = 'mydb'; c.connect(); callback(c); }, destroy : function(client) { client.end(); }, max : 10, idleTimeoutMillis : 30000, log : false }); // borrow connection - callback function is called // once a resource becomes available pool.borrow(function(client) { client.query("select * from foo", [], function() { // return object back to pool pool.returnToPool(client); }); });
  • 14. So what? Apricot - XUI + jQuery + request + generic-pool + qunit + ? = ??
  • 15. Simple API? var Crawler = require("node-crawler").Crawler; var c = new Crawler({ "maxConnections":10, "timeout":60, "defaultHandler":function(error,result,$) { $("#content a:link").each(function(a) { c.queue(a.href); }) } }); c.queue(["http://jamendo.com/","http://tedxparis.com", ...]); c.queue([{ "uri":"http://parisjs.org/register", "method":"POST" "handler":function(error,result,$) { $("div:contains(Thank you)").after(" very much"); } }]);
  • 16. Name contest! :) node-crawler ? Crawly ? ?????
  • 17. Thanks! First code on github tonight Help & Forks welcomed (We’re hiring HTML5/JS hackers ;-) Also, http://html5weekend.org/