Presentation delivered on 12.04.19, at April's edition of Brighton SEO. Contains an introduction, basic & more advanced usages of Chrome Puppeteer, Headless Chrome & how you can use it to monitor your site!
Like many of us, I’m constantly trying to find any new ways to make my (and my teams) jobs easier
So this awesome guy - Eric Bidelman - is a software engineer at Google, and works on headless chrome, lighthouse & dev tools.
I can use chrome puppeteer to help me with my job
So I went away and did a literal shit ton of research, that is worth sharing.
So I went away and did a literal shit ton of research, that is worth sharing.
So I went away and did a literal shit ton of research, that is worth sharing.
So I went away and did a literal shit ton of research, that is worth sharing.
So I went away and did a literal shit ton of research, that is worth sharing.
So I went away and did a literal shit ton of research, that is worth sharing.
So I went away and did a literal shit ton of research, that is worth sharing.
So I went away and did a literal shit ton of research, that is worth sharing.
So I went away and did a literal shit ton of research, that is worth sharing.
So I went away and did a literal shit ton of research, that is worth sharing.
So, the first thing i was looking for was a basic definition.
Contrary to what i wanted to believe, it did not involve any decapitation
So when you open up Google Chrome normally, you get a wonderful User Interface with bookmarks
And a search bar, plugins, buttons, tabs
And usable functionality.
With headless chrome, you get none of that shit.
So here I am running headless chrome
And we can see that it is in the background, but I have no Chrome windows open.
So Google Chrome is Running, but with NO User Interface.
SO it is running without the UX/UI head
Why should you even care about this sort of stuff though?
Through this research journey, I found out that you can do a bunch of stuff with it!
Scrape the literal shit out of Javascript websites (as well as basic HTML scraping)
You can copy the DOM, and then paste it into a text file, with which you canm
Compare the source code of the site with the DOM, and then export differences. This can allow you to identify any potential rendering issues.
Can use it to generate screenshots of
And effectively crawl single page applications
JS Can be a bit of a pain to work with, but unfortunately, it is not going away!
So Screaming Frog (and a majority of crawling softwares), utilise something like headless chrome to emulate a browser, and provide JS rendering features.
And we all know about issues that Google can have with crawling JS, ranging from having slight issues with rendering, to completely drawing a blank.
So there have been a bunch of JS indexing and rendering case studies over the past couple of years.
So it can help you crawl these guys.
We can also use Headless Chrome to automate web page checks, and I provide an in depth investigation to this later on in this deck.
AND it can be used for general webpage testing. Including clicking on stuff, filling in forms, general fuckery with the mouse and keyboard.
It is really good for emulating user behaviour. So great for pretending to be a user, and browsing around a site.
SO it is basically really great for seeing exactly how much shit a website can take before it breaks!
However, the problem with running all of these tasks is
You have to run basic headless chrome through the command line interface
So first you gotta install some dependencies, and have a shit ton of errors hit you in the face, and you gotta know where chrome is stored on your local machine...
Then you gotta run directly from that location
Then specify headless chrome to launch
Then open a port to use
Then you gotta disable GPU
Then you can add a single URL, or a URL list into the command line
Now then
I really really really love using command line
In fact so much so that I spoke about it at Brighton last year
But doing all of this shit really really really really made me wanna cry
So how do I make utilsiing headless chrome, which is freaking awesome - easy?
Like I said a few minutes ago, I’m always trying to find ways to make my job easier
And doing all of these boring ass steps was really really not easy. At All.
So I went away and did a bigger shit ton of research.
So, in this talk at Google IO, Eric mentions something called Google Puppeteer ()shoutout eric
So what is Chrome Puppeteer?
Doing a simple Google Search for Chrome Puppeteer reveals all.
But the stuff I’m interested in is this. A Node Library, and
Oooooooooo an API
So Node - for those that do not have dev experience, can be used for making some pretty kick-ass applications
It can also be used to help control headless chrome in an easy to digest and utilise package
So Node - for those that do not have dev experience, can be used for making some pretty kick-ass applications
So how can you actually get chrome pupppeteer?
If you want to run tests on your local machine, you have to install a few things first.
Node.js - which is a runtime environment, and NPM which is a package manager for node.
Chill out though, it’s fairly straightforward
Someone a while ago has made this easy
So If you are on PC it’s fairly simple to get and install,
You’ve just gotta install these things from the Node JS website
I’ve linked to a guide here - that takes you through step by step.
If, like me, you are on a Mac
If, like me, you are on a Mac
Its not that easy.
There’s a wicked awesome guide here that takes you through step by step what you need to do.
So you wanna start off by opening up terminal
And then typing in a few lines of shit
This installs homebrew, that makes everything even ez-er
This installs homebrew, that makes everything even ez-er
This installs homebrew, that makes everything even ez-er
So when homebrew is downloaded - it shouldnt take too long - a max of 5 mins
So You Have To Install 2 More Things, And We’ll Be Ready To Rock. These are npm and node.
So just type in this. It installs node through homebrew, directly onto your machine with no fuckery.
So this installs node and npm, you’ll get a nice progress bar tellling you how far along it is
Then you wanna use npm to install the latest version of puppeteer.
Now that’s it, you are all good and groovy!
You can
So for example.
If I wanted to take a screenshotof a single page
So just type in this, and you should be good to go.
So just type in this, and you should be good to go.
So just type in this, and you should be good to go.
You’ll need to code some stuff up - but I’ve put everything together into a single google doc, that makes it simple & easy to understand what each bit does. Exmplain that you are going to go through it.
So we are starting up a headless browser, in true headless mode, so you won’t see what goes on (running in the background)
And then we are opening up a new tab/page
And then we specify exactly what URL we want to go to. So in this instance, we are testing the BlueArray Hoempage
Then we are taking a screenshot. We have to specify 2 things to allow the code to work correctly
So the path, so where and what we want the file to be saved as
And then saving as a specific filetype. Can fuck around with this, and get the ideal filetype that is good for you.
And then we close the page, and then close the broswer.
And then we close the page, and then close the broswer.
Go to terminal, make sure you are in the same folder as your code, and type in
Go to terminal, make sure you are in the same folder as your code, and type in
Node screenshot.js.
And then a couple of seconds later, you’ll see
A nice screenshot get added to your folder with your code in
If you wanted to see the browser test this exactly for you,
Just change the headless mode to false. This is great for seeing exactly what the browser sees, and looks pretty cool, having a chrome window doing all sorts of shit in front of you!
Just change the headless mode to false. This is great for seeing exactly what the browser sees, and looks pretty cool, having a chrome window doing all sorts of shit in front of you!
You can also modify the script slightly to run through a list of provided URLs
And then get a bunch of screenshots!
Now I’m sure that you guys can see where this is headed
Faking Googlebot and seeing what they would see
So with a few little tweaks to the code that we have for the first example
Adding in a user agent string, and setting it to what Googlebot use
FYI Googlebot user agent string is not ‘Googlebot’ it is fucking massive
FYI Googlebot user agent string is not ‘Googlebot’ it is fuckinhg massive
And wouldn’t fit on the slide
Node screenshot.js. Screenshot.js is the name of the file.
Using the await page set viewport option
So we have to specify the width and the height of the viewport that we want to use
This isn’t reallt Googlebot, just a decent attempt at emulation
AS unfortunately
As puppeteer was launched way after Chrome 41, we cannot specify it to use this version of Chrome :*(
As puppeteer was launched way after Chrome 41, we cannot specify it to use this version of Chrome :*(
However
This can be persuasive in getting a client to ensure that their content is Rendered Server Side, as opposed to client side, if needed
This can be persuasive in getting a client to ensure that their content is Rendered Server Side, as opposed to client side, if needed
We can then provide a list of URLs that we want to get screenshotted
And show how they would appear to Google through puppeteer rendering, instead of
In the case of some rather shit JS sites
Absolutely fuck all
Nothing - a blank page
Which is pretty cool, and allows for bulk page testing
But the really cool stuff is yet to come!
So who here has heard of, or even used Content King?
It’s a fairly awesome piece of software
That allows you to monitor a site in -real time ish,
With it alerting you of any issues such as
Meta data changes, New pages that 404, Updated links, redirects, indexable and non-indexable pages….
However!
Like most really good tools, it costs money
Maybe You Don’t Wanna Eat Into Your Budget For Content King for a personal project site, or you don’t need the level of detail that those guys provide for a smaller, shitter site?
This Next Example Shows How We Can Use puppeteer to
Monitor a chosen site when you want, and report of any changes to key areas
Including some key areas, such as
Meta title changes
Meta description updates
Any increase or decrease in the word count of the page.
Pull out any robots directives, and highlights any differences between them
Any differences in canonical elements
So basically the really important shit from a HTML webpage
So I wrote some code
So I’ll be tweeting this out after for those who are interested..
As with all coding, this required a bit of research
Ahem stackoverflow ahem
And with a little bit of luck
We now have a way to monitor these basic areas for web pages
This is how it works
There is about 200 lines of code in total
Heres a small snapshot
An i don’t have time to go through the full thing today,
but
There are a few really interesting snippets that I’d really like to share, that can come in handy
So we launch headless chrome as highlighted a few minutes ago
Like so. So we launch the browser, and then create a new page within the browser, awaiting for further instruction...
And then we provide a list of URLs for Puppeteer to go and fuck around with
So here we are quoting the file that we will use for this program, we parse (or read it) using a couple more lines, that don’t really look that exciting!
And then we pull in teh relevant meta data that I mentioned
SO, for example
Gonna show you guys how we pull in meta titles
So we are just pulling the title from the page. If there isn’t one - we get an error, so add in this - n/a
And then create an array of all the meta data - so a nice, formatted list of data that we can use later on within the script
So this just tells the script to treat all this data as one line, that we can then refer back to later
And we then pushed all this data to a text file
The Script then loops through every URL that is provided, pullingout all data for each
It then checks for differences in the data - so compares this run with the previous one.
If there are any differences between the two sets of data, these get saved within a changes.txt file
That i can then check whenever
So I can see what has changed from yesterday, or whenever I last ran the code
This required me to run the code each day manually
That I completely forgot to do
So, I went one step further, to make my life even easier
Chucked the code on a Raspberry Pi
And set up a cron job within my local machine to automatically run the script at the same time
Every day
And then
This was the bit that took the most amount of time by faarrr