Combining Web Crawler Data with Server Logs to highlight Crawl Budget opportunities. Get Google crawling and indexing more of your pages in Organic Search Results!
3. Here’s the problem…
> Google doesn’t crawl every page of your website
>> If a page isn’t crawled it won’t be indexed
>>> If a page isn’t indexed, it won’t make you money
5. Identify Desired
Outcomes and
Objectives
Information
Gathering
Action Planning
Implementation
and Review
New Initiative Planning Process
Dawn Anderson’s slide-deck “BRINGING IN THE
FAMILY DURING CRAWLING” is an insightful guide to
help you identify crawl budget opportunities. Dawn
also suggests powerful actions you should explore.
7. Hypertext Transfer Protocol (HTTP)
Client
Server
HTTP Request
GET /index.html HTTP/1.1
Host: www.exampleshop.com
User-Agent: Mozilla 5.0
HTTP Response
HTTP/1.1 200 OK
Date: Mon, 11 Jul 2016 08:06:45 GMT
Server: Apache/1.3.27 (Unix) (Red-Hat/Linux)
Last-Modified: Wed, 04 Feb 2016 23:11:55 GMT
Etag: “3f84f-1b9-3elcd16b”
Accept-Ranges: bytes Content-Length: 458
Connection: close
Content-Type: text/html; charset=UTF-8
Fig 1: HTTP Client/Server Communication
This is a standard HTTP/1.1 exchange between Client (e.g. Browser or Googlebot) & your Server.
8. Server Log Files
Server
188.65.114.122 - -
[19/Jul/2016:08:07:05 -0400]
"GET
/women/shoes/ converse14579/ HTTP/1.1"
200 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
charset=UTF-8
Server IP
Timestamp (date & time)
Method (GET / POST)
Request URI
HTTP status code
User-agent
Fig 2: Example Server Output
WHO’S REQUESTING? | WHEN?
| HOW?
WHAT FILE?
SERVER RESPONSE Server Logs are the SINGLE SOURCE OF TRUTH
when it comes to seeing how search engines, such as Googlebot, assess your
website.
Your webserver keeps a file of every hit the server receives during the exchange on the previous slide. Your very own data treasure
chest.
9. “[Cleanup your architecture because] we get lost crawling unnecessary URLs and we might not be able to crawl and index your
new and updated content as quickly as we would otherwise… There are a number of crawlers you can use to crawl your website
on your own, to run across your website.” Google Webmaster Central office hours hangout, 16 Oct 2015
@JohnMu
Crawl your website
with a
THIRD-PARTY
CRAWLER
@JohnMu
Conduct
LOG FILE
ANALYSIS
10. How does Log Files Analysis differ to Web
Crawl Analysis?
11. Home
Category
Subcategory
Detail
Web Crawl
Systematically fetch, retrieve, and validate the HTML on every page of your website to simulate
Googlebot’s/Bingbot’s analysis of your pages
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
_____
______________
______________
______________
______________
Let’s consider how the information is collected…
This is great for optimising
your HTML code and helps
you try and produce a best in
class website.
12. But that’s not how search engines operate and crawling alone lacks the evidence to
back up your strategy.
For example, Googlebot might enter through a popular category and crawl the same
pages time after time. Search Console won’t tell you this and neither does simulating a
crawl from your homepage.
So, you need to crawl your architecture and compare the data to Google’s activity (via
your log files) to gain an insight into how you’ll get more of your money making pages
crawled and indexed.
13. What barriers do people face when trying to study this vital information?
• Access to Server Logs
• File Sizes
• Misplacing trust in Search Console
• Time required to process the data
But I don’t think you should be deterred and here’s why…
14. Accessing your logs is simpler than you think.
Your organisation is probably already using them.
Common Log Analysis use cases for eCommerce organisations include:
>> Application Management
>> Access Management
>> Network Forensics
>> Compliance
Popular products used by Applications and Security teams at major Enterprise
companies include: LogRhythm, Loggly, and Splunk.
15. Splunk (a log file storage and processing company): Market Cap $8.6bn, 11,000 Customers
17. It’s true that the volume of data involved can make working
with the files prohibitive.
For example, if a site receives 50,000 visitors a day browsing
an average of 5 pages per session, that’s 250,000 log entries
per day for the HTML
7.5M entries per month
Now add 10 assets requested from the server for each page:
75,000,000 lines in your Log Files per month
18. SEOs regularly monitor and trend site architecture data (HTTP codes,
etc.) in third-party apps
but it’s not possible to scrutinise Search Console’s crawling and
indexing charts, but you really should.
19. So, how is engineering helping us overcome these barriers and
expand our knowledge?
>>> Secure File Transfer Protocol (SFTP)
>>> Storing and trending Log Data thanks to cloud services
>>> Processing Automation (saving TIME)
>>> Diffing Log Data with Simulated Crawl
Data for greater insights
20. Let’s move onto the questions I think you
should be looking to answer.
21. What are the typical questions SEOs
try and answer with Log Analysis?
• Where do I have accessibility errors?
• Which pages are being spidered most frequently?
• Is spammer activity proving detrimental to performance?
• Which pages haven’t been crawled by search engines?
And these are all very valid and helpful but I suggest looking at the next list too…
22. # 5 Critical Questions / KPIs Score
1 What is my ‘Crawl Ratio’?
2 What percentage of my compliant pages (2xx & unique) will Google crawl each month?
3 How deep will Google crawl into my site architecture?
4 What does Google consider to be my Top, Middle and Long Tail pages?
5 What is my ‘Crawl Window’ score?
HOW MANY MORE PAGES NOW HAVE THE POTENTIAL TO MAKE US MONEY?
(THANKS TO MY EFFORTS OVER THE PAST 30 DAYS?)
INDE
X
23. Crawl Definition Score
Crawl Rate requests per second Googlebot makes to your site when it is crawling it
Crawl Budget the maximum number of pages that Google crawls on a website
Crawl Frequency
program determining which sites to crawl, how often, and how many pages
to fetch from each site
Crawl Rank
the frequency a page is crawled compared with the ranking position of that
page
Crawl Space the totality of possible URLs for a website
Crawl Ratio the percentage of my website structure Google is crawling every 30 days
Crawl Window
the percentage of the compliant (unique & 200) pages on my website Google
usually crawls in a 14 day period
I’ve mentioned a few terms you might not be familiar with so here’s a list of old friends with a couple of new additions.
25. Crawl Ratio: the percentage of my website structure
Google is crawling every 30 days
Total Pages in the website structure crawled by Google in 30
days
Total Pages in the website structure
x100
26. Organic Growth Opportunities
Lifestyle
Publisher
Business Equipment Retail Real Estate
Classified
The Venn diagram clearly illustrates the mis-match between the URLs you hope Google is looking at with the
accurate picture from your server logs.
27. Critical Question 2
What percentage of my compliant pages (200
& unique) will Google crawl each month?
28. % of key pages crawled
Total Compliant Pages Crawled By Google in 30 days
Total Compliant Pages in the website structure
x100
%
Potential
30. Critical Question 3 – how deep will Google
crawl into my website architecture?
31. What depths will Google plunge?
Lifestyle
Publisher
Business Equipment Retail Real Estate
Classified
This chart indicates the correlation between the depth of your content and Google’s crawling activity
32. Lifestyle
Publisher
Business Equipment Retail Real Estate
Classified
This chart indicates Google's crawl rate (URL crawled or not by any bot)
by Internal Pagerank
How can I more effectively use Pagerank to increase
visibility?
33. Critical Question 4 – what does Google
consider to be my Top, Middle and Long Tail
pages?
34. This graph details visits frequency from Google search result pages for
all URLs analysed by the crawler: how often URLs get organic visits
from Google
Lifestyle
Publisher
Business Equipment Retail Real Estate
Classified
35. Then compare Organic Traffic with a measure of how often URLs are crawled by any Google bot.
Increase your Middle Tail
Lifestyle
Publisher
Business Equipment Retail Real Estate
Classified
37. Crawl Window: the percentage of my compliant URLs
Google usually crawls in a 14 day period*
When a change appears on the website, either voluntary or involuntary, understanding your Crawl Window value will help you know precisely
how long it will take to identify a positive/negative impact.
*This is a simplified calculation of Botify’s Crawl Window metric.
Real Estate
Classified
25.5%
Business Equipment Retail
80.8%
Lifestyle
Publisher
66.3%
38. # 5 Critical Questions / KPIs Score
1 What is my ‘Crawl Ratio’?
2 What percentage of my compliant pages (2xx & unique) will Google crawl each month?
3 How deep will Google crawl into my site architecture?
4 What does Google consider to be my Top, Middle and Long Tail pages?
5 What is my ‘Crawl Window’ score?
HOW MANY MORE PAGES NOW HAVE THE POTENTIAL TO MAKE US MONEY?
(THANKS TO MY EFFORTS OVER THE PAST 30 DAYS?)
INDE
X
You might find this checklist helpful.
39. THANK YOU!
Take a Free Trial via www.botify.com
#BrightonSEO | @SearchMATH