SlideShare a Scribd company logo
1 of 135
Search Engines
Sudarsun Santhiappan., M.Tech.,
Director – R & D,
Burning Glass Technologies
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 2
Today's Coverage
● Introduction
● Types of Search Engines
● Components of a Search Engine
● Semantics and Relevancy
● Search Engine Optimization
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 3
What is a Search Engine ?
● What is a Search ?
● Why do we need a Search Engine ?
● What are we searching against ?
● How good is a Search Engine ?
● What is Search on Search (Meta SE) ?
● Compared Search Engines Side-by-Side ?
● How are Images and Videos searched ?
● Apart from Web Search, what else ?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 4
Introduction
● Web Search Engine is a software program that searches
the Internet (bunch of websites) based on the words
that you designate as search terms (query words).
● Search engines look through their own databases of
information in order to find what it is that you are
looking for.
● Web Search Engines are a good example for massively
sized Information Retrieval Systems.
– Tried “Similar pages” Link in Google result set ?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 5
Dictionary Definitions
Search
COMPUTING (transitive verb) to examine a computer file, disk,
database, or network for particular information
Engine
something that supplies the driving force or energy to a movement,
system, or trend
Search Engine
a computer program that searches for particular keywords and
returns a list of documents in which they were found, especially a
commercial service that scans documents on the Internet
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 6
About definition of search engines
● oh well … search engines do not search
only for keywords, some search for other
stuff as well
● and they are really not “engines” in the
classical sense
– but then mouse is not a “mouse”
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 7
use of search engines
… among others
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 8
Types of Search Engines
● Text Search Engines
– General: AltaVista, AskJeeves, Bing, Google
– Specialized: Google Scholar, Scirus, Citeseer
● Intranet vs Internet Search Engines
● Image Search Engines
– How can we search on the Image content ?
● Video Search Engines
– Image Search with Time dimension !!
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 9
Types of Search Engine
● Crawler Powered Indexes
– Guruji.com, Google.com
● Human Powered Indexes
– www.dmoz.org
● Hybrid Models
– Submitted URLs to a search engine ?
● Semantic Indexes
– Hakia.com,
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 10
Have you tried Hakia ?
● What is Semantic Search ?
● How's it different from Keyword Search?
● What is categorized search ?
● Side-by-Side comparison with Google!!
● Have you compared Bing with Google ?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 11
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 12
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 13
Directories
● www.dmoz.org
● Website classified into a Taxonomy
● Website are categorically arranged
● Searching vs Navigation
● Instead of Query, you Click and navigate
● Accurate search always! (if data is available)
● Problem: Mostly Manually created
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 14
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 15
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 16
How does a Search Engine work ?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 17
Your
Browser
How Search Engines Work
(Sherman 2003)
The Web
URL1
URL2
URL3 URL4
Crawler
Indexer
Search
Engine
Database Eggs?
Eggs.
Eggs - 90%
Eggo - 81%
Ego- 40%
Huh? - 10%
All About
Eggs
by
S. I. Am
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 18
how do search engines work? elaboration
• crawlers, spiders: go out to find content
– in various ways go through the web looking
for new & changed sites
– periodic, not for each query
● no search engine works in real time
– some search engines do it for themselves,
others not
● buy content from companies such as Inktomi
– for a number of reasons crawlers do not
cover all of the web – just a fraction
– what is not covered is “invisible web” ?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 19
Elaboration …
• organizing content: labeling, arranging
– indexing for searching – automatic
● keywords and other fields
● arranging by URL popularity - PageRank as Google
– classifying as directory
● mostly human handpicked & classified
● as a result of different organization we have
basically two kinds of search engines:
● search – input is a query that is searched & displayed
● directory – classified content – a class is displayed
– and fused: directories have search capabilities & vice versa
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 20
Elaboration (cont.)
• databases, caches: storing content
– humongous files usually distributed over many computers
• query processor: searching, retrieval, display
– takes your query as input
● engines have differing rules on how they are handled
– displays ranked output
● some engines also cluster output and provide visualization
● some engines provide categorically structured results
● at the other end is your browser
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 21
Similarities & Differences
● All search engines have these basic parts in
common
● BUT the actual processes – methods how they
do it – are based on various algorithms and
they significantly differ
– most are proprietary (patented) with details kept
mostly secret (or protected) but based on well
known principles from information retrieval or
classification
– to some extent Google is an exception – they
published their method
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 22
Google Search
● In the beginning it ran on Stanford computers
● Basic approach has been described in their
famous paper “The Anatomy of a Large-Scale
Hypertextual Web Search Engine”
– well written, simple language, has their pictures
– in acknowledgement they cite the support by NSF’s Digital Library
Initiative i.e. initially, Google came out of government sponsored research
– describe their method PageRank - based on ranking hyperlinks as in
citation indexing
– “We chose our system name, Google, because it is a common spelling of
googol, or ten on hundredth power”
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 23
coverage differences
● no engine covers more than a fraction of WWW
– estimates: none more than 16%
– hard (even impossible) to discern & compare coverage, but they differ
substantially in what they cover
● in addition:
– many national search engines
● own coverage, orientation, governance
– many specialized or domain search engines
● own coverage geared to subject of interest
– many comprehensive sources independent of search engines
● some have compilations of evaluated web sources
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 24
searching differences
● substantial differences among search engines
on searching, retrieval display
– need to know how they work & differ in respect to
● defaults in searching a query
● searching of phrases, case sensitivity, categories
● searching of different fields, formats, types of resources
● advance search capabilities and features
● possibilities for refinement, using relevance feedback
● display options
● personalization options
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 25
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 26
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 27
Limitations
● every search engine has limitation as to
– Coverage: meta engines just follow coverage
limitations & have more of their own search
capabilities
– finding quality information
● some have compromised search with economics
– becoming little more than advertisers
● but search engines are also many times victims
of spamdexing
– affecting what is included and how ranked
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 28
Spamming a search engine
● use of techniques that push rankings higher
than they belong is also called spamdexing
– methods typically include textual as well as link-
based techniques
– like e-mail spam, search engine spam is a form of
adversarial information retrieval
● the conflicting goals of accurate results of search
providers & high positioning by content page rank
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 29
Meta Search Engines
Search on Search
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 30
Meta search engines
● meta engines search multiple engines
– getting combined results from a variety of
engines
● do not have their own databases
– but have their own business models affecting
results
● a number of techniques used
– interesting ones: clustering, statistical analysis
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 31
Some Meta engines - with organized results
Dogpile : results from a number of leading
search engines; gives source, so overlap can
be compared; (has also a (bad) joke of the day)
Surfwax : gives statistics and text sources &
linking to sources; for some terms gives related
terms to focus
Teoma : results with suggestions for narrowing;
links resources derived; originated at Rutgers
Turbo10 : provides results in clusters; engines
searched can be edited
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 32
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 33
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 34
Some Meta Engines (cont.)
● Large directory
 Complete Planet
– directory of over 70,000 databases & specialty engines
● Results with graphical displays
– Vivisimo clusters results; innovative
– Webbrain results in tree structure – fun to use
– Kartoo results in display by topics of query
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 35
Domain Specific Search Engines
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 36
Domain Search Engines &
Catalogs
● cover specific subjects & topics
● important tool for subject searches
– particularly for subject specialist
– valued by professional searchers
● selection mostly hand-picked rather than
by crawlers, following inclusion criteria
– often not readily discernable
– but content more trustworthy
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 37
Domain Search Engines …
Open Directory Project
large edited catalog of the web – global, run by volunteers
BUBL LINK
selected Internet resources covering all academic subject
areas; organized by Dewey Decimal System – from UK
Profusion
search in categories for resources & search
engines
Resource Discovery Network – UK
“UK's free national gateway to Internet
resources for the learning, teaching and
research community”
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 38
Domain Engines … sample
Think Quest – Oracle Education Foundation
• education resources, programs; web sites created by students
All Music Guide
• resource about musicians, albums, and songs
Internet Movie Database
• treasure trove of American and British movies
Genealogy links and surname search engines
well.. that is getting really specialized (and popular)
Daypop
searches the “living web” “The living web is composed of sites that update on a
daily basis: newspapers, online magazines, and weblogs”
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 39
Science, scholarship engines …sample
 Psychcrawler - Amer Psychological Association
 web index for psychology
 Entrez PubMed – Nat Library of Medicine
biomedical literature from MEDLINE & health journals
 CiteSeer - NEC Research Center
 scientific literature, citations index; strong in computer science
Scholar Google
searches for scholarly articles & resources
Infomine
scholarly internet research collections
Scirus
scientific information in journals & on the web
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 40
Science, scholarship engines …sample
commercial access
● an addition to freely accessible engines many
provide search free but access to full text paid
– by subscription or per item
– RUL provides access to these & many more:
ScienceDirect Elsevier: “world's largest electronic collection of science, technology
and medicine full text and bibliographic information”
ACM Portal Association for Computing Machinery: access to ACM Digital Library &
Guide to Computing
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 41
Search Engine Internals
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 42
Search Engine Internals
● Crawlers
● Indexers
● Searching
● Semantics
● Ranking
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 43
Standard Web Search Engine Architecture
crawl the
web
create an
inverted
index
Check for duplicates,
store the
documents
Inverted
index
Search
engine
servers
user
query
Show results
To user
DocIds
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 44
Typical Search Engine
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 45
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 46
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 47
Crawlers
● What is Crawling ?
● How does Crawling happen ?
● Have you tried “wget -r <url>” in Linux ?
● Have you tried “DAP” to download entire site?
● Page Walk
● Spidering & Crawlbots
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 48
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 49
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 50
Spidering the Web
● Replicating the Spider's behavior of building
the Internet (web) by adding spirals (sites)
● But, can the web be fully crawled ?
● By the time, one round of indexing is over, the
page might have changed already!
● That's why we have cached page link in the
search result!
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 51
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 52
Crawler Bots
● How to make your website Crawlable ?
● White-listing and Black-listing!
● Meta Tags to control the Bots
● Can HTTPS pages be crawled ?
● Does Sessions maintained while crawling ?
● Can dynamic pages be crawled ?
● URL normalization
– cool.com?page=2 [crawler unfriendly]
– cool.com/page/2 [norm'd and crawler friendly]
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 53
How to control Robots ?
<HTML>
<HEAD>
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<TITLE>...</TITLE>
</HEAD>
<BODY>
Index: This tell the spider/bot that it’s OK to index this page
Noindex: Spider/bot see this and don’t index any of the content on this page.
Follow: This let the spider/bot know that it’s OK to travel down links found on this
page.
Nofollow: It tells the spider/bot not to follow any of the links on this page.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 54
Crawling – Process Flow
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 55
Data Structures
● Tree primarily while Crawling
● Both Depth-First-Search and Breadth-First-
Search are used
● Every page that the crawler visits shall be
added as a node to the Tree
● Fan-out information is represented as
Children for a node (page).
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 56
Inverted Indexes the IR Way
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 57
How Inverted Files
Are Created
● Periodically rebuilt, static otherwise.
● Documents are parsed to extract
tokens. These are saved with the
Document ID.
Now is the time
for all good men
to come to the aid
of their country
Doc 1
It was a dark and
stormy night in
the country
manor. The time
was past midnight
Doc 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 58
How Inverted
Files are Created
● After all
documents have
been parsed the
inverted file is
sorted
alphabetically.
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2
Term Doc #
now 1
is 1
the 1
time 1
for 1
all 1
good 1
men 1
to 1
come 1
to 1
the 1
aid 1
of 1
their 1
country 1
it 2
was 2
a 2
dark 2
and 2
stormy 2
night 2
in 2
the 2
country 2
manor 2
the 2
time 2
was 2
past 2
midnight 2
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 59
How Inverted
Files are Created
● Multiple term
entries for a
single
document are
merged.
● Within-
document term
frequency
information is
compiled.
Term Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Term Doc #
a 2
aid 1
all 1
and 2
come 1
country 1
country 2
dark 2
for 1
good 1
in 2
is 1
it 2
manor 2
men 1
midnight 2
night 2
now 1
of 1
past 2
stormy 2
the 1
the 1
the 2
the 2
their 1
time 1
time 2
to 1
to 1
was 2
was 2
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 60
How Inverted Files are Created
● Finally, the file can be split into
– A Dictionary or Lexicon file
and
– A Postings file
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 61
How Inverted Files are Created
Dictionary/Lexicon
Postings
Term Doc # Freq
a 2 1
aid 1 1
all 1 1
and 2 1
come 1 1
country 1 1
country 2 1
dark 2 1
for 1 1
good 1 1
in 2 1
is 1 1
it 2 1
manor 2 1
men 1 1
midnight 2 1
night 2 1
now 1 1
of 1 1
past 2 1
stormy 2 1
the 1 2
the 2 2
their 1 1
time 1 1
time 2 1
to 1 2
was 2 2
Doc # Freq
2 1
1 1
1 1
2 1
1 1
1 1
2 1
2 1
1 1
1 1
2 1
1 1
2 1
2 1
1 1
2 1
2 1
1 1
1 1
2 1
2 1
1 2
2 2
1 1
1 1
2 1
1 2
2 2
Term N docs Tot Freq
a 1 1
aid 1 1
all 1 1
and 1 1
come 1 1
country 2 2
dark 1 1
for 1 1
good 1 1
in 1 1
is 1 1
it 1 1
manor 1 1
men 1 1
midnight 1 1
night 1 1
now 1 1
of 1 1
past 1 1
stormy 1 1
the 2 4
their 1 1
time 2 2
to 1 2
was 1 2
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 62
Inverted indexes
● Permit fast search for individual terms
● For each term, you get a list consisting of:
– document ID
– frequency of term in doc (optional)
– position of term in doc (optional)
● These lists can be used to solve Boolean queries:
● country -> d1, d2
● manor -> d2
● country AND manor -> d2
● Also used for statistical ranking algorithms
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 63
Inverted Indexes for Web Search
Engines
● Inverted indexes are still used, even though
the web is so huge.
● Some systems partition the indexes across
different machines. Each machine handles
different parts of the data.
● Other systems duplicate the data across
many machines; queries are distributed
among the machines.
● Most do a combination of these.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 64
From description of the FAST search engine, by Knut Risvik
In this example, the data
for the pages is
partitioned across
machines. Additionally,
each partition is allocated
multiple machines to
handle the queries.
Each row can handle 120
queries per second
Each column can handle
7M pages
To handle more queries,
add another row.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 65
Cascading Allocation of CPUs
● A variation on this that produces a cost-
savings:
– Put high-quality/common pages on many
machines
– Put lower quality/less common pages on fewer
machines
– Query goes to high quality machines first
– If no hits found there, go to other machines
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 66
The Search Process
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 67
Searching – Process Flow
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 68
Google Query Evaluation
1. Parse the Query
2. Convert words to WordID
3. Seek to the start of the doclist in the short barrel for every word.
4. Scan through the doclists until there is a document that
matches all the search terms.
5. Compute the rank of that document for the query.
6. If we are in the short barrels and at the end of any doclist, seek
to the start of the doclist in the full barrel for every word and
go to step 4.
7. If we are not at the end of any doclist go to step 4.
8. Sort the documents that have matched by rank and return the
top k.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 69
Queries
● Search engines are one tool used to answer info needs
● Users express their information needs as queries
● Usually informally expressed as two or three words (we
call this a ranked query)
● A recent study showed the mean query length was 2.4
words per query with a median of 2
● Around 48.4% of users submit just one query in a
session, 20.8% submit two, and about 31% submit
three or more
● Less than 5% of queries use Boolean operators (AND,
OR, and NOT), and around 5% contain quoted phrases
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 70
Queries...
● About 1.28 million different words were used in queries in
the Excite log studied (which contained 1.03 million
queries)
● Around 75 words account for 9% of all words used in
queries. The top-ten non-trivial words occurring in
531,000 queries are “sex” (10,757), “free” (9,710),
“nude” (7,047), “pictures” (5,939), “university” (4,383),
“pics” (3,815), “chat” (3,515), “adult” (3,385), “women”
(3,211), and “new” (3,109)
● 16.9% of the queries were about entertainment, 16.8%
about sex, pornography, or preferences, and 13.3%
concerned commerce, travel, employment, and the
economy
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 71
Answers
What is a good answer to a query?
● One that is relevant to the user’s information need!
● Search engines typically return ten answers-per-page,
where each answer is a short summary of a web
document
● Likely relevance to an information need is approximated
by statistical similarity between web documents and
the query
● Users favour search engines that have high precision,
that is, those that return relevant answers in the first
page of results
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 72
Approximating Relevance
● Statistical similarity is used to estimate the relevance of a
query to an answer
● Consider the query “Richardson Richmond Football”
● A good answer contains all three words, and the more
frequently the better; we call this term frequency (TF)
● Some query terms are more important—have better
discriminating power—than others. For example, an
answer containing only “Richardson” is likely to be
better than an answer containing only “Football”; we
call this inverse document frequency (IDF)
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 73
Ranking
To improve the accuracy of search engines:
● Google Inc. use their patented PageRank(tm) technology.
Google ranks a page higher if it links to pages that are
an authorative source, and a link from an authorative
source to a page ranks that page higher
● Relevance feedback is a technique that adds words to a
query based on a user selecting a more like this option
● Query expansion adds words to a query using thesaural or
other techniques
● Searching within categories or groups to narrow a search
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 74
Resolving Queries
● Queries are resolved using the inverted index
● Consider the example query “Cat Mat Hat”. This is
evaluated as follows:
– Select a word from the query (say, “Cat”)
– Retrieve the inverted list from disk for the word
– Process the list. For each document the word occurs in, add
weight to an accumulator for that document based on the
TF, IDF, and document length
– Repeat for each word in the query
– Find the best-ranked documents with the highest weights
– Lookup the document in the mapping table
– Retrieve and summarize the docs, and present to the user
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 75
Fast Search Engines
● Inverted lists are stored in a compressed format. This
allows more information per second to be retrieved
from disk, and it lowers disk head seek times
● As long as decompression is fast, there is a beneficial
trade-off in time
● Documents are stored in a compressed format for the
same reason
● Different compression schemes are used for lists (which
are integers) and documents (which are multimedia,
but mostly text)
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 76
Fast Search Engines
● Sort disk accesses to minimise disk head movement
when retrieving lists or documents
● Use hash tables in memory to store the vocabulary; avoid
slow hash functions that use modulo
● Pre-calculate and store constants in ranking formulae
● Carefully choose integer compression schemes
● Organise inverted lists so that the information frequently
needed is at the start of the list
● Use heap structures when partial sorting is required
● Develop a query plan for each query
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 77
Search Engine Architecture
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 78
Search Engine architecture
● The inverted lists are divided amongst a number of servers,
where each is known as a shard
● If an inverted list is required for a particular range of words,
then that shard server is contacted
● Each shard server can be replicated as many times as
required; each server in a shard is identical
● Documents are also divided amongst a number of servers
● Again, if a document is required within a particular range,
then the appropriate document server is contacted
● Each document server can also be replicated as many times
as required
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 79
Google, Case Study
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 80
Google Architecture
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 81
Components
● URL Server: Bunch of URLs (white-list)
● Crawler: Fetch the page
● Store Server: To store the fetched pages
● Repository: Compressed pages are put here
● Every unique page has a DocID
● Anchor: Page transition [to, from] information
● URLResolver: Relative URL to Absolute URL
● Lexicon: list of known words
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 82
Indexer
● Parses the document
● Build Word-Frequency table {word, position, font,
capitalization} [hits]
● Pushes the hits to barrels as partially sorted
forward index
● Identifies anchors (page transition out info)
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 83
Searcher
● Forward Index to Inverted Index
– Maps keywords to DocIds
– DocIds mapped to URLs
● Reranker
– Uses Anchor information to rank the pages for
the given query keyword.
– Thumbrule: Fan In increases page rank
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 84
Reranking
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 85
What about Ranking?
● Lots of variation here
– Often messy; details proprietary and fluctuating
● Combining subsets of:
– IR-style relevance: Based on term frequencies,
proximities, position (e.g., in title), font, etc.
– Popularity information
– Link analysis information
● Most use a variant of vector space ranking to
combine these. Here’s how it might work:
– Make a vector of weights for each feature
– Multiply this by the counts for each feature
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 86
Relevance: Going Beyond IR
● Page “popularity” (e.g., DirectHit)
– Frequently visited pages (in general)
– Frequently visited pages as a result of a query
● Link “co-citation” (e.g., Google)
– Which sites are linked to by other sites?
– Draws upon sociology research on
bibliographic citations to identify “authoritative
sources”
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 87
Link Analysis for Ranking Pages
● Assumption: If the pages pointing to this
page are good, then this is also a good
page.
● References: Kleinberg 98, Page et al. 98
● Draws upon earlier research in sociology
and bibliometrics.
– Kleinberg’s model includes “authorities” (highly
referenced pages) and “hubs” (pages
containing good reference lists).
– Google model is a version with no hubs, and is
closely related to work on influence weights by
Pinski-Narin (1976).
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 88
Link Analysis for Ranking Pages
● Why does this work?
– The official Toyota site will be linked to by lots
of other official (or high-quality) sites
– The best Toyota fan-club site probably also has
many links pointing to it
– Less high-quality sites do not have as many
high-quality sites linking to them
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 89
PageRank
● Let A1, A2, …, An be the pages that point
to page A. Let C(P) be the # links out of
page P. The PageRank (PR) of page A is
defined as:
● PageRank is principal eigenvector of the
link matrix of the web.
● Can be computed as the fixpoint of the
above equation.
PR(A) = (1-d) + d ( PR(A1)/C(A1) + … + PR(An)/C(An) )
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 90
PageRank: User Model
● PageRanks form a probability distribution over web pages:
sum of all pages’ ranks is one.
● User model: “Random surfer” selects a page, keeps
clicking links (never “back”), until “bored”: then randomly
selects another page and continues.
– PageRank(A) is the probability that such a user visits A
– d is the probability of getting bored at a page
● Google computes relevance of a page for a given search
by first computing an IR relevance and then modifying that
by taking into account PageRank for the top pages.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 91
Search Engine Optimization
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 92
● Location, Location, Location...and Frequency
● Tags (<title>, <meta>, <b>, top of the page)
● How close words (from the query) are to each other on the
website
● Quality of links going to and from a page
● Penalization for "spamming“, when a word is repeated
hundreds of times on a page, to increase the frequency and
propel the page higher in the listings.
● Off the Page ranking criteria:
● By analyzing how pages link to each other.
How Search Engines Rank Pages?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 93
Why do results differ ?
● Some search engines index more web pages
than others.
● Some search engines also index web pages
more often than others.
● The result is that no search engine has the
exact same collection of web pages to
search through.
● Different algorithms to compute relevance of
the page to a particular query
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 94
Search Engine Placement Tips
● Why is it important to be on the first page of the
results?
– Most users do not go beyond the first page.
● How to optimize your website?
– Pick your target keywords: How do you think people will
search for your web page? The words you imagine them
typing into the search box are your target keywords.
– Pick target words differently for each page on your website.
– Your target keywords should always be at least two or more
words long.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 95
Position your Keywords
● Make sure your target keywords appear in the
crucial locations on your web pages. The page's
HTML <title> tag is most important.
● The titles should be relatively short and attractive.
Several phrases are enough for the description.
● Search engines also like pages where keywords
appear "high" on the page: headline, first
paragraphs of your web page.
● Keep in mind that tables and large JavaScript
sections can make your keywords less relevant
because they appear lower on the page.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 96
Have Relevant Content
● Keywords need to be reflected in the page's content.
● Put more text than graphics on a page
● Don't use frames
● Use the <ALT….> tag
● Make good use of <TITLE> and <H1>
● Consider using the <META> tag
● Get people to link to your page
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 97
Hiding Web pages
● You may wish to have web pages that are not
indexed (for example, test pages).
● It is also possible to hide web content from robots,
using the Robots.txt file and the robots meta tag.
● Not all crawlers will obey this, so this is not foolproof.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 98
Submitting To Search Engines
 Search engines should find you naturally, but
submitting helps speed the process and can
increase your representation
 Look for Add URL link at bottom of home page
 Submit your home page and a few key
“section” pages
 Turnaround from a few days to 2 months
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 99
Deep Crawlers
 AltaVista, Inktomi, Northern Light will add the
most, usually within a month
 Excite, Go (Infoseek) will gather a fair amount;
Lycos gathers little
 Index sizes are going up, but the web is
outpacing them…nor is size everything
 Here are more actions to help even the odds…
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 100
“Deep” Submit
 A “deep” submit is directly submitting pages from
“inside” the web site – can help improve the odds
these will get listed.
 At Go, you can email hundreds of URLs. Consider
doing this.
 At HotBot/Inktomi, you can submit up to 50 pages per
day. Possibly worth doing.
 At AltaVista, you can submit up to 5 pages per day.
Probably not worth the effort.
 Elsewhere, not worth doing a “deep” submit.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 101
Big Site? Split It Up
 Expect search engines to max out at around
500 pages from any particular site
 Increase representation by subdividing large
sites logically into subdomains
 Search engines will crawl each subsite to more
depth
 Here’s an example...
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 102
Subdomains vs. Subdirectories
I nstead of Do this
gold.ac.uk/science/ science.gold.ac.uk
gold.ac.uk/english/ english.gold.ac.uk
gold.ac.uk/admin/ admin.gold.ac.uk
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 103
I Was Framed
 Don't use them. Period.
 If you do use them, search engines will have
difficulty crawling your site.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 104
Dynamic Roadblocks
 Dynamic delivery systems that use ? symbols in the
URL string prevent search engines from getting to
your pages
 http://www.nike.com/ObjectBuilder/ObjectBuilder.iwx?ProcessName=
IndexPage&Section_Id=17200&NewApplication=t
 Eliminate the ? symbol, and your life will be rosy
 Look for workarounds, such as Apache rewrite or
Cold Fusion alternatives
 Before you move to a dynamic delivery system,
check out any potential problems.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 105
How Directories Work
 Editors find sites, describe them, put them in a
category
 Site owners can also submit to be listed
 A short description represents the entire web
site
 Usually has secondary results from a crawler-
based search engine
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 106
The Major Directories
 Yahoo
 The Open Directory
 (Netscape, Lycos, AOL Search, others)
 LookSmart
 UK Plus
 Snap
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 107
Submitting To Directories
 Directories probably won't find you or may list you
badly unless you submit
 Find the right category (more in a moment), then use
Add URL link at top or bottom of page
 Write down who submitted (and email address), when
submitted, which category submitted to and other
details
 You’ll need this info for the inevitable resubmission attempt
– it will save you time.
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 108
Submitting To Directories
 Take your time and submit to these right
 Write 3 descriptions: 15, 20 and 25 words long, which
incorporate your key terms
 Search for the most important term you want to be
found for and submit to first category that's listed
which seems appropriate for your site
 Be sure to note the contact name and email address
you provided on the submit form
 If you don't get in, keep trying
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 109
Subdomain Advantage
 Directories tend not to list subsections of a
web site.
 In contrast, they do tend to see subdomains as
independent web sites deserving their own
listings
 So, another reason to go with subdomains
over subdirectories
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 110
How to do Search ?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 111
What do we search ?
● Information
● Reviews, news
● Advice, methods
● Bugs
● Education stuff
● Examples:
– Access Violation 0xC0000005
– Search Engine ppt
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 112
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 113
Main Steps
● Make a decision about the search
● Formulate a topic. Define a type of resources that
you are looking for
● Find relevant words for description
● Find websites with information
● Choose the best out of them
● Feedback: How did you search?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 114
Main Problems
Why is it difficult to search?
– Know the problem, don’t know what to look for
– Lose focus (go to interesting but non-relevant
sites)
– Perform superficial (shallow) search
– Search Spam
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 115
Typical Problems
● Links are often out of date
● Usually too many links are returned
● Returned links are not very relevant
● The Engines don't know about enough pages
● Different engines return different results
● Political bias
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 116
Typical Mistakes
● Unnecessary words in a query
● Unsuitable choice of keywords
● Not enough flexibility in changing keywords (Ses)
● Divide the time devoted to search and evaluation of
search results
● “Your search did not match any documents. ” – Bad
Query!
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 117
Search Tricks
What can we search for?
● Thematic resource (http://www.topicmaps.org)
● Community
● Collection of articles
● Forum
● Catalog of resources, links
● File (file types)
● Encyclopedia article
● Digital library
● Contact information (i.e. email)
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 118
Improving Query Results
● To look for a particular page use an unusual phrase
you know is on that page
● Use phrase queries where possible
● Check your spelling!
● Progressively use more terms
● If you don't find what you want, use another Search
Engine!
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 119
Useful words
● download
● pdf, ppt, doc, zip, mp3
● forum, directory, links
● faq, for newbies, for beginners, guide, rules,
checklist
● lecture notes, survey, tutorials
● how, where, correct, howto
● Copy-pasting the exact error message
● Have you tried http://del.icio.us/ ?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 120
Search Engine Features
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 121
Features
● Indexing features
● Search features
● Results display
● Costs, licensing and registration requirements
● Unique features (if any)
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 122
Indexing Features
● File/document formats supported: HTML, ASCII, PDF, SQL, Spread
Sheets, WYSIWYG (MS-Word, WP, etc.)
● Indexing level support: File/directory level, multi-record files
● Standard formats recognized: MARC, Medline, etc
● Customization of document formats
– Stemming: If yes, is this an optional or mandatory feature?
– Stop words support: If yes, is this an optional or mandatory
features ?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 123
Searching Features
● Boolean Searching: Use of Boolean operators AND, OR and NOT as
search term connectors
● Natural Language: Allows users to enter the query in natural language
● Phrase: Users can search for exact phrase
● Truncation/wild card: Variations of search terms and plural forms can
be searched
● Exact match: Allows users to search for terms exactly as it is entered
● Duplicate detection: Remove duplicate records from the retrieved
records
● Proximity: With connectors such as With , Near, ADJacent one can
specify the position of a search terms w.r.t to others
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 124
Searching Features
● Field Searching: Query for a specific field value in the database
● Thesaurus searching: Search for Broader or Narrower or Related terms or
Related concepts
● Query by example: Enables users to search for similar documents
● Soundex searching: Search for records with similar spelling as the search term
● Relevance ranking: Ranking the retrieved records in some order
● Search set manipulation: Saving the search results as sets and allowing users to
view search history
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 125
Results Display
● Formats supported: Can it display in native format or just HTML; Display in
different formats, Display number of records retrieved
● Relevancy ranking: If the retrieved records are ranked, how the relevance score
is indicated
● Keyword-in-context: KWIC or highlighting of matching search terms
● Customization of results display: allow users to select different display formats
● Saving options: Saving in different formats; number of records that can be saved
at a time
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 126
Evaluation of Search Engines
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 127
CRITICAL EVALUATION
Why Evaluate What You Find on the Web?
● Anyone can put up a Web page
– about anything
● Many pages not kept up-to-date
● No quality control
– most sites not “peer-reviewed”
● less trustworthy than scholarly publications
– no selection guidelines for search engines
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 128
Web Evaluation Techniques
Before you click to view the page...
● Look at the URL - personal page or site ?
~ or % or users or members
● Domain name appropriate for the content ?
edu, com, org, net, gov, ca.us, uk, etc.
● Published by an entity that makes sense ?
● News from its source?
www.nytimes.com
● Advice from valid agency?
www.nih.gov/
www.nlm.nih.gov/
www.nimh.nih.gov/
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 129
Web Evaluation Techniques
Scan the perimeter of the page
● Can you tell who wrote it ?
● name of page author
● organization, institution, agency you recognize
● e-mail contact by itself not enough
● Credentials for the subject matter ?
– Look for links to:
“About us” “Philosophy” “Background” “Biography”
●
Is it recent or current enough ?
● Look for “last updated” date - usually at bottom
● If no links or other clues...
● truncate back the URL
http://hs.houstonisd.org/hspva/academic/Science/Thinkquest/gail/text/ethics.html
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 130
Web Evaluation Techniques
Indicators of quality
● Sources documented
● links, footnotes, etc.
– As detailed as you expect in print publications ?
● do the links work ?
● Information retyped or forged
● why not a link to published version instead ?
● Links to other resources
● biased, slanted ?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 131
Web Evaluation Techniques
What Do Others Say ?
● Search the URL in alexa.com
– Who links to the site? Who owns the domain?
– Type or paste the URL into the basic search box
– Traffic for top 100,000 sites
● See what links are in Google’s Similar pages
● Look up the page author in Google
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 132
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 133
Web Evaluation Techniques
STEP BACK & ASK: Does it all add up ?
● Why was the page put on the Web ?
● inform with facts and data?
● explain, persuade?
● sell, entice?
● share, disclose?
● as a parody or satire?
● Is it appropriate for your purpose?
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 134
Try evaluating some sites...
Search a controversial topic in Google:
 "nuclear armageddon"
 prions danger
 “stem cells” abortion
Scan the first two pages of results
Visit one or two sites
 try to evaluate their quality and reliability
Copyleft ( ) 2009 Sudarsun Santhiappanɔ 135
Ufff, The End
● Have you learned something today ?
● Try whatever we've discussed today!
● If you need help, let me know at
sudarsun@gmail.com

More Related Content

What's hot

google search engine
google search enginegoogle search engine
google search engineway2go
 
Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Ali Saif Mirza
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint201014161
 
Searching the Web
Searching the WebSearching the Web
Searching the Webcshieh
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its typesNagarjuna Kalluru
 
Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines PresentationJSCHO9
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalA. LE
 
Learn the Search Engine Type and Its Functions!
Learn the Search Engine Type and Its Functions!Learn the Search Engine Type and Its Functions!
Learn the Search Engine Type and Its Functions!aashokkr
 
Working of search engine
Working of search engineWorking of search engine
Working of search engineNikhil Deswal
 

What's hot (20)

google search engine
google search enginegoogle search engine
google search engine
 
Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )Web search engines ( Mr.Mirza )
Web search engines ( Mr.Mirza )
 
Search engine
Search engineSearch engine
Search engine
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 
search engines
search enginessearch engines
search engines
 
Search engine
Search engineSearch engine
Search engine
 
Search engines
Search enginesSearch engines
Search engines
 
Searching the Web
Searching the WebSearching the Web
Searching the Web
 
Search engine
Search engineSearch engine
Search engine
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its types
 
Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines Presentation
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
Learn the Search Engine Type and Its Functions!
Learn the Search Engine Type and Its Functions!Learn the Search Engine Type and Its Functions!
Learn the Search Engine Type and Its Functions!
 
Search engine
Search engineSearch engine
Search engine
 
Search engine
Search engineSearch engine
Search engine
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Search engines
Search enginesSearch engines
Search engines
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Working of search engine
Working of search engineWorking of search engine
Working of search engine
 

Similar to Search Engine Demystified

How google works and functions: A complete Approach
How google works and functions: A complete ApproachHow google works and functions: A complete Approach
How google works and functions: A complete ApproachPrakhar Gethe
 
Introduction to Search Engine Optimization
Introduction to Search Engine OptimizationIntroduction to Search Engine Optimization
Introduction to Search Engine OptimizationGauravPrajapati39
 
Lost in the Net? Navigating Search Engines
Lost in the Net?  Navigating Search EnginesLost in the Net?  Navigating Search Engines
Lost in the Net? Navigating Search EnginesJohan Koren
 
searchengineppt-171025105119 (1).docx
searchengineppt-171025105119 (1).docxsearchengineppt-171025105119 (1).docx
searchengineppt-171025105119 (1).docxNiteshRaj48
 
Google Search Engine
Google Search Engine Google Search Engine
Google Search Engine Aniket_1415
 
Lost in the Net: Navigating Search Engines
Lost in the Net:  Navigating Search EnginesLost in the Net:  Navigating Search Engines
Lost in the Net: Navigating Search EnginesJohan Koren
 
Lost in the net: Navigating search engines
Lost in the net:  Navigating search enginesLost in the net:  Navigating search engines
Lost in the net: Navigating search enginesJohan Koren
 
Week 9 10 ppt-how_searchworks
Week 9 10 ppt-how_searchworksWeek 9 10 ppt-how_searchworks
Week 9 10 ppt-how_searchworkscarolyn oldham
 
Navigating Semantic Search
Navigating Semantic SearchNavigating Semantic Search
Navigating Semantic SearchMonster
 
Week 12 how searchenginessearch
Week 12 how searchenginessearchWeek 12 how searchenginessearch
Week 12 how searchenginessearchcarolyn oldham
 
Search engines by Gulshan K Maheshwari(QAU)
Search engines by Gulshan  K Maheshwari(QAU)Search engines by Gulshan  K Maheshwari(QAU)
Search engines by Gulshan K Maheshwari(QAU)GulshanKumar368
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than GoogleDr Trivedi
 
Tech Headline - SEO For Non-Technical Professionals
Tech Headline - SEO For Non-Technical ProfessionalsTech Headline - SEO For Non-Technical Professionals
Tech Headline - SEO For Non-Technical ProfessionalsRodrigo Castilho
 
Digital marketing
Digital marketing Digital marketing
Digital marketing M Manas
 

Similar to Search Engine Demystified (20)

Search Engine
Search EngineSearch Engine
Search Engine
 
How google works and functions: A complete Approach
How google works and functions: A complete ApproachHow google works and functions: A complete Approach
How google works and functions: A complete Approach
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Introduction to Search Engine Optimization
Introduction to Search Engine OptimizationIntroduction to Search Engine Optimization
Introduction to Search Engine Optimization
 
Search engine
Search engineSearch engine
Search engine
 
Lost in the Net? Navigating Search Engines
Lost in the Net?  Navigating Search EnginesLost in the Net?  Navigating Search Engines
Lost in the Net? Navigating Search Engines
 
searchengineppt-171025105119 (1).docx
searchengineppt-171025105119 (1).docxsearchengineppt-171025105119 (1).docx
searchengineppt-171025105119 (1).docx
 
Google Search Engine
Google Search Engine Google Search Engine
Google Search Engine
 
Lost in the Net: Navigating Search Engines
Lost in the Net:  Navigating Search EnginesLost in the Net:  Navigating Search Engines
Lost in the Net: Navigating Search Engines
 
Lost in the net: Navigating search engines
Lost in the net:  Navigating search enginesLost in the net:  Navigating search engines
Lost in the net: Navigating search engines
 
Search engine
Search engineSearch engine
Search engine
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
 
Week 9 10 ppt-how_searchworks
Week 9 10 ppt-how_searchworksWeek 9 10 ppt-how_searchworks
Week 9 10 ppt-how_searchworks
 
Search engine
Search engineSearch engine
Search engine
 
Navigating Semantic Search
Navigating Semantic SearchNavigating Semantic Search
Navigating Semantic Search
 
Week 12 how searchenginessearch
Week 12 how searchenginessearchWeek 12 how searchenginessearch
Week 12 how searchenginessearch
 
Search engines by Gulshan K Maheshwari(QAU)
Search engines by Gulshan  K Maheshwari(QAU)Search engines by Gulshan  K Maheshwari(QAU)
Search engines by Gulshan K Maheshwari(QAU)
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than Google
 
Tech Headline - SEO For Non-Technical Professionals
Tech Headline - SEO For Non-Technical ProfessionalsTech Headline - SEO For Non-Technical Professionals
Tech Headline - SEO For Non-Technical Professionals
 
Digital marketing
Digital marketing Digital marketing
Digital marketing
 

More from Sudarsun Santhiappan

More from Sudarsun Santhiappan (13)

Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Software Patterns
Software PatternsSoftware Patterns
Software Patterns
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Essentials for a Budding IT professional
Essentials for a Budding IT professionalEssentials for a Budding IT professional
Essentials for a Budding IT professional
 
What it takes to be the Best IT Trainer
What it takes to be the Best IT TrainerWhat it takes to be the Best IT Trainer
What it takes to be the Best IT Trainer
 
Using Behavioral Patterns In Treating Autistic
Using Behavioral Patterns In Treating AutisticUsing Behavioral Patterns In Treating Autistic
Using Behavioral Patterns In Treating Autistic
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Audio And Video Over Internet
Audio And Video Over InternetAudio And Video Over Internet
Audio And Video Over Internet
 
Practical Network Security
Practical Network SecurityPractical Network Security
Practical Network Security
 
How To Do A Project
How To Do A ProjectHow To Do A Project
How To Do A Project
 
Ontology
OntologyOntology
Ontology
 
Object Oriented Design
Object Oriented DesignObject Oriented Design
Object Oriented Design
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

Search Engine Demystified

  • 1. Search Engines Sudarsun Santhiappan., M.Tech., Director – R & D, Burning Glass Technologies
  • 2. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 2 Today's Coverage ● Introduction ● Types of Search Engines ● Components of a Search Engine ● Semantics and Relevancy ● Search Engine Optimization
  • 3. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 3 What is a Search Engine ? ● What is a Search ? ● Why do we need a Search Engine ? ● What are we searching against ? ● How good is a Search Engine ? ● What is Search on Search (Meta SE) ? ● Compared Search Engines Side-by-Side ? ● How are Images and Videos searched ? ● Apart from Web Search, what else ?
  • 4. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 4 Introduction ● Web Search Engine is a software program that searches the Internet (bunch of websites) based on the words that you designate as search terms (query words). ● Search engines look through their own databases of information in order to find what it is that you are looking for. ● Web Search Engines are a good example for massively sized Information Retrieval Systems. – Tried “Similar pages” Link in Google result set ?
  • 5. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 5 Dictionary Definitions Search COMPUTING (transitive verb) to examine a computer file, disk, database, or network for particular information Engine something that supplies the driving force or energy to a movement, system, or trend Search Engine a computer program that searches for particular keywords and returns a list of documents in which they were found, especially a commercial service that scans documents on the Internet
  • 6. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 6 About definition of search engines ● oh well … search engines do not search only for keywords, some search for other stuff as well ● and they are really not “engines” in the classical sense – but then mouse is not a “mouse”
  • 7. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 7 use of search engines … among others
  • 8. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 8 Types of Search Engines ● Text Search Engines – General: AltaVista, AskJeeves, Bing, Google – Specialized: Google Scholar, Scirus, Citeseer ● Intranet vs Internet Search Engines ● Image Search Engines – How can we search on the Image content ? ● Video Search Engines – Image Search with Time dimension !!
  • 9. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 9 Types of Search Engine ● Crawler Powered Indexes – Guruji.com, Google.com ● Human Powered Indexes – www.dmoz.org ● Hybrid Models – Submitted URLs to a search engine ? ● Semantic Indexes – Hakia.com,
  • 10. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 10 Have you tried Hakia ? ● What is Semantic Search ? ● How's it different from Keyword Search? ● What is categorized search ? ● Side-by-Side comparison with Google!! ● Have you compared Bing with Google ?
  • 11. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 11
  • 12. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 12
  • 13. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 13 Directories ● www.dmoz.org ● Website classified into a Taxonomy ● Website are categorically arranged ● Searching vs Navigation ● Instead of Query, you Click and navigate ● Accurate search always! (if data is available) ● Problem: Mostly Manually created
  • 14. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 14
  • 15. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 15
  • 16. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 16 How does a Search Engine work ?
  • 17. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 17 Your Browser How Search Engines Work (Sherman 2003) The Web URL1 URL2 URL3 URL4 Crawler Indexer Search Engine Database Eggs? Eggs. Eggs - 90% Eggo - 81% Ego- 40% Huh? - 10% All About Eggs by S. I. Am
  • 18. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 18 how do search engines work? elaboration • crawlers, spiders: go out to find content – in various ways go through the web looking for new & changed sites – periodic, not for each query ● no search engine works in real time – some search engines do it for themselves, others not ● buy content from companies such as Inktomi – for a number of reasons crawlers do not cover all of the web – just a fraction – what is not covered is “invisible web” ?
  • 19. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 19 Elaboration … • organizing content: labeling, arranging – indexing for searching – automatic ● keywords and other fields ● arranging by URL popularity - PageRank as Google – classifying as directory ● mostly human handpicked & classified ● as a result of different organization we have basically two kinds of search engines: ● search – input is a query that is searched & displayed ● directory – classified content – a class is displayed – and fused: directories have search capabilities & vice versa
  • 20. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 20 Elaboration (cont.) • databases, caches: storing content – humongous files usually distributed over many computers • query processor: searching, retrieval, display – takes your query as input ● engines have differing rules on how they are handled – displays ranked output ● some engines also cluster output and provide visualization ● some engines provide categorically structured results ● at the other end is your browser
  • 21. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 21 Similarities & Differences ● All search engines have these basic parts in common ● BUT the actual processes – methods how they do it – are based on various algorithms and they significantly differ – most are proprietary (patented) with details kept mostly secret (or protected) but based on well known principles from information retrieval or classification – to some extent Google is an exception – they published their method
  • 22. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 22 Google Search ● In the beginning it ran on Stanford computers ● Basic approach has been described in their famous paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine” – well written, simple language, has their pictures – in acknowledgement they cite the support by NSF’s Digital Library Initiative i.e. initially, Google came out of government sponsored research – describe their method PageRank - based on ranking hyperlinks as in citation indexing – “We chose our system name, Google, because it is a common spelling of googol, or ten on hundredth power”
  • 23. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 23 coverage differences ● no engine covers more than a fraction of WWW – estimates: none more than 16% – hard (even impossible) to discern & compare coverage, but they differ substantially in what they cover ● in addition: – many national search engines ● own coverage, orientation, governance – many specialized or domain search engines ● own coverage geared to subject of interest – many comprehensive sources independent of search engines ● some have compilations of evaluated web sources
  • 24. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 24 searching differences ● substantial differences among search engines on searching, retrieval display – need to know how they work & differ in respect to ● defaults in searching a query ● searching of phrases, case sensitivity, categories ● searching of different fields, formats, types of resources ● advance search capabilities and features ● possibilities for refinement, using relevance feedback ● display options ● personalization options
  • 25. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 25
  • 26. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 26
  • 27. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 27 Limitations ● every search engine has limitation as to – Coverage: meta engines just follow coverage limitations & have more of their own search capabilities – finding quality information ● some have compromised search with economics – becoming little more than advertisers ● but search engines are also many times victims of spamdexing – affecting what is included and how ranked
  • 28. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 28 Spamming a search engine ● use of techniques that push rankings higher than they belong is also called spamdexing – methods typically include textual as well as link- based techniques – like e-mail spam, search engine spam is a form of adversarial information retrieval ● the conflicting goals of accurate results of search providers & high positioning by content page rank
  • 29. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 29 Meta Search Engines Search on Search
  • 30. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 30 Meta search engines ● meta engines search multiple engines – getting combined results from a variety of engines ● do not have their own databases – but have their own business models affecting results ● a number of techniques used – interesting ones: clustering, statistical analysis
  • 31. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 31 Some Meta engines - with organized results Dogpile : results from a number of leading search engines; gives source, so overlap can be compared; (has also a (bad) joke of the day) Surfwax : gives statistics and text sources & linking to sources; for some terms gives related terms to focus Teoma : results with suggestions for narrowing; links resources derived; originated at Rutgers Turbo10 : provides results in clusters; engines searched can be edited
  • 32. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 32
  • 33. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 33
  • 34. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 34 Some Meta Engines (cont.) ● Large directory  Complete Planet – directory of over 70,000 databases & specialty engines ● Results with graphical displays – Vivisimo clusters results; innovative – Webbrain results in tree structure – fun to use – Kartoo results in display by topics of query
  • 35. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 35 Domain Specific Search Engines
  • 36. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 36 Domain Search Engines & Catalogs ● cover specific subjects & topics ● important tool for subject searches – particularly for subject specialist – valued by professional searchers ● selection mostly hand-picked rather than by crawlers, following inclusion criteria – often not readily discernable – but content more trustworthy
  • 37. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 37 Domain Search Engines … Open Directory Project large edited catalog of the web – global, run by volunteers BUBL LINK selected Internet resources covering all academic subject areas; organized by Dewey Decimal System – from UK Profusion search in categories for resources & search engines Resource Discovery Network – UK “UK's free national gateway to Internet resources for the learning, teaching and research community”
  • 38. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 38 Domain Engines … sample Think Quest – Oracle Education Foundation • education resources, programs; web sites created by students All Music Guide • resource about musicians, albums, and songs Internet Movie Database • treasure trove of American and British movies Genealogy links and surname search engines well.. that is getting really specialized (and popular) Daypop searches the “living web” “The living web is composed of sites that update on a daily basis: newspapers, online magazines, and weblogs”
  • 39. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 39 Science, scholarship engines …sample  Psychcrawler - Amer Psychological Association  web index for psychology  Entrez PubMed – Nat Library of Medicine biomedical literature from MEDLINE & health journals  CiteSeer - NEC Research Center  scientific literature, citations index; strong in computer science Scholar Google searches for scholarly articles & resources Infomine scholarly internet research collections Scirus scientific information in journals & on the web
  • 40. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 40 Science, scholarship engines …sample commercial access ● an addition to freely accessible engines many provide search free but access to full text paid – by subscription or per item – RUL provides access to these & many more: ScienceDirect Elsevier: “world's largest electronic collection of science, technology and medicine full text and bibliographic information” ACM Portal Association for Computing Machinery: access to ACM Digital Library & Guide to Computing
  • 41. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 41 Search Engine Internals
  • 42. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 42 Search Engine Internals ● Crawlers ● Indexers ● Searching ● Semantics ● Ranking
  • 43. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 43 Standard Web Search Engine Architecture crawl the web create an inverted index Check for duplicates, store the documents Inverted index Search engine servers user query Show results To user DocIds
  • 44. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 44 Typical Search Engine
  • 45. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 45
  • 46. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 46
  • 47. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 47 Crawlers ● What is Crawling ? ● How does Crawling happen ? ● Have you tried “wget -r <url>” in Linux ? ● Have you tried “DAP” to download entire site? ● Page Walk ● Spidering & Crawlbots
  • 48. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 48
  • 49. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 49
  • 50. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 50 Spidering the Web ● Replicating the Spider's behavior of building the Internet (web) by adding spirals (sites) ● But, can the web be fully crawled ? ● By the time, one round of indexing is over, the page might have changed already! ● That's why we have cached page link in the search result!
  • 51. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 51
  • 52. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 52 Crawler Bots ● How to make your website Crawlable ? ● White-listing and Black-listing! ● Meta Tags to control the Bots ● Can HTTPS pages be crawled ? ● Does Sessions maintained while crawling ? ● Can dynamic pages be crawled ? ● URL normalization – cool.com?page=2 [crawler unfriendly] – cool.com/page/2 [norm'd and crawler friendly]
  • 53. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 53 How to control Robots ? <HTML> <HEAD> <META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW"> <TITLE>...</TITLE> </HEAD> <BODY> Index: This tell the spider/bot that it’s OK to index this page Noindex: Spider/bot see this and don’t index any of the content on this page. Follow: This let the spider/bot know that it’s OK to travel down links found on this page. Nofollow: It tells the spider/bot not to follow any of the links on this page.
  • 54. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 54 Crawling – Process Flow
  • 55. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 55 Data Structures ● Tree primarily while Crawling ● Both Depth-First-Search and Breadth-First- Search are used ● Every page that the crawler visits shall be added as a node to the Tree ● Fan-out information is represented as Children for a node (page).
  • 56. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 56 Inverted Indexes the IR Way
  • 57. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 57 How Inverted Files Are Created ● Periodically rebuilt, static otherwise. ● Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2 Term Doc # now 1 is 1 the 1 time 1 for 1 all 1 good 1 men 1 to 1 come 1 to 1 the 1 aid 1 of 1 their 1 country 1 it 2 was 2 a 2 dark 2 and 2 stormy 2 night 2 in 2 the 2 country 2 manor 2 the 2 time 2 was 2 past 2 midnight 2
  • 58. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 58 How Inverted Files are Created ● After all documents have been parsed the inverted file is sorted alphabetically. Term Doc # a 2 aid 1 all 1 and 2 come 1 country 1 country 2 dark 2 for 1 good 1 in 2 is 1 it 2 manor 2 men 1 midnight 2 night 2 now 1 of 1 past 2 stormy 2 the 1 the 1 the 2 the 2 their 1 time 1 time 2 to 1 to 1 was 2 was 2 Term Doc # now 1 is 1 the 1 time 1 for 1 all 1 good 1 men 1 to 1 come 1 to 1 the 1 aid 1 of 1 their 1 country 1 it 2 was 2 a 2 dark 2 and 2 stormy 2 night 2 in 2 the 2 country 2 manor 2 the 2 time 2 was 2 past 2 midnight 2
  • 59. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 59 How Inverted Files are Created ● Multiple term entries for a single document are merged. ● Within- document term frequency information is compiled. Term Doc # Freq a 2 1 aid 1 1 all 1 1 and 2 1 come 1 1 country 1 1 country 2 1 dark 2 1 for 1 1 good 1 1 in 2 1 is 1 1 it 2 1 manor 2 1 men 1 1 midnight 2 1 night 2 1 now 1 1 of 1 1 past 2 1 stormy 2 1 the 1 2 the 2 2 their 1 1 time 1 1 time 2 1 to 1 2 was 2 2 Term Doc # a 2 aid 1 all 1 and 2 come 1 country 1 country 2 dark 2 for 1 good 1 in 2 is 1 it 2 manor 2 men 1 midnight 2 night 2 now 1 of 1 past 2 stormy 2 the 1 the 1 the 2 the 2 their 1 time 1 time 2 to 1 to 1 was 2 was 2
  • 60. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 60 How Inverted Files are Created ● Finally, the file can be split into – A Dictionary or Lexicon file and – A Postings file
  • 61. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 61 How Inverted Files are Created Dictionary/Lexicon Postings Term Doc # Freq a 2 1 aid 1 1 all 1 1 and 2 1 come 1 1 country 1 1 country 2 1 dark 2 1 for 1 1 good 1 1 in 2 1 is 1 1 it 2 1 manor 2 1 men 1 1 midnight 2 1 night 2 1 now 1 1 of 1 1 past 2 1 stormy 2 1 the 1 2 the 2 2 their 1 1 time 1 1 time 2 1 to 1 2 was 2 2 Doc # Freq 2 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 2 2 2 1 1 1 1 2 1 1 2 2 2 Term N docs Tot Freq a 1 1 aid 1 1 all 1 1 and 1 1 come 1 1 country 2 2 dark 1 1 for 1 1 good 1 1 in 1 1 is 1 1 it 1 1 manor 1 1 men 1 1 midnight 1 1 night 1 1 now 1 1 of 1 1 past 1 1 stormy 1 1 the 2 4 their 1 1 time 2 2 to 1 2 was 1 2
  • 62. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 62 Inverted indexes ● Permit fast search for individual terms ● For each term, you get a list consisting of: – document ID – frequency of term in doc (optional) – position of term in doc (optional) ● These lists can be used to solve Boolean queries: ● country -> d1, d2 ● manor -> d2 ● country AND manor -> d2 ● Also used for statistical ranking algorithms
  • 63. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 63 Inverted Indexes for Web Search Engines ● Inverted indexes are still used, even though the web is so huge. ● Some systems partition the indexes across different machines. Each machine handles different parts of the data. ● Other systems duplicate the data across many machines; queries are distributed among the machines. ● Most do a combination of these.
  • 64. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 64 From description of the FAST search engine, by Knut Risvik In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row.
  • 65. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 65 Cascading Allocation of CPUs ● A variation on this that produces a cost- savings: – Put high-quality/common pages on many machines – Put lower quality/less common pages on fewer machines – Query goes to high quality machines first – If no hits found there, go to other machines
  • 66. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 66 The Search Process
  • 67. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 67 Searching – Process Flow
  • 68. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 68 Google Query Evaluation 1. Parse the Query 2. Convert words to WordID 3. Seek to the start of the doclist in the short barrel for every word. 4. Scan through the doclists until there is a document that matches all the search terms. 5. Compute the rank of that document for the query. 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7. If we are not at the end of any doclist go to step 4. 8. Sort the documents that have matched by rank and return the top k.
  • 69. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 69 Queries ● Search engines are one tool used to answer info needs ● Users express their information needs as queries ● Usually informally expressed as two or three words (we call this a ranked query) ● A recent study showed the mean query length was 2.4 words per query with a median of 2 ● Around 48.4% of users submit just one query in a session, 20.8% submit two, and about 31% submit three or more ● Less than 5% of queries use Boolean operators (AND, OR, and NOT), and around 5% contain quoted phrases
  • 70. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 70 Queries... ● About 1.28 million different words were used in queries in the Excite log studied (which contained 1.03 million queries) ● Around 75 words account for 9% of all words used in queries. The top-ten non-trivial words occurring in 531,000 queries are “sex” (10,757), “free” (9,710), “nude” (7,047), “pictures” (5,939), “university” (4,383), “pics” (3,815), “chat” (3,515), “adult” (3,385), “women” (3,211), and “new” (3,109) ● 16.9% of the queries were about entertainment, 16.8% about sex, pornography, or preferences, and 13.3% concerned commerce, travel, employment, and the economy
  • 71. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 71 Answers What is a good answer to a query? ● One that is relevant to the user’s information need! ● Search engines typically return ten answers-per-page, where each answer is a short summary of a web document ● Likely relevance to an information need is approximated by statistical similarity between web documents and the query ● Users favour search engines that have high precision, that is, those that return relevant answers in the first page of results
  • 72. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 72 Approximating Relevance ● Statistical similarity is used to estimate the relevance of a query to an answer ● Consider the query “Richardson Richmond Football” ● A good answer contains all three words, and the more frequently the better; we call this term frequency (TF) ● Some query terms are more important—have better discriminating power—than others. For example, an answer containing only “Richardson” is likely to be better than an answer containing only “Football”; we call this inverse document frequency (IDF)
  • 73. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 73 Ranking To improve the accuracy of search engines: ● Google Inc. use their patented PageRank(tm) technology. Google ranks a page higher if it links to pages that are an authorative source, and a link from an authorative source to a page ranks that page higher ● Relevance feedback is a technique that adds words to a query based on a user selecting a more like this option ● Query expansion adds words to a query using thesaural or other techniques ● Searching within categories or groups to narrow a search
  • 74. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 74 Resolving Queries ● Queries are resolved using the inverted index ● Consider the example query “Cat Mat Hat”. This is evaluated as follows: – Select a word from the query (say, “Cat”) – Retrieve the inverted list from disk for the word – Process the list. For each document the word occurs in, add weight to an accumulator for that document based on the TF, IDF, and document length – Repeat for each word in the query – Find the best-ranked documents with the highest weights – Lookup the document in the mapping table – Retrieve and summarize the docs, and present to the user
  • 75. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 75 Fast Search Engines ● Inverted lists are stored in a compressed format. This allows more information per second to be retrieved from disk, and it lowers disk head seek times ● As long as decompression is fast, there is a beneficial trade-off in time ● Documents are stored in a compressed format for the same reason ● Different compression schemes are used for lists (which are integers) and documents (which are multimedia, but mostly text)
  • 76. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 76 Fast Search Engines ● Sort disk accesses to minimise disk head movement when retrieving lists or documents ● Use hash tables in memory to store the vocabulary; avoid slow hash functions that use modulo ● Pre-calculate and store constants in ranking formulae ● Carefully choose integer compression schemes ● Organise inverted lists so that the information frequently needed is at the start of the list ● Use heap structures when partial sorting is required ● Develop a query plan for each query
  • 77. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 77 Search Engine Architecture
  • 78. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 78 Search Engine architecture ● The inverted lists are divided amongst a number of servers, where each is known as a shard ● If an inverted list is required for a particular range of words, then that shard server is contacted ● Each shard server can be replicated as many times as required; each server in a shard is identical ● Documents are also divided amongst a number of servers ● Again, if a document is required within a particular range, then the appropriate document server is contacted ● Each document server can also be replicated as many times as required
  • 79. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 79 Google, Case Study
  • 80. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 80 Google Architecture
  • 81. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 81 Components ● URL Server: Bunch of URLs (white-list) ● Crawler: Fetch the page ● Store Server: To store the fetched pages ● Repository: Compressed pages are put here ● Every unique page has a DocID ● Anchor: Page transition [to, from] information ● URLResolver: Relative URL to Absolute URL ● Lexicon: list of known words
  • 82. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 82 Indexer ● Parses the document ● Build Word-Frequency table {word, position, font, capitalization} [hits] ● Pushes the hits to barrels as partially sorted forward index ● Identifies anchors (page transition out info)
  • 83. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 83 Searcher ● Forward Index to Inverted Index – Maps keywords to DocIds – DocIds mapped to URLs ● Reranker – Uses Anchor information to rank the pages for the given query keyword. – Thumbrule: Fan In increases page rank
  • 84. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 84 Reranking
  • 85. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 85 What about Ranking? ● Lots of variation here – Often messy; details proprietary and fluctuating ● Combining subsets of: – IR-style relevance: Based on term frequencies, proximities, position (e.g., in title), font, etc. – Popularity information – Link analysis information ● Most use a variant of vector space ranking to combine these. Here’s how it might work: – Make a vector of weights for each feature – Multiply this by the counts for each feature
  • 86. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 86 Relevance: Going Beyond IR ● Page “popularity” (e.g., DirectHit) – Frequently visited pages (in general) – Frequently visited pages as a result of a query ● Link “co-citation” (e.g., Google) – Which sites are linked to by other sites? – Draws upon sociology research on bibliographic citations to identify “authoritative sources”
  • 87. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 87 Link Analysis for Ranking Pages ● Assumption: If the pages pointing to this page are good, then this is also a good page. ● References: Kleinberg 98, Page et al. 98 ● Draws upon earlier research in sociology and bibliometrics. – Kleinberg’s model includes “authorities” (highly referenced pages) and “hubs” (pages containing good reference lists). – Google model is a version with no hubs, and is closely related to work on influence weights by Pinski-Narin (1976).
  • 88. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 88 Link Analysis for Ranking Pages ● Why does this work? – The official Toyota site will be linked to by lots of other official (or high-quality) sites – The best Toyota fan-club site probably also has many links pointing to it – Less high-quality sites do not have as many high-quality sites linking to them
  • 89. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 89 PageRank ● Let A1, A2, …, An be the pages that point to page A. Let C(P) be the # links out of page P. The PageRank (PR) of page A is defined as: ● PageRank is principal eigenvector of the link matrix of the web. ● Can be computed as the fixpoint of the above equation. PR(A) = (1-d) + d ( PR(A1)/C(A1) + … + PR(An)/C(An) )
  • 90. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 90 PageRank: User Model ● PageRanks form a probability distribution over web pages: sum of all pages’ ranks is one. ● User model: “Random surfer” selects a page, keeps clicking links (never “back”), until “bored”: then randomly selects another page and continues. – PageRank(A) is the probability that such a user visits A – d is the probability of getting bored at a page ● Google computes relevance of a page for a given search by first computing an IR relevance and then modifying that by taking into account PageRank for the top pages.
  • 91. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 91 Search Engine Optimization
  • 92. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 92 ● Location, Location, Location...and Frequency ● Tags (<title>, <meta>, <b>, top of the page) ● How close words (from the query) are to each other on the website ● Quality of links going to and from a page ● Penalization for "spamming“, when a word is repeated hundreds of times on a page, to increase the frequency and propel the page higher in the listings. ● Off the Page ranking criteria: ● By analyzing how pages link to each other. How Search Engines Rank Pages?
  • 93. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 93 Why do results differ ? ● Some search engines index more web pages than others. ● Some search engines also index web pages more often than others. ● The result is that no search engine has the exact same collection of web pages to search through. ● Different algorithms to compute relevance of the page to a particular query
  • 94. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 94 Search Engine Placement Tips ● Why is it important to be on the first page of the results? – Most users do not go beyond the first page. ● How to optimize your website? – Pick your target keywords: How do you think people will search for your web page? The words you imagine them typing into the search box are your target keywords. – Pick target words differently for each page on your website. – Your target keywords should always be at least two or more words long.
  • 95. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 95 Position your Keywords ● Make sure your target keywords appear in the crucial locations on your web pages. The page's HTML <title> tag is most important. ● The titles should be relatively short and attractive. Several phrases are enough for the description. ● Search engines also like pages where keywords appear "high" on the page: headline, first paragraphs of your web page. ● Keep in mind that tables and large JavaScript sections can make your keywords less relevant because they appear lower on the page.
  • 96. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 96 Have Relevant Content ● Keywords need to be reflected in the page's content. ● Put more text than graphics on a page ● Don't use frames ● Use the <ALT….> tag ● Make good use of <TITLE> and <H1> ● Consider using the <META> tag ● Get people to link to your page
  • 97. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 97 Hiding Web pages ● You may wish to have web pages that are not indexed (for example, test pages). ● It is also possible to hide web content from robots, using the Robots.txt file and the robots meta tag. ● Not all crawlers will obey this, so this is not foolproof.
  • 98. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 98 Submitting To Search Engines  Search engines should find you naturally, but submitting helps speed the process and can increase your representation  Look for Add URL link at bottom of home page  Submit your home page and a few key “section” pages  Turnaround from a few days to 2 months
  • 99. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 99 Deep Crawlers  AltaVista, Inktomi, Northern Light will add the most, usually within a month  Excite, Go (Infoseek) will gather a fair amount; Lycos gathers little  Index sizes are going up, but the web is outpacing them…nor is size everything  Here are more actions to help even the odds…
  • 100. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 100 “Deep” Submit  A “deep” submit is directly submitting pages from “inside” the web site – can help improve the odds these will get listed.  At Go, you can email hundreds of URLs. Consider doing this.  At HotBot/Inktomi, you can submit up to 50 pages per day. Possibly worth doing.  At AltaVista, you can submit up to 5 pages per day. Probably not worth the effort.  Elsewhere, not worth doing a “deep” submit.
  • 101. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 101 Big Site? Split It Up  Expect search engines to max out at around 500 pages from any particular site  Increase representation by subdividing large sites logically into subdomains  Search engines will crawl each subsite to more depth  Here’s an example...
  • 102. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 102 Subdomains vs. Subdirectories I nstead of Do this gold.ac.uk/science/ science.gold.ac.uk gold.ac.uk/english/ english.gold.ac.uk gold.ac.uk/admin/ admin.gold.ac.uk
  • 103. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 103 I Was Framed  Don't use them. Period.  If you do use them, search engines will have difficulty crawling your site.
  • 104. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 104 Dynamic Roadblocks  Dynamic delivery systems that use ? symbols in the URL string prevent search engines from getting to your pages  http://www.nike.com/ObjectBuilder/ObjectBuilder.iwx?ProcessName= IndexPage&Section_Id=17200&NewApplication=t  Eliminate the ? symbol, and your life will be rosy  Look for workarounds, such as Apache rewrite or Cold Fusion alternatives  Before you move to a dynamic delivery system, check out any potential problems.
  • 105. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 105 How Directories Work  Editors find sites, describe them, put them in a category  Site owners can also submit to be listed  A short description represents the entire web site  Usually has secondary results from a crawler- based search engine
  • 106. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 106 The Major Directories  Yahoo  The Open Directory  (Netscape, Lycos, AOL Search, others)  LookSmart  UK Plus  Snap
  • 107. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 107 Submitting To Directories  Directories probably won't find you or may list you badly unless you submit  Find the right category (more in a moment), then use Add URL link at top or bottom of page  Write down who submitted (and email address), when submitted, which category submitted to and other details  You’ll need this info for the inevitable resubmission attempt – it will save you time.
  • 108. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 108 Submitting To Directories  Take your time and submit to these right  Write 3 descriptions: 15, 20 and 25 words long, which incorporate your key terms  Search for the most important term you want to be found for and submit to first category that's listed which seems appropriate for your site  Be sure to note the contact name and email address you provided on the submit form  If you don't get in, keep trying
  • 109. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 109 Subdomain Advantage  Directories tend not to list subsections of a web site.  In contrast, they do tend to see subdomains as independent web sites deserving their own listings  So, another reason to go with subdomains over subdirectories
  • 110. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 110 How to do Search ?
  • 111. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 111 What do we search ? ● Information ● Reviews, news ● Advice, methods ● Bugs ● Education stuff ● Examples: – Access Violation 0xC0000005 – Search Engine ppt
  • 112. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 112
  • 113. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 113 Main Steps ● Make a decision about the search ● Formulate a topic. Define a type of resources that you are looking for ● Find relevant words for description ● Find websites with information ● Choose the best out of them ● Feedback: How did you search?
  • 114. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 114 Main Problems Why is it difficult to search? – Know the problem, don’t know what to look for – Lose focus (go to interesting but non-relevant sites) – Perform superficial (shallow) search – Search Spam
  • 115. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 115 Typical Problems ● Links are often out of date ● Usually too many links are returned ● Returned links are not very relevant ● The Engines don't know about enough pages ● Different engines return different results ● Political bias
  • 116. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 116 Typical Mistakes ● Unnecessary words in a query ● Unsuitable choice of keywords ● Not enough flexibility in changing keywords (Ses) ● Divide the time devoted to search and evaluation of search results ● “Your search did not match any documents. ” – Bad Query!
  • 117. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 117 Search Tricks What can we search for? ● Thematic resource (http://www.topicmaps.org) ● Community ● Collection of articles ● Forum ● Catalog of resources, links ● File (file types) ● Encyclopedia article ● Digital library ● Contact information (i.e. email)
  • 118. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 118 Improving Query Results ● To look for a particular page use an unusual phrase you know is on that page ● Use phrase queries where possible ● Check your spelling! ● Progressively use more terms ● If you don't find what you want, use another Search Engine!
  • 119. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 119 Useful words ● download ● pdf, ppt, doc, zip, mp3 ● forum, directory, links ● faq, for newbies, for beginners, guide, rules, checklist ● lecture notes, survey, tutorials ● how, where, correct, howto ● Copy-pasting the exact error message ● Have you tried http://del.icio.us/ ?
  • 120. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 120 Search Engine Features
  • 121. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 121 Features ● Indexing features ● Search features ● Results display ● Costs, licensing and registration requirements ● Unique features (if any)
  • 122. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 122 Indexing Features ● File/document formats supported: HTML, ASCII, PDF, SQL, Spread Sheets, WYSIWYG (MS-Word, WP, etc.) ● Indexing level support: File/directory level, multi-record files ● Standard formats recognized: MARC, Medline, etc ● Customization of document formats – Stemming: If yes, is this an optional or mandatory feature? – Stop words support: If yes, is this an optional or mandatory features ?
  • 123. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 123 Searching Features ● Boolean Searching: Use of Boolean operators AND, OR and NOT as search term connectors ● Natural Language: Allows users to enter the query in natural language ● Phrase: Users can search for exact phrase ● Truncation/wild card: Variations of search terms and plural forms can be searched ● Exact match: Allows users to search for terms exactly as it is entered ● Duplicate detection: Remove duplicate records from the retrieved records ● Proximity: With connectors such as With , Near, ADJacent one can specify the position of a search terms w.r.t to others
  • 124. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 124 Searching Features ● Field Searching: Query for a specific field value in the database ● Thesaurus searching: Search for Broader or Narrower or Related terms or Related concepts ● Query by example: Enables users to search for similar documents ● Soundex searching: Search for records with similar spelling as the search term ● Relevance ranking: Ranking the retrieved records in some order ● Search set manipulation: Saving the search results as sets and allowing users to view search history
  • 125. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 125 Results Display ● Formats supported: Can it display in native format or just HTML; Display in different formats, Display number of records retrieved ● Relevancy ranking: If the retrieved records are ranked, how the relevance score is indicated ● Keyword-in-context: KWIC or highlighting of matching search terms ● Customization of results display: allow users to select different display formats ● Saving options: Saving in different formats; number of records that can be saved at a time
  • 126. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 126 Evaluation of Search Engines
  • 127. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 127 CRITICAL EVALUATION Why Evaluate What You Find on the Web? ● Anyone can put up a Web page – about anything ● Many pages not kept up-to-date ● No quality control – most sites not “peer-reviewed” ● less trustworthy than scholarly publications – no selection guidelines for search engines
  • 128. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 128 Web Evaluation Techniques Before you click to view the page... ● Look at the URL - personal page or site ? ~ or % or users or members ● Domain name appropriate for the content ? edu, com, org, net, gov, ca.us, uk, etc. ● Published by an entity that makes sense ? ● News from its source? www.nytimes.com ● Advice from valid agency? www.nih.gov/ www.nlm.nih.gov/ www.nimh.nih.gov/
  • 129. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 129 Web Evaluation Techniques Scan the perimeter of the page ● Can you tell who wrote it ? ● name of page author ● organization, institution, agency you recognize ● e-mail contact by itself not enough ● Credentials for the subject matter ? – Look for links to: “About us” “Philosophy” “Background” “Biography” ● Is it recent or current enough ? ● Look for “last updated” date - usually at bottom ● If no links or other clues... ● truncate back the URL http://hs.houstonisd.org/hspva/academic/Science/Thinkquest/gail/text/ethics.html
  • 130. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 130 Web Evaluation Techniques Indicators of quality ● Sources documented ● links, footnotes, etc. – As detailed as you expect in print publications ? ● do the links work ? ● Information retyped or forged ● why not a link to published version instead ? ● Links to other resources ● biased, slanted ?
  • 131. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 131 Web Evaluation Techniques What Do Others Say ? ● Search the URL in alexa.com – Who links to the site? Who owns the domain? – Type or paste the URL into the basic search box – Traffic for top 100,000 sites ● See what links are in Google’s Similar pages ● Look up the page author in Google
  • 132. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 132
  • 133. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 133 Web Evaluation Techniques STEP BACK & ASK: Does it all add up ? ● Why was the page put on the Web ? ● inform with facts and data? ● explain, persuade? ● sell, entice? ● share, disclose? ● as a parody or satire? ● Is it appropriate for your purpose?
  • 134. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 134 Try evaluating some sites... Search a controversial topic in Google:  "nuclear armageddon"  prions danger  “stem cells” abortion Scan the first two pages of results Visit one or two sites  try to evaluate their quality and reliability
  • 135. Copyleft ( ) 2009 Sudarsun Santhiappanɔ 135 Ufff, The End ● Have you learned something today ? ● Try whatever we've discussed today! ● If you need help, let me know at sudarsun@gmail.com

Editor's Notes

  1. &amp;lt;number&amp;gt;
  2. &amp;lt;number&amp;gt; Go through these procedures fairly quickly: there’s an exercise to learn this You want them to be able to understand the form and what it says. DOMAIN APPROPRIATE FOR THE CONTENT: Do you trust a NYT times article from a personal page as much as one from nytimes.com? A copy of Jackie Onassis’s will from a personal page as much as one from the California Bar Assn.? Example of a personal page would be: www.aol.com/~jbarker They are loosely paralleled by the sequence of the form in the next exercise.
  3. &amp;lt;number&amp;gt;
  4. &amp;lt;number&amp;gt; You can trust the lii.org more than many referrals. If there are annotations by professionals, that helps. The burden is on you, always. Demonstrate link: search example in Google. Use http://www.hanksville.org/yucatan/mayacal.html
  5. &amp;lt;number&amp;gt;