# Yahoo slurp spider

Discussion in 'Computer Science & Culture' started by kaduseus, Apr 30, 2006.

Not open for further replies.
1. ### kaduseusmelencolia IRegistered Senior Member

Messages:
213
does anyone know what the slurp spider is doing?
it seems to be multiplying and eating bandwidth, but what is it actually doing?
Is it a case of bad programming?, the slurp spider doesn't seem to know that other slurp spiders are on the same site.
Is it 'skynet' by another name?

Is yahoo are trying to develop a virtual internet, so that if you find a page with their search engine, you surf yahoo's cache without ever visiting the actual site. They would supply the adverts.

At what point does the slurp spider activity become illegal, with the theft of bandwidth?

3. ### Sci-PhenomenaReality is in the Minds EyeRegistered Senior Member

Messages:
869
I think it should be illegal the moment there is more than one spider on one webpage.

5. ### StryderKeeper of "good" ideas.Valued Senior Member

Messages:
13,101

The help information gives you a clue as how to make a decent robots.txt file to disallow crawling of particular folders and files. There is even information on something thats "Yahoo" specific, where the robots cacheing time can be controled by the robots.txt file.

There is also information on how to input headers into your pages to stop the spider indexing them.

It does however miss information like this:
(Place between < HEAD > and < TITLE > tags of HTML)

< META HTTP-EQUIV="Revisit-After" CONTENT="5 Days" >
Obviously alter "5 Days" to something else to tell spiders when they should come back and check for updates.

< META HTTP-EQUIV="ROBOTS" CONTENT="NOINDEX,NOFOLLOW" >
This is mentioned in the Slurp help, however it missed "NOFOLLOW" which basically asks spiders not to follow URLs from that page.

Example Robots.txt
Code:
User-agent: *
Disallow: /*.js
Disallow: /cgi-bin
Disallow: /images
Disallow: /*.css
Disallow: /*.cgi
Disallow: /*.log
Disallow: /script
Allow: /*.html

User-agent: Slurp
Crawl-delay: 10
Disallow:/*.jpg
Disallow:/*.gif
Disallow:/*.png

You could use something like Webtracer to test how your websites links connect together. (Please note the program can be buggy, however it's a different way of viewing how your pages interlink)