Yahoo slurp spider

Discussion in 'Computer Science & Culture' started by kaduseus, Apr 30, 2006.

Thread Status:
Not open for further replies.
  1. kaduseus melencolia I Registered Senior Member

    Messages:
    213
    does anyone know what the slurp spider is doing?
    it seems to be multiplying and eating bandwidth, but what is it actually doing?
    Is it a case of bad programming?, the slurp spider doesn't seem to know that other slurp spiders are on the same site.
    Is it 'skynet' by another name?

    Is yahoo are trying to develop a virtual internet, so that if you find a page with their search engine, you surf yahoo's cache without ever visiting the actual site. They would supply the adverts.

    At what point does the slurp spider activity become illegal, with the theft of bandwidth?
     
  2. Google AdSense Guest Advertisement



    to hide all adverts.
  3. Sci-Phenomena Reality is in the Minds Eye Registered Senior Member

    Messages:
    869
    I think it should be illegal the moment there is more than one spider on one webpage.
     
  4. Google AdSense Guest Advertisement



    to hide all adverts.
  5. Stryder Keeper of "good" ideas. Valued Senior Member

    Messages:
    13,101
    You should perhaps read up on your foe.

    The help information gives you a clue as how to make a decent robots.txt file to disallow crawling of particular folders and files. There is even information on something thats "Yahoo" specific, where the robots cacheing time can be controled by the robots.txt file.

    There is also information on how to input headers into your pages to stop the spider indexing them.

    It does however miss information like this:
    (Place between < HEAD > and < TITLE > tags of HTML)

    < META HTTP-EQUIV="Revisit-After" CONTENT="5 Days" >
    Obviously alter "5 Days" to something else to tell spiders when they should come back and check for updates.

    < META HTTP-EQUIV="ROBOTS" CONTENT="NOINDEX,NOFOLLOW" >
    This is mentioned in the Slurp help, however it missed "NOFOLLOW" which basically asks spiders not to follow URLs from that page.

    Example Robots.txt
    Code:
    User-agent: *
    Disallow: /*.js
    Disallow: /cgi-bin
    Disallow: /images
    Disallow: /*.css
    Disallow: /*.cgi
    Disallow: /*.log  
    Disallow: /script
    Allow: /*.html
    
    User-agent: Slurp
    Crawl-delay: 10
    Disallow:/*.jpg
    Disallow:/*.gif
    Disallow:/*.png
    
    You could use something like Webtracer to test how your websites links connect together. (Please note the program can be buggy, however it's a different way of viewing how your pages interlink)
     
  6. Google AdSense Guest Advertisement



    to hide all adverts.
Thread Status:
Not open for further replies.

Share This Page