ir.webutils
Class SafeHTMLPageRetriever

java.lang.Object
  |
  +--ir.webutils.HTMLPageRetriever
        |
        +--ir.webutils.SafeHTMLPageRetriever

public final class SafeHTMLPageRetriever
extends HTMLPageRetriever

Keeps track of Robot Exclusion information. Clients can use this class to ensure that they do not access pages prohibited either by the Robots Exclusion Protocol or Robots META tags.


Constructor Summary
SafeHTMLPageRetriever()
           
 
Method Summary
 HTMLPage getHTMLPage(Link link)
          Tries to download the given web page.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SafeHTMLPageRetriever

public SafeHTMLPageRetriever()
Method Detail

getHTMLPage

public HTMLPage getHTMLPage(Link link)
                     throws PathDisallowedException
Tries to download the given web page. Throws PathDisallowedException if access to the page is prohibited. Also updates Robots Exclusion information based on the new page.
Overrides:
getHTMLPage in class HTMLPageRetriever
Parameters:
url - The URL to try to download from.
Returns:
The web page specified by the URL.
Throws:
PathDisallowedException - If url is disallowed by a robots.txt file or Robots META tag.