ir.webutils
Class AnchoredSpider

java.lang.Object
  |
  +--ir.webutils.Spider
        |
        +--ir.webutils.AnchoredSpider
Direct Known Subclasses:
AnchoredDirectorySpider, AnchoredSiteSpider

public class AnchoredSpider
extends Spider


Field Summary
protected  java.util.HashMap urlMap
          Map from URLs to file names where they are stored.
 
Fields inherited from class ir.webutils.Spider
count, linksToVisit, maxCount, saveDir, slow, visited, webpr
 
Constructor Summary
AnchoredSpider()
           
 
Method Summary
protected  void addAnchorText(java.lang.String fileName, AnchoredLink link, boolean firstLink)
          Add anchor text of the link to the end of a page file
 void doCrawl()
          This crawl differs from the default by extracting AnchoredLink's and attaching all anchor text to the bottom of indexed pages to which they point.
protected  void handleUCommandLineOption(java.lang.String value)
          Called when "-u" is passed in on the command line.
static void main(java.lang.String[] args)
          Spider the web according to the following command options, but only below the start URL directory.
protected  void processPage(HTMLPage page)
          This version extracts anchored links, stores file name for this URL in the urlMap, and adds initial link anchor text to the bottom of the file.
 
Methods inherited from class ir.webutils.Spider
getNewLinks, go, handleCCommandLineOption, handleDCommandLineOption, handleSafeCommandLineOption, handleSlowCommandLineOption, linkToHTMLPage, processArgs
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

urlMap

protected java.util.HashMap urlMap
Map from URLs to file names where they are stored. This allows attaching anchor text of links to this page to the bottom of the page.
Constructor Detail

AnchoredSpider

public AnchoredSpider()
Method Detail

handleUCommandLineOption

protected void handleUCommandLineOption(java.lang.String value)
Called when "-u" is passed in on the command line.

This implementation adds value to the list of links to visit. This version creates an initial anchored link.

Overrides:
handleUCommandLineOption in class Spider
Parameters:
value - The value associated with the "-u" option.

doCrawl

public void doCrawl()
This crawl differs from the default by extracting AnchoredLink's and attaching all anchor text to the bottom of indexed pages to which they point.
Overrides:
doCrawl in class Spider

processPage

protected void processPage(HTMLPage page)
This version extracts anchored links, stores file name for this URL in the urlMap, and adds initial link anchor text to the bottom of the file.
Overrides:
processPage in class Spider
Parameters:
page - An HTMLPage that contains the page to index.

addAnchorText

protected void addAnchorText(java.lang.String fileName,
                             AnchoredLink link,
                             boolean firstLink)
Add anchor text of the link to the end of a page file

main

public static void main(java.lang.String[] args)
Spider the web according to the following command options, but only below the start URL directory. This time storing anchor text to end of pages.