ir.webutils
Class AnchoredSpider
java.lang.Object
|
+--ir.webutils.Spider
|
+--ir.webutils.AnchoredSpider
- Direct Known Subclasses:
- AnchoredDirectorySpider, AnchoredSiteSpider
- public class AnchoredSpider
- extends Spider
Field Summary |
protected java.util.HashMap |
urlMap
Map from URLs to file names where they are stored. |
Method Summary |
protected void |
addAnchorText(java.lang.String fileName,
AnchoredLink link,
boolean firstLink)
Add anchor text of the link to the end of a page file |
void |
doCrawl()
This crawl differs from the default by extracting
AnchoredLink 's and attaching all anchor text
to the bottom of indexed pages to which they point. |
protected void |
handleUCommandLineOption(java.lang.String value)
Called when "-u" is passed in on the command line. |
static void |
main(java.lang.String[] args)
Spider the web according to the following command options,
but only below the start URL directory. |
protected void |
processPage(HTMLPage page)
This version extracts anchored links, stores file name for this
URL in the urlMap, and adds initial link anchor text to the
bottom of the file. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
urlMap
protected java.util.HashMap urlMap
- Map from URLs to file names where they are stored.
This allows attaching anchor text of links to this
page to the bottom of the page.
AnchoredSpider
public AnchoredSpider()
handleUCommandLineOption
protected void handleUCommandLineOption(java.lang.String value)
- Called when "-u" is passed in on the command line.
This
implementation adds value
to the list of links to
visit. This version creates an initial anchored link.
- Overrides:
handleUCommandLineOption
in class Spider
- Parameters:
value
- The value associated with the "-u" option.
doCrawl
public void doCrawl()
- This crawl differs from the default by extracting
AnchoredLink
's and attaching all anchor text
to the bottom of indexed pages to which they point.
- Overrides:
doCrawl
in class Spider
processPage
protected void processPage(HTMLPage page)
- This version extracts anchored links, stores file name for this
URL in the urlMap, and adds initial link anchor text to the
bottom of the file.
- Overrides:
processPage
in class Spider
- Parameters:
page
- An HTMLPage
that contains the page to
index.
addAnchorText
protected void addAnchorText(java.lang.String fileName,
AnchoredLink link,
boolean firstLink)
- Add anchor text of the link to the end of a page file
main
public static void main(java.lang.String[] args)
- Spider the web according to the following command options,
but only below the start URL directory. This time storing
anchor text to end of pages.
- -safe : Check for and obey robots.txt and robots META tag
directives.
- -d <directory> : Store indexed files in <directory>.
- -c <count> : Store at most <count> files.
- -u <url> : Start at <url>.
- -slow : Pause briefly before getting a page. This can be
useful when debugging.