Spider

java.lang.Object
- ir.webutils.Spider

Direct Known Subclasses:

DirectorySpider, SiteSpider
```
public class Spider
extends java.lang.Object
```
Spider defines a framework for writing a web crawler. Users can change the behavior of the spider by overriding methods. Default spider does a breadth first crawl starting from a given URL up to a specified maximum number of pages, saving (caching) the pages in a given directory. Also adds a "BASE" HTML command to cached pages so links can be followed from the cached version.

Field Summary

Fields
Modifier and Type	Field and Description
`protected int`	`count` The number of pages indexed.
`protected java.util.List<Link>`	`linksToVisit` The queue of links maintained by the spider
`protected int`	`maxCount` The maximum number of pages to be indexed.
`protected HTMLPageRetriever`	`retriever` The object to be used to retrieve pages
`protected java.io.File`	`saveDir` The directory to save the downloaded files to.
`protected boolean`	`slow` Flag to purposely slow the crawl for debugging purposes
`protected java.util.HashSet<Link>`	`visited` The URLs that have already been visited.

Constructor Summary

Constructors
Constructor and Description

Spider()

Constructors
Constructor and Description
`Spider()`

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`doCrawl()` Performs the crawl.
`protected java.util.List<Link>`	`getNewLinks(HTMLPage page)` Returns a list of links to follow from a given page.
`void`	`go(java.lang.String[] args)` Checks command line arguments and performs the crawl.
`protected void`	`handleCCommandLineOption(java.lang.String value)` Called when "-c" is passed in on the command line.
`protected void`	`handleDCommandLineOption(java.lang.String value)` Called when "-d" is passed in on the command line.
`protected void`	`handleSafeCommandLineOption()` Called when "-safe" is passed in on the command line.
`protected void`	`handleSlowCommandLineOption()` Called when "-slow" is passed in on the command line.
`protected void`	`handleUCommandLineOption(java.lang.String value)` Called when "-u" is passed in on the command line.
`protected void`	`indexPage(HTMLPage page)` "Indexes" a `HTMLpage`.
`protected boolean`	`linkToHTMLPage(Link link)` Check if this is a link to an HTML page.
`static void`	`main(java.lang.String[] args)` Spider the web according to the following command options: -safe : Check for and obey robots.txt and robots META tag directives. -d <directory> : Store indexed files in <directory>. -c <maxCount> : Store at most <maxCount> files (default is 10,000). -u <url> : Start at <url>. -slow : Pause briefly before getting a page.
`void`	`processArgs(java.lang.String[] args)` Processes command-line arguments.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - linksToVisit
```
protected java.util.List<Link> linksToVisit
```
    The queue of links maintained by the spider
  - slow
```
protected boolean slow
```
    Flag to purposely slow the crawl for debugging purposes
  - retriever
```
protected HTMLPageRetriever retriever
```
    The object to be used to retrieve pages
  - saveDir
```
protected java.io.File saveDir
```
    The directory to save the downloaded files to.
  - count
```
protected int count
```
    The number of pages indexed. In the default implementation a page is considered to be indexed only if it is written to a file.
  - maxCount
```
protected int maxCount
```
    The maximum number of pages to be indexed.
  - visited
```
protected java.util.HashSet<Link> visited
```
    The URLs that have already been visited.
- Constructor Detail
  - Spider
```
public Spider()
```
- Method Detail
  - go
```
public void go(java.lang.String[] args)
```
    Checks command line arguments and performs the crawl.
    This implementation calls processArgs and doCrawl.
    
    Parameters:
    args - Command line arguments.
  - processArgs
```
public void processArgs(java.lang.String[] args)
```
    Processes command-line arguments.
    The following options are handled by this function:
    - -safe : Check for and obey robots.txt and robots META tag directives.
    - -d <directory> : Store indexed files in <directory>.
    - -c <count> : Store at most <count> files.
    - -u <url> : Start at <url>.
    - -slow : Pause briefly before getting a page. This can be useful when debugging.
    Each option has a corresponding handleXXXCommandLineOption function that will be called when the option is found. Subclasses may find it convenient to change how options are handled by overriding those methods instead of this one. Only the above options will be dealt with by this function, and the input array will remain unchanged. Note that if the flag for an option appears in the input array, any value associated with that option will be assumed to follow. Thus if a "-c" flag appears in args, the next value in args will be blindly treated as the count.
    Parameters:
    args - Array of arguments as passed in from the command line.
  - handleSafeCommandLineOption
```
protected void handleSafeCommandLineOption()
```
    Called when "-safe" is passed in on the command line.
    This implementation sets retriever to a SafeHTMLPageRetriever.
  - handleDCommandLineOption
```
protected void handleDCommandLineOption(java.lang.String value)
```
    Called when "-d" is passed in on the command line.
    This implementation sets saveDir to value.
    
    Parameters:
    value - The value associated with the "-d" option.
  - handleCCommandLineOption
```
protected void handleCCommandLineOption(java.lang.String value)
```
    Called when "-c" is passed in on the command line.
    This implementation sets maxCount to the integer represented by value.
    
    Parameters:
    value - The value associated with the "-c" option.
  - handleUCommandLineOption
```
protected void handleUCommandLineOption(java.lang.String value)
```
    Called when "-u" is passed in on the command line.
    This implementation adds value to the list of links to visit.
    
    Parameters:
    value - The value associated with the "-u" option.
  - handleSlowCommandLineOption
```
protected void handleSlowCommandLineOption()
```
    Called when "-slow" is passed in on the command line.
    This implementation sets a flag that will be used in go to pause briefly before downloading each page.
  - doCrawl
```
public void doCrawl()
```
    Performs the crawl. Should be called after processArgs has been called. Assumes that starting url has been set.
    This implementation iterates through a list of links to visit. For each link a check is performed using visited to make sure the link has not already been visited. If it has not, the link is added to visited, and the page is retrieved. If access to the page has been disallowed by a robots.txt file or a robots META tag, or if there is some other problem retrieving the page, then the page is skipped. If the page is downloaded successfully indexPage and getNewLinks are called if allowed. go terminates when there are no more links to visit or count >= maxCount
  - linkToHTMLPage
```
protected boolean linkToHTMLPage(Link link)
```
    Check if this is a link to an HTML page.
    
    Returns:
    true if a directory or clearly an HTML page
  - getNewLinks
```
protected java.util.List<Link> getNewLinks(HTMLPage page)
```
    Returns a list of links to follow from a given page. Subclasses can use this method to direct the spider's path over the web by returning a subset of the links on the page.
    
    Parameters:
    page - The current page.
    
    Returns:
    Links to be visited from this page
  - indexPage
```
protected void indexPage(HTMLPage page)
```
    "Indexes" a HTMLpage. This version just writes it out to a file in the specified directory with a "P.html" file name.
    
    Parameters:
    page - An HTMLPage that contains the page to index.
  - main
```
public static void main(java.lang.String[] args)
```
    Spider the web according to the following command options:
    - -safe : Check for and obey robots.txt and robots META tag directives.
    - -d <directory> : Store indexed files in <directory>.
    - -c <maxCount> : Store at most <maxCount> files (default is 10,000).
    - -u <url> : Start at <url>.
    - -slow : Pause briefly before getting a page. This can be useful when debugging.

Class Spider

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

linksToVisit

slow

retriever

saveDir

count

maxCount

visited

Constructor Detail

Spider

Method Detail

go

processArgs

handleSafeCommandLineOption

handleDCommandLineOption

handleCCommandLineOption

handleUCommandLineOption

handleSlowCommandLineOption

doCrawl

linkToHTMLPage

getNewLinks

indexPage

main