|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object ir.webutils.Spider
public class Spider
Spider defines a framework for writing a web crawler. Users can change the behavior of the spider by overriding methods. Default spider does a breadth first crawl starting from a given URL up to a specified maximum number of pages, saving (caching) the pages in a given directory. Also adds a "BASE" HTML command to cached pages so links can be followed from the cached version.
Field Summary | |
---|---|
protected int |
count
The number of pages indexed. |
protected java.util.List<Link> |
linksToVisit
The queue of links maintained by the spider |
protected int |
maxCount
The maximum number of pages to be indexed. |
protected HTMLPageRetriever |
retriever
The object to be used to retrieve pages |
protected java.io.File |
saveDir
The directory to save the downloaded files to. |
protected boolean |
slow
Flag to purposely slow the crawl for debugging purposes |
protected java.util.HashSet<Link> |
visited
The URLs that have already been visited. |
Constructor Summary | |
---|---|
Spider()
|
Method Summary | |
---|---|
void |
doCrawl()
Performs the crawl. |
protected java.util.List<Link> |
getNewLinks(HTMLPage page)
Returns a list of links to follow from a given page. |
void |
go(java.lang.String[] args)
Checks command line arguments and performs the crawl. |
protected void |
handleCCommandLineOption(java.lang.String value)
Called when "-c" is passed in on the command line. |
protected void |
handleDCommandLineOption(java.lang.String value)
Called when "-d" is passed in on the command line. |
protected void |
handleSafeCommandLineOption()
Called when "-safe" is passed in on the command line. |
protected void |
handleSlowCommandLineOption()
Called when "-slow" is passed in on the command line. |
protected void |
handleUCommandLineOption(java.lang.String value)
Called when "-u" is passed in on the command line. |
protected void |
indexPage(HTMLPage page)
"Indexes" a HTMLpage . |
protected boolean |
linkToHTMLPage(Link link)
Check if this is a link to an HTML page. |
static void |
main(java.lang.String[] args)
Spider the web according to the following command options: -safe : Check for and obey robots.txt and robots META tag directives. -d <directory> : Store indexed files in <directory>. -c <maxCount> : Store at most <maxCount> files (default is 10,000). -u <url> : Start at <url>. -slow : Pause briefly before getting a page. |
void |
processArgs(java.lang.String[] args)
Processes command-line arguments. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected java.util.List<Link> linksToVisit
protected boolean slow
protected HTMLPageRetriever retriever
protected java.io.File saveDir
protected int count
protected int maxCount
protected java.util.HashSet<Link> visited
Constructor Detail |
---|
public Spider()
Method Detail |
---|
public void go(java.lang.String[] args)
This
implementation calls processArgs
and
doCrawl
.
args
- Command line arguments.public void processArgs(java.lang.String[] args)
The following options are handled by this function:
handleXXXCommandLineOption
function that will be
called when the option is found. Subclasses may find it
convenient to change how options are handled by overriding
those methods instead of this one. Only the above options will
be dealt with by this function, and the input array will remain
unchanged. Note that if the flag for an option appears in the
input array, any value associated with that option will be
assumed to follow. Thus if a "-c" flag appears in
args
, the next value in args
will be
blindly treated as the count.
args
- Array of arguments as passed in from the command
line.protected void handleSafeCommandLineOption()
This
implementation sets retriever
to a SafeHTMLPageRetriever
.
protected void handleDCommandLineOption(java.lang.String value)
This
implementation sets saveDir
to value
.
value
- The value associated with the "-d" option.protected void handleCCommandLineOption(java.lang.String value)
This
implementation sets maxCount
to the integer
represented by value
.
value
- The value associated with the "-c" option.protected void handleUCommandLineOption(java.lang.String value)
This
implementation adds value
to the list of links to
visit.
value
- The value associated with the "-u" option.protected void handleSlowCommandLineOption()
This
implementation sets a flag that will be used in go
to pause briefly before downloading each page.
public void doCrawl()
processArgs
has been called. Assumes that
starting url has been set. This implementation iterates
through a list of links to visit. For each link a check is
performed using visited
to make sure the link
has not already been visited. If it has not, the link is added
to visited
, and the page is retrieved. If access
to the page has been disallowed by a robots.txt file or a
robots META tag, or if there is some other problem retrieving
the page, then the page is skipped. If the page is downloaded
successfully indexPage
and getNewLinks
are called if allowed.
go
terminates when there are no more links to visit
or count >= maxCount
protected boolean linkToHTMLPage(Link link)
protected java.util.List<Link> getNewLinks(HTMLPage page)
page
- The current page.
protected void indexPage(HTMLPage page)
HTMLpage
. This version just writes it
out to a file in the specified directory with a "P
page
- An HTMLPage
that contains the page to
index.public static void main(java.lang.String[] args)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |