|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object ir.webutils.Spider ir.webutils.BeamSearchSpider
public class BeamSearchSpider
A spider that uses heuristic beam search to find a web page that contains a set of "want strings" using a set of "help strings" to guide the search. Conducts a search through a space of ScoredAnchoredLinks to find a page that satisfies the goal, i.e. contains all of the "want strings".
Field Summary | |
---|---|
protected int |
beamSize
The beam width to use. |
protected PageGoal |
goal
Defines the goal predicate over HTMLPage's that is to be satisfied. |
protected HTMLPage |
goalPage
The page found that satisfies the goal |
protected LinkHeuristic |
heuristic
Defines the heuristic that is used to sort ScoredAnchoredLink's in the queue |
Fields inherited from class ir.webutils.Spider |
---|
count, linksToVisit, maxCount, retriever, saveDir, slow, visited |
Constructor Summary | |
---|---|
BeamSearchSpider()
|
Method Summary | |
---|---|
protected LinkHeuristic |
constructLinkHeuristic()
Return default LinkHeuristic. |
void |
doCrawl()
Crawls the web using beam search with given heuristic to find a page that satisfies goal. |
protected java.util.List<Link> |
getNewLinks(HTMLPage page)
Returns a list of scored links to follow from a given page. |
void |
go(java.lang.String[] args)
Interprets command line arguments and performs the crawl. |
protected void |
handleBCommandLineOption(java.lang.String value)
Called when "-b" is passed in on the command line to sets beam width. |
protected void |
handleHCommandLineOption(java.lang.String value)
Called when "-h" is passed in on the command line to set help strings. |
protected void |
handleUCommandLineOption(java.lang.String value)
Called when "-u" is passed in on the command line. |
protected void |
handleWCommandLineOption(java.lang.String value)
Called when "-w" is passed in on the command line to set "want strings". |
static void |
main(java.lang.String[] args)
Search the web using beam search according to the following command options: -safe : Check for and obey robots.txt and robots META tag directives. -c <maxCount> : Download at most <maxCount> pages (default is 10,000). -u <url> : Start at <url>. -w <strings> : <strings> should be a list of "need strings" separated by ";"'s. -h <strings> : <strings> should be a list of "help strings" separated by ";"'s. -b <size> : Use a beam width of given <size> (default is 100) -slow : Pause briefly before getting a page. |
void |
processArgs(java.lang.String[] args)
Processes command-line arguments. |
protected void |
scoreLinks(java.util.List<Link> links,
HTMLPage page)
Use the heuristic to score each of the new links on a given page that was expanded. |
Methods inherited from class ir.webutils.Spider |
---|
handleCCommandLineOption, handleDCommandLineOption, handleSafeCommandLineOption, handleSlowCommandLineOption, indexPage, linkToHTMLPage |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected PageGoal goal
protected LinkHeuristic heuristic
protected int beamSize
protected HTMLPage goalPage
Constructor Detail |
---|
public BeamSearchSpider()
Method Detail |
---|
public void go(java.lang.String[] args)
go
in class Spider
args
- Command line arguments.public void processArgs(java.lang.String[] args)
The following options are handled by this function:
handleXXXCommandLineOption
function that will be
called when the option is found. Subclasses may find it
convenient to change how options are handled by overriding
those methods instead of this one. Only the above options will
be dealt with by this function, and the input array will remain
unchanged. Note that if the flag for an option appears in the
input array, any value associated with that option will be
assumed to follow. Thus if a "-c" flag appears in
args
, the next value in args
will be
blindly treated as the count.
processArgs
in class Spider
args
- Array of arguments as passed in from the command
line.protected void handleUCommandLineOption(java.lang.String value)
This
implementation adds value
to the list of links to
visit. This version creates an initial ScoredAnchoredLink.
handleUCommandLineOption
in class Spider
value
- The value associated with the "-u" option.protected void handleWCommandLineOption(java.lang.String value)
protected void handleHCommandLineOption(java.lang.String value)
protected LinkHeuristic constructLinkHeuristic()
protected void handleBCommandLineOption(java.lang.String value)
public void doCrawl()
doCrawl
in class Spider
protected java.util.List<Link> getNewLinks(HTMLPage page)
getNewLinks
in class Spider
page
- The current page.
protected void scoreLinks(java.util.List<Link> links, HTMLPage page)
public static void main(java.lang.String[] args)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |