public class YahooSpider
extends java.lang.Object
Modifier and Type | Field and Description |
---|---|
protected java.util.List<Link> |
categoryLinks
List of category links found for the current directory page
|
java.util.Map<Link,java.util.List<Link>> |
categoryLinksMap
The HashMap for storing categoryLinks for already downloaded Links
|
protected int |
count
The number of pages indexed.
|
protected java.lang.String |
filePrefix
Prefix to add to the name of all saved files for the current cateogry
|
protected int |
maxCount
The number of pages to be found and indexed.
|
protected java.util.Random |
random
Random number generator to use
|
protected HTMLPageRetriever |
retriever
The object to be used to retrieve pages
|
protected java.io.File |
saveDir
The directory to save the downloaded files to.
|
protected java.util.List<Link> |
siteLinks
List of site links found for the current directory page
|
java.util.Map<Link,java.util.List<Link>> |
siteLinksMap
The HashMap for storing siteLinks for already downloaded Links
|
protected boolean |
slow
Flag to purposely slow the crawl for debugging purposes
|
protected Link |
topCategoryLink
Link for the main topic Yahoo category
|
protected java.util.HashSet<Link> |
visitedSites
The sites that have already been indexed.
|
Constructor and Description |
---|
YahooSpider() |
Modifier and Type | Method and Description |
---|---|
void |
doCrawl()
Performs the crawl.
|
protected Link |
getRandomLink(java.util.List<Link> links)
Pick a random link from a list of links
|
void |
go(java.lang.String[] args)
Checks command line arguments and performs the crawl.
|
protected void |
handleCCommandLineOption(java.lang.String value)
Called when "-c" is passed in on the command line.
|
protected void |
handleDCommandLineOption(java.lang.String value)
Called when "-d" is passed in on the command line.
|
protected void |
handlePCommandLineOption(java.lang.String value)
Called when "-p" is passed on the command line.
|
protected void |
handleSlowCommandLineOption()
Called when "-slow" is passed in on the command line.
|
protected void |
handleUCommandLineOption(java.lang.String value)
Called when "-u" is passed in on the command line.
|
protected void |
indexPage(HTMLPage page)
"Indexes" a
HTMLpage . |
protected boolean |
linkToHTMLPage(Link link)
Check if this is a link to an HTML page.
|
static void |
main(java.lang.String[] args)
Spider Yahoo category to randomly collect pages according to the following command options:
-d <directory> : Store indexed files in <directory>.
-c <maxCount> : Find <maxCount> files (default is 10,000).
-u <url> : Start at Yahoo directory page given by <url>.
-p <prefix > : Prefix saved file names with <prefix>.
-slow : Pause briefly before getting a page.
|
void |
processArgs(java.lang.String[] args)
Processes command-line arguments.
|
protected Link topCategoryLink
protected java.lang.String filePrefix
protected java.util.List<Link> categoryLinks
protected java.util.List<Link> siteLinks
protected boolean slow
protected HTMLPageRetriever retriever
protected java.io.File saveDir
protected int count
protected int maxCount
public java.util.Map<Link,java.util.List<Link>> categoryLinksMap
public java.util.Map<Link,java.util.List<Link>> siteLinksMap
protected java.util.HashSet<Link> visitedSites
protected java.util.Random random
public void go(java.lang.String[] args)
This
implementation calls processArgs
and
doCrawl
.
args
- Command line arguments.public void processArgs(java.lang.String[] args)
The following options are handled by this function:
handleXXXCommandLineOption
function that will be
called when the option is found. Subclasses may find it
convenient to change how options are handled by overriding
those methods instead of this one. Only the above options will
be dealt with by this function, and the input array will remain
unchanged. Note that if the flag for an option appears in the
input array, any value associated with that option will be
assumed to follow. Thus if a "-c" flag appears in
args
, the next value in args
will be
blindly treated as the count.args
- Array of arguments as passed in from the command
line.protected void handleDCommandLineOption(java.lang.String value)
This
implementation sets saveDir
to value
.
value
- The value associated with the "-d" option.protected void handleCCommandLineOption(java.lang.String value)
This
implementation sets maxCount
to the integer
represented by value
.
value
- The value associated with the "-c" option.protected void handleUCommandLineOption(java.lang.String value)
This
implementation sets the top level Yahoo directory category
link to value
value
- The value associated with the "-u" option.protected void handlePCommandLineOption(java.lang.String value)
protected void handleSlowCommandLineOption()
This
implementation sets a flag that will be used in go
to pause briefly before downloading each page.
public void doCrawl()
processArgs
has been called. Assumes that
starting url has been set. This implementation iterates
until count >= maxCount
protected Link getRandomLink(java.util.List<Link> links)
protected boolean linkToHTMLPage(Link link)
protected void indexPage(HTMLPage page)
HTMLpage
. This version just writes it
out to a file in the specified directory with a filePrefix<count>.html file name.page
- An HTMLPage
that contains the page to
index.public static void main(java.lang.String[] args)