A C D E G H I L M O P R S T U V W

A

absoluteCopy - Variable in class ir.webutils.LinkExtractor
Copy of the text of page with absolute links
absoluteText - Variable in class ir.webutils.HTMLPage
Copy of the text with relative links replaced by absolute ones
add(Object) - Method in class ir.webutils.RobotExclusionSet
 
addAnchorText(String, AnchoredLink, boolean) - Method in class ir.webutils.AnchoredSpider
Add anchor text of the link to the end of a page file
addEndSlash(URL) - Static method in class ir.webutils.LinkExtractor
If URL looks like a directory rather than a file, then add a "/" at the end so that it acts as a proper base URL for completing URLs in this page
addLink(MutableAttributeSet, HTML.Attribute) - Method in class ir.webutils.LinkExtractor
Retrieves a link from an attribute set and completes it against the base URL.
addLink(MutableAttributeSet, HTML.Attribute) - Method in class ir.webutils.AnchoredLinkExtractor
Retrieves a link from an attribute set and completes it against the base URL.
AnchoredDirectorySpider - class ir.webutils.AnchoredDirectorySpider.
Anchored spider that limits itself to the directory it started in.
AnchoredDirectorySpider() - Constructor for class ir.webutils.AnchoredDirectorySpider
 
AnchoredLink - class ir.webutils.AnchoredLink.
Link with included anchor text
AnchoredLink(String) - Constructor for class ir.webutils.AnchoredLink
Construct a link with specified URL string
AnchoredLink(URL) - Constructor for class ir.webutils.AnchoredLink
Constructs a link with specified URL
AnchoredLink(URL, String) - Constructor for class ir.webutils.AnchoredLink
Constructs a link with specified URL and anchor text
AnchoredLinkExtractor - class ir.webutils.AnchoredLinkExtractor.
Extractor for AnchoredLink's.
AnchoredLinkExtractor(HTMLPage) - Constructor for class ir.webutils.AnchoredLinkExtractor
Create an anchored link extractor for the given page
AnchoredSiteSpider - class ir.webutils.AnchoredSiteSpider.
An anchored spider that limits itself to a given site.
AnchoredSiteSpider() - Constructor for class ir.webutils.AnchoredSiteSpider
 
AnchoredSpider - class ir.webutils.AnchoredSpider.
 
AnchoredSpider() - Constructor for class ir.webutils.AnchoredSpider
 
anchorText - Variable in class ir.webutils.AnchoredLinkExtractor
Buffer to store anchor text encountered between an "a" start tag and end tag.
appendTag(StringBuffer, HTML.Tag, MutableAttributeSet) - Static method in class ir.webutils.LinkExtractor
Write this tag with attributes out to the buffer

C

cleanURL(URL) - Static method in class ir.webutils.Link
Standardize URL
contains(Object) - Method in class ir.webutils.RobotExclusionSet
Checks to see if a path is prohibited by this set.
count - Variable in class ir.webutils.Spider
The number of pages indexed.
currentLink - Variable in class ir.webutils.AnchoredLinkExtractor
The current link being processed

D

DirectorySpider - class ir.webutils.DirectorySpider.
Spider that limits itself to the directory it started in.
DirectorySpider() - Constructor for class ir.webutils.DirectorySpider
 
displayURL(URL) - Method in class ir.webutils.WebPageViewer
 
doCrawl() - Method in class ir.webutils.Spider
Performs the crawl.
doCrawl() - Method in class ir.webutils.AnchoredSpider
This crawl differs from the default by extracting AnchoredLink's and attaching all anchor text to the bottom of indexed pages to which they point.

E

empty() - Method in class ir.webutils.HTMLPage
Returns true if the page is empty or a 404 error.
equals(Object) - Method in class ir.webutils.Link
 
extractLinks() - Method in class ir.webutils.LinkExtractor
Extracts links from the given page.

G

getAbsoluteText() - Method in class ir.webutils.HTMLPage
Get the absolute link version of this page
getAnchorText() - Method in class ir.webutils.AnchoredLink
Return anchor text for link
getHTMLPage(Link) - Method in class ir.webutils.HTMLPageRetriever
Downloads a web page from a given URL.
getHTMLPage(Link) - Method in class ir.webutils.SafeHTMLPageRetriever
Tries to download the given web page.
getLink() - Method in class ir.webutils.HTMLPage
Returns the Link object that was used to access this page.
getNewLinks(HTMLPage) - Method in class ir.webutils.Spider
Returns a list of links to follow from a given page.
getNewLinks(HTMLPage) - Method in class ir.webutils.AnchoredSiteSpider
Gets links from the given page that are on the same host as the page.
getNewLinks(HTMLPage) - Method in class ir.webutils.DirectorySpider
Gets links from the page that are in or below the starting directory.
getNewLinks(HTMLPage) - Method in class ir.webutils.AnchoredDirectorySpider
Gets links from the page that are in or below the starting directory.
getNewLinks(HTMLPage) - Method in class ir.webutils.SiteSpider
Gets links from the given page that are on the same host as the page.
getOutLinks() - Method in class ir.webutils.HTMLPage
Get the list of out links from this page.
getParser() - Method in class ir.webutils.HTMLParserMaker
Returns a parser.
getText() - Method in class ir.webutils.HTMLPage
Returns the full text of this page.
getURL() - Method in class ir.webutils.Link
Returns the URL of this link.
getURL(String) - Static method in class ir.webutils.URLChecker
 
getWebPage(String) - Static method in class ir.webutils.WebPage
Downloads the web page specified by the URL represented by a given string.
getWebPage(URL) - Static method in class ir.webutils.WebPage
Downloads the web page specified by the given URL object.
go(String[]) - Method in class ir.webutils.Spider
Checks command line arguments and performs the crawl.

H

handleCCommandLineOption(String) - Method in class ir.webutils.Spider
Called when "-c" is passed in on the command line.
handleDCommandLineOption(String) - Method in class ir.webutils.Spider
Called when "-d" is passed in on the command line.
handleEndTag(HTML.Tag, int) - Method in class ir.webutils.LinkExtractor
Executed when a closing HTML tag is found in the document.
handleEndTag(HTML.Tag, int) - Method in class ir.webutils.AnchoredLinkExtractor
Executed when a closing HTML tag is found in the document.
handleSafeCommandLineOption() - Method in class ir.webutils.Spider
Called when "-safe" is passed in on the command line.
handleSimpleTag(HTML.Tag, MutableAttributeSet, int) - Method in class ir.webutils.LinkExtractor
Executed when an HTML tag that has no closing tag is found in the document.
handleSimpleTag(HTML.Tag, MutableAttributeSet, int) - Method in class ir.webutils.AnchoredLinkExtractor
Executed when an HTML tag that has no closing tag is found in the document.
handleSimpleTag(HTML.Tag, MutableAttributeSet, int) - Method in class ir.webutils.RobotsMetaTagParser
Checks for robots META tags.
handleSlowCommandLineOption() - Method in class ir.webutils.Spider
Called when "-slow" is passed in on the command line.
handleStartTag(HTML.Tag, MutableAttributeSet, int) - Method in class ir.webutils.LinkExtractor
Executed when an opening HTML tag is found in the document.
handleStartTag(HTML.Tag, MutableAttributeSet, int) - Method in class ir.webutils.AnchoredLinkExtractor
Executed when an opening HTML tag is found in the document.
handleText(char[], int) - Method in class ir.webutils.LinkExtractor
Executed when a block of text is encountered.
handleText(char[], int) - Method in class ir.webutils.AnchoredLinkExtractor
Executed when a block of text is encountered.
handleUCommandLineOption(String) - Method in class ir.webutils.Spider
Called when "-u" is passed in on the command line.
handleUCommandLineOption(String) - Method in class ir.webutils.AnchoredSpider
Called when "-u" is passed in on the command line.
handleUCommandLineOption(String) - Method in class ir.webutils.DirectorySpider
Sets the initial URL from the "-u" argument, then calls the corresponding superclass method.
handleUCommandLineOption(String) - Method in class ir.webutils.AnchoredDirectorySpider
Sets the initial URL from the "-u" argument, then calls the corresponding superclass method.
hashCode() - Method in class ir.webutils.Link
 
HTMLPage - class ir.webutils.HTMLPage.
HTMLPage is a representation of information about a web page.
HTMLPage(Link, String) - Constructor for class ir.webutils.HTMLPage
Constructs an HTMLPage with the given link and text.
HTMLPageRetriever - class ir.webutils.HTMLPageRetriever.
HTMLPageRetriever allows clients to download web pages from URLs.
HTMLPageRetriever() - Constructor for class ir.webutils.HTMLPageRetriever
Constructs a HTMLPageRetriever object.
HTMLParserMaker - class ir.webutils.HTMLParserMaker.
HTMLParserMaker allows clients to retrieve an HTMLEditorKit.Parser instance.
HTMLParserMaker() - Constructor for class ir.webutils.HTMLParserMaker
 

I

index() - Method in class ir.webutils.RobotsMetaTagParser
Indicates whether the page can be indexed.
indexAllowed() - Method in class ir.webutils.HTMLPage
Clients should always call this method before indexing an HTML page if they want to obey the "NOINDEX" directive in the Robots META tag.
indexAllowed() - Method in class ir.webutils.SafeHTMLPage
Indicates whether or not indexing has been disallowed by a Robots META tag.
ir.webutils - package ir.webutils
 
iterator() - Method in class ir.webutils.RobotExclusionSet
 

L

link - Variable in class ir.webutils.HTMLPage
The original link to this page
Link - class ir.webutils.Link.
Link is a class that contains a URL.
Link() - Constructor for class ir.webutils.Link
May be subclassed.
Link(String) - Constructor for class ir.webutils.Link
Construct a link with specified URL string
Link(URL) - Constructor for class ir.webutils.Link
Constructs a link with specified URL.
LinkExtractor - class ir.webutils.LinkExtractor.
LinkExtractor defines a callback that extracts the links from an HTML document and provides functionality to parse a document.
LinkExtractor(HTMLPage) - Constructor for class ir.webutils.LinkExtractor
Create an link extractor for the given page
links - Variable in class ir.webutils.LinkExtractor
The current list of extracted links
linksToVisit - Variable in class ir.webutils.Spider
The queue of links maintained by the spider
linkToHTMLPage(Link) - Method in class ir.webutils.Spider
Check if this is a link to an HTML page.

M

main(String[]) - Static method in class ir.webutils.AnchoredLinkExtractor
 
main(String[]) - Static method in class ir.webutils.Spider
Spider the web according to the following command options: -safe : Check for and obey robots.txt and robots META tag directives. -d <directory> : Store indexed files in <directory>. -c <count> : Store at most <count> files. -u <url> : Start at <url>. -slow : Pause briefly before getting a page.
main(String[]) - Static method in class ir.webutils.AnchoredSpider
Spider the web according to the following command options, but only below the start URL directory.
main(String[]) - Static method in class ir.webutils.AnchoredSiteSpider
Spider the web according to the following command options, but stay within the given site (same URL host) and include anchor text of links to page.
main(String[]) - Static method in class ir.webutils.WebPage
 
main(String[]) - Static method in class ir.webutils.Link
 
main(String[]) - Static method in class ir.webutils.WebPageViewer
 
main(String[]) - Static method in class ir.webutils.DirectorySpider
Spider the web according to the following command options, but only below the start URL directory.
main(String[]) - Static method in class ir.webutils.AnchoredDirectorySpider
Spider the web according to the following command options, but only below the start URL directory and include anchor text of links to page.
main(String[]) - Static method in class ir.webutils.SiteSpider
Spider the web according to the following command options, but stay within the given site (same URL host).
maxCount - Variable in class ir.webutils.Spider
The maximum number of pages to be indexed.

O

outLinks - Variable in class ir.webutils.HTMLPage
The links on this page

P

page - Variable in class ir.webutils.LinkExtractor
The page from which to extract links
parseMetaTags() - Method in class ir.webutils.RobotsMetaTagParser
Parses the document and returns a list of links that can not be followed.
PathDisallowedException - exception ir.webutils.PathDisallowedException.
PathDisallowedException.java Thrown to indicate that a client program tried to access a path that was disallowed by either a robots.txt file or a robots META tag.
PathDisallowedException() - Constructor for class ir.webutils.PathDisallowedException
 
PathDisallowedException(String) - Constructor for class ir.webutils.PathDisallowedException
 
processArgs(String[]) - Method in class ir.webutils.Spider
Processes command-line arguments.
processPage(HTMLPage) - Method in class ir.webutils.Spider
"Indexes" a HTMLpage.
processPage(HTMLPage) - Method in class ir.webutils.AnchoredSpider
This version extracts anchored links, stores file name for this URL in the urlMap, and adds initial link anchor text to the bottom of the file.

R

removeEndSlash(URL) - Static method in class ir.webutils.Link
Removes slash at end of URL to normalize
removeRef(URL) - Static method in class ir.webutils.Link
Remove the internal "ref" pointer in a URL if there is one.
RobotExclusionSet - class ir.webutils.RobotExclusionSet.
RobotExclusionSet provides support for the Robots Exclusion Protocol.
RobotExclusionSet() - Constructor for class ir.webutils.RobotExclusionSet
Constructs an empty set.
RobotExclusionSet(String) - Constructor for class ir.webutils.RobotExclusionSet
Constructs a set containing the paths in the robots.txt file for this site.
RobotsMetaTagParser - class ir.webutils.RobotsMetaTagParser.
Parser callback that extracts robots META tag information.
RobotsMetaTagParser() - Constructor for class ir.webutils.RobotsMetaTagParser
 
RobotsMetaTagParser(URL) - Constructor for class ir.webutils.RobotsMetaTagParser
 
RobotsMetaTagParser(URL, String) - Constructor for class ir.webutils.RobotsMetaTagParser
 

S

SafeHTMLPage - class ir.webutils.SafeHTMLPage.
SafeHTMLPage is an immutable representation of information about a web page that includes information about whether or not this page can be indexed.
SafeHTMLPage(Link, String, boolean) - Constructor for class ir.webutils.SafeHTMLPage
Constructs an SafeHTMLPage with the given link, text, and indication whether or not indexing is allowed.
SafeHTMLPageRetriever - class ir.webutils.SafeHTMLPageRetriever.
Keeps track of Robot Exclusion information.
SafeHTMLPageRetriever() - Constructor for class ir.webutils.SafeHTMLPageRetriever
 
saveDir - Variable in class ir.webutils.Spider
The directory to save the downloaded files to.
setAbsoluteText(String) - Method in class ir.webutils.HTMLPage
Set the absolute link version of this page
setAnchorText(String) - Method in class ir.webutils.AnchoredLink
Return anchor text for link
setOutLinks(List) - Method in class ir.webutils.HTMLPage
Set of the outLinks for this page to given list
setPage(String) - Method in class ir.webutils.RobotsMetaTagParser
 
setUrl(URL) - Method in class ir.webutils.RobotsMetaTagParser
 
SiteSpider - class ir.webutils.SiteSpider.
A spider that limits itself to a given site.
SiteSpider() - Constructor for class ir.webutils.SiteSpider
 
size() - Method in class ir.webutils.RobotExclusionSet
 
slow - Variable in class ir.webutils.Spider
Flag to purposely slow the crawl for debugging purposes
Spider - class ir.webutils.Spider.
Spider defines a framework for writing a web crawler.
Spider() - Constructor for class ir.webutils.Spider
 

T

text - Variable in class ir.webutils.HTMLPage
The text of the page
toString() - Method in class ir.webutils.Link
 
toString() - Method in class ir.webutils.AnchoredLink
 

U

url - Variable in class ir.webutils.LinkExtractor
The URL for this page
URLChecker - class ir.webutils.URLChecker.
URLChecker.java trys to clean up some URLs that do not conform to the standard and cause confusion.
urlMap - Variable in class ir.webutils.AnchoredSpider
Map from URLs to file names where they are stored.

V

visited - Variable in class ir.webutils.Spider
The URLs that have already been visited.

W

WebPage - class ir.webutils.WebPage.
WebPage is a static utility class that provides operations for downloading web pages.
WebPage() - Constructor for class ir.webutils.WebPage
 
WebPageViewer - class ir.webutils.WebPageViewer.
WebPageViewer contains utilities to download and display HTML pages.
WebPageViewer() - Constructor for class ir.webutils.WebPageViewer
 
webpr - Variable in class ir.webutils.Spider
The object to be used to retrieve pages
writeAbsolute(File, String) - Method in class ir.webutils.HTMLPage
Writes web page to a file with absolute links and a comment with the original URL.

A C D E G H I L M O P R S T U V W