public class YahooSiteLinkExtractor
extends javax.swing.text.html.HTMLEditorKit.ParserCallback
Modifier and Type | Field and Description |
---|---|
protected boolean |
inSiteSection
Flag that is true during parsing while HTML parser is
in the section of the webpage that lists site links
|
protected java.util.List<Link> |
links
The current list of extracted site links
|
protected java.lang.String |
moreURL
Flag that is true during parser while the HTML parser
in inside an anchor link text for a Yahoo link that
refers to more sites not listed on the current page
Stores the URL for this link while in its anchor text.
|
protected HTMLPage |
page
The page from which to extract links
|
protected java.net.URL |
url
The URL for this page
|
Constructor and Description |
---|
YahooSiteLinkExtractor(HTMLPage page)
Create an link extractor for the given page
|
Modifier and Type | Method and Description |
---|---|
protected void |
addLink(javax.swing.text.MutableAttributeSet attributes,
javax.swing.text.html.HTML.Attribute attr)
Retrieves a link from an attribute set and completes it against
the base URL.
|
java.util.List<Link> |
extractLinks()
Extracts site links from the given Yahoo page.
|
void |
handleEndTag(javax.swing.text.html.HTML.Tag tag,
int position)
Executed when a closing HTML tag is found in the document.
|
void |
handleSimpleTag(javax.swing.text.html.HTML.Tag tag,
javax.swing.text.MutableAttributeSet attributes,
int position)
Executed when an HTML tag that has no closing tag is found in
the document.
|
void |
handleStartTag(javax.swing.text.html.HTML.Tag tag,
javax.swing.text.MutableAttributeSet attributes,
int position)
Executed when an opening HTML tag is found in the document.
|
void |
handleText(char[] text,
int position)
Executed when a block of text is encountered.
|
static void |
main(java.lang.String[] args)
Given Yahoo directory URL as a single arg, test extraction of
site links from this page.
|
protected java.util.List<Link> links
protected HTMLPage page
protected java.net.URL url
protected boolean inSiteSection
protected java.lang.String moreURL
public YahooSiteLinkExtractor(HTMLPage page)
public void handleText(char[] text, int position)
handleText
in class javax.swing.text.html.HTMLEditorKit.ParserCallback
text
- A char
array representation of the
text.position
- The position of the text in the document.public void handleStartTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)
handleStartTag
in class javax.swing.text.html.HTMLEditorKit.ParserCallback
tag
- The tag that caused this function to be executed.attributes
- The attributes of tag
.position
- The start of the tag in the document. If the
tag is implied (filled in by the parser but not actually
present in the document) then position
will
correspond to that of the next encountered tag.public void handleEndTag(javax.swing.text.html.HTML.Tag tag, int position)
handleEndTag
in class javax.swing.text.html.HTMLEditorKit.ParserCallback
tag
- The tag found.position
- The position of the tag in the document.public void handleSimpleTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)
handleSimpleTag
in class javax.swing.text.html.HTMLEditorKit.ParserCallback
tag
- The tag that caused this function to be executed.attributes
- The attributes of tag
.position
- The start of the tag in the document. If the
tag is implied (filled in by the parser but not actually
present in the document) then position
will
correspond to that of the next encountered tag.public java.util.List<Link> extractLinks()
this
as the callback.Link
objects containing the
links found on this page. The links will all be absolute
links.protected void addLink(javax.swing.text.MutableAttributeSet attributes, javax.swing.text.html.HTML.Attribute attr)
attributes
- The attribute set.attr
- The attribute that should be treated as a URL. For
example, attr
should be
HTML.Attribute.HREF
if attributes
is
from an anchor tag.public static void main(java.lang.String[] args)