public class YahooCategoryLinkExtractor
extends javax.swing.text.html.HTMLEditorKit.ParserCallback
Modifier and Type | Field and Description |
---|---|
protected boolean |
inCategorySection
Flag that is true during parsing while HTML parser is
in the section of the webpage that lists subcateogry links
|
protected java.util.List<Link> |
links
The current list of extracted category links
|
protected HTMLPage |
page
The page from which to extract links
|
protected java.net.URL |
url
The URL for this page
|
Constructor and Description |
---|
YahooCategoryLinkExtractor(HTMLPage page)
Create an link extractor for the given page
|
Modifier and Type | Method and Description |
---|---|
protected void |
addLink(javax.swing.text.MutableAttributeSet attributes,
javax.swing.text.html.HTML.Attribute attr)
Retrieves a link from an attribute set and completes it against
the base URL.
|
java.util.List<Link> |
extractLinks()
Extracts cateory links from the given Yahoo page.
|
void |
handleEndTag(javax.swing.text.html.HTML.Tag tag,
int position)
Executed when a closing HTML tag is found in the document.
|
void |
handleSimpleTag(javax.swing.text.html.HTML.Tag tag,
javax.swing.text.MutableAttributeSet attributes,
int position)
Executed when an HTML tag that has no closing tag is found in
the document.
|
void |
handleStartTag(javax.swing.text.html.HTML.Tag tag,
javax.swing.text.MutableAttributeSet attributes,
int position)
Executed when an opening HTML tag is found in the document.
|
void |
handleText(char[] text,
int position)
Executed when a block of text is encountered.
|
static void |
main(java.lang.String[] args)
Given Yahoo directory URL as a single arg, test extraction of
category links from this page.
|
protected java.util.List<Link> links
protected HTMLPage page
protected java.net.URL url
protected boolean inCategorySection
public YahooCategoryLinkExtractor(HTMLPage page)
public void handleText(char[] text, int position)
handleText
in class javax.swing.text.html.HTMLEditorKit.ParserCallback
text
- A char
array representation of the
text.position
- The position of the text in the document.public void handleStartTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)
handleStartTag
in class javax.swing.text.html.HTMLEditorKit.ParserCallback
tag
- The tag that caused this function to be executed.attributes
- The attributes of tag
.position
- The start of the tag in the document. If the
tag is implied (filled in by the parser but not actually
present in the document) then position
will
correspond to that of the next encountered tag.public void handleEndTag(javax.swing.text.html.HTML.Tag tag, int position)
handleEndTag
in class javax.swing.text.html.HTMLEditorKit.ParserCallback
tag
- The tag found.position
- The position of the tag in the document.public void handleSimpleTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)
handleSimpleTag
in class javax.swing.text.html.HTMLEditorKit.ParserCallback
tag
- The tag that caused this function to be executed.attributes
- The attributes of tag
.position
- The start of the tag in the document. If the
tag is implied (filled in by the parser but not actually
present in the document) then position
will
correspond to that of the next encountered tag.public java.util.List<Link> extractLinks()
this
as the callback.Link
objects containing the
links found on this page. The links will all be absolute
links.protected void addLink(javax.swing.text.MutableAttributeSet attributes, javax.swing.text.html.HTML.Attribute attr)
attributes
- The attribute set.attr
- The attribute that should be treated as a URL. For
example, attr
should be
HTML.Attribute.HREF
if attributes
is
from an anchor tag.public static void main(java.lang.String[] args)