YahooCategoryLinkExtractor

java.lang.Object
- javax.swing.text.html.HTMLEditorKit.ParserCallback
- - ir.webutils.YahooCategoryLinkExtractor

```
public class YahooCategoryLinkExtractor
extends javax.swing.text.html.HTMLEditorKit.ParserCallback
```
YahooCategoryLinkExtractor defines a callback for the Swing HTML parser that extracts links to subcategories from a Yahoo directory page. Extracted links are absolute. Uses the HTML parser in Java Swing to parse the document and find links and translate them to absolute URL's (instead of relative ones).

Field Summary

Fields
Modifier and Type	Field and Description
`protected boolean`	`inCategorySection` Flag that is true during parsing while HTML parser is in the section of the webpage that lists subcateogry links
`protected java.util.List<Link>`	`links` The current list of extracted category links
`protected HTMLPage`	`page` The page from which to extract links
`protected java.net.URL`	`url` The URL for this page

Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
IMPLIED

Constructor Summary

Constructors
Constructor and Description

YahooCategoryLinkExtractor(HTMLPage page)
Create an link extractor for the given page

Constructors
Constructor and Description
`YahooCategoryLinkExtractor(HTMLPage page)` Create an link extractor for the given page

Method Summary

Methods
Modifier and Type	Method and Description
`protected void`	`addLink(javax.swing.text.MutableAttributeSet attributes, javax.swing.text.html.HTML.Attribute attr)` Retrieves a link from an attribute set and completes it against the base URL.
`java.util.List<Link>`	`extractLinks()` Extracts cateory links from the given Yahoo page.
`void`	`handleEndTag(javax.swing.text.html.HTML.Tag tag, int position)` Executed when a closing HTML tag is found in the document.
`void`	`handleSimpleTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)` Executed when an HTML tag that has no closing tag is found in the document.
`void`	`handleStartTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)` Executed when an opening HTML tag is found in the document.
`void`	`handleText(char[] text, int position)` Executed when a block of text is encountered.
`static void`	`main(java.lang.String[] args)` Given Yahoo directory URL as a single arg, test extraction of category links from this page.

Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
flush, handleComment, handleEndOfLineString, handleError

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - links
```
protected java.util.List<Link> links
```
    The current list of extracted category links
  - page
```
protected HTMLPage page
```
    The page from which to extract links
  - url
```
protected java.net.URL url
```
    The URL for this page
  - inCategorySection
```
protected boolean inCategorySection
```
    Flag that is true during parsing while HTML parser is in the section of the webpage that lists subcateogry links
- Constructor Detail
  - YahooCategoryLinkExtractor
```
public YahooCategoryLinkExtractor(HTMLPage page)
```
    Create an link extractor for the given page
- Method Detail
  - handleText
```
public void handleText(char[] text,
              int position)
```
    Executed when a block of text is encountered. If it sees text indicating the start of the categories section of the Yahoo page, it sets the inCategorySection flag to true.
    
    Overrides:
    
    handleText in class javax.swing.text.html.HTMLEditorKit.ParserCallback
    
    Parameters:
    text - A char array representation of the text.
    position - The position of the text in the document.
  - handleStartTag
```
public void handleStartTag(javax.swing.text.html.HTML.Tag tag,
                  javax.swing.text.MutableAttributeSet attributes,
                  int position)
```
    Executed when an opening HTML tag is found in the document. Note that this method only handles tags that also have a closing tag. Catches "a" tags and adds links for them (after completing them). If currently in the category section, then save any link in the set of extracted links.
    
    Overrides:
    
    handleStartTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback
    
    Parameters:
    tag - The tag that caused this function to be executed.
    attributes - The attributes of tag.
    position - The start of the tag in the document. If the tag is implied (filled in by the parser but not actually present in the document) then position will correspond to that of the next encountered tag.
  - handleEndTag
```
public void handleEndTag(javax.swing.text.html.HTML.Tag tag,
                int position)
```
    Executed when a closing HTML tag is found in the document. Note that the parser may add "implied" closing tags. For example, the default parser adds closing <p> tags. If encounters end of TABLE tag while in category section of Yahoo page, indicates the end of this section and sets the inCategorySection flag to false
    
    Overrides:
    
    handleEndTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback
    
    Parameters:
    tag - The tag found.
    position - The position of the tag in the document.
  - handleSimpleTag
```
public void handleSimpleTag(javax.swing.text.html.HTML.Tag tag,
                   javax.swing.text.MutableAttributeSet attributes,
                   int position)
```
    Executed when an HTML tag that has no closing tag is found in the document. Nothing to do here.
    
    Overrides:
    
    handleSimpleTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback
    
    Parameters:
    tag - The tag that caused this function to be executed.
    attributes - The attributes of tag.
    position - The start of the tag in the document. If the tag is implied (filled in by the parser but not actually present in the document) then position will correspond to that of the next encountered tag.
  - extractLinks
```
public java.util.List<Link> extractLinks()
```
    Extracts cateory links from the given Yahoo page. This method constructs a parser and registers this as the callback.
    
    Returns:
    A list of Link objects containing the links found on this page. The links will all be absolute links.
  - addLink
```
protected void addLink(javax.swing.text.MutableAttributeSet attributes,
           javax.swing.text.html.HTML.Attribute attr)
```
    Retrieves a link from an attribute set and completes it against the base URL.
    
    Parameters:
    attributes - The attribute set.
    attr - The attribute that should be treated as a URL. For example, attr should be HTML.Attribute.HREF if attributes is from an anchor tag.
  - main
```
public static void main(java.lang.String[] args)
```
    Given Yahoo directory URL as a single arg, test extraction of category links from this page.

Class YahooCategoryLinkExtractor

Field Summary

Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback

Constructor Summary

Method Summary

Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback

Methods inherited from class java.lang.Object

Field Detail

links

page

url

inCategorySection

Constructor Detail

YahooCategoryLinkExtractor

Method Detail

handleText

handleStartTag

handleEndTag

handleSimpleTag

extractLinks

addLink

main