ir.webutils
Class LinkExtractor

java.lang.Object
  |
  +--javax.swing.text.html.HTMLEditorKit.ParserCallback
        |
        +--ir.webutils.LinkExtractor
Direct Known Subclasses:
AnchoredLinkExtractor

public class LinkExtractor
extends javax.swing.text.html.HTMLEditorKit.ParserCallback

LinkExtractor defines a callback that extracts the links from an HTML document and provides functionality to parse a document. The extracted links are absolute. Uses the HTML parser in Java Swing to parse the document and find links and replace them with absolute URL's (instead of relative ones).


Field Summary
protected  java.lang.StringBuffer absoluteCopy
          Copy of the text of page with absolute links
protected  java.util.List links
          The current list of extracted links
protected  HTMLPage page
          The page from which to extract links
protected  java.net.URL url
          The URL for this page
 
Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
IMPLIED
 
Constructor Summary
LinkExtractor(HTMLPage page)
          Create an link extractor for the given page
 
Method Summary
protected static java.net.URL addEndSlash(java.net.URL url)
          If URL looks like a directory rather than a file, then add a "/" at the end so that it acts as a proper base URL for completing URLs in this page
protected  void addLink(javax.swing.text.MutableAttributeSet attributes, javax.swing.text.html.HTML.Attribute attr)
          Retrieves a link from an attribute set and completes it against the base URL.
static void appendTag(java.lang.StringBuffer buffer, javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes)
          Write this tag with attributes out to the buffer
 java.util.List extractLinks()
          Extracts links from the given page.
 void handleEndTag(javax.swing.text.html.HTML.Tag tag, int position)
          Executed when a closing HTML tag is found in the document.
 void handleSimpleTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)
          Executed when an HTML tag that has no closing tag is found in the document.
 void handleStartTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)
          Executed when an opening HTML tag is found in the document.
 void handleText(char[] text, int position)
          Executed when a block of text is encountered.
 
Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
flush, handleComment, handleEndOfLineString, handleError
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

links

protected java.util.List links
The current list of extracted links

page

protected HTMLPage page
The page from which to extract links

url

protected java.net.URL url
The URL for this page

absoluteCopy

protected java.lang.StringBuffer absoluteCopy
Copy of the text of page with absolute links
Constructor Detail

LinkExtractor

public LinkExtractor(HTMLPage page)
Create an link extractor for the given page
Method Detail

addEndSlash

protected static java.net.URL addEndSlash(java.net.URL url)
If URL looks like a directory rather than a file, then add a "/" at the end so that it acts as a proper base URL for completing URLs in this page

handleText

public void handleText(char[] text,
                       int position)
Executed when a block of text is encountered. Just stores it in the absolute copy.
Overrides:
handleText in class javax.swing.text.html.HTMLEditorKit.ParserCallback
Parameters:
text - A char array representation of the text.
position - The position of the text in the document.

handleStartTag

public void handleStartTag(javax.swing.text.html.HTML.Tag tag,
                           javax.swing.text.MutableAttributeSet attributes,
                           int position)
Executed when an opening HTML tag is found in the document. Note that this method only handles tags that also have a closing tag. Catches "a" tags and adds links for them (after completing them) Also stores completed URL in the absolute copy.
Overrides:
handleStartTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback
Parameters:
tag - The tag that caused this function to be executed.
attributes - The attributes of tag.
position - The start of the tag in the document. If the tag is implied (filled in by the parser but not actually present in the document) then position will correspond to that of the next encountered tag.

appendTag

public static void appendTag(java.lang.StringBuffer buffer,
                             javax.swing.text.html.HTML.Tag tag,
                             javax.swing.text.MutableAttributeSet attributes)
Write this tag with attributes out to the buffer

handleEndTag

public void handleEndTag(javax.swing.text.html.HTML.Tag tag,
                         int position)
Executed when a closing HTML tag is found in the document. Note that the parser may add "implied" closing tags. For example, the default parser adds closing <p> tags. This just writes end tag out to absolute copy.
Overrides:
handleEndTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback
Parameters:
tag - The tag found.
position - The position of the tag in the document.

handleSimpleTag

public void handleSimpleTag(javax.swing.text.html.HTML.Tag tag,
                            javax.swing.text.MutableAttributeSet attributes,
                            int position)
Executed when an HTML tag that has no closing tag is found in the document. Adds link for FRAME's and writes them out to the absolute copy.
Overrides:
handleSimpleTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback
Parameters:
tag - The tag that caused this function to be executed.
attributes - The attributes of tag.
position - The start of the tag in the document. If the tag is implied (filled in by the parser but not actually present in the document) then position will correspond to that of the next encountered tag.

extractLinks

public java.util.List extractLinks()
Extracts links from the given page. This method constructs a parser and registers this as the callback.
Returns:
A list of Link objects containing the links found on this page. The links will all be absolute links.

addLink

protected void addLink(javax.swing.text.MutableAttributeSet attributes,
                       javax.swing.text.html.HTML.Attribute attr)
Retrieves a link from an attribute set and completes it against the base URL.
Parameters:
attributes - The attribute set.
attr - The attribute that should be treated as a URL. For example, attr should be HTML.Attribute.HREF if attributes is from an anchor tag.