nz.ac.waikato.mcennis.rat.parser
Class BaseHTMLParser

java.lang.Object
  extended by nz.ac.waikato.mcennis.rat.parser.BaseHTMLParser
All Implemented Interfaces:
Parser

public class BaseHTMLParser
extends java.lang.Object
implements Parser

Class for transforming WebPage data into bag-of-words format.


Constructor Summary
BaseHTMLParser()
           
 
Method Summary
 Parser duplicate()
          Create an exact copy of this object
 ParsedObject get()
          Return histogram object FIX: currently returns null
 java.lang.String getName()
           
 void parse(java.io.InputStream data)
          Parse the document into bag-of-words format.
 void parse(java.io.InputStream data, Crawler crawler)
          Parse a data stream while spidering for more pages
protected  void processLinks(java.lang.String content, Crawler crawler)
          Creates a new URL from anchor text to crawl
 void set(ParsedObject o)
          Set the parsed object to be loaded
 void setName(java.lang.String name)
          Give this parser an id that should be globally unique
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BaseHTMLParser

public BaseHTMLParser()
Method Detail

parse

public void parse(java.io.InputStream data)
Parse the document into bag-of-words format.

Specified by:
parse in interface Parser
Parameters:
data - data stream to be parsed

parse

public void parse(java.io.InputStream data,
                  Crawler crawler)
Parse a data stream while spidering for more pages

Specified by:
parse in interface Parser
Parameters:
data - stream to be parsed
crawler - crawler for crawling new pages

processLinks

protected void processLinks(java.lang.String content,
                            Crawler crawler)
Creates a new URL from anchor text to crawl

Parameters:
content - line containing the anchor text
crawler - crawler to crawl the new URLs

duplicate

public Parser duplicate()
Description copied from interface: Parser
Create an exact copy of this object

Specified by:
duplicate in interface Parser
Returns:
new Parser object

get

public ParsedObject get()
Return histogram object FIX: currently returns null

Specified by:
get in interface Parser
Returns:
parsed object

set

public void set(ParsedObject o)
Description copied from interface: Parser
Set the parsed object to be loaded

Specified by:
set in interface Parser
Parameters:
o - object to be loaded

setName

public void setName(java.lang.String name)
Description copied from interface: Parser
Give this parser an id that should be globally unique

Specified by:
setName in interface Parser
Parameters:
name - id for this parser

getName

public java.lang.String getName()
Specified by:
getName in interface Parser