nz.ac.waikato.mcennis.rat.parser
Class BaseHTMLParser

java.lang.Object
  extended by nz.ac.waikato.mcennis.rat.parser.AbstractParser
      extended by nz.ac.waikato.mcennis.rat.parser.BaseHTMLParser
All Implemented Interfaces:
Parser

public class BaseHTMLParser
extends AbstractParser
implements Parser

Class for transforming WebPage data into bag-of-words format.


Constructor Summary
BaseHTMLParser()
           
 
Method Summary
 Parser duplicate()
          Create an exact copy of this object
 ParsedObject get()
          Return histogram object FIX: currently returns null
 void parse(java.io.InputStream data, Crawler crawler, Properties properties)
          Parse a data stream while spidering for more pages
 void parse(java.io.InputStream data, Properties properties)
          Parse the document into bag-of-words format.
protected  void processLinks(java.lang.String content, Crawler crawler)
          Creates a new URL from anchor text to crawl
 void set(ParsedObject o)
          Set the parsed object to be loaded
 
Methods inherited from class nz.ac.waikato.mcennis.rat.parser.AbstractParser
check, check, getName, getParameter, getParameter, init, setName
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface nz.ac.waikato.mcennis.rat.parser.Parser
check, check, getName, getParameter, getParameter, init, setName
 

Constructor Detail

BaseHTMLParser

public BaseHTMLParser()
Method Detail

parse

public void parse(java.io.InputStream data,
                  Properties properties)
Parse the document into bag-of-words format.

Specified by:
parse in interface Parser
Specified by:
parse in class AbstractParser
Parameters:
data - data stream to be parsed

parse

public void parse(java.io.InputStream data,
                  Crawler crawler,
                  Properties properties)
Parse a data stream while spidering for more pages

Specified by:
parse in interface Parser
Specified by:
parse in class AbstractParser
Parameters:
data - stream to be parsed
crawler - crawler for crawling new pages

processLinks

protected void processLinks(java.lang.String content,
                            Crawler crawler)
Creates a new URL from anchor text to crawl

Parameters:
content - line containing the anchor text
crawler - crawler to crawl the new URLs

duplicate

public Parser duplicate()
Description copied from interface: Parser
Create an exact copy of this object

Specified by:
duplicate in interface Parser
Specified by:
duplicate in class AbstractParser
Returns:
new Parser object

get

public ParsedObject get()
Return histogram object FIX: currently returns null

Specified by:
get in interface Parser
Specified by:
get in class AbstractParser
Returns:
parsed object

set

public void set(ParsedObject o)
Description copied from interface: Parser
Set the parsed object to be loaded

Specified by:
set in interface Parser
Specified by:
set in class AbstractParser
Parameters:
o - object to be loaded

Get Relational Analysis Toolkit at SourceForge.net. Fast, secure and Free Open Source software downloads