nz.ac.waikato.mcennis.rat.crawler
Class CrawlerBase

java.lang.Object
  extended by nz.ac.waikato.mcennis.rat.crawler.CrawlerBase
All Implemented Interfaces:
Crawler
Direct Known Subclasses:
WebCrawler.Spider

public class CrawlerBase
extends java.lang.Object
implements Crawler

Base class implementing basic web access capabilities. It executes in the current thread and accesses pages sequentially. It is designed to be the base class for other web-accessing crawlers. In particular, it is the superclass of the WebCrawler crawler.


Field Summary
protected  boolean cache
          Perform caching so each parser gets cached copy of original.
protected  Parser[] parser
          Parser array handling all parsers utilized in processing documents.
protected  boolean proxy
          Is there a proxy that needs to be used in collecting data.
protected  boolean spider
          Is this object operating as a spider (automatically following links).
 
Constructor Summary
CrawlerBase()
          Base constructor.
 
Method Summary
 void crawl(java.lang.String site)
          Identical to crawl except all parsers are used
 void crawl(java.lang.String site, java.lang.String[] parsers)
           
protected  void doParse(byte[] raw_data, java.lang.String[] parsers)
          Helper function separated from public parse to allow easy overloading.
 Parser[] getParser()
          Returns the parsers that are associated with this crawler.
 java.lang.String getProxyHost()
          Return the string to be used to determine the proxy host for this connection
 java.lang.String getProxyPort()
          Returns the string describing the port the proxy is operating on
 java.lang.String getProxyType()
          Returns the string the crawler will use to set the system property determining the proxy type.
 boolean isCaching()
          Is the crawler caching the page or is it re-acquiring the page for each parser.
 boolean isSpidering()
          Is this crawler following links
 void set(Parser[] parser)
          Takes the current array of parsers and creates a copy of them utilizing the duplicate method on each parser.
 void setCaching(boolean b)
          Sets caching on or off FIXME: currently caching permanently on
 void setProxy(boolean proxy)
          Set or unset whether the crawler should be going through a proxy to access the internet.
 void setProxyHost(java.lang.String proxyHost)
          Sets the string to be used to determine the proxy host for this connection
 void setProxyPort(java.lang.String proxyPort)
          Sets the string describing the port the proxy is operating on
 void setProxyType(java.lang.String proxyType)
          Sets the string the crawler will use to set the system property determining the proxy type.
 void setSpidering(boolean s)
          Should links to new documents discovered also be read
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

parser

protected Parser[] parser
Parser array handling all parsers utilized in processing documents.


spider

protected boolean spider
Is this object operating as a spider (automatically following links). This is utilized to determine which procedure to call on the parsers


proxy

protected boolean proxy
Is there a proxy that needs to be used in collecting data.


cache

protected boolean cache
Perform caching so each parser gets cached copy of original. FIXME: always performs caching regardless of caching value

Constructor Detail

CrawlerBase

public CrawlerBase()
Base constructor. The parser must be set before this object can be used or a null pointer exception will be thrown on parse.

Method Detail

set

public void set(Parser[] parser)
Takes the current array of parsers and creates a copy of them utilizing the duplicate method on each parser.

Specified by:
set in interface Crawler
Parameters:
parser - Array of parsers to be duplicated and set for parsing documents

crawl

public void crawl(java.lang.String site)
           throws java.net.MalformedURLException,
                  java.io.IOException
Identical to crawl except all parsers are used

Specified by:
crawl in interface Crawler
Parameters:
site - site to be crawled
Throws:
java.net.MalformedURLException - id site URL is invalid
java.io.IOException - error occurs during retrieval

crawl

public void crawl(java.lang.String site,
                  java.lang.String[] parsers)
           throws java.net.MalformedURLException,
                  java.io.IOException
Specified by:
crawl in interface Crawler
Parameters:
site - Name of the document to be fetched
parsers - index of parsers to parse this site
Throws:
java.net.MalformedURLException - If the site to crawl is not a valid document. Only thrown if the underlying crawler is retrieving documents via a http or similar protocol.
java.io.IOException - Thrown if their is a problem retrieving the document to be processed.

getProxyHost

public java.lang.String getProxyHost()
Return the string to be used to determine the proxy host for this connection

Returns:
proxy host descriptor

setProxyHost

public void setProxyHost(java.lang.String proxyHost)
Sets the string to be used to determine the proxy host for this connection

Parameters:
proxyHost - proxy host descriptor

getProxyPort

public java.lang.String getProxyPort()
Returns the string describing the port the proxy is operating on

Returns:
port of proxy

setProxyPort

public void setProxyPort(java.lang.String proxyPort)
Sets the string describing the port the proxy is operating on

Parameters:
proxyPort - port of the proxy

getProxyType

public java.lang.String getProxyType()
Returns the string the crawler will use to set the system property determining the proxy type.

Returns:
proxyType type of the proxy

setProxyType

public void setProxyType(java.lang.String proxyType)
Sets the string the crawler will use to set the system property determining the proxy type.

Parameters:
proxyType - type of the proxy

doParse

protected void doParse(byte[] raw_data,
                       java.lang.String[] parsers)
                throws java.io.IOException,
                       java.lang.Exception
Helper function separated from public parse to allow easy overloading. Parses the given data into a byte array and passes a copy to every parser.

Parameters:
raw_data -
Throws:
java.io.IOException
java.lang.Exception

getParser

public Parser[] getParser()
Returns the parsers that are associated with this crawler. Returns null if no parsers have been set

Specified by:
getParser in interface Crawler
Returns:
Parser array containing the parsers used to parse documents.

setProxy

public void setProxy(boolean proxy)
Set or unset whether the crawler should be going through a proxy to access the internet.

Specified by:
setProxy in interface Crawler
Parameters:
proxy - Should a proxy be used to access the internet

setCaching

public void setCaching(boolean b)
Sets caching on or off FIXME: currently caching permanently on

Specified by:
setCaching in interface Crawler
Parameters:
b - should caching be enabled

isCaching

public boolean isCaching()
Description copied from interface: Crawler
Is the crawler caching the page or is it re-acquiring the page for each parser.

Specified by:
isCaching in interface Crawler
Returns:
is caching enabled

setSpidering

public void setSpidering(boolean s)
Description copied from interface: Crawler
Should links to new documents discovered also be read

Specified by:
setSpidering in interface Crawler

isSpidering

public boolean isSpidering()
Description copied from interface: Crawler
Is this crawler following links

Specified by:
isSpidering in interface Crawler
Returns:
follows links or not