nz.ac.waikato.mcennis.rat.crawler
Class GZipFileCrawler

java.lang.Object
  extended by nz.ac.waikato.mcennis.rat.crawler.GZipFileCrawler
All Implemented Interfaces:
Crawler

public class GZipFileCrawler
extends java.lang.Object
implements Crawler


Constructor Summary
GZipFileCrawler()
          Creates a new instance of GZipFileCrawler
 
Method Summary
 void crawl(java.lang.String site)
          Identical to crawl except all parsers are used
 void crawl(java.lang.String site, java.lang.String[] parsers)
           
 Parser[] getParser()
          Returns an array of Parser objects used by this crawler to parse pages.
 boolean isCaching()
          Is the crawler caching the page or is it re-acquiring the page for each parser.
 boolean isSpidering()
          Is this crawler following links
 void set(Parser[] parser)
          Set the parsers that are to be utilized by this crawler to interpret the documents that are parsed.
 void setCaching(boolean b)
          Set whether or notthe crawler should cache the web page or reload for each individual parser.
 void setProxy(boolean proxy)
          Establishes whether a proxy is needed to access documents
 void setSpidering(boolean s)
          Should links to new documents discovered also be read
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

GZipFileCrawler

public GZipFileCrawler()
Creates a new instance of GZipFileCrawler

Method Detail

crawl

public void crawl(java.lang.String site)
           throws java.net.MalformedURLException,
                  java.io.IOException
Identical to crawl except all parsers are used

Specified by:
crawl in interface Crawler
Parameters:
site - site to be crawled
Throws:
java.net.MalformedURLException - id site URL is invalid
java.io.IOException - error occurs during retrieval

crawl

public void crawl(java.lang.String site,
                  java.lang.String[] parsers)
           throws java.net.MalformedURLException,
                  java.io.IOException
Specified by:
crawl in interface Crawler
Parameters:
site - Name of the document to be fetched
parsers - index of parsers to parse this site
Throws:
java.net.MalformedURLException - If the site to crawl is not a valid document. Only thrown if the underlying crawler is retrieving documents via a http or similar protocol.
java.io.IOException - Thrown if their is a problem retrieving the document to be processed.

set

public void set(Parser[] parser)
Description copied from interface: Crawler
Set the parsers that are to be utilized by this crawler to interpret the documents that are parsed.

Specified by:
set in interface Crawler
Parameters:
parser - Array of parsing objects to be utilized by the crawler to process documents fetched

getParser

public Parser[] getParser()
Description copied from interface: Crawler
Returns an array of Parser objects used by this crawler to parse pages.

Specified by:
getParser in interface Crawler
Returns:
Array of parsers utilized to parse fetched documents.

setProxy

public void setProxy(boolean proxy)
Description copied from interface: Crawler
Establishes whether a proxy is needed to access documents

Specified by:
setProxy in interface Crawler
Parameters:
proxy - Whether or not a proxy is needed for accessing documents

setCaching

public void setCaching(boolean b)
Description copied from interface: Crawler
Set whether or notthe crawler should cache the web page or reload for each individual parser. This is a tradeof between memory needed for loading potentially large files and the cost of continually reloading the web page

Specified by:
setCaching in interface Crawler
Parameters:
b - should caching occur

isCaching

public boolean isCaching()
Description copied from interface: Crawler
Is the crawler caching the page or is it re-acquiring the page for each parser.

Specified by:
isCaching in interface Crawler
Returns:
is caching enabled

setSpidering

public void setSpidering(boolean s)
Description copied from interface: Crawler
Should links to new documents discovered also be read

Specified by:
setSpidering in interface Crawler

isSpidering

public boolean isSpidering()
Description copied from interface: Crawler
Is this crawler following links

Specified by:
isSpidering in interface Crawler
Returns:
follows links or not