nz.ac.waikato.mcennis.rat.crawler
Class GZipFileCrawler

java.lang.Object
  extended by nz.ac.waikato.mcennis.rat.crawler.GZipFileCrawler
All Implemented Interfaces:
Crawler

public class GZipFileCrawler
extends java.lang.Object
implements Crawler

Class for reading the files contained within a zip file.


Constructor Summary
GZipFileCrawler()
          Creates a new instance of GZipFileCrawler
 
Method Summary
 void block(java.lang.String site)
          Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site) method
 void block(java.lang.String site, Properties props)
          Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site, Properties props) method
 void crawl(java.lang.String site)
          Identical to crawl except all parsers are used
 void crawl(java.lang.String site, Properties parsers)
           
 Crawler getCrawler()
          return a reference to the crawler used when spidering is enabled.
 CrawlerFilter getFilter()
          Returns the current CrawlerFilter set for this crawler.
 Parser[] getParsers()
          Returns an array of Parser objects used by this crawler to parse pages.
 Properties getProperties()
          Returns the parsers that are associated with this crawler.
 java.lang.String getProxyHost()
          Return the string to be used to determine the proxy host for this connection
 java.lang.String getProxyPort()
          Returns the string describing the port the proxy is operating on
 java.lang.String getProxyType()
          Returns the string the crawler will use to set the system property determining the proxy type.
 boolean isCaching()
          Is the crawler caching the page or is it re-acquiring the page for each parser.
 boolean isSpidering()
          Is this crawler following links.
 void set(Properties parser)
          Takes the current array of parsers and creates a copy of them utilizing the duplicate method on each parser.
 void setCaching(boolean b)
          Sets caching on or off
 void setCrawler(Crawler c)
          Sets the crawler used when spidering.
 void setFilter(CrawlerFilter filter)
          Sets the filter used to determine whether or not a given URL should be added to the list of URLs to be crawled.
 void setParsers(Parser[] parsers)
          Sets the parsers available to this crawler.
 void setProxy(boolean proxy)
          Set or unset whether the crawler should be going through a proxy to access the internet.
 void setProxyHost(java.lang.String proxyHost)
          Sets the string to be used to determine the proxy host for this connection
 void setProxyPort(java.lang.String proxyPort)
          Sets the string describing the port the proxy is operating on
 void setProxyType(java.lang.String proxyType)
          Sets the string the crawler will use to set the system property determining the proxy type.
 void setSpidering(boolean s)
          Should links to new documents discovered also be read This also sets the value of the 'Spidering' parameter.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

GZipFileCrawler

public GZipFileCrawler()
Creates a new instance of GZipFileCrawler

Method Detail

crawl

public void crawl(java.lang.String site)
           throws java.net.MalformedURLException,
                  java.io.IOException
Identical to crawl except all parsers are used

Specified by:
crawl in interface Crawler
Parameters:
site - site to be crawled
Throws:
java.net.MalformedURLException - id site URL is invalid
java.io.IOException - error occurs during retrieval

crawl

public void crawl(java.lang.String site,
                  Properties parsers)
           throws java.net.MalformedURLException,
                  java.io.IOException
Specified by:
crawl in interface Crawler
Parameters:
site - Name of the document to be fetched
Throws:
java.net.MalformedURLException - If the site to crawl is not a valid document. Only thrown if the underlying crawler is retrieving documents via a http or similar protocol.
java.io.IOException - Thrown if their is a problem retrieving the document to be processed.

getProxyHost

public java.lang.String getProxyHost()
Return the string to be used to determine the proxy host for this connection

Returns:
proxy host descriptor

setProxyHost

public void setProxyHost(java.lang.String proxyHost)
Sets the string to be used to determine the proxy host for this connection

Parameters:
proxyHost - proxy host descriptor

set

public void set(Properties parser)
Takes the current array of parsers and creates a copy of them utilizing the duplicate method on each parser.

Specified by:
set in interface Crawler
Parameters:
parser - Array of parsers to be duplicated and set for parsing documents

getProxyPort

public java.lang.String getProxyPort()
Returns the string describing the port the proxy is operating on

Returns:
port of proxy

setProxyPort

public void setProxyPort(java.lang.String proxyPort)
Sets the string describing the port the proxy is operating on

Parameters:
proxyPort - port of the proxy

getProxyType

public java.lang.String getProxyType()
Returns the string the crawler will use to set the system property determining the proxy type.

Returns:
proxyType type of the proxy

setProxyType

public void setProxyType(java.lang.String proxyType)
Sets the string the crawler will use to set the system property determining the proxy type.

Parameters:
proxyType - type of the proxy

getCrawler

public Crawler getCrawler()
Description copied from interface: Crawler
return a reference to the crawler used when spidering is enabled. By default, this is a self-reference

Specified by:
getCrawler in interface Crawler
Returns:
reference to the crawler used in spidering

setCrawler

public void setCrawler(Crawler c)
Description copied from interface: Crawler
Sets the crawler used when spidering. A null value disables spidering.

Specified by:
setCrawler in interface Crawler
Parameters:
c - crawle to use for spidering

getProperties

public Properties getProperties()
Returns the parsers that are associated with this crawler. Returns null if no parsers have been set

Specified by:
getProperties in interface Crawler
Returns:
Parser array containing the parsers used to parse documents.

setProxy

public void setProxy(boolean proxy)
Set or unset whether the crawler should be going through a proxy to access the internet.

Specified by:
setProxy in interface Crawler
Parameters:
proxy - Should a proxy be used to access the internet

setCaching

public void setCaching(boolean b)
Sets caching on or off

Specified by:
setCaching in interface Crawler
Parameters:
b - should caching be enabled

isCaching

public boolean isCaching()
Description copied from interface: Crawler
Is the crawler caching the page or is it re-acquiring the page for each parser. The value returned is also the value of the 'cacheSource' parameter.

Specified by:
isCaching in interface Crawler
Returns:
is caching enabled

setSpidering

public void setSpidering(boolean s)
Description copied from interface: Crawler
Should links to new documents discovered also be read This also sets the value of the 'Spidering' parameter.

Specified by:
setSpidering in interface Crawler

isSpidering

public boolean isSpidering()
Description copied from interface: Crawler
Is this crawler following links. The value is the value of the 'Spidering' parameter.

Specified by:
isSpidering in interface Crawler
Returns:
follows links or not

setFilter

public void setFilter(CrawlerFilter filter)
Description copied from interface: Crawler
Sets the filter used to determine whether or not a given URL should be added to the list of URLs to be crawled. The results are stored in the 'CrawlerFilter' parameter

Specified by:
setFilter in interface Crawler
Parameters:
filter - Function to determine whether a URL should be parsed or not.

getFilter

public CrawlerFilter getFilter()
Description copied from interface: Crawler
Returns the current CrawlerFilter set for this crawler. This is the value of the 'CrawlerFilter' parameter.

Specified by:
getFilter in interface Crawler
Returns:
filter used to determine whether or not to schedule a URL for parsing

getParsers

public Parser[] getParsers()
Description copied from interface: Crawler
Returns an array of Parser objects used by this crawler to parse pages.

Specified by:
getParsers in interface Crawler
Returns:
Array of parsers utilized to parse fetched documents.

setParsers

public void setParsers(Parser[] parsers)
Description copied from interface: Crawler
Sets the parsers available to this crawler. Null is permitted, setting the parser list to an empty set. Also sets the values of the 'Parsers' parameter. This proicedure also adds the children of the parsers to the parser set.

Specified by:
setParsers in interface Crawler
Parameters:
parsers - Parsers that can be called in a given crawler.

block

public void block(java.lang.String site)
Description copied from interface: Crawler
Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site) method

Specified by:
block in interface Crawler
Parameters:
site - URL to pass to the filter wihtout passing to the parsers.
See Also:
nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site)

block

public void block(java.lang.String site,
                  Properties props)
Description copied from interface: Crawler
Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site, Properties props) method

Specified by:
block in interface Crawler
Parameters:
site - URL to pass to the filter wihtout passing to the parsers.
props - Properties object defining parameters associated with parsing the site.
See Also:
nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site, Properties props)

Get Relational Analysis Toolkit at SourceForge.net. Fast, secure and Free Open Source software downloads