nz.ac.waikato.mcennis.rat.crawler
Interface Crawler

All Known Implementing Classes:
CrawlerBase, FileListCrawler, GZipFileCrawler, WebCrawler, WebCrawler.Spider

public interface Crawler

Interface for accessing files via io. Designed to abstract away the difference between file and web access. Utilizes parsing objects to parse the documents collected. Multiple parser can be utilized. It is crawler dependant whether all parsers are used against all documents or only a subset.


Method Summary
 void crawl(java.lang.String site)
          fetches the site designated by site.
 void crawl(java.lang.String site, java.lang.String[] parsers)
           
 Parser[] getParser()
          Returns an array of Parser objects used by this crawler to parse pages.
 boolean isCaching()
          Is the crawler caching the page or is it re-acquiring the page for each parser.
 boolean isSpidering()
          Is this crawler following links
 void set(Parser[] parser)
          Set the parsers that are to be utilized by this crawler to interpret the documents that are parsed.
 void setCaching(boolean b)
          Set whether or notthe crawler should cache the web page or reload for each individual parser.
 void setProxy(boolean proxy)
          Establishes whether a proxy is needed to access documents
 void setSpidering(boolean spider)
          Should links to new documents discovered also be read
 

Method Detail

crawl

void crawl(java.lang.String site)
           throws java.net.MalformedURLException,
                  java.io.IOException
fetches the site designated by site. The meaning and interpretation of 'site' is dependant on the cralwer used. This can spark new secondary crawls depending on the crawler.

Parameters:
site - Name of the document to be fetched
Throws:
java.net.MalformedURLException - If the site to crawl is not a valid document. Only thrown if the underlying crawler is retrieving documents via a http or similar protocol.
java.io.IOException - Thrown if their is a problem retrieving the document to be processed.

crawl

void crawl(java.lang.String site,
           java.lang.String[] parsers)
           throws java.net.MalformedURLException,
                  java.io.IOException
Parameters:
site - Name of the document to be fetched
parsers - index of parsers to parse this site
Throws:
java.net.MalformedURLException - If the site to crawl is not a valid document. Only thrown if the underlying crawler is retrieving documents via a http or similar protocol.
java.io.IOException - Thrown if their is a problem retrieving the document to be processed.

set

void set(Parser[] parser)
Set the parsers that are to be utilized by this crawler to interpret the documents that are parsed.

Parameters:
parser - Array of parsing objects to be utilized by the crawler to process documents fetched

getParser

Parser[] getParser()
Returns an array of Parser objects used by this crawler to parse pages.

Returns:
Array of parsers utilized to parse fetched documents.

setProxy

void setProxy(boolean proxy)
Establishes whether a proxy is needed to access documents

Parameters:
proxy - Whether or not a proxy is needed for accessing documents

setCaching

void setCaching(boolean b)
Set whether or notthe crawler should cache the web page or reload for each individual parser. This is a tradeof between memory needed for loading potentially large files and the cost of continually reloading the web page

Parameters:
b - should caching occur

isCaching

boolean isCaching()
Is the crawler caching the page or is it re-acquiring the page for each parser.

Returns:
is caching enabled

setSpidering

void setSpidering(boolean spider)
Should links to new documents discovered also be read

Parameters:
spider -

isSpidering

boolean isSpidering()
Is this crawler following links

Returns:
follows links or not