Crawler

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

nz.ac.waikato.mcennis.rat.crawler
Interface Crawler

All Known Implementing Classes:: CrawlerBase, FileListCrawler, GZipFileCrawler, WebCrawler, WebCrawler.Spider

public interface Crawler

Interface for accessing files via io. Designed to abstract away the difference between file and web access. Utilizes parsing objects to parse the documents collected. Multiple parser can be utilized. It is crawler dependant whether all parsers are used against all documents or only a subset.

Method Summary
`void`	`block(java.lang.String site)` Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site) method
`void`	`block(java.lang.String site, Properties props)` Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site, Properties props) method
`void`	`crawl(java.lang.String site)` fetches the site designated by site.
`void`	`crawl(java.lang.String site, Properties props)`
`Crawler`	`getCrawler()` return a reference to the crawler used when spidering is enabled.
`CrawlerFilter`	`getFilter()` Returns the current CrawlerFilter set for this crawler.
`Parser[]`	`getParsers()` Returns an array of Parser objects used by this crawler to parse pages.
`Properties`	`getProperties()` Get the Properties object backing this factory.
`boolean`	`isCaching()` Is the crawler caching the page or is it re-acquiring the page for each parser.
`boolean`	`isSpidering()` Is this crawler following links.
`void`	`set(Properties parameters)` Set the parsers that are to be utilized by this crawler to interpret the documents that are parsed.
`void`	`setCaching(boolean b)` Set whether or not the crawler should cache the web page or reload for each individual parser.
`void`	`setCrawler(Crawler c)` Sets the crawler used when spidering.
`void`	`setFilter(CrawlerFilter filter)` Sets the filter used to determine whether or not a given URL should be added to the list of URLs to be crawled.
`void`	`setParsers(Parser[] parsers)` Sets the parsers available to this crawler.
`void`	`setProxy(boolean proxy)` Establishes whether a proxy is needed to access documents
`void`	`setSpidering(boolean spider)` Should links to new documents discovered also be read This also sets the value of the 'Spidering' parameter.

Method Detail

crawl

void crawl(java.lang.String site)
           throws java.net.MalformedURLException,
                  java.io.IOException

fetches the site designated by site. The meaning and interpretation of 'site' is dependant on the cralwer used. This can spark new secondary crawls depending on the crawler.

Parameters:: site - Name of the document to be fetched
Throws:: java.net.MalformedURLException - If the site to crawl is not a valid document. Only thrown if the underlying crawler is retrieving documents via a http or similar protocol.; java.io.IOException - Thrown if their is a problem retrieving the document to be processed.

crawl

void crawl(java.lang.String site,
           Properties props)
           throws java.net.MalformedURLException,
                  java.io.IOException

Parameters:: site - Name of the document to be fetched; parsers - index of parsers to parse this site
Throws:: java.net.MalformedURLException - If the site to crawl is not a valid document. Only thrown if the underlying crawler is retrieving documents via a http or similar protocol.; java.io.IOException - Thrown if their is a problem retrieving the document to be processed.

set

void set(Properties parameters)

Set the parsers that are to be utilized by this crawler to interpret the documents that are parsed.

Parameters:: parser - Array of parsing objects to be utilized by the crawler to process documents fetched

getParsers

Parser[] getParsers()

Returns an array of Parser objects used by this crawler to parse pages.

Returns:: Array of parsers utilized to parse fetched documents.

setParsers

void setParsers(Parser[] parsers)

Sets the parsers available to this crawler. Null is permitted, setting the parser list to an empty set. Also sets the values of the 'Parsers' parameter. This proicedure also adds the children of the parsers to the parser set.

Parameters:: parsers - Parsers that can be called in a given crawler.

getProperties

Properties getProperties()

Get the Properties object backing this factory.

Returns:: parameter set

setProxy

void setProxy(boolean proxy)

Establishes whether a proxy is needed to access documents

Parameters:: proxy - Whether or not a proxy is needed for accessing documents

setCaching

void setCaching(boolean b)

Set whether or not the crawler should cache the web page or reload for each individual parser. This is a tradeof between memory needed for loading potentially large files and the cost of continually reloading the web page. This also sets the 'CacheSource' parameter of the properties object.

Parameters:: b - should caching occur

isCaching

boolean isCaching()

Is the crawler caching the page or is it re-acquiring the page for each parser. The value returned is also the value of the 'cacheSource' parameter.

Returns:: is caching enabled

setSpidering

void setSpidering(boolean spider)

Should links to new documents discovered also be read This also sets the value of the 'Spidering' parameter.

Parameters:: spider -

isSpidering

boolean isSpidering()

Is this crawler following links. The value is the value of the 'Spidering' parameter.

Returns:: follows links or not

setFilter

void setFilter(CrawlerFilter filter)

Sets the filter used to determine whether or not a given URL should be added to the list of URLs to be crawled. The results are stored in the 'CrawlerFilter' parameter

Parameters:: filter - Function to determine whether a URL should be parsed or not.

getFilter

CrawlerFilter getFilter()

Returns the current CrawlerFilter set for this crawler. This is the value of the 'CrawlerFilter' parameter.

Returns:: filter used to determine whether or not to schedule a URL for parsing

block

void block(java.lang.String site)

Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site) method

Parameters:: site - URL to pass to the filter wihtout passing to the parsers.
See Also:: nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site)

block

void block(java.lang.String site,
           Properties props)

Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site, Properties props) method

Parameters:: site - URL to pass to the filter wihtout passing to the parsers.; props - Properties object defining parameters associated with parsing the site.
See Also:: nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site, Properties props)

getCrawler

Crawler getCrawler()

return a reference to the crawler used when spidering is enabled. By default, this is a self-reference

Returns:: reference to the crawler used in spidering

setCrawler

void setCrawler(Crawler c)

Sets the crawler used when spidering. A null value disables spidering.

Parameters:: c - crawle to use for spidering

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

nz.ac.waikato.mcennis.rat.crawler Interface Crawler

crawl

crawl

set

getParsers

setParsers

getProperties

setProxy

setCaching

isCaching

setSpidering

isSpidering

setFilter

getFilter

block

block

getCrawler

setCrawler

nz.ac.waikato.mcennis.rat.crawler
Interface Crawler