Crawler

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

nz.ac.waikato.mcennis.rat.crawler
Interface Crawler

All Known Implementing Classes:: CrawlerBase, FileListCrawler, GZipFileCrawler, WebCrawler, WebCrawler.Spider

public interface Crawler

Interface for accessing files via io. Designed to abstract away the difference between file and web access. Utilizes parsing objects to parse the documents collected. Multiple parser can be utilized. It is crawler dependant whether all parsers are used against all documents or only a subset.

Method Summary
`void`	`crawl(java.lang.String site)` fetches the site designated by site.
`void`	`crawl(java.lang.String site, java.lang.String[] parsers)`
`Parser[]`	`getParser()` Returns an array of Parser objects used by this crawler to parse pages.
`boolean`	`isCaching()` Is the crawler caching the page or is it re-acquiring the page for each parser.
`boolean`	`isSpidering()` Is this crawler following links
`void`	`set(Parser[] parser)` Set the parsers that are to be utilized by this crawler to interpret the documents that are parsed.
`void`	`setCaching(boolean b)` Set whether or notthe crawler should cache the web page or reload for each individual parser.
`void`	`setProxy(boolean proxy)` Establishes whether a proxy is needed to access documents
`void`	`setSpidering(boolean spider)` Should links to new documents discovered also be read

Method Detail

crawl

void crawl(java.lang.String site)
           throws java.net.MalformedURLException,
                  java.io.IOException

fetches the site designated by site. The meaning and interpretation of 'site' is dependant on the cralwer used. This can spark new secondary crawls depending on the crawler.

Parameters:: site - Name of the document to be fetched
Throws:: java.net.MalformedURLException - If the site to crawl is not a valid document. Only thrown if the underlying crawler is retrieving documents via a http or similar protocol.; java.io.IOException - Thrown if their is a problem retrieving the document to be processed.

crawl

void crawl(java.lang.String site,
           java.lang.String[] parsers)
           throws java.net.MalformedURLException,
                  java.io.IOException

Parameters:: site - Name of the document to be fetched; parsers - index of parsers to parse this site
Throws:: java.net.MalformedURLException - If the site to crawl is not a valid document. Only thrown if the underlying crawler is retrieving documents via a http or similar protocol.; java.io.IOException - Thrown if their is a problem retrieving the document to be processed.

set

void set(Parser[] parser)

Set the parsers that are to be utilized by this crawler to interpret the documents that are parsed.

Parameters:: parser - Array of parsing objects to be utilized by the crawler to process documents fetched

getParser

Parser[] getParser()

Returns an array of Parser objects used by this crawler to parse pages.

Returns:: Array of parsers utilized to parse fetched documents.

setProxy

void setProxy(boolean proxy)

Establishes whether a proxy is needed to access documents

Parameters:: proxy - Whether or not a proxy is needed for accessing documents

setCaching

void setCaching(boolean b)

Set whether or notthe crawler should cache the web page or reload for each individual parser. This is a tradeof between memory needed for loading potentially large files and the cost of continually reloading the web page

Parameters:: b - should caching occur

isCaching

boolean isCaching()

Is the crawler caching the page or is it re-acquiring the page for each parser.

Returns:: is caching enabled

setSpidering

void setSpidering(boolean spider)

Should links to new documents discovered also be read

Parameters:: spider -

isSpidering

boolean isSpidering()

Is this crawler following links

Returns:: follows links or not

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

nz.ac.waikato.mcennis.rat.crawler Interface Crawler

crawl

crawl

set

getParser

setProxy

setCaching

isCaching

setSpidering

isSpidering

nz.ac.waikato.mcennis.rat.crawler
Interface Crawler