CrawlerBase

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

nz.ac.waikato.mcennis.rat.crawler
Class CrawlerBase

java.lang.Object
  nz.ac.waikato.mcennis.rat.crawler.CrawlerBase

All Implemented Interfaces:: Crawler

Direct Known Subclasses:: WebCrawler.Spider

public class CrawlerBase
extends java.lang.Object
implements Crawler
extends java.lang.Object
implements Crawler

Base class implementing basic web access capabilities. It executes in the current thread and accesses pages sequentially. It is designed to be the base class for other web-accessing crawlers. In particular, it is the superclass of the WebCrawler crawler.

Field Summary
`protected boolean`	`cache` Perform caching so each parser gets cached copy of original.
`protected Parser[]`	`parser` Parser array handling all parsers utilized in processing documents.
`protected boolean`	`proxy` Is there a proxy that needs to be used in collecting data.
`protected boolean`	`spider` Is this object operating as a spider (automatically following links).

Constructor Summary
`CrawlerBase()` Base constructor.

Method Summary
`void`	`crawl(java.lang.String site)` Identical to crawl except all parsers are used
`void`	`crawl(java.lang.String site, java.lang.String[] parsers)`
`protected void`	`doParse(byte[] raw_data, java.lang.String[] parsers)` Helper function separated from public parse to allow easy overloading.
`Parser[]`	`getParser()` Returns the parsers that are associated with this crawler.
`java.lang.String`	`getProxyHost()` Return the string to be used to determine the proxy host for this connection
`java.lang.String`	`getProxyPort()` Returns the string describing the port the proxy is operating on
`java.lang.String`	`getProxyType()` Returns the string the crawler will use to set the system property determining the proxy type.
`boolean`	`isCaching()` Is the crawler caching the page or is it re-acquiring the page for each parser.
`boolean`	`isSpidering()` Is this crawler following links
`void`	`set(Parser[] parser)` Takes the current array of parsers and creates a copy of them utilizing the duplicate method on each parser.
`void`	`setCaching(boolean b)` Sets caching on or off FIXME: currently caching permanently on
`void`	`setProxy(boolean proxy)` Set or unset whether the crawler should be going through a proxy to access the internet.
`void`	`setProxyHost(java.lang.String proxyHost)` Sets the string to be used to determine the proxy host for this connection
`void`	`setProxyPort(java.lang.String proxyPort)` Sets the string describing the port the proxy is operating on
`void`	`setProxyType(java.lang.String proxyType)` Sets the string the crawler will use to set the system property determining the proxy type.
`void`	`setSpidering(boolean s)` Should links to new documents discovered also be read

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

parser

protected Parser[] parser

Parser array handling all parsers utilized in processing documents.

spider

protected boolean spider

Is this object operating as a spider (automatically following links). This is utilized to determine which procedure to call on the parsers

proxy

protected boolean proxy

Is there a proxy that needs to be used in collecting data.

cache

protected boolean cache

Perform caching so each parser gets cached copy of original. FIXME: always performs caching regardless of caching value

Constructor Detail

CrawlerBase

public CrawlerBase()

Base constructor. The parser must be set before this object can be used or a null pointer exception will be thrown on parse.

Method Detail

set

public void set(Parser[] parser)

Takes the current array of parsers and creates a copy of them utilizing the duplicate method on each parser.

Specified by:: set in interface Crawler

Parameters:: parser - Array of parsers to be duplicated and set for parsing documents

crawl

public void crawl(java.lang.String site)
           throws java.net.MalformedURLException,
                  java.io.IOException

Identical to crawl except all parsers are used

Specified by:: crawl in interface Crawler

Parameters:: site - site to be crawled
Throws:: java.net.MalformedURLException - id site URL is invalid; java.io.IOException - error occurs during retrieval

crawl

public void crawl(java.lang.String site,
                  java.lang.String[] parsers)
           throws java.net.MalformedURLException,
                  java.io.IOException

Specified by:: crawl in interface Crawler

Parameters:: site - Name of the document to be fetched; parsers - index of parsers to parse this site
Throws:: java.net.MalformedURLException - If the site to crawl is not a valid document. Only thrown if the underlying crawler is retrieving documents via a http or similar protocol.; java.io.IOException - Thrown if their is a problem retrieving the document to be processed.

getProxyHost

public java.lang.String getProxyHost()

Return the string to be used to determine the proxy host for this connection

Returns:: proxy host descriptor

setProxyHost

public void setProxyHost(java.lang.String proxyHost)

Sets the string to be used to determine the proxy host for this connection

Parameters:: proxyHost - proxy host descriptor

getProxyPort

public java.lang.String getProxyPort()

Returns the string describing the port the proxy is operating on

Returns:: port of proxy

setProxyPort

public void setProxyPort(java.lang.String proxyPort)

Sets the string describing the port the proxy is operating on

Parameters:: proxyPort - port of the proxy

getProxyType

public java.lang.String getProxyType()

Returns the string the crawler will use to set the system property determining the proxy type.

Returns:: proxyType type of the proxy

setProxyType

public void setProxyType(java.lang.String proxyType)

Sets the string the crawler will use to set the system property determining the proxy type.

Parameters:: proxyType - type of the proxy

doParse

protected void doParse(byte[] raw_data,
                       java.lang.String[] parsers)
                throws java.io.IOException,
                       java.lang.Exception

Helper function separated from public parse to allow easy overloading. Parses the given data into a byte array and passes a copy to every parser.

Parameters:: raw_data -
Throws:: java.io.IOException; java.lang.Exception

getParser

public Parser[] getParser()

Returns the parsers that are associated with this crawler. Returns null if no parsers have been set

Specified by:: getParser in interface Crawler

Returns:: Parser array containing the parsers used to parse documents.

setProxy

public void setProxy(boolean proxy)

Set or unset whether the crawler should be going through a proxy to access the internet.

Specified by:: setProxy in interface Crawler

Parameters:: proxy - Should a proxy be used to access the internet

setCaching

public void setCaching(boolean b)

Sets caching on or off FIXME: currently caching permanently on

Specified by:: setCaching in interface Crawler

Parameters:: b - should caching be enabled

isCaching

public boolean isCaching()

Description copied from interface: Crawler

Is the crawler caching the page or is it re-acquiring the page for each parser.

Specified by:: isCaching in interface Crawler

Returns:: is caching enabled

setSpidering

public void setSpidering(boolean s)

Description copied from interface: Crawler

Should links to new documents discovered also be read

Specified by:: setSpidering in interface Crawler

isSpidering

public boolean isSpidering()

Description copied from interface: Crawler

Is this crawler following links

Specified by:: isSpidering in interface Crawler

Returns:: follows links or not

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

nz.ac.waikato.mcennis.rat.crawler Class CrawlerBase

parser

spider

proxy

cache

CrawlerBase

set

crawl

crawl

getProxyHost

setProxyHost

getProxyPort

setProxyPort

getProxyType

setProxyType

doParse

getParser

setProxy

setCaching

isCaching

setSpidering

isSpidering

nz.ac.waikato.mcennis.rat.crawler
Class CrawlerBase