WebCrawler

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

nz.ac.waikato.mcennis.rat.crawler
Class WebCrawler

java.lang.Object
  nz.ac.waikato.mcennis.rat.crawler.WebCrawler

All Implemented Interfaces:: Crawler

public class WebCrawler
extends java.lang.Object
implements Crawler
extends java.lang.Object
implements Crawler

Multithreaded web crawler designed to pull web pages. Each thread utilizes CrawlerBase as the base classs. The threads themselves are a interior class.

Nested Class Summary
`class`	`WebCrawler.Spider` Helper class that is used for threads.

Field Summary
`WebCrawler.Spider[]`	`spiders` Array of runnable objects for multithreaded parsing.

Constructor Summary
`WebCrawler()` Base constructor that initializes threads to null.

Method Summary
`void`	`block(java.lang.String site)` Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site) method
`void`	`block(java.lang.String site, Properties props)` Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site, Properties props) method
`void`	`crawl(java.lang.String site)` Identical to crawl except all parsers are used
`void`	`crawl(java.lang.String site, Properties parsers)` If this site has not been crawled before, adds a new site to the set of all sites parsed and adds it to the queue of the next spider, incrementing the spider index to the next spider.
`void`	`createThreads(int numThreads)` Creates threads to be utilized by this web crawler.
`Crawler`	`getCrawler()` return a reference to the crawler used when spidering is enabled.
`CrawlerFilter`	`getFilter()` Returns the current CrawlerFilter set for this crawler.
`Parser[]`	`getParsers()` Returns an array of Parser objects used by this crawler to parse pages.
`Properties`	`getProperties()` Returns the parsers that are associated with this crawler.
`java.lang.String`	`getProxyHost()` Return the string to be used to determine the proxy host for this connection
`java.lang.String`	`getProxyPort()` Returns the string describing the port the proxy is operating on
`java.lang.String`	`getProxyType()` Returns the string the crawler will use to set the system property determining the proxy type.
`long`	`getThreadDelay()` Returns the delay between the end of retrieving and parsing one page and the start of retrieving the next.
`boolean`	`isCaching()` Is the crawler caching the page or is it re-acquiring the page for each parser.
`boolean`	`isDone()` returns if this object has finished parsing all sites added to be crawled
`boolean`	`isSpidering()` Is this crawler following links.
`void`	`set(Properties parser)` Set the parser for all spiders in this object.
`void`	`setCaching(boolean b)` Sets caching on or off
`void`	`setCrawler(Crawler c)` Sets the crawler used when spidering.
`void`	`setFilter(CrawlerFilter filter)` Sets the filter used to determine whether or not a given URL should be added to the list of URLs to be crawled.
`void`	`setParsers(Parser[] parsers)` Sets the parsers available to this crawler.
`void`	`setProxy(boolean proxy)` Set or unset whether the crawler should be going through a proxy to access the internet.
`void`	`setProxyHost(java.lang.String proxyHost)` Sets the string to be used to determine the proxy host for this connection
`void`	`setProxyPort(java.lang.String proxyPort)` Sets the string describing the port the proxy is operating on
`void`	`setProxyType(java.lang.String proxyType)` Sets the string the crawler will use to set the system property determining the proxy type.
`void`	`setSpidering(boolean s)` Should links to new documents discovered also be read This also sets the value of the 'Spidering' parameter.
`void`	`setThreadDelay(long threadDelay)` Set the delay between the end of retrieving and parsing one page and the start of retrieving the next.
`void`	`startThreads()` Sets the threads running.
`void`	`stopThreads()` Tells each thread to stop execution after parsing the current document.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

spiders

public WebCrawler.Spider[] spiders

Array of runnable objects for multithreaded parsing. Initially set to null.

Constructor Detail

WebCrawler

public WebCrawler()

Base constructor that initializes threads to null.

Method Detail

createThreads

public void createThreads(int numThreads)

Creates threads to be utilized by this web crawler. Must be called before parsing or a NullPointerException is thrown

Parameters:: numThreads - number of threads to be created
Throws:: java.lang.NullPointerException - if no threads have been created yet

startThreads

public void startThreads()

Sets the threads running. A null operation if no threads exist yet.

stopThreads

public void stopThreads()

Tells each thread to stop execution after parsing the current document.

crawl

public void crawl(java.lang.String site)
           throws java.net.MalformedURLException,
                  java.io.IOException

Identical to crawl except all parsers are used

Specified by:: crawl in interface Crawler

Parameters:: site - site to be crawled
Throws:: java.net.MalformedURLException - id site URL is invalid; java.io.IOException - error occurs during retrieval

crawl

public void crawl(java.lang.String site,
                  Properties parsers)

If this site has not been crawled before, adds a new site to the set of all sites parsed and adds it to the queue of the next spider, incrementing the spider index to the next spider.

Specified by:: crawl in interface Crawler

Parameters:: site - web site to be retrieved and parsed
See Also:: nz.ac.waikato.mcennis.arm.crawler.Crawler#crawl(java.lang.String)

set

public void set(Properties parser)

Set the parser for all spiders in this object. This is a null operation if no spider threads have been created yet.

Specified by:: set in interface Crawler

Parameters:: parser - set of parsers to be duplicated accross all spiders
See Also:: nz.ac.waikato.mcennis.arm.crawler.Crawler#set(nz.ac.waikato.mcennis.arm.parser.Parser[])

isDone

public boolean isDone()

returns if this object has finished parsing all sites added to be crawled

Returns:: if all sites added have been parsed

getProxyHost

public java.lang.String getProxyHost()

Return the string to be used to determine the proxy host for this connection

Returns:: proxy host descriptor

setProxyHost

public void setProxyHost(java.lang.String proxyHost)

Sets the string to be used to determine the proxy host for this connection

Parameters:: proxyHost - proxy host descriptor

getProxyPort

public java.lang.String getProxyPort()

Returns the string describing the port the proxy is operating on

Returns:: port of proxy

setProxyPort

public void setProxyPort(java.lang.String proxyPort)

Sets the string describing the port the proxy is operating on

Parameters:: proxyPort - port of the proxy

getProxyType

public java.lang.String getProxyType()

Returns the string the crawler will use to set the system property determining the proxy type.

Returns:: proxyType type of the proxy

setProxyType

public void setProxyType(java.lang.String proxyType)

Sets the string the crawler will use to set the system property determining the proxy type.

Parameters:: proxyType - type of the proxy

getProperties

public Properties getProperties()

Returns the parsers that are associated with this crawler. Returns null if no parsers have been set

Specified by:: getProperties in interface Crawler

Returns:: Parser array containing the parsers used to parse documents.

setProxy

public void setProxy(boolean proxy)

Set or unset whether the crawler should be going through a proxy to access the internet.

Specified by:: setProxy in interface Crawler

Parameters:: proxy - Should a proxy be used to access the internet

setCaching

public void setCaching(boolean b)

Sets caching on or off

Specified by:: setCaching in interface Crawler

Parameters:: b - should caching be enabled

isCaching

public boolean isCaching()

Description copied from interface: Crawler

Is the crawler caching the page or is it re-acquiring the page for each parser. The value returned is also the value of the 'cacheSource' parameter.

Specified by:: isCaching in interface Crawler

Returns:: is caching enabled

setSpidering

public void setSpidering(boolean s)

Description copied from interface: Crawler

Should links to new documents discovered also be read This also sets the value of the 'Spidering' parameter.

Specified by:: setSpidering in interface Crawler

isSpidering

public boolean isSpidering()

Description copied from interface: Crawler

Is this crawler following links. The value is the value of the 'Spidering' parameter.

Specified by:: isSpidering in interface Crawler

Returns:: follows links or not

setFilter

public void setFilter(CrawlerFilter filter)

Description copied from interface: Crawler

Sets the filter used to determine whether or not a given URL should be added to the list of URLs to be crawled. The results are stored in the 'CrawlerFilter' parameter

Specified by:: setFilter in interface Crawler

Parameters:: filter - Function to determine whether a URL should be parsed or not.

getFilter

public CrawlerFilter getFilter()

Description copied from interface: Crawler

Returns the current CrawlerFilter set for this crawler. This is the value of the 'CrawlerFilter' parameter.

Specified by:: getFilter in interface Crawler

Returns:: filter used to determine whether or not to schedule a URL for parsing

getThreadDelay

public long getThreadDelay()

Returns the delay between the end of retrieving and parsing one page and the start of retrieving the next.

Returns:: delay between parsing web pages

setThreadDelay

public void setThreadDelay(long threadDelay)

Set the delay between the end of retrieving and parsing one page and the start of retrieving the next.

Parameters:: threadDelay - delay between parsing web pages

getParsers

public Parser[] getParsers()

Description copied from interface: Crawler

Returns an array of Parser objects used by this crawler to parse pages.

Specified by:: getParsers in interface Crawler

Returns:: Array of parsers utilized to parse fetched documents.

setParsers

public void setParsers(Parser[] parsers)

Description copied from interface: Crawler

Sets the parsers available to this crawler. Null is permitted, setting the parser list to an empty set. Also sets the values of the 'Parsers' parameter. This proicedure also adds the children of the parsers to the parser set.

Specified by:: setParsers in interface Crawler

Parameters:: parsers - Parsers that can be called in a given crawler.

block

public void block(java.lang.String site)

Description copied from interface: Crawler

Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site) method

Specified by:: block in interface Crawler

Parameters:: site - URL to pass to the filter wihtout passing to the parsers.
See Also:: nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site)

block

public void block(java.lang.String site,
                  Properties props)

Description copied from interface: Crawler

Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site, Properties props) method

Specified by:: block in interface Crawler

Parameters:: site - URL to pass to the filter wihtout passing to the parsers.; props - Properties object defining parameters associated with parsing the site.
See Also:: nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site, Properties props)

getCrawler

public Crawler getCrawler()

Description copied from interface: Crawler

return a reference to the crawler used when spidering is enabled. By default, this is a self-reference

Specified by:: getCrawler in interface Crawler

Returns:: reference to the crawler used in spidering

setCrawler

public void setCrawler(Crawler c)

Description copied from interface: Crawler

Sets the crawler used when spidering. A null value disables spidering.

Specified by:: setCrawler in interface Crawler

Parameters:: c - crawle to use for spidering

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

nz.ac.waikato.mcennis.rat.crawler Class WebCrawler

spiders

WebCrawler

createThreads

startThreads

stopThreads

crawl

crawl

set

isDone

getProxyHost

setProxyHost

getProxyPort

setProxyPort

getProxyType

setProxyType

getProperties

setProxy

setCaching

isCaching

setSpidering

isSpidering

setFilter

getFilter

getThreadDelay

setThreadDelay

getParsers

setParsers

block

block

getCrawler

setCrawler

nz.ac.waikato.mcennis.rat.crawler
Class WebCrawler