WebCrawler

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

nz.ac.waikato.mcennis.rat.crawler
Class WebCrawler

java.lang.Object
  nz.ac.waikato.mcennis.rat.crawler.WebCrawler

All Implemented Interfaces:: Crawler

public class WebCrawler
extends java.lang.Object
implements Crawler
extends java.lang.Object
implements Crawler

Multithreaded web crawler designed to pull web pages. Each thread utilizes CrawlerBase as the base classs. The threads themselves are a interior class. FIXME: Does not handle sites parsed using a subset of given parsers

Nested Class Summary
`class`	`WebCrawler.SiteReference`
`class`	`WebCrawler.Spider` Helper class that is used for threads.

Field Summary
`WebCrawler.Spider[]`	`spiders` Array of runnable objects for multithreaded parsing.

Constructor Summary
`WebCrawler()` Base constructor that initializes threads to null.

Method Summary
`void`	`block(java.lang.String site)` Add a site to the list of parsed sites without parsing it.
`void`	`block(java.lang.String site, java.lang.String[] parsers)`
`void`	`crawl(java.lang.String site)` Identical to crawl except all parsers are used
`void`	`crawl(java.lang.String site, java.lang.String[] parsers)` If this site has not been crawled before, adds a new site to the set of all sites parsed and adds it to the queue of the next spider, incrementing the spider index to the next spider.
`void`	`createThreads(int numThreads)` Creates threads to be utilized by this web crawler.
`int`	`getMaxCount()`
`Parser[]`	`getParser()` Returns the parsers of the first spider.
`java.lang.String`	`getProxyHost()`
`java.lang.String`	`getProxyPort()`
`java.lang.String`	`getProxyType()`
`boolean`	`getSiteCheck()`
`long`	`getThreadDelay()`
`boolean`	`isCaching()` Is the crawler caching the page or is it re-acquiring the page for each parser.
`boolean`	`isDone()` returns if this object has finished parsing all sites added to be crawled
`boolean`	`isSpidering()` Is this crawler following links
`void`	`set(Parser[] parser)` Set the parser for all spiders in this object.
`void`	`setCaching(boolean b)` Set whether or notthe crawler should cache the web page or reload for each individual parser.
`void`	`setMaxCount(int maxCount)`
`void`	`setProxy(boolean proxy)` indicates whether or not this crawler's threads should use a proxy to connect to the internet.
`void`	`setProxyHost(java.lang.String proxyHost)`
`void`	`setProxyPort(java.lang.String proxyPort)`
`void`	`setProxyType(java.lang.String proxyType)`
`void`	`setSiteCheck(boolean check)`
`void`	`setSpidering(boolean s)` Should links to new documents discovered also be read
`void`	`setThreadDelay(long threadDelay)`
`void`	`startThreads()` Sets the threads running.
`void`	`stopThreads()` Tells each thread to stop execution after parsing the current document.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

spiders

public WebCrawler.Spider[] spiders

Array of runnable objects for multithreaded parsing. Initially set to null.

Constructor Detail

WebCrawler

public WebCrawler()

Base constructor that initializes threads to null.

Method Detail

createThreads

public void createThreads(int numThreads)

Creates threads to be utilized by this web crawler. Must be called before parsing or a NullPointerException is thrown

Parameters:: numThreads - number of threads to be created
Throws:: java.lang.NullPointerException - if no threads have been created yet

startThreads

public void startThreads()

Sets the threads running. A null operation if no threads exist yet.

stopThreads

public void stopThreads()

Tells each thread to stop execution after parsing the current document.

crawl

public void crawl(java.lang.String site)
           throws java.net.MalformedURLException,
                  java.io.IOException

Identical to crawl except all parsers are used

Specified by:: crawl in interface Crawler

Parameters:: site - site to be crawled
Throws:: java.net.MalformedURLException - id site URL is invalid; java.io.IOException - error occurs during retrieval

crawl

public void crawl(java.lang.String site,
                  java.lang.String[] parsers)

If this site has not been crawled before, adds a new site to the set of all sites parsed and adds it to the queue of the next spider, incrementing the spider index to the next spider.

Specified by:: crawl in interface Crawler

Parameters:: site - web site to be retrieved and parsed; parsers - index of parsers to parse this site
See Also:: nz.ac.waikato.mcennis.arm.crawler.Crawler#crawl(java.lang.String)

set

public void set(Parser[] parser)

Set the parser for all spiders in this object. This is a null operation if no spider threads have been created yet.

Specified by:: set in interface Crawler

Parameters:: parser - set of parsers to be duplicated accross all spiders
See Also:: nz.ac.waikato.mcennis.arm.crawler.Crawler#set(nz.ac.waikato.mcennis.arm.parser.Parser[])

isDone

public boolean isDone()

returns if this object has finished parsing all sites added to be crawled

Returns:: if all sites added have been parsed

getParser

public Parser[] getParser()

Returns the parsers of the first spider. Parsers should use static storage in the parser so this will access all data parsed. This is a null operation if threads have not been constructed yet

Specified by:: getParser in interface Crawler

Returns:: array of parsers stored on the first thread.

block

public void block(java.lang.String site)

Add a site to the list of parsed sites without parsing it. Useful for restarting a site crawl without reparsing all the data it contained.

Parameters:: site - web site to be added to the set of sites already parsed

block

public void block(java.lang.String site,
                  java.lang.String[] parsers)

setProxy

public void setProxy(boolean proxy)

indicates whether or not this crawler's threads should use a proxy to connect to the internet. This is a null operation if threads have not been constructed yet.

Specified by:: setProxy in interface Crawler

Parameters:: proxy - should a proxy be used.

setCaching

public void setCaching(boolean b)

Description copied from interface: Crawler

Set whether or notthe crawler should cache the web page or reload for each individual parser. This is a tradeof between memory needed for loading potentially large files and the cost of continually reloading the web page

Specified by:: setCaching in interface Crawler

Parameters:: b - should caching occur

isCaching

public boolean isCaching()

Description copied from interface: Crawler

Is the crawler caching the page or is it re-acquiring the page for each parser.

Specified by:: isCaching in interface Crawler

Returns:: is caching enabled

setSpidering

public void setSpidering(boolean s)

Description copied from interface: Crawler

Should links to new documents discovered also be read

Specified by:: setSpidering in interface Crawler

isSpidering

public boolean isSpidering()

Description copied from interface: Crawler

Is this crawler following links

Specified by:: isSpidering in interface Crawler

Returns:: follows links or not

setSiteCheck

public void setSiteCheck(boolean check)

getSiteCheck

public boolean getSiteCheck()

getThreadDelay

public long getThreadDelay()

setThreadDelay

public void setThreadDelay(long threadDelay)

getMaxCount

public int getMaxCount()

setMaxCount

public void setMaxCount(int maxCount)

getProxyHost

public java.lang.String getProxyHost()

setProxyHost

public void setProxyHost(java.lang.String proxyHost)

getProxyPort

public java.lang.String getProxyPort()

setProxyPort

public void setProxyPort(java.lang.String proxyPort)

getProxyType

public java.lang.String getProxyType()

setProxyType

public void setProxyType(java.lang.String proxyType)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

nz.ac.waikato.mcennis.rat.crawler Class WebCrawler

spiders

WebCrawler

createThreads

startThreads

stopThreads

crawl

crawl

set

isDone

getParser

block

block

setProxy

setCaching

isCaching

setSpidering

isSpidering

setSiteCheck

getSiteCheck

getThreadDelay

setThreadDelay

getMaxCount

setMaxCount

getProxyHost

setProxyHost

getProxyPort

setProxyPort

getProxyType

setProxyType

nz.ac.waikato.mcennis.rat.crawler
Class WebCrawler