nz.ac.waikato.mcennis.rat.crawler
Class WebCrawler

java.lang.Object
  extended by nz.ac.waikato.mcennis.rat.crawler.WebCrawler
All Implemented Interfaces:
Crawler

public class WebCrawler
extends java.lang.Object
implements Crawler

Multithreaded web crawler designed to pull web pages. Each thread utilizes CrawlerBase as the base classs. The threads themselves are a interior class.


Nested Class Summary
 class WebCrawler.Spider
          Helper class that is used for threads.
 
Field Summary
 WebCrawler.Spider[] spiders
          Array of runnable objects for multithreaded parsing.
 
Constructor Summary
WebCrawler()
          Base constructor that initializes threads to null.
 
Method Summary
 void block(java.lang.String site)
          Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site) method
 void block(java.lang.String site, Properties props)
          Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site, Properties props) method
 void crawl(java.lang.String site)
          Identical to crawl except all parsers are used
 void crawl(java.lang.String site, Properties parsers)
          If this site has not been crawled before, adds a new site to the set of all sites parsed and adds it to the queue of the next spider, incrementing the spider index to the next spider.
 void createThreads(int numThreads)
          Creates threads to be utilized by this web crawler.
 Crawler getCrawler()
          return a reference to the crawler used when spidering is enabled.
 CrawlerFilter getFilter()
          Returns the current CrawlerFilter set for this crawler.
 Parser[] getParsers()
          Returns an array of Parser objects used by this crawler to parse pages.
 Properties getProperties()
          Returns the parsers that are associated with this crawler.
 java.lang.String getProxyHost()
          Return the string to be used to determine the proxy host for this connection
 java.lang.String getProxyPort()
          Returns the string describing the port the proxy is operating on
 java.lang.String getProxyType()
          Returns the string the crawler will use to set the system property determining the proxy type.
 long getThreadDelay()
          Returns the delay between the end of retrieving and parsing one page and the start of retrieving the next.
 boolean isCaching()
          Is the crawler caching the page or is it re-acquiring the page for each parser.
 boolean isDone()
          returns if this object has finished parsing all sites added to be crawled
 boolean isSpidering()
          Is this crawler following links.
 void set(Properties parser)
          Set the parser for all spiders in this object.
 void setCaching(boolean b)
          Sets caching on or off
 void setCrawler(Crawler c)
          Sets the crawler used when spidering.
 void setFilter(CrawlerFilter filter)
          Sets the filter used to determine whether or not a given URL should be added to the list of URLs to be crawled.
 void setParsers(Parser[] parsers)
          Sets the parsers available to this crawler.
 void setProxy(boolean proxy)
          Set or unset whether the crawler should be going through a proxy to access the internet.
 void setProxyHost(java.lang.String proxyHost)
          Sets the string to be used to determine the proxy host for this connection
 void setProxyPort(java.lang.String proxyPort)
          Sets the string describing the port the proxy is operating on
 void setProxyType(java.lang.String proxyType)
          Sets the string the crawler will use to set the system property determining the proxy type.
 void setSpidering(boolean s)
          Should links to new documents discovered also be read This also sets the value of the 'Spidering' parameter.
 void setThreadDelay(long threadDelay)
          Set the delay between the end of retrieving and parsing one page and the start of retrieving the next.
 void startThreads()
          Sets the threads running.
 void stopThreads()
          Tells each thread to stop execution after parsing the current document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

spiders

public WebCrawler.Spider[] spiders
Array of runnable objects for multithreaded parsing. Initially set to null.

Constructor Detail

WebCrawler

public WebCrawler()
Base constructor that initializes threads to null.

Method Detail

createThreads

public void createThreads(int numThreads)
Creates threads to be utilized by this web crawler. Must be called before parsing or a NullPointerException is thrown

Parameters:
numThreads - number of threads to be created
Throws:
java.lang.NullPointerException - if no threads have been created yet

startThreads

public void startThreads()
Sets the threads running. A null operation if no threads exist yet.


stopThreads

public void stopThreads()
Tells each thread to stop execution after parsing the current document.


crawl

public void crawl(java.lang.String site)
           throws java.net.MalformedURLException,
                  java.io.IOException
Identical to crawl except all parsers are used

Specified by:
crawl in interface Crawler
Parameters:
site - site to be crawled
Throws:
java.net.MalformedURLException - id site URL is invalid
java.io.IOException - error occurs during retrieval

crawl

public void crawl(java.lang.String site,
                  Properties parsers)
If this site has not been crawled before, adds a new site to the set of all sites parsed and adds it to the queue of the next spider, incrementing the spider index to the next spider.

Specified by:
crawl in interface Crawler
Parameters:
site - web site to be retrieved and parsed
See Also:
nz.ac.waikato.mcennis.arm.crawler.Crawler#crawl(java.lang.String)

set

public void set(Properties parser)
Set the parser for all spiders in this object. This is a null operation if no spider threads have been created yet.

Specified by:
set in interface Crawler
Parameters:
parser - set of parsers to be duplicated accross all spiders
See Also:
nz.ac.waikato.mcennis.arm.crawler.Crawler#set(nz.ac.waikato.mcennis.arm.parser.Parser[])

isDone

public boolean isDone()
returns if this object has finished parsing all sites added to be crawled

Returns:
if all sites added have been parsed

getProxyHost

public java.lang.String getProxyHost()
Return the string to be used to determine the proxy host for this connection

Returns:
proxy host descriptor

setProxyHost

public void setProxyHost(java.lang.String proxyHost)
Sets the string to be used to determine the proxy host for this connection

Parameters:
proxyHost - proxy host descriptor

getProxyPort

public java.lang.String getProxyPort()
Returns the string describing the port the proxy is operating on

Returns:
port of proxy

setProxyPort

public void setProxyPort(java.lang.String proxyPort)
Sets the string describing the port the proxy is operating on

Parameters:
proxyPort - port of the proxy

getProxyType

public java.lang.String getProxyType()
Returns the string the crawler will use to set the system property determining the proxy type.

Returns:
proxyType type of the proxy

setProxyType

public void setProxyType(java.lang.String proxyType)
Sets the string the crawler will use to set the system property determining the proxy type.

Parameters:
proxyType - type of the proxy

getProperties

public Properties getProperties()
Returns the parsers that are associated with this crawler. Returns null if no parsers have been set

Specified by:
getProperties in interface Crawler
Returns:
Parser array containing the parsers used to parse documents.

setProxy

public void setProxy(boolean proxy)
Set or unset whether the crawler should be going through a proxy to access the internet.

Specified by:
setProxy in interface Crawler
Parameters:
proxy - Should a proxy be used to access the internet

setCaching

public void setCaching(boolean b)
Sets caching on or off

Specified by:
setCaching in interface Crawler
Parameters:
b - should caching be enabled

isCaching

public boolean isCaching()
Description copied from interface: Crawler
Is the crawler caching the page or is it re-acquiring the page for each parser. The value returned is also the value of the 'cacheSource' parameter.

Specified by:
isCaching in interface Crawler
Returns:
is caching enabled

setSpidering

public void setSpidering(boolean s)
Description copied from interface: Crawler
Should links to new documents discovered also be read This also sets the value of the 'Spidering' parameter.

Specified by:
setSpidering in interface Crawler

isSpidering

public boolean isSpidering()
Description copied from interface: Crawler
Is this crawler following links. The value is the value of the 'Spidering' parameter.

Specified by:
isSpidering in interface Crawler
Returns:
follows links or not

setFilter

public void setFilter(CrawlerFilter filter)
Description copied from interface: Crawler
Sets the filter used to determine whether or not a given URL should be added to the list of URLs to be crawled. The results are stored in the 'CrawlerFilter' parameter

Specified by:
setFilter in interface Crawler
Parameters:
filter - Function to determine whether a URL should be parsed or not.

getFilter

public CrawlerFilter getFilter()
Description copied from interface: Crawler
Returns the current CrawlerFilter set for this crawler. This is the value of the 'CrawlerFilter' parameter.

Specified by:
getFilter in interface Crawler
Returns:
filter used to determine whether or not to schedule a URL for parsing

getThreadDelay

public long getThreadDelay()
Returns the delay between the end of retrieving and parsing one page and the start of retrieving the next.

Returns:
delay between parsing web pages

setThreadDelay

public void setThreadDelay(long threadDelay)
Set the delay between the end of retrieving and parsing one page and the start of retrieving the next.

Parameters:
threadDelay - delay between parsing web pages

getParsers

public Parser[] getParsers()
Description copied from interface: Crawler
Returns an array of Parser objects used by this crawler to parse pages.

Specified by:
getParsers in interface Crawler
Returns:
Array of parsers utilized to parse fetched documents.

setParsers

public void setParsers(Parser[] parsers)
Description copied from interface: Crawler
Sets the parsers available to this crawler. Null is permitted, setting the parser list to an empty set. Also sets the values of the 'Parsers' parameter. This proicedure also adds the children of the parsers to the parser set.

Specified by:
setParsers in interface Crawler
Parameters:
parsers - Parsers that can be called in a given crawler.

block

public void block(java.lang.String site)
Description copied from interface: Crawler
Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site) method

Specified by:
block in interface Crawler
Parameters:
site - URL to pass to the filter wihtout passing to the parsers.
See Also:
nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site)

block

public void block(java.lang.String site,
                  Properties props)
Description copied from interface: Crawler
Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site, Properties props) method

Specified by:
block in interface Crawler
Parameters:
site - URL to pass to the filter wihtout passing to the parsers.
props - Properties object defining parameters associated with parsing the site.
See Also:
nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site, Properties props)

getCrawler

public Crawler getCrawler()
Description copied from interface: Crawler
return a reference to the crawler used when spidering is enabled. By default, this is a self-reference

Specified by:
getCrawler in interface Crawler
Returns:
reference to the crawler used in spidering

setCrawler

public void setCrawler(Crawler c)
Description copied from interface: Crawler
Sets the crawler used when spidering. A null value disables spidering.

Specified by:
setCrawler in interface Crawler
Parameters:
c - crawle to use for spidering

Get Relational Analysis Toolkit at SourceForge.net. Fast, secure and Free Open Source software downloads