nz.ac.waikato.mcennis.rat.crawler
Class WebCrawler

java.lang.Object
  extended by nz.ac.waikato.mcennis.rat.crawler.WebCrawler
All Implemented Interfaces:
Crawler

public class WebCrawler
extends java.lang.Object
implements Crawler

Multithreaded web crawler designed to pull web pages. Each thread utilizes CrawlerBase as the base classs. The threads themselves are a interior class. FIXME: Does not handle sites parsed using a subset of given parsers


Nested Class Summary
 class WebCrawler.SiteReference
           
 class WebCrawler.Spider
          Helper class that is used for threads.
 
Field Summary
 WebCrawler.Spider[] spiders
          Array of runnable objects for multithreaded parsing.
 
Constructor Summary
WebCrawler()
          Base constructor that initializes threads to null.
 
Method Summary
 void block(java.lang.String site)
          Add a site to the list of parsed sites without parsing it.
 void block(java.lang.String site, java.lang.String[] parsers)
           
 void crawl(java.lang.String site)
          Identical to crawl except all parsers are used
 void crawl(java.lang.String site, java.lang.String[] parsers)
          If this site has not been crawled before, adds a new site to the set of all sites parsed and adds it to the queue of the next spider, incrementing the spider index to the next spider.
 void createThreads(int numThreads)
          Creates threads to be utilized by this web crawler.
 int getMaxCount()
           
 Parser[] getParser()
          Returns the parsers of the first spider.
 java.lang.String getProxyHost()
           
 java.lang.String getProxyPort()
           
 java.lang.String getProxyType()
           
 boolean getSiteCheck()
           
 long getThreadDelay()
           
 boolean isCaching()
          Is the crawler caching the page or is it re-acquiring the page for each parser.
 boolean isDone()
          returns if this object has finished parsing all sites added to be crawled
 boolean isSpidering()
          Is this crawler following links
 void set(Parser[] parser)
          Set the parser for all spiders in this object.
 void setCaching(boolean b)
          Set whether or notthe crawler should cache the web page or reload for each individual parser.
 void setMaxCount(int maxCount)
           
 void setProxy(boolean proxy)
          indicates whether or not this crawler's threads should use a proxy to connect to the internet.
 void setProxyHost(java.lang.String proxyHost)
           
 void setProxyPort(java.lang.String proxyPort)
           
 void setProxyType(java.lang.String proxyType)
           
 void setSiteCheck(boolean check)
           
 void setSpidering(boolean s)
          Should links to new documents discovered also be read
 void setThreadDelay(long threadDelay)
           
 void startThreads()
          Sets the threads running.
 void stopThreads()
          Tells each thread to stop execution after parsing the current document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

spiders

public WebCrawler.Spider[] spiders
Array of runnable objects for multithreaded parsing. Initially set to null.

Constructor Detail

WebCrawler

public WebCrawler()
Base constructor that initializes threads to null.

Method Detail

createThreads

public void createThreads(int numThreads)
Creates threads to be utilized by this web crawler. Must be called before parsing or a NullPointerException is thrown

Parameters:
numThreads - number of threads to be created
Throws:
java.lang.NullPointerException - if no threads have been created yet

startThreads

public void startThreads()
Sets the threads running. A null operation if no threads exist yet.


stopThreads

public void stopThreads()
Tells each thread to stop execution after parsing the current document.


crawl

public void crawl(java.lang.String site)
           throws java.net.MalformedURLException,
                  java.io.IOException
Identical to crawl except all parsers are used

Specified by:
crawl in interface Crawler
Parameters:
site - site to be crawled
Throws:
java.net.MalformedURLException - id site URL is invalid
java.io.IOException - error occurs during retrieval

crawl

public void crawl(java.lang.String site,
                  java.lang.String[] parsers)
If this site has not been crawled before, adds a new site to the set of all sites parsed and adds it to the queue of the next spider, incrementing the spider index to the next spider.

Specified by:
crawl in interface Crawler
Parameters:
site - web site to be retrieved and parsed
parsers - index of parsers to parse this site
See Also:
nz.ac.waikato.mcennis.arm.crawler.Crawler#crawl(java.lang.String)

set

public void set(Parser[] parser)
Set the parser for all spiders in this object. This is a null operation if no spider threads have been created yet.

Specified by:
set in interface Crawler
Parameters:
parser - set of parsers to be duplicated accross all spiders
See Also:
nz.ac.waikato.mcennis.arm.crawler.Crawler#set(nz.ac.waikato.mcennis.arm.parser.Parser[])

isDone

public boolean isDone()
returns if this object has finished parsing all sites added to be crawled

Returns:
if all sites added have been parsed

getParser

public Parser[] getParser()
Returns the parsers of the first spider. Parsers should use static storage in the parser so this will access all data parsed. This is a null operation if threads have not been constructed yet

Specified by:
getParser in interface Crawler
Returns:
array of parsers stored on the first thread.

block

public void block(java.lang.String site)
Add a site to the list of parsed sites without parsing it. Useful for restarting a site crawl without reparsing all the data it contained.

Parameters:
site - web site to be added to the set of sites already parsed

block

public void block(java.lang.String site,
                  java.lang.String[] parsers)

setProxy

public void setProxy(boolean proxy)
indicates whether or not this crawler's threads should use a proxy to connect to the internet. This is a null operation if threads have not been constructed yet.

Specified by:
setProxy in interface Crawler
Parameters:
proxy - should a proxy be used.

setCaching

public void setCaching(boolean b)
Description copied from interface: Crawler
Set whether or notthe crawler should cache the web page or reload for each individual parser. This is a tradeof between memory needed for loading potentially large files and the cost of continually reloading the web page

Specified by:
setCaching in interface Crawler
Parameters:
b - should caching occur

isCaching

public boolean isCaching()
Description copied from interface: Crawler
Is the crawler caching the page or is it re-acquiring the page for each parser.

Specified by:
isCaching in interface Crawler
Returns:
is caching enabled

setSpidering

public void setSpidering(boolean s)
Description copied from interface: Crawler
Should links to new documents discovered also be read

Specified by:
setSpidering in interface Crawler

isSpidering

public boolean isSpidering()
Description copied from interface: Crawler
Is this crawler following links

Specified by:
isSpidering in interface Crawler
Returns:
follows links or not

setSiteCheck

public void setSiteCheck(boolean check)

getSiteCheck

public boolean getSiteCheck()

getThreadDelay

public long getThreadDelay()

setThreadDelay

public void setThreadDelay(long threadDelay)

getMaxCount

public int getMaxCount()

setMaxCount

public void setMaxCount(int maxCount)

getProxyHost

public java.lang.String getProxyHost()

setProxyHost

public void setProxyHost(java.lang.String proxyHost)

getProxyPort

public java.lang.String getProxyPort()

setProxyPort

public void setProxyPort(java.lang.String proxyPort)

getProxyType

public java.lang.String getProxyType()

setProxyType

public void setProxyType(java.lang.String proxyType)