|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnz.ac.waikato.mcennis.rat.crawler.WebCrawler
public class WebCrawler
Multithreaded web crawler designed to pull web pages. Each thread utilizes CrawlerBase as the base classs. The threads themselves are a interior class.
Nested Class Summary | |
---|---|
class |
WebCrawler.Spider
Helper class that is used for threads. |
Field Summary | |
---|---|
WebCrawler.Spider[] |
spiders
Array of runnable objects for multithreaded parsing. |
Constructor Summary | |
---|---|
WebCrawler()
Base constructor that initializes threads to null. |
Method Summary | |
---|---|
void |
block(java.lang.String site)
Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site) method |
void |
block(java.lang.String site,
Properties props)
Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site, Properties props) method |
void |
crawl(java.lang.String site)
Identical to crawl except all parsers are used |
void |
crawl(java.lang.String site,
Properties parsers)
If this site has not been crawled before, adds a new site to the set of all sites parsed and adds it to the queue of the next spider, incrementing the spider index to the next spider. |
void |
createThreads(int numThreads)
Creates threads to be utilized by this web crawler. |
Crawler |
getCrawler()
return a reference to the crawler used when spidering is enabled. |
CrawlerFilter |
getFilter()
Returns the current CrawlerFilter set for this crawler. |
Parser[] |
getParsers()
Returns an array of Parser objects used by this crawler to parse pages. |
Properties |
getProperties()
Returns the parsers that are associated with this crawler. |
java.lang.String |
getProxyHost()
Return the string to be used to determine the proxy host for this connection |
java.lang.String |
getProxyPort()
Returns the string describing the port the proxy is operating on |
java.lang.String |
getProxyType()
Returns the string the crawler will use to set the system property determining the proxy type. |
long |
getThreadDelay()
Returns the delay between the end of retrieving and parsing one page and the start of retrieving the next. |
boolean |
isCaching()
Is the crawler caching the page or is it re-acquiring the page for each parser. |
boolean |
isDone()
returns if this object has finished parsing all sites added to be crawled |
boolean |
isSpidering()
Is this crawler following links. |
void |
set(Properties parser)
Set the parser for all spiders in this object. |
void |
setCaching(boolean b)
Sets caching on or off |
void |
setCrawler(Crawler c)
Sets the crawler used when spidering. |
void |
setFilter(CrawlerFilter filter)
Sets the filter used to determine whether or not a given URL should be added to the list of URLs to be crawled. |
void |
setParsers(Parser[] parsers)
Sets the parsers available to this crawler. |
void |
setProxy(boolean proxy)
Set or unset whether the crawler should be going through a proxy to access the internet. |
void |
setProxyHost(java.lang.String proxyHost)
Sets the string to be used to determine the proxy host for this connection |
void |
setProxyPort(java.lang.String proxyPort)
Sets the string describing the port the proxy is operating on |
void |
setProxyType(java.lang.String proxyType)
Sets the string the crawler will use to set the system property determining the proxy type. |
void |
setSpidering(boolean s)
Should links to new documents discovered also be read This also sets the value of the 'Spidering' parameter. |
void |
setThreadDelay(long threadDelay)
Set the delay between the end of retrieving and parsing one page and the start of retrieving the next. |
void |
startThreads()
Sets the threads running. |
void |
stopThreads()
Tells each thread to stop execution after parsing the current document. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public WebCrawler.Spider[] spiders
Constructor Detail |
---|
public WebCrawler()
Method Detail |
---|
public void createThreads(int numThreads)
numThreads
- number of threads to be created
java.lang.NullPointerException
- if no threads have been created yetpublic void startThreads()
public void stopThreads()
public void crawl(java.lang.String site) throws java.net.MalformedURLException, java.io.IOException
crawl
in interface Crawler
site
- site to be crawled
java.net.MalformedURLException
- id site URL is invalid
java.io.IOException
- error occurs during retrievalpublic void crawl(java.lang.String site, Properties parsers)
crawl
in interface Crawler
site
- web site to be retrieved and parsednz.ac.waikato.mcennis.arm.crawler.Crawler#crawl(java.lang.String)
public void set(Properties parser)
set
in interface Crawler
parser
- set of parsers to be duplicated accross all spidersnz.ac.waikato.mcennis.arm.crawler.Crawler#set(nz.ac.waikato.mcennis.arm.parser.Parser[])
public boolean isDone()
public java.lang.String getProxyHost()
public void setProxyHost(java.lang.String proxyHost)
proxyHost
- proxy host descriptorpublic java.lang.String getProxyPort()
public void setProxyPort(java.lang.String proxyPort)
proxyPort
- port of the proxypublic java.lang.String getProxyType()
public void setProxyType(java.lang.String proxyType)
proxyType
- type of the proxypublic Properties getProperties()
getProperties
in interface Crawler
public void setProxy(boolean proxy)
setProxy
in interface Crawler
proxy
- Should a proxy be used to access the internetpublic void setCaching(boolean b)
setCaching
in interface Crawler
b
- should caching be enabledpublic boolean isCaching()
Crawler
isCaching
in interface Crawler
public void setSpidering(boolean s)
Crawler
setSpidering
in interface Crawler
public boolean isSpidering()
Crawler
isSpidering
in interface Crawler
public void setFilter(CrawlerFilter filter)
Crawler
setFilter
in interface Crawler
filter
- Function to determine whether a URL should be parsed or not.public CrawlerFilter getFilter()
Crawler
getFilter
in interface Crawler
public long getThreadDelay()
public void setThreadDelay(long threadDelay)
threadDelay
- delay between parsing web pagespublic Parser[] getParsers()
Crawler
getParsers
in interface Crawler
public void setParsers(Parser[] parsers)
Crawler
setParsers
in interface Crawler
parsers
- Parsers that can be called in a given crawler.public void block(java.lang.String site)
Crawler
block
in interface Crawler
site
- URL to pass to the filter wihtout passing to the parsers.nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site)
public void block(java.lang.String site, Properties props)
Crawler
block
in interface Crawler
site
- URL to pass to the filter wihtout passing to the parsers.props
- Properties object defining parameters associated with parsing the site.nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site, Properties props)
public Crawler getCrawler()
Crawler
getCrawler
in interface Crawler
public void setCrawler(Crawler c)
Crawler
setCrawler
in interface Crawler
c
- crawle to use for spidering
|
|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |