|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnz.ac.waikato.mcennis.rat.crawler.WebCrawler
public class WebCrawler
Multithreaded web crawler designed to pull web pages. Each thread utilizes CrawlerBase as the base classs. The threads themselves are a interior class. FIXME: Does not handle sites parsed using a subset of given parsers
Nested Class Summary | |
---|---|
class |
WebCrawler.SiteReference
|
class |
WebCrawler.Spider
Helper class that is used for threads. |
Field Summary | |
---|---|
WebCrawler.Spider[] |
spiders
Array of runnable objects for multithreaded parsing. |
Constructor Summary | |
---|---|
WebCrawler()
Base constructor that initializes threads to null. |
Method Summary | |
---|---|
void |
block(java.lang.String site)
Add a site to the list of parsed sites without parsing it. |
void |
block(java.lang.String site,
java.lang.String[] parsers)
|
void |
crawl(java.lang.String site)
Identical to crawl except all parsers are used |
void |
crawl(java.lang.String site,
java.lang.String[] parsers)
If this site has not been crawled before, adds a new site to the set of all sites parsed and adds it to the queue of the next spider, incrementing the spider index to the next spider. |
void |
createThreads(int numThreads)
Creates threads to be utilized by this web crawler. |
int |
getMaxCount()
|
Parser[] |
getParser()
Returns the parsers of the first spider. |
java.lang.String |
getProxyHost()
|
java.lang.String |
getProxyPort()
|
java.lang.String |
getProxyType()
|
boolean |
getSiteCheck()
|
long |
getThreadDelay()
|
boolean |
isCaching()
Is the crawler caching the page or is it re-acquiring the page for each parser. |
boolean |
isDone()
returns if this object has finished parsing all sites added to be crawled |
boolean |
isSpidering()
Is this crawler following links |
void |
set(Parser[] parser)
Set the parser for all spiders in this object. |
void |
setCaching(boolean b)
Set whether or notthe crawler should cache the web page or reload for each individual parser. |
void |
setMaxCount(int maxCount)
|
void |
setProxy(boolean proxy)
indicates whether or not this crawler's threads should use a proxy to connect to the internet. |
void |
setProxyHost(java.lang.String proxyHost)
|
void |
setProxyPort(java.lang.String proxyPort)
|
void |
setProxyType(java.lang.String proxyType)
|
void |
setSiteCheck(boolean check)
|
void |
setSpidering(boolean s)
Should links to new documents discovered also be read |
void |
setThreadDelay(long threadDelay)
|
void |
startThreads()
Sets the threads running. |
void |
stopThreads()
Tells each thread to stop execution after parsing the current document. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public WebCrawler.Spider[] spiders
Constructor Detail |
---|
public WebCrawler()
Method Detail |
---|
public void createThreads(int numThreads)
numThreads
- number of threads to be created
java.lang.NullPointerException
- if no threads have been created yetpublic void startThreads()
public void stopThreads()
public void crawl(java.lang.String site) throws java.net.MalformedURLException, java.io.IOException
crawl
in interface Crawler
site
- site to be crawled
java.net.MalformedURLException
- id site URL is invalid
java.io.IOException
- error occurs during retrievalpublic void crawl(java.lang.String site, java.lang.String[] parsers)
crawl
in interface Crawler
site
- web site to be retrieved and parsedparsers
- index of parsers to parse this sitenz.ac.waikato.mcennis.arm.crawler.Crawler#crawl(java.lang.String)
public void set(Parser[] parser)
set
in interface Crawler
parser
- set of parsers to be duplicated accross all spidersnz.ac.waikato.mcennis.arm.crawler.Crawler#set(nz.ac.waikato.mcennis.arm.parser.Parser[])
public boolean isDone()
public Parser[] getParser()
getParser
in interface Crawler
public void block(java.lang.String site)
site
- web site to be added to the set of sites already parsedpublic void block(java.lang.String site, java.lang.String[] parsers)
public void setProxy(boolean proxy)
setProxy
in interface Crawler
proxy
- should a proxy be used.public void setCaching(boolean b)
Crawler
setCaching
in interface Crawler
b
- should caching occurpublic boolean isCaching()
Crawler
isCaching
in interface Crawler
public void setSpidering(boolean s)
Crawler
setSpidering
in interface Crawler
public boolean isSpidering()
Crawler
isSpidering
in interface Crawler
public void setSiteCheck(boolean check)
public boolean getSiteCheck()
public long getThreadDelay()
public void setThreadDelay(long threadDelay)
public int getMaxCount()
public void setMaxCount(int maxCount)
public java.lang.String getProxyHost()
public void setProxyHost(java.lang.String proxyHost)
public java.lang.String getProxyPort()
public void setProxyPort(java.lang.String proxyPort)
public java.lang.String getProxyType()
public void setProxyType(java.lang.String proxyType)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |