|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectnz.ac.waikato.mcennis.rat.crawler.WebCrawler
public class WebCrawler
Multithreaded web crawler designed to pull web pages. Each thread utilizes CrawlerBase as the base classs. The threads themselves are a interior class. FIXME: Does not handle sites parsed using a subset of given parsers
| Nested Class Summary | |
|---|---|
class |
WebCrawler.SiteReference
|
class |
WebCrawler.Spider
Helper class that is used for threads. |
| Field Summary | |
|---|---|
WebCrawler.Spider[] |
spiders
Array of runnable objects for multithreaded parsing. |
| Constructor Summary | |
|---|---|
WebCrawler()
Base constructor that initializes threads to null. |
|
| Method Summary | |
|---|---|
void |
block(java.lang.String site)
Add a site to the list of parsed sites without parsing it. |
void |
block(java.lang.String site,
java.lang.String[] parsers)
|
void |
crawl(java.lang.String site)
Identical to crawl except all parsers are used |
void |
crawl(java.lang.String site,
java.lang.String[] parsers)
If this site has not been crawled before, adds a new site to the set of all sites parsed and adds it to the queue of the next spider, incrementing the spider index to the next spider. |
void |
createThreads(int numThreads)
Creates threads to be utilized by this web crawler. |
int |
getMaxCount()
|
Parser[] |
getParser()
Returns the parsers of the first spider. |
java.lang.String |
getProxyHost()
|
java.lang.String |
getProxyPort()
|
java.lang.String |
getProxyType()
|
boolean |
getSiteCheck()
|
long |
getThreadDelay()
|
boolean |
isCaching()
Is the crawler caching the page or is it re-acquiring the page for each parser. |
boolean |
isDone()
returns if this object has finished parsing all sites added to be crawled |
boolean |
isSpidering()
Is this crawler following links |
void |
set(Parser[] parser)
Set the parser for all spiders in this object. |
void |
setCaching(boolean b)
Set whether or notthe crawler should cache the web page or reload for each individual parser. |
void |
setMaxCount(int maxCount)
|
void |
setProxy(boolean proxy)
indicates whether or not this crawler's threads should use a proxy to connect to the internet. |
void |
setProxyHost(java.lang.String proxyHost)
|
void |
setProxyPort(java.lang.String proxyPort)
|
void |
setProxyType(java.lang.String proxyType)
|
void |
setSiteCheck(boolean check)
|
void |
setSpidering(boolean s)
Should links to new documents discovered also be read |
void |
setThreadDelay(long threadDelay)
|
void |
startThreads()
Sets the threads running. |
void |
stopThreads()
Tells each thread to stop execution after parsing the current document. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public WebCrawler.Spider[] spiders
| Constructor Detail |
|---|
public WebCrawler()
| Method Detail |
|---|
public void createThreads(int numThreads)
numThreads - number of threads to be created
java.lang.NullPointerException - if no threads have been created yetpublic void startThreads()
public void stopThreads()
public void crawl(java.lang.String site)
throws java.net.MalformedURLException,
java.io.IOException
crawl in interface Crawlersite - site to be crawled
java.net.MalformedURLException - id site URL is invalid
java.io.IOException - error occurs during retrieval
public void crawl(java.lang.String site,
java.lang.String[] parsers)
crawl in interface Crawlersite - web site to be retrieved and parsedparsers - index of parsers to parse this sitenz.ac.waikato.mcennis.arm.crawler.Crawler#crawl(java.lang.String)public void set(Parser[] parser)
set in interface Crawlerparser - set of parsers to be duplicated accross all spidersnz.ac.waikato.mcennis.arm.crawler.Crawler#set(nz.ac.waikato.mcennis.arm.parser.Parser[])public boolean isDone()
public Parser[] getParser()
getParser in interface Crawlerpublic void block(java.lang.String site)
site - web site to be added to the set of sites already parsed
public void block(java.lang.String site,
java.lang.String[] parsers)
public void setProxy(boolean proxy)
setProxy in interface Crawlerproxy - should a proxy be used.public void setCaching(boolean b)
Crawler
setCaching in interface Crawlerb - should caching occurpublic boolean isCaching()
Crawler
isCaching in interface Crawlerpublic void setSpidering(boolean s)
Crawler
setSpidering in interface Crawlerpublic boolean isSpidering()
Crawler
isSpidering in interface Crawlerpublic void setSiteCheck(boolean check)
public boolean getSiteCheck()
public long getThreadDelay()
public void setThreadDelay(long threadDelay)
public int getMaxCount()
public void setMaxCount(int maxCount)
public java.lang.String getProxyHost()
public void setProxyHost(java.lang.String proxyHost)
public java.lang.String getProxyPort()
public void setProxyPort(java.lang.String proxyPort)
public java.lang.String getProxyType()
public void setProxyType(java.lang.String proxyType)
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||