|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnz.ac.waikato.mcennis.rat.crawler.CrawlerBase
public class CrawlerBase
Base class implementing basic web access capabilities. It executes in the current thread and accesses pages sequentially. It is designed to be the base class for other web-accessing crawlers. In particular, it is the superclass of the WebCrawler crawler.
Field Summary | |
---|---|
protected boolean |
cache
Perform caching so each parser gets cached copy of original. |
protected Parser[] |
parser
Parser array handling all parsers utilized in processing documents. |
protected boolean |
proxy
Is there a proxy that needs to be used in collecting data. |
protected boolean |
spider
Is this object operating as a spider (automatically following links). |
Constructor Summary | |
---|---|
CrawlerBase()
Base constructor. |
Method Summary | |
---|---|
void |
crawl(java.lang.String site)
Identical to crawl except all parsers are used |
void |
crawl(java.lang.String site,
java.lang.String[] parsers)
|
protected void |
doParse(byte[] raw_data,
java.lang.String[] parsers)
Helper function separated from public parse to allow easy overloading. |
Parser[] |
getParser()
Returns the parsers that are associated with this crawler. |
java.lang.String |
getProxyHost()
Return the string to be used to determine the proxy host for this connection |
java.lang.String |
getProxyPort()
Returns the string describing the port the proxy is operating on |
java.lang.String |
getProxyType()
Returns the string the crawler will use to set the system property determining the proxy type. |
boolean |
isCaching()
Is the crawler caching the page or is it re-acquiring the page for each parser. |
boolean |
isSpidering()
Is this crawler following links |
void |
set(Parser[] parser)
Takes the current array of parsers and creates a copy of them utilizing the duplicate method on each parser. |
void |
setCaching(boolean b)
Sets caching on or off FIXME: currently caching permanently on |
void |
setProxy(boolean proxy)
Set or unset whether the crawler should be going through a proxy to access the internet. |
void |
setProxyHost(java.lang.String proxyHost)
Sets the string to be used to determine the proxy host for this connection |
void |
setProxyPort(java.lang.String proxyPort)
Sets the string describing the port the proxy is operating on |
void |
setProxyType(java.lang.String proxyType)
Sets the string the crawler will use to set the system property determining the proxy type. |
void |
setSpidering(boolean s)
Should links to new documents discovered also be read |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected Parser[] parser
protected boolean spider
protected boolean proxy
protected boolean cache
Constructor Detail |
---|
public CrawlerBase()
Method Detail |
---|
public void set(Parser[] parser)
set
in interface Crawler
parser
- Array of parsers to be duplicated and set for parsing
documentspublic void crawl(java.lang.String site) throws java.net.MalformedURLException, java.io.IOException
crawl
in interface Crawler
site
- site to be crawled
java.net.MalformedURLException
- id site URL is invalid
java.io.IOException
- error occurs during retrievalpublic void crawl(java.lang.String site, java.lang.String[] parsers) throws java.net.MalformedURLException, java.io.IOException
crawl
in interface Crawler
site
- Name of the document to be fetchedparsers
- index of parsers to parse this site
java.net.MalformedURLException
- If the site to crawl is not a valid document. Only thrown if the
underlying crawler is retrieving documents via a http or similar
protocol.
java.io.IOException
- Thrown if their is a problem retrieving the document to be processed.public java.lang.String getProxyHost()
public void setProxyHost(java.lang.String proxyHost)
proxyHost
- proxy host descriptorpublic java.lang.String getProxyPort()
public void setProxyPort(java.lang.String proxyPort)
proxyPort
- port of the proxypublic java.lang.String getProxyType()
public void setProxyType(java.lang.String proxyType)
proxyType
- type of the proxyprotected void doParse(byte[] raw_data, java.lang.String[] parsers) throws java.io.IOException, java.lang.Exception
raw_data
-
java.io.IOException
java.lang.Exception
public Parser[] getParser()
getParser
in interface Crawler
public void setProxy(boolean proxy)
setProxy
in interface Crawler
proxy
- Should a proxy be used to access the internetpublic void setCaching(boolean b)
setCaching
in interface Crawler
b
- should caching be enabledpublic boolean isCaching()
Crawler
isCaching
in interface Crawler
public void setSpidering(boolean s)
Crawler
setSpidering
in interface Crawler
public boolean isSpidering()
Crawler
isSpidering
in interface Crawler
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |