|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
public interface Crawler
Interface for accessing files via io. Designed to abstract away the difference between file and web access. Utilizes parsing objects to parse the documents collected. Multiple parser can be utilized. It is crawler dependant whether all parsers are used against all documents or only a subset.
Method Summary | |
---|---|
void |
block(java.lang.String site)
Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site) method |
void |
block(java.lang.String site,
Properties props)
Pass the given URL (as a string) to the filter object via the CrawlerFilter.load(String site, Properties props) method |
void |
crawl(java.lang.String site)
fetches the site designated by site. |
void |
crawl(java.lang.String site,
Properties props)
|
Crawler |
getCrawler()
return a reference to the crawler used when spidering is enabled. |
CrawlerFilter |
getFilter()
Returns the current CrawlerFilter set for this crawler. |
Parser[] |
getParsers()
Returns an array of Parser objects used by this crawler to parse pages. |
Properties |
getProperties()
Get the Properties object backing this factory. |
boolean |
isCaching()
Is the crawler caching the page or is it re-acquiring the page for each parser. |
boolean |
isSpidering()
Is this crawler following links. |
void |
set(Properties parameters)
Set the parsers that are to be utilized by this crawler to interpret the documents that are parsed. |
void |
setCaching(boolean b)
Set whether or not the crawler should cache the web page or reload for each individual parser. |
void |
setCrawler(Crawler c)
Sets the crawler used when spidering. |
void |
setFilter(CrawlerFilter filter)
Sets the filter used to determine whether or not a given URL should be added to the list of URLs to be crawled. |
void |
setParsers(Parser[] parsers)
Sets the parsers available to this crawler. |
void |
setProxy(boolean proxy)
Establishes whether a proxy is needed to access documents |
void |
setSpidering(boolean spider)
Should links to new documents discovered also be read This also sets the value of the 'Spidering' parameter. |
Method Detail |
---|
void crawl(java.lang.String site) throws java.net.MalformedURLException, java.io.IOException
site
- Name of the document to be fetched
java.net.MalformedURLException
- If the site to crawl is not a valid document. Only thrown if the
underlying crawler is retrieving documents via a http or similar
protocol.
java.io.IOException
- Thrown if their is a problem retrieving the document to be processed.void crawl(java.lang.String site, Properties props) throws java.net.MalformedURLException, java.io.IOException
site
- Name of the document to be fetchedparsers
- index of parsers to parse this site
java.net.MalformedURLException
- If the site to crawl is not a valid document. Only thrown if the
underlying crawler is retrieving documents via a http or similar
protocol.
java.io.IOException
- Thrown if their is a problem retrieving the document to be processed.void set(Properties parameters)
parser
- Array of parsing objects to be utilized by the crawler to process documents fetchedParser[] getParsers()
void setParsers(Parser[] parsers)
parsers
- Parsers that can be called in a given crawler.Properties getProperties()
void setProxy(boolean proxy)
proxy
- Whether or not a proxy is needed for accessing documentsvoid setCaching(boolean b)
b
- should caching occurboolean isCaching()
void setSpidering(boolean spider)
spider
- boolean isSpidering()
void setFilter(CrawlerFilter filter)
filter
- Function to determine whether a URL should be parsed or not.CrawlerFilter getFilter()
void block(java.lang.String site)
site
- URL to pass to the filter wihtout passing to the parsers.nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site)
void block(java.lang.String site, Properties props)
site
- URL to pass to the filter wihtout passing to the parsers.props
- Properties object defining parameters associated with parsing the site.nz.ac.waikato.mcennis.rat.crawler.filter.Crawler.lod(String site, Properties props)
Crawler getCrawler()
void setCrawler(Crawler c)
c
- crawle to use for spidering
|
|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |