Package us.codecraft.webmagic
Main class "Spider" and models.
-
Interface Summary Interface Description MultiPageModel Extract an object of more than one pages, such as news and articles.SpiderListener Listener of Spider on page processing.Task Interface for identifying different tasks. -
Class Summary Class Description Page Object storing extracted result and urls to fetch.
Not thread safe.
Main method:
Page.getUrl()
get url of current page
Page.getHtml()
get content of current page
Page.putField(String, Object)
save extracted result
Page.getResultItems()
get extract results to be used inPipeline
Page.addTargetRequests(Iterable)
Page.addTargetRequest(String)
add urls to fetchRequest Object contains url to crawl.
It contains some additional information.ResultItems Object contains extract results.
It is contained in Page and will be processed in pipeline.SimpleHttpClient Site Object contains setting for crawler.Spider Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.SpiderScheduler -
Enum Summary Enum Description Spider.Status