Main class "Spider" and models.
Interface Summary Interface Description MultiPageModelExtract an object of more than one pages, such as news and articles. SpiderListenerListener of Spider on page processing. TaskInterface for identifying different tasks.
Class Summary Class Description PageObject storing extracted result and urls to fetch.
Not thread safe.
Page.getUrl()get url of current page
Page.getHtml()get content of current page
Page.putField(String, Object)save extracted result
Page.getResultItems()get extract results to be used in
Page.addTargetRequest(String)add urls to fetch
RequestObject contains url to crawl.
It contains some additional information.
ResultItemsObject contains extract results.
It is contained in Page and will be processed in pipeline.
SimpleHttpClient SiteObject contains setting for crawler. SpiderEntrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.
Enum Summary Enum Description Spider.Status