All Classes and Interfaces

Class
Description
Base class of downloader with some common methods.
 
All selectors will be arranged as a pipeline.
 
 
 
 
Pipeline that can collect and store results.
Write results in console.
Usually used in test.
Thread pool for workers.

Use ExecutorService as inner implement.
CSS selector.
支持post 302跳转策略实现类 HttpClient默认跳转:httpClientBuilder.setRedirectStrategy(new LaxRedirectStrategy()); 上述代码在post/redirect/post这种情况下不会传递原有请求的数据信息。所以参考了下SeimiCrawler这个项目的重定向策略。 原代码地址:https://github.com/zhegexiaohuozi/SeimiCrawler/blob/master/project/src/main/java/cn/wanghaomiao/seimi/http/hc/SeimiRedirectStrategy.java
Downloader is the part that downloads web pages and store in Page object.
Remove duplicate urls and only push urls which are not duplicate.

Remove duplicate requests.
Selector(extractor) for html elements.
Stands for features unstable.
Base object of file persistence.
Store results in files.
 
 
Selectable html.
 
The http downloader based on HttpClient.
 
 
 
Some constants of Http protocal.
 
 
 
 
 
 
parse json
JsonPath selector.
Used to extract content from JSON.
Links selector based on jsoup.
The scheduler whose requests can be counted for monitor.
 
All extractors will do extracting separately,
and the results of extractors will combined as the final result.
Object storing extracted result and urls to fetch.
Not thread safe.
Main method:
Page.getUrl() get url of current page
Page.getHtml() get content of current page
Page.putField(String, Object) save extracted result
Page.getResultItems() get extract results to be used in Pipeline
Page.addTargetRequests(Iterable) Page.addTargetRequest(String) add urls to fetch
Interface to be implemented to customize a crawler.
Pipeline is the persistent and offline process part of crawler.
The interface Pipeline can be implemented to customize ways of persistent.
Selectable plain text.
Can not be selected by XPath or CSS Selector.
Priority scheduler.
 
Proxy provider.
Pooled Proxy Object
Basic Scheduler implementation.
Store urls to fetch in LinkedBlockingQueue and remove duplicate urls by HashMap.
Selector in regex.
Replace selector.
Object contains url to crawl.
It contains some additional information.
Object contains extract results.
It is contained in Page and will be processed in pipeline.
 
Scheduler is the part of url management.
You can implement interface Scheduler to do: manage urls to fetch remove duplicate urls
Selectable text.
Selector(extractor) for text.
Convenient methods for selectors.
A simple PageProcessor.
A simple ProxyProvider.
Object contains setting for crawler.
Borrowed from https://code.google.com/p/cx-extractor/
Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.
 
Listener of Spider on page processing.
 
Interface for identifying different tasks.
url and html utils.
 
XPath selector based on Xsoup.