All Classes and Interfaces (webmagic-core 1.0.3-SNAPSHOT API)

支持post 302跳转策略实现类 HttpClient默认跳转：httpClientBuilder.setRedirectStrategy(new LaxRedirectStrategy()); 上述代码在post/redirect/post这种情况下不会传递原有请求的数据信息。所以参考了下SeimiCrawler这个项目的重定向策略。原代码地址：https://github.com/zhegexiaohuozi/SeimiCrawler/blob/master/project/src/main/java/cn/wanghaomiao/seimi/http/hc/SeimiRedirectStrategy.java

Downloader

Downloader is the part that downloads web pages and store in Page object.

DuplicateRemovedScheduler

Remove duplicate urls and only push urls which are not duplicate.

DuplicateRemover

Remove duplicate requests.

ElementSelector

Selector(extractor) for html elements.

Experimental

Stands for features unstable.

FilePersistentBase

Base object of file persistence.

FilePipeline

Store results in files.

GithubRepoPageProcessor

HashSetDuplicateRemover

Html

Selectable html.

HtmlNode

HttpClientDownloader

The http downloader based on HttpClient.

HttpClientGenerator

HttpClientRequestContext

HttpClientUtils

HttpConstant

Some constants of Http protocal.

HttpConstant.Header

HttpConstant.Method

HttpConstant.StatusCode

HttpRequestBody

HttpRequestBody.ContentType

HttpUriRequestConverter

Json

parse json

JsonPathSelector

JsonPath selector.
Used to extract content from JSON.

LinksSelector

Links selector based on jsoup.

MonitorableScheduler

The scheduler whose requests can be counted for monitor.

NumberUtils

OrSelector

All extractors will do extracting separately,
and the results of extractors will combined as the final result.

Page

Object storing extracted result and urls to fetch.
Not thread safe.
Main method：
Page.getUrl() get url of current page
Page.getHtml() get content of current page
Page.putField(String, Object) save extracted result
Page.getResultItems() get extract results to be used in Pipeline
Page.addTargetRequests(Iterable) Page.addTargetRequest(String) add urls to fetch

PageProcessor

Interface to be implemented to customize a crawler.

Pipeline

Pipeline is the persistent and offline process part of crawler.
The interface Pipeline can be implemented to customize ways of persistent.

PlainText

Selectable plain text.
Can not be selected by XPath or CSS Selector.

Priority scheduler.

Proxy provider.