All Classes and Interfaces (webmagic 1.0.3-SNAPSHOT API)

支持post 302跳转策略实现类 HttpClient默认跳转：httpClientBuilder.setRedirectStrategy(new LaxRedirectStrategy()); 上述代码在post/redirect/post这种情况下不会传递原有请求的数据信息。所以参考了下SeimiCrawler这个项目的重定向策略。原代码地址：https://github.com/zhegexiaohuozi/SeimiCrawler/blob/master/project/src/main/java/cn/wanghaomiao/seimi/http/hc/SeimiRedirectStrategy.java

DateFormatter

DelayQueueScheduler

DiandianBlogProcessor

DianpingFtlDataScanner

DiaoyuwengProcessor

DoubleKeyMap<K1,K2,V>

Downloader

Downloader is the part that downloads web pages and store in Page object.

DuplicateRemovedScheduler

Remove duplicate urls and only push urls which are not duplicate.

DuplicateRemover

Remove duplicate requests.

DuplicateStorageRemover

ElementSelector

Selector(extractor) for html elements.

Experimental

Stands for features unstable.

ExpressionType

ExtractBy

Define the extractor for field or class.

ExtractBy.Source

types of source for extracting.

ExtractBy.Type

types of extractor expressions

ExtractByUrl

Define a extractor to extract data in url of current page.

Extractor

The object contains 'ExtractBy' information.

ExtractorUtils

Tools for annotation converting.

ExtractRule

F58PageProcesser

FieldExtractor

Wrapper of field and extractor.

FileCacheQueueScheduler

Store urls and cursor in files so that a Spider can resume the status when shutdown.

FilePageModelPipeline

Store results objects (page models) to files in plain format.
Use model.getKey() as file name if the model implements HasKey.
Otherwise use SHA1 as file name.

FilePersistentBase

Base object of file persistence.

FilePipeline

Store results in files.

Formatter

Define how the result string is convert to an object for field.

GithubRepo

GithubRepoApi

GithubRepoPageMapper

GithubRepoPageProcessor

HashSetDuplicateRemover

HasKey

Interface to be implemented by page mode.
Can be used to identify a page model, or be used as name of file storing the object.

HelpUrl

Define the 'help' url patterns for class.

Html

Selectable html.

HtmlNode

HttpClientDownloader

The http downloader based on HttpClient.

HttpClientGenerator

HttpClientRequestContext

HttpClientUtils

HttpConstant

Some constants of Http protocal.

HttpConstant.Header

HttpConstant.Method

HttpConstant.StatusCode

HttpRequestBody

HttpRequestBody.ContentType

HttpUriRequestConverter

HuxiuProcessor

InfoQMiniBookProcessor

parse json

JsonFilePageModelPipeline

Store results objects (page models) to files in JSON format.
Use model.getKey() as file name if the model implements HasKey.
Otherwise use SHA1 as file name.

JsonFilePipeline

Store results to files in JSON format.

JsonPathSelector

JsonPath selector.
Used to extract content from JSON.

Links selector based on jsoup.

The scheduler whose requests can be counted for monitor.

MonitorExample

MultiKeyMapBase

multi-key map, some basic objects *

MultiPageModel

Extract an object of more than one pages, such as news and articles.

MultiPagePipeline

A pipeline combines the result in more than one page together.
Used for news and articles containing more than one web page.

Selector(extractor) for html node.

NumberUtils

ObjectFormatter<T>

ObjectFormatterBuilder

ObjectFormatters

OneFilePipeline

OOSpider<T>

The spider for page model extractor.
In webmagic, we call a POJO containing extract result as "page model".

OrSelector

All extractors will do extracting separately,
and the results of extractors will combined as the final result.

OschinaAnswer

OschinaBlog

Page

Object storing extracted result and urls to fetch.
Not thread safe.
Main method：
Page.getUrl() get url of current page
Page.getHtml() get content of current page
Page.putField(String, Object) save extracted result
Page.getResultItems() get extract results to be used in Pipeline
Page.addTargetRequests(Iterable) Page.addTargetRequest(String) add urls to fetch

PageField

PageMapper<T>

PageModelPipeline<T>

Implements PageModelPipeline to persistent your page model.

PageProcessor

Interface to be implemented to customize a crawler.

Params

PatternProcessor

PatternProcessorExample

Created with IntelliJ IDEA.

PatternRequestMatcher

Created with IntelliJ IDEA.