All Classes and Interfaces

Class
Description
Base class of downloader with some common methods.
 
Interface to be implemented by page models that need to do something after fields are extracted.
 
 
All selectors will be arranged as a pipeline.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
BloomFilterDuplicateRemover for huge number of urls.
 
 
 
Pipeline that can collect and store results.
Combo 'ExtractBy' extractor with and/or operator.
 
types of source for extracting.
 
 
 
 
 
Print page model in console.
Usually used in test.
Write results in console.
Usually used in test.
Thread pool for workers.

Use ExecutorService as inner implement.
CSS selector.
支持post 302跳转策略实现类 HttpClient默认跳转:httpClientBuilder.setRedirectStrategy(new LaxRedirectStrategy()); 上述代码在post/redirect/post这种情况下不会传递原有请求的数据信息。所以参考了下SeimiCrawler这个项目的重定向策略。 原代码地址:https://github.com/zhegexiaohuozi/SeimiCrawler/blob/master/project/src/main/java/cn/wanghaomiao/seimi/http/hc/SeimiRedirectStrategy.java
 
 
 
 
 
 
Downloader is the part that downloads web pages and store in Page object.
Remove duplicate urls and only push urls which are not duplicate.

Remove duplicate requests.
 
Selector(extractor) for html elements.
Stands for features unstable.
 
Define the extractor for field or class.
types of source for extracting.
types of extractor expressions
Define a extractor to extract data in url of current page.
The object contains 'ExtractBy' information.
Tools for annotation converting.
 
 
Wrapper of field and extractor.
Store urls and cursor in files so that a Spider can resume the status when shutdown.
Store results objects (page models) to files in plain format.
Use model.getKey() as file name if the model implements HasKey.
Otherwise use SHA1 as file name.
Base object of file persistence.
Store results in files.
Define how the result string is convert to an object for field.
 
 
 
 
 
 
 
 
Interface to be implemented by page mode.
Can be used to identify a page model, or be used as name of file storing the object.
Define the 'help' url patterns for class.
Selectable html.
 
The http downloader based on HttpClient.
 
 
 
Some constants of Http protocal.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
parse json
Store results objects (page models) to files in JSON format.
Use model.getKey() as file name if the model implements HasKey.
Otherwise use SHA1 as file name.
Store results to files in JSON format.
JsonPath selector.
Used to extract content from JSON.
 
 
 
 
 
Links selector based on jsoup.
 
 
 
The scheduler whose requests can be counted for monitor.
 
multi-key map, some basic objects *
Extract an object of more than one pages, such as news and articles.
A pipeline combines the result in more than one page together.
Used for news and articles containing more than one web page.
 
 
 
Selector(extractor) for html node.
 
 
 
 
 
The spider for page model extractor.
In webmagic, we call a POJO containing extract result as "page model".
All extractors will do extracting separately,
and the results of extractors will combined as the final result.
 
 
 
Object storing extracted result and urls to fetch.
Not thread safe.
Main method:
Page.getUrl() get url of current page
Page.getHtml() get content of current page
Page.putField(String, Object) save extracted result
Page.getResultItems() get extract results to be used in Pipeline
Page.addTargetRequests(Iterable) Page.addTargetRequest(String) add urls to fetch
 
 
Implements PageModelPipeline to persistent your page model.
Interface to be implemented to customize a crawler.
 
 
Created with IntelliJ IDEA.
Created with IntelliJ IDEA.
this downloader is used to download pages which need to render the javascript
Created by dolphineor on 2014-11-21.
Pipeline is the persistent and offline process part of crawler.
The interface Pipeline can be implemented to customize ways of persistent.
Selectable plain text.
Can not be selected by XPath or CSS Selector.
Priority scheduler.
 
Proxy provider.
Pooled Proxy Object
 
Basic Scheduler implementation.
Store urls to fetch in LinkedBlockingQueue and remove duplicate urls by HashMap.
 
 
 
the redis scheduler with priority
Use Redis as url scheduler for distributed crawlers.
Selector in regex.
 
Replace selector.
Object contains url to crawl.
It contains some additional information.
 
 
 
Object contains extract results.
It is contained in Page and will be processed in pipeline.
 
Scheduler is the part of url management.
You can implement interface Scheduler to do: manage urls to fetch remove duplicate urls
 
 
 
 
Selectable text.
Selector(extractor) for text.
Convenient methods for selectors.
使用Selenium调用浏览器进行渲染。目前仅支持chrome。
需要下载Selenium driver支持。
 
A simple PageProcessor.
A simple ProxyProvider.
 
 
Object contains setting for crawler.
Borrowed from https://code.google.com/p/cx-extractor/
 
 
 
 
 
 
 
Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.
 
Listener of Spider on page processing.
 
 
 
 
 
 
 
Define the url patterns for class.
Interface for identifying different tasks.
 
url and html utils.
 
支持xpath2.0的选择器。包装了HtmlCleaner和Saxon HE。
XPath selector based on Xsoup.