All Classes and Interfaces

Class
Description
Interface to be implemented by page models that need to do something after fields are extracted.
 
 
 
 
 
 
 
 
 
 
 
 
BloomFilterDuplicateRemover for huge number of urls.
 
 
Combo 'ExtractBy' extractor with and/or operator.
 
types of source for extracting.
 
 
 
Print page model in console.
Usually used in test.
 
 
 
Define the extractor for field or class.
types of source for extracting.
types of extractor expressions
Define a extractor to extract data in url of current page.
The object contains 'ExtractBy' information.
Tools for annotation converting.
 
Wrapper of field and extractor.
Store urls and cursor in files so that a Spider can resume the status when shutdown.
Store results objects (page models) to files in plain format.
Use model.getKey() as file name if the model implements HasKey.
Otherwise use SHA1 as file name.
Define how the result string is convert to an object for field.
 
 
 
Interface to be implemented by page mode.
Can be used to identify a page model, or be used as name of file storing the object.
Define the 'help' url patterns for class.
 
Store results objects (page models) to files in JSON format.
Use model.getKey() as file name if the model implements HasKey.
Otherwise use SHA1 as file name.
Store results to files in JSON format.
 
multi-key map, some basic objects *
Extract an object of more than one pages, such as news and articles.
A pipeline combines the result in more than one page together.
Used for news and articles containing more than one web page.
 
 
 
 
The spider for page model extractor.
In webmagic, we call a POJO containing extract result as "page model".
 
 
 
Implements PageModelPipeline to persistent your page model.
 
Created with IntelliJ IDEA.
Created with IntelliJ IDEA.
this downloader is used to download pages which need to render the javascript
the redis scheduler with priority
Use Redis as url scheduler for distributed crawlers.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Define the url patterns for class.