Index

$ A B C D E F G H I J L M N O P Q R S T U V W X Z 
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form

$

$(String) - Method in class us.codecraft.webmagic.selector.HtmlNode
 
$(String) - Method in class us.codecraft.webmagic.selector.PlainText
 
$(String) - Method in interface us.codecraft.webmagic.selector.Selectable
select list with css selector
$(String) - Static method in class us.codecraft.webmagic.selector.Selectors
 
$(String, String) - Method in class us.codecraft.webmagic.selector.HtmlNode
 
$(String, String) - Method in class us.codecraft.webmagic.selector.PlainText
 
$(String, String) - Method in interface us.codecraft.webmagic.selector.Selectable
select list with css selector
$(String, String) - Static method in class us.codecraft.webmagic.selector.Selectors
 

A

AbstractDownloader - Class in us.codecraft.webmagic.downloader
Base class of downloader with some common methods.
AbstractDownloader() - Constructor for class us.codecraft.webmagic.downloader.AbstractDownloader
 
AbstractSelectable - Class in us.codecraft.webmagic.selector
 
AbstractSelectable() - Constructor for class us.codecraft.webmagic.selector.AbstractSelectable
 
addCookie(String, String) - Method in class us.codecraft.webmagic.Request
 
addCookie(String, String) - Method in class us.codecraft.webmagic.Site
Add a cookie with domain Site.getDomain()
addCookie(String, String, String) - Method in class us.codecraft.webmagic.Site
Add a cookie with specific domain.
addHeader(String, String) - Method in class us.codecraft.webmagic.Request
 
addHeader(String, String) - Method in class us.codecraft.webmagic.Site
Put an Http header for downloader.
addPipeline(Pipeline) - Method in class us.codecraft.webmagic.Spider
add a pipeline for Spider
addRequest(Request...) - Method in class us.codecraft.webmagic.Spider
Add urls with information to crawl.
addTargetRequest(String) - Method in class us.codecraft.webmagic.Page
add url to fetch
addTargetRequest(Request) - Method in class us.codecraft.webmagic.Page
add requests to fetch
addTargetRequests(Iterable<String>) - Method in class us.codecraft.webmagic.Page
add urls to fetch
addTargetRequests(Iterable<String>, long) - Method in class us.codecraft.webmagic.Page
add urls to fetch
addUrl(String...) - Method in class us.codecraft.webmagic.Spider
Add urls to crawl.
all() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
all() - Method in interface us.codecraft.webmagic.selector.Selectable
multi string result
and(Selector...) - Static method in class us.codecraft.webmagic.selector.Selectors
 
AndSelector - Class in us.codecraft.webmagic.selector
All selectors will be arranged as a pipeline.
AndSelector(List<Selector>) - Constructor for class us.codecraft.webmagic.selector.AndSelector
 
AndSelector(Selector...) - Constructor for class us.codecraft.webmagic.selector.AndSelector
 

B

BaiduBaikePageProcessor - Class in us.codecraft.webmagic.processor.example
 
BaiduBaikePageProcessor() - Constructor for class us.codecraft.webmagic.processor.example.BaiduBaikePageProcessor
 
BaseElementSelector - Class in us.codecraft.webmagic.selector
 
BaseElementSelector() - Constructor for class us.codecraft.webmagic.selector.BaseElementSelector
 
BaseSelectorUtils - Class in us.codecraft.webmagic.utils
 
BaseSelectorUtils() - Constructor for class us.codecraft.webmagic.utils.BaseSelectorUtils
 

C

canonicalizeUrl(String, String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
canonicalizeUrl
Borrowed from Jsoup.
CharsetUtils - Class in us.codecraft.webmagic.utils
 
checkAndMakeParentDirecotry(String) - Method in class us.codecraft.webmagic.utils.FilePersistentBase
 
checkIfRunning() - Method in class us.codecraft.webmagic.Spider
 
clearPipeline() - Method in class us.codecraft.webmagic.Spider
clear the pipelines set
close() - Method in class us.codecraft.webmagic.Spider
 
CODE_200 - Static variable in class us.codecraft.webmagic.utils.HttpConstant.StatusCode
 
CollectorPipeline<T> - Interface in us.codecraft.webmagic.pipeline
Pipeline that can collect and store results.
compareLong(long, long) - Static method in class us.codecraft.webmagic.utils.NumberUtils
 
CONNECT - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
 
ConsolePipeline - Class in us.codecraft.webmagic.pipeline
Write results in console.
Usually used in test.
ConsolePipeline() - Constructor for class us.codecraft.webmagic.pipeline.ConsolePipeline
 
ContentType() - Constructor for class us.codecraft.webmagic.model.HttpRequestBody.ContentType
 
convert(Request, Site, Proxy) - Method in class us.codecraft.webmagic.downloader.HttpUriRequestConverter
 
convertHeaders(Header[]) - Static method in class us.codecraft.webmagic.utils.HttpClientUtils
 
convertToRequests(Collection<String>) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
convertToUrls(Collection<Request>) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
CountableThreadPool - Class in us.codecraft.webmagic.thread
Thread pool for workers.

Use ExecutorService as inner implement.
CountableThreadPool(int) - Constructor for class us.codecraft.webmagic.thread.CountableThreadPool
 
CountableThreadPool(int, ExecutorService) - Constructor for class us.codecraft.webmagic.thread.CountableThreadPool
 
create(String) - Static method in class us.codecraft.webmagic.selector.Html
 
create(String) - Static method in class us.codecraft.webmagic.selector.PlainText
 
create(URI) - Static method in class us.codecraft.webmagic.proxy.Proxy
 
create(PageProcessor) - Static method in class us.codecraft.webmagic.Spider
create a spider with pageProcessor.
css(String) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
css(String) - Method in interface us.codecraft.webmagic.selector.Selectable
select list with css selector
css(String, String) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
css(String, String) - Method in interface us.codecraft.webmagic.selector.Selectable
select list with css selector
CssSelector - Class in us.codecraft.webmagic.selector
CSS selector.
CssSelector(String) - Constructor for class us.codecraft.webmagic.selector.CssSelector
 
CssSelector(String, String) - Constructor for class us.codecraft.webmagic.selector.CssSelector
 
custom(byte[], String, String) - Static method in class us.codecraft.webmagic.model.HttpRequestBody
 
CustomRedirectStrategy - Class in us.codecraft.webmagic.downloader
支持post 302跳转策略实现类 HttpClient默认跳转:httpClientBuilder.setRedirectStrategy(new LaxRedirectStrategy()); 上述代码在post/redirect/post这种情况下不会传递原有请求的数据信息。所以参考了下SeimiCrawler这个项目的重定向策略。 原代码地址:https://github.com/zhegexiaohuozi/SeimiCrawler/blob/master/project/src/main/java/cn/wanghaomiao/seimi/http/hc/SeimiRedirectStrategy.java
CustomRedirectStrategy() - Constructor for class us.codecraft.webmagic.downloader.CustomRedirectStrategy
 
CYCLE_TRIED_TIMES - Static variable in class us.codecraft.webmagic.Request
 

D

DELETE - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
 
destroyWhenExit - Variable in class us.codecraft.webmagic.Spider
 
detectCharset(String, byte[]) - Static method in class us.codecraft.webmagic.utils.CharsetUtils
 
DISABLE_HTML_ENTITY_ESCAPE - Static variable in class us.codecraft.webmagic.selector.Html
Deprecated. 
download(String) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
A simple method to download a url.
download(String, String) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
A simple method to download a url.
download(Request, Task) - Method in interface us.codecraft.webmagic.downloader.Downloader
Downloads web pages and store in Page object.
download(Request, Task) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
 
downloader - Variable in class us.codecraft.webmagic.Spider
 
downloader(Downloader) - Method in class us.codecraft.webmagic.Spider
Deprecated. 
Downloader - Interface in us.codecraft.webmagic.downloader
Downloader is the part that downloads web pages and store in Page object.
DuplicateRemovedScheduler - Class in us.codecraft.webmagic.scheduler
Remove duplicate urls and only push urls which are not duplicate.

DuplicateRemovedScheduler() - Constructor for class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
 
DuplicateRemover - Interface in us.codecraft.webmagic.scheduler.component
Remove duplicate requests.

E

ElementSelector - Interface in us.codecraft.webmagic.selector
Selector(extractor) for html elements.
encodeIllegalCharacterInUrl(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
Deprecated. 
equals(Object) - Method in class us.codecraft.webmagic.proxy.Proxy
 
equals(Object) - Method in class us.codecraft.webmagic.Request
 
equals(Object) - Method in class us.codecraft.webmagic.Site
 
execute(Runnable) - Method in class us.codecraft.webmagic.thread.CountableThreadPool
 
executorService - Variable in class us.codecraft.webmagic.Spider
 
exitWhenComplete - Variable in class us.codecraft.webmagic.Spider
 
Experimental - Annotation Type in us.codecraft.webmagic.utils
Stands for features unstable.
extractAndAddRequests(Page, boolean) - Method in class us.codecraft.webmagic.Spider
 

F

fail() - Static method in class us.codecraft.webmagic.Page
Deprecated.
Use Page.fail(Request) instead.
fail(Request) - Static method in class us.codecraft.webmagic.Page
Returns a Page with Page.downloadSuccess is false, and Page.request is specified.
FilePersistentBase - Class in us.codecraft.webmagic.utils
Base object of file persistence.
FilePersistentBase() - Constructor for class us.codecraft.webmagic.utils.FilePersistentBase
 
FilePipeline - Class in us.codecraft.webmagic.pipeline
Store results in files.
FilePipeline() - Constructor for class us.codecraft.webmagic.pipeline.FilePipeline
create a FilePipeline with default path"/data/webmagic/"
FilePipeline(String) - Constructor for class us.codecraft.webmagic.pipeline.FilePipeline
 
fixIllegalCharacterInUrl(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
form(Map<String, Object>, String) - Static method in class us.codecraft.webmagic.model.HttpRequestBody
 
FORM - Static variable in class us.codecraft.webmagic.model.HttpRequestBody.ContentType
 
from(Proxy...) - Static method in class us.codecraft.webmagic.proxy.SimpleProxyProvider
 
fromValue(int) - Static method in enum us.codecraft.webmagic.Spider.Status
 

G

get() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
get() - Method in interface us.codecraft.webmagic.selector.Selectable
single string result
get(String) - Method in class us.codecraft.webmagic.ResultItems
 
get(String) - Method in class us.codecraft.webmagic.Spider
 
GET - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
 
getAcceptStatCode() - Method in class us.codecraft.webmagic.Site
get acceptStatCode
getAll() - Method in class us.codecraft.webmagic.ResultItems
 
getAll(Collection<String>) - Method in class us.codecraft.webmagic.Spider
Download urls synchronizing.
getAllCookies() - Method in class us.codecraft.webmagic.Site
get cookies of all domains
getBody() - Method in class us.codecraft.webmagic.model.HttpRequestBody
 
getBytes() - Method in class us.codecraft.webmagic.Page
 
getCharset() - Method in class us.codecraft.webmagic.Page
 
getCharset() - Method in class us.codecraft.webmagic.Request
 
getCharset() - Method in class us.codecraft.webmagic.Site
get charset set manually
getCharset(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
getClient(Site) - Method in class us.codecraft.webmagic.downloader.HttpClientGenerator
 
getCollected() - Method in interface us.codecraft.webmagic.pipeline.CollectorPipeline
Get all results collected.
getCollected() - Method in class us.codecraft.webmagic.pipeline.ResultItemsCollectorPipeline
 
getCollectorPipeline() - Method in class us.codecraft.webmagic.Spider
 
getContentType() - Method in class us.codecraft.webmagic.model.HttpRequestBody
 
getCookies() - Method in class us.codecraft.webmagic.Request
 
getCookies() - Method in class us.codecraft.webmagic.Site
get cookies
getCycleRetryTimes() - Method in class us.codecraft.webmagic.Site
When cycleRetryTimes is more than 0, it will add back to scheduler and try download again.
getDefaultCharset() - Method in class us.codecraft.webmagic.Site
The default charset if charset detected failed.
getDocument() - Method in class us.codecraft.webmagic.selector.Html
 
getDomain() - Method in class us.codecraft.webmagic.Site
get domain
getDomain(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
getDownloader() - Method in class us.codecraft.webmagic.Request
 
getDuplicateRemover() - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
 
getElements() - Method in class us.codecraft.webmagic.selector.Html
 
getElements() - Method in class us.codecraft.webmagic.selector.HtmlNode
 
getEncoding() - Method in class us.codecraft.webmagic.model.HttpRequestBody
 
getExtra(String) - Method in class us.codecraft.webmagic.Request
 
getExtras() - Method in class us.codecraft.webmagic.Request
 
getFile(String) - Method in class us.codecraft.webmagic.utils.FilePersistentBase
 
getFirstSourceText() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
getHeaders() - Method in class us.codecraft.webmagic.Page
 
getHeaders() - Method in class us.codecraft.webmagic.Request
 
getHeaders() - Method in class us.codecraft.webmagic.Site
 
getHost() - Method in class us.codecraft.webmagic.proxy.Proxy
 
getHost(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
getHtml() - Method in class us.codecraft.webmagic.Page
get html content of page
getHttpClientContext() - Method in class us.codecraft.webmagic.downloader.HttpClientRequestContext
 
getHttpUriRequest() - Method in class us.codecraft.webmagic.downloader.HttpClientRequestContext
 
getJson() - Method in class us.codecraft.webmagic.Page
get json content of page
getJsonPathStr() - Method in class us.codecraft.webmagic.selector.JsonPathSelector
 
getLeftRequestsCount(Task) - Method in interface us.codecraft.webmagic.scheduler.MonitorableScheduler
 
getLeftRequestsCount(Task) - Method in class us.codecraft.webmagic.scheduler.PriorityScheduler
 
getLeftRequestsCount(Task) - Method in class us.codecraft.webmagic.scheduler.QueueScheduler
 
getMethod() - Method in class us.codecraft.webmagic.Request
The http method of the request.
getPageCount() - Method in class us.codecraft.webmagic.Spider
Get page count downloaded by spider.
getPassword() - Method in class us.codecraft.webmagic.proxy.Proxy
 
getPath() - Method in class us.codecraft.webmagic.utils.FilePersistentBase
 
getPort() - Method in class us.codecraft.webmagic.proxy.Proxy
 
getPriority() - Method in class us.codecraft.webmagic.Request
 
getProxy(Request, Task) - Method in interface us.codecraft.webmagic.proxy.ProxyProvider
Returns a proxy for the request.
getProxy(Request, Task) - Method in class us.codecraft.webmagic.proxy.SimpleProxyProvider
 
getProxy(Task) - Method in interface us.codecraft.webmagic.proxy.ProxyProvider
Deprecated.
getRawText() - Method in class us.codecraft.webmagic.Page
 
getRedirect(HttpRequest, HttpResponse, HttpContext) - Method in class us.codecraft.webmagic.downloader.CustomRedirectStrategy
 
getRequest() - Method in class us.codecraft.webmagic.Page
get request of current page
getRequest() - Method in class us.codecraft.webmagic.ResultItems
 
getRequestBody() - Method in class us.codecraft.webmagic.Request
 
getResultItems() - Method in class us.codecraft.webmagic.Page
 
getRetrySleepTime() - Method in class us.codecraft.webmagic.Site
 
getRetryTimes() - Method in class us.codecraft.webmagic.Site
Get retry times immediately when download fail, 0 by default.
getScheduler() - Method in class us.codecraft.webmagic.Spider
 
getScheduler() - Method in class us.codecraft.webmagic.SpiderScheduler
 
getScheme() - Method in class us.codecraft.webmagic.proxy.Proxy
 
getSite() - Method in class us.codecraft.webmagic.processor.example.BaiduBaikePageProcessor
 
getSite() - Method in class us.codecraft.webmagic.processor.example.GithubRepoPageProcessor
 
getSite() - Method in class us.codecraft.webmagic.processor.example.ZhihuPageProcessor
 
getSite() - Method in interface us.codecraft.webmagic.processor.PageProcessor
Returns the site settings.
getSite() - Method in class us.codecraft.webmagic.processor.SimplePageProcessor
 
getSite() - Method in class us.codecraft.webmagic.Spider
 
getSite() - Method in interface us.codecraft.webmagic.Task
site of a task
getSleepTime() - Method in class us.codecraft.webmagic.Site
Get the interval between the processing of two pages.
Time unit is milliseconds.
getSourceTexts() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
getSourceTexts() - Method in class us.codecraft.webmagic.selector.HtmlNode
 
getSourceTexts() - Method in class us.codecraft.webmagic.selector.PlainText
 
getSpiderListeners() - Method in class us.codecraft.webmagic.Spider
 
getStartTime() - Method in class us.codecraft.webmagic.Spider
 
getStatus() - Method in class us.codecraft.webmagic.Spider
Get running status by spider.
getStatusCode() - Method in class us.codecraft.webmagic.Page
 
getTargetRequests() - Method in class us.codecraft.webmagic.Page
 
getText(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
 
getThreadAlive() - Method in class us.codecraft.webmagic.Spider
Get thread count which is running
getThreadAlive() - Method in class us.codecraft.webmagic.thread.CountableThreadPool
 
getThreadNum() - Method in class us.codecraft.webmagic.thread.CountableThreadPool
 
getTimeOut() - Method in class us.codecraft.webmagic.Site
 
getTotalRequestsCount(Task) - Method in interface us.codecraft.webmagic.scheduler.component.DuplicateRemover
Get TotalRequestsCount for monitor.
getTotalRequestsCount(Task) - Method in class us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover
 
getTotalRequestsCount(Task) - Method in interface us.codecraft.webmagic.scheduler.MonitorableScheduler
 
getTotalRequestsCount(Task) - Method in class us.codecraft.webmagic.scheduler.PriorityScheduler
 
getTotalRequestsCount(Task) - Method in class us.codecraft.webmagic.scheduler.QueueScheduler
 
getUrl() - Method in class us.codecraft.webmagic.Page
get url of current page
getUrl() - Method in class us.codecraft.webmagic.Request
 
getUrl(Request) - Method in class us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover
 
getUserAgent() - Method in class us.codecraft.webmagic.Site
get user agent
getUsername() - Method in class us.codecraft.webmagic.proxy.Proxy
 
getUUID() - Method in class us.codecraft.webmagic.Spider
 
getUUID() - Method in interface us.codecraft.webmagic.Task
unique id for a task.
GithubRepoPageProcessor - Class in us.codecraft.webmagic.processor.example
 
GithubRepoPageProcessor() - Constructor for class us.codecraft.webmagic.processor.example.GithubRepoPageProcessor
 

H

handleResponse(Request, String, HttpResponse, Task) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
 
hasAttribute() - Method in class us.codecraft.webmagic.selector.BaseElementSelector
 
hasAttribute() - Method in class us.codecraft.webmagic.selector.CssSelector
 
hasAttribute() - Method in class us.codecraft.webmagic.selector.LinksSelector
 
hasAttribute() - Method in class us.codecraft.webmagic.selector.XpathSelector
 
hashCode() - Method in class us.codecraft.webmagic.proxy.Proxy
 
hashCode() - Method in class us.codecraft.webmagic.Request
 
hashCode() - Method in class us.codecraft.webmagic.Site
 
HashSetDuplicateRemover - Class in us.codecraft.webmagic.scheduler.component
 
HashSetDuplicateRemover() - Constructor for class us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover
 
HEAD - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
 
Header() - Constructor for class us.codecraft.webmagic.utils.HttpConstant.Header
 
Html - Class in us.codecraft.webmagic.selector
Selectable html.
Html(String) - Constructor for class us.codecraft.webmagic.selector.Html
 
Html(String, String) - Constructor for class us.codecraft.webmagic.selector.Html
 
Html(Document) - Constructor for class us.codecraft.webmagic.selector.Html
 
HtmlNode - Class in us.codecraft.webmagic.selector
 
HtmlNode() - Constructor for class us.codecraft.webmagic.selector.HtmlNode
 
HtmlNode(List<Element>) - Constructor for class us.codecraft.webmagic.selector.HtmlNode
 
HttpClientDownloader - Class in us.codecraft.webmagic.downloader
The http downloader based on HttpClient.
HttpClientDownloader() - Constructor for class us.codecraft.webmagic.downloader.HttpClientDownloader
 
HttpClientGenerator - Class in us.codecraft.webmagic.downloader
 
HttpClientGenerator() - Constructor for class us.codecraft.webmagic.downloader.HttpClientGenerator
 
HttpClientRequestContext - Class in us.codecraft.webmagic.downloader
 
HttpClientRequestContext() - Constructor for class us.codecraft.webmagic.downloader.HttpClientRequestContext
 
HttpClientUtils - Class in us.codecraft.webmagic.utils
 
HttpClientUtils() - Constructor for class us.codecraft.webmagic.utils.HttpClientUtils
 
HttpConstant - Class in us.codecraft.webmagic.utils
Some constants of Http protocal.
HttpConstant() - Constructor for class us.codecraft.webmagic.utils.HttpConstant
 
HttpConstant.Header - Class in us.codecraft.webmagic.utils
 
HttpConstant.Method - Class in us.codecraft.webmagic.utils
 
HttpConstant.StatusCode - Class in us.codecraft.webmagic.utils
 
HttpRequestBody - Class in us.codecraft.webmagic.model
 
HttpRequestBody() - Constructor for class us.codecraft.webmagic.model.HttpRequestBody
 
HttpRequestBody(byte[], String, String) - Constructor for class us.codecraft.webmagic.model.HttpRequestBody
 
HttpRequestBody.ContentType - Class in us.codecraft.webmagic.model
 
HttpUriRequestConverter - Class in us.codecraft.webmagic.downloader
 
HttpUriRequestConverter() - Constructor for class us.codecraft.webmagic.downloader.HttpUriRequestConverter
 

I

Init - Enum constant in enum us.codecraft.webmagic.Spider.Status
 
initComponent() - Method in class us.codecraft.webmagic.Spider
 
INITIAL_CAPACITY - Static variable in class us.codecraft.webmagic.scheduler.PriorityScheduler
 
isBinaryContent() - Method in class us.codecraft.webmagic.Request
 
isDisableCookieManagement() - Method in class us.codecraft.webmagic.Site
 
isDownloadSuccess() - Method in class us.codecraft.webmagic.Page
 
isDuplicate(Request, Task) - Method in interface us.codecraft.webmagic.scheduler.component.DuplicateRemover
Check whether the request is duplicate.
isDuplicate(Request, Task) - Method in class us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover
 
isExitWhenComplete() - Method in class us.codecraft.webmagic.Spider
 
isShutdown() - Method in class us.codecraft.webmagic.thread.CountableThreadPool
 
isSkip() - Method in class us.codecraft.webmagic.ResultItems
Whether to skip the result.
Result which is skipped will not be processed by Pipeline.
isSpawnUrl() - Method in class us.codecraft.webmagic.Spider
 
isUseGzip() - Method in class us.codecraft.webmagic.Site
 

J

json(String, String) - Static method in class us.codecraft.webmagic.model.HttpRequestBody
 
Json - Class in us.codecraft.webmagic.selector
parse json
Json(String) - Constructor for class us.codecraft.webmagic.selector.Json
 
Json(List<String>) - Constructor for class us.codecraft.webmagic.selector.Json
 
JSON - Static variable in class us.codecraft.webmagic.model.HttpRequestBody.ContentType
 
jsonPath(String) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
jsonPath(String) - Method in class us.codecraft.webmagic.selector.Json
 
jsonPath(String) - Method in interface us.codecraft.webmagic.selector.Selectable
extract by JSON Path expression
JsonPathSelector - Class in us.codecraft.webmagic.selector
JsonPath selector.
Used to extract content from JSON.
JsonPathSelector(String) - Constructor for class us.codecraft.webmagic.selector.JsonPathSelector
 

L

links() - Method in class us.codecraft.webmagic.selector.HtmlNode
 
links() - Method in class us.codecraft.webmagic.selector.PlainText
 
links() - Method in interface us.codecraft.webmagic.selector.Selectable
select all links
LinksSelector - Class in us.codecraft.webmagic.selector
Links selector based on jsoup.
LinksSelector() - Constructor for class us.codecraft.webmagic.selector.LinksSelector
 
logger - Variable in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
 
logger - Variable in class us.codecraft.webmagic.Spider
 

M

main(String[]) - Static method in class us.codecraft.webmagic.processor.example.BaiduBaikePageProcessor
 
main(String[]) - Static method in class us.codecraft.webmagic.processor.example.GithubRepoPageProcessor
 
main(String[]) - Static method in class us.codecraft.webmagic.processor.example.ZhihuPageProcessor
 
match() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
match() - Method in interface us.codecraft.webmagic.selector.Selectable
if result exist for select
me() - Static method in class us.codecraft.webmagic.Site
new a Site
Method() - Constructor for class us.codecraft.webmagic.utils.HttpConstant.Method
 
MonitorableScheduler - Interface in us.codecraft.webmagic.scheduler
The scheduler whose requests can be counted for monitor.
MULTIPART - Static variable in class us.codecraft.webmagic.model.HttpRequestBody.ContentType
 

N

newArrayList(T...) - Static method in class us.codecraft.webmagic.utils.WMCollections
 
newHashSet(T...) - Static method in class us.codecraft.webmagic.utils.WMCollections
 
nodes() - Method in class us.codecraft.webmagic.selector.HtmlNode
 
nodes() - Method in class us.codecraft.webmagic.selector.PlainText
 
nodes() - Method in interface us.codecraft.webmagic.selector.Selectable
get all nodes
noNeedToRemoveDuplicate(Request) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
 
NumberUtils - Class in us.codecraft.webmagic.utils
 
NumberUtils() - Constructor for class us.codecraft.webmagic.utils.NumberUtils
 

O

onError(Page, Task, Throwable) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
 
onError(Request) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
onError(Request) - Method in class us.codecraft.webmagic.Spider
Deprecated.
onError(Request) - Method in interface us.codecraft.webmagic.SpiderListener
onError(Request, Exception) - Method in class us.codecraft.webmagic.Spider
 
onError(Request, Exception) - Method in interface us.codecraft.webmagic.SpiderListener
 
onError(Request, Task, Throwable) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
onSuccess(Page, Task) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
 
onSuccess(Request) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
onSuccess(Request) - Method in class us.codecraft.webmagic.Spider
 
onSuccess(Request) - Method in interface us.codecraft.webmagic.SpiderListener
 
onSuccess(Request, Task) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
or(Selector...) - Static method in class us.codecraft.webmagic.selector.Selectors
 
OrSelector - Class in us.codecraft.webmagic.selector
All extractors will do extracting separately,
and the results of extractors will combined as the final result.
OrSelector(List<Selector>) - Constructor for class us.codecraft.webmagic.selector.OrSelector
 
OrSelector(Selector...) - Constructor for class us.codecraft.webmagic.selector.OrSelector
 

P

Page - Class in us.codecraft.webmagic
Object storing extracted result and urls to fetch.
Not thread safe.
Main method:
Page.getUrl() get url of current page
Page.getHtml() get content of current page
Page.putField(String, Object) save extracted result
Page.getResultItems() get extract results to be used in Pipeline
Page.addTargetRequests(Iterable) Page.addTargetRequest(String) add urls to fetch
Page() - Constructor for class us.codecraft.webmagic.Page
 
pageProcessor - Variable in class us.codecraft.webmagic.Spider
 
PageProcessor - Interface in us.codecraft.webmagic.processor
Interface to be implemented to customize a crawler.
path - Variable in class us.codecraft.webmagic.utils.FilePersistentBase
 
PATH_SEPERATOR - Static variable in class us.codecraft.webmagic.utils.FilePersistentBase
 
pipeline(Pipeline) - Method in class us.codecraft.webmagic.Spider
Deprecated. 
Pipeline - Interface in us.codecraft.webmagic.pipeline
Pipeline is the persistent and offline process part of crawler.
The interface Pipeline can be implemented to customize ways of persistent.
pipelines - Variable in class us.codecraft.webmagic.Spider
 
PlainText - Class in us.codecraft.webmagic.selector
Selectable plain text.
Can not be selected by XPath or CSS Selector.
PlainText(String) - Constructor for class us.codecraft.webmagic.selector.PlainText
 
PlainText(List<String>) - Constructor for class us.codecraft.webmagic.selector.PlainText
 
poll(Spider) - Method in class us.codecraft.webmagic.SpiderScheduler
 
poll(Task) - Method in class us.codecraft.webmagic.scheduler.PriorityScheduler
 
poll(Task) - Method in class us.codecraft.webmagic.scheduler.QueueScheduler
 
poll(Task) - Method in interface us.codecraft.webmagic.scheduler.Scheduler
get an url to crawl
POST - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
 
preParse(String) - Static method in class us.codecraft.webmagic.utils.BaseSelectorUtils
Jsoup/HtmlCleaner could not parse "tr" or "td" tag directly https://stackoverflow.com/questions/63607740/jsoup-couldnt-parse-tr-tag
PriorityScheduler - Class in us.codecraft.webmagic.scheduler
Priority scheduler.
PriorityScheduler() - Constructor for class us.codecraft.webmagic.scheduler.PriorityScheduler
 
process(Page) - Method in class us.codecraft.webmagic.processor.example.BaiduBaikePageProcessor
 
process(Page) - Method in class us.codecraft.webmagic.processor.example.GithubRepoPageProcessor
 
process(Page) - Method in class us.codecraft.webmagic.processor.example.ZhihuPageProcessor
 
process(Page) - Method in interface us.codecraft.webmagic.processor.PageProcessor
Processes the page, extract URLs to fetch, extract the data and store.
process(Page) - Method in class us.codecraft.webmagic.processor.SimplePageProcessor
 
process(ResultItems, Task) - Method in class us.codecraft.webmagic.pipeline.ConsolePipeline
 
process(ResultItems, Task) - Method in class us.codecraft.webmagic.pipeline.FilePipeline
 
process(ResultItems, Task) - Method in interface us.codecraft.webmagic.pipeline.Pipeline
Process extracted results.
process(ResultItems, Task) - Method in class us.codecraft.webmagic.pipeline.ResultItemsCollectorPipeline
 
Proxy - Class in us.codecraft.webmagic.proxy
 
Proxy(String, int) - Constructor for class us.codecraft.webmagic.proxy.Proxy
 
Proxy(String, int, String) - Constructor for class us.codecraft.webmagic.proxy.Proxy
 
Proxy(String, int, String, String) - Constructor for class us.codecraft.webmagic.proxy.Proxy
 
ProxyProvider - Interface in us.codecraft.webmagic.proxy
Proxy provider.
ProxyUtils - Class in us.codecraft.webmagic.utils
Pooled Proxy Object
ProxyUtils() - Constructor for class us.codecraft.webmagic.utils.ProxyUtils
 
push(Request, Spider) - Method in class us.codecraft.webmagic.SpiderScheduler
 
push(Request, Task) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
 
push(Request, Task) - Method in interface us.codecraft.webmagic.scheduler.Scheduler
add a url to fetch
pushWhenNoDuplicate(Request, Task) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
 
pushWhenNoDuplicate(Request, Task) - Method in class us.codecraft.webmagic.scheduler.PriorityScheduler
 
pushWhenNoDuplicate(Request, Task) - Method in class us.codecraft.webmagic.scheduler.QueueScheduler
 
put(String, T) - Method in class us.codecraft.webmagic.ResultItems
 
PUT - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
 
putExtra(String, T) - Method in class us.codecraft.webmagic.Request
 
putField(String, Object) - Method in class us.codecraft.webmagic.Page
store extract results

Q

QueueScheduler - Class in us.codecraft.webmagic.scheduler
Basic Scheduler implementation.
Store urls to fetch in LinkedBlockingQueue and remove duplicate urls by HashMap.
QueueScheduler() - Constructor for class us.codecraft.webmagic.scheduler.QueueScheduler
 
QueueScheduler(int) - Constructor for class us.codecraft.webmagic.scheduler.QueueScheduler
Creates a QueueScheduler with the given (fixed) capacity.

R

REFERER - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Header
 
regex(String) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
regex(String) - Method in interface us.codecraft.webmagic.selector.Selectable
select list with regex, default group is group 1
regex(String) - Static method in class us.codecraft.webmagic.selector.Selectors
 
regex(String, int) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
regex(String, int) - Method in interface us.codecraft.webmagic.selector.Selectable
select list with regex
regex(String, int) - Static method in class us.codecraft.webmagic.selector.Selectors
 
RegexSelector - Class in us.codecraft.webmagic.selector
Selector in regex.
RegexSelector(String) - Constructor for class us.codecraft.webmagic.selector.RegexSelector
Create a RegexSelector.
RegexSelector(String, int) - Constructor for class us.codecraft.webmagic.selector.RegexSelector
 
removePadding(String) - Method in class us.codecraft.webmagic.selector.Json
remove padding for JSONP
removePort(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
removeProtocol(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
 
replace(String, String) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
replace(String, String) - Method in interface us.codecraft.webmagic.selector.Selectable
replace with regex
ReplaceSelector - Class in us.codecraft.webmagic.selector
Replace selector.
ReplaceSelector(String, String) - Constructor for class us.codecraft.webmagic.selector.ReplaceSelector
 
Request - Class in us.codecraft.webmagic
Object contains url to crawl.
It contains some additional information.
Request() - Constructor for class us.codecraft.webmagic.Request
 
Request(String) - Constructor for class us.codecraft.webmagic.Request
 
resetDuplicateCheck(Task) - Method in interface us.codecraft.webmagic.scheduler.component.DuplicateRemover
Reset duplicate check.
resetDuplicateCheck(Task) - Method in class us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover
 
ResultItems - Class in us.codecraft.webmagic
Object contains extract results.
It is contained in Page and will be processed in pipeline.
ResultItems() - Constructor for class us.codecraft.webmagic.ResultItems
 
ResultItemsCollectorPipeline - Class in us.codecraft.webmagic.pipeline
 
ResultItemsCollectorPipeline() - Constructor for class us.codecraft.webmagic.pipeline.ResultItemsCollectorPipeline
 
returnProxy(Proxy, Page, Task) - Method in interface us.codecraft.webmagic.proxy.ProxyProvider
Return proxy to Provider when complete a download.
returnProxy(Proxy, Page, Task) - Method in class us.codecraft.webmagic.proxy.SimpleProxyProvider
 
run() - Method in class us.codecraft.webmagic.Spider
 
runAsync() - Method in class us.codecraft.webmagic.Spider
 
Running - Enum constant in enum us.codecraft.webmagic.Spider.Status
 

S

scheduler - Variable in class us.codecraft.webmagic.Spider
 
scheduler(Scheduler) - Method in class us.codecraft.webmagic.Spider
Deprecated.
Scheduler - Interface in us.codecraft.webmagic.scheduler
Scheduler is the part of url management.
You can implement interface Scheduler to do: manage urls to fetch remove duplicate urls
select(String) - Method in class us.codecraft.webmagic.selector.AndSelector
 
select(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
 
select(String) - Method in class us.codecraft.webmagic.selector.JsonPathSelector
 
select(String) - Method in class us.codecraft.webmagic.selector.OrSelector
 
select(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
 
select(String) - Method in class us.codecraft.webmagic.selector.ReplaceSelector
 
select(String) - Method in interface us.codecraft.webmagic.selector.Selector
Extract single result in text.
If there are more than one result, only the first will be chosen.
select(String) - Method in class us.codecraft.webmagic.selector.SmartContentSelector
 
select(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
 
select(Element) - Method in interface us.codecraft.webmagic.selector.ElementSelector
Extract single result in text.
If there are more than one result, only the first will be chosen.
select(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
 
select(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
 
select(Selector) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
select(Selector) - Method in class us.codecraft.webmagic.selector.HtmlNode
 
select(Selector) - Method in interface us.codecraft.webmagic.selector.Selectable
extract by custom selector
select(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
Selectable - Interface in us.codecraft.webmagic.selector
Selectable text.
selectDocument(Selector) - Method in class us.codecraft.webmagic.selector.Html
 
selectDocumentForList(Selector) - Method in class us.codecraft.webmagic.selector.Html
 
selectElement(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
 
selectElement(Element) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
 
selectElement(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
 
selectElement(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
 
selectElement(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
 
selectElements(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
 
selectElements(Element) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
 
selectElements(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
 
selectElements(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
 
selectElements(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
 
selectElements(BaseElementSelector) - Method in class us.codecraft.webmagic.selector.HtmlNode
select elements
selectGroup(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
 
selectGroupList(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
 
selectList(String) - Method in class us.codecraft.webmagic.selector.AndSelector
 
selectList(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
 
selectList(String) - Method in class us.codecraft.webmagic.selector.JsonPathSelector
 
selectList(String) - Method in class us.codecraft.webmagic.selector.OrSelector
 
selectList(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
 
selectList(String) - Method in class us.codecraft.webmagic.selector.ReplaceSelector
 
selectList(String) - Method in interface us.codecraft.webmagic.selector.Selector
Extract all results in text.
selectList(String) - Method in class us.codecraft.webmagic.selector.SmartContentSelector
 
selectList(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
 
selectList(Element) - Method in interface us.codecraft.webmagic.selector.ElementSelector
Extract all results in text.
selectList(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
 
selectList(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
 
selectList(Selector) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
selectList(Selector) - Method in class us.codecraft.webmagic.selector.HtmlNode
 
selectList(Selector) - Method in interface us.codecraft.webmagic.selector.Selectable
extract by custom selector
selectList(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
Selector - Interface in us.codecraft.webmagic.selector
Selector(extractor) for text.
Selectors - Class in us.codecraft.webmagic.selector
Convenient methods for selectors.
Selectors() - Constructor for class us.codecraft.webmagic.selector.Selectors
 
setAcceptStatCode(Set<Integer>) - Method in class us.codecraft.webmagic.Site
Set acceptStatCode.
When status code of http response is in acceptStatCodes, it will be processed.
{200} by default.
It is not necessarily to be set.
setBinaryContent(boolean) - Method in class us.codecraft.webmagic.Request
 
setBody(byte[]) - Method in class us.codecraft.webmagic.model.HttpRequestBody
 
setBytes(byte[]) - Method in class us.codecraft.webmagic.Page
 
setCharset(String) - Method in class us.codecraft.webmagic.Page
 
setCharset(String) - Method in class us.codecraft.webmagic.Request
 
setCharset(String) - Method in class us.codecraft.webmagic.Site
Set charset of page manually.
When charset is not set or set to null, it can be auto detected by Http header.
setContentType(String) - Method in class us.codecraft.webmagic.model.HttpRequestBody
 
setCycleRetryTimes(int) - Method in class us.codecraft.webmagic.Site
Set cycleRetryTimes times when download fail, 0 by default.
setDefaultCharset(String) - Method in class us.codecraft.webmagic.Site
Set default charset of page.
setDisableCookieManagement(boolean) - Method in class us.codecraft.webmagic.Site
Downloader is supposed to store response cookie.
setDomain(String) - Method in class us.codecraft.webmagic.Site
set the domain of site.
setDownloader(Downloader) - Method in class us.codecraft.webmagic.Request
 
setDownloader(Downloader) - Method in class us.codecraft.webmagic.Spider
set the downloader of spider
setDownloadSuccess(boolean) - Method in class us.codecraft.webmagic.Page
 
setDuplicateRemover(DuplicateRemover) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
 
setEmptySleepTime(long) - Method in class us.codecraft.webmagic.Spider
Set wait time when no url is polled.

setEncoding(String) - Method in class us.codecraft.webmagic.model.HttpRequestBody
 
setExecutorService(ExecutorService) - Method in class us.codecraft.webmagic.Spider
 
setExecutorService(ExecutorService) - Method in class us.codecraft.webmagic.thread.CountableThreadPool
 
setExitWhenComplete(boolean) - Method in class us.codecraft.webmagic.Spider
Exit when complete.
setExtras(Map<String, Object>) - Method in class us.codecraft.webmagic.Request
 
setHeaders(Map<String, List<String>>) - Method in class us.codecraft.webmagic.Page
 
setHtml(Html) - Method in class us.codecraft.webmagic.Page
Deprecated.
since 0.4.0 The html is parse just when first time of calling Page.getHtml(), so use Page.setRawText(String) instead.
setHttpClientContext(HttpClientContext) - Method in class us.codecraft.webmagic.downloader.HttpClientRequestContext
 
setHttpUriRequest(HttpUriRequest) - Method in class us.codecraft.webmagic.downloader.HttpClientRequestContext
 
setHttpUriRequestConverter(HttpUriRequestConverter) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
 
setMethod(String) - Method in class us.codecraft.webmagic.Request
 
setPath(String) - Method in class us.codecraft.webmagic.utils.FilePersistentBase
 
setPipelines(List<Pipeline>) - Method in class us.codecraft.webmagic.Spider
set pipelines for Spider
setPoolSize(int) - Method in class us.codecraft.webmagic.downloader.HttpClientGenerator
 
setPriority(long) - Method in class us.codecraft.webmagic.Request
Set the priority of request for sorting.
Need a scheduler supporting priority.
setProxyProvider(ProxyProvider) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
 
setRawText(String) - Method in class us.codecraft.webmagic.Page
 
setRequest(Request) - Method in class us.codecraft.webmagic.Page
 
setRequest(Request) - Method in class us.codecraft.webmagic.ResultItems
 
setRequestBody(HttpRequestBody) - Method in class us.codecraft.webmagic.Request
 
setRetrySleepTime(int) - Method in class us.codecraft.webmagic.Site
Set retry sleep times when download fail, 1000 by default.
setRetryTimes(int) - Method in class us.codecraft.webmagic.Site
Set retry times when download fail, 0 by default.
setScheduler(Scheduler) - Method in class us.codecraft.webmagic.Spider
set scheduler for Spider
setScheduler(Scheduler) - Method in class us.codecraft.webmagic.SpiderScheduler
 
setScheme(String) - Method in class us.codecraft.webmagic.proxy.Proxy
 
setSkip(boolean) - Method in class us.codecraft.webmagic.Page
 
setSkip(boolean) - Method in class us.codecraft.webmagic.ResultItems
Set whether to skip the result.
Result which is skipped will not be processed by Pipeline.
setSleepTime(int) - Method in class us.codecraft.webmagic.Site
Set the interval between the processing of two pages.
Time unit is milliseconds.
setSpawnUrl(boolean) - Method in class us.codecraft.webmagic.Spider
Whether add urls extracted to download.
Add urls to download when it is true, and just download seed urls when it is false.
setSpiderListeners(List<SpiderListener>) - Method in class us.codecraft.webmagic.Spider
 
setStatusCode(int) - Method in class us.codecraft.webmagic.Page
 
setThread(int) - Method in interface us.codecraft.webmagic.downloader.Downloader
Tell the downloader how many threads the spider used.
setThread(int) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
 
setTimeOut(int) - Method in class us.codecraft.webmagic.Site
set timeout for downloader in ms
setUrl(String) - Method in class us.codecraft.webmagic.Request
 
setUrl(Selectable) - Method in class us.codecraft.webmagic.Page
 
setUseGzip(boolean) - Method in class us.codecraft.webmagic.Site
Whether use gzip.
setUserAgent(String) - Method in class us.codecraft.webmagic.Site
set user agent
setUUID(String) - Method in class us.codecraft.webmagic.Spider
Set an uuid for spider.
Default uuid is domain of site.
shouldReserved(Request) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
 
shutdown() - Method in class us.codecraft.webmagic.thread.CountableThreadPool
 
signalNewUrl() - Method in class us.codecraft.webmagic.SpiderScheduler
 
SimplePageProcessor - Class in us.codecraft.webmagic.processor
A simple PageProcessor.
SimplePageProcessor(String) - Constructor for class us.codecraft.webmagic.processor.SimplePageProcessor
 
SimpleProxyProvider - Class in us.codecraft.webmagic.proxy
A simple ProxyProvider.
SimpleProxyProvider(List<Proxy>) - Constructor for class us.codecraft.webmagic.proxy.SimpleProxyProvider
 
site - Variable in class us.codecraft.webmagic.Spider
 
Site - Class in us.codecraft.webmagic
Object contains setting for crawler.
Site() - Constructor for class us.codecraft.webmagic.Site
 
sleep(int) - Method in class us.codecraft.webmagic.Spider
 
smartContent() - Method in class us.codecraft.webmagic.selector.HtmlNode
 
smartContent() - Static method in class us.codecraft.webmagic.selector.Selectors
 
SmartContentSelector - Class in us.codecraft.webmagic.selector
Borrowed from https://code.google.com/p/cx-extractor/
SmartContentSelector() - Constructor for class us.codecraft.webmagic.selector.SmartContentSelector
 
sourceTexts - Variable in class us.codecraft.webmagic.selector.PlainText
 
spawnUrl - Variable in class us.codecraft.webmagic.Spider
 
Spider - Class in us.codecraft.webmagic
Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.
Spider(PageProcessor) - Constructor for class us.codecraft.webmagic.Spider
create a spider with pageProcessor.
Spider.Status - Enum in us.codecraft.webmagic
 
SpiderListener - Interface in us.codecraft.webmagic
Listener of Spider on page processing.
SpiderScheduler - Class in us.codecraft.webmagic
 
SpiderScheduler(Scheduler) - Constructor for class us.codecraft.webmagic.SpiderScheduler
 
start() - Method in class us.codecraft.webmagic.Spider
 
startRequest(List<Request>) - Method in class us.codecraft.webmagic.Spider
Set startUrls of Spider.
Prior to startUrls of Site.
startRequests - Variable in class us.codecraft.webmagic.Spider
 
startUrls(List<String>) - Method in class us.codecraft.webmagic.Spider
Set startUrls of Spider.
Prior to startUrls of Site.
stat - Variable in class us.codecraft.webmagic.Spider
 
STAT_INIT - Static variable in class us.codecraft.webmagic.Spider
 
STAT_RUNNING - Static variable in class us.codecraft.webmagic.Spider
 
STAT_STOPPED - Static variable in class us.codecraft.webmagic.Spider
 
StatusCode() - Constructor for class us.codecraft.webmagic.utils.HttpConstant.StatusCode
 
stop() - Method in class us.codecraft.webmagic.Spider
 
Stopped - Enum constant in enum us.codecraft.webmagic.Spider.Status
 
stopWhenComplete() - Method in class us.codecraft.webmagic.Spider
Stop when all tasks in the queue are completed and all worker threads are also completed

T

Task - Interface in us.codecraft.webmagic
Interface for identifying different tasks.
test(String...) - Method in class us.codecraft.webmagic.Spider
Process specific urls without url discovering.
thread(int) - Method in class us.codecraft.webmagic.Spider
start with more than one threads
thread(ExecutorService, int) - Method in class us.codecraft.webmagic.Spider
start with more than one threads
threadNum - Variable in class us.codecraft.webmagic.Spider
 
threadPool - Variable in class us.codecraft.webmagic.Spider
 
toList(Class<T>) - Method in class us.codecraft.webmagic.selector.Json
 
toObject(Class<T>) - Method in class us.codecraft.webmagic.selector.Json
 
toString() - Method in class us.codecraft.webmagic.Page
 
toString() - Method in class us.codecraft.webmagic.proxy.Proxy
 
toString() - Method in class us.codecraft.webmagic.Request
 
toString() - Method in class us.codecraft.webmagic.ResultItems
 
toString() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
 
toString() - Method in class us.codecraft.webmagic.selector.RegexSelector
 
toString() - Method in class us.codecraft.webmagic.selector.ReplaceSelector
 
toString() - Method in interface us.codecraft.webmagic.selector.Selectable
single string result
toString() - Method in class us.codecraft.webmagic.Site
 
toTask() - Method in class us.codecraft.webmagic.Site
 
toURI() - Method in class us.codecraft.webmagic.proxy.Proxy
 
TRACE - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
 

U

UrlUtils - Class in us.codecraft.webmagic.utils
url and html utils.
UrlUtils() - Constructor for class us.codecraft.webmagic.utils.UrlUtils
 
us.codecraft.webmagic - package us.codecraft.webmagic
Main class "Spider" and models.
us.codecraft.webmagic.downloader - package us.codecraft.webmagic.downloader
Downloader is the part that downloads web pages and store in Page object.
us.codecraft.webmagic.model - package us.codecraft.webmagic.model
 
us.codecraft.webmagic.pipeline - package us.codecraft.webmagic.pipeline
Pipeline is the persistent and offline process part of crawler.
us.codecraft.webmagic.processor - package us.codecraft.webmagic.processor
PageProcessor custom part of a crawler for specific site.
us.codecraft.webmagic.processor.example - package us.codecraft.webmagic.processor.example
 
us.codecraft.webmagic.proxy - package us.codecraft.webmagic.proxy
 
us.codecraft.webmagic.scheduler - package us.codecraft.webmagic.scheduler
Scheduler is the part of url management.
us.codecraft.webmagic.scheduler.component - package us.codecraft.webmagic.scheduler.component
Component of scheduler.
us.codecraft.webmagic.selector - package us.codecraft.webmagic.selector
Selectors for page extraction.
us.codecraft.webmagic.thread - package us.codecraft.webmagic.thread
 
us.codecraft.webmagic.utils - package us.codecraft.webmagic.utils
Static utils of webmagic.
USER_AGENT - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Header
 
uuid - Variable in class us.codecraft.webmagic.Spider
 

V

validateProxy(Proxy) - Static method in class us.codecraft.webmagic.utils.ProxyUtils
 
valueOf(String) - Static method in enum us.codecraft.webmagic.Spider.Status
Returns the enum constant of this type with the specified name.
values() - Static method in enum us.codecraft.webmagic.Spider.Status
Returns an array containing the constants of this enum type, in the order they are declared.

W

waitNewUrl(CountableThreadPool, long) - Method in class us.codecraft.webmagic.SpiderScheduler
 
WMCollections - Class in us.codecraft.webmagic.utils
 
WMCollections() - Constructor for class us.codecraft.webmagic.utils.WMCollections
 

X

xml(String, String) - Static method in class us.codecraft.webmagic.model.HttpRequestBody
 
XML - Static variable in class us.codecraft.webmagic.model.HttpRequestBody.ContentType
 
xpath(String) - Method in class us.codecraft.webmagic.selector.HtmlNode
 
xpath(String) - Method in class us.codecraft.webmagic.selector.PlainText
 
xpath(String) - Method in interface us.codecraft.webmagic.selector.Selectable
select list with xpath
xpath(String) - Static method in class us.codecraft.webmagic.selector.Selectors
 
XpathSelector - Class in us.codecraft.webmagic.selector
XPath selector based on Xsoup.
XpathSelector(String) - Constructor for class us.codecraft.webmagic.selector.XpathSelector
 
xsoup(String) - Static method in class us.codecraft.webmagic.selector.Selectors
Deprecated.

Z

ZhihuPageProcessor - Class in us.codecraft.webmagic.processor.example
 
ZhihuPageProcessor() - Constructor for class us.codecraft.webmagic.processor.example.ZhihuPageProcessor
 
$ A B C D E F G H I J L M N O P Q R S T U V W X Z 
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form