Index
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form
$
- $(String) - Method in class us.codecraft.webmagic.selector.HtmlNode
- $(String) - Method in class us.codecraft.webmagic.selector.PlainText
- $(String) - Method in interface us.codecraft.webmagic.selector.Selectable
-
select list with css selector
- $(String) - Static method in class us.codecraft.webmagic.selector.Selectors
- $(String, String) - Method in class us.codecraft.webmagic.selector.HtmlNode
- $(String, String) - Method in class us.codecraft.webmagic.selector.PlainText
- $(String, String) - Method in interface us.codecraft.webmagic.selector.Selectable
-
select list with css selector
- $(String, String) - Static method in class us.codecraft.webmagic.selector.Selectors
A
- AbstractDownloader - Class in us.codecraft.webmagic.downloader
-
Base class of downloader with some common methods.
- AbstractDownloader() - Constructor for class us.codecraft.webmagic.downloader.AbstractDownloader
- AbstractSelectable - Class in us.codecraft.webmagic.selector
- AbstractSelectable() - Constructor for class us.codecraft.webmagic.selector.AbstractSelectable
- addCookie(String, String) - Method in class us.codecraft.webmagic.Request
- addCookie(String, String) - Method in class us.codecraft.webmagic.Site
-
Add a cookie with domain
Site.getDomain()
- addCookie(String, String, String) - Method in class us.codecraft.webmagic.Site
-
Add a cookie with specific domain.
- addHeader(String, String) - Method in class us.codecraft.webmagic.Request
- addHeader(String, String) - Method in class us.codecraft.webmagic.Site
-
Put an Http header for downloader.
- addPipeline(Pipeline) - Method in class us.codecraft.webmagic.Spider
-
add a pipeline for Spider
- addRequest(Request...) - Method in class us.codecraft.webmagic.Spider
-
Add urls with information to crawl.
- addTargetRequest(String) - Method in class us.codecraft.webmagic.Page
-
add url to fetch
- addTargetRequest(Request) - Method in class us.codecraft.webmagic.Page
-
add requests to fetch
- addTargetRequests(Iterable<String>) - Method in class us.codecraft.webmagic.Page
-
add urls to fetch
- addTargetRequests(Iterable<String>, long) - Method in class us.codecraft.webmagic.Page
-
add urls to fetch
- addUrl(String...) - Method in class us.codecraft.webmagic.Spider
-
Add urls to crawl.
- all() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- all() - Method in interface us.codecraft.webmagic.selector.Selectable
-
multi string result
- and(Selector...) - Static method in class us.codecraft.webmagic.selector.Selectors
- AndSelector - Class in us.codecraft.webmagic.selector
-
All selectors will be arranged as a pipeline.
- AndSelector(List<Selector>) - Constructor for class us.codecraft.webmagic.selector.AndSelector
- AndSelector(Selector...) - Constructor for class us.codecraft.webmagic.selector.AndSelector
B
- BaiduBaikePageProcessor - Class in us.codecraft.webmagic.processor.example
- BaiduBaikePageProcessor() - Constructor for class us.codecraft.webmagic.processor.example.BaiduBaikePageProcessor
- BaseElementSelector - Class in us.codecraft.webmagic.selector
- BaseElementSelector() - Constructor for class us.codecraft.webmagic.selector.BaseElementSelector
- BaseSelectorUtils - Class in us.codecraft.webmagic.utils
- BaseSelectorUtils() - Constructor for class us.codecraft.webmagic.utils.BaseSelectorUtils
C
- canonicalizeUrl(String, String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
-
canonicalizeUrl
Borrowed from Jsoup. - CharsetUtils - Class in us.codecraft.webmagic.utils
- checkAndMakeParentDirecotry(String) - Method in class us.codecraft.webmagic.utils.FilePersistentBase
- checkIfRunning() - Method in class us.codecraft.webmagic.Spider
- clearPipeline() - Method in class us.codecraft.webmagic.Spider
-
clear the pipelines set
- close() - Method in class us.codecraft.webmagic.Spider
- CODE_200 - Static variable in class us.codecraft.webmagic.utils.HttpConstant.StatusCode
- CollectorPipeline<T> - Interface in us.codecraft.webmagic.pipeline
-
Pipeline that can collect and store results.
- compareLong(long, long) - Static method in class us.codecraft.webmagic.utils.NumberUtils
- CONNECT - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
- ConsolePipeline - Class in us.codecraft.webmagic.pipeline
-
Write results in console.
Usually used in test. - ConsolePipeline() - Constructor for class us.codecraft.webmagic.pipeline.ConsolePipeline
- ContentType() - Constructor for class us.codecraft.webmagic.model.HttpRequestBody.ContentType
- convert(Request, Site, Proxy) - Method in class us.codecraft.webmagic.downloader.HttpUriRequestConverter
- convertHeaders(Header[]) - Static method in class us.codecraft.webmagic.utils.HttpClientUtils
- convertToRequests(Collection<String>) - Static method in class us.codecraft.webmagic.utils.UrlUtils
- convertToUrls(Collection<Request>) - Static method in class us.codecraft.webmagic.utils.UrlUtils
- CountableThreadPool - Class in us.codecraft.webmagic.thread
- CountableThreadPool(int) - Constructor for class us.codecraft.webmagic.thread.CountableThreadPool
- CountableThreadPool(int, ExecutorService) - Constructor for class us.codecraft.webmagic.thread.CountableThreadPool
- create(String) - Static method in class us.codecraft.webmagic.selector.Html
- create(String) - Static method in class us.codecraft.webmagic.selector.PlainText
- create(URI) - Static method in class us.codecraft.webmagic.proxy.Proxy
- create(PageProcessor) - Static method in class us.codecraft.webmagic.Spider
-
create a spider with pageProcessor.
- css(String) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- css(String) - Method in interface us.codecraft.webmagic.selector.Selectable
-
select list with css selector
- css(String, String) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- css(String, String) - Method in interface us.codecraft.webmagic.selector.Selectable
-
select list with css selector
- CssSelector - Class in us.codecraft.webmagic.selector
-
CSS selector.
- CssSelector(String) - Constructor for class us.codecraft.webmagic.selector.CssSelector
- CssSelector(String, String) - Constructor for class us.codecraft.webmagic.selector.CssSelector
- custom(byte[], String, String) - Static method in class us.codecraft.webmagic.model.HttpRequestBody
- CustomRedirectStrategy - Class in us.codecraft.webmagic.downloader
-
支持post 302跳转策略实现类 HttpClient默认跳转:httpClientBuilder.setRedirectStrategy(new LaxRedirectStrategy()); 上述代码在post/redirect/post这种情况下不会传递原有请求的数据信息。所以参考了下SeimiCrawler这个项目的重定向策略。 原代码地址:https://github.com/zhegexiaohuozi/SeimiCrawler/blob/master/project/src/main/java/cn/wanghaomiao/seimi/http/hc/SeimiRedirectStrategy.java
- CustomRedirectStrategy() - Constructor for class us.codecraft.webmagic.downloader.CustomRedirectStrategy
- CYCLE_TRIED_TIMES - Static variable in class us.codecraft.webmagic.Request
D
- DELETE - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
- destroyWhenExit - Variable in class us.codecraft.webmagic.Spider
- detectCharset(String, byte[]) - Static method in class us.codecraft.webmagic.utils.CharsetUtils
- DISABLE_HTML_ENTITY_ESCAPE - Static variable in class us.codecraft.webmagic.selector.Html
-
Deprecated.
- download(String) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
-
A simple method to download a url.
- download(String, String) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
-
A simple method to download a url.
- download(Request, Task) - Method in interface us.codecraft.webmagic.downloader.Downloader
-
Downloads web pages and store in Page object.
- download(Request, Task) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
- downloader - Variable in class us.codecraft.webmagic.Spider
- downloader(Downloader) - Method in class us.codecraft.webmagic.Spider
-
Deprecated.
- Downloader - Interface in us.codecraft.webmagic.downloader
-
Downloader is the part that downloads web pages and store in Page object.
- DuplicateRemovedScheduler - Class in us.codecraft.webmagic.scheduler
-
Remove duplicate urls and only push urls which are not duplicate.
- DuplicateRemovedScheduler() - Constructor for class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
- DuplicateRemover - Interface in us.codecraft.webmagic.scheduler.component
-
Remove duplicate requests.
E
- ElementSelector - Interface in us.codecraft.webmagic.selector
-
Selector(extractor) for html elements.
- encodeIllegalCharacterInUrl(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
-
Deprecated.
- equals(Object) - Method in class us.codecraft.webmagic.proxy.Proxy
- equals(Object) - Method in class us.codecraft.webmagic.Request
- equals(Object) - Method in class us.codecraft.webmagic.Site
- execute(Runnable) - Method in class us.codecraft.webmagic.thread.CountableThreadPool
- executorService - Variable in class us.codecraft.webmagic.Spider
- exitWhenComplete - Variable in class us.codecraft.webmagic.Spider
- Experimental - Annotation Type in us.codecraft.webmagic.utils
-
Stands for features unstable.
- extractAndAddRequests(Page, boolean) - Method in class us.codecraft.webmagic.Spider
F
- fail() - Static method in class us.codecraft.webmagic.Page
-
Deprecated.
- fail(Request) - Static method in class us.codecraft.webmagic.Page
-
Deprecated, for removal: This API element is subject to removal in a future version.Use
Page.ofFailure(Request)
instead. - FilePersistentBase - Class in us.codecraft.webmagic.utils
-
Base object of file persistence.
- FilePersistentBase() - Constructor for class us.codecraft.webmagic.utils.FilePersistentBase
- FilePipeline - Class in us.codecraft.webmagic.pipeline
-
Store results in files.
- FilePipeline() - Constructor for class us.codecraft.webmagic.pipeline.FilePipeline
-
create a FilePipeline with default path"/data/webmagic/"
- FilePipeline(String) - Constructor for class us.codecraft.webmagic.pipeline.FilePipeline
- fixIllegalCharacterInUrl(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
- form(Map<String, Object>, String) - Static method in class us.codecraft.webmagic.model.HttpRequestBody
- FORM - Static variable in class us.codecraft.webmagic.model.HttpRequestBody.ContentType
- from(Proxy...) - Static method in class us.codecraft.webmagic.proxy.SimpleProxyProvider
- fromValue(int) - Static method in enum us.codecraft.webmagic.Spider.Status
G
- get() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- get() - Method in interface us.codecraft.webmagic.selector.Selectable
-
single string result
- get(String) - Method in class us.codecraft.webmagic.ResultItems
- get(String) - Method in class us.codecraft.webmagic.Spider
- GET - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
- getAcceptStatCode() - Method in class us.codecraft.webmagic.Site
-
get acceptStatCode
- getAll() - Method in class us.codecraft.webmagic.ResultItems
- getAll(Collection<String>) - Method in class us.codecraft.webmagic.Spider
-
Download urls synchronizing.
- getAllCookies() - Method in class us.codecraft.webmagic.Site
-
get cookies of all domains
- getBody() - Method in class us.codecraft.webmagic.model.HttpRequestBody
- getBytes() - Method in class us.codecraft.webmagic.Page
- getCharset() - Method in class us.codecraft.webmagic.Page
- getCharset() - Method in class us.codecraft.webmagic.Request
- getCharset() - Method in class us.codecraft.webmagic.Site
-
get charset set manually
- getCharset(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
- getClient(Site) - Method in class us.codecraft.webmagic.downloader.HttpClientGenerator
- getCollected() - Method in interface us.codecraft.webmagic.pipeline.CollectorPipeline
-
Get all results collected.
- getCollected() - Method in class us.codecraft.webmagic.pipeline.ResultItemsCollectorPipeline
- getCollectorPipeline() - Method in class us.codecraft.webmagic.Spider
- getContentType() - Method in class us.codecraft.webmagic.model.HttpRequestBody
- getCookies() - Method in class us.codecraft.webmagic.Request
- getCookies() - Method in class us.codecraft.webmagic.Site
-
get cookies
- getCycleRetryTimes() - Method in class us.codecraft.webmagic.Site
-
When cycleRetryTimes is more than 0, it will add back to scheduler and try download again.
- getDefaultCharset() - Method in class us.codecraft.webmagic.Site
-
The default charset if charset detected failed.
- getDocument() - Method in class us.codecraft.webmagic.selector.Html
- getDomain() - Method in class us.codecraft.webmagic.Site
-
get domain
- getDomain(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
- getDownloader() - Method in class us.codecraft.webmagic.Request
- getDuplicateRemover() - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
- getElements() - Method in class us.codecraft.webmagic.selector.Html
- getElements() - Method in class us.codecraft.webmagic.selector.HtmlNode
- getEncoding() - Method in class us.codecraft.webmagic.model.HttpRequestBody
- getExtra(String) - Method in class us.codecraft.webmagic.Request
- getExtras() - Method in class us.codecraft.webmagic.Request
- getFile(String) - Method in class us.codecraft.webmagic.utils.FilePersistentBase
- getFirstSourceText() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- getHeaders() - Method in class us.codecraft.webmagic.Page
- getHeaders() - Method in class us.codecraft.webmagic.Request
- getHeaders() - Method in class us.codecraft.webmagic.Site
- getHost() - Method in class us.codecraft.webmagic.proxy.Proxy
- getHost(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
- getHtml() - Method in class us.codecraft.webmagic.Page
-
get html content of page
- getHttpClientContext() - Method in class us.codecraft.webmagic.downloader.HttpClientRequestContext
- getHttpUriRequest() - Method in class us.codecraft.webmagic.downloader.HttpClientRequestContext
- getJson() - Method in class us.codecraft.webmagic.Page
-
get json content of page
- getJsonPathStr() - Method in class us.codecraft.webmagic.selector.JsonPathSelector
- getLeftRequestsCount(Task) - Method in interface us.codecraft.webmagic.scheduler.MonitorableScheduler
- getLeftRequestsCount(Task) - Method in class us.codecraft.webmagic.scheduler.PriorityScheduler
- getLeftRequestsCount(Task) - Method in class us.codecraft.webmagic.scheduler.QueueScheduler
- getMethod() - Method in class us.codecraft.webmagic.Request
-
The http method of the request.
- getPageCount() - Method in class us.codecraft.webmagic.Spider
-
Get page count downloaded by spider.
- getPassword() - Method in class us.codecraft.webmagic.proxy.Proxy
- getPath() - Method in class us.codecraft.webmagic.utils.FilePersistentBase
- getPort() - Method in class us.codecraft.webmagic.proxy.Proxy
- getPriority() - Method in class us.codecraft.webmagic.Request
- getProxy(Request, Task) - Method in interface us.codecraft.webmagic.proxy.ProxyProvider
-
Returns a proxy for the request.
- getProxy(Request, Task) - Method in class us.codecraft.webmagic.proxy.SimpleProxyProvider
- getProxy(Task) - Method in interface us.codecraft.webmagic.proxy.ProxyProvider
-
Deprecated.Use
ProxyProvider.getProxy(Request, Task)
instead. - getRawText() - Method in class us.codecraft.webmagic.Page
- getRedirect(HttpRequest, HttpResponse, HttpContext) - Method in class us.codecraft.webmagic.downloader.CustomRedirectStrategy
- getRequest() - Method in class us.codecraft.webmagic.Page
-
get request of current page
- getRequest() - Method in class us.codecraft.webmagic.ResultItems
- getRequestBody() - Method in class us.codecraft.webmagic.Request
- getResultItems() - Method in class us.codecraft.webmagic.Page
- getRetrySleepTime() - Method in class us.codecraft.webmagic.Site
- getRetryTimes() - Method in class us.codecraft.webmagic.Site
-
Get retry times immediately when download fail, 0 by default.
- getScheduler() - Method in class us.codecraft.webmagic.Spider
- getScheduler() - Method in class us.codecraft.webmagic.SpiderScheduler
- getScheme() - Method in class us.codecraft.webmagic.proxy.Proxy
- getSite() - Method in class us.codecraft.webmagic.processor.example.BaiduBaikePageProcessor
- getSite() - Method in class us.codecraft.webmagic.processor.example.GithubRepoPageProcessor
- getSite() - Method in class us.codecraft.webmagic.processor.example.ZhihuPageProcessor
- getSite() - Method in interface us.codecraft.webmagic.processor.PageProcessor
-
Returns the site settings.
- getSite() - Method in class us.codecraft.webmagic.processor.SimplePageProcessor
- getSite() - Method in class us.codecraft.webmagic.Spider
- getSite() - Method in interface us.codecraft.webmagic.Task
-
site of a task
- getSleepTime() - Method in class us.codecraft.webmagic.Site
-
Get the interval between the processing of two pages.
Time unit is milliseconds. - getSourceTexts() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- getSourceTexts() - Method in class us.codecraft.webmagic.selector.HtmlNode
- getSourceTexts() - Method in class us.codecraft.webmagic.selector.PlainText
- getSpiderListeners() - Method in class us.codecraft.webmagic.Spider
- getStartTime() - Method in class us.codecraft.webmagic.Spider
- getStatus() - Method in class us.codecraft.webmagic.Spider
-
Get running status by spider.
- getStatusCode() - Method in class us.codecraft.webmagic.Page
- getTargetRequests() - Method in class us.codecraft.webmagic.Page
- getText(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
- getThreadAlive() - Method in class us.codecraft.webmagic.Spider
-
Get thread count which is running
- getThreadAlive() - Method in class us.codecraft.webmagic.thread.CountableThreadPool
- getThreadNum() - Method in class us.codecraft.webmagic.thread.CountableThreadPool
- getTimeOut() - Method in class us.codecraft.webmagic.Site
- getTotalRequestsCount(Task) - Method in interface us.codecraft.webmagic.scheduler.component.DuplicateRemover
-
Get TotalRequestsCount for monitor.
- getTotalRequestsCount(Task) - Method in class us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover
- getTotalRequestsCount(Task) - Method in interface us.codecraft.webmagic.scheduler.MonitorableScheduler
- getTotalRequestsCount(Task) - Method in class us.codecraft.webmagic.scheduler.PriorityScheduler
- getTotalRequestsCount(Task) - Method in class us.codecraft.webmagic.scheduler.QueueScheduler
- getUrl() - Method in class us.codecraft.webmagic.Page
-
get url of current page
- getUrl() - Method in class us.codecraft.webmagic.Request
- getUrl(Request) - Method in class us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover
- getUserAgent() - Method in class us.codecraft.webmagic.Site
-
get user agent
- getUsername() - Method in class us.codecraft.webmagic.proxy.Proxy
- getUUID() - Method in class us.codecraft.webmagic.Spider
- getUUID() - Method in interface us.codecraft.webmagic.Task
-
unique id for a task.
- GithubRepoPageProcessor - Class in us.codecraft.webmagic.processor.example
- GithubRepoPageProcessor() - Constructor for class us.codecraft.webmagic.processor.example.GithubRepoPageProcessor
H
- handleResponse(Request, String, HttpResponse, Task) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
- hasAttribute() - Method in class us.codecraft.webmagic.selector.BaseElementSelector
- hasAttribute() - Method in class us.codecraft.webmagic.selector.CssSelector
- hasAttribute() - Method in class us.codecraft.webmagic.selector.LinksSelector
- hasAttribute() - Method in class us.codecraft.webmagic.selector.XpathSelector
- hashCode() - Method in class us.codecraft.webmagic.proxy.Proxy
- hashCode() - Method in class us.codecraft.webmagic.Request
- hashCode() - Method in class us.codecraft.webmagic.Site
- HashSetDuplicateRemover - Class in us.codecraft.webmagic.scheduler.component
- HashSetDuplicateRemover() - Constructor for class us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover
- HEAD - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
- Header() - Constructor for class us.codecraft.webmagic.utils.HttpConstant.Header
- Html - Class in us.codecraft.webmagic.selector
-
Selectable html.
- Html(String) - Constructor for class us.codecraft.webmagic.selector.Html
- Html(String, String) - Constructor for class us.codecraft.webmagic.selector.Html
- Html(Document) - Constructor for class us.codecraft.webmagic.selector.Html
- HtmlNode - Class in us.codecraft.webmagic.selector
- HtmlNode() - Constructor for class us.codecraft.webmagic.selector.HtmlNode
- HtmlNode(List<Element>) - Constructor for class us.codecraft.webmagic.selector.HtmlNode
- HttpClientDownloader - Class in us.codecraft.webmagic.downloader
-
The http downloader based on HttpClient.
- HttpClientDownloader() - Constructor for class us.codecraft.webmagic.downloader.HttpClientDownloader
- HttpClientGenerator - Class in us.codecraft.webmagic.downloader
- HttpClientGenerator() - Constructor for class us.codecraft.webmagic.downloader.HttpClientGenerator
- HttpClientRequestContext - Class in us.codecraft.webmagic.downloader
- HttpClientRequestContext() - Constructor for class us.codecraft.webmagic.downloader.HttpClientRequestContext
- HttpClientUtils - Class in us.codecraft.webmagic.utils
- HttpClientUtils() - Constructor for class us.codecraft.webmagic.utils.HttpClientUtils
- HttpConstant - Class in us.codecraft.webmagic.utils
-
Some constants of Http protocal.
- HttpConstant() - Constructor for class us.codecraft.webmagic.utils.HttpConstant
- HttpConstant.Header - Class in us.codecraft.webmagic.utils
- HttpConstant.Method - Class in us.codecraft.webmagic.utils
- HttpConstant.StatusCode - Class in us.codecraft.webmagic.utils
- HttpRequestBody - Class in us.codecraft.webmagic.model
- HttpRequestBody() - Constructor for class us.codecraft.webmagic.model.HttpRequestBody
- HttpRequestBody(byte[], String, String) - Constructor for class us.codecraft.webmagic.model.HttpRequestBody
- HttpRequestBody.ContentType - Class in us.codecraft.webmagic.model
- HttpUriRequestConverter - Class in us.codecraft.webmagic.downloader
- HttpUriRequestConverter() - Constructor for class us.codecraft.webmagic.downloader.HttpUriRequestConverter
I
- Init - Enum constant in enum us.codecraft.webmagic.Spider.Status
- initComponent() - Method in class us.codecraft.webmagic.Spider
- INITIAL_CAPACITY - Static variable in class us.codecraft.webmagic.scheduler.PriorityScheduler
- isBinaryContent() - Method in class us.codecraft.webmagic.Request
- isDisableCookieManagement() - Method in class us.codecraft.webmagic.Site
- isDownloadSuccess() - Method in class us.codecraft.webmagic.Page
- isDuplicate(Request, Task) - Method in interface us.codecraft.webmagic.scheduler.component.DuplicateRemover
-
Check whether the request is duplicate.
- isDuplicate(Request, Task) - Method in class us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover
- isExitWhenComplete() - Method in class us.codecraft.webmagic.Spider
- isShutdown() - Method in class us.codecraft.webmagic.thread.CountableThreadPool
- isSkip() - Method in class us.codecraft.webmagic.ResultItems
-
Whether to skip the result.
Result which is skipped will not be processed by Pipeline. - isSpawnUrl() - Method in class us.codecraft.webmagic.Spider
- isUseGzip() - Method in class us.codecraft.webmagic.Site
J
- json(String, String) - Static method in class us.codecraft.webmagic.model.HttpRequestBody
- Json - Class in us.codecraft.webmagic.selector
-
parse json
- Json(String) - Constructor for class us.codecraft.webmagic.selector.Json
- Json(List<String>) - Constructor for class us.codecraft.webmagic.selector.Json
- JSON - Static variable in class us.codecraft.webmagic.model.HttpRequestBody.ContentType
- jsonPath(String) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- jsonPath(String) - Method in class us.codecraft.webmagic.selector.Json
- jsonPath(String) - Method in interface us.codecraft.webmagic.selector.Selectable
-
extract by JSON Path expression
- JsonPathSelector - Class in us.codecraft.webmagic.selector
-
JsonPath selector.
Used to extract content from JSON. - JsonPathSelector(String) - Constructor for class us.codecraft.webmagic.selector.JsonPathSelector
L
- links() - Method in class us.codecraft.webmagic.selector.HtmlNode
- links() - Method in class us.codecraft.webmagic.selector.PlainText
- links() - Method in interface us.codecraft.webmagic.selector.Selectable
-
select all links
- LinksSelector - Class in us.codecraft.webmagic.selector
-
Links selector based on jsoup.
- LinksSelector() - Constructor for class us.codecraft.webmagic.selector.LinksSelector
- logger - Variable in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
- logger - Variable in class us.codecraft.webmagic.Spider
M
- main(String[]) - Static method in class us.codecraft.webmagic.processor.example.BaiduBaikePageProcessor
- main(String[]) - Static method in class us.codecraft.webmagic.processor.example.GithubRepoPageProcessor
- main(String[]) - Static method in class us.codecraft.webmagic.processor.example.ZhihuPageProcessor
- match() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- match() - Method in interface us.codecraft.webmagic.selector.Selectable
-
if result exist for select
- me() - Static method in class us.codecraft.webmagic.Site
-
new a Site
- Method() - Constructor for class us.codecraft.webmagic.utils.HttpConstant.Method
- MonitorableScheduler - Interface in us.codecraft.webmagic.scheduler
-
The scheduler whose requests can be counted for monitor.
- MULTIPART - Static variable in class us.codecraft.webmagic.model.HttpRequestBody.ContentType
N
- newArrayList(T...) - Static method in class us.codecraft.webmagic.utils.WMCollections
- newHashSet(T...) - Static method in class us.codecraft.webmagic.utils.WMCollections
- nodes() - Method in class us.codecraft.webmagic.selector.HtmlNode
- nodes() - Method in class us.codecraft.webmagic.selector.PlainText
- nodes() - Method in interface us.codecraft.webmagic.selector.Selectable
-
get all nodes
- noNeedToRemoveDuplicate(Request) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
- NumberUtils - Class in us.codecraft.webmagic.utils
- NumberUtils() - Constructor for class us.codecraft.webmagic.utils.NumberUtils
O
- ofFailure(Request) - Static method in class us.codecraft.webmagic.Page
- ofSuccess(Request) - Static method in class us.codecraft.webmagic.Page
- onError(Page, Task, Throwable) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
- onError(Request) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
-
Deprecated.Use
AbstractDownloader.onError(Page, Task, Throwable)
instead. - onError(Request) - Method in class us.codecraft.webmagic.Spider
-
Deprecated.Use
Spider.onError(Request, Exception)
instead. - onError(Request) - Method in interface us.codecraft.webmagic.SpiderListener
-
Deprecated.Use
SpiderListener.onError(Request, Exception)
instead. - onError(Request, Exception) - Method in class us.codecraft.webmagic.Spider
- onError(Request, Exception) - Method in interface us.codecraft.webmagic.SpiderListener
- onError(Request, Task, Throwable) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
-
Deprecated.Use
AbstractDownloader.onError(Page, Task, Throwable)
instead. - onSuccess(Page, Task) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
- onSuccess(Request) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
-
Deprecated.Use
AbstractDownloader.onSuccess(Page, Task)
instead. - onSuccess(Request) - Method in class us.codecraft.webmagic.Spider
- onSuccess(Request) - Method in interface us.codecraft.webmagic.SpiderListener
- onSuccess(Request, Task) - Method in class us.codecraft.webmagic.downloader.AbstractDownloader
-
Deprecated.Use
AbstractDownloader.onSuccess(Page, Task)
instead. - or(Selector...) - Static method in class us.codecraft.webmagic.selector.Selectors
- OrSelector - Class in us.codecraft.webmagic.selector
-
All extractors will do extracting separately,
and the results of extractors will combined as the final result. - OrSelector(List<Selector>) - Constructor for class us.codecraft.webmagic.selector.OrSelector
- OrSelector(Selector...) - Constructor for class us.codecraft.webmagic.selector.OrSelector
P
- Page - Class in us.codecraft.webmagic
-
Object storing extracted result and urls to fetch.
Not thread safe.
Main method:
Page.getUrl()
get url of current page
Page.getHtml()
get content of current page
Page.putField(String, Object)
save extracted result
Page.getResultItems()
get extract results to be used inPipeline
Page.addTargetRequests(Iterable)
Page.addTargetRequest(String)
add urls to fetch - Page() - Constructor for class us.codecraft.webmagic.Page
- pageProcessor - Variable in class us.codecraft.webmagic.Spider
- PageProcessor - Interface in us.codecraft.webmagic.processor
-
Interface to be implemented to customize a crawler.
- path - Variable in class us.codecraft.webmagic.utils.FilePersistentBase
- PATH_SEPERATOR - Static variable in class us.codecraft.webmagic.utils.FilePersistentBase
- pipeline(Pipeline) - Method in class us.codecraft.webmagic.Spider
-
Deprecated.
- Pipeline - Interface in us.codecraft.webmagic.pipeline
-
Pipeline is the persistent and offline process part of crawler.
The interface Pipeline can be implemented to customize ways of persistent. - pipelines - Variable in class us.codecraft.webmagic.Spider
- PlainText - Class in us.codecraft.webmagic.selector
-
Selectable plain text.
Can not be selected by XPath or CSS Selector. - PlainText(String) - Constructor for class us.codecraft.webmagic.selector.PlainText
- PlainText(List<String>) - Constructor for class us.codecraft.webmagic.selector.PlainText
- poll(Spider) - Method in class us.codecraft.webmagic.SpiderScheduler
- poll(Task) - Method in class us.codecraft.webmagic.scheduler.PriorityScheduler
- poll(Task) - Method in class us.codecraft.webmagic.scheduler.QueueScheduler
- poll(Task) - Method in interface us.codecraft.webmagic.scheduler.Scheduler
-
get an url to crawl
- POST - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
- preParse(String) - Static method in class us.codecraft.webmagic.utils.BaseSelectorUtils
-
Jsoup/HtmlCleaner could not parse "tr" or "td" tag directly https://stackoverflow.com/questions/63607740/jsoup-couldnt-parse-tr-tag
- PriorityScheduler - Class in us.codecraft.webmagic.scheduler
-
Priority scheduler.
- PriorityScheduler() - Constructor for class us.codecraft.webmagic.scheduler.PriorityScheduler
- process(Page) - Method in class us.codecraft.webmagic.processor.example.BaiduBaikePageProcessor
- process(Page) - Method in class us.codecraft.webmagic.processor.example.GithubRepoPageProcessor
- process(Page) - Method in class us.codecraft.webmagic.processor.example.ZhihuPageProcessor
- process(Page) - Method in interface us.codecraft.webmagic.processor.PageProcessor
-
Processes the page, extract URLs to fetch, extract the data and store.
- process(Page) - Method in class us.codecraft.webmagic.processor.SimplePageProcessor
- process(ResultItems, Task) - Method in class us.codecraft.webmagic.pipeline.ConsolePipeline
- process(ResultItems, Task) - Method in class us.codecraft.webmagic.pipeline.FilePipeline
- process(ResultItems, Task) - Method in interface us.codecraft.webmagic.pipeline.Pipeline
-
Process extracted results.
- process(ResultItems, Task) - Method in class us.codecraft.webmagic.pipeline.ResultItemsCollectorPipeline
- Proxy - Class in us.codecraft.webmagic.proxy
- Proxy(String, int) - Constructor for class us.codecraft.webmagic.proxy.Proxy
- Proxy(String, int, String) - Constructor for class us.codecraft.webmagic.proxy.Proxy
- Proxy(String, int, String, String) - Constructor for class us.codecraft.webmagic.proxy.Proxy
- ProxyProvider - Interface in us.codecraft.webmagic.proxy
-
Proxy provider.
- ProxyUtils - Class in us.codecraft.webmagic.utils
-
Pooled Proxy Object
- ProxyUtils() - Constructor for class us.codecraft.webmagic.utils.ProxyUtils
- push(Request, Spider) - Method in class us.codecraft.webmagic.SpiderScheduler
- push(Request, Task) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
- push(Request, Task) - Method in interface us.codecraft.webmagic.scheduler.Scheduler
-
add a url to fetch
- pushWhenNoDuplicate(Request, Task) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
- pushWhenNoDuplicate(Request, Task) - Method in class us.codecraft.webmagic.scheduler.PriorityScheduler
- pushWhenNoDuplicate(Request, Task) - Method in class us.codecraft.webmagic.scheduler.QueueScheduler
- put(String, T) - Method in class us.codecraft.webmagic.ResultItems
- PUT - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
- putExtra(String, T) - Method in class us.codecraft.webmagic.Request
- putField(String, Object) - Method in class us.codecraft.webmagic.Page
-
store extract results
Q
- QueueScheduler - Class in us.codecraft.webmagic.scheduler
-
Basic Scheduler implementation.
Store urls to fetch in LinkedBlockingQueue and remove duplicate urls by HashMap. - QueueScheduler() - Constructor for class us.codecraft.webmagic.scheduler.QueueScheduler
- QueueScheduler(int) - Constructor for class us.codecraft.webmagic.scheduler.QueueScheduler
-
Creates a
QueueScheduler
with the given (fixed) capacity.
R
- REFERER - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Header
- regex(String) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- regex(String) - Method in interface us.codecraft.webmagic.selector.Selectable
-
select list with regex, default group is group 1
- regex(String) - Static method in class us.codecraft.webmagic.selector.Selectors
- regex(String, int) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- regex(String, int) - Method in interface us.codecraft.webmagic.selector.Selectable
-
select list with regex
- regex(String, int) - Static method in class us.codecraft.webmagic.selector.Selectors
- RegexSelector - Class in us.codecraft.webmagic.selector
-
Selector in regex.
- RegexSelector(String) - Constructor for class us.codecraft.webmagic.selector.RegexSelector
-
Create a RegexSelector.
- RegexSelector(String, int) - Constructor for class us.codecraft.webmagic.selector.RegexSelector
- removePadding(String) - Method in class us.codecraft.webmagic.selector.Json
-
remove padding for JSONP
- removePort(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
- removeProtocol(String) - Static method in class us.codecraft.webmagic.utils.UrlUtils
- replace(String, String) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- replace(String, String) - Method in interface us.codecraft.webmagic.selector.Selectable
-
replace with regex
- ReplaceSelector - Class in us.codecraft.webmagic.selector
-
Replace selector.
- ReplaceSelector(String, String) - Constructor for class us.codecraft.webmagic.selector.ReplaceSelector
- Request - Class in us.codecraft.webmagic
-
Object contains url to crawl.
It contains some additional information. - Request() - Constructor for class us.codecraft.webmagic.Request
- Request(String) - Constructor for class us.codecraft.webmagic.Request
- resetDuplicateCheck(Task) - Method in interface us.codecraft.webmagic.scheduler.component.DuplicateRemover
-
Reset duplicate check.
- resetDuplicateCheck(Task) - Method in class us.codecraft.webmagic.scheduler.component.HashSetDuplicateRemover
- ResultItems - Class in us.codecraft.webmagic
-
Object contains extract results.
It is contained in Page and will be processed in pipeline. - ResultItems() - Constructor for class us.codecraft.webmagic.ResultItems
- ResultItemsCollectorPipeline - Class in us.codecraft.webmagic.pipeline
- ResultItemsCollectorPipeline() - Constructor for class us.codecraft.webmagic.pipeline.ResultItemsCollectorPipeline
- returnProxy(Proxy, Page, Task) - Method in interface us.codecraft.webmagic.proxy.ProxyProvider
-
Return proxy to Provider when complete a download.
- returnProxy(Proxy, Page, Task) - Method in class us.codecraft.webmagic.proxy.SimpleProxyProvider
- run() - Method in class us.codecraft.webmagic.Spider
- runAsync() - Method in class us.codecraft.webmagic.Spider
- Running - Enum constant in enum us.codecraft.webmagic.Spider.Status
S
- scheduler - Variable in class us.codecraft.webmagic.Spider
- scheduler(Scheduler) - Method in class us.codecraft.webmagic.Spider
-
Deprecated.
- Scheduler - Interface in us.codecraft.webmagic.scheduler
-
Scheduler is the part of url management.
You can implement interface Scheduler to do: manage urls to fetch remove duplicate urls - select(String) - Method in class us.codecraft.webmagic.selector.AndSelector
- select(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
- select(String) - Method in class us.codecraft.webmagic.selector.JsonPathSelector
- select(String) - Method in class us.codecraft.webmagic.selector.OrSelector
- select(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
- select(String) - Method in class us.codecraft.webmagic.selector.ReplaceSelector
- select(String) - Method in interface us.codecraft.webmagic.selector.Selector
-
Extract single result in text.
If there are more than one result, only the first will be chosen. - select(String) - Method in class us.codecraft.webmagic.selector.SmartContentSelector
- select(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
- select(Element) - Method in interface us.codecraft.webmagic.selector.ElementSelector
-
Extract single result in text.
If there are more than one result, only the first will be chosen. - select(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
- select(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
- select(Selector) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- select(Selector) - Method in class us.codecraft.webmagic.selector.HtmlNode
- select(Selector) - Method in interface us.codecraft.webmagic.selector.Selectable
-
extract by custom selector
- select(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- Selectable - Interface in us.codecraft.webmagic.selector
-
Selectable text.
- selectDocument(Selector) - Method in class us.codecraft.webmagic.selector.Html
- selectDocumentForList(Selector) - Method in class us.codecraft.webmagic.selector.Html
- selectElement(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
- selectElement(Element) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
- selectElement(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
- selectElement(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
- selectElement(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
- selectElements(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
- selectElements(Element) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
- selectElements(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
- selectElements(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
- selectElements(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
- selectElements(BaseElementSelector) - Method in class us.codecraft.webmagic.selector.HtmlNode
-
select elements
- selectGroup(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
- selectGroupList(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
- selectList(String) - Method in class us.codecraft.webmagic.selector.AndSelector
- selectList(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
- selectList(String) - Method in class us.codecraft.webmagic.selector.JsonPathSelector
- selectList(String) - Method in class us.codecraft.webmagic.selector.OrSelector
- selectList(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
- selectList(String) - Method in class us.codecraft.webmagic.selector.ReplaceSelector
- selectList(String) - Method in interface us.codecraft.webmagic.selector.Selector
-
Extract all results in text.
- selectList(String) - Method in class us.codecraft.webmagic.selector.SmartContentSelector
- selectList(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
- selectList(Element) - Method in interface us.codecraft.webmagic.selector.ElementSelector
-
Extract all results in text.
- selectList(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
- selectList(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
- selectList(Selector) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- selectList(Selector) - Method in class us.codecraft.webmagic.selector.HtmlNode
- selectList(Selector) - Method in interface us.codecraft.webmagic.selector.Selectable
-
extract by custom selector
- selectList(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- Selector - Interface in us.codecraft.webmagic.selector
-
Selector(extractor) for text.
- Selectors - Class in us.codecraft.webmagic.selector
-
Convenient methods for selectors.
- Selectors() - Constructor for class us.codecraft.webmagic.selector.Selectors
- setAcceptStatCode(Set<Integer>) - Method in class us.codecraft.webmagic.Site
-
Set acceptStatCode.
When status code of http response is in acceptStatCodes, it will be processed.
{200} by default.
It is not necessarily to be set. - setBinaryContent(boolean) - Method in class us.codecraft.webmagic.Request
- setBody(byte[]) - Method in class us.codecraft.webmagic.model.HttpRequestBody
- setBytes(byte[]) - Method in class us.codecraft.webmagic.Page
- setCharset(String) - Method in class us.codecraft.webmagic.Page
- setCharset(String) - Method in class us.codecraft.webmagic.Request
- setCharset(String) - Method in class us.codecraft.webmagic.Site
-
Set charset of page manually.
When charset is not set or set to null, it can be auto detected by Http header. - setContentType(String) - Method in class us.codecraft.webmagic.model.HttpRequestBody
- setCycleRetryTimes(int) - Method in class us.codecraft.webmagic.Site
-
Set cycleRetryTimes times when download fail, 0 by default.
- setDefaultCharset(String) - Method in class us.codecraft.webmagic.Site
-
Set default charset of page.
- setDisableCookieManagement(boolean) - Method in class us.codecraft.webmagic.Site
-
Downloader is supposed to store response cookie.
- setDomain(String) - Method in class us.codecraft.webmagic.Site
-
set the domain of site.
- setDownloader(Downloader) - Method in class us.codecraft.webmagic.Request
- setDownloader(Downloader) - Method in class us.codecraft.webmagic.Spider
-
set the downloader of spider
- setDownloadSuccess(boolean) - Method in class us.codecraft.webmagic.Page
- setDuplicateRemover(DuplicateRemover) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
- setEmptySleepTime(long) - Method in class us.codecraft.webmagic.Spider
-
Set wait time when no url is polled.
- setEncoding(String) - Method in class us.codecraft.webmagic.model.HttpRequestBody
- setExecutorService(ExecutorService) - Method in class us.codecraft.webmagic.Spider
- setExecutorService(ExecutorService) - Method in class us.codecraft.webmagic.thread.CountableThreadPool
- setExitWhenComplete(boolean) - Method in class us.codecraft.webmagic.Spider
-
Exit when complete.
- setExtras(Map<String, Object>) - Method in class us.codecraft.webmagic.Request
- setHeaders(Map<String, List<String>>) - Method in class us.codecraft.webmagic.Page
- setHtml(Html) - Method in class us.codecraft.webmagic.Page
-
Deprecated.since 0.4.0 The html is parse just when first time of calling
Page.getHtml()
, so usePage.setRawText(String)
instead. - setHttpClientContext(HttpClientContext) - Method in class us.codecraft.webmagic.downloader.HttpClientRequestContext
- setHttpUriRequest(HttpUriRequest) - Method in class us.codecraft.webmagic.downloader.HttpClientRequestContext
- setHttpUriRequestConverter(HttpUriRequestConverter) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
- setMethod(String) - Method in class us.codecraft.webmagic.Request
- setPath(String) - Method in class us.codecraft.webmagic.utils.FilePersistentBase
- setPipelines(List<Pipeline>) - Method in class us.codecraft.webmagic.Spider
-
set pipelines for Spider
- setPoolSize(int) - Method in class us.codecraft.webmagic.downloader.HttpClientGenerator
- setPriority(long) - Method in class us.codecraft.webmagic.Request
-
Set the priority of request for sorting.
Need a scheduler supporting priority. - setProxyProvider(ProxyProvider) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
- setRawText(String) - Method in class us.codecraft.webmagic.Page
- setRequest(Request) - Method in class us.codecraft.webmagic.Page
- setRequest(Request) - Method in class us.codecraft.webmagic.ResultItems
- setRequestBody(HttpRequestBody) - Method in class us.codecraft.webmagic.Request
- setRetrySleepTime(int) - Method in class us.codecraft.webmagic.Site
-
Set retry sleep times when download fail, 1000 by default.
- setRetryTimes(int) - Method in class us.codecraft.webmagic.Site
-
Set retry times when download fail, 0 by default.
- setScheduler(Scheduler) - Method in class us.codecraft.webmagic.Spider
-
set scheduler for Spider
- setScheduler(Scheduler) - Method in class us.codecraft.webmagic.SpiderScheduler
- setScheme(String) - Method in class us.codecraft.webmagic.proxy.Proxy
- setSkip(boolean) - Method in class us.codecraft.webmagic.Page
- setSkip(boolean) - Method in class us.codecraft.webmagic.ResultItems
-
Set whether to skip the result.
Result which is skipped will not be processed by Pipeline. - setSleepTime(int) - Method in class us.codecraft.webmagic.Site
-
Set the interval between the processing of two pages.
Time unit is milliseconds. - setSpawnUrl(boolean) - Method in class us.codecraft.webmagic.Spider
-
Whether add urls extracted to download.
Add urls to download when it is true, and just download seed urls when it is false. - setSpiderListeners(List<SpiderListener>) - Method in class us.codecraft.webmagic.Spider
- setStatusCode(int) - Method in class us.codecraft.webmagic.Page
- setThread(int) - Method in interface us.codecraft.webmagic.downloader.Downloader
-
Tell the downloader how many threads the spider used.
- setThread(int) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
- setTimeOut(int) - Method in class us.codecraft.webmagic.Site
-
set timeout for downloader in ms
- setUrl(String) - Method in class us.codecraft.webmagic.Request
- setUrl(Selectable) - Method in class us.codecraft.webmagic.Page
- setUseGzip(boolean) - Method in class us.codecraft.webmagic.Site
-
Whether use gzip.
- setUserAgent(String) - Method in class us.codecraft.webmagic.Site
-
set user agent
- setUUID(String) - Method in class us.codecraft.webmagic.Spider
-
Set an uuid for spider.
Default uuid is domain of site. - shouldReserved(Request) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
- shutdown() - Method in class us.codecraft.webmagic.thread.CountableThreadPool
- signalNewUrl() - Method in class us.codecraft.webmagic.SpiderScheduler
- SimplePageProcessor - Class in us.codecraft.webmagic.processor
-
A simple PageProcessor.
- SimplePageProcessor(String) - Constructor for class us.codecraft.webmagic.processor.SimplePageProcessor
- SimpleProxyProvider - Class in us.codecraft.webmagic.proxy
-
A simple ProxyProvider.
- SimpleProxyProvider(List<Proxy>) - Constructor for class us.codecraft.webmagic.proxy.SimpleProxyProvider
- site - Variable in class us.codecraft.webmagic.Spider
- Site - Class in us.codecraft.webmagic
-
Object contains setting for crawler.
- Site() - Constructor for class us.codecraft.webmagic.Site
- sleep(int) - Method in class us.codecraft.webmagic.Spider
- smartContent() - Method in class us.codecraft.webmagic.selector.HtmlNode
- smartContent() - Static method in class us.codecraft.webmagic.selector.Selectors
- SmartContentSelector - Class in us.codecraft.webmagic.selector
-
Borrowed from https://code.google.com/p/cx-extractor/
- SmartContentSelector() - Constructor for class us.codecraft.webmagic.selector.SmartContentSelector
- sourceTexts - Variable in class us.codecraft.webmagic.selector.PlainText
- spawnUrl - Variable in class us.codecraft.webmagic.Spider
- Spider - Class in us.codecraft.webmagic
-
Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider. - Spider(PageProcessor) - Constructor for class us.codecraft.webmagic.Spider
-
create a spider with pageProcessor.
- Spider.Status - Enum in us.codecraft.webmagic
- SpiderListener - Interface in us.codecraft.webmagic
-
Listener of Spider on page processing.
- SpiderScheduler - Class in us.codecraft.webmagic
- SpiderScheduler(Scheduler) - Constructor for class us.codecraft.webmagic.SpiderScheduler
- start() - Method in class us.codecraft.webmagic.Spider
- startRequest(List<Request>) - Method in class us.codecraft.webmagic.Spider
-
Set startUrls of Spider.
Prior to startUrls of Site. - startRequests - Variable in class us.codecraft.webmagic.Spider
- startUrls(List<String>) - Method in class us.codecraft.webmagic.Spider
-
Set startUrls of Spider.
Prior to startUrls of Site. - stat - Variable in class us.codecraft.webmagic.Spider
- STAT_INIT - Static variable in class us.codecraft.webmagic.Spider
- STAT_RUNNING - Static variable in class us.codecraft.webmagic.Spider
- STAT_STOPPED - Static variable in class us.codecraft.webmagic.Spider
- StatusCode() - Constructor for class us.codecraft.webmagic.utils.HttpConstant.StatusCode
- stop() - Method in class us.codecraft.webmagic.Spider
- Stopped - Enum constant in enum us.codecraft.webmagic.Spider.Status
- stopWhenComplete() - Method in class us.codecraft.webmagic.Spider
-
Stop when all tasks in the queue are completed and all worker threads are also completed
T
- Task - Interface in us.codecraft.webmagic
-
Interface for identifying different tasks.
- test(String...) - Method in class us.codecraft.webmagic.Spider
-
Process specific urls without url discovering.
- thread(int) - Method in class us.codecraft.webmagic.Spider
-
start with more than one threads
- thread(ExecutorService, int) - Method in class us.codecraft.webmagic.Spider
-
start with more than one threads
- threadNum - Variable in class us.codecraft.webmagic.Spider
- threadPool - Variable in class us.codecraft.webmagic.Spider
- toList(Class<T>) - Method in class us.codecraft.webmagic.selector.Json
- toObject(Class<T>) - Method in class us.codecraft.webmagic.selector.Json
- toString() - Method in class us.codecraft.webmagic.Page
- toString() - Method in class us.codecraft.webmagic.proxy.Proxy
- toString() - Method in class us.codecraft.webmagic.Request
- toString() - Method in class us.codecraft.webmagic.ResultItems
- toString() - Method in class us.codecraft.webmagic.selector.AbstractSelectable
- toString() - Method in class us.codecraft.webmagic.selector.RegexSelector
- toString() - Method in class us.codecraft.webmagic.selector.ReplaceSelector
- toString() - Method in interface us.codecraft.webmagic.selector.Selectable
-
single string result
- toString() - Method in class us.codecraft.webmagic.Site
- toTask() - Method in class us.codecraft.webmagic.Site
- toURI() - Method in class us.codecraft.webmagic.proxy.Proxy
- TRACE - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Method
U
- UrlUtils - Class in us.codecraft.webmagic.utils
-
url and html utils.
- UrlUtils() - Constructor for class us.codecraft.webmagic.utils.UrlUtils
- us.codecraft.webmagic - package us.codecraft.webmagic
-
Main class "Spider" and models.
- us.codecraft.webmagic.downloader - package us.codecraft.webmagic.downloader
-
Downloader is the part that downloads web pages and store in Page object.
- us.codecraft.webmagic.model - package us.codecraft.webmagic.model
- us.codecraft.webmagic.pipeline - package us.codecraft.webmagic.pipeline
-
Pipeline is the persistent and offline process part of crawler.
- us.codecraft.webmagic.processor - package us.codecraft.webmagic.processor
-
PageProcessor custom part of a crawler for specific site.
- us.codecraft.webmagic.processor.example - package us.codecraft.webmagic.processor.example
- us.codecraft.webmagic.proxy - package us.codecraft.webmagic.proxy
- us.codecraft.webmagic.scheduler - package us.codecraft.webmagic.scheduler
-
Scheduler is the part of url management.
- us.codecraft.webmagic.scheduler.component - package us.codecraft.webmagic.scheduler.component
-
Component of scheduler.
- us.codecraft.webmagic.selector - package us.codecraft.webmagic.selector
-
Selectors for page extraction.
- us.codecraft.webmagic.thread - package us.codecraft.webmagic.thread
- us.codecraft.webmagic.utils - package us.codecraft.webmagic.utils
-
Static utils of webmagic.
- USER_AGENT - Static variable in class us.codecraft.webmagic.utils.HttpConstant.Header
- uuid - Variable in class us.codecraft.webmagic.Spider
V
- validateProxy(Proxy) - Static method in class us.codecraft.webmagic.utils.ProxyUtils
- valueOf(String) - Static method in enum us.codecraft.webmagic.Spider.Status
-
Returns the enum constant of this type with the specified name.
- values() - Static method in enum us.codecraft.webmagic.Spider.Status
-
Returns an array containing the constants of this enum type, in the order they are declared.
W
- waitNewUrl(CountableThreadPool, long) - Method in class us.codecraft.webmagic.SpiderScheduler
- WMCollections - Class in us.codecraft.webmagic.utils
- WMCollections() - Constructor for class us.codecraft.webmagic.utils.WMCollections
X
- xml(String, String) - Static method in class us.codecraft.webmagic.model.HttpRequestBody
- XML - Static variable in class us.codecraft.webmagic.model.HttpRequestBody.ContentType
- xpath(String) - Method in class us.codecraft.webmagic.selector.HtmlNode
- xpath(String) - Method in class us.codecraft.webmagic.selector.PlainText
- xpath(String) - Method in interface us.codecraft.webmagic.selector.Selectable
-
select list with xpath
- xpath(String) - Static method in class us.codecraft.webmagic.selector.Selectors
- XpathSelector - Class in us.codecraft.webmagic.selector
-
XPath selector based on Xsoup.
- XpathSelector(String) - Constructor for class us.codecraft.webmagic.selector.XpathSelector
- xsoup(String) - Static method in class us.codecraft.webmagic.selector.Selectors
-
Deprecated.
Z
- ZhihuPageProcessor - Class in us.codecraft.webmagic.processor.example
- ZhihuPageProcessor() - Constructor for class us.codecraft.webmagic.processor.example.ZhihuPageProcessor
All Classes and Interfaces|All Packages|Constant Field Values|Serialized Form
Page.fail(Request)
instead.