Package us.codecraft.webmagic.model
Class OOSpider<T>
java.lang.Object
us.codecraft.webmagic.Spider
us.codecraft.webmagic.model.OOSpider<T>
The spider for page model extractor.
In webmagic, we call a POJO containing extract result as "page model".
You can customize a crawler by write a page model with annotations.
Such as:
In webmagic, we call a POJO containing extract result as "page model".
You can customize a crawler by write a page model with annotations.
Such as:
@TargetUrl("http://my.oschina.net/flashsword/blog/\\d+") public class OschinaBlog{ @ExtractBy("//title") private String title; @ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css) private String content; @ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true) private List<String> tags; }And start the spider by:
OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog") ,new JsonFilePageModelPipeline(), OschinaBlog.class).run(); }
- Since:
- 0.2.0
- Author:
- code4crafter@gmail.com
-
Nested Class Summary
Nested classes/interfaces inherited from class us.codecraft.webmagic.Spider
Spider.Status
-
Field Summary
Fields inherited from class us.codecraft.webmagic.Spider
destroyWhenExit, downloader, executorService, exitWhenComplete, logger, pageProcessor, pipelines, scheduler, site, spawnUrl, startRequests, stat, STAT_INIT, STAT_RUNNING, STAT_STOPPED, threadNum, threadPool, uuid
-
Constructor Summary
ConstructorsModifierConstructorDescriptionprotected
OOSpider
(us.codecraft.webmagic.model.ModelPageProcessor modelPageProcessor) OOSpider
(PageProcessor pageProcessor) OOSpider
(Site site, PageModelPipeline pageModelPipeline, Class... pageModels) create a spider -
Method Summary
Modifier and TypeMethodDescriptionaddPageModel
(PageModelPipeline pageModelPipeline, Class... pageModels) static OOSpider
static OOSpider
create
(Site site, PageModelPipeline pageModelPipeline, Class... pageModels) protected CollectorPipeline
setIsExtractLinks
(boolean isExtractLinks) Methods inherited from class us.codecraft.webmagic.Spider
addPipeline, addRequest, addUrl, checkIfRunning, clearPipeline, close, create, downloader, extractAndAddRequests, get, getAll, getPageCount, getScheduler, getSite, getSpiderListeners, getStartTime, getStatus, getThreadAlive, getUUID, initComponent, isExitWhenComplete, isSpawnUrl, onError, onError, onSuccess, pipeline, run, runAsync, scheduler, setDownloader, setEmptySleepTime, setExecutorService, setExitWhenComplete, setPipelines, setScheduler, setSpawnUrl, setSpiderListeners, setUUID, sleep, start, startRequest, startUrls, stop, stopWhenComplete, test, thread, thread
-
Constructor Details
-
OOSpider
protected OOSpider(us.codecraft.webmagic.model.ModelPageProcessor modelPageProcessor) -
OOSpider
-
OOSpider
create a spider- Parameters:
site
- sitepageModelPipeline
- pageModelPipelinepageModels
- pageModels
-
-
Method Details
-
getCollectorPipeline
- Overrides:
getCollectorPipeline
in classSpider
-
create
-
create
-
addPageModel
-
setIsExtractLinks
-