Package us.codecraft.webmagic
Class Spider
- java.lang.Object
-
- us.codecraft.webmagic.Spider
-
- All Implemented Interfaces:
java.lang.Runnable
,Task
- Direct Known Subclasses:
OOSpider
public class Spider extends java.lang.Object implements java.lang.Runnable, Task
Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.
The modules are defined in interface.
You can customize a spider with various implementations of them.
Examples:
A simple crawler:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*")).run();
Store results to files by FilePipeline:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.pipeline(new FilePipeline("/data/temp/webmagic/")).run();
Use FileCacheQueueScheduler to store urls and cursor in files, so that a Spider can resume the status when shutdown.
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).run();- Since:
- 0.1.0
- Author:
- code4crafter@gmail.com
- See Also:
Downloader
,Scheduler
,PageProcessor
,Pipeline
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
Spider.Status
-
Field Summary
Fields Modifier and Type Field Description protected boolean
destroyWhenExit
protected Downloader
downloader
protected java.util.concurrent.ExecutorService
executorService
protected boolean
exitWhenComplete
protected org.slf4j.Logger
logger
protected PageProcessor
pageProcessor
protected java.util.List<Pipeline>
pipelines
protected SpiderScheduler
scheduler
protected Site
site
protected boolean
spawnUrl
protected java.util.List<Request>
startRequests
protected java.util.concurrent.atomic.AtomicInteger
stat
protected static int
STAT_INIT
protected static int
STAT_RUNNING
protected static int
STAT_STOPPED
protected int
threadNum
protected CountableThreadPool
threadPool
protected java.lang.String
uuid
-
Constructor Summary
Constructors Constructor Description Spider(PageProcessor pageProcessor)
create a spider with pageProcessor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description Spider
addPipeline(Pipeline pipeline)
add a pipeline for SpiderSpider
addRequest(Request... requests)
Add urls with information to crawl.Spider
addUrl(java.lang.String... urls)
Add urls to crawl.protected void
checkIfRunning()
Spider
clearPipeline()
clear the pipelines setvoid
close()
static Spider
create(PageProcessor pageProcessor)
create a spider with pageProcessor.Spider
downloader(Downloader downloader)
Deprecated.protected void
extractAndAddRequests(Page page, boolean spawnUrl)
<T> T
get(java.lang.String url)
<T> java.util.List<T>
getAll(java.util.Collection<java.lang.String> urls)
Download urls synchronizing.protected CollectorPipeline
getCollectorPipeline()
long
getPageCount()
Get page count downloaded by spider.Scheduler
getScheduler()
Site
getSite()
site of a taskjava.util.List<SpiderListener>
getSpiderListeners()
java.util.Date
getStartTime()
Spider.Status
getStatus()
Get running status by spider.int
getThreadAlive()
Get thread count which is runningjava.lang.String
getUUID()
unique id for a task.protected void
initComponent()
boolean
isExitWhenComplete()
boolean
isSpawnUrl()
protected void
onError(Request request)
Deprecated.UseonError(Request, Exception)
instead.protected void
onError(Request request, java.lang.Exception e)
protected void
onSuccess(Request request)
Spider
pipeline(Pipeline pipeline)
Deprecated.void
run()
void
runAsync()
Spider
scheduler(Scheduler scheduler)
Deprecated.Spider
setDownloader(Downloader downloader)
set the downloader of spiderSpider
setEmptySleepTime(long emptySleepTime)
Set wait time when no url is polled.Spider
setExecutorService(java.util.concurrent.ExecutorService executorService)
Spider
setExitWhenComplete(boolean exitWhenComplete)
Exit when complete.Spider
setPipelines(java.util.List<Pipeline> pipelines)
set pipelines for SpiderSpider
setScheduler(Scheduler updateScheduler)
set scheduler for SpiderSpider
setSpawnUrl(boolean spawnUrl)
Whether add urls extracted to download.
Add urls to download when it is true, and just download seed urls when it is false.Spider
setSpiderListeners(java.util.List<SpiderListener> spiderListeners)
Spider
setUUID(java.lang.String uuid)
Set an uuid for spider.
Default uuid is domain of site.protected void
sleep(int time)
void
start()
Spider
startRequest(java.util.List<Request> startRequests)
Set startUrls of Spider.
Prior to startUrls of Site.Spider
startUrls(java.util.List<java.lang.String> startUrls)
Set startUrls of Spider.
Prior to startUrls of Site.void
stop()
void
test(java.lang.String... urls)
Process specific urls without url discovering.Spider
thread(int threadNum)
start with more than one threadsSpider
thread(java.util.concurrent.ExecutorService executorService, int threadNum)
start with more than one threads
-
-
-
Field Detail
-
downloader
protected Downloader downloader
-
pipelines
protected java.util.List<Pipeline> pipelines
-
pageProcessor
protected PageProcessor pageProcessor
-
startRequests
protected java.util.List<Request> startRequests
-
site
protected Site site
-
uuid
protected java.lang.String uuid
-
scheduler
protected SpiderScheduler scheduler
-
logger
protected org.slf4j.Logger logger
-
threadPool
protected CountableThreadPool threadPool
-
executorService
protected java.util.concurrent.ExecutorService executorService
-
threadNum
protected int threadNum
-
stat
protected java.util.concurrent.atomic.AtomicInteger stat
-
exitWhenComplete
protected boolean exitWhenComplete
-
STAT_INIT
protected static final int STAT_INIT
- See Also:
- Constant Field Values
-
STAT_RUNNING
protected static final int STAT_RUNNING
- See Also:
- Constant Field Values
-
STAT_STOPPED
protected static final int STAT_STOPPED
- See Also:
- Constant Field Values
-
spawnUrl
protected boolean spawnUrl
-
destroyWhenExit
protected boolean destroyWhenExit
-
-
Constructor Detail
-
Spider
public Spider(PageProcessor pageProcessor)
create a spider with pageProcessor.- Parameters:
pageProcessor
- pageProcessor
-
-
Method Detail
-
create
public static Spider create(PageProcessor pageProcessor)
create a spider with pageProcessor.- Parameters:
pageProcessor
- pageProcessor- Returns:
- new spider
- See Also:
PageProcessor
-
startUrls
public Spider startUrls(java.util.List<java.lang.String> startUrls)
Set startUrls of Spider.
Prior to startUrls of Site.- Parameters:
startUrls
- startUrls- Returns:
- this
-
startRequest
public Spider startRequest(java.util.List<Request> startRequests)
Set startUrls of Spider.
Prior to startUrls of Site.- Parameters:
startRequests
- startRequests- Returns:
- this
-
setUUID
public Spider setUUID(java.lang.String uuid)
Set an uuid for spider.
Default uuid is domain of site.- Parameters:
uuid
- uuid- Returns:
- this
-
scheduler
@Deprecated public Spider scheduler(Scheduler scheduler)
Deprecated.set scheduler for Spider- Parameters:
scheduler
- scheduler- Returns:
- this
- See Also:
setScheduler(us.codecraft.webmagic.scheduler.Scheduler)
-
setScheduler
public Spider setScheduler(Scheduler updateScheduler)
set scheduler for Spider- Parameters:
updateScheduler
- scheduler- Returns:
- this
- Since:
- 0.2.1
- See Also:
Scheduler
-
pipeline
@Deprecated public Spider pipeline(Pipeline pipeline)
Deprecated.add a pipeline for Spider- Parameters:
pipeline
- pipeline- Returns:
- this
- See Also:
addPipeline(us.codecraft.webmagic.pipeline.Pipeline)
-
addPipeline
public Spider addPipeline(Pipeline pipeline)
add a pipeline for Spider- Parameters:
pipeline
- pipeline- Returns:
- this
- Since:
- 0.2.1
- See Also:
Pipeline
-
setPipelines
public Spider setPipelines(java.util.List<Pipeline> pipelines)
set pipelines for Spider- Parameters:
pipelines
- pipelines- Returns:
- this
- Since:
- 0.4.1
- See Also:
Pipeline
-
clearPipeline
public Spider clearPipeline()
clear the pipelines set- Returns:
- this
-
downloader
@Deprecated public Spider downloader(Downloader downloader)
Deprecated.set the downloader of spider- Parameters:
downloader
- downloader- Returns:
- this
- See Also:
setDownloader(us.codecraft.webmagic.downloader.Downloader)
-
setDownloader
public Spider setDownloader(Downloader downloader)
set the downloader of spider- Parameters:
downloader
- downloader- Returns:
- this
- See Also:
Downloader
-
initComponent
protected void initComponent()
-
run
public void run()
- Specified by:
run
in interfacejava.lang.Runnable
-
onError
@Deprecated protected void onError(Request request)
Deprecated.UseonError(Request, Exception)
instead.
-
onError
protected void onError(Request request, java.lang.Exception e)
-
onSuccess
protected void onSuccess(Request request)
-
close
public void close()
-
test
public void test(java.lang.String... urls)
Process specific urls without url discovering.- Parameters:
urls
- urls to process
-
sleep
protected void sleep(int time)
-
extractAndAddRequests
protected void extractAndAddRequests(Page page, boolean spawnUrl)
-
checkIfRunning
protected void checkIfRunning()
-
runAsync
public void runAsync()
-
addUrl
public Spider addUrl(java.lang.String... urls)
Add urls to crawl.- Parameters:
urls
- urls- Returns:
- this
-
getAll
public <T> java.util.List<T> getAll(java.util.Collection<java.lang.String> urls)
Download urls synchronizing.- Type Parameters:
T
- type of process result- Parameters:
urls
- urls- Returns:
- list downloaded
-
getCollectorPipeline
protected CollectorPipeline getCollectorPipeline()
-
get
public <T> T get(java.lang.String url)
-
addRequest
public Spider addRequest(Request... requests)
Add urls with information to crawl.- Parameters:
requests
- requests- Returns:
- this
-
start
public void start()
-
stop
public void stop()
-
thread
public Spider thread(int threadNum)
start with more than one threads- Parameters:
threadNum
- threadNum- Returns:
- this
-
thread
public Spider thread(java.util.concurrent.ExecutorService executorService, int threadNum)
start with more than one threads- Parameters:
executorService
- executorService to run the spiderthreadNum
- threadNum- Returns:
- this
-
isExitWhenComplete
public boolean isExitWhenComplete()
-
setExitWhenComplete
public Spider setExitWhenComplete(boolean exitWhenComplete)
Exit when complete.
True: exit when all url of the site is downloaded.
False: not exit until call stop() manually.- Parameters:
exitWhenComplete
- exitWhenComplete- Returns:
- this
-
isSpawnUrl
public boolean isSpawnUrl()
-
getPageCount
public long getPageCount()
Get page count downloaded by spider.- Returns:
- total downloaded page count
- Since:
- 0.4.1
-
getStatus
public Spider.Status getStatus()
Get running status by spider.- Returns:
- running status
- Since:
- 0.4.1
- See Also:
Spider.Status
-
getThreadAlive
public int getThreadAlive()
Get thread count which is running- Returns:
- thread count which is running
- Since:
- 0.4.1
-
setSpawnUrl
public Spider setSpawnUrl(boolean spawnUrl)
Whether add urls extracted to download.
Add urls to download when it is true, and just download seed urls when it is false.
DO NOT set it unless you know what it means!- Parameters:
spawnUrl
- spawnUrl- Returns:
- this
- Since:
- 0.4.0
-
getUUID
public java.lang.String getUUID()
Description copied from interface:Task
unique id for a task.
-
setExecutorService
public Spider setExecutorService(java.util.concurrent.ExecutorService executorService)
-
getSpiderListeners
public java.util.List<SpiderListener> getSpiderListeners()
-
setSpiderListeners
public Spider setSpiderListeners(java.util.List<SpiderListener> spiderListeners)
-
getStartTime
public java.util.Date getStartTime()
-
getScheduler
public Scheduler getScheduler()
-
setEmptySleepTime
public Spider setEmptySleepTime(long emptySleepTime)
Set wait time when no url is polled.- Parameters:
emptySleepTime
- In MILLISECONDS.- Returns:
- this
-
-