Package us.codecraft.webmagic
Class Spider
java.lang.Object
us.codecraft.webmagic.Spider
- Direct Known Subclasses:
OOSpider
Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.
The modules are defined in interface.
You can customize a spider with various implementations of them.
Examples:
A simple crawler:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*")).run();
Store results to files by FilePipeline:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.pipeline(new FilePipeline("/data/temp/webmagic/")).run();
Use FileCacheQueueScheduler to store urls and cursor in files, so that a Spider can resume the status when shutdown.
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).run();
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.
The modules are defined in interface.
You can customize a spider with various implementations of them.
Examples:
A simple crawler:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*")).run();
Store results to files by FilePipeline:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.pipeline(new FilePipeline("/data/temp/webmagic/")).run();
Use FileCacheQueueScheduler to store urls and cursor in files, so that a Spider can resume the status when shutdown.
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).run();
- Since:
- 0.1.0
- Author:
- code4crafter@gmail.com
- See Also:
-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionprotected boolean
protected Downloader
protected ExecutorService
protected boolean
protected org.slf4j.Logger
protected PageProcessor
protected SpiderScheduler
protected Site
protected boolean
protected AtomicInteger
protected static final int
protected static final int
protected static final int
protected int
protected CountableThreadPool
protected String
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionaddPipeline
(Pipeline pipeline) add a pipeline for SpideraddRequest
(Request... requests) Add urls with information to crawl.Add urls to crawl.protected void
clear the pipelines setvoid
close()
static Spider
create
(PageProcessor pageProcessor) create a spider with pageProcessor.downloader
(Downloader downloader) Deprecated.protected void
extractAndAddRequests
(Page page, boolean spawnUrl) <T> T
<T> List<T>
getAll
(Collection<String> urls) Download urls synchronizing.protected CollectorPipeline
long
Get page count downloaded by spider.getSite()
site of a taskGet running status by spider.int
Get thread count which is runninggetUUID()
unique id for a task.protected void
boolean
boolean
protected void
Deprecated.protected void
protected void
Deprecated.void
run()
void
runAsync()
Deprecated.setDownloader
(Downloader downloader) set the downloader of spidersetEmptySleepTime
(long emptySleepTime) Set wait time when no url is polled.setExecutorService
(ExecutorService executorService) setExitWhenComplete
(boolean exitWhenComplete) Exit when complete.setPipelines
(List<Pipeline> pipelines) set pipelines for SpidersetScheduler
(Scheduler updateScheduler) set scheduler for SpidersetSpawnUrl
(boolean spawnUrl) Whether add urls extracted to download.
Add urls to download when it is true, and just download seed urls when it is false.setSpiderListeners
(List<SpiderListener> spiderListeners) Set an uuid for spider.
Default uuid is domain of site.protected void
sleep
(int time) void
start()
startRequest
(List<Request> startRequests) Set startUrls of Spider.
Prior to startUrls of Site.Set startUrls of Spider.
Prior to startUrls of Site.void
stop()
void
Stop when all tasks in the queue are completed and all worker threads are also completedvoid
Process specific urls without url discovering.thread
(int threadNum) start with more than one threadsthread
(ExecutorService executorService, int threadNum) start with more than one threads
-
Field Details
-
downloader
-
pipelines
-
pageProcessor
-
startRequests
-
site
-
uuid
-
scheduler
-
logger
protected org.slf4j.Logger logger -
threadPool
-
executorService
-
threadNum
protected int threadNum -
stat
-
exitWhenComplete
protected volatile boolean exitWhenComplete -
STAT_INIT
protected static final int STAT_INIT- See Also:
-
STAT_RUNNING
protected static final int STAT_RUNNING- See Also:
-
STAT_STOPPED
protected static final int STAT_STOPPED- See Also:
-
spawnUrl
protected boolean spawnUrl -
destroyWhenExit
protected boolean destroyWhenExit
-
-
Constructor Details
-
Spider
create a spider with pageProcessor.- Parameters:
pageProcessor
- pageProcessor
-
-
Method Details
-
create
create a spider with pageProcessor.- Parameters:
pageProcessor
- pageProcessor- Returns:
- new spider
- See Also:
-
startUrls
Set startUrls of Spider.
Prior to startUrls of Site.- Parameters:
startUrls
- startUrls- Returns:
- this
-
startRequest
Set startUrls of Spider.
Prior to startUrls of Site.- Parameters:
startRequests
- startRequests- Returns:
- this
-
setUUID
Set an uuid for spider.
Default uuid is domain of site.- Parameters:
uuid
- uuid- Returns:
- this
-
scheduler
Deprecated.set scheduler for Spider- Parameters:
scheduler
- scheduler- Returns:
- this
- See Also:
-
setScheduler
set scheduler for Spider- Parameters:
updateScheduler
- scheduler- Returns:
- this
- Since:
- 0.2.1
- See Also:
-
pipeline
Deprecated.add a pipeline for Spider- Parameters:
pipeline
- pipeline- Returns:
- this
- See Also:
-
addPipeline
add a pipeline for Spider- Parameters:
pipeline
- pipeline- Returns:
- this
- Since:
- 0.2.1
- See Also:
-
setPipelines
set pipelines for Spider- Parameters:
pipelines
- pipelines- Returns:
- this
- Since:
- 0.4.1
- See Also:
-
clearPipeline
clear the pipelines set- Returns:
- this
-
downloader
Deprecated.set the downloader of spider- Parameters:
downloader
- downloader- Returns:
- this
- See Also:
-
setDownloader
set the downloader of spider- Parameters:
downloader
- downloader- Returns:
- this
- See Also:
-
initComponent
protected void initComponent() -
run
public void run() -
onError
Deprecated.UseonError(Request, Exception)
instead. -
onError
-
onSuccess
-
close
public void close() -
test
Process specific urls without url discovering.- Parameters:
urls
- urls to process
-
sleep
protected void sleep(int time) -
extractAndAddRequests
-
checkIfRunning
protected void checkIfRunning() -
runAsync
public void runAsync() -
addUrl
Add urls to crawl.- Parameters:
urls
- urls- Returns:
- this
-
getAll
Download urls synchronizing.- Type Parameters:
T
- type of process result- Parameters:
urls
- urls- Returns:
- list downloaded
-
getCollectorPipeline
-
get
-
addRequest
Add urls with information to crawl.- Parameters:
requests
- requests- Returns:
- this
-
start
public void start() -
stop
public void stop() -
stopWhenComplete
public void stopWhenComplete()Stop when all tasks in the queue are completed and all worker threads are also completed -
thread
start with more than one threads- Parameters:
threadNum
- threadNum- Returns:
- this
-
thread
start with more than one threads- Parameters:
executorService
- executorService to run the spiderthreadNum
- threadNum- Returns:
- this
-
isExitWhenComplete
public boolean isExitWhenComplete() -
setExitWhenComplete
Exit when complete.
True: exit when all url of the site is downloaded.
False: not exit until call stop() manually.- Parameters:
exitWhenComplete
- exitWhenComplete- Returns:
- this
-
isSpawnUrl
public boolean isSpawnUrl() -
getPageCount
public long getPageCount()Get page count downloaded by spider.- Returns:
- total downloaded page count
- Since:
- 0.4.1
-
getStatus
Get running status by spider.- Returns:
- running status
- Since:
- 0.4.1
- See Also:
-
getThreadAlive
public int getThreadAlive()Get thread count which is running- Returns:
- thread count which is running
- Since:
- 0.4.1
-
setSpawnUrl
Whether add urls extracted to download.
Add urls to download when it is true, and just download seed urls when it is false.
DO NOT set it unless you know what it means!- Parameters:
spawnUrl
- spawnUrl- Returns:
- this
- Since:
- 0.4.0
-
getUUID
Description copied from interface:Task
unique id for a task. -
setExecutorService
-
getSite
Description copied from interface:Task
site of a task -
getSpiderListeners
-
setSpiderListeners
-
getStartTime
-
getScheduler
-
setEmptySleepTime
Set wait time when no url is polled.- Parameters:
emptySleepTime
- In MILLISECONDS.- Returns:
- this
-
onError(Request, Exception)
instead.