us.codecraft.webmagic.Spider

All Implemented Interfaces:: Runnable, Task

Direct Known Subclasses:: OOSpider

public class Spider extends Object implements Runnable, Task

Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.
The modules are defined in interface.
You can customize a spider with various implementations of them.
Examples:

A simple crawler:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*")).run();

Store results to files by FilePipeline:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.pipeline(new FilePipeline("/data/temp/webmagic/")).run();

Use FileCacheQueueScheduler to store urls and cursor in files, so that a Spider can resume the status when shutdown.
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).run();

Since:

0.1.0

Author:

code4crafter@gmail.com

See Also:

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static enum

Spider.Status
Field Summary

Fields

Modifier and Type

Field

Description

protected boolean

destroyWhenExit

protected Downloader

downloader

protected ExecutorService

executorService

protected boolean

exitWhenComplete

protected org.slf4j.Logger

logger

protected PageProcessor

pageProcessor

protected List<Pipeline>

pipelines

protected SpiderScheduler

scheduler

protected Site

site

protected boolean

spawnUrl

protected List<Request>

startRequests

protected AtomicInteger

stat

protected static final int

STAT_INIT

protected static final int

STAT_RUNNING

protected static final int

STAT_STOPPED

protected int

threadNum

protected CountableThreadPool

threadPool

protected String

uuid
Constructor Summary

Constructors

Constructor

Description

Spider(PageProcessor pageProcessor)

create a spider with pageProcessor.
Method Summary

Modifier and Type

Method

Description

Spider

addPipeline(Pipeline pipeline)

add a pipeline for Spider

Spider

addRequest(Request... requests)

Add urls with information to crawl.

Spider

addUrl(String... urls)

Add urls to crawl.

protected void

checkIfRunning()

Spider

clearPipeline()

clear the pipelines set

void

close()

static Spider

create(PageProcessor pageProcessor)

create a spider with pageProcessor.

Spider

downloader(Downloader downloader)

Deprecated.

protected void

extractAndAddRequests(Page page, boolean spawnUrl)

<T> T

get(String url)

<T> List<T>

getAll(Collection<String> urls)

Download urls synchronizing.

protected CollectorPipeline

getCollectorPipeline()

long

getPageCount()

Get page count downloaded by spider.

Scheduler

getScheduler()

Site

getSite()

site of a task

List<SpiderListener>

getSpiderListeners()

Date

getStartTime()

Spider.Status

getStatus()

Get running status by spider.

int

getThreadAlive()

Get thread count which is running

String

getUUID()

unique id for a task.

protected void

initComponent()

boolean

isExitWhenComplete()

boolean

isSpawnUrl()

protected void

onError(Request request)

Deprecated.
Use onError(Request, Exception) instead.

protected void

onError(Request request, Exception e)

protected void

onSuccess(Request request)

Spider

pipeline(Pipeline pipeline)

Deprecated.

void

run()

void

runAsync()

Spider

scheduler(Scheduler scheduler)

Deprecated.

Spider

setDownloader(Downloader downloader)

set the downloader of spider

Spider

setEmptySleepTime(long emptySleepTime)

Set wait time when no url is polled.

Spider

setExecutorService(ExecutorService executorService)

Spider

setExitWhenComplete(boolean exitWhenComplete)

Exit when complete.

Spider

setPipelines(List<Pipeline> pipelines)

set pipelines for Spider

Spider

setScheduler(Scheduler updateScheduler)

set scheduler for Spider

Spider

setSpawnUrl(boolean spawnUrl)

Whether add urls extracted to download.
Add urls to download when it is true, and just download seed urls when it is false.

Spider

setSpiderListeners(List<SpiderListener> spiderListeners)

Spider

setUUID(String uuid)

Set an uuid for spider.
Default uuid is domain of site.

protected void

sleep(int time)

void

start()

Spider

startRequest(List<Request> startRequests)

Set startUrls of Spider.
Prior to startUrls of Site.

Spider

startUrls(List<String> startUrls)

Set startUrls of Spider.
Prior to startUrls of Site.

void

stop()

void

stopWhenComplete()

Stop when all tasks in the queue are completed and all worker threads are also completed

void

test(String... urls)

Process specific urls without url discovering.

Spider

thread(int threadNum)

start with more than one threads

Spider

thread(ExecutorService executorService, int threadNum)

start with more than one threads

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- downloader
  
  protected Downloader downloader
- pipelines
  
  protected List<Pipeline> pipelines
- pageProcessor
  
  protected PageProcessor pageProcessor
- startRequests
  
  protected List<Request> startRequests
- site
  
  protected Site site
- uuid
  
  protected String uuid
- scheduler
  
  protected SpiderScheduler scheduler
- logger
  
  protected org.slf4j.Logger logger
- threadPool
  
  protected CountableThreadPool threadPool
- executorService
  
  protected ExecutorService executorService
- threadNum
  
  protected int threadNum
- stat
  
  protected AtomicInteger stat
- exitWhenComplete
  
  protected volatile boolean exitWhenComplete
- STAT_INIT
  
  protected static final int STAT_INIT
  See Also:
  
  Constant Field Values
- STAT_RUNNING
  
  protected static final int STAT_RUNNING
  See Also:
  
  Constant Field Values
- STAT_STOPPED
  
  protected static final int STAT_STOPPED
  See Also:
  
  Constant Field Values
- spawnUrl
  
  protected boolean spawnUrl
- destroyWhenExit
  
  protected boolean destroyWhenExit
Constructor Details
- Spider
  
  public Spider(PageProcessor pageProcessor)
  
  create a spider with pageProcessor.
  
  Parameters:
  
  pageProcessor - pageProcessor
Method Details
- create
  
  public static Spider create(PageProcessor pageProcessor)
  
  create a spider with pageProcessor.
  Parameters:
  
  pageProcessor - pageProcessor
  
  Returns:
  
  new spider
  
  See Also:
  
  PageProcessor
- startUrls
  
  public Spider startUrls(List<String> startUrls)
  
  Set startUrls of Spider.
  Prior to startUrls of Site.
  
  Parameters:
  
  startUrls - startUrls
  
  Returns:
  
  this
- startRequest
  
  public Spider startRequest(List<Request> startRequests)
  
  Set startUrls of Spider.
  Prior to startUrls of Site.
  
  Parameters:
  
  startRequests - startRequests
  
  Returns:
  
  this
- setUUID
  
  public Spider setUUID(String uuid)
  
  Set an uuid for spider.
  Default uuid is domain of site.
  
  Parameters:
  
  uuid - uuid
  
  Returns:
  
  this
- scheduler
  
  @Deprecated public Spider scheduler(Scheduler scheduler)
  
  Deprecated.
  
  set scheduler for Spider
  Parameters:
  
  scheduler - scheduler
  
  Returns:
  
  this
  
  See Also:
  
  setScheduler(us.codecraft.webmagic.scheduler.Scheduler)
- setScheduler
  
  public Spider setScheduler(Scheduler updateScheduler)
  
  set scheduler for Spider
  Parameters:
  
  updateScheduler - scheduler
  
  Returns:
  
  this
  
  Since:
  
  0.2.1
  
  See Also:
  
  Scheduler
- pipeline
  
  @Deprecated public Spider pipeline(Pipeline pipeline)
  
  Deprecated.
  
  add a pipeline for Spider
  Parameters:
  
  pipeline - pipeline
  
  Returns:
  
  this
  
  See Also:
  
  addPipeline(us.codecraft.webmagic.pipeline.Pipeline)
- addPipeline
  
  public Spider addPipeline(Pipeline pipeline)
  
  add a pipeline for Spider
  Parameters:
  
  pipeline - pipeline
  
  Returns:
  
  this
  
  Since:
  
  0.2.1
  
  See Also:
  
  Pipeline
- setPipelines
  
  public Spider setPipelines(List<Pipeline> pipelines)
  
  set pipelines for Spider
  Parameters:
  
  pipelines - pipelines
  
  Returns:
  
  this
  
  Since:
  
  0.4.1
  
  See Also:
  
  Pipeline
- clearPipeline
  
  public Spider clearPipeline()
  
  clear the pipelines set
  
  Returns:
  
  this
- downloader
  
  @Deprecated public Spider downloader(Downloader downloader)
  
  Deprecated.
  
  set the downloader of spider
  Parameters:
  
  downloader - downloader
  
  Returns:
  
  this
  
  See Also:
  
  setDownloader(us.codecraft.webmagic.downloader.Downloader)
- setDownloader
  
  public Spider setDownloader(Downloader downloader)
  
  set the downloader of spider
  Parameters:
  
  downloader - downloader
  
  Returns:
  
  this
  
  See Also:
  
  Downloader
- initComponent
  
  protected void initComponent()
- run
  
  public void run()
  
  Specified by:
  
  run in interface Runnable
- onError
  
  @Deprecated protected void onError(Request request)
  
  Deprecated.
  Use onError(Request, Exception) instead.
- onError
  
  protected void onError(Request request, Exception e)
- onSuccess
  
  protected void onSuccess(Request request)
- close
  
  public void close()
- test
  
  public void test(String... urls)
  
  Process specific urls without url discovering.
  
  Parameters:
  
  urls - urls to process
- sleep
  
  protected void sleep(int time)
- extractAndAddRequests
  
  protected void extractAndAddRequests(Page page, boolean spawnUrl)
- checkIfRunning
  
  protected void checkIfRunning()
- runAsync
  
  public void runAsync()
- addUrl
  
  public Spider addUrl(String... urls)
  
  Add urls to crawl.
  
  Parameters:
  
  urls - urls
  
  Returns:
  
  this
- getAll
  
  public <T> List<T> getAll(Collection<String> urls)
  
  Download urls synchronizing.
  
  Type Parameters:
  
  T - type of process result
  
  Parameters:
  
  urls - urls
  
  Returns:
  
  list downloaded
- getCollectorPipeline
  
  protected CollectorPipeline getCollectorPipeline()
- get
  
  public <T> T get(String url)
- addRequest
  
  public Spider addRequest(Request... requests)
  
  Add urls with information to crawl.
  
  Parameters:
  
  requests - requests
  
  Returns:
  
  this
- start
  
  public void start()
- stop
  
  public void stop()
- stopWhenComplete
  
  public void stopWhenComplete()
  
  Stop when all tasks in the queue are completed and all worker threads are also completed
- thread
  
  public Spider thread(int threadNum)
  
  start with more than one threads
  
  Parameters:
  
  threadNum - threadNum
  
  Returns:
  
  this
- thread
  
  public Spider thread(ExecutorService executorService, int threadNum)
  
  start with more than one threads
  
  Parameters:
  
  executorService - executorService to run the spider
  
  threadNum - threadNum
  
  Returns:
  
  this
- isExitWhenComplete
  
  public boolean isExitWhenComplete()
- setExitWhenComplete
  
  public Spider setExitWhenComplete(boolean exitWhenComplete)
  
  Exit when complete.
  True: exit when all url of the site is downloaded.
  False: not exit until call stop() manually.
  
  Parameters:
  
  exitWhenComplete - exitWhenComplete
  
  Returns:
  
  this
- isSpawnUrl
  
  public boolean isSpawnUrl()
- getPageCount
  
  public long getPageCount()
  
  Get page count downloaded by spider.
  
  Returns:
  
  total downloaded page count
  
  Since:
  
  0.4.1
- getStatus
  
  public Spider.Status getStatus()
  
  Get running status by spider.
  Returns:
  
  running status
  
  Since:
  
  0.4.1
  
  See Also:
  
  Spider.Status
- getThreadAlive
  
  public int getThreadAlive()
  
  Get thread count which is running
  
  Returns:
  
  thread count which is running
  
  Since:
  
  0.4.1
- setSpawnUrl
  
  public Spider setSpawnUrl(boolean spawnUrl)
  
  Whether add urls extracted to download.
  Add urls to download when it is true, and just download seed urls when it is false.
  DO NOT set it unless you know what it means!
  
  Parameters:
  
  spawnUrl - spawnUrl
  
  Returns:
  
  this
  
  Since:
  
  0.4.0
- getUUID
  
  public String getUUID()
  
  Description copied from interface: Task
  
  unique id for a task.
  
  Specified by:
  
  getUUID in interface Task
  
  Returns:
  
  uuid
- setExecutorService
  
  public Spider setExecutorService(ExecutorService executorService)
- getSite
  
  public Site getSite()
  
  Description copied from interface: Task
  
  site of a task
  
  Specified by:
  
  getSite in interface Task
  
  Returns:
  
  site
- getSpiderListeners
  
  public List<SpiderListener> getSpiderListeners()
- setSpiderListeners
  
  public Spider setSpiderListeners(List<SpiderListener> spiderListeners)
- getStartTime
  
  public Date getStartTime()
- getScheduler
  
  public Scheduler getScheduler()
- setEmptySleepTime
  
  public Spider setEmptySleepTime(long emptySleepTime)
  
  Set wait time when no url is polled.
  
  Parameters:
  
  emptySleepTime - In MILLISECONDS.
  
  Returns:
  
  this

Class Spider

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

downloader

pipelines

pageProcessor

startRequests

site

uuid

scheduler

logger

threadPool

executorService

threadNum

stat

exitWhenComplete

STAT_INIT

STAT_RUNNING

STAT_STOPPED

spawnUrl

destroyWhenExit

Constructor Details

Spider

Method Details

create

startUrls

startRequest

setUUID

scheduler

setScheduler

pipeline

addPipeline

setPipelines

clearPipeline

downloader

setDownloader

initComponent

run

onError

onError

onSuccess

close

test

sleep

extractAndAddRequests

checkIfRunning

runAsync

addUrl

getAll

getCollectorPipeline

get

addRequest

start

stop

stopWhenComplete

thread

thread

isExitWhenComplete

setExitWhenComplete

isSpawnUrl

getPageCount

getStatus

getThreadAlive

setSpawnUrl

getUUID

setExecutorService

getSite

getSpiderListeners

setSpiderListeners

getStartTime

getScheduler

setEmptySleepTime