Class Spider

java.lang.Object
us.codecraft.webmagic.Spider
All Implemented Interfaces:
Runnable, Task
Direct Known Subclasses:
OOSpider

public class Spider extends Object implements Runnable, Task
Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
Every module is a field of Spider.
The modules are defined in interface.
You can customize a spider with various implementations of them.
Examples:

A simple crawler:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*")).run();

Store results to files by FilePipeline:
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.pipeline(new FilePipeline("/data/temp/webmagic/")).run();

Use FileCacheQueueScheduler to store urls and cursor in files, so that a Spider can resume the status when shutdown.
Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
.scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).run();
Since:
0.1.0
Author:
code4crafter@gmail.com
See Also:
  • Field Details

    • downloader

      protected Downloader downloader
    • pipelines

      protected List<Pipeline> pipelines
    • pageProcessor

      protected PageProcessor pageProcessor
    • startRequests

      protected List<Request> startRequests
    • site

      protected Site site
    • uuid

      protected String uuid
    • scheduler

      protected SpiderScheduler scheduler
    • logger

      protected org.slf4j.Logger logger
    • threadPool

      protected CountableThreadPool threadPool
    • executorService

      protected ExecutorService executorService
    • threadNum

      protected int threadNum
    • stat

      protected AtomicInteger stat
    • exitWhenComplete

      protected volatile boolean exitWhenComplete
    • STAT_INIT

      protected static final int STAT_INIT
      See Also:
    • STAT_RUNNING

      protected static final int STAT_RUNNING
      See Also:
    • STAT_STOPPED

      protected static final int STAT_STOPPED
      See Also:
    • spawnUrl

      protected boolean spawnUrl
    • destroyWhenExit

      protected boolean destroyWhenExit
  • Constructor Details

    • Spider

      public Spider(PageProcessor pageProcessor)
      create a spider with pageProcessor.
      Parameters:
      pageProcessor - pageProcessor
  • Method Details

    • create

      public static Spider create(PageProcessor pageProcessor)
      create a spider with pageProcessor.
      Parameters:
      pageProcessor - pageProcessor
      Returns:
      new spider
      See Also:
    • startUrls

      public Spider startUrls(List<String> startUrls)
      Set startUrls of Spider.
      Prior to startUrls of Site.
      Parameters:
      startUrls - startUrls
      Returns:
      this
    • startRequest

      public Spider startRequest(List<Request> startRequests)
      Set startUrls of Spider.
      Prior to startUrls of Site.
      Parameters:
      startRequests - startRequests
      Returns:
      this
    • setUUID

      public Spider setUUID(String uuid)
      Set an uuid for spider.
      Default uuid is domain of site.
      Parameters:
      uuid - uuid
      Returns:
      this
    • scheduler

      @Deprecated public Spider scheduler(Scheduler scheduler)
      Deprecated.
      set scheduler for Spider
      Parameters:
      scheduler - scheduler
      Returns:
      this
      See Also:
    • setScheduler

      public Spider setScheduler(Scheduler updateScheduler)
      set scheduler for Spider
      Parameters:
      updateScheduler - scheduler
      Returns:
      this
      Since:
      0.2.1
      See Also:
    • pipeline

      @Deprecated public Spider pipeline(Pipeline pipeline)
      Deprecated.
      add a pipeline for Spider
      Parameters:
      pipeline - pipeline
      Returns:
      this
      See Also:
    • addPipeline

      public Spider addPipeline(Pipeline pipeline)
      add a pipeline for Spider
      Parameters:
      pipeline - pipeline
      Returns:
      this
      Since:
      0.2.1
      See Also:
    • setPipelines

      public Spider setPipelines(List<Pipeline> pipelines)
      set pipelines for Spider
      Parameters:
      pipelines - pipelines
      Returns:
      this
      Since:
      0.4.1
      See Also:
    • clearPipeline

      public Spider clearPipeline()
      clear the pipelines set
      Returns:
      this
    • downloader

      @Deprecated public Spider downloader(Downloader downloader)
      Deprecated.
      set the downloader of spider
      Parameters:
      downloader - downloader
      Returns:
      this
      See Also:
    • setDownloader

      public Spider setDownloader(Downloader downloader)
      set the downloader of spider
      Parameters:
      downloader - downloader
      Returns:
      this
      See Also:
    • initComponent

      protected void initComponent()
    • run

      public void run()
      Specified by:
      run in interface Runnable
    • onError

      @Deprecated protected void onError(Request request)
      Deprecated.
    • onError

      protected void onError(Request request, Exception e)
    • onSuccess

      protected void onSuccess(Request request)
    • close

      public void close()
    • test

      public void test(String... urls)
      Process specific urls without url discovering.
      Parameters:
      urls - urls to process
    • sleep

      protected void sleep(int time)
    • extractAndAddRequests

      protected void extractAndAddRequests(Page page, boolean spawnUrl)
    • checkIfRunning

      protected void checkIfRunning()
    • runAsync

      public void runAsync()
    • addUrl

      public Spider addUrl(String... urls)
      Add urls to crawl.
      Parameters:
      urls - urls
      Returns:
      this
    • getAll

      public <T> List<T> getAll(Collection<String> urls)
      Download urls synchronizing.
      Type Parameters:
      T - type of process result
      Parameters:
      urls - urls
      Returns:
      list downloaded
    • getCollectorPipeline

      protected CollectorPipeline getCollectorPipeline()
    • get

      public <T> T get(String url)
    • addRequest

      public Spider addRequest(Request... requests)
      Add urls with information to crawl.
      Parameters:
      requests - requests
      Returns:
      this
    • start

      public void start()
    • stop

      public void stop()
    • stopWhenComplete

      public void stopWhenComplete()
      Stop when all tasks in the queue are completed and all worker threads are also completed
    • thread

      public Spider thread(int threadNum)
      start with more than one threads
      Parameters:
      threadNum - threadNum
      Returns:
      this
    • thread

      public Spider thread(ExecutorService executorService, int threadNum)
      start with more than one threads
      Parameters:
      executorService - executorService to run the spider
      threadNum - threadNum
      Returns:
      this
    • isExitWhenComplete

      public boolean isExitWhenComplete()
    • setExitWhenComplete

      public Spider setExitWhenComplete(boolean exitWhenComplete)
      Exit when complete.
      True: exit when all url of the site is downloaded.
      False: not exit until call stop() manually.
      Parameters:
      exitWhenComplete - exitWhenComplete
      Returns:
      this
    • isSpawnUrl

      public boolean isSpawnUrl()
    • getPageCount

      public long getPageCount()
      Get page count downloaded by spider.
      Returns:
      total downloaded page count
      Since:
      0.4.1
    • getStatus

      public Spider.Status getStatus()
      Get running status by spider.
      Returns:
      running status
      Since:
      0.4.1
      See Also:
    • getThreadAlive

      public int getThreadAlive()
      Get thread count which is running
      Returns:
      thread count which is running
      Since:
      0.4.1
    • setSpawnUrl

      public Spider setSpawnUrl(boolean spawnUrl)
      Whether add urls extracted to download.
      Add urls to download when it is true, and just download seed urls when it is false.
      DO NOT set it unless you know what it means!
      Parameters:
      spawnUrl - spawnUrl
      Returns:
      this
      Since:
      0.4.0
    • getUUID

      public String getUUID()
      Description copied from interface: Task
      unique id for a task.
      Specified by:
      getUUID in interface Task
      Returns:
      uuid
    • setExecutorService

      public Spider setExecutorService(ExecutorService executorService)
    • getSite

      public Site getSite()
      Description copied from interface: Task
      site of a task
      Specified by:
      getSite in interface Task
      Returns:
      site
    • getSpiderListeners

      public List<SpiderListener> getSpiderListeners()
    • setSpiderListeners

      public Spider setSpiderListeners(List<SpiderListener> spiderListeners)
    • getStartTime

      public Date getStartTime()
    • getScheduler

      public Scheduler getScheduler()
    • setEmptySleepTime

      public Spider setEmptySleepTime(long emptySleepTime)
      Set wait time when no url is polled.

      Parameters:
      emptySleepTime - In MILLISECONDS.
      Returns:
      this