Class Spider

  • All Implemented Interfaces:
    java.lang.Runnable, Task
    Direct Known Subclasses:
    OOSpider

    public class Spider
    extends java.lang.Object
    implements java.lang.Runnable, Task
    Entrance of a crawler.
    A spider contains four modules: Downloader, Scheduler, PageProcessor and Pipeline.
    Every module is a field of Spider.
    The modules are defined in interface.
    You can customize a spider with various implementations of them.
    Examples:

    A simple crawler:
    Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*")).run();

    Store results to files by FilePipeline:
    Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
    .pipeline(new FilePipeline("/data/temp/webmagic/")).run();

    Use FileCacheQueueScheduler to store urls and cursor in files, so that a Spider can resume the status when shutdown.
    Spider.create(new SimplePageProcessor("http://my.oschina.net/", "http://my.oschina.net/*blog/*"))
    .scheduler(new FileCacheQueueScheduler("/data/temp/webmagic/cache/")).run();
    Since:
    0.1.0
    Author:
    code4crafter@gmail.com
    See Also:
    Downloader, Scheduler, PageProcessor, Pipeline
    • Field Detail

      • pipelines

        protected java.util.List<Pipeline> pipelines
      • startRequests

        protected java.util.List<Request> startRequests
      • site

        protected Site site
      • uuid

        protected java.lang.String uuid
      • logger

        protected org.slf4j.Logger logger
      • executorService

        protected java.util.concurrent.ExecutorService executorService
      • threadNum

        protected int threadNum
      • stat

        protected java.util.concurrent.atomic.AtomicInteger stat
      • exitWhenComplete

        protected boolean exitWhenComplete
      • spawnUrl

        protected boolean spawnUrl
      • destroyWhenExit

        protected boolean destroyWhenExit
    • Constructor Detail

      • Spider

        public Spider​(PageProcessor pageProcessor)
        create a spider with pageProcessor.
        Parameters:
        pageProcessor - pageProcessor
    • Method Detail

      • create

        public static Spider create​(PageProcessor pageProcessor)
        create a spider with pageProcessor.
        Parameters:
        pageProcessor - pageProcessor
        Returns:
        new spider
        See Also:
        PageProcessor
      • startUrls

        public Spider startUrls​(java.util.List<java.lang.String> startUrls)
        Set startUrls of Spider.
        Prior to startUrls of Site.
        Parameters:
        startUrls - startUrls
        Returns:
        this
      • startRequest

        public Spider startRequest​(java.util.List<Request> startRequests)
        Set startUrls of Spider.
        Prior to startUrls of Site.
        Parameters:
        startRequests - startRequests
        Returns:
        this
      • setUUID

        public Spider setUUID​(java.lang.String uuid)
        Set an uuid for spider.
        Default uuid is domain of site.
        Parameters:
        uuid - uuid
        Returns:
        this
      • setScheduler

        public Spider setScheduler​(Scheduler updateScheduler)
        set scheduler for Spider
        Parameters:
        updateScheduler - scheduler
        Returns:
        this
        Since:
        0.2.1
        See Also:
        Scheduler
      • addPipeline

        public Spider addPipeline​(Pipeline pipeline)
        add a pipeline for Spider
        Parameters:
        pipeline - pipeline
        Returns:
        this
        Since:
        0.2.1
        See Also:
        Pipeline
      • setPipelines

        public Spider setPipelines​(java.util.List<Pipeline> pipelines)
        set pipelines for Spider
        Parameters:
        pipelines - pipelines
        Returns:
        this
        Since:
        0.4.1
        See Also:
        Pipeline
      • clearPipeline

        public Spider clearPipeline()
        clear the pipelines set
        Returns:
        this
      • setDownloader

        public Spider setDownloader​(Downloader downloader)
        set the downloader of spider
        Parameters:
        downloader - downloader
        Returns:
        this
        See Also:
        Downloader
      • initComponent

        protected void initComponent()
      • run

        public void run()
        Specified by:
        run in interface java.lang.Runnable
      • onError

        protected void onError​(Request request,
                               java.lang.Exception e)
      • onSuccess

        protected void onSuccess​(Request request)
      • close

        public void close()
      • test

        public void test​(java.lang.String... urls)
        Process specific urls without url discovering.
        Parameters:
        urls - urls to process
      • sleep

        protected void sleep​(int time)
      • extractAndAddRequests

        protected void extractAndAddRequests​(Page page,
                                             boolean spawnUrl)
      • checkIfRunning

        protected void checkIfRunning()
      • runAsync

        public void runAsync()
      • addUrl

        public Spider addUrl​(java.lang.String... urls)
        Add urls to crawl.
        Parameters:
        urls - urls
        Returns:
        this
      • getAll

        public <T> java.util.List<T> getAll​(java.util.Collection<java.lang.String> urls)
        Download urls synchronizing.
        Type Parameters:
        T - type of process result
        Parameters:
        urls - urls
        Returns:
        list downloaded
      • get

        public <T> T get​(java.lang.String url)
      • addRequest

        public Spider addRequest​(Request... requests)
        Add urls with information to crawl.
        Parameters:
        requests - requests
        Returns:
        this
      • start

        public void start()
      • stop

        public void stop()
      • thread

        public Spider thread​(int threadNum)
        start with more than one threads
        Parameters:
        threadNum - threadNum
        Returns:
        this
      • thread

        public Spider thread​(java.util.concurrent.ExecutorService executorService,
                             int threadNum)
        start with more than one threads
        Parameters:
        executorService - executorService to run the spider
        threadNum - threadNum
        Returns:
        this
      • isExitWhenComplete

        public boolean isExitWhenComplete()
      • setExitWhenComplete

        public Spider setExitWhenComplete​(boolean exitWhenComplete)
        Exit when complete.
        True: exit when all url of the site is downloaded.
        False: not exit until call stop() manually.
        Parameters:
        exitWhenComplete - exitWhenComplete
        Returns:
        this
      • isSpawnUrl

        public boolean isSpawnUrl()
      • getPageCount

        public long getPageCount()
        Get page count downloaded by spider.
        Returns:
        total downloaded page count
        Since:
        0.4.1
      • getStatus

        public Spider.Status getStatus()
        Get running status by spider.
        Returns:
        running status
        Since:
        0.4.1
        See Also:
        Spider.Status
      • getThreadAlive

        public int getThreadAlive()
        Get thread count which is running
        Returns:
        thread count which is running
        Since:
        0.4.1
      • setSpawnUrl

        public Spider setSpawnUrl​(boolean spawnUrl)
        Whether add urls extracted to download.
        Add urls to download when it is true, and just download seed urls when it is false.
        DO NOT set it unless you know what it means!
        Parameters:
        spawnUrl - spawnUrl
        Returns:
        this
        Since:
        0.4.0
      • getUUID

        public java.lang.String getUUID()
        Description copied from interface: Task
        unique id for a task.
        Specified by:
        getUUID in interface Task
        Returns:
        uuid
      • setExecutorService

        public Spider setExecutorService​(java.util.concurrent.ExecutorService executorService)
      • getSite

        public Site getSite()
        Description copied from interface: Task
        site of a task
        Specified by:
        getSite in interface Task
        Returns:
        site
      • getSpiderListeners

        public java.util.List<SpiderListener> getSpiderListeners()
      • setSpiderListeners

        public Spider setSpiderListeners​(java.util.List<SpiderListener> spiderListeners)
      • getStartTime

        public java.util.Date getStartTime()
      • getScheduler

        public Scheduler getScheduler()
      • setEmptySleepTime

        public Spider setEmptySleepTime​(long emptySleepTime)
        Set wait time when no url is polled.

        Parameters:
        emptySleepTime - In MILLISECONDS.
        Returns:
        this