Class PhantomJSDownloader

  • All Implemented Interfaces:
    Downloader

    public class PhantomJSDownloader
    extends AbstractDownloader
    this downloader is used to download pages which need to render the javascript
    Version:
    0.5.3
    Author:
    dolphineor@gmail.com
    • Constructor Summary

      Constructors 
      Constructor Description
      PhantomJSDownloader()  
      PhantomJSDownloader​(java.lang.String phantomJsCommand)
      添加新的构造函数,支持phantomjs自定义命令 example: phantomjs.exe 支持windows环境 phantomjs --ignore-ssl-errors=yes 忽略抓取地址是https时的一些错误 /usr/local/bin/phantomjs 命令的绝对路径,避免因系统环境变量引起的IOException
      PhantomJSDownloader​(java.lang.String phantomJsCommand, java.lang.String crawlJsPath)
      新增构造函数,支持crawl.js路径自定义,因为当其他项目依赖此jar包时,runtime.exec()执行phantomjs命令时无使用法jar包中的crawl.js
    • Constructor Detail

      • PhantomJSDownloader

        public PhantomJSDownloader()
      • PhantomJSDownloader

        public PhantomJSDownloader​(java.lang.String phantomJsCommand)
        添加新的构造函数,支持phantomjs自定义命令 example: phantomjs.exe 支持windows环境 phantomjs --ignore-ssl-errors=yes 忽略抓取地址是https时的一些错误 /usr/local/bin/phantomjs 命令的绝对路径,避免因系统环境变量引起的IOException
        Parameters:
        phantomJsCommand - phantomJsCommand
      • PhantomJSDownloader

        public PhantomJSDownloader​(java.lang.String phantomJsCommand,
                                   java.lang.String crawlJsPath)
        新增构造函数,支持crawl.js路径自定义,因为当其他项目依赖此jar包时,runtime.exec()执行phantomjs命令时无使用法jar包中的crawl.js
         crawl.js start --
         
           var system = require('system');
           var url = system.args[1];
           
           var page = require('webpage').create();
           page.settings.loadImages = false;
           page.settings.resourceTimeout = 5000;
           
           page.open(url, function (status) {
               if (status != 'success') {
                   console.log("HTTP request failed!");
               } else {
                   console.log(page.content);
               }
           
               page.close();
               phantom.exit();
           });
           
         -- crawl.js end
         
        具体项目时可以将以上js代码复制下来使用 example: new PhantomJSDownloader("/your/path/phantomjs", "/your/path/crawl.js");
        Parameters:
        phantomJsCommand - phantomJsCommand
        crawlJsPath - crawlJsPath
    • Method Detail

      • download

        public Page download​(Request request,
                             Task task)
        Description copied from interface: Downloader
        Downloads web pages and store in Page object.
        Parameters:
        request - request
        task - task
        Returns:
        page
      • setThread

        public void setThread​(int threadNum)
        Description copied from interface: Downloader
        Tell the downloader how many threads the spider used.
        Parameters:
        threadNum - number of threads
      • getPage

        protected java.lang.String getPage​(Request request)
      • getRetryNum

        public int getRetryNum()