Class PhantomJSDownloader

  • All Implemented Interfaces:
    Downloader

    public class PhantomJSDownloader
    extends AbstractDownloader
    this downloader is used to download pages which need to render the javascript
    Version:
    0.5.3
    Author:
    dolphineor@gmail.com
    • Constructor Detail

      • PhantomJSDownloader

        public PhantomJSDownloader()
      • PhantomJSDownloader

        public PhantomJSDownloader​(java.lang.String phantomJsCommand)
        添加新的构造函数,支持phantomjs自定义命令

        example: phantomjs.exe 支持windows环境 phantomjs --ignore-ssl-errors=yes 忽略抓取地址是https时的一些错误 /usr/local/bin/phantomjs 命令的绝对路径,避免因系统环境变量引起的IOException

        Parameters:
        phantomJsCommand - phantomJsCommand
      • PhantomJSDownloader

        public PhantomJSDownloader​(java.lang.String phantomJsCommand,
                                   java.lang.String crawlJsPath)
        新增构造函数,支持crawl.js路径自定义,因为当其他项目依赖此jar包时,runtime.exec()执行phantomjs命令时无使用法jar包中的crawl.js
         crawl.js start --
        
           var system = require('system');
           var url = system.args[1];
        
           var page = require('webpage').create();
           page.settings.loadImages = false;
           page.settings.resourceTimeout = 5000;
        
           page.open(url, function (status) {
               if (status != 'success') {
                   console.log("HTTP request failed!");
               } else {
                   console.log(page.content);
               }
        
               page.close();
               phantom.exit();
           });
        
         -- crawl.js end
         
        具体项目时可以将以上js代码复制下来使用

        example: new PhantomJSDownloader("/your/path/phantomjs", "/your/path/crawl.js");

        Parameters:
        phantomJsCommand - phantomJsCommand
        crawlJsPath - crawlJsPath
    • Method Detail

      • download

        public Page download​(Request request,
                             Task task)
        Description copied from interface: Downloader
        Downloads web pages and store in Page object.
        Parameters:
        request - request
        task - task
        Returns:
        page
      • setThread

        public void setThread​(int threadNum)
        Description copied from interface: Downloader
        Tell the downloader how many threads the spider used.
        Parameters:
        threadNum - number of threads
      • getPage

        protected java.lang.String getPage​(Request request)
                                    throws java.lang.Exception
        Throws:
        java.lang.Exception