Class PhantomJSDownloader

  • All Implemented Interfaces:
    us.codecraft.webmagic.downloader.Downloader

    public class PhantomJSDownloader
    extends us.codecraft.webmagic.downloader.AbstractDownloader
    this downloader is used to download pages which need to render the javascript
    Version:
    0.5.3
    Author:
    dolphineor@gmail.com
    • Constructor Summary

      Constructors 
      Constructor Description
      PhantomJSDownloader()  
      PhantomJSDownloader​(java.lang.String phantomJsCommand)
      添加新的构造函数,支持phantomjs自定义命令
      PhantomJSDownloader​(java.lang.String phantomJsCommand, java.lang.String crawlJsPath)
      新增构造函数,支持crawl.js路径自定义,因为当其他项目依赖此jar包时,runtime.exec()执行phantomjs命令时无使用法jar包中的crawl.js
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      us.codecraft.webmagic.Page download​(us.codecraft.webmagic.Request request, us.codecraft.webmagic.Task task)  
      protected java.lang.String getPage​(us.codecraft.webmagic.Request request)  
      void setThread​(int threadNum)  
      • Methods inherited from class us.codecraft.webmagic.downloader.AbstractDownloader

        download, download, onError, onError, onError, onSuccess, onSuccess, onSuccess
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • PhantomJSDownloader

        public PhantomJSDownloader()
      • PhantomJSDownloader

        public PhantomJSDownloader​(java.lang.String phantomJsCommand)
        添加新的构造函数,支持phantomjs自定义命令

        example: phantomjs.exe 支持windows环境 phantomjs --ignore-ssl-errors=yes 忽略抓取地址是https时的一些错误 /usr/local/bin/phantomjs 命令的绝对路径,避免因系统环境变量引起的IOException

        Parameters:
        phantomJsCommand - phantomJsCommand
      • PhantomJSDownloader

        public PhantomJSDownloader​(java.lang.String phantomJsCommand,
                                   java.lang.String crawlJsPath)
        新增构造函数,支持crawl.js路径自定义,因为当其他项目依赖此jar包时,runtime.exec()执行phantomjs命令时无使用法jar包中的crawl.js
         crawl.js start --
        
           var system = require('system');
           var url = system.args[1];
        
           var page = require('webpage').create();
           page.settings.loadImages = false;
           page.settings.resourceTimeout = 5000;
        
           page.open(url, function (status) {
               if (status != 'success') {
                   console.log("HTTP request failed!");
               } else {
                   console.log(page.content);
               }
        
               page.close();
               phantom.exit();
           });
        
         -- crawl.js end
         
        具体项目时可以将以上js代码复制下来使用

        example: new PhantomJSDownloader("/your/path/phantomjs", "/your/path/crawl.js");

        Parameters:
        phantomJsCommand - phantomJsCommand
        crawlJsPath - crawlJsPath
    • Method Detail

      • download

        public us.codecraft.webmagic.Page download​(us.codecraft.webmagic.Request request,
                                                   us.codecraft.webmagic.Task task)
      • setThread

        public void setThread​(int threadNum)
      • getPage

        protected java.lang.String getPage​(us.codecraft.webmagic.Request request)
                                    throws java.lang.Exception
        Throws:
        java.lang.Exception