Class PhantomJSDownloader

java.lang.Object
us.codecraft.webmagic.downloader.AbstractDownloader
us.codecraft.webmagic.downloader.PhantomJSDownloader
All Implemented Interfaces:
us.codecraft.webmagic.downloader.Downloader

public class PhantomJSDownloader extends us.codecraft.webmagic.downloader.AbstractDownloader
this downloader is used to download pages which need to render the javascript
Version:
0.5.3
Author:
dolphineor@gmail.com
  • Constructor Summary

    Constructors
    Constructor
    Description
     
    PhantomJSDownloader(String phantomJsCommand)
    添加新的构造函数,支持phantomjs自定义命令
    PhantomJSDownloader(String phantomJsCommand, String crawlJsPath)
    新增构造函数,支持crawl.js路径自定义,因为当其他项目依赖此jar包时,runtime.exec()执行phantomjs命令时无使用法jar包中的crawl.js
  • Method Summary

    Modifier and Type
    Method
    Description
    us.codecraft.webmagic.Page
    download(us.codecraft.webmagic.Request request, us.codecraft.webmagic.Task task)
     
    protected String
    getPage(us.codecraft.webmagic.Request request)
     
    void
    setThread(int threadNum)
     

    Methods inherited from class us.codecraft.webmagic.downloader.AbstractDownloader

    download, download, onError, onError, onError, onSuccess, onSuccess, onSuccess

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • PhantomJSDownloader

      public PhantomJSDownloader()
    • PhantomJSDownloader

      public PhantomJSDownloader(String phantomJsCommand)
      添加新的构造函数,支持phantomjs自定义命令

      example: phantomjs.exe 支持windows环境 phantomjs --ignore-ssl-errors=yes 忽略抓取地址是https时的一些错误 /usr/local/bin/phantomjs 命令的绝对路径,避免因系统环境变量引起的IOException

      Parameters:
      phantomJsCommand - phantomJsCommand
    • PhantomJSDownloader

      public PhantomJSDownloader(String phantomJsCommand, String crawlJsPath)
      新增构造函数,支持crawl.js路径自定义,因为当其他项目依赖此jar包时,runtime.exec()执行phantomjs命令时无使用法jar包中的crawl.js
       crawl.js start --
      
         var system = require('system');
         var url = system.args[1];
      
         var page = require('webpage').create();
         page.settings.loadImages = false;
         page.settings.resourceTimeout = 5000;
      
         page.open(url, function (status) {
             if (status != 'success') {
                 console.log("HTTP request failed!");
             } else {
                 console.log(page.content);
             }
      
             page.close();
             phantom.exit();
         });
      
       -- crawl.js end
       
      具体项目时可以将以上js代码复制下来使用

      example: new PhantomJSDownloader("/your/path/phantomjs", "/your/path/crawl.js");

      Parameters:
      phantomJsCommand - phantomJsCommand
      crawlJsPath - crawlJsPath
  • Method Details

    • download

      public us.codecraft.webmagic.Page download(us.codecraft.webmagic.Request request, us.codecraft.webmagic.Task task)
    • setThread

      public void setThread(int threadNum)
    • getPage

      protected String getPage(us.codecraft.webmagic.Request request) throws Exception
      Throws:
      Exception