Class BloomFilterDuplicateRemover

  • All Implemented Interfaces:
    us.codecraft.webmagic.scheduler.component.DuplicateRemover

    public class BloomFilterDuplicateRemover
    extends java.lang.Object
    implements us.codecraft.webmagic.scheduler.component.DuplicateRemover
    BloomFilterDuplicateRemover for huge number of urls.
    Since:
    0.5.1
    Author:
    code4crafer@gmail.com
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      int getTotalRequestsCount​(us.codecraft.webmagic.Task task)  
      protected java.lang.String getUrl​(us.codecraft.webmagic.Request request)  
      boolean isDuplicate​(us.codecraft.webmagic.Request request, us.codecraft.webmagic.Task task)  
      protected com.google.common.hash.BloomFilter<java.lang.CharSequence> rebuildBloomFilter()  
      void resetDuplicateCheck​(us.codecraft.webmagic.Task task)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • BloomFilterDuplicateRemover

        public BloomFilterDuplicateRemover​(int expectedInsertions)
      • BloomFilterDuplicateRemover

        public BloomFilterDuplicateRemover​(int expectedInsertions,
                                           double fpp)
        Parameters:
        expectedInsertions - the number of expected insertions to the constructed
        fpp - the desired false positive probability (must be positive and less than 1.0)
    • Method Detail

      • rebuildBloomFilter

        protected com.google.common.hash.BloomFilter<java.lang.CharSequence> rebuildBloomFilter()
      • isDuplicate

        public boolean isDuplicate​(us.codecraft.webmagic.Request request,
                                   us.codecraft.webmagic.Task task)
        Specified by:
        isDuplicate in interface us.codecraft.webmagic.scheduler.component.DuplicateRemover
      • getUrl

        protected java.lang.String getUrl​(us.codecraft.webmagic.Request request)
      • resetDuplicateCheck

        public void resetDuplicateCheck​(us.codecraft.webmagic.Task task)
        Specified by:
        resetDuplicateCheck in interface us.codecraft.webmagic.scheduler.component.DuplicateRemover
      • getTotalRequestsCount

        public int getTotalRequestsCount​(us.codecraft.webmagic.Task task)
        Specified by:
        getTotalRequestsCount in interface us.codecraft.webmagic.scheduler.component.DuplicateRemover