Package us.codecraft.webmagic.scheduler
Class BloomFilterDuplicateRemover
- java.lang.Object
-
- us.codecraft.webmagic.scheduler.BloomFilterDuplicateRemover
-
- All Implemented Interfaces:
us.codecraft.webmagic.scheduler.component.DuplicateRemover
public class BloomFilterDuplicateRemover extends java.lang.Object implements us.codecraft.webmagic.scheduler.component.DuplicateRemover
BloomFilterDuplicateRemover for huge number of urls.- Since:
- 0.5.1
- Author:
- code4crafer@gmail.com
-
-
Constructor Summary
Constructors Constructor Description BloomFilterDuplicateRemover(int expectedInsertions)
BloomFilterDuplicateRemover(int expectedInsertions, double fpp)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description int
getTotalRequestsCount(us.codecraft.webmagic.Task task)
protected java.lang.String
getUrl(us.codecraft.webmagic.Request request)
boolean
isDuplicate(us.codecraft.webmagic.Request request, us.codecraft.webmagic.Task task)
protected com.google.common.hash.BloomFilter<java.lang.CharSequence>
rebuildBloomFilter()
void
resetDuplicateCheck(us.codecraft.webmagic.Task task)
-
-
-
Constructor Detail
-
BloomFilterDuplicateRemover
public BloomFilterDuplicateRemover(int expectedInsertions)
-
BloomFilterDuplicateRemover
public BloomFilterDuplicateRemover(int expectedInsertions, double fpp)
- Parameters:
expectedInsertions
- the number of expected insertions to the constructedfpp
- the desired false positive probability (must be positive and less than 1.0)
-
-
Method Detail
-
rebuildBloomFilter
protected com.google.common.hash.BloomFilter<java.lang.CharSequence> rebuildBloomFilter()
-
isDuplicate
public boolean isDuplicate(us.codecraft.webmagic.Request request, us.codecraft.webmagic.Task task)
- Specified by:
isDuplicate
in interfaceus.codecraft.webmagic.scheduler.component.DuplicateRemover
-
getUrl
protected java.lang.String getUrl(us.codecraft.webmagic.Request request)
-
resetDuplicateCheck
public void resetDuplicateCheck(us.codecraft.webmagic.Task task)
- Specified by:
resetDuplicateCheck
in interfaceus.codecraft.webmagic.scheduler.component.DuplicateRemover
-
getTotalRequestsCount
public int getTotalRequestsCount(us.codecraft.webmagic.Task task)
- Specified by:
getTotalRequestsCount
in interfaceus.codecraft.webmagic.scheduler.component.DuplicateRemover
-
-