Package us.codecraft.webmagic.scheduler
Scheduler is the part of url management.
-
Interface Summary Interface Description MonitorableScheduler The scheduler whose requests can be counted for monitor.Scheduler Scheduler is the part of url management.
You can implement interface Scheduler to do: manage urls to fetch remove duplicate urls -
Class Summary Class Description BloomFilterDuplicateRemover BloomFilterDuplicateRemover for huge number of urls.DuplicateRemovedScheduler Remove duplicate urls and only push urls which are not duplicate.FileCacheQueueScheduler Store urls and cursor in files so that a Spider can resume the status when shutdown.PriorityScheduler Priority scheduler.QueueScheduler Basic Scheduler implementation.
Store urls to fetch in LinkedBlockingQueue and remove duplicate urls by HashMap.RedisPriorityScheduler the redis scheduler with priorityRedisScheduler Use Redis as url scheduler for distributed crawlers.