Crawl-frontier

Latest version: v0.2.0

Safety actively analyzes 682387 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

0.7.1

Thanks to voith, a problem introduced with beginning of support of Python 3 when Frontera was supporting only keys and values stored as bytes in .meta fields is now solved. Many Scrapy middlewares weren't working or working incorrectly. This is still not tested properly, so please report any bugs.

Other improvements include:
- batched states refresh in crawling strategy,
- proper access to redirects in Scrapy converters,
- more readable and simple OverusedBuffer implementation,
- examples, tests and docs fixes.

Thank you all, for your contributions!

0.7.0

A long awaiting support of kafka-python 1.x.x client. Now Frontera is much more resistant to physical connectivity loss and is using new asynchronous Kafka API.
Other improvements:
- SW consumes less CPU (because of rare state flushing),
- requests creation api is changed in BaseCrawlingStrategy, and now it's batch oriented,
- new article in the docs on cluster setup,
- disable scoring log consumption option in DB worker,
- fix of hbase drop table,
- improved tests coverage.

0.6.0

- Full Python 3 support 👏 👍 🍻 (https://github.com/scrapinghub/frontera/issues/106), all the thanks goes to Preetwinder.
- canonicalize_url method removed in favor of w3lib implementation.
- The whole `Request` (incl. meta) is propagated to DB Worker, by means of scoring log (fixes https://github.com/scrapinghub/frontera/issues/131)
- Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
- `HBaseQueue` supports delayed requests now. ‘crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.
- `Request` object is now persisted in `HBaseQueue`, allowing to schedule requests with specific meta, headers, body, cookies parameters.
- `MESSAGE_BUS_CODEC` option allowing to choose other than default message bus codec.
- Strategy worker refactoring to simplify it’s customization from subclasses.
- Fixed a bug with extracted links distribution over spider log partitions (https://github.com/scrapinghub/frontera/issues/129).

0.5.3

New options for managing broad crawling queue get algorithm and improved logging in manager and strategy worker.

0.5.2.3

See https://github.com/scrapinghub/frontera/issues/173

0.5.2.2

- `CONSUMER_BATCH_SIZE` is removed and two new options are introduced `SPIDER_LOG_CONSUMER_BATCH_SIZE` and `SCORING_LOG_CONSUMER_BATCH_SIZE`
- Traceback is thrown into log when SIGUSR1 is received in DBW or SW.
- Finishing in SW is fixed when crawling strategy reports finished.

Page 1 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.