Frontera

Latest version: v0.8.1

Safety actively analyzes 681857 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

0.8.1

Some bugs are fixed due to dependencies update.

0.8.0.1

* general-spider example is fixed,
* SW crashes with ZeroMQ are fixed (stats output is wiped),
* documentation update.

0.8.0

This is major release containing many architectural changes. The goal of these changes is make development and debugging of the crawling strategy easier. From now, there is an extensive guide in documentation on how to write a custom crawling strategy, a single process mode making much easier to debug crawling strategy locally and old distributed mode for production systems. Starting from this version there is no requirement to setup Apache Kafka or HBase to experiment with crawling strategies on your local computer.

We also removed unnecessary, rarely used features: distributed spiders run mode, prioritisation logic from backends to make Frontera easier to use and understand.

Here is a (somewhat) full change log:
* PyPy (2.7.*) support,
* Redis backend (kudos to khellan),
* LRU cache and two cache generations for HBaseStates,
* Discovery crawling strategy, respecting robots.txt and leveraging sitemaps to discover links faster,
* Breadth-first and depth-first crawling strategies,
* new mandatory component in backend: DomainMetadata,
* filter_links_extracted method in crawling strategy API to optimise calls to backends for state data,
* create_request in crawling strategy is now using FronteraManager middlewares,
* many batch gen instances,
* support of latest kafka-python,
* statistics are sent to message bus from all parts of Frontera,
* overall reliability improvements,
* settings for OverusedBuffer,
* DBWorker was refactored and divided on components (kudos to vshlapakov),
* seeds addition can be done using s3 now,
* Python 3.7 compatibility.

0.7.1

Thanks to voith, a problem introduced with beginning of support of Python 3 when Frontera was supporting only keys and values stored as bytes in .meta fields is now solved. Many Scrapy middlewares weren't working or working incorrectly. This is still not tested properly, so please report any bugs.

Other improvements include:
- batched states refresh in crawling strategy,
- proper access to redirects in Scrapy converters,
- more readable and simple OverusedBuffer implementation,
- examples, tests and docs fixes.

Thank you all, for your contributions!

0.7.0

A long awaiting support of kafka-python 1.x.x client. Now Frontera is much more resistant to physical connectivity loss and is using new asynchronous Kafka API.
Other improvements:
- SW consumes less CPU (because of rare state flushing),
- requests creation api is changed in BaseCrawlingStrategy, and now it's batch oriented,
- new article in the docs on cluster setup,
- disable scoring log consumption option in DB worker,
- fix of hbase drop table,
- improved tests coverage.

0.6.0

- Full Python 3 support 👏 👍 🍻 (https://github.com/scrapinghub/frontera/issues/106), all the thanks goes to Preetwinder.
- canonicalize_url method removed in favor of w3lib implementation.
- The whole `Request` (incl. meta) is propagated to DB Worker, by means of scoring log (fixes https://github.com/scrapinghub/frontera/issues/131)
- Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
- `HBaseQueue` supports delayed requests now. ‘crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.
- `Request` object is now persisted in `HBaseQueue`, allowing to schedule requests with specific meta, headers, body, cookies parameters.
- `MESSAGE_BUS_CODEC` option allowing to choose other than default message bus codec.
- Strategy worker refactoring to simplify it’s customization from subclasses.
- Fixed a bug with extracted links distribution over spider log partitions (https://github.com/scrapinghub/frontera/issues/129).

Page 1 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.