Trove-newspaper-harvester

Latest version: v0.7.2

Safety actively analyzes 626118 Python packages for vulnerabilities to keep your Python projects secure.

0.7.2

Minor update that adds `name` and `description` fields to the root of the generated RO-Crate file.

0.7.1

Major update to work with v3 of the Trove API. Other changes include:

- Each harvest is now documented by two automatically-generated files. The `ro-crate-metadata.json` contains contextual information about the harvest, such as the date, harvester version etc, saved in [RO-Crate](https://www.researchobject.org/ro-crate/) format. The second `harvester_config.json` saves the query parameters sent to the Trove API and the harvester options. This means you always know how and when a harvest was created, and can share information about the provenance of a harvested dataset in a standard way.
- You can now initiate a harvest by pointing to an existing `harvester_config.json` file. This makes it easy to rerun a harvest at some future date in order to pick up changes in the Trove corpus.
- The way `results.csv` files are generated has been changed to make it a bit more memory friendly (no longer loading the whole ndjson file into Pandas).

0.6.6

Minor changes:

- handle articles without page urls (eg 'Coming soon' articles)
- use correct package name for BeautifulSoup in requirements

0.6.5

- Better error messages for CLI
- Better handling of exceptions

0.6.3

- Fix to handle articles with missing metadata
- Don't try to re-download existing text and PDF files on restart

0.6.2

This is the first release of the updated Trove Newspaper Harvester. The version numbering carries on from the old troveharvester package that it supercedes. See [here for more details](https://updates.timsherratt.org/2022/09/22/do-you-want.html).

Releases

Has known vulnerabilities