Matricula-online-scraper

Latest version: v0.4.1

Safety actively analyzes 624755 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

0.4.1

What's Changed

Fixed
* Previously, when scraping locations with `fetch location --place "something"` a bug occurred and output was improperly formatted or parts were missing (4). This is now fixed (24).

Most notably changes are listed above. See the full changelog for all changes.
**Full Changelog**: https://github.com/lsg551/matricula-online-scraper/compare/v0.4.0...v0.4.1

0.4.0

What's Changed

Added Features

* Coordinates of parishes are now included by default when scraping locations. The new fields `longitude` and `latitude` contain floats. Because this effectively doubles the amount of requests, be aware that the extra information comes with a price. This feature can be disabled by using the flag `--exclude-coordinates`. Use it if you suffer decreased performance. (20)

Most notably changes are listed above. See the full changelog for all changes.
**Full Changelog**: https://github.com/lsg551/matricula-online-scraper/compare/v0.3.0...v0.4.0

0.3.0

What's Changed


Added Features
* Support for JSON and CSV. You can use the new optional option `--file-format` (`-e`) to specify a format for the output / extracted data. One can choose from JSON Lines, regular JSON and CSV. (16)

Fixed or Changed
* ⚠️ _[BREAKING CHANGE]_ The cli argument `output_file_name` in `fetch parish` and `fetch location` is no longer required because a default value was set. Now it will automatically create a file in the current working directory. The name depends on the subcommand and will be shown in the help menu.
* ⚠️ _[BREAKING CHANGE]_ Additionally, the export format respectively the file extension is no longer configurable through the filename. Instead, you can choose from json, jsonl and csv with new cli option `--file-format` (`-e`).
* ⚠️ _[BREAKING CHANGE]_ Previously, the cli would abort if the specified path was already existing. Now, the new option `--append` is set to default and will instruct to append data to existing files instead of exiting. Choose `--no-append` to turn off this behaviour.

Most notably changes are listed above. See the full changelog for all changes.
**Full Changelog**: https://github.com/lsg551/matricula-online-scraper/compare/v0.2.2...v0.3.0

0.2.2

What's Changed

Added Features

* A new CLI option `--verison` now prints the CLI's version. Run `$ matricula-online-scraper --version`. (11)

Fixed or Changed

* ⚠️ _[BREAKING CHANGE]_ The CLI option `--urls` for `fetch parish` was renamed to `--url` (short `-u`). This option allows to specify which URLs of parishes on Matricula should be fetched and can be repeated to use multiple ones, but at least one. E.g. previously you could do `$ matricula-online-scraper fetch parish ./out --urls https://data.matricula-online.eu/en/deutschland/aachen/aachen-hl-kreuz/ --urls https://data.matricula-online.eu/en/slovenia/maribor/bizeljsko/` to fetch all sources of the two specified parishes. However, the singular seems more suited. Hence, it was renamed without any further changes. (#14)
* ⚠️ _[BREAKING CHANGE]_ If you look at the listed sources of any parish in the tabular data section ([example](https://data.matricula-online.eu/de/deutschland/aachen/aachen-hl-kreuz/)) you will notice that two adjacent rows are related – if expanded. While the first row contains a URL to images, an accession number, a type and a date range, another row can be unfolded below the main row, if clicked on the book icon in the main row. This additional collapsable row contains extra information. It was already scraped and included before. But because those fields are inconsistent, not all could be included. Now, all fields will be scraped and included. The fields `type` and `comment` were hardcoded to be scraped and are now removed explicitly. However, both and more will be included anyway, just dynamically named in the output according to the Matricula reference row. (#13)
* Sometimes the pages of parishes are blank ([example](https://data.matricula-online.eu/en/italien/bozen-brixen/Suedtirol_01/)). This is mostly intentional and instead of all sources provided in a table on the page in question, an external URL to a third party service is given. Most often the own system of the parish. Previously, these pages were ignored. Now the URLs are scraped too and included in the output `{ "external_url": "http://some.other.parish/" }` (#13)



Most notably changes are listed above. See the full changelog for all changes.
**Full Changelog**: https://github.com/lsg551/matricula-online-scraper/compare/v0.1.0...v0.2.2

0.1.0

This first version of the scraper is a rudimentary implementation and offers basic functionality.
- One can scrape information about available locations. I.e. regions, places, cities or parishes as well as virtual entities Matricula Online has digitized content of. Usually a parish with digitized parish registers or similar content. This data consists only of metadata about these locations (geographical information, url, name, date range, notes); a URL is included to the parish's main page with the actual digitized sources (see below). This operation can be filtered by various parameters – or all can be scraped. https://data.matricula-online.eu/en/suchen/ is the scraped page.
- Information about all the digitized sources of parishes can be scraped too. An example of a parish's page is https://data.matricula-online.eu/de/deutschland/muenster/muenster-st-servatii/. This operation too scrapes metadata only (name of the source, type, date range, url to the actual content, notes).

Note that this very first version is not feature-complete. Not all resources Matricula offers can be scraped with this version (e.g. the actual content = images of parish registers like https://data.matricula-online.eu/de/deutschland/muenster/muenster-st-servatii/KB001_2/?pg=1).

:warning: This is a semver version < 1.0.0. Bugs and breaking changes are to be expected. Please report any issues.

Links

Releases

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.