This is a major release with a lot of breaking changes but most changes are easy to fix.
It focuses on type safety with the introduction of runtime checks: any call to zimscraperlib API must match the type definition or an exception will be raised.
Documentation is available as docstrings and on https://python-scraperlib.readthedocs.io
Main changes includes:
- ZIM metadata handling has completely changed with new types for each kind of metadata.
- `i18n` module has been redesigned around a single main class `Language`
- New `rewriting` module for HTTML/CSS/JS (that one being done at runtime via Wombat)
- Now supporting only Python 3.12
Added
- Documentation using `mkdocs`, published on readthedocs.com (92)
- `rewriting` module to rewrite URLs in content for generic scrapers
- `rewriting.css` to rewrite URLs in CSS files
- `rewriting.html` to rewrite URLs in HTML files
- `rewriting.js` to rewrite URLs in JS files (at runtime, using `wombat`)
- `wombat-setup` javascript module in `javascript/`
- `typing` module with custom types:
- `Callback` to use where we expect callbacks
- `SupportsWrite`, `SupportsRead`, `SupportsSeeking` `SupportsSeekableRead` and `SupportsSeekableWrite`: protocols for IO type annotations
- `zim.metadata` module with a type-based approach for each kind of metadata and helpers for custom ones
- [`zim.metadata`] `APPLY_RECOMMENDATIONS`: general flag to toggle openZIM-recommended constraints
- [`zim.metadata`] Type-based classes: `Metadata`, `TextBasedMetadata`, `TextListBasedMetadata`, `DateBasedMetadata`, `IllustrationBasedMetadata`
- [`zim.metadata`] Usage-based classes: `NameMetadata`, `LanguageMetadata`, `DefaultIllustrationMetadata`, etc.
- [`zim.metadata`] `StandardMetadataList` to package the standard metadata
- See details for additional API endpoints and variables
- [`constants`] `DEFAULT_WEB_REQUESTS_TIMEOUT` exposed for `download` module
- [`download`] `stream_file()` now accepts `timeout: int` param (defaults to constant timeout) (222)
- [`filesystem`] `path_from` context manager to acquire a pathlib `Path` from `Path` or `TemporaryDirectory`
- [`i18n`] `Language`, `get_language()` and `get_language_or_none()`. See breaking changes
- [`image.optimization`] `OptimizePngOptions` dataclass to store PNG options
- [`image.optimization`] `OptimizeJpgOptions` dataclass to store JPEG options
- [`image.optimization`] `OptimizeGifOptions` dataclass to store WebP options
- [`image.optimization`] `OptimizeOptions` dataclass to store cross-formats options
- [`inputs`] `unique_values()` to deduplicate a list while preserving order
- [`logging`] `DEFAULT_FORMAT_WITH_THREADS` as many scrapers uses threads
- [`video.encoding`] `reencode()`'s `existing_tmp_path` param
- [`zim.filesystem`] `validate_folder_writable()` to ensure one can write into a folder (200)
- [`zim.creator`] `Creator._get_first_language_metadata_value()` to retrieve first language from metadata
- [`zim.items`] `no_indexing_indexdata()` to get an IndexData that disables indexing
- [`zim.items`] `URLItem.get_mimetype()` now only returning `str`
Changed (Breaking)
- Entire API is now type-protected using beartype. Any call to scraperlib that doesn't satisfy the annotated types will raise an exception
- [`constants`] `MANDATORY_ZIM_METADATA_KEYS` and `DEFAULT_DEV_ZIM_METADATA` moved to `zim/metadata`
- [`download`] `YoutubeDownloader.download`'s `options` parameters now expect an `dict[str, Any]` instead of `dict`
- [`download`] `YoutubeConfig` options now limited to `str | bool | int | None`
- [`download`] `_get_retry_adapter()` now exposed as `get_retry_adapter()`
- [`download`] `stream_file`'s `byte_stream' param now more flexible, accepting `SupportsWrite[bytes] | SupportsSeekableWrite[bytes]`
- [`download`] `stream_file`'s `proxies` param now accepting `dict[str, str]` instead of `dict`
- [`filesystem`] `delete_callback()` is now a simple callback accepting an `fpath` and deleting it (doesn't chain other callback anymore).
- [`filesystem`] `delete_callback()` doesn't fail on missing file (192)
- [`i18n`] Redesigned API around a single object:
- `Language` which is inited with any acceptable code. Raises `NotFoundError` on 639-3 matching failure
- `find_language_names()` is retained but only accepts a `query: str`
- added `get_language()` and `get_language_or_none()` as shortcuts around `Language`
- `is_valid_iso_639_3()` is retained
- [`image.conversion`] `convert_image()` now accepts `io.BytesIO` in place of `IO[bytes]` for `src` and `dst`.
- [`image.conversion`] `convert_svg2png()` now accepts `io.BytesIO` in place of `IO[bytes]` for `src` and `dst`.
- [`image.optimization`] `optimize_png()` now accepts `options: OptimizePngOptions` instead of individual params.
- [`image.optimization`] `optimize_jpeg()` now accepts `options: OptimizeJpgOptions` instead of individual params.
- [`image.optimization`] `optimize_webp()` now accepts `options: OptimizeWebpOptions` instead of individual params.
- [`image.optimization`] `optimize_gif()` now accepts `options: OptimizeGifOptions` instead of individual params.
- [`image.presets`] All presets now use the new options dataclass instead of ClassVar dict
- [`image.probing`] `format_for()` now accepts `io.BytesIO` in place of `IO[bytes]` for `src`.
- [`image.probing`] `is_valid_image()` now accepts `io.BytesIO` in place of `IO[bytes]` for `image`.
- [`image.utils`] `save_image()` now accepts `io.BytesIO` in place of `IO[bytes]` for `dst`.
- [`video.config`] `Config` was mostly not using type annotations.
- [`video.config`] `Config` options only expecting `str | None`
- [`video.presets`] All options only expecting `str | None`
- [`video.encoding`] `reencode()` now always returning a `tuple[bool, CompletedProcess]`
- [`zim._libkiwix`] `MimetypeAndCounter` now expects specific types for `mimetype: str` and `value: int`
- [`zim.filesystem`] `make_zim_file()` publisher`param now properly expects an`str`
- [`zim.filesystem`] `IncorrectZIMPathError` renamed to `IncorrectPathError`
- [`zim.filesystem`] `MissingZIMFolderError` renamed to `MissingFolderError`
- [`zim.filesystem`] `NotADirectoryZIMFolderError` renamed to `NotADirectoryFolderError`
- [`zim.filesystem`] `NotWritableZIMFolderError` renamed to `NotWritableFolderError`
- [`zim.filesystem`] `IncorrectZIMFilenameError` renamed to `IncorrectFilenameError`
- [`zim.filesystem`] `validate_zimfile_creatable()` renamed to `validate_file_creatable()`
- [`zim.items`] `Item` and `StaticItem` now expecting `hints` as `dict[libzim.writer.Hint, int]` instead of `dict`
- [`zim.items`] `Item.get_hints()` now returning `dict[libzim.writer.Hint, int]` instead of `dict`
- [`zim.items`] `URLItem.download_for_size()` now specifying type annotations and reordered params
- [`zim.providers`] `FileLikeProvider.gen_blob()` and `URLProvider.gen_blob()` now properly annotates return type (`Generator[libzim.writer.Blob, None, None]`)
- [`zim.providers`] `URLProvider.get_size_of()` param `url` now explicitly expects an `str`
- [`zim.creator`] `Creator.config_metadata()` signature changed, now mainly accepting a `StandardMetadataList`
- [`zim.creator`] `Creator.config_dev_metadata()` signature changed to accept new metadata types
- [`zim.creator`] `Creator.add_item_for()`'s `callback` renamed to `callbacks` and accepting `Callback`
- [`zim.creator`] `Creator.add_item()`'s `callback` renamed to `callbacks` and accepting `Callback`
Changed
- [deps] `iso639-lang` now requires at least v2.4.0
- [`download`] `stream_file()` now return `tuple[int, requests.structures.CaseInsensitiveDict[str]]` instead of `tuple[int, requests.structures.CaseInsensitiveDict]`
- [`download`] `stream_file()` now accepts both `fpath` and `byte_stream` params (writes to both)
- [`image.utils`] `save_image()` now accepts `Any` `**params`.
- [`zim.archive`] `Archive.counters` now returning `CounterMap` (compatible with previous `dict[str, int]`)
Fixed
- Direct dependencies now properly references: pillow, urllib3, piexif, idna (226)
- [`download`] `YoutubeDownloader.download` now respects its return type (`bool | Future[Any]`)
- [`image.conversion`] `convert_image()` `**params` properly declared as accepting `None`.
- [`logging`] `getLogger()`'s' `console` now properly accepting `TextIO | io.StringIO | None`
- [`video.probing`] `get_media_info()` type annotation for `src_path`
- [`zim.archive`] `Archive.get_item()` return type (`libzim.reader.Item`)
Removed
- Support for Python 3.8/3.9/3.10/3.11. Only Python 3.12 is supported now.
- [`i18n`] `Lang` (See breaking changes)
- [`i18n`] `get_iso_lang_data()` (See breaking changes)
- [`i18n`] `update_with_macro()` (See breaking changes)
- [`i18n`] `get_language_details()` (See breaking changes)
- [`uri`] `rebuild_uri` `failsafe` param (was only handling incorrect types)
- [`video.encoding`] `reencode()`'s `with_process` param
- [`zim.creator`] `Creator.validate_metadata()`
- [`zim.creator`] `Creator.convert_and_check_metadata()`