Zimscraperlib

Latest version: v5.1.1

Safety actively analyzes 722898 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 8

5.0.0rc1

This is a major release with a lot of breaking changes but most changes are easy to fix.

It focuses on type safety with the introduction of runtime checks: any call to zimscraperlib API must match the type definition or an exception will be raised.

Documentation is available as docstrings and on https://python-scraperlib.readthedocs.io

Main changes includes:

- ZIM metadata handling has completely changed with new types for each kind of metadata.
- `i18n` module has been redesigned around a single main class `Language`
- New `rewriting` module for HTTML/CSS/JS (that one being done at runtime via Wombat)
- Now supporting only Python 3.12

Added

- Documentation using `mkdocs`, published on readthedocs.com (92)
- `rewriting` module to rewrite URLs in content for generic scrapers
- `rewriting.css` to rewrite URLs in CSS files
- `rewriting.html` to rewrite URLs in HTML files
- `rewriting.js` to rewrite URLs in JS files (at runtime, using `wombat`)
- `wombat-setup` javascript module in `javascript/`
- `typing` module with custom types:
- `Callback` to use where we expect callbacks
- `SupportsWrite`, `SupportsRead`, `SupportsSeeking` `SupportsSeekableRead` and `SupportsSeekableWrite`: protocols for IO type annotations
- `zim.metadata` module with a type-based approach for each kind of metadata and helpers for custom ones
- [`zim.metadata`] `APPLY_RECOMMENDATIONS`: general flag to toggle openZIM-recommended constraints
- [`zim.metadata`] Type-based classes: `Metadata`, `TextBasedMetadata`, `TextListBasedMetadata`, `DateBasedMetadata`, `IllustrationBasedMetadata`
- [`zim.metadata`] Usage-based classes: `NameMetadata`, `LanguageMetadata`, `DefaultIllustrationMetadata`, etc.
- [`zim.metadata`] `StandardMetadataList` to package the standard metadata
- See details for additional API endpoints and variables
- [`constants`] `DEFAULT_WEB_REQUESTS_TIMEOUT` exposed for `download` module
- [`download`] `stream_file()` now accepts `timeout: int` param (defaults to constant timeout) (222)
- [`filesystem`] `path_from` context manager to acquire a pathlib `Path` from `Path` or `TemporaryDirectory`
- [`i18n`] `Language`, `get_language()` and `get_language_or_none()`. See breaking changes
- [`image.optimization`] `OptimizePngOptions` dataclass to store PNG options
- [`image.optimization`] `OptimizeJpgOptions` dataclass to store JPEG options
- [`image.optimization`] `OptimizeGifOptions` dataclass to store WebP options
- [`image.optimization`] `OptimizeOptions` dataclass to store cross-formats options
- [`inputs`] `unique_values()` to deduplicate a list while preserving order
- [`logging`] `DEFAULT_FORMAT_WITH_THREADS` as many scrapers uses threads
- [`video.encoding`] `reencode()`'s `existing_tmp_path` param
- [`zim.filesystem`] `validate_folder_writable()` to ensure one can write into a folder (200)
- [`zim.creator`] `Creator._get_first_language_metadata_value()` to retrieve first language from metadata
- [`zim.items`] `no_indexing_indexdata()` to get an IndexData that disables indexing
- [`zim.items`] `URLItem.get_mimetype()` now only returning `str`

Changed (Breaking)

- Entire API is now type-protected using beartype. Any call to scraperlib that doesn't satisfy the annotated types will raise an exception
- [`constants`] `MANDATORY_ZIM_METADATA_KEYS` and `DEFAULT_DEV_ZIM_METADATA` moved to `zim/metadata`
- [`download`] `YoutubeDownloader.download`'s `options` parameters now expect an `dict[str, Any]` instead of `dict`
- [`download`] `YoutubeConfig` options now limited to `str | bool | int | None`
- [`download`] `_get_retry_adapter()` now exposed as `get_retry_adapter()`
- [`download`] `stream_file`'s `byte_stream' param now more flexible, accepting `SupportsWrite[bytes] | SupportsSeekableWrite[bytes]`
- [`download`] `stream_file`'s `proxies` param now accepting `dict[str, str]` instead of `dict`
- [`filesystem`] `delete_callback()` is now a simple callback accepting an `fpath` and deleting it (doesn't chain other callback anymore).
- [`filesystem`] `delete_callback()` doesn't fail on missing file (192)
- [`i18n`] Redesigned API around a single object:
- `Language` which is inited with any acceptable code. Raises `NotFoundError` on 639-3 matching failure
- `find_language_names()` is retained but only accepts a `query: str`
- added `get_language()` and `get_language_or_none()` as shortcuts around `Language`
- `is_valid_iso_639_3()` is retained
- [`image.conversion`] `convert_image()` now accepts `io.BytesIO` in place of `IO[bytes]` for `src` and `dst`.
- [`image.conversion`] `convert_svg2png()` now accepts `io.BytesIO` in place of `IO[bytes]` for `src` and `dst`.
- [`image.optimization`] `optimize_png()` now accepts `options: OptimizePngOptions` instead of individual params.
- [`image.optimization`] `optimize_jpeg()` now accepts `options: OptimizeJpgOptions` instead of individual params.
- [`image.optimization`] `optimize_webp()` now accepts `options: OptimizeWebpOptions` instead of individual params.
- [`image.optimization`] `optimize_gif()` now accepts `options: OptimizeGifOptions` instead of individual params.
- [`image.presets`] All presets now use the new options dataclass instead of ClassVar dict
- [`image.probing`] `format_for()` now accepts `io.BytesIO` in place of `IO[bytes]` for `src`.
- [`image.probing`] `is_valid_image()` now accepts `io.BytesIO` in place of `IO[bytes]` for `image`.
- [`image.utils`] `save_image()` now accepts `io.BytesIO` in place of `IO[bytes]` for `dst`.
- [`video.config`] `Config` was mostly not using type annotations.
- [`video.config`] `Config` options only expecting `str | None`
- [`video.presets`] All options only expecting `str | None`
- [`video.encoding`] `reencode()` now always returning a `tuple[bool, CompletedProcess]`
- [`zim._libkiwix`] `MimetypeAndCounter` now expects specific types for `mimetype: str` and `value: int`
- [`zim.filesystem`] `make_zim_file()` publisher`param now properly expects an`str`
- [`zim.filesystem`] `IncorrectZIMPathError` renamed to `IncorrectPathError`
- [`zim.filesystem`] `MissingZIMFolderError` renamed to `MissingFolderError`
- [`zim.filesystem`] `NotADirectoryZIMFolderError` renamed to `NotADirectoryFolderError`
- [`zim.filesystem`] `NotWritableZIMFolderError` renamed to `NotWritableFolderError`
- [`zim.filesystem`] `IncorrectZIMFilenameError` renamed to `IncorrectFilenameError`
- [`zim.filesystem`] `validate_zimfile_creatable()` renamed to `validate_file_creatable()`
- [`zim.items`] `Item` and `StaticItem` now expecting `hints` as `dict[libzim.writer.Hint, int]` instead of `dict`
- [`zim.items`] `Item.get_hints()` now returning `dict[libzim.writer.Hint, int]` instead of `dict`
- [`zim.items`] `URLItem.download_for_size()` now specifying type annotations and reordered params
- [`zim.providers`] `FileLikeProvider.gen_blob()` and `URLProvider.gen_blob()` now properly annotates return type (`Generator[libzim.writer.Blob, None, None]`)
- [`zim.providers`] `URLProvider.get_size_of()` param `url` now explicitly expects an `str`
- [`zim.creator`] `Creator.config_metadata()` signature changed, now mainly accepting a `StandardMetadataList`
- [`zim.creator`] `Creator.config_dev_metadata()` signature changed to accept new metadata types
- [`zim.creator`] `Creator.add_item_for()`'s `callback` renamed to `callbacks` and accepting `Callback`
- [`zim.creator`] `Creator.add_item()`'s `callback` renamed to `callbacks` and accepting `Callback`

Changed

- [deps] `iso639-lang` now requires at least v2.4.0
- [`download`] `stream_file()` now return `tuple[int, requests.structures.CaseInsensitiveDict[str]]` instead of `tuple[int, requests.structures.CaseInsensitiveDict]`
- [`download`] `stream_file()` now accepts both `fpath` and `byte_stream` params (writes to both)
- [`image.utils`] `save_image()` now accepts `Any` `**params`.
- [`zim.archive`] `Archive.counters` now returning `CounterMap` (compatible with previous `dict[str, int]`)

Fixed

- Direct dependencies now properly references: pillow, urllib3, piexif, idna (226)
- [`download`] `YoutubeDownloader.download` now respects its return type (`bool | Future[Any]`)
- [`image.conversion`] `convert_image()` `**params` properly declared as accepting `None`.
- [`logging`] `getLogger()`'s' `console` now properly accepting `TextIO | io.StringIO | None`
- [`video.probing`] `get_media_info()` type annotation for `src_path`
- [`zim.archive`] `Archive.get_item()` return type (`libzim.reader.Item`)

Removed

- Support for Python 3.8/3.9/3.10/3.11. Only Python 3.12 is supported now.
- [`i18n`] `Lang` (See breaking changes)
- [`i18n`] `get_iso_lang_data()` (See breaking changes)
- [`i18n`] `update_with_macro()` (See breaking changes)
- [`i18n`] `get_language_details()` (See breaking changes)
- [`uri`] `rebuild_uri` `failsafe` param (was only handling incorrect types)
- [`video.encoding`] `reencode()`'s `with_process` param
- [`zim.creator`] `Creator.validate_metadata()`
- [`zim.creator`] `Creator.convert_and_check_metadata()`

4.0.0

Added

- Add utility function to compute ZIM Tags 164, including deduplication 156
- Metadata does not automatically drops control characters 159
- New `indexing.IndexData` class to hold title, content and keywords to pass to libzim to index an item
- Automatically index PDF documents content 167
- Automatically set proper title on PDF documents 168
- Expose new `optimization.get_optimization_method` to get the proper optimization method to call for a given image format
- Add `optimization.get_optimization_method` to get the proper optimization method to call for a given image format
- New `creator.Creator.convert_and_check_metadata` to convert metadata to bytes or str for known use cases and check proper type is passed to libzim
- Add svg2png image conversion function 113
- Add `conversion.convert_svg2png` image conversion function + support for SVG in `probing.format_for` 113
- Add `i18n.Lang` class used as typed result of i18n operations 151

Changed

- **BREAKING** Renamed `zimscraperlib.image.convertion` to `zimscraperlib.image.conversion` to fix typo
- **BREAKING** Many changes in type hints to match the real underlying code
- **BREAKING** Force all boolean arguments (and some other non-obvious parameters) to be keyword-only in function calls for clarity / disambiguation (see ruff rule FBT002)
- Prefer to use `IO[bytes]` to `io.BytesIO` when possible since it is more generic
- **BREAKING** `i18n.NotFound` renamed `i18n.NotFoundError`
- **BREAKING** `types.get_mime_for_name` now returns `str | None`
- **BREAKING** `creator.Creator.add_metadata` and `creator.Creator.validate_metadata` now only accepts `bytes | str` as value (it must have been converted before call)
- **BREAKING** second argument of `creator.Creator.add_metadata` has been renamed to `value` instead of `content` to align with other methods
- When a type issue arises in metadata checks, wrong value type is displayed in exception
- **BREAKING** `i18n.get_language_details()`, `i18n.get_iso_lang_data()`, `i18n.find_language_names()` and `i18n.update_with_macro` now process / return a new typed `Lang` class 151
- **BREAKING** Rename `i18.NotFound` to `i18n.NotFoundError`

Removed

- **BREAKING** Remove translation features in `i18n`: `Locale` class + `_` and `setlocale` functions 134

Fixed

- Metadata length validation is buggy for unicode strings 158
- Pillow 10.4.0 reveals improper type hints for image probing functions 177
- Enhance error when locale fails to setup 157

3.4.0

Added

- `zim.creator.Creator._log_metadata()` to log (DEBUG) all metadata set on `_metadata` (prior to start()) 155
- New utility function to confirm ZIM can be created at given location / name 163

Changed

- Migrate the **VideoWebmLow** and **VideoWebmHigh** presets to VP9 for smaller file size 79
- New preset versions are v3 and v2 respectively
- Simplify type annotations by replacing Union and Optional with pipe character ("|") for improved readability and clarity 150
- Calling `Creator._log_metadata()` on `Creator.start()` if running in DEBUG 155

Fixed

- Add back the `--runinstalled` flag for test execution to allow smooth testing on other build chains 139

3.3.2

Added

- Add support for `disable_metadata_checks` and `ignore_duplicates` arguments in `make_zim_file` function ("zimwritefs-mode")

Changed

- Relaxed constraints on Python dependencies
- Upgraded optional dependencies used for test and QA

3.3.1

Added

- Set a user-agent for `handle_user_provided_file` 103

Changed

- Migrate to generic syntax in all std collections 140

Fixed

- Do not modify the ffmpeg_args in reencode function 144

3.3.0

Added

- New `disable_metadata_checks` parameter in `zimscraperlib.zim.creator.Creator` initializer, allowing to disable metadata check at startup (assuming the user will validate them on its own) 119

Changed

- Rework the **VideoWebmLow** preset for faster encoding and smaller file size 122
- preset has been bumped to **version 2**
- when using an S3 cache, all videos using this preset will be reencoded and uploaded to cache again (it will replace the same file encoded with preset version 1)
- When reencoding a video, ffmpeg now uses only 1 CPU thread by default (new arg to `reencode` allows to override this default value)
- Using openZIM Python bootstrap conventions (including hatch-openzim plugin) 120
- Add support for Python 3.12, drop Python 3.7 support 118
- Replace "iso-369" by "iso639-lang" library
- Replace "file-magic" by "python-magic" library for Alpine Linux support and better maintenance

Fixed

- Fixed type hints of `zimscraperlib.zim.Item` and subclasses, and `zimscraperlib.image.optimization:convert_image`

Page 2 of 8

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.