Warc2zim

Latest version: v2.2.2

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 5

2.1.0

Added

- New fuzzy-rule for cheatography.com (342), der-postillon.com (330), iranwire.com (363)
- Properly rewrite redirect target url when present in <meta> HTML tag (237)
- New `--encoding-aliases` argument to pass encoding/charset aliases (331)
- Add support for SVG favicon (148)
- Automatically index PDF content and use PDF title (289 and 290)

Changed

- Upgrade to python-scraperlib 4.0.0
- Generate fuzzy rules tests in Python and Javascript (284)
- Refactor HTML rewriter class to make it more open to change and expressive (305)
- Detect charset in document header only for HTML documents (331)
- Use `software` property from `warcinfo` record to set ZIM `Scraper` metadata (357)
- Store `ContentDate` as metadata, based on `WARC-Date` (358)
- Remove domain specific rules (328)
- Revisit retrieve_illustration logic to prefer best favicons (352 and 369)
- Upgrade dependencies (zimscraperlib 4.0.0, wombat.js 3.7.12 and others) (376)

 Fixed

- Handle case where the redirect target is bad / unsupported (332 and 356)
- Fixed WARC files handling order to follow creation order (366)
- Remove subsequent slashes in URLs, both in Python and JS (365)
- Ignore non HTTP(S) WARC records (351)
- Fix `vimeo_cdn_fix` fuzzy rule for proper operation in Javascript (348)
- Performance issue linked to new "extensible" HTML rewriting rules (370)

2.0.3

Changed

- Moved rules definition from JSON to YAML and documented update process (216)
- Upgrade to wombat.js 3.7.11

 Added

- Exit with cleaner message when no entries are expected in the ZIM (336) and when main entry is not processable (337)
- Add debug log for items whose content is empty (344)

Fixed

- Some resources rewrite mode are still not correctly identified (326)

2.0.2

Added

- Add `--ignore-content-header-charsets` option to disable automatic retrieval of content charsets from content first bytes (318)
- Add `--content-header-bytes-length` option to specify how many first bytes to consider when searching for content charsets in header (320)
- Add `--ignore-http-header-charsets` option to disable automatic retrieval of content charsets from content HTTP `Content-Type` headers (318)

Changed

- Simplify logic deciding content charset, stop guessing with chardet (312)

Fixed

- Rewrite only content with mimetype `text-html` when `WARC-Resource-Type` is `html` (313)

2.0.1

Added

- Add support for multiple languages in `--lang` CLI argument (300)

Changed

- Use the new `WARC-Resource-Type` header to decide rewrite mode (when present in WARC) (296)
- Upgrade Python dependencies + wombat.js 3.7.5

Fixed

- Drop `integrity` attribute in HTML `<script>` and `<link>` tags (298)
- Use automatic detection of content encoding also for JS, JSON and CSS files (301)
- Set correct charset in HTML documents (253)

2.0.0

Added

- Allow to specify a scraper suffix for the ZIM scraper metadata at the CLI (168)
- New test website to test many known situations supposed to be handled (166)

Changed

- Replace **Service Worker** approach by **scraper-side rewriting** of static content (https://github.com/kiwix/overview/issues/95)
- Adopted Python bootstrap conventions (152)
- Upgrade dependencies, especially move to **Python 3.12** (only) and zimscraperlib 3.3.2
- Change wording in logs about the return code 100 (which is not an error code)
- Added checks in `converter.py` to verify output directory existence, logging appropriate error messages and cleanly exit if checks fail. (106)
- Added check for invalid zim file names (232)
- Changed default publisher metadata from 'Kiwix' to 'openZIM' (150)

1.5.5

Changed

- Code restructuration in preparation for 2.x

Page 2 of 5

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.