Trafilatura

Latest version: v1.12.2

Safety actively analyzes 681775 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 8

1.12.2

- downloads: add support for SOCKS proxies with gremid (682)
- extraction fix: ValueError in table spans (685)
- spider: `prune_xpath` parameter added by felipehertzer (684)
- spider: relax strict parameter for link extraction (687)
- sitemaps: `max_sitemaps` parameter added by felipehertzer (690)
- maintenance: make compression libraries optional (691)
- metadata: review and lint code (694)

1.12.1

Navigation:
- spider: restrict search to sections containing URL path (673)
- crawler: add parameter class and types, **breaking change** for undocumented functions (675)
- maintenance: simplify link discovery and extend tests (674)
- CLI: review code, add types and tests (677)

Bugfixes:
- fix `AttributeError` in element deletion (668)
- fix `MemoryError` in table header columns (665)

Docs:
- docs: fix variable name for extract_metadata in quickstart by jpigla in 678

1.12.0

Breaking change:
- enforce fixed list of output formats, deprecate `-out` on the CLI (647)

Faster, more accurate extraction:
- review link and structure checks (653)
- improve justext fallback (652)
- baseline: prevent LXML error in JSON-LD (643), do not use as backup extraction (646)
- review XPaths for undesirable content (645)

Bugfixes and maintenance:
- CLI fix: markdown format should trigger `include_formatting` (649)
- images fix: use a length threshold on src attribute (654)
- XML-TEI: replace RelaxNG by DTD, remove pickle, and update (655)
- formatting & markdown fix: add newlines (656)
- table fix: prevent `MemoryError` & `ValueError` during conversion to text (658)

Documentation:
- update `crawls.rst`: `known` is an unexpected argument, by tommytyc in 638

1.11.0

Breaking change:
- metadata now skipped by default (613), to trigger inclusion in all output formats:
- `with_metadata=True` (Python)
- `--with-metadata` (CLI)

Extraction:
- add HTML as output format (614)
- better and faster baseline extraction (619)
- better handling of HTML/XML elements (628)
- XPath rules added with felipehertzer (540)
- fix: avoid faulty readability_lxml content (635)

Evaluation:
- new scripts and data with LydiaKoerber (606, 615)
- additional data with swetepete (197)

Maintenance:
- docs extended and updated, added page on deduplication (618)
- review code, add tests and types in part of the submodules (620, 623, 624, 625)

1.10.0

Breaking changes:
- raise errors on deprecated CLI and function arguments (581)
- regroup classes and functions linked to deduplication (582)
``trafilatura.hashing`` → ``trafilatura.deduplication``

Extraction:
- port of is_probably_readerable from readability.js by zirkelc in 587
- Markdown table fixes by naktinis in 601
- fix list spacing in TXT output (598)
- CLI fixes: file processing options, mtime, and tests (605)
- CLI fix: read standard input as binary (607)

Downloads:
- fix deflate and add optional zstd to accepted encodings (594)
- spider fix: use internal download utilities for robots.txt (590)

Maintenance:
- add author XPaths (567)
- update justext and lxml dependencies (593)
- simplify code: unique function for length tests (591)

Docs:
- fix typos by RainRat in 603

1.9.0

Extraction:
- add markdown as explicit output (550)
- improve recall preset (571)
- speedup for readability-lxml (547)
- add global options object for extraction and use it in CLI (552)
- fix: better encoding detection (548)
- recall: fix for lists inside tables with mikhainin (534)
- add symbol to preserve vertical spacing in Markdown (499)
- fix: table cell separators in non-XML output (563)
- slightly better accuracy and execution speed overall

Metadata:
- add file creation date (date extraction, JSON & XML-TEI) (561)
- fix: empty content in meta tag by felipehertzer (545)

Maintenance:
- restructure and simplify code (543, 556)
- CLI & downloads: revamp and use global options (565)
- eval: review code, add guidelines and small benchmark (542)
- fix: raise error if config file does not exist (554)
- deprecate `process_record()` (549)
- docs: convert readme to markdown and update info (564, 578)

Page 1 of 8

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.