Trafilatura

Latest version: v1.10.0

Safety actively analyzes 634582 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 8

1.10.0

Breaking changes:
- raise errors on deprecated CLI and function arguments (581)
- regroup classes and functions linked to deduplication (582)
``trafilatura.hashing`` → ``trafilatura.deduplication``

Extraction:
- port of is_probably_readerable from readability.js by zirkelc in 587
- Markdown table fixes by naktinis in 601
- fix list spacing in TXT output (598)
- CLI fixes: file processing options, mtime, and tests (605)
- CLI fix: read standard input as binary (607)

Downloads:
- fix deflate and add optional zstd to accepted encodings (594)
- spider fix: use internal download utilities for robots.txt (590)

Maintenance:
- add author XPaths (567)
- update justext and lxml dependencies (593)
- simplify code: unique function for length tests (591)

Docs:
- fix typos by RainRat in 603

1.9.0

Extraction:
- add markdown as explicit output (550)
- improve recall preset (571)
- speedup for readability-lxml (547)
- add global options object for extraction and use it in CLI (552)
- fix: better encoding detection (548)
- recall: fix for lists inside tables with mikhainin (534)
- add symbol to preserve vertical spacing in Markdown (499)
- fix: table cell separators in non-XML output (563)
- slightly better accuracy and execution speed overall

Metadata:
- add file creation date (date extraction, JSON & XML-TEI) (561)
- fix: empty content in meta tag by felipehertzer (545)

Maintenance:
- restructure and simplify code (543, 556)
- CLI & downloads: revamp and use global options (565)
- eval: review code, add guidelines and small benchmark (542)
- fix: raise error if config file does not exist (554)
- deprecate `process_record()` (549)
- docs: convert readme to markdown and update info (564, 578)

1.8.1

Maintenance:
- Pin LXML to prevent broken dependency (535)

Extraction:
- Improve extraction accuracy for major news outlets (530)
- Fix formatting by correcting order of element generation and space handling with dlwh (528)
- Fix: prevent tail insertion before children in nested elements by knit-bee (536)

1.8.0

Extraction:
- Better precision by felipehertzer (509, 520)
- Code formatting in TXT/Markdown output added (498)
- Improved CSV output (496)
- LXML: compile XPath expressions (504)
- Overall speedup about +5%

Downloads and Navigation:
- More robust scans with `is_live_page()` (501)
- Better sitemap start and safeguards (503, 506)
- Fix for headers in response object (513)

Maintenance:
- License changed to Apache 2.0
- `Response` class: convenience functions added (497)
- `lxml.html.Cleaner` removed (491)
- CLI fixes: parallel cores and processing (524)

1.7.0

Extraction:
- improved `html2txt()` function

Downloads:
- add advanced `fetch_response()` function
→ pending deprecation for `fetch_url(decode=False)`

Maintenance:
- support for LXML v5+ (484 by knit-bee, 485)
- update [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.7.0)

1.6.4

Maintenance:
- MacOS: fix setup, update htmldate and add tests (460)
- drop invalid XML element attributes with vbarbaresi in 462
- remove cyclic imports (458)

Navigation:
- introduce `MAX_REDIRECTS` config setting and fix urllib3 redirect handling by vbarbaresi in 461
- improve feed detection (457)

Documentation:
- enhancements to documentation and testing with Maddesea in 456

Page 1 of 8

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.