Mtcleanse

Latest version: v0.2.1

Safety actively analyzes 722581 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

0.2.1

MTCleanse is a Python library for cleaning and processing parallel text datasets, particularly useful for machine translation and other NLP tasks.
New Feature

The `.clean_file` method now outputs an html file with a detailed report of the filtering process.

<img src="https://github.com/user-attachments/assets/fe23e9e3-1c59-44ea-81ff-23a94c199b1d" width="600">

python
cleaner = ParallelTextCleaner({
"min_chars": 10,
"max_chars": 500,
"min_words": 3,
"max_words": 50,
"enable_domain_filtering": True,
"domain_contamination": 0.2
})

This method saves the cleaned data to disk and generates an HTML report
cleaner.clean_file(
source_file="source.en",
target_file="target.fr",
output_source="clean_source.en",
output_target="clean_target.fr",
html_report="report.html"
)


**Full Changelog**: https://github.com/Ancastal/mtcleanse/compare/v0.2.0...v0.2.1

0.2.0

MTCleanse is a Python library for cleaning and processing parallel text datasets, particularly useful for machine translation and other NLP tasks.

New Features

- Quality filter using reference-less quality estimation metric COMET-KIWI.

0.1.0

MTCleanse is a Python library for cleaning and processing parallel text datasets, particularly useful for machine translation and other NLP tasks.

Features

- Clean parallel text datasets with configurable parameters
- Remove noise such as URLs, emails, and control characters
- Filter texts based on length constraints
- Detect and remove statistical outliers
- Domain-based filtering using sentence embeddings
- Export cleaned data in various formats (text files, JSON)
- Comprehensive statistics on the cleaning process

Links

Releases

Has known vulnerabilities

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.