MTCleanse is a Python library for cleaning and processing parallel text datasets, particularly useful for machine translation and other NLP tasks.
New Feature
The `.clean_file` method now outputs an html file with a detailed report of the filtering process.
<img src="https://github.com/user-attachments/assets/fe23e9e3-1c59-44ea-81ff-23a94c199b1d" width="600">
python
cleaner = ParallelTextCleaner({
"min_chars": 10,
"max_chars": 500,
"min_words": 3,
"max_words": 50,
"enable_domain_filtering": True,
"domain_contamination": 0.2
})
This method saves the cleaned data to disk and generates an HTML report
cleaner.clean_file(
source_file="source.en",
target_file="target.fr",
output_source="clean_source.en",
output_target="clean_target.fr",
html_report="report.html"
)
**Full Changelog**: https://github.com/Ancastal/mtcleanse/compare/v0.2.0...v0.2.1