Hojichar

Latest version: v0.11.4

Safety actively analyzes 681812 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

0.11.4

Fixes the bug:
- https://github.com/HojiChar/HojiChar/issues/65
- Fix division by zero in some filters.

0.11.3

Bug fix
The following bug has fixed:

`ImportError` occurs with missing the package `requests` when HojiChar is installed without a `[all]` option.

0.11.2

Bug Fix
The above bug are fixed:

`ImportError` caused when importing the `hojichar` package after installing `pip install hojichar` without the `[all]` option.

0.11.1

Changes
This update includes adding the`__repr__` method to the `Document` class, enhancing object representation for easier debugging.

For instance, now when you check a `Document` object, you'll see a detailed representation by `repr` method.
python
>>> from hojichar import Document


>>> doc = Document("Hello, world", extras={"date": "2024-10-03"})

>>> repr(doc)
"Document(text='Hello, world', is_rejected=False, extras={'date': '2024-10-03'})"

>>> eval(repr(doc))
Document(text='Hello, world', is_rejected=False, extras={'date': '2024-10-03'})

0.11.0

What's New in This Release
We're excited to introduce a series of new filters in this version, designed to enhance the filters and be particularly useful for handling noisy datasets such as Common Crawl.

New Filters Added:
`hojichar.filters.document_filters`:
- `DiscardTooManyNouns`: Removes "word salad" with excessive nouns in Japanese texts.
- `CharRepetitionRatioFilter`: Filters out entries based on character repetition ratios to reduce noise.
- `WordRepetitionRatioFilter`: Discards entries with repetitive word patterns in Japanese texts.
- `DiscardTooManySpecialTokens`: Cleans up entries overloaded with special tokens or symbols, judged as noise.
- `SingleCharacterRepetitionFilter`: Removes entries where single characters are overly repeated.
- `DiscardTooManyEndingEllipsis`: Eliminates entries ending with multiple ellipses such as `...`.
- `DiscardTooShortLines`: Filters out repetitions of unusually short lines.

`hojichar.filters.language_identification`:
- `LanguageIdentificationByFastText`: Employs FastText for high-performance language identification.
- `AcceptJapaneseByFastText`: Japanese LID filter.

Installation Notes:
- To utilize these new filters, some may require installation of dependency libraries. Install hojichar by running:
bash
pip install 'hojichar[all]'

- The `mmh3` package, which the `hojichar.filters.deduplication` module depends on, has also been added to the extras package. To use it, you will need to specify `hojichar[all]` in the same way as above.

Additional Updates:
- **Full Support for Python 3.12**: We officially now support Python 3.12, ensuring compatibility and enhanced performance across more environments.

0.10.1

Fixes
- Fix access when the `extras` argument is passed to the `Document`: https://github.com/HojiChar/HojiChar/pull/55

Page 1 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.