Changes
Changes were made to make the analysis of discarded text more convenient.
- `ignore_filtered` flag of the `hojichar.Compose` is abolished.
- `skip_rejected` flag is added to the `hojichar.Filter`.
- `--all` option is added to the CLI.
- `dump_reason` flag is added to `hojichar.document_filters.JSONDumper`.
- Logging the Filter and its primitive member variable when the filter discards the document.
- Discarded text is no longer converted to blank characters.
With these changes, we can analyze the following outputs
Profile `myprofile.py`:
python
from hojichar.filters import document_filters as dflt
FILTER = Compose(
[
dflt.JSONLoader(key="text", ignore=True),
dflt.DiscardAdultContentJa(p=0.9, ignore_confused=True),
dflt.JSONDumper(dump_reason=True, skip_rejected=False),
],
)
And
cat dirty_texts.jsonl | hojichar -p myprofile.py --all | jq
Get such lines:
json
{
"text": "劇訳表示。 : 経済産業省「国民の皆さん、トイレットペーパーは余分に備えを」【防災の日】\n< 【防衛省】15年度予算、概算要求が過去最高額へ←「GO!日本」\n「初キッスの年齢を教えろ」【海外反応】 >\n経済産業省「国民の皆さん、トイレットペーパーは余分に備えを」【防災の日】\n<防災の日>経産省トイレットペーパー備蓄PR「1か月分」",
"is_rejected": true,
"reason": {
"name": "DiscardAdultContentJa",
"p": 0.9,
"matched_text": "えろ",
"matched_text_neighbor": "へ←「GO!日本」\n「初キッスの年齢を教えろ」【海外反応】 >\n経済産業省「国民の皆"
}
}