- added stanza tokenizer to support sentence splitting for more languages (for example chinese)
python
text = "我喜欢读书。天气很好。我们去公园吧。今天是星期五。早上好。这是我的朋友。请帮我。吃饭了吗?我在学中文。晚安。"
expected = ["我喜欢读书。", "天气很好。", "我们去公园吧。", "今天是星期五。", "早上好。", "这是我的朋友。", "请帮我。", "吃饭了吗?", "我在学中文。晚安。"]
sentences = list(generate_sentences(text, minimum_sentence_length = 2, context_size=2, tokenizer="stanza", language="zh"))
self.assertEqual(sentences, expected)
- emoji library added to filter emojis more precisely out of the stream (the previous emoji filter method would not work well with some languages)