Textgen

Latest version: v1.1.1

Safety actively analyzes 706267 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 4

1.3

* Added `encode_text_vectors` to encode text using the trained network.
* Added `similarity` to quickly calculate cosine similarity and return the most similar texts.

See [this notebook](https://github.com/minimaxir/textgenrnn/blob/master/docs/textgenrnn-encode-text.ipynb) for details.

1.2.2

* Make `is_csv` work for real downstream.
* Description tweaks

1.2.1

* Added `validation` to disable validation training for speed.
* Added `is_csv`: Use with `train_from_file` if the source file is a one-column CSV (e.g. an export from BigQuery or Google Sheets) for proper quote/newline escaping.
* README tweaks

1.2

* Renamed `prop_keep` to `train_size`, and will use the remaining data for validation.
* Added `dropout`, which randomly excludes input tokens each epoch.

1.1.1

1. 支持多卡推理,推理速度加倍,调库textgen做batch推理,多卡推理更方便、快速。

多卡数据并行,batch推理:

cd examples/gpt
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 inference_multigpu_demo.py --model_type chatglm --base_model THUDM/chatglm-6b



2. 优化ChatGLM-6B/Baichuan/LLaMA2/BLOOM的多轮对话SFT代码,逻辑合并到textgen/gpt下,统一处理多模型微调,加入prompt模板支持。

**Full Changelog**: https://github.com/shibing624/textgen/compare/1.1.0...1.1.1

1.1

- Switched to a `fit_generator` implementation of generating sequences for training, instead of loading all sequences into memory. This will allow training large text files (10MB+) without requiring ridiculous amounts of RAM.
- Better `word_level` support:
- The model will only keep `max_words` words and discard the rest.
- The model will not train to predict words not in the vocabulary
- All punctuation (including smart quotes) are their own token.
- When generating, newlines/tabs have surrounding whitespace stripped. (this is not the case for other punctuation as there are too many rules around that)
- Training on single text no longer uses meta tokens to indicate the start/end of the text and does not use them when generating, which results in slightly better output.

Page 2 of 4

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.