Paddlenlp

Latest version: v2.8.1

Safety actively analyzes 722491 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 10

3.0

* finetune support continue_training by tianhaodongbd in https://github.com/PaddlePaddle/PaddleNLP/pull/8615
* [PaddleNLP 3.0] Refactor/3 part1- remove fast tokenizer. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8613
* Repo adjustment by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/8605
* [PaddleNLP 3.0] Refactor, merge examples/language_model model_zoo to legacy/model_zoo by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8614
* [PaddleNLP 3.0] Refactor RLHF by gongel in https://github.com/PaddlePaddle/PaddleNLP/pull/8617
* Remove delay_scale_loss and release_grads for llama-2 13B's benchmark. by Xreki in https://github.com/PaddlePaddle/PaddleNLP/pull/8623
* [PaddleNLP 3.0] Fix dead link by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8626
* Update PaddleNLP to fix PPO by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/8618
* [LLM] support sparse attention for LLAMA by GuoxiaWang in https://github.com/PaddlePaddle/PaddleNLP/pull/8592
* remove fast generation by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/8625
* fix npu llama by zhink in https://github.com/PaddlePaddle/PaddleNLP/pull/8628
* [PaddleNLP 3.0] Refactor/3 part3, move pipelines. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8619
* [PaddleNLP 3.0] update dataset preprocess by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8629
* [LLM] Support prefix tuning and lora for qwen2 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8601
* modify path of model_zoo in ci_case_auto.sh and ci_case_dy.sh by jeff41404 in https://github.com/PaddlePaddle/PaddleNLP/pull/8633
* 【benchmark】 fix model_zoo path by mmglove in https://github.com/PaddlePaddle/PaddleNLP/pull/8643
* [PaddleNLP 3.0] [LLM] change llm content by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/8627
* [LLM] Add sequence_parallel support for qwen by Difers in https://github.com/PaddlePaddle/PaddleNLP/pull/8558
* [NPU][LLM] add README & reformat llama scripts by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/8642
* align llama auto_parallel dataloader with manual_parallel by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/8639
* fix fast_ln compile error by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/8650
* Apache License by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8658
* Fix different length for numpy>=1.24.x by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8655
* [LLM][NPU] fix on readme by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/8659
* [DOC] Fix dead link by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8662
* fix benchmark dir because of PR8627 by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/8649
* fix llama alibi pretrain by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/8668
* inference support llama3(wint8|4/a8w8) by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8630
* 【benchmark】 fix benchmark script by mmglove in https://github.com/PaddlePaddle/PaddleNLP/pull/8648
* [cpu]llama avx model inference supports by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/8634
* 【AutoParallel】Change benchmark config for llama2-7b by heavyrain-lzy in https://github.com/PaddlePaddle/PaddleNLP/pull/8667
* support flashmask by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/8670
* [PaddleNLP 3.0] Update README.md by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8666
* adjust llm readme by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/8672
* Update export model by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8671
* Update version by gongel in https://github.com/PaddlePaddle/PaddleNLP/pull/8675
* Sft flash mask by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/8664
* Update version by gongel in https://github.com/PaddlePaddle/PaddleNLP/pull/8676

New Contributors
* Southpika made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8082
* cxa-unique made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8331
* dynamicheart made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8282
* EnflameGCU made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8445
* cqulilujia made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8459
* yinfan98 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8440
* zhangyuqin1998 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8396
* ming1753 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8456
* asr-sheep1 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8472
* NeroLoh made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8515
* bukejiyu made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8634

**Full Changelog**: https://github.com/PaddlePaddle/PaddleNLP/compare/v2.8.1...v3.0.0-beta0

3.0.0beta4

本次版本中,我们全面集成了 DeepSeek R1类的思考模型。推理团队深度优化了模型推理,速度业界领先。此外,我们还发布了自研PP-UIE信息抽取模型。本次重点更新如下。

重点更新:
* 模型新增
* DeepSeek V3/R1, R1-distill, QwQ-32B 热门思考模型,全面支持。用户可以点击[官方模型文档列表](https://paddlenlp.readthedocs.io/zh/latest/model_list.html)查看、下载所有模型。
* 飞桨自研发布下一代通用信息抽取工具 PP-UIE 全新发布。支持8K长度信息抽取。[使用文档](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/application/information_extraction)。
* 推理部署
* 全面支持DeepSeek V3/R1满血版FP8、INT8、4比特量化推理,MTP投机解码。
* FP8推理,单机输出超1400 tokens/s;4比特单机部署,输出超2500 tokens/s!
* 首次协同推理团队,发布统一推理部署镜像,热门模型一键部署。推理部署使用文档全面更新,体验全面提升!见[文档](https://paddlenlp.readthedocs.io/zh/latest/llm/server/docs/general_model_inference.html)。
* 模型训练:
* 新增大模型 Embedding 训练,支持INF-CL超大batch size训练。
* 新增MergeKit模型融合工具,缓解对齐代价。见[文档](https://paddlenlp.readthedocs.io/zh/latest/llm/docs/mergekit.html)。
* 低资源训练 全面优化。16G小显存可以流畅训练。
* 其他重点特性:
* 文档页面,新增模型列表展示。用户可查看、下载对应模型文件。见[文档](https://paddlenlp.readthedocs.io/zh/latest/model_list.html)。
* 训练新增 adam-mini 优化器。AdamW优化器支持 BF16 动量。


下面是一些对应的更新细节:

1. 模型、框架组件更新
* 模型新增
* 模型新增列表:
* paddlenlp/PP-UIE-0.5B, paddlenlp/PP-UIE-1.5B, paddlenlp/PP-UIE-7B, paddlenlp/PP-UIE-14B
* deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-V3-Base,deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Zero,
* deepseek-ai/DeepSeek-R1-Distill-Llama-70B, deepseek-ai/DeepSeek-R1-Distill-Llama-8B, deepseek-ai/DeepSeek-R1-Distill-Qwen-14B, deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
* Qwen/Qwen2.5-7B-Instruct-1M,Qwen/Qwen2.5-14B-Instruct-1M, Qwen/QwQ-32B, Qwen/QwQ-32B-Preview
* PR 9738: Deepseek V3 模型新增。PR 9876: 增加 MTP 支持。PR 9797:修复 TP问题。 PR 9643: Deepseek llama3.3 新增模型说明(DrownFish19)
* PR 9906: Deepseek V3 支持动态图直接加载 Float8 参数并进行推理 (ZHUI)
* PR 9845: 新增PP-UIE系列模型 Fantasy-02 i PR 9911 & PR 9913: PP-UIE 相关文档更新(DrownFish19)
* Tokenizer 改进
* PR 9548、PR 9577、PR 9594: “Hackathon No.43” 系列,完善 TokenizerFast 功能支持(yinfan98)
* PR 9745: 修复 AutoTokenizer 问题(DrownFish19)PR 9837: 保存额外的 special tokens(DesmonDay)
* Unified Checkpoint 相关:
* PR 9540: 修复加载master weight PR 9523: 修复缺失key问题。
* PR 9669: 统一检查点的 Bug 修复 PR 9935: 针对忽略 merge optimizer 时直接加载参数的问题进行修复
* PR 9741 & PR 9821: 修复专家并行支持问题
* [MergeKit 功能增强与优化](https://github.com/PaddlePaddle/PaddleNLP/pull/9811)
* 新增功能与优化
* PR 9561: 新增 mergekit_with_sparsify 功能,支持稀疏化合并(Mangodadada)。
* PR 9702: 优化 MergeKit 的 GPU 支持,提升处理效率(Mangodadada)。
* PR 9811: 添加 LoRA(低秩适配器)合并功能,扩展模型融合能力(lugimzzz)。
* 工具更新与维护
* PR 9885: 对 MergeKit 工具进行代码更新与维护,优化整体逻辑。
* 日志与调试支持
* PR 9948: 添加日志记录功能,增强调试与过程追踪能力(lugimzzz)。
* 低资源特性优化
* PR 9804: 添加 use_fused_linear_cross_entropy 支持,减小显存。加入 pre_divided_factor 避免FP16溢出。
* 文档更新、其他:
* PR 9634: unified_checkpoint 文档更新
* PR 9734: 自定义设备代码重构(ZHUI)
* PR 9715: 增加 offload_recompute_inputs(will-jl944)
* PR 9800: 增加训练 token 计数功能(lugimzzz)

2. LLM 训练更新
* 通用训练
* PR 9204: 更新 chatglmv2 的 tensor/pipeline 并行(DrownFish19)
* PR 9827: 为 Qwen2Moe 和 Deepseek 增加 pipeline 与 flashmask 支持(DrownFish19)
* Embedding 训练
* PR 9508: Embedding trainer 新增(DesmonDay)PR 9673: 增加 INF-CL 超大batch训练支持(jie-z-0607)
* PR 9656: Trainer 中修复加载 rng 状态问题(DesmonDay)
* PR 9721: 修复 embedding 随机性问题(DesmonDay)
* DPO训练
* PR 9543: LLM 模块中 dpo 对 qwen2 的 flashmask 支持(wtmlon)
* PR 9620: 更新 dpo criterion(lugimzzz)
* PR 9695: 支持 qwen 与 llama 的 dpo pp(lugimzzz)
* 新功能和特性
* PR 9542: 增加 adam-mini 优化器支持(lugimzzz)
* PR 9732: 支持BF16动量adamw 训练 (lugimzzz)
* PR 9830: 修复非 flash 模式下 checkpoint 保存的问题(SylarTiaNII)
* PR 9705: Cherry-Pick:在 optimizer step 前校验 loss(SylarTiaNII)
* PR 9704: Cherry-Pick:为 LLM 训练增加异步 metrics dumper(SylarTiaNII)
* 训练文档及问题修复
* PR 9689: 增加 KTO 功能(lugimzzz)
* PR 9655: 更新 peft 文档(lugimzzz)
* PR 9659: 修复 lora 相关问题(lugimzzz)

3. Inference 更新
* Predictor & Flask 更新
* PR 9831: 修复 multibatch 推理问题(DrownFish19)
* PR 9841: 修复 position_ids 相关问题(DrownFish19)
* PR 9864: 更新 Deepseek 推理(DrownFish19)
* PR 9828: Flask 服务使 Inference 兼容 OpenAI API(ZHUI)
* MTP功能优化
* PR 9856: Inference 中支持 mtp 与 Deepseek-v3(freeliuzc)
* PR 9894: 修复 Deepseek_v3 在多 GPU 模式下的 mtp 问题(freeliuzc)
* PR 9936: 增加 mtp serving 支持(freeliuzc)
* 部署优化
* PR 9872: 支持多机部署 LLM(ltd0924)
* PR 9791: 合并 fastdeploy 部分代码(kevincheng2)
* Kernel优化
* PR 9707: 优化 gemm_dequant OP,利用 CUDA 核进行 int8_sq 运算(zhink)
* 文档更新、测试
* PR 9613: Inference 模块支持 llama3.2 及文档更新(yuanlehome)
* PR 9921: 修复 llama 的 block_size 设置(zhaohaixu)
* PR 9711: 为 LLM predictor 增加 common models 和参数单元测试(aooxin)


4. AutoParallel / 分布式训练更新
* 自动并行
* PR 9578: 增加 llama2-7b-cinn 的测试(zhangbo9674)
* 基础配置与 CI 集成
* PR 9538: 增加 qwen model_auto 与 CI(blacksheep-Aristotle)
* PR 9541: 增加 llama3.1 自动并行配置(zhiqiu)
* PR 9551: 为 gpt 和 baichuan 自动 CI 加入支持(blacksheep-Aristotle)
* PR 9591: 增加 gpt、baichuan 及 qwen 的 ce 支持(blacksheep-Aristotle)
* PR 9412: 增加 single_model 网络和使用 intermediate API(blacksheep-Aristotle)
* PR 9943: 通过 training_args 控制 split input(blacksheep-Aristotle)
* 测试、验证与功能开关
* PR 9621: 增加 PIR recompute 测试(waliwali777)
* PR 9647: 修改 loss_base 以支持 dropout 后 SPMD(deepllz)
* PR 9714: 增加阶段 1 tensor fusion 相关开关(AndSonder)
* PR 9672: 修复 recompute 测试在 to_static=1 下运行问题(waliwali777)
* PR 9688: 自动并行下合并 ckpt 供推理使用(xuxinyi389)
* PR 9750 & PR 9753: 修复 ernine auto trainer 相关 CI 错误(blacksheep-Aristotle)
* PR 9749: 为 benchmark 开启 tensor fusion(AndSonder)
* PR 9810: 增加 sharding tensor fusion save/load 开关(AndSonder)
* PR 9862: 支持 deepseekv2 下的 DP/MP(xuxinyi389)
* PR 9823: 增加 support ppo ckpt 功能(xuxinyi389)

5. CI、文档、Benchmark 及测试脚本更新
* CI 脚本及警告过滤
* PR 9547: 更新 CI 脚本(Liujie0926)
* PR 9612: CI 中过滤 paddle.to_tensor 警告(DrownFish19)
* PR 9626: 更新 a100 loss_base 配置(Liujie0926)
* PR 9889: CI 脚本更新(Liujie0926)
* PR 9524: LLM benchmark 中新增 qwen2.5-7b(Liujie0926)
* PR 9662 & PR 9722: 更新 LLM_benchmark 脚本(Liujie0926)
* 文档与说明改进
* PR 9585: 修复文档中失效链接(DrownFish19)
* PR 9668: 更新 README.md(ZHUI)
* PR 9785: 更新面向文档的 README(ZHUI)
* PR 9746: 文档修复(DrownFish19)
* PR 9725: 调整 benchmark 环境变量和模型配置(XieYunshen)
* PR 9877: 修正 inference 和 servering 的文档(ZHUI)
* PR 9834: 发布 DeepSeek 新闻及说明(DrownFish19)
* PR 9922: 更正精调文档错误(sijunhe)
* Benchmark 配置与测试
* PR 9651: 修复 benchmark 多机任务异常退出的问题(XieYunshen)
* PR 9891: 更新 gpt-13b 在 dygraph 模式下的最佳配置(liym27)


6. NPU/XPU 及硬件相关更新
* NPU 适配与修复
* PR 9499: 适配 NPU 用于 FusedHeadAndCrossEntropy(tianhaodongbd)
* PR 9573: 修复 NPU 下的 where 问题(tianhaodongbd)
* PR 9762: 适配新版 flash_attention_npu API(will-jl944)
* XPU 功能与优化
* PR 9549: qwen2 支持 flash_attn on XPU(will-jl944)
* PR 9660: qwen2 支持 fused_rope(will-jl944)
* PR 9789: 支持 XPU 下的 empty_cache(will-jl944)
* PR 9796: 支持 XPU 用于自动并行 LLaMa(From00)
* PR 9854: 为 deepseek 增加 XPU 下 fused op(QingshuChen)

7. Bug 修复、性能优化及其他改进
* 状态加载与多线程问题
* PR 9464: 修复多线程下 load_state_dict 的问题(DesmonDay)
* 各类模型与算子问题修复
* PR 9603: 修复 qwen2 modeling 中 d2s bug(wawltor)
* PR 9569: 修复 dynamic 与 static 模式下的 norm outputs 问题(Wangzheee)
* PR 9652: 修复 paddle.where 问题(will-jl944)
* PR 9638: 增加 config replace_with_c_embedding(Xing-lil)
* PR 9699: 修复 loraga amp 问题(greycooker)
* PR 9752: 修复 get_block_shape_and_split_kv_block 的 bug(lizhenyun01)
* PR 9759: 修复 speculate_verify_and_update op(Wanglongzhi2001)
* PR 9674: 将 speculate_step 合并到 step op 中(Wanglongzhi2001)
* PR 9757: Trainer 模块中更新 sequence parallel(DesmonDay)
* PR 9765: 修复 loraga merge 问题(greycooker)
* PR 9777: 分布式训练下 Cherry-Pick 支持 fuse optimizer(SylarTiaNII)
* PR 9783: 修复 ce 错误(blacksheep-Aristotle)
* PR 9779: 修复 pickle unsafe-load 问题(DrownFish19)
* PR 9760: MoE 模块修复 expert parallel(DesmonDay)
* PR 9790: 为 server infer 添加 pir_model 路径(aooxin)
* PR 9706: Cherry-Pick 集成 PDC SDK 用于 LLM 训练容错(SylarTiaNII)
* PR 9624: 添加 FLAGS 用于替换四个参数以便更好地加速(zhink)
* PR 9806: 修复 LLAMA 参数解析 bug(will-jl944)
* PR 9829: 更新 mixtral.md 文件(yuanlehome)
* PR 9859: 修复 dsk rope 差异问题(yuanlehome)

8. 环境/依赖及版本兼容更新
* requirements 及安装更新
* PR 9514: 更新 py38 下的 requirements.txt (ZHUI)
* PR 9118: 更新安装依赖(DrownFish19)
* PR 9953: 针对 py38 增加 tokenizers 依赖(DrownFish19)
* Python 版本兼容性
* PR 9853: 解决类型注解在不同 Python 版本下的兼容性问题(zty-king)


What's Changed
* Update requirements.txt for py38 by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9514
* [Unified Checkpoint] fix single card loading without master weights by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9540
* Fix multi-threading load_state_dict by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9464
* delete generate_rank_mapping when export multi cards model by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9552
* [LLM] dpo support qwen2 with flashmask by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9543
* [XPU] qwen2 supports flash_attn on XPU by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9549
* [AutoParallel]: add qwen model_auto and ci by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9538
* add llama3.1 config for auto_parallel by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/9541
* Add more model support for speculate_decoding and refactor speculate_decoding by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9504
* [Intel_HPU]FSDPA custom kernel API update by yanfeich in https://github.com/PaddlePaddle/PaddleNLP/pull/9556
* [Unified Checkpoint] fix load missing keys by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9523
* 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 3 by yinfan98 in https://github.com/PaddlePaddle/PaddleNLP/pull/9548
* adapt code to amsgrad supported adamw by HydrogenSulfate in https://github.com/PaddlePaddle/PaddleNLP/pull/9568
* [CI]update scripts by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9547
* Adapting npu for FusedHeadAndCrossEntropy by tianhaodongbd in https://github.com/PaddlePaddle/PaddleNLP/pull/9499
* 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 4 by yinfan98 in https://github.com/PaddlePaddle/PaddleNLP/pull/9577
* fix(export_model): fix export_model.py python path by thinking-computer in https://github.com/PaddlePaddle/PaddleNLP/pull/9571
* Fix_ckpt_oom_paddlenlp by Xing-lil in https://github.com/PaddlePaddle/PaddleNLP/pull/9507
* Add GPUEventTimer by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/9582
* [npu] fix where bug by tianhaodongbd in https://github.com/PaddlePaddle/PaddleNLP/pull/9573
* [doc] Fix dead links by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9585
* [AutoParallel]:add gpt & baichuan auto ci by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9551
* Add llama2-7b-cinn test by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9578
* [AutoParallel]:add gpt&baichuan&qwen ce by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9591
* fix dpo pp eval by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9607
* [LLM] update tensor and pipeline parallel for chatglmv2 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9204
* [Install] Update requirment.txt by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9118
* [Trainer]Fix _get_eval_sampler by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/9374
* fix benchmark scripts by XieYunshen in https://github.com/PaddlePaddle/PaddleNLP/pull/9597
* [Trainer] Add embedding trainer by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9608
* [CI] filter paddle.to_tensor warnings when set_state_dict by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9612
* fix ckpt quant log by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9606
* fix the d2s bug in qwen2 modeling by wawltor in https://github.com/PaddlePaddle/PaddleNLP/pull/9603
* 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 5 by yinfan98 in https://github.com/PaddlePaddle/PaddleNLP/pull/9594
* fix pp_config bug by tianhaodongbd in https://github.com/PaddlePaddle/PaddleNLP/pull/9605
* Speedup FusedHeadAndCrossEntropy by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9601
* fix get_save_output op and refactor specu_decoding by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9576
* [Inference] Fix docs and support llama3.2 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9613
* fix by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9628
* fix norm outputs in dynamic and static mode by Wangzheee in https://github.com/PaddlePaddle/PaddleNLP/pull/9569
* [CI]update a100 loss_base for gpt by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9626
* [LLM benchmark]add qwen2.5-7b by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9524
* Checkpoint Compression Doc by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9614
* Update unified_checkpoint.md by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9634
* add llama and nv-embed training by Li-Z-Q in https://github.com/PaddlePaddle/PaddleNLP/pull/9323
* [News] Unified Checkpoint by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9632
* feat(sdaa): support sdaa backend infer by thinking-computer in https://github.com/PaddlePaddle/PaddleNLP/pull/9570
* [llm]update dpo criterion by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9620
* [llm]add adam-mini by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9542
* Update version for beta3 by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9553
* [LLM DOCs] Add deepseek llama3.3 new models by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9643
* [Tokenizer] Fix tokenizer of llama3.3 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9641
* [AutoParallel] Add test for PIR recompute by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9621
* Update README.md for 3.0 beta3 by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9644
* Add replace_with_parallel_cross_entropy flag by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9579
* [AutoParallel] change loss_base after dropout support spmd by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/9647
* [Embedding] Add embedding training by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9508
* [PEFT]Add LoRA-GA by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/9592
* mergekit_with_sparsify by Mangodadada in https://github.com/PaddlePaddle/PaddleNLP/pull/9561
* Fix paddle.where by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9652
* Add config replace_with_c_embedding by Xing-lil in https://github.com/PaddlePaddle/PaddleNLP/pull/9638
* Update embedding trainer state by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9629
* MoRA Implementation by lcykww in https://github.com/PaddlePaddle/PaddleNLP/pull/9562
* [llm]update peft docs by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9655
* [Trainer] Fix loading rng state by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9656
* fix qwen&baichaun&gpt ci error by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9650
* [llm] fix lora by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9659
* [XPU] qwen2 supports fused_rope by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9660
* update hygon dcu docs by TimeYWL in https://github.com/PaddlePaddle/PaddleNLP/pull/9298
* Make the timer compatible with devices other than GPU by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/9665
* [Trainer] update remove_master_weight by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9640
* [DOC] Update README.md by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9668
* [Mthreads] support llama 13B train by shang-mt in https://github.com/PaddlePaddle/PaddleNLP/pull/9666
* Structured Index of Documents by dfmz759837901 in https://github.com/PaddlePaddle/PaddleNLP/pull/9411
* 【Qwen2-VL Inference】add qwen2-vl high performance inference by chang-wenbin in https://github.com/PaddlePaddle/PaddleNLP/pull/9575
* merge docs by Mangodadada in https://github.com/PaddlePaddle/PaddleNLP/pull/9657
* [CI]update blacklist for gpt3 by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9555
* [体验优化] 整合训练的CUDA和Triton算子为 paddlenlp_kernel by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/9471
* [Unified Checkpoint] bug fix by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9669
* Add tied_weight_keys for pipeline model by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9663
* Optimize performance for Qwen2 model by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/9616
* [MLU] add mlu llama readme by PeiyuLau in https://github.com/PaddlePaddle/PaddleNLP/pull/9671
* Set tensor parallel name mapping when fusion is used by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/9685
* [LLM] add deploy server by kevincheng2 in https://github.com/PaddlePaddle/PaddleNLP/pull/9581
* [Embedding] Add inf-cl in embedding trainer by jie-z-0607 in https://github.com/PaddlePaddle/PaddleNLP/pull/9673
* [Fix]fix loraga amp by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/9699
* [LLM INFER] cutlass 3.x gemm on sm90 by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/9398
* [Iluvatar] Add readme for llama-13b by tianyuzhou668 in https://github.com/PaddlePaddle/PaddleNLP/pull/9670
* [AutoParallel] merge ckpt for inference by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9688
* update gpt&baichuan&qwen ce name by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9697
* fix docs by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9703
* [Inference] Use cuda core(int8_sq) for m <=4 in gemm_dequant OP by zhink in https://github.com/PaddlePaddle/PaddleNLP/pull/9707
* [LLM] [Cherry-Pick] valid loss before optimizer step (9255) by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9705
* [llm]support dpo pp for qwen & llama by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9695
* support qwen dpo fused kernel by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9686
* [AutoParallel] Fix recompute test running under `to_static=1` by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9672
* [LLM_benchmark]update LLM_benchmark scripts by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9662
* [LLM] [Cherry-Pick] add asynchronous metrics dumper for llm training by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9704
* [llm] Add KTO by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9689
* [Embedding] Fix embedding random by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9721
* remove refined recompute deep copy by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/9617
* add single_model network and use intermediate api by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9412
* Refactor custom devices. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9734
* Add offload_recompute_inputs by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9715
* [LLM] [Cherry-Pick] Integrate PDC SDK for LLM training fault tolerance platform by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9706
* add common models and common params unit test for llm predictor. by aooxin in https://github.com/PaddlePaddle/PaddleNLP/pull/9711
* Added FLAGS to replace four params and the value can be adjusted for better speedup by zhink in https://github.com/PaddlePaddle/PaddleNLP/pull/9624
* [AutoParallel] add parameter enable_stage1_tensor_fusion_blanced_save_load and enable_stage1_tensor_fusion by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9714
* Adapt to new npu flash_attention api by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9735
* [AutoParallel] Add test for PIR refined recompute by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9679
* [Docs] Fix by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9746
* Bugfix update predictor.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9742
* Modify the environment variables and model configuration of the bench… by XieYunshen in https://github.com/PaddlePaddle/PaddleNLP/pull/9725
* [Unified Checkpoint] Fix expert parallel by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9741
* [AutoParallel]:ufix ernie ci error by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9750
* fix import bugs. by aooxin in https://github.com/PaddlePaddle/PaddleNLP/pull/9751
* [AutoParallel]ckpt support local views keys to global views keys by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9604
* Add XLMRoBERTaModel in paddlenlp by jie-z-0607 in https://github.com/PaddlePaddle/PaddleNLP/pull/9720
* [AutoParallel]:fix ernine auto_trainer error by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9753
* fix get_block_shape_and_split_kv_block by lizhenyun01 in https://github.com/PaddlePaddle/PaddleNLP/pull/9752
* fix speculate_verify_and_update op by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9759
* [Inference]merge speculate_step into step op by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9674
* [NPU] Adapt to new flash_attention_npu api by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9762
* [Trainer] update sequence parallel by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9757
* [tokenizer] Fix AutoTokenizer by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9745
* [LLM] Add DeepseekV3 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9738
* [AutoParallel] open tensor_fusion for benchmark by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9749
* fix loraga merge by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/9765
* Fix ernie ci auto trainer error by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9758
* Update README.md by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9766
* Fix matryoshka norm loss by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9774
* [Distributed] [Cherry-Pick] support fuse optimizer (9519) by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9777
* Update register_sequence_parallel_allreduce_hooks by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9782
* Fix ce error by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9783
* fix pickle unsafe-load by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9779
* [MoE] fix expert parallel by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9760
* fix dpo pp criterion by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9786
* add pir_model path for server infer. by aooxin in https://github.com/PaddlePaddle/PaddleNLP/pull/9790
* [LLM] [Cherry-Pick] support flash device on static model (9619) by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9787
* [LLM Benchmark]update scripts by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9722
* mergekit gpu 1226 by Mangodadada in https://github.com/PaddlePaddle/PaddleNLP/pull/9702
* [LLM] merge code from fastdeploy by kevincheng2 in https://github.com/PaddlePaddle/PaddleNLP/pull/9791
* support eagle for llama by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9812
* [CI] Fix by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9633
* wrap model when lora is ON and only do evaluation. by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9803
* Update README.md for documention by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9785
* [Checkpoint compression] Support sharding stage1 v2 by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9817
* [LLM] Update model convert and fix TP for deepseekv3 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9797
* [AutoParallel] add sharding tensor_fusion save load switch by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9810
* 修复benchmark多机任务异常退出的处理 by XieYunshen in https://github.com/PaddlePaddle/PaddleNLP/pull/9651
* Fix LLAMA arg parsing bug in pp by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9806
* Update mixtral.md by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9829
* [XPU] Support empty_cache on XPUs by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9789
* [Inference] Fix multibatch inference by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9831
* Fix position_ids for infra by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9841
* [LLM] Add pipeline and flashmask for Qwen2Moe and Deepseek by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9827
* [Mergekit]update & add LoRA merge by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9811
* [Unified Checkpoint] Fix expert parallel by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9821
* [Inference] Flask server compatible with OpenAI api. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9828
* [LLM] fix checkpoint save for non flash mode by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9830
* [DSK] support deepseek-v3/r1 (mha/fp16/bf16/wint8/wint4) by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9769
* 解决类型注解Python版本兼容性问题 by zty-king in https://github.com/PaddlePaddle/PaddleNLP/pull/9853
* [Tokenizer] save extra special tokens by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9837
* [Bugfix] Fix dsk rope diff by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9859
* Support lower memory cards. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9804
* Support XPU for auto-paralllel LLaMa by From00 in https://github.com/PaddlePaddle/PaddleNLP/pull/9796
* [XPU] Add fused op for deepseek by QingshuChen in https://github.com/PaddlePaddle/PaddleNLP/pull/9854
* [Inference] Update deepseek by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9864
* [PreTrain] Support deepseek mfu for pretraining and fix tflops for pretrain pipe model by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9855
* [Inference]Support mtp with deepseek-v3 by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9856
* [AutoParallel] Support deepseekv2 with DP/MP by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9862
* [LLM] move modeling.py and modeling_nv.py to transformers by Li-Z-Q in https://github.com/PaddlePaddle/PaddleNLP/pull/9676
* [Docs] fix docs for inference and servering by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9877
* [Docs] news of DeepSeek by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9834
* [AutoParallel]support_ppo_ckpt by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9823
* suppport intermediate_api llama test by liym27 in https://github.com/PaddlePaddle/PaddleNLP/pull/9850
* Update MergeKit by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9885
* [LLM] Support multi machine deployment by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/9872
* 【SpecInfer】修复 InferenceWithReference 接收率不高的 bug by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9880
* update the best conf for gpt-13b in dygraph mode by liym27 in https://github.com/PaddlePaddle/PaddleNLP/pull/9891
* [Inference]fix deepseek_v3 with mtp in multi-gpu mode by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9894
* [TaskFlow] Fix pir for taskflow by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9822
* [LLM-IE] Add pp-uie to Taskflow by Fantasy-02 in https://github.com/PaddlePaddle/PaddleNLP/pull/9845
* [DOC] Update README for PP-UIE by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9911
* 【benchmark】align benchmark conf for static baichuan2 gpt3 by liym27 in https://github.com/PaddlePaddle/PaddleNLP/pull/9901
* [DOC] PP-UIE by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9913
* add gpu whl by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/9890
* add count trained tokens by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9800
* 更正精调文档错误 by sijunhe in https://github.com/PaddlePaddle/PaddleNLP/pull/9922
* [CI]update ci scripts by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9889
* [LLM]: fix block_size setting for llama. by zhaohaixu in https://github.com/PaddlePaddle/PaddleNLP/pull/9921
* support qwen2_5_vl by chang-wenbin in https://github.com/PaddlePaddle/PaddleNLP/pull/9924
* [DSK] Fix some bugs for dsk-v3 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9874
* support intermediate_api gpt-3 test by Function-Samuel in https://github.com/PaddlePaddle/PaddleNLP/pull/9912
* support intermediate_api qwen test by Function-Samuel in https://github.com/PaddlePaddle/PaddleNLP/pull/9910
* [LLM] Add MTP for Deepseekv3 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9876
* [taskflow] Fix taskflow bug by Fantasy-02 in https://github.com/PaddlePaddle/PaddleNLP/pull/9930
* 【Inference】Support mtp serving by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9936
* [Autoparallel] Mtp for DeepSeekV3 by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9884
* [Unified Checkpoint] Fix split param loading directly when using ignore_merge_optimizer by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9935
* [DSK] Implement mla use matrix-absorption by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9875
* use training_args to contorl split input by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9943
* [requirements] tokenizers for py38 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9953
* [LLM] update llm server dockerfiles by kevincheng2 in https://github.com/PaddlePaddle/PaddleNLP/pull/9940
* 【Inference】fix dynamic_forward of mtp by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9947
* [RL] Fix PPO and add GRPO by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9925
* [doc] update config and add docs for grpo by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9962
* Add Process Reward Model. by XuLingnan in https://github.com/PaddlePaddle/PaddleNLP/pull/9598
* [Feature] Support float8 dtype storage and deepseek v3 with fp8 inference. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9906
* [AutoParallel] Add auto parallel moe layer by pkuzyc in https://github.com/PaddlePaddle/PaddleNLP/pull/9886
* [llm]add bf16 moment adamw by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9732
* [MergeKit]add log by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9948
* Longlora by micelvrice in https://github.com/PaddlePaddle/PaddleNLP/pull/9970
* Fix update paddle_patch.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9968
* support MMLU eval by vivienfanghuagood in https://github.com/PaddlePaddle/PaddleNLP/pull/9967
* Update paddle_patch.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9978
* [XPU] change llama loss func on xpu by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9973
* [Inference] refine csrc/tools/build_wheel.sh by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/9971
* [DSK] mla use tensor core by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9952
* Update paddle_patch.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9984
* [LLM]fix ci by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9986
* Fix mtp speed by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9987
* [Trainer]fix wandb proxy by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/9960
* [LLM] add moe parallel groups by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9982
* [AutoParallel] Fix pipeline visualization tool by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9976
* [llm]fix ci by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9989
* [DSK] DeepSeek Support FP8 by ming1753 in https://github.com/PaddlePaddle/PaddleNLP/pull/9956
* [LLM INFER] update step_paddle by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9991
* [AutoParallel] Add pp stage id by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9965
* [CI] Fix tokenizer load in PRM by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9997
* fix tenercore precision while split kv by lizhenyun01 in https://github.com/PaddlePaddle/PaddleNLP/pull/9994
* support intermediate_api baichuan test by Function-Samuel in https://github.com/PaddlePaddle/PaddleNLP/pull/9988
* 【Inference】Add benchmark client test scripts by gzy19990617 in https://github.com/PaddlePaddle/PaddleNLP/pull/9996
* [LLM] Support for automatic deployment of services, modification of environment variable names by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/9966
* Add moe flex dispatcher by umiswing in https://github.com/PaddlePaddle/PaddleNLP/pull/9977
* add deepseek doc by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9964
* [Feat] Sage Attention Kernels Support for sm80, sm89, sm90 by l1cacheDell in https://github.com/PaddlePaddle/PaddleNLP/pull/9848
* [LLM] support fix seq len and cmd run service by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10004
* [Doc] Add Qwen/QwQ-32B model ids by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/10005
* [LLM] Fix MTP for pipeline parallel by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9972
* [LLM] Update license by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/10003
* 【Infer】remove some bug config for block gemm by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/10002
* Default set FLAGS_cascade_attention_max_partition_size as 32K by lizhenyun01 in https://github.com/PaddlePaddle/PaddleNLP/pull/10013
* [Distribution] Support DualPipeV for GPT3 by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/9993
* [inference]add docker doc by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/9998
* 【Docs】Update speculate decoding docs by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/10017
* [LLM] fix llm model path and support download from txt by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10029
* [CherryPick] add import check of local_layer by pkuzyc in https://github.com/PaddlePaddle/PaddleNLP/pull/10038
* fix mla nan in mtp by lizhenyun01 in https://github.com/PaddlePaddle/PaddleNLP/pull/10041
* [cherry-pick] update doc by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/10043
* [CI] fix install issue for requirements-dev.txt by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/10051
* [Doc] 支持用户自行下载静态图 by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10046
* [cherry-pick] (PR10034 [server]Add a model download script and fix bugs for the server) by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/10035
* [LLM] 增加版本号 by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10056
* check mtp triton cache by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/10065
* Update version setup.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/10070
* [Doc] 文档完善,新增模型环境设备需求 by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10073
* change_h100_to_h800 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/10091
* [Doc] 完善文档,更新示例模型 by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10085
* 【Serving】Fix serving bug release by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/10101

New Contributors
* thinking-computer made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9571
* Wangzheee made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9569
* lcykww made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9562
* shang-mt made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9666
* dfmz759837901 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9411
* jie-z-0607 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9673
* aooxin made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9711
* zty-king made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9853
* Fantasy-02 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9845
* zhaohaixu made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9921
* XuLingnan made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9598
* micelvrice made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9970

**Full Changelog**: https://github.com/PaddlePaddle/PaddleNLP/compare/v3.0.0-beta3...v3.0.0-beta4

3.0.0beta3

本次更新增强了PaddleNLP的基础体验,新增了Llama-3.2、DeepSeekV2模型,升级了TokenizerFast功能,重构了SFTTrainer。

此外,PaddleNLP还支持了优化器状态的卸载和重载功能,实现了精细化的重新计算,训练性能提升7%。在Unified Checkpoint方面,进一步优化了异步保存逻辑,新增Checkpoint压缩功能,可节省78.5%存储空间。
最后,在大模型推理、自动并行、多硬件支持、文档使用上,我们都进行了深度优化。



主要更新与增强

1. **新增模型**:
- 新增了Llama-3.2模型(9199)、DeepSeekV2模型(9250),进一步丰富了大型模型的选择。

2. **基础架构改进**:
- 重构了SFTTrainer和SFTConfig,提高了代码的可维护性。(9318)
- 支持优化器状态的卸载和重载功能(9467),有效降低了内存使用。
- 通过Hook实现了精细化的重新计算支持,例如,在llama模型上,训练性能可提升7%。(9396)
- **Unified Checkpoint优化**:
- 更新了异步保存逻辑(9173, 9274, 9321),显著提升了检查点的保存与加载效率。
- 增加了对专家并行的支持(9055),使模型训练更加灵活。
- 支持在开启sharding_comm_overlap时使用Unified Checkpoint。(9392)
- 新增了Checkpoint压缩功能,最多可节省78.5%的存储空间。([9183](https://github.com/PaddlePaddle/PaddleNLP/pull/9183))
- 通过多线程技术减少了检查点的加载时间(9034)。

- **Tokenizer功能增强**:
- 允许在Tokenizer调用时指定`padding_side`参数(9258),提升了用户体验。
- Qwen tokenizer现支持添加特殊标记(9344),增强了其灵活性。
- 修复了TokenizerFast中缺失的`clean_up_tokenization_spaces`问题(9304),提高了文本处理的准确性。
- 统一了分词器的`_pad`函数到基类。[9280](https://github.com/PaddlePaddle/PaddleNLP/pull/9280)
- 新增了对`BertTokenizerFast`的支持,并允许在调用时注册tokenizer。([9353](https://github.com/PaddlePaddle/PaddleNLP/pull/9353))
- 改进了Qwen、Gemma、Yuan模型chat template的特殊输入处理。([9462](https://github.com/PaddlePaddle/PaddleNLP/pull/9462))

3. **推理性能提升**:
- 支持LLM推理直接量化内置bos模型(9197)。
- 加强了对LLM推理中FP8 量化的支持(如9328, 9423),满足了多样化的精度需求。
- 增强了投机解码(speculative decoding)和Append Attention 的支持。(9180) (9244)

4. **硬件兼容性扩展**:
- 加强了对Intel HPU的支持(9273),现在支持动态图预测。
- 为XPU等国产硬件提供了统一检查点功能(9312)。
- 修复了XPU和DCU支持中的错误,并提升了性能。[9414](https://github.com/PaddlePaddle/PaddleNLP/pull/9414) 和[#9433](https://github.com/PaddlePaddle/PaddleNLP/pull/9433)

5. **自动并行优化**:
- 修复了自动并行过程中的多个问题(如9217, 9355),确保了并行训练的稳定性。
- 更新了自动并行配置与检查点转换器(如9136, 9432),提升了训练的灵活性和稳定性。

6. **文档和测试更新**:
- 更新了多个文档,包括LLM模型文档(如9314)和量化文档(如9330),确保了信息的时效性和准确性。
- 新增了多个测试用例,如分布式数据加载测试(9438),提高了测试的覆盖率。
- 修复了文档中的链接错误和排版问题(如9127, 9515),提升了用户体验。


本次更新标志着PaddleNLP的持续进步,为用户提供了更加全面、高效和稳定的NLP解决方案。我们期待在未来的版本中,继续为用户带来更多的创新和价值。

What's Changed
* [Unified Checkpoint] update async_save_info in develop by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9173
* add flashmask rm by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9154
* [LLM_INFER] Support quantized model from bos and fix docs by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9197
* fix ci not set no_proxy and modify tests in pir mode by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/9205
* [Models] Add Llama-3.2 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9199
* move some auto_parallel args into class AutoTrainingArguments by Wennie396 in https://github.com/PaddlePaddle/PaddleNLP/pull/9155
* [Performance] Compatible with flashmask API rename upgrade by GuoxiaWang in https://github.com/PaddlePaddle/PaddleNLP/pull/9019
* [AutoParallel] add vpp align and pp amp test by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9176
* fix auto ci return bug when run in v100 by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/9216
* fix auto ci return bug when run in v100 by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9228
* [LLM] Add tools for parameters by Hanyonggong in https://github.com/PaddlePaddle/PaddleNLP/pull/9137
* [AutoParallel] Add test for fuse_ffn and fuse_attention_qkv pass by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9203
* [CI] Fix ci import. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9239
* [Version] Update version info by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9241
* [Auto Parallel] Adding align mode support by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/9150
* [LLM INFER] top_p_sampling_reject support top_p=0 and custom seed by gzy19990617 in https://github.com/PaddlePaddle/PaddleNLP/pull/9202
* [INFER] update tune_cublaslt_gemm op and fix some bugs by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9222
* Reduce the time spent on git downloading third-party libraries by vivienfanghuagood in https://github.com/PaddlePaddle/PaddleNLP/pull/9246
* [PIR] fix pir open bugs by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9248
* Cherry-pick some PRs from incubate/paddlenlp-fleety by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/9245
* [Unified Checkpoint] Support expert parallel by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9055
* [PIR] fix pir dt2st for chatglm_v2 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9251
* Cherry-pick some PRs from incubate/paddlenlp-fleety by LiYuRio in https://github.com/PaddlePaddle/PaddleNLP/pull/9253
* [Unified Checkpoint] Fix generation config save by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9223
* [AutoParallel] Fix tests for pass paddle AutoParallel CI by liym27 in https://github.com/PaddlePaddle/PaddleNLP/pull/9267
* change dataset by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9266
* [Unified Checkpoint] update async save logic by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9274
* add config file for model chatglm2,gemma,yuan by Mangodadada in https://github.com/PaddlePaddle/PaddleNLP/pull/9139
* Fix async hang by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9276
* [AutoParallel] Change llama test from sharding stage2 to stage1 by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9281
* [Tokenizer] Enable padding_side as call time kwargs by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9258
* [Trainer] fix save_model by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9286
* [CI] Skip inference test cases by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9270
* [LLM] Add deepseekv2 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9250
* [Tokenizer] Unify tokenizer _pad by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9280
* [CI] Fix llm/alignment/rm/flashmask path by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9289
* support attention mask using causal=True by GuoxiaWang in https://github.com/PaddlePaddle/PaddleNLP/pull/9268
* [FlashMask] Add FlashMask for Qwen2 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9264
* bug fix for xpu_parallel_matmul by FeixLiu in https://github.com/PaddlePaddle/PaddleNLP/pull/9297
* fix lora sharding v2 by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9300
* [LLM INFER] Append attn by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9244
* [Auto Parallel] fix bugs for split_batches_for_accumulation && fix bu… by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/9217
* [Tokenizer] Fix TokenizerFast missing clean_up_tokenization_spaces by dynamicheart in https://github.com/PaddlePaddle/PaddleNLP/pull/9304
* clean llama static modeling file by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/9301
* [Unified Checkpoint] Accelerate loading checkpoint by multi-thread by Crystal-X-111 in https://github.com/PaddlePaddle/PaddleNLP/pull/9034
* fix non-pipelinelayer to distributed by gongel in https://github.com/PaddlePaddle/PaddleNLP/pull/9310
* change the legacy to slm by wawltor in https://github.com/PaddlePaddle/PaddleNLP/pull/9311
* [TRL] Rename sft trainer. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9292
* [XPU] support unified ckpt function by cqulilujia in https://github.com/PaddlePaddle/PaddleNLP/pull/9312
* [LLM INFER] Fix some bugs and chatglm_v2 support block_attn by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9271
* [Readme] Add flash mask by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9219
* update llm infer docs by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9314
* [Unified Checkpoint] Add split param and refactor code by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9240
* [METAX] Support llama for MX C550 by idontkonwher in https://github.com/PaddlePaddle/PaddleNLP/pull/9186
* update QR code by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9325
* add flash_attention on model chatglm_v2 by Mangodadada in https://github.com/PaddlePaddle/PaddleNLP/pull/9296
* fix readme by Mangodadada in https://github.com/PaddlePaddle/PaddleNLP/pull/9326
* [Unified Checkpoint] update non-merge checkpoint loading, move async_save_info.json location by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9321
* [paddle cpu inference]fix cpu doc by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/9299
* [LLM INFER] add rope_theta for block_multihead_attention by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9334
* Fix pr 9334 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9335
* fix parameter calculation in auto_parallel mode by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/9327
* [Docs] Update flashmask by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9330
* Update load_save_single_card.py by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9337
* Update README.md by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9339
* [Tokenizer] Support reading Tiktoken tokenizer.model. by lvdongyi in https://github.com/PaddlePaddle/PaddleNLP/pull/9215
* align default custom black/white list for dygraph and static graph by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/9340
* [intel_hpu] initial commit for intel_hpu support by yanfeich in https://github.com/PaddlePaddle/PaddleNLP/pull/9273
* Compatible with Tensor.to change to out_of_place. by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9343
* [Tokenizer] Fix Llama3Tokenizer import by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9341
* [Docs] Add precision alignment doc by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9346
* [Tokenizer] Support adding special tokens to Qwen tokenizer by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9344
* Add ordered save to avoid OOM by ForFishes in https://github.com/PaddlePaddle/PaddleNLP/pull/9347
* [AutoParallel]Bugfix Hang for VPP-Sharding by JZ-LIANG in https://github.com/PaddlePaddle/PaddleNLP/pull/9336
* Add CI testing for A100 and V100 device by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9324
* [Inference] Append attn FP8 quant by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/9328
* [Tokenizer] Add BertTokenizerFast, support register new tokenizer by lvdongyi in https://github.com/PaddlePaddle/PaddleNLP/pull/9353
* clean print in auto_trainer by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/9357
* [Unified Checkpoint] Fix fp32 dtype for using newest paddle by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9360
* [UIE] Fix tokenizer output with return_token_type_ids by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9363
* Add offload/reload for optimizer by ForFishes in https://github.com/PaddlePaddle/PaddleNLP/pull/9359
* refine dtype use by wanghuancoder in https://github.com/PaddlePaddle/PaddleNLP/pull/9366
* Add check for sharding stage1-v2 using amp master grad by ForFishes in https://github.com/PaddlePaddle/PaddleNLP/pull/9333
* [Trainer] Update assert to warning by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9332
* [Auto Parallel] fix adapt_stale_fwd_patch for to_static mode by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/9372
* [LLM INFER] Optimize fuse some kernels in postprocess by gzy19990617 in https://github.com/PaddlePaddle/PaddleNLP/pull/9201
* [AutoParallel] Fix `EXCODE` bug of AutoParallel CI by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9355
* Support pp + no_recompute_layer. by tianyuzhou668 in https://github.com/PaddlePaddle/PaddleNLP/pull/9373
* [Unified Checkpoint] Support empty state_dict saving by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9380
* Add submodule by risemeup1 in https://github.com/PaddlePaddle/PaddleNLP/pull/9385
* [CI] add recursive for submodule by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9389
* [CI]fix scripts by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9394
* [LLM]add ktotrainer by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9393
* Refine log freq by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9397
* [XPU] Llama XPU's swiglu uses phi's swiglu by dynamicheart in https://github.com/PaddlePaddle/PaddleNLP/pull/9414
* fix hip paddlenlp_ops bug by TBD1 in https://github.com/PaddlePaddle/PaddleNLP/pull/9418
* [CI]update target_lists_for_llm by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9417
* [INFER][LLM] Add the AutoModel for inference mode by zeroRains in https://github.com/PaddlePaddle/PaddleNLP/pull/9416
* [Unified Checkpoint] Support sharding_comm_overlap by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9392
* [DCU] update dcu paddlenlp_ops by TBD1 in https://github.com/PaddlePaddle/PaddleNLP/pull/9433
* Change core.LoDTensor to core.DenseTensor by co63oc in https://github.com/PaddlePaddle/PaddleNLP/pull/9434
* Change LOD_TENSOR to DENSE_TENSOR by co63oc in https://github.com/PaddlePaddle/PaddleNLP/pull/9419
* [LLM] Fix deepseekv2 import in py38 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9446
* [Distributed Dataloader] change process new_group creation by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9438
* Update dist_dataloader.py by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9451
* [llm]fix pp no drop last by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9439
* Reduce long duration for the `exit -6 re-run` process. by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9400
* Fix row parallel lora layers parameters initialization bug by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9427
* Refactor tool of creating pretrain dataset by gongel in https://github.com/PaddlePaddle/PaddleNLP/pull/9454
* 【Auto-Parallel】update conf for sharding overlap in static by liym27 in https://github.com/PaddlePaddle/PaddleNLP/pull/9456
* [AutoParallel] add release_gradients and comm_buffer_size_MB to strategy by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9432
* [LLM] Skip zero loss by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9447
* [ChatTemplate] Fix chat template when answer is contained within question. by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9444
* [LLM] Add expert parallel by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9368
* 增加benchmark多机任务执行脚本对于异常退出的处理 by XieYunshen in https://github.com/PaddlePaddle/PaddleNLP/pull/9442
* [llm]add set_seed by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9429
* [AutoParallel] Reconstruct sharding mesh dimension inference logic - Part2 add sharding_mesh_dimension param by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9382
* Fix auto parallel CI exit -6 by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9460
* [ChatTemplate] Fix chat template for `Gemma` when answer is contained within question. by lvdongyi in https://github.com/PaddlePaddle/PaddleNLP/pull/9462
* Use paddle.cast instead of Tensor.astype by HydrogenSulfate in https://github.com/PaddlePaddle/PaddleNLP/pull/9461
* fixed the init problem in tensor parallel by wawltor in https://github.com/PaddlePaddle/PaddleNLP/pull/9452
* Revised PoSE by whf313 in https://github.com/PaddlePaddle/PaddleNLP/pull/8822
* fix AutoInferenceModel for qwen-vl by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9463
* add reft method by TranscenderNing in https://github.com/PaddlePaddle/PaddleNLP/pull/8819
* [AutoParallel]: llama_model_auto support alibi by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9422
* [AutoParallel]:gpt 13b model support fused_linear sp fused_attention … by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9477
* add Moslora by TranscenderNing in https://github.com/PaddlePaddle/PaddleNLP/pull/9331
* [Trainer] Fix eval for map dataset by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9472
* [Inference]Move quantization code from run_finetune.py to run_quantization.py by lixcli in https://github.com/PaddlePaddle/PaddleNLP/pull/9450
* [AutoParallel] Fix parameter passing for comm_buffer_size_MB and release_gradients by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9481
* [AutoParallel]:fix run llama_13b_auto error by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9480
* [Unified Checkpoint] Checkpoint compression by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9183
* fixbug for chatglm_v2's RetaryEmbedding dtype by mingMelody in https://github.com/PaddlePaddle/PaddleNLP/pull/9476
* [LLM INFER] Support speculative decoding (llama) by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9180
* [Fix] Remove data args print by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9486
* [AutoParallel] open vpp test cast at v100 machines by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9468
* [ChatTemplate] Fix chat template for `Yuan` when answer is contained within question. by lvdongyi in https://github.com/PaddlePaddle/PaddleNLP/pull/9485
* [AutoParallel]:fix baichuan d2s fail by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9478
* [Tokenizer] Support fast tokenizer within AutoTokenizer import by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9466
* [Inference] use fp8 cuda core gemm kernel when M<=4 by zhink in https://github.com/PaddlePaddle/PaddleNLP/pull/9423
* [XPU] set appropriate mask value for xpu by runzhech in https://github.com/PaddlePaddle/PaddleNLP/pull/9495
* [LLM INFER] not use gemm_dequant default and fix bug by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9498
* [NEW Feature] 新增基于hook的refined_recompute支持 by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/9396
* 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 1 by yinfan98 in https://github.com/PaddlePaddle/PaddleNLP/pull/9407
* [BUG] fix pp eval shape bug by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/9505
* Adding LoKrModel Class to paddle.peft library by WhuanY in https://github.com/PaddlePaddle/PaddleNLP/pull/9269
* 移除CUDA_DEVICE_MAX_CONNECTIONS环境变量, 优化benchmark执行脚本 by XieYunshen in https://github.com/PaddlePaddle/PaddleNLP/pull/9500
* [Refactor] SFTTrainer SFTConfig by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9318
* fix csrc readme by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9515
* Add document for speculative decoding by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9492
* [News] FlashRAG-Paddle by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9511
* support quant ckpt limit strategy by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9494
* Fix ckpt convert bug by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9521
* support pp accuracy calculation by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9379
* Fix ckpt convert bug1 by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9522
* [CI] Compatible with paddle.where by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9534
* [Inference] Update DygraphInferencePredictor by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9491
* support offload/reload optimizer's states for custom device by tianhaodongbd in https://github.com/PaddlePaddle/PaddleNLP/pull/9467
* [LLM INFER] fix tune_cublaslt_int8_gemm.py and remove dist_config by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9520
* 【Hackathon 7th No.43】TokenizerFast for Qwen2 by yinfan98 in https://github.com/PaddlePaddle/PaddleNLP/pull/9532
* [INFER][LLM] Add the AutoPredictor for inference by zeroRains in https://github.com/PaddlePaddle/PaddleNLP/pull/9445
* Support call sft training with clone PaddleNLP by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9516

New Contributors
* Crystal-X-111 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9034
* idontkonwher made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9186
* waliwali777 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9324
* tianyuzhou668 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9373
* risemeup1 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9385
* TBD1 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9418
* zeroRains made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9416
* XieYunshen made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9442
* whf313 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8822
* mingMelody made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9476
* runzhech made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9495
* WhuanY made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9269

**Full Changelog**: https://github.com/PaddlePaddle/PaddleNLP/compare/v3.0.0-beta2...v3.0.0-beta3

3.0.0beta2

本次更新强化了PaddleNLP的基础设施,新增了Qwen2.5、Mixtral 8*22B模型并升级了Tokenizer功能,同时重命名了数据索引工具。

此外,还修复了MoE模型参数保存与加载等问题,提升了文本处理准确性,并更新了文档与测试用例。在推理性能、硬件支持及自动并行方面也进行了优化,包括支持更多模型与参数配置、多GPU推理、国产硬件支持增强以及分布式训练流程优化等。


核心变更与增强功能

1. **基础设施强化**:
- 新增Qwen2.5模型(9157 ),Mixtral 8*22B。进一步丰富模型库。
- Tokenizer功能升级,现支持加载额外解码标记added_tokens_decoder(8997 ),提升灵活性。
- 数据索引工具`tool_helpers`重命名为`fast_dataindex`(9134 ),以更直观反映其功能特性。
- 实现训练过程中数据间隔跳过的功能(8989 ),优化数据处理效率。
- **Unified Checkpoint优化**:
- 更新优化器异步保存信号(8975 ),保证保存稳定。
- 修复统一检查点中的多项问题(9082 ),确保功能正确性。

3. **问题修复**:
- 解决了MoE模型参数保存与加载的问题(9045 )。
- 修正Tokenizer中空格与特殊符号处理的不足(9010 , 9144 ),提升文本处理准确性。

4. **文档与测试更新**:
- 更新多个文档,涵盖LLM模型文档(如8990 , 8999 )及量化文档(9057 )等,确保信息的时效性与准确性。
- 新增测试用例,如针对PIR模式序列并行的测试(9015 ),强化测试覆盖度。
- 修复文档中的链接错误(如9127 ),提升用户体验。

5. **其他关键变更**:
- **推理性能优化**:
- LLM推理代码得到优化,支持更多模型与参数配置(如8986 , 8995 ),拓宽应用场景。
- 实现Qwen2_Moe多GPU推理(9121 )及wint4量化(9129 ),提升推理效率。
- 加强LLM推理对FP8与INT8的支持(如9032 , 9151 ),满足多样化精度需求。
- **硬件支持拓展**:
- 增强对DCU、XPU、MLU等国产硬件的支持(如8983 , 8504 , 9075 ),促进国产化替代。
- 优化上述硬件上的模型训练与推理性能,提升整体运算效率。
- **自动并行优化**:
- 修复训练过程中数据重复跳过的问题(8980 ),确保数据处理的正确性。
- 更新自动并行配置与检查点转换器(如8847 , 9136 ),提升并行训练的灵活性与稳定性。
- 新增损失NaN/Inf检查器(8943 ),及时发现并处理潜在数值问题。
- 优化分布式训练中的数据加载与梯度合并流程(如9120 , 9179 ),提升训练速度与稳定性。


What's Changed
* [Unified checkpoint] update optimizer async save signal by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8975
* 更正run_dpo.py文件路径 by Mangodadada in https://github.com/PaddlePaddle/PaddleNLP/pull/8952
* fix the loss base in llama_align_dygraph_dy2st_auto_bs2_bf16_DP2-MP1-… by winter-wang in https://github.com/PaddlePaddle/PaddleNLP/pull/8986
* [Bug fix] fix skip consumed_samples twice bug by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/8980
* fix pip error in legacy benchmarks by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/8978
* 【auto_parallel】Add checkpoint convertor by xingmingyyj in https://github.com/PaddlePaddle/PaddleNLP/pull/8847
* [llm]update finetune.md by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/8990
* tool_helpers升级后可以支持32766个数据集. by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/8994
* add DCU inference docs by YanhuiDua in https://github.com/PaddlePaddle/PaddleNLP/pull/8983
* [Distributed]Add loss nan/inf checker by ForFishes in https://github.com/PaddlePaddle/PaddleNLP/pull/8943
* 【llm】update docs by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/8999
* [Feature] Fused Mixtral support by penPenf28 in https://github.com/PaddlePaddle/PaddleNLP/pull/8901
* [XPU] Add README.md for llama2-7b by xiguapipi in https://github.com/PaddlePaddle/PaddleNLP/pull/8979
* Add gcu llama readme by EnflameGCU in https://github.com/PaddlePaddle/PaddleNLP/pull/8950
* fix qwen model use_casual_mask by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/9009
* [ZeroPadding] revert zero_padding 8973 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9003
* [LLM Inference] Fix step.cu bug by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8995
* Refine checkpoint converter by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9001
* [Feature] fused mixtral wint4 by penPenf28 in https://github.com/PaddlePaddle/PaddleNLP/pull/9013
* llm inference docs by Sunny-bot1 in https://github.com/PaddlePaddle/PaddleNLP/pull/8976
* [LLM Inference] Support Qwen2_Moe Inference Model by CJ77Qi in https://github.com/PaddlePaddle/PaddleNLP/pull/8892
* fix llama3 static run by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8849
* [paddle inference cpu]update cpu inference by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/8984
* fix the tipc ce case by wawltor in https://github.com/PaddlePaddle/PaddleNLP/pull/8748
* [Cherry-pick] Add is_distributed field in sharding reshard param_meta by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/9028
* [Tokenizer] Support for loading added_tokens_decoder by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8997
* [Inference] Add a8w8(fp8) a8w8c8(int8) quant_type support by lixcli in https://github.com/PaddlePaddle/PaddleNLP/pull/9032
* Fix checker of nan/inf by ForFishes in https://github.com/PaddlePaddle/PaddleNLP/pull/9029
* [Cherry-pick] add comm buffer size (8963) by ForFishes in https://github.com/PaddlePaddle/PaddleNLP/pull/9031
* [Unified Checkpoint] Update async save info by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8982
* [llm]support pad to max_length & fix sp bug by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9040
* [Bugfix] fix bias optional by penPenf28 in https://github.com/PaddlePaddle/PaddleNLP/pull/9037
* fix setup.py for llm inference by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9041
* [Inference] Add cutlass gemm dequant op by gzy19990617 in https://github.com/PaddlePaddle/PaddleNLP/pull/8909
* [Inference] update fakequant support by lixcli in https://github.com/PaddlePaddle/PaddleNLP/pull/9047
* add test for pir sequence parallel on llama model by liym27 in https://github.com/PaddlePaddle/PaddleNLP/pull/9015
* Fix moe save load by Meiyim in https://github.com/PaddlePaddle/PaddleNLP/pull/9045
* Update quantization.md by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9057
* 【Fix】Initialize dp degree in single GPU by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/9056
* fix bos download by westfish in https://github.com/PaddlePaddle/PaddleNLP/pull/9023
* [Inference] Update fakequant script by lixcli in https://github.com/PaddlePaddle/PaddleNLP/pull/9054
* [AutoParallel][PIR] Fit pir grad merge by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/8985
* [MLU] Support rms_norm_mlu by PeiyuLau in https://github.com/PaddlePaddle/PaddleNLP/pull/8504
* [Inference] support llama3 a8w8c8_fp8 inference and cutlass_fp8_gemm by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/8953
* [Inference] Qwen2 support fp8 inference by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/8954
* [Version] update version info by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9060
* [NPU] Fix baichuan2-13b-chat infer by ronny1996 in https://github.com/PaddlePaddle/PaddleNLP/pull/9070
* [MLU] Fix Llama attrntion_mask in npu and mlu by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9075
* Fix the memory overflow bug of the tune_cublaslt_gemm operator by Hanyonggong in https://github.com/PaddlePaddle/PaddleNLP/pull/9076
* [Inference] Fix weight_only_int4 bug by lixcli in https://github.com/PaddlePaddle/PaddleNLP/pull/9073
* [Auto Parallel] fix data stream bug of dist.to_static by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/9077
* fix hang when Flag_dataloader_use_file_descriptor=True by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/9080
* fix llm predict install error by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/9088
* [PIR] add pir grad merge test by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9074
* Update readme by EnflameGCU in https://github.com/PaddlePaddle/PaddleNLP/pull/9046
* [LLM] Add tensor parallel for chatglmv2 by SevenSamon in https://github.com/PaddlePaddle/PaddleNLP/pull/9014
* [data] update tool_helpers version and add unittest by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/9093
* fix baseline because of PR8769 by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/9092
* fix use paddle.incubate.jit.inference(model) errors by chang-wenbin in https://github.com/PaddlePaddle/PaddleNLP/pull/9016
* [CI] Fix paddlepaddle install by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9102
* [LLM] fix train on npu by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9101
* Disable ut by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9108
* [AutoParallel] Enable CI for gradclip by JZ-LIANG in https://github.com/PaddlePaddle/PaddleNLP/pull/9059
* [Inference] Remove ceval from run_finetune by lixcli in https://github.com/PaddlePaddle/PaddleNLP/pull/9100
* [Bugfix] fix multi-gpu infer by penPenf28 in https://github.com/PaddlePaddle/PaddleNLP/pull/9107
* 【Inference】fix step kernel by gzy19990617 in https://github.com/PaddlePaddle/PaddleNLP/pull/9122
* [DCU] fix DCU w8a8c8 GEMM shape by YanhuiDua in https://github.com/PaddlePaddle/PaddleNLP/pull/9115
* [Inference] FP8 gemm auto-tune by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/9094
* Open ut llama_align_dygraph_dy2st_pir_auto_grad_merge_bs2_fp32_DP1-MP1-PP1 by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9120
* [LLM Inference] Support Qwen2_Moe Inference with MultiGPU by CJ77Qi in https://github.com/PaddlePaddle/PaddleNLP/pull/9121
* [Unified Checkpoint] Fix uc lora config, fix release_grads by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9082
* [Inference]qwen2-a8w8c8 support use_fake_parameter by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/9109
* Add fast_ln spmd rules by From00 in https://github.com/PaddlePaddle/PaddleNLP/pull/9125
* fix pir dtype by wanghuancoder in https://github.com/PaddlePaddle/PaddleNLP/pull/9130
* Remove ring_flash_attention warning by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9119
* [DOC] Fix LLM page 404 Not Found by DrRyanHuang in https://github.com/PaddlePaddle/PaddleNLP/pull/9127
* Add hardware flops for pretraining by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9069
* [Benchmark] Fix amp level bug in some gpt tests by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9116
* [Auto Parallel] Fix ckpt_converter for auto_parallel by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/9136
* [Inference] Update fakequant by lixcli in https://github.com/PaddlePaddle/PaddleNLP/pull/9140
* [DOC] Update docs by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9141
* [LLM Inference] Qwen2_Moe Support wint4 by CJ77Qi in https://github.com/PaddlePaddle/PaddleNLP/pull/9129
* add multy devices supported models by a31413510 in https://github.com/PaddlePaddle/PaddleNLP/pull/9079
* [fix] freeze 参数冗余存储 兼容shard-reshard (9067) by bo-ke in https://github.com/PaddlePaddle/PaddleNLP/pull/9148
* [Docs] Update LLM docs by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9143
* fix llm ce predict run error by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/9149
* [Tokenizer] Add replace_additional_special_tokens parameter to add_special_tokens by lvdongyi in https://github.com/PaddlePaddle/PaddleNLP/pull/9144
* [Tokenizer] Fix decode output with space in decode_token by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9010
* 【Inference】Optimize top_p kernel performance by gzy19990617 in https://github.com/PaddlePaddle/PaddleNLP/pull/9132
* [Models] Add Qwen2.5 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9157
* Update README.md by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9160
* [Inference] FP8 dual gemm auto-tune and support compile parallelization by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/9151
* [AutoParallel] enable ci for dp amp clip by JZ-LIANG in https://github.com/PaddlePaddle/PaddleNLP/pull/9062
* [llm]support dpo pp by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9039
* [Tools] Rename tool_helpers to fast_dataindex. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9134
* [Trainer] Support skip data intervals by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/8989
* remove run_pretrain_auto_static.py CI when open PIR by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/9177
* [Tokenizer] Enable padding_side as call time kwargs by lvdongyi in https://github.com/PaddlePaddle/PaddleNLP/pull/9161
* Revert "[Tokenizer] Enable padding_side as call time kwargs" by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9192
* [XPU] add xpu support for llama sft by tizhou86 in https://github.com/PaddlePaddle/PaddleNLP/pull/9152
* [AutoParallel] Add FLAGS_enable_fused_ffn_qkv_pass for llama by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9182
* [AutoParallel] Fix ckpt convert bug for sharding v2 by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9179
* [Test] Disable dynamic to static test case for paddle PIR by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9196
* Fix ppt eval hang by gongel in https://github.com/PaddlePaddle/PaddleNLP/pull/9218
* Update branch version to 3.0.0b2 by gongel in https://github.com/PaddlePaddle/PaddleNLP/pull/9220
* Update branch version to 3.0.0b2 by gongel in https://github.com/PaddlePaddle/PaddleNLP/pull/9221
* Revert "Fix ppt eval hang" by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9229

New Contributors
* Mangodadada made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8952
* xingmingyyj made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8847
* penPenf28 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8901
* xiguapipi made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8979
* Sunny-bot1 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8976
* CJ77Qi made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8892
* lixcli made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9032
* gzy19990617 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8909
* SevenSamon made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9014
* chang-wenbin made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9016
* DrRyanHuang made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9127
* a31413510 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9079
* lvdongyi made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9144
* tizhou86 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9152

**Full Changelog**: https://github.com/PaddlePaddle/PaddleNLP/compare/v3.0.0-beta1...v3.0.0-beta2

3.0.0beta1

PaddleNLP从v3.0.0-beta0升级至v3.0.0-beta1版本,带来了多项重要更新与增强。新引入了Yuan、mamba和jamba模型,并优化了LLM推理代码,提升了兼容性和效率。

基础性能优化方面,添加了快速分词器,实现了MoE优化器参数广播,加速了层归一化。同时,修复了多个bug,包括safetensors shape切片问题和Windows下mmap问题,提升了系统稳定性和兼容性。

文档与测试方面,进行了全面更新和优化,确保了文档的准确性和代码的可读性。此外,还增强了国产硬件支持,包括DCU和XPU的优化,以及PIR模式和自动并行的配置更新。


主要变更与新增功能

1. 新模型与特性引入
- **新模型**:在8654 中引入了Yuan模型;在8513 和8517 中分别添加了mamba和jamba新模型,并在后续Pull Request中修复了相关bug,确保了模型的稳定运行。
- **LLM推理优化**:通过多个Pull Request,我们优化了LLM推理代码,并新增了对新模型和参数的支持,进一步提升了推理效率和兼容性。

2. 基础性能优化
- **快速分词器**:在8832 中,我们添加了基于`tokenizers`库的快速分词器,显著提升了分词速度和性能。
- **MoE优化**:在8810 中,我们实现了MoE(Mixture of Experts)优化器参数的广播,有效增强了模型训练的效率。
- **层归一化加速**:通过多个Pull Request,我们添加了fast_rmsnorm,启用了use_fast_layer_norm,并更新了基准测试配置,进一步加速了模型训练过程。特别是在8717 中,我们支持了在微调过程中使用use_fast_layer_norm,为用户提供了更多灵活性。
- **训练性能优化**:在8803 中,我们添加了`enable_sp_async_reduce_scatter`选项,有效优化了训练性能。
- **字典参数支持**:在8446 中,我们为trainer的argparser添加了支持字典参数的新特性,增强了参数传递的灵活性。同时,在8904 中,我们更新了tensorboard的要求,确保了与最新版本的兼容性。

3. Bug修复
- **safetensors修复**:在8702 中,我们修复了safetensors的形状问题。
- **Windows系统mmap修复**:在8734 中修复了mmap问题,提升了windows的兼容性。
- **其他Bug修复**:包括8687 、8730 等多个Pull Request中的bug修复。

4. 文档与测试更新
- **文档优化**:在多个Pull Request中,我们进行了文档更新、代码风格清理和版本信息更新,确保了文档的准确性和可读性。
- **README修复与增强**:在8741 中,我们修复了README中的断链问题;同时,多个贡献者更新了README文档,添加了新的测试用例,确保了文档与代码的同步更新。

5. 其他重要变更

国产硬件支持增强
- **DCU支持**:在8580 中,我们实现了针对DCU的高性能LLM训练和推理,拓展了PaddleNLP的硬件支持范围。
- **XPU优化**:在8527 中,我们为XPU添加了LoRA优化;在8697 和8710 中,我们分别实现了XPU的allgather功能和修复了统一检查点的gather问题,进一步提升了XPU上的模型训练效率。

PIR模式支持
- **导出与加载优化**:在8689 中,我们修改了PIR模式下llama模型的导出方式;在8712 和8766 中,我们支持了以三种模式(旧IR、PIR模型文件、PIR JSON文件)加载或保存Llama2-7b模型,为用户提供了更多灵活性和兼容性。

自动并行优化
- **配置更新**:在8679 中,我们更改了Llama2-7b配置中的`max_steps`以适应自动并行;在8767 和8828 中,我们优化了自动训练器的保存和加载功能;在8750 中,我们更新了全局剪切的损失函数,进一步提升了自动并行的效率和准确性。


What's Changed
* [DCU] high performance LLM train and inference for DCU by yuguo-Jack in https://github.com/PaddlePaddle/PaddleNLP/pull/8580
* fix benchmark dir and add CUDA_DEVICE_MAX_CONNECTIONS to qwen by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/8678
* bug fix by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/8687
* [XPU] add lora optimization by dynamicheart in https://github.com/PaddlePaddle/PaddleNLP/pull/8527
* [pir save] Modiy export llama model file in pir mode by xiaoguoguo626807 in https://github.com/PaddlePaddle/PaddleNLP/pull/8689
* [AutoParallel]Change `max_steps` in Llama2-7b config for auto-parallel. by heavyrain-lzy in https://github.com/PaddlePaddle/PaddleNLP/pull/8679
* [benchmark] Change the mirror source for pip by mmglove in https://github.com/PaddlePaddle/PaddleNLP/pull/8699
* update loss base of auto-parallel tests by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/8701
* Add new mistral by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/7425
* [Safetensors] Fix safetensors shape by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8702
* [BUG] num_samples 向下去整, 防止prefrech预取时候超过数据集最大长度... by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/8690
* xpu use allgather by FeixLiu in https://github.com/PaddlePaddle/PaddleNLP/pull/8697
* add fast_rmsnorm by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/8680
* enable use_fast_layer_norm for llama2 benchmark by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/8714
* fix xpu gather for unified ckpt by FeixLiu in https://github.com/PaddlePaddle/PaddleNLP/pull/8710
* [inference] support load or save Llama2-7b in three patterns by lizexu123 in https://github.com/PaddlePaddle/PaddleNLP/pull/8712
* fix fast_ln backward by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/8719
* finetune support use_fast_layer_norm by tianhaodongbd in https://github.com/PaddlePaddle/PaddleNLP/pull/8717
* bug fix by FeixLiu in https://github.com/PaddlePaddle/PaddleNLP/pull/8730
* disable lora by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/8674
* [Safetensors] Fix mmap for Windows system by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8734
* correct broken links in readme by jzhang533 in https://github.com/PaddlePaddle/PaddleNLP/pull/8741
* revert benchmark fix by ronny1996 in https://github.com/PaddlePaddle/PaddleNLP/pull/8747
* [LLM] Add Yuan model by zhaogf01 in https://github.com/PaddlePaddle/PaddleNLP/pull/8654
* fix nlp dir and auto_parallel_ci exit -6 by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/8744
* [LLM] Update sequence parallel linear import by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8706
* [Bug fixes] Fix ring attention by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/8740
* update a100 loss by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/8708
* [PaddleNLP 3.0] Update README by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8681
* [AutoParallel] update loss for global clip by JZ-LIANG in https://github.com/PaddlePaddle/PaddleNLP/pull/8750
* [NPU] Fix sequence parallel lib import by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8760
* [DEV] Update develop version show by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8754
* [inference] support load or save Llama2-7b in three patterns by lizexu123 in https://github.com/PaddlePaddle/PaddleNLP/pull/8766
* add benchmark baichuan2 scripts by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/8683
* Add the missing truncation=True in llm/predictor.py by lszxb in https://github.com/PaddlePaddle/PaddleNLP/pull/8768
* fix the ce for the unittest by wawltor in https://github.com/PaddlePaddle/PaddleNLP/pull/8772
* Enable parallel_config to use commas as delimiters. by Difers in https://github.com/PaddlePaddle/PaddleNLP/pull/8677
* fix incorrect token counting in `llm/predictor.py` by lszxb in https://github.com/PaddlePaddle/PaddleNLP/pull/8769
* Refine savable by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8758
* [CodeStyle] remove markdownlint-cli by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8779
* [XPU] use allgather and fp32 multinomial for XPU by houj04 in https://github.com/PaddlePaddle/PaddleNLP/pull/8787
* fix version show by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8791
* [BUG] Add 20 redundant data in post pretrain by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/8789
* vera-pissa method added by TranscenderNing in https://github.com/PaddlePaddle/PaddleNLP/pull/8722
* update version by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8792
* [Inference LLM] refine some code in llama wint8/4 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8796
* [DCU] Llama a8w8 inference performance optimization by Deleter-D in https://github.com/PaddlePaddle/PaddleNLP/pull/8800
* [Prediction] Update LLM prediction. by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8778
* [Trainer] Add enable_sp_async_reduce_scatter by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8803
* [AutoParallel] Refine auto_trainer save load by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/8767
* [MoE] Optimizer parameter broadcast by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8810
* [Doc] Update README by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8817
* support Llama3.1 8B 128K generation on single GPU 80GB by GuoxiaWang in https://github.com/PaddlePaddle/PaddleNLP/pull/8811
* add paddle nv-embed-v1 by Li-Z-Q in https://github.com/PaddlePaddle/PaddleNLP/pull/8785
* fix pad_token_id bug by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8814
* [DCU] fix llama inference bug on DCU by Deleter-D in https://github.com/PaddlePaddle/PaddleNLP/pull/8815
* [Doc] Add LLaMA3.1 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8824
* [BUG] Fix build train valid test datasets by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/8826
* Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file by Hanyonggong in https://github.com/PaddlePaddle/PaddleNLP/pull/8799
* fix tune_cublaslt_gemm compile bug by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8844
* [AutoParallel] Refine save and load ckpt for auto_trainer by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/8828
* [Unified Checkpoint] update merge tensor parallel by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8856
* [Trainer] update clear_grad by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8829
* [Unified Checkpoint] Fix tie_word_embeddings by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8795
* [Inference LLM] support static c8 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8833
* support sft mapdataset by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/8840
* Cherry pick some changes from incubate branch by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/8862
* support nested list of dict inputs by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/8876
* Fix the bug with issues code 8641. by smallbenxiong in https://github.com/PaddlePaddle/PaddleNLP/pull/8880
* Fix the issue of P-tuning official sample error by guangyunms in https://github.com/PaddlePaddle/PaddleNLP/pull/8884
* modify Paddlemix qwen dytostatic by xiaoguoguo626807 in https://github.com/PaddlePaddle/PaddleNLP/pull/8869
* [llm]fix zeropadding by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/8895
* 修复fast_ln算子动半开启后报错 by Wennie396 in https://github.com/PaddlePaddle/PaddleNLP/pull/8891
* enable_sp_async_reduce_scatter for qwen_72b && llama2_70b by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/8897
* Update run_pretrain.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8902
* [doc] Update readme by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8905
* [AutoParallel] Bugfix auto parallel FA by JZ-LIANG in https://github.com/PaddlePaddle/PaddleNLP/pull/8903
* [Readme] Update README.md by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8908
* [cherry-pick] Optimize async save by ForFishes in https://github.com/PaddlePaddle/PaddleNLP/pull/8878
* [LLM Inference] Refactor BlockInferencePredictor by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8879
* [Fix] modify tensorboard requirements by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/8904
* [LLM Inference] Support qwen2 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8893
* modify dict include none to aviod pir dytostatic bug in while op by xiaoguoguo626807 in https://github.com/PaddlePaddle/PaddleNLP/pull/8898
* [LLM]Update yuan model by zhaogf01 in https://github.com/PaddlePaddle/PaddleNLP/pull/8786
* update qwen && baichuan benchmark config by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/8920
* [doc] Update README by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8922
* [ New features]Trainer support dict parameter by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/8446
* set logging_step to 5 with baichuan && qwen benchmark by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/8928
* [Cherry-pick]fix pipeline eval by gongel in https://github.com/PaddlePaddle/PaddleNLP/pull/8924
* fix test_wint8 ut by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8930
* [LLM Inference] support llama3.1 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8929
* Fix tokens count for benchmark by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8938
* [bug fix] fix create_optimizer_and_scheduler for auto_parallel by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/8937
* [LLM Inference] fix _get_tensor_parallel_mappings in llama by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8939
* [Unified Checkpoint] Fix load best checkpoint by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8935
* fix bug by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8947
* [LLM Inference] move llm.utils.utils.py to paddlenlp.utils.llm_utils.py by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8946
* support amp in pir dy2st mode. by winter-wang in https://github.com/PaddlePaddle/PaddleNLP/pull/8485
* [Trainer] Fix distributed dataloader by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8932
* [Tokenizer] Add Fast Tokenizer by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8832
* [ZeroPadding] add greedy_zero_padding by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8933
* [NEW Model] Add mamba by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/8513
* [BUG] fix mamba tokenizer by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/8958
* [NEW Model] add jamba by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/8517
* [LLM Inference] add --use_fake_parameter option for ptq fake scales and fix compute error of total_max_length by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8955
* [LLM Inference] support qwen2 a8w8c8 inference by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/8925
* fix JambaModelIntegrationTest by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/8965
* [Fix] Enable tensor parallel tests. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8757
* [CI] Fix by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8793
* [Unified Checkpoint] update async save by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8801
* [AutoParallel] Support save model for auto trainer by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/8927
* fix qwen benchmark by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/8969
* [ZeroPadding] padding to max_length for sequence parallel by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8973
* add amp unit test case for auto_parallel ci. by winter-wang in https://github.com/PaddlePaddle/PaddleNLP/pull/8966
* [New Version] Upgrade to 3.0 b1 by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8977

New Contributors
* yuguo-Jack made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8580
* ruisunyc made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8698
* xiaoguoguo626807 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8689
* lizexu123 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8712
* jzhang533 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8741
* zhaogf01 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8654
* lszxb made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8768
* TranscenderNing made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8722
* Deleter-D made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8800
* Li-Z-Q made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8785
* Hanyonggong made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8799
* smallbenxiong made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8880
* guangyunms made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8884
* winter-wang made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8485
* ckl117 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/8925

**Full Changelog**: https://github.com/PaddlePaddle/PaddleNLP/compare/v3.0.0-beta0...v3.0.0-beta1

3.0.0beta0

很高兴地通知大家,飞桨大模型套件发布v3.0.0beat版本:拥抱大模型,体验全升级。具体工作如下:
* 统一大模型工具链,实现国产计算芯片全流程接入;
* 全面支持飞桨4D并行配置、高效精调策略、高效对齐算法、高性能推理等大模型产业级应用流程;
* 自研极致收敛的RsLoRA+算法、自动扩缩容存储机制Unified Checkpoint和通用化支持FastFFN、FusedQKV助力大模型训推;
* 主流模型持续支持更新,提供高效解决方案。

大模型精调对齐训推优化

* PEFT:
* 新增scaling策略,支持rslora, pissa算法 in https://github.com/PaddlePaddle/PaddleNLP/pull/8256
* 适配FusedQKV和FastFFN参数 in https://github.com/PaddlePaddle/PaddleNLP/pull/8372 https://github.com/PaddlePaddle/PaddleNLP/pull/8526
* DPO:
* 支持DPO(llama,qwen)in https://github.com/PaddlePaddle/PaddleNLP/pull/8474
* 支持序列并行 in https://github.com/PaddlePaddle/PaddleNLP/pull/7953
* 国产芯片支持:
* 适配NPU in https://github.com/PaddlePaddle/PaddleNLP/pull/8303 https://github.com/PaddlePaddle/PaddleNLP/pull/8342 https://github.com/PaddlePaddle/PaddleNLP/pull/8359 https://github.com/PaddlePaddle/PaddleNLP/pull/8399 https://github.com/PaddlePaddle/PaddleNLP/pull/8409 https://github.com/PaddlePaddle/PaddleNLP/pull/8401 https://github.com/PaddlePaddle/PaddleNLP/pull/8431 https://github.com/PaddlePaddle/PaddleNLP/pull/8439 https://github.com/PaddlePaddle/PaddleNLP/pull/8438 https://github.com/PaddlePaddle/PaddleNLP/pull/8442 https://github.com/PaddlePaddle/PaddleNLP/pull/8528 https://github.com/PaddlePaddle/PaddleNLP/pull/8642
* 适配XPU in https://github.com/PaddlePaddle/PaddleNLP/pull/8282 https://github.com/PaddlePaddle/PaddleNLP/pull/8505 https://github.com/PaddlePaddle/PaddleNLP/pull/8515 https://github.com/PaddlePaddle/PaddleNLP/pull/8588 https://github.com/PaddlePaddle/PaddleNLP/pull/8595 https://github.com/PaddlePaddle/PaddleNLP/pull/8598
* 适配GCU in https://github.com/PaddlePaddle/PaddleNLP/pull/8445 https://github.com/PaddlePaddle/PaddleNLP/pull/8470

* 性能优化:
* 优化Unified Checkpoint机制 in https://github.com/PaddlePaddle/PaddleNLP/pull/8204 https://github.com/PaddlePaddle/PaddleNLP/pull/8409 https://github.com/PaddlePaddle/PaddleNLP/pull/8422 https://github.com/PaddlePaddle/PaddleNLP/pull/8512
* 模型并行优化 in https://github.com/PaddlePaddle/PaddleNLP/pull/8370
* 序列并行优化 in https://github.com/PaddlePaddle/PaddleNLP/pull/8551
* 支持llama3 (wint8|4/a8w8) in https://github.com/PaddlePaddle/PaddleNLP/pull/8630

* 其他
* 新增模型内存监控 in https://github.com/PaddlePaddle/PaddleNLP/pull/8269

模型新增

* 新增Gemma模型 in https://github.com/PaddlePaddle/PaddleNLP/pull/8082
* google/gemma-7b
* google/gemma-7b-it
* google/gemma-2b
* google/gemma-2b-it

* 新增llama3模型 in https://github.com/PaddlePaddle/PaddleNLP/pull/8307 https://github.com/PaddlePaddle/PaddleNLP/pull/8371
* meta-llama/Meta-Llama-3-8B
* meta-llama/Meta-Llama-3-8B-Instruct
* meta-llama/Meta-Llama-3-70B
* meta-llama/Meta-Llama-3-70B-Instruct

* 新增Qwen2模型 in https://github.com/PaddlePaddle/PaddleNLP/pull/8338 https://github.com/PaddlePaddle/PaddleNLP/pull/8584 https://github.com/PaddlePaddle/PaddleNLP/pull/8601
* Qwen/Qwen1.5-0.5B
* Qwen/Qwen1.5-0.5B-Chat
* Qwen/Qwen1.5-1.8B
* Qwen/Qwen1.5-1.8B-Chat
* Qwen/Qwen1.5-4B
* Qwen/Qwen1.5-4B-Chat
* Qwen/Qwen1.5-7B
* Qwen/Qwen1.5-7B-Chat
* Qwen/Qwen1.5-14B
* Qwen/Qwen1.5-14B-Chat
* Qwen/Qwen1.5-32B
* Qwen/Qwen1.5-32B-Chat
* Qwen/Qwen1.5-72B
* Qwen/Qwen1.5-72B-Chat
* Qwen/Qwen1.5-110B
* Qwen/Qwen1.5-110B-Chat
* Qwen/Qwen1.5-MoE-A2.7B
* Qwen/Qwen1.5-MoE-A2.7B-Chat
* Qwen/Qwen2-0.5B
* Qwen/Qwen2-0.5B-Instruct
* Qwen/Qwen2-1.5B
* Qwen/Qwen2-1.5B-Instruct
* Qwen/Qwen2-7B
* Qwen/Qwen2-7B-Instruct
* Qwen/Qwen2-72B
* Qwen/Qwen2-72B-Instruct
* Qwen/Qwen2-57B-A14B
* Qwen/Qwen2-57B-A14B-Instruct

基础框架升级

* 功能优化:
* 支持FusedQKV和FastFFN权重自动融合分割 in https://github.com/PaddlePaddle/PaddleNLP/pull/8202 https://github.com/PaddlePaddle/PaddleNLP/pull/8378 https://github.com/PaddlePaddle/PaddleNLP/pull/8432
* 支持模型并行参数同步设置 in https://github.com/PaddlePaddle/PaddleNLP/pull/8311
* 支持RoPE算子设定theta in https://github.com/PaddlePaddle/PaddleNLP/pull/8440
* 通信overlap优化 in https://github.com/PaddlePaddle/PaddleNLP/pull/8276 https://github.com/PaddlePaddle/PaddleNLP/pull/8473 https://github.com/PaddlePaddle/PaddleNLP/pull/8499 https://github.com/PaddlePaddle/PaddleNLP/pull/8594

* AutoParallel优化
* llama支持recompute机制 in https://github.com/PaddlePaddle/PaddleNLP/pull/8265
* 适配llama3 in https://github.com/PaddlePaddle/PaddleNLP/pull/8395
* position_ids优化 in https://github.com/PaddlePaddle/PaddleNLP/pull/8363
* 支持流水线并行split_backward in https://github.com/PaddlePaddle/PaddleNLP/pull/8479
* 适配qwen in https://github.com/PaddlePaddle/PaddleNLP/pull/8312


* 分布式能力优化:
* 修复流水线并行中enable_sharding_comm_overlap中参数错误问题 in https://github.com/PaddlePaddle/PaddleNLP/pull/8333
* MoE并行支持 in https://github.com/PaddlePaddle/PaddleNLP/pull/8498 https://github.com/PaddlePaddle/PaddleNLP/pull/8522

* chat能力优化:
* 增加Chat template in https://github.com/PaddlePaddle/PaddleNLP/pull/8226

* 其他
* 文档 in https://github.com/PaddlePaddle/PaddleNLP/pull/8336 https://github.com/PaddlePaddle/PaddleNLP/pull/8393
* 更新nested操作 in https://github.com/PaddlePaddle/PaddleNLP/pull/8380
* 随机性更新 in https://github.com/PaddlePaddle/PaddleNLP/pull/8450 https://github.com/PaddlePaddle/PaddleNLP/pull/8396
* 算子更新 in https://github.com/PaddlePaddle/PaddleNLP/pull/8472
* example更新 in https://github.com/PaddlePaddle/PaddleNLP/pull/8538

问题修复

* 修复sharding数量小于100的bug in https://github.com/PaddlePaddle/PaddleNLP/pull/8146
* 修复TP/PP参数合并问题 in https://github.com/PaddlePaddle/PaddleNLP/pull/8239
* 修复tensor.shape与paddle.shape(tensor)不一致问题 in https://github.com/PaddlePaddle/PaddleNLP/pull/8260
* 修复fp16+delay_scale_loss_scale+sharding_stage1_overlap的bug in https://github.com/PaddlePaddle/PaddleNLP/pull/8314
* 增加pipelines运行文档及提示 in https://github.com/PaddlePaddle/PaddleNLP/pull/8292 https://github.com/PaddlePaddle/PaddleNLP/pull/8308 https://github.com/PaddlePaddle/PaddleNLP/pull/8202 https://github.com/PaddlePaddle/PaddleNLP/pull/8353
* 修复text feature extraction任务中tokenizer输入 in https://github.com/PaddlePaddle/PaddleNLP/pull/8331
* 修复import error in https://github.com/PaddlePaddle/PaddleNLP/pull/8332 https://github.com/PaddlePaddle/PaddleNLP/pull/8367

结构调整

PaddleNLP文件结构调整 in https://github.com/PaddlePaddle/PaddleNLP/pull/8609 https://github.com/PaddlePaddle/PaddleNLP/pull/8613 https://github.com/PaddlePaddle/PaddleNLP/pull/8605 https://github.com/PaddlePaddle/PaddleNLP/pull/8614 https://github.com/PaddlePaddle/PaddleNLP/pull/8617 https://github.com/PaddlePaddle/PaddleNLP/pull/8626 https://github.com/PaddlePaddle/PaddleNLP/pull/8618 https://github.com/PaddlePaddle/PaddleNLP/pull/8625 https://github.com/PaddlePaddle/PaddleNLP/pull/8619 https://github.com/PaddlePaddle/PaddleNLP/pull/8629 https://github.com/PaddlePaddle/PaddleNLP/pull/8601 https://github.com/PaddlePaddle/PaddleNLP/pull/8627 https://github.com/PaddlePaddle/PaddleNLP/pull/8666

What's Changed
* [dist]pip requirements-dev.txt by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/8258
* add scaling by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/8256
* [LLM]Support Gemma model by Southpika in https://github.com/PaddlePaddle/PaddleNLP/pull/8082
* [BugFix] Try except sequence parallel utils by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8189
* Update CodeCov GitHub Action by sijunhe in https://github.com/PaddlePaddle/PaddleNLP/pull/8268
* [AutoParallel] Open recompute strategy for llama model by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/8265
* Fix sharding < 100 limitation bug by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/8146
* use tensor.shape bug not paddle.shape(tensor) by wanghuancoder in https://github.com/PaddlePaddle/PaddleNLP/pull/8260
* [dist CI]update paddlenlp install for CI by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/8267
* [Bug Fix]Fix merge parameters in pp by Southpika in https://github.com/PaddlePaddle/PaddleNLP/pull/8239
* [LLM] add memory stats to logger of trainer by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/8269
* Add p2p_comm_overlap for Llama-2-70b benchmark. by Xreki in https://github.com/PaddlePaddle/PaddleNLP/pull/8276
* add a100 test ground truth by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/8249
* [paddle-pipelines] faq semantic search question answering reamde by w5688414 in https://github.com/PaddlePaddle/PaddleNLP/pull/8292
* [paddle-pipelines] Add pipelines documentation by w5688414 in https://github.com/PaddlePaddle/PaddleNLP/pull/8308
* Support llama-3 by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8307
* [Distributed] [CustomDevices] Adapt SP on lora && polish MC2 APIs by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/8303
* fix bug for fp16 + delay_scale_loss_scale + sharding_stage1_overlap by FeixLiu in https://github.com/PaddlePaddle/PaddleNLP/pull/8314
* [paddle-pipelines] Update mkdocs by w5688414 in https://github.com/PaddlePaddle/PaddleNLP/pull/8310
* [benchmark]update llama2_ips by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/8322
* [dist CI]fix before_hook by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/8283
* benchmark llama worker=1 by wanghuancoder in https://github.com/PaddlePaddle/PaddleNLP/pull/8305
* 【AutoParallel】Add llama2 UT for auto-parallel by heavyrain-lzy in https://github.com/PaddlePaddle/PaddleNLP/pull/8300
* Add system env log for llama test by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/8321
* [LLM] Support fuse attention q, k, v weights by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8202
* [Distributed] fix lora by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/8325
* fix try import by w5688414 in https://github.com/PaddlePaddle/PaddleNLP/pull/8332
* [DEV] Support sync params in tensor parallel config by From00 in https://github.com/PaddlePaddle/PaddleNLP/pull/8311
* cherry pick paddlenlp 2.8 by w5688414 in https://github.com/PaddlePaddle/PaddleNLP/pull/8323
* textfeature_queryinput by cxa-unique in https://github.com/PaddlePaddle/PaddleNLP/pull/8331
* [BugFix] Fix gpu ci by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8337
* [Trainer] Fix sharding overlap bug by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8333
* [Tokenizer]Add Chat template by Southpika in https://github.com/PaddlePaddle/PaddleNLP/pull/8226
* [AutoParallel]Refine lr warm_up configuration strategy for llama by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/8329
* Add num_hidden_layer config for llama run_pretrain by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/8288
* [XPU] llama add xpu support by dynamicheart in https://github.com/PaddlePaddle/PaddleNLP/pull/8282
* add eliminate_transpose arg by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/8339
* change llama/modeling.py to opt npu performence by Galaxy1458 in https://github.com/PaddlePaddle/PaddleNLP/pull/8342
* Update llm docs requirements by w5688414 in https://github.com/PaddlePaddle/PaddleNLP/pull/8336
* Disable eval and predict for llama-2 benchmark. by Xreki in https://github.com/PaddlePaddle/PaddleNLP/pull/8366
* update by Galaxy1458 in https://github.com/PaddlePaddle/PaddleNLP/pull/8359
* [LLM] fix lora target modules on llama by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/8372
* [paddle-pipelines] Update offline ann by w5688414 in https://github.com/PaddlePaddle/PaddleNLP/pull/8353
* refine benchmard bert ips stat by wanghuancoder in https://github.com/PaddlePaddle/PaddleNLP/pull/8361
* [BugFix] Update truncate in distributed training by KB-Ding in https://github.com/PaddlePaddle/PaddleNLP/pull/8362
* [dist benchmark]Fix llama2 benchmark by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/8376
* Revert "update" by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8389
* Fix test init by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8377
* [Performance] Optimize unified checkpoint save/load speed. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8204
* [npu model bug]fix_global_bug by Galaxy1458 in https://github.com/PaddlePaddle/PaddleNLP/pull/8399
* [Bugfix] Fix fast tokenizer import error by w5688414 in https://github.com/PaddlePaddle/PaddleNLP/pull/8367
* [bugfix] fix uie by w5688414 in https://github.com/PaddlePaddle/PaddleNLP/pull/8379
* fit for llama3 for auto_parallel by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/8395
* [DistDataloader] Update implementation, add nested.py by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8380
* [LLM] Fix fuse or split with same key by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8378
* [UC] Fix compatible with npu by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8409
* pre copy pinned data to gpu by wanghuancoder in https://github.com/PaddlePaddle/PaddleNLP/pull/8386
* Refine position_ids for auto parallel training of llama by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/8363
* [Distributed] enable tensor_parallel_output for finetuning by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/8370
* fix type promotion problem. by zxcd in https://github.com/PaddlePaddle/PaddleNLP/pull/8414
* Fix ckpt done by gongel in https://github.com/PaddlePaddle/PaddleNLP/pull/8402
* [LLM] rename logits_tensor_parallel_output to avoid conflict by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/8419
* [Trainer] fix distdataloader by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8420
* fix safe open. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8422
* adapter new type promotion rule for Paddle 2.6 by zxcd in https://github.com/PaddlePaddle/PaddleNLP/pull/8421
* [BugFix] Fix llama3 `eot_id` by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8371
* add npu-llama-opt0-script by Galaxy1458 in https://github.com/PaddlePaddle/PaddleNLP/pull/8401
* [LLM] add assertion for enable_stage1_overlap in lora mode by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/8425
* [NPU]Custom fusion operator unification by Galaxy1458 in https://github.com/PaddlePaddle/PaddleNLP/pull/8431
* delete csrc/generation/reset_need_stop_value.cc by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/8413
* Update llama_npu_opt_lora.sh by Galaxy1458 in https://github.com/PaddlePaddle/PaddleNLP/pull/8439
* [CI]add scripts for unittest by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/8433
* fix npu sft ckpt load bug and no FA bug by NINGBENZHE in https://github.com/PaddlePaddle/PaddleNLP/pull/8438
* Fix CI bugs by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8430
* Fix/test gpu by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8452
* Support fused_attention_qkv for auto_parallel llama by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/8432
* [BugFix] Fix load rng compatibility. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8450
* update by Galaxy1458 in https://github.com/PaddlePaddle/PaddleNLP/pull/8448
* [GCU] Support llama for GCU by EnflameGCU in https://github.com/PaddlePaddle/PaddleNLP/pull/8445
* [bugfix] fix erniedoc by w5688414 in https://github.com/PaddlePaddle/PaddleNLP/pull/8393
* [benchmark]Add llama2 auto by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/8424
* Add llama2-70b for test_tipc by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/8455
* Fix ci tests. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8471
* [NPU] support npu llama2-13B export & inference by ronny1996 in https://github.com/PaddlePaddle/PaddleNLP/pull/8442
* [LLM] fix bug when loss is None in llama modeling.py by cqulilujia in https://github.com/PaddlePaddle/PaddleNLP/pull/8459
* fix rotary_emb for llama by EnflameGCU in https://github.com/PaddlePaddle/PaddleNLP/pull/8470
* [Ops] RoPE kernel support theta input by yinfan98 in https://github.com/PaddlePaddle/PaddleNLP/pull/8440
* Support Sharding Overlap by iosmers in https://github.com/PaddlePaddle/PaddleNLP/pull/8473
* Revert "Support Sharding Overlap (8473)" by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/8491
* fix run_benchmark for llama2_70b in auto_parallel by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/8484
* 【AutoParallel】Add split_backward for vpp by heavyrain-lzy in https://github.com/PaddlePaddle/PaddleNLP/pull/8479
* Quick fix from_pretrained. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8486
* Fix rng_state in llm models by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/8396
* [AutoParallel] Support qwen for auto_parallel by GhostScreaming in https://github.com/PaddlePaddle/PaddleNLP/pull/8312
* modify block_multihead_attention api by ming1753 in https://github.com/PaddlePaddle/PaddleNLP/pull/8456
* [LLM] disable part of MC2 in lora by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/8505
* Update model_utils.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8509
* Update merge_lora_params.py by Galaxy1458 in https://github.com/PaddlePaddle/PaddleNLP/pull/8514
* [fea] moe support by bo-ke in https://github.com/PaddlePaddle/PaddleNLP/pull/8498
* Add Sharding V1 broadcast and V2 allgather overlap optimize by iosmers in https://github.com/PaddlePaddle/PaddleNLP/pull/8499
* [fix] Broadcast optimizer state using broadcast_dp without shard-resh… by bo-ke in https://github.com/PaddlePaddle/PaddleNLP/pull/8522
* Update README.md by wawltor in https://github.com/PaddlePaddle/PaddleNLP/pull/8524
* [Safetensors] Fix fast safe open slice. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8512
* Update Benchmark scripts by iosmers in https://github.com/PaddlePaddle/PaddleNLP/pull/8519
* fix eval. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8529
* [BugFix][NPU] fix llama attn_mask astype error by tianhaodongbd in https://github.com/PaddlePaddle/PaddleNLP/pull/8528
* fused_ln:Added implementation for the HIP platform by asr-sheep1 in https://github.com/PaddlePaddle/PaddleNLP/pull/8472
* [CI] Update pip source. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8540
* [PIP] Update run_ci.sh by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8552
* add mteb evaluation by cxa-unique in https://github.com/PaddlePaddle/PaddleNLP/pull/8538
* [Cherry-pick] Add release grad & sharding format & decorate_exclude_layers by ForFishes in https://github.com/PaddlePaddle/PaddleNLP/pull/8545
* Add RingFlashAttention for context parallel by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/8383
* fix codecov conflicts by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/8555
* support fused weights for export_model by ronny1996 in https://github.com/PaddlePaddle/PaddleNLP/pull/8554
* 【benchmark】 add llama-7b_auto_dp2mp2pp2 benchmark script for cinn by mmglove in https://github.com/PaddlePaddle/PaddleNLP/pull/8423
* Fix memory leak bug by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/8546
* Update sequence_parallel for predict by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/8551
* [GPT][CE] Update modeling.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8548
* add fuse_attention_ffn support for qwen by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/8526
* Update generation_utils.py by carryyu in https://github.com/PaddlePaddle/PaddleNLP/pull/8502
* fix llama export by ronny1996 in https://github.com/PaddlePaddle/PaddleNLP/pull/8561
* Update llama_npu_opt_lora.sh by Galaxy1458 in https://github.com/PaddlePaddle/PaddleNLP/pull/8562
* [FIX DDP] fix ddp by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8549
* [AutoParallel] Add benchmark for llama-7b-dy2st. by GhostScreaming in https://github.com/PaddlePaddle/PaddleNLP/pull/8559
* [Cherry pick] Sharding reshard function enhancement by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/8544
* [BugFix] Fix test_long_sequence_strategies by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8568
* Fix/ci pip by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8541
* Add async save for optimizer by ForFishes in https://github.com/PaddlePaddle/PaddleNLP/pull/8557
* add llama & qwen dpo by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/8474
* [LLM] support Qwen2 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8338
* [LLM] Fix Qwen2 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/8584
* fix autotunner benchmark error and fix llama2 dy2st benchmark by fightfat in https://github.com/PaddlePaddle/PaddleNLP/pull/8587
* fix autoruner resume case by Difers in https://github.com/PaddlePaddle/PaddleNLP/pull/8259
* Enable test with re-try. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8590
* [xpu] add xpu custom ops support for llama2-7b by NeroLoh in https://github.com/PaddlePaddle/PaddleNLP/pull/8515
* xpu devices support llama-7b basic mode inference (turn on BlockAtten… by zhink in https://github.com/PaddlePaddle/PaddleNLP/pull/8588
* Add Pipeline Parallel for PPO training and support generation with InferenceModel by guoshengCS in https://github.com/PaddlePaddle/PaddleNLP/pull/7953
* [xpu] change xpu setup.py to paddlenlp_ops by NeroLoh in https://github.com/PaddlePaddle/PaddleNLP/pull/8595
* Clean RLHF main script by guoshengCS in https://github.com/PaddlePaddle/PaddleNLP/pull/8596
* Fix dataset with empty char. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8469
* XPU open ir pass by zhink in https://github.com/PaddlePaddle/PaddleNLP/pull/8598
* [bug fix] fix sharding stage1 allgather overlap bug, which needs to forbiden pin memory by iosmers in https://github.com/PaddlePaddle/PaddleNLP/pull/8594
* Add main process print function by ForFishes in https://github.com/PaddlePaddle/PaddleNLP/pull/8604
* [Feature] Optimize config saving. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/8490
* to_json_string兼容性升级 by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/8608

Page 1 of 10

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.