本次版本中,我们全面集成了 DeepSeek R1类的思考模型。推理团队深度优化了模型推理,速度业界领先。此外,我们还发布了自研PP-UIE信息抽取模型。本次重点更新如下。
重点更新:
* 模型新增
* DeepSeek V3/R1, R1-distill, QwQ-32B 热门思考模型,全面支持。用户可以点击[官方模型文档列表](https://paddlenlp.readthedocs.io/zh/latest/model_list.html)查看、下载所有模型。
* 飞桨自研发布下一代通用信息抽取工具 PP-UIE 全新发布。支持8K长度信息抽取。[使用文档](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm/application/information_extraction)。
* 推理部署
* 全面支持DeepSeek V3/R1满血版FP8、INT8、4比特量化推理,MTP投机解码。
* FP8推理,单机输出超1400 tokens/s;4比特单机部署,输出超2500 tokens/s!
* 首次协同推理团队,发布统一推理部署镜像,热门模型一键部署。推理部署使用文档全面更新,体验全面提升!见[文档](https://paddlenlp.readthedocs.io/zh/latest/llm/server/docs/general_model_inference.html)。
* 模型训练:
* 新增大模型 Embedding 训练,支持INF-CL超大batch size训练。
* 新增MergeKit模型融合工具,缓解对齐代价。见[文档](https://paddlenlp.readthedocs.io/zh/latest/llm/docs/mergekit.html)。
* 低资源训练 全面优化。16G小显存可以流畅训练。
* 其他重点特性:
* 文档页面,新增模型列表展示。用户可查看、下载对应模型文件。见[文档](https://paddlenlp.readthedocs.io/zh/latest/model_list.html)。
* 训练新增 adam-mini 优化器。AdamW优化器支持 BF16 动量。
下面是一些对应的更新细节:
1. 模型、框架组件更新
* 模型新增
* 模型新增列表:
* paddlenlp/PP-UIE-0.5B, paddlenlp/PP-UIE-1.5B, paddlenlp/PP-UIE-7B, paddlenlp/PP-UIE-14B
* deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-V3-Base,deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Zero,
* deepseek-ai/DeepSeek-R1-Distill-Llama-70B, deepseek-ai/DeepSeek-R1-Distill-Llama-8B, deepseek-ai/DeepSeek-R1-Distill-Qwen-14B, deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
* Qwen/Qwen2.5-7B-Instruct-1M,Qwen/Qwen2.5-14B-Instruct-1M, Qwen/QwQ-32B, Qwen/QwQ-32B-Preview
* PR 9738: Deepseek V3 模型新增。PR 9876: 增加 MTP 支持。PR 9797:修复 TP问题。 PR 9643: Deepseek llama3.3 新增模型说明(DrownFish19)
* PR 9906: Deepseek V3 支持动态图直接加载 Float8 参数并进行推理 (ZHUI)
* PR 9845: 新增PP-UIE系列模型 Fantasy-02 i PR 9911 & PR 9913: PP-UIE 相关文档更新(DrownFish19)
* Tokenizer 改进
* PR 9548、PR 9577、PR 9594: “Hackathon No.43” 系列,完善 TokenizerFast 功能支持(yinfan98)
* PR 9745: 修复 AutoTokenizer 问题(DrownFish19)PR 9837: 保存额外的 special tokens(DesmonDay)
* Unified Checkpoint 相关:
* PR 9540: 修复加载master weight PR 9523: 修复缺失key问题。
* PR 9669: 统一检查点的 Bug 修复 PR 9935: 针对忽略 merge optimizer 时直接加载参数的问题进行修复
* PR 9741 & PR 9821: 修复专家并行支持问题
* [MergeKit 功能增强与优化](https://github.com/PaddlePaddle/PaddleNLP/pull/9811)
* 新增功能与优化
* PR 9561: 新增 mergekit_with_sparsify 功能,支持稀疏化合并(Mangodadada)。
* PR 9702: 优化 MergeKit 的 GPU 支持,提升处理效率(Mangodadada)。
* PR 9811: 添加 LoRA(低秩适配器)合并功能,扩展模型融合能力(lugimzzz)。
* 工具更新与维护
* PR 9885: 对 MergeKit 工具进行代码更新与维护,优化整体逻辑。
* 日志与调试支持
* PR 9948: 添加日志记录功能,增强调试与过程追踪能力(lugimzzz)。
* 低资源特性优化
* PR 9804: 添加 use_fused_linear_cross_entropy 支持,减小显存。加入 pre_divided_factor 避免FP16溢出。
* 文档更新、其他:
* PR 9634: unified_checkpoint 文档更新
* PR 9734: 自定义设备代码重构(ZHUI)
* PR 9715: 增加 offload_recompute_inputs(will-jl944)
* PR 9800: 增加训练 token 计数功能(lugimzzz)
2. LLM 训练更新
* 通用训练
* PR 9204: 更新 chatglmv2 的 tensor/pipeline 并行(DrownFish19)
* PR 9827: 为 Qwen2Moe 和 Deepseek 增加 pipeline 与 flashmask 支持(DrownFish19)
* Embedding 训练
* PR 9508: Embedding trainer 新增(DesmonDay)PR 9673: 增加 INF-CL 超大batch训练支持(jie-z-0607)
* PR 9656: Trainer 中修复加载 rng 状态问题(DesmonDay)
* PR 9721: 修复 embedding 随机性问题(DesmonDay)
* DPO训练
* PR 9543: LLM 模块中 dpo 对 qwen2 的 flashmask 支持(wtmlon)
* PR 9620: 更新 dpo criterion(lugimzzz)
* PR 9695: 支持 qwen 与 llama 的 dpo pp(lugimzzz)
* 新功能和特性
* PR 9542: 增加 adam-mini 优化器支持(lugimzzz)
* PR 9732: 支持BF16动量adamw 训练 (lugimzzz)
* PR 9830: 修复非 flash 模式下 checkpoint 保存的问题(SylarTiaNII)
* PR 9705: Cherry-Pick:在 optimizer step 前校验 loss(SylarTiaNII)
* PR 9704: Cherry-Pick:为 LLM 训练增加异步 metrics dumper(SylarTiaNII)
* 训练文档及问题修复
* PR 9689: 增加 KTO 功能(lugimzzz)
* PR 9655: 更新 peft 文档(lugimzzz)
* PR 9659: 修复 lora 相关问题(lugimzzz)
3. Inference 更新
* Predictor & Flask 更新
* PR 9831: 修复 multibatch 推理问题(DrownFish19)
* PR 9841: 修复 position_ids 相关问题(DrownFish19)
* PR 9864: 更新 Deepseek 推理(DrownFish19)
* PR 9828: Flask 服务使 Inference 兼容 OpenAI API(ZHUI)
* MTP功能优化
* PR 9856: Inference 中支持 mtp 与 Deepseek-v3(freeliuzc)
* PR 9894: 修复 Deepseek_v3 在多 GPU 模式下的 mtp 问题(freeliuzc)
* PR 9936: 增加 mtp serving 支持(freeliuzc)
* 部署优化
* PR 9872: 支持多机部署 LLM(ltd0924)
* PR 9791: 合并 fastdeploy 部分代码(kevincheng2)
* Kernel优化
* PR 9707: 优化 gemm_dequant OP,利用 CUDA 核进行 int8_sq 运算(zhink)
* 文档更新、测试
* PR 9613: Inference 模块支持 llama3.2 及文档更新(yuanlehome)
* PR 9921: 修复 llama 的 block_size 设置(zhaohaixu)
* PR 9711: 为 LLM predictor 增加 common models 和参数单元测试(aooxin)
4. AutoParallel / 分布式训练更新
* 自动并行
* PR 9578: 增加 llama2-7b-cinn 的测试(zhangbo9674)
* 基础配置与 CI 集成
* PR 9538: 增加 qwen model_auto 与 CI(blacksheep-Aristotle)
* PR 9541: 增加 llama3.1 自动并行配置(zhiqiu)
* PR 9551: 为 gpt 和 baichuan 自动 CI 加入支持(blacksheep-Aristotle)
* PR 9591: 增加 gpt、baichuan 及 qwen 的 ce 支持(blacksheep-Aristotle)
* PR 9412: 增加 single_model 网络和使用 intermediate API(blacksheep-Aristotle)
* PR 9943: 通过 training_args 控制 split input(blacksheep-Aristotle)
* 测试、验证与功能开关
* PR 9621: 增加 PIR recompute 测试(waliwali777)
* PR 9647: 修改 loss_base 以支持 dropout 后 SPMD(deepllz)
* PR 9714: 增加阶段 1 tensor fusion 相关开关(AndSonder)
* PR 9672: 修复 recompute 测试在 to_static=1 下运行问题(waliwali777)
* PR 9688: 自动并行下合并 ckpt 供推理使用(xuxinyi389)
* PR 9750 & PR 9753: 修复 ernine auto trainer 相关 CI 错误(blacksheep-Aristotle)
* PR 9749: 为 benchmark 开启 tensor fusion(AndSonder)
* PR 9810: 增加 sharding tensor fusion save/load 开关(AndSonder)
* PR 9862: 支持 deepseekv2 下的 DP/MP(xuxinyi389)
* PR 9823: 增加 support ppo ckpt 功能(xuxinyi389)
5. CI、文档、Benchmark 及测试脚本更新
* CI 脚本及警告过滤
* PR 9547: 更新 CI 脚本(Liujie0926)
* PR 9612: CI 中过滤 paddle.to_tensor 警告(DrownFish19)
* PR 9626: 更新 a100 loss_base 配置(Liujie0926)
* PR 9889: CI 脚本更新(Liujie0926)
* PR 9524: LLM benchmark 中新增 qwen2.5-7b(Liujie0926)
* PR 9662 & PR 9722: 更新 LLM_benchmark 脚本(Liujie0926)
* 文档与说明改进
* PR 9585: 修复文档中失效链接(DrownFish19)
* PR 9668: 更新 README.md(ZHUI)
* PR 9785: 更新面向文档的 README(ZHUI)
* PR 9746: 文档修复(DrownFish19)
* PR 9725: 调整 benchmark 环境变量和模型配置(XieYunshen)
* PR 9877: 修正 inference 和 servering 的文档(ZHUI)
* PR 9834: 发布 DeepSeek 新闻及说明(DrownFish19)
* PR 9922: 更正精调文档错误(sijunhe)
* Benchmark 配置与测试
* PR 9651: 修复 benchmark 多机任务异常退出的问题(XieYunshen)
* PR 9891: 更新 gpt-13b 在 dygraph 模式下的最佳配置(liym27)
6. NPU/XPU 及硬件相关更新
* NPU 适配与修复
* PR 9499: 适配 NPU 用于 FusedHeadAndCrossEntropy(tianhaodongbd)
* PR 9573: 修复 NPU 下的 where 问题(tianhaodongbd)
* PR 9762: 适配新版 flash_attention_npu API(will-jl944)
* XPU 功能与优化
* PR 9549: qwen2 支持 flash_attn on XPU(will-jl944)
* PR 9660: qwen2 支持 fused_rope(will-jl944)
* PR 9789: 支持 XPU 下的 empty_cache(will-jl944)
* PR 9796: 支持 XPU 用于自动并行 LLaMa(From00)
* PR 9854: 为 deepseek 增加 XPU 下 fused op(QingshuChen)
7. Bug 修复、性能优化及其他改进
* 状态加载与多线程问题
* PR 9464: 修复多线程下 load_state_dict 的问题(DesmonDay)
* 各类模型与算子问题修复
* PR 9603: 修复 qwen2 modeling 中 d2s bug(wawltor)
* PR 9569: 修复 dynamic 与 static 模式下的 norm outputs 问题(Wangzheee)
* PR 9652: 修复 paddle.where 问题(will-jl944)
* PR 9638: 增加 config replace_with_c_embedding(Xing-lil)
* PR 9699: 修复 loraga amp 问题(greycooker)
* PR 9752: 修复 get_block_shape_and_split_kv_block 的 bug(lizhenyun01)
* PR 9759: 修复 speculate_verify_and_update op(Wanglongzhi2001)
* PR 9674: 将 speculate_step 合并到 step op 中(Wanglongzhi2001)
* PR 9757: Trainer 模块中更新 sequence parallel(DesmonDay)
* PR 9765: 修复 loraga merge 问题(greycooker)
* PR 9777: 分布式训练下 Cherry-Pick 支持 fuse optimizer(SylarTiaNII)
* PR 9783: 修复 ce 错误(blacksheep-Aristotle)
* PR 9779: 修复 pickle unsafe-load 问题(DrownFish19)
* PR 9760: MoE 模块修复 expert parallel(DesmonDay)
* PR 9790: 为 server infer 添加 pir_model 路径(aooxin)
* PR 9706: Cherry-Pick 集成 PDC SDK 用于 LLM 训练容错(SylarTiaNII)
* PR 9624: 添加 FLAGS 用于替换四个参数以便更好地加速(zhink)
* PR 9806: 修复 LLAMA 参数解析 bug(will-jl944)
* PR 9829: 更新 mixtral.md 文件(yuanlehome)
* PR 9859: 修复 dsk rope 差异问题(yuanlehome)
8. 环境/依赖及版本兼容更新
* requirements 及安装更新
* PR 9514: 更新 py38 下的 requirements.txt (ZHUI)
* PR 9118: 更新安装依赖(DrownFish19)
* PR 9953: 针对 py38 增加 tokenizers 依赖(DrownFish19)
* Python 版本兼容性
* PR 9853: 解决类型注解在不同 Python 版本下的兼容性问题(zty-king)
What's Changed
* Update requirements.txt for py38 by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9514
* [Unified Checkpoint] fix single card loading without master weights by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9540
* Fix multi-threading load_state_dict by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9464
* delete generate_rank_mapping when export multi cards model by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9552
* [LLM] dpo support qwen2 with flashmask by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9543
* [XPU] qwen2 supports flash_attn on XPU by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9549
* [AutoParallel]: add qwen model_auto and ci by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9538
* add llama3.1 config for auto_parallel by zhiqiu in https://github.com/PaddlePaddle/PaddleNLP/pull/9541
* Add more model support for speculate_decoding and refactor speculate_decoding by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9504
* [Intel_HPU]FSDPA custom kernel API update by yanfeich in https://github.com/PaddlePaddle/PaddleNLP/pull/9556
* [Unified Checkpoint] fix load missing keys by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9523
* 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 3 by yinfan98 in https://github.com/PaddlePaddle/PaddleNLP/pull/9548
* adapt code to amsgrad supported adamw by HydrogenSulfate in https://github.com/PaddlePaddle/PaddleNLP/pull/9568
* [CI]update scripts by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9547
* Adapting npu for FusedHeadAndCrossEntropy by tianhaodongbd in https://github.com/PaddlePaddle/PaddleNLP/pull/9499
* 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 4 by yinfan98 in https://github.com/PaddlePaddle/PaddleNLP/pull/9577
* fix(export_model): fix export_model.py python path by thinking-computer in https://github.com/PaddlePaddle/PaddleNLP/pull/9571
* Fix_ckpt_oom_paddlenlp by Xing-lil in https://github.com/PaddlePaddle/PaddleNLP/pull/9507
* Add GPUEventTimer by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/9582
* [npu] fix where bug by tianhaodongbd in https://github.com/PaddlePaddle/PaddleNLP/pull/9573
* [doc] Fix dead links by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9585
* [AutoParallel]:add gpt & baichuan auto ci by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9551
* Add llama2-7b-cinn test by zhangbo9674 in https://github.com/PaddlePaddle/PaddleNLP/pull/9578
* [AutoParallel]:add gpt&baichuan&qwen ce by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9591
* fix dpo pp eval by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9607
* [LLM] update tensor and pipeline parallel for chatglmv2 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9204
* [Install] Update requirment.txt by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9118
* [Trainer]Fix _get_eval_sampler by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/9374
* fix benchmark scripts by XieYunshen in https://github.com/PaddlePaddle/PaddleNLP/pull/9597
* [Trainer] Add embedding trainer by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9608
* [CI] filter paddle.to_tensor warnings when set_state_dict by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9612
* fix ckpt quant log by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9606
* fix the d2s bug in qwen2 modeling by wawltor in https://github.com/PaddlePaddle/PaddleNLP/pull/9603
* 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 5 by yinfan98 in https://github.com/PaddlePaddle/PaddleNLP/pull/9594
* fix pp_config bug by tianhaodongbd in https://github.com/PaddlePaddle/PaddleNLP/pull/9605
* Speedup FusedHeadAndCrossEntropy by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9601
* fix get_save_output op and refactor specu_decoding by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9576
* [Inference] Fix docs and support llama3.2 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9613
* fix by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9628
* fix norm outputs in dynamic and static mode by Wangzheee in https://github.com/PaddlePaddle/PaddleNLP/pull/9569
* [CI]update a100 loss_base for gpt by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9626
* [LLM benchmark]add qwen2.5-7b by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9524
* Checkpoint Compression Doc by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9614
* Update unified_checkpoint.md by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9634
* add llama and nv-embed training by Li-Z-Q in https://github.com/PaddlePaddle/PaddleNLP/pull/9323
* [News] Unified Checkpoint by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9632
* feat(sdaa): support sdaa backend infer by thinking-computer in https://github.com/PaddlePaddle/PaddleNLP/pull/9570
* [llm]update dpo criterion by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9620
* [llm]add adam-mini by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9542
* Update version for beta3 by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9553
* [LLM DOCs] Add deepseek llama3.3 new models by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9643
* [Tokenizer] Fix tokenizer of llama3.3 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9641
* [AutoParallel] Add test for PIR recompute by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9621
* Update README.md for 3.0 beta3 by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9644
* Add replace_with_parallel_cross_entropy flag by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9579
* [AutoParallel] change loss_base after dropout support spmd by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/9647
* [Embedding] Add embedding training by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9508
* [PEFT]Add LoRA-GA by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/9592
* mergekit_with_sparsify by Mangodadada in https://github.com/PaddlePaddle/PaddleNLP/pull/9561
* Fix paddle.where by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9652
* Add config replace_with_c_embedding by Xing-lil in https://github.com/PaddlePaddle/PaddleNLP/pull/9638
* Update embedding trainer state by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9629
* MoRA Implementation by lcykww in https://github.com/PaddlePaddle/PaddleNLP/pull/9562
* [llm]update peft docs by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9655
* [Trainer] Fix loading rng state by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9656
* fix qwen&baichaun&gpt ci error by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9650
* [llm] fix lora by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9659
* [XPU] qwen2 supports fused_rope by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9660
* update hygon dcu docs by TimeYWL in https://github.com/PaddlePaddle/PaddleNLP/pull/9298
* Make the timer compatible with devices other than GPU by deepllz in https://github.com/PaddlePaddle/PaddleNLP/pull/9665
* [Trainer] update remove_master_weight by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9640
* [DOC] Update README.md by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9668
* [Mthreads] support llama 13B train by shang-mt in https://github.com/PaddlePaddle/PaddleNLP/pull/9666
* Structured Index of Documents by dfmz759837901 in https://github.com/PaddlePaddle/PaddleNLP/pull/9411
* 【Qwen2-VL Inference】add qwen2-vl high performance inference by chang-wenbin in https://github.com/PaddlePaddle/PaddleNLP/pull/9575
* merge docs by Mangodadada in https://github.com/PaddlePaddle/PaddleNLP/pull/9657
* [CI]update blacklist for gpt3 by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9555
* [体验优化] 整合训练的CUDA和Triton算子为 paddlenlp_kernel by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/9471
* [Unified Checkpoint] bug fix by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9669
* Add tied_weight_keys for pipeline model by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9663
* Optimize performance for Qwen2 model by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/9616
* [MLU] add mlu llama readme by PeiyuLau in https://github.com/PaddlePaddle/PaddleNLP/pull/9671
* Set tensor parallel name mapping when fusion is used by sneaxiy in https://github.com/PaddlePaddle/PaddleNLP/pull/9685
* [LLM] add deploy server by kevincheng2 in https://github.com/PaddlePaddle/PaddleNLP/pull/9581
* [Embedding] Add inf-cl in embedding trainer by jie-z-0607 in https://github.com/PaddlePaddle/PaddleNLP/pull/9673
* [Fix]fix loraga amp by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/9699
* [LLM INFER] cutlass 3.x gemm on sm90 by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/9398
* [Iluvatar] Add readme for llama-13b by tianyuzhou668 in https://github.com/PaddlePaddle/PaddleNLP/pull/9670
* [AutoParallel] merge ckpt for inference by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9688
* update gpt&baichuan&qwen ce name by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9697
* fix docs by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9703
* [Inference] Use cuda core(int8_sq) for m <=4 in gemm_dequant OP by zhink in https://github.com/PaddlePaddle/PaddleNLP/pull/9707
* [LLM] [Cherry-Pick] valid loss before optimizer step (9255) by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9705
* [llm]support dpo pp for qwen & llama by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9695
* support qwen dpo fused kernel by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9686
* [AutoParallel] Fix recompute test running under `to_static=1` by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9672
* [LLM_benchmark]update LLM_benchmark scripts by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9662
* [LLM] [Cherry-Pick] add asynchronous metrics dumper for llm training by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9704
* [llm] Add KTO by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9689
* [Embedding] Fix embedding random by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9721
* remove refined recompute deep copy by JunnYu in https://github.com/PaddlePaddle/PaddleNLP/pull/9617
* add single_model network and use intermediate api by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9412
* Refactor custom devices. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9734
* Add offload_recompute_inputs by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9715
* [LLM] [Cherry-Pick] Integrate PDC SDK for LLM training fault tolerance platform by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9706
* add common models and common params unit test for llm predictor. by aooxin in https://github.com/PaddlePaddle/PaddleNLP/pull/9711
* Added FLAGS to replace four params and the value can be adjusted for better speedup by zhink in https://github.com/PaddlePaddle/PaddleNLP/pull/9624
* [AutoParallel] add parameter enable_stage1_tensor_fusion_blanced_save_load and enable_stage1_tensor_fusion by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9714
* Adapt to new npu flash_attention api by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9735
* [AutoParallel] Add test for PIR refined recompute by waliwali777 in https://github.com/PaddlePaddle/PaddleNLP/pull/9679
* [Docs] Fix by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9746
* Bugfix update predictor.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9742
* Modify the environment variables and model configuration of the bench… by XieYunshen in https://github.com/PaddlePaddle/PaddleNLP/pull/9725
* [Unified Checkpoint] Fix expert parallel by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9741
* [AutoParallel]:ufix ernie ci error by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9750
* fix import bugs. by aooxin in https://github.com/PaddlePaddle/PaddleNLP/pull/9751
* [AutoParallel]ckpt support local views keys to global views keys by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9604
* Add XLMRoBERTaModel in paddlenlp by jie-z-0607 in https://github.com/PaddlePaddle/PaddleNLP/pull/9720
* [AutoParallel]:fix ernine auto_trainer error by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9753
* fix get_block_shape_and_split_kv_block by lizhenyun01 in https://github.com/PaddlePaddle/PaddleNLP/pull/9752
* fix speculate_verify_and_update op by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9759
* [Inference]merge speculate_step into step op by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9674
* [NPU] Adapt to new flash_attention_npu api by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9762
* [Trainer] update sequence parallel by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9757
* [tokenizer] Fix AutoTokenizer by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9745
* [LLM] Add DeepseekV3 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9738
* [AutoParallel] open tensor_fusion for benchmark by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9749
* fix loraga merge by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/9765
* Fix ernie ci auto trainer error by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9758
* Update README.md by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9766
* Fix matryoshka norm loss by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9774
* [Distributed] [Cherry-Pick] support fuse optimizer (9519) by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9777
* Update register_sequence_parallel_allreduce_hooks by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9782
* Fix ce error by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9783
* fix pickle unsafe-load by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9779
* [MoE] fix expert parallel by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9760
* fix dpo pp criterion by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9786
* add pir_model path for server infer. by aooxin in https://github.com/PaddlePaddle/PaddleNLP/pull/9790
* [LLM] [Cherry-Pick] support flash device on static model (9619) by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9787
* [LLM Benchmark]update scripts by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9722
* mergekit gpu 1226 by Mangodadada in https://github.com/PaddlePaddle/PaddleNLP/pull/9702
* [LLM] merge code from fastdeploy by kevincheng2 in https://github.com/PaddlePaddle/PaddleNLP/pull/9791
* support eagle for llama by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9812
* [CI] Fix by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9633
* wrap model when lora is ON and only do evaluation. by wtmlon in https://github.com/PaddlePaddle/PaddleNLP/pull/9803
* Update README.md for documention by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9785
* [Checkpoint compression] Support sharding stage1 v2 by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9817
* [LLM] Update model convert and fix TP for deepseekv3 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9797
* [AutoParallel] add sharding tensor_fusion save load switch by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9810
* 修复benchmark多机任务异常退出的处理 by XieYunshen in https://github.com/PaddlePaddle/PaddleNLP/pull/9651
* Fix LLAMA arg parsing bug in pp by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9806
* Update mixtral.md by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9829
* [XPU] Support empty_cache on XPUs by will-jl944 in https://github.com/PaddlePaddle/PaddleNLP/pull/9789
* [Inference] Fix multibatch inference by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9831
* Fix position_ids for infra by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9841
* [LLM] Add pipeline and flashmask for Qwen2Moe and Deepseek by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9827
* [Mergekit]update & add LoRA merge by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9811
* [Unified Checkpoint] Fix expert parallel by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9821
* [Inference] Flask server compatible with OpenAI api. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9828
* [LLM] fix checkpoint save for non flash mode by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9830
* [DSK] support deepseek-v3/r1 (mha/fp16/bf16/wint8/wint4) by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9769
* 解决类型注解Python版本兼容性问题 by zty-king in https://github.com/PaddlePaddle/PaddleNLP/pull/9853
* [Tokenizer] save extra special tokens by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9837
* [Bugfix] Fix dsk rope diff by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9859
* Support lower memory cards. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9804
* Support XPU for auto-paralllel LLaMa by From00 in https://github.com/PaddlePaddle/PaddleNLP/pull/9796
* [XPU] Add fused op for deepseek by QingshuChen in https://github.com/PaddlePaddle/PaddleNLP/pull/9854
* [Inference] Update deepseek by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9864
* [PreTrain] Support deepseek mfu for pretraining and fix tflops for pretrain pipe model by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9855
* [Inference]Support mtp with deepseek-v3 by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9856
* [AutoParallel] Support deepseekv2 with DP/MP by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9862
* [LLM] move modeling.py and modeling_nv.py to transformers by Li-Z-Q in https://github.com/PaddlePaddle/PaddleNLP/pull/9676
* [Docs] fix docs for inference and servering by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9877
* [Docs] news of DeepSeek by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9834
* [AutoParallel]support_ppo_ckpt by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9823
* suppport intermediate_api llama test by liym27 in https://github.com/PaddlePaddle/PaddleNLP/pull/9850
* Update MergeKit by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9885
* [LLM] Support multi machine deployment by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/9872
* 【SpecInfer】修复 InferenceWithReference 接收率不高的 bug by Wanglongzhi2001 in https://github.com/PaddlePaddle/PaddleNLP/pull/9880
* update the best conf for gpt-13b in dygraph mode by liym27 in https://github.com/PaddlePaddle/PaddleNLP/pull/9891
* [Inference]fix deepseek_v3 with mtp in multi-gpu mode by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9894
* [TaskFlow] Fix pir for taskflow by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9822
* [LLM-IE] Add pp-uie to Taskflow by Fantasy-02 in https://github.com/PaddlePaddle/PaddleNLP/pull/9845
* [DOC] Update README for PP-UIE by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9911
* 【benchmark】align benchmark conf for static baichuan2 gpt3 by liym27 in https://github.com/PaddlePaddle/PaddleNLP/pull/9901
* [DOC] PP-UIE by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9913
* add gpu whl by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/9890
* add count trained tokens by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9800
* 更正精调文档错误 by sijunhe in https://github.com/PaddlePaddle/PaddleNLP/pull/9922
* [CI]update ci scripts by Liujie0926 in https://github.com/PaddlePaddle/PaddleNLP/pull/9889
* [LLM]: fix block_size setting for llama. by zhaohaixu in https://github.com/PaddlePaddle/PaddleNLP/pull/9921
* support qwen2_5_vl by chang-wenbin in https://github.com/PaddlePaddle/PaddleNLP/pull/9924
* [DSK] Fix some bugs for dsk-v3 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9874
* support intermediate_api gpt-3 test by Function-Samuel in https://github.com/PaddlePaddle/PaddleNLP/pull/9912
* support intermediate_api qwen test by Function-Samuel in https://github.com/PaddlePaddle/PaddleNLP/pull/9910
* [LLM] Add MTP for Deepseekv3 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9876
* [taskflow] Fix taskflow bug by Fantasy-02 in https://github.com/PaddlePaddle/PaddleNLP/pull/9930
* 【Inference】Support mtp serving by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9936
* [Autoparallel] Mtp for DeepSeekV3 by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9884
* [Unified Checkpoint] Fix split param loading directly when using ignore_merge_optimizer by DesmonDay in https://github.com/PaddlePaddle/PaddleNLP/pull/9935
* [DSK] Implement mla use matrix-absorption by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9875
* use training_args to contorl split input by blacksheep-Aristotle in https://github.com/PaddlePaddle/PaddleNLP/pull/9943
* [requirements] tokenizers for py38 by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9953
* [LLM] update llm server dockerfiles by kevincheng2 in https://github.com/PaddlePaddle/PaddleNLP/pull/9940
* 【Inference】fix dynamic_forward of mtp by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9947
* [RL] Fix PPO and add GRPO by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9925
* [doc] update config and add docs for grpo by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9962
* Add Process Reward Model. by XuLingnan in https://github.com/PaddlePaddle/PaddleNLP/pull/9598
* [Feature] Support float8 dtype storage and deepseek v3 with fp8 inference. by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9906
* [AutoParallel] Add auto parallel moe layer by pkuzyc in https://github.com/PaddlePaddle/PaddleNLP/pull/9886
* [llm]add bf16 moment adamw by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9732
* [MergeKit]add log by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9948
* Longlora by micelvrice in https://github.com/PaddlePaddle/PaddleNLP/pull/9970
* Fix update paddle_patch.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9968
* support MMLU eval by vivienfanghuagood in https://github.com/PaddlePaddle/PaddleNLP/pull/9967
* Update paddle_patch.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9978
* [XPU] change llama loss func on xpu by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9973
* [Inference] refine csrc/tools/build_wheel.sh by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/9971
* [DSK] mla use tensor core by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9952
* Update paddle_patch.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/9984
* [LLM]fix ci by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9986
* Fix mtp speed by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/9987
* [Trainer]fix wandb proxy by greycooker in https://github.com/PaddlePaddle/PaddleNLP/pull/9960
* [LLM] add moe parallel groups by SylarTiaNII in https://github.com/PaddlePaddle/PaddleNLP/pull/9982
* [AutoParallel] Fix pipeline visualization tool by AndSonder in https://github.com/PaddlePaddle/PaddleNLP/pull/9976
* [llm]fix ci by lugimzzz in https://github.com/PaddlePaddle/PaddleNLP/pull/9989
* [DSK] DeepSeek Support FP8 by ming1753 in https://github.com/PaddlePaddle/PaddleNLP/pull/9956
* [LLM INFER] update step_paddle by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9991
* [AutoParallel] Add pp stage id by xuxinyi389 in https://github.com/PaddlePaddle/PaddleNLP/pull/9965
* [CI] Fix tokenizer load in PRM by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9997
* fix tenercore precision while split kv by lizhenyun01 in https://github.com/PaddlePaddle/PaddleNLP/pull/9994
* support intermediate_api baichuan test by Function-Samuel in https://github.com/PaddlePaddle/PaddleNLP/pull/9988
* 【Inference】Add benchmark client test scripts by gzy19990617 in https://github.com/PaddlePaddle/PaddleNLP/pull/9996
* [LLM] Support for automatic deployment of services, modification of environment variable names by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/9966
* Add moe flex dispatcher by umiswing in https://github.com/PaddlePaddle/PaddleNLP/pull/9977
* add deepseek doc by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/9964
* [Feat] Sage Attention Kernels Support for sm80, sm89, sm90 by l1cacheDell in https://github.com/PaddlePaddle/PaddleNLP/pull/9848
* [LLM] support fix seq len and cmd run service by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10004
* [Doc] Add Qwen/QwQ-32B model ids by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/10005
* [LLM] Fix MTP for pipeline parallel by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/9972
* [LLM] Update license by DrownFish19 in https://github.com/PaddlePaddle/PaddleNLP/pull/10003
* 【Infer】remove some bug config for block gemm by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/10002
* Default set FLAGS_cascade_attention_max_partition_size as 32K by lizhenyun01 in https://github.com/PaddlePaddle/PaddleNLP/pull/10013
* [Distribution] Support DualPipeV for GPT3 by zhangyuqin1998 in https://github.com/PaddlePaddle/PaddleNLP/pull/9993
* [inference]add docker doc by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/9998
* 【Docs】Update speculate decoding docs by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/10017
* [LLM] fix llm model path and support download from txt by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10029
* [CherryPick] add import check of local_layer by pkuzyc in https://github.com/PaddlePaddle/PaddleNLP/pull/10038
* fix mla nan in mtp by lizhenyun01 in https://github.com/PaddlePaddle/PaddleNLP/pull/10041
* [cherry-pick] update doc by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/10043
* [CI] fix install issue for requirements-dev.txt by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/10051
* [Doc] 支持用户自行下载静态图 by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10046
* [cherry-pick] (PR10034 [server]Add a model download script and fix bugs for the server) by bukejiyu in https://github.com/PaddlePaddle/PaddleNLP/pull/10035
* [LLM] 增加版本号 by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10056
* check mtp triton cache by ckl117 in https://github.com/PaddlePaddle/PaddleNLP/pull/10065
* Update version setup.py by ZHUI in https://github.com/PaddlePaddle/PaddleNLP/pull/10070
* [Doc] 文档完善,新增模型环境设备需求 by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10073
* change_h100_to_h800 by yuanlehome in https://github.com/PaddlePaddle/PaddleNLP/pull/10091
* [Doc] 完善文档,更新示例模型 by ltd0924 in https://github.com/PaddlePaddle/PaddleNLP/pull/10085
* 【Serving】Fix serving bug release by freeliuzc in https://github.com/PaddlePaddle/PaddleNLP/pull/10101
New Contributors
* thinking-computer made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9571
* Wangzheee made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9569
* lcykww made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9562
* shang-mt made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9666
* dfmz759837901 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9411
* jie-z-0607 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9673
* aooxin made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9711
* zty-king made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9853
* Fantasy-02 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9845
* zhaohaixu made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9921
* XuLingnan made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9598
* micelvrice made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/9970
**Full Changelog**: https://github.com/PaddlePaddle/PaddleNLP/compare/v3.0.0-beta3...v3.0.0-beta4