- New Features:
- Added support for GPU runtime solution, [nv-triton](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/triton_gpu), which allows easy export of Paraformer models from ModelScope and deployment as services. We conducted benchmark tests on a single GPU-V100, and achieved an RTF of 0.0032 and a speedup of 300.
- Added support for CPU runtime [quantization solution](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export), which supports export of quantized ONNX and Libtorch models from ModelScope. We conducted [benchmark](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python) tests on a CPU-8369B, and found that RTF increased by 50% (0.00438 -> 0.00226) and double speedup (228 -> 442).
- Added support for C++ version of the gRPC service deployment solution. The C++ version of ONNXRuntime and quantization solution, provides double higher efficiency compared to the Python runtime, [demo](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/grpc).
- Added streaming inference pipeline to the [16k VAD model](https://www.modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), [8k VAD model](https://www.modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-8k-common/summary), with support for audio input streams (>= 10ms) , [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/236).
- Improved the [punctuation prediction model](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), resulting in increased accuracy (F-score increased from 55.6 to 56.5).
- Added real-time subtitle example based on gRPC service, using a 2-pass recognition model. [Paraformer streaming](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/summary) model is used to output text in real time, while [Paraformer-large offline model](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) is used to correct recognition results, [demo](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/grpc).
- New Models:
- Added [16k Paraformer streaming model](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/summary), which supports real-time speech recognition with streaming audio input, [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/241). It can be deployed using the gRPC service to implement real-time subtitle function.
- Added [streaming punctuation model](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727/summary), which supports real-time punctuation marking in streaming speech recognition scenarios, with real-time calls based on VAD points. It can be used along with real-time ASR models to achieve readable real-time subtitle function, [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/238).
- Added [TP-Aligner timestamp model](https://www.modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary), which takes audio and corresponding text as input and outputs word-level timestamps. Its performance is comparable to that of the Kaldi FA model (60.3ms vs. 69.3ms). It can be combined freely with ASR models, [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/246).
- Added financial domain model ([8k Paraformer-large-3445vocab](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-8k-finance-vocab3445/summary)), which is fine-tuned using 1000 hours of data. The recognition accuracy on the financial domain test set increased by 5%, and the recall rate of domain keywords increased by 7%.
- Added audio-visual domain model ([16k Paraformer-large-3445vocab](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-audio_and_video-vocab3445/summary)), which is fine-tuned using 10,000 hours of data. The recognition accuracy on the audio-visual domain test set increased by 8%.
- Added [8k speaker verification model](https://www.modelscope.cn/models/damo/speech_xvector_sv-en-us-callhome-8k-spk6135-pytorch/summary), which can be used for speaker embedding extraction.
- Added speaker diarization models, including [16k SOND Chinese model](https://www.modelscope.cn/models/damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/summary), [8k SOND English model](https://www.modelscope.cn/models/damo/speech_diarization_sond-en-us-callhome-8k-n16k4-pytorch/summary), which achieved the best performance on AliMeeting and Callhome with a DER of 4.46% and 11.13%, respectively.
- Added UniASR streaming offline unifying models, including [16k UniASR Burmese](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch/summary), [16k UniASR Hebrew](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch/summary), [16k UniASR Urdu](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch/summary), [8k UniASR Mandarin financial domain](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-8k-finance-vocab3445-online/summary), and [16k UniASR Mandarin audio-visual domain](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-16k-audio_and_video-vocab3445-online/summary).
最新更新:
- 2023年3月17日:[funasr-0.3.0](https://github.com/alibaba-damo-academy/FunASR/tree/main), modelscope-1.4.1
- 功能完善:
- 新增GPU runtime方案,[nv-triton](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/triton_gpu),可以将modelscope中Paraformer模型便捷导出,并部署成triton服务,实测,单GPU-V100,RTF为0.0032,吞吐率为300,[benchmark](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/triton_gpu#performance-benchmark)。
- 新增CPU [runtime量化方案](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export),支持从modelscope导出量化版本onnx与libtorch,实测,CPU-8369B,量化后,RTF提升50%(0.00438->0.00226),吞吐率翻倍(228->442),[benchmark](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python)。
- [新增加C++版本grpc服务部署方案](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/grpc),配合C++版本[onnxruntime](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/onnxruntime),以及[量化方案](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export),相比python-runtime性能翻倍。
- [16k VAD模型](https://www.modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary),[8k VAD模型](https://www.modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-8k-common/summary),modelscope pipeline,新增加流式推理方式,,最小支持10ms语音输入流,[用法](https://github.com/alibaba-damo-academy/FunASR/discussions/236)。
- 优化[标点预测模型](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary),主观体验标点准确性提升(fscore绝对提升 55.6->56.5)。
- 基于grpc服务,新增实时字幕[demo](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/grpc),采用2pass识别模型,[Paraformer流式模型](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/summary) 用来上屏,[Paraformer-large离线模型](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)用来纠正识别结果。
- 上线新模型:
- [16k Paraformer流式模型](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/summary),支持语音流输入,可以进行实时语音识别,[用法](https://github.com/alibaba-damo-academy/FunASR/discussions/241)。支持基于grpc服务进行部署,可实现实时字幕功能。
- [流式标点模型](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727/summary),支持流式语音识别场景中的标点打标,以VAD点为实时调用点进行流式调用。可与实时ASR模型配合使用,实现具有可读性的实时字幕功能,[用法](https://github.com/alibaba-damo-academy/FunASR/discussions/238)
- [TP-Aligner时间戳模型](https://www.modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary),输入音频及对应文本输出字级别时间戳,效果与Kaldi FA模型相当(60.3ms v.s. 69.3ms),支持与asr模型自由组合,[用法](https://github.com/alibaba-damo-academy/FunASR/discussions/246)。
- 金融领域模型,[8k Paraformer-large-3445vocab](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-8k-finance-vocab3445/summary),使用1000小时数据微调训练,金融领域测试集识别效果相对提升5%,领域关键词召回相对提升7%。
- 音视频领域模型,[16k Paraformer-large-3445vocab](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-audio_and_video-vocab3445/summary),使用10000小时数据微调训练,音视频领域测试集识别效果相对提升8%。
- [8k说话人确认模型](https://www.modelscope.cn/models/damo/speech_xvector_sv-en-us-callhome-8k-spk6135-pytorch/summary),CallHome数据集英文说话人确认模型,也可用于声纹特征提取。
- 说话人日志模型,[16k SOND中文模型](https://www.modelscope.cn/models/damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/summary),[8k SOND英文模型](https://www.modelscope.cn/models/damo/speech_diarization_sond-en-us-callhome-8k-n16k4-pytorch/summary),在AliMeeting和Callhome上获得最优性能,DER分别为4.46%和11.13%。
- UniASR流式离线一体化模型:
[16k UniASR缅甸语](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch/summary)、 [16k UniASR希伯来语](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch/summary)、 [16k UniASR乌尔都语](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch/summary)、 [8k UniASR中文金融领域](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-8k-finance-vocab3445-online/summary)、[16k UniASR中文音视频领域](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-16k-audio_and_video-vocab3445-online/summary)。
New Contributors
* dingbig made their first contribution in https://github.com/alibaba-damo-academy/FunASR/pull/147
* yuekaizhang made their first contribution in https://github.com/alibaba-damo-academy/FunASR/pull/161
* zhuzizyf made their first contribution in https://github.com/alibaba-damo-academy/FunASR/pull/180
* znsoftm made their first contribution in https://github.com/alibaba-damo-academy/FunASR/pull/185
* songtaoshi made their first contribution in https://github.com/alibaba-damo-academy/FunASR/pull/227
**Full Changelog**: https://github.com/alibaba-damo-academy/FunASR/compare/v0.2.0...v0.3.0