- Upgrade optimum-habana diffusers dependency from 0.26.3 to 0.29.2 1150 dsocek
Stable Diffusion 3
- Sd3 1153 dsocek
- Refactor SD3 1199 dsocek
Training with Sentence Transformers
- Enable Sentence Transformer Trainer with Gaudi 1111 ZhengHongming888
Model optimizations
- Fix starcoder2 accuracy issue and optimize performance with fused rope 1095 mandy-li
- Enable FusedRoPE using float32 for gpt-neox model 1104 yeonsily
- Mamba initial enablement. 1122 libinta
- Adding fused qkv support along with config 1102 bhargaveede
- Enhance Qwen2 with fastsoftmax and bf16 RoPE and cache optimization 1087 Zhiwei35
- Enable fp8 inference for Llava-Next and add Fused_SDPA 1120 tthakkal
- Support bucket_internal for MPT 1137 pk1d3v
- Enable Flash Attention (Fused SDPA) for Starcoder 1114 abhilash1910
- gpt_bigcode: added FusedSDPA kernel 1138 mgonchar
- Enable torch.compile for Granite20B 1185 dvarshney-habana
- Refine use cache for mpt model 1158 Jing1Ling
- GPT-J support reuse_cache 1094 atakaha
- Use fast softmax only on prefill 1159 jaygala223
- Starcoder2 : KVCache and flash attention (FusedSDPA) enablement 1149 abhatkal
- Gpt bigcode fused sdpa 1260 yeonsily
SAM, FastVIT, VideoMAE, OpenCLIP, DETR, Table Transformer, deciLM
- Add an example of Segment Anything Model [Inference] 814 cfgfung
- Add an example of FastViT model (Infernece) 826 cfgfung
- VideoMAE Model Enabling and Examples 922 pi314ever
- OpenCLIP sample for visual question answering 977 vidyasiv
- Enabled DETR (Object Detection) model 1046 cfgfung
- Table transformer enabling 978 pi314ever
- deciLM support 1133 sywangyi
Stable Diffusion inpainting, unconditional image generation
- Add the Stable diffusion inpaint support 869 yuanwu2017
- Enable Unconditional Image Generation on Gaudi 2 [Diffuser/Tasks] 859 cfgfung
Text feature extraction example
- Feature extraction enabling 994 pi314ever
Tensor parallelism
- Tensor parallel distributed strategy without using deepspeed 1121 kalyanjk
- Disable torch.compile for all_reduce when parallel_strategy is set to "tp" 1174 kalyanjk
Kubernetes cluster example
- Adds a helm chart, dockerfile, and instructions for running examples using a Kubernetes cluster 1099 dmsuehir
- Fix PyTorch version in the Kubernetes docker-compose to match image 1246 dmsuehir
FP8 training
- TE FP8 integration 1096 SanjuCSudhakaran
Other
- Updates run_lora_clm.py with enhanced dataset support 955 dmsuehir
- Fix prefix tuning finetune issue and update test 975 sywangyi
- Fix throughput calculation in image-to-text example 1070 regisss
- SDXL-trainig: fixed ci, changed gated dataset, fixes for non-square datasets 1038 imangohari1
- Updating batch_size of Albert-XXL in README 1063 vineethanandh
- Fix the error of running run_pipeline.py of text_generation example 1055 yuanwu2017
- Add a test for llama finetuning with FP8 precision 1106 SanjuCSudhakaran
- Beam-search fix 1113 ssarkar2
- Add chat format support dataset in SFT 1066 libinta
- Fix nan loss of gemma and crash if dataset_concatenation is not set 1088 sywangyi
- torch.compile keep input mutation in graph this avoids unnecessary memcpy 1069 sushildubey171
- Updated langchain text-generation pipeline to work with latest release 0.2.5 1084 rbrugaro
- Add the MC example 891 yuanwu2017
- Fix recompiles if limit_hpu_graph is False 1129 ssarkar2
- Update examples batchsize in README 1123 shepark
- Fix OOM error in SDXL Fine-Tuning validation stage 1134 dsocek
- Added an example code to demonstrate how to use deterministic image generation 878 cfgfung
- SD image variation/InstructPix2Pix/StableDiffusionXLImg2ImgPipeline pipeline 988 sywangyi
- Add ci test for trl rewarding and ppo, fix backward failure in ppo caused by rmsfusion 1020 sywangyi
- Llama adapter 983 sywangyi
- torch.flip issue is fixed in SynapseAI 1.16, so remove the WA 1092 sywangyi
- Fix test CausalLanguageModelingLORAExampleTester KeyError 1139 dmsuehir
- fix(ci): new runs-on 1136 XciD
- Add trust_remote_code for loading datasets in the audio classification example 1074 regisss
- Generation example: print number of warmup iterations 1145 mgonchar
- CI Updates: text-gen to recieve ranks/bs, Updated bs/metric for baselines 1140 imangohari1
- Support for custom files for run_lora_clm.py 1039 vidyasiv
- Change the device_id for FSDP plugin 1086 ckvermaAI
- Set KV Cache update as static method 1160 ulivne
- To fix CPU tensor issue 1157 mkumargarg
- Adding missing __init__.py to mistral and mixtral test package 1188 rkumar2patel
- Add example of multitask_prompt/poly tuning 915 sywangyi
- Fix data-type mismatch for mlperf_inference accuracy test 1146 kalyanjk
- Fix spawn MP context, limit cpu and download data 1131 polisettyvarma
- T5 multi card 1222 yafshar
- Add trust_remote_code for t5 poly-tuning test 1220 yafshar
- Resolve "empty tensor optional" error with hpu_graphs + kv cache for StarCoder 1181 vidyasiv
- Fix VIT, add wav2vec comment 1223 ssarkar2
- Roberta tests were running on CPU 1229 ssarkar2
- Fix bert/roberta contrastive search tests 1226 skavulya
- Remove the default env variable to trust remote code by default 1225 yafshar
- Improve style check workflow 1230 regisss
- Added scheduler selection for SDXL fine-tuning 867 kplau1128
- Clear help msg for ignore_eos to avoid misunderstanding sywangyi
- Support loading hugging face checkpoint 1165 ulivne
- Change triggering event for code style check 1238 regisss
- gptj: fix missing token_idx 1234 envsp
- fix(nltk): fixed the version to working one 1247 imangohari1
- Updating to avoid hardcoding tests in CI framework 1221 vidyasiv
- Fix FSDP graph error due to Tranformer 4.43 update 1251 jiminha
- Fix SD README commands 1250 imangohari1
- Fix spelling errors 1252 changwangss
- Set HLS_MODULE_ID only if it wasn't set previously 1254 astachowiczhabana
- Fix overflow of steps in SDXL for default diffusers scheduler dsocek
- fix(test_diffusers): automated the checking for tests without upstream HF 1232 imangohari1
- fix(nltk): Revert 1247. Updated the version. added the punkt_tab download 1258 imangohari1
- Set input_embeds before it gets used 1261 tthakkal
- Update README and more changes, rebase to main 1259 shepark
Known limitations
- For Llama, some big batch sizes lead to out-of-memory errors whereas they used to work