8. Upgraded node to version 18 2663 agunapal
Blogs
+ [High performance Llama 2 deployments with AWS Inferentia2 using TorchServe](https://pytorch.org/blog/high-performance-llama/)
+ [ML Model Server Resource Saving - Transition From High-Cost GPUs to Intel CPUs and oneAPI powered Software with performance](https://pytorch.org/blog/ml-model-server-resource-saving/)
+ [Run multiple generative AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe and save up to 75% in inference costs](https://aws.amazon.com/blogs/machine-learning/run-multiple-generative-ai-models-on-gpu-using-amazon-sagemaker-multi-model-endpoints-with-torchserve-and-save-up-to-75-in-inference-costs/)
New Features
+ Support PyTorch 2.1.0 and Python 3.11 2621 2691 2697 agunapal
+ Supported continous batching for LLM inference 2628 mreso lxning
+ Supported dynamically loading 3rd party package on SageMaker Multi-Model Endpoint 2535 lxning
+ Added DALI handler to handle preprocess and updated Nvidia DALI example 2485 jagadeeshi2i
New Examples
1. Deploy Llama2 on Inferentia2 2458 namannandan
2. [Using TorchServe on SageMaker Inf2.24xlarge with Llama2-13B](https://github.com/aws/amazon-sagemaker-examples-community/blob/main/torchserve/inf2/llama2/llama-2-13b.ipynb) lxning
3. PyTorch tensor parallel on Llama2 example 2623 2689 HamidShojanazeri
4. Enabled better transformer (ie. flash attention 2) on Llama2 2700 HamidShojanazeri lxning
5. Llama2 Chatbot on Mac 2618 agunapal
6. ASR speech recognition example 2047 husenzhang
Improvements
+ Fixed typo in BaseHandler 2547 a-ys
+ Create merge_queue workflow for CI 2548 msaroufim
+ Fixed typo in artifact terminology unification 2551 park12sj
+ Added env hints in model_service_worker 2540 ZachOBrien
+ Refactor conda build scripts to publish all binaries 2561 agunapal
+ Fixed response return type in KServe 2566 jagadeeshi2i
+ Added torchserve-kfs nightly build 2574 jagadeeshi2i
+ Added regression for all CPU binaries 2562 agunapal
+ Updated CICD runners 2586 2597 2636 2627 2677 2710 2696 agunapal msaroufim
+ Upgraded newman version to 5.3.2 2598 2603 agunapal
+ Updated opt benchmark config for inf2 2617 namannandan
+ Added ModelRequestEncoderTest 2580 abergmeier
+ Added manually dispatch workflow 2686 msaroufim
+ Updated test wheels with PyTorch 2.1.0 2684 agunapal
+ Allowed parallel level = 1 to run in torchrun mode 2608 lxning
+ Fixed metric unit assignment backward compatibility 2693 namannandan
Documentation
+ Updated MPS readme 2543 sekyondaMeta
+ Updated large model inference readme 2542 sekyondaMeta
+ Fixed bash snippets in examples/image_classifier/mnist/Docker.md 2345 dmitsf
+ Fixed typo in kubernetes/autoscale.md 2393 CandiedCode
+ Fixed path in examples/image_classifier/resnet_18/README.md 2568 udaij12
+ Model Loading Guidance 2592 agunapal
+ Updated Metrics readme 2560 sekyondaMeta
+ Display nightly workflow status badge in README 2619 2666 agunapal msaroufim
+ Update torch.compile information in examples/pt2/README.md 2706 agunapal
+ [Deploy model using TorchServe on SageMaker tutorial](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-models-frameworks-torchserve.html) lxning
Platform Support
Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.
GPU Support