Release note#
v0.7.3rc2#
This is 2nd release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.
Quickstart with container: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/quick_start.html
Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html
Highlights#
Add Ascend Custom Ops framewrok. Developers now can write customs ops using AscendC. An example ops
rotary_embeddingis added. More tutorials will come soon. The Custome Ops complation is disabled by default when installing vllm-ascend. SetCOMPILE_CUSTOM_KERNELS=1to enable it. #371V1 engine is basic supported in this release. The full support will be done in 0.8.X release. If you hit any issue or have any requirement of V1 engine. Please tell us here. #376
Prefix cache feature works now. You can set
enable_prefix_caching=Trueto enable it. #282
Core#
Bump torch_npu version to dev20250320.3 to improve accuracy to fix
!!!output problem. #406
Model#
The performance of Qwen2-vl is improved by optimizing patch embedding (Conv3D). #398
Other#
v0.7.3rc1#
🎉 Hello, World! This is the first release candidate of v0.7.3 for vllm-ascend. Please follow the official doc to start the journey.
Quickstart with container: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/quick_start.html
Installation: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/installation.html
Highlights#
DeepSeek V3/R1 works well now. Read the official guide to start! #242
Speculative decoding feature is supported. #252
Multi step scheduler feature is supported. #300
Core#
Bump torch_npu version to dev20250308.3 to improve
_exponentialaccuracyAdded initial support for pooling models. Bert based model, such as
BAAI/bge-base-en-v1.5andBAAI/bge-reranker-v2-m3works now. #229
Model#
Other#
Support MTP(Multi-Token Prediction) for DeepSeek V3/R1 #236
[Docs] Added more model tutorials, include DeepSeek, QwQ, Qwen and Qwen 2.5VL. See the official doc for detail
Pin modelscope<1.23.0 on vLLM v0.7.3 to resolve: https://github.com/vllm-project/vllm/pull/13807
Known issues#
In some cases, especially when the input/output is very long, the accuracy of output may be incorrect. We are working on it. It’ll be fixed in the next release.
Improved and reduced the garbled code in model output. But if you still hit the issue, try to change the generation config value, such as
temperature, and try again. There is also a knonwn issue shown below. Any feedback is welcome. #277
v0.7.1rc1#
🎉 Hello, World!
We are excited to announce the first release candidate of v0.7.1 for vllm-ascend.
vLLM Ascend Plugin (vllm-ascend) is a community maintained hardware plugin for running vLLM on the Ascend NPU. With this release, users can now enjoy the latest features and improvements of vLLM on the Ascend NPU.
Please follow the official doc to start the journey. Note that this is a release candidate, and there may be some bugs or issues. We appreciate your feedback and suggestions here
Highlights#
Core#
Other#
Known issues#
This release relies on an unreleased torch_npu version. It has been installed within official container image already. Please install it manually if you are using non-container environment.
There are logs like
No platform detected, vLLM is running on UnspecifiedPlatformorFailed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")shown when running vllm-ascend. It actually doesn’t affect any functionality and performance. You can just ignore it. And it has been fixed in this PR which will be included in v0.7.3 soon.There are logs like
# CPU blocks: 35064, # CPU blocks: 2730shown when running vllm-ascend which should be# NPU blocks:. It actually doesn’t affect any functionality and performance. You can just ignore it. And it has been fixed in this PR which will be included in v0.7.3 soon.