Tech

DeepSeek Unveils V3.2-Exp LLM to Slash Costs and Reduce Dependence on NVIDIA CUDA

TIG SEA3 weeks ago

109 1 minute read

DeepSeek Launches DeepSeek-V3.2-Exp to Cut Long-Context Processing Costs and Reduce Reliance on NVIDIA CUDA

DeepSeek, one of China’s leading AI companies, has officially launched its latest large language model, DeepSeek-V3.2-Exp, on September 29, 2025. The new model, now available on Hugging Face with open-source code and checkpoints, is designed to significantly reduce long-context inference costs through an advanced sparse attention mechanism that minimizes memory and computation requirements while maintaining high-quality outputs.

Built on the V3.1-Terminus architecture, the V3.2-Exp introduces DeepSeek Sparse Attention (DSA), a new system featuring a Lightning Indexer that reduces key cache size to just 128 per token (compared to 512 in the previous MLA) and a Sparse Multi-Latent Attention mechanism that selects only 2,048 essential tokens from long input sequences. These optimizations result in faster training times, improved energy efficiency, and performance on par with V3.1-Terminus.

The industry response was immediate. Huawei’s Ascend team and the vLLM-Ascend community quickly updated their repositories with custom operators and kernels to support V3.2-Exp on Ascend NPUs. Huawei’s CANN team also released inference formulas optimized for its hardware, enabling same-day deployment across Chinese platforms and dramatically reducing dependence on NVIDIA CUDA.

🚀 Introducing DeepSeek-V3.2-Exp — our latest experimental model!

✨ Built on V3.1-Terminus, it debuts DeepSeek Sparse Attention(DSA) for faster, more efficient training & inference on long context.
👉 Now live on App, Web, and API.
💰 API prices cut by 50%+!

1/n
— DeepSeek (@deepseek_ai) September 29, 2025

Other domestic chipmakers followed suit. Cambricon updated its vLLM-MLU fork to integrate V3.2-Exp, reporting significant cost reductions for long-sequence inference. Hygon announced that its DCU accelerators are now optimized for zero-wait deployment through the DTK software stack, ensuring the model runs seamlessly on various hardware without delays.

SGLang confirmed support for V3.2-Exp across multiple backends, including Ascend, while DeepSeek’s GitHub documentation highlights full compatibility with vLLM, TileLang, and even CUDA kernels since launch.

This rapid ecosystem adoption underscores China’s accelerating efforts to build an AI infrastructure independent of NVIDIA hardware. While CUDA remains dominant for both training and inference workloads, DeepSeek-V3.2-Exp stands out as one of the first major Chinese LLMs optimized for non-CUDA environments from day one. Its release marks a pivotal shift in the global AI landscape, signaling a future where high-performance AI no longer relies exclusively on Western GPU technology.