Researchers at DeepSeek on Monday released a new experimental model, V3.2‑exp, which is designed to have dramatically lower inference costs when used in long-context operations. DeepSeek announced the model in a post on Hugging Face and also published a linked academic paper on GitHub that provides details on its architecture and performance.
The most important feature of the model is called DeepSeek Sparse Attention. This system uses a module referred to as a “lightning indexer” to prioritize specific excerpts from the context window. After that step, a separate system, a “fine-granular token selection system,” chooses specific tokens from within those excerpts. These selected tokens are then loaded into the module’s limited attention window. This combination allows the Sparse Attention model to operate over long portions of context with comparatively small server loads.
The system’s benefits are significant for long-context operations. Preliminary testing conducted by DeepSeek found that the price of a simple API call could be reduced by as much as half in these situations. Further testing will be required to build a more robust assessment of the claims. The model is open-weight and freely available on Hugging Face, which will allow for third-party tests to evaluate the results presented in the paper.
DeepSeek’s new model is part of a string of recent breakthroughs that address the problem of inference costs. These costs represent the server expenses of operating a pre-trained AI model, which are distinct from the cost of training it. DeepSeek’s researchers were looking for ways to make the fundamental transformer architecture operate more efficiently, finding that there are significant improvements to be made.
Based in China, DeepSeek has been an unusual figure in the AI sector, particularly for those who view AI research as a nationalist struggle between the U.S. and China. The company gained attention at the beginning of the year with its R1 model, which was trained using primarily reinforcement learning at a far lower cost than its American competitors. However, the model did not spark a wholesale revolution in AI training as some predicted, and the company has receded from the spotlight in the months since.
The new “sparse attention” approach is unlikely to produce the same uproar as R1, but it could still teach U.S. providers some much-needed tricks to help keep inference costs low.




