Kimi Linear、Minimax M2、杨松琳考古算法变种史,以及未来架构改进方案预演。
This episode will discuss a crucial topic: **AI algorithm and architecture innovation.** Our guest is MIT PhD student Yang Songlin, specializing in linear attention mechanisms. We'll delve into the newly released models Kimi Linear, Minimax M2, and Qwen3-Next. Songlin contributed to Kimi Linear and Qwen3-Next and **is a co-author of the Kimi Linear paper.** Why is algorithm innovation particularly important in 2025? Because data, computing power, and algorithms drive AI. With data limitations, model companies are "sculpting model architectures" hoping Scaling Law continues. China's limited computing power **has pushed Chinese AI algorithm innovation to the forefront.** You'll hear that **DeepSeek's MoE (Mixture of Experts) is the biggest architectural breakthrough, making MoE a global consensus; the next breakthrough may be Attention.** Chinese companies are betting on different Attention techniques: * DeepSeek is exploring Sparse Attention. * Kimi is exploring Linear Attention. * Minimax explored Linear Attention in its early M1 version but reverted to Full Attention in the released M2 version. Songlin will discuss her work on **《Kimi Linear: An Expressive, Efficient Attention Architecture》** and analyze these companies' Attention choices; **She'll also lead us through AI algorithm variations and predict future algorithm and architecture improvements.** > *This episode is technical and may be challenging. Listen according to your needs. The guest's work environment uses both Chinese and English.* **04:00** Personal background, research focus, and exploration of linear attention. **06:27** Songlin created an open-source library: flash-linear-attention (FLA). **07:04** Understanding Linear Attention's "Linear" in simple terms. **11:19** Discussing recent work, the newly released 《Kimi Linear: An Expressive, Efficient Attention Architecture》. (Invited by Zhang, Yu, another FLA author) **12:20** Why did Kimi need to redesign the attention mechanism at the beginning of the year? The background and goals of the design. Linear Attention significantly reduces computation and memory costs during inference. Full Attention is very expensive for long text decoding. **14:39** **Key explanation of the 《Kimi Linear》 paper: KDA module** (Kimi Delta Attention, incremental attention mechanism). **18:56** Kimi has a Scaling Ladder; good performance at one scale leads to scaling at the next. **20:20 Kimi Linear Attention vs DeepSeek Sparse Attention:** Kimi uses linear attention, DeepSeek uses sparse attention, both aiming to solve long text decoding efficiency. **23:01 Minimax's architectural changes from M1 to M2, reverting from Linear Attention to Full Attention:** Why? **27:00** Cannot fully discuss Silicon Valley's attention mechanisms, but can briefly discuss OpenAI's published solutions. **28:05** The progression of Linear Attention since its invention in 2020. Linear Attention is considered when people hit the Context Wall. Recent renewed interest in long text decoding prompts a re-evaluation of this technology. **38:16** Pure Linear Attention is ineffective. Hybrid attention mechanisms still have global attention layers to ensure a baseline. **40:30 Kimi Linear inserts one full attention layer every three KDA layers, a three-to-one ratio becoming a consensus.** Minimax previously used a seven-to-one ratio, but now they're gradually returning to three-to-one - a consensus within the non-consensus of hybrid attention mechanisms. **42:32** Trade-off between expressivity and efficiency. **Minimax also mentioned that hybrid linear attention/hybrid sliding window attention has defects in "multi-hop inference."** The GAP may narrow if we develop hardware-efficient RNNs (Recurrent Neural Networks) with better expressiveness for "multi-hop inference." **46:28** chunkwise algorithm for parallelization. **47:55** How to design Attention? Two mainstream routes and some non-mainstream routes. **49:36** **Future ideal solution combining Linear Attention and Sparse Attention.** Linear Attention and Sparse Attention aren't competitive. Linear Attention's competitor might be Sliding-Window Attention. Industry exploration of combining Linear Attention and Sparse Attention seems to have not yet started. **My ideal solution: replace the global attention (Full Attention) in hybrid attention with sparse attention (Sparse Attention).** Sparse Attention can completely replace Full Attention if chosen accurately, but the problem now is it cannot be chosen accurately. **55:36** Fair comparison: Linear Attention vs Sliding-Window Attention. **57:05** Transformer → MoE → Linear/Sparse Attention algorithm evolution, driven by the goal of achieving lower loss functions with the same FLOPs (floating-point operations per second). MoE (Mixture of Experts) is a more efficient replacement for FNN (Feedforward Neural Network). **58:26 The biggest architectural breakthrough in recent years is MoE, the next breakthrough may be Attention; Transformer has two modules, FFN and Attention; FFN has been sculpted into MoE, now Attention can also be sculpted.** **01:01:28** Data, algorithms, and computing power drive AI. When data is limited, algorithm innovation becomes more important. **01:02:48** Future of architecture: 1. Can we eliminate global attention? It's the main bottleneck preventing context window scale-up. 2. Continue Learning, allowing AI to learn itself. **01:04:30** How to continue scaling up Linear Attention Transformers? **01:07:43** Chinese AI algorithm innovation is stronger than overseas because there are fewer cards (computing resources). US companies invest more in optimizers; China is gradually paying attention. **01:10:56** Other training details: NoPE vs. RoPE. **01:12:09** DeepSeek-OCR. **01:12:55** Songlin also participated in Qwen3-Next, but not Minimax M2. **01:13:39** The people who "sculpt" architectures. **01:15:16** Personal journey: "When you know exactly what you want to do, you won't encounter any setbacks." Experience sharing: PhD is going smoothly, thanks to my archaeological research in the six months before starting. **01:23:12 Speaking of archaeology, let's talk about the history of algorithm variations starting with Transformer.** **01:29:50** Delta Rule algorithm, hardware affinity, DeepSeek highly pursues hardware and algorithm matching. **01:42:23** Advice for younger people. Previous episodes with the guest: 《In-depth Explanation of DeepSeek, Kimi, MiniMax's New Attention Mechanism Papers – "Violent Aesthetics on Hardware"》 Papers mentioned: 《Kimi Linear: An Expressive, Efficient Attention Architecture》 《MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention》 《DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models》
Original title: 119. Kimi Linear、Minimax M2?和杨松琳考古算法变种史,并预演未来架构改进方案
Original description: <figure><img src="https://image.xyzcdn.net/Flo18nNUSP7OUNlTf8UgCdHxio6O.jpg" /></figure><p>今天这集节目,我…