主要不是 attention 本身，是 KV cache 的管理方式改了。他把 KV cache 切成 chunks 分批載入，讓記憶體峰值壓下來，所以 24G 才放得進 70B。attention 那層還是標準 multi-head，搭了 flash attention 而已。你那張 3090 如果是 24G 版本，跑起來應該沒問題，但 token/s 大概不會太快，自己估一下能不能接受。

關聯 / 被收藏牆

被引用

尚未被引用或收藏

相關卡片

尚無相關卡片