These are neat, but feels like its optimizing a problem that shouldn’t exist. The paper I linked dropped half of the attention layers without any noticeable impact on performance. Architecture changes like that could have a much larger impact. Wish i had time to tinker with this stuff …
There's multiple competing goals: we want the rocks to think better, and we also want them to think faster. We hear more about the former than the latter, but some are proposing radically simpler networks with similar capabilities: https://www.alphaxiv.org/abs/2410.01201v2