Oddbean new post about | logout
 yeah llama.cpp is great, i should look at what they’re doing. 
 Justine made some nice performance improvements while working on llamafile. If you haven't followed APE, that's a fun rabbit hole as well. 

https://justine.lol/matmul/ 
 These are neat, but feels like its optimizing a problem that shouldn’t exist. The paper I linked dropped half of the attention layers without any noticeable impact on performance. Architecture changes like that could have a much larger impact. Wish i had time to tinker with this stuff … 
 I dont read into it tho 
 There's multiple competing goals: we want the rocks to think better, and we also want them to think faster. We hear more about the former than the latter, but some are proposing radically simpler networks with similar capabilities: https://www.alphaxiv.org/abs/2410.01201v2