Oddbean new post about | logout
 As performance optimization enjoyer i can’t help but look at the transformer architecture in LLMs and notice how incredibly inefficient they are, specifically the attention mechanism. 

Looks like i am not the only one who has noticed this and it seems like people are working on it.

https://arxiv.org/pdf/2406.15786

Lots of ai researchers are not performance engineers and it shows. I suspect we can reach similar results with much less computational complexity. This will be good news if you want to run these things on your phone. https://i.nostr.build/LFm0CEhOtaSXV5rq.jpg  
 nice & enticed t-y 
 Efficiency doesn't sell new GPUs 
 dont feel bad for NVidia, they'll still sell plenty of GPUs to train the models 🙂  
 Secure, Correct, then Fast (a software take on the Vitruvian Triad). Inference performance has gotten orders of magnitude faster since llama.cpp, and it continues to get faster. 
 yeah llama.cpp is great, i should look at what they’re doing. 
 Justine made some nice performance improvements while working on llamafile. If you haven't followed APE, that's a fun rabbit hole as well. 

https://justine.lol/matmul/ 
 These are neat, but feels like its optimizing a problem that shouldn’t exist. The paper I linked dropped half of the attention layers without any noticeable impact on performance. Architecture changes like that could have a much larger impact. Wish i had time to tinker with this stuff … 
 I dont read into it tho 
 There's multiple competing goals: we want the rocks to think better, and we also want them to think faster. We hear more about the former than the latter, but some are proposing radically simpler networks with similar capabilities: https://www.alphaxiv.org/abs/2410.01201v2 
 it‘s optimized for looking pretty on slides 
 I would say efficiency is fairly low on the list of attributes held by normal engineers, I like to install a screw by mashing it with a hammer