Oddbean

As performance optimization enjoyer i can’t help but look at the transformer architecture in LLMs and notice how incredibly inefficient they are, specifically the attention mechanism. Looks like i am not the only one who has noticed this and it seems like people are working on it. https://arxiv.org/pdf/2406.15786 Lots of ai researchers are not performance engineers and it shows. I suspect we can reach similar results with much less computational complexity. This will be good news if you want to run these things on your phone. https://i.nostr.build/LFm0CEhOtaSXV5rq.jpg