Secure, Correct, then Fast (a software take on the Vitruvian Triad). Inference performance has gotten orders of magnitude faster since llama.cpp, and it continues to get faster.
yeah llama.cpp is great, i should look at what they’re doing.
Justine made some nice performance improvements while working on llamafile. If you haven't followed APE, that's a fun rabbit hole as well.
https://justine.lol/matmul/
These are neat, but feels like its optimizing a problem that shouldn’t exist. The paper I linked dropped half of the attention layers without any noticeable impact on performance. Architecture changes like that could have a much larger impact. Wish i had time to tinker with this stuff …
There's multiple competing goals: we want the rocks to think better, and we also want them to think faster. We hear more about the former than the latter, but some are proposing radically simpler networks with similar capabilities: https://www.alphaxiv.org/abs/2410.01201v2