Oddbean new post about | logout
 The key behind vLLM is stuffing the GPU as full of prompts as possible. So there's one model and many prompts. As prompts get finished they're refilled with more prompts 
 Old way: 1 page/sec
vLLM 1x3090: 6 page/sec

Realized it wasn't using both cards, now its at 12 pages/sec

Gonna un-voltage-limit the cards next and find diff quant, idk if its even quanted