The key behind vLLM is stuffing the GPU as full of prompts as possible. So there's one model and many prompts. As prompts get finished they're refilled with more prompts
Old way: 1 page/sec
vLLM 1x3090: 6 page/sec
Realized it wasn't using both cards, now its at 12 pages/sec
Gonna un-voltage-limit the cards next and find diff quant, idk if its even quanted