Oddbean

▲ ▼

 The key behind vLLM is stuffing the GPU as full of prompts as possible. So there's one model and many prompts. As prompts get finished they're refilled with more prompts

▲ ▼

 Old way: 1 page/sec
vLLM 1x3090: 6 page/sec

Realized it wasn't using both cards, now its at 12 pages/sec

Gonna un-voltage-limit the cards next and find diff quant, idk if its even quanted