the training needs big amount of GPU on the same machine but i am sure some will figure out to scale out. mixture of experts is the popular thing nowadays. maybe each of those experts can be trained on a single machine and later combined.