yeah that would save a lot of space! its just a nice compression mechanism. next step would be to find the probability distribution over words and represent an article as prefix-coded bit vector, huffman style :P