Oddbean new post about | logout
 Back in 2004, I developed I wiki where the text was always split into words and the words were saved on the database. Then a post was just a list of word indexes.

It saved a TON of memory because most texts are just lots of duplicated words.

If you have that option, then you could quickly re-compute muted events by splitting the mute word into many, finding posts that each word appears in sequence and setting the mute flag.  
 that is bizarre but interesting 
 If Nostr was only long form, that would be how I would save the .content of every event. 
 yeah that would save a lot of space! its just a nice compression mechanism. next step would be to find the probability distribution over words and represent an article as prefix-coded bit vector, huffman style :P 
 this would effectively be a compressed database. I like this idea a lot. I kinda want to try it... but the database would be completely different.

but tbh, most the storage space seems to be from keys and signatures anyways. i guess you can have a dictionary over pubkeys, but .... eh 
 Yeah, the dictionary of keys and IDs was the first thing I did on Amethyst. It's why our memory use is minimal. There are no duplicated bytearrays anywhere in memory. 
 Bonus points if you do all of that WHILE also storing the markdown/ascii doc/Nostr node information. 

For instance, the database would store an nprofile word, but you have to load the user, write the user's name, with a link, and then run the muted words procedure against the user's name, not the nprofile string :)  
 it reminds me of the kind of tokenization that LLMs do (though that isn't strictly per word)