Oddbean

▲ ▼

 Back in 2004, I developed I wiki where the text was always split into words and the words were saved on the database. Then a post was just a list of word indexes.

It saved a TON of memory because most texts are just lots of duplicated words.

If you have that option, then you could quickly re-compute muted events by splitting the mute word into many, finding posts that each word appears in sequence and setting the mute flag.

▲ ▼

 that is bizarre but interesting

▲ ▼

 If Nostr was only long form, that would be how I would save the .content of every event.

▲ ▼

 yeah that would save a lot of space! its just a nice compression mechanism. next step would be to find the probability distribution over words and represent an article as prefix-coded bit vector, huffman style :P

▲ ▼

 this would effectively be a compressed database. I like this idea a lot. I kinda want to try it... but the database would be completely different.

but tbh, most the storage space seems to be from keys and signatures anyways. i guess you can have a dictionary over pubkeys, but .... eh

▲ ▼

 Yeah, the dictionary of keys and IDs was the first thing I did on Amethyst. It's why our memory use is minimal. There are no duplicated bytearrays anywhere in memory.

▲ ▼

 Bonus points if you do all of that WHILE also storing the markdown/ascii doc/Nostr node information. 

For instance, the database would store an nprofile word, but you have to load the user, write the user's name, with a link, and then run the muted words procedure against the user's name, not the nprofile string :)

▲ ▼

 it reminds me of the kind of tokenization that LLMs do (though that isn't strictly per word)