Oddbean

▲ ▼

 researching a way to build a minimal perfect hashmap of words to store alongside the binary note in nostrdb so that nostrdb clients (damus ios, damus notedeck, zaptream rust) can do O(1) mute word matching, which is pretty important for performance when you have 100s of mute words.

▲ ▼

 You'll like this video.  This guy uses lots of optimization techniques to get a PHT that runs more than twice as fast as the gperf one (the best one he found that isn't specifically constructed)  https://www.youtube.com/watch?v=DMQ_HcNSOAI

▲ ▼

 I'm currently looking at BoomPHF which seems faster than the last time I did a MPHT

https://github.com/10XGenomics/rust-boomphf

https://github.com/rizkg/BBHash

▲ ▼

 great video

▲ ▼

 what I'm proposing should be way faster than anything in this video, because in a *minimal* perfect hashtable I don't even need to do a compare in many cases, since if the word is out of the range it can be rejected outright.

▲ ▼

 they have butts

▲ ▼

 Sounds like a Bloom filter in front of a radix trie

▲ ▼

 BBHash does say its bloom-filter based. hmm!

https://github.com/rizkg/BBHash

▲ ▼

 I think this is just a bloom filter. With only a few target words and a lot of potential ones, even with a multihash scheme you'll end up with a lot of false positives. For those you'll also need some way to reliably rule them out, and a radix tree is one implementation.

▲ ▼

 Back in 2004, I developed I wiki where the text was always split into words and the words were saved on the database. Then a post was just a list of word indexes.

It saved a TON of memory because most texts are just lots of duplicated words.

If you have that option, then you could quickly re-compute muted events by splitting the mute word into many, finding posts that each word appears in sequence and setting the mute flag.

▲ ▼

 that is bizarre but interesting

▲ ▼

 If Nostr was only long form, that would be how I would save the .content of every event.

▲ ▼

 yeah that would save a lot of space! its just a nice compression mechanism. next step would be to find the probability distribution over words and represent an article as prefix-coded bit vector, huffman style :P

▲ ▼

 this would effectively be a compressed database. I like this idea a lot. I kinda want to try it... but the database would be completely different.

but tbh, most the storage space seems to be from keys and signatures anyways. i guess you can have a dictionary over pubkeys, but .... eh

▲ ▼

 Yeah, the dictionary of keys and IDs was the first thing I did on Amethyst. It's why our memory use is minimal. There are no duplicated bytearrays anywhere in memory.

▲ ▼

 Bonus points if you do all of that WHILE also storing the markdown/ascii doc/Nostr node information. 

For instance, the database would store an nprofile word, but you have to load the user, write the user's name, with a link, and then run the muted words procedure against the user's name, not the nprofile string :)

▲ ▼

 it reminds me of the kind of tokenization that LLMs do (though that isn't strictly per word)

▲ ▼

 ccan/htable.

▲ ▼

 Yeah, you don't need minimal. It's overkill: you'll never get those cycles back in savings. But ccan/htable will keep the damage to a single cache line for the common case (a miss) which is what you want.

▲ ▼

 I mainly care about space savings, since I will be saving this alongside each note in the DB

▲ ▼

 I have realtime requirements in notedeck (144fps+), so I like to have important things precomputed by nostrdb ahead of time, like fast an small datastructure for testing mute words.

▲ ▼

 I could do this outside of nostrdb in a separate worker pool, but my ingester threads are already worker pools and it seems like this will such a common operation, so having a small and optimized data structure on each note makes sense to me

▲ ▼

 I also have this for note blocks, which are the parsed sections of a note, these are also stored in the database. this way each frame I can simply walk and interact with these structures immediately after they come out of the local database subscription and enter the timeline.

▲ ▼

 I'm confused. If you want to know "does this have muted words in it?" that's one bit? If you want to support dynamically changing mute words then you probably do a pair of generation numbers (one for words added to mute list, one for words removed) and then recalc on the fly if it could be a false negative  / positive. 

The performance difference between a 50% full hash table and a 100% full is minimal in practice. And if you don't use the bitstuffing tricks of ccan/htable you will get a second cache hit to actually validate the miss.

▲ ▼

 would a bloom or similar filter be better?

▲ ▼

 is htable a perfect hashmap? whats the perfect_bitnum thing

▲ ▼

 when i say that censorship is wasteful, this is what i mean

nostr:nevent1qvzqqqqqqypzqvhpsfmr23gwhv795lgjc8uw0v44z3pe4sg2vlh08k0an3wx3cj9qqs8plq5d29mmemz3nhh8e2wjez4awswqka9vlh5jxls36kee428wqqy52s4t

▲ ▼

 Spell check dictionaries use BK trees.