Oddbean

Can we create a Nostr filter where the client sends a bloomfilter instead of a massive list of pubkeys of the 1000 people an account is following? At .01% error rate, the filter size drops from ~67kb to 2.3Kb. It looks like a good data-saving feature. 🤔

I have no clue what you are talking about (bloom filter) but I will take "less data" any day over "100 % accuracy" .. no one needs exact number of people they follow .. much less the followers ..

You can use a Golomb-Rice encoded list then. It can encode all the PKs in a compact way and the relay can expand it without needing to use any other data source. It doesn't have errors and it is less cpu intensive, specially for the relay. Take it into account.

Isn't that in the same ballpark of what zip/zlib compressions alredy do? The Websocket connection can be compressed using gzip, etc. So, whatever we do needs to offer a significant again when compared to a an already compressed representation of the keys in hex strings.

Zip won't help much when compressing lists of pubkeys. Even for a mobile client, uploading 67kB is not much, especially compared to what relays send in reply. If you want to make your client more responsive on startup, maybe you can query frequent posters first, to get something on the screen and send the full filter later. If you consider involving the relay ... maybe we can reference collections of pubkeys via list events? The relay would get the relevant list event via it's naddr and then expand that list into the original query. Instead of {"authors":["c88821bdafd6c608107c9c4969672b57c3a8c85f2a0f6bc7b749c3c5609b3424"]}, {"authorLists":["naddr1qvzqqqr4xypzq3huhccxt6h34eupz3jeynjgjgek8lel2f4adaea0svyk94a3njdqy88wumn8ghj7mn0wvhxcmmv9uq3uamnwvaz7tmwdaehgu3dwp6kytnhv4kxcmmjv3jhytnwv46z7qqlgd6hyun9de6zq3n0d3kx7aeqf35hxapvyqcnzte3xqhnyvpjxvx3tr5w",...]}

Very costly for the server then. But I think there are DBs that support probabilistic filters. Cuckoo filters maybe. Redis appears to support Cuckoo and Bloom. On the other hand, a 1/10000 FPP would give you the events of 1/10000 users which hopefully will be a lot. Yet on the other hand, these filters are typically used with much lower FPP.

Aren't bloom filters faster than string comparisons? Maybe they aren't faster than pre-computed indexes? Also, I assume relays would pre-compute and store the set of hashes needed for a standard Nostr Bloom filter.

Bloom filters use sort of string comparisons. They use bitmask comparisons. I'm not sure how an index could work for such a bitmask but apparently redis does something there. Pre-computing the hashes would make things super expensive again. In your example of a 67kB Bloom filter, pubkeys extended with hashes would be 67kB each, too. I tried to do this myself but the best I got was to use probabilistic filters to save memory for active filters, not for querying the DB and thus not for the communications neither. The nice part is that you can mix event IDs and authors in the same filter. https://github.com/Giszmo/NostrPostr/blob/master/nostrpostrlib/src/main/java/nostr/postr/Filter.kt#L195

Leo, I need to define a bloom/cuckoo filter format for pubkeys so that implementation A creates the filter that implementation B will use. Do you have any recommendations on how to encode the resulting filter itself? The filter will be saved in every event as a tag, so it must be something the other client can easily decode and apply. Bonus if the writer of the filter can specify the precision. Any tips?

I'm not sure what you want to achieve exactly but bloom filters are really trivial in how they work and thus easy to serialize. Now as you are talking about use across implementations, you would need some serialization standard or come up with one. I know there are libraries that allow serialization but apparently the serializations they consume or produce are not compatible from one library to the other. If you have a compelling use case for a nip for example, I bet we can come up with a standard for nostr. Are you working on something? With regards to your original question, I think it's easier and more efficient to maybe tag lists in queries so you can refer to these tags in subsequent queries as 67kB big data structures aren't great to keep around with every pubkey for example and neither are 67kB great to index.

Data saving in filters for extra computation and logic in relays if I understand correctly. Could this also provide some privacy benefits? The relay is not sure about the exact pubkeys the client is interested in.

Isn't that 0.01% error rate over the total number of public keys in the set? E.g. if I follow 1,000 out of the 1 million publeys on nostr, I'll get guaranteed updates from my 1k follows plus updates from about 100 random people (1e6 x 0.0001)?

% of all nostr keys. But once the client receives the data, it can filter further to the matching IDs. So, you will see the same thing, but there will be X number of posts returned that are not in your follow lists. i.e. errors consume data but doesn't change your UX

Good point, I am not sure how heavy the error rates would be. We might need to decrease the error % as new users join the protocol 🤔 But keep in mind filters change every second on Amethyst. So, it's 60kb per second in exchange of the added payloads from the errors ONCE.

i don't really know much about how bloom filters work, but I had envisioned that requests could use an event id that the relay already has and use the content/tags of that event to get the pubkeys/tags for the request, the request would be a single note id instead of 1000's. it doesn't maybe impact the burden on relays but helps keep the request small. not sure how any of this is actually feasible and relevant to your post...