Oddbean new post about | logout
 I read the source on wikistr for the kind 1987 specification.
I like the idea about precalculating embeddings when generating content for efficiency.

One main question though:
You cannot avoid forged embeddings, can you?
This is SEO back to field one. The content “does not speak for itself”.
You cannot get around some concept of source reputation, I guess.

Or is there a fast way to check on an embedding by verifying it against the content? Like we do for hashes? 
 If you are grabbing embeddings from other users, there should be a link to the original model used for the embeddings. There are probably solutions for recovering the original text from the embedding and you can also just reembed the text yourself to compare the resulting vectors. But if you're grabbing enough of them that you're trying to make comparisons, you're giving some level of trust because at that level you might as well do the embeddings yourself. Additionally, if the embeddings are already being used in a recommendation system I'd imagine that they are there because they are useful and help increase the organization of the content - so I'd expect that there is less incentive to find embeddings that are maliciously  inserted by a user. 
 I will add that I would expect to see embedding-only relays for search purposes.  The relays would be write-restricted so embeddings can't be forged.

To make those workable, they would probably be a paid service, but they'd be much less vulnerable to forgery.  Such services would likely need to employ bots to generate embeddings on new content when it is published, but as we saw with ReplyGuy, it's not hard to make bots that can interact with all new content. 
 Yup, specialized relays = lowest hanging fruit for trust/anti-forgery 
 The key here, is that relays are the only ones that can truly moderate content for a community because they can hard-delete, block universally, allow universally, define allowed or blocked keywords, manually curate, etc.
Everyone else can only issue events and hope for the best, but a relay is a real server and directly controls the data set on a machine. That's much more persistent.

And communities need to determine what they trust, and what they don't. Then individuals can choose a community whose moderation they prefer, or run their own. Because to trust data, is to value its usefulness and accuracy, but valuations are always subjective. 
 “Because to trust data, is to value its usefulness and accuracy, but valuations are always subjective.”
This is nicely put and consistent.

But you’re taking an orthogonal step away from other approaches to digital Search if you further follow this road. You must make it clear to client developers that they may need to check content again.
Because the end-user conducting a RAG search may or may not to be protected from forged results. That’s a difficult UX question to answer. 
 I like the idea of long lasting embedding bots more than the one of specialized relays.

A long living bot attaching embeddings to notes could build a track record of signed transactions and become a trusted npub of the nostr community. This doesn’t have a lot to do with the relay (except maybe retrieval of notes which requires a separate filter implementation).

To make an analogy:
Prefer trusted librarians over a good library building.

This all goes very much in the topic of DVM. The clients should be diverse and heavy. Not the relays. 

And most important:
We should not forget that bot npubs can be long-lasting as well! 
 Perhaps not specialized relays that do a few functions well, but an additional feature for existing relay architecture. It needs the database for effecient storage and retrieval, so a relay that expects embeddings as a feature should use this "module" to know what to do with these event kinds. Otherwise, block the event because you don't want to be holding embeddings unless you actually care about them. 

with that, you can automatically embed events as they come in - or you can have an npub be an agent to compute embeddings as you need them. I'd expect embedding all kind1's flowing through a relay, or even a set of relays to be intensive, so you optimize in the quality of the content you embed. User's calling a bot for the job is definitely a signal of quality. 
 The relay could also call a bot. Because someone needs to sign the embedding-enriched notes. 
 I think our plan was to have our own bots that subscribe to our own relay. We have an automation server, too. 
 I think librarian versus library might be a false dichotomy, as a very good librarian might declare that she will be found in one particular library, where she stores the actual books, rather than merely making recommendations. 
 The job description of a librarian probably splits into 3 areas:

- Incoming: catalogize, archive, quality check, fine
- Outgoing: filter, reference, recommend, make lists, order, rate limit
- Async: reorganize, refresh, order, archive

Am I missing some?