Oddbean new post about | logout
 Imagine a nostr search engine. 

Like Google keyword search, but for nostr.

Clean, simple UX.

How?

PageRank for pubkeys.

Why not? 
 I think @brugeman was working on this but I haven’t seen any new developments on it for a while.  
 interesting, I’d like to check that out 
 There are others too. @primal and @damus both have great search built in too 
 Anyone using a PageRank-like algo to rank results? 
 https://trust.nostr.band/ 
 Yup, exactly what I’m looking for. 🙏🏼 
 noogle? noogle.lol 
 Very cool - is this foss? I cannot find the link to the repo … 
 Good question, dunno 
 Did you change your mind on Pagerank? 
 Nope. I’m still using GrapeRank, as per our previous discussions — a modification of PageRank that introduces a nonlinear term which means we can’t use the eigenvalue method, we must use the iterative method which is computationally more expensive, but not prohibitively so. 
 I’m thinking there might be a place for PageRank and a place for GrapeRank. If the goal is stratification of keyword search results, maybe PageRank is the way to go bc it’s easier to calculate. If the goal is to use the score as a weight, like weighted averages, then GrapeRank is preferable. 
 Interesting 
 Sim, mais ou menos isso, você pode pesquisar por usuários e por relays, só serão exibidos usuários ativos mapeados e relays ativos. 
 Yes, you can search for users and relays...

https://nosbook.org 
 Nostr keyword search - @Vic have you taken a look at this? 
 How are the results stratified? 
 It is a centralized solution divided into three stages (indexer, engine, and web application).

The indexer basically starts from my public key and retrieves the people I follow, then iterates through the list of people I follow and does the same, and so on, until the searches return no more results.

The indexer inserts users into a Trie tree, where the public key is the address in the tree, and at the end of each branch is a user. This user has a linked list of addresses in the tree for the users they follow. So, I have a mix of a graph and a tree for efficient searching.

Finally, this "search engine" application is not directly exposed; there is a web application that actually makes requests to the search engine, similar to an unauthenticated graph database. The API does the same.

The search is conducted in the graph starting from the user initiating the search; if no public key is provided, the search begins from my profile.

P.S.: There is still a minor bug that's distorting the search results; I'll fix it as soon as I have time. 
 This is very interesting. @ManiMe , @cloud fodder and I are working on something with a similar starting point. We start with one or more “seed user” pubkeys, usually one pubkey which is the end-user (the customer for the service we are creating); fetch seed users’ follows, their follows, their follows, ad infinitum until no new pubkeys are obtained. If I understand correctly, that’s your starting point as well.

Right now my implementation typically ends up with a list of about 150k to 200k connected pubkeys. Is that about how many you get? 
 Hey Mises, im taking a look at your indexer now. 
https://github.com/misesdev/nostr-indexer

You should take a look at our GrapeRank engine. Looks like we’re trying to solve the same problem … in almost the same manner. … maybe we can colab?

https://github.com/Pretty-Good-Freedom-Tech/graperank-nodejs

Live demo at
https://grapevine.my 
 Also check this out:

https://grapevine-brainstorm.vercel.app/

As of right now, @ManiMe ‘s implementation is more up to date than mine in terms of follows bc fresh follows data is fetched; my implementation is more thorough bc follows data is scraped ahead of time, but data is less up to date. (And maybe @ManiMe is catching up with mine wrt thoroughness of data). 
 Sure, I'll take a look; we can collaborate!

The engine repository itself is private for now, but I'll make it public soon. It's equivalent to an in-memory database; the project is in C and is still a bit amateurish, haha. 
 If I’m following you correctly, the search results will be stratified by how many hops away the content author is from the user initiating the search (or from you). So I’ll see content from a user 3 hops away from me ranked more highly than a user 4 hops away. Is that right? 
 Not exactly. I did some tests, and initially I wanted the search to be performed in a "network" starting from the user who is doing the search, considering the "distance" within that network as a factor for relevance — meaning, in a search, users closer to me, with more similarity to my query, would have higher relevance, and those farther away would have lower relevance.

In theory, it looks great, but when I tested it in practice, perhaps because the number of users was still small, the results were quite poor. So, I added a "similarity" property to the users resulting from the search and simply sorted the list at the end in descending order by the "similarity" property.

In the form described above, the results were much more accurate, so once I got the desired outcome, I optimized the code to make the search as fast as possible. To summarize, the most efficient approach given the current number of users is to take the list and perform a simple search sorted by similarity. I’m still not making good use of the graph properties, and I’ll revisit the analysis when the user base is large enough for graph-based search results to be more relevant.

This week, I’ll solve an issue in the method that calculates the similarity between users’ names in the list and the search term. There’s a bug causing it to return poor results quite often. 
 How do you calculate the similarity score? Is it PageRank or something like it?

And are you using a graph db like neo4j? (Which has powerful tools for calculation of various similarity scores) 
 I'm a few steps behind, my project is still a bit primitive haha

Right now, it only performs a simple breadth-first search (BFS) on the graph, where I calculate the similarity between user names and the search term using an algorithm I implemented based on this paper: http://www.catalysoft.com/articles/StrikeAMatch.html

I'm not using any database; I implemented everything from scratch in C to achieve the highest possible performance. By using the paper above and implementing it in the most optimized way possible with bitwise operations and everything, the similarity calculation became orders of magnitude faster than all existing methods—about 300% faster than cosine similarity, for example.

But it's still a pretty primitive project, as I mentioned. Right now, it's just a simple search in a graph and list. Later on, depending on the needs that come up, I might implement something like PageRank when considering user posts. For now, the search is just done by profile. 
 So do you query relays for follows and build the graph on the fly at the time of the search request? 
 In the case of the relay search, it doesn't involve a graph—it's just a simple list, since the number of relays is very small and is not likely to increase significantly. The similarity calculation is the same, but it's done by iterating over a simple array.

On the front end, a POST request is made to the relay, which eliminates any relays that don't respond to the request, listing only the active relays. This ensures that only active relays will appear in the search results. Another thing I implemented is a 2-second timeout for the requests made to the relays, which also ensures that only the relays closest to you, and thus more relevant, will appear.

I only use a graph for the users. The relays are mapped in kind 3 events, which contain a list of relays and are sent to the engine that saves them in a list loaded into RAM. The graph is also loaded into RAM at the application startup, and everything runs in memory.
 
 @ManiMe we should dig into this to compare it to what we are doing 
 Nostr keyword search - @Vic have you taken a look at this? 
 How are the results stratified? 
 It is a centralized solution divided into three stages (indexer, engine, and web application).

The indexer basically starts from my public key and retrieves the people I follow, then iterates through the list of people I follow and does the same, and so on, until the searches return no more results.

The indexer inserts users into a Trie tree, where the public key is the address in the tree, and at the end of each branch is a user. This user has a linked list of addresses in the tree for the users they follow. So, I have a mix of a graph and a tree for efficient searching.

Finally, this "search engine" application is not directly exposed; there is a web application that actually makes requests to the search engine, similar to an unauthenticated graph database. The API does the same.

The search is conducted in the graph starting from the user initiating the search; if no public key is provided, the search begins from my profile.

P.S.: There is still a minor bug that's distorting the search results; I'll fix it as soon as I have time. 
 This is very interesting. @ManiMe , @cloud fodder and I are working on something with a similar starting point. We start with one or more “seed user” pubkeys, usually one pubkey which is the end-user (the customer for the service we are creating); fetch seed users’ follows, their follows, their follows, ad infinitum until no new pubkeys are obtained. If I understand correctly, that’s your starting point as well.

Right now my implementation typically ends up with a list of about 150k to 200k connected pubkeys. Is that about how many you get? 
 Hey Mises, im taking a look at your indexer now. 
https://github.com/misesdev/nostr-indexer

You should take a look at our GrapeRank engine. Looks like we’re trying to solve the same problem … in almost the same manner. … maybe we can colab?

https://github.com/Pretty-Good-Freedom-Tech/graperank-nodejs

Live demo at
https://grapevine.my 
 Also check this out:

https://grapevine-brainstorm.vercel.app/

As of right now, @ManiMe ‘s implementation is more up to date than mine in terms of follows bc fresh follows data is fetched; my implementation is more thorough bc follows data is scraped ahead of time, but data is less up to date. (And maybe @ManiMe is catching up with mine wrt thoroughness of data). 
 Sure, I'll take a look; we can collaborate!

The engine repository itself is private for now, but I'll make it public soon. It's equivalent to an in-memory database; the project is in C and is still a bit amateurish, haha. 
 If I’m following you correctly, the search results will be stratified by how many hops away the content author is from the user initiating the search (or from you). So I’ll see content from a user 3 hops away from me ranked more highly than a user 4 hops away. Is that right? 
 Not exactly. I did some tests, and initially I wanted the search to be performed in a "network" starting from the user who is doing the search, considering the "distance" within that network as a factor for relevance — meaning, in a search, users closer to me, with more similarity to my query, would have higher relevance, and those farther away would have lower relevance.

In theory, it looks great, but when I tested it in practice, perhaps because the number of users was still small, the results were quite poor. So, I added a "similarity" property to the users resulting from the search and simply sorted the list at the end in descending order by the "similarity" property.

In the form described above, the results were much more accurate, so once I got the desired outcome, I optimized the code to make the search as fast as possible. To summarize, the most efficient approach given the current number of users is to take the list and perform a simple search sorted by similarity. I’m still not making good use of the graph properties, and I’ll revisit the analysis when the user base is large enough for graph-based search results to be more relevant.

This week, I’ll solve an issue in the method that calculates the similarity between users’ names in the list and the search term. There’s a bug causing it to return poor results quite often. 
 How do you calculate the similarity score? Is it PageRank or something like it?

And are you using a graph db like neo4j? (Which has powerful tools for calculation of various similarity scores) 
 I'm a few steps behind, my project is still a bit primitive haha

Right now, it only performs a simple breadth-first search (BFS) on the graph, where I calculate the similarity between user names and the search term using an algorithm I implemented based on this paper: http://www.catalysoft.com/articles/StrikeAMatch.html

I'm not using any database; I implemented everything from scratch in C to achieve the highest possible performance. By using the paper above and implementing it in the most optimized way possible with bitwise operations and everything, the similarity calculation became orders of magnitude faster than all existing methods—about 300% faster than cosine similarity, for example.

But it's still a pretty primitive project, as I mentioned. Right now, it's just a simple search in a graph and list. Later on, depending on the needs that come up, I might implement something like PageRank when considering user posts. For now, the search is just done by profile. 
 So do you query relays for follows and build the graph on the fly at the time of the search request? 
 In the case of the relay search, it doesn't involve a graph—it's just a simple list, since the number of relays is very small and is not likely to increase significantly. The similarity calculation is the same, but it's done by iterating over a simple array.

On the front end, a POST request is made to the relay, which eliminates any relays that don't respond to the request, listing only the active relays. This ensures that only active relays will appear in the search results. Another thing I implemented is a 2-second timeout for the requests made to the relays, which also ensures that only the relays closest to you, and thus more relevant, will appear.

I only use a graph for the users. The relays are mapped in kind 3 events, which contain a list of relays and are sent to the engine that saves them in a list loaded into RAM. The graph is also loaded into RAM at the application startup, and everything runs in memory.
 
 @ManiMe we should dig into this to compare it to what we are doing 
 @cloud fodder and @straycat, the number of iteration to reach convergence (~10^-6 tollerance) for pagerank is roughly 100.

Why do you perform only 8 iterations?  
 My version of GrapeRank determines “convergence” without arbitrary iteration limits … by excluding “calculated” scorecards (whose scores have not changed beyond a “precision” threshold)  from future iterations.

https://github.com/Pretty-Good-Freedom-Tech/graperank-nodejs/blob/main/src/Calculator/index.ts#L255-L261

Number of iterations varies with “depth” of follows calculated (and other parameters) but usually converges around 10 or 20 . 
 that's a good approach  
 We just picked 8 as a place to start. Still needs work.

My most recent implementation is at grapevine-brainstorm.vercel.app (link to repo in the footer) and I set it to iterate until convergence, defined as sum over (delta-score)^2 is less than some threshold, and I think around 12 or 15 iterations is where it stopped. @ManiMe uses a similar method. 
 My version of GrapeRank determines “convergence” without arbitrary iteration limits … by excluding “calculated” scorecards (whose scores have not changed beyond a “precision” threshold)  from future iterations.

https://github.com/Pretty-Good-Freedom-Tech/graperank-nodejs/blob/main/src/Calculator/index.ts#L255-L261

Number of iterations varies with “depth” of follows calculated (and other parameters) but usually converges around 10 or 20 . 
 that's a good approach