Not exactly. I did some tests, and initially I wanted the search to be performed in a "network" starting from the user who is doing the search, considering the "distance" within that network as a factor for relevance — meaning, in a search, users closer to me, with more similarity to my query, would have higher relevance, and those farther away would have lower relevance.
In theory, it looks great, but when I tested it in practice, perhaps because the number of users was still small, the results were quite poor. So, I added a "similarity" property to the users resulting from the search and simply sorted the list at the end in descending order by the "similarity" property.
In the form described above, the results were much more accurate, so once I got the desired outcome, I optimized the code to make the search as fast as possible. To summarize, the most efficient approach given the current number of users is to take the list and perform a simple search sorted by similarity. I’m still not making good use of the graph properties, and I’ll revisit the analysis when the user base is large enough for graph-based search results to be more relevant.
This week, I’ll solve an issue in the method that calculates the similarity between users’ names in the list and the search term. There’s a bug causing it to return poor results quite often.
How do you calculate the similarity score? Is it PageRank or something like it?
And are you using a graph db like neo4j? (Which has powerful tools for calculation of various similarity scores)
I'm a few steps behind, my project is still a bit primitive haha
Right now, it only performs a simple breadth-first search (BFS) on the graph, where I calculate the similarity between user names and the search term using an algorithm I implemented based on this paper: http://www.catalysoft.com/articles/StrikeAMatch.html
I'm not using any database; I implemented everything from scratch in C to achieve the highest possible performance. By using the paper above and implementing it in the most optimized way possible with bitwise operations and everything, the similarity calculation became orders of magnitude faster than all existing methods—about 300% faster than cosine similarity, for example.
But it's still a pretty primitive project, as I mentioned. Right now, it's just a simple search in a graph and list. Later on, depending on the needs that come up, I might implement something like PageRank when considering user posts. For now, the search is just done by profile.
So do you query relays for follows and build the graph on the fly at the time of the search request?
In the case of the relay search, it doesn't involve a graph—it's just a simple list, since the number of relays is very small and is not likely to increase significantly. The similarity calculation is the same, but it's done by iterating over a simple array.
On the front end, a POST request is made to the relay, which eliminates any relays that don't respond to the request, listing only the active relays. This ensures that only active relays will appear in the search results. Another thing I implemented is a 2-second timeout for the requests made to the relays, which also ensures that only the relays closest to you, and thus more relevant, will appear.
I only use a graph for the users. The relays are mapped in kind 3 events, which contain a list of relays and are sent to the engine that saves them in a list loaded into RAM. The graph is also loaded into RAM at the application startup, and everything runs in memory.
@ManiMe we should dig into this to compare it to what we are doing