Oddbean new post about | logout
 From 0ec9e624e6172fb1c216d70ce148459a7ad385a0 Mon Sep 17 00:00:00 2001
Subject: [PATCH 0/5] Add "Pharos" AsciiDoc Parser

This proposal adds a custom extension to the Asciidoctor library that generates Nostr Knowledge Base events from an AsciiDoc document.  The `Asciidoctor` class is extended in the `Pharos` om an AsciiDoc document.  The `Pharos` class extends the `Asciidoctor` class with a custom tree parser that maps the relationships between the AsciiDoc sections and blocks.  Ind (kind 30040) and zettel (kind 30041) events are generated based on these relationships and on the content of the respective blocks and sections.  Asciidoctor generates unique IDs for each AsciiDoc block, and these unique IDs are used d tag identifiers.  Future developments will parse metadata such as author, edition, and publication date from the AsciiDoc document, and will add methods to support remapping the relationships between generated Nostr events. 
 For those interested, you can see my progress thus far on an AsciiDoc parser and event generator for Alexandria here: 

https://gitworkshop.dev/r/naddr1qvzqqqrhnypzplfq3m5v3u5r0q9f255fdeyz8nyac6lagssx8zy4wugxjs8ajf7pqq9yzmr90pskuerjd9ssphy5dc/proposals/note187a6pmzk8py7pyz095g6jz9w38axc68yusdq0mr4kejm2v5rrj8qcm4f82

#NostrDev #Alexandria #GitCitadel 
 It looks like you've made author a required field, but NKBIP-01 says it's optional.
On the uploader, I have a construct_d_tag function with a fall-through statement, that adds on each additional available variable, ending with $dTagTitle . "-by-" . $dTagAuthor . "-v-" . $dTagVersion

So, something like The-fox-and-the-hound-by-Aesop-v-fourth-revised 
 But The-fox-and-the-hound or The-fox-and-the-hound-v-fourth-revised or The-fox-and-the-hound-by-Aesop are also correct. 
 I don't do anything to pull author from the document yet, though I will add that soon.  Asciidoctor has some nice convenience methods that automatically parse fields like author out of the document.

Thank you for reminding me to include author and version information in the d tag, I hadn't thought about that detail yet. 
 Why is page.ts empty? Am I missing some information or is it just a placeholder? 
 I just haven't filled that part in, yet.  I wanted to get the parser work up so the team could look at it, because feedback on the parser may heavily impact the UI. 
 Wow, five levels. Nice. 
 What happens if it has more than 5 sections? 
 Find it a bit confusing that the indices are sections and the sections are blocks, and the nodes can be indices or sections, which means that they can be sections or blocks, and the commentary  and function "zettel", but they aren't called zettel in the function.
Clashes too much with the domain, as described in NKBIP-1 and in my Uploader. Can we maybe agree on one naming convention and write it into the spec?
Also problematic because only 30041 are zettel, so adding more kinds would make it even more confusing because you'd be referring to wiki pages as "zettel". 
 Or is a wiki page actually a zettel, in that context and we need to redo the spec?
The spec calls them "section/article". 
 just responding as I can right now, 30041 are the Zettel analogues, not the full article 
 Okay, it's all a bit fuzzy. The spec assumes 30040 is always an article and each linked element is a section of the article. But we also allowed for 30040 to be a book and each linked element is then a chapter. And we allowed for 30040 to be a collection and each linked element is then an article, in it's own right. I made a 30040 that was an audiobook list and each 30041 contained an audiobook episode.

I actually prefer 30040 index and 30041 section, as we don't presume to know the content of the events, we're just describing what the "do": one contains a list of events, the other contains a portion of the content. 
 Yeah, bit fuzzy but that's why in some earlier conversations i made analogues to 30040s being some type of container event. They also got inspiration from lists and how you can have a top level note with other comments chained under it.

Containers can hold other containers, but collectively the pairing of 30040s with a sequence of various events constructs a modular article, not that 30040s are the modular article event. 
 Recording here what we discussed in Slack:

- 'Zettel' is the term we're using specifically for kind 30041 events.  They are intended to be bite-sized notes.
- 'Index' is the term we're using for kind 30040 events.
- 'Event' is the generic term.  Wiki entries, long-form articles, indices, and zettels are all 'events.'  If we don't have a specific term for an event kind in a given instance, we'll call it an 'event.'

Terms like 'node,' 'block,' and 'section' are specific to Asciidoctor.  I can add details in the code comments explaining how we are using different terms throughout the parser logic. 
 In terms of feed based implementation, maybe some 30040s should have an explicit 'top level' indicator. That way chapters won't be seen in the feed. 
 What happens if the first section contains no entries? Seems to just return an empty relay. Shouldn't it return FALSE? 
 @Michael J did you see this one? 
 I'm having a hard time figuring out by the question.

Are you asking what happens if the first  index event doesn't have any child events?

In general, each paragraph will become its own kind 30041 zettel, so if there are no subsections, the document will be turned into a single index with a bunch of one-paragraph zettels.

Does that answer your question, or did I just misunderstand it? 
 What if it's only 

"Document metadata

=== This is a header"

And then it ends. Maybe because someone copy-pasted wrong or pushed the button too early. Or maybe they just want to create an index and update it later, with the relevant e tags.

Then it would only generate a 30040, right? Or would it return an error? Or would they need to confirm that it should only generate a 30040? 
 It will parse the document down to 5 levels of nested sections (`===== Section`).  Currently, any sections deeper than that will be lumped into the content of the kind 30041 events generated from fifth-level blocks.

In theory, we can go arbitrarily deep, but at a certain point, we don't need that much depth in an editor workflow that is driven by direct user input (versus a programmatic upload of something large and complex, like the Bible).  Setting a cutoff makes things a bit easier for both us and, IMO, for the end user. 
 Okay, sounds good. That's how I handle it, as well. 
 would be nice on a first pass for how to examine and test some of the features like the parser. Like general workflows and expected behaviors. How'd one go about that? nostr:nprofile1qqs06gywary09qmcp2249ztwfq3ue8wxhl2yyp3c39thzp55plvj0sgprdmhxue69uhhg6r9vehhyetnwshxummnw3erztnrdakj7qguwaehxw309a6xsetrd96xzer9dshxummnw3erztnrdakj7qgcwaehxw309ahx7um5wgh8xmmkvf5hgtngdaehgtcw45ql5 have ideas? Would Gherkin be a use case? 
 Yeah, we need to write some Gherkin. We can already see the problem with the naming, arising out of a lack of consensus. 
 @DanConwayDev on ngit 1.4.5, when I try to run `ngit push` or `ngit pull`, I get the following message: 

> Error: cannot find proposal that matches the current branch name

The branch name on my local computer is `article-editor`, and I'm trying to push updates to the below proposal on GitWorkshop: 

nostr:nevent1qvzqqqqx2ypzquqjyy5zww7uq7hehemjt7juf0q0c9rgv6lv8r2yxcxuf0rvcx9eqydhwumn8ghj7argv43kjarpv3jkctnwdaehgu339e3k7mgqyqlmhg8v2cuyncysfuk3r2gg46yl5mrgunjp5plvwkmxtdfjsvwgu45txfp 
 I'm sorry, but I can't tell what you do with the article-title, how you get it into the d-tag and formulate the d-tag, with the title, author, and version. I must be blind. I just see a node ID, but I can't tell where it comes from or what it consists of. 
 Ah, it comes from Asciidoctor. What is in it? Is it a random string UID? 
 Asciidoctor parses a document into blocks.  A block can be a paragraph, an image, a chart, a code block, a mathematical formula, a section, or even a whole document.

For sections, the ID consists of the underscore-separated section title.  So if you had a section header `== Part 3: The Reckoning` in your document, then Asciidoctor's ID for that section would be something like `part_3_the_reckoning`.

It does something similar to generate IDs for paragraph blocks that don't have an explicit title, but I'm not sure yet exactly what the results come out to be.

If two IDs clash, Asciidoctor appends numbers them to guarantee uniqueness. 
 Same, bro. 😂

We were trying to think through the nested index structure, yesterday, and it turned into a multi-hour debate in three channels, to straighten it out.

https://gitworkshop.dev/r/naddr1qq9yzmr90pskuerjd9sszrthwden5te0dehhxtnvdakqygxufnggdntuukccx2klflw3yyfgnzqd93lzjk7tpe5ycqdvaemuqcpsgqqqw7vskyayug/proposals/note187a6pmzk8py7pyz095g6jz9w38axc68yusdq0mr4kejm2v5rrj8qcm4f82 
 A totally empty document should throw an error.  But I think if the user tries to publish a document, as long as we can pull the required fields for the index event tags from the document, we should go ahead and publish it.  Trust that our users know what they're doing, or are at least smart enough to figure out out sooner or later. 
 I'm thinking through d tag conventions more, and we may have to revise our pattern somewhat.

The pattern `<title>-<author>-<version>` only really works for the root index of a whole book.  Chapters/sections within that book won't have author and version attached to them by default.  I see at least two options:

1. Carry author and version information down into the book's hierarchy, and attach it to every chapter and section index.  This makes the ids of _everything_ longer (which increases URI length within our app), but it increases the likelihood of name uniqueness even of sections within a book.

2. Don't include author and version information in the d tag identifier of every chapter and section, but include that information in the tags.  If we know the author and edition of the book within a chapter resides, for instance, we'll include it in the index event generated for that chapter.

2.5. If we take this latter approach, I'm interested in exploring adding additional fields to the d tag array.  The question is whether the first identifier in that array is _always_ used to generate note identifiers for 30000-series events, or whether the whole d tag array is used.  Perhaps we can include full author and version information in the d tag array, but not have it be mandatory for looking up an event.  The details probably depend on relay indexing and search implementations, there.

Let me know which option you prefer, or if you have other alternatives. 
 I like 2. For 2.5, is there a link to the d tag spec? The nips repo points to NIP01 but I don't see any reference to jt, 
 https://wikistr.com/nip-54*266815e0c9210dfa324c6cba3573b14bee49da4209a9456f9484e5106cd408a5 
 I think we were leaning into the definition for wiki pages, since that's the closest event kind. 
 Hmm. I do like 2.5 then. By default we keep the first entry as the title. Any other important searchable metadata (which can be different, depending on domain) could go there. Could this essentially combine nip36 and 54 to reduce redundancy? 
 Don't know. I don't fully understand 2.5. Need a map or something. 😅  
 I'm thinking something like:
```
"tags": [
  ["d", "chapter-1", "book:example-book", "author:John Doe", "version:1.0"],
  // other tags...
]
``` 
 Ah, I get it, thanks. I forget that the tags are arrays. 
 Y'all I searched and the NIPs repo doesn't formally define the `d` tag anywhere.

That means no one can tell us we can't extend it lol.

I was thinking we use positional values to extend the `d` tag array.  Something like:

```json
"tags": [
  ["d", "war-and-peace", "leo-tolstoy", "penguin-classics-edition"]
]
```

The general format would be `["d", <title>, <author>, <edition>]` where edition is a human-readable edition name (as opposed to just a number).

The event _should_ still be addressable by `#d` and the title, just like other event kinds' d tags, but we can increase address specificity for the Nostr Knowledge Base use case.  Then we can bake this specificity into wikilinks (see NIP-54 for an existing wikilinks specification).  A document might contain `[[War and Peace. Leo Tolstoy. Penguin Classics Edition.]]` and we can specify that clients should split that at the periods and normalize it into a d tag array reference, so it becomes `["#d", "war-and-peace", "leo-tolstoy", "penguin-classics-edition"]`.  The client can then use this tag array to search its relays, find the closest match, and display to the user a hyperlink to the referenced event.

Basically, we define a citation format for NKB events.

Since the event ID is derived from a serialization of the whole event, including tags, increasing identifier specificity will always generate a unique event ID.

We'll have to experiment with how relays index events by d tag, and how they respond to queries for such events.  Maybe Stella's PHP utilities can help us with that.  Worst case, relays only support searching by the first `d` tag value and return a bunch of matching results, and then we have Alexandria walk through the results and find matches by author and edition. 
 You think we should fine grain a difference between [[leo-tolstoy]] the topic and [[author:leo-tolstoy]] anything written by him? 
 Maybe using `[[author:leo-tolstoy]]` should link to a search page showing results with that `author` tag.  We can use a similar format for wikilinks to search results or tag-based feeds.  Perhaps you want to link to a feed of literature on a specific topic. 
 Yeah, just allowing for a multitude of ways for human navigation. The general case would be to remove the '{noun}:' and just find anything matching the sane string across the categories.

Kind of an aside, but I'm wondering if it would make sense backend-wise if we paired an optional db for caching on relays we want to associate with to improve performance. I'd imagine without the db search performance would degrade as events increase. 
 The app in general probably shouldn't have its own DB, but our instance, on our website, could easily have its own DB, server, etc. to provide premium features. 
 GIVE ME ELASTIC SEARCH! 😂 
 I think the caching DB could be part of our premium offering.

The client is open-source, so anyone can run an instance.  The data?  Not everybody will have that.

We can attach a caching service to TheCitadel relay and introduce a paid tier for those who want that additional performance. 
 Love the idea 
 >>The index MUST also be uniquely identifiable using a combination of the d tag's first value (usually containing the title), the pubkey, and the kind fields.<<

This is the part in the NKBIP-01 that pertains to the d tag. We could change the text to read

>>The index MUST also be uniquely identifiable using a combination of the d tag's values (at least including the first value, usually the title), the pubkey, and the kind fields.<<

 
 That sounds good.  It gives us flexibility, but maintains basic compatibility with standard event identification by d tag. 
 The spec doesn't assume that a 30040 is a book, that's why it's that way. It could also be an index of articles or wiki pages from different authors. Or pages from various books from various authors in various editions.

That's why only the title is required, as everything has some sort of title. 
 I like #1, for cases where the lower-level stuff doesn't have their own tags.  I had to do this, when I ran into the problem that I had two different editions of one audiobook and they kept overwriting each others' events, even though I changed the version of the 30040. 
 https://wikistr.com/nip-54*266815e0c9210dfa324c6cba3573b14bee49da4209a9456f9484e5106cd408a5 
 I think we were leaning into the definition for wiki pages, since that's the closest event kind. 
 Hmm. I do like 2.5 then. By default we keep the first entry as the title. Any other important searchable metadata (which can be different, depending on domain) could go there. Could this essentially combine nip36 and 54 to reduce redundancy? 
 Don't know. I don't fully understand 2.5. Need a map or something. 😅  
 I'm thinking something like:
```
"tags": [
  ["d", "chapter-1", "book:example-book", "author:John Doe", "version:1.0"],
  // other tags...
]
``` 
 Ah, I get it, thanks. I forget that the tags are arrays. 
 Y'all I searched and the NIPs repo doesn't formally define the `d` tag anywhere.

That means no one can tell us we can't extend it lol.

I was thinking we use positional values to extend the `d` tag array.  Something like:

```json
"tags": [
  ["d", "war-and-peace", "leo-tolstoy", "penguin-classics-edition"]
]
```

The general format would be `["d", <title>, <author>, <edition>]` where edition is a human-readable edition name (as opposed to just a number).

The event _should_ still be addressable by `#d` and the title, just like other event kinds' d tags, but we can increase address specificity for the Nostr Knowledge Base use case.  Then we can bake this specificity into wikilinks (see NIP-54 for an existing wikilinks specification).  A document might contain `[[War and Peace. Leo Tolstoy. Penguin Classics Edition.]]` and we can specify that clients should split that at the periods and normalize it into a d tag array reference, so it becomes `["#d", "war-and-peace", "leo-tolstoy", "penguin-classics-edition"]`.  The client can then use this tag array to search its relays, find the closest match, and display to the user a hyperlink to the referenced event.

Basically, we define a citation format for NKB events.

Since the event ID is derived from a serialization of the whole event, including tags, increasing identifier specificity will always generate a unique event ID.

We'll have to experiment with how relays index events by d tag, and how they respond to queries for such events.  Maybe Stella's PHP utilities can help us with that.  Worst case, relays only support searching by the first `d` tag value and return a bunch of matching results, and then we have Alexandria walk through the results and find matches by author and edition. 
 You think we should fine grain a difference between [[leo-tolstoy]] the topic and [[author:leo-tolstoy]] anything written by him? 
 Maybe using `[[author:leo-tolstoy]]` should link to a search page showing results with that `author` tag.  We can use a similar format for wikilinks to search results or tag-based feeds.  Perhaps you want to link to a feed of literature on a specific topic. 
 Yeah, just allowing for a multitude of ways for human navigation. The general case would be to remove the '{noun}:' and just find anything matching the sane string across the categories.

Kind of an aside, but I'm wondering if it would make sense backend-wise if we paired an optional db for caching on relays we want to associate with to improve performance. I'd imagine without the db search performance would degrade as events increase. 
 The app in general probably shouldn't have its own DB, but our instance, on our website, could easily have its own DB, server, etc. to provide premium features. 
 GIVE ME ELASTIC SEARCH! 😂 
 I think the caching DB could be part of our premium offering.

The client is open-source, so anyone can run an instance.  The data?  Not everybody will have that.

We can attach a caching service to TheCitadel relay and introduce a paid tier for those who want that additional performance. 
 Love the idea 
 >>The index MUST also be uniquely identifiable using a combination of the d tag's first value (usually containing the title), the pubkey, and the kind fields.<<

This is the part in the NKBIP-01 that pertains to the d tag. We could change the text to read

>>The index MUST also be uniquely identifiable using a combination of the d tag's values (at least including the first value, usually the title), the pubkey, and the kind fields.<<

 
 That sounds good.  It gives us flexibility, but maintains basic compatibility with standard event identification by d tag. 
 The spec doesn't assume that a 30040 is a book, that's why it's that way. It could also be an index of articles or wiki pages from different authors. Or pages from various books from various authors in various editions.

That's why only the title is required, as everything has some sort of title. 
 I like #1, for cases where the lower-level stuff doesn't have their own tags.  I had to do this, when I ran into the problem that I had two different editions of one audiobook and they kept overwriting each others' events, even though I changed the version of the 30040. 
 Don't know. I don't fully understand 2.5. Need a map or something. 😅  
 I'm thinking something like:
```
"tags": [
  ["d", "chapter-1", "book:example-book", "author:John Doe", "version:1.0"],
  // other tags...
]
``` 
 Ah, I get it, thanks. I forget that the tags are arrays. 
 Y'all I searched and the NIPs repo doesn't formally define the `d` tag anywhere.

That means no one can tell us we can't extend it lol.

I was thinking we use positional values to extend the `d` tag array.  Something like:

```json
"tags": [
  ["d", "war-and-peace", "leo-tolstoy", "penguin-classics-edition"]
]
```

The general format would be `["d", <title>, <author>, <edition>]` where edition is a human-readable edition name (as opposed to just a number).

The event _should_ still be addressable by `#d` and the title, just like other event kinds' d tags, but we can increase address specificity for the Nostr Knowledge Base use case.  Then we can bake this specificity into wikilinks (see NIP-54 for an existing wikilinks specification).  A document might contain `[[War and Peace. Leo Tolstoy. Penguin Classics Edition.]]` and we can specify that clients should split that at the periods and normalize it into a d tag array reference, so it becomes `["#d", "war-and-peace", "leo-tolstoy", "penguin-classics-edition"]`.  The client can then use this tag array to search its relays, find the closest match, and display to the user a hyperlink to the referenced event.

Basically, we define a citation format for NKB events.

Since the event ID is derived from a serialization of the whole event, including tags, increasing identifier specificity will always generate a unique event ID.

We'll have to experiment with how relays index events by d tag, and how they respond to queries for such events.  Maybe Stella's PHP utilities can help us with that.  Worst case, relays only support searching by the first `d` tag value and return a bunch of matching results, and then we have Alexandria walk through the results and find matches by author and edition. 
 You think we should fine grain a difference between [[leo-tolstoy]] the topic and [[author:leo-tolstoy]] anything written by him? 
 Maybe using `[[author:leo-tolstoy]]` should link to a search page showing results with that `author` tag.  We can use a similar format for wikilinks to search results or tag-based feeds.  Perhaps you want to link to a feed of literature on a specific topic. 
 Yeah, just allowing for a multitude of ways for human navigation. The general case would be to remove the '{noun}:' and just find anything matching the sane string across the categories.

Kind of an aside, but I'm wondering if it would make sense backend-wise if we paired an optional db for caching on relays we want to associate with to improve performance. I'd imagine without the db search performance would degrade as events increase. 
 The app in general probably shouldn't have its own DB, but our instance, on our website, could easily have its own DB, server, etc. to provide premium features. 
 GIVE ME ELASTIC SEARCH! 😂 
 I think the caching DB could be part of our premium offering.

The client is open-source, so anyone can run an instance.  The data?  Not everybody will have that.

We can attach a caching service to TheCitadel relay and introduce a paid tier for those who want that additional performance. 
 Love the idea 
 >>The index MUST also be uniquely identifiable using a combination of the d tag's first value (usually containing the title), the pubkey, and the kind fields.<<

This is the part in the NKBIP-01 that pertains to the d tag. We could change the text to read

>>The index MUST also be uniquely identifiable using a combination of the d tag's values (at least including the first value, usually the title), the pubkey, and the kind fields.<<

 
 That sounds good.  It gives us flexibility, but maintains basic compatibility with standard event identification by d tag.