Oddbean new post about | logout
 #realy just got switched over from using my broken bespoke binary encoder to using protobuf 3

because i'm tired of fixing the bugs

only took me a few hours and done... realy works much better now

anyone who might be using it, the latest tag v1.1.10 is a breaker... hmm should really be v1.2.0 i guess... ok done

for now, it is a v1.x.x but actually it's really a v0.x.x so anything can be breaking... mainly this change just causes the database format to change... to migrate really all you have to export the db as it is, and then import it back in after upgrading

but the old version was breaking so much of the data, maybe half of the events weren't coming back out of the database as they went in... this is now fixed 
 my biggest reason for this change is simple

the json encoder is actually the damned fastest, and hell i'd store the events in the DB directly as json if it were as space efficient as protobuf3... but besides all that, when events are queried, they have to be unmarshaled and then re-marshaled when dispatching to the clients (for queries and when events come in, for subscriptions that match, the matching requires them to be unmarshaled).

i keep saying this but for some reason the hipsters all like CBOR which is a pile of shit - the broadest and most complete binary codec support for all languages is protobuf

i'm of the opinion that some little ways down the track there will be many codecs used because it's very simple to implement to add an encoding header in the http websocket headers, one of the biggest advantages of using websockets... then all the hipsters can make their cbor clients and relays, but until then, it's merely insane to put json in the database for space reasons

oh yeah, and, not to forget... one of the things that is still in my code is the runtime versions of a, e and p tags field 2 are still in a binary form... the filter also is like this, and this nearly doubles the speed of performing a match on them because both the id/pubkey and tag data are in the same format natively

that's the main reason why my json encoder is so fast, it doesn't double-process anything 
 I built a "kappa" application which processes terabytes of data on top of kafka as a storage backend where messages are stored in message pack binary format (and gzip compressed by kafka).

Must say I'm quite happy with message pack. 
 yeah messagepack is pretty good to, just i think more people are familiar with protobuf, and its tooling isn't painful to set up unlike flatbuffers

i was pretty much shocked when i learned how slow the Gob encoder that comes with the Go standard library actually is... it's slower than the stdlib encoding/json!

i figure out of all the options, protobuf probably has the greatest battle hardness, because it's used by so many big tech things 
 agreed. protobuf is the king 
 i just learned about this tho

https://github.com/deneonet/benc

from this:

https://alecthomas.github.io/go_serialization_benchmarks/

and i can't help myself but implement it lol

faster than flatbuffers... faster than capn proto

the benefits of those flat/decode on demand style of codecs are zero when you actually need the whole data and taking advantage of it, requires refactoring an entire search matching library (filters in nostr)

so, i'm just gonna go with this

its syntax is like Go except also like Protobuf

it has some scheme about versioning and shit but honestly for this use case, like, such as replacing the entire nostr encoding, with the envelopes, filters and events... this would be the hands down best option because the data usually needs to be matched on several criteria in the "extra filter" used in the database indexing scheme that fiatjaf devised, so the extra logic to decode them before running the match when half the time half the fields need to be decoded, and most of the fields except for two are already just raw bytes in my data structure format...

this is it, no more fucking around, the end, faster than this is impossiburu 
 i can reserve the option of later changing it again but just following the same pattern i used to do protobuf should be easy to make the same thing... the only extra overhead it does is copying out the slice headers of the benc formatted structure and allocating/copying into the "native" realy style 
 well, on the basis of its raw throughput in my bulk decode encode decode test it's all about the same

probably the reason is that my data structure is more suited to protobuf than benc, so even though protobuf does reflection the overhead does not make it slower

they are basically the same for this use case

the throughput is about 58mb/s for a single thread unmarshal from json, encode to binary, decode from binary, about 30mb/s to do that plus check the ID hash and encode/decode through json

my JSON encoder is the bomb for this shit...

if it wasn't for it being a bigger data size i'd say this is the way to go

in any case, i'm leaving the code in there for the different binary codecs in case for some reason it seems like a good idea to work with them later

the big problem i foresee is that to make it go any faster i need to adapt the runtime data structure of my events to BE the benc encoded version, which uses slices wrapped in structs, both the protobuf and benc versions have this problem of needing a shim to change the data structure, and probably that overhead is the last bit... so, i'm just leaving it with protobuf for simplicity 
 I learnt about your #realy just now. First I dismissed it because my eyes would just read it as #relay. We as species are not fully evolved for reading :) 
 that's why i call it #realy so people are puzzled

mostly it just passes you buy and then one moment you realise it's a typo, and all the spidey senses are tingling 
 check, t-y