my categoriser thing, which writes events from the giant trove i got from semisol (180gb) into a set of separate files depending on at what point in the process of unmarshalling from json, checking the event ID, encoding to binary, and finally decoding from binary and checking that the original unmarshaled/marshaled json matches the encoded/decoded binary form
it tests every part of my codec and now i have a series of grouped events that failed at different points and i can now isolate the groups of events to test whether there is bugs in my code for that step of the process
this event cache dates back almost 2 years, newest in them i think abut 4 months ago:
168G dump.jsonl
5.6G fail_frombin.jsonl
7.2G fail_ids.jsonl
551M fail_reser.jsonl
2.0G fail_tobin.jsonl
436K fail_unmar.jsonl
168G read.jsonl
dump is the original, and read is more or less a copy of it reconstructed by writing back each line out of it back to disk before any processing (more than a few don't decode at all)
fail_unmar.jsonl is mostly ones that have broken json string escaping and stuff like line breaks in the middle of strings, but they could be bugs in my json decoding code... so they will be examined, first just by seeing that indeed the syntax is wrong according to another decoder (the one in my IDE), and those that are valid according to the second decoder will be used to debug the unmarshal code.
fail_ids.jsonl are the ones whose ID did not compute as the ID listed in the event (idk what to do with these, they should be rewritten with their real ID but then it breaks inbound references) - 7.2gb out of 180 have wrongly computed IDs... mostly they are old though... i think in the early days there was some ambiguity in the escaping rules... and more than a few in there are ones with bogus extra fields that are invalid according to nip-01. these also may have other errors in them, some of which i can adjust my code to decode correctly, or to interpret correctly, but if that breaks the ID then no.
2gb of events that failed to marshal to binary, that's my marshalling code, which is good, step 2 after making sure my unmarshal code is not failing to decode valid json
5.6gb of events went to binary fine but were either encoded wrongly in binary or failed to decode from the binary... i know there is several issues there, with the tags and content specifically
551M of events that fail to re-serialize the same way as the originally unmarshaled data, so this is more stuff involving the json encoding library
separating them makes it much easier for me to focus a debugging session on one section of the codebase, so, this will hopefully let me chase the low hanging fruit, and chip away a larger chunk of events from failing in a given step
so much work to do... this is the burden of me rewriting this whole encoder code... i'm still quite sure it's very fast, but probably i will find that fixing the bugs i see that some of the speedup is dialed back to handle the correct procedure for the errors
but probably not much