Oddbean

@c2f9cd0a Agreed, the interesting question to me is that if you live a NixOS universe and can analyze your whole infrastructure with NixOS expressions, can you derive the right magic recipes to synchronize, snapshot and perform recovery? Though, for many people, being able to rollback single systems separately is already pretty good. Distributed systems is probably a research question, albeit a solvable one IMHO.

@7c36db82 That sounds too simplistic/hopeful. I don't think we have the necessary low level operations to create a *GENERIC* solution that allows atomic/synchronous snapshots at arbitrary scale. Also: rollbacks are intended for fast recovery. If I have to dive into analyzing NixOS expressions, the "fast" is out of the window. Roll forward may allow you to fix it quicker on an application level and pay attention to the outside world that your application already communicated with.

@c2f9cd0a Note that I am not arguing against roll forward. Also, I am not sure why you are convinced we do not have the right primitives to execute such operations to a large set of services (maybe not all of them, but most of them?), finally, analysis of the NixOS expressions can be performed *automatically*, this is why I am framing this as a research problem. I am not saying "maybe", I am saying there are ways to frame the problem as a theoretical computer science statement.

@c2f9cd0a Fast recovery is subjective I'd say, surely, if the recovery cost you 1 week but you are getting all the data back, some organizations may accept it and setup an alternative on the side. In general, there's plenty of ways to make recovery fast by preparing for recovery (filesystem snapshots, large networking pipes, etc.)

@7c36db82 From my perspective the theoretical argument is not coherent with my experience. You might be overestimating what complexity you can deal with in an automated fashion. For example, one quickly ends up in merged sets of expressions that are contradictory - in essence a (very rough) Goedel statement where everything complex enough to be interesting will be breaking in interesting ways due to self-referential properties and people outsmarting your assumptions.

@c2f9cd0a I would not say you are wrong, I'd rather say this is important that we figure out what is truth on that matter and put words on the class of situations out there. Already being able to say : this set of particular services fulfilling those conditions can be reasonably handled is already a very interesting statement because it helps understanding how to design such systems.

@7c36db82 Just to circle back to my very original statement: the amount of nuance we needed to get out of our systems here goes way beyond what I experience the majority (semi- or even non-technical) people understand and then extrapolate from about rollbacks ... ;)

@c2f9cd0a And I completely agree with you on this, nevertheless, I am interested into seeing how to push the boundaries and better use of the existing capabilities and steer future developments to not make rollback a nice "gimmick", but a theoretically understood concept which may end up being completely useless due to too many blockers (impossible to track data dependencies between a service and his database, etc.)

@7c36db82 I think it IS more than a gimmick at the moment. It is more like a safety belt or airbag right now. I don't make *ANY* plans that involve deployment of either. Yet I regularly see those as the "defacto strategies" if "anything should go wrong". ;)

@c2f9cd0a Right, I agree also with that. I called it a gimmick because honestly, when you forward Gitea and you rollback NixOS, Gitea is down because programs are not forward-compatible with their potentially new database schema. Therefore, there is a whole class of issues that rollbacks will just add to your problems.