Oddbean new post about | logout
 so I've been using Nostr as an escape from a problem I can't figure out at work.

Today I'm flipping this on its head and talking to myself on Nostr to solve my problem:

Workflow logic Bug:
main task contains 5-10 sub tasks, all of which need to complete for the main task to complete.

Error: every day a couple hundred (out of 100,000) main tasks are marked as complete even though they should not be because all required subtasks have not been created and/or marked as complete. 

How it works:

Main task gets created (Status incomplete)
-->sub task 1 (status incomplete) starts, ends (status: complete) creates sub task 2
-->sub task 2 starts, end & creates sub task 3
-->etc

every 3 minutes, a separate procedure checks main tasks:
if all sub tasks are complete, main task is updated to complete.


So, for this error to occur, either:
1.  the chain of sub tasks fails to create the next incomplete subtask (look in log)  
2. the commit for the completed sub task happens before the commit for the insert of the new incomplete subtask, and the main task is updated to "complete" in this short window of time
3....

Questions to look up:
Are all Subtasks marked as complete? for the incorrectly completed Main Tasks? 
 Do they happen concurrently? Sounds like no. If that’s the case instead of having something checking that all subtasks are complete maybe there should just be a final subtask that handles marking the task complete (and other cleanup/finalization). Simplify the conditions for your expected behavior. 
 Welp, you're right. Sounds like a design flaw. 

Perhaps it's made this way bc only step 5 knows if it should continue to additional step 6, and only step 6 knows if it passes criteria already or needs to go to step 7, etc. 

Hmm. The way it overall knows it's finished is when there are no incomplete subtasks. But the criteria for propagating subtasks is at the subtask level.

I probably can't actually change whole structure, but I can recommend moving a "commit" around.

 
 Just a wild guess but could a race condition between a subtask marking itself complete and your “supervisor” process that checks if no subtasks are incomplete.

So when subtask 3 for example marks itself complete if your “supervisor” checks before your subtask 4 is added it would see all subtasks complete. A bandaid could be only marking a subtask complete after the next subtask is inserted.

Sounds like you’re using a DB to track the jobs. Is that because your workers are distributed? 
 I'm hoping for the bandaid problem. I'm getting ready to pop open the code now.

It's a DB... I think it's a DB because it's interfacing with web pages that could be open in different locations by multiple people at once.

 I'm not a native computer programmer so I wouldn't know what another solution would look like. I just take the code as it was created by others and assume it's the best solution and just try to find the bandaid problems. 
 Fair enough. I'm an engineer, I see a problem and I have to try to solve it :D