Resilience in Distributed Systems
Jimmy Bogard is in the middle of an excellent series on Refactoring towards Resilience. As it happens, I had the privilege of working with Jimmy on that project, and have an anecdote to relate that he’s probably unaware of.
A few days after the project went live, I received a frantic email from the client’s Finance department (names changed):
“Late this afternoon I noticed that there was one refund on our Stripe account. From looking at the transaction the card belongs to Ricky Bobby, and he does have an account in SSO, but I see no teams for him. The charge came in on 1/23 at 7pm. The refund was processed only 1 second later. I contacted Stripe and they looked into what triggered the refund. They determined it came from us and it wasn’t myself, Jim or Bob. I am concerned that there was a glitch here that caused this refund. I have suggested that Customer Service reach out to him to see what he experienced but can you please look into why this happened?”
I knew immediately what had happened of course. For some reason our overall transaction had failed, so the system gave the customer an immediate refund. It was how we’d coded it, and it worked as designed, hooray!
Except… we’d never really explained that to the business.
Resiliency is about the ability to react to failure in a good way. And to paraphrase Jimmy somewhat, you have four options when a failure occurs: Ignore, Retry, Undo and Coordinate. Organizations are distributed systems too, but Coordinate is often the only option. And in this instance we failed.
Ignore, Retry, Undo, Coordinate. Each option leaves a different signature on the business, so it’s important that they know what to expect. Had we chosen to go with “Ignore”, the Finance folks wouldn’t see any problems but Customer Service would get irate calls and have to issue manual refunds. IT would have to figure out what happened by looking at the logs. Had we chosen “Retry”, Finance would have had to maintain lists of transactions that failed after 24 hours, and had to process them somehow. The option we chose, “Undo”, is interesting because it puts the least strain on the business, but puts the onus on the customer to retry.