At work, you've created a service that half the company uses, and on the weekend you go camping the service crashes and nobody knows how to restart it. Half of the company is now in a failure mode.
A failure mode is a degradation of quality, but it is not yet a catastrophe. In your personal life and in your work, you should always think about what kind of quality you'll be limping along with if some component or assumption were to fail. If you find that quality is unpalatable, then it's time to go back to the drawing board and try again.
Back at home--cold, shivering, and with the neighbors watching--you grope around under the doormat and find the spare key you've hidden there. You've now recovered from the fault and exited the failure mode.
And at the office, your service is being monitored by some kind of watchdog process that detects the crash and restarts the service.
The spare key and the watchdog process are insurance policies, but they should not be confused with the quality of your failure mode. You enter failure mode when a component fails, but insurance policies are just another kind of component. Maybe the spare key wasn't there this time, or maybe the watchdog itself crashed. Either way, you're still in a condition of degraded quality.
Having an insurance policy doesn't absolve you from engineering as much quality as possible into each level of failure mode. All that you've done is create more possible failure modes.
Having learned your lesson, you now either get dressed and put on a coat before going outside to fetch the paper, or you switch to an online subscription, or you defer reading the newspaper until after you're dressed and on the train for work. You've now either eliminated or changed the quality of the failure mode to something more acceptable.
At work it's not likely to be so simple, but consider some of the general techniques below as a starting point.
An example is checking credit card numbers for validity and sufficient funds, which will depend on the success of both your code and the gateway service you're paying for. Some systems are set up to refuse a transaction if it's unable to validate them, even if the transaction would be valid otherwise. You need to design your system so that validation takes place after a transaction has been received and recorded, and that the transaction is considered "on hold" until validation is performed
Transferring money between two accounts is the canonical example of why Transactions are important; if a fault happens after you've decreased the balance of the sender, but haven't yet increased the balance of the receiver, then the money disappears. Databases support Transactions, but now so do filesystems, queues, and even memory. If your platform supports Distributed Transactions, then you can even tie everything together: pulling a message off a queue, updating multiple tables on multiple databases, and writing to an audit log file can now all happen in the same transaction, so that if the final step of writing to the audit log fails, then everything is rewound all the way back to the queue for the chance to try again. If the operation still continues to fail, you'll still have a system with consistent quality
This morning you get a notification that a Big Box Store wants you to ship with "signature required" service on large orders, and to do that they're changing part of the EDI spec to contain a new boolean value. The problem is that notification was sent a month ago and nobody remembered to forward it to the IT department until the change had already gone into effect. If you were deleting their EDI files after importing them into the database then you'd be in hot water. It may still be awkward even if you had backups to restore from, because now you've got a thousand workflows in various stages that have to be changed. To solve this, you'll want a system that records changes to a workflow's state, but doesn't actually change any data. The workflow has to begin with the EDI document and use it as the bible all the way to the end. You pull the minimum details into your database for tracking and indexing, and build an abstraction layer to query the original document for everything that isn't performance sensitive (such as getting the spec for the shipping label). Now when the document's format changes, you only have to change the abstraction layer
Every system has failure modes, all the way from the trivialities of your personal life to the global economy, and the truth is that we are always operating in at least one failure mode all of the time. My car's suspension needs work, we just lost an employee who walked out with a lot of unwritten knowledge, access to credit has dried up and the economy is shrinking. And yet my car still runs, the company is still in business, and we can still buy and sell things. We just cope with the degradation of quality and work to improve it somehow.