Briefing Nine : Your systems have a tolerance limit

The most dangerous moment in any complex operation isn't when things go wrong. It's when the person managing it stops knowing what they don't know.

Think about the last time you were truly overwhelmed at work.

Not just busy. Overwhelmed. The kind where you're staring at your screen and you genuinely cannot decide what to do next. Where everything feels equally urgent and equally impossible. Where you've been working for hours and have nothing meaningful to show for it.

That feeling has a name — it's called saturation.

The highest-performing operators in the most complex fields in the world have spent decades building systems specifically to prevent it — and recover from it fast when it happens anyway.

Today I want to share one of those principles with you. It's called Fault Tolerance Design. And once you understand it, you'll look at your overwhelm very differently.

What Fault Tolerance actually means

In engineering, a fault is a deviation — a component behaving outside its expected parameters. It might be minor. It might be correctable. Left unaddressed, faults cascade.

A fault-tolerant system is one designed with the assumption that faults will happen. The question the engineer asks is: when a fault occurs, how does the system continue to function?

This is a profound shift in thinking. Most systems — and most people — are designed for the ideal case. Everything working as expected, all inputs within normal range, no surprises.

The moment conditions deviate from ideal, those systems degrade fast. They were designed for smooth running, so deviation breaks them.

A fault-tolerant system doesn't assume perfect conditions. It assumes imperfect ones — and performs anyway.

The three layers of fault tolerance

Engineers build fault tolerance in three layers. Each one translates directly into how you manage your workload.

Layer 1 — Redundancy

Critical systems are never single points of failure. If one component fails, a backup takes over seamlessly. The backup is already in place, already warm, already ready — before the fault arrives.

For you, this means never holding critical information only in one place.

When a task, a deadline, a decision, or a commitment exists only in your head, you have created a single point of failure. The moment you hit saturation — and you will — that information is at risk.

Fault-tolerant operators externalise everything. Write it down, capture it, get it out and into a trusted external system. Your capture list is your redundancy layer. Your head is the processor. The external system is the storage.

Layer 2 — Graceful degradation

When a fault occurs, a well-designed system sheds non-critical functions to preserve the most important ones. A spacecraft losing power doesn't shut off everything. It sheds what's least essential and protects what cannot be lost.

Most people, when overwhelmed, do the opposite. They try to hold everything together simultaneously. They keep all the plates spinning. They refuse to let anything drop.

That is how systems fail catastrophically instead of gracefully.

Graceful degradation at work means having a pre-decided hierarchy of what gets protected when capacity is under pressure. Which work is mission-critical — the thing that cannot be compromised? Which work can be simplified? Which work can be set aside until the pressure passes?

Make this decision in advance, when your thinking is clear. When saturation hits, the response is already decided.

The people who seem calm under pressure have simply already answered the question: if I can only do one thing today, what is it? Everything else degrades gracefully around that answer.

Layer 3 — Safe mode

When faults exceed what normal operation can handle, complex systems enter safe mode. Non-essential processes are suspended. The system stabilises. Recovery becomes possible.

Safe mode is a designed response to conditions that exceed operating parameters. It is part of the system, not a failure of it.

Burnout is what happens to a person with no safe mode.

When faults keep accumulating and the system keeps operating at full capacity regardless, something eventually gives in. The crash, when it happens, is always bigger than it needed to be.

Your safe mode is a valid operational state that your performance depends on. A shortened workday. A moratorium on new commitments. A week of only one priority. Whatever the minimum viable version of your professional function looks like — that is your safe mode. Define it before you need it, because when you need it, you will already be past the point of designing it clearly.

The question worth sitting with today

Most high performers know their field intimately. They are excellent at their work.

And they have no redundancy layer, no graceful degradation plan, and no defined safe mode.

From an engineering perspective, they are a single point of failure running at full capacity with no margin and no recovery protocol.

You would never design a critical system this way. Why are you running yourself this way?

Today's question — just one, and it deserves ten minutes:

If your capacity dropped by 40% tomorrow, what would you protect, what would you simplify, and what would you set aside?

Write it down. Three columns. Ten minutes.

That document is the beginning of your fault tolerance design.

If this gave you one useful idea today — forward this to one person who needs it. A colleague running on empty. A friend who keeps saying they'll sort out their workload but hasn't yet. A peer carrying more than they're showing.

One forward. That's it.

— Sumana.

600 1st Ave, Ste 330 PMB 92768, Seattle, WA 98104-2246
Unsubscribe · Preferences

Mission Control Club

Briefing Nine : Your systems have a tolerance limit

Briefing Ten : How engineering design margins can help your overloaded week

Briefing Eight — What Rocket Staging Teaches Us About Work Priorities

Briefing Seven : 3 Essential Rules for Managing Your Email