Resilience Engineering

My introduction to resilience engineering came by way of John Allspaw's talk in DevOps Enterprise Summit 2017 "How Your Systems Keep Running Day After Day." So I was delighted with my great fortune to work with him as part of the SNAFUcatchers consortium when I joined New Relic.

The software industry has fairly mediocre standard practice for learning from incidents. We are only starting to adopt practices that are better developed in other high-risk industries like aerospace, transportation, power generation & distribution, and healthcare.

What's more, our current tools for understanding our systems revolves around dashboards of time series charts that do not offer much support for the essential sense-making that's demanded by high-pressure incidents.

There is a huge opportunity for Observable to fill this niche of helping teams of engineers to make sense of the complexity in their systems.

One way I might drive adoption of Observable is to collaborate with my colleagues in the Learning From Incidents in Software community to create Notebooks that tell the stories of their incidents.

In so doing we could gradually grow visual languages for describing software systems supported by libraries and examples that empower deeper sense-making for other learners.

Observable already know this. See "Why Visualization Helps Developers" blog

One of my favorite tools for making sense of messy code is to Draw a Sequence Diagram

For an example of visual language for software systems, see Visualize System Quality

The newer collaboration features in Observable will likely grow up to become vital support for shared learning. See Joint Cognitive Whiteboard.

Not known outside of New Relic (maybe poorly known even within New Relic), El Dorado helps curious engineers make sense of a system under continual and relentless change. HEIGHT 280 Ward Cunningham, Expore Domain Driven Design 2017, Denver CO. Ward presents El Dorado as a way of observing a dynamically changing software system


Collecting a few presentations we've given around learning from incidents.

New Relic's incident response was world class. My teammate, Beth Adele Long, co-wrote a paper with Richard Cook about the NERF program in which elite incident commanders volunteered to support the most severe incidents. pdf

Beth Long and I have published "Building and revising adaptive capacity sharing for technical incident response: A case of resilience engineering" twitter pdf

Laura Maguire elicits the expertise of five software leaders and in so doing operationalizes common grounding in a tangled layered network! "Designing & Managing for Resilience" infoq

There's an influential paper about common ground. Despite its name, common ground is not a geographic location you can visit. It is a state of being, a relationship between people, the essential context in all human communication, continually changing.

How Your Systems Keep Running Day After Day. John Allspaw at DevOps Enterprise Summit 2017. youtube

John Allspaw reviews current software industry practice for learning from incidents and suggests how we can improve to get more value out of our incidents. youtube

John Allspaw shared a paper he loves: Common Ground and Coordination in Joint Activity: pdf . Capturing notes here especially for the notion of constant repair of common ground and the fundamental breakdown of common ground. youtube

Seeing the Invisible: Perceptual-Cognitive Aspects of Expertise. Novices see only what is there; experts see what is not there. pdf

Yesterday I learned something new to me about joint cognition and socio-technical systems. A group of software engineers gathered around a whiteboard are a joint cognitive system. The scrawls on the board are spatial cues for building a shared model of a complex system.

"How a Cockpit Remembers Its Speeds". Edwin Hutchins applies cognitive science to a unit of analysis that is larger than one person. A pilot and ethnographer of [flight decks], Hutchins paper is a fascinating deep-dive into an essential moment in the complex socio-technical system of commercial aviation. pdf

I got the code checked out, got secrets setup, and launched the local dev environment. The response in my browser was less than helpful: 502 Bad Gateway.

Colin Breck offers Quality Views to represent the system holistically, describing current investments, highlighting risks, and demonstrating how the system is evolving. They have been useful for communicating within our team, as well as externally, with both technical and non-technical stakeholders. They have been invaluable for aligning our mental models of the system. 2016 article

Transported image. source