Resolving Production Incidents
Production incidents are a nasty problem and have been a part of my career. Furthermore, I don't see them going away. Therefore, I've found an approach to handling production incidents that have normalized them for me. The strategy I use to resolve production incidents is a two-prong approach – I use short-term solutions and long-term solutions.
Each solution focuses on different facets of an incident. First, incidents aren't an incident if they don't affect users. Sometimes it's annoying to users, and sometimes it's catastrophic to users. Either case, users are angry, and if the incident continues, they may take their money and leave. Second, to correct the incident, you need to find and analyze the root cause. However, with large systems, this can take a long time and might take even longer to correct. There are countless times I thought I fixed the root cause only to realize I created another problem. Finding the root cause and the correct solution is vital if I don't want to wake up again at 4 am.
The angry users facet conflicts with the root cause analysis facet. If I looked for the root cause and tried to solve it, I might prolong the duration of angry users complaining. This conflict is where using short-term and long-term solutions can help.
First, I solve the angry users problem. I want to let my users know that I am aware of the situation and working on a solution. Therefore, I look for fast short-term solutions. I might not know what the root cause of the incident is, but I can take actions to appease my users. The solutions here can take many forms, but speed is the most crucial factor. For instance, I can:
- Deploy the previous version of the code
- Turn on a maintenance mode with a status message
- Turn off feature flags for high load features
- Delay asynchronous job processing
- Hot patch the code to disable certain functionality
- SSH into servers and kill processes
- Launch new cache servers to offload load from the database
- Scale up or down instances
Besides calming down my users, the other reason I implement the short-term solutions first is to buy time. Moreover, time is what's needed to implement a long-term solution. In the past, I've gone through incidents where the long term solution required days to months of effort to solve (I've gone through some terrible incidents). Another benefit of calming down users is the ability to work on a solution to a root cause without angry customers breathing down my back.
Incidents Will Always Find You
Production incidents are a normal part of software engineering. I work tirelessly at my job to reduce the number of incidents that happen weekly, but incidents happen. Resolving incidents by using the short-term and long-term solutions strategy has helped me approach incidents in a calm and relaxed manner. Hopefully, this helps you out too!