tips

Things I find useful.

Ask for the Coffee

The majority of discussions around coding cloud infrastructure lead to cloud orchestration tools like Terraform, Cloudformation, or Google Cloud Deployment Manager. These infrastructure as code tools manage the complexity of launching resources. They only need a description of resources to launch. We provide these descriptions declaratively programming configuration files that these tools consume. But what is declarative programming?

Let's review a YAML configuration used by Cloudformation to launch an EC2 instance with a Security Group:

Resources:
  EC2Instance:
    Properties:
      ImageId: ami-00a208c7cdba991ea
      InstanceType: t2.small
      KeyName: ilikecoffeeinamug
      SecurityGroups:
      - Ref: InstanceSecurityGroup
    Type: AWS::EC2::Instance
  InstanceSecurityGroup:
    Properties:
      GroupDescription: Enable SSH access via port 22
      SecurityGroupIngress:
      - CidrIp: 127.0.0.1/32
        FromPort: 22
        IpProtocol: tcp
        ToPort: 22
    Type: AWS::EC2::SecurityGroup
Cloudformation Template for an EC2 Instance and a Security Group

The configuration file does not have any control flow. There are no procedural steps in launching these resources. It’s only a description of the resources. The steps to launch the EC2 instance and the security group is managed by Cloudformation and hidden from us. This is declarative programming.

Declarative programming is if I were to go to a coffee shop and asked, “Could I please have a medium roast coffee, no room, to go please?” I describe the coffee I want and wait for it. After a short amount of time, and assuming I didn’t talk to an angry barista, I’d have my coffee in a to-go cup and I’d be on my way.

Contrast this to imperative programming, or what most programmers consider coding. The imperative programming example with coffee are the steps the barista follows to make my cup of coffee:

  1. Grind the beans
  2. Load the beans into the coffee maker
  3. Pour water into the coffee maker
  4. Press the button to brew the coffee
  5. Wait for the coffee to brew
  6. Fill a to-go coffee cup to the top with the brewed coffee
  7. Serve the coffee to the customer

The order of operations is important. If the steps were rearranged or skipped, I would not have my cup of coffee. Also, much to my dismay, this is only one way to make a cup of coffee. There are countless ways to make a cup of coffee.

The same can be said about provisioning and configuring cloud infrastructure for use with software. As a developer, I don't want to spend my time making cloud resources. I want to ask for them so I can sip my coffee.

Resolving Production Incidents

Production incidents are a nasty problem and have been a part of my career. Furthermore, I don't see them going away. Therefore, I've found an approach to handling production incidents that have normalized them for me. The strategy I use to resolve production incidents is a two-prong approach – I use short-term solutions and long-term solutions.

Each solution focuses on different facets of an incident. First, incidents aren't an incident if they don't affect users. Sometimes it's annoying to users, and sometimes it's catastrophic to users. Either case, users are angry, and if the incident continues, they may take their money and leave. Second, to correct the incident, you need to find and analyze the root cause. However, with large systems, this can take a long time and might take even longer to correct. There are countless times I thought I fixed the root cause only to realize I created another problem. Finding the root cause and the correct solution is vital if I don't want to wake up again at 4 am.

The angry users facet conflicts with the root cause analysis facet. If I looked for the root cause and tried to solve it, I might prolong the duration of angry users complaining. This conflict is where using short-term and long-term solutions can help.

First, I solve the angry users problem. I want to let my users know that I am aware of the situation and working on a solution. Therefore, I look for fast short-term solutions. I might not know what the root cause of the incident is, but I can take actions to appease my users. The solutions here can take many forms, but speed is the most crucial factor. For instance, I can:

  • Deploy the previous version of the code
  • Turn on a maintenance mode with a status message
  • Turn off feature flags for high load features
  • Delay asynchronous job processing
  • Hot patch the code to disable certain functionality
  • SSH into servers and kill processes
  • Launch new cache servers to offload load from the database
  • Scale up or down instances

Besides calming down my users, the other reason I implement the short-term solutions first is to buy time. Moreover, time is what's needed to implement a long-term solution. In the past, I've gone through incidents where the long term solution required days to months of effort to solve (I've gone through some terrible incidents). Another benefit of calming down users is the ability to work on a solution to a root cause without angry customers breathing down my back.

Incidents Will Always Find You

Production incidents are a normal part of software engineering. I work tirelessly at my job to reduce the number of incidents that happen weekly, but incidents happen. Resolving incidents by using the short-term and long-term solutions strategy has helped me approach incidents in a calm and relaxed manner. Hopefully, this helps you out too!

Send UDP messages with /dev/udp

The bash shell comes with two pseudo-devices for TCP and UDP network communication at /dev/tcp and /dev/udp. To use either in bash, you need to read or write to the device appending the host and port to the end of the path — e.g., /dev/tcp/google.com/80. The primary reason I use the pseudo-devices is it’s easier for me to remember than the netcat. But if I wanted a portable solution, then netcat is the winner. Either case, learning of bash's TCP and UDP pseudo-devices tickled my brain.

For my day to day work, the /dev/udp comes in handy for sending statsd metrics to the local statsd server process on PlanGrid servers. statsd has a wire protocol that looks like:

METRIC_NAME:METRIC_VAL|TYPE

# Example: counting 200 status codes for nginx:
nginx.status_200:1|c

Here is how to send the above metric via bash's UDP pseudo-device:

echo "nginx.status_200:1|c" >/dev/udp/127.0.0.1/8125

The other portable way is with netcat (nc):

echo "nginx.status_200:1|c" | nc -u -w0 127.0.0.1 8125

This post was originally published on medium.com.