20200920: cloudctl, Testing, and YAML Thoughts

I added tests to ensure Stack.Deploy() is covered. I developed the mockDeployer struct to implement stack.Deployer. I want to ensure that errors and responses from Deployer are handled correctly. From my limited research, I haven't seen many mocks in go packages. I'm unsure why this is. Maybe it's too complicated and using the concrete value is enough? Either case, I added mockDeployer.Verify() so it could report errors when methods were not called.

I'll probably learn why the community does/doesn't follow this behavior in time. For now, the mocks and tests make me feel warm and fuzzy inside. I'm sure I'll eat my own words later.

Oh, I made a similar mock in the cfn package. It's likely a little too complicated. It uses maps and consts to keep track of method calls. Just read the code, you'll see. I spent a lot of effort on it yesterday, but I kept going because I wanted to see where it lead me. I'll let another feature force me to change it.

Finally, I'll add that the aws-sdk-go library's cloudformation API is insanely large. Implementing cloudformationiface in tests is unwieldy. Also, why doesn't cloudformation have an CreateOrUpdate function. awscli has the deploy subcommand, but the go api doesn't have it. Why.

For next time

I want to look at YAML usage for cloutctl. I like that YAML is simple. But I hate that YAML can become complicated. I'll still choose YAML as a format, even with it's flaws. But I hope to define a tight interface into cloudctl with YAML. This keeps learning the tool easy. And in time, I could write validations on top of YAML to ease writing it. I've always hated tools that didn't have an easy way to validate YAML.

After working with tools like salt and ansible, there are a few things off the top of my head that I want/don't want:

  • I don't want to manage very large YAML files. At work, there's a YAML file with 10K lines. There's too many reasons why this exists.
  • I want to have cloudctl semantically validate YAML configuration. For this to work, the set of possible keys should be fairly small and well defined.
  • I do not want Jinja (or go equivalent) in YAML! No! No! No!
  • If YAML becomes complicated and requires logic, I'd rather recommend jsonnet support. Hopefully, I don't have to get to this point.
  • I want to keep away from YAML references and anchors. I've found I have to be careful when using these features of YAML and have noticed with infrastructure tools that I tend to abuse it. For instance, a simple example is creating an anchor for an EC2 AMI image that's used profusely when defining Cloudformation stacks. The anchor allows DRY configuration, so seems like a great first step to manage configuration files. But, changing the anchor will force all Cloudformation stacks to update. If you have a 10k line YAML like I do, then the related Cloudformation stacks will be blasted.

Checklist: Boring Production Tasks

  1. Write out the shell commands to execute, URLs to visit, copy/paste information, and any manual steps required to perform the production task.
  2. Ask yourself, could any of the steps fail? Are there alternatives steps that could yield the same result and reduce the chance of failure? Add or update steps to reduce the likelihood of failure.
  3. Ask yourself, could you reorder any of the steps to increase the speed of the task? Could any of the steps be done at the beginning of the checklist? If so, make changes.
  4. Ask yourself, could a less experienced developer handle this checklist? What would they need to know? Could that knowledge be baked into a step? Edit steps to incorporate more details.
  5. Share the checklist to colleagues experienced with the system. Ask them if there are missing or inaccurate steps. Make changes if necessary.
  6. Ask a less experienced teammate to perform your checklist in a staging environment. While your teammate steps through the list, record any missteps, questions, and failed steps.
  7. Review the completed checklist and incorporate changes into a new checklist. Replace failed steps with any actions you and your teammate took in Step 6.
  8. Send the revised checklist to colleagues for review again.
  9. Repeat steps 6-8 as many times as necessary until you're confident of the steps.
  10. Breathe.
  11. Perform the checklist on your production system. Follow each step. Do not skip ahead. Slow and steady.
  12. Sigh in relief. Huzzah! You're done!

Next time you have to perform a manual, sweat-inducing, hearth-pumping, production task, use this checklist to write yourself a checklist. You might find that performing your production task is now dull. And maybe, you'll like this.

Make production tasks boring.

It's Always DNS

I bought an Ikea shelf (Bekant) to house the stormlight NUCs and my Synology NAS. I shut down the entire cluster so that I could move the machines into the shelf. I built the shelf. I placed all the machines into the shelf and meticulously looped cables behind the shelf and plugged them into a powered-off surge protector. I hit the power switch on the surge protector and booted everything.

Since the NUCs were built with NVMe disks, they boot up instantly. Fatty was unfortunately behind.

I checked the kubernetes dashboard for the cluster and noticed that the docker registry and the nfs-client-provisioner services were down. They could not reach fatty (Synology NAS). Makes sense, fatty is old and have four spindle drives it needs to validate.

Once fatty was available, I restarted the pods in stormlight that were broken with no luck. The pods were down.

I looked into the error and realized they were failing to connect to https://fatty.stormlight.home.

Mutha F'er. It's always F'n DNS.

This is what I get for making my router's primary DNS server fatty.

When fatty was offline, my router fell back to the secondary resolver. When the NUCs came online and started querying fatty.stormlight.home, they cached the secondary resolver's response of NXDOMAIN. And that's my problem.

The fix? Flush the DNS cache and restart.

I flushed dns (with ansible: ansible -i hosts all --become -a 'systemd-resolve --flush-caches') and then verified the NUCs were able to resolve fatty's DNS. Afterwards, I checked the kubernetes dashboard and everything came back online.

Anyways, here's what my corner looks like now:

Stormlight: My Intel NUC Kubernetes Cluster

When I began building a kubernetes cluster in December 2019, I didn't have a great plan. I wanted to program with go, I wanted to learn kubernetes, and I definitely wanted Intel NUCs. As I researched technical decisions after technical decisions, I finally came to a list of desires for the cluster. First, I had to have a name. I named the cluster Stormlight.

Second, I crafted user stories I wanted for myself.

  1. I want to access my cluster services at the domain .stormlight.home so that I don't have to remember IP addresses and port numbers.
  2. I want a simple (to me) deployment system that didn't require touching DNS configuration, storage configuration, and TLS certificate configuration for each service I deploy. If I did have to configure these components for a service, the settings should live within the kubernetes manifests.
  3. When I open cluster HTTP services with Chrome, I wanted a locked icon in the URL bar. I don't want to see the "Your connection is not private" warning and then click the "Proceed to ..." link. These annoy me.

With these out of the way, let's dive into the components that make up the Stormlight.

Physical Hardware

Obviously, I'm using Intel NUCs. But, there are a few other devices on the network that help fulfill my needs. Here are the machines, their names, and some specs.

  • lightweaver
    • Kubernetes master
    • Intel NUC 8i3BEK M.2 SSD
    • 32GB RAM
    • 250GB SSD
  • skybreaker
    • Kubernetes worker
    • Intel NUC 8i3BEK M.2 SSD
    • 32GB RAM
    • 250GB SSD
  • windrunner
    • Kubernetes worker
    • Intel NUC 8i3BEK M.2 SSD
    • 64GB RAM (I got lucky here! Amazon shipped me 64GB instead of the originally purchased 32GB!)
    • 250GB SSD
  • fatty
    • Synology DS413j
    • 8TB storage
    • Some pathetic amount of CPU and RAM. This thing is old, slow, and still works. I can't really complain.
  • TRENDnet 8-Port Gigabit GREENnet Switch
  • Netgear Nighthawk Wifi Router

My home's network diagram looks like this.

Network Diagram
Network Diagram

For all machines, I configure a static IP address on the wifi router. For NUCs, I assign the IP address when the computer is installing the operating system. There's likely a more simple way to do this, but this worked, and I only had to do it three times.

Software

  • Ubuntu 18.04 Server
  • DNS Server
    • Runs on fatty using Synology's DNS server package
    • Hosts the private zone stormlight.home
    • All machines have hostnames defined to make SSH easier
    • *.stormlight.home record points to lightweaver (more on this later)
  • NFS Server
  • Certificate Authority (self-hosted) for SSL certificate signing
    • Root CA for creating intermediate CAs
    • Intermediate CA for signing server certs
    • Both Root and Intermediate certs are installed on my laptop and all cluster machines
  • HAProxy
    • All traffic to the Stormlight is directed here (via *.stormlight.home DNS above)
    • Runs on lightweaver
    • Uses a wildcard cert for *.stormlight.home giving me a nice 🔒 icon in Chrome
    • Terminates SSL traffic
    • Proxies traffic to the local kubernetes ingress (see kubernetes configuration below)
  • Kubernetes v1.17
    • Installed with kubeadm
    • Uses a single host as the master (lightweaver)
    • Uses kubernetes self-signed certs (I built the CA after I set up kubernetes, so I didn't use my own CA at the time).

stormlight.home Domain

Kubernetes relies on load balancers in the cloud or on-premise to handle ingress traffic. For Stormlight, I could deploy services using NodePorts, and I would be able to access the service at <ip address of any NUC>:<NodePort>. But I find this inelegant. I want a domain name for Stormlight.

Therefore, I looked at using DNS. Initially, I wanted to use a public top-level domain. But this costs money, and I'm cheap. So I decided on the stormlight.home private domain. It's not entirely clear to me that the .home TLD is suitable for private use, but I'm comfortable dealing with this in the future.

stormlight.home is served by the DNS server running on fatty. My home router is configured to request records with fatty first before hopping out to 1.1.1.1. Therefore, stormlight.home is available while I'm connected to my home network.

stormlight.home has a handful of configured DNS records. Every machine Stormlight has an entry. This keeps me from saving IP addresses in my SSH config to connect to my NUC machines. Aside from machine records, the DNS server has a wildcard record handling all other subdomains. This is how I send traffic to Stormlight.

Ingress Traffic to Stormlight

Running on lightweaver's port 443 is HAProxy enabling traffic into Stormlight from *.stormlight.home subdomains. HAProxy terminates SSL (user story #2 and #3) and forwards traffic, locally, to the nginx-ingress NodePort service running in kubernetes.

Tracing an HTTPS Request

Let's move our attention to a simplified HTTPS request for the fictional service mysvc. Below is a diagram of an HTTP request into Stormlight. I simplified kubernetes to make the diagram easier to comprehend. Because in reality, I remove the taints on the master so that the kubernetes scheduler runs on all NUCs. So theoretically, traffic could stay entirely on lightweaver if there were mysvc pods running there.

HTTPS Request Tracing
  1. A user request https://mysvc.stormlight.home. This resolves to lightweaver's IP address because I have the wildcard DNS *.stormlight.home record on my Synology NAS.
  2. The request routes to lightweaver's HAProxy.
  3. HAProxy terminates the SSL request.
  4. HAProxy forwards the HTTP request to the local kubernetes cluster's nginx ingress port.
  5. nginx ingress forwards to the mysvc kubernetes service.
  6. mysvc processes the request by forwarding to whatever deployment/replica/pods are running within the cluster.
  7. mysvc sends the response back to the nginx ingress.
  8. nginx ingress responds to HAProxy.
  9. HAProxy responds to the user.
  10. Hopefully, the user is happy. The user is me. I am happy.

Kubernetes

Let's move our attention to Kubernetes. Aside from nginx-ingress, there are a few other services.

Here is the complete list of services on Stormlight.

nginx-ingress

Stormlight uses nginx-ingress to route all HTTP traffic into the cluster. This really helps me with user story #2. I can configure the subdomain/path of a service running in stormlight simply by creating an Ingress resource. I don't have to configure anything in fatty's DNS server or the HAProxy. I really dig this setup.

For example, here's a basic configuration for kuard (a handy debugging application ) so that uses https://kuard.stormlight.home as the domain.

---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: ingress-kuard
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  # I can set this to whatever I want
  - host: kuard.stormlight.home
    http:
      paths:
      - path: /
        backend:
          serviceName: kuard
          servicePort: 80

nfs-client

A part of user story #2 deals with storage. While I could configure services to use local storage, I wanted to use fatty as well. Data on fatty has better durability (four 2TB drives) and configured for cloud backups (I did this a long time ago). I want to use local storage for performance, but anything important would be stored on fatty.

I dug into Kubernetes docs on storage options and ran around in circles. Should I use Volumes, Persistent Volumes, or Container Storage Interface plugins? After several days of reading, I landed on nfs-client from the external-storage github repo. To add to my initial confusion, external-storage states that the repository is deprecated and that I should use sig-storage-lib-external-provisioner instead. But, on sig-storage-lib-external-provisioner page, it links back to external-storage for examples. Sigh. Luckily, nfs-client worked well and was easy to set up.

Here's how I configured nfs-client.

nfs-client creates a dynamic provisioner for fatty's NFS shares. With the provisioner, I expose two types of storage classes. The first one, called fatty-archives, archives the data when a PersistentVolumeClaim (PVC) is deleted. Therefore, I don't have to worry about losing data when I muck around with the cluster and accidentally delete PVCs.

The second storage class, called fatty, deletes data when a PVC is removed. Honestly, I don't have much use for the fatty storage class yet, but it enables scaling pods across nodes and use the NFS mount for shared data.

Here's the configuration I use for my docker registry setup.

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: registry-data
  annotations:
    volume.beta.kubernetes.io/storage-class: "fatty-archives"
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100G
---

apiVersion: apps/v1
kind: Deployment
metadata:
  name: registry
spec:
  selector:
    matchLabels:
      app: registry
  replicas: 1
  template:
    metadata:
      labels:
        app: registry
    spec:
      containers:
        - name: registry
          image: registry
          imagePullPolicy: Always
          ports:
            - containerPort: 5000
          volumeMounts:
            - name: registry-data
              mountPath: /var/lib/registry
      volumes:
        - name: registry-data
          persistentVolumeClaim:
            claimName: registry-data

If you look carefully, you don't see fatty's host or mounts shares in this configuration. The NFS configuration is managed in a single place -- the nfs-client manifests. From a service perspective, there's no dependency on NFS. The service depends on a PVC (i.e., a storage request) and the volume configuration for the Deployment. In the future, if I decided to replace fatty with a new NAS (I'm due for an upgrade), then I reconfiguring a new storage class, moving data around from the old NAS to the new NAS, and then changing the PVCs for all my services. I don't have to find all the places where the NFS information is configured in service manifests. Lovely!

Docker Registry

One of the biggest reasons I wanted durable network storage was to run a docker registry. Remember, I'm cheap. So paying for a registry was not something I wanted to do. I also didn't want to store the data on a single node. If I wanted to scale up the number of pods running the registry, I'd like to do so without worry about where the data lives.

So Stormlight runs Docker's Registry at registry.stormlight.home. The images are stored on fatty through the nfs-client provisioner.

Master Component Backups

I worry about failures. I work with cloud providers at my day job, so failures are common and expected. At home, my computers fail far less than cloud instances (case in point: fatty). But, they will fail. And that makes me nervous. So, I made some contingency plans.

Kubernetes depends on the etcd as the backing database. In a single-master configuration, there's one etcd instance. Failure of etcd renders the master services unusable. Services running on the cluster continue to run, but if a service pod dies, the master is unable to provision a replacement. Therefore, backing up etcd is useful.

I found this nice post on backing up the master. I tweaked the script so that it also backs up kubernetes' self-signed certs. This is backup runs every hour and stores data in fatty. The restoration process is scripted with ansible and very similar to what's described in the linked blog post.

This plan leaves me a little less nervous, but it doesn't consider a full lightweaver failure. With a full failure of lightweaver, the entire cluster unusable. All *.stormlight.home subdomains will fail because the machine is offline.

Even writing this makes me nervous, but I think I'll deal with the failure in the future. I've built Stormlight using code. There are no manual steps. So, if I lose the cluster, I can rebuild by relying on my code. I'll open source my code in a future post.

That said, I might buy a few Raspberry Pis to build a multi-master cluster later. That could be another fun side project.

The Results: My Deploy Process

With all the above in place, I can begin programming my own services (about damn time). To deploy, all I need is a single manifest file with a Deployment, a Service, and an Ingress. Here's what kuard's manifest file looks like:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kuard
spec:
  selector:
    matchLabels:
      app: kuard
  replicas: 1
  template:
    metadata:
      labels:
        app: kuard
    spec:
      containers:
      - image: gcr.io/kuar-demo/kuard-amd64:1
        imagePullPolicy: Always
        name: kuard
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: kuard
spec:
  type: NodePort
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP
  selector:
    app: kuard

---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: ingress-kuard
  annotations:
    # use the shared ingress-nginx
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  - host: kuard.stormlight.home
    http:
      paths:
      - path: /
        backend:
          serviceName: kuard
          servicePort: 80

With the above file, I run kubectl apply -f kuard.yml and then visit https://kuard.stormlight.home. That's my whole deploy process.

And if I need durable storage? I can use the fatty and fatty-archives storage class, or hook into a NUC's local disk.

Onwards

That's Stormlight! I'll share my code with future posts. I need to spend some time cleaning up the code to make it a little easier to use first.

The Side Project: Keeping The Maker In Me Alive

I started building a homelab using Intel NUCs in December, but I never wrote down why I wanted to do this. Since my mind runs rampant with thoughts, I figure I should write this down before I lose it.

For the past two years, my work activities lead me farther away from coding. It was a conscious decision. I became a tech lead for a year. After that, I formed a new team as the engineering manager. I'll likely stick to this role for a while because it's challenging to solve problems amongst humans. I finally understand the saying that all problems are human problems. I definitely don't know how to solve them, though. That said, I sorely miss being in a flow state and building projects.

So the middle of last year, I thought about what I could work on.

Firstly, the project should keep my coding skills sharp. I immediately thought of go and how I wanted to learn it. Like really learn it. I built a few services with the language, but honestly, I'm still looking up packages and syntax rules.

Secondly, I want to learn kubernetes. I helped build the original kubernetes cluster at work (v1.3 yeesh!) but moved onto other projects that kept me at arm's length. I never learned the details well enough. I want to know kubernetes like I know the back of my hand.

Finally, right around the time, I thought about these ideas, I fell in love with Intel NUCs. They're mini-computers (4" x 4"), and I love the look. I'm a big sucker for dope-ass-looking devices. They're also decently priced, and I had closet space to spare. So I purchased three of them back in December 2019.

My goals are:

  • set up a local kubernetes cluster so that I can launch go projects
  • write posts about the process
  • open source code for the system (first one was the stormlight-iso)

I've already built quite a bit of this system, so in the next post, I'll present what I've built.

Oh, and the name of the cluster is stormlight because I love the Stormlight Archive books from Brandon Sanderson.