Write out the shell commands to execute, URLs to visit, copy/paste information, and any manual steps required to perform the production task.
Ask yourself, could any of the steps fail? Are there alternatives steps that could yield the same result and reduce the chance of failure? Add or update steps to reduce the likelihood of failure.
Ask yourself, could you reorder any of the steps to increase the speed of the task? Could any of the steps be done at the beginning of the checklist? If so, make changes.
Ask yourself, could a less experienced developer handle this checklist? What would they need to know? Could that knowledge be baked into a step? Edit steps to incorporate more details.
Share the checklist to colleagues experienced with the system. Ask them if there are missing or inaccurate steps. Make changes if necessary.
Ask a less experienced teammate to perform your checklist in a staging environment. While your teammate steps through the list, record any missteps, questions, and failed steps.
Review the completed checklist and incorporate changes into a new checklist. Replace failed steps with any actions you and your teammate took in Step 6.
Send the revised checklist to colleagues for review again.
Repeat steps 6-8 as many times as necessary until you're confident of the steps.
Perform the checklist on your production system. Follow each step. Do not skip ahead. Slow and steady.
Sigh in relief. Huzzah! You're done!
Next time you have to perform a manual, sweat-inducing, hearth-pumping, production task, use this checklist to write yourself a checklist. You might find that performing your production task is now dull. And maybe, you'll like this.
I bought an Ikea shelf (Bekant) to house the stormlight NUCs and my Synology NAS. I shut down the entire cluster so that I could move the machines into the shelf. I built the shelf. I placed all the machines into the shelf and meticulously looped cables behind the shelf and plugged them into a powered-off surge protector. I hit the power switch on the surge protector and booted everything.
Since the NUCs were built with NVMe disks, they boot up instantly. Fatty was unfortunately behind.
I checked the kubernetes dashboard for the cluster and noticed that the docker registry and the nfs-client-provisioner services were down. They could not reach fatty (Synology NAS). Makes sense, fatty is old and have four spindle drives it needs to validate.
Once fatty was available, I restarted the pods in stormlight that were broken with no luck. The pods were down.
I looked into the error and realized they were failing to connect to https://fatty.stormlight.home.
Mutha F'er. It's always F'n DNS.
This is what I get for making my router's primary DNS server fatty.
When fatty was offline, my router fell back to the secondary resolver. When the NUCs came online and started querying fatty.stormlight.home, they cached the secondary resolver's response of NXDOMAIN. And that's my problem.
The fix? Flush the DNS cache and restart.
I flushed dns (with ansible: ansible -i hosts all --become -a 'systemd-resolve --flush-caches') and then verified the NUCs were able to resolve fatty's DNS. Afterwards, I checked the kubernetes dashboard and everything came back online.
When I began building a kubernetes cluster in December 2019, I didn't have a great plan. I wanted to program with go, I wanted to learn kubernetes, and I definitely wanted Intel NUCs. As I researched technical decisions after technical decisions, I finally came to a list of desires for the cluster. First, I had to have a name. I named the cluster Stormlight.
Second, I crafted user stories I wanted for myself.
I want to access my cluster services at the domain .stormlight.home so that I don't have to remember IP addresses and port numbers.
I want a simple (to me) deployment system that didn't require touching DNS configuration, storage configuration, and TLS certificate configuration for each service I deploy. If I did have to configure these components for a service, the settings should live within the kubernetes manifests.
When I open cluster HTTP services with Chrome, I wanted a locked icon in the URL bar. I don't want to see the "Your connection is not private" warning and then click the "Proceed to ..." link. These annoy me.
With these out of the way, let's dive into the components that make up the Stormlight.
Obviously, I'm using Intel NUCs. But, there are a few other devices on the network that help fulfill my needs. Here are the machines, their names, and some specs.
Intel NUC 8i3BEK M.2 SSD
Intel NUC 8i3BEK M.2 SSD
Intel NUC 8i3BEK M.2 SSD
64GB RAM (I got lucky here! Amazon shipped me 64GB instead of the originally purchased 32GB!)
Some pathetic amount of CPU and RAM. This thing is old, slow, and still works. I can't really complain.
TRENDnet 8-Port Gigabit GREENnet Switch
Netgear Nighthawk Wifi Router
My home's network diagram looks like this.
For all machines, I configure a static IP address on the wifi router. For NUCs, I assign the IP address when the computer is installing the operating system. There's likely a more simple way to do this, but this worked, and I only had to do it three times.
Used for durable storage in the kubernetes cluster
Certificate Authority (self-hosted) for SSL certificate signing
Root CA for creating intermediate CAs
Intermediate CA for signing server certs
Both Root and Intermediate certs are installed on my laptop and all cluster machines
All traffic to the Stormlight is directed here (via *.stormlight.home DNS above)
Runs on lightweaver
Uses a wildcard cert for *.stormlight.home giving me a nice 🔒 icon in Chrome
Terminates SSL traffic
Proxies traffic to the local kubernetes ingress (see kubernetes configuration below)
Installed with kubeadm
Uses a single host as the master (lightweaver)
Uses kubernetes self-signed certs (I built the CA after I set up kubernetes, so I didn't use my own CA at the time).
Kubernetes relies on load balancers in the cloud or on-premise to handle ingress traffic. For Stormlight, I could deploy services using NodePorts, and I would be able to access the service at <ip address of any NUC>:<NodePort>. But I find this inelegant. I want a domain name for Stormlight.
Therefore, I looked at using DNS. Initially, I wanted to use a public top-level domain. But this costs money, and I'm cheap. So I decided on the stormlight.home private domain. It's not entirely clear to me that the .home TLD is suitable for private use, but I'm comfortable dealing with this in the future.
stormlight.home is served by the DNS server running on fatty. My home router is configured to request records with fatty first before hopping out to 126.96.36.199. Therefore, stormlight.home is available while I'm connected to my home network.
stormlight.home has a handful of configured DNS records. Every machine Stormlight has an entry. This keeps me from saving IP addresses in my SSH config to connect to my NUC machines. Aside from machine records, the DNS server has a wildcard record handling all other subdomains. This is how I send traffic to Stormlight.
Ingress Traffic to Stormlight
Running on lightweaver's port 443 is HAProxy enabling traffic into Stormlight from *.stormlight.home subdomains. HAProxy terminates SSL (user story #2 and #3) and forwards traffic, locally, to the nginx-ingress NodePort service running in kubernetes.
Tracing an HTTPS Request
Let's move our attention to a simplified HTTPS request for the fictional service mysvc. Below is a diagram of an HTTP request into Stormlight. I simplified kubernetes to make the diagram easier to comprehend. Because in reality, I remove the taints on the master so that the kubernetes scheduler runs on all NUCs. So theoretically, traffic could stay entirely on lightweaver if there were mysvc pods running there.
A user request https://mysvc.stormlight.home. This resolves to lightweaver's IP address because I have the wildcard DNS *.stormlight.home record on my Synology NAS.
The request routes to lightweaver's HAProxy.
HAProxy terminates the SSL request.
HAProxy forwards the HTTP request to the local kubernetes cluster's nginx ingress port.
nginx ingress forwards to the mysvc kubernetes service.
mysvc processes the request by forwarding to whatever deployment/replica/pods are running within the cluster.
mysvc sends the response back to the nginx ingress.
nginx ingress responds to HAProxy.
HAProxy responds to the user.
Hopefully, the user is happy. The user is me. I am happy.
Let's move our attention to Kubernetes. Aside from nginx-ingress, there are a few other services.
Here is the complete list of services on Stormlight.
Stormlight uses nginx-ingress to route all HTTP traffic into the cluster. This really helps me with user story #2. I can configure the subdomain/path of a service running in stormlight simply by creating an Ingress resource. I don't have to configure anything in fatty's DNS server or the HAProxy. I really dig this setup.
For example, here's a basic configuration for kuard (a handy debugging application ) so that uses https://kuard.stormlight.home as the domain.
# I can set this to whatever I want
- host: kuard.stormlight.home
- path: /
A part of user story #2 deals with storage. While I could configure services to use local storage, I wanted to use fatty as well. Data on fatty has better durability (four 2TB drives) and configured for cloud backups (I did this a long time ago). I want to use local storage for performance, but anything important would be stored on fatty.
I dug into Kubernetes docs on storage options and ran around in circles. Should I use Volumes, Persistent Volumes, or Container Storage Interface plugins? After several days of reading, I landed on nfs-client from the external-storage github repo. To add to my initial confusion, external-storage states that the repository is deprecated and that I should use sig-storage-lib-external-provisioner instead. But, on sig-storage-lib-external-provisioner page, it links back to external-storage for examples. Sigh. Luckily, nfs-client worked well and was easy to set up.
Here's how I configured nfs-client.
nfs-client creates a dynamic provisioner for fatty's NFS shares. With the provisioner, I expose two types of storage classes. The first one, called fatty-archives, archives the data when a PersistentVolumeClaim (PVC) is deleted. Therefore, I don't have to worry about losing data when I muck around with the cluster and accidentally delete PVCs.
The second storage class, called fatty, deletes data when a PVC is removed. Honestly, I don't have much use for the fatty storage class yet, but it enables scaling pods across nodes and use the NFS mount for shared data.
Here's the configuration I use for my docker registry setup.
If you look carefully, you don't see fatty's host or mounts shares in this configuration. The NFS configuration is managed in a single place -- the nfs-client manifests. From a service perspective, there's no dependency on NFS. The service depends on a PVC (i.e., a storage request) and the volume configuration for the Deployment. In the future, if I decided to replace fatty with a new NAS (I'm due for an upgrade), then I reconfiguring a new storage class, moving data around from the old NAS to the new NAS, and then changing the PVCs for all my services. I don't have to find all the places where the NFS information is configured in service manifests. Lovely!
One of the biggest reasons I wanted durable network storage was to run a docker registry. Remember, I'm cheap. So paying for a registry was not something I wanted to do. I also didn't want to store the data on a single node. If I wanted to scale up the number of pods running the registry, I'd like to do so without worry about where the data lives.
So Stormlight runs Docker's Registry at registry.stormlight.home. The images are stored on fatty through the nfs-client provisioner.
Master Component Backups
I worry about failures. I work with cloud providers at my day job, so failures are common and expected. At home, my computers fail far less than cloud instances (case in point: fatty). But, they will fail. And that makes me nervous. So, I made some contingency plans.
Kubernetes depends on the etcd as the backing database. In a single-master configuration, there's one etcd instance. Failure of etcd renders the master services unusable. Services running on the cluster continue to run, but if a service pod dies, the master is unable to provision a replacement. Therefore, backing up etcd is useful.
I found this nice post on backing up the master. I tweaked the script so that it also backs up kubernetes' self-signed certs. This is backup runs every hour and stores data in fatty. The restoration process is scripted with ansible and very similar to what's described in the linked blog post.
This plan leaves me a little less nervous, but it doesn't consider a full lightweaver failure. With a full failure of lightweaver, the entire cluster unusable. All *.stormlight.home subdomains will fail because the machine is offline.
Even writing this makes me nervous, but I think I'll deal with the failure in the future. I've built Stormlight using code. There are no manual steps. So, if I lose the cluster, I can rebuild by relying on my code. I'll open source my code in a future post.
That said, I might buy a few Raspberry Pis to build a multi-master cluster later. That could be another fun side project.
The Results: My Deploy Process
With all the above in place, I can begin programming my own services (about damn time). To deploy, all I need is a single manifest file with a Deployment, a Service, and an Ingress. Here's what kuard's manifest file looks like:
I started building a homelab using Intel NUCs in December, but I never wrote down why I wanted to do this. Since my mind runs rampant with thoughts, I figure I should write this down before I lose it.
For the past two years, my work activities lead me farther away from coding. It was a conscious decision. I became a tech lead for a year. After that, I formed a new team as the engineering manager. I'll likely stick to this role for a while because it's challenging to solve problems amongst humans. I finally understand the saying that all problems are human problems. I definitely don't know how to solve them, though. That said, I sorely miss being in a flow state and building projects.
So the middle of last year, I thought about what I could work on.
Firstly, the project should keep my coding skills sharp. I immediately thought of go and how I wanted to learn it. Like really learn it. I built a few services with the language, but honestly, I'm still looking up packages and syntax rules.
Secondly, I want to learn kubernetes. I helped build the original kubernetes cluster at work (v1.3 yeesh!) but moved onto other projects that kept me at arm's length. I never learned the details well enough. I want to know kubernetes like I know the back of my hand.
Finally, right around the time, I thought about these ideas, I fell in love with Intel NUCs. They're mini-computers (4" x 4"), and I love the look. I'm a big sucker for dope-ass-looking devices. They're also decently priced, and I had closet space to spare. So I purchased three of them back in December 2019.
My goals are:
set up a local kubernetes cluster so that I can launch go projects
I decided in December that I want to start coding again. I've been in engineering leadership roles for the past two and a half years, which has kept me away from coding. I miss being in a flow state working on low-level technical problems for hours on end. One of the areas I'm missing out on is mastering kubernetes. I use kubernetes at work, but I removed enough that it's difficult to grok the details.
Therefore, last winter, I bought three Intel NUCs (8i3BEK1) to be the basis of my kubernetes homelab. After physically building each machine, connecting them to my network, and manually installing Ubuntu 18.04 three times, I decided that I didn't want to do this again. I am, to my wife's annoyance, forgetful. I have difficulty remembering to buy milk when going to the grocery store with the sole intention of buying milk. Imagine what the three NUCs looked like after I installed Ubuntu three times. Not very similar.
Luckily, Ubuntu had an installation method called preseeding to install itself with pre-configured answers to the dialogue prompts. Essentially, this allowed me to remaster the installation ISO so that I did not have to manually enter resposes to dialog prompts. After following the instructions from the wiki, I created an ISO that installed Ubuntu Server from start to finish without any keyboard prompts. With the ISO, I installed Ubuntu identically on my three NUCs and went about my business installing Kubernetes.
This development took several weeks because I became a father at the same time. And apparently, newborns need to feed every few hours. Though, I admit that's a coverup to the real reason it took so long. I didn't know how to do this. I've never dealt with the debian installer (what Ubuntu uses for installation), manipulating initrd, or configuring VirtualBox images to mimic intel NUCs for development. And then to top it all off, I still had to deal with differences in linux and mac tools.
Nevertheless, I codified my work into the stormlight-iso project on GitHub (stormlight is the name of my kubernetes cluster). Now I can forget the entire process without guilt. And if you'd like, you too can forget how to do it too!
With that, I'll leave you at the beginning of the README.
This project builds an Ubuntu 18.04 ISO to install Ubuntu unattended (no keyboard interaction) on Intel NUC 8 Core i3 machines.
This project assumes:
Installation of Ubuntu via USB stick
ISO built on a Mac OSX machine
Intel NUC has a static IP assigned to it to SSH to the machine (or some way for you to find the IP of your machine after Ubuntu has been installed and booted)
A USB stick with minimum 100MB of space
The project is designed to minimize the amount of physical effort to set up an Intel NUC because the author is lazy and forgetful. Also, the author has several Intel NUCs and manually entering in configuration value is error prone. Here's what the installation process looks like.
Build the stormlight.iso with preseed config and an ssh public key
Create a bootable USB from the stormlight.iso
Walk over to the Intel NUC, plug in USB stick, and power on the machine
Wait until the machine powers itself down after the installation (roughly 10-15 mins). "Look ma, no keyboard!"
Unplug USB stick and power on the machine.
Walk back to your computer and SSH into the machine.