Saturday, February 24, 2018

Building a Reclaimed Kubernetes Cluster

Thanks to family hand-me-downs, over the passing years I have become a repository of unwanted laptops. Some of them barely boot anymore, but three-quarters of them have two cores and 2 GB of RAM or more. One actually had four cores and 8 GB of RAM... making it a veritable workhorse. I could make them disposable workstations, but instead I wove them together and created a personal Kubernetes cluster.

The base OS for the nodes is Ubuntu's latest LTS release. Rather than using Conjure on MaaS to set up the cluster (which required an isolated network for bootp and DNS and meh), I leveraged Kubespray's flurry of Ansible scripts to prep an inventory of machines over SSH. This ended up being surprisingly low impact and worked perfectly for the use case of building a test lab with piecemeal hardware.

Laptops work just fine as server nodes with a few tweaks:
  • Even if you use the server distribution of Ubuntu, laptop events such as closing the lid will still result in a suspend/hibernate/resume action. Edit /etc/systemd/logind.conf to make sure the laptop keeps running when closed:
    sudo vi /etc/systemd/logind.conf
    sudo service systemd-logind restart
  • The display will remain on once you start ignoring LidSwitch events - run a script at startup to turn the display off and save energy.
  • Even if you are running in console mode, NVIDIA Optimus laptops will go nuts and seemingly run the discrete and on-chip GPUs nonstop, overheating the machine. Install Ubuntu's Bumblebee packages to prevent this:
    sudo apt-get install bumblebee bumblebee-nvidia primus linux-headers-generic
  • As with all Kubernetes nodes, disable swap by commenting out the partition in /etc/fstab. Since you will no longer need to resume from hibernate mode on the laptop, it can be safely disabled

Once you have the laptops prepped and the latest updates applied, you will need to make sure each node has a copy of python-netaddr installed: sudo apt-get install python-netaddr Ansible issues its commands over SSH, so ensure you have keyfile-based authentication set up from the machine you will be running Kubespray on to each of the nodes. If you don't already have an SSH key generated (for example, if you will run Kubespray on the master node), then you can generate a passwordless one via ssh-keygen. After that, copy the public key to each node with:

ssh-copy-id node1
ssh-copy-id node2

After that, the machine you are running Kubespray on will need Ansible installed. I ran Kubespray on the master node to keep things simple - so on that Ubuntu box I issued:

sudo apt-add-repository ppa:ansible/ansible
sudo apt-get update
sudo apt-get install ansible
git clone
cp -rfp inventory/sample inventory/mycluster

This will:
  1. Install Ansible on the box
  2. Download the Ansible scripts from Kubespray
  3. Creates a new Ansible inventory called "mycluster" that is a clone of the Kubespray sample
An important thing to remember is that you address nodes by straight IP address - not by hostname. This is especially important with Ansible scripts because the node's hostname may well change as part of the installation process. If your nodes are fetching their IP address via a DHCP server, ensure the DHCP server has static IP allocations for your nodes.

Once you have all the IP addresses for your nodes, set them in your inventory file. An easy way to do this at the command line is:

declare -a IPS=(
CONFIG_FILE=inventory/mycluster/hosts.ini python3 contrib/inventory_builder/ ${IPS[@]}

Verify the inventory is correct by cracking open inventory/mycluster/hosts.ini - if you want to change hostnames, now is the time.

I would recommend having Kubespray build a kubectl configuration file automagically for you. To have this generated as an artifact, change inventory/mycluster/group_vars/k8s-cluster.yml to have the following entry set: kubeconfig_localhost: true After these tweaks you should be ready to launch Kubespray's Ansible playbook. Note that Ubuntu's convention is to have you operate as a normal user and sudo all of your commands, so you will need to use Ansible's --become parameter:

ansible-playbook -i inventory/mycluster/hosts.ini cluster.yml --ask-become-pass --become

At this point Kubespray will try its best to get a cluster up and running on the nodes specified in your inventory file. At the very end Kubespray will provide you with a kubectl configuration file in artifacts/admin.conf, which you can then copy or merge into another workstation's ~/.kube/config file.
Once you have the Kubernets configuration file set on your workstation, you can use it to fetch an authtoken to get into the Kubernetes Dashboard. The proper way to do this is to generate a new system secret that has the appropriate permissions to interrogate the running cluster... but the lazy way is to just steal the token used by Kubernetes' namespace controller.

I'm lazy, so first I list all the secretes in the kube-system namespace:

kubectl -n kube-system get secrets

And then fetch the token for the namespace controller:

kubectl -n kube-system describe secret namespace-controller-token-???

So that I can use it to login to the web dashboard:

kubectl proxy &

Now you should have a working cluster you can mess with!

So that the laptops were properly ventilated, I placed each vertically into a metal document sorter from an office supply store. This gives me a nifty vertical rack for the laptops that has plenty of air circulation and allows me to route cables out of the way.

I've constructed one weird frakencluster - but it works!

Tuesday, January 23, 2018

Your Documents Under the Magnifying Glass

A few years ago I moved my household administrivia to a paperless system. Instead of stacking file folders deep with bills and statements, everything would be scanned & shredded. This greatly helped with storage space - but in a couple of years I ended up with a network drive filled with over 3,000 PDFs, images and documents. Bear in mind the majority of these are scanned documents - so the contents are images instead of machine-readable text. Everything was dumped into a single directory and files were named based on the timestamp of when they were scanned, taking hours to organize documents into folders and sub-folders.

Instead of burning hours sorting documents I started burning hours building a simple set of applications that would read document metadata, attempt to convert the images to text, group documents by common letterhead and then provide a simple search interface over all of it. Since optical character recognition is hit-and-miss, any full-text search should permit proximate indexing and searching to allow for fuzzy matches.

In the end I created two apps: DocMag and DocIndex. DocMag serves as the search front-end and allows users to perform full-text searches on scanned documents, label them with tags and automagically group other documents with the same letterhead or logo. The interface is pretty spartan and uses Spring Boot to build a straightforward integration into Elasticsearch. DocIndex is the batch process that crawls a filesystem and parses the documents using OCR, generates thumbnails, tags similar documents using computer vision-based template matching, and stores document metadata within Elasticsearch.

DocMag was created in Groovy using Spring Boot (Spring Web, Spring Data, etc). I did this mainly to understand how Spring Boot's conventions translated over to the Groovy world... it had been quite a while since I had worked with Grails. It turns out that Groovy, Spring Boot and Thymeleaf complemented each other quite well and make for fairly simple web development.

DocIndex was created with Spring Boot and Java 9 initially. I griped in an earlier post about my problems with Java 9's dependency management, so instead I fell back to the lambda expressions and work queue management within Java 8. This permits multithreaded parsing of discovered files, which then allows for vertically scaling document indexing by adding cores. Horizontal scaling should be possible by replacing the in-memory work queue with a proper shared message broker. There is a "reminder" issue I've already filed to migrate to a proper broker so this can be done sometime in the future.

Both DocMag and DocIndex are deployed as containers within DockerHub. This was especially necessary with DocIndex, as it relied heavily on native libraries for Tesseract OCR and OpenCV. OpenCV was the most contentious - each Linux distribution has a different version of OpenCV, and the version changes quite rapidly. Building containers for distribution allowed me to ensure users got the correct version of native libraries that worked well with their Java bindings.

Another nice feature of the containerized deployment model was composition - I was able to pair the correct revision of Elasticsearch, conditionally include Kibana, and provide a simple web application firewall by placing DocMag behind modsecurity and Apache. Network connections could be maintained between Elasticsearch, modsecurity, and DocMag without any of these interconnects leaking to the "outside" world, allowing me to do things such as only expose modsecurity to outside traffic and only permitting DocMag to receive requests through modsecurity. Elasticsearch could be hidden as well, only available on the internal network managed by Docker Compose.

Deployment can be relatively straightforward; since everything is deployed to Docker Hub as a container, one should just need to download the docker-compose.yml file and issue export DOCUMENT_HOST_DIR=/mnt/documents && docker-compose up -d. This should provision a single-node Elasticsearch instance, start DocMag behind modsecurity, and begin indexing with DocIndex.

If you are stuck digging through mountains of scanned documents, give DocMag a try. Ease of installation is one of its primary goals - so let me know if you find any issues getting it running!