Setting up a Docker Swarm on the Personal Compute Cluster

The last time I set up a Spark cluster, I installed Spark manually and configured each node directly. For the small scale cluster I had, that was fine. This time time the cluster is still relatively small scale. However, I do want to take advantage of a prebuilt containers, if possible. The standard for that is Docker. So in this post I will set up a Docker Swarm on the Personal Compute Cluster.

Installing Docker

Before we begin, if you set up SETI@Home as the end of the last post, we need to stop it first (skip if you didn’t set up SETI@Home):

parallel-ssh -i -h ~/cluster/all.txt -l root "service boinc-client stop"

We need to ensure that swap is off, as various applications we will be working with later do not like it:

parallel-ssh -i -h ~/cluster/all.txt -l root "swapoff -a"

Now we install all the needed software:

parallel-ssh -i -h ~/cluster/all.txt -l root "apt-get update -y"
parallel-ssh -i -h ~/cluster/all.txt -l root "apt-get upgrade -y"
parallel-ssh -i -h ~/cluster/all.txt -l root "apt-get install apt-transport-https software-properties-common ca-certificates -y"
parallel-ssh -i -h ~/cluster/all.txt -l root "wget https://download.docker.com/linux/ubuntu/gpg &amp;&amp; apt-key add gpg"
parallel-ssh -i -h ~/cluster/all.txt -l root "echo 'deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable' >> /etc/apt/sources.list"
parallel-ssh -i -h ~/cluster/all.txt -l root "sudo apt-get update -y"
parallel-ssh -i -h ~/cluster/all.txt -l root "apt-get install docker-ce -y"

The installation of the software should have created the user group docker. Now we need to add our user account to that group so we can run docker without sudo (change the user account name as needed):

parallel-ssh -i -h ~/cluster/all.txt -l root "usermod -aG docker michael"

Now we need to configure Docker to use our nodes’ /mnt/data partitions for all its storage. We also want to tell Docker that it is OK to use the local registry that we will be setting up later in this post, and force it to use the Google DNS. To do this, we need to create a configuration file for Docket. Start by creating or editing the /etc/docker/daemon.json file:

sudo vi /etc/docker/daemon.json

If this daemon.json file was new, make the contents look like this:

{
    "data-root":"/mnt/data/docker/",
    "insecure-registries": ["master:5000"],
    "dns" : ["8.8.8.8"]
    "metrics-addr" : "0.0.0.0:9323",
    "experimental" : true
}

If the file already existed, simply add the element shown above to the file. Now distribute this file to the slave nodes:

parallel-scp -h ~/cluster/slaves.txt -l root /etc/docker/daemon.json /etc/docker/daemon.json

Finally, start docker on all machines:

parallel-ssh -i -h ~/cluster/all.txt -l root "systemctl restart docker && systemctl enable docker"

You need to log out of the master node and log back in for the group membership to take effect. Once doing so, try docker out:

docker run hello-world

You should get a pleasant message from Docker. Now lets set up the Docker swarm. On the master node, if you did not allow all traffic into the cluster via the ufw firewall, we first need to open the firewall for docker:

sudo ufw allow 2376/tcp
sudo ufw allow 7946/udp
sudo ufw allow 7946/tcp
sudo ufw allow 80/tcp
sudo ufw allow 2377/tcp
sudo ufw allow 4789/udp
sudo ufw reload
sudo ufw enable
sudo systemctl restart docker

Now we init the swarm:

docker swarm init --advertise-addr 10.1.1.1

That should give you a message which has a command to run on the slave nodes for them to join the swarm. You can do that easily with pssh. Update for the join command you got from the prior command:

parallel-ssh -i -h ~/cluster/slaves.txt -l michael "docker swarm join --token SWMTKN-1-16bn159k6y9rtyitecha7q5cts8ru1qkojlykual4119fsppxm-bavnczg8eiutgmv0r34zzswgx 10.1.1.1:2377"

Check that it worked and the nodes are all in the swarm:

docker node ls

Now let’s create a simple service just to prove it all worked:

docker service create --name webserver -p 80:80 httpd
docker service ls

You can now go to a web browser on your development computer, and visit the URL of the public IP address for your cluster, which for me would be http://192.168.1.60. You should see a web page with the phrase “It Works!”. Once you are satisfied with your accomplishment, we can remove the web service:

docker service rm webserver
docker service ls

The web page should no longer work if you try to refresh it in your browser.

Monitoring Docker Swarm

If you want to set up a nice lightweight management UI for your docker swarm, install Portainer:

sudo mkdir -p /mnt/data/portainer/
docker service create \
    --name portainer \
    --publish 9999:9000 \
    --replicas=1 \
    --limit-memory=128M \
    --constraint 'node.role == manager' \
    --mount=type=bind,src=/mnt/data/portainer/,dst=/data \
    --mount=type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock \
    portainer/portainer

Now point a browser at http://master-node-ip:9999 (replacing master-node-ip with the public WAN address of your master node), create an account, and enjoy!

Creating a Local Docker Registry

We will be running a Docker registry within the swarm in order to manage the images we will be creating later. The registry is needed to easily distribute your custom docker images to all nodes when a service is create. Normally registries are set up with security. However, because our cluster is private and we do not have a signed certificate for from a certificate authority, we need to set up the registry is insecure with a self-signed certificate. On the master node do the following:

cd
mkdir certs
cd certs/
openssl req -newkey rsa:4096 -nodes -sha256 -keyout registry.key -x509 -days 365 -out registry.crt

The certificate generation will pose a few questions to you. Fill them in as you see fit, except for the question on “Common Name”. For that, you should enter the IP address of the master node, which is master. Once you do that, we need to distribute the certificate to the docker engine on all nodes:

parallel-ssh -i -h ~/cluster/all.txt -l root "mkdir -p /etc/docker/certs.d/10.1.1.1:5000"
parallel-scp -h ~/cluster/all.txt -l root registry.crt /etc/docker/certs.d/10.1.1.1:5000/ca.crt
parallel-ssh -i -h ~/cluster/all.txt -l root "mkdir -p /etc/docker/certs.d/master:5000"
parallel-scp -h ~/cluster/all.txt -l root registry.crt /etc/docker/certs.d/master:5000/ca.crt
parallel-ssh -i -h ~/cluster/all.txt -l root "service docker restart"

Now, start the registry in the swarm with the following, but adjust the path /home/michael/certs to be where the certs folder is in your home directory is on the master node:

docker service create --name registry --publish=5000:5000 \
    --constraint=node.role==manager \
    --mount=type=bind,src=/home/michael/certs,dst=/certs \
    --limit-memory=500m \
    -e REGISTRY_HTTP_ADDR=0.0.0.0:5000 \
    -e REGISTRY_HTTP_TLS_CERTIFICATE=/certs/registry.crt \
    -e REGISTRY_HTTP_TLS_KEY=/certs/registry.key \
    registry:latest

In order to track want’s in the registry, we can load a simply registry visualizer:

docker service create --name registry-browser -p 5050:8080 \
    -e DOCKER_REGISTRY_URL=https://10.1.1.1:5000 \
    -e NO_SSL_VERIFICATION=true \
    --limit-memory=128m \
    klausmeyer/docker-registry-browser

Now point you browser at http://master-node-ip:5050 (change the IP address to your cluster’s WAN address), and you can see that your local docker registry is empty. But we will be changing that in the upcoming posts.

Fun Things to do on your Docker Swarm

Run SETI@Home on Docker Swarm

We still have some work to do before we are ready to run Spark. As with the last post, that will be another day. For now, we can now leverage our new Docker Swarm to run and manager a SETI@Home processes. A Docker image for BOINC, the software that enables SETI@Home, has been prebuilt and available in the registry, and you can use that. There is even a version optimized for Intel CPUs, which the Personal Cluster users. To launch the service into the Docker Swarm such that each node gets an instance to run, use this command:

docker network create -d overlay --attachable boinc
parallel-ssh -i -h ~/cluster/all.txt -l root "mkdir -p /mnt/data/boinc"
parallel-ssh -i -h ~/cluster/all.txt -l root "chgrp docker /mnt/data/boinc"
docker service create \
    --mode global \
    --name boinc \
    --network=boinc \
    --hostname="boinc-{{.Node.Hostname}}" \
    --mount type=bind,src=/mnt/data/boinc,dst=/var/lib/boinc \
    -p 31416:31416 \
    -e BOINC_GUI_RPC_PASSWORD="123" \
    -e BOINC_CMD_LINE_OPTIONS="--allow_remote_gui_rpc" \
    --limit-memory=1G \
    --limit-cpu=8 \
    boinc/client:intel
docker run \
    --rm \
    --network boinc \
    boinc/client \
    boinccmd_swarm \
        --passwd 123 \
        --project_attach http://setiathome.berkeley.edu \
        <*insert your account key here*>

Replace <*insert your account key here*> with your SETI@Home account key. One thing I will point out is the use of the --limit-cpu option in the service creation command. Without it, each instance of the service will use 100% of the CPUs. The option limits the effective number of CPUs each instance can utilize. Since each of our nodes has 12 virtual CPUs (6 cores, 2 threads per core), setting --limit-cpu to 6 has the effect of limiting the process to 50% of the CPU capacity. The reason I did this was at 100%, the nodes’ cooling fans are running at full blast. While these EGLOBAL S200 computers are pretty quiet, at 100% CPU utilization and 100% fan speed, things do get noticeably noisy. So, I limited CPU usage to 50% and the nodes run without the fans turning on.

You can read more about the BOINC client and operating it within Docker here.

Run a Nethack Server

If you can recollect the days of crawling a dungeon made up of ASCII characters, you might enjoy this one: you can easily deploy a Nethack server for anyone on your home network to play on. This really shows off the power of Docker in a fun way: it makes it easy for people to share whole software deployments. In this case, we can pull Matsuu Takuto’s image for a Nethack server and set it up to run on our swarm:

docker service create --replicas 1 --publish=23:23 --name nethack matsuu/nethack-server

Then to the play the game, simply telnet into port 23 on the cluster’s master node using its external IP address.

telnet 192.168.1.60 23

Best wishes on your ascension!