Fixstars Solutions, a software acceleration and development company, desired to create a job that runs with both CPU and GPU. At first, the only option was to rely on a cloud service provider to utilize a Kubernetes cluster with GPU support. This presented a problem for Fixstars, as rates for GPU Instances provided by cloud service providers were expensive, and some jobs could take long periods of time. Instead of only relying on cloud service providers, Fixstars chose to create hybrid clusters for their jobs. Hybrid clusters will allow Fixstars to use any cloud service provider and to run jobs on their own on-premise machines, as hybrid clusters also grant users more control over their hardware with improved security, while saving money without relying on third party equipment.
Methods:
In order to implement hybrid clusters, Fixstars chose to install Kubernetes with WireGuard using AWS as a Cloud Service Provider. Kubernetes is the cluster itself, and is a “portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation (https://kubernetes.io/docs/concepts/overview/" ).” WireGuard is a “fast and modern VPN that utilizes state-of-the-art cryptography (https://www.wireguard.com/" ).” A huge benefit of using these open-source software systems is that they can be used on any cloud service provider.
The first step was to use WireGuard to establish the networking layer. After installation, the EC2 instance acts as the Control Plane, allowing UDP traffic to port 51820. WireGuard setup required configuration files, as well as communications through EC2. Each client required its own configuration file. Fixstars ensured that there is a running connection between client and server, as well as ensuring the correct output for each file.
After completing the WireGuard setup, installing Kubernetes was the next step. Each host required their own configuration, as well as a CRI-O and Cilium installations. “CRI-O is an optimized container engine specifically designed for Kubernetes (https://earthly.dev/blog/deploy-kubernetes-cri-o-container-runtime/#:~:text=CRI%2DO%20pronounced%20as%20(cry,containers%20in%20a%20Kubernetes%20cluster ).” It offers a secure, stable, and reliable platform for Kubernetes. Cilium is an “open source, cloud native solution for providing, securing, and observing network connectivity between workloads (https://cilium.io/#:~:text=Cilium%20is%20an%20open%20source,the%20revolutionary%20Kernel%20technology%20eBPF ).” After installation, Fixstars connected and tested each worker node, one at a time.
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter
# sysctl params required by setup, params persist across reboots
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
# Apply sysctl params without reboot
sudo sysctl --system
Pre-configuration for Kubernetes. After installing WireGuard, Fixstars input this configuration to set up Kubernetes.
sudo apt-get install -y curl jq tar
curl https://raw.githubusercontent.com/cri-o/cri-o/main/scripts/get | sudo bash
sudo systemctl enable crio
sudo systemctl start crio
Configuration for CRI-O. Fixstars installed this configuration for each host.
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/master/stable.txt)
CLI_ARCH=amd64
if [ "$(uname -m)" = "aarch64" ]; then CLI_ARCH=arm64; fi
curl -L --fail --remote-name-all https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-${CLI_ARCH}.tar.gz
sha256sum --check cilium-linux-${CLI_ARCH}.tar.gz.sha256sum
sudo tar xzvfC cilium-linux-${CLI_ARCH}.tar.gz /usr/local/bin
rm cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}
cilium install
cilium status
/¯¯\
/¯¯\__/¯¯\ cilium: OK
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Envoy DaemonSet: disabled (using embedded mode)
\__/¯¯\__/ Hubble Relay: disabled
\__/ ClusterMesh: disabled
DaemonSet cilium Desired: 1, Ready: 1/1, Available: 1/1
Deployment cilium-operator Desired: 1, Ready: 1/1, Available: 1/1
Containers: cilium Running: 1
cilium-operator Running: 1
Cluster Pods: 2/2 managed by Cilium
Helm chart version: 1.13.4
Image versions cilium quay.io/cilium/cilium: v1.13.4@sha256:bde8800d61aaad8b8451b10e247ac7bdeb7af187bb698f83d40ad75a38¢
cilium-operator quay.io/cilium/operator-generic: v1.13.4@sha256:09ab77d324ef4d31f7d341f97ec5a2a4860910076046d57az
kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
cilium-operator-d5157588-nrbhg 1/1 Running 0 69s
cilium-w7nx2 1/1 Running 0 69s
coredns-5d78c9869d-S8qup 1/1 Running 0 29s
coredns-5d78c9869d-gs54p 1/1 Running 0 14s
etcd-ip-172-31-39-216 1/1 Running 0 2m54s
kube-apiserver-ip-172-31-39-216 1/1 Running 0 2m54s
kube-controller-manager-ip-172-31-39-216 1/1 Running 0 2m54s
kube-proxy-9gqf6 1/1 Running 0 2m40s
kube-scheduler-ip-172-31-39-216 1/1 Running 0 2m54s
Configuration for Cilium, along with its successful launch. Cilium is specifically designed to work for Kubernetes.
Process:
In an attempt to manually create the initial network layer, Fixstars created routing tables on the different hosts and manually port forward so the different addresses were able to communicate with each other. This posed a challenge in implementation. WireGuard was originally only used for the on-premise machine to bridge the connection between the cloud and machine. Thankfully, WireGuard is a versatile software, so Fixstars used it to manage the entire network. With that method, the network layer was simplified, since WireGuard also handles port forwarding, routing, and allows everything to be on a single network.
From there, Fixstars, precisely, manually input the correct configuration for each node. After careful checking and testing, the network layer ran smoothly.
[Interface]
Address = 20.0.0.1/24
SaveConfig = true
ListenPort = 51820
PrivateKey = <Private Key of Server>
[Peer]
PublicKey = <Public Key of Client>
# Allowed IPs for client to server, must be /32
AllowedIPs = 20.0.0.2/32
[Peer]
PublicKey = <Public Key of Client>
# Allowed IPs for client to server, must be /32
AllowedIPs = 20.0.0.3/32
Configuration file for the WireGuard Server. Fixstars input this configuration to set up WireGuard.
[Interface]
# wg0 IP address on the client, needs to be unique per client
Address = 20.0.0.2/24
PrivateKey = <Private Key of Client>
[Peer]
Publickey = <Public Key of Server>
# Public IP address on the server
Endpoint = <EC2 Server Public IP>:51820
# IP addresses to set the route for wg0 on the client
AllowedIPs = 20.0.0.0/24
PersistentKeepalive = 21
Configuration for each client. Each node for each client is unique.
Results:
Fixstars successfully created an on-premise node that is able to connect to the cloud control plane. Hybrid clusters allowed full control over the cost, flexibility, security, data governance, and hardware on the machine. WireGuard proved to be the most optimal solution to set up the network layer for Kubernetes to use, as its versatility and simplified communication helped the hybrid cluster to run efficiently. CRI-O and Cilium also proved useful, as they help provide and secure network connectivity. Fixstars accurately input configurations for nodes, as well as ensure each was configured correctly.
Conclusion:
Fixstars successfully created hybrid clusters, which allows them to run jobs on their own on-premise machines with high security, gives them more control over their hardware and improves overall cost efficiency, all while maintaining computational speed and performance. In addition, Fixstars no longer relies solely on cloud service providers for GPU supported instances. The hybrid clusters will help Fixstars’ short and long term success, in decision-making, efficiency, and financially. Fixstars has gained valuable information through this process, and they hope that other companies can adopt their hybrid cluster method as well.
References:
- Kubernetes. Overview. 2023. https://kubernetes.io/docs/concepts/overview/ .
- WireGuard. Overview. 2022. https://www.wireguard.com/ .
- Bassey, Mercy. How to Deploy a Kubernetes Cluster Using the CRI-O Contatiner Runtime. 2023. https://earthly.dev/blog/deploy-kubernetes-cri-o-container-runtime/#:~:text=CRI%2DO%20pronounced%20as%20(cry,containers%20in%20a%20Kubernetes%20cluster) .
- Cilium. Overview. https://cilium.io/#:~:text=Cilium%20is%20an%20open%20source,the%20revolutionary%20Kernel%20technology%20eBPF .