Node Management

Rebooting a node

Key Considerations

Check if the node has any rook-ceph-osd-* pods. Verify the health of the corresponding Ceph cluster and bring down one node at a time.
Check for haproxy-ingress-* pods. If the node will be down for an extended period, disable its record in Constellix DNS.
Check if the node has the nautilus.io/linstor-server label. This node serves as a Linstor server. Some Linstor servers are redundant, while others are critical.
Check if the node has the nautilus.io/bgp-speaker label. There are two nodes used for MetalLB IPs—ensure one remains active.
Check if the node has the node-role.kubernetes.io/master label. Rebooting this node will make the cluster inaccessible unless it’s not an Admiralty virtual node.

Prerequisites

Install Ansible on your local computer.
Clone the repository of Ansible playbooks:

git clone https://gitlab.nrp-nautilus.io/prp/nautilus-ansible.git

Pull the latest updates from the playbook repository:

cd nautilus-ansible;
git pull

Reboot a Node Due to GPU Failure

Use the following command to reboot the node:

ansible-playbook reboot.yaml -i nautilus-ansible/nautilus-hosts.yaml -l <nodename>

Special Instructions to Reboot Ceph Nodes

To maintain redundancy in the Ceph cluster, only one node can be rebooted at a time.

Run this command to enter the rook-ceph-tools pod shell. Replace <namespace> with the appropriate Ceph cluster namespace (e.g., rook, rook-east, rook-pacific, rook-haosu, rook-suncave):

kubectl exec -it -n <namespace> $(kubectl get pods -n <namespace> --selector=app=rook-ceph-tools --output=jsonpath={.items..metadata.name}) -- bash

Once inside the pod shell, run:

watch ceph health detail

Wait until [WRN] OSD_DOWN: 1 osds down disappears from the ceph health detail output before rebooting the next node.

Recycling a node

If possible, do kubeadm reset on the node
Delete the node from kubernetes cluster
Delete the node from netbox
Close all gitlab issues related to the node
Check if there are any volumeattachments left for the node in kubernetes

This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.