Node Management
Rebooting a node
Key Considerations
- Check if the node has any
rook-ceph-osd-*
pods. Verify the health of the corresponding Ceph cluster and bring down one node at a time. - Check for
haproxy-ingress-*
pods. If the node will be down for an extended period, disable its record in Constellix DNS. - Check if the node has the
nautilus.io/linstor-server
label. This node serves as a Linstor server. Some Linstor servers are redundant, while others are critical. - Check if the node has the
nautilus.io/bgp-speaker
label. There are two nodes used for MetalLB IPs—ensure one remains active. - Check if the node has the
node-role.kubernetes.io/master
label. Rebooting this node will make the cluster inaccessible unless it’s not an Admiralty virtual node.
Prerequisites
- Install Ansible on your local computer.
- Clone the repository of Ansible playbooks:
git clone https://gitlab.nrp-nautilus.io/prp/nautilus-ansible.git
- Pull the latest updates from the playbook repository:
cd nautilus-ansible;git pull
Reboot a Node Due to GPU Failure
Use the following command to reboot the node:
ansible-playbook reboot.yaml -i nautilus-ansible/nautilus-hosts.yaml -l <nodename>
Special Instructions to Reboot Ceph Nodes
To maintain redundancy in the Ceph cluster, only one node can be rebooted at a time.
Run this command to enter the rook-ceph-tools
pod shell. Replace <namespace>
with the appropriate Ceph cluster namespace (e.g., rook
, rook-east
, rook-pacific
, rook-haosu
, rook-suncave
):
kubectl exec -it -n <namespace> $(kubectl get pods -n <namespace> --selector=app=rook-ceph-tools --output=jsonpath={.items..metadata.name}) -- bash
Once inside the pod shell, run:
watch ceph health detail
Wait until [WRN] OSD_DOWN: 1 osds down
disappears from the ceph health detail
output before rebooting the next node.
Recycling a node
If possible, do
kubeadm reset
on the nodeDelete the node from kubernetes cluster
Delete the node from netbox
Close all gitlab issues related to the node
Check if there are any
volumeattachments
left for the node in kubernetes

This work was supported in part by National Science Foundation (NSF) awards CNS-1730158, ACI-1540112, ACI-1541349, OAC-1826967, OAC-2112167, CNS-2100237, CNS-2120019.