Back to posts

Troubleshooting Ceph After Proxmox Node Restarts

The incident was triggered by a "cascading dependency failure." After a reboot, the cluster failed to automatically re-establish iSCSI sessions to the backend SAN. This caused the Multipath and LVM layers to fail, which in turn prevented the Ceph OSD (Object Storage Daemon) from mounting its BlueStore backend. The situation was further complicated by missing OSD metadata and hung lxc-info processes that paralyzed the Proxmox API.


Troubleshooting Ceph After Proxmox Node Restarts

Date: 2025-11-09
Environment: Proxmox VE Cluster (node1, node2, node3)


Main Problem

Symptoms

  • Unable to login to GUI on node1 and node2 (only node3 worked)
  • "login failed" error for all user accounts
  • Proxmox API not responding on node1 and node2
  • Problem returned after machine restarts
  • 27 VMs/containers using Ceph RBD storage were inaccessible

Error Logs

proxy detected vanished client connection
command 'lxc-info -n 10025 -p' failed: got signal 9

Diagnosis and Resolution - Phase 1: GUI Issues

Problem 1.1: GUI/API Not Responding

Diagnosis:

# Check cluster status
pvecm status  # All 3 nodes in quorum

# Check services
systemctl status pveproxy  # running
systemctl status pvedaemon  # running

# Test API
curl -k https://node1:8006/api2/json/access/ticket  # timeout

Root cause: lxc-info processes were blocking pvedaemon while checking container status.

Solution:

# Kill stuck processes
pkill -9 lxc-info

# Restart pvedaemon
systemctl restart pvedaemon

# Verification
curl -k https://node1:8006/api2/json/access/ticket  # OK

Status: ✅ Temporarily fixed (problem returned)


Problem 1.2: Problem Returns After Restart/VM Start

Diagnosis:

# Check stuck processes
ps aux | grep lxc-info
# Multiple lxc-info processes in hung state

# Identify containers
# CT 10025 on node1 - es-coord-1
# CT 10026 on node2 - es-coord-2

Root cause: Containers using Ceph RBD were blocking during status checks via lxc-info.


Diagnosis and Resolution - Phase 2: Automatic Workaround

Problem 2.1: Need for Continuous Manual Process Cleanup

Solution - Automatic cleanup:

Script: /usr/local/bin/lxc-cleanup.sh

#!/bin/bash
# Kill lxc-info processes older than 30 seconds
for pid in $(ps -eo pid,etimes,cmd | grep 'lxc-info' | grep -v grep | awk '$2 > 30 {print $1}'); do
    kill -9 $pid 2>/dev/null
done

Cron job: /etc/cron.d/lxc-cleanup

* * * * * root /usr/local/bin/lxc-cleanup.sh

Deployment:

# Deploy on all nodes
for node in node1 node2 node3; do
    ssh $node "cat > /usr/local/bin/lxc-cleanup.sh << 'EOF'
[script as above]
EOF
chmod +x /usr/local/bin/lxc-cleanup.sh"

    ssh $node "cat > /etc/cron.d/lxc-cleanup << 'EOF'
* * * * * root /usr/local/bin/lxc-cleanup.sh
EOF"
done

Status: ✅ Workaround working, but not solving root cause


Diagnosis and Resolution - Phase 3: Root Cause - Ceph

Problem 3.1: Identifying Source of Blocks

Diagnosis:

# Check container configurations
grep -l 'ceph-rbd' /etc/pve/nodes/*/lxc/*.conf /etc/pve/nodes/*/qemu-server/*.conf

# List containers/VMs using Ceph
# Found 27 VMs/containers with rootfs/disks on ceph-rbd

# Check Ceph status
ceph -s

Result:

cluster:
  health: HEALTH_WARN
  1 osds down
  OSD count 1 < osd_pool_default_size 3

services:
  mon: 3 daemons, quorum node1,node2,node3
  mgr: node1(active)
  osd: 1 osds: 0 up, 1 in

data:
  usage: 0 B used, 0 B / 0 B avail
  pgs: 32 unknown

Root cause: OSD.0 was down, preventing containers/VMs with Ceph from accessing disks.


Problem 3.2: OSD.0 Cannot Start

Diagnosis:

# Check OSD status
systemctl status ceph-osd@0
# Failed - exit code 1

# Logs
journalctl -u ceph-osd@0
# auth: unable to find a keyring on /var/lib/ceph/osd/ceph-0/keyring
# no keyring found, disabling cephx

Root cause: Missing keyring for OSD.0

Solution Attempt 1 - Generate keyring:

# Generate new key
ceph auth get-or-create osd.0 mon 'allow profile osd' mgr 'allow profile osd' osd 'allow *'
# [osd.0]
# key = <key>

# Save to file
ceph auth get osd.0 -o /var/lib/ceph/osd/ceph-0/keyring
chown ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
chmod 600 /var/lib/ceph/osd/ceph-0/keyring

Next error:

# Start attempt
systemctl start ceph-osd@0
# Failed

# New logs
journalctl -u ceph-osd@0
# missing 'type' file and unable to infer osd type

Status: ❌ Keyring not sufficient, more metadata missing


Problem 3.3: Missing OSD Metadata

Diagnosis:

# OSD directory contents
ls -la /var/lib/ceph/osd/ceph-0/
# Only keyring - no other files

Root cause: OSD directory is nearly empty - missing:
- type - backend type (bluestore/filestore)
- whoami - OSD ID
- block - symlink to block device
- fsid - OSD UUID
- ceph_fsid - Cluster UUID
- ready - ready marker

Status: ❌ Deeper problem - check backend storage


Diagnosis and Resolution - Phase 4: Storage Backend

Problem 4.1: Missing Device for OSD

Diagnosis:

# Check OSD metadata from Ceph
ceph osd metadata 0

Result - Key Information:

{
    "osd_objectstore": "bluestore",
    "bluestore_bdev_dev_node": "/dev/dm-13",
    "bluestore_bdev_devices": "sdf,sdg,sdh,sdi,sdj,sdk,sdl,sdm",
    "bluestore_bdev_partition_path": "/dev/dm-13",
    "bluestore_bdev_type": "ssd",
    "bluestore_bdev_size": "6597065572352",
    "osd_data": "/var/lib/ceph/osd/ceph-0"
}

Key Discovery:
- OSD uses BlueStore (not FileStore)
- Backend: /dev/dm-13 (device mapper multipath)
- Physical devices: 8x iSCSI disks (sdf-sdm)
- Size: 6TB

Check device:

ls -la /dev/dm-13
# ls: cannot access '/dev/dm-13': No such file or directory

Root cause: Device mapper /dev/dm-13 doesn't exist - no storage connection!


Problem 4.2: No iSCSI Sessions

Diagnosis:

# Check iSCSI sessions
iscsiadm -m session
# iscsiadm: No active sessions.

# Check disks
lsblk | grep sd[f-m]
# sdf-sdm exist but are tiny (16K)

# Check iSCSI configuration
ls -la /etc/iscsi/nodes/
# 8 configured targets to storage array

Root cause: After node restart, iSCSI sessions didn't login automatically.


Diagnosis and Resolution - Phase 5: Restore iSCSI and Multipath

Solution 5.1: Restore iSCSI Connections

Restore sessions:

# Login to all targets
iscsiadm -m node -L all

Result:

Logging in to [iface: default, target: iqn.yyyy-mm.tld.vendor:target:..., portal: 10.0.1.10,3260]
...
Login to [...] successful. (x8 targets)

Verification:

# Check active sessions
iscsiadm -m session
# tcp: [1] 10.0.1.10:3260,13 iqn.yyyy-mm.tld.vendor:target:...
# ... (8 sessions)

# Check multipath device
ls -la /dev/dm-13
# brw-rw---- 1 root disk 252, 13 Nov  9 01:54 /dev/dm-13

# Verify multipath
multipath -ll | grep -A15 dm-13

Multipath result:

ceph-lun-1 (36xxxxxxxxxxxxxxxxxxxx) dm-13 VENDOR,MODEL
size=6.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| |- 18:0:0:1 sdo 8:224 active ready running
| |- 22:0:0:1 sds 65:32 active ready running
| |- 20:0:0:1 sdr 65:16 active ready running
| `- 24:0:0:1 sdu 65:64 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  |- 17:0:0:1 sdn 8:208 active ready running
  |- 19:0:0:1 sdp 8:240 active ready running
  |- 21:0:0:1 sdq 65:0  active ready running
  `- 23:0:0:1 sdt 65:48 active ready running

Status: ✅ Device dm-13 exists and is active (8 multipath paths)


Problem 5.2: Device Exists But OSD Still Doesn't Work

Diagnosis:

# Try reading BlueStore label
ceph-bluestore-tool show-label --dev /dev/dm-13
# unable to decode label
# No valid bdev label found

Root cause: /dev/dm-13 exists but doesn't contain BlueStore data - likely OSD uses LVM on this device.

Check LVM:

# Read beginning of disk
dd if=/dev/dm-13 bs=4096 count=10 | hexdump -C | head -50

Discovery:

00000200  4c 41 42 45 4c 4f 4e 45  01 00 00 00 00 00 00 00  |LABELONE........|
00000210  3a c4 ab 25 20 00 00 00  4c 56 4d 32 20 30 30 31  |:..% ...LVM2 001|
...
00001200  63 65 70 68 2d 39 37 36  33 62 30 34 62 2d 61 38  |ceph-9763b04b-a8|
...
device = "/dev/mapper/ceph-lun-1"

Key discovery: OSD uses LVM!
- Volume Group: ceph-9763b04b-a846-47bc-984c-3c9da95d7329
- Physical Volume: /dev/mapper/ceph-lun-1 (alias for dm-13)


Solution 5.3: Activate LVM OSD

Scan and activate:

# Scan LVM
pvscan
vgscan
lvscan | grep ceph
# ACTIVE '/dev/ceph-9763b04b-a846-47bc-984c-3c9da95d7329/osd-block-b59f609a-4c09-4795-89f6-29b30800a3c7' [<6.00 TiB]

# Activate VG
vgchange -ay ceph-9763b04b-a846-47bc-984c-3c9da95d7329
# 1 logical volume(s) in volume group now active

# Check
ls -la /dev/ceph-9763b04b-a846-47bc-984c-3c9da95d7329/
# osd-block-b59f609a-4c09-4795-89f6-29b30800a3c7 -> ../dm-14

Verify data:

# Read BlueStore label from LV
ceph-bluestore-tool show-label --dev /dev/ceph-9763b04b-a846-47bc-984c-3c9da95d7329/osd-block-b59f609a-4c09-4795-89f6-29b30800a3c7

Result:

{
    "osd_uuid": "b59f609a-4c09-4795-89f6-29b30800a3c7",
    "size": 6597065572352,
    "description": "main",
    "bluefs": "1",
    "ceph_fsid": "79e21f45-2ab5-4caa-8848-30d0ee790bc8",
    "osd_key": "<key>",
    "ready": "ready",
    "type": "bluestore",
    "whoami": "0"
}

Status: ✅ OSD data is complete and readable! LVM works, data exists.


Diagnosis and Resolution - Phase 6: OSD Metadata Reconstruction

Create symlink:

cd /var/lib/ceph/osd/ceph-0
ln -sf /dev/ceph-9763b04b-a846-47bc-984c-3c9da95d7329/osd-block-b59f609a-4c09-4795-89f6-29b30800a3c7 block

# Verify
ls -la block
# block -> /dev/ceph-.../osd-block-b59f609a-...

Status: ✅ Symlink created


Solution 6.2: Add Metadata Files

Create all required files:

# OSD type
echo bluestore > /var/lib/ceph/osd/ceph-0/type

# OSD ID
echo 0 > /var/lib/ceph/osd/ceph-0/whoami

# Ready marker
echo ready > /var/lib/ceph/osd/ceph-0/ready

# OSD UUID (NOT cluster UUID!)
echo 'b59f609a-4c09-4795-89f6-29b30800a3c7' > /var/lib/ceph/osd/ceph-0/fsid

# Ceph cluster UUID
echo '79e21f45-2ab5-4caa-8848-30d0ee790bc8' > /var/lib/ceph/osd/ceph-0/ceph_fsid

# Fix permissions
chown ceph:ceph /var/lib/ceph/osd/ceph-0/*
chmod 644 /var/lib/ceph/osd/ceph-0/type /var/lib/ceph/osd/ceph-0/whoami /var/lib/ceph/osd/ceph-0/ready

Verify contents:

ls -la /var/lib/ceph/osd/ceph-0/
# block -> /dev/ceph-.../osd-block-...
# ceph_fsid
# fsid
# keyring
# ready
# type
# whoami

Status: ✅ All metadata created


Solution 6.3: Start OSD

First attempt - error:

systemctl start ceph-osd@0
journalctl -u ceph-osd@0
# bluestore(...) _read_multi_bdev_label label correct, but osd_uuid=b59f609a... need=79e21f45...
# No valid bdev label found

Problem: File fsid contained cluster UUID instead of OSD UUID!

Correction:

# Fix - fsid should be OSD UUID, not cluster UUID
echo 'b59f609a-4c09-4795-89f6-29b30800a3c7' > /var/lib/ceph/osd/ceph-0/fsid

Restart:

systemctl reset-failed ceph-osd@0
systemctl start ceph-osd@0
systemctl status ceph-osd@0

Result:

[email protected] - Ceph object storage daemon osd.0
   Active: active (running)
   Main PID: 28977 (ceph-osd)
   Memory: 293.6M

Status: ✅ OSD started successfully!


Diagnosis and Resolution - Phase 7: Ceph Verification

Verification 7.1: Cluster Status

Check after 10 seconds:

sleep 10
ceph -s

Result:

cluster:
  id:     79e21f45-2ab5-4caa-8848-30d0ee790bc8
  health: HEALTH_WARN
          mons are allowing insecure global_id reclaim
          1 pool(s) have no replicas configured
          OSD count 1 < osd_pool_default_size 3

services:
  mon: 3 daemons, quorum node1,node2,node3 (age 23m)
  mgr: node1(active, since 23m)
  osd: 1 osds: 1 up (since 27s), 1 in (since 4M)

data:
  pools:   1 pools, 32 pgs
  objects: 970.87k objects, 3.7 TiB
  usage:   3.7 TiB used, 2.3 TiB / 6.0 TiB avail
  pgs:     29 active+clean
           2  active+clean+scrubbing
           1  active+clean+scrubbing+deep

Key Metrics:
- ✅ OSD: 1 up, 1 in - ACTIVE
- ✅ Objects: 970.87k objects, 3.7 TiB - DATA AVAILABLE!
- ✅ PGs: 29 active+clean - placement groups healthy
- ✅ Scrubbing: automatic data integrity verification

Status: ✅ Ceph fully functional!


Verification 7.2: RBD Image List

Check image availability:

rbd ls rbd | head -30

Result:

base-template-disk-0
vm-10025-disk-0
vm-10026-disk-0
...
vm-10061-cloudinit
vm-10061-disk-0
vm-10061-disk-1
vm-10061-disk-2
...
vm-10068-disk-0
...
vm-10230-disk-0

Status: ✅ All RBD images available (30+ images)


Diagnosis and Resolution - Phase 8: Storage Restoration

Solution 8.1: Enable ceph-rbd Storage in Proxmox

Edit storage configuration:

# Fix /etc/pve/storage.cfg
cat > /etc/pve/storage.cfg << 'EOF'
dir: local
    path /var/lib/vz
    content vztmpl,iso,snippets
    shared 0

zfspool: local-zfs
    pool rpool/data
    content rootdir
    sparse 1

dir: shared-nfs
    path /mnt/shared/proxmox
    content import,rootdir,snippets,images
    nodes node2,node1,node3
    prune-backups keep-all=1
    shared 1

lvmthin: ssd-thin
    thinpool thin_pool
    vgname vg_data
    content images,rootdir
    nodes node2,node1,node3

pbs: backup-server
    datastore local-backup
    server 10.0.0.100
    content backup
    fingerprint xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx
    namespace proxmox-cluster
    nodes node1,node3,node2
    prune-backups keep-all=1
    username backup-user@pbs

rbd: ceph-rbd
    content rootdir,images
    nodes node3,node1,node2
    pool rbd
EOF

Status: ✅ Storage ceph-rbd active (removed disable 1 flag)


Diagnosis and Resolution - Phase 9: VM/Container Testing

Test 9.1: Start Container 10068

Start container:

pct start 10068
sleep 5
pct status 10068

Result:

status: running

Status: ✅ Container running!


Test 9.2: Verify VM 10061

Check status:

qm status 10061

Result:

status: running

Status: ✅ VM running (was already started earlier)


Final Summary

State After Repair

Ceph cluster:

Health: HEALTH_WARN (non-critical warnings)
OSD: 1 osds: 1 up, 1 in
Data: 970.88k objects, 3.7 TiB
PGs: 30 active+clean, 2 scrubbing+deep
I/O: active operations (25 KiB/s rd, 153 KiB/s wr)

VMs/Containers:
- All 27 VMs/containers using Ceph have access to data
- Critical machines (10061, 10068) tested and working

Warnings (non-critical):
- insecure global_id reclaim - security issue, can be fixed later
- no replicas configured - only 1 OSD, no replication (normal for single-OSD setup)
- OSD count 1 < osd_pool_default_size 3 - default config requires 3 OSDs


Complete List of Executed Steps

1. Disable Autostart for Ceph Containers

# Modify 27 configuration files
sed -i 's/^onboot: 1$/onboot: 0/' /etc/pve/nodes/*/lxc/*.conf
sed -i 's/^onboot: 1$/onboot: 0/' /etc/pve/nodes/*/qemu-server/*.conf

2. Restart All Nodes

# Sequentially: node3 -> node2 -> node1
ssh node3 "reboot"
# Wait for boot
ssh node2 "reboot"
# Wait for boot
ssh node1 "reboot"

3. Restore iSCSI

# Login to all targets
iscsiadm -m node -L all
# Verify 8 sessions
iscsiadm -m session

4. Activate Multipath

# Automatically created dm-13
multipath -ll

5. Activate LVM

# Scan and activate VG
pvscan
vgscan
vgchange -ay ceph-9763b04b-a846-47bc-984c-3c9da95d7329

6. Create OSD Metadata

cd /var/lib/ceph/osd/ceph-0

# Symlink to device
ln -sf /dev/ceph-9763b04b-a846-47bc-984c-3c9da95d7329/osd-block-b59f609a-4c09-4795-89f6-29b30800a3c7 block

# Metadata files
echo bluestore > type
echo 0 > whoami
echo ready > ready
echo 'b59f609a-4c09-4795-89f6-29b30800a3c7' > fsid
echo '79e21f45-2ab5-4caa-8848-30d0ee790bc8' > ceph_fsid

# Keyring (already existed)
# /var/lib/ceph/osd/ceph-0/keyring

# Permissions
chown ceph:ceph /var/lib/ceph/osd/ceph-0/*
chmod 644 type whoami ready fsid ceph_fsid

7. Start OSD

systemctl reset-failed ceph-osd@0
systemctl start ceph-osd@0
systemctl enable ceph-osd@0

8. Enable Proxmox Storage

# Edit /etc/pve/storage.cfg
# Remove 'disable 1' line from ceph-rbd section

9. Test VMs/Containers

pct start 10068  # Container
qm status 10061  # VM (already running)

Key Error Logs and Solutions

Error 1: Missing Keyring

Error: auth: unable to find a keyring on /var/lib/ceph/osd/ceph-0/keyring
Solution: ceph auth get osd.0 -o /var/lib/ceph/osd/ceph-0/keyring

Error 2: Missing type File

Error: missing 'type' file and unable to infer osd type
Solution: echo bluestore > /var/lib/ceph/osd/ceph-0/type

Error 3: Missing fsid

Error: bluestore(/var/lib/ceph/osd/ceph-0) _open_fsid (2) No such file or directory
Solution: echo 'b59f609a-4c09-4795-89f6-29b30800a3c7' > /var/lib/ceph/osd/ceph-0/fsid

Error 4: Incorrect UUID in fsid

Error: osd_uuid=b59f609a... need=79e21f45... (mixed OSD UUID with cluster UUID)
Solution: Use OSD UUID (not cluster) in fsid file

Error 5: Missing dm-13 Device

Error: ls: cannot access '/dev/dm-13': No such file or directory
Solution: Restore iSCSI sessions (iscsiadm -m node -L all)

Ceph Storage Architecture in This Environment

SAN Storage Array (10.0.1.x, 10.0.2.x)
  |
  |-- iSCSI Targets (8 LUNs)
  |   |
  |   |-- controller-a: 4 paths
  |   `-- controller-b: 4 paths
  |
  v
Proxmox node1
  |
  |-- iSCSI Initiator (8 sessions)
  |   |
  |   `-- Disks: sdf, sdg, sdh, sdi, sdj, sdk, sdl, sdm
  |
  |-- Multipath (dm-13)
  |   |
  |   `-- Device: /dev/mapper/ceph-lun-1 (6TB)
  |
  |-- LVM
  |   |
  |   |-- VG: ceph-9763b04b-a846-47bc-984c-3c9da95d7329
  |   `-- LV: osd-block-b59f609a-4c09-4795-89f6-29b30800a3c7
  |       |
  |       `-- /dev/dm-14
  |
  |-- Ceph OSD.0
  |   |
  |   |-- BlueStore backend
  |   |-- Data: 3.7 TiB used
  |   `-- Objects: 970k
  |
  `-- Ceph RBD Pool 'rbd'
      |
      `-- 30+ VM/container images

Conclusions and Recommendations

What Went Wrong

  1. No automatic iSCSI login - after node restart, iSCSI sessions weren't restored automatically
  2. Empty OSD directory - OSD metadata was previously deleted (probably during some repair attempt)
  3. Ceph container autostart - containers tried to start during boot before Ceph availability

Recommendations

  1. Enable automatic iSCSI login:
    bash # For each target in /etc/iscsi/nodes/ sed -i 's/^node.startup = manual$/node.startup = automatic/' /etc/iscsi/nodes/*/*/default

  2. Backup OSD metadata:
    bash # Regular backups of directory tar czf /root/ceph-osd-0-metadata-$(date +%Y%m%d).tar.gz /var/lib/ceph/osd/ceph-0/

  3. Monitor iSCSI and Ceph:
    bash # Add to cron check for iSCSI sessions */5 * * * * /usr/local/bin/check-iscsi-sessions.sh

  4. Dependency documentation:

  5. Ceph containers should have delayed start
  6. Or systemd dependency on ceph-osd

  7. Keep autostart disabled for Ceph containers or add delay:
    bash # In container config add: startup: order=100,up=300 # 300 second delay before start

Automated Recovery Script After Restart

Location: /usr/local/bin/ceph-recovery.sh

#!/bin/bash
# Automatic Ceph recovery after restart

echo "=== Ceph Recovery Script ==="
date

# 1. Check and restore iSCSI sessions
echo "Checking iSCSI sessions..."
if [ $(iscsiadm -m session | wc -l) -lt 8 ]; then
    echo "Restoring iSCSI sessions..."
    iscsiadm -m node -L all
    sleep 5
fi

# 2. Check multipath
echo "Checking multipath..."
if [ ! -e /dev/mapper/ceph-lun-1 ]; then
    echo "ERROR: Multipath device not found!"
    multipath -r
    sleep 5
fi

# 3. Activate LVM
echo "Activating Ceph LVM..."
vgchange -ay ceph-9763b04b-a846-47bc-984c-3c9da95d7329

# 4. Check OSD
echo "Checking OSD status..."
if ! systemctl is-active --quiet ceph-osd@0; then
    echo "Starting OSD..."
    systemctl start ceph-osd@0
fi

# 5. Wait for Ceph
echo "Waiting for Ceph cluster..."
for i in {1..30}; do
    if ceph health &>/dev/null; then
        echo "Ceph cluster is responding"
        break
    fi
    sleep 2
done

# 6. Display status
echo "=== Final Status ==="
ceph -s
echo "=== Recovery Complete ==="

Add to systemd:

# /etc/systemd/system/ceph-recovery.service
[Unit]
Description=Ceph Recovery After Reboot
After=network-online.target iscsid.service
Wants=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/ceph-recovery.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
systemctl enable ceph-recovery.service

Checklists

Checklist: Single Node Restart

  • [ ] Check active VMs/containers on node
  • [ ] Migrate critical machines to other nodes
  • [ ] Perform restart
  • [ ] After boot check iSCSI sessions: iscsiadm -m session
  • [ ] Check multipath: multipath -ll
  • [ ] Check LVM: lvs | grep ceph
  • [ ] Check OSD: systemctl status ceph-osd@0
  • [ ] Check Ceph: ceph -s

Checklist: Full Cluster Restart

  • [ ] Disable autostart for VMs/CTs with Ceph (onboot: 0)
  • [ ] Restart node3
  • [ ] Wait for node3 boot + check Ceph
  • [ ] Restart node2
  • [ ] Wait for node2 boot + check Ceph
  • [ ] Restart node1 (MON+MGR+OSD)
  • [ ] Wait for node1 boot
  • [ ] Execute recovery: /usr/local/bin/ceph-recovery.sh
  • [ ] Check ceph -s
  • [ ] Start critical machines manually

Checklist: OSD Problem Diagnosis

  • [ ] Check iSCSI sessions: iscsiadm -m session
  • [ ] Check disks: lsblk | grep sd[f-m]
  • [ ] Check multipath: ls -la /dev/dm-13
  • [ ] Check LVM: pvs; vgs; lvs
  • [ ] Check OSD directory: ls -la /var/lib/ceph/osd/ceph-0/
  • [ ] Check logs: journalctl -u ceph-osd@0 -n 50
  • [ ] Check status: systemctl status ceph-osd@0
  • [ ] Check cluster: ceph -s