Featured image of post A Rollercoaster Bug Fix Journey – PVE Node Shows 'Unknown' Status Failure

A Rollercoaster Bug Fix Journey – PVE Node Shows 'Unknown' Status Failure

On a Sunday afternoon, Old T planned to install RouterOS on PVE for testing, only to find that creating a virtual machine was impossible. The specific issue was that on the VM creation page, the node field displayed “Node epson seems to be offline,” and on the PVE panel, the node showed a gray question mark with a status of “Unknown.” Thus began Old T’s rollercoaster journey of bug fixing.


PVE Storage Device Configuration Conflict

Upon encountering the bug, Old T first suspected a node configuration conflict. Although the VMs within the node were running normally, aside from the gray question mark on the node, two storage devices also displayed black question marks.

These two storage devices were actually mechanical hard drives in the host. Previously, Old T had mounted these drives and passed them through to VM 100 for use.

However, since the VM couldn’t read the hard drive temperatures, the storage was detached, and PCI-E hardware passthrough was used instead to pass the SATA controller directly to the VM. This left two outdated storage devices behind.

So, Old T directly deleted the two storage devices from the PVE panel and refreshed, but the issue persisted.

Wondering if the cluster state hadn’t updated yet, Old T restarted and forcibly refreshed the cluster state while checking the storage mount status.

1
2
3
systemctl restart pve-cluster
systemctl restart corosync
pvesm status

Sure enough, there was an issue with the storage status. The PVE configuration retained the LVM storage definitions for the two HDDs, but the physical disks were no longer visible as they had been passed through to VM 100.

1
2
3
4
5
6
7
8
root@epson:# pvesm status
Command failed with status code 5.
command '/sbin/vgscan --ignorelockingfailure --mknodes' failed: exit code 5
Name Type Status Total Used Available %
HDD1 lvm inactive 0 0 0 0.00%
HDD2 lvm inactive 0 0 0 0.00%
local dir active 98497780 13880408 79567824 14.09%
local-lvm lvmthin active 832888832 126265946 706622885 15.16%

Old T then cleaned up the invalid storage configurations and repaired the PVE cluster state.

1
2
3
4
5
6
pvesm remove HDD1
pvesm remove HDD2
systemctl restart pve-storage.target
pmxcfs -l # Rebuild cluster configuration files
pvecm updatecerts --force
systemctl restart pve-cluster

A reboot followed, hoping the issue would be resolved.

But after restarting, the node still showed a gray question mark.


Cluster Configuration File Error

If discovering the PVE storage configuration error earlier was straightforward, checking the PVE cluster configuration next led Old T into a deeper rabbit hole.

1
2
root@epson:~# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster? Linux epson 6.8.12-9-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-9 (2025-03-16T19:18Z) x86_64

Checking the cluster status revealed that the corosync configuration file was missing—a bizarre bug with no clear cause.

Before rebuilding, another minor issue needed attention.

Old T scrutinized the PVE panel again and noticed that, besides the gray question mark on the node, basic information like CPU and memory usage wasn’t displayed, and the charts were missing, showing January 1, 1970, below the icons.

System Time

This led Old T to suspect that a system time service failure might be causing these issues.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
root@epson:# timedatectl
Local time: Mon 2025-08--31 21:18:39 CST
Universal time: Mon 2025-08--31 13:18:39 UTC
RTC time: Mon 2025-08--31 13:18:39
Time zone: Asia/Shanghai (CST, +0800)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
root@epson:# hwclock --show
2025-08--31 21:19:03.141107+08:00

However, no faults were found; the system time was correct.

File Path Issue

With no other options left, the only path was to rebuild the configuration file.

As usual, Old T first checked the corosync status.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
root@epson:~# systemctl status corosync
○ corosync.service - Corosync Cluster Engine
     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
     Active: inactive (dead)
  Condition: start condition failed at Mon 2025-09-01 21:13:46 CST; 6min ago
             └─ ConditionPathExists=/etc/corosync/corosync.conf was not met
       Docs: man:corosync
             man:corosync.conf
             man:corosync_overview

Aug 31 21:13:46 epson systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.conf).
root@epson:~#

This check revealed a new problem. Earlier, the cluster check indicated a missing /etc/pve/corosync.conf, but the corosync status showed it was looking for /etc/corosync/corosync.conf—the paths seemed inconsistent.

Old T attempted to fix this issue, but it made no difference; corosync still couldn’t find the configuration file.

1
2
3
4
oot@epson:# mount | grep /etc/pve
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
root@epson:# mkdir -p /etc/corosync
root@epson:# ln -s /etc/pve/corosync.conf /etc/corosync/corosync.conf

Rebuilding Cluster Configuration

Finally, the configuration rebuilding began.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
root@epson:# systemctl stop pve-cluster pvedaemon pveproxy corosync # Stop services
root@epson:# rm -rf /etc/pve/* # Delete original configuration
root@epson:# rm -rf /var/lib/pve-cluster/* # Delete original configuration
root@epson:# mkdir /etc/pve # Recreate directory
root@epson:# mkdir /var/lib/pve-cluster # Recreate directory
# Write configuration
root@epson:# cat > /etc/pve/corosync.conf <<EOF
totem {
version: 2
cluster_name: epson
transport: knet
crypto_cipher: aes256
crypto_hash: sha256
}
nodelist {
node {
name: epson
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.1.8
}
}
quorum {
provider: corosync_votequorum
expected_votes: 1
}
logging {
to_syslog: yes
}
EOF

root@epson:# chown root:www-data /etc/pve/corosync.conf
root@epson:# chmod 640 /etc/pve/corosync.conf
root@epson:# rm -f /etc/corosync/corosync.conf # Remove old link
root@epson:# ln -s /etc/pve/corosync.conf /etc/corosync/corosync.conf
root@epson:# systemctl daemon-reload
root@epson:# systemctl start corosync
Job for corosync.service failed because the control process exited with error code.
See "systemctl status corosync.service" and "journalctl -xeu corosync.service" for details.

Everything seemed normal during the process, but corosync still crashed. Checking the logs revealed a missing authentication key file: Could not open /etc/corosync/authkey: No such file or directory.

Fixing the Key File

Old T quickly generated the key file and linked it to the correct path as per the error.

1
2
3
4
5
root@epson:# corosync-keygen
Corosync Cluster Engine Authentication key generator.
Gathering 2048 bits for key from /dev/urandom.
Writing corosync key to /etc/corosync/authkey.
root@epson:# ln -s /etc/pve/authkey /etc/corosync/authkey

Rechecking the corosync status showed it was finally working.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
root@epson:~# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; preset: enabled)
Active: active (running) since Mon 2025-08-31 21:37:39 CST; 29s ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 18905 (corosync)
Tasks: 9 (limit: 18833)
Memory: 112.4M
CPU: 204ms
CGroup: /system.slice/corosync.service
└─18905 /usr/sbin/corosync -f

However, even though corosync was fixed, the original issue of the gray question mark on the epson node remained unresolved. The root cause was still elusive.

Resetting Configuration

This marked the beginning of the deep dive.

Before resetting the cluster configuration, Old T deleted the original cluster files as usual. But, oh no—PVE completely crashed.

1
2
3
4
5
6
7
root@epson:# systemctl stop pve-cluster corosync pvedaemon pveproxy
root@epson:# rm -rf /var/lib/pve-cluster/*
root@epson:# pvecm updatecerts -f
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

Fortunately, Old T had backed up files earlier and restored them from the backup.

Hoping to at least return to the initial state without a complete crash, Old T proceeded.

Yet, the problem persisted. Checking the logs revealed a strange new issue: pmxcfs couldn’t open the database file ‘/var/lib/pve-cluster/’.config.db’.

1
2
[database] crit: splite3_open_v2 failed: 14#010
[main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'

With a “nothing left to lose” mindset, Old T decided to go for a complete reconfiguration.

1
2
3
4
5
6
rm -rf /var/lib/pve-cluster/*
rm -f /etc/corosync/corosync.conf
rm -f /etc/pve/corosync.conf
rm -f /var/lib/corosync/*

apt-get install --reinstall --purge pve-cluster corosync

So, Old T completely removed the cluster and corosync, intending to start fresh.

But it still failed. The PVE web interface was now unrecoverable. Since the virtual machines were still running, Old T didn’t push further and decided to tackle it again the next day.


Second Attempt at Cluster Repair

The next morning, Old T got up early, hoping to fix this before work.

Admittedly, a good night’s sleep brought much clearer thinking.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Check and backup existing files
ls -la /var/lib/pve-cluster/
cp /var/lib/pve-cluster/config.db /root/config.db.bak.$(date +%Y%m%d)

# Stop processes
systemctl stop corosync pve-cluster pvedaemon pveproxy
pkill -9 pmxcfs

# Repair the database
sqlite3 /var/lib/pve-cluster/config.db ".dump" > /root/dump.sql
sqlite3 /var/lib/pve-cluster/new_config.db < /root/dump.sql
mv /var/lib/pve-cluster/new_config.db /var/lib/pve-cluster/config.db
chown www-data:www-data /var/lib/pve-cluster/config.db
chmod 0600 /var/lib/pve-cluster/config.db

## Restart cluster services
systemctl start pve-cluster

However, the restart still failed. Checking the logs revealed that pmxcfs couldn’t mount the filesystem to /etc/pve. Another peculiar issue.

A closer look at the /etc/pve path showed that the previous rm -rf /etc/pve/* command only deleted some files in the directory, leaving hidden files (those starting with a dot .) untouched, meaning the directory wasn’t actually empty.

So, he went through the process again, removed the /etc/pve directory entirely, and recreated an empty one.

Then, he rewrote the original two VM configurations into /etc/pve/qemu-server/100.conf and /etc/pve/qemu-server/101.conf.

Finally, he was back to square one.

That is, the node ’epson’ in PVE showed a grey question mark. Two rounds of effort, no progress.


Remotely Breaking PVE Connectivity

Old T reviewed this process with DeepSeek to figure out where things went wrong. It immediately suggested updating the network configuration right after restoring the cluster config.

Trusting its advice (perhaps foolishly), Old T followed its method to create a network configuration, which promptly took the remote PVE instance offline.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Create temporary network configuration
cat << EOF > /etc/network/interfaces.tmp
auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.1.238/24
        gateway 192.168.1.1
        bridge-ports eno1
        bridge-stp off
        bridge-fd 0
EOF

# Apply configuration
mv /etc/network/interfaces.tmp /etc/network/interfaces
chmod 644 /etc/network/interfaces
systemctl restart networking

Rescuing the PVE Network

Just as one wave subsided, another arose. This mishap with deepseek was frustrating. Old T turned to Gemini for help.

Investigation revealed that the PVE network itself wasn’t the main issue; it could connect to the network normally.

The problem stemmed from an earlier mistake when writing the two VM configurations: the disk space format was incorrectly written as ‘64 GiB’ instead of ‘64G’, causing PVE to fail parsing the disk settings. Consequently, after the network restart, it couldn’t connect to the VMs properly.

However, during this network troubleshooting, Old T noticed a new issue.

The running status showed the node name as 2606:4700:3037::6815:3752, while in the corosync configuration, the node name was epson.

In theory, when the corosync service starts, it should read the name ’epson’ from the config file. Then, it needs to resolve ’epson’ to an IP address for network communication. If this resolution fails, corosync might “give up” and use the IP address directly as its name.

A detailed check revealed that Old T had previously modified the PVE hostname. Normally, the node should resolve to 192.168.1.238, but after adding a custom domain, the node was resolving to a Cloudflare IP address. Adjusting the Hosts file order finally fixed this bug.

1
2
3
root@epson:~# getent hosts epson
2606:4700:3037::6815:3752 epson.fosu.cc
2606:4700:3031::ac43:91f8 epson.fosu.cc

The Resolution

I took another look at the various functions in the PVE panel. After changing the hardware status monitoring panel’s timeframe from “Day” to “Month”, I finally spotted the major issue.

CPU Utilization Status

It turned out this PVE node failure had been present for about 10 days. Pinpointing the timeline around August 21st, it was likely related to the installation of PVEtools at that time.

Back then, to solve the hard drive temperature display issue in the Feiniu system, I had installed PVEtools to modify panel settings and add temperature monitoring.

Remembering this, I immediately started verifying.

1
2
3
4
root@epson:~# dpkg --verify pve-manager # Verify file changes
missing c /etc/apt/sources.list.d/pve-enterprise.list
??5?????? /usr/share/perl5/PVE/API2/Nodes.pm
??5?????? /usr/share/pve-manager/js/pvemanagerlib.js

Unsurprisingly, the installation of PVEtools had modified the PVE manager’s backend API and core JS library. Even though I recalled uninstalling the relevant PVEtools components afterward, it didn’t revert the changes.

So, I reinstalled PVEmanager. This finally removed the temperature display that PVEtools had left on the panel.

1
2
apt-get update
apt-get install --reinstall pve-manager

Next, I continued investigating the potential negative impacts PVEtools might have had on PVE.

While checking the status of various PVE components, I found that the PVE status daemon was constantly throwing errors.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
root@epson:~# systemctl status pvestatd # Check PVE status daemon
● pvestatd.service - PVE Status Daemon
Loaded: loaded (/lib/systemd/system/pvestatd.service; enabled; preset: enabled)
Active: active (running) since Tue 2025-09-01 19:07:31 CST; 5min ago
Process: 1081 ExecStart=/usr/bin/pvestatd start (code=exited, status=0/SUCCESS)
Main PID: 1116 (pvestatd)
Tasks: 1 (limit: 18833)
Memory: 156.2M
CPU: 2.704s
CGroup: /system.slice/pvestatd.service
└─1116 pvestatd

Sep 01 18:11:50 epson pvestatd[1116]: node status update error: Undefined subroutine &PVE::Network::ip_link_details called at /usr/share/perl5/PVE/Ser>
Sep 01 18:12:00 epson pvestatd[1116]: node status update error: Undefined subroutine &PVE::Network::ip_link_details called at /usr/share/perl5/PVE/Ser>
Sep 01 18:12:10 epson pvestatd[1116]: node status update error: Undefined subroutine &PVE::Network::ip_link_details

Finally, Old T resolved the issue by performing a complete upgrade of the PVE components.

1
apt update && apt full-upgrade

Problem Review and Summary

  1. Initial Symptoms

    • PVE node status showed unknown (grey question mark), inability to create VMs, missing node performance graphs (showing January 1, 1970).
  2. Detour 1: Misjudged Storage Configuration Conflict

    • Suspected invalid storage configurations (HDD1/HDD2) from passed-through HDDs caused the issue. Cleared them and rebooted, but the problem persisted.
  3. Detour 2: Mistakenly Rebuilt Cluster Configuration

    • Discovered corosync.conf was missing and attempted to rebuild the cluster (including fixing paths, key files), but the node status anomaly remained.
  4. Detour 3: Misled into Modifying Network Configuration

    • Followed erroneous advice to rewrite network configuration, causing PVE to lose connectivity. Later fixed hosts resolution (node name incorrectly resolved to Cloudflare IP), but the core fault persisted.
  5. Initial Suspicion

    • Combined timeline (fault started around August 21st) and operation history, suspected the pvetools script (used to add temperature monitoring) was the root cause.
  6. Key Evidence 1

    • dpkg --verify pve-manager confirmed core files (Nodes.pm, pvemanagerlib.js) were modified by pvetools.
  7. First Attempt

    • Reinstalled pve-manager: restored the modified files (temperature monitoring disappeared), but the node status anomaly still wasn’t fixed, indicating a deeper issue.
  8. Decisive Evidence 2

    • Checked pvestatd logs and found the critical error: Undefined subroutine &PVE::Network::ip_link_details, clearly pointing to library version mismatch.
  9. Root Cause

    • PVE Component Version Conflict: The newer version of pve-manager was calling a function that didn’t exist in the older Perl libraries (like libpve-common-perl).
  10. Final Solution

    • Executed a full system upgrade: apt update && apt full-upgrade, synchronizing all PVE components to compatible versions, completely resolving the issue.
CC BY-NC-ND 4.0
Built with Hugo, Powered by Github.
Total Posts: 364, Total Words: 506635.