On a Sunday afternoon, Old T planned to install RouterOS on PVE for testing, only to find that creating a virtual machine was impossible. The specific issue was that on the VM creation page, the node field displayed “Node epson seems to be offline,” and on the PVE panel, the node showed a gray question mark with a status of “Unknown.” Thus began Old T’s rollercoaster journey of bug fixing.
PVE Storage Device Configuration Conflict
Upon encountering the bug, Old T first suspected a node configuration conflict. Although the VMs within the node were running normally, aside from the gray question mark on the node, two storage devices also displayed black question marks.
These two storage devices were actually mechanical hard drives in the host. Previously, Old T had mounted these drives and passed them through to VM 100 for use.
However, since the VM couldn’t read the hard drive temperatures, the storage was detached, and PCI-E hardware passthrough was used instead to pass the SATA controller directly to the VM. This left two outdated storage devices behind.
So, Old T directly deleted the two storage devices from the PVE panel and refreshed, but the issue persisted.
Wondering if the cluster state hadn’t updated yet, Old T restarted and forcibly refreshed the cluster state while checking the storage mount status.
|
|
Sure enough, there was an issue with the storage status. The PVE configuration retained the LVM storage definitions for the two HDDs, but the physical disks were no longer visible as they had been passed through to VM 100.
|
|
Old T then cleaned up the invalid storage configurations and repaired the PVE cluster state.
|
|
A reboot followed, hoping the issue would be resolved.
But after restarting, the node still showed a gray question mark.
Cluster Configuration File Error
If discovering the PVE storage configuration error earlier was straightforward, checking the PVE cluster configuration next led Old T into a deeper rabbit hole.
|
|
Checking the cluster status revealed that the corosync configuration file was missing—a bizarre bug with no clear cause.
Before rebuilding, another minor issue needed attention.
Old T scrutinized the PVE panel again and noticed that, besides the gray question mark on the node, basic information like CPU and memory usage wasn’t displayed, and the charts were missing, showing January 1, 1970, below the icons.
System Time
This led Old T to suspect that a system time service failure might be causing these issues.
|
|
However, no faults were found; the system time was correct.
File Path Issue
With no other options left, the only path was to rebuild the configuration file.
As usual, Old T first checked the corosync status.
|
|
This check revealed a new problem. Earlier, the cluster check indicated a missing /etc/pve/corosync.conf
, but the corosync status showed it was looking for /etc/corosync/corosync.conf
—the paths seemed inconsistent.
Old T attempted to fix this issue, but it made no difference; corosync still couldn’t find the configuration file.
|
|
Rebuilding Cluster Configuration
Finally, the configuration rebuilding began.
|
|
Everything seemed normal during the process, but corosync still crashed. Checking the logs revealed a missing authentication key file: Could not open /etc/corosync/authkey: No such file or directory
.
Fixing the Key File
Old T quickly generated the key file and linked it to the correct path as per the error.
|
|
Rechecking the corosync status showed it was finally working.
|
|
However, even though corosync was fixed, the original issue of the gray question mark on the epson node remained unresolved. The root cause was still elusive.
Resetting Configuration
This marked the beginning of the deep dive.
Before resetting the cluster configuration, Old T deleted the original cluster files as usual. But, oh no—PVE completely crashed.
|
|
Fortunately, Old T had backed up files earlier and restored them from the backup.
Hoping to at least return to the initial state without a complete crash, Old T proceeded.
Yet, the problem persisted. Checking the logs revealed a strange new issue: pmxcfs couldn’t open the database file ‘/var/lib/pve-cluster/’.config.db’.
|
|
With a “nothing left to lose” mindset, Old T decided to go for a complete reconfiguration.
|
|
So, Old T completely removed the cluster and corosync, intending to start fresh.
But it still failed. The PVE web interface was now unrecoverable. Since the virtual machines were still running, Old T didn’t push further and decided to tackle it again the next day.
Second Attempt at Cluster Repair
The next morning, Old T got up early, hoping to fix this before work.
Admittedly, a good night’s sleep brought much clearer thinking.
|
|
However, the restart still failed. Checking the logs revealed that pmxcfs couldn’t mount the filesystem to /etc/pve
. Another peculiar issue.
A closer look at the /etc/pve
path showed that the previous rm -rf /etc/pve/*
command only deleted some files in the directory, leaving hidden files (those starting with a dot .
) untouched, meaning the directory wasn’t actually empty.
So, he went through the process again, removed the /etc/pve
directory entirely, and recreated an empty one.
Then, he rewrote the original two VM configurations into /etc/pve/qemu-server/100.conf
and /etc/pve/qemu-server/101.conf
.
Finally, he was back to square one.
That is, the node ’epson’ in PVE showed a grey question mark. Two rounds of effort, no progress.
Remotely Breaking PVE Connectivity
Old T reviewed this process with DeepSeek to figure out where things went wrong. It immediately suggested updating the network configuration right after restoring the cluster config.
Trusting its advice (perhaps foolishly), Old T followed its method to create a network configuration, which promptly took the remote PVE instance offline.
|
|
Rescuing the PVE Network
Just as one wave subsided, another arose. This mishap with deepseek was frustrating. Old T turned to Gemini for help.
Investigation revealed that the PVE network itself wasn’t the main issue; it could connect to the network normally.
The problem stemmed from an earlier mistake when writing the two VM configurations: the disk space format was incorrectly written as ‘64 GiB’ instead of ‘64G’, causing PVE to fail parsing the disk settings. Consequently, after the network restart, it couldn’t connect to the VMs properly.
However, during this network troubleshooting, Old T noticed a new issue.
The running status showed the node name as 2606:4700:3037::6815:3752
, while in the corosync configuration, the node name was epson
.
In theory, when the corosync service starts, it should read the name ’epson’ from the config file. Then, it needs to resolve ’epson’ to an IP address for network communication. If this resolution fails, corosync might “give up” and use the IP address directly as its name.
A detailed check revealed that Old T had previously modified the PVE hostname. Normally, the node should resolve to 192.168.1.238, but after adding a custom domain, the node was resolving to a Cloudflare IP address. Adjusting the Hosts file order finally fixed this bug.
|
|
The Resolution
I took another look at the various functions in the PVE panel. After changing the hardware status monitoring panel’s timeframe from “Day” to “Month”, I finally spotted the major issue.
It turned out this PVE node failure had been present for about 10 days. Pinpointing the timeline around August 21st, it was likely related to the installation of PVEtools at that time.
Back then, to solve the hard drive temperature display issue in the Feiniu system, I had installed PVEtools to modify panel settings and add temperature monitoring.
Remembering this, I immediately started verifying.
|
|
Unsurprisingly, the installation of PVEtools had modified the PVE manager’s backend API and core JS library. Even though I recalled uninstalling the relevant PVEtools components afterward, it didn’t revert the changes.
So, I reinstalled PVEmanager. This finally removed the temperature display that PVEtools had left on the panel.
|
|
Next, I continued investigating the potential negative impacts PVEtools might have had on PVE.
While checking the status of various PVE components, I found that the PVE status daemon was constantly throwing errors.
|
|
Finally, Old T resolved the issue by performing a complete upgrade of the PVE components.
|
|
Problem Review and Summary
Initial Symptoms
- PVE node status showed
unknown
(grey question mark), inability to create VMs, missing node performance graphs (showing January 1, 1970).
- PVE node status showed
Detour 1: Misjudged Storage Configuration Conflict
- Suspected invalid storage configurations (
HDD1/HDD2
) from passed-through HDDs caused the issue. Cleared them and rebooted, but the problem persisted.
- Suspected invalid storage configurations (
Detour 2: Mistakenly Rebuilt Cluster Configuration
- Discovered
corosync.conf
was missing and attempted to rebuild the cluster (including fixing paths, key files), but the node status anomaly remained.
- Discovered
Detour 3: Misled into Modifying Network Configuration
- Followed erroneous advice to rewrite network configuration, causing PVE to lose connectivity. Later fixed hosts resolution (node name incorrectly resolved to Cloudflare IP), but the core fault persisted.
Initial Suspicion
- Combined timeline (fault started around August 21st) and operation history, suspected the
pvetools
script (used to add temperature monitoring) was the root cause.
- Combined timeline (fault started around August 21st) and operation history, suspected the
Key Evidence 1
dpkg --verify pve-manager
confirmed core files (Nodes.pm
,pvemanagerlib.js
) were modified bypvetools
.
First Attempt
- Reinstalled
pve-manager
: restored the modified files (temperature monitoring disappeared), but the node status anomaly still wasn’t fixed, indicating a deeper issue.
- Reinstalled
Decisive Evidence 2
- Checked
pvestatd
logs and found the critical error:Undefined subroutine &PVE::Network::ip_link_details
, clearly pointing to library version mismatch.
- Checked
Root Cause
- PVE Component Version Conflict: The newer version of
pve-manager
was calling a function that didn’t exist in the older Perl libraries (likelibpve-common-perl
).
- PVE Component Version Conflict: The newer version of
Final Solution
- Executed a full system upgrade:
apt update && apt full-upgrade
, synchronizing all PVE components to compatible versions, completely resolving the issue.
- Executed a full system upgrade: