If you've worked long in the virtual machine space, you have very likely used VMware's products, especially ESXi, the bare metal hypervisor that you could get a free activation key and use for various small-scale non-high availability purposes, and a lot of us learned the VMware ecosystem that way.
With VMware's recent changes in focus on larger customers, they have taken away the free licenses for ESXi, instead making VMware Workstation free. Workstation isn't quite the same - it is considered a "Type 2" hypervisor, which is more like an application that runs on top of an existing operating system, whereas ESXi is a "Type 1" hypervisor, which more simply is its own host OS - it may be based on a typical OS like Linux, but it is highly optimized and offers more direct hardware access for guest virtual machines, whereas with Type 2 all access is through the host OS and virtual machines are typically more isolated.
We have a lot of clients that utilize virtualization to varying degrees, from simply better resource utilization and flexibility to redundancy in services. The cost of VMware's vSphere products is far too prohibitive now for these uses, so we have been investigating what free - or really, "freemium" with paid support available - options that are now available, and provide mature, stable solutions at a much lower price point - and free versions for development and educational purposes.
We had some criteria for what we wanted in our solution:
Must Haves:
✅ Stability and reliability
✅ Easy to use interface
✅ Multiple VLAN support
✅ Run Linux and Windows VMs
✅ Can run on older hardware
✅ Import VMware VMs/OVAs
✅ Snapshots
✅ Backups
✅ Virtual TPM support
✅ Thin provisioning
Nice to haves:
✅ Live migrations, both for host and storage
✅ High Availability
✅ Clustering for centralized control
✅ Native container (i.e. "docker") support
✅ CPU Masking support for dissimilar CPU migration
✅ Native hyperconverged shared storage support
✅ Support for common automation tools like terraform
There are many options to choose from, but we focused on two particular mature entries: XCP-ng and Proxmox VE. Both are based on Linux, so they are capable of running on most any hardware Linux supports.
The Basics
XCP-ng - Xen Cloud Platform - next generation
XCP-ng started as XCP in 2010 as an open-sourced version of Citrix XenServer - probably the closest competitor to VMware's vSphere at the time. Citrix later open-sourced XenServer itself but then stopped providing updates, so XCP was revived in 2018 as XCP-ng as a fork from XenServer, and they maintain ties to the upstream XenServer.
XCP-ng uses the Linux-based Xen hypervisor, which was first released in 2003. The overall host OS was based on CentOS 7 user space so it would be familiar to RHEL users, but a highly customized kernel and updates fully maintained by XCP-ng, essentially becoming its own distribution.
A Long-Term Support Release model has been supported to provide at least 5 years of support for an LTS release, although this appears to be evolving as the current LTS release (8.2) is expected to expire in June 2025, while the current intermediate release (8.3) is expected to still receive support through November 2028. 8.3 was used for this evaluation.
Using XCP-ng is best done through Xen Orchestra (XO), which is an appliance VM you can download and deploy for free. The downloadable version is very naggy about support contracts, but if you are good at rolling your own from sources, you can build your own VM with XO with fewer nags, but then you also need to maintain updates manually. An "XO Lite" interface is built-in to the host but offers limited functionality at this time - mainly starting and stopping VMs - useful if you accidentally shut down XO. XO offers a fairly simple web interface. Command line operations are also available on the hosts.
Clustering is done by simply forming a pool with the member nodes. This then allows you to allocate networking across the cluster, at least if all the network devices at the host level are named the same, etc. Through XO, you can do live migrations between hosts and storage in the pool. Local storage is typically an ext4 filesystem (LVM and ZFS are also options, with the latter requiring manual setup) with the VM disks stored as VHD files. Snapshots are fully supported. Thin provisioning is applied at the storage level and is automatic with ext4 storage. High Availability is available in a pool, but as you would expect, it requires shared storage of some form and at least 3 hosts to manage quorum.
Networking itself was straightforward, supporting VLAN tagging easily and named networks. In some screens, multi-VLAN networks can show as the raw device as well as the VLAN networks, which can be a bit confusing. Also confusing at times is that a network may show as "Disconnected" but show "Physically connected" - the main "Connected/Disconnected" state actually refers to its active use in the cluster. If there is no IP assigned to the host itself on the interface, and there are no VMs running that use that network, it will show as Disconnected.
Backups of VMs can be made directly in the XO interface and can be scheduled using a cron-like syntax and support full and incremental (delta) backups. Note that Backups with the downloaded XO appliance requires a license. The roll-your-own edition does not require a license to do backups, but file-level restores of a Linux VM did not work for us.
When dealing with dissimilar CPUs in a pool, XCP-ng automatically applies CPU masking to the lowest common feature set across all members of the pool. There is no need to manually configure CPU masks.
Alas, there is no native container support. As for hyperconverged replicated storage, XOSTOR, based on Linstor/DRDB, is available, but we were not able to test it at this time due to a lack of available network ports and speed and memory requirements. At least three nodes are required, and 10G dedicated networking is highly recommended.
Importing VMs from VMware (which requires a non-free license for ESXi to enable the API) was somewhat simple with a direct interface to ESXi/vCenter. The complicating factor is that the Xen hypervisor requires a different set of drivers than most, so it is a good idea to install the drivers before attempting a migration. We had some trouble trying to import Windows 10 VMs during our tests, likely because of the drivers, but Linux worked fine. Imports also appeared to place some sort of lock on the VM in VSphere, which prevents vCenter from starting the VM again after the import. A snapshot prior to import and then reverting after seems to fix the problem.
In Linux, disks and network interfaces are presented differently (/dev/xvda instead of /dev/sda, enXn vs enpn, etc.) With modern Linux, the disk name change is handled so long as you use LVM paths, UUIDs, or labels for the mounts. Not so for the network interfaces, which have different names and must be adjusted.
OVA/XVA templates, VHD, VMDK, and RAW disks can be imported as well.
In terms of automation, terraform being our preferred target, a community terraform provider was listed under the name “xenorchestra” although it seems fairly young. You can manage VMs and networks with it.
Proxmox VE - Proxmox Virtual Environment
Proxmox VE (aka PVE) was first publicly released in 2008, built on the newer “Kernel-based Virtual Machine" or KVM hypervisor rather than Xen. The focus on PVE was to provide a proper GUI and backup tools, and supports full virtual machines as well as LXC containers, which are lighter weight than VMs for Linux use.
PVE is based on a Debian LTS base OS ("Bookworm" as of 8.3), but is highly customized.
Proxmox provides its own web-based GUI built in without the need to use an appliance, and any node's GUI can manage the entire cluster. Nodes join the cluster via a token that can be obtained in the GUI of the first node once you start creating the cluster. Like with most clusters, HA generally requires shared storage and 3 hosts, however you can run Corosync (the cluster technology used) on a "tie-breaking" host that is not otherwise part of the cluster (like a Raspberry Pi), or you could give a specific host more "votes" in the quorum but that host becomes vital to the cluster in general.
Removing nodes from the cluster, however, can be very cumbersome and result in problems.
The GUI presents a pretty familiar, if complex, interface with an "explorer"-style similar to vCenter. VMs, storage, etc. are laid out in an expandable tree on the left, with the details presented in a larger window. There are lots of configurable options, especially regarding VMs. It can take some time and testing to get used to all the possible options.
Networking can also be a bit more complicated, especially around tagged VLANs. Both "Linux" and "OpenVSwitch" networking is supported, but "Linux" is generally recommended. For VLANs, you create a VLAN-aware bridge, which you can then create VLANs on - but this only affects the host, not the VMs. You would create a VLAN on the host if you wanted to give the host an IP on that VLAN. VMs do not see these VLAN interfaces, only the bridges - you still must provide the VLAN tags on the VM's network configuration.
An alternative when you are in a cluster is to use Software Defined Networking (SDN). Like XCP-ng's pool-based networking, this allows you to create named networks across the cluster, which can be seen in the network configuration of the VMs and does not require tagging at the VM level.
Snapshots and thin provisioning are provided but can be dependent on storage and VM disk types. For example, lvm-thin storage supports snapshots (raw disks are the only supported disk type there) and thin provisioning for all VMs in the store, but raw disk types do not support snapshots on ext4 stores - instead, you need to use qcow2 disks.
Backups are provided built-in, but only at the full VM level - no file restores. To get file-level restores, you can deploy the Proxmox Backup Server, which acts as a storage proxy to datastores either locally to itself (you can deploy on bare metal) or other network stores. It can also manage retention to offload the work from the PVE hosts. File restores worked easily, although you need to know something of the Linux filesystem structures to find the files.
CPU masking is a manual process. When creating each VM you set the CPU type. The default is "x86-64-v2-AES", and there are many to choose from - and unfortunately, there does not seem to be a way to set the default currently. If you have a homogeneous cluster, the ideal type is "host", which does no masking, and you get the maximum capabilities of the CPU. Otherwise, to support live migrations, etc. you need to find the type that matches the lowest-common-featured (often the oldest) CPU in the cluster.
PVE supports LXC containers natively. LXC is a container type where the resources are isolated, but they share "slices" of the host kernel. They are a lot like VMs in the amount of control you have but use fewer resources than a full VM. Other container types, if needed, would need to be run through a VM.
Imports from licensed ESXi servers are fairly simple to set up and mostly seamless as PVE supports the VMware paravirtual drivers as well as the common ones. You can switch to the preferred "VirtIO SCSI" driver in Linux by adding the "virtio_scsi" driver to the kernel before switching with 'dracut --add-drivers "virtio_scsi" --force'. Even Windows 10 import tests were seamless. It is recommended that you remove other guest agents BEFORE importing.
Hyperconverged replicated storage is available via Ceph, but similar to XOSTOR, it has high requirements and highly recommends enterprise-grade SSDs.
Local storage has several options - typically at installation time, you are offered lvm-thin and ext4. There are performance implications here, so see the next section about that. ZFS is also an option, but beware that if you are operating on ZFS, the write amplification problem can prematurely wear out your SSDs, especially if you are deploying on consumer-grade SSDs.
For automation with terraform, there were A LOT of community providers listed, some of which appear to be forks of others. We briefly tested the one from “telmate”, and it seems pretty solid and simple to set up. There is another one from “bgg” that seems to have a lot more features. It remains to be seen what provider, if any, will become an “official” provider.
Performance Comparisons
To test the performance of a virtual machine in each environment, we made sure that all hosts in the clusters were identical. In this case, we used 100% identical Dell Precision 3431 small form factor units with 32GB DDR4 RAM, Intel i7-9700 CPU, and a Crucial BX500 2TB SSD.
To start, we used all the defaults for creating a Linux VM (except setting the CPU mask in PVE to "host", although in this case there was no noticeable difference), which ran identical AlmaLinux 9.5 installs allocated 4GB of RAM, 20GB disk, and the sysbench utility for measuring performance. No other VMs or services were run on the test host.
I won't bore you with the details of the CPU tests…there was no difference seen. Memory, however, showed a significant difference. This graph shows 10 runs of "sysbench memory run" each:
XCP-ng is very consistently less than 30% of the memory performance of PVE in sequential memory access and about 50% less for random access. Haven't really been able to determine why this is.
The other performance tests I ran were for disk I/O, both sequential and random reads and writes. The results here were a bit surprising at first…for the read tests, PVE beats XCP-ng by about 15%...
But the write performance flips this on its head:
Note that there was a dropoff from XCP-ng in one of the runs - we did see fairly consistently a momentary write performance drop from XCP-ng we did not notice on PVE, but more importantly, XCP-ng overall beat the pants off PVE…and there had to be a reason. The reason is the storage type each system used.
XCP-ng uses the tried-and-true Linux ext4 filesystem under a standard LVM logical volume and handles thin provisioning within the VHD disk storage format for the VM. ext4 performs well in a variety of situations. PVE, on the other hand, gives you a choice at install time of ext4 or "lvm-thin" (the default), which utilizes a thinpool logical volume, so thin provisioning is actually a function of LVM, and VMs provisioned on it use the "raw" storage format which itself does not handle thin provisioning. lvm-thin is the apparent bottleneck here.
We decided to check what the alternatives offered, so we removed the lvm-thin volume and made it ext4 instead. Note that this also means how we stored the VM disks can matter - Since the raw disk format does not handle thin provisioning, and ext4 doesn't handle it either, the default VM creation results in a non-thin disk. Also, since snapshots were a feature of lvm-thin, snapshotting was no longer available for the VM. We still did the tests, but then we also repeated the tests with the VM using the "qcow2" disk format, which supports thin provisioning and snapshots regardless of the storage type. Ironically, PVE also supports the VMware "vmdk" format, but does not support snapshots with it - and it also performed much more poorly with our tests, so we do not recommend it.
As a side note, if you migrate a VM from one storage type to another, PVE will convert the VM’s disk type to a supported one as well. So if you have a qcow2 VM on ext4, and migrate it to an lvm-thin volume, it will convert to raw. If you migrate it back to ext4, it does NOT convert it back to qcow2, and you now have thick-provisioned VM.
We also tested with the three disk cache options, which also affect performance: NoCache (default!), WriteBack, and WriteThrough.
The blue bars represent the default lvm-thin and raw format (raw is the only format supported on an lvm-thin volume), orange is raw format on ext4, and green is qcow2 on ext4. The line represents the XCP-ng average, as we don't have the ability to adjust as with PVE. As you can see, for sequential reads, ext4 outperforms lvm-thin in all cases, while random writes the raw format struggled on ext4 more for some reason. But ext4/qcow2 is significantly faster across the board.
Also, you can see that setting the cache policy on the disk of the VM has an impact. Counterintuitively, WriteThrough performed better than WriteBack in our tests very consistently, even though it really should be the opposite - WriteThrough is the safer option to ensure writes make it to storage so in the case of power loss (and no battery-backed storage cache) there is no loss. I reran both tests to be sure. This may be due to the testing hardware and how the OS cache and SSD are handled, so you may wish to test yourself for your particular hardware, as well as know what the implications of the various cache options are.
Pros and Cons
XCP-ng
Pros:
✅ Simple interface, although not without idiosyncrasies✅ Easy to get started
✅ Clustering was easy
✅ CPU masking is automatic within a heterogenous pool
✅ Networking can easily be applied across a pool
✅ Performs decently without needing lots of tweaks
Cons:
🚫 Unable to get file-level restores to work - possibly a license issue or XO bug🚫 XO GUI can give some false or confusing results on connected status on networks
🚫 Imports are more problematic due to driver and hardware differences
🚫 The central GUI is an appliance that takes resources in the cluster…but not all that different than vCenter and lighter weight
🚫 Standalone interface is far from ready
🚫 No native container support
🚫 Easy pool-wide networking setup requires identically named NICs, etc. across pool members
Proxmox
Pros:
✅ Familiar interface for vCenter refugees✅ Greater adoption means more community support
✅ Multiple VM console options like SPICE, VNC
✅ Supports OpenVSwitch networking as an option (adds overhead, so keep to Linux networking unless needed)
Cons:
🚫 Parts of it seem outdated when configuring VMs, such as for a Linux VM only being able to choose between a Linux kernel version of 2.4 or 2.6 (always choose 2.6 for anything later)🚫 CPU masking is manually set per VM
🚫 Migrations require clustering and a quorum, so you need at least 3 nodes (preferably an odd number) or make a node more important
🚫 Migrations can change the disk type of a VM, such as migrating a qcow2 disk type to an lvm-thin volume will convert it to a raw disk type, but mgrating it back will leave it as a raw disk type
🚫 Software-Defined Networking doesn't allow the host to have an IP on the SDN network (say, for migrations), so a regular Linux Bridge has to be used instead for the host’s IP
🚫 Removing nodes from a cluster is problematic
🚫 So…many…options. Can easily get into the weeds trying to figure out the best options for any given VM
Conclusion
Despite the pedigree of XCP-ng, it still feels like some parts are still rough around the edges. We encountered several glitches in the UI and operations that failed for no obvious reason. For example, all migration tasks report as failed now, even when they are successful. But for a simple-to-operate, quick-to-deploy virtual machine infrastructure, it is a good choice.
Proxmox VE is the choice for power users. It can be adjusted and tweaked in many ways to adjust performance. But this is also its weakness - it's less obvious that you likely need to make these adjustments to optimize performance, the adjustments may vary according to the hardware, and it is not currently possible to set reasonable defaults from their current presets, which are not necessarily optimal. It does enjoy a large community, so help is available. It is certainly the choice for administrators who are familiar with how virtualization works.
How iuvo Can Help
Ready to break free from costly VMware licensing but need expert guidance? Our team can help you transition smoothly. Contact us today!