Building a Private Cloud with Open Source
Why Build a Private Cloud
The decision to build a private cloud rather than using public cloud services comes down to a few concrete factors that vary by organization. The first is cost at sustained scale. Public cloud pricing is consumption-based, which works well for variable workloads but becomes expensive for stable baseline capacity. An organization running 100 VMs continuously on AWS EC2 pays significantly more per year than the amortized cost of owning the hardware and electricity to run those same VMs on a private cloud. The crossover point where private cloud becomes cheaper than public cloud depends on workload stability, data transfer volumes, and local electricity costs, but most analyses place it somewhere between 30 and 60 consistently utilized servers.
Data sovereignty is the second major driver. Organizations subject to GDPR, HIPAA, PCI DSS, ITAR, or government classification requirements often need to demonstrate that data never leaves specific physical locations and is accessible only to authorized personnel with appropriate clearances. A private cloud on owned hardware in a controlled facility satisfies these requirements definitively, while public cloud compliance depends on the provider's documentation, audit reports, and the specific region and service configuration being used.
Performance predictability matters for latency-sensitive workloads. In a private cloud, your VMs do not share physical hosts with unknown workloads from other tenants. There are no noisy neighbors competing for I/O bandwidth, no hypervisor-level CPU throttling imposed by the provider, and no network congestion from co-located customers. Financial trading platforms, real-time communication systems, and databases serving latency-critical applications benefit significantly from the consistent performance that dedicated hardware provides.
Strategic independence rounds out the case. Organizations that build core business functionality on proprietary cloud services (AWS Lambda functions, Azure Cognitive Services, Google Cloud Spanner) create dependencies that are expensive and disruptive to unwind. A private cloud built on open source software and standard Linux, KVM, and Ceph components can be replicated, migrated, or supplemented with capacity from any provider that supports similar open standards.
Hardware Planning
The hardware foundation of a private cloud consists of three categories: compute nodes, storage nodes (or converged storage), and networking infrastructure. Getting the sizing right requires understanding your current workload profile and projected growth, because both under-provisioning (running out of capacity too soon) and over-provisioning (spending capital on hardware that sits idle) waste money.
Compute nodes are servers that run virtual machines and containers. Each compute node contributes its CPU cores and RAM to the shared resource pool. The key specifications are CPU core count (which determines how many VMs can run concurrently, accounting for typical overcommit ratios of 2:1 to 4:1 for CPU), RAM capacity (which is the most common bottleneck, because memory cannot be overcommitted as aggressively as CPU without performance degradation), and local storage for the operating system and VM ephemeral disks. Modern compute nodes typically use dual-socket servers with 32 to 128 cores per node and 256 GB to 1 TB of RAM. For a starting private cloud, three to five compute nodes provide enough capacity for meaningful workloads while establishing the infrastructure patterns that scale to larger deployments.
Storage can be handled in two ways. Dedicated storage nodes run distributed storage software like Ceph, providing a shared storage pool that all compute nodes access over the network. This approach separates compute and storage scaling, allowing you to add more storage capacity independently of compute capacity. Alternatively, converged or hyper-converged infrastructure (HCI) runs storage software directly on the compute nodes, using local drives on each compute node as part of the distributed storage pool. Proxmox VE's built-in Ceph integration and OpenStack with Ceph on compute nodes both support this model. HCI uses hardware more efficiently but means that adding storage also means adding compute, which may not match your scaling needs.
For Ceph deployments, plan for a minimum of three storage nodes (to satisfy Ceph's default three-way replication), each with a mix of SSDs (for Ceph's metadata databases and journals) and larger HDDs or NVMe drives (for data). Ceph's raw capacity must account for the replication factor: with three-way replication, 30 TB of raw disk yields about 10 TB of usable storage.
Networking requires switches that support the throughput your VMs will generate and the isolation features your multi-tenant model needs. At minimum, deploy separate management and data networks, either as physically separate switches or as VLANs on managed switches. For Ceph storage traffic, a dedicated storage network (ideally 25 Gbps or higher) prevents storage I/O from competing with VM traffic and management traffic. Spine-leaf network topologies scale better than traditional three-tier architectures and are recommended for deployments that will grow beyond a single rack.
Choosing a Platform
The platform choice depends primarily on the level of cloud functionality you need and the size of your operations team.
Proxmox VE is the right choice for organizations that need VM and container management without full cloud features like self-service provisioning, API-driven orchestration, or multi-tenant isolation. Proxmox installs from an ISO in minutes, includes a web management interface, supports Ceph and ZFS natively, and can be operated by a single administrator. It is the most practical choice for deployments of 3 to 20 nodes where the primary users are infrastructure administrators rather than self-service development teams.
Apache CloudStack provides full cloud functionality with significantly lower operational complexity than OpenStack. It is a strong choice for organizations that need self-service VM provisioning, tenant isolation, usage metering, and API access, but do not have the team to maintain a distributed-services architecture. CloudStack's monolithic design means fewer components to troubleshoot, simpler upgrades, and faster incident recovery. It works particularly well for managed hosting companies and internal IT teams serving multiple departments.
OpenNebula fits organizations that need a lightweight cloud platform with minimal infrastructure overhead. Its centralized architecture makes it the easiest IaaS platform to install and operate, and its edge provisioning features make it strong for geographically distributed deployments. OpenNebula 7.2's GPU support also makes it competitive for AI and machine learning infrastructure.
OpenStack is appropriate when you need the deepest feature set, the most flexible architecture, and the ability to scale beyond several hundred compute nodes. OpenStack requires a larger operations team (typically three or more dedicated engineers) and deeper expertise, but it provides capabilities that simpler platforms cannot match, including sophisticated software-defined networking, bare-metal provisioning, container infrastructure management, and integration with dozens of commercial storage and networking vendors.
Networking Architecture
Network design is where most private cloud projects succeed or fail. A well-designed network provides isolation between tenants, predictable performance for all workloads, and secure management access. A poorly designed network produces intermittent connectivity failures, performance bottlenecks, and security exposures that are difficult to diagnose and fix.
The standard approach uses three separate networks. The management network carries API traffic, database replication, message queue communication, and administrative SSH access. This network should be on a private, non-routable subnet accessible only from trusted management workstations. The data network (also called the tenant or overlay network) carries VM-to-VM traffic using VXLAN or GRE encapsulation, isolating each tenant's traffic within virtual tunnels regardless of the physical network topology. The provider or external network connects VMs to the internet or the organization's corporate network through floating IPs or direct provider network attachment.
For storage traffic, a fourth network dedicated to Ceph or other storage replication traffic prevents storage I/O from competing with management and tenant traffic. This network should use the highest available bandwidth (25 Gbps or 100 Gbps) because storage replication and recovery operations generate sustained high throughput that can saturate slower links.
Each compute node needs at least three network interfaces (or one high-bandwidth interface with VLAN tagging) to connect to these networks. Bond interfaces for redundancy on production deployments, using LACP (802.3ad) for active-active bonding where the switches support it, or active-backup bonding where they do not. Document every VLAN ID, IP subnet, and gateway before starting platform installation, because changing the network design after the platform is deployed is disruptive and error-prone.
Storage Architecture
Storage is the second most common source of private cloud problems after networking. The storage system must provide adequate IOPS for the expected VM workloads, sufficient throughput for bulk operations like backups and migrations, and enough redundancy to survive disk and node failures without data loss.
Ceph is the dominant open source storage solution for private clouds, used as the backend for OpenStack Cinder (block storage), Glance (image storage), and Nova (ephemeral instance storage), and integrated directly into Proxmox VE's management interface. Ceph distributes data across multiple nodes with configurable replication (typically three copies), providing both high availability and horizontal scalability. Adding storage capacity is as simple as adding more OSD (Object Storage Daemon) drives to the cluster.
For performance-sensitive workloads, use NVMe SSDs for Ceph OSDs rather than SATA SSDs or HDDs. NVMe provides the IOPS and latency needed for database workloads, high-frequency logging, and other I/O-intensive applications. Use a tiered approach where NVMe pools serve performance-critical VMs and HDD pools serve archival, backup, and bulk storage workloads.
If Ceph's complexity is more than your team wants to manage, NFS-based storage from a NAS appliance or a Linux NFS server provides a simpler alternative for smaller deployments. NFS lacks Ceph's distributed redundancy but works reliably for environments where the NAS itself provides hardware RAID and high availability. ZFS on local disks, which Proxmox supports natively, provides another simple option for environments where VMs do not need to migrate between hosts frequently (ZFS replication can synchronize storage between nodes, but it is not as seamless as Ceph's shared storage model).
Operational Readiness
Building the platform is only half the work. Operating it reliably requires monitoring, backup, capacity planning, and incident response procedures that many organizations underestimate during the initial build phase.
Monitoring should cover the physical layer (server hardware health, disk SMART status, temperature, power supply status), the platform layer (API response times, service health, message queue depths, database replication lag), the storage layer (Ceph cluster health, OSD status, pool utilization, IOPS and throughput metrics), and the workload layer (VM CPU, memory, and disk utilization by tenant). Prometheus with Grafana dashboards and Alertmanager is the most common open source monitoring stack for private clouds. Configure alerts for conditions that indicate emerging problems, not just for outages that have already occurred.
Backup must cover the platform's configuration and databases (so you can rebuild the management plane if needed), tenant VM data (through scheduled snapshots, Ceph RBD snapshots, or external backup tools like Proxmox Backup Server, Restic, or Bacula), and storage system metadata (Ceph monitor data, CRUSH maps). Store backups off-site, ideally in a geographically separate location. Test restore procedures regularly.
Capacity planning requires trending resource utilization over time to predict when you will need to add compute, storage, or network capacity. Start tracking utilization metrics from day one, even if the cloud is lightly loaded initially. Ordering, racking, and provisioning new hardware takes weeks, so you need enough lead time to add capacity before existing resources are exhausted.
A successful private cloud is 40 percent platform selection, 30 percent networking and storage architecture, and 30 percent operational readiness. The platform itself is the easiest part to get right. Invest heavily in network design and monitoring from day one, because these are the areas where shortcuts cause the most pain long-term.