Back to blog
INFRASTRUCTURE 7 min read Jan 25, 2026

OpenStack at Scale: What We Learned Running 2,000+ VMs

OpenStack is powerful but unforgiving. Here are the hard-won lessons from deploying and operating it for production cloud services.

Dario Ristić

CTO & Founder

OpenStack is not a product — it's a toolkit. Out of the box, it gives you the building blocks for a cloud platform. Turning those blocks into a reliable, production-grade service that customers trust with their workloads is a different story entirely.

After deploying OpenStack for multiple clients on PLATFORMA and managing over 2,000 VMs in production, here's what we've learned.

Lesson 1: Networking Will Break Your Spirit

Neutron, OpenStack's networking component, is the single biggest source of complexity and failure. VLAN configuration, floating IP allocation, security groups, and DNS integration all need to work perfectly for every tenant. One misconfigured network bridge and an entire compute node goes dark.

Our approach: standardize the network topology. Every deployment uses the same VLAN layout, the same security group templates, and the same floating IP pools. We don't let clients customize network architecture — that's where chaos lives. Instead, we provide a tested, validated topology that works reliably.

Lesson 2: Image Management Matters

The base images your VMs boot from directly impact provisioning speed and customer experience. We maintain a curated set of images — Ubuntu 22.04, 24.04, Debian 12, Rocky Linux 9 — each optimized for fast boot. Cloud-init is pre-configured, monitoring agents are baked in, and the image size is minimized.

OPENSTACK FLEET
Nova
Neutron
Cinder
2K+VMs managed
200+Metrics
99.9%Uptime

We pre-cache images on every compute node. Without pre-caching, the first VM boot on a node requires downloading the image from Glance, which can take 30-60 seconds. With pre-caching, the image is already on local storage and boot starts immediately.

Lesson 3: Cinder Storage Backend Selection

The storage backend you choose for Cinder (block storage) determines your performance ceiling. We standardized on local NVMe with LVM for performance-sensitive workloads and Ceph for workloads that need replication and snapshots.

The mistake many operators make is using Ceph for everything. Ceph is excellent for reliability but adds latency compared to local storage. For database workloads that need consistent IOPS, local NVMe is 10-50x faster. We let the product catalog define which storage backend each product tier uses.

Lesson 4: Monitoring Is Not Optional

OpenStack doesn't fail gracefully. A Nova scheduler running out of memory doesn't return an error — it silently stops scheduling. A Neutron agent that loses RabbitMQ connectivity doesn't alert — it just stops processing network changes. Without deep monitoring, you only discover failures when customers complain.

We monitor 200+ metrics per OpenStack deployment: service health, RabbitMQ queue depths, database connection pools, API response times, hypervisor resource utilization, and Ceph cluster health. Every metric has an alert threshold, and every alert has a runbook.

Lesson 5: Upgrades Are the Hardest Part

Deploying OpenStack is hard. Upgrading it is harder. The upgrade process involves database migrations, service-by-service restarts, and API compatibility windows. We've developed a rolling upgrade process that upgrades one service at a time with automated health checks between each step.

The key is testing. Every upgrade is first applied to our lab environment with production-realistic data. We run the full provisioning test suite against the upgraded environment. Only when everything passes do we schedule the production upgrade — always during a maintenance window with customer notification.