Boot Storm Testing Guide

A "boot storm" occurs when many VMs start simultaneously, creating high demand on storage I/O, network resources, compute resources, and hypervisor scheduling. This guide explains how to test and understand boot storm performance.

What is Boot Storm Testing?

Boot storm testing helps you understand:

Concurrent Startup Performance: How your infrastructure handles simultaneous VM startups
Performance Degradation: Impact of load on individual VM boot times
Bottleneck Identification: Discover limits in storage, network, or compute
Recovery Time Objectives (RTO): Realistic expectations for disaster recovery scenarios

Boot Storm Test Workflow

The boot storm test follows a four-phase workflow:

Phase 1: Initial VM Creation

Creates all test namespaces in parallel batches
Creates and starts all VMs simultaneously
Measures time to Running state for each VM
Measures time to network readiness (ping) for each VM
Displays initial creation performance results

This phase establishes a baseline for comparison.

Phase 2: Shutdown All VMs

Issues stop commands to all VMs in parallel
Waits for all VMIs to be deleted (VMs fully stopped)
Confirms all VMs are in stopped state

This ensures a clean starting point for the boot storm test.

Phase 3: Boot Storm (Simultaneous Startup)

Issues start commands to ALL VMs at once
Creates maximum load on infrastructure
Measures time to Running state for each VM
Measures time to network readiness for each VM
Displays boot storm performance results

This is the actual boot storm test.

Phase 4: Comparison

Compare initial creation vs boot storm metrics to understand: - Performance differences between cold start and warm start - Impact of concurrent operations - Storage backend behavior under load - Infrastructure capacity limits

Testing Scenarios

Single Node Boot Storm

Tests VM startup performance on a single node when powering on multiple VMs simultaneously.

Use Case: Validates node-level capacity and boot storm performance (e.g., how many VMs can a single node handle during boot storm).

Command:

virtbench datasource-clone \
  --start 1 \
  --end 50 \
  --storage-class YOUR-STORAGE-CLASS \
  --single-node \
  --boot-storm \
  --save-results

What it does: 1. Selects a single node (random or specified with --node-name) 2. Creates and starts all VMs on that node (initial test) 3. Stops all VMs and waits for complete shutdown 4. Starts all VMs simultaneously on the same node (boot storm) 5. Measures time to Running state and time to ping for each VM 6. Provides separate statistics for initial creation and boot storm

Multi-Node Boot Storm

Tests VM startup performance across all nodes when powering on multiple VMs simultaneously.

Use Case: Validates cluster-wide performance under boot storm conditions (e.g., after maintenance, power outage recovery).

Command:

virtbench datasource-clone \
  --start 1 \
  --end 100 \
  --storage-class YOUR-STORAGE-CLASS \
  --boot-storm \
  --save-results

What it does: 1. Creates and starts all VMs (distributed across nodes) 2. Stops all VMs and waits for complete shutdown 3. Starts all VMs simultaneously (boot storm) 4. Measures time to Running state and time to ping for each VM 5. Provides separate statistics for initial creation and boot storm

Interpreting Boot Storm Results

Key Metrics

Time to Running: How long until VM reaches Running state
Time to Ping: How long until VM is network-reachable
Average Times: Mean performance across all VMs
Max Times: Worst-case performance (important for SLA planning)
Success Rate: Percentage of VMs that successfully started

What to Look For

Good Performance Indicators: - Boot storm times similar to initial creation times - Consistent performance across all VMs - High success rate (100%) - Predictable max times

Performance Issues: - Boot storm times significantly higher than initial creation - Wide variance in boot times - VMs failing to start - Increasing times as more VMs start

Common Bottlenecks

Storage I/O: High disk read/write contention
Network: Bandwidth saturation during image pulls
Compute: CPU/memory exhaustion on nodes
Hypervisor: KubeVirt scheduling delays

Best Practices

Start Small: Begin with 10-20 VMs to establish baseline
Incremental Testing: Gradually increase VM count to find limits
Monitor Resources: Watch node CPU, memory, and storage I/O during tests
Multiple Runs: Run tests multiple times for consistent results
Save Results: Always use --save-results to track performance over time
Clean Environment: Ensure cluster is not under load before testing

Advanced Options

Namespace Batch Size

Control how many namespaces are created in parallel:

virtbench datasource-clone \
  --start 1 \
  --end 100 \
  --storage-class YOUR-STORAGE-CLASS \
  --boot-storm \
  --namespace-batch-size 50

Concurrency Control

Adjust monitoring concurrency for large-scale tests:

virtbench datasource-clone \
  --start 1 \
  --end 200 \
  --storage-class YOUR-STORAGE-CLASS \
  --boot-storm \
  --concurrency 200