6.18.5. OpenStack Neutron Density Testing report

Abstract

This document includes OpenStack Networking (aka Neutron) density test results against 200 nodes OpenStack environment. All tests have been performed regarding OpenStack Neutron Density Testing

6.18.5.1. Environment description

6.18.5.1.1. Lab A (200 nodes)

3 controllers, 196 computes, 1 node for Grafana/Prometheus

6.18.5.1.1.1. Hardware configuration of each server

Description of controller servers

server

vendor,model

Supermicro MBD-X10DRI

CPU

vendor,model

Intel Xeon E5-2650v3

processor_count

2

core_count

10

frequency_MHz

2300

RAM

vendor,model

8x Samsung M393A2G40DB0-CPB

amount_MB

2097152

NETWORK

vendor,model

Intel,I350 Dual Port

bandwidth

1G

vendor,model

Intel,82599ES Dual Port

bandwidth

10G

STORAGE

vendor,model

Intel SSD DC S3500 Series

SSD/HDD

SSD

size

240GB

vendor,model

2x WD WD5003AZEX

SSD/HDD

HDD

size

500GB

Description of compute servers

server

vendor,model

SUPERMICRO 5037MR-H8TRF

CPU

vendor,model

INTEL XEON Ivy Bridge 6C E5-2620

processor_count

1

core_count

6

frequency_MHz

2100

RAM

vendor,model

4x Samsung DDRIII 8GB DDR3-1866

amount_MB

32768

NETWORK

vendor,model

AOC-STGN-i2S - 2-port

bandwidth

10G

STORAGE

vendor,model

Intel SSD DC S3500 Series

SSD/HDD

SSD

size

80GB

vendor,model

1x WD Scorpio Black BP WD7500BPKT

SSD/HDD

HDD

size

750GB

6.18.5.2. Test results

6.18.5.2.1. Test Case: VM density check

The idea was to boot as many VMs as possible (in batches of 200-1000 VMs) and make sure they are properly wired and have access to the external network. The test allows to measure the maximum number of VMs which can be deployed without issues with cloud operability, etc.

The external access was checked by the external server to which VMs connect upon spawning. The server logs incoming connections from provisioned VMs which send instance IPs to this server via POST requests. Instances also report a number of attempts it took to get an IP address from metadata server and send connect to the HTTP server respectively.

A Heat template was used for creating 1 net with a subnet, 1 DVR router, and a VM per compute node. Heat stacks were created in batches of 1 to 5 (5 most of the times), so 1 iteration effectively means 5 new nets/routers and 196 * 5 VMs. During the execution of the test we were constantly monitoring lab’s status using Grafana dashboard and checking agents’ status.

As a result we were able to successfully create 125 Heat stacks which gives us the total of 24500 VMs.

Iteration 1:

../../../_images/iteration1.png

Iteration i:

../../../_images/iterationi.png

Example of Grafana dashboard during density test:

../../../_images/grafana.png

6.18.5.2.2. Observed issues

Issues faced during testing:

During testing it was also needed to tune nodes configuration in order to comply with the growing number of VMs per node, such as:

  • Increase ARP table size on compute nodes and controllers

  • Raise cpu_allocation_ratio from 8.0 for 12.0 in nova.conf to prevent hitting Nova vCPUs limit on computes

At ~16000 VMs we reached ARP table size limit on compute nodes so Heat stack creation started to fail. Having increased maximum table size we decided to cleanup failed stacks, in attempt to do so we ran into a following Nova issue (LP #1606825 nova-compute hangs while executing a blocking call to librbd): on VM deletion nova-compute may hang for a while executing a call to librbd and eventually go down in Nova service-list output. This issue was fixed with the help of the Mirnatis Nova team and the fix was applied on the lab as a patch.

After launching ~20000 VMs cluster started experiencing problems with RabbitMQ and Ceph. When the number of VMs reached 24500 control plane services and agents started to massively go down: the initial failure might have been caused by the lack of allowed PIDs per OSD nodes (https://bugs.launchpad.net/fuel/+bug/1536271) , Ceph failure affected all services, i.e. MySQL errors in Neutron server that lead to agents going down and massive resource rescheduling/resync. After Ceph failure the control plane cluster could not be recovered and due to that density test had to be stopped before the capacity of compute nodes was exhausted.

Ceph team commented that 3 Ceph monitors aren’t enough for over 20000 VMs (each having 2 drives) and recommended to have at least 1 monitor per ~1000 client connections or move them to dedicated nodes.

Note

Connectivity check of Integrity test passed 100% even when control plane cluster went crazy. That is a good illustration of control plane failures not affecting data plane.

Note

Final result - 24500 VMs on a cluster.