5.30.1. OpenStack reliability testing

status

ready

version

1.0

Abstract

This document describes an abstract methodology for OpenStack cluster high-availability testing and analysis. OpenStack data plane testing at this moment is out of scope, but will be described in future.

Conventions

  • OpenStack cluster: consists of server nodes with deployed and fully operational OpenStack environment in high-availability configuration.

  • Fault-injection operation: represents common types of failures which can occur in production environment: service-hang, service-crash, network-partition, network-flapping, and node-crash.

  • Service-hang: faults are injected into specified OpenStack service by sending -SIGSTOP and -SIGCONT POSIX signals.

  • Service-crash: faults are injected by sending -SIGKILL signal into specified OpenStack service.

  • Node-crash: faults are injected to an OpenStack cluster by rebooting or shutting down a server node.

  • Network-partition: faults are injected by inserting iptables rules to OpenStack cluster nodes to a corresponding service that should be network-partitioned.

  • Network-flapping: faults are injected into OpenStack cluster nodes by inserting/deleting iptables rules on the fly which will affect corresponding service that should be tested.

  • Factor: consists of a set of atomic fault-injection operations. For example: reboot-random-controller, reboot-random-rabbitmq.

  • Test plan: contains two elements: test scenario execution graph and fault-injection factors.

  • SLA: Service-level agreement

  • Testing-cycles: number of test cycles of each factor

  • Inf: assumes infinite time to auto-healing of cluster after fault-factor injection.

5.30.1.1. Test Plan

5.30.1.1.1. Test Environment

This section should contain all information about deployed OpenStack environment including archive with all information in the /etc folder from all nodes.

5.30.1.1.1.1. Preparation

This section should contain all steps to reproduce Openstack environment deployment and client node. For example: if testing environment is deployed with DevStack, this section should contain all DevStack configuration files, DevStack version and all deployment steps.

5.30.1.1.1.2. Environment description

This section should contain all cluster hardware information, including processor model and its frequency, memory size, storage type and its capacity, network interfaces, and others. A separate client node must be used to drive the tests.

5.30.1.1.1.2.1. Hardware

This section should contain a full hardware nodes specification.

Description of server hardware

SERVER

name

role

vendor,model

operating_system

CPU

vendor,model

processor_count

core_count

frequency_MHz

RAM

vendor,model

amount_MB

NETWORK

interface_name

vendor,model

bandwidth

STORAGE

dev_name

vendor,model

SSD/HDD

size

5.30.1.1.1.2.2. Networking

This section should сontain full description of network equipment used in OpenStack cluster. Network topology diagram and network hardware configuration files should be included in this section.

5.30.1.1.2. Factors description

Please define here description of used factors during test runs. Examples are:

  • reboot-random-controller: consist node-crash fault injection on random

OpenStack controller node.

  • reboot-random-rabbitmq: consist node-crash fault injection on master

RabbitMQ messaging node.

  • sigstop-random-nova-api: consist service-hang fault injection on random

nova-api service.

  • sigkill-random-mysql: consist service-crash fault injection on

random MySQL node.

  • network-partition-random-mysql: consist network-partition fault injection on

random MySQL node.

5.30.1.1.3. Test Case 1: NovaServers.boot_and_delete_server

5.30.1.1.3.1. Description

This Rally scenario boots and deletes virtual instances with injected fault factors through OpenStack Nova API.

5.30.1.1.3.2. Service-level agreement

In this section, specify SLA values. For example:

Parameter

Value

MTTR (sec)

<=240

Failure rate (%)

<=95

Auto-healing

Yes

5.30.1.1.3.3. Parameters

In this section, specify load parameters during the test. For example:

Parameter

Value

Runner

constant

Concurrency

X

Times

Y

Injection-iteration

Z

Testing-cycles

N

5.30.1.1.3.4. List of reliability metrics

Priority

Value

Measurement Units

Description

1

SLA

Boolean

Service-level agreement result

2

Auto-healing

Boolean

Is cluster auto-healed after fault-injection

3

Failure rate

Percents

Test iteration failure ratio

4

MTTR (auto)

Seconds

Automatic mean time to repair

5

MTTR (manual)

Seconds

Manual mean time to repair, if Auto MTTR is Inf.

5.30.1.1.3.5. Results

5.30.1.1.3.5.1. reboot-random-controller
Full description of cyclic execution results

Cycles

MTTR(sec) | Failure rate(%)

Auto-healing

Performance degradation

1

X

Y

Yes

Yes

2

X

Y

Yes

Yes

3

X

Y

No

Yes

4

X

Y

Yes

Yes

5

X

Y

Yes

Yes

Place here link to rally report file with results of testing this factor.

Testing results summary

Value

MTTR

Failure rate

Min

X

Y

Max

X

Y

SLA

X

Y

5.30.1.1.3.5.2. Detailed results description

In this section, specify detailed description of test results, including factor impact.

5.30.1.1.3.5.3. reboot-random-rabbitmq
Full description of cyclic execution results

Cycles

MTTR(sec)

Failure rate(%)

Auto-healing

Performance degradation

1

X

Y

Yes

Yes

2

X

Y

Yes

Yes

3

X

Y

No

Yes

4

X

Y

Yes

Yes

5

X

Y

Yes

Yes

Place here link to rally report file with results of testing this factor.

Testing results summary

Value

MTTR

Failure rate

Min

X

Y

Max

X

Y

SLA

X

Y

5.30.1.1.3.5.4. Detailed results description

In this section, specify detailed description of test results, including factor impact.

5.30.1.1.4. Test Case 2: GlanceImages.create_and_delete_image

5.30.1.1.4.1. Description

This Rally scenario creates and deletes images with injected fault factors through OpenStack Glance API.

5.30.1.1.4.2. Service-level agreement

In this section, specify SLA values. For example:

Parameter

Value

MTTR (sec)

<=120

Failure rate (%)

<=95

Auto-healing

Yes

5.30.1.1.4.3. Parameters

In this section, specify load parameters during the test. For example:

Parameter

Value

Runner

constant

Concurrency

X

Times

Y

Injection-iteration

Z

Testing-cycles

N

5.30.1.1.4.4. List of reliability metrics

Priority

Value

Measurement Units

Description

1

SLA

Boolean

Service-level agreement result

2

Auto-healing

Boolean

Is cluster auto-healed after fault-injection

3

Failure rate

Percents

Test iteration failure ratio

4

MTTR (auto)

Seconds

Automatic mean time to repair

5

MTTR (manual)

Seconds

Manual mean time to repair, if Auto MTTR is Inf.

5.30.1.1.4.5. Results

5.30.1.1.4.5.1. reboot-random-controller
Full description of cyclic execution results

Cycles

MTTR(sec)

Failure rate(%)

Auto-healing

Performance degradation

1

X

Y

Yes

Yes

2

X

Y

Yes

Yes

3

X

Y

No

Yes

4

X

Y

Yes

Yes

5

X

Y

Yes

Yes

Place here link to rally report file with results of testing this factor.

Testing results summary

Value

MTTR

Failure rate

Min

X

Y

Max

X

Y

SLA

X

Y

5.30.1.1.4.5.2. Detailed results description

In this section, specify detailed description of test results, including factor impact.

5.30.1.1.4.5.3. reboot-random-rabbitmq
Full description of cyclic execution results

Cycles

MTTR(sec)

Failure rate(%)

Auto-healing

Performance degradation

1

X

Y

Yes

Yes

2

X

Y

Yes

Yes

3

X

Y

No

Yes

4

X

Y

Yes

Yes

5

X

Y

Yes

Yes

Place here link to rally report file with results of testing this factor.

Testing results summary

Value

MTTR

Failure rate

Min

X

Y

Max

X

Y

SLA

X

Y

5.30.1.1.4.5.4. Detailed results description

In this section, specify detailed description of test results, including factor impact.

5.30.1.2. Reports

Test plan execution reports: