6.18.2. Neutron L3 HA test results Mitaka

As we have already done initial testing of L3 HA functionality for Liberty, in this testing load parameters are increased (for example, concurrency for Rally tests, number of routers for manual tests).

This report is generated for OpenStack Neutron L3 HA Test Plan test plan.

6.18.2.1. Environment description

6.18.2.1.1. Cluster description

  • 3 controllers

  • 45 compute nodes

6.18.2.1.2. Software versions

MOS 9.0

6.18.2.1.3. Hardware configuration of each server

Description of servers hardware

Compute Vendor:

1x SUPERMICRO SUPERSERVER 5037MR-H8TRF MICRO-CLOUD http://www.supermicro.com/products/system/3u/5037/sys-5037mr-h8trf.cfm

CPU

1x INTEL XEON Ivy Bridge 6C E5-2620 V2 2.1G 15M 7.2GT/s QPI 80w SOCKET 2011R 1600 http://ark.intel.com/products/75789/Intel-Xeon-Processor-E5-2620-v2-15M-Cache-2_10-GHz

RAM:

4x Samsung DDRIII 8GB DDR3-1866 1Rx4 ECC REG RoHS M393B1G70QH0-CMA

NIC

1x AOC-STGN-i2S - 2-port 10 Gigabit Ethernet SFP+

6.18.2.2. Rally test results

L3 HA has a restriction of 255 routers per HA network per tenant. At this moment we do not have the ability to create new HA network per tenant if the number of VIPs exceed this limit. Based on this, for some tests, the number of tenants was increased (NeutronNetworks.create_and_list_router).

The most important results are provided by test_create_delete_routers test, as it allows to catch possible race conditions during creation/deletion of HA routers, HA networks and HA interfaces. There are already several known bugs related to this which have been fixed in upstream. Some of them:

To find out more possible issues test_create_delete_routers has been run multiple times with different concurrency. Another important aspect is number of users per tenant for tests. This increases load for tests.

Results of test_create_delete_routers

Times

Concurrency

Users per tenant

Number of errors

Link for rally report

150

50

1

0

rally report

250

10

3

9

rally report

300

10

2

1

rally report

300

30

5

3

rally report

300

40

1

4

rally report

300

40

5

4

rally report

500

50

3

55

rally report

300

100

1

60

rally report

300

30

4

0

rally report

In every suit with multiple scenarios boot_and_delete_server_with_secgroups was executed to show overall performance and validity of environment.

Test

Number of tenants

Times

Concurrency

Number of errors

Link for rally report

create_and_delete_routers

3

10

3

1

rally report

create_and_list_routers

4

10

3

0

create_and_update_routers

3

10

3

0

boot_and_delete_server_with_secgroups

1

10

3

0

create_and_delete_routers

3

300

50

4

rally report

create_and_list_routers

4

300

50

118

create_and_update_routers

3

300

50

3

boot_and_delete_server_with_secgroups

5

450

30

63

create_and_delete_routers

3

300

50

30

rally report

create_and_list_routers

10

300

50

0

create_and_update_routers

3

300

50

20

boot_and_delete_server_with_secgroups

5

450

30

180

The errors discovered have been classified as the following bugs:

../../../_images/graf_rally_M.png

Frequency of appearance of the bugs above is shown in the following table:

Comparative analysis of failures

Test

Number of all test of that kind that were executed

Number of failed tests

Number of tests with root cause determined

create_delete_routers

3310

171 (5,2 %)

121(3,6 %) - The server didn’t respond in time. 11 (0,33 %) - SubnetInUse: Unable to complete operation on subnet

create_and_update_routers

910

23 (2,5 %)

20 (2,1 %) - The server didn’t respond in time.

create_and_list_routers

610

118 (19,3 %)

Incorrect test setup. Larger number of tenants was required.

boot_and_delete_server_with_secgroups

910

243 (26,7 %)

243 (26,7 %) - TimeoutException: Rally tired waiting for Server to become (‘ACTIVE’) current status BUILD

6.18.2.2.1. Summary:

1. In comparison with results for Liberty neutron-server does not cope with higher load (a lot of “The server didn’t respond in time” errors)

2. Among races with creation and deletion of HA router remains the race with HA networks bug/1562878 and race with deleting routers bug/1605546.

6.18.2.3. Shaker test results

L3 HA

L3 HA during L3 agents restart

Lost

Errors

Link for report

Lost

Errors

Link for report

OpenStack L3 East-West

0

0

report

6

0

report

OpenStack L3 East-West Performance

0

0

report

0

0

report

OpenStack L3 North-South

0

0

report

30

0

report

OpenStack L3 North-South UDP

0

1

report

4

0

report

OpenStack L3 North-South Performance

(concurrency 5)

0

0

report

0

0

report

OpenStack L3 North-South Dense

0

0

report

29

0

report

OpenStack L3 East-West Dense

0

0

report

0

0

report

../../../_images/graf_M_East-West.png ../../../_images/graf_M_North-South.png

Shaker provides statistics about maximum, minimum and mean values of different connection measurements. Maximum among all maximum values and minimum among all minimum values was found for each test as well as mean value was counted from all mean values. In the table below, these values are presented.

type

L3 HA

L3 HA during l3 agents restart

min

mean

max

min

mean

max

OpenStack L3 East-West

ping_icmp,

ms

0.29

9.1

21.08

0.04

9.57

972.05

tcp_download

Mbits/s

84.5

789.6

3614.99

0.07

886.38

5519.7

tcp_upload

Mbits/s

87.97

617.14

3364.86

85.69

604.72

4898.11

Bandwidth Mbit/s

151.13

933.28

3232.75

0.0

990.99

4340.5

OpenStack L3 East-West Performance

Bandwidth Mbit/s

760.16

1316.44

2879.94

0.0

1220.97

4315.06

OpenStack L3 North-South

ping_icmp,

ms

0.12

13.54

130.15

0.38

65.64

369.95

tcp_download

Mbits/s

0.11

204.71

771.85

11.07

156.67

731.95

tcp_upload

Mbits/s

1.46

131.58

719.26

41.01

240.1

864.65

Bandwidth Mbit/s

4.25

198.02

680.56

0.0

184.97

900.81

OpenStack L3 North-South Performance

(concurrency 5)

Bandwidth Mbit/s

52.38

472.69

768.68

0.0

450.13

768.44

OpenStack L3 East-West Dense

ping_icmp,

ms

0.06

6.87

53.34

0.37

7.51

50.08

Bandwidth Mbit/s

497.88

1832.61

3754.25

0.0

1580.26

3386.44

tcp_download

Mbits/s

332.42

1536.71

3771.26

63.62

1436.54

3902.18

tcp_upload

Mbits/s

333.37

1091.59

2692.93

8.06

1047.41

3376.56

OpenStack L3 North-South Dense

ping_icmp,

ms

0.33

14.27

78.47

0.38

1.0

2.3

Bandwidth Mbit/s

106.25

375.94

721.31

0.0

267.31

758.63

tcp_download

Mbits/s

66.48

294.2

668.21

493.42

535.22

552.0

tcp_upload

Mbits/s

61.12

245.19

658.95

Average value of difference between these values without and with restart is presented in the next table:

ping_icmp,

ms

tcp_download

Mbits/s

tcp_upload

Mbits/s

Bandwidth Mbit/s

min

-0.0925

-21.1675

87.29

262

mean

-9.985

-65.68

-85.4875

72.39

max

-277.835

-469.88

-563.83

169.67

6.18.2.3.1. Summary:

1. The results show that mean values for metrics do not decrease dramatically during stop of active L3 agent.

6.18.2.4. Manual tests execution

During manual testing, the following scenarios were tested:

  • Ping to external network from VM during reset of primary(non-primary) controller

  • Ping from one VM to another VM in different network during ban L3 agent

  • Iperf UPD testing between VMs in different networks during ban L3 agent

All tests were performed with large number of routers.

6.18.2.4.1. Ping to external network from VM during reset of primary(non-primary) controller

../../../_images/ping_external1.png

Iteration

Number of routers

Command

Number of loss packages

1

10

14

1

10

2

2

50

42

3

50

41

4

100

43

5

100

42

6

100

ping 8.8.8.8

47

7

150

4

8

150

15

9

150

32

10

200

48

11

200

53

11

200

57

11

225

40

../../../_images/graf_M_reboot.png

After reboot of the controller on which l3 agent was active, another l3 agent becomes active. When rebooted node recovers, that l3 agent becomes active as well - this leads to extra loss of external connectivity in tenant network.

After some time only one agent remains to be active - the one from rebooted node.

The root cause of this behavior is that routers are processed by l3 agent before openvswitch agent sets up appropriate HA ports, so for some time recovered HA router is isolated from HA routers on other hosts and becomes active.

This issue needs special attention and will be investigated as bug/1597461.

Work around for such issue: start l3 agent after openvswitch agent. Results after applying such fix:

Number of routers

Command

Number of loss packages

50

ping 8.8.8.8

3

6.18.2.4.2. Ping from one VM to another VM in different network during ban L3 agent

../../../_images/ping_floating.png

Iteration

Number of routers

Command

Number of loss packages

1

1

2

2

1

3

3

1

3

4

10

1

5

10

5

6

10

3

7

50

4

8

50

3

9

50

4

10

100

1

11

100

ping 172.16.45.139

3

12

150

4

13

150

26

14

150

3

15

200

3

16

200

14

17

200

3

18

225

5

19

225

3

20

250

6

21

250

6

../../../_images/graf_M_stop.png

With 250 routers l3 agents started to fail with unmanaged state.

6.18.2.4.3. Iperf UPD testing between VMs in different networks ban L3 agent

../../../_images/iperf_addresses1.png

Number of routers

Command

Loss (%)

50

iperf -c 10.2.0.4 -p 5001 -t 60 -i 10 –bandwidth 30M –len 64 -u

5.8

iperf -c 10.2.0.4 -p 5001 -t 60 -i 10 –bandwidth 30M –len 64 -u

WARNING: did not receive ack of last datagram after 10 tries.

iperf -c 10.2.0.4 -p 5001 -t 60 -i 10 –bandwidth 30M –len 64 -u

12

100

iperf -c 10.2.0.4 -p 5001 -t 60 -i 10 –bandwidth 30M –len 64 -u

16

iperf -c 10.2.0.4 -p 5001 -t 60 -i 10 –bandwidth 30M –len 64 -u

6.1

150

iperf -c 10.2.0.4 -p 5001 -t 120 -i 10 –bandwidth 30M –len 64 -u

4.3

iperf -c 10.2.0.4 -p 5001 -t 120 -i 10 –bandwidth 30M –len 64 -u

4.8

iperf -c 10.2.0.4 -p 5001 -t 60 -i 10 –bandwidth 30M –len 64 -u

31

iperf -c 10.2.0.4 -p 5001 -t 60 -i 10 –bandwidth 30M –len 64 -u

6.9

200

iperf -c 10.2.0.4 -p 5001 -t 90 -i 10 –bandwidth 30M –len 64 -u

7.5

iperf -c 10.2.0.4 -p 5001 -t 90 -i 10 –bandwidth 30M –len 64 -u

WARNING: did not receive ack of last datagram after 10 tries.

227

iperf -c 10.2.0.4 -p 5001 -t 120 -i 10 –bandwidth 30M –len 64 -u

8.9

iperf -c 10.2.0.4 -p 5001 -t 120 -i 10 –bandwidth 30M –len 64 -u

WARNING: did not receive ack of last datagram after 10 tries.

../../../_images/graf_M_stop_iperf.png

With 227 routers l3 agents started to fail with unmanaged state.

6.18.2.4.4. Summary:

In comparison with results for MOS 8.0 (Liberty):

  1. The root cause of unstable L3 HA behaviour bug/1563298 was found and filed bug bug/1597461

  2. For stop/start of L3 agent results become more stable.

  3. With number of routers more than 227, agent’s recovery leads to falling into unmanaged state.