Testing Down Cells¶
This document describes how to recreate a down-cell scenario in a single-node devstack environment. This can be useful for testing the reliability of the controller services when a cell in the deployment is down.
Setup¶
DevStack config¶
This guide is based on a devstack install from the Train release using an Ubuntu Bionic 18.04 VM with 8 VCPU, 8 GB RAM and 200 GB of disk following the All-In-One Single Machine guide.
The following minimal local.conf was used:
[[local|localrc]]
# Define passwords
OS_PASSWORD=openstack1
SERVICE_TOKEN=$OS_PASSWORD
ADMIN_PASSWORD=$OS_PASSWORD
MYSQL_PASSWORD=$OS_PASSWORD
RABBIT_PASSWORD=$OS_PASSWORD
SERVICE_PASSWORD=$OS_PASSWORD
# Logging config
LOGFILE=$DEST/logs/stack.sh.log
LOGDAYS=2
# Disable non-essential services
disable_service horizon tempest
Populate cell1¶
Create a test server first so there is something in cell1:
$ source openrc admin admin
$ IMAGE=$(openstack image list -f value -c ID)
$ openstack server create --wait --flavor m1.tiny --image $IMAGE cell1-server
Take down cell1¶
Break the connection to the cell1 database by changing the
database_connection
URL, in this case with an invalid host IP:
mysql> select database_connection from cell_mappings where name='cell1';
+-------------------------------------------------------------------+
| database_connection |
+-------------------------------------------------------------------+
| mysql+pymysql://root:openstack1@127.0.0.1/nova_cell1?charset=utf8 |
+-------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> update cell_mappings set database_connection='mysql+pymysql://root:openstack1@192.0.0.1/nova_cell1?charset=utf8' where name='cell1';
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
Update controller services¶
Prepare the controller services for the down cell. See Handling cell failures for details.
Modify nova.conf¶
Configure the API to avoid long timeouts and slow start times due to
bug 1815697 by modifying /etc/nova/nova.conf
:
[database]
...
max_retries = 1
retry_interval = 1
[upgrade_levels]
...
compute = stein # N-1 from train release, just something other than "auto"
Restart services¶
Note
It is useful to tail the n-api service logs in another screen to watch for errors / warnings in the logs due to down cells:
$ sudo journalctl -f -a -u devstack@n-api.service
Restart controller services to flush the cell cache:
$ sudo systemctl restart devstack@n-api.service devstack@n-super-cond.service devstack@n-sch.service
Test cases¶
Try to create a server which should fail and go to cell0.
$ openstack server create --wait --flavor m1.tiny --image $IMAGE cell0-server
You can expect to see errors like this in the n-api logs:
Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context [None req-fdaff415-48b9-44a7-b4c3-015214e80b90 None None] Error gathering result from cell 4f495a21-294a-4051-9a3d-8b34a250bbb4: DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on u'192.0.0.1' ([Errno 101] ENETUNREACH)") (Background on this error at: http://sqlalche.me/e/e3q8) Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context Traceback (most recent call last): Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/opt/stack/nova/nova/context.py", line 441, in gather_result Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context result = fn(cctxt, *args, **kwargs) Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 211, in wrapper Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context with reader_mode.using(context): Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context return self.gen.next() Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 1061, in _transaction_scope Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context context=context) as resource: Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context return self.gen.next() Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 659, in _session Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context bind=self.connection, mode=self.mode) Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 418, in _create_session Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context self._start() Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 510, in _start Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context engine_args, maker_args) Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 534, in _setup_for_connection Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context sql_connection=sql_connection, **engine_kwargs) Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/debtcollector/renames.py", line 43, in decorator Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context return wrapped(*args, **kwargs) Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 201, in create_engine Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context test_conn = _test_connection(engine, max_retries, retry_interval) Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 387, in _test_connection Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context six.reraise(type(de_ref), de_ref) Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context File "<string>", line 3, in reraise Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on u'192.0.0.1' ([Errno 101] ENETUNREACH)") (Background on this error at: http://sqlalche.me/e/e3q8) Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context Apr 04 20:48:22 train devstack@n-api.service[10884]: WARNING nova.objects.service [None req-1cf4bf5c-2f74-4be0-a18d-51ff81df57dd admin admin] Failed to get minimum service version for cell 4f495a21-294a-4051-9a3d-8b34a250bbb4
List servers with the 2.69 microversion for down cells.
Note
Requires python-openstackclient >= 3.18.0 for v2.69 support.
The server in cell1 (which is down) will show up with status UNKNOWN:
$ openstack --os-compute-api-version 2.69 server list +--------------------------------------+--------------+---------+----------+--------------------------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+---------+----------+--------------------------+--------+ | 8e90f1f0-e8dd-4783-8bb3-ec8d594e60f1 | | UNKNOWN | | | | | afd45d84-2bd7-4e49-9dff-93359f742bc1 | cell0-server | ERROR | | cirros-0.4.0-x86_64-disk | | +--------------------------------------+--------------+---------+----------+--------------------------+--------+
Using v2.1 the UNKNOWN server is filtered out by default due to
api.list_records_by_skipping_down_cells
:$ openstack --os-compute-api-version 2.1 server list +--------------------------------------+--------------+--------+----------+--------------------------+---------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------+--------+----------+--------------------------+---------+ | afd45d84-2bd7-4e49-9dff-93359f742bc1 | cell0-server | ERROR | | cirros-0.4.0-x86_64-disk | m1.tiny | +--------------------------------------+--------------+--------+----------+--------------------------+---------+
Configure nova-api with
list_records_by_skipping_down_cells=False
[api] list_records_by_skipping_down_cells = False
Restart nova-api and then listing servers should fail:
$ sudo systemctl restart devstack@n-api.service $ openstack --os-compute-api-version 2.1 server list Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'nova.exception.NovaException'> (HTTP 500) (Request-ID: req-e2264d67-5b6c-4f17-ae3d-16c7562f1b69)
Try listing compute services with a down cell.
The services from the down cell are skipped:
$ openstack --os-compute-api-version 2.1 compute service list +----+------------------+-------+----------+---------+-------+----------------------------+ | ID | Binary | Host | Zone | Status | State | Updated At | +----+------------------+-------+----------+---------+-------+----------------------------+ | 2 | nova-scheduler | train | internal | enabled | up | 2019-04-04T21:12:47.000000 | | 6 | nova-consoleauth | train | internal | enabled | up | 2019-04-04T21:12:38.000000 | | 7 | nova-conductor | train | internal | enabled | up | 2019-04-04T21:12:47.000000 | +----+------------------+-------+----------+---------+-------+----------------------------+
With 2.69 the nova-compute service from cell1 is shown with status UNKNOWN:
$ openstack --os-compute-api-version 2.69 compute service list +--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+ | ID | Binary | Host | Zone | Status | State | Updated At | +--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+ | f68a96d9-d994-4122-a8f9-1b0f68ed69c2 | nova-scheduler | train | internal | enabled | up | 2019-04-04T21:13:47.000000 | | 70cd668a-6d60-4a9a-ad83-f863920d4c44 | nova-consoleauth | train | internal | enabled | up | 2019-04-04T21:13:38.000000 | | ca88f023-1de4-49e0-90b0-581e16bebaed | nova-conductor | train | internal | enabled | up | 2019-04-04T21:13:47.000000 | | | nova-compute | train | | UNKNOWN | | | +--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+
Future¶
This guide could be expanded for having multiple non-cell0 cells where one cell is down while the other is available and go through scenarios where the down cell is marked as disabled to take it out of scheduling consideration.