Testing Down Cells

This document describes how to recreate a down-cell scenario in a single-node devstack environment. This can be useful for testing the reliability of the controller services when a cell in the deployment is down.

Setup

DevStack config

This guide is based on a devstack install from the Train release using an Ubuntu Bionic 18.04 VM with 8 VCPU, 8 GB RAM and 200 GB of disk following the All-In-One Single Machine guide.

The following minimal local.conf was used:

[[local|localrc]]
# Define passwords
OS_PASSWORD=openstack1
SERVICE_TOKEN=$OS_PASSWORD
ADMIN_PASSWORD=$OS_PASSWORD
MYSQL_PASSWORD=$OS_PASSWORD
RABBIT_PASSWORD=$OS_PASSWORD
SERVICE_PASSWORD=$OS_PASSWORD
# Logging config
LOGFILE=$DEST/logs/stack.sh.log
LOGDAYS=2
# Disable non-essential services
disable_service horizon tempest

Populate cell1

Create a test server first so there is something in cell1:

$ source openrc admin admin
$ IMAGE=$(openstack image list -f value -c ID)
$ openstack server create --wait --flavor m1.tiny --image $IMAGE cell1-server

Take down cell1

Break the connection to the cell1 database by changing the database_connection URL, in this case with an invalid host IP:

mysql> select database_connection from cell_mappings where name='cell1';
+-------------------------------------------------------------------+
| database_connection                                               |
+-------------------------------------------------------------------+
| mysql+pymysql://root:openstack1@127.0.0.1/nova_cell1?charset=utf8 |
+-------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> update cell_mappings set database_connection='mysql+pymysql://root:openstack1@192.0.0.1/nova_cell1?charset=utf8' where name='cell1';
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Update controller services

Prepare the controller services for the down cell. See Handling cell failures for details.

Modify nova.conf

Configure the API to avoid long timeouts and slow start times due to bug 1815697 by modifying /etc/nova/nova.conf:

[database]
...
max_retries = 1
retry_interval = 1

[upgrade_levels]
...
compute = stein  # N-1 from train release, just something other than "auto"

Restart services

Note

It is useful to tail the n-api service logs in another screen to watch for errors / warnings in the logs due to down cells:

$ sudo journalctl -f -a -u devstack@n-api.service

Restart controller services to flush the cell cache:

$ sudo systemctl restart devstack@n-api.service devstack@n-super-cond.service devstack@n-sch.service

Test cases

  1. Try to create a server which should fail and go to cell0.

    $ openstack server create --wait --flavor m1.tiny --image $IMAGE cell0-server
    

    You can expect to see errors like this in the n-api logs:

    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context [None req-fdaff415-48b9-44a7-b4c3-015214e80b90 None None] Error gathering result from cell 4f495a21-294a-4051-9a3d-8b34a250bbb4: DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on u'192.0.0.1' ([Errno 101] ENETUNREACH)") (Background on this error at: http://sqlalche.me/e/e3q8)
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context Traceback (most recent call last):
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/opt/stack/nova/nova/context.py", line 441, in gather_result
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     result = fn(cctxt, *args, **kwargs)
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 211, in wrapper
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     with reader_mode.using(context):
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     return self.gen.next()
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 1061, in _transaction_scope
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     context=context) as resource:
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     return self.gen.next()
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 659, in _session
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     bind=self.connection, mode=self.mode)
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 418, in _create_session
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     self._start()
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 510, in _start
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     engine_args, maker_args)
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 534, in _setup_for_connection
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     sql_connection=sql_connection, **engine_kwargs)
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/usr/local/lib/python2.7/dist-packages/debtcollector/renames.py", line 43, in decorator
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     return wrapped(*args, **kwargs)
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 201, in create_engine
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     test_conn = _test_connection(engine, max_retries, retry_interval)
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 387, in _test_connection
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context     six.reraise(type(de_ref), de_ref)
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context   File "<string>", line 3, in reraise
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on u'192.0.0.1' ([Errno 101] ENETUNREACH)") (Background on this error at: http://sqlalche.me/e/e3q8)
    Apr 04 20:48:22 train devstack@n-api.service[10884]: ERROR nova.context
    Apr 04 20:48:22 train devstack@n-api.service[10884]: WARNING nova.objects.service [None req-1cf4bf5c-2f74-4be0-a18d-51ff81df57dd admin admin] Failed to get minimum service version for cell 4f495a21-294a-4051-9a3d-8b34a250bbb4
    
  2. List servers with the 2.69 microversion for down cells.

    Note

    Requires python-openstackclient >= 3.18.0 for v2.69 support.

    The server in cell1 (which is down) will show up with status UNKNOWN:

    $ openstack --os-compute-api-version 2.69 server list
    +--------------------------------------+--------------+---------+----------+--------------------------+--------+
    | ID                                   | Name         | Status  | Networks | Image                    | Flavor |
    +--------------------------------------+--------------+---------+----------+--------------------------+--------+
    | 8e90f1f0-e8dd-4783-8bb3-ec8d594e60f1 |              | UNKNOWN |          |                          |        |
    | afd45d84-2bd7-4e49-9dff-93359f742bc1 | cell0-server | ERROR   |          | cirros-0.4.0-x86_64-disk |        |
    +--------------------------------------+--------------+---------+----------+--------------------------+--------+
    
  3. Using v2.1 the UNKNOWN server is filtered out by default due to api.list_records_by_skipping_down_cells:

    $ openstack --os-compute-api-version 2.1 server list
    +--------------------------------------+--------------+--------+----------+--------------------------+---------+
    | ID                                   | Name         | Status | Networks | Image                    | Flavor  |
    +--------------------------------------+--------------+--------+----------+--------------------------+---------+
    | afd45d84-2bd7-4e49-9dff-93359f742bc1 | cell0-server | ERROR  |          | cirros-0.4.0-x86_64-disk | m1.tiny |
    +--------------------------------------+--------------+--------+----------+--------------------------+---------+
    
  4. Configure nova-api with list_records_by_skipping_down_cells=False

    [api]
    list_records_by_skipping_down_cells = False
    
  5. Restart nova-api and then listing servers should fail:

    $ sudo systemctl restart devstack@n-api.service
    $ openstack --os-compute-api-version 2.1 server list
    Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
    <class 'nova.exception.NovaException'> (HTTP 500) (Request-ID: req-e2264d67-5b6c-4f17-ae3d-16c7562f1b69)
    
  6. Try listing compute services with a down cell.

    The services from the down cell are skipped:

    $ openstack --os-compute-api-version 2.1 compute service list
    +----+------------------+-------+----------+---------+-------+----------------------------+
    | ID | Binary           | Host  | Zone     | Status  | State | Updated At                 |
    +----+------------------+-------+----------+---------+-------+----------------------------+
    |  2 | nova-scheduler   | train | internal | enabled | up    | 2019-04-04T21:12:47.000000 |
    |  6 | nova-consoleauth | train | internal | enabled | up    | 2019-04-04T21:12:38.000000 |
    |  7 | nova-conductor   | train | internal | enabled | up    | 2019-04-04T21:12:47.000000 |
    +----+------------------+-------+----------+---------+-------+----------------------------+
    

    With 2.69 the nova-compute service from cell1 is shown with status UNKNOWN:

    $ openstack --os-compute-api-version 2.69 compute service list
    +--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+
    | ID                                   | Binary           | Host  | Zone     | Status  | State | Updated At                 |
    +--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+
    | f68a96d9-d994-4122-a8f9-1b0f68ed69c2 | nova-scheduler   | train | internal | enabled | up    | 2019-04-04T21:13:47.000000 |
    | 70cd668a-6d60-4a9a-ad83-f863920d4c44 | nova-consoleauth | train | internal | enabled | up    | 2019-04-04T21:13:38.000000 |
    | ca88f023-1de4-49e0-90b0-581e16bebaed | nova-conductor   | train | internal | enabled | up    | 2019-04-04T21:13:47.000000 |
    |                                      | nova-compute     | train |          | UNKNOWN |       |                            |
    +--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+
    

Future

This guide could be expanded for having multiple non-cell0 cells where one cell is down while the other is available and go through scenarios where the down cell is marked as disabled to take it out of scheduling consideration.