This section describes the various recommended practices/tips for runnning and maintaining CellsV2 for admins and operators. For more details regarding the basic concept of CellsV2 and its layout please see the main Cells Layout (v2) page.
For an explanation on how nova-api
handles cell failures please see the
Handling Down Cells
section of the Compute API guide. Below, you can find some recommended practices and
considerations for effectively tolerating cell failure situations.
Since a cell being reachable or not is determined through timeouts, it is suggested to provide suitable values for the following settings based on your requirements.
database.max_retries
is 10 by default meaning every time
a cell becomes unreachable, it would retry 10 times before nova can declare the
cell as a “down” cell.database.retry_interval
is 10 seconds and
oslo_messaging_rabbit.rabbit_retry_interval
is 1 second by
default meaning every time a cell becomes unreachable it would retry every 10
seconds or 1 second depending on if it’s a database or a message queue problem.CELL_TIMEOUT
which is hardcoded to 60
seconds and that is the total time the nova-api would wait before returning
partial results for the “down” cells.The values of the above settings will affect the time required for nova to decide if a cell is unreachable and then take the necessary actions like returning partial results.
The operator can also control the results of certain actions like listing
servers and services depending on the value of the
api.list_records_by_skipping_down_cells
config option.
If this is true, the results from the unreachable cells will be skipped
and if it is false, the request will just fail with an API error in situations where
partial constructs cannot be computed.
While the temporary outage in the infrastructure is being fixed, the affected cells can be disabled so that they are removed from being scheduling candidates. To enable or disable a cell, use nova-manage cell_v2 update_cell --cell_uuid <cell_uuid> --disable. See the Nova Cells v2 man page for details on command usage.
upgrade_levels.compute
is set to auto
then the
nova-api
service hangs on startup if there is at least one unreachable
cell. This is because it needs to connect to all the cells to gather
information on each of the compute service’s version to determine the compute
version cap to use. The current workaround is to pin the
upgrade_levels.compute
to a particular version like
“rocky” and get the service up under such situations. See bug 1815697 for more details. Also note
that in general during situations where cells are not reachable certain
“slowness” may be experienced in operations requiring hitting all the cells
because of the aforementioned configurable timeout/retry values.Except where otherwise noted, this document is licensed under Creative Commons Attribution 3.0 License. See all OpenStack Legal Documents.