Restoring the Overcloud control plane services

Restoring the Overcloud control plane from a failed state depends on the specific issue the operator is facing.

This section provides a restore method for the backups created in previous steps.

The general strategy of restoring an Overcloud control plane will be to have the services working back again to re-run the update/upgrade tasks.

YUM update rollback

Depending on the updated packages, running a yum rollback based on the yum history command might not be a good idea. In the specific case of an OpenStack minor update or a major upgrade will be harder as there will be several dependencies and packages to downgrade based on the number of transactions yum had to run to upgrade all the node packages. Also, using yum history to rollback transactions can lead to target to remove packages needed for the system to work correctly.

Database restore

In the case we have updated the packages correctly, and the user has an issue with updating the database schemas, we might need to restore the database cluster.

With all the services stopped in the Overcloud controllers (except MySQL), go through the following procedure:

On all the controller nodes, drop connections to the database port via the VIP by running:

MYSQLIP=$(grep -A1 'listen mysql' /var/lib/config-data/haproxy/etc/haproxy/haproxy.cfg | grep bind | awk '{print $2}' | awk -F":" '{print $1}')
sudo nft add rule inet filter TRIPLEO_INPUT tcp dport 3306 ip daddr $MYSQLIP drop

This will isolate all the MySQL traffic to the nodes.

On only one controller node, unmanage galera so that it is out of pacemaker’s control:

pcs resource unmanage galera

Remove the wsrep_cluster_address option from /var/lib/config-data/mysql/etc/my.cnf.d/galera.cnf. This needs to be executed on all nodes:

grep wsrep_cluster_address /var/lib/config-data/mysql/etc/my.cnf.d/galera.cnf
vi /var/lib/config-data/mysql/etc/my.cnf.d/galera.cnf

On all the controller nodes, stop the MariaDB database:

mysqladmin -u root shutdown

On all the controller nodes, move existing MariaDB data directories and prepare new data directories:

sudo -i
mv /var/lib/mysql/ /var/lib/mysql.old
mkdir /var/lib/mysql
chown mysql:mysql /var/lib/mysql
chmod 0755 /var/lib/mysql
mysql_install_db --datadir=/var/lib/mysql --user=mysql
chown -R mysql:mysql /var/lib/mysql/
restorecon -R /var/lib/mysql

On all the controller nodes, move the root configuration to a backup file:

sudo mv /root/.my.cnf /root/.my.cnf.old
sudo mv /etc/sysconfig/clustercheck /etc/sysconfig/clustercheck.old

On the controller node we previously set to unmanaged, bring the galera cluster up with pacemaker:

pcs resource manage galera
pcs resource cleanup galera

Wait for the galera cluster to come up properly and run the following command to wait and see all nodes set as masters as follows:

pcs status | grep -C3 galera
# Master/Slave Set: galera-master [galera]
# Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

NOTE: If the cleanup does not show all controller nodes as masters, re-run the following command:

pcs resource cleanup galera

On the controller node we previously set to unmanaged which is managed back by pacemaker, restore the OpenStack database that was backed up in a previous section. This will be replicated to the other controllers by Galera:

mysql -u root < openstack_database.sql

On the same controller node, restore the users and permissions:

mysql -u root < grants.sql

Pcs status will show the galera resource in error because it’s now using the wrong user/password to connect to poll the database status. On all the controller nodes, restore the root/clustercheck configuration to a backup file:

sudo mv /root/.my.cnf.old /root/.my.cnf
sudo mv /etc/sysconfig/clustercheck.old /etc/sysconfig/clustercheck

Test the clustercheck locally for each controller node:

/bin/clustercheck

Perform a cleanup in pacemaker to reprobe the state of the galera nodes:

pcs resource cleanup galera

Test clustercheck on each controller node via xinetd.d:

curl overcloud-controller-0:9200
# curl overcloud-controller-1:9200
# curl overcloud-controller-2:9200

Remove the firewall rule from each node for the services to restore access to the database:

sudo nft -a list chain inet filter TRIPLEO_INPUT | grep mysql
[...]
tcp dport 3306 ip daddr $MYSQLIP drop # handle 499
sudo nft delete rule inet filter TRIPLEO_INPUT handle 499

Filesystem restore

On all overcloud nodes, copy the backup tar file to a temporary directory and uncompress all the data:

mkdir /var/tmp/filesystem_backup/data/
cd /var/tmp/filesystem_backup/data/
mv <path_to_the_backup_file> .
tar --xattrs -xvzf <backup_file>.tar.gz

NOTE: Untarring directly on the / directory will override your current files. Its recommended to untar the file in a different directory.

Cleanup the redis resource

Run:

pcs resource cleanup redis

Start up the services on all the controller nodes

The operator must check that all services are starting correctly, the services installed in the controllers depend on the operator needs so the following commands might not apply completely. The goal of this section is to show that all services must be started correctly before proceeding to retry an update, upgrade or use the Overcloud on a regular basis.

Non containerized environment

Command to start services:

sudo -i ;systemctl start openstack-ceilometer-central; systemctl start memcached; pcs resource enable rabbitmq; systemctl start openstack-nova-scheduler; systemctl start openstack-heat-api; systemctl start mongod; systemctl start redis; systemctl start httpd; systemctl start neutron-ovs-cleanup

Once all the controller nodes are up, start the compute node services on all the compute nodes:

sudo -i; systemctl start openstack-ceilometer-compute.service; systemctl start openstack-nova-compute.service

Containerized environment

The operator must check all containerized services are running correctly, please identify those stopped services by running:

sudo docker ps

Once the operator finds a stopped service, proceed to start it by running:

sudo docker start <service name>