Rehoming Subcloud with Expired Certificates

The rehoming procedure for subcloud that has been powered off for a long period of time differs from the regular rehoming procedure. Depending on how long the subcloud has been offline, the platform certificates may expire and require regeneration.

If the certificates are recoverable, the rehoming playbook will automatically recover most of them. However, some certificates will require manual intervention. The playbook will fail and dcmanager subcloud errors subcloud will indicate the actions that need to be taken.

Procedure

  1. Power on controller-0 of the subcloud.

    Note

    Ensure that you can ping the OAM floating IP from the new system controller before proceeding.

  2. SSH to the subcloud as sysadmin. If the password has expired, a prompt will pop up requesting to update the sysadmin password.

  3. Proceed with rehoming.

Multi-node system

In DX and standard subclouds (subcloud with 2 controllers), the subcloud may be in a continuous swact/reboot cycle that will lead to an unstable controller for the rehoming procedure to target.

  • Ensure that you power off controller-1 before attempting the rehoming procedure. Otherwise, the playbook will fail with an error Certificate recovery in progress. Please power-off controller-1 and try again.

  • The rehoming playbook will run and recover the active controller of the subcloud, after which it will display Running certificate recovery on other nodes. Connect to the subcloud and run 'tail -f /root/ansible.log' to follow the logs.. This means that another ansible process is running in the subcloud and you can review the log for more details.

  • At the Running certificate recovery on other nodes step, controller-1 should be powered on automatically. If not, a message will be written to /root/ansible.log asking for manual intervention to power it on.

    The following error indicates that controller-1 should be powered off first for subcloud active controller certificate recovery:

    [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
    FAILED rehoming playbook of (subcloud3).
     detail: fatal: [subcloud3]: FAILED! => changed=false
      msg: Certificate recovery in progress. Please power-off controller-1 and try again.
    FAILED TASK: TASK [common/recover-subcloud-certificates : Fail if controller-1 is running]  Thursday 15 March 2035  00:01:03 +0000 (0:00:00.439)       0:00:08.467
    

    If you get this error, turn off controller-1 and try again.

Manually Managed Certificates

Manual certificates are those that are manually installed by the user using the system certificate-install command. Examples include the StarlingX REST API & Horizon Server certificate and Local Registry Server certificate. It is not possible to automatically recover manual certificates.

As automatic recovery is not possible, the rehoming procedure will fail and ask for manual intervention:

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
FAILED rehoming playbook of (subcloud3).
 detail: fatal: [subcloud3]: FAILED! => changed=false
  msg: |-
    Rest API and Docker Registry certificates are expired.  Manual action required! On the subcloud, please update the expired certificates with `system certificate-install` and then run "dcmanager subcloud delete" and "dcmanager subcloud add" again to restart the procedure.
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
Wednesday 14 March 2035  22:52:22 +0000 (0:00:00.026)       0:03:12.115 *******
skipping: [subcloud3]

If you get this error, generate new certificates for the aforementioned certificates, install them with certificate-install, and try again.

Note

This will not be required if the certificates are already managed by cert-manager.

Cert-manager Certificates using a Custom CA Issuer

If you are using a Cert-manager Issuer other than system-local-ca for platform certificates, you will get the following error:

[sysadmin@controller-0 dc-config(keystone_admin)]$ dcmanager subcloud error subcloud1
FAILED rehoming playbook of (subcloud1).
 detail: fatal: [subcloud1]: FAILED! => changed=false
  msg: Cert-manager certificate(s) with their issuer expired. Please verify secret(s)
deployment/cloudplatform-rootca-secret on the subcloud, manually update and try
again."
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
Saturday 03 March 2035  18:56:00 +0000 (0:00:00.042)       0:02:42.799 ********
skipping: [subcloud1]
FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes]  Saturday 03 March 2035  18:56:00 +0000 (0:00:00.042)
 0:02:42.799

In this case, manual update of the underlying Issuer’s secret will be necessary.

As an example, the above error mentions deployment/cloudplatform-rootca-secret,

where deployment is the K8s namespace and cloudplatform-rootca-secret is the secret name. To update the CA certificate in this secret, use the following commands:

kubectl -n deployment delete secret cloudplatform-rootca-secret
kubectl -n deployment create secret tls cloudplatform-rootca-secret  --key=./ca.key --cert=./ca.crt
rm ca.crt ca.key

ca.crt and ca.key are in pem format. They can be obtained from the security personnel or the team responsible for certificate management.

Management Affecting Alarms

Once the certificate recovery process is completed, the subclouds should be free of management affecting alarms. The management affecting alarms will cause the rehoming procedure to fail. The subcloud may still be recoverable and the alarms should indicate the condition and provide information on the next step.

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud errors subcloud3
FAILED rehoming playbook of (subcloud3).
 detail: fatal: [subcloud3]: FAILED! => changed=false
  msg: The subcloud has management affecting alarms which are blocking the rehoming
procedure from continuing. The subcloud may still be recoverable, connect to it and
run "fm alarm-list --mgmt_affecting" to check the alarms. Please resolve the alarm
condition(s) then try again.
TASK [common/recover-subcloud-certificates : Delete root ca key file after use in compute nodes] ***
Wednesday 14 March 2035  23:45:44 +0000 (0:00:00.020)       0:42:53.295 *******
skipping: [subcloud3]
FAILED TASK: TASK [common/recover-subcloud-certificates : Delete root ca key file
after use in compute nodes]  Wednesday 14 March 2035  23:45:44 +0000 (0:00:00.020)
0:42:53.295

In this case, review the active alarms and take the necessary actions to resolve them.

SSL CA Certificates

SSL CA certificates are not automatically recovered as part of the rehoming procedure.

After a successful rehoming, an alarm will be raised by the system to let users know about the expiration of SSL CA certificates:

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
| Alarm ID | Reason Text                                                                                              | Entity ID                            | Severity | Time Stamp          |
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+
| 500.210  | Certificate 'system certificate-show 9062a088-8c71-46c6-b194-6a65908f1080' (mode=ssl_ca) expired.        | system.certificate.mode=ssl_ca.uuid= | critical | 2035-03-19T23:50:22 |
|          |                                                                                                          | 9062a088-8c71-46c6-b194-6a65908f1080 |          | .917781             |
+----------+----------------------------------------------------------------------------------------------------------+--------------------------------------+----------+---------------------+

The alarm indicates that the certificate has expired. For more information about the certificate, run sudo show-certs.sh. The following are the two possible resolutions:

  • The certificate is no longer needed

    system certificate-list | grep ssl_ca
    system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
    
  • The certificate is needed

    system certificate-list | grep ssl_ca
    system certificate-uninstall -m ssl_ca <expired_certificate_uuid>
    

    Obtain and install the new version of the required certificate:

    system certificate-install -m ssl_ca <new_ssl_ca>