Node cleaning

Overview

Ironic provides two modes for node cleaning: automated and manual.

Automated cleaning is automatically performed before the first workload has been assigned to a node and when hardware is recycled from one workload to another.

Manual cleaning must be invoked by the operator.

Automated cleaning

When hardware is recycled from one workload to another, ironic performs automated cleaning on the node to ensure it’s ready for another workload. This ensures the tenant will get a consistent bare metal node deployed every time.

Ironic implements automated cleaning by collecting a list of cleaning steps to perform on a node from the Power, Deploy, Management, BIOS, and RAID interfaces of the driver assigned to the node. These steps are then ordered by priority and executed on the node when the node is moved to cleaning state, if automated cleaning is enabled.

With automated cleaning, nodes move to cleaning state when moving from active -> available state (when the hardware is recycled from one workload to another). Nodes also traverse cleaning when going from manageable -> available state (before the first workload is assigned to the nodes). For a full understanding of all state transitions into cleaning, please see Ironic’s State Machine.

Ironic added support for automated cleaning in the Kilo release.

Enabling automated cleaning

To enable automated cleaning, ensure that your ironic.conf is set as follows:

[conductor]
automated_clean=true

This will enable the default set of cleaning steps, based on your hardware and ironic hardware types used for nodes. This includes, by default, erasing all of the previous tenant’s data.

You may also need to configure a Cleaning Network.

Cleaning steps

Cleaning steps used for automated cleaning are ordered from higher to lower priority, where a larger integer is a higher priority. In case of a conflict between priorities across interfaces, the following resolution order is used: Power, Management, Deploy, BIOS, and RAID interfaces.

You can skip a cleaning step by setting the priority for that cleaning step to zero or ‘None’.

You can reorder the cleaning steps by modifying the integer priorities of the cleaning steps.

See How do I change the priority of a cleaning step? for more information.

Management Interface

idrac cleaning steps

Name

Details

Priority

Stoppable

Arguments

clear_job_queue

Clear the job queue.

0

no

known_good_state

Reset the iDRAC, Clear the job queue.

0

no

reset_idrac

Reset the iDRAC.

0

no

idrac-wsman cleaning steps

Name

Details

Priority

Stoppable

Arguments

clear_job_queue

Clear the job queue.

0

no

known_good_state

Reset the iDRAC, Clear the job queue.

0

no

reset_idrac

Reset the iDRAC.

0

no

ilo cleaning steps

Name

Details

Priority

Stoppable

Arguments

activate_license

Activates iLO Advanced license.

0

no

ilo_license_key (required) – The HPE iLO Advanced license key to activate enterprise features.

clear_secure_boot_keys

Clear all secure boot keys.

Clears all the secure boot keys. This operation is supported only on HP Proliant Gen9 and above servers.

0

no

reset_bios_to_default

Resets the BIOS settings to default values.

Resets BIOS to default settings. This operation is currently supported only on HP Proliant Gen9 and above servers.

10

no

reset_ilo

Resets the iLO.

0

no

reset_ilo_credential

Resets the iLO password.

30

no

reset_secure_boot_keys_to_default

Reset secure boot keys to manufacturing defaults.

Resets the secure boot keys to manufacturing defaults. This operation is supported only on HP Proliant Gen9 and above servers.

20

no

update_firmware

Updates the firmware.

0

no

firmware_images (required) – This argument represents the ordered list of JSON dictionaries of firmware images. Each firmware image dictionary consists of three mandatory fields, namely ‘url’, ‘checksum’ and ‘component’. These fields represent firmware image location URL, md5 checksum of image file and firmware component type respectively. The supported firmware URL schemes are ‘file’, ‘http’, ‘https’ and ‘swift’. The supported values for firmware component are ‘ilo’, ‘cpld’, ‘power_pic’, ‘bios’ and ‘chassis’. The firmware images will be applied (in the order given) one by one on the baremetal server. For more information, see https://docs.openstack.org/ironic/latest/admin/drivers/ilo.html#initiating-firmware-update-as-manual-clean-step

firmware_update_mode (required) – This argument indicates the mode (or mechanism) of firmware update procedure. Supported value is ‘ilo’.

update_firmware_sum

Updates the firmware using Smart Update Manager (SUM).

0

no

checksum (required) – The md5 checksum of the SPP image file.

components – The list of firmware component filenames. If not specified, SUM updates all the firmware components.

url (required) – The image location for SPP (Service Pack for Proliant) ISO.

ilo5 cleaning steps

Name

Details

Priority

Stoppable

Arguments

activate_license

Activates iLO Advanced license.

0

no

ilo_license_key (required) – The HPE iLO Advanced license key to activate enterprise features.

clear_secure_boot_keys

Clear all secure boot keys.

Clears all the secure boot keys. This operation is supported only on HP Proliant Gen9 and above servers.

0

no

erase_devices

Erase all the drives on the node.

This method performs out-of-band sanitize disk erase on all the supported physical drives in the node. This erase cannot be performed on logical drives.

0

no

erase_pattern – Dictionary of disk type and corresponding erase pattern to be used to perform specific out-of-band sanitize disk erase. Supported values are, for “hdd”: (“overwrite”, “crypto”, “zero”), for “ssd”: (“block”, “crypto”, “zero”). Default pattern is: {“hdd”: “overwrite”, “ssd”: “block”}.

reset_bios_to_default

Resets the BIOS settings to default values.

Resets BIOS to default settings. This operation is currently supported only on HP Proliant Gen9 and above servers.

10

no

reset_ilo

Resets the iLO.

0

no

reset_ilo_credential

Resets the iLO password.

30

no

reset_secure_boot_keys_to_default

Reset secure boot keys to manufacturing defaults.

Resets the secure boot keys to manufacturing defaults. This operation is supported only on HP Proliant Gen9 and above servers.

20

no

update_firmware

Updates the firmware.

0

no

firmware_images (required) – This argument represents the ordered list of JSON dictionaries of firmware images. Each firmware image dictionary consists of three mandatory fields, namely ‘url’, ‘checksum’ and ‘component’. These fields represent firmware image location URL, md5 checksum of image file and firmware component type respectively. The supported firmware URL schemes are ‘file’, ‘http’, ‘https’ and ‘swift’. The supported values for firmware component are ‘ilo’, ‘cpld’, ‘power_pic’, ‘bios’ and ‘chassis’. The firmware images will be applied (in the order given) one by one on the baremetal server. For more information, see https://docs.openstack.org/ironic/latest/admin/drivers/ilo.html#initiating-firmware-update-as-manual-clean-step

firmware_update_mode (required) – This argument indicates the mode (or mechanism) of firmware update procedure. Supported value is ‘ilo’.

update_firmware_sum

Updates the firmware using Smart Update Manager (SUM).

0

no

checksum (required) – The md5 checksum of the SPP image file.

components – The list of firmware component filenames. If not specified, SUM updates all the firmware components.

url (required) – The image location for SPP (Service Pack for Proliant) ISO.

irmc cleaning steps

Name

Details

Priority

Stoppable

Arguments

restore_irmc_bios_config

Restore BIOS config for a node.

0

no

Bios Interface

idrac-wsman cleaning steps

Name

Details

Priority

Stoppable

Arguments

apply_configuration

Apply the BIOS configuration to the node

param task

a TaskManager instance containing the node to act on

param settings

List of BIOS settings to apply

raises

DRACOperationError upon an error from python-dracclient

0

no

settings (required) – List of BIOS settings to apply

factory_reset

Reset the BIOS settings of the node to the factory default.

This uses the Lifecycle Controller configuration to perform BIOS configuration reset. Leveraging the python-dracclient methods already available.

0

no

ilo cleaning steps

Name

Details

Priority

Stoppable

Arguments

apply_configuration

Applies the provided configuration on the node.

0

no

settings (required) – Dictionary with current BIOS configuration.

factory_reset

Reset the BIOS settings to factory configuration.

0

no

irmc cleaning steps

Name

Details

Priority

Stoppable

Arguments

apply_configuration

Applies BIOS configuration on the given node.

This method takes the BIOS settings from the settings param and applies BIOS configuration on the given node. After the BIOS configuration is done, self.cache_bios_settings() may be called to sync the node’s BIOS-related information with the BIOS configuration applied on the node. It will also validate the given settings before applying any settings and manage failures when setting an invalid BIOS config. In the case of needing password to update the BIOS config, it will be taken from the driver_info properties.

0

no

settings (required) – Dictionary containing the BIOS configuration.

redfish cleaning steps

Name

Details

Priority

Stoppable

Arguments

apply_configuration

Apply the BIOS settings to the node.

0

no

settings (required) – A list of BIOS settings to be applied

factory_reset

Reset the BIOS settings of the node to the factory default.

0

no

Raid Interface

agent cleaning steps

Name

Details

Priority

Stoppable

Arguments

create_configuration

Create a RAID configuration on a bare metal using agent ramdisk.

This method creates a RAID configuration on the given node.

0

no

delete_configuration

Deletes RAID configuration on the given node.

0

no

idrac cleaning steps

Name

Details

Priority

Stoppable

Arguments

create_configuration

Create the RAID configuration.

This method creates the RAID configuration on the given node.

0

no

create_nonroot_volumes – This specifies whether to create the non-root volumes. Defaults to True.

create_root_volume – This specifies whether to create the root volume. Defaults to True.

delete_existing – Setting this to ‘True’ indicates to delete existing RAID configuration prior to creating the new configuration. Default value is ‘False’.

delete_configuration

Delete the RAID configuration.

0

no

idrac-wsman cleaning steps

Name

Details

Priority

Stoppable

Arguments

create_configuration

Create the RAID configuration.

This method creates the RAID configuration on the given node.

0

no

create_nonroot_volumes – This specifies whether to create the non-root volumes. Defaults to True.

create_root_volume – This specifies whether to create the root volume. Defaults to True.

delete_existing – Setting this to ‘True’ indicates to delete existing RAID configuration prior to creating the new configuration. Default value is ‘False’.

delete_configuration

Delete the RAID configuration.

0

no

ilo5 cleaning steps

Name

Details

Priority

Stoppable

Arguments

create_configuration

Create a RAID configuration on a bare metal using agent ramdisk.

This method creates a RAID configuration on the given node.

0

no

create_nonroot_volumes – This specifies whether to create the non-root volumes. Defaults to True.

create_root_volume – This specifies whether to create the root volume. Defaults to True.

delete_configuration

Delete the RAID configuration.

0

no

irmc cleaning steps

Name

Details

Priority

Stoppable

Arguments

create_configuration

Create the RAID configuration.

This method creates the RAID configuration on the given node.

0

no

create_nonroot_volumes – This specifies whether to create the non-root volumes. Defaults to True.

create_root_volume – This specifies whether to create the root volume.Defaults to True.

delete_configuration

Delete the RAID configuration.

0

no

Manual cleaning

Manual cleaning is typically used to handle long running, manual, or destructive tasks that an operator wishes to perform either before the first workload has been assigned to a node or between workloads. When initiating a manual clean, the operator specifies the cleaning steps to be performed. Manual cleaning can only be performed when a node is in the manageable state. Once the manual cleaning is finished, the node will be put in the manageable state again.

Ironic added support for manual cleaning in the 4.4 (Mitaka series) release.

Setup

In order for manual cleaning to work, you may need to configure a Cleaning Network.

Starting manual cleaning via API

Manual cleaning can only be performed when a node is in the manageable state. The REST API request to initiate it is available in API version 1.15 and higher:

PUT /v1/nodes/<node_ident>/states/provision

(Additional information is available here.)

This API will allow operators to put a node directly into cleaning provision state from manageable state via ‘target’: ‘clean’. The PUT will also require the argument ‘clean_steps’ to be specified. This is an ordered list of cleaning steps. A cleaning step is represented by a dictionary (JSON), in the form:

{
    "interface": "<interface>",
    "step": "<name of cleaning step>",
    "args": {"<arg1>": "<value1>", ..., "<argn>": <valuen>}
}

The ‘interface’ and ‘step’ keys are required for all steps. If a cleaning step method takes keyword arguments, the ‘args’ key may be specified. It is a dictionary of keyword variable arguments, with each keyword-argument entry being <name>: <value>.

If any step is missing a required keyword argument, manual cleaning will not be performed and the node will be put in clean failed provision state with an appropriate error message.

If, during the cleaning process, a cleaning step determines that it has incorrect keyword arguments, all earlier steps will be performed and then the node will be put in clean failed provision state with an appropriate error message.

An example of the request body for this API:

{
  "target":"clean",
  "clean_steps": [{
    "interface": "raid",
    "step": "create_configuration",
    "args": {"create_nonroot_volumes": false}
  },
  {
    "interface": "deploy",
    "step": "erase_devices"
  }]
}

In the above example, the node’s RAID interface would configure hardware RAID without non-root volumes, and then all devices would be erased (in that order).

Starting manual cleaning via “openstack baremetal” CLI

Manual cleaning is available via the openstack baremetal node clean command, starting with Bare Metal API version 1.15.

The argument --clean-steps must be specified. Its value is one of:

  • a JSON string

  • path to a JSON file whose contents are passed to the API

  • ‘-‘, to read from stdin. This allows piping in the clean steps. Using ‘-‘ to signify stdin is common in Unix utilities.

The following examples assume that the Bare Metal API version was set via the OS_BAREMETAL_API_VERSION environment variable. (The alternative is to add --os-baremetal-api-version 1.15 to the command.):

export OS_BAREMETAL_API_VERSION=1.15

Examples of doing this with a JSON string:

openstack baremetal node clean <node> \
    --clean-steps '[{"interface": "deploy", "step": "erase_devices_metadata"}]'

openstack baremetal node clean <node> \
    --clean-steps '[{"interface": "deploy", "step": "erase_devices"}]'

Or with a file:

openstack baremetal node clean <node> \
    --clean-steps my-clean-steps.txt

Or with stdin:

cat my-clean-steps.txt | openstack baremetal node clean <node> \
    --clean-steps -

Cleaning Network

If you are using the Neutron DHCP provider (the default) you will also need to ensure you have configured a cleaning network. This network will be used to boot the ramdisk for in-band cleaning. You can use the same network as your tenant network. For steps to set up the cleaning network, please see Configure the Bare Metal service for cleaning.

In-band vs out-of-band

Ironic uses two main methods to perform actions on a node: in-band and out-of-band. Ironic supports using both methods to clean a node.

In-band

In-band steps are performed by ironic making API calls to a ramdisk running on the node using a deploy interface. Currently, all the deploy interfaces support in-band cleaning. By default, ironic-python-agent ships with a minimal cleaning configuration, only erasing disks. However, you can add your own cleaning steps and/or override default cleaning steps with a custom Hardware Manager.

Out-of-band

Out-of-band are actions performed by your management controller, such as IPMI, iLO, or DRAC. Out-of-band steps will be performed by ironic using a power or management interface. Which steps are performed depends on the hardware type and hardware itself.

For Out-of-Band cleaning operations supported by iLO hardware types, refer to Node Cleaning Support.

FAQ

How are cleaning steps ordered?

For automated cleaning, cleaning steps are ordered by integer priority, where a larger integer is a higher priority. In case of a conflict between priorities across hardware interfaces, the following resolution order is used:

  1. Power interface

  2. Management interface

  3. Deploy interface

  4. BIOS interface

  5. RAID interface

For manual cleaning, the cleaning steps should be specified in the desired order.

How do I skip a cleaning step?

For automated cleaning, cleaning steps with a priority of 0 or None are skipped.

How do I change the priority of a cleaning step?

For manual cleaning, specify the cleaning steps in the desired order.

For automated cleaning, it depends on whether the cleaning steps are out-of-band or in-band.

Most out-of-band cleaning steps have an explicit configuration option for priority.

Changing the priority of an in-band (ironic-python-agent) cleaning step requires use of a custom HardwareManager. The only exception is erase_devices, which can have its priority set in ironic.conf. For instance, to disable erase_devices, you’d set the following configuration option:

[deploy]
erase_devices_priority=0

To enable/disable the in-band disk erase using ilo hardware type, use the following configuration option:

[ilo]
clean_priority_erase_devices=0

The generic hardware manager first tries to perform ATA disk erase by using hdparm utility. If ATA disk erase is not supported, it performs software based disk erase using shred utility. By default, the number of iterations performed by shred for software based disk erase is 1. To configure the number of iterations, use the following configuration option:

[deploy]
erase_devices_iterations=1

What cleaning step is running?

To check what cleaning step the node is performing or attempted to perform and failed, run the following command; it will return the value in the node’s driver_internal_info field:

openstack baremetal node show $node_ident -f value -c driver_internal_info

The clean_steps field will contain a list of all remaining steps with their priorities, and the first one listed is the step currently in progress or that the node failed before going into clean failed state.

Should I disable automated cleaning?

Automated cleaning is recommended for ironic deployments, however, there are some tradeoffs to having it enabled. For instance, ironic cannot deploy a new instance to a node that is currently cleaning, and cleaning can be a time consuming process. To mitigate this, we suggest using disks with support for cryptographic ATA Security Erase, as typically the erase_devices step in the deploy interface takes the longest time to complete of all cleaning steps.

Why can’t I power on/off a node while it’s cleaning?

During cleaning, nodes may be performing actions that shouldn’t be interrupted, such as BIOS or Firmware updates. As a result, operators are forbidden from changing power state via the ironic API while a node is cleaning.

Troubleshooting

If cleaning fails on a node, the node will be put into clean failed state and placed in maintenance mode, to prevent ironic from taking actions on the node.

Nodes in clean failed will not be powered off, as the node might be in a state such that powering it off could damage the node or remove useful information about the nature of the cleaning failure.

A clean failed node can be moved to manageable state, where it cannot be scheduled by nova and you can safely attempt to fix the node. To move a node from clean failed to manageable:

openstack baremetal node manage $node_ident

You can now take actions on the node, such as replacing a bad disk drive.

Strategies for determining why a cleaning step failed include checking the ironic conductor logs, viewing logs on the still-running ironic-python-agent (if an in-band step failed), or performing general hardware troubleshooting on the node.

When the node is repaired, you can move the node back to available state, to allow it to be scheduled by nova.

# First, move it out of maintenance mode
openstack baremetal node maintenance unset $node_ident

# Now, make the node available for scheduling by nova
openstack baremetal node provide $node_ident

The node will begin automated cleaning from the start, and move to available state when complete.