Rocky Series (11.0.0 - 11.1.x) Release Notes

11.1.4-12

Bug Fixes

  • Fixes ‘Invalid parameter value for SpanLength’ when configuring RAID using Python 3. This passed incorrect data type to iDRAC, e.g., instead of 2 it passed 2.0. See story 2004265.

  • Cleans up nodes stuck in the deleting state on conductor restart.

  • Fixes vague node last_error field reporting upon deploy step failure by providing the exception error message in addition to the step that failed.

  • Kill ipmitool process invoked by ironic to read node’s power state if ipmitool process does not exit after configured timeout expires. It appears pretty common for ipmitool to run for five minutes (with current ironic defauls) once it hits a non-responsive bare metal node. This could slow down the management of other nodes due periodic tasks slots exhaustion. The new behaviour could is enabled by default, but could be disabled via the [ipmi]kill_on_timeout ironic configuration option.

  • Fixed a bug where rebooting a node managed by the idrac hardware type when using the WS-MAN power interface sometimes fails with a The command failed to set RequestedState error. See bug 2007487 for details.

  • Adds command_timeout and max_command_attempts configuration options to IPA, so when connection errors occur the command will be executed again.

  • Fixes an issue where ironic-conductor initialization could return a NodeNotLocked error for requests requiring locks when the conductor was starting. This was due to the conductor removing locks after beginning accepting new work. The lock removal has been moved to after the Database connectivity has been established but before the RPC bus is initialized.

11.1.4

Bug Fixes

  • Fixes a deployment issue encountered during deployment, more precisely during the configdrive partition creation step. On some specific devices like NVMe drives, the created configdrive partition could not be correctly identified (required to dump data onto it afterward). https://storyboard.openstack.org/#!/story/2005764

  • Fixes an issue with using serial number as root device hints with the ansible deploy interface.

  • Fixes an issue regarding the ansible deploy interface. Node deployment was broken for any image that was not public because the original request context was not available anymore at the time some image information was fetched.

  • Fixes issue where the resource list API returned results with requested fields only until the API MAX_LIMIT. After the API MAX_LIMIT is reached the API started ignoring user requested fields. This fix will make sure that the next url generated by the pagination code will include the user requested fields as query parameter.

  • Fixes an issue where the pagination marker was not being set if uuid was not in the list of requested fields when executing a list query. The affected API endpoints were: port, portgroup, volume_target, volume_connector, node and chassis. See story 2003192 for more details.

  • Fixes an issue where baremetal node deployment would fail on clouds with a high number of security groups. Listing the security groups took too long. Instead of listing all security groups, a query filter was added to list only the security groups to be used for the network. (See bug 2006256.)

  • Fixes a bug with the grub ramdisk boot template handling, such that the template now properly references the user provided kernal and ramdisk. Previously the deployment ramdisk and kernel was referenced in the template.

  • Fixes an issue in updating firmware using update_firmware_sum clean step from management interface of ilo hardware type with an error stating that unable to connect to iLO address due to authentication failure. See story 2006223 for details.

11.1.3

Deprecation Notes

  • Using the fake management interface with the manual-management hardware type is deprecated, please use noop instead. Existing nodes will have to be updated after the upgrade.

Bug Fixes

  • Fixes an issue regarding the ansible deployment interface cleaning workflow. Handling the error in the driver and returning nothing caused the manager to consider the step done and go to the next one instead of interrupting the cleaning workflow.

  • Fixes an issue with the ansible deployment interface where raw images could not be streamed correctly to the host.

  • Fixes deployment with the ansible deploy interface and instance images with GPT partition table.

  • Fixes an issue where the sensor data parsing method for the ipmitool interface lacked the ability to handle the automatically included ipmitool debugging information when the debug option is set to True in the ironic.conf file. As such, extra debugging information supplied by the underlying ipmitool command is disregarded. More information can be found in story 2005331.

  • Fixes an issue where deploy fails during node preparation if the node capabilities are passed as string.

  • Fixes an issue for validating checksum when trying to calculate the actual checksum and failing with UnicodeDecode Error. The fix uses the oslo_utils library for calculating the actual checksum.

  • The manual-management hardware type now defaults to the noop management interface. Unlike the fake management interface, it does not fail on attempt to set the boot device to the local disk.

  • Fixes a bug where cinder block storage service volumes volume fail to attach expecting a mountpoint to be a valid string. See story 2004864 for additional information.

  • Returns the correct error message on providing an invalid reference to image_source. Previously an internal error was raised.

  • Reverts the fix to the idrac hardware type creating port objects during inspection with pxe_enabled fields not set to reflect the configuration of the physical ports. It is inconsistent with the stable branch policy [1]. It requires python-dracclient version 1.5.0 and greater; however, driver-requirements.txt specifies version 1.3.0 and greater can be used on this branch.

    [1] https://docs.openstack.org/project-team-guide/stable-branches.html

11.1.2

Bug Fixes

  • A bug has been fixed in the node update code that could cause the nodes to become not updatable if their driver is no longer available.

  • Fixes an issue where the master instance image cache could not be disabled. The configuration option [pxe]/instance_master_path may now be set to the empty string to disable the cache.

  • Fixes an issue where the master TFTP image cache could not be disbled. The configuration option [pxe]/tftp_master_path may now be set to the empty string to disable the cache. For more information, see story 2004608.

  • Fixes a bug where ironic port is not updated in node introspection as per PXE enabled setting for idrac hardware type. See bug 2004340 for details.

11.1.1

New Features

  • Setting these configuration options to 0 will disable the periodic tasks:

    • [conductor]sync_power_state_interval: sync power states for the nodes

    • [conductor]check_provision_state_interval:

      • check deployments and time out if the deployment takes too long

      • check the status of cleaning a node and time out if it takes too long

      • check the status of inspecting a node and time out if it takes too long

      • check for and handle nodes that are taken over by new conductors (if an old conductor disappeared)

    • [conductor]send_sensor_data_interval: send sensor data to ceilometer

    • [conductor]sync_local_state_interval: refresh a conductor’s copy of the consistent hash ring. If any mappings have changed, determines which, if any, nodes need to be “taken over”. The ensuing actions could include preparing a PXE environment, updating the DHCP server, and so on.

    • [oneview]periodic_check_interval:

      • check for nodes taken over by OneView users

      • check for nodes freed by OneView users

Known Issues

  • Building RAID1 is known to not work with Dell BOSS cards using python-dracclient 1.4.0 or earlier. Upgrade to python-dracclient 1.5.0 to use this feature.

Upgrade Notes

  • The hash_ring_reset_interval configuration option was changed from 180 to 15 seconds. Previously, this option was essentially ignored on the API side, becase the hash ring was reset on each API access. The lower value minimizes the probability of a request routed to a wrong conductor when the ring needs rebalancing.

  • If you are doing a minor version upgrade, please re-run the ironic-dbsync online_data_migrations command to properly update the versions of the Objects in the database. Otherwise, the next major upgrade may fail.

Critical Issues

  • The ironic-dbsync online_data_migrations command was not updating the objects to their latest versions, which could prevent upgrades from working (i.e. when running the next release’s ironic-dbsync upgrade). Objects are updated to their latest versions now when running that command. See story 2004174 for more information.

Bug Fixes

  • Fixes an issue with a baremetal node that times out during cleaning. The ironic-conductor was attempting to change the node’s provision state to ‘clean failed’ twice, resulting in the node’s last_error being set incorrectly. This no longer happens. For more information, see story 2004299.

  • Fixes an issue where setting these configuration options to 0 caused a ValueError exception to be raised. You can now set them to 0 to disable the associated periodic tasks. (For more information, see story 2002059.):

    • [conductor]sync_power_state_interval: sync power states for the nodes

    • [conductor]check_provision_state_interval:

      • check deployments and time out if the deployment takes too long

      • check the status of cleaning a node and time out if it takes too long

      • check the status of inspecting a node and time out if it takes too long

      • check for and handle nodes that are taken over by new conductors (if an old conductor disappeared)

    • [conductor]send_sensor_data_interval: send sensor data to ceilometer

    • [conductor]sync_local_state_interval: refresh a conductor’s copy of the consistent hash ring. If any mappings have changed, determines which, if any, nodes need to be “taken over”. The ensuing actions could include preparing a PXE environment, updating the DHCP server, and so on.

    • [oneview]periodic_check_interval:

      • check for nodes taken over by OneView users

      • check for nodes freed by OneView users

  • Fixes an issue where Neutron ports would be left with a baremetal MAC address associated after an instance is deleted from a baremetal host. This caused problems with MAC address conflicts in follow up deployments to the same baremetal host. bug 2004428.

  • Fixes an issue where a flat Neutron port would be left with a host ID associated with it after an instance is deleted from a baremetal host. This caused problems with reusing the same port for a new instance as it is already bound to the old instance.

  • Fixes a bug where the number of CPU sockets was being returned by the idrac hardware type during introspection, instead of the number of virtual CPUs. See bug 2004155 for details.

  • Fixes a race condition in the hash ring implementation that could cause an internal server error on any request. See story 2003966 for details.

  • Properly reports an error when the image cache and the image HTTP or TFTP location are on different file system, causing hard link to fail.

  • Fixes an issue where iSCSI based deployments fail if the cpu_arch property is not specified on a node.

  • Fixes redfish hardware type to reuse HTTP session tokens when talking to BMC using session authentication. Prior to this fix redfish hardware type never tried to reuse session token given out by BMC during previous connection what may sometimes lead to session pool exhaustion with some BMC implementations.

  • Fixes an issue wherein provisioning fails if ironic node is configured with ramdisk deploy interface. See bug 2003532 for more details.

  • The IPMI hardware type unconditionally instructed the BMC to not automatically clear boot flag valid bit if Chassis Control command not received within 60-second timeout (countdown restarts when a Chassis Control command is received). Some BMCs do not support setting this; if sent it causes the boot to be aborted instead. For IPMI hardware type a new driver option node['driver_info']['ipmi_disable_boot_timeout'] can be specified. It is True by default; set it to False to bypass sending this command. See story 2004266 for additional information.

11.1.0

Prelude

Ironic 11.1… Where the volume dial turned more!

While Pixie Boots has rocked out to Rock and Roll, the Bare Metal as a Service team has wrapped up our Rocky release with 11.1. This new release contains a number of major features that we hope will improve the lives of bare metal operators everywhere!

  • Conductor grouping enabling nodes to be assigned to groups of different conductors.

  • Deployment steps framework enabling greater flexibility for deployers to request specific steps.

  • Bios setting interfaces for the ilo and irmc hardware types.

  • Ramdisk deployment interface for disk-less deployments.

  • Capability to reset nodes to their default interfaces via the API when resetting the node’s driver.

New Features

  • Added support for local booting a partition image for ppc64* hardware. If a PReP partition is detected when deploying to a ppc64* machine, the partition will be specified to IPA causing the bootloader to be installed there directly. This feature requires a ironic-python-agent ramdisk with ironic-lib >=2.14.

  • Adds new optional snmp_community_read and snmp_community_write properties to snmp driver configuration (specified via a node’s driver_info field). If present, the value(s) will be used respectively for SNMP reads and/or writes to the PDU. When not present, snmp_community value will be used instead.

  • The iRMC driver can now automatically update the node.traits field with CUSTOM_CPU_FPGA value based on information provided by the node during node inspection.

  • Adds a ramdisk deploy interface for deployments that wish to network boot to a ramdisk, as opposed to perform a complete traditional deployment to a physical media. This may be useful in scientific use cases or where ephemeral baremetal machines are desired.

    The ramdisk deploy interface is intended for advanced users and has some particular operational caveats that the users should be aware of prior to use, such as network access list requirements and configuration drive architectural restrictions and the inability to leverage configuration drives.

  • Adds a new configuration option [pxe]pxe_config_subdir to allow operators to define the specific directory that may be used inside of /tftpboot or /httpboot for a boot loader to locate the configuration file for the node. This option defaults to pxelinux.cfg which is the directory that the Syslinux pxelinux.0 bootloader utilized. Operators may wish to change the directory name if they are using other boot loaders such as GRUB or iPXE.

  • Conductors and nodes may be arbitrarily grouped to provide a basic level of affinity between conductors and nodes. Conductors use the [conductor]/conductor_group configuration option to set the group which they belong to. The same value may be set on one or more nodes in the conductor_group field (available in API version 1.46), and these will be matched such that only conductors with a given group will manage nodes with the same group.

    A group name may be up to 255 characters containing a-z, 0-9, _, -, and .. The group is case-insensitive. The default group is the empty string ("").

    The “node list” API endpoint (GET /v1/nodes) may also be filtered by conductor group in API version 1.46.

  • The framework for deployment steps is in place. All in-tree drivers (DeployInterfaces) have one (big) deploy step; the conductor executes this step when deploying a node.

    Starting with the Bare Metal REST API version 1.44, the current deploy step (if any) being executed is available in a node’s deploy_step field in the responses for the following queries:

    • GET /v1/nodes/<node identifier>

    • GET /v1/nodes/detail

    • GET /v1/nodes?fields=deploy_step,...

  • Implements bios interface for ilo hardware type. Adds the list of supported bios interfaces for the ilo hardware type. Adds manual cleaning steps apply_configuration and factory_reset which support managing the BIOS settings for the iLO servers using ilo hardware type.

  • Adds support for the new noop interface to the ipmi hardware type. This interface targets hardware that does not correctly change boot mode via the IPMI protocol. Using it requires pre-configuring the boot order on a node to try PXE, then fall back to local booting.

  • Adds new bios interface to irmc hardware type. This provides out-of-band BIOS configuration solution for iRMC driver which makes the functionality available via manual cleaning.

  • Adds out-of-band RAID configuration solution for the iRMC driver which makes the functionality available via manual cleaning. See iRMC hardware type documentation for more details.

  • Starting with API version 1.45, PATCH requests to /v1/nodes/<NODE> accept the new query parameter reset_interfaces. It can be provided whenever the driver field is updated. If set to ‘true’, all hardware interfaces wil be reset to their defaults, except for ones updated in the same request.

Upgrade Notes

  • Operators utilizing grub for PXE booting, typically with UEFI, should change their deployed master PXE configuration file provided for nodes PXE booting using grub. Ironic 11.1 now writes both MAC address and IP address based PXE confiuration links for network booting via grub. The grub variable should be changed from $net_default_ip to $net_default_mac. IP address support is deprecated and will be removed in the Stein release.

  • The minimum required version of pysnmp has been bumped to 4.3. This pysnmp version introduces simpler, faster and more functional high-level SNMP API on which ironic snmp driver has been migrated.

  • The minimum required version of the osprofiler library is now 1.5.0. This is now a new dependency, ironic has not been able to start with 1.4.0 since the Pike release when this dependency was introduced.

  • The swift/endpoint_type configuration option is now removed. python-swiftclient 3.2.0 (Ocata) and above removed support for the native URL type used by radosgw. Since using a swift/endpoint_type value of radosgw would fail anyway, it is removed. Deployers must now configure ceph with rgw swift account in url = True. This must be set before upgrading to this release.

  • The snmp hardware type now uses the noop management interface instead of fake used previously. Support for fake is left for backward compatibility.

Deprecation Notes

  • All drivers must implement their deployment process using deploy steps. Out-of-tree drivers without deploy steps will be supported until the Stein release. For more details, see story 1753128.

  • The xclarity hardware type, as well as the supporting driver interfaces have been deprecated and are scheduled to be removed from ironic in the Stein development cycle. This is due to the lack of operational Third Party testing to help ensure that the support for Lenovo XClarity is functional.

    The xclarity hardware type was introduced at the end of the Queens development cycle. During implementation of Third Party CI, the Lenovo team encountered some unforseen delays. Lenovo is continuing to work towards Third Party CI, and upon establishment and verification of functional Third Party CI, this deprecation will be rescinded.

  • Support for ironic to link PXE boot configuration files via the assigned interface IP address has been deprecated. This option was only the case when [pxe]ipxe_enabled was set to false and the node was being deployed using UEFI.

  • Using the fake management interfaces with the snmp hardware type is now deprecated, please use noop instead.

Bug Fixes

  • Better handles the case when an operator attempts to perform an upgrade from a release older than Pike, directly to a release newer than Pike, skipping one or more releases in between (i.e. a “skip version upgrade”). Instead of crashing, the operator will be informed that upgrading from a version older than the previous release is not supported (skip version upgrades) and that (as of Pike) all database migrations need to be performed using the previous releases for a fast-forward upgrade. [Bug 2002558]

  • Fixes support for grub based UEFI PXE booting by enabling links to the PXE configuration files to be written using the MAC address of the node in addition to the interface IP address. If the [dhcp]dhcp_provider option is set to none, only the MAC based links will be created.

  • Fixes an issue that caused the integrated Dell Remote Access Controller (iDRAC) management hardware interface implementation, idrac, to fail to boot nodes in Unified Extensible Firmware Interface (UEFI) boot mode. That interface is supported by the idrac hardware type. The issue is resolved for Dell EMC PowerEdge 13th and 14th generation servers. It is not resolved for PowerEdge 12th generation and earlier servers. For more information, see story 1656841.

  • If a node gets stuck in one of the states deploying, cleaning, verifying, inspecting, adopting, rescuing, unrescuing for some reason (eg. conductor goes down when executing a task), it will be moved to an appropriate failure state in the next time the conductor starts.

  • Changes the iPXE behavior to retry a total of 10 times with an increasing backoff time between each retry in order to not create a Denial of Service situation with the iPXE HTTP server. Should the retries fail, the node will be powered-off after a warning is displayed on the console for 30 seconds. For more information, see story.

  • The cleaning operation may fail, if an in-band clean step were to execute after the completion of out-of-band clean step that performs reboot of the node. The failure is caused because of race condition where in cleaning is resumed before the Ironic Python Agent(IPA) is ready to execute clean steps. This has been fixed. For more information, see bug 2002731.

Other Notes

  • The deprecated configuration option [ipmi]retry_timeout was removed, use [ipmi]command_retry_timeout instead.