R10.0 Release Notes¶
About this task
StarlingX is a fully integrated edge cloud software stack that provides everything needed to deploy an edge cloud on one, two, or up to 100 servers.
This section describes the new capabilities, Known Limitations and Workarounds, Defects fixed and deprecated information in StarlingX 10.0.
ISO image¶
The pre-built ISO (Debian) for StarlingX 10.0 is located at the
StarlingX mirror
repo:
Source Code for StarlingX 10.0¶
The source code for StarlingX 10.0 is available on the r/stx.10.0 branch in the StarlingX repositories.
Deployment¶
To deploy StarlingX 10.0, see Consuming StarlingX.
For detailed installation instructions, see StarlingX 10.0 Installation Guides.
New Features and Enhancements¶
The sections below provide a detailed list of new features and links to the associated user guides (if applicable).
Platform Component Upversion¶
The auto_update
attribute supported for StarlingX applications
enables apps to be automatically updated when a new app version tarball is
installed on a system.
See: https://wiki.openstack.org/wiki/StarlingX/Containers/Applications/AppIntegration
The following platform component versions have been updated in StarlingX 10.0.
sriov-fec-operator 2.9.0
kubernetes-power-manager 2.5.1
kubevirt-app: 1.1.0
security-profiles-operator 0.8.7
nginx-ingress-controller
ingress-nginx 4.11.1
secret-observer 0.1.1
auditd 1.0.5
snmp 1.0.3
cert-manager 1.15.3
ceph-csi-rbd 3.11.0
node-interface-metrics-exporter 0.1.3
node-feature-discovery 0.16.4
app-rook-ceph
rook-ceph 1.13.7
rook-ceph-cluster 1.13.7
rook-ceph-floating-monitor 1.0.0
rook-ceph-provisioner 2.0.0
dell-storage
csi-powerstore 2.10.0
csi-unity 2.10.0
csi-powerscale 2.10.0
csi-powerflex 2.10.1
csi-powermax 2.10.0
csm-replication 1.8.0
csm-observability 1.8.0
csm-resiliency 1.9.0
portieris 0.13.16
metrics-server 3.12.1 (0.7.1)
FluxCD helm-controller 1.0.1 (for Helm 3.12.2)
power-metrics
cadvisor 0.50.0
telegraf 1.1.30
security-profiles-operator 0.8.7
vault
vault 1.14.0
vault-manager 1.0.1
oidc-auth-apps
oidc-auth-secret-observer secret-observer 0.1.6 1.0
oidc-dex dex-0.18.0+STX.4 2.40.0
oidc-oidc-client oidc-client 0.1.22 1.0
platform-integ-apps
ceph-csi-cephfs 3.11.0
ceph-pools-audit 0.2.0
app-istio
istio-operator 1.22.1
kiali-server 1.85.0
harbor 1.12.4
ptp-notification 2.0.55
intel-device-plugins-operator
intel-device-plugins-operator 0.30.3
intel-device-plugins-qat 0.30.1
intel-device-plugins-gpu 0.30.0
intel-device-plugins-dsa 0.30.1
secret-observer 0.1.1
node-interface-metrics-exporter 0.1.3
oran-o2 2.0.4
helm 3.14.4 for K8s 1.21 - 1.29
Redfish Tool 1.1.8-1
Kubernetes Upversion¶
StarlingX Release r10.0 supports Kubernetes 1.29.2.
Distributed Cloud Scalability Improvement¶
StarlingX System Controller scalability has been improved in StarlingX 10.0 with both 5 thousand maximum managed nodes and maximum number of parallel operations.
Unified Software Delivery and Management¶
In StarlingX 10.0, the Software Patching functionality and the Software Upgrades functionality have been re-designed into a single Unified Software Management framework. There is now a single procedure for managing the deployment of new software; regardless of whether the new software is a new Patch Release or a new Major Release. The same APIs/CLIs are used, the same procedures are used, the same VIM / Host Orchestration strategies are used and the same Distributed Cloud / Subcloud Orchestration strategies are used; regardless of whether the new software is a new Patch Release or a new Major Release.
See: Appendix A - Commands replaced by USM for Updates (Patches) and Upgrades for a detailed list of deprecated commands and new commands.
Infrastructure Management Component Updates¶
In StarlingX 10.0, the new Unified Software Management framework supports enhanced Patch Release packaging and enhanced Major Release deployments.
Patch Release packaging has been simplified to deliver new or modified Debian packages, instead of the cryptic difference of OSTree builds done previously. This allows for inspection and validation of Patch Release content prior to deploying, and allows for future flexibility of Patch Release packaging.
Major Release deployments have been enhanced to fully leverage OSTree. An OSTree deploy is now used to update the host software. The new software’s root filesystem can be installed on the host, while the host is still running the software of the old root filesystem. The host is simply rebooted into the new software’s root filesystem. This provides a significant improvement in both the upgrade duration and the upgrade service impact (especially for AIO-SX systems), as previously upgrading hosts needed to have disks/root-filesystems wiped and then software re-installed.
See
Unified Software Management - Rollback Orchestration AIO-SX¶
VIM Patch Orchestration has been enhanced to support the abort and rollback of a Patch Release software deployment. VIM Patch Orchestration rollback will automate the abort and rollback steps across all hosts of a Cloud configuration.
Note
In StarlingX 10.0, VIM Patch Orchestration Rollback is only supported for AIO-SX configurations.
In StarlingX 10.0 VIM Patch Orchestration Rollback is only supported if the Patch Release software deployment has been aborted or failed prior to the ‘software deploy activate’ step. If the Patch Release software deployment is at or beyond the ‘software deploy activate’ step, then an install plus restore of the Cloud is required in order to rollback the Patch Release deployment.
Enhancements to Full Debian Support¶
The Kernel can be configured during runtime as [ standard <-> lowlatency ].
Support for Kernel Live Patching (for possible scenarios)¶
StarlingX supports live patching that enables fixing critical functions without rebooting the system and enables systems to be functional and running. The live-patching modules will be built into the upgraded StarlingX binary patch.
The upgraded binary patch is generated as the in-service type (non-reboot-required). The kernel modules will be matched with the correct kernel release version during binary patch upgrading.
The relevant kernel module can be found in the location: ‘/lib/modules/<release-kernel-version>/extra/kpatch’
During binary patch upgrading, the user space tool kpatch
is
used for:
installing the kernel module to ${installdir}
loading(insmod) the kernel module for the running kernel
unloading(rmmod) the kernel module from the running kernel
uninstallling the kernel module from ${installdir}
listing the enabled live patch kernel module
Subcloud Phased Deployment¶
Subclouds can be deployed using individual phases. Therefore, instead of using a single operation, a subcloud can be deployed by executing each phase individually. Users have the flexibility to proactively abort the deployment based on their needs. When the deployment is resumed, previously installed contents will be still valid.
Kubernetes Local Client Access¶
You can configure Kubernetes access for a user logged in to the active controller either through SSH or by using the system console.
Kubernetes Remote Client Access¶
The access to the Kubernetes cluster from outside the controller can be done using the remote CLI container or using the host directly.
IPv4/IPv6 Dual Stack support for Platform Networks¶
Migration of a single stack deployment to dual stack network deployments will not cause service disruptions.
Dual-stack networking facilitates the simultaneous use of both IPv4 and IPv6 addresses, or continue to use each IP version independently. To accomplish this, platform networks can be associated with 1 or 2 address pools, one for each IP version (IPv4 or IPv6). The first pool is linked to the network upon creation and cannot be subsequently removed. The second pool can be added or removed to transition the system between dual-stack and single-stack modes.
Run Kata Containers in Kubernetes¶
There are two methods to run Kata Containers in Kubernetes: by runtime class or by annotation. Runtime class is supported in Kubernetes since v1.12.0 or higher, and it is the recommended method for running Kata Containers.
External DNS Alternative: Adding Local Host Entries¶
You can configure user-defined host entries for external resources that are not
maintained by DNS records resolvable by the external DNS server(s) (i.e.
nameservers
in system dns-show/dns-modify
). This functionality enables
the configuration of local host records, supplementing hosts resolvable by
external DNS server(s).
Power Metrics Enablement - vRAN Integration¶
StarlingX 10.0 supports integrated enhanced power metrics tool with reduced impact on vRAN field deployment.
Power Metrics may increase the scheduling latency due to perf and MSR readings. It was observed that there was a latency impact of around 3 µs on average, plus spikes with significant increases in maximum latency values. There was also an impact on the kernel processing time. Applications that run with priorities at or above 50 in real-time kernel isolated CPUs should allow kernel services to avoid unexpected system behavior.
Crash dump File Size Setting Enhancements¶
The Linux kernel can be configured to perform a crash dump and reboot in response to specific serious events. A crash dump event produces a crash dump report with bundle of files that represent the state of the kernel at the time of the event, which is useful for post-event root cause analysis.
The crash dump files that are generated by Linux kdump are configured to be generated during kernel panics (default) are managed by the crashDumpMgr utility. The utility will save crash dump files but the current handling uses a fixed configuration when saving files. In order to provide a more flexible system handling the crashDumpMgr utility is enhanced to support the following configuration parameters that will control the storage and rotation of crash dump files.
Maximum Files: New configuration parameter for the number of saved crash dump files (default 4).
Maximum Size: Limit the maximum size of an individual crash dump file (support for unlimited, default 5GB).
Maximum Used: Limit the maximum storage used by saved crash dump files (support for unlimited, default unlimited).
Minimum Available: Limit the minimum available storage on the crash dump file system (restricted to minimum 1GB, default 10%).
The service parameters must be specified using the following service hierarchy. It is recommended to model the parameters after the platform coredump service parameters for consistency.
platform crashdump <parameter>=<value>
Subcloud Install or Restore of Previous Release¶
StarlingX r10.0 system controller supports both StarlingX 9.0 and StarlingX r10.0 subclouds fresh install or restore.
If the upgrade is from StarlingX 9.0 to a higher release, the prestage status
and prestage versions fields in the output of the
dcmanager subcloud list command will be empty, regardless of whether
the deployment status of the subcloud was prestage-complete
before the upgrade.
These fields will only be updated with values if you run subcloud prestage
or prestage orchestration
again.
See: Subclouds Previous Major Release Management
For non-prestaged subcloud remote installations
The ISO imported via load-import --active
should always be at the same patch
level as the system controller. This is to ensure that the subcloud boot image
aligns with the patch level of the load to be installed on the subcloud.
See:installing-a-subcloud-using-redfish-platform-management-service
For prestaged remote subcloud installations
The ISO imported via load-import --inactive
should be at the same patch level
as the system controller. If the system controller is patched after subclouds
have been prestaged, it is recommended to repeat the prestaging for each
subcloud. This is to ensure that the subcloud boot image aligns with the patch
level of the load to be installed on the subcloud.
See: Prestaging Requirements
WAD Users Access Right Control via Group¶
You can configure an LDAP / WAD user with ‘sys_protected’ group or ‘sudo all’.
an LDAP / WAD user in ‘sys_protected’ group on StarlingX
is equivalent to the special ‘sysadmin’ bootstrap user
via “source /etc/platform/openrc”
has Keystone admin/admin identity and credentials, and
has Kubernetes /etc/kubernetes/admin.conf credentials
only a small number of users have this capability
an LDAP / WAD user with ‘sudo all’ capability on StarlingX
can perform the following StarlingX-type operations: - sw_patch to unauthenticated endpoint - docker/crictl to communicate with the respective daemons - using some utilities - like show-certs.sh, license-install (recovery only) - IP configuration for local network setup - password changes of Linux users (i.e. local LDAP) - access to restricted files, including some logs - manual reboots
The local LDAP server by default serves both HTTPS on port 636 and HTTP on port 389.
The HTTPS server certificate is issued by cert-manager ClusterIssuer
system-local-ca
and is managed internally by cert-manager. The certificate
will be automatically renewed when the expiration date approaches. The
certificate is called system-openldap-local-certificate
with its secret
having the same name system-openldap-local-certificate
in the
deployment
namespace. The server certificate and private key files are
stored in the /etc/ldap/certs/
system directory.
See:
Accessing Collect Command with ‘sudo’ privileges and membership in ‘sys-protected’ Group¶
The StarlingX 10.0 adds support to run Collect
from any
local LDAP or Remote WAD user account with ‘sudo’ capability and a member
of the ‘sys_protected’ group.
The Collect
tool continues support from the ‘sysadmin’ user account
and also being run from any other successfully created LDAP and WAD account
with ‘sudo’ capability and a member of the ‘sys_protected’ group.
For security reasons, no password ‘sudo’ continues to be unsupported.
Support for Intel In-tree Driver¶
The system supports both in-tree and out-of-tree versions of the Intel ice
,
i40e
, and iavf
drivers. On initial installation, the system uses the
default out-of-tree driver version. You can switch between the in-tree and
out-of-tree driver versions. For further details:
See: Switch Intel Driver Versions
Note
The ice in-tree driver does not support SyncE/GNSS deployments.
Password Rules Enhancement¶
You can check current password expiry settings by running the
chage -l <username> command replacing <username>
with the name
of the user whose password expiry settings you wish to view.
You can also change password expiry settings by running the sudo chage -M <days_to_expiry> <username> command.
Use the following new password rules as listed below:
There should be a minimum length of 12 characters.
The password must contain at least one letter, one number, and one special character.
Do not reuse the past 5 passwords.
The Password expiration period should be defined by users, but by default it is set to 90 days.
See:
Management Network Reconfiguration after Deployment Completion Phase 1 AIO-SX¶
StarlingX 10.0 supports changes to the management IP addresses for a standalone AIO-SX and for an AIO-SX subcloud after the node is completely deployed.
See:
Networking Statistic Support¶
The Node Interface Metrics Exporter application is designed to fetch and
display node statistics in a Kubernetes environment. It deploys an Interface
Metrics Exporter DaemonSet on all nodes with the
starlingx.io/interface-metrics=true node
label. It uses the Netlink library
to gather data directly from the kernel, offering real-time insights into node
performance.
Add Existing Cloud as Subcloud Without Reinstallation¶
The subcloud enrollment feature converts a factory pre-installed system or initially deployed as a standalone cloud system to a subcloud of a DC. Factory pre-installation standalone systems must be installed locally in the factory, and later deployed and configured on-site as a DC subcloud without re-installing the system.
See: Enroll a Factory Installed Non Distributed Standalone System as a Subcloud
Rook Support for freshly Installed StarlingX¶
The new Rook Ceph application will be used for deploying the latest version of Ceph via Rook.
Rook Ceph is an orchestrator that provides a containerized solution for Ceph Storage with a specialized Kubernetes Operator to automate the management of the cluster. It is an alternative solution to the bare metal Ceph storage. See https://rook.io/docs/rook/latest-release/Getting-Started/intro/ for more details.
The deployment model is the topology strategy that defines the storage backend capabilities of the deployment. The deployment model dictates how the storage solution will look like when defining rules for the placement of storage cluster elements.
Enhanced Availability for Ceph on AIO-DX¶
Ceph on AIO-DX now works with 3 Ceph monitors providing High Availability and enhancing uptime and resilience.
Available Deployment Models¶
Each deployment model works with different deployment strategies and rules to fit different needs. The following models available for the requirements of your cluster are:
Controller Model (default)
Dedicated Model
Open Model
Storage Backend¶
Configuration of the storage backend defines the deployment models characteristics and main configurations.
Migration with Rook container based Ceph Installations¶
When you migrate an AIO-SX to an AIO-DX subcloud with Rook container-based Ceph installations in StarlingX 10.0, you would need to follow the additional procedural steps below:
Procedure
After you configure controller-1, follow the steps below:
Add a new Ceph monitor on controller-1.
Add an OSD on controller-1.
List host’s disks and identify disks you want to use for Ceph OSDs. Ensure you note the UUIDs.
Add disks as an OSD storage.
List OSD storage devices.
Unlock controller-1 and follow the steps below:
Wait until Ceph is updated with two active monitors. To verify the updates, run the ceph -s command and ensure the output shows mon: 2 daemons, quorum a,b. This confirms that both monitors are active.
Add the floating monitor.
Wait for the controller to reset and come back up to an operational state.
Re-apply the
rook-ceph
application.
To Install and Uninstall Rook Ceph¶
See:
Performance Configurations on Rook Ceph¶
When using Rook Ceph it is important to consider resource allocation and configuration adjustments to ensure optimal performance. Rook introduces additional management overhead compared to a traditional bare-metal Ceph setup and needs more infrastructure resources.
Protecting against L2 Network Attackers - Securing local traffic on MGMT networks¶
A new security solution is introduced for StarlingX inter-host management network:
Attackers with direct access to local StarlingX L2 VLANs
specifically protect LOCAL traffic on the MGMT network which is used for private/internal infrastructure management of the StarlingX cluster.
Protection against both passive and active attackers accessing private/internal data, which could risk the security of the cluster
passive attackers that are snooping traffic on L2 VLANs (MGMT), and
active attackers attempting to connect to private internal endpoints on StarlingX L2 interfaces (MGMT) on StarlingX hosts.
IPsec is a set of communication rules or protocols for setting up secure connections over a network. StarlingX utilizes IPsec to protect local traffic on the internal management network of multi-node systems.
StarlingX uses strongSwan as the IPsec implementation. strongSwan is an opensource IPsec solution. See https://strongswan.org/ for more details.
For the most part, IPsec on StarlingX is transparent to users.
See:
Vault application support for running on application cores¶
By default the Vault application’s pods will run on platform cores.
“If static kube-cpu-mgr-policy
is selected and when overriding the label
app.starlingx.io/component
for Vault namespace or pods, there are two requirements:
The Vault server pods need to be restarted as directed by Hashicorp Vault documentation. Restart each of the standby server pods in turn, then restart the active server pod.
Ensure that sufficient hosts with worker function are available to run the Vault server pods on application cores.
See:
Restart the Vault Server pods¶
The Vault server pods do not restart automatically. If the pods are to be re-labelled to switch execution from platform to application cores, or vice-versa, then the pods need to be restarted.
Under Kubernetes the pods are restarted using the kubectl delete pod command. See, Hashicorp Vault documentation for the recommended procedure for restarting server pods in HA configuration, https://support.hashicorp.com/hc/en-us/articles/23744227055635-How-to-safely-restart-a-Vault-cluster-running-on-Kubernetes.
Ensure that sufficient hosts are available to run the server pods on application cores¶
The standard cluster with less than 3 worker nodes does not support Vault HA on the application cores. In this configuration (less than three cluster hosts with worker function):
When setting label app.starlingx.io/component=application with the Vault app already applied in HA configuration (3 Vault server pods), ensure that there are 3 nodes with worker function to support the HA configuration.
When applying Vault for the first time and with
app.starlingx.io/component
set to “application”: ensure that the server replicas is also set to 1 for non-HA configuration. The replicas for Vault server are overriden both for the Vault Helm chart and the Vault manager Helm chart:cat <<EOF > vault_overrides.yaml server: extraLabels: app.starlingx.io/component: application ha: replicas: 1 injector: extraLabels: app.starlingx.io/component: application EOF cat <<EOF > vault-manager_overrides.yaml manager: extraLabels: app.starlingx.io/component: application server: ha: replicas: 1 EOF $ system helm-override-update vault vault vault --values vault_overrides.yaml $ system helm-override-update vault vault-manager vault --values vault-manager_overrides.yaml
Component Based Upgrade and Update - VIM Orchestration¶
VIM Patch Orchestration in StarlingX 10.0 has been updated to interwork with the new underlying Unified Software Management APIs.
As before, VIM Patch Orchestration automates the patching of software across all hosts of a Cloud configuration. All Cloud configurations are supported; AIO-SX, AIO-DX, AIO-DX with worker nodes, Standard configuration with controller storage and Standard configuration with dedicated storage.
Note
This includes the automation of both applying a Patch and removing a Patch.
See
Subcloud Remote Install, Upgrade and Prestaging Adaptation¶
StarlingX 10.0 supports software management upgrade/update process that does not require re-installation. The procedure for upgrading a system is simplified since the existing filesystem and associated release configuration will remain intact in the versioned controlled paths (e.g. /opt/platform/config/<version>). In addition the /var and /etc directories is retained, indicating that updates can be done directly as part of the software migration procedure. This eliminates the need to perform a backup and restore procedure for AIO-SX based systems. In addition, the rollback procedure can revert to the existing versioned or saved configuration in the event an error occurs if the system must be reverted to the older software release.
With this change, prestaging for an upgrade will involve populating a new ostree deployment directory in preparation for an atomic upgrade and pulling new container image versions into the local container registry. Since the system is not reinstalled, there is no requirement to save container images to a protected partition during the prestaging process, the new container images can be populated in the local container registry directly.
See: Prestage a Subcloud
Update Default Certificate Configuration on Installation¶
You can configure default certificates during install for both standalone and Distributed Cloud systems.
New bootstrap overrides for system-local-ca (Platform Issuer)
You can customize the Platform Issuer (system-local-ca) used to sign the platform certificates with an external Intermediate CA from bootstrap, using the new bootstrap overrides.
See: Platform Issuer
Note
It is recommended to configure these overrides. If it is not configured,
system-local-ca
will be configured using a local auto-generated Kubernetes Root CA.
REST API / Horizon GUI and Docker Registry certificates are issued during bootstrap
The certificates for StarlingX REST APIs / Horizon GUI access and Local Docker Registry will be automatically issued by
system-local-ca
during bootstrap. They will be anchored tosystem-local-ca
Root CA public certificate, so only this certificate needs to be added in the user list of trusted CAs.
HTTPS enabled by default for StarlingX REST API access
The system is now configured by default with HTTPS enabled for access to StarlingX API and the Horizon GUI. The certificate used to secure this will be anchored to
system-local-ca
Root CA public certificate.
Playbook to update system-local-ca and re-sign the renamed platform certificates
The
migrate_platform_certificates_to_certmanager.yml
playbook is renamed toupdate_platform_certificates.yml
.
External certificates provided in bootstrap overrides can now be provided as base64 strings, such that they can be securely stored with Ansible Vault
The following bootstrap overrides for certificate data CAN be provided as the certificate / key converted into single line base64 strings instead of the filepath for the certificate / key:
ssl_ca_cert
k8s_root_ca_cert and k8s_root_ca_key
etcd_root_ca_cert and etcd_root_ca_key
system_root_ca_cert, system_local_ca_cert and system_local_ca_key
Note
You can secure the certificate data in an encrypted bootstrap overrides file using Ansible Vault.
The base64 string can be obtained using the base64 -w0 <cert_file> command. The string can be included in the overrides YAML file (secured via Ansible Vault), then insecurely managed
cert_file
can be removed from the system.
Dell CSI Driver Support - Test with Dell PowerStore¶
StarlingX 10.0 supports a new system application to support kubernetes CSM/CSI for Dell Storage Platforms. With this application the user can communicate with Dell PowerScale, PowerMax, PowerFlex, PowerStore and Unity XT Storage Platforms to provision PVCs and use them on kubernetes stateful applications.
See: Dell Storage File System Provisioner for details on installation and configurations.
O-RAN O2 IMS and DMS Interface Compliancy Update¶
With the new updates in Infrastructure Management Services (IMS) and Deployment Management Services (DMS) the J-release for O-RAN O2, OAuth2 and mTLS are mandatory options. It is fully compliant with latest O-RAN spec O2 IMS interface R003 -v05.00 version and O2 DMS interface K8s profile - R003-v04.00 version. Kubernetes Secrets are no longer required.
The services implemented include:
O2 API with mTLS enabled
O2 API supported OAuth2.0
Compliance with O2 IMS and DMS specs
See: O-RAN O2 Application
Configure Liveness Probes for PTP Notification Pods¶
Helm overrides can be used to configure liveness probes for ptp-notification
containers.
Intel QAT and GPU Plugins¶
The QAT and GPU applications provide a set of plugins developed by Intel to facilitate the use of Intel hardware features in Kubernetes clusters. These plugins are designed to enable and optimize the use of Intel-specific hardware capabilities in a Kubernetes environment.
Intel GPU plugin enables Kubernetes clusters to utilize Intel GPUs for hardware acceleration of various workloads.
Intel® QuickAssist Technology (Intel® QAT) accelerates cryptographic workloads by offloading the data to hardware capable of optimizing those functions.
The following QAT and GPU plugins are supported in StarlingX 10.0.
See:
Support for Sapphire Rapids Integrated QAT¶
Intel 4th generation Xeon Scalable Processor (Sapphire Rapids) support has been introduced for the StarlingX 10.0.
Drivers for QAT Gen 4 Intel Xeon Gold Scalable processor (Sapphire Rapids)
Intel Xeon Gold 6428N
Sapphire Rapids Data Streaming Accelerator Support¶
Intel® DSA is a high-performance data copy and transformation accelerator integrated into Intel® processors starting with 4th Generation Intel® Xeon® processors. It is targeted for optimizing streaming data movement and transformation operations common with applications for high-performance storage, networking, persistent memory, and various data processing applications.
DPDK Private Mode Support¶
For the purpose of enabling and using needVhostNet
, SR-IOV needs to be
configured on a worker host.
SR-IOV FEC Operator Support¶
FEC Operator 2.9.0 is adopted based on Intel recommendations offering features for various Intel hardware accelerators used for field deployments.
See: Configure Intel Wireless FEC Accelerators using SR-IOV FEC operator
Support for Advanced VMs on Stx Platform with KubeVirt¶
The KubeVirt system application kubevirt-app-1.1.0 in StarlingX includes: KubeVirt, Containerized Data Importer (CDI) v1.58.0, and the Virtctl client tool. StarlingX 10.0 supports enhancements for this application, describes the Kubevirt architecture with steps to install Kubevirt and provides examples for effective implementation in your environment.
See:
Support Harbor Registry (Harbor System Application)¶
Harbor registry is integrated as a System Application. End users can use Harbor, running on StarlingX, for holding and managing their container images. The Harbor registry is currently not used by the platform.
Harbor is an open-source registry that secures artifacts with policies and role-based access control, ensures images are scanned and free from vulnerabilities, and signs images as trusted. Harbor has been evolved to a complete OCI compliant cloud-native artifact registry.
With Harbor V2.0, users can manage images, manifest lists, Helm charts, CNABs, OPAs among others which all adhere to the OCI image specification. It also allows for pulling, pushing, deleting, tagging, replicating, and scanning such kinds of artifacts. Signing images and manifest list are also possible now.
Note
When using local LDAP for authentication of the Harbor system application, you cannot use local LDAP groups for authorization; use only individual local LDAP users for authorization.
Support for DTLS over SCTP¶
DTLS (Datagram Transport Layer Security) v1.2 is supported in StarlingX 10.0.
The SCTP module is now autoloaded by default.
The socket buffer size values have been upgraded:
Old values (in Bytes):
net.core.rmem_max=425984
net.core.wmem_max=212992
New Values (In Bytes):
net.core.rmem_max=10485760
net.core.wmem_max=10485760
To enable each SCTP socket association to have its own buffer space, the socket accounting policies have been updated as follows:
net.sctp.sndbuf_policy=1
net.sctp.rcvbuf_policy=1
Old value:
net.sctp.auth_enable=0
New value:
net.sctp.auth_enable=1
Hardware Updates¶
See:
Bug status¶
Fixed bugs¶
This release provides fixes for a number of defects. Refer to the StarlingX bug database to review the R10.0 Fixed Bugs.
Known Limitations and Procedural Changes¶
The following are known limitations you may encounter with your StarlingX 10.0 and earlier releases. Workarounds are suggested where applicable.
Note
These limitations are considered temporary and will likely be resolved in a future release.
Ceph Daemon Crash and Health Warning¶
After a Ceph daemon crash, an alarm is displayed to verify Ceph health.
Run ceph -s
to display the following message:
cluster:
id: <id>
health: HEALTH_WARN
1 daemons have recently crashed
One or more Ceph daemons have crashed, and the crash has not yet been archived or acknowledged by the administrator.
Procedural Changes: To archive the crash, clear the health check warning and the alarm.
List the timestamp/uuid crash-ids for all newcrash information:
[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash ls-new
Display details of a saved crash.
[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash info <crash-id>
Archive the crash so it no longer appears in
ceph crash ls-new
output.[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash archive <crash-id>
After archiving the crash, make sure the recent crash is not displayed.
[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash ls-new
If more than one crash needs to be archived run the following command.
[sysadmin@controller-0 ~(keystone_admin)]$ ceph crash archive-all
Rook Ceph Application Limitation¶
After applying Rook Ceph application in an AIO-DX configuration the
800.001 - Storage Alarm Condition: HEALTH_WARN
alarm may be triggered.
Procedural Changes: Restart the pod of the monitor associated with the
slow operations detected by Ceph. Check ceph -s
.
Subcloud failed during rehoming while creating RootCA update strategy¶
Subcloud rehoming may fail while creating the RootCA update strategy.
Proceudral Changes: Delete the subcloud from the new System Controller and rehome it again.
RSA required to be the platform issuer private key¶
The system-local-ca
issuer needs to use RSA type certificate/key. The usage
of other types of private keys is currently not supported during bootstrap
or with the Update system-local-ca or Migrate Platform Certificates to use
Cert Manager
procedures.
Proceudral Changes: N/A.
Host lock/unlock may interfere with application apply¶
Host lock and unlock operations may interfere with applications that are in the applying state.
Proceudral Changes: Re-applying or removing / installing applications may be required. Application status can be checked using the system application-list command.
Add / delete operations on pods may result in errors¶
Under some circumstances, add / delete operations on pods may result in pods staying in ContainerCreating/Terminating state and reporting an ‘error getting ClusterInformation: connection is unauthorized: Unauthorized`. This error may also prevent users from locking the host.
Proceudral Changes: If this error occurs run the following kubectl describe pod -n <namespace> <pod name> command. The following message is displayed:
error getting ClusterInformation: connection is unauthorized: Unauthorized
Note
There is a known issue with the Calico CNI that may occur in rare occasions if the Calico token required for communication with the kube-apiserver becomes out of sync due to NTP skew or issues refreshing the token.
Proceudral Changes: Delete the calico-node pod (causing it to automatically restart) using the following commands:
$ kubectl get pods -n kube-system --show-labels | grep calico
$ kubectl delete pods -n kube-system -l k8s-app=calico-node
Deploy does not fail after a system reboot¶
Deploy does not fail after a system reboot.
Proceudral Changes: Run the sudo software-deploy-set-failed --hostname/-h <hostname> --confirm utility to manually move the deploy and deploy host to a failed state which is caused by a failover, lost power, network outage etc. You can only run this utility with root privileges on the active controller.
The utility displays the current state and warns the user about the next steps to be taken in case the user needs to continue executing the utility. It also displays the new states and the next operation to be executed.
Rook-ceph application limitations¶
This section documents the following known limitations you may encounter with the rook-ceph application and procedural changes that you can use to resolve the issue.
Remove all OSDs in a host
The procedure to remove OSDs will not work as expected when removing all
OSDs from a host. The Ceph cluster gets stuck in HEALTH_WARN
state.
Note
Use the Procedural change only if the cluster is stuck in HEALTH_WARN
state after removing all OSDs on a host.
Procedural Changes:
Check the cluster health status.
Check crushmap tree.
Remove the host(s) that are empty in the command executed before
Check the cluster health status.
Use the rook-ceph apply command when a host with OSD is in offline state
The rook-ceph apply will not allocate the OSDs correctly if the host is offline.
Note
Use either of the procedural changes below only if the OSDs are not allocated in the Ceph cluster.
Procedural Changes 1:
Check if the OSD is not in crushmap tree.
Restart the rook-ceph operator pod.
Note
Wait for about 5 minutes to let the operator to try to recoever the OSDs.
Check if the OSDs have been added in crushmap tree.
Procedural Changes 2:
Check if the OSD is not in the crushmap tree OR it is in the crushmap tree but not allocated in the correct location (within a host).
Lock the host
Wait for the host to be locked.
Get the list from the OSDs inventory from the host.
Remove the OSDs from the inventory.
Reapply the rook-ceph application.
Wait for OSDs prepare pods to be recreated.
Add the OSDs in the inventory.
Reapply the rook-ceph application.
Wait for new OSD pods to be created and running.
Unable to set maximum VFs for NICs using out-of-tree ice driver v1.14.9.2 on systems with a large number of cores¶
On systems with a large number of cores (>= 32 physical cores / 64 threads), it is not possible to set the maximum number of VFs (32) for NICs using the out-of-tree ice driver v1.14.9.2.
If the issue is encountered, the following error logs will be reported in kern.log:
[ 83.322344] ice 0000:51:00.1: Only 59 MSI-X interrupts available for SR-IOV. Not enough to support minimum of 2 MSI-X interrupts per VF for 32 VFs
[ 83.322362] ice 0000:51:00.1: Not enough resources for 32 VFs, err -28. Try with fewer number of VFs
The impacted NICs are:
Intel E810
Silicom STS2
Procedural Changes: Reduce the number of configured VFs. To determine the maximum number of supported VFs:
Check /sys/class/net/<interface name>/device/sriov_vf_total_msix. Example:
cat /sys/class/net/enp81s0f0/device/sriov_vf_total_msix 59
Calculate the maximum number of VFs as sriov_vf_total_msix / 2. Example:
max_VFs = 59/2 = 29
Critical alarm 800.001 after Backup and Restore on AIO-SX Systems¶
A Critical alarm 800.001 may be triggered after running the Restore Playbook. The alarm details are as follows:
~(keystone_admin)]$ fm alarm-list
+-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+
| Alarm | Reason Text | Entity ID | Severity | Time Stamp |
| ID | | | | |
+-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+
| 800. | Storage Alarm Condition: HEALTH_ERR. Please check 'ceph -s' for more | cluster= | critical | 2024-08-29T06 |
| 001 | details. | 96ebcfd4-3ea5-4114-b473-7fd0b4a65616 | | :57:59.701792 |
| | | | | |
+-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+
Procedural Changes: To clear this alarm run the following commands:
Note
Applies only to AIO-SX systems.
FS_NAME=kube-cephfs
METADATA_POOL_NAME=kube-cephfs-metadata
DATA_POOL_NAME=kube-cephfs-data
# Ensure that the Ceph MDS is stopped
sudo rm -f /etc/pmon.d/ceph-mds.conf
sudo /etc/init.d/ceph stop mds
# Recover MDS state from filesystem
ceph fs new ${FS_NAME} ${METADATA_POOL_NAME} ${DATA_POOL_NAME} --force
# Try to recover from some common errors
sudo ceph fs reset ${FS_NAME} --yes-i-really-mean-it
cephfs-journal-tool --rank=${FS_NAME}:0 event recover_dentries summary
cephfs-journal-tool --rank=${FS_NAME}:0 journal reset
cephfs-table-tool ${FS_NAME}:0 reset session
cephfs-table-tool ${FS_NAME}:0 reset snap
cephfs-table-tool ${FS_NAME}:0 reset inode
sudo /etc/init.d/ceph start mds
Error installing Rook Ceph on AIO-DX with host-fs-add before controllerfs-add¶
When you provision controller-0 manually prior to unlock, the following sequence of commands fail:
~(keystone_admin)]$ system storage-backend-add ceph-rook --confirmed
~(keystone_admin)]$ system host-fs-add controller-0 ceph=20
~(keystone_admin)]$ system controllerfs-add ceph-float=20
The following error occurs when you run the controllerfs-add command:
“Failed to create controller filesystem ceph-float: controllers have pending LVG updates, please retry again later”.
Procedural Changes: To avoid this issue, run the commands in the following sequence:
~(keystone_admin)]$ system storage-backend-add ceph-rook --confirmed
~(keystone_admin)]$ controllerfs-add ceph-float=20
~(keystone_admin)]$ system host-fs-add controller-0 ceph=20
Intermittent installation of Rook-Ceph on Distributed Cloud¶
While installing rook-ceph, if the installation fails, this is due to
ceph-mgr-provision
not being provisioned correctly.
Procedural Changes: It is recommended to use the system application-remove rook-ceph --force to initiate rook-ceph installation.
Vault application is not supported during Bootstrap¶
The Vault application cannot be configured during Bootstrap.
Procedural Changes:
The application must be configured after the platform nodes are unlocked /
enabled / available, a storage backend is configured, and platform-integ-apps
is applied. If Vault is to be run in HA configuration (3 vault server pods)
then at least three controller / worker nodes must be unlocked / enabled / available.
cert-manager cm-acme-http-solver pod fails¶
On a multinode setup, when you deploy an acme issuer to issue a certificate,
the cm-acme-http-solver
pod might fail and stays in “ImagePullBackOff” state
due to the following defect https://github.com/cert-manager/cert-manager/issues/5959.
Procedural Changes:
If you are using the namespace “test”, create a docker-registry secret “testkey” with local registry credentials in the “test” namespace.
~(keystone_admin)]$ kubectl create secret docker-registry testkey --docker-server=registry.local:9001 --docker-username=admin --docker-password=Password*1234 -n test
Use the secret “testkey” in the issuer spec as follows:
apiVersion: cert-manager.io/v1 kind: Issuer metadata: name: stepca-issuer namespace: test spec: acme: server: https://test.com:8080/acme/acme/directory skipTLSVerify: true email: test@test.com privateKeySecretRef: name: stepca-issuer solvers: - http01: ingress: podTemplate: spec: imagePullSecrets: - name: testkey class: nginx
ptp-notification application is not supported during bootstrap¶
Deployment of
ptp-notification
during bootstrap time is not supported due to dependencies on the system PTP configuration which is handled post-bootstrap.Procedural Changes: N/A.
The helm-chart-attribute-modify command is not supported for
ptp-notification
because the application consists of a single chart. Disabling the chart would renderptp-notification
non-functional. See Application Commands and Helm Overrides for details on this command.Procedural Changes: N/A.
Harbor cannot be deployed during bootstrap¶
The Harbor application cannot be deployed during bootstrap due to the bootstrap deployment dependencies such as early availability of storage class.
Procedural Changes: N/A.
Kubevirt Limitations¶
The following limitations apply to Kubevirt in StarlingX 10.0:
Limitation: Kubernetes does not provide CPU Manager detection.
Procedural Changes: Add
cpumanager
to Kubevirt:apiVersion: kubevirt.io/v1 kind: KubeVirt metadata: name: kubevirt namespace: kubevirt spec: configuration: developerConfiguration: featureGates: - LiveMigration - Macvtap - Snapshot - CPUManager
Check the label, using the following command:
~(keystone_admin)]$ kubectl describe node | grep cpumanager where `cpumanager=true`
Limitation: Huge pages do not show up under cat /proc/meminfo inside a guest VM. Although, resources are being consumed on the host. For example, if a VM is using 4GB of Huge pages, the host shows the same 4GB of huge pages used. The huge page memory is exposed as normal memory to the VM.
Procedural Changes: You need to configure Huge pages inside the guest OS.
See Installation Guides for more details.
Limitation: Virtual machines using Persistent Volume Claim (PVC) must have a shared ReadWriteMany (RWX) access mode to be live migrated.
Procedural Changes: Ensure PVC is created with RWX.
$ class=cephfs --access-mode=ReadWriteMany $ virtctl image-upload --pvc-name=cirros-vm-disk-test-2 --pvc-size=500Mi --storage-class=cephfs --access-mode=ReadWriteMany --image-path=/home/sysadmin/Kubevirt-GA-testing/latest-manifest/kubevirt-GA-testing/cirros-0.5.1-x86_64-disk.img --uploadproxy-url=https://10.111.54.246 -insecure
Note
Live migration is not allowed with a pod network binding of bridge interface type ()
Live migration requires ports 49152, 49153 to be available in the virt-launcher pod. If these ports are explicitly specified in the masquarade interface, live migration will not function.
For live migration with SR-IOV interface:
specify networkData: in cloudinit, so when the VM moves to another node it will not loose the IP config
specify nameserver and internal FQDNs to connect to cluster metadata server otherwise cloudinit will not work
fix the MAC address otherwise when the VM moves to another node the MAC address will change and cause a problem establishing the link
Example:
cloudInitNoCloud: networkData: | ethernets: sriov-net1: addresses: - 128.224.248.152/23 gateway: 128.224.248.1 match: macAddress: "02:00:00:00:00:01" nameservers: addresses: - 10.96.0.10 search: - default.svc.cluster.local - svc.cluster.local - cluster.local set-name: sriov-link-enabled version: 2
Limitation: Snapshot CRDs and controllers are not present by default and needs to be installed on StarlingX.
Procedural Changes: To install snapshot CRDs and controllers on Kubernetes, see:
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml
Additionally, create
VolumeSnapshotClass
for Cephfs and RBD:cat <<EOF>cephfs-storageclass.yaml — apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: csi-cephfsplugin-snapclass driver: cephfs.csi.ceph.com parameters: clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344 csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-cephfs-data csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete EOF
cat <<EOF>rbd-storageclass.yaml — apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: csi-rbdplugin-snapclass driver: rbd.csi.ceph.com parameters: clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344 csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-rbd csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete EOF .. note:: Get the cluster ID from : ``kubectl describe sc cephfs, rbd``
Limitation: Live migration is not possible when using configmap as a filesystem. Currently, virtual machine instances (VMIs) cannot be live migrated as
virtiofs
does not support live migration.Procedural Changes: N/A.
Limitation: Live migration is not possible when a VM is using secret exposed as a filesystem. Currently, virtual machine instances cannot be live migrated since
virtiofs
does not support live migration.Procedural Changes: N/A.
Limitation: Live migration will not work when a VM is using ServiceAccount exposed as a file system. Currently, VMIs cannot be live migrated since
virtiofs
does not support live migration.Procedural Changes: N/A.
synce4l CLI options are not supported¶
The SyncE configuration using the synce4l
is not supported in StarlingX
10.0.
The service type of synce4l
in the ptp-instance-add command
is not supported in StarlingX 10.0.
Procedural Changes: N/A.
Kubernetes Pod Core Dump Handler may fail due to a missing Kubernetes token¶
In certain cases the Kubernetes Pod Core Dump Handler may fail due to a missing
Kubernetes token resulting in disabling configuration of the coredump on a per
pod basis and limiting namespace access. If application coredumps are not being
generated, verify if the k8s-coredump token is empty on the configuration file:
/etc/k8s-coredump-conf.json
using the following command:
~(keystone_admin)]$ ~$ sudo cat /etc/k8s-coredump-conf.json
{
"k8s_coredump_token": ""
}
Procedural Changes: If the k8s-coredump token is empty in the configuration file and the kube-apiserver is verified to be responsive, users can re-execute the create-k8s-account.sh script in order to generate the appropriate token after a successful connection to kube-apiserver using the following commands:
~(keystone_admin)]$ :/home/sysadmin$ sudo chmod +x /etc/k8s-coredump/create-k8s-account.sh
~(keystone_admin)]$ :/home/sysadmin$ sudo /etc/k8s-coredump/create-k8s-account.sh
Limitations from previous releases
Impact of Kubernetes Upgrade to v1.24¶
In Kubernetes v1.24 support for the RemoveSelfLink
feature gate was removed.
In previous releases of StarlingX this has been set to “false” for backward
compatibility, but this is no longer an option and it is now hardcoded to “true”.
Procedural Changes: Any application that relies on this feature gate being disabled (i.e. assumes the existance of the “self link”) must be updated before upgrading to Kubernetes v1.24.
Console Session Issues during Installation¶
After bootstrap and before unlocking the controller, if the console session times
out (or the user logs out), systemd
does not work properly. fm, sysinv and
mtcAgent
do not initialize.
Procedural Changes: If the console times out or the user logs out between bootstrap and unlock of controller-0, then, to recover from this issue, you must re-install the ISO.
PTP O-RAN Spec Compliant Timing API Notification¶
The
v1 API
only supports monitoring a single ptp4l + phc2sys instance.Procedural Changes: Ensure the system is not configured with multiple instances when using the v1 API.
The O-RAN Cloud Notification defines a /././sync API v2 endpoint intended to allow a client to subscribe to all notifications from a node. This endpoint is not supported in StarlingX.
Procedural Changes: A specific subscription for each resource type must be created instead.
v1 / v2
v1: Support for monitoring a single ptp4l instance per host - no other services can be queried/subscribed to.
v2: The API conforms to O-RAN.WG6.O-Cloud Notification API-v02.01 with the following exceptions, that are not supported in StarlingX.
O-RAN SyncE Lock-Status-Extended notifications
O-RAN SyncE Clock Quality Change notifications
O-RAN Custom cluster names
Procedural Changes: See the respective PTP-notification v1 and v2 document subsections for further details.
Upper case characters in host names cause issues with kubernetes labelling¶
Upper case characters in host names cause issues with kubernetes labelling.
Procedural Changes: Host names should be lower case.
Installing a Debian ISO¶
The disks and disk partitions need to be wiped before the install. Installing a Debian ISO may fail with a message that the system is in emergency mode if the disks and disk partitions are not completely wiped before the install, especially if the server was previously running a CentOS ISO.
Procedural Changes: When installing a system for any Debian install, the disks must first be completely wiped using the following procedure before starting an install.
Use the following wipedisk commands to run before any Debian install for each disk (eg: sda, sdb, etc):
sudo wipedisk
# Show
sudo sgdisk -p /dev/sda
# Clear part table
sudo sgdisk -o /dev/sda
Note
The above commands must be run before any Debian install. The above commands must also be run if the same lab is used for CentOS installs after the lab was previously running a Debian ISO.
Security Audit Logging for K8s API¶
A custom policy file can only be created at bootstrap in apiserver_extra_volumes
.
If a custom policy file was configured at bootstrap, then after bootstrap the
user has the option to configure the parameter audit-policy-file
to either
this custom policy file (/etc/kubernetes/my-audit-policy-file.yml
) or the
default policy file /etc/kubernetes/default-audit-policy.yaml
. If no
custom policy file was configured at bootstrap, then the user can only
configure the parameter audit-policy-file
to the default policy file.
Only the parameter audit-policy-file
is configurable after bootstrap, so
the other parameters (audit-log-path
, audit-log-maxsize
,
audit-log-maxage
and audit-log-maxbackup
) cannot be changed at
runtime.
Procedural Changes: NA
PTP is not supported on Broadcom 57504 NIC¶
PTP is not supported on the Broadcom 57504 NIC.
Procedural Changes: None. Do not configure PTP instances on the Broadcom 57504 NIC.
Deploying an App using nginx controller fails with internal error after controller.name override¶
An Helm override of controller.name to the nginx-ingress-controller app may result in errors when creating ingress resources later on.
Example of Helm override:
Procedural Changes: NA
Optimization with a Large number of OSDs¶
As Storage nodes are not optimized, you may need to optimize your Ceph configuration for balanced operation across deployments with a high number of OSDs. This results in an alarm being generated even if the installation succeeds.
800.001 - Storage Alarm Condition: HEALTH_WARN. Please check ‘ceph -s’
Procedural Changes: To optimize your storage nodes with a large number of OSDs, it is recommended to use the following commands:
~(keystone_admin)]$ ceph osd pool set kube-rbd pg_num 256
~(keystone_admin)]$ ceph osd pool set kube-rbd pgp_num 256
BPF is disabled¶
BPF cannot be used in the PREEMPT_RT/low latency kernel, due to the inherent incompatibility between PREEMPT_RT and BPF, see, https://lwn.net/Articles/802884/.
Some packages might be affected when PREEMPT_RT and BPF are used together. This includes the following, but not limited to these packages.
libpcap
libnet
dnsmasq
qemu
nmap-ncat
libv4l
elfutils
iptables
tcpdump
iproute
gdb
valgrind
kubernetes
cni
strace
mariadb
libvirt
dpdk
libteam
libseccomp
binutils
libbpf
dhcp
lldpd
containernetworking-plugins
golang
i40e
ice
Procedural Changes: It is recommended not to use BPF with real-time kernel. If required it can still be used, for example, debugging only.
Control Group parameter¶
The control group (cgroup) parameter kmem.limit_in_bytes has been deprecated, and results in the following message in the kernel’s log buffer (dmesg) during boot-up and/or during the Ansible bootstrap procedure: “kmem.limit_in_bytes is deprecated and will be removed. Please report your use case to linux-mm@kvack.org if you depend on this functionality.” This parameter is used by a number of software packages in StarlingX, including, but not limited to, systemd, docker, containerd, libvirt etc.
Procedural Changes: NA. This is only a warning message about the future deprecation of an interface.
Kubernetes Taint on Controllers for Standard Systems¶
In Standard systems, a Kubernetes taint is applied to controller nodes in order to prevent application pods from being scheduled on those nodes; since controllers in Standard systems are intended ONLY for platform services. If application pods MUST run on controllers, a Kubernetes toleration of the taint can be specified in the application’s pod specifications.
Procedural Changes: Customer applications that need to run on controllers on Standard systems will need to be enabled/configured for Kubernetes toleration in order to ensure the applications continue working after an upgrade from StarlingX 6.0 to StarlingX future Releases. It is suggested to add the Kubernetes toleration to your application prior to upgrading to StarlingX 8.0.
You can specify toleration for a pod through the pod specification (PodSpec). For example:
spec:
....
template:
....
spec
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
See: Taints and Tolerations.
Storage Nodes are not considered part of the Kubernetes cluster¶
When running the system kube-host-upgrade-list command the output must only display controller and worker hosts that have control-plane and kubelet components. Storage nodes do not have any of those components and so are not considered a part of the Kubernetes cluster.
Procedural Changes: Do not include Storage nodes as part of the Kubernetes upgrade.
Application Pods with SRIOV Interfaces¶
Application Pods with SR-IOV Interfaces require a restart-on-reboot: “true” label in their pod spec template.
Pods with SR-IOV interfaces may fail to start after a platform restore or Simplex upgrade and persist in the Container Creating state due to missing PCI address information in the CNI configuration.
Procedural Changes: Application pods that require|SRIOV| should add the label restart-on-reboot: “true” to their pod spec template metadata. All pods with this label will be deleted and recreated after system initialization, therefore all pods must be restartable and managed by a Kubernetes controller (i.e. DaemonSet, Deployment or StatefulSet) for auto recovery.
Pod Spec template example:
template:
metadata:
labels:
tier: node
app: sriovdp
restart-on-reboot: "true"
Storage Nodes Recovery on Power Outage¶
Storage nodes take 10-15 minutes longer to recover in the event of a full power outage.
Procedural Changes: NA
Ceph Recovery on an AIO-DX System¶
In certain instances Ceph may not recover on an AIO-DX system, and remains in the down state when viewed using the :command”ceph -s command; for example, if an OSD comes up after a controller reboot and a swact occurs, or other possible causes for example, hardware failure of the disk or the entire host, power outage, or switch down.
Procedural Changes: There is no specific command or procedure that solves the problem for all possible causes. Each case needs to be analyzed individually to find the root cause of the problem and the solution. It is recommended to contact Customer Support at, http://www.windriver.com/support.
Cert-manager does not work with uppercase letters in IPv6 addresses¶
Cert-manager does not work with uppercase letters in IPv6 addresses.
Procedural Changes: Replace the uppercase letters in IPv6 addresses with lowercase letters.
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: oidc-auth-apps-certificate
namespace: test
spec:
secretName: oidc-auth-apps-certificate
dnsNames:
- ahost.com
ipAddresses:
- fe80::903a:1c1a:e802::11e4
issuerRef:
name: cloudplatform-interca-issuer
kind: Issuer
Kubernetes Root CA Certificates¶
Kubernetes does not properly support k8s_root_ca_cert and k8s_root_ca_key being an Intermediate CA.
Procedural Changes: Accept internally generated k8s_root_ca_cert/key or customize only with a Root CA certificate and key.
Windows Active Directory¶
Limitation: The Kubernetes API does not support uppercase IPv6 addresses.
Procedural Changes: The issuer_url IPv6 address must be specified as lowercase.
Limitation: The refresh token does not work.
Procedural Changes: If the token expires, manually replace the ID token. For more information, see, Configure Kubernetes Client Access.
Limitation: TLS error logs are reported in the oidc-dex container on subclouds. These logs should not have any system impact.
Procedural Changes: NA
BMC Password¶
The BMC password cannot be updated.
Procedural Changes: In order to update the BMC password, de-provision the BMC, and then re-provision it again with the new password.
Application Fails After Host Lock/Unlock¶
In some situations, application may fail to apply after host lock/unlock due to previously evicted pods.
Procedural Changes: Use the kubectl delete command to delete the evicted pods and reapply the application.
Application Apply Failure if Host Reset¶
If an application apply is in progress and a host is reset it will likely fail. A re-apply attempt may be required once the host recovers and the system is stable.
Procedural Changes: Once the host recovers and the system is stable, a re-apply may be required.
Platform CPU Usage Alarms¶
Alarms may occur indicating platform cpu usage is greater than 90% if a large number of pods are configured using liveness probes that run every second.
Procedural Changes: To mitigate either reduce the frequency for the liveness probes or increase the number of platform cores.
Pods Using isolcpus¶
The isolcpus feature currently does not support allocation of thread siblings for cpu requests (i.e. physical thread +HT sibling).
Procedural Changes: For optimal results, if hyperthreading is enabled then isolcpus should be allocated in multiples of two in order to ensure that both SMT siblings are allocated to the same container.
Restrictions on the Size of Persistent Volume Claims (PVCs)¶
There is a limitation on the size of Persistent Volume Claims (PVCs) that can be used for all StarlingX Releases.
Procedural Changes: It is recommended that all PVCs should be a minimum size of 1GB. For more information, see, https://bugs.launchpad.net/starlingx/+bug/1814595.
Sub-Numa Cluster Configuration not Supported on Skylake Servers¶
Sub-Numa cluster configuration is not supported on Skylake servers.
Procedural Changes: For servers with Skylake Gold or Platinum CPUs, Sub-NUMA clustering must be disabled in the BIOS.
The ptp-notification-demo App is Not a System-Managed Application¶
The ptp-notification-demo app is provided for demonstration purposes only. Therefore, it is not supported on typical platform operations such as Upgrades and Backup and Restore.
Procedural Changes: NA
Unable to create Kubernetes Upgrade Strategy for Subclouds using Horizon GUI¶
When creating a Kubernetes Upgrade Strategy for a subcloud using the Horizon GUI, it fails and displays the following error:
kube upgrade pre-check: Invalid kube version(s), left: (v1.24.4), right:
(1.24.4)
Procedural Changes: Use the following steps to create the strategy:
Procedure
Create a strategy for subcloud Kubernetes upgrade using the dcmanager kube-upgrade-strategy create --to-version <version> command.
Apply the strategy using the Horizon GUI or the CLI using the command dcmanager kube-upgrade-strategy apply.
Power Metrics Application in Real Time Kernels¶
When executing Power Metrics application in Real Time kernels, the overall scheduling latency may increase due to inter-core interruptions caused by the MSR (Model-specific Registers) reading.
Due to intensive workloads the kernel may not be able to handle the MSR reading interruptions resulting in stalling data collection due to not being scheduled on the affected core.
Procedural Changes: N/A.
k8s-coredump only supports lowercase annotation¶
Creating K8s pod core dump fails when setting the
starlingx.io/core_pattern
parameter in upper case characters on the pod
manifest. This results in the pod being unable to find the target directory
and fails to create the coredump file.
Procedural Changes: The starlingx.io/core_pattern
parameter only accepts
lower case characters for the path and file name where the core dump is saved.
NetApp Permission Error¶
When installing/upgrading to Trident 20.07.1 and later, and Kubernetes version 1.17 or higher, new volumes created will not be writable if:
The storageClass does not specify
parameter.fsType
- The pod using the requested PVC has an
fsGroup
enforced as part of a Security constraint
- The pod using the requested PVC has an
Procedural Changes: Specify parameter.fsType
in the localhost.yml
file under
netapp_k8s_storageclasses
parameters as below.
The following example shows a minimal configuration in localhost.yml
:
ansible_become_pass: xx43U~a96DN*m.?
trident_setup_dir: /tmp/trident
netapp_k8s_storageclasses:
- metadata:
name: netapp-nas-backend
provisioner: netapp.io/trident
parameters:
backendType: "ontap-nas"
fsType: "nfs"
netapp_k8s_snapshotstorageclasses:
- metadata:
name: csi-snapclass
See: Configure an External NetApp Deployment as the Storage Backend
Huge Page Limitation on Postgres¶
Debian postgres version supports huge pages, and by default uses 1 huge page if it is available on the system, decreasing by 1 the number of huge pages available.
Procedural Changes: The huge page setting must be disabled by setting
/etc/postgresql/postgresql.conf: "huge_pages = off"
. The postgres service
needs to be restarted using the Service Manager sudo sm-restart service postgres
command.
Warning
The Procedural Changes is not persistent, therefore, if the host is rebooted it will need to be applied again. This will be fixed in a future release.
Password Expiry does not work on LDAP user login¶
On Debian, the warning message is not being displayed for Active Directory users, when a user logs in and the password is nearing expiry. Similarly, on login when a user’s password has already expired, the password change prompt is not being displayed.
Procedural Changes: It is recommended that users rely on Directory administration tools for “Windows Active Directory” servers to handle password updates, reminders and expiration. It is also recommended that passwords should be updated every 3 months.
Note
The expired password can be reset via Active Directory by IT administrators.
Silicom TimeSync (STS) card limitations¶
Silicom and Intel based Time Sync NICs may not be deployed on the same system due to conflicting time sync services and operations.
PTP configuration for Silicom TimeSync (STS) cards is handled separately from StarlingX host PTP configuration and may result in configuration conflicts if both are used at the same time.
The sts-silicom application provides a dedicated
phc2sys
instance which synchronizes the local system clock to the Silicom TimeSync (STS) card. Users should ensure thatphc2sys
is not configured via StarlingX PTP Host Configuration when the sts-silicom application is in use.Additionally, if StarlingX PTP Host Configuration is being used in parallel for non-STS NICs, users should ensure that all
ptp4l
instances do not use conflictingdomainNumber
values.When the Silicom TimeSync (STS) card is configured in timing mode using the sts-silicom application, the card goes through an initialization process on application apply and server reboots. The ports will bounce up and down several times during the initialization process, causing network traffic disruption. Therefore, configuring the platform networks on the Silicom TimeSync (STS) card is not supported since it will cause platform instability.
Procedural Changes: N/A.
N3000 Image in the containerd cache¶
The StarlingX system without an N3000 image in the containerd cache fails to configure during a reboot cycle, and results in a failed / disabled node.
The N3000 device requires a reset early in the startup sequence. The reset is
done by the n3000-opae image. The image is automatically downloaded on bootstrap
and is expected to be in the cache to allow the reset to succeed. If the image
is not in the cache for any reason, the image cannot be downloaded as
registry.local
is not up yet at this point in the startup. This will result
in the impacted host going through multiple reboot cycles and coming up in an
enabled/degraded state. To avoid this issue:
Ensure that the docker filesystem is properly engineered to avoid the image being automatically removed by the system if flagged as unused. For instructions to resize the filesystem, see Increase Controller Filesystem Storage Allotments Using the CLI
Do not manually prune the N3000 image.
Procedural Changes: Use the procedure below.
Procedure
Lock the node.
~(keystone_admin)]$ system host-lock controller-0
Pull the (N3000) required image into the
containerd
cache.~(keystone_admin)]$ crictl pull registry.local:9001/docker.io/starlingx/n3000-opae:stx.8.0-v1.0.2
Unlock the node.
~(keystone_admin)]$ system host-unlock controller-0
Quartzville Tools¶
The following celo64e and nvmupdate64e commands are not supported in StarlingX due to a known issue in Quartzville tools that crashes the host.
Procedural Change: Reboot the host using the boot screen menu.
ptp4l
error “timed out while polling for tx timestamp” reported for NICs using the Intel ice driver¶
NICs using the Intel® ice driver may report the following error in the ptp4l
logs, which results in a PTP port switching to FAULTY
before
re-initializing.
Note
PTP ports frequently switching to FAULTY
may degrade the accuracy of
the PTP timing.
ptp4l[80330.489]: timed out while polling for tx timestamp
ptp4l[80330.489]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
Note
This is due to a limitation with the Intel® ice driver as the driver cannot
guarantee the time interval to return the timestamp to the ptp4l
user
space process which results in the occasional timeout error message.
Procedural Changes: The Procedural Changes recommended by Intel is to increase the
tx_timestamp_timeout
parameter in the ptp4l
config. The increased
timeout value gives more time for the ice driver to provide the timestamp to
the ptp4l
user space process. Timeout values of 50ms and 700ms have been
validated. However, the user can use a different value if it is more suitable
for their system.
~(keystone_admin)]$ system ptp-instance-parameter-add <instance_name> tx_timestamp_timeout=700
~(keystone_admin)]$ system ptp-instance-apply
Note
The ptp4l
timeout error log may also be caused by other underlying
issues, such as NIC port instability. Therefore, it is recommended to
confirm the NIC port is stable before adjusting the timeout values.
Cert-manager accepts only short hand IPv6 addresses¶
Cert-manager accepts only short hand IPv6 addresses.
Procedural Changes: You must use the following rules when defining IPv6 addresses to be used by Cert-manager.
all letters must be in lower case
each group of hexadecimal values must not have any leading 0s (use :12: instead of :0012:)
the longest sequence of consecutive all-zero fields must be short handed with
::
::
must not be used to short hand an IPv6 address with 7 groups of hexadecimalvalues, use :0: instead of
::
Note
Use the rules above to set the IPv6 address related to the management and OAM network in the Ansible bootstrap overrides file, localhost.yml.
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: oidc-auth-apps-certificate
namespace: test
spec:
secretName: oidc-auth-apps-certificate
dnsNames:
- ahost.com
ipAddresses:
- fe80:12:903a:1c1a:e802::11e4
issuerRef:
name: cloudplatform-interca-issuer
kind: Issuer
Deprecated Notices¶
Bare metal Ceph¶
Host-based Ceph will be deprecated in a future release. Adoption of Rook-Ceph is recommended for new deployments as some host-based Ceph deployments may not be upgradable.
No support for system_platform_certificate.subject_prefix¶
StarlingX 10.0 no longer supports system_platform_certificate.subject_prefix This is an optional field to add a prefix to further identify the certificate, for example, StarlingX for instance.
Static Configuration for Hardware Accelerator Cards¶
Static configuration for hardware accelerator cards is deprecated in StarlingX 10.0 and will be discontinued in future releases. Use FEC operator instead.
See Switch between Static Method Hardware Accelerator and SR-IOV FEC Operator
N3000 FPGA Firmware Update Orchestration¶
The N3000 FPGA Firmware Update Orchestration has been deprecated in StarlingX 10.0. For more information, see N3000 FPGA Overview for more information.
show-certs.sh Script¶
The show-certs.sh
script that is available when you ssh to a controller is
deprecated in StarlingX 10.0.
The new response format of the ‘system certificate-list’ RESTAPI / CLI now
provides the same information as provided by show-certs.sh
.
Kubernetes APIs¶
Kubernetes APIs that will be removed in K8s 1.25 are listed below:
See: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-25
ptp-notification v1 API¶
The ptp-notification v1 API can still be used in StarlingX 10.0. The v1 API will be removed in a future release and only the O-RAN Compliant Notification API (ptp-notification v2 API) will be supported.
Note
It is recommended that all new deployments use the O-RAN Compliant Notification API (ptp-notification v2 API).
Removed in Stx 10.0¶
kube-ignore-isol-cpus
is no longer supported in StarlingX 10.0.
Pod Security Policy¶
Pod Security Policy (PSP) is removed in StarlingX 10.0 and K8s v1.25 and ONLY applies if running on K8s v1.24 or earlier. Instead of using Pod Security Policy, you can enforce similar restrictions on Pods using Pod Security Admission Controller (PSAC) supporting K8s v1.25.
Note
Although StarlingX 10.0 still supports K8s v1.24 which supports PSP, StarlingX 10.0 has removed the StarlingX default PSP policies, roles and role-bindings that made PSP usable in StarlingX; It is important to note that StarlingX 10.0 is officially NOT supporting the use of PSP in its Kubernetes deployment.
Important
Upgrades
PSP should be removed on hosted application’s and converted to PSA Controller before the upgrade to StarlingX 10.0.
System certificate CLI Commands¶
The following commands are removed in StarlingX 10.0 and replaced by:
system certificate-install -m ssl <pemfile>
has been replaced by an automatically installed ‘system-restapi-gui-certificate’ CERTIFICATE (in the ‘deployment’ namespace) which can be modified using the ‘update_platform_certificates’ Ansible playbooksystem certificate-install -m openstack <pemfile>
has been replaced by ‘system os-certificate-install <pemfile>’system certificate-install -m ssl_ca <pemfile>
system certificate-install -m docker_registry <pemfile>
has been replaced by an automatically installed ‘system-registry-local-certificate’ CERTIFICATE (in the ‘deployment’ namespace) which can be modified using the ‘update_platform_certificates’ Ansible playbooksystem certificate-uninstall -m ssl_ca <pemfile>
andsystem certificate-uninstall -m ssl_ca <pemfile>
have been replaced by:'system ca-certificate-install <pemfile>'
'system ca-certificate-uninstall <uuid>'
Appendix A - Commands replaced by USM for Updates (Patches) and Upgrades¶
Manually Managing Software Patches¶
The sudo sw-patch
commands for manually managing software patches have
been replaced by software
commands as listed below:
The following commands for manually managing software patches are no longer supported:
sw-patch upload <patch file>
sw-patch upload-dir <directory with patch files>
sw-patch query
sw-patch show <patch-id>
sw-patch apply <patch-id>
sw-patch query-hosts
sw-patch host-install <hostname>
sw-patch host-install-async <hostname>
sw-patch remove <patch-id>
sw-patch delete <patch-id>
sw-patch what-requires <patch-id>
sw-patch query-dependencies <patch-id>
sw-patch is-applied <patch-id>
sw-patch is-available <patch-id>
sw-patch install-local
sw-patch drop-host <hostname-or-id>
sw-patch commit <patch-id>
Software patching is now manually managed by the software
commands
described in the :ref:Manual Host Software Deployment <manual-removal-host-software-deployment-24f47e80e518>
procedure.
software upload <patch file>
software upload-dir <directory with patch files>
software list
software delete <patch-release-id>
software show <patch-release-id>
software deploy precheck <patch-release-id>
software deploy start <patch-release-id>
software deploy show
software deploy host <hostname>
software deploy host-rollback <hostname>
software deploy localhost
software deploy host-list
software deploy activate
software deploy complete
software deploy delete
Manual Software Upgrades¶
The system load-delete/import/list/show
,
system upgrade-start/show/activate/abort/abort-complete/complete
and
system host-upgrade/upgrade-list/downgrade
commands for manually managing
software upgrades have been replaced by software
commands.
The following commands for manually managing software upgrades are no longer supported:
system load-import <major-release-ISO> <ISO-signature-file>
system load-list
system load-show <load-id>
system load-delete <load-id>
system upgrade-start
system upgrade-show
system host-upgrade <hostname>
system host-upgrade-list
system upgrade-activate
system upgrade-complete
system upgrade-abort
system host-downgrade <hostname>
system upgrade-abort-complete
Software upgrade is now manually managed by the software
commands described
in the Manual Host Software Deployment
procedure.
software upload <patch file>
software upload-dir <directory with patch files>
software list
software delete <patch-release-id>
software show <patch-release-id>
software deploy precheck <patch-release-id>
software deploy start <patch-release-id>
software deploy show
software deploy host <hostname>
software deploy localhost
software deploy host-list
software deploy activate
software deploy complete
software deploy delete
software deploy abort
software deploy host-rollback <hostname>
software deploy activate-rollback
Orchestration of Software Patches¶
The sw-manager patch-strategy-create/apply/show/abort/delete
commands for
managing the orchestration of software patches have been replaced by
sw-manager sw-deploy-strategy-create/apply/show/abort/delete
commands.
The following commands for managing the orchestration of software patches are no longer supported
sw-manager patch-strategy create … <options> …
sw-manager patch-strategy show
sw-manager patch-strategy apply
sw-manager patch-strategy abort
sw-manager patch-strategy delete
Orchestrated software patching is now managed by the
sw-manager sw-deploy-strategy-create/apply/show/abort/delete
commands
described in the Orchestrated Deployment Host Software Deployment
procedure.
sw-manager sw-deploy-strategy create <patch-release-id> … <options> …
sw-manager sw-deploy-strategy show
sw-manager sw-deploy-strategy apply
sw-manager sw-deploy-strategy abort
sw-manager sw-deploy-strategy delete
Orchestration of Software Upgrades¶
The sw-manager patch-strategy-create/apply/show/abort/delete
commands for
managing the orchestration of software upgrades have been replaced by
sw-manager sw-deploy-strategy-create/apply/show/abort/delete
commands.
The following commands for managing the orchestration of software upgrades are no longer supported.
sw-manager upgrade-strategy create … <options> …
sw-manager upgrade-strategy show
sw-manager upgrade-strategy apply
sw-manager upgrade-strategy abort
sw-manager upgrade-strategy delete
Orchestrated software upgrade is now managed by the
sw-manager sw-deploy-strategy-create/apply/show/abort/delete
commands
described in the Orchestrated Deployment Host Software Deployment
procedure.
sw-manager sw-deploy-strategy create <<major-release-id> … <options> …
sw-manager sw-deploy-strategy show
sw-manager sw-deploy-strategy apply
sw-manager sw-deploy-strategy abort
sw-manager sw-deploy-strategy delete
Release Information for other versions¶
You can find details about a release on the specific release page.
Version |
Release Date |
Notes |
Status |
StarlingX R10.0 |
2025-02 |
https://docs.starlingx.io/r/stx.10.0/releasenotes/index.html |
Maintained |
StarlingX R9.0 |
2024-03 |
Maintained |
|
StarlingX R8.0 |
2023-02 |
Maintained |
|
StarlingX R7.0 |
2022-07 |
Maintained |
|
StarlingX R6.0 |
2021-12 |
Maintained |
|
StarlingX R5.0.1 |
2021-09 |
EOL |
|
StarlingX R5.0 |
2021-05 |
EOL |
|
StarlingX R4.0 |
2020-08 |
EOL |
|
StarlingX R3.0 |
2019-12 |
EOL |
|
StarlingX R2.0.1 |
2019-10 |
EOL |
|
StarlingX R2.0 |
2019-09 |
EOL |
|
StarlingX R12.0 |
2018-10 |
EOL |
StarlingX follows the release maintenance timelines in the StarlingX Release Plan.
The Status column uses OpenStack maintenance phase definitions.