PCI Passthrough¶
Introduction¶
PCI pass through allows compute nodes to pass a physical PCI device to a hosted VM. This can be used for direct access to a PCI device inside the VM. For example a GPU or direct access to a physical network interface. The OpenStack charms fully support this feature. The following will document deployment and configuration of the feature. This document will focus on using OpenStack charms, Juju and MAAS. It is worth familiarizing yourself with the generic OpenStack Nova documentation at https://docs.openstack.org/nova/latest/admin/pci-passthrough.html.
Example Hardware¶
This document will assume a MAAS environment with compute hosts that have two GPUs per physical host. In this case, a Quadro P5000 with PCI IDs 83:00.0 and 84:00.0 and vendor and product ids of 10de:1bb0.
lspci output:
lspci -nn | grep VGA
83:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1bb0] (rev a1) (prog-if 00 [VGA controller])
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1bb0] (rev a1) (prog-if 00 [VGA controller])
Infrastructure¶
Note
To passthrough PCI devices IOMMU must be enabled for the hardware. Depending on the hardware vendor (Intel or AMD) enable the virtualisation feature in BIOS and set the correct kernel parameter as described bellow (intel_iommu, amd_iommu).
Blacklisting the PCI device¶
There are certain cases, particularly for GPUs, where we must blacklist the PCI device at the physical host so that its kernel does not load drivers. i.e. stop the physical host’s kernel from loading NVIDIA or nouveau drivers for the GPU. The host must be able to cleanly pass the device to libvirt.
Most PCI devices can be hot unbound and then bound via vfio-pci or the pci-stub legacy method and therefore do not require blacklisting. However, GPU devices do require blacklisting. This documentation covers the GPU case. The blacklisting documentation is based on the following: https://www.pugetsystems.com/labs/articles/Multiheaded-NVIDIA-Gaming-using-Ubuntu-14-04-KVM-585/
MAAS¶
MAAS tags can also send kernel boot parameters. The tag will therefore serve two purposes: to select physical hosts which have PCI devices installed and to pass kernel parameters in order to blacklist the PCI device stopping the kernel from loading a driver for the device.
maas $MASS_PROFILE tags create name="pci-pass" comment="PCI Passthrough kernel parameters" kernel_opts="intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1"
Note
A MAAS host with more than one tag that sets kernel parameters will execute only one set. Good practice is to only have one kernel parameter setting tag on any given host to avoid confusion.
In Target Changes¶
The MAAS tag and kernel parameters are only half the blacklisting solution. We also need the host to actively check the blacklisting. This requires setting up some configuration at deploy time. There are a few options on how to accomplish this, however, this document will use cloud init configuration at the Juju model level.
For other options to customize the deployed host see: https://docs.maas.io/2.5/en/nodes-custom
Juju can customize the host configuration using model level configuration for cloudinit-userdata. We will leverage this feature to do the final configuration of the physical host.
Note
Model level cloudinit userdata will get executed on all “machines” in a model. This includes physical hosts and containers. Take care to ensure any customization scripts will work in all cases.
Determine the vendor and product ID of the PCI device from lspci. In our case the GPU’s is 10de:1bb0.
Determine the PCI device IDs of the PCI device from lspci. In our case the GPU’s are 83:00.0 and 84:00.0.
Determine which driver we are blacklisting. This information can be found in the lspci output “Kernel modules.” In our case nvidiafb and nouveau.
Base64 encode an /etc/default/grub with the default line set:
echo 'GRUB_DEFAULT=0
GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1 modprobe.blacklist=nvidiafb,nouveau"
GRUB_CMDLINE_LINUX=""' | base64
Base64 encode the following /etc/rc.local script with the correct PCI device IDS. In our case 0000:83:00.0 and 0000:84:00.0.
echo 'vfiobind() {
dev="$1"
vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
device=$(cat /sys/bus/pci/devices/$dev/device)
if [ -e /sys/bus/pci/devices/$dev/driver ]; then
echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
fi
echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
}
vfiobind 0000:83:00.0
vfiobind 0000:84:00.0' | base64
Take all the information from above to create the following YAML. Set pci_stub, $BASE64_ENCODED_RC_LOCAL and $BASE64_ENCODED_DEFAULT_GRUB correctly.
cloudinit-userdata: |
postruncmd:
- "update-initramfs -u > /root/initramfs-update.log"
- "update-grub > /root/grub-update.log"
write_files:
- path: /etc/initramfs-tools/modules
content: pci_stub ids=10de:1bb0
- path: /etc/modules
content: |
pci_stub
vfio
vfio_iommu_type1
vfio_pci
- path: /etc/rc.local
encoding: b64
permissions: '0755'
content: $BASE64_ENCODED_RC_LOCAL
- path: /etc/default/grub
encoding: b64
content: $BASE64_ENCODED_DEFAULT_GRUB
Create the juju model with cloudinit-userdata set with this YAML:
juju add-model openstack-deployment --config cloudinit-userdata.yaml
For further cloud init documentation for customization see: https://cloudinit.readthedocs.io/en/latest/topics/examples.html
Deploy OpenStack¶
At this point we are ready to deploy OpenStack using the OpenStack charms with Juju and MAAS. The charm deployment guide already documents this process. The only additional settings required are setting the PCI aliases.
Manually:
juju config nova-cloud-controller pci-alias='{"vendor_id":"10de", "product_id":"1bb0", "name":"gpu"}'
juju config nova-cloud-controller scheduler-default-filters="RetryFilter,AvailabilityZoneFilter,CoreFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,DifferentHostFilter,SameHostFilter,AggregateInstanceExtraSpecsFilter,PciPassthroughFilter"
juju config nova-compute pci-alias='{"vendor_id":"10de", "product_id":"1bb0", "name":"gpu"}'
juju config nova-compute pci-passthrough-whitelist='{ "vendor_id": "10de", "product_id": "1bb0" }'
# If passing through a GPU use spice for console which creates a usable VGA device for the VMs
juju config nova-cloud-controller console-access-protocol=spice
Example bundle snippet. Update the OpenStack bundle.
machines:
'0':
series: bionic
# Use the MAAS tag pci-pass for hosts with the PCI device installed.
constraints: tags=pci-pass
'1':
series: bionic
# Use the inverse (NOT) ^pci-pass tag for hosts without the PCI device.
constraints: tags=^pci-pass
applications:
nova-compute:
charm: cs:nova-compute
num_units: 1
options:
pci-alias: '{"vendor_id":"10de", "product_id":"1bb0", "name":"gpu"}'
pci-passthrough-whitelist: '{ "vendor_id": "10de", "product_id": "1bb0" }'
to:
- '0'
nova-cloud-controller:
charm: cs:nova-cloud-controller
num_units: 1
options:
pci-alias: '{"vendor_id":"10de", "product_id":"1bb0", "name":"gpu"}'
scheduler-default-filters="RetryFilter,AvailabilityZoneFilter,CoreFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,DifferentHostFilter,SameHostFilter,AggregateInstanceExtraSpecsFilter,PciPassthroughFilter"
console-access-protocol: spice
to:
- lxd:1
Post Deployment¶
Create a flavor. Set the pci_passthrough property with the alias name set above, in our case gpu and the number of devices to pass in this case 1.
openstack flavor create --ram 8192 --disk 100 --vcpu 8 m1.gpu
openstack flavor set m1.gpu --property "pci_passthrough:alias"="gpu:1"
Boot an instance with the PCI device passed through. Use the flavor just created:
openstack server create --key-name $KEY --image $IMAGE --nic net-id=$NETWORK --flavor m1.gpu gpu-enabled-vm
SSH onto the VM and run lspci to see the PCI device in the VM. In our case the NVIDIA 1bb0.
$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Red Hat, Inc. QXL paravirtual graphic card (rev 04)
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
00:04.0 Communication controller: Red Hat, Inc Virtio console
00:05.0 SCSI storage controller: Red Hat, Inc Virtio block device
00:06.0 VGA compatible controller: NVIDIA Corporation Device 1bb0 (rev a1)
00:07.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon
Boot an instance without a PCI device passed. Use any flavor without the pci_passthrough property set. The PciPassthroughFilter will do the right thing.
openstack server create --key-name $KEY --image $IMAGE --nic net-id=$NETWORK --flavor m1.medium no-gpu-vm