KB2149592 - ESXi IO Connectivity Issues or PSOD with VT-d Interrupt Remapper Disabled

KB2149592 - ESXi IO Connectivity Issues or PSOD with VT-d Interrupt Remapper Disabled

 

What is VT-d remapper? 

ESXi/ESX 4.1 and later versions introduced interrupt remapping code that is enabled by default. This technology was introduced by Intel to produce more efficient IRQ routing to improve the performance and security of VMs. The Interrupt-Remapping feature enables the VMM to isolate interrupts to CPUs assigned to a given VM and to remap/reroute physical I/O device interrupts. When enabled, this feature helps ensure an efficient migration of the interrupts across CPUs.

 

The interrupt remapper is controlled by the VMware kernel setting - iovDisableIR.

 

To show the current setting, use the following ESXCLI command:

esxcli system settings kernel list -o iovDisableIR

 

To enable it:

esxcli system settings kernel set --setting=iovDisableIR -v FALSE

 

To disable it:

esxcli system settings kernel set --setting=iovDisableIR -v TRUE

 

Over the years, there have been several problems related to this setting. Provided below are multiple related Knowledge Base articles:

 

vHBAs and other PCI devices may stop responding in ESXi 6.0.x, ESXi 5.x and ESXi/ESX 4.1 when using Interrupt Remapping (1030265)

 

Possible symptoms:

  • ESXi hosts are non-responsive
  • Virtual machines are non-responsive
  • HBAs stop responding
  • Other PCI devices stop responding
  • You may receive the Degraded path for an Unknown Device alerts in vCenter Server

 

VMware advised that the remapper should be disabled (set true to the iovDisableIR parameter)

 

A couple of years ago, during my work as a system engineer, we always disabled it on ESXi servers before putting them into production.

 

Onto the next article:

 

ESXi host fails with a diagnostic screen due to an Intel Virtualization Technology Erratum (2147325)

 

This article has reconfirmed that the VT-d interrupt remapper should be disabled for multiple Intel processors:

 

Intel Xeon Processor 55xx Series

Intel Xeon Processor 56xx Series

Intel Xeon Processor 65xx Series

Intel Xeon Processor 75xx Series

Intel Xeon Processor E5-1400 v2 Product Family

Intel Xeon Processor E5-1600 v2 Product Family

Intel Xeon Processor E5-1600 v3 Product Family

Intel Xeon Processor E5-2400 Product Family

Intel Xeon Processor E5-2400 v2 Product Family

Intel Xeon Processor E5-2600 Product Family

Intel Xeon Processor E5-2600 v2 Product Family

Intel Xeon Processor E5-2600 v3 Product Family

Intel Xeon Processor E5-2600 v4 Product Family

Intel Xeon Processor E5-4600 Product Family

Intel Xeon Processor E5-4600 v2 Product Family

Intel Xeon Processor E5-4600 v3 Product Family

Intel Xeon Processor E5-4600 v4 Product Family

Intel Xeon Processor E7-2800 Product Family

Intel Xeon Processor E7-4800 Product Family

Intel Xeon Processor E7-8800 Product Family

Intel Xeon Processor E7-8800/4800/2800 v2 Product Families

Intel Xeon Processor E7-8800/4800 v3 Product Families

Intel Xeon Processor E7-8800/4800 v4 Product Families

 

VMware even set it as a default setting in versions:

 

ESXi 5.5 Patch 10 2016-12-20 4722766
ESXi 5.5 Express Patch 11 2017-03-28 5230635
ESXi 6.0 Patch 4 2016-11-22 4600944
ESXi 6.0 Update 3 2017-02-24 5050593
ESXi 6.5 GA 2016-11-15 4564106

 

But this, in turn, caused a new PSOD issue, affecting HP ProLiant Gen8 servers:

ESXi host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers (2149043)

HPE also covered this in Custom Advisory c05392947.

 

ESXi IO connectivity issues or PSOD with VT-d interrupt remapper disabled (2149592)

 

In the following ESXi versions, VMware reinstated the interrupt remapping:

 

ESXi 5.5 Express Patch 11 2017-03-28 5230635
ESXi 6.0 Update 3a (ESXi 6.0 Patch 5) 2017-07-11 5572656
ESXi 6.5. Patch 01 2017-03-09 5146846

 

The  interrupt remapping is enabled by default on:

 

ESXi 6.7 GA 2018-04-17 8169922

 

As you can see, even a single parameter with different values can lead deeper problems.

 

Use Runecast Analyzer to verify if your specific ESXi hosts are affected by this parameter. It shows if your servers are vulnerable to a specific problem, and why:

 

ESXi Host fails with intermittent NMI PSOD

 

ESXi IO connectivity issues VMware

 

At Runecast, we are constantly updating the automatic checks based on Knowledge Base articles, Best Practices and Security Hardening Guides. These updates combined with automatic monitoring ensure that your vSphere environment can be continuously protected using the latest industry knowledge.

 

Constantin Ivanov

Head of R&D


09-06-2018 00:00


See how many KBs are applicable in your environment