How To Deal With PSOD - The Purple Screen of Death

How To Deal With PSOD - The Purple Screen of Death

 

Contents

What is it? 

Why does it happen?

What's the impact?

 

What to do when it happens?

How to prevent it?

 

What is it?

PSOD stands for Purple Screen of Diagnostics, often referred to as Purple Screen of Death: from the more known Blue Screen of Death encountered on Microsoft Windows.

It’s a diagnostic screen displayed by VMware ESXi when the kernel detects a fatal error in which it either is unable to safely recover from, or cannot continue to run without having a much higher risk of a major data loss. 

It shows the memory state at the time of the crash and also additional details which are important in troubleshooting the cause of the crash: ESXi version and build, exception type, register dump, backtrace, server uptime, error messages and information about the core dump(a file generated after the the error, containing further diagnostic information). 
 
This screen is visible on the console of the server. In order to see it, you will need to either be in the datacenter and connect a monitor or remotely using the server’s out-of-band management (iLO, iDRAC, IMM… depending on your vendor).

DID YOU KNOW? 
The screen is referred to as either 
   Purple    or   Pink  , but in fact the color is   Dark Magenta   (RGB:171,0,171 | CMYK:0.00, 1.00, 0.00, 0.33)

Why does it happen? 

The PSOD is a kernel panic. Even though we all know that ESXi is not based on UNIX, the panic implementation fits the UNIX definition. The ESXi kernel (vmkernel) triggers this safety measure in response to an event/error which is unrecoverable and would mean that continuing to run would pose a high risk for the services and VMs. To put it simply: when the ESXi hosts feels it became corrupted, it commits “seppuku” and, while bleeding its purple blood, writes a suicide letter detailing why it did it!

The most common causes for a PSOD are:
1. Hardware failures, mostly RAM or CPU related. They normally throw out a “MCE” or “NMI” error. 

  • “MCE” - Machine Check Exception, which is a mechanism within the CPU to detect and report hardware issues. There are important details for identifying the root cause of the issue in the codes displayed on the purple screen. 
  • “NMI” - non-maskable interrupt, which is a hardware interrupt that cannot be ignored by the processor. Since NMI is a very important message about a HW failure, the default response starting with ESXi 5.0 and later is to trigger a PSOD. Earlier versions were just logging the error and continuing. Same as with MCEs, purple screen caused by NMI will provide important codes that are crucial for troubleshooting.

2. Software bugs

3. Misbehaving drivers; bugs in drivers that try to access some incorrect index or non-existing method (ex: KB2146526 , KB2148123)

DID YOU KNOW? 
You can even trigger manually a PSOD for testing purposes or if you are just curious to see it happen. 
Log in to the ESXi host via DCUI or SSH with a privileged account and run:

vsish -e set /reliability/crashMe/Panic

Obviously a test system is recommended, ideally a virtual nested ESXi so you can easily observe the console. Also make sure you finish reading this article to understand the implications of this action and the effect on your test system. 

What’s the impact?

When the panic occurs and the host crashes, it terminates all the services running on it together with all the virtual machines hosted. The VMs are not gracefully shutdown, but rather abruptly powered off. If the host is part of a cluster and you’ve configured HA, these VMs will be started on the other hosts in the cluster. Besides the outage and the unavailability of the VMs during the time they are down, some critical applications like database servers, message queues or backup jobs may be affected by the “dirty” shutdown.

Additionally, all other services provided by the host will be terminated, so if your host is a member of a VSAN cluster, a PSOD will impact vSAN as well.

For me, the most troublesome aspect of a PSOD is that it makes you lose trust in your infrastructure and the anxiety it creates, at least until you get to the bottom of it. Ok, you can recover by rebooting and may have HA or even FT so the impact may not be devastating… but until you don’t solve the root cause, the thought that this can happen again or on an another server can keep you up at night.

What to do when it happens?

1. Analyze the purple screen message
One of the most important things to do when you have a PSOD is to take a screenshot. If you are connecting remotely(IMM, iLO, iDRAC,...) to the console it will be easy taking a screenshot, but if you have to go to the datacenter, you may need to literally take out your phone and snap a picture of the screen. There’s a lot of useful information about the cause of the crash in that screen. 


 

2. Contact VMware support
Before you start further investigation and troubleshooting it is advisable to contact VMware support, if you have a support contract. In parallel with your investigation they will be able to assist you in making the Root Cause Analysis (RCA). 

3. Reboot the affected ESXi host
In order to recover the server you will need to reboot it. I would also advise keeping it in maintenance mode until you perform the full RCA, identify the cause and fix it. If you can’t afford keeping it in maintenance mode, at least fine tune your DRS rules so that only un-important VMs will run on it, so that if another PSOD hits the impact will be minimal.

4. Get the core dump
After the server boots up you should collect the coredump. The coredump, also called vmkernel-zdump is a file containing logs with similar, but more detailed information to that seen on the purple diagnostic screen and will be used in further troubleshooting. Even if the cause of the crash might seem obvious from the PSOD message that you analyzed in step 1, it is advisable to confirm it by looking at the logs from the coredump.

Depending on your configuration you may have the core dump in one of these forms:

a. On the scratch partition
b. As a .dump file on one of the host’s datastores
c. As a .dump file on the vCenter - through the netdump service

 

The coredump becomes especially important if the configuration of the host is to automatically reset after a PSOD, in which case you will not get to see the message on screen.

You can copy the dumpfile out of the ESXi host using SCP and then open it using a text editor (like Notepad++). This will contain the contents of the memory at the time of the crash and the first parts of it contain the messages you saw on the purple screen. The whole file may be requested by VMware support, but you can only extract the vmkernel log, which is a bit more … digestible:


5. Decipher the error

Troubleshooting and Root Cause Analysis can make one feel like Sherlock Holmes. PSODs can sometimes turn into a Arthur Conan Doyle inspired story, but in most cases it’s a pretty straightforward process where it will be hard to get to the fifth “why” of the 5 Whys technique.

The most important symptom, and the one you should start with, is the error message generated by the purple screen. Luckily, the number of error messages that can be produced is finite: 

Exception Type 0 #DE: Divide Error
Exception Type 1 #DB: Debug Exception
Exception Type 2 NMI: Non-Maskable Interrupt
Exception Type 3 #BP: Breakpoint Exception
Exception Type 4 #OF: Overflow (INTO instruction)
Exception Type 5 #BR: Bounds check (BOUND instruction)
Exception Type 6 #UD: Invalid Opcode
Exception Type 7 #NM: Coprocessor not available
Exception Type 8 #DF: Double Fault
Exception Type 10 #TS: Invalid TSS
Exception Type 11 #NP: Segment Not Present
Exception Type 12 #SS: Stack Segment Fault
Exception Type 13 #GP: General Protection Fault
Exception Type 14 #PF: Page Fault
Exception Type 16 #MF: Coprocessor error
Exception Type 17 #AC: Alignment Check
Exception Type 18 #MC: Machine Check Exception
Exception Type 19 #XF: SIMD Floating-Point Exception
Exception Type 20-31: Reserved
Exception Type 32-255: User-defined (clock scheduler)

Since the kernel panic is handled by the CPU, for more information about these Exceptions see Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture and Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

The most common cases are covered in separate VMware KB articles and I will just maintain a reference table of such errors here since the articles are very detailed and well documented. So use this table as an index for the PSOD errors:

Example Error Detailed KB Article
LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed Using hardware NMI facilities to troubleshoot unresponsive hosts (1014767)
Panic requested by one or more 3rd party NMI handlers
COS Error: Oops Understanding an "Oops" purple diagnostic screen (1006802)
Lost Heartbeat Understanding a "Lost Heartbeat" purple diagnostic screen (1009525)
ASSERT bora/vmkernel/main/pframe_int.h:527 Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956)
NOT_IMPLEMENTED /build/mts/release/bora-84374/bora/vmkernel/main/util.c:83 Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956)
Spin count exceeded (iplLock) - possible deadlock Understanding a "Spin count exceeded" purple diagnostic screen (1020105)
PCPU 1 locked up. Failed to ack TLB invalidate Understanding a Failed to ack TLB invalidate purple diagnostic screen (1020214)
#GP Exception(13) in world 4130:helper13-0 @ 0x41803399e303 Understanding Exception 13 and Exception 14 purple diagnostic screen events (1020181)
#PF Exception type 14 in world 136:helper0-0 @ 0x4a8e6e
Machine Check Exception: Unable to continueHardware (Machine) Error Decoding Machine Check Exception (MCE) output after a purple screen error (1005184)
Hardware (Machine) Error
PCPU: 1 hardware errors seen since boot (1 corrected by hardware)


6. Check logs

It may happen that the cause is not very obvious from looking at the purple screen message or at the core dump log, so the next place where to look for clues is in the host logs, especially at the time interval just preceding the PSOD. Even when you feel you have located the cause, it’s still advisable to avoid being parsimonious and confirm it by looking at the logs.

If you are administering an enterprise environment it’s likely you have some specialized log management solution at hand (like VMware Log Insight or SolarWinds LEM) so it will be easy to browse through those logs, but if you don’t have a log management you can easily export them.

The most interesting log files to explore would be:

Components Location What is it
System messages /var/log/syslog.log Contains all general log messages and can be used for troubleshooting.
VMkernel /var/log/vmkernel.log Records activities related to virtual machines and ESXi. Most PSOD relevant entries will be in this log, so pay special attention to it.
ESXi host agent log /var/log/hostd.log Contains information about the agent that manages and configures the ESXi host and its virtual machines.
VMkernel warnings /var/log/vmkwarning.log Records activities related to virtual machines. Watch for heap exhaustion(Heap WorkHeap) related log entries.
vCenter agent log /var/log/vpxa.log Contains information about the agent that communicates with vCenter, so you can use it to spot tasks triggered by the vCenter and might have caused the PSOD.
Shell log /var/log/shell.log Contains a record of all commands typed, so you can correlate the PSOD to a command executed.

 

How to prevent it?

Most of the software related PSODs are resolved by patches, so make sure you are up to date with the latest versions.

Make sure that your servers are on VMware’s Hardware Compatibility Checklist, together with all the devices and adapters. This will protect from some of the unexpected hardware related issues, but it will also ensure that VMware support will be able to support you in case of a PSOD.

As described above in “Why it happens”, misbehaving drivers are also an often cause of PSODs, so it’s imperative to regularly check vendors’ support websites for updated firmware and drivers and especially for the documented PSOD causing drivers to respond as soon as possible by upgrading them.

At Runecast, we regularly analyze the entire VMware Knowledge Base (kb.vmware.com) which consists of more than 30,000 articles. We are extracting actionable insights from the KBs in order to proactively make virtualized infrastructures more resilient, secure and efficient. We are very familiar with the PSOD and are able to identify most of the preconditions that can lead to this problem. By proactively analyzing your environment, Runecast Analyzer will help you steer away from these issues, so you can have the peace of mind that most PSODs lurking in your environment are prevented.

About the author:

Aylin Sali, Runecast CTO

Aylin Sali (Runecast CTO)

Aylin Sali is a virtualization and cloud enthusiast with more than 10 years of IT experience and an overwhelming desire for automation. He is a VCAP DCA & DCD and 5x vExpert.


June 14, 2018


See how many KBs are applicable in your environment