Troubleshooting a hung or non-responsive Windows server can be a challenging endeavor. Simply hitting the reset button is no longer a tolerated option as an increasing number of these servers are used for business critical operations.
There are a variety of reasons why a server may hang including both hardware and software issues. For example bad NIC, device driver conflicts, resource pool depletion… etc
Here are some of the basic steps for troubleshooting an incident related to physical windows server in non-responsive/hung state
1. Try to Ping the server in hung state.
2. If server is reachable via ping try RDP to server and see if you are able to login.
3. If ping is working and RDP is not working then try to manage server remotely using ‘computer management’ or ‘pstools’ or “perfmon” from another server in same network.
4. If server is reachable via ping not via any other remote management tools then login to hardware remote console RSA/DRAC/ILO (if available).
5. Check the server status in hardware console and see if you are able to login.
6. If you are able to login to server console via RSA/DRAC/ILO then use perfmon to generate the performance report on server or use task manager and look for process that is consuming high CPU/memory, look for errors in event logs.
7. If server is non-responsive even from server console RSA/DRAC/ILO then perform NMI reboot. NMI generates a forced crash dump which may only be necessary if other means of troubleshooting prove unsuccessful.
8. NMI reboot option is available in power/diagnostics options in RSA/DRAC/ILO depending on the vendor (please see the below screenshots).
9. Before performing NMI reboot/generating memory dump make sure crash control is enabled in windows registry
10. After you have created crash dump file (memory.dmp), you are ready to begin using Windbg to determine what caused that server hung.
NOTE: Please take screenshots at each and every step you perform, screenshots are important for drafting a RCA.