Monitoring Windows 64 bit environments and the role of WMI

Introduction

Monitoring Microsoft windows servers from Nagios is a straight forward affair and can return rich and full information dependent on what checks are used. In some cases, it is easier to set up than Linux environments and can provide the administrator with an in depth insight into what is happening on their server at a given moment. However, as all windows servers tend to differ in specification and build, issues can sometime crop up. In this article, I will describe the issue I faced and how it was overcome. It is assumed at this point that the reader has a working knowledge of windows servers, Nagios and basic networking protocols. In this case study, the nrpe executables are ones that we install as a default, however there are many others available at Nagios Exchange.

The Headache

Whilst trying to monitor some windows servers, a problem occurred when our standard NRPE executables would not return accurate data. The following was being displayed constantly for a check on the systems RAM and CPU:

OS_RAM; OK; HARD; 4; Mem: 0 MB (0%) / 4095 MB (100%) Paged Mem: 0 MB (0%) / 4095 MB (100%)
OS_CPU_LOAD;OK;HARD;4;NOW: Mean:0.000000% Variance: 0.000000% CUMULATIVE: Mean:0.000000% Variance: 0.000000%

This check is designed to show the total amount of memory and total amount of paged memory available, and, the amount of each being used. So, in this case, it was a surprise to see that both of the two metrics where showing as 0% used. As this was a busy environment, I was expecting (as a minimum) a good percentage of the physical memory to be used and certainly some sort of CPU load. But this was not the case. The checks showed that this server was effectively not doing anything. Considering that this server runs a production Oracle 11g instance, I found that hard to believe.

So, my initial investigation took me to the nrpe.cfg of the server being monitored. Looking at the config file, everything appeared to be fine:

Command[nt_cpuload]=..\nrpe_nt_plugins\bin\check_cpu_load.exe 85 90
Command[check_mem]=..\ nrpe_nt_plugins\bin\check_mem_load.exe 85 90

Nothing seemed to be amiss here so the next thing to check was whether or not the service was running:

gb_131016_image1

Nothing wrong with this either. Maybe the monitoring servers configuration was wrong?

**Service Definition**

define service {
        use                             	generic-service
        name                            	generic-win-cpu-load-service
        service_description             	OS_CPU_LOAD
        check_command                           check_nt!nt_cpuload
        servicegroups                   	OS_Services
        register                        	0
}

**Host Service Definition**

define service{
        use                             	generic-win-cpu-load-service
        host_name                       	CLIENT_SERVERNAME
        }

Nothing wrong here either. So why wouldn’t it work? The setup was definitely correct and the fact that all other checks on the server told me my configuration was correct. So, it’s reasonable to assume at this point that the check itself is at fault.

The world of WMI and how to find it

WMI, or Windows Management Instrumentation, is a tool that is used to collect data and functionality information on local and remote machines that run windows operating systems. This can be anything from RAM, CPU, disk health and space, etc.

The great thing about WMI is it can be used with practically any scripting language that can utilize the inbuilt Windows Script Host.

To check if it is enables is fairly straight forward, just head to services and makes sure it is started and set to automatically run at start up as shown below:

gb_131016_image2

Once the service has been started, It’s a case of having a script that checks the correct service or function you require.

Which script to use and choosing the right parameters!

The scripts that can be used with WMI are widely available on the internet and using your favorite search engine will bring back plenty of results and ideas for you to decide on. In this case though, I used a script called check_memory_percentage_space_used.vbs which is installed as part of the nrpe package. This particular script is quite an in depth memory script as it allows for different instances of memory to be filtered to give a richer insight into the server that is being monitored. I will show the three that I used.

Firstly, the script needs to be located somewhere sensible, but preferably separate from your old nt executables. Then it is a case of editing your nrpe.cfg to reflect the new check. Below is an extract from the one I worked on:

Command[check_ram]=cscript.exe //nologo //T:60 
..\wmi\check_memory_percentage_space_used.vbs -h 127.0.0.1 -inst RAM -w 85 -c 90

Command[check_paging]=cscript.exe //nologo //T:60 
..\wmi\check_memory_percentage_space_used.vbs -h 127.0.0.1 -inst PAGING -w 85 -c 90

Command[check_total_mem]=cscript.exe //nologo //T:60 
..\wmi\check_memory_percentage_space_used.vbs -h 127.0.0.1 -inst _TOTAL -w 85 -c 90

The first part of the check after the command[command_name], is a generic part of the check which will need to be included across all wmi checks in your configuration file. The piece we need to focus on here is the “-h, -inst, -w, -c” part. This is where you will need to define your parameters. In this case, we have set the “-h” parameter as the local host address as we want the script to run on the server it resides.

The second parameter is the most important part. The “-inst” parameter defines what part of the memory structure you want to return information on. In this first instance, we have it set as ‘RAM’. This will return information on the physical memory on the server. The second one is ‘PAGING’, which will return the virtual memory of the server and, the final one is ‘_TOTAL’ which is a combination of the two previous variables.

As this is a percentage used check, the “-w” and “-c” parameters represent a percentage value at which the alert will be Warning and Critical respectively. Simple really.

However, your Nagios config also needs to be changed to reflect the new changes. The new service definition is shown below:

 **Service Definitions**

define service {
        use                             	generic-service
        name                           	        generic-win-total-mem-service
        service_description            	        OS_TOTAL_MEMORY
        check_command                           check_nt!check_total_mem
        register                        	0
        }

define service {
        use                             	generic-service
        name                           	        generic-win-ram-service
        service_description            	        OS_RAM
        check_command                           check_nt!check_ram
        register                        	0
        }

define service {
        use                             	generic-service
        name                                   	generic-win-paging-service
        service_description            	        OS_PAGING
        check_command                           check_nt!check_paging
        register                        	0
        }

**Host Service Definitions**

define service{
        use                             	generic-win-total-mem -service
        host_name                       	CLIENT_SERVERNAME
        }

define service{
        use                             	generic-win-ram-service
        host_name                       	CLIENT_SERVERNAME
        }

define service{
        use                             	generic-win-paging-service
        host_name                       	CLIENT_SERVERNAME
        }

And that’s it really. Once you have re-load the nagios service (which will error out if you have done it wrong in any way), you should see some good results like the ones below:

gb_131016_image3 gb_131016_image4 gb_131016_image5

Here’s one for CPU Load as well which include ‘ * ‘ as the “-inst” parameter. This checks all cpu’s available on the server and also outputs a total amount all in the same check:

gb_131016_image6 gb_131016_image7

With Nagios and the addition of WMI checks, there is a wealth of information that can be pulled out from Windows servers so that your monitoring can spot on and give you the best chance to react to any issues.

Graham Barnes

Contact Us