iCore IT Services

Windows Server Checks

CPU

CPU: Interrupts /sec. The numbers of interrupts the processor was asked to respond to. Interrupts are generated from hardware components like hard disk controller adapters and network interface cards. A sustained value over 1000 per processor is usually an indication of a problem. Problems would include a poorly configured drivers, errors in drivers, excessive utilization of a device (like a NIC on an IIS server), or hardware failure. Compare this value with the System : Systems Calls/sec. If the Interrupts/sec is much larger over a sustained period, you probably have a hardware issue. High Interrupts/sec indicates high utilization caused by hardware devices

CPU : System Calls/sec. This counter is a measure of the number of calls made to the system components, Kernel mode services. This is a measure of how busy the system is taking care of applications and services—software stuff. When compared to the Interrupts/Sec it will give you an indication of whether processor issues are hardware or software related. See CPU: Interrupts/Sec for more information. High System calls/sec indicates high utilization caused by software

CPU Load: This check calculates an average of CPU usage for a specified period of time

CPU: Processor Queue Length: This is a rough indicator of the number of threads each processor is servicing. If the Processor Queue Length exceeds 2 per CPU for continuous periods (over 10 minutes or so), then you probably have a CPU bottleneck. For example, if you have 4 CPUs in your server, the Processor Queue Length should not exceed a total of 8 for the entire server.

CPU: Processor Time: determines the percentage of time the processor is busy by measuring the percentage of time the thread of the Idle process is running and then subtracting that from 100 percent. This measurement is the amount of processor utilization. Although you might sometimes see high values for the Processor\% Processor Time counter (70 percent or greater depending on your workload and environment), it might not indicate a problem; you need more data to understand this activity. For example, high processor-time values typically occur when you are starting a new process and should not cause concern.

Processor Bottleneck

Processor bottlenecks occur when the processor is so busy that it cannot respond to requests quickly. Excessive processor activity can be identified by

  • a high rate of processor activity
  • a long, sustained processor queue is a more certain indicator.

As you monitor processor and related counters, you can recognize a developing bottleneck by the following conditions:

  • CPU: Processor Time often exceeds 80 percent.
  • CPU: Processor Queue Length: is often greater than 2 on a single-processor system. (Multiply it with the number of processors)
  • Unusually high values appear for the CPU: Interrupts /sec.

Solutions for processor bottlenecks
If the high processor utilization is caused by faulty drivers or faulty application software, we need to get the software fixed or replaced. If the high processor activity is caused by the normal activity, we need to acquire upgraded hardware or reorganize our software schedule.

  • Add more or Upgrade the processor
  • Replace Programmed I/O PIO devices
    • Use SCSI or Ultra DMA instead of normal IDE
    • Use bus mastering devices
  • Distribute applications onto other servers
  • If the utilization is caused by software, schedule processor intensive tasks at less busy times such as the night
    • Schedule the tasks via the Control Panel

 

MEMORY

The Memory performance object consists of counters that describe the behavior of physical and virtual memory on the computer. Physical memory is the amount of RAM on the computer. Virtual memory consists of space in physical memory and on disk. Many of the memory counters monitor paging, which is the movement of pages of code and data between disk and physical memory. Excessive paging is a symptom of a memory shortage and can cause delays that interfere with all system processes

Memory-Physical: Shows the total, used and free memory on the system. Measured in GB.

Memory-Physical available: Shows the amount of physical memory, in Megabytes, immediately available for allocation to a process or for system use. The more available memory the faster the server can respond.  It is equal to the sum of memory assigned to the standby (cached), free, and zero page lists.

Memory-Committed Bytes: Traces the amount of virtual memory that’s in use. The Available Bytes counter monitors how much memory is actually available. As you might expect, as the Available Bytes counter decreases, paging increases, thus slowing down your machine. If you determine that the Available Bytes are often in short supply, you can correct the problem by adding memory. However, before you do, try watching both counters together as you open and close programs. If the committed bytes don’t decrease and available bytes don’t increase as you close programs, the system may have a memory leak which is caused by a software problem rather than insufficient RAM.

Memory Pages per sec: The less paging the better your server's performance.  Most authorities agree that Memory: Pages / sec is a key memory counter.  This counter measures 'hard' page faults, in other words the page in nowhere in memory, so the VMM (Virtual Memory Manager) has to fetch the data from the pagefile on the disk; in computing terms that takes an age. The recommended threshold for the Memory:Pages/sec counter is 20.
Shows the rate, in incidents per second, at which pages were read from or written to disk to resolve hard page faults. This counter is a primary indicator for the kinds of faults that cause system-wide delays. It is the sum of Pages Input/sec and Pages Output/sec. It is counted in numbers of pages, so it can be directly compared to other counts of pages such as Page Faults/sec. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) and noncached mapped memory files. In other words it indicates the number of requested pages that were not immediately available in RAM and had to be read from the disk or had to be written to the disk to make room in RAM for other pages. If your system experiences a high rate of hard page faults, the value for Memory\Pages/sec can be high.

Memory Page Faults per sec measures page faults that occur when an application attempts to read from a virtual memory location that is marked "not present." Zero is the optimum measurement. Any measurement higher than zero delays response time. The Memory:Page faults/sec counter measures both hard page faults and soft page faults. Hard page faults occur when a file has to be retrieved from a hard disk rather than virtual memory. Soft page faults occur when, a resolved page fault, found elsewhere in physical memory, interrupts the processor but have much less effect on performance.

Memory Pagefile usage: Windows Pagefile usage in Percentage.

Memory Pagefile usage Peak:  The peak value of the Windows Pagefile usage (in percent). If it approaches 90% or above you might want to increase the size of the paging file.

Memory NonPaged-leaks: This counts pages of memory that can’t be moved to virtual memory, but must stay in the physical RAM.  Normally, if this value is too high, you’ll have to add more memory. Anything over 200MB should be investigated.

Memory-Virtual: Number of pages of swap currently in use (only swap). According to Microsoft this is the size of unreserved and uncommitted memory in the user mode portion of the virtual address space of the calling process, in bytes.

Solutions to Memory Problems

Hardware solution

Not enough available memory?  The easiest cure for memory problems is to open up the server and add another stick of RAM.  We have a suggestion for future purchases - Always buy machines with more RAM than you need now.

Software solution
The most common memory problem is a memory leak due to incorrect application code. Following are some recommendations to remedy memory issues:

  •   Investigate the minimum memory requirement for your applications to run. This can be easily done by using the Task Manager. (Read the memory values before and after the application is loaded to the memory). Make sure the available memory exceeds this value. Add more physical RAM to the machine if it is not sufficient.
  •   Create multiple paging files on multiple disks. This will allow faster disk access between the disks.
  •   Reevaluate the paging file size. It is recommended that the paging file size be 1.5  times the physical RAM installed. If the paging file/ virtual memory used exceeds this limit,  add extra physical memory or  decrease the page file size.
  •   Run your most memory intensive applications on your highest performing computers. You can also reschedule such applications to run when the system work load is light.

The first step in detecting a memory leak is to observe the memory data by using the Memory:Available Bytes and Memory: Committed Bytes performance counters. You should suspect a memory leak when the available memory figure declines by more than 4MBs. You need to isolate the applications and run them against these counters to determine which application is causing the memory leak.
You might need to monitor the Process: Private Bytes, Process: Working Set and Process: Handle Count counters on the suspected process to confirm the memory leak. A kernel mode application can also be leaking memory. In that case, you need to use the Memory: Pool Nonpaged Bytes, Memory: Pool Nonpaged Allocs, Process (Process name): Pool Nonpaged Bytes counters. The kernel mode applications do not refer to paging mechanisms; therefore you should use nonpagesd counters.

 

DISK

The Disk counters help you to evaluate the performance of the disk subsystem. The disk subsystem is more than the disk itself. It will include to disk controller card, the I/O bus of the system, and the disk. When measuring disk performance it is usually better to have a good baseline for performance than simply to try and evaluate the disk performance on a case by case basis.
There are two objects for the disk—PhysicalDisk and LogicalDisk. The counters for the two are identical. However, in some cases they may lead to slightly different conclusions. The PhysicalDisk object is used for the analysis of the overall disk, despite the partitions that may be on the disk. When evaluating overall disk performance this would be the one to select. The LogicalDisk object analyzes information for a single partition. Thus the values will be isolated to activity that is particularly occurring on a single partition and not necessarily representative of the entire load that the disk is burdened with. The LogicalDisk object is useful primarily when looking at the affects or a particular application, like SQL Server, on the disk performance. Again the PhysicalDisk is primarily for looking at the performance of the entire disk subsystem. In the list that follows, the favored object is indicated with the counter. When the LogicalDisk and PhysicalDisk objects are especially different, the counter will be listed twice and the difference specifically mentioned.

Disk Queue Length: This counter provides a primary measure of disk congestion. Just as the processor queue was an indication of waiting threads, the disk queue is an indication of the number of transactions that are waiting to be processed. Recall that the queue is an important measure for services that operate on a transaction basis. Just like the line at the supermarket, the queue will be representative of not only the number of transactions, but also the length and frequency of each transaction.

Disk Time: Much like % Processor time, this counter is a general mark of how busy the disk is. You will see many similarities between the disk and processor since they are both transaction-based services. This counter indicates a disk problem, but must be observed in conjunction with the Current Disk Queue Length counter to be truly informative. Recall also that the disk could be a bottleneck prior to the % Disk Time reaching 100%

Disk Idle Time: Shows the percentage of elapsed time during the sample interval that the selected disk drive was idle.

Disk Transfers per sec: Shows the rate, in incidents per second, at which read and write operations were performed on the disk.

Drive Space X: Shows the total, used and free space in GigaByte

 

Network Counters

The network performance counters are not typically installed. The Network Segment object that is referred to here is installed when the Network Monitor Agent is installed. The network interface is installed when the SNMP service is installed. Many of the counters have to do with TCP/IP components, such as the SNMP service which relies on TCP/IP.

Network Sent Load: (Bytes Sent/sec) This is how many bytes of data are sent to the NIC. This is a raw measure of throughput for the network interface. We are really measuring the information sent to the interface which is the lowest point we can measure. If you have multiple NIC, you will see multiple instances of this particular counter.

Network Received Load: (Bytes Received/sec) This, of course, is how many bytes you get from the NIC. This is a measure of the inbound traffic In measuring the bytes, NT isn't too particular at this level. So, no matter what the byte is, it is counted. This will include the framing bytes as opposed to just the data.

 

System

Host Alive: The role of this check is to detect if the Host (server) is in up or down state. If it is down, an alert message is sent to the client/administrator.

Service State: This checks the state of one or more service on the server and generates a critical state if any service is not in the required state

Process State: checks the state of one or more processes on the system and generates a critical state if any process is not in the required state (started or stopped). This check can also count the number of a process instances and generates an alert if the number is below or over the number of instances specified.

Uptime: It returns the uptime of the server. Example: System Uptime - 44 day(s) 4 hour(s) 28 minute(s) 

 

Terminal Services

TS Active sessions: Shows the number of active Terminal Server connections/sessions.

TS Inactive sessions: Shows the number of inactive Terminal Server connections/sessions.

 

Dell Server Status: (Windows)

Checks the hardware health of Dell PowerEdge and some PowerVault servers. It uses the Dell OpenManage Server Administrator (OMSA) software, which must be running on the monitored system. This check makes sure the following components are running correctly:


Storage components checked:
  • Controllers
  • Physical drives
  • Logical drives
  • Cache batteries
  • Connectors (channels)
  • Enclosures
  • Enclosure fans
  • Enclosure power supplies
  • Enclosure temperature probes
  • Enclosure management modules (EMMs)


Chassis components checked:

  • Processors
  • Memory modules
  • Cooling fans
  • Temperature probes
  • Power supplies
  • Batteries
  • Voltage probes
  • Power usage
  • Chassis intrusion
 

HP Server Status: (Windows - SNMP)

Checks the hardware health of HP Proliant servers. It uses the Windows Insight Management Agents software, which must be running on the monitored system. This check makes sure the following components are running correctly:

Storage components checked:

  • Controllers
  • Physical drives
  • Logical drives
  • Cache batteries
  • Connectors (channels)
  • Enclosures
  • Enclosure fans
  • Enclosure power supplies
  • Enclosure temperature probes

Chassis components checked:

  • Processors
  • Memory modules
  • Cooling fans
  • Temperature probes (Board, Ambient, CPU, PowerSupply)
  • Power supplies
 

IBM Server Status: (Windows)

Soon... 

 

OTHER Checks

Internet Connection: Checks the heath of the internet connection. It returns Round Trip Times and Lost Packets.

FTP Server Check: Checks if the FTP server is Up or Down.