Countdown to TechEd 2010 in New Orleans, LA: 2010-06-07 00:00:00 GMT-08:00

Tuesday, December 29, 2009

Hotfix ID – What Does This GUID Stand For?

Recently, I came across a problem when running the Cluster Validation Wizard where the two nodes did not match in the Validate Software Update Levels section.

You must run the Validate test on fully configured solutions before you configure the Failover Cluster to verify the proposed solution. All tests must pass with either a green checkmark (passed) or a yellow yield sign (warning), in order to obtain product support from Microsoft. See the Microsoft Support Policy for Windows Server 2008 Failover Clusters.

The yellow yield sign indicates that this particular aspect of the proposed solution is not in alignment with Microsoft best practices. However, this aspect will still work and will be considered a supported configuration. Personally, I never deploy a production cluster unless I get a completely green result.

As shown above, one of the Windows Server 2008 servers was indicating a warning of "Software Updates missing on 'servername'" and the missing updates are listed only as a GUIDs, with no description.

I searched the Interwebs for anything on related to either GUID, with no luck. Then I came across a nifty script by Guy Teverovsky, a Premier Field Engineer for Platforms at Microsoft Israel. You run the script on the node that's missing the updates.

Here's the syntax:

C:\>cscript GetPatchInfo.vbs /?
Displays details of installed patches/hotfixes
Usage: cscript GetPatchInfo.vbs [/guid:]
/guid: The GUID of the hotfix
Running the script without parameters will enumerate all
the patches installed.

Sample output:

C:\>cscript GetPatchInfo.vbs /guid:{47740627-D81D-4A45-A215-03B075A18EC7}
-------------------------------------------------------
Patch Name: Microsoft Office SharePoint Designer 2007 Service Pack 1 (SP1)
Patch Code: {47740627-D81D-4A45-A215-03B075A18EC7}
More Info URL:
http://support.microsoft.com/kb/937162Patch
State: Installed
Product Code:{90120000-00A4-0409-0000-0000000FF1CE}
Product Name: Microsoft Office 2003 Web Components

I'm also hosting the script here on my blog, just in case it becomes unavailable from his site sometime in the future.

Download GetPatchInfo.zip

In my case, the GUIDs {DEBD1C94-5AAB-4E46-A130-359A52D2bb65} and {2B3A711E-1265-4D05-ACBB-B7677EA6E860} refer to the SCOM 2007 agent, which was missing on one of the nodes.

Labels: , , , , ,


Subscribe in a reader Subscribe by Email

Friday, June 12, 2009

Failure of FSW Causes Cluster Group to Failover

The following information was written for Exchange 2007 CCR mailbox clusters, but it pertains to any clustering solution that uses the Windows Server 2008 Node and File Share Majority cluster quorum configuration.

How Does Node and File Share Majority Clustering Work?

Exchange 2007 CCR uses two clustered Exchange mailbox nodes, called a Clustered Mailbox Server (CMS). In order for Windows to know which node is active, it utilizes a File Share Witness (FSW) to maintain quorum. The FSW is a network share on a third computer (typically a Hub Transport server in the normally active node's physical site). The active node writes information to files in that share and locks them for writing, preventing the passive node from writing to the FSW and taking quorum. It always take two out of three votes to maintain quorum.

If the active node becomes unavailable, the passive node can write to the FSW and the cluster group fails over. In the case of a total site failure where both the active node and the FSW are offline, both the cluster group and the CMS will fail since there is no quorum (there's only one vote).

What Happens When the FSW Becomes Unavailable?

When the FSW fails, the active CMS node (Exchange) does not fail over because there are still two votes (the active and passive nodes). However, the Windows cluster group will fail over to the other node if the FSW does not come back online within 60 seconds. This is because File Share Witness resource in Windows Server 2008 is configured to fail over the cluster group when the FSW fails, as shown below.


Worse, the FSW resource will not come back online for another 60 minutes. During this time, a failure of either one of the nodes will cause the cluster to fail, even if the FSW is back online.

These default settings are provided so that the cluster event logs don't fill up with constant "Trying to start the resource", "The resource failed to start" events during a prolonged outage.

This is what happens when the FSW server is rebooted (during patch management, for example):

  • The server holding the FSW resource is rebooted.
  • The cluster tries to connect to the FSW one minute after failure is detected.
  • If the FSW is still unavailable (which usually happens - most servers take longer than 60 seconds to restart), the cluster group fails over to another node.
  • Wait one hour and try connecting to the FSW again. The FSW is finally brought online.
Note: This behavior only pertains to Windows Server 2008. Windows Server 2008 R2 does not have this issue.

It's important to know that even though the cluster group fails over, there really is no effect on Exchange, even with a geographically disbursed CCR cluster (geo-cluster). However, if you're like me, you like symmetry and order. The cluster group should be with the active CMS node.

Here's how to minimize the time that the cluster group is on the (normally) passive node:

  • Open the Failover Cluster Management console
  • Add the cluster name, if necessary, and select it
  • Double-click Cluster Core Resources in the middle pane to expand it
  • Right-click File Share Witness (\\servername\sharename) and select Properties
  • Click the Policies tab
  • For optimal restart performance, change "If all the restart attempts fail, begin restarting again after the specified period (hh:mm)" to 15 minutes, as shown below:

This configuration will cause the cluster service to attempt to bring the FSW resource to online once every 15 minutes, instead of an hour.

Next, logon to the server holding the FSW resource (typically a Hub Transport server in the active site and install the Failover Clustering Tools feature. You'll find it in Remote Server Administration Tools > Feature Administration Tools.

Now create a batch file called FSW_Online.bat. Enter the following two lines:

  • cluster EXCLUSTER1 res "File Share Witness (\\server\mns_fsw_excluster1)" /online
  • cluster EXCLUSTER1 group “Cluster Group” /move:node.yourdomain.com

Note: Replace EXCLUSTER1 with your cluster name. Replace \\server\mns_fsw_excluster1 with the name of your FSW resource (enter "cluster res" at a command prompt to find it). Replace node.yourdomain.com with the FQDN of the CMS node you want to keep the cluster group on.

Lastly, configure FSW_Online.bat to run at startup on the FSW resource server:

  • Open Local Group Policy Editor
  • Navigate to Computer Configuration > Windows Settings > Scripts (Startup/Shutdown) > Startup
  • Click Add and browse to the FSW_Online.bat file you created
  • Click OK twice and close Local Group Policy Editor

This is my current best practice for configuring the File Share Witness resource failure policy.

Special thanks go to Tim McMichael, Senior Support Escalation Engineer on the Exchange product support team, for assisting me with this article.

Labels: , , , ,


Subscribe in a reader Subscribe by Email