Tuesday, November 30, 2010

SCOM R2 Gateway Server not communicating with the SCOM Management Group: EventID 20070 on the GW server and EventID 20000 on the RMS

Normally when a SCOM Gateway is installed and all prereqs are met, things run like clock work. In the years that I work with SCOM I have installed many SCOM GWs, all without any real issues what so ever. And when something was amiss, it turned out to be something simple like a firewall blocking some traffic or an incorrect certificate or a missing certificate chain. With just a few mouse clicks, all was fine and life was good again.

Until last week that is. I bumped into a GW that wouldn’t work. AT ALL! I could reproduce it as well with another GW, installed in total different environment. Strangest thing was that another SCOM R2 GW server was already installed and fully functional. So what was happening? And more over, how to solve it?
The Situation:
The SCOM R2 GW is installed and everything is in place (certs, SCOM GW Approval Tool has been run, firewalls have been configured and the lot). So there is a connection from the GW to the MG.
However, the GW throws EventID 20070 with the message ‘…Check the event log on the server for the presence of 20000 events, indicating that the agents which are not approved are attempting to connect ’:
On the RMS side of things, EventID 20000 is shown, telling that the SCOM R2 GW tries to connect but isn’t recognized as part of this Management Group (A device which is not part of this management group has attempted to access this Health Service. Requesting Device Name : …):
Things we tried: Wow! We did many things in order to get it all up & running:
  1. Of course, we checked the firewalls, routers and switches;
  2. Even installed Network Monitor on the RMS;
  3. Renewed the certs on the GW side of it all, reinstalled the SCOM GW;
  4. Reran the GW Approval Tool many times;
  5. Flushed the Health Service State on the RMS and the MS which the GW should report to in order to get a fresh config file (~:\Program Files\System Center Operations Manager 2007\Health Service State\Connector Configuration Cache\\OpsMgrConnector.Config.xml);
  6. Installed the SCOM GW on total new server;
  7. Renamed the SCOM GW to see whether the computer name was causing it all;
  8. Ran some verbose logging on the RMS, MS and GWs which only showed EventID 20000 happening and nothing more;
  9. Deleted the SCOM GW and its SITE entry from the SCOM DB, waited until they were groomed out and started all over totally CLEAN;
  10. Ran some good tracing on the firewalls involved as well, showing us the connection was closed by the RMS (EventID 20000).
All to no avail. Nothing solid came out of it.
So I installed a new SCOM GW in total different Forest. And experienced the same issue! And all that time, the GW server which was installed some weeks ago was running just fine.
Dive Dive!: So it was time for a deep deep dive. We copied the file OpsMgrConnector.Config.xml of the RMS and MS to another location and started to take a deep dive into them. Soon we noticed a difference: the file from the RMS contained the Connector information for the fully functional GW server, while the MS didn’t.
That’s strange! Since that GW server was installed by me using the GW Approval Tool, telling SCOM that the GW server should report to the MS and not the RMS. So this entrance should be found in the file located on the MS, not the RMS! I checked my installation document for that particular environment and indeed, I referred to the MS, not the RMS….
Time to run a PS-cmdlet which shows to WHAT MS the GW server is primarily talking to: Get-GatewayManagementServer | where {$_.Name -like '< GW SERVER NAME>'} | Get-PrimaryManagementServer.
And the output really puzzled me: the functional GW Server wasn’t talking to the MS but the RMS. Also the people running the firewall (TMG) told me that ONLY the RMS was being published, not the MS!
Now it all hit home! Wow!
The Solution: I stopped the Health Service on the problematic test GW server, removed the GW server from the SCOM R2 Console, reran the GW Approval Tool, this time I referred to the RMS as the Management Server, adjusted the registry on the GW server in order to reflect the RMS and not the MS and restarted the Health Service on the GW.
All was working now!
Did the same for the problematic production GW server and hit the jackpot there as well!
However, some additional work needs to be done but that will be planned for the days to come:
  1. Publish the MS instead of the RMS on the TMG;
  2. Reconfigure the GWs to talk to the MS and not the RMS (some simple PS-cmdlets will do the trick here);
  3. Adjust the registry entries on the GWs in order to reflect the changes.
Why? It is not good to have servers reporting to the RMS.
Puzzled: Yes, I am still puzzled. WHY does the first functional GW server talk to the RMS instead of the MS, while I have ran the GW Approval Tool in such a manner that it should talk to the MS? Got the screen dumps showing it. Really felt stupid and taken by surprise. Also learned a valuable lesson: How to troubleshoot SCOM R2…