Tuesday, June 29, 2010

OpsMgr - Cross Platform Discovery Errors

The key to being able to monitor a server is being able to discover that server :), until you can get the server into Operations Manager you aren't going to be able to do much with it.  While the discovery process for Unix and Linux servers seems simple enough, there is a lot going on behind the scenes that is hidden by the wizard.  In a previous entry I went over a successful discovery path (OpsMg and Cross Plat-Getting Started), for this post I'm going to go over some of the errors that can occur and how to resolve them.
The first one I'll talk about is Not Enough Entropy, this one required a little digging to figure out what was wrong.  The exact error is Failed to allocate resource of type random data: Failed to get random data - not enough entropy.

I've had this issue when discovering both RHEL and SLES servers and it is related to certificate generation. 
There are two ways to solve this problem, you can recreate the /dev/random file or do a manual agent install.
For both fixes, clean off the partially installed agent using the commands

  1. rpm -e scx
  2. rm -rf /etc/opt/microsoft/scx
Then if you want to make it so that discovery will work from the wizard use the commands
  1. rm /dev/random
  2. mknod -m 644 /dev/random c 1 9
  3. chown root:root /dev/random
A manual install requires copying the appropriate package from %Program Files%\System Center Operations Manager 2007\AgentManagement\UnixAgents to the Unix\Linux machine and installing it directly.
After fixing the install issue, switch the /dev/random file back to a signed random file using the commands:
  1. rm /dev/random
  2. mknod -m 644 /dev/random c 1 8
  3. chown root:root /dev/random
Next, let's look at Unspecified Problem, this is one where I am sure there is a whole gamut of reasons why it occurs.  The text is Starting Microsoft SCX CIM Server:  Unspecified Problem. 
The key here is that we can see that the certificate was generated by the statement "Generating certificate with hostname..." so we know we need to look at things after the certificate creation.  The only reason I have found for this error is the firewall, after installation and certificate generation there is a validation step.  If you watch the steps through the wizard, the error pops up almost immediately so the wizard is unable to verify the agent suggesting a communication issue.  Ensure that port 1270 has been opened on the firewall and try to discover again.
Some of the other errors I've run into over time are:
Access is Denied, this one pops up from time to time when an agent installation failed for some reason, you fixed the underlying reason and tried again. The problem is the partially installed agent is blocking the re-install, the fix is to clean off the agent and do a fresh install the same way we  did for Not Enough Entropy.
Cannot connect to port 1270, this one typically occurs when there is a library path issue on the monitored server.  If you go to the server, you'll likely see that the service failed to start. Trying to restart the service will give you the name of the library that cannot be found.  
The typical resolution path for linux is:
  1. scxadmin -restart all
  2. See what library is missing 
  3. find / -name   
  4. vi /etc/ld.so.conf 
  5. add path to missing library  
  6. ldconfig to reload dynamic loader  
  7. scxadmin -restart all   

The path for Solaris is the same for steps 1 - 3 but differs when it comes to setting the library path:
  1. crle to see the current path
  2. crle -l to update the path (include the old path plus the new path because the command is a replacement, not an append) 
  3. scxadmin -restart all  

Can not resign certificate, /etc/opt/microsoft/ssl/scx-host-.pem already exists,in this situation the re-creation of a certificate was attempted but failed because there was a previously generated certificate on the target host.  If you want to generate a new certificate, simply delete the contents of the /etc/opt/microsoft/ssl directory.  Alternatively you can export the certificate and trust it on the management server.

winrm failed to connect in a timely manner, this can happen if the target server is over loaded. OpenPegasus will time out after 20 seconds or so and this can result in a failure to validate the agent was properly installed.  The fix here is to ensure the agent was in fact installed using scxcimcli ei -n root/scx CIM_ManageElement on the target server and then retrying the discovery.
There are  many other things that couild go wrong during discovery but in most cases the error message you receive should help you determine how to fix the problem. One thing to watch is at what phase the error occurred: Initial discovery (name resolution issues), Installation (user account issues), Signing (certificate issues), Validation (configuration issues), knowing where to start looking is half the battle to getting our servers successfully discovered.