Patched our lab this afternoon. All seemed well until I ran the SQL scripts on the OpsDB and DW. When I went back to my RMS to open the console and import the new MP’s, I got an error saying the SDK service had not initialized and could not connect.
I saw a lot of 2115 warnings and 26319 errors. I reached out to my buddy Almquist and he recommended bouncing the SQL service. He said this was an issue that was supposed to be fixed in CU4. Apparently we ran into it during our upgrade from CU3 to CU5.
So if you have this issue, then you know the easy fix, bounce SQL and you will be good to go. Other than that I had no issues in my lab. I hope this continues in our pre-prod environment and then production.
Get it here.
Cumulative Update 5 for Operations Manager 2007 R2 resolves the following issues:
- Restart of non-Operations Manager services when the agent is updated.
- Updated ACS reports.
- TCP Port Probe incorrectly reports negative ping latency.
- MissingEvent Manual Reset Monitor does not work as expected.
- Drillthrough fails because of rsParameterTypeMismatch in the EnterpriseManagementChartControl.
- ACS – Event log message is truncated or corrupted in SCDW.
- UI hang caused by SDK locking.
- ACS Filter fails for certain wildcard queries.
- Edit Schedule button is disabled with SQL 2008 R2.
- Web console times out when you open the left navigation tree.
- Scheduled Reports view for Windows Server 2003 and for Microsoft SQL Server 2005 Reporting Services SP3 CU9 returns “System.IndexOutOfRangeException: Index was outside the bounds of the array.”
- Signed MPs cannot be imported when new attributes are added to existing classes.
Cross Platform Cumulative Update 5 for Operations Manager 2007 R2 resolves the following issues:
- Performance data for LVM managed partitions is not available.
- Process monitor does not keep name if run by using symbolic link.
- AIX with large number of processes crashes with bad alloc.
Cross Platform Cumulative Update 5 for Operations Manager 2007 R2 adds the following feature:
- Support for Red Hat 6
The SCOM team has released the R2 Admin ResKit. Go get it here!
The System Center Operation Manager 2007 Administration Resource Kit provides the following features to aid in management group administration:
- Scheduled Maintenance Mode – Ability to schedule and manage maintenance mode in the management group.
- Clean Mom – Helps remove all installed R2 components.
- MP Event Analyzer – MP Event Analyzer tool is designed to help a user with functional and exploratory testing and debugging of event based management pack workflows like rules and monitors.
Schedule Maintenance Mode is designed to use the Operations Manager platform. Due to this we are able to centrally manage Maintenance Mode instead of using a schedule task solution. Also, all information is stored in the Operations Manager database therefore no information is lost during a disaster if the database has been backed up.
You will find a detailed guide on how to setup and use this tool in the download package.
The tool provides the following features:
- Ability to schedule any type of object to be placed into maintenance mode in the form of a Job
- Group support including nested groups
- Automatically places Health Service Watcher in maintenance with computer
- Blocks RMS from being placed in maintenance
- Support for Run Once, daily, weekly, and monthly schedules (including complex scenarios like “second Tuesday of the month”)
- Ability to cancel a maintenance Job where everything will be removed from maintenance automatically
- History Report
I did the CU3 update yesterday to our infrastructure. Later, in the afternoon, I started to approve and process agent updates. In the evening I got pinged on OCS by our OCS and Group Chat engineer. He asked if I was doing an install on OCS because “SCOM” is restarting all of the OCS and GroupChat services. I told him that this wasn’t possible, that the agent install shouldn’t bounce application services. After looking at one of the boxes, it was apparent that RestartManager was bouncing several services after the SCOM agent update took place. I had patched other Windows 2008 servers earlier that day without any issue. I am still uncertain what caused this to happen on our OCS and GroupChat servers, however if it happens to you here is what you need to look for and what you need to do to resolve it.
Despite the push showing as “Successful” you will find that some of these were not so. The quick way to find them is through an alert view and or this view in the console:
All of the above Critical states are agents that experienced problems during install. Pick one and log onto that box. Checking the SCOM Agent service you will find it in a “Starting State”:
After you verify that the SCOM service is “Starting” open up task manager and you should find the MOMAgentInstaller.exe still operating:
Kill this and the HealthService.exe process:
Now start the SCOM agent service and verify your .dll’s have been updated with the .49 version. If we look at the application and scom event logs we will see what potentially happened. When looking at the application log we notice that after the scom agent install started the RestartManager started to cycle several services and the SCOM agent had been hung since the incident started:
So be careful about pushing agent updates to Windows 2008 servers if the Restart Manager service is running and is allowed to run, as it may cause some application outages for you.
So I had to roll CU3 to production today and one of my agents was throwing an odd error:
The Agent Management Operation Agent Install failed for remote computer servername.domain.com.
Install account: myaccount
Error Code: 80070641
Error Description: The Windows Installer Service could not be accessed. This can occur if you are running Windows in safe mode, or if the Windows Installer is not correctly installed. Contact your support personnel for assistance.
Microsoft Installer Error Description:
For more information, see Windows Installer log file “(null)” on the Management Server.
I thought this was odd and had never seen it before. Did a little “google” search for this issue and found this KB that mentioned the windows installer service could be unregistered or corrupt. After I followed the steps in the article, I tried to install the update to the agent and it was successful. Very nice!
So I rolled CU3 into our labs in the past week. Doing this on virtual machines and terminal servers with very low free space and a great distance between themselves on the network was not that fun. Regardless, I was able to patch my SCOM infrastructure. Our labs are a shared environment so I moved on to agent updates and ran into a few problems. Two agents, that I have noticed so far, could not update. The agent logs refered to the CU1 update bits and how the agent was unable to locate them. I thought that was odd. So I had to jump on the box and see what was going on. I tried to do a remote uninstall, but that failed. I tried to do a uninstall of the agent from the actual server itself, but that failed as well with some pop up box asking for the location of the momagent.msi file. I suspect something got corrupt in the registry and will now have to follow Jonathan’s blog post on how to brute force uninstall the agent from the few servers that are behaving like this.
Then, a few days after patching several agents, I checked the patch list and saw the following:
Before deploying CU3 I read Kevin’s post for guidance. I noticed he had made a comment about the patch list may not appear correctly on some patched agents:
“Note: experienced 100% success rate on the agent updates…. however, some of my agents are still reporting both the CU2 and CU3 in patchlist. I am investigating this as it should not be reporting this way.”
At the bottom of his post he did address this:
“4. Agentpatchlist information incomplete. The agent Patchlist is showing CU3, but also CU2 or CU1. The localization ENU update is not showing in patchlist. This appears to be related to the agents needing a reboot. Once they are rebooted, and a repair initiated, the patchlist column looks correct.”
I have quite a few like this, and didn’t want to have to do all of this in order to get this fixed. I verified that the .dll’s on the agent were updated and then I looked for this value that the discovery is pulling from the registy of the agent that displayed a mixed up Patch List:
The key where this information is stored: HKLM\Software\Microsoft\CurrentVersion\Installer\UserData\S-1-5-18\Products7779052F1B26F94B\Patches
The reg keys represent the different patches applied and dictate the order they appear in the patch list. If we look at the values in the key we will notice something different between those that list the correct CU3 patch and those that list the CU3 and older patches:
If the State value is 1, then this patch display name will be listed in your patch list. If the State value is 2, then it will not be listed in the patch list view.
When I followed Kevin’s advice, it did resolve the issue, but that meant that I had several servers that I would have to first reboot, then repair (basically reinstallation of the agent). In the lab that might be ok, but production may pose a bigger issue, especially if my lab patching is any indication of the percentage I will see in production. Furthermore, if the .dll files are updated on the agent, then I would rather just use PSEXEC to batch a reg change on the STATE value and then bounce the health service on that agent. This would save a lot of time for me, and a lot of outages for our mission critical applications. In the screen shot I say a repair is not necessary. This is not the “official” word from MSFT, but just my observations from my lab. I will fix the remainder of my agents modifying the registry and bouncing the health service, then let it cook for a while before I decide to use this method should this problem appear in production when we patch. I recommend you test this in your own environment before coming to any conclusions as to if this is a viable work around. If you feel comfortable with this solution, and have ensured all your workflows and monitoring are working as expected (also ensuring the .dll’s are updated), then you may have saved yourself a lot of time wasted rebooting servers and doing repairs on agents. 😉
The only caveat I have seen so far with this is that you may have a inconsistent patch list even after this because on the agent I repaired the patch list showed two CU3 patches (the one with and without the ENU Components) and on the ones that I repaird now showed just the CU3 patch without the ENU addition. If you are worried about that, just pick one you want displayed and disable the rest.