We’ve had a support ticket open since August 2015 regarding hosts spontaneously disconnecting from our windows based vcenter server. It’s actually the latest in a long line of support tickets related to the same issue. The original going back to November of 2014!
The ticket was created as high/critical, but yet our support has been very lax, and the prevailing suggestion is basically just to move our vcenter to a new server. I’d like to avoid such an ordeal.
It was suggested that we create a ticket with Microsoft because VMware thought the issue was with the OS, but Microsoft has been even less helpful; suggesting a single hot fix ( https://support.microsoft.com/en-us/kb/2775511 ) to remedy the situation, and it didn’t work. Originally we were told we needed another hot fix that covered kernel socket leaks, http://support.microsoft.com/kb/2577795, that was applied in December of 2014 but did not help.
Multiple times a day we get alerts for ‘Hosts not responding’ from log insight. These emails are basically just a summary of alerts on vcenter that match the event alarm for when a host is not responding. If you happen to be logged into the vcenter via the vsphere client you’ll see the host get disconnected along with all of its VMs. This lasts for under a minute, but when checking logs on the host itself it doesn’t appear to be aware of any networking issues. This leads us to believe the issue is with the vcenter server.
We’ve run perfmon and special scripts created by VMware support to monitor resource usage and network connectivity/port exhaustion while these events happen, but no smoking gun has been found. At one point VMware suggested we add more CPUs to the VM, going from an already over provisioned 8 vCPU to 10, but that didn’t help.
This is a relatively small environment of under 60 hosts and approximately 425 VMs, but the vcenter server is configured with 8 vCPU and 20GB of memory.
No other products are installed on the vcenter other than the related VMware products; web client, VUM, dump collector, SSO, inventory service. We run McAfee MOVE off-server virus scan which was called out by VMware support, but we need to run something and this should be the best option as scanning is not done on the server itself. We have updated the client to the latest version at their suggestion, but no change. We also have Veritas netbackup client installed for backups. Another item we need, but we have tried removing it temporarily with no help.
Support also thought maybe having vROps tied into this vcenter was a problem, but we temporarily stopped data collection and no improvement was noticed.
At one point they called out VDP being to blame and an update to that would fix it. It didn’t.
At this point I believe they have given up on doing any troubleshooting and would like us to just move this to a new server. I feel like that’s an easy way out for them and a lot of work on my part. I’m fairly confident the issue lies with vcenter server, and think surely there could be more that can be done to narrow things down and possibly fix the issue.
This VM was originally deployed as vcenter server 5.1 and upgraded throughout the years to the current 5.5U3b. It is Windows Server 2008 R2 Enterprise SP1 and SQL is running on a separate VM.
Has anybody had a similar issue they were able to solve, or have any suggestions other than a slash and burn approach?
Thanks