Since most of our helpdesk customers are sysadmins we are starting a new series of posts in our blog - targeted specifically at server administratorss. Resolving common server issues, handy tools reviews etc.
Yesterday, after routinely installing updates on one of our database servers (Windows Server 2016 running SQL Server), we faced an extremely high disk usage by the Windows Update service and its lieutenants - "TiWorker.exe" and the like. The high load didn't stop after 1 hour, 2 hours, 5 hours... 12 hours... 24 hours... So here are the steps we took to resolve it.
If your cloud server is sluggish, but the CPU and memory load look normal, you're probably facing a disk bottleneck. Cloud disks are slow. In AWS for example, once a disk drains all its "burst credits" it becomes really slow.
How do you tell which process is eating all the I/O? A simple way is to go to Task Manager - Details - Right-click the column headerts - Select Columns - Add "I/O read bytes" and "I/O write bytes".
If you see the "svchost" process among the top I/O consumers - it means the load is caused by some system service running behind this process. To discover the actual service hiding behind it click "Go to services"
The "services" tab will highlight the services using that PID:
This is how we discovered it was Windows Update that bashed our disks. You can further use Procmon from Sysinternals to discover which files are being accessed (filtering the output by svchost's PID) but if it's been going on for hours I bet it's probably reading/writing to the "SoftwareDistribution" folder.
Even after we temporarily upgraded our Amazon EBS disk drive to a super-fast PIO disk (thanks to Amazon's new feature that allows modifying drives on-the-fly!) the load still continued. It just started taking more CPU now, that has previously been "throttled" naturally.
The problem is probably caused by some corrupted downloaded files in the "SoftwareDistribution" folder, which is where Windows Update puts all the downloaded stuff.
First, stop the Windows Update service (can take a while):
Then rename the C:\Windows\SoftwareDistribution\
folder to SoftwareDistribution.old
(don't worry the folder will be recreated after you restart the service.
Restart the service and enjoy the disk load falling back to normal:
Wait for your next outage window and re-check the updates. After that you can delete the ".old" directory.