Auto deploy – host register error

Not many are using Auto Deploy with stateless hosts, but I’m a big fan of it. Though sometimes I’m pulling my hair trying to figure out why stuff isn’t working since there is also very little documentation and experiences from others.

Recently we ran into an issue that after upgrading vCenter Server 6.5 to 7.0u2b, non of the hosts would come back after reboot. When trying to upgrade the hosts from 6.5 to 6.7, they failed picking up their boot image from the TFTP server.

/vmw/rbd/host-register… HTTP 5xx Server error
Could not boot: HTTP 5xx Server error
Could not boot image: HTTP 5xx Server error
Network error encountered while PXE booting.
Scanning the local disk for cached image.
<3>BANK5: not a VMware boot bank
<3>BANK6: not a VMware boot bank
<3>No hypervisor found
<3>Rebooting

TFTP was responding and delevering the tramp information, but something went wrong after that step. We tried deploying a different image to the host, forcing the image cache to update. We tested the host association with the deploy rules ( powershell: get-vmhost | Test-DeployRuleSetCompliance ). Removed the host from inventory and then boot again, but nothing worked.

Also the auto deploy database was restored from two days ago, using the following steps, but I’m not sure if this contributed to the solution. You might not need to do this, I’m just writing it down to be complete in the steps I took to solve it. Since after solving it, I can’t reproduce the issue, I can’t test if restoring the db was essential:

  • stop auto deploy service
  • rename the original database file to a backup name: rm /var/lib/rbd/db /var/lib/rbd/db.backup
  • in the /var/lib/rbd/ directory you’ll see other db files from previous days, copy one of those to db. In my case: cp /var/lib/rbd/db.2021-06-09 /var/lib/rbd/db
  • start the auto deploy service

With VMware Support we finally found an easy solution, which is to clear the cache.

  • find /storage/autodeploy/cache/ -name .hostprofile-secrets

This gave the following result in my case (just 3 out of 20 lines):

/storage/autodeploy/cache/5a/5a85137aaf3a16dd601217ac24a0de/.hostprofile-secrets
/storage/autodeploy/cache/5a/71789ae29999b51bebdc4e1ebfa058/.hostprofile-secrets
/storage/autodeploy/cache/db/ac07ad661c6957f4edd599b8c92450/.hostprofile-secrets

Next I had to delete all those cache directories, one by one:

  • rm -rf /storage/autodeploy/cache/5a/5a85137aaf3a16dd601217ac24a0de

After this the host immediately picked up it’s image and booted without issues.