vCenter appliance database issue - Gabes Virtual World

Recently I had an issue in my homelab environment. Because of some power outages, my vCenter Appliance hadn’t been shutdown correctly and now vCenter didn’t start correctly anymore. After some searching I found that the database could not be loaded. In the VMware KBs I couldn’t find anything that fixes the start up of the database it self. Mostly it is about resetting the database, but even though my environment is quite small, I had VSAN running in it and was afraid about what would happen if I connect a clean vCenter to the existing hosts. So I decided to dive in and try and fix it at the database level.

To see what was going on, I first check the vpxd.log ( /var/log/vmware/vpxd/vpxd.log) and found that a login to the database was not possible:

info vpxd[7FF9A8AD97A0] [Originator@6876 sub=vpxdVdb] [VpxdVdb::SetDBType] Logging in to DSN: VMware VirtualCenter with username vc

error vpxd[7FF9A8AD97A0] [Originator@6876 sub=vpxdVdb] [VpxdVdb::SetDBType] Failed to connect to database: ODBC error: (08001) - [unixODBC]Could not connect to the server; --> Connection refused [127.0.0.1:5432].  Retry attempt: 1 ...

Then I wanted to check if the database was running at all. In the database logs (/storage/db/vpostgres/pg_log/postgresql.log) I saw the following lines:

2016-09-10 19:02:12.294 UTC 57d458b4.21d8 0   LOG:  database system was interrupted; last known up at 2016-05-16 22:58:35 UTC

2016-09-10 19:02:14.920 UTC 57d458b4.21d8 0   LOG:  unexpected pageaddr E/C8000000 in log segment 000000010000000E000000CC, offset 0

2016-09-10 19:02:14.920 UTC 57d458b4.21d8 0   LOG:  invalid primary checkpoint record

2016-09-10 19:02:14.920 UTC 57d458b4.21d8 0   LOG:  unexpected pageaddr E/C8000000 in log segment 000000010000000E000000CC, offset 0

2016-09-10 19:02:14.920 UTC 57d458b4.21d8 0   LOG:  invalid secondary checkpoint record

2016-09-10 19:02:14.920 UTC 57d458b4.21d8 0   PANIC:  could not locate a valid checkpoint record

2016-09-10 19:02:14.920 UTC 57d458b1.20bf 0   LOG:  startup process (PID 8664) was terminated by signal 6: Aborted

2016-09-10 19:02:14.920 UTC 57d458b1.20bf 0   LOG:  aborting startup due to startup process failure

Some Google assistance on “PANIC: could not locate a valid checkpoint record” learn that there probably was a checkpoint not cleared properly because of the unclean shutdown. Suggested solutions talked about using pg_resetxlog which will reset the write-ahead log and other control information of a PostgreSQL database cluster.

** Warning ** Nowhere can I find anything on this command in the VMware KBs, so I want to emphasise that the next steps are unsupported and I expect resetting the write-ahead log will also cause some data loss. You’re at your own from here :-)

The command line for the pg_resetxlog would be:

/opt/vmware/vpostgres/9.3/bin/pg_resetxlog -f  {Location of the database}

First I needed to find out, where the database was located. This can be found in /etc/vmware-vpx/embedded_db.cfg at the following line:

EMB_DB_STORAGE='/storage/db/vpostgres'

Then when running the pg_resetxlog command, I received an error:

/opt/vmware/vpostgres/9.3/bin/pg_resetxlog -f  /storage/db/vpostgres

You must run pg_resetxlog as the PostgreSQL superuser

Hmm, the superuser? When looking at the directory contents of the /storage/db/vpostgres directory, I saw the user vpostgres had rights on this directory. So I tried running the command as the vpostgres user:

su vpostgres -s /bin/sh

/opt/vmware/vpostgres/9.3/bin/pg_resetxlog -f  /storage/db/vpostgres

This returned: Transaction log reset

I then tried to start vpxd again ( service vmware-vpxd start ) but again it took a lot of time. I could then see in the logs that it was waiting for services on port 8089 and since I had stopped and started a number of services during my troubleshooting, I decided to just reboot the appliance. After the reboot, vCenter was up and running again and I could reconnect without any issues.