Overview
We recently discovered an issue affecting versions 1.1.0, 1.1.1 and 1.1.2 of
the Genesis Cloud Foundry Kit. Locket and Diego were incorrectly configured to
place their tables in the postgres
database, instead of their dedicated
locketdb
and diegodb
databases.
While this does not affect day-to-day operation of Cloud Foundry, we still see this an issue that needs to be corrected as soon as possible.
Please note that although your PostgreSQL instance may already contain
locketdb
and diegodb
databases, they are empty and unused. All tables
that should be in those databases were instead created in the postgres
database.
Analysis
A major rewrite of internal Genesis YAML occurred for the 1.1.0 kit release, as various components were upgraded and required configuration changes. Namely, one of these changes was to the database connection information.
For diego
and locket
, their BOSH release configuration uses the term
db_schema
to refer to the name of the desired database, and db_driver
to
set the type of database used (typically mysql
or postgres
)
sql.dbschema: description: "Database name to use for connecting to SQL backend" default: "" sql.dbdriver: description: "Database driver to use for SQL backend (for example: mysql,postgres)" default: mysql
Whereas for most other SQL configuration YAML blocks used a variation of
db_name
or database.name
to refer to the desired database name, and
db_type
or database.type
to refer to the type of database used.
database.name: description: "Name of logical database to use." database.type: description: "Type of database: postgres or mysql"
Mistakenly, we set sql.db_schema
to an internal db_scheme
variable that
Genesis sets according to operator configuration. Depending on environment
settings, sql.db_schema
was either set to postgres
or mysql
:
sql: dbhost: (( grab params.locketdbhost )) dbport: (( grab params.locketdbport )) dbschema: (( grab params.locketdbscheme )) dbusername: (( grab params.locketdbuser )) dbpassword: (( grab params.locketdbpassword )) dbdriver: (( grab params.locketdbscheme ))
Because PostgreSQL has a database named postgres
, and MySQL has a database
named mysql
by default, our mistake went unnoticed because Cloud Foundry
still operated normally as it was able to access and write to those databases.
To fix this, we are making changes to db_schema
to reference the proper
db_name
variable set. To mitigate future occurrences of internal YAML variable
misuse, we plan to organize the Cloud Foundry kit's params
variables
into meta
and params
, a paradigm that we've used successfully for newer
kits.
Impact
We believe the impact to be minimal to the stability and runtime of your Cloud
Foundry deployments. However, it is in the best interest to correct this
problem as confusion stemming from unorganized data may arise. Specifically,
it's possible that selective backups of Cloud Foundry databases expected Diego
and Locket data to be in their proper database names; however backups taken of
diegodb
and locketdb
are empty.
The fix for this issue will require downtime. In our testing, applications were inaccessible for 15 minutes while the fix was being applied.
The Process
The following sections contain step-by-step instructions on how to
successfully migrate Locket and Diego data from postgres
and into their
respective databases. This guide will mention when downtime approximately
starts and ends.
There are different sets of steps to take depending on your Cloud Foundry environment. Please choose the one that matches your deployment, using the matrix below:
Kit Version | Genesis Feature | Relevant Section |
---|---|---|
1.1.0 1.1.1 1.1.2 |
local-db | Standalone Local PostgreSQL |
local-ha-db | Highly-Available Local PostgreSQL | |
postgres | External PostgreSQL Databases | |
mysql | External MySQL Databases |
Standalone Local PostgreSQL
This process applies if you deployed with the local-db
feature.
Am I Affected?
Execute the following command, which will grab the tables currently in the
postgres
database name.
bosh ssh -e [environment name] -d [deployment name] postgres/0 -rc \ "/var/vcap/packages/postgres/bin/psql -U vcap -p 5432 postgres -t -c '\dt' 2>&1"
Next to the column stdout
, there will be a list of tables. Look for any of
the following:
public | actuallrps | table | diegoadmin public | configurations | table | diegoadmin public | desiredlrps | table | diegoadmin public | domains | table | diegoadmin public | locks | table | locketadmin public | tasks | table | diegoadmin
If any of your output matches above, you are affected and must follow the rest of this guide to upgrade to 1.2.0 or beyond. Otherwise, you can go brew yourself a cup of tea and pursue other endeavors.
Step A1: Getting Started
We'll frequently reference [environment name]
and [deployment name]
during
this guide. [environment name]
refers to the name of your BOSH director where
Cloud Foundry is deployed to. [deployment name]
refers to the name of your
Cloud Foundry deployment.
Before starting, ensure your Cloud Foundry deployment is running properly by running smoke tests. If tests pass, continue onto step 2. Otherwise, please diagnose and correct your errors; do not continue until smoke tests pass.
Once you finish this guide, you will once again execute smoke tests to validate your Cloud Foundry deployment. It's a good idea to run smoke tests now so that if your final smoke tests were to fail, you can determine if it was this migration process that caused it and not previous changes
To execute smoke tests, run the following command on your Cloud Foundry deployment:
bosh -e [environment name] -d [deployment name] run-errand smoke-tests
Step A2: Stage Genesis Environment File for Deploy
The Genesis Cloud Foundry Kit version 1.2.0 includes the fix to GMP-CF-001.
Download it via genesis download cf/1.2.0
in your Cloud Foundry deployments
folder. Then, edit your environment file and set version:
to
version: 1.2.0
. Do not deploy yet, that step will come last.
Step A3: Stop All BBS VMs
In order to migrate the data, we first need to stop the processes that access the affected DB tables. We'll do this via BOSH SSHing into all the BBS instances
To do so, gather how many BBS instances running via:
bosh -e [environment name] -d [deplyoment name] vms | grep bbs
This will print out a list of all BBS instances deployed in your CF environment.
With each entry on the list, copy the bbs/instance-guid-here
and run:
bosh -e [environment name] -d [deployment name] ssh bbs/instance-guid-here
Once connected to your VM, become the root user:
sudo -i
Stop all processess on the VM:
monit stop all
Logout of the root user:
exit
End the SSH session:
exit
Repeat these steps for each BBS instance listed from above.
Approximately 2 minutes after stopping all processes on all VMs,routes to your apps will no longer work and downtime begins.
Step A4: Connect to your Postgres VM
To BOSH SSH into your Postgres VM, execute the following:
bosh -e [environment name] -d [deployment name] ssh postgres
Once connected, become the root user:
sudo -i
It is necessary to add a directory to your PATH, as the commands in step 4 will require it. To do so, run:
export PATH=$PATH:/var/vcap/packages/postgres/bin
Step A5: Migrate the Data
We'll need a place to store SQL dumps, so create a directory under the permanent data directory by running:
mkdir /var/vcap/store/migration cd /var/vcap/store/migration
Dump the data in the postgres
database into files named after their proper
database name:
pgdump -U vcap -p 5432 -t locks postgres > locketdb.sql; pgdump -U vcap -p 5432 -c -t actuallrps -t configurations -t desiredlrps -t domains -t tasks postgres > diegodb.sql;
You will then have two files under /var/vcap/store/migration
, locket.sql
will contain the Locket database data, and diegodb.sql
will contain the Diego
database data.
Now that we've exported the data from postgres
, we want to import back into
the properly named databases:
psql -U vcap -p 5432 locketdb < locketdb.sql; psql -U vcap -p 5432 diegodb < diegodb.sql;
Step A6: Verify the Data Migration was Successful
To make certain that no GUIDs were changed and that all primary keys stayed the same, we want to re-export the recently imported data and verify against our original export:
pgdump -U vcap -p 5432 -t locks locketdb > locketdb.new.sql; pgdump -U vcap -p 5432 -c -t actuallrps -t configurations -t desiredlrps -t domains -t tasks postgres > diegodb.new.sql;
Verify the recently exported data has no difference from the originally exported data:
diff locketdb.sql locketdb.new.sql; diff diegodb.sql diegodb.new.sql;
If all went well, there should be no output from any of those diff
commands.
Step A7: Rename Old Databases
Now that we've migrated the data to their proper databases, we want to prevent
any old configurations from referencing data from postgres
. To do so:
psql -U vcap -p 5432 postgres -c "ALTER TABLE locks RENAME TO locksold"; psql -U vcap -p 5432 postgres -c "ALTER TABLE actuallrps RENAME TO actuallrpsold"; psql -U vcap -p 5432 postgres -c "ALTER TABLE configurations RENAME TO configurationsold"; psql -U vcap -p 5432 postgres -c "ALTER TABLE desiredlrps RENAME TO desiredlrpsold"; psql -U vcap -p 5432 postgres -c "ALTER TABLE domains RENAME TO domainsold"; psql -U vcap -p 5432 postgres -c "ALTER TABLE tasks RENAME TO tasksold";
This way, any software that references the old tables should fail, making it easier to discover improperly configured processes.
Step A8: Redeploy Genesis Environment
If all went well, the next step is to redeploy your Cloud Foundry environment with the v1.2 fix, which will point Locket and Diego to their properly named databases:
genesis deploy [genesis environment name]
If all went well, after redeploy BBS will begin running again and routes will restore within a minute after deploying. Downtime ends here.
Step A9: Cleanup & Verification (Final step)
Execute smoke tests to verify that your redeployed Cloud Foundry environment is operating properly. Once you are satisfied with the results of the migration, you may delete the old tables:
psql -U vcap -p 5432 postgres -c "DROP TABLE locksold CASCADE"; psql -U vcap -p 5432 postgres -c "DROP TABLE actuallrpsold CASCADE"; psql -U vcap -p 5432 postgres -c "DROP TABLE configurationsold CASCADE"; psql -U vcap -p 5432 postgres -c "DROP TABLE desiredlrpsold CASCADE"; psql -U vcap -p 5432 postgres -c "DROP TABLE domainsold CASCADE"; psql -U vcap -p 5432 postgres -c "DROP TABLE tasksold CASCADE";
Highly-Available Local PostgreSQL
This process applies if you deployed with the local-ha-db
feature.
Am I Affected?
Execute the following command, which will grab the tables currently in the
postgres
database name.
bosh ssh -e [environment name] -d [deployment name] postgres/0 -rc \ "/var/vcap/packages/postgres/bin/psql -U vcap -p 6432 postgres -t -c '\dt' 2>&1"
Next to the column stdout
, there will be a list of tables. Look for any of
the following:
public | actuallrps | table | diegoadmin public | configurations | table | diegoadmin public | desiredlrps | table | diegoadmin public | domains | table | diegoadmin public | locks | table | locketadmin public | tasks | table | diegoadmin
If any of your output matches above, you are affected and must follow the rest of this guide to upgrade to 1.2.0 or beyond. Otherwise, you can go brew yourself a cup of tea and pursue other endeavors.
Step B1: Getting Started
We'll frequently reference [environment name]
and [deployment name]
during
this guide. [environment name]
refers to the name of your BOSH director where
Cloud Foundry is deployed to. [deployment name]
refers to the name of your
Cloud Foundry deployment.
Before starting, ensure your Cloud Foundry deployment is running properly by running smoke tests. If tests pass, continue onto step 2. Otherwise, please diagnose and correct your errors; do not continue until smoke tests pass.
Once you finish this guide, you will once again execute smoke tests to validate your Cloud Foundry deployment. It's a good idea to run smoke tests now so that if your final smoke tests were to fail, you can determine if it was this migration process that caused it and not previous changes
To execute smoke tests, run the following command on your Cloud Foundry deployment:
bosh -e [environment name] -d [deployment name] run-errand smoke-tests
Step B2: Stage Genesis Environment File for Deploy
The Genesis Cloud Foundry Kit version 1.2.0 includes the fix to GMP-CF-0001.
Download it via genesis download cf/1.2.0
in your Cloud Foundry deployments
folder. Then, edit your environment file and set version:
to
version: 1.2.0
. Do not deploy yet, that step will come last.
Step B3: Stop All BBS VMs
In order to migrate the data, we first need to stop the processes that access the affected DB tables. We'll do this via BOSH SSHing into all the BBS instances
To do so, gather how many BBS instances running via:
bosh -e [environment name] -d [deplyoment name] vms | grep bbs
This will print out a list of all BBS instances deployed in your CF environment.
With each entry on the list, copy the bbs/instance-guid-here
and run:
bosh -e [environment name] -d [deployment name] ssh bbs/instance-guid-here
Once connected to your VM, become the root user:
sudo -i
Stop all processess on the VM:
monit stop all
Logout of the root user:
exit
End the SSH session:
exit
Repeat these steps for each BBS instance listed from above.
Approximately 2 minutes after stopping all processes on all VMs,routes to your apps will no longer work and downtime begins.
Step B4: Connect to your Postgres VM
For Highly-Available PostgreSQL installations, all data manipulation
commands must execute against the write master. To determine which of your
two postgres/*
nodes is the master, execute the following command:
bosh ssh postgres/0 -rc \ "/var/vcap/packages/postgres/bin/psql -U vcap -p 6432 postgres -t -c 'SELECT pgisin_recovery()' 2>&1"
If the column stdout
reads f
, then postgres/0
is your master. If the printed
output is t
, then postgres/1
is your master.
To BOSH SSH into your Postgres VM, execute the following, substituting #
with
the number of your master node.
bosh -e [environment name] -d [deployment name] ssh postgres/#
Once connected, become the root user:
sudo -i
It is necessary to add a directory to your PATH, as the commands in step 4 will require it. To do so, run:
export PATH=$PATH:/var/vcap/packages/postgres/bin
Step B5: Migrate the Data
We'll need a place to store SQL dumps, so create a directory under the permanent data directory by running:
mkdir /var/vcap/store/migration cd /var/vcap/store/migration
Dump the data in the postgres
database into files named after their proper
database name:
pgdump -U vcap -p 6432 -t locks postgres > locketdb.sql; pgdump -U vcap -p 6432 -c -t actuallrps -t configurations -t desiredlrps -t domains -t tasks postgres > diegodb.sql;
You will then have two files under /var/vcap/store/migration
, locket.sql
will contain the Locket database data, and diegodb.sql
will contain the Diego
database data.
Now that we've exported the data from postgres
, we want to import back into
the properly named databases:
psql -U vcap -p 6432 locketdb < locketdb.sql; psql -U vcap -p 6432 diegodb < diegodb.sql;
Step B6: Verify the Data Migration was Successful
To make certain that no GUIDs were changed and that all primary keys stayed the same, we want to re-export the recently imported data and verify against our original export:
pgdump -U vcap -p 6432 -t locks locketdb > locketdb.new.sql; pgdump -U vcap -p 6432 -c -t actuallrps -t configurations -t desiredlrps -t domains -t tasks postgres > diegodb.new.sql;
Verify the recently exported data has no difference from the originally exported data:
diff locketdb.sql locketdb.new.sql; diff diegodb.sql diegodb.new.sql;
If all went well, there should be no output from any of those diff
commands.
Step B7: Rename Old Databases
Now that we've migrated the data to their proper databases, we want to prevent
any old configurations from referencing data from postgres
. To do so:
psql -U vcap -p 6432 postgres -c "ALTER TABLE locks RENAME TO locksold"; psql -U vcap -p 6432 postgres -c "ALTER TABLE actuallrps RENAME TO actuallrpsold"; psql -U vcap -p 6432 postgres -c "ALTER TABLE configurations RENAME TO configurationsold"; psql -U vcap -p 6432 postgres -c "ALTER TABLE desiredlrps RENAME TO desiredlrpsold"; psql -U vcap -p 6432 postgres -c "ALTER TABLE domains RENAME TO domainsold"; psql -U vcap -p 6432 postgres -c "ALTER TABLE tasks RENAME TO tasksold";
This way, any software that references the old tables should fail, making it easier to discover improperly configured processes.
Step B8: Redeploy Genesis Environment
If all went well, the next step is to redeploy your Cloud Foundry environment with the v1.2 fix, which will point Locket and Diego to their properly named databases:
genesis deploy [genesis environment name]
If all went well, after redeploy BBS will begin running again and routes will restore within a minute after deploying. Downtime ends here.
Step B9: Cleanup & Verification (Final step)
Execute smoke tests to verify that your redeployed Cloud Foundry environment is operating properly. Once you are satisfied with the results of the migration, you may delete the old tables:
psql -U vcap -p 6432 postgres -c "DROP TABLE locksold CASCADE"; psql -U vcap -p 6432 postgres -c "DROP TABLE actuallrpsold CASCADE"; psql -U vcap -p 6432 postgres -c "DROP TABLE configurationsold CASCADE"; psql -U vcap -p 6432 postgres -c "DROP TABLE desiredlrpsold CASCADE"; psql -U vcap -p 6432 postgres -c "DROP TABLE domainsold CASCADE"; psql -U vcap -p 6432 postgres -c "DROP TABLE tasksold CASCADE";
External PostgreSQL
This process applies if you deployed with the postgres
feature.
We do not currently have migration steps for this scenario.
Please reach out to us on Genesis Slack
External MySQL
This process applies if you deployed with the mysql
feature.
We do not currently have migration steps for this scenario.
Please reach out to us on Genesis Slack
Help & Support
If you have concerns about the impact of this migration process, or need assistance running through it, please don't hesitate to find us in #help on Slack.