Tuesday, October 28, 2014

One more time: Replication is no substitute for good backups.

I don't know how many times I have had to try to drum this into clients' heads. Having an up to date replica won't protect you against certain kinds of failures. If you really want to protect your data, you need to use a proper backup solution - preferable a continuous backup solution. The ones I prefer to use are barman and wal-e. Both have strengths and weaknesses, but both are incredibly useful, and fairly well documented and simple to set up. If you're not using one of them, or something similar, your data is at risk.

(In case you haven't guessed, today is another of those days when I'm called in to help someone where the master and the replica are corrupted and the last trusted pg_dump backup is four days old and rolling back to it would cost a world of pain. I like these jobs. They can stretch your ingenuity, and no two are exactly alike. But I'd still rather be paid for something more productive.)

Monday, October 6, 2014

pg_repack pitfalls

pg_repack is a terrific tool for allowing you to reorganize a table without needing to hold long running strong locks on the table. That means that that your normal inserts, updates and deletes can continue to run against the table while the reorganization is proceeding.

I have had clients who have run into problems with it, however. In particular, it is possible to get it wedged so that the table is inaccessible and nothing can proceed, unless you either kill the repack operation or kill what is blocking it. Here is a simple example of how to cause problems.

In session 1, do:
pg_repack -e -t foo dbnameset

and in session 2 in psql do:
select pg_sleep(10); lock table foo; rollback;
The sleep gets us past the time when pg_repack is setting up, and happens while it is is doing its CREATE TABLE ... AS SELECT .... When that CREATE TABLE statement finishes, both sessions will be wedged.  Session 2 will be hung because it is unable to lock the table, since pg_repack's other session will hold a weak lock on the table. And nothing, including pg_repack, will be able to do anything with the table.

The solution is to make sure that nothing holds or even tries to obtain any strong long running locks on the table.

One useful thing is to use the check_postgres.pl monitor script to look for things like long running transactions and processes waiting for locks.

Or you can create a more customized test to look for this exact situation.

Most importantly, you need to be aware that problems can occur, and to protect against them happening in the first place.





Friday, October 3, 2014

Towards a PostgreSQL Benchfarm

For years I have been wanting to set up a farm of machines, modelled after the buildfarm, that will run some benchmarks and let us see performance regressions. Today I'm publishing some progress on that front, namely a recipe for vagrant to set up an instance on AWS of the client I have been testing with. All this can be seen on the PostgreSQL Buildfarm Github Repository on a repo called aws-vagrant-benchfarm-client. The README explains how to set it up. The only requirement is that you have vagrant installed and the vagrant-aws provider set up (and, of course, an Amazon AWS account to use).

Of course, we don't want to run members of the benchfarm on smallish AWS instances. But this gives me (and you, if you want to play along) something to work on, and the provisioning script documents all the setup steps rather than relying on complex instructions.

The provisioner installs a bleeding edge version of the buildfarm client's experimental Pgbench module, which currently only exists on the "benchfarm" topic branch. This module essentially runs Greg Smith's pgbench-tools suite, gets the results from the results database's "tests" table, and bundles it as a CSV for upload to the server.

Currently the server does nothing with it. This will just look like another buildfarm step. So the next thing to do is to get the server to start producing some pretty and useful graphs. Also, we need to decide what else we might want to capture.