Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home

Open Source Posts

Shaun M. Thomas: Finally Done With High Availability

From Planet PostgreSQL. Published on Jul 29, 2014.

Well, my publisher recently informed me that the book I’ve long been slaving over for almost a year, is finally finished. I must admit that PostgreSQL 9 High Availability Cookbook is somewhat awkward as a title, but that doesn’t detract from the contents. I’d like to discuss primarily why I wrote it.

When Packt first approached me in October of 2013, I was skeptical. I have to admit that I’m not a huge fan of the “cookbook” style they’ve been pushing lately. Yet, the more I thought about it, the more I realized it was something the community needed. I’ve worked almost exclusively with PostgreSQL since at late 2005 with databases big and small. It was always the big ones that presented difficulties.

Back then, disaster recovery nodes were warm standby through continuous recovery at best, and pg_basebackup didn’t exist. Nor did pg_upgrade, actually. Everyone had their own favorite backup script, and major upgrades required dumping the entire database and importing it in the new version. To work with PostgreSQL then required a much deeper understanding than is necessary now. Those days forced me to really understand how PostgreSQL functions, which caveats to acknowledge, and which needed redress.

One of those caveats that still called out to me, was one of adoption. With a lot of the rough edges removed in recent releases of PostgreSQL, came increased usage in small and large businesses alike. I fully expected PostgreSQL to be used in a relatively small customer acquisition firm, for instance, but then I started seeing it in heavy-duty financial platforms. Corporate deployments of PostgreSQL require various levels of high availability, from redundant hardware, all the way to WAL stream management and automated failover systems.

When I started working with OptionsHouse in 2010, their platform handled 8,000 database transactions per second. Over the years, that has increased to around 17k, and I’ve seen spikes over 20k. At these levels, standard storage solutions break down, and even failover systems are disruptive. Any outage must be as short as possible, and be instantly available with little to no dependency on cache warming. Our backup system had to run on the warm standby or risk slowing down our primary database. Little by little, I broke the database cluster into assigned roles to stave off the total destruction I felt was immanent.

I was mostly scared of the size of the installation and its amount of activity. Basic calculations told me the database handled over a billion queries per day, at such a rate that even one minute of downtime could potentially cost us tens of thousands in commissions. But I had no playbook. There was nothing I could use as a guide so that I knew what to look for when things went wrong, or how I could build a stable stack that generally took care of itself. It was overwhelming.

This book, as overly verbose as the title might be, is my contribution to all of the DBAs out there that might have to administer a database that demands high availability. It’s as in-depth as I could get without diverging too much from the cookbook style, and there are plenty of links for those who want to learn beyond the scope of its content. The core however, is there. Anyone with a good understanding of Linux could pick it up and weave a highly available cluster of PostgreSQL systems without worrying, or having to build too many of their own tools.

If I’ve helped even one DBA with this high availability book, I’ll consider my mission accomplished. It’s the culmination of years of experimentation, research, and performance testing. I owe it to the PostgreSQL community—which has helped me out of many jams—to share my experience how I can.

Thanks, everyone!

Josh Berkus: Improved TOAST corruption function

From Planet PostgreSQL. Published on Jul 28, 2014.

Recently I had reason to try to locate TOAST corruption on an older server (running 8.4.0), and found that the prior script I posted didn't always find the corruption.  As such, below is an improved function, which can easily be changed to a DO statement or whatever you want.

Note that TOAST corruption isn't a common thing; in fact, if you've kept up with your PostgreSQL updates like the update last Thursday, you should never see it.  This issue was subsequently fixed in 8.4.3, but the owner never applied that update.

Oh, also note that version 8.4 is now EOL.  If you haven't upgraded already, you should have a schedule for doing so.

This function assumes that you have a primary key column which is a SERIAL.  It would need to be altered for other designs.  It also assumes the postgres user can write to /tmp/testout.  See the original posts for how to use this.

create or replace function find_bad_toast2 (
   tablename text,
   pk_col text
)
returns text
language plpgsql
as
$f$
declare
   curid BIGINT := 0;
   badid BIGINT;
begin
FOR badid IN EXECUTE 'SELECT ' || pk_col || ' FROM ' || tablename LOOP
   curid = curid + 1;
   if curid % 100000 = 0 then
       raise notice '% rows inspected', curid;
   end if;
   begin
       EXECUTE 'COPY ( SELECT * FROM ' || tablename || ' WHERE ' ||
            pk_col || ' = ' || cast(badid as text) || ') TO ''/tmp/testout'';';
   exception
       when others then
           raise notice 'data for id % is corrupt', badid;
           continue;
   end;
end loop;
return 'done checking all rows';
end;
$f$;

Michael Paquier: Dynamic tracing with Postgres

From Planet PostgreSQL. Published on Jul 27, 2014.

Postgres has in-core support for Dynamic tracing, which is the possibility to use an external utility to track specific code path calls and have an execution trace at disposition for further analysis. Enabling this feature can be done when compiling code by specifying --enable-dtrace at configure step. Note as well that on Linux systems you will need systemtap installed (development package for compilation may be needed depending on your platform) to be able to compile code with this support, and extra kernel-level packages like kernel-devel to be able to take traces. In order to check if your installation is able to do dynamic tracing, for example simply run the following "Hello World" command.

$ stap -e 'probe begin { printf("Hello World\n") exit() }'
Hello World

Now, there are many things that can be done using the probes that are defined within Postgres and the functions defined natively, the most intuitive thing being to print on-the-fly information about things being done on the Postgres server. Here is for example a script able to track transaction start and commit, giving at the same time some extra information about the process doing the operation:

probe process("/path/to/bin/postgres").mark("transaction__start")
{
    printf ("Start PID: %d, CPU: %d\n", pid(), cpu())
}

probe process("/path/to/bin/postgres").mark("transaction__commit")
{
    printf ("Commit PID: %d, CPU: %d\n", pid(), cpu())
}

When using systemtap, the separator for mark points is not a single dash "-" but a double underscore "__". The argument values within a probe mark can be accessed as wel in the stap script as $arg1, $arg2, etc. Now, running a simple transaction like this one...

=# BEGIN;
BEGIN
=# CREATE TABLE dtrace_tab (a int);
CREATE TABLE
=# select pg_backend_pid();
 pg_backend_pid
 ----------------
           14411
(1 row)
=# COMMIT;
COMMIT

Results in the following output when running stap and the script.

$ sudo stap tx_track.d
Start PID: 14411, CPU: 0
Commit PID: 14411, CPU: 0

It is as well possible to track function calls, here is an example tracking the same information as the last script, but now by tracking calls of StartTransaction and CommitTransaction. Such function is more interesting to track specific code paths like planner or execution things though:

probe process("/path/to/bin/postgres").function("StartTransaction")
{
    printf ("Start PID: %d, CPU: %d\n", pid(), cpu())
}

probe process("/path/to/bin/postgres").function("CommitTransaction")
{
    printf ("Commit PID: %d, CPU: %d\n", pid(), cpu())
}

And a transaction similar to the previous one results in this output (transaction run with the same session):

$ sudo stap tx_func.d
Start PID: 14411, CPU: 0
Commit PID: 14411, CPU: 0

Using probes with a timer makes possible to print at a wanted interval of time information about things that occurred on server. Here is an example calculating the server TPS counting the transaction commits and printing results every second:

global commit_count

probe process("/path/to/bin/postgres").mark("transaction__commit") {
    commit_count++
}

probe timer.s(1) {
    printf("tps: %d\n", commit_count)
    commit_count=0
}

A simple run of pgbench like that (no tuning of any kind):

$ pgbench -S -T 5 -c 24
starting vacuum...end.
transaction type: SELECT only
scaling factor: 1
query mode: simple
number of clients: 24
number of threads: 1
duration: 5 s
number of transactions actually processed: 14756
latency average: 8.132 ms
tps = 2941.653159 (including connections establishing)
tps = 2975.574118 (excluding connections establishing)

Results in the following output:

$ sudo stap tps_count.d
tps: 2091
tps: 2991
tps: 3089
tps: 2960
tps: 3034
tps: 624

Priting only information gathered once the stap process is stopped can be done using probe end. Here is for example a script counting the number of buffers flushed and read.

probe process("/path/to/bin/postgres").mark("buffer__read__done")
{
    buffer_read++
}
probe process("/path/to/bin/postgres").mark("buffer__flush__done")
{
    buffer_flush++
}
probe end
{
    printf("\nNumber of buffers read/flushed\n")
    printf("Read = %d\n", buffer_read)
    printf("Flushed = %d\n", buffer_flush)
}

And now here is when running a short pgbench run followed by a checkpoint:

$ pgbench -T 60 -c 24 | tail -n4
starting vacuum...end.
number of transactions actually processed: 24608
latency average: 58.518 ms
tps = 394.921180 (including connections establishing)
tps = 395.289723 (excluding connections establishing)
$ psql -c 'checkpoint'
CHECKPOINT

The following results are obtained:

$ sudo stap buffer_track.d
^C
Number of buffers read/flushed
Read = 927040
Flushed = 2204

There are of course many possibilities with this facility, so be sure to adapt it to your own needs and use it wisely.

Craig Ringer: PostgreSQL anti-patterns: read-modify-write cycles

From Planet PostgreSQL. Published on Jul 26, 2014.

Shaun Thomas’s recent post about client-side loops as an SQL anti-pattern is well worth a read if you’re relatively new to SQL-based application development.

It’s reminded me of another SQL coding anti-pattern that I see quite a lot: the naïve read-modify-write cycle. I”ll explain what this common development mistake is, how to identify it, and options for how to fix it.

Imagine your code wants to look up a user’s balance, subtract 100 from it if doing so won’t make it negative, and save it.

It’s common to see this written as three steps:

SELECT balance FROM accounts WHERE user_id = 1;
-- in the application, subtract 100 from balance if it's above
-- 100; and, where ? is the new balance:
UPDATE balance SET balance = ? WHERE user_id =1;

and everything will appear to work fine to the developer. However, this code is critically wrong, and will malfunction as soon as the same user is updated by two different sessions at the same time.

Imagine two concurrent sessions, each subtracting 100 from the user’s balance, starting with an initial value of 300.

Session 1 Session 2
SELECT balance FROM accounts WHERE user_id = 1; (returns 300)  
  SELECT balance FROM accounts WHERE user_id = 1; (also returns 300)
UPDATE balance SET balance = 200 WHERE user_id = 1; (300 – 100 = 200)  
  UPDATE balance SET balance = 200 WHERE user_id = 1; (300 – 100 = 200)

Whoops!. The balance is 200, but you took out 200 from a starting point of 300. So 100 has just vanished.

Most testing and development is done on standalone servers running single sessions, so unless you’re doing rigorous testing this sort of thing often doesn’t get noticed until production, and can be painful to debug. It’s important to know about it so you can code defensively.

I often have people on Stack Overflow ask things to the tune of “Don’t transactions prevent this?”. Unfortunately, while great, transactions aren’t magic secret sauce you can add for easy concurrency. The only way to let you completely ignore concurrency issues is to LOCK TABLE every table you might use before starting the transaction (and even then you have to always lock in the same order to prevent deadlocks).

In a transaction the outcome in this case is exactly the same:

Session 1 Session 2
BEGIN BEGIN
SELECT balance FROM accounts WHERE user_id = 1; (returns 300)  
  SELECT balance FROM accounts WHERE user_id = 1; (also returns 300)
UPDATE balance SET balance = 200 WHERE user_id = 1; (300 – 100 = 200)  
  UPDATE balance SET balance = 200 WHERE user_id = 1; (300 – 100 = 200)
COMMIT COMMIT

Solutions

Thankfully PostgreSQL (and SQL in general) has a few tools that will help you, and there are some application side options too. Some popular solutions to this problem are:

  • Avoiding the read-modify-write with a calculated update
  • Row level locking with SELECT ... FOR UPDATE
  • Use of SERIALIZABLE transactions
  • Optimistic concurrency control, otherwise known as optimistic locking

Avoiding the read-modfy-write cycle

The best solution is often to just do the work in SQL, avoiding the read-modify-write-cycle entirely.

Just write:

Session 1 Session 2
UPDATE balance SET balance = balance - 100 WHERE user_id = 1; (sets balance=200)  
  UPDATE balance SET balance = balance - 100 WHERE user_id = 1; (sets balance=100)

This works even if the two statements are in concurrent transactions, because the first takes a lock on the row and the second waits on the lock until the first commits or rolls back. The transaction isolation documentation covers this in more detail under READ COMMITTED.

This option is only viable for simpler cases, though. It isn’t useful if the application needs to do some complex logic based on the current balance to determine if the update should proceed, for example.

It’s usually the simplest and fastest choice where it’s applicable.

Row level locking

The simplest solution to fix an existing broken application with the fewest changes is generally to add row level locking.

Instead of SELECT balance FROM accounts WHERE user_id = 1 write SELECT balance FROM accounts WHERE user_id = 1 FOR UPDATE. This takes a row-level write lock. Any other transaction that tries to UPDATE the row or SELECT ... FOR UPDATE (or FOR SHARE) it will pause until the transaction that holds the lock rolls back or commits.

In the example above and in PostgreSQL’s default transaction isolation level the second SELECT would simply not return until the first transaction ran its UPDATE then COMMITed. Then the first transaction would continue, but the SELECT would return 200 instead of 300, so the correct value would be produced.

Session 1 Session 2
BEGIN BEGIN
SELECT balance FROM accounts WHERE user_id = 1 FOR UPDATE; (returns 300)  
  SELECT balance FROM accounts WHERE user_id = 1; (gets stuck and waits for transaction 1)
UPDATE balance SET balance = 200 WHERE user_id = 1; (300 – 100 = 200)  
COMMIT;  
  (second transaction’s SELECT returns 200)
  UPDATE balance SET balance = 100 WHERE user_id = 1; (200 – 100 = 100)
  COMMIT

The explicit locking chapter in the documentation discusses this in more detail.

Note that this only works if the read-modify-write cycle is contained within a single transaction, as locks only exist for the lifetime of a transaction.

SERIALIZABLE transactions

If the read-modify-write cycle is always contained within a single transaction and if you’re running PostgreSQL 9.1 or newer, you can use a SERIALIZABLE transaction instead of explicit SELECT ... FOR UPDATE.

In this case both SELECTs would proceed normally, as would both UPDATEs. The first transaction would COMMIT fine. However, when you went to COMMIT the second transaction it would instead abort with a serialization error. You would then re-run the failed transaction from the beginning*:

Session 1 Session 2
BEGIN BEGIN
SELECT balance FROM accounts WHERE user_id = 1; (returns 300)  
  SELECT balance FROM accounts WHERE user_id = 1; (also returns 300)
UPDATE balance SET balance = 200 WHERE user_id = 1; (300 – 100 = 200)  
  UPDATE balance SET balance = 200 WHERE user_id = 1; (300 – 100 = 200)
COMMIT – succeeds, setting balance=200  
  COMMIT – fails with serialization error, leaving balance unchanged

SERIALIZABLE isolation can force applications to repeat a lot of work if a big transaction aborts or if conflicts are common. It’s very useful for complex cases where attempting to use row locking might just cause deadlocks, though.

Optimistic concurrency control

Optimistic concurrency control (or “optimistic locking”) is usually implemented as an application-side method for handling concurrency, often by object relational mapping tools like Hibernate.

In this scheme, all tables have a version column or last-updated timestamp, and all updates have an extra WHERE clause entry that checks to make sure the version column hasn’t changed since the row was read. The application checks to see if any rows were affected by the UPDATE and if none were affected, treats it as an error and aborts the transaction.

The example above becomes:

Session 1 Session 2
BEGIN BEGIN
SELECT balance, version FROM accounts WHERE user_id = 1; (returns 1, 300)  
  SELECT version, balance FROM accounts WHERE user_id = 1; (also returns 1, 300)
COMMIT COMMIT
BEGIN BEGIN
UPDATE balance SET balance = 200, version = 2 WHERE user_id = 1 AND version = 1 (300 – 100 = 200. Succeeds, reporting 1 row changed.)  
  UPDATE balance SET balance = 200, version = 2 WHERE user_id = 1 AND version = 1 (300 – 100 = 200). Blocks on session 1′s lock.
COMMIT  
  (UPDATE returns, matching zero rows because it sees version=2 in the WHERE clause)
  ROLLBACK because of error detected

Because it’s fiddly to code, optimistic concurrency control is usually used via an ORM or query building tool.

Unlike SERIALIZABLE isolation, it works even in autocommit mode or if the statements are in separate transactions. For this reason it’s often a good choice for web applications that might have very long user “think time” pauses or where clients might just vanish mid-session, as it doesn’t need long-running transactions that can cause performance problems.

Optimistic concurrency control can co-exist with traditional locking based approaches if you use triggers to enforce the optimistic concurrency locking rules; like this. Nonetheless it's generally used instead of other explicit concurrency control methods.

Which to use?

The appropriate choice depends on what you're doing, your scalability requirements, how easy it is to write retry loops to re-run failed transactions, etc.

There is no right answer for everybody. If there was, there'd only be one way to do it, not several.

There is, however, a definitively wrong way, and that's to ignore concurrency issues and expect the database to just take care of them for you.


* The topic of transaction re-tries brings me to an anti-pattern I want to discuss later: testing connections and writing cannot-afford-to-fail transactions.

Shaun M. Thomas: Friends Don’t Let Friends Use Loops

From Planet PostgreSQL. Published on Jul 25, 2014.

Programming is fun. I love programming! Ever since I changed my career from programming to database work, I’ve still occasionally dabbled in my former craft. As such, I believe I can say this with a fair amount of accuracy: programmers don’t understand databases. This isn’t something small, either; there’s a fundamental misunderstanding at play. Unless the coder happens to work primarily with graphics, bulk set-based transformations are not something they’ll generally work with.

For instance, if tasked with inserting ten thousand records into a database table, a programmer might simply open the data source and insert them one by one. Consider this basic (non-normalized) table with a couple basic indexes:

CREATE TABLE sensor_log (
  sensor_log_id    SERIAL PRIMARY KEY,
  location         VARCHAR NOT NULL,
  reading          BIGINT NOT NULL,
  reading_date     TIMESTAMP NOT NULL
);

CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

Now suppose we have a file with ten thousand lines of something like this:

38c-1401,293857,2014-07-25 10:18:38-05:00
69a-8921,209574,2014-07-25 10:18:25-05:00
9e5-0942,690134,2014-07-25 10:18:16-05:00

To load this data, our coder chooses Python and whips up an insert script. Let’s even give the programmer the benefit of the doubt, and say they know that prepared queries are faster due to less overhead. I see scripts like this all the time, written in languages from Java to Erlang. This one is no different:

import psycopg2

db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
cur = db_conn.cursor()

cur.execute(
  """PREPARE log_add AS
     INSERT INTO sensor_log (location, reading, reading_date)
     VALUES ($1, $2, $3);"""
)

file_input = open('/tmp/input.csv', 'r')
for line in file_input:
    cur.execute("EXECUTE log_add(%s, %s, %s)", line.strip().split(','))
file_input.close()

cur.execute("DEALLOCATE log_add");

db_conn.commit()
db_conn.close()

It’s unlikely we have the /tmp/input.csv file itself, but we can generate one. Suppose we have 100 locations each with 100 sensors. We could produce a fake input file with this SQL:

COPY (
    SELECT substring(md5((a.id % 100)::TEXT), 1, 3) || '-' ||
           to_char(a.id % 100, 'FM0999') AS location,
           (a.id * random() * 1000)::INT AS reading,
           now() - (a.id % 60 || 's')::INTERVAL AS reading_date
      FROM generate_series(1, 10000) a (id)
) TO '/tmp/input.csv' WITH CSV;

Whew! That was a lot of work. Now, let’s see what happens when we time the inserts on an empty import table:

real    0m1.162s
user    0m0.168s
sys     0m0.122s

Well, a little over one second isn’t that bad. But suppose we rewrote the python script a bit. Bear with me; I’m going to be silly and use the python script as a simple pass-through. This should simulate a process that applies transformations and outputs another file for direct database import. Here we go:

import psycopg2

file_input = open('/tmp/input.csv', 'r')
processed = open('/tmp/processed.csv', 'w+')

for line in file_input:
    parts = line.strip().split(',')
    processed.write(','.join(parts) + '\n')

file_input.close()
processed.seek(0)

db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
cur = db_conn.cursor()
cur.copy_from(
    processed, 'sensor_log', ',',
    columns = ('location', 'reading', 'reading_date')
)

processed.close()

db_conn.commit()
db_conn.close()

Now let’s look at the timings involved again:

real    0m0.365s
user    0m0.056s
sys     0m0.004s

That’s about three times faster! Considering how simple this example is, that’s actually pretty drastic. We don’t have many indexes, the table has few columns, and the number of rows is relatively small. The situation gets far worse as all of those things increase.

It’s also not the end of our story. What happens if we disable autocommit, so that each insert gets its own transaction? Some ORMs might do this, or a naive developer might try generating a single script full of insert statements, and not know much about transactions. Let’s see:

real    1m31.794s
user    0m0.233s
sys     0m0.172s

Oh. Ouch. What normally takes around a third of a second can balloon all the way out to a minute and a half. This is one of the reasons I strongly advocate educating developers on proper data import techniques. It’s one thing for a job to be three to five times slower due to inefficient design. It’s quite another to be nearly 250 times slower simply because a programmer believes producing the output file was fast, so logically, inserting it should be similarly speedy. Both scenarios can be avoided by educating anyone involved with data manipulation.

This doesn’t just apply to new hires. Keeping everyone up to date on new techniques is equally important, as are refresher courses. I care about my database, so when possible, I try to over-share as much information as I can manage. I even wrote and presented several talks which I periodically give to our application developers to encourage better database use. Our company Wiki is similarly full of information, which I also present on occasion, if only because reading technical manuals can be quite boring.

If my database is being abused, it’s my job as a DBA to try and alleviate the situation any way I can. Sometimes, that means telling people what they’re doing wrong, and how they can fix it. I certainly didn’t know all of this ten years ago when I was primarily a coder. But I would have appreciated being pointed in the right direction by someone with more experience in the field.

Your database and users deserve it.

Magnus Hagander: The end of PostgreSQL 8.4 - a small tribute

From Planet PostgreSQL. Published on Jul 24, 2014.

Todays release marks the end of life for PostgreSQL 8.4, after 5 years of service.

I've found it interesting that PostgreSQL 8.4 was probably the first release to actually make it possible to run fairly busy PostgreSQL installations without any hands-on at all. There were many important parts to 8.4 that made that happen, such as much better standby database support (though not hot standby yet - that came in 9.0) and better statistics and monitoring systems. 8.4 also came with Recursive CTEs (WITH-queries) and Window Aggregates, both of which made a whole new set of applications possible.

But I think nothing actually brought about this "run without hands-on" as much as the new code for Visibility Map (VM) and in particular the automatic sizing of the Free Space Map (FSM). Anybody who deployed 8.3 or earlier in any kind of busy environment knows the pain of trying to configure max_fsm_pages correctly. It was almost impossible to get it right, and the value of right kept changing with your load and data. And the worst part of it all was that if you got it wrong you were in trouble - there wasn't really any remedy other than taking your database offline (or at least read-only) for manual full database vacuums (there were tricks to get around the vacuum specifics, but the core problem was there). So what happened was that a lot of those people who knew what to do just increased that number to something that was "big enough", which usually meant "way too big" and thus wasting resources. And even with that, sometimes ran into it not being big enough because the workload changed.

In fact, I think more or less every single system I did reviews for customers for on those versions had a substantially wrong number in max_fsm_pages - usually the default value because they had no idea what to do with it - and were paying the price for it.

Extra funny is that I've been talking to Heikki (who wrote the dynamic FSM and VM code) a few times recently, and he's still surprised that these changes had such a big impact on real life deployments. Probably because it's one of those things that is usually not a big problem at all in small installations, developer installations, and testing systems, but can quickly snowball in real life production. I think many of us didn't realize before the release actually made it out there how useful it would be.

This feature, together with things like Recursive CTEs and Window Functions really makes 8.4 one of the landmark PostgreSQL releases. It's easier to point to releases like 9.0 which had Streaming Replication and Hot Standby because they are sexier and more marketable features, but infrastructure like this is more important than we often think.

8.4 got rid of some of the hardest things to tune. Let's stay on the path of trying to keep the system simple enough to use that it doesn't need handholding, and look forward to the upcoming 9.4 release!

Greg Sabino Mullane: Postgresql conflict handling with Bucardo and multiple data sources

From Planet PostgreSQL. Published on Jul 23, 2014.


Image by Flickr user Rebecca Siegel (cropped)

Bucardo's much publicized ability to handle multiple data sources often raises questions about conflict resolution. People wonder, for example, what happens when a row in one source database gets updated one way, and the same row in another source database gets updated a different way? This article will explain some of the solutions Bucardo uses to solve conflicts. The recently released Bucardo 5.1.1 has some new features for conflict handling, so make sure you use at least that version.

Bucardo does multi-source replication, meaning that users can write to more than one source at the same time. (This is also called multi-master replication, but "source" is a much more accurate description than "master"). Bucardo deals in primary keys as a way to identify rows. If the same row has changed on one or more sources since the last Bucardo run, a conflict has arisen and Bucardo must be told how to handle it. In other words, Bucardo must decide which row is the "winner" and thus gets replicated to all the other databases.

For this demo, we will again use an Amazon AWS. See the earlier post about Bucardo 5 for directions on installing Bucardo itself. Once it is installed (after the './bucardo install' step), we can create some test databases for our conflict testing. Recall that we have a handy database named "shake1". As this name can get a bit long for some of the examples below, let's make a few databases copies with shorter names. We will also teach Bucardo about the databases, and create a sync named "ctest" to replicate between them all:

createdb aa -T shake1
createdb bb -T shake1
createdb cc -T shake1
bucardo add db A,B,C dbname=aa,bb,cc
## autokick=0 means new data won't replicate right away; useful for conflict testing!
bucardo add sync ctest dbs=A:source,B:source,C:source tables=all autokick=0
bucardo start

Bucardo has three general ways to handle conflicts: built in strategies, a list of databases, or using custom conflict handlers. The primary strategy, and also the default one for all syncs, is known as bucardo_latest. When this strategy is invoked, Bucardo scans all copies of the conflicted table across all source databases, and then orders the databases according to when they were last changed. This generates a list of databases, for example "B C A". For each conflicting row, the database most recently updated - of all the ones involved in the conflict for that row - is the winner. The other built in strategy is called "bucardo_latest_all_tables", which scans all the tables in the sync across all source databases to find a winner.

There may be other built in strategies added as experience/demand dictates, but it is hard to develop generic solutions to the complex problem of conflicts, so non built-in strategies are preferred. Before getting into those other solutions, let's see the default strategy (bucardo_latest) in action:

## This is the default, but it never hurts to be explicit:
bucardo update sync ctest conflict=bucardo_latest
Set conflict strategy to 'bucardo_latest'
psql aa -c "update work set totalwords=11 where title~'Juliet'"; \
psql bb -c "update work set totalwords=21 where title~'Juliet'"; \
psql cc -c "update work set totalwords=31 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 0
Kick ctest: [1 s] DONE!
## Because cc was the last to be changed, it wins:
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   31
bb   |   31
cc   |   31

Under the hood, Bucardo actually applies the list of winning databases to each conflicting row, such that example above of "B C A" means that database B wins in a conflict in which a rows was updated by B and C, or B and A, or B and C and A. However, if B did not change the row, and the conflict is only between C and A, then C will win.

As an alternative to the built-ins, you can set conflict_strategy to a list of the databases in the sync, ordered from highest priority to lowest, for example "C B A". The list does not have to include all the databases, but it is a good idea to do so. Let's see it in action. We will change the conflict_strategy for our test sync and then reload the sync to have it take effect:


bucardo update sync ctest conflict='B A C'
Set conflict strategy to 'B A C'
bucardo reload sync ctest
Reloading sync ctest...Reload of sync ctest successful
psql aa -c "update work set totalwords=12 where title~'Juliet'"; \
psql bb -c "update work set totalwords=22 where title~'Juliet'"; \
psql cc -c "update work set totalwords=32 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 10
Kick ctest: [1 s] DONE!
## This time bb wins, because B comes before A and C
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   22
bb   |   22
cc   |   22

The final strategy for handling conflicts is to write your own code. Many will argue this is the best approach. It is certaiy the only one that will allow you to embed your business logic into the conflict handling.

Bucardo allows loading of snippets of Perl code known as "customcodes". These codes take effect at specified times, such as after triggers are disabled, or when a sync has failed because of an exception. The specific time we want is called "conflict", and it is an argument to the "whenrun" attribute of the customcode. A customcode needs a name, the whenrun argument, and a file to read in for its content. They can also be associated with one or more syncs or tables.

Once a conflict customcode is in place and a conflict is encountered, the code will be invoked, and it will in turn pass information back to Bucardo telling it how to handle the conflict.

The code should expect a single argument, a hashref containing information about the current sync. This hashref tells the current table, and gives a list of all conflicted rows. The code can tell Bucardo which database to consider the winner for each conflicted row, or it can simply declare a winning database for all rows, or even for all tables. It can even modify the data in any of the tables itself. What it cannot do (thanks to the magic of DBIx::Safe) is commit, rollback, or do other dangerous actions since we are in the middle of an important transaction.

It's probably best to show by example at this point. Here is a file called ctest1.pl that asks Bucardo to skip to the next applicable customcode if the conflict is in the table "chapter". Otherwise, it will tell it to have database "C" win all conflicts for this table, and fallback to the database "B" otherwise.

## ctest1.pl - a sample conflict handler for Bucardo
use strict;
use warnings;

my $info = shift;
## If this table is named 'chapter', do nothing
if ($info->{tablename} eq 'chapter') {
    $info->{skip} = 1;
}
else {
    ## Winning databases, in order
    $info->{tablewinner} = 'C B A';
}
return;

Let's add in this customcode, and associate it with our sync. Then we will reload the sync and cause a conflict.

bucardo add customcode ctest \
  whenrun=conflict src_code=ctest1.pl sync=ctest
Added customcode "ctest"
bucardo reload sync ctest
Reloading sync ctest...Reload of sync ctest successful
psql aa -c "update work set totalwords=13 where title~'Juliet'"; \
psql bb -c "update work set totalwords=23 where title~'Juliet'"; \
psql cc -c "update work set totalwords=33 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 0
Kick ctest: [1 s] DONE!
## This time cc wins, because we set all rows to 'C B A'
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   33
bb   |   33
cc   |   33

We used the 'skip' hash value to tell Bucardo to not do anything if the table is named "chapter'. In real life, we would have another customcode that will handle the skipped table, else any conflict in it will cause the sync to stop. Any number of customcodes can be attached to syncs or tables.

The database preference will last for the remainder of this sync's run, so any other conflicts in other tables will not even bother to invoke the code. You can use the hash key "tablewinneralways" to make this decision sticky, in that it will apply for all future runs by this sync (its KID technically) - which effectively means the decision stays until Bucardo restarts.

One of the important structures sent to the code is a hash named "conflicts", which contains all the changed primary keys, and, for each one, a list of which databases were involved in the sync. A Data::Dumper peek at it would look like this:

$VAR1 = {
  'romeojuliet' => {
    'C' => 1,
    'A' => 1,
    'B' => 1,
  }
};

The job of the conflict handling code (unless using one of the "winner" hash keys) is to change each of those conflicted rows from a hash of involved databases into a string describing the preferred order of databases. The Data::Dumper output would thus look like this:

$VAR1 = {
  'romeojuliet' => 'B'
};

The code snippet would look like this:

## ctest2.pl - a simple conflict handler for Bucardo.
use strict;
use warnings;

my $info = shift;
for my $row (keys %{ $info->{conflicts} }) {
  ## Equivalent to 'A C B'
  $info->{conflicts}{$row} = exists $info->{conflicts}{$row}{A} ? 'A' : 'C';
}

## We don't want any other customcodes to fire: we have handled this!
$info->{lastcode} = 1;
return;

Let's see that code in action. Assuming the above "bucardo add customcode" command was run, we will need to load an updated version, and then reload the sync. We create some conflicts, and check on the leresults:


bucardo update customcode ctest src_code=ctest2.pl
Changed customcode "ctest" src_code with content of file "ctest2.pl"
bucardo reload sync ctest
Reloading sync ctest...Reload of sync ctest successful
psql aa -c "update work set totalwords=14 where title~'Juliet'"; \
psql bb -c "update work set totalwords=24 where title~'Juliet'"; \
psql cc -c "update work set totalwords=34 where title~'Juliet'"
UPDATE 1
UPDATE 1
UPDATE 1
bucardo kick sync ctest 10
Kick ctest: [2 s] DONE!
## This time aa wins, because we set all rows to 'A C B'
for i in {aa,bb,cc}; do psql $i -tc "select current_database(), \
totalwords from work where title ~ 'Juliet'"; done
aa   |   14
bb   |   14
cc   |   14

That was an obviously oversimplified example, as we picked 'A' for no discernible reason! These conflict handlers can be quite complex, and are only limited by your imagination - and your business logic. As a final example, let's have the code examine some other things in the database, and as well as jump out of the database itself(!) to determine the resolution to the conflict:

## ctest3.pl - a somewhat silly conflict handler for Bucardo.
use strict;
use warnings;
use LWP;

my $info = shift;

## What is the weather in Walla Walla, Washington?
## If it's really hot, we cannot trust server A
my $max_temp = 100;
my $weather_url = 'http://wxdata.weather.com/wxdata/weather/rss/local/USWA0476?cm_ven=LWO&cm_cat=rss';
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => $weather_url);
my $response = $ua->request($req)->content();
my $temp = ($response =~ /(\d+) \°/) ? $1 : 75;
## Store in our shared hash so we don't have to look it up every run
## Ideally we'd add something so we only call it if the temp has not been checked in last hour
$info->{shared}{wallawallatemp} = $temp;

## We want to count the number of sessions on each source database
my $SQL = 'SELECT count(*) FROM pg_stat_activity';
for my $db (sort keys %{ $info->{dbinfo} }) {
    ## Only source databases can have conflicting rows
    next if ! $info->{dbinfo}{$db}{issource};
    ## The safe database handles are stored in $info->{dbh}
    my $dbh = $info->{dbh}{$db};
    my $sth = $dbh->prepare($SQL);
    $sth->execute();
    $info->{shared}{dbcount}{$db} = $sth->fetchall_arrayref()->[0][0];
}

for my $row (keys %{ $info->{conflicts} }) {
    ## If the temp is too high, remove server A from consideration!
    if ($info->{shared}{wallawallatemp} > $max_temp) {
        delete $info->{conflicts}{$row}{A}; ## May not exist, but we delete anyway
    }

    ## Now we can sort by number of connections and let the least busy db win
    (my $winner) = sort {
        $info->{shared}{dbcount}{$a} <=> $info->{shared}{dbcount}{$b}
        or
        ## Fallback to reverse alphabetical if the session counts are the same
        $b cmp $a
    } keys %{ $info->{conflicts}{$row} };

    $info->{conflicts}{$row} = $winner;
}

## We don't want any other customcodes to fire: we have handled this!
$info->{lastcode} = 1;
return;

We'll forego the demo: suffice to say that B always won in my tests, as Walla Walla never got over 97, and all my test databases had the same number of connections. Note some of the other items in the $info hash: "shared" allows arbitrary data to be stored across invocations of the code. The "lastcode" key tells Bucardo not to fire any more customcodes. While this example is very impractical, it does demonstrate the power available to you when solving conflicts.

Hopefully this article answers many of the questions about conflict handling with Bucardo. Suggestions for new default handlers and examples of real-world conflict handlers are particularly welcome, as well as any other questions or comments. You can find the mailing list at bucardo-general@bucardo.org, and subscribe by visiting the bucardo-general Info Page.

Gurjeet Singh: Announcing TPC-C.js; a Lightweight Implementation of TPC-C

From Planet PostgreSQL. Published on Jul 23, 2014.

I am glad to announce the beta release of TPC-C.js, which implements one of the most popular database benchmarks, TPC-C. It’s not a coincidence that today is also the 22nd anniversary of the TPC-C benchmark.

It currently supports Postgres database, but can be easily extended to test other database systems.

You might ask “Why another TPC-C implementation when we already have so many of them?”“

Short answer: This one is very light on system resources, so you can

  1. Run the benchmark strictly adhering to the specification, and
  2. Invest more in database hardware, rather than client hardware.

Long answer: It’s covered in the Motivation section of TPC-C.js, which I’ll quote here:

Motivation

The TPC-C benchmark drivers currently available to us, like TPCC-UVa, DBT2, HammerDB, BenchmarkSQL, etc., all run one process (or thread) per simulated client. Because the TPC-C benchmark specification limits the max tpmC metric (transactions per minute of benchmark-C) from any single client to be 1.286 tpmC, this means that to get a result of, say, 1 million tpmC we have to run about 833,000 clients. Even for a decent number as low as 100,000 tpmC, one has to run 83,000 clients.

Given that running a process/thread, even on modern operating systems, is a bit expensive, it requires a big upfront investment in hardware to run the thousands of clients required for driving a decent tpmC number. For example, the current TPC-C record holder had to run 6.8 million clients to achieve 8.55 million tpmC, and they used 16 high-end servers to run these clients, which cost them about $ 220,000 (plus $ 550,000 in client-side software).

So, to avoid those high costs, these existing open-source implementations of TPC-C compromise on the one of the core requirements of the TPC-C benchmark: keying and thinking times. These implementations resort to just hammering the SUT (system under test) with a constant barrage of transactions from a few clients (ranging from 10-50).

So you can see that even though a decent modern database (running on a single machine) can serve a few hundred clients simultaneously, it ends up serving very few (10-50) clients. I strongly believe that this way the database is not being tested to its full capacity; at least not as the TPC-C specification intended.

The web-servers of yesteryears also suffer from the same problem; using one process for each client request prohibits them from scaling, because the underlying operating system cannot run thousands of processes efficiently. The web-servers solved this problem (known as c10k problem) by using event-driven architecture which is capable of handling thousands of clients using a single process, and with minimal effort on the operating system’s part.

So this implementation of TPC-C uses a similar architecture and uses NodeJS, the event-driven architecture, to run thousands of clients against a database.

Hans-Juergen Schoenig: Optimization issues: Cross column correlation

From Planet PostgreSQL. Published on Jul 23, 2014.

Planner estimates have already been discussed on this blog in my previous posting and also in some posting before that. A couple of years ago I stumbled over an interesting issue which is commonly known as “cross correlation”. Let us consider the following example: test=# CREATE TABLE t_test (a int, b int); CREATE TABLE test=# […]

Andrew Dunstan: Code size

From Planet PostgreSQL. Published on Jul 21, 2014.

Someone was just talking about the size of some source files in PostgreSQL. The source code (.c, .h, .y and .l files) weighs in at a bit over 1 million lines of code. The documentation source has another roughly 300,000 lines. That's a large project, but by no means enormous by today's standards. The biggest source code file is pg_dump.c, at around 15,700 lines. The biggest documentation file is funcs.sgml, at around 17,600 lines. Both of these might well do with a bit of reorganization.

Michael Paquier: Postgres 9.5 feature highlight: Display failed queries in psql

From Planet PostgreSQL. Published on Jul 19, 2014.

Postgres 9.5 is coming up with a new ECHO mode for psql that has been introduced by this commit:

commit: 5b214c5dd1de37764797b3fb9164af3c885a7b86
author: Fujii Masao <fujii@postgresql.org>
date: Thu, 10 Jul 2014 14:27:54 +0900
Add new ECHO mode 'errors' that displays only failed commands in psql.

When the psql variable ECHO is set to 'errors', only failed SQL commands
are printed to standard error output. Also this patch adds -b option
into psql. This is equivalent to setting the variable ECHO to 'errors'.

Pavel Stehule, reviewed by Fabrízio de Royes Mello, Samrat Revagade,
Kumar Rajeev Rastogi, Abhijit Menon-Sen, and me.

Up to now, there have been two ECHO modes:

  • "all", to print to the standard output all the queries before they are parsed or executed. This can be set when starting psql with option -a.
  • "queries", to have psql print all the queries sent to server. This can be set additionally with option -e of psql.

The new mode is called "errors" and can be either set with the option -b when starting psql or with "set" command in a psql client like that:

=# \set ECHO errors

The feature added is simple: have psql print all the failed queries in the standard error output. The failed query is printed in an additional field prefixed with STATEMENT:

=# CREATE TABLES po ();
ERROR:  42601: syntax error at or near "TABLES"
LINE 1: CREATE TABLES po ();
           ^
LOCATION:  scanner_yyerror, scan.l:1053
STATEMENT:  CREATE TABLES po ();

If multiple queries are specified within a single input only the query that failed is displayed:

=# CREATE TABLE aa (a int); CREATE FOO po; CREATE TABLE bb (a int);
CREATE TABLE
ERROR:  42601: syntax error at or near "FOO"
LINE 1: CREATE FOO po;
               ^
LOCATION:  scanner_yyerror, scan.l:1053
STATEMENT:  CREATE FOO po;
CREATE TABLE

Also, queries that are typed in multiple lines are showed as they are, spaces included:

=# SELECT
      col1_not_here,
      col2_not_here
   FROM
      table_not_here;
ERROR:  42P01: relation "table_not_here" does not exist
LINE 5:     table_not_here;
            ^
LOCATION:  parserOpenTable, parse_relation.c:986
STATEMENT:  SELECT
    col1_not_here,
    col2_not_here
FROM
    table_not_here;

Hans-Juergen Schoenig: Detecting wrong planner estimates

From Planet PostgreSQL. Published on Jul 17, 2014.

In 99% of all the cases the PostgreSQL planner is doing a perfect job to optimize your queries and to make sure, that you can enjoy high performance and low response times. The infrastructure ensuring this is both sophisticated as well as robust. However, there are some corner cases, which can turn out to be quite […]

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – psql: Show tablespace size in \db+

From Planet PostgreSQL. Published on Jul 15, 2014.

On 14th of July, Alvaro Herrera committed patch: psql: Show tablespace size in \db+   Fabrízio de Royes Mello As I previously mentioned – I'm sucker for psql additions. And while todays patch is rather small, I really like it. What it does? It simply makes \db+ show size of all objects in given tablespace. […]

gabrielle roth: PDXPUG: July meeting – OSCON BoF Session

From Planet PostgreSQL. Published on Jul 15, 2014.

When: 7-8pm Tuesday
Where: Oregon Convention Center, Room E147

We’re having a Birds of a Feather session at OSCON instead of our usual July meeting. Come hang out with Pg peeps who are in town for the conference! You do not need to be registered for the conference to attend the BoF.

There will be an additional social hour afterwards should we need it.

See you there!


Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Implement IMPORT FOREIGN SCHEMA.

From Planet PostgreSQL. Published on Jul 14, 2014.

On 10th of July, Tom Lane committed patch: Implement IMPORT FOREIGN SCHEMA.   This command provides an automated way to create foreign table definitions that match remote tables, thereby reducing tedium and chances for error. In this patch, we provide the necessary core-server infrastructure and implement the feature fully in the postgres_fdw foreign-data wrapper. Other […]

Magnus Hagander: Getting information about PostgreSQL SSL connections

From Planet PostgreSQL. Published on Jul 14, 2014.

PostgreSQL can, as many other products, use SSL to secure client/server communications. It can be configured to be optional or required, to require a client certificate, or even to use the SSL client certificate to log in. However, the DBA tools are currently a bit lacking in this regard. In particular, there is no way for a DBA to see what SSL parameters are in force (or indeed if SSL is enabled at all) for an already existing connection.

There are multiple ways to see the status of a specific connection (such as the libpq PQgetssl() function, the psql startup message or the sslinfo module. Unfortunately all these methods have one thing in common - they are only available to the process making the connection to the database, not to a DBA inspecting the system from the outside.

9.4 will make it a little bit better, because log_connections now include SSL information when the user connects, similar to:
LOG:  connection authorized: user=mha database=postgres SSL enabled (protocol=TLSv1.1, cipher=ECDHE-RSA-AES256-SHA)

But this can still be a bit of a pain to figure out for existing connectioons of course.

To deal with this problem, I've written a small PostgreSQL extension called pg_sslstatus. When loaded using shared_preload_libraries it will keep track of the SSL status for each connection, and expose it through a system view named pg_sslstatus like this:

postgres=# SELECT * FROM pg_sslstatus;
  pid  | ssl | bits | compression | version |        cipher        |                         clientdn                         
-------+-----+------+-------------+---------+----------------------+----------------------------------------------------------
 27286 | t   |  256 | f           | TLSv1   | ECDHE-RSA-AES256-SHA | 
 26682 | t   |  256 | t           | TLSv1   | ECDHE-RSA-AES256-SHA | /C=AU/ST=Some-State/O=Internet Widgits Pty Ltd/CN=magnus
 26693 | f   |      |             |         |                      | 
(3 rows)

It will include one row for each connection, and can then be joined with either pg_stat_activity or pg_stat_replication to view the SSL data in a bigger context:

postgres=# SELECT datname, usename, client_addr, ssl, cipher FROM pg_stat_activity INNER JOIN pg_sslstatus ON pg_stat_activity.pid=pg_sslstatus.pid;
 datname  | usename | client_addr | ssl |        cipher        
----------+---------+-------------+-----+----------------------
 postgres | mha     | 127.0.0.1   | t   | ECDHE-RSA-AES256-SHA
 postgres | mha     |             | f   | 
 pgweb    | pgweb   |             | f   | 
(2 rows)

The extension is available from my github account today for existing version of PostgreSQL, under the standard PostgreSQL license. My plan is to work on having a similar view included in PostgreSQL 9.5 by default, but it's unfortunately too late to include it in 9.4 at this point.

Michael Paquier: Postgres 9.5 feature highlight: IMPORT FOREIGN SCHEMA

From Planet PostgreSQL. Published on Jul 11, 2014.

IMPORT FOREIGN SCHEMA is a SQL query defined in the SQL specification allowing to import from a foreign source a schema made of foreign tables. Its support has been added in Postgres 9.5 with the following commit:

commit 59efda3e50ca4de6a9d5aa4491464e22b6329b1e
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Thu Jul 10 15:01:31 2014 -0400

Implement IMPORT FOREIGN SCHEMA.

This command provides an automated way to create foreign table definitions
that match remote tables, thereby reducing tedium and chances for error.
In this patch, we provide the necessary core-server infrastructure and
implement the feature fully in the postgres_fdw foreign-data wrapper.
Other wrappers will throw a "feature not supported" error until/unless
they are updated.

Ronan Dunklau and Michael Paquier, additional work by me

This feature is made of two parts:

  • New API available for foreign data wrappers to support this SQL query
  • Support for this query in postgres_fdw, foreign-data wrapper (FDW) for PostgreSQL available in core.

The new API available has the following shape:

List *
ImportForeignSchema (ImportForeignSchemaStmt *stmt, Oid serverOid);

ImportForeignSchemaStmt is a parsed representation of the raw query of IMPORT FOREIGN SCHEMA and serverOid is the OID of the FDW server used for the import. The parsed statement contains all the information needed by a FDW to fetch all the information to rebuild a schema fetched from a remote source, mainly being:

  • Type of import done with stmt->list_type with the table list (not for ALL)
    • FDW_IMPORT_SCHEMA_LIMIT_TO (LIMIT clause specified in query) for a restricted list of table names imported
    • FDW_IMPORT_SCHEMA_EXCEPT (EXCEPT clause specified in query) for a list of tables to not fetch during import
    • FDW_IMPORT_SCHEMA_ALL (no LIMIT TO or EXCEPT clauses in query) to let the FDW know that all the tables from the foreign schema
  • Remote schema name
  • List of options to customize the import

Then this API needs to return a list of raw queries that will be applied as-is by the server after parsing them. The local schema is overridden by server to avoid any abuse. Documentation should be used as a reference for more details as well.

The second part of the feature is the support of IMPORT FOREIGN SCHEMA for postgres_fdw itself, allowing to import a schema from a different node. For example let's take the case of two instances on the same server. The first node listens to port 5432 and uses postgres_fdw to connect to a second node listening to port 5433 (for more details on how to set of that refer to that or directly have look at the official documentation).

On the remote node (listening to 5433) the two following tables are created on a default schema, aka "public":

=# CREATE TABLE remote_tab1 (a int not null);
CREATE TABLE
=# CREATE TABLE remote_tab2 (b timestamp default now());
CREATE TABLE

Importing them locally on schema public is a matter of running this command on the local node (local schema name is defined with clause INTO, and foreign schema at the beginning of the query):

=# IMPORT FOREIGN SCHEMA public FROM SERVER postgres_server INTO public;
IMPORT FOREIGN SCHEMA
=# \d
                List of relations
 Schema |    Name     |     Type      | Owner  
--------+-------------+---------------+--------
 public | remote_tab1 | foreign table | ioltas
 public | remote_tab2 | foreign table | ioltas
(2 rows)

IMPORT FOREIGN SCHEMA offers some control to the list of tables imported with LIMIT TO and EXCEPT, so this query would only import the table remote_tab1 in schema test_import1:

=# CREATE SCHEMA test_import1;
CREATE SCHEMA
=# IMPORT FOREIGN SCHEMA public LIMIT TO (remote_tab1)
   FROM SERVER postgres_server INTO test_import1;
IMPORT FOREIGN SCHEMA
=# \d test_import1.*
      Foreign table "test_import1.remote_tab1"
  Column |  Type   | Modifiers |    FDW Options    
 --------+---------+-----------+-------------------
  a      | integer | not null  | (column_name 'a')
 Server: postgres_server
 FDW Options: (schema_name 'public', table_name 'remote_tab1')

And this query would import everything except remote_tab1 in schema test_import2:

 =# CREATE SCHEMA test_import2;
 CREATE SCHEMA
 =# IMPORT FOREIGN SCHEMA public EXCEPT (remote_tab1)
    FROM SERVER postgres_server INTO test_import2;
 IMPORT FOREIGN SCHEMA
 =# \d test_import2.*
                Foreign table "test_import2.remote_tab2"
  Column |            Type             | Modifiers |    FDW Options    
 --------+-----------------------------+-----------+-------------------
  b      | timestamp without time zone |           | (column_name 'b')
Server: postgres_server
FDW Options: (schema_name 'public', table_name 'remote_tab2')

By default, the import will try to import collations and NOT NULL constraints. So, coming back to what has been imported on schema public, relation remote_tab1 is defined like that, with a NOT NULL constaint:

=# \d remote_tab1
         Foreign table "public.remote_tab1"
  Column |  Type   | Modifiers |    FDW Options    
 --------+---------+-----------+-------------------
  a      | integer | not null  | (column_name 'a')
 Server: postgres_server
 FDW Options: (schema_name 'public', table_name 'remote_tab1')

Note that this can be controlled with the clause OPTIONS in IMPORT FOREIGN SCHEMA, postgres_fdw offerring 3 options:

  • import_collate to import collates, default is true
  • import_default to import default expressions, default is false
  • import_not_null to import NOT NULL constaints, default is true

import_default is the tricky part, particularly for volatile expressions. Import will also fail if the default expression is based on objects not created locally, like what would happen when trying to import a relation with a SERIAL column.

Dave Cramer: PostgreSQL JDBC Driver version 9_3_1102 released

From Planet PostgreSQL. Published on Jul 10, 2014.

This is a maintenance release with a few interesting upgrades

Version 9.3-1102 (2014-07-10)

Author:epgrubmair bug #161
    fix copyOut close hanging bug #161 from epgrubmair

Author:romank0

    backpatch exception during close of fully read stream from romank0

Author:Christophe Canovas

    Added caching for ResultSetMetaData  complete commit

Author:Elizabeth Chatman
    NullPointerException in AbstractJdbc2DatabaseMetaData.getUDTs

    setNull, setString, setObject may fail if a specified type cannot be transferred in a binary mode #151

    backpatch fix for changing datestyle before copy

Author:TomonariKatsumata
    binary transfer fixes new feature -1 for forceBinaryTransfer

Author:Sergey Chernov
    connectTimeout property support backpatch
   
Author:Naoya Anzai
    fix prepared statement ERROR due to EMPTY_QUERY defined as static.

9.4 jars can also be found on the site 

Greg Sabino Mullane: Version differences via Github from the command line

From Planet PostgreSQL. Published on Jul 09, 2014.

I work with a lot of open source projects, and I use the command-line for almost everything. It often happens that I need to examine a file from a project, and thanks to bash, Github, and curl, I can do so easily, without even needing to have the repo handy. One of the things I do sometimes is compare a file across versions to see what has changed. For example, I needed to see what changes were made between versions 1.22 and 1.23 to the file includes/UserMailer.php which is part of the MediaWiki project. For this trick to work, the project must be on Github, and must label their versions in a consistent manner, either via git branches or git tags.

MediaWiki exists on Github as wikimedia/mediawiki-core. The MediaWiki project tags all of their releases in the format X.Y.Z, so in this example we can use the git tags 1.22.0 and 1.23.0. Github is very nice because you can view a specific file at a certain commit (aka a tag), and even grab it over the web as a plain text file. The format is:

https://raw.githubusercontent.com/PROJECTNAME/BRANCH-OR-TAG/FILE

Note that you can use a tag OR a branch! So to compare these two files, we can use one of these pairs:

https://raw.githubusercontent.com/wikimedia/mediawiki-core/REL1_21/includes/UserMailer.php
https://raw.githubusercontent.com/wikimedia/mediawiki-core/REL1_22/includes/UserMailer.php

https://raw.githubusercontent.com/wikimedia/mediawiki-core/1.21.0/includes/UserMailer.php
https://raw.githubusercontent.com/wikimedia/mediawiki-core/1.22.0/includes/UserMailer.php

All that is left is to treat git as a web service and compare the two files at the command line ourselves. The program curl is a great tool for downloading the files, as it dumps to stdout by default. We will add a -s flag (for "silent") to prevent it from showing the progress meter as it usually does. The last bit of the puzzle is to use <(), bash's process substitution feature, to trick diff into comparing the curl outputs as if they were files. So our final command is:

diff <(curl -s https://raw.githubusercontent.com/wikimedia/mediawiki-core/1.21.0/includes/UserMailer.php) \
<(curl -s https://raw.githubusercontent.com/wikimedia/mediawiki-core/1.22.0/includes/UserMailer.php) \
| more

Voila! A quick and simple glance at what changed between those two tags. This should work for any project on Github. You can also replace the branch or tag with the word "master" to see the current version. For example, the PostgreSQL project lives on github as postgres/postgres. They use the format RELX_Y_Z in their tags. To see what has changed since release 9.3.4 in the psql help file (as a context diff), run:

diff -c <(curl -s https://raw.githubusercontent.com/postgres/postgres/REL9_3_4/src/bin/psql/help.c) \
<(curl -s https://raw.githubusercontent.com/postgres/postgres/master/src/bin/psql/help.c)

You are not limited to diff, of course. For a final example, let's see how many times Tom Lane is mentioned in the version 9 release notes:

for i in {0,1,2,3,4}
do grep -Fc 'Tom Lane' \
<(curl -s https://raw.githubusercontent.com/postgres/postgres/master/doc/src/sgml/release-9.$i.sgml)
done
272
206
174
115
16

The last number is so low relative to the rest because 9.4 is still under development. Rest assured Tom's contributions have not slowed down! :) Thanks to Github for providing such a useful service for so many open source projects, and for providing the raw text to allow useful hacks like this.

Kirk Roybal: DFWPUG Meetup August 6, 2014

From Planet PostgreSQL. Published on Jul 08, 2014.

Just a quick reminder that Dallas/Fort Worth PostgreSQL Users Group has a Meetup the first Wednesday of every month.

What: PostgreSQL-backed Websites: Middleware Considered Harmful

Who: Justin Tocci

When: Wednesday, August 6, 2014 7:00 PM

Where:
Improving Enterprises
16633 Dallas Parkway Suite 110 Addison, TX 75001

DFW PUG on Meetup

Paul Ramsey: FOSS4G 2014 in Portland, Oregon, September 2014

From Planet PostgreSQL. Published on Jul 07, 2014.

Just a quick public service announcement for blog followers in the Pacific Northwest and environs: you've got a once in a not-quite-lifetime opportunity to attend the "Free and Open Source Software for Geospatial" (aka FOSS4G) conference this year in nearby Portland, Oregon, a city so hip they have trouble seeing over their pelvis.

Anyone in the GIS / mapping world should take the opportunity to go, to learn about what technology the open source world has available for you, to meet the folks writing the software, and the learn from other folks like you who are building cool things.

September 8th-13th, be there and be square.

Baji Shaik: Updating pg_cast helps, however sometimes !!

From Planet PostgreSQL. Published on Jul 05, 2014.

I've seen one of our customer is migrating a table from SQL Server to PostgreSQL using EnterpriseDB's Migration ToolKit.  This table has a boolean datatype column. In migration process, MTK converts datatype "boolean" to "bit" in PostgreSQL and the process was taking 6 hrs to complete. Customer wanted to change the datatype from "Bit" to "Integer" and alter command for changing type was taking another 6 hrs in PostgreSQL. If he migrates only structure to PostgreSQL first, and then change the type to "Integer" from "Bit", then it does not allow you to load the data with below error. If it allows, it takes only 6 hrs as no need of alter the type after data load.

ERROR: column "hidehelm" is of type integer but expression is of type boolean
Hint: You will need to rewrite or cast the expression.

So I found a work around to type cast from "boolean" to "Integer" implicitly by updating "pg_cast" table as below. By this, he can directly load the boolean data into integer column which saves the time of altering the type from BIT to Integer after migrating.

postgres=# insert into tarik values (1::boolean);
ERROR:  column "t" is of type integer but expression is of type boolean
LINE 1: insert into tarik values (1::boolean);
HINT:  You will need to rewrite or cast the expression.
postgres=# select * from pg_cast where castsource='boolean'::regtype and casttarget='int4'::regtype;
castsource | casttarget | castfunc | castcontext | castmethod
------------+------------+----------+-------------+------------
         16 |         23 |     2558 | e           | f
(1 row)
postgres=# update pg_cast set castcontext ='i' where castsource='boolean'::regtype and casttarget='int4'::regtype;
UPDATE 1
postgres=# select * from pg_cast where castsource='boolean'::regtype and casttarget='int4'::regtype;
 castsource | casttarget | castfunc | castcontext | castmethod 
------------+------------+----------+-------------+------------
         16 |         23 |     2558 | i           | f
(1 row)
postgres=# insert into tarik values (1::boolean);
INSERT 0 1

Irrespective of any side effects(which I'm not aware of) of this workaround, this worked and migration had take only 6 hrs. Of-course, customer rolled back this setting after migrating the table. Updating catalogs is very dangerous, so might be I should have concentrated on how to reduce the time of ALTER command after migration?, anyways, it worked, so I was happy !! ;-)

Thanks for any suggestions/comments.

Baji Shaik: Oops I corrupted my table, of-course just to recover salvaged data.

From Planet PostgreSQL. Published on Jul 04, 2014.


The way I started the title might be confusing, "I corrupted my table", so everyone starts with "Crap !! why did you do that !!, ...", so just to justify it.......
I see many customers coming for recovering the corrupted tables without any backup. In such cases, hard to recover the tables completely and it needs lot of work, however we can recover salvaged data if they dont care about corrupted rows.

Let me corrupt the table first.. :-)

 I created a million-row table called "damaged"
postgres=# select count(*) from to_be_damaged ;
  count
---------
 1000000
(1 row)
postgres=# select relfilenode,relname from pg_class where relname='to_be_damaged';
relfilenode |    relname  
-------------+---------------
       461257 | to_be_damaged

(1 row)
I've used "hexedit" to damage it. Open relfilenode file from OS level using hexedit and try picking a line which is the start of an 8K boundary and typing hex DE AD BE EF across it.

postgres=# select count(*) from to_be_damaged ;
ERROR:  invalid page in block 0 of relation base/12896/461257

Now create an identical table "salvaged" to recover salvaged data from "to_be_damaged" table.
postgres=# create table salvaged(t int);
CREATE TABLE
Prepared below function which copies the rows which are still salvageable:
create or replace function salvage_damaged()
  returns void
  language plpgsql
as $$
declare
  pageno int;
  tupno int;
  pos tid;
begin
  <<pageloop>>
  for pageno in 0..35930 loop  -- pg_class.relpages for the damaged table
    for tupno in 1..1000 loop
      pos = ('(' || pageno || ',' || tupno || ')')::tid;
      begin
        insert into salvaged select * from damaged where ctid = pos;
      exception
        when sqlstate 'XX001' then
          raise warning 'skipping page %', pageno;
          continue pageloop;
        when others then
          raise warning 'skipping row %', pos;
      end;
    end loop;
  end loop;
end;
$$;

Now run the function to copy salvagable rows:
postgres# select salvage_damaged();
WARNING: skipping page 0
salvage_damaged
-----------------

(1 row)
postgres=# select count(*) from salvaged ;
count
----------
12999815
(1 row)

postgres=# select 13000000-12999815;
?column?
----------
185
(1 row)

I hope it helps someone. Thanks for reading.

Michael Paquier: Postgres 9.5 feature highlight: WHERE clause pushdown in subqueries with window functions

From Planet PostgreSQL. Published on Jul 04, 2014.

Postgres 9.5 is going to improve the performance of subqueries using window functions by allowing the pushdown of WHERE clauses within them. Here is a commit, done during commit fest 1, that is the origin of this improvement:

commit d222585a9f7a18f2d793785c82be4c877b90c461
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Fri Jun 27 23:08:08 2014 -0700

Allow pushdown of WHERE quals into subqueries with window functions.

We can allow this even without any specific knowledge of the semantics
of the window function, so long as pushed-down quals will either accept
every row in a given window partition, or reject every such row.  Because
window functions act only within a partition, such a case can't result
in changing the window functions' outputs for any surviving row.
Eliminating entire partitions in this way obviously can reduce the cost
of the window-function computations substantially.

David Rowley, reviewed by Vik Fearing; some credit is due also to
Thomas Mayer who did considerable preliminary investigation.

The pushdown of the WHERE qual is done only if two conditions are satisfied:

  • Only the partitioning columns are referenced
  • The qual contains no volatile functions

Let's have a look at how things are improved by using a simple data set: a list of individuals referenced by an ID, with information about the city where they live and their respective age (the authorities lacked imagination for the city names).

=# CREATE TABLE population (
     person_id serial,
     age int,
     city text);
CREATE TABLE
=# INSERT INTO population (age, city)
     SELECT round(random() * 100),
            'city ' || round(random() * 200)
     FROM generate_series(1, 1000000);
INSERT 0 1000000

Now, here is a query that evaluates the average, minimum and maximum age of the cities where the population is living. If 9.4, this query returns the following plan:

=# EXPLAIN SELECT * FROM
     (SELECT city,
        avg(age) OVER (PARTITION BY city) avg_age,
        min(age) OVER (PARTITION BY city) min_age,
        max(age) OVER (PARTITION BY city) max_age
      FROM population) age_all
   WHERE city in ('city 26', 'city 47')
   GROUP BY avg_age, city, min_age, max_age;
                                         QUERY PLAN
 ---------------------------------------------------------------------------------------------
  HashAggregate  (cost=184834.34..184844.29 rows=995 width=48)
   Group Key: age_all.avg_age, age_all.city, age_all.min_age, age_all.max_age
    ->  Subquery Scan on age_all  (cost=149734.84..184734.84 rows=9950 width=48)
          Filter: (age_all.city = ANY ('{"city 26","city 47"}'::text[]))
          ->  WindowAgg  (cost=149734.84..172234.84 rows=1000000 width=12)
                ->  Sort  (cost=149734.84..152234.84 rows=1000000 width=12)
                      Sort Key: population.city
                      ->  Seq Scan on population  (cost=0.00..15896.00 rows=1000000 width=12)
  Planning time: 0.227 ms
 (9 rows)

As you can notice, a sequential scan is done by the subquery on the whole table "Seq Scan on population", while the WHERE clause is applied after generating the results through a costly sort operation on all the rows. This query took 2 seconds to run on a machine of the author of this article (sort did not spill on disk as work_mem was set high enough).

 =# SELECT * FROM
     (SELECT city,
        avg(age) OVER (PARTITION BY city) avg_age,
        min(age) OVER (PARTITION BY city) min_age,
        max(age) OVER (PARTITION BY city) max_age
      FROM population) age_all
    WHERE city in ('city 26', 'city 47')
    GROUP BY avg_age, city, min_age, max_age;
   city   |       avg_age       | min_age | max_age
 ---------+---------------------+---------+---------
  city 47 | 49.6150433555152248 |       0 |     100
  city 26 | 49.7953169156237384 |       0 |     100
 (2 rows)
 Time: 2276.422 ms

In Postgres 9.5, here is the plan obtained for the same query (plan has been reformated a bit here to fit on this blog page):

                                        QUERY PLAN
 ------------------------------------------------------------------------------------
  HashAggregate  (cost=15171.49..15178.33 rows=684 width=72)
    Group Key: avg(population.age) OVER (?),
               population.city,
               min(population.age) OVER (?),
               max(population.age) OVER (?)
    ->  WindowAgg  (cost=14880.83..15034.71 rows=6839 width=36)
          ->  Sort  (cost=14880.83..14897.93 rows=6839 width=36)
                Sort Key: population.city
                ->  Seq Scan on population  (cost=0.00..14445.20 rows=6839 width=36)
                      Filter: (city = ANY ('{"city 26","city 47"}'::text[]))
 Planning time: 0.203 ms
(6 rows)

Things are getting better, the WHERE clause is evaluated within the subquery, drastically reducing the cost of the sort by reducing the number of tuples selected. Running this query takes as well only 300ms, which is an interesting improvement compared to the pre-commit period.

Devrim GÜNDÜZ: Using huge pages on RHEL 7 and PostgreSQL 9.4

From Planet PostgreSQL. Published on Jul 04, 2014.

PostgreSQL 9.4 will finally support huge pages. This article will cover about configuring huge pages on RHEL 7 box.



Continue reading "Using huge pages on RHEL 7 and PostgreSQL 9.4"

Devrim GÜNDÜZ: Running more than one 9.4 /9.3 instance in parallel on RHEL 7

From Planet PostgreSQL. Published on Jul 03, 2014.

In the 3rd post of "PostgreSQL on RHEL7 series, I will mention about running more than one 9.4 instance on RHEL 7. This article is also valid for 9.3.
Continue reading "Running more than one 9.4 /9.3 instance in parallel on RHEL 7"

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Add cluster_name GUC which is included in process titles if set.

From Planet PostgreSQL. Published on Jul 02, 2014.

On 29th of June, Andres Freund committed patch: Add cluster_name GUC which is included in process titles if set.   When running several postgres clusters on one OS instance it's often inconveniently hard to identify which "postgres" process belongs to which postgres instance.   Add the cluster_name GUC, whose value will be included as part […]

Pavel Golub: Get PostgreSQL sources using SVN

From Planet PostgreSQL. Published on Jul 02, 2014.

As you probably know PostgreSQL sources are managed by Git version control system. Which is great, but for me as SVN user it would be much better to get access to the sources using familiar tools, e.g. TortoiseSVN. Mainly because I don’t need write access, read mode only. There is a workaround for such guys.

First of all, there is a mirror of PostgreSQL sources on the GitHub. And the second, GitHub supports SVN protocol using the bridge to communicate svn commands to GitHub.

So in my case I need only trunk with the latest commits. I’ve created an empty folder and made Checkout to the https://github.com/postgres/postgres/trunk URL.Image


Filed under: Coding, PostgreSQL Tagged: development, git, PostgreSQL, svn, trick

Devrim GÜNDÜZ: Installing and configuring PostgreSQL 9.3 and 9.4 on RHEL 7

From Planet PostgreSQL. Published on Jul 01, 2014.

Red Hat Enterprise Linux 7 was released recently. The day after the release, we announced PostgreSQL 9.4 RPMs for RHEL 7, and last week we pushed PostgreSQL 9.3 RPMs for RHEL 7.

Since Red Hat switched to systemd as of RHEL 7, there are some changes in the packaging as compared to RHEL 5 and RHEL 6. So, how do you install and configura PostgreSQL on RHEL 7?
Continue reading "Installing and configuring PostgreSQL 9.3 and 9.4 on RHEL 7"

PlanetDjango.org shutting down

By Adomas Paltanavičius from Planet Django. Published on Mar 08, 2014.

Is Open Source Consulting Dead?

By chrism from plope. Published on Sep 10, 2013.

Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

Consulting and Patent Indemification

By chrism from plope. Published on Aug 09, 2013.

Article about consulting and patent indemnification

Python Advent Calendar 2012 Topic

By chrism from plope. Published on Dec 24, 2012.

An entry for the 2012 Japanese advent calendar at http://connpass.com/event/1439/

Why I Like ZODB

By chrism from plope. Published on May 15, 2012.

Why I like ZODB better than other persistence systems for writing real-world web applications.

A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

By chrism from plope. Published on Mar 03, 2012.

A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

In Praise of Complaining

By chrism from plope. Published on Jan 01, 2012.

In praise of complaining, even when the complaints are absurd.

2012 Python Meme

By chrism from plope. Published on Dec 24, 2011.

My "Python meme" replies.

In Defense of Zope Libraries

By chrism from plope. Published on Dec 19, 2011.

A much too long defense of Pyramid's use of Zope libraries.

Plone Conference 2011 Pyramid Sprint

By chrism from plope. Published on Nov 10, 2011.

An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

Jobs-Ification of Software Development

By chrism from plope. Published on Oct 17, 2011.

Try not to Jobs-ify the task of software development.

WebOb Now on Python 3

By chrism from plope. Published on Oct 15, 2011.

Report about porting to Python 3.

Open Source Project Maintainer Sarcastic Response Cheat Sheet

By chrism from plope. Published on Jun 12, 2011.

Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

Pylons Miniconference #0 Wrapup

By chrism from plope. Published on May 04, 2011.

Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

Pylons Project Meetup / Minicon

By chrism from plope. Published on Apr 14, 2011.

In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.

PyCon 2011 Report

By chrism from plope. Published on Mar 19, 2011.

My personal PyCon 2011 Report