Skip to content. | Skip to navigation

Personal tools
Log in
You are here: Home

Open Source Posts

Andrew Dunstan: PostgreSQL 9.4 and beyond

From Planet PostgreSQL. Published on Dec 18, 2014.

PostgreSQL 9.4 is released today, and once again my main input has been in the JSON area. We've introduced a new spiffy JSON type called "jsonb" that makes performing operations on JSON much, much faster, because it's not stored as plain text, but rather decomposed into the structures and scalar values that go to make up JSON. Much of the credit for this goes to Oleg Bartunov and Teodor Sigaev, the original authors of the hstore module, who created a "nested hstore" module that we eventually based jsonb on. Credit also goes to Alexander Korotkov and Peter Geoghegan, especially for their indexing work, which will make many operations super fast. My role was to make sure that all (or almost all) of the operations that are available for the original json type are also available for the jsonb type.

There have been some benchmark tests that suggest that our jsonb is faster in many respects that other database engines that use json as their native storage.

The nice thing about this is that you can now have a highly performing mixture of traditional relational data and semi-structured treeish data with proper transactional semantics, all in one engine. That's a very attractive proposition for a lot of people looking to design new data stores.

My other JSON contribution was to add a number of functions that allow the construction of arbitrarily complex json documents, with data from the database. This will be particularly useful for clients that expect json in a particular, non-regular shape, and is based on the needs of some actual json-using customers.

Already committed for 9.5, to be released some time next year, are jsonb versions of those json functions we didn't provide jsonb versions of in 9.4, and functions to allow the stripping of object fields with 'null' as the value from json and jsonb values (this can result in much more compact storage in some cases, with no significant semantic loss).

I made a few other contributions to 9.4, but they are not worth mentioning in detail.

I'm now looking for new PostgreSQL projects, not involving JSON. I'm not sure what those should be. For the most part my contributions over the last 12 years have tended not to be in  the area that involves the planning and execution of queries, so maybe it's time for me to dive into that pool. But to do what I don't yet know.

Josh Berkus: Your Hanukkah Present: PostgreSQL 9.4

From Planet PostgreSQL. Published on Dec 18, 2014.

For the third night of Hanukkah, we have a nice present for you: PostgreSQL 9.4.0.  Now eat your latkes.

Let's talk about 9.4 by talking about one of my favorite features: JSONB.  A lot of people worked to bring you JSONB in this release, including my colleague Andrew Dunstan, Oleg Bartunov and Teodor Sigaev (sponsored by Engine Yard), Peter Geohegan of, and several others.

For example, imagine we have a large set of JSON documents which contain publication data about various books.  One reason we might have this is that we receive the data in JSON or HTML format from various publishers, and it is not normalized, so we keep it in its original format for data mining..  We can create store them in a JSON column like so:

    table booksdata (
        title citext not null,
        isbn isbn not null primary key,
        pubinfo jsonb not null

Some example pubinfo might be:

    {"authority" : [ "James Michaels", "Nina Totenberg" ], "cost": 16, "edition" : 3,
     "format": "Trade PB", "publisher": "HarperOne",
     "published_on": "1995-08-05", "revised" : [ "1996-06-01", "1999-01-01" ] }

You'll note that the keys are already sorted in alphabetical order.  This is a feature of JSONB; whitespace gets cleaned up and keys get storted on import.

Then we can create a general index on the JSONB like so:

    CREATE INDEX ON booksdata USING GIN (pubinfo);


    CREATE INDEX ON booksdata USING GIN (pubinfo json_path_ops);

There are two different versions depending on the operations you expect to run.  The standard GIN index supports every kind of operation we support for JSONB.  The path_ops index supports only the search path operator "@>" (see below), but produces a smaller and faster index for these kinds of searches.  However, we're going to do a bunch of different JSON operations for data mining, so we'll use the regular GIN index.

Once we have the index, we can do arbitrary path searches on JSONB, and these will be index searches, depending on what the planner picks.  Examples of JSONB queries:

Count all books published by "It Books":

    SELECT count(*) FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }';


Get the ISBN, and extract the cost of the book as a string, for all books.

    SELECT isbn, pubinfo #>> '{"cost"}' as cost
    FROM booksdata;
        isbn      │ cost
    0-06-203728-5 │ 16
    0-06-204980-1 │ 27
    0-06-206888-1 │ 180
    0-06-201744-6 │ 10
    0-06-162698-8 │ 20
    0-06-199895-8 │ 10

Give me the count of all books which have both the "publisher" and "cost" keys:

    SELECT count(*) FROM booksdata
    WHERE pubinfo ?& array['publisher','cost'];


That goofy "?&" is a special operator which says "has all of these keys".  There are other operators for "has any of these keys", and you can negate it with a NOT.

JSON operations can be combined with standard PostgreSQL aggregates and other query operations.  This allows us to provide far more functionality than non-relational databases have.  For example:

Give me the average cost of all books from "It Books":

    SELECT avg((pubinfo #>> '{"cost"}')::NUMERIC)
    FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }';


Now, you'll notice that I had to cast "cost" to numeric, even though it's stored as a number in the JSONB.  That's a current limitation.  By 9.5, expect to have more operators which use native JSON types.

Return the publisher name as a string, and the average cost for all books grouped by each publisher name:

    SELECT pubinfo #>> '{"publisher"}' as publisher,
        round(avg((pubinfo #>> '{"cost"}')::NUMERIC),2)
    FROM booksdata
    GROUP BY 1
    ORDER BY publisher;

            publisher         │ round
    Avon                      │  43.39
    Collins Reference         │  24.57
    Harper                    │  45.40
    HarperBusiness            │  26.29
    HarperOne                 │  21.96
    Harper Paperbacks         │  16.00
    Harper Voyager            │  29.18
    Igniter                   │  19.50
    It Books                  │  35.00
    William Morrow            │ 348.00
    William Morrow Paperbacks │  15.83

Wow, William Morrow is expensive!   I suspect some kind of data problem here.

Further, JSONB is sortable.  It sorts in a sensible way, first by keys and then by values, as their primitive JSON types (integer, numeric, boolean and text).  This allows JSONB values to be compared and even paginated.  For example, this query pulls the full pubinfo from all "It Books" titles, sorts them, and limits to 25:

    SELECT pubinfo
    FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }'
    ORDER BY pubinfo LIMIT 4 OFFSET 0;

    {"cost": 14, "format": "Trade PB", "publisher": "It Books", ...
    {"cost": 15, "format": "Hardcover", "publisher": "It Books", ...
    {"cost": 15, "format": "Trade PB", "publisher": "It Books", ...
    {"cost": 15, "format": "Trade PB", "publisher": "It Books", ...

This query grabs all pubinfo for "It Books" titles and then sorts them by the JSON of the publication date.  Since JSON has no internal type for dates, you'll want to be careful to use dates in Unix format:

    SELECT pubinfo
    FROM booksdata
    WHERE pubinfo @> '{ "publisher" : "It Books" }'
    ORDER BY pubinfo #> '{"published_on"}'

    {"cost": 260, "format": "Hardcover ", "publisher": "It Books",
        "published_on": "2006-03-01"}
    {"cost": 17, "format": "Trade PB", "publisher": "It Books",
        "published_on": "2006-03-01"}
    {"cost": 90, "format": "Hardcover ", "publisher": "It Books",
        "published_on": "2006-11-01"}
    {"cost": 15, "format": "Trade PB", "publisher": "It Books",
        "published_on": "2006-12-01"}
    {"cost": 22, "format": "Trade PB", "publisher": "It Books",
        "published_on": "2007-02-01"}

So, some awesome functionality to let PostgreSQL 9.4 be your document database solution.  Download and enjoy!

gabrielle roth: Simple test for autovacuum log messages

From Planet PostgreSQL. Published on Dec 15, 2014.

I had reason recently to suspect that autovacuum jobs weren’t being properly logged on my RDS instance. Instead of compulsively re-running “SELECT relname, last_autovacuum FROM pg_stat_user_tables” while I waited for one to happen, I set up this quick test: connect to database and do the following: -- verify settings -- log_autovacuum_min_duration should be 0 ("log […]

Robins Tharakan: PostgreSQL Certification

From Planet PostgreSQL. Published on Dec 14, 2014.

Recently, someone asked for some inputs before going ahead with a PostgreSQL Certification. From what I know, outside Japan, EnterpriseDB is the only Consultancy offer Certification services consistently over the past few years. Having recently cleared one from EnterpriseDB, I had a few things to say therein, and thought I'd rather enlist them as a full-blown post here. An important

Pavel Stehule: plpgsql_check is ready for OS X

From Planet PostgreSQL. Published on Dec 14, 2014.

mscottie found a magic option -undefined dynamic_lookup. With this option we are able to compile plpgsql_check on OS X platform.

Raghavendra Rao: Switchover/Switchback in Slony-I while upgrading PostgreSQL major versions 8.4.x/9.3.x

From Planet PostgreSQL. Published on Dec 13, 2014.

Every new release of PostgreSQL comes with a packed of exciting features. To benefit new features, database server should be upgraded. Choosing traditional upgrade paths like pg_dump/pg_restore or pg_upgrade requires a significant downtime of application. Today, if you are looking for minimum downtime upgrade path among major PostgreSQL versions with perfect rollback plan, then it will be accomplished by asynchronous Slony-I replication. Since Slony-I (know more about it here) has the capability to replicate between different PostgreSQL versions,OS and bit architectures easily, so upgrades are doable without requiring a substantial downtime. In addition, it has a consistent switchover and switchback functionality in its design.

IMO, while doing major version upgrades there should be a proper fallback plan because just in case application turn out to buggy or fail to perform well on upgraded version, then we should be able to rollback to older version immediately. Slony-I provides such functionality in the way of switchback. This post demonstrates, minimum downtime upgradation including switchover/switchback steps.

Before going to demo, one important step to be noted, earlier to PG 9.0.x version bytea datatype columns use to store data in ESCAPE format and later version its in HEX format. While performing switchback (newer version to older version), this kind of bytea format differences are not support by Slony-I hence ESCAPE format should be maintained through out upgrade duration, else you may encounter with an error:
ERROR  remoteWorkerThread_1_1: error at end of COPY IN: ERROR:  invalid input syntax for type bytea
CONTEXT: COPY sl_log_1, line 1: "1 991380 1 100001 public foo I 0 {id,500003,name,"A ",b,"\\\\x41"}"
ERROR remoteWorkerThread_1: SYNC aborted
To fix, no changes required on PG 8.4.x but on PG 9.3.5 bytea_output parameter should be set from HEX to ESCAPE as shown. We can set it at cluster-level ($PGDATA/postgresql.conf) or user-level (ALTER TABLE...SET), I have preferred to go with user-level changes.
slavedb=# alter user postgres set bytea_output to escape;
Lets proceed with upgrade steps. Below are my two versions server details used in this demo, change it accordingly as per your server setup if you are trying:
Origin Node (Master/Primary are called as Origin)                     Subscriber Node (Slave/Secondary are called as Subscriber)
------------------------------------------------- ----------------------------------------------------------
Host IP :
OS Version : RHEL 6.5 64 bit RHEL 6.5 64 bit
PG Version : 8.4.22 (5432 Port) 9.3.5 (5432 Port)
Slony Vers. : 2.2.2 2.2.2
PG Binaries : /usr/local/pg84/bin /opt/PostgreSQL/9.3/
Database : masterdb slavedb
PK Table : foo(id int primary key, name char(20), image bytea) ...restore PK tables structure from Origin...
For simple understanding and easy implementation, I have divided demo in three sections

1. Compiling for Slony-I binaries against PostgreSQL versions
2. Creating Replication Scripts and executing
3. Testing Switchover/Switchback.

1. Compiling for Slony-I binaries against PostgreSQL version
Download Slony-I sources from here, and perform source installation against PostgreSQL binaries on Origin and Subscriber nodes.
On Origin Node:
# tar -xvf slony1-2.2.2.tar.bz2
# cd slony1-2.2.2
./configure --with-pgbindir=/usr/local/pg84/bin
make install

On Subscriber Node: (assuming PG 9.3.5 installed)
# tar -xvf slony1-2.2.2.tar.bz2
# cd slony1-2.2.2
./configure --with-pgconfigdir=/opt/PostgreSQL/9.3/bin
make install
2. Creating Replication Scripts and executing
To setup replication, we need create few scripts that take care of replication including switchover/switchback.

1. initialize.slonik - This script holds the Origin/Subscriber nodes connection information.
2. create_set.slonik - This script holds all the Origin PK Tables that replicate to Subscriber Node.
3. subscribe_set.slonik - This script starts replicating sets data to Subscriber Node.
4. switchover.slonik - This script helps to move control from Origin to Subscriber.
5. switchback.slonik - This script helps to fallback control from Subscriber to Origin.

Finally, two more startup scripts "" and "" that starts slon processes according to the binaries compiled on Origin/Subscriber Nodes.

Download all scripts from here.

Here's the sample data on Origin Node(8.4.22) in Foo Table with a column of bytea datatype, that we will replicate it to Subscriber Node(9.3.5) with the help of scripts created.
masterdb=# select * from foo;
id | name | image
1 | Raghav | test1
2 | Rao | test2
3 | Rags | test3
(3 rows)
-bash-4.1$ slonik initalize.slonik
-bash-4.1$ slonik create_set.slonik
create_set.slonik:13: Set 1 ...created
create_set.slonik:16: PKey table *** *** added.
-bash-4.1$ sh
-bash-4.1$ sh //ON SUBSCRIBER NODE
-bash-4.1$ slonik subscribe_set.slonik
After successful execution of above script, you can notice the data on Origin(masterdb) has replicated to Subscriber(slavedb). Also not allowing any DML operation on Subscriber node:
slavedb=# select * from foo;
id | name | image
1 | Raghav | test1
2 | Rao | test2
3 | Rags | test3
(3 rows)

slavedb=# insert into foo values (4,'PG-Experts','Image2');
ERROR: Slony-I: Table foo is replicated and cannot be modified on a subscriber node - role=0
Cool... We have moved data to newer version of PostgreSQL 9.3.5. At this stage if you feel all data have replicated to Subscriber Node, then you can do switchover.

3. Testing Switchover/Switchback.

Let's switchover to latest version with the script and try inserting data on Subscriber/Origin Nodes.
-bash-4.1$ slonik switchover.slonik
switchover.slonik:8: Set 1 has been moved from Node 1 to Node 2

slavedb=# insert into foo values (4,'PG-Experts','Image2');

masterdb=# select * from foo ;
id | name | image
1 | Raghav | test1
2 | Rao | test2
3 | Rags | test3
4 | PG-Experts | Image2
(4 rows)

masterdb=# insert into foo values (5,'PG-Experts','Image3');
ERROR: Slony-I: Table foo is replicated and cannot be modified on a subscriber node - role=0
Perfect... This is what we are looking, now slavedb(Subscriber Node) running PG 9.3.5 version accepting data and masterdb(Origin Node) receiving the slavedb data. Also its rejecting DMLs executed on masterdb.

Slony-I Logs shows the origin/subscriber node id movements at the time of Switchover:
2014-12-12 04:55:06 PST CONFIG moveSet: set_id=1 old_origin=1 new_origin=2
2014-12-12 04:55:06 PST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2014-12-12 04:55:06 PST CONFIG remoteWorkerThread_1: update provider configuration
2014-12-12 04:55:06 PST CONFIG remoteWorkerThread_1: helper thread for provider 1 terminated
2014-12-12 04:55:06 PST CONFIG remoteWorkerThread_1: disconnecting from data provider 1
2014-12-12 04:55:11 PST INFO start processing ACCEPT_SET
2014-12-12 04:55:11 PST INFO ACCEPT: set=1
2014-12-12 04:55:11 PST INFO ACCEPT: old origin=1
2014-12-12 04:55:11 PST INFO ACCEPT: new origin=2
2014-12-12 04:55:11 PST INFO ACCEPT: move set seq=5000006393
2014-12-12 04:55:11 PST INFO got parms ACCEPT_SET
If you encounter any issues at this stage, you can switchback to older version. After switchback you can continue with Older version until your application or other issues fixed. This is the perfect rollback plan without wasting much of time in case of issues after switchover..
-bash-4.1$ slonik switchback.slonik
switchback.slonik:8: Set 1 has been moved from Node 2 to Node 1

slavedb=# insert into foo values (5,'PG-Experts','Image3');
ERROR: Slony-I: Table foo is replicated and cannot be modified on a subscriber node - role=0

masterdb=# insert into foo values (5,'PG-Experts','Image3');

slavedb=# select * from foo ;
id | name | image
1 | Raghav | test1
2 | Rao | test2
3 | Rags | test3
4 | PG-Experts | Image2
5 | PG-Experts | Image3
(5 rows)
Very Nice...!!! Is this not the exact rollback with minimum downtime ? Yes, its a perfect switching between nodes without missing a transaction.

Logs showing the switchback from Subscriber to Origin Node:
2014-12-12 04:58:45 PST CONFIG moveSet: set_id=1 old_origin=2 new_origin=1
2014-12-12 04:58:45 PST CONFIG storeListen: li_origin=2 li_receiver=1 li_provider=2
2014-12-12 04:58:45 PST CONFIG remoteWorkerThread_2: update provider configuration
2014-12-12 04:58:45 PST CONFIG remoteWorkerThread_2: helper thread for provider 2 terminated
2014-12-12 04:58:45 PST CONFIG remoteWorkerThread_2: disconnecting from data provider 2
2014-12-12 04:58:46 PST CONFIG storeListen: li_origin=2 li_receiver=1 li_provider=2
2014-12-12 04:58:47 PST INFO start processing ACCEPT_SET
2014-12-12 04:58:47 PST INFO ACCEPT: set=1
2014-12-12 04:58:47 PST INFO ACCEPT: old origin=2
2014-12-12 04:58:47 PST INFO ACCEPT: new origin=1
2014-12-12 04:58:47 PST INFO ACCEPT: move set seq=5000006403
2014-12-12 04:58:47 PST INFO got parms ACCEPT_SET
2014-12-12 04:58:48 PST CONFIG moveSet: set_id=1 old_origin=2 new_origin=1
By this time you might have noticed, none of the transactions are lost during switching operation  between PostgreSQL versions. Only downtime might be your application to start/stop for connecting  to Origin and Subscriber nodes, but whereas Origin/Subscriber nodes are never taken down they are just up and running.

Remember, the method shown here is not only useful for upgrades however its the same method in Slony-I  for moving between Nodes.

Thank you for your patience :). Hope this post helps you to upgrade PostgreSQL with minimum downtime using Slony-I including proper rollback plan.


Ian Barwick: Japan PostgreSQL Conference 2014

From Planet PostgreSQL. Published on Dec 11, 2014.

Japan has been an early and vigorous adopter of PostgreSQL (back in 2006, when PostgreSQL was still...


Josh Berkus: SQLnoSQL: pg_shard and JSONB, part 1

From Planet PostgreSQL. Published on Dec 10, 2014.

Of course, the first thing which occurred to me with pg_shard was to create a distributed "NoSQL" database using 9.4's JSONB.  I had to wait a bit because there were some issues with 9.4 compatibility which needed resolving, but this week I was able to build it.  I was hoping to show it off at the SFPUG meeting, but sadly the weather gods (in the form of the PostgreSQL-powered NWS) interfered.

So we'll go over it here because it makes a good introduction to pg_shard and to JSONB.

First step was to spin up an AWS m3.xlarge instance to be our master.  While I don't need much in other capabilities, I do want the master node to have AWS's best network option since it'll be doing a LOT of networking.  Do, however, make sure to configure it with instance storage because you'll want that space for temp tables and any tables which aren't sharded.  This also becomes our template for the shards, because it's easier that way.  I created the instance running Ubuntu 14.04.  Then I had to install all of the PostgreSQL 9.4 components and a few others, including:

  • postgresql-9.4, postgresql-contrib-9.4, postgresql-server-dev-9.4
  • python-psycopg2 (for my benchmark test)
  • git, make

After that I "git cloned" the pg_shard repo and did a make and make install in it.  Currently it emits some spurious warnings with 9.4; ignore those.

Next was configuring PostgreSQL for pg_shard.  This is the master node, so it's where we're going to use pg_shard.  In addition to the usual PostgreSQL configuration, you want to change two settings:

shared_preload_libraries = 'pg_shard' #load pg_shard
temp_buffers = 64MB #pg_shard uses temp tables internally

Then start (or restart) PostgreSQL.  After that, we create a "psocial" database to represent our dummy social networking application, and set up a .pgpass file and an authorized_hosts file for passwordless access by the postgres user.  Then we stop, and shut PostgreSQL down. and use the AWS API to create eight r3.large instances based on that master instance to be our shards.  This way they already have the right configuration and the psocial database created.

We then harvest the IP addresses for those shards and turn them into some local DNS, naming them shard#.demo.  After that, we can set up the pg_worker_list.conf file, which goes in the data directory for PostgreSQL (not the configuration directory, but the actual data directory).

    #hostname       port
    shard1.demo     5432
    shard2.demo     5432
    shard3.demo     5432
    shard4.demo     5432
    shard5.demo     5432
    shard6.demo     5432
    shard7.demo     5432
    shard8.demo     5432

Unfortunately, we do have to configure the shards a bit manually.  First, we'll need to mount the instance storage on each shard, and move the PGDATA directory to it.  Second, we'll need to start Postgres on each shard.  If you use Puppet, Salt, Chef, CFEngine, or similar, then it will be easy do to this; otherwise, you're in for a little SSH time.

Now we are ready to create the sharded "members" table for our app.  Let's create the pg_shard extension (this only needs to be done on the master):

    CREATE EXTENSION pg_shard;

Then we can create the master table:

    create table members (
        id text not null unique,
        email text,
        profile jsonb
    create index on members(email);
    create index on members using gin(profile);

You can see that we've created this table as a highly denormalized collection of JSON documents, using indexed JSONB.  We're doing this mainly to test the capabilities of 9.4's JSONB in a similation of a large, sharded, NoSQL application.  If this were a real application, I'd normalize it a bit.  The id column is TEXT because we're going to put a fake GUID into it.

This empty table is our template for the shards, and like a master partition acts as a target for queries which will be redirected by pg_shard to the shards.  It's not sharded yet; we have to do two things to make that happen.  One is "register" it as a master, which creates entries in pg_shard's metadata for it:

    SELECT master_create_distributed_table('members','id');

This tells pg_shard that "members" is a sharded table, and that it will be sharded on a hash of the column "id".  Now to create the actual shards:

    SELECT master_create_worker_shards('members',32,2);

The first number is the number of shards to create.  While not required, it helps for this to be evenly divisible by the number of hosts you have.  You generally want to create more shards than hosts, first to take advantage of some parallelism on each host, and second to have room for growth by "splitting" hosts.  That is, in our psocial example, we could double to 16 hosts and give each half of the shards.

If you look at the pg_shard metadata, you can see that things are now sharded.  The "partition" table holds our list of master tables:

    psocial=# select * from pgs_distribution_metadata.partition ;
     relation_id | partition_method | key
           16450 | h                | id

The "shard" table shows us the list of shards:

     id   | relation_id | storage |  min_value  |  max_value
    10000 |       16450 | t       | -2147483648 | -2013265922
    10001 |       16450 | t       | -2013265921 | -1879048195
    10002 |       16450 | t       | -1879048194 | -1744830468
    10003 |       16450 | t       | -1744830467 | -1610612741

You can see that each shard is defined as a range of hash values, hashed using PostgreSQL's internal hashing function.  Finally, the actual locations of shards are listed in the shard_placement table:

   id | shard_id | shard_state |  node_name  | node_port
    1 |    10000 |           1 | shard1.demo |      5432
    2 |    10000 |           1 | shard2.demo |      5432
    3 |    10001 |           1 | shard2.demo |      5432
    4 |    10001 |           1 | shard3.demo |      5432

You'll notice that each shard exists on two hosts.  That's that "redundancy" number we gave pg_shard when we created the shards; it says how many hosts should have a copy of each shard.  As you can see, these shards are assigned on a round-robin basis.

If we actually look on one of the shards, we see that each shard is a numbered table:

                List of relations
    Schema |     Name      | Type  |  Owner
    public | members_10000 | table | postgres
    public | members_10007 | table | postgres
    public | members_10008 | table | postgres
    public | members_10015 | table | postgres
    public | members_10016 | table | postgres
    public | members_10023 | table | postgres
    public | members_10024 | table | postgres
    public | members_10031 | table | postgres

Again, you can see that there are eight shards per host, because of the 2X redundancy we asked for.

Next post, we'll populate the shards with a bunch of dummy data.

Ian Barwick: Japan PostgreSQL Conference 2014

From Planet PostgreSQL. Published on Dec 10, 2014.

jpug-conference-2014-sponsors-smallJapan has been an early and vigorous adopter of PostgreSQL (back in 2006, when PostgreSQL was still emerging from obscurity in the western hemisphere, I noted that in Tokyo bookstores, PostgreSQL books outweighed MySQL ones by about 5:3), and it’s no surprise that by nationality, only the USA and Germany have more committers. The Japan PostgreSQL User Group (JPUG) has been around a while too and is one of the oldest and most active worldwide (and has been around long enough to have established a Japanese logo for PostgreSQL which is often used in place of Slonik [*]) . This year JPUG celebrates its 15th anniversary, and the 2014 conference – held in Tokyo on December 5th – was the largest yet with over 20 speakers, 5 (five!) parallel tracks and around 280 attendees.

Continuing a tradition started in 2012, the keynote speaker has been a foreign guest, and this year’s speaker was 2ndQuadrant’s very own Simon Riggs, who gave a well-received talk about the open source development process and PostgreSQL. (The interpreter did a fantastic job despite the occasional idomatic diversion from the script).

A key theme of this year’s conference was international exchange in general, and Asia in particular. As you can see from the schedule below, apart from the keynote speech there was an international track (rightmost column) with talks in English by speakers from a variety of Asian countries, and it’s hoped this will develop into a PgCon Asia.

jpug-conference-2014-sessions-smallAnother theme to emerge was the need for increased communication between the Japanese PostgreSQL community and the rest of the planet – Japan is home to many interesting developments which often go unnoticed in the wider world – and 2ndQuadrant hopes to play a part in improving their visibility.

Finally, no modern conference is complete without lightning talks, which this year took place during the post-conference party. By an ancient Japanese tradition dating back to the earlier part of this year, each speaker (including yours truly, who burbled on about the ideas contained in the preceding paragraph) received a pie in the face (tragically no photographic evidence is available).

On behalf of 2ndQuadrant I’d like to thank JPUG and the conference organisers for their efforts in putting this conference together and Satoshi Nagayasu in particular for his work on the international track.

jpug-conference-2014-pg-water-small[*] The choice of turtle is not entirely random – back in the days of PostgreSQL’s predecessor projects, a turtle was used as the INGRES mascot and retained for POSTGRES.  The precise reasons for its adoption in Japan appear to have already been lost in the mists of time, but a popular theory suggests a blue elephant was unsuitable due to similarities with the symbolism of the cult behind the Sarin gas attack on the Tokyo underground network.






jpug-conference-2014-simon1-small jpug-conference-2014-keynote-audience-small

jpug-conference-2014-axle-small jpug-conference-2014-NTT-brin-small

Dan Robinson: Speeding Up PostgreSQL With Partial Indexes

From Planet PostgreSQL. Published on Dec 10, 2014.

Did you know PostgreSQL supports indexing a subset of your table? This enables very fast reads from that subset with almost no index overhead. It’s often the best way to index your data if you want to repeatedly analyze rows that match a given WHERE clause. This makes PostgreSQL a great fit for workflows that involve pre-aggregation with additional ad hoc analysis. In this post, I’ll walk through an example query optimization in which partial indexes are well suited.

Consider a table of events with the following schema:

Each event is associated with a user and has an ID, a timestamp, and a JSON representation of the event. The JSON includes the page path, the event type (e.g. click, page view, form submission), and any other properties that describe the event.

We can use this table to store many different kinds of events, and let’s assume we have some automatic event tracking in place that logs every click, page view, and form submission so that we can analyze it later. We might want to have an internal dashboard that displays some high-value metrics, such as the number of weekly signups or the amount of revenue we are collecting per day. The events that are relevant to this dashboard make up a small fraction of this table — a very small percentage of clicks on your website are to finalize purchases! But they are mixed in with the rest of the table, so our signal-to-noise ratio is low.

We would like to index our data to make these dashboard queries fast.[1] Let’s start with a signup event, which we define to be a form submission on our /signup/ page. Getting the number of such signups in the first week of September translates to:

On a synthetic dataset with 10 million events, of which 3000 are signups, and without any indexes, this query takes 45 seconds.

Full Indexes On Single Columns: A Mixed Bag

A naive way to improve this performance is by creating single-column indexes for each of the relevant event features: (data->>'type'), (data->>'path'), and time. We can use a bitmap join between results from three indexed scans, which should be fast if the query is selective and the relevant index portions are in memory. Indeed, with these indexes in place, this query takes 200 ms initially, and 20 ms in subsequent runs on our synthetic dataset — a significant improvement over the 45 seconds required by a sequential scan.

But this indexing strategy has a few significant downsides:

  • Write overhead. We need to write to all three indexes on every INSERT/UPDATE/DELETE run against the table.[2] For a write-heavy workload like the one in this example, that might be too expensive.
  • Limited query set. This strategy restricts our ability to define types of high-value events. It won’t work if we need something more complicated than a range query on one of those JSON fields. What if we want to match a regex, or all pageview paths that start with /signup/ but can include anything thereafter?
  • Disk use. The table in our test dataset takes up 6660 mb, and the three indexes take up a combined 1026 mb, a substantial increase in the amount of hard drive space we need to use to support this table.[3]

Enter Partial Indexes

We’re only analyzing the 0.03% of the table that constitutes signup events, but this strategy indexes all the rows. We want to be able to run efficient queries over a tiny fraction of a table. In a case like this, a partial index is going to give us the best results.

If we index an unrelated column and restrict our index to events that match our signup definition, PostgreSQL can easily determine where the signup event rows are, and we can query over them even more efficiently than with three complete indexes on the relevant fields. In particular, consider indexing the time field, but only on the rows that match the filter for signup events. That is:

CREATE INDEX event_signups ON event (time)
WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/'

Using this index, our test query runs in 200ms initially and 2ms on subsequent runs, so there’s a performance improvement to be had if we are running this query often. More importantly, the partial index addresses all three disadvantages of the triple-index approach described above.

  • It only takes up 96 kb, a 10000x improvement over the 1026 mb required to fully index all three fields.
  • We only need to write to the partial index when new rows match our filter for signup events. Since those are 0.03% of our events, this constitutes a significant improvement in write performance: we essentially get this index for free.
  • The partial join allows for filtering with the full expressiveness of PostgreSQL. Our index’s WHERE clause can use any row-filtering expression that we could otherwise use in a query. We can use this approach to match more complicated event definitions, such as regexes, function results, or the prefix-matching example mentioned above.

Avoid Indexing Predicate Results As Booleans

Another approach I’ve seen is to attempt to index the boolean expression

(data->>'type') = 'submit' AND (data->>'path') = '/signup/'

directly, with a second term for time. I.e., something like:

CREATE INDEX event_signup_time ON event
(((data->>'type') = 'submit' AND (data->>'path') = '/signup/'), time)

This is worse than either of the approaches above, as it turns out PostgreSQL’s query planner won’t figure out that our example query restricts our result to rows for which the first term in the index is true. That is, the planner doesn’t know that the WHERE clause:

WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/'

is equivalent to saying that the

((data->>'type') = 'submit' AND (data->>'path') = '/signup/')

field of our index must be true. So it uses this index to restrict the time range of events in our scan, but it reads events for which the boolean expression is true as well as the events for which it is false, and then checks that condition for each row after loading it.[4]

So, we’re reading a lot more rows from disk than necessary, and also executing a nontrivial check for each one. On our synthetic dataset, this query takes 25 seconds for the first run and 8 seconds on successive executions. Under the hood, this is slightly worse than just indexing on the time field, and it performs comparably to doing just that.

Partial indexes are a powerful way to precompute the subset of a table that matches a predicate. Judging by traffic in the #postgresql IRC, they are considerably underused. Compared to full indexes, they allow a greater range of predicates. They can be dramatically more lightweight, requiring fewer writes and less disk space, especially for highly selective filters. Partial indexes should be your default strategy if you’re repeatedly querying a table for a small slice of rows.

Love PostgreSQL? Know of a feature the world should know more about? Ping me @danlovesproofs.

Interested in building systems that make powerful technology easy to use? Shoot us a note at

[1] We might deal with this kind of a situation with table partitioning, in which high-value events and low-value events can live in separate child tables, but this approach is going to be unwieldy if there are many different kinds of high-value events, and we would need to repartition the table every time we wanted to add a new type of high-value event.
[2] We may get some of the updates for free via the heap-only tuples optimization, but, at least, every INSERT and DELETE will require writing to all three indexes.
[3] We could index all three fields in one multi-column index, e.g. on ((data->>'type'), (data->>'path'), time). That would take up 755 mb, a 26% savings, but this is still large, and the other drawbacks still apply. What’s more, this index would be less applicable to other queries we might want to run on this data, so, if we’re supporting a few different kinds of high-value events, this might not end up saving us any space.
[4] The relevant query plan:

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – REINDEX SCHEMA

From Planet PostgreSQL. Published on Dec 08, 2014.

On 8th of December, Simon Riggs committed patch: REINDEX SCHEMA   Add new SCHEMA option to REINDEX and reindexdb.   Sawada Masahiko   Reviewed by Michael Paquier and Fabrízio de Royes Mello This is very simple thing, so the description will not be long. But – I'm mostly a DBA, so all things that give […]

Josh Berkus: Loading pg_partman on RDS or Heroku

From Planet PostgreSQL. Published on Dec 08, 2014.

One of the limitations with both RDS and Heroku is that you can only install the extensions they've approved and made available to you.  Except ... you can install "SQL-only" extensions which don't have a binary component.  Like pg_partman.

pg_partman is a set of functions and tables which help you manage your partitioned tables, including creating new partitions and triggers on the partition master.  I love it because it means I never have to write ad-hoc code for a client to do that again.  So of course I want to install it on our clients who are running on Heroku and RDS and need partitioning.  However, I have no access to "create extension partman" on the RDS command line, so how do I make it work?  Mostly by loading the extension the old way.

1) create a new schema on the target database called "partman".

2) Install PostgreSQL 9.3 on my laptop, including the development files.

3) git clone pg_partman.

4) make & make install pg_partman on my laptop.  Inside the pg_partman source directory, that will create a file in the "sql" directory called (currently): pg_partman--1.7.2.sql.

5) Since that file is designed to be installed as an extension, I need to alter it.  First, I replace all occurances of "@extschema@" with "partman".  Second, I delete the line "SELECT pg_catalog.pg_extension_config_dump('part_config', '');"

6) Load partman into your database by using "psql -f  pg_partman--1.7.2.sql YOUR_DB_NAME -h YOUR_HOST".  In the RDS case, this needs to be done as the instance superuser.

pg_partman can now be used in the target database by calling all of the functions as "partman.function", or by putting the partman schema in your search_path.

Now, this has the major drawback that upgrading pg_partman once 1.7.3 comes out will be a very manual process.  But that's better than not using it at all.

gabrielle roth: PDXPUG lab report – BDR

From Planet PostgreSQL. Published on Dec 08, 2014.

For the last PDXPUG lab of the year, we tried out BiDirectional Replicatoin (BDR). Five of us just set up VMs on our laptops and followed the instructions on the wiki. We only had about 90 minutes time for this lab, so the goal was to get a basic configuration up & running, understand the […]

Michael Paquier: Postgres 9.5 feature highlight: REINDEX SCHEMA

From Planet PostgreSQL. Published on Dec 08, 2014.

PostgreSQL 9.5 has added a new mode for REINDEX with this commit:

commit: fe263d115a7dd16095b8b8f1e943aff2bb4574d2
author: Simon Riggs <>
date: Tue, 9 Dec 2014 00:28:00 +0900

Add new SCHEMA option to REINDEX and reindexdb.

Sawada Masahiko

Reviewed by Michael Paquier and Fabrízio de Royes Mello

Up to 9.4, REINDEX is able to run on different things:

  • INDEX, to reindex a given index.
  • TABLE, to reindex entirely a table, including its underlying toast index if there is a toast table on it.
  • DATABASE, to reindex all the relations of the database user is currently connected to, including catalog tables.
  • SYSTEM, to reindex all the system catalogs.

A couple of things to note though:

  • Indexes reindexed are locked with a strong exclusive lock, preventing any other session to touch it.
  • Parent tables are locked with a share lock
  • DATABASE and SYSTEM cannot run in a transaction, and process tables one-by-one, pg_class being run first as REINDEX updates it (this last point is an implementation detail, not mentioned in the docs).

The new mode for SCHEMA mixes those things, but behaves similarly to DATABASE and SYSTEM, for a schema:

  • It cannot run in a transaction.
  • Each table of the schema is processed one-by-one
  • pg_class is reindexed first only if pg_catalog is processed.

That's actually what you can find here, note first that pg_class is at the top of the relations indexed.

=# REINDEX SCHEMA pg_catalog;
NOTICE:  00000: table "pg_catalog.pg_class" was reindexed
LOCATION:  ReindexObject, indexcmds.c:1942
NOTICE:  00000: table "pg_catalog.pg_depend" was reindexed
LOCATION:  ReindexObject, indexcmds.c:1942

And that this operation is non-transactional:

=# REINDEX SCHEMA pg_catalog;
ERROR:  25001: REINDEX SCHEMA cannot run inside a transaction block
LOCATION:  PreventTransactionChain, xact.c:2976

A last thing to note is that a user that has no access on a schema will logically not be able to run REINDEX on it.

ERROR:  42501: must be owner of schema public
LOCATION:  aclcheck_error, aclchk.c:3376

This feature is particularly helpful when for example working on a server that has multiple schemas when it is wanted to reindex some multiple relations on a single schema as it makes unnecessary the step to list all the relations or to play with a custom function. It presents as well the advantage it avoids taking successive locks on all the objects when doing the work database-wide.

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Event Trigger for table_rewrite

From Planet PostgreSQL. Published on Dec 08, 2014.

On 7th of December, Simon Riggs committed patch: Event Trigger for table_rewrite   Generate a table_rewrite event when ALTER TABLE attempts to rewrite a table. Provide helper functions to identify table and reason.   Intended use case is to help assess or to react to schema changes that might hold exclusive locks for long periods. […]

Craig Ringer: BDR for PostgreSQL: Present and future

From Planet PostgreSQL. Published on Dec 07, 2014.

For a couple of years now a team at 2ndQuadrant led by Andres Freund have been working on adding bi-directional asynchronous multi-master replication support for PostgreSQL. This effort has become known as the BDR project.

We’re really excited to see these efforts leading to new PostgreSQL features and have a great deal more still to come.

Incremental Development

As a large development project it is neither practical nor desirable to deliver all the changes to PostgreSQL as a single huge patch. That way lies madness and unmaintainable code. It would also be resoundingly rejected by the PostgreSQL community, as it should be.

Instead, BDR has been developed as a series of discrete changes to core PostgreSQL, plus an extension that uses those core changes to implement multi-master asynchronous logical replication.

To make this separation clearer we’ve split BDR into two separate trees since the last official release. We now manage and develop the set of patches to PostgreSQL separately from the BDR extension its self.

BDR is destined for community PostgreSQL

2ndQuadrant’s goal is to get all of BDR – the changes to the core server as well as the extension or its functionality – included in a future open source community PostgreSQL release.

When that’s achieved you’ll be able to use all the features currently provided by the BDR patches and extension with no 3rd party (open source or otherwise) patches, extensions and tools.

All of BDR is open source, released under the PostgreSQL license, and is developed on PostgreSQL’s public infrastructure. The git repository containing the authoritative current version of BDR is on

Getting BDR into users’ hands

With a project as big as BDR and with the PostgreSQL project’s safety-first, conservative and careful development style it’ll be a long time before all of BDR is included in community PostgreSQL. As a result we’ve had to make some compromises in order to get BDR functionality out to users prior to the expected end of the 9.5/9.6 release cycles in a couple of years time.

The main compromise is that we have merged all the not-yet-committed patches to PostgreSQL core that BDR requires into a patched variant of PostgreSQL 9.4 that can be used to run BDR. The need for this patched version to exist is temporary. If all the patches required to run BDR (see below) get into PostgreSQL 9.5 it will be possible to run the BDR extension on a stock community version of PostgreSQL 9.5. We would’ve strongly preferred not to need to make a non-trivially patched version of PostgreSQL available to the public, even temporarily, so hopefully we can do away with it for 9.5.

BDR features in PostgreSQL 9.3, 9.4, 9.5 and onward

Many of the core server changes developed for or required by BDR have already been committed to community PostgreSQL. That means they’re part of released or soon-to-be-released PostgreSQL versions. These features include:

Other changes have been submitted for potential inclusion in PostgreSQL 9.5 or are due to be submitted shortly, including:

BDR-related feature development has also led to other useful work by others, like pg_dump support for using snapshots.

Each of these features is/was a significant development effort in its self – and they’re just the foundations required to make the BDR extension possible.

Adoption of features developed for BDR by other projects

The BDR team is really excited at how far BDR has come and how much of the work done for BDR has already entered community PostgreSQL.

Some of those BDR building blocks are already seeing adoption and use for other purposes. For example, the Slony-I team are working on adopting the logical decoding feature developed for BDR to reduce replication overheads.

Background workers are sure to be a crucial foundation for new PostgreSQL features and capabilities such as built-in task scheduling, and in time I think that we’ll wonder how we ever did without them.

UDR – one-way BDR for unmodified community PostgreSQL 9.4

We’re also pleased to report that a reduced-functionality version of the BDR extension currently known as “UDR” will soon be available for use with PostgreSQL 9.4.

An extension running on unmodified PostgreSQL 9.4 can’t use the not-yet-in-core-PostgreSQL features that BDR relies on to implement full multi-master support. Nonetheless it provides useful path toward live major-version upgrade and an efficient one-way replication tool.

Because PostgreSQL 9.4 does not include support for DDL deparse, full DROP tracking, commit timestamps, or replication identifiers we can’t support DDL replication or multi-master in UDR.

Functionally UDR is quite similar to Londiste, but it avoids the write-amplification of trigger based replication and runs entirely within PostgreSQL, without any need for an external daemon. UDR will offer a useful live-migration path for moving stock 9.4 into BDR vice versa, and in future for online upgrades of 9.4 to later major releases.

More details on UDR will be available soon, as will RPMs.

What’s next for BDR

PostgreSQL is 9.4 due to be released any minute now, and we’re working to have a new BDR release ready to follow shortly afterwards.

Otherwise, the BDR team is focusing on getting all the functionality required to run BDR on top of unmodified PostgreSQL into PostgreSQL 9.5.

Support for replication sets was recently added to the BDR extension. This permits replication of subsets of tables in a database to some nodes, asymmetric replication, and various other configurations that are useful for workloads like data aggregation.

We’re also working on dynamic configuration, where node addition and removal is done using SQL commands instead of by modifying postgresql.conf and restarting the server. That work will lead to a smoother process for node addition and removal.

There are lots of goodies coming, both for core PostgreSQL and for users of the BDR extension.

Dave Page: The story of pgAdmin

From Planet PostgreSQL. Published on Dec 07, 2014.

I’m often asked how I first became involved in PostgreSQL, and how the pgAdmin project got started. Much as I’m happy to tell the story over beer, it becomes tedious to do so over email after the first half-dozen or so times. So in a vain attempt to save my fingers from future carpal tunnel syndrome, here’s the tale…

I first discovered PostgreSQL in 1997. I was working at a social housing provider in the UK where, at the time, we were running Informix SE and Pick databases. We wanted to start building in-house web applications for managing additional areas of the business. There were no useful interfaces for Pick that could be used with any of the web technologies available to us at the time and the licensing model for Informix and its ODBC driver was prohibitively expensive, so I started looking for alternatives. 

After spending some time researching mSQL and MySQL, I eventually found PostgreSQL, which seemed to offer everything we needed—a SQL-based DBMS with ODBC and C interfaces, a liberal license, a rich set of features, and a vibrant community supporting it that was responsive to bug reports and open to feature requests. My colleagues and I developed a number of applications using PostgreSQL in the following years and the database became a critical part of our business. We had applications written in PHP for the web as well as Visual Basic for the desktop users. 

However, it was early on in our journey with Postgres that, as a Windows shop (on the desktop at least), we found ourselves wishing for a good graphical management tool for the database. At the time, the only option was pgAccess, which was a basic tool written in TCL/TK that was not easy to get to work on Windows. So I decided to write my own in Visual Basic. The first version was never released to the public and was modeled on the pgAccess design. Called pgManager, it was quickly abandoned as we found the UI to be restrictive and, well, not overly usable. The second incarnation was pgAdmin, which was redesigned from the ground up. It was released as an Open Source project and quickly became a popular tool amongst Postgres users (which was extremely gratifying as it was my first Open Source project). 

Some years later, we once again found ourselves suffering due to the way the application had been designed. I shut myself away for a couple of weeks whilst my family were away visiting relatives in Scotland and wrote pgAdmin II, using a far more structured and maintainable design that implemented a much more usable user interface. I was particularly proud of the design and cleanliness of that code, but by 2002, we needed to start over again. This time it wasn't the application design that was holding us back, but rather the choice of technology. Visual Basic didn't handle internationalization or localization well, nor did it run on platforms other than Windows without the use of WINE, under which it was never particularly stable. The hard decision was made to rewrite everything again, this time using C++ with the wxWidgets framework. pgAdmin III looked much like pgAdmin II, but it solved all of the problems our (in particular) Japanese Linux-using friends were experiencing. Now in its third incarnation, pgAdmin remains the most popular Open Source GUI tool for managing Postgres.

I continued to work at the housing provider and make good use of Postgres until 2007, at which time I moved to EnterpriseDB so I could concentrate on my Postgres work full time. At EDB, not only do I work on our own tools, I also continue to contribute to the PostgreSQL community in various ways. I have the privilege of having full support of the management at EDB for my community work, allowing me the freedom to work on whatever I need to in order to fulfill my community roles on the core team, the PostgreSQL Europe board, and of course, pgAdmin. One of the products I’m responsible for at EDB is Postgres Enterprise Manager, which has a fat client interface that’s based on pgAdmin. This has allowed us to put more effort into the maintenance of pgAdmin, with members of the PEM team at EDB regularly contributing code, fixing bugs and reviewing the patches from elsewhere. Ashesh Vashi has even been made a committer for his efforts.

Despite the hard work over the last 12 or so years, the pgAdmin developers have come to the conclusion that there is limited life left in the current code. Aside from it being extremely hard to find C++ developers these days, we’ve also run into numerous difficult-to-fix bugs that can be traced back to issues or design choices in our code, the underlying wxWidgets framework, and some versions of GTK (the choice of which, we typically have no control as we inherit GTK from the users’ Linux distribution). What’s more, the world is shifting to a web based model these days. Cloud deployments are becoming more widely used, as well as simple instances supplied through hosting providers. Users are getting used to being able to pickup their laptop or a tablet and do whatever they need to do without having to install software – open a browser, read and write email, build a spreadsheet or create a database!

Consequently, we're now starting to design pgAdmin 4. The plan is to build a single Python-based pgAdmin that users can either deploy on a web server or run from their desktop. It’ll be designed to “live query” databases rather than query and cache schema details like the current implementation (which was designed with use over a dialup connection in mind!), and we’ll be making the application much more task focused (and therefore usable) than it is at present, as well as implementing some of the oft-requested features that are hard to do in the current code such as direct data editing in the query tool and tabbed query windows. The new code will also be extensible from the ground up, allowing support for new object types or functionality to be simply “dropped in” and auto-discovered at runtime.

We're still in the proof of concept stage at the moment though, so there's nothing available to download just yet. But stay tuned as work kicks up a gear over the next 12 months!

Hubert 'depesz' Lubaczewski: How much slower are numerics?

From Planet PostgreSQL. Published on Dec 06, 2014.

The general knowledge is that numerics are slower than integers/float, but offer precision and ranges that are better. While I understand what is slow, I don't really know how much slower numerics are. So let's test it. There are couple of things to consider: table size (related to disk usage per column) speed of inserts […]

Leo Hsu and Regina Obe: Oracle FDW 1.1.0 with SDO_Geometry PostGIS spatial support

From Planet PostgreSQL. Published on Dec 06, 2014.

Oracle FDW is a foreign data wrapper PostgreSQL extension that allows you to read Oracle database tables from a PostgreSQL database. You can get it via the PGXN network or the main website

What is new about the latest 1.1.0 release is that there is now support for the Oracle SDO_GEOMETRY type that allows you to map the most common geometry types POINT, LINE, POLYGON, MULTIPOINT, MULTILINE and MULTIPOLYGON to PostGIS geometry type. Much of the spatial plumbing work was done by Vincent Mora of Oslandia. If we have any Windows Oracle users out there, yes there are binaries available for windows for PostgreSQL 9.1- 9.4 for both 32-bit and 64-bit. The FDW does have a dependency on the OCI.dll which I think comes shipped with Oracle products. Unfortunately, we are not Oracle database users so can't kick the tires on this one.

Josh Berkus: pgDay deadline, SFPUG December

From Planet PostgreSQL. Published on Dec 04, 2014.

First, SFPUG December meeting has been scheduled. It's about pg_shard, which has been covered elsewhere.

Second, a reminder that talk submissions for pgDay SF 2015 are due Monday for regular length talks. Please submit your proposals ASAP!

Josh Berkus: What's this pg_shard thing?

From Planet PostgreSQL. Published on Dec 04, 2014.

I've been working with CitusData for a while, and I'm excited that they've finally released pg_shard, a generic "sharding" extension for PostgreSQL databases.  The press releases have all of the noise about this, so this blog post is for the nuts-and-bolts of pg_shard.


What does pg_shard do?

pg_shard takes one or more tables in your database and distributes them over multiple databases on multiple servers, or "shards".  Rows are distributed based on a hash function on the primary key which you define.  The hash function used is Postgres' own hash function for hash indexes.

A single node is the "master node" which contains templates for the distributed tables, as and transparently handles making sure that inserts, updates, and selects go to the right shards.  To ensure redundancy, sharded tables can be defined as having multiple copies across multiple shards.

SELECTs are distributed based on comparing constants in your WHERE clause and other filter conditions against the hashed key, allowing you to query against only the shards which have compliant data.  This should speed up queries for really large tables a great deal.

Where there are multiple copies of shards, pg_shard notices timeouts in contacting the shards and marks certain shards as offline. 

All of this is done inside PostgreSQL, most of it using our defined APIs and hooks.  The only thing the Citus team had to fork was ruleutils.c, and they're hoping to change that in 9.5 so forking that isn't required either.


What does pg_shard not do?

First, since the rows are distributed using a hash function, only equality comparisons are currently supported.  This means that if you query your table on ranges (such as BETWEEN two timestamps) it will scan all shards for that data.  Adding range partitioning to pg_shard is planned.

Second, unlike Postgres-XC/Postgres-XL, transactions between shards and distributed JOINs are not supported.  Those features are among the things which distinguish CitusData's proprietary product.

Third, currently only one master node is supported.  That means that pg_shard isn't yet a solution for supporting applications which need massive numbers of connections.   Supporting multiple query nodes is planned for the future.

Complex distribution rules, like requiring each shard to be copied to a specific subset of hosts in a different availability zone or data center, is also not supported.  It's not clear when/if it will be.


So what is pg_shard useful for right now?

A lot of web applications consist of "One Big Table" (OBT) which by size is 90% or more of the database. This might be a users table, or a messages table, a log table, or a pictures table.  Years ago, this was exactly the problem we faced with Instagram.

pg_shard will allow users to distribute the OBT across many nodes in a public cloud, letting users scale to much larger sizes.  Particularly, it makes it much easier to redistribute the OBT so that it fits in RAM on each cloud node, both decreasing overall cloud hosting cost and improving throughput and response times.  And since there's shard redundancy, they are protected against certain kinds of cloud failures.

As Instagram and Wanelo have done, you can implement this kind of sharding on the application layer ... and when you get to their scale, you'll probably have to.  But most PostgreSQL users aren't at that scale yet, and pg_shard can help them get over the "hump" of going from one server to several with a lot less pain.


Can you give some examples of pg_sharding a database?

First, over the next week I'll be building a "Postgre-NoSQL" database using PostgreSQL 9.4, pg_shard and JSONB, to demonstrate that you can replace some non-relational databases with PostgreSQL.  Expect more on this blog.

If you're in the Bay Area, then join us for an SFPUG meeting next week, which will be all about pg_shard.  If possible, we'll have live video, but no promises.

Michael Paquier: Postgres 9.5 feature highlight: standby actions at the end of recovery

From Planet PostgreSQL. Published on Dec 04, 2014.

PostgreSQL offers many ways to define how a node reaches the end of recovery with many parameters related to recovery target, like a timestamp, a transaction ID, a target name and recovery timeline, using recovery_target_timeline. A parameter called pause_at_recovery_target that exists since 9.1 allows as well to put a standby in pause when the recovery target is reached. The upcoming release 9.5 has made some improvements in this area with the following commit:

commit: aedccb1f6fef988af1d1a25b78151f3773954b4c
author: Simon Riggs <>
date: Tue, 25 Nov 2014 20:13:30 +0000
action_at_recovery_target recovery config option

action_at_recovery_target = pause | promote | shutdown

Petr Jelinek

Reviewed by Muhammad Asif Naeem, Fujji Masao and
Simon Riggs

Its use is rather simple, when a standby has hot_standby enabled in postgresql.conf, meaning that it is able to execute read queries while being in recovery, it is possible to perform the set of actions defined above using action_at_recovery_target in recovery.conf:

  • pause, acting the same as when pause_at_recovery_target is enabled to have the standby pause any replay actions so as it is possible to check in which state it is once the recovery target has been reached. Note as well that the recovery replay can be resumed using pg_xlog_replay_resume().
  • promote, to perform automatically a promotion of the node and have it just to the next timeline, making it available for write queries as well. This is the same as when pause_at_recovery_target or action_at_recovery_target are not used, or when only pause_at_recovery_target is used and set to false.
  • shutdown, to simply shutdown the standby once target is reached. This is the real new addition that this feature brings in because this can be used to make a standby instance immediately ready for use after it has finished its recovery. Note that in this case the node will also need to re-apply all the WAL since last checkpoint, so there is some cost in this mode. Moreover, recovery.conf is not renamed to recovery.done automatically. So this setting need to be removed in recovery.conf (simply removing the file is not something to recommend as server would miss post-end-recovery sanity checks).

Now let's put in recovery a standby that has the following parameters in recovery.conf:

recovery_target_time = '2014-12-04 22:21:52.922328'
action_at_recovery_target = 'shutdown'
restore_command = 'cp -i /path/to/archive/%f %p'

When a recovery target is reached (timestamp, XID, name), the following logs will show up if shutdown is set up for the end of recovery.

LOG:  recovery stopping before commit of transaction 1004, time 2014-12-04 22:22:19.554052+09
LOG:  shutdown at recovery target
LOG:  shutting down 

Also, note that both parameters cannot be used at the same time. An error being returned by server as follows as pause_at_recovery_target is logically deprecated.

FATAL:  cannot set both "pause_at_recovery_target" and
        "action_at_recovery_target" recovery parameters
HINT:  The "pause_at_recovery_target" is deprecated.

That's nice stuff, useful for the control of nodes to-be-promoted when checking for their data consistency.

Paul Ramsey: The Tyranny of Environment

From Planet PostgreSQL. Published on Dec 02, 2014.

Most users of PostGIS are safely ensconsed in the world of Linux, and their build/deploy environments are pretty similar to the ones used by the developers, so any problems they might experience are quickly found and removed early in development.

Some users are on Windows, but they are our most numerous user base, so we at least test that platform preemptively before release and make sure it is as good as we can make it.

And then there's the rest. We've had a passel of FreeBSD bugs lately, and I've found myself doing Solaris builds for customers, and don't get me started on the poor buggers running AIX. One of the annoyances of trying to fix a problem for a "weird platform" user is just getting the platform setup and running in the first place.

So, having recently learned a bit about vagrant, and seeing that some of the "weird" platforms have boxes already, I thought I would whip off a couple vagrant configurations so it's easy in the future to throw up a Solaris or FreeBSD box, or even a quick Centos box for debugging purposes.

I've just been setting up my Solaris Vagrantfile and using my favourite Solaris crutch: the OpenCSW software repository. But as I use it, I'm not just adding the "things I need", I'm implicitly choosing an environment:

  • my libxml2 is from OpenCSV
  • so is my gcc, which is version 4, not version 3
  • so is my postgres

This is convenient for me, but what are the chances that it'll be the environment used by someone on Solaris having problems? They might be compiling against libraries from /usr/sfw/bin, or using the Solaris gcc-3 package, or any number of other variants. At the end of the day, when testing on such a Solaris environment, will I be testing against a real situation, or a fantasyland of my own making?

For platforms like Ubuntu (apt) or Red Hat (yum) or FreeBSD (port) where there is One True Way to get software, the difficulties are less, but even then there is no easy way to get a "standard environment", or to quickly replicate the combinations of versions a user might have run into that is causing problems (libjson is a current source of pain). DLL hell has never really gone away, it has just found new ways to express itself.

(I will send a psychic wedgie to anyone who says "docker", I'm not kidding.)

Dan Robinson: PostgreSQL’s Powerful New Join Type: LATERAL

From Planet PostgreSQL. Published on Dec 02, 2014.

PostgreSQL 9.3 has a new join type! Lateral joins arrived without a lot of fanfare, but they enable some powerful new queries that were previously only tractable with procedural code. In this post, I’ll walk through a conversion funnel analysis that wouldn’t be possible in PostgreSQL 9.2.

What is a LATERAL join?

The best description in the documentation comes at the bottom of the list of FROM clause options:

The LATERAL key word can precede a sub-SELECT FROM item. This allows the sub-SELECT to refer to columns of FROM items that appear before it in the FROM list. (Without LATERAL, each sub-SELECT is evaluated independently and so cannot cross-reference any other FROM item.)

When a FROM item contains LATERAL cross-references, evaluation proceeds as follows: for each row of the FROM item providing the cross-referenced column(s), or set of rows of multiple FROM items providing the columns, the LATERAL item is evaluated using that row or row set’s values of the columns. The resulting row(s) are joined as usual with the rows they were computed from. This is repeated for each row or set of rows from the column source table(s).

This is a bit dense. Loosely, it means that a LATERAL join is like a SQL foreach loop, in which PostgreSQL will iterate over each row in a result set and evaluate a subquery using that row as a parameter.

What can we do with this?

Consider a table of click events with the following schema:

Each event is associated with a user and has an ID, a timestamp, and a JSON blob with the event’s properties. At Heap, these properties might include the DOM hierarchy of a click, the window title, the session referrer, and so forth.

Let’s say we want to optimize our landing page to increase signups. The first step is to figure out where we’re losing users in our conversion funnel.

An example conversion funnel between four steps in a signup flow.
An example conversion funnel between four steps in a signup flow.

We’ll assume that we’ve instrumented our frontend to log events along this flow and that all of the data lives in the event table specified above.[1] As an initial question, let’s figure out how many people view our homepage and what percentage of them enter a credit card within two weeks of that initial homepage view. If we were using an older version of PostgreSQL, we might write some custom functions in PL/pgSQL, PostgreSQL’s builtin procedural language. But, in 9.3, we can use a lateral join to compute this in one efficient query, with no extensions or PL/pgSQL.

Nobody likes 30-line SQL queries, so let’s break this down into pieces. The first chunk of this is vanilla SQL:

That is, get the initial time each user did a view_homepage event. Then our lateral join allows us to iterate over each resulting row and perform a parametrized version of the next subquery. This is equivalent to taking the query below and running it for each resulting row:

I.e., for each user, get the first time he or she performed the enter_credit_card event within two weeks of view_homepage_time. Because this is a lateral join, our subquery can make reference to the view_homepage_time results from the previous subquery. Otherwise, the subqueries would be evaluated independently and there would be no way to access results from one when evaluating the other.

Then we wrap the whole thing in a select, which returns something like this:

Because this is a LEFT JOIN, the query still produces result rows for users with no matching enter_credit_card event, as long as there is a view_homepage event. If we aggregate over the numerical columns, we get a tidy summary of this conversion funnel:

… which produces:

We can add intermediate steps to this funnel with more lateral joins to evaluate which portions of our flow we should focus on improving.[2] Let’s add a use_demo step between viewing the homepage and entering a credit card.

This yields:[3]

This gives us the three-step conversion funnel from viewing the homepage to using the demo within a week to entering the credit card within a week of that. From here, the expressive power of PostgreSQL allows us to drill down on these results and thoroughly analyze the performance of our website. We might follow up with:

  • Does using the demo increase the likelihood of a signup?
  • Do users who discover our homepage via an ad convert with the same likelihood as users from other sources?
  • How do these conversion rates change over time or with different A/B test variants?

The answers to these questions apply directly to product changes and can be determined in PostgreSQL now that it supports lateral joins.

Without lateral joins, we would need to resort to PL/pgSQL to do this analysis. Or, if our data set were small, we could get away with complex, inefficient queries. In an exploratory data science use case, you might just pull your data out of PostgreSQL and analyze it with your scripting language of choice. But there is considerable power in being able to express these questions in SQL, especially if you’re wrapping it all in an approachable UI and exposing the functionality to nontechnical users.

Note that these queries can be tuned so as to be very efficient. In this example, if we create a btree index on (user_id, (data->>’type’), time), we can evaluate each funnel step for each user with a single indexed lookup. If you’re using SSDs, on which seeks are cheap, this might be good enough. If not, you might need to schematize your data a bit differently, but I’ll leave the details of that for another post.

Have a favorite new PostgreSQL feature or a neat lateral join use case? Ping me @danlovesproofs.

Interested in building systems that make powerful features easy to use? Shoot us a note at

[1] Or we can use Heap, which captures everything for us as soon as we install it! No need to write any logging code, and no risk of forgetting to log something that you’ll want to analyze later.
[2] Note that adding additional steps to a conversion funnel would be particularly easy if we were using a product like Heap, since we would already have the relevant data.
[3] The number of users with enter_credit_card events is lower in this query than in the previous one, as this query only returns enter_credit_card events for users who do so after doing the use_demo event, and 17 of the users signed up without using the demo.

Peter Eisentraut: ccache and clang, part 3

From Planet PostgreSQL. Published on Dec 01, 2014.

In part 1 and part 2 I investigated how to use ccache with clang. That was more than three years ago.

I got an email the other day that ccache bug 8118, which I filed while writing part 1, was closed, as ccache 3.2 was released. The release notes of clang 3.2 contain several items related to clang. So it was time to give this another look.

Basically, the conclusions from part 2 still stand: You cannot use ccache with clang without using CCACHE_CPP2. And it is now becoming clear to me that this is an issue that is not going to go away, and it’s not really even Clang’s fault.


The problem is that clang’s -Wall can cause warnings when compiling the preprocessed version of otherwise harmless C code. This can be illustrated by this piece of C code:

        int *p, *q;

        p = q = 0;
        p = p;
        if (p == p)
                return 1;
        if ((p == q))
                return 2;
        return 0;

When compiled by gcc-4.9 -Wall, this gives no warnings. When compiled by clang-3.5 -Wall, this results in test.c:7:4: warning: explicitly assigning value of variable of type 'int *' to itself [-Wself-assign] test.c:8:8: warning: self-comparison always evaluates to true [-Wtautological-compare] test.c:10:9: warning: equality comparison with extraneous parentheses [-Wparentheses-equality] test.c:10:9: note: remove extraneous parentheses around the comparison to silence this warning test1.c:10:9: note: use '=' to turn this equality comparison into an assignment

You wouldn’t normally write code like this, but the C preprocessor could create code with self-assignments, self-comparisons, extra parentheses, and so on.

This example represents the issues I saw when trying to compile PostgreSQL 9.4 with ccache and clang; there might be others.

You can address this issue in two ways:

  1. Use CCACHE_CPP2, as discussed in part 2. With ccache 3.2, you can now also put this into a configuration file: run_second_cpp = true in ~/.ccache/ccache.conf

  2. Turn off the warnings mentioned above: -Wno-parentheses-equality, -Wno-tautological-compare, -Wno-self-assign (and any others you might find). One might think that these are actually useful warnings that one might want to keep, but GCC doesn’t warn about them, and if you develop primarily with GCC, your code might contain these issues anyway. In particular, I have found that -Wno-tautological-compare is necessary for legitimate code.

I think CCACHE_CPP2 is the way to go, for two reasons. Firstly, having to add more and more options to turn off warnings is obviously somewhat stupid. Secondly and more importantly, there is nothing stopping GCC from adding warnings similar to Clang’s that would trigger on preprocessed versions of otherwise harmless C code. Unless they come up with a clever way to annotate the preprocessed code to the effect of “this code might look wrong to you, but it looked OK before preprocessing, so don’t warn about it”, in a way that creates no extra warnings and doesn’t lose any warnings, I don’t think this issue can be solved.


Now the question is, how much would globally setting CCACHE_CPP2 slow things down?

To test this, I have built PostgreSQL 9.4rc1 with clang-3.5 and gcc-4.8 (not gcc-4.9 because it creates some unrelated warnings that I don’t want to deal with here). I have set export CCACHE_RECACHE=true so that the cache is not read but new cache entries are computed. That way, the overhead of ccache on top of the compiler is measured.


  • clang-3.5
    • Using ccache is 10% slower than not using it at all.
    • Using ccache with CCACHE_CPP2 on is another 10% slower.
  • gcc-4.8
    • Using ccache is 19% slower than not using it at all.
    • Using ccache with CCACHE_CPP2 is another 9% slower.

(There different percentages between gcc and clang arise because gcc is faster than clang (yes, really, more on that in a future post), but the overhead of ccache doesn’t change.)

10% or so is not to be dismissed, but let’s remember that this applies only if there is a cache miss. If everything is cached, both methods do the same thing. Also, if you use parallel make, the overhead is divided by the number of parallel jobs.

With that in mind, I have decided to put the issue to rest for myself and have made myself a ~/.ccache/ccache.conf containing

run_second_cpp = true

Now Clang or any other compiler should run without problems through ccache.


There is one more piece of news in the new ccache release: Another thing I talked about in part 1 was that ccache will disable the colored output of clang, and I suggested workarounds. This was actually fixed in ccache 3.2, so the workarounds are no longer necessary, and the above configuration change is really the only thing to make Clang work smoothly with ccache.

Tomas Vondra: Compiler optimizations vs PostgreSQL

From Planet PostgreSQL. Published on Nov 30, 2014.

About two weeks ago I posted a performance comparison of PostgreSQL compiled using various compilers and compiler versions. The conclusion was that for pgbench-like workloads (lots of small transactions), the compilers make very little difference. Maybe one or two percent, if you're using a reasonably fresh compiler version. For analytical workloads (queries processing large amounts of data), the compilers make a bit more difference - gcc 4.9.2 giving the best results (about 15% faster than gcc 4.1.2, with the other compilers/versions somewhere between those two).

Those results were however measured with the default optimization level (which is -O2 for PostgreSQL), and one of the questions in the discussion below that article is what difference would the other optimizations level (like -O3, -march=native and -flto) do. So here we go!

I repeated the two tests described in the previous post - pgbench and TPC-DS, with almost exactly the same configuration. If you're interested in the details, read that post.

The only thing that really changed is that thile compiling the PostgreSQL sources, I modified the opmtimization level or enabled the additional options. In total, I decided to test these combinations:

  • clang -O3 -O4 (available since clang-3.4)
  • gcc -O3 -O3 -march=native (since gcc-4.2) ** -O3 -march=native -flto (since gcc-4.5)

When combined with all the clang and gcc versions, this amounts to 43 combinations. I haven't done the tests for the Intel C Compiler.

BTW if you're interested in more detailed results, see [this](] spreadsheet, or download this OpenDocument spreadheet (same data).


For the small dataset (~150MB), the results of the read-only test are depicted on the following chart, showing the number of transactions per second (50k-58k range) - so the higher the number, the better.

The results are sorted by compiler (clang, gcc), optimization level and finally compiler version. The bars depict minimum, average and maximum tps (from 3 runs), to give an idea of how volatile the results are - I haven't found a better chart type in Google Drive or Libreoffice Calc.


Now, the first thing you probably notice is that for clang, the higher optimization levels mostly lower the performance. The newer the version, the worse the impact - while clang 3.5 gives >55k transactions with -O2, it drops to 54k with -O3 and 53k with -O4. The 2-4% difference is not a big one, but it's pretty consistent and it certainly is not in the direction we've hoped for.

With gcc, the situation is more complicated - the -O3 and -O3 -march=native levels result in slightly worse performance, although the difference is not as significant as for clang. The results however seem less volatile (e.g. 4.4 is a good example of that).

The "Link Time Optimization" is a different story, increasing the performance for most versions, especially compared to the -O3 -march=native results. For example 4.8 jumps from 53k to 55k tps, and 4.9 jumps from 54k to 56k. Compared to the -O2 results it's not that great, though (it's still faster, but the difference is smaller).

The other way to look at the data is by looking at the results like this:

compiler -O2 -O3 -O4
clang 3.1 54808 54666 -
clang 3.2 55320 54957 -
clang 3.3 55314 54909 -
clang 3.4 55144 54474 54259
clang 3.5 55766 54628 53848
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc 4.1.2 52808 53019 - -
gcc 4.2.4 53474 52829 53053 -
gcc 4.3.6 52355 52634 52465 -
gcc 4.4.7 51685 52070 52194 -
gcc 4.5.4 53739 52828 52663 56085
gcc 4.6.4 53144 53632 52899 54973
gcc 4.7.4 54354 53572 53001 52451
gcc 4.8.3 54350 52390 52753 54842
gcc 4.9.2 54669 54036 53758 56151

And after computing the difference against the -O2 results for each version, you'll get this:

compiler -O2 -O3 -O4
clang 3.1 54808 -0.26% -
clang 3.2 55320 -0.66% -
clang 3.3 55314 -0.73% -
clang 3.4 55144 -1.21% -0.40%
clang 3.5 55766 -2.04% -1.43%
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc 4.1.2 52808 0.40% - -
gcc 4.2.4 53474 -1.20% -0.79% -
gcc 4.3.6 52355 0.53% 0.21% -
gcc 4.4.7 51685 0.74% 0.98% -
gcc 4.5.4 53739 -1.70% -2.00% 4.36%
gcc 4.6.4 53144 0.92% -0.46% 3.44%
gcc 4.7.4 54354 -1.44% -2.49% -3.50%
gcc 4.8.3 54350 -3.61% -2.94% 0.91%
gcc 4.9.2 54669 -1.16% -1.67% 2.71%

This only confirms that on clang all the optimization levels hurt the performance (although only a tiny bit). For gcc, the only thing that makes a bit of difference in the right direction is the -flo flag. But even this makes less difference than the compiler version (the gcc-4.9.2 with -O2 is almost as fast as gcc-4.8.3 with -flto).


Ok, so that was a transactiona workload. Now let's see the impact on analytical workloads ... first, the data load, consisting from the same steps as before:

  • COPY data into all the tables
  • create indexes
  • VACUUM FULL (not really necessary)

but for all the various optimization combinations:


Clearly, no significant impact - exactly as in the initial post. In case you prefer the tabular form of the results (similar to the one presented for pgbench), this time tracking the total duration of the loading process (in seconds):

compiler -O2 -O3 -O4
clang-3.1 407 407 -
clang-3.2 399 396 -
clang-3.3 399 395 -
clang-3.4 406 397 411
clang-3.5 405 405 411
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc-4.1.2 401 406 - -
gcc-4.2.4 407 402 397 -
gcc-4.3.6 401 398 400 -
gcc-4.4.7 400 402 398 -
gcc-4.5.4 398 394 391 393
gcc-4.6.4 406 398 400 397
gcc-4.7.4 385 384 384 387
gcc-4.8.3 390 384 390 383
gcc-4.9.2 379 383 374 379

and as a speedup versus the -O2 for the same compiler version (negative values mean slowdown):

compiler -O2 -O3 -O4
clang-3.1 407 -0.10% -
clang-3.2 399 0% -
clang-3.3 399 1% -
clang-3.4 406 2% -1.18%
clang-3.5 405 0% -1.38%
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc-4.1.2 401 -1.11% - -
gcc-4.2.4 407 1% 2% -
gcc-4.3.6 401 0% 0% -
gcc-4.4.7 400 -0.61% 0% -
gcc-4.5.4 398 1% 1% 1%
gcc-4.6.4 406 1% 1% 2%
gcc-4.7.4 385 0% 0% -0.57%
gcc-4.8.3 390 1% -0.02% 1%
gcc-4.9.2 379 -1.04% 1% -0.13%

Now, let's see the impact on query performance (notice the chart shows range 150-210, in seconds):


And the results in the tabular form:

compiler version -O2 -O3 -O4
clang-3.1 176 174 - -
clang-3.2 176 172 - -
clang-3.3 174 185 - -
clang-3.4 189 176 181 -
clang-3.5 174 175 179 -
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc-4.1.2 186 200 - -
gcc-4.2.4 189 186 186 -
gcc-4.3.6 189 186 185 -
gcc-4.4.7 181 178 183 -
gcc-4.5.4 173 169 166 160
gcc-4.6.4 171 173 172 153
gcc-4.7.4 171 166 183 160
gcc-4.8.3 171 170 172 161
gcc-4.9.2 164 167 162 153

and as a speedup versus the -O2 for the same compiler version (negative values mean slowdown):

compiler version -O2 -O3 -O4
clang-3.1 176 1.14% - -
clang-3.2 176 2.27% - -
clang-3.3 174 -6.32% - -
clang-3.4 189 6.88% 4.23% -
clang-3.5 174 -0.57% -2.87% -
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc-4.1.2 186 -7.53% - -
gcc-4.2.4 189 1.59% 1.59% -
gcc-4.3.6 189 1.59% 2.12% -
gcc-4.4.7 181 1.66% -1.10% -
gcc-4.5.4 173 2.31% 4.05% 7.51%
gcc-4.6.4 171 -1.17% -0.58% 10.53%
gcc-4.7.4 171 2.92% -7.02% 6.43%
gcc-4.8.3 171 0.58% -0.58% 5.85%
gcc-4.9.2 164 -1.83% 1.22% 6.71%

For clang, the results vary for each version - on 3.3 the -O3 results in ~6% slowdown, on 3.4 it's ~6% speed-up. For the last version (3.5) it's a slight slowdown for both -O3 and -O4.

For gcc, the -O3 and -O3 -march=native flags are a bit unpredictable - on older versions this might give either slight improvement or significant slowdown (see for example gcc-4.7.4 where -O3 gives ~3% speed-up, but -O3 -march=native results in ~7% slowdown).

The only flag that really matters on gcc is apparently -flto (i.e. Link Time Optimization), giving ~5-7% speedup for most versions. That's not negligible, although it's not a ground-breaking speed-up either.


  • The various optimization flags don't have much impact - in most cases it's ~1-2%.
  • When they do have an impact, it's often in the unexpected (and undesirable) direction, actually making it slower.
  • The one flag that apparently makes a measurable difference in the right direction is -flto, giving ~3% speed-up in pgbench and ~7% in TPC-DS.

Josh Berkus: Setting up Ghost with PostgreSQL and Apache vhosts

From Planet PostgreSQL. Published on Nov 28, 2014.

I'm currently in love with Ghost, a Node.js blogging platform.  It's simple and small, uses Markdown for blog editing, and has scads of themes available, most of which are responsive and typography-centric.  If there's a project which might be a Wordpress-killer -- especially for PostgreSQL users -- Ghost is probably it.  I've set up both my food and pottery blog and the new SFPUG page using Ghost.

Ghost comes configured by default to use SQLite, though, and of course I had to run it on Postgres.  Also, I needed to set up Apache vhosts to redirect to the various Ghost instances, since one Node instance only runs one Ghost instance, and needed supervisord for autostart.  At some point later, I'll repackage all this to use docker containers for each Ghost instance, but these instructions don't cover that.

First, install Node.js.  OS packages are fine for Ubuntu 14.04 or Fedora; for other OSes you may need to get funky with source installs.  I'm not going to cover that here.  You'll also need to install the Node PostgreSQL driver, and the postgresql-server-dev package.

Second, "git clone" the stable branch of the Ghost code into an appropriate directory.  This can be directly your web directory for Ghost, or you can download, build, and then copy.  If you do the latter, be aware that Ghost has several "dot" files it needs which a regular "cp -r" won't pick up.

Now, before you build Ghost, you're going to need to make some changes to support PostgreSQL.  Edit "package.json" in the main directory, and add PostgreSQL into the "optional dependencies" section like so:

    "optionalDependencies": {
        "mysql": "2.1.1",
        "pg" : "latest"

Now, build Ghost by running the build commands from the home directory:

    npm install -g grunt-cli
    npm install
    grunt init
    grunt prod

Make sure you scroll through the output of the above commands and look for errors; they don't always fail on error.

As a note, Ghost, like Rails, has "dev", "stage", and "production" modes.  Each of these modes can use a different database, even a different DBMS.  Or you can have Ghost connect to the same database for all three modes.  By default, Ghost starts in dev mode.

You'll need to configure your database connection.  Before editing config.js, make sure that you've set up PostgreSQL to accept local connections using an md5 password.  I additionally route Ghost through pgBouncer in case of getting slammed by connections; this requires you to configure pgBouncer as well:
  1. add Ghost user to the database
  2. create a database for the Ghost
  3. configure pg_hba.conf so the new user can connect with an md5 password
  4. if using pgBouncer, configure pgbouncer.ini and users.txt with the database and the user/password you're using for Ghost.
Once that's all set up, you can configure Ghost to use Postgres.  Edit config.js, and in each section (dev, staging and prod) where you want to use a Postgres database, replace the SQLite connection information with something like this:

        database: {
            client: 'pg',
            connection: {
                host     : '',
                port     : '6432',
                user     : 'ghostblog',
                password : 'supercalifragalistic',
                database : 'ghostblog',
                charset  : 'utf8'

Then save.  As you can see, that's a configuration for pgBouncer.  If you're not using pgBouncer, then you can omit the host and port configuration.  If you're using Heroku, you'll need to point the Host at the public address for your Heroku Postgres instance. Restart Ghost at this point, or start it if you haven't already:

   npm start
   npm start --production

After this, you should be able to connect to Ghost on port 2368 of the defined interface ( if you haven't configured anything).  However, that doesn't actually get a site up.  For one, Ghost may have to share the server with non-Ghost sites and other Ghost instances.  Second, users aren't generally accustomed to looking for sites on port 2638.  Also, I'm not all that confident about Node.js security, given how new the platform is.

Enter Apache vhosts with proxying.

First, configure Ghost to listen only an alternate port and the internal IP address:

        server: {
            // Host to be passed to node's `net.Server#listen()`
            host: '',
            // Port to be passed to node's `net.Server#listen()`
            port: '4444'

Note that if you have several Ghost instances, you'll need a different port for each.  If you're running Ghost under Docker, then redirect the Ghost port to a unique high port on the host.

Next, create a new vhost file (in /etc/apache2/servers-available on Ubuntu) pointing to Ghost:

    <VirtualHost *:80>

        ProxyRequests off
        ProxyPass /
        ProxyPassReverse / http:/

        CustomLog /var/log/apache2/ combined
        ErrorLog /var/log/apache2/

Those ProxyPass lines allow bi-directional pass-through from Apache to Ghost.  Note that this prevents you from putting any other Apache directives on the Vhost other than logging, such as an authentication config; they will get ignored, because the request gets passed to Ghost first.

Now, link that to servers-enabled, and restart Ghost and Apache.  You should now be able to connect to Ghost using the URL of the vhost.

But wait ... what's keeping Ghost running?  npm will quit on error, and has no ability to restart with a server restart.   You could craft your own init file, but and alternative is to use supervisord.

First, install supervisor from packages.  Then add a config file to /etc/supervisor/conf.d:

    command = node /var/www/
    directory = /var/www/
    user = ghost
    autostart = true
    autorestart = true
    stdout_logfile = /var/log/supervisor/sfpostgres.log
    stderr_logfile = /var/log/supervisor/sfpostgres_err.log
    environment = NODE_ENV="production"

Then add the service to supervisor and enable it:

    supervisorctl add sfpostgres
    supervisorctl start sfpostgres

.... and that's it!  You should now have a Ghost instance on a custom domain with autorestart able to share a server with other websites.  And most importantly, running on PostgreSQL.

Petr Jelínek: Progress on online upgrade

From Planet PostgreSQL. Published on Nov 28, 2014.

In last couple of months I’ve been working on online upgrade for very large databases as part of the AXLE project and I would like to share my thoughts on the topic and what progress we have made recently.

Before joining 2ndQuadrant I used to work in Skype where the business would not allow a maintenance window for our databases. This meant no downtime was allowed for deployments, upgrades, etc. That kind of rule makes you change the way you do things. Most changes are small, you don’t do any heavy locks, you have replicas to allow for fast fail-over. But while you can make your releases small and non-blocking, what happens when you need to do a major version upgrade of the PostgreSQL database?

You might be in a different situation, as most companies do have an upgrade window, and so you might afford some downtime during the upgrade. This however brings two problems. For one, no company actually likes the downtimes even if they are allowed. And more importantly once your database grows beyond gigabytes in size into the range of terabytes or hundreds of terabytes, the downtime can take days or even weeks and nobody can afford to stop their operations for that long. The result is many companies often skip important upgrades, making the next one actually even more painful. And the developers are missing new features, performance improvements. They (the companies) sometime even risk running a PostgreSQL version that is no longer supported and has known data corruption or security problems. In the following paragraphs I will talk a little about my work on making the upgrades less time consuming and as result less painful and hopefully more frequent.

Let me start with a little history first. Before PostgreSQL 9.0 the only way to do a major version upgrade was to run pg_dump and restore the dump into an instance running a newer version of PostgreSQL. This method required the structure and all data to be read from the database and written into a file. Then read from the file and inserted into a new database, indexes have to be rebuilt, etc.

As you can imagine this process can take quite some time. Improvements in performance were made in 8.4 for pg_restore with the -j option added where you could specify how many parallel jobs to be run. This makes it possible to restore several tables (indexes, etc) in parallel making the restore process faster for custom format dumps. The 9.3 version added similar option to pg_dump, improving the performance even further. But given how fast data volumes are growing, the parallelization itself is not enough to make any serious gain in the time required for upgrade.

Then in PostgreSQL 9.0 a utility called pg_upgrade arrived. Pg_upgrade dumps just the structures and restores them into the new cluster. But it copies the data files as they are on disk which is much faster than dumping them into logical format and then reinserting. This is good enough for small databases because it means a downtime in range of minutes or hours, a time acceptable for many scenarios. There is also the link mode which just creates hard links (junction points on Windows) which makes this process even faster. But from my personal point of view it is too dangerous to run such setup on a production master server. I will briefly explain why. If something goes wrong, once you start your new server that was upgraded using the link mode, you are suddenly without production database and have to fail-over, or worse, you have to restore from backup. That means you not only failed to upgrade but you just caused additional downtime! Good luck getting approval next time.

Now many people who can’t afford long downtimes for upgrades use the trigger based replication solutions like Slony or Londiste to do the upgrade. This is a good solution because you can replicate your data while the original server is running and then switch with minimal downtime. In practice there are several problems however. One of them is that the trigger based solutions are often clunky to setup, especially if you are doing it only once every couple of years and only to do the upgrade. It is also easy to miss a table or to add tables in wrong order and thus not getting the full copy. I have witnessed this in practice and people doing the upgrade were working with the trigger based replication on daily basis. Another issue is that the trigger based solutions add considerable load on the source database, sometimes making the upgrade impossible due to the database server becoming overloaded once the replication is activated. And last but often not least, it can take very long time for the trigger based replication to actually move the data to the new server. On the last occasion I was involved with an upgrade project, the trigger based solution took around a month to copy the database and catch up with changes. Yes, one month.

With PostgreSQL 9.4 arrives the logical decoding feature which offers a fresh start for designing a new and better online upgrade problem solution. What we did, as part of AXLE project, is to create a tool which combines the logical decoding with the techniques described above. The solution solves most of the problems of previous approaches. The Uni-Directional Replication PostgreSQL extension (UDR for short) does logical replication using logical decoding of the write ahead log (WAL). Thanks to this, the impact on the master server is almost on par with the physical streaming replication, so the additional load caused by ongoing upgrade is minimal on the running system. Also it provides tools to initialize new nodes, both using physical and logical backup. You can even turn existing physical slave to UDR slave. And because it is a logical replication system, it is possible to design it in a way that supports cross-version replication.

What all this means is we can now use UDR in combination with pg_upgrade to do an online upgrade of major PostgreSQL version with minimal downtime, in short amount of absolute time and with minimal impact on the running system.

online-upgrade-stepsAn example how this can look in practice:

  • Do pg_basebackup of existing instance.
  • Setup the UDR replication between original instance and the one created by basebackup.
  • Pg_upgrade the new instance.
  • Let UDR replay the changes that happened in meantime.
  • Switch the traffic to the new instance.

For howto with more detailed instructions see the UDR Online Upgrade guide on PostgreSQL wiki. The UDR sources are available in the 2ndquadrant_bdr repository on PostgreSQL git server (bdr-plugin/next branch).

Finally, since UDR is not just an online upgrade tool but also a replication solution, it can be used for your normal replication needs, instead of the physical streaming replication. Furthermore it provides several advantages like ability to create temporary tables, replicating from multiple OLTP databases into one big data warehouse database, or replicating just part of the database.

My hope is that this effort will mean that downtime considerations are no longer a problem when it comes to upgrading from PostgreSQL 9.4 and above to a new major version.

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633.

gabrielle roth: PDXPUG December Lab; no regular December meeting

From Planet PostgreSQL. Published on Nov 27, 2014.

As is now our tradition, we will not have a regular meeting this month. Go to the Winter Coders’ Social instead.

We are, however, having a lab night to close out the year!

Thu Dec 4, 6pm-8pm
Where: Renewable Funding, 400 SW 6th Ave

Mark Wong is going to take us on a tour of Bi-Directional Replication.

This lab will require some prep work, as BDR uses a patched version of 9.4. Don’t let that put you off attending though, as you can always pair with someone else.

We do need to know who’s coming so we can notify building security, so please sign up in advance here.

Grab dinner at the carts & come on over!

Michael Paquier: Postgres 9.5 feature highlight: pageinspect extended for GIN indexes

From Planet PostgreSQL. Published on Nov 27, 2014.

The extension module pageinspect has been already dumped to version 1.3 in PostgreSQL 9.5 with the addition of functions for BRIN indexes. A couple of days back a new set of functions has been added for GIN indexes with this commit.

commit: 3a82bc6f8ab38be3ed095f1c86866900b145f0f6
author: Heikki Linnakangas <>
date: Fri, 21 Nov 2014 11:46:50 +0200
Add pageinspect functions for inspecting GIN indexes.

Patch by me, Peter Geoghegan and Michael Paquier, reviewed by Amit Kapila.

This consists of a set of 3 functions that take as argument the raw content of a relation page (fetched using get_raw_page):

  • gin_page_opaque_info, able to fetch information about the flags located in the opaque area of a GIN page.
  • gin_metapage_info, able to translate information located in the meta page of a GIN index.
  • gin_leafpage_items, giving some information about the content of a GIN leaf page.

Such functions are useful for developers TODOTODO First let's create an index using the GIN operators of pg_trgm on the book "Les Miserables", English translation.

=# CREATE TABLE les_miserables (num serial, line text);
=# COPY les_miserables (line) FROM '/path/to/les_miserables.txt';
COPY 68116
=# CREATE INDEX les_miserables_idx ON les_miserables
USING gin (line gin_trgm_ops);

First, gin_page_opaque_info provides information about the status of a page (plus alpha like the right link page if any). Here is for example the status of the meta page of the previous index and one of its leaf page.

=# SELECT * FROM gin_page_opaque_info(get_raw_page('les_miserables_idx', 0));
 rightlink  | maxoff | flags
 4294967295 |      0 | {meta}
(1 row)
=# SELECT * FROM gin_page_opaque_info(get_raw_page('les_miserables_idx', 3));
 rightlink | maxoff |         flags
         5 |      0 | {data,leaf,compressed}
(1 row)

Using gin_metapage_info, a direct visual of what is stored in GinMetaPageData is available (refer mainly to gin_private.h for more details), failing if the page accessed does not contain this information.

=# SELECT * FROM gin_metapage_info(get_raw_page('les_miserables_idx', 0));
-[ RECORD 1 ]----+-----------
pending_head     | 4294967295
pending_tail     | 4294967295
tail_free_size   | 0
n_pending_pages  | 0
n_pending_tuples | 0
n_total_pages    | 1032
n_entry_pages    | 267
n_data_pages     | 764
n_entries        | 5791
version          | 2
=# SELECT * FROM gin_metapage_info(get_raw_page('les_miserables_idx', 1));
ERROR:  22023: input page is not a GIN metapage
DETAIL:  Flags 0000, expected 0008
LOCATION:  gin_metapage_info, ginfuncs.c:62

Finally, gin_leafpage_items can be used to retrieve details about the items stored in a leaf page (being either a posting tree or a posting list).

=# SELECT first_tid, nbytes, tids[2] AS second_tid
   FROM gin_leafpage_items(get_raw_page('les_miserables_idx', 3)) LIMIT 4;
 first_tid | nbytes | second_tid
 (149,94)  |    248 | (149,95)
 (154,11)  |    248 | (154,12)
 (158,62)  |    248 | (158,64)
 (163,14)  |    248 | (163,20)
(4 rows)

This new set of functions improves the existing coverage of btree and BRIN. Cool stuff for developers. shutting down

By Adomas Paltanavičius from Planet Django. Published on Mar 08, 2014.

Is Open Source Consulting Dead?

By chrism from plope. Published on Sep 10, 2013.

Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

Consulting and Patent Indemification

By chrism from plope. Published on Aug 09, 2013.

Article about consulting and patent indemnification

Python Advent Calendar 2012 Topic

By chrism from plope. Published on Dec 24, 2012.

An entry for the 2012 Japanese advent calendar at

Why I Like ZODB

By chrism from plope. Published on May 15, 2012.

Why I like ZODB better than other persistence systems for writing real-world web applications.

A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

By chrism from plope. Published on Mar 03, 2012.

A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

In Praise of Complaining

By chrism from plope. Published on Jan 01, 2012.

In praise of complaining, even when the complaints are absurd.

2012 Python Meme

By chrism from plope. Published on Dec 24, 2011.

My "Python meme" replies.

In Defense of Zope Libraries

By chrism from plope. Published on Dec 19, 2011.

A much too long defense of Pyramid's use of Zope libraries.

Plone Conference 2011 Pyramid Sprint

By chrism from plope. Published on Nov 10, 2011.

An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

Jobs-Ification of Software Development

By chrism from plope. Published on Oct 17, 2011.

Try not to Jobs-ify the task of software development.

WebOb Now on Python 3

By chrism from plope. Published on Oct 15, 2011.

Report about porting to Python 3.

Open Source Project Maintainer Sarcastic Response Cheat Sheet

By chrism from plope. Published on Jun 12, 2011.

Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

Pylons Miniconference #0 Wrapup

By chrism from plope. Published on May 04, 2011.

Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

Pylons Project Meetup / Minicon

By chrism from plope. Published on Apr 14, 2011.

In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.

PyCon 2011 Report

By chrism from plope. Published on Mar 19, 2011.

My personal PyCon 2011 Report