Skip to content. | Skip to navigation

Personal tools
Log in
You are here: Home

Open Source Posts

Josh Berkus: Finding Duplicate Indexes

From Planet PostgreSQL. Published on Sep 18, 2014.

Recently a client asked us to help them find and weed out duplicate indexes.  We had some old queries to do this, but they tended to produce a lot of false positives, and in a database with over 2000 indexes that wasn't going to cut it.  So I rewrote those queries to make them a bit more intelligent and discriminating, and to supply more information to the user on which to base decisions about whether to drop an index.

Here's the first query, which selects only indexes which have exactly the same columns.  Let me explain the columns of output it produces:
  • schema_name, table_name, index_name: the obvious
  • index_cols: a comman-delimited list of index columns
  • indexdef: a CREATE statement for how the index was created, per pg_indexes view
  • index_scans: the number of scans on this index per pg_stat_user_indexes
Now, go run in on your own databases.  I'll wait.

So, you probably noticed that we still get some false positives, yes?  That's because an index can have all the same columns but still be different.  For example, it could use varchar_pattern_ops, GiST, or be a partial index.  However, we want to see those because often they are functionally duplicates of other indexes even though they are not exactly the same.  For example, you probably don't need both an index on ( status WHERE cancelled is null ) and on ( status ).

What about indexes which contain all of the columns of another index, plus some more?  Like if you have one index on (id, name) you probably don't need another index on just (id).  Well, here's a query to find partial matches.

This second query looks for indexes where one index contains all of the same columns as a second index, plus some more, and they both share the same first column.  While a lot of these indexes might not actually be duplicates, a lot of them will be.

Obviously, you could come up with other variations on this, for example searching for all multicolumn indexes with the same columns in a different order, or indexes with the same first two columns but others different.  To create your own variations, the key is to edit the filter criteria contained in this clause:

    FROM pg_index as ind2
    WHERE ind.indrelid = ind2.indrelid
    AND ( ind.indkey @> ind2.indkey
     OR ind.indkey <@ ind2.indkey )
    AND ind.indkey[0] <> ind2.indkey[0]
    AND ind.indkey <> ind2.indkey
    AND ind.indexrelid <> ind2.indexrelid

... and change it to figure out the factors which give you the most real duplicates without missing anything.

Happy duplicate-hunting!

gabrielle roth: Happy 10th Birthday, Portal!

From Planet PostgreSQL. Published on Sep 18, 2014.

The Portal Project hosted at PSU is near & dear to our hearts here at PDXPUG. (It’s backed by an almost 3TB Postgres database.) We’ve had several talks about this project over the years:

eXtreme Database Makeover (Episode 2): PORTAL – Kristin Tufte
Metro Simulation Database – Jim Cser
R and Postgres – Chris Monsere (I think this is where we first heard about bus bunching)
Extreme Database Makeover – Portal Edition – William van Hevelingin

Kristin Tufte most recently spoke at the PDXPUG PgDay about current development on this project, which is now in its 10th year. Future plans include data from more rural locations, more detailed bus stats, and possibly a new bikeshare program. We look forward to hearing more about it!

Tomas Vondra: Introduction to MemoryContexts

From Planet PostgreSQL. Published on Sep 18, 2014.

If I had to name one thing that surprised me the most back when I started messing with C and PostgreSQL, I'd probably name memory contexts. I never met this concept before, so it seemd rather strange, and there's not much documentation introducing it. I recently read an interesting paper summarizing architecture of a database system (by Hellerstein, Stonebraker and Hamilton), and there's actually devote a whole section (7.2 Memory Allocator) to memory contexts (aka allocators). The section explicitly mentions PostgreSQL as having a fairly sophisticated allocator, but sadly it's very short (only ~2 pages) and describes only the general ideas, without going discussing the code and challenges - which is understandable, because the are many possible implementations. BTW the paper is very nice, definitely recommend reading it.

But this blog is a good place to present details of the PostgreSQL memory contexts, including the issues you'll face when using them. If you're a seasoned PostgreSQL hacker, chances are you know all of this (feel free to point out any inaccuracies), but if you're just starting hacking PostgreSQL in C, this blog post might be useful for you.

Now, when I said there's not much documentation about memory contexts, I was lying a bit. The are plenty of comments in memutils.h and aset.c, explaining the internals quite well - but who reads code comments, right? Also, you can only read them when you realize how important memory contexts are (and find the appropriate files). Another issue is that the comments only explain "how it works" and not some of the consequences (like, palloc overhead, for example).


But, why do we even need memory contexts? In C, you simply call malloc whenever you need to allocate memory on heap, and when you're done with the memory, you call free. It's simple and for short programs this is pretty sufficient and manageable, but as the program gets more complex (passing allocated pieces between functions) it becomes really difficult to track all those little pieces of memory. Memory allocated at one place may be passed around and then freed at a completely different part of the code, far far away from the malloc that allocated it. If you free them too early, the application will eventually see garbage, if you free them too late (or never), you get excessive memory usage (or memory leaks).

And PostgreSQL is quite complex code - consider for example how tuples flow throught execution plans. The tuple is allocated at one place, gets passed through sorting, aggregations, various transformations etc. and eventually sent to the client.

Memory contexts are a clever way to deal with this - instead of tracking each little piece of memory separately, each piece is registered somewhere (in a context), and then the whole context is released at once. All you have to choose the memory context, and call palloc/pfree instead of malloc/free.

In the simplest case palloc simply determines the "current" memory context (more on this later), allocates appropriate piece of memory (by calling malloc) and associates is with the current memory context (by storing a pointer in the memory context and some info in a "header" of the allocated piece) and returns it to the caller. Freeing the memory is done either through pfree (which reverses palloc logic) or by freeing the whole memory context (you can see it as pfree loop over all allocated pieces).

This offers multiple optimization options - for example reducing the number of malloc/free calls by keeping a cache of released pieces, etc.

The other thing is granularity and organization of memory contexts. We certainly don't want a single huge memory contexts, because that's almost exactly the same as having no contexts at all. So we know we need multiple contexts, but how many? Luckily, there's a quite natural way to split memory contexts, because all queries are evaluated through execution plans - a tree of operators (scans, joins, aggregations, ...).

Most executor nodes have their own memory context, released once that particular node completes. So for example when you have a join or aggregation, once this step finishes (and passes all the results to the downstream operator), it discards the context and frees the memory it allocated (and didn't free explicitly). Sometimes this is not perfectly accurate (e.g. some nodes create multiple separate memory contexts), but you get the idea.

The link to execution plans also gives us hint on how to organize the memory context - the execution plan is a tree of nodes, and with memory contexts attached to nodes, it's natural to keep the memory contexts organized in a tree too.

That being said, it's worth mentioning that memory contexts are not used only when executing queries - pretty much everything in PostgreSQL is allocated within some a memory context, including "global" structures like various caches, global structures etc. That however does not contradict the tree-ish structure and per-node granularity.

MemoryContext API

The first thing you should probably get familiar with is MemoryContextMethods API which provides generic infrastructure for various possible implementations. It more or less captures the ideas outlined above. The memory context itself is defined as a simple structure:

typedef struct MemoryContextData
    NodeTag type;
    MemoryContextMethods *methods;
    MemoryContext parent;
    MemoryContext firstchild;
    MemoryContext nextchild;
    char    *name;
    bool    isReset;
} MemoryContextData;

Which allows the tree structure of memory contexts (by parent and first/next child fields). The methods describe what "operations" are available for a context:

typedef struct MemoryContextMethods
    void    *(*alloc) (MemoryContext context, Size size);
    /* call this free_p in case someone #define's free() */
    void    (*free_p) (MemoryContext context, void *pointer);
    void    *(*realloc) (MemoryContext context, void *pointer, Size size);
    void    (*init) (MemoryContext context);
    void    (*reset) (MemoryContext context);
    void    (*delete_context) (MemoryContext context);
    Size    (*get_chunk_space) (MemoryContext context, void *pointer);
    bool    (*is_empty) (MemoryContext context);
    void    (*stats) (MemoryContext context, int level);
        void    (*check) (MemoryContext context);
} MemoryContextMethods;

Which pretty much says that each memory context implementation provides methods to allocate, free and reallocate memory (alternatives to malloc, free a realloc) and also methods to manage the contexts (e.g. initialize a new context, destroy it etc.).

There are also several helper methods wrapping this API, forwarding the calls to the proper instance of MemoryContextMethods.

And when I mentioned palloc and pfree before - these are pretty much just additional wrappers on top of these helper methods (grabbing the current context and passing it into the method).

Allocation Set (AllocSet) Allocator

Clearly, the MemoryContext API provides just the infrastructure, and was developer in anticipation of multiple allocators with different features. That however newer happened, and so far there's a single memory context implementation - Allocation set.

This often makes the discussion a bit confusing, because people mix the general concept of memory contexts and the (single) implementation available.

Allocation Set implementation is quite sophisticated (aka complex). Let me quote the first comment in aset.c:

... it manages allocations in a block pool by itself, combining many small allocations in a few bigger blocks. AllocSetFree() normally doesn't free() memory really. It just add's the free'd area to some list for later reuse by AllocSetAlloc(). All memory blocks are free()'d at once on AllocSetReset(), which happens when the memory context gets destroyed.

To explain this a bit - AllocSet allocates blocks of memory (multiples of 1kB), and then "splits" this memory into smaller chunks, to satisfy the actual palloc requests. When you free a chunk (by calling pfree), it can't immediately pass it to free because the memory was allocated as a part of a larger block. So it keeps the chunk for reuse (for similarly-sized palloc requests), which has the nice benefit of lowering the number of malloc calls (and generally malloc-related book-keeping).

This works perfectly once you have palloc calls with a mix of different requests sizes, but once you break this, the results are pretty bad. Similarly, it's possible to construct requests that interact with the logic grouping requests into groups (making it easier to reuse the chunks), resulting in a lot of wasted memory.

There's another optimization for requests over 8kB, that are handled differently - the largest blocks (part of the block pool) are 8kB, and all requests exceeding this are allocated through malloc directly, and freed immediately using free.

The CurrentMemoryContext

Now, let's say you call palloc, which looks almost exactly the same as a malloc call:

char * x = palloc(128); // allocate 128B in the context

So how does it know which memory context to use? It's really simple - the memory context implementation defines a few global variables, tracking interesting memory contexts, and one of them is CurrentMemoryContext which means "we're currently allocating memory in this context."

Earlier I mentioned that each execution node has an associated context - the first thing the memory node may do is setting the associated memory context as the current one. This however is a problem, because the child nodes may do the same, and the execution may be "interleaved" (the nodes are passing tuples in an iterative manner).

Thus what we usually see is this idiom:

MemoryContext oldcontext = MemoryContextSwitchTo(nodecontext);

char * x = palloc(128);
char * y = palloc(256);


which keeps the current memory context set to the original value.


I tried to explain the motivation and basic of memory contexts, and hopefully direct you to the proper source files for more info.

The main points to remember are probably:

  • Memory contexts group allocated pieces of memory, making it easier to manage lifecycle.
  • Memory contexts are organized in a tree, roughly matching the execution plans.
  • There's a generic infrastructure allowing different implementations, but nowadays there's a single implementation - Allocation Set.
  • It attempts to minimize malloc calls/book-keeping, maximize memory reuse, and never really frees memory.

In the next post I'll look into the usual problems with palloc overhead.

Joshua Drake: Along the lines of GCE, here are some prices

From Planet PostgreSQL. Published on Sep 18, 2014.

I was doing some research for a customer who wanted to know where the real value to performance is. Here are some pricing structures between GCE, AWS and Softlayer. For comparison Softlayer is bare metal versus virtual.

GCE: 670.00
60G Memory
2500GB HD space

GCE: 763.08
104G Memory
2500GB HD space

Amazon: 911.88
30G Memory
3000GB HD Space

Amazon: 1534.00
122.0 Memory
SSD 1 x 320
3000GB HD Space

Amazon: 1679.00
60.0 Memory
SSD 2 x 320
3000GB HD Space

None of the above include egress bandwidth charges. Ingress is free.

Softlayer: ~815 (with 72GB memory ~ 950)
16 Cores
4TB (4 2TB drives)
48GB Memory

Softlayer: ~1035 (with 72GB memory ~ 1150)
16 Cores
3TB (6 1TB drives, I also looked at 8-750GB and the price was the same. Lastly I also looked at using 2TB drives but the cost is all about the same)
48GB Memory

Jeff Frost: WAL-E with Rackspace CloudFiles over Servicenet

From Planet PostgreSQL. Published on Sep 18, 2014.

Found a great walkthrough on setting up WAL-E to use python-swiftclient for storage in Rackspace Cloud Files:

Unfortunately by default, your backups use the public URL for Cloud Files and eat into metered public bandwidth.

The way to work around this is to set the endpoint_type to internalURL instead of the default publicURL.

You do that by setting the following environment variable:


That allows WAL-E to use Servicenet for base backups and WAL archiving which will be much faster and not eat into your metered public bandwidth.

Denish Patel: Postgres in Amazon RDS

From Planet PostgreSQL. Published on Sep 18, 2014.

Today, I presented on “Postgres in Amazon RDS” topic at Postgres Open Conference in Chicago. Here is the slide deck:

Chris Travers: PGObject Cookbook Part 2.1: Serialization and Deserialization of Numeric Fields

From Planet PostgreSQL. Published on Sep 17, 2014.


This article demonstrates the simplest cases regarding autoserialization and deserialization to the database of objects in PGObject.   It also demonstrates a minimal subset of the problems that three valued logic introduces and the most general solutions to those problems.  The next article in this series will address more specific solutions and more complex scenarios.

The Problems

Often times we want to have database fields automatically turned into object types which are useful to an application.  The example here turns SQL numeric fields into Perl Math::Bigfloat objects. However the transformation isn't perfect and if not carefully done can be lossy.  Most applications types don't support database nulls properly and therefore a NULL making a round trip may end up with an unexpected value if we aren't careful.  Therefore we have to create our type in a way which can make round trips in a proper, lossless way.

NULLs introduce another subtle problem with such mappings, in that object methods are usually not prepared to handle them properly.  One solution here is to try to follow the basic functional programming approach and copy on write.  This prevents a lot of problems.  Most Math::BigFloat operations do not mutate the objects so we are relatively safe there, but we still have to be careful.

The simplest way to address this is to build into one's approach a basic sensitivity into three value logic.  However, this poses a number of problems, in that one can accidentally assign a value which can have other values which can impact things elsewhere.

A key principle on all our types is that they should handle a null round trip properly for the data type, i.e. a null from the db should be turned into a null on database insert.  We generally allow programmers to check the types for nulls, but don't explicitly handle them with three value logic in the application (that's the programmer's job).

The Example Module and Repository

This article follows the code of PGObject::Type::BigFloat..  The code is licensed under the two-clause BSD license as is the rest of the PGObject framework.  You can read the code to see the boilerplate.  I won't be including it in here.  I will though note that this extends the Math::BigFloat library which provides arbitrary precision arithmetic for PostgreSQL and is a good match for LedgerSMB's numeric types.

NULL handling

To solve the problem of null inputs we extend the hashref slightly with a key _pgobject_undef and allow this to be set or checked by applications with a function "is_undef."  This is fairly trivial:

sub is_undef {
    my ($self, $set) = @_;
    $self->{_pgobject_undef} = $set if defined $set;
    return $self->{_pgobject_undef};

How PGObject Serializes

When a stored procedure is called, the mapper class calls PGObject::call_procedure with an enumerated set of arguments.  A query is generated to call the procedure, and each argument is checked for a "to_db" method.  That method, if it exists, is called and the output used instead of the argument provided.  This allows an object to specify how it is serialized.

The to_db method may return either a literal value or a hashref with two keys, type and value.  If the latter, the value is used as the value literal and the type is the cast type (i.e. it generates ?::type for the placeholder and binds the value to it).  This hash approach is automatically used when bytea arguments are found.

The code used by PGObject::Type::BigFloat is simple:

sub to_db {
    my $self = shift @_;
    return undef if $self->is_undef;
    return $self->bstr;

Any type of course can specify a to_db method for serialization purposes.

How and When PGObject Deserializes

Unlike serialization, deserialization from the database can't happen automatically without the developer specifying which database types correspond to which application classes, because multiple types could serialize into the same application classes.  We might even want different portions of an application (for example in a database migration tool) to handle these differently.

For this reason, PGObject has what is called a "type registry" which specifies which types are deserialized and as what.  The type registry is optionally segmented into several "registries" but most uses will in fact simply use the default registry and assume the whole application wants to use the same mappings.  If a registry is not specified the default subregistry is used and that is consistent throughout the framework.

Registering a type is fairly straight forward but mostly amounts to boilerplate code in both the type handler and using scripts.  For this type handler:

sub register{
    my $self = shift @_;
    croak "Can't pass reference to register \n".
          "Hint: use the class instead of the object" if ref $self;
    my %args = @_;
    my $registry = $args{registry};
    $registry ||= 'default';
    my $types = $args{types};
    $types = ['float4', 'float8', 'numeric'] unless defined $types and @$types;
    for my $type (@$types){
        my $ret =
            PGObject->register_type(registry => $registry, pg_type => $type,
                                  perl_class => $self);
        return $ret unless $ret;
    return 1;

Then we can just call this in another script as:


Or we can specify a subset of types or different types, or the like.

The deserialization logic is handled by a method called 'from_db' which takes in the database literal and returns the blessed object.  In this case:

sub from_db {
    my ($self, $value) = @_;
    my $obj = "$self"->new($value);
    $obj->is_undef(1) if ! defined $value;
    return $obj;

This supports subclassing, which is in fact the major use case.

Use Cases

This module is used as the database interface for numeric types in the LedgerSMB 1.5 codebase.  We subclass this module and add support for localized input and output (with different decimal and thousands separators).  This gives us a data type which can present itself to the user as one format and to the database as another.  The module could be further subclassed to make nulls contageous (which in this module they are not) and the like.


PGObject::Type::BigFloat does not currently handle making the null handling contageous and this module as such probably never will, as this is part of our philosophy of handing control to the programmer.  Those who do want contageous nulls can override additional methods from Math::BigFloat to provide such in subclasses.

A single null can go from the db into the application and return to the db and be serialized as a null, but a running total of nulls will be saved in the db as a 0.  To this point, that behavior is probably correct.  More specific handling of nulls in the application, however, is passed to the developer which can check the is_undef method.

Next In Series:  Advanced Serialization and Deserialization:  Dates, Times, and JSON

Quinn Weaver: RDS for Postgres: List of Supported Extensions

From Planet PostgreSQL. Published on Sep 17, 2014.

Today I learned that Amazon doesn't keep any list of extensions supported in PostgreSQL. Instead, their documentation tells you to start a psql session and run 'SHOW rds.extensions'. But that creates a chicken-and-egg situation if you have an app that needs extensions, and you're trying to decide whether to migrate.

So here's a list of extensions supported as of today, 2014-09-17 (RDS PostgreSQL 9.3.3). I'll try to keep this current.


Pavel Stehule: plpgsql_check rpm packages are available for PostgreSQL9.3 for RHEL7, 6

From Planet PostgreSQL. Published on Sep 17, 2014.

If you have a RHEL6, 7 based Linux distro and use PostgreSQL 9.3 from community repository, you can install plpgsql_check simply via yum.

Hans-Juergen Schoenig: Next stop: Joining 1 million tables

From Planet PostgreSQL. Published on Sep 17, 2014.

This week I started my preparations for one of my talks in Madrid. The topic is: “Joining 1 million tables”. Actually 1 million tables is quite a lot and I am not sure if there is anybody out there who has already tried to do something similar. Basically the idea is to join 1 million […]

Paul Ramsey: PostGIS for Managers

From Planet PostgreSQL. Published on Sep 16, 2014.

At FOSS4G this year, I wanted to take a run at the decision process around open source with particular reference to the decision to adopt PostGIS: what do managers need to know before they can get comfortable with the idea of making the move.

The Manager's Guide to PostGIS — Paul Ramsey from FOSS4G on Vimeo.

Vasilis Ventirozos: Offsite replication problems and how to solve them.

From Planet PostgreSQL. Published on Sep 16, 2014.

 Those of us who use (and abuse) replication in daily basis know how cool and flexible it is. I've seen a lot of guides on how to setup streaming replication in 5 minutes, how to setup basic archiving and/or wal shipping replication but i haven't seen many guides combining these or implementing an offsite setup simulating latency, packet corruption, and basically what happens under network degradation.

In this post i will describe a resilient replication setup of 2 nodes and i will put it to the test. For this post i will use 2 debian VMs, PostgreSQL 9.4 beta2, OmniPITR 1.3.2 and netem.
Netem can be found on all current (2.6+) distributions and it can emulate variable delay, loss, duplication and re-ordering.

The Basics

Streaming replication is awesome, its fast , easy to setup, lightweight and near to realtime, but how it performs over the internet ?

I setup a simple streaming replica, set wal_segments and wal_keep_segments low (10 and 5). Now i wanna emulate how it will perform over a slow internet connection :
From lab2 and as root :
# tc qdisc add dev eth0 root tbf rate 20kbit buffer 1600 limit 3000

This will emulate an "almost" network outage limiting eth0 to 20kbit.
Next, hammer lab1 with transactions... a bit later :
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000000A has already been removed
Makes sense right ? lab2 couldn't keep up with lab1, lab1 rotated all xlogs and the replica is now broken. I know that this example is a bit extreme, these settings would never be used for an offsite replica, in all fairness they aren't even suitable for a local replica.
But ! network outages happen, and especially on geographically distributed databases this WILL happen and because :
Matter will be damaged in direct proportion to its value.

So, lets configure something that will tolerate such failures and it will do much, much more.

First of all, we want postgres to archive , we want wal files to be transferred compressed and on an encrypted channel. For all my wal management i will use OmniPITR , a wal management suite written by OmniTI that is simple to use, has almost no dependencies and makes everything so much better, i will use rsync over ssh to transfer my wals, bellow the archive_command and the recovery.conf entries for streaming + WAL shipping replication.
(please keep in mind that these settings are for the sake of this post, they are not to be used directly on production systems)

archive_command = '/home/postgres/omnipitr/bin/omnipitr-archive -D /opt/postgres/pgdata/ -dr gzip=postgres@lab2:/opt/postgres/walarchive -l /home/postgres/archive.log -v  "%p"'           

recovery.conf :
standby_mode = 'on'
primary_conninfo = 'user=postgres host=lab1 port=5432'
restore_command = '/home/postgres/omnipitr/bin/omnipitr-restore -l /home/postgres/restore.log -s gzip=/opt/postgres/walarchive -f /home/postgres/ -p /home/postgres/pause.removal -t /tmp -ep hang -pp /opt/postgres/psql/bin/pg_controldata  -sr -r -v %f %p'

Again, with the same wal settings i hammered the database with transactions, the replica started having delays because of the 20kbit limitation but eventually it caught up and everything was ok.

OmniPITR is hands down awesome, it can do much more than just archive and restore wals. You can delay your replica,  you can take hot backups from the slave, that can even be encrypted on creation, you can output the backup from a slave directly to an offsite backup server with no need for extra disk space on the replica and more..
(On some occasions directly sending WALS to the slave might create issues, if you get into this situation, remember that you can always archive wals locally to the master and schedule a script to transfer all wals generated in time intervals that serve your needs.)

This all proves that WAL shipping replication on top of streaming is still very usable when it comes to remote/offsite slaves and you can always switch to only SR or only WAL shipping without problems.
Afterall, its good to know that in case of a network outage, your only limitation is disk space for archived wals.

Feedback and ideas for future posts are always welcome :)

Thanks for reading
- Vasilis

Pavel Stehule: nice unix filter pv

From Planet PostgreSQL. Published on Sep 16, 2014.

I search some filter, that can count a processed rows and can to show a progress. It exists and it is pv

# import to vertica
zcat data.sql | pv -s 16986105538 -p -t -r | vsql

0:13:56 [4.22MB/s] [==============> ] 14%


Chris Travers: PGObject Cookbook Part 1: Introduction

From Planet PostgreSQL. Published on Sep 15, 2014.


I have decided to put together a PGObject Cookbook, showing the power of this framework.  If anyone is interested in porting the db-looking sides to other languages, please let me know.  I would be glad to provide whatever help my time and skills allow.

The PGObject framework is a framework for integrated intelligent PostgreSQL databases into Perl applications.  It addresses some of the same problems as ORMs but does so in a very different way.  Some modules are almost ORM-like and more such modules are likely to be added in the future.  However unlike an ORM, PGObject mostly serves as an interface to stored procedures and whatever code generation routines will be added, these are not intended to be quickly changed.  Moreover it only supports PostgreSQL because we make extended use of PostgreSQL-only features.

For those who are clearly not interested in Perl, this series may still be interesting as it not only covers how to use the framework but also various problems that happen when we integrate databases with applications.  And there are people who should not use this framework because it is not the right tool for the job.  For example, if you are writing an application that must support many different database systems, you probably will get more out of an ORM than you will this framework.  But you still may get some interesting stuff from this series so feel free to enjoy it.

Along the way this will explore a lot of common problems that happen when writing database-centric applications and how these can be solved using the PGObject framework.  Other solutions of course exist and hopefully we can talk about these in the comments.

Much of the content here (outside of the prefaces) will go into a documentation module on CPAN.  However I expect it to also be of far more general interest since the problems are common problems across frameworks.


PGObject is written under the theory that the database will be built as a server of information and only loosely tied to the application.  Therefore stored procedures should be able to add additional parameters without expecting that the application knows what to put there, so if the parameter can accept a null and provide the same answer as before, the application can be assured that the database is still usable.

The framework also includes a fairly large number of other capabilities.  As we work through we will go through the main areas of functionality one at a time, building on the simplest capabilities and moving onto the more advanced.  In general these capabilities can be grouped into basic, intermediate, and advanced:

Basic Functionality

  1. registered types, autoserialization, and autodeserialization.
  2. The simple stored procedure mapper
  3. Aggregates and ordering
  4. Declarative mapped methods

Intermediate Functionality

  1. The Bulk Loader
  2. The Composite Type stored procedure mapper
  3. The database admin functions

Advanced Functionality

  1. Memoization of Catalog Lookups
  2. Writing your own stored procedure mapper

This series will cover all the above functionality and likely more.  As we get through the series, I hope that it will start to make sense and we will start to get a lot more discussion (and hopefully use) surrounding the framework.

Design Principles

The PGObject framework came out of a few years of experience building and maintaining LedgerSMB 1.3.  In general we took what we liked and what seemed to work well and rewrote those things that didn't.  Our overall approach has been based on the following principles:
  • SQL-centric: Declarative, hand-coded SQL is usually more productive than application programming languages.  The system should leverage hand-coded SQL.
  • Leveraging Stored Procedures and Query Generators: The system should avoid having people generate SQL queries themselves as strings and executing them.  It's better to store them persistently in the db  or generate well-understood queries in general ways where necessary.
  • Flexible and Robust: It should be possible to extend a stored procedure's functionality (and arguments) without breaking existing applications.
  • DB-centric but Loosely Coupled:  The framework assumes that databases are the center of the environment, and that it is a self-contained service in its own right.  Applications need not be broken because the db structure changed, and the DB should be able to tell the application what inputs it expects.
  • Don't Make Unnecessary Decisions for the Developer:  Applications may use a framework in many atypical ways and we should support them.  This means that very often instead of assuming a single database connection, we instead provide hooks in the framework so the developer can decide how to approach this.  Consequently you can expect your application to have to slightly extend the framework to configure it.
This framework is likely to be very different from anything else you have used.  While it shares some similarities with iBatis in the Java world, it is unique in the sense that the SQL is stored in the database, not in config files.  And while it was originally inspired by a number of technologies (including both REST and SOAP/WSDL), it is very much unlike any other framework I have come across.

Next in Series:  Registered Types:  Autoserialization and Deserialization between Numeric and Math::BigFloat.

gabrielle roth: PDXPUGDay Recap

From Planet PostgreSQL. Published on Sep 15, 2014.

Last weekend we held the biggest PDXPUGDay we’ve had in a while! 5 speakers + a few lightning talks added up to a fun lineup. About 1/3 of the ~50 attendees were in town for FOSS4G; I think the guy from New Zealand will be holding the “visitor farthest from PDXPUG” for a good long […]

Joshua Drake: GCE, A little advertised cloud service that is perfect for PostgreSQL

From Planet PostgreSQL. Published on Sep 15, 2014.


I have yet to run PostgreSQL on GCE in production. I am still testing it but I have learned the following:

  1. A standard provision disk for GCE will give you ~ 80MB/s random write.
  2. A standard SSD provisioned disk for GCE will give you ~ 240MB/s.

Either disk can be provisioned as a raw device allowing you to use Linux Software Raid to build a RAID 10 which even further increases speed and reliability. Think about that, 4 SSD provisioned disks in a RAID 10...

The downside I see outside of the general arguments against cloud services (shared tenancy, all your data in a big brother, lack of control over your resources, general distaste for $vendor, or whatever else we in our right minds can think up) is that GCE is current limited to 16 virtual CPUS and 104GB of memory.

What does that mean? Well it means that it is likely that GCE is perfect for 99% of PostgreSQL workloads. By far the majority of PostgreSQL need less than 104GB of memory. Granted, we have customers that have 256GB, 512GB and even more but those are few and far between.

It also means that EC2 is no longer your only choice for dynamic cloud provisioned VMs for PostgreSQL. Give it a shot, the more competition in this space the better.

US PostgreSQL Association: PgUS Fall Update 2014

From Planet PostgreSQL. Published on Sep 15, 2014.

It has been a little quiet on the U.S. front of late. Alas, summer of 2014 has come and gone and it is time to strap on the gators and get a little muddy. Although we have been relatively quiet we have been doing some work. In 2013 the board appointed two new board members, Jonathan S. Katz and Jim Mlodgeski. We also affiliated with multiple PostgreSQL User Groups:

  • PhillyPUG
  • SeaPUG

    read more

  • Abdul Yadi: Delta Table Clean Up in Bucardo 5 Cascaded Slave Replication

    From Planet PostgreSQL. Published on Sep 14, 2014.

    Thanks for Bucardo team for responding my previous post. My cascaded slave replication works as expected.

    Today I notice there is still something to do related with delta and track tables.
    Single table replication scenario:
    Db-A/Tbl-T1 (master) => Db-B/Tbl-T2 (slave) => Db-C/Tbl-T3 (cascaded slave)

    Every change on Table T1 replicated to T2, then T2 to T3. After a while, VAC successfully cleans delta and track tables on Db-A. But not on Db-B.

    I detect 2 issues:
    1. If cascaded replication T2 to T3 successful, the delta table on Db-B is not be cleaned up by VAC.
    2. If cascaded replication T2 to T3 failed before VAC schedule, the delta table on Db-B will be cleaned up by VAC. Then, cascaded replication from T2 to T3 losts.

    I fix it by modifying SQL inside bucardo.bucardo_purge_delta(text, text):

    — Delete all txntimes from the delta table that:
    — 1) Have been used by all dbgroups listed in bucardo_delta_targets
    — 2) Have a matching txntime from the track table
    — 3) Are older than the first argument interval
    myst = 'DELETE FROM bucardo.'
    || deltatable
    || ' USING (SELECT track.txntime AS tt FROM bucardo.'
    || tracktable
    || ' track INNER JOIN bucardo.bucardo_delta_targets bdt ON'
    || ' GROUP BY 1 HAVING COUNT(*) = '
    || drows
    || ') AS foo'
    || ' WHERE txntime = tt'
    || ' AND txntime < now() – interval '
    || quote_literal($1);

    Need advice from Bucardo team.

    Chris Travers: LedgerSMB 1.4.0 Released

    From Planet PostgreSQL. Published on Sep 14, 2014.

    15 September 2014, London. The LedgerSMB project - all-volunteer developers and contributors - today announced LedgerSMB 1.4.0.

    Based on an open source code base first released in 1999, the LedgerSMB project was formed in 2006 and saw it's 1.0 release in the same year. It has now seen continuous development for over eight years and that shows no signs of slowing down.

    "LedgerSMB 1.4 brings major improvements that many businesses need," said Chris Travers, who helped found the project. "Businesses which do manufacturing or retail, or need features like funds accounting will certainly get much more out of this new release."

    Better Productivity

    LedgerSMB 1.4 features a redesigned contact management framework that allows businesses to better keep track of customers, vendors, employers, sales leads, and more. Contacts can be stored and categorized, and leads can be converted into sales accounts.

    Additionally, a new import module has been included that allows businesses to upload csv text files to import financial transactions and much more. No longer is data entry something that needs to be done entirely by hand or involves customizing the software.

    Many smaller enhancements are here as well, For example, shipping labels can now be printed for invoices and orders, user management workflows have been improved,

    Better Reporting

    The reporting interfaces have been rewritten in LedgerSMB 1.4.0 in order to provide greater flexibility in both reporting and in sharing reports. Almost all reports now include a variety of formatting options including PDF and CSV formats. Reports can also be easily shared within an organization using stable hyperlinks to reports. Additionally the inclusion of a reporting engine means that it is now relatively simple to write third-party reports which offer all these features. Such reports can easily integrate with LedgerSMB or be accessed via a third party web page.

    Additionally, the new reporting units system provides a great deal more flexibility in tracking money and resources as they travel through the system. Not only can one track by project or department, but funds accounting and other specialized reporting needs are possible to meet.

    Better Integration

    Integration of third-party line of business applications is also something which continues to improve. While all integration is possible, owing to the open nature of the code and db structure, it has become easier as more logic is moved to where it can be easily discovered by applications.

    There are two major improvement areas in 1.4. First additional critical information, particularly regarding manufacturing and cost of goods sold tracking, has been moved into the database where it can be easily shared by other applications. This also allows for better testability and support. Secondly LedgerSMB now offers a framework for web services, which are currently available for contact management purposes, allowing integrators to more easily connect programs together.

    Commercial Options

    LedgerSMB isn't just an open source project. A number of commercial companies offer support, hosting, and customization services for this ERP. A list of some of the most prominant commercial companies involved can be found at

    Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Add width_bucket(anyelement, anyarray).

    From Planet PostgreSQL. Published on Sep 14, 2014.

    On 9th of September, Tom Lane committed patch: Add width_bucket(anyelement, anyarray).   This provides a convenient method of classifying input values into buckets that are not necessarily equal-width. It works on any sortable data type.   The choice of function name is a bit debatable, perhaps, but showing that there's a relationship to the SQL […]

    Michael Paquier: Postgres 9.5 feature highlight: Logging of replication commands

    From Planet PostgreSQL. Published on Sep 14, 2014.

    Postgres 9.5 will come up with an additional logging option making possible to log replication commands that are being received by a node. It has been introduced by this commit.

    commit: 4ad2a548050fdde07fed93e6c60a4d0a7eba0622
    author: Fujii Masao <>
    date: Sat, 13 Sep 2014 02:55:45 +0900
    Add GUC to enable logging of replication commands.
    Previously replication commands like IDENTIFY_COMMAND were not logged
    even when log_statements is set to all. Some users who want to audit
    all types of statements were not satisfied with this situation. To
    address the problem, this commit adds new GUC log_replication_commands.
    If it's enabled, all replication commands are logged in the server log.
    There are many ways to allow us to enable that logging. For example,
    we can extend log_statement so that replication commands are logged
    when it's set to all. But per discussion in the community, we reached
    the consensus to add separate GUC for that.
    Reviewed by Ian Barwick, Robert Haas and Heikki Linnakangas.

    The new parameter is called log_replication_commands and needs to be set in postgresql.conf. Default is off to not log this new information that may surprise existing users after an upgrade to 9.5 and newer versions. And actually replication commands received by a node were already logged at DEBUG1 level by the server. A last thing to note is that if log_replication_commands is enabled, all the commands will be printed as LOG and not as DEBUG1, which is kept for backward-compatibility purposes.

    Now, a server enabling this logging mode...

    $ psql -At -c 'show log_replication_commands'

    ... Is able to show replication commands in LOG mode. Here is for example the set of commands set by a standby starting up:

    LOG:  received replication command: IDENTIFY_SYSTEM
    LOG:  received replication command: START_REPLICATION 0/3000000 TIMELINE 1

    This will certainly help utilities and users running audit for replication, so looking forward to see log parsing tools like pgbadger make some nice outputs using this information.

    Barry Jones: Video: SQL vs NoSQL Discussion at UpstatePHP

    From Planet PostgreSQL. Published on Sep 12, 2014.

    Here's the video from the August UpstatePHP meeting in Greenville discussing SQL vs NoSQL and where they are useful for your development process. I represented SQL solutions (*cough* PostgreSQL *cough*) while Benjamin Young represented NoSQL. Ben has actively contributed to CouchDB, worked for Cloudant, Couchbase, organizes the REST Fest Unconference (happening again September 25-27th) and is t...

    Craig Ringer: pg_sysdatetime: a simple cross-platform PostgreSQL extension

    From Planet PostgreSQL. Published on Sep 12, 2014.

    A while ago I wrote about compiling PostgreSQL extensions under Visual Studio – without having to recompile the whole PostgreSQL source tree.

    I just finished the pg_sysdatetime extension, which is mainly for Windows but also supports compilation with PGXS on *nix. It’s small enough that it serves as a useful example of how to support Windows compilation in your extension, so it’s something I think is worth sharing with the community.

    The actual Visual Studio project creation process took about twenty minutes, and would’ve taken less if I wasn’t working remotely over Remote Desktop on an AWS EC2 instance. Most of the time was taken by the simple but fiddly and annoying process of adding the include paths and library path for the x86 and x64 configurations. That’s necessary because MSVC can’t just get them from pg_config and doesn’t have seem to have user-defined project variables to let you specify a $(PGINSTALLDIR) in one place.

    Working on Windows isn’t always fun – but it’s not as hard as it’s often made out to be either. If you maintain an extension but haven’t added Windows support it might be easier than you expect to do so.

    Packaging it for x86 and x64 versions of each major PostgreSQL release, on the other hand… well, lets just say we could still use PGXS support for Windows with a “make installer” target.

    gabrielle roth: PDXPUGDay 2014 report

    From Planet PostgreSQL. Published on Sep 11, 2014.

    We had about 50 folks attend the PDXPUGDay 2014 last week, between DjangoCon and Foss4g. A lot of folks were already in town for one of the other confs, but several folks also day tripped from SeaPUG! Thanks for coming on down.

    Thanks again to our speakers:
    Josh Drake
    David Wheeler
    Eric Hanson
    Veronika Megler
    Kristin Tufte
    Josh Berkus

    (Plus our lightning talk speakers: Josh B, Mark W, and Basil!)

    And our sponsors:
    2nd Quadrant

    And of course, PSU for hosting us.

    Videos are linked from the wiki.

    Leo Hsu and Regina Obe: FOSS4G 2014 televised live

    From Planet PostgreSQL. Published on Sep 11, 2014.

    If you weren't able to make it to FOSS4G 2014 this year, you can still experience the event Live. All the tracks are being televised live and its pretty good reception. Lots of GIS users using PostGIS and PostgreSQL. People seem to love Node.JS too.

    After hearing enough about Node.JS from all these people, and this guy (Bill Dollins), I decided to try this out for myself.

    I created a node.js web application - which you can download from here: . It's really a spin-off from my other viewers, but more raw. I borrowed the same ideas as Bill, but instead of having a native node Postgres driver, I went for the pure javascript one so its easier to install on all platforms. I also experimented with using base-64 encoding to embed raster output directly into the browser so I don't have to have that silly img src path reference thing to contend with.

    Keith Fiske: A Large Database Does Not Mean Large shared_buffers

    From Planet PostgreSQL. Published on Sep 11, 2014.

    A co-worker of mine did a blog post last year that I’ve found incredibly useful when assisting clients with getting shared_buffers tuned accurately.

    Setting shared_buffers the hard way

    You can follow his queries there for using pg_buffercache to find out how your shared_buffers are actually being used. But I had an incident recently that I thought would be interesting to share that shows how shared_buffers may not need to be set nearly as high as you believe it should. Or it can equally show you that you that you definitely need to increase it. Object names have been sanitized to protect the innocent.

    To set the stage, the database total size is roughly 260GB and the use case is high data ingestion with some reporting done on just the most recent data at the time. shared_buffers is set to 8GB. The other thing to note is that this is the only database in the cluster. pg_buffercache is installed on a per database basis, so you’ll have to install it on each database in the cluster and do some additional totalling to figure out your optimal setting in the end.

    database=# SELECT c.relname
      , pg_size_pretty(count(*) * 8192) as buffered
      , round(100.0 * count(*) / ( SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent
      , round(100.0 * count(*) * 8192 / pg_relation_size(c.oid),1) AS percent_of_relation
     FROM pg_class c
     INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
     INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
     GROUP BY c.oid, c.relname
     LIMIT 10;
                   relname               | buffered | buffers_percent | percent_of_relation
     table1                              | 7479 MB  |            91.3 |                 9.3
     table2                              | 362 MB   |             4.4 |               100.0
     table3                              | 311 MB   |             3.8 |                 0.8
     table4                              | 21 MB    |             0.3 |               100.0
     pg_attrdef_adrelid_adnum_index      | 16 kB    |             0.0 |               100.0
     table4                              | 152 kB   |             0.0 |                 7.7
     index5                              | 16 kB    |             0.0 |                14.3
     pg_index_indrelid_index             | 40 kB    |             0.0 |                 8.8
     pg_depend_depender_index            | 56 kB    |             0.0 |                 1.0
     pg_cast_source_target_index         | 16 kB    |             0.0 |               100.0

    You can see that table1 is taking up a vast majority of the space here and it’s a large table, so only 9% of it is actually in shared_buffers. What’s more interesting though is how much of the space for that table is actually in high demand.

    database=# SELECT pg_size_pretty(count(*) * 8192) 
    FROM pg_class c
    INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
    INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
    WHERE c.oid::regclass = 'table1'::regclass
    AND usagecount >= 2;
     2016 kB

    Data blocks that go into and come out of postgres all go through shared_buffers. Just to review the blog post I linked to, whenever a block is used in shared memory, it increments a clock-sweep algorithm that ranges from 1-5, 5 being extremely high use data blocks. This means high usage blocks are likely to be kept in shared_buffers (if there’s room) and low usage blocks will get moved out if space for higher usage ones is needed. We believe that a simple insert or update sets a usagecount of 1. So, now we look at the difference when usage count is dropped to that.

    database=# SELECT pg_size_pretty(count(*) * 8192) 
    FROM pg_class c
    INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
    INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
    WHERE c.oid::regclass = 'public.ip_addresses_taggings'::regclass
    AND usagecount >= 1;
     4946 MB

    So the shared_buffers is actually getting filled mostly by the data ingestion process, but relatively very little of it is of any further use afterwards. If anything of greater importance was needed in shared_buffers, there’s plenty of higher priority space and that inserted data would quickly get flushed out of shared memory due to having a low usagecount.

    So with having pg_buffercache installed, we’ve found that the below query seems to be a good estimate on an optimal, minimum shared_buffers setting

    database=# SELECT pg_size_pretty(count(*) * 8192) as ideal_shared_buffers
     FROM pg_class c
     INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
     INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
     WHERE usagecount >= 3;
     640 MB

    This is the sort of query you would run after you have had your database running through your expected workload for a while. Also, note my use of the key word minimal. This does not account for unexpected spikes in shared_buffers usage that may occur during a session of reporting queries or something like that. So you definitely want to set it higher than this, but it can at least show you how effectively postgres is using its shared memory. In general we’ve found the typical suggestion of 8GB to be a great starting point for shared_buffers.

    So, in the end, the purpose of this post was to show that shared_buffers is something that needs further investigation to really set optimally and there is a pretty easy method to figuring it out once you know where to look.

    Chris Travers: Math and SQL Part 6: The Problem with NULLs

    From Planet PostgreSQL. Published on Sep 11, 2014.

    This will be the final installment on Math and SQL and will cover the problem with NULLs.  NULL handling is probably the most poorly thought-out feature of SQL and is inconsistent generally with the relational model.  Worse, a clear mathematical approach to NULLs is impossible with SQL because too many different meanings are attached to the same value.

    Unfortunately, nulls are also indispensable because wider tables are more expressive than narrower tables.  This makes advice such as "don't allow nulls in your database" somewhat dangerous because one ends up having to add them back in fairly frequently.

    At the same time understanding the problems that NULLs introduce is key to avoiding the worst of the problems and managing the rest.

    Definition of a Null Set

    A null set is simply a set with no members.  This brings us to the most obvious case of the use of a NULL, used when an outer join results in a row not being found.  This sort of use by itself doesn't do too much harm but the inherent semantic ambiguity of "what does that mean?" also means you can't just substitute join tables for nullable columns and solve the problems that NULLs bring into the database. This will hopefully become more clear below.

    Null as Unknown

    The first major problem surfaces when we ask the question, "when I do a left join and the row to the right is not found, does that mean we don't know the answer yet or that there is no value associated?"  In all cases, a missing result from an outer join will sometimes mean that the answer is not yet known, if only because we are still inserting the data in stages.  But it can also mean that maybe there is an answer and that there is no value associated.  In almost all databases, this may also be the case in this situation.

    But then there is no additional harm done in allowing NULLs to represent unknowns in the tables themselves, right?

    Handling NULLs as unknown values complicates database design and introduces problems so many experts like Chris Date tend to be generally against their use.  The problem is that using joins doesn't solve the problem but instead only creates additional failure cases to be aware of.  So very often times, people do use NULL in the database to mean unknown despite the problems.

    NULL as unknown introduces problems to predicate logic because it introduces three value logic (true, false, and unknown), but these are typically only problems when one is storing a value (as opposed to a reference such as a key) in the table.  1 + NULL IS NULL.  NULL OR FALSE IS NULL.  NULL OR TRUE IS TRUE.  This makes things complicated.  But sometimes we must....

    Null as Not Applicable

    One severe antipattern that is frequently seen is the use of NULL to mean "Not Applicable" or "No Value."  There are a few data types which have no natural empty/no-op types.  Prime among these are numeric types.  Worse, Oracle treats NULL as the same value as an empty string for VARCHAR types.

    Now, the obvious problem here is that the database does't know here that NULL is not unknown, and therefore you end up having to track this yourself, use COALESCE() functions to convert to sane values, etc.  In general, if you can avoid using NULL to mean "Not Applicable" you will find that worthwhile.

    Now, if you have to do this, one strategy to make this manageable is to include other fields to tell you what the null means.  Consider for example:

    CREATE TABLE wage_class (
       id int not null,
       label text not null

    INSERT INTO wage_class VALUES(1, 'salary'), (2, 'hourly');

    CREATE TABLE wage (
       ssn text not null,
       emp_id int not null,
       wage_class int not null references wage_class(id),
       hourly_wage numeric,
       salary numeric,
       check (wage_class = 1 or salary is null),
       check (wage_class = 2 or hourly_wage is null)

    This approach allows us to select and handle logic based on the wage class and therefore we know based on the wage_class field whether hourly_wage is applicable or not.  This is far cleaner and allows for better handling in queries than just putting nulls in and expecting them to be semantically meaningful.  This solution can also be quite helpful because it ensures that one does not accidentally process an hourly wage as a salary or vice versa.

    What Nulls Do to Predicate Logic

    Because NULLs can represent unknowns, they introduce three-valued predicate logic.  This itself can be pretty nasty.  Consider the very subtle difference between:

       WHERE ssn like '1234%' AND salary < 50000


        WHERE ssn like '1234%' AND salary < 50000 IS NOT FALSE

    The latter will pull in hourly employees as well, as they have a NULL salary.

    Nulls and Constraints

    Despite all the problems, NULLs have become a bit of a necessary evil.  Constraints are a big part of the reason why.

    Constraints are far simpler to maintain if they are self-contained in a tuple and therefore require no further table access to verify.  This means that wider tables admit to more expression relating to constraints than narrow tables.

    In the example above, we can ensure that every hourly employee has no salary, and every salaried employee has no hourly wage.  This level of mutual exclusion would not be possible if we were to break off salaries and wages into separate, joined tables.

    Nulls and Foreign Keys

    Foreign keys are a special case of NULLs where the use is routine and poses no problems.  NULL always means "no record referenced" in this context and because of the specifics of three-valued boolean logic, they always drop out of join conditions.

    NULLs in foreign keys make foreign key constraints and 5th Normal Form possible in many cases where it would not be otherwise.  Consequently they can be used routinely here with few if any ill effects.

    What Nulls Should Have Looked Like:  NULL, NOVALUE, UNKNOWN

    In retrospect, SQL would be cleaner if we could be more verbose about what we mean by a NULL.  UNKNOWN could then be reserved for rare cases where we really must need to store a record with incomplete data in it.  NULL could be returned from outer joins, and NOVALUE could be used for foreign keys and places where we know the field is not applicable.

    Jehan-Guillaume (ioguix) de Rorthais: Bloat estimation for tables

    From Planet PostgreSQL. Published on Sep 10, 2014.

    After my Btree bloat estimation query, I found some time to work on a new query for tables. The goal here is still to have a better bloat estimation using dedicated queries for each kind of objects.

    Compare to the well known bloat query, this query pay attention to:

    • TOAST
    • headers of variable length types
    • easier to filter or parse

    You’ll find the queries here:


    I created the file sql/bloat_tables.sql with the 9.0 and more query version. I edited the query to add the bloat reported by pgstattuple (free_percent + dead_tuple_percent) to compare both results and added the following filter:

    -- remove Non Applicable tables
    NOT is_na
    -- remove tables with real bloat < 1 block
    AND tblpages*((pst).free_percent + (pst).dead_tuple_percent)::float4/100 >= 1
    -- filter on table name using the parameter :tblname
    AND tblname LIKE :'tblname'

    Here is the result on a fresh pagila database:

    postgres@pagila=# \set tblname %
    postgres@pagila=# \i sql/bloat_tables.sql 
     current_database | schemaname |    tblname     | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag 
     pagila           | pg_catalog | pg_description |    253952 |       8192 |       31 | f     |  3.2258064516129 |      3.34
     pagila           | public     | city           |     40960 |       8192 |        5 | f     |               20 |     20.01
     pagila           | public     | customer       |     73728 |       8192 |        9 | f     | 11.1111111111111 |     11.47
     pagila           | public     | film           |    450560 |       8192 |       55 | f     | 1.81818181818182 |      3.26
     pagila           | public     | rental         |   1228800 |     131072 |      150 | f     | 10.6666666666667 |      0.67
    (5 rows)

    Well, not too bad. Let’s consider the largest table, clone it and create some bloat:

    postgres@pagila=# create table film2 as select * from film;
    SELECT 1000
    postgres@pagila=# analyze film2;
    postgres@pagila=# \set tblname film%
    postgres@pagila=# \i sql/bloat_tables.sql
     current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag 
     pagila           | public     | film    |    450560 |       8192 |       55 | f     | 1.81818181818182 |      3.26
     pagila           | public     | film2   |    450560 |       8192 |       55 | f     | 1.81818181818182 |      3.26
    (2 rows)
    postgres@pagila=# delete from film2 where film_id < 250;
    DELETE 249
    postgres@pagila=# analyze film2;
    postgres@pagila=# \set tblname film2
    postgres@pagila=# \i sql/bloat_tables.sql
     current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag 
     pagila           | public     | film2   |    450560 |     122880 |       55 | f     | 27.2727272727273 |     27.29
    (1 row)

    Again, the bloat reported here is pretty close to the reality!

    Some more tests:

    postgres@pagila=# delete from film2 where film_id < 333;
    DELETE 83
    postgres@pagila=# analyze film2;
    postgres@pagila=# \i sql/bloat_tables.sql
     current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag 
     pagila           | public     | film2   |    450560 |     155648 |       55 | f     | 34.5454545454545 |     35.08
    (1 row)
    postgres@pagila=# delete from film2 where film_id < 666;
    DELETE 333
    postgres@pagila=# analyze film2;
    postgres@pagila=# \i sql/bloat_tables.sql
     current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag 
     pagila           | public     | film2   |    450560 |     303104 |       55 | f     | 67.2727272727273 |     66.43
    (1 row)

    Good, good, good. What next?

    The alignment deviation

    You might have noticed I did not mentioned this table with a large deviation between the statistical bloat and the real one, called “rental”:

    postgres@pagila=# \set tblname rental
    postgres@pagila=# \i sql/bloat_tables.sql
     current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag 
     pagila           | public     | rental  |   1228800 |     131072 |      150 | f     | 10.6666666666667 |      0.67
    (1 row)

    This particular situation is exactly why I loved writing these bloat queries (including the btree one), confronting the statistics and the reality and finding a logical answer or a fix.

    Statistical and real bloat are actually both right here. The statistical one is just measuring here the bloat AND something else we usually don’t pay attention to. I’ll call it the alignment overhead.

    Depending on the fields types, PostgreSQL adds some padding before the values to align them inside the row in regards to the CPU word size. This help ensuring a value fits in only one CPU register when possible. Alignment padding are given in this pg_type page from PostgreSQL document, see field typalign.

    So let’s demonstrate how it influence the bloat here. Back to the rental table, here is its definition:

    postgres@pagila=# \d rental
                                              Table "public.rental"
        Column    |            Type             |                         Modifiers                          
     rental_id    | integer                     | not null default nextval('rental_rental_id_seq'::regclass)
     rental_date  | timestamp without time zone | not null
     inventory_id | integer                     | not null
     customer_id  | smallint                    | not null
     return_date  | timestamp without time zone | 
     staff_id     | smallint                    | not null
     last_update  | timestamp without time zone | not null default now()

    All the fields here are fixed-size types, so it is quite easy to compute the row size:

    • rental_id and inventory_id are 4-bytes integers, possible alignment is every 4 bytes from the begining of the row
    • customer_id and staff_id are 2-bytes integers, possible alignment is every 2 bytes from the begining of the row
    • rental_date, return_date and last_update are 8-bytes timestamps, possible alignment is every 8 bytes from the begining of the row

    The minimum row size would be 2*4 + 2*2 + 3*8, 36 bytes. Considering the alignment optimization and the order of the fields, we now have (ascii art is easier to explain):

    |0     1     2     3     4     5     6     7     8     |
    |       rental_id       |***********PADDING************|
    |                     rental_date                      |
    |     inventory_id      |customer_id|******PADDING*****|
    |                     return_date                      |
    | staff_id  |*****************PADDING******************|
    |                     last_update                      |

    That makes 12 bytes of padding and a total row size of 48 bytes instead of 36. Here are the 10%! Let’s double check this by the experience:

    postgres@pagila=# create table rental2 as select rental_date, return_date, last_update, rental_id, inventory_id, customer_id, staff_id from public.rental;
    SELECT 16044
    postgres@pagila=# \d rental2
                     Table "public.rental2"
        Column    |            Type             | Modifiers 
     rental_date  | timestamp without time zone | 
     return_date  | timestamp without time zone | 
     last_update  | timestamp without time zone | 
     rental_id    | integer                     | 
     inventory_id | integer                     | 
     customer_id  | smallint                    | 
     staff_id     | smallint                    | 
    postgres@pagila=# \dt+ rental*
                          List of relations
     Schema |  Name   | Type  |  Owner   |  Size   | Description 
     public | rental  | table | postgres | 1200 kB | 
     public | rental2 | table | postgres | 1072 kB | 
    (2 rows)
    postgres@pagila=# select 100*(1200-1072)::float4/1200;
    (1 row)

    Removing the “remove tables with real bloat < 1 block” filter from my demo query, we have now:

    postgres@pagila=# \set tblname rental%
    postgres@pagila=# \i sql/bloat_tables.sql
     current_database | schemaname | tblname | real_size | bloat_size | tblpages | is_na |   bloat_ratio    | real_frag 
     pagila           | public     | rental  |   1228800 |     131072 |      150 | f     | 10.6666666666667 |      0.67
     pagila           | public     | rental2 |   1097728 |          0 |      134 | f     |                0 |      0.41
    (2 rows)


    Sadly, I couldn’t find a good way to measure this in the queries so far, so I will live with that. By the way, this alignment overhead might be a nice subject for a script measuring it per tables.

    Known issues

    The same than for the Btree statistical bloat query: I’m pretty sure the query will have a pretty bad estimation with array types. I’ll investigate about that later.

    Cheers, and happy monitoring!

    Hans-Juergen Schoenig: Checking per-memory context memory consumption

    From Planet PostgreSQL. Published on Sep 10, 2014.

    Writing a complex database server like PostgreSQL is not an easy task. Especially memory management is an important task, which needs special attention. Internally PostgreSQL makes use of so called “memory contexts”. The idea of a memory context is to organize memory in groups, which are organized hierarchically. The main advantage is that in case […]

    Paul Ramsey: PostGIS 2.1.4 Released

    From Planet PostgreSQL. Published on Sep 09, 2014.

    The 2.1.4 release of PostGIS is now available.

    The PostGIS development team is happy to release patch for PostGIS 2.1, the 2.1.4 release. As befits a patch release, the focus is on bugs, breakages, and performance issues

    Continue Reading by clicking title hyperlink .. shutting down

    By Adomas Paltanavičius from Planet Django. Published on Mar 08, 2014.

    Is Open Source Consulting Dead?

    By chrism from plope. Published on Sep 10, 2013.

    Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

    Consulting and Patent Indemification

    By chrism from plope. Published on Aug 09, 2013.

    Article about consulting and patent indemnification

    Python Advent Calendar 2012 Topic

    By chrism from plope. Published on Dec 24, 2012.

    An entry for the 2012 Japanese advent calendar at

    Why I Like ZODB

    By chrism from plope. Published on May 15, 2012.

    Why I like ZODB better than other persistence systems for writing real-world web applications.

    A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

    By chrism from plope. Published on Mar 03, 2012.

    A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

    In Praise of Complaining

    By chrism from plope. Published on Jan 01, 2012.

    In praise of complaining, even when the complaints are absurd.

    2012 Python Meme

    By chrism from plope. Published on Dec 24, 2011.

    My "Python meme" replies.

    In Defense of Zope Libraries

    By chrism from plope. Published on Dec 19, 2011.

    A much too long defense of Pyramid's use of Zope libraries.

    Plone Conference 2011 Pyramid Sprint

    By chrism from plope. Published on Nov 10, 2011.

    An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

    Jobs-Ification of Software Development

    By chrism from plope. Published on Oct 17, 2011.

    Try not to Jobs-ify the task of software development.

    WebOb Now on Python 3

    By chrism from plope. Published on Oct 15, 2011.

    Report about porting to Python 3.

    Open Source Project Maintainer Sarcastic Response Cheat Sheet

    By chrism from plope. Published on Jun 12, 2011.

    Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

    Pylons Miniconference #0 Wrapup

    By chrism from plope. Published on May 04, 2011.

    Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

    Pylons Project Meetup / Minicon

    By chrism from plope. Published on Apr 14, 2011.

    In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.

    PyCon 2011 Report

    By chrism from plope. Published on Mar 19, 2011.

    My personal PyCon 2011 Report