Skip to content. | Skip to navigation

Personal tools
Log in
You are here: Home

Open Source Posts

Improved Django Tests

By Revolution Systems Blog from Django community aggregator: Community blog posts. Published on May 29, 2015.

Improved Django Tests

Shaun M. Thomas: PG Phriday: 10 Ways to Ruin Performance: IN-Sanity

From Planet PostgreSQL. Published on May 29, 2015.

When working with a database, sometimes performance problems are both far more subtle, and much worse than a query itself might suggest. The topic of this week’s PGDB (PostgreSQL) performance killers article concerns the use of the IN clause, and how misusing it can catastrophically obliterate the database in mysterious ways.

To that end, we’ll use a slightly revised single-table test case since it’s served us pretty well so far:

DROP TABLE sys_order;
CREATE TABLE sys_order
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now()
INSERT INTO sys_order (product_id, item_count, order_dt)
SELECT ( % 100) + 1, ( % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL
  FROM generate_series(1, 1000000) a(id);
ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);
CREATE INDEX idx_order_product_id
    ON sys_order (product_id);
CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt);
ANALYZE sys_order;

As usual, my test system is a simple dual-CPU VM with 16GB of RAM and some mysterious storage allocation from a SAN. All settings are default, and the version of PostgreSQL is the latest release of the 9.4 branch. I always recommend using the latest version of PGDB when possible, otherwise there’s a risk of missing important planner improvements.

As it turns out in this particular story, the IN clause is actually pretty well known to most developers I’ve worked with. This isn’t some obtuse syntax that only experts have encountered, and it’s used regularly in applications and ORMs through the industry. It’s ubiquitous, and consequentially, extremely dangerous.

Why dangerous? Let’s examine a sanitized query I encountered in an actual running production system. Now, our test case is scaled down by a couple orders of magnitude, so the results won’t be as drastic as what I encountered. Still, the query below performs much worse than anything we’ve discussed so far:

SELECT * FROM sys_order
 WHERE order_id IN (
        SELECT DISTINCT order_id
          FROM sys_order
         WHERE product_id = 10
 ORDER BY order_dt DESC
 LIMIT 30;
                             QUERY PLAN                             
 LIMIT  (cost=27768.47..27768.55 ROWS=30 width=20)
        (actual TIME=1362.794..1362.840 ROWS=30 loops=1)
   ->  Sort  (cost=27768.47..27791.97 ROWS=9400 width=20)
             (actual TIME=1362.785..1362.801 ROWS=30 loops=1)
 [ Horrible ugly mess redacted ]
 Planning TIME: 0.699 ms
 Execution TIME: 1363.219 ms

What we’re looking at here, is the planner ripping itself to shreds trying to optimize a query with several problems:

  • A subquery containing the same table with no aggregates.
  • Use of DISTINCT on a primary-key column.
  • Ordering the results on the outside query.
  • Using the LIMIT on the outer query.

Taking these in order, it should be obvious that the subquery is pointless in this example. The inside query is essentially the same as the outside query, minus the ordering and result limit. There are a number of reasons this might happen. The IN clause is primarily used as a type of glue. Often, a developer or ORM will take a working query and embed it as a subquery unchanged. The justification is fairly simple: I’m interested in these how these records are related, and I already have this working query.

In most cases, IN can be simplified into some kind of JOIN, since that’s how databases tend to combine related data. By using IN and a subquery, the planner has to perform numerous unnecessary optimization steps in an attempt to reach the best plan. As the amount of complexity increases, so does the number of potential execution paths. How many elements from the inner query, for instance, can be collapsed into the outer one? What is the resource cost for each variant of doing so?

Then there’s the DISTINCT clause within the inner query. That sys_order table is not joined with anything, and there are no row multiplying functions. It’s not possible for more than one of the same primary key in the results. Yet there it is, making the planner do more work.

The last two are closely related. Since the outer query doesn’t add any new WHERE clauses, applying an order and limiting the results at that point, is simply inefficient. The database must first execute the inner query to find the relevant order_id values, and afterwards, throw away all but the top 30 results. The planner could have used the index on order_dt and read it backwards. Or it could have used the index on product_id and then ordered the results afterward, depending on which was more efficient based on statistics. Instead, it has to produce, and then subsequently discard, all data that matched the subquery.

Here’s what the query should have been:

  FROM sys_order
 WHERE product_id = 10
 ORDER BY order_dt DESC
 LIMIT 30;
                             QUERY PLAN                             
 LIMIT  (cost=0.42..172.64 ROWS=30 width=20)
        (actual TIME=5.049..5.101 ROWS=30 loops=1)
   ->  INDEX Scan Backward USING idx_order_order_dt ON sys_order
           (cost=0.42..53960.42 ROWS=9400 width=20)
           (actual TIME=5.046..5.073 ROWS=30 loops=1)
         FILTER: (product_id = 10)
         ROWS Removed BY FILTER: 9000
 Planning TIME: 0.099 ms
 Execution TIME: 5.137 ms

Oh, look! There’s that backward index scan I mentioned. The row estimates are a bit off, and we’ll probably want to increase statistics and analyze to produce better values, but this is a speed improvement of over 250x. In a production system, even a few milliseconds can be a huge problem with enough throughput. Multiply that by 250, and the issue is upgraded to a catastrophe.

So how did this happen?

In this particular instance, it was the fault of Java Hibernate. An object for the inner query was passed to another object to flesh out the order detail, and the result almost crashed a production system. The fix was to make smarter use of Hibernate capabilities so it didn’t generate such a terrible query. Indeed, code refactors are probably something we should all consider doing more often in order to reduce accumulated technical debt.

Unfortunately, human-generated queries aren’t free from fault, either. It’s far too tempting to smash two queries together, than to rewrite them as a merged version using proper JOIN syntax. I’ve done it myself when in a hurry to check something. The difference is that I know better than to commit such a hack to a permanent location in our code tree. And the reason I don’t do that, is because I know the potential havoc such a query can wreak.

And now, so do you. By itself, IN is a wonderful tool. But its allure can bewitch and ultimately betray when it becomes a crutch. PGDB has a lot of powerful syntax potential, and it would be a shame to limit ourselves to the basics out of familiarity.

Joshua Drake: Let's delete contrib!

From Planet PostgreSQL. Published on May 28, 2015.

There has been a lot of discussion about the upcoming extension pg_audit and whether or not it should be in contrib. You can read about that here. The end result of the discussion is that pg_audit is going to be reverted and not in contrib. There were plenty of technical reasons why people didn't want it in contrib but I have a different reason. It is an extension. It doesn't need to be in contrib. In fact, I argue that because of pgxs and extensions we don't need contrib at all. If you don't follow the mailing lists my argument is blow and please feel free to comment here. The discourse is very much needed on this topic.

This is a topic that has come up in various ways over the years. After the long thread on pg_audit, I thought it might be time to bring it up again.

Contrib according to the docs is:

"These include porting tools, analysis utilities, and plug-in features that are not part of the core PostgreSQL system, mainly because they address a limited audience or are too experimental to be part of the main source tree. This does not preclude their usefulness."

It has also been mentioned many times over the years that contrib is a holding tank for technology that would hopefully be pushed into core someday.

What I am suggesting:

1. Analyze the current contrib modules for inclusion into -core. A few of these are pretty obvious:


I am sure there will be plenty of fun to be had with what should or shouldn't be merged into core. I think if we argue about the guidelines of how to analyze what should be in core versus the merits of any particular module, life will be easier. Here are some for a start:

A. Must have been in contrib for at least two releases
B. Must have visible community (and thus use case)

2. Push the rest out into a .Org project called contrib. Let those who are interested in the technology work on them or use them. This project since it is outside of core proper can work just like other extension projects. Alternately, allow the maintainers push them wherever they like (Landscape, Github, Savannah, ...).

Why I am suggesting this:

1. Less code to maintain in core
2. Eliminates the mysticism of contrib
3. Removal of experimental code from core
4. Most of the distributions package contrib separately anyway
5. Some of core is extremely small use case (sepgsql, tsearch2, lo ...)
6. Finding utilities for PostgreSQL used to be harder. It is rather dumb simple teenage snapchat user easy now.
8. Isn't this what pgxs is for?
9. Everybody hates cleaning the closet until the end result.
10. Several of these modules would make PostgreSQL look good anyway (default case insensitive index searching with citext? It is a gimme)
11. Contrib has been getting smaller and smaller. Let's cut the cord.
12. Isn't this the whole point of extensions?



Greg Sabino Mullane: Postgres "unsupported frontend protocol" mystery

From Planet PostgreSQL. Published on May 28, 2015.

The wonderful tail_n_mail program continues to provide me with new mysteries from our Postgres clients. One of the main functions it provides is to send an immediate email to us when an unexpected FATAL (or ERROR or PANIC) message appears in the Postgres logs. While these are often simple application errors, or deeper problems such as running out of disk space, once in a blue moon you see something completely unexpected. Some time ago, I saw a bunch of these messages appear in an email from a tail_n_mail email:

[1] From files A to B Count: 2
First: [A] 2015-12-01T06:30:00 server1 postgres[1948]
Last:  [B] 2015-12-01T06:30:00 server2 postgres[29107]
FATAL: unsupported frontend protocol 65363.19778: server supports 1.0 to 3.0

I knew what caused this error in general, but decided to get to the bottom of the problem. Before we go into the specific error, let's review what causes this particular message to appear. When a Postgres client (such as psql or DBD::Pg) connects to Postgres, the first thing it does is to issue a startup message. One of the things included in this request is the version of the Postgres protocol the client wishes to use. Since 2003, Postgres servers have been using version 3.1. It is very rare to see a client or server that uses anything else. Because this protocol number request occurs at the very start of the connection request, non-Postgres programs often trigger this error, because the server is expecting a number at the start of the request.

We can verify this by use of a small Perl script that connects to the server, and sends an invalid protocol request:

#!/usr/bin/env perl

use strict;
use warnings;
use IO::Socket;

my $server = IO::Socket::UNIX->new('/tmp/.s.PGSQL.5432')
  or die "Could not connect!: $@";

my $packet = pack('nn', 1234,56789) . "user\0pg\0\0";
$packet = pack('N', length($packet) + 4). $packet;
$server->send($packet, 0);

After running the above program, a new error pops up in the Postgres logs as expected:

$ tail -1 /var/lib/pgsql/data/pg_log/postgres-2015-05-20.log
2015-05-21 12:00:00 EDT [unknown]@[unknown] [10281] FATAL:  unsupported frontend protocol 1234.56789: server supports 1.0 to 3.0

There is our error, as expected. The "unknown"s are because my log_line_prefix looks like this: %t %u@%d [%p] . While the timestamp (%t) and the process ID (%p) are easily filled in, the login failed, so both the user (%u) and database (%d) are still unknown.

Now on to our specific error, which you will recall is "unsupported frontend protocol 65363.19778". The above program shows that the protocol number is sent in a specific format. Let's use Perl to display the numbers 65363.19778 and see if there are any clues buried within it:

$ perl -e 'print pack "nn", 65363,19778'

Some sort of unprintable character in there; let's take a deeper look just for completeness:

$ perl -e 'print pack "nn", 65363,19778' | hd
00000000  ff 53 4d 42                                       |.SMB|

Aha! SMB is not just a random placement of three letters, it is a big clue as to what is causing this message. SMB stands for Server Message Block, and is used by a variety of things. We can guess that this is either some program randomly hitting the Postgres port without realizing what it is, or some sort of purposeful port scanner. Why would something want to connect to the port but not log in? For one, you can determine the version of Postgres without logging in.

To cut to the chase, the culprit is the nmap program. In addition to simply scanning ports, it has the ability to do a deeper inspection to determine not only what is running on each port, but what version it is as well (with the "-sV" argument). Let's see nmap in action. We will use a non-standard Postgres port so as not to give it any additional hints about what is on that port:

$ nmap -p 5930 localhost -sV
Starting Nmap 6.40 ( ) at 2015-05-20 12:00 EDT
Nmap scan report for localhost (
Host is up (0.000088s latency).
5930/tcp open  postgresql PostgreSQL DB
1 service unrecognized despite returning data. If you know the service/version, please submit the following fingerprint at :

Service detection performed. Please report any incorrect results at .
Nmap done: 1 IP address (1 host up) scanned in 6.73 seconds

It looks like it triggered the "unsupported protocol" message, based on what was returned. Taking a peek at the Postgres 9.3 logs shows our mystery message:

$ tail -1 /var/lib/pgsql/pg9.3/pg_log/postgres-2015-05-20.log
2015-05-21 12:00:00 EDT [unknown]@[unknown] [2318] FATAL:  unsupported frontend protocol 65363.19778: server supports 1.0 to 3.0

As a final check, let's confirm that nmap is using SMB when it runs the version check:

$ nmap localhost -p 5930 -sV --version-trace 2>/dev/null | grep SMB
Service scan sending probe SMBProgNeg to (tcp)
Service scan match (Probe SMBProgNeg matched with SMBProgNeg line 10662): is postgresql.  Version: |PostgreSQL DB|||

Bingo. Mystery solved. If you see that error in your logs, it is most likely caused by someone running nmap in version detection mode.

Michael Paquier: Postgres 9.5 feature highlight: Archiving of last segment on timeline after promotion

From Planet PostgreSQL. Published on May 28, 2015.

Postgres 9.5 is bringing a change in the way WAL is archived with the following commit:

commit: de7688442f5aaa03da60416a6aa3474738718803
author: Heikki Linnakangas <>
date: Fri, 8 May 2015 21:59:01 +0300
At promotion, archive last segment from old timeline with .partial suffix.

Previously, we would archive the possible-incomplete WAL segment with its
normal filename, but that causes trouble if the server owning that timeline
is still running, and tries to archive the same segment later. It's not nice
for the standby to trip up the master's archival like that. And it's pretty
confusing, anyway, to have an incomplete segment in the archive that's
indistinguishable from a normal, complete segment.


As mentioned in the commit log above, prior to 9.5, a standby would always try to archive at promotion the last, partial WAL segment of the old timeline it was recovering on. This is a behavior that has been present in Postgres for ages, and there were no easy way to make a difference between a segment completely full and one only partially completed.

The data of this last partial segment is available on the segment file of the new timeline for the standby, but its name does not match the one of the old timeline as it uses as prefix the new timeline standby has been promoted on, and it contains data of the new timeline as well. Actually having it is useful when recovering on the old timeline of the master.

Note as well that the pre-9.5 behavior can cause conflicts particularly in the case where a master and its standby(s) point to the same archive location as master would try to archive a complete segment once it is done with it, and standby would archive a partial one with exactly the same name. Advanced users are normally (hopefully) using archiving scripts more advanced than a single copy command, so they may have some internal handling regarding such conflicts enabling them to save both files and make a clear difference from which node the segment has been archived, still it is an annoyance not to be able to make the difference on server side.

Using a pair of nodes, like one master (listening to port 5432) and one standby (listening to port 5433) streaming from the first one, and both of them having the same archive_command on the same server, here is actually how things happen. First let's archive a couple of files on the master:

=# SELECT pg_is_in_recovery();
(1 row)
=# SHOW archive_command;
 cp -i %p /path/to/archive/%f
(1 row)
=# CREATE table aa AS SELECT generate_series(1,1000000);
SELECT 1000000
=# SELECT pg_current_xlog_location();
(1 row)
=# SELECT last_archived_wal FROM pg_stat_archiver;
(1 row)

After standby promotion, the last, partial and final WAL segment of the old timeline is archived by the standby with the suffix ".partial":

$ pg_ctl promote -D /to/standby/pgdata/
server promoting
$ psql -At -p 5432 -c 'SELECT pg_switch_xlog()'
$ ls /path/to/archive/

The partial file, archived by the promoted standby, as well as the completed segment, archived by the master are both present in the WAL archive path.

Finally, note that the server is not able to use a partial file suffixed with .partial at recovery, so a manual operation is necessary to use it during the recovery of a node by renaming it without this suffix ".partial".

Vilis Lacis: 3 Must-Read Books About Database Performance

From Planet PostgreSQL. Published on May 28, 2015.

SQL Performance Explained: Everything Developers Need to Know about SQL Performance

An in-depth book on how to improve database performance. The focus is on relational databases and it covers all major SQL databases without getting lost in the details of any one specific product. Starting with the basics of indexing and the WHERE clause, SQL Performance Explained guides developers through all parts of an SQL statement and explains the pitfalls of object-relational mapping (ORM) tools like Hibernate.

PostgreSQL 9.0 High Performance

An excellent book for intermediate to advanced PostgreSQL database administrators (DBA). Teaches everything about building, monitoring and maintaining a PostgreSQL installation while also providing useful and interesting information about database internals. If you truly want to understand how PostgreSQL operates and behaves under a high load this is a book for you.

High Performance MySQL: Optimization, Backups, and Replication

Advanced techniques for everything from designing schemas, indexes, and queries to tuning your MySQL server, operating system, and hardware to their fullest potential. This guide also teaches you safe and practical ways to scale applications through replication, load balancing, high availability, and failover.

Josh Berkus: Determining your danger of multixact wraparound corruption

From Planet PostgreSQL. Published on May 27, 2015.

The PostgreSQL Project released an update on May 22, and the only really important fix in that update was a patch to prevent data corruption due to "Multixact member wraparound".  However,  it turns out that there are some issues with that update release, including one with file permissions, and potentially another one with Multixact truncation.  As a result, users are feeling justifiably reluctant to apply our last update release.

If there's a lesson for the project in this, it's "don't wrap an update release the same week as Feature Freeze".

Anyway, given this it would be good for users to determine more definitively if they are in actual danger of Multixact member wraparound (MMW hereafter), so that they know if they need to apply the update right away despite issues.  Here's how to do that:
  1. navigate to the PostgreSQL data directory on your server(s), e.g. "/var/lib/postgresql/9.3/main"
  2. switch to the subdirectory "pg_multixact/members" 
  3. count the number of files in this directory, e.g. "ls -l | wc -l"
If you have a few dozen, hundred, or even a few thousand multixact member files, you're fine.  If you have over 10,000, and that number increases rapidly, then you are potentially in danger of MMW, and should apply the update sooner rather than later. 

Of course, there are other fixes in this latest update, and if one of them specifically affects you, you may have updated already.

Thank you to Thomas Munro for pointing out the simple way to determine this and giving me background material on the fixes.

Note: the above is my personal advice, and has not been approved by the PostgreSQL project, core team, or PostgreSQL Experts Inc. The PostgreSQL project generally advises applying all updates promptly.

My First Django Girls Event

By pydanny's blog from Django community aggregator: Community blog posts. Published on May 27, 2015.

Since the first Django Girls event I've watched the movement grow with a sense of of awe and inevitability. There is something about it that is both contagious and powerful, and in a very good way. This past weekend I had my first chance to attend one of their events in Ensenada, Mexico.

This is what we saw. A room full of attendees with laser focus. The coaches were clearly inspired by the dedication of the women who had come to learn and grow.


A photo posted by Daniel Greenfeld (@pydanny) on

By the end of the day, the energy hadn't dwindled, it had accelerated.

Saying goodbye to #djangogirls Ensenada. Everyone stayed until the very end.

A photo posted by Daniel Greenfeld (@pydanny) on

No one wanted the day to end.

#djangogirls Ensenada attendees so dedicated they stayed after the event finished! :-)

A photo posted by Daniel Greenfeld (@pydanny) on

We did our small part. We coached and did our best to give an inspirational talk.

The Django Girls Ensenada organizers did a magnificent job of assembling a wonderful day that touched lives. We were very impressed!

The event had two sponsors, Hala Ken and the US Consulate in Tijuana, Mexico. They did more than just sponsor, they were part of the effort of organizing and running the event.

Cakti at CRS ICT4D 2015

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on May 27, 2015.

This is Caktus’ first year taking part in the Catholic Relief Service’s (CRS) Information and Communication Technologies for Development (ICT4D) conference. The theme of this year’s conference is increasing the impact of aid and development tools through innovation. We’re especially looking forward to all of the speakers from organizations like the International Rescue Committee, USAID, World Vision, and the American Red Cross. In fact, the offerings are so vast, we thought we would provide a little cheat sheet to help you find Cakti throughout this year’s conference.

Wednesday, May 27th

How SMS Powers Democracy in Libya Vinod Kurup will explain how Caktus used Rapid SMS, a Django-based SMS framework, to build the world’s first voter registration system in Libya.

Commodity Tracking System (CTS): Tracking Distribution of Commodities Jack Byrne of the International Rescue Committee(IRC) is the Syria Response Director. He will present on the Caktus-built system IRC uses to track humanitarian aid for Syrian refugees.

Friday, May 29th

Before the Pilot: Planning for Scale Caktus’ CTO Colin Copeland will be part of a panel discussion on what technology concepts matter most at the start of a project and the various challenges of pilot programs. Also on the panel will be Jake Watson of IRC and Jeff Wishnie of MercyCorps. Hao Nguyen, Caktus’ Strategy Director, will moderate.

Leveraging the Open Source Community for Truly Sustainable ICT4D CEO Tobias McNulty will provide his insider’s perspective on the open source community and how to best use that community in the development of ICT4D tools and solutions.

Wednesday, Thursday, and Friday

Throughout the conference you can stop by the Caktus booth to read more about our ICT4D projects and services, meet Cakti, or play one of the mini-games from our Epic Allies app.

Not attending the conference? You can follow @caktusgroup and #ICT4D2015 for live updates!

PyPy.js: What? How? Why? by Ryan Kelly (PyCon 2015 Must-See Talk: 5/6)

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on May 26, 2015.

Part five of six in our PyCon 2015 Must-See Series, a weekly highlight of talks our staff enjoyed at PyCon.

From Ryan Kelly's talk I learned that it is actually possible, today, to run Python in a web browser (not something that interprets Python-like syntax and translates it into JavaScript, but an actual Python interpreter!). PyPy.js combines two technologies, PyPy (the Python interpreter written in Python) and Emscripten (an LLVM-to-JavaScript converter, typically used for getting games running in the browser), to run PyPy in the browser. This talk is a must-see for anyone who's longed before to write client-side Python instead of JavaScript for a web app. While realistically being able to do this in production may still be a ways off, at least in part due to the multiple megabytes of JavaScript one needs to download to get it working, I enjoyed the view Ryan's talk provided into the internals of this project. PyPy itself is always fascinating, and this talk made it even more so.

Raghavendra Rao: Compiling pg_repack extension on binary format of PostgreSQL installation

From Planet PostgreSQL. Published on May 26, 2015.

This blog is about compiling pg_repack extension on binary format of PostgreSQL installation. Most of you know, pg_repack extension is one of well known PostgreSQL extensions, its especially used for reclaiming space[bloats] ONLINE without holding an EXCLUSIVE LOCK on Tables/Indexes.

To enable pg_repack extension in PostgreSQL database, it should be compiled from sources. Its quite easy and simple to compile from source on any installed variants(source,rpm,binary) of PostgreSQL, however its slightly different if it is with binary format of PostgreSQL [One Click Installer] as they are pre-built binary bundle with dependency libraries. Let's compile and see.

On CentOS 7 Virtual Machine, I have installed binary format of PostgreSQL 9.4(download link) and its home directory "/opt/PostgreSQL/9.4/". Next we need to download pg_repack source from their official site.
[root@localhost ~]# git clone
Before compiling, pg_config of PostgreSQL 9.4 should be set in the PATH.
[root@localhost pg_repack]# export PATH=/opt/PostgreSQL/9.4/bin:$PATH
[root@localhost pg_repack]# type pg_config
pg_config is /opt/PostgreSQL/9.4/bin/pg_config
Now we are good to execute source installation commands "make" & "make install". Let's execute "make"
[root@localhost ~]# cd pg_repack/
[root@localhost pg_repack]# make
make[1]: Entering directory `/root/pg_repack/bin'
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -O2 pg_repack.o pgut/pgut.o pgut/pgut-fe.o -L/opt/PostgreSQL/9.4/lib -lpq -L/opt/PostgreSQL/9.4/lib -L/opt/local/Current/lib -Wl,--as-needed -Wl,-rpath,'/opt/PostgreSQL/9.4/lib',--enable-new-dtags -lpgcommon -lpgport -lssl -lcrypto -lgssapi_krb5 -lz -ledit -lrt -lcrypt -ldl -lm -o pg_repack
/usr/bin/ld: cannot find -ledit
collect2: ld returned 1 exit status
make[1]: *** [pg_repack] Error 1
make[1]: Leaving directory `/root/pg_repack/bin'
make: *** [all] Error 2
Oops 1...!!, seems there's an error related to -ledit(libedit) library missing in PostgreSQL lib directory. Lets run same command "ld -ledit" in verbose mode (-verbose) for more information, tried and failed by "make" command.
[root@localhost pg_repack]# ld -ledit -verbose
GNU ld version 20100205
attempt to open /usr/x86_64-redhat-linux/lib64/ failed
attempt to open /usr/x86_64-redhat-linux/lib64/libedit.a failed
attempt to open /usr/local/lib64/ failed
attempt to open /usr/local/lib64/libedit.a failed
attempt to open /lib64/ failed
attempt to open /lib64/libedit.a failed
attempt to open /usr/lib64/ failed
attempt to open /usr/lib64/libedit.a failed
attempt to open /usr/x86_64-redhat-linux/lib/ failed
attempt to open /usr/x86_64-redhat-linux/lib/libedit.a failed
attempt to open /usr/lib64/ failed
attempt to open /usr/lib64/libedit.a failed
attempt to open /usr/local/lib/ failed
attempt to open /usr/local/lib/libedit.a failed
attempt to open /lib/ failed
attempt to open /lib/libedit.a failed
attempt to open /usr/lib/ failed
attempt to open /usr/lib/libedit.a failed
ld: cannot find -ledit
Okay, now its clear that its looking for library in PostgreSQL lib directory [/opt/PostgreSQL/9.4/lib]. Lets check for library in that directory.
[root@localhost pg_repack]# cd /opt/PostgreSQL/9.4/lib
[root@localhost lib]# ls -l libedit*
-rwxr-xr-x. 1 root daemon 254702 Mar 22 23:32
Ah,we have "" but not "" required by "make" command. Creating a symbolic link should be a quick fix.
[root@localhost lib]# ln -s
[root@localhost lib]# ls -l libedit*
lrwxrwxrwx. 1 root root 12 May 19 22:25 ->
-rwxr-xr-x. 1 root daemon 254702 Mar 22 23:32
Re-run "make" command.
[root@localhost pg_repack]# make
make[1]: Entering directory `/root/pg_repack/bin'
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -O2 pg_repack.o pgut/pgut.o pgut/pgut-fe.o -L/opt/PostgreSQL/9.4/lib -lpq -L/opt/PostgreSQL/9.4/lib -L/opt/local/Current/lib -Wl,--as-needed -Wl,-rpath,'/opt/PostgreSQL/9.4/lib',--enable-new-dtags -lpgcommon -lpgport -lssl -lcrypto -lgssapi_krb5 -lz -ledit -lrt -lcrypt -ldl -lm -o pg_repack
/usr/bin/ld: warning:, needed by /opt/PostgreSQL/9.4/lib/, may conflict with
/lib64/ undefined reference to `ber_sockbuf_io_udp'
collect2: ld returned 1 exit status
make[1]: *** [pg_repack] Error 1
make[1]: Leaving directory `/root/pg_repack/bin'
make: *** [all] Error 2
Oops 2...!!! interesting, it has passed -ledit error and now its reporting on libldap library [ ]. Am not sure why its looking in /lib64 directory when my pg_config points to /opt/PostgreSQL/9.4/lib. Lets check what we have in both the location.
[root@localhost pg_repack]# ls -l /lib64/libldap*
lrwxrwxrwx. 1 root root 21 Jan 6 22:05 ->
-rwxr-xr-x. 1 root root 329696 Oct 15 2014
lrwxrwxrwx. 1 root root 23 May 19 06:43 ->
-rwxr-xr-x. 1 root root 351920 Oct 15 2014

[root@localhost pg_repack]# ls -l /opt/PostgreSQL/9.4/lib/libldap*
-rwxr-xr-x. 1 root daemon 404761 Mar 22 23:32 /opt/PostgreSQL/9.4/lib/
-rwxr-xr-x. 1 root daemon 442657 Mar 22 23:32 /opt/PostgreSQL/9.4/lib/
Seems there are two copies of "", one in the form of symbolic link and another as a hard copy. Am guessing its due to multiple library copies, lets remove symbolic link and retain a hard copy of library and try again.
[root@localhost lib64]# unlink

[root@localhost pg_repack]# make
make[1]: Entering directory `/root/pg_repack/bin'
sed 's,REPACK_VERSION,1.3.1,g' > pg_repack--1.3.1.sql;
sed 's,REPACK_VERSION,1.3.1,g' > pg_repack.control
make[1]: Leaving directory `/root/pg_repack/lib'
make[1]: Entering directory `/root/pg_repack/regress'
make[1]: Nothing to be done for `all'.
make[1]: Leaving directory `/root/pg_repack/regress'
wow... finally,  it has compiled. Run "make install" for the pg_repack binaries and libraries.
[root@localhost pg_repack]# make install
make[1]: Entering directory `/root/pg_repack/bin'
/bin/mkdir -p '/opt/PostgreSQL/9.4/bin'
/usr/bin/install -c pg_repack '/opt/PostgreSQL/9.4/bin'
make[1]: Leaving directory `/root/pg_repack/bin'
make[1]: Entering directory `/root/pg_repack/lib'
/bin/mkdir -p '/opt/PostgreSQL/9.4/lib/postgresql'
/bin/mkdir -p '/opt/PostgreSQL/9.4/share/postgresql/extension'
/bin/mkdir -p '/opt/PostgreSQL/9.4/share/postgresql/extension'
/usr/bin/install -c -m 755 '/opt/PostgreSQL/9.4/lib/postgresql/'
/usr/bin/install -c -m 644 pg_repack.control '/opt/PostgreSQL/9.4/share/postgresql/extension/'
/usr/bin/install -c -m 644 pg_repack--1.3.1.sql pg_repack.control '/opt/PostgreSQL/9.4/share/postgresql/extension/'
make[1]: Leaving directory `/root/pg_repack/lib'
make[1]: Entering directory `/root/pg_repack/regress'
make[1]: Nothing to be done for `install'.
make[1]: Leaving directory `/root/pg_repack/regress'
After compilation there will be pg_repack utility in $PGHOME/bin and library in $PGHOME/lib/postgresql/ directory.
[root@localhost pg_repack]# ls -l /opt/PostgreSQL/9.4/bin/pg_rep*
-rwxr-xr-x. 1 root root 84030 May 20 00:07 /opt/PostgreSQL/9.4/bin/pg_repack

[root@localhost postgresql]# ls -l /opt/PostgreSQL/9.4/lib/postgresql/pg_rep*
-rwxr-xr-x. 1 root root 31028 May 20 00:07 /opt/PostgreSQL/9.4/lib/postgresql/
Now we are set to create pg_repack extension inside the database.
-bash-4.1$ psql
psql.bin (9.4.1)
Type "help" for help.

postgres=# select * from pg_available_extensions where name='pg_repack';
name | default_version | installed_version | comment
pg_repack | 1.3.1 | 1.3.1 | Reorganize tables in PostgreSQL databases with minimal locks
(1 row)

postgres=# create extension pg_repack;
Likewise, I have attempted to compile with EnterpriseDB product PostgresPlus Advanced Server 9.4[PPAS] a pre-built binary package. Faced similar library issues, hence I have used linker option "LDFLAGS" pointing to library directory and compiled.
[root@localhost ~]# export PATH=/opt/PostgresPlus/9.4AS/bin:$PATH
[root@localhost ~]# export LDFLAGS=-L/opt/PostgresPlus/9.4AS/lib
[root@localhost ~]# cd pg_repack/
[root@localhost pg_repack]# make
[root@localhost pg_repack]# make install

[root@localhost pg_repack]# ls -l /opt/PostgresPlus/9.4AS/bin/pg_rep*
-rwxr-xr-x. 1 root root 201877 May 15 11:06 /opt/PostgresPlus/9.4AS/bin/pg_repack
[root@localhost pg_repack]# ls -l /opt/PostgresPlus/9.4AS/lib/pg_rep*
-rwxr-xr-x. 1 root root 94516 May 15 11:06 /opt/PostgresPlus/9.4AS/lib/
Cool,this too compiled smoothly. Now create extension in PPAS 9.4
-bash-4.1$ psql
psql.bin (
Type "help" for help.

edb=# create extension pg_repack;

Thank you.


My First Django Girls Event

By pydanny's blog from Django community aggregator: Community blog posts. Published on May 25, 2015.

Since the first Django Girls event I've watched the movement grow with a sense of of awe and inevitability. There is something about it that is both contagious and powerful, and in a very good way. This past weekend I had my first chance to attend one of their events in Ensenada, Mexico.

This is what we saw. A room full of attendees with laser focus. The coaches were clearly inspired by the dedication of the women who had come to learn and grow.


A photo posted by Daniel Greenfeld (@pydanny) on

By the end of the day, the energy hadn't dwindled, it had accelerated.

Saying goodbye to #djangogirls Ensenada. Everyone stayed until the very end.

A photo posted by Daniel Greenfeld (@pydanny) on

No one wanted the day to end.

#djangogirls Ensenada attendees so dedicated they stayed after the event finished! :-)

A photo posted by Daniel Greenfeld (@pydanny) on

We did our small part. We coached and did our best to give an inspirational talk.

Django interview questions ...

By Between engineering and real life from Django community aggregator: Community blog posts. Published on May 25, 2015.

... and some answers

Well I haven't conducted any interviews recently but this one has been laying in my drafts for a quite while so it is time to take it out of the dust and finish it. As I have said in Python Interview Question and Answers these are basic questions to establish the basic level of the candidates.

  1. Django's request response cycle
    You should be aware of the way Django handles the incoming requests - the execution of the middlewares, the work of the URL dispatcher and what should the views return. It is not necessary to know everything in the tiniest detail but you should be generally aware of the whole picture. For reference you can check "the live of the request" slide from my Introduction to Django presentation.
  2. Middlewares - what they are and how they work
    Middlewares are one of the most important parts in Django. Not only because they are quite powerfull and useful but also because the lack of knowledge about their work can lead to hours of debugging. From my experience the process_request and process_response hooks are the most frequently used and those are the one I always ask for. You should also know their execution order and when their execution starts/stop. For reference check the official docs.
  3. Do you write tests? How?
    Having well tested code benefits you, the project and the rest of the team that is going to maintain/modify your code. Django's built-in test client, Django Webtest, Factory boy, Faker or whatever you use, I would like to hear about it and how you do it. There is not only one way/tool for it so I would like to know what are you using and how. You should show that you understand how important is to test your code and how to do it right.
  4. select_related and prefetch_related
    Django's ORM (as every other) can be both a blessing and a curse. Getting a foreign key property can easily lead you to executing million queries. Using select_related and prefetch_related can come as a saviour so it is important to know how to use them.
  5. Forms validation
    I expect from the candidates that they know how to build forms, to add custom validation for one field and how to validate fields that depend on each other.
  6. Building a REST API with Django
    The most popular choices out there are Django REST Framework and Django Tastypie. Have you used any of them, why an how. Why issues have you faced? Here is the place to show that you understand the concept of REST APIs, the correct request/response types and the serialisation of the data.
  7. Templates
    Building website using the built-in template system is often part of your daily tasks. This means that you should understand the template inheritance, how to reuse the block and so on. Having experience with sekizai is a plus.

There are a lot of other things to ask about internationalisation(i18n), localisation(l10n), south/migrations etc. Take a look at the docs and you can see them explained pretty well.

Giuseppe Broccolo: The WITHIN GROUP and FILTER SQL clauses of PostgreSQL 9.4

From Planet PostgreSQL. Published on May 25, 2015.

PostgreSQL 9.4 extends the SQL standard by inserting two new clauses that facilitate many operations required during the development of applications: the WITHIN GROUP and FILTER clauses.



The WITHIN GROUP clause is particularly useful when performing aggregations on ordered subsets of data.

PostgreSQL introduced window functions since version 9.0 in order to work on subsets of data that can be correlated to each current record of tables, defining a sort of “aggregates” centred on any specific record as the query is gradually executed via the SQL OVER(PARTITION BY/ORDER BY) clause and by using the functions that can be performed on those aggregations.

With version 9.4 of PostgreSQL the SQL WITHIN GROUP clause was introduced: this simplified many operations that had previously only been possible with the use of the window functions, defining aggregations of ordered subsets of data.
In addition, new functions were introduced that can be applied to these subsets and expand the collection of available window

  • percentile_cont(), percentile_disc() for the calculation of
  • mode() a statistical function that calculates the mode on ordered
  • rank(), dense_rank(), percent_rank(), cume_dist(): window functions already available in PostgreSQL to be executed on the subsets obtained using the OVER (PARTITION BY/ORDER BY) clause and now able to take as a parameter ordered subsets produced by the WITHIN GROUP clause.

To better clarify the situation, let’s say, for example, that we want to calculate the 25th, the 50th, the 75th and the 100th percentile of the first 20 integers. Until now, this was only possible by dividing the numbers into 4 sets via the OVER (PARTITION BY/ORDER BY) clause, then order them internally into 4 ordered subsets from which to then take the maximum value – for example, by using a CTE:

$ CREATE TABLE t AS SELECT generate_series(1,20) AS val;

$ WITH subset AS (
    SELECT val,
       ntile(4) OVER (ORDER BY val) AS tile
    FROM t
  SELECT max(val)
  FROM subset GROUP BY tile ORDER BY tile;

 (4 rows)

With PostgreSQL 9.4 everything is reduced to a single SQL command, resulting in significant advantages in terms of readability of the scripts and execution of the commands:

$ CREATE TABLE t AS SELECT generate_series(1,20) AS val;

$ SELECT unnest(percentile_disc(array[0.25,0.5,0.75,1])
  FROM t;

 (4 rows)

The FILTER clause

This second clause of SQL commands is useful when you want to apply filters on subsets of data without necessarily
performing aggregations.
For example, it is now possible to perform a total count of the records of a table and also a partial code>count of one of its subsets that satisfies a certain condition (expressed by the WHERE clause) within a single query, without having to use further ones to be performed on aggregations:

$ SELECT count(*) count_all,
         count(*) FILTER(WHERE bid=1) count_1,
         count(*) FILTER(WHERE bid=2) count_2
  FROM pgbench_history;

 count_all | count_1 | count_2
      7914 |     758 |     784
 (1 row)

In this particular case, this also simplifies the readability of scripts and improves execution performances.


The extension of the SQL standard through the introduction of these new clauses further facilitates the work of developers, who are increasingly able to delegate manipulation and aggregation of subsets of data to the database.
By using the WITHIN GROUP clause, the management of subsets of data that can be ordered becomes easier through the introduction of new window functions.
The FILTER clause helps to manage subsets of data that meet certain conditions, thereby avoiding aggregations.

Last updated 2015-05-21 10:51:41 CEST

Squashing and optimizing migrations in Django

By Piotr Maliński from Django community aggregator: Community blog posts. Published on May 24, 2015.

With Django 1.7 we got built in migrations and a management command to squash a set of existing migrations into one optimized migration - for faster test database building and to remove some legacy code/history. Squashing works, but it still has some rough edges and requires some manual work to get the best of a squashed migration. Here are few tips for squashing and optimizing squashed migrations.

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Support GROUPING SETS, CUBE and ROLLUP.

From Planet PostgreSQL. Published on May 23, 2015.

On 16th of May, Andres Freund committed patch: Support GROUPING SETS, CUBE and ROLLUP.   This SQL standard functionality allows to aggregate data by different GROUP BY clauses at once. Each grouping set returns rows with columns grouped by in other sets set to NULL.   This could previously be achieved by doing each grouping […]

Leo Hsu and Regina Obe: PostGIS 2.2 leveraging power of PostgreSQL 9.5

From Planet PostgreSQL. Published on May 23, 2015.

Things are shaping up nicely in PostGIS 2.2 development. We are going to hit feature freeze around June 30th 2015, and plan to ship late August or early September to be in line with PostgreSQL 9.5 release. So far we have committed a couple of neat features most itemized in PostGIS 2.2 New Functions. Many of the really sort after ones will require PostgreSQL 9.5 and GEOS 3.5. The geography measurement enhancements will require Proj 4.9.0+ to take advantage of. Things I'd like to highlight and then later dedicate full-length articles in our BostonGIS Waiting for PostGIS 2.2 series once they've been stress tested.

Continue reading "PostGIS 2.2 leveraging power of PostgreSQL 9.5"

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – TABLESAMPLE, SQL Standard and extensible

From Planet PostgreSQL. Published on May 22, 2015.

On 15th of May, Simon Riggs committed patch: TABLESAMPLE, SQL Standard and extensible   Add a TABLESAMPLE clause to SELECT statements that allows user to specify random BERNOULLI sampling or block level SYSTEM sampling. Implementation allows for extensible sampling functions to be written, using a standard API. Basic version follows SQLStandard exactly. Usable concrete use […]

Function caching decorator [reprise]

By from Django community aggregator: Community blog posts. Published on May 22, 2015.

Sudden inspiration to write a blog.

A long time ago, I wrote a post about a decorator that could somehow cache an expensive function. There were some ideas in the comments, but I never really followed up the idea I spawned in that particular post. And I never really gave it more thought or research either.

Today I was at the PyGrunn conference. Tom Levine, author of Vlermv, held a presentation about that package. He told there that he wrote a decorator called `cache` (ctrl+f on the page I linked) that he uses for the exact same goal as I wrote my original post. He also noted that `cache` would probably be a bad name for a decorator like that.

At the end of the presentation there was some time where people could ask questions. A lot of people gave Tom tips on packages he could look into and there was one helpful attendant who called out Python's `memoised` decorator. I noted that one of the commenters on my original post also named memoize, but that commenter linked to a decorator inside a Plone package. I searched a bit on the internet today and there's a class inside the PythonDecoratorLibrary that does exactly what I initially wanted.

So here's the link to that page.

Shaun M. Thomas: PG Phriday: 10 Ways to Ruin Performance: Forgetting it EXISTS

From Planet PostgreSQL. Published on May 22, 2015.

For the second of my ten part series on hidden PGDB (PostgreSQL) performance killers, I’m going to talk about something called an anti-join. It’s not a well-known approach outside of the database world, but due to how it works, it can impart almost magical plan revisions that drastically improve query performance in the right scenario. Developers can add it to a growing bag of tricks when working on database-driven content, since it comes in handy more often than you might expect.

Let’s build a test-case, shall we?

CREATE TABLE sys_product
    product_id   SERIAL  PRIMARY KEY,
    prod_name    TEXT    NOT NULL,
    quality      INT     NOT NULL,
    descr        TEXT    NOT NULL DEFAULT now()
INSERT INTO sys_product (prod_name, quality, descr)
SELECT 'Product ' ||,
       log(( % 100) + 1)::INT,
       'It does stuff.'
  FROM generate_series(1, 100000) a(id);
CREATE TABLE sys_order
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
INSERT INTO sys_order (product_id, item_count, order_dt, valid_dt)
SELECT ( % 100000) + 1, ( % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL,
       CASE WHEN % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
  FROM generate_series(1, 1000000) a(id);
ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);
CREATE INDEX idx_order_product_id
    ON sys_order (product_id);
CREATE INDEX idx_order_valid_dt
    ON sys_order (valid_dt);
ANALYZE sys_product;
ANALYZE sys_order;

This is a very basic product and order table structure, and we’ve used it before in the last installment of this series. The only columns we’ve added since last time is the quality column in sys_product and the valid_dt column in sys_order. This way, we can introduce some variability into the query plan due to data correlation effects. We’ve tried to distribute the data such that new orders have not been validated, and there are still some non-validated orders in the past. This distribution is more reliable than using random().

Now let’s build a query that retrieves all orders on products with a quality of five using a simple JOIN:

  FROM sys_order o
  JOIN sys_product p USING (product_id)
 WHERE o.valid_dt IS NULL
   AND p.quality = 2;
                             QUERY PLAN                             
 Hash JOIN  (cost=2985.66..7406.44 ROWS=1303 width=28)
            (actual TIME=80.438..90.406 ROWS=1383 loops=1)
   Hash Cond: (o.product_id = p.product_id)
   ->  Bitmap Heap Scan ON sys_order o
              (cost=39.41..4423.00 ROWS=1933 width=28)
              (actual TIME=0.600..6.879 ROWS=2004 loops=1)
         Recheck Cond: (valid_dt IS NULL)
         Heap Blocks: exact=2004
         ->  Bitmap INDEX Scan ON idx_order_valid_dt
                    (cost=0.00..38.92 ROWS=1933 width=0)
                    (actual TIME=0.327..0.327 ROWS=2004 loops=1)
               INDEX Cond: (valid_dt IS NULL)
   ->  Hash  (cost=2084.00..2084.00 ROWS=68980 width=4)
             (actual TIME=79.773..79.773 ROWS=69000 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 2426
         ->  Seq Scan ON sys_product p
                 (cost=0.00..2084.00 ROWS=68980 width=4)
                 (actual TIME=0.013..41.150 ROWS=69000 loops=1)
               FILTER: (quality = 2)
               ROWS Removed BY FILTER: 31000
 Planning TIME: 0.732 ms
 Execution TIME: 91.875 ms

We can actually infer quite a bit from this execution plan.

  • The sys_order table was checked with the index on valid_dt as desired.
  • The sys_product table was scanned in its entirety.
  • The planner estimates all match up almost exactly.
  • Final result was obtained by hashing the two row sets together.

From this, we could say that this is the ideal query plan. The index we wanted was used, and it reduced one million rows to about 2000. Since we don’t have an index on product quality, we might as well fetch them all and filter out the wrong quality before executing the hash. Once the hash between both is computed, we get about 1300 rows as a result.

Of course, this approach makes a huge mistake. Even though the order and product tables are within an order of magnitude of each other, we’re only interested in a relatively small fraction of the order table. Thus, fetching all products at this scale is probably a losing proposition. However, since the planner doesn’t have data regarding how closely these tables are correlated, it can’t make that assumption. Indeed, is it fetching 2000 unique product IDs, or 200? Since it can’t fetch the related products at this point, it does the best it can.

But can we do better? As it turns out, we can do a whole lot better by telling the planner what we really wanted. Note that we didn’t actually use the results from the product table in the SELECT portion of the query. Further, we know that we are only interested in a maximum of about 1300 products, so why check all of them? Is there a way to force the planner to fetch the matching orders first, and then check to see which products correspond as a second step?

We already know from last time that this is what a nested loop does, and one way to encourage the planner to pick a nested loop is by using EXISTS to break apart our JOIN condition. This accomplishes two things:

  1. Don’t join against a table to fetch columns that will never be returned.
  2. Favor some kind of looping construct to avoid excessive record comparisons.

This is also what is known as an anti-join, and PGDB performs very well when guided in that direction. Here’s what our query and plan look like using this concept:

  FROM sys_order o
 WHERE o.valid_dt IS NULL
         SELECT 1
           FROM sys_product p
          WHERE p.product_id = o.product_id
            AND p.quality = 2
          LIMIT 1
                             QUERY PLAN                             
 Bitmap Heap Scan ON sys_order o 
             (cost=39.16..20490.82 ROWS=967 width=28)
             (actual TIME=0.565..11.462 ROWS=1383 loops=1)
   Recheck Cond: (valid_dt IS NULL)
   FILTER: (SubPlan 1)
   ROWS Removed BY FILTER: 621
   Heap Blocks: exact=2004
   ->  Bitmap INDEX Scan ON idx_order_valid_dt
              (cost=0.00..38.92 ROWS=1933 width=0)
              (actual TIME=0.269..0.269 ROWS=2004 loops=1)
         INDEX Cond: (valid_dt IS NULL)
   SubPlan 1
     ->  LIMIT  (cost=0.29..8.31 ROWS=1 width=0)
                (actual TIME=0.003..0.003 ROWS=1 loops=2004)
           ->  INDEX Scan USING sys_product_pkey ON sys_product p
                     (cost=0.29..8.31 ROWS=1 width=0)
                     (actual TIME=0.002..0.002 ROWS=1 loops=2004)
                 INDEX Cond: (product_id = o.product_id)
                 FILTER: (quality = 2)
                 ROWS Removed BY FILTER: 0
 Planning TIME: 0.126 ms
 Execution TIME: 12.154 ms

This time, something drastically different happened:

  • The executor took our matching rows from the valid_dt index and stashed them in the memory heap.
  • Using those same rows, it executed about 2000 index lookups into the product table.
  • The results are produced almost 8x faster.

Before anyone takes this result and starts rewriting all queries everywhere to utilize anti-joins, keep in mind that they’re extremely situational. The example here was built based on an actual application query at a company I used to work with. In their case, execution went from 2500ms to 280ms following the rewrite. But in building this test-case to simulate their data, I ended up creating a dozen data distributions that were slightly slower using this approach.

In addition, take a look at the LIMIT 1 section of the subquery. We’re explicitly telling the planner that only one product can match per seek, which is what the word EXISTS should imply. Yet if we remove it, the execution plan actually reverts to the same approach used by the more naive JOIN. Ouch!

But that’s OK because queries are, by their nature, more of a linguistic construct than a computational one. Having a larger vocabulary can help convey more meaning and sometimes produce different results, but our audience is perfectly within their rights to summarize or ignore irrelevant prattle. The goal in this case is adding to our lexicon, and trying to frame applicable scenarios in a different way in hopes that the planner will catch on.

For data distributions that apply, anti-joins are a very powerful technique. If nothing else, they’re worth a try when other approaches fall flat. The only way to know for sure is experimentation! They’re just a tool, and like any tool, should only be used at opportune instances.

So remember the anti-join—it’s a great tool for that rare hex-socket screw you occasionally stumble across.

Marco Slot: Parallel data loading for pg_shard

From Planet PostgreSQL. Published on May 22, 2015.

The pg_shard extension helps you scale out PostgreSQL for large data sets by transparently distributing tables across many nodes, storing (replicated) shards of the data as regular PostgreSQL tables on worker nodes. With pg_shard, you can perform INSERT, UPDATE, DELETE, SELECT on a distributed table as if you were using a regular table.

While first-class COPY support is still on the horizon, we've already added a copy_to_distributed_table tool for loading data files, which supports the same formats as COPY. When using this tool, pg_shard creates a temporary 'insert proxy' table which has a trigger that INSERTs into the distributed table. The copy_to_distributed_table script copies a file into the insert proxy.

Below we give an example of loading customer reviews data using the example from the pg_shard github page and the copy_to_distributed_table script that comes with pg_shard.

copy_to_distributed_table -C customer_reviews_1998.csv customer_reviews

Some users have noticed that pg_shard performs poorly when performing many INSERTs over a single connection and the same applies to copy_to_distributed_table. The reason is simple: pg_shard has to wait for a network round-trip on every INSERT. Since there is no notion of parallelism within a single PostgreSQL connection, consecutive INSERTs on the same connection will have to wait. However, pg_shard can still handle more INSERTs on other connections while waiting.

When loading a large data file, it's therefore recommended to split the file into multiple chunks first and use xargs -P 64 to load the chunks concurrently. The commands below have the same effect as the command above, but run much faster.

mkdir chunks
split -n l/64 customer_reviews_1998.csv chunks/
find chunks/ -type f | xargs -n 1 -P 64 sh -c 'echo $0 `copy_to_distributed_table -C $0 customer_reviews`'

On a c3.8xlarge EC2 instance, using parallel data loading for pg_shard is roughly 40x faster and inserts around 50,000 rows per second in the above example, but there are many ways of scaling out pg_shard even further. Pg_shard can also be used in combination with CitusDB which has powerful built-in bulk loading and query parallelization features for distributed tables.

Pygrunn: ZeroMQ - Pieter Hintjens

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 21, 2015.

(One of the summaries of the 2015 Pygrunn conference)

Pieter Hintjens has quite some some experience with distributed systems. Distributed systems are, to him, about making our systems look more like the real world. The real world is distributed.

Writing distributed systems is hard. You need a big stack. The reason that we're using http such a lot is because that was one of the first ones that is pretty simple and that we could understand. Almost everything seems to be http now.

Three comments:

  • So: the costs of such a system must be low. He really likes ZeroMQ, especially because it makes it cheap.

  • We lack a lot of knowledge. The people that can do it well are few. Ideally, the community should be bigger. We have to build the culture, build the knowledge. Zeromq is one of the first bigger open source projects that succeeded.

  • Conway's law: an organization will build software that looks like itself. A centralized power-hungry organization will probably build centralized power-hungry software.

    So: if you want to write distributed systems stuff, your organization has to be distributed!

    Who has meetings in his company? They are bad bad bad. They're blocking. You have to "synchronize state" and wait for agreement. A conference like pygrunn is fine: meeting people is fine. At pygrunn, there's no state synchronization. Imagine that it were a meeting to agree on a standard editor...

In a distributed system, what you really want is participation. Open source development needs pull requests, so to say.

A question about making money from open source resulted in a rant about open source software being the only way to produce valuable software. "You might as well ask about how you can make money from a free school system". It is idiotic to ask the question. And somehow he managed to sort-of compare evil proprietary software to the pure evil of any religious education. Or something like it. ( He sounded like a very religious follower of atheism when making these comments, I don't know if that's appropriate for a tech conference. I might have mis-understood, too, of course :-) ).

Something to emulate: our food system. None of us owns the complete food system. Nobody owns the full food system. But it works! Lots of smaller and bigger actors. And everyone had breakfast and lunch today. The system works. This kind of distributed system is an example to emulate in our open source software.

Nice comparison when asked about succesful commercial software. Gmail is a succesful example, but that's something that grew pretty organically. Compare that with google wave or google plus: who even remembers them? Those were vision driven software. Made based on money. A failure.

Pygrunn: Laurence de Jong - Towards a web framework for distributed apps

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 21, 2015.

(One of the summaries of the 2015 Pygrunn conference)

Laurence de Jong is a graduate student.

Everyone uses the internet. Many of the most-used sites are centralized. Centralization means control. It also gives scale advantages, like with gmail's great spam filter.

It also has drawbacks. If the site goes down, it is really down. Another drawback is the control they have over our data and what they do with it. If you're not paying for it, you're the product being sold. Also: eavesdropping. Centralized data makes it easy for agencies to collect the data. And: censorship!

A better way would be decentralized websites. There are existing decentralized things like Freenet, but they're a pain to install and the content on there is not the content you want to see... And part of it is stored on your harddisk...

See also Mealstrom, which distributes websites as torrents. A problem there is the non-existence of proper decentralized DNS: you have unreadable hashes.

A solution could be the blockchain system from bitcoin. It is called namecoin. This way, you could store secure DNS records to torrent hashes in a decentralized way. uses namecoin to have proper DNS addresses and to download the website via bittorrent. Not many people use it right now.

And.... the websites you download right now are all static. We want dynamic content! You can do even that with blockchains. An example is the decentralized twitter alternative . Mostly used by chinese people because twitter is mostly unavailable there.

There are problems, of course. Where do you store your data? Agencies can still do traffic analysis. How do you manage your private keys? Aren't we getting browsers wars all over again? And can your mom install it (answer: no, it is too hard).

An extra problem is more technical: distributed hash tables are considered unsafe.

And... in the end, if you use hashes for everything (like every individual tweet, email and webpage), that's a lot of hashes to store, partially locally. So it isn't the solution, but at least it is a solution.

Pygrunn: Python, WebRTC and You - Saúl Ibarra Corretgé

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 21, 2015.

(One of the summaries of the 2015 Pygrunn conference )

Saúl Ibarra Corretgé does telecom and VOIP stuff for his work, which is what webRTC calls legacy :-)

webRTC is Real-Time Communication for the web via simple APIs. So: voice calling, video chat, P2P file sharing without needing internal or external plugins.

Basically it is a big pile of C++ that sits in your browser. One of the implementations is Some people say that webRTC stand for Well, Everybody Better Restart Their Chrome. Because the browser support is mostly limited to chrome. There's a plugin for IE/safari, though.

There are several javascript libraries for webRTC. They help you set up a secure connection to another person (a "RTCPeerConnection"). The connection is directly, if possible. If not, due to firewalls for instance, you can use an external server. It uses ICE, which means Interactive Connectivity Establishment (see ICE trickle which he apparently used). A way to set up the connection.

Once you have a connection, you have an RTCDataChannel. Which you can use, for instance, to send a file from one browser to another.

As a testcase, he wrote Call Roulette. The app is in python, but in the browser javascript is used as that is more-or-less the native way to do it. The "call roulette" app connects a user to a random other user. Users will send simple json requests to the app. Once the app finds two candidates, both get the other's data to set up a subsequent webRTC connection.

He made the toy app in python 3.3 because it is new. It has websockets. And async via asyncio "because async is modern :-)". All, nice new and shiny.

So: users connect from their browser with a websocket connection to the app. They are paired up and the webRTC connection data is send back. Very fast.

Fun: light-weight django-models-like models via ! Look it up.

He did a live demo with web video with someone from the audience. Worked basically like a charm.

Pygrunn: Reliable distributed task scheduling - Niels Hageman

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 21, 2015.

(One of the summaries of the 2015 Pygrunn conference)

Note: see Niels Hageman's somewhat-related talk from 2012 . Niels works at Paylogic . Wow, the room was packed.

They discovered the normal problem of operations that took too long for the regular request/response cycle. The normal solution is to use a task queue. Some requirements:

  • Support python, as most of their code is in python.
  • It has to be super-reliable. It also needs to allow running in multiple data centers (for redundacy).
  • Ideally, a low-maintenance solution as they already have enough other work.

Option 1: celery + rabbitMQ. It is widely used and relatively easy to use. But rabbitMQ was unreliable. With alarming frequency, the two queues in the two datacenters lost sync. They also got clogged from time to time.

Option 2: celery + mysql. They already use mysql, which is an advantage. But... the combination was buggy and not-production ready.

Option 3: gearman with mysql. Python bindings were buggy and non-maintained. And you could also run one gearman bundle, so multiple datacenters was out of the window.

Option 4: do it yourself. They did this and ended up with "Taskman" (which I couldn't find online, they're planning on making it open source later on: they still need to add installation documentation).

The backend? They started with mysql. It is a great relational database, but it isn't a great queue. There is a saying on the internet: Thou shalt not use thine database as a task queue. With some adjustments, like autocommit, they got it working nicely anyway.

The task server consists of a python daemon (running under supervisor) and a separate task runner. It runs in a separate process to provide isolation and resource control.

Of course, the task server needs to be integrated in the main server. The task server is written as an independent application, so how does the task finder find the python functions it needs to run? They do this via "server plugins" that define which environment variables are needed, which python path you need and which function and which version you need. All this gets applied by the task runner and subsequently it can import and run the function.

Some additional features of their task runner:

  • Tasks can report progress.
  • Tasks can be aborted.
  • Task start time can be constrained.
  • There's exception handling.

Some of the properties of taskman: it is optimized for long running tasks. And: it is designed for reliability. Very necessary, as Paylogic is a payment processor.

It also means it is less suited when you have lots of little tasks. Running everything as a separate process is fine for longer-running processes, but it is too heavy-weight for lots of small tasks. Oh, and there's no admin UI yet: he uses phpmysqladmin :-)

Pygrunn: Orchestrating Python projects using CoreOS - Oscar Vilaplana

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 21, 2015.

(One of the summaries of the 2015 Pygrunn conference)

(Note: Oscar Vilaplana had a lot of info in his presentation and also a lot on his slides, so this summary is not as elaborate as what he told us. Wait for the video for the full version.)

"Orchestrating python": why? He cares about reliability. You need a static application environment. Reliable deployments. Easy and reliable continuous integration. And self-healing. Nice is if it is also portable.

A common way to make scalable systems is to use microservices. You compose, mix and extend them into bigger wholes. Ideally it is "cluster-first": also locally you test with a couple of instances. A "microservices architecture".

Wouldn't it be nice to take the "blue pill" and move to a different reality? One in where you have small services, each running in a separate container without a care for what occurs around it? No sysadmin stuff? And similary the smart infrastructure people only have to deal with generic containers that can't break anything.

He did a little demo with rethinkdb and flask.

For the demo it uses coreOS: kernel + docker + etcd. CoreOS uses a read-only root filesystem and it by design doesn't have a package manager. Journald for logging (it automatically captures the stdout). Systemd for managing processes.

etcd? It is a distributed configuration store. It has a http API.

Also: "fleet". "systemd for services". It starts up the containers. It coordinates accross the cluster. It will re-start containers if they die.

How do we get containers to talk to each other? They're containerized... For that there's "flannel": "dhcp for containers". Per-cluster specific subnet. Per-machine smaller subnet. The best system to run all this is Kubernetes.

Kubernetes uses "replication controllers". The basis is a "pod", from which multiple replicas are made, depending on the amount of instances you need.

He then showed a demo. Including a rolling update. Nice. Similarly for a rethinkdb cluster where he increased the number of nodes halfway the demo. Nice, too.

In development, it might be easy to use "nspawn" instead of docker. It is mostly the same, only less isolated (which is handy for development).

Pygrunn: Data acquisition with the Vlermv database - Thomas Levine

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 21, 2015.

(One of the summaries of the 2015 Pygrunn conference)

Thomas Levine wrote vlermv. A simple "kind of database" by using folders and files. Python is always a bit verbose when dealing with files, so that's why he wrote vlermv.


from vlermv import Vlermv
vlermv = Vlermv('/tmp/a-directory')

vlermv['filename'] = 'something'
# ^^^ This saves a python pickle with 'something' to /tmp/a-directory/filename

The advantage is that the results are always readable, even if you lose the original program.

You can choose a different serializer, for intance json instead of pickle.

You can also choose your own key_transformer. A key_transformer translates a key to a filename. Handy if you want to use a datetime or tuple as a key, for instance.

The two hard things in computer science are:

  • Cache invalidation.
  • Naming things.

Cache invalidation? Well, vlermv doesn't do cache invalidation, so that's easy. Naming things? Well, the name 'vlermv' comes from typing randomly on his (dvorak) keyboard... :-)

Testing an app that uses vlermv is easy: you can mock the entire database with a simple python dictionary.

What if vlermv is too new for you? You can use the standard library shelve module that does mostly the same, only it stores everything in one file.

A drawback of vlermv: it is quite slow.

Fancy full-featured databases are fast and nice, but do you really need all those features? If not, wouldn't you be better served by a simple vlermv database? You might even use it as a replacement for mongodb! That one is used often only because it is so easy to start with and so easy to create a database. If you don't have a lot of data, vlermv might be a much better fit.

Pygrunn: IPython and MongoDB as big data scratchpads - Jens de Smit

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 21, 2015.

(One of the summaries of the 2015 Pygrunn conference )

A show of hand: about half the people in the room have used mongodb and half used ipython notebooks. There's not a lot of overlap.

Jens de Smit works for optiver, a financial company. A "high-frequency trader", so they use a lot of data and they do a lot of calculations. They do a lot of financial transactions and they need to monitor if they made the right trades.

Trading is now almost exclusively done electronically. Waving hands and shouting on the trading floor at a stock exchange is mostly a thing of the past. Match-making between supply and demand is done centrally. It started 15 years ago. The volume of transactions really exploded. Interesting fact: the response time has gone from 300ms to just 1ms!

So... being fast is important in electronic trading. If you're slow, you trade at the wrong prices. Trading at the wrong prices means losing money. So speed is important. Just as making the right choices.

What he had to do is to figure out how fast an order was made and wether it was a good order. Non-intrusively. So: what market event did we react to? What was the automatic trade decision (done by an algorithm)? Was it a good one? How long did it all take?

So he monitors data going in and out of their system. He couldn't change the base system, so: log files, network data and an accounting database. Most of the data is poorly indexed. And a very low signal-to-noise ratio. And of course the logfiles aren't all consistent. And documentation is bad.

Oh, and the data size is of course also to big to fit in memory :-)

He used mongodb. A schemaless json (well, bson, binary version of json) store. Great for messy data. Easy to use. Just put in a python dictionary, basically. The data is persisted to disk, but as long as you have enough RAM, it'll keep it in memory. Very fast that way. You get indexes and speedups by default.

After he managed to get everything into mongodb, he had to make sense of things. So: correlate decision logs to network data. This is easy for humans to spot, but hard for computers. Computers are good at exact matches, humans are better at inexact pattern matches.

He used ipython notebook, a nice interactive python shell with a browser interface. Including matplotlib integration for easy graphs. Syntax highlighting; you can render html inside the shell; you can save your work at the end of the day (which you can't with a regular python shell!); inline editing.

Nice: since last week, rendering such notebooks is supported by github. (I guess he means this announcement ).

Now mongodb. It is very simple to create a directory and start mongodb. If you stop mongo and delete the directory, it is gone as if it was never there. Easy. And with pymongo it is just a few lines of python code and you're set. Including a handy query language.

He showed a couple of code examples. Looked pretty handy.

Creating an index is a oneliner. If you know beforehand what kinds of queries you want to do, you can quickly create an index for it, which speeds up your queries a lot. You can make complex indexes, but in his experience, simple single-field indexes are often enough.

Something to watch out for: mongo does never return disk space to the OS. If you delete lots of objects, the OS doesn't get it back unless you shut mongodb down and "repair" the database. What he does is simply delete the database at the end of the day!

He showed one of the outputs: a graph with response times which immediately showed that several responses were too slow. Good, useful information. One year ago he wouldn't have dreamt of being able to do this sort of analysis.

Mongo is very useful for this kind of work. You use mongodb's strengths and you aren't bothered by many of the drawbacks, like missing transactions.

Pygrunn: Leveraging procedural knowledge - K Rain Leander

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 21, 2015.

(One of the summaries of the 2015 Pygrunn conference )

K Rain Leander works at Red Hat and yes, she wore a bright red hat :-) She's a python and django newbie. She knows how it is to be a newbie: there is so much in linux that there are always areas where you're a complete newbie. So everyone is helpful there.

"Amsterdam is the capital of the netherlands" is declarative knowledge. Procedural knowledge is things like learning to ride a bike or a knew language. So: What versus How. You might know declaratively how to swim, but procedurally you might still drown: you need to practice and try.

Some background: she was a dancer in the USA. Unless you're famous, you barely scrape by financially. So she started teaching herself new languages. Both real-life languages and computer languages. Css, html for starters. And she kept learning.

She got a job at Red Hat. You have to pass a RHCE certification test within 90 days of starting work there - or you're fired. She made it. She

She has military background. In bootcamp, the purpose is not the pushups and the long runs. The goal is to break you down so that you jump when they say "jump".

In the Red Hat bootcamp, the goal is not making the test. The goal is to figure out if you're able to drink from the firehose. Which means if you get a support request, you say "I'll figure it out for you" and you just dive in and try to figure it out. You have to be able to dive into a whole lot of new information without panicking. That's drinking from the firehose.

She re-used existing knowledge and previous skills to learn everything. The important part was not being afraid to dive in.

She moved towards programming. Python, django. She was new to it. One of the first steps? "Set up a virtualenv and....". It can frighten you, but it is just a question of RTFM. Just read the manual. Just read it and then start doing it.

She went to a Django Girls Workshop. (One of the results: Django girls does a really good job of providing material and documentation. She had some problems installing it, but continued (and succeeded) anyway.

... and then someone challenged her to deploy it on openshift. It hasn't succeeded completely yet. But she'll persevere and get it working.

She recommends to learn python.

What's next: she'll practice, practice, practice. And she'll contribute to the community. Probably build one or two apps. And she'll be a coach at the upcoming Groningen django girls workshop ("as a coach. No, I'm not worried....")

So: re-use your existing knowledge and build from there. Don't be afraid. Just do it.

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Add pg_audit, an auditing extension

From Planet PostgreSQL. Published on May 21, 2015.

On 14th of May, Stephen Frost committed patch: Add pg_audit, an auditing extension   This extension provides detailed logging classes, ability to control logging at a per-object level, and includes fully-qualified object names for logged statements (DML and DDL) in independent fields of the log output.   Authors: Ian Barwick, Abhijit Menon-Sen, David Steele Reviews […]

Feng Tian: Comparing two tables

From Planet PostgreSQL. Published on May 21, 2015.

Say you have carefully tuned your database and wow! it is ten times faster -- but, before you push it to the production cluster, how do you know the answer of the query is correct?   Or at least, same as before?   :-)    You may have same question if you want to upgrade or migrate your database, or, for us, we want to make sure the new great join algorithm actually produce correct answer.   Well, let's just put the result before, and after, in two tables, and compare the two tables.

This is great, but turns out to be quite tricky.   You need to take care of the cases that,
  •  Either table may not have a primary key
  •  Rows in the two tables may be physically stored in different orders.
  • Columns could contain nulls. 
  • The two tables could be huge, so performance must be good.
Here are something you could do.   

Method 1: You can dump the tables to text or csv files, the run diff -- but you need to take care of ordering, so you have to copy the table out with order by clause.    And good luck diff-ing 2TB of text file.

Method 2: Use except.  How about this query?  
select * from (
    select * from ta except select * from tb 
    union all 
    select * from tb except select * from ta
) foo; 

It is totally wrong!   See attached sql script for some surprise.   And this?

select * from (
    select * from (select * from ta except select * from tb) xxx
    union all
    select * from (select * from tb except select * from ta) yyy
) foo;
Better, but it did not take care of duplicate rows -- and, checking select count(*) is not good enough.

Method 3: Join by ourselves!  Here is a script from my friend CK.

A as ( 
    select hashtext(textin(record_out(ta))) as h, count(*) as c 
    from ta group by h
B as (
    select hashtext(textin(record_out(tb))) as h, count(*) as c 
    from tb group by h
select * from A full outer join B on (A.h + A.c= B.h + B.c)
where A.h is null or B.h is null limit 5;

The hashtext part is optional -- for very wide rows, using hashtext will save some cycles and memory.   This is great except it has a full outer join.    Can we do better?   

Method 4: Group it ourselves,
select r, sum(cnt) from (
    select textin(record_out(ta)) as r, 1 as cnt from ta 
    union all
    select textin(record_out(tb) as r, -1 as cnt from tb) foo
group by r having sum(cnt) <> 0;

No joins!  

How to control the zoom level with automatic face detection based image cropping

By Cloudinary Blog - Django from Django community aggregator: Community blog posts. Published on May 21, 2015.

Zoom level of cropped images Many websites now offer their users the ability to upload images and profile pictures, making it a challenge for web designers to maintain a certain graphic design and style when subsequently displaying these images. The profile pictures may need to be smartly cropped to focus on the faces, with some sites that prefer close-ups of faces and others that prefer including more background when displaying images of people.

Cloudinary's automatic Face Detection feature is an extremely useful method for identifying the face(s) in uploaded images and then making it a simple matter to intelligently crop, position and resize the picture accordingly. However, the resulting image is cropped at a default zoom level, so to be able to control how much of the original image surrounding the face to keep, Cloudinary has introduced the Zoom parameter that can be added to dynamic image delivery URLs with on-the-fly cloud-based manipulation.

The zoom parameter (z for URLs) accepts a decimal value that sets the new zoom level as a multiple of the default zoom setting: a value less than 1.0 zooms out and a value greater than 1.0 zooms in. For example, z_0.5 halves the default zoom to 50% and zooms out to include more of the background around the face, while z_2.0 doubles the default zoom to 200% and zooms in to include less of the background around the face. The zoom parameter works together with either the crop or thumb cropping modes while setting the gravity parameter to focus on either a single face (g_face), multiple faces (g_faces) or custom coordinates (g_custom) manually defined by the user when uploading the image. Here are a few examples to show how to use the zoom parameter:

Single face thumbnail

The following example demonstrates how the zoom parameter works together with the thumb cropping mode. This mode generates a thumbnail of an image with the exact given width and height dimensions, while identifying the face and making it the center of the image.

The original proportions are retained, but the image might be scaled to fit in the given dimensions. The zoom parameter will now determine how much to scale the face within the given width and height, with a zoom value less than 1.0 shrinking the face, and a zoom value greater than 1.0 enlarging the face within the image.

Original photo

The following URL dynamically generates a 200x200 thumbnail focusing on the face in the image shown above with the default zoom level.

cl_image_tag("lady.jpg", :width=>200, :height=>200, :crop=>:thumb, :gravity=>:face)
cl_image_tag("lady.jpg", array("width"=>200, "height"=>200, "crop"=>"thumb", "gravity"=>"face"))
CloudinaryImage("lady.jpg").image(width=200, height=200, crop="thumb", gravity="face")
cloudinary.image("lady.jpg", {width: 200, height: 200, crop: "thumb", gravity: "face"})
cloudinary.url().transformation(new Transformation().width(200).height(200).crop("thumb").gravity("face")).imageTag("lady.jpg")
$.cloudinary.image("lady.jpg", {width: 200, height: 200, crop: "thumb", gravity: "face"})
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(200).Height(200).Crop("thumb").Gravity("face")).BuildImageTag("lady.jpg")
200x200 thumbnail with default zoom photo

To include more of the background around the face and decrease the relative size of the face within the image, we can decrease the zoom level to 70% of the default by setting the zoom parameter to 0.7:

cl_image_tag("lady.jpg", :zoom=>0.7, :width=>200, :height=>200, :crop=>:thumb, :gravity=>:face)
cl_image_tag("lady.jpg", array("zoom"=>0.7, "width"=>200, "height"=>200, "crop"=>"thumb", "gravity"=>"face"))
CloudinaryImage("lady.jpg").image(zoom=0.7, width=200, height=200, crop="thumb", gravity="face")
cloudinary.image("lady.jpg", {zoom: 0.7, width: 200, height: 200, crop: "thumb", gravity: "face"})
cloudinary.url().transformation(new Transformation().zoom(0.7).width(200).height(200).crop("thumb").gravity("face")).imageTag("lady.jpg")
$.cloudinary.image("lady.jpg", {zoom: 0.7, width: 200, height: 200, crop: "thumb", gravity: "face"})
cloudinary.Api.UrlImgUp.Transform(new Transformation().Zoom(0.7).Width(200).Height(200).Crop("thumb").Gravity("face")).BuildImageTag("lady.jpg")
200x200 thumbnail with 70% zoom photo

To include less of the background around the face and increase the relative size of the face within the image, we can increase the zoom level to 130% of the default by setting the zoom parameter to 1.3:

cl_image_tag("lady.jpg", :zoom=>1.3, :width=>200, :height=>200, :crop=>:thumb, :gravity=>:face)
cl_image_tag("lady.jpg", array("zoom"=>1.3, "width"=>200, "height"=>200, "crop"=>"thumb", "gravity"=>"face"))
CloudinaryImage("lady.jpg").image(zoom=1.3, width=200, height=200, crop="thumb", gravity="face")
cloudinary.image("lady.jpg", {zoom: 1.3, width: 200, height: 200, crop: "thumb", gravity: "face"})
cloudinary.url().transformation(new Transformation().zoom(1.3).width(200).height(200).crop("thumb").gravity("face")).imageTag("lady.jpg")
$.cloudinary.image("lady.jpg", {zoom: 1.3, width: 200, height: 200, crop: "thumb", gravity: "face"})
cloudinary.Api.UrlImgUp.Transform(new Transformation().Zoom(1.3).Width(200).Height(200).Crop("thumb").Gravity("face")).BuildImageTag("lady.jpg")
200x200 thumbnail with 130% zoom photo

Multiple faces cropping

The following example demonstrates how the zoom parameter works together with the crop mode. This mode is used to extract a section out of the original image, while retaining the size of the graphics, and together with the gravity parameter set to faces, cropping will center the derived image on the detected faces in the image. The zoom parameter will now determine the dimensions of the extracted image, resulting in a bigger image when the zoom value is less than 1.0, or a smaller image when the zoom value is greater than 1.0.

Original photo

The following URL dynamically crops the image shown above to only display the detected faces at the default zoom level:

cl_image_tag("young_couple.jpg", :crop=>:crop, :gravity=>:faces)
cl_image_tag("young_couple.jpg", array("crop"=>"crop", "gravity"=>"faces"))
CloudinaryImage("young_couple.jpg").image(crop="crop", gravity="faces")
cloudinary.image("young_couple.jpg", {crop: "crop", gravity: "faces"})
cloudinary.url().transformation(new Transformation().crop("crop").gravity("faces")).imageTag("young_couple.jpg")
$.cloudinary.image("young_couple.jpg", {crop: "crop", gravity: "faces"})
cloudinary.Api.UrlImgUp.Transform(new Transformation().Crop("crop").Gravity("faces")).BuildImageTag("young_couple.jpg")
Cropped to detected faces with default zoom

To include more of the background around the faces and increase the size of the resulting image, we can decrease the zoom level to 75% of the default by setting the zoom parameter to 0.75:

cl_image_tag("young_couple.jpg", :zoom=>0.75, :crop=>:crop, :gravity=>:faces)
cl_image_tag("young_couple.jpg", array("zoom"=>0.75, "crop"=>"crop", "gravity"=>"faces"))
CloudinaryImage("young_couple.jpg").image(zoom=0.75, crop="crop", gravity="faces")
cloudinary.image("young_couple.jpg", {zoom: 0.75, crop: "crop", gravity: "faces"})
cloudinary.url().transformation(new Transformation().zoom(0.75).crop("crop").gravity("faces")).imageTag("young_couple.jpg")
$.cloudinary.image("young_couple.jpg", {zoom: 0.75, crop: "crop", gravity: "faces"})
cloudinary.Api.UrlImgUp.Transform(new Transformation().Zoom(0.75).Crop("crop").Gravity("faces")).BuildImageTag("young_couple.jpg")
Cropped to detected faces with 75% zoom

To include less of the background around the faces and decrease the size of the resulting image, we can increase the zoom level to 150% of the default by setting the zoom parameter to 1.5:

cl_image_tag("young_couple.jpg", :zoom=>1.5, :crop=>:crop, :gravity=>:faces)
cl_image_tag("young_couple.jpg", array("zoom"=>1.5, "crop"=>"crop", "gravity"=>"faces"))
CloudinaryImage("young_couple.jpg").image(zoom=1.5, crop="crop", gravity="faces")
cloudinary.image("young_couple.jpg", {zoom: 1.5, crop: "crop", gravity: "faces"})
cloudinary.url().transformation(new Transformation().zoom(1.5).crop("crop").gravity("faces")).imageTag("young_couple.jpg")
$.cloudinary.image("young_couple.jpg", {zoom: 1.5, crop: "crop", gravity: "faces"})
cloudinary.Api.UrlImgUp.Transform(new Transformation().Zoom(1.5).Crop("crop").Gravity("faces")).BuildImageTag("young_couple.jpg")
Cropped to detected faces with 150% zoom

Single face cropping with multiple face thumbnail overlay

The following URL dynamically crops the lady.jpg image to only display the face at a zoom level of 90% and with rounded corners, then overlays a circular 150x150 thumbnail of the young_couple.jpg image with a zoom level of 75%. The overlay is also adjusted to display 10 pixels away from the north east corner by using the fl_layer_apply parameter, which allows multiple transformations to be applied on the overlay (see the blog post on manipulating overlays for more info).

cl_image_tag("lady.jpg", :transformation=>[
  {:radius=>50, :zoom=>0.9, :crop=>:crop, :gravity=>:face},
  {:radius=>"max", :zoom=>0.75, :overlay=>"young_couple", :width=>150, :height=>150, :crop=>:thumb, :gravity=>:faces},
  {:x=>10, :y=>10, :gravity=>:north_east, :flags=>:layer_apply}
cl_image_tag("lady.jpg", array("transformation"=>array(
  array("radius"=>50, "zoom"=>0.9, "crop"=>"crop", "gravity"=>"face"),
  array("radius"=>"max", "zoom"=>0.75, "overlay"=>"young_couple", "width"=>150, "height"=>150, "crop"=>"thumb", "gravity"=>"faces"),
  array("x"=>10, "y"=>10, "gravity"=>"north_east", "flags"=>"layer_apply")
  {"radius": 50, "zoom": 0.9, "crop": "crop", "gravity": "face"},
  {"radius": "max", "zoom": 0.75, "overlay": "young_couple", "width": 150, "height": 150, "crop": "thumb", "gravity": "faces"},
  {"x": 10, "y": 10, "gravity": "north_east", "flags": "layer_apply"}
cloudinary.image("lady.jpg", {transformation: [
  {radius: 50, zoom: 0.9, crop: "crop", gravity: "face"},
  {radius: "max", zoom: 0.75, overlay: "young_couple", width: 150, height: 150, crop: "thumb", gravity: "faces"},
  {x: 10, y: 10, gravity: "north_east", flags: "layer_apply"}
cloudinary.url().transformation(new Transformation()
$.cloudinary.image("lady.jpg", {transformation: [
  {radius: 50, zoom: 0.9, crop: "crop", gravity: "face"},
  {radius: "max", zoom: 0.75, overlay: "young_couple", width: 150, height: 150, crop: "thumb", gravity: "faces"},
  {x: 10, y: 10, gravity: "north_east", flags: "layer_apply"}
cloudinary.Api.UrlImgUp.Transform(new Transformation()
Cropped to the detected face with 90% zoom and rounded corners, with 150x150 circular thumbnail overlay at 75% zoom and 10 pixels away from the north east corner

As shown in the example above, any of Cloudinary's image transformations can also be applied to the delivered image.


When automatically cropping images to focus on the detected face(s), the zoom parameter is used to control how much of the original image surrounding the face to keep, and is a powerful method to modify and normalize user generated content to match the graphic design and style of your site.

The new Zoom feature is available to all our free and paid plans. If you don't have a Cloudinary account, you are welcome to sign up to our free account and try it out.

Joshua Drake: Updating the .Org docs on backups

From Planet PostgreSQL. Published on May 21, 2015.

I spent a great deal of time working through the SQL DUMP portion of the 9.5devel docs this past week. Below is the current text of what I have and it would be great if my readers would take a look and offer some thoughtful feedback. What would you like to see added? What would you like to see changed? Please note that this is reference documentation not tutorial documentation.

This is just the straight HTML dump that is generated from Docbook but since it is inline the links won't work. <!a href="">The current -devel docs are here and the updated version I am working is below:

<!a name="BACKUP-DUMP" id="BACKUP-DUMP">24.1. SQL Dump

PostgreSQL provides the program <!a href="app-pgdump.html">pg_dump for generating a backup file with SQL commands that, when fed back to the server, will recreate the database in the same state as it was at the time of the dump. The basic usage of <!a href="app-pgdump.html">pg_dump is:

pg_dump -C -F p -f outfile dbname

The use of -C ensures that the dump file will contain the requisite <!a href="sql-createdatabase.html">CREATE DATABASE command within the dump file. The use of -Fp ensures that you are using the plain text format and the use of -f allows you to specify the name of the file the dump will be written to. It is also possible for pg_dump to create files in other formats that allow for parallelism and fine-grained control of object backup or restoration. For more details on all options available to pg_dump please refer to the <!a href="app-pgdump.html">pg_dump reference page.

The pg_dump application requires read access to all objects within the database that it will be operating with. This generally requires database super-user access. It is possible for any database user to use pg_dump to backup the objects that they own regardless of super-user access. This can be achieved using options such as -n schema or -t table.

The primary advantage of using pg_dump over the other backup methods described is that pg_dump output is architecture independent. A backup made with pg_dump can generally be moved between operating systems and different architectures (32bit, 64bit, Sparc, Intel). Whereas file-level backups and continuous archiving are both server-version-specific.

The text files created by pg_dump are internally consistent, meaning, the dump represents a snapshot of the database at the time pg_dump began running. pg_dump does not block other operations on the database while it is working. (Exceptions are those operations that need to operate with an exclusive lock, such as most forms of ALTER TABLE.)

Note: Like any other PostgreSQL client application, pg_dump will by default connect with the database user name that is equal to the current operating system user name. To override this, either specify the -U option or set the environment variable PGUSER.

<!a name="ADVANCED-PGDUMP" id= "ADVANCED-PGDUMP">24.1.1. Advanced pg_dump

The pg_dump application provides other formats. The most notable are the use of -F c or -F d. The use of the custom format (-F c) is an excellent option for smaller databases when you need fine grained control of the objects you chose to restore. The use of the directory format (-F d) allows for parallel connection based backups. If you are performing a backup with many objects and using pg_dump then the directory format will be the most efficient. This option also allows for fine grained control of the objects you chose to restore.

If PostgreSQL was built on a system with the zlib compression library installed, the custom dump format will compress data as it writes it to the output file. This will produce dump file sizes similar to using gzip, but it has the added advantage that tables can be restored selectively.

<!a name="AEN37590" id="AEN37590">

Example 24-1. Backup a single table

  pg_dump -U user -h host1 -F c -f outfile -t table1 dbname
<!a name="AEN37605" id="AEN37605">

Example 24-2. Using wildcards with table list

  pg_dump -U user -h host1 -F c -f outfile -t table* dbname
<!a name="AEN37620" id="AEN37620">

Example 24-3. Using parallelism and a wildcard table list

  pg_dump -U user -h host1 -F d -f outfile -t table* -j 8 dbname

Note: The use of the custom or directory pg_dump formats requires the use of <!a href= "app-pgrestore.html">pg_restore and will not work with <!a href= "app-psql.html">psql. There is more information on using pg_restore in section <!a href= "backup-dump.html#ADVANCED-RESTORE-PGDUMP">Section 24.1.3.

<!a name="BACKUP-DUMP-RESTORE" id= "BACKUP-DUMP-RESTORE">24.1.2. Restoring the Dump

The psql application is the default client that ships with PostgreSQL. It is also the default application used when restoring text based dumps created by the pg_dump application. For details information on psql please see <!a href= "app-psql.html">psql. For the purposes of restoring a dump the basic usage is:

psql -f infile -d dbname 

Note: If you omitted -C when executing pg_dump the CREATE DATABASE command will not be in the text file. You will need to create the database yourself from template0 before the executing the restore (e.g., with createdb -T template0 dbname).


pg_dump does not backup users, roles and other global objects. To properly backup global objects you must use <!a href="app-pg-dumpall.html">pg_dumpall with the -g parameter. If you do not restore the globals before the text based dump, the database will implicitly restore all objects as the owner passed by -U username. If -U is not passed then the operating system user executing psql will be used.

Important: The dumps produced by pg_dump are relative to template0. This means that any languages, procedures, etc. added via template1 will also be dumped by pg_dump. As a result, when restoring, if you are using a customized template1, you must create the empty database from template0, as in the example above.

After restoring a backup, one should execute <!a href= "sql-analyze.html">ANALYZE on each database so the query optimizer has useful statistics; see <!a href= "routine-vacuuming.html#VACUUM-FOR-STATISTICS">Section 23.1.3 and <!a href="routine-vacuuming.html#AUTOVACUUM">Section 23.1.6 for more information. For more advice on how to load large amounts of data into PostgreSQL efficiently, refer to <!a href="populate.html">Section 14.4.

<!a name="ADVANCED-RESTORE-PGDUMP" id= "ADVANCED-RESTORE-PGDUMP">24.1.3. Advanced restore

<!a name="AEN37695" id="AEN37695">

Example 24-4. Using pipes to restore to new server

   pg_dump -h host1 -d dbname | psql -h host2 -d dbname

If one is using the custom, directory or tar formats the restore command is <!a href= "app-pgrestore.html">pg_restore. The pg_restore program has many benefits over the use psql including fine grained object restore and parallelism.

<!a name="AEN37715" id="AEN37715">

Example 24-5. Extracting a text dump from a custom format backup

The following will extract the backup to the standard output. The use of -F is optional as pg_restore should be able to detect the format.

   pg_restore -F c infile 
<!a name="AEN37724" id="AEN37724">

Example 24-6. Restoring a single table

   pg_restore -U username -h host1 -d dbname -t table infile
<!a name="AEN37736" id="AEN37736">

Example 24-7. Using parallelism to restore databases

The use of parallelism will normally allow databases to restore much faster than a single connection based restore. The restore will only execute as quickly as it can restore your largest table but for databases with many objects it is the fastest pg_dump based restore.

   pg_restore -U username -h host1 -d dbname -t table -j 8 infile

<!a name="BACKUP-DUMP-ALL" id= "BACKUP-DUMP-ALL">24.1.4. Using pg_dumpall

pg_dump dumps only a single database at a time, and it does not dump information about roles or tablespaces (because those are cluster-wide rather than per-database). To support convenient dumping of the entire contents of a database cluster, the <!a href= "app-pg-dumpall.html">pg_dumpall program is provided. pg_dumpall backs up each database in a given cluster, and also preserves cluster-wide data such as role and tablespace definitions. The basic usage of this command is:

pg_dumpall > outfile

The resulting dump can be restored with psql:

psql -f infile postgres

It is necessary to have database superuser access when using a pg_dumpall dump. The superuser acess is required to restore the role and tablespace information.

pg_dumpall works by emitting commands to re-create roles, tablespaces, and empty databases, then invoking pg_dump for each database. This means that while each database will be internally consistent, the snapshots of different databases are not sychronized.

Cluster-wide data can be dumped alone using the pg_dumpall --globals-only option. This is necessary to fully backup the cluster if running the pg_dump command on individual databases.

Note: If you use tablespaces, make sure that the tablespace paths in the dump are appropriate for the new installation.

<!a name="BACKUP-DUMP-LARGE" id= "BACKUP-DUMP-LARGE">24.1.5. Handling Large Databases

The act of backing up a normal sized database is relatively simple. The act of backing up a large database (>500GB) can be challenging. Fortunately, PostgreSQL is very flexible in its ability to provide a reliable backup. Here is a list of things you might want to consider when backing up a large database.

  1. Use the directory format and the -j NUM option. This will ensure the quickest and most flexible pg_dump style backup.

  2. Use continuous archiving as described in <!a href="continuous-archiving.html">Section 24.2. You can then backup the replica without putting load on the master.

  3. Use pg_basebackupas described in <!a href="continuous-archiving.html#BACKUP-BASE-BACKUP">Section 24.2.2.


The pg_dump methods utilize at least one connection if not many connections (via -j). They also utilize long running transactions. This can cause problems with maintenance. If you find that your database contains a lot of growing bloat consider using a backup method on the master that does not require pg_dump.

Satoshi Nagayasu: [Postgres Toolkit] pt-table-usage/pt-index-usage commands

From Planet PostgreSQL. Published on May 21, 2015.

In this entry, I would like to explain how to use pt-table-usage and pt-index-usage commands in the Postgres Toolkit which I introduced in the previous entry. If you have never heard about Postgres Toolkit, please visit the previous entry. A Hacker's Diary: Postgres Toolkit 0.2 released pt-table-usage and pt-index-usage are the commands which can be used to obtain usages of the tables/indexes

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Add pg_settings.pending_restart column

From Planet PostgreSQL. Published on May 20, 2015.

On 15th of May, Peter Eisentraut committed patch: Add pg_settings.pending_restart column   with input from David G. Johnston, Robert Haas, Michael Paquier This is, for me, great. It might not sound like much, but it is actually something that I'm very excited about. Basically – let's assume you change system settings. Either by modifying postgresql.conf, […]

Andrew Dunstan: Placeholder blues

From Planet PostgreSQL. Published on May 20, 2015.

There has been some debate going on on the hackers list about problems certain drivers have with operators that contain the ? character. Since drivers such as the JDBC driver consider any such character outside of quote marks to be a parameter placeholder, its use as part of an operator makes life difficult for them. Part of the problem has been that we allow almost every available non-alphanumeric ASCII character to be a part of an operator. Why, for example, do we allow `? I've never seen it that I can recall, but the fact means we can't use it for something else (such as marking an operator). That makes doing anything about the problem hard. Some people want to deprecate the use of ? in operators. I think it's far too late for that - its use is far too widespread.

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Additional functions and operators for jsonb

From Planet PostgreSQL. Published on May 20, 2015.

On 12th of May, Andrew Dunstan committed patch: Additional functions and operators for jsonb   jsonb_pretty(jsonb) produces nicely indented json output. jsonb || jsonb concatenates two jsonb values. jsonb - text removes a key and its associated value from the json jsonb - int removes the designated array element jsonb - text[] removes a key […]

Michael Paquier: Postgres 9.5 feature highlight: Tracking parameters waiting for server restart

From Planet PostgreSQL. Published on May 19, 2015.

A couple of days back the following commit has landed in the Postgres world, for the upcoming 9.5 release:

commit: a486e35706eaea17e27e5fa0a2de6bc98546de1e
author: Peter Eisentraut <>
date: Thu, 14 May 2015 20:08:51 -0400
Add pg_settings.pending_restart column

with input from David G. Johnston, Robert Haas, Michael Paquier

Particularly useful for system doing a lot of server parameter updates, this allows tracking parameters in need of a server restart when their value is updated to have the new value take effect on the system. Note that this applies to all the parameters marked as PGC_POSTMASTER in guc.c, shared_buffers being one, as well as the custom parameters a system may have after their load by a plugin. This information is tracked by a new column called pending_restart in the system view pg_settings with a boolean value set to "true" if a given GUC parameter is indeed waiting for a server restart.

In order to make visible the fact that parameter waits for a restart, the server can have its parameters be reloaded with either pg_reload_conf(), "pg_ctl reload" or a SIGHUP signal. Of course, modifications made in postgresql.conf, as well as any configuration files included, or ALTER SYSTEM are taken into account. See for example:

=# \! echo "port = 6666" > $PGDATA/postgresql.conf
=# ALTER SYSTEM SET shared_buffers TO '1GB';
=# SELECT pg_reload_conf();
(1 row)
=# SELECT name FROM pg_settings WHERE pending_restart;
(2 rows)

This will prove to be useful for many systems around, like those doing automatic tuning of system parameters or even containers (not limited to it of course).

Keynote by Catherine Bracy (PyCon 2015 Must-See Talk: 4/6)

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on May 19, 2015.

Part four of six in our PyCon 2015 Must-See Series, a weekly highlight of talks our staff enjoyed at PyCon.

My recommendation would be Catherine Bracy's Keynote about Code for America. Cakti should be familiar with Code for America. Colin Copeland, Caktus CTO, is the founder of Code for Durham and many of us are members. Her talk made it clear how important this work is. She was funny, straight-talking, and inspirational. For a long time before I joined Caktus, I was a "hobbyist" programmer. I often had time to program, but wasn't sure what to build or make. Code for America is a great opportunity for people to contribute to something that will benefit all of us. I have joined Code for America and hope to contribute locally soon through Code for Durham.

More in the PyCon 2015 Must-See Talks Series.

Giulio Calacoci: PostgreSQL 9.5: IMPORT FOREIGN SCHEMA

From Planet PostgreSQL. Published on May 19, 2015.

The release of PostgreSQL 9.5 is imminent so the time has come to analyse what’s new in this latest version.

A very interesting feature of version 9.5 is the ability to import a schema from a remote database, using Foreign Data Wrapper and the IMPORT FOREIGN SCHEMA command.


Foreign Data Wrappers (FDW)

Before the introduction of Foreign Data Wrappers, the only way to connect a Postgres database with an external data source was using the dblink module.

In 2003 the set of rules for the standard management of external data sources was defined within the SQL language: SQL/MED  (management of external Data).

PostgreSQL 9.1 introduced a first implementation of the standard SQL/MED with the Foreign Data Wrappers, which provided Postgres with direct access to data sources such as files or other databases (Oracle, Mysql…), allowing their use as tables.

The advantage of this approach is obvious: it gives you the ability to connect to an external data source to extract data natively by running a simple query. The fact that it is not necessary to use external modules to obtain this result considerably simplifies the DBAs’ work.

If you want to know more, take a look at our 2011 blog post: PostgreSQL 9.1: Tabelle esterne con SQL/MED (Warning: blog post in Italian).

Small example of an FDW

PostgreSQL 9.3 introduces the support for Foreign Data Wrappers in writing (Warning: blog post in Italian) and also added support to the foreign data wrapper for PostgreSQL. Let’s have a look at a simple example of how to use an FDW by connecting together two Postgres databases.

First we create two databases:

CREATE DATABASE destination;

Within the source we create a test table with test data:

\c source
CREATE TABLE test1 AS SELECT id, md5(random()::text) FROM generate_series(1,5) id;

Now we connect to the destination db and then, connect the two databases:

\c destination
CREATE EXTENSION postgres_fdw ;
CREATE SERVER src_srv FOREIGN DATA WRAPPER postgres_fdw OPTIONS( dbname 'source' );
CREATE USER MAPPING FOR postgres SERVER src_srv OPTIONS ( user 'postgres' );

Many of you, quite rightly will be up in arms, complaining about my poor choice in terms of security! Good!

For simplicity’s sake I decided to connect with the administrator user postgres — also to avoid deviating excessively from the main topic of this article. Be aware that, for security reasons, you will have to make other choices in a production environment (for example, using a specific user for your application).

However, once the connection is established, we can create on a target database an external table that points to test1 on the source database:

CREATE FOREIGN TABLE test1_ft (id integer, md5 text) server src_srv options(table_name 'test1');

We can now compare the content of the two test tables:

select * from test1_ft ;
 id |               md5
  1 | 63e5bc545b45f5c3961522f2609bedd9
  2 | d74af95e495d946d4a0887c51eb2cbe2
  3 | acce7cba66967332d01d51b74eb293f7
  4 | c5bb57ca54036004de334cf793792d4e
  5 | 02f32751b09042cf28b78cc29321a32e
(5 rows)

\c source

select * from test1 ;
 id |               md5
  1 | 63e5bc545b45f5c3961522f2609bedd9
  2 | d74af95e495d946d4a0887c51eb2cbe2
  3 | acce7cba66967332d01d51b74eb293f7
  4 | c5bb57ca54036004de334cf793792d4e
  5 | 02f32751b09042cf28b78cc29321a32e
(5 rows)

It is clear, looking at this example, that one of the greatest limitations to the use of Foreign Data Wrapper is the need to define each table separately, according to the appropriate structure. Access to external data is therefore laborious if you want to import more complex tables or even entire schemas.

Until now, such operations were performed through scripts that were able to connect to the source database and create the structure of the external tables automatically. Fortunately, the IMPORT FOREIGN SCHEMA function, present in the next release of PostgreSQL, will help us.


The IMPORT FOREIGN SCHEMA instruction, allows importing of an entire schema from an external data source without having to specify the structure of each table:

IMPORT FOREIGN SCHEMA remote_schema_name
FROM SERVER server_name INTO destination_schema;

If it is not necessary to import an entire schema, it is possible to use the LIMIT TO clause and restrict imports only to the tables we are interested in:

IMPORT FOREIGN SCHEMA remote_schema_name LIMIT TO (table_name, table_name, ...)
FROM SERVER server_name INTO destination_schema;

Otherwise, if we only want to exclude certain tables from the schema, it is possible to filter them by the EXCLUDE clause:

IMPORT FOREIGN SCHEMA remote_schema_name EXCLUDE (table_name, table_name, ...)
FROM SERVER server_name INTO destination_schema;


We can see in detail how to use this command by extending the example used previously. We connect to the source database and we add two tables to the one that already exists:

\c source
create table test2 as select id, md5(random()::text) from generate_series(1,20) as id;
create table test3 as select id, md5(random()::text) from generate_series(1,50) as id;

Now we create in the target database a schema that we will use as the target of the instruction IMPORT FOREIGN SCHEMA:

\c destination
create schema imported

Now we can import the schema we have just expanded, counting on the open connection in the previous example:


Let’s make a quick inspection of all the tables on the target database to observe the outcome of the schema import:

\dE *.*

               List of relations
  Schema  |   Name   |     Type      |  Owner
 imported | test1    | foreign table | postgres
 imported | test2    | foreign table | postgres
 imported | test3    | foreign table | postgres
 public   | test1_ft | foreign table | postgres

Within the public schema we note the table that we created earlier, while the result of the “mass” import is visible in the imported schema. With this example it is possible to see how much faster and more efficient the use of external tables is by IMPORT FOREIGN SCHEMA.


With PostgreSQL 9.5, thanks to this new feature, data migrations will become increasingly simpler and quicker. Currently, the IMPORT FOREIGN SCHEMA instruction is only supported by postgres_fdw and requires that developers of individual drivers implement them in the manner most appropriate to the data source. Increasing the number of drivers that are able to support this feature, opens up interesting scenarios for PostgreSQL and data integration.

Pavel Stehule: faster plpgsql in PostgreSQL 9.5 (second part)

From Planet PostgreSQL. Published on May 18, 2015.

Tom Lane introduced new state for TOAST types - expanded state. The side effect of this change together with some changes in plpgsql internals has big impact to speed of array operation in plpgsql.
I have two plpgsql block with cycles. First enforces a array fields update, second a array append.
DO $$ 
DECLARE a int[] = array_fill(10, ARRAY[10000]);
FOR i IN 1 .. 10000 LOOP
  a[i] := 0;

DO $$
DECLARE a int[] = '{}';
FOR i IN 1 .. 10000 LOOP
  a := a || 10;
You can try this code on PostgreSQL 9.4 - and you can get time 450 and 220ms. Same code needs 6 and 5 ms only on PostgreSQL 9.5! It is more than one order speedup.

PyCon Sweden 2015

By Between engineering and real life from Django community aggregator: Community blog posts. Published on May 18, 2015.

In a few words PyCon Sweden 2015 was awesome. Honestly, this was my first Python conference ever but I really hope it won't be the last.

Outside the awesome talks and great organisation it was really nice to spend some time with similar minded people and talk about technology, the universe and everything else. I have met some old friends and made some new ones but lets get back to the talk. Unfortunately I was not able to see all of them but here is a brief about those I saw and found really interesting:

It all started with Ian Ozsvald and his awesome talk about "Data Science Deployed" (slides). The most important point here were:

  • log everything
  • think about data quality, don't use everything just what you need
  • think about turning data into business values
  • start using your data

Then Rebecca Meritz talked about "From Explicitness to Convention: A Journey from Django to Rails" (slides). Whether the title sounds a bit contradictive this was not the usual Django vs Rails talk. At least to me it was more like a comparison between the two frameworks, showing their differences, weak and strong sides. Whether I am a Django user, I am more and more keen to the position that none of the frameworks is better than the other one, they are just two different approaches for building great web apps.

Flavia Missi and "Test-Driven-Development with Python and Django" (slides). TDD will help you have cleaner, better structured and easy to maintian code. If you are not doing it the best moment to start is now. Whether it is hard at the beginning you will pretty soon realise how beneficial it is. Especially if someone pushed a bad commit and the tests saved your ass before the code goes to production.

Later Dain Nilsson talked about "U2F: Phishing-proof two-factor authentication for everyone". Whether I don't use two-factor authentication at the moment I am familiar with the concept and I really like it. The U2F protocol looks like a big step towards making it more applicable over different tools and applications and the key holding devices are more and more accessible nowadays. Maybe it is time for me to get one )))

The second day started with Kate Heddleston who talked about ethics in computer programming. About how social networks can be used as a tool for ruining peoples lifes and that we as a developers should take a responsibility and work towards making the internet a safer place for everyone. A place where you can have your privacy and have protection if harassed. It is a big problem which won't be solved in a night, but talking about it is the first step towards solving it.

"Why Django Sucks" by Emil Stenström. Don't rush into raging cause this was one of best talks at the conference. Emil showed us the parts in Django where he sees flaws and that need improvement. The main point was the lack of common template language between Django and Javascript and what possible solutions are available or can be made.

Daniele Sluijters reminded us how easy is to work with "Puppet and Python". No more fighting with Ruby, you can easily use your favorite language to build your own tools and helpers

Dennis Ljungmark and "Embedded Python in Practice". The last time I programmed embedded devices was 15 years ago as a part of short course in the high school. Dennis' work is much more complex than what I did then but his talk reminded me of things that are applicable to general programming. Whether using non-embedded systems we often have much more memory and processing power available that does not mean that we should waste it. So think when you code - RAM is not endless and processors are not that fast as we often wish. Also don't forget that Exceptions wil sooner or later occur so make your code ready to handle them.

"How to Build a Python Web Application with Flask and Neo4j" by Nicole White. Well I have heard about Neo4J, but I have never used it or seen it in action so this one was really exciting. Neo4J offers you a whole new perspective about building databases and relations between objects but it is much far from panacea. Actually I can see it it is more like a special tool then as a general replacement of a relation database but it still worths to be tried. Oh, and the Neo4J browser - totally awesome.

In the lightning talks Tome Cvitan talked about Faker. If you are still not familiar with it now is a good time to try it. Especially if you are writing tests.

At the final Kenneth Reitz told us about "Python for Humans". About the not that obvious things in Python and what solutions are out there. And also about The Hitchhiker’s Guide to Python! a great place for beginners and not only and shared the idea to make Python easier and more welcoming by introducing better tools for higher level of operations.

Finally I want to thank to everyone - the organisers, the speakers, the audience and practically everyone who was a part of the conference. Without you it would be the same (or be at all). Thanks, keep up the good work and hopefully we will sea each other again.

P.S. Have I mentioned that the whole conference was recorded on video so hopefully we will see be able to see all the talks pretty soon. I will try to keep this post updated with the links to the videos and/or slides when they become available. Of course if you know about any published slides from the conference that are not linked here please let me know.

gabrielle roth: Simple test for lock_waits log messages

From Planet PostgreSQL. Published on May 18, 2015.

I do a lot of “is this !#@(* really working correctly” testing. (See also: simple test for autovacuum log messages.) Here’s a simple test to verify that log_lock_waits is configured correctly. This is essential for troubleshooting lock contention. -- verify settings -- log_lock_waits should be 'on'; it is off by default. SHOW log_lock_waits; -- lock_wait […]

Greg Sabino Mullane: Connected to PgBouncer or Postgres?

From Planet PostgreSQL. Published on May 18, 2015.

Determining if your current database connection is using PgBouncer, or going directly to Postgres itself, can be challenging, as PgBouncer is a very low-level, transparent interface. It is possible, and here are some detection methods you can use.

This was inspired by someone asking on the Perl DBD IRC channel if it was possible to easily tell if your current database handle (usually "$dbh") is connected to PgBouncer or not. Since I've seen this question asked in other venues, I decided to take a crack at it.

There are actually two questions to be answered: (1) are we connected to PgBouncer, and if so, (2) what pool_mode is being run? The quickest and easiest wat I found to answer the first question is to try and connect to a non-existent database. Normally, this is a FATAL message, as seen here:

$ psql testdb -p 5432
testdb=# \c ghostdb
FATAL:  database "ghostdb" does not exist
Previous connection kept

However, a slightly different ERROR message is returned if the same thing is attempted while connected to PgBouncer:

$ psql testdb -p 6432
testdb=# \c ghostdb
ERROR:  No such database: ghostdb
Previous connection kept

Thus, an ERROR will always indicate that you are connected to PgBouncer and not directly to Postgres, which will always issue a FATAL.

In the future, there will be an even simpler method. As of this writing, pgBouncer 1.6 has not been released, but it will have the ability to customize the application_name. This is a configurable session-level variable that is fairly new in Postgres. Andrew Dunstan wrote a patch which enables adding this to your pgbouncer.ini file:

application_name_add_host = 1

This will make PgBouncer modify the application_name to append some information to it such as the remote host, the remote port, and the local port. This is a feature many PgBouncer users will appreciate, as it offers an escape from the black hole of connection information that PgBouncer suffers from. Here is what it looks like on both a normal Postgres connection, and a PgBouncer connection. As you can see, this is an easier check than the "invalid database connection" check above:

## Postgres:
$ psql testdb -p 5432 -c 'show application_name'

## PgBouncer:
$ psql testdb -p 6432 -c 'show application_name'
 psql - unix(

## DBD::Pg connections to PgBouncer get a very similar change:
$ perl --port 6432
app - unix(

Now we have answered question of "are we connected to PgBouncer or not?". The next question is which pool mode we are in. There are three pool modes you can set for PgBouncer, which controls when your particular connection is returned to "the pool". For "session" mode, you keep the same Postgres backend the entire time you are connected. For "transaction", you keep the same Postgres backend until the end of a transaction. For "statement", you may get a new Postgres backend after each statement.

First, we can check if we are connected to PgBouncer in a statement level pool mode by taking advantage of the fact that multi-statement transactions are prohibited. PgBouncer enforces this by intercepting any attempts to enter a transaction (e.g. by issuing a BEGIN command). A very PgBouncer specific error about "Long transactions not allowed" is issued back to the client like so:

$ psql testdb -p 6432
testdb=# begin;
ERROR:  Long transactions not allowed

So, that takes care of detecting a pool_mode set to 'statement'. The other two modes, transaction and session, will *not* give the same error. Thus, seeing that error indicates you are using a statement-level PgBouncer connection.

The next pool mode is "transaction", which means that the server connection if released back to the pool at the end of a transaction. To figure out if we are in this mode, we take advantage of the fact that PgBouncer can be set to clean up the connection at the end of each transaction by issuing a specific command. By default, the command set by server_reset_query is DISCARD ALL, which invalidates any prepared statements, temporary tables, and other transation-spanning, session-level items. Thus, our test will see if these session-level artifacts get discarded or not:

## Direct Postgres:
$ psql testdb -p 5432
testdb=# prepare abc(int) as select $1::text;
testdb=# execute abc(1);

## PgBouncer:
$ psql testdb -p 6432
testdb=# prepare abc(int) as select $1::text;
testdb=# execute abc(1);
ERROR:  prepared statement "abc" does not exist

Keep in mind that there are no true "transactionless" commands in Postgres. Even though we did not use a BEGIN in the psql prompt above, each command is treated as its own mini-transaction. In the case of the PgBouncer connection, the prepare is immediately followed with a DISCARD ALL, which means that our prepared statement no longer exists. Hence, we have determined that we are using a transaction-level PgBouncer connection.

Unfortunately, not getting an error does not necessarily mean your PgBouncer is NOT in transaction mode! It could be that server_reset_query is empty, meaning that temporary artifacts are not discarded at the end of the transaction. In such as case, we can take advantage of the fact that PgBouncer will allow other clients to share in our current connection, and thus be able to see the temporary items. If we create a temporary table in one pgbouncer connection, then connect again as a new client, the temporary table will only show up if we are sharing sessions but not transactions. Easier shown than explained, I suspect:

## Regular Postgres gets a fresh session:
$ psql test1 -p 5432
test1=# create temp table abc(a int);
test1=# select * from abc;
(No rows)
test1=# ^Z ## (we suspend with CTRL-Z)
[2]+  Stopped                 psql test1 -p 5432

$ psql test1 -p 5432
test1=# select * from abc;
ERROR:  relation "abc" does not exist

## PgBouncer will re-use the same session:
$ psql test1 -p 6432
test1=# create temp table abc(a int);
test1=# select * from abc;
(No rows)
test1=# ^Z
[2]+  Stopped                 psql test1 -p 6432

$ psql test1 -p 6432
test1=# select * from abc;
(No rows)

The final PgBouncer pool mode is "session", and basically means the only advantage over a normal Postgres connection is the overhead to start up and connect to a new Postgres backend. Thus, the PgBouncer connections are only returned to the pool upon disconnection. The only way to tell if you are in this mode is by determining that you are *not* in the other two modes. :)

So, although PgBouncer is extremely transparent, there are some tricks to determine if you are connected to it, and at what pool_mode. If you can think of other (SQL-level!) ways to check, please let me know in the comments section.

Andrew Dunstan: Yet another data recovery job

From Planet PostgreSQL. Published on May 17, 2015.

One of the things I like about data recovery jobs is that each one is different. Right now I'm wrestling with this problem: how can I recover whatever is possible from a backup that's missing a handful of xlog files. We know this will result in a probably corrupt database, but we still want to see what data is available.

My first two approaches didn't work too well. I tried mangling the control file, and creating dummy xlog files to replace the handful that are missing.

Right now I'm trying to follow a suggestion from Andres Freund to drive the process via the debugger, moving it on to the next available log file when it gets to where it would want the first missing log file.  The trouble is it's hard to find what to set exactly and where to set it (this is Postgres 9.3). Every time I try something I get errors like "unexpected page_addr". If I were more familiar with this section of the code it might be easier, but it's something I've pretty much never dabbled in.

It all makes for an exciting life.

Robins Tharakan: Basic OLAP Support in PostgreSQL

From Planet PostgreSQL. Published on May 16, 2015.

While reviewing an existing application, I thought it'd be worthwhile to review how good / bad PostgreSQL is in terms of OLAP. This (growing) post is going to be my (un)learning of how ready is PostgreSQL. Row Numbering Support: Yes.  Use: Row_Number() function numbers rows generated in a result-set. Example: SELECT   row_number() OVER (ORDER BY marks DESC) AS rn,  nameFROM x; Review: Some

Robins Tharakan: Postgres finally has CUBE / ROLLUP / GROUPING SETS !

From Planet PostgreSQL. Published on May 16, 2015.

Finally ! A *much* awaited feature, this attempt at adding the GROUPING SETS / ROLLUP / CUBE feature to PostgreSQL has been in the works for about a year (besides the so many in the past decade and a half that didn't get through), and thankfully this has finally got the approval of the powers that be, so the upcoming Postgres 9.5 would finally have this long pending SQL feature. MSSQL and

Joel Jacobson: Finding missing foreign keys

From Planet PostgreSQL. Published on May 16, 2015.

By coincidence I stumbled upon a table where one of its columns didn’t have a foreign key. I found this strange, since our convention is to always add foreign keys when appropriate.

But humans are humans, and will eventually forget to add a foreign key, so we better have both belt and suspenders and find a way to deal with this inevitable problem in an efficient way.

It would be a tedious job to manually look for missing foreign keys in all tables and columns.
But if you’re lucky enough to have the pleasure of working with a system where all tables and columns have been given their names by following a strict naming convention, it might be possible to fully automate the task.

This is the namning convention we use:

  1. Table names are always in plural form with a tailing “s”, e.g. users
  2. Primary key column names are always equal to the table name in singular form with a tailing id, e.g. userid.
  3. Foreign key columns are always equal to the primary key they are referecning, e.g. transactions.userid -> users.userid

This means you always know based on the column name alone, what table if any that column might be referencing.

Example: If table transactions have a column named userid, and there is a table where userid is also the primary key, but if there isn’t any foreign key on transactions.userid, then it’s a missing foreign key, or else someone has not followed the namning convention.

Thanks to PostgreSQL’s fantastic pg_catalog system tables, we can write a query which uses the rules from the naming convention and returns all the columns which appears to be missing foreign keys. It doesn’t support multi-column keys, but we don’t have many of those, so it’s not a problem in my case.

Thanks to the view below, I automatically found three more missing foreign keys of the same type, which saves me many hours of boring work today.

pg1:joel=#* SELECT * FROM view_missing_foreign_keys;
 nspname |      relname    | attname 
 public  | transactions    | userid
 public  | someohtertable1 | userid
 public  | someothertable2 | userid
(5 rows)

I’m posting the view I wrote here in hope it might be useful for others with a similar naming convention, and/or to inspire others to include the table name in their column names used as primary/foreign keys.

CREATE OR REPLACE VIEW view_missing_foreign_keys AS
FROM pg_catalog.pg_namespace
INNER JOIN pg_catalog.pg_class ON (pg_catalog.pg_class.relnamespace = pg_catalog.pg_namespace.oid)
INNER JOIN pg_catalog.pg_attribute ON (pg_catalog.pg_attribute.attrelid = pg_catalog.pg_class.oid)
WHERE pg_catalog.pg_class.relkind = 'r'
AND pg_catalog.pg_attribute.attnum > 0
AND NOT pg_catalog.pg_attribute.attisdropped
AND pg_catalog.pg_namespace.nspname NOT IN ('pg_toast','information_schema','pg_catalog')
AND pg_catalog.pg_attribute.attname LIKE '%id'
    -- The column is PRIMARY KEY in some table
    SELECT 1 FROM pg_catalog.pg_constraint
    WHERE pg_catalog.pg_constraint.contype = 'p'
    AND pg_catalog.pg_get_constraintdef(pg_catalog.pg_constraint.oid) = format('PRIMARY KEY (%s)',pg_catalog.pg_attribute.attname)
    -- There is no FOREIGN KEY on this column
    SELECT 1 FROM pg_catalog.pg_constraint
    WHERE pg_catalog.pg_constraint.contype = 'f'
    AND pg_catalog.pg_constraint.conrelid = pg_catalog.pg_class.oid
    AND pg_catalog.pg_get_constraintdef(pg_catalog.pg_constraint.oid) LIKE (format('FOREIGN KEY (%s)',pg_catalog.pg_attribute.attname) || '%')
    -- This column is not the PRIMARY KEY of it's own table,
    -- since if it was, we wouldn't require a FOREIGN KEY on it
    SELECT 1 FROM pg_catalog.pg_constraint
    WHERE pg_catalog.pg_constraint.contype = 'p'
    AND pg_catalog.pg_constraint.conrelid = pg_catalog.pg_class.oid
    AND pg_catalog.pg_get_constraintdef(pg_catalog.pg_constraint.oid) = format('PRIMARY KEY (%s)',pg_catalog.pg_attribute.attname)

Adding Maintenance Data pt 1

By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on May 15, 2015.

Join us as we continue building our product by starting to allow our users to add bike maintenance records to their bikes.
Watch Now...

Shaun M. Thomas: PG Phriday: 10 Ways to Ruin Performance: Forgetting to EXPLAIN

From Planet PostgreSQL. Published on May 15, 2015.

Yesterday I gave the developers at my company what I call a DBA Chat. It’s something I try to do every month to keep them apprised on various features, caveats, performance considerations, and so on. I find that educating the folks who regularly work with the database does wonders for application performance and my sanity. The benefit of this long format is that I can go over more information than a time constrained set of slides.

This month we went over one of my slightly older talks, and even though the material was three years old from my perspective, it was all new to them. And why not? Developers have a job to do, and while they do work with the database, it generally isn’t their responsibility to research the voluminous minutia and quirks commonly associated with a specific platform. That’s my job.

So here is the first of a ten part series on anti-patterns that can annihilate PGDB (PostgreSQL) performance, some of which are outright insidious. This week we’ll start slow with a short discussion on the importance of EXPLAIN. This is PGDB’s first line of defense against bad queries, and it’s overlooked more than you might expect, as well as somewhat difficult to interpret for the uninformed. I’ll probably have a more in-depth version of this article in the future for truly interpreting EXPLAIN output, but this time we’ll be focusing on using it in general.

First, we need to set up a use case to illustrate a couple of variants. Two tables with a close relationship should do it:

CREATE TABLE sys_product
    product_id   SERIAL  PRIMARY KEY,
    prod_name    TEXT    NOT NULL,
    descr        TEXT    NOT NULL DEFAULT now()
INSERT INTO sys_product (prod_name, descr)
SELECT 'Product ' ||, 'It does stuff.'
  FROM generate_series(1, 1000) a(id);
CREATE TABLE sys_order
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now()
INSERT INTO sys_order (product_id, item_count)
SELECT ( % 100) + 1, (random()*10)::INT + 1
  FROM generate_series(1, 1000000) a(id);
ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);
CREATE INDEX idx_order_product_id
    ON sys_order (product_id);

Now, EXPLAIN comes in many variants, but there are two that directly concern us: EXPLAIN and EXPLAIN ANALYZE. The first will list the estimated cost of the query, which indexes will be used, how many rows are estimated to match, and so on. The second will physically execute the query and obtain the actual match counts along with elapsed time for each step. Here’s what that looks like:

                             QUERY PLAN                             
 Seq Scan ON sys_order  (cost=0.00..16370.00 ROWS=1000000 width=20)
                             QUERY PLAN
 Seq Scan ON sys_order  (cost=0.00..16370.00 ROWS=1000000 width=20)
     (actual TIME=0.006..465.956 ROWS=1000000 loops=1)
 Planning TIME: 0.041 ms
 Execution TIME: 886.766 ms

I’ve reformatted the results a bit, but the difference in output should be readily apparent. Interpreting this isn’t even very difficult at a high level. Begin by examining the first set of parenthesis in both results, and ignore the cost for now; it’s essentially meaningless noise intended primarily for the query planner. What concerns us is the row count, which is one-million in this case. It’s how many rows PGDB estimated it would return.

The second set of parenthesis in the ANALYZE version gives the actual values obtained by executing the query. In this case, the query matched exactly one million rows, and took 465ms to do so. If you don’t know already, a Seq Scan is short for Sequence Scan, and means the database read the entire contents of the table. That should be expected since we didn’t provide any WHERE clause to restrict the results.

Pay special attention to the rows section, because it can be a major clue to ferret out performance problems. PGDB keeps statistics on table contents, but these are necessarily sparse and in aggregate form. There’s also currently no way to correlate statistics between tables, so a JOIN can drastically affect row estimates, and since it’s how cost is calculated, can result in wildly different execution plans than the optimal case.

So we know that a Sequence Scan will read the entire table. Hopefully that point by itself is an obvious indication that doing so is generally undesirable. If we are querying a table with 300-million rows, reading its entire contents is almost never the intended approach, nor should it be.

What other variants should we look for? How about an Index Scan:

 SELECT * FROM sys_order
  WHERE order_id BETWEEN 5 AND 10;
                             QUERY PLAN                             
 INDEX Scan USING pk_order_order_id ON sys_order 
       (cost=0.42..8.53 ROWS=5 width=20)
       (actual TIME=0.008..0.012 ROWS=6 loops=1)
   INDEX Cond: ((order_id >= 5) AND (order_id <= 10))
 Planning TIME: 0.112 ms
 Execution TIME: 0.041 ms

We can see a few things from these results:

  • Estimated row count is not the same as how many actually matched.
  • The rows were found with an Index Scan.
  • Execution time was much faster than the Sequence Scan case.

It’s not critical for estimated and actual row counts to match, but they should be within an order of magnitude of each other. If this isn’t the case, it’s likely the planner has insufficient information about the data, or statistics are wrong, or data is more closely correlated than it assumed, resulting in less eliminations than it expected, and so on. This is one of the first ways to see that there could be a problem with a query’s execution plan.

Regarding the index scan, matching five or six values against a million isn’t a big deal, but there are limits. Indexes in PGDB work, in most cases, by fetching very quickly from the index file and then consulting the base table file to retrieve the matching row. This results in two random seeks to the underlying storage subsystem, and if you don’t know already, random reads are usually slower than a sequential read by an order of magnitude or more. This doesn’t apply for SSD devices, but most databases don’t reside entirely on such expensive hardware just yet.

That’s why the rows element of the EXPLAIN is important in respect to index scans; it’s entirely possible that reading the entire table would be faster than performing several hundred thousand index and table seeks. For the most part, the planner knows this, but calculating this relationship is still more of an art than a science. Consequentially, sometimes the planner guesses wrong, and query performance will suffer. By using EXPLAIN, we can see what happened, and find a way around the problem.

Let’s turn this into a JOIN to make it a little more fun:

   FROM sys_order o
   JOIN sys_product p USING (product_id)
  WHERE p.product_id IN (5, 10, 20);
                             QUERY PLAN                             
 Nested Loop  (cost=188.87..15715.29 ROWS=3614 width=20)
              (actual TIME=2.317..66.377 ROWS=30000 loops=1)
   ->  INDEX ONLY Scan USING sys_product_pkey ON sys_product p
             (cost=0.28..24.88 ROWS=3 width=4)
             (actual TIME=0.033..0.056 ROWS=3 loops=1)
         INDEX Cond: (product_id = ANY ('{5,10,20}'::INTEGER[]))
         Heap Fetches: 3
   ->  Bitmap Heap Scan ON sys_order o
              (cost=188.59..5130.14 ROWS=10000 width=20)
              (actual TIME=2.105..13.037 ROWS=10000 loops=3)
         Recheck Cond: (product_id = p.product_id)
         Heap Blocks: exact=19107
         ->  Bitmap INDEX Scan ON idx_order_product_id
               (cost=0.00..186.09 ROWS=10000 width=0)
               (actual TIME=1.170..1.170 ROWS=10000 loops=3)
               INDEX Cond: (product_id = p.product_id)
 Planning TIME: 0.274 ms
 Execution TIME: 78.834 ms

Wow! Luckily we can ignore most of this output by focusing on the Nested Loop at the top. This is really just a fancy way of saying “For Loop”. This is also the outermost level of the query plan, as EXPLAIN output is somewhat inverted. Each nesting level is a prerequisite step to construct the layer above it, somewhat like an onion.

If we look at the loops section in the actual execution information for each of these blocks, we can see that the loop really ran three times (the nested entries show loops=3), even though it says it only ran once. The distinction here is that the outermost loop ran once, and it is somewhat misleading.

But basically PGDB fetched three rows from the product table, and looped through memory using those values to find corresponding data in the order table. And as you might imagine, this is fine for small numbers of loops. There are occasions where PostgreSQL will underestimate row counts and opt for a Nested Loop, when really there are hundreds of thousands or millions of potential loop candidates, and it will use the same process as we saw above.

In such cases, execution time is extremely adversely impacted. Why? Nested loops and other types of loop are computationally expensive. As far as the database is concerned, it’s almost always faster to create an in-memory object of some kind and perform a bulk operation on all values at once, such as a comparison or a transformation. Nested loops make this impossible, and impose a series of calculations and function callbacks on each individual row set in O(n*m) scale. For small loop counts, this is often cheaper than building those afore-mentioned memory representations, but scale is vastly important.

Here’s an example of what I mean, and we don’t even need the whole EXPLAIN output to see the problem:

   FROM sys_order o
   JOIN sys_product p USING (product_id)
  WHERE p.product_id BETWEEN 6 AND 10;
                             QUERY PLAN                             
 Nested Loop  (cost=188.20..17588.55 ROWS=4000 width=20)
              (actual TIME=2.542..105.528 ROWS=50000 loops=1)

This should be ringing alarm bells. Note that PGDB expected 4000 rows, and actually matched 50,000. At this scale that isn’t really a major issue, but expand these tables by a couple orders of magnitude, and that can quickly change. I’ve seen several queries where the estimated row count is five orders of magnitude less than the actual results. Disk fetches are exceptionally expensive, and multiplying random read times by a few thousand adds up. If a query is using a nested loop is being slow, always check the statistics based on EXPLAIN output; you may need to revise your query to compensate.

Those are the Big Three in a nutshell. Even if you know nothing else about PGDB, how it plans or executes queries, or even databases in general, try to examine the execution path of your queries whenever possible. This is especially true if you have a query that’s unexpectedly slow. Being cognizant of potential issues prevents malformed queries or structures from worming their way permanently into a production application, and that’s something we can all celebrate.

Happy querying.

Markup Language Faceoff: Lists

By pydanny from Django community aggregator: Community blog posts. Published on May 14, 2015.

Today I want to talk about lists. Not for shopping, not the programming data type, but the display of items in both unordered and ordered fashion. Specifically this:

  • Item A
  • Item B
    1. First Numbered Inner Item
    2. Second Numbered Inner Item
  • Item C

In other words, lists of bullets and numbers. This article explores some of the different tools used by the programming world to render display lists, specifically HTML, reStructuredText, Markdown, and LaTeX.


If you view the HTML source of this web page, you'll find this:

<ul class="simple">
<li>Item A</li>
<li>Item B<ol class="arabic">
<li>First Numbered Inner Item</li>
<li>Second Numbered Inner Item</li>
<li>Item C</li>

Or more clearly:

<ul class="simple">
    <li>Item A</li>
    <li>Item B
        <ol class="arabic">
            <li>First Numbered Inner Item</li>
            <li>Second Numbered Inner Item</li>
    <li>Item C</li>

This works, but is incredibly verbose. HTML requires closing tags on every element. Working with lists in HTML becomes tedious quickly. Which is why so many people use WYSIWYG tools or mark up languages like reStructuredText and Markdown, as it expedites creation of lists (and many other things).


This blog is written in reStructuredText and transformed into HTML. Let's see the markup for this blog post:

* Item A

* Item B

  #. First Numbered Inner Item

  #. Second Numbered Inner Item

* Item C

Notice the extra lines between bullets and numbers? A quirk of reStructuredText is that you have to put those in nested lists in order to make things display correctly. Also, 2 spaces indentation generates a different result than 4 spaces. I have no idea why this behavior exists, but I admit to finding both quirks annoying.

One thing to note about reStructuredText is that it's pretty much Python only. Outside the Python world if you are writing plaintext markup then odds are you are using Markdown.


Markdown does lists really well. Terse and no weird quirks:

* Item A
* Item B
  1. First Numbered Inner Item
  1. Second Numbered Inner Item
* Item C

Another nice feature about Markdown is that it's in use everywhere. GitHub, Stack Overflow, my favorite tablet writing app, and a lot more.

Markdown vs. reStructuredText

Why don't I switch from reStructuredText to Markdown? Here are my reasons against switching:

  1. Force of habit.
  2. PyPI requires it to display package long descriptions nicely on Package pages.
  3. Sphinx is based on it.
  4. reStructuredText has one concrete standard, with extensions that people add. Markdown has many standards, which may or may not have shared features.
  5. I can use Pandoc to help transform reStructuredText to Markdown.


Finally, let's discuss LaTeX. While not a markup language it bears mentioning, and I'll explain why later in this section.

Up to about 8-10 years ago LaTeX was used in a lot of technical writing, including the Python core documentation. That ended with the rise of mark up languages, relegating LaTeX to the world of academics, mathematicians and computer scientists - anywhere complex equations need to be specified.

LaTeX belongs in this article because it is so commonly used with markup. In fact, as far as I can tell, in order to render reStructuredText and Markdown content into the PDF format, the most common approach is:

  1. Use a script to transform the markup into LaTeX.
  2. Use a tool like XeTeX to render the LaTeX into PDF.

Why the extra step? Why not just go directly from markup to PDF? Well, the content in reStructuredText and Markdown have to be formatted in order for them to be displayed, or they will just look like plaintext markup. When they are converted to HTML, the browser does the formatting for us. When they are translated to PDF, LaTeX is a very common choice. That is because LaTeX isn't a markup language, but a typesetting tool. Unlike reStructuredText and Markdown which are designed for ease of use, LaTeX is designed to make documents look good.

Here is how I define my sample list in LaTeX

    \item Item A
    \item Item A
            \item First Numbered Inner Item
            \item Second Numbered Inner Item
    \item Item C

Halfway between the markup languages and HTML in verbosity, LaTeX lists are of medium difficulty to write. If this example makes LaTeX look easy, please realize that while lists are easy to understand, other structures like LaTeX tables can quickly get out of hand. LaTeX's reputation for being an arcane tool is a well deserved one.

Modifying Generated LaTeX

Several book authors, including ourselves, have written books using reStructuredText or Markdown, generated the LaTeX, then modified the LaTeX before rendering the PDF. The approach is seductive: You get the ease of a markup language combined with the formatting precision of LaTeX.

Or do you?

The problem my wife and I have faced is that the combination of LaTeX packages and tools we've assembled for ourselves to write books like Two Scoops of Django is very, very different than what is rendered via docutils' rst2latex or Sphinx make latex. We've tried to write migration scripts, but have found that we end up spending too much of our time on formatting. That's why we have stuck with hand-crafted artisan LaTeX.

That isn't to say it isn't possible. In fact, Matt Harrison has released a number handsome Python books following this path (reStructuredText to LaTeX). I'm certain there are Markdown books that follow this path too.

Closing Thoughts

For better or for worse, lists of bullets and numbers are a foundation of how we communicate via the written medium. They allow for terse communication of ideas and thought, but that same terseness can mean we skip over details. Interestingly enough, the very tools that we use to create lists can color our ability and desire to use them.

Naseby musket balls

Old-fashioned bullets, specifically matchlock musket balls, alleged to have been found at the site of the Battle of Naseby.

From the collection of Northampton Museum and Art Gallery.

Q2 2015 ShipIt Day ReCap

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on May 14, 2015.

Last Friday everyone at Caktus set aside their regular client projects for our quarterly ShipIt Day, a chance for Caktus employees to take some time for personal development and independent projects. People work individually or in groups to flex their creativity, tackle interesting problems, or expand their personal knowledge. This quarter’s ShipIt Day saw everything from game development to Bokeh data visualization, Lego robots to superhero animation. Read more about the various projects from our Q2 2015 ShipIt Day.

Victor worked on our version of Ultimate Tic Tac Toe, a hit at PyCon 2015. He added in Jeff Bradbury’s artificial intelligence component. Now you can play against the computer! Victor also cleaned up the code and open sourced the project, now available here:

Philip dove into @total_ordering, a Python feature that fills in defining methods for sorting classes. Philip was curious as to why @total_ordering is necessary, and what might be the consequences of NOT using it. He discovered that though it is helpful in defining sorting classes, it is not as helpful as one would expect. In fact, rather than speeding things up, adding @total_ordering actually slows things down. But, he concluded, you should still use it to cover certain edge cases.

Karen updated our project template, the foundation for nearly all Caktus projects. The features she worked on will save us all a lot of time and daily annoyance. These included pulling DB from deployed environments, refreshing the staging environment from production, and more.

Erin explored Bokeh, a Python interactive data visualization library. She initially learned about building visualizations without javascript during PyCon (check out the video she recommended by Sarah Bird). She used Bokeh and the Google API to display data points on a map of Africa for potential use in one of our social impact projects.

Jeff B worked on Lisp implementation in Python. PyPy is written in a restricted version of Python (called RPython) and compiled down into highly efficient C or machine code. By implementing a toy version of Lisp on top of PyPy machinery, Jeff learned about how PyPy works.

Calvin and Colin built the beginnings of a live style guide into Caktus’ Django-project-template. The plan was loosely inspired by Mail Chimp's public style guide. They hope to eventually have a comprehensive guide of front-end elements to work with. Caktus will then be able to plug these elements in when building new client projects. This kind of design library should help things run smoothly between developers and the design team for front-end development.

Neil experimented with Mercury hoping the speed of the language would be a good addition to the Caktus toolkit. He then transitioned to building a project in Elm. He was able to develop some great looking hexagonal data visualizations. Most memorable was probably the final line of his presentation: “I was hoping to do more, but it turns out that teaching yourself a new programming language in six hours is really hard.” All Cakti developers nodded and smiled knowingly.

Caleb used Erlang and cowboy to build a small REST API. With more time, he hopes to provide a REST API that will provide geospatial searches for points of interest. This involves creating spatial indexes in Erlang’s built-in Mnesia database using geohashes.

Mark explored some of the issues raised in the Django-project-template and developed various fixes for them, including the way secrets are managed. Now anything that needs to be encrypted is encrypted with a public key generated when you bring up the SALT master. This fixes a very practical problem in the development workflow. He also developed a Django-project-template Heroku-style deploy, setting up a proof of concept project with a “git push” to deploy workflow.

Vinod took the time to read fellow developer Mark Lavin’s book Lightweight Django while I took up DRiVE by Daniel H. Pink to read about what motivates people to do good work or even complete rote tasks.

Scott worked with Dan to compare Salt states to Ansible playbooks. In addition, Dan took a look at Ember, working with the new framework as a potential for front-end app development. He built two simple apps, one for organizing albums in a playlist, and one for to-do lists. He had a lot of fun experimenting and working with the new framework.

Edward and Lucas built a minigame for our Epic Allies app. It was a fun, multi-slot, pinball machine game built with Unity3D.

Hunter built an HTML5 game using Phaser.js. Though he didn’t have the time to make a fully fledged video game, he did develop a fun looking boardgame with different characters, abilities, and animations.

NC developed several animations depicting running and jumping to be used to animate the superheros in our Epic Allies app. She loved learning about human movement, how to create realistic animations, and outputting the files in ways that will be useful to the rest of the Epic Allies team.

Wray showed us an ongoing project of his: a front-end framework called sassless, “the smallest CSS framework available.” It consists of front-end elements that allow you to set up a page in fractions so that they stay in position when resizing a browser window (to a point) rather than the elements stacking. In other words, you can build a responsive layout with a very lightweight CSS framework.

One of the most enertaining projects of the day was the collaboration between Rebecca C and [Rob], who programmed Lego-bots to dance in a synced routine using the Lego NXT software. Aside from being a lot of fun to watch robots (and coworkers) dance, the presence of programmable Lego-bots prompted a much welcome visit from Calvin’s son Caelan, who at age of 9 is already learning to code!

Interactive Data for the Web by Sarah Bird (PyCon 2015 Must-See Talk: 3/6)

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on May 12, 2015.

Part three of six in our PyCon 2015 Must-See Series, a weekly highlight of talks our staff enjoyed at PyCon.

Sarah Bird's talk made me excited to try the Bokeh tutorials. The Bokeh library has very approachable methods for creating data visualizations inside of Canvas elements all via Python. No javascript necessary. Who should see this talk? Python developers who want to add a beautiful data visualization to their websites without writing any javascript. Also, Django developers who would like to use QuerySets to create data visualizations should watch the entire video, and then rewind to minute 8:50 for instructions on how to use Django QuerySets with a couple of lines of code.

After the talk, I wanted to build my own data visualization map of the world with plot points for one of my current Caktus projects. I followed up with one of the friendly developers from Continuum Analytics to find out that you do not need to spin up a separate Bokeh server to get your data visualizations running via Bokeh.

More in the PyCon 2015 Must-See Talks Series.

Cakti Comment on Django's Class-based Views

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on May 11, 2015.

After PyCon 2015, we were surprised when we realized how many Cakti who attended had all been asked about Django's class-based views (CBVs). We talked about why this might be, and this is a summary of what we came up with.

Lead Front-End Developer Calvin Spealman has noticed that there are many more tutorials on how to use CBVs than on how to decide whether to use them.

Astro Code School Lead Instructor Caleb Smith reminded us that while "less code" is sometimes given as an advantage of using CBVs, it really depends on what you're doing. Each case is different.

I pointed out that there seem to be some common misconceptions about CBVs.

Misconception: Functional views are deprecated and we're all supposed to be writing class-based views now.

Fact: Functional views are fully supported and not going anywhere. In many cases, they're a good choice.

Misconception: CBVs means using the generic class-based views that Django provides.

Fact: You can use as much or as little of Django's generic views as you like, and still be using class-based views. I like Vanilla Views as a simpler, easier to understand alternative to Django's generic views that still gives all the advantages of class-based views.

So, when to use class-based views? We decided the most common reason is if you want to reuse code across views. This is common, for example, when building APIs.

Caktus Technical Director Mark Lavin has a simple answer: "I default to writing functions and refactor to classes when needed writing Python. That doesn't change just because it's a Django view."

On the other hand, Developer Rebecca Muraya and I tend to just start with CBVs, since if the view will ever need to be refactored that will be a lot easier if it was split up into smaller bits from the beginning. And so many views fall into the standard patterns of Browse, Read, Edit, Add, and Delete that you can often implement them very quickly by taking advantage of a library of common CBVs. But I'll fall back to Mark's system of starting with a functional view when I'm building something that has pretty unique behavior.

Griatch's Evennia musings

By Griatch's Evennia musings (MU* creation with Django+Twisted) from Django community aggregator: Community blog posts. Published on May 10, 2015.

We are currently in a maintenance and cleanup phase of Evennia, where bugs are found and reported and things are getting more and more stable as people learn and use the new features we merged a few months back.

Overall though I must say the relatively big changes we did to the infrastructure (making Evennia into a full library and making a complete overhaul of the typeclass system behind the scenes) went over with surprising smoothness. There were a flurry of things to fix immediately after the devel-branch merger but no more than expected. For the big changes it really worked very well I think, with no big disaster stories. We have a few bugs lingering in the issue tracker that need to be addressed but nothing show-stopping.

I have been a bit busy with various projects off-MUD so to speak. I was contracted for making the cover and illustration for a book (this is not hobby art for once, but a professional commission which I was quite excited to be asked to do). I also author and draw a fantasy comic as part of another project.

I've not been slacking off on on the MUD side though: I have written and submitted an article for the revived Imaginary Realities e-zine (next issue should be out end of May/early June?) and another article (on Evennia) for the new Optional Realities MUD community website. I also contributed a game-design blurb for the latter's Dreaming Big contest, where you compete (for cash prizes, actually!) by submitting a game sale's pitch under 600 words.

The above mentioned Optional Realities website is so far picking up pace with quite good discussion in its forums (the similarity of name with Imaginary Realities is unfortunate, apparently they were not aware of the former when choosing it).  While targeted at discussions of RPI-style games (a sort of sub-genre of roleplay-enforced MUDs with perma-death), it already hosts several invited articles on game design and general game development that can be interesting for any MU* dev.

People should know by now that I like to support MUD community efforts when possible, and Evennia is thus  listed as an official "affiliate" to Optional Realities (which admittedly means little more than us linking to each other but still). The team behind OR is however also using Evennia for their own "Project Redshift" Sci-fi mud, so we hope to get plenty of good feedback as their project matures.

Image: Truss from Gems NFX

Is Open Source Consulting Dead?

By chrism from . Published on Sep 10, 2013.

Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

Consulting and Patent Indemification

By chrism from . Published on Aug 09, 2013.

Article about consulting and patent indemnification

Python Advent Calendar 2012 Topic

By chrism from . Published on Dec 24, 2012.

An entry for the 2012 Japanese advent calendar at

Why I Like ZODB

By chrism from . Published on May 15, 2012.

Why I like ZODB better than other persistence systems for writing real-world web applications.

A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

By chrism from . Published on Mar 03, 2012.

A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

In Praise of Complaining

By chrism from . Published on Jan 01, 2012.

In praise of complaining, even when the complaints are absurd.

2012 Python Meme

By chrism from . Published on Dec 24, 2011.

My "Python meme" replies.

In Defense of Zope Libraries

By chrism from . Published on Dec 19, 2011.

A much too long defense of Pyramid's use of Zope libraries.

Plone Conference 2011 Pyramid Sprint

By chrism from . Published on Nov 10, 2011.

An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

Jobs-Ification of Software Development

By chrism from . Published on Oct 17, 2011.

Try not to Jobs-ify the task of software development.

WebOb Now on Python 3

By chrism from . Published on Oct 15, 2011.

Report about porting to Python 3.

Open Source Project Maintainer Sarcastic Response Cheat Sheet

By chrism from . Published on Jun 12, 2011.

Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

Pylons Miniconference #0 Wrapup

By chrism from . Published on May 04, 2011.

Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

Pylons Project Meetup / Minicon

By chrism from . Published on Apr 14, 2011.

In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.

PyCon 2011 Report

By chrism from . Published on Mar 19, 2011.

My personal PyCon 2011 Report