Skip to content. | Skip to navigation

Personal tools
Log in
You are here: Home

Open Source Posts

Phasing Out Django Packages APIv1 & APIv2

By pydanny's blog from Django community aggregator: Community blog posts. Published on Mar 01, 2015.

It is time to upgrade Django Packages. If you are using the site's APIs in any way, this affects you.

This site, maintained by myself and Audrey Roy Greenfeld, is the directory of reusable apps, sites, and tools for Django-powered projects. And the site has been running on Django 1.4.x for about 2.5 years. Which in internet years is forever. It's time to upgrade!

Alas, we have a problem.

The Problem

The first REST API for the project, APIv1 is running on a very old version of django-tastypie (0.9.7) which blocks upgrading the Django version, even to Django 1.5. While we could do a lot of work to migrate upwards to modern django-tastypie, that would require a lot of time that we would rather spend adding new features or making other stuff. More importantly, there are elements to the APIv1's design that we want to change.

While we are on the subject of legacy APIs, the second REST API for the project, the mostly undocumented APIv2, is powered by a relatively old version of Django Rest Framework (2.3.8). The design of APIv2 was a bit of an experiement in architectural design, one whose novel approach (API views in individual apps) was unweildy and ultimately annoying. One might even call it an anti-pattern. That's fine, as sometimes you have to try things in order to determine better approaches for later, but it's time for this version of the API to die.

Our eventual goal is to get Django Packages running on Django 1.8 as well as implementing a brand-new REST API powered by Django Rest Framework. Getting there is the trick. Hence, we've created "The Plan".

The Plan

We have a four-stage plan for migrating the site upwards:

  1. On April 2, 2015 we will be turning off APIv1. All endpoints will redirect to APIv3.
  2. On April 3, 2015, we will be turning off APIv2. All endpoints will stop working.
  3. On April 4, we commence work on migrating the project to Django 1.8.
  4. Whenever we finish #3, we start work on APIv4, powered by Django Rest Framework.

About APIv3

Already implemented, this version of the API is powered by very simple JSON-powered views. Since both APIv1 and APIv2 only have GET endpoints, it was easy to roll something out that provided nearly all the data generated by both previous APIs.

Leo Hsu and Regina Obe: LATERAL WITH ORDINALITY - numbering sets

From Planet PostgreSQL. Published on Mar 01, 2015.

One of the neat little features that arrived at PostgreSQL 9.4 is the WITH ORDINALITY ANSI-SQL construct. What this construct does is to tack an additional column called ordinality as an additional column when you use a set returning function in the FROM part of an SQL Statement.

Continue reading "LATERAL WITH ORDINALITY - numbering sets"

Guillaume LELARGE: pgstat 1.0.0 is out!

From Planet PostgreSQL. Published on Feb 28, 2015.

Since the last time I talked about it, I had quite a few feedbacks, bug issues, pull requests, and so on. Many issues were fixed, the last of it tonight.

I also added two new reports. I had the idea while working on my customers' clusters.

One of them had a lot of writes on their databases, and I wanted to know how much writes occured in the WAL files. vmstat would only tell me how much writes on all files, but I was only interested in WAL writes. So I added a new report that grabs the current XLOG position, and diff it with the previous XLOG position. It gives something like this with a pgbench test:

$ ./pgstat -s xlog 
-------- filename -------- -- location -- ---- bytes ----
 00000001000000000000003E   0/3EC49940        1053071680
 00000001000000000000003E   0/3EC49940                 0
 00000001000000000000003E   0/3EC49940                 0
 00000001000000000000003E   0/3EC875F8            253112
 00000001000000000000003E   0/3ED585C8            856016
 00000001000000000000003E   0/3EE36C40            910968
 00000001000000000000003E   0/3EEFCC58            811032
 00000001000000000000003E   0/3EFAB9D0            716152
 00000001000000000000003F   0/3F06A3C0            780784
 00000001000000000000003F   0/3F0E79E0            513568
 00000001000000000000003F   0/3F1354E0            318208
 00000001000000000000003F   0/3F1F6218            789816
 00000001000000000000003F   0/3F2BCE00            814056
 00000001000000000000003F   0/3F323240            418880
 00000001000000000000003F   0/3F323240                 0
 00000001000000000000003F   0/3F323240                 0

That's not big numbers, so it's easy to find it writes at 253K/s, but if the number were bigger, it might get hard to read. One of my co-worker, Julien Rouhaud, added a human readable option:

$ ./pgstat -s xlog -H
-------- filename -------- -- location -- ---- bytes ----
 00000001000000000000003F   0/3F32EDC0      1011 MB
 00000001000000000000003F   0/3F32EDC0      0 bytes
 00000001000000000000003F   0/3F32EDC0      0 bytes
 00000001000000000000003F   0/3F3ABC78      500 kB
 00000001000000000000003F   0/3F491C10      920 kB
 00000001000000000000003F   0/3F568548      858 kB
 00000001000000000000003F   0/3F634748      817 kB
 00000001000000000000003F   0/3F6F4378      767 kB
 00000001000000000000003F   0/3F7A56D8      709 kB
 00000001000000000000003F   0/3F8413D0      623 kB
 00000001000000000000003F   0/3F8D7590      600 kB
 00000001000000000000003F   0/3F970160      611 kB
 00000001000000000000003F   0/3F9F2840      522 kB
 00000001000000000000003F   0/3FA1FD88      181 kB
 00000001000000000000003F   0/3FA1FD88      0 bytes
 00000001000000000000003F   0/3FA1FD88      0 bytes
 00000001000000000000003F   0/3FA1FD88      0 bytes

That's indeed much more readable if you ask me.

Another customer wanted to know how many temporary files were written, and their sizes. Of course, you can get that with the pg_stat_database view, but it only gets added when the query is done. We wanted to know when the query is executed. So I added another report:

$ ./pgstat -s tempfile
--- size --- --- count ---
         0             0
         0             0
  13082624             1
  34979840             1
  56016896             1
  56016896             1
  56016896             1
         0             0
         0             0

You see the file being stored.

Well, that's it for now. The 1.0.0 release is available on the github project.

Django Girls in Cardiff

By Blog - Django from Django community aggregator: Community blog posts. Published on Feb 28, 2015.

(From the Django Girls press release)

Django Girls is coming very soon to Cardiff - a first Wales edition of the international programming course directed at women. This non-profit event will take place during DjangoCon Europe, a large conference for programmers and will be attended by 45 women and girls who want to learn how to build web applications in just one day.

Hubert 'depesz' Lubaczewski: How to pg_upgrade …

From Planet PostgreSQL. Published on Feb 27, 2015.

One of my clients is upgrading some servers. The procedure we have took some time to get to current state, and we found some potential problems, so decided to write more about it. First, what we have, and what we want to have. We have usually 3 servers: master slave slave2 Both slaves use streaming […]

Shaun M. Thomas: PG Phriday: PostgreSQL Select Filters

From Planet PostgreSQL. Published on Feb 27, 2015.

Long have CASE statements been a double-edged sword in the database world. They’re functional, diverse, adaptive, and simple. Unfortunately they’re also somewhat bulky, and when it comes to using them to categorize aggregates, something of a hack. This is why I wanted to cry with joy when I found out that PostgreSQL 9.4 introduced a feature I’ve always wanted, but found difficult to express as a need. I mean, CASE statements are fine, right? Well, yes they are, but now we have something better. Now, we have the FILTER aggregate expression.

I always like working with examples, so let’s create some test data to illustrate just what I’m talking about.

CREATE TABLE sys_order
    order_id     SERIAL     NOT NULL,
    product_id   INT        NOT NULL,
    item_count   INT        NOT NULL,
    order_dt     TIMESTAMP  NOT NULL DEFAULT now()

INSERT INTO sys_order (product_id, item_count)
SELECT ( % 100) + 1, (random()*10)::INT + 1
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);

We now have a table for tracking fake orders, using 100 nonexistent products. I added the primary key after loading the table as a well known DBA trick. Doing this after data loading means the index can be created as a single step, which is much more efficient than repeatedly extending an existing index.

With that out of the way, let’s do a basic product order count, since that’s something many people are already familiar with:

SELECT sum(item_count) AS total
  FROM sys_order;

-[ RECORD 1 ]--
total | 1000000

No surprises here. But what happens when Jeff from Accounting wants to know how many people ordered five specific products as a column list? In the old days, we might do something like this:

SELECT sum(CASE WHEN product_id = 1 THEN item_count ELSE 0 END) AS horse_mask_count,
       sum(CASE WHEN product_id = 7 THEN item_count ELSE 0 END) AS eyeball_count,
       sum(CASE WHEN product_id = 13 THEN item_count ELSE 0 END) AS badger_count,
       sum(CASE WHEN product_id = 29 THEN item_count ELSE 0 END) AS orb_count,
       sum(CASE WHEN product_id = 73 THEN item_count ELSE 0 END) AS memebox_count
  FROM sys_order;

 horse_mask_count | eyeball_count | badger_count | orb_count | memebox_count 
            59870 |         59951 |        59601 |     59887 |         60189


As a DBA, I’ve seen more of these than I can reasonably stand, and hate them every single time. It’s not the use of the CASE statement that is so irksome, but the micromanaging methodology necessary to reduce the count to zero for unwanted items. With FILTER however, this query changes quite a bit:

SELECT sum(item_count) FILTER (WHERE product_id = 1) AS horse_mask_count,
       sum(item_count) FILTER (WHERE product_id = 7) AS eyeball_count,
       sum(item_count) FILTER (WHERE product_id = 13) AS badger_count,
       sum(item_count) FILTER (WHERE product_id = 29) AS orb_count,
       sum(item_count) FILTER (WHERE product_id = 73) AS memebox_count
  FROM sys_order;

 horse_mask_count | eyeball_count | badger_count | orb_count | memebox_count 
            59870 |         59951 |        59601 |     59887 |         60189

The query itself isn’t much shorter, but semantically, it’s far easier to understand what we’re trying to accomplish. It’s clear that we want the item count for each specific product, and nothing else. Further, since this is built-in functionality instead of a gross hack, it’s much faster. On the system we used for testing, after ten runs, the average time for the CASE variant was about 500ms; the FILTER version was about 300ms. In both cases, the query execution plan is identical. Internally however, invoking hundreds of thousands of CASE statements causes an immense CPU impact, where FILTER can utilize set grouping or another efficient stratagem based on the filter criteria. For large OLAP databases, this is a significant improvement in both query simplicity and performance.

This is good stuff we are getting in these new releases, and I encourage everyone to enjoy all of the new toys we get every year. Some of them are much more than mere window-dressing.

gabrielle roth: PDXPUG February Meeting Recap

From Planet PostgreSQL. Published on Feb 26, 2015.

Mark & I left David Wheeler in charge of the PDXPUG February meeting while we were at SCALE last week.

Here’s David’s report:

This week Dave Kerr discussed using Bucardo mutli-master replication to gradually migrate a production database from EC2 to RDS. This work allowed his team to switch back and forth between the two systems with the assurance that the data would be the same on both. It also allowed them a fallback in case the RDS database didn’t work out: the S3 system would still be there. The discussion allowed those present to complain about Bucardo, EC2, RDS, and the French.


Simple Django applications for writing Facebook applications and authentication

By Piotr Maliński from Django community aggregator: Community blog posts. Published on Feb 26, 2015.

Creating Facebook applications or integrating websites with Facebook isn't complex or very time consuming. In case of Django there is for example very big django-facebook package that provides a lot of Facebook related features., a company I work with released a set of small applications that were used internally to make Facebook applications as well as Facebook authentication integrations. In this article I'll showcase those applications.

Ernst-Georg Schmid: The Long Tail - vertical table partitioning II

From Planet PostgreSQL. Published on Feb 26, 2015.

Having all parts from the previous post in place, some mechanism to do the routine maintenance by calling the transfer function automagically, is needed.

Of course this could be done with pgAgent or it could be done with cron, but since it should be elegant, this calls for a background worker process.

For illustration purposes, I wrote a sample implementation called  worker_ltt based on the worker_spi sample code. Sloppy - even the orginal comments are still in there.

Adding worker_ltt to shared_preload_libraries and

worker_ltt.naptime = 60
worker_ltt.database = 'yourdb'
worker_ltt.user = 'youruser'
worker_ltt.function = 'move_longtail'
to postgresql.conf starts executing move_longtail() every 60 seconds in yourdb as youruser. If the user is omitted, it runs with superuser rights!

Since move_longtail() basically can do anything, restricting the user is a good idea.

For more security, the SQL statements could be moved entirely into the background worker, but then the approach loses much of its flexibility... But this is a concept anyway, there is always room for improvement.

But it really works.

In part III I'll try to give a raw estimate how big the performance penalty is when the partitioned table switches from fast to slow storage during a query. And there is another important problem to be solved...

Informing Users with django.contrib.messages

By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on Feb 26, 2015.

The messages framework can be bit confusing to wrap your head around at first. Learn the basics of setting successful and error messages and show them to users. See the default django way, then see how to do with django-braces.
Watch Now...

Daniel Pocock: PostBooks accounting and ERP suite coming to Fedora

From Planet PostgreSQL. Published on Feb 26, 2015.

PostBooks has been successful on Debian and Ubuntu for a while now and for all those who asked, it is finally coming to Fedora.

The review request has just been submitted and the spec files have also been submitted to xTuple as pull requests so future upstream releases can be used with rpmbuild to create packages.

Can you help?

A few small things outstanding:

  • Putting a launcher icon in the GNOME menus
  • Packaging the schemas - they are in separate packages on Debian/Ubuntu. Download them here and load the one you want into your PostgreSQL instance using the instructions from the Debian package.

Community support

The xTuple forum is a great place to ask any questions and get to know the community.


Here is a quick look at the login screen on a Fedora 19 host:

postbooks-fedora.png227.54 KB

Josh Berkus: Why you might need statement_cost_limit

From Planet PostgreSQL. Published on Feb 26, 2015.

Here's a commonplace ops crisis: the developers push a new dashboard display widget for user homepages on your application.  This seems to work fine with in testing, and they push it out to production ... not realizing that for some large subset of users dissimilar from your tests, the generated query triggers a sequential scan on the second-largest table in the database.   Suddenly your database servers are paralyzed with load, and you have to shut down the whole site and back out the changes.

Wouldn't it be nice if you could just tell the database server "don't run expensive queries for the 'web' user"?  Well, thanks to my colleague Andrew Dunstan, who wrote plan_filter with support from Twitch.TV, now you can.

Sort of.  Let me explain.

PostgreSQL has had statement_timeout for a while, which can be set on a per-user basis (or other places) to prevent application errors from running queries for hours.  However, this doesn't really solve the "overload" issue, because the query runs for that length of time, gobbling resources until it's terminated.  What you really want to do is return an error immediately if a query is going to be too costly.

plan_filter is a loadable module which allows you to set a limit on the cost of queries you can execute.  It works, as far as we know, with all versions of Postgres starting at 9.0 (we've tested 9.1, 9.3 and 9.4). 

Let me show you.  First, you have to load the module in postgresql.conf:

    shared_preload_libraries = 'plan_filter'

Then you alter the "web" user to have a strict limit:

    ALTER USER web SET plan_filter.statement_cost_limit = 200000.0

Then try some brain-dead query as that user, like a blanket select from the 100m-row "edges" graph table:

    \c - web
    SELECT * FROM edges;

    STATEMENT:  select * from edges;
    ERROR:  plan cost limit exceeded
    HINT:  The plan for your query shows that it would probably
    have an excessive run time. This may be due to a logic error
    in the SQL, or it maybe just a very costly query. Rewrite 
    your query or increase the configuration parameter

Obviously, your application needs to handle this error gracefully, especially since you'll likely get it for hundreds or thousands of queries at once if you're sending bad queries due to a code change. But a bunch of errors is definitely better than having to restart your whole app cluster.   It's comparatively easy to just display a broken widget icon.

So why did I say "sort of", and why aren't we submitting this as a feature for PostgreSQL 9.5?

Well, there's some issues with limiting by plan cost.  The first is that if you can't run the query due to the cost limit, you also can't run an EXPLAIN to see why the query is so costly in the first place.  You'd need to set plan_filter.statement_cost_limit = 0 in your session to get the plan.

The second, and much bigger, issue is that plan cost estimates are just that: estimates.  They don't necessarily accurately show how long the query is actually going to take.  Also, unless you do a lot of cost tuning, costs do not necessarily consistently scale between very different queries.   Worst of all, some types of queries, especially those with LIMIT clauses, can return a cost in the plan which is much higher than the real cost because the planner expects to abort the query early.

So you're looking at a strong potential for false positives with statement_cost_limit.  This means that you need to both set the limit very high (like 5000000) and work your way down, and test this on your staging cluster to make sure that you're not bouncing lots of legitimate queries.  Overall, statement_cost_limit is mainly only useful to DBAs who know their query workloads really well.

That means it's not ready for core Postgres (assuming it ever is).  Fortunately, PostgreSQL is extensible so you can use it right now while you wait for it to eventually become a feature, or to be supplanted by a better mechanism of resource control.

Ernst-Georg Schmid: The Long Tail - vertical table partitioning I

From Planet PostgreSQL. Published on Feb 26, 2015.

DISCLAIMER: This is just an idea, I don't have tried this in a production environment!

Having said that, any input is welcome. :-)

Storing tons of rows in a table of which only a small percentage of rows are frequently queried is a common scenario for RDBMS, especially with databases that have to keep historical information just in case they might be audited, e.g. in environments regulated by law.

With the advent of the first affordable 8TB harddisk and fast SSDs still being much more expensive per GB, I wondered if such long-tailed tables could be split over a SSD holding the frequently accessed pages and a near-line storage HDD keeping the archive - elegantly - with PostgreSQL.

With elegant, I mean without fiddling around with VIEWs, INSTEAD OF triggers and exposing a clean and familiar interface to the developer.

OK, since PostgreSQL already supports horizontal partitioning, spreading one table transparently over many parallel tables, how about vertical partitioning, spreading one table over a hierarchy of speed?

The speed zones can be mapped to tablespaces:

CREATE TABLESPACE fast LOCATION '/mnt/fastdisk';
CREATE TABLESPACE slow LOCATION '/mnt/slowdisk';

Next comes the table(s):

CREATE TABLE the_table
  id integer NOT NULL,
  value real

CREATE TABLE the_table_archive
INHERITS (the_table)

Table inheritance in PostgreSQL is so cool...

And a function to move data from fast to slow:

  RETURNS boolean AS
worked := false;
rowcount := count(*) FROM ONLY the_table WHERE id >= 5000000;
IF (rowcount > 100000) THEN
INSERT INTO the_table_archive SELECT * FROM ONLY the_table WHERE id >= 5000000;
DELETE FROM ONLY the_table WHERE id >= 5000000;
worked := true;
RETURN worked;

This function runs only if a minimum of movable rows qualify. This is recommended since SMR disks like large contiguous writes due to how SMR technically works.

Notice the ONLY keyword. This allows to control precisely which partition of the table is affected by the DML statement. Did I say already that table inheritance in PostgreSQL is so cool?

And basically that's it. All the developer sees is SELECT * FROM the _table; or something, being oblivious to the underlying machinery.

Some kind of automatic maintenance is missing. On to part II...

Michael Paquier: Postgres 9.5 feature highlight: Control WAL retrieval with wal_retrieve_retry_interval

From Planet PostgreSQL. Published on Feb 25, 2015.

Up to Postgres 9.4, when a node in recovery checks for the availability of WAL from a source, be it a WAL stream, WAL archive or local pg_xlog and that it fails to obtain what it wanted, it has to wait for a mount of 5s, amount of time hardcoded directly in xlog.c. 9.5 brings more flexibility with a built-in parameter allowing to control this interval of time thanks to this commit:

commit: 5d2b45e3f78a85639f30431181c06d4c3221c5a1
author: Fujii Masao <>
date: Mon, 23 Feb 2015 20:55:17 +0900
Add GUC to control the time to wait before retrieving WAL after failed attempt.

Previously when the standby server failed to retrieve WAL files from any sources
(i.e., streaming replication, local pg_xlog directory or WAL archive), it always
waited for five seconds (hard-coded) before the next attempt. For example,
this is problematic in warm-standby because restore_command can fail
every five seconds even while new WAL file is expected to be unavailable for
a long time and flood the log files with its error messages.

This commit adds new parameter, wal_retrieve_retry_interval, to control that
wait time.

Alexey Vasiliev and Michael Paquier, reviewed by Andres Freund and me.

wal_retrieve_retry_interval is a SIGHUP parameter (possibility to update it by reloading parameters without restarting server) of postgresql.conf that has the effect to control this check interval when a node is in recovery. This parameter is useful when set to values shorter than its default of 5s to increase for example the interval of time a warm-standby node tries to get WAL from a source, or on the contrary a higher value can help to reduce log noise and attempts to retrieve a missing WAL archive repetitively when for example WAL archives are located on an external instance which is priced based on the amount of connections attempted or similar (note as well that a longer interval can be done with some timestamp control using a script that is kicked by restore_command, still it is good to have a built-in option to do it instead of some scripting magic).

Using this parameter is simple, for example with a warm-standby node set as follows:

$ grep -e wal_retrieve_retry_interval -e log_line_prefix postgresql.conf
wal_retrieve_retry_interval = 100ms
log_line_prefix = 'time %m:'
$ cat recovery.conf
# Track milliseconds easily for each command kicked
restore_command = 'echo $(($(date +%%s%%N)/1000000)) && cp -i /path/to/wal/archive/%f %p'
standby_mode = on
recovery_target_timeline = 'latest'

The following successive attempts are done to try to get WAL:

cp: cannot stat '/home/ioltas/archive/5432/000000010000000000000004': No such file or directory
cp: cannot stat '/home/ioltas/archive/5432/000000010000000000000004': No such file or directory
# 101 ms of difference

And then after switching to 20s:

cp: cannot stat '/home/ioltas/archive/5432/000000010000000000000005': No such file or directory
cp: cannot stat '/home/ioltas/archive/5432/000000010000000000000005': No such file or directory
# 20023ms of difference

Something else to note is that the wait processing has been switched from pg_usleep that may not stop on certain platforms after receiving a signal to a latch, improving particularly a postmaster death detection.

Dying NTP deamons on vsphere vmware machines

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Feb 25, 2015.

We (Nelen & Schuurmans) have quite some servers. Most of them are vmware virtual machines in a vshpere cluster.

Once in a while, one or more of the machines got reported by our monitoring tool (zabbix) as having a time drift problem. Weird, as we have NTP running everywhere. And weird if you look at django logfiles and see a negative jump in time all of a sudden.

We run ntpd everywhere to keep the time in sync with two windows domain servers. Every time a server drifted, the ntpd daemon turned out to have died. Without leaving any trace in any logfile.

ntpd kills itself when the time drift is more than 20 minutes or so, assuming that it hurts more than it helps. There's a switch to prevent this self-killing behaviour, but ntpd killed itself anyway.

In the end, an external sysadmin found the problem:

  • One of the physical vsphere host machines (big server, lots of blades) was mis-configured: the ntp daemon on the host machine itself was configured, but it was not configured to automatically start when you start up the server...
  • This host machine started to drift its time, naturally.
  • Several actions vsphere does on a VM result in a very very short period where the VM is frozen. Actions like "full backup", "snapshot" and "automatically moving from one host machine to another for performance reasons". Very short, but vmware does adjust the time inside the VM. It keeps track of how long the quick action took and adjusts the VM's time accordingly.
  • It adjusts the time relative to the host machine's time. So if an action took 1 second, the second is added to the host machine's time and the result is set as the VM time. All is still well if the VM stays on the same host.
  • If the action includes moving the VM to a different host... And that host is the one with the drifted time.... If the host machine's time has drifted by an hour, the VM that gets moved to that host suddenly gets its internal time moved by an hour...

So it was a combination of host machines with a drifted time and the fact that vmware adjusts the VM's time after certain actions.

Writing it down as it might help someone googling for this problem :-)

Andrew Dunstan: Stopping expensive queries before they start

From Planet PostgreSQL. Published on Feb 25, 2015.

Today we're releasing a code for a small PostgreSQL module called plan_filter that lets you stop queries from even starting if they meet certain criteria. Currently the module implements one such criterion: the estimated cost of the query.

After you have built and installed it, you add a couple of settings to the postgresql.conf file, like this:
shared_preload_libraries = 'plan_filter'
plan_filter.statement_cost_limit = 100000.0
Then if the planner estimates the cost as higher than the statement_cost_limit it will raise an error rather than allowing the query to run.

This module follows an idea from a discussion on the postgresql-hackers mailing list some time ago. It was developed by PostgreSQL Experts Inc for our client Twitch.TV, who have generously allowed us to make it publicly available.

Andrew Dunstan: Raspberry Pi 2 coming to the buildfarm

From Planet PostgreSQL. Published on Feb 25, 2015.

Yesterday I ordered a Raspberry Pi 2 Model B, and it should be delivered in a few days. I'm intending to set it up as a buildfarm member. The fact that you can purchase a general purpose computer the size of a credit card with  a quad-core processor and 1Gb of memory (I remember when RAM was counted in kilobytes) and all for USD35.00 is amazing, even when you remember Moore's Law.

Jehan-Guillaume (ioguix) de Rorthais: New repository for bloat estimation queries

From Planet PostgreSQL. Published on Feb 25, 2015.

New repository

It’s been almost a year now that I wrote the first version of the btree bloat estimation query. Then, came the first fixes, the bloat estimation queries for tables, more fixes, and so on. Maintaining these queries as gists on github was quite difficult and lack some features: no documented history, multiple links, no doc, impossible to fork, etc.

So I decided to move everything to a git repository you can fork right away: There’s already 10 commits for improvements and bug fixes.

Do not hesitate to fork this repo, play with the queries, test them or make pull requests. Another way to help is to discuss your results or report bugs by opening issues. This can lead to bug fixes or the creation of a FAQ.


Here is a quick changelog since my last post about bloat:

  • support for fillfactor! Previous versions of the queries were considering any extra space as bloat, even the fillfactor. Now, the bloat is reported without it. So a btree with the default fillfactor and no bloat will report a bloat of 0%, not 10%.
  • fix bad tuple header size for tables under 8.0, 8.1 or 8.2.
  • fix bad header size computation for varlena types.
  • fix illegal division by 0 for the btrees.
  • added some documentation! See

In conclusion, do not hesitate to use this queries in your projects, contribute to them and make some feedback!

Ernst-Georg Schmid: pgchem::tigress 3.2 released

From Planet PostgreSQL. Published on Feb 25, 2015.

pgchem::tigress 3.2 is finally out!

  • This builds against PostgreSQL 9.4 and OpenBabel 2.3.2 on Linux.
  • It contains all fixes and contributions of the previous versions.
  • Windows is not supported anymore - and since it builds and runs way better on Linux, probably never will be again.
  • Depiction functions have been removed. Their run time was too unpredictable to be run inside a database server.
  • Theoretical isotope pattern generation with MERCURY7 is now available with isotopes for 39 elements.

So: CREATE EXTENSION pgchem_tigress;

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Replace checkpoint_segments with min_wal_size and max_wal_size.

From Planet PostgreSQL. Published on Feb 25, 2015.

On 23rd of February, Heikki Linnakangas committed patch: Replace checkpoint_segments with min_wal_size and max_wal_size.   Instead of having a single knob (checkpoint_segments) that both triggers checkpoints, and determines how many checkpoints to recycle, they are now separate concerns. There is still an internal variable called CheckpointSegments, which triggers checkpoints. But it no longer determines how […]

Joshua Drake: PostgreSQL is King! Last week was quite busy being a servant.

From Planet PostgreSQL. Published on Feb 25, 2015.

Last week was one of the busiest community weeks I have had in a long time. It started with an excellent time in Vancouver, B.C. giving my presentation, "An evening with PostgreSQL!" at VanLUG. These are a great group of people. They took all my jibes with good humor (Canadians gave us Maple Syrup, we gave them Fox News) and we enjoyed not only technical discussion but discussions on technology in general. It is still amazing to me how many people don't realize that Linux 3.2 - 3.8 is a dead end for random IO performance.

After VanLUG I spent the next morning at the Vancouver Aquarium with my ladies. Nothing like beautiful weather, dolphins and jelly fish to brighten the week. Once back in Bellingham, we moved on to a WhatcomPUG meeting where I presented, "Practical PostgreSQL: AWS Edition". It was the inaugural meeting but was attended by more than just the founders which is a great start!

I got to rest from community work on Wednesday and instead dug my head into some performance problems on a client High Availability Cluster. It is amazing that even with proper provisioning how much faster ASYNC rep is over SYNC rep. Some detailed diagnosis and proving data demonstrated, we switched to ASYNC rep and all critical problems were resolved.

On Thursday it was off to Southern California Linux Expo where I presented, "Suck it! Webscale is dead; long live PostgreSQL!". The room was packed, people laughed and for those who might have been offended, I warned you. Your offense is your problem. Look inside yourself for your insecurities! All my talks are PG-13 and it is rare that I will shy away from any topic. My disclosure aside, I had two favorite moments:

  1. When someone was willing to admit they hadn't seen Terminator. I doubt that person will ever raise his hand to one of my questions again.
  2. When Berkus (who knew the real answer) suggested it was Elton John that wrote the lyrics at the end of the presentation.

After I spent the evening with JimmyM (BigJim, my brother), Joe Conway of SDPUG/Credativ , Jim Nasby of the fledgling bird that is Blue Treble and the very enjoyable, I don't remember her name but she works at Enova (a well known PostgreSQL installation). Flying out the next morning at 8am probably should have been avoided though.

I am glad to be on the ground for the next few weeks before I head off to PgConf.US. It is looking like this conference is once again prove why PostgreSQL is King! Bring your people from all the lands, you are about to enter utopia.

Tomas Vondra: Prague PostgreSQL Developer Day 2015

From Planet PostgreSQL. Published on Feb 24, 2015.

So ... Prague PostgreSQL Developer Day 2015, the local PostgreSQL conference, happened about two weeks ago. Now that we collected all the available feedback, it's probably the right time for a short report and sketch of plans for next year.


The first year of Prague PostgreSQL Developer Day (P2D2) happened in 2008, and from the very beginning was organized as a community event for developers - from students of software engineering to people who use PostgreSQL at work.

We've changed the venue a few times, but in most cases we've just moved from one university / faculty to another one, and the same happened this year for capacity reasons. The previous venue at Charles University served us well, but we couldn't stuff more than 120 people in, and we usually reached that limit within a week after opening the registration. The new venue, located at Czech Technical University can handle up to ~180 people, which should be enough for the near future - this year we registered 150 people, but a few more ended on a wait list.

The most obvious change was adding a full day of trainings on February 11 (i.e. the day before the main conference day), similarly to what happens at and various other conferences. The feedback to this is overwhelmingly good, so we're pretty sure we'll preserve this for the next years.

The main conference (on February 12) consisted of 9 talks, not counting the initial "welcome" speech. We had the usual mixture of talks, from a brief talk about features introduced in 9.4, talks about using PostgreSQL in actual projects, to a talk about Bi-Directional Replication.

Although the conference is aimed at local users, and thus the majority of talks is either in Czech or Slovak, every year we invite a few foreign speakers to give talks in english. This year we had the pleasure to welcome Marc Balmer, who gave a talk "Securing your PostgreSQL applications", and Keith Fiske explaining that "When PostgreSQL Can't You Can".

In the early years of the conference we've been getting "too many talks in english" feedback whenever we got more than two talks in english, but judging how well the english talks were rated this year (Marc's talk even made it to "TOP 3"), the times are probably changing and we'll consider inviting more foreign speakers next year. So if you're like to visit Prague in February 2016, watch out for our CfP or ping me at sometime in September.

There were a few more changes, like recording most of the talks (will publish the talks in the future, but as most of them are in Czech ...), but those are mostly invisible to regular attendees. Nevertheless, there's plenty of things to improve in the next year, of course.

None of this would be possible without support from companies sponsoring the event - Avast, 2ndQuadrant, Elos Technologies, GoodData, ASW Systems, OptiSolutions, LinuxBox, mistaCMS and chemcomex.

Luca Ferrari: Thank you ITPUG

From Planet PostgreSQL. Published on Feb 24, 2015.

2014 was a very bad year, one I will remember forever for the things and the people I missed. But it was also the first year I missed the PGDay.IT, but today, thank to the board of directors and volounteers, I received the shirts of the event. This is a great thing for me, as being part of this great community. A special thank also to the OpenERP Iitalia!

Greg Sabino Mullane: Postgres ON_ERROR_ROLLBACK explained

From Planet PostgreSQL. Published on Feb 24, 2015.

Way back in 2005 I added the ON_ERROR_ROLLBACK feature to psql, the Postgres command line client. When enabled, any errors cause an immediate rollback to just before the previous command. What this means is that you can stay inside your transaction, even if you make a typo (the main error-causing problem and the reason I wrote it!). Since I sometimes see people wanting to emulate this feature in their application or driver, I thought I would explain exactly how it works in psql.

First, it must be understood that this is not a Postgres feature, and there is no way you can instruct Postgres itself to ignore errors inside of a transaction. The work must be done by a client (such as psql) that can do some voodoo behind the scenes. The ON_ERROR_ROLLBACK feature is available since psql version 8.1.

Normally, any error you make will throw an exception and cause your current transaction to be marked as aborted. This is sane and expected behavior, but it can be very, very annoying if it happens when you are in the middle of a large transaction and mistype something! At that point, the only thing you can do is rollback the transaction and lose all of your work. For example:

greg=# CREATE TABLE somi(fav_song TEXT, passphrase TEXT, avatar TEXT);
greg=# begin;
greg=# INSERT INTO somi VALUES ('The Perfect Partner', 'ZrgRQaa9ZsUHa', 'Andrastea');
greg=# INSERT INTO somi VALUES ('Holding Out For a Hero', 'dx8yGUbsfaely', 'Janus');
greg=# INSERT INTO somi BALUES ('Three Little Birds', '2pX9V8AKJRzy', 'Charon');
ERROR:  syntax error at or near "BALUES"
LINE 1: INSERT INTO somi BALUES ('Three Little Birds', '2pX9V8AKJRzy'...
greg=# INSERT INTO somi VALUES ('Three Little Birds', '2pX9V8AKJRzy', 'Charon');
ERROR:  current transaction is aborted, commands ignored until end of transaction block
greg=# rollback;
greg=# select count(*) from somi;

When ON_ERROR_ROLLBACK is enabled, psql will issue a SAVEPOINT before every command you send to Postgres. If an error is detected, it will then issue a ROLLBACK TO the previous savepoint, which basically rewinds history (good analogy needed time travel) to the point in time just before you issued the command. Which then gives you a chance to re-enter the command without the mistake. If an error was not detected, psql does a RELEASE savepoint behind the scenes, as there is no longer any reason to keep the savepoint around. So our example above becomes:

greg=# \set ON_ERROR_ROLLBACK interactive
greg=# begin;
greg=# INSERT INTO somi VALUES ('Volcano', 'jA0EBAMCV4e+-^', 'Phobos');
greg=# INSERT INTO somi VALUES ('Son of a Son of a Sailor', 'H0qHJ3kMoVR7e', 'Proteus');
greg=# INSERT INTO somi BALUES ('Xanadu', 'KaK/uxtgyT1ni', 'Metis');
ERROR:  syntax error at or near "BALUES"
LINE 1: INSERT INTO somi BALUES ('Xanadu', 'KaK/uxtgyT1ni'...
greg=# INSERT INTO somi VALUES ('Xanadu', 'KaK/uxtgyT1ni', 'Metis');
greg=# commit;
greg=# select count(*) from somi;

What about if you create a savepoint yourself? Or even a savepoint with the same name as the one that psql uses internally? Not a problem - Postgres allows multiple savepoints with the same name, and will rollback or release the latest one created, which allows ON_ERROR_ROLLBACK to work seamlessly with user-provided savepoints.

Note that the example above sets ON_ERROR_ROLLBACK (yes it is case sensitive!) to 'interactive', not just 'on'. This is a good idea, as you generally want it to catch human errors, and not just plow through a SQL script.

So, if you want to add this to your own application, you will need to wrap each command in a hidden savepoint, and then rollback or release it. The end-user should not see the SAVEPOINT, ROLLBACK TO, or RELEASE commands. Thus, the SQL sent to the backend will change from this:

BEGIN; ## entered by the user
INSERT INTO somi VALUES ('Mr. Roboto', 'H0qHJ3kMoVR7e', 'Triton');
INSERT INTO somi VALUES ('A Mountain We Will Climb', 'O2DMZfqnfj8Tle', 'Tethys');
INSERT INTO somi BALUES ('Samba de Janeiro', 'W2rQpGU0MfIrm', 'Dione');

to this:

BEGIN; ## entered by the user
SAVEPOINT myapp_temporary_savepoint ## entered by the application
INSERT INTO somi VALUES ('Mr. Roboto', 'H0qHJ3kMoVR7e', 'Triton');
RELEASE myapp_temporary_savepoint

SAVEPOINT myapp_temporary_savepoint
INSERT INTO somi VALUES ('A Mountain We Will Climb', 'O2DMZfqnfj8Tle', 'Tethys');
RELEASE myapp_temporary_savepoint

SAVEPOINT myapp_temporary_savepoint
INSERT INTO somi BALUES ('Samba de Janeiro', 'W2rQpGU0MfIrm', 'Dione');
ROLLBACK TO myapp_temporary_savepoint

Here is some pseudo-code illustrating the sequence of events. To see the actual implementation in psql, take a look at bin/psql/common.c

run("SAVEPOINT myapp_temporary_savepoint");
if (txn_status == ERROR) {
  run("ROLLBACK TO myapp_temporary_savepoint");
if (command was "savepoint" or "release" or "rollback") {
  ## do nothing
elsif (txn_status == IN_TRANSACTION) {
  run("RELEASE myapp_temporary_savepoint");

While there is there some overhead in constantly creating and tearing down so many savepoints, it is quite small, especially if you are using it in an interactive session. This ability to automatically roll things back is especially powerful when you remember that Postgres can roll everything back, including DDL (e.g. CREATE TABLE). Certain other expensive database systems do not play well when mixing DDL and transactions.

Hans-Juergen Schoenig: PostgreSQL 9.4 aggregation filters: They do pay off

From Planet PostgreSQL. Published on Feb 24, 2015.

In my previous posting on PostgreSQL 9.4 I have shown aggregate FILTER clauses, which are a neat way to make partial aggregates more readable. Inspired by some comments to this blog post I decided to create a follow up posting to see which impact this new FILTER clause has on performance. Loading some demo data To […]

Luca Ferrari: ITPUG interview

From Planet PostgreSQL. Published on Feb 22, 2015.

Thanks to the effort of some of our associates, we were able to perform a short interview to our associates themselves in order to see how ITPUG is working and how they feel within the association. The results, in italian, are available here for a first brief description. As a general trend, ITPUG is going fine, or even better of how it was going a few years before. However there is still a

Hubert 'depesz' Lubaczewski: I have PostgreSQL, loaded some data, and have app using it. Now what?

From Planet PostgreSQL. Published on Feb 21, 2015.

I had to deal with this question, or some version of it, quite a few times. So, decided to write a summary on what one could (or should) do, after data is in database, and application is running. Namely – setup some kind of replication and backups. What to use, how, and why? This is […]

Ernst-Georg Schmid: Counting rows again - inheritance strangeness solved

From Planet PostgreSQL. Published on Feb 20, 2015.

Taking a second look at the execution plans, I've noticed that the scan on slow read twice the number of pages from disk than the one on fast:

Buffers: shared read=442478 vs. Buffers: shared read=221239

Since I loaded all rows into slow first and then moved 50% of them into fast, this makes sense, I guess.
If I understand it correctly, those pages in slow are now empty, but PostgreSQL keeps them for future use.

So I tried a VACUUM FULL on slow and ran my queries again. That changed the plans:

Buffers: shared read=221239 vs. Buffers: shared read=221239

And execution times are now about equal.

Jim Mlodgenski: PostgreSQL PL/pgSQL Profiler

From Planet PostgreSQL. Published on Feb 20, 2015.

Some of our customers really like writing their business logic inside of PostgreSQL. While this is really cool that PostgreSQL is capable of handling, trying to performance tune large amounts of PL/pgSQL code becomes unwieldy. If your functions are small enough, it’s possible add some logging statements, but that is not possible with hundreds or even thousands lines of legacy code.

Several years ago as part of the PL/pgSQL debugger, Korry Douglas wrote a PL/pgSQL profiler, but over the years, it seems to have suffered from bit rot. A profiler for PL/pgSQL code helps solve a lot of problems and gives us insight into how your server side code is running.

Below is an example output from the profiler showing how many times each line of code executed and what was the time taken for each line.

PL/pgSQL Profiler



The plprofiler has not been tested is many different environments yet, so be careful in rolling it out to production servers. Check it out and let me know if you find any issues.


Michael Paquier: Short story with pg_dump, directory format and compression level

From Planet PostgreSQL. Published on Feb 20, 2015.

Not later than this week a bug regarding pg_dump and compression with zlib when dumping data has been reported here.

The issue was that when calling -Fd, the compression level specified by -Z was ignored, making the compressed dump having the same size for Z > 0. For example with a simple table:

=# CREATE TABLE dump_tab AS SELECT random() as a,
                                   random() as b
   FROM generate_series(1,10000000);
SELECT 10000000

A dump keeps the same size whatever the compression level specified:

$ for num in {0..4}; do pg_dump -Fd -t dump_tab -f \
    level_$num.dump -Z num ; done
$ ls -l level_?.dump/????.dat.gz level_0.dump/????.dat
-rw-r--r--  1 michael  staff  419999247 Feb 20 22:13 level_0.dump/2308.dat
-rw-r--r--  1 michael  staff  195402899 Feb 20 22:13 level_1.dump/2308.dat.gz
-rw-r--r--  1 michael  staff  195402899 Feb 20 22:14 level_2.dump/2308.dat.gz
-rw-r--r--  1 michael  staff  195402899 Feb 20 22:15 level_3.dump/2308.dat.gz
-rw-r--r--  1 michael  staff  195402899 Feb 20 22:16 level_4.dump/2308.dat.gz

After a couple of emails exchanged, it was found out that a call to gzopen() missed the compression level: for example to do a compression of level 7, the compression mode (without a strategy) needs to be something like "w7" or "wb7" but the last digit was simply missing. An important thing to note is how quickly the bug has been addressed, the issue being fixed within one day with this commit (that will be available in the next series of minor releases 9.4.2, 9.3.7, etc.):

commit: 0e7e355f27302b62af3e1add93853ccd45678443
author: Tom Lane <>
date: Wed, 18 Feb 2015 11:43:00 -0500
Fix failure to honor -Z compression level option in pg_dump -Fd.

cfopen() and cfopen_write() failed to pass the compression level through
to zlib, so that you always got the default compression level if you got
any at all.

In passing, also fix these and related functions so that the correct errno
is reliably returned on failure; the original coding supposes that free()
cannot change errno, which is untrue on at least some platforms.

Per bug #12779 from Christoph Berg.  Back-patch to 9.1 where the faulty
code was introduced.

And thanks to that, the dump sizes have a much better look (interesting to see as well that a higher compression level is not synonym to less data for this test case that has low repetitiveness):

$ for num in {0..9}; do pg_dump -Fd -t dump_tab -f \
    level_$num.dump -Z num ; done
$ ls -l level_?.dump/????.dat.gz level_0.dump/????.dat
-rw-r--r--  1 michael  staff  419999247 Feb 20 22:24 level_0.dump/2308.dat
-rw-r--r--  1 michael  staff  207503600 Feb 20 22:25 level_1.dump/2308.dat.gz
-rw-r--r--  1 michael  staff  207065206 Feb 20 22:25 level_2.dump/2308.dat.gz
-rw-r--r--  1 michael  staff  198538467 Feb 20 22:26 level_3.dump/2308.dat.gz
-rw-r--r--  1 michael  staff  199498961 Feb 20 22:26 level_4.dump/2308.dat.gz
-rw-r--r--  1 michael  staff  195780331 Feb 20 22:27 level_5.dump/2308.dat.gz
-rw-r--r--  1 michael  staff  195402899 Feb 20 22:28 level_6.dump/2308.dat.gz
-rw-r--r--  1 michael  staff  195046961 Feb 20 22:29 level_7.dump/2308.dat.gz
-rw-r--r--  1 michael  staff  194413125 Feb 20 22:30 level_8.dump/2308.dat.gz
-rw-r--r--  1 michael  staff  194413125 Feb 20 22:32 level_9.dump/2308.dat.gz

Nice community work to sort such things out very quickly.

Ernst-Georg Schmid: Counting rows again - inheritance strangeness

From Planet PostgreSQL. Published on Feb 20, 2015.

Hm, for something I'm trying at the moment on 9.4, I created two identical tables where the second one inherits all from the first:

  id integer NOT NULL,
  value real

INHERITS (everything.slow)

No indexes. Default tablespace.

If I then load 50 million records into each of those tables and query them individually using the ONLY restrictor, a count(*) on the parent table (slow) is slower than on the descendant (fast):

select count(*) from only everything.slow;

"Aggregate  (cost=1067783.10..1067783.11 rows=1 width=0) (actual time=4973.812..4973.813 rows=1 loops=1)"
"  Output: count(*)"
"  Buffers: shared read=442478"
"  ->  Seq Scan on everything.slow  (cost=0.00..942722.08 rows=50024408 width=0) (actual time=1012.708..3416.349 rows=50000000 loops=1)"
"        Output: id, value"
"        Buffers: shared read=442478"
"Planning time: 0.118 ms"
"Execution time: 4973.901 ms"

select count(*) from only;

"Aggregate  (cost=846239.00..846239.01 rows=1 width=0) (actual time=3988.235..3988.235 rows=1 loops=1)"
"  Output: count(*)"
"  Buffers: shared read=221239"
"  ->  Seq Scan on  (cost=0.00..721239.00 rows=50000000 width=0) (actual time=0.101..2403.813 rows=50000000 loops=1)"
"        Output: id, value"
"        Buffers: shared read=221239"
"Planning time: 0.086 ms"
"Execution time: 3988.302 ms"

This works with other aggregates like avg() too.

I had expected some overhead when querying without ONLY on slow, because of the traversal of the inheritance hierarchy, but not when I restrict the query to a specific table with ONLY...

Can someone explain this?

Jignesh Shah: CentOS 7, Docker, Postgres and DVDStore kit

From Planet PostgreSQL. Published on Feb 20, 2015.

Its been a long time since I have posted an entry. It has been a very busy year and more about that in a later post. Finally I had some time to try out new versions of Linux and new OSS technologies.

I started to learn by installing the latest version of CentOS 7. CentOS closely follows RHEL 7 and coming from SLES 11 and older CentOS 6.5, I saw many new changes which are pretty interesting.

New commands to learn immediately as I started navigating:

I admit that I missed my favorite files in /etc/init.d and looking at new location of /etc/systemd/system/ will take me a while to get used to.

firewall-cmd actually was more welcome considering how hard I  found to remember the exact rule syntax of iptables.

There is new Grub2 but honestly lately  I do not even worry about it (which is a good thing). Apart from that I see XFS is the new default file system and LVM now has snapshot support for Ext4 and XFS and many more.

However the biggest draw for me was the support for Linux Containers. As a Sun alumni, I was always draw to the battle of who did containers first and no longer worry about it, but as BSD Jails progressed to Solaris Containers to now the hottest technology: Docker container, it sure has its appeal.

In order to install docker however you need the "Extras" CentOS 7 repository enabled. However  docker is being updated faster so the "Extras" repository is getting old at 1.3 with the latest out (as of last week) is Docker 1.5. To get Docker 1.5  you will need to enable "virt7-testing" repository on CentOS 7

I took a shortcut to just create a file /etc/yum.repos.d/virt7-testing.repo with the following contents in it.


Then I was ready to install docker as follows

# yum install docker

I did find that it actually does not start the daemon immediately, so using the new systemctl command I enabled  and then started the daemon

# systemctl enable docker
# systemctl start docker

We now have the setup ready. However what good is the setup unless you have something to demonstrate quickly. This is where I see Docker winning over other container technology and probably their differentiator. There is an "AppStore" for the container images available to download images. Of course you need a login to access the Docker Hub as it is called at (which is for free fortunately). 

# docker login

To login to the hub and now you are ready to get new images.
I have uploaded two images for the demonstration for today
1. A Standard Postgres 9.4 image
2. A DVDStore benchmark application image based on kit from

To download the images is as simple as pull
# docker pull jkshah/postgres:9.4
# docker pull jkshah/dvdstore

Now lets see on how to deploy them. 
For PostgreSQL 9.4 since it is a database it will require storage for "Persistent Data" so first we make a location on the host that can be used for storing the data.

# mkdir /hostpath/pgdata

SELinux is enabled by default on CentOS 7 which means there is an additional step required to make the location read/write from Linux containers

# chcon -Rt svirt_sandbox_file_t /hostpath/pgdata

Now we will create a container as a daemon which will map the container port to host port 5432 and setup a database with a username and password that we set. (Please do not use secret as password :-) )
# docker run -d -p 5432:5432 --name postgres94 -v /hostpath/pgdata:/var/lib/postgresql/data -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=secret -t jkshah/postgres:9.4

Here now if you check /hostpath/pgdata you will see the database files on the host.
Now lets deploy an application using this database container.

# docker run -d -p 80:80 -–name dvdstore2 -–link postgres94:ds2db –-env DS2DBINIT=1 jkshah/dvdstore

The above command starts another container based on the DVDStore image which expects a database "ds2db" defined which is satisfied using the link option to link the database container created earlier. The application container also intiailizes the database so it is ready to serve requests at port 80 of the host. 

This opens up new avenues to now benchmark your PostgreSQL hardware easily. (Wait the load test driver code is still on Windows  :-( )

How-to automatically identify similar images using pHash

By Cloudinary Blog - Django from Django community aggregator: Community blog posts. Published on Feb 19, 2015.

pHash for image similarity detection Photos today can be easily edited by means of resizing, cropping, adjusting the contrast, or changing an image’s format. As a result, new images are created that are similar to the original ones. Websites, web applications and mobile apps that allow user generated content uploads can benefit from identifying similar images.

Image de-duplication

If your site allows users to upload images, they can also upload various processed or manipulated versions of the same image. As described above, while the versions are not exactly identical, they are quite similar.

Obviously, it’s good practice to show several different images on a single page and avoid displaying similar images. For example, travel sites might want to show different images of a hotel room, but avoid having similar images of the room on the same page.

Additionally, if your web application deals with many uploaded images, you may want to be able to automatically recognize if newly uploaded images are similar to previously uploaded images. Recognizing similar images can prevent duplicate images from being used once they are uploaded, allowing you to better organize your site’s content.

Image similarity identification

Cloudinary uses perceptual hash (pHash), which acts as an image’s fingerprint. This mathematical algorithm analyzes an image's content and represents it using a 64-bit number fingerprint. Two images’ pHash values are "close" to one another if the images’ content features are similar. By comparing two images’ fingerprints, you can tell if they are similar.

You can request the pHash value of an image from Cloudinary for any uploaded image, either using Cloudinary's upload API, or for any previously uploaded image in your media library using our admin API. You can simply set the phash parameter to true, which produces the image's pHash value.

Using the following image for example: Original koala photo

Below is a code sample in Ruby that shows how to upload this image with a request for the pHash value:

Cloudinary::Uploader.upload("koala1.jpg", :public_id => "koala1", :phash => true)

The result below shows the returned response with the calculated pHash value:

     "public_id": "koala1",
     "version": 1424266415,
     "width": 887,
     "height": 562,
     "format": "jpg",
     "etag": "6f821ea4478af3e3a183721c0755cb1b",
     "phash": "ba19c8ab5fa05a59"

The examples below demonstrate multiple similar images and their pHash values. Let's compare the pHash values and find the distance between each pair. If you XOR two of the pHash values and count the “1’s” in the result, you get a value between 0-64. The lower the value, the more similar the images are. If all 64 bits are the same, the photos are very similar.

The similarity score of the examples below expresses how each image is similar to the original image. The score is calculated as 1 - (phash_distance(phash1, phash2) / 64.0) in order to give a result between 0.5 and 1 (phash_distance can be computed using bit_count(phash1 ^ phash2) in MySQL for example).

Original koala thumbnail 887x562 JPEG, 180 KB
pHash: ba19c8ab5fa05a59

Grayscale koala 887x562 JPEG, 149 KB
Difference: grayscale.
pHash: ba19caab5f205a59
Similarity score: 0.96875

Cropped koala photo with increased saturation 797x562 JPEG, 179 KB
Difference: cropped, increased color saturation.
pHash: ba3dcfabbc004a49
Similarity score: 0.78125

Cropped koala photo with lower JPEG quality 887x509 JPEG, 30.6 KB
Difference: cropped, lower JPEG quality.
pHash: 1b39ccea7d304a59
Similarity score: 0.8125

Another koala photo 1000x667 JPEG, 608 KB
Difference: a different koala photo...
pHash: 3d419c23c42eb3db
Similarity score: 0.5625

Not a koala photo 1000x688 JPEG, 569 KB
Difference: not a koala...
pHash: f10773f1cd269246
Similarity score: 0.5

As you can see from the results above that the three images that appear to be similar to the original received a high score when they were compared. While other comparison results showed significantly less similarity.

By using Cloudinary to upload users’ photos to your site or application, you can request the pHash values of the uploaded images and store them on your servers. That allows you to identify which images are similar and decide what the next step should be. You may want to keep similar images, classify them in your database, filter them out, or interactively allow users to decide which images they want to keep.


This feature is available for any Cloudinary plan, including the free tier. As explained above, you can use Cloudinary’s API to get an image’s fingerprint and start checking for similarities. In addition, it is in our roadmap to further enhance our similar image search and de-duplication capabilities.

David Fetter: Stalking the Wild Timezone

From Planet PostgreSQL. Published on Feb 19, 2015.

What time was it?

This is a question that may not always be easy to answer, even with the excellent TIMESTAMPTZ data type. While it stores the UTC timestamp equivalent to the time it sees, it throws away the time zone of the client.

Here's how to capture it.
Continue reading "Stalking the Wild Timezone"

Josh Berkus: Spring/Summer 2015 Conference Schedule

From Planet PostgreSQL. Published on Feb 17, 2015.

What follows is my conference travel schedule through the early summer.  I'm posting it so that local PUGs will know when I'm going to be nearby, in case you want me to come talk to your members.  Also, so folks can find me at conference booths everywhere.

This list is also for anyone who was unaware of the amount of Postgres content available this year at conferences everywhere.
  • SCALE, Los Angeles, this week: 2-day Postgres track, booth.  Use code "SPEAK" if you still haven't registered for a small discount.  I'm speaking on 9.4 (Friday), and PostgreSQL on AWS (Sunday).
  • March 10, Burlingame, CA: pgDay SF 2015 Running the event, and a lightning talk.
  • March 25-27, NYC, NY: pgConf NYC: speaking on PostgreSQL on PAAS: a comparison of all the big ones.
  • April 25-26, Bellingham, WA: LinuxFest NorthWest, tentatively.  Talks haven't been chosen yet.  If I go, I'll also be working a booth no doubt.  I understand there are plans to have a bunch of Postgres stuff at this event.
  • June 16-20, Ottawa, Canada: pgCon of course.
  • July 20-24, Portland, OR: OSCON (tentatively, talks not selected).  Postgres talk of some sort, and probably booth duty.
Now you know.

    Checking for Python version and Vim version in your .vimrc

    By from Django community aggregator: Community blog posts. Published on Feb 17, 2015.

    Recently I’ve had to adjust a bunch of my dotfiles to support some old (Centos 5) systems which means that I am using a Vim that has Python 2.4 build in… needless to say, it breaks some of my dotfiles ;)

    So here’s some tips on patching Vim version issues.

    First, checking if you have Python in your Vim and which version you are using. It returns a version similar to how Vim does it with it’s version. So 204 is the result for Python 2.4, 207 for Python 2.7 and so on.

    " Check python version if available
    if has("python")
        python import vim; from sys import version_info as v; vim.command('let python_version=%d' % (v[0] * 100 + v[1]))
        let python_version=0

    Now we can make plugins/bundles dependend on versions:

    if python_version >= 205
        " Ultisnips requires Vim 2.5 or higher due to the with_statement
        Bundle 'SirVer/ultisnips'
        Bundle "MarcWeber/vim-addon-mw-utils"
        Bundle "tomtom/tlib_vim"
        Bundle "garbas/vim-snipmate"

    And checking for the Vim version to see if features are available:

    if version >= 703
        set undofile
        set undodir=~/.vim/undo
        set undolevels=10000
        call system('mkdir ' . expand('~/.vim/undo'))

    That’s it, the examples can be found in my Vim config:

    David Fetter: Upgrading PostgreSQL on Ubuntu Without Automatically Restarting

    From Planet PostgreSQL. Published on Feb 17, 2015.

    You're in an environment where you can't take long (or worse, randomly timed) down times. It's a totally ordinary thing.

    Normally, you'd want to upgrade the software in one step, and restart the service that depends on it in a different step.

    Here's how.
    Continue reading "Upgrading PostgreSQL on Ubuntu Without Automatically Restarting"

    A method for rendering templates with Python

    By Will McGugan from Django community aggregator: Community blog posts. Published on Feb 15, 2015.

    I never intended to write a template system for Moya. Originally, I was going to offer a plugin system to use any template format you wish, with Jinja as the default. Jinja was certainly up to the task; it is blindingly fast, with a comfortable Django-like syntax. But it was never going to work exactly how I wanted it to, and since I don't have to be pragmatic on my hobby projects, I decided to re-invent the wheel. Because otherwise, how do we get better wheels?

    The challenge of writing a template language, I discovered, was keeping the code manageable. If you want to make it both flexible and fast, it can quickly descend in to a mass of special cases and compromises. After a few aborted attempts, I worked out a system that was both flexible and reasonable fast. Not as fast as template systems that compile directly in to Python, but not half bad. Moya's template system is about 10-25% faster than Django templates with a similar feature set.

    There are a two main steps in rendering a template. First the template needs to be tokenized, i.e. split up in a data structure of text / tags. This part is less interesting I think, because it can be done in advance and cached. The interesting part is the following step that turns that data structure in to HTML output.

    This post will explain how Moya renders templates, by implementing a new template system that works the same way.

    Let's render the following template:

    <h1>Hobbit Index</h1>
        {% for hobbit in hobbits %}
        <li{% if hobbit==active %} class="active"{% endif %}>
        {% endfor %}

    This somewhat similar to a Django or Moya template. It generates HTML with unordered list of hobbits, one of which has the attribute class="active" on the <li>. You can see there is a loop and conditional in there.

    The tokenizer scans the template and generates a hierarchical data structure of text, and tag tokens (markup between {% and %}). Tag tokens consist of a parameters extracted from the tag and children nodes (e.g the tokens between the {% for %} and {% endfor %}).

    I'm going to omit the tokenize functionality as an exercise for the reader (sorry, I hate that too). We'll assume that we have implemented the tokenizer, and the end result is a data structure that looks like this:

        "<h1>Hobbit Index</h1>",
            {"src": "hobbits", "dst": "hobbit"},
                    {"test": "hobbit==active"},
                        ' class="active"'

    Essentially this is a list of strings or nodes, where a node can contain further nested strings and other nodes. A node is defined as a class instance that handles the functionality of a given tag, i.e. IfNode for the {% if %} tag and ForNode for the {% for %} tag.

    Nodes have the following trivial base class, which stores the parameters and the list of children:

    class Node(object):
        def __init__(self, params, children):
            self.params = params
            self.children = children

    Nodes also have an additional method, render, which takes a mapping of the data we want to render (the conext). This method should be a generator, which may yield] one of two things; either strings containing output text or an iterator that yields further nodes. Let's look at the IfNode first:

    class IfNode(Node):
        def render(self, context):
            test = eval(self.params['test'], globals(), context)
            if test:
                yield iter(self.children)

    The first thing the render method does is to get the test parameter and evaluate it with the data in the context. If the result of that test is truthy, then the render method yields an iterator of it's children. Essentially all this node object does is render its children (i.e. the template code between {% if %} and {% endif %}) if the test passes.

    The ForNode is similar, here's the implementation:

    class ForNode(Node):
        def render(self, context):
            src = eval(self.params['src'], globals(), context)
            dst = self.params['dst']
            for obj in src:
                context[dst] = obj
                yield iter(self.children)

    The ForNode render method iterates over each item in a sequence, and assigns the value to an intermediate variable. It also yields to its children each pass through the loop. So the code inside the {% for %} tag is rendered once per item in the sequence.

    Because we are using generators to handle the state for control structures, we can keep the main render loop free from such logic. This makes the code that renders the template trivially easy to follow:

    def render(template, **context):
        output = []
        stack = [iter(template)]
        while stack:
            node = stack.pop()
            if isinstance(node, basestring):
            elif isinstance(node, Node):
                new_node = next(node, None)
                if new_node is not None:
        return "".join(output)

    The render loop manages a stack of iterators, initialized to the template data structure. Each pass through the loop it pops an item off the stack. If that item is a string, it performs a string format operation with the context data. If the item is a Node, it calls the render method and pushes the generator back on to the stack. When the stack item is an iterator (such as a generator created by Node.render) it gets one value from the iterator and pushes it back on to the stack, or discards it if is empty.

    In essence, the inner loop is running the generators and collecting the output. A more naive approach might have the render methods also rendering their children and returning the result as a string. Using generators frees the nodes from having to build strings. Generators also makes error reporting much easier, because exceptions won't be obscured by deeply nested render methods. Consider a node throwing an exception inside a for loop; if ForNode.render was responsible for rendering its children, it would also have to trap and report such errors. The generator system makes error reporting simpler, and confines it to one place.

    There is a very similar loop at the heart of Moya's template system. I suspect the main reason that Moya templates are moderately faster than Django's is due to this lean inner loop. See this GutHub gist for the code from this post.

    Don't import (too much) in your django settings

    By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Feb 10, 2015.

    One of our production Django sites broke this afternoon with a database error "relation xyz doesn't exist". So: a missing table.

    Why 1

    I helped debugging it and eventually found the cause by doing a select * from south_migrationhistory. This lists the south migrations and lo and behold, a migration had just been applied 25 minutes earlier. The migration name suggested a rename of tables, which of course matches the "missing table" error.

    Why 2

    Cause found. But you have to ask yourself "why" again. So: "why was this migration applied?".

    Well, someone was working on a bit of database cleanup and refactoring. Naming consistency, proper use of permissions, that sort of thing. Of course, locally in a branch. And on a development database. Now why did the local command result in a migration on the production database?

    Why 3

    So, effectively, "why don't the development settings work as intended"? We normally use as the production settings and a that is used in development. It imports from and sets the debug mode and development database and so.

    This project is a bit different in that there's only a It does however try to import This is generated for you when you set up your project environment with ansible. A bit less clear (in my opinion) than a real .py file in your github repository, but it works. We saw the generated localsettings file with development database and DEBUG = True. This wasn't the cause. What then?

    Normally, calling django's diffsettings command (see the django documentation) shows you any settings errors by printing all the settings in your config that are different from Django's defaults. In this case, nothing was wrong. The DATABASES setting was the right one with the local development database. Huh?

    The developer mentioned one other thing he changed recently: importing some django signal registration module in the ````. Ah! Django's signals often work on the database. Yes, the signals in this module did database work, too.

    So the effectively looked like this:

    DATABASES = { .... 'server': 'productiondatabase' ....}
    import my_project.signal_stuff_that_works_on_the_database
        from .localsettings import *
        # This normally sets DATABASES = { .... 'server': 'developmentdatabase' ....}
    except ImportError:

    The import of the signal registration module apparently triggered something in Django's database layer so that the database connection was already active. The subsequent change of the DATABASES config to the local development database didn't have any effect anymore.

    diffsettings just shows you what the settings are and doesn't catch the fact that the DATABASES isn't really used in the form that comes out of diffsettings.

    Why 4

    Why the import, then?

    Well, it has to be executed when django starts up. The settings file looked like a good spot. It isn't, though.

    The traditional location to place imports like this is the or file. That's why the admin.autodiscover() line is often in your, for instance.

    So... put imports like this in or instead of in your settings file.

    Why 5

    Digging even deeper... isn't this sort of weird and ugly? Why isn't there a more obvious place for initialization code like this? Now you have to have the arcane knowledge to somehow know where you can import and where not, right?

    The answer: there is a good spot, in django 1.7. The AppConfig.ready() method! Quote from the documentation: Subclasses can override this method to perform initialization tasks such as registering signals. Bingo!

    Haystack & Tastypie Orgs

    By Daniel Lindsley from Django community aggregator: Community blog posts. Published on Feb 02, 2015.

    Haystack & Tastypie Orgs

    How to Find the Performance Bottlenecks in Your Django Views?

    By DjangoTricks from Django community aggregator: Community blog posts. Published on Jan 30, 2015.

    Once you have your Django projects running, you come to situations, when you need to optimize for performance. The rule of thumb is to find the bottlenecks and then to take action to eliminate them by more idiomatic Python code, database denormalization, caching, or other techniques.

    What is a bottleneck? Literally it refers to the top narrow part of a bottle. In engineering, bottleneck is a case where the performance or capacity of an entire system is limited by a single or small number of components or resources.

    How to find these parts of your code? The most trivial way is to check the current time before specific code execution and after that code execution, and then count the time difference:

    from datetime import datetime
    start =
    # heavy execution ...
    end =
    d = end - start # datetime.timedelta object
    print d.total_seconds() # prints something like 7.861985

    However, measuring code performance for Django projects like this is inefficient, because you need a lot of such wrappers for your code until you find which part is the most critical. Also you need a lot of manual computation to find the critical parts.

    Recently I found line_profiler module that can inspect the performance of the code line by line. By default, to use line_profiler for your functions, you should decorate them with @profile decorator and then to execute the script:

    $ kernprof -l

    This script will execute your script, analize the decorated function, and will save results to a binary file that can later be inspected with:

    $ python -m line_profiler

    That's quite complicated, but to use line_profiler for Django views, you can install django-devserver which replaces the original development server of Django and will output the performance calculations immediately in the shell like this:

    [30/Jan/2015 02:26:40] "GET /quotes/json/ HTTP/1.1" 200 137
    [sql] 1 queries with 0 duplicates
    [profile] Total time to render was 0.01s
    [profile] Timer unit: 1e-06 s

    Total time: 0.001965 s
    File: /Users/archatas/Projects/quotes_env/project/inspirational/quotes/
    Function: quote_list_json at line 27

    Line # Hits Time Per Hit % Time Line Contents
    27 def quote_list_json(request):
    28 1 2 2.0 0.1 quote_dict_list = []
    29 2 1184 592.0 60.3 for quote in InspirationQuote.objects.all():
    30 1 1 1.0 0.1 quote_dict = {
    31 1 1 1.0 0.1 'author':,
    32 1 1 1.0 0.1 'quote': quote.quote,
    33 1 363 363.0 18.5 'picture': quote.get_medium_picture_url(),
    34 }
    35 1 1 1.0 0.1 quote_dict_list.append(quote_dict)
    37 1 42 42.0 2.1 json_data = json.dumps(quote_dict_list)
    38 1 370 370.0 18.8 return HttpResponse(json_data, content_type="application/json")

    The most interesting data in this table is the "% Time" column, giving an overview in percentage which lines of the Django view function are the most time-consuming. For example, here it says that I should pay the most attention to the QuerySet, the method get_medium_picture_url() and the HttpResponse object.

    To setup line profiling, install line_profiler and django-devserver to you virtual environment:

    (myproject_env)$ pip install line_profiler
    (myproject_env)$ pip install django-devserver

    Then make sure that you have the following settings in your or

    # ...

    # ...


    # Modules not enabled by default

    DEVSERVER_AUTO_PROFILE = True # profiles all views without the need of function decorator

    When you execute

    (myproject_env)$ python runserver

    it will run the development server from django-devserver and for each visited view, it will show the analysis of code performance. I have tested this setup with Django 1.7, but it should work since Django 1.3.

    Do you know any more useful tools to check for performance bottlenecks?

    Account Control part 1

    By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on Jan 30, 2015.

    This is the first in a series of videos on creating a site which utilizes other services to help your users stay informed. We start the series with getting our users setup with an account, and giving them the ability to log in and out.
    Watch Now...

    Astro Code School Tapped to Teach App Development at UNC Journalism School

    By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jan 29, 2015.

    Our own Caleb Smith, Astro Code School lead instructor, is teaching this semester at UNC’s School of Journalism, one of the nation’s leading journalism schools. He’s sharing his enthusiasm for Django application development with undergraduate and graduate media students in a 500-level course, Advanced Interactive Development.

    For additional details about the course and why UNC School of Journalism selected Caktus and Astro Code School, please see our press release.

    Solinea is looking for a Senior Backend Engineer (Python, Django, Elasticsearch)

    By John DeRosa from Django community aggregator: Community blog posts. Published on Jan 28, 2015.

    This is my second week at Solinea, and I’m loving it! A position just opened up on our development team for a backend developer, and I wanted to share the love. :-)

    The company supports remote employees. Its headquarters is in Berkeley, CA, and I’m in Seattle, and I feel more connected now than, well, I did at some other companies I’ve worked for.

    If you’re in Seattle, I’d be happy to meet for coffee to talk at length about the job.

    To apply for this job, you can contact me at, or click the “Apply for this position” button at the bottom of the job’s Recruiterbox page.

    Senior Backend Engineer (Python, Django, Elasticsearch)


    Berkeley, CA, US, or remote

    This position is only open to candidates based in and eligible to work in the United States.


    As a backend developer at Solinea, you will be primarily working on our flagship product from the API back, as well as committing to the OpenStack codebase.

    You will work in a sprint-based agile development team, and will participate in the full cycle including release/sprint planning, feature design, story definition, daily standups, development, testing, code review, and release packaging. You will also work on the automated build, test, and package environment, as well as participate in maintenance of the development lab. Most of all, you will have the opportunity to have fun, be challenged, and grow as a developer while creating a game-changing product to help our fellow cloud operators.


    The ideal person to fill the role will have a solid track record of cloud, open source, virtualization, real-time data, and API development. You should have the ‘play all fields’ mentality required to be successful in a startup environment. Bring your passion for solving large problems, exploring new technology frontiers, and helping to bootstrap a development organization.

    The goldstone backend technology stack primarily consists of Python, Django, Celery, Redis, Logstash, and Elasticsearch. You should be an expert-level developer in the Python/Django ecosystem, and hands-on experience with OpenStack or some other cloud management framework.

    In addition to the core skills, things like systems automation, machine learning, data visualization, and prior startup experience are definitely relevant to the position.

    The ideal candidate will have at least a BS degree in CS or related field along with relevant work experience.


    Solinea offers comprehensive benefits including:

    • Medical, dental, vision, life, disability insurance, 401k plan
    • Flexible spending accounts
    • Pre-tax commuter benefits
    • Free coffee/tea in offices
    • 20 days of PTO/yr
    • Flexible working environment
    • Joel Test score: 8 out of 12

    The Joel Test is a twelve-question measure of a software team’s quality.

    Do you use source control? Yes
    Can you make a build in one step? Yes
    Do you make daily builds? Yes
    Do you have a bug database? Yes
    Do you fix bugs before writing new code? Depends on severity
    Do you have an up-to-date schedule? No
    Do you have a spec? Yes
    Do programmers have quiet working conditions? Yes
    Do you use the best tools money can buy? Yes
    Do you have testers? No
    Do new candidates write code during their interview? Yes
    Do you do hallway usability testing? No

    Tagged: Django, jobs, Python

    Reading/writing 3D STL files with numpy-stl

    By from Django community aggregator: Community blog posts. Published on Jan 28, 2015.

    As a followup of my earlier article about reading and writing STL files with Numpy, I’ve created a library that can be used easily to read, modify and write STL files in both binary and ascii format.

    The library automatically detects whether your file is in ascii or binary STL format and is very fast due to all operations being done by numpy.

    First, install using pip or easy_install:

    pip install numpy-stl
    # Or if you don't have pip available
    easy_install numpy-stl

    Note that numpy numpy and python-utils version 1.6 or greater are required. While these should both be installed automatically by pip/easy_install, for numpy it’s generally recommended to download a binary release so it installs a bit faster.

    Example usage:

    from stl import stl
    mesh = stl.StlMesh('some_file.stl')
    # The mesh normals (calculated automatically)
    # The mesh vectors
    mesh.v0, mesh.v1, mesh.v2
    # Accessing individual points (concatenation of v0, v1 and v2 in triplets)
    mesh.points[0] == mesh.v0[0]
    mesh.points[1] == mesh.v1[0]
    mesh.points[2] == mesh.v2[0]
    mesh.points[3] == mesh.v0[1]'new_stl_file.stl')

    Documentation can be found here:
    Please let me know if you have any problems using it or just to tell me that you like the project :)

    Django Logging Configuration: How the Default Settings Interfere with Yours

    By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jan 27, 2015.

    My colleague Vinod recently found the answer on Stack Overflow to something that's been bugging me for a long time - why do my Django logging configurations so often not do what I think they should?

    Short answer

    If you want your logging configuration to behave sensibly, set LOGGING_CONFIG to None in your Django settings, and do the logging configuration from scratch using the Python APIs:

    LOGGING = {...}  # whatever you want
    import logging.config


    The kernel of the explanation is in this Stack Overflow answer by jcotton; kudoes to jcotton for the answer: before processing your settings, Django establishes a default configuration for Python's logging system, but you can't override it the way you would think, because disable_existing_loggers doesn't work quite the way the Django documentation implies.

    The Django documentation for disable_existing_loggers in 1.6, 1.7, and dev (as of January 8, 2015) says "If the disable_existing_loggers key in the LOGGING dictConfig is set to True (which is the default) the default configuration is completely overridden." (emphasis added)

    That made me think that I could set disable_existing_loggers to True (or leave it out) and Django's previously established default configuration would have no effect.

    Unfortunately, that's not what happens. The disable_existing_loggers flag only does literally what it says: it disables the existing loggers, which is different from deleting them. The result is that they stay in place, they don't log any messages, but they also don't propagate any messages to any other loggers that might otherwise have logged them, regardless of whether they're configured to do so.

    What if you try the other option, and set disable_existing_loggers to False? Then your configuration is merged with the previous one (the default configuration that Django has already set up), without disabling the existing loggers. If you use Django's LOGGING setting with the default LOGGING_CONFIG, there is no setting that will simply replace Django's default configuration.

    Because Django installs several django loggers, the result is that unless you happened to have specified your own configuration for each of them (replacing Django's default loggers), you have some hidden loggers possibly blocking what you expect to happen.

    For example - when I wasn't sure what was going on in a Django project, sometimes I'd try just adding a root logger, to the console or to a file, so I could see everything. I didn't know that the default Django loggers were blocking most log messages from Django itself from ever reaching the root logger, and I would get very frustrated trying to see what was wrong with my logging configuration. In fact, my own logging configuration was probably fine; it was just being blocked by a hidden, overriding configuration I didn't know about.

    We could work around the problem by carefully providing our own configuration for each logger included in the Django default logging configuration, but that's subject to breaking if the Django default configuration changes.

    The most fool-proof solution is to disable Django's own log configuration mechanism by setting LOGGING_CONFIG to None, then setting the log configuration explicitly ourselves using the Python logging APIs. There's an example above.

    The nitty-gritty

    The Python documentation is more accurate: "disable_existing_loggers – If specified as False, loggers which exist when this call is made are left enabled. The default is True because this enables old behavior in a backward- compatible way. This behavior is to disable any existing loggers unless they or their ancestors are explicitly named in the logging configuration."

    In other words, disable_existing_loggers does literally what it says: it leaves existing loggers in place, it just changes them to disabled.

    Unfortunately, Python doesn't seem to document exactly what it means for a logger to be disabled, or even how to do it. The code seems to set a disabled attribute on the logger object. The effect is to stop the logger from calling any of its handlers on a log event. An additional effect of not calling any handlers is to also block propagation of the event to any parent loggers.

    Status of the problem

    There's been some recent discussion on the developers' list about at least improving the documentation, with a core developer offering to review anything submitted. And that's where things stand.

    We’re launching a Django code school: Astro Code School

    By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jan 26, 2015.

    One of the best ways to grow the Django community is to have more high-quality Django developers. The good news is that we’ve seen sharply increasing demand for Django web applications. The challenge that we and many other firms face is that there’s much higher demand than there is supply: there aren’t enough high-quality Django developers. We’ve talked about this issue intensely internally and with our friends while at DjangoCon and PyCon. We decided that we can offer at least one solution: a new Django-focused code school.

    We’re pleased to announce the launch of Astro Code School in Spring 2015. Astro will be the first Django code school on the East Coast. Programs include private trainings and weekend, 3-week, and 12-week full-time courses. In addition to Django, students will learn Python (of course), HTML, CSS, and JavaScript. They will come away being able to build web applications. The shorter programs will be geared towards beginners. The longer program will are for those with previous programming experience. Astro will also provide on-site, private corporate training, another area we frequently get asked about.

    Astro will be a separate company under Caktus. To support Astro, we welcome Brian Russell, the new director of Astro. Brian is the former owner of Carrboro Creative Coworking, the place where Caktus got its start. In addition to being a long-term supporter of new developers, Brian is also an artist and entrepreneur. He has a special interest in increasing diversity within open source. Django itself is one of the most respectful and welcoming places for women and minorities and he’s excited to contribute.

    Our first and leading instructor will be Caleb Smith, a Caktus developer since 2011. Caleb first joined Caktus as an intern, straight from his days as a public school music teacher. He continued to teach while at Caktus, supporting free and low-cost courses for women through the nonprofit Girl Develop It RDU. He’s also currently teaching an advanced web application course at the University of North Carolina’s School of Journalism and Mass Communication.

    We’re building out the space for Astro currently on the first floor of our new headquarters in Downtown Durham. Astro Code School will have a dedicated 1,795 square feet of space. Construction should be complete by April.

    Why I Love Technical Blogging

    By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jan 23, 2015.

    I love writing blog posts, and today I’m setting out to do something I’ve never tried before: write a blog post about writing blog posts. A big part of our mission at Caktus is to foster and help grow the Python and Django development communities, both locally and nationally. Part of how we’ve tried to accomplish this in the past is through hosting development sprints, sponsoring and attending conferences such as PyCon and DjangoCon, and building a knowledge base of common problems in Python and Django development in our blog. Many in the Django community first get to know Caktus through our blog, and it’s both gratifying and humbling when I meet someone at a conference and the person thanks me for a post Caktus wrote that helped him or her solve a technical problem at some point in the past.

    While I personally don’t do as much software development as I used to and hence no longer write as many technical posts, the Caktus blog and many others in the community continue as a constant source of inspiration and education to me. As software developers we are constantly trying to work ourselves out of a job, building tools that organize information and help people communicate. Sharing a brief, highly specific technical blog post serves in a similar capacity; after I’ve spent 1-2 hours or more researching something that ultimately took 5-10 minutes to fix, I’d hate for someone else to have to go through the same experience. Writing up a quick, 1-2 paragraph technical post about the issue not only helps me think through the problem, but also hopefully saves a few minutes of someone else’s life at some point in the future.

    To help me better understand what I like so much about blogging, I went back and reviewed the history of Caktus blogging efforts over the past 5 years and separated our posts into categories. While I’m sure there are innumerable ways to do this, in case it serves as a source of inspiration to others, what follows are the categories I came up with:

    • Technical Tidbits. These types of posts are small, usually a paragraph or two, along with a code or configuration snippet. They might cover upgrading a specific open source package or reusable app in your project, or augment existing Django release notes when you find the built-in Django documentation lacking for a specific use case. Posts in this category that we’ve written in the past at Caktus include upgrading django-photologue and changing or SHMMAX setting (for PostgreSQL) on a Mac. These are great posts to write after you’ve just done something for the first time. You’ll have a fresher perspective than someone who’s done the task many times before. Because of this, you can easily anticipate many of the common problems someone coming to the task for the first time might face.

    • Debugging Sugar. Posts handy for debugging purposes often rely on a specific error message or stack trace. Another good candidate for this type of post is documenting an existing Django bug that requires a specific workaround. Posts we’ve written in this category include using strace to debug stuck celery tasks and the (thankfully now obsolete) parsing microseconds in the Django admin. A good sign you need to write a post like this is that you had to spend more than 5-10 minutes Googling for an answer to something or asking your co-workers. If you’re looking for an answer and having trouble finding it, there’s a good chance someone else out there is doing the same and would benefit from your blog post.

    • Open Source Showcase. Open Source Showcase posts are a great way to spread the word about a project you have or a teammate has written, or to validate a 3rd party app or Django feature you’ve found particularly helpful. These are typically longer, more in-depth analyses of a project or feature rather than an answer to a specific technical problems (though the two are not always mutually exclusive). At Caktus we’ve written about our django-scribbler app as well as several new features in Django, including bulk inserts, class-based views, and support for custom user models. While these posts can require a significant time investment to get right, their value as augmentation to or 3rd-party validation of Python and Django development patterns cannot be underrated. Patterns are set through a community rallying around an open source package or approach. Proposing and sharing these ideas openly is what drives the open source community forward.

    • Mini How-tos. Mini How-tos are generally a combination of other types of posts. They start with a specific goal in mind -- setting up a server, installing a reusable app -- and walk the reader through all the necessary steps, services, and packages required to get there. If you feel passionately that something should be done in a certain way, this is a great way to set a standard for the community to be aware of and potentially follow. This could cover anything from configuring a Jenkins slave to using Amazon S3 to store your static and uploaded media. Similar to an Open Source Showcase, Mini How-tos are an asset to the community insofar as they help advance and disseminate common approaches to software development problems. At the same time, they’re open to review and critique by the wider open source community.

    A big thank you to everyone in the Python and Django community for being open and willing to share your experiences and problem solving efforts. Without this, Caktus would not be where it is today and for that I am deeply grateful. If this post happens to inspire at least one short technical post from someone who hasn’t written one before, I’ll consider it a success.

    Why you should donate to the Django fellowship program

    By Greg Taylor from Django community aggregator: Community blog posts. Published on Jan 23, 2015.

    Disclaimer: I do not represent the Django Software Foundation in any way, nor has anything below been endorsed by the DSF. The following opinions are my own, unsolicited rambling.

    If you hadn’t been looking for it specifically, you may have missed it. The Django Softare Foundation is running a fundraising effort for the new Django Fellowship program. It sounds like they’re still trying to figure out how to get the word out, so I wanted to do what I could to tell you why you should chip in.

    This particular blog post is going to focus on encouraging (peer-pressuring) commercial Django users in particular, though enthusiasts are welcome to read along!

    Humble beginnings

    Django is free and open source. Just provide the expertise and the infrastructure and you can build just about whatever web powered contraption you’d like. So you end up doing just that.

    Your first stop is the Django tutorial, written and maintained by a community of volunteers (just like the rest framework itself). You stumble along, slowly at first. Perhaps you find yourself frustrated at times, or maybe things move along at a faster pace. In no time, you’ve got "Hello World!" rendering, and here comes a business idea!

    One hundred lines of code turns into a thousand, then five thousand, and beyond. You start seeing signups, and revenue begins to trickle in. You toil away at your codebase, making improvements and dealing with the "accidental features" that crept in during one of your late night dev sessions.

    You could have built your business on one of any number of frameworks, but you chose Django. You like how it’s a very productive way to build a web app. You appreciate how it’s not impossible to find Django developers to work with you. There are probably some things you don’t like, but you might not have the time to work on fixing them yourself. You’re just busy shipping and growing.

    But it could be better still!

    You’re happily using Django, it serves you well. There are a few things you’d love to see fixed or improved, but you don’t really have the time or expertise to contribute directly. As luck would have it, all of the Django core developers have day jobs themselves. Things would progress much more quickly if we had someone working full-time on Django…

    Enter: Django Fellowship Program. The idea is to fund at least one Django developer to work for the DSF part or full-time for a while. During this fellowship, said developer sets aside some or all of their other responsibilities to focus on improving Django. The DSF, in turn, pays the developer a fair (but low rate) for their work.

    As per the Tim Graham’s recent retrospective blog post, we’ve see some huge leaps forward for the project during these fellowships. These are periods of focus and rapid improvement that everyone (including your business) benefit from.

    The only problem is that we’re not going to see the benefits of this program unless it gets (and stays) funded. A well-funded fellowship program could mean one (or more) developers working on Django full-time at any given point in time. That would be huge for the project (and you and I).

    Why you should donate

    As a business, we are donating to the fellowship program to see one of our critical components improved. Due to the fellowship application process, you can be assured that your money will be paying a capable, trusted developer to get things done.

    Consequently, you can view a donation to the Django Fellowship program as an investment with an almost assuredly positive return. If you are making money with Django, consider making a (potentially tax-deductible) investment in what may be the foundation of your business.

    At the end of the first full day of fund-raising, there are precious few commercial donors listed in the "Django Heroes" leaderboard. Let’s help change that!

    If you don’t hold the purse strings at your business, get in touch with someone who does and tell them about this investment with near-guaranteed returns.

    Caktus is looking for a Web Design Director

    By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jan 22, 2015.

    Over the last two years Caktus’ design portfolio has rapidly been growing. We’ve taken on new projects primarily focused on design and have received community recognition for those efforts. We are happy to have grown our design capabilities to match the level of quality we demand from our Django developers. We have found it’s important to have strength on both sides of the table as each side challenges the other and forces the final product of our process to be as high quality as possible.

    In an effort to continue to push ourselves and expand our web design skill sets, Caktus is looking to hire a new Web Design Director. We’re searching for someone who can do a bit of wireframing and user experience and then has the tools necessary to design and code pages. We’re looking for someone who is attune to both form and function and knows where to focus depending on clients’ needs. Caktus is committed to doing good in our development communities as well as through the projects that we choose to work on, so we are also interested in finding someone who is engaged in the design community.

    If you or someone you know would be a good fit, please apply to the position! If you have any questions get in touch.

    Introducing High Performance Django Expert Sessions

    By Lincoln Loop from Django community aggregator: Community blog posts. Published on Jan 21, 2015.

    With the launch of our book, High Performance Django, we’ve received a number of inquiries from people asking for advice, for which the answers are too specific to their application to give good general advice, and too short to sign a consulting engagement.

    Rather than decline to help, we now offer Expert Sessions - a one-hour online consultation with a member (or members) of the Lincoln Loop team.

    Schedule an Expert Session and we'll meet up with you via Google Hangouts, Skype, or phone to answer any questions or provide expertise on building and scaling your complex Django application.

    In the past, we've answered such questions as:

    • What technologies should we use for building a complex, high-performance application?

    • Our app is super complex and it takes us 3 days to onboard a new developer. How can we simplify things to speed up onboarding?

    • Should we move our infrastructure to Amazon Web Services or Heroku? What are the benefits and how do we perform the correct analysis?

    • How can we make our deploys more reliable?

    • How do we move from our legacy system to Django in order to improve reliability and cut costs?

    • Should we use MongoDB or Redis?

    • How do we properly load balance across our app servers?

    • How can we optimize our development workflow?

    We can also provide expertise in other technologies, such as SaltStack, Go (golang), Javascript (React.js and Backbone), MySQL, and Postgres.

    To schedule an Expert Session, or for more information, please see our Expert Sessions page.

    Webinar: Testing Client-Side Applications with Django

    By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jan 20, 2015.

    Technical Director Mark Lavin will be hosting a free O’Reilly webinar today at 4PM EST or 1PM PT on Testing Client-Side Applications with Django. Mark says testing is one of the most popular question topics he receives. It’s also a topic near and dear to Caktus’ quality-loving heart. Mark’s last webinar garnered more than 500 viewers, so sign up quick!

    Here’s a description from Mark:

    During the session we'll examine a simple REST API with Django connected to a single page application built with Backbone. We'll look at some of the tools available to test the application with both Javascript unit tests and integration tests written in Python. We'll also look at how to organize them in a sane way for your project workflow.

    To sign up, visit the webinar page on O’Reilly’s site.

    Self-Hosted Server Status Page with Uptime Robot, S3, and Upscuits

    By Ross Poulton from Django community aggregator: Community blog posts. Published on Jan 20, 2015.

    For quite a while I've had a public "Status" page online for WhisperGifts via Pingdom. It basically just shows uptime over the past few days, but given my site is relatively low-volume and not ovely critical to my customers, the $10/month for Pingdom was actually one of my largest expenses after hosting.

    So, I started looking for an alternative.

    Today I re-deployed the WhisperGifts Status Page using a combination of Uptime Robot, Upscuits and Amazon S3.

    In short, I have Uptime Robot checking the uptime of my site (including it's subsites, such as the admin and user pages). The statistics are gathered and presented by Upscuits, which is entirely client-side JavaScript hosted on S3.

    My basic todo list for next time:

    1. Sign up for Uptime Robot. I'd been using them for ages on their Free plan as a backup to Pingdom; this gives 5-minute checks. Their paid plan gives 1-minute resolution.
    2. Add your sites, make sure they're being monitored correct.
    3. On the Uptime Robot dashboard, click My Settings. Open the section labelled Monitor-Specific API Keys and search for your Monitor. Copy the API key to a text file for later; repeat this step for subsequent monitors you want to include on your status page.
    4. Download the latest Upscuits release to your PC.
    5. In the public folder of the upscuits package, rename config.example.js to config.js. Paste your API key(s) inside it.
    6. Create an AWS bucket called eg and enable website mode. Setup your DNS etc to point to this bucket.
    7. Upload the contents of public/ to your AWS bucket
    8. Visit your new status page and view your last 12 months of Uptime Robot statistics
    9. Close your Pingdom account saving $10 a month Profit!

    For a small site like mine this has a couple of obvious benefits. It's free (or $4.50/month if you want higher resolution - still half the price of the most basic Pingdom plan); it uses a tiny amount of S3 storage which is as good as free, and doesn't involve running any server-side code. The included index.html is also easily customisable if you like, since it's just plain HTML (using the Bootstrap framework, by default). This is a big win over hosted solutions, IMO.

    Building Django proxies and MUD libraries

    By Griatch's Evennia musings (MU* creation with Django+Twisted) from Django community aggregator: Community blog posts. Published on Jan 19, 2015.

    2015 is here and there is a lot of activity going on in Evennia's repository, mailing list and IRC channel right now, with plenty of people asking questions and starting to use the system to build online games.

    We get newcomers of all kinds, from experienced coders wanting to migrate from other code bases to newbies who are well versed in mudding but who aim to use Evennia for learning Python. At the moment the types of games planned or under development seems rather evenly distributed between RPI-style MUDs and MUSH games (maybe with a little dominance of MUSH) but there are also a couple of hack-and-slash concepts thrown into the mix. We also get some really wild concepts pitched to us now and then. What final games actually comes of it, who can tell, but people are certainly getting their MU*-creative urges scratched in greater numbers, which is a good sign.

    Since Christmas our "devel" branch is visible online and is teeming with activity. So I thought I'd post an summary about it in this blog. The more detailed technical details for active developers can be found on Evennia's mailing list here (note that full docs are not yet written for devel-branch).

    Django proxies for Typeclasses

    I have written about Evennia's Typeclass system before on this blog. It is basically a way to "decorate" Django database models with a second set of classes to allow Evennia developers to create any type of game entity without having to modify the database schema. It does so by connecting one django model instance to one typeclass instance and overloading __setattr__ and __getattribute__ to transparently communicate between the two.
    For the devel branch I have refactored our typeclass system to make use of Django's proxy models instead. Proxy models have existed for quite a while in Django, but they simply slipped under my radar until a user pointed them out to me late last year. A proxy model is basically a way to "replace the Python representation of a database table with a proxy class". Sounds like a Typeclass, doesn't it?

    Now, proxy models doesn't work quite like typeclasses out of the box - for one thing if you query for them you will get back the original model and not the proxy one. They also do not allow multiple inheritance. Finally I don't want Evennia users to have to set up django Meta info every time they use a proxy. So most work went into overloading the proxy multiclass inheritance check (there is a django issue about how to fix this). Along the way I also redefined the default managers and __init__ methods to always load the proxy actually searched for and not the model. I finally created metaclasses to handle all the boilerplate. We choose to keep the name Typeclass also for this extended proxy. This is partly for legacy reasons, but typeclasses do have their own identity: they are not vanilla Django-proxies nor completely normal Python classes (although they are very close to the latter from the perspective of the end user).
    Since typeclasses now are directly inheriting from the base class (due to metaclassing this looks like normal Python inheritance), it makes things a lot easier to visualize, explain and use. Performance-wise this system is en par with the old, or maybe a little faster, but it will also be a lot more straight forward to cache than the old. I have done preliminary testing with threading and it looks promising (but more on that in a future post). 

    Evennia as a Python library package
     Evennia has until now been solely distributed as a version controlled source tree (first under SVN, then Mercurial and now via GIT and Github). In its current inception you clone the tree and find inside it a game/ directory where you create your game. A problem we have when helping newbies is that we can't easily put pre-filled templates in there - if people used them there might be merge conflicts when we update the templates upstream. So the way people configure Evennia is to make copies of template modules and then change the settings to point to that copy rather than the default module. This works well but it means a higher threshold of setup for new users and a lot of describing text. Also, while learning GIT is a useful skill, it's another hurdle to get past for those who just want to change something minor to see if Evennia is for them.

    In the devel branch, Evennia is now a library. The game/ folder is no longer distributed as part of the repository but is created dynamically by using the new binary evennia launcher program, which is also responsible for creating (or migrating) the database as well as operating the server:

    evennia --init mygame
    cd mygame
    evennia migrate
    evennia start

    Since this new folder is not under our source tree, we can set up and copy pre-made template modules to it that people can just immediately start filling in without worrying about merge conflicts. We can also dynamically create a setting file that fits the environment as well as set up a correct tree for overloading web functionality and so on. It also makes it a lot easier for people wanting to create multiple games and to put their work under separate version control.

    Rather than traversing the repository structure as before you henceforth will just do import evennia in your code to have access to the entirety of the API. And finally this means it will (eventually) be possible to install Evennia from pypi with something like pip install evennia. This will greatly ease the first steps for those not keen on learning GIT.

    For existing users

    Both the typeclasses-as-proxies and the evennia library changes are now live in the devel branch. Some brave users have already started taking it through its paces but it will take some time before it merges into master.

    The interesting thing is that despite all this sounding like a huge change to Evennia, the coding API doesn't change very much, the database schema almost not at all. With the exception of some properties specific to the old connection between the typeclass and model, code translate over pretty much without change from the developer's standpoint.

    The main translation work for existing developers lies in copying over their code from the old game/ directory to the new dynamically created game folder. They need to do a search-and-replace so that they import from evennia rather than from src or ev. There may possibly be some other minor things. But so far testers have not found it too cumbersome or time consuming to do. And all agree that the new structure is worth it.

    So, onward into 2015!

    Image: "Bibliothek St. Florian" by Original uploader was Stephan Brunker at de.wikipedia Later versions were uploaded by Luestling at de.wikipedia. - Originally from de.wikipedia; description page is/was here.. Licensed under CC BY-SA 3.0 via Wikimedia Commons -

    Mirroring my article on Chennai 36 – The Alumni Blog of IITM

    By Django for beginners from Django community aggregator: Community blog posts. Published on Jan 18, 2015.

    Recently, I wrote an article to Chennai36 which is a blog maintained by the Alumni Association of IITM. The article was about my opinions on how to apply to graduate school for an undergrad at IITM. In this post, I am mirroring that article on this blog.


    The original article can be found here.


    The Grad Guru : Karthik Abinav at University of Maryland, College Park

    Note : Whenever I mean grad school, I am going to be referring to a PhD program. Though most of this advice also applies to MS programs, you should bear in mind that my focus is on PhD program. Since, I do not have much knowledge about the MS programs, I will not comment much about them in this article.
    1.  Please tell us about yourself, the university you are studying at, the research field you are working on, and the scope it has to offer after an MS or PhD. Also tell us about a typical day in the life of a postgraduate student.

    I recently started my PhD in Computer Science at University of Maryland at College Park. I am broadly interested in theoretical computer science. At a high level, this area deals with the mathematical formalism for many of the fundamental computer science problems. The direct application of the work in this area is primarily in computer science, but not limited to it. Due to its extreme mathematical nature, most problems are abstracted sufficiently, to be applicable in a variety of other fields. Some of the common fields where work in this area is used are Economics, Operations Research, Computational Biology, etc. In terms of a researcher in this field, you have multiple options after a PhD. The most obvious choice is academia, either as a research scientist at one of the research labs in industry or a professor at one of the universities. But, due to the versatility of this subject, you can work at a number of other places, which at first do not seem related. For example, a quant researcher at a financial service is one such place. The kind of ideas required in those places requires a strong training in mathematics and computer science. Theoretical computer science is one way, which gives those particular skills to a researcher. Without going into great detail I would list down some places where one could join if one is not too keen on academia. Places such as Google, eBay, Amazon, etc. which does a so-called “Product oriented research” is yet another great place where a PhD in theoretical computer science can be of great help. In recent days, with computational biology relying heavily on techniques from computer science, a career as a computational biologist at industry is yet another lucrative option. With both the industry and academia moving at such a fast pace these days, one should also account for the options, which do not exist as of now, but will be a real option when one graduates.
    A typical day in the life of a grad student varies widely on what stage of a PhD one is in. For example, most universities have some form of course requirement to be completed in the first few years. Hence, a significant part of the first couple of years is spent doing graduate coursework. In my particular case, I spend about 8 hours a week attending classes and doing related work. Since, I serve as a teaching assistant, I spend another 7-8 hours teaching classes, holding office hours and grading. I spend about 15 hours doing research in the form of thinking about problems, reading relevant materials, attending talks etc. Since, these activities are amortized over a week, one can scale it appropriately to get an idea about the workload per day. The most important lesson I have started to learn is that a high amount of self-discipline is required, as a grad student, to get all the work done. If you want to give your fullest towards your research, classes and yet have a life outside the university, it is highly critical that you make a strict schedule and stick to it. This is yet another important skill you learn in pursuit of a PhD.
    2.  When did you decide to apply for further studies? What are the necessary skills, according to you, a person should develop in order to make himself cut out for research and not just getting a good Grad school?
    My decision to apply to grad school was somewhere in third year, where I felt I might like the process of doing research. My first sparks came from thinking about problems from class work and the idea of figuring out solutions after toiling with it for a while. A typical “undergrad attitude”(which I admit, I too had at that point) is to believe that solving hard problems from a course work is in fact a great indicator about liking research. But in reality, doing research involves a lot more activities (sometimes mundane) and one should ensure they like the whole package before diving into it. I tried my hands at some simple research problems at institute and during my internship in summer. What struck me most about these experiences was the fact that I actually liked almost all of the associated activities along with solving problems. Critical thinking, evaluating multiple solutions, collaborating with other researchers, effective communication, a lot of writing, are some, among the many associated activities accompanying doing research. I realized that I would like to have these skills and would also enjoy the process of developing them. So for anyone considering research, I would suggest they try out working on some research problem. The only way to know if you like running a marathon is by running a marathon. No amount of short sprints will prepare you for a marathon.
    3.  How did you make the choice between placements and applying? Aren’t people who are working on projects and making their resume good enough to apply to Grad school less preferred by recruiters?
    Though I had pretty much made up my mind to go to grad school, I went through the placement process nonetheless. In Computer Science, a lot of companies in the industry work on exciting stuff. They tackle problems that are challenging and the skill set required to be a competitive candidate there has a large overlap with skill set required to be a researcher. Hence, the conflict of “preparing a resume for grad school” is not too evident. However, one downside of going through both is the preparation process for the interviews alongside with preparing your grad school application. The interview process for most of these companies requires a focused practice, without which you would not stand a chance. This to me is the most important factor when considering both applying to grad school and going through the placement process. You need to put in double the effort and time than usual. However, if one is considering grad school against a so-called “non-core” company, then that’s a different ball game. I would suspect, in such cases one should make a clear decision upfront and prepare accordingly. But since, I never had to make such a decision, I may not be the right person to comment about it.
    4.  Is a high CGPA required for applying? How do you derive the motivation to study and get high grades in subjects not at all related to your research interests? Is it all lost for people below the ‘astronomical’ 9 point CGPA? How can they make up for not crossing the barrier? Does pursuing Honors add weight to his/her Grad school application?
    A high CGPA is neither a necessary nor a sufficient condition when it comes to grad school. It is just one of the factors among the many factors that decides admission to grad school. But, this should not be interpreted as; having a low CGPA doesn’t hurt your chances. A good CGPA always helps, although the principle of higher the better is not necessarily correct. Similarly, a low CGPA doesn’t mean all is lost. But you have to make up with sufficient quality research to compensate for low CGPA.
    Very few people would have a research interest fixed in mind early in undergraduate. So, the concept of a course not aligning with one’s research interest does not make much sense. In fact, these courses are the one that would give you first sparks about which research area might suit you. In my opinion, every course has something to offer and one should look to gain these things out of every course. Also, as with any other aspect in life, the good and the bad always come together. If getting a good CGPA means higher chances of an admit to grad school and getting an admit to grad school means that much to you, you would definitely find the motivation to study hard for it. Also, the ‘astronomical’ 9-point CGPA is more of an artificial thing in my opinion. As I said before, CGPA is not the only factor and hence one should also do some quality research to increase one’s chances. It also helps one fully realize if a career in research is a right path for one.
    Honors program (at least in Computer Science) gives you the flexibility of taking more graduate electives. This particularly helps if you are unsure about what area excites you the most. But, if you would rather like to explore that by doing more research and fewer courses, then honors program doesn’t necessarily add more value. For most parts of grad school, you should show potential of a good researcher. Hence, doing more research is the best way to convince the committee that you are cut out for research. If you would like to seek these experiences from graduate level electives or otherwise is a choice that is personal to you.

    5.  How relevant are extra-curricular and Positions of Responsibility? If any, what position did you hold, and how did it help you?
    With respect to direct impact on application, extra-curricular/positions of responsibility have almost no effect. They are looking for promising researchers and they want just about that. It does not really matter if you were a Shaastra Core or a Saarang Pro-show Coordinator, since those things aren’t going to give evidence about whether you will be a good researcher. But, this doesn’t mean one should not try their hands at them. They should do so only for personal pleasures and shouldn’t look at them as an investment, which would eventually reap some direct returns in terms of grad school applications. Among many things, they teach you a lot of life skills, which you will otherwise never learn from classroom. In my particular case, I remember doing some webops/mobops activities during Shaastra, Saarang, and placements. I did them since, at that time they seemed challenging and nice to me. The biggest thing that helped me from those experiences was that I got to make a lot of friends with many of the smart people in insti.
    6.  Can you tell us about the other schools you applied to ? Did you have alternate options? How did you select between them? How do we gauge the authenticity of world rankings of a university and to which extent are they reliable? 
    I applied to about 10 schools overall with a mix of Computer Science and Operations Research programs. I had to make a decision between University of Michigan and University of Maryland, and ended up choosing University of Maryland. Personally, my criteria included a mix of department, potential advisors with whom I would work with, strength of the allied departments like Electrical Engineering, Math, etc., location of the university and proximity to the closest city. A lot of people underestimate the last factor. In my opinion, that is as important a factor as, choosing whom you work with. Grad school is a long journey spanning for at least half a decade. I personally didn’t prefer to work at an isolated college town located in the middle of nowhere, with the travel to the nearest city requiring an hour’s flight journey.
    Another method for selection that seems to work quite well is to identify some of the top conferences in your field of research. You can see who are the researchers that are publishing consistently and look at the universities they are associated with. This will also give you a good idea about the strength of these places. Finally, you should also mail current grad students working at those places and find out more specific details about the places. Each of this process involves a lot of work, but unfortunately, there is no shorter alternative. Ultimately, you are looking at where you would want to spend your next five or so years. You might as well put in the effort in making sure it’s the best fit for you.
    World rankings are an extremely tricky topic that makes a lot of people uncomfortable. One way to interpret them is by not looking at them as an absolute number but as a range. Does the university rank in the top 5, top 15 etc. Beyond that, it is hard to distinguish various universities, because for most parts they are all similar. Also, most of the rankings lag the reality by almost 3-5 years. Also, within academia, since professors move around a lot, it is hard to assign the strength of a university with a single number. Hence, the best way is to speak to your professors and find out what is happening in the universities, who are the people currently doing good research etc. Academia is a closely-knit circle, and usually professors in your department are the experts when it comes to choosing universities. In my particular case, I am extremely thankful to Prof. Jayalal Sarma, who helped me, and three of my batch mates in choosing the right places to apply to. He spent a considerable amount of time researching about the work people are doing currently. Without that, it would have been extremely hard for us to come to a good list of places.
    7.  How did you identify your recos? What matters in LORs, the proximity with the referee or his stature in the research field? What is the relevance of SOPs, and how does one write ‘the perfect SOP? Does an exchange program help?How important are recos, SOPs, CGPA, GRE score, projects/internships, publications etc. in relative percentage of weightage? 
    Usually, a good recommender is someone who knows you well on a professional basis, can vouch for your skills, commitment and has particular instances he can cite in the letter. Stature in the research field is a criterion to consider, but that should come after all the above holds true. An average letter or a letter without any details, from a top-notch researcher does more harm than good. Good way to identify who are your suitable recommenders is to talk with professors whom you think you have either worked with well enough or done some quality research with and see what they feel about you. Most professors are open about their opinions and would say they would not recommend you or not recommend you well enough, if they do not have a strong opinion about you. Your internships also come in handy here. You would most likely have worked closely with researchers during that time and if they were impressed with your work, would give you strong recommendations. Also, a recommendation from a manager in the industry is not so strong as compared to a researcher in academia. Some universities explicitly place restrictions on having at most one recommendation from outside academia.
    Statement of Purpose is a one-and-half to two-page document explaining why you are a suitable and strong candidate for the program. It is the place where you can talk about goals, motivations and some of the technical and non-technical skills you have accumulated during your undergraduate. The best way to write a so-called ‘perfect SoP’ is to be honest about one and clearly write out one’s motivations, strengths and goals in general. You can look at SoP as an advertisement for a product. A bad advertisement is surely going to repel customers away from the product. A good SoP, would give your product the first look it deserves. But, beyond that it’s the quality of the product that ultimately decides its popularity in the market. Similarly, beyond that initial attention, it’s the other components in your application that will ultimately decide if you would be admitted.
    Though there is no magical formula for how admission works, but for most parts, I feel, the following ordering holds true:
    Letter of Recommendation >top-tier publications / research experience > CGPA > SoP, GRE, others
    Since, most of academia works by word of mouth, a good recommendation from one person is more than enough to sway the decision in your favor. Similarly, one bad recommendation is enough to ensure your application being rejected.

    8.  Does work experience hold any importance, if yes, is it not advisable to work for a couple of years and then apply to Grad schools?
    Work experience is a tricky situation to consider when considering grad school. Many people, especially in the US, come to grad school after working for a while. They seem to have a better idea about the importance of their research as opposed to many people straight out of undergrad. On the other hand, when it comes to reaching one’s full potential, most people believe the earlier you start; the more chances you have on reaching your full potential. Hence, in conclusion, there is no correct answer, and depends on a case-to-case basis. But, if you think you do not have sufficient research experience, or do not yet know if you can commit yourself fully for a PhD, working, as a research assistant is one nice option. This gives you a flavor of research and also strengthens your application to grad school. You can search for research assistant opportunities at a number of places, including IITM, IISc, etc. Here again, your professor will be your best guide who can direct you to the right people.
    9.  What are the research internship avenues a student can look at? Could you please share with us your list of internships/projects and also the ones you are aware about? How did it help you? When is an ideal time to apply, and how does one go about it? Are students expected to do projects in the same field of research as they are applying, as they might not have decided on their topic of interest before actually working on it? How important is a foreign research internship, and how does it weigh as compared to an industrial internship?
    There are a number of avenues to seek research experience as an undergrad. There is no place like home; hence working on a research problem with one of your professors is a great way to step into research. There would surely be a course or two that would have excited you and you would want to explore further. Speaking up with the professor and letting him know that you are interested is a great first step.
    Other places for internships include programs like DAAD, MITACS, SN Bose program, etc. Most of these flyers are circulated around throughout the institute and one should definitely check these options. These programs are usually highly structured to help undergrads get a concrete problem to work on. The program usually has a number of hosts participating, who have a specific problem in mind. These are problems chosen such that, it is approachable by an undergrad, can be completed in the given timeframe of summer and possibly lead to a good publication. In many cases, students are also open to suggest the own problems they would like to work with and most researchers are open to working on those problems.
    Some other opportunities are research labs such as IBM, Microsoft Research, etc. For most parts, they have a similar model as to working at a university, where the host mentor has a specific problem in mind and you go ahead and work on it.
    Most of the application process happens between September to January before the summer. The best way to search for internships is to keep a lookout for interesting opportunities that are circulated by the department branch counselors. It is also a nice idea to talk to a professor you know and see if he has some ideas about where you could spend the summer. A “foreign” internship is not that important; what is important is you work on some research problem. Since, your ultimate goal is to see if you like research and apply with the required credentials to a good grad school, it is really important you work on research. As to where exactly you seek these opportunities is not so important. An industrial internship is not so relevant, especially if you plan to apply to grad school. With respect to computer science, most industrial internships are software engineer positions and you are expected to write code for software. They are equally challenging and fun, but they do not help prepare you for grad school.
    10.  Please tell us about the funding options for a Grad school? Did you apply for scholarships? Who is eligible for them? Is working part time over there a way to meet tuition fees/etc.? How much does one generally have to spend from his own pocket (savings/loans)? What is the cost of living for married research scholars, approximately?
    Since, in the entire article I have been referring to grad school as a PhD program, almost all grad students are completely funded by the department in the form of RA, TA, department fellowship, etc. The funds given by various departments vary according to the location, cost of living, the amount of funds the department receives, etc. Nobody ever becomes rich by attending grad school, but the funding is sufficient for a comfortable living. These funding sources usually also cover the tuition and give a monthly stipend for covering living expenses.
    Besides, there are a number of external fellowships one has access to. I will give some details with respect to the universities in the United States. In computer science, there are fellowships such as Google PhD fellowships, Microsoft PhD fellowships, Facebook PhD fellowships,etc. that are open to international students. Though most of these fellowships are only eligible for grad students in later years of their studies. If you are a citizen of the US or a permanent resident, then you have a wider access to fellowships. One such extremely popular fellowship is the National Science Foundation Fellowship. Usually, one has to apply to this before joining grad school and no later than two years into grad school. This usually covers tuition, provides stipend for fours years of your study in the program. Hence, if you are eligible, it is a highly prestigious fellowship to apply to, irrespective of your other funding sources.
    All in all, funding is not a major concern for most PhD programs in Computer Science. You will not have to end up taking student loans or borrowing money from others. As far as married research scholars are concerned, its slightly more tricky, but still manageable. If both the members have some form of income, then one can easily manage with the funding available in grad school. In case, the family depends on a single source of income, it is still manageable, but one has to live a frugal life.
    11.  Students fear that, later, they might realize that they have no interest in the research field they have chosen, and hence hesitate to commit such a long duration of their lifetime. What do you suggest should one do/think in such a case? Also, how to choose among MS, PhD and an MS+PhD integrated program?
    Just to clarify things upfront, in the US most schools have an MS program and a PhD program, which by default is an MS+PhD program. What this means is that you will get an MS “on the way” to your PhD. Some schools explicitly force you to finish MS requirements within a stipulated time and earn the degree while some other give you liberty till the time you finish your PhD to earn the requirements and hence the MS degree. If I am not wrong, very few schools have a PhD program where they do not give you an MS, if you choose to drop out at a later stage.
    It is indeed a valid concern, but that is not something to really worry about. You should be more concerned about whether or not you like doing research in general. A lot of people switch areas at various points in their career. In some sense, a PhD prepares you to work on a new area by overcoming the steep learning curve as quickly as possible. Hence, if at any stage of your PhD, you lose the spark you had for your area, you can still switch to an interesting area (as long as its not too far away from your original area) and continue research in that area. In fact, most graduate schools recommend first year PhD students to keep an open mind and explore areas before fixing upon an area. Hence, in summary, the only question is whether you wish to commit such a long time towards research or not. Do not fret too much whether you would have the sustained enthusiasm for a single topic you have in mind before joining grad school. You can always find interesting problems to work on as long as you like the process of doing research.

    To choose between a MS and a PhD program, there are multiple criteria. Firstly, most MS programs are based on courses. Your objective is to finish a set of courses and may be a short project. On the other hand, primary purpose of a PhD is to make you an independent researcher. You do take courses, but they are just to help you supplement your research. Your focus would always be on doing good research. Hence, if one has an idea about what one would want to accomplish, then the choice to be made is almost obvious. Sometimes, people may wonder if doing a MS first would good them a better idea about if they are cut for a PhD or not. This is not true in most cases, since the primary purpose of a MS degree is different. The only way to know if you like research is to do research.
    12.  Did you consider the options offered in other countries, says Germany/ Australia/ Singapore/ France/ Canada? If yes, can you please discuss the pros and cons of choosing them over graduation in the ‘famous’ school in States, in terms of fees and cost of living, quality of research and education, scope of jobs after graduating from those schools and the quality of life in those countries, as you see it?
    I didn’t apply to any schools outside the United States. Though there are some schools in Europe and Canada, for example, which are comparable to a top university at the US. It just was my personal choice to not apply to any schools there. Both in Canada and in Europe, I believe one has to first apply to a MS program, complete the requirements and only then apply to a PhD program. Unlike the US they may not offer a direct PhD program. If anything, this just increases the total time spent in MS+PhD program. Since, I didn’t go through the specifics of the universities and weigh pros and cons in greater detail at those places, I may not be able to comment much about the cost of living, quality of research etc.
    13.  What work do you plan to do after you finish your doctoral, and where do you see yourself after 5-10 years?

    I am still undecided about a particular career path. Broadly, it would be in the domain of a researcher, but as to whether it is in academia or in industry or elsewhere is something I do not have a clue about right now. Partly, this is because of multiple factors that affect one’s decision. For example, one of the major factors is the limited amount of positions in the job market for academia. Roughly about one in every ten of the graduating PhD student can realistically expect to find a position in academia. Other major factors include preference for teaching vs. doing only research, contributing to product research against intellectual pleasure, etc.


    Is Open Source Consulting Dead?

    By chrism from plope. Published on Sep 10, 2013.

    Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

    Consulting and Patent Indemification

    By chrism from plope. Published on Aug 09, 2013.

    Article about consulting and patent indemnification

    Python Advent Calendar 2012 Topic

    By chrism from plope. Published on Dec 24, 2012.

    An entry for the 2012 Japanese advent calendar at

    Why I Like ZODB

    By chrism from plope. Published on May 15, 2012.

    Why I like ZODB better than other persistence systems for writing real-world web applications.

    A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

    By chrism from plope. Published on Mar 03, 2012.

    A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

    In Praise of Complaining

    By chrism from plope. Published on Jan 01, 2012.

    In praise of complaining, even when the complaints are absurd.

    2012 Python Meme

    By chrism from plope. Published on Dec 24, 2011.

    My "Python meme" replies.

    In Defense of Zope Libraries

    By chrism from plope. Published on Dec 19, 2011.

    A much too long defense of Pyramid's use of Zope libraries.

    Plone Conference 2011 Pyramid Sprint

    By chrism from plope. Published on Nov 10, 2011.

    An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

    Jobs-Ification of Software Development

    By chrism from plope. Published on Oct 17, 2011.

    Try not to Jobs-ify the task of software development.

    WebOb Now on Python 3

    By chrism from plope. Published on Oct 15, 2011.

    Report about porting to Python 3.

    Open Source Project Maintainer Sarcastic Response Cheat Sheet

    By chrism from plope. Published on Jun 12, 2011.

    Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

    Pylons Miniconference #0 Wrapup

    By chrism from plope. Published on May 04, 2011.

    Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

    Pylons Project Meetup / Minicon

    By chrism from plope. Published on Apr 14, 2011.

    In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.

    PyCon 2011 Report

    By chrism from plope. Published on Mar 19, 2011.

    My personal PyCon 2011 Report