Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home

Open Source Posts

David E. Wheeler: PGXN Manager Upgraded

From Planet PostgreSQL. Published on Aug 31, 2015.

I took a little time this summer to finally address some nagging issues in PGXN Manager, the site to which extensions are uploaded and added to the master repository. Most of the changes were internal, improving the interface through which I sometimes have to re-index a release. There are also a couple of minor changes to the sample Makefile in the How To. But the most important change for new releases going forward is that version ordering is now enforced. That means two things:

  • Distribution versions must be greater than the version of the previous release. You can no longer release v1.2.0 today and v1.1.0 tomorrow.
  • Extension versions must be greater than or equal to versions in previous releases. It’s pretty common to have a new release version but have embedded extensions be the same version as before. But they can’t be any less than before.

The changes have been applied to the reindexing code, as well, which also prevents versions from being greater than in subsequent releases. That won’t come up very often—most of the time, I end up reindexing something that has just been released. But in the future I expect to add an interface for release managers to reindex their own distributions, so it may be that someone reindexes a release that’s a few versions old. Or not. But just in case, it’ll be handled.

So what are you going to release on PGXN today? Get to it!.

How to make one view do the work of two

By The Django Blog from Django community aggregator: Community blog posts. Published on Aug 31, 2015.

Hey folks! Today's is an interesting topic: making one view do two functions. In fact, you can make it do more than two; you an make them do 8 at a time. Think of it as view overloading, if you're into that kind of thing.
So let's dive into it.

Take logging into your account for example. Initially what I used to do is split this process into three views: one for displaying the login page, one for verification of credentials, and a third for displaying the profile. With the new method I will outline in this post you can do that using only two views rather than three. We'll combine displaying the login page and credential verification into one view. The way well do that will use HTTP request methods. Let's see what those are (going to take a little help from TutorialsPoint for the exact definitions):

  1. GET: This method is used to extract data that is embedded in the URL. The data is identified by a '?'. For example, in the URL "http://some.url/?data1=value1&data2=value2" the data after the '?' can be translated as {'data1':'value1', 'data2':'value2'}
  2. HEAD: This one is the same as GET, but this one only transfers status line and header section
  3. POST: This method sends data directly to the server rather than embedding it in the URL. It is used for handling sensitive data like passwords and credit card numbers. That way, when a user enters it in the browser, it cannot be seen by someone standing behind him
  4. PUT: This is an upload based method. The representations of target resource are replaced with uploaded content
  5. DELETE: As the name indicates, this method deletes all representations of the target resource
  6. CONNECT: Establishes a connection with the server
  7. OPTIONS: Describes communication options for target resource
  8. TRACE: Performs a loopback test on connection tunnel
Now each webpage can be accessed using one of those methods. you can overload a view by using an if  block to see what access method is used and do a different thing based on that.

def login(request):
    if request.method == "GET":
        return render(request, "login.html")
    elif request.method == "POST":
        # code to authenticate user

As you can see, in this case, if the request method is GET, which it is by default, it'll just render the login page. Normally, in the login form the action will be POST since we'll be asking for passwords. So the first page will be rendered due to the GET method and the verification will be done by the POST method. So this way you can overload views.
That's all for today folks!

Heikki Linnakangas: Helsinki PostgreSQL meetup on Tuesday September 1st

From Planet PostgreSQL. Published on Aug 31, 2015.

Helsinki PostgreSQL User Group will convene tomorrow, on Tuesday September 1st 2015 at 15:00.

I will be babbling about the various index types in PostgreSQL. GIN, GiST, and the humble B-tree. What does the structure of each one look like? What are the strengths and weaknesses of each? And of course, a lot of free chatter on whatever other topics you want to bring up.

We have also created a Meetup group. Registration at Meetup is recommended to ensure you get a seat: http://www.meetup.com/Helsinki-PostgreSQL-Meetup/.

This is the fourth meeting of the Helsinki PostgreSQL User Group. Come and meet fellow PostgreSQL users and developers. Welcome!

Time:
September 1st 2015
15:00

Place:
Richardinkatu Library
Rikhardinkatu 3
Helsinki

Lessons learned from buildout a django site with a reactjs front-end

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Aug 30, 2015.

My colleague Gijs Nijholt just posted his blog entry lessons learned from building a larger app with React.js, which is about the javascript/reactjs side of a django website we both (plus another colleague) recently worked on.

Simplified a bit, the origin is a big pile of measurement data, imported from csv and xml files. Just a huge list of measurements, each with a pointer to a location, parameter, unit, named area, and so on. A relatively simple data model.

The core purpose of the site is threefold:

  • Import, store and export the data. Csv/xml, basically.
  • Select a subset of the data.
  • Show the subset in a table, on a map or visualized as graphs.

The whole import, store, export is where Django shines. The model layer with its friendly and powerful ORM works fine for this kind of relational data. With a bit of work, the admin can be configured so that you can view and edit the data.

Mostly "view" as the data is generally imported automatically. Which means you discover possible errors like "why isn't this data shown" and "why is it shown in the wrong location". With the right search configuration and filters, you can drill down to the offending data and check what's wrong.

Import/export works well with custom django management commands, admin actions and celery tasks.

Now on to the front-end. With the basis being "select a subset" and then "view the subset", I advocated a simple interface with a sidebar. All selection would happen in the sidebar, the main content area would be for viewing. And perhaps some view-customization like sorting/filtering in a table column or adjusting graph colors. This is the mockup I made of the table screen:

http://reinout.vanrees.org/images/2015/efcis_tabel_mockup.png

In the sidebar you can select a period, locations/location groups and parameters. The main area is for one big table. (Or a map or a graph component).

To quickly get a first working demo, I initially threw together three django views, each with a template that extended one base template. Storing the state (=your selection) as a dict in the django session on the server side. A bit of bootstrap css and you've got a reasonable layout. Enough, as Gijs said in his blog entry, to sell the prototype to the customer and get the functional design nailed down.

Expanding the system. The table? That means javascript. And in the end, reactjs was handy to manage the table, the sorting, the data loading and so on. And suddenly state started spreading. Who manages the state? The front-end or the back-end? If it is half-half, how do you coordinate it?

Within a week, we switched the way the site worked. The state is now all on the client side. Reactjs handles the pages and the state and the table and the graph and the map. Everything on one side (whether client-side or server-side) is handiest.

Here's the current table page for comparison with the mockup shown above:

http://reinout.vanrees.org/images/2015/efcis_tabel_website.png

Cooperation is simple this way. The front-end is self-contained and simply talks to a (django-rest-framework) REST django backend. State is on the client (local storage) and the relevant parameters (=the selection) are passed to the server.

Django rest framework's class based views came in handy. Almost all requests, whether for the map, the table or the graph, are basically a filter on the same data, only rendered/serialized in a different way. So we made one base view that grabs all the GET/POST parameters and uses them for a big django query. All the methods of the subclassing views can then use those query results.

A big hurray for class based views that make it easy to put functionality like this in just one place. Less errors that way.

Some extra comments/tips:

  • Even with a javascript front-end, it is still handy to generate the homepage with a Django template. That way, you can generate the URLs to the various API calls as data attributes on a specific element. This prevents hard-coding in the javascript code:

    <body>
      <!-- React app renders itself into this div -->
      <div id="efcis-app"></div>
    
      <script>
        // An object filled with dynamic urls from Django (for XHR data retrieval in React app)
        var config = {};
        config.locationsUrl = '{% url 'efcis-locaties-list' %}';
        config.meetnetTreeUrl = '{% url 'efcis-meetnet-tree' %}';
        config.parameterGroupTreeUrl = '{% url 'efcis-parametergroep-tree' %}';
        config.mapUrl = '{% url 'efcis-map' %}';
        window.config = config;
      </script>
      ...
    </body>
    
  • Likewise, with django staticfiles and staticfiles' ManifestStaticFilesStorage, you get guaranteed unique filenames so that you can cache your static files forever for great performance.

Lessons learned?

  • Splitting up the work is easy and enjoyable when there's a REST back-end and a javascript front-end and if the state is firmly handled by the front-end. Responsibility is clearly divided that way and adding new functionality is often a matter of looking where to implement it. Something that's hard in javascript is sometimes just a few lines of code in python (where you have numpy to do the calculation for you).

    Similarly, the user interface can boil down complex issues to just a single extra parameter send to the REST API, making life easier for the python side of things.

  • When you split state, things get hard. And in practice that, in my experience, means the javascript front-end wins. It takes over the application and the django website is "reduced" to an ORM + admin + REST framework.

    This isn't intended as a positive/negative value statement, just as an observation. Though a javascript framework like reactjs can be used to just manage individual page elements, often after a while everything simply works better if the framework manages everything, including most/all of the state.

Oleg Bartunov: Toward to million tps

From Planet PostgreSQL. Published on Aug 30, 2015.

Our company Postgres Professional has conducted a performance testing on big P8 server, provided by IBM. One excited thing is that we were able to identify scalability issue and made a patch (kludge for now), which gave us performance of postgres close to million tps. We'll discuss the approach used in patch in -hackers.

Pg-P8

More details are in Alexander's Korotkov blog post.

http://www.postgrespro.ru/blog/pgsql/2015/08/30/p8scaling

Next post we'll compare performance of postgres and MariaDB :)

gabrielle roth: My “Must-see” List for PgOpen

From Planet PostgreSQL. Published on Aug 30, 2015.

PostgresOpen is coming up in just a few (Disclosure: I am on the conference committee.) We are still working out the room assignments; your conference badge/printed program will have the right info. The committee is very excited to have snagged Lacey Williams Henschel to give our keynote, on Open Source & Higher Ed. I’m looking […]

Satoshi Nagayasu: sql_firewall: a SQL Firewall Extension for PostgreSQL

From Planet PostgreSQL. Published on Aug 29, 2015.

A few days ago, I have released a brand-new PostgreSQL extension, called "sql_firewall". https://news.ycombinator.com/item?id=10109566 https://github.com/uptimejp/sql_firewall sql_firewall is intended to protect PostgreSQL database from SQL injection attacks by limiting SQL queries to be executed on the database. In this entry, I would like to introduce how it works and how to use it. How

Eric Hanson: Aquameta Layer 0: meta - Writable System Catalog for PostgreSQL

From Planet PostgreSQL. Published on Aug 28, 2015.

Aquameta Layer 0:  meta - Writable System Catalog for PostgreSQL

In the introduction, we talked about Aquameta's first principle, datafication. We weild the broadsword of datafication as we charge the many-headed hydra that is unnecessary programming complexity. In this writeup, we describe our first, and in some ways most-challenging foe, datafication of the database itself.

Layer zero of Aquameta is called meta, a writable system catalog for PostgreSQL. It exposes PostgreSQL administration commands through data manipulation commands, enabling schema creation, table creation, column renaming, role dropping, and much more via INSERTs UPDATEs and DELETEs. In other words, it makes the DDL accessible via the DML.

Meta unifies "normal" data with schema, database configuration, everything that is PostgreSQL, into a coherent and synthesized information model. Everything is accessible as data. This adds the needed paths to make PostgreSQL truly homoiconic, which breaks down the wall between schema and data, and opens the doors for all manner of meta-programming.

PostgreSQL already has two system catalogs, INFORMATION_SCHEMA and pg_catalog. Meta is different because:

  • It's writable, and changes to the data take effect in the database
  • It has friendly names for relations and columns
  • It's normalized, the views are sensibly laid out in parallel with PostgreSQL's architecture
  • It has a type system of meta-identifiers for primary keys

Here's a simple example of how to use it. Instead of doing:

aquameta=# create schema test_schema;  
CREATE SCHEMA  

You can do:

aquameta=# insert into meta.schema (name) values ('test_schema');  
INSERT 0 1  

These two commands perform identical operations under the hood; meta just makes them accessibe through a different interface.

Here is an ER diagram of meta's schema:

Aquameta Layer 0:  meta - Writable System Catalog for PostgreSQL

aquameta=# set search_path=meta  
aquameta=# \d  
 Schema |         Name         | Type | Owner
--------+----------------------+------+-------
 meta   | cast                 | view | eric
 meta   | column               | view | eric
 meta   | connection           | view | eric
 meta   | constraint_check     | view | eric
 meta   | constraint_unique    | view | eric
 meta   | extension            | view | eric
 meta   | foreign_column       | view | eric
 meta   | foreign_data_wrapper | view | eric
 meta   | foreign_key          | view | eric
 meta   | foreign_server       | view | eric
 meta   | foreign_table        | view | eric
 meta   | function             | view | eric
 meta   | operator             | view | eric
 meta   | relation             | view | eric
 meta   | role                 | view | eric
 meta   | schema               | view | eric
 meta   | sequence             | view | eric
 meta   | table                | view | eric
 meta   | trigger              | view | eric
 meta   | type                 | view | eric
 meta   | view                 | view | eric
(21 rows)

Each relation in meta is a VIEW that queries INFORMATION_SCHEMA, pg_catalog, or wherever else we had to dig to get the data. These views support INSERT, UPDATE and DELETE statements via TRIGGERs that translate the operation into a native PostgreSQL command.

We have a good start on PostgreSQL feature coverage. You can do most common operations through meta instead. We don't have 100% feature coverage yet, but that is the goal.

On the surface, these views expose a clean, consistent, writable interface for PostgreSQL administration via data manipulation.

Examples

Here are a few examples of how you might use meta:

/* drop all the schemas in the database.  Highly destructive! */
delete from meta.schema;  
/* create a table with no columns called `foo` in the `public` schema */
insert into meta.table (name, schema_name) values ('foo', 'public');  
/* rename all columns named `id` to `foo` */
update meta.column set name='foo' where name='id';  
/* list all columns in the beehive schema */
select r.name, c.name, c.type_name  
    from meta.column c
    join meta.relation r on c.relation_id = r.id 
    join meta.schema s on r.schema_id = s.id
    where s.name = 'beehive'
    order by r.name, c.position;
    relation_name     |            name             |      type_name
----------------------+-----------------------------+----------------------
 brands               | name                        | pg_catalog.text
 brands_brand         | id                          | pg_catalog.int4
 brands_brand         | name                        | pg_catalog.text
 brands_brand         | show_on_website             | pg_catalog.bool
 brands_brandcategory | id                          | pg_catalog.int4
 brands_brandcategory | name                        | pg_catalog.text
 brands_brandcategory | brand_id                    | pg_catalog.int4
 brands_brandgroup    | id                          | pg_catalog.int4
 brands_brandgroup    | brand_id                    | pg_catalog.int4
 brands_brandgroup    | name                        | pg_catalog.text
 ...
 (462 rows)
/* list of all the relations in the `beehive` schema */
select name, type from meta.relation where schema_name='beehive';  
                  name                  |    type    
----------------------------------------+------------
 vendor_paymentterm                     | BASE TABLE
 product_margin                         | BASE TABLE
 favorites                              | BASE TABLE
 countries_usstate                      | BASE TABLE
 product_cost                           | VIEW
 warehouse_pieces_piecebreakdown        | BASE TABLE
 ...
(144 rows)

Anatomy of a meta View

Let's take a look at one of these views in detail, meta.column:

aquameta=# \d meta.column  
                       View "meta.column"
    Column     |                Type                | Modifiers
---------------+------------------------------------+-----------
 id            | meta.column_id                     |
 relation_id   | meta.relation_id                   |
 schema_name   | information_schema.sql_identifier  |
 relation_name | information_schema.sql_identifier  |
 name          | information_schema.sql_identifier  |
 position      | information_schema.cardinal_number |
 type_name     | text                               |
 type_id       | meta.type_id                       |
 nullable      | boolean                            |
 default       | information_schema.character_data  |
 primary_key   | boolean                            |
Triggers:  
    meta_column_delete_trigger INSTEAD OF DELETE ON meta."column" FOR EACH ROW EXECUTE PROCEDURE meta.column_delete()
    meta_column_insert_trigger INSTEAD OF INSERT ON meta."column" FOR EACH ROW EXECUTE PROCEDURE meta.column_insert()
    meta_column_update_trigger INSTEAD OF UPDATE ON meta."column" FOR EACH ROW EXECUTE PROCEDURE meta.column_update()
  • The id field is a kind of "soft" primary key. It's of a special type, meta.column_id, which is one of the "meta-identifiers" in the system that uniquely identifies a column with a single value. More about meta-identifiers later.
  • The relation_id field is another meta-identifier, a kind of "soft foreign key" to the meta.relation view, which contains a row for the table or view that this column is a member of.
  • Then comes the human identifiers, schema_name, relation_name and name. These are what they sound like. When INSERTing into this view, you need to specify either the human identifiers, or the meta-identifiers above.
  • The position field tells where this column is in relation to the other columns. It is not currently updatable, as PostgreSQL does not support column reordering.
  • The type_name and type_id fields reference the data type of this column. type_id is another meta-relation, this one foreign-keying to the meta.type relation. You can UPDATE the type field either by updating type_name or type_id, and if PostgreSQL can cast from the original datatype to the new one, it will update the column's type. Otherwise the UPDATE will fail without changing anything.
  • The nullable field is a boolean that determines whether the column is nullable. It behaves as you would expect.
  • The default field contains the column's default value, represented as text. You can update this as well.
  • The primary_key boolean determins whether or not this key is a primary key. Aquameta assumes a single primary key on all tables.
  • Finally, the TRIGGERs listed handle INSERT, UPDATE and DELETE, passing off the operation to the functions meta.column_delete(), meta.column_insert() and meta.column_update().

All together, this view is a general purpose column administration interface. The rest of the meta views behave similarly.

The meta-identifier TYPE System

Besides just the views, meta also contains a system of meta-identifiers, a collection of PostgreSQL compositie types that encapsulate the unique identifier for a PostgreSQL component as a single value. You can think of them as the primary keys of the meta views.

PostgreSQL EntityPostgreSQL identifier(s)meta-identifier
schemanameschema_id
castnamecast_id
relationschema_name, namerelation_id
functionschema_name, namefunction_id
typeschema_name, nametype_id
operatorschema_name, nameoperator_id
sequenceschema_name, namesequence_id
functionschema_name, namefunction_id
triggerschema_name, relation_name, nametrigger_id
foreign_keyschema_name, relation_name, nameforeign_key_id
columnschema_name, relation_name, namecolumn_id
constraintschema_name, relation_name, nameconstraint_id
rowschema_name, relation_name, column_name, name, pk_name, pk_valuerow_id
fieldschema_name, relation_name, column_name, pk_name, pk_valuefield_id

When querying against the meta relations, instead of using the human names as identifiers, you can also use the meta-identifiers:

/* select the beehive.customers_customer.name column */
select * from meta.column where id=meta.column_id('beehive','customers_customer','name');  

Meta-identifiers can be cast to other, less-specific identifiers. For example, to get the schema_id that a column_id contains, you can do:

select meta.column_id('beehive','customers_customer','name')::meta.schema_id;  

You can also use the meta-identifiers to do meta-programming in your own tables. For example, if you want to make a table that references a PostgreSQL view, it would look like this:

create table column_widget (  
    id serial primary key,
    name text,
    column_id meta.column_id
);

Benefits

An all-data interface to PostgreSQL has a lot of benefits:

  • Consistency: Operations have a high degree of simplicity and consistency, namely that they all take the form of an INSERT, UPDATE or DELETE instead of PostgreSQL's extensive DDL grammar.
  • Programming Simplicity: From a developer's perspective, writing, say, a PostgreSQL administration GUI against meta instead of the DDL means that the app is "just another CRUD app".
  • Batch Operations: Since UPDATEs and DELETEs can affect multiple rows with a single query, you can easily do batch operations with a single statement, like deleting all roles that match a WHERE clause, or renaming columns en masse.
  • Meta-Programming: Developers can make "normal" tables that foreign-key to one of meta's views, for example a reporting app that has a foreign key to the VIEWs behind the reports.

But really, we don't think we've fully wrapped our head around everything you can do with meta. We're excited to see how people use it.

Conclusion

Meta is the foundational layer for the rest of Aquameta. It allows us to build a programming environment where under the hood, programming is always some form of data manipulation.

In the next chapter, we'll cover Aquameta Layer 1, bundle, a data version control system.

Eric Hanson: Introducing Aquameta

From Planet PostgreSQL. Published on Aug 28, 2015.

I'm happy to announce Aquameta, a web application development platform built entirely in PostgreSQL.

Why We Must Rethink Our Stack

1. Software Development Complexity Bottlenecks Human Progress

The complexity and inaccessibility of software development is inhibiting businesses, education and social progress.

In businesses, especially large ones, software has become the digital nervous system that allows them to operate at scale. But because only programmers can change the software, experimentation and evolution of business practices has actually become slower and more expensive. Most employees are powerless to make even simple software changes, and yet developers often don't have the domain expertise required to write good software for someone else's job.

Education is feeling it too. The demand for programmers is quickly outpacing supply, and for students, learning computer science is becoming more and more of an essential skill. Educators are well aware of this problem, but the complexity of programming makes it very challenging to teach.

Then there is social progress generally. End users have so many great ideas for software systems that could help us all collaborate with each other in deeper, richer ways. But to create them they either need the money or the skill to bring it into reality. Software complexity is a huge bottleneck for social innovation. Not every idea has a business model behind it, and there must be so many good ideas that could exist if programming was eaiser.

We believe that it is a social imperative, arguably one of the greatest demands of our time, to make software development more accessible to everyone.

2. Today's Web Is Better At Making Money Than Fun

Two-decades of profit-driven evolution have lead to the web we have today. But this outcome wasn't inevitable, and we can still change it. We have learned a lot since the inception of the web, and when we look at it knowing what we know now, we see some ways things could be a lot cooler. Here are a few.

Conflation of Data and Presentation

Today's web is made of "pages", which conflate the data layer with the presentation layer. When we want an article or video, rather than being able to download just that content, we have to download the entire "page" that the content is on, often times containing ads, cross-promotions, etc. A more advanced architecture would give the user the power to control what we downloaded selectively.

Trending Towards Centralization

The Internet was designed as a peer-to-peer network, where every computer could communicate directly with every other computer. Yet the way we use the web today is primarily through centralized silos of information. All our computers talk to a really big computer, say Facebook, who mediates the interaction. If I want to share a photo with my mom, there is no architectural reason why any central party needs to be the mediator.

But it's about more than just privacy or exploitation of the commons. Centralized systems are BORING. The early days of the web were honestly more exciting, more raw, more wild-wild-west. We need to get back to that vibe.

Complexity of Systems Invention

Once upon a time there were no end-user tools for contributing to the web. "Users" just wrote HTML by hand. We've come a long way since then with GUI editors, blogging platforms, social networks etc, but still today there is a wall between the technical haves and have-nots, and that's that it is very difficult for a "mere mortal" to build a data-driven web application. In the web we invision, building data-driven web apps is something accessible by everyone.

We need to rethink programming and the web, to fix some of the architectural short-comings and open up a new space full of new possibilities, and new problems to solve.

Approach

At the foundation of Aquameta's reimagining of programming and the web are two central ideas, datafication and visualization.

1. Datafication

The complexity of our stack is pretty daunting when you list it all out. Some of this complexity is unavoidable, computer science is just fairly complex; but there is another swath of knowledge and skills that has little to do with computer science and a lot more to do with just the diversity of our programing tools. Consider /etc, for example.

This diversity and complexity isn't a "problem". In fact, one could argue that it is an essential ingredient of the primordial soup from which so much innovation has emerged. However, when you take a step back and look at it, it seems pretty unnecessary, and it does make it harder for beginners to learn.

Our first step towards making programming easier is to use the database to model the development stack itself.

Typical Web Stack
HTML, CSS, Javascript
Client-side framework
Webserver
Web Framework
Event Queue
Programming Language
Version Control System
Database
Filesystem
Operating System
Aquameta
HTML, CSS, Javascript
Client-side framework
Webserver
Web Framework
Event Queue
Programming Language
Version Control System
Database
Filesystem
Operating System

In Aquameta, the entire stack is accessible as data, so diverse tools can share a common information model. It puts the database at the foundation of the developer experience. We use PostgreSQL as a kind of "infokernel".

It turns out PostgreSQL is really good at this. Between foreign data wrappers and the various procedural languages, there isn't much that PostgreSQL can't make look like data.

A datafied stack makes a common interface to all these different tools, which eliminates a ton of complexity.

2. Visualization

Visual interfaces are a lot easier for beginners to use, and writing visual interfaces against a datafied stack is ridiculously easy and fun. As web programmers, we have a ton of experience writing interfaces to databases, so when the whole dev stack is data, we can apply that skill towards experimenting with user interfaces that users can quickly understand.

If Aquameta is a success, we'll see an explosion of different visual programming tools, and a vast diversity of competing approaches to making programming easier.

The Architecture

Aquameta has eight core layers, each of which is a PostgreSQL schema:

  • meta - a writable system catalog for PostgreSQL
  • bundle - a snapshotted version control system for data
  • filesystem - models the filesystem as relational data, and in reverse, makes PostgreSQL visible as a filesystem
  • event - a pub/sub data change event system built atop PostgreSQL's NOTIFY system
  • www - a HTTP request handler that authenticates against the PostgreSQL user system and serves up arbitrary text/binary resource in the database, files from the file system, a REST interface to the database, and websocket events
  • p2p - Mechanisms for doing transmitting data between one end-user annd another across NATs
  • widget - modular web user interface components
  • semantics - a space to decorate the schema with "meaning", human identifiers, crud widgets, etc.

Project Status

Aquameta is the result of several years of work by Eric Hanson and the folks at Aquameta Labs. It's currently pre-alpha. We haven't even done what we would call a 0.1 release. We're releasing it primarily for architecture nerds to get some feedback and contributions.

Our plan is to publish a writeup of each module in the coming weeks, starting with meta and moving up. We'd love to hear what you think, and patches are welcome.

http://github.com/aquametalabs/aquameta

Josh Berkus: Stupid Hacks: Dictionary replace function

From Planet PostgreSQL. Published on Aug 28, 2015.

I write a lot more PL/pgSQL than I'd like to.  Not that I don't like SQL, but as a language PL/pgSQL really shows its thrown-together origin; it's like 80's primitive.  One thing that PL/pgSQL lacks is good string manipulation tools, which is particularly tragic given that the #1 thing to do in PL/pgSQL is to generate queries from parameters and run EXECUTE.

Postgres has two built-in ways to do string substitution: concatenation and format().  Both have drawbacks.  Let me give you an example:

EXECUTE 'SELECT ' || col1 || ', ' || col2 || ' FROM ' || userschema ||
  '.accounts WHERE ' || filterclause || ' ORDER BY ' || col1 || ',' || col2;

EXECUTE format('SELECT %s, %s FROM %s.accounts WHERE %s ORDER BY %s, %s', col1, col2, userschema, filterclause, col1, col2);

You can see the problem here.  Both formats are hard to read and hard to maintain.  Python and Perl have a good fix for this: dictionary/hash-based string substitution, where you can swap in the dictionary keys for the values.  So I wrote up a quick hack to do this in PL/pgSQL.

Here's the dict-replace function.

Using it, you'd replace the above with:

EXECUTE replace_vars('SELECT ${col1}, ${col2} FROM ${userschema}.accounts
  WHERE ${filterclause} ORDER BY ${col1}, ${col2}', vardict);

Of course, you need to first set up the vardict as a JSON value, in the form:

vardict := '{ "col1" : "username", 'col2' : "branchname", ...}'

Still, much more readable, eh?  No refcounting, no repeating variables, no string breaks.  I used Bash's variable substitution syntax of ${var} because it seemed like the thing least likely to conflict with user text, unlike anything involving %.

There's some caveats, though: it's not performant, and you could probably find strings which will break it, particularly if you're swapping in JSON values.  You still have to do your own quoting of strings.  And if you have access to PL/Perl or PL/Python you don't need this nonsense.

But for the minimal case, it should help.

Shaun M. Thomas: PG Phriday: Fancy Partitioning

From Planet PostgreSQL. Published on Aug 28, 2015.

This week we’ll be covering another method of Postgres partitioning. This is a technique I personally prefer and try to use and advocate at every opportunity. It’s designed to straddle the line between traditional partitioning and standard monolithic table structure by using table inheritance as a convenience factor. The assumption here is that end-user applications either:

  1. Know that partitioning is in use.
  2. Only load “current” data and don’t care about partitions.

These two things may seem mutually exclusive, but they can actually be complementary. There are also notable performance benefits to this approach that we’ll explore later.

The SQL for this structure is similar to what we used last week, but there are a couple of notable omissions.

TRUNCATE TABLE sensor_log;
 
CREATE TABLE sensor_log_part_2012 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2012-01-01' AND
           reading_date < '2013-01-01')    
) INHERITS (sensor_log);
 
CREATE TABLE sensor_log_part_2013 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2013-01-01' AND
           reading_date < '2014-01-01')    
) INHERITS (sensor_log);
 
CREATE TABLE sensor_log_part_2014 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2014-01-01' AND
           reading_date < '2015-01-01')    
) INHERITS (sensor_log);

This time, we don’t need the function that distributes the rows to the correct partition, or the associated trigger. Further, note that the 2015 partition is missing. This is possible because we’ve made two assumptions about how partitions will be used:

  1. Current data will be stored in the base table.
  2. Old or “archived” data will be moved into partitions.

To fulfill the first requirement, we made a slight change to the python loader script introduced at the beginning of this series. We could have done this with the partitioning introduced last week as well, but while that is a performance tweak for that format, this method depends on it.

# Set these to modify how large the COMMIT blocks will be,
# and how many days the data will represent.
# Total row count = chunk_size * date_range.
 
chunk_size = 100000
date_range = 1000
 
# Main program body
 
import psycopg2
from StringIO import StringIO
from datetime import datetime, timedelta
 
raw_data = StringIO()
db_conn = psycopg2.connect(database = 'postgres', user = 'postgres', port = 5433)
cur = db_conn.cursor()
 
print 'rows,seconds,microseconds'
 
for j in xrange(0, date_range):
    raw_data.seek(0)
    raw_data.truncate(0)
 
    usedate = datetime.now() - timedelta(days = j)
    stamp = datetime.combine(usedate.date(), usedate.time().max)
 
    for i in xrange(1, chunk_size + 1):
        stamp -= timedelta(seconds = 0.5)
        raw_data.write(
            '%s\t%s\t%s\n' % (i % 1000, i % 100, stamp)
        )
 
    raw_data.seek(0)
    t1 = datetime.now()
 
    # This is the part where we become partition-aware.
    # If this script only handled current data, we wouldn't
    # even need the 'part' variable at all.
 
    part = ''
    if stamp.year != datetime.now().year:
        part = '_part_%s' % stamp.year
 
    cur.copy_from(
        raw_data, 'sensor_log%s' % part, '\t',
        columns = ('location', 'reading', 'reading_date')
    )
 
    t2 = datetime.now()
    diff_s = str(t2 - t1).split(':')[2]
 
    print '%s,%s' % (
        (j + 1) * chunk_size, diff_s.replace('.', ',')
    )
 
    db_conn.commit()
 
raw_data.close()
db_conn.close()

As the comments in the code note, if the loader script only handles a stream of new and current data, we could remove the partition handling block entirely. We should also note that the script loads all of the partitions about 10% faster than a single monolithic table. Likely this is due to the index calculations being reduced in cost from reduced tree depth.

The primary benefit from this structure is that we can get more use out of the ONLY keyword. Most of the partitioned tables I’ve encountered are primarily concerned with “hot” data. Since know that Postgres will always check the base table regardless of partition contents, why not use that to our advantage? The ONLY keyword makes that the first and last step.

This is how that looks to the planner:

EXPLAIN ANALYZE
SELECT COUNT(*)
  FROM ONLY sensor_log
 WHERE reading_date >= CURRENT_DATE;
 
                            QUERY PLAN                             
--------------------------------------------------------------------
 Aggregate  (cost=5141.55..5141.56 ROWS=1 width=0)
            (actual TIME=140.079..140.080 ROWS=1 loops=1)
   ->  INDEX ONLY Scan USING idx_sensor_log_date ON sensor_log
              (cost=0.57..4853.80 ROWS=115099 width=0)
              (actual TIME=0.043..75.019 ROWS=100000 loops=1)
         INDEX Cond: (reading_date >= ('now'::cstring)::DATE)
         Heap Fetches: 0
 
 Planning TIME: 0.184 ms
 Execution TIME: 140.126 ms

We can see that we only touched the base sensor_log table, ignoring any partitions that might exist. This is the part where being partition-aware is beneficial. Now systems that focus on current data can directly restrict queries to the “current” partition without having to know anything about the naming scheme or distribution process. And as we’ve already seen, table loading can also ignore any partitions if processing recent entries.

Of course, now we have a new problem. Since the current partition is the base table, once the data no longer matches our desired constraints, it needs to be relocated. When 2016 arrives, we need to move the 2015 data, or reorganize the tables themselves. Depending on partition granularity, the latter actually isn’t so bad. In fact, here’s a process that does it all at once in a transaction:

BEGIN;
 
DELETE FROM pg_inherits
 WHERE inhparent = 'sensor_log'::REGCLASS;
 
ALTER TABLE sensor_log RENAME TO sensor_log_part_2015;
CREATE TABLE sensor_log (LIKE sensor_log_part_2015 INCLUDING ALL);
 
ALTER TABLE sensor_log_part_2015
  ADD CONSTRAINT sensor_log_part_2015_reading_date_check
CHECK (reading_date >= '2015-01-01' AND
       reading_date < '2016-01-01');
 
INSERT INTO pg_inherits
SELECT oid, 'sensor_log'::REGCLASS, 1
  FROM pg_class
 WHERE relname LIKE 'sensor_log_part_%'
   AND relkind = 'r';
 
COMMIT;

Alternatively, we can maintain a more elastic system. Another way to use the base table + partition style is to only keep extremely recent data in the base table, such as the most recent week. Then, move any older data to a prepared partition. In this case we would already have the 2015 partition, and then every night we would run this maintenance SQL:

BEGIN;
 
INSERT INTO sensor_log_part_2015
SELECT *
  FROM ONLY sensor_log
 WHERE reading_date >= CURRENT_DATE - INTERVAL '1 week';
 
DELETE FROM ONLY sensor_log
 WHERE reading_date >= CURRENT_DATE - INTERVAL '1 week';
 
COMMIT;

In either case, this maintenance is easily scriptable, though clearly a bit more invasive than simply distributing the data outright with a trigger. What do we gain from that caveat?

  1. No more performance-robbing redirection trigger. This is great for high transaction volumes.
  2. The root table contains only the most relevant data. This allows us to treat inherited tables like optional extended archives.
  3. Data in the root table does not need to match the granularity of our partitions. Given the second data-movement approach, the base partition could hold one week of data, while each archive table reflects month or year distributions.
  4. Data readers only need to remember one rule: fast, recent data should use the ONLY keyword.
  5. Both readers and writers can remain blissfully ignorant of the naming scheme, distribution rules, and any other partitioning overhead. Everything can just target the base table and maintenance will sort it out later.

In the end, maybe this structure doesn’t qualify as “fancy”, but it is most definitely a departure from the methodology discussed in the Postgres documentation. It has some benefits beyond the traditional approach, and of course brings its own drawbacks to the table. As always, choosing is up to the DBA and the development teams.

I wrote a Postgres extension for my current employer that abstracts much of this and automates the rest. I hope to open-source it soon so everyone can squeeze some benefits out of applicable use cases. But now that you know about it, there’s no reason to wait for me to wade through red tape; explore!

Chicago Djangonauts meetup Sep 3, 2015

By Imaginary Landscape's ChicagoDjango.com Tech Blog from Django community aggregator: Community blog posts. Published on Aug 27, 2015.

Chicago Djangonauts meetup Sep 3, 2015

Dave Cramer: PostgreSQL JDBC Driver Version 9_4_1202 released

From Planet PostgreSQL. Published on Aug 27, 2015.

PostgreSQL JDBC Driver Version 9_4_1202 released


Lots of bug fixes and some awesome performance enhancements, including statement caching

Version 9.4-1201 (2015-02-25)

  • ResultSet positioning methods in some particular cases PR #296 (282536b)
Author: Craig Ringer
  • Disable binary xfer on batches returning generated keys PR #273 (763ae84)
  • Add a new test case demonstrating a bug in returning support PR #273 (4d2b046)
  • Always Describe a query in a batch that returns generated keys PR #273 (a6bd36f)
Author: Dave Cramer
  • chore: fix build.xml to allow releasing to maven PR #262 (34f9361)
  • fix: BlobInputStream ignores constructor parameters #263 PR #273 (c1c6edc)
  • don't reset forceBinary transfer if we setPreparedThreshold (937a11c)
  • Revert "perf: Remove expensive finalize methods from Statement and Connection" PR #293 (a0d3997)
  • updated copyright PR #312 (263375c)
  • Revert "Issue 250 -- Adding setURL/getURL to BaseDataSource.java" PR #312 (a1ac380)
  • fixed mailing list href PR #326 (c3e86a6)
  • increment driver version PR #346 (b8ee75d)
Author: David R. Bild:
  • feat: add equality support to PSQLState PR #277 (7698cd9)
  • Improve version checking PR #355 (f7a84db)
Author: Eugene Koontz
  • Add support within "private Object buildArray (PgArrayList input, int index, int count)" for array elements whose type is jsonb PR #349 (d313138)
  • Added setter method for logging level. The method exactly matches property name in documentation. PR #282 (d9595d1)
  • Added getter method. PR #282 (65759f0)
  • Adding XML catalog to help unit tests not remote entity resolution. PR #284 (cb87067)
  • Added support to locally resolve dtd or entity files. PR #284 (017970d)
  • Disable verbose logging of the catalog resolver. PR #284 (fcc34f5)
Author: Kris Jurka
  • Improve error message for failure to update multicolumn primary key RSs. PR #284 (05ff811)
  • Remove all JDBC3 code as JDK 1.4/1.5 are no longer supported. PR #284 (f9a956b)
  • Add preliminary support for JDBC4.2. PR #284 (bd05fd2)
  • Added setURL/getURL methods. (fcc8f75)
  • Added a unit test for setURL PR #309 (5fa405b)
Author: Markus KARG
  • perf: use shared PGBoolean instances PR #321 (159fed6)
  • docs: parameter "database" is optional PR #332 (9a9d03f)
  • refactor: binary transfer for setObject(int, Object, int) PR #351 (3ff2129)
Author: Michael Paquier:
  • Update entries in lib/.gitignore PR #262 (8cd15a9)
Author: Phillip Ross
  • Fix for issue https://github.com/pgjdbc/pgjdbc/issues/342 - Modifications to PGline to store coefficients and constant value for linear equation representations used by postgresql for native line datatype. PR #343 (0565416)
  • Fix for issue https://github.com/pgjdbc/pgjdbc/issues/342 - Removed extra copyright comments. PR #343 (5f21a18)
  • Fix for issue https://github.com/pgjdbc/pgjdbc/issues/342 - Handle vertical lines. PR #343 (3918b24)
  • Fix for issue https://github.com/pgjdbc/pgjdbc/issues/342 - Added test method for testing PGline PR #343 (1a50585)
  • Fix for issue https://github.com/pgjdbc/pgjdbc/issues/342 - Modifications to PGline test method to only attempt database access if the postgresql version supports it (v9.4+). PR #343 (15eedb5)
  • feat: Improved composite/array type support and type naming changes. PR #333 (cddcd18)
  • Deadlock after IO exception during copy cancel PR #363 (d535c13)
  • style: clean up newline whitespace PR #273 (1b77b4c)
  • style: clean up whitespace in .travis.yml PR #274 (3ee5bbf)
  • fix: correct incorrect PG database version in .travis.yml matrix PR #274 (74b88c6)
  • style: reorder jdk versions in .travis.yml PR #274 (21289e7)
  • feat: add PostgreSQL 9.4 to .travis.yml matrix PR #274 (9e94f35)
  • feat: add escapeLiteral(...) and escapeIdentifier(...) to PGConnection PR #275 (096241f)
  • Replace for loops with Java 5-style for loops. Replace String.indexOf with String.contains. Replace StringBuffer with StringBuilder. Remove boxing/unboxing of primitives. PR #245 (206a542)
  • feat: Customize default fetchSize for statements PR #287 (093a4bc)
  • feat: Customize default fetchSize for statements PR #287 (519bfe1)
  • perf: Read test only property "org.postgresql.forceBinary" spend many time when creating statements PR #291 (e185a48)
  • perf: Remove expensive finalize method from Statement Finalize method on Statement is moved to a separate class that is lazily created if user sets "autoCloseUnclosedConnections"="true". This dramatically improves performance of statement instantiation and reduces garbage collection overhead on several wildly used JMVs. PR #290 (eb83210)
  • docs: fix misleading statement on "can't use jdk5 features" in README.md PR #298 (5b91aed)
  • feat: implement micro-benchmark module for performance testing PR #297 (48b79a3)
  • feat: add benchmark for Parser.unmarkDoubleQuestion PR #297 (e5a7e4e)
  • feat: improve sql parsing performance PR #301 (fdd9249)
  • perf: Remove AbstractJdbc2Statement.finalize() PR #299 (b3a2f80)
  • test: add test for prepare-fetch-execute performance PR #303 (d23306c)
  • perf: improve performance of preparing statements PR #303 (7c0655b)
  • test: add test for utf8-encoder performance PR #307 (6345ab1)
  • perf: improve performance of UTF-8 encoding PR #307 (f2c175f)
  • perf: skip instantiation of testReturn and functionReturnType for non-callable statements PR #323 (8eacd06)
  • perf: parse SQL to a single string, not a array of fragments PR #319 (4797114)
  • perf: cache parsed statement across .prepareStatement calls PR #319 (5642abc)
  • refactor: cleanup constructors of JDBC4 and JDBC42 connections/statements PR #318 (a4789c0)
  • refactor: use Dequeue<...> instead of raw ArrayList in v3.QueryExecutorImpl PR #314 (787d775)
  • perf: SimpleParameterList.flags int[] -> byte[] PR #325 (f5bceda)
  • perf: cut new byte[1] from QueryExecutorImpl.receiveCommandStatus PR #326 (0ae1968)
  • perf: avoid useBinary(field) check for each sendBind PR #324 (45269b8)
  • refactor: cleanup Parser and NativeQuery after #311 PR #346 (a1029df)
  • refactor: cleanup Parser and CallableQueryKey after #319 PR #346 (5ec7dea)
  • perf: skip caching of very large queries to prevent statement cache pollution PR #346 (126b60c)
  • use current_schema() for Connection#getSchema PR #356 (ffda429)
  • chore: simple script to compose release notes PR #357 (341ff8e)
  • chore: teach release_notes.sh to identify PR ids out of merge commits PR #358 (f3214b1)

A wagon-load of post-summer updates

By Griatch's Evennia musings (MU* creation with Django+Twisted) from Django community aggregator: Community blog posts. Published on Aug 27, 2015.

Summer vacations are over and work resumes in Evennia land! Here's a wagon-load of small updates on what's going on.
 

Ainneve

The Ainneve project, the creation of an official, open-source Evennia demo game, has gotten going. The lead devs of the project are keen to make this a collaborative effort and there is a lot of good discussion and code being written. After some slowdown at the end of summer it's bound to pick up again. 

Ainneve's a rare chance to see a full MUD getting developed from scratch out in the open. The current issue list includes tags for difficulty and allows also newbie Python coders to jump in. Not to mention you have a chance to get valuable feedback on your work by seasoned coders!

So if you are at all interested in making a MUD, try out Python/Evennia or just get involved in a semi-big programming project, this is a great chance to do so.

Imaginary Realities

Since a few weeks, there is a new issue of Imaginary realities (vol 7, issue 3) is out. As usual I have an article in it. This venerable e-zine was revitalized to include articles on both MU* as well as roguelikes, Interactive fiction and others. Not only is this issue the most content-rich since the reboot, with this issue they have also spruced up their interface to make past issues easier to navigate.

  • "A text MUD with a working ecology system" - in this article Molly O'Hara  details the concepts behind simulating a functioning ecologic network in a game. Interesting stuff and some parts of this is certainly worth considering for any open-world game design. I wonder at which level of detail the system become more complex than the players can appreciate though.
  • "Dispelling the gloom" by Tomasz Gruca is an interesting essay on the author's history growing up in the former Soviet Union and eventually finding text adventure games and interactive fiction, a passion he has apparently lately re-kindled. He makes the observation that the current "retro" trend of games have not really reached back to the text-based game world when it comes to mainstream acceptance.
  • "How integral are letters and text to ASCII gaming?"by Mark R. Johnson goes into the practical use of ASCII in traditional rogue-like games (beyond nostalgia). This is a meaty article that goes into both text-as-graphics as well as the use of text for aiding imagination and suggest subtle puzzles in some classic rogue-likes. 
  • "Legend and the lore" (Hugo Zombiestalker) proclaims the death of the traditional point-and-click adventure game and but then moves on to try to distill just why those games nevertheless was so appealing to him and how it can be applied in modern game designs like zombie-survival MUD Epitath which he is a senior developer for. Plenty of good observations here!
  • "The bonds of mudding" by Clint Itan Kuranes Knapp writes about the community that can spring up on a long-running MUD, the interactions the friends and the relationships that could persist already long before "social media" became a buzz word. A warm text with plenty of anecdotes and examples and things to ponder for both designers and staff when wanting to cater for this type of player bonding. 
  • "The mercurial temperament at the end of the world" (Drakkos) discusses NPCs and how they rarely are as interactive as one would want (the term "vend a chat" is a good one I think). He then goes on to how they have implemented their "Mercurial" system for NPCs in Epitath. This seems to be a state-AI system where NPCs have moods that affects what they say based on their circumstance and relation to other actors in the world. Sounds really cool and since he goes into some details on the implementation there is a lot to ponder here. 
  • "Where do I begin?" by me, Griatch, discusses one of the more common questions we get in the Evennia chat - 'I want to make a MUD, but how do I begin?' This article starts before Evennia's Game planning wiki page - it discusses assessing your capabilities and resources in the form of programming skills, code bases and motivations to help you figure out what you can realistically accomplish. 

 

Evennia Web client

In the pipeline I have some updates to Evennia's websocket/JSON MUD-web client component. These are changes that are intended to make the webclient easier to customize and hook into Evennia output using only HTML/CSS. More details on this will be forthcoming when I have more solid stuff to show.

______
Image: The troll here a-cometh by Griatch

Andrew Dunstan: Quick buildfarm recipe for CentOS and friends

From Planet PostgreSQL. Published on Aug 27, 2015.

I've been writing a little provisioning script for a vagrant Centos 7 machine that will be a buildfarm client.

Stripped of the vagrant stuff and some stuff that's special for the machine, here is what I did to get it running:

sudo yum install -y wget gcc make flex bison ccache git \
perl-libwww-perl perl-Digest-SHA perl-devel perl-ExtUtils-Embed \
zlib-devel openssl-devel readline-devel python-devel tcl-devel \
libxml2-devel libxslt-devel openldap-devel

mkdir bf
cd bf
HERE=`pwd`
wget http://www.pgbuildfarm.org/downloads/latest-client.tgz
mv latest-client.tgz buildfarm-latest-client.tgz
tar -z --strip-components=1 -xf buildfarm-latest-client.tgz
mkdir root
sed -i "s!build_root =>.*!build_root => '$HERE/root',!" build-farm.conf
At this stage the buildfarm client will run quite successfully, as can be tested using:
perl run_build.pl --test

Alexander Korotkov: Psql Command to Attach Gdb to Backend

From Planet PostgreSQL. Published on Aug 26, 2015.

While hacking PostgreSQL it’s very useful to know pid of the backend you are working with. You need to know pid of the process to attach debugger, profiler etc. Luckily, .psqlrc provides us an elegant way to define the shortcuts for psql. Using config line below one can find out backend pid just by typing :pid.

\set pid 'SELECT pg_backend_pid();'
=# :pid
 pg_backend_pid
----------------
          99038
(1 row)

In 9.6 it becomes possible to even include backend pid into psql prompt.

However, it’s possible to automate more complex actions in psql. I’ve configured my psql to run gdb attached to current backend in new tab of iTerm2 just by typing :gdb.

The :gdb command selects pid of current backend and puts it to the input of pg_debug script.

\set gdb 'SELECT pg_backend_pid() \\g |pg_debug'

pg_debug extracts pid from its input and then runs OSA script which runs gdb in the new tab of iTerm2.

#!/bin/bash

IFS=''

while read line
do
    # Extended display off
    if [[ $line =~ ^\ +([0-9]+) ]]; then
        PID=${BASH_REMATCH[1]}
        break
    fi
    # Extended display on
    if [[ $line =~ ^pg_backend_pid.*\ ([0-9]+) ]]; then
        PID=${BASH_REMATCH[1]}
        break
    fi
done

# Open gdb session
osascript -e "
tell application \"iTerm\"
    activate
    tell the current terminal
        set mysession to (the current session)
        launch session \"Default Session\"
        tell the last session
            write text \"gdb --pid=$PID -x <(echo continue)\"
        end tell
        select mysession
    end tell
end tell"

This script works for Mac OS X and iTerm2, but the same approach should work for other platforms and terminal emulators.

Pavan Deolasee: Working towards Postgres-XL 9.5

From Planet PostgreSQL. Published on Aug 26, 2015.

It’s been busy few months as we work towards merging Postgres-XL with the latest and greatest release of PostgreSQL. Postgres-XL is an open source fork of PostgreSQL that provides a scalable platform for OLTP and Business Intelligence. The current release of Postgres-XL is based on PostgreSQL 9.2, so it lacks all the improvements made to PostgreSQL over the last three years.

2ndQuadrant and other companies are working on bringing distributed scalability into PostgreSQL core as well as building tools and extensions outside the core. As part of that, Postgres-XL has a number of features that we’d like to bring back into core PostgreSQL, so 2ndQuadrant has picked up the task of updating the Postgres-XL code base to the latest PostgreSQL release as the first step. After more than 3 months work, PostgreSQL 9.5 is still in alpha stage, so we wanted to give a progress report on how the work is proceeding. I also need to say the magic words: This ongoing work on Postgres-XL is part of the AXLE project, funded by the European Union under grant agreement 318633.

Preparation for the Merge

Since PostgreSQL and Postgres-XL both use GIT as source control system, it makes the merge process much simpler since GIT provides many tools to assist the process. But as soon as we tried the merge, we faced the first hurdle.

We realised that the current Postgres-XL repository is based on an older minor 9.2 release of PostgreSQL. That means there were commits and changes in the Postgres-XL master branch which either never made to PostgreSQL’s master branch or had different commit ids. So merge with PostgreSQL master branch threw a lot more conflicts than what we would have other expected. So the first task we must accomplish was to rebase the Postgres-XL 9.2 repository on a later commit point. This obviously required careful treading, making sure that nothing breaks during the process. Once we had the basic rebase done, we also merged all Postgres-XL bug fixes and enhancements, created a Postgres-XL 9.2 stable branch and merged the 9.2 branch with the latest available PostgreSQL 9.2 minor release.

Challenges Faced During the Merge

The actual merge with PostgreSQL master branch was not an easy task either. Note that we were jumping 3 major releases of PostgreSQL, which almost accounted for 3 years of development work. Thankfully, git-mergetool comes very handy for such large scale merges. You can use your favourite editor (vimdiff in our case) to nicely see the merge conflicts and resolve them. While some of the conflicts are straightforward and require minor adjustments, many require careful reading and understanding of the code. While it’s not trivial to support all new features, we tried to preserve as much as possible and we have been quite successful.

The other major challenge was merging documentation changes. Since Postgres-XL project had created a copy of the existing SGML documentation, the GIT merge with the master branch did not yield any updates to the copy. This required manual merge. To ensure that this is not required again in future merges, we now make the documentation changes in-place.

What’s next?

There are many things that needs to be completed before we can release Postgres-XL 9.5 to general public:

  1. Enhance regression test coverage for Postgres-XL
  2. Fix bugs and add support for new features
  3. Do systematic performance tests and tuning
  4. Create Postgres-XL 9.5 branch and merge with latest PostgreSQL 9.5 stable branch

We aren’t yet ready for review of Postgres-XL, but we expect Postgres-XL 9.5 Beta to be ready round about the same time as PostgreSQL 9.5 Beta.

Look for my next blog post in about a month’s time for the next update.

Yann Larrivee: ConFoo Call for Papers is Open

From Planet PostgreSQL. Published on Aug 25, 2015.

ConFoo is once more seekconfooing passionate speakers for the upcoming conference.

The event is happening in Montreal, Canada, between February 24th and 26th, 2016. It is an exciting conference for web developers with speakers from all over the world. It unites many web programming languages under one roof, as well as other topics related to web development. The call for papers closes on September 20th.

ConFoo renews 50% of its speakers each year. If you’re new to this conference, you should definitely submit.

If you would just like to attend, there is a discount until October 13th.

Terry Erisman: PGConf Silicon Valley Talks Posted

From Planet PostgreSQL. Published on Aug 25, 2015.

We are pleased to announce that the PGConf Silicon Valley breakout sessions and tutorials are now posted on the conference website. Our outstanding Conference Committee has worked through the large pool of submissions and selected the sessions which will be presented November 17-18, 2015 at the South San Francisco Conference Center. As the first dedicated PostgreSQL conference in the Bay Area, the conference is a great opportunity for PostgreSQL community members to exchange technical knowledge and make new connections. Early Bird registration prices are available through October 4, 2015.

Breakout Sessions

The breakout sessions are organized into five tracks:

  • DevOps
  • In the Trenches
  • Features
  • Hackery
  • PostgreSQL at Scale

The outstanding list of speakers include representatives from companies including TripAdvisor, Urban Airship, Square, Heap, Chartio, AWS, Rackspace, Joyent, and Heroku. The breakout sessions, speakers, and talk descriptions are available on the conference website.

Tutorials

The tutorials will be presented by leading PostgreSQL practitioners and range from beginner to advanced level. Presenters include representatives from 2ndQuadrant, EnterpriseDB, Citus Data, OmniTI, and PostgreSQL Experts. Check the conference website for talk descriptions and speaker information.

More Information Coming Soon

The schedule for the tutorials and breakout sessions will be posted to the PGConf Silicon Valley conference website in the near future. We will also post a series of interviews with conference speakers who will provide insights into their sessions and what they are looking forward to at the conference. 

Take advantage of Early Bird pricing before it ends on October 4th. I hope to see you PGConf Silicon Valley!

 

PGConf Silicon Valley Breakout Sessions and Tutorials are Now Posted

November 17-18, 2015 at the South San Francisco Conference Center

Easy maintainance: script that prints out repair steps

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Aug 24, 2015.

At my work we have quite a number of different sites/apps. Sometimes it is just a regular django website. Sometimes django + celery. Sometimes it also has extra django management commands, running from cronjobs. Sometimes Redis is used. Sometimes there are a couple of servers working together....

Anyway, life is interesting if you're the one that people go to when something is (inexplicably) broken :-) What are the moving parts? What do you need to check? Running top to see if there's a stuck process running at 100% CPU. Or if something eats up all the memory. df -h to check for a disk that's full. Or looking at performance graphs in Zabbix. Checking our "sentry" instance for error messages. And so on.

You can solve the common problems that way. Restart a stuck server, clean up some files. But what about a website that depends on background jobs, run periodically from celery? If there are 10 similar processes stuck? Can you kill them all? Will they restart?

I had just such a problem a while ago. So I sat down with the developer. Three things came out of it.

  • I was told I could just kill the smaller processes. They can be re-run later. This means it is a good, loosely-coupled design: fine :-)

  • The README now has a section called "troubleshooting" with a couple of command line examples. For instance the specific celery command to purge a specific queue that's often troublesome.

    This is essential! I'm not going to remember that. There are too many different sites/apps to keep all those troubleshooting commands in my head.

  • A handy script (bin/repair) that prints out the commands that need to be executed to get everything right again. Re-running previously-killed jobs, for instance.

The script grew out of the joint debugging session. My colleague was telling me about the various types of jobs and celery/redis queues. And showing me redis commands that told me which jobs still needed executing. "Ok, so how do I then run those jobs? What should I type in?"

And I could check serveral directories to see which files were missing. Plus commands to re-create them. "So how am I going to remember this?"

In the end, I asked him if he could write a small program that did all the work we just did manually. Looking at the directories, looking at the redis queue, printing out the relevant commands?

Yes, that was possible. So a week ago, when the site broke down and the colleague was away on holiday, I could kill a few stuck processes, restart celery and run bin/repair. And copy/paste the suggested commands and execute them. Hurray!

So... make your sysadmin/devops/whatever happy and...

  • Provide a good README with troubleshooting info. Stuff like "you can always run bin/supervisorctl restart all without everything breaking. Or warnings not to do that but to instead do xyz.
  • Provide a script that prints out what needs doing to get everything OK again.

Testing SaltStack with Vagrant

By Imaginary Landscape's ChicagoDjango.com Tech Blog from Django community aggregator: Community blog posts. Published on Aug 24, 2015.

Testing SaltStack with Vagrant

Michael Paquier: Postgres 9.6 feature highlight: Lock reductions for ALTER TABLE SET

From Planet PostgreSQL. Published on Aug 24, 2015.

ALTER TABLE has been known for many years in the Postgres ecosystem as being a command taking systematically an ACCESS EXCLUSIVE lock on the relation being modified, preventing all operations on the relation in parallel. Those locks are getting more and more relaxed, with for example in Postgres 9.4 the following commands that got improvements:

  • VALIDATE CONSTRAINT
  • CLUSTER ON
  • SET WITHOUT CLUSTER
  • ALTER COLUMN SET STATISTICS
  • ALTER COLUMN SET and ALTER COLUMN RESET for attribute options

In 9.5 as well those commands have been improved:

  • ENABLE TRIGGER and DISABLE TRIGGER
  • ADD CONSTRAINT FOREIGN KEY

Now, Postgres 9.6, which is currently in development, brings in more lock reduction, with the following commit:

commit: 47167b7907a802ed39b179c8780b76359468f076
author: Simon Riggs <simon@2ndQuadrant.com>
date: Fri, 14 Aug 2015 14:19:28 +0100
Reduce lock levels for ALTER TABLE SET autovacuum storage options

Reduce lock levels down to ShareUpdateExclusiveLock for all
autovacuum-related relation options when setting them using ALTER TABLE.

Add infrastructure to allow varying lock levels for relation options in
later patches. Setting multiple options together uses the highest lock
level required for any option. Works for both main and toast tables.

Fabrízio Mello, reviewed by Michael Paquier, mild edit and additional
regression tests from myself

Code speaking, ALTER TABLE SET has been improved to be able to define different types of locks depending on the parameter touched, and in the case of this commit all the parameters tuning autovacuum and auto-analyze at relation level have been updated to use SHARE UPDATE EXCLUSIVE LOCK. In short, this allows read as well as write operations to occur in parallel of the ALTER TABLE, something that will definitely help leveraging activity bloat on such relations.

Note as well that when multiple subcommands are used, the stronger lock of the whole set is taken for the duration of the ALTER TABLE command. So for example, should an update on the parameter fillfactor be mixed with a modification of autovacuum_enabled, an ACCESS EXCLUSIVE lock will be taken on the relation instead of a SHARE UPDATE EXCLUSIVE lock. ALTER TABLE uses the following set of locks, and those having a monotonic relationship it is possible to establish a hierarchy of them, the strongest one being the first listed here:

  • ACCESS EXCLUSIVE LOCK
  • SHARE ROW EXCLUSIVE LOCK
  • SHARE UPDATE EXCLUSIVE LOCK

Hence be sure to read the online documentation when planning to combine multiple subcommands with ALTER TABLE, all the details are there.

Runs on python 3: checkoutmanager

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Aug 23, 2015.

Checkoutmanager is a five-year old tool that I still use daily. The idea? A simple ~/.checkoutmanager.cfg ini file that lists your checkouts/clones. Like this (only much longer):

[local]
vcs = git
basedir = ~/local/
checkouts =
    git@github.com:nens/syseggrecipe.git
    git@github.com:buildout/buildout.git
    git@github.com:reinout/reinout-media.git
    git@github.com:rvanlaar/djangorecipe.git

[svn]
vcs = svn
basedir = ~/svn/
checkouts =
    svn+ssh://reinout@svn.zope.org/repos/main/z3c.recipe.usercrontab/trunk

In the morning, I'll normally do a checkoutmanager up and it'll go through the list and do svn up, git pull, hg pull -u, depending on the version control system. Much better than going though a number of them by hand!

Regularly, I'll do checkoutmanager st to see if I've got something I still need to commit. If you just work on one project, no problem. But if you need to do quick fixes on several projects and perhaps also store your laptop's configuration in git... it is easy to forget something:

$ checkoutmanager st
/Users/reinout/vm/veertien/efcis-site
 M README.rst

And did you ever commit something but forgot to push it to the server? checkoutmanager out tells you if you did :-)

Porting to python 3. The repo was originally on bitbucket, but nowadays I keep having to look all over my screen, looking for buttons, to get anything done there. I'm just too used to github, it seems. So after merging a pull request I finally got down to moving it to github.

I also copied over the issues and added one that told me to make sure it runs on python 3, too. Why? Well, it is the good thing to do. And... we had a work meeting last week where we said that ideally we'd want to run everything on python 3.

Two years ago I started a django site with python 3. No real problems there. I had to fix two buildout recipes myself. And the python LDAP package didn't work, but I could work around it. And supervisord didn't run so I had to use the apt-get-installed global one. For the rest: fine.

Recently I got zest.releaser to work on python 3 (that is: someone else did most of the hard work, I helped getting the pull request properly merged :-) ). For that, several test dependencies needed to be fixed for python 3 (which, again, someone else did). Checkoutmanager had the same test dependencies, so getting the test machinery to run was just a matter of updating dependencies.

What had to be done?

  • print 'something' is now a function: print('something'). Boring work, but easy.

  • Some __future__ imports, mostly for the print function and unicode characters.

  • Oh, and setting up travis-ci.org testing. Very easy to get both python 2.7 and 3.4 testing your software that way. Otherwise you keep on switching back/forth between versions yourself.

    (There's also 'tox' you can use for local multi-python-version testing in case you really really need that all the time, I don't use it myself though.)

  • Some from six.moves import xyz to work around changed imports between 2 and 3. Easy peasy, just look at the list in the documentation.

  • It is now try... except SomeError as e instead of try... except SomeError, e. The new syntax already works in 2.7, so there's no problem there.

  • The one tricky part was that checkoutmanager uses doctests instead of "regular" tests. And getting string comparison/printing right on both python 2 and 3 is a pain. You need an ugly change like this one to get it working. Bah.

    But: most people don't use doctests, so they won't have this problem :-)

  • The full list of changes is in this pull request: https://github.com/reinout/checkoutmanager/pull/9 .

  • A handy resource is http://python3porting.com/problems.html . Many common problems are mentioned there. Including solution.

    Django's porting tips at https://docs.djangoproject.com/en/1.8/topics/python3/ are what I recommended to my colleagues as a useful initial guide on what to do. Sane, short advice.

Anyway... Another python 3 package! (And if I've written something that's still used but that hasn't been ported yet: feel free to bug me or to send a pull request!)

Gulcin Yildirim: What’s New in PgBouncer 1.6

From Planet PostgreSQL. Published on Aug 23, 2015.

PgBouncer is lightweight connection pooler for PostgreSQL.

The latest release of PgBouncer was announced on 1st of August 2015. In this blog post we’ll talk about the major new improvements of PgBouncer.

Main new features of PgBouncer

Load user password hash from postgres database

With this hot feature, PgBouncer allows loading user’s password from database with two config parameters which are auth_user and auth_query.

Note: These config parameters are defined in the configuration file pgbouncer.ini.

  • auth_user
    If auth_user is set, any user not specified in auth_file will be queried from pg_shadow in the database using auth_user. Auth_user’s password will be taken from auth_file. This parameter can be set per-database too.

    Note: auth_file is the name of the file to load user names and passwords from.

  • auth_query
    This parameter allow us to write a SQL query to load user’s password from database. It runs under auth_user.See the default query below:

    SELECT usename, passwd FROM pg_shadow WHERE usename=$1

Pooling mode can be configured both per-database and per-user

With this feature, independent of the main pooling mode, clients can now connect to different databases with one of the 3 pooling modes described below. This is also applicable to users as well. For example, if the pooling mode is session pooling, a specific user can be configured to use transaction pooling. This gives us database-level and user-level flexibility to apply more appropriate pooling options.

PgBouncer provides 3 connection pooling modes:

  • Session pooling
    During the lifetime of a client connection, an existing server connection is assigned to the client and after the client disconnection, the assigned server connection is put back to the connection pool.
  • Transaction pooling
    In this mode, a server connection is not assigned to a connected client immediately: only during a transaction a server connection is assigned. As soon as the transaction is over, the connection is put back into the pool.
  • Statement pooling
    This is similar to the transaction pooling but more aggressive one. However, an assignment is performed when a single-statement transaction, for example a query, is issued. When the statement is over, the connection is put back into pool.

Per-database and per-user connection limits: max_db_connections and max_user_connections

With this feature, now we can control also connection limits per database/user level with the two new parameters, which are max_db_connections and max_user_connections.

  • max_db_connections
    This parameter does not allow more than the specified connections per-database (regardless of pool – i.e. user).
    The default value of this parameter is unlimited.
  • max_user_connections
    This parameter does not allow more than the specified connections per-user (regardless of pool – i.e. user).

Add DISABLE/ENABLE commands to prevent new connections

With this feature PgBouncer have ENABLE/DISABLE db; commands to prevent new connections.

  • DISABLE db;
    This command rejects all new client connections on the given database.
  • ENABLE db;
    This command allows new client connections after a previous DISABLE command.

Note: PgBouncer have several other process controlling commands. See the list below:

PAUSE [db]; PgBouncer tries to disconnect from all servers, first waiting for all queries to complete. The command will not return before all queries are finished. To be used at the time of database restart. If database name is given, only that database will be paused.

KILL db; Immediately drop all client and server connections on given database.

SUSPEND; All socket buffers are flushed and PgBouncer stops listening for data on them. The command will not return before all buffers are empty. To be used at the time of PgBouncer online reboot.

RESUME [db]; Resume work from previous PAUSE or SUSPEND command.

SHUTDOWN; The PgBouncer process will exit.

RELOAD; The PgBouncer process will reload its configuration file and update changeable settings.

New preferred DNS backend: c-ares

c-ares is the only DNS backend that supports all interesting features: /etc/hosts with refresh, SOA lookup, large replies (via TCP/EDNS+UDP), IPv6. It is the preferred backend now, and probably will be only backend in the future.

Note: See the following list which shows currently supported backends.

backend parallel EDNS0 /etc/hosts SOA lookup note
c-ares yes yes yes yes ipv6+CNAME buggy in <=1.10
udns yes yes no yes ipv4-only
evdns, libevent 2.x yes no yes no does not check /etc/hosts updates
getaddrinfo_a, glibc 2.9+ yes yes yes no N/A on non-linux
getaddrinfo, libc no yes yes no N/A on win32, requires pthreads
evdns, libevent 1.x yes no no no buggy

Config files have ‘%include FILENAME’ directive to allow configuration to be split into several files

With this feature, PgBouncer has support for inclusion of config files within other config files.

In other words, PgBouncer config file can contain include directives, which specify another config file to read and process. This allows splitting a large configuration file to smaller and more manageable files. The include directives look like this:

%include filename

If the file name is not absolute path it is taken as relative to current working directory.

There are more features released by this version. You can visit PgBouncer changelog page or check the list below for the other improvements:

  • Show remote_pid in SHOW CLIENTS/SERVERS. Available for clients that connect over unix sockets and both tcp and unix socket server. In case of tcp-server, the pid is taken from cancel key.
  • Add separate config param (dns_nxdomain_ttl) for controlling negative dns caching.
  • Add the client host IP address and port to application_name. This is enabled by a config parameter application_name_add_host which defaults to ‘off’.

What is PgBouncer?

PgBouncer is an utility for managing client connections to the PostgreSQL database. In a nutshell, it maintains a connection pool to the PostgreSQL server and reuses those existing connections. While this can be useful for reducing the client connection overhead, it also enables limiting the maximum number of open connections to the database server. It can also be used for traffic shaping like redirecting connections to one or more databases to different database servers. In addition to these, PgBouncer can be used for managing security on user and even on database level.

See the diagram below which depicts PgBouncer architecture in a more visual way.

v6pgbouncer

In this particular example, client applications are connected to separate PgBouncer instances where they would instead be connected to PostgreSQL database servers directly. Database servers “DB Server 1″ and “DB Server 2″ may be independent PostgreSQL instances or they may be a part of a cluster with different roles (e.g., master/replica or write-master/backup-master, etc.).

Each PgBouncer instance maintains a connection pool with a number of open connections to PostgreSQL servers. As it can be seen from the example, PgBouncers allows creating pools with connections to different databases and even connections to different database servers.

For more information

You can visit the main website of PgBouncer: https://pgbouncer.github.io/

They have a nice FAQ page: https://pgbouncer.github.io/faq.html

You can take a look to the Github repository of the project: https://github.com/pgbouncer/pgbouncer

Shaun M. Thomas: PG Phriday: Basic Partitioning

From Planet PostgreSQL. Published on Aug 21, 2015.

PG Phriday: Basic Partitioning

Most PGDB (PostgreSQL) users who are familiar with partitioning use the method described in the partitioning documentation. This architecture comes in a fairly standard stack:

  • One empty base table for structure.
  • At least one child table that inherits the base design.
  • A trigger to redirect inserts based on the partitioning scheme.
  • A constraint on each child table to enforce the partition scheme, and help the planner exclude child partitions from inapplicable queries.

It looks simple, but there’s a lot of SQL involved. Let’s take the sensor_log table we introduced a couple weeks ago] and turn it into a partitioned structure.

TRUNCATE TABLE sensor_log;
 
CREATE TABLE sensor_log_part_2012 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2012-01-01' AND
           reading_date < '2013-01-01')    
) INHERITS (sensor_log);
 
CREATE TABLE sensor_log_part_2013 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2013-01-01' AND
           reading_date < '2014-01-01')    
) INHERITS (sensor_log);
 
CREATE TABLE sensor_log_part_2014 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2014-01-01' AND
           reading_date < '2015-01-01')    
) INHERITS (sensor_log);
 
CREATE TABLE sensor_log_part_2015 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2015-01-01' AND
           reading_date < '2016-01-01')    
) INHERITS (sensor_log);
 
CREATE OR REPLACE FUNCTION sensor_partition()
RETURNS TRIGGER AS $$
BEGIN
  CASE EXTRACT(YEAR FROM NEW.reading_date)
    WHEN 2015 THEN
      INSERT INTO sensor_log_part_2015 VALUES (NEW.*);
    WHEN 2014 THEN
      INSERT INTO sensor_log_part_2014 VALUES (NEW.*);
    WHEN 2013 THEN
      INSERT INTO sensor_log_part_2013 VALUES (NEW.*);
    WHEN 2012 THEN
      INSERT INTO sensor_log_part_2012 VALUES (NEW.*);
  END CASE;
 
  RETURN NULL;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER t_correct_partition
BEFORE INSERT ON sensor_log
   FOR EACH ROW EXECUTE PROCEDURE sensor_partition();

This probably looks a bit different from the examples in the documentation for a few reasons. First, when creating a table, the LIKE markup is available to copy various attributes from another table. By using this operation, we’re redundantly copying the columns from the parent table, because this is already handled by the INHERITS command. What isn’t redundant is the INCLUDING INDEXES modifier for LIKE, which saves us the work of manually creating each index from the parent table on each child table.

Since we already have a more complicated CREATE TABLE than what they show in the documentation, we might as well include the partition constraints when creating each child table, instead of using ALTER TABLE to add them afterwards. But these are just shortcuts; the overall structure is the same, as are the performance considerations and benefits.

Using our python script and inserting 100-million rows takes about three times longer than it did in the non-partitioned case. Basically, this includes the trigger overhead for redirecting the rows. It’s painful, but not debilitating on systems that aren’t dependent on transaction performance.

We could also reduce this overhead by simplifying the logic of the sensor_partition function. An easy way to accomplish that, would be to have it only target the most recent partition. Then we would just need some maintenance job that would replace the function when the most recent partition changes. Since most ongoing loader jobs target recent incoming data, that would probably make more sense anyway.

Barring that, what do we gain? One huge benefit is constraint exclusion. The PGDB query planner will consider the CHECK constraints we placed on the partitions when processing statements. This is what a query looks like that triggers this effect:

EXPLAIN ANALYZE
SELECT COUNT(*)
  FROM sensor_log
 WHERE reading_date >= '2015-08-14';
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Aggregate  (cost=16003.26..16003.27 ROWS=1 width=0)
            (actual TIME=1181.681..1181.682 ROWS=1 loops=1)
   ->  Append  (cost=0.00..15003.79 ROWS=399791 width=0)
               (actual TIME=0.578..929.660 ROWS=411314 loops=1)
         ->  Seq Scan ON sensor_log
                  (cost=0.00..0.00 ROWS=1 width=0)
                  (actual TIME=0.001..0.001 ROWS=0 loops=1)
               FILTER: (reading_date >= '2015-08-14 00:00:00')
         ->  INDEX ONLY Scan
             USING sensor_log_part_2015_reading_date_idx
                ON sensor_log_part_2015
                    (cost=0.44..15003.79 ROWS=399790 width=0)
                    (actual TIME=0.576..474.890 ROWS=411314 loops=1)
               INDEX Cond: (reading_date >= '2015-08-14 00:00:00')
               Heap Fetches: 411314
 
 Planning TIME: 0.282 ms
 Execution TIME: 1181.719 ms

With a PGDB partition, the base table is always included. Since it’s empty, a sequence scan has no cost. The WHERE clause only matches the check constraint for the 2015 partition, so all results are from that table alone. The data is extremely dense, so we still fetch over 400k records however, resulting in an execution time of around one second.

The primary caveat with constraint exclusion is that it doesn’t work with variable substitution. Were we to replace the static date with CURRENT_DATE instead, PGDB would check all of the child partitions. As it turns out, that’s only one empty index poll for each of the invalid partitions. Thus the cost isn’t much higher, though the query plan would be much more complex.

Is there anything else that goes in our favor? We’ve mentioned maintenance costs several times in the past, and this is one reason why. Our data is now distributed across four tables. Assuming we retain this partitioning scheme, all future data will continue to be distributed by year. Remember our query that checks for all tables with at least 10-million rows? Let’s try it again:

SELECT oid::regclass::text AS TABLE_NAME,
       pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
       CEIL(reltuples / 1000000) AS million_rows
  FROM pg_class
 WHERE relkind = 'r'
   AND reltuples > 10000000
 ORDER BY relpages DESC;
 
      TABLE_NAME      | total_size | million_rows 
----------------------+------------+--------------
 sensor_log_part_2014 | 4163 MB    |           37
 sensor_log_part_2013 | 4508 MB    |           37
 sensor_log_part_2015 | 2832 MB    |           23

Aside from 2012, which is much smaller simply due to the date of this article, rows are distributed proportionally. From this, we can estimate that all maintenance operations on each child table will be roughly three times faster than before. While it’s true that we’ve also multiplied the amount of maintenance by a similar factor, such operations are generally automated.

Autovacuum and autoanalyze will now be much faster, and only focus on modified data. This means our 2015 table will likely experience several more of these maintenance cycles as it accumulates rows until 2016, after which it might remain totally inert. Before we partitioned, each of these operations would have processed all 100-million rows every time since they were all contained in a single table. Now, it can focus on just the portion that’s changing frequently.

We also win the option of easy archival. If we no longer need 2012 data, we can export it with a COPY command, and then drop the table. With a single table, we could only remove it with a DELETE statement. Not only is this orders of magnitude slower since it has to locate and then remove each row for 2012, but due to PGDB’s versioned storage, we now have several million dead rows to account for. If a VACUUM doesn’t immediately follow such a procedure, we’ve virtually guaranteed performance-robbing bloat.

Large tables take a very long time to VACUUM, which would cause a chain of delays along the data import stack. This kind of maintenance doesn’t block table writes, but if we don’t delay such jobs following significant changes, we again risk unnecessary table bloat. Or the disk activity needed for the maintenance will delay it for us. Really, using a single table in these cases brings a lot of downsides.

Used properly, this kind of partitioning is close to an equivalent exchange regarding query performance. Of course, there are known caveats associated with this kind of partitioning structure.

  1. Trigger overhead to redistribute data entry is very real. Even fast triggers add significant overhead.
  2. Primary keys can not distinguish rows in a partitioned table. Since each unique index exists independently on the partition, there’s no way to prevent duplicates across the set.
  3. A system must be in place to change trigger targets and build new partitions as they are needed. This is a lot more complicated than just using a single static table.

With that said, the first caveat is easy to circumvent. Any good partitioning scheme has an associated naming scheme. Loading jobs can target new data at the appropriate partition directly, thus reverting to single-table performance in that regard. Software that is PGDB partition-aware is amazingly beneficial, but also unfortunately rare due to all the custom tooling surrounding it.

The second concern is a bit tricky since such checks are inherent to PGDB operation. We can however, supplement the PRIMARY KEY syntax, with a stored procedure that performs verification across all partitions. Then we simply attach a trigger to child partitions that would invoke it on INSERT or UPDATE. Due to the associated overhead, such an extreme measure should be reserved for incredibly critical cases. However, we should note that large tables are generally data sinks; their primary key columns are usually handled through sequences or other types of auto-numbering that makes collisions incredibly rare.

The third issue is simply a software engineering problem. In that regard, the pg_partman extension provides quite a bit of automation for provisioning and maintaining partitioned tables. If that doesn’t work, the entire partition stack is in our first block of code. A bit of variable substitution and scripting could transform it into a generalized process. In fact, I’ve seen such things in Groovy and Python data import applications.

It all comes down to what kind of tradeoffs you want. Is it worth taking a potential performance hit in some SELECT and INSERT operations to prevent long-term data juggling? Is maintenance frequent enough that large objects act as a noticeable bottleneck? Is there enough data accumulation and turnover that distributing it is desirable? There are a lot of questions to consider before jumping straight into this partition style.

This methodology is a first step to a basic distributed architecture. Next week, we’ll consider something a bit more exotic.

Andreas Scherbaum: PostgreSQL Day in Beijing

From Planet PostgreSQL. Published on Aug 20, 2015.

Author
Andreas 'ads' Scherbaum

What happens, if you ask the Chinese PostgreSQL community for a Meetup-like event, one or two speakers? You end up getting a full day conference ;-)

 

On September 12, Saturday, a full day PostgreSQL conference will take place at the Pivotal office in Beijing. If you want to attend, please sign up here.

The full address is: 17/F, South Block, Tower C, Raycom Info Tech Park, NO.2, Kexueyuan South Road, Beijing 100190, China.

 


Continue reading "PostgreSQL Day in Beijing"

Getting Started with Django Admin

By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on Aug 20, 2015.

The admin is a very useful tool for developers to use when working with django. It is super easy to setup, and get your models registered with it so you can start CRUDing data.
Watch Now...

Marco Slot: A futuristic pg_shard demo: a scalable, dynamically changing row and columnar store in PostgreSQL

From Planet PostgreSQL. Published on Aug 18, 2015.

At this year's PGCon, we gave a talk on pg_shard that included a futuristic pg_shard demo: a distributed table with JSONB fields, backed by a dynamically changing row and columnar store. The demo is based on the Github archive data, which comprises a log of all events across all public Github repositories with detailed metadata in JSON format.

The pg_shard extension can store very large event logs in a PostgreSQL table by transparently sharding the table across many servers. We envision a use-case where new events are added to the table in real-time and frequent look-ups occur on data less than 7 days old. Data older than 7 days is compressed inside the database using cstore_fdw. A video of the pg_shard talk and demo is available on our website and at the end this post.

In our demo, the events table was range-partitioned by time, such that each shard contains events for a specific hour of the day. One of the advantages of range-partitioning a table by time is that older shards are read-only, which makes it easier to perform certain operations on the shards, such as replacing the shard with an equivalent cstore_fdw table. We consider this demo to be futuristic since pg_shard does not yet have functions to create a range-partitioned table. It can only be set up by manually changing the pg_shard metadata tables, for which we created several PL/pgSQL functions.

The architecture used in the demo resembles a shard assembly line as shown below. In the first stage of the assembly line the shard is filled wih data, after that the shard is frequently read by SELECT queries, and after 7 days the shard is compressed for archival. When pg_shard receives an INSERT, it compares the timestamp of the time ranges of the shards in the metadata tables to find the shard for the current hour. Assuming the INSERTS only happen for the current time, then there is only ever one shard to which new data is being written. At the end of the hour, INSERTs will start going into a new shard. The shard for the next hour needs to be set up in advance, for example by an hourly cron job.

Inserting into a time-partitioned distributed table

Compression is done by a periodic cron job, which goes through all shard placements with a time range that lies more than 7 days in the past. The compress_table function changes every uncompressed table into a cstore_fdw table in parallel across the cluster. Queries can continue to run while the data is being compressed. After compression, the storage size of a shard is typically reduced by around 50%. 

The pg_shard demo presented here could be especially suitable for auditing and metering uses-cases, with various queries on new data to measure usage or detect abuse and archival of older data. When using CitusDB, you can also run parallel queries on the data for real-time analytics, for example to power an analytics dashboard. While some features for range-based partitiong are still missing, we will be adding better support for range-partitioning in upcoming releases and there there are similar set-ups of pg_shard and CitusDB already running in production.

 

Video of our pg_shard talk at PGCon 2015 - Demo starts at 30:00

A futuristic pg_shard demo: A scalable, dynamically changing row and columnar store

The pg_shard extension can scale out PostgreSQL to store very large event logs in a PostgreSQL table by transparently sharding the table across many servers

Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Add IF NOT EXISTS processing to ALTER TABLE ADD COLUMN

From Planet PostgreSQL. Published on Aug 18, 2015.

On 30th of July, Andrew Dunstan committed patch: Add IF NOT EXISTS processing to ALTER TABLE ADD COLUMN   Fabrízio de Royes Mello, reviewed by Payal Singh, Alvaro Herrera and Michael Paquier. Nice. I always like when there is new “IF EXISTS" or “IF NOT EXISTS", because it makes my life as dba simpler. This […]

Josh Berkus: SFPUG City: PipelineDB

From Planet PostgreSQL. Published on Aug 18, 2015.

For September, two of the founders of PipelineDB are going to be introducing this streaming PostgreSQL fork. Come learn what PipelineDB is about and how you'd use it alongside PostgreSQL. This meetup will be at a new venue, If(we), a social networking company which uses PostgreSQL together with internal databases. RSVP on meetup.

P.S. we are looking for a lightning talk for this meetup! Contact josh-at-postgresql.org.

Glyn Astill: Windows psql and utf8 client_encoding issues

From Planet PostgreSQL. Published on Aug 18, 2015.

Prior to pg 9.1, you could connect to any database with psql regardless of encoding and you’d get the server encoding as your client encoding unless you set it: That wasn’t quite right; the client_encoding is a lie. On a modern psql version that’s quite rightly prevented: This is not an issue if you want […]

Lossy compression for optimizing animated GIFs

By Cloudinary Blog - Django from Django community aggregator: Community blog posts. Published on Aug 18, 2015.

Lossy animated GIF

Animated GIFs keep getting more and more popular, but they are generally very big files with slow loading times and high bandwidth costs, while the format itself is quite old and not optimized for modern video clips. As developers, you need to allow users to upload their animated GIF files, but you also need to deliver them optimized, which can be a complex, time consuming process.

One option is to convert animated GIFs to videos, another great feature available with Cloudinary which saves on file size and bandwidth (see this blog post for more details), but videos are still not as easy to embed in sites and apps compared to regular image tags, and some browsers and mobile devices still do not support auto playing video files. Another option is to convert animated GIFs to animated WebP files, but this format, introduced by Google and supported on Chrome, is unfortunately not supported by most of the other mobile devices and browsers. Likewise, GIF conversion tools seem to either create files that are too large and of high quality, or smaller files with a bad visual quality.

Another great solution is to perform GIF optimization using a lossy compression technique rather than the lossless nature of the GIF format, allowing you to optimize uploaded animated GIFs to be delivered as smaller files while still looking good. Lossy compression is actually a misnomer for GIFs as the compression algorithms used in GIFs are lossless, and there is no loss of data when compressing this palette-based format (although converting to GIF from other image formats does result in a loss of data due to the 8bit GIF limitation of 256 colors).

The lossiness comes in when the GIF is first filtered or altered so that the image can then compress more efficiently. The loss of data occurs in this filtering phase by increasing redundant patterns along scan lines to subsequently improve the actual compression.

Applying lossy GIF compression

The lossy compression feature is available using Cloudinary's on-the-fly dynamic manipulation URLs, with no need to install any software or to use any computational power on your side because the image manipulation takes place in the cloud. To leverage this capability, and tell Cloudinary to automatically use lossy compression, all you need to do is set the flag parameter to lossy (fl_lossy in URLs).

For example, the following animated GIF named kitten_fighting uploaded to Cloudinary has a file size of 6.3 MB.

Ruby:
cl_image_tag("kitten_fighting.gif")
PHP:
cl_image_tag("kitten_fighting.gif")
Python:
CloudinaryImage("kitten_fighting.gif").image()
Node.js:
cloudinary.image("kitten_fighting.gif")
Java:
cloudinary.url().imageTag("kitten_fighting.gif")
jQuery:
$.cloudinary.image("kitten_fighting.gif")
.Net:
cloudinary.Api.UrlImgUp.BuildImageTag("kitten_fighting.gif")
original non-optimized animated GIF

Enabling the lossy flag, which means adding the fl_lossy parameter to the delivery URL, optimizes the animated GIF to a file size of 2.5 MB. The file still looks good and is now 40% of the original size.

Ruby:
cl_image_tag("kitten_fighting.gif", :flags=>:lossy)
PHP:
cl_image_tag("kitten_fighting.gif", array("flags"=>"lossy"))
Python:
CloudinaryImage("kitten_fighting.gif").image(flags="lossy")
Node.js:
cloudinary.image("kitten_fighting.gif", {flags: "lossy"})
Java:
cloudinary.url().transformation(new Transformation().flags("lossy")).imageTag("kitten_fighting.gif")
jQuery:
$.cloudinary.image("kitten_fighting.gif", {flags: "lossy"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Flags("lossy")).BuildImageTag("kitten_fighting.gif")
Optimized animated GIF with lossy compression

You can further control the level of lossy compression in the resulting animated GIF by also adding the quality parameter (q in URLs), which has a default value of 80. For example, enabling lossy compression for the kitten_fighting GIF and also setting the quality parameter to 50 results in a file size of 2.1 MB, which means we saved almost 70% of the original file size.

Ruby:
cl_image_tag("kitten_fighting.gif", :quality=>50, :flags=>:lossy)
PHP:
cl_image_tag("kitten_fighting.gif", array("quality"=>50, "flags"=>"lossy"))
Python:
CloudinaryImage("kitten_fighting.gif").image(quality=50, flags="lossy")
Node.js:
cloudinary.image("kitten_fighting.gif", {quality: 50, flags: "lossy"})
Java:
cloudinary.url().transformation(new Transformation().quality(50).flags("lossy")).imageTag("kitten_fighting.gif")
jQuery:
$.cloudinary.image("kitten_fighting.gif", {quality: 50, flags: "lossy"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Quality(50).Flags("lossy")).BuildImageTag("kitten_fighting.gif")
50% quality lossy animated GIF

Further animated GIF manipulations with lossy compression

The lossy compression feature can be mixed with any of Cloudinary's rich set of image manipulation capabilities to match any graphic design, any dimensions, different devices, different browsers, responsive layouts and more. Lossy compression can also optimize a generated image, so instead of optimizing the original large animated GIF, you can optimize each manipulated or cropped version you would like to display.

For example, the following code generates and delivers a version of the uploaded kitten_fighting animated GIF as follows: Crops the animated GIF to a width of 50% and a height of 80%. Adds another uploaded png image named cloudinary_icon as an overlay. The overlay is resized to a width of 40 pixels, positioned 5 pixels from the top right corner of the animated GIF and is made 40% semi transparent. Lossy compression is applied with a quantity value of 50%.

The size of the delivered file is 765 KB compared to 1.9 MB without using lossy compression (a reduction of 60% in file size).

Ruby:
cl_image_tag("kitten_fighting.gif", :transformation=>[
  {:width=>0.5, :height=>0.8, :crop=>:crop},
  {:opacity=>40, :overlay=>"cloudinary_icon", :width=>40, :x=>5, :y=>5, :crop=>:scale, :gravity=>:north_east},
  {:quality=>50, :flags=>:lossy}
  ])
PHP:
cl_image_tag("kitten_fighting.gif", array("transformation"=>array(
  array("width"=>0.5, "height"=>0.8, "crop"=>"crop"),
  array("opacity"=>40, "overlay"=>"cloudinary_icon", "width"=>40, "x"=>5, "y"=>5, "crop"=>"scale", "gravity"=>"north_east"),
  array("quality"=>50, "flags"=>"lossy")
  )))
Python:
CloudinaryImage("kitten_fighting.gif").image(transformation=[
  {"width": 0.5, "height": 0.8, "crop": "crop"},
  {"opacity": 40, "overlay": "cloudinary_icon", "width": 40, "x": 5, "y": 5, "crop": "scale", "gravity": "north_east"},
  {"quality": 50, "flags": "lossy"}
  ])
Node.js:
cloudinary.image("kitten_fighting.gif", {transformation: [
  {width: 0.5, height: 0.8, crop: "crop"},
  {opacity: 40, overlay: "cloudinary_icon", width: 40, x: 5, y: 5, crop: "scale", gravity: "north_east"},
  {quality: 50, flags: "lossy"}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .width(0.5).height(0.8).crop("crop").chain()
  .opacity(40).overlay("cloudinary_icon").width(40).x(5).y(5).crop("scale").gravity("north_east").chain()
  .quality(50).flags("lossy")).imageTag("kitten_fighting.gif")
jQuery:
$.cloudinary.image("kitten_fighting.gif", {transformation: [
  {width: 0.5, height: 0.8, crop: "crop"},
  {opacity: 40, overlay: "cloudinary_icon", width: 40, x: 5, y: 5, crop: "scale", gravity: "north_east"},
  {quality: 50, flags: "lossy"}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .Width(0.5).Height(0.8).Crop("crop").Chain()
  .Opacity(40).Overlay("cloudinary_icon").Width(40).X(5).Y(5).Crop("scale").Gravity("north_east").Chain()
  .Quality(50).Flags("lossy")).BuildImageTag("kitten_fighting.gif")
Animated GIF resized, with overlay added and 50% lossy GIF compression

Summary

Lossy compression for animated GIFs allows you to benefit from both worlds: support animated GIFs, enjoy their simplicity, and still deliver smaller files that look good. Improve your user's experience, save on bandwidth, and all that with zero effort in developing your web sites and apps.

If you need much smaller files and you are ready to embed video files, Cloudinary can auto convert GIFs to videos. Likewise, if you allow your users to upload video files and you want to display animated GIFs instead, you can use Cloudinary to dynamically convert videos to GIFs.

The lossy compression feature for animated GIFs is available to all our free and paid plans. If you don't have a Cloudinary account, you are welcome to sign up to our free account and try it out.

Julien Rouhaud: Estimating Needed Memory for a Sort

From Planet PostgreSQL. Published on Aug 18, 2015.

work_mem?

The work memory, or work_mem is one of the hardest thing to configure. It can be used for various purposes. It’s mainly used when sorting data or creating hash tables, but it can also be used by set returning functions using a tuplestore for instance, like the generate_series() function. And each node of a query can use this amount of memory. Set this parameter too low, and a lot of temporary files will be used, set it too high and you may encounter errors, or even an Out Of Memory (OOM) depending on your OS configuration.

I’ll focus here on the amount of memory needed when sorting data, to help you understand how much memory is required when PostgreSQL runs a sort operation.

Truth is out

I often hear people say there is a correlation between the size of the temporary files generated and the amount of data needed. It’s wrong, you can’t make any assumption on the value of work_mem based on the size of a sort temporary file.

It’s because when the data to be sorted don’t fit in the allowed memory, PostgreSQL switches to an external sort. In addition to the currently used memory, a temporary file is used multiple times, to avoid wasting disk space. If you want more details on this, the relevant source code is present in tuplesort.c and logtapes.c. As a brief introduction, the header of tuplesort.c says:

[…] This module handles sorting of heap tuples, index tuples, or single Datums (and could easily support other kinds of sortable objects, if necessary). It works efficiently for both small and large amounts of data. Small amounts are sorted in-memory using qsort(). Large amounts are sorted using temporary files and a standard external sort algorithm.

See Knuth, volume 3, for more than you want to know about the external sorting algorithm. We divide the input into sorted runs using replacement selection, in the form of a priority tree implemented as a heap (essentially his Algorithm 5.2.3H), then merge the runs using polyphase merge, Knuth’s Algorithm 5.4.2D. The logical “tapes” used by Algorithm D are implemented by logtape.c, which avoids space wastage by recycling disk space as soon as each block is read from its “tape”. […]

It can be easily verified. First, let’s create a table and add some data:

rjuju=# CREATE TABLE sort(id integer, val text);
CREATE TABLE
rjuju=# INSERT INTO sort SELECT i, 'line ' || i
FROM generate_series(1,100000) i;

INSERT 0 100000

To sort all these row, 7813kB is needed (more details later). Let’s see the EXPLAIN ANALYZE with work_mem set to 7813kB and 7812kB:

rjuju=# SET work_mem to '7813kB';
SET
rjuju=# EXPLAIN ANALYZE SELECT * FROM sort ORDER BY id;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Sort  (cost=9845.82..10095.82 rows=100000 width=14) (actual time=50.957..59.163 rows=100000 loops=1)
   Sort Key: id
   Sort Method: quicksort  Memory: 7813kB
   ->  Seq Scan on sort  (cost=0.00..1541.00 rows=100000 width=14) (actual time=0.012..19.789 rows=100000 loops=1)

rjuju=# SET work_mem to '7812kB';
SET

rjuju=# EXPLAIN ANALYZE SELECT * FROM sort ORDER BY id;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Sort  (cost=9845.82..10095.82 rows=100000 width=14) (actual time=142.662..168.596 rows=100000 loops=1)
   Sort Key: id
   Sort Method: external sort  Disk: 2432kB
   ->  Seq Scan on sort  (cost=0.00..1541.00 rows=100000 width=14) (actual time=0.027..18.621 rows=100000 loops=1)

So, 7813kB are needed, and if we lack only 1kB, the temporary file size is 2432kB.

You can also activate the trace_sort parameter to have some more information:

rjuju=# SET trace_sort TO on;
SET
rjuju=# SET client_min_messages TO log;
SET

rjuju=# EXPLAIN ANALYZE SELECT * FROM sort ORDER BY id;
LOG:  begin tuple sort: nkeys = 1, workMem = 7812, randomAccess = f
LOG:  switching to external sort with 28 tapes: CPU 0.00s/0.05u sec elapsed 0.05 sec
LOG:  performsort starting: CPU 0.00s/0.07u sec elapsed 0.07 sec
LOG:  finished writing final run 1 to tape 0: CPU 0.00s/0.15u sec elapsed 0.15 sec
LOG:  performsort done: CPU 0.00s/0.15u sec elapsed 0.15 sec
LOG:  external sort ended, 304 disk blocks used: CPU 0.00s/0.18u sec elapsed 0.19 sec
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Sort  (cost=9845.82..10095.82 rows=100000 width=14) (actual time=154.751..181.724 rows=100000 loops=1)
   Sort Key: id
   Sort Method: external sort  Disk: 2432kB
   ->  Seq Scan on sort  (cost=0.00..1541.00 rows=100000 width=14) (actual time=0.039..23.712 rows=100000 loops=1)

With these data, 28 tapes are used.

So, how do I know how much work_mem is needed?

First, you need to know that all the data will be allocated through PostgreSQL’s allocator AllocSet. If you want to know more about it, I recommend to read the excellent articles Tomas Vondras wrote on this topic: Introduction to memory contexts, Allocation set internals and palloc overhead examples.

The needed information here is that the allocator adds some overhead. Each allocated block has a fixed overhead of 16B, and the memory size requested (without the 16B overhead) will be rounded up to a 2^N size. So if you ask for 33B, 80B will be used: 16B of overhead and 64B, the closest 2^N multiple. The work_mem will be used to store every row, and some more information.

For each row to sort, a fixed amount of 24B memory will be used. This is the size of a SortTuple which is the structure sorted. This amount of memory will be allocated in a single block, so we have only 24B overhead (fixed 8B and the 16B to go to the closest 2^N multiple).

The first part of the formula is therefore:

24 * n + 24

(n being the number of tuple sorted)

Then, you have to know that PostgreSQL will preallocate this space for 1024 rows. So you’ll never see a memory consumption of 2 or 3kB.

Then, each SortTuple will then contain a MinimalTuple, which is basically a tuple without the system metadata (xmin, xmax…), or an IndexTuple if the tuples come from an index scan. This structure will be allocated separately for each tuple, so there can be a pretty big overhead. Theses structures lengths are both 6B, but need to be aligned. This represents 16B per tuple.

These structures will also contain the entire row, the size depends on the table, and the content for variable length columns.

The second part of the formula is therefore:

(8 + ( (16 + average row length) rounded to 2^N) ) * n

We can now estimate how much memory is needed:

(24 + 8 + ( (16 + average row length) rounded to 2^N) ) * n + 24

Testing the formula

Let’s see on our table. It contains two fields, id and val. id is an integer, so it uses 4B. The val column is variable length. First, figure out the estimated average row size:

rjuju=# SELECT stawidth
FROM pg_statistic WHERE starelid = 'sort'::regclass AND staattnum = 2;
 stawidth
----------
       10

Just to be sure, as I didn’t do any ANALYZE on the table:

rjuju=# SELECT avg(length(val)) FROM sort;
        avg
--------------------
 9.8889500000000000

So, the average row size is approximatively 14B. PostgreSQL showed the same estimation on the previous EXPLAIN plan, the reported width was 14:

Sort  (cost=9845.82..10095.82 rows=100000 width=14) [...]

NOTE: It’s better to rely on the pg_statistic, because it’s faster and doesn’t consume resources. Also, if you have large fields, they’ll be toasted, and only a pointer will be stored in work_mem, not the entire field

We add the 16B overhead for the MinimalTuple structure and get 30B. This will lead to an allocated space of 32B.

Finally, the table contains 100.000 tuples, we can now compute the memory needed :

(24 + 16 + 8 + 32) * 100000 + 24 = 8000024B = 7812,52kB

We now find the 7813kB I announced earlier!

This is a very simple example. If you only sort some of the rows, the estimated size can be too high or too low if the rows you sort don’t match the average size.

Also, note that if the data length of a row exceed 8kB (not counting the toasted data), the allocated size won’t be rounded up to the next 2^N multiple.

Wait, what about NULLs?

Yes, this formula was way too simple…

The formula assume you don’t have any NULL field, so it compute the maximum estimated memory needed.

A NULL field won’t consume space for data, obviously, but will add a bit in a bitmap stored in the MinimalTuple.

If at least one field of a tuple is NULL, the bitmap will be created. Its size is:

(number of attribute + 7) / 8) bytes (rounded down)

So, if a tuple has 3 integer fields, and two of them are NULL, the data size will not be 16B but:

4 + ( (3+7) / 8) = 5B

You can then try to estimate a better size with the statistic NULL fractions of each attribute, available in pg_statistics.

For the lazy ones

Here’s a simple query that will do the maths for you. It assumes:

  • only fields from one table is sorted
  • there are no NULL
  • all the rows will be sorted
  • statistics are accurate
WITH RECURSIVE overhead(n) AS (
    SELECT 1
    UNION ALL
    SELECT n*2 FROM overhead
    WHERE n <= 4096
),
width AS (
    SELECT starelid,sum(stawidth) AS sum
    FROM pg_statistic
    GROUP BY 1
),
num_of_lines AS (
    SELECT relid,n_live_tup as n
    FROM pg_stat_user_tables

)

SELECT pg_size_pretty(((24 + 16 + 8 + max(o.n)*2) * (min(nol.n))) + 24)
FROM overhead o
CROSS JOIN pg_class c
JOIN pg_namespace n ON c.relnamespace = n.oid
JOIN width w ON w.starelid = c.oid
JOIN num_of_lines nol ON nol.relid = c.oid
WHERE
c.relname = 'sort'
AND n.nspname = 'public'
AND o.n < (w.sum + 16);
 pg_size_pretty
----------------
 7813 kB

Conclusion

Now, you know the basics to estimate the amount of memory you need to sort your data.

A minimal example was presented here for a better understanding, things start to get really complicated when you don’t only sort all the rows of a single table but the result of some joins and filters.

I hope you’ll have fun tuning work_mem on your favorite cluster. But don’t forget, work_mem is used for more than just sorting tuples!

Estimating Needed Memory for a Sort was originally published by Julien Rouhaud at rjuju's home on August 18, 2015.

Ernst-Georg Schmid: So, what is TABLESAMPLE actually good for?

From Planet PostgreSQL. Published on Aug 18, 2015.

Q: So, what is TABLESAMPLE actually good for?

A: To speed things up that can be answered from a small sample instead of the whole population!

Like so:

CREATE TABLE public.ts_test
(
  id integer NOT NULL DEFAULT nextval('ts_test_id_seq'::regclass),
  val double precision,
  CONSTRAINT ts_test_pkey PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);

Insert 100 million random values in the range 0.0 <= x < 1.0:

INSERT INTO public.ts_test (val)
SELECT
    random()
FROM
    generate_series(1, 100000000);

And now, some queries to get the standard deviation.

For the whole population:

SELECT round(stddev_pop(val)::numeric,5) FROM ts_test;

0.28869 in 15005 ms
Error: 0%
Speedup: 1x

For a sample of size 20000 (good for 1% error and 99% confidence):

SELECT round(stddev_samp(val)::numeric,5) FROM ts_test TABLESAMPLE BERNOULLI(0.02);

0.28867 in 2387 ms
Error: < 1%
Speedup: 6.3x

SELECT round(stddev_samp(val)::numeric,5) FROM ts_test TABLESAMPLE SYSTEM(0.02);

0.28791 in 25 ms
Error: < 1%
Speedup: 600x

With a table that large, both methods BERNOULLI and SYSTEM are accurate enough to give results well below the 1% error target.

(For potential problems with SYSTEM, see this post)

SYSTEM aces BERNOULLI with a speedup of 600x vs. 6.3x relative to a full population query.

And this is hard to beat...

Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Add new function pg_notification_queue_usage.

From Planet PostgreSQL. Published on Aug 17, 2015.

On 17th of July, Robert Haas committed patch: Add new function pg_notification_queue_usage.   This tells you what fraction of NOTIFY's queue is currently filled.   Brendan Jurd, reviewed by Merlin Moncure and Gurjeet Singh. A few further tweaks by me. LISTEN / NOTIFY are very cool features, that don't get enough recognition. For completeness sake […]

Federico Campoli: The sea elephants

From Planet PostgreSQL. Published on Aug 17, 2015.

Friday 14th August we kicked off the Brighton PostgreSQL  Meetup.
We had a nice evening with cool people all togheter discussing about PostgreSQL and how we can run effectively the meetup.

We decided to have a regular monthly meetup hosted around Brighton, possibly, by companies or any suitable venue.

The next meetup will be the 25th of September and this time there will be some PostgreSQL talks. The general interest favours the standby servers and the streaming replication. Another topic could be the query planning and execution.
I can cover both of the topics. We just need to decide how many talks will be in one hour session.

After the meetup we’ll head to the nearest pub for a nice beer.

I can only say a big thank you, to all the meetup members.

That’s all for now.  

Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Add psql PROMPT variable showing the pid of the connected to backend.

From Planet PostgreSQL. Published on Aug 17, 2015.

On 7th of July, Andres Freund committed patch: Add psql PROMPT variable showing the pid of the connected to backend.   The substitution for the pid is %p.   Author: Julien Rouhaud Discussion: <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=object;h=116262CF971C844FB6E793F8809B51C6E99D48">116262CF971C844FB6E793F8809B51C6E99D48</a>@BPXM02GP.gisp.nec.co.jp It was over a month ago, but in the mean time I was on vacation, and I wanted to test […]

Andrew Dunstan: Quick tips

From Planet PostgreSQL. Published on Aug 17, 2015.

Found in the last couple of days.

1. If you have more than one active database, and you're using text format logs, make sure you have %d (the database name) in your log line prefix. Not knowing which database the log line comes from can be confusing and makes life more difficult.

2. If you report on a possible database error, report from the Postgres logs, not from your client's logs. We need to see things from the database's point of view. In the case in point, the Postgres logs conatined no relevant errors at all, i.e. this wasn't a Postgres error.

3. Found when trying to resize a data volume on Amazon: certain Linux disk partitioning utilities don't optimally align partitions by default, notably sfdisk and cfdisk. Stay away from those, or make sure you explicitly choose proper alignment for partitions. Programs that do the right thing include fdisk and parted. In general you are probably best to stick to using parted.

David Fetter: Approximate Counts

From Planet PostgreSQL. Published on Aug 17, 2015.

PostgreSQL has MVCC, which is awesome until you want an approximate count on what could be a long view...until now. Here's how. Continue reading "Approximate Counts"

Making Clean Code a Part of Your Build Process (And More!)

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Aug 15, 2015.

At Caktus, "clean" (in addition to "working"!) code is an important part of our delivery. For all new projects, we achieve that by using flake8. flake8 is a wrapper around several tools: pep8, pyflakes, and McCabe. pep8 checks to make sure your code matches the PEP 0008 style guidelines, pyflakes looks for a few additional things like unused imports or variables, and McCabe raises warnings about overly complex sections of code.

We usually check code formatting locally before committing, but we also have safety checks in place in case someone forgets (as I more than anyone have been known to do!). This prevents code formatting standards from slowly eroding over time. We accomplish this by making our continuous integration (CI) server "fail" the build if unclean code is committed. For example, using Travis CI, simply adding the following line to the script: section of your .travis.yml will fail the build if flake8 detects any formatting issues (the command returns a non-zero exit code if errors are found):

- flake8 .

You can adjust flake8 defaults by adding a file named setup.cfg to the top level of your repository. For example, we usually relax the 80 character limit a little and exclude certain automatically generated files:

[flake8]
max-line-length=100
exclude=migrations

As a result you not only have code that is more readable for everyone, but avoids actual errors as well. For example, flake8 will detect missing imports or undefined variables in code paths that might not be tested by your unit test suite.

Adding flake8 to an older project

This is all well and good for new projects, but bringing old projects up to speed with this approach can be a challenge. I recently embarked on such a task myself and thought I'd share what I learned.

I started by adding a setup.cfg to the project and running flake8 on my source tree:

$ flake8 --count
...
1798

The result: a whopping 1798 warnings. Many of these turned out to be pep8's "E128 continuation line under-indented for visual indent":

$ flake8 --select=E128 --count
...
1010

In other words, in a huge number of cases, we weren't indenting multi-line continuations the way pep8 wanted us to. Other errors included things like not having a space after commas (E231), or not having two spaces before inline comments (E261). While many editors do support automatically fixing errors like this, doing so manually would still be tedious: In this project we had nearly 250 Python source files.

Enter autopep8 and autoflake. These tools purport to automatically fix pep8- and pyflakes-related issues. There are two ways to use these tools; one wholesale for all errors across the project at once, and one more granular, addressing only a single group of similar errors at a time.

Addressing all errors at once

This approach is best for smaller projects or those with just a few errors (50-100) to address:

$ pip install autoflake
$ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-all-unused-imports --remove-unused-variables

In my case, this resulted in changes across 39 files and reduced the number of flake8 errors from 1798 to 1726. Not a huge change, but a time saver nonetheless. autopep8 was even more impressive:

$ pip install autopep8
$ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" .

This brought the number of changed files up to 160, and brought the number of warnings from 1726 down to 211. Note that autopep8 also supports an --aggressive option which allows non-whitespace changes. When I tried this, however, it only reduced the number of warnings from 211 to 198. I'll most likely fix those by hand.

Please note: If or when you're ready to try these commands on a project, you must first make sure you have no uncommitted changes locally. After each change, I also recommend (a) committing the results of each command as a separate commit so they're easier to unravel or review later, and (b) running your unit test suite to make sure nothing is inadvertently broken.

Addressing errors as groups (recommended)

While the above approach may work for smaller projects (not this one!), it can make code reviews difficult because all pyflakes or pep8 fixes are grouped together in a single commit. The more labor intensive but recommended approach is to address them in groups of similar errors. My colleague Rebecca Muraya recommended this approach and suggested the groups (thanks, Rebecca!):

  1. First, find and remove any unused imports:

    $ pip install autoflake
    $ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-all-unused-imports
    

    After it's finished, review the changes, run your test suite, and commit the code.

  2. Now, do the same for unused variables:

    $ find . -name '*.py'|grep -v migrations|xargs autoflake --in-place --remove-unused-variables
    

    Again, once complete, review the changes, run your test suite, and commit the code.

  3. Before moving on to pep8 errors, the following command provides an invaluable summary of errors not yet addressed:

    $ flake8 --statistics --count -qq
    
  4. Finally, autopep8 can be told to fix only certain error codes, like so:

    $ pip install autopep8
    $ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" --select="W291,W293" .
    

    This will remove trailing whitespace and trailing whitespace on blank lines. Once complete, review and commit your changes and move on to the next group of errors.

pep8's error codes are listed in detail on the autopep8 PyPI page and in the pep8 documentation. You can either group them yourself based on your preferences and the particular warnings in your project, or use the following as a guide:

  • Remove trailing whitespace, then configure your editor to keep it away:
    • W291 - Remove trailing whitespace.
    • W293 - Remove trailing whitespace on blank line.
  • Use your editor to find/replace all tabs, if any, with spaces, and then fix indentation with these error codes. This can have a semantic impact so the changes need to be reviewed carefully:
    • E101 - Reindent all lines.
    • E121 - Fix indentation to be a multiple of four.
  • Fix whitespace errors:
    • E20 - Remove extraneous whitespace.
    • E211 - Remove extraneous whitespace.
    • E22 - Fix extraneous whitespace around keywords.
    • E224 - Remove extraneous whitespace around operator.
    • E226 - Fix missing whitespace around arithmetic operator.
    • E227 - Fix missing whitespace around bitwise/shift operator.
    • E228 - Fix missing whitespace around modulo operator.
    • E231 - Add missing whitespace.
    • E241 - Fix extraneous whitespace around keywords.
    • E242 - Remove extraneous whitespace around operator.
    • E251 - Remove whitespace around parameter '=' sign.
    • E27 - Fix extraneous whitespace around keywords.
  • Adjust blank lines:
    • W391 - Remove trailing blank lines.
    • E301 - Add missing blank line.
    • E302 - Add missing 2 blank lines.
    • E303 - Remove extra blank lines.
    • E304 - Remove blank line following function decorator.
    • E309 - Add missing blank line (after class declaration).
  • Fix comment spacing:
    • E26 - Fix spacing after comment hash for inline comments.
    • E265 - Fix spacing after comment hash for block comments.
  • The following are aggressive fixes that can have semantic impact. It's best to do these one commit at a time and with careful review:
    • E711 - Fix comparison with None.
    • E712 - Fix comparison with boolean.
    • E721 - Use "isinstance()" instead of comparing types directly.
    • W601 - Use "in" rather than "has_key()".
    • W602 - Fix deprecated form of raising exception.
    • W603 - Use "!=" instead of "<>"

You can repeat steps 3 and 4 in the above with each group of error codes (or in the case of more aggressive fixes, single error codes) until they're all resolved. Once all the automatic fixes are done, you'll likely have some manual fixes left to do. Before those, you may want to see what remaining automatic fixes, if any, autopep8 suggests:

$ autopep8 --in-place --recursive --max-line-length=100 --exclude="*/migrations/*" .

Once all the errors have been resolved, add flake8 to your build process so you never have to go through this again.

Concluding Remarks

While I haven't finished all the manual edits yet as of the time of this post, I have reduced the number to about 153 warnings across the whole project. Most of the remaining warnings are long lines that pep8 wasn't comfortable splitting, for example, strings that needed to be broken up like so::

foo = ('a very '
       'long string')

Or other similar issues that couldn't be auto-corrected. To its credit, flake8 did detect two bugs, namely, a missing import (in some unused test code that should probably be deleted), and an instance of if not 'foo' in bar (instead of the correct version, if 'foo' not in bar).

My colleague Mark Lavin also remarked that flake8 does not raise warnings about variable naming, but the pep8-naming plugin is available to address this. The downside is that it doesn't like custom assertions which match the existing unittest style (i.e., assertOk, assertNotFound, etc.).

Good luck, and I hope this has been helpful!

django-registration

By James Bennett from Django community aggregator: Community blog posts. Published on Aug 13, 2015.

I mentioned a couple weeks ago that lately I’ve been in the process of providing updated releases for all the various open-source projects I maintain, and specifically mentioned a desire to resurrect django-registration which, once upon a time, was the most popular thing I’d ever written. Over the past week I’ve been focusing more heavily on that, and now it’s time to start talking about a release.

Ancient history

I’ve always felt pretty strongly that Django’s killer ...

Read full entry

AWS load balancers with Django

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Aug 10, 2015.

We recently had occasion to reconfigure some of our existing servers to use Amazon Web Services Elastic Load Balancers in front of them. Setting this up isn't hard, exactly, but there are a lot of moving parts that have to mesh correctly before things start to work, so I thought I'd write down what we did.

All of these tools have lots of options and ways to use them. I'm not trying to cover all the possibilities here. I'm just showing what we ended up doing.

Our requirements

We had some specific goals we wanted to achieve in this reconfiguration.

  • There should be no outside requests sneaking in -- the only requests that should reach the backend servers are those that come through the load balancer. We'll achieve this by setting the backend servers' security group(s) to only allow incoming traffic on those ports from the load balancer's security group.
  • The site should only handle requests that have the right Host header. We achieve this already by Nginx configuration (server_name) and won't have to change anything.
  • Redirect any non-SSL requests to SSL. The load balancer can't do this for us (as far as I could see), so we just forward incoming port 80 requests to the server’s port 80, and let our existing port 80 Nginx configuration continue to redirect all requests to our https: URL.
  • All SSL connections are terminated at the load balancer. Our site certificate and key are only needed on the load balancer. The backend servers don't need to process encryption, nor do we need to maintain SSL configuration on them. We'll have the load balancers forward the connections, unencrypted, to a new listening port in nginx, 8088, because we're redirecting everything from port 80 to https. (We could have configured the port 80 server to figure out from the headers whether the connection came into the load balancer over SSL, but we didn't, figuring that using a separate port would be fool-proof.) If we were concerned about security of the data between the load balancer and the backend, for example if financial or personal information was included, we could re-encrypt the forwarded connections, maybe using self-signed certificates on the backend servers to simplify managing their configurations.
  • Strict-Transport-Security header - we add this already in our Nginx configuration and will include it in our port 8088 configuration.
  • We need to access backend servers directly for deploys (via ssh). We achieve this by keeping our elastic IP addresses on our backend servers so they have stable IP addresses, even though the load balancers don't need them.
  • Some of our servers use basic auth (to keep unreleased sites private). This is in our Nginx configuration, but we'll need to open up the health check URL to bypass basic auth, since the load balancers can't provide basic auth on health checks.
  • Sites stay up through the change. We achieve this by making the changes incrementally, and making sure at all times there's a path for incoming requests to be handled.

All the pieces

Here are all the pieces that we had to get in place:

  • The site's hostname is a CNAME for the elastic load balancer’s hostname, so that requests for the site go to the load balancer instead of the backend servers. Don’t use the load balancer IP addresses directly, since they’ll change over time.
  • The backend servers' security group allows incoming requests on ports 80 and 8088, but only from the load balancer's security group. That allows the load balancer to forward requests, but requests cannot be sent directly to the backend servers even if someone knows their addresses.
  • There's a health check URL on the backend server that the load balancer can access, and that returns a 200 status (not 301 or 401 or anything else), so the load balancers can determine if the backend servers are up.
  • Apart from the health check, redirect port 80 requests to the https URL of the server (non-SSL to SSL), so that any incoming requests that aren't over SSL will be redirected to SSL.
  • Get the data about the request's origin from the headers where the load balancer puts it, and pass it along to Django in the headers that our Django configuration is expecting. This lets Django tell whether a request came in securely.
  • The load balancer must be in the same region as the servers (AWS requirement).
  • Keep the elastic IP on our backend server so we can use that to get to it for administration. Deploys and other administrative tasks can no longer use the site domain name to access the backend server, since it now points at the load balancer.

Where we started

Before adding the load balancer, our site was running on EC2 servers with Ubuntu. Nginx was accepting incoming requests on ports 80 and 443, redirecting all port 80 requests to https, adding basic auth on port 443 on some servers, proxying some port 443 requests to gunicorn with our Django application, and serving static files for the rest.

To summarize our backend server configuration before and after the change:

Before

  • Port 80 redirects all requests to https://server_URL
  • Port 443 terminates SSL and processes requests
  • Server firewall and AWS security group allow all incoming connections on port 80 and 443

After

  • Port 80 redirects all requests to https://server_URL
  • Port 8088 processes requests
  • Server firewall and AWS security group allow port 80 and 8088 connections from the load balancer only, and no port 443 connections at all.

Steps in order

  • DNS: shorten the DNS cache time for the site domain names to something like 5 minutes, so when we start changing them later, clients will pick up the change quickly. We'll lengthen these again when we're done.
  • Django: if needed, create a new view for health checks. We made one at /health/ that simply returned a response with status 200, bypassing all authentication. We can enhance that view later to do more checking, such as making sure the database is accessible.
  • Nginx: We added a new port 8088 server, copying the configuration from our existing port 443 server, but removing the ssl directives. We did keep the line that added the Strict-Transport-Security header.
  • Nginx: Added configuration in our new port 8088 to bypass basic auth for the /health/ URL only.
  • Ufw: opened port 8088 in the Linux firewall.
  • AWS: opened port 8088 in the servers' security group - for now, from all source addresses so we can test easily as we go.
  • AWS: add the SSL certificate in IAM
  • AWS: create a new load balancer in the same region as the servers
  • AWS: configure the new load balancer:
  • configure to use the SSL certificate
  • set up a security group for the load balancer. It needs to accept incoming connections from the internet on ports 80 and 443.
  • instances: the backend servers this load balancer will forward to
  • health check: port 8088, URL /health/. Set the period and number of checks small for now, e.g. 30 seconds and 2 checks.
  • listeners: 80->80, 443 ssl -> 8088 non-ssl
  • Tests: Now stop to make sure things are working right so far:
  • The load balancer should show the instance in service (after the health check period has passed).
  • With the site domain set in your local /etc/hosts file to point at one of the load balancer's IP addresses, the site should work on ports 80 & 443
  • undo your local /etc/hosts changes since the load balancer IPs will change over time!

  • AWS: update the backend servers' security group to only accept 8088 traffic from the load balancer's security group

  • Test: the health check should still pass, since it's coming in on port 8088 from the load balancer.

  • DNS: update DNS to make the site domain a CNAME for the load balancer's A name

  • wait for DNS propagation
  • test: site should still work when accessed using its hostname.

Cleanup

These steps should now be safe to do, but it doesn't hurt to test again after each step, just to be sure.

  • Nginx: remove port 443 server from nginx configuration.
  • AWS: remove port 443 from backend servers’ security group. Configure ports 80 and 8088 to only accept incoming connections from the load balancer's security group.
  • Ufw: block port 443 in server firewall
  • AWS: in the load balancer health check configuration, lengthen the time between health checks, and optionally require more passing checks before treating an instance as live.
  • docs: Update your deploy docs with the changes to how the servers are deployed!
  • DNS: lengthen the cache time for the server domain name(s) if you had shortened it before.

Check out PyCon PL 2015 agenda

By Piotr Maliński from Django community aggregator: Community blog posts. Published on Aug 08, 2015.

The PyCon PL 2015 will be held in Hotel Ossa Congress & Spa located in Ossa on October 15th through October 18th. The conference is held in Poland but consists of two blocks of talks - English and Polish, so non-Polish speaking attendees can benefit from it too.

The agenda and other details can be found on the conference website.

Working with Environment Variables in Python

By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on Aug 06, 2015.

If you look at modern code deployment practices, like in the 12 factor app, environment variables are very important for keeping secret information secret, or server specific information set for that server. All while not having long crazy settings in a settings file for every single server and computer. It helps keep things slimmer in your code along with the following quoted from the 12factorapp site:

    * Resource handles to the database, Memcached, and other backing services
    * Credentials to external services such as Amazon S3 or Twitter
    * Per-deploy values such as the canonical hostname for the deploy

What Are Environment Variables?

Environment variables are a key value pairs that can affect how a program runs. They need to be set at some point before a process is run so the process can read them in and act accordingly. A lot of times in production environments your database name and password are set as environment variables so that information does not end up in a code repository somewhere.

By relying on an environment variable your code doesn't care what a setting is because you can set that at some other point. Actually working with environment variables is really simple. All you need to start is import os and you are off to the races.

Get Environment Variable with Python:

name = os.environ.get('DATABASE_NAME')

The environ part of this call is like a dictionary of all the environment variables. So we just get the one we are after and go about our business. You might even see something like below for your database settings, in your django settings file. You are either getting the environment variable, or you are getting back an empty string.

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'NAME': os.environ.get('DATABASE_NAME', ''),
        'USER': os.environ.get('DATABASE_USER', ''),
        'PASSWORD': os.environ.get('DATABASE_PASSWORD', ''),
        'HOST': os.environ.get('DATABASE_HOST', ''),
        'PORT': os.environ.get('DATABASE_PORT', ''),
    }
}

Running django-admin.py/manage.py with environment variables

A common use of environment variables in django is the DJANGO_SETTINGS_MODULE. It tells django-admin.py what settings file to load for a project.

DJANGO_SETTINGS_MODULES='project.settings' django-admin.py

manage.py is basically a shortcut to this command. We are setting the location of our settings.py file and running django-admin.py command. There has been a lot debate about manage.py since the above is all it really does. You can read and learn more on a ticket about it: Deprecate and then remove "shell" management command

Set environment variable in bashrc

Finally, depending on your environment you might want to setup environment variables in your bashrc or bash_profile so every time you start a command line environment they are set. To do this add something like the following to your bashrc or bash_profile

export PROGRAMMING_FOLDER='/home/user/programming/'

The export command is built into shell and sets PROGRAMMING_FOLDER to /home/user/programming for the current environment. You can verify this running env in your shell, or even env | grep PROGRAMMING to get just that environment variable.

Common Places to Set Environment Variables in Production

First link is how to set environment variables with supervisord which is a recommended way to run your server side code, generally running gunicorn or uWSGI. uWSGI is linked to because a lot of people run it outside of supervisord. However, it is recommended if you use gunicorn to use supervisord so set environment variables inside that config.

Conclusion

Environment variables are extremely useful, but can be a huge pain to work with. I really suggest playing with them in different environments to get used to working with them. When I first started setting them in production environments it was tough to first learn how to set them, and run my code. However, once I got environment variables working locally in my dev environment, and in production it became super convenient. I am super glad I took the extra effort to learn a process to work with them, and I think you will too.

Announcing the Caktus Open Source Fellowship

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Aug 06, 2015.

We are excited to announce the creation of a pilot program for open source contributions here at Caktus Group. This program is inspired by the Django Software Foundation’s fellowship as well as the Two Day Manifesto. For this program, Caktus seeks to hire a part-time developer for twelve weeks this fall for the sole purpose of contributing back to open source projects. Caktus builds web applications based on open source tools and the continued growth of these projects is important to us. Open source projects such as Python and Django have given so much to this company and this is one of many ways we are trying to give back.

We are looking for candidates who would love to collaborate with us in our offices in downtown Durham, NC to directly contribute to both open source projects created at Caktus as well as open source projects used by Caktus, such as Django. As previously mentioned, this will be a part-time position and will be taking the place of our normal fall internship program. If successful, we may expand this into a full-time position for future iterations.

I think this will be a great opportunity for a developer to get experience working with and contributing to open source. It could be a great resume builder for a relatively new developer where all of your work will be publicly visible. It could also be a fun break from the ordinary for a more experienced developer who would love to be paid to work on open source. Through our mentorship, we hope this program will empower people who otherwise would not contribute to open source.

Prior experience with Python is a general requirement, but no prior open source contributions are required. If you have specific projects you would like to contribute to, we would love to know about them during the application process.

We are looking forwarding to reviewing and selecting applicants over the next few weeks. You can find more details on the position as well as the application form here: https://www.caktusgroup.com/careers/#op-73393-caktus-open-source-fellow

Template fragment caching gotchas

By Agiliq's Django Blog from Django community aggregator: Community blog posts. Published on Aug 06, 2015.

Variables in cached template fragment

Assuming this is in template.

{% cache 300 nums %}
{% for i in nums %}
    <p>i</p>
{% endfor %}
{% endcache %}

And assuming we send {'nums': range(100)} from context, then 0 to 99 will be sent in the response.

Now suppose we change context to {'nums': range(1000)}, still for next 5 minutes i.e until the cache expires, 0 to 99 will be sent in the response. 0 to 999 will not be sent in the response.

To fix this, we should use the variable too with the {% cache %} tag. So correct code would be

{% cache 300 nums_cache nums %}
{% for i in nums %}
    <p>i</p>
{% endfor %}
{% endcache %}

After this whenever context nums changes, cache would be reevaluated.

Boolean variable in cached template fragment

Assuming template contains

{% cache 300 hello %}
{% if hello %}
  <p>Hello</p>
{% endif %}
{% endcache %}

and assuming {'hello': True} is sent in context. Then <p>Hello</p> will be sent in response.

Now even when we send {'hello': False} in context, "<p>Hello</p>" will still be sent in response because it's already cached.

To fix this.

{% cache 300 hello_cache hello %}
{% if hello %}
  <p>Hello</p>
{% endif %}
{% endcache %}

request.user in cached template fragment

{% cache 300 username_cache %}
<p>{{request.user.username}}</p>
{% endcache %}

When current user logs out and a new user logs in, still the username of old user is shown on the web page because template fragment is already cached and would not be reevaluated.

Fix it like

{% cache 300 username_cache request.user.username %}
<p>{{request.user.username}}</p>
{% endcache %}

Gotcha while using base template

Assuming we are using template inheritance, and our base template i.e base.html looks like:

{% load cache %}
<html>
    <head>
    </head>
    <body>
    {% cache 300 base_body %}
    {% block body %}
    {% endblock %}
    {% endcache %}
    </body>
</html>

And the child template, say test.html looks like

{% extends "base.html" %}
{% load cache %}
{% block body %}
  {% cache 300 username request.user.username %}
    <p>{{request.user.username}}</p>
  {% endcache %}
{% endblock %}

Assuming our view uses test.html. Suppose cache is empty now, and you are logged in as user1 and then you make the request to a url which uses test.html. So response will contain "<p>user1</p>".

Now if you logout and login as user2 and make the request, still the response will be "<p>user1</p>" instead of "<p>user2</p>".

The reason for this is, as per Django docs.

The extends tag is the key here. It tells the template engine that this template extends another template. When the template system evaluates this template, first it locates the parent  in this case, base.html.

But in our case, the parent template itself has some cached content, which Django will use. Django will not even bother to look at the {% block body %} of child template if it finds cached content base_body for the parent template.

The fix for this is to differentiate the cache of different users even in the base template. So we could change the base template to look like.

{% load cache %}
<html>
    <head>
    </head>
    <body>
    {% cache 300 base_body request.user.username %}
    {% block body %}
    {% endblock %}
    {% endcache %}
    </body>
</html>

After this, if user1 is logged in the "<p>user1</p>" is sent in response. If user2 is logged in then "<p>user2</p>" is sent in response.

Gotcha with {% include %} tag.

It is similar to the gotcha of template inheritance discussed in last section.

Assuming the view uses test.html, which contains.

{% cache 60 body_cache %}
{% include "body.html" %}
{% endcache %}

And body.html contains

<p>Hello {{request.user.username}}</p>

So when first user, say user1 logs in, cache is evaluated and <p>Hello user1</p> is sent in response. When user2 logs in, still <p>Hello user2</p> will be sent in response.

To fix this, test.html should also use all the variables on which included template depends.

So code in test.html should change to

{% cache 60 body_cache request.user %}
{% include "body.html" %}
{% endcache %}

If included template, i.e body.html depended on any other variable, that other variable too should be used with {% cache %} tag.

Announcing Django Girls RDU: Free Coding Workshop for Women

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Aug 05, 2015.

Django Girls Logo

We’re incredibly excited to announce the launch of Django Girls RDU, a group in NC’s Triangle region that hosts free one-day Django coding workshops for women. Django Girls is part of an international movement that’s helped 1,600 (and counting!) women learn how to code.

We originally got involved with Django Girls at the impetus of our Technical Director, Mark Lavin. While discussing our efforts to support women in tech, Mark was emphatic: there was this group, Django Girls, that was doing extraordinary work engaging women in Django. Clearly, we needed something like that locally. Luckily for us, Django Girls was coming to PyCon and we'd get to see first hand just how wonderful they are.

Mark Lavin coaching at Django Girls PyCon 2015

Four of our team members volunteered as coaches for a Django Girls workshop during PyCon 2015. There was nothing quite like seeing the impact Django Girls had on each attendee. The environment was warm, friendly, and infectiously enthusiastic. The tutorials for students, coaches, and organizers, were very thoughtful and detailed. The passion of the Django Girls organizers was evident in every moment. Out of a desire to prolong this excitement and share it with everyone we knew, we put together a team and applied to have a Django Girls RDU chapter. We’re honored to partner with such a wonderful group!

The first workshop will be on October 3rd, applications are due September 4th. We have five great coaches from Caktus and PyLadies RDU and each coach will work one-on-one with three students to build their first Django website, a blog. We’re looking for volunteers to coach and sponsor the workshop. Each additional coach means three more students we can accept. If you’d like to get involved, please email us at durham@djangogirls.org. And, of course, we’re also looking for women who want to learn how to code.

Not able to make the October 3rd meetup? You'll also find members of our team coaching at DjangoCon's Django Girls Austin. To learn about more Django Girl activities, please follow us @djangogirlsRDU or visit the DjangoGirls website.

Automatic visual image enhancement for your web application

By Cloudinary Blog - Django from Django community aggregator: Community blog posts. Published on Jul 29, 2015.

VIESUS automatic image enhancement

Various factors can have an effect on the visual quality of photos captured by a wide variety of digital cameras. Technical limitations of cameras, coupled with changing conditions in which users take photos, results in a wide range of visual quality. Camera-related limitations arise from a combination of poor optics, noisy sensors, and the modest capabilities of mobile camera phones that are used to take photos in conditions that range from bright daylight to indoor scenes with incandescent light or even dark night scenes.

If you have lots of spare time, one option is to spend hours trying to enhance your images by adjusting brightness and color, restoring sharpness, removing noise, correcting for overexposure or underexposure, etc. Furthermore, the results achieved will not only depend on your training and experience with the photo editing software, but also on the quality, condition and calibration of the monitor used. Manual fine-tuning is also time consuming, and as the amount of image content is constantly growing, there is an obvious need for automatic image enhancement

VIESUS logo

VIESUS™ is a software application developed by Imaging Solutions AG that takes everyday digital camera images and enhances them to look more visually attractive. VIESUS first analyses the image data then automatically applies any processing steps as needed: fixing dull colors and bad color balance, removing digital noise, adjusting poor sharpness / blurriness, correcting for overexposure or underexposure, and more.

Cloudinary provides an add-on for using VIESUS's image enhancement capabilities, fully integrated into Cloudinary's image management and manipulation pipeline. With VIESUS's image enhancement add-on, you can extend Cloudinary's powerful image manipulation and optimization capabilities by automatically enhancing images to their best visual quality.

Automatically enhancing images

Cloudinary already supports on-the-fly manipulation using URLs for resizing, cropping, applying effects, etc. Now you can also use VIESUS as an effect by setting the effect transformation parameter to viesus_correct (or e_viesus_correct for URLs) which tells Cloudinary to dynamically enhance the image to the best visual quality using the VIESUS add-on.

Take a look at the following photo of the Golden Gate Bridge in San Francisco that was uploaded to Cloudinary's demo account as golden_gate_side.jpg. The original photo on the left has darkened colors, low contrast and poor sharpness, and looks like it was taken on an overcast day. In the VIESUS enhanced photo on the right, the brightness and contrast is increased and the colors appear sharper and more vivid, while the photo now looks like it was taken on a bright sunny day.

Original image Auto corrected image

Ruby:
cl_image_tag("golden_gate_side.jpg", :width=>350, :crop=>:scale, :effect=>"viesus_correct", :sign_url=>true)
PHP:
cl_image_tag("golden_gate_side.jpg", array("width"=>350, "crop"=>"scale", "effect"=>"viesus_correct", "sign_url"=>true))
Python:
CloudinaryImage("golden_gate_side.jpg").image(width=350, crop="scale", effect="viesus_correct", sign_url=True)
Node.js:
cloudinary.image("golden_gate_side.jpg", {width: 350, crop: "scale", effect: "viesus_correct", sign_url: true})
Java:
cloudinary.url().transformation(new Transformation().width(350).crop("scale").effect("viesus_correct")).signed(true).imageTag("golden_gate_side.jpg")
jQuery:
$.cloudinary.image("golden_gate_side.jpg", {width: 350, crop: "scale", effect: "viesus_correct"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(350).Crop("scale").Effect("viesus_correct")).Signed(true).BuildImageTag("golden_gate_side.jpg")

Further image manipulations

Visual enhancement using the VIESUS add-on can be mixed with any of Cloudinary's rich set of image manipulation capabilities. The VIESUS add-on can also enhance a generated image, so instead of improving the original large photo, you can separately enhance each thumbnail or cropped version you would like to display.

For example, the following code generates and delivers a version of the uploaded golden_gate_side photo as follows:

  • Crops the photo to a width of 80% and a height of 35% with east gravity, and applies the viesus_correct effect.
  • Adds another uploaded png image named viesus_icon as an overlay. The overlay is resized to a width of 400 pixels, positioned 10 pixels from the top right corner of the containing image and is made 40% semi transparent.
  • The entire image is scaled down to a width of 600 pixels with rounded corners.

Without visual enhancement:

golden_gate_side.jpg cropped to 600 pixels with rounded corners and a logo overlay

With VIESUS visual enhancement:

golden_gate_side.jpg cropped to 600 pixels with rounded corners, enhanced with viesus and a logo overlay

Ruby:
cl_image_tag("golden_gate_side.jpg", :sign_url=>true, :transformation=>[
  {:width=>0.8, :height=>0.35, :crop=>:crop, :gravity=>:east, :effect=>"viesus_correct"},
  {:opacity=>40, :overlay=>"viesus_icon", :width=>400, :x=>10, :y=>10, :crop=>:scale, :gravity=>:north_east},
  {:radius=>20, :width=>600, :crop=>:scale}
  ])
PHP:
cl_image_tag("golden_gate_side.jpg", array("sign_url"=>true, "transformation"=>array(
  array("width"=>0.8, "height"=>0.35, "crop"=>"crop", "gravity"=>"east", "effect"=>"viesus_correct"),
  array("opacity"=>40, "overlay"=>"viesus_icon", "width"=>400, "x"=>10, "y"=>10, "crop"=>"scale", "gravity"=>"north_east"),
  array("radius"=>20, "width"=>600, "crop"=>"scale")
  )))
Python:
CloudinaryImage("golden_gate_side.jpg").image(sign_url=True, transformation=[
  {"width": 0.8, "height": 0.35, "crop": "crop", "gravity": "east", "effect": "viesus_correct"},
  {"opacity": 40, "overlay": "viesus_icon", "width": 400, "x": 10, "y": 10, "crop": "scale", "gravity": "north_east"},
  {"radius": 20, "width": 600, "crop": "scale"}
  ])
Node.js:
cloudinary.image("golden_gate_side.jpg", {sign_url: true, transformation: [
  {width: 0.8, height: 0.35, crop: "crop", gravity: "east", effect: "viesus_correct"},
  {opacity: 40, overlay: "viesus_icon", width: 400, x: 10, y: 10, crop: "scale", gravity: "north_east"},
  {radius: 20, width: 600, crop: "scale"}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .width(0.8).height(0.35).crop("crop").gravity("east").effect("viesus_correct").chain()
  .opacity(40).overlay("viesus_icon").width(400).x(10).y(10).crop("scale").gravity("north_east").chain()
  .radius(20).width(600).crop("scale")).signed(true).imageTag("golden_gate_side.jpg")
jQuery:
$.cloudinary.image("golden_gate_side.jpg", {transformation: [
  {width: 0.8, height: 0.35, crop: "crop", gravity: "east", effect: "viesus_correct"},
  {opacity: 40, overlay: "viesus_icon", width: 400, x: 10, y: 10, crop: "scale", gravity: "north_east"},
  {radius: 20, width: 600, crop: "scale"}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .Width(0.8).Height(0.35).Crop("crop").Gravity("east").Effect("viesus_correct").Chain()
  .Opacity(40).Overlay("viesus_icon").Width(400).X(10).Y(10).Crop("scale").Gravity("north_east").Chain()
  .Radius(20).Width(600).Crop("scale")).Signed(true).BuildImageTag("golden_gate_side.jpg")

For more detailed information on implementing this automatic visual enhancement to your images, see the VIESUS™ add-on documentation, and for a full list of Cloudinary's image manipulation options, see the Image transformations documentation.

Summary

Enhancing your images and user uploaded photos makes your website look nicer and improves user engagement. The VIESUS add-on is utilized to extend Cloudinary's powerful image manipulation and optimization capabilities by automatically enhancing images to their best visual quality. Simply add a single parameter to your image URLs and everything is done seamlessly, dynamically and automatically for you.

VIESUS automatic visual enhancement  add-on

The free tier of the VIESUS add-on is available to all our free and paid plans. If you don't have a Cloudinary account, you are welcome to sign up to our free account and try it out.

Using Unsaved Related Models for Sample Data in Django 1.8

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jul 28, 2015.

Note: In between the time I originally wrote this post and it getting published, a ticket and pull request were opened in Django to remove allow_unsaved_instance_assignment and move validation to the model save() method, which makes much more sense anyways. There's a chance this will even be backported to Django 1.8. So, if you're using a version of Django that doesn't require this, hopefully you'll never stumble across this post in the first place! If this is still an issue for you, here's the original post:

In versions of Django prior to 1.8, it was easy to construct "sample" model data by putting together a collection of related model objects, even if none of those objects was saved to the database. Django 1.8 adds a restriction that prevents this behavior. Errors such as this are generally a sign that you're encountering this issue:

ValueError: Cannot assign "...": "MyRelatedModel" instance isn't saved in the database.

The justification for this is that, previously, unsaved foreign keys were silently lost previously if they were not saved to the database. Django 1.8 does provide a backwards compatibility flag to allow working around the issue. The workaround, per the Django documentation, is to create a new ForeignKey field that removes this restriction, like so:

class UnsavedForeignKey(models.ForeignKey):
    # A ForeignKey which can point to an unsaved object
    allow_unsaved_instance_assignment = True

class Book(models.Model):
    author = UnsavedForeignKey(Author)

This may be undesirable, however, because this approach means you lose all protection for all uses of this foreign key, even if you want Django to ensure foreign key values have been saved before being assigned in some cases.

There is a middle ground, not immediately obvious, that involves changing this attribute temporarily during the assignment of an unsaved value and then immediately changing it back. This can be accomplished by writing a context manager to change the attribute, for example:

import contextlib

@contextlib.contextmanager
def allow_unsaved(model, field):
    model_field = model._meta.get_field(field)
    saved = model_field.allow_unsaved_instance_assignment
    model_field.allow_unsaved_instance_assignment = True
    yield
    model_field.allow_unsaved_instance_assignment = saved

To use this decorator, surround any assignment of an unsaved foreign key value with the context manager as follows:

with allow_unsaved(MyModel, 'my_fk_field'):
    my_obj.my_fk_field = unsaved_instance

The specifics of how you access the field to pass into the context manager are important; any other way will likely generate the following error:

RelatedObjectDoesNotExist: MyModel has no instance.

While strictly speaking this approach is not thread safe, it should work for any process-based worker model (such as the default "sync" worker in Gunicorn).

This took a few iterations to figure out, so hopefully it will (still) prove useful to someone else!

PyCon 2015 Workshop Video: Building SMS Applications with Django

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jul 24, 2015.

As proud sponsors of PyCon, we hosted a one and a half hour free workshop. We see the workshops as a wonderful opportunity to share some practical, hands-on experience in our area of expertise: building applications in Django. In addition, it’s a way to give back to the open source community.

This year, Technical Director Mark Lavin and Developers Caleb Smith and David Ray presented “Building SMS Applications with Django.” In the workshop, they taught the basics of SMS application development using Django and Django-based RapidSMS. Aside from covering the basic anatomy of an SMS-based application, as well as building SMS workflows and testing SMS applications, Mark, David, and Caleb were able to bring their practical experience with Caktus client projects to the table.

We’ve used SMS on behalf of international aid organizations and agencies like UNICEF as a cost-effective and pervasive method for conveying urgent information. We’ve built tools to help Libyans register to vote via SMS, deliver critical infant HIV/AIDs results in Zambia and Malawi, and alert humanitarian workers of danger in and around Syria.

Interested in SMS applications and Django? Don’t worry. If you missed the original workshop, we have good news: we recorded it. You can participate by watching the video above!

Reviews of two recent Django Books

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jul 23, 2015.

Introduction

When I started building sites in Django, I learned the basics from the excellent Django tutorial. But I had to learn by trial and error which approaches to using Django's building blocks worked well and which approaches tended to cause problems later on. I looked for more intermediate-level documentation, but beyond James Bennett's Practical Django Projects and our Karen Tracey's Django 1.1 Testing and Debugging, there wasn't much to be found.

Over the years, ever more interesting introductory material has been showing up, including recently our lead technical manager Mark Lavin's Lightweight Django.

But more experienced developers are also in luck. Some of the community's most experienced Django developers have recently taken the time to put down their experience in books so that we can all benefit. I've been reading two of those books and I can highly recommend both.

Two Scoops of Django

I guess Two Scoops of Django, by Daniel Roy Greenfeld and Audrey Roy Greenfeld, isn't that recent; its third edition just came out. It's been updated twice to keep up with changes in Django (the latest edition covers Django 1.8), improve the existing material, and add more information.

The subtitle of the most recent edition is Best Practices for Django 1.8, and that's what the book is. The authors go through most facets of Django development and share what has worked best for them and what to watch out for, so you don't have to learn it all the hard way.

For example, reading the Django documentation, you can learn what each setting does. Then you can read in chapter 5 of Two Scoops a battle-tested scheme for managing different settings and files across multiple environments, from local development to testing servers and production, while protecting your secrets (passwords, keys, etc).

Similarly, chapter 19 of Two Scoops covers what cases you should and shouldn't use the Django admin for, warns about using list_editable in multi-user environments, and gives some tips for securing the admin and customizing it.

Those are just two examples. The most recent edition of the book has 35 chapters, each covering a useful topic. It's over 500 pages.

Another great thing about the book is that the chapters stand alone - you can pick it up and read whatever chapter you need right now.

I'll be keeping this book handy when I'm working on Django projects.

High Performance Django

High Performance Django by Peter Baumgartner and Yann Malet is aimed at the same audience as Two Scoops, but is tightly focused on performance. It moves on from building a robust Django app to how to deploy and scale it. It covers load balancers, proxies, caching, and monitoring.

One of its best features is that it gives war stories of deploys gone wrong and how the problems were attacked and solved.

Like Two Scoops, this book talks about general principles, along with specific approaches and tools that the authors are familiar with and have had success with. It does a good job of showing the overall architecture that most high-performance sites use, from the load balancer out front to the database at the back, and listing some popular choices of tools at each tier. Then they go into more detail about the specific tools they favor.

It also delves into how to spot performance bottlenecks in your Django site’s code, where they’re most likely to be, and good ways to deal with them.

This is the book I'll be coming back to when I have a question about performance.

Summary

There are two things I want to say about books like these.

First, they are immensely useful to those of us who work with Django every day. There's a huge amount of experience captured here for our benefit.

Second, I cannot imagine the amount of time and work it takes to create books like these. When I flip through Two Scoops, not only is it full of useful information, almost every page has examples or diagrams that had to be prepared too.

Caktus has added both these books to the office library, and I've bought personal copies too (including all three editions of Two Scoops). I hope you'll try either or both, and if you find them useful, spread the word.

Django and Python 3 How to Setup pyenv for Multiple Pythons

By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on Jul 23, 2015.

We need to be doing Django development in Python 3. Unfortunately, we have a lot of projects still in Python 2.7 so switching between the 2 versions can be frustrating. Fortunately pyenv takes the guess work out of switching, and makes it super simple.
Watch Now...

A couple quick tips

By James Bennett from Django community aggregator: Community blog posts. Published on Jul 22, 2015.

As noted yesterday, I’ve spent the last little while working to freshen up various bits of open-source code I maintain, in order to make sure all of them have at least one recent release. Along the way I’ve picked up a few little tips and tricks; some of them may be old hat to you if you’ve been lucky enough to be working with fairly modern Django and Python versions for a while, but I ...

Read full entry

How to automatically create images for Responsive design

By Cloudinary Blog - Django from Django community aggregator: Community blog posts. Published on Jul 21, 2015.

Responsive Images Responsive web design is a method of designing websites to provide an optimal viewing experience to users, irrespective of the device, window size, orientation, or resolution used to view the website. A site designed responsively adapts its layout to the viewing environment, resizing and moving elements dynamically and based on the properties of the browser or device the site is being displayed on.

The responsive design uses CSS for dynamic content changes and controlling the text font size, the layout grid used and the various image dimensions, which are based on media queries and the browser window size.  Most of the dynamic changes can be accomplished this way (or with frameworks like Bootstrap) but not so when it comes to the images.  

The simple solution is to always deliver the image in the highest resolution and then simply scale the image down with CSS for lower resolution devices or smaller browser windows. However, high resolution images use more bytes and take more time to deliver, so this solution could needlessly waste bandwidth and loading time, increasing costs and harming users experience.

To offset the problem, a more complex framework could be designed that can load different images for each responsive mode, but when considering the number of images needed and the time to create all the different resolutions and image dimensions, implementing this solution becomes complex and hard to maintain.

Cloudinary can help reduce the complexity with dynamic image manipulation. You can simply build image URLs with any image width or height based on the specific device resolution and window size. This means you don't have to pre-create the images, with dynamic resizing taking place on-the-fly as needed.

Responsive images solution

The solution for simply and dynamically integrating images within responsive design layouts, can be implemented with a method added to Cloudinary's Javascript library a few months ago. The Cloudinary Javascript library, which is based on jQuery, automatically builds image URLs to match the size available for each image in the responsive layout and works as follows:

  • A Cloudinary dynamic manipulation URL is automatically built on the fly to deliver an uploaded image that is scaled to the exact available width.

  • If the browser window is consequently enlarged then new higher resolution images are automatically delivered, while using stop-points (every 10px by default) to prevent loading too many images.

  • If the browser window is scaled down, browser-side scaling is used instead of delivering a new image.

This feature allows you to provide one high resolution image, and have it automatically adapted to the resolution and size appropriate to each user’s device or browser on the fly. This ensures a great user experience by delivering the best possible resolution image, based on the device's resolution and the width available, without needlessly wasting bandwidth or loading time.

Implementing responsive design with Cloudinary's Javascript library

Implementing the responsive design in code using the Cloudinary jQuery plugin is a very simple process.

Step 1:

Include the jQuery plugin in your HTML pages (see the jQuery plugin getting started guide for more information).

Step 2:

For each image to display responsively:

  1. Set the data-src attribute of the img tag to the URL of an image that was uploaded to Cloudinary. The src attribute is not set and the actual image is updated dynamically (you can set the src attribute to a placeholder image that is displayed until the image is loaded).

  2. Set the width parameter to auto (w_auto in URLs). This allows the jQuery plugin to dynamically generate an image URL scaled to the correct width value, based on the detected width actually available for the image in the containing element.

  3. Add the cld-responsive class to the image tag. This is the default class name, but you can use custom class names and programmatically make HTML elements become responsive.

For example:

<img data-src="http://res.cloudinary.com/demo/image/upload/w_auto/smiling_man.jpg" class="cld-responsive">

Step 3:

Add Cloudinary's responsive Javascript method call at the end of the HTML page.

<script type="text/javascript">$.cloudinary.responsive()</script>

The responsive method looks for all images in the page that have the "cld-responsive" class name, detects the available width for the image on the page, and then updates the HTML image tags accordingly. The image is also updated whenever the window size or screen resolution changes.

Note that the three step process presented above covers the simplest and most general solution. The behaviour can be further customized to control whether to update images on resize, when to update the image using stop-points, preserving the CSS image height and more. See the Cloudinary Javascript library for more details.

Thats it! Checkout the following demo images created using Cloudinary (for the images) and Bootstrap (for the layout). The images also include a text overlay that is updated on-the-fly to display the actual width of the image and the Device Pixel Ratio setting (see further on in this blog post for more details on DPR).

Resize this browser window to see how the layout and images dynamically respond to the changes.

4 columns

4-3-2 grid

3-2-1 grid

As can be seen in the demo images above, the URL of an image can be further manipulated on the fly like any other image uploaded to Cloudinary.

Implementing responsive design with the Cloudinary SDKs

To make things even easier, responsive design can be implemented with the Cloudinary SDK's view helper methods (e.g. cl_image_tag in Ruby on Rails). Setting the width parameter to auto creates an HTML image tag with a blank src attribute while the data-src attribute points to a dynamic image manipulation URL. When you load Cloudinary's jQuery plugin and call the responsive method, the image tags are automatically updated and URLs are replaced with the correct width value. You can also set a placeholder image using the responsive_placeholder parameter, or set it to an inline blank image by setting the parameter to blank.

For example, creating an HTML image tag for the "smiling_man.jpg" image with the width automatically determined on the fly as needed, and using a blank image placeholder:

Ruby:
cl_image_tag("smiling_man.jpg", :width => :auto, 
  :responsive_placeholder => "blank")
PHP:
cl_image_tag("smiling_man.jpg",  array("width" => "auto", 
  "responsive_placeholder" => "blank"));
Python:
cloudinary.CloudinaryImage("smiling_man.jpg").image(width = "auto",
  responsive_placeholder = "blank")
Node.js:
cloudinary.image("smiling_man.jpg",  { width: "auto", 
  responsive_placeholder: "blank" })
Java:
cloudinary.url().transformation(new Transformation().width("auto").
  responsive_placeholder("blank")).imageTag("smiling_man.jpg");

The code above generates the following HTML image tag:

<img class="cld-responsive" 
data-src="http://res.cloudinary.com/demo/image/upload/w_auto/smiling_man.jpg"
src=""  />

Responsive design with support for retina and HiDPI devices

You can also simultaneously create the correct DPR image for devices that support higher resolutions by simply adding the dpr parameter set to auto to the URL or SDK method. The Javascript code will check the DPR of the device as well as the space available for the image. Delivery and manipulation URLs are then built automatically (and lazily) for all image tags to match the specific settings, with all image generation happening in the cloud. Users of devices with high pixel density will get a great visual result, while low-DPR users don't have to wait needlessly for larger images to load (see this blog post for more details).

For example, creating an HTML image tag for the "woman.jpg" image with the width and DPR automatically determined on the fly as needed, and using a blank image placeholder:

Ruby:
cl_image_tag("woman.jpg", :width => :auto,  :dpr => :auto,
  :responsive_placeholder => "blank")
PHP:
cl_image_tag("woman.jpg",  array("width" => "auto",  "dpr" => "auto",
  "responsive_placeholder" => "blank"));
Python:
cloudinary.CloudinaryImage("woman.jpg").image(width = "auto", dpr = "auto", 
  responsive_placeholder = "blank")
Node.js:
cloudinary.image("woman.jpg",  { width: "auto", dpr: "auto", 
  responsive_placeholder: "blank" })
Java:
cloudinary.url().transformation(new Transformation().width("auto").dpr("auto"). 
  responsive_placeholder("blank")).imageTag("woman.jpg");

The code above generates the following HTML image tag:

<img class="cld-responsive" 
data-src=
"http://res.cloudinary.com/demo/image/upload/w_auto,dpr_auto/woman.jpg"
src=""  
/>

Summary

In the modern world, applications have to look good on both the web and on various mobile devices, and therefore need to become more responsive to support the large number of devices available and adjust to the varying amount of space available for displaying content. Responsive frameworks such as Bootstrap can help with the layout and text, but have no support for images beyond client-side resizing.

Cloudinary allows you to manage your images in a very simple way by just uploading your hi-res images, using any web framework to add your image tag with automatic width and automatic DPR, and adding one line of Javascript for all your images to become Responsive. Improve your user's experience with plenty more of Cloudinary's image optimization and manipulation capabilities all done in the cloud, without the need to install image processing software or pre-generating all the image versions and resolutions, while reducing needless page loading times and saving bandwidth.

Responsive support is available in all the Cloudinary plans, including the free plan. If you don't have a Cloudinary account, you are welcome to sign up to our free account and try it out.

Testing Django Views Without Using the Test Client

By Ian's Blog from Django community aggregator: Community blog posts. Published on Jul 21, 2015.

Testing Django Views Without Using the Test Client

Is Open Source Consulting Dead?

By chrism from . Published on Sep 10, 2013.

Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

Consulting and Patent Indemification

By chrism from . Published on Aug 09, 2013.

Article about consulting and patent indemnification

Python Advent Calendar 2012 Topic

By chrism from . Published on Dec 24, 2012.

An entry for the 2012 Japanese advent calendar at http://connpass.com/event/1439/

Why I Like ZODB

By chrism from . Published on May 15, 2012.

Why I like ZODB better than other persistence systems for writing real-world web applications.

A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

By chrism from . Published on Mar 03, 2012.

A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

In Praise of Complaining

By chrism from . Published on Jan 01, 2012.

In praise of complaining, even when the complaints are absurd.

2012 Python Meme

By chrism from . Published on Dec 24, 2011.

My "Python meme" replies.

In Defense of Zope Libraries

By chrism from . Published on Dec 19, 2011.

A much too long defense of Pyramid's use of Zope libraries.

Plone Conference 2011 Pyramid Sprint

By chrism from . Published on Nov 10, 2011.

An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

Jobs-Ification of Software Development

By chrism from . Published on Oct 17, 2011.

Try not to Jobs-ify the task of software development.

WebOb Now on Python 3

By chrism from . Published on Oct 15, 2011.

Report about porting to Python 3.

Open Source Project Maintainer Sarcastic Response Cheat Sheet

By chrism from . Published on Jun 12, 2011.

Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

Pylons Miniconference #0 Wrapup

By chrism from . Published on May 04, 2011.

Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

Pylons Project Meetup / Minicon

By chrism from . Published on Apr 14, 2011.

In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.

PyCon 2011 Report

By chrism from . Published on Mar 19, 2011.

My personal PyCon 2011 Report