Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home

Open Source Posts

Hans-Juergen Schoenig: Beating Uber with a PostgreSQL prototype

From Planet PostgreSQL. Published on May 06, 2016.

The other day I got a link to an interesting post published by Uber, which has caught our attention here at Cybertec: https://eng.uber.com/go-geofence The idea behind geo-fencing is to provide information about an area to users. Somebody might want to find a taxi near a certain location or somebody might simply want to order a […]

The post Beating Uber with a PostgreSQL prototype appeared first on Cybertec - The PostgreSQL Database Company.

What We’re Clicking - April Link Roundup

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on May 05, 2016.

It's time for this month’s roundup of articles and posts shared by Cakti that drew the most attention on Twitter. The list highlights new work in civic tech and international development as well as reasons for the increasing popularity of Python and open source development.

Python is an Equal Opportunity Programming Language

An interview with David Stewart, manager in the Intel Data Center Software Technology group, about the unique accessibility of the Python programming language as well as the inclusivity of its community.

Why Every Developer is an Open Source Developer Now

A short article on why the future of IT lies in open source collaboration.

A Debate Where the Voters Pick the Questions

The Atlantic’s profile of the Florida Open Debate platform. Caktus Group helped build the tool on behalf of the Open Debate Coalition. The platform powered the first-ever crowd sourced open Senate debate.

Making it Easy to Bring Cellphone Apps to Africa

A wonderful Fastco profile on Africa’s Talking, a startup devoted to making it easier for developers to disseminate SMS-based apps to cell phone users in Africa.

US PostgreSQL Association: Building a better community part one

From Planet PostgreSQL. Published on May 05, 2016.

As one of the founders of United States PostgreSQL, I have always had a vision of where the corporation should go. Our corporation has continued to move forward and we have achieved some great things. Although it has taken longer than I expected and we are still behind from some of my goals, PostgreSQL has become one of the dominant Open Source databases in the market. By sticking to a tried, true and tested model our community has built strongholds in every sector of business from small mom-and-pops to Wall Street to the Military.

read more

Hubert 'depesz' Lubaczewski: Picking task from queue – revisit

From Planet PostgreSQL. Published on May 04, 2016.

Some time ago, I wrote blogpost about how to pick a task from queue, without locking. It was written in 2013, and as such it couldn't reflect everything we have now in PostgreSQL – namely SKIP LOCKED – which was added to PostgreSQL over year later. Two people mentioned SKIP LOCKED in comments, but it […]

How to track Google Analytics pageviews on non-web requests (with Python)

By Peter Bengtsson from Django community aggregator: Community blog posts. Published on May 03, 2016.

tl;dr; Use raven's ThreadedRequestsHTTPTransport transport class to send Google Analytics pageview trackings asynchronously to Google Analytics to collect pageviews that aren't actually browser pages.

We have an API on our Django site that was not designed from the ground up. We had a bunch of internal endpoints that were used by the website. So we simply exposed those as API endpoints that anybody can query. All we did was wrap certain parts carefully as to not expose private stuff and we wrote a simple web page where you can see a list of all the endpoints and what parameters are needed. Later we added auth-by-token.

Now the problem we have is that we don't know which endpoints people use and, as equally important, which ones people don't use. If we had more stats we'd be able to confidently deprecate some (for easier maintanenace) and optimize some (to avoid resource overuse).

Our first attempt was to use statsd to collect metrics and display those with graphite. But it just didn't work out. There are just too many different "keys". Basically, each endpoint (aka URL, aka URI) is a key. And if you include the query string parameters, the number of keys just gets nuts. Statsd and graphite is better when you have about as many keys as you have fingers on one hand. For example, HTTP error codes, 200, 302, 400, 404 and 500.

Also, we already use Google Analytics to track pageviews on our website, which is basically a measure of how many people render web pages that have HTML and JavaScript. Google Analytic's UI is great and powerful. I'm sure other competing tools like Mixpanel, Piwik, Gauges, etc are great too, but Google Analytics is reliable, likely to stick around and something many people are familiar with.

So how do you simulate pageviews when you don't have JavaScript rendering? The answer; using plain HTTP POST. (HTTPS of course). And how do you prevent blocking on sending analytics without making your users have to wait? By doing it asynchronously. Either by threading or a background working message queue.

Threading or a message queue

If you have a message queue configured and confident in its running, you should probably use that. But it adds a certain element of complexity. It makes your stack more complex because now you need to maintain a consumer(s) and the central message queue thing itself. What if you don't have a message queue all set up? Use Python threading.

To do the threading, which is hard, it's always a good idea to try to stand on the shoulder of giants. Or, if you can't find a giant, find something that is mature and proven to work well over time. We found that in Raven.

Raven is the Python library, or "agent", used for Sentry, the open source error tracking software. As you can tell by the name, Raven tries to be quite agnostic of Sentry the server component. Inside it, it has a couple of good libraries for making threaded jobs whose task is to make web requests. In particuarly, the awesome ThreadedRequestsHTTPTransport. Using it basically looks like this:

import urlparse
from raven.transport.threaded_requests import ThreadedRequestsHTTPTransport

transporter = ThreadedRequestsHTTPTransport(
    urlparse.urlparse('https://ssl.google-analytics.com/collect'),
    timeout=5
)

params = {
    ...more about this later...
}

def success_cb():
    print "Yay!"

def failure_cb(exception):
    print "Boo :("

transporter.async_send(
    params,
    headers,
    success_cb,
    failure_cb
)

The call isn't very different from regular plain old requests.post.

About the parameters

This is probably the most exciting part and the place where you need some thought. It's non-trivial because you might need to put some careful thought into what you want to track.

Your friends is: This documentation page

There's also the Hit Builder tool where you can check that the values you are going to send make sense.

Some of the basic ones are easy:

"Protocol Version"

Just set to v=1

"Tracking ID"

That code thing you see in the regular chunk of JavaScript you put in the head, e.g tid=UA-1234-Z

"Data Source"

Optional word you call this type of traffic. We went with ds=api because we use it to measure the web API.

The user ones are a bit more tricky. Basically because you don't want to accidentally leak potentially sensitive information. We decided to keep this highly anonymized.

"Client ID"

A random UUID (version 4) number that identifies the user or the app. Not to be confused with "User ID" which is basically a string that identifies the user's session storage ID or something. Since in our case we don't have a user (unless they use an API token) we leave this to a new random UUID each time. E.g. cid=uuid.uuid4().hex This field is not optional.

"User ID"

Some string that identifies the user but doesn't reveal anything about the user. For example, we use the PostgreSQL primary key ID of the user as a string. It just means we can know if the same user make several API requests but we can never know who that user is. Google Analytics uses it to "lump" requests together. This field is optional.

Next we need to pass information about the hit and the "content". This is important. Especially the "Hit type" because this is where you make your manually server-side tracking act as if the user had clicked around on the website with a browser.

"Hit type"

Set this to t=pageview and it'll show up Google Analytics as if the user had just navigated to the URL in her browser. It's kinda weird to do this because clearly the user hasn't. Most likely she's used curl or something from the command line. So it's not really a pageview but, on our end, we have "views" in the webserver that produce information to the user. Some of it is HTML and some of it is JSON, in terms of output format, but either way they're sending us a URL and we respond with data.

"Document location URL"

The full absolute URL of that was used. E.g. https://www.example.com/page?foo=bar. So in our Django app we set this to dl=request.build_absolute_uri(). If you have a site where you might have multiple domains in use but want to collect them all under just 1 specific domain you need to set dh=example.com.

"Document Host Name" and "Document Path"

I actually don't know what the point of this is if you've already set the "Document location URL".

"Document Title"

In Google Analytics you can view your Content Drilldown by title instead of by URL path. In our case we set this to a string we know from the internal Python class that is used to make the API endpoint. dt='API (%s)'%api_model.__class__.__name__.

There are many more things you can set, such as the clients IP, the user agent, timings, exceptions. We chose to NOT include the user's IP. If people using the JavaScript version of Google Analytics can set their browser to NOT include the IP, we should respect that. Also, it's rarely interesting to see where the requests for a web API because it's often servers' curl or requests that makes the query, not the human.

Sample implementation

Going back to the code example mentioned above, let's demonstrate a fuller example:

import urlparse
from raven.transport.threaded_requests import ThreadedRequestsHTTPTransport

transporter = ThreadedRequestsHTTPTransport(
    urlparse.urlparse('https://ssl.google-analytics.com/collect'),
    timeout=5
)

# Remember, this is a Django, but you get the idea

domain = settings.GOOGLE_ANALYTICS_DOMAIN
if not domain or domain == 'auto':
    domain = RequestSite(request).domain

params = {
    'v': 1,
    'tid': settings.GOOGLE_ANALYTICS_ID,
    'dh': domain,
    't': 'pageview,
    'ds': 'api',
    'cid': uuid.uuid4().hext,
    'dp': request.path,
    'dl': request.build_request_uri(),
    'dt': 'API ({})'.format(model_class.__class__.__name__),
    'ua': request.META.get('HTTP_USER_AGENT'),
}

def success_cb():
    logger.info('Successfully informed Google Analytics (%s)', params)

def failure_cb(exception):
    logger.exception(exception)

transporter.async_send(
    params,
    headers,
    success_cb,
    failure_cb
)

How to unit test this

The class we're using, ThreadedRequestsHTTPTransport has, as you might have seen, a method called async_send. There's also one, with the exact same signature, called sync_send which does the same thing but in a blocking fashion. So you could make your code look someting silly like this:

def send_tracking(page_title, request, async=True):
    # ...same as example above but wrapped in a function...
    function = async and transporter.async_send or transporter.sync_send
    function(
        params,
        headers,
        success_cb,
        failure_cb
    )

And then in your tests you pass in async=False instead.
But don't do that. The code shouldn't be sub-serviant to the tests (unless it's for the sake of splitting up monster-long functions).
Instead, I recommend you mock the inner workings of that ThreadedRequestsHTTPTransport class so you can make the whole operation synchronous. For example...

import mock
from django.test import TestCase
from django.test.client import RequestFactory

from where.you.have import pageview_tracking


class TestTracking(TestCase):

    @mock.patch('raven.transport.threaded_requests.AsyncWorker')
    @mock.patch('requests.post')
    def test_pageview_tracking(self, rpost, aw):

        def mocked_queue(function, data, headers, success_cb, failure_cb):
            function(data, headers, success_cb, failure_cb)

        aw().queue.side_effect = mocked_queue

        request = RequestFactory().get('/some/page')
        with self.settings(GOOGLE_ANALYTICS_ID='XYZ-123'):
            pageview_tracking('Test page', request)

            # Now we can assert that 'requests.post' was called.
            # Left as an exercise to the reader :)
            print rpost.mock_calls       

This is synchronous now and works great. It's not finished. You might want to write a side effect for the requests.post so you can have better control of that post. That'll also give you a chance to potentially NOT return a 200 OK and make sure that your failure_cb callback function gets called.

How to manually test this

One thing I was very curious about when I started was to see how it worked if you really ran this for reals but without polluting your real Google Analytics account. For that I built a second little web server on the side, whose address I used instead of https://ssl.google-analytics.com/collect. So, change your code so that https://ssl.google-analytics.com/collect is not hardcoded but a variable you can change locally. Change it to http://localhost:5000/ and start this little Flask server:

import time
import random
from flask import Flask, abort, request

app = Flask(__name__)
app.debug = True

@app.route("/", methods=['GET', 'POST'])
def hello():
    print "- " * 40
    print request.method, request.path
    print "ARGS:", request.args
    print "FORM:", request.form
    print "DATA:", repr(request.data)
    if request.args.get('sleep'):
        sec = int(request.args['sleep'])
        print "** Sleeping for", sec, "seconds"
        time.sleep(sec)
        print "** Done sleeping."
    if random.randint(1, 5) == 1:
        abort(500)
    elif random.randint(1, 5) == 1:
        # really get it stuck now
        time.sleep(20)
    return "OK"

if __name__ == "__main__":
    app.run()

Now you get an insight into what gets posted and you can pretend that it's slow to respond. Also, you can get an insight into how your app behaves when this collection destination throws a 5xx error.

How to really test it

Google Analytics is tricky to test in that they collect all the stuff they collect then they take their time to process it and it then shows up the next day as stats. But, there's a hack! You can go into your Google Analytics account and click "Real-Time" -> "Overview" and you should see hits coming in as you're testing this. Obviously you don't want to do this on your real production account, but perhaps you have a stage/dev instance you can use. Or, just be patient :)

Florida Open Debate Platform Receives National Attention (The Atlantic, USA Today, Engadget)

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on May 03, 2016.

Several national publications have featured the Florida Open Debate platform, including USA Today, Engadget, and The Atlantic. Caktus helped develop the Django-based platform on behalf of the Open Debate Coalition (ODC) in advance of the nation’s first-ever open Senate debate held in Florida on April 25th. The site enabled citizens to submit debate questions as well as vote on which questions mattered most to them. Moderators then used the thirty most popular questions from the site to structure the debate between Florida Senate candidates David Jolly (R) and Alan Grayson (D). According to The Atlantic,,more than 400,000 votes were submitted by users on the site, including more than 84,00 from Florida voters.

Florida Open Debate user-submitted questions

“Normally, the press frames important US election debates by choosing the questions and controlling the video broadcast,” wrote Steve Dent. “For the first time, however, the public... decide[d] the agenda.”

In his article for The Atlantic, Russell Berman also applauded the site’s effort “to make bottom-up, user-generated questions the centerpiece of a debate.” But possibly more significant were the results of this crowd-sourced content. “What transpired was, by all accounts, a decent debate,” Berman writes. “For 75 minutes, Grayson and Jolly addressed several weighty policy disputes—money in politics, Wall Street reform, the minimum wage, climate change, the solvency of Social Security—and often in detail.”

The Florida debate was streamed live on Monday to more than 80,000 viewers. The Open Debate platform is receiving attention and interest from various potential debate sponsors as well as the Commission on Presidential Debates for possible use in the in this fall’s presidential elections.

Craig Ringer: Visual PostgreSQL debugging

From Planet PostgreSQL. Published on May 02, 2016.

Having recently been doing some debugging work where many watchpoints, conditional breakpoints etc were necessary I’d like to shout out to a really useful tool: The standalone CDT debugger.

It’s part of the Eclipse project, but before you run screaming – it doesn’t require project setup or anything and it serves as a good GUI gdb wrapper. It’s good for working with PostgreSQL because you don’t need to set up a fake IDE project or any such time-wasting business. You just launch the debugger. You’ve still got your .gdbinit with postgresql debug helper macros and everything.

The CDT debugger still gives you access to the gdb console and all the features you’re used to, it just has a good UI for showing the values of locals, watched variables, stack browsing, setup of conditional breakpoints, etc.

Just:

dnf install eclipse-cdt

or

aptitude install eclipse-cdt

then get the backend pid of interest from ps or SELECT pg_backend_pid() and

cdtdebug -a $PID

It supports core files, debugging a program under direct control, and attaching.

The debugger does have a small startup time, unlike gdb, so it’s worth keeping open and detaching/re-attaching as needed when convenient. But it’s well worth it:

eclipse-cdt-standalone-debugger

I’m looking into how to incorporate Tomas’s gdbpg or even better, extension support for displaying a List*‘s contents in the variable browse tree. Even without that it’s pretty handy.

ES6 For Django Lovers

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on May 02, 2016.

ES6 for Django Lovers!

The Django community is not one to fall to bitrot. Django supports every new release of Python at an impressive pace. Active Django websites are commonly updated to new releases quickly and we take pride in providing stable, predictable upgrade paths.

We should be as adamant about keeping up that pace with our frontends as we are with all the support Django and Python put into the backend. I think I can make the case that ES6 is both a part of that natural forward pace for us, and help you get started upgrading the frontend half of your projects today.

The Case for ES6

As a Django developer and likely someone who prefers command lines, databases, and backends you might not be convinced that ES6 and other Javascript language changes matter much.

If you enjoy the concise expressiveness of Python, then ES6's improvements over Javascript should matter a lot to you. If you appreciate the organization and structure Django's common layouts for projects and applications provides, then ES6's module and import system is something you'll want to take advantage of. If you benefit from the wide variety of third-party packages the Python Package index makes available to you just a pip install away, then you should be reaching out to the rich ecosystem of packages NPM has available for frontend code, as well.

For all the reasons you love Python and Django, you should love ES6, too!

Well Structured Code for Your Whole Project

In any Python project, you take advantage of modules and packages to break up a larger body of code into sensible pieces. It makes your project easier to understand and maintain, both for yourself and other developers trying to find their way around a new codebase.

If you're like many Python web developers, the lack of structure between your clean, organized Python code and your messy, spaghetti Javascript code is something that bothers you. ES6 introduces a native module and import system, with a lot of similarities to Python's own modules.

import React from 'react';

import Dispatcher from './dispatcher.jsx';
import NoteStore from './store.jsx';
import Actions from './actions.jsx';
import {Note, NoteEntry} from './components.jsx';
import AutoComponent from './utils.jsx'

We don't benefit only from organizing our own code, of course. We derive an untold value from a huge and growing collection of third-party libraries available in Python and often specifically for Django. Django itself is distributed in concise releases through PyPI and available to your project thanks to the well-organized structure and the distribution service provided by PyPI.

Now you can take advantage of the same thing on the frontend. If you prefer to trust a stable package distribution for Django and other dependencies of your project, then it is a safe bet to guess that you are frustrated when you have to "install" a Javascript library by just unzipping it and committing the whole thing into your repository. Our Javascript code can feel unmanaged and fragile by comparison to the rest of our projects.

NPM has grown into the de facto home of Javascript libraries and grows at an incredible pace. Consider it a PyPI for your frontend code. With tools like Browserify and Webpack, you can wrap all the NPM installed dependencies for your project, along with your own organized tree of modules, into a single bundle to ship with your pages. These work in combination with ES6 modules to give you the scaffolding of modules and package management to organize your code better.

A Higher Baseline

This new pipeline allows us to take advantage of the language changes in ES6. It exposes the wealth of packages available through NPM. We hope it will raise the standard of quality within our front-end code.

This raised bar puts us in a better position to continue pushing our setup forward.

How Caktus Integrates ES6 With Django

Combining a Gulp-based pipeline for frontend assets with Django's runserver development web server turned out to be straightforward when we inverted the usual setup. Instead of teaching Django to trigger the asset pipeline, we embedded Django into our default gulp task.

Now, we set up livereload, which reloads the page when CSS or JS has been changed. We build our styles and scripts, transforming our Less and ES6 into CSS and Javascript. The task will launch Django's own runserver for you, passing along --port and --host parameters. The rebuild() task delegated to below will continue to monitor all our frontend source files for changes to automatically rebuild them when necessary.

// Starts our development workflow
gulp.task('default', function (cb) {
  livereload.listen();

  rebuild({
    development: true,
  });

  console.log("Starting Django runserver http://"+argv.address+":"+argv.port+"/");
  var args = ["manage.py", "runserver", argv.address+":"+argv.port];
  var runserver = spawn("python", args, {
    stdio: "inherit",
  });
  runserver.on('close', function(code) {
    if (code !== 0) {
      console.error('Django runserver exited with error code: ' + code);
    } else {
      console.log('Django runserver exited normally.');
    }
  });
});

Integration with Django's collectstatic for Deployments

Options like Django Compressor make integration with common Django deployment pipelines a breeze, but you may need to consider how to combine ES6 pipelines more carefully. By running our Gulp build task before collectstatic and including the resulting bundled assets — both Less and ES6 — in the collected assets, we can make our existing Gulp builds and Django work together very seamlessly.

References

Eyðun Nielsen: PostgreSQL Backup Script IV (follow up)

From Planet PostgreSQL. Published on May 02, 2016.

Came around to this old stuff - and realizing that it didn't really work :-( Bash loops and pipes can be tricky :-)

So an update was long overdue:

#!/bin/bash
# PostgreSQL backup script
# Run as postgres user

BACKUP_HOST=replace-with-your-backup-host
BACKUP_HOST_USER=replace-with-your-backup-host-user
BACKUP_HOST_PATH=replace-with-your-backup-host-/path/to/backup/of/postgresql

# For all running PostgreSQL clusters
while read cluster;
do
# Create cluster-path on backuphost
ssh -n -q $BACKUP_HOST_USER@$BACKUP_HOST "mkdir -p $BACKUP_HOST_PATH/$cluster";
# Global stuff from cluster: roles etc.
pg_dumpall --globals-only | ssh -n -q $BACKUP_HOST_USER@$BACKUP_HOST "dd of=$BACKUP_HOST_PATH/$cluster/globals.sql" > /dev/null 2>&1;
# And then each database (except templates)
while read databasename;
do
pg_dump --cluster $cluster --format=c -- $databasename | ssh -n -q $BACKUP_HOST_USER@$BACKUP_HOST "dd of=$BACKUP_HOST_PATH/$cluster/$databasename.sqlc" > /dev/null 2>&1;
done < <(psql --cluster $cluster --no-align --tuples-only --command="SELECT datname from pg_database WHERE NOT datistemplate");
done < <( awk -- '{ print $1"/"$2 }' <(pg_lsclusters --no-header | grep online) ) 

Petr Jelínek: PGLogical 1.1 released with sequence support and more

From Planet PostgreSQL. Published on May 02, 2016.

The new feature version of pglogical is now available. The new release brings support for sequence replication, manually configured parallel subscriptions, replica triggers, numerous usability improvements and bug fixes. Let’s look into the changes in more detail.

Sequences support

As the title of this post says, pglogical now supports sequence replication. While I am sure this is good news for everybody, for me the main importance of this feature is delivering better experience with the zero-downtime upgrades. With the first version of pglogical you’d have to manage sequences manually after the data were transfered which complicates the whole procedure, now they just get replicated.

PostgreSQL does not provide any infrastructure for capturing sequence changes (not for lack of trying, it just turns out that it’s hard to add support for it). So we use similar trick as one of the previous replication projects I worked on. We periodically capture state current state of the sequence and send update to the replication queue with some additional buffer. This means that in normal situation the sequence values on subscriber will be ahead of those on provider. We try to dynamically size the buffer so that the value is as close as possible but also does not fall behind on frequently updated sequences. It is however advisable to force the sequence update using the pglogical.synchronize_sequence() function after big data loads or upgrade.

Other than the above, the sequence replication works similarly to the way tables work. This includes the replication sets, so there are functions for adding and removing sequences to replication sets and subscribers can receive updates only for some sequences.

Parallel subscriptions

Another major change is support for multiple subscriptions between one pair of node. In pglogical 1.0 there was strict limit on single subscription between one pair of nodes. You could replicate from one node to multiple nodes or from multiple nodes to one node but no parallel subscriptions between the same two nodes was allowed. This limitation is now gone provided that the replication sets of the two subscriptions don’t overlap. What it means is that users can manually parallelize the replication for improved performance.

Foreign keys and triggers handling on subscriber

PGLocgical now runs both the initial data copy process and the ongoing replication process under session_replication_role set to replica. This, plus some other internal changes have couple of interesting effects:

  • Foreign keys on subscriber are no longer checked for validity. The philosophy behind this change is that the if the FK check passed on the provider, the subscriber should accept the data as well. This should solve the foreign key issues some people had when replicating just part of the database.
  • The ENABLE REPLICA and ENABLE ALWAYS triggers are now called on the subscriber. This means that you can do data post-processing on the subscriber now albeit in somewhat limited fashion.

These might break compatibility on some existing installations.

(In)Compatibility

Speaking of the compatibility with existing installations. There are also few other compatibility breaking changes in the new version of pglogical. The biggest one is that the create_subscription() function now defaults to not synchronize the schema. If you wish to do the schema synchronization, you can enable it by setting the synchronize_schema parameter to true when calling create_subscription(). Also, the synchronize parameters in table synchronization functions were renamed to synchronize_data for better clarity and consistency with the create_subscription() naming.

Usability improvements

We also tried to make pglogical easier to use and manage. For this reason we added functions to add and remove node interfaces (connection strings) and for switching the subscription to different interface. Main use for these is when server IP has been changed and it’s also a first step towards supporting physical failover. We also modified the replicate_ddl_command() function to optionally accept list of replication sets for which given DDL change will be replicated to ease filtering of the structure changes in partially replicated databases. There are also smaller behavior changes that should improve general usability like:

  • Better behavior on worker crashes. PGLogical will longer spam the log with failures, it will give the worker grace period of 5 seconds before retrying.
  • Improved logging. For example it’s easier to spot which workers have been started and stopped and when it happened.
  • Workers now set application_name on start for easier identification of pglogical processes in pg_stat_activity.

And finally, like any software release, there are some small bug fixes all around.

The download and installation instructions as well as documentation are available at our pglogical project page.

Rajeev Rastogi: PostgreSQL Transaction Internal Code Flow

From Planet PostgreSQL. Published on Apr 30, 2016.

This post is in continuation of my previous post. Here I would explain main transaction working from code flow perspective. Sub-transaction, MVCC and other related details will be covered in subsequent posts. Request to see my previous post Basic of Transaction in order to get better understanding. 
Internally each transaction is represented as unique number in increasing order with an exception in-case of overflow. Also whole of transaction flow is tracked using various state as explained below:

Command Execution:

Usually each command execution has 3 steps as:
  • StartTransactionCommand
  • Do the operation specific to command
  • CommitTransactionCommmand
As mentioned in my earlier post, if transaction is not started explicitly then it will be started internally for any command execution. So this forms two way of startign the transaction. Below it explain both of the case and corresponding state transition involved. 

Meaning  of below nomenclature
                    X ====action====>Y
Current state is X, in this it does operation "action" and gets transferred to state "Y". In-case there is no action mentioned, means in current state it does not do any operation, it directly moves to state "Y".
There are mainly 3 actions mentioned below, the purpose of each actions are as below:
  • StartTransaction: Assign resources in terms of memory, initializes the guc variable specific, initialize the transaction properties like read-only transaction or read-write, create a new resource owner etc.
  • CommitTransaction: Undo all initialization done by StartTransaction. Execute any pending triggers, handle all on commit action if any. Generate a COMMIT WAL record and insert same in WAL buffer, Update commit TS data. Then depending on synchronous commit or asynchronous commit configured, wait for response from standby node or directly flush the transaction to CLOG respectively.
  • AbortTransaction: Almost similar to CommitTransaction except it writes an ABORT WAL, marks the transaction as ABORTED.

State transition mentioned in subsequent section are states from client queries perspective. In addition to these state, there are few states from server perspective also, these are:
  • TRANS_DEFAULT: Idle, its default state.
  • TRANS_START: This is the state in which transaction initialization happens (StartTransaction)
  • TRANS_INPROGRESS: Means transaction has been started.
  • TRANS_COMMIT: This state shows transaction commit is in progress (CommitTransaction)
  • TRANS_ABORT:This state shows transaction abort is in progress (AbortTransaction)

Case-1: Explicit Start of Transaction:


Execution of START TRANSACTION command:

  • StartTransactionCommand (TBLOCK_DEFAULT)====StartTransaction====>TBLOCK_STARTED
  • Processing this command:  BeginTransactionBlock(TBLOCK_STARTED) =====> TBLOCK_BEGIN
  • CommitTransactionCommmand(TBLOCK_BEGIN) ====> TBLOCK_INPROGRESS


Execution of a normal command:

  • StartTransactionCommand (TBLOCK_INPROGRESS): Nothing to do.
  • Processing this command: Till this point no transaction ID has been assigned, as was not sure if any command requiring transaction id going to be executed. So now call the function AssignTransactionId to assign a new transaction id. Current number of transaction is maintained in a shared variable ShmemVariableCache->nextXid; So the current value of this variable is taken as transaction id for current transaction. Then value of ShmemVariableCache->nextXid is incremented (taking care of overflow case). 
 Also each transaction information needs to be made durable (its one of the properties), for which it maintains:
  1. Commit log for each transaction (Called CLOG stored in clog page)
  2. Each transaction commit timestamp (call CommitTs stores in separate page)
        If current XID is going to be stored in a new page (either because its first transaction in the system or existing page is full), then it            needs reset whole content of new page with zero. This should be done for all pages used for storing these information.

        Finally it should:
  1. Also each session maintains MyPgXact, which maintains transaction information in memory. This is used by all other session for taking various decision. So assign this transaction to MyPgXact.
  2. The new transaction id is stored in the each tuple being created (more on this in coming post related to MVCC).
  3. Stores the current command id in each tuple being created.
  4. Then continue with normal command operation.

  • CommitTransactionCommmand(TBLOCK_INPROGRESS): Does command counter increment (CommandCounterIncrement) i.e. increments the command id so that if multiple commands running in same transaction then next command can see the operation done by previous command. No state transition.


Executing again a command:

  • StartTransactionCommand (TBLOCK_INPROGRESS): Nothing to do.
  • Processing this command: Since already transaction assigned nothing to do. Just continue with command execution.
  • CommitTransactionCommmand(TBLOCK_INPROGRESS): Does command counter increment (CommandCounterIncrement) i.e. increments the command id so that if multiple commands running in same transaction then next command can see the operation done by previous command. No state transition.


Executing COMMIT/END command:

  • StartTransactionCommand (TBLOCK_INPROGRESS): Nothing to do.
  • Processing this command: EndTransactionBlock(TBLOCK_INPROGRESS) ====> TBLOCK_END
  • CommitTransactionCommmand(TBLOCK_END)====CommitTransaction====> TBLOCK_DEFAULT


Executing ROLLBACK command:

  • StartTransactionCommand (TBLOCK_INPROGRESS): Nothing to do.
  • Processing this command: UserAbortTransactionBlock(TBLOCK_INPROGRESS) ====> TBLOCK_ABORT_PENDING
  • CommitTransactionCommmand(TBLOCK_ABORT_PENDING)====AbortTransaction====> TBLOCK_DEFAULT


Case-2: Implicit Start of Transaction:

Transaction is started automatically in-case a transaction block start command was not executed. This transaction is valid only for the current command and it gets committed or aborted automatically once command gets executed successfully or gets fail respectively


Command Execution- Success:

  • StartTransactionCommand (TBLOCK_DEFAULT)====StartTransaction====>TBLOCK_STARTED
  • Do the actual operation
  • CommitTransactionCommmand(TBLOCK_STARTED) ====CommitTransaction====> TBLOCK_DEFAULT


Command Execution - Fail:

  • StartTransactionCommand (TBLOCK_DEFAULT)====StartTransaction====>TBLOCK_STARTED
  • Do the actual operation
  • AbortCurrentTransaction(TBLOCK_STARTED) ====AbortTransaction====> TBLOCK_DEFAULT

In my subsequent post, I would cover some more details about transaction and their implementation details.

Any comment/query/suggestion welcome.

Rajeev Rastogi: Basic of Transaction

From Planet PostgreSQL. Published on Apr 30, 2016.

Have you ever though about how below classic example works:
           You swipe your credit card in a shop and money gets debited from account but overall billing fails. In that case your money gets credited back to your account instantly.

If you can already relate this to one of the key property of Database called Transaction, then you already have some head start for this topic. If not then no need to worry, I will take you from very basic of Transactions.

Background

Transactions are one of the most fundamental key concept for any database including PostgreSQL. It enables multiple step actions to be grouped together in such a way that either all step executions are accepted or all are rejected. If all of step executions are success then as a group whole actions will be considered as success. If any of the step execution fails in between, then it will make sure to nullify the effect of all other previous successful steps. 
Consider an example of bank fund transfer, where in there is table consisting of customer name, their balance etc. There is a customer 'A' who wants to transfer 1000Rs to customer 'B'. In terms of SQL it will look like:

            UPDATE accounts SET balance = balance - 1000 where cust_name='A';
            UPDATE accounts SET balance = balance + 1000 where cust_name='B';

There are two steps involved in overall fund transfer activity and as per the above definition of transaction, either both of them should succeed i.e. once 1000Rs debited from A's account it must be credited to B's account or there should not be even debit from A's account. Also there should not be case where debit from A's account failed but B's account got credited with 1000Rs, in which case bank will not be happy.
Just explained property of relation is called as "Atomicity", there are other 3 additional properties of each transaction explained below:

Properties

There are four mandatory properties of each transaction:

Atomicity: 

This property make sure that all changes to data are performed as if they are a single operation. That is, all the changes are performed, or none of them. This is already explained above with example.

Consistency: 

This property ensures that database is in consistent state when transaction state and when it ends i.e. it just bring the database from one valid state to another. In above example balance data will remain in consistent state when transaction ends.

Isolation: 

This property insures that the intermediate state of a transaction is invisible to other concurrent running transactions. As a result, transactions that run concurrently appear to be serialized. In above example, if there is any concurrent transaction executing the below SQL command:

               SELECT sum(balance) FROM account;

Then it should always show the same amount i.e. it should not be able to see intermediate stage where 1000Rs has been debited from A's account but not credited to B's account.

Durability: 

This property related to action after transaction has completed and it ensures that once a transaction completes changes to data persists and are not undone even in event of system failure. In above example, once 1000Rs debited from A's account, even on system restart changed balance will be shown for A's account.

Major Commands

Below are the major commands related to transaction:

1. START TRANSACTION / BEGIN: This command start a new transaction. 
2. COMMIT: This command ends already running transaction.
3. ROLLBACK: This command is used to undo all changes done by the statements running in the current transaction. Some depending on your business logic you might need to execute this command.

In order to execute multiple statements in a single transaction, then before writing any statement, first command-1 needs to be executed and then after all statements are executed, command -2 or command -3 should be executed.
e.g..
START TRANSACTION
SQL-STMT-1
SQL-STMT-2
..
..
SQL-STMT-3
COMMIT/ROLLBACK.

Note: If an explicit START TRANSACTION/BEGIN command is not given before a SQL-statement then explicitly BEGIN command will be executed for that SQL-statement and once this statement finishes, an explicit COMMIT command will be executed.

I hope by now you understand how swiping credit card example works. Here debiting of money from your card, generation of bill all happens in the same transaction. So even if first step of debiting money success but final billing failed then whole transaction will be rollbacked and hence debited money will be credited back.

Do visit my next post to see how transaction works in PostgreSQL from code flow perspective.

Any comment/query/suggestion welcome.



Shaun M. Thomas: PG Phriday: Derivation Deluge

From Planet PostgreSQL. Published on Apr 29, 2016.

Having run into a bit of a snag with Postgres-XL, and not wanting to be dead in the water with our project, I went on a bit of a knowledge quest. Database scaling is hard, so I expected a bunch of either abandoned or proprietary approaches. In addition, as a huge fans of Postgres, compatibility or outright use of the Postgres core was a strict prerequisite.

So, what options are out there? Is there even anything worth further investigation? Maybe more importantly, what do you do when you’re under a bit of a scheduling constraint? Projects need to move forward after all, and regardless of preferences, sometimes concessions are necessary. The first step was obviously the list of databases derived from Postgres.

At first glance, that’s a pretty big list. If we look carefully though, we can see that quite a few of those projects were abandoned years ago. Others are commercial, not based on scalability, or both. Being commercial isn’t automatically a disqualification, but most of the commercial options were forked from practically ancient versions of Postgres and never kept up compatibility, or don’t mention the version at all. Amazon Redshift fits that profile, being based on Postgres 8.0, which few would want to use these days. Fujitsu Enterprise is another, which doesn’t even list which version they’re based on, nor do they provide a download for testing purposes.

What’s left? It’s hard to tell from the Wiki page, so I just started with the projects that include some kind of data scaling not based on replication. These candidates present a longer list than I’d initially anticipated, which is always a good problem to have!

Let’s scrutinize the nominees.

CitusDB

It’s not really a secret that CitusDB and Postgres-XL are both tackling the same problem, and are currently the top two contenders. Unlike Postgres-XL and its approach of extending SQL syntax to directly embrace data distribution, CitusDB is actually just a Postgres extension.

As a result, it’s a bit more janky. There’s no CREATE TABLE ... DISTRIBUTE BY magic. Instead, and like other Postgres extensions, we call functions to populate metadata and control CitusDB operation. Fortunately, the CitusDB documentation is amazingly thorough. Not only are there downloads with cut-and-paste tutorials for evaluation, they also provide Docker containers with fully operational test clusters.

I also have to admit that I originally wrote this off entirely, based on my experience with pg_shard last year. I opened a ticket complaining about missing transaction support, and they never closed it. I didn’t realize that was because pg_shard was merged into the new Citus extension, along with a veritable raft of other modifications and fixes.

My easy tutorial install didn’t exhibit the problems I had with pg_shard, so this warrants deeper testing. I’ve got VMs galore, and itchy scripting fingers.

Greenplum

I’ve known about Greenplum for a long time. There was a lot of excitement when Pivotal announced that they were opening the source. A parallel scaling Postgres? No way!

Well, this comes with a giant caveat. If we look at the dates listed in the Wiki, Greenplum is listed as starting in 2005. They’re not kidding, and unfortunately it seems Pivotal executed a “fork it and forget it” maneuver. The documentation admits Greenplum is based on Postgres 8.2, with elements of functionality from 8.3.

Like Amazon’s Redshift, this immediately disqualifies Greenplum from consideration for anyone using a newer version. Our own databases are on 9.4 pending an upgrade plan; there’s no way we could justify such a massive downgrade, even for horizontal scaling improvements. EnterpriseDB had a similar problem when they started selling their version of 8.3; they were far behind for years before they managed to reduce their version lag by only a few months. Greenplum never even bothered.

This may be an amazing product, but we can’t use it to replace existing Postgres 9.4 databases that need scaling. Will Greenplum catch up now that it’s been open-sourced? I can’t say. It would definitely be cool, but I’m not holding my breath. Incidentally, this is one of the reasons all of those projects on the Wiki have definitive end dates. Keeping up with Postgres after forking is extremely difficult if you don’t merge your enhancements back into core. It’s all too easy to fall hopelessly behind and become nothing but an academic concern.

HadoopDB

Hadoop is the… uh, “new” kid on the block regarding big data. It’s designed to leverage multiple systems or VMs to spread storage mining, which would be a great way to address a VLDB system. So in 2009, a university project spawned HadoopDB to turn Postgres into a SQL interface and aggregator for Hadoop.

Sadly, that’s where the story ends. The Postgres Wiki says it’s still active, but for all intents and purposes, it has been abandoned. The quick start guide hasn’t been updated since 2009, and Thomas Koch did a presentation as late as 2011 denoting it as an effective proof of concept, but not much else.

In the end, it’s a really cool “glue” between Hadoop and Postgres. But without updates to enhance the interlinks, speed, efficiency, and bugs, it’s not suitable for a production environment. The project lived on in Hadapt before being acquired by Teradata and renamed to presto. That means there’s some potential to contact Teradata and make an existing Hadoop datastore more user friendly. The job of converting an existing Postgres cluster to a Hadoop equivalent is left as an (onerous) exercise for the user.

Postgres-X2

The Postgres-X2 project is a bit of a conundrum. Unlike Postgres-XL which is active and backed by 2ndQuadrant, Postgres-X2 seems to be a direct continuation of the abandoned Postgres-XC codebase. As a result, they’re still stuck on Postgres 9.3. Further, they likely have similar issues as we encountered with Postgres-XL, or worse due to the project’s stagnancy. After exploring the github repository, it turns out the last update to any code was two years ago.

Maybe there’s another repository elsewhere, but this project should be considered dead unless they pull a Holy Grail and claim otherwise.

Stado

I first encountered Stado back when it was GridSQL, one of many EnterpriseDB projects. It works by abstracting several database instances through a port proxy, distributing writes and reads arbitrarily based on its own internal magic. It uses Postgres as a filesystem of sorts, and connecting to Postgres directly reveals this in greater detail. Object names are the result of hash functions, and even databases are only tangentially linked to the desired given nomenclature.

Stado is all about metadata, and Postgres is its chosen host. Because I had experience with a previous incarnation, I made an exception and considered it undead for testing purposes, even though the Wiki says it died in 2011. It’s just a Java proxy after all, so what could it hurt to see if it still works with recent Postgres releases?

As it turns out, it can hurt a lot. It seems my memory of GridSQL was a little hazy, as what’s going on here isn’t merely a distribution proxy. It’s transformative and extremely restrictive, throwing up errors for “unknown” keywords such as SCHEMA. No schema support means there’s no way we can use it, which is something of a shame. The performance metrics were encouraging back in the day, and the concept it represents is sound.

Consider the PMPP extension, for example. When I looked at it late last year, I loved the simplicity. Take a query, broadcast it to every known Postgres node, and present the results. Wrap the function in another query, and it can be re-aggregated to merge everything together. I was hoping Stado did this automatically, and that was all. Nope. Oh well.

The Winner?

If someone could write something that worked like I thought Stado did, I’d probably kiss them. It would require manipulating the Postgres query planner or a proxy of some kind, but that’s all I really want. Send a query to multiple nodes, let them execute it independently in parallel, keep track of aggregate functions used, and apply them to the appropriate columns in the final result. It seems so simple, but the underlying complexity is clearly more involved.

The thing is, large warehouse databases usually contain data that’s already been in another source. Primary key collisions are unlikely, as some kind of engine (Postgres-XL, ETL, etc.) has already distributed data according to some kind of hash function. I just want a query that can invoke the cluster in parallel. That’s all. Extensions like PMPP do half of the job, but short of rewriting existing applications to leverage it properly, it’s only really usable for new projects.

So I’ll probably be looking into CitusDB a bit more. It seems to work the way I want, and adds shard redundancy as an extra benefit. I’ll put it on some VMs and unceremoniously thrash it after dumping hundreds of GB into its lap and record the ensuing chaos. Hopefully these tests go better than when I subjected pg_shard to the same treatment.

Otherwise, the state of Postgres scaling is incomplete, and there are no projects I know of that will suit our requirements. As a Postgres DBA, I probably try too hard to use it as a hammer on every nail, but it’s just so close as to be immensely frustrating.

Wish me luck!

Paul Ramsey: OGR FDW Update

From Planet PostgreSQL. Published on Apr 29, 2016.

I’ve had a productive couple of weeks here, despite the intermittently lovely weather and the beginning of Little League baseball season (not coaching, just supporting my pitcher-in-training).

13 Days

The focus of my energies has been a long-awaited (by me) update to the OGR FDW extension for PostgreSQL. By binding the multi-format OGR library into PostgreSQL, we get access to the many formats supported by OGR, all with just one piece of extension code.

As usual, the hardest part of the coding was remembering how things worked in the first place! But after getting my head back in the game the new code flowed out and now I can reveal the new improved OGR FDW!

OGR FDW Update

The new features are:

  • Column name mapping between OGR layers and PgSQL tables is now completely configurable. The extension will attempt to guess mapping automagically, using names and type consistency, but you can over-ride mappings using the table-level column_name option.
  • Foreign tables are now updateable! That means, for OGR sources that support it, you can run INSERT, UPDATE and DELETE commands on your OGR FDW tables and the changes will be applied to the source.

    • You can control which tables and foreign servers are updateable by setting the UPDATEABLE option on the foreign server and foreign table definitions.
  • PostgreSQL 9.6 is supported. It’s not released yet, but we can now build against it.
  • Geometry type and spatial reference system are propogated from OGR. If your OGR source defines a geometry type and spatial reference identifier, the FDW tables in PostGIS will now reflect that, for easier integration with your local geometry data.
  • GDAL2 and GDAL1 are supported. Use of GDAL2 syntax has been made the default in the code-base, with mappings back to GDAL1 for compatibility, so the code is now future-ready.
  • Regression tests and continuous integration are in place, for improved code reliability. Thanks to help from Even Roualt, we are now using Travis-CI for integration testing, and I’ve enabled a growing number of integration tests.

As usual, I’m in debt to Regina Obe for her usual timely feedback and willingness to torture-test very fresh code.

For now, early adopters can get the code by cloning and building the project master branch, but I will be releasing a numbered version in a week or two when any obvious bugs have been shaken out.

Andrew Dunstan: Windows XP - PostgreSQL's long goodbye.

From Planet PostgreSQL. Published on Apr 29, 2016.

I just committed a patch to allow building with Visual Studio 2015. Due to a change in the way we need to detect locales with this compiler, it will not be possible to use binaries built with it on Windows XP. Despite the fact that Microsoft declared Windows XP to be completely out of support well over a year ago, it still lives on in a huge number of installations. My own instance still gets occasional updates from Microsoft. And you can still build and run the very latest PostgreSQL on Windows XP. But you can't use it on Windows XP if it's built with Visual Studio 2015 or later.

I will keep my instance (whose only job is to run several buildfarm members) running as long as it doesn't require any significant tweaks. But a day will eventually come when it does require such tweaks, or just won't build and run successfully any more,  and at that stage I'll shut it down.


Greg Sabino Mullane: Postgres concurrent indexes and the curse of IIT

From Planet PostgreSQL. Published on Apr 28, 2016.

Postgres has a wonderful feature called concurrent indexes. It allows you to create indexes on a table without blocking reads OR writes, which is quite a handy trick. There are a number of circumstances in which one might want to use concurrent indexes, the most common one being not blocking writes to production tables. There are a few other use cases as well, including:


Photograph by Nicholas A. Tonelli

  • Replacing a corrupted index
  • Replacing a bloated index
  • Replacing an existing index (e.g. better column list)
  • Changing index parameters
  • Restoring a production dump as quickly as possible

In this article, I will focus on that last use case, restoring a database as quickly as possible. We recently upgraded a client from a very old version of Postgres to the current version (9.5 as of this writing). The fact that use of pg_upgrade was not available should give you a clue as to just how old the "very old" version was!

Our strategy was to create a new 9.5 cluster, get it optimized for bulk loading, import the globals and schema, stop write connections to the old database, transfer the data from old to new, and bring the new one up for reading and writing.

The goal was to reduce the application downtime as much as reasonably possible. To that end, we did not want to wait until all the indexes were created before letting people back in, as testing showed that the index creations were the longest part of the process. We used the "--section" flags of pg_dump to create pre-data, data, and post-data sections. All of the index creation statements appeared in the post-data file.

Because the client determined that it was more important for the data to be available, and the tables writable, than it was for them to be fully indexed, we decided to try using CONCURRENT indexes. In this way, writes to the tables could happen at the same time that they were being indexed - and those writes could occur as soon as the table was populated. That was the theory anyway.

The migration went smooth - the data was transferred over quickly, the database was restarted with a new postgresql.conf (e.g. turn fsync back on), and clients were able to connect, albeit with some queries running slower than normal. We parsed the post-data file and created a new file in which all the CREATE INDEX commands were changed to CREATE INDEX CONCURRENTLY. We kicked that off, but after a certain amount of time, it seemed to freeze up.


The frogurt is also cursed.

Looking closer showed that the CREATE INDEX CONCURRENTLY statement was waiting, and waiting, and never able to complete - because other transactions were not finishing. This is why concurrent indexing is both a blessing and a curse. The concurrent index creation is so polite that it never blocks writers, but this means processes can charge ahead and be none the wiser that the create index statement is waiting on them to finish their transaction. When you also have a misbehaving application that stays "idle in transaction", it's a recipe for confusion. (Idle in transaction is what happens when your application keeps a database connection open without doing a COMMIT or ROLLBACK). A concurrent index can only completely finish being created once any transaction that has referenced the table has completed. The problem was that because the create index did not block, the app kept chugging along, spawning new processes that all ended up in idle in transaction.

At that point, the only way to get the concurrent index creation to complete was to forcibly kill all the other idle in transaction processes, forcing them to rollback and causing a lot of distress for the application. In contrast, a regular index creation would have caused other processes to block on their first attempt to access the table, and then carried on once the creation was complete, and nothing would have to rollback.

Another business decision was made - the concurrent indexes were nice, but we needed the indexes, even if some had to be created as regular indexes. Many of the indexes were able to be completed (concurrently) very quickly - and they were on not-very-busy tables - so we plowed through the index creation script, and simply canceled any concurrent index creations that were being blocked for too long. This only left a handful of uncreated indexes, so we simply dropped the "invalid" indexes (these appear when a concurrent index creation is interrupted), and reran with regular CREATE INDEX statements.

The lesson here is that nothing comes without a cost. The overly polite concurrent index creation is great at letting everyone else access the table, but it also means that large complex transactions can chug along without being blocked, and have to have all of their work rolled back. In this case, things worked out as we did 99% of the indexes as CONCURRENT, and the remaining ones as regular. All in all, the use of concurrent indexes was a big win, and they are still an amazing feature of Postgres.

Josh Berkus: Don't delete pg_xlog

From Planet PostgreSQL. Published on Apr 28, 2016.

This StackOverflow question reminded me of this old blog post, which is still relevant today:

pg_log, pg_xlog and pg_clog


There are three directories in a default $PGDATA directory when you create it which are named "pg_*log".


pg_log


$PGDATA/pg_log is the default location for the database activity logs, which include error messages, query logging, and startup/shutdown messages.  This is where you should first look for information when PostgreSQL won't start.  Many Linux distributions and other packaging systems relocate this log directory to somewhere like /var/log/postgresql.

You can freely delete, rename, compress, and move files in pg_log without penalty, as long as the postgres user still has rights to write to the directory. If pg_log becomes bloated with many large files, you probably need to decrease the number of things you're logging by changing the settings in postgresql.conf.

Do note that if you "delete" the current log file on a Linux or Unix system, it may remain open but not accessible, just sending any successive log messages to /dev/null until the file rotates.

pg_xlog 


$PGDATA/pg_xlog is the PostgreSQL transaction log.  This set of binary log files, with names like '00000001000000000000008E', contain images of the data from recent transactions.  These logs are also used for binary replication.

If replication, archiving, or PITR is failing, this directory can become bloated with gigabytes of logs the database server is saving for when archiving resumes. This can cause you to run out of disk space
.
Unlike pg_log, you may not freely delete, move, or compress files in this directory.  You may not even move the directory without symlinking it back to its original location.  Deleting pg_xlog files may result in unrecoverable database corruption.

If you find yourself in a situation where you've got 100GB of files in pg_xlog and the database won't start, and you've already disabled archiving/replication and tried clearing disk space every other way, then please take two steps:
  1. Move files from pg_xlog to a backup disk or shared network drive, don't delete them, and
  2. Move only a few of the oldest files, enough to allow PostgreSQL to start again.

pg_clog


$PGDATA/pg_clog contains a log of transaction metadata.   This log tells PostgreSQL which transactions completed and which did not.  The clog is small and never has any reason to become bloated, so you should never have any reason to touch it.

Should you ever delete files from pg_clog, you might as well delete the entire database directory. There is no recovery from a missing clog.

Note that this means, if you back up the files in a $PGDATA directory, you should make sure to include the pg_clog and pg_xlog as well, or you may find that your backup is not usable.

Marco Slot: Scalable Real-time Product Search using Citus

From Planet PostgreSQL. Published on Apr 28, 2016.

Scalable Real-time Product Search using PostgreSQL with Citus

Product search is a common, yet sometimes challenging use-case for online retailers and marketplaces. It typically involves a combination of full-text search and filtering by attributes which differ for every product category. More complex use-cases may have many sellers that offer the same product, but with a different price and different properties.

PostgreSQL has the functionality required to build a product search application, but with a large product catalog scaling it can be difficult. With the Citus extension, PostgreSQL can distribute tables and parallelize queries across many servers, making it easy to scale out your memory and compute power. Couple Citus with PostgreSQL's full-text search and it becomes fast and easy to do interactive searches on a large product catalog. While the search functionality is not as comprehensive as in dedicated search solutions, a huge benefit of keeping the data in PostgreSQL is that it can be updated in real-time and tables can be joined. This post will go through the steps of setting up an experimental products database with parallel search functions using PostgreSQL and Citus, with the goal of showcasing several powerful features. 

We start by setting up a multi-node Citus cluster on EC2 using 4 m3.2xlarge instances as workers. An even easier way to get started is to use the brand new Citus Cloud, which gives you a managed PostgreSQL cluster with Citus and full auto-fail-over. The main table in our database schema is the "product" table, which contains the name and description of a product, its price, and attributes in JSON format such that different types of products can use different attributes:

$ psql
CREATE TABLE product (
  product_id int primary key,
  name text not null,
  description text not null,
  price decimal(12,2),
  attributes jsonb
);

To distribute the table using Citus, we call the functions for hash-distributing the table into 16 shards (one per physical core). The shards are distributed and replicated across the 4 workers.

SELECT master_create_distributed_table('product', 'product_id', 'hash');
SELECT master_create_worker_shards('product', 16, 2);

We create a GIN index to allow fast filtering of attributes by the JSONB containment operator. For example, a search query for English books might have the following expression: attributes @> '{"category":"books", "language":"english"}' , which can use the GIN index.

CREATE INDEX attributes_idx ON product USING GIN (attributes jsonb_path_ops);

To filter products by their name and description, we use the full text search functions in PostgreSQL to find a match with a user-specified query. A text search operation is performed on a text search vector (tsvector) using a text search query (tsquery). It can be useful to define an intermediate function that generates the tsvector for a product. The product_text_search function below combines the name and description of a product into a tsvector, in which the name is assigned the highest weight (from 'A' to 'D'), such that matches with the name will show up higher when sorting by relevance.

CREATE FUNCTION product_text_search(name text, description text)
RETURNS tsvector LANGUAGE sql IMMUTABLE AS $function$
  SELECT setweight(to_tsvector(name),'A') || 
         setweight(to_tsvector(description),'B');
$function$;

To use the product_text_search function in queries and indexes, it also needs to be created on the workers. An easy way to run a SQL file on every worker is to use xargs.

$ psql -c "SELECT * FROM master_get_active_worker_nodes()" -tA -F" " \
 | xargs -n 2 sh -c "psql -h \$0 -p \$1 -f product_text_search.sql"

After setting up the function, we define a GIN index on it, which speeds up text searches on the product table.

$ psql
CREATE INDEX text_idx ON product USING GIN (product_text_search(name, description));

We don't have a large product dataset available, so instead we generate 10 million mock products (7GB) by appending random words to generate names, descriptions, and attributes, using a simple generator function. This is probably not be the fastest way to generate mock data, but we're PostgreSQL geeks :). After adding some words to the words table, we can run:

\COPY (SELECT * FROM generate_products(10000000)) TO '/data/base/products.tsv'

The new COPY feature in Citus can be used to load the data into the product table. COPY for hash-partitioned tables is currently available in the latest version of Citus and in Citus Cloud. A benefit of using COPY on distributed tables is that workers can process multiple rows in parallel. Because each shard is indexed separately, the indexes are also kept small, which improves ingestion rate for GIN indexes. 

\COPY product FROM '/data/base/products.tsv'

The data load takes just under 7 minutes; roughly 25,000 rows/sec on average. We also loaded data into a regular PostgreSQL table in 45 minutes (3,700 rows/sec) by creating the index after copying in the data.

Now let's search products! Assume the user is searching for "copper oven". We can convert the phrase into a tsquery using the plainto_tsquery function and match it to the name and description using the @@ operator. As an additional filter, we require that the "food" attribute of the product is either "waste" or "air". We're using very random words :). To order the query by relevance, we can use the ts_rank function, which takes the tsvector and tsquery as input. 

SELECT p.product_id, p.name, p.price
FROM product p
WHERE product_text_search(name, description) @@ plainto_tsquery('copper oven') 
  AND (attributes @> '{"food":"waste"}' OR attributes @> '{"food":"air"}')
ORDER BY ts_rank(product_text_search(name, description),
                 plainto_tsquery('copper oven')) DESC
LIMIT 10;
 product_id |         name         | price 
------------+----------------------+-------
    2016884 | oven copper hot      | 32.33
    8264220 | rifle copper oven    | 92.11
    4021935 | argument chin rub    | 79.33
    5347636 | oven approval circle | 50.78
(4 rows)

Time: 68.832 ms (~78ms on non-distributed table)

The query above uses both GIN indexes to do a very fast look-up of a small number of rows. A much broader search can take longer because of the need to sort all the results by their rank. For example, the following query has 294,000 results that it needs to sort to get the first 10:

SELECT p.product_id, p.name, p.price
FROM product p
WHERE product_text_search(name, description) @@ plainto_tsquery('oven') 
  AND price < 50
ORDER BY ts_rank(product_text_search(name, description),
                 plainto_tsquery('oven')) DESC
LIMIT 10;
 product_id |         name         | price 
------------+----------------------+-------
    6295883 | end oven oven        |  7.80
    3304889 | oven punishment oven | 28.27
    2291463 | town oven oven       |  7.47
...
(10 rows)

Time: 2262.502 ms (37 seconds on non-distributed table)

This query gets the top 10 results from each of the 16 shards, which is where the majority of time is spent, and the master sorts the final 160 rows. By using more machines and more shards, the number of rows that needs to be sorted in each shard is lowered significantly, but the amount of sorting work done by the master is still trivially small. This means that we can get significantly lower query times by using a bigger cluster with more shards.

In addition to products, imagine the retailer also has a marketplace where third-party sellers can offer products at different prices. Those offers should also show up in searches if their price is under the maximum. A product can have many such offers. We create an additional distributed table, which we distribute by product_id and assign the same number of shards, such that we can perform co-located joins between the product and offer table on product_id. 

CREATE TABLE offer (
  product_id int not null,
  offer_id int not null,
  seller_id int,
  price decimal(12,2),
  new bool,
  primary key(product_id, offer_id)
);
SELECT master_create_distributed_table('offer', 'product_id','hash');
SELECT master_create_worker_shards('offer', 16, 2);

We load 5 million random offers generated using the generate_offers function and COPY. The following query searches for popcorn oven products priced under $70, including products with offers under $70. Offers are included in the results as an array of JSON objects.

SELECT p.product_id, p.name, p.price, to_json(array_agg(to_json(o)))
FROM   product p LEFT JOIN offer o USING (product_id)
WHERE  product_text_search(p.name, p.description) @@ plainto_tsquery('popcorn oven')
  AND (p.price < 70 OR o.price < 70)
GROUP BY p.product_id, p.name, p.description, p.price
ORDER BY ts_rank(product_text_search(p.name, p.description),
                 plainto_tsquery('popcorn oven')) DESC
LIMIT 10;
 product_id |          name          | price |                                        to_json                                        
------------+------------------------+-------+---------------------------------------------------------------------------------------
    9354998 | oven popcorn bridge    | 41.18 | [null]
    1172380 | gate oven popcorn      | 24.12 | [{"product_id":1172380,"offer_id":4853987,"seller_id":2088,"price":55.00,"new":true}]
     985098 | popcorn oven scent     | 73.32 | [{"product_id":985098,"offer_id":5890813,"seller_id":5727,"price":67.00,"new":true}]
...
(10 rows)

Time: 337.441 ms (4 seconds on non-distributed tables)

Given the wide array of features available in PostgreSQL, we can keep making further enhancements. For example, we could convert the entire row to JSON, or add a filter to only return reasonably close matches, and we could make sure only lowest priced offers are included in the results. We can also start doing real-time inserts and updates in the product and offer tables.

If you have a use-case with similar needs, or would like to use Citus into production, don't hesitate to contact Citus Data.

Special thanks to the people at Postgres Professional for contributing most of the full-text search, JSONB, and GIN index features in PostgreSQL, as well as the initial code for the Citus COPY feature.

Oskari Saarenmaa: PostgreSQL cloud backups with PGHoard

From Planet PostgreSQL. Published on Apr 28, 2016.

PGHoard is the cloud backup and restore solution we're using in Aiven. We started PGHoard development in early 2015 when the Aiven project was launched as a way to provide real-time streaming backups of PostgreSQL to a potentially untrusted cloud object storage.

PGHoard has an extensible object storage interface, which currently works with the following cloud object stores:
  • Amazon Web Services S3
  • Google Cloud Storage
  • OpenStack Swift
  • Ceph's RADOSGW utilizing either the S3 or Swift drivers 
  • Microsoft Azure Storage (currently experimental)
  •  

Data integrity

PostgreSQL backups consist of full database backups, basebackups, plus write ahead logs and related metadata, WAL. Both basebackups and WAL are required to create and restore a consistent database.

PGHoard handles both the full, periodic backups (driving pg_basebackup) as well as streaming the write-ahead-log of the database.  Constantly streaming WAL as it's generated allows PGHoard to restore a database to any point in time since the oldest basebackup was taken.  This is used to implement Aiven's Database Forks and Point-in-time-Recovery as described in our PostgreSQL FAQ.

To save disk space and reduce the data that needs to be sent over the network (potentially incurring extra costs) backups are compressed by default using Google's Snappy, a fast compression algorithm with a reasonable compression ratio. LZMA (a slower algorithm with very high compression ratio) is also supported.

To protect backups from unauthorized access and to ensure their integrity PGHoard can also transparently encrypt and authenticate the data using RSA, AES and SHA256.  Each basebackup and WAL segments gets a unique random AES key which is encrypted with RSA.  HMAC-SHA256 is used for file integrity checking.

Restoring is the key

As noted in the opening paragraph, PGHoard is a backup and restore tool: backups are largely useless unless they can be restored.  Experience tells us that backups, even if set up at some point, are usually not restorable unless restore is routinely tested, but experience also shows that backup restoration is rarely practiced unless it's easy to do and automate.

This is why PGHoard also includes tooling to restore backups, allowing you to create new master or standby databases from the object store archives.  This makes it possible to set up a new database replica with a single command, which first restores the database basebackup from object storage and then sets up PostgreSQL's recovery.conf to fetch the remaining WAL files from the object storage archive and optionally connect to an existing master server after that.

Preparing PostgreSQL for PGHoard

First, we will need to create a replication user account. We'll just use the psql command-line client for this:

postgres=# CREATE USER backup WITH REPLICATION PASSWORD 'secret';
CREATE ROLE


We also need to allow this new user to make connections to the database. In PostgreSQL this is done by editing the pg_hba.conf configuration file and adding a line something like this:

host  replication  backup  127.0.0.1/32  md5

We'll also need to ensure our PostgreSQL instance is configured to allow WAL replication out from the server and it has the appropriate wal_level setting. We'll edit postgresql.conf and edit or add the following settings:

max_wal_senders = 2  # minimum two with pg_receivexlog mode!
wal_level = archive  # 'hot_standby' or 'logical' are also ok


Finally, since we have modified PostgreSQL configuration files, we'll need to restart PostgreSQL to take the new settings into use by running "pg_ctl restart", "systemctl restart postgresql" or "service postgresql restart", etc depending on the Linux distribution being used.  Note that it's not enough to "reload" PostgreSQL in case the WAL settings were changed.

Now we are ready on the PostgreSQL side and can move on to PGHoard.

Installing PGHoard

PGHoard's source distribution includes packaging scripts for Debian, Fedora and Ubuntu.  Instructions for building distribution specific packages can be found in the PGHoard README.  As PGHoard is a Python package it can also be installed on any system with Python 3 by running "pip3 install pghoard".

Taking backups with PGHoard

PGHoard provides a number of tools that can be launched from the command-line:
  • pghoard - The backup daemon itself, can be run under systemd or sysvinit
  • pghoard_restore - Backup restoration tool
  • pghoard_archive_sync - Command for verifying archive integrity
  • pghoard_create_keys - Backup encryption key utility
  • pghoard_postgres_command - Used as PostgreSQL's archive_command and restore_command
First, we will launch the pghoard daemon to start taking backups. pghoard requires a small JSON configuration file that contains the settings for the PostgreSQL connection and for the target backup storage. We'll name the file pghoard.json:

{
    "backup_location": "./metadata",
    "backup_sites": {
        "example-site": {
            "nodes": [
                {
                    "host": "127.0.0.1",
                    "password": "secret",
                    "port": 5432,
                    "user": "backup"
                }
            ],
            "object_storage": {
                "storage_type": "local",
                "directory": "./backups"
            }
        }
    }
}


In the above file we just list where pghoard keep's its local working directory (backup_location), our PostgreSQL connection settings (nodes) and where we want to store the backups (object_storage). In this example we'll just write the backup files to a local disk instead of a remote cloud object storage.

Then we just need to run the pghoard daemon and point it to our configuration file:

$ pghoard --short-log --config pghoard.json
DEBUG   Loading JSON config from: './pghoard.json', signal: None
INFO    pghoard initialized, own_hostname: 'ohmu1', cwd: '/home/mel/backup'
INFO    Creating a new basebackup for 'example-site' because there are currently none
INFO    Started: ['/usr/bin/pg_receivexlog', '--status-interval', '1', '--verbose', '--directory', './metadata/example-site/xlog_incoming', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'"], running as PID: 8809
INFO    Started: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_0'], running as PID: 8815, basebackup_location: './metadata/example-site/basebackup_incoming/2016-04-28_0/base.tar'
INFO    Compressed 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000025' to 805706 bytes (4%), took: 0.056s
INFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000025', size: 805706, took 0.003s
INFO    Ran: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_0'], took: 0.331s to run, returncode: 0
INFO    Compressed 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000026' to 797357 bytes (4%), took: 0.057s
INFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000026', size: 797357, took 0.011s
INFO    Compressed 80187904 byte file './metadata/example-site/basebackup_incoming/2016-04-28_0/base.tar' to 15981960 bytes (19%), took: 0.335s
INFO    'UPLOAD' transfer of key: 'example-site/basebackup/2016-04-28_0', size: 15981960, took 0.026s



PGHoard automatically connected to the PostgreSQL database server, noticed that we don't have any backups and immediately created a new basebackup and started the realtime streaming of WAL files (which act as incremental backups). Each file stored in the backups was first compressed for optimizing the transfer and storage costs.

As long as you keep PGHoard running, it will make full backups using the default schedule (once per 24 hours) and continuously stream WAL files.

Looking at the contents of the "backups" directory, we see that our backups now contain a full database backup plus a couple of WAL files, and some metadata for each of the files:

$ find backups/ -type f
backups/example-site/xlog/000000010000000000000025
backups/example-site/xlog/000000010000000000000025.metadata
backups/example-site/xlog/000000010000000000000026
backups/example-site/xlog/000000010000000000000026.metadata
backups/example-site/basebackup/2016-04-28_0
backups/example-site/basebackup/2016-04-28_0.metadata


Available backups can be listed with the pghoard_restore tool:

$ pghoard_restore list-basebackups --config pghoard.json
Available 'example-site' basebackups:

Basebackup                                Backup size    Orig size  Start time
----------------------------------------  -----------  -----------  --------------------
example-site/basebackup/2016-04-28_0            15 MB        76 MB  2016-04-28T06:40:46Z


Looks like we are all set. Now let's try restore!

Restoring a backup

Restoring a backup is a matter of running a single command:

$ pghoard_restore get-basebackup --config pghoard.json --target-dir restore-test
Found 1 applicable basebackup

Basebackup                                Backup size    Orig size  Start time
----------------------------------------  -----------  -----------  --------------------
example-site/basebackup/2016-04-28_0            15 MB        76 MB  2016-04-28T06:40:46Z
    metadata: {'compression-algorithm': 'snappy', 'start-wal-segment': '000000010000000000000026', 'pg-version': '90406'}

Selecting 'example-site/basebackup/2016-04-28_0' for restore
Basebackup complete.
You can start PostgreSQL by running pg_ctl -D restore-test start
On systemd based systems you can run systemctl start postgresql
On SYSV Init based systems you can run /etc/init.d/postgresql start


The pghoard_restore command automatically chooses the latest available backup, downloads, unpacks (and decompresses and decrypts, when those options are used) it to the specified target directory. The end result will be a complete PostgreSQL data directory (e.g. something like /var/lib/postgresql/9.5/main or /var/lib/pgsql/data, depending on the distro), ready to be used by a PostgreSQL instance.

There are more command-line options for more detailed control over the restoration process, for example restoring to a particular point in time or transaction (PITR) or choosing whether the restored database will be acting as a master or a standby.

Backup encryption

In order to encrypt our backups, we'll need to create an encryption key pair. PGHoard provides a handy command for automatically creating a key pair and storing it into our configuration file:

$ pghoard_create_keys --key-id example --config pghoard.json
Saved new key_id 'example' for site 'example-site' in 'pghoard.json'
NOTE: The pghoard daemon does not require the 'private' key in its configuration file, it can be stored elsewhere to improve security


Note that in most cases you will want to extract the private key away from the configuration file and store it safely elsewhere away from the machine that makes the backups. The pghoard daemon only needs the encryption public key. Only the restore tool needs the private key in order to restore backups.

Uploading backups to a cloud

Sending backups to an object storage in a cloud is simple: we just need the cloud's access credentials and we'll modify the object_storage section pghoard.json:

            "object_storage": {
                "aws_access_key_id": "XXX",
                "aws_secret_access_key": "XXX",
                "bucket_name": "backups",
                "region": "eu-central-1",
                "storage_type": "s3"
            }


Now when we restart pghoard, the backups are sent to AWS S3 in Frankfurt:

$ pghoard --short-log --config pghoard.json
DEBUG   Loading JSON config from: './pghoard.json', signal: None
INFO    pghoard initialized, own_hostname: 'ohmu1', cwd: '/home/mel/backup'
INFO    Started: ['/usr/bin/pg_receivexlog', '--status-interval', '1', '--verbose', '--directory', './metadata/example-site/xlog_incoming', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'"], running as PID: 8001
INFO    Creating a new basebackup for 'example-site' because there are currently none
INFO    Started: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_1'], running as PID: 8014, basebackup_location: './metadata/example-site/basebackup_incoming/2016-04-28_1/base.tar'
INFO    Ran: ['/usr/bin/pg_basebackup', '--format', 'tar', '--label', 'pghoard_base_backup', '--progress', '--verbose', '--dbname', "dbname='replication' host='127.0.0.1' port='5432' replication='true' user='backup'", '--pgdata', './metadata/example-site/basebackup_incoming/2016-04-28_1'], took: 0.350s to run, returncode: 0
INFO    Compressed and encrypted 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000027' to 799445 bytes (4%), took: 0.406s
INFO    Compressed and encrypted 16777216 byte file './metadata/example-site/xlog_incoming/000000010000000000000028' to 797784 bytes (4%), took: 0.137s
INFO    Compressed and encrypted 80187904 byte file './metadata/example-site/basebackup_incoming/2016-04-28_1/base.tar' to 15982372 bytes (19%), took: 0.417s
INFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000028', size: 797784, took 0.885sINFO    'UPLOAD' transfer of key: 'example-site/xlog/000000010000000000000027', size: 799445, took 1.104s
INFO    'UPLOAD' transfer of key: 'example-site/basebackup/2016-04-28_1', size: 15982372, took 4.911s



The restore tool works the same way regardless of where the backups are stored:

$ pghoard_restore list-basebackups --config pghoard.json
Available 'example-site' basebackups:

Basebackup                                Backup size    Orig size  Start time
----------------------------------------  -----------  -----------  --------------------
example-site/basebackup/2016-04-28_1            15 MB        76 MB  2016-04-28T09:39:37Z



PostgreSQL 9.2+ and Python 3.3+ required

Today we released PGHoard version 1.2.0 with support for Python 3.3 and PostgreSQL 9.2 plus enhanced support for handling network outages.  These features were driven by external users, in Aiven we always use the latest PostgreSQL versions (9.5.2 at the time of writing) and access object storages near the database machines.


PGHoard in Aiven.io

We're happy to talk more about PGHoard and help you set up your backups with it.  You can also sign up for a free trial of our Aiven.io PostgreSQL service where PGHoard will take care of your backups.


Cheers,
Team Aiven

Craig Ringer: Don’t set fsync=off if you want to keep your data

From Planet PostgreSQL. Published on Apr 27, 2016.

There are a lot of amazing features coming in PostgreSQL 9.6, but I’m personally very happy about a really small, simple one that helps close a long-standing user foot-gun.

commit a31212b429cd3397fb3147b1a584ae33224454a6
Author: Robert Haas 
Date:   Wed Apr 27 13:46:26 2016 -0400

    Change postgresql.conf.sample to say that fsync=off will corrupt data.

    Discussion: 24748.1461764666@sss.pgh.pa.us
    
    Per a suggestion from Craig Ringer.  This wording from Tom Lane,
    following discussion.

There’s a bit of terrible advice floating around that turning fsync=off will make PostgreSQL run faster. Which is true, as far as it goes, but neglects that little risk massive data corruption on crash part. So users get bitten. I’ve tried to scrub as much of that advice from the Internet as I can or get it qualified with warnings, but I still see people advised to do it on random forums sometimes.

The docs do a good job of explaining that setting fsync=off is a bad idea, but the sample config file made it seem pretty innocuous:

#fsync = on                             # turns forced synchronization on or off

so users keep turning it off. Then teams like 2ndQuadrant land up doing expensive forensic data recovery on their database clusters once they crash weeks, months or years later and experience severe corruption… or notice the corruption they experienced a while ago. Or users just write off their data and start again because it’s all too hard and expensive.

To make turning fsync=off a little less attractive the sample now reads:

#fsync = on                            # flush data to disk for crash safety
                                       # (turning this off can cause
                                       # unrecoverable disk corruption)

It won’t stop someone using ALTER SYSTEM SET without realising, but there’s only so much you can do.

I’m really happy about this. It’s nice to knock off such minor improvements that have a disproportionate impact on usability and UX.

If you are ever tempted to set fsync=off, pretend it’s called eat_my_data_if_you_feel_like_it=on and see if you still want to set it. synchronous_commit=off is probably a better choice. Read the manual.

Joel Jacobson: Pgcronjob: Cron daemon to run user defined functions

From Planet PostgreSQL. Published on Apr 27, 2016.

I find myself quite often having to split a large write operation such as updating millions of rows in a heavily used table. This usually boils down to writing a small Perl script that runs a small one-off function that performs the task by updating a limited numbers of rows per run, and then committing in between to avoid a long running transaction.

This time I needed to do this I decided to not write yet another Perl script and to instead do something to improve the situation. I’ve not been able to find any cron-ish tools at all to run database functions, except for pgAgent, but that looks more like a client-side daemon, I wanted something bare-minimum that can run server-side and piggy-back on the OS cron.

A few hours later and 382 lines of code later, the project was finished and pushed to Github:

https://github.com/trustly/pgcronjob

createuser pgcronjob
psql -f install.sql
crontab pgcronjob.crontab
psql

CREATE OR REPLACE FUNCTION public.CronJob_Function_Template_Skeleton()
RETURNS batchjobstate
LANGUAGE plpgsql
SET search_path TO public, pg_temp
AS $FUNC$
DECLARE
BEGIN
RAISE NOTICE 'Hello world!';
PERFORM pg_sleep(random());
RAISE NOTICE 'Slept for a while.';
IF random() < 0.5 THEN
 -- Tell CronJob() we have more work to do and we want it to run us again in due time
 RAISE NOTICE 'See you again!';
 RETURN 'AGAIN';
ELSIF random() < 0.5 THEN
 -- Throw error to CronJob() to test errors
 RAISE EXCEPTION 'Simulate error in CronJob function';
ELSE
 -- Tell CronJob() we're done and we don't want it to run us ever again
 RAISE NOTICE 'Bye world!';
 RETURN 'DONE';
END IF;
END;
$FUNC$;

GRANT EXECUTE ON FUNCTION public.CronJob_Function_Template_Skeleton() TO pgcronjob;

SELECT CronJob_Register('public','cronjob_function_template_skeleton');

 

Hopefully this will be useful for others as well. It would be fun to get some feedback.

The initial commit even comes with a nice README and install and uninstall scripts.

Extract from the README:

The settings are conditions that must all be TRUE for the cronjob to run, i.e. they are AND'd together.

Always NOT NULL:
- Enabled boolean NOT NULL DEFAULT TRUE: Controls whether the cronjob is enabled or not.
- RunEvenIfOthersAreWaiting boolean NOT NULL DEFAULT FALSE: Controls whether to run the cronjob or not if there are other waiting db txns (pg_stat_activity.waiting).
- RetryOnError boolean NOT NULL DEFAULT FALSE: Controls whether to run the cronjob ever again if the user-defined function would throw an error.

Can be NULL (which means setting is ignored):
- RunAfterTimestamp timestamptz: Run only after the specified timestamp.
- RunUntilTimestamp timestamptz: Run only until the specified timestamp.
- RunAfterTime time: Run only after the specified time of the day.
- RunBeforeTime time: Run only until the specified time of the day.
- RunInterval interval: Run only after having slept for this interval after the last run started.
- SleepInterval interval: Run only after having slept for this interval after the last run finished.

How to use conditions to dynamically manipulate images

By Cloudinary Blog - Django from Django community aggregator: Community blog posts. Published on Apr 27, 2016.

Conditional image transformations illustration

It's great to have the capability to manipulate images on the fly by using dynamic URLs to customize the images to fit the graphic design of your site or mobile application. However, what if you want to manipulate an image depending on a specific image characteristic (like its width or aspect ratio) or its contents (does it contain a face?). What you need is a way to apply a transformation to an image only if a specific condition is met. Take for example a situation where you have allocated space on your page for a user uploaded image with a width and height of 200 pixels. Furthermore, if the image contains a face you would like to zoom in and focus on the face itself, otherwise you would like to fit the entire image into the available space:

Ruby:
cl_image_tag("smiling_man.jpg", :transformation=>[
  {:if=>"fc_gte_1"},
  {:width=>200, :height=>200, :gravity=>"face", :crop=>"thumb"},
  {:if=>"else", :width=>200, :height=>200, :crop=>"fit"},
  {:if=>"end"},
  {:radius=>40, :border=>"4px_solid_black"}
  ])
PHP:
cl_image_tag("smiling_man.jpg", array("transformation"=>array(
  array("if"=>"fc_gte_1"),
  array("width"=>200, "height"=>200, "gravity"=>"face", "crop"=>"thumb"),
  array("if"=>"else", "width"=>200, "height"=>200, "crop"=>"fit"),
  array("if"=>"end"),
  array("radius"=>40, "border"=>"4px_solid_black")
  )))
Python:
CloudinaryImage("smiling_man.jpg").image(transformation=[
  {"if": "fc_gte_1"},
  {"width": 200, "height": 200, "gravity": "face", "crop": "thumb"},
  {"if": "else", "width": 200, "height": 200, "crop": "fit"},
  {"if": "end"},
  {"radius": 40, "border": "4px_solid_black"}
  ])
Node.js:
cloudinary.image("smiling_man.jpg", {transformation: [
  {if: "fc_gte_1"},
  {width: 200, height: 200, gravity: "face", crop: "thumb"},
  {if: "else", width: 200, height: 200, crop: "fit"},
  {if: "end"},
  {radius: 40, border: "4px_solid_black"}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .if("fc_gte_1").chain()
  .width(200).height(200).gravity("face").crop("thumb").chain()
  .if("else").width(200).height(200).crop("fit").chain()
  .if("end").chain()
  .radius(40).border("4px_solid_black")).imageTag("smiling_man.jpg")
jQuery:
$.cloudinary.image("smiling_man.jpg", {transformation: [
  {if: "fc_gte_1"},
  {width: 200, height: 200, gravity: "face", crop: "thumb"},
  {if: "else", width: 200, height: 200, crop: "fit"},
  {if: "end"},
  {radius: 40, border: "4px_solid_black"}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .If("fc_gte_1").Chain()
  .Width(200).Height(200).Gravity("face").Crop("thumb").Chain()
  .If("else").Width(200).Height(200).Crop("fit").Chain()
  .If("end").Chain()
  .Radius(40).Border("4px_solid_black")).BuildImageTag("smiling_man.jpg")
Original image (scaled down)right arrowAn image containing a face is delivered as a zoomed in thumbnail

Ruby:
cl_image_tag("golden_gate.jpg", :transformation=>[
  {:if=>"fc_gte_1"},
  {:width=>200, :height=>200, :gravity=>"face", :crop=>"thumb"},
  {:if=>"else", :width=>200, :height=>200, :crop=>"fit"},
  {:if=>"end"},
  {:radius=>40, :border=>"4px_solid_black"}
  ])
PHP:
cl_image_tag("golden_gate.jpg", array("transformation"=>array(
  array("if"=>"fc_gte_1"),
  array("width"=>200, "height"=>200, "gravity"=>"face", "crop"=>"thumb"),
  array("if"=>"else", "width"=>200, "height"=>200, "crop"=>"fit"),
  array("if"=>"end"),
  array("radius"=>40, "border"=>"4px_solid_black")
  )))
Python:
CloudinaryImage("golden_gate.jpg").image(transformation=[
  {"if": "fc_gte_1"},
  {"width": 200, "height": 200, "gravity": "face", "crop": "thumb"},
  {"if": "else", "width": 200, "height": 200, "crop": "fit"},
  {"if": "end"},
  {"radius": 40, "border": "4px_solid_black"}
  ])
Node.js:
cloudinary.image("golden_gate.jpg", {transformation: [
  {if: "fc_gte_1"},
  {width: 200, height: 200, gravity: "face", crop: "thumb"},
  {if: "else", width: 200, height: 200, crop: "fit"},
  {if: "end"},
  {radius: 40, border: "4px_solid_black"}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .if("fc_gte_1").chain()
  .width(200).height(200).gravity("face").crop("thumb").chain()
  .if("else").width(200).height(200).crop("fit").chain()
  .if("end").chain()
  .radius(40).border("4px_solid_black")).imageTag("golden_gate.jpg")
jQuery:
$.cloudinary.image("golden_gate.jpg", {transformation: [
  {if: "fc_gte_1"},
  {width: 200, height: 200, gravity: "face", crop: "thumb"},
  {if: "else", width: 200, height: 200, crop: "fit"},
  {if: "end"},
  {radius: 40, border: "4px_solid_black"}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .If("fc_gte_1").Chain()
  .Width(200).Height(200).Gravity("face").Crop("thumb").Chain()
  .If("else").Width(200).Height(200).Crop("fit").Chain()
  .If("end").Chain()
  .Radius(40).Border("4px_solid_black")).BuildImageTag("golden_gate.jpg")
Original image (scaled down)right arrow An image without a face detected is delivered as large as possible

As can be seen in the example images above, a different transformation is applied to the image depending on whether the original image contains a face.

Making sure your site looks good can be challenging if you don't know in advance what kind of images your users (or editors) upload. The images will likely vary in aspect ratios, dimensions, and other characteristics, and the graphic design of your site probably requires different image dimensions and effects for different kinds of image. For example:

  • If you need to place an image in a space that has more width available than height, some portrait images will probably look too small if you just resize all images to fit within the space available. You would want to apply different resizing and cropping depending on the image's aspect ratio.
  • If the original images are very small, you would want to handle them in a different way to the way you would handle images that are larger than the available display space.
  • If faces appear in the image, you may want to manipulate the image in a different way to images without faces present (e.g. use a different artistic effect, different cropping dimensions, add overlays, etc).

The solution: conditional transformations

With Cloudinary, images are manipulated on-the-fly using dynamic delivery URLs for any uploaded image. A condition and its associated transformations are added to the URLs using the if parameter which accepts a string value detailing the condition to evaluate. You can apply a transformation based on the image's width, height, aspect ratio, the number of faces in the image (if any) or the number of frames (for animated images) or pages (for PDFs) present. For example, to evaluate whether the image's width is greater than 500 pixels: if_w_gt_500. Multiple conditions can be evaluated by concatenating the conditions with an and or or operator, and a different transformation can be applied in the case that the condition is evaluated as negative by using the if_else parameter.

The following examples highlight some use cases for conditional transformations. For more comprehensive information and details on the available options, see the documentation on Conditional transformations.

Example 1: Conditional text overlay

You have allocated space on your page for a user uploaded image with a width and height of 300 pixels. If the original image needs to be cropped in order to fill in the required space (i.e. either the width or height is greater than 300 pixels) you also want to add a text overlay with the disclaimer "This image has been cropped to fit":

Ruby:
cl_image_tag("dog.jpg", :transformation=>[
  {:if=>"w_gt_300_or_h_gt_300"},
  {:overlay=>"text:Verdana_90_bold:This%20image%20has%20been%20cropped%20to%20fit", :gravity=>"south", :y=>40},
  {:if=>"end"},
  {:border=>"5px_solid_green", :radius=>30, :width=>300, :height=>300, :crop=>"fill"}
  ])
PHP:
cl_image_tag("dog.jpg", array("transformation"=>array(
  array("if"=>"w_gt_300_or_h_gt_300"),
  array("overlay"=>"text:Verdana_90_bold:This%20image%20has%20been%20cropped%20to%20fit", "gravity"=>"south", "y"=>40),
  array("if"=>"end"),
  array("border"=>"5px_solid_green", "radius"=>30, "width"=>300, "height"=>300, "crop"=>"fill")
  )))
Python:
CloudinaryImage("dog.jpg").image(transformation=[
  {"if": "w_gt_300_or_h_gt_300"},
  {"overlay": "text:Verdana_90_bold:This%20image%20has%20been%20cropped%20to%20fit", "gravity": "south", "y": 40},
  {"if": "end"},
  {"border": "5px_solid_green", "radius": 30, "width": 300, "height": 300, "crop": "fill"}
  ])
Node.js:
cloudinary.image("dog.jpg", {transformation: [
  {if: "w_gt_300_or_h_gt_300"},
  {overlay: "text:Verdana_90_bold:This%20image%20has%20been%20cropped%20to%20fit", gravity: "south", y: 40},
  {if: "end"},
  {border: "5px_solid_green", radius: 30, width: 300, height: 300, crop: "fill"}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .if("w_gt_300_or_h_gt_300").chain()
  .overlay("text:Verdana_90_bold:This%20image%20has%20been%20cropped%20to%20fit").gravity("south").y(40).chain()
  .if("end").chain()
  .border("5px_solid_green").radius(30).width(300).height(300).crop("fill")).imageTag("dog.jpg")
jQuery:
$.cloudinary.image("dog.jpg", {transformation: [
  {if: "w_gt_300_or_h_gt_300"},
  {overlay: "text:Verdana_90_bold:This%20image%20has%20been%20cropped%20to%20fit", gravity: "south", y: 40},
  {if: "end"},
  {border: "5px_solid_green", radius: 30, width: 300, height: 300, crop: "fill"}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .If("w_gt_300_or_h_gt_300").Chain()
  .Overlay("text:Verdana_90_bold:This%20image%20has%20been%20cropped%20to%20fit").Gravity("south").Y(40).Chain()
  .If("end").Chain()
  .Border("5px_solid_green").Radius(30).Width(300).Height(300).Crop("fill")).BuildImageTag("dog.jpg")
If original image is bigger than 300x300 then text overlay added

Example 2: Conditional blurred background

If you allocate space on your page for an image with a width of 500px, you can conditionally add a blurred background for images that have a width less than 500px:

Ruby:
cl_image_tag("small_koala.jpg", :transformation=>[
  {:if=>"w_lt_500"},
  {:overlay=>"text:Arial_20_bold:Image%20shown%20in%20full%20scale", :y=>5, :x=>5, :color=>"white", :gravity=>"south_east"},
  {:effect=>"blur:400", :underlay=>"small_koala", :width=>500, :crop=>"scale"},
  {:if=>"end"},
  {:border=>"5px_solid_black", :radius=>30}
  ])
PHP:
cl_image_tag("small_koala.jpg", array("transformation"=>array(
  array("if"=>"w_lt_500"),
  array("overlay"=>"text:Arial_20_bold:Image%20shown%20in%20full%20scale", "y"=>5, "x"=>5, "color"=>"white", "gravity"=>"south_east"),
  array("effect"=>"blur:400", "underlay"=>"small_koala", "width"=>500, "crop"=>"scale"),
  array("if"=>"end"),
  array("border"=>"5px_solid_black", "radius"=>30)
  )))
Python:
CloudinaryImage("small_koala.jpg").image(transformation=[
  {"if": "w_lt_500"},
  {"overlay": "text:Arial_20_bold:Image%20shown%20in%20full%20scale", "y": 5, "x": 5, "color": "white", "gravity": "south_east"},
  {"effect": "blur:400", "underlay": "small_koala", "width": 500, "crop": "scale"},
  {"if": "end"},
  {"border": "5px_solid_black", "radius": 30}
  ])
Node.js:
cloudinary.image("small_koala.jpg", {transformation: [
  {if: "w_lt_500"},
  {overlay: "text:Arial_20_bold:Image%20shown%20in%20full%20scale", y: 5, x: 5, color: "white", gravity: "south_east"},
  {effect: "blur:400", underlay: "small_koala", width: 500, crop: "scale"},
  {if: "end"},
  {border: "5px_solid_black", radius: 30}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .if("w_lt_500").chain()
  .overlay("text:Arial_20_bold:Image%20shown%20in%20full%20scale").y(5).x(5).color("white").gravity("south_east").chain()
  .effect("blur:400").underlay("small_koala").width(500).crop("scale").chain()
  .if("end").chain()
  .border("5px_solid_black").radius(30)).imageTag("small_koala.jpg")
jQuery:
$.cloudinary.image("small_koala.jpg", {transformation: [
  {if: "w_lt_500"},
  {overlay: "text:Arial_20_bold:Image%20shown%20in%20full%20scale", y: 5, x: 5, color: "white", gravity: "south_east"},
  {effect: "blur:400", underlay: "small_koala", width: 500, crop: "scale"},
  {if: "end"},
  {border: "5px_solid_black", radius: 30}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .If("w_lt_500").Chain()
  .Overlay("text:Arial_20_bold:Image%20shown%20in%20full%20scale").Y(5).X(5).Color("white").Gravity("south_east").Chain()
  .Effect("blur:400").Underlay("small_koala").Width(500).Crop("scale").Chain()
  .If("end").Chain()
  .Border("5px_solid_black").Radius(30)).BuildImageTag("small_koala.jpg")
if original is smaller than 500px then blurred underlay added

Example 3: Conditional overlay placement based on detected faces

If you want to add an overlay of the golden_star image over all the detected faces in the image. If no faces are detected then you want to place the overlay in the bottom right corner:

  • Faces detected - the overlay is placed over the detected faces:

Ruby:
cl_image_tag("bicycle.jpg", :transformation=>[
  {:width=>400},
  {:if=>"faces_gte_1"},
  {:overlay=>"golden_star", :width=>1.1, :flags=>"region_relative", :gravity=>"faces"},
  {:if=>"else"},
  {:overlay=>"golden_star", :width=>100, :gravity=>"south_east"},
  {:if=>"end"},
  {:border=>"7px_solid_grey", :radius=>30}
  ])
PHP:
cl_image_tag("bicycle.jpg", array("transformation"=>array(
  array("width"=>400),
  array("if"=>"faces_gte_1"),
  array("overlay"=>"golden_star", "width"=>1.1, "flags"=>"region_relative", "gravity"=>"faces"),
  array("if"=>"else"),
  array("overlay"=>"golden_star", "width"=>100, "gravity"=>"south_east"),
  array("if"=>"end"),
  array("border"=>"7px_solid_grey", "radius"=>30)
  )))
Python:
CloudinaryImage("bicycle.jpg").image(transformation=[
  {"width": 400},
  {"if": "faces_gte_1"},
  {"overlay": "golden_star", "width": 1.1, "flags": "region_relative", "gravity": "faces"},
  {"if": "else"},
  {"overlay": "golden_star", "width": 100, "gravity": "south_east"},
  {"if": "end"},
  {"border": "7px_solid_grey", "radius": 30}
  ])
Node.js:
cloudinary.image("bicycle.jpg", {transformation: [
  {width: 400},
  {if: "faces_gte_1"},
  {overlay: "golden_star", width: 1.1, flags: "region_relative", gravity: "faces"},
  {if: "else"},
  {overlay: "golden_star", width: 100, gravity: "south_east"},
  {if: "end"},
  {border: "7px_solid_grey", radius: 30}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .width(400).chain()
  .if("faces_gte_1").chain()
  .overlay("golden_star").width(1.1).flags("region_relative").gravity("faces").chain()
  .if("else").chain()
  .overlay("golden_star").width(100).gravity("south_east").chain()
  .if("end").chain()
  .border("7px_solid_grey").radius(30)).imageTag("bicycle.jpg")
jQuery:
$.cloudinary.image("bicycle.jpg", {transformation: [
  {width: 400},
  {if: "faces_gte_1"},
  {overlay: "golden_star", width: 1.1, flags: "region_relative", gravity: "faces"},
  {if: "else"},
  {overlay: "golden_star", width: 100, gravity: "south_east"},
  {if: "end"},
  {border: "7px_solid_grey", radius: 30}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .Width(400).Chain()
  .If("faces_gte_1").Chain()
  .Overlay("golden_star").Width(1.1).Flags("region_relative").Gravity("faces").Chain()
  .If("else").Chain()
  .Overlay("golden_star").Width(100).Gravity("south_east").Chain()
  .If("end").Chain()
  .Border("7px_solid_grey").Radius(30)).BuildImageTag("bicycle.jpg")
overlay placed over detected faces

  • No faces detected - the overlay is placed in the bottom right corner:

Ruby:
cl_image_tag("leather_bag.jpg", :transformation=>[
  {:width=>400},
  {:if=>"faces_gte_1"},
  {:overlay=>"golden_star", :width=>1.1, :flags=>"region_relative", :gravity=>"faces"},
  {:if=>"else"},
  {:overlay=>"golden_star", :width=>100, :gravity=>"south_east"},
  {:if=>"end"},
  {:border=>"7px_solid_grey", :radius=>30}
  ])
PHP:
cl_image_tag("leather_bag.jpg", array("transformation"=>array(
  array("width"=>400),
  array("if"=>"faces_gte_1"),
  array("overlay"=>"golden_star", "width"=>1.1, "flags"=>"region_relative", "gravity"=>"faces"),
  array("if"=>"else"),
  array("overlay"=>"golden_star", "width"=>100, "gravity"=>"south_east"),
  array("if"=>"end"),
  array("border"=>"7px_solid_grey", "radius"=>30)
  )))
Python:
CloudinaryImage("leather_bag.jpg").image(transformation=[
  {"width": 400},
  {"if": "faces_gte_1"},
  {"overlay": "golden_star", "width": 1.1, "flags": "region_relative", "gravity": "faces"},
  {"if": "else"},
  {"overlay": "golden_star", "width": 100, "gravity": "south_east"},
  {"if": "end"},
  {"border": "7px_solid_grey", "radius": 30}
  ])
Node.js:
cloudinary.image("leather_bag.jpg", {transformation: [
  {width: 400},
  {if: "faces_gte_1"},
  {overlay: "golden_star", width: 1.1, flags: "region_relative", gravity: "faces"},
  {if: "else"},
  {overlay: "golden_star", width: 100, gravity: "south_east"},
  {if: "end"},
  {border: "7px_solid_grey", radius: 30}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .width(400).chain()
  .if("faces_gte_1").chain()
  .overlay("golden_star").width(1.1).flags("region_relative").gravity("faces").chain()
  .if("else").chain()
  .overlay("golden_star").width(100).gravity("south_east").chain()
  .if("end").chain()
  .border("7px_solid_grey").radius(30)).imageTag("leather_bag.jpg")
jQuery:
$.cloudinary.image("leather_bag.jpg", {transformation: [
  {width: 400},
  {if: "faces_gte_1"},
  {overlay: "golden_star", width: 1.1, flags: "region_relative", gravity: "faces"},
  {if: "else"},
  {overlay: "golden_star", width: 100, gravity: "south_east"},
  {if: "end"},
  {border: "7px_solid_grey", radius: 30}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .Width(400).Chain()
  .If("faces_gte_1").Chain()
  .Overlay("golden_star").Width(1.1).Flags("region_relative").Gravity("faces").Chain()
  .If("else").Chain()
  .Overlay("golden_star").Width(100).Gravity("south_east").Chain()
  .If("end").Chain()
  .Border("7px_solid_grey").Radius(30)).BuildImageTag("leather_bag.jpg")
overlay placed in bottom right corner

Example 4: Conditional cropping based on aspect ratio

If you want to include as much of an image as possible, how you crop and deliver the image can depend on whether the original image is in landscape or portrait format.

  • Original (scaled down) images of bike and happy_dog:
bike.jpg happy_dog.jpg
  • Both images cropped to a width of 300 pixels and a height of 200 pixels results in a bad crop for the portrait image:
[300x200 bike.jpg [300x200 happy_dog.jpg
  • Both images cropped to a width of 200 pixels and a height of 300 pixels results in a bad crop for the landscape image:
200x300 bike.jpg 200x300 happy_dog.jpg

By first checking the aspect ratio of the image and then applying a transformation accordingly, it becomes easier to fit the image into the graphic design of your site and make sure you include the relevant parts of an image. In this case, if the aspect ratio is greater than 1:1 (landscape) we want to deliver the image with a width of 300 pixels and a height of 200 pixels. On the other hand if the aspect ratio is less than 1:1 (portrait) we want to deliver the image with a width of 200 pixels and a height of 300 pixels:

  • Images cropped according to their aspect ratio: landscape images are cropped to a width of 300 pixels and a height of 200 pixels, and portrait images are cropped to a width of 200 pixels and a height of 300 pixels:
    Ruby:
    cl_image_tag("bike.jpg", :transformation=>[
      {:if=>"ar_gt_1:1"},
      {:width=>300, :height=>200, :crop=>"fill"},
      {:if=>"else", :width=>200, :height=>300, :crop=>"fill"},
      {:if=>"end"},
      {:border=>"7px_solid_grey", :radius=>30}
      ])
    PHP:
    cl_image_tag("bike.jpg", array("transformation"=>array(
      array("if"=>"ar_gt_1:1"),
      array("width"=>300, "height"=>200, "crop"=>"fill"),
      array("if"=>"else", "width"=>200, "height"=>300, "crop"=>"fill"),
      array("if"=>"end"),
      array("border"=>"7px_solid_grey", "radius"=>30)
      )))
    Python:
    CloudinaryImage("bike.jpg").image(transformation=[
      {"if": "ar_gt_1:1"},
      {"width": 300, "height": 200, "crop": "fill"},
      {"if": "else", "width": 200, "height": 300, "crop": "fill"},
      {"if": "end"},
      {"border": "7px_solid_grey", "radius": 30}
      ])
    Node.js:
    cloudinary.image("bike.jpg", {transformation: [
      {if: "ar_gt_1:1"},
      {width: 300, height: 200, crop: "fill"},
      {if: "else", width: 200, height: 300, crop: "fill"},
      {if: "end"},
      {border: "7px_solid_grey", radius: 30}
      ]})
    Java:
    cloudinary.url().transformation(new Transformation()
      .if("ar_gt_1:1").chain()
      .width(300).height(200).crop("fill").chain()
      .if("else").width(200).height(300).crop("fill").chain()
      .if("end").chain()
      .border("7px_solid_grey").radius(30)).imageTag("bike.jpg")
    jQuery:
    $.cloudinary.image("bike.jpg", {transformation: [
      {if: "ar_gt_1:1"},
      {width: 300, height: 200, crop: "fill"},
      {if: "else", width: 200, height: 300, crop: "fill"},
      {if: "end"},
      {border: "7px_solid_grey", radius: 30}
      ]})
    .Net:
    cloudinary.Api.UrlImgUp.Transform(new Transformation()
      .If("ar_gt_1:1").Chain()
      .Width(300).Height(200).Crop("fill").Chain()
      .If("else").Width(200).Height(300).Crop("fill").Chain()
      .If("end").Chain()
      .Border("7px_solid_grey").Radius(30)).BuildImageTag("bike.jpg")
    landscape image cropped to fit in space with 400x300 pixels
    Ruby:
    cl_image_tag("happy_dog.jpg", :transformation=>[
      {:if=>"ar_gt_1:1"},
      {:width=>300, :height=>200, :crop=>"fill"},
      {:if=>"else", :width=>200, :height=>300, :crop=>"fill"},
      {:if=>"end"},
      {:border=>"7px_solid_grey", :radius=>30}
      ])
    PHP:
    cl_image_tag("happy_dog.jpg", array("transformation"=>array(
      array("if"=>"ar_gt_1:1"),
      array("width"=>300, "height"=>200, "crop"=>"fill"),
      array("if"=>"else", "width"=>200, "height"=>300, "crop"=>"fill"),
      array("if"=>"end"),
      array("border"=>"7px_solid_grey", "radius"=>30)
      )))
    Python:
    CloudinaryImage("happy_dog.jpg").image(transformation=[
      {"if": "ar_gt_1:1"},
      {"width": 300, "height": 200, "crop": "fill"},
      {"if": "else", "width": 200, "height": 300, "crop": "fill"},
      {"if": "end"},
      {"border": "7px_solid_grey", "radius": 30}
      ])
    Node.js:
    cloudinary.image("happy_dog.jpg", {transformation: [
      {if: "ar_gt_1:1"},
      {width: 300, height: 200, crop: "fill"},
      {if: "else", width: 200, height: 300, crop: "fill"},
      {if: "end"},
      {border: "7px_solid_grey", radius: 30}
      ]})
    Java:
    cloudinary.url().transformation(new Transformation()
      .if("ar_gt_1:1").chain()
      .width(300).height(200).crop("fill").chain()
      .if("else").width(200).height(300).crop("fill").chain()
      .if("end").chain()
      .border("7px_solid_grey").radius(30)).imageTag("happy_dog.jpg")
    jQuery:
    $.cloudinary.image("happy_dog.jpg", {transformation: [
      {if: "ar_gt_1:1"},
      {width: 300, height: 200, crop: "fill"},
      {if: "else", width: 200, height: 300, crop: "fill"},
      {if: "end"},
      {border: "7px_solid_grey", radius: 30}
      ]})
    .Net:
    cloudinary.Api.UrlImgUp.Transform(new Transformation()
      .If("ar_gt_1:1").Chain()
      .Width(300).Height(200).Crop("fill").Chain()
      .If("else").Width(200).Height(300).Crop("fill").Chain()
      .If("end").Chain()
      .Border("7px_solid_grey").Radius(30)).BuildImageTag("happy_dog.jpg")
    portrait image cropped to fit in space with 300x400 pixels

Summary

Conditional transformations is a powerful feature that goes beyond simple dynamic image manipulation and can produce completely different results depending on the attributes of the original image as well as on the results of any applied manipulations. This capability has been requested by many of our customers and solves plenty of advanced use cases that they have encountered. For more details, see the documentation on Conditional transformations and note that the feature is available for use with all Cloudinary accounts, including the free tier.

Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Support \crosstabview in psql

From Planet PostgreSQL. Published on Apr 27, 2016.

On 8th of April, Alvaro Herrera committed patch: Support \crosstabview in psql   \crosstabview is a completely different way to display results from a query: instead of a vertical display of rows, the data values are placed in a grid where the column and row headers come from the data itself, similar to a spreadsheet. […]

Pavel Golub: POSTGRES.chat

From Planet PostgreSQL. Published on Apr 27, 2016.

If you ever need quick help with PostgreSQL from other people (sure, you should check manuals first) just remember one link POSTGRES.chat.

On this page you will find links to online chats, e.g. Slack, Gitter.im, Telegram, IRC etc.

Thanks PostgreSQL in Russia for hint.


Filed under: PostgreSQL Tagged: community, development, PostgreSQL, support

Josh Berkus: Join us for the 3rd pgCon User Unconference

From Planet PostgreSQL. Published on Apr 26, 2016.

This year, we're continuing to experiment with new formats for the pgCon unconference.  In 2013 and 2014 we had an Unconference on the Saturday of pgCon.  In 2015 we had a limited Developer Unconference on Wednesday.

This year, we will have a Developer Unconference on Wednesday, and a User Unconference on Saturday.  We're doing this because people were disappointed that we didn't do the User Unconference last year, and asked us to bring it back.  So, hopefully you planned to stay over Saturday!

The User Unconference has several purposes:

  • to give various teams and special interest groups an opportunity to schedule something
  • to let folks whose technology was released too late for the CfP another chance to present something
  • to continue discussions started around talks in the main program
So, please join us!  And if you have ideas for User Unconference sessions which you want to make sure get on the program, please list them on the wiki page using the template provided.  Note that final sessions will be chosen at 10am Saturday morning, though.


Hans-Juergen Schoenig: Migrating from Oracle: One word about pg_type

From Planet PostgreSQL. Published on Apr 26, 2016.

After spending pretty much the entire week on freeing a customer from the chains of Oracle’s license policy, I found an interesting issue, which made me smile: When the Oracle code was built 15 years ago or so, somebody created a table called “pg_type” on the Oracle side as part of the application. With only […]

The post Migrating from Oracle: One word about pg_type appeared first on Cybertec - The PostgreSQL Database Company.

Bruce Momjian: Non-Relational Postgres

From Planet PostgreSQL. Published on Apr 25, 2016.

Having worked on Postgres for twenty years, I have seen a lot of features added. However, I usually evaluate new features in isolation and rarely see the big picture of how they fit together. Last month I created a new talk, Non-Relational Postgres, which presents a comprehensive overview of non-relational Postgres storage.

The consistent use of @> (containment operator) listed on slide 70 is interesting. It also has good coverage of how character strings can be split apart and indexed.

Joshua Drake: You are my fellow community member

From Planet PostgreSQL. Published on Apr 25, 2016.

I attended the fantastically presented PgConf US 2016 last week. An amazing conference, my training was well attended, my talk was at capacity, the 20th Anniversary Party was phenomenal and the conference raised money for an excellent cause. There were over 435 attendees, giving our brothers and sisters at PgConf EU something to work for during their conference in November.

While attending the hallway track, I was talking to a gentleman whose name escapes me. He asked me how he could contribute to the community. I am always excited to have that conversation because we are able to discuss all kinds of different ways to contribute, whether it be social (user groups, pgday, speaking at alternative conferences), documentation, code, or infrastructure. Bringing people to the community truly inspires me to continue building our community into something special.

Then he said, "I talked to X and he said, 'Why would I tell you that, you are my competitor?'" I was stunned. It is a sad and unacceptable ideology to hold within any open source community. It is an ideal that should be left to bad business practices from past decades. It is a horrible statement to make to any potential community member and speaks to a less than stellar understanding of how our community works.

If any person asks you how they can help with .Org, you should answer them honestly and enthusiastically. If you don't know how they can help, find a contributor and help the new person get connected. You should do this for any person looking to contribute, no matter who they work for (even if it is a competitor). Any other way of handling it is taking away from the community and is a testament to a selfishness that has no place here.

We are an inclusive community and that means we recognize something greater than ourselves. We recognize the power of a sustainable, productive .Org that is a collaborative space without the belittling behavior described in this post. I strongly hope that we can eliminate negative behavior such as this and become the most unified, bad-ass, and truly open source project around.

The feature I offered to the individual was to allow pg_basebackup to use multiple connections (jobs) to pull a basebackup. In a similar vein to how pg_dump and pg_restore.

Florida Open Debate Site Powers First-Ever Crowd-Sourced Open Senate Debate

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Apr 25, 2016.

Florida Open Debate launched ahead of the upcoming, bi-partisan debate between candidates for the Florida Senate. The site, which crowdsources debate questions from the general public, was met with national acclaim. Citizens can not only submit questions, but also vote on which ones matter most. Caktus helped develop the tool on behalf of the Open Debate Coalition (ODC), a non-partisan organization dedicated to supporting participatory democracy through the use of civic tech.

The coalition formed during the 2008 presidential election to combat a sharp decline in voter participation as well as lack of representation in the debate arena. According to The Los Angeles Times “the job of designing and choosing questions is left to the media host.” The ODC recognized the need to increase participation from and provide access to the debate system by reaching as many American citizens as possible.

"The tool truly is an open forum for US citizens to participate in the political process,” says Ben Riesling, Caktus project manager. “Anyone can submit a question and vote on which questions should or should not be discussed. We’re extremely honored to be asked to participate in making this site available during this election season.”

The debate between Florida Senate candidate David Jolly (R) and Alan Grayson (D) takes place Monday, April 25th at 7:00 pm EDT and will be live streamed on the Florida Open Debate site itself.

Michael Paquier: Postgres 9.6 feature highlight: pg_rewind and timeline switches

From Planet PostgreSQL. Published on Apr 25, 2016.

pg_rewind, a utility introduced in PostgreSQL 9.5 allowing one to reconnect a master to one of its promoted standbys, making unnecessary the need of a new base backup and hence saving a huge amount of time, has gained something new in Postgres 9.6 with the following commit:

commit: e50cda78404d6400b1326a996a4fabb144871151
author: Teodor Sigaev <teodor@sigaev.ru>
date: Tue, 1 Dec 2015 18:56:44 +0300
Use pg_rewind when target timeline was switched

Allow pg_rewind to work when target timeline was switched. Now
user can return promoted standby to old master.

Target timeline history becomes a global variable. Index
in target timeline history is used in function interfaces instead of
specifying TLI directly. Thus, SimpleXLogPageRead() can easily start
reading XLOGs from next timeline when current timeline ends.

Author: Alexander Korotkov
Review: Michael Paquier

The following document gives a short introduction about pg_rewind. You may want to look at it before moving on with this post and understand why here is described a damn cool feature. First, before entering in the details of this feature, let’s take the example of the following cluster:

                            Node 1
                            /    \
                           /      \
                          /        \
                       Node 2   Node 3
                       /
                      /
                     /
                  Node 4

Node 1 is a master node, with two standbys directly connected to it, presented here as nodes 2 and 3. Finally node 4 is a cascading standby connected to node 2. Then say that all the nodes have been successively promoted. At promotion, a standby finishes recovery and jumps to a new timeline to identify a new series of WAL record generated by what is now a master node to avoid overlapping WAL records from another node. The promotion of each node in the cluster is done in the following order, and results in the following timelines being taken by each node:

  • Node 2 promotes, at the end of recovery it begins to use timeline 2. Node 4 jumps as well to timeline 2, and continues to follow node 2.
  • Node 3 promotes, it switches to timeline 3.
  • Node 4 promotes, it switches to timeline 4.

At this point all the nodes have become master nodes, and are now generating their own history and making their own way in life. In terms of WAL history, the cluster becomes as follows:

  ----------------------------------> Node 1 (timeline 1)
      \        \
       \        \
        \        \------------------> Node 3 (timeline 3)
         \
          \
           \------------------------> Node 2 (timeline 2)
                \
                 \
                  \-----------------> Node 4 (timeline 4)

With 9.5’s pg_rewind, when trying to rewind a node, or to put it in other words to synchronize a target node with a source node so as the target node can become a standby of the target node, it is not possible to perform complicated operations like in the case of the cluster described above. For example, one can use node 1 as target and node 2 as source so as node 1 is recycled as a standby of node 2. Same for node 1 as source and node 3 as target, and similarly for node 4 as source and node 2 as target. Also, as a timeline history cannot be looked up backwards, it is not possible to rewind a promoted standby to its former master. In short, when trying more complicated operations, the following error will be quickly found out:

could not find common ancestor of the source and target cluster's timelines
Failure, exiting

Finally, what does the feature aimed at being described in this blog post do? Well, to put it simply pg_rewind has been extended so as it can have a look at the timeline history graph like the one above. And it is able to find out the most recent shared point in the timeline history between a target and a source node, and then it performs a rewind operation from this point. In even shorter words, do you remember the four nodes of the same cluster promoted one after the other? In PostgreSQL 9.6, pg_rewind can allow one to rebuild a complete cluster using all the existing nodes, without limitation, even if they completely forked off. Any node can become the master, and have as standbys the other nodes. This abscence of limitation is what makes this feature a really cool thing, and something to count on.

Álvaro Herrera: Column Store Plans

From Planet PostgreSQL. Published on Apr 25, 2016.

Over at pgsql-general, Bráulio Bhavamitra asks:

I wonder if there is any plans to move postgresql entirely to a columnar store (or at least make it an option), maybe for version 10?

This is a pretty interesting question. Completely replacing the current row-based store wouldn’t be a good idea: it has served us extremely well and I’m pretty sure that replacing it entirely with a columnar store would be disastrous performance-wise for OLTP use cases.

Some columns

Some columns. Picture courtesy of Yiming Sun on Flickr

That doesn’t mean columnar stores are a bad idea in general — because they aren’t. They just have a more limited use case than “the whole database”. For analytical queries on append-mostly data, a columnar store is a much more appropriate representation than the regular row-based store, but not all databases are analytical.

However, in order to attain interesting performance gains you need to do a lot more than just change the underlying storage: you need to ensure that the rest of the system can take advantage of the changed representation, so that it can execute queries optimally; for instance, you may want aggregates that operate in a SIMD mode rather than one-value-at-a-time as it is today. This, in itself, is a large undertaking, and there are other challenges too.

As it turns out, there’s a team at 2ndQuadrant working precisely on these matters. We posted a patch last year, but it wasn’t terribly interesting — it only made a single-digit percentage improvement in TPC-H scores; not enough to bother the development community with (it was a fairly invasive patch). We want more than that.

In our design, columnar or not is going to be an option: you’re going to be able to say Dear server, for this table kindly set up columnar storage for me, would you? Thank you very much. And then you’re going to get a table which may be slower for regular usage but which will rock for analytics. For most of your tables the current row-based store will still likely be the best option, because row-based storage is much better suited to the more general cases.

We don’t have a timescale yet. Stay tuned.

Django: Running management commands inside a Docker container

By Abu Ashraf Masnun from Django community aggregator: Community blog posts. Published on Apr 23, 2016.

Okay, so we have dockerized our django app and we need to run a manage.py command for some task. How do we do that? Simple, we have to locate the container that runs the django app, login and then run the command.

Locate The Container

It’s very likely that our app uses multiple containers to compose the entire system. For exmaple, I have one container running MySQL, one container running Redis and another running the actual Django app. If we want to run manage.py commands, we have to login to the one that runs Django.

While our app is running, we can find the running docker containers using the docker ps command like this:

$ docker ps
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS              PORTS                    NAMES
308f40bba888        crawler_testscript   "/sbin/my_init"          31 hours ago        Up 3 seconds        5000/tcp                 crawler_testscript_1
3a5ccc872215        crawler_web          "bash run_web.sh"        31 hours ago        Up 4 seconds        0.0.0.0:8000-&gt;8000/tcp   crawler_web_1
14f0e260fb2c        redis:latest         "/entrypoint.sh redis"   31 hours ago        Up 4 seconds        0.0.0.0:6379-&gt;6379/tcp   crawler_redis_1
252a7092870d        mysql:latest         "/entrypoint.sh mysql"   31 hours ago        Up 4 seconds        0.0.0.0:3306-&gt;3306/tcp   crawler_mysql_1

In my case, I am using Docker Compose and I know my Django app runs using the crawler_web image. So we note the name of the container. In the above example, that is – crawler_web_1.

Nice, now we know which container we have to login to.

Logging Into The Container

We use the name of the container to login to it, like this:

docker exec -it crawler_web_1 bash

The command above will connect us to the container and land us on a bash shell. Now we’re ready to run our command.

Running the command

We cd into the directory if necessary and then run the management command.

cd /project
python manage.py <command>

Summary

  • docker ps to list running containers and locate the one
  • docker exec -it [container_name] bash to login to the bash shell on that container
  • cd to the django project and run python manage.py [command]

gabrielle roth: PDXPUG April Meeting recap

From Planet PostgreSQL. Published on Apr 22, 2016.

What a great meeting last night! Made me feel a bit better about missing PgConf.US:)

We had a good turnout, too; several first-timers showed up, some hauling themselves all the way in from Nike & Intel.

Thanks to Eric Ferreira, AWS Database Engineer, and Tony Gibbs, AWS Solutions Architect, for coming all the way down from Seattle to give this excellent talk. Eric is the originator of Redshift (it’s based on Pg 8.0.4), and he shared with us some of the features & why he made some of the decision he did.

Special thanks goes to Veronika Megler, RDS Professional Services team, for proposing the topic and arranging for Eric & Tony to come on down to PDX.

We’re already working on a return visit for late summer, so if you missed this meeting, you will get another chance.


Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Phrase full text search.

From Planet PostgreSQL. Published on Apr 22, 2016.

On 7th of April, Teodor Sigaev committed patch: Phrase full text search.   Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery. On-disk and binary in/out format of tsquery are backward compatible. It has two side effect: - change order for tsquery, so, users, who has a btree index over tsquery, should reindex […]

Shaun M. Thomas: PG Phriday: Growing Pains

From Planet PostgreSQL. Published on Apr 22, 2016.

Postgres is a great tool for most databases. Larger installations however, pretty much require horizontal scaling; addressing multi-TB tables relies on multiple parallel storage streams thanks to the laws of physics. It’s how all immense data stores work, and for a long time, Postgres really had no equivalent that wasn’t a home-grown shard management wrapper. To that end, we’ve been considering Postgres-XL as a way to fill that role. At first, everything was going well. Performance tests showed huge improvements, initial deployments uncovered no outright incompatibilities, and the conversion was underway.

Then we started to use it.

Much to our chagrin, dev and test environments aren’t always indicative of real scenarios and utilization patterns. Once Postgres-XL reached our semi-production environment, everything started to go wrong. Here’s a list of what happened, and why our 43TB (and growing) database can’t use Postgres-XL until it matures a bit more. As a note, all of the examples listed here were observed using the latest 9.5r1 release.

I also want to strongly emphasize that this is not an indictment of Postgres-XL. Postgres greatly needs the capabilities it adds to truly break into the VLDB market. Sometimes though, projects aren’t as capable as we originally hoped, and subtle issues aren’t always readily apparent until software is field-tested. We’re still not ruling it out.

That said, on to the adventure.

Trigger Support

In short, there isn’t any. This at least, was something we knew about. In our case, it wasn’t a problem because a VLDB installation is commonly a warehouse of some description. These tend to depend on ETL or other source scripts to supply data that’s already prepared and doesn’t rely on post-insert transformations.

For anyone else, this could be a roadblock. Trigger support isn’t always essential to running an active database, but they’re common enough that a conversion process will have to contend with replacing them. This missing feature practically relegates Postgres-XL to warehouse use all by itself. Indeed, like us, many may consider triggers unnecessary in a horizontally distributed database or NoSQL databases would have never garnered so much attention.

But it is a missing feature a Postgres user would expect. This is the first lesson that Postgres-XL is a variant of Postgres with similar functionality, but it isn’t Postgres.

SAVEPOINT Support

Again, there isn’t any. This is one of those things we only found after scripts started failing. Savepoints allow transactions to retain integrity through several phases if the state and progress are verified. If something unexpected happens, the whole transaction doesn’t have to be abandoned outright. Clever database developers use this as something of a state machine to make multiple attempts, or perform alternate actions in case of failures within a transaction.

Perhaps due to internal transaction implementation details, Postgres-XL can’t retain this functionality. Distributing transactions and involving a transaction manager while also maintaining a state machine within it, is definitely not a trivial problem. That makes this a completely understandable omission, and since the feature is rarely utilized, many won’t even notice it’s missing.

Yet it was something we had to route around. Another thing to keep in mind when converting to Postgres-XL.

Sub-queries and DISTINCT ON Woes

This one caught us totally off-guard. The documentation on SELECT doesn’t seem to mention this, but it is indeed an issue with Postgres-XL. Observe:

CREATE TABLE foo (id INT);
CREATE TABLE bar (id INT, nid INT);
 
INSERT INTO foo
SELECT a.id FROM generate_series(1,100) a(id);
 
INSERT INTO bar
SELECT DISTINCT ON (f1.id) f1.id, f2.id
  FROM foo f1
  JOIN foo f2 USING (id)
 ORDER BY f1.id, f2.id DESC;
 
ERROR:  Postgres-XL does NOT currently support ORDER BY IN subqueries

But is that error accurate? If we remove the INSERT preamble so the query executes on its own, we merely see the expected output of rows. So even if the DISTINCT ON is being internally transformed into a sub-query, it isn’t doing so consistently. The error is wrong anyway, because this works just fine:

INSERT INTO bar
SELECT a.id, b.id
  FROM foo a
  JOIN (SELECT id FROM foo ORDER BY id DESC) b USING (id);

Not only is that an ordered sub-query, it’s a JOIN on an ordered sub-query, an arguably more complex scenario. I can only assume this was missing functionality, but in implementing the necessary code changes to remove the warning, they missed a spot.

In any case, this really threw a wrench into our functions that acted as ETL components. We took the error at face value, and groaned at trying to rewrite DISTINCT ON without using sub-queries or ORDER BY. Ironically an even more convoluted query came to our rescue:

INSERT INTO bar
WITH sub AS (
  SELECT DISTINCT ON (f1.id) f1.id, f2.id
    FROM foo f1
    JOIN foo f2 USING (id)
   ORDER BY f1.id, f2.id DESC
)
SELECT * FROM sub;

This works because CTEs instantiate the results as a temporary table, which doesn’t trigger the error. Then the INSERT proceeds normally. Huh. Well, that’s a bit odd, but easily circumvented.

Then our luck ran out.

No TEMP TABLE For You!

Now we’re going from missing features and wonky implementations to outright bugs. The next issue that reared its ugly head was related to temporary tables. Well, maybe. We started seeing this in the logs:

ERROR:  could not open relation with OID 0

A quick Google led to this thread, which seemed to indicate a problem with temporary tables in stored procedures called two times consecutively within a session. Unfortunately the test cases provided in the thread do not replicate the problem on our installation. Regardless, there was a good chance it was related, so we removed temp tables anyway. Our solution leveraged UNLOGGED tables instead, because they have stable OIDs and were much less likely to invoke whatever edge case we were triggering. So our functions turned into this:

-- Functions like this caused the OID 0 errors.
 
CREATE OR REPLACE FUNCTION run_me()
RETURNS VOID AS
$$
BEGIN
  DROP TABLE IF EXISTS t_foo;
  CREATE TEMP TABLE t_foo (id INT);
 
  INSERT INTO t_foo
  SELECT a.id FROM generate_series(1,100) a(id);
$$
LANGUAGE plpgsql;
 
-- Using UNLOGGED tables and functions like this worked fine.
-- Of course, now concurrent function calls won't work.
 
CREATE UNLOGGED TABLE t_foo (id INT);
 
CREATE OR REPLACE FUNCTION run_me()
RETURNS VOID AS
$$
BEGIN
  TRUNCATE TABLE t_foo;
 
  INSERT INTO t_foo
  SELECT a.id FROM generate_series(1,100) a(id);
$$
LANGUAGE plpgsql;

And what would you know? Success! It wasn’t ideal, but we could limp along with the workaround until they fixed the bug. Or so we thought.

We Need ACID

Probably the most amusing part of this whole adventure is that the errors we kept encountering were almost immediately consecutive. Once we worked around one issue, we merely cleared the way for another completely unrelated problem. This time, we saw this error in the logs:

ERROR:  catalog is missing 25 attribute(s) for relid 36564

That’s not encouraging. Back to Google we went, giving us another thread detailing the problem. This time, the issue seems related to autovacuum somehow. It’s also probable that repeatedly truncating UNLOGGED tables wrought havoc on the Postgres system catalog, and Postgres-XL wasn’t properly propagating something. The theory at the end of the thread is unfortunately incorrect. Disabling autovacuum only reduces the bug’s prevalence, as the original reporter noted when I asked how to correct (or remove) the broken tables.

Regardless of the underlying cause, this was the final straw. Once those errors showed up, that meant the system catalog was corrupt. The UNLOGGED tables used by our functions were completely invalid, and worse, we couldn’t even drop them to start over. The only way out is a full dump/restore, a rather drastic measure because some ephemeral table became an actual Ghost in the Shell. And the real worry here—that the missing attributes would affect a real table—became the overriding concern.

Because the D in ACID stands for Duability, and this bug illustrates that the current incarnation of Postgres-XL has none. Database metadata is sacrosanct, because without it, all that remains is indecipherable binary blobs taking up disk space. There’s no way I’d dump 40TB of data on an installation, when history suggests some of that data could mysteriously go missing.

The End of the Road

And that’s the real misfortune. No matter how much we desperately wanted to deploy Postgres-XL into our warehouse, we literally can’t. Doing so at this point would be irresponsible and dangerous. Some of the problems we encountered are understandable and workarounds exist for most. But the outright deluge of roaches in our sundae, with a rotten cherry on top, is impossible to ignore.

I have every confidence in 2ndQuadrant eventually resolving the worst roadblocks. Before Postgres-XC became Postgres-XL, it practically languished in Development Limbo, accumulating code smell and copious edge cases from managing node interactions. I’ll continue installing every subsequent release in our dev and QA environments until there’s a legitimate reliability breakthrough. I still maintain that Postgres-XL is the way forward for highly distributed horizontal VLDB installations.

Just… not yet.

Hubert 'depesz' Lubaczewski: Waiting for 9.6 – Add a \gexec command to psql for evaluation of computed queries.

From Planet PostgreSQL. Published on Apr 22, 2016.

On 4th of April, Tom Lane committed patch: Add a \gexec command to psql for evaluation of computed queries.   \gexec executes the just-entered query, like \g, but instead of printing the results it takes each field as a SQL command to send to the server. Computing a series of queries to be executed is […]

Leo Hsu and Regina Obe: PGConfUS 2016 PostGIS slides and tutorial material

From Planet PostgreSQL. Published on Apr 21, 2016.

We gave a PostGIS Intro Training and a PostGIS talk at PGConfUS 2016 in Brooklyn, New York and just got back. A number of people asked if we'd make the slides and material available. We have these posted on our presentation page: http://www.postgis.us/presentations and will be putting on the PostgreSQL Wiki as well in due time. There will be a video coming along for the talk, but the training was not recorded.

We also have two more talks coming up in North Carolina in Early May at FOSS4G NA 2016 - one on PostGIS Spatial Tricks which has more of a GIS specialist focus than the top 10 talk we gave, but there will be some overlap. The other talk is a topic a couple of people asked us in training and after our talk, on routing along constrained paths. If you are attending FOSS4G NA 2016, you won't want to miss our talk pgRouting: A Crash Course which is also the topic of our upcoming book.

Just like FOSS4G NA 2015, there is a pgDay track which is PostgreSQL specific material, useful to a spatial crowd, but not PostGIS focused.


Continue reading "PGConfUS 2016 PostGIS slides and tutorial material"

How to create Custom User Model in Django?

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 20, 2016.

Django provides built in authentication which is good for most of the cases, but you may have needs that are being served with the existing system. For Ex: You want 'email' for authentication purpose rather than Django's username field and you want an extra field called 'display_name' as full name for the logged in User. To meet the above requeirements, we need to customize Django's built-in user model or substitute a completely customized model. In this blog post we'll learn how to customize Django's built in User model. Its good idea to extend django User Model rather than writing whole new user model and there are a lot of ways but one good way is to extend it from django itself. We get all features and properties of Django User and our own custom features on top of it.

Add the following to your app that holds custom user model.

In models.py 

from django.db import models
from django.contrib.auth.models import AbstractUser

class User(AbstractUser):
    name = models.CharField(max_length=100, blank=True, null=True)

and specify your custom user model in settings.py

AUTH_USER_MODEL = ‘your_app.User'


If you want to use your custom user model in models.py, you can get that model with `settings.AUTH_USER_MODEL` or `get_user_model()` of Django's auth module.

   from django.conf import settings
   from django.contrib.auth import get_user_model

   class Book(models.Model):
        author = models.ForeignKey(settings.AUTH_USER_MODEL)

(OR)

   from django.contrib.auth import get_user_model

   User = get_user_model()

   class Book(models.Model):
        author = models.ForeignKey(User)

By overriding the Django's Abstract user model we can specify our own username field by overriding 'USERNAME_FIELD', By default Django User model uses 'username' field for authentication purpose, we can set 'email' as username field. Just keep the following line in your User model to make user authenticate with email rather than Django's default username field.

class User(AbstractUser):
    name = models.CharField(max_length=100, blank=True, null=True)

    USERNAME_FIELD = 'email'

 

Autocomplete with Django-Haystack and Elasticsearch with single letter querying.

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 20, 2016.

Django's haystack provides autocomplete functionality. To do autocomplete effectively, the search backend(elasticsearch in this case) uses n-grams (essentially a small window passed over the string). Because this alters the way your data needs to be stored. We had two choices: NgramField and EdgeNgramField used as n-grams in search backend. The major drawback of the n-grams is that they take minimum of 3 letters in the search query. Still EdgeNgramField or NgramField fields won't produce consistent results(For Eg: . By customising the Haystack's built in elastcisearch engine backend we can achieve single letter query with Charfield itself.

In your search_indexes.py

from haystack import indexes
from myapp.models import Book

class BookIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    title = indexes.CharField(model_attr='title')
    pub_date = indexes.DateTimeField(model_attr='pub_date')

    def get_model(self):
        return Book

Create backends.py in your app. And this file will contain the actual customized elasticsearch search engine backend.

from django.conf import settings
from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend
from haystack.backends.elasticsearch_backend import ElasticsearchSearchEngine
from haystack.backends.elasticsearch_backend import ElasticsearchSearchQuery
from haystack.constants import VALID_FILTERS, FILTER_SEPARATOR, DEFAULT_ALIAS
from haystack.inputs import Clean, Exact, PythonData, Raw
from django.utils import six

class CustomElasticsearchSearchQuery(ElasticsearchSearchQuery):

    def __init__(self, using=DEFAULT_ALIAS):
        super(CustomElasticsearchSearchQuery, self).__init__(using=DEFAULT_ALIAS)

    def build_query_fragment(self, field, filter_type, value):
        from haystack import connections
        query_frag = ''

        if not hasattr(value, 'input_type_name'):
            # Handle when we've got a ``ValuesListQuerySet``...
            if hasattr(value, 'values_list'):
                value = list(value)

            if isinstance(value, six.string_types):
                # It's not an ``InputType``. Assume ``Clean``.
                value = Clean(value)
            else:
                value = PythonData(value)

        # Prepare the query using the InputType.
        prepared_value = value.prepare(self)

        if not isinstance(prepared_value, (set, list, tuple)):
            # Then convert whatever we get back to what pysolr wants if needed.
            prepared_value = self.backend._from_python(prepared_value)

        # 'content' is a special reserved word, much like 'pk' in
        # Django's ORM layer. It indicates 'no special field'.
        if field == 'content':
            index_fieldname = ''
        else:
            index_fieldname = u'%s:' % connections[self._using].get_unified_index().get_index_fieldname(field)

        filter_types = {
            'contains': u'*%s*',
            'startswith': u'%s*',
            'exact': u'%s',
            'gt': u'{%s TO *}',
            'gte': u'[%s TO *]',
            'lt': u'{* TO %s}',
            'lte': u'[* TO %s]',
        }

        if value.post_process is False:
            query_frag = prepared_value
        else:
            if filter_type in ['contains', 'startswith']:
                if value.input_type_name == 'exact':
                    query_frag = prepared_value
                else:
                    # Iterate over terms & incorportate the converted form of each into the query.
                    terms = []

                    if isinstance(prepared_value, six.string_types):
                        for possible_value in prepared_value.split(' '):
                            terms.append(filter_types[filter_type] % self.backend._from_python(possible_value))
                    else:
                        terms.append(filter_types[filter_type] % self.backend._from_python(prepared_value))

                    if len(terms) == 1:
                        query_frag = terms[0]
                    else:
                        query_frag = u"(%s)" % " AND ".join(terms)
            elif filter_type == 'in':
                in_options = []

                for possible_value in prepared_value:
                    in_options.append(u'"%s"' % self.backend._from_python(possible_value))

                query_frag = u"(%s)" % " OR ".join(in_options)
            elif filter_type == 'range':
                start = self.backend._from_python(prepared_value[0])
                end = self.backend._from_python(prepared_value[1])
                query_frag = u'["%s" TO "%s"]' % (start, end)
            elif filter_type == 'exact':
                if value.input_type_name == 'exact':
                    query_frag = prepared_value
                else:
                    prepared_value = Exact(prepared_value).prepare(self)
                    query_frag = filter_types[filter_type] % prepared_value
            else:
                if value.input_type_name != 'exact':
                    prepared_value = Exact(prepared_value).prepare(self)

                query_frag = filter_types[filter_type] % prepared_value

        if len(query_frag) and not isinstance(value, Raw):
            if not query_frag.startswith('(') and not query_frag.endswith(')'):
                query_frag = "(%s)" % query_frag

        return u"%s%s" % (index_fieldname, query_frag)


class ConfigurableElasticSearchEngine(ElasticsearchSearchEngine):
    query = CustomElasticsearchSearchQuery

 

In the above snippet 'contains': u'*%s*' is doing the job for us to allow single letter querying which provides better results. Add the ConfigurableElasticSearchEngine in your settings.py

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'myapp.backends.ConfigurableElasticSearchEngine',
        'URL': 'http://localhost:9200/',
        'INDEX_NAME': 'my_index',
    },
}

Now we are done with the customization. The customised searchengine's example search would look like:

SearchQuerySet().models(Book).filter_and(title=<query_string>)

Eg1: SearchQuerySet().models(Book).filter_and(title='m')
# Will give results of books with title 'mcrop', 'micropyramid', 'mycropyramid' and all other titles which are having letter 'm' in them.

Eg2: SearchQuerySet().models(Book).filter_and(title='mi')
# Will give results of books with title 'micropyramid'. 

Note: Here __contains is optional as Haystack2.X added __contains as default for filtering the data.

Django REST Framework: Remember to disable Web Browsable API in Production

By Abu Ashraf Masnun from Django community aggregator: Community blog posts. Published on Apr 20, 2016.

So this is what happened – I built an url shortening service at work for internal use. It’s a very basic app – shortens urls and tracks clicks. Two models – URL and URLVisit. URL model contains the full url, slug for the short url, created time etc. URLVisit has information related to the click, like user IP, browser data, click time etc and a ForeignKey to URL as expected.

Two different apps were using this service, one from me, another from a different team. So I kept the Web Browsable API so the developers from other teams can try it out easily and they were very happy about it. The only job of this app was url shortening so I didn’t bother building a different home page. When people requested the / page on the domain, I would redirect them directly to /api/.

Things were going really great initially. There was not very heavy load on the service. Roughly 50-100 requests per second. I would call that minimal load. The server also had decent hardware and was running on an EC2 instance from AWS. nginx was on the front while the app was run with uwsgi. Everything was so smooth until it happened. After a month and half, we started noticing very poor performance of the server. Sometimes it was taking up to 40 seconds to respond. I started investigating.

It took me some time to find out what actually happened. By the time it happened, we have shortened more than a million urls. So when someone was visiting /api/url-visit/ – the web browsable api was trying to render the html form. The form allows the user to choose one of the entries from the URL model inside a select (dropdown). Rendering that page was causing usages of 100% cpu and blocking / slowing down other requests. It’s not really DRF’s fault. If I tried to load a million of entries into a select like that, it would crash the app too.

Even worse – remember I added a redirect from the home page, directly to the /api/ url? Search engines bots started crawling the urls. As a result the app became extremely slow and often unavailable to nginx. I initially thought, I could stop the search engine crawls by adding some robots.txt or simply by adding authentication to the API. But developers from other teams would still time to time visit the API to try out things and then make the app non responsive. So I did what I had to – I disabled the web browsable API and added a separate documentation demonstrating the use of the API with curl, PHP and Python.

I added the following snippet in my production settings file to only enable JSONRenderer for the API:

REST_FRAMEWORK = {
    'DEFAULT_RENDERER_CLASSES': (
        'rest_framework.renderers.JSONRenderer',
    )
}

Things have become pretty smooth afterwards. I can still enjoy the nice HTML interface locally where there are much fewer items. While on my production servers, there is no web browsable APIs to cause any bottlenecks.

Django Generic many to many field implementation

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 14, 2016.

Django application has a GM2MField that combines the features of the standard Django "ManyToManyField" and "GenericForeignKey".

Installation:

pip install django-gm2m

Django’s contenttype framework must be (django.contrib.contenttypes) mentionned in the INSTALLED_APPS.

INSTALLED_APPS = [
   ...
   'django.contrib.contenttypes',
   ...
   'gm2m',
]

Uses Of django-gm2m:
1. reverse relations
2. prefetching
3. Allows you to customize the deletion behaviour

Example Models:

from django.db import models

class Videos(models.Model):
      name = models.CharField(maxlength=100)

class Articles(models.Model):
      name = models.CharField(maxlength=100)

If you want to have a field for the preferred videos and articles of a User, you need to add GM2MField to the User model.

from gm2m import GM2MField

class User(models.Model):
      email = models.CharField(maxlength=100)
      username = models.CharField(maxlength=100)
      related = models.GM2MField()

Now you can add videos, articles to the related set.
Creating User object:

user = User.objects.create(email="mp@mp.com", username="mp@mp.com")

video = Videos.objects.create(name="video")

article = Articles.objects.create(name="sample document")

user.related.add(video)

user.related.add(article)

or you can add both videos, articles instances.

user.related = [video, article]

From User instance, you can fetch all related videos, articles set or you can filter model using the "Model" or "Model__in" keywords.

list(user.related) # This provides list of model objects

list(user.related.filter(Model=Videos)) # This provides list of user related video instances

list(user.related.filter(Model__in=[Videos, Articles]))

1. Reverse Relations:

GM2MField provides automatic reverse relations when an instance is added.

We have seen how to access videos, articles of a given user.

Here we see how to access list of all the users with the given Video or Article instance.

list(video.user_set) # This provides list of user instances

With the user reverse relation, we can also use lookup chains in your queries, can also perform add, remove, clear, set operations

[each.email for each in Video.objects.filter(user__email='mp@mp.com')]


We can also list the models related to GM2MField using get_related_models method

    User.related.get_related_models()

this will give article, video objects as they are already added to user using GM2MField

2. Prefetching

GM2MField provides prefetching in the same way with ManyToManyField

    User.objects.all().prefetch_related('related')

3. Allows you to customize the deletion behaviour

    By default, when a source or target model instance is deleted, all relations linking this instance are deleted. It is possible to change this behavior with the on_delete, on_delete_src keyword arguments when creating the GM2MField.

    from gm2m.deletion import DO_NOTHING

    class User(models.Model):
          preferred_videos = GM2MField(Movie, 'Documentary', on_delete=DO_NOTHING)

    But be ensure that the database remains consistent after the deletion operation.

Factory Boy - An alternative for Fixtures

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 14, 2016.

Factory Boy is a fixtures replacement tool. It allows you to use objects customized for the current test, while only declaring the test-specific fields.

Setting up your factories:

For testing the Django app, create a root directory named tests. Then create __init__.py, tests.py (to write your tests), factories.py (to write your model factories) in that directory.

So, your app layout looks like:

  myapp
    |- __init__.py
    |- admin.py
    |- forms.py
    |- models.py
    |- tests
        |- __init__.py
        |- factories.py
        |- tests.py
    |- urls.py
    |- views.py

Features of Factory Boy:

It allows you to automate much of the testing scenarios. Some examples are:

  1. Use a 'Sequence' object using a lambda function to dynamically create unique field values:
        EX: username = factory.Sequence(lambda n: 'user_%d' % n)

  2. Use a LazyAttribute object to pick from a range of choices for a defined field
        EX: choice = factory.LazyAttribute(lambda s: random.choice(dict(MyModel.CHOICES).keys()))

A basic factory: Factories declare a set of attributes used to instantiate an object. Here's a basic factory that generates auth.User instances:

# factories.py
from factory import DjangoModelFactory, lazy_attribute
from django.contrib.auth.models import User

class UserFactory(DjangoModelFactory):
    first_name = 'Shirisha'
    last_name = 'Gaddi'
    username = factory.Sequence(lambda n: 'user_%d' % n)
    email = lazy_attribute(lambda obj: obj.username + "@example.com")

    class Meta:
        model = 'User'
        django_get_or_create = ('username',)

Calling this as factories.UserFactory() will perform same task as User.objects.create call. Howerver, it is possible to override the defined attributes by passing keyword arguments:

# Create a user instance by overriding first_name
user = factories.UserFactory(first_name="Karuna")

The factory's Meta has django_get_or_create set, which means the factory will call the Django built-in User.objects.get_or_create method.

Facial attribute detection with Microsoft's Face API

By Cloudinary Blog - Django from Django community aggregator: Community blog posts. Published on Apr 14, 2016.

Microsoft Face API face detection

Many of the photos displayed on the internet these days are of people. If your website or mobile application displays photos that include people, you will want to make sure that their faces are included in the delivered images when cropping and manipulating them to fit your graphic design and responsive layout. You may even want to further manipulate an image according to the faces present, for example, adding a harlequin mask overlay on all of their eyes, where each mask is adjusted to the correct size and orientation (although not a typical use case, it's a cool example of using advanced facial attribute detection):

Ruby:
cl_image_tag("cloudinary_team.jpg", :transformation=>[
  {:width=>700, :radius=>"max"},
  {:flags=>"region_relative", :gravity=>"adv_eyes", :overlay=>"harlequinmask", :width=>1.7}
  ])
PHP:
cl_image_tag("cloudinary_team.jpg", array("transformation"=>array(
  array("width"=>700, "radius"=>"max"),
  array("flags"=>"region_relative", "gravity"=>"adv_eyes", "overlay"=>"harlequinmask", "width"=>1.7)
  )))
Python:
CloudinaryImage("cloudinary_team.jpg").image(transformation=[
  {"width": 700, "radius": "max"},
  {"flags": "region_relative", "gravity": "adv_eyes", "overlay": "harlequinmask", "width": 1.7}
  ])
Node.js:
cloudinary.image("cloudinary_team.jpg", {transformation: [
  {width: 700, radius: "max"},
  {flags: "region_relative", gravity: "adv_eyes", overlay: "harlequinmask", width: 1.7}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .width(700).radius("max").chain()
  .flags("region_relative").gravity("adv_eyes").overlay("harlequinmask").width(1.7)).imageTag("cloudinary_team.jpg")
jQuery:
$.cloudinary.image("cloudinary_team.jpg", {transformation: [
  {width: 700, radius: "max"},
  {flags: "region_relative", gravity: "adv_eyes", overlay: "harlequinmask", width: 1.7}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .Width(700).Radius("max").Chain()
  .Flags("region_relative").Gravity("adv_eyes").Overlay("harlequinmask").Width(1.7)).BuildImageTag("cloudinary_team.jpg")
Cloudinary team with masks

Face Detection is a great feature that enables the automatic modification of images according to the detected faces within an image, making it simple to intelligently crop, position, resize and transform your images appropriately.

At Cloudinary we strive to create an enriched environment that can solve our customers media asset related needs. By taking a holistic approach to image management, we create partnerships with leading companies developing image processing and media related technologies that can extend our internal features and capabilities and cater customers with more “complex” needs. Our add-ons feature services pre-integrated into Cloudinary that are tailored for developing, extending, and operating web and mobile apps.

We have recently partnered with Microsoft's Cognitive Services which provides a Face API for high precision face detection with state-of-the-art cloud-based algorithms. The Face API technology is fully integrated within our Advanced Facial Attributes Detection add-on that can do more than just detect the human faces in an image. The Advanced Facial Attribute Detection add-on can also extract meaningful advanced data about the face(s) in an image, including the exact location of facial features. This allows you even greater control over your image categorization, and to automatically use these details to smartly crop, position, rotate and overlay images based on the detected facial features.

Microsoft Cognitive Services logo

How to automatically detect facial attributes

Cloudinary supports uploading images using a cloud-based API. You can request further information while uploading the image by setting the detection parameter to adv_face when calling Cloudinary's upload API and Advanced Facial Attribute Detection is utilized to automatically extract detailed face attributes from the uploaded image. The detected faces are returned in the JSON response with rectangles (left, top, width and height) indicating the location of faces in the image in pixels, the exact position details of the eyes, mouth, eyebrows, nose and lips, as well as a series of face related attributes from each face such as pose, gender and age. See the Advanced Facial Attribute Detection documentation for more information. The code sample below uploads the lady image while requesting that the facial attributes are also returned in the JSON response:

Ruby:
Cloudinary::Uploader.upload("lady.jpg", 
              :detection: "adv_face")
PHP:
\Cloudinary\Uploader::upload("lady.jpg", 
              array(
               "detection": "adv_face"));
Python:
cloudinary.uploader.upload("lady.jpg", 
              detection = "adv_face")
Node.js:
cloudinary.uploader.upload("lady.jpg", 
              function(result) {console.log(result); }, { detection: "adv_face" });
Java:
cloudinary.uploader().upload("lady.jpg", 
              Cloudinary.asMap("detection", "adv_face"));

Ruby:
cl_image_tag("lady.jpg")
PHP:
cl_image_tag("lady.jpg")
Python:
CloudinaryImage("lady.jpg").image()
Node.js:
cloudinary.image("lady.jpg")
Java:
cloudinary.url().imageTag("lady.jpg")
jQuery:
$.cloudinary.image("lady.jpg")
.Net:
cloudinary.Api.UrlImgUp.BuildImageTag("lady.jpg")
Original uploaded image

The example JSON snippet below contains the results of the upload response when applying advanced facial attribute detection on the uploaded image. The response includes very detailed information regarding the face that was automatically detected in the image above:

{
...
 "info": 
  {"detection": 
    {"adv_face": 
      {"status": "complete",
       "data": 
        [{"bounding_box": 
           {"top": 234.0, "left": 216.0, "width": 244.0, "height": 244.0},
          "attributes": 
           {"smile": 0.649,
            "head_pose": {"pitch": 0.0, "roll": -6.7, "yaw": 0.6},
            "gender": "female",
            "age": 30.3,
            "facial_hair": {"moustache": 0.0, "beard": 0.0, 
                            "sideburns": 0.0}},
          "facial_landmarks": 
           {"mouth": 
             {"left": {"x": 277.1, "y": 410.6},
              "right": {"x": 410.2, "y": 395.1},
              "under_lip": 
               {"bottom": {"x": 352.8, "y": 445.0},
                "top": {"x": 349.1, "y": 431.8}},
              "upper_lip": 
               {"bottom": {"x": 346.3, "y": 415.1},
                "top": {"x": 345.4, "y": 407.5}}},
            "eyebrow": 
             {"left_outer": {"x": 224.1, "y": 298.0},
              "left_inner": {"x": 306.6, "y": 283.4},
              "right_inner": {"x": 361.4, "y": 279.8},
              "right_outer": {"x": 428.8, "y": 272.2}},
            "eye": 
             {"left_outer": {"x": 258.5, "y": 314.1},
              "left_top": {"x": 277.0, "y": 306.2},
              "left_bottom": {"x": 277.1, "y": 315.6},
              "left_inner": {"x": 296.4, "y": 312.9},
              "right_inner": {"x": 373.4, "y": 305.2},
              "right_top": {"x": 388.0, "y": 294.5},
              "right_bottom": {"x": 389.0, "y": 306.0},
              "right_outer": {"x": 406.5, "y": 300.2},
              "left_pupil": {"x": 278.3, "y": 309.4},
              "right_pupil": {"x": 386.0, "y": 298.7}},
            "nose": 
             {"tip": {"x": 341.6, "y": 381.6},
              "root_left": {"x": 321.9, "y": 314.6},
              "root_right": {"x": 343.6, "y": 311.8},
              "left_alar_top": {"x": 312.7, "y": 359.7},
              "right_alar_top": {"x": 359.2, "y": 351.2},
              "left_alar_out_tip": {"x": 305.4, "y": 380.4},
              "right_alar_out_tip": {"x": 374.3, "y": 367.5}}}}]}}},
}

Dynamic image manipulation with facial attribute detection

Cloudinary supports on-the-fly image manipulation using simple parameters in HTTP delivery URLs. Based on the position of facial attributes detected by the Advanced Facial Attribute Detection add-on, Cloudinary can crop your images to focus on the detected facial features, while providing a large set of image transformation and cropping options when using a Cloudinary delivery URL. To focus an automatic crop on the detected faces, simply set the crop parameter to thumb, fill or crop and the gravity parameter to adv_faces (set gravity to adv_face for focusing on the single largest detected face in the image). The resulting images are dynamically generated on-the-fly and the result is delivered via a fast CDN.

For example, to deliver a 300x300 thumbnail of the lady image shown above:

Ruby:
cl_image_tag("lady.jpg", :gravity=>"adv_face", :height=>300, :width=>300, :crop=>"thumb")
PHP:
cl_image_tag("lady.jpg", array("gravity"=>"adv_face", "height"=>300, "width"=>300, "crop"=>"thumb"))
Python:
CloudinaryImage("lady.jpg").image(gravity="adv_face", height=300, width=300, crop="thumb")
Node.js:
cloudinary.image("lady.jpg", {gravity: "adv_face", height: 300, width: 300, crop: "thumb"})
Java:
cloudinary.url().transformation(new Transformation().gravity("adv_face").height(300).width(300).crop("thumb")).imageTag("lady.jpg")
jQuery:
$.cloudinary.image("lady.jpg", {gravity: "adv_face", height: 300, width: 300, crop: "thumb"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Gravity("adv_face").Height(300).Width(300).Crop("thumb")).BuildImageTag("lady.jpg")
150x150 thumbnail of lady.jpg

Cloudinary can also dynamically crop your images based on the position of detected eyes. Simply set the gravity parameter to adv_eyes (g_adv_eyes for URLs) to center the image on the detected eyes. The example below delivers a 200x60 thumbnail centered on the eyes:

Ruby:
cl_image_tag("lady.jpg", :gravity=>"adv_eyes", :width=>200, :height=>60, :crop=>"thumb")
PHP:
cl_image_tag("lady.jpg", array("gravity"=>"adv_eyes", "width"=>200, "height"=>60, "crop"=>"thumb"))
Python:
CloudinaryImage("lady.jpg").image(gravity="adv_eyes", width=200, height=60, crop="thumb")
Node.js:
cloudinary.image("lady.jpg", {gravity: "adv_eyes", width: 200, height: 60, crop: "thumb"})
Java:
cloudinary.url().transformation(new Transformation().gravity("adv_eyes").width(200).height(60).crop("thumb")).imageTag("lady.jpg")
jQuery:
$.cloudinary.image("lady.jpg", {gravity: "adv_eyes", width: 200, height: 60, crop: "thumb"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Gravity("adv_eyes").Width(200).Height(60).Crop("thumb")).BuildImageTag("lady.jpg")
200x60 thumbnail centered on eyes

Thanks to the detailed information on the position of facial attributes detected by the Advanced Facial Attribute Detection add-on, Cloudinary can add overlays while taking into account the pose of the face, and automatically scale and rotate the overlay accordingly.

Ruby:
cl_image_tag("HarlequinMask.jpg", :width=>150)
PHP:
cl_image_tag("HarlequinMask.jpg", array("width"=>150))
Python:
CloudinaryImage("HarlequinMask.jpg").image(width=150)
Node.js:
cloudinary.image("HarlequinMask.jpg", {width: 150})
Java:
cloudinary.url().transformation(new Transformation().width(150)).imageTag("HarlequinMask.jpg")
jQuery:
$.cloudinary.image("HarlequinMask.jpg", {width: 150})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(150)).BuildImageTag("HarlequinMask.jpg")
Harlequin mask

For example, in order to automatically overlay the above image of a harlequin mask scaled to 170% relative to the detected eyes in the main image:

Ruby:
cl_image_tag("lady.jpg", :flags=>"region_relative", :gravity=>"adv_eyes", :overlay=>"HarlequinMask", :width=>1.7, :crop=>"scale")
PHP:
cl_image_tag("lady.jpg", array("flags"=>"region_relative", "gravity"=>"adv_eyes", "overlay"=>"HarlequinMask", "width"=>1.7, "crop"=>"scale"))
Python:
CloudinaryImage("lady.jpg").image(flags="region_relative", gravity="adv_eyes", overlay="HarlequinMask", width=1.7, crop="scale")
Node.js:
cloudinary.image("lady.jpg", {flags: "region_relative", gravity: "adv_eyes", overlay: "HarlequinMask", width: 1.7, crop: "scale"})
Java:
cloudinary.url().transformation(new Transformation().flags("region_relative").gravity("adv_eyes").overlay("HarlequinMask").width(1.7).crop("scale")).imageTag("lady.jpg")
jQuery:
$.cloudinary.image("lady.jpg", {flags: "region_relative", gravity: "adv_eyes", overlay: "HarlequinMask", width: 1.7, crop: "scale"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Flags("region_relative").Gravity("adv_eyes").Overlay("HarlequinMask").Width(1.7).Crop("scale")).BuildImageTag("lady.jpg")
Harlequin masked face

See the Advanced Facial Attribute Detection documentation for more information on the features available with the add-on and how to use them.

Summary

The Advanced Facial Attribute Detection add-on powered by Cloudinary and the Face API of Microsoft's Cognitive Services provides a high precision mechanism that can analyze images and create the best crop for most sites as well as automatically add the nice artistic effects of exact overlay placing. The integration within Cloudinary's pipeline is seamless and dynamic using simple manipulation URLs.

We are excited to enter into this partnership with Microsoft's Cognitive Services so give the add-on a try. The Advanced Facial Attribute Detection add-on is available to all our free and paid plans.

How to Create initial django migrations for existing DB schema.

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 13, 2016.

Django provides the confort database migrations from its version 1.8, with which we can avoid the usage of third party packages like south. Adding migrations to new apps is straightforward - they come preconfigured to accept migrations, and so just run makemigrations once you’ve made some changes. But if your app already has models and database tables, and doesn’t have migrations yet (for example, you created it against a previous Django version)  or you got the your migrations messed up, you’ll need to convert your app to use migrations. Following are the steps to create initial migrations to your app:

Step1: Empty the django_migrations table: 

Just go to your corrsponding database terminals and delete all the records from you django_migrations table with

delete from django_migrations;

Step2: Remove all the files in migrations folders in each and every app of your project.

Go to terminal and run remove all files in migrations folder with 

rm -rf <app>/migrations/

Step3: Reset the migrations for the "built-in" apps:

Reset all the migrations of the django's built in apps like admin with the command

python manage.py migrate --fake

Step4: Create intial migrations for each and every app:

For each app run:

python manage.py makemigrations <app>.
Note: Take care of dependencies (models with ForeignKey's should run after their parent model).

Step5: Final step is to creaete fake initial migrations:

To create initial fake migrations just run 

python manage.py migrate --fake-initial

With all the above five steps all the initial migrations will be created for the existing database schema. Now you can use the django's migrations system normally. To test if the migrations are succeeded or not, just add a new field to any of the models and run python manage.py makemigrations and then it will create a migration file in the corresponding migrations folder of the corresponding app, and then run python manage.py migrate. If this succeeds our above steps are success and you can enjoy the beauty of django's migrations.

Check test coverate in code with Coveralls

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 13, 2016.

Coverage: It is a tool used for measuring the effectiveness of tests, showing the percentage of your codebase covered by tests.

Test Coverage is an important indicator of software quality and an essential part of software maintenance. It helps in evaluating the effectiveness of testing by providing data on different coverage items. It is a useful tool for finding untested parts of a code base. Test coverage is also called code coverage in certain cases.

Test coverage can help in monitoring the quality of testing and assist in directing the test generators to create test cases that cover areas that have not been tested. It helps in determining a quantitative measure of Test coverage, which is an indirect measure of quality and identifies redundant test cases that do not increase coverage.

Benefits of Test Coverage:

Defect prevention in the project.
It creates additional test cases to increase coverage.
It helps in determining a quantitative measure of code coverage, which indirectly measures the quality of the application or product.

Installation: Pip install coverage

Use "coverage run" to run your program and gather data:

coverage run manage.py test

coverage report -m  # provides the report for the tests

Individual app test can be run using the command

coverage run --source=app1,app2 manage.py test

Celery Flower to monitor task queue

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 13, 2016.

Celery is a task queue that is to built an asynchronous message passing system. It can be used as a bucket where programming tasks can be dumped. The program that passed the task can continue to execute and function responsively.

To monitor the status of these celery tasks we use celery flower

The Celery Flower is a tool for monitoring your celery tasks and workers. It's web based and allows you to see task progress, details, worker status.

Install flower with pip

pip install flower

Install RabbitMQ

Celery requires a messaging agent in order to handle requests from an external source. This agent is referred to as a "broker".

RabbitMQ is the messaging server which powers this service. RabbitMQ does the following:

  • Listens for messages sent via the AMQP protocol.
  • Stores these messages in one or more queues.
  • Releases these messages to workers for consumption and processing.

RabbitMQ does not actually execute tasks. Rather, it is the mechanism by which tasks (“messages”) will be sent between instances and stored until executed.

Install rabbitmq using

sudo apt-get install rabbitmq-server

 

Add broker_api url in your settings.py

RabbitMQ management api

broker_api = 'http://guest:guest@localhost:15672/api/'

# Enable debug logging

logging = 'DEBUG'

We need to configure RabbitMQ for message broker services before running the celery. Once the RabbitMQ is successfully started it can be checked using the web UI located at:

http://localhost:15672/

Usage

Once the Celery Flower is successfully started you can check using the web UI located at:

http://localhost:5555

Launch the server

$ flower --port=5555

Launch from celery

$ celery flower -A proj --address=127.0.0.1 --port=5555

This was DjangoCon Europe 2016

By Horst Gutmann from Django community aggregator: Community blog posts. Published on Apr 13, 2016.

Every year the Django community organisers an amazing event somewhere in Europe aptly named DjangoCon Europe. Last year’s conference in Cardiff was amazing … and so was the 2016 edition in Budapest 💖

The whole conference had even more of a music flavor going on compared to previous events. The location was a music hall, your conference badge was a music tape and there plastic saxophones everywhere!

Even the badge was a tape!

If you’ve been following my posts on my travelogue you already know the details. There were tons of amazing talks and it was (as always) hard to pick favourites. That being said, the biggest surprise for me was one by Bashar Al-Abdulhadi where he described the evolution of localised software for Arabic over the last decades and how the community worked on offering translations et al. for Django and its ecosystem. The most amazing aspect was here for me a language called Arabish where the Arabic alphabet was translated to ASCII (including numbers).

On the technical side of things Andrew Godwin’s talk about Django Channels was - as expected - amazing 😊 I can’t wait to have them land in the core, which will hopefully happen with Django 1.10 in the next couple of months. In the meantime I’ve already added them to an internal project where I needed Websockets and they’ve worked great so far 💖

As was the case in Cardiff last year, also this year’s event had a big focus on the human side of software development. Most of us eventually run into situations where stress among other things are starting to reach an unhealthy level. Right after the first day’s keynotes Mikey Ariel and Erik Romijn had a great talk about this. Besides talks about this topic there was also a quiet room and a music library if you needed time for yourself. Heck, even at the official party there was enough space to find a quiet spot if you needed one. As always: Much appreciated 😃

Amazing party at Extra

Perhaps the only thing in my opinion didn’t work perfectly during the main event were the lunch breaks due to the way the tables were positioned right in front of the buffet. But, if something like that is the only thing I can come up with then the conference was just amazing.

Same goes for the sprints. The location was a co-working space in a run-down building in the heart of the city with its own bar! There was so much space there that you could easily get lost. The venues conference organisers find for sprints will never cease to amaze me 😊 We once again had the gong which heralded the arrival of a merged pull request and … it was getting noisy quickly! Sprints are always something special and if you haven’t been to one yet, you should definitely go to one the next time you get a chance!

On the sprint days I’ve also took some time to explore the city with friends but I’ll most likely write more about that back on the travelogue when I find the time.

At first I thought this might have been the last DjangoCon Europe for me was I didn’t have any serious Django projects for ages but … I don’t need an excuse to come here! So see you next year! Wherever this might end up being…

From Intern to Professional Developer: Advice on a Mid-Career Pivot

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Apr 13, 2016.

A few weeks ago, Rebecca Conley attended DjangoCon Europe 2016 in Budapest, Hungary. The event is a five-day conference that brings together Django lovers from all over the world to learn about and share each other’s experiences with Django.

Rebecca gave an inspiring talk on her transition into web development from other fields, including the non-profit sector. She approached the topic from a unique diversity in tech perspective, arguing that developers making such transitions have a great deal to offer the developer community as a whole. You can watch her talk below or check out the many other great talks here.

Adopting Scrum in a Client-services, Multi-project Organization

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Apr 11, 2016.

Caktus began the process of adopting Scrum mid-November 2015 with two days of onsite Scrum training and fully transitioned to a Scrum environment in January 2016. From our original epiphany of “Yes! We want Scrum!” to the beginning of our first sprint, it took us six weeks to design and execute a process and transition plan. This is how we did it:

Step 1: Form a committee

Caktus is a fairly flat organization and we prefer to involve as many people as possible in decisions that affect the whole team. We formed a committee that included our founders, senior developers, and project managers to think through this change. In order for us to proceed with any of the following steps, all committee members had to be in agreement. When we encountered disagreement, we continued communicating in order to identify and resolve points of contention.

Step 2: Identify an approach

Originally we planned to adopt Scrum on a per-project basis. After all, most of the literature on Scrum is geared towards projects. Once we started planning this approach, however, we realized the overhead and duplication of effort required to adopt Scrum on even four concurrent projects (e.g. requiring team members to attend four discrete sets of sprint activities) was not feasible or realistic. Since Caktus works on more than four projects at a time, we needed another approach.

It was then that our CEO Tobias McNulty flipped the original concept, asking “What if instead of focusing our Scrum process around projects, we focused around teams?” After some initial head-scratching, some frantic searches in our Scrum books, and questions to our Scrum trainers, our committee agreed that the Scrum team approach was worth looking into.

Step 3: Identify cross-functional teams with feasible project assignments

Our approach to Scrum generated a lot of questions, including:

  • How many teams can we have?
  • Who is on which team?
  • What projects would be assigned to which teams?

We broke out into several small groups and brainstormed team ideas, then met back together and presented our options to each other. There was a lot of discussion and moving around of sticky notes. We ended up leaving all the options on one of our whiteboards for several days. During this time, you’d frequently find Caktus team members gazing at the whiteboard or pensively moving sticky notes into new configurations. Eventually, we settled on a team/project configuration that required the least amount of transitions for all stakeholders (developers, clients, project managers), retained the most institutional knowledge, and demonstrated cross-functional skillsets.

Step 4: Role-to-title breakdown

Scrum specifies three roles: Development team member, Scrum Master, and Product Owner. Most organizations, including Caktus, specify job titles instead: Backend developer, UI developer, Project Manager, etc. Once we had our teams, we had to map our team members to Scrum roles.

At first, this seemed fairly straightforward. Clearly Development team member = any developers, Scrum Master = Project Manager, and Product Owner = Product Manager. Yet the more we delved into Scrum, the more it became obvious that roles ≠ titles. We stopped focusing on titles and instead focused on responsibilities, skill sets, and attributes. Once we did so, it became obvious that our Project Managers were better suited to be Product Owners.

This realization allowed us to make smarter long-term decisions when assigning members to our teams.

Step 5: Create a transition plan

The change from a client-services, multi-project organization to a client-services, multi-project organization divided into Scrum teams was not insignificant. In order to transition to our Scrum teams, we needed to orient developers to new projects, switch out some client contacts, and physically rearrange our office so that we were seated roughly with our teams. We created a plan to make the necessary changes over time so that we were prepared to start our first sprints in January 2016.

We identified which developers would need to be onboarded onto which projects, and the key points of knowledge transfer that needed to happen in order for teams to successfully support projects. We started these transitions when it made sense to do so per project per team, e.g., after the call with the client in which the client was introduced to the new developer(s), and before the holder of the institutional knowledge went on holiday vacation.

Step 6: Obtain buy-in from the team

We wanted the whole of Caktus to be on board with the change prior to January. Once we had a plan, we hosted a Q&A lunch with the team in which we introduced the new Scrum teams, sprint activity schedules, and project assignments. We answered the questions we could and wrote down the ones we couldn’t for further consideration.

After this initial launch, we had several other team announcements as the process became more defined, as well as kick-off meetings with each team in which everyone had an opportunity to choose team names, provide feedback on schedules, and share any concerns with their new Scrum team. Team name direction was “A type of cactus”, and we landed on Team Robust Hedgehog, Team Discocactus, and Team Scarlet Crown. Concerns were addressed by the teams first, and if necessary, escalated to the Product Owners for further discussion and resolution.

On January 4, 2016, Caktus started its first Scrum sprints. After three months, our teams are reliably and successfully completing sprints, and working together to support our varied clients.

What we’ve learned by adopting Scrum is that Scrum is not a silver bullet. What Scrum doesn’t cover is a much larger list than what it does. The Caktus team has earnestly identified, confronted, and worked together to resolve issues and questions exposed by our adoption of Scrum, including (but not limited to):

  • How best to communicate our Scrum process to our clients, so they can understand how it affects their projects?
  • How does the Product Strategist title fit into Scrum?
  • How can we transition from scheduling projects in hours to relative sizing by sprint in story points, while still estimating incoming projects in hours?
  • How do sales efforts get appointed to teams, scheduled into sprints, and still get completed in a satisfactory manner?
  • What parts of Scrum are useful for other, non-development efforts at Caktus (retrospectives, daily check-ins, backlogs, etc)?
  • Is it possible for someone to perform the Scrum Master on one team and Product Owner roles on a different team?

Scrum provides the framework that highlights these issues but intentionally does not offer solutions to all the problems. (In fact, in the Certified ScrumMaster exam, “This is outside the scope of Scrum” is the correct answer to some of the more difficult questions.) Adopting Scrum provides teams with the opportunity to solve these problems together and design a customized process that works for them.

Scrum isn’t for every organization or every situation, but it’s working for Caktus. We look forward to seeing how it continues to evolve to help us grow sharper web apps.

Basics of Django templates

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 05, 2016.

Django template engine comes as part of django framework and its very easy, extensible and handy to develop complex web applications with simplest efforts. Lets see basics of django template engine.

Basics:

A template is a text document, or a normal Python string, that is marked-up using the Django template language. Template system in Python can be used in the following way:

First, compile the raw template code into a Template object.
Then, call the render() method of the Template object with a given context.

Rendering a context: Once you have a compiled Template object, you can render a context or multiple contexts with it. We have a Context class at django.template.Context, and its constructor takes two (optional) arguments:
    1. A dictionary mapping variable names to variable values.
    2. The name of the current application.

Example:

from django.template import Template, Context

# Creating a Template object by instantiating it and its constructor takes 
one argument i.e., raw template code.
template = Template("My name is {{my_name}}")   
context = Context({"my_name":"XXX"})

# Calling the Template object's render() method with the context to fill the template.
template.render(context)

Templates:

A template is simply a text file. It can generate any text-based format (HTML, XML, CSV, etc.). A template contains variables, which get replaced with values when the template is evaluated, and tags, which control the logic of the template.

The following example illustrates a few basics.


{% extends "base.html" %}
{% block title %}{{ section.title }}{% endblock %}

{% block content %}
  <h1>{{ section.title }}</h1>
  {% for story in story_list %}
    <h2>{{ story.headline|upper }}</h2>
    <p>{{ story.description|truncatewords:"100" }}</p>
  {% endfor %}
{% endblock %}

The above template extends another template called "base.html". And this template provides content for the blocks defined in the "base.html" template. Each element is described below.

Variables:

Variables look like this: {{ variable }}. When the template engine encounters a variable, it evaluates that variable and replaces it with the result. Variable names consist of any combination of alphanumeric characters, underscore ("_"), dot (".") except spaces and punctuation characters. The dot (".") has a special meaning i.e., to access attributes of a variable.

In the above example, {{ section.title }} will be replaced with the title attribute of the section object.

Note: If you use a variable that doesn’t exist, the template system will insert an empty string by default.

Filters:

By using filters, you can modify the variables for display. Django provides about thirty built-in template filters. Most commonly used template filters are:

♦ lower - Example: {{ name|lower }} - This displays the value of the {{ name }} variable after being filtered through the lower filter, which converts text to lowercase. Use a pipe (|) to apply a filter.

♦ truncatewords - Some filters take arguments. A filter argument looks like this: {{ bio|truncatewords:30 }}. This will display the first 30 words of the bio variable.

♦ default - If a variable is false or empty, use given default. Otherwise, use the value of the variable.

For example: {{ value|default:"nothing" }}

If value isn’t provided or is empty, the above will display “nothing”.

Tags:

Tags look like this: {% tag %}. Tags are more complex than variables: Some create text in the output, some control flow by performing loops or logic, and some load external information into the template.

Some tags require beginning and ending tags (i.e. {% tag %} ... tag contents ... {% endtag %}).

Django provides about twenty built-in template tags. Here are some of the more commonly used tags:

♦ for - Loop over each item in an array. In the above example, to display a list of stories provided in story_list we have used a for loop.

♦ if, elif, and else - Evaluates a variable, and if that variable is “true” the contents of the block are displayed:

Example:

  {% if story_list|length > 2 %}
    Number of stories: {{ story_list|length }}
  {% elif story_list|length == 1 %}
    There is only one story!
  {% else %}
    No stories.
  {% endif %}

Comments:

To comment-out part of a line in a template, use the comment syntax: {# #}.

For example, this template would render as 'hello':  {# greeting #}hello

Template inheritance:

The most powerful and the most complex part of Django’s template engine is template inheritance. Template inheritance allows you to build a base “skeleton” template that contains all the common elements of your site and defines blocks that child templates can override. It’s easy to understand template inheritance by starting with an example:

    <html lang="en">
      <head>
        <link href="style.css" rel="stylesheet" />
        <title>Home Page</title>
      </head>
      <body>
        <div id="content">
          {% block content %}{% endblock %}
        </div>
      </body>
    </html>

This template, which we’ll call base.html, defines a simple HTML skeleton document. It’s the job of “child” templates to fill the empty blocks with content. The example below Templates topic shows child template.

The extends tag is the key here. It tells the template engine that this template “extends” another template. When the template system evaluates this template, first it locates the parent – in this case, “base.html”.

At that point, the template engine will notice the two block tags in base.html and replace those blocks with the contents of the child template. Depending on the value of story_list, the output might look like:

    <html lang="en">
      <head>
        <link href="style.css" rel="stylesheet" />
        <title>Story List</title>
      </head>
      <body>
        <div id="content">
          <h2>STORY ONE</h2>
          <p>This is first story.</p>
          <h2>STROY TWO</h2>
          <p>This is second story.</p>
        </div>
      </body>
    </html>

Django Forms bascis explained

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 05, 2016.

Django forms is powerful module to help django application development in rendering html from model, validate input from httprequest to model specifications. And we can extend the module to suit our exact need. Lets see basics of django forms.

Forms:
A collection of fields that knows how to validate itself,Form classes are created as subclasses of django.forms.Form
Example:
Consider a form used to implement “contact me” functionality

create models.py 


 

    from django.db import models

    class Contact(models.Model):         subject = models.CharField(max_length=50)         message = models.CharField(max_length=50)

create forms.py


 

    from django import forms

    class ContactForm(forms.Form):         subject = forms.CharField(max_length=100)         message = forms.CharField(max_length=100)

A form is composed of Field objects.

Using a form in a view:


 

    from django.http import HttpResponse     from app.forms import ContactForm     def contact(request):         if request.method == "POST":                                   form = ContactForm(request.POST)                           if form.is_valid():                  subject = request.POST.get("subject")                 message = request.POST.get("message")                 contact = Contact.objects.create("subject"="subject","message"="message")                 return HttpResponse("Contact details created")             else:                 return HttpResponse("form is not validated")

ModelForm:
ModelForm is used to avoid duplicating Model description

create forms.py:


 

    from django.forms import ModelForm     from app.models import Contact     class ContactForm(ModelForm):         class Meta:             model = Contact

Using a form in a view:


 

from django.http import HttpResponse from app.forms import ContactForm def contact(request):     if request.method == "POST":                               form = ContactForm(request.POST)                       if form.is_valid():              contact = form.save()             contact.save()             return HttpResponse("Contact details created")         else:             return HttpResponse("form is not validated")

Add captcha to django web page using Python-reCaptcha

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 05, 2016.

How to integrate reCaptcha with django? Following gives you a brief idea of how to integrate reCaptcha to django application

  • First get the corresponding public key and private key from reCaptcha site and keep them in your settings.py
in settings.py
CAPTCHA_PUBLIC = 'XXXXXXXXXXXXXXXXXX'
CAPTCHA_PRIVATE = 'XXXXXXXXXXXXXXXXXX'
  • Render the reCaptcha to corresponding html page
in views.py
recaptcha = captcha.displayhtml(CAPTCHA_PUBLIC) 
    #displayhtml() gives the captcha object to be displayed on html
return render_to_response('demo.html','recaptcha':recaptcha)
in demo.html
<form action="/" method="post">
{{recaptcha|safe}}
<button class="submit" type="submit">Submit</button>
</form>
  • Validate the challenge field and response fields..
in views.py
capresponse = captcha.submit( request.POST.get('recaptcha_challenge_field'),
                              request.POST.get('recaptcha_response_field'),
                              CAPTCHA_PRIVATE,
                              request.META.get('REMOTE_ADDR'))
'''checks if the catpcha is valid or not(challenge field and response fields    are equal or not '''
if capresponse.is_valid:
     do_some_work()
else:
     return HttpResponse('wrong captcha text')

Django Raw Sql Queries

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 05, 2016.

When your model query API don't go well or you want more performance, you can use raw sql queries in django.
The Django Object Relational Mapper (ORM) helps bridge the gap between the database and our code
Performing raw queries.

app/model.py

class employee(models.Model):

    name = models.CharField(max_length=20)

    position = models.CharField(max_length=20)

    sal = models.IntegerField()


Select statment:

  >>> emp_list=employee.objects.raw('select * FROM blogapp_employee')

SQL statement in .raw() return a set of rows from the database if not return rows, error will result.

  >>> for i in emp_list:

  ...    i.name

  ... 

  u'ravi'

  u'john'

  u'pheebi'

Above work same as model_name.objects.all().

MyModel.objects.all() is a very simplyfied ORM. This code will return the entire contents of the database table, same as running: select * from MyModel.

Only use .raw() method when you con't do task with which Django QuerySet.

Update, Delete With sql(Using cursor)

  >>>from django.db import connection, transaction

  >>>cursor = connection.cursor()

  >>>cursor.execute("UPDATE tablename SET field=value, .. WHERE .'condition'. ")

    >>>cursor.execute("DELETE tablename SET field=value, .. WHERE .'condition'. ")

To see What are the sql queries django running:


 

  >>> from django.db import connection
  >>> connection.queries
  [{'sql': 'SELECT * FROM blog_employee',
  'time': '0.001'}]

  #sql -- The raw SQL statement
  #time -- How long the statement took to execute, in seconds.


You can see django orm query equalent to sql qeury

  >>> q = employee.objects.all()

  >>> q

  []

  >>> print q.query

  SELECT "blogapp_employee"."id", "blogapp_employee"."name", "blogapp_employee"."position", "blogapp_employee"."sal" FROM "blogapp_employee"

Basics of Django messages framework

By Micropyramid django from Django community aggregator: Community blog posts. Published on Apr 05, 2016.

In any web application we need to display notification messages to the end user after processing a form or some other types of his requests. To make this messaging system simple, django provided full support to cookie and session based massaging for both anonymous and authenticated users.

This messages framework is capable of storing messages in one request and retrieving those messages in the subsequent request. Every message has a tag based on its priority(info,warning and error).

Enabling django messages:

-> we need to put 'django.contrib.messages' in INSTALLED_APPS.

-> MIDDLEWARE_CLASSES should contains 'django.contrib.sessions.middleware.SessionMiddleware' and 'django.contrib.messages.middleware.MessageMiddleware'.

-> The 'context_processors' option of the DjangoTemplates backend defined in your TEMPLATES setting contains 'django.contrib.messages.context_processors.messages'.

 storage.fallback.FallbackStorage is the default storage class.if you want, you can select another storage class by setting MESSAGE_STORAGE to its full import path, for example

MESSAGE_STORAGE = 'django.contrib.messages.storage.cookie.CookieStorage'

To write your own storage class, subclass the BaseStorage class in django.contrib.messages.storage.base and implement the _get and _store methods.

Add messages:

add_message(request, level, message, extra_tags='', fail_silently=False)

from django.contrib import messages
messages.add_message(request, messages.INFO, 'Hello world.')

 

Some shortcut methods to add messages with commonly used tags.

messages.debug(request, '%s SQL statements were executed.' % count)

messages.info(request, 'Three credits remain in your account.')

messages.success(request, 'Profile details updated.')

messages.warning(request, 'Your account expires in three days.')

messages.error(request, 'Document deleted.')

 

Display Messages in Temaplates :

{% if messages %}
<ul>
   {% for message in messages %}
	   <li>{{ message }}</li>
   {% endfor %}
</ul>
{% endif %}

The context processor also provides a DEFAULT_MESSAGE_LEVELS variable which is a mapping of the message level names to their numeric value:

{% if messages %}
<ul>
   {% for message in messages %}
	<li>
	   {% if message.level == DEFAULT_MESSAGE_LEVELS.ERROR %}Important: {% endif %}
	      {{ message }}
       {% endfor %}
	</li>
</ul>
{% endif %}

Get messages outside Templates:

from django.contrib.messages import get_messages>

storage = get_messages(request)

for message in storage:

    do_something_with_the_message(message)

 

Elaborating django messages with example:

Models.py:

from django.db import models

class Student(models.Model):

    name = models.CharField(max_length=100)

    r_no = models.CharField(max_length=10)

 

forms.py

from django.forms import ModelForm

from myapp.models import Student

class StudentForm(ModelForm):

    class Meta:

        model = Student

        fields = ['name', 'r_no']

 

views.py

from myapp.forms import StudentForm

from django.contrib import messages

from django.shortcuts import render

def my_view(request):

    if request.method == "POST":

         student_form = StudentForm(request.POST)

     if student_form.is_valid():

          student_form.save()

          messages.success(request, 'Student created successfully.')

          return render(request,"template1.html")    

     else:

          messages.error(request, student_form.errors)

    students = Student.objects.all()  

    return render(request,"template2.html",{"students":students})

 

template1.html

{% if messages %}
<ul>
	{% for message in messages %}
	   <li>{{ message }}</li>
	{% endfor %}
</ul>
{% endif %}
<form action="." method="post">
    {% csrf_token %}
    <input name="name" type="text" />
    <input name="r_no" type="text" />
    <button type="submit">Submit</button>
</form>

template2.html

{% if messages %}
<ul>
	{% for message in messages %}
	   <li>{{ message }}</li>
	{% endfor %}
</ul>
{% endif %}
<h2>Student list

What We're Clicking - March Link Roundup

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Apr 05, 2016.

We’re starting a new, monthly series on the Caktus blog highlighting the articles and posts shared by Cakti that drew the most attention on Twitter. These roundups will include everything from Django how-tos to explorations of the tech industry, to innovations for social good.

This month we’re featuring articles on front-end development, data visualization, open source, and diversity in tech.

python for geospatial data processing

Python for Geospatial Data Processing

An excellent how-to from Carlos de la Torre on satellite images classification and geospatial data processing in Python.

Why Do Many Data Scientists Love Using Python Over Ruby?

We obviously love Python. But if you’re not convinced, read Harri Srivastav’s post singing the praises of Python for data management, processing, and visualization. Python’s speed, availability of libraries, options for graphics, and its large, active community make it a standout for working with data.

Rachel Andrews’ Talks on Modern CSS Layout

Our lead front-end developer Calvin Spealman recommends all of Rachel Andrews’ talks on modern CSS layout. She addresses everything from Flexbox to Grid and Box Alignment as well as a handful of other front-end tools.

Should Your NGO Go Open Source?

Catherine Cheney’s article on the pros and cons of open source for NGOs sparked much discussion, especially on social media and at the most recent Digital Principles conference in DC.

a building in downtown Durham

Is a Different Kind of Silicon Valley Possible?

As a Durham-based tech firm invested in increasing diversity in tech, we were thrilled to see The Atlantic shout out Durham’s thriving tech scene and its attempts to address diversity from the ground up.

Is Open Source Consulting Dead?

By chrism from . Published on Sep 10, 2013.

Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

Consulting and Patent Indemification

By chrism from . Published on Aug 09, 2013.

Article about consulting and patent indemnification

Python Advent Calendar 2012 Topic

By chrism from . Published on Dec 24, 2012.

An entry for the 2012 Japanese advent calendar at http://connpass.com/event/1439/

Why I Like ZODB

By chrism from . Published on May 15, 2012.

Why I like ZODB better than other persistence systems for writing real-world web applications.

A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

By chrism from . Published on Mar 03, 2012.

A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

In Praise of Complaining

By chrism from . Published on Jan 01, 2012.

In praise of complaining, even when the complaints are absurd.

2012 Python Meme

By chrism from . Published on Dec 24, 2011.

My "Python meme" replies.

In Defense of Zope Libraries

By chrism from . Published on Dec 19, 2011.

A much too long defense of Pyramid's use of Zope libraries.

Plone Conference 2011 Pyramid Sprint

By chrism from . Published on Nov 10, 2011.

An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

Jobs-Ification of Software Development

By chrism from . Published on Oct 17, 2011.

Try not to Jobs-ify the task of software development.

WebOb Now on Python 3

By chrism from . Published on Oct 15, 2011.

Report about porting to Python 3.

Open Source Project Maintainer Sarcastic Response Cheat Sheet

By chrism from . Published on Jun 12, 2011.

Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

Pylons Miniconference #0 Wrapup

By chrism from . Published on May 04, 2011.

Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

Pylons Project Meetup / Minicon

By chrism from . Published on Apr 14, 2011.

In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.

PyCon 2011 Report

By chrism from . Published on Mar 19, 2011.

My personal PyCon 2011 Report