Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home

Open Source Posts

Debugging Django Management Commands in PyCharm

By DjangoTricks from Django community aggregator: Community blog posts. Published on May 25, 2016.

Photo by Jill Heyer

My favorite editor for Python projects is PyCharm. Besides editing code, it allows you to inspect the database, work with Git repositories, run management commands, execute bash commands and Python scripts, and debug code just in the same window. In this article, I will show you how to set breakpoints and debug Django management commands visually in PyCharm.

Django management commands are scripts that can be executed on your Django project to do something with the project database, media files, or code. Django itself comes with a bunch of commands like: migrate, runserver, collectstatic, makemessages, and clearsessions. Management commands can be executed like this:

(myproject_env)$ python manage.py clearsessions

If you want to create a custom management command in your project, you can find how to do that in the official Django documentation. Also you can find some practical examples in the Chapter 9, Data Import and Export of the Web Development with Django Cookbook - Second Edition.

In this example, I won't create any new management command, but will debug the clearsessions command that is coming from Django and is located at django/contrib/sessions/management/commands/clearsessions.py.

First of all, let's click on "Edit Configurations..." in the top toolbar just before the Run button (with the Play icon). In the opened dialog box "Run/Debug Configurations" click on the "Add New Configuration" button (with the Plus icon) and choose "Python".

Let's fill in the configuration with these values:

Name:Clear Outdated Sessions
Script:/Users/me/DjangoProjects/myproject_env/project/myproject/manage.py
Script paramethers:clearsessions --traceback --verbosity=2
Python interpreter:Python 2.7.6 virtualenv at ~/DjangoProjects/myproject_env
Working directory:/Users/me/DjangoProjects/myproject_env/project/myproject/

Then open the file with the definition of the management command django/contrib/sessions/management/commands/clearsessions.py. Click on the left padding of the editor to add a breakpoint (marked with a red circle) where the script should stop executing for inspection.

Normally to run this script, you could click on the Run button (with the Play icon). But as we want to debug the script, we will click on the Debug button (with the Bug icon) in the toolbar.

The script will start executing and will stop temporarily at the breakpoint you made. You will be able to inspect all local variables in the debug panel that is opened at the bottom of your window by default.

You can navigate through code execution with the arrow buttons "Step Over", "Step Into", etc. To evaluate local or global variables or values, click on the "Evaluate Expression" button (with the Calculator icon) and enter some Python code.

Click on the "Resume Program" button (with the Fast Play icon) to continue execution of the script, when you are ready.

Analogously, you can debug your models, views, forms, or middlewares by running a development server ("Django server") in debug mode.

Dynamically generating ZIP files with one line of code

By Cloudinary Blog - Django from Django community aggregator: Community blog posts. Published on May 25, 2016.

Dynamic ZIP creation from code

As a developer, you want to allow your users to download multiple files in a single click. An easy way to download multiple files and share them is to generate a ZIP file. When images are involved, you may also want to normalize the original images before including them in the ZIP file, by scaling them down to the same maximum resolution or converting them to the same format.

In the following simple example, 3 images of cats have been uploaded to the cloud.

fat_cat.jpg fat_cat kitten.jpg kitten hungry_cat.jpg hungry_cat

With one line of code, you can generate a dynamic URL that, for example, automatically creates and then delivers a ZIP file containing the cat images, all scaled down to a width of 200 pixels:

https://api.cloudinary.com/v1_1/demo/image/generate_archive?api_key=373364719177799&expires_at=1557919777&mode=download&public_ids%5B%5D=fat_cat&public_ids%5B%5D=kitten&public_ids%5B%5D=hungry_cat&signature=a2f86b73d32d2a778493d6759d59059d0a30076d&timestamp=1464179836&transformations=c_scale%2Cw_200

The ability to dynamically generate and deliver ZIP files to your users by including one line of code can be useful for developers in a number of ways. For example:

  • Social and messaging apps that allow users to select multiple files to send to another recipient, who subsequently receives all the files in one ZIP file (e.g., the Gmail feature where you can download all attachments as a single ZIP file).
  • Applications that include image galleries and allow users to select multiple images and then download a ZIP file with all the selected images (e.g., as Google Photos has implemented).
  • Allowing your users to download multiple images simultaneously in a single ZIP file, where all the images have been normalized to a certain size, format or quality, (or any other image transformation you want to apply to all the images).

Generating ZIP files of images in the cloud

Cloudinary supports generating ZIP files using the generate_archive method, that can include any type of file, and offers various options for determining which files to include in the ZIP file (e.g., according to the file's name, all files in the same folder, etc). The method also allows you to apply transformations to all the images before including them in the file and set various options for generating the ZIP file (e.g., naming the file). For more information on all the options available for generating ZIP files, see the generate_archive documentation.

Cloudinary enables you to create and deliver ZIP files in one of the following two ways:

  • Precreate the ZIP file and upload it to the cloud.
  • Generate a dynamic URL for creating and downloading a ZIP file on demand.

Create a ZIP file of images

To precreate a ZIP file, use the create_archive method of the Upload API, which also automatically uploads the ZIP file to the cloud, and then give your users a link for downloading it. This option is best if multiple users will be downloading the resulting ZIP file.

For example, to create a ZIP file called small_cats.zip that contains small (50x50) thumbnails of all of the images in your account that have been tagged as "cats":

Ruby:
Cloudinary::Uploader.create_archive(:tags => 'cats', 
    :resource_type => 'image', :target_public_id => 'small_cats.zip',
    :transformations => {:width => 50, :height => 50, :crop => :fill})
PHP:
\Cloudinary\Uploader::create_archive(array(
    'tags' => 'cats', 'resource_type' => 'image', 
    'target_public_id' => 'small_cats.zip', 'transformations' => array(
        'width' => 50, 'height' => 50, 'crop' => 'fill')));
Python:
cloudinary.uploader.create_archive(
    tags = 'cats', resource_type = 'image', 
    target_public_id = 'small_cats.zip', transformations = {
        width = 50, height = 50, crop => 'fill'})
Node.js:
cloudinary.v2.uploader.create_archive(
    { tags: 'cats', resource_type: 'image', 
    target_public_id: 'small_cats.zip', transformations: {
        width: 50, height: 50, crop: 'fill'}},
    function(error,result) {console.log(result) });
Java:
cloudinary.uploader().createArchive(
    ObjectUtils.asMap('tags', 'cats', 'resource_type', 'image',
    'target_public_id', 'small_cats.zip', 'transformations', 
    Arrays.asList(
        new Transformation().width(50).height(50).crop('fill')));
cURL:
curl https://api.cloudinary.com/v1_1/demo/image/generate_archive -X POST --data 'tags=cats$resource_type=image&target_public_id=small_cats.zip&timestamp=173719931&api_key=436464676&signature=a788d68f86a6f868af&transformations=c_fill%2Cw_50%2Ch_50'

The response to the API call includes all pertinent information about the created zip file, including the URL needed to access it, in this case:

Ruby:
cl_image_tag("small_cats.zip")
PHP:
cl_image_tag("small_cats.zip")
Python:
CloudinaryImage("small_cats.zip").image()
Node.js:
cloudinary.image("small_cats.zip")
Java:
cloudinary.url().imageTag("small_cats.zip")
jQuery:
$.cloudinary.image("small_cats.zip")
.Net:
cloudinary.Api.UrlImgUp.BuildImageTag("small_cats.zip")

Generate a dynamic URL for downloading a ZIP file on demand

Instead of precreating the ZIP file, you can generate a signed URL for creating a ZIP file on the fly and on demand with the download_archive_url method of the Utils API. The ZIP file is only created and streamed to your user when the URL is accessed. The resulting ZIP file is not cached or stored in your Cloudinary account, so this option is best if only a single user downloads the resulting ZIP file and avoids waste if the URL is not accessed by the user.

For example, to generate a signed URL for creating and delivering a ZIP file that contains the 'fat_cat' and 'kitten' images:

Ruby:
Cloudinary::Utils.download_archive_url(
    :public_ids => ['fat_cat', 'kitten'], 
    :resource_type => 'image')
PHP:
\Cloudinary\Utils::download_archive_url(
    array(
        'public_ids' => array('fat_cat', 'kitten'), 
        'resource_type' => 'image'));
Python:
cloudinary.utils.download_archive_url(
    public_ids = ['fat_cat', 'kitten'], 
    resource_type = 'image')
Node.js:
cloudinary.v2.utils.download_archive_url(
    { public_ids: ['fat_cat', 'kitten'], resource_type: 'image'},
    function(error,result) {console.log(result) });
Java:
cloudinary.utils().downloadArchiveUrl(
    ObjectUtils.asMap('public_ids', Arrays.asList('fat_cat', 'kitten'), 
        'resource_type', 'image'));
cURL:
curl https://api.cloudinary.com/v1_1/demo/image/generate_archive -X POST --data 'public_ids[]=fat_cat&public_ids[]=kitten$resource_type=image&mode=download&timestamp=173719931&api_key=436464676&signature=a788d68f86a6f868af'

The API call returns the URL needed to dynamically create and deliver the ZIP file, in this case:

https://api.cloudinary.com/v1_1/demo/image/generate_archive?api_key=373364719177799&expires_at=1557919777&mode=download&public_ids%5B%5D=fat_cat&public_ids%5B%5D=kitten&signature=45411c9ad47e06a2a9468658d919b045d810ec1b&timestamp=1464180350

Dynamic ZIP files with a single line of code

Generating ZIP files with a single line of code allows you to organize, streamline, normalize and optimize multiple image delivery to your users. Either create the ZIP file and upload it to the cloud, or generate a dynamic URL that creates and delivers the ZIP file on demand. For more information on all the options available for generating ZIP files, see the generate_archive documentation. The feature is available for use with all Cloudinary accounts, including the free tier.

Nikolay Shaplov: postgres: reloption ALTER INDEX bug

From Planet PostgreSQL. Published on May 25, 2016.

It seems to me that I found a bug in current implementation of reloptions: When you are creating a custom index with it's own reloptions, you have no ways to prevent it from changing using ALTER INDEX .... SET (op=value);
For example if you do for bloom index
alter index bloomidx set ( length=15 );

postgres will successfully run this, change the value of reloptions attribute in pg_class, and bloom index will work wrong after it.
And there is no way to forbid this from inside of an extension.

I think I would add there a flag in reloption descriptor that will tell whether it is allowed to change this reloption using ALTER INDEX, or not

Umair Shahid: Using Hibernate Query Language (HQL) with PostgreSQL

From Planet PostgreSQL. Published on May 25, 2016.

In my previous blog, I talked about using Java arrays to talk to PostgreSQL arrays. This blog is going to go one step further in using Java against the database. Hibernate is an ORM implementation available to use with PostgreSQL. Here we discuss its query language, HQL.

The syntax for HQL is very close to that of SQL, so anyone who knows SQL should be able to ramp up very quickly. The major difference is that rather than addressing tables and columns, HQL deals in objects and their properties. Essentially, it is a complete object oriented language to query your database using Java objects and their properties. As opposed to SQL, HQL understands inheritance, polymorphism, & association. Code written in HQL is translated by Hibernate to SQL at runtime and executed against the PostgreSQL database.

An important point to note here is, references to objects and their properties in HQL are case-sensitive; all other constructs are case insensitive.  

Why Use HQL?

The main driver to using HQL would be database portability. Because its implementation is designed to be database agnostic, if your application uses HQL for querying the database, you can interchange the underlying database by making simple changes to the configuration XML file. As opposed to native SQL, the actual code will remain largely unchanged if your application starts talking to a different database.

Prominent Features

A complete list of features implemented by HQL can be found on their website. Here, we present examples of some basic and salient features that will help you get going on HQL. These examples are using a table by the name of ‘largecities’ that lists out the 10 largest metropolitans of the world. The descriptor and data are:

postgres=# \d largecities
 Table "public.largecities"
 Column |          Type          | Modifiers 
--------+------------------------+-----------
 rank   | integer                | not null
 name   | character varying(255) | 
Indexes:
 "largecities_pkey" PRIMARY KEY, btree (rank)
postgres=# select * from largecities; 
 rank | name 
------+-------------
    1 | Tokyo
    2 | Seoul
    3 | Shanghai
    4 | Guangzhou
    5 | Karachi
    6 | Delhi
    7 | Mexico City
    8 | Beijing
    9 | Lagos
   10 | Sao Paulo
(10 rows)

HQL works with a class that this table is mapped to in order to create objects in memory with its data. The class is defined as:

@Entity
public class LargeCities {
 @Id
 private int rank;
 private String name;

 public int getRank() {
 return rank;
 }
 public String getName() {
 return name;
 }
 public void setRank(int rank) {
 this.rank = rank;
 }
 public void setName(String name) {
 this.name = name;
 }
}

Notice the @Entity and @Id annotations, which are declare the class ‘LargeCities’ as an entity and the property ‘rank’ as the identifier.

The FROM Clause

The FROM clause is used if you want to load all rows of the table as objects in memory. The sample code given below retrieves all rows from table ‘largecities’ and lists out the data from objects to stdout.

try {

 SessionFactory sessionFactory = new Configuration().configure().buildSessionFactory();
 Session session = sessionFactory.openSession();
 session.beginTransaction();

 Query query = session.createQuery("FROM LargeCities");
 List<LargeCities> cities = (List<LargeCities>)query.list();

 session.close();

 for (LargeCities c : cities)
     System.out.println(c.getRank() + " " + c.getName());

} catch (Exception e) {
     System.out.println(e.getMessage()); 
}

Note that ‘LargeCities’ referred to in the HQL query is not the ‘largecities’ table but rather the ‘LargeCities’ class. This is the object oriented nature of HQL.

Output from the above program is as follows:

1 Tokyo
2 Seoul
3 Shanghai
4 Guangzhou
5 Karachi
6 Delhi
7 Mexico City
8 Beijing
9 Lagos
10 Sao Paulo

The WHERE Clause

There can be instances where you would want to specify a filter on the objects you want to see. Taking the above example forward, you might want to see just the top 5 largest metropolitans in the world. A WHERE clause can help you achieve that as follows:

try {

 SessionFactory sessionFactory = new Configuration().configure().buildSessionFactory();
 Session session = sessionFactory.openSession();
 session.beginTransaction();

 Query query = session.createQuery("FROM LargeCities WHERE rank < 6");
 List<LargeCities> cities = (List<LargeCities>)query.list(); 

 session.close(); 

 for (LargeCities c : cities)
     System.out.println(c.getRank() + " " + c.getName());

} catch (Exception e) {
     System.out.println(e.getMessage()); 
}

Output from the above code is:

1 Tokyo
2 Seoul
3 Shanghai
4 Guangzhou
5 Karachi

The SELECT Clause

The default FROM clause retrieve all columns from the table as properties of the object in Java. There are instances where you would want to retrieve only selected properties rather than all of them. In such a case, you can specify a SELECT clause that identifies the precise columns you want to retrieve.

The code below selects just the city name for retrieval. Note that, because it now just one column that is being retrieved, Hibernate loads it as a list of Strings rather than a list of LargeCities objects.

try {

 SessionFactory sessionFactory = new Configuration().configure().buildSessionFactory();
 Session session = sessionFactory.openSession();
 session.beginTransaction();

 Query query = session.createQuery("SELECT name FROM LargeCities");
 List<String> cities = (List<String>)query.list(); 

 session.close(); 

 for (String c : cities)
     System.out.println(c);

} catch (Exception e) {
     System.out.println(e.getMessage()); 
}

The output of this code is:

Tokyo
Seoul
Shanghai
Guangzhou
Karachi
Delhi
Mexico City
Beijing
Lagos
Sao Paulo

Named Parameters

Much like prepared statements, you can have named parameters through which you can use variables to assign values to HQL queries at runtime. The following example uses a named parameter to find out the rank of ‘Beijing’.

try {

 SessionFactory sessionFactory = new Configuration().configure().buildSessionFactory();
 Session session = sessionFactory.openSession();
 session.beginTransaction();

 Query query = session.createQuery("SELECT rank FROM LargeCities WHERE name = :city_name");
 query.setParameter("city_name", "Beijing");
 List<Integer> rank = (List<Integer>)query.list(); 

 session.getTransaction().commit();
 session.close(); 

 for (Integer c : rank)
     System.out.println("Rank is: " + c.toString());

} catch (Exception e) {
     System.out.println(e.getMessage()); 
}

Output for this code:

Rank is: 8

Pagination

When programming, numerous scenarios present themselves where code is required to be processed in chunks or pages. The process is called pagination of data and HQL provides a mechanism to handle that with a combination of setFirstResult and setMaxResults, methods of the Query interface. As the names suggest, setFirstResult allows you to specify which record should be the starting point for record retrieval while setMaxResults allows you to specify the maximum number of records to retrieve. This combination is very helpful in Java or in web apps where a large result set is shown split into pages and the user has the ability to specify the page size.

The following code breaks up our ‘largecities’ examples into 2 pages and retrieves data for them.

try {

 SessionFactory sessionFactory = new Configuration().configure().buildSessionFactory();
 Session session = sessionFactory.openSession();
 session.beginTransaction();

 Query query = session.createQuery("FROM LargeCities");

 query.setFirstResult(0); 
 query.setMaxResults(5);

 List<LargeCities> cities = (List<LargeCities>)query.list(); 

 System.out.println("*** Page 1 ***");
 for (LargeCities c : cities)
     System.out.println(c.getRank() + " " + c.getName());

 query.setFirstResult(5); 

 cities = (List<LargeCities>)query.list(); 

 System.out.println("*** Page 2 ***");
 for (LargeCities c : cities)
     System.out.println(c.getRank() + " " + c.getName());

 session.close(); 

} catch (Exception e) {
     System.out.println(e.getMessage()); 
}

An important point to keep in mind here is that Hibernate usually does pagination in memory rather than at the database query level. This means that for large data sets, it might be more efficient to use cursors, temp tables, or some other construct for pagination.

Other Features

A comprehensive list of features is available on the Hibernate website, but a few more worth mentioning here are:

  • UPDATE Clause
  • DELETE Clause
  • INSERT Clause
  • JOINs
  • Aggregate Methods
    • avg
    • count
    • max
    • min
    • sum

Drawbacks of Using HQL

HQL gives its users a lot of flexibility and rich set of options to use while talking to a database. The flexibility does come at a price, however. Because HQL is designed to be generic and largely database-agnostic, you should watch out for the following when using HQL.

  • At times, you would need to use advanced features & functions that are specific to PostgreSQL. As an example, you might want to harness the power of the newly introduced JSONB data type. Or you might want to use window functions to analyze your data. Because HQL tries to be as generic as possible, in order to use such advanced features, you will need to fallback to native SQL.
  • Because of the way left joins are designed, if you are joining an object to another table / object in a one-to-many or many-to-many format, you can potentially get duplicate data. This problem is exacerbated in case of cascading left joins and HQL has to preserve references to these duplicates, essentially ending up transferring a lot of duplicate data. This has the potential to significantly impact performance.
  • Because HQL does the object-relational mapping itself, you don’t get full control over how and what data gets fetched. One such infamous issue is the N+1 problem. Although you can find workarounds within HQL, identifying the problem can at time get very tricky.

Magnus Hagander: www.postgresql.org is now https only

From Planet PostgreSQL. Published on May 24, 2016.

We've just flipped the switch on www.postgresql.org to be served on https only. This has been done for a number of reasons:

  • In response to popular request
  • Google, and possibly other search engines, have started to give higher scores to sites on https, and we were previously redirecting accesses to cleartext
  • Simplification of the site code which now doesn't have to keep track of which pages need to be secure and which does not
  • Prevention of evil things like WiFi hotspot providers injecting ads or javascript into the pages

We have not yet enabled HTTP Strict Transport Security, but will do so in a couple of days once we have verified all functionality. We have also not enabled HTTP/2 yet, this will probably come at a future date.

Please help us out with testing this, and let us know if you find something that's not working, by emailing the pgsql-www mailinglist.

There are still some other postgresql.org websites that are not available over https, and we will be working on those as well over the coming weeks or months.

Raghavendra Rao: Ways to access Oracle Database in PostgreSQL

From Planet PostgreSQL. Published on May 24, 2016.

Today, organizations stores information(data) in different database systems. Each database system has a set of applications that run against it. This data is just bits and bytes on a file system - and only a database can turn the bits and bytes of data into business information. Integration and consolidation of such information(data) into one database system is often difficult. Because many of the applications that run against one database may not have an equivalent application that runs against another. To consolidate the information into one database system, we need a heterogeneous database connection.  In this post, I'll demo on how you may connect PostgreSQL to one of heterogeneous database Oracle using different methods.

Below are few methods to make connection to Oracle database in PostgreSQL.
  • Using ODBC Driver
  • Using Foreign DataWrappers
  • Using Oracle Call Interface(OCI) Driver

Using ODBC Driver

Open DataBase Connectivity(ODBC) is a standard software API for using DBMS. The ODBC driver/ODBC Data source(API) is a library that allows applications to connect to any database for which an ODBC driver is available. It's a middle layer translates the application's data queries into commands that the DBMS understands. To use this method, an open source unixODBC and Oracle ODBC driver(Basic/ODBC/Devel) packages required. Along with a module in PostgreSQL that can communicate to DSN created using unixODBC and Oracle ODBC driver. Few years back CyberTec has released a module ODBC-Link, at present it is obsolete, however, it has a dblink-style implementation for PostgreSQL to connect to any other ODBC compliant database. We can use this module for basic connectivity to Oracle. Let's see.

Install unixODBC
tar -xvf unixODBC-2.3.4.tar.gz
cd unixODBC-2.3.4/
./configure --sysconfdir=/etc
make
make install
Binary/Libraries/configuration files location: /usr/local/bin,/usr/local/lib,/etc(odbc.ini,odbcinst.ini)

Install Oracle ODBC Driver
rpm -ivh oracle-instantclient11.2-basic-11.2.0.4.0-1.x86_64.rpm
rpm -ivh oracle-instantclient11.2-odbc-11.2.0.4.0-1.x86_64.rpm
rpm -ivh oracle-instantclient11.2-devel-11.2.0.4.0-1.x86_64.rpm
Binary/Libraries location: /usr/lib/oracle/11.2/client64

Install ODBC-Link
tar -zxvf ODBC-Link-1.0.4.tar.gz
cd ODBC-Link-1.0.4
export PATH=/opt/PostgreSQL/9.5/bin:$PATH
which pg_config
make USE_PGXS=1
make USE_PGXS=1 install
Libraries and SQL files location: /opt/PostgreSQL/9.5/share/postgresql/contrib

Installation will create a ODBC-Link module SQL file in $PGHOME/contrib directory. Load the SQL file, which will create a schema by name "odbclink" with necessary functions in it.
psql -p 5432 -d oratest -U postgres -f /opt/PostgreSQL/9.5/share/postgresql/contrib/odbclink.sql
At this point, we have installed unixODBC Drirver, Oracle ODBC driver and ODBC-Link module for PostgreSQL. As a first step, we need to create a DSN using Oracle ODBC.

Edit /etc/odbcinst.ini file and pass the drivers deifinition
## Driver for Oracle
[MyOracle]
Description =ODBC for oracle
Driver =/usr/lib/oracle/11.2/client64/lib/libsqora.so.11.1
UsageCount=1
FileUsage = 1
Driver Logging = 7
Edit /etc/odbc.ini file and create the DSN with driver mentioned in /etd/odbcinst.ini
## Host: pg.raghav-node1.com, PORT: 1521
## Oracle Instance Name: ORA11G, Username: mmruser, Password: mmruser
## ODBC Data source: Ora

[Ora]
Description = myoracledb database
Driver = MyOracle
Trace = yes
TraceFile = /tmp/odbc_oracle.log
Database = //pg.raghav-node1.com:1521/ORA11G
UserID = mmruser
Password = mmruser
Port = 1521
After creating DSN, load all Oracle & unix ODBC driver libraries by setting environment variables and test the connectivity using OS command line tool "dltest" & "iSQL"
[root@172.16.210.161 ~]# export ORACLE_HOME=/usr/lib/oracle/11.2/client64
[root@172.16.210.161 ~]# export LD_LIBRARY_PATH=/usr/local/unixODBC-2.3.4/lib:/usr/lib/oracle/11.2/client64/lib
[root@172.16.210.161 ~]# export ODBCINI=/etc/odbc.ini
[root@172.16.210.161 ~]# export ODBCSYSINI=/etc/
[root@172.16.210.161 ~]# export TWO_TASK=//pg.raghav-node1.com:1521/ORA11G
[root@172.16.210.161 ~]# dltest /usr/lib/oracle/11.2/client64/lib/libsqora.so.11.1
SUCCESS: Loaded /usr/lib/oracle/11.2/client64/lib/libsqora.so.11.1
[root@172.16.210.161 ~]# isql ora -v
+---------------------------------------+
| Connected! |
| |
| sql-statement |
| help [tablename] |
| quit |
| |
+---------------------------------------+
SQL>
Now, set the same environment variables for postgres user for loading the libraries and restart the PostgreSQL cluster to take effect. Connect to PostgreSQL and call odbclink functions to connect to Oracle database.
[root@172.16.210.163 ~]#su - postgres
[postgres@172.16.210.163 ~]$ export ORACLE_HOME=/usr/lib/oracle/11.2/client64
[postgres@172.16.210.163 ~]$ export LD_LIBRARY_PATH=/usr/local/unixODBC-2.3.4/lib:/usr/lib/oracle/11.2/client64/lib
[postgres@172.16.210.163 ~]$ export ODBCINI=/etc/odbc.ini
[postgres@172.16.210.163 ~]$ export ODBCSYSINI=/etc/
[postgres@172.16.210.163 ~]$ export TWO_TASK=//pg.raghav-node1.com:1521/ORA11G
[postgres@172.16.210.163 ~]$ dltest /usr/lib/oracle/11.2/client64/lib/libsqora.so.11.1
SUCCESS: Loaded /usr/lib/oracle/11.2/client64/lib/libsqora.so.11.1
[postgres@172.16.210.163 ~]$ /opt/PostgreSQL/9.5/bin/pg_ctl -D /opt/PostgreSQL/9.5/data/ stop -mf
[postgres@172.16.210.163 ~]$ /opt/PostgreSQL/9.5/bin/pg_ctl -D /opt/PostgreSQL/9.5/data/ start
[postgres@172.16.210.163 ~]$ psql
psql.bin (9.5.2)
Type "help" for help.

postgres=# select odbclink.connect('DSN=Ora');
connect
---------
1
(1 row)
Cool right...!!!. For retrieving and manipulating data refer to ODBC-Link README file.

Using Foreign DataWrappers

An SQL/MED(SQL Management of External Data) extension to the SQL Standard allows managing external data stored outside the database. SQL/MED provides two components Foreign data wrappers and Datalink. PostgreSQL introduced Foreign Data Wrapper(FDW) in 9.1 version with read-only support and in 9.3 version write support of this SQL Standard. Today, the latest version has a number of features around it and many varieties of FDW available to access different remote SQL databases.

Oracle_fdw provides an easy and efficient way to access Oracle Database. IMO,its one of the coolest method to access the remote database. To compile Oracle_FDW with PostgreSQL 9.5, we need Oracle Instant Client libraries and pg_config set in PATH. We can use the same Oracle Instant Client libraries used for ODBC-Link. Let's see how it works.

First, set environment variables with OIC libraries and pg_config
export PATH=/opt/PostgreSQL/9.5/bin:$PATH
export ORACLE_HOME=/usr/lib/oracle/11.2/client64
export LD_LIBRARY_PATH=/usr/lib/oracle/11.2/client64/lib
Unzip the oracle_fdw module and compile it with PostgreSQL 9.5
unzip oracle_fdw-1.4.0.zip
cd oracle_fdw-1.4.0/
make
make install
Now switch as 'postgres' user and restart the cluster by loading Oracle Instant Client libraries required for oracle_fdw extension and create the extension inside the database.
[postgres@172.16.210.161 9.5]$ export ORACLE_HOME=/usr/lib/oracle/11.2/client64/lib
[postgres@172.16.210.161 9.5]$ export LD_LIBRARY_PATH=/usr/lib/oracle/11.2/client64/lib:$LD_LIBRARY_PATH
[postgres@172.16.210.161 9.5]$ /opt/PostgreSQL/9.5/bin/pg_ctl -D /opt/PostgreSQL/9.5/data/ stop -mf
[postgres@172.16.210.161 9.5]$ /opt/PostgreSQL/9.5/bin/pg_ctl -D /opt/PostgreSQL/9.5/data/ start
[postgres@172.16.210.161 9.5]$ psql
Password:
psql.bin (9.5.2)
Type "help" for help.

postgres=# create extension oracle_fdw;
CREATE EXTENSION
Now you can access the Oracle database.
postgres=# CREATE SERVER oradb FOREIGN DATA WRAPPER oracle_fdw OPTIONS (dbserver '//pg.raghav-node1.com/ORA11G');
CREATE SERVER
postgres=# GRANT USAGE ON FOREIGN SERVER oradb TO postgres;
GRANT
postgres=# CREATE USER MAPPING FOR postgres SERVER oradb OPTIONS (user 'scott', password 'tiger');
CREATE USER MAPPING
postgres=#
postgres=# CREATE FOREIGN TABLE oratab (ecode integer,name char(30)) SERVER oradb OPTIONS(schema 'SCOTT',table 'EMP');
CREATE FOREIGN TABLE
postgres=# select * from oratab limit 3;
ecode | name
-------+--------------------------------
7369 | SMITH
7499 | ALLEN
7521 | WARD
(3 rows)

Using Oracle Call Interface(OCI) Drivers

Oracle Call Interface(OCI) a type-2 driver freely available on Oracle site which allows the client to connect to Oracle database. EDB Postgres Advanced Server (also called EPAS) a proprietary product has built-in OCI-based database link module called dblink_ora, which connects to Oracle database using Oracle OCI drivers. All you have to do to use dblink_ora module, install EPAS(not covering installation) and tell EPAS where it can find Oracle OCI driver libraries. We can make use of same Oracle Instant Client by specifying its libraries location in LD_LIBRARY_PATH environment variable and to take effect restart the EPAS cluster.

First, switch as "enterprisedb" user, load the libraries and restart the cluster. That's all, we are good to access Oracle database.
[enterprisedb@172.16.210.129 ~]$ export LD_LIBRARY_PATH=/usr/lib/oracle/11.2/client64/lib
[enterprisedb@172.16.210.129 bin]$ /opt/PostgresPlus/9.5AS/bin/pg_ctl -D /opt/PostgresPlus/9.5AS/data/ restart
[enterprisedb@172.16.210.129 bin]$ psql
psql.bin (9.5.0.5)
Type "help" for help.

edb=# select dblink_ora_connect('oraconn','localhost','edbora','edbuser','edbuser',1521);
dblink_ora_connect
--------------------
OK
(1 row)
Note: EPAS connects to the Oracle Database using Oracle Instant Client library "libclntsh.so". If you won't find the library in Oracle Client Library location then create the symbolic link with libclntsh.so pointing to the libclntsh.so.version.number. Refer to documentation.

In the example, dblink_ora_connect establishes a connection to an Oracle database with the user-specified connection information. Later using link name('oraconn' in my case) we can perform operations like SELECT,INSERT,DELETE,UPDATE & COPY using dblink_ora* functions. All functions you can refer from the EnterpriseDB documentation here.

All the above methods will be very handy if you are working on migration projects. Hope its helpful. Thank you

--Raghav

Pavel Stehule: plpgsql_check 1.0.5 released

From Planet PostgreSQL. Published on May 24, 2016.

New version of plpgsql_check is available. The most important change is support for future PostgreSQL 9.6

https://manager.pgxn.org/distributions/plpgsql_check/1.0.5
https://github.com/okbob/plpgsql_check/releases/tag/v1.0.5

Daniel Pocock: PostBooks, PostgreSQL and pgDay.ch talk

From Planet PostgreSQL. Published on May 23, 2016.

PostBooks 4.9.5 was recently released and the packages for Debian (including jessie-backports), Ubuntu and Fedora have been updated.

Postbooks at pgDay.ch in Rapperswil, Switzerland

pgDay.ch is coming on Friday, 24 June. It is at the HSR Hochschule für Technik Rapperswil, at the eastern end of Lake Zurich.

I'll be making a presentation about Postbooks in the business track at 11:00.

Getting started with accounting using free, open source software

If you are not currently using a double-entry accounting system or if you are looking to move to a system that is based on completely free, open source software, please see my comparison of free, open source accounting software.

Free and open source solutions offer significant advantages: flexibility, businesses can choose any programmer to modify the code, and use of SQL back-ends, multi-user support and multi-currency support are standard. These are all things that proprietary vendors charge extra money for.

Accounting software is the lowest common denominator in the world of business software, people keen on the success of free and open source software may find that encouraging businesses to use one of these solutions is a great way to lay a foundation where other free software solutions can thrive.

PostBooks new web and mobile front end

xTuple, the team behind Postbooks, has been busy developing a new Web and Mobile front-end for their ERP, CRM and accounting suite, powered by the same PostgreSQL backend as the Linux desktop client.

More help is needed to create official packages of the JavaScript dependencies before the Web and Mobile solution itself can be packaged.

Craig Ringer: PostgreSQL-based application performance: latency and hidden delays

From Planet PostgreSQL. Published on May 23, 2016.

Goldfields Pipeline, Western Australia

Goldfields Pipeline, by SeanMac (Wikimedia Commons)

If you’re trying to optimise the performance of your PostgreSQL-based application you’re probably focusing on the usual tools: EXPLAIN (BUFFERS, ANALYZE), pg_stat_statements, auto_explain, log_statement_min_duration, etc.

Maybe you’re looking into lock contention with log_lock_waits, monitoring your checkpoint performance, etc too.

But did you think about network latency? Gamers know about network latency, but did you think it mattered for your application server?

Latency matters

Typical client/server round-trip network latencies can range from 0.01ms (localhost) through the ~0.5ms of a switched network, 5ms of WiFi, 20ms of ADSL, 300ms of intercontinental routing, and even more for things like satellite and WWAN links.

A trivial SELECT can take in the order of 0.1ms to execute server-side. A trivial INSERT can take 0.5ms.

Every time your application runs a query it has to wait for the server to respond with success/failure and possibly a result set, query metadata, etc. This incurs at least one network round trip delay.

When you’re working with small, simple queries network latency can be significant relative to the execution time of your queries if your database isn’t on the same host as your application.

Many applictions, particularly ORMs, are very prone to running lots of quite simple queries. For example, if your Hibernate app is fetching an entity with a lazily fetched @OneToMany relationship to 1000 child items it’s probably going to do 1001 queries thanks to the n+1 select problem, if not more. That means it’s probably spending 1000 times your network round trip latency just waiting. You can left join fetch to avoid that… but then you transfer the parent entity 1000 times in the join and have to deduplicate it.

Similarly, if you’re populating the database from an ORM, you’re probably doing hundreds of thousands of trivial INSERTs… and waiting after each and every one for the server to confirm it’s OK.

It’s easy to try to focus on query execution time and try to optimise that, but there’s only so much you can do with a trivial INSERT INTO ...VALUES .... Drop some indexes and constraints, make sure it’s batched into a transaction, and you’re pretty much done.

What about getting rid of all the network waits? Even on a LAN they start to add up over thousands of queries.

COPY

One way to avoid latency is to use COPY. To use PostgreSQL’s COPY support your application or driver has to produce a CSV-like set of rows and stream them to the server in a continuous sequence. Or the server can be asked to send your application a CSV-like stream.

Either way, the app can’t interleave a COPY with other queries, and copy-inserts must be loaded directly into a destination table. A common approach is to COPY into a temporary table, then from there do an INSERT INTO ... SELECT ..., UPDATE ... FROM ...., DELETE FROM ... USING..., etc to use the copied data to modify the main tables in a single operation.

That’s handy if you’re writing your own SQL directly, but many application frameworks and ORMs don’t support it, plus it can only directly replace simple INSERT. Your application, framework or client driver has to deal with conversion for the special representation needed by COPY, look up any required type metadata its self, etc.

(Notable drivers that do support COPY include libpq, PgJDBC, psycopg2, and the Pg gem… but not necessarily the frameworks and ORMs built on top of them.)

PgJDBC – batch mode

PostgreSQL’s JDBC driver has a solution for this problem. It relies on support present in PostgreSQL servers since 8.4 and on the JDBC API’s batching features to send a batch of queries to the server then wait only once for confirmation that the entire batch ran OK.

Well, in theory. In reality some implementation challenges limit this so that batches can only be done in chunks of a few hundred queries at best. The driver can also only run queries that return result rows in batched chunks if it can figure out how big the results will be ahead of time. Despite those limitations, use of Statement.executeBatch() can offer a huge performance boost to applications that are doing tasks like bulk data loading remote database instances.

Because it’s a standard API it can be used by applications that work across multiple database engines. Hibernate, for example, can use JDBC batching though it doesn’t do so by default.

libpq and batching

Most (all?) other PostgreSQL drivers have no support for batching. PgJDBC implements the PostgreSQL protocol completely independently, wheras most other drivers internally use the C library libpq that’s supplied as part of PostgreSQL.

libpq does not support batching. It does have an asynchronous non-blocking API, but the client can still only have one query “in flight” at a time. It must wait until the results of that query are received before it can send another.

The PostgreSQL server supports batching just fine, and PgJDBC uses it already. So I’ve written batch support for libpq and submitted it as a candidate for the next PostgreSQL version. Since it only changes the client, if accepted it’ll still speed things up when connecting to older servers.

I’d be really interested in feedback from authors and advanced users of libpq-based client drivers and developers of libpq-based applications. The patch applies fine on top of PostgreSQL 9.6beta1 if you want to try it out. The documentation is detailed and there’s a comprehensive example program.

Performance

I thought a hosted database service like RDS or Heroku Postgres would be a good example of where this kind of functionality would be useful. In particular, accessing them from ourside their own networks really shows how much latency can hurt.

At ~320ms network latency:

  • 500 inserts without batching: 167.0s
  • 500 inserts with batching: 1.2s

… which is over 120x faster.

You won’t usually be running your app over an intercontinental link between the app server and the database, but this serves to highlight the impact of latency. Even over a unix socket to localhost I saw over a 50% performance improvement for 10000 inserts.

Batching in existing apps

It is unfortunately not possible to automatically enable batching for existing applications. Apps have to use a slightly different interface where they send a series of queries and only then ask for the results.

It should be fairly simple to adapt apps that already use the asynchronous libpq interface, especially if they use non-blocking mode and a select()/poll()/epoll()/WaitForMultipleObjectsEx loop. Apps that use the synchronous libpq interfaces will require more changes.

Batching in other client drivers

Similarly, client drivers, frameworks and ORMs will generally need interface and internal changes to permit the use of batching. If they’re already using an event loop and non-blocking I/O they should be fairly simple to modify.

I’d love to see Python, Ruby, etc users able to access this functionality, so I’m curious to see who’s interested. Imagine being able to do this:

import psycopg2
conn = psycopg2.connect(...)
cur = conn.cursor()

# this is just an idea, this code does not work with psycopg2:
futures = [ cur.async_execute(sql) for sql in my_queries ]
for future in futures:
    result = future.result  # waits if result not ready yet
    ... process the result ...
conn.commit()

Asynchronous batched execution doesn’t have to be complicated at the client level.

COPY is fastest

Where practical clients should still favour COPY. Here are some results from my laptop:

inserting 1000000 rows batched, unbatched and with COPY
batch insert elapsed:      23.715315s
sequential insert elapsed: 36.150162s
COPY elapsed:              1.743593s
Done.

Batching the work provides a surprisingly large performance boost even on a local unix socket connection…. but COPY leaves both individual insert approaches far behind it in the dust.

Use COPY.

The image

The image for this post is of the Goldfields Water Supply Scheme pipeline from Mundaring Weir near Perth in Western Australia to the inland (desert) goldfields. It’s relevant because it took so long to finish and was under such intense criticism that its designer and main proponent, C. Y. O’Connor, committed suicide 12 months before it was put into commission. Locally people often (incorrectly) say that he died after the pipeline was built when no water flowed – because it just took so long everyone assumed the pipeline project had failed. Then weeks later, out the water poured.

Django meetup Amsterdam 18 May 2016

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 22, 2016.

Summary of the Django meetup organized at crunchr in Amsterdam, the Netherlands.

(I gave a talk on the django admin, which I of course don't have a summary of, yet, though my brother made a summary of an almost-identical talk I did the friday before)

Reducing boilerplate with class-based views - Priy Werry

A view can be more than just a function. They can also be class based, django has quite a lot of them. For example the TemplateView that is very quick for rendering a template. Boilerplate reduction.

Django REST framework is a good example of class based views usage. It really helps you to reduce the number of boring boilerplate and concentrate on your actual code.

Examples of possible boilerplate code:

  • Parameter validation.
  • Pagination.
  • Ordering.
  • Serialisation.

They wanted to handle this a bit like django's middleware mechanism, but then view-specific. So they wrote a base class that performed most of the boilerplate steps. So the actual views could be fairly simple.

It also helps with unit testing: normally you'd have to test all the corner cases in all your views, now you only have to test your base class for that.

Custom base classes also often means you have methods that you might re-define in subclasses to get extra functionality. In those cases make sure you call the parent class's original method (when needed).

Users of your views (especially when it is an API) get the advantage of having a more consistent API. It is automatically closer to the specification. The API should also be easier to read.

Meetups on steroids - Bob Aalsma

"Can I get a consultant for this specific subject?" Some managers find it difficult to allow this financially. A special course with a deep-dive is easier to allow.

He would like to be a kind of a broker between students and teachers to arrange it: "Meetups on steroids: pick the subject - any subject; pick the date - any date; pick the group - any group"

Security in Django - Joery van der Zwart

Joery comes out of the security world. He doesn't know anything from the inside of django, but a lot of the outside of django. He's tested a lot of them.

Security is as strong as its weekest link. People are often the weakest link. Django doesn't protect you if you explicitly circumvent its security mechanisms as a programmer.

Django actually protects you a lot!

A good thing is to look at the OWASP list of top 10 errors. (See also Florian Apolloner's talk at 'Django under the Hood' 2015, for instance).

  • SQL injection. Protection is integrated in django. But watch out when doing raw sql queries, because they are really raw and unprotected. If you work through the model layer, you're safe.
  • Authentication and sessions. Django's SessionSecurityMiddleware is quite good. He has some comments on authentication, though, so he advices to do that one yourself. (Note: the local core committer looked quite suspicious as this was the first he heard about it. Apparently there are a number of CVEs that are unfixed in Django. Joery will share the numbers.)
  • XSS injection. User-fillable fields that aren't escaped. Django by default... yes, protects you against this. Unless you use {% autoescape off %}, so don't do that.
  • Direct object reference. He doesn't agree with this point. So ignore it.
  • Security misconfiguration. Basically common sense. Don't have DEBUG = True on in your production site. Django's output is very detailed and thus very useful for anyone breaking into your site.
  • Sensitive data. Enable https. Django doesn't enforce it. But use https. Extra incentive: google lowers the page ranking for non-https sites...
  • Access control. It is very very very hard to get into Django this way. He says django is one of the few systems to fix it this rigidly!
  • CSRF. Django protects you. Unless you explicitly use @csfr_exempt...
  • Known vulnerabilities. Update django! Because there have been fixes in django. Older versions are thus broken.
  • Insecure forwards/redirects. Once you've enabled django's default middleware, you're secure.

So Django is quite secure, but you are not.

Look at django's security documentation. And look at https://www.ponycheckup.com. You can check your site with it. The good is that it is simple. It only checks django itself, though.

With some other tools (like nessus) you have to watch out for false positives, though. So if you don't know to interpret the result, you'll be scared shitless.

A good one: Qualys SSLlabs https checker to get your ssl certificate completely right. (Note: see my blog post fixing ssl certificate chains for some extra background.)

"Owasp zap": open source tool that combines checker and reduces the number of false positives.

The summary:

  • Good: django with middleware.
  • Good: django provides a good starting point.
  • Bad: experimenting. Be very sure you're doing it right. Look at documentation.
  • Bad: do it yourself. Most of the times.

Django girls Amsterdam

On 25 june there'll be a django girls workshop in Amsterdam. Everything's set, but they do still need coaches.

Evennia 0.6 !

By Griatch's Evennia musings (MU* creation with Django+Twisted) from Django community aggregator: Community blog posts. Published on May 22, 2016.


As of today, I merged the development branch to make version 0.6 of the MU* development system and server Evennia.

Evennia 0.6 comes with a lot of updates, mainly in the way Evennia talks to the outside world. All communication is now standardized, so there are no particular treatment of things like text - text is just one of any standardized commands being passed between the server the client (whether over telnet, ssh, websockets or ajax/comet).

For example the user can now easily plug in "inputfuncs" to handle any data coming from the client. If you want your client to offer some particular functionality, you just need to plop in a python function to handle it, server-side. We also now offer a lot of utility functions for things like monitoring change (tell the client whenever your health status changes so it can update a health bar or flash the screen).

The HTML5 webclient has itself updated considerably. Most is happening behind the scenes though. Notably the webclient's javascript component is split into two:

  •  evennia.js, acts as a library for handling all communication with the server part of Evennia. It offers events for a gui library to plug into and send/receive. It will also gracefully degrade from a websocket connection to AJAX/COMET long-polling if the player uses an older browser. 
  • evennia_gui.js is the default front-end and implements a traditional and stable "telnet-like" interface. The html part uses uses Django templating to make it easy to customize and expand. Since this simply makes use of the resources offered by evennia.js, one could pretty easily slip in some other gui library here, or set up calls to get all sorts of interesting information from the server (which talks back using inputfuncs). 
There are a truckload of more updates and features that are covered on the mailing list.
.
Griatch

Károly Nagy: Postgresql server fails to start in recovery with systemd

From Planet PostgreSQL. Published on May 22, 2016.

I ran into an issue while trying to setup a simple test system for myself to move from CentOS 6 to 7. When I was about to start a standby slave systemd reported PostgreSQL startup timeout and stopped it. Although it was running perfectly fine doing recovery.

Issue

In var log message systemd reports failure

systemd: postgresql-9.5.service start operation timed out. Terminating.
systemd: Failed to start PostgreSQL 9.5 database server.
systemd: Unit postgresql-9.5.service entered failed state.
systemd: postgresql-9.5.service failed.

Meanwhile in postgresql

[1342]: [3-1] host=,user=,db=,tx=0,vtx= LOG:  received smart shutdown request
[1383]: [3-1] host=,user=,db=,tx=0,vtx= LOG:  shutting down
[1383]: [4-1] host=,user=,db=,tx=0,vtx= LOG:  database system is shut down

Cause

In the systemd script postgresql is being started with -w flag which means ”  -w                     wait until operation complete”. Hence systemd fails after configured timeout.

ExecStart=/usr/pgsql-9.5/bin/pg_ctl start -D ${PGDATA} -s -w -t 300

Fix

Change the -w flag to -W and systemd service file (/usr/lib/systemd/system/postgresql-9.5.service) and reload the daemon.

systemctl daemon-reload
service postgresql-9.5 start

Hopefully this will save you couple of minutes debugging.

Deploying a Django Website on Heroku

By DjangoTricks from Django community aggregator: Community blog posts. Published on May 21, 2016.

Photo by Frances Gunn

Once you have a working project, you have to host it somewhere. One of the most popular deployment platforms nowadays is Heroku. Heroku belongs to a Platform as a Service (PaaS) category of cloud computing services. Every Django project you host on Heroku is running inside a smart container in a fully managed runtime environment. Your project can scale horizontally (adding more computing machines) and you pay for what you use starting with a free tier. Moreover, you won't need much of system administrator's skills to do the deployment - once you do the initial setup, the further deployment is as simple as pushing Git repository to a special heroku remote.

However, there are some gotchas to know before choosing Heroku for your Django project:

  • One uses PostgreSQL database with your project. MySQL is not an option.
  • You cannot store your static and media files on Heroku. One should use Amazon S3 or some other storage for that.
  • There is no mailing server associated with Heroku. One can use third-party SendGrid plugin with additional costs, GMail SMTP server with sent email amount limitations, or some other SMTP server.
  • The Django project must be version-controlled under Git.
  • Heroku works with Python 2.7. Python 3 is not yet supported.

Recently I deployed a small Django project on Heroku. To have a quick reference for the future, I summed up the process here providing instructions how to do that for future reference.

1. Install Heroku Toolbelt

Sign up for a Heroku account. Then install Heroku tools for doing all the deployment work in the shell.

To connect your shell with Heroku, type:

$ heroku login

When asked, enter your Heroku account's email and password.

2. Prepare Pip Requirements

Activate your project's virtual environment and install Python packages required for Heroku:

(myproject_env)$ pip install django-toolbelt

This will install django, psycopg2, gunicorn, dj-database-url, static3, and dj-static to your virtual environment.

Install boto and Django Storages to be able to store static and media files on an S3 bucket:

(myproject_env)$ pip install boto
(myproject_env)$ pip install django-storages

Go to your project's directory and create the pip requirements that Heroku will use in the cloud for your project:

(myproject_env)$ pip freeze -l > requirements.txt

3. Create Heroku-specific Files

You will need two files to tell Heroku what Python version to use and how to start a webserver.

In your project's root directory create a file named runtime.txt with the following content:

python-2.7.11

Then at the same location create a file named Procfile with the following content:

web: gunicorn myproject.wsgi --log-file -

4. Configure the Settings

As mentioned in the "Web Development with Django Cookbook - Second Edition", we keep the developmnent and production settings in separate files both importing the common settings from a base file.

Basically we have myproject/conf/base.py with the settings common for all environments.

Then myproject/conf/dev.py contains the local database and dummy email configuration as follows:

# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
from .base import *

DATABASES = {
"default": {
"CONN_MAX_AGE": 0,
"ENGINE": "django.db.backends.postgresql",
"HOST": "localhost",
"NAME": "myproject",
"PASSWORD": "",
"PORT": "",
"USER": "postgres"
}
}

EMAIL_BACKEND = "django.core.mail.backends.console.EmailBackend"

Lastly for the production settings we need myproject/conf/prod.py with special database configuration, non-debug mode, and unrestrictive allowed hosts as follows:

# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
from .base import *
import dj_database_url

DATABASES = {
"default": dj_database_url.config()
}

ALLOWED_HOSTS = ["*"]

DEBUG = False

Now let's open myproject/settings.py and add the following content:

# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
from .conf.dev import *

Finally, open the myproject/wsgi.py and change the location of the DJANGO_SETTINGS_MODULE there:

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "myproject.conf.prod")

5. Set Up Amazon S3 for Static and Media Files

Create an Amazon S3 bucket myproject.media at the AWS Console (web interface for Amazon Web Services). Go to the properties of the bucket, expand "Permissions" section, click on the "add bucket policy" button and enter the following:

{
"Version": "2008-10-17",
"Statement": [
{
"Sid": "AllowPublicRead",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::myproject.media/*"
}
]
}

This ensures that files on the S3 bucket will be accessible publicly without any API keys.

Go back to your Django project and add storages to the INSTALLED_APPS in myproject/conf/base.py:

INSTALLED_APPS = [
# ...
"storages",
]

Media files and static files will be stored on different paths under S3 bucket. To implement that, we need to create two Python classes under a new file myproject/s3utils.py as follows:

# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
from storages.backends.s3boto import S3BotoStorage

class StaticS3BotoStorage(S3BotoStorage):
"""
Storage for static files.
"""

def __init__(self, *args, **kwargs):
kwargs['location'] = 'static'
super(StaticS3BotoStorage, self).__init__(*args, **kwargs)


class MediaS3BotoStorage(S3BotoStorage):
"""
Storage for uploaded media files.
"""

def __init__(self, *args, **kwargs):
kwargs['location'] = 'media'
super(MediaS3BotoStorage, self).__init__(*args, **kwargs)

Finally, let's edit the myproject/conf/base.py and add AWS settings:

AWS_S3_SECURE_URLS = False       # use http instead of https
AWS_QUERYSTRING_AUTH = False # don't add complex authentication-related query parameters for requests
AWS_S3_ACCESS_KEY_ID = "..." # Your S3 Access Key
AWS_S3_SECRET_ACCESS_KEY = "..." # Your S3 Secret
AWS_STORAGE_BUCKET_NAME = "myproject.media"
AWS_S3_HOST = "s3-eu-west-1.amazonaws.com" # Change to the media center you chose when creating the bucket

STATICFILES_STORAGE = "myproject.s3utils.StaticS3BotoStorage"
DEFAULT_FILE_STORAGE = "myproject.s3utils.MediaS3BotoStorage"

# the next monkey patch is necessary to allow dots in the bucket names
import ssl
if hasattr(ssl, '_create_unverified_context'):
ssl._create_default_https_context = ssl._create_unverified_context

Collect static files to the S3 bucket:

(myproject_env)$ python manage.py collectstatic --noinput

6. Set Up Gmail to Send Emails

Open myproject/conf/prod.py and add the following settings:

EMAIL_USE_TLS = True
EMAIL_HOST = "smtp.gmail.com"
EMAIL_HOST_USER = "myproject@gmail.com"
EMAIL_HOST_PASSWORD = "mygmailpassword"
EMAIL_PORT = 587

7. Push to Heroku

Commit and push all the changes to your Git origin remote. Personally I prefer using SourceTree to do that, but you can also do that in the command line, PyCharm, or another software.

In your project directory type the following:

(myproject_env)$ heroku create my-unique-project

This will create a Git remote called "heroku", and a new Heroku project "my-unique-project" which can be later accessed at http://my-unique-project.herokuapp.com.

Push the changes to heroku remote:

(myproject_env)$ git push heroku master

8. Transfer Your Local Postgres Database To Heroku

Create local database dump:

(myproject_env)$ PGPASSWORD=mypassword pg_dump -Fc --no-acl --no-owner -h localhost -U myuser mydb > mydb.dump

Upload the database dump temporarily to some server, for example, S3 bucket: http://myproject.media.s3-eu-west-1.amazonaws.com/mydb.dump. Then import that dump into the Heroku database:

(myproject_env)$ heroku pg:backups restore 'http://myproject.media.s3-eu-west-1.amazonaws.com/mydb.dump' DATABASE_URL

Remove the database dump from S3 server.

9. Set Environment Variables

If your Git repository is not private, put your secret values in environment variables rather than in the Git repository directly.

(myproject_env)$ heroku config:set AWS_S3_ACCESS_KEY_ID=ABCDEFG123
$ heroku config:set AWS_S3_SECRET_ACCESS_KEY=aBcDeFg123

To read out the environment variables you can type:

(myproject_env)$ heroku config

To read out the environment variables in the Python code open myproject/conf/base.py and type:

import os
AWS_S3_ACCESS_KEY_ID = os.environ.get("AWS_S3_ACCESS_KEY_ID", "")
AWS_S3_SECRET_ACCESS_KEY = os.environ.get("AWS_S3_SECRET_ACCESS_KEY", "")

10. Set DNS Settings

Open your domain settings and set CNAME to "my-unique-project.herokuapp.com".

At last, you are done! Drop in the comments if I missed some part. For the new updates, see the next section.

*. Update Production

Push the changes to heroku remote:

(myproject_env)$ git push heroku master

If you have changed something in the static files, collect them again:

(myproject_env)$ python manage.py collectstatic --noinput

Collecting static files to S3 bucket takes quite a long time, so I do not recommend to do that automatically every time when you want to deploy to Heroku.

Further Reading

You can read more about Django on Heroku in the following resources:

REGINA OBE: pgRouting 2.2.3 released with support for PostgreSQL 9.6beta1

From Planet PostgreSQL. Published on May 20, 2016.

pgRouting 2.2.3 was released last week. Main change is this version now supports PostgreSQL 9.6. Many thanks to Vicky Vergara for working thru the issues with PostgreSQL 9.6 and getting it to work. Vicky has also been doing a good chunk of the coding (a lot of Boost refactoring and integrating more Boost features), testing, and documentation in pgRouting, osm2pgrouting, and QGIS pgRoutingLayer in general for pgRouting 2.1, 2.2, and upcoming 2.3. We are very indebted to her for her hard work.

If you are a windows user testing the waters of PostgreSQL 9.6beta1, we have pgRouting 2.2.3 binaries and PostGIS 2.3.0dev binaries at http://postgis.net/windows_downloads.


Continue reading "pgRouting 2.2.3 released with support for PostgreSQL 9.6beta1"

Shaun M. Thomas: PG Phriday: Trusty Table Tiers

From Planet PostgreSQL. Published on May 20, 2016.

I always advocate breaking up large Postgres tables for a few reasons. Beyond query performance concerns, maintaining one monolithic structure is always more time consuming and consequentially more dangerous. The time required to create a dozen small indexes may be slightly longer than a single larger one, but we can treat the smaller indexes as incremental. If we want to rebuild, add more indexes, or fix any corruption, why advocate an all-or-nothing proposition? Deleting from one large table will be positively glacial compared to simply dropping an entire expired partition. The list just goes on and on.

On the other hand, partitioning in Postgres can be pretty intimidating. There are so many manual steps involved, that it’s easy to just kick the can down the road and tackle the problem later, or not at all. Extensions like the excellent pg_partman remove much of the pain involved in wrangling an army of partitions, and we strongly suggest using some kind of tool-kit instead of reinventing the wheel.

The main limitation with most existing partition management libraries is that they never deviate from the examples listed in the Postgres documentation. It’s always: create inherited tables, add redirection triggers, automate, rinse, repeat. In most cases, this is exactly the right approach. Unfortunately triggers are slow, and especially in an OLTP context, this can introduce sufficient overhead that partitions are avoided entirely.

Well, there is another way to do partitioning that’s almost never mentioned. The idea is to actually utilize the base table as a storage target, and in lieu of triggers, schedule data movement during low-volume time periods. The primary benefit to this is that there’s no more trigger overhead. It also means we can poll the base table itself for recent data with the ONLY clause. This is a massive win for extremely active tables, and the reason tab_tier was born.

Let’s create some data for testing this out:

CREATE TABLE sensor_log (
  id            INT PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       CURRENT_DATE - ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(1, 5000000) s(id);
 
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
 
ANALYZE sensor_log;

Now we have 5-million rows in a table with a defined date column that’s a perfect candidate for partitioning. The way this data is currently distributed, we have content going back to late 2014. Imagine in this scenario we don’t need this much live information at all times. So we decide to keep one week of logs for active use, and relegate everything else into some kind of monthly partition.

This is how all of that would look in tab_tier:

CREATE EXTENSION tab_tier;
 
SELECT tab_tier.register_tier_root('public', 'sensor_log', 'reading_date');
 
UPDATE tab_tier.tier_root
   SET root_retain = '1 week'::INTERVAL,
       part_period = '1 month'::INTERVAL
 WHERE root_schema = 'public'
   AND root_table = 'sensor_log';
 
SELECT tab_tier.bootstrap_tier_parts('public', 'sensor_log');
 
\dt
 
                 List OF relations
 Schema |          Name          | TYPE  |  Owner   
--------+------------------------+-------+----------
 public | sensor_log             | TABLE | postgres
 public | sensor_log_part_201410 | TABLE | postgres
 public | sensor_log_part_201411 | TABLE | postgres
 public | sensor_log_part_201412 | TABLE | postgres
 public | sensor_log_part_201501 | TABLE | postgres
 public | sensor_log_part_201502 | TABLE | postgres
 public | sensor_log_part_201503 | TABLE | postgres
 public | sensor_log_part_201504 | TABLE | postgres
 public | sensor_log_part_201505 | TABLE | postgres
 public | sensor_log_part_201506 | TABLE | postgres
 public | sensor_log_part_201507 | TABLE | postgres
 public | sensor_log_part_201508 | TABLE | postgres
 public | sensor_log_part_201509 | TABLE | postgres
 public | sensor_log_part_201510 | TABLE | postgres
 public | sensor_log_part_201511 | TABLE | postgres
 public | sensor_log_part_201512 | TABLE | postgres
 public | sensor_log_part_201601 | TABLE | postgres
 public | sensor_log_part_201602 | TABLE | postgres
 public | sensor_log_part_201603 | TABLE | postgres
 public | sensor_log_part_201604 | TABLE | postgres
 public | sensor_log_part_201605 | TABLE | postgres

Taking this piece by piece, the first thing we did after creating the extension itself, was to call the register_tier_root function. This officially tells tab_tier about the table, and creates a record with configuration elements we can tweak. And that’s exactly what we do by setting the primary retention window and the partition size. Creating all of the partitions manually is pointless, so we also invoke bootstrap_tier_parts. Its job is to check the range of dates currently represented in the table, and create all of the partitions necessary to store it.

What did not happen here, is any data movement. This goes back to our original concern regarding maintenance. Some tables may be several GB or even TB in size, and moving all of that data as one gargantuan operation would be a really bad idea. Instead, tab_tier provides the migrate_tier_data function to relocate data for a specific partition.

With a bit of clever SQL, we can even generate a script for it:

COPY (
  SELECT 'SELECT tab_tier.migrate_tier_data(''public'', ''sensor_log'', ''' || 
         REPLACE(part_table, 'sensor_log_part_', '') || ''');' AS part_name
    FROM tab_tier.tier_part
    JOIN tab_tier.tier_root USING (tier_root_id)
   WHERE root_schema = 'public'
     AND root_table = 'sensor_log'
   ORDER BY part_table
) TO '/tmp/move_parts.sql';
 
\i /tmp/move_parts.SQL
 
SELECT COUNT(*) FROM ONLY sensor_log;
 
 COUNT 
-------
 60480
 
SELECT COUNT(*) FROM sensor_log_part_201504;
 
 COUNT  
--------
 259200

Following some debugging notices, all of our data has moved to the appropriate partition. We verified that by checking the base table and a randomly chosen partition for record counts. At this point, the table is now ready for regular maintenance. In this case “maintenance” means regularly calling the cap_tier_partitions and migrate_all_tiers functions. The first ensures target partitions always exist, and the second moves any pending data to a waiting partition for all tables we’ve registered.

And that’s it. We’re completely done with this table. If we stopped here, we could be secure in the knowledge we no longer have to worry about some gigantic monolith ruining our day some time in the future. But that’s not how tab_tier got its name. One or two levels does not a tier make; the real “secret sauce” is its support for long term storage.

One thing we didn’t really cover, and most partition systems never even consider, is that partitioning is only half of the story. On an extremely active system, having months or years of data just sitting around is relatively frowned upon. The mere presence of older data might encourage using it, transforming our finely tuned OLTP engine into a mixed workload wreck. One or two queries against those archives, and suddenly our cache is tainted and everything is considerably slower.

We need to move that data off of the system, and there are quite a few ways to do that. Some might use ETL scripts or systems like talend to accomplish that goal. Or we can just use tab_tier and a Postgres foreign table. Let’s now dictate that only six months of archives should ever exist on the primary server. Given that constraint, this is how we could proceed:

-- Do this on some kind of archive server
 
CREATE USER arc_user PASSWORD 'PasswordsAreLame';
 
CREATE TABLE sensor_log (
  id            INT PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL,
  snapshot_dt   TIMESTAMP WITHOUT TIME ZONE
);
 
GRANT ALL ON sensor_log TO arc_user;
 
-- Back on the data source..,
 
UPDATE tab_tier.tier_root
   SET lts_threshold = '6 months'::INTERVAL,
       lts_target = 'public.sensor_log_archive'
 WHERE root_schema = 'public'
   AND root_table = 'sensor_log';
 
CREATE EXTENSION postgres_fdw;
 
CREATE USER arc_user PASSWORD 'PasswordsAreLame';
GRANT tab_tier_role TO arc_user;
GRANT ALL ON ALL TABLES IN SCHEMA PUBLIC TO tab_tier_role;
 
CREATE SERVER arc_srv 
  FOREIGN DATA WRAPPER postgres_fdw 
  OPTIONS (dbname 'postgres', host 'archive-host');
 
CREATE USER MAPPING FOR arc_user 
  SERVER arc_srv 
  OPTIONS (USER 'arc_user', password 'PasswordsAreLame');
 
CREATE FOREIGN TABLE sensor_log_archive (
  id            INT,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL,
  snapshot_dt   TIMESTAMP WITHOUT TIME ZONE
 
) SERVER arc_srv OPTIONS (TABLE_NAME 'sensor_log');
 
GRANT INSERT ON sensor_log_archive TO tab_tier_role;
 
-- Connect as arc_user, then run this:
 
SELECT tab_tier.archive_tier('public', 'sensor_log');
 
SELECT COUNT(*) FROM sensor_log_archive;
 
  COUNT  
---------
 3263360

Whew! That was a lot of work. Maybe a future version of tab_tier should provide a wrapper for that. In any case, all we did was set up a foreign table on a remote server, create a separate user to handle the data movement, and tell tab_tier about our six month threshold for long term storage, and the target table itself.

Using a foreign table isn’t required here, since the target can be any kind of table, but isn’t that the whole point of this exercise? The cool thing about Postgres foreign data wrappers is that we could have used any of them. In this case we’re just moving data to another remote Postgres instance, but we could have dumped everything into Cassandra or Hadoop instead. Take that, subspace!

For those who noticed all of the ridiculous GRANT statements, please remember this is only for demonstration purposes. A real system would probably use ALTER DEFAULT PRIVILEGES to give tab_tier_role more limited control over a specific schema and tables specifically designed for archival. The extension doesn’t add its own privileges—even to tables it creates—in case controls are tightly locked down. We don’t want to hijack any carefully laid down security. Instead tab_tier just propagates any ACLs it finds on root tables to new partitions.

This is the same reason we ran the archive_tier (or archive_all_tiers) routine as a different user. Since we’re using a foreign user mapping, we want to limit data leak potential by isolating the movement process from the table owner or a superuser. We recommend using this approach for any foreign table usage whenever possible.

With all of that out of the way, we still need to clean up. We archived all of the partition content, but the partitions themselves are still sitting around and gathering dust. Let’s fix that by running one final step as the owner of sensor_log or any superuser:

SELECT part_table
  FROM tab_tier.tier_part
 WHERE is_archived;
 
       part_table       
------------------------
 sensor_log_part_201410
 sensor_log_part_201411
 sensor_log_part_201412
 sensor_log_part_201501
 sensor_log_part_201502
 sensor_log_part_201503
 sensor_log_part_201504
 sensor_log_part_201505
 sensor_log_part_201506
 sensor_log_part_201507
 sensor_log_part_201508
 sensor_log_part_201509
 sensor_log_part_201510
 
SELECT tab_tier.drop_archived_tiers();
 
SELECT COUNT(*) FROM sensor_log_archive;
 
  COUNT  
---------
 1736640

During the archival process itself, tab_tier marks the related metadata so archived tables will no longer be used in any of the data movement functions. It also makes them an easy target for removal with a maintenance function. We can see that everything worked as a large portion of our data is no longer part of the sensor_log inheritance tree. Now the archived data is securely located on another system that’s probably geared more toward OLAP use, or some incomprehensible Hive we don’t have to worry about.

I for one, welcome our incomprehensible Hive overlords.

Tomas Vondra: Auditing Users and Roles in PostgreSQL

From Planet PostgreSQL. Published on May 20, 2016.

One of the services we offer are security reviews (or audits, if you want), covering a range of areas related to security. It may be a bit surprising, but a topic that often yields the most serious issues is roles and privileges. Perhaps the reason why roles and privileges are a frequent source of issues is that it seems to be quite simple and similar to things the engineers are familiar with (e.g. Unix system of users and groups), but it turns out there are a few key differences with major consequences.

The other parts are either very straightforward and understandable even for sysadmins without much PostgreSQL experience (e.g. authentication config in pg_hba.conf), or the engineers recognize the complexity and take their time to familiarize with the details (a good example of this is Row Level Security).

That is not to say there are no interesting topics e.g. how to use RLS with application-level users but I’ll leave that for another blog post, as this one is about roles and privileges.

So let’s look at roles and privileges a bit a closer …

Owner is a small superuser

When it comes to roles, the initial checks are mostly expected. The role should not be a superuser (as superusers simply bypass various checks), and in general should not have any excessive privileges (e.g. CREATEDB, CREATEROLE and so on).

But it also should not own the database objects (tables, functions, …), since owners can simply grant themselves arbitrary privileges on the objects they own, which turns them into small superusers.

Consider the following example, where we attempt to protect the table from the owner by revoking all the privileges from that role:

db=# CREATE USER u;
CREATE ROLE

db=# SELECT rolsuper FROM pg_roles WHERE rolname = 'u';
 rolsuper
----------
 f
(1 row)

db=# \c 'user=u dbname=db'
You are now connected to database "db" as user "u".

So we have created a user who is not a superuser, and we have connected using that account (that’s the slightly cryptic psql command). Let’s create a table (so the user is an owner) and restrict our own access to it

db=> CREATE TABLE t (id INT);
CREATE TABLE

db=> REVOKE ALL ON t FROM u;
REVOKE

db=> SELECT * FROM t;
ERROR:  permission denied for relation t

So that works, right? Well, the problem is a user with access to SQL (e.g. an “attacker” that discovered a SQL injection vulnerability) can do this:

db=> GRANT ALL ON t TO u;
GRANT

db=> select * from t;
 id
----
(0 rows)

The owner can simply grant all privileges back to himself, defeating the whole privilege system. A single SQL injection vulnerability and it’s a game over. Another issue with owners is that they are not subject to RLS by default, although that can be fixed with a simple ALTER TABLE ... FORCE ROW LEVEL SECURITY.

In any case, this should be a clear hint that the application should use a dedicated role (or multiple roles), not owning any of the objects.

BTW users are often surprised when I mention that we can grant privileges to individual columns e.g. allow SELECT on a subset of columns, UPDATE on a different subset of columns, and so on.

When combined with SECURITY DEFINER functions, this is a great way to restrict access to columns the application should not access directly, but allow special operations. For example it shouldn’t be possible to select all passwords (even if hashed) or e-mails, but it should be possible to verify a password or an e-mail. SECURITY DEFINER functions are great for that, but sadly it’s one of the powerful yet severely underused features :-(

Role inheritance

Let’s assume you have a role that owns the objects, and a separate role used by the application. In fact, if you have sufficiently complex application, chances are you’ve split it into multiple parts, perhaps segregated into schemas, and each module uses a separate set of roles (owner + application, possibly more).

This gives you the ability to create application roles covering only part of the application e.g. the administration panel needs access to all modules, while a public web interface only needs read-only access to a small subset of modules.

CREATE ROLE module_users;    -- full access to user info
CREATE ROLE module_users_ro; -- limited access user info (register/verify)
CREATE ROLE module_posts;    -- full access to blog posts
CREATE ROLE module_posts_ro; -- read-only access to blog posts
... roles for additional modules ...

CREATE USER admin_user   -- full access
    IN ROLE module_users, module_posts;

CREATE USER web_user     -- limited access
    IN ROLE module_users_ro, module_posts_ro;

In other words, roles may be seen as groups and used for making the privileges easier to manage. There are two aspects that make this different from unix-like groups it’s possible to use multi-level hierarchy of roles (while Unix groups are flat), and inheritance (will get to that in a minute).

The above scheme works just fine, but only if you keep the connections for the two users (admin_user and web_user) separate. With a small number of users (modules, applications) that’s manageable, as you can maintain separate connection pools, but as the number of connection pools grows it ceases to serve the purpose. But can we use a single connection pool and keep the benefit of separate users?

Well, yes. We can create another user role for the connection pool and grant it all the existing users (admin_user and web_user).

CREATE USER pool_user IN ROLE admin_user, web_user

This seems a bit strange, because the new user becomes member of admin_user and web_user roles (users are just roles with LOGIN privilege), effectively inheriting all the privileges. Wasn’t the whole point using roles with limited privileges?

Let me introduce you the SET ROLE command, which can be used to switch the session to arbitrary role the user is member of. So as the pool_user user is member of both admin_user and web_user roles, the connection pool or application may use this:

SET ROLE admin_user

to switch it to “full” privileges for the admin interface, or

SET ROLE web_user

when the connection is intended for the website.

These commands are akin to dropping privileges in Unix. The init scripts are executed as root, but you really don’t want to run all the services as root, so the init script does something like sudo -u or chpst to switch to unprivileged user.

But wait, we can actually do the opposite. We can start with “no privileges” by default, all we need to do is create the role like this:

CREATE USER pool_user NOINHERIT IN ROLE admin_user, web_user

The user is still member of the two roles (and so can switch to them using SET ROLE), but inherits no privileges from them. This has the benefit that if the pool or application fails to do the SET ROLE, it will fail due to lack of privileges on the database objects (instead of silently proceeding with full privileges). So instead of starting with full privileges and eventually dropping most of them, with NOINHERIT we start with no privileges and then acquire a limited subset of them.

But why am I wasting time by explaining all this SET ROLE and INHERIT or NOIHERIT stuff? Well, it has implications on testing.

Note: You have to trust the pool/application to actually execute the SET ROLE command with the right target role, and the user must not be able to execute custom SQL on the connection (because then it’s just a matter of RESET ROLE to gain the full privileges, or SET ROLE to switch to another role). If that’s not the case, the shared connection pool is not a path forward for you.

Testing roles

Pretty much no one tests privileges. Or to be more accurate everyone tests the positive case implicitly, because if you don’t get the necessary privileges the application breaks down. But only very few people verify that there are no unnecessary/unexpected privileges.

The most straightforward way to test absence of privileges (user has no access) might be to walk through all existing objects (tables, columns) and try all compatible privileges. But that’s obviously a lot of combinations and a lot of additional schema-specific work (data types, constraints, …).

Luckily, PostgreSQL provides a collection of useful functions for exactly this purpose (showing just table-related ones, there are additional functions for other object types):

  • has_any_column_privilege(...)
  • has_column_privilege(...)
  • has_table_privilege(...)

So for example it’s trivial to check which roles have INSERT privilege on a given table:

SELECT rolname FROM pg_roles WHERE has_table_privilege(rolname, 'table', 'INSERT')

or listing tables accessible by a given role:

SELECT oid, relname FROM pg_class WHERE has_table_privilege('user', oid, 'INSERT')

And similarly for other privileges and object types. The testing seems fairly trivial – simply run a bunch of queries for the application users, check that the result matches expectation and we’re done.

Note: It’s also possible to use the information_schema, e.g. table_privileges which essentially just runs a query with has_table_privilege and formats the output nicely.

Except there’s a small catch – the inheritance. It works just fine as long as the role inherits privileges through membership, but as soon as there’s a NOINHERIT somewhere, those privileges will not be considered when checking the access (both in the functions and information_schema). Which makes sense, because the current user does not currently have the privileges, but can gain them easily using SET ROLE.

But of course, PostgreSQL also includes pg_has_role() function, so we can merge the privileges from all the roles, for example like this:

SELECT DISTINCT relname
  FROM pg_roles CROSS JOIN pg_class
 WHERE pg_has_role('user', rolname, 'MEMBER')
   AND has_table_privilege(rolname, pg_class.oid, 'SELECT')

Making this properly testable requires more work (to handle additional object types and applicable privileges), but you get the idea.

Summary

Let me briefly summarize this blog post:

  • separate the owner and application user – Don’t use a single role for both things.
  • consider using SET ROLE role – Either drop (INHERIT) or acquire (NOINHERIT).
  • test the expected privileges – Ideally run this as part of regular unit tests if possible.
  • keep it simple – It’s definitely better to have a simple hierarchy of roles you understand.

Scott Mead: Helpful PostgreSQL Logging and Defaults

From Planet PostgreSQL. Published on May 19, 2016.

I use PostgreSQL every single day.  I develop against multiple versions of the database (4 versions running in parallel as I write), I have apps that use it and, my daily DBA-ing.  The biggest question I get from newbies and veterans a like is: “What are your PostgreSQL defaults?”

If you follow postgres, you already know that the default configuration (postgresql.conf) is very conservative from a resource (CPU, IOPS, Disk Space) perspective.  Over the last 10 years, we [the community] have developed some straightforward and easy to understand formulas that will help you tune… shared_buffers for example.  The item that always gets left out though is logging.  As a developer, I’m always looking for ways to see “How the database is answering the questions I ask”.  When I get a new postgres instance set up, I have a personal (somewhere in the cobwebs) checklist that I run.  Some are based on the purpose of the deployment (shared_buffers, work_mem, etc…), some are things that I always set.  Aside from memory, the biggest set of “standard” items I set are all related to logging.  I’m big on monitoring (my pg_stat_activity patch was accepted back for 9.2) and having the right detail presented to me is important.

TL;DR

 logging_collector = on
 log_filename = ‘postgresql-%a.log’
log_truncate_on_rotation=on
 log_line_prefix = ‘%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ‘
 log_checkpoints = on
 log_autovacuum_min_duration = 0
 log_temp_files = 0
 log_lock_waits = on
the goal here is to have postgres ‘bubble-up’ details about what’s going on to us so

that we can see exactly what is happening when.  In addition to setting the log files, you can pick up a great log analyzer (like pgBadger or pgCluu) that will break all of this down in to a gorgeous, easy-to-use report.

Scott’s logging defaults

logging_collector = on

By default, the logging_collector is disabled, this means that all of the info / error / warning, etc… logs that come out of postgres get kicked are sent to stderr.  Problem is, when your postgres forks from the terminal and then you close the terminal, you lose all your logs.  Turning this setting on will write all the log messages in a sub-directory of $PGDATA (log_directory = ‘pg_log’ by default, which, I like).  This setting also requires a restart, so I always end up setting it before I even start postgres the first time.  Postgres will automatically create the pg_log directory (or whatever log_directory is set to) in the data directory.  Now, all of those info / error / warning will get written a log file in the pg_log directory.

log_filename = ‘postgresql-%a.log’

This sets the name of the actual file that log messages will be written to (in your shiny, new pg_log log_destination directory).  The %a means that you’ll see Mon, Tue, Wed, Thu, Fri, Sat, Sun.  The patterns are based on standard strftime escapes (man page).  The reason that I like using %a is that you get auto-rotation on the log files.  You will keep one week’s worth of logs, when you rotate 7 days later, you won’t be creating a huge number of log files.

Note: Depending on my requirements, I will adjust this for production.  If I have any special retention policy that I need to abide by, I’ll make the filename: postgresql-YYYY-MM-DD-HH24mmSS.log  (log_filename = ‘postgresql-%Y-%m-%d-%H%M%S.log’ ).  The trouble with this is that you’ll need to deal with log cleanup yourself (cron to archive logs + 30 days old … ).

log_truncate_on_rotation=on

This essentially says “when I switch log files, if a log already exists with that name, truncate the existing one and write new logs to an empty file”.

For example, on Monday, May 15th, we wrote a log file:

postgresql-Mon.log

Now, at midnight on Monday, May 22nd, postgres is going to rotate its log back to:

postgresql-Mon.log

This is our data from last week.  If you leave log_truncate_on_rotation = off (the default), then postgres will append to that log file.  That means, in December, you’ll have data for every Monday throughout the year in the same file.  If you set this to on, it’ll will nuke the old data and give you only data from the most recent Monday.  If you need keep log files for longer than 7 days, I recommend you use a more complete name for your log files (see log_filename above).

log_line_prefix = ‘%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ‘

The log_line_prefix controls what every single line in the log file looks like.  My personal favorite here is actually stolen directly from pgBadger’s suggested configuration.  pgBadger is an amazing tool for parsing and analyzing the postgres log files.  If you set the log_line_prefix directly, pgBadger can provide incredible detail about: “what happens where, who did it and, when the did it”.   Just to show you the difference….

default log file error message:

LOG: database system was shut down at 2016-05-19 11:40:57 EDT
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
ERROR: column "create_error_for_fun" does not exist at character 8
STATEMENT: select create_error_for_fun;

With a rich log_line_prefix

2016-05-19 11:46:22 EDT [14913]: [1-1] user=,db=,app=,client= LOG: database system was shut down at 2016-05-19 11:46:21 EDT
2016-05-19 11:46:22 EDT [14913]: [2-1] user=,db=,app=,client= LOG: MultiXact member wraparound protections are now enabled
2016-05-19 11:46:22 EDT [14911]: [3-1] user=,db=,app=,client= LOG: database system is ready to accept connections
2016-05-19 11:46:22 EDT [14917]: [1-1] user=,db=,app=,client= LOG: autovacuum launcher started
2016-05-19 11:46:27 EDT [14921]: [1-1] user=postgres,db=postgres,app=psql,client=[local] ERROR: column "create_error_for_fun" does not exist at character 8
2016-05-19 11:46:27 EDT [14921]: [2-1] user=postgres,db=postgres,app=psql,client=[local] STATEMENT: select create_error_for_fun;

Now, I know who, what, where, why and when.

log_checkpoints = on

Like any database, postgres is going to use the disks.  In order to accomplish dealing with I/O efficiently, postgres (like many other databases) uses something called a checkpoint in order to synchronize it’s memory to those disks.  Checkpoints occur periodically based on a number of things (load, configuration, etc…).  The thing to keep in mind is that a checkpoint will use disk I/O, the busier the database, the more it requires.  Setting this to on means that you know without a doubt when a checkpoint occured.  It’s the same ol’ story: “Every once in a while, I get long-running queries, different ones each time!”… “I’m seeing a spike in IOPS and I don’t know why!” … “Sometimes my data load gets bogged down for some reason!” … etc…

This very well could be due to a large checkpoint occurring.  Since it’s based on load / configuration / time, it’s critical that the server write a log of when checkpoint occurred so that you’re not left in the dark.  There’s also useful information in these logs about how postgres is behaving (it can even help you tune your memory settings).

2016-05-19 12:01:26 EDT [15274]: [1-1] user=,db=,app=,client= LOG: checkpoint starting: immediate force wait
2016-05-19 12:01:26 EDT [15274]: [2-1] user=,db=,app=,client= LOG: checkpoint complete: wrote 0 buffers (0.0%); 
     0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; 
     sync files=0, longest=0.000 s, average=0.000 s

(NB: There was some line-wrap here, I’ve manually entered carriage returns.  The second line prints as one, long line)

It’s important to note that this is going to up the volume of your logs.  But, it’s minimal and the benefits far outweigh the few extra bytes needed to write the message.  (pgBadger will parse these up and give you a nice, clear picture of your checkpoint behavior).

 log_autovacuum_min_duration = 0

(default: log_autovacuum_min_duration = -1). The tuning convention isn’t just a boolean on/ off.  Essentially, you are telling postgres: “When an autovacuum runs for x milliseconds or longer, write a message to the log”.  Setting this to 0 (zero) means that you will log all autovacuum operations to the log file.

For the uninitiated, autovacuum is essentially a background process that does garbage collection in postgres.  If you’re just starting out, what you really need to know is the following:

  • It’s critical that autovacuum stay enabled
  • autovacuum is another background process that uses IOPS

Because autovacuum is necessary and uses IOPS, it’s critical that you know what it’s doing and when.  Just like log_checkpoints (above), autovacuum runs are based on load (thresholds on update / delete velocities on each table).  This means that vacuum can kick off at virtually any time.

2016-05-19 12:32:25 EDT [16040]: [4-1] user=,db=,app=,client= LOG: automatic vacuum of table "postgres.public.pgbench_branches": index scans: 1
 pages: 0 removed, 12 remain
 tuples: 423 removed, 107 remain, 3 are dead but not yet removable
 buffer usage: 52 hits, 1 misses, 1 dirtied
 avg read rate: 7.455 MB/s, avg write rate: 7.455 MB/s
 system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
2016-05-19 12:32:25 EDT [16040]: [5-1] user=,db=,app=,client= LOG: automatic analyze of table "postgres.public.pgbench_branches" system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
2016-05-19 12:32:27 EDT [16040]: [6-1] user=,db=,app=,client= LOG: automatic analyze of table "postgres.public.pgbench_history" system usage: CPU 0.01s/0.13u sec elapsed 1.70 sec
Just like with log_checkpoints the information that we receive here is very rich and can be invaluable when trying to diagnose ‘fleeting’ behavior of a system.  As with log_checkpoints, you are increasing the volume of your log files.  In the past 13 years, I’ve found this type of data well worth the overhead.  The nice thing about this logging parameter is that you can throttle back the volume by saying “only log vacuums that take 10 seconds (10,000  milliseconds) or longer”.

log_temp_files = 0

To be as efficient as possible, postgres tries to do everything it can in memory.  Sometimes, you just run out of that fickle resource.  Postgres has a built-in ‘swap-like’ system that will employ temp files with the data directory to deal with the issue.  If you’ve spent any time around disks (especially spinning-rust), you’ll know that swap can cause some serious performance issues.  Just like checkpoint and autovacuum, temp files are going to happen automatically.  Unlike these other two processes, they are going to occur if the queries you are running need the temp space.  From a developer’s perspective, I want to know if the process that I’m engineering is going to use temp.  From a DBA’s perspective, I want to know if the dang developers did something that needs temp space :) ( more likely, my dang maintenance jobs are using it).  To help in tuning your queries and maintenance processes, log your temp files.  It’ll tell you what size temp file was needed and which query caused it:

2016-05-19 12:31:20 EDT [15967]: [1-1] user=postgres,db=postgres,app=pgbench,client=[local] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15967.0", size 200204288
2016-05-19 12:31:20 EDT [15967]: [2-1] user=postgres,db=postgres,app=pgbench,client=[local] STATEMENT: alter table pgbench_accounts add primary key (aid)
The log_temp_files parameter is set in units of kilobytes.  Essentially “write a log if I created a temp file that was this many KB or more in size”.  I prefer 0 (zero) because I want to see all the temp files in use.  Again, in my experience, the “juice is worth the squeeze” here.

 log_lock_waits = on

Databases are servicing lots of clients all trying to do very similar work against the same set of data.  This can cause contention (it’s the nature of the beast).  log_lock_waits let’s you see where your contention is.  It will give you detailed, specific information about what waits occurred and the context in which they occurred.

 2016-05-19 13:14:50 EDT [17094]: [1-1] user=postgres,db=postgres,app=psql,client=[local] LOG: process 17094 still waiting for RowExclusiveLock on relation 16847 of database 12403 after 1000.794 ms at character 13
 2016-05-19 13:14:50 EDT [17094]: [2-1] user=postgres,db=postgres,app=psql,client=[local] DETAIL: Process holding the lock: 17086. Wait queue: 17094.
 2016-05-19 13:14:50 EDT [17094]: [3-1] user=postgres,db=postgres,app=psql,client=[local] STATEMENT: delete from pgbench_tellers ;

 

These are what I call ‘reasonable defaults’ for logging in postgres.  Again, these are the settings that I configure every time I setup a new cluster, whether it’s for dev / test / toy / prod.

Happy querying!

Scott Mead: Helpful PostgreSQL Logging and Defaults

From Planet PostgreSQL. Published on May 19, 2016.

I use PostgreSQL every single day.  I develop against multiple versions of the database (4 versions running in parallel as I write), I have apps that use it and, my daily DBA-ing.  The biggest question I get from newbies and veterans a like is: “What are your PostgreSQL defaults?”

If you follow postgres, you already know that the default configuration (postgresql.conf) is very conservative from a resource (CPU, IOPS, Disk Space) perspective.  Over the last 10 years, we [the community] have developed some straightforward and easy to understand formulas that will help you tune… shared_buffers for example.  The item that always gets left out though is logging.  As a developer, I’m always looking for ways to see “How the database is answering the questions I ask”.  When I get a new postgres instance set up, I have a personal (somewhere in the cobwebs) checklist that I run.  Some are based on the purpose of the deployment (shared_buffers, work_mem, etc…), some are things that I always set.  Aside from memory, the biggest set of “standard” items I set are all related to logging.  I’m big on monitoring (my pg_stat_activity patch was accepted back for 9.2) and having the right detail presented to me is important.

TL;DR

 logging_collector = on
 log_filename = ‘postgresql-%a.log’
log_truncate_on_rotation=on
 log_line_prefix = ‘%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ‘
 log_checkpoints = on
 log_autovacuum_min_duration = 0
 log_temp_files = 0
 log_lock_waits = on
the goal here is to have postgres ‘bubble-up’ details about what’s going on to us so

that we can see exactly what is happening when.  In addition to setting the log files, you can pick up a great log analyzer (like pgBadger or pgCluu) that will break all of this down in to a gorgeous, easy-to-use report.

Scott’s logging defaults

logging_collector = on

By default, the logging_collector is disabled, this means that all of the info / error / warning, etc… logs that come out of postgres get kicked are sent to stderr.  Problem is, when your postgres forks from the terminal and then you close the terminal, you lose all your logs.  Turning this setting on will write all the log messages in a sub-directory of $PGDATA (log_directory = ‘pg_log’ by default, which, I like).  This setting also requires a restart, so I always end up setting it before I even start postgres the first time.  Postgres will automatically create the pg_log directory (or whatever log_directory is set to) in the data directory.  Now, all of those info / error / warning will get written a log file in the pg_log directory.

log_filename = ‘postgresql-%a.log’

This sets the name of the actual file that log messages will be written to (in your shiny, new pg_log log_destination directory).  The %a means that you’ll see Mon, Tue, Wed, Thu, Fri, Sat, Sun.  The patterns are based on standard strftime escapes (man page).  The reason that I like using %a is that you get auto-rotation on the log files.  You will keep one week’s worth of logs, when you rotate 7 days later, you won’t be creating a huge number of log files.

Note: Depending on my requirements, I will adjust this for production.  If I have any special retention policy that I need to abide by, I’ll make the filename: postgresql-YYYY-MM-DD-HH24mmSS.log  (log_filename = ‘postgresql-%Y-%m-%d-%H%M%S.log’ ).  The trouble with this is that you’ll need to deal with log cleanup yourself (cron to archive logs + 30 days old … ).

log_truncate_on_rotation=on

This essentially says “when I switch log files, if a log already exists with that name, truncate the existing one and write new logs to an empty file”.

For example, on Monday, May 15th, we wrote a log file:

postgresql-Mon.log

Now, at midnight on Monday, May 22nd, postgres is going to rotate its log back to:

postgresql-Mon.log

This is our data from last week.  If you leave log_truncate_on_rotation = off (the default), then postgres will append to that log file.  That means, in December, you’ll have data for every Monday throughout the year in the same file.  If you set this to on, it’ll will nuke the old data and give you only data from the most recent Monday.  If you need keep log files for longer than 7 days, I recommend you use a more complete name for your log files (see log_filename above).

log_line_prefix = ‘%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ‘

The log_line_prefix controls what every single line in the log file looks like.  My personal favorite here is actually stolen directly from pgBadger’s suggested configuration.  pgBadger is an amazing tool for parsing and analyzing the postgres log files.  If you set the log_line_prefix directly, pgBadger can provide incredible detail about: “what happens where, who did it and, when the did it”.   Just to show you the difference….

default log file error message:

LOG: database system was shut down at 2016-05-19 11:40:57 EDT
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
ERROR: column "create_error_for_fun" does not exist at character 8
STATEMENT: select create_error_for_fun;

With a rich log_line_prefix

2016-05-19 11:46:22 EDT [14913]: [1-1] user=,db=,app=,client= LOG: database system was shut down at 2016-05-19 11:46:21 EDT
2016-05-19 11:46:22 EDT [14913]: [2-1] user=,db=,app=,client= LOG: MultiXact member wraparound protections are now enabled
2016-05-19 11:46:22 EDT [14911]: [3-1] user=,db=,app=,client= LOG: database system is ready to accept connections
2016-05-19 11:46:22 EDT [14917]: [1-1] user=,db=,app=,client= LOG: autovacuum launcher started
2016-05-19 11:46:27 EDT [14921]: [1-1] user=postgres,db=postgres,app=psql,client=[local] ERROR: column "create_error_for_fun" does not exist at character 8
2016-05-19 11:46:27 EDT [14921]: [2-1] user=postgres,db=postgres,app=psql,client=[local] STATEMENT: select create_error_for_fun;

Now, I know who, what, where, why and when.

log_checkpoints = on

Like any database, postgres is going to use the disks.  In order to accomplish dealing with I/O efficiently, postgres (like many other databases) uses something called a checkpoint in order to synchronize it’s memory to those disks.  Checkpoints occur periodically based on a number of things (load, configuration, etc…).  The thing to keep in mind is that a checkpoint will use disk I/O, the busier the database, the more it requires.  Setting this to on means that you know without a doubt when a checkpoint occured.  It’s the same ol’ story: “Every once in a while, I get long-running queries, different ones each time!”… “I’m seeing a spike in IOPS and I don’t know why!” … “Sometimes my data load gets bogged down for some reason!” … etc…

This very well could be due to a large checkpoint occurring.  Since it’s based on load / configuration / time, it’s critical that the server write a log of when checkpoint occurred so that you’re not left in the dark.  There’s also useful information in these logs about how postgres is behaving (it can even help you tune your memory settings).

2016-05-19 12:01:26 EDT [15274]: [1-1] user=,db=,app=,client= LOG: checkpoint starting: immediate force wait
2016-05-19 12:01:26 EDT [15274]: [2-1] user=,db=,app=,client= LOG: checkpoint complete: wrote 0 buffers (0.0%); 
     0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.001 s; 
     sync files=0, longest=0.000 s, average=0.000 s

(NB: There was some line-wrap here, I’ve manually entered carriage returns.  The second line prints as one, long line)

It’s important to note that this is going to up the volume of your logs.  But, it’s minimal and the benefits far outweigh the few extra bytes needed to write the message.  (pgBadger will parse these up and give you a nice, clear picture of your checkpoint behavior).

 log_autovacuum_min_duration = 0

(default: log_autovacuum_min_duration = -1). The tuning convention isn’t just a boolean on/ off.  Essentially, you are telling postgres: “When an autovacuum runs for x milliseconds or longer, write a message to the log”.  Setting this to 0 (zero) means that you will log all autovacuum operations to the log file.

For the uninitiated, autovacuum is essentially a background process that does garbage collection in postgres.  If you’re just starting out, what you really need to know is the following:

  • It’s critical that autovacuum stay enabled
  • autovacuum is another background process that uses IOPS

Because autovacuum is necessary and uses IOPS, it’s critical that you know what it’s doing and when.  Just like log_checkpoints (above), autovacuum runs are based on load (thresholds on update / delete velocities on each table).  This means that vacuum can kick off at virtually any time.

2016-05-19 12:32:25 EDT [16040]: [4-1] user=,db=,app=,client= LOG: automatic vacuum of table "postgres.public.pgbench_branches": index scans: 1
 pages: 0 removed, 12 remain
 tuples: 423 removed, 107 remain, 3 are dead but not yet removable
 buffer usage: 52 hits, 1 misses, 1 dirtied
 avg read rate: 7.455 MB/s, avg write rate: 7.455 MB/s
 system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
2016-05-19 12:32:25 EDT [16040]: [5-1] user=,db=,app=,client= LOG: automatic analyze of table "postgres.public.pgbench_branches" system usage: CPU 0.00s/0.00u sec elapsed 0.00 sec
2016-05-19 12:32:27 EDT [16040]: [6-1] user=,db=,app=,client= LOG: automatic analyze of table "postgres.public.pgbench_history" system usage: CPU 0.01s/0.13u sec elapsed 1.70 sec
Just like with log_checkpoints the information that we receive here is very rich and can be invaluable when trying to diagnose ‘fleeting’ behavior of a system.  As with log_checkpoints, you are increasing the volume of your log files.  In the past 13 years, I’ve found this type of data well worth the overhead.  The nice thing about this logging parameter is that you can throttle back the volume by saying “only log vacuums that take 10 seconds (10,000  milliseconds) or longer”.

log_temp_files = 0

To be as efficient as possible, postgres tries to do everything it can in memory.  Sometimes, you just run out of that fickle resource.  Postgres has a built-in ‘swap-like’ system that will employ temp files with the data directory to deal with the issue.  If you’ve spent any time around disks (especially spinning-rust), you’ll know that swap can cause some serious performance issues.  Just like checkpoint and autovacuum, temp files are going to happen automatically.  Unlike these other two processes, they are going to occur if the queries you are running need the temp space.  From a developer’s perspective, I want to know if the process that I’m engineering is going to use temp.  From a DBA’s perspective, I want to know if the dang developers did something that needs temp space :) ( more likely, my dang maintenance jobs are using it).  To help in tuning your queries and maintenance processes, log your temp files.  It’ll tell you what size temp file was needed and which query caused it:

2016-05-19 12:31:20 EDT [15967]: [1-1] user=postgres,db=postgres,app=pgbench,client=[local] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15967.0", size 200204288
2016-05-19 12:31:20 EDT [15967]: [2-1] user=postgres,db=postgres,app=pgbench,client=[local] STATEMENT: alter table pgbench_accounts add primary key (aid)
The log_temp_files parameter is set in units of kilobytes.  Essentially “write a log if I created a temp file that was this many KB or more in size”.  I prefer 0 (zero) because I want to see all the temp files in use.  Again, in my experience, the “juice is worth the squeeze” here.

 log_lock_waits = on

Databases are servicing lots of clients all trying to do very similar work against the same set of data.  This can cause contention (it’s the nature of the beast).  log_lock_waits let’s you see where your contention is.  It will give you detailed, specific information about what waits occurred and the context in which they occurred.

 2016-05-19 13:14:50 EDT [17094]: [1-1] user=postgres,db=postgres,app=psql,client=[local] LOG: process 17094 still waiting for RowExclusiveLock on relation 16847 of database 12403 after 1000.794 ms at character 13
 2016-05-19 13:14:50 EDT [17094]: [2-1] user=postgres,db=postgres,app=psql,client=[local] DETAIL: Process holding the lock: 17086. Wait queue: 17094.
 2016-05-19 13:14:50 EDT [17094]: [3-1] user=postgres,db=postgres,app=psql,client=[local] STATEMENT: delete from pgbench_tellers ;

 

These are what I call ‘reasonable defaults’ for logging in postgres.  Again, these are the settings that I configure every time I setup a new cluster, whether it’s for dev / test / toy / prod.

Happy querying!

Pallavi Sontakke: How Postgres-XL is tested

From Planet PostgreSQL. Published on May 19, 2016.

The purpose of this blog is to explain the process of Quality Analysis that Postgres-XL goes through internally at 2ndQuadrant. Here, I describe the bare minimum tests that each release goes through, not to mention the other tests carried out by many 2ndQuadrant and other community members.

  •   Regression tests are carried out to ensure that the defect fixes or enhancements to PostgreSQL have not affected Postgres-XL.

We want to keep up-to-date with PostgreSQL features and performance enhancements. Also, many tools may work with only newer PostgreSQL releases. So, we merge each PostgreSQL minor release in a timely fashion. When merging the regression tests therein, we need to continuously and consciously ensure if the new features are supported in Postgres-XL fine. Sometimes, there may be a gap in the expected outputs of PostgreSQL and Postgres-XL, which has to be tracked.

  • Functional tests are done to validate that the functionalities of Postgres-XL are working as per business requirements. We ensure that all features are functioning as they are expected to.

Initially we created the ‘Major differences and Limitations’ tests’ module. Postgres-XL project (Postgres-XC, initially) was based on PostgreSQL 9.2. It picked up speed again, during PostgreSQL 9.5 development timeframe. Due to this, there is a known gap of features that Postgres-XL would like to support, but does not currently. We keep track of these with our xl_* functional tests. These tests cover limitations like materialized views, event triggers, foreign data wrappers, etc. Also, on the other hand they cover positive functional tests for features like BRIN, logical decoding of WAL data, jsonb, etc.

Additionally, each time a new feature or enhancement is added, we keep adding tests to validate the functionality.

  • Usability tests are performed to validate the ease with which the user interfaces can be used.

We have cluster-setup-utility tests. Creating a Postgres-XL cluster manually requires quite a few steps. We have automated these for ease-of-use with pgxc_ctl utility. For simple prototype use, we have added ‘prepare minimal’ way. Also, for seasoned users, we have added the ‘prepare empty’ way where they can provision node-by-node, for their specific use. We have automated TAP tests for this utility.

  • Recovery tests are done to test how well Postgres-XL is able to recover from crashes, hardware failures and other similar problems.

Postgres-XL being deployed as a cluster, we realize the importance of data consistency across node crashes. We have crash-recovery test scripts that crash/kill nodes and bring them up again. In parallel sessions, we keep on making database changes with SQL statements, transactions or prepared transactions. We verify that nodes (or their configured standbys) come up fine. We perform data sanity checks to verify proper recovery.

  • Bug tracking is done for all known/reported bugs in our internal tracking system.
  • Future steps

We are looking into using SQLsmith to generate random queries to detect further problems in Postgres-XL.

Also, we are in the process of setting up continuous integration server for automated builds, deployments and tests  of Postgres-XL project.

Vasilis Ventirozos: Repairing clog corruptions

From Planet PostgreSQL. Published on May 19, 2016.

Yesterday, i got a page from a client about a possible corruption issue, to one of his clients,  so i decided to share how i dealt with it. Before starting with how i approached the problem, i want to say that corruptions don't just happen in postgres, in OmniTI, we manage A LOT of databases, all with different major versions and in different operating systems and in my time, I haven't seen (many) cases that corruption happened because of a postgres bug (i've seen indexes getting corrupted but i don't remember ever seeing data being corrupted). What i want to say is that corruptions don't just happen, hardware failures, power outages, disks getting full are common reasons for corruptions. 
A replica and backups should always be in place and the server should be properly monitored. Unfortunately this server was not managed by us so none of the above was in place..

At first I saw in the logs entries like :

From the logs:
2016-05-18 15:50:06 EDT::@:[10191]: ERROR:  could not access status of transaction 128604706
2016-05-18 15:50:06 EDT::@:[10191]: DETAIL:  Could not open file "pg_clog/007A": No such file or directory.
2016-05-18 15:50:06 EDT::@:[10191]: CONTEXT:  automatic vacuum of table "pe.pg_toast.pg_toast_4409334"
...
much more to be expected...
...

At this point you know that something went wrong, ideally you want to copy your whole $PGDATA to somewhere else and work there while at the same time you start considering uncompressing your most recent backup. In my case,  there was no backup and not enough disk space to copy $PGDATA.
FUN TIMES :)

I started by fixing all clogs missing which i found  logs
dd if=/dev/zero of=/var/db/pgdata/pg_clog/0114 bs=256k count=1
dd if=/dev/zero of=/var/db/pgdata/pg_clog/00D1 bs=256k count=1
dd if=/dev/zero of=/var/db/pgdata/pg_clog/0106 bs=256k count=1
...
...
keep creating until logs are clean, they can be a lot, in my case they were more than 100....
...
...

From the logs i also found the tables :

pg_toast.pg_toast_18454
pg_toast.pg_toast_35597
pg_toast.pg_toast_35607
pg_toast.pg_toast_4409334
pg_toast.pg_toast_4409344
pg_toast.pg_toast_8817516


db=# select relname,relnamespace from pg_class where oid in (18454,35597,35607,4409334,4409344,8817516) order by relnamespace;
        relname       | relnamespace
------------------------+--------------
table_case1 |        16872
table_case1 |        16874
table_case2 |        16874
table_case2 |      4409063
table_case1 |      4409063
table_case2 |      8817221
(6 rows)

db=# select oid,nspname from pg_namespace where oid in (16872,16874,16874,4409063,8817221) order by oid;
   oid   |  nspname
---------+------------
   16872 | schema1
   16874 | schema2
 4409063 | schema3
 8817221 | schema4
(4 rows)

With a join i found schema.table:
schema1.table_case1
schema2.table_case1
schema2.table_case2
schema3.table_case2
schema3.table_case1
schema4.table_case2

Basically we have an application with multiple schemas and 2 kinds of tables were corrupted across 4 schemas.

For table_case1 (simple case, table not referenced by anyone):
for each schema :

set search_path to schema1;
create table badrows (badid int);

DO $f$
declare
    curid INT := 0;
    vcontent TEXT;
    badid INT;
begin
FOR badid IN SELECT id FROM table_case1 LOOP
    curid = curid + 1;
    if curid % 100000 = 0 then
        raise notice '% rows inspected', curid;
    end if;
    begin
        SELECT id
        INTO vcontent
        FROM table_case1 where id = badid;
        vcontent := substr(vcontent,1000,2000);
    exception
        when others then
            insert into badrows values (badid);
raise notice 'data for id % is corrupt', badid;
            continue;
    end;
end loop;
end;
$f$;

(This script was taken from Josh Berkus blog, and it was modified to fill my needs.)

create table table_case1_clean as select * from table_case1
where id not in (select badid from badrows);

TRUNCATE table_case1;
vacuum full verbose table_case1;
insert into table_case1 select * from table_case1_clean;
vacuum full analyze verbose table_case1;
drop table badrows;


For table_case2 (this one is being referenced by 2 other tables) 
F or each org (schema):

set search_path to schema2;

create table badrows (badid int);

DO $f$
declare
    curid INT := 0;
    vcontent TEXT;
    badid INT;
begin
FOR badid IN SELECT id FROM table_case2 LOOP
    curid = curid + 1;
    if curid % 100000 = 0 then
        raise notice '% rows inspected', curid;
    end if;
    begin
        SELECT id
        INTO vcontent
        FROM table_case2 where id = badid;
        vcontent := substr(vcontent,1000,2000);
    exception
        when others then
            insert into badrows values (badid);
            raise notice 'data for id % is corrupt', badid;
            continue;
    end;
end loop;
end;
$f$;


create table table_case2_clean as select * from table_case2
where id not in (select badid from badrows);

alter table reference_table1 drop constraint reference_table1_fkey;
delete from reference_table1 where table_case2_id in (select badid from badrows) ;

alter table reference_table2 drop constraint reference_table2_fkey;
delete from reference_table2 where table_case2_id in (select badid from badrows);

TRUNCATE table_case2;
vacuum full verbose table_case2;
insert into table_case2 select * from table_case2_clean;
vacuum full analyze verbose table_case2;

ALTER TABLE ONLY reference_table1
    ADD CONSTRAINT reference_table1_fkey FOREIGN KEY (table_case2_id) REFERENCES table_case2(id) ON DELETE CASCADE;

ALTER TABLE ONLY reference_table2
    ADD CONSTRAINT reference_table2_fkey FOREIGN KEY (table_case2_id) REFERENCES table_case2(id);

drop table badrows;

(please ignore or report any typos here, i replaced the real table names while i was writing this post so i might messed up with some names).


What we basically did here was to recreate the table without the corrupted rows.
After this, tables should be corruption free with the minimum possible data loss.
To ensure that you are corruption free you should either pg_dump and restore, or vacuum full everythingnormal vacuum will NOT show corruptions.

pe=# vacuum verbose schema1.table_case1;
INFO:  vacuuming " schema1.table_case1"
INFO:  index " schema1.table_case1_pkey" now contains 12175 row versions in 36 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.15 sec.
INFO:  " table_case1": found 0 removable, 12175 nonremovable row versions in 258 out of 258 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.17 sec.
INFO:  vacuuming "pg_toast.pg_toast_18454"
INFO:  index "pg_toast_18454_index" now contains 51370 row versions in 143 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_18454": found 0 removable, 51370 nonremovable row versions in 12331 out of 12331 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.05s/0.03u sec elapsed 0.09 sec.
VACUUM
pe=#
pe=#
pe=# vacuum full verbose  schema1.table_case1;
INFO:  vacuuming " schema1.table_case1"
ERROR:  missing chunk number 0 for toast value 9270408 in pg_toast_18454



Rows were lost, in my case that was acceptable and maybe your case is not the same as mine, but i hope this will provide some guidance in case you get into a similar situation..


Thanks for reading
- Vasilis






Conditional Python Dependencies

By Blog of Hynek Schlawack from Django community aggregator: Community blog posts. Published on May 18, 2016.

Since the inception of Python wheels that install without executing arbitrary code, we needed a way to encode conditional dependencies for our packages. Thanks to PEP 426 and PEP 508 we do have a blessed way but sadly the prevalence of old setuptools versions makes them a minefield to use.

Django Channels for Background Tasks

By Albert O'Connor Web Developer from Django community aggregator: Community blog posts. Published on May 18, 2016.

Django Channels is the most exciting thing to happen to Django since well Django :).

This little tutorial is what you need to add a background task processor to Django using channels. Our task for this example will just be outputting "Hello, Channels!", but you could image running a subprocess on some data or sending an email.

NOTE: channel works on an at-least-once delivery model, so it is possible a message in a channel could be lost, delivery isn't guaranteed. That also means consumers don't have to worry about duplicates.

This example will be stripped down to the basic code without much error checking. There are detailed examples one here and here.

We will start with a simple Django 1.9 app without channels

# urls.py
from django.conf.urls import url
from . import views
urlpatterns = [
    url(r'^$', views.home, name='home'),
]

# views.py
from django.shortcuts import render
def home(request, template="home.html"):
    print("Hello, Channels!")  # long running task of printing.
    return render(
        request,
        template,
        dict(),
    )

You will need to define a home.html template where Django can find it, but besides that this simple site should work and synchronously render what is in your home.html and output "Hello, Channels!" on your terminal.

Now for channelification!

First, you will need redis, you can use a built-in broker but redis is pretty easy to install on Linux or Mac (using homebrew at least).

# MacOS
homebrew install redis

# Debian
apt-get install redis-server

Once that is done you will need to install channels and asgi-redis with pip. This apparently also installs Twisted! and zope.interface. The past returns :)

pip install channels

Add 'channels' to your INSTALLED_APPS and add the following to your settings.py to tell channel how to run:

CHANNEL_LAYERS = {
    "default": {
        "BACKEND": "asgi_redis.RedisChannelLayer",
        "CONFIG": {
            "hosts": [os.environ.get('REDIS_URL', 'redis://localhost:6379')],
        },
        "ROUTING": "myproject.routing.channel_routing",
    },
}

Where you replace "myproject" with the name of your project.

You need to create the routing.py file—beside your settings.py and root url.py file. In it add the channel_routing list, but for now we will leave it empty.

# routing.py
channel_routing = []

This is actually enough to get what we already had to run.

python manage.py runserver

This runs workers and the front end server daphne and it will handle Django requests and responses as you are used to.

Workers

Stop the server for now with Control-C. Django has grown a new command: runworker

python manage.py runworker

This won't run on port 8000, it will just run a worker listening on the default channel layer so you can't connect to it directly. Kill it for now with Control-C.

Channels has a front end server you can run in a separate process called daphne. To use it you need to create a asgi.py file—beside the wsgi.py file Django created for you, or your settings.py file.

# asgi.py
import os
import channels.asgi

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "myproject.settings")
channel_layer = channels.asgi.get_channel_layer()

Here is how you run daphne, should be familiar if you have used gunicorn or similar.

daphne myproject.asgi:channel_layer --port 8000

Now you can connect to http://127.0.0.1:8000/ but it just times out because there aren't any workers. In a separate terminal window run use the runworker command to start a worker and try connecting again.

Boom! You are back to where you were, but now there are two processes running. Now for the interesting bit, doing something asynchronously.

Background Task

Websockets are the killer feature of channels and other tutorials cover how to use them. Here we are going to make our own channel for our background task of saying hello.

It starts in routing.py where we will connect our channel to our new consumer. A consumer is a function which takes a message much like how a view takes a request.

# routing.py again
from channels.routing import route
from .consumers import hello

channel_routing = [
    route('background-hello', hello),
]

This means messages put into the background-hello channel will be handled by the hello callable found in the local consumers.py file—place it beside your views.py file in your app folder. We need to write our first consumer:

# consumers.py
def hello(message):
    print("Hello, Channels!")  # long running task or printing

Unlike views, consumers don't have to return a response, but they can send any number of messages down any number of channels they like.

Finally, we need to use our new background channel in our view, so let's edit views.py

# views.py
from django.shortcuts import render
from channels import Channel
def home(request, template="home.html"):
    message = {}
    Channel('background-hello').send(message)
    return render(
        request,
        template,
        dict(),
    )

That should be enough to get "Hello, Channel!" to be printed asynchronously to your terminal.

What is interesting is you didn't have to do anything to get the HTTP request and response channel to work. Including and initializing the channels library is enough for it to create those channels and route based on url.py. Everything Django works as it always did.

This just scratches the surface of what you can do with channels, so start using it today!

You can view the code from this blog post on GitHub

Oleg Bartunov: Refined search

From Planet PostgreSQL. Published on May 18, 2016.

Sometimes, using too specific can be risky, since the search could return zero or little results.
Better search for a wide query and use ranking to sort the results.

Example:
SELECT title,ts_rank_cd(fts, to_tsquery('english','x-ray & russian')) AS rank 
FROM apod WHERE fts  @@ to_tsquery('english','x-ray&russian') ORDER BY rank 
DESC LIMIT 5;
                 title                  |   rank
----------------------------------------+-----------
 The High Energy Heart Of The Milky Way | 0.0240938
(1 row)

SELECT title,ts_rank_cd(fts, to_tsquery('english','x-ray & russian')) AS rank 
FROM apod WHERE fts  @@ to_tsquery('english','x-ray') ORDER BY rank DESC LIMIT 5;
                 title                  |   rank
----------------------------------------+-----------
 The High Energy Heart Of The Milky Way | 0.0240938
 X-Ray Jet From Centaurus A             |         0
 Barnacle Bill And Sojourner            |         0
 The Crab Nebula in X-Rays              |         0
 M27: The Dumbbell Nebula               |         0
(5 rows)

US PostgreSQL Association: Unsung Heroes: Steve Atkins

From Planet PostgreSQL. Published on May 18, 2016.

Continuing my Building a Better community series, I contacted Steve Atkins. There was an interesting twist to this particular Unsung Hero. I initially contacted lluad on #postgresql (irc.freenode.net) due to his continuous and untiring efforts to help people on channel. What I didn't know is that Steve Atkins is actually lluad. Read on for some interesting bits.

How do you use PostgreSQL?

I use PostgreSQL almost any time I need to store or process structured data.

read more

Josh Berkus: Changing PostgreSQL Version Numbering

From Planet PostgreSQL. Published on May 18, 2016.

Per yesterday's developer meeting, the PostgreSQL Project is contemplating a change to how we do version numbers.  First, let me explain how we do version numbers now.  Our current version number composition is:

9 . 5 . 3 
Major1 . Major2 . Minor


That is, the second number is the "major version" number, reflecting our annual release.  The third number is the update release number, reflecting cumulative patch releases.  Therefore "9.5.3" is the third update to to version 9.5.

The problem is the first number, in that we have no clear criteria when to advance it.  Historically, we've advanced it because of major milestones in feature development: crash-proofing for 7.0, Windows port for 8.0, and in-core replication for 9.0.  However, as PostgreSQL's feature set matures, it has become less and less clear on what milestones would be considered "first digit" releases.  The result is arguments about version numbering on the mailing lists every year which waste time and irritate developers.

As a result, the PostgreSQL Project is proposing a version numbering change, to the following:

10 . 2
Major . Minor

Thus "10.2" would be the second update release for major version 10.   The version we release in 2017 would be "10" (instead of 10.0), and the version we release in 2018 will be "11".

The idea is that this will both put an end to the annual arguments, as well as ending the need to explain to users that 9.5 to 9.6 is really a major version upgrade requiring downtime.

Obviously, there is potential for breakage of a lot of tools, scripts, automation, packaging and more in this.  That's one reason we're discussing this now, almost a year before 10 beta is due to come out.

The reason for this blog post is that I'm looking for feedback on what this version number change will break for you.  Particularly, I want to hear from driver authors, automation engineers, cloud owners, application stack owners, and other folks who are "downstream" of PostgreSQL.  Please let us know what technical problems this will cause for you, and how difficult it will be to resolve them in the next nine months.

We are not, at this point, interested in comments on how you feel about the version change or alternate version naming schemes.  That discussion has already happened, at length.  You can read it here, here, and here, as well as at the developer meeting.

Places to provide feedback:


Thanks for any feedback you can provide.

Note that the next release of PostgreSQL, due later this year, will be "9.6" regardless.  We're deciding what we do after that.




Joshua Drake: Thanks to EDB

From Planet PostgreSQL. Published on May 18, 2016.

This is just a short blog to thank EDB and Robert Haas for the this blog. It is great to see all aspects of the community coming together. This is how we defeat the true competitors. Open and Closed Source products will in the long run fail to materialize a true return on investment that is anywhere close to what a coopetition based true Open Source project can.

Integration Of GitHub API with python django

By Micropyramid django from Django community aggregator: Community blog posts. Published on May 17, 2016.

Using Github integration, we can get the user verified email id, general information, git hub URL, id, disk usage, public, private repo's, gists and followers, following in a less span of time.

These Following steps are needed for Github integration:

    1. creating git hub app
    2. Authenticating user and getting an access token.
    3. Get user information, work history using access token.

1. Creating Github App

         a. To create an app, click on create an application on top of a page.

              Here you can give application name then the application will be created.
         b. Now you can get the client id, secret of an application and you can give redirect urls of your applications.

2. Authenticating user and getting an access token.
         a. Here We have to create a GET request for asking user permission.

POST  "https://github.com/login/oauth authorize?client_id=GIT_APP_ID&redirect_uri=REDIRECT_URL&scope=user,user:email&state=dia123456789ramya"

         GIT_APP_ID: your application client id,
         SCOPE: List of permissions to request from the person using your app
         REDIRECT_URI: The url which you want to redirect after user login and this url must be set in the app
         registered redireted urls.

         b. If user accepts the permissions, then authorization code sent to redirected url.
         c. Then we get accesstoken with the post request and the following params.
                  params = {
                        'grant_type': 'authorization_code',
                        'client_id': {{ Your Application id }},
                        'redirect_uri': {{ Your Redirect }}
                        'client_secret': settings.GIT_APP_SECRET,
                       'code': request.GET.get('code')
                   }

        info = requests.post("https://github.com/login/oauth/access_token", data=params)

        from urlparse import parse_qsl

        ac = dict(parse_qsl(info.text))

        params = {'access_token': ac['access_token']}

3. Get user information using access token with json response

  GET  "https://api.github.com/user"

     From the JSON response we can get the general information,  git hub  URL, id, disk usage, public, private  repo's gists  and followers, following.

   We can get the user emails from the get request using JSON response

  GET  "https://api.github.com/user/emails"

    From the json response, You can get the user emails

Dynamically Adding Google Maps with Marker In Django

By Micropyramid django from Django community aggregator: Community blog posts. Published on May 17, 2016.

Google Maps allows you to display maps on your web site,  we can also customize maps, and the information on maps.

The Google Maps API is a JavaScript library. It can be added to a web page with a following script tags:



We are creating a div to holds the google map. Here we are also giving an option to search the place in a google map.


 
   
 

then add a DOM listener that will execute the getGoogleMap() function on window load (when the page is initially loaded):

google.maps.event.addDomListener(window, "load", getGoogleMap)

In the above example, we are aleady loading 3 markers in banglore, chennai, hyderabad.

Then again if a user marks any location, it will display the marker with longitude, latitude of the place(if u want to store) by deleting user previously selecting markers.


We can also set the position description dynamically using the info window

US PostgreSQL Association: PostgreSQL at OSCON 2016!

From Planet PostgreSQL. Published on May 17, 2016.

Austin, Texas – May 16, 2016 – The PostgreSQL community announces their participation at OSCON 2016, the largest open source conference in North America. OSCON takes place in Austin, Texas, May 16 – 19, 2016 and is the best opportunity to meet the open source community.

read more

Umair Shahid: Using Java arrays to insert, retrieve, & update PostgreSQL arrays

From Planet PostgreSQL. Published on May 17, 2016.

Arrays are a powerful programming feature frequently used by developers, both in Java and in PL/pgSQL. Interfaces can potentially become tricky, however, when the two try and talk to each other. This section explores how you can write simple code that uses the java.sql.Array Interface to insert, retrieve, & update arrays in PostgreSQL.

To illustrate the functionality, let’s set up a simple table that stores country names in one column as TEXT and a list of some of the country’s cities in the second column as a TEXT ARRAY.

CREATE TABLE city_example (
country TEXT, 
cities TEXT[]
);

Now we will use the JDBC interface to insert data into this table, retrieve it, and update it.

Inserting ARRAYs

Anyone familiar with Java has used arrays in one form or the other. Before these arrays can be stored in PostgreSQL, however, they need to be mapped to the Interface provided in the java.sql package … Array.

The JDBC driver provides functions that map Java arrays to their corresponding PostgreSQL arrays. Mapping is database-specific and is defined in PostgreSQL JDBC’s org.postgresql.jdbc2.TypeInfoCache file. It is important to note that mapping is case sensitive. As an example, “INTEGER” is not the same as “integer”.

In the code that follows, function createArrayOf of the Connection Interface is used to convert Java String arrays to PostgreSQL TEXT Arrays before insertion.

try {
 
 String[] usa = {"New York", "Chicago", "San Francisco"};
 String[] canada = {"Montreal", "Toronto", "Vancouver"};
 String[] uk = {"London", "Birmingham", "Oxford"};

 /*
 Convert String[] to java.sql.Array using JDBC API
 */
 Array arrayUSA = conn.createArrayOf("text", usa);
 Array arrayCanada = conn.createArrayOf("text", canada);
 Array arrayUK = conn.createArrayOf("text", uk);
 String sql = "INSERT INTO city_example VALUES (?, ?)";
 PreparedStatement pstmt = conn.prepareStatement(sql);
 
 pstmt.setString(1, "USA");
 pstmt.setArray(2, arrayUSA);
 pstmt.executeUpdate();
 
 pstmt.setString(1, "Canada");
 pstmt.setArray(2, arrayCanada);
 pstmt.executeUpdate();
 pstmt.setString(1, "UK");
 pstmt.setArray(2, arrayUK);
 pstmt.executeUpdate();
 
 conn.commit();
} catch (Exception e) {
 
 System.out.println(e.getMessage());
 e.printStackTrace();
}

Please note that the data type specified in Connection.createArrayOf has to be a PostgreSQL type, not java.sql.Types. The JDBC driver looks up the data type at runtime to create the java.sql.Array object.

This code, when executed, results in the following data in the city_example table:

select * from city_example ; 
 country | cities 
---------+--------------------------------------
 USA     | {"New York",Chicago,"San Francisco"}
 Canada  | {Montreal,Toronto,Vancouver}
 UK      | {London,Birmingham,Oxford}
(3 rows)

Retrieving ARRAYs

Process of retrieving arrays is the exact reverse of inserting them. In the example below, first step is to get a ResultSet with the required data and second step is to convert PostgreSQL TEXT Arrays to Java String arrays.

try { 
 
 String sql = "SELECT * FROM city_example";
 PreparedStatement ps = conn.prepareStatement(sql);
 ResultSet rs = ps.executeQuery();
 
 while(rs.next()) {
 
 System.out.println("Country: " + rs.getString(1));
 System.out.println("---------------");
 
 Array cities = rs.getArray(2);
 String[] str_cities = (String[])cities.getArray();
 
 for (int i=0; i<str_cities.length; i++) {
 System.out.println(str_cities[i]);
 }
System.out.println("");
}
 
} catch (Exception e) {
 
 System.out.println(e.getMessage());
 e.printStackTrace();
}

For the above code, output to stdout is:

Country: USA
---------------
New York
Chicago
San Francisco
Country: Canada
---------------
Montreal
Toronto
Vancouver
Country: UK
---------------
London
Birmingham
Oxford

Updating ARRAYs

Process of updating arrays in PostgreSQL is pretty close to that of inserting them. In the code below, a new set of USA cities is declared as a Java String array, which is then converted to a PostgreSQL TEXT array before updating the existing row.

try {
 
 String[] usa = {"New York", "Chicago", "San Francisco", "Miami", "Seattle"};
 Array arrayUSA = conn.createArrayOf("text", usa);
 
 String sql = "UPDATE city_example SET cities = ? WHERE country = 'USA'";
 PreparedStatement pstmt = conn.prepareStatement(sql);
 
 pstmt.setArray(1, arrayUSA);
 pstmt.executeUpdate();
 
 conn.commit();
 
} catch (Exception e) {
 
 System.out.println(e.getMessage());
 e.printStackTrace();
}

After execution of this code, the database table looks as follows:

select * from city_example ;
 country | cities 
---------+----------------------------------------------------
 Canada  | {Montreal,Toronto,Vancouver}
 UK      | {London,Birmingham,Oxford}
 USA     | {"New York",Chicago,"San Francisco",Miami,Seattle}
(3 rows)

Nikolay Shaplov: Postgres: how to add new keyword to SQL grammar

From Planet PostgreSQL. Published on May 17, 2016.

While trying to add something useful into INSERT statement, I wrote more simple example, just to find out how it should work.
I think this might be helpful to people that also want to hack into that area.

Example is very simple, it adds keyword TEST into the definition of attribute list into CREATE INDEX statement, and in ComputeIndexAttrs it says using elog(WARNING ...) whether it found TEST keyword or not.

postgres=# create index  ON mytable (i TEST 'value');
WARNING:  Creating index for column i with TEST value
CREATE INDEX 
postgres=# create index  ON mytable  (i);
WARNING:  TEST is empty
CREATE INDEX


This example is not for production development, it just shows what part of code you should touch in order to modify grammar.

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index c294c7c..138190e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1041,6 +1041,14 @@ ComputeIndexAttrs(IndexInfo *indexInfo,
 			/* Simple index attribute */
 			HeapTuple	atttuple;
 			Form_pg_attribute attform;
+			
+			if (attribute->test)
+			{
+				elog(WARNING, "Creating index for column %s with TEST %s", attribute->name, attribute->test);
+			} else
+			{
+				elog(WARNING, "TEST is empty");
+			}
 
 			Assert(attribute->expr == NULL);
 			atttuple = SearchSysCacheAttName(relId, attribute->name);
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 20e38f0..155070c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2591,6 +2591,7 @@ _copyIndexElem(const IndexElem *from)
 	COPY_NODE_FIELD(opclass);
 	COPY_SCALAR_FIELD(ordering);
 	COPY_SCALAR_FIELD(nulls_ordering);
+	COPY_STRING_FIELD(test);
 
 	return newnode;
 }
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 18ec5f0..597ad57 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -316,7 +316,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 		copy_file_name
 				database_name access_method_clause access_method attr_name
 				name cursor_name file_name
-				index_name opt_index_name cluster_index_specification
+				index_name opt_index_name cluster_index_specification opt_test
 
 %type 	func_name handler_name qual_Op qual_all_Op subquery_Op
 				opt_class opt_inline_handler opt_validator validator_clause
@@ -632,7 +632,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	STATEMENT STATISTICS STDIN STDOUT STORAGE STRICT_P STRIP_P SUBSTRING
 	SYMMETRIC SYSID SYSTEM_P
 
-	TABLE TABLES TABLESAMPLE TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN
+	TABLE TABLES TABLESAMPLE TABLESPACE TEMP TEMPLATE TEMPORARY TEST TEXT_P THEN
 	TIME TIMESTAMP TO TRAILING TRANSACTION TRANSFORM TREAT TRIGGER TRIM TRUE_P
 	TRUNCATE TRUSTED TYPE_P TYPES_P
 
@@ -715,6 +715,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %left		'(' ')'
 %left		TYPECAST
 %left		'.'
+
 /*
  * These might seem to be low-precedence, but actually they are not part
  * of the arithmetic hierarchy at all in their use as JOIN operators.
@@ -6707,7 +6708,7 @@ index_params:	index_elem							{ $$ = list_make1($1); }
  * expressions in parens.  For backwards-compatibility reasons, we allow
  * an expression that's just a function call to be written without parens.
  */
-index_elem:	ColId opt_collate opt_class opt_asc_desc opt_nulls_order
+index_elem:	ColId opt_collate opt_class opt_asc_desc opt_nulls_order opt_test
 				{
 					$$ = makeNode(IndexElem);
 					$$->name = $1;
@@ -6717,8 +6718,9 @@ index_elem:	ColId opt_collate opt_class opt_asc_desc opt_nulls_order
 					$$->opclass = $3;
 					$$->ordering = $4;
 					$$->nulls_ordering = $5;
+					$$->test = $6;
 				}
-			| func_expr_windowless opt_collate opt_class opt_asc_desc opt_nulls_order
+			| func_expr_windowless opt_collate opt_class opt_asc_desc opt_nulls_order opt_test
 				{
 					$$ = makeNode(IndexElem);
 					$$->name = NULL;
@@ -6728,8 +6730,9 @@ index_elem:	ColId opt_collate opt_class opt_asc_desc opt_nulls_order
 					$$->opclass = $3;
 					$$->ordering = $4;
 					$$->nulls_ordering = $5;
+					$$->test = $6;
 				}
-			| '(' a_expr ')' opt_collate opt_class opt_asc_desc opt_nulls_order
+			| '(' a_expr ')' opt_collate opt_class opt_asc_desc opt_nulls_order opt_test
 				{
 					$$ = makeNode(IndexElem);
 					$$->name = NULL;
@@ -6739,6 +6742,7 @@ index_elem:	ColId opt_collate opt_class opt_asc_desc opt_nulls_order
 					$$->opclass = $5;
 					$$->ordering = $6;
 					$$->nulls_ordering = $7;
+					$$->test = $8;
 				}
 		;
 
@@ -6761,6 +6765,10 @@ opt_nulls_order: NULLS_LA FIRST_P			{ $$ = SORTBY_NULLS_FIRST; }
 			| /*EMPTY*/						{ $$ = SORTBY_NULLS_DEFAULT; }
 		;
 
+opt_test:	 TEST Sconst							{ $$ = $2; }
+			| /*EMPTY*/								{ $$ = NULL; }
+		;
+
 
 /*****************************************************************************
  *
@@ -14159,6 +14167,7 @@ reserved_keyword:
 			| SOME
 			| SYMMETRIC
 			| TABLE
+			| TEST
 			| THEN
 			| TO
 			| TRAILING
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 714cf15..54e3de8 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -638,6 +638,7 @@ typedef struct IndexElem
 	List	   *opclass;		/* name of desired opclass; NIL = default */
 	SortByDir	ordering;		/* ASC/DESC/default */
 	SortByNulls nulls_ordering; /* FIRST/LAST/default */
+	char	   *test;
 } IndexElem;
 
 /*
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 17ffef5..2b7f2fd 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -380,6 +380,7 @@ PG_KEYWORD("tablespace", TABLESPACE, UNRESERVED_KEYWORD)
 PG_KEYWORD("temp", TEMP, UNRESERVED_KEYWORD)
 PG_KEYWORD("template", TEMPLATE, UNRESERVED_KEYWORD)
 PG_KEYWORD("temporary", TEMPORARY, UNRESERVED_KEYWORD)
+PG_KEYWORD("test", TEST, RESERVED_KEYWORD)
 PG_KEYWORD("text", TEXT_P, UNRESERVED_KEYWORD)
 PG_KEYWORD("then", THEN, RESERVED_KEYWORD)
 PG_KEYWORD("time", TIME, COL_NAME_KEYWORD)

Extract text with OCR for all image types in python using pytesseract

By Micropyramid django from Django community aggregator: Community blog posts. Published on May 16, 2016.

What is OCR?

Optical Character Recognition(OCR) is the process of electronically extracting text from images or any documents like PDF and reusing it in a variety of ways such as full text searches.

In this blog, we will see, how to use 'Python-tesseract', an OCR tool for python.

pytesseract:

It will recognize and read the text present in images. It can read all image types - png, jpeg, gif, tiff, bmp etc. It’s widely used to process everything from scanned documents.

Installation:

$ sudo pip install pytesseract

Requirements:

* Requires python 2.5 or later versions.
* And requires Python Imaging Library(PIL).

Usage:

From the shell:

$ ./pytesseract.py test.png 

Above command prints the recognized text from image 'test.png'.

$ ./pytesseract.py -l eng test-english.jpg

Above command recognizes english text.

In Python Script:

import Image
from tesseract import image_to_string

print image_to_string(Image.open('test.png'))
print image_to_string(Image.open('test-english.jpg'), lang='eng')

Implement search with Django-haystack and Elasticsearch Part-I

By Micropyramid django from Django community aggregator: Community blog posts. Published on May 16, 2016.

Haystack works as search plugin for django. You can use different back ends Elastic-search, Whose, Sorl, Xapian to search objects. All backends work with same code. In this post i am using elasticsearch as backend.

Installation:

pip install django-haystack

Configuration:

add haystack to installed apps

    INSTALLED_APPS=[

               'django.contrib.admin',
               'django.contrib.auth',
               'django.contrib.contenttypes',
               'django.contrib.sessions',
               'django.contrib.sites',
               #add haystack here
               'haystack',
               'books'
    ]


Settings:

Add back-end settings for haystack.

     HAYSTACK_CONNECTIONS = {
              'default': {
                    'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
                    'URL': 'http://127.0.0.1:9200/',
                    'INDEX_NAME': 'haystack_books',
              },
    }

Above settings for elastic search.

Add signal processor for haystack. This signal will update objects in index. 

    HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'

For This tutorial i am using below as model.

books/models.py:

    class Book(models.Model):
        title = models.CharField(max_length=100, unique=True)
        slug = models.SlugField(unique=True)
        created_on = models.DateTimeField(auto_now_add=True)
        description = models.TextField()
        authors = models.ManyToManyField(settings.AUTH_USER_MODEL)
        
        def __unicode__(self):
            return self.title

SearchIndex:

    The purpose of indexing objects is to optimize speed and performance in finding relevant documents for a given search query.
    With searchindex haystack determines what data should be placed in index. In haystack you write unique index for each model.

    create new file in books/search_indexes.py below code goes

    from haystack import indexes
    from books.models import Book

    class BookIndex(indexes.SearchIndex, indexes.Indexable):
        text = indexes.CharField(document=True, use_template=True, template_name="search/book_text.txt")
        title = indexes.CharField(model_attr='title')
        authors = indexes.CharField()

        def get_model(self):
            return Book

        def prepare_authors(self, obj):
            return [ author.name for a in obj.authors.all()]

        def index_queryset(self, using=None):
            return self.get_model().objects.all()

search index will be created by subclasing both haystack.Indexes.SearchIndex and haystack.Indexes.Indexable. Specify index  field types by indexes field types(similar to models).

in above index class get_model method used to define model which you want to create indexes. prepare_fieldname method used to define data for indexing.


Haystack will use template to index data. these index syntaxes are similar to django template tags.

templates/search/book_text.txt

    {{ object.title }}
    {% for a in authors%}
        {{ a }}
    {% endfor %}

Setup:

urls.py

    urlpatterns = patterns(
        # other urls
        url(r'^search/', include('haystack.urls')),
    )

build index:

Python manage.py rebuild_index

search query form.   

<form action="/search" method="get">
        <input name="q" type="text" />
        <input type="submit" />
    </form>

input field q is mandatory for searching.

Display search results:

belove code goes in templates/search/search.html

        {% for result in page.object_list %}
           {{ result.object.title }}
           {{ result.object.description }}
        {{ result.object.created_on }}
        {% empty %}           
             No results found.
        {% endfor %}

Got error or want to give suggistions please comment below. Happy Coding.

Setting Up Coveralls for Django Project

By Micropyramid django from Django community aggregator: Community blog posts. Published on May 16, 2016.

Why coveralls?
Coveraslls will check the code coverage for your test cases. To use coveralls.io your code must be hosted on GitHub or BitBucket.

install coveralls

   pip install coveralls

Using Travis
If you are using Travis for you CI. add below script in .travis.yml file in project root folder

language: python

# python versions
python:
  - "3.4"
  - "2.7.4"

env:
  -DJANGO=1.8 DB=sqlite3

# install requirements
install:
  - pip install -r requirements.txt
  - pip install coveralls

# To run tests
script:
  - coverage run --source=my_app1, my_app2 manage.py test

# send coverage report to coveralls
after_success:
  coveralls

Signup with GitHub in https://coveralls.io/ and activate coveralls for you repo.

Thats it. Happy Testing...

How to Create your own e-commerce shop using Django-Oscar.

By Micropyramid django from Django community aggregator: Community blog posts. Published on May 16, 2016.

Oscar is an open-source ecommerce framework for Django. Django Oscar provides a base platform to build an online shop.

Oscar is built as a highly customisable and extendable framework. It supports Pluggable tax calculations, Per-customer pricing, Multi-currency etc.

1. Install Oscar

$ pip install django-oscar

2. Then, create a Django project

$ django-admin.py startproject <project-name>

After creating the project, add all the settings(INSTALLED_APPS, MIDDLEWARE_CLASSES, DATABASES) in your settings file as specified here.

And you can find the reference on how to customize the Django Oscar app, urls, models and views here.

Customising/Overridding templates:

To override Oscar templates, first you need to update the template configuration settings as below in your setting file.

import os

location = lambda x: os.path.join(
    os.path.dirname(os.path.realpath(__file__)), x)

TEMPLATE_LOADERS = (
    'django.template.loaders.filesystem.Loader',
    'django.template.loaders.app_directories.Loader',
    'django.template.loaders.eggs.Loader',

)

from oscar import OSCAR_MAIN_TEMPLATE_DIR

TEMPLATE_DIRS = (
    location('templates'),
    OSCAR_MAIN_TEMPLATE_DIR,
)

Note: In the 'TEMPLATE_DIRS' setting, you have to include your project template directory path first and then comes the Oscar's template folder which you can import from oscar.

By customising templates, you can just replacing all the content with your own content or you can only change blocks using "extends"

Ex: Overriding Home page

{% extends 'oscar/promotions/home.html' %}

{% block content %}
    Content goes here
    ...
    ...
{% endblock content %}

Yann Larrivee: ConFoo Vancouver: Call for Papers is open

From Planet PostgreSQL. Published on May 16, 2016.

ConFoo Vancouver: December 5th-7th 2016

ConFoo Vancouver: December 5th-7th 2016

We are happy to open the call for papers of ConFoo Vancouver 2016! If you are interested in speaking about web development and related topics, please submit until June 6th. We will cover travel and hotel for the speakers who require it.

ConFoo Vancouver will be held on December 5-7, 2016. For those who are familiar with ConFoo Montreal, that conference will still be running annually in addition to Vancouver. Visit our site to learn more about the event.

Talks are 35 minutes for the topic and 10 minutes for Q&A, for a total of 45 minutes. We are eagerly expecting your proposals!

Oleg Bartunov: Postgres roadmap - to be or not to be.

From Planet PostgreSQL. Published on May 16, 2016.

We published Postgres Professional roadmap, which we want to discuss at the PgCon 2016 Developer Meeting. I think it's important to have an official roadmap of postgres community, both for developers (companies and individuals) and customers. Also, I think that releases should be more predictable ("programmable"), at least major features should be known in advance. I don't know, how it should works, but we want at least to rise the problem.

Why do we bother about roadmap at all ?

Postgres has no roadmap:


PostgreSQL is a non-commercial, all volunteer, free software project, and as such there is no formal list of feature requirements required for development. We really do follow the mantra of letting developers scratch their own itches.


Instead, we have Todo list, but it says nothing to customers.

At the same time, Postgres is of great interest to Russia, since it provides real alternative to Oracle, which currently dominates corporate market (have you read read Oracle's letter to Russain IT-companies ?). Russian companies (goverment and commercial), which consider migration from Oracle to Postgres, are evaluating postgres features and looking for postgres roadmap. They are demanding not only enterprise features like Enterprise Manager, tools for easy migration of data and procedures to postgres, but also features they needs to plan their future development. It's clear, that migration of very big and complex system of large corporation is the question of not only current operational functionality. That's why Postgres Professional has developed it's roadmap and as a member of postgres community is looking for help from other postgres companies and individual developers to synchronize "the wathches".

It's era of Enterprise Postgres, at the end !

Mark Lavin to Give Keynote at Python Nordeste

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on May 16, 2016.

Mark Lavin will be giving the keynote address at Python Nordeste this year. Python Nordeste is the largest gathering of the Northeast Python community, which takes place annually in cities of northeastern Brazil. This year’s conference will be held in Teresina, the capital of the Brazilian state of Piauí.

Mark will be speaking from his love of long-distance running . Using endurance sports training methodologies and applying them to development training, he will provide a roadmap for how to improve as a developer throughout your career.

Mark Lavin is a co-author of Lightweight Django from O'Reilly, The book was recently published in Portuguese and is available in Brazil under the title Django Essencial from publisher Novatec. He has also recorded a video series called "Intermediate Django" which focuses on integration background jobs with Celery and best practices for growing Django developers. Mark is an active member of the Django community and you can often find him contributing to the Django Project or answering questions on StackOverflow.

NGINX for static files for dev python server

By Andrey Zhukov's blog from Django community aggregator: Community blog posts. Published on May 14, 2016.

When you work on the backend part of django or flask project and there are many static files, sometimes the development server becomes slow. In this case it’s possible to use nginx as reverse proxy to serve static. I’m using nginx in docker and the configuration is quite simple.

Put in some directory Dockerfile and default.conf.tmpl.

Dockerfile

1
2
3
4
5
FROM nginx:1.9
VOLUME /static
COPY default.conf.tmpl /etc/nginx/conf.d/default.conf.tmpl
EXPOSE 9000 
CMD envsubst '$APP_IP $APP_PORT' < /etc/nginx/conf.d/default.conf.tmpl > /etc/nginx/conf.d/default.conf && nginx -g 'daemon off;'

default.conf.tmpl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
server {
    listen 9000;
    charset utf-8;

    location /site_media {
        alias /static;
    }

    location / {
        proxy_pass http://${APP_IP}:${APP_PORT};
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

Build image with docker build -t dev-nginx . command.

To run it:

1
docker run --rm -it -v `pwd`/static:/static -p 9000:9000 -e APP_IP=<your ip from ifconfig> -e APP_PORT=8000 dev-nginx

Then you can access your development server though http://<localhost|docker-machine-ip>:9000.

Rajeev Rastogi: Client connected to itself...Isn't it Strange (effect of TCP Simultaneous Connect)

From Planet PostgreSQL. Published on May 14, 2016.

Yes I observed this strange behavior during one of my work, which I would like to share. The symptom of the issue was that client was getting some unexpected junk response, which server never sent.

Background

While working on a system, where client application has logic to keep reconnecting to server till connection becomes successful. At one point of time it was observed that the client connection has been established successfully even though server has not yet come up. Later when Server was brought up, it failed to start saying "Address already in use". Which gave us hint that client must have connected to any other application running on same address as the server.

Then we saw the result of netstat command at this point of time and below were the result:

    netstat -nA inet | fgrep :32246
    tcp        0      0 127.0.0.1:32246         127.0.0.1:32246         ESTABLISHED

Which shows client has connected to itself (source and destination IP:Port is same). Also there is no additional application listening on 32246 port. Address 127.0.0.1:32246 was the address on which client was trying to connect.

Cause Analysis

In our case we were trying to reconnect to server repetitively and every time it keeps failing as server was still down. As per TCP, every time client tries to re-connect it is assigned one new source port in increasing order from the range of ephemeral ports(Its range is defined in file /proc/sys/net/ipv4/ip_local_port_range). So at one point of time, it may use the same source port as used for destination (if destination port is also in the range of ephemeral ports).
Below was the range of ephemeral port on the machine where issue being observed:

    cat /proc/sys/net/ipv4/ip_local_port_range
    9000    65500

And the server port used in my set-up was 32246. So server port was also in the range of ephemeral ports. So it was possible that at one point of time source port will be same as that of destination port.

Reason

Now you may think that even though client has chosen same port as of server, still how it can connect as no server is listening on that port. This is because of the one of the TCP feature called simultaneous connectdocumented in RFC793. This allows two clients to connect to each other without anyone entering into listening state. The connection establishment as per this approach is different from the usual 3-way handshake. Here both clients performs an active OPEN as shown in below table:
Table-1: Synchronous Connect
This gets triggered only if two clients are trying to reach each other (In our case though one client is trying to connect to server but since port being same it gets treated as if both are trying to connect to each other and hence simultaneous connect gets triggered.)

Experiment

Execute below command snippet:
while true
do
   telnet 127.0.0.1 32775
done

Even though there is no telnet server running on port 32775 but at some point of time it will succeed.
telnet: connect to address 127.0.0.1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
hello
hello

Solution

Once connection is successful, we can get the port number dynamically being assigned to client and compare the same with destination port. If it matches then we can disconnect the current connection and continue to retry.


    struct sockaddr_in cAddr;
    socklen_t cAddrLen = sizeof(cAddr);

    // Get the current client dynamic information allocated.
    int e = getsockname(fd, (sockaddr *)&cAddr, &cAddrLen);
    if (e != 0)
    {
        close(fd);
    }

Conclusion

Though this is very rare scenario but very much possible. So any application having infinite reconnect logic and there is no authentication info exchange between client and server in order to finalize the connection, then it should take care of this.
In-case of PostgreSQL, client expect authentication request from server once connection is established. So issue related to self-connect will not happen.
But if there is any third party tool working on top of PostgreSQL and they have their own way of connection mechanism, then they should evaluate to check if this issue is applicable to them.

Please share your comment/feedback or if you have some other idea to address this issue.

Using Django's built in signals and writing custom signals.

By Micropyramid django from Django community aggregator: Community blog posts. Published on May 13, 2016.

Django has  beautiful feature of signals which will record all the actions performed on particular model. In the current blog post we’ll learn how to use django’s built in signals and how to  create custom signal

Using Django’s built in Signals:
Django has lot of built in signals like pre_save, post_save, pre_delete and post_delete and etc., For more information about django’s bult in signals visit https://docs.djangoproject.com/en/1.9/ref/signals/. Now we’ll learn how to use django’s pre_delete signal with a simple example. In the way we use pre_delete in the present blog post we can use other signals also in the same way. We have two models called Author and Book an their models are defined in models.py as below.

# In models.py
from django.db import models

class Author(models.Model):
    full_name = models.CharField(max_length=100)
    short_name = models.CharField(max_length=50)

class Book(models.Model):
    title = models.CharField(max_length=100)
    slug = models.SlugField(max_length=100)
    content = model.TextField()
    status = models.CharField(max_length=10, default=”Drafted”)
    author_id = model.PositiveIntegerField(null=True)

In the above two models we are not having author as foreignKey to Book model, so by default when the Author gets deleted it won’t delete all the Books written by the author. This is the case where signals comes to picture, we can achieve this by using pre_delete or post_delete signals. For this we’ll write a receiver function which will be called on pre_delete of the author object. Write the following code in your models.py

def remove_books(sender, instance, **kwargs):
    author_id = instance.id
    Book.objects.filter(author_id=author_id).delete()

pre_delete.connect(remove_books, sender=Article)

In the above snippet sender is the model name on which the pre_delete signal is called, in the current example it is Author model. Remove_books is the receiver function which will be called on delete of the Author object. It takes sender, instance(the Author instance which is called for delete) and any other keyword arguments.

Writing Custom Signals:
Now in this section we’ll learn, how to create custom signals using the same above example. Suppose if the author has to get an email when the Book status is changed to “Published”. For this we have to create a file called signals.py to create a custom signal. By default all the signals are django.dispatch.Signal instances.

# In signals.py
import django.dispatch
book_publised = django.dispatch.Signal(providing_args=["book", “author”])

Create receivers.py which contains the receiver funcation which will be called when the signal is dispatched.

# In receivers.py
from django.dispatch import receiver
from .signals import *

@receiver(book_publised)
def send_mail_on_publish(sender, **kwargs):
    # contains the logic to send the email to author.

In the above snippet receiver is decorator which tells the book_published signal that send_mail_on_publish is the receiver function which will be called when the book_publisehd signal is dispatched.

We can dispatch signal anywhere as following.

book_published.send(sender=Book, book=<Book Instance>, user=<Author Instance>)
Note: The most important to be remembered is when we just call book_published.send(*****) it won’t hit the receiver function. To make the signal hit the receiver function is we have to import receiver in your app’s __init__.py. 

# In __init__.py

import receivers

But from Django 1.9+ if we import receivers in __init__.py, it will cause runtime error of Apps not ready. To avoid this issue we have to import this inside the ready function of your apps.py

# In apps.py

from django.apps import AppConfig
class LoadReceivers(AppConfig):
    name = testapp

    def ready(self):
        from . import receivers

Leo Hsu and Regina Obe: PLV8 binaries for PostgreSQL 9.6beta1 windows both 32-bit and 64-bit

From Planet PostgreSQL. Published on May 13, 2016.

To celebrate recent release of PostgreSQL 9.6beta1, we've started to experiment with our favorite extensions. For starters, PLV8 (aka PL/JavaScript) binaries listed below and upcoming PostGIS 2.3.0 and ogr_fdw detailed here


Continue reading "PLV8 binaries for PostgreSQL 9.6beta1 windows both 32-bit and 64-bit"

REGINA OBE: PostgreSQL 9.6beta1 out, help test PostGIS 2.3 windows binaries available

From Planet PostgreSQL. Published on May 13, 2016.

PostgreSQL 9.6beta1 came out yesterday. It is the first version of PostgreSQL that will have parallel query support and PostGIS 2.3 will be the first PostGIS to support parallelism in queries. Although work is not yet committed in PostGIS repo to support this, you can expect to see this soon (currently here - https://github.com/pramsey/postgis/tree/parallel , waiting for someone to you know who you are do something about it.)

Because of the newness of the parallelization feature, there are some caveats. As with all big things, we expect there to be a lot of bugs, and the more eyeballs on those and real workloads we've got hammering on them, the sweeter the PostGIS 2.3.0 and PostgreSQL 9.6 release will be.

Binaries for Windows users

For windows users, winnie the PostGIS windows buildbot is now building PostGIS for 9.6. Get PostgreSQL 9.6 binaries and installers from PostgreSQL 9.6beta1 for windows.

Once you have that, just copy the contents of the respective PostGIS 2.3 9.6 binaries listed here - http://postgis.net/windows_downloads/ into your install folder.

In the extras folder, you'll also find ogr_fdw foreign data wrapper latest development version which we covered extensively in FOSS4GNA2016 PostGIS Spatial Tricks. Talk also covered some new PostGIS 2.3.0 stuff.

We don't have pgRouting binaries available yet. pgRouting team is working out some compatibility issues with PostgreSQL 9.6. Once those are resolved, we will publish pgRouting binaries as well.

Shaun M. Thomas: PG Phriday: Bountiful Beta Benefits

From Planet PostgreSQL. Published on May 13, 2016.

The Postgres developers recently announced the availability of the first public beta for Postgres 9.6. I would be highly remiss to ignore such an opportunity to dig into any interesting functionality listed in the 9.6 release notes. All in all, it’s a pretty exciting series of advancements, and assuming this is a glimpse of what we see when 9.6 drops, I’d say we’re on the right track.

Plentiful Parallelism

Probably the most high-profile addition for 9.6 is parallel operation on certain query plans. I already examined this in depth, but how did the feature stack up over several months of development? Let’s use the same test and see:

CREATE TABLE para_test AS
SELECT a.id, repeat(' ', 20) AS junk
  FROM generate_series(1, 20000000) a(id);
 
ANALYZE para_test;
 
SET max_parallel_degree TO 1; -- The planner worked up to 5.
 
EXPLAIN ANALYZE
SELECT *
  FROM para_test
 WHERE junk LIKE '%s%';

Give or take a bit of variance, it looks pretty similar based on the best case of several runs:

Workers Avg Time (s)
0 3.7
1 1.9
2 1.4
3 1.1
4 0.9
5 0.8

This is really good to see. We still get the most benefit from the initial activation; even one parallel worker drastically improves performance. After that—at least for sequence scans—returns diminish quite a bit. Perhaps more interesting however, is the new support for other operations, like nested loops. Check this out:

EXPLAIN ANALYZE
SELECT p1.id
  FROM para_test p1
  JOIN para_test p2 USING (id)
 WHERE id BETWEEN 1 AND 100000;
 
 Gather  (cost=5500.25..217096.45 ROWS=93502 width=4)
         (actual TIME=67.443..1651.985 ROWS=100000 loops=1)
   Workers Planned: 5
   Workers Launched: 5
   ->  Hash JOIN  (cost=4500.25..207681.27 ROWS=93502 width=4)
                  (actual TIME=77.348..1621.807 ROWS=16667 loops=6)
         Hash Cond: (p2.id = p1.id)
         ->  Parallel Seq Scan ON para_test p2
              (cost=0.00..187059.00 ROWS=4000000 width=4)
              (actual TIME=0.014..498.137 ROWS=3333333 loops=6)
         ->  Hash  (cost=3331.48..3331.48 ROWS=93502 width=4)
                   (actual TIME=76.616..76.616 ROWS=100000 loops=6)
               Buckets: 131072  Batches: 1  Memory Usage: 4540kB
               ->  INDEX ONLY Scan USING idx_test_id ON para_test p1
                     (cost=0.44..3331.48 ROWS=93502 width=4)
                     (actual TIME=0.069..43.965 ROWS=100000 loops=6)
                     INDEX Cond: ((id >= 1) AND (id <= 100000))
                     Heap Fetches: 100000
 Planning TIME: 0.371 ms
 Execution TIME: 1662.068 ms

Now, it turns out this is actually a major performance hit for this contrived example. Without parallelism enabled, the nested loop operates in about 350ms. There’s clearly still some work needed under the hood, but the fact that this works at all is an encouraging step forward.

Spinlock Showdown

Several spinlocks have been removed and replaced with atomic locks to avoid costs associated with tight lock loops. Results of these modifications look encouraging. Our own tests reflect those glowing datapoints, with a bit of a caveat.

We began the comparison with 9.4 using pgbench on an 8-CPU VM with 32GB of RAM under various conditions and weren’t encouraged by the output.

Clients 9.4 9.6
1 6100 5500
2 12350 11000
4 23850 23500
8 38450 34500
16 47800 46000
32 43800 42750

Pgbench performance is worse and inconsistent in our VM environment across the board. Postgres 9.4 regularly outperforms it on both standard and prepared queries, and there’s some very strange judder between tests when the amount of clients matches the amount of system CPUs. After that, everything seems to even out within variance, but something strange is going on. Is it because the tests are on a VM instead of bare hardware? Is it because the VM was on kernel 3.13? Something clearly smells fishy here.

Why can this kind of variance be a problem? Because these are the numbers on a piece of bare metal with 32 CPUs and 64GB of RAM:

Clients 9.4 9.6
1 10800 10500
2 21500 21000
4 41000 40600
8 69000 73000
16 97000 118000
32 100000 167000

Note that the scaling improvements are far more obvious in this scenario than on our VM. It seems 9.6 is currently more environment sensitive than 9.4. Even if it’s the fault of the VM software, the fact that it affects 9.6 disproportionately is distressing. Hopefully that’ll go away by the time it’s officially released.

Foreign Federation

In Postgres versions older than 9.6, foreign tables are treated as individual entities and the external server isn’t really taken into consideration. If two foreign tables exist on the same server, all results will be fetched locally and joined as a third step. This not only fetches far more rows than necessary, but forces the local system to work harder joining everything after building temporary structures. It’s incredibly inefficient.

An easy way to visualize this is to create a couple garbage tables:

-- On external server (trust auth for testing)
 
CREATE TABLE foo AS
SELECT a.id FROM generate_series(1, 10) a(id);
 
CREATE TABLE bar AS
SELECT a.id FROM generate_series(1, 10) a(id);
 
-- On another system
 
CREATE EXTENSION postgres_fdw;
 
CREATE SERVER ext_srv 
  FOREIGN DATA WRAPPER postgres_fdw 
  OPTIONS (dbname 'postgres', host 'localhost', port '5434');
 
CREATE USER MAPPING FOR postgres 
  SERVER ext_srv 
  OPTIONS (USER 'postgres');
 
CREATE FOREIGN TABLE foo (
  id INT
) SERVER ext_srv OPTIONS (TABLE_NAME 'foo');
 
CREATE FOREIGN TABLE bar (
  id INT
) SERVER ext_srv OPTIONS (TABLE_NAME 'bar');

These tables are exceedingly simple. Yet here’s what happens if we try a basic JOIN with a where clause in 9.4 or 9.5:

-- In 9.4
 
EXPLAIN
SELECT foo.*
  FROM foo
  JOIN bar ON (foo.id = bar.id)
 WHERE foo.id = 5;
 
                               QUERY PLAN                               
------------------------------------------------------------------------
 Nested Loop  (cost=200.00..296.58 ROWS=225 width=4)
   ->  FOREIGN Scan ON foo  (cost=100.00..146.86 ROWS=15 width=4)
   ->  Materialize  (cost=100.00..146.94 ROWS=15 width=4)
         ->  FOREIGN Scan ON bar  (cost=100.00..146.86 ROWS=15 width=4)
 
-- In 9.5
 
                           QUERY PLAN                            
-----------------------------------------------------------------
 Nested Loop  (cost=200.00..202.30 ROWS=1 width=4)
   ->  FOREIGN Scan ON foo  (cost=100.00..101.15 ROWS=1 width=4)
   ->  FOREIGN Scan ON bar  (cost=100.00..101.15 ROWS=1 width=4)

The 9.4 plan means Postgres will obtain the results of foo, and then loop through each row and obtain a result from bar. That’s potentially really awful. The 9.5 plan is better, but still not ideal. In that case, it’s pushing down the WHERE clause and pulling the matches from both tables, then joining them locally.

But 9.6 uses a much different approach:

                     QUERY PLAN                     
----------------------------------------------------
 FOREIGN Scan  (cost=100.00..102.28 ROWS=1 width=4)
   Relations: (public.foo) INNER JOIN (public.bar)

That’s right, it’s just letting the foreign system perform the entire JOIN, since both tables are local in that context. That’s a massive game-changing improvement. Pushing work to the remote server where appropriate saves bandwidth, allocation, and work on both ends of the equation. The potential returns for heavy users of foreign tables are multiple orders of magnitude.

Updating and deleting data from foreign tables is equally affected by this kind of pushdown logic. Look at the subtle difference between 9.5 and 9.6:

EXPLAIN
DELETE FROM foo
 WHERE id BETWEEN 5 AND 7;
 
-- On 9.5
 
                           QUERY PLAN                            
-----------------------------------------------------------------
 DELETE ON foo  (cost=100.00..101.15 ROWS=2 width=6)
   ->  FOREIGN Scan ON foo  (cost=100.00..101.15 ROWS=1 width=6)
 
-- On 9.6
 
                            QUERY PLAN                             
-------------------------------------------------------------------
 DELETE ON foo  (cost=100.00..101.15 ROWS=2 width=6)
   ->  FOREIGN DELETE ON foo  (cost=100.00..101.15 ROWS=2 width=6)

So what is the difference between a “Foreign Scan” and a “Foreign Delete”? Believe it or not, the scan means Postgres 9.5 and older fetch the rows, and then issue separate DELETE statements for each. They use the same process for UPDATE statements. That’s… not great, especially as the amount of matches increases.

Once again, 9.6 saves the day. It allows the foreign system to take care of the whole process, and send back the results. The foreign system acts as if the whole DELETE statement were issued locally without any of that back-and-forth nonsense.

Vacuuming the Warehouse

The VACUUM process is vastly improved. Because Postgres uses MVCC, it requires regular maintenance to ensure the transaction counter does not wrap around and cause data loss. This makes sense, but in the context of a warehouse or any large system with several TB of data, it becomes more and more difficult to perform this maintenance simply due to performance constraints of storage systems.

In Postgres 9.5 and below, a VACUUM process must occasionally visit every single page of every single table. If a data warehouse is just sitting around accumulating data, why repeatedly vacuum all of the old data that hasn’t changed? Well, that all changes with 9.6. Now Postgres keeps track of pages that contain only unchanged values, and can skip them outright. This should drastically reduce storage IO during maintenance, and complete much faster as a bonus. For large idle warehouses, the difference can be an order of magnitude or more.

Persnickety Planner

There’s a veritably endless list of improvements in the release notes. The last I found noteworthy beyond the exciting stuff everyone drools over was a somewhat obscure planner tweak.

Do you have tables that use composite foreign keys? The query planner would normally multiply the statistical probability of each column individually, even though this usually indicates a strong correlation. This in turn causes row underestimates and bad query plans based on those artificially low values. If the planner expects 10 rows, when there are really 8000, that’s the difference between a nested loop biting off more than it can chew, and a relatively faster merge or hash operation on the larger result set.

This isn’t a general case for other implicit or explicit correlations, but it’s a move in the right direction. There is a multivariate statistics patch, but it didn’t survive the latest code freeze. It’s looking like we won’t see this kind of improvement on a general basis in 9.6, but this foreign key improvement is an example of how things could operate if a version of the patch is accepted.

In the end, 9.6 is looking like another great advancement over previous versions. I can’t wait to use it for realsies.

Hans-Juergen Schoenig: PostgreSQL Performance advise: UNIX sockets vs. localhost

From Planet PostgreSQL. Published on May 13, 2016.

We have posted a lot of tuning advise on this blog already. Now we thought we would take it a step further and share some more advice and common “mistakes” that we see people making when configuring web applications. Improving web application performance is more critical than ever because most applications in use these days […]

The post PostgreSQL Performance advise: UNIX sockets vs. localhost appeared first on Cybertec - The PostgreSQL Database Company.

Pygrunn: django channels - Bram Noordzij/Bob Voorneveld

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 12, 2016.

(One of my summaries of the one-day 2016 PyGrunn conference).

Django channels is a project to make Django to handle more than "only" plain http requests. So: websockets, http2, etc. Regular http is the normal request/response cycle. Websockets is a connection that stays open, for bi-directional communication. Websockets are technically an ordered first-in first-out queue with message expiry and at-most-once delivery to only one listener at the time.

"Django channels" is an easy-to-understand extension of the Django view mechanism. Easy to integrate and deploy.

Installing django channels is quick. Just add the application to your INSTALLED_APPS list. That's it. The complexity happens when deploying it as it is not a regular WSGI deployment. It uses a new standard called ASGI (a = asynchronous). Currently there's a "worker service" called daphne (build in parallel to django channels) that implements ASGI.

You need to configure a "backing service". Simplified: a queue.

They showed a demo where everybody in the room could move markers over a map. Worked like a charm.

How it works behind the scenes is that you define "channels". Channels can recieve messages and can send messages to other channels. So you can have channel for reading incoming messages, do something with it and then send a reply back to some output channel. Everything is hooked up with "routes".

You can add channels to groups so that you can, for instance, add the "output" channel of a new connection to the group you use for sending out status messages.

Pygrunn: from code to config and back again - Jasper Spaans

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 12, 2016.

(One of my summaries of the one-day 2016 PyGrunn conference).

Jasper works at Fox IT, one of the programs he works on is DetACT, a fraud detection tool for online banking. The technical summary would be something like "spamassassin and wireshark for internet traffic".

  • Wireshark-like: DetACT intercepts online bank traffic and feeds it to a rule engine that ought to detect fraud. The rule engine is the one that needs to be configured.
  • Spamassassin-like: rules with weights. If a transaction gets too many "points", it is marked as suspect. Just like spam detection in emails.

In the beginning of the tool, the rules were in the code itself. But as more and more rules and exceptions got added, maintaining it became a lot of work. And deploying takes a while as you need code review, automatic acceptance systems, customer approval, etc.

From code to config: they rewrote the rule engine from start to work based on a configuration. (Even though Joel Spolsky says totally rewriting your code is the single worst mistake you can make). They went 2x over budget. That's what you get when rewriting completely....

The initial test with hand-written json config files went OK, so they went to step two: make the configuration editable in a web interface. Including config syntax validation. Including mandatory runtime performance evaluation. The advantage: they could deploy new rules much faster than when the rules were inside the source code.

Then... they did a performance test at a customer.... It was 10x slower than the old code. They didn't have enough hardware to run it. (It needs to run on real hardware instead of in the cloud as it is very very sensitive data).

They fired up the profiler and discovered that only 30% of the time is spend on the actual rules, the other 70% is bookkeeping and overhead.

In the end they had the idea to generate python code from the configuration. They tried it. The generated code is ugly, but it works and it is fast. A 3x improvement. Fine, but not a factor of 10, yet.

They tried converting the config to AST (python's Abstract Syntax Tree) instead of to actual python code. Every block was turned into an AST and then combined based on the config. This is then optimized (which you can do with an AST) before generating python code again.

This was fast enough!

Some lesons learned:

  • Joel Spolsky is right. You should not rewrite your software completely. If you do it, do it in very small chunks.
  • Write readable and correct code first. Then benchmark and profile
  • Have someone on your team who knows about compiler construction if you want to solve these kinds of problems.

Pygrunn: Kliko, compute container specification - Gijs Molenaar

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 12, 2016.

(One of my summaries of the one-day 2016 PyGrunn conference).

Gijs Molenaar works on processing big data for large radio telescopes ("Meerkat" in the south of Africa and "Lofar" in the Netherlands).

The data volumes coming from such telescopes are huge. 4 terabits per seconds, for example. So they do a log of processing and filtering to get that number down. Gijs works on the "imaging and calibration" part of the process.

So: scientific software. Which is hard to install and fragile. Especially for scientists. So they use ubuntu's "lauchpad PPA's" to package it all up as debian packages.

The new hit nowadays is docker. Containerization. A self-contained light-weight "virtual machine". Someone called it centralized agony: only one person needs to go through the pain of creating the container and all the rest of the world can use it... :-)

His line of work is often centered around pipelines. Data flows from one step to the other and on to the next. This is often done with bash scripts.

Docker is nice and you can hook up multiple dockers. But... it is all network-centric: a web container plus a database container plus a redis container. It isn't centered on data flows.

So he build something new: kliko. He's got a spec for "kliko" containers. Like "read your input from /input". "Write your output to /output". There should be a kliko.yml that defines the parameters you can pass. There should be a /kliko script as an entry point.

Apart from the kliko container, you also have the "kliko runner". It is the actor that runs the container. It runs the containers with the right parameters. You can pass the parameters on the command line or via a web interface. Perfect for scientists! You get a form where you can fill in the various parameters (defined in the kliko.yml file) and "just" run the kliko container.

An idea: you could use it almost as functional programming: functional containers. Containers that don't change the data they're operating on. Every time you run it on the same input data, you get the same results. And you can run them in parallel per definition. And you can do fun things with caching.

There are some problems with kliko.

  • There is no streaming yet.
  • It is filesystem based at the moment, which is slow.

These are known problems which are fine with what they're currently using it for. They'll work on it, though. One thing they're also looking at is "kliko-compose", so something that looks like "docker-compose".

Some (fundamental) problems with docker:

  • Docker access means root access, basically.
  • GPU acceleration is crap.
  • Cached filesystem layers is just annoying. In first instance it seems fine that all the intermediary steps in your Dockerfile are cached, but it is really irritating once you install, for instance, debian packages. They're hard to update.
  • You can't combine containers.

Pygrunn: Understanding PyPy and using it in production - Peter Odding/Bart Kroon

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 12, 2016.

(One of my summaries of the one-day 2016 PyGrunn conference).

pypy is "the faster version of python".

There are actually quite a lot of python implementation. cpython is the main one. There are also JIT compilers. Pypy is one of them. It is by far the most mature. PyPy is a python implementation, compliant with 2.7.10 and 3.2.5. And it is fast!.

Some advantages of pypy:

  • Speed. There are a lot of automatic optimizations. It didn't use to be fast, but since 5 years it is actually faster than cpython! It has a "tracing JIT compiler".
  • Memory usage is often lower.
  • Multi core programming. Some stackless features. Some experimental work has been started ("software transactional memory") to get rid of the GIL, the infamous Global Interpreter Lock.

What does having a "tracing JIT compiler" mean? JIT means "Just In Time". It runs as an interpreter, but it automatically identifies the "hot path" and optimizes that a lot by compiling it on the fly.

It is written in RPython, which is a statically typed subset of python which translates to C and is compiled to produce an interpreter. It provides a framework for writing interpreters. "PyPy" really means "Python written in Python".

How to actually use it? Well, that's easy:

$ pypy your_python_file.py

Unless you're using C modules. Lots of python extension modules use C code that compile against CPython... There is a compatibility layer, but that catches only 40-60% of the cases. Ideally, all extension modules would use "cffi", the C Foreign Function Interface, instead of "ctypes". CFFI provides an interface to C that allows lots of optimizations, especially by pypy.

Peter and Bart work at paylogic. A company that sells tickets for big events. So you have half a million people trying to get a ticket to a big event. Opening multiple browsers to improve their chances. "You are getting DDOSed by your own customers".

Whatever you do: you still have to handle 500000 pageviews in just a few seconds. The solution: a CDN for the HTML and only small JSON requests to servers. Even then then you still need a lot of servers to handle the JSON requests. State synchronisation was a problem as in the end you still had one single server for that single task.

Their results after using pypy for that task:

  • An 8-fold improvement. Initially 4x, but pypy has been optimized since, so they got an extra 2x for free. So: upgrade regularly.
  • Real savings on hosting costs
  • The queue has been tested to work for at least two million visitors now.

Guido van Rossum supposedly says "if you want your code to run faster, you should probably just use PyPy" :-)

Note: slides are online

Pygrunn: simple cloud with TripleO quickstart - K Rain Leander

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 12, 2016.

(One of my summaries of the one-day 2016 PyGrunn conference).

What is openstack? A "cloud operating system". Openstack is an umbrella with a huge number of actual open source projects under it. The goal is a public and/or private cloud.

Just like you use "the internet" without concerning yourself with the actual hardware everything runs on, just in the same way you should be able to use a private/public cloud on any regular hardware.

What is RDO? Exactly the same as openstack, but using RPM packages. Really, it is exactly the same. So a way to get openstack running on a Red Hat enterprise basis.

There are lots of ways to get started. For RDO there are three oft-used ones:

  • TryStack for trying out a free instance. Not intended for production.

  • PackStack. Install openstack-packstack with "yum". Then you run it on your own hardware.

  • TripleO (https://wiki.openstack.org/wiki/TripleO). It is basically "openstack on openstack". You install an "undercloud" that you use to deploy/update/monitor/manage several "overclouds". An overcloud is then the production openstack cloud.

    TripleO has a separate user interface that's different from openstack's own one. This is mostly done to prevent confusion.

    It is kind of heavy, though. The latest openstack release (mitaka) is resource-hungry and needs ideally 32GB memory. That's just for the undercloud. If you strip it, you could get the requirement down to 16GB.

To help with setting up there's now a TripleO quickstart shell script.

Pygrunn: Micropython, internet of pythonic things - Lars de Ridder

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 12, 2016.

(One of my summaries of the one-day 2016 PyGrunn conference).

micropython is a project that wants to bring python to the world of microprocessors.

Micropython is a lean and fast implementation of python 3 for microprocessors. It was funded in 2013 on kickstarter. Originally it only ran on a special "pyboard", but it has now been ported to various other microprocessors.

Why use micropython? Easy to learn, with powerful features. Native bitwise operations. Ideal for rapid prototyping. (You cannot use cpython, mainly due to RAM usage.)

It is not a full python, of course, they had to strip things out. "functools" and "this" are out, for instance. Extra included are libraries for the specific boards. There are lots of memory optimizations. Nothing fancy, most of the tricks are directly from compiler textbooks, but it is nice to see it all implemented in a real project.

Some of the supported boards:

  • Pyboard
  • The "BBC micro:bit" which is supplied to 1 million school children!
  • Wipy. More of a professional-grade board.
  • LoPy. a board which supports LoRa, an open network to connect internet-of-things chips.

Development: there is one full time developer (funded by the ESA) and two core contributors. It is stable and feels like it is maturing.

Is it production ready? That depends on your board. It is amazing for prototyping or for embedding in games and tools.

Pygrunn keynote: the future of programming - Steven Pemberton

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 12, 2016.

(One of my summaries of the one-day 2016 PyGrunn conference).

Steven Pemberton (https://en.wikipedia.org/wiki/Steven_Pemberton) is one of the developers of ABC, a predecessor of python.

He's a researcher at CWI in Amsterdam. It was the first non-military internet site in Europe in 1988 when the whole of Europe was still connected to the USA with a 64kb link.

When designing ABC they were considered completely crazy because it was an interpreted language. Computers were slow at that time. But they knew about Moore's law. Computers would become much faster.

At that time computers were very, very expensive. Programmers were basically free. Now it is the other way. Computers are basically free and programmers are very expensive. So, at that time, in the 1950s, programming languages were designed around the needs of the computer, not the programmer.

Moore's law is still going strong. Despite many articles claiming its imminent demise. He heard the first one in 1977. Steven showed a graph of his own computers. It fits.

On modern laptops, the CPU is hardly doing anything most of the time. So why use programming languages optimized for giving the CPU a rest?

There's another cost. The more lines a program has, the more bugs there are in it. But it is not a linear relationship. More like lines ^ 1.5. So a program with 10x more lines probably has 30x more bugs.

Steven thinks the future of programming is in declarative programming instead of in procedural programming. Declarative code describes what you want to achieve and not how you want to achieve it. It is much shorter.

Procedural code would have specified everything in detail. He showed a code example of 1000 lines. And a declarative one of 15 lines. Wow.

He also showed an example with xforms, which is declarative. Projects that use it regularly report a factor of 10 in savings compared to more traditional methods. He mentioned a couple of examples.

Steven doesn't necessarily want us all to jump on Xforms. It might not fit with our usecases. But he does want us to understand that declarative languages are the way to go. The approach has been proven.

In response to a question he compared it to the difference between roman numerals and arabic numerals and the speed difference in using them.

(The sheets will be up on http://homepages.cwi.nl/~steven/Talks/2016/05-13-pygrunn/ later).

Pygrunn keynote: Morepath under the hood - Martijn Faassen

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on May 12, 2016.

(One of my summaries of the one-day 2016 PyGrunn conference).

Martijn Faassen is well-known from lxml, zope, grok. Europython, Zope foundation. And he's written Morepath, a python web framework.

Three subjects in this talk:

  • Morepath implementation details.
  • History of concepts in web frameworks
  • Creativity in software development.

Morepath implementation details. A framework with super powers ("it was the last to escape from the exploding planet Zope")

Traversal. In the 1990's you'd have filesystem traversal. example.com/addresses/faassen would map to a file /webroot/addresses/faassen.

In zope2 (1998) you had "traversal through an object tree. So root['addresses']['faassen'] in python. The advantage is that it is all python. The drawback is that every object needs to know how to render itself for the web. It is an example of creativity: how do we map filesystem traversal to objects?.

In zope3 (2001) the goal was the zope2 object traversal, but with objects that don't need to know how to handle the web. A way of working called "component architecture" was invented to add traversal-capabilities to existing objects. It works, but as a developer you need to quite some configuration and registration. Creativity: "separation of concerns" and "lookups in a registry"

Pyramid sits somewhere in between. And has some creativity on its own.

Another option is routing. You map a url explicitly to a function. A @route('/addresses/{name}') decorator to a function (or a django urls.py). The creativity is that is simple.

Both traversal and routing have their advantages. So Morepath has both of them. Simple routing to get to the content object and then traversal from there to the view.

The creativity here is "dialectic". You have a "thesis" and an "antithesis" and end up with a "synthesis". So a creative mix between two ideas that seem to be opposites.

Apart from traversal/routing, there's also the registry. Zope's registry (component architecture) is very complicated. He's now got a replacement called "Reg" (http://reg.readthedocs.io/).

He ended up with this after creatively experimenting with it. Just experimenting, nothing serious. Rewriting everything from scratch.

(It turned out there already was something that worked a bit the same in the python standard library: @functools.singledispatch.)

He later extended it from single dispatch to multiple dispatch. The creativity here was the freedom to completely change the implementation as he was the only user of the library at that moment. Don't be afraid to break stuff. Everything has been invented before (so research). Also creative: it is just a function.

A recent spin-off: "dectate". (http://dectate.readthedocs.io/). A decorator-based configuration system for frameworks :-) Including subclassing to override configuration.

Some creativity here: it is all just subclassing. And split something off into a library for focus, testing and documentation. Split something off to gain these advantages.

Caktus CTO Colin Copeland Invited to the White House Open Police Data Initiative

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on May 12, 2016.

We at Caktus were incredibly proud when the White House Police Data Initiative invited CTO Colin Copeland to celebrate their first year accomplishments. While at the White House, Colin also joined private breakout sessions to share ideas with law enforcement officials, city staff, and other civic technologists from across the country. Colin is the co-founder of Code for Durham and served as lead developer for OpenDataPolicingNC.com. OpenDataPolicingNC.com, a site built for the Southern Coalition for Social Justice, displays North Carolina police stop data.

When he returned, we couldn’t wait to learn more about his perspective on the talks given (video available) and the breakout discussions held. Here’s a condensed version of our conversation with him.

Can you tell us what the White House Police Data Initiative is?

It’s an effort by the Obama administration to use open data and partner with technologists to strengthen the relationship with citizens and police. The goal is to increase transparency and, as a result, build trust and accountability. It has grown a lot—53 law enforcement agencies. It’s an incredible initiative.

What was it like to be at the White House Police Data Initiative celebration?

It was super exciting to be at the White House and to see demonstrations of what has been accomplished. Fifty-three new law enforcement agencies had signed on to the initiative with close to 100 datasets. I felt lucky to be part of it since what we do with OpenDataPolicingNC.com is such a small piece of the whole effort.

Seeing other initiatives and what other police departments are doing was invigorating. It really made me feel motivated to keep doing what we’ve been doing, especially after seeing other Code for America projects. I also liked being able to hear the perspectives of people from vastly different backgrounds, whether it was someone in the police department or the city. The event was about learning from people all over the country.

Can you describe a couple perspectives you found interesting?

Ron Davis [Director of the federal Office of Community Oriented Policing Services] had a perspective grounded in making sure all police officers, from the rank and file to leadership, understood why open data benefits them. It’s an important aspect. If they don’t see the benefit and buy-in, it’s much harder to advocate for open data.

Also Chief Brown [Dallas] emphasized that releasing data led to community trust. The value you get out of that outweighs any internal pressure not to do it. He was straightforward about how doable it is to release data, how good it was for Dallas, and encouraged other police departments to do the same.

What do you think is the greatest challenge to open data policing efforts for interested agencies?

Knowing what to share is a hurdle for most new agencies. There was some discussion of building a guide or toolkit to share ways to implement this in your city. Small police agencies do not want to reinvent the wheel, so they need easier onboarding. We need to make it easier for everyone to get involved.

What was something new you learned about open data policing?

I learned a lot. It was a lot of interesting, new perspectives, innovative partnerships. But there was one aspect: there’s not a lot of data standards for how police track and report various metrics, including use-of-force. So you can’t compare one jurisdiction to another always. It can look bad for one department versus another because you used a different set of criteria. There needs to be greater standards in order to better share information.

What’s next for you with open data policing?

There’s going to be an expansion of OpenDataPolicingNC.com and that’s through Code for Durham. We’re going to be using geolocational data provided by Fayetteville and Police Chief Harold Medlock. He asked us to map the data to see what it highlights. We hope other agencies can use it, too, once the Fayetteville one is online. It’s an exciting project and we’re honored Chief Medlock asked us to help out.

Colin Copeland at the White House Police Data Initiative. Also pictured, representatives from other open police data initiatives.

Is Open Source Consulting Dead?

By chrism from . Published on Sep 10, 2013.

Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

Consulting and Patent Indemification

By chrism from . Published on Aug 09, 2013.

Article about consulting and patent indemnification

Python Advent Calendar 2012 Topic

By chrism from . Published on Dec 24, 2012.

An entry for the 2012 Japanese advent calendar at http://connpass.com/event/1439/

Why I Like ZODB

By chrism from . Published on May 15, 2012.

Why I like ZODB better than other persistence systems for writing real-world web applications.

A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

By chrism from . Published on Mar 03, 2012.

A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

In Praise of Complaining

By chrism from . Published on Jan 01, 2012.

In praise of complaining, even when the complaints are absurd.

2012 Python Meme

By chrism from . Published on Dec 24, 2011.

My "Python meme" replies.

In Defense of Zope Libraries

By chrism from . Published on Dec 19, 2011.

A much too long defense of Pyramid's use of Zope libraries.

Plone Conference 2011 Pyramid Sprint

By chrism from . Published on Nov 10, 2011.

An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

Jobs-Ification of Software Development

By chrism from . Published on Oct 17, 2011.

Try not to Jobs-ify the task of software development.

WebOb Now on Python 3

By chrism from . Published on Oct 15, 2011.

Report about porting to Python 3.

Open Source Project Maintainer Sarcastic Response Cheat Sheet

By chrism from . Published on Jun 12, 2011.

Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

Pylons Miniconference #0 Wrapup

By chrism from . Published on May 04, 2011.

Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

Pylons Project Meetup / Minicon

By chrism from . Published on Apr 14, 2011.

In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.

PyCon 2011 Report

By chrism from . Published on Mar 19, 2011.

My personal PyCon 2011 Report