Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home

Open Source Posts

Jobin Augustine: C++14 and Postgres

From Planet PostgreSQL. Published on Mar 28, 2017.

The C++ 14 specification came with huge improvements: Auto type deduction, lambdas, movable RValue expressions etc. — the list is huge. Its a dream come true for lazy programmers like me : write less code & create fewer bugs 😉 And get huge performance gains for free.

OK this time I thought I’d try some functions in C++ rather than PL/SQL. It could be easier for development. Ohh, well…But Postgresql is a world of C code. Other than the list of restrictions, I couldn’t find anything in the documentation. No working examples.

Looks like there is a lot of confusion around. I see many queries on different forums that remain unanswered. I thought I’d give it a try.

After the first couple of attempts, i figured out that wrapping my code in an extern “C” could do the magic.

Here is a simple sample function, where i am trying to use type deduction, STL and a loop iterator. My file name is extension.cpp and it contains following lines

extern "C" {
#include <postgres.h>    //all C headers and macros go inside extern "C"
#include <utils/rel.h>
PG_MODULE_MAGIC;
}

#include<vector>           //C++ headers go outside

extern "C" {
int sumofall(){        //wrap the C++ function inside the extern "C" block
auto sum_of_all = 0;
std::vector<int> arr {1,2,3,4,5,6,7};
for (auto& i : arr )
     sum_of_all += i;
return sum_of_all;
}
}

I know that this could be construed as dumb code. A C++ coder may say that this function must be a C++ “constexpr” (Constant expression) so that the calculation is done at compile time and there is zero overhead at runtime. But here my intention is to show that we won’t have any restriction in using C++ features.

I used the Postgres sandbox created using BigSQL Packagemanger for this demonstration as it comes with all the header files and libraries. At the same time it is portable (relocatable) also.

This can be compiled with:

g++ -c -fPIC -Wall -Werror -g3 -O0 -I/home/vagrant/bigsql/pg96/include/postgresql/server extension.cpp 

Make it a shared object

g++ -shared -o my-extension.so extension.o

Copy the same to library location

cp my-extension.so ~/bigsql/pg96/lib/postgresql/

Create the function definition

postgres=# CREATE OR REPLACE FUNCTION 
   sumofall() RETURNS integer AS 'my-extension' 
LANGUAGE C STRICT;

Yes it works!!!

postgres=# select sumofall();
sumofall
----------
       28
(1 row)

Now Lets see how we can put some business logic into a function.

int bonus(const int score,const int type){
   if (type ==1 || score > 100) return std::rint(score+score*0.05);
   else return std::rint(score+score*0.06);
}

This function can be invoked from SQL with:

postgres=# select empnm,bonus(score,type) from emp;
 empnm | bonus
-------+-------
 emp1  |   103
 emp2  |   105
 emp3  |   105
 emp4  |   106
 emp5  |   106
 emp6  |   106

Note: I used g++ (GCC) 6.2.1 to compile above code.

Django ile Web Programlama Video Serisi

By Python Turkiye from Django community aggregator: Community blog posts. Published on Mar 27, 2017.

Toplam 41 video dan oluşan Django ile Web Programlama eğitim serisi, temel düzeyde Python bilgisi olanlar ve daha önceden web programlama bilgisine sahip olmayanlar için temel seviyeden başlayıp, adım adım giderek Django ile bir web sitesi oluşturmayı hedefler. Django’nun temel mantığını anladıktan sonra, ortalama bir websitesini geliştirmek vaktinizi çok fazla almayacaktır. Django Nedir? Django, tamamen...

Django ile Web Programlama Video Serisi yazısı ilk önce Python Türkiye üzerinde ortaya çıktı.

Bruce Momjian: Use All Your Constraints

From Planet PostgreSQL. Published on Mar 27, 2017.

Many database administrators use databases as simple data stores. However, relational systems can do much more, with advanced querying, analysis, and transaction control capabilities. Another area that is often overlooked is constraints. Constraints allow new and updated data to be checked against defined constraints and prevent changes if the constraints would be violated.

Constraints are odd in that they don't do anything if the data is consistent — it is more like an insurance policy against invalid data being entered into the database. If constraints are missing, there often are no initial problems, but over time erroneous or unexpected data gets in, causing problems with applications and reporting.

Do yourself a favor the next time you create a table — take the insurance and create useful CHECK, NOT NULL, DEFAULT, UNIQUE, PRIMARY KEY, and FOREIGN KEY constraints. If your tables are already created, you can use ALTER TABLE to add constraints to existing tables.

Continue Reading »

Alexey Lesovsky: Deep dive into postgres stats: pg_stat_bgwriter reports

From Planet PostgreSQL. Published on Mar 27, 2017.

Everything you always wanted to know about Postgres stats

Today, I would like to make a little detour from the main series and will dive into pg_stat_bgwriter. If you’ve been following my previous posts, you will remember that pg_stat_bgwriter view has summary statistics about bgwriter and checkpointer. Here I would like to show an interesting report query which is based on pg_stat_bgwriter. Sources of the query were found in postgres mailing lists and shared by my colleague Viсtor Yegorov and slightly modified by me. This report provides comprehensive information about bgwriter and checkpointer activity and helps to better configure them.
A tiny recommendation to run this query with expanded output in psql. Report produces only one row and looks like this :


-[ RECORD 1 ]--------------------------+---------------------------------------
Uptime                                 | 112 days 12:58:21.394156
Since stats reset                      | 132 days 14:37:34.149519
Forced checkpoint ratio (%)            | 7.6
Minutes between checkpoints            | 56.61
Average write time per checkpoint (s)  | 775.36
Average sync time per checkpoint (s)   | 0.21
Total MB written                       | 4194915.7
MB per checkpoint                      | 105.26
Checkpoint MBps                        | 0.03
Bgwriter MBps                          | 0.24
Backend MBps                           | 0.10
Total MBps                             | 0.37
New buffer allocation ratio            | 34.925
Clean by checkpoints (%)               | 8.5
Clean by bgwriter (%)                  | 64.3
Clean by backends (%)                  | 27.2
Bgwriter halt-only length (buffers)    | 0.00
Bgwriter halt ratio (%)                | 0.25
-------------------------------------- | --------------------------------------
checkpoints_timed                      | 3117
checkpoints_req                        | 256
checkpoint_write_time                  | 2615284047
checkpoint_sync_time                   | 716686
buffers_checkpoint                     | 45447168
buffers_clean                          | 345439953
maxwritten_clean                       | 861
buffers_backend                        | 146062083
buffers_backend_fsync                  | 0
buffers_alloc                          | 18752891779
stats_reset                            | 2016-10-12 23:41:35.737168+03
total_checkpoints                      | 3373
total_buffers                          | 536949204
startup                                | 2016-11-02 01:20:48.492531+03
checkpoint_timeout                     | 1h
checkpoint_segments                    | 256
checkpoint_completion_target           | 0.9

The report consists of two parts which are separated by horizontal dotted line, the first part is the report itself and the second is raw values from pg_stat_bgwriter and auxiliary items used in the report.

The first part is more interesting and here is why.

The first values that we see are the stats interval and uptime. General idea here is that the bigger the interval you have since last stats reset , the more inaccurate report you will get. Thus, I'd recommend resetting stats periodically, weekly or monthly. The uptime only shows how long postgres has been working. However, please note that in the example the uptime is less than stats collecting interval -- it's not bad, but for your report try using stats that were collected only after postgres startup, since at shutdown postgres runs forced checkpoint that isn’t related to the workload but might affect the stats.

Next is the information about checkpoints - "Forced checkpoint ratio" is the ratio of checkpoints which occurred by xlog. Current value is 7.6 and it's good enough. High values, for example more than 40%, indicate that xlog checkpoints occur too frequently. As you might remember, xlog checkpoints are less preferred than time checkpoints, so general idea is to reduce the number of xlog checkpoints by increasing the number of WAL segments required to trigger checkpoint. "Minutes between checkpoints" is time interval between occured checkpoints. When everything is ok, this value should be near checkpoint_timeout. Values significantly lower than checkpoint_timeout also indicate on occurrence frequency of xlog checkpoints. General recommendation in both cases is raise max_wal_size (or checkpoint_segments for 9.4 or over).

Next is the average write and sync time in seconds. Write time is near the 13 minutes and that's a good value, it shows that write stage of checkpoints was performed fast enough despite the long interval between checkpoints. Values that are closer to the earlier mentioned intervals between checkpoints aren't good - it’s an indicator that storage spent too much time writing buffers. Average sync time should be near zero - values that far from zero would indicate on low performance of the storage.

Next items have informative value, they tell us about average throughput of checkpointer, bgwriter and backends. These numbers give us additional information about the workload. The "Total MB written" is obviously the size of written data by all subprocesses. The "MB per checkpoint" is an average value for checkpoints. Next values are measured in Mbps and they are about process throughput. In the example above, there are low values less than 1 Mbps means that server doesn't have a huge amount of dirty data or maybe the report has been built with longer stats interval and throughput values are spread throughout.

"New buffer allocation ratio" field is the ratio of new allocated buffers to all written buffers. When backends handle data, firstly they check are data already in shared buffers area? If there are no required data in shared buffers, backends allocate new buffers and load data from main storage to shared buffers and then process it (see details in BufferAlloc() and StrategyGetBuffer() functions). Thus, high number here tell us that backends allocated a lot of buffers since required data didn't exist among shared buffers.

There are two reasons for this, the first is that the backends read rarely uses "cold" data, old archived partitions or something similar; the second reason is that the early used data had been evicted from shared buffers because of lack of shared buffers. That’s not all however, this number means how many times data were read to shared buffers more than it had been written out from them. This item potentially comes with cache hit ratio and high allocation ratio and low cache hit ratio can indicate insufficient shared buffers, though it’s hard to know for sure.

Next set of values are on how many buffers in percent are cleaned by the sub-processes. High "Clean by checkpoints" value is the sign of write-intensive workload. High "Clean by bgwriter" tells us about read workload. High number of "Clean by backends" is the sign that backends done a lot of bgwriter' work and that’s not good - values more than 50% tells us about ineffective setting of bgwriter, and in this case I would suggest trying to make it more aggressive.

Next values are "Bgwriter halt-only length" and "Bgwriter halt ratio". They are about frequency with which bgwriter was delayed due to exceeded bgwriter_lru_maxpages. The values in our example are perfect and high values conversely indicate that bgwriter went to sleep mode too frequently and didn't do its work fast enough. In this case, I'd recommend to configure bgwriter in a more aggressive way -- decrease delay and increase maxpages parameters.

The second part of the report is the raw values from pg_stat_bgwriter and configuration parameters which also related to bgwriter and checkpointer - they are used in report's query, hence, you don't need to see them in separate queries.

Here I prepared a few reports with my comments from different pgbench workloads:
  • here is eight hours of read-only workload with 4GB shared buffers (default bgwriter)
  • here is eight hours of read-only workload with 8GB shared buffers (default bgwriter)
  • here is eight hours of read/write workload with 4GB shared buffers (default bgwriter)
  • here is eight hours of read/write workload with 8GB shared buffers (default bgwriter)
  • here is eight hours of read/write workload with 8GB shared buffers (aggressive bgwriter)
Tests were performed on the server with 16CPU, 32GB RAM, RAID1 on 2xSSD (datadir), RAID1 on 2xSAS (wal) with PostgreSQL 9.6.2 and test database size is 96GB.

That is all for this time and I hope you enjoyed this post and found it helpful.

Django REST Framework: Using the request object

By Abu Ashraf Masnun from Django community aggregator: Community blog posts. Published on Mar 27, 2017.

While working with Django REST Framework aka DRF, we often wonder how to customize our response based on request parameters. May be we want to check something against the logged in user (request.user) ? Or may be we want to modify part of our response based on a certain request parameter? How do we do that? We will discuss a few use cases below.

ModelViewSet - Filtering based on request

This is very often required while using ModelViewSets. We have many Items in our database. But when listing them, we only want to display the items belonging to the current logged in user.

from rest_framework.permissions import IsAuthenticated

class ItemViewSet(ModelViewSet):
    permission_classes = (IsAuthenticated,)
    serializer_class = ItemSerializer

    def get_queryset(self):
        queryset = Item.objects.all().filter(user=request.user)

        another_param = self.request.GET.get('another_param')
        if another_param:
            queryset = queryset.filter(another_field=another_param)

        return queryset

If you are using the awesome ModelViewSet, you can override the get_queryset method. Inside it, you can access the request object as self.request. In the above example, we are only listing the items which has our current user set as their user field. At the same time, we are also filtering the queryset based on another parameter. Basically you have the queryset and self.request available to you, feel free to use your imagination to craft all the queries you need!

Serializers - Modifying Response based on request

What if we don’t want to display item_count for the users by default? What if we only want to display that field when a request parameter, show_count is set? We can override the serializer to do that.

class UserSerializer(ModelSerializer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        context = kwargs.get('context', None)
        if context:
            request = kwargs['context']['request']

            show_count = request.GET.get('show_count')
            if show_count:
                self.fields['item_count'] = IntegerField(source="item_count")

When Serializers are constructed by DRF, it gets the request in the context. So we should always check if it exists and use it as needed. We can override the serializer fields by accessing self.fields.

Please note: The request object will be passed only if DRF constructs the serializer for you, for example when you just pass the serializer_class to a ModelViewSet. But if you are using the Serializer in your custom views, please do remember to pass the request manually, otherwise it won’t work.

item_serializer = ItemSerializer(item, context={"request": request})

In our case we have just used IntegerField. You can of course use another serializer to embed the full data of a related field.

Using request in Serializer Fields

Serializer fields have context too!

class ShortURLField(ReadOnlyField):
    def to_representation(self, value):
        return self.context['request'].build_absolute_uri(value)

and here’s the serializer:

class URLSerializer(ModelSerializer):
    short_url = ShortURLField()

    class Meta:
        model = URL
        fields = "__all__"

In the URL model, there is a method named short_url that returns a slug for that url. In our custom ShortURLField, we have customized the to_representation method to use the build_absolute_uri(value) method on current request for creating the full url from the slug.

Michael Goldberg: PostgreSQL 9.6.2 Release Notes in Hebrew

From Planet PostgreSQL. Published on Mar 26, 2017.

As part of PostgreSQL documentation translation into Hebrew project I’m proud to announce about PostgreSQL 9.6.2 Release Notes translation process completion.

The Hebrew Notes are available on link.

Comments are welcome.

I will continue posting the translation project progress.

 

 

 

REGINA OBE: PGConfUS 2017 Getting Stuff done in PostGIS

From Planet PostgreSQL. Published on Mar 25, 2017.

A reminder, PGConfUS 2017 conference is just days away, and we'll be giving a training March 28th 2017, Jersey City, NJ at 1 PM. If you are coming, keep an eye on this page PGConf 2017 US Getting Stuff done with PostGIS materials.

If you haven't signed up already, there are still spots, make sure to buy your tickets at http://pgconf.us/conferences/2017#registration.


Continue reading "PGConfUS 2017 Getting Stuff done in PostGIS"

Celery + Redis + Django

By Coding for Entrepreneurs | A Django Programming Class for the Non-Technical Founder from Django community aggregator: Community blog posts. Published on Mar 24, 2017.

*Celery* is a task queue with ...

Shaun M. Thomas: PG Phriday: CONFLICT of Interests

From Planet PostgreSQL. Published on Mar 24, 2017.

MySQL has had a REPLACE INTO syntax to perform “UPSERT” logic since practically the very beginning. For the longest time, users who wanted to switch to Postgres, but for whatever reason relied on this functionality, were essentially trapped. Postgres 9.5 changed all that, but why did it take so long? As with much of Postgres history, it’s a long story.

To really understand where Postgres started, we need to look at the “old” way of handling a row merge. Many in the database world have probably encountered this once or twice:

CREATE TABLE my_tab (a INT PRIMARY KEY, b TEXT);

CREATE FUNCTION merge_tab(k INT, v TEXT) RETURNS VOID AS $$ BEGIN LOOP -- First, try to update the key.

UPDATE my_tab SET b = v
 WHERE a = k;

IF FOUND THEN
  EXIT;
END IF;

-- The key doesn't exist, so try to insert it.

BEGIN
  INSERT INTO my_tab (a, b) VALUES (k, v);
  EXIT;

EXCEPTION WHEN unique_violation THEN
  -- Nothing here, allow the loop to continue.
END;

END LOOP;

END; $$ LANGUAGE plpgsql;

SELECT merge_tab(1, 'James'); SELECT merge_tab(1, 'Jimmy');

What on Earth is all of that? Oddly enough, the somewhat convoluted logic is not only sound, it’s actually required to avoid a race condition. In the microseconds between attempting our UPDATE and following to the INSERT, some other transaction may have inserted the “missing” key. In that case, we’d encounter a unique constraint violation.

By catching the exception, we’re not immediately kicked out of the function and are presented with a choice. Do we assume our value is “right” and repeat the loop to apply the update, or just exit silently under the assumption that the successful transaction that beat us is probably fine? This particular function selected the previous assertion because that’s what a merge or upsert tries to guarantee: that the requested action is applied. Were we to omit the loop, the exception block would ensure there was no conflict or fatal error, but we could no longer rely on the function operating as advertised.

So why not invert the logic and remove the loop entirely? After all, we could just attempt the insert and if it fails, perform the update within the exception block, right? Actually no. Consider what happens if the target key is deleted by a concurrent transaction. Say we try our insert, and in the space of time between the key violation and our update, it gets deleted. Suddenly our update also produces an error. That’s probably an extremely unlikely edge case, but in OLTP databases, the unlikely becomes frighteningly common. So to be safe, we’re stuck with the loop.

Don’t think about it too much

That is a lot of overhead for what many consider basic functionality. Since that’s no longer a concern, let’s take a look at the actual syntax the Postgres team selected. To do that, let’s start with a very basic table with a handful of rows:

CREATE TABLE symbol_mapper (
  vendor_id    BIGINT   NOT NULL,
  ext_mapping  VARCHAR  NOT NULL,
  symbol       VARCHAR  NOT NULL,
  PRIMARY KEY (vendor_id, ext_mapping)
);

INSERT INTO symbol_mapper VALUES (1, 'Google', 'GOOGL'); INSERT INTO symbol_mapper VALUES (1, 'Apple', 'AAPL'); INSERT INTO symbol_mapper VALUES (2, 'goo', 'GOOGL'); INSERT INTO symbol_mapper VALUES (2, 'app', 'AAPL');

ANALYZE symbol_mapper;

The purpose of a mapping table is to fill the role of decoding external names or lookup values to match internal ones. Since each vendor may have its own designation structure, we require a mapping for each. That also protects us in case two vendors use the same identifiers.

So far we have a fairly standard application of tables. Now let’s do something interesting:

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (2, 'app', 'AAPL')
    ON CONFLICT DO NOTHING;

INSERT 0 0

In this particular case, the “aap” mapping already exists for vendor 2, so no insert takes place. This is the equivalent of not wrapping our insert/update with a loop. We don’t care what the value is, just as long as something is there. In reality, this is more of a way to remove error output from violations than anything immediately useful.

The real fun doesn’t start until we integrate the DO UPDATE functionality. Consider the case where we want to add Samsung as a mapping for a vendor. In this particular case, someone sneaked an existing row into the system, and it contains a typo.

INSERT INTO symbol_mapper VALUES (1, 'Samsung', 'SSLNF');

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol) VALUES (1, 'Samsung', 'SSNLF') ON CONFLICT (vendor_id, ext_mapping) DO UPDATE SET symbol = 'SSNLF';

INSERT 0 1

SELECT * FROM symbol_mapper;

vendor_id | ext_mapping | symbol -----------+-------------+-------- 1 | Google | GOOGL 1 | Apple | AAPL 2 | goo | GOOGL 2 | app | AAPL 1 | Samsung | SSNLF

What we’ve done here is ensure the newest incoming mapping is the “correct” one; that’s our merge. An observant reader might ask how this is any different from our futile desire to attempt an INSERT with an UPDATE in an exception block. Unlike those two separate statements and the time-consuming exception handling, this is a single atomic action.

Did another session delete the row before us? We’ll just insert it again. Did another transaction delete the row we just inserted or updated? Oh well. The important part is that it is impossible to delete the row while our statement is running. So while the logic is similar to using an exception, the difference is that DO UPDATE is built into the database itself, so it can’t be broken into multiple actions that can be interrupted.

Another interesting bit of syntax is that we can actually incorporate a WHERE clause into the update beyond the implicit assumption that our update affects the same key we tried to insert.

INSERT INTO symbol_mapper VALUES (1, 'Sony', 'SONY');

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol) VALUES (1, 'Sony', 'SNY') ON CONFLICT (vendor_id, ext_mapping) DO UPDATE SET symbol = 'SNY' WHERE symbol_mapper.symbol = 'SONY';

INSERT 0 1

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol) VALUES (1, 'Sony', 'SNY') ON CONFLICT (vendor_id, ext_mapping) DO UPDATE SET symbol = 'SNY' WHERE symbol_mapper.symbol = 'SONY';

INSERT 0 0

In this case, we had a mapping for Sony that needed a correction. The first query affected the row we targeted, and the second did nothing. This is important because if we had not specified that predicate, both updates would have successfully modified the row. And so would all subsequent attempts. Remember Postgres keeps a row version for every update, even if the new and old values are identical. That’s just how MVCC works.

In a loosely built application environment, it isn’t uncommon for several vectors to operate simultaneously. If a dozen of these each upsert the same value, they’ll all be satisfied that their work is complete, and Postgres would be stuck with a dozen duplicate old rows. VACUUM (and autovacuum) will ensure old row versions are recycled, but again, that’s more overhead we don’t need to invoke.

And of course, the WHERE clause isn’t restricted to deflecting repeated update attempts. There may be circumstances where we simply don’t want to apply changes. By specifying the table name, we can introspect into any of the existing table values. What about the values we attempted to insert? Since these were part of an inherent violation, they’re assigned to a record named “excluded”.

Here it is in action:

INSERT INTO symbol_mapper VALUES (1, 'Microsoft', 'msft');

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol) VALUES (1, 'Microsoft', 'MSFT') ON CONFLICT ON CONSTRAINT symbol_mapper_pkey DO UPDATE SET symbol = 'MSFT' WHERE symbol_mapper.symbol != excluded.symbol;

This is a very similar situation as we had with Sony. The mapping for Microsoft needs an update if the existing value doesn’t match the one we’re attempting to insert. Well, we can perform that check explicitly without hard-coding those values into the query multiple times. It’s possible to refer to anything in the VALUES tuple by specifying “excluded”. Handy, eh?

Also notice that we’ve changed our conflict condition. Previously we had simply listed the columns in the primary key, and Postgres inferred the proper constraint from that definition. In this case, we directly stated the constraint that Postgres should use in resolving the conflict. It’s somewhat uncommon (and probably not entirely safe) to directly invoke constraint names, but the option is there in case we want it.

This feature was a long time coming; Postgres 9.5 was released in early 2016. As impossible as it sounds, we’ve only really had a little over a year to leverage ON CONFLICT. As a consequence, it’s still slowly seeping into existing Postgres application stacks. Users are still incorporating it into their workflows. It’ll be a while before it’s taken for granted with the rest of the kitchen sink Postgres offers.

Until then, it’s that special toy we’ve always wanted but couldn’t afford until now. There’s nothing else to do but make up for lost time!

“Toy”

Bruce Momjian: Referencing SELECT Column Aliases in WHERE Clauses

From Planet PostgreSQL. Published on Mar 24, 2017.

Ever wonder why using SELECT column aliases in WHERE clauses fails?

SELECT random() AS confusion
FROM generate_series(1,10)
WHERE confusion > 0.5;
 
ERROR:  column "confusion" does not exist
LINE 3: WHERE confusion > 0.5;

Continue Reading »

Pavan Deolasee: PGConf India 2017 – An Event to Remember

From Planet PostgreSQL. Published on Mar 24, 2017.

Yet another edition of PGConf India came to conclusion in early March. You may have noticed the change from PGDay to PGConf, which signals a much larger gathering of PostgreSQL enthusiasts, now and in future. What started as a small meet-up of like minded people 4 years back, has now grown into a 2-day conference with a dedicated training day and a multi-track event at the main conference.

The keynote this year was delivered by Simon Riggs, a major developer and committer at PostgreSQL. He presented his thoughts on why Persistence is key to PostgreSQL’s success. Persistence is why users trust PostgreSQL to manage their critical data and persistence is why PostgreSQL community is able to deliver a solid product, release after release.

This year’s conference was attended by more than 250 delegates, coming from 20 different cities in the world and presenting over 80 different companies. This is at least 40% growth over the last year’s conference, and shows growing popularity of the conference but more so of PostgreSQL as a preferred choice of database.

 

 

 

 

 

 

 

 

The other key achievement this year was the success of a dedicated training day. The program saw much higher interest than what we anticipated. Over 75 participants from 30 different companies attended this day long session and got a chance to learn from some of the best trainers in the world. The training topics covered things such as performance tuning, high availability, physical/logical replication and backup and recovery.

This is the first time we conducted 3 parallel tracks at the main conference. Until 2 years back we used to struggle to get even a handful topics submitted for the talks. This year we had to reject 2 submissions for every accepted talk, even after doing a 3-track event.

We hope this journey continues its positive growth and we see more participation from our community. If you’ve any comments, please write to the organisers at contact@pgconf.in.

Leaving you with a photo moment.

 

Hosting Django Sites on Amazon Elastic Beanstalk

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Mar 23, 2017.

Introduction

Amazon Web Services (AWS)' Elastic Beanstalk is a service that bundles up a number of their lower-level services to manage many details for you when deploying a site. We particularly like it for deploys and autoscaling.

We were first introduced to Elastic Beanstalk when taking over an existing project that used it. It's not without its shortcomings, but we've generally been happy enough with it to stick with it for the project and consider it for others.

Basics

Elastic Beanstalk can handle a number of technologies that people use to build web sites, including Python (2.7 and 3.4), Java, PHP, .Net, Node.js, and Ruby. More are probably in the works.

You can also deploy containers, with whatever you want in them.

To add a site to Elastic Beanstalk, you create a new Elastic Beanstalk application, set some configuration, and upload the source for your application. Then Elastic Beanstalk will provision the necessary underlying resources, such as EC2 (virtual machine) instances, load balancers, autoscaling groups, DNS, etc, and install your application appropriately.

Elastic Beanstalk can monitor the load and automatically scale underlying resources as needed.

When your application needs updating, you upload the updated source, and Elastic Beanstalk updates the underlying resources. You can choose from multiple update strategies.

For example, on our staging server, we have Elastic Beanstalk update the application on our existing web servers, one at a time. In production, we have Elastic Beanstalk start setting up a new set of servers, check their health, and only start directing traffic to them as they're up and healthy. Then it shuts down the previous servers.

Elastic Beanstalk and Python

As you'd expect, when deploying a Python application, Elastic Beanstalk will create a virtual environment and install whatever's in your requirements.txt file.

A lot of other configuration can be done. Here are some example configuration file snippets from Amazon's documentation on Elastic Beanstalk for Python:

option_settings:
  aws:elasticbeanstalk:application:environment:
    DJANGO_SETTINGS_MODULE: production.settings
  aws:elasticbeanstalk:container:python:staticfiles:
    "/images/": "staticimages/"
  aws:elasticbeanstalk:container:python:
    WSGIPath: ebdjango/wsgi.py
    NumProcesses: 3
    NumThreads: 20

packages:
  yum:
    libmemcached-devel: '0.31'

container_commands:
  00collectstatic:
    command: "django-admin.py collectstatic --noinput"
  01syncdb:
    command: "django-admin.py syncdb --noinput"
    leader_only: true
  02migrate:
    command: "django-admin.py migrate"
    leader_only: true

Here are some things we can note:

  • We can define environment variables, like DJANGO_SETTINGS_MODULE.
  • We can tell Elastic Beanstalk to serve static files for us.
  • We ask Elastic Beanstalk to run our Django application using WSGI.
  • We can install additional packages, like memcached.
  • We can run commands during deploys.
  • Using leader_only, we can tell Elastic Beanstalk that some commands only need to be run on one instance during the deploy.

This is also where we could set parameters for autoscaling, configure the deployment strategy, set up notifications, and many other things.

"leader_only"

The leader_only feature is great for doing something on only one of your servers during a deploy. But don't make the mistake we made of trying to use that to configure one server differently from the others, for example to run some background task periodically. Once the deploy is done, there's nothing special about the server that ran the leader_only commands during the deploy, and that server is as likely as any other to be terminated during autoscaling.

Right now, Elastic Beanstalk doesn't provide any way to readily differentiate servers so you can, for example, run things on only one server at a time. You'll have to do that yourselves.

Our solution for the situation where we ran into this was to use select_for_update to "lock" the records we were updating.

Database

You can have Elastic Beanstalk manage RDS (Amazon's hosted database service) for you, but so far we've preferred to set up RDS outside of Elastic Beanstalk. If Elastic Beanstalk was managing it, then Elastic Beanstalk would provide pointers to the database server in environment variables, which would be more convenient than keeping track of it ourselves. On the other hand, with our database outside of Elastic Beanstalk, we know our data is safe, even if we make some terrible mistake in our Elastic Beanstalk configuration.

Migrations

As always, you have to think carefully when deploying an update that includes migrations. Both of the deploy strategies we mentioned earlier will result in the migrations running on the database while some servers are still running the previous code, so you need to be sure that'll work. But that's a problem anytime you have multiple web servers and deserves a blog post of its own.

Time to deploy

One thing we're not happy with is the time it takes for an update deploy - over 20 minutes for our staging environment, and over 40 minutes for our production environment with its more conservative deploy strategy.

Some of that time is under our control. For example, it still takes longer than we'd like to set up the environment on each server, including building new environments for both Python and Node.js. We've already made some speedups there, and will continue working on that.

Other parts of the time are not under our control, especially some parts unique to production deploys. It takes AWS several minutes to provision and start a new EC2 instance, even before starting to set up our application's specific environment. It does that for one new server, waits until it's completely ready and tests healthy (several more minutes), and then starts the process all over again for the rest of the new servers it needs. (Those are done in parallel.)

When all the traffic is going to the new servers, it starts terminating the previous servers at 3 minute intervals, and waiting for all those to finish before declaring the deploy complete.

There are clear advantages to doing things this way: Elastic Beanstalk won't publish a completely broken version of our application, and the site never has any downtime during a production deploy. There are CLI tools that help you manage your deploys, plus a nice web interface to monitor what’s going on. We just wish the individual steps didn't take so long. Hopefully Elastic Beanstalk will find ways over time to improve things.

Conclusion

Despite some of its current shortcomings, we're quite happy to have Elastic Beanstalk in our deployment and hosting toolbox. We previously built our own tool for deploying and managing Django projects in an AWS autoscaling environment (before many of the more recent additions to the AWS suite, such as RDS for Postgres), and we know how much work it is to design, build, and maintain such a platform.

Ernst-Georg Schmid: Windows, keep your dirty fingers of my files!

From Planet PostgreSQL. Published on Mar 23, 2017.

I spent the better part of the morning figuring out why a colleague could not import a PostgreSQL dump in plain format made on Linux on his Windows machine.

According to documentation, this works like so (OS agnostic):

psql dbname < infile

However, this  gave the following error:

ERROR:  missing data for ...

However, the documentation for psql gives an alternative way to read commands from a file:

The -f switch.

 "
Read commands from the file filename, rather than standard input. This option can be repeated and combined in any order with the -c option. When either -c or -f is specified, psql does not read commands from standard input; instead it terminates after processing all the -c and -f options in sequence. Except for that, this option is largely equivalent to the meta-command \i.
If filename is - (hyphen), then standard input is read until an EOF indication or \q meta-command. This can be used to intersperse interactive input with input from files. Note however that Readline is not used in this case (much as if -n had been specified).
Using this option is subtly different from writing psql < filename. In general, both will do what you expect, but using -f enables some nice features such as error messages with line numbers. There is also a slight chance that using this option will reduce the start-up overhead. On the other hand, the variant using the shell's input redirection is (in theory) guaranteed to yield exactly the same output you would have received had you entered everything by hand.
 "


What this doesn't tell you, is that on Windows, CMD.exe apparently somehow tries to interpret the file it reads. And by doing so, it destroyed data in the dump so that COPY was unable to understand it anymore. So the last sentence of the statement above is just theory on Windows.

Long story short, with psql -f all went fine - and don't use I/O redirection with psql on Windows!

Craig Ringer: Re-import repository keys for BDR and pglogical apt repositories

From Planet PostgreSQL. Published on Mar 23, 2017.

The BDR and pglogical apt repository GnuPG signing keys have been renewed.

Users should re-import the existing keys. You can verify that it’s still the same key as referenced in the documentation, just with a later expiry date.

Simply run:

wget --quiet -O - http://packages.2ndquadrant.com/bdr/apt/AA7A6805.asc | sudo apt-key add -
sudo apt-get update

If you get an error like:

GPG error: http://packages.2ndquadrant.com jessie-2ndquadrant
InRelease: The following signatures were invalid: KEYEXPIRED 1490229886 KEYEXPIRED 1490229886 KEYEXPIRED 1490229886
WARNING: The following packages cannot be authenticated!
postgresql-bdr-client-9.4 postgresql-bdr-9.4 postgresql-bdr-contrib-9.4 postgresql-bdr-9.4-bdr-plugin

… then re-import your keys per the above instructions.

Fossgis: sewer cadastre with qgis - jörg Höttges

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Mar 22, 2017.

(One of my summaries of a talk at the 2017 fossgis conference).

With engineer firms from the Aachen region they created qkan. Qkan is:

  • A data structure.
  • Plugins for Qgis.
  • Direct access. Not a specific application with restricted access, but unrestricted access from within Qgis. (He noticed lots of interest among the engineers to learn qgis during the project!)

It has been designed for the needs of the engineers that have to work with the data. You first import the data from the local sewer database. Qkan converts the data to what it needs. Then you can do simulations in a separate package. The results of the simulation will be visualized by Qkan in qgis. Afterwards you probably have to make some corrections to the data and give corrections back to the original database. Often you have to go look at the actual sewers to make sure the database is correct. Output is often a map with the sewer system.

Some functionality: import sewer data (in various formats). Simulate water levels. Draw graphs of the water levels in a sewer. Support database-level check ("an end node cannot occur halfway a sewer").

They took care to make the database schema simple. The source sewer database is always very complex because it has to hold lots of metadata. The engineer that has to work with it needs a much simpler schema in order to be productive. Qkan does this.

They used qgis, spatialite, postgis, python and qt (for forms). An important note: they used as many postgis functionality as possible instead of the geographical functions from qgis: the reason is that postgis (and even spatialite) is often much quicker.

With qgis, python and the "qt designer", you can make lots of handy forms. But you can always go back to the database that's underneath it.

The code is at https://github.com/hoettges

Fossgis: open source for emergencies - Marco Lechner

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Mar 22, 2017.

(One of my summaries of a talk at the 2017 fossgis conference).

He works for the Bundesamtes fuer Strahlenschutz, basically the government agency that was started after Chernobil to protect against and to measure radioactivity. The software system they use/build is called IMIS.

IMIS consists of three parts:

  • Measurements (automatic + mobile measurements + laboratory results).
  • Prediction system. Including documentation (managed in Plone, a python CMS system).
  • Decision support. Help support the government layers that have to make the decisions.

They have a simple map at odlinfo.bfs.de.

The current core of the system is proprietary. They are dependent on one single firm. The system is heavily customized for their usage.

They need a new system because geographical analysis keeps getting more important and because there are new requirements coming out of the government. The current program cannot handle that.

What they want is a new system that is as simple as possible; that uses standards for geographical exchange; they don't want to be dependent on a single firm anymore. So:

  • Use open standards, so OGC. But also a specific world-wide nuclear info protocol.
  • Use existing open source software. OSGEO.
  • If we need something special, can we change/extend existing open source software?
  • If not, then it is OK to create our their software. Under an open source license.

They use open source companies to help them, including training their employees. And helping getting these employees used to modern software development (jenkins, docker, etc.)

If you use an open source strategy, what do you need to do to make it fair?

  • Your own developments should also be open source!
  • You need your own test and build infrastructure. (For instance Jenkins)
  • You need to make it easy to start working with what you made: documentation, docker, buildout (!), etc.

(Personal note: I didn't expect to hear 'buildout' at this open source GIS conference. I've helped quite a bit with that particular piece of python software :-) )

Bruce Momjian: Characters Sets, Encodings, and Collations

From Planet PostgreSQL. Published on Mar 22, 2017.

Understanding how characters sets, encodings, and collations work together can be confusing. I would like to explain them in this blog post:

Characters Sets

Postgres databases can be initialized to support a variety of character sets. A character set is a full set of the characters or glyphs that can be represented. Popular characters sets are ASCII (127 characters), Latin1 (ISO8859-1, 255 characters), and Unicode (128k+ characters).

Continue Reading »

Pierre Ducroquet: Using Row Level Security with a Django application

From Planet PostgreSQL. Published on Mar 22, 2017.

Hello

Row Level Security is one of the lesser known great addition to PostgreSQL 9.5. The documentation about it is of PostgreSQL documentation quality of course, but there is a lack of online examples and usages without relying on distinct PostgreSQL users for multi-tenant websites. I recently built a proof of concept using RLS to secure access to a specific table in a multi-site Django application, and I will explain here how to do it. It is a very simple trick, but it can improve your security a lot, especially in bigger applications when auditing the whole source code can become tedious.

What is Row Level Security ?

Row Level Security, aka RLS, is a great PostgreSQL feature that allows you, for each (user, table, action) combination, to specify additional rules that restrict access to the rows. The PostgreSQL documentation, and most online documentations, show that feature with SQL users. A table with an “owner” column would be automatically filtered on the criteria owner = current_user(). It’s very efficient (the criteria is not applied after the query but pushed in the query and thus can use more indexes if needed), and for a very low maintenance cost you can have security pushed down to the lowest level of your stack.

With RLS, suddendly customer A can no longer steal from customer B, even if there is a SQL injection in your server. It would also require a security issue or misconfiguration on PostgreSQL side. Of course, you should keep checks in your code, better safe than sorry, but it’s the extra security belt that could save you from the next “data leaks” headlines 🙂

Limitation of the common RLS demonstrations and workaround

Usually, most RLS demo will apply restrictions per SQL user. It’s, by far, the easiest and safest way to do it, because it will be very hard to alter the data used to apply the rules. But it’s not going to work very well for most, if not all, web applications. Web applications seldomly use different SQL users per authenticated web user. Sometimes one SQL user is used per “site” (we will see that a bit later), but even that is not the most common deployment strategy. Web applications, with their very short connection lives, rely on poolers like pgbouncer to keep connections to the SQL server alives and save a lot of time for each HTTP request. Doing that for tens or hundreds of different SQL users would be complicated and likely to be expensive.

But there is absolutely nothing in PostgreSQL preventing you from using any function. We could even have a policy applying restrictions based on the time or the day in the week (a “no work can be done outside work hours” policy). The real “complexity” will be the forwarding of the needed informations from your application to the policy. And PostgreSQL has a solution for that too (new motto of the day ?) : SET SESSION…

So let’s build a small django application and combine all that together to have our security belt against code mistakes.

Our first policy : business_hours

First, let’s see how to create a simple RLS policy, based on my previous business hours idea.

-- First step, create a table...
test=# CREATE TABLE work_queue(id serial primary key, task text);
CREATE TABLE 
test=# GRANT SELECT ON work_queue TO employee;
GRANT
test=# INSERT INTO work_queue(task) VALUES ('long and complicated task');
INSERT 0 1

We are superuser here (notice the # in prompt) and we put a new task in our work_queue. And our employee can read the queue. But he could do that even out of business hours, and we don’t want that. So let’s write a policy and enable it :

test=# CREATE POLICY business_hours ON work_queue FOR ALL TO employee USING (extract('hour' from now()) between 8 and 20);
CREATE POLICY
test=# ALTER TABLE work_queue ENABLE ROW LEVEL SECURITY ;
ALTER TABLE

Policies are quite readable : on table work_queue, for all operations (SELECT, INSERT, UPDATE, DELETE) to user employee, filter rows using the time of day. Note that policies, of course, don’t apply to super users…

And now, if our employee selects in the table:

test=> SELECT NOW(); SELECT * FROM work_queue;
 now 
-------------------------------
 2017-03-21 23:15:28.560327+01
(1 row)

id | task 
----+------
(0 rows)

And if he comes back tomorrow :

test=> SELECT NOW(); SELECT * FROM work_queue;
 now 
-------------------------------
 2017-03-22 10:20:53.382093+01
(1 row)

id | task 
----+---------------------------
 1 | long and complicated task
(1 row)

We wrote our first security policy, congrats! Now, let’s start working with Django.

Django and RLS

Understanding the need

In the Django world, a single server, a single process can handle several “sites”. Let’s say you, tinycorp, are selling a SaaS software to your customers megacorp1 and megacorp2. The both want their logo to appear on the login page. So you won’t be able to have a simple app.tinycorp.com domain, otherwise you would have no way to customize the look. You will instead use one vhost per customer, megacorp1.tinycorp.com and megacorp2.tinycorp.com. That’s what Django calls a site.

So here we will work on preventing a leaking page used by an attacker against megacorp2 to leak datas from megacorp1 too.

What to inject, and how ?

Django store in memory, for each request, the current site. It’s done in the first steps after a browser connects to the server. This is a simple integer, referencing the django_site table. So we have to do the following SQL call for each request:

SET SESSION django.site TO $id$;

How can we do that without altering every view in the application? It’s easy, most Python/web developers will know the answer: we must write a middleware.

In the Python world, a middleware is a function (or a class that defines what is needed to behave like a function) that is called for each request. The request and the “next” middleware are given as parameters, the response must be returned. For instance, you could have an AuthMiddleware, a CSRFProtectionMiddleware, a WafMiddleware and then your application. This is a very nice mechanism, efficient and simple.

Since this post starts being quite long, I will just post here a working Django Middleware. The code should be obvious to most Python developers.

class RlsMiddleware(object):
        def __init__ (self, get_response):
                self.get_response = get_response
        def __call__ (self, request):
                current_site = get_current_site(request)
                with connection.cursor() as cursor:
                        cursor.execute('SET SESSION "django.site" = %s;' % 
current_site.id)
                response = self.get_response(request)
                with connection.cursor() as cursor:
                        cursor.execute('SET SESSION "django.site" = -1;')
                return response

And now, we only have to enable RLS on the tables and restrict the Django SQL user.

CREATE POLICY "check_django_site" ON my_table
                FOR ALL
                TO "django-user"
                USING ((site_id = (current_setting('django.site'))::integer));
ALTER TABLE my_table ENABLE ROW LEVEL SECURITY;

And… that’s all folks! Nothing else is needed, you only have to apply such a security for each table.

Drawbacks and further improvements…

The main drawback with that solution against restrictions based on the SQL user is that you trust your SQL user to honestly request the proper site. It’s a far lower security than distinct SQL users, but like any security, you must do a threat analysis. The threat here is poorly coded views forgetting a check, and some partial SQL injections. Full SQL injections, allowing any SQL query, could still hurt you badly since they could change the django.site variable. But, considering the low cost of that extra-security… Why not?

Another drawback is that you must have a separate SQL user for your console Django commands, or celery tasks or whatever you do in your Django application with no user interaction. But that’s almost always better to do that, so it’s more like pushing a good practice down your throat 🙂

If you want, you can deep deeper. Much, much deeper. You could set a django.user variable that contains the current user id, and push down most access rules in the database. Again, you are protecting yourself against some programming or logic errors, a full SQL injection would work around that easily…Thank you all for reading that long explanation, and I hope it will be of use to someone… Especially if I have an account on your website, I don’t want to be in a pwned list please !

The Top 5 Most Used Django Commands and what they mean.

By Coding for Entrepreneurs | A Django Programming Class for the Non-Technical Founder from Django community aggregator: Community blog posts. Published on Mar 22, 2017.

Django command line commands a...

Ask Vitor #2: How to dynamically filter ModelChoice's queryset in a ModelForm?

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Mar 22, 2017.

Michał Strumecki asks:

I just want to filter select field in a form, regarding a currently logged user. Every user has own categories and budgets. I want to display only a models related with a currently logged user. I’ve tried stuff with filtering before is_valid field, but with no result.


Answer

This is a very common use case when dealing with ModelForms. The problem is that in the form fields ModelChoice and ModelMultipleChoiceField, which are used respectively for the model fields ForeignKey and ManyToManyField, it defaults the queryset to the Model.objects.all().

If the filtering was static, you could simply pass a filtered queryset in the form definition, like Model.objects.filter(status='pending').

When the filtering parameter is dynamic, we need to do a few tweaks in the form to get the right queryset.

Let’s simplify the scenario a little bit. We have the Django User model, a Category model and Product model. Now let’s say it’s a multi-user application. And each user can only see the products they create, and naturally only use the categories they own.

models.py

from django.contrib.auth.models import User
from django.db import models

class Category(models.Model):
    name = models.CharField(max_length=30)
    user = models.ForeignKey(User, on_delete=models.CASCADE)

class Product(models.Model):
    name = models.CharField(max_length=30)
    price = models.DecimalField(decimal_places=2, max_digits=10)
    category = models.ForeignKey(Category)
    user = models.ForeignKey(User, on_delete=models.CASCADE)

Here is how we can create a ModelForm for the Product model, using only the currently logged-in user:

forms.py

from django import forms
from .models import Category, Product

class ProductForm(forms.ModelForm):
    class Meta:
        model = Product
        fields = ('name', 'price', 'category', )

    def __init__(self, user, *args, **kwargs):
        super(ProductForm, self).__init__(*args, **kwargs)
        self.fields['category'].queryset = Category.objects.filter(user=user)

That means now the ProductForm has a mandatory parameter in its constructor. So, instead of initializing the form as form = ProductForm(), you need to pass a user instance: form = ProductForm(user).

Here is a working example of view handling this form:

views.py

from django.shortcuts import render, redirect
from .forms import ProductForm

@login_required
def new_product(request):
    if request.method == 'POST':
        form = ProductForm(request.user, request.POST)
        if form.is_valid():
            product = form.save(commit=False)
            product.user = request.user
            product.save()
            return redirect('products_list')
    else:
        form = ProductForm(request.user)
    return render(request, 'products/product_form.html', {'form': form})

Using ModelFormSet

The machinery behind the modelformset_factory is not very flexible, so we can’t add extra parameters in the form constructor. But we certainly can play with the available resources.

The difference here is that we will need to change the queryset on the fly.

Here is what we can do:

models.py

@login_required
def edit_all_products(request):
    ProductFormSet = modelformset_factory(Product, fields=('name', 'price', 'category'), extra=0)
    data = request.POST or None
    formset = ProductFormSet(data=data, queryset=Product.objects.filter(user=request.user))
    for form in formset:
        form.fields['category'].queryset = Category.objects.filter(user=request.user)

    if request.method == 'POST' and formset.is_valid():
        formset.save()
        return redirect('products_list')

    return render(request, 'products/products_formset.html', {'formset': formset})

The idea here is to provide a screen where the user can edit all his products at once. The product form involves handling a list of categories. So for each form in the formset, we need to override the queryset with the proper list of values.

ProductFormSet

The big difference here is that each of the categories list is filtered by the categories of the logged in user.


Get the Code

I prepared a very detailed example you can explore to get more insights.

The code is available on GitHub: github.com/sibtc/askvitor.

David Rader: Get AWS RDS PostgreSQL log files from command line

From Planet PostgreSQL. Published on Mar 22, 2017.

AWS RDS has a pretty good web UI in their console. It’s even got a log file watcher built in. But sometimes you just want to download your PostgreSQL log files from your command line, maybe because you’re looking for slow queries and want to run pgBadger. Or you want to script grabbing the last log file after you run a batch data load. Or you just like the command line better.

You should already know that AWS had a command line that uses the same APIs their web console is built with and has extensive documentation. Make sure you have that installed and configured to access your AWS account. Since most of the CLI output is in JSON, we’re going to us the JQ tool to pick apart the responses and get the bits we want.

Step 1: Find your instance identifier

Hopefully you gave your RDS instance a nice easy to remember name when you created it. To find out, list all of your instances, filter for postgres engines (in case you also have some “other” databases for some reason) and show it’s status. You use the describe-db-instances API for this:

aws rds describe-db-instances 

And this will give you a big blob of JSON that makes your eyes bleed trying to find the information you want. Like this: AWS JSON

To make this usable, fire up JQ, filter down to postgres engines (let’s pretend we don’t need MySQL for WordPress or anything), and check the availability status:

aws rds describe-db-instances | jq ' .DBInstances[] | select( .Engine | contains("postgres")) | .DBInstanceIdentifier + ": " + .DBInstanceStatus' 
"bigsql-on-rds: available" 
"openwatch: available"

Much better!

Step 2: Find the most recent log file

Now, you want to find the most recent log file for your instance. Use the describe-db-log-files command and the instance identifier for your db, find the one that was most recently written:

aws rds describe-db-log-files --db-instance-identifier bigsql-on-rds | jq ' .DescribeDBLogFiles | max_by( .LastWritten ) | .LogFileName'
"error/postgresql.log.2017-03-22-14"

Step 3: Retrieve the log file

Download it! Use the text output if you want it to look like the PostgreSQL log file you know and love. If you had previously set log_min_duration_statement to a value, you’ll see the SQL statements that took longer than your value to execute.

aws rds download-db-log-file-portion --db-instance-identifier bigsql-on-rds --log-file-name "error/postgresql.log.2017-03-22-13" --output text
2017-03-22 13:13:43 UTC:172.30.1.60(37924):xxxxxx@xxxxxxx:[22791]:LOG:  duration: 3.244 ms  parse <unnamed>: SET extra_float_digits = 3
2017-03-22 13:13:43 UTC:172.30.1.60(37924):xxxxxx@xxxxxxx:[22791]:LOG:  duration: 1.902 ms  execute <unnamed>: SET extra_float_digits = 3
2017-03-22 13:13:43 UTC:172.30.1.60(37924):xxxxxx@xxxxxxx::[22791]:LOG:  duration: 14.961 ms  parse <unnamed>: INSERT INTO some_table (column_one, column_two, column_three, column_four) values ($1,$2,$3,'now()')
2017-03-22 13:13:43 UTC:172.30.1.60(37924):xxxxxx@xxxxxxx:[22791]:LOG:  duration: 1.678 ms  bind <unnamed>: INSERT INTO some_table (column_one, column_two, column_three, column_four) values ($1,$2,$3,'now()')
2017-03-22 13:13:43 UTC:172.30.1.60(37924):xxxxxx@xxxxxxx:[22791]:DETAIL:  parameters: $1 = 'a big long string', $2 = '', $3 = 'short string, or so'
2017-03-22 13:13:43 UTC:172.30.1.60(37924):xxxxxx@xxxxxxx:[22791]:LOG:  duration: 8.089 ms  execute <unnamed>: INSERT INTO some_table (column_one, column_two, column_three, column_four) values ($1,$2,$3,'now()')
2017-03-22 13:13:43 UTC:172.30.1.60(37924):xxxxxx@xxxxxxx:[22791]:DETAIL:  parameters: $1 = 'an even longer and bigger string than the first time', $2 = '', $3 = 'short string, or so'
2017-03-22 13:15:04 UTC::@:[22496]:LOG:  received SIGHUP, reloading configuration files
2017-03-22 13:15:04 UTC::@:[22496]:LOG:  parameter "log_min_duration_statement" changed to "-1"
2017-03-22 13:15:23 UTC::@:[22496]:LOG:  received fast shutdown request
2017-03-22 13:15:23 UTC::@:[22496]:LOG:  aborting any active transactions

Step 4: Script it (Left as exercise for reader)

Michael Paquier: Postgres 10 highlight - Tracking of current logfiles

From Planet PostgreSQL. Published on Mar 21, 2017.

The following feature has landed in Postgres 10 to help system administrators:

commit: 19dc233c32f2900e57b8da4f41c0f662ab42e080
author: Robert Haas <rhaas@postgresql.org>
date: Fri, 3 Mar 2017 11:43:11 +0530
Add pg_current_logfile() function.

The syslogger will write out the current stderr and csvlog names, if
it's running and there are any, to a new file in the data directory
called "current_logfiles".  We take care to remove this file when it
might no longer be valid (but not at shutdown).  The function
pg_current_logfile() can be used to read the entries in the file.

Gilles Darold, reviewed and modified by Karl O.  Pinc, Michael
Paquier, and me.  Further review by Álvaro Herrera and Christoph Berg.

When “stderr” or “csvlog” is defined as log_destination, there is no real way to know to which PostgreSQL backends are writing to for most users. There are configurations where this can be guessed automatically, for example by tweaking log_filename to use only a day or a month number, and then have some client application layer guess what is currently the file being written to based on the current data, but this adds an extra complexity by having a dependency between an upper application layer and a setting value in PostgreSQL.

The above patch, as mentioned in the commit message, shows up what are the current files where logs are being written depending on the log destination defined. Once run, it shows the file currently in use:

=# SELECT pg_current_logfile();
           pg_current_logfile
-----------------------------------------
 pg_log/postgresql-2017-03-22_151520.log
(1 row)

This function actually parses a file in $PGDATA/current_logfiles that gets updated each time a log file is rotated, or when parameters are reloaded and that there is a modification of the log destinations, the first entry showing up if no argument is given. Note as well that the entry for “stderr” is generated first, and then goes the one of “csvlog”. So the order of things writtent in current_logfiles is pre-defined and does not depend on the order of destinations defined in log_destination.

An extra argument can be used as well, “csvlog” or “stderr” to get what is the file in use for those log destinations:

=# SELECT pg_current_logfile('stderr');
           pg_current_logfile
-----------------------------------------
 pg_log/postgresql-2017-03-22_151520.log
(1 row)
=# SELECT pg_current_logfile('csvlog');
           pg_current_logfile
-----------------------------------------
 pg_log/postgresql-2017-03-22_151520.csv
(1 row)

Note that this function access is forbidden by default to non-superusers but the access can be granted. Note also that the value of log_directory, which is a superuser-only parameter, is used as prefix of the result returned. So granting the access of this function leaks a bit of superuser-only information.

Fossgis: creating maps with open street map in QGis - Axel Heinemann

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Mar 21, 2017.

(One of my summaries of a talk at the 2017 fossgis conference).

He wanted to make a map for a local run. He wanted a nice map with the route and the infrastructure (start, end, parking, etc). Instead of the usual not-quite-readable city plan with a simple line on top. With qgis and openstreetmap he should be able to make something better!

A quick try with QGis, combined with the standard openstreetmap base map, already looked quite nice, but he wanted to do more customizations on the map colors. So he needed to download the openstreetmap data. That turned into quite a challenge. He tried two plugins:

  • OSMDownloader: easy selection, quick download. Drawback: too many objects as you cannot filter. The attribute table is hard to read.
  • QuickOSM: key/value selection, quick. Drawback: you need a bit of experience with the tool, as it is easy to forget key/values.

He then landed on https://overpass-turbo.eu . The user interface is very friendly. There is a wizard to get common cases done. And you can browse the available tags.

With the data downloaded with overpass-turbo, he could easily adjust colors and get a much nicer map out of it.

You can get it to work, but it takes a lot of custom work.

Some useful links:

https://taginfo.openstreetmap.org http://tagfinder.herokuapp.com https://gis.stackexchange.com

https://abload.de/img/screenshot2017-03-21a6asqe.png

Photo explanation: just a nice unrelated picture from the recent beautiful 'on traxs' model railway exibition (see video )

Fossgis: introduction on some open source software packages

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Mar 21, 2017.

(One of my summaries of a talk at the 2017 fossgis conference).

The conference started with a quick introduction on several open source programs.

Openlayers 3 - Marc Jansen

Marc works on both openlayers and GeoExt. Openlayers is a javascript library with lots and lots of features.

To see what it can do, look at the 161 examples on the website :-) It works with both vector layers and raster layers.

Openlayers is a quite mature project, the first version is from 2006. It changed a lot to keep up with the state of the art. But they did take care to keep everything backwards compatible. Upgrading from 2.0 to 2.2 should have been relatively easy. The 4.0.0 version came out last month.

Openlayers...

  • Allows many different data sources and layer types.
  • Has build-in interaction and controls.
  • Is very actively developed.
  • Is well documented and has lots of examples.

The aim is to be easy to start with, but also to allow full control of your map and all sorts of customization.

Geoserver - Marc Jansen

(Again Marc: someone was sick...)

Geoserver is a java-based server for geographical data. It support lots of OGC standards (WMS, WFS, WPS, etc). Flexible, extensible, well documented. "Geoserver is a glorious example that you can write very performant software in java".

Geoserver can connect to many different data sources and make those sources available as map data.

If you're a government agency, you're required to make INSPIRE metadata available for your maps: geoserver can help you with that.

A big advantage of geoserver: it has a browser-based interface for configuring it. You can do 99% of your configuration work in the browser. For maintaining: there is monitoring to keep an eye on it.

Something to look at: the importer plugin. With it you get a REST API to upload shapes, for instance.

The latest version also supports LDAP groups. LDAP was already supported, but group membership not yet.

Mapproxy - Dominik Helle

Dominik is one of the MapProxy developers. Mapproxy is a WMS cache and tile cache. The original goal was to make maps quicker by caching maps.

Some possible sources: WMS, WMTS, tiles (google/bing/etc), MapServer. The output can be WMS, WMS-C, WMTS, TMS, KML. So the input could be google maps and the output WMS. One of their customers combines the output of five different regional organisations into one WMS layer...

The maps that mapproxy returns can be stored on a local disk in order to improve performance. They way they store it allows mapproxy to support intermediary zoom levels instead of fixed ones.

The cache can be in various formats: MBTiles, sqlite, couchdb, riak, arcgis compact cache, redis, s3. The cache is efficient by combining layers and by omitting unneeded data (empty tiles).

You can pre-fill the cache ("seeding").

Some other possibilities, apart from caching:

  • A nice feature: clipping. You can limit a source map to a specific area.
  • Reprojecting from one coordinate system to another. Very handy if someone else doesn't want to support the coordinate system that you need.
  • WMS feature info: you can just pass it on to the backend system, but you can also intercept and change it.
  • Protection of your layers. Password protection. Protect specific layers. Only allow specific areas. Etcetera.

QGis - Thomas Schüttenberg

QGis is an opern source gis platform. Desktop, server, browser, mobile. And it is a library. It runs on osx, linux, windows, android. The base is the QT ui library, hence the name.

Qgis contains almost everything you'd expect from a GIS packages. You can extend it with plugins.

Qgis is a very, very active project. Almost 1 million lines of code. 30.000+ github commits. 332 developers have worked on it, in the last 12 months 104.

Support via documentation, mailinglists and http://gis.stackexchange.com/ . In case you're wondering about the names of the releases: they come from the towns where the twice-a-year project meeting takes place :-)

Since december 2016, there's an official (legal) association.

QGis 3 will have lots of changes: QT 5 and python 3.

Mapbender 3 - Astrid Emde

Mapbender is library to build webgis applications. Ideally, you don't need to write any code yourself, but you configure it instead in your browser. It also supports mobile usage.

You can try it at http://demo.mapbender3.org/ . Examples are at http://mapbender3.org/?q=en/gallery .

You can choose a layout and fill in and configure the various parts. Layers you want to show: add sources. You can configure security/access with roles.

An example component: a search form for addresses that looks up addresses with sql or a web service. Such a search form can be a popup or you can put it in the sidebar, for instance. CSS can be customized.

PostNAS - Astrid Emde, Jelto Buurman

The postnas project is a solution for importing ALKIS data, a data exchange format for the german cadastre (Deutsch: Kataster).

PostNAS is an extension of the GDAL library for the "NAS" vector data format. (NAS = normalisierte Austausch Schnittstelle, "normalized exchange format"). This way, you can use all of the gdal functionality with the cadastre data. But that's not the only thing: there's also a qgis plugin. There is configuration and conversion scripts for postgis, mapbender, mapserver, etc.

They needed postprocessing/conversion scripts to get useful database tables out of the original data, tables that are usable for showing in QGis, for instance.

So... basically a complete open source environment for working with the cadastre data!

https://abload.de/img/screenshot2017-03-21a1islt.png

Photo explanation: just a nice unrelated picture from the recent beautiful 'on traxs' model railway exibition (see video )

Class-Based Views vs. Function-Based Views

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Mar 21, 2017.

If you follow my content here in the blog, you probably have already noticed that I’m a big fan of function-based views. Quite often I use them in my examples. I get asked a lot why I don’t use class-based views more frequently. So I thought about sharing my thoughts about that subject matter.

Before reading, keep that in mind: class-based views does not replace function-based views.


Introduction

I’m not an old school Django developer that used it when there was only function-based views and watched the class-based views been released. But, there was a time that only function-based views existed.

I guess it didn’t take long until all sort of hacks and solutions was created to extend and reuse views, make them more generic.

There was a time that function-based generic views was a thing. They were created to address the common use cases. But the problem was that they were quite simple, and was very hard to extend or customize them (other than using configurations parameters).

To address those issues, the class-based views was created.

Now, views are always functions. Even class-based views.

When we add them to the URL conf using the View.as_view() class method, it returns a function.

Here is what the as_view method looks like:

class View:
    @classonlymethod
    def as_view(cls, **initkwargs):
        """Main entry point for a request-response process."""
        for key in initkwargs:
            # Code omitted for clarity
            # ...

        def view(request, *args, **kwargs):
            self = cls(**initkwargs)
            if hasattr(self, 'get') and not hasattr(self, 'head'):
                self.head = self.get
            self.request = request
            self.args = args
            self.kwargs = kwargs
            return self.dispatch(request, *args, **kwargs)

        # Code omitted for clarity
        # ...

        return view

Parts of the code was omitted for clarity, you can see the full method on GitHub.

So if you want to explicitly call a class-based view here is what you need to do:

return MyView.as_view()(request)

To make it feel more natural, you can assign it to a variable:

view_function = MyView.as_view()
return view_function(request)

The view function returned by the as_view() method is outer part of every class-based view. After called, the view pass the request to the dispatch() method, which will execute the appropriate method accordingly to the request type (GET, POST, PUT, etc).

Class-Based View Example

For example, if you created a view extending the django.views.View base class, the dispatch() method will handle the HTTP method logic. If the request is a POST, it will execute the post() method inside the view, if the request is a GET, it will execute the get() method inside the view.

views.py

from django.views import View

class ContactView(View):
    def get(self, request):
        # Code block for GET request

    def post(self, request):
        # Code block for POST request

urls.py

urlpatterns = [
    url(r'contact/$', views.ContactView.as_view(), name='contact'),
]
Function-Based View Example

In function-based views, this logic is handled with if statements:

views.py

def contact(request):
    if request.method == 'POST':
        # Code block for POST request
    else:
        # Code block for GET request (will also match PUT, HEAD, DELETE, etc)

urls.py

urlpatterns = [
    url(r'contact/$', views.contact, name='contact'),
]

Those are the main differences between function-based views and class-based views. Now, Django’s generic class-based views are a different story.


Generic Class-Based Views

The generic class-based-views was introduced to address the common use cases in a Web application, such as creating new objects, form handling, list views, pagination, archive views and so on.

They come in the Django core, and you can implement them from the module django.views.generic.

They are great and can speed up the development process.

Here is an overview of the available views:

Simple Generic Views
  • View
  • TemplateView
  • RedirectView
Detail Views
  • DetailView
List Views
  • ListView
Editing Views
  • FormView
  • CreateView
  • UpdateView
  • DeleteView
Date-Based Views
  • ArchiveIndexView
  • YearArchiveView
  • MonthArchiveView
  • WeekArchiveView
  • DayArchiveView
  • TodayArchiveView
  • DateDetailView

You can find more details about each implementation in the official docs: Built-in class-based views API.

I find it a little bit confusing, because the generic implementations uses a lot of mixins, so at least for me, sometimes the code flow is not very obvious.

Now here is a great resource, also from the Django Documentation, a flattened index with all the attributes and methods from each view: Class-based generic views - flattened index. I keep this one in my bookmarks.


The Different Django Views Schools

Last year I got myself a copy of the Two Scoops of Django: Best Practices for Django 1.8 Amazon AdSystem book. It’s a great book. Each chapter is self-contained, so you don’t need to read the whole book in order.

In the chapter 10, Daniel and Audrey talk about the best practices for class-based views. They brought up this tip that was very interesting to read, so I thought about sharing it with you:

School of “Use all the generic views”!
This school of thought is based on the idea that since Django provides functionality to reduce your workload, why not use that functionality? We tend to belong to this school of thought, and have used it to great success, rapidly building and then maintaining a number of projects.

School of “Just use django.views.generic.View”
This school of thought is based on the idea that the base Django CBV does just enough and is ‘the True CBV, everything else is a Generic CBV’. In the past year, we’ve found this can be a really useful approach for tricky tasks for which the resource-based approach of “Use all the views” breaks down. We’ll cover some use cases for it in this chapter.

School of “Avoid them unless you’re actually subclassing views”
Jacob Kaplan-Moss says, “My general advice is to start with function views since they’re easier to read and understand, and only use CBVs where you need them. Where do you need them? Any place where you need a fair chunk of code to be reused among multiple views.”

Excerpt from “Two Scoops of Django: Best Practices for Django 1.8” - 10.4: General Tips for Django CBV, page 121.

The authors said in the book they are in the first school. Personally, I’m in the third school. But as they said, there is no consensus on best practices.


Pros and Cons

For reference, some pros and cons about function-based views and class-based views.

Pros Cons
Function-Based Views
  • Simple to implement
  • Easy to read
  • Explicit code flow
  • Straightforward usage of decorators
  • Hard to extend and reuse the code
  • Handling of HTTP methods via conditional branching
Class-Based Views
  • Can be easily extended, reuse code
  • Can use O.O techniques such as mixins (multiple inheritance)
  • Handling of HTTP methods by separate class methods
  • Built-in generic class-based views
  • Harder to read
  • Implicit code flow
  • Hidden code in parent classes, mixins
  • Use of view decorators require extra import, or method override

There is no right or wrong. It all depends on the context and the needs. As I mentioned in the beginning of this post, class-based views does not replace function-based views. There are cases where function-based views are better. In other cases class-based views are better.

For example, if you are implementing a list view, and you can get it working just by subclassing the ListView and overriding the attributes. Great. Go for it.

Now, if you are performing a more complex operation, handling multiple forms at once, a function-based view will serve you better.


Conclusions

The reason why I use function-based views often in my post examples, are because they are way easier to read. Many readers that stumble upon my blog are beginners, just getting started. Function-based views communicate better, as the code flow is explicit.

I usually always start my views as function-based views. If I can use a generic class-based view just by overriding the attributes, I go for it. If I have some very specific needs, and it will replicate across several views, I create my own custom generic view subclassing the django.views.generic.View.

Now a general advice from the Django documentation: If you find you’re struggling to implement your view as a subclass of a generic view, then you may find it more effective to write just the code you need, using your own class-based or functional views.

Andrew Dunstan: Support for enums in btree_gin and btree_gist

From Planet PostgreSQL. Published on Mar 21, 2017.

I have just committed a series of patches that provide support for enum types in the btree_gin and btree_gist standard extensions. This is something I first started work on about a year ago. It turned out to be more involved that I had first thought it would be, as it requires some additional core code due to the way that enum comparisons work, which is a bit more complex than for most data types, and involves use of PostgresSQL’s internal caching mechanism.

The practical upshot of this, however, is that starting with PostgreSQL 10 you will be able to use enum columns in exclusion constraints. That’s something that could be very useful – I started this work when I found, somewhat to my surprise, that it wasn’t possible.

 

 

Django Admin: Expensive COUNT(*) Queries

By Abu Ashraf Masnun from Django community aggregator: Community blog posts. Published on Mar 20, 2017.

If you are a Django developer, it is very likely that you use the Django Admin regularly. And if you have maintained a website with a huge amount of data, you probably already know that Django Admin can become very slow when the database table gets so large. If you log the SQL queries (either using Django logging or using Django Debug Toolbar), you would notice a very expensive SQL query, something like this:

SELECT COUNT(*) AS "__count" FROM "table_name"

In the default settings, you will actually notice this query twice. If you use Django Debug Toolbar, it will tell you that the query was duplicated 2 times.

Issue - 1

By default ModelAdmin has show_full_result_count = True which shows the full result count in the admin interface. This is the source of one of the count(*) queries.

To fix that, we just need to set this on our ModelAdmin:

show_full_result_count = False

Issue - 2

Even after switching show_full_result_count off, we are still noticing a count(*) query in the log. It’s because the Django Paginator does a count itself.

The solution is to somehow bypass the expensive query while still returning a number so the pagination works as expected. We can cache the count value or even run raw SQL query find an approximate value through a rather inexpensive lookup somewhere else.

Here’s a quick example of a paginator that runs the expensive query once and then caches the results:

from django.core.paginator import Paginator
from django.core.cache import cache

# Modified version of a GIST I found in a SO thread
class CachingPaginator(Paginator):
    def _get_count(self):

        if not hasattr(self, "_count"):
            self._count = None

        if self._count is None:
            try:
                key = "adm:{0}:count".format(hash(self.object_list.query.__str__()))
                self._count = cache.get(key, -1)
                if self._count == -1:
                    self._count = super().count
                    cache.set(key, self._count, 3600)

            except:
                self._count = len(self.object_list)
        return self._count

    count = property(_get_count)

    

Now on our ModelAdmin we just need to use this paginator.

paginator = CachingPaginator

Once we have done that, it will be slow when we first time load the page and it will be faster afterwards. We can also fetch and cache this value from time to time. This solution might not get us the exact count and thus mess up pagination sometimes but in most cases that would not be much of a problem.

Bruce Momjian: Outer Joins and WHERE Clauses

From Planet PostgreSQL. Published on Mar 20, 2017.

Postgres supports both traditional join syntax, which uses the WHERE clause to specify joined columns, and ANSI join syntax, that uses the word JOIN in the FROM clause. While both syntaxes can be used for inner joins, only the ANSI join syntax supports outer joins in Postgres.

Because column restrictions like col = 4 are not related to joins, you would think that restrictions have the same effect whether they appear in the WHERE clause or as part of a join clause, e.g. a OUTER JOIN b ON a.x = b.x AND col = 4. However, this is not always true. Restrictions in the JOIN clause are processed during joins, while WHERE clause restrictions are processed after joins.

This is only significant in outer joins (and CROSS joins) because columns from unjoined rows are manufactured by outer joins. Here is an example:

Continue Reading »

Jim Mlodgenski: Proliferation of NUMERIC

From Planet PostgreSQL. Published on Mar 20, 2017.

More and more I’m seeing people use people use a NUMERIC for their primary keys. This is a direct result of people leveraging automated tools for their Oracle to PostgreSQL migration. Oracle’s NUMBER data type is frequently used as the primary key for a table so the tools just map to a PostgreSQL NUMERIC and calls it a day. The PostgreSQL NUMERIC is a purpose built data type for arbitrary precision numbers where exactness is required, think monetary values. It can hold up to 131072 digits before the decimal point and up to 16383 after the decimal point. To me, that does not sound like the right choice for the surrogate key for your country code table. It really is more like using a cement truck for your daily commute. It gets the job done, but it is not even close to the most efficient way to do it.

Let’s take a look.

First, let’s create 2 tables with the only difference being the data type of the primary key.

test=# \d bar_int
                   Table "public.bar_int"
 Column |       Type        | Collation | Nullable | Default 
--------+-------------------+-----------+----------+---------
 a      | integer           |           | not null | 
 b      | character varying |           |          | 
Indexes:
    "bar_int_pkey" PRIMARY KEY, btree (a)

test=# \d bar_numeric
                 Table "public.bar_numeric"
 Column |       Type        | Collation | Nullable | Default 
--------+-------------------+-----------+----------+---------
 a      | numeric           |           | not null | 
 b      | character varying |           |          | 
Indexes:
    "bar_numeric_pkey" PRIMARY KEY, btree (a)

Then, let’s fill those tables with 10 million rows of data.

test=# INSERT INTO bar_int 
test-# SELECT a, repeat('x', 100) 
test-# FROM generate_series(1, 100000000) a;

Looking at the sizes on disk of those 2 tables, they both take up exactly the same amount of space. When people are trying to decide if a NUMERIC is safe choice for their primary keys, many times this is where they stop the analysis. The storage difference is irrelevant or non-existent so it must be a good choice. We know it’s not, so let’s dig deeper.

test=# SELECT relname, pg_total_relation_size(relid) 
test-# FROM pg_stat_user_tables WHERE relname LIKE 'bar%' 
test-# ORDER BY 1;
   relname   | pg_total_relation_size 
-------------+------------------------
 bar_int     |            16373833728
 bar_numeric |            16373833728
(2 rows)

In a transactional system, the primary key is used to return a single row so let’s look at the difference NUMERIC makes for this use case. Creating a simple script to look up a random row by the primary key and running it via pgbench shows the performance loss of NUMERIC.

\set id random(1, 10000000)
SELECT b FROM bar_int WHERE a = :id;

pgbench -n -f number_test -c 4 -t 100000 test

Integer: 27105 tps
Numeric: 25823 tps

That is a 4.7% performance loss by using NUMERIC instead of an integer. For some people who want to squeeze out every last bit of performance out of the system they can, this is enough to not use NUMERIC, but for many it is not. The convenience of using a NUMERIC is well worth the loss of 5% on queries that run in milliseconds. But let’s keep digging.

A very common first migration project people undertake is a reporting database. In many ways, it is a perfect choice as people become comfortable with PostgreSQL since frequently the uptime requirements are not as stringent as a transactional system. Let’s look at a very basic reporting type query joining 2 tables by their primary keys.

SELECT count(*) FROM foo_int a, bar_int b WHERE a.a = b.a;

This is where the performance loss of NUMERIC will change everyone’s mind about using them as primary keys.

Average Query Time:
Integer: 5637 ms
Numeric: 9216 ms

The query just using NUMERIC instead of integers as the primary key performs nearly 39% slower. And that is just a simple case. If you start summing or averaging the values which is frequently done in reports, the performance loss easily tops 40%.

While it is easy to just map an Oracle NUMBER to a PostgreSQL NUMERIC, take the time to map things to the correct data type and save that 40% of performance.

Federico Campoli: pg_chameleon beta 1 out

From Planet PostgreSQL. Published on Mar 19, 2017.

After fixing few bugs I decided to release the first pg_chameleon beta.

The package details with the changelog are available here http://www.pgdba.co.uk/p/mysql-to-postgresql-replica.html

These are the notable changes.

  • Switch to psycopg2 2.7.x - Thanks to Daniele's work now the requirement installs without need for python or postgresql headers, making the install in virtualenv much simpler
  • Install system wide or in virtualenv follow the same rules. The configuration files and the package files are installed in the python's site-packages/pg_chameleon. 
  • Configuration directory created at first run. If not present the $HOME/.pg_chameleon/ directory is created when chameleon.py is executed the first time. The script creates  the directory and the required config dir with the config-example.yaml file. 
  • The write_batch function is now using the copy_expert in order to speedup the batch load. The fallback to inserts is still present.
 The release is already available on pypi https://pypi.python.org/pypi/pg_chameleon/

The documentation is available here http://pythonhosted.org/pg_chameleon/

Please report any issue on the project's github page https://github.com/the4thdoctor/pg_chameleon

However if you like to get in touch you can ping me on twitter @4thdoctor_scarf or if you prefer to chat, on irc there is a dedicated channel irc.freenode.net #pgchameleon (you can find me logged in at night).

Michael Goldberg: The first pgDay in Israel

From Planet PostgreSQL. Published on Mar 19, 2017.

pgDay Israel 2017, the first event of its kind in Israel fully dedicated to PostgreSQL, was held at Google Campus Tel Aviv on March 2. The event was run by PostgreSQL Israel Community .

The event raised great interest among high-tech and security industries, and attended by more than 70 participants from a variety of domains, companies and ages.

Speakers from Israeli companies and from the global PostgreSQL Community, Oleg Bartunov and Ivan Panchenko, made the event atmosphere interesting and valuable for all the participants.

Running an engineering reorg

By Will Larson from Django community aggregator: Community blog posts. Published on Mar 17, 2017.

At quickly growing companies, I believe there are two managerial skills that have a disproportionate impact on your organization’s success: making technical migrations cheap, and running clean reorganizations. Do both well, and you can skip that lovely running to stand still sensation, and invest your attention more fruitfully.

Of the two, managing organizational change is more general so let’s work through a lightly structured framework for (re)designing an engineering organization.

Caveat: this is more of a thinking tool than a recipe!

My approach for planning organization change:

  1. Validate organizational change is the right tool.
  2. Project headcount a year out.
  3. Set target ratio of management to individual contributors.
  4. Identify logical teams and groups of teams.
  5. Plan staffing for the teams and groups.
  6. Commit to moving forward.
  7. Rollout the change.

Now, let’s drill into each of those a bit.

Is a reorg the right tool?

There are two best kinds of reoganizations:

  • the one that solves a structural problem, and
  • the one that you don’t do.

There is only one worst: the one you do because you’re avoiding a people management issue.

My checklist for ensuring that a reorganization is appropriate is:

  1. Is the problem structural? The opportunities of organization change are to increase communication, reduce decision friction, and focus attention; if you’re looking for a different change, consider a bit if there’s a more direct approach.
  2. Are you reorganizing to work around a broken relationship? Management is a profession where karma always comes due, and you’ll be better addressing the underlying issue than continuing to work around it.
  3. Does the problem already exists? It’s better to wait until a problem actively exists before solving it, because it’s remarkably hard to predict future problems. Even if you’re right that the problem will occur, you may end up hitting a different problem first.
  4. Are the conditions temporary? Are you in a major crunch period or otherwise doing something you don’t anticipate doing again? If so, then it may be easier to patch through and rethink on the other side, and avoid optimizing for a transient failure mode.

Alright, so you’re still thinking you want a reorg.

Project headcount a year out.

The first step of designing the organization is determining it’s approximate total size. I recommend reasoning through this number from three or four different directions:

  1. an optimistic number based that’s barely possible,
  2. a number based on the “natural size” of your organization, if every team and role was fully staffed,
  3. a realistic number based on historical hiring rates.

Then merge those into a single number.

Unless you’ve changed something meaningful in your process, it’s likely the historical trend will hold accurate, and you should weight it the most heavily (and my sense is that the list of easy changes that significantly change hiring outcomes is short).

One of the goals of using the year out headcount number is to avoid optimizing too heavily for your exact current situation and the current individuals you’re working with. Organizational change is so disruptive to so many people that I’ve increasingly come to believe you should drive organizational design from the boxes and not from the key individuals.

Manager to engineer ratio.

Once you have your headcount projection, next you need to identify how many individuals you want each manager to support. This number particularly depends on your company’s working definintion of an engineering manager’s role.

If engineering managers are expected to be doing hands on technical work, then their teams should likely be three to five engineers (unless the team has been working together well for a long time, in which case things get very specific and hard to generalize about).

Otherwise, targeting five to eight, depending on experience level, is pretty typical. If you’re targeting higher than eight engineers per manager, then it’s worth reflecting on why you believe your managers can support a significantly higher load than industry average: are they exceptionally experienced? are your expectations lower than typical?

In any case, pick your target, probably in the six to eight range.

Defining teams and groups.

Now that you have your target organization size and target ratio of managers to engineers, it’s time to figure out the general shape of your organization!

For 35 engineers and seven engineers per manager, you’re looking at five managers (35 divided by seven) and 1.8 manager of managers (Math.log(35, 7)).

For 74 engineers and six engineers per manager, that’s twelve managers and 2.4 second-degree managers (Math.log(74, 6)).

In a growing company, you should generally round up the number of managers, as this is a calculation “at rest”, and your organization will be a living, evolving thing.

Once you have the numbers, then these are useful to ground you in the general number of teams and groups of teams you should have.

In the first case with 35 engineers, you’re going to want between one and three groups, containing a total of five or six teams. In the later with 74, you’ll want two to four groups comprised of twelve to fifteen teams.

Once you’ve grounded yourself, here are some additional considerations:

  1. Can you write a crisp mission statement for each team?
  2. Would you personally be excited to be a member of each of the teams, as well as to be the manager for each of those teams?
  3. Put teams which work together (especially poorly) as close together as possible. This minimizes the distance for escalations during disagreements, allowing arbiters to have sufficient context, but also most poor working relationships are the byproduct of information gaps, and nothing fills them faster than proximity.
  4. Can you define clear interfaces for each teams?
  5. Can you list the areas of ownership for each team?
  6. Have you created a gapless map of ownership, such that everything is clearly owned by one team? Try to avoid implicitly creating holes of ownership. If you need to create explicit holes of ownership, that’s a better solution (essentially, defining unstaffed teams).
  7. Are there compelling candidate pitches for each of those teams?
  8. As always, are you over-optimizing on individuals versus a sensible structure?

This is the least formulaic aspect of organizational design, and if possible it’s a good time to lean on your network and similar organizations for ideas.

Staffing the teams and groups.

With your organization design and headcount planning, you can roughly determine when you’ll need to fill each of the technical and management leadership positions.

From there, you have four sources of candidates to staff them:

  1. Team members who are ready to fill the roles now.
  2. Team members who can grow into the roles in the timeframe.
  3. Internal transfers from within your company.
  4. External hires who already have the skills.

That is probably an ordered list of how you should try to fill the key roles. This is true both because you want people who already know your culture, and also because reorganizations that depend on yet-to-be-hired individuals are much harder to pull off successfully.

Specifically, I’d recommend having a spreadsheet with every single person’s name on it, their current team and their future team. Accidentally missing someone is the cardinal sin of reorganization.

Commit to moving forward.

Now it’s time to make a go decision. A few questions to ask yourself before you decide to fully commit:

  1. Are the changes meaningful net positive?
  2. Will the new structure last at least six months?
  3. What problems did you discover during design?
  4. What will trigger the reorg after this one?
  5. Who is going to be impacted most?

After you’ve answered those, make sure to not only get your own buy-in, but also from your peers and leadership. Organizational change is rather resistant to rollback, so you have to be collectively committed to moving forward with it, even if it runs into challenges along the way (which, if history holds, it almost certainly will).

Rollout the change.

The final and often times most awkward phase of a reorganization is its rollout. There are three key elements to a good rollout:

  1. Explanation of reasoning driving the reorganization.
  2. Documentation of how each person and team will be impacted.
  3. Availability and empathy to help bleed off frustration from impacted individuals.

In general, the actual tactics of doing this are:

  1. Discuss with heavily impacted individuals in private first.
  2. Ensure managers and other key individuals are prepared to explain the reasoning behind the changes.
  3. Send an email out documenting the changes.
  4. Be available for discussion.
  5. If necessary, hold an organization all-hands, but probably try not to. People don’t process well in large groups, and the best discussions take place in small room.
  6. Double down on doing skip-level 1:1s.

And with that, you’re done! You’ve worked through a engineering reorganization. Hopefully you won’t need to do that again for a while.

As a closing thought, organizations are both (1) collections of people, and (2) a manifestation of an idea separate from the individuals comprising it. You can’t reason about them purely from either direction. There are many, exceedingly valid, different ways to think about any given reorganization, and you should use these ideas as one model for thinking through changes, not a definitive roadmap.

Bruce Momjian: 'You need to set a tone where everyone in the community feels valued'

From Planet PostgreSQL. Published on Mar 17, 2017.

As open source gains popularity in the enterprise, there is increased study of open source communities and how they function. Those studying such things often ask about Postgres because of its unusually healthy community and recent successes.

While I was in India in February, I was interviewed by an open source magazine; an excerpt from that interview is now online. It covers open source leadership, encouraging new developers, and healthy software ecosystems.

Shaun M. Thomas: PG Phriday: RESTing in the Corn

From Planet PostgreSQL. Published on Mar 17, 2017.

Last week we explored using Postgres as a central communication nexus between several data sources. At the time, I made a fairly hand-wavy allusion to REST interfaces. Since I hadn’t really explored further, I had assumed PLV8 could use core node.js or other similar libraries to invoke HTTP APIs. Of course as an untrusted language, PLV8 isn’t allowed to do that. It’s more of a language for easily manipulating JSON and JSONB objects within Postgres.

Only slightly less violent than JSON

So we need some other way of calling out to an external HTTP resource, wrapping it in column trimmings, and sending the data to Postgres. As demonstrated last week, we could do this directly with a Python (or some other language) function. Alternatively we can use an extension that has access to Python. Then all we would need to do is write a python library, and suddenly we can interact with the REST interface as if it were an actual Postgres table. This is where Multicorn comes in.

One of the easiest ways to install Multicorn is through PGXN. It goes something like this:

sudo pgxn install multicorn

Otherwise users of Red Hat variants have a yum repo or Debian/Ubuntu can rely on the apt repo. Multicorn is popular enough that it’s included in both of these resources, which is very handy for us.

In any case, we need a basic REST service to poll. In the interests in keeping things simple, let’s use Flask to whip up an API for fetching all the words in a UNIX dictionary file. Since the jsonify method is extremely slow, it’s probably a good idea to cache the output too. The end result might look something like this:

from flask import Flask, jsonify

app = Flask(name)

counter = 1 dictionary = [] cached = None

words = open('/usr/share/dict/words', 'r')

for line in words: dictionary.append({'id': counter, 'word': line.strip()}) counter += 1

words.close()

@app.route('/') def index(): global cached

if not cached:
    cached = jsonify({'records': dictionary})

return cached

if name == 'main': app.run(debug=True, port=9999)

Our REST interface replies to only one URL and accepts no parameters. It’s not exciting, but it does the job of transmitting data to HTTP requests. With this “service” running in the background somewhere, we can continue with our task of turning that output into a Postgres table.

The next thing we need to do is make use of the Multicorn Python libraries. In this arena, Python setuptools are our friend. With them, we can write a Python library that imports Multicorn and it will automatically install everything in a compatible location. Here’s a bare-bones setup.py file that might work:

import subprocess 
from setuptools import setup, find_packages, Extension

setup( name='dictionaryfdw', version='0.0.1', author='Shaun Thomas', license='Postgresql', packages=['fdw'] )

That takes care of actually installing the project library files. Next we need to actually use the Multicorn API. As a pure foreign data wrapper API, it has a lot more capability than we actually need. By default it can represent direct access to CSV files, filesystem objects, RSS feeds, and a bevy of other cute implementations.

Unfortunately none of these use cases apply to wrapping up /usr/share/dict/words. Multicorn can interact with authentication, pass predicates to help pre-optimize matches, and has several other capabilities we simply won’t leverage here. We just need one table from one file, with very little in the way of parsing.

To that end, we only need to define a single execute method in a class derived from ForeignDataWrapper. With that in place, Python has a very capable protocol request handler we can leverage to actually interact with our REST API. It provides everything we need to get the data and convert the HTTP response from JSON into Python dicts that represent each row of the dictionary contents.

There isn’t much exciting here:

import requests
from multicorn import ForeignDataWrapper

class DictionaryFDW(ForeignDataWrapper):

def execute(self, quals, columns):
    response = requests.get('http://localhost:9999/')
    data = response.json()

    for row in data['records']:
        yield row

To install it, we merely invoke setup.py:

sudo python setup.py install

Everything is now in place. As a Python library, Multicorn has access to other Python tools installed on the system. This means the primary option to the wrapper will be a named Python class. So long as that class is compatible, we will have a functional Postgres extension.

It just so happens that Multicorn implements this at the SERVER level. This makes sense as each server generally has its own defined interaction model. The CSV server would take filenames as parameters for foreign tables, RSS feeds would need the URL for each external resource, etc. Our wrapper is somewhat less… sophisticated. It takes no parameters and offers no advanced functionality. As such, once we define the server that invokes the library, we’re basically done.

Here’s the code that would create our dictionary table and a sample invocation:

CREATE EXTENSION multicorn;
CREATE SERVER multicorn_srv
       FOREIGN DATA WRAPPER multicorn
       OPTIONS (wrapper 'fdw.DictionaryFDW');

CREATE FOREIGN TABLE dictionary ( id INT, word VARCHAR ) SERVER multicorn_srv;

\timing on

SELECT * FROM dictionary LIMIT 5 OFFSET 9000;

id | word
------+------------ 9001 | Lubbock 9002 | Lubumbashi 9003 | Lucas 9004 | Luce 9005 | Luce's (5 rows)

Time: 167.939 ms

It works! Beyond that, notice how slow the results are. Though there are only about 100k rows, even fetching a mere handful takes an exorbitant amount of time. This happens for a few reasons.

First consider that this is an external API call. Each request must trigger an HTTP request, parse the results, and then encode and return each row. While we know what our REST process does, others may not be so forthcoming and are essentially black boxes. At least some of the delay is due to whatever work the REST service performs on its end. Then we have the unfortunate fact that, bless its heart, Python is generally a dog’s breakfast when it comes to performance.

Yum!

We could fix some of these problems. Multicorn does have the ability to pass along predicates after all. A more advanced library could implement basic regular expression, equality, or other logic to pass along WHERE clauses like these:

EXPLAIN ANALYZE
SELECT * FROM dictionary
 WHERE word LIKE '%house%';

QUERY PLAN

Foreign Scan on dictionary (cost=20.00..20000000000.00 rows=100000000 width=36) (actual time=164.616..326.678 rows=189 loops=1) Filter: ((word)::text ~~ '%house%'::text) Rows Removed by Filter: 98982 Planning time: 0.248 ms Execution time: 326.896 ms

While in our case Postgres manually removed all inapplicable matches, a more advanced wrapper could transform these into parameters to the API call itself. In such a case, the REST call might return a much smaller (and faster) row subset. Barring that, there’s another way to cheat the system: materialized views!

We have, after all, already used these to represent regularly updated external snapshots. If we treat the API like a rolling data window, we can just refresh a materialized view with its contents and enjoy direct manipulation of instantiated contents. That means indexes and all the trimmings! A major benefit to this approach over spending all of our time optimizing our wrapper, is that we don’t have to duplicate existing Postgres functionality.

We would naturally want to eventually fix the more obvious inefficiencies in our wrapper interface. Yet until that happens—and even afterwards, since performance enhancements are cumulative—we can do something like this:

CREATE EXTENSION pg_trgm;

CREATE MATERIALIZED VIEW mv_dictionary AS SELECT * FROM dictionary;

ANALYZE mv_dictionary;

CREATE INDEX dictionary_word_trigram ON mv_dictionary USING GIST (word gist_trgm_ops);

EXPLAIN ANALYZE SELECT * FROM mv_dictionary WHERE word LIKE '%house%';

QUERY PLAN

Bitmap Heap Scan on mv_dictionary (cost=3.36..30.78 rows=10 width=13) (actual time=6.454..6.628 rows=189 loops=1) Recheck Cond: ((word)::text ~~ '%house%'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=41 -> Bitmap Index Scan on dictionary_word_trigram (cost=0.00..3.35 rows=10 width=0) (actual time=6.435..6.435 rows=190 loops=1) Index Cond: ((word)::text ~~ '%house%'::text) Planning time: 0.152 ms Execution time: 6.783 ms

One cool trick Postgres has up its sleeves is that it can use leverage quirky index types. One of these is based on word trigrams. It’s a lossy format that will inevitably return more matches than we actually want on the first iteration, but remember that Postgres will apply a post-filter to remove excess matches. We can see that in the Recheck cond line from the execution plan above, which yanked one row that didn’t match our LIKE clause. That’s a small price to pay for reducing a 326ms runtime to 6ms!

Implementing trigram logic in our Python library or REST service would be a huge headache, as would duplicating LIKE handling. Since words are most likely to have inexact and fuzzy searches, we may want to avoid the whole mess. With a materialized view, we can let Postgres do all of the complicated data manipulation while the REST API focuses on bridging the gap to our extrernal resource.

Once again, we’re left with a database that contains none of our own data. Tables we create by wrapping and caching external contents gives us a transitive scratch zone and staging area for actual analysis.

In a real-world scenario, we could only hope the REST interface and its corresponding foreign data wrapper component wouldn’t be as lazy as the one presented here. Perhaps a better alternative would be to write a full REST-specific foreign data wrapper in Go. In such a world, feed sources and parameters, column decoding, caching layers, and any number of other optimizations would be efficient and native to web-driven content. One wrapper and server combination could represent any number of foreign tables instead of ours which only works with a single service.

The options, like Postgres, are endless and varied. That’s the world foreign data wrappers give us.

Amit Kapila: Hash indexes are faster than Btree indexes?

From Planet PostgreSQL. Published on Mar 17, 2017.


PostgreSQL supports Hash Index from a long time, but they are not much used in production mainly because they are not durable.  Now, with the next version of PostgreSQL, they will be durable.  The immediate question is how do they perform as compared to Btree indexes. There is a lot of work done in the coming version to make them faster. There are multiple ways in which we can compare the performance of Hash and Btree indexes, like the time taken for creation of the index, search or insertion in the index.  This blog will mainly focus on the search operation. By definition, hash indexes are O(1) and Btree indexes are O(log n), however with duplicates that is not exactly true.

To start with let us see the impact of work done to improve the performance of hash indexes. Below is the performance data of the pgbench read-only workload to compare the performance difference of Hash indexes between 9.6 and HEAD on IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM.



The workload is such that all the data fits in shared buffers (scale factor is 300 (~4.5GB) and shared_buffers is 8GB).  As we can see from the above graph, that the performance has increased at all client counts in the range of 7% to 81% and the impact is more pronounced at higher client counts. The main work which has led to this improvement is 6d46f478 (Improve hash index bucket split behavior.) and 293e24e5 (Cache hash index's metapage in rel->rd_amcache.).

The first commit 6d46f478 has changed the heavyweight locks (locks that are used for logical database objects to ensure the database ACID properties) to lightweight locks (locks to protect shared data structures) for scanning the bucket pages.  In general, acquiring the heavyweight lock is costlier as compare to lightweight locks.  In addition to reducing the locking cost, this also avoids locking out scans and inserts for the lifetime of the split.

The second commit 293e24e5 avoids a significant amount of contention for accessing metapage. Each search operation needs to access metapage to find the bucket that contains tuple being searched which leads to high contention around metapage.  Each access to metapage needs to further access buffer manager. This work avoids that contention by caching the metapage information in backend local cache which helps bypassing all the buffer manager related work and hence the major contention in accessing the metapage.


Next graph shows how the hash index performs as compared to the btree index.  In this run we have changed hash to btree index in pgbench read-only tests.



We can see here that the hash index performs better than the btree index and the performance difference is in the range of 10 to 22%.  In some other workloads we have seen a better performance like with hash index on varchar columns and even in the community, it has been reported that there is performance improvement in the range of 40-60% when hash indexes are used for unique index columns.


The important thing to note about the above data is that it is only on some of the specific workloads and it mainly covers Selects as that is the main area where performance improvement work has been done for PostgreSQL10.  The other interesting parameters to compare are the size of the index and update on the index which needs more study and experiments.

In the end, I would like to thank my colleagues who were directly involved in this work and my employer EnterpriseDB who has supported this work.  Firstly I would like to thank, Robert Haas who has envisioned all this work and is the committer of this work, and Mithun C Y who was the author of commit 293e24e5.  Also, I would like to extend sincere thanks to all the community members who are involved in this work and especially Jeff Janes and Jesper Pedersen who have reviewed and tested this work.

Ask Vitor #1: Getting form data to appear in URL and for use in the next view

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Mar 16, 2017.

Devon Moore asks:

I want to have the user specify a date in a custom form. This date will append the current URL with the date value path/YYYY-MM-DD/ I then need to capture the date and use it to filter data from the database to display that date’s data.
I’m using class based views and for the report I’m using generic view since this view is a custom report I’m building using multiple db models.


Answer

Here is what Devon wants to achieve:

Result

There are a couple of ways to achieve the desired result. Using class-based views, we could perhaps do something like this:

forms.py

class ReportForm(forms.Form):
    date = forms.DateField()

urls.py

from django.conf.urls import url
from core import views  # import your views using the correct app name

urlpatterns = [
    url(r'report/$', views.Report.as_view(), name='report'),
    url(r'report/(?P<year>[0-9]{4})-(?P<month>[0-9]{2})-(?P<day>[0-9]{2})/$',
        views.ReportDetails.as_view(), name='report_details'),
]

Example of valid URL matching the report_details pattern: /report/2017-03-17/. If you want to change the URL to use slashes instead of dashes (/report/2017/03/17/), change the URL pattern to this:

    url(r'report/(?P<year>[0-9]{4})/(?P<month>[0-9]{2})/(?P<day>[0-9]{2})/$',
        views.ReportDetails.as_view(), name='report_details'),

views.py

from django.contrib.auth.models import User
from django.views import View
from django.views.generic.edit import FormView
from django.utils.dateformat import format

class Report(FormView):
    template_name = 'report.html'
    form_class = ReportForm

    def form_valid(self, form):
        date = form.cleaned_data.get('date')
        year = date.year
        month = format(date, 'm')
        day = format(date, 'd')
        return redirect('report_details', year, month, day)

class ReportDetails(View):
    def get(self, request, year, month, day):
        users = User.objects.filter(date_joined__year__gte=int(year))
        # do your thing here, filter the data etc
        return render(request, 'report_details.html', {'users': users})

The view Report is responsible just for validating the user input and redirecting the user to the view that will actually process the report, which is ReportDetails.

The form is posted to Report. Report validates the input and if it is valid, it fires a redirect towards the ReportDetails. ReportDetails grab the date information from the URL, process the querysets and finally returns to the user, rendering the template.

report.html

{% extends 'base.html' %}

{% block content %}
  <form method="post">
    {% csrf_token %}
    {{ form.as_p }}
    <button type="submit">Generate report</button>
  </form>
{% endblock %}

report_details.html

{% extends 'base.html' %}

{% block content %}
  <ul>
  {% for user in users %}
    <li>{{ user.username }}</li>
  {% endfor %}
  </ul>
  <a href="{% url 'report' %}">New report</a>
{% endblock %}

The final result would be something like this:

Report View

Filtering with the proper URL:

Report Details View


Caveats

This implementation will only work if you only need the date to filter the report.

If you need to pass extra information to do the filtering of the querysets, I would recommend sending the form data directly to the Report view. And perhaps even using a GET request, because you are not modifying the data.

Something like this:

urls.py

urlpatterns = [
    url(r'report/$', views.Report.as_view(), name='report'),
]

views.py

class Report(View):
    def get(self, request):
        if 'date' in request.GET:  # only try to filter if `date` is in the querystring
            form = ReportForm(request.GET)
            if form.is_valid():
                # process the report
                date = form.cleaned_data.get('date')
                invoices = Invoices.objects.filter(date__year=date.year)
                return render(request, 'report_details.html', {'invoices': invoices})
        else:
            form = ReportForm()
        return render(request, 'report.html', {'form': form})

Then you would end up having a URL like this: /report/?date=2017-03-17. And if you had more information in the form, like the status, it would append in the URL: /report/?date=2017-03-17&status=pending.

The URL would still be friendly, and the implementation would be simpler.

Dan Robinson: Redshift Pitfalls And How To Avoid Them

From Planet PostgreSQL. Published on Mar 16, 2017.

Amazon Redshift is a data warehouse that’s orders of magnitudes cheaper than traditional alternatives. Many companies use it, because it’s made data warehousing viable for smaller companies with a limited budget. Since so many Heap customers use Redshift, we built Heap SQL to allow them to sync their Heap datasets to their own Redshift clusters. […]

The post Redshift Pitfalls And How To Avoid Them appeared first on Heap Blog.

Kaarel Moppel: Number one thing to watch out for when doing Postgres Streaming Replication

From Planet PostgreSQL. Published on Mar 16, 2017.

On operational issues side, one thing that quite commonly floats atop when dealing with customers using Postgres, especially with smaller setups, is Streaming Replication and it’s failures. Failure here not as a bug or design failure, but more as a misunderstood “feature”, as encountered problems are mostly actually things that work as intended and with […]

The post Number one thing to watch out for when doing Postgres Streaming Replication appeared first on Cybertec - The PostgreSQL Database Company.

What is a Mixin?

By Chris Bartos from Django community aggregator: Community blog posts. Published on Mar 15, 2017.

A Mixin is something that isn’t really talked about a lot in other programming languages. So, what is a Mixin, exactly? A Mixin is a special kind of inheritance in Python and it’s starting to get a big rise in Django / Web Application Development. You can use a Mixin to allow classes in Python […]

Payal Singh: First Impressions - Google Cloud SQL for PostgreSQL

From Planet PostgreSQL. Published on Mar 15, 2017.

Google Cloud SQL for PostgreSQL is finally here, although in Beta. While not recommended for production use yet, the beta version is perfect for trying out and playing around.

It goes without saying that this service is a direct competitor to Amazon's RDS and understandably there will be comparisons going forward, both in this post as well as anywhere else Cloud SQL gets reviewed. With that said, lets dive in!

Account for Getting started: Currently, Google Cloud Platform is providing $300.00 in credit for maximum of one year. That is great, especially if all you need the platform account for is to try out Cloud SQL for Postgres. Setting up the account is super fast since like most things google, for better or worse, the account you're signed in as is used to activate the cloud platform service for you and you're in your console in a matter of seconds. Amazon too has a generous plan for beginners to get into their AWS world.

Documentation: Cloud SQL's documentation is very clear and concise. Also appreciate the direct links to sections of the console within the tutorial steps. Very handy and makes it immensely easier to find your way if, like me, you have never worked on Google's Cloud Platform before.

Interface: I found Google's instance creation interface to be much cleaner than Amazon's. While, to create an RDS instance on Amazon, you're guided through multiple pages starting with the basic settings and progressing to the advanced ones, instance creation in google is all done on a single page hence making the process seem faster and simpler, also allowing to review as you choose new settings. This may change though, as Cloud SQL ads more features and options.




Security: While Google's instance creation page is simpler, I did not see any security group/VPC/Subnet/VPN related information I was required to sign off on before my instance was created, unlike RDS, where the minimum barrier to create an instance is to review these settings and hence is much higher than the minimum barrier in Cloud SQL. My guess is google prioritized speed and simplicity of instance creation over security, which might be good to get in customers but not as good in the long run.

Monitoring: Basic monitoring in both RDS and Cloud SQL are enabled by default. RDS instance creation provides the option to choose enhanced monitoring at time of creation, whereas Cloud SQL does not.



Logging: Logging in Cloud SQL seems to be better than RDS. There are options for extracting log messages based on logging levels. As far as I know there is no such options RDS




Memory: Google's memory options only span from 3.75 to 6.5GB, while RDS's is that from 1 to 244GB. That is a huge difference between the two offerings, but I suppose being in Beta, Cloud SQL's actual services may change overtime.

Storage: This ranges from 10 - 10230GB on Cloud SQL and from 5 - 6144GB. Unlike memory, Cloud SQL is the one with more options when it comes to storage.

Customizable Configuration: Not too impressed with Google's Cloud SQL Flags. It is very limited at the moment. RDS itself only provides a portion of the configuration parameters to be customizable, but compared to google's flags it is a lot better. Currently, only autovacuum settings and default_statistics_target

Pricing: Prices for Google Cloud SQL for Postgres and Amazon Postgres RDS look quite comparable at first glance, but on a closer look I found that Google's pricing is a bit higher than Amazon's. Whether the cost difference is worth it will be clearer in the next few months when the final service is released and more details about performance come out.

At first glance, I am impressed with Google Cloud SQL, mainly because of the ease and simplicity of setting up a working cluster, and the monitoring and logging interface. The one thing I did not like, especially when compared to RDS is the lack of advanced network settings such as setting up a VPC and security groups in general. These may be accessible elsewhere, but most beginners creating an instance may not go through the trouble of searching for these. I think Amazon does the right thing here by forcing the user to review these settings, even if it comes at the cost of simplicity and speed, even with a light headache at times. As for the extremely limited set of customizable flags for server configuration parameters, I am pretty sure it will change and grow as more people will start using this service. 

Joshua Drake: Community Days: BigApplePy and DockerNYC

From Planet PostgreSQL. Published on Mar 15, 2017.

In 2016 we started working with horizontal communities. We wanted to make sure that PgConf US was an inclusive community that advocated not only for the best database in the world but also all the external Open Source technologies that make the best database in the world the best platform in the world. 

This year we wanted to continue building a stronger community through relationships and our efforts are proving successful. We have joined forces with several horizontal communities to bring original and related content to PgConf US. The first is Big Apple Py. They are working with PgConf US to run a full track of Python content as well as a development workshop. 


The workshop is particularly interesting as it will focus on a code sprint to develop a mature and user friendly Python based conference software. I wonder if they know we already use a Ruby on Rails solution...

We also have a great workshop being presented by DockerNYC and Jesse White: Automating production ready databases in Docker
DockerNYC
PgConf US 2017 is less than two weeks away and we continue to work with communities to bring you even more new content! Stay tuned as we expect to announce more workshops in the next few days. If you haven't purchased your tickets, today would be a good day.

Keith Fiske: Removing A Lot of Old Data (But Keeping Some Recent)

From Planet PostgreSQL. Published on Mar 15, 2017.

I’ve had this situation crop up a few times with clients and after a discussion on #postgresql on Freenode recently, decided a blog post may be in order. The pitfalls that lead me to this solution are useful to cover and it seems a useful set of steps to have documented and be able to share again later.

There comes a time for most people when you have a table that builds up quite a lot of rows and you then realize you didn’t actually need to keep all of it. But you can’t just run a TRUNCATE because you do want to keep some of the more recent data. If it’s just a few million small rows, it’s not a huge deal to just run a simple DELETE. But when that starts getting into the billions of rows, or your rows are very large (long text, bytea, etc), a simple DELETE may not be realistic.

The first, and arguably easiest way, to deal with this would be to run the DELETE in batches instead of one large transaction. This allows you to add a pause in between the batches to help control I/O. Another side affect of such large delete operations can be an excessively high amount of WAL generation. This not only contributes to I/O, this can also dramatically increase disk space usage. When you’re trying to delete data due to disk space constraints, this can end up making things even worse before they get better. And if you’re deleting quite a lot of data from over a long period of time that didn’t receive many deletes before, you likely won’t get much disk space back at the end without doing a VACUUM FULL or pg_repack (see my discussion on bloat here). And the biggest issue of all when disk space is critical is either one of those options requires that you have at least as much disk space available as a full copy of the unbloated table would take up. So if disk space is the crucial problem most people are trying to solve with removing old data, how can we easily do this?

The first option that’s most easily done on almost any RDBMS is to make a secondary table and have new data copied/redirected there. The I/O and WAL generation of moving the smaller, required amount of data is much less than removing all the old data. Then you can just drop the old table. And most people do this via a trigger-based method: have every write to the old table also write to the new one as well. Then once you’re sure things are working, you can take a brief lock on both the old and new tables and swap their names. PostgreSQL makes this last step much easier, having transactional DDL. I’ll cover the commands to do this later since my final solution is similar to this. The main problems surrounding this come when the table you’re trying to clean up is a very high-traffic table. Doing a trigger like this basically doubles all writes which could possibly cause some I/O issues. There’s also the fact of making sure you get that trigger function code write, otherwise all writes break. Can be a little nerve wracking on critical production systems. But there is a way to avoid both the additional I/O of double writes and the headache of writing triggers.

The actual solution we came up for this involves using the often overlooked feature of table inheritance in PostgreSQL. The steps basically go like this:

  1. Create a new table exactly like the first one
  2. Set ownership/privileges on the new table to match the old table
  3. Have the old table INHERIT the new table
  4. Swap old and new table names
  5. Move the data you want to keep from the old table to the new one
  6. Drop the old table

As soon as you do step 4, all new data is immediately going to the new table. Also, since the old table is a child of the new table, all your old data is still visible from the original table name. Moving the data from the old table to the new one is nearly transparent to any users of the table, the only issue being there may be some slightly slower queries during that transition period since the planner has to account for 2 tables. You can help mitigate this slightly by placing a constraint on the new table (before the swap when it’s empty) that only allows data in the new table’s window. This allows constraint exclusions to possibly ignore the old table while you get data moved. Placing a constraint on the old table probably wouldn’t help much since it would have to lock it for validation. Yes there will be additional I/O and disk usage from WAL while you move data from the old table to the new, but this can be controlled to a much greater degree since all new data isn’t also being written twice by a trigger. You can move the data in batches with any necessary pauses to control those I/O & WAL spikes.

The one thing this method does not account for is if there are updates to data in the old table that would technically place it in the new one while you’re in the process of moving the recent data you want to keep. That update will not be moved from the old table to do the new one. But in most cases where we’ve done this, those updates weren’t that critical since, if you’d done the original method of just deleting the old data, you would’ve been deleting that data that was getting updated anyway. But it’s something to be aware of if you go querying the old table and still see “new” data after you think you’re done migrating. It’s also something to be aware of if your application thinks it updated something when the old table was there and now it’s suddenly gone. You may need to stick with the trigger method above if that’s the case then.

An example of commands to do this is below. The LIKE clause to the CREATE TABLE statement is quite useful since this can automatically include all indexes, constraints, defaults, comments, and storage options. It does NOT include ownership or privileges however, so the one critical step in this piece is definitely #2 above. You can easily see a table’s privileges with the \dp option in psql. Also, explicitly obtaining the exclusive lock on both tables before doing the name switch ensures nothing weird happens during whatever brief moment could exist between the switch.

CREATE TABLE public.notifications_new (LIKE public.notifications INCLUDING ALL);

ALTER TABLE public.notifications_new OWNER TO sysadmin;

GRANT select ON public.notifications_new TO read_only
GRANT select, insert, update, delete ON public.notifications TO app_user;
GRANT all ON public.notifications TO admin;

ALTER TABLE public.notifications INHERIT public.notifications_new;

BEGIN;
LOCK TABLE public.notifications IN ACCESS EXCLUSIVE MODE;
LOCK TABLE public.notifications_new IN ACCESS EXCLUSIVE MODE;
ALTER TABLE public.notifications RENAME TO notifications_old;
ALTER TABLE public.notifications_new RENAME TO notifications;

COMMIT;  (or ROLLBACK; if there's a problem)

Once all these steps are done, you can then begin the process of moving your more recent data out of the old table and into the new via whichever method works best for you. One easy method to batch this is a CTE query that does the DELETE/INSERT with a SELECT in a single query to limit the rows moved.

WITH row_batch AS (
    SELECT id FROM public.notifications_old WHERE updated_at >= '2016-10-18 00:00:00'::timestamp LIMIT 20000 ),
delete_rows AS (
    DELETE FROM public.notifications_old o USING row_batch b WHERE b.id = o.id RETURNING o.id, account_id, created_at, updated_at, resource_id, notifier_id, notifier_type)
INSERT INTO public.notifications SELECT * FROM delete_rows;

And once that’s done, you can then DROP the old table, instantly recovering all that disk space with minimal WAL traffic and zero bloat aftermath!

Bruce Momjian: Column Storage Internals

From Planet PostgreSQL. Published on Mar 15, 2017.

Postgres uses native CPU alignment to store values in a row. This allows blocks to be copied unchanged from disk into shared buffers and accessed as local variables, as outlined in this presentation.

This can be illustrated by pg_column_size(). First, an empty row size:

SELECT pg_column_size(row());
 pg_column_size
----------------
             24

Continue Reading »

Hubert 'depesz' Lubaczewski: Waiting for PostgreSQL 10 – hash indexing vs. WAL

From Planet PostgreSQL. Published on Mar 15, 2017.

For a long time hash indexed were not crash safe, and couldn't be used on replication slave, because they skipped WAL. Now, thanks to these two commits, it has changed: On 14th of March 2017, Robert Haas committed patch: hash: Add write-ahead logging support. The warning about hash indexes not being write-ahead logged and their […]

How to Create Django Admin List Actions

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Mar 14, 2017.

Django Admin list actions are meant to be used to perform operations in bulk. All Django Admin list views already come with a default action “Delete selected <ModelName>s”. In this short tutorial I will guide you through the steps to create your own list actions.


Creating the Action Function

Each action in the list is a regular Python function that takes three parameters: the current ModelAdmin, a HttpRequest object (just like a view function) and a QuerySet, which is the list of selected model instances.

Those Action Functions can live inside the admin.py module of your app. But if they start to get really big, you can define them outside the admin.py.

Following is the skeleton for a Action Function:

def my_admin_action(modeladmin, request, queryset):
    # do something with the queryset

my_admin_action.short_description = 'My admin action'

Simple Example

Consider the following model and model admin:

models.py

from django.db import models

class Book(models.Model):
    HARDCOVER = 1
    PAPERBACK = 2
    EBOOK = 3
    BOOK_TYPES = (
        (HARDCOVER, 'Hardcover'),
        (PAPERBACK, 'Paperback'),
        (EBOOK, 'E-book'),
    )
    title = models.CharField(max_length=50)
    publication_date = models.DateField(null=True)
    author = models.CharField(max_length=30, blank=True)
    price = models.DecimalField(max_digits=5, decimal_places=2)
    pages = models.IntegerField(blank=True, null=True)
    book_type = models.PositiveSmallIntegerField(choices=BOOK_TYPES)

    class Meta:
        verbose_name = 'book'
        verbose_name_plural = 'books'

admin.py

from django.contrib import admin
from .models import Book

class BookAdmin(admin.ModelAdmin):
    list_display = ['title', 'publication_date', 'author', 'price', 'book_type']

admin.site.register(Book, BookAdmin)

Let’s say we want to create a list action to apply a 10% discount to the selected books. It would be as simple as:

admin.py

import decimal
from django.contrib import admin
from .models import Book

def apply_discount(modeladmin, request, queryset):
    for book in queryset:
        book.price = book.price * decimal.Decimal('0.9')
        book.save()
apply_discount.short_description = 'Apply 10%% discount'

class BookAdmin(admin.ModelAdmin):
    list_display = ['title', 'publication_date', 'author', 'price', 'book_type']
    actions = [apply_discount, ]  # <-- Add the list action function here

admin.site.register(Book, BookAdmin)

Don’t forget to add the name of the function to the actions list, and the result will be something like this:

Custom List Action

Tip!

You can optimize the apply_discount function using a F() expression:

from django.db.models import F

def apply_discount(modeladmin, request, queryset):
    queryset.update(price=F('price') * decimal.Decimal('0.9'))

If you want to learn more about F() expressions, I have a post dedicated to that subject: Django Tips #13 Using F() Expressions


Export to CSV Example

You can also use the list action to return a HttpResponse. A simple export to CSV example:

admin.py

import decimal, csv
from django.contrib import admin
from django.http import HttpResponse
from django.db.models import F
from .models import Book

def apply_discount(modeladmin, request, queryset):
    queryset.update(price=F('price') * decimal.Decimal('0.9'))
apply_discount.short_description = 'Apply 10%% discount'

def export_books(modeladmin, request, queryset):
    response = HttpResponse(content_type='text/csv')
    response['Content-Disposition'] = 'attachment; filename="books.csv"'
    writer = csv.writer(response)
    writer.writerow(['Title', 'Publication Date', 'Author', 'Price', 'Pages', 'Book Type'])
    books = queryset.values_list('title', 'publication_date', 'author', 'price', 'pages', 'book_type')
    for book in books:
        writer.writerow(book)
    return response
export_books.short_description = 'Export to csv'

class BookAdmin(admin.ModelAdmin):
    list_display = ['title', 'publication_date', 'author', 'price', 'book_type']
    actions = [apply_discount, export_books, ]

admin.site.register(Book, BookAdmin)

You can read more about data export here in the blog:


List Action as Model Admin Method

An alternative way to implement it is by creating the list action function as a method of the admin class:

class BookAdmin(admin.ModelAdmin):
    list_display = ['title', 'publication_date', 'author', 'price', 'book_type']
    actions = ['apply_discount', export_books]

    def apply_discount(self, request, queryset):
        queryset.update(price=F('price') * decimal.Decimal('0.9'))
    apply_discount.short_description = 'Apply 10%% discount'

Pass the method name as a string to the actions list, and rename the modeladmin keyword argument to self.


Conclusions

There is much more you can do with the admin list actions. Read more in the official documentation.

The example used in this tutorial is available on GitHub: sibtc/django-admin-list-actions

How Do I Start Learning Django

By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on Mar 14, 2017.

If you have done django for any amount of time you have probably gotten this question, and it can sometimes be hard to answer beyond start at the main django site.

In this video I attempt to help by giving people a starting point, and some guidance on resources to and how to use.

Give it a watch and offer any suggestions, and tweaks you think should be made.

How Do I Start Learning Django?

Robert Haas: Parallel Query v2

From Planet PostgreSQL. Published on Mar 14, 2017.

A recent Twitter poll asked What is your favorite upcoming feature of PostgreSQL V10?  In this admittedly unscientific survey, "better parallelism" (37%) beat out "logical replication" (32%) and "native partitioning" (31%).  I think it's fruitless to argue about which of those features is actually most important; the real point is that all of those are amazing features, and PostgreSQL 10 is on track to be an amazing release.  There are a number of already-committed or likely-to-be-committed features which in any other release would qualify as headline features, but in this release they'll have to fight it out with the ones mentioned above.

Read more »

A Production-ready Dockerfile for Your Python/Django App

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Mar 14, 2017.

Docker has matured a lot since it was released nearly 4 years ago. We’ve been watching it closely at Caktus, and have been thrilled by the adoption -- both by the community and by service providers. As a team of Python and Django developers, we’re always searching for best of breed deployment tools. Docker is a clear fit for packaging the underlying code for many projects, including the Python and Django apps we build at Caktus.

Technical overview

There are many ways to containerize a Python/Django app, no one of which could be considered “the best.” That being said, I think the following approach provides a good balance of simplicity, configurability, and container size. The specific tools I’ll be using are: Docker (of course), Alpine Linux, and uWSGI.

Alpine Linux is a simple, lightweight Linux distribution based on musl libc and Busybox. Its main claim to fame on the container landscape is that it can create a very small (5MB) Docker image. Typically one’s application will be much larger than that after the code and all dependencies have been included, but the container will still be much smaller than if based on a general-purpose Linux distribution.

There are many WSGI servers available for Python, and we use both Gunicorn and uWSGI at Caktus. A couple of the benefits of uWSGI are that (1) it’s almost entirely configurable through environment variables (which fits well with containers), and (2) it includes native HTTP support, which can circumvent the need for a separate HTTP server like Apache or Nginx, provided static files are hosted on a 3rd-party CDN such as Amazon S3.

The Dockerfile

Without further ado, here’s a production-ready Dockerfile you can use as a starting point for your project (it should be added in your top level project directory, or whichever directory contains the Python package(s) provided by your application):

FROM python:3.5-alpine

# Copy in your requirements file
ADD requirements.txt /requirements.txt

# OR, if you’re using a directory for your requirements, copy everything (comment out the above and uncomment this if so):
# ADD requirements /requirements

# Install build deps, then run `pip install`, then remove unneeded build deps all in a single step. Correct the path to your production requirements file, if needed.
RUN set -ex \
    && apk add --no-cache --virtual .build-deps \
            gcc \
            make \
            libc-dev \
            musl-dev \
            linux-headers \
            pcre-dev \
            postgresql-dev \
    && pyvenv /venv \
    && /venv/bin/pip install -U pip \
    && LIBRARY_PATH=/lib:/usr/lib /bin/sh -c "/venv/bin/pip install -r /requirements.txt" \
    && runDeps="$( \
            scanelf --needed --nobanner --recursive /venv \
                    | awk '{ gsub(/,/, "\nso:", $2); print "so:" $2 }' \
                    | sort -u \
                    | xargs -r apk info --installed \
                    | sort -u \
    )" \
    && apk add --virtual .python-rundeps $runDeps \
    && apk del .build-deps

# Copy your application code to the container (make sure you create a .dockerignore file if any large files or directories should be excluded)
RUN mkdir /code/
WORKDIR /code/
ADD . /code/

# uWSGI will listen on this port
EXPOSE 8000

# Add any custom, static environment variables needed by Django or your settings file here:
ENV DJANGO_SETTINGS_MODULE=my_project.settings.deploy

# uWSGI configuration (customize as needed):
ENV UWSGI_VIRTUALENV=/venv UWSGI_WSGI_FILE=my_project/wsgi.py UWSGI_HTTP=:8000 UWSGI_MASTER=1 UWSGI_WORKERS=2 UWSGI_THREADS=8 UWSGI_UID=1000 UWSGI_GID=2000

# Call collectstatic (customize the following line with the minimal environment variables needed for manage.py to run):
RUN DATABASE_URL=none /venv/bin/python manage.py collectstatic --noinput

# Start uWSGI
CMD ["/venv/bin/uwsgi", "--http-auto-chunked", "--http-keepalive"]

We extend from the Alpine flavor of the official Docker image for Python 3.5, copy the folder containing our requirements files to the container, and then, in a single line, (a) install the OS dependencies needed, (b) pip install the requirements themselves (edit this line to match the location of your requirements file, if needed), (c) scan our virtual environment for any shared libraries linked to by the requirements we installed, and (d) remove the C compiler and any other OS packages no longer needed, except those identified in step (c) (this approach, using scanelf, is borrowed from the underlying 3.5-alpine Dockerfile). It’s important to keep this all on one line so that Docker will cache the entire operation as a single layer.

You’ll notice I’ve only included a minimal set of OS dependencies here. If this is an established production app, you’ll most likely need to visit https://pkgs.alpinelinux.org/packages, search for the Alpine Linux package names of the OS dependencies you need, including the -dev supplemental packages as needed, and add them to the list above.

Next, we copy our application code to the image, set some default environment variables, and run collectstatic. Be sure to change the values for DJANGO_SETTINGS_MODULE and UWSGI_WSGI_FILE to the correct paths for your application (note that the former requires a Python package path, while the latter requires a file system path). In the event you’re not serving static media directly from the container (e.g., with Whitenoise), the collectstatic command can also be removed.

Finally, the --http-auto-chunked and --http-keepalive options to uWSGI are needed in the event the container will be hosted behind an Amazon Elastic Load Balancer (ELB), because Django doesn’t set a valid Content-Length header by default, unless the ConditionalGetMiddleware is enabled. See the note at the end of the uWSGI documentation on HTTP support for further detail.

Building and testing the container

Now that you have the essentials in place, you can build your Docker image locally as follows:

docker build -t my-app .

This will go through all the commands in your Dockerfile, and if successful, store an image with your local Docker server that you could then run:

docker run -e DATABASE_URL=none -t my-app

This command is merely a smoke test to make sure uWSGI runs, and won’t connect to a database or any other external services.

Running commands during container start-up

As an optional final step, I recommend creating an ENTRYPOINT script to run commands as needed during container start-up. This will let us accomplish any number of things, such as making sure Postgres is available or running migrate or collectstatic during container start-up. Save the following to a file named docker-entrypoint.sh in the same directory as your Dockerfile:

#!/bin/sh
set -e

until psql $DATABASE_URL -c '\l'; do
  >&2 echo "Postgres is unavailable - sleeping"
  sleep 1
done

>&2 echo "Postgres is up - continuing"

if [ "x$DJANGO_MANAGEPY_MIGRATE" = 'xon' ]; then
    /venv/bin/python manage.py migrate --noinput
Fi

if [ "x$DJANGO_MANAGEPY_COLLECTSTATIC" = 'xon' ]; then
    /venv/bin/python manage.py collectstatic --noinput
fi

exec "$@"

Next, add the following line to your Dockerfile, just above the CMD statement:

ENTRYPOINT ["/code/docker-entrypoint.sh"]

This will (a) make sure a database is available (usually only needed when used with Docker Compose), (b) run outstanding migrations, if any, if the DJANGO_MANAGEPY_MIGRATE is set to on in your environment, and (c) run collectstatic if DJANGO_MANAGEPY_COLLECTSTATIC is set to on in your environment. Even if you add this entrypoint script as-is, you could still choose to run migrate or collectstatic in separate steps in your deployment before releasing the new container. The only reason you might not want to do this is if your application is highly sensitive to container start-up time, or if you want to avoid any database calls as the container starts up (e.g., for local testing). If you do rely on these commands being run during container start-up, be sure to set the relevant variables in your container’s environment.

Creating a production-like environment locally with Docker Compose

To run a complete copy of production services locally, you can use Docker Compose. The following docker-compose.yml will create a barebones, ephemeral, AWS-like container environment with Postgres and Redis for testing your production environment locally.

This is intended for local testing of your production environment only, and will not save data from stateful services like Postgres upon container shutdown.

version: '2'

services:
  db:
    environment:
      POSTGRES_DB: app_db
      POSTGRES_USER: app_user
      POSTGRES_PASSWORD: changeme
    restart: always
    image: postgres:9.6
    expose:
      - "5432"
  redis:
    restart: always
    image: redis:3.0
    expose:
      - "6379"
  app:
    environment:
      DATABASE_URL: postgres://app_user:changeme@db/app_db
      REDIS_URL: redis://redis
      DJANGO_MANAGEPY_MIGRATE: on
    build:
      context: .
      dockerfile: ./Dockerfile
    links:
      - db:db
      - redis:redis
    ports:
      - "8000:8000"
    depends_on:
      - db
      - redis

Copy this into a file named docker-compose.yml in the same directory as your Dockerfile, and then run:

docker-compose up --build -d

This downloads (or builds) and starts the three containers listed above. You can view output from the containers by running:

docker-compose logs

If all services launched successfully, you should now be able to access your application at http://localhost:8000/ in a web browser.

Extra: Blocking Invalid HTTP_HOST header errors with uWSGI

To avoid Django’s Invalid HTTP_HOST header errors (and prevent any such spurious requests from taking up any more CPU cycles than absolutely necessary), you can also configure uWSGI to return an HTTP 400 response immediately without ever invoking your application code. This can be accomplished by adding a command line option to uWSGI in your Dockerfile script, e.g., --route-host=’^(?!www.myapp.com$) break:400' (note, the single quotes are required here, to prevent the shell from attempting to interpret the regular expression). If preferred (for example, in the event you use a different domain for staging and production), you can accomplish the same end by setting an environment variable via your hosting platform: UWSGI_ROUTE_HOST=‘^(?!www.myapp.com$) break:400'.

That concludes this high-level introduction to containerizing your Python/Django app for hosting on AWS Elastic Beanstalk (EB), Elastic Container Service (ECS), or elsewhere. Each application and Dockerfile will be slightly different, but I hope this provides a good starting point for your containers. Shameless plug: If you’re looking for a simple (and at least temporarily free) way to test your Docker containers on AWS using an Elastic Beanstalk Multicontainer Docker environment or the Elastic Container Service, checkout AWS Container Basics (more on this soon). Good luck!

Partnering with your manager

By Will Larson from Django community aggregator: Community blog posts. Published on Mar 13, 2017.

At my first software job, I chatted 1:1 with my manager twice in two years, including my first year where I was remote and three time zones away. In that situation you become self-managing, or you get let go for inactivity, and somehow I found things to do. (I’d like to prepend useful to things, but as best I can tell, my team’s software was unilaterally thrown away, so that’s hard to justify.)

Something that experience didn’t equip me well to do is partner with my manager. I came away without a mental model for what management does, let alone how you would work with them. It’s been a rocky path for me to figure out a healthier approach, and if you’ve faced similar a struggle, hopefully these ideas will help.

To partner successfully with your manager:

  1. You need her to know a few things about you,
  2. You need to know a few things about her,
  3. Occasionally update the things you know about each other.

Things your manager should know about you:

  • What problems you’re trying solve.
  • How you’re trying to solve it.
  • That you’re making progress. (Specifically, that you’re not stuck.)
  • What you prefer to work on. (So they can staff you properly.)
  • How busy you are. (So they know if you can take on an opportunity that comes up.)
  • What your professional goals and growth areas are.
  • Where you are between bored and challenged.
  • How you believe you’re being measured. (A rubric, company values, some KPIs, etc.)

Some managers are easier to keep informed than others, and success hinges on finding the communication mechanism that works for them. The approach that I’ve found work well is:

  1. Maintain a document with this information, that you keep updated and shared with your manager. For some managers, this will be enough! Mission accomplished.
  2. Sprinkle this information into your 1:1s, focusing on information gaps (you’re not seeing support around a growth area, you’re too busy or not busy enough). Success is filling in information gaps, not reciting a mantra.
  3. At some cadence, maybe quarterly, write up a self-reflection which covers each of those aspects. (I’ve been experimenting with a “career narrative” format that is essentially a stack of quarterly self-reflections.) Share that with your manager, and maybe your peers too!

A few managers seemingly just don’t care, and I’ve always found that those managers do care, and are too stressed to participate in successful communication. This leads to the other key aspect of managing up: knowing some things about your manager and her needs.

Here are some good things to know:

  • What are her current priorities? Particularly, problems and key initiatives. Often when I get asked this question I can’t answer it directly, because what I’m focused on is people related, but it’s a warning sign if your manager never answers it (either because they don’t know, or they are always working on people issues).
  • How stressed are they? How busy are they? Do they feel like they have time to grow in her role or are they grinding?
  • Is there anything you can do to help? This is particularly valuable for managers who don’t have strong delegation instincts.
  • What is her management’s priority for her?
  • What is she trying to improve on themselves, and what are her goals? This is particularly valuable to know if she appears stuck, because you may be able to help unstick her. (Especially in terms of redefining impact in terms of work your team can accomplish versus growing team size, which is a frequent source of stickiness!)

It’s relatively uncommon for managers to be unwilling to answer these kinds of questions (either they’re open and glad to share, or are willing to speak about themselves), but it is fairly common for them to not know the answers. In those cases, each of these can be a pretty expansive topic for a 1:1.

How do you partner with your manager?

How to Create Infinite Scroll With Django

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Mar 13, 2017.

In this tutorial I will show you how to implement a very simple infinite scrolling with Django. Basically we will take advantage of Django’s pagination API and a jQuery plug-in. You will find examples using both function-based views and class-based views.


Dependencies

We don’t need anything other than Django installed in the back-end. For this example you will need jQuery and Waypoints.

After downloading the dependencies, include the following scripts in your template:

base.html

<script src="{% static 'js/jquery-3.1.1.min.js' %}"></script>
<script src="{% static 'js/jquery.waypoints.min.js' %}"></script>
<script src="{% static 'js/infinite.min.js' %}"></script>

Basic Example

This example uses function-based view and I’m simply paginating a list of numbers, which I generate on the fly.

urls.py

from django.conf.urls import url
from mysite.blog import views

urlpatterns = [
    url(r'^$', views.home, name='home'),
]

views.py

from django.core.paginator import Paginator, EmptyPage, PageNotAnInteger
from django.shortcuts import render

def home(request):
    numbers_list = range(1, 1000)
    page = request.GET.get('page', 1)
    paginator = Paginator(numbers_list, 20)
    try:
        numbers = paginator.page(page)
    except PageNotAnInteger:
        numbers = paginator.page(1)
    except EmptyPage:
        numbers = paginator.page(paginator.num_pages)
    return render(request, 'blog/home.html', {'numbers': numbers})

Here in this view, numbers_list represents a list of 1000 numbers, which I’m breaking down into blocks of 20 numbers per page. After paginating it, I return the numbers object to the template, which is a block of 20 numbers.

I won’t get into details about Django pagination because I have an article talking exclusively about it, so if you want to learn more, check this post: How to Paginate with Django.

Now here is where the magic happens:

blog/home.html

{% extends 'base.html' %}

{% block content %}
  <div class="infinite-container">
    {% for number in numbers %}
      <div class="infinite-item">{{ number }}</div>
    {% endfor %}
  </div>

  {% if numbers.has_next %}
    <a class="infinite-more-link" href="?page={{ numbers.next_page_number }}">More</a>
  {% endif %}

  <script>
    var infinite = new Waypoint.Infinite({
      element: $('.infinite-container')[0]
    });
  </script>
{% endblock %}

The element identified by the class .infinite-container is the container where the plug-in will load more items. This action will occur every time the element .infinite-more-link appears in the screen. When that happens, it will trigger an asynchronous request (AJAX) loading the content from the URL specified in the href of the .infinite-more-link.

The page will keep loading new items until the .infinite-more-link is no longer shown. It will happen when there is no more items to be loaded. That is, the paginator reached the last page.

That’s why we have the conditional if numbers.has_next.


Adding Loading State

We can improve the basic example by adding a loading state, while the page is grabbing more data.

blog/home.html

{% extends 'base.html' %}

{% block content %}
  <div class="infinite-container">
    {% for number in numbers %}
      <div class="infinite-item">{{ number }}</div>
    {% endfor %}
  </div>

  {% if numbers.has_next %}
    <a class="infinite-more-link" href="?page={{ numbers.next_page_number }}">More</a>
  {% endif %}

  <div class="loading" style="display: none;">
    Loading...
  </div>

  <script>
    var infinite = new Waypoint.Infinite({
      element: $('.infinite-container')[0],
      onBeforePageLoad: function () {
        $('.loading').show();
      },
      onAfterPageLoad: function ($items) {
        $('.loading').hide();
      }
    });
  </script>
{% endblock %}

Our loading block will be shown while the AJAX is running in the background:

Infinite Scroll Loading


Class-Based View Example With Models

The idea remain the same. Actually, the example will become even cleaner. Consider the following model:

models.py

from django.db import models

class Article(models.Model):
    title = models.CharField(max_length=30)
    body = models.TextField(max_length=2000)
    date = models.DateTimeField()
    author = models.CharField(max_length=30)

views.py

from django.views.generic.list import ListView
from .models import Article

class ArticlesView(ListView):
    model = Article
    paginate_by = 5
    context_object_name = 'articles'
    template_name = 'blog/articles.html'

blog/articles.html

{% extends 'base.html' %}

{% block content %}
  <div class="infinite-container">
    {% for article in articles %}
      <div class="infinite-item">
        <h3>{{ article.title }}</h3>
        <p>
          <small>{{ article.author }} / {{ article.date }}</small>
        </p>
        <p>{{ article.body|truncatechars:100 }}</p>
      </div>
    {% endfor %}
  </div>

  <div class="loading" style="display: none;">
    Loading...
  </div>

  {% if page_obj.has_next %}
    <a class="infinite-more-link" href="?page={{ articles.next_page_number }}">More</a>
  {% endif %}

  <script>
    var infinite = new Waypoint.Infinite({
      element: $('.infinite-container')[0],
      onBeforePageLoad: function () {
        $('.loading').show();
      },
      onAfterPageLoad: function ($items) {
        $('.loading').hide();
      }
    });
  </script>
{% endblock %}

Just a small difference here, that since I’m using a ListView, the page object is available in the page_obj object in the template.


Conclusions

That’s pretty much it! If you want to explore the example, the code is available on GitHub: github.com/sibtc/simple-infinite-scroll

Django 101 – App Oluşturma ve Model’a Giriş #3

By Python Turkiye from Django community aggregator: Community blog posts. Published on Mar 11, 2017.

Video Icerisindekiler: – Django için standart ayar düzenlemeleri. – Django App oluşturma. – Migrate ve Makemigrations kavramları. – Classlar için default kavramı ve yeniden düzenleme aşamasında karşılaşılacak sorunların giderilmesi. – Django image için gerekli ayarlamalar. – Django admine erişim.

Django 101 – App Oluşturma ve Model’a Giriş #3 yazısı ilk önce Python Türkiye üzerinde ortaya çıktı.

Django Likes

By Coding for Entrepreneurs | A Django Programming Class for the Non-Technical Founder from Django community aggregator: Community blog posts. Published on Mar 09, 2017.

# Django Likes Learn how to c...

Why care about deployment?

By Django deployment from Django community aggregator: Community blog posts. Published on Mar 09, 2017.

Do you know that in Heroku you can enter a single command and have your Django project deployed for you? If there are services that have commoditized deployment, why should you care about nginx, Apache, PostgreSQL, Gunicorn, systemd, and all that, and don’t you spend your time and grey matter on something more useful?

I asked the question both to myself and to my readers, and here are my conclusions, in no particular order:

  • Some customers have a policy to use their own infrastructure.
  • Sometimes there is a preference or policy for deploying on infrastructure hosted in a specific country; for example, I have deployed a Greek national project which had to be hosted in Greece.
  • Sometimes it’s cheaper to deploy yourself than to deploy on Heroku, even if you take into account the time required to learn/do it.
  • If you use a deployment service like Heroku you have vendor lock-in.
  • If you deploy yourself you can be more flexible if your needs change.
  • Students learn about compilers and build primitive processors not because they will work in compilers or in hardware, but because the increased understanding of their tools helps them be a better engineer. Likewise, you need to understand Django deployment in order to be a better developer. In fact, that’s exactly how the term “DevOps” came to be.
  • You can get away with not knowing how a compiler works or how your hardware works because these things are very mature and you can treat them as black boxes. In deployment, this is not true. At some point the abstraction will likely leak and you’ll need to know what’s going on behind the scenes.

Many thanks to my readers.


Note: The phrase “at some point the abstraction will likely leak and you’ll need to know” is from an article by Martin Fowler where he describes what function-as-a-service means (which has also become known with the unfortunate name “serverless”—of course FaaS also isn’t very accurate, but at least it’s not misleading). I couldn’t have phrased it better.

 

The post Why care about deployment? appeared first on Django deployment.

Release 0.11.

By Lightning Fast Shop from Django community aggregator: Community blog posts. Published on Mar 09, 2017.

We just released LFS 0.11. 

Changes

  • Adds Django 1.10 support
  • Excludes variants from sitemap
  • Add appending slash to all urls
  • Use F() expressions to update stock amount (#203)
  • Use F() expression to increase use_amount for vouchers (#202)
  • Removes django-pagination (use Django's default one instead)
  • remove STATIC_URL (user static_url tag instead)
  • fix saving properties and variant prices [pigletto]

Information

You can find more information and help on following locations:

Quickly and Easily Way to Create a Django Application: BeDjango Starter

By BeDjango from Django community aggregator: Community blog posts. Published on Mar 08, 2017.

BeDjango starter

In the following post we will talk about the starter we have released from BeDjango.  The decisions we have made, the features we have decided to add and some aspects that we would like to emphasize:

Why did we decide to make a starter?

Although starting a project in Django is a fairly quick and easy process, after developing several applications, we realized that we used to invest a certain amount of time in each project to implement similar features (registry, basic views, validations, configuration of libraries ...) Therebefore, we decided to invest that time in making a starter that we could easily reuse ... But the thing went out of hand and we created a fairly complete Django base application.

Architecture

We decided to modify the initial structure of folders and files that gives us django, many of these files are already being used by the community:

 

We would like to point out that we have added a 'logic.py' file where we place all the logic of code, in this way we gain modularity and improve the way we have to perform the tests, doing these directly against code and not only with access to views. We have also been forced to create managers for all the objects of our models.

Main Features

  • Coverage of code over 95% (98% at the time of release)

  • Custom theme based on bootstrap, with custom and responsive components.

  • User control system (1-step or 2-step registration, login, password recovery ..)

  • Modularity of the applications (users, base)

  • Application ready for internationalization

  • Python2/3 compatibility

Packages included

After reviewing several packages that we often use, we decided to include some already preconfigured ones:

  1. Django debug toolbar
  2. Django cachalot
  3. Material design for admin

Utils

We have also decided to put a series of utils that we needed or that we usually use in most projects:

  • Breadcrumbs

  • Password validation via ajax

  • Decorators: group required, ajax required (to protect views), anonymous required (to prevent logged user to visit some views)

  • Regex for password validation

Requirements

Here are the system requirements needed for the starter to work down a distribution Ubuntu 16.04:

  • Git:

sudo apt install git
  • Pip:

sudo apt install python-pip
  • Virtualenv:

sudo pip install virtualenv
  • Python3-dev:

sudo apt install python3-dev

How to use it

# Create virtualenv
virtualenv -p python3 venv

# Activate virtualenv and install Django
source venv/bin/activate
pip install django==1.10

Use django-admin to create the app using the starter
django-admin.py startproject --template=https://github.com/BeDjango/bedjango-starter/archive/master.zip --extension=py,rst,yml {{nameofproject}}

# Install requirements/dev-requirements
cd nameofproject/nameofproject
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Migrate database 
python3 manage.py migrate

# Compile translations
python3 manage.py compilemessages

# To run our project:
python3 manage.py runserver

 

Contribution

We this release we intend to give back some of the knowledge and the tools which are given by the community but with a processing from our side. We don’t expect to create a project itself, however, we open a Github with a code which could be used and personalized by everyone, and moreover, we will be grateful if we receive some feedback as comments (both pros and cons), proposals or contributions. For that this project is licensed under the MIT License

I hope you try it and like it :)!