Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home

Open Source Posts

Oleg Bartunov: The sizes of FTS indices

From Planet PostgreSQL. Published on Aug 27, 2016.

This is a technical post to save my findings about RUM index, which we introduced at PGCon-2016 in Ottawa. The original idea of RUM index appeared in 2012 year, when I and Alexander Korotkov presented Our milliseconds FTS in Prague. This year, inspired by Alexander's work on extensibility, committed to 9.6, I asked my colleagues in Postgres Professional Teodor Sigaev and Artur Zakirov to implement it as full-fledged access method as extension. The name RUM was chosen in according to GIN...VODKA range, but some people suggested "Really Useful Method" or "Russian Useful Method" :)

Public RUM repository is available on https://github.com/postgrespro/rum. We know several bugs in RUM, hope to fix them soon.



Table pglist contains about million posts from postgres mailing lists, fts - is a fts representation of post - tsvector (subject, body_plain, author).
I created several fts indices:

1. GIN index
2. RUM index - (GIN + positions as addinfo)
3. RUM index with sent time as addinfo
4. as 3, but postings are sorted by sent time

RUM index is bigger than GIN, since it contains more information, which allow to speedup certain types of queries, namely:

2. ranking fts - index returns results in the order of position based relevance.
3. fts results ordered by sent time (latest fts results) - index returns results in order of sent time, stored in postings
4. even faster than 3, since postings are stored in the sent time order

The price for speedup is the size of index !


\d pglist
                Table "public.pglist"
   Column   |            Type             | Modifiers
------------+-----------------------------+-----------
 id         | integer                     |
 sent       | timestamp without time zone |
 subject    | text                        |
 author     | text                        |
 body_plain | text                        |
 fts        | tsvector                    |
Indexes:
    "pglist_fts_gin_idx4" gin (fts)
    "pglist_fts_rum_idx4" rum (fts rum_tsvector_ops)
    "pglist_fts_ts_order_rum_idx4" rum (fts rum_tsvector_timestamp_ops, sent) WITH (attach=sent, "to"=fts, order_by_attach=t)
    "pglist_fts_ts_rum_idx4" rum (fts rum_tsvector_timestamp_ops, sent) WITH (attach=sent, "to"=fts)
    "pglist_sent_idx" btree (sent)

select pg_size_pretty(sum(pg_column_size(fts))) as
fts,pg_size_pretty(pg_table_size('pglist_fts_gin_idx4')) as gin,
pg_size_pretty(pg_table_size('pglist_fts_rum_idx4')) as rum,
pg_size_pretty(pg_table_size('pglist_fts_ts_rum_idx4')) as rum_ts, pg_size_pretty(pg_table_size('pglist_fts_ts_order_rum_idx4')) as rum_ts_order from
pglist;
   fts   |  gin   |  rum   | rum_ts  | rum_ts_order
---------+--------+--------+---------+--------------
 1302 MB | 535 MB | 980 MB | 1531 MB | 1921 MB
(1 row)


RUM access method was implemented as an extension using Extensible Access Methods in 9.6. GENERIC WAL in current state of art doesn't recognizes difference between pages with shifts, so RUM could generate a lot of WAL traffic, for example, for RUM index with sent time ordered by sent time CREATE INDEX generates about 186 GB of WAL (compare to 1921 MB index size), since postings now arranged not by their location on page and almost each posting generates new wal record.
I calculated WAL traffic like this:

SELECT pg_current_xlog_location();
11/B54E6098
CREATE INDEX ....
SELECT pg_current_xlog_location();
15/6F2F8F18

SELECT pg_size_pretty(pg_xlog_location_diff('15/6F2F8F18','11/B54E6098'));
pg_size_pretty
----------------
15 GB


The number of postings could be obtained by this query:

select sum(length(fts)) from pglist;
   sum
----------
 84472729
(1 row)


So, if we assume average fullness as 75% and each posting generates traffic about half of page, then we could estimate WAL-traffic as
84472729*4096/1024/1024/1024*0.75 ~ 242 GB, which is close to 186 GB :)

Using some optimization we were able to reduce traffic to 15 GB, which is still high, compare to index size (1.9 GB). Finally, we got an idea to write WAL records after index created and we got WAL traffic 1.5GB, which is even smaller than the index.

Notice, that this valid only for CREATE INDEX, insertions to RUM will generate big traffic unless GENERIC WAL become more smart.

I think the same optimization we could apply to GiST, GIN and SP-GIST. Saving WAL-traffic on my SSD-based system had low impact on time of index creation, but I suspect it should be noticeable on real hard drives.

Pavel Stehule: Orafce package for PostgreSQL 9.3, 9.4, 9.5, 9.6 for WIN32, WIN64 is available

From Planet PostgreSQL. Published on Aug 26, 2016.

Please, download from link.

Pavel Stehule: plpgsql_check for PostgreSQL 9.4, 9.5, 9.6 compiled for WIN32, WIN64 is available

From Planet PostgreSQL. Published on Aug 25, 2016.

please download from link

Ernst-Georg Schmid: Timeseries: How long can the elephant remember?

From Planet PostgreSQL. Published on Aug 25, 2016.

Frankly, I don't know where the practical limit for the number of rows in a single PostgreSQL table is from experience, but the interwebs seems to agree on 10^9 for narrow tables.

After a lively discussion with a NoSQL afficionado yesterday about the (in)ability to effectively store timeseries data in a RDBMS I made a quick calculation.

Timeseries data is usually a triple of the form key timestamp value, so it can be stored in a pretty narrow table, hence I stick to the 10^9 rows limit.

If we get a data point every second, we can store 10^9 seconds worth of data. 10^9 seconds is 16666666.6667 minutes, which is 277777.777778 hours, which is 11574.0740741 days, which is good for about 31 years of recording.

Every second of 31 years. Per table.

Simon Riggs: When to use Postgres-BDR or Postgres-XL?

From Planet PostgreSQL. Published on Aug 25, 2016.

The right answer is of course “Use PostgreSQL”. It’s the main distro and we want you to use that as often as possible.

The Postgres-BDR and Postgres-XL projects are also fully open source projects, using the same copyright and licence as the main PostgreSQL project. So if you’re using PostgreSQL, they are also options to consider if you want extended functionality.

What does Postgres-BDR do? BDR allows you to have a widely distributed cluster of nodes that give you multiple full copies of a database. So you can have copies of the database in London, New York, San Francisco, Rome, Dubai, New Delhi, Hong Kong, Tokyo, Sydney, São Paulo, Buenos Aires and Johannesburg. And all your customers in those places get fast access for read AND write to the database. As long as your application doesn’t update the same data in two different places at once, you’re fine. If it does, then you probably need to rethink what you’re trying to do, but we do provide a conflict resolution scheme and options for logging.

What does Postgres-XL do? XL allows you to scale a database up from one node to many nodes. It’s a shared nothing cluster, with all the nodes in one place, though with HA and DR options. So yes, its scalable, just like MongoDB, Cassandra etc.. though XL runs both schemaless JSON and structured relational data. And most importantly, Postgres-XL does both Big Data analytics (SQL, joins etc..) and fine-grained OLTP with MVCC in the same platform. So you can load your data and then immediately run queries against it, no need to export it to another kind of database.

So both Postgres-BDR and Postgres-XL are very significant enhancements to PostgreSQL core.

Why do we have two? Why Postgres-BDR and Postgres-XL? Each variant has its own use case, so the separation isn’t really an issue. Eventually we’ll be able to take advantage of all that functionality in one system, but that will take a few years while we get that to work. And we are working as actively as we can on returning that all into the main project. pglogical and the push to get Logical Replication into 10.x are just part of that.

The Postgres-BDR and Postgres-XL projects use the same copyright and licence as the main PostgreSQL project because the intention is to feed back from those projects into the main distro. If it doesn’t have the right licence then it’s not ever going to be fed back into PostgreSQL core – it is not compatible and won’t ever be accepted.

Paul Ramsey: PgSQL Indexes and "LIKE"

From Planet PostgreSQL. Published on Aug 25, 2016.

Do you write queries like this:

SELECT * FROM users 
WHERE name LIKE 'G%'

Are your queries unexpectedly slow in PostgreSQL? Is the index not doing what you expect? Surprise! You’ve just discovered a PostgreSQL quirk.

TL;DR: If you are running a locale other than “C” (show LC_COLLATE to check) you need to create a special index to support pattern searching with the LIKE operator: CREATE INDEX myindex ON mytable (mytextcolumn text_pattern_ops). Note the specification of the text_pattern_ops operator class after the column name.

As a beginner SQL student, you might have asked “will the index make my ‘like’ query fast” and been answered “as long as the wildcard character is at the end of the string, it will.”

PgSQL Indexes and "LIKE"

That statement is only true in general if your database is initialized using the “C” locale (the North America/English-friendly UNIX default). Running with “C” used to be extremely common, but is less and less so, as modern operating systems automagically choose appropriate regional locales to provide approriate time and formatting for end users.

For example, I run Mac OSX and I live in British Columbia, an English-speaking chunk of North America. I could use “C” just fine, but when I check my database locale (via my collation), I see this:

pramsey=# show LC_COLLATE;
 lc_collate  
-------------
 en_CA.UTF-8
(1 row)

It’s a good choice, it’s where I live, it supports lots of characters via UTF-8. However, it’s not “C”, so there are some quirks.

I have a big table of data linked to postal codes, this is what the table looks like:

              Table "gis.postal_segments"
      Column       |     Type     | Modifiers 
-------------------+--------------+-----------
 postal_code       | text         | not null
 segment           | character(4) | 
Indexes:
    "postal_segments_pkey" PRIMARY KEY, btree (postal_code)

Note the index on the postal code, a standard btree.

I want to search rows based on a postal code prefix string, so I run:

EXPLAIN ANALYZE 
SELECT * FROM postal_segments 
WHERE postal_code LIKE 'V8V1X%';
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on postal_segments  (cost=0.00..2496.85 rows=10 width=68) (actual time=30.320..34.219 rows=4 loops=1)
   Filter: (postal_code ~~ 'V8V1X%'::text)
   Rows Removed by Filter: 100144
 Planning time: 0.250 ms
 Execution time: 34.263 ms
(5 rows)

Ruh roh!

I have an index on the postal code, so why am I getting a sequence scan?!?! Because my index is no good for doing pattern matching in any collation other than “C”. I need a special index for that, which I create like this.

CREATE INDEX postal_segments_text_x 
  ON postal_segments (postal_code text_pattern_ops);

The magic part is at the end, invoking text_pattern_ops as the opclass for this index. Now my query works as expected:

EXPLAIN ANALYZE 
SELECT * FROM postal_segments 
WHERE postal_code LIKE 'V8V1X%';
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using postal_segments_text_x on postal_segments  (cost=0.29..8.31 rows=10 width=68) (actual time=0.067..0.073 rows=4 loops=1)
   Index Cond: ((postal_code ~>=~ 'V8V1X'::text) AND (postal_code ~<~ 'V8V1Y'::text))
   Filter: (postal_code ~~ 'V8V1X%'::text)
 Planning time: 0.532 ms
 Execution time: 0.117 ms
(5 rows)

I have gotten so used to PostgreSQL doing exactly the right thing automatically that it took quite a long time to track down this quirk when I ran into it. I hope this page helps others save some time!

How to Create a One Time Link

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 24, 2016.

Django uses a very interesting approach to generate the Password reset tokens. I’m not really a security expert, neither I’m very familiar with cryptography algorithms, but it is very safe and reliable.

Before I elaborate a little be more on the one-time-link generation, I wanted to discuss about the Django’s PasswordResetTokenGenerator implementation. Because what we will be doing is actually extending this particular class to fit our needs.

Generally speaking, Django generate a token without persisting it in the database. Yet, it still have the capabilities of determining whether a given token is valid or not. Also the token is only valid for a defined number of days.

The default value for the Password Reset Token is 7 days, and it can be changed in the settings.py by changing the value of PASSWORD_RESET_TIMEOUT_DAYS.

The class have two public methods:

  • make_token(user)
  • check_token(user, token)

The make_token method will generate a hash value with user related data that will change after the password reset. Meaning, after the user clicks on the link with the hash and proceed to the password reset, the link (or the hash) will no longer be valid:

def _make_hash_value(self, user, timestamp):
    # Ensure results are consistent across DB backends
    login_timestamp = '' if user.last_login is None else user.last_login.replace(microsecond=0, tzinfo=None)
    return (
        six.text_type(user.pk) + user.password +
        six.text_type(login_timestamp) + six.text_type(timestamp)
    )

And then this hash value is used to create a hash that will be mailed to the user:

    def _make_token_with_timestamp(self, user, timestamp):
        # timestamp is number of days since 2001-1-1.  Converted to
        # base 36, this gives us a 3 digit string until about 2121
        ts_b36 = int_to_base36(timestamp)

        hash = salted_hmac(
            self.key_salt,
            self._make_hash_value(user, timestamp),
        ).hexdigest()[::2]
        return "%s-%s" % (ts_b36, hash)

So, two things: it is using the user.password salt and user.last_login timestamp. Both will change and the link will no longer be valid. Also the SECRET_KEY is used in the salted_hmac function. So unless your SECRET_KEY was compromised, it would be impossible to reproduce the hash value.


Creating your own token

So, basically you will need an information that will change after using the link. The simplest approach would be:

from django.contrib.auth.tokens import PasswordResetTokenGenerator
from django.utils import six

class AccountActivationTokenGenerator(PasswordResetTokenGenerator):
    def _make_hash_value(self, user, timestamp):
        return (
            six.text_type(user.pk) + six.text_type(timestamp) +
            six.text_type(user.profile.email_confirmed)
        )

account_activation_token = AccountActivationTokenGenerator()

I’m pretending we have an User model with a Profile model through a One-to-One relationship. And then in this profile model we have an boolean flag named email_confirmed.

In order to use it, we could use the same approach as the password reset:

urls.py

url(r'^activate_account/(?P<uidb64>[0-9A-Za-z_\-]+)/(?P<token>[0-9A-Za-z]{1,13}-[0-9A-Za-z]{1,20})/$',
                views.ActivateAccountView.as_view(), name='activate_account'),

views.py

from django.contrib.auth import login
from django.utils.encoding import force_text
from django.utils.http import urlsafe_base64_decode

class ActivateAccountView(View):
    def get(self, request, uidb64, token):
        try:
            uid = force_text(urlsafe_base64_decode(uidb64))
            user = User.objects.get(pk=uid)
        except (TypeError, ValueError, OverflowError, User.DoesNotExist):
            user = None

        if user is not None and account_activation_token.check_token(user, token):
            user.profile.email_confirmed = True
            user.save()
            login(request, user)
            return redirect('profile')
        else:
            # invalid link
            return render(request, 'registration/invalid.html')

Of course there are cases and cases. Sometimes will be just way easier to generate a random token and save it in the database and simply check it and invalidate after it is “used”. But, if that’s not the case, you can inspire yourself on how Django implements the Password Reset Token.

Jim Nasby: One Data Solution to Rule Them All

From Planet PostgreSQL. Published on Aug 24, 2016.

This post is the first in our series on PostgreSQL, a highly customizable and standards compliant open source object-relational database system.

The modern world isn’t just a sea of data, it’s also awash in data storage options. It used to be a company just had to choose between one of the “Big 3” (DB2, MS-SQL, or Oracle), or MySQL or PostgreSQL. Today there’s a few more relational databases, a bunch of NqSQL databases, as well as specialty databases such as graph databases or time-series databases. Many companies now run multiple different database technologies.

What if there was one solution that satisfied all these needs?

The hidden costs of data

There’s one thing that’s important to understand: data is not easy. If something claims to make it easy, either it’s glossing over details or just ignoring them. A choice that makes one data problem easy makes other problems harder:

  • Lack of schema makes ingestion easy and analysis difficult
  • Strong OLTP performance reduces OLAP performance
  • Consistency, Availability, Performance: Pick 2 (the CAP theorem)

Understanding how your data technologies work under the hood is critical for making informed decisions and designing an appropriate data architecture.

Here’s your magic bullet!

Just kidding. Sort of.

Postgres has a lot of features that most users aren’t aware of. Unfortunately, there’s no make_data_easy = true setting, but there are a lot of features that make various problems easier. In fact, it’s best not to think of Postgres as a database, because it goes far beyond that. Think of it as a data platform.

Features you’ve never heard of

This is certainly not a complete list, but I think it’s some of the more overlooked features.

Arrays and Composite types

Tables are not the only way to store sets of information in Postgres, nor are they the only way to pass information around your systems. Arrays and composite types work together the same way that lists and dictionaries/hashes do in other languages: they allow you to create a single object that represents an arbitrarily complex set of data. That complex object can be passed to and from functions, easily converted to/from JSON (and XML), and even stored as a field in a table.

Can you imagine writing an application in a language that didn’t support lists and dictionaries? Of course not. So why wouldn’t you extend that notion to your database? A quick (still work in progress!) example of this idea is https://github.com/decibel/cat_snap[cat_snap], a tool that defines a composite type that will store the complete contents of the catalog and statistics for an entire Postgres database, as a single object. A single SQL command will output all that information in a format that can be read directly into the composite type. Imagine how easy that makes it to gather statistics on a bunch of databases.

There’s another powerful use for arrays and composites: acting like a column store. Most column stores are very picky about how and when you can write to them. By representing a series of data points as an array instead of row-by-row, you can gain the benefits of a column store with more flexibility in writing data.

New base types

I’ve never seen another database that makes it as easy to create a brand new data type as Postgres. This makes it easy to represent, store and index information inside Postgres that meshes seamlessly with how that data is handled elsewhere.

Extensions

Similar to packages in other languages, Postgres extensions allow you to create a package of Postgres objects for easy use by others. The Postgres Extension Network is the main repository for extensions, and it currently contains over 200 extensions.

Multiple procedural languages

Support for running many different programming languages inside the database is a killer feature for advanced data science and analytics. It allows you to perform complex pre-processing and filtering of data on the same machine that houses the data. Use python, R, Lua or any number of other languages.

Connect anywhere

Postgres has over 100 foreign data wrappers that allow you to connect to other data sources and interact with them as if they were Postgres tables. Thanks to the Multicorn extension it is very easy to create a new foreign data wrapper.

What many people don’t realize is that, thanks to the robust procedural language support, you don’t have to use a foreign data wrapper to speak to a remote data source. For example, it would be very difficult to interface with Amazon AWS via a foreign data wrapper (though S3 is a good fit, and there is a foreign data wrapper for it). But since there is a python AWS API and because you can run python code inside Postgres, you can interface directly with AWS via Postgres.

Do even more complex things

Postgres has a number of internal “hooks” that allow extensions to directly alter how certain behaviors work in Postgres, without modifying any Postgres source code. One example of that is the citus extension, created by Citus Data. This extension brings horizontal scale to Postgres. 100+ TB databases are no problem with this extension.

Why Postgres?

As you can see, there are many features in Postgres that make complex data challenges easier to handle. More features are added constantly (some in Postgres itself, some as extensions).

It even comes with a world class relational database engine. 

The post One Data Solution to Rule Them All appeared first on Blue Treble.

Django Tips #13 Using F() Expressions

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 23, 2016.

In the Django QuerySet API, F() expressions are used to refer to model field values directly in the database. Let’s say you have a Product class with a price field, and you want to increase the price of all products in 20%.

A possible solution would be:

products = Product.objects.all()
for product in products:
    product.price *= 1.2
    product.save()

Instead you could use an F() expression to update it in a single query:

from django.db.models import F

Product.objects.update(price=F('price') * 1.2)

You can also do it for a single object:

product = Product.objects.get(pk=5009)
product.price = F('price') * 1.2
product.save()

But take care with this kind of assignment. The F() object persist after saving the model.

product.price                   # price = Decimal('10.00')
product.price = F('price') + 1
product.save()                  # price = Decimal('11.00')
product.name = 'What the F()'
product.save()                  # price = Decimal('12.00')

So, basically after updating a field like that, product.price will hold an instance of django.db.models.expressions.CombinedExpression, instead of the actual result. If you want to access the result immediately:

product.price = F('price') + 1
product.save()
print(product.price)            # <CombinedExpression: F(price) + Value(1)>
product.refresh_from_db()
print(product.price)            # Decimal('13.00')

You can also use it to annotate data:

from django.db.models import ExpressionWrapper, DecimalField

Product.objects.all().annotate(
    value_in_stock=ExpressionWrapper(
        F('price') * F('stock'), output_field=DecimalField()
    )
)

Since price is a DecimalField and stock is a IntegerField, we need to wrap the expression inside a ExpressionWrapper object.

It can be used to filter data as well:

Product.objects.filter(stock__gte=F('ordered'))

Yann Larrivee: ConFoo Montreal 2017 Calling for Papers

From Planet PostgreSQL. Published on Aug 23, 2016.

ConFoo Montreal: March 8th-10th 2016

ConFoo Montreal: March 8th-10th 2016

Want to get your web development ideas in front of a live audience? The call for papers for the ConFoo Montreal 2017 web developer conference is open! If you have a burning desire to hold forth about PHP, Java, Ruby, Python, or any other web development topics, we want to see your proposals. The window is open only from August 21 to September 20, 2016, so hurry. An added benefit: If your proposal is selected and you live outside of the Montreal area, we will cover your travel and hotel.

You’ll have 45 minutes to wow the crowd, with 35 minutes for your topic and 10 minutes for Q&A. We can’t wait to see your proposals. Knock us out!

ConFoo Montreal will be held on March 8-10, 2017. For those of you who already know about our conference, be aware that this annual tradition will still be running in addition to ConFoo Vancouver. Visit our site to learn more about both events.

gabrielle roth: Easy clock dimension table

From Planet PostgreSQL. Published on Aug 22, 2016.

We’re currently redesigning our data warehouse, and we’re experimenting with a clock (or time, or time-of-day) dimension to represent the time of day, separate from the date dimension we already use. This table should contain a record for each minute of a day, allowing us to easily determine business hours in various timezones, etc etc. I’m […]

Dealing With QueryString Parameters

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 22, 2016.

It is kinda tough to describe what the problem really is. But, do you know when you are creating an interface where you provide pagination, filters and ordering, and you are making it controlling it via URL Get parameters?

For instance if you have different options for ordering, you might think of something like that:

<div class="dropdown">
  <button class="btn btn-default dropdown-toggle" type="button" data-toggle="dropdown">
    Order by
  </button>
  <ul class="dropdown-menu">
    <li><a href="?order=name">Name (a-z)</a></li>
    <li><a href="?order=-name">Name (z-a)</a></li>
    <li><a href="?order=price">Price</a></li>
    <li><a href="?order=date">Date</a></li>
  </ul>
</div>

Basically you would be sending the user to the very same page, but passing a GET parameter named order, where you could do something like that:

def products_list(request):
    products = Product.objects.all()
    order = request.GET.get('order', 'name')  # Set 'name' as a default value
    products = products.order_by(order)
    return render(request, 'products_list.html', {
        'products': products
    })

PS: This is a minimalist example, if you pass an invalid parameter directly in the querystring you will make queryset break. I will avoid adding extra validations so we can focus on the objective of this article.

So far so good. But the problem starts to appear when you add new control, also via GET parameter. Lets say a pagination:

<ul class="pagination">
  {% for i in page_obj.paginator.page_range %}
    <li>
      <a href="?page={{ i }}">{{ i }}</a>
    </li>
  {% endfor %}
</ul>

What would happen here: if you are ordering the results by the Date and then you move to the next page, you will lose the ordering preference.

The easiest solution would be something like that:

def products_list(request):
    ...
    return render(request, 'products_list.html', {
        'products': products,
        'order': order,
        'page': page
    })

And then:

<div class="dropdown">
  <button class="btn btn-default dropdown-toggle" type="button" data-toggle="dropdown">
    Order by
  </button>
  <ul class="dropdown-menu">
    <li><a href="?order=name&page={{ page }}">Name (a-z)</a></li>
    <li><a href="?order=-name&page={{ page }}">Name (z-a)</a></li>
    <li><a href="?order=price&page={{ page }}">Price</a></li>
    <li><a href="?order=date&page={{ page }}">Date</a></li>
  </ul>
</div>

...

<ul class="pagination">
  {% for i in page_obj.paginator.page_range %}
    <li>
      <a href="?page={{ i }}&order={{ order }}">{{ i }}</a>
    </li>
  {% endfor %}
</ul>

The bigger the number of parameters, the bigger is the mess in the template.


The Solution

Last week while working on a project I faced this problem again, and I put some time to think of a better/reusable solution.

So, I came up with this template tag, and I thought about sharing with you guys. Maybe it can be useful for your as well. Basically you will need the django.template.context_processors.request in your project’s context_processors.

templatetags/templatehelpers.py

from django import template

register = template.Library()

@register.simple_tag
def relative_url(value, field_name, urlencode=None):
    url = '?{}={}'.format(field_name, value)
    if urlencode:
        querystring = urlencode.split('&')
        filtered_querystring = filter(lambda p: p.split('=')[0] != field_name, querystring)
        encoded_querystring = '&'.join(filtered_querystring)
        url = '{}&{}'.format(url, encoded_querystring)
    return url

And then you use it this way:

{% load templatehelpers %}

<div class="dropdown">
  <button class="btn btn-default dropdown-toggle" type="button" data-toggle="dropdown">
    Order by
  </button>
  <ul class="dropdown-menu">
    {% with params=request.GET.urlencode %}
      <li><a href="{% relative_url 'name' 'order' params %}">Name (a-z)</a></li>
      <li><a href="{% relative_url '-name' 'order' params %}">Name (z-a)</a></li>
      <li><a href="{% relative_url 'price' 'order' params %}">Price</a></li>
      <li><a href="{% relative_url 'date' 'order' params %}">Date</a></li>
    {% endwith %}
  </ul>
</div>

...

<ul class="pagination">
  {% for i in page_obj.paginator.page_range %}
    <li>
      <a href="{% relative_url i 'page' request.GET.urlencode %}">{{ i }}</a>
    </li>
  {% endfor %}
</ul>

Now the template tag will keep that current state of your filters/ordering/pages.

Better Python Object Serialization

By Blog of Hynek Schlawack from Django community aggregator: Community blog posts. Published on Aug 22, 2016.

Serialization is everywhere. Notably for me: JSON web APIs and structured logs. And yet, general approaches to serialize arbitrary objects to JSON are rather clunky. Fortunately there’s a gem in the Python standard library that makes it easy and elegant.

Kaarel Moppel: Walbouncer refreshed – a proxy for selective PostgreSQL physical replication

From Planet PostgreSQL. Published on Aug 22, 2016.

Walbouncer was covered on the blog already when it was first announced, but that was almost 2 years ago – so it would be a good time to echo it out again, especially in light of compatibility update to support PostgresSQL 9.5, addition of a helper script for bootstrapping replicas called Walbouncer-companion and making the […]

The post Walbouncer refreshed – a proxy for selective PostgreSQL physical replication appeared first on Cybertec - The PostgreSQL Database Company.

Pavel Stehule: XMLTABLE implementation is done

From Planet PostgreSQL. Published on Aug 21, 2016.

I finished the patch for XMLTABLE in PostgreSQL. It is working well. Examples what I found on net works:

postgres=#   SELECT t.id, x.*
FROM employees t,
XMLTABLE ('/Employees/Employee[age>40]'
PASSING t.data
COLUMNS firstname VARCHAR(30) PATH 'firstname',
lastname VARCHAR(30) PATH 'lastname',
age VARCHAR(30) PATH 'age') x
WHERE t.id = 1;
┌────┬───────────┬──────────┬─────┐
│ id │ firstname │ lastname │ age │
╞════╪═══════════╪══════════╪═════╡
│ 1 │ Jim │ Moriarty │ 52 │
│ 1 │ Mycroft │ Holmes │ 41 │
└────┴───────────┴──────────┴─────┘
(2 rows)

Time: 1.619 ms
postgres=# SELECT t.id, x.*
FROM employees t,
XMLTABLE ('/Employees/Employee[@emplid=2222]'
PASSING t.data
COLUMNS firstname VARCHAR(30) PATH 'firstname',
lastname VARCHAR(30) PATH 'lastname') x
WHERE t.id = 1;
┌────┬───────────┬──────────┐
│ id │ firstname │ lastname │
╞════╪═══════════╪══════════╡
│ 1 │ Sherlock │ Homes │
└────┴───────────┴──────────┘
(1 row)

Time: 1.606 ms
postgres=# SELECT emp.id, x.*
FROM employees emp,
XMLTABLE ('/Employees/Employee'
PASSING emp.data
COLUMNS firstname VARCHAR(30) PATH 'firstname',
type VARCHAR(30) PATH '@type') x;
┌────┬───────────┬───────┐
│ id │ firstname │ type │
╞════╪═══════════╪═══════╡
│ 1 │ John │ admin │
│ 1 │ Sherlock │ admin │
│ 1 │ Jim │ user │
│ 1 │ Mycroft │ user │
└────┴───────────┴───────┘
(4 rows)

Time: 1.556 ms

Please, test it, check it.

Django Tips #12 Disabling Migrations to Speed Up Unit Tests

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 19, 2016.

The model migrations are certainly a great feature of the Django framework. But, when it comes down to running tests, it really slows down the process. Especially if your migration history is big. This is a simple tip to speed up your tests.

I like to create a separate settings file for this tweaks.

tests_settings.py

from settings import *

# Custom settings goes here

And then to run the tests:

python manage.py test --settings=myproject.tests_settings --verbosity=1

Django >= 1.9

One option is using the MIGRATION_MODULES setting, which is intended to define a custom name for an app’s migration module. If you set None instead, Django will ignore the migration module.

from settings import *

MIGRATION_MODULES = {
    'auth': None,
    'contenttypes': None,
    'default': None,
    'sessions': None,

    'core': None,
    'profiles': None,
    'snippets': None,
    'scaffold_templates': None,
}

Django < 1.9

This is a possible solution if you are using a version prior to 1.9. Actually, I still prefer to use it nowadays. Because I don’t need to set each app.

from settings import *

class DisableMigrations(object):
    def __contains__(self, item):
        return True

    def __getitem__(self, item):
        return 'notmigrations'

MIGRATION_MODULES = DisableMigrations()

Older Django Versions (using South)

Hold tight:

SOUTH_TESTS_MIGRATE = False

Damn! It could even live inside the production settings.py.

Shaun M. Thomas: PG Phriday: Forensic Fundamentals

From Planet PostgreSQL. Published on Aug 19, 2016.

All database engines, even Postgres, occasionally present a seemingly intractable problem that will drive everyone insane while attempting to isolate it. All it takes is the perfect storm of situational circumstances, and even a perfectly running stack of software will grind to a screeching halt. It’s situations like this that we must turn to various Postgres forensic tools to track down the issue before management starts firing people in frustration.

We’ve already discussed how connections can be blocked by improper operation ordering. But what if thing’s aren’t so straight-forward? When there are more than two actors involved, the potential for weird interactions increases to a point where replicating the issue in a clean environment is often difficult or impossible. So let’s make sure a few things are in place.

First things first: logging. We should tweak log settings from the defaults, as they provide a wealth of after-the-fact diagnostic information. All of these should be in postgresql.conf:

log_checkpoints = on
log_statement = ddl
log_min_duration_statement = 1000
log_line_prefix = '%p|%u|%d|%r|%t|'

There are a lot of other log settings, but these are the ones I personally consider essential modifications. Why? Let’s examine them one by one.

  • log_checkpoints: A checkpoint happens when Postgres writes pending modifications to table files. If this setting is enabled, we learn when that process started, how long it took, how much data was written, how much time was spent syncing to disk and thus waiting for the underlying OS to flush the data, and so on. Enabling this can single-handedly track down insufficient write performance of a storage device, or reveal times of heavy write activity that may suggest better transaction log coverage, and so on. It’s a crime this isn’t enabled by default.
  • log_statement: The ddl setting will log any modification to a database object. Create a table? Logged. Modify a view? Logged. Drop an index? Logged. This is the minimal level of activity auditing a DBA should expect. Modifying database structures, especially in production, should be strictly controlled, and we need to know when such modifications occur, who made them, and so on.
  • log_min_duration_statement: In most databases, queries execute on the order of tens of milliseconds. Those that exceed one full second are almost always an aberration that require further investigation. Reporting-oriented servers might err toward higher values as their queries are generally slower, but this should still be set to something. This doesn’t just reveal slow queries, but queries that ran long for any reason. This latter case makes query tracking an essential tool.
  • log_line_prefix: There are numerous environmental details in place for any session, or a particular operation. We know the account in question, where it connected from, which database it’s using, the PID of the Postgres backend it was assigned, and so on. It’s ideal to salt this to fit organization needs, and this is much better than the default. If a problem occurs, we might not be available or capable of viewing it right away. It’s up to the log to fill any gaps.

There is a huge exception here that will be relevant soon: the TRUNCATE command. Postgres classifies this as a modification, not DDL. As such, our chosen log_statement value will not log table truncations! This matters because TRUNCATE doesn’t just empty a table—it destroys it. What we see when truncating a table is that it becomes empty. In reality, Postgres created an exact copy of the original table and performed an atomic swap, discarding the original.

We could use that to argue TRUNCATE is DDL, and should be logged as such. We’re fundamentally modifying the table itself, and the documentation agrees:

TRUNCATE acquires an ACCESS EXCLUSIVE lock on each table it operates on, which blocks all other concurrent operations on the table.

This exclusive access lock is the key to understanding situations where TRUNCATE can be problematic. We have a process that isn’t logged as DDL, yet acts like DDL, and could originate from any application or script that touches the database. Consider the implications.

Now let’s construct the perfect storm. Imagine two access vectors: one reads data to a table, the other writes intermittently. They look like this:

CREATE TABLE foo (bar INT);
 
-- Connection 1, reads from a table:
 
BEGIN;
SELECT * FROM foo;
ROLLBACK;
 
-- Connection 2, writes from some script source:
 
BEGIN;
INSERT INTO foo (bar) VALUES (42);
COMMIT;

These look relatively harmless, but there are a couple clues if we examine more closely. Notice that the read-only connection is wrapped in a transaction? This looks very odd since it’s completely unnecessary, but it’s actually very common. Many database connectors actually consider this beneficial, since it allows developers to perform endless operations without affecting the database until they’re ready to commit.

Unfortunately for applications that are read-only, this can be a lot of unnecessary overhead. Further, the author of a read-only script may not be cognizant they’re operating within a transaction. It’s understandably common for such a process to request a large amount of data from Postgres and then process it, unaware that Postgres is calmly waiting for them to commit or roll back their work. We can see this in Postgres if we examine the pg_stat_activity and pg_locks views.

Imagine a connection retrieved some data and is spending hours processing it. This is what Postgres would see:

\x ON
 
SELECT * FROM pg_stat_activity
 WHERE state = 'idle in transaction';
 
-[ RECORD 1 ]----+------------------------------
datid            | 12411
datname          | postgres
pid              | 19690
usesysid         | 10
usename          | postgres
application_name | psql
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2016-08-19 11:14:11.194043-05
xact_start       | 2016-08-19 11:14:28.376349-05
query_start      | 2016-08-19 11:14:28.376627-05
state_change     | 2016-08-19 11:14:28.376977-05
waiting          | f
state            | idle IN TRANSACTION
backend_xid      | 
backend_xmin     | 
query            | SELECT * FROM foo;
 
\x off
 
SELECT l.pid, a.waiting, l.relation::regclass::text,
       l.locktype, l.mode
  FROM pg_locks l
  JOIN pg_stat_activity a USING (pid)
 WHERE a.state = 'idle in transaction';
 
  pid  | waiting | relation |  locktype  |      mode       
-------+---------+----------+------------+-----------------
 19690 | f       | foo      | relation   | AccessShareLock
 19690 | f       |          | virtualxid | ExclusiveLock

This information paints the session as ultimately innocuous. It’s idle in transaction, but it only has a share lock on one table, which shouldn’t block anything trying to write to it. Imagine our surprise when applications and users start reporting they keep getting errors about “too many connections”. When we look at the sessions, we see something like this:

SELECT COUNT(*)
  FROM pg_stat_activity
 WHERE waiting;
 
 COUNT 
-------
    97

How is that possible? It’s easy if we have a large cluster of web-driven servers all trying to write to a single table. Or even just an aggressively threaded bulk loading application that pushes data as hard as it can without any kind of throttle checks. Both of these situations are fine when the database is behaving, so nobody ever thought about coding around scenarios where it becomes unresponsive.

Sensing an issue, developers, operators, or system administrators might simply start bouncing applications or even Postgres itself in a frantic effort to clear the logjam. In that case, we can’t check pg_stat_activity for the cause, and all we have are the logs. Remember those? Let’s see what they have to say about things, given we know roughly when the problem occurred:

grep 11:42 /var/log/postgresql/postgresql-9.5-main.log
 
25956|postgres|postgres|::1(49763)|2016-08-19 11:42:21 CDT|LOG:  duration: 660488.381 ms  statement: TRUNCATE TABLE foo;

Introducing our third vector, a periodic cleanup job that empties a shared table before an extremely aggressive bulk operation fills it again. Before this process began, our read-only script started running and was calmly processing query results, unaware it had inadvertently shut down the entire company.

How? Remember that access share lock it’s holding while idle in a transaction? Well the TRUNCATE command requires an access exclusive lock, which it can’t get. It needs the exclusive lock because it needs to annihilate the table and replace it with a shady clone. While it waits for a lock it will never receive, the aggressive bulk loading application consumes all available connections with inserts that will never complete.

The inserts will never complete because they need an exclusive lock for the specific rows they’re inserting. But they can’t lock individual rows because the entire table is locked by the pending TRUNCATE. Meanwhile, the loading engine dutifully eats up all available connections with subsequent attempts that will simply exacerbate the problem. Once those connections are gone, standard applications can no longer connect at all for unrelated work, and suddenly the world stops.

All because of a single unnecessary transaction. If this sounds ridiculous and contrived, consider the actors. All of them operate independently of one another and function properly in a vacuum. It’s hilariously easy for that to go awry. I’ve seen this happen dozens of times in as many separate organizations across developers at all experience levels. I’ve never encountered a database-driven app that didn’t eventually trigger a DDoS on itself through some unlikely interaction between separate components.

So how can we solve this conundrum? The easiest way is to simply stop using implicit transactions wherever they may lurk. Most—if not all—database connectors provide an attribute or method that completely eliminates this behavior. In Python for example, we could tell psycopg2 we want to handle our own transactions:

import psycopg2
 
conn = psycopg2.connect("my connection string")
conn.autocommit = True
 
cur = conn.cursor()
 
cur.execute('BEGIN')
# Some other db-related write code here
cur.execute('COMMIT')

This reverts Postgres to standard operation: all queries resolve immediately unless commands need to be grouped together. In those cases, we can control that explicitly. This solves our first problem of dangling transactions.

Then applications and scripts need to clean up after themselves. Even in cases of garbage collection, if an insert procedure gets stuck, the whole stack needs to be canceled and cleaned up before invoking subsequent attempts. This prevents out-of-control resource consumption that starve out other access vectors. The Postgres DBA can assist here by setting role-level connection limits:

ALTER USER loader_thing WITH CONNECTION LIMIT 20;

This won’t prevent misbehaving scripts from choking themselves to death, but the rest of the application stack should continue unaffected. Of course, if different applications are sharing accounts, that needs to be stopped immediately. We can’t use high granularity controls on coarse access vectors, otherwise we’ve just moved the problem instead of solving it.

Above all, remember that this is only one possible scenario that requires forensics to track down and eliminate. All of the tools demonstrated here can almost always identify every one of them. We were even able to derive recommended practices to prevent future issues based on our observations. It’s the kind of balanced approach that emerges naturally from Postgres itself. It’s a stage where everyone has a role to play.

Except implicit transactions; those things are cancer.

Chris Travers: PostgreSQL vs Hadoop

From Planet PostgreSQL. Published on Aug 18, 2016.

So one of the folks I do work with is moving a large database from PostgreSQL to Hadoop.  The reasons are sound -- volume and velocity are major issues for them, and PostgreSQL is not going away in their data center and in their industry there is a lot more Hadoop usage and tooling than there is PostgreSQL tooling for life science analytics (Hadoop is likely to replace both PostgreSQL and, hopefully, a massive amount of data on NFS).  However this has provided an opportunity to think about big data problems and solutions and their implications.  At the same time I have seen as many people moving from Hadoop to PostgreSQL as the other way around.  No, LedgerSMB will never likely use Hadoop as a backend.  It is definitely not the right solution to any of our problems.

Big data problems tend to fall into three categories, namely managing ever increasing volume of data, managing increasing velocity of data, and dealing with greater variety of data structure.  It's worth noting that these are categories of problems, not specific problems themselves, and the problems within the categories are sufficiently varied that there is no solution for everyone.  Moreover these solutions are hardly without their own significant costs.  All too often I have seen programs like Hadoop pushed as a general solution without attention to these costs and the result is usually something that is overly complex and hard to maintain, may be slow, and doesn't work very well.

So the first point worth noting is that big data solutions are specialist solutions, while relational database solutions for OLTP and analytics are generalist solutions.  Usually those who are smart start with the generalist solutions and move to the specialist solutions unless they know out of the box that the specialist solutions address a specific problem they know they have.  No, Hadoop does not make a great general ETL platform.....

One of the key things to note is that Hadoop is built to solve all three problems simultaneously.  This means that you effectively buy into a lot of other costs if you are trying to solve only one of the V problems with it.

The single largest cost comes from the solutions to the variety of data issues.  PostgreSQL and other relational data solutions provide very good guarantees on the data because they enforce a lack of variety.  You force a schema on write and if that is violated, you throw an error.  Hadoop enforces a schema on read, and so you can store data and then try to read it, and get a lot of null answers back because the data didn't fit your expectations.  Ouch.  But that's very helpful when trying to make sense of a lot of non-structured data.

Now, solutions to check out first if you are faced with volume and velocity problems include Postgres-XL and similar shard/clustering solutions but these really require good data partitioning criteria.  If your data set is highly interrelated, it may not be a good solution because cross-node joins are expensive.  Also you wouldn't use these for smallish datasets either, certainly not if they are under a TB since the complexity cost of these solutions is not lightly undertaken either.

Premature optimization is the root of all evil and big data solutions have their place.  However don't use them just because they are cool or new, or resume-building.  They are specialist tools and overuse creates more problems than underuse.

Package of the Week: Pendulum

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 18, 2016.

Pendulum is a Python library to make your life easier when it comes down to work with date/time.


Installation

pip install pendulum

A few dependencies will be installed: pytz, tzlocal and python-dateutil. Just the python-dateutil itself is already great – but pendulum offers a few more tweaks.


Usage

import pendulum

now = pendulum.now()  # '2016-08-18 20:24:52'

Let’s keep the now variable for the next examples.

add
now.add(days=1)  # '2016-08-19 20:24:52'

Accepts days, months, years, hours, minutes, seconds, microseconds, weeks.

now.add(weeks=1, hours=2)  # '2016-08-25 22:24:52'
subtract
now.subtract(weeks=1)  # '2016-08-11 20:24:52'
age
birthday = pendulum.create(1988, 10, 22)
birthday.age  # 27
diff
now = pendulum.now('Europe/Helsinki')
future = now.add(hours=12)
past = now.subtract(years=50, days=12)

future.diff()             # '<Period [11 hours 59 minutes 41 seconds]>'
future.diff_for_humans()  # '11 hours from now'
past.diff_for_humans()    # '50 years ago'
delta
delta = now - last_week

delta.start  # <Pendulum [2016-08-05T00:28:26.207225+03:00]>
delta.end    # <Pendulum [2016-08-12T00:28:26.207225+03:00]>

delta.in_days()      # 7
delta.in_hours()     # 168
delta.in_weekdays()  # 6
delta.in_words()     # '1 week'
interval
it = pendulum.interval(days=15)

it.weeks       # 2
it.days        # 15
it.in_hours()  # 360
it.in_words()  # '2 weeks 1 day'
is_
now.is_birthday()
now.is_future()
now.is_past()

now.is_monday()
now.is_tuesday()
now.is_wednesday()
now.is_thursday()
now.is_friday()
now.is_saturday()
now.is_sunday()

now.is_today()
now.is_yesterday()
now.is_tomorrow()

now.is_leap_year()
now.is_long_year()

now.is_same_day()
now.is_weekday()
now.is_weekend()
now = pendulum.now()          # 2016-08-18 20:24:52
next_year = now.add(years=1)  # 2017-08-18 20:24:52

now.is_birthday(next_year)    # True
now.is_same_day(next_year)    # False
class methods
date = datetime.datetime(2016, 1, 1) # datetime.datetime(2016, 1, 1, 0, 0)
pendulum.instance(date)              # <Pendulum [2016-01-01T00:00:00+00:00]>

pendulum.now()                       # <Pendulum [2016-08-18T21:12:27.684083+03:00]>
pendulum.utcnow()                    # <Pendulum [2016-08-18T21:12:56.711154+00:00]>

pendulum.today()                     # <Pendulum [2016-08-18T00:00:00+03:00]>
pendulum.tomorrow()                  # <Pendulum [2016-08-19T00:00:00+03:00]>
pendulum.yesterday()                 # <Pendulum [2016-08-17T00:00:00+03:00]>

pendulum.create(2017, 6, 1)          # <Pendulum [2017-06-01T21:17:11.868599+00:00]>

pendulum.parse('2016-08-18')         # <Pendulum [2016-08-18T00:00:00+00:00]>

There is much more to explore. See the official documentation at https://pendulum.eustace.io/.

Simon Riggs: Postgres-BDR: 2 Years in Production

From Planet PostgreSQL. Published on Aug 18, 2016.

Postgres-BDR has now reached 1.0 production status.

Over the last 2 years, Postgres-BDR has been used daily for mission critical production systems.

As you might imagine, it’s been improved by both bug fixes and feature enhancements that allow it to be used smoothly, so its mature, robust and feature-rich.

The BDR Project introduced logical replication for PostgreSQL, now available as pglogical. In addition, it introduced replication slots, background workers and many other features. But is it still relevant?

Postgres-BDR delivers all of these features that aren’t yet in PostgreSQL 9.6, and likely won’t all be in PostgreSQL 10.0 either

  • Automatic replication of DDL, reducing maintenance costs for DBAs
  • Automatic replication of sequences
  • Conflict resolution and logging

Ernst-Georg Schmid: Hexastores are easy

From Planet PostgreSQL. Published on Aug 18, 2016.

Did you know that you can make a Hexastore from a RDF triple in just one line of SQL? (This needs PostgreSQL 9.4 or better, because of the multi-array unnest)

CREATE OR REPLACE FUNCTION hexify(
    IN sub text,
    IN pred text,
    IN obj text)
  RETURNS TABLE(ord text, a text, b text, c text) AS
$$select A.t || B.t || C.t as ord, A.v, B.v, C.v from (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as A(v, t) cross join (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as B(v, t) cross join (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as C(v, t) where a.v != b.v and a.v != c.v and b.v != c.v order by ord desc$$
  LANGUAGE sql IMMUTABLE STRICT
  COST 100
  ROWS 6;

SELECT* FROM hexify('subject','predicate','object');


Sometimes, PostgreSQL SQL is just awesome...

More on Hexastores here and here.

Colin Copeland: Postgres Present and Future (PyCon 2016 Must-See Talk: 6/6)

From Planet PostgreSQL. Published on Aug 18, 2016.

Part six of six in our annual PyCon Must-See Series, a weekly highlight of talks our staff especially loved at PyCon. With so many fantastic talks, it’s hard to know where to start, so here’s our short list.

Coming from a heavy database admin background, I found Craig Kerstiens’s “Postgres Present and Future “to be incredibly well organized and engaging. Of particular interest to me, because I am somewhat new to Postgres (while having more background history with MS SQL), was the deep dive into indexes in Postgres.

Check out 5:44-8:39 to find out when to use different types of indexes, outside of the standard B-Tree. For instance, Gin indexes are helpful when searching multiple values for a single column, ie. an array field or a JSONB field.

Click over to 17:22-19:33 to learn about the new Bloom Filter in Postgres 9.6, which is coming out in a few months. This extension seems like it will be incredibly useful to speed up queries on wide tables with a bunch of different options.


More in the annual PyCon Must-See Talks Series.

Exploring Django Utils #2

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 17, 2016.

Last week I started a post series exploring the django.utils module. In this second part I will be focusing more on the html module.


HTML

Module: django.utils.html

escape

Returns the given text with ampersands, quotes and angle brackets encoded for use in HTML.

from django.utils.html import escape

escape("<strong style='font-size: 12px'>escaped html</strong>")
'&lt;strong style=&#39;font-size: 12px&#39;&gt;escaped html&lt;/strong&gt;'

It will cause already escaped strings to be escaped again:

escaped_html = escape("<strong>escaped html</strong>")
# '&lt;strong&gt;escaped html&lt;/strong&gt;'

escape(escaped_html)
# '&amp;lt;strong&amp;gt;escaped html&amp;lt;/strong&amp;gt;'

If this is a concern, use conditional_escape() instead.

conditional_escape
escaped_html = conditional_escape("<strong>escaped html</strong>")
# '&lt;strong&gt;escaped html&lt;/strong&gt;'

conditional_escape(escaped_html)
# '&lt;strong&gt;escaped html&lt;/strong&gt;'
format_html

This function is similar to str.format, but it will conditional escape all the arguments. Prefer to use it to build small HTML fragments instead of str.format or string interpolation, as it is safer.

from django.utils.html import format_html

format_html('<div class="alert {}">{}</>', 'warning', 'Watch out!')
'<div class="alert warning">Watch out!</>'

Safely format HTML fragments:

format_html('<div class="alert {}">{}</>', '<script>alert(1);</script>', 'Watch out!')
'<div class="alert &lt;script&gt;alert(1);&lt;/script&gt;">Watch out!</>'
format_html_join

A wrapper of format_html, for the common case of a group of arguments that need to be formatted using the same format string.

format_html_join('\n', '<p>{}</p>', ['a', 'b', 'c'])
<p>a</p>\n<p>b</p>\n<p>c</p>

Another example:

data = [
    ['success', 'Success message'],
    ['warning', 'Watch out!'],
    ['danger', 'Danger!!'],
]

format_html_join('\n', '<div class="alert {0}">{1}</div>', data)
<div class="alert success">Success message</div>\n
<div class="alert warning">Watch out!</div>\n
<div class="alert danger">Danger!!</div>

Yet another example:

format_html_join('\n', '<tr><td>{0}</td><td>{1}</td></tr>', ((u.first_name, u.last_name)
                                                            for u in users))
<tr><td>Vitor</td><td>Freitas</td></tr>\n
<tr><td>John</td><td>Duo</td></tr>\n
<tr><td>Peter</td><td>Croke</td></tr>\n
<tr><td>Elektra</td><td>Moore</td></tr>
linebreaks
from django.utils.html import linebreaks

linebreaks('convert\ninto html paragraphs\ntest')
<p>convert<br />into html paragraphs<br />test</p>

Gulcin Yildirim: Evolution of Fault Tolerance in PostgreSQL: Synchronous Commit

From Planet PostgreSQL. Published on Aug 17, 2016.

PostgreSQL is an awesome project and it evolves at an amazing rate. We’ll focus on evolution of fault tolerance capabilities in PostgreSQL throughout its versions with a series of blog posts. This is the fourth post of the series and we’ll talk about synchronous commit and its effects on fault tolerance and dependability of PostgreSQL.

If you would like to witness the evolution progress from the beginning, please check the first three blog posts of the series below. Each post is independent, so you don’t actually need to read one to understand another.

  1. Evolution of Fault Tolerance in PostgreSQL 
  2. Evolution of Fault Tolerance in PostgreSQL: Replication Phase 
  3. Evolution of Fault Tolerance in PostgreSQL: Time Travel

synchronous

Synchronous Commit

By default, PostgreSQL implements asynchronous replication, where data is streamed out whenever convenient for the server. This can mean data loss in case of failover. It’s possible to ask Postgres to require one (or more) standbys to acknowledge replication of the data prior to commit, this is called synchronous replication (synchronous commit).

With synchronous replication, the replication delay directly affects the elapsed time of transactions on the master. With asynchronous replication, the master may continue at full speed.

Synchronous replication guarantees that data is written to at least two nodes before the user or application is told that a transaction has committed.

The user can select the commit mode of each transaction, so that it is possible to have both synchronous and asynchronous commit transactions running concurrently.

This allows flexible trade-offs between performance and certainty of transaction durability.

Configuring Synchronous Commit

For setting up synchronous replication in Postgres we need to configure synchronous_commit parameter in postgresql.conf.

The parameter specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a success indication to the client. Valid values are onremote_applyremote_write, local, and off. We’ll discuss how things work in terms of synchronous replication when we setup synchronous_commit parameter with each of the defined values.

Let’s start with Postgres documentation (9.6):

The default, and safe, setting is on. When off, there can be a delay between when success is reported to the client and when the transaction is really guaranteed to be safe against a server crash. (The maximum delay is three times wal_writer_delay.) Unlike fsync, setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction.

Here we understand the concept of synchronous commit, like we described at the introduction part of the post, you’re free to set up synchronous replication but if you don’t, there is always a risk of losing data. But without risk of creating database inconsistency, unlike turning fsync off – however that is a topic for another post -. Lastly, we conclude that if we need don’t want to lose any data between replication delays and want to be sure that the data is written to at least two nodes before user/application is informed the transaction has committed, we need to accept losing some performance.

Let’s see how different settings work for different level of synchronisation. Before we start let’s talk how commit is processed by PostgreSQL replication. Client execute queries on the master node, the changes are written to a transaction log (WAL) and copied over network to WAL on the standby node. The recovery process on the standby node then reads the changes from WAL and applies them to the data files just like during crash recovery. If the standby is in hot standby mode, clients may issue read-only queries on the node while this is happening. For more details about how replication works you can check out the replication blog post in this series.

synchronous commit base

 

Fig.1 How replication works

synchronous_commit = off

When we set sychronous_commit = off,  the COMMIT does not wait for the transaction record to be flushed to the disk. This is highlighted in Fig.2 below.

synchronous commit off

Fig.2 synchronous_commit = off

synchronous_commit = local

When we set synchronous_commit = local,  the COMMIT waits until the transaction record is flushed to the local disk. This is highlighted in Fig.3 below.

synchronous commit local

Fig.3 sychronous_commit = local

synchronous_commit = on (default)

When we set synchronous_commit = on, the COMMIT will wait until the server(s) specified by synchronous_standby_names confirm reception of the transaction record (meaning that it has the data in the memory).This is highlighted in Fig.4 below.

Note: When synchronous_standby_names is empty, this setting behaves same as synchronous_commit = local.

synchronous commit on

Fig.4 synchronous_commit = on

synchronous_commit = remote_write

When we set synchronous_commit = remote_write, the COMMIT will wait until the server(s) specified by synchronous_standby_names confirm write of the transaction record to the disk. This is highlighted in Fig.5 below.

synchronous commit remote_write

Fig.5 synchronous_commit = remote_write

synchronous_commit = remote_apply

When we set synchronous_commit = remote_apply, the COMMIT will wait until the server(s) specified by synchronous_standby_names confirm that the transaction record was applied to the database. This is highlighted in Fig.6 below.

synchronous commit remote_apply

Fig.6 synchronous_commit = remote_apply

Now, let’s look at sychronous_standby_names parameter in details, which is referred above when setting synchronous_commit as onremote_apply or remote_write.

synchronous_standby_names = ‘standby_name [, …]’

The synchronous commit will wait for reply from one of the standbys listed in the order of priority. This means that if first standby is connected and streaming, the synchronous commit will always wait for reply from it even if the second standby already replied. The special value of  * can be used as stanby_name which will match any connected standby.

synchronous_standby_names = ‘num (standby_name [, …])’

The synchronous commit will wait for reply from at least num number of standbys listed in the order of priority. Same rules as above apply. So, for example setting synchronous_standby_names = '2 (*)' will make synchronous commit wait for reply from any 2 standby servers.

synchronous_standby_names is empty

If this parameter is empty as shown it changes behaviour of setting synchronous_commit to on, remote_write or remote_apply to behave same as local (ie, the COMMIT will only wait for flushing to local disk).

Note: synchronous_commit parameter can be changed at any time; the behaviour for any one transaction is determined by the setting in effect when it commits. It is therefore possible, and useful, to have some transactions commit synchronously and others asynchronously. For example, to make a single multi-statement transaction commit asynchronously when the default is the opposite, issue SET LOCAL synchronous_commit TO OFF within the transaction.

Conclusion

In this blog post, we discussed synchronous replication and described different levels of protection which are available in Postgres. We’ll continue with logical replication in the next blog post.

References

Special thanks to my colleague Petr Jelinek for giving me the idea for illustrations.

PostgreSQL Documentation
PostgreSQL 9 Administration Cookbook – Second Edition

Django Tips #11 Custom Manager With Chainable QuerySets

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 16, 2016.

In a Django model, the Manager is the interface that interacts with the database. By default the manager is available through the Model.objects property. The default manager every Django model gets out of the box is the django.db.models.Manager. It is very straightforward to extend it and change the default manager.

from django.db import models

class DocumentManager(models.Manager):
    def pdfs(self):
        return self.filter(file_type='pdf')

    def smaller_than(self, size):
        return self.filter(size__lt=size)

class Document(models.Model):
    name = models.CharField(max_length=30)
    size = models.PositiveIntegerField(default=0)
    file_type = models.CharField(max_length=10, blank=True)

    objects = DocumentManager()

With that you will be able to retrieve all pdf files like this:

Document.objects.pdfs()

The thing is, this method is not chainable. I mean, you can still use order_by or filter in the result:

Document.objects.pdfs().order_by('name')

But if you try to chain the methods it will break:

Document.objects.pdfs().smaller_than(1000)
AttributeError: 'QuerySet' object has no attribute 'smaller_than'

To make it work you must create custom QuerySet methods:

class DocumentQuerySet(models.QuerySet):
    def pdfs(self):
        return self.filter(file_type='pdf')

    def smaller_than(self, size):
        return self.filter(size__lt=size)

class DocumentManager(models.Manager):
    def get_queryset(self):
        return DocumentQuerySet(self.model, using=self._db)  # Important!

    def pdfs(self):
        return self.get_queryset().pdfs()

    def smaller_than(self, size):
        return self.get_queryset().smaller_than(size)

class Document(models.Model):
    name = models.CharField(max_length=30)
    size = models.PositiveIntegerField(default=0)
    file_type = models.CharField(max_length=10, blank=True)

    objects = DocumentManager()

Now you can use it just like any other QuerySet method:

Document.objects.pdfs().smaller_than(1000).exclude(name='Article').order_by('name')

You can keep the code inside the models.py. But as the code base grow, I prefer to keep the Managers and QuerySets in a different module, named managers.py.

Joshua Drake: Rich in the Jungle: A AWS to Softlayer comparison for PostgreSQL

From Planet PostgreSQL. Published on Aug 16, 2016.

I have updated my Rich in the Jungle presentation with new pricing for AWS vs. Softlayer. Things haven't changed much, in terms of raw performance per dollar (which is not the only qualifier) Softlayer is clearly the winner.

Kaarel Moppel: Insert-only data modeling with PostgreSQL?

From Planet PostgreSQL. Published on Aug 16, 2016.

During recent years there has been quite a lot of fuzz about “insert-only” approaches and some database-like products (Datomic, Apache Samza) have emerged, being inspired by the idea of having an immutable datastore. In light of cheap storage and powerful hardware, I see the idea definitely having potential for certain use cases. So why not […]

The post Insert-only data modeling with PostgreSQL? appeared first on Cybertec - The PostgreSQL Database Company.

Tatsuo Ishii: Importing PostgreSQL 9.6's SQL parser

From Planet PostgreSQL. Published on Aug 16, 2016.

In almost every releases of Pgpool-II, we import the latest version of PostgreSQL's SQL parser (more precisely, the raw parser). This time, our new Pgpool-II developer faced with an interesting problem while importing PostgreSQL 9.6's parser.

In PostgreSQL, the SQL parser is written in bion, a general-purpose parser generator. The particular file including the SQL grammar rules is named "gram.y". gram.y used to include "scan.l", which is a lexical scanner written in flex.

 In reality, gram.y is translated into a C source file by bison, then compiled. Same thing can be said to scan.l, which is translated by flex though.

So the main part of SQL parser  source file was single big file consisted of gram.y and scan.l.

From PostgreSQL 9.6, however, PostgreSQL developers decided to keep gram.y and flex.l separated.

Build backend/parser/scan.l and interfaces/ecpg/preproc/pgc.l standalone.

This gives enough confusion to the developer in charge of the work and took some time before realize the change. I would say it's a fun part of the task when we work on an OSS project:-)  However I cannot stop saying that it would be nice if the SQL parser is exported as a separate library so that we do not need this kind of work in every releases of Pgpool-II.

How to Create a Password Confirmation View

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 15, 2016.

It is a very common practice nowadays to keep alive signed in user sessions for a very long time. But some web pages usually deal with sensitive information (such as billing information or change email forms) and it is a good idea to ask the user to confirm his/her credentials. To achieve this task, we will be using the built-in check_password and the user’s last_login field.

To illustrate a little bit more what we want to achieve, take the picture below as an example:

Github Example

No redirects, we want to keep the current URL. Make it lasts for a few hours.


Building the Form

I’m using a User ModelForm to hold an instance of the logged in user. Basically I’m overriding the clean() method so to validate the informed password using the built-in function check_password, where you can test a plain-text password against its hash.

And in the same method, I’m updating the last_login with the current time. This field will be used to control when we should ask for the user’s password again.

from django import forms
from django.contrib.auth.models import User
from django.contrib.auth.hashers import check_password
from django.utils import timezone

class ConfirmPasswordForm(forms.ModelForm):
    confirm_password = forms.CharField(widget=forms.PasswordInput())

    class Meta:
        model = User
        fields = ('confirm_password', )

    def clean(self):
        cleaned_data = super(ConfirmPasswordForm, self).clean()
        confirm_password = cleaned_data.get('confirm_password')
        if not check_password(confirm_password, self.instance.password):
            self.add_error('confirm_password', 'Password does not match.')

    def save(self, commit=True):
        user = super(ConfirmPasswordForm, self).save(commit)
        user.last_login = timezone.now()
        if commit:
            user.save()
        return user

The View Function

We will only need a view function (or a class-based view), there is no need for a url route.

views.py

from django.views.generic.edit import UpdateView
from .forms import ConfirmPasswordForm

class ConfirmPasswordView(UpdateView):
    form_class = ConfirmPasswordForm
    template_name = 'core/confirm_password.html'

    def get_object(self):
        return self.request.user

    def get_success_url(self):
        return self.request.get_full_path()

confirm_password.html

{% extends 'base.html' %}

{% block content %}
  <form method="post">
    {% csrf_token %}
    {{ form.as_p }}
    <button type="submit">Confirm credentials</button>
  </form>
{% endblock %}

The Decorator

The decorator will be responsible for checking if the application needs to ask the user to confirm his/her password. It will also make it easy to reuse across multiple views.

decorators.py

import datetime
from functools import wraps
from django.utils import timezone

def confirm_password(view_func):
    @wraps(view_func)
    def _wrapped_view(request, *args, **kwargs):
        last_login = request.user.last_login
        timespan = last_login + datetime.timedelta(hours=6)
        if timezone.now() > timespan:
            from uploads.core.views import ConfirmPasswordView
            return ConfirmPasswordView.as_view()(request, *args, **kwargs)
        return view_func(request, *args, **kwargs)
    return _wrapped_view

A few things here: I’m importing the ConfirmPasswordView inside the wrapper class to avoid circular import. In the datetime.timedelta(hours=6) I’m defining that the application will ask for the user’s password if it’s been more than 6 hours since his/her last login.

We can extract the hardcoded 6 hours later on.


Usage

With the form, the view and the decorator we should be good to go.

views.py

from myapp.core.decorators import confirm_password

@login_required
@confirm_password
def settings(request):
    ...
    return render(request, 'core/settings.html')

If there is more than six hours since the authentication, when the user tries to access the settings page, the ConfirmPasswordView will be rendered instead, using the settings URL.

Example

Process the form, updates the user.last_login:

Example


That’s it. Now it is just a matter of decorating the views with @confirm_password.

Chris Travers: Forthcoming new scalable job queue extension

From Planet PostgreSQL. Published on Aug 14, 2016.

So for those of you who know, I now spend most of my time doing more general PostgreSQL consulting and a fair bit of time still on LedgerSMB.  One of my major projects lately has been on a large scientific computing platform currently run on PostgreSQL, but due to volume and velocity of data being moved to Hadoop (the client maintains other fairly large PostgreSQL instances with no intention of moving btw).

With this client's permission I have decided to take a lot of the work I have done in optimizing their job queue system and create an extension under PostgreSQL for it..  The job queue currently runs tens of millions of jobs per day (meaning twice that number of write queries, and a fair number of read queries too) and is one of the most heavily optimized parts of the system, so this will be based on a large number of lessons learned on what is a surprisingly hard problem.

It is worth contrasting this to pg_message_queue of which I am also the author.  pg_message_queue is intended as a light-weight, easy to use message queue extension that one can use to plug into other programs to solve common problems where notification and message transfer are the main problems.  This project will be an industrial scale job queuing system aimed at massive concurrency.  As a result simplicity and ease of use take second place to raw power and performance under load.  In other words here I am not afraid to assume the dba and programming teams know what they are doing and has the expertise to read the manual and implement appropriately.

The first version (1.x) will support all supported versions of PostgreSQL and make the following guarantees:


  1. massively multiparallel, non-blocking performance  (we currently use with 600+ connections to PostgreSQL by worker processes.
  2. Partitioning, coalescing, and cancelling of jobs similar in some ways to TheSchwartz
  3. Exponential pushback based on number of times a job has failed
  4. Jobs may be issued again after deletion but that this can always be detected and bad jobs pruned
  5. Optionally job table partitioning.
The first client written will rely on hand-coded SQL along with DBIx::Class's schema objects.  This client will guarantee that:

  1. Work modules done always succeeds or fails in a transaction
  2. A job notifier class will be shown
  3. Pruning of completed jobs will be provided via the  perl module and a second query.
The history of this is that this came from a major client's use of The Schwartz and they out grew it for scalability reasons.  While the basic approach is thus compatible, the following changes are made:

  1. Job arguments are in json format rather than in Storable format in bytea columns
  2. Highly optimized performance on PostgreSQL
  3. Coalesce is replaced by a single integer cancellation column
  4. Jobs may be requested by batches of various sizes
2.x will support 9.5+ and dispense with the need for both advisory locks and rechecking.  I would like to support some sort of graph management as well (i.e. a graph link that goes from one job type to another which specifies "for each x create a job for y" type of semantics.  That is still all in design.

Django Tips #10 AuthenticationForm Custom Login Policy

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 12, 2016.

Since I started working with Django, I never had to spend time implementing authentication related stuff. The built-in authentication system is great and it’s very easy to plug-in and get started. Now, even if need to customize, Django makes it easy. That’s what this tip is about.

For the built-in login view, Django makes use of django.contrib.auth.forms.AuthenticationForm form to handle the authentication process. Basically it checks username, password and the is_active flag.

Django makes it easy to add custom verifications, as the AuthenticationForm has a method named confirm_login_allowed(user).

For example, if you are handling double opt-in email confirmation and don’t wanna let users without the email confirmed to log in to the application you can do something like that:

forms.py:

from django import forms
from django.contrib.auth.forms import AuthenticationForm

class CustomAuthenticationForm(AuthenticationForm):
    def confirm_login_allowed(self, user):
        if not user.is_active or not user.is_validated:
            raise forms.ValidationError('There was a problem with your login.', code='invalid_login')

urls.py

from django.conf.urls import url
from django.contrib.auth import views as auth_views

from .forms import CustomAuthenticationForm

urlpatterns = [
    url(r'^login/$', auth_views.login, {'template_name': 'core/login.html',
        'authentication_form': CustomAuthenticationForm}, name='login'),
    url(r'^logout/$', auth_views.logout, name='logout'),
    ...
]

Basically it is just a matter of overriding the confirm_login_allowed method and substituting the authentication_form parameter with the new form in the urlconf. You can add any login policy, and to invalidate the authentication simply raise a ValidationError.

django-html-validator - now locally, fast!

By Peter Bengtsson from Django community aggregator: Community blog posts. Published on Aug 12, 2016.

A couple of years ago I released a project called django-html-validator (GitHub link) and it's basically a Django library that takes the HTML generated inside Django and sends it in for HTML validation.

The first option is to send the HTML payload, over HTTPS, to https://validator.nu/. Not only is this slow but it also means sending potentially revealing HTML. Ideally you don't have any passwords in your HTML and if you're doing HTML validation you're probably testing against some test data. But... it sucked.

The other alternative was to download a vnu.jar file from the github.com/validator/validator project and executing it in a subprocess with java -jar vnu.jar /tmp/file.html. Problem with this is that it's really slow because java programs take such a long time to boot up.

But then, at the beginning of the year some contributors breathed fresh life into the project. Python 3 support and best of all; the ability to start the vnu.jar as a local server on http://localhost:8888 and HTTP post HTML over to that. Now you don't have to pay the high cost of booting up a java program and you don't have to rely on a remote HTTP call.

Now it becomes possible to have HTML validation checked on every rendered HTML response in the Django unit tests.

To try it, check out the new instructions on "Setting the vnu.jar path".

The contributor who's made this possible is Ville "scop" Skyttä, as well as others. Thanks!!

Shaun M. Thomas: PG Phriday: Inevitable Interdiction

From Planet PostgreSQL. Published on Aug 12, 2016.

“Hey! That row shouldn’t be in that table! How the heck did that get there!? Alright, who wrote the application client filters, because you’re fired!”

Good application developers know never to trust client input, but not all realize that a single app is rarely the only vector into a database. Databases don’t just preserve data with various levels of paranoia, they’re also the central nexus of a constellation of apps, scripts, APIs, GUIs, BMIs, HMOs, and STDs. As such, unless every single one of those share a common ancestor that sanitizes, boils, renders, and formats content before storage, there’s bound to be inconsistencies. That’s just how things work.

One of the major benefits of using an RDBMS, is that engines like Postgres provide several mechanisms for ensuring data integrity beyond crash durability and transaction control. Continuing our discussion on database objects from last week, we vaguely referred to other types of constraint. So, what other mechanisms are available aside from primary keys and unique constraints?

Let’s start with foreign keys, since they illustrate how tables are related and enforce that relation to reject invalid content. Here are two simply related tables and a couple of rows:

CREATE TABLE pet_type
(
  type_id  SERIAL   PRIMARY KEY,
  animal   VARCHAR  UNIQUE NOT NULL
);
 
CREATE TABLE pet
(
  pet_id      SERIAL   PRIMARY KEY,
  type_id     INT      NOT NULL,
  pet_name    VARCHAR  NOT NULL,
  owner_name  VARCHAR  NOT NULL
);
 
ALTER TABLE pet
  ADD CONSTRAINT fk_pet_pet_type FOREIGN KEY (type_id)
      REFERENCES pet_type (type_id);
 
INSERT INTO pet_type (animal) VALUES ('cat'), ('dog');
 
INSERT INTO pet
  (type_id, pet_name, owner_name)
VALUES
  (1, 'Meow Meow Fuzzyface', 'Cedric'),
  (2, 'Mr. Peanutbutter', 'Paul');

Foreign keys provide a buffer between pending modifications by enforcing the relationship. In this case, we can’t remove “dog” as a pet type because at least one pet references it. We also can’t insert a pet fish, because there’s no corresponding type.

The other job the foreign key fills is to normalize the type names. If we had used a VARCHAR column in the pet table instead, we could have types of “Dog”, “dog”, “GDo”, or any number of typos preserved for eternity and forever complicating searches. This helps illustrate and sanitize the relationship and all affected data. Let’s see it in action:

DELETE FROM pet_type WHERE animal = 'dog';
 
ERROR:  UPDATE OR DELETE ON TABLE "pet_type" violates FOREIGN KEY
        CONSTRAINT "fk_pet_pet_type" ON TABLE "pet"
 
INSERT INTO pet
  (type_id, pet_name, owner_name)
VALUES
  (3, 'Wanda', 'Lisa');
 
ERROR:  INSERT OR UPDATE ON TABLE "pet" violates FOREIGN KEY
        CONSTRAINT "fk_pet_pet_type"

Foreign keys do have other modes of operation. We could for instance, declare the constraint as ON DELETE CASCADE. That would enable us to delete from the parent table, but a delete cascade would mean every row in any table that referenced the deleted value would also be removed. That’s a fairly dangerous operation, and a pretty good reason it’s not the default in Postgres.

Foreign keys are fairly limited in application, however. Beyond describing a relationship and enforcing its integrity, there isn’t much left. If we wanted to impose actual rules on the data itself, we need to go a bit further into the toolbox.

For example, suppose we want to ensure pet birth dates are firmly established in the past. None of the previously mentioned constraints will let us apply arbitrary rules on column values, right? That’s where CHECK constraints come in!

ALTER TABLE pet ADD birth_date DATE;
 
ALTER TABLE pet
  ADD CONSTRAINT ck_pet_no_future
CHECK (CURRENT_DATE - birth_date > 0);
 
INSERT INTO pet
  (type_id, pet_name, owner_name, birth_date)
VALUES
  (1, 'Princess Carolyn', 'Amy', '2017-08-12');
 
ERROR:  NEW ROW FOR relation "pet" violates CHECK
        CONSTRAINT "ck_pet_no_future"

Not bad, eh? Check constraints are exceptionally useful when there are very simple rules we want to enforce. Maybe prices should always be positive. Perhaps invoice line items should be positive unless they’re a credit. There is a lot of potential here, but there’s also room for abuse. There’s technically no limit on the amount of conditionals a check constraint enforces, or the number of checks we prescribe, so we must be judicious or risk performance degradation.

Still, preventing critical data flaws prior to insert is a stupendous capability. Can we go further, though? Of course we can! The final constraint type is for data exclusion. Imagine in our examples that pets can change owners, but can’t be owned by two people simultaneously. Well, we can’t use check constraints for that since they only operate on the current row, and a unique constraint won’t work either.

Let’s watch EXCLUDE handle the situation with ease:

CREATE EXTENSION btree_gist;
 
ALTER TABLE pet ADD owner_range TSRANGE;
 
ALTER TABLE pet
  ADD CONSTRAINT ex_owner_overlap
      EXCLUDE USING GIST (
        pet_name WITH =, owner_range WITH &&
      );
 
INSERT INTO pet_type (animal) VALUES ('horse');
 
INSERT INTO pet
  (type_id, pet_name, owner_name, birth_date, owner_range)
VALUES 
  (3, 'Bojack', 'Will', '1964-06-12', '[1964-06-12,2014-09-07)');
 
INSERT INTO pet
  (type_id, pet_name, owner_name, birth_date, owner_range)
VALUES
  (3, 'Bojack', 'Arnett', '1964-06-12', '[2013-09-07,)');
 
ERROR:  conflicting KEY VALUE violates exclusion
        CONSTRAINT "ex_owner_overlap"

There are a couple prerequisites for using exclusion this way, of course. Since the gist index type wasn’t designed to handle types like INT or VARCHAR natively, we need to give it B-Tree capability first with an extension.

Beyond that oddity, we merely added the new ownership date range and then added the constraint itself. The exclusion syntax is to list the column and then the type of operator that should be applied. For ranges, that operator is && to indicate overlap. For our particular example, no pet with the same name can have overlapping ownership ranges. This is a tiny universe indeed!

Exclusion constraints work better for things like scheduling, preventing archive overlaps, and other operations that would be awkward or impossible otherwise. With other database engines, an application might have to search for a date range and other parameters and self-verify that a record is safe to insert. Under this paradigm, any tertiary data vector that isn’t so diligent would be free to ignore scheduling conflicts.

But no application, script, or API can ignore rules the database itself enforces. That’s what constraints are for: those times when being a shared resource is a liability. Judicious application of various constraints can protect as well as describe the data, and make things easier (and safer) for everyone.

And if that isn’t the Postgres motto, maybe it should be.

Craig Ringer: BDR 1.0

From Planet PostgreSQL. Published on Aug 11, 2016.

I’m pleased to say that we’ve just released Postgres-BDR 1.0, based on PostgreSQL 9.4.9.

This release contains significant improvements to DDL replication locking, global sequences, documentation, performance, and more. It also removes the deprecated UDR component in favour of pglogical.

It’s taken a lot of work to get to this point. This release sets the foundation to port BDR to PostgreSQL 9.6 and to enhance its high-availability capabilities, and I’m excited to be pushing BDR forward.

Package of the Week: django-import-export

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 11, 2016.

As the name suggests, this is a library to handle importing and exporting data. The django-import-export library supports multiple formats, including xls, csv, json, yaml, and all other formats supported by tablib. It also have a Django admin integration, which is really convenient to use.


Installation

Pip is the way to go:

pip install django-import-export

Update your settings.py:

INSTALLED_APPS = (
    ...
    'import_export',
)

There is also an optional configuration that I usually add:

IMPORT_EXPORT_USE_TRANSACTIONS = True

The default value is False. It determines if the library will use database transactions on data import, just to be on the safe side.


Resources

The django-import-export library work with the concept of Resource, which is class definition very similar to how Django handle model forms and admin classes.

In the documentation the authors suggest to put the code related to the resources inside the admin.py file. But if the implementation is not related to the Django admin, I usually prefer to create a new module named resources.py inside the app folder.

models.py

from django.db import models

class Person(models.Model):
    name = models.CharField(max_length=30)
    email = models.EmailField(blank=True)
    birth_date = models.DateField()
    location = models.CharField(max_length=100, blank=True)

resources.py

from import_export import resources
from .models import Person

class PersonResource(resources.ModelResource):
    class Meta:
        model = Person

This is the simplest definition. You can pass several configurations to the Meta class like fields, exclude, etc. See the documentation for more details.


Exporting Data

Exporting data as CSV

from .resources import PersonResource

person_resource = PersonResource()
dataset = person_resource.export()
dataset.csv
id,name,email,birth_date,location
1,John,john@doe.com,2016-08-11,Helsinki
2,Peter,peter@example.com,2016-08-11,Helsinki
3,Maria,maria@gmail.com,2016-08-11,Barcelona
4,Vitor,vitor@freitas.com,2016-08-11,Oulu
5,Erica,erica@gmail.com,2016-08-11,Oulu

Exporting data as JSON

dataset.json
[
  {"id": 1, "name": "John", "email": "john@doe.com", "birth_date": "2016-08-11", "location": "Helsinki"},
  {"id": 2, "name": "Peter", "email": "peter@example.com", "birth_date": "2016-08-11", "location": "Helsinki"},
  {"id": 3, "name": "Maria", "email": "maria@gmail.com", "birth_date": "2016-08-11", "location": "Barcelona"},
  {"id": 4, "name": "Vitor", "email": "vitor@freitas.com", "birth_date": "2016-08-11", "location": "Oulu"},
  {"id": 5, "name": "Erica", "email": "erica@gmail.com", "birth_date": "2016-08-11", "location": "Oulu"}
]

Exporting data as YAML

dataset.yaml
- {birth_date: '2016-08-11', email: john@doe.com, id: 1, location: Helsinki, name: John}
- {birth_date: '2016-08-11', email: peter@example.com, id: 2, location: Helsinki, name: Peter}
- {birth_date: '2016-08-11', email: maria@gmail.com, id: 3, location: Barcelona, name: Maria}
- {birth_date: '2016-08-11', email: vitor@freitas.com, id: 4, location: Oulu, name: Vitor}
- {birth_date: '2016-08-11', email: erica@gmail.com, id: 5, location: Oulu, name: Erica}

Filtering the data

from .resources import PersonResource
from .models import Person

person_resource = PersonResource()
queryset = Person.objects.filter(location='Helsinki')
dataset = person_resource.export(queryset)
dataset.yaml
- {birth_date: '2016-08-11', email: john@doe.com, id: 1, location: Helsinki, name: John}
- {birth_date: '2016-08-11', email: peter@example.com, id: 2, location: Helsinki, name: Peter}
Views Example

Exporting to CSV view:

from django.http import HttpResponse
from .resources import PersonResource

def export(request):
    person_resource = PersonResource()
    dataset = person_resource.export()
    response = HttpResponse(dataset.csv, content_type='text/csv')
    response['Content-Disposition'] = 'attachment; filename="persons.csv"'
    return response

Exporting to JSON view:

from django.http import HttpResponse
from .resources import PersonResource

def export(request):
    person_resource = PersonResource()
    dataset = person_resource.export()
    response = HttpResponse(dataset.json, content_type='application/json')
    response['Content-Disposition'] = 'attachment; filename="persons.json"'
    return response

Exporting to Excel view:

from django.http import HttpResponse
from .resources import PersonResource

def export(request):
    person_resource = PersonResource()
    dataset = person_resource.export()
    response = HttpResponse(dataset.xls, content_type='application/vnd.ms-excel')
    response['Content-Disposition'] = 'attachment; filename="persons.xls"'
    return response

Importing Data

Consider the file new_persons.csv:

name,email,birth_date,location,id
Jessica,jessica@jones.com,2016-08-11,New York,
Mikko,mikko@suomi.com,2016-08-11,Jyväskyla,

The id must be present because it is the primary key. But it will be generated though, so we don’t need to specify the value.

import.html

{% extends 'base.html' %}

{% block content %}
  <form method="post" enctype="multipart/form-data">
    {% csrf_token %}
    <input type="file" name="myfile">
    <button type="submit">Upload</button>
  </form>
{% endblock %}

views.py

from tablib import Dataset

def simple_upload(request):
    if request.method == 'POST':
        person_resource = PersonResource()
        dataset = Dataset()
        new_persons = request.FILES['myfile']

        imported_data = dataset.load(new_persons.read())
        result = person_resource.import_data(dataset, dry_run=True)  # Test the data import

        if not result.has_errors():
            person_resource.import_data(dataset, dry_run=False)  # Actually import now

    return render(request, 'core/simple_upload.html')

Django Admin

Simply use ImportExportModelAdmin instead of ModelAdmin. admin.py

from import_export.admin import ImportExportModelAdmin
from django.contrib import admin
from .models import Person

@admin.register(Person)
class PersonAdmin(ImportExportModelAdmin):
    pass

And you will already notice the Import and Export buttons.

Django Admin Import Export

The import functionaly come with a nice diff, when importing existing items:

Django Admin Import Export


This is a great Django library. There is much more you can do with it. Totally worth having a look on the API reference.

gabrielle roth: PDXPUG: August meeting in two days

From Planet PostgreSQL. Published on Aug 11, 2016.

When: 6-8pm Thursday August 18, 2106
Where: iovation
Who: Brian Panulla, Marty Zajac, Gabrielle Roth
What: ETL Throwdown (or up)

For August, we’re having a panel discussion on various ETL tools we’ve tried.
Brian: moving from bulk ETL to near real time ETL with Tungsten Replicator to replicate from MySQL to PostgreSQL and Solr
Marty: Pentaho
Gabrielle: CloverETL, home-grown solution using postgres_fdw, and a brief rant about Informatica if we have time.


If you have a job posting or event you would like me to announce at the meeting, please send it along. The deadline for inclusion is 5pm the day before the meeting.

Our meeting will be held at iovation, on the 32nd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak). It’s right on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots. No bikes in the office, sorry!

iovation provides us a light dinner (usually sandwiches or pizza).

Elevators open at 5:45 and building security closes access to the floor at 6:30.

See you there!


Kaarel Moppel: Logging of data modifications and the “log_statement” configuration parameter

From Planet PostgreSQL. Published on Aug 10, 2016.

PostgreSQL has a bagful of server configuration parameters (249 according to my counting for version 9.5) at your disposal, which mostly is a good thing as it enables to take the maximum out of your hardware if you’re willing to put in the necessary time. But some of the parameters might leave the door open […]

The post Logging of data modifications and the “log_statement” configuration parameter appeared first on Cybertec - The PostgreSQL Database Company.

Exploring Django Utils #1

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 10, 2016.

Exploring the Django source code I ended up discovering some really nice utility functions that I wasn’t aware of. I thought about sharing with you guys in a form of reference-like article. There are great stuff there, so I decided to break this post into a few parts.


Crypto

Module: django.utils.crypto

get_random_string

Calling it without any parameters defaults the length to 12.

from django.utils.crypto import get_random_string

get_random_string()
'5QxAxqyhsyJM'

Your may pass the number of characteres you want:

get_random_string(50)
'lrWYnyxhnXpwmjHDzmdgTFaIi1j73cKD5fPDOPwuVBmmKxITYF'

And also the collection of characteres the random string will have:

get_random_string(12, '0123456789')
'805379737758'

Date

Module: django.utils.dates

Basically it is a collection of commonly-used date structures.

WEEKDAYS
from django.utils.dates import WEEKDAYS
WEEKDAYS = {
    0: _('Monday'), 1: _('Tuesday'), 2: _('Wednesday'), 3: _('Thursday'), 4: _('Friday'),
    5: _('Saturday'), 6: _('Sunday')
}
WEEKDAYS_ABBR
from django.utils.dates import WEEKDAYS_ABBR
WEEKDAYS_ABBR = {
    0: _('Mon'), 1: _('Tue'), 2: _('Wed'), 3: _('Thu'), 4: _('Fri'),
    5: _('Sat'), 6: _('Sun')
}
MONTHS
from django.utils.dates import MONTHS
MONTHS = {
    1: _('January'), 2: _('February'), 3: _('March'), 4: _('April'), 5: _('May'), 6: _('June'),
    7: _('July'), 8: _('August'), 9: _('September'), 10: _('October'), 11: _('November'),
    12: _('December')
}

DateFormat

Module: django.utils.dateformat

The implementation of PHP date() style date formatting, which is used in the Django template filter date formatter. This is a great utility module.

Refer to the Date Template Filter reference guide for a list of format codes semantically ordered.

format

For the following examples, consider the now = timezone.now().

from django.utils.dateformat import format
from django.utils import timezone

now = timezone.now()  # datetime.datetime(2016, 8, 10, 20, 32, 36, 461069, tzinfo=<UTC>)
format(now, 'd M Y')
'10 Aug 2016'

Date and time:

format(now, 'd/m/Y H:i')
'10/08/2016 20:32'

DateParse

Module: django.utils.dateparse

Convert a formatted date string into date/time/datetime. If the string is well formatted but represents an invalid, the function will return None.

parse_date
from django.utils.dateparse import parse_date

parse_date('2016-08-10')
datetime.date(2016, 8, 10)
parse_time
from django.utils.dateparse import parse_time

parse_time('20:43:02')
datetime.time(20, 43, 2)
parse_datetime
from django.utils.dateparse import parse_datetime

parse_datetime('2016-08-10 20:32:36')
datetime.datetime(2016, 8, 10, 20, 32, 36)

HTML

Module: django.utils.html

urlize

Utility to turn urls into <a> tags.

from django.utils.html import urlize

urlize('You guys should visit this website www.google.com')
'You guys should visit this website <a href="http://www.google.com">www.google.com</a>'

It also works with emails:

urlize('Send me a message to vitor@freitas.com')
'Send me a message to <a href="mailto:vitor@freitas.com">vitor@freitas.com</a>'

You can also trim the size of the link:

urlize('Visit the new Snippets section https://simpleisbetterthancomplex.com/snippets/', 30)
'Visit the new Snippets section <a href="https://simpleisbetterthancomplex.com/snippets/">https://simpleisbetterthanc...</a>'

That’s it for now. I hope you may find some of those useful. I will cover more modules in a future article.

Dritte Hamburger Python Unconference

By Arne Brodowski from Django community aggregator: Community blog posts. Published on Aug 10, 2016.

Python Unconference Hamburg

Vom 09. bis 11. September 2016 findet die 3​. Python Unconference​ Hamburg an der Technischen Universität Hamburg-Harburg (TUHH) statt.

Erwartet werden über 100 Python­-User bzw. Developer aus Hamburg und dem Rest der Welt. Von "Greenhorns" bis zu "proven Experts" aus den Bereichen Mathematik, Data Science, System­-Administration und DevOps bis hin zu Web­-Development und Python­-Core-­Entwicklung werden alle nur erdenklichen Facetten der Python­-Welt vertreten sein.

Wie in den letzten Jahren findet die Unconference am Samstag und Sonntag im Barcamp-Stil statt, mit Vorträgen und Diskussionen aus allen Bereichen der Python Welt. Am Freitag gibt es wieder Raum für Sprints, Tutorials und Workshops. Das aktuelle Freitags-Programm findet Ihr hier .

Tickets gibt es hier.

Mehr Informationen gibt es unter www.pyunconf.de.

Paul Ramsey: Your Broken PostGIS Upgrade

From Planet PostgreSQL. Published on Aug 10, 2016.

Since the Dawn of Time, people have found PostGIS upgrades difficult and confusing, and this is entirely to be expected, because a PostGIS upgrade consists of a number of interlocking parts. Sometimes, they “upgrade” their version of PostGIS and find out they’ve bricked their system. What gives?

Your Broken PostGIS Upgrade

What Makes PostGIS Work?

Before talking about upgrades, it’s important to understand how PostGIS works at all, because that understanding is key to seeing how upgrade scenarios go bad.

PostGIS is a “run-time loadable library” for PostgreSQL. That means we have a block of C code that is added to a running PostgreSQL database. That C code sits in a “library file” which is named (for the current 2.2 version): postgis-2.2.so.

Just to add to the confusion: for Windows, the name of the library file is postgis-2.2.dll. For every rule, there must be an exception. For users of Apple OSX, yes, there’s a further exception for you: even though most dynamic libraries on OSX are suffixed .dylib, the PostgreSQL modules on OSX are suffixed .so, just like their Linux counterparts.

The location of the postgis-2.2.so file will vary from system to system.

The presence of the postgis-2.2.so alone is not sufficient to “PostGIS enable” a database. PostGIS consists of a large collection of SQL functions in the database.

The SQL functions are created when you run the CREATE EXTENSION postgis command. Until that time your database knows nothing about the existence or definition of the PostGIS functions.

Once the extension is installed, you can see the definitions of the PostGIS functions in the system tables.

The use of dynamic function and type management catalogs is one of the things which makes PostgreSQL so incredibly flexible for extensions like PostGIS

SELECT * 
  FROM pg_proc 
  WHERE proname = 'st_pointonsurface';
-[ RECORD 1 ]---+--------------------
proname         | st_pointonsurface
pronamespace    | 2200
proowner        | 10
prolang         | 13
procost         | 100
prorows         | 0
provariadic     | 0
protransform    | -
proisagg        | f
proiswindow     | f
prosecdef       | f
proleakproof    | f
proisstrict     | t
proretset       | f
provolatile     | i
pronargs        | 1
pronargdefaults | 0
prorettype      | 667466
proargtypes     | 667466
proallargtypes  | 
proargmodes     | 
proargnames     | 
proargdefaults  | 
prosrc          | pointonsurface
probin          | $libdir/postgis-2.2
proconfig       | 
proacl          | 

Lots to see here, but most important bit is the entry for the probin column: $libdir/postgis-2.2. This function (like all the other PostGIS functions) is bound to a particular version of the PostGIS C library.

Those of you thinking forward can now begin to see where upgrades could potentially go wrong.

How Things Go Wrong

Package Managers

The most common way for things to go wrong is to upgrade the library on the system without upgrading the database.

So, in Red Hat Linux terms, perhaps running:

yum upgrade postgresql94-postgis

This seems straight-forward, but think about what a package manager does during an upgrade:

  • Downloads a new version of the software
  • Removes the old version
  • Copies in the new version

So, if we had PostGIS 2.1.3 installed, and the latest version is 2.2.2, what has happend?

  • The postgis-2.1.so file has been removed
  • The postgis-2.2.so file has been added
  • So, the pg_proc entries in every PostGIS-enabled database now point to a library file that does not exist

Fortunately this mismatch between the pg_proc entries and the system state is usually solved during the very next step of the upgrade. But it’s a manual step, and if the DBA and system administrator are different people with different schedules, it might not happen.

Your next step should be to go and update the SQL function definitions by running an extension update on all your databases:

ALTER EXTENSION postgis UPDATE TO '2.2.2';

If you don’t, you’ll find that none of the PostGIS functions work. That, in fact, you cannot even dump your database. The very act of outputting a representation of the geometry data is something that requires the PostGIS C library file, and until you run ALTER EXTENSION the database doesn’t know where the new library file is.

Migrations

Since the use of CREATE EXTENSION postgis (available since PostgreSQL 9.1+ and PostGIS 2.0+) became commonplace, migrations now almost always “just work”, which is excellent news.

  • When you dump a modern PostGIS-enabled database, that was created using the CREATE EXTENSION postgis command, the dump file just includes a CREATE EXTENSION postgis command of its own at the top.
  • When you load the dump file into a new version of PostgreSQL even with a new version of PostGIS, the extension is created and the data magically loads.

However, there are still some old databases around that were created before the PostgreSQL extension system was invented, and when you dump them you get not only the data, but all the “custom” function and type definitions, including the defintions for PostGIS. A function definition looks like this:

CREATE OR REPLACE FUNCTION ST_PointOnSurface(geometry)
    RETURNS geometry
    AS '$libdir/postgis-2.2', 'pointonsurface'
    LANGUAGE 'c' IMMUTABLE STRICT; 

And look what is hiding inside of it: a reference to a particular version of the PostGIS library! So you cannot simply dump your old PostGIS 1.5 database on PostgreSQL 8.4 and load it into a fresh PostGIS 2.2 database on PostgreSQL 9.5: the function definitions won’t reference the right library file.

The best bet for a really old database that was created without the extension mechanism is to use the “hard upgrade” process. The hard upgrade works by:

  • Taking a special “custom-format” back-up that includes an object catalog;
  • Filtering the back-up to clean out all the PostGIS-specific function and object definitions; and then
  • Loading the “cleaned” back-up into a new database with the desired version of PostGIS already installed (using CREATE EXTENSION postgis this time, so you never have to hard upgrade again).

Hacks

In the case of upgrades that change out the underlying library and other situations that result in a mismatch between the SQL definitions in the database and the state of the system, there are a couple hacks that provide short-term fixes for emergencies:

  • Symlink the library name the database is looking for to the library name you have. So if your database wants postgis-2.1.so and all you have is postgis-2.2.so, you can ln -s postgis-2.2.so postgis-2.1.so and your database will “work” again.
  • Update the PostgreSQL catalog definitions for the functions. As a super-user, you can do all kinds of dangerous things, and one of them is to just UPDATE pg_proc SET probin = '$libdir/postgigs-2.2' WHERE probin ~ 'postgis-2.1'

Both hacks “work” because the PostGIS project doesn’t change underlying function names often, and inter-version changes mostly involve adding functions to the C library, not removing old ones.

However, there’s no guarantee that an underlying function name hasn’t change between versions, it’s just unlikely. In the worst case, the function name hasn’t changed, but the parameters have, so it’s now possible that calling the function will crash your database.

All this to say: linking and SQL catalogue hacks should be used temporarily only until you can properly upgrade your database using a hard upgrade.

How to Deploy Django Applications on Heroku

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 09, 2016.

Heroku is a cloud application platform, basically a Platform-as-a-Service (PaaS) platform. They support several programming languages, including Python. It is very easy to deploy Django applications on Heroku. They also offer a free plan, which is quite limited, but it is great to get started and to host demos of Django applications.


Install Heroku Toolbelt

Actually, first thing – sign up Heroku. Then install the Heroku Toolbelt. It is a command line tool to manage your Heroku apps.

After installing the Heroku Toolbelt, open a terminal and login to your account:

$ heroku login
Enter your Heroku credentials.
Email: vitor@simpleisbetterthancomplex.com
Password (typing will be hidden):
Authentication successful.

Preparing the Application

In this tutorial I will deploy an existing project, Bootcamp. It’s open-souce Django project I’ve developed a couple of years ago, and it’s also available on GitHub, so you actually can clone the repository and try it by yourself.

Basically things will work better if you are already using Git. The Heroku deployment process is done through Git. Your application will be stored in a remote Git repository in the Heroku Cloud.

Anyway, here is the list of things you will probably need to add to your project:

  • Add a Procfile in the project root;
  • Add requirements.txt file with all the requirements in the project root;
  • Add Gunicorn to requirements.txt;
  • A runtime.txt to specify the correct Python version in the project root;
  • Configure whitenoise to serve static files.
The Procfile

Create a file named Procfile in the project root with the following content:

web: gunicorn bootcamp.wsgi --log-file -

Note: change bootcamp with the name of your Django project.

The requirements.txt

If you are using a virtualenv and pip you can simply run:

pip freeze > requirements.txt

Otherwise, list the dependencies like the example below. This is how a requirements.txt looks like:

Django==1.9.8
dj-database-url==0.3.0
dj-static==0.0.6
gunicorn==19.6.0
Unipath==1.0
python-decouple==3
Pillow==3.3.0
Markdown==2.6.6
bleach==1.4.3
psycopg2==2.6.1
whitenoise==3.2
The runtime.txt

Create a file named runtime.txt in the project root, and put the specific Python version your project use:

python-2.7.12

See the complete list of Heroku Python Runtimes.

Set Up The Static Assets

Configure the STATIC-related parameters on settings.py:

PROJECT_ROOT = os.path.dirname(os.path.abspath(__file__))

# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/1.9/howto/static-files/
STATIC_ROOT = os.path.join(PROJECT_ROOT, 'staticfiles')
STATIC_URL = '/static/'

# Extra places for collectstatic to find static files.
STATICFILES_DIRS = (
    os.path.join(PROJECT_ROOT, 'static'),
)

Install the Whitenoise, and don’t forget to update the requirements.txt:

pip install whitenoise

Add the Whitenoise to your Django application in the wsgi.py file:

import os
from django.core.wsgi import get_wsgi_application
from whitenoise.django import DjangoWhiteNoise

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "bootcamp.settings")

application = get_wsgi_application()
application = DjangoWhiteNoise(application)

Update the settings.py

STATICFILES_STORAGE = 'whitenoise.django.GzipManifestStaticFilesStorage'
Optional

Here are two libraries I always use when deploying to Heroku, they are not required but they help a lot:

  • python-decouple: Strictly separate the settings parameters from your source code. I’ve written an article about it, click here to learn more.
  • dj-database-url: This simple Django utility allows you to utilize the 12factor inspired DATABASE_URL environment variable to configure your Django application.

You will see how I use it later on.


Deployment

Alright, enough configuration. Let’s get the deployment started.

First, clone the repository you want to deploy:

git clone https://github.com/vitorfs/bootcamp.git && cd bootcamp

Login to Heroku using the toolbelt:

heroku login

Inside the project root, create a Heroku App:

heroku create demo-bootcamp

Output:

Creating ⬢ demo-bootcamp... done
https://demo-bootcamp.herokuapp.com/ | https://git.heroku.com/demo-bootcamp.git

You can omit the app name parameter (in my case, demo-bootcamp), then Heroku will pick a name for you.

Add a PostgreSQL database to your app:

heroku addons:create heroku-postgresql:hobby-dev

Output:

Creating postgresql-metric-21979... done, (free)
Adding postgresql-metric-21979 to demo-bootcamp... done
Setting DATABASE_URL and restarting demo-bootcamp... done, v4
Database has been created and is available

Now login to Heroku Dashboard and access your recently created app:

Heroku demo-bootcamp Application

Click on the Settings menu and then on the button Reveal Config Vars:

Heroku Reveal Config Vars

Now we need to add all the Environment variables. Here is where python-decouple and dj-database-url are handy. Here is the list of variables I use on the Bootcamp project:

settings.py

SECRET_KEY = config('SECRET_KEY')
DEBUG = config('DEBUG', default=False, cast=bool)
DATABASES = {
    'default': dj_database_url.config(
        default=config('DATABASE_URL')
    )
}

The DEBUG var defaults to False, and DATABASE_URL is automatically added by Heroku upon PostgreSQL database installation. So in this case I will only need to add the SECRET_KEY:

Heroku Reveal Config Vars

Push to deploy:

git push heroku master

Output:

Counting objects: 1999, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (616/616), done.
Writing objects: 100% (1999/1999), 461.28 KiB | 0 bytes/s, done.
Total 1999 (delta 1139), reused 1983 (delta 1131)
remote: Compressing source files... done.
remote: Building source:
remote:
remote: -----> Python app detected
remote: -----> Installing python-2.7.12
remote:      $ pip install -r requirements.txt

...

remote: -----> Launching...
remote:        Released v5
remote:        https://demo-bootcamp.herokuapp.com/ deployed to Heroku
remote:
remote: Verifying deploy.... done.
To https://git.heroku.com/demo-bootcamp.git
 * [new branch]      master -> master

Migrate the database:

heroku run python manage.py migrate

Output:

Running python manage.py migrate on ⬢ demo-bootcamp... up, run.9352
Operations to perform:
  Apply all migrations: questions, sessions, authentication, articles, feeds, contenttypes, messenger, activities, auth
Running migrations:
  Rendering model states... DONE
  Applying contenttypes.0001_initial... OK
  Applying auth.0001_initial... OK
  Applying questions.0001_initial... OK
  Applying feeds.0001_initial... OK
  Applying articles.0001_initial... OK
  Applying activities.0001_initial... OK
  Applying contenttypes.0002_remove_content_type_name... OK
  Applying auth.0002_alter_permission_name_max_length... OK
  Applying auth.0003_alter_user_email_max_length... OK
  Applying auth.0004_alter_user_username_opts... OK
  Applying auth.0005_alter_user_last_login_null... OK
  Applying auth.0006_require_contenttypes_0002... OK
  Applying auth.0007_alter_validators_add_error_messages... OK
  Applying authentication.0001_initial... OK
  Applying messenger.0001_initial... OK
  Applying sessions.0001_initial... OK

And there you go! Try the URL in a web browser: http://demo-bootcamp.herokuapp.com/.

Bootcamp on Heroku


Wrap up

This is pretty much it. Heroku is a great service, but it is kinda expensive. For some hobby/non-production applications you can find some cheaper alternatives.

Be aware of some limitations of the free tier:

  • The application “sleeps” after 30 minutes of inactivity. The first access after it might be slow.
  • The free PostgreSQL database has a limitation of 10K rows.

Also if you need to store user uploaded files (media), you will need a service to store the files. The most common option is Amazon S3. This applies to every tier.

Raghavendra Rao: How to rotate PgBouncer logs in Linux/Windows ?

From Planet PostgreSQL. Published on Aug 08, 2016.

Before doing a deep dive into the subject, a short outline about PgBouncer, its a lightweight connection pooler for PostgreSQL that dramatically reduces the processing time and resources for maintaining a large number of client connections to one or more databases. Typically used to increase the number of user connections that can be handled in a high performance environment. For more details on Installing/Configuring PgBouncer refer to the documentation here.
Like other tools, PgBouncer has a  stderr/syslog logging architecture to record connection, disconnection, and  pooler_errors with different verbosity levels. As of now, the greater part of logging go to one single file "pgbouncer.log" and grows endlessly. Sometimes, it might be a potential risk of making a system unresponsive due to lack of disk space on the log file location. At present, PgBouncer logging has no in-built configuration to rotate logs on the basis of age or size, hence it forces users to choose alternative methods. IMO, there are two approaches to handle it :-
  1. Configure PgBouncer in "syslog" method to rely on OS log rotation or
  2. Configure log rotation using OS utilities on "pgbouncer.log" file.

Method 1:

Its pretty straightforward to configure syslog in PgBouncer, set "syslog" to 1 (default 0); give a name to begin the log line in OS logs in "syslog_ident" (default 'pgbouncer') and specify the facility details in "syslog_facility" (default daemon). A sample output from my OS logs(/var/log/messages):
Aug 5 16:54:27 raghavt pgbouncer[62549]: C-0x1cdfe60: postgres/postgres@unix(62621):6432 login attempt: db=postgres user=postgres tls=no
Aug 5 16:54:27 raghavt pgbouncer[62549]: S-0x1ce4b10: postgres/postgres@127.0.0.1:5432 new connection to server (from 127.0.0.1:38947)
Aug 5 16:54:27 raghavt pgbouncer[62549]: C-0x1cdfe60: postgres/postgres@unix(62621):6432 closing because: client close request (age=0)
Note: If "syslog" enabled, comment or blank out the "logfile" parameter, else it will be additional logging. 

Method 2:

Logrotate is one of the OS utility that has an ability to rotate logs systematically and archive to reduce an operating system's disk space requirement. Each log file may be handled daily, weekly, monthly, or when it grows too large. A default configuration file "/etc/logrotate.conf" defines the log rotation age/size/interval.  Using this tool logs can be kept longer with less disk space. Many people have articulated about the usage of the utility which you can discover it over net anyway, thus am jumping directly into the implementation phase.
First, create a configuration file in /etc/logrotate.d/ directory for pgbouncer logs. I have named it as "/etc/logrotate.d/pgbouncer" with below details:
/var/log/pgbouncer/pgbouncer.log {
      rotate 10
      missingok
      sharedscripts
      notifempty
      nocompress
      size 10m
      daily
      create 0640 postgres postgres
      postrotate
           /bin/kill -HUP `cat /var/pgbouncer-postgres/pgbouncer.pid 2> /dev/null` 2>/dev/null ||true
      endscript
}
About the configuration file, first line indicate the pgbouncer log file location("logfile" parameter values in pgbouncer.ini file) and next are the parameters that work on rotation thresholds like; how many log files to maintain (rotate); issue no error and go on to next log (missingok); what script should be executed pre/post rotation (prerotate/postrotate); run once or multiple times pre/post scripts (sharedscripts); do not rotate the log if it is empty (notifempty); after rotation an old log file should be compressed with gzip utility (compress/nocompress); on how much size log rotation should be performed (size); how often to rotate a particular log (daily); and what permission new log file should be (create).

Now we can see new log files rotated with 10M size. (We can even force the rotation with command "logrotate -f /etc/logrotate.conf")

[root@172.16.210.202 pgbouncer]# ls -lrth
total 16K
-rw-r-----. 1 postgres postgres 10M Jul 27 15:30 pgbouncer.log-20160727
-rw-r-----. 1 postgres postgres 11K Jul 27 18:32 pgbouncer.log
That was simple right (smile), now lets check the same on Windows environment.

On Windows:

I know very less about windows utilities, consequently I did some googling and found a Windows version utility called "LogRotateWin"  which works same like Linux version of logrotate. For more details refer to detailed documentation available on Installation/Configuration/Usage here
Let's see how it works, first download ".msi" version of LogRotateWin available on the site as "logrotateSetup*.zip" file. Extract and execute the ".msi" file, it will install the utility to "c:\Program Files (x86)\LogRotate" location. You can find the default configuration file(logrotate.conf) under "c:\Program Files (x86)\LogRotate\Content". 
Next, edit the "c:\Program Files (x86)\LogRotate\Content\logrotate.conf" file and specify the full path of "pgbouncer.log" file with same rotation parameters. A sample copy of my configuration file tested on Windows 10. (Note: Below parameter values are used to test the utility)
c:\Program Files (x86)\LogRotate\Content>more logrotate.conf
"c:\Program Files (x86)\PgBouncer\log\pgbouncer.log" {
rotate 10
copytruncate
create
missingok
sharedscripts
nocompress
size 200k
daily
}
To verify, I have forced the log rotation with "-f" option
c:\Program Files (x86)\LogRotate>logrotate.exe -f Content\logrotate.conf
logrotate: Force option set to true
Here's the result:
C:\Program Files (x86)\PgBouncer\log>dir
Volume in drive C has no label.
Volume Serial Number is F226-9FFB

Directory of C:\Program Files (x86)\PgBouncer\log

08/08/2016 01:31 PM <DIR> .
08/08/2016 01:31 PM <DIR> ..
08/08/2016 01:31 PM 0 pgbouncer.log
08/08/2016 01:31 PM 6,626 pgbouncer.log.1
08/08/2016 01:31 PM 13,252 pgbouncer.log.2
3 File(s) 19,878 bytes
2 Dir(s) 26,905,051,136 bytes free
Nice right !!!.
On most Linux distributions, logrotate runs daily using "logrotate.conf" as part of cronjob, similarly on Windows, we can schedule a task in Windows Task Scheduler to rotate the logs daily. FYI, I have not explored much on "LogRotateWin" utility just a basic level. In case, if you encounter any issue please post it on logrotate General Discussion forum.
Thank you for reading. 
--Raghav

Tomas Vondra: PostgreSQL vs. Linux kernel versions

From Planet PostgreSQL. Published on Aug 08, 2016.

I’ve published multiple benchmarks comparing different PostgreSQL versions, as for example the performance archaeology talk (evaluating PostgreSQL 7.4 up to 9.4), and all those benchmark assumed fixed environment (hardware, kernel, …). Which is fine in many cases (e.g. when evaluating performance impact of a patch), but on production those things do change over time – you get hardware upgrades and from time to time you get an update with a new kernel version.

For hardware upgrades (better storage, more RAM, faster CPUs, …), the impact is usually fairly easy to predict, and moreover people generally realize they need to assess the impact by analyzing the bottlenecks on production and perhaps even testing the new hardware first.

But for what about kernel updates? Sadly we usually don’t do much benchmarking in this area. The assumption is mostly that new kernels are better than older ones (faster, more efficient, scale to more CPU cores). But is it really true? And how big is the difference? For example what if you upgrade a kernel from 3.0 to 4.7 – will that affect the performance, and if yes, will the performance improve or not?

From time to time we get reports about serious regressions with a particular kernel version, or sudden improvement between kernel versions. So clearly, kernel versions may affects performance.

I’m aware of a single PostgreSQL benchmark comparing different kernel versions, made in 2014 by Sergey Konoplev in response to recommendations to avoid 3.0 – 3.8 kernels. But that benchmark is fairly old (the last kernel version available ~18 months ago was 3.13, while nowadays we have 3.19 and 4.6), so I’ve decided to run some benchmarks with current kernels (and PostgreSQL 9.6beta1).

PostgreSQL vs. kernel versions

But first, let me discuss some significant differences between policies governing commits in the two projects. In PostgreSQL we have the concept of major and minor versions – major versions (e.g. 9.5) are released roughly once a year, and include various new features. Minor versions (e.g. 9.5.2) only include bugfixes, and are released about every three months (or more frequently, when a serious bug is discovered). So there should be no major performance or behavior changes between minor versions, which makes it fairly safe to deploy minor versions without extensive testing.

With kernel versions, the situation is much less clear. PostgreSQL kernel also has branches (e.g. 2.6, 3.0 or 4.7), those are by no means equal to “major versions” from PostgreSQL, as they continue to receive new features and not just bugfixes. I’m not claiming that the PostgreSQL versioning policy is somehow automatically superior, but the consequence is that updating between minor kernel versions may easily significantly affect performance or even introduce bugs (e.g. 3.18.37 suffers by OOM issues due to a such non-bugfix commit).

Of course, distributions realize these risks, and often lock the kernel version and do further testing to weed out new bugs. This post however uses vanilla longterm kernels, as available on www.kernel.org.

Benchmark

There are many benchmarks we might use – this post presents a suite of pgbench tests, i.e. a fairly simple OLTP (TPC-B-like) benchmark. I plan to do additional tests with other benchmark types (particularly DWH/DSS-oriented), and I’ll present them on this blog in the future.

Now, back to the pgbench – when I say “collection of tests” I mean combinations of

  • read-only vs. read-write
  • data set size – active set does (not) fit into shared buffers / RAM
  • client count – single client vs. many clients (locking/scheduling)

The values obviously depend on the hardware used, so let’s see what hardware this round of benchmarks was running on:

  • CPU: Intel i5-2500k @ 3.3 GHz (3.7 GHz turbo)
  • RAM: 8GB (DDR3 @ 1333 MHz)
  • storage: 6x Intel SSD DC S3700 in RAID-10 (Linux sw raid)
  • filesystem: ext4 with default I/O scheduler (cfq)

So it’s the same machine I’ve used for a number of previous benchmarks – a fairly small machine, not exactly the newest CPU etc. but I believe it’s still a reasonable “small” system.

The benchmark parameters are:

  • data set scales: 30, 300 and 1500 (so roughly 450MB, 4.5GB and 22.5GB)
  • client counts: 1, 4, 16 (the machine has 4 cores)

For each combination there were 3 read-only runs (15-minute each) and 3 read-write runs (30-minute each). The actual script driving the benchmark is available here (along with results and other useful data).

Note: If you have significantly different hardware (e.g. rotational drives), you may see very different results. If you have a system that you’d like to test, let me know and I’ll help you with that (assuming I’ll be allowed to publish the results).

Kernel versions

Regarding kernel versions, I’ve tested the latest versions in all longterm branches since 2.6.x (2.6.39, 3.0.101, 3.2.81, 3.4.112, 3.10.102, 3.12.61, 3.14.73, 3.16.36, 3.18.38, 4.1.29, 4.4.16, 4.6.5 and 4.7). There’s still a lot of systems running on 2.6.x kernels, so it’s useful to know how much performance you might gain (or lose) by upgrading to a newer kernel. But I’ve been compiling all the kernels on my own (i.e. using vanilla kernels, no distribution-specific patches), and the config files are in the git repository.

Results

As usual, all the data is available on bitbucket, including

  • kernel .config file
  • benchmark script (run-pgbench.sh)
  • PostgreSQL config (with some basic tuning for the hardware)
  • PostgreSQL logs
  • various system logs (dmesg, sysctl, mount, …)

The following charts show the average tps for each benchmarked case – the results for the three runs are fairly consistent, with ~2% difference between min and max in most cases.

read-only

For the smallest data set, there’s a clear performance drop between 3.4 and 3.10 for all client counts. The results for 16 clients (4x the number of cores) however more than recovers in 3.12.

pgbench-read-only-small

For the medium data set (fits into RAM but not into shared buffers), we can see the same drop between 3.4 and 3.10 but not the recovery in 3.12.

pgbench-read-only-medium

For large data sets (exceeding RAM, so heavily I/O-bound), the results are very different – I’m not sure what happened between 3.10 and 3.12, but the performance improvement (particularly for higher client counts) is quite astonishing.

pgbench-read-only-large

read-write

For the read-write workload, the results are fairly similar. For the small and medium data sets we can observe the same ~10% drop between 3.4 and 3.10, but sadly no recovery in 3.12.

pgbench-read-write-small

pgbench-read-write-medium

For the large data set (again, significantly I/O bound) we can see similar improvement in 3.12 (not as significant as for the read-only workload, but still significant):

pgbench-read-write-large

Summary

I don’t dare to draw conclusions from a single benchmark on a single machine, but I think it’s safe to say:

  • The overall performance is fairly stable, but we can see some significant performance changes (in both directions).
  • With data sets that fit into memory (either into shared_buffers or at least into RAM) we see a measurable performance drop between 3.4 and 3.10. On read-only test this partially recovers in 3.12 (but only for many clients).
  • With data sets exceeding memory, and thus primarily I/O-bound, we don’t see any such performance drops but instead a significant improvement in 3.12.

As for the reasons why those sudden changes happen, I’m not quite sure. There are many possibly-relevant commits between the versions, but I’m not sure how to identify the correct one without extensive (and time consuming) testing. If you have other ideas (e.g. are aware of such commits), let me know.

How to Export to PDF

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 08, 2016.

There are a few ways to export data to a PDF file using Django. All of them requires a third-party library so to generate the file itself. First I will show how to return a PDF response, which can also be used if you are just serving an existing PDF file. Then I will show how to use ReportLab and WeasyPrint.


Writing a PDF to Response

In the example below I’m using the Django’s FileSystemStorage class. It will also work if you simply use open() instead. The FileSystemStorage sets the base_url to the project’s MEDIA_ROOT.

from django.core.files.storage import FileSystemStorage
from django.http import HttpResponse, HttpResponseNotFound

def pdf_view(request):
    fs = FileSystemStorage()
    filename = 'mypdf.pdf'
    if fs.exists(filename):
        with fs.open(filename) as pdf:
            response = HttpResponse(pdf, content_type='application/pdf')
            response['Content-Disposition'] = 'attachment; filename="mypdf.pdf"'
            return response
    else:
        return HttpResponseNotFound('The requested pdf was not found in our server.')

This way the user will be prompted with the browser’s open/save file. If you want to display the PDF in the browser you can change the Content-Disposition to:

response['Content-Disposition'] = 'inline; filename="mypdf.pdf"'

Using ReportLab

Installation
pip install reportlab

It relies on Pillow, which is a third-party Python Image Library. Sometimes it is a pain to get it installed. If you are running into problems, please refer to the Pillow’s installation guide.

Usage

A “Hello, World” from Django docs:

from io import BytesIO
from reportlab.pdfgen import canvas
from django.http import HttpResponse

def write_pdf_view(request):
    response = HttpResponse(content_type='application/pdf')
    response['Content-Disposition'] = 'inline; filename="mypdf.pdf"'

    buffer = BytesIO()
    p = canvas.Canvas(buffer)

    # Start writing the PDF here
    p.drawString(100, 100, 'Hello world.')
    # End writing

    p.showPage()
    p.save()

    pdf = buffer.getvalue()
    buffer.close()
    response.write(pdf)

    return response

Below an example of writing a PDF using SimpleDocTemplate and Paragraph:

from django.core.files.storage import FileSystemStorage
from django.http import HttpResponse

from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch

def write_pdf_view(request):
    doc = SimpleDocTemplate("/tmp/somefilename.pdf")
    styles = getSampleStyleSheet()
    Story = [Spacer(1,2*inch)]
    style = styles["Normal"]
    for i in range(100):
       bogustext = ("This is Paragraph number %s.  " % i) * 20
       p = Paragraph(bogustext, style)
       Story.append(p)
       Story.append(Spacer(1,0.2*inch))
    doc.build(Story)

    fs = FileSystemStorage("/tmp")
    with fs.open("somefilename.pdf") as pdf:
        response = HttpResponse(pdf, content_type='application/pdf')
        response['Content-Disposition'] = 'attachment; filename="somefilename.pdf"'
        return response

    return response

ReportLab is a great library. It is very versatile. But there are a lot of configurations and settings to get the printing right. So make sure you really want to go down that road.

Refer to the ReportLab Userguide for more references about the functions, classes and all the available resources.


Using WeasyPrint

Installation
pip install WeasyPrint

Downside: that huge list of dependency. If you were lucky enough to get it installed smoothly, move forward. If that’s not the case, please refer to the official WeasyPrint installation guide.

Usage

A good thing about WeasyPrint is that you can convert a HTML document to a PDF. So you can create a regular Django template, print and format all the contents and then pass it to the WeasyPrint library to do the job of creating the pdf.

pdf_template.html

<html>
<head>
  <title>Test</title>
  <style>
    body {
      background-color: #f7f7f7;
    }
  </style>
</head>
<body>
  <h1>Test Template</h1>
  {% for paragraph in paragraphs %}
    <p>{{ paragraph }}</p>
  {% endfor %}
</body>
</html>

views.py

from django.core.files.storage import FileSystemStorage
from django.http import HttpResponse
from django.template.loader import render_to_string

from weasyprint import HTML

def html_to_pdf_view(request):
    paragraphs = ['first paragraph', 'second paragraph', 'third paragraph']
    html_string = render_to_string('core/pdf_template.html', {'paragraphs': paragraphs})

    html = HTML(string=html_string)
    html.write_pdf(target='/tmp/mypdf.pdf');

    fs = FileSystemStorage('/tmp')
    with fs.open('mypdf.pdf') as pdf:
        response = HttpResponse(pdf, content_type='application/pdf')
        response['Content-Disposition'] = 'attachment; filename="mypdf.pdf"'
        return response

    return response

Output PDF File

PDF Output

You can also have a look on the official API reference.

Andrew Dunstan: Using EXPLAIN json format output in plpgsql

From Planet PostgreSQL. Published on Aug 08, 2016.

It's possible to use EXPLAIN output in plpgsql. In the default text format, the result comes back as a set of text values, so you can process them in a loop like this:

declare
   exp text;
begin
   for exp in explain myquery
   loop
      raise notice '%', exp;
   end loop;
end;


If you use json format output, however, the result comes back as a single json document rather than a set of lines. You still might need a loop - I haven't found another way yet of getting the output from EXPLAIN into a variable - but the loop will only have one iteration. Here is an example taken from a function I wrote the other day that lets you get the estimated number of rows from the plan:


declare
   exp json;
begin
   for exp in explain (format json) myquery
   loop
      raise notice 'rows: %', exp#>>'{0,Plan,Plan Rows}';
   end loop;
end;

Setting up a Honey Pot with django-admin-honeypot

By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on Aug 05, 2016.

Security is something we often ignore until it is too late. However, there are some things you can do right now that are easy to increase your security. Using django-admin-honeypot is one of those things you can do. It is super easy and provides you with the means of tracking who is trying to access your site.
Watch Now...

Package of the Week: Flake8

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 05, 2016.

Flake8 is a Python library that wraps PyFlakes, pycodestyle and Ned Batchelder’s McCabe script. It is a great toolkit for checking your code base against coding style (PEP8), programming errors (like “library imported but unused” and “Undefined name”) and to check cyclomatic complexity.

If you are not familiar with the term cyclomatic complexity, it is a software metric created by Thomas J. McCabe to measure the number of independent paths through the source code. Generally speaking, the higher number of ifs inside a function, the higher number of paths it will have, thus a higher cyclomatic complexity. Of course there are other control flow operations that impact the calculus of the cyclomatic complexity. It is also referred as McCabe complexity.

Speaking about coding style, I also like to follow the Django Coding Style on the Django projects I develop. This style standard is only enforced when writing code for inclusion in Django.


Installation

Easiest way to get started is installing it using pip:

pip install flake8

Usage

Inside the Django project dir, run the command:

flake8

Or you can pass the path to a file/dir:

flake8 bootcamp/feeds/

The output will be something like that:

./bootcamp/settings.py:96:80: E501 line too long (91 > 79 characters)
./bootcamp/settings.py:99:80: E501 line too long (81 > 79 characters)
./bootcamp/settings.py:102:80: E501 line too long (82 > 79 characters)
./bootcamp/settings.py:105:80: E501 line too long (83 > 79 characters)
./bootcamp/settings.py:156:80: E501 line too long (94 > 79 characters)
./bootcamp/settings.py:165:5: F403 'from local_settings import *' used; unable to detect undefined names
./bootcamp/urls.py:5:1: F401 'django.contrib.auth.views as auth_views' imported but unused
./bootcamp/urls.py:6:1: F401 'django.core.urlresolvers.reverse' imported but unused
./bootcamp/urls.py:16:80: E501 line too long (88 > 79 characters)
./bootcamp/urls.py:17:80: E501 line too long (84 > 79 characters)
./bootcamp/urls.py:18:80: E501 line too long (87 > 79 characters)
./bootcamp/urls.py:22:80: E501 line too long (80 > 79 characters)
./bootcamp/members/forms.py:5:1: F401 'django.forms.extras.SelectDateWidget' imported but unused
...

The output is formatted as:

file path : line number : column number : error code : short description

Error code prefix:

  • E***/W***: pep8 errors and warnings
  • F***: PyFlakes codes (see below)
  • C9**: McCabe complexity plugin mccabe
  • N8**: Naming Conventions plugin pep8-naming

You can see the full list of error codes in the links below:


Configuring

You can pass some project-based configuration parameters using a setup.cfg file. If you already have one, just edit it. If that is not the case, create a file named setup.cfg in the project root.

[flake8]
exclude = .git,*migrations*
max-line-length = 119

Those are the basic configurations I use in every project. The exclude parameter is used to ignore file/dirs. The default line-length is 79. I find it too small and sometimes makes the code look worse. So, following the Django code style guidelines, I stick with 119.

See the full list of options.


In-Line Ignoring Errors

Just add # noqa in the end of the line.

class ProfilesConfig(AppConfig):
    name = 'cmdbox.profiles'
    verbose_name = _('profiles')

    def ready(self):
        import cmdbox.profiles.signals.handlers  # noqa

Or you can pass the specific error code you want to ignore:

import cmdbox.profiles.signals.handlers  # noqa: F401

Running flake8 on Travis CI

The example below is a .travis.yml file from a project of mine. You can run flake8 checks very easily just by adding the commands inside the script list.

.travis.yml

language: python
python:
  - "2.7"
install: "pip install -r requirements.txt"
before_script:
  - cp .env.example .env
  - python manage.py migrate
script:
  - flake8 cmdbox
  - coverage run manage.py test --settings=cmdbox.tests_settings
after_success:
  - coveralls

That’s it! As you can see, it is really easy to start making your Django project PEP8 complaint with flake8.

Django Tips #9 How to Create a Change Password View

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 04, 2016.

This is a very quick tip about how to create a change password view using the built-in PasswordChangeForm.

For that matter, using a function-based view is easier, because the PasswordChangeForm is slightly different. It does not inherit from ModelForm and it takes an user argument in its constructor.

The example below is a functional code to change the password for the authenticated user.

views.py

from django.contrib import messages
from django.contrib.auth import update_session_auth_hash
from django.contrib.auth.forms import PasswordChangeForm
from django.shortcuts import render, redirect

def change_password(request):
    if request.method == 'POST':
        form = PasswordChangeForm(request.user, request.POST)
        if form.is_valid():
            user = form.save()
            update_session_auth_hash(request, user)  # Important!
            messages.success(request, 'Your password was successfully updated!')
            return redirect('accounts:change_password')
        else:
            messages.error(request, 'Please correct the error below.')
    else:
        form = PasswordChangeForm(request.user)
    return render(request, 'accounts/change_password.html', {
        'form': form
    })

The messages.success() and messages.error() are optional, but it is a good idea to keep your user aware about what is going on in the application.

Now an important bit is to call update_session_auth_hash() after you save the form. Otherwise the user’s auth session will be invalidated and she/he will have to log in again.

urls.py

from django.conf.urls import url
from myproject.accounts import views

urlpatterns = [
    url(r'^password/$', views.change_password, name='change_password'),
]

change_password.html

<form method="post">
  {% csrf_token %}
  {{ form }}
  <button type="submit">Save changes</button>
</form>

Serving a Customized Site with Moya

By Will McGugan from Django community aggregator: Community blog posts. Published on Aug 03, 2016.

This SO post prompted me to think about how you would go about customizing a entire Django site (not just an app). Particularly if it is a third party project, still in development.

Forking the site repository is a solution, but the more customizations you make, the more awkward it becomes to merge upstream fixes and features. The problem is compounded if you have multiple deploys, with custom templates, assets, and settings.

I wanted to do a quick write-up of how this would be done in Moya, where customization is more of a first-class concept.

I'll briefly go over how you would serve a Moya site with a custom template, without modifying any of the original files. The project I'll be customizing is Moya Techblog which power this blog.

Install Moya

If you want to follow along, then first install Moya + dependencies and Techblog with the following:

pip install moya requests_oauthlib -U
git clone https://github.com/moyaproject/moya-techblog.git

Make a custom project

Next we want to create a new directory along side the Techblog checkout, which will contain two files; moya and settings.ini. The former is an empty file that tells Moya this is a project directory and the latter contains the following settings:

[customize]
location = ../moya-techblog/site

This settings file tells Moya to load the project from an alternative location. The [customize] section acts like a symbolic link of sorts, but there is more going on that that, which we will see below.

The directory structure should now resemble the following:

├── moya-techblog
│   ╰── LOTS OF FILES
╰── mysite
    ├── moya
    ╰── settings.ini

We can initialize and serve the custom site in usual way:

cd mysite
moya init
moya runserver

This will serve a basic Techblog site. If we want to customize a template from the original project, we can add it to mysite, and as long as it has the same path, Moya will load that file in preference to the original.

Replace a Template

The file we want to customize is templates/willmcgugan.techblog/base.html, which is the base template for the blog pages. We can copy that from the original to the same location within mysite, so our directory structure becomes:

├── moya-techblog
│   ╰── LOTS OF FILES
╰── mysite
    ├── moya
    ├── settings.ini
    ╰── templates
        ╰── willmcgugan.techblog
            ╰── base.html

Now we can edit base.html and the changes will be visible when we serve the project.

The same technique will work with any file from the project, including CSS, JS and even code. You can also add new files that don't exist in the original. For instance, if you have new assets, you can add them to the static folder and they will be served alongside the original files.

Any settings in settings.ini are merged with the equivalent file from the original project, allowing you to configure custom database details / file locations / smtp servers etc. See Building a Project for details.

If you are interested, the magic behind this is achieved with a virtual filesystem.

How to Paginate with Django

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 03, 2016.

As part of the Django’s common Web application tools, Django offers a few classes to manage paginated data. You can pass either a list/tuple of objects or an QuerySet. In this tutorial I will show how to paginate data using function based views and how to paginate using class-based views (ListView).


The Paginator

The paginator classes lives in django.core.paginator. We will be working mostly with the Paginator and Page classes.

Consider the auth.User table has 53 user instances.

from django.contrib.auth.models import User
from django.core.paginator import Paginator

user_list = User.objects.all()
paginator = Paginator(user_list, 10)

In the example above I’m telling Paginator to paginate the user_list QuerySet in pages of 10. This will create a 6 pages result. The first 5 pages with 10 users each and the last page with 3 users.

Debugging the Paginator Object
Input Output Type
paginator.count 53 <type 'int'>
paginator.num_pages 6 <type 'int'>
paginator.page_range xrange(1, 7) <type 'xrange'>
paginator.page(2) <Page 2 of 6> <class 'django.core.paginator.Page'>

The Paginator.page() method will return a given page of the paginated results, which is an instance of Page. This is what we will return to the template.

users = paginator.page(2)
Debugging the Page Object
Input Output Type
users <Page 2 of 6> <class 'django.core.paginator.Page'>
users.has_next() True <type 'bool'>
users.has_previous() True <type 'bool'>
users.has_other_pages() True <type 'bool'>
users.next_page_number() 3 <type 'int'>
users.previous_page_number() 1 <type 'int'>
users.start_index() 11 <type 'int'>
users.end_index() 20 <type 'int'>

The Page.next_page_number() and Page.previous_page_number() methods raises InvalidPage if next/previous page doesn’t exist.

The Page.start_index() and Page.end_index() are relative to the page number.

>>> users = paginator.page(6)  # last page
<Page 6 of 6>
>>> users.start_index()
51
>>> users.end_index()
53

The process is basically done by querying the database, then pass the QuerySet to the Paginator, grab a Page and return to the template. The rest is done in the template.

Let’s see now some practical examples.


Pagination with Function-Based Views

views.py

from django.contrib.auth.models import User
from django.core.paginator import Paginator, EmptyPage, PageNotAnInteger

def index(request):
    user_list = User.objects.all()
    page = request.GET.get('page', 1)

    paginator = Paginator(user_list, 10)
    try:
        users = paginator.page(page)
    except PageNotAnInteger:
        users = paginator.page(1)
    except EmptyPage:
        users = paginator.page(paginator.num_pages)

    return render(request, 'core/user_list.html', { 'users': users })

user_list.html

<table class="table table-bordered">
  <thead>
    <tr>
      <th>Username</th>
      <th>First name</th>
      <th>Email</th>
    </tr>
  </thead>
  <tbody>
    {% for user in users %}
      <tr>
        <td>{{ user.username }}</td>
        <td>{{ user.first_name }}</td>
        <td>{{ user.email }}</td>
      </tr>
    {% endfor %}
  </tbody>
</table>

{% if users.has_other_pages %}
  <ul class="pagination">
    {% if users.has_previous %}
      <li><a href="?page={{ users.previous_page_number }}">&laquo;</a></li>
    {% else %}
      <li class="disabled"><span>&laquo;</span></li>
    {% endif %}
    {% for i in users.paginator.page_range %}
      {% if users.number == i %}
        <li class="active"><span>{{ i }} <span class="sr-only">(current)</span></span></li>
      {% else %}
        <li><a href="?page={{ i }}">{{ i }}</a></li>
      {% endif %}
    {% endfor %}
    {% if users.has_next %}
      <li><a href="?page={{ users.next_page_number }}">&raquo;</a></li>
    {% else %}
      <li class="disabled"><span>&raquo;</span></li>
    {% endif %}
  </ul>
{% endif %}

The result is something like this:

Pagination

The example above is using Bootstrap 3.


Pagination with Class-Based Views

views.py

class UserListView(ListView):
    model = User
    template_name = 'core/user_list.html'  # Default: <app_label>/<model_name>_list.html
    context_object_name = 'users'  # Default: object_list
    paginate_by = 10
    queryset = User.objects.all()  # Default: Model.objects.all()

user_list.html

<table class="table table-bordered">
  <thead>
    <tr>
      <th>Username</th>
      <th>First name</th>
      <th>Email</th>
    </tr>
  </thead>
  <tbody>
    {% for user in users %}
      <tr>
        <td>{{ user.username }}</td>
        <td>{{ user.first_name }}</td>
        <td>{{ user.email }}</td>
      </tr>
    {% endfor %}
  </tbody>
</table>

{% if is_paginated %}
  <ul class="pagination">
    {% if page_obj.has_previous %}
      <li><a href="?page={{ page_obj.previous_page_number }}">&laquo;</a></li>
    {% else %}
      <li class="disabled"><span>&laquo;</span></li>
    {% endif %}
    {% for i in paginator.page_range %}
      {% if page_obj.number == i %}
        <li class="active"><span>{{ i }} <span class="sr-only">(current)</span></span></li>
      {% else %}
        <li><a href="?page={{ i }}">{{ i }}</a></li>
      {% endif %}
    {% endfor %}
    {% if page_obj.has_next %}
      <li><a href="?page={{ page_obj.next_page_number }}">&raquo;</a></li>
    {% else %}
      <li class="disabled"><span>&raquo;</span></li>
    {% endif %}
  </ul>
{% endif %}

How to Split Views Into Multiple Files

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 02, 2016.

It is a good idea to keep your apps views module small. A common mistake is do much work inside the views functions. For the most part you can create separate modules to do the processing work outside the views and just import it and use the functions inside the views. Another common mistake is a single app implementing many different requirements or serving for many purposes. Sometimes you can split the work into different apps. When it makes sense of course. But sometimes, the views will get big anyway. In this case, you might want to split the views into different files.


Sample Scenario

Before I show you how, please consider the scenario below:

App dir

|∙∙core/
  |∙∙__init__.py
  |∙∙admin.py
  |∙∙migrations/
  |∙∙models.py
  |∙∙tests.py
  |∙∙urls.py
  |∙∙views.py

views.py

from django.shortcuts import render

def view_a(request):
    return render(request, 'view_a.html')

def view_b(request):
    return render(request, 'view_b.html')

def view_c(request):
    return render(request, 'view_c.html')

def view_d(request):
    return render(request, 'view_d.html')

urls.py

from django.conf.urls import url
import .views

urlpatterns = [
    url(r'^aaa$', views.view_a, name='view_a'),
    url(r'^bbb$', views.view_b, name='view_b'),
    url(r'^ccc$', views.view_c, name='view_c'),
    url(r'^ddd$', views.view_d, name='view_d'),
]

Splitting the Views

This strategy is good if you are refactoring your code base. This way you won’t need to change the way you handle your urls.py. Even though the view functions are in different files, they are still acessible via views.view_a.

App dir

Remove the views.py file and create a directory named views. Add a __init__.py file inside it and create the separated view files.

|∙∙core/
  |∙∙__init__.py
  |∙∙admin.py
  |∙∙migrations/
  |∙∙models.py
  |∙∙tests.py
  |∙∙urls.py
  |∙∙views/
    |∙∙__init__.py
    |∙∙alpha.py
    |∙∙beta.py

views/__init__.py

This is an important step: import all the modules inside each view file.

from .alpha import *
from .beta import *

views/alpha.py

from django.shortcuts import render

def view_a(request):
    return render(request, 'view_a.html')

def view_b(request):
    return render(request, 'view_b.html')

views/beta.py

from django.shortcuts import render

def view_c(request):
    return render(request, 'view_c.html')

def view_d(request):
    return render(request, 'view_d.html')

urls.py

You don’t need to change anything here.

from django.conf.urls import url
import .views

urlpatterns = [
    url(r'^aaa$', views.view_a, name='view_a'),
    url(r'^bbb$', views.view_b, name='view_b'),
    url(r'^ccc$', views.view_c, name='view_c'),
    url(r'^ddd$', views.view_d, name='view_d'),
]

Or you can simply import the views directly from different files. Truth is Django doesn’t really care about where the view function lives. And the name views is not mandatory, but it is recommended (always think about a new developer starting to work with your code – try to make things easier).

The example below you may keep views/__init__.py empty, since we are importing the views directly.

from django.conf.urls import url
from .views import alpha
from .views import beta

urlpatterns = [
    url(r'^aaa$', alpha.view_a, name='view_a'),
    url(r'^bbb$', alpha.view_b, name='view_b'),
    url(r'^ccc$', beta.view_c, name='view_c'),
    url(r'^ddd$', beta.view_d, name='view_d'),
]

Please Fix Your Decorators

By Blog of Hynek Schlawack from Django community aggregator: Community blog posts. Published on Aug 01, 2016.

If your Python decorator unintentionally changes the signatures of my callables or doesn’t work with class methods, it’s broken and should be fixed. Sadly most decorators are broken because the web is full of bad advice.

How to Upload Files With Django

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Aug 01, 2016.

In this tutorial you will learn the concepts behind Django file upload and how to handle file upload using model forms. In the end of this post you will find the source code of the examples I used so you can try and explore.


The Basics of File Upload With Django

When files are submitted to the server, the file data ends up placed in request.FILES.

It is mandatory for the HTML form to have the attribute enctype="multipart/form-data" set correctly. Otherwise the request.FILES will be empty.

The form must be submitted using the POST method.

Django have proper model fields to handle uploaded files: FileField and ImageField.

The files uploaded to FileField or ImageField are not stored in the database but in the filesystem.

FileField and ImageField are created as a string field in the database (usually VARCHAR), containing the reference to the actual file.

If you delete a model instance containing FileField or ImageField, Django will not delete the physical file, but only the reference to the file.

The request.FILES is a dictionary-like object. Each key in request.FILES is the name from the <input type="file" name="" />.

Each value in request.FILES is an UploadedFile instance. I will talk more about it later on.

You will need to set MEDIA_URL and MEDIA_ROOT in your project’s settings.py.

MEDIA_URL = '/media/'
MEDIA_ROOT = os.path.join(BASE_DIR, 'media')

In the development server you may serve the user uploaded files (media) using django.contrib.staticfiles.views.serve() view.

from django.conf import settings
from django.conf.urls.static import static

urlpatterns = [
    # Project url patterns...
]

if settings.DEBUG:
    urlpatterns += static(settings.MEDIA_URL, document_root=settings.MEDIA_ROOT)

To access the MEDIA_URL in template you must add django.template.context_processors.media to your context_processeors inside the TEMPLATES config.


Simple File Upload

Following is a minimal file upload example. Use it just to learn about the flow of the process.

simple_upload.html

{% extends 'base.html' %}

{% load static %}

{% block content %}
  <form method="post" enctype="multipart/form-data">
    {% csrf_token %}
    <input type="file" name="myfile">
    <button type="submit">Upload</button>
  </form>

  {% if uploaded_file_url %}
    <p>File uploaded at: <a href="{{ uploaded_file_url }}">{{ uploaded_file_url }}</a></p>
  {% endif %}

  <p><a href="{% url 'home' %}">Return to home</a></p>
{% endblock %}

views.py

from django.shortcuts import render
from django.conf import settings

def simple_upload(request):
    if request.method == 'POST':
        if request.FILES['myfile']:
            myfile = request.FILES['myfile']

            path = u'{0}/{1}'.format(settings.MEDIA_ROOT, myfile.name)
            with open(path, 'wb+') as destination:
                for chunk in myfile.chunks():
                    destination.write(chunk)

            uploaded_file_url = u'{0}{1}'.format(settings.MEDIA_URL, myfile.name)
            return render(request, 'core/simple_upload.html', {
                'uploaded_file_url': uploaded_file_url
            })

    return render(request, 'core/simple_upload.html')

Note I’m using the MEDIA_ROOT and MEDIA_URL to build the path and uploaded_file_url respectively.

  • myfile.name: example.txt
  • path: /Users/vitorfs/Development/simple-file-upload/media/example.txt
  • uploaded_file_url: /media/example.txt

So, path is where I’m actually saving the actual file in the filesystem. And uploaded_file_url is the URL I will use to serve this file in my Django application.


File Upload With Model Forms

Now, this is a way more convenient way. Model forms perform validation, automatically builds the absolute path for the upload, treats filename conflicts and other common tasks.

models.py

from django.db import models

class Document(models.Model):
    description = models.CharField(max_length=255, blank=True)
    document = models.FileField(upload_to='documents/')
    uploaded_at = models.DateTimeField(auto_now_add=True)

forms.py

from django import forms
from uploads.core.models import Document

class DocumentForm(forms.ModelForm):
    class Meta:
        model = Document
        fields = ('description', 'document', )

views.py

def model_form_upload(request):
    if request.method == 'POST':
        form = DocumentForm(request.POST, request.FILES)
        if form.is_valid():
            form.save()
            return redirect('home')
    else:
        form = DocumentForm()
    return render(request, 'core/model_form_upload.html', {
        'form': form
    })

model_form_upload.html

{% extends 'base.html' %}

{% block content %}
  <form method="post" enctype="multipart/form-data">
    {% csrf_token %}
    {{ form.as_p }}
    <button type="submit">Upload</button>
  </form>

  <p><a href="{% url 'home' %}">Return to home</a></p>
{% endblock %}

About the FileField upload_to Parameter

See the example below:

document = models.FileField(upload_to='documents/')

Note the upload_to parameter. The files will be automatically uploaded to MEDIA_ROOT/documents/.

It is also possible to do something like:

document = models.FileField(upload_to='documents/%Y/%m/%d/')

A file uploaded today would be uploaded to MEDIA_ROOT/documents/2016/08/01/.

The upload_to can also be a callable that returns a string. This callable accepts two parameters, instance and filename.

def user_directory_path(instance, filename):
    # file will be uploaded to MEDIA_ROOT/user_<id>/<filename>
    return 'user_{0}/{1}'.format(instance.user.id, filename)

class MyModel(models.Model):
    upload = models.FileField(upload_to=user_directory_path)

Download the Examples

The code used in this post is available on Github.

git clone https://github.com/sibtc/simple-file-upload.git
pip install django
python manage.py migrate
python manage.py runserver

Django + Elastic Search + haystack = Powering your website with search functionality – part 2

By Book of Stranger from Django community aggregator: Community blog posts. Published on Jul 31, 2016.

Search is one of the most important and powerful functionality in web applications these days. It helps users to easily find content, products etc on your website. Without search option, users will have to find their way out to reach the desired content which no one likes to do. Just imagine Amazon without search bar, you will have to navigate through various categories to find out the product, and it may take you forever to find it.

Having said that, implementing search into your web app is not difficult thanks to lot of libraries built for this purpose. I have broken down this tutorial into two parts. First article explains how to setup elastic search and haystack with django, create search search indexes and how to query data. In this article, we will take one step forward and discuss about using autocomplete, spelling suggestions and creating custom backends. If you haven’t read part-1, I suggest you do it first and then jump to this article.

Remember the SearchIndex we created in the part-1 of this tutorial, we are going to continue modifying the same in this article.

import datetime
from haystack import indexes
from myapp.models import Blog

class BlogIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    author = indexes.CharField(model_attr='user')
    pub_date = indexes.DateTimeField(model_attr='pub_date')

    def get_model(self):
        return Blog

    def index_queryset(self, using=None):
        """Used when the entire index for model is updated."""
        return self.get_model().objects.filter(pub_date__lte=datetime.datetime.now())

Spelling Suggestions

Many times your users type in a wrong spelling of a word and press search. At this point, its wise to show him the corrected spelling and show search results according to that. You can also give an option to user if he wants to search for the entered words only. Google search also works like that as shown below

Screen Shot 2016-08-01 at 12.57.49 AM

Here the input from the user is “book of strangrer” as shown in red, but google suggests and shows results for “book of stranger” which is highlighted in green. It also gives an option to the user if he still wants to search for the entered value (as shown in blue). Lets talk about how to achieve this in your django app.

To enable the spelling suggestion functionality in Haystack, first you need to create a special field in your SearchIndex which mirrors the content of the “text” field. The modified SearchIndex is shown below

class BlogIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    author = indexes.CharField(model_attr='user')
    pub_date = indexes.DateTimeField(model_attr='pub_date')
    suggestions = indexes.FacetCharField()

    def prepare(self, obj):
       prepared_data = super(BlogIndex, self).prepare(obj)
       prepared_data['suggestions'] = prepared_data['text']
       return prepared_data
   
    def get_model(self):
        return Blog

    def index_queryset(self, using=None):
        """Used when the entire index for model is updated."""
        return self.get_model().objects.filter(pub_date__lte=datetime.datetime.now())

You also need to include spellCheckComponent in haystack settings. Just modify your django settings file to include spellings as mentioned below

HAYSTACK_CONNECTIONS = {
  'default': {
    'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
    'URL': 'http://127.0.0.1:9200/',
    'INDEX_NAME': 'haystack',
    'INCLUDE_SPELLING':True,
  },
}

You are all set to show spelling suggestions now. You can query for suggested searchTerm like this

suggestedSearchTerm = SearchQuerySet().spelling_suggestion(searchTerm)

Short and simple right 🙂

 

Autocomplete Functionality

Autocomplete is becoming an important feature whenever search functionality is present. Implementing this using django-haystack is not difficult at all. First you need to prepare data, and then you need to implement the search to fetch these suggestions.

You have to create an additional field that contains the text you want to autocomplete on in your SearchIndex. This can be a NgramField or EdgeNgramField, both have different uses.

  • EdgeNgramField tokenizes on the whitespace which prevents incorrect matching across two words. Most of the time you should be using this only.
  • However if its required to autocomplete across word boundaries, then you should use NgramField

Your SearchIndex need to be modified as shown below

class BlogIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    author = indexes.CharField(model_attr='user')
    pub_date = indexes.DateTimeField(model_attr='pub_date')
    suggestions = indexes.FacetCharField()
   text_auto = indexes.EdgeNgramField(model_attr='getAutocompleteText')

Here, getAutocompleteText is a function in the Blog model which supplies the data for autocompletion. You can query these results using SearchQuerySet.autocomplete method

SearchQuerySet().autocomplete(text_auto='old')
# Result match things like 'goldfish', 'cuckold' and 'older'.

The above query can also be written as

SearchQuerySet().filter(text_auto='old')  # Result match things like 'goldfish', 'cuckold' and 'older'.

That’s it, your site has the autocomplete functionality now.

 

Creating Custom Backend

Lot of times, you may feel the requirement to extend the existing backend to incorporate certain functionality. For example, the default min_gram value for an edgeNgram tokenizer is 2, but what if you want to modify it to 3. This can be achieved by extending the elasticSearch backend and overriding these values. I am just changing the min_gram value for demonstration purpose, you are free to modify other functionality as well.

First lets define the custom elasticsearch settings in the django settings file. You can see the default_settings of elasticsearch backend on the official haystack source code

ELASTICSEARCH_INDEX_SETTINGS = {
  'settings': {
      "analysis": {
          "analyzer": {
             "edgengram_analyzer": {
                 "type": "custom",
                 "tokenizer": "lowercase",
                 "filter": ["haystack_edgengram"]
              }
           },
           "tokenizer": {
              "haystack_edgengram_tokenizer": {
                 "type": "edgeNGram",
                 "min_gram": 3,
                 "max_gram": 15,
                 "side": "front"
              }
           },
           "filter": {
              "haystack_edgengram": {
                 "type": "edgeNGram",
                 "min_gram": 3,
                 "max_gram": 15
              }
            }
        }
    }
}

 

Next I am going to create a new custom backend file, lets call it my_elaticsearch_backend.py

from django.conf import settings

from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend, ElasticsearchSearchEngine


class ConfigurableSearchBackend(ElasticsearchSearchBackend):

 def __init__(self, connection_alias, **connection_options):
    super(ConfigurableSearchBackend, self).__init__(connection_alias, **connection_options)
    setattr(self, 'DEFAULT_SETTINGS', settings.ELASTICSEARCH_INDEX_SETTINGS)


class ConfigurableSearchEngine(ElasticsearchSearchEngine):
 backend = ConfigurableSearchBackend

This extended backend just replaces the ‘DEFAULT_SETTINGS’ with your own custom settings.

Now you just to reference this new backend in the haystack connection settings.

HAYSTACK_CONNECTIONS = {
 'default': {
   'ENGINE': 'home.my_elasticbackend.ConfigurableSearchEngine',
   'URL': 'http://127.0.0.1:9200/',
   'INDEX_NAME': 'haystack',
   'INCLUDE_SPELLING':True,
 },
}

 

Make sure to rebuild/update your index after making these changes.

While updating indexing, make sure to add ‘remove’ argument otherwise haystack will keep data in index even if the corresponding model is deleted from the database.

./manage.py update_index --remove

 

I hope you find this article helpful. Let me know if you have any suggestions/ feedback in the comments section below.

Fun Fact: Game of Thrones season 6 is back, and its episode 4 is also titled as the book of stranger 🙂

I'm Gonna Regret This

By chrism from . Published on Jun 14, 2016.

A plea for liberals to fight for individual rights.

Is Open Source Consulting Dead?

By chrism from . Published on Sep 10, 2013.

Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

Consulting and Patent Indemification

By chrism from . Published on Aug 09, 2013.

Article about consulting and patent indemnification

Python Advent Calendar 2012 Topic

By chrism from . Published on Dec 24, 2012.

An entry for the 2012 Japanese advent calendar at http://connpass.com/event/1439/

Why I Like ZODB

By chrism from . Published on May 15, 2012.

Why I like ZODB better than other persistence systems for writing real-world web applications.

A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

By chrism from . Published on Mar 03, 2012.

A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

In Praise of Complaining

By chrism from . Published on Jan 01, 2012.

In praise of complaining, even when the complaints are absurd.

2012 Python Meme

By chrism from . Published on Dec 24, 2011.

My "Python meme" replies.

In Defense of Zope Libraries

By chrism from . Published on Dec 19, 2011.

A much too long defense of Pyramid's use of Zope libraries.

Plone Conference 2011 Pyramid Sprint

By chrism from . Published on Nov 10, 2011.

An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

Jobs-Ification of Software Development

By chrism from . Published on Oct 17, 2011.

Try not to Jobs-ify the task of software development.

WebOb Now on Python 3

By chrism from . Published on Oct 15, 2011.

Report about porting to Python 3.

Open Source Project Maintainer Sarcastic Response Cheat Sheet

By chrism from . Published on Jun 12, 2011.

Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

Pylons Miniconference #0 Wrapup

By chrism from . Published on May 04, 2011.

Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

Pylons Project Meetup / Minicon

By chrism from . Published on Apr 14, 2011.

In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.