Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home

Open Source Posts

Ship It Day Q1 2017

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jan 18, 2017.

Last Friday, Caktus set aside client projects for our regular quarterly ShipIt Day. From gerrymandered districts to RPython and meetup planning, the team started off 2017 with another great ShipIt.

Books for the Caktus Library

Liza uses Delicious Library to track books in the Caktus Library. However, the tracking of books isn't visible to the team, so Scott used the FTP export feature of Delicious Library to serve the content on our local network. Scott dockerized Caddy and deployed it to our local Dokku PaaS platform and serves it over HTTPS, allowing the team to see the status of the Caktus Library.

Property-based testing with Hypothesis

Vinod researched using property-based testing in Python. Traditionally it's more used with functional programming languages, but Hypothesis brings the concept to Python. He also learned about new Django features, including testing optimizations introduced with setupTestData.

Caktus Wagtail Demo with Docker and AWS

David looked into migrating a Heroku-based Wagtail deployment to a container-driven deployment using Amazon Web Services (AWS) and Docker. Utilizing Tobias' AWS Container Basics isolated Elastic Container Service stack, David created a Dockerfile for Wagtail and deployed it to AWS. Down the road, he'd like to more easily debug performance issues and integrate it with GitLab CI.

Local Docker Development

During Code for Durham Hack Nights, Victor noticed local development setup was a barrier of entry for new team members. To help mitigate this issue, he researched using Docker for local development with the Durham School Navigator project. In the end, he used Docker Compose to run a multi-container docker application with PostgreSQL, NGINX, and Django.

Caktus Costa Rica

Daryl, Nicole, and Sarah really like the idea of opening a branch Caktus office in Costa Rica and drafted a business plan to do so! Including everything from an executive summary, to operational and financial plans, the team researched what it would take to run a team from Playa Hermosa in Central America. Primary criteria included short distances to an airport, hospital, and of course, a beach. They even found an office with our name, the Cactus House. Relocation would be voluntary!

Improving the GUI test runner: Cricket

Charlotte M. likes to use Cricket to see test results in real time and have the ability to easily re-run specific tests, which is useful for quickly verifying fixes. However, she encountered a problem causing the application to crash sometimes when tests failed. So she investigated the problem and submitted a fix via a pull request back to the project. She also looked into adding coverage support.

Color your own NC Congressional District

Erin, Mark, Basia, Neil, and Dmitriy worked on an app that visualizes and teaches you about gerrymandered districts. The team ran a mini workshop to define goals and personas, and help the team prioritize the day's tasks by using agile user story mapping. The app provides background information on gerrymandering and uses data from NC State Board of Elections to illustrate how slight changes to districts can vastly impact the election of state representatives. The site uses D3 visualizations, which is an excellent utility for rendering GeoJSON geospatial data. In the future they hope to add features to compare districts and overlay demographic data.

Releasing django_tinypng

Dmitriy worked on testing and documenting django_tinypng, a simple Django library to allows optimization of images by using TinyPNG. He published the app to PyPI so it's easily installable via pip.

Learning Django: The Django Girls Tutorial

Gerald and Graham wanted to sharpen their Django skills by following the Django Girls Tutorial. Gerald learned a lot from the tutorial and enjoyed the format, including how it steps through blocks of code describing the syntax. He also learned about how the Django Admin is configured. Graham knew that following tutorials can sometimes be a rocky process, so he worked together with Graham so they could talk through problems together and Graham was able to learn by reviewing and helping.

Planning a new meetup for Digital Project Management

When Elizabeth first entered the Digital Project Management field several years ago, there were not a lot of resources available specifically for digital project managers. Most information was related to more traditional project management, or the PMP. She attended the 2nd Digital PM Summit with her friend Jillian, and loved the general tone of openness and knowledge sharing (they also met Daryl and Ben there!). The Summit was a wonderful resource. Elizabeth wanted to bring the spirit of the Summit back to the Triangle, so during Ship It Day, she started planning for a new meetup, including potential topics and meeting locations. One goal is to allow remote attendance through Google Hangouts, to encourage openness and sharing without having to commute across the Triangle. Elizabeth and Jillian hope to hold their first meetup in February.

Kanban: Research + Talk

Charlotte F. researched Kanban to prepare for a longer talk to illustrate how Kanban works in development and how it differs from Scrum. Originally designed by Toyota to improve manufacturing plants, Kanban focuses on visualizing workflows to help reveal and address bottlenecks. Picking the right tool for the job is important, and one is not necessarily better than the other, so Charlotte focused on outlining when to use one over the other.

Identifying Code for Cleanup

Calvin created redundant, a tool for identifying technical debt. Last ShipIt he was able to locate completely identical files, but he wanted to improve on that. Now the tool can identify functions that are almost the same and/or might be generalizable. It searches for patterns and generates a report of your codebase. He's looking for codebases to test it on!

RPython Lisp Implementation, Revisited

Jeff B. continued exploring how to create a Lisp implementation in RPython, the framework behind the PyPy project project. RPython is a restricted subset of the Python language. In addition to learning about RPython, he wanted to better understand how PyPy is capable of performance enhancements over CPython. Jeff also converted his parser to use Alex Gaynor's RPLY project.

Streamlined Time Tracking

At Caktus, time tracking is important, and we've used a variety of tools over the years. Currently we use Harvest, but it can be tedius to use when switching between projects a lot. Dan would like a tool to make this process more efficient. He looked into Project Hampster, but settled on building a new tool. His implementation makes it easy to switch between projects with a single click. It also allows users to sync daily entries to Harvest.

How I Deploy Django Day-to-Day

By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on Jan 18, 2017.

There are a lot of ways to deploy Django so I think it is one of those topics people are really curious about how other people do it. Generally, in all deploys you need to get the latest code, run migrations, collect your static files and restart web server processes. How yo do those steps, that is the interesting part.

In todays video I go over How I deploy Django day to day, followed by some other ways I have done it. This is definitely a topic you can make as easy or complicated as you want.

Here is the link again: https://www.youtube.com/watch?v=43lIXCPMw_8?vq=hd720

Why does Django not email me the 500 internal server error?

By Django deployment from Django community aggregator: Community blog posts. Published on Jan 18, 2017.

You’ve set your EMAIL_* settings correctly, and when you try to send emails with django.core.mail.sendmail() it works. However, Django still does not send you internal server errors. Why?

The sender’s email address matters

SERVER_EMAIL is the email address from which emails with error messages appear to come from. It is set in the “From:” field of the email. The default is “root@localhost”, and while “root” is OK, “localhost” is not, and some mail servers may refuse the email. The domain name where your Django application runs is usually OK, but if this doesn’t work you can use any other valid domain. The domain of your email address should work properly.

→ Set SERVER_EMAIL

→ When testing with django.core.mail.sendmail(), use the same sender email address as the one you’ve specified in SERVER_EMAIL.

The recipient’s email address matters

Because of spam, mail servers are often very picky about which emails they will accept. It’s possible that even if your smarthost accepts the email, the next mail server may refuse it. For example, I made some experiments using EMAIL_HOST = 'mail.runbox.com' and this command:

send_mail('Hello', 'hello, world', 'noreply@example.com', ['anthony@itia.ntua.gr'])

In that case, Runbox accepted the email and subsequently attempted to deliver it to the mail server of ntua.gr, which rejected it because it didn’t like the sender (noreply@example.com; I literally used “example.com”, and ntua.gr didn’t like that domain). When something like this happens, send_mail() will appear to work, because send_mail() manages to deliver the email to the smarthost, and the error occurs after that; not only will we never receive the email, but it is also likely that we will not receive the failure notification (the returned email), so it’s often hard to know what went wrong and we need to guess.

One thing you can do to lessen the probability of error is to make sure that the recipient has an email address served by the provider who provides the smarthost. In my case, the smarthost is mail.runbox.com, and the recipient is antonis@djangodeployment.com, and the email for domain djangodeployment.com is served by Runbox. It is unlikely that mail.runbox.com would accept an email addressed to antonis@djangodeployment.com if another Runbox server were to subsequently refuse it. If something like this happened, I believe it would be a configuration error on behalf of Runbox. But it’s very normal that mail.runbox.com will accept an email which will subsequently be refused by ntua.gr or Gmail or another provider downstream.

→ At least one of the ADMINS should have an email address served by the provider who runs the smarthost.

→ When testing with django.core.mail.sendmail(), use the same recipient email address as the one of the ADMINS.

Commas can matter

This will work:

ADMINS = [('John', 'john@example.com')]

This is a common error; it won’t work:

ADMINS = (('John', 'john@example.com'))

You can make it work by adding a comma, like this:

ADMINS = (('John', 'john@example.com'),)

Despite the fact that I have a very sharp eye, I once forgot the comma and the site worked for months without notifying me of errors. Therefore, use the foolproof way:

→ Specify ADMINS as a list, not as a tuple.

Test whether it sends errors

A favourite way of mine is to temporarily rename a template file and make a related request, which will raise a TemplateDoesNotExist exception. Your browser should show the “server error” page. Don’t forget to rename the template file back to what it was. By the time you finish doing that, you should have received the email with the full trace.

→ Temporarily rename a template file and make a related request in order to test whether errors are emailed OK.

Django can’t notify you that it can’t notify you

While Django attempts to send an error email, if something goes wrong, it fails silently. This behaviour is appropriate (the system is in error, it attempts to email its administrator with the     exception, but sending the email also results in an error; can’t do much more).  Suppose, however, that when you try to verify that error emails work, you find out they don’t work. What has gone wrong? Nothing is written in any log. Intercepting the communication with ngrep won’t work either, because it’s usually encrypted. In my book, I recommend to use a locally installed mail server. If you do so, you will at least be able to look at the local mail server’s logs.

I don’t recommend using exim or postfix, as they are quite complicated. Instead, I recommend dma. To make it work, you’ll also need django-sendmail-backend.

→ Use a local mail server


The post Why does Django not email me the 500 internal server error? appeared first on Django deployment.

Bruce Momjian: Using SSL Certificates

From Planet PostgreSQL. Published on Jan 17, 2017.

Having covered SSL certificate creation and the use of certificate authorities (CA), I would like to put it all together and show how certificates and certificate authorities work to ensure trusted Postgres communication.

I have created a diagram showing server, client, and certificate authority certificates. None of these certificates is secret, e.g. the server sends its SSL certificate to the client, and visa versa. In the diagram, the server and client use the same certificate authority certificate. (Intermediate certificate authorities could also be used.)

When the client connects, the server sends its certificate to the client. The client uses the public key in its certificate authority certificate to verify that the server certificate was signed by its trusted certificate authority (the red line). It then uses the public key in the server certificate to encrypt a secret key that is sent to the server. Only a server with the matching private key can reply to generate a session key. It is not the possession of the server certificate that proves the server's identity but the possession of the private key that matches the public key stored in the server's certificate. The same is true for client certificates used for client host and user authentication (the blue line).

Continue Reading »

Marco Slot: Parallel indexing in Citus

From Planet PostgreSQL. Published on Jan 17, 2017.

Indexes are an essential tool for optimizing database performance and are becoming ever more important with big data. However, as the volume of data increases, index maintenance often becomes a write bottleneck, especially for advanced index types which use a lot of CPU time for every row that gets written. Index creation may also become prohibitively expensive as it may take hours or even days to build a new index on terabytes of data in postgres. As of Citus 6.0, we’ve made creating and maintaining indexes that much faster through parallelization.

Citus can be used to distribute PostgreSQL tables across many machines. One of the many advantages of Citus is that you can keep adding more machines with more CPUs such that you can keep increasing your write capacity even if indexes are becoming the bottleneck. As of Citus 6.0 CREATE INDEX can also be performed in a massively parallel fashion, allowing fast index creation on large tables. Moreover, the COPY command can write multiple rows in parallel when used on a distributed table, which greatly improves performance for use-cases which can use bulk ingestion (e.g. sensor data, click streams, telemetry).

To show the benefits of parallel indexing, we’ll walk through a small example of indexing ~200k rows containing large JSON objects from the GitHub archive. To run the examples, we set up a formation using Citus Cloud consisting of 4 worker nodes with 4 cores each, running PostgreSQL 9.6 with Citus 6.

You can download the sample data by running the following commands:

wget http://examples.citusdata.com/github_archive/github_events-2015-01-01-{0..24}.csv.gz
gzip -d github_events-*.gz

Next lets create the table for the GitHub events once as a regular PostgreSQL table and then distribute it across the 4 nodes:

CREATE TABLE github_events (
    event_id bigint,
    event_type text,
    event_public boolean,
    repo_id bigint,
    payload jsonb,
    repo jsonb,
    actor jsonb,
    org jsonb,
    created_at timestamp
);

-- (distributed table only) Shard the table by repo_id 
SELECT create_distributed_table('github_events', 'repo_id');

-- Initial data load: 218934 events from 2015-01-01
\COPY github_events FROM PROGRAM 'cat github_events-*.csv' WITH (FORMAT CSV)

Each event in the GitHub data set has a detailed payload object in JSON format. Building a GIN index on the payload gives us the ability to quickly perform fine-grained searches on events, such as finding commits from a specific author. However, building such an index can be very expensive. Fortunately, parallel indexing makes this a lot faster by using all cores at the same time and building many smaller indexes:

CREATE INDEX github_events_payload_idx ON github_events USING GIN (payload);
|                           | Regular table | Distributed table | Speedup |
|---------------------------|---------------|-------------------|---------|
| CREATE INDEX on 219k rows |         33.2s |              2.6s |     13x |

To test how well this scales we took the opportunity to run our test multiple times. Interestingly, parallel CREATE INDEX exhibits superlinear speedups giving >16x speedup despite having only 16 cores. This is likely due to the fact that inserting into one big index is less efficient than inserting into a small, per-shard index (following O(log N) for N rows), which gives an additional performance benefit to sharding.

|                           | Regular table | Distributed table | Speedup |
|---------------------------|---------------|-------------------|---------|
| CREATE INDEX on 438k rows |         55.9s |              3.2s |     17x |
| CREATE INDEX on 876k rows |        110.9s |              5.0s |     22x |
| CREATE INDEX on 1.8M rows |        218.2s |              8.9s |     25x |

Once the index is created, the COPY command also takes advantage of parallel indexing. Internally, COPY sends a large number of rows over multiple connections to different workers asynchronously which then store and index the rows in parallel. This allows for much faster load times than a single PostgreSQL process could achieve. How much speedup depends on the data distribution. If all data goes to the a single shard, performance will be very similar to PostgreSQL.

\COPY github_events FROM PROGRAM 'cat github_events-*.csv' WITH (FORMAT CSV)
|                         | Regular table | Distributed table | Speedup |
|-------------------------|---------------|-------------------|---------|
| COPY 219k rows no index |         18.9s |             12.4s |    1.5x |
| COPY 219k rows with GIN |         49.3s |             12.9s |    3.8x |

Finally, it’s worth measuring the effect that the index has on query time. We try two different queries, one across all repos and one with a specific repo_id filter. This distinction is relevant to Citus because the github_events table is sharded by repo_id. A query with a specific repo_id filter goes to a single shard, whereas the other query is parallelised across all shards.

-- Get all commits by test@gmail.com from all repos
SELECT repo_id, jsonb_array_elements(payload->'commits')
  FROM github_events
 WHERE event_type = 'PushEvent' AND 
       payload @> '{"commits":[{"author":{"email":"test@gmail.com"}}]}';

-- Get all commits by test@gmail.com from a single repo
SELECT repo_id, jsonb_array_elements(payload->'commits')
  FROM github_events
 WHERE event_type = 'PushEvent' AND
       payload @> '{"commits":[{"author":{"email":"test@gmail.com"}}]}' AND
       repo_id = 17330407;

On 219k rows, this gives us the query times below. Times marked with * are of queries that are executed in parallel by Citus. Parallelisation creates some fixed overhead, but also allows for more heavy lifting, which is why it can either be much faster or a bit slower than queries on a regular table.

|                                       | Regular table | Distributed table |
|---------------------------------------|---------------|-------------------|
| SELECT no indexes, all repos          |         900ms |             68ms* |
| SELECT with GIN on payload, all repos |           2ms |             11ms* |
| SELECT no indexes, single repo        |         900ms |              28ms |
| SELECT with indexes, single repo      |           2ms |               2ms |

Indexes in PostgreSQL can dramatically reduce query times, but at the same time dramatically slow down writes. Citus gives you the possibility of scaling out your cluster to get good performance on both sides of the pipeline. A particular sweet spot for Citus is parallel ingestion and single-shard queries, which gives querying performance that is better than regular PostgreSQL, but with much higher and more scalable write throughput.

If you would like to learn more about parallel indexes or other ways in which Citus helps you scale, consult our documentation or give us a ping on slack. You can also get started with Citus in minutes by setting up a managed cluster in Citus Cloud or spinning up a cluster on your desktop.

New Django Admin with DRF and EmberJS... What are the news?

By LevIT's blog from Django community aggregator: Community blog posts. Published on Jan 17, 2017.

A couple months ago, I published a post titled "Yes I want a new admin". Now that more than a few weeks have gone by, I want to bring you up to speed on the progress.

In the original post, it was mentioned that, in order to achieve this goal, several libraries would be needed and that some of these libraries were already published.

Among them were DRF-auto-endpoint and djember-model.
During Django Under The Hood's sprints, we got together with several people interested by the concepts of those two libraries, merged them in a signle library (DRF-schema-adapter) and added some functionalities. Other than being a merge of the two "old" libraries DRF-schema-adapter also brings the concept of adapters which make it possible to use the same library for different frontends.
DRF-schema-adapter pursues 3 main goals:

  • Making it as easy, fast and straight-forward to define an API endpoint as it would be to generate a ModelAdmin class.
  • Letting developpers be DRYer by generating frontend code (models, resources, stores, ...) based on your DRF endpoints.
  • Providing enough information to the client to be entirely dynamic.

Although there are still some improvements planned for the library, all 3 of these goals have been achieved.

As far as the DRF backend is concerned, another library has been introduced: DRF-Base64 which allows for base64-encoded files to be uploaded to DRF.

Regarding the frontend, last time I mentioned ember-cli-dynamic-model (to dynamically load frontend models and metadata generated by the backend) and even if I still consider this library in alpha stage, I have been using it successfully for a few months now, the next steps will be to cleanup the code and add some more tests before it can be considered stable.

Still on the frontend, another library has also been introduced: ember-cli-crudities.
Ember-cli-crudities is mostly a set of widgets as well as a change-list and a change-form that behave pretty closely to the original Django admin change-list and change-form.

So... where does it stand compared to the original goals?

"Themability" of the admin

The new admin is based on bootstrap 3 and is therefore pretty easily skinnable. For the future, there are also plans to support material design. But a few pictures are better than words so let me share with you, a screenshot of the same form rendered with 3 different themes:

  • LevIT theme

  • Bootstrap theme

  • Django admin theme

On those screenshots, you might also have noticed a few perks like the form-disposition (which doesn't require any templating) or the translatable fields (compatibility with django-modeltranslation).

Multi-level nesting

This goal is also fully achieved out-of-the-box, here is another sample screenshot demontrating the capability (Company has a OneToMany relationship towards Employee which, in turn has a OneToMany relationship towards ContactMechanism):

Once again, in this screenshot, you might have noticed a couple extra perks like tabular content and sortable inlines, both available out-of-the-box.

Dynamic update

This goal has also been achieved.

As you can see on the first screenshot, on a brand new record, the employee I just added to the company is already available as "Default contact".

On the second and third screenshot, you can see an extra field dynamically added to the form depending on the selected "Sale Type".

And... what about all the functionalities of the existing admin?

Django's admin is packed with functionalities and there are probably some of that I've never heard of but as far as I can tell this DRF-EmberJS admin comes pretty close, from the top of my head, here are a few things that I know are still missing/should be improved:

  • i18n support for the interface
  • better filter widgets
  • not provided by django admin - frontend validation (right now form validation happens on the backend)
  • "Save and add another"
  • "Save as new"
  • More field widgets
  • Python package installable through pip for the whole admin.

Where can I try it?

All of the screenshots on this post have been taken from a sample application available at https://djembersample.pythonanywhere.com (Thanks PythonAnywhere), you are more than welcome to try it out and play with it (as well as report bugs if you happen to find some).

You can also clone the git repository for the sample app and run it locally where you'll be able to make changes to the exposed models, create new ones or change the active theme.

If you are interested in knowing more about DRF-schema-adapter or ember-cli-crudities stay tuned for more posts about those libraries in a near future.

Have a great day!

Generate PDF from HTML in django using weasyprint

By Micropyramid django from Django community aggregator: Community blog posts. Published on Jan 16, 2017.

In most of the web development projects you might want to automate file generation, like for example placeorder confirmation receipts, payment receipts, that can be based on a template you are using.

The library we will be using is Weasyprint. WeasyPrint is to combine multiple pieces of information into an HTML template and then converting it to a PDF document.

The supported version are Python 2.7, 3.3+

WeasyPrint has lot of dependencies, So this can be install it with pip.

    pip install Weasyprint

Once you have installed WeasyPrint, you should have a weasyprint executable. This can be as simple:

    weasyprint --version

This will Print WeasyPrint's version number you have installed.

    weasyprint <Your_Website_URL> <Your_path_to_save_this_PDF>
    Eg: weasyprint http://samplewebsite.com ./test.pdf

    Here i have converted "http://samplewebsite.com" site to an test.pdf.

Let we write sample PDF Generation:
    from weasyprint import HTML, CSS

    HTML('http://samplewebsite.com/').write_pdf('/localdirectory/test.pdf',

        stylesheets=[CSS(string='body { font-size: 10px }')])

    This will also converts the page in to PDF, Here the change is we are writting custom stylesheet(CSS) for the body to change the font size using the "string" argument.

    You can also pass the CSS File, This can be done using:

    from django.conf import settings

    CSS(settings.STATIC_ROOT +  'css/main.css')

    Ex: HTML('http://samplewebsite.com/').write_pdf('/localdirectory/test.pdf',
        stylesheets=[CSS(settings.STATIC_ROOT +  'css/main.css')])

    You can also pass multiple css files to this stylesheets array

Generating PDF Using Template:

    Let we create a basic HTML file, that we will use as a template to generate PDF:

    templates/home_page.html

        <html>
         <head>
             Home Page
         </head>
         <body>
          <h1>Hello !!!</h1>
          <p>First Pdf Generation using Weasyprint.</p>
         </body>
        </html>

    Lets write a django function to render this template in a PDF:

        from weasyprint import HTML, CSS
        from django.template.loader import get_template
        from django.http import HttpResponse

        def pdf_generation(request):
            html_template = get_template('templates/home_page.html')
            pdf_file = HTML(string=html_template).write_pdf()
            response = HttpResponse(pdf_file, content_type='application/pdf')
            response['Content-Disposition'] = 'filename="home_page.pdf"'
            return response

        Here, we have used the get_template() function to fetch the HTML template file in the static root.

        Finally, You can download your home_page.pdf

Magnus Hagander: Another couple of steps on my backup crusade

From Planet PostgreSQL. Published on Jan 16, 2017.

For a while now, I've been annoyed with how difficult it is to set up good backups in PostgreSQL. The difficulty of doing this "right" has pushed people to use things like pg_dump for backups, which is not really a great option once your database reaches any non-toy size. And when visiting customers over the years I've seen a large number of home-written scripts to do PITR backups, most of them broken, and most of that breakage because the APIs provided were too difficult to use.

Over some time, I've worked on a number of ways to improve this situation, alone or with others. The bigger steps are:

  • 9.1 introduced pg_basebackup, making it easier to take base backups using the replication protocol
  • 9.2 introduced transaction log streaming to pg_basebackup
  • 9.6 introduced a new version of the pg_start_backup/pg_stop_backup APIs that are needed to do more advanced base backups, in particular using third party backup tools.

For 10.0, there are a couple of new things that have been done in the past couple of weeks:

Kaarel Moppel: Two simple Postgres tips to kick-start year 2017

From Planet PostgreSQL. Published on Jan 16, 2017.

While reviewing my notes on some handy Postgres tricks and nasty gotchas to conclude in on-site training course my “current me” again learned some tricks which an older version of “me” had luckily wrote down. So here are two simple tricks that hopefully even a lot of Postgres power-users find surprising. Disabling JOIN re-ordering by […]

The post Two simple Postgres tips to kick-start year 2017 appeared first on Cybertec - The PostgreSQL Database Company.

Michael Paquier: Postgres 10 highlight - Reload of SSL parameters

From Planet PostgreSQL. Published on Jan 15, 2017.

Here are some news from the front of Postgres 10 development, with the highlight of the following commit:

commit: de41869b64d57160f58852eab20a27f248188135
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Mon, 2 Jan 2017 21:37:12 -0500
Allow SSL configuration to be updated at SIGHUP.

It is no longer necessary to restart the server to enable, disable,
or reconfigure SSL.  Instead, we just create a new SSL_CTX struct
(by re-reading all relevant files) whenever we get SIGHUP.  Testing
shows that this is fast enough that it shouldn't be a problem.

In conjunction with that, downgrade the logic that complains about
pg_hba.conf "hostssl" lines when SSL isn't active: now that's just
a warning condition not an error.

An issue that still needs to be addressed is what shall we do with
passphrase-protected server keys?  As this stands, the server would
demand the passphrase again on every SIGHUP, which is certainly
impractical.  But the case was only barely supported before, so that
does not seem a sufficient reason to hold up committing this patch.

Andreas Karlsson, reviewed by Michael Banck and Michael Paquier

Discussion: https://postgr.es/m/556A6E8A.9030400@proxel.se

This has been wanted for a long time. In some environments where Postgres is deployed, there could be CA and/or CLR files installed by default, and the user may want to replace them with custom entries. Still, in most cases, the problems to deal with is the replacement of expired keys. In each case, after replacing something that needs a reload of the SSL context, a restart of the instance is necessary to rebuild it properly. Note that while it may be fine for some users to pay the cost of an instance restart, some users caring about availability do not want to have to take down a server, so this new feature is most helpful for many people.

All the SSL parameters are impacted by this upgrade, and they are the following ones:

  • ssl
  • ssl_ciphers
  • ssl_prefer_server_ciphers
  • ssl_ecdh_curve
  • ssl_cert_file
  • ssl_key_file
  • ssl_ca_file
  • ssl_crl_file

Note however that there are a couple of things to be aware of:

  • On Windows (or builds with EXEC_BACKEND), the new parameters are read at each backend startup. Existing sessions do not have its context updated, and an error in loading the new parameters will cause the connection to fail.
  • Entries of pg_hba.conf with hostssl are ignored are ignored if SSL is disabled and a warning is logged to mention that. In previous versions you would get an error if a hostssl entry was found at server start.
  • Passphrase key prompt is enabled, but only at server startup, and disactivated at parameter reload to not stuck every backend reloading the SSL context. There could be improvements in this area by using a new GUC parameter that defines command allowing processes to get the passphrase instead of asking it in a tty. Patches are welcome if there is a use case for it. This behavior is described in the following commit. As passphrase support has been rather limited for a long time, being able to reload SSL contexts even without it has a great value.

This is really a nice feature, and I am happy to see this landing as I have been struggling more than once with the downtime that a SSL update induces.

Transcoding with AWS- part four

By Krzysztof Żuraw Personal Blog from Django community aggregator: Community blog posts. Published on Jan 15, 2017.

As I have my transcoder up and running now it's time to let user know that their uploaded files were transcoded. To this occasion I will use AWS SNS service which allows me to send notification about completion of transcode job.

Setting up AWS SNS to work with AWS Transcoder

After logging to AWS console and selecting SNS I have to create a topic:

SNS topic

Topic is endpoint for other application in AWS to send their notifications. For my case I have to change it in AWS Transcoder pipeline settings:

Transcoder SNS subscription

Last thing I have to do was to create subscription for topic created above. They are a lot of types of subscription that you can find in SNS settings but I will be using HTTP request.

Receiving notifications from SNS service in Django

The flow of application will look like this:

  1. User upload a file
  2. File is sent to S3
  3. Transcode job is fired after uploading form view
  4. After transcode completion AWS transcoder sends SNS notification
  5. This notification is taken by SNS subscription and send to my endpoint
  6. After validating notification endpoint inform user that his or her files are transcoded

To receive HTTP notifications I have to create a endpoint in my Django application. First I add url in audio_transcoder/urls.py:

url(
      regex=r'^transcode-complete/$',
      view=views.transcode_complete,
      name='transcode-complete'
  )

Code for this endpoint looks as follows (audio_transcoder/views.py):

from django.views.decorators.csrf import csrf_exempt
from .utlis import convert_sns_str_to_json
from django.http import (
  HttpResponse,
  HttpResponseNotAllowed,
  HttpResponseForbidden
)

@csrf_exempt
def transcode_complete(request):
    if request.method != 'POST':
        return HttpResponseNotAllowed(request.method)
    if request.META['HTTP_X_AMZ_SNS_TOPIC_ARN'] != settings.SNS_TOPIC_ARN:
        return HttpResponseForbidden('Not vaild SNS topic ARN')
    json_body = json.loads(request.body.decode('utf-8'), object_hook=convert_sns_str_to_json)
    if json_body['Message']['state'] == 'COMPLETED':
        # do something
        pass
    return HttpResponse('OK')

What is happening there? The first 2 ifs in transcode_complete are for checking if user sends POST request and as a SNS documentation says I have to make sure that message received are valid as everyone can send request to this endpoint.

In line with json_body I have to use helper that I pass to object_hook:

import json


def convert_sns_str_to_json(obj):
  value = obj.get('Message')
  if value and isinstance(value, str):
      obj['Message'] = json.loads(value)
  return obj

This small function is for converting nested strings received from SNS to python dicts. I know that every notification will have Message key so based on that I can load string and convert it to python dictionary.

The last if will be completed in next blog post.

Right now I have my endpoint up and running. But there is a problem - Amazon SNS needs to have access to that endpoint and I'm developing this application on my localhost. How to overcome such issue? I used ngrok which allows me to tunnel to my localhost from internet. How to use it? After downloading and unpacking you first run:

$ python transcoder/manage.py runserver 0.0.0.0:9000

And in other window:

$ ./ngrok http 9000

Ngrok will start and you can use url shown in console - for me: http://fba8f218.ngrok.io/.

With this url I go to AWS SNS subscription tab and add new subscription:

Creating a SNS subscription

After setting this up you will receive SNS message with link that you need to paste in browser to confirm subscription.

That's all for today! In the next blog post I will take care about how to inform user that transcode job has completed. Feel free to comment - your feedback is always welcome.

Other blog posts in this series

The code that I have made so far is available on github. Stay tuned for next blog post from this series.

Special thanks to Kasia for being an editor for this post. Thank you.

Cover image by Harald Hoyer under CC BY-SA 2.0, via Wikimedia Commons

Shaun M. Thomas: PG Phriday: Why Postgres

From Planet PostgreSQL. Published on Jan 13, 2017.

There are a smorgasbord of database engines out there. From an outside perspective, Postgres is just another on a steadily growing pile of structured data storage mechanisms. Similarly to programming languages like Rust and Go, it’s the new and shiny database systems like MongoDB that tend to garner the most attention. On the other hand, more established engines like Oracle or MySQL have a vastly larger lead that seems insurmountable. In either case, enthusiasm and support is likely to be better represented in exciting or established installations.

So why? Why out of the myriad choices available, use Postgres? I tend to get asked this question by other DBAs or systems engineers that learn I strongly advocate Postgres. It’s actually a pretty fair inquiry, so why not make it the subject of the first PG Phriday for 2017? What distinguishes it from its brethren so strongly that I staked my entire career on it?

Boring!

Postgres isn’t new. It didn’t enjoy the popularity that practically made MySQL a household name as part of the LAMP stack. It didn’t infiltrate corporations several decades ago as the de facto standard for performance and security like Oracle. It isn’t part of a much larger supported data environment like SQL Server. It isn’t small and easy like SQLite. It’s not a distributed hulk like Hadoop, or enticingly sharded like MongoDB or Cassandra. It’s not in-memory hotness like VoltDB.

It’s just a regular, plain old ACID RDBMS.

You can't explain that!

It does after all, have all of the basics many expect in an RDBMS:

  • Tables, Views, Sequences, etc.
  • Subqueries
  • Functions in various languages
  • Triggers
  • Point In Time Recovery

Certain… other database platforms weren’t so complete. As a consequence, Postgres was preferred by those who knew the difference and needed that extra functionality without breaking the bank. It’s not much, but it’s a great way to develop a niche. From there, things get more interesting.

Durababble

For the most part, being boring but reliable was a fact of life until the release of 9.0 when Postgres introduced streaming replication and hot standby. Postgres was still a very capable platform before that juncture, but built-in high-availability made it more viable in a business context. Now the secondary copy could be online and supply query results. Now the replica would lag behind the primary by a small handful of transactions instead of entire 16MB segments of transaction log files.

Postgres had finally joined the rest of the world in that regard. MySQL used a different mechanism, but that was one of its selling-points for years before Postgres. The primary distinction is that Postgres streams the changes at a binary level, meaning very little calculation is necessary to apply them. As a result, Postgres replicas are much less likely to fall behind the upstream primary.

The second I tested this feature in 2010, any lingering doubts about the future of Postgres vanished.

Cult of Extensibility

A future version of Postgres. Probably.

A future version of Postgres. Probably.

The next huge—and arguably most important—advancement in Postgres accompanied the release of 9.1: extensions. The true implications here are hard to overstate. Not all of the internal API is exposed, but extensions make it possible for practically any inclined individual to just bolt functionality onto Postgres. When Postgres 9.2 added foreign data wrappers, even arbitrary alternative backends became a possibility.

Hijack the query planner to route data through video card CPUs? Got it. Add basic sharding and distributed query support? No problem. Interact with Cassandra, SQL Server, or even Facebook? Easy peasy. Store data in analytic-friendly column structure? Child’s play.

Perl has the dubious honor of being labeled the Swiss Army Chainsaw of languages because it enables a practitioner do anything. Extensions convey almost that same capability to Postgres. And while a badly written extension can crash your database, good ones can elevate it beyond the imaginations and time constraints of the core developers.

Extensions that provide enough added value have even inspired fully supported internal adaptations, as in the case of materialized views in Postgres 9.3. What other database does that?

Consider what happens when these features are combined.

  1. Create a materialized view that refers to a remote table.
  2. Refresh the above view before using it in a report.
  3. Alternatively, siphon updated rows from the view into a more permanent aggregate summary table.
  4. Get local data processing performance in ad-hoc analytics over heterogeneous platforms.

Now Postgres can be the central nexus for a constellation of various data environments and provide snapshot analysis for the live data. Without a convoluted ETL infrastructure. Technically the materialized views or intermediate aggregate tables aren’t strictly necessary, so Postgres wouldn’t even need to store actual data. Such a configuration would be hilariously slow, but now the ironic scenario exists where Postgres can power a database empty of actual contents.

The 9.2 release transformed Postgres into a platform, and one of the reasons I don’t use the SQL part of PostgreSQL anymore.

Developers, Developers, Developers!

Ballmer likes developers

The folks hacking on the Postgres code are both insanely skilled and notoriously available. It’s almost a running joke to guess which of the core devs will answer a basic SQL question first. There’s practically a race to answer questions in the mailing lists regardless of sophistication or perceived merit, and anyone subscribed to the list can participate.

Their dedication to fostering community interaction is unrelenting. While not quite as organized as the Linux kernel developers thanks to Linus’ role as benevolent dictator, they’ve pushed Postgres forward every year. Due to their strict commit-fests and automated testing and code review, they’ve delivered a stable update roughly every year since 2008. Is there another database engine that can boast the same?

And every release has at least one headliner feature that makes upgrading worth the effort. Every. Last. Version.

  • 8.4: Window functions + CTEs
  • 9.0: Streaming replication
  • 9.1: Foreign tables, extensions
  • 9.2: Cascading replication, JSON support
  • 9.3: Materialized views, event triggers, data checksums
  • 9.4: JSONB, background workers, logical WAL decoding
  • 9.5: Upsert
  • 9.6: Parallel execution
  • 10.0?: Standby quorum, native table partitioning

While it would be wrong to demand that kind of dedication and quality, appreciating it is quite a different story. The community pushes Postgres forward because the devs give it a voice. That’s rare in any project.

In the end, I consider it a privilege to even participate from the sidelines. Is it perfect? Of course not; I’ve pointed out serious flaws in Postgres performance that have yet to be successfully addressed. Yet given the alternatives, and what Postgres really delivers when it’s fully leveraged, I can’t even think of a better commercial RDBMS.

Why Postgres? Maybe the better question is: why not?

Bruce Momjian: Creating an SSL Certificate

From Planet PostgreSQL. Published on Jan 12, 2017.

Having covered the choice of certificate authorities, I want to explain the internals of creating server certificates in Postgres. The instructions are already in the Postgres documentation.

When using these instructions for creating a certificate signing request (CSR), two files are created:

  • certificate signing request file with extension req
  • key file, containing public and private server keys, with extension pem

(It is also possible to use an existing key file.) You can view the contents of the CSR using openssl, e.g.:

Continue Reading »

New year, new Python: Python 3.6

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jan 11, 2017.

Python 3.6 was released in the tail end of 2016. Read on for a few highlights from this release. New module: secrets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Python 3.6 introduces a new module in the standard library called secrets. While the random module has long existed to provide us with pseudo-random numbers suitable for applications like modeling and simulation, these were not "cryptographically random" and not suitable for use in cryptography. secrets fills this gap, providing a cryptographically strong method to, for instance, create a new, random password or a secure token.

New method for string interpolation

Python previously had several methods for string interpolation, but the most commonly used was str.format(). Let’s look at how this used to be done. Assuming 2 existing variables, name and cookies_eaten, str.format() could look like this:

"{0} ate {1} cookies".format(name, cookies_eaten)

Or this:

"{name} ate {cookies_eaten} cookies".format(name=name, cookies_eaten=cookies_eaten)

Now, with the new f-strings, the variable names can be placed right into the string without the extra length of the format parameters:

f"{name} ate {cookies_eaten} cookies"

This provides a much more pythonic way of formatting strings, making the resulting code both simpler and more readable.

Underscores in numerals

While it doesn’t come up often, it has long been a pain point that long numbers could be difficult to read in the code, allowing bugs to creep in. For instance, suppose I need to multiply an input by 1 billion before I process the value. I might say:

bill_val = input_val * 1000000000

Can you tell at a glance if that number has the right number of zeroes? I can’t. Python 3.6 allows us to make this clearer:

bill_val = input_val * 1_000_000_000

It’s a small thing, but anything that reduces the chance I’ll introduce a new bug is great in my book!

Variable type annotations

One key characteristic of Python has always been its flexible variable typing, but that isn’t always a good thing. Sometimes, it can help you catch mistakes earlier if you know what type you are expecting to be passed as parameters, or returned as the results of a function. There have previously been ways to annotate types within comments, but the 3.6 release of Python is the first to bring these annotations into official Python syntax. This is a completely optional aspect of the language, since the annotations have no effect at runtime, but this feature makes it easier to inspect your code for variable type inconsistencies before finalizing it.

And much more…

In addition to the changes mentioned above, there have been improvements made to several modules in the standard library, as well as to the CPython implementation. To read about all of the updates this new release includes, take a look at the official notes.

Josh Berkus: Retiring from the Core Team

From Planet PostgreSQL. Published on Jan 11, 2017.

Those of you in the PostgreSQL community will have noticed that I haven't been very active for the past year.  My new work on Linux containers and Kubernetes has been even more absorbing than I anticipated, and I just haven't had a lot of time for PostgreSQL work.

For that reason, as of today, I am stepping down from the PostgreSQL Core Team.

I joined the PostgreSQL Core Team in 2003.  I decided to take on project advocacy, with the goal of making PostgreSQL one of the top three databases in the world.  Thanks to the many contributions by both advocacy volunteers and developers -- as well as the efforts by companies like EnterpriseDB and Heroku -- we've achieved that goal.  Along the way, we proved that community ownership of an OSS project can compete with, and ultimately outlast, venture-funded startups.

Now we need new leadership who can take PostgreSQL to the next phase of world domination.  So I am joining Vadim, Jan, Thomas, and Marc in clearing the way for others.

I'll still be around and still contributing to PostgreSQL in various ways, mostly around running the database in container clouds.  It'll take a while for me to hand off all of my PR responsibilities for the project (assuming that I ever hand all of them off).

It's been a long, fun ride, and I'm proud of the PostgreSQL we have today: both the database, and the community.  Thank you for sharing it with me.

Which components should I use in production?

By Django deployment from Django community aggregator: Community blog posts. Published on Jan 11, 2017.

This is the last in a series of posts about whether you should choose Ubuntu or Windows, Gunicorn or uWSGI, MySQL or PostgreSQL, and Apache or nginx. I bring it altogether. The image illustrates the components I propose for a start.

[…]

The post Which components should I use in production? appeared first on Django deployment.

David Rader: How to: Pick a PostgreSQL Python driver

From Planet PostgreSQL. Published on Jan 10, 2017.

Using Postgres with python is easy and provides first class database capabilities for applications and data processing. When starting a new project, you need to choose which PostgreSQL python driver to use. The best choice depends on your deployment, python version, and background. Here’s a quick guide to choosing between three popular python PostgreSQL drivers:

Psycopg2

This is the most common and widely supported Python PostgreSQL driver. It provides a DB-API compatible wrapper on top of the native libpq postgresql library. It supports automatic mapping of query results to python dictionaries as well as named tuples, and is commonly used by ORM’s and frameworks, including SQLAlchmey. Because psycopg2 uses libpq, it supports the same environment variables as libpq for connection properties (PGDATABASE, PGUSER, PGHOST, etc) as well as using a .pgpass file. And, supports COPY directly for bulk loading. If you use Postgres from multiple languages this driver will feel the most “native PG.”

But, the libpq dependency requires libpq-dev and python-dev packages on Linux or install packages on Windows and macOS for the shared objects/dll’s. Not an issue for a single developer machine or server, but hard to package for a cross-platform Python app (like our BigSQl DevOps management tool) or to use in a PyPy runtime.

pg8000

An actively maintained, pure python DB-API compatible driver that works across OS and in many different Python environments, such as PyPy and Jython. With no native dependencies, it is easy to package pg8000 for distribution with an application with Python 2 or Python 3. pg8000 tries to be focused on just PostgreSQL database access, ignoring some of the convenience functions in libpq. For example, pg800 does not directly support a .pgpass file or environment variables – but you can use the pgpasslib project in conjunction. pg8000 does not include the many Extras included with psycopg2 to retrieve dictionaries or named tuples but does support most data types.

pg8000 is a good choice for distributing with an application bundle or for environments where you can’t install native dependencies.

asyncpg

An exciting and new (in 2016) driver supporting Python/asyncio. Rather than using libpq and the text format protocol, asyncpg implements the Postgres binary protocol, which provides better type support and faster performance. asyncpg is designed (and tested) to be fast, able to achieve must higher retrieval speeds that either driver above. See the performance benchmarking provided by the authors. With the emphasis on speed, asyncpg does not support the DB-API so code feels different than other Python database access.

asyncpg is a C-extension, so only works with Cython and requires Python 3.5+; but has no dependencies, so a simple pip install (or app bundling) works and supports PostgreSQL 9.1+. asyncpg is definitely worth looking at if you’re using Python 3.5.

For more information, see the project pages for each driver discussed above:

Hans-Juergen Schoenig: Checkpoint distance and amount of WAL

From Planet PostgreSQL. Published on Jan 10, 2017.

Most people using PostgreSQL database systems are aware of the fact that the database engine has to send changes to the so called “Write Ahead Log” (= WAL) to ensure that in case of a crash the database will be able to recover to a consistent state safely and reliably. However, not everybody is aware […]

The post Checkpoint distance and amount of WAL appeared first on Cybertec - The PostgreSQL Database Company.

How to Implement OAuth2 using Django REST Framework

By Chris Bartos from Django community aggregator: Community blog posts. Published on Jan 09, 2017.

Let’s paint a picture for you. You want to create a web and mobile application that allows your users to login more securely than [Token Authentication]. You know it’s possible, but you’re not sure how to implement something like this.

We can also use Facebook, Google, Twitter, Github, etc. to authenticate users. However, I’m just going to describe how to use OAuth to authenticate with JUST our application.

Maybe there is a package you can use? You’d be correct. There is a package you can use. OAuth Authentication implementation from scratch is complicated. Therefore, I recommend NOT re-inventing the wheel. I recommend using a package that already implements it. (I’ll write a post about how OAuth can be implemented yourself, if you’re interested in learning how this works).

I’m just going to pick a random OAuth package because it really doesn’t matter. The implementation of OAuth is the same no matter which package you choose. In this case, I’m going to use django-rest-framework-social-oauth2.

Implementing OAuth2 using Django-REST-Framework-Social-OAuth2

Let’s start by installing the package.

$ pip install django-rest-framework-social-oauth2

Now, let’s add the package apps to our INSTALLED_APPS.

settings.py

INSTALLED_APPS = (
    #...
    'oauth2_provider',
    'social_django',
    'rest_framework_social_oauth2,
)

Next, let’s include the URLs to our urls.py.

urls.py

urlpatterns = patterns(
    # ...
    url(r'^auth/', include('rest_framework_social_oauth2.urls')),
)

Now, we need to set up the packages CONTEXT_PROCESSORS. If you are using Django 1.8+ you’ll add the CONTEXT_PROCESSORS in the TEMPLATES setting.

settings.py

TEMPLATES = [
    {
        # ...
        'OPTIONS': {
            'context_processors': [
                # ...
                'social_django.context_processors.backends',
                'social_django.context_processors.login_redirect',
            ],
        },
    }
]

However, if you are using anything BEFORE Django 1.8, you’ll need to add the context processors like so…

settings.py

TEMPLATE_CONTEXT_PROCESSORS = (
    # ...
    'social_django.context_processors.backends',
    'social_django.context_processors.login_redirect',
)

Now, we need some Authentication Backends so that Django and REST knows how to authenticate users. So, let’s add the backends to both settings.

settings.py

# DJANGO REST FRAMEWORK SETTINGS
REST_FRAMEWORK = {
    # ...
    'DEFAULT_AUTHENTICATION_CLASSES': (
        # ...
        'oauth2_provider.ext.rest_framework.OAuth2Authentication',
        'rest_framework_social_oauth2.authentication.SocialAuthentication',
    ),
}

# DJANGO SETTINGS 
AUTHENTICATION_BACKENDS = (
    # ...
    'rest_framework_social_oauth2.backends.DjangoOAuth2',
    'django.contrib.auth.backends.ModelBackend',
)

Set up Web Application

We have successfully setup our application to use OAuth2 in our Web Application. Just like every time you add a new app to your list of INSTALLED_APPS, you always need to run:

$ python manage.py migrate

This command will build our database backend with the new models from the OAuth2 package.

Let’s run the application!

$ python manage.py runserver

If the application starts correctly, we’ll go to http://localhost:8000/admin/ and login with:

username: adminadmin 
password: adminadmin

When you login successfully, you’ll see a list of different options: Django OAuth Toolkit and Social_Django.

Underneath Django OAuth Toolkit, near Applications, click Add.

Your application should be filled out to be the following:

Client Id: **Do not change**
User: **click the hourglass and select the superuser**
Redirect URIs: **leave blank**
Client Type: “Confidential”
Authorization grant type: “Resource owner password-based”
Client secret: **Do not change**
Name: **Anything you want — maybe “Test Example”**

Next, click save to save that application.

Test the Application

Use either CURL (command line) or use [Postman] to create a POST request to your web application (http://localhost:8000/auth/token using the following information:

username=adminadmin
password=adminadmin
client_id=QKaPafSal2lYYfIYIDSKCC3hoj0TRPLFnZYJNE0h
client_secret=bvQDk9bIwVS28VSNZFP5ehgsrnQroWP5xHccdradOvqxonWSqC1soy7HzaiIiRzCQi73o0pPKyWp7dEoS8DgrZWLoiwJf7iZ8kymv1rb1s3Hx3XSTGQgDmVBqveOQT5H
grant_type=password

If you are using CURL, you can run the following command:

$ curl -X POST -d "client_id=QKaPafSal2lYYfIYIDSKCC3hoj0TRPLFnZYJNE0h&client_secret=bvQDk9bIwVS28VSNZFP5ehgsrnQroWP5xHccdradOvqxonWSqC1soy7HzaiIiRzCQi73o0pPKyWp7dEoS8DgrZWLoiwJf7iZ8kymv1rb1s3Hx3XSTGQgDmVBqveOQT5H&grant_type=password&username=adminadmin&password=adminadmin" http://localhost:8000/auth/token

If you are using Postman, your interface should look like this:

This will retrieve a secret token for use for authentication by our user.

Now, we need to use the access token that we received and create an Authorization header in the form:

Authorization: Bearer [access_token]

The access token that I received happens to look like this: 71AVaiCidnYcD2ct3Mqgat9j4jo6Xl.

So, if I want to access http://localhost:8000/polls/api/questions/1 I have to create a header using my access token in order to access that question.

In CURL, I would do something like this:

$ curl -H "Authorization: Bearer 71AVaiCidnYcD2ct3Mqgat9j4jo6Xl" -X GET http://localhost:8000/polls/api/questions/1

In Postman I do something like this:

Just to test it out to make sure that I’m doing this right, I will remove the Header from the request and see if Django will let me access the data. Not surprisingly, it doesn’t let me in!

Click here for my Sample Code — remember to install the package using pip

Homework (because this helps you apply what you learned)

  1. Setup a new Django REST Framework web application
  2. Add an endpoint, a serializer and make sure only authenticated users can access the data.
  3. Follow the steps above and try OAuth2 out for yourself. Get it to work.
  4. If you have questions email me

Now, you know how to implement OAuth2 in your own application. You can use the similar steps to get this setup for Facebook, Google, Twitter, Github, etc. so you can allow users to sign in to your application using accounts that they already have and use!

Try this out, see if you can get it work for yourself!

5 Reasons I don't think you shold use the django admin

By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on Jan 09, 2017.

Have you ever needed to quickly modify data or look it up. But while the task is simple the entire process of changing one value frustrates you?

Well, this is very common to me, bad UX is annoying. I quite often have to either lookup, or edit, two disparate, yet related pieces of data, and sometimes it is an exercise in frustration.

An all to common occurrence is I just needed to check the edit history of a "Company" our database. This is a simple process, go to the admin find the Company model, do a search and you have your answer.

Except, it isn't that easy in reality. Lets run through what really happened.

I went to "http://superawesomesite.com/admin/" and logged in.

I then looked at how my model options extended well below the fold of my browser so I had to scroll. No problem I'll just hit "ctrl+f" and search for it. WTF!!! where is my company model?

I then proceed to scroll down and finally find it only to remember it was pluralized. If I had searched for "Companies" I would have been good to go, grrrrr.

I click into and see the list of companies available to me. I see that there is a search feature at the top so I do a search for the relevant information. Unfortunately, I don't know the company name that is what I am trying to find out.

Except, we aren't filtering our search in the admin based on that field. So no results show up.

No problem I can do a list filter on the side. So I set the correct filters and still no luck because someone didn't set all the metadata that was supposed to be set for the company. Again grrrrr.

Finally, I abandoned the admin and opened up the django/python shell and did a query with the model, and in about 30 seconds had the record I needed. Took the id and plugged that into the django admin and I was good to go.

This was an exercise in frustration because the 30 seconds that was taken during development to decide which fields should be filtered didn't let that person imagine the field I needed. Also the fact that whomever added the company didn't add all the correct information, made it almost impossible to find the data. In the end I had to go outside this system that is hailed as one of the greatest tools for django to get the information I needed.

I propose we as a community reduce our reliance on the django admin, especially in production.

I also created a Video on the 5 reasons we shouldn't use the django admin.

Bruce Momjian: SSL Certificates and Certificate Authorities

From Planet PostgreSQL. Published on Jan 09, 2017.

When setting up SSL in Postgres, you can't just enable SSL. You must also install a signed certificate on the server.

The first step is to create a certificate signing request (CSR) file that contains the host name of the database server. Once created, there are three ways to sign a CSR to create a certificate:

If the certificate is to be self-signed, use the key created by the certificate signing request to create a certificate. If using a local certificate authority, sign the CSR file with the local certificate authority's key.

Continue Reading »

Jobin Augustine: pg_repack in Postgres by BigSQL

From Planet PostgreSQL. Published on Jan 09, 2017.

Many DBAs agree that one of the most useful extension in their arsenal is pg_repack. Pg_repack addresses a serious shortcoming in Postgres: doing VACUUM FULL online. Due to the way Postgres handles the MVCC, tables and indexes become bloated. Dead tuples are addressed by AUTOVACUUM and space will be marked free. In many situations a VACUUM FULL becomes unavoidable because AUTOVACUUM just leaves scattered and fragmented free space in the table as it is. DBA may have to do VACUMM FULL to release such free space back to disk.

Unfortunately VACUUM FULL requires an exclusive lock on the table during an operation. This can’t be performed while the table is being used. This is where pg_repack comes to the rescue of the DBA — to perform a VACUUM FULL almost fully online (there could be a momentary lock).

The popularity of this extension among DBAs led the Postgres by BigSQL project to add pg_repack as a ready to install package.

Installation

Installing pg_repack is quite easy with the pgcli command line.

$ ./pgc list Displays all installable versions of extensions. Let’s install repack13-pg96 (the package version depends on the postgres version we have installed). Installing this is pretty straightforward:

$ ./pgc install repack13-pg96

Installation of pg_repack doesn’t require either a restart or reload. As is commonly known, the pg_repack installation has 2 components: the actual extension and a client tool to invoke the pg_repack functionality. We can create the extension in the desired database as follows from the psql command interface:

postgres=# \c db1
postgres=# CREATE EXTENSION pg_repack;
CREATE EXTENSION

Test Environment

To create a test environment, run pgbench from the command line:

pgbench -U postgres -i -s 10 db1

This produced a pgbench_accounts table of 128 MB. To create a bloat, I ran the update twice.

db1=# update pgbench_accounts set abalance=abalance;

This caused the table to grow up to 384 MB. In a couple of minutes, AUTOVACUUM kicked in and cleans up all dead tuples as expected.

db1=# select n_live_tup,n_dead_tup from pg_stat_user_tables where relname='pgbench_accounts';
n_live_tup | n_dead_tup
------------+------------
997705 | 0

However the tablesize remained at 384 MB

db1=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------------+-------+----------+---------+-------------
public | pgbench_accounts | table | postgres | 384 MB |

VACUUM FULL using pg_repack

pg_repack is invoked using the included command line utility. We can repack every table in a database like so:

$ pg_repack db1
INFO: repacking table "pgbench_tellers"
INFO: repacking table "pgbench_accounts"
INFO: repacking table "pgbench_branches"

Where db1 is the name of the database.

Or we can repack individual tables:

$ pg_repack --no-order --table pgbench_accounts --table pgbench_branches db1
INFO: repacking table "pgbench_accounts"
INFO: repacking table "pgbench_branches"

After running pg_repack, the space consumption is back to 128MB

db1=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------------+-------+----------+------------+-------------
public | pgbench_accounts | table | postgres | 128 MB |

Giulio Calacoci: Barman 2.1 and the new –archive option

From Planet PostgreSQL. Published on Jan 09, 2017.

Barman 2.1

Version 2.1 of Barman, backup and recovery manager for PostgreSQL, was released Thursday, Jan. 5.

The new release, along with several bugfixes, introduces preliminary support for the upcoming PostgreSQL 10, and adds the –archive option to the switch-xlog command.

switch-xlog –archive

The new –archive option is especially useful when setting up a new server.

Until now, the switch-xlog command used to force the PostgreSQL server to switch to a different transaction log file. Now, Barman also gives the –archive option, which triggers WAL archiving after the xlog switch, and forces Barman to wait for the archival of the closed WAL file.

By default Barman expects to receive the WAL in 30 seconds, the amount of seconds to wait can be changed using the –archive-timeout option.
If the switch-xlog command returns an error, it means no WAL file has been archived, and the Barman server is not able to receive WALs from PostgreSQL.

This option allows the users to test the entire WAL archiving process, identifying configuration issues.

Conclusions

The Barman dev team is very happy about this small release. Containing primarily bug fixes, it increases the robustness of Barman thanks to the feedback received through the Barman mailing list and the GitHub issue tracker.

If you are interested in helping us by sponsoring the development, even partially, drop us a line (info@pgbarman.org).

Links

Website
Download
Online Documentation
Man page, section 1
Man page, section 5
Support

Craig Kerstiens: Simple but handy Postgres features

From Planet PostgreSQL. Published on Jan 08, 2017.

It seems each week when I’m reviewing data with someone a feature comes up that they had no idea existed within Postgres. In an effort to continue documenting many of the features and functionality that are useful, here’s a list of just a few that you may find handy the next time you’re working with your data.

Psql, and \e

This one I’ve covered before, but it’s worth restating. Psql is a great editor that already comes with Postgres. If you’re comfortable on the CLI you should consider giving it a try. You can even setup you’re own .psqlrc for it so that it’s well customized to your liking. In particular turning \timing on is especially useful. But even with all sorts of customization if you’re not aware that you can use your preferred editor by using \e then you’re missing out. This will allow you to open up the last run query, edit it, save–and then it’ll run for you. Vim, Emacs, even Sublime text works just take your pick by setting your $EDITOR variable.

Watch

Ever sit at a terminal running a query over and over to see if something on your system changed? If you’re debugging something whether locally or even live in production, watching data change can be key to figuring out. Instead of re-running your query you could simply use the \watch command in Postgres, this will re-run your query automatically every few seconds.

```sql SELECT now() –

   query_start, 
   state, query 

FROM pg_stat_activity \watch ```

JSONB pretty print

I love JSONB as a datatype. Yes, in cases it won’t be the optimal for performance (though at times it can be perfectly fine). If I’m hitting some API that returns a ton of data, I’m usually not using all of it right away. But, you never know when you’ll want to use the rest of it. I use Clearbit this way today, and for safety sake I save all the JSON result instead of de-normalizing it. Unfortunately, when you query this in Postgres you get one giant compressed text of JSON. Yes, you could pipe out to something like jq, or you could simply use Postgres built in function to make it legible:

```sql SELECT jsonb_pretty(clearbit_response) FROM lookup_data;

                            jsonb_pretty

{

 "person": { 
     "id": "063f6192-935b-4f31-af6b-b24f63287a60", 
     "bio": null, 
     "geo": { 
         "lat": 37.7749295, 
         "lng": -122.4194155,                                              
         "city": "San Francisco", 
         "state": "California", 
         "country": "United States", 
         "stateCode": "CA", 
         "countryCode": "US" 
     }, 
     "name": { 
     ...

```

Importing my data into Google

This one isn’t Postgres specific, but I use it on a weekly basis and it’s key for us at Citus. If you use something like Heroku Postgres, dataclips is an extremely handy feature that lets you have a real-time view of a query and the results of it, including an anonymous URL you can it for it. At Citus much like we did at Heroku Postgres we have a dashboard in google sheets which pulls in this data in real-time. To do this simple select a cell then put in: =importdata("pathtoyourdataclip.csv"). Google will import any data using this as long as it’s in CSV form. It’s a great lightweight way to build out a dashboard for your business without rolling your own complicated dashboarding or building out a complex ETL pipeline.

I’m sure I’m missing a ton of the smaller features that you use on a daily basis. Let me know @craigkerstiens the ones I forgot that you feel should be listed.

gabrielle roth: PDXPUG – January meeting

From Planet PostgreSQL. Published on Jan 07, 2017.

When: 6-8pm Thursday Jan 19, 2017
Where: iovation
Who: Mark Wong
What: pglogical

Mark Wong will give an overview of pglogical, the latest replication option for Postgres. It’s a lower-impact option than trigger-based replication, and features include the ability to replicate only the databases and tables you choose from a cluster. Part of the talk will cover use cases and future development plans.

Find out more here: https://2ndquadrant.com/en/resources/pglogical/

Mark leads the 2ndQuadrant performance practice as a Performance Consultant for English Speaking Territories, based out of Oregon in the
USA.


If you have a job posting or event you would like me to announce at the meeting, please send it along. The deadline for inclusion is 5pm the day before the meeting.

Our meeting will be held at iovation, on the 32nd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak). It’s right on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots. No bikes in the office, sorry!

iovation provides us a light dinner (usually sandwiches or pizza).

Elevators open at 5:45 and building security closes access to the floor at 6:30.

See you there!


How to Implement Custom Authentication with Django REST Framework

By Chris Bartos from Django community aggregator: Community blog posts. Published on Jan 06, 2017.

Introduction to Custom Authentication

Custom Authentication in Django REST Framework is the way you would create any time of authentication you would want. In fact, inside of the internals of DRF, you will find every other authentication scheme that I’ve talked about using CustomAuthentication. So, let’s look at an example of how you would implement something like this.

How to Implement Custom Authentication

WARNING: The example I’m about to show you is VERY VERY bad for security so DON’T use it in production. 🙂

First, you will need to override the BaseAuthentication class. It looks like this:

my_proj/accounts/auth.py from django.contrib.auth.models import User from rest_framework.authentication import BaseAuthentication from rest_framework import exceptions

class MyCustomAuthentication(BaseAuthentication):
    def authenticate(self, request):
        username = request.GET.get("username")

        if not username: # no username passed in request headers
            return None # authentication did not succeed

        try:
            user = User.objects.get(username=username) # get the user
        except User.DoesNotExist:
            raise exceptions.AuthenticationFailed('No such user') # raise exception if user does not exist

        return (user, None) # authentication successful

I called the new class MyCustomAuthentication. If you look at what this does, it retrieves a username as a GET request and will try to find a user with that username. (You should now understand why this is a stupid example).

Next, in settings.py you’ll want to update the DEFAULT_AUTHENICATION setting.

settings.py REST_FRAMEWORK = { ‘DEFAULT_AUTHENTICATION_CLASSES’: ( ‘accounts.auth.MyCustomAuthentication’, ), ‘DEFAULT_PERMISSION_CLASSES’: ( ‘rest_framework.permissions.IsAuthenticated’, ) }

And that is LITERALLY all you need to do to create a new authentication scheme. Download the custom code below and try going to the following URL:

http://localhost:8000/polls/api/questions/1/?username=chris

You should be able to see the data. Also, if you go to:

http://localhost:8000/polls/api/questions/1/

The authentication scheme should deny you from getting any data at all.

Click Here to Download the Sample Code

Homework

  1. Run the sample code and go to the two URLs above.
  2. Try to implement your own Session Authentication scheme WITHOUT enforcing CSRF tokens using Custom Authentication. You can see how Session Authentication is implemented here

Bruce Momjian: Use Kill -9 Only in an Emergency

From Planet PostgreSQL. Published on Jan 06, 2017.

During normal server shutdown, sessions are disconnected, dirty shared buffers and pending write-ahead log (WAL) records are flushed to durable storage, and a clean shutdown record is written to pg_control. During the next server start, pg_control is checked, and if the previous shutdown was clean, startup can ignore the WAL and start immediately.

Unfortunately, a clean shutdown can take some time, and impatient database administrators might get into the habit of using kill -9 or pg_ctl -m immediate to quicken the shutdown. While this does have the intended effect, and you will not lose any committed transactions, it greatly slows down the next database startup because all WAL generated since the last completed checkpoint must be replayed. You can identify an unclean shutdown by looking at the server logs for these two ominous lines:

LOG:  database system was interrupted; last known up at 2016-10-25 12:17:28 EDT
LOG:  database system was not properly shut down; automatic recovery in progress

Continue Reading »

Andrew Dunstan: Managed Database Services – pros and cons

From Planet PostgreSQL. Published on Jan 06, 2017.

Using a managed service is a very attractive proposition. You are offloading a heck of a lot of worry, especially when it comes to something as complicated and, let’s face it, specialized as a database. Someone else will set it up for you, and back it up, and keep it running, without you having to worry overmuch about it. However, there are downsides. You can only get what the manager is offering. Often that’s good enough. But needs change, and I have often seen people start with managed services, only to find that they want more than they can get that way.

Just yesterday I received a complaint that the Redis Foreign Data Wrapper, which I have done a lot of work on, is not available on Amazon RDS. And that’s completely understandable. Amazon only provide a limited number of extensions, and this isn’t one of them. At least one other managed service, Heroku, does offer this extension, but there are others it doesn’t offer.

So the lesson is: choose your managed service, or even whether to use a managed service at all, very carefully, taking into account both your current needs and your likely future needs.

Michael Paquier: Postgres 10 highlight - Quorum set of synchronous standbys

From Planet PostgreSQL. Published on Jan 06, 2017.

Today’s post, the first one of 2017, is about the following feature of the upcoming Postgres 10:

commit: 3901fd70cc7ccacef1b0549a6835bb7d8dcaae43
author: Fujii Masao <fujii@postgresql.org>
date: Mon, 19 Dec 2016 21:15:30 +0900
Support quorum-based synchronous replication.

This feature is also known as "quorum commit" especially in discussion
on pgsql-hackers.

This commit adds the following new syntaxes into synchronous_standby_names
GUC. By using FIRST and ANY keywords, users can specify the method to
choose synchronous standbys from the listed servers.

FIRST num_sync (standby_name [, ...])
ANY num_sync (standby_name [, ...])

The keyword FIRST specifies a priority-based synchronous replication
which was available also in 9.6 or before. This method makes transaction
commits wait until their WAL records are replicated to num_sync
synchronous standbys chosen based on their priorities.

The keyword ANY specifies a quorum-based synchronous replication
and makes transaction commits wait until their WAL records are
replicated to *at least* num_sync listed standbys. In this method,
the values of sync_state.pg_stat_replication for the listed standbys
are reported as "quorum". The priority is still assigned to each standby,
but not used in this method.

The existing syntaxes having neither FIRST nor ANY keyword are still
supported. They are the same as new syntax with FIRST keyword, i.e.,
a priority-based synchronous replication.

Author: Masahiko Sawada
Reviewed-By: Michael Paquier, Amit Kapila and me
Discussion: <CAD21AoAACi9NeC_ecm+Vahm+MMA6nYh=Kqs3KB3np+MBOS_gZg@mail.gmail.com>

Many thanks to the various individuals who were involved in
discussing and developing this feature.

9.6 has introduced the possibility to specify multiple synchronous standbys by extending the syntax of synchronous_standby_names. For example values like ‘N (standby_1,standby_2, … ,standby_M)’ allow a primary server to wait for commit confirmations from N standbys among the set of M nodes defined in the list given by user, depending on the availability of the standbys at the moment of the transaction commit, and their reported WAL positions for write, apply or flush. In this case, though, the standbys from which a confirmation needs to be waited for are chosen depending on their order in the list of the parameter.

Being able to define quorum sets of synchronous standbys provides more flexibility in some availability scenarios. In short, it is possible to validate a commit after receiving a confirmation from N standbys, those standbys being any node listed in the M nodes of synchronous_standby_names. So this facility is actually useful for example in the case of deployments where there is a primary with two or more standbys to bring more flexibility in the way synchronous standbys are chosen. Be careful though that it is better to have a low latency between each node, but there is nothing new here…

In order to support this new feature, and as mentioned in the commit message, the grammar of synchronous_standby_names has been extended with a set of keywords.

  • ANY maps to the quorum behavior, meaning that any node in the set can be used to confirm a commit.
  • FIRST maps to the 9.6 behavior, giving priority to the nodes listed first (higher priority number defined).

Those can be used as follows:

# Quorum set of two nodes
any 2(node_1,node_2)
# Priority set of two nodes, with three standbys
first 1(node_1,node_2,node_3)

Note as well that not using any keyword means ‘first’ for backward-compatibility. And that those keywords are case insensitive.

One last thing to know is that pg_stat_replication marks the standbys in a quorum set with… ‘quorum’. For example let’s take a primary with two standbys node_1 and node_2.

=# ALTER SYSTEM SET synchronous_standby_names = 'ANY 2(node_1,node_2)';
ALTER SYSTEM
=# SELECT pg_reload_conf();
 pg_reload_conf
----------------
 t
(1 row)

And here is how they show up to the user:

=# SELECT application_name, sync_priority, sync_state FROM pg_stat_replication;
 application_name | sync_priority | sync_state
------------------+---------------+------------
 node_1           |             1 | quorum
 node_2           |             2 | quorum
(2 rows)

Note that the priority number does not have much meaning for a quorum set, though it is useful to see them if user is willing to switch from ‘ANY’ to ‘FIRST’ to understand what would be the standbys that would be considered as synchronous after the switch (this is still subject to discussions on community side, and may change by the release of Postgres 10).

Event sourcing in Django

By Yoong Kang Lim from Django community aggregator: Community blog posts. Published on Jan 05, 2017.

Django comes with "batteries included" to make CRUD (create, read, update, delete) operations easy. It's nice that the CR part (create and read) of CRUD is so easy, but have you ever paused to think about the UD part (update and delete)?

Let's look at delete. All you need to do is this:

ReallyImportantModel.objects.get(id=32).delete()  # gone from the database forever

Just one line, and your data is gone forever. It can be done accidentally. Or you can be do it deliberately, only to later realise that your old data is valuable too.

Now what about updating?

Updating is deleting in disguise.

When you update, you're deleting the old data and replacing it with something new. It's still deletion.

important = ReallyImportantModel.object.get(id=32)
important.update(data={'new_data': 'This is new data'})  # OLD DATA GONE FOREVER

Okay, but why do we care?

Let's say we want to know the state of ReallyImportantModel 6 months ago. Oh that's right, you've deleted it, so you can't get it back.

Well, that's not exactly true -- you can recreate your data from backups (if you don't backup your database, stop reading right now and fix that immediately). But that's clumsy.

So by only storing the current state of the object, you lose all the contextual information on how the object arrived at this current state. Not only that, you make it difficult to make projections about the future.

Event sourcing 1 can help with that.

Event sourcing

The basic concept of event sourcing is this:

  • Instead of just storing the current state, we also store the events that lead up to the current state
  • Events are replayable. We can travel back in time to any point by replaying every event up to that point in time
  • That also means we can recover the current state just by replaying every event, even if the current state was accidentally deleted
  • Events are append-only.

To gain an intuition, let's look at an event sourcing system you're familiar with: your bank account.

Your "state" is your account balance, while your "events" are your transactions (deposit, withdrawal, etc.).

Can you imagine a bank account that only shows you the current balance?

That is clearly unacceptable ("Why do I only have $50? Where did my money go? If only I could see the the history."). So we always store the history of transfers as the source of truth.

Implementing event sourcing in Django

Let's look at a few ways to do this in Django.

Ad-hoc models

If you have a one or two important models, you probably don't need a generalizable event sourcing solution that applies to all models.

You could do it on an ad-hoc basis like this, if you can have a relationship that makes sense:

# in an app called 'account'
from django.db import models
from django.conf import settings


class Account(models.Model):
    """Bank account"""
    balance = models.DecimalField(max_digits=19, decimal_places=6)
    owner = models.ForeignKey(settings.AUTH_USER_MODEL, related_name='account')


class Transfer(models.Model):
    """
    Represents a transfer in or out of an account. A positive amount indicates
    that it is a transfer into the account, whereas a negative amount indicates
    that it is a transfer out of the account.
    """
    account = models.ForeignKey('account.Account', on_delete=models.PROTECT, 
                                related_name='transfers')
    amount = models.DecimalField(max_digits=19, decimal_places=6)
    date = models.DateTimeField()

In this case your "state" is in your Account model, whereas your Transfer model contains the "events".

Having Transfer objects makes it trivial to recreate any account.

Using an Event Store

You could also use a single Event model to store every possible event in any model. A nice way to do this is to encode the changes in a JSON field.

This example uses Postgres:

from django.contrib.contenttypes.fields import GenericForeignKey
from django.contrib.contenttypes.models import ContentType
from django.contrib.postgres.fields import JSONField
from django.db import models


class Event(models.Model):
    """Event table that stores all model changes"""
    content_type = models.ForeignKey(ContentType, on_delete=models.PROTECT)
    object_id = models.PositiveIntegerField()
    time_created = models.DateTimeField()
    content_object = GenericForeignKey('content_type', 'object_id')
    body = JSONField()

You can then add methods to any model that mutates the state:

class Account(models.Model):
    balance = models.DecimalField(max_digits=19, decimal_places=6, default=0
    owner = models.ForeignKey(settings.AUTH_USER_MODEL, related_name='account')

    def make_deposit(self, amount):
        """Deposit money into account"""
        Event.objects.create(
            content_object=self,
            time_created=timezone.now(),
            body=json.dumps({
                'type': 'made_deposit',
                'amount': amount,
            })
        )
        self.balance += amount
        self.save()

    def make_withdrawal(self, amount):
        """Withdraw money from account"""
        Event.objects.create(
            content_object=self,
            time_created=timezone.now(),
            body=json.dumps({
                'type': 'made_withdrawal',
                'amount': -amount,  # withdraw = negative amount
            })
        )
        self.balance -= amount
        self.save()

    @classmethod
    def create_account(cls, owner):
        """Create an account"""
        account = cls.objects.create(owner=owner, balance=0)
        Event.objects.create(
            content_object=account,
            time_created=timezone.now(),
            body=json.dumps({
                'type': 'created_account',
                'id': account.id,
                'owner_id': owner.id
            })
        )
        return account

So now you can do this:

account = Account.create_account(owner=User.objects.first())
account.make_deposit(decimal.Decimal(50.0))
account.make_deposit(decimal.Decimal(125.0))
account.make_withdrawal(decimal.Decimal(75.0))

events = Event.objects.filter(
    content_type=ContentType.objects.get_for_model(account), 
    object_id=account.id
)

for event in events:
    print(event.body)

Which should give you this:

{"type": "created_account", "id": 2, "owner_id": 1}
{"type": "made_deposit", "amount": 50.0}
{"type": "made_deposit", "amount": 50}
{"type": "made_deposit", "amount": 150}
{"type": "made_deposit", "amount": 200}
{"type": "made_withdrawal", "amount": -75}

Again, this makes it trivial to write any utility methods to recreate any instance of Account, even if you accidentally dropped the whole accounts table.

Snapshotting

There will come a time when you have too many events to efficiently replay the entire history. In this case, a good optimisation step would be snapshots taken at various points in history. For example, in our accounting example one could save snapshots of the account in an AccountBalance model, which is a snapshot of the account's state at a point in time.

You could do this via a scheduled task. Celery 2 is a good option.

Summary

Use event sourcing to maintain an append-only list of events for your critical data. This effectively allows you to travel in time to any point in history to see the state of your data at that time.


  1. Martin Fowler wrote a detailed description of event sourcing in his website here: http://martinfowler.com/eaaDev/EventSourcing.html 

  2. Celery project. http://www.celeryproject.org/ 

Christophe Pettus: Estimates “stuck” at 200 rows?

From Planet PostgreSQL. Published on Jan 04, 2017.

So, what’s weird about this plan, from a query on a partitioned table? (PostgreSQL 9.3, in this case.)

test=> explain select distinct id from orders where order_timestamp > '2016-05-01';
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=347341.56..347343.56 rows=200 width=10)
   Group Key: orders.id
   ->  Append  (cost=0.00..337096.10 rows=4098183 width=10)
         ->  Seq Scan on orders  (cost=0.00..0.00 rows=1 width=178)
               Filter: (order_timestamp > '2016-05-01 00:00:00'::timestamp without time zone)
         ->  Index Scan using orders_20160425_order_timestamp_idx on orders_20160425  (cost=0.43..10612.30 rows=120838 width=10)
               Index Cond: (order_timestamp > '2016-05-01 00:00:00'::timestamp without time zone)
         ->  Seq Scan on orders_20160502  (cost=0.00..80539.89 rows=979431 width=10)
               Filter: (order_timestamp > '2016-05-01 00:00:00'::timestamp without time zone)
         ->  Seq Scan on orders_20160509  (cost=0.00..74780.41 rows=909873 width=10)
               Filter: (order_timestamp > '2016-05-01 00:00:00'::timestamp without time zone)
         ->  Seq Scan on orders_20160516  (cost=0.00..68982.25 rows=845620 width=10)
               Filter: (order_timestamp > '2016-05-01 00:00:00'::timestamp without time zone)
         ->  Seq Scan on orders_20160523  (cost=0.00..65777.68 rows=796054 width=10)
               Filter: (order_timestamp > '2016-05-01 00:00:00'::timestamp without time zone)
         ->  Seq Scan on orders_20160530  (cost=0.00..36403.57 rows=446366 width=10)
               Filter: (order_timestamp > '2016-05-01 00:00:00'::timestamp without time zone)
(17 rows)

That estimate on the HashAggregate certainly looks wonky, doesn’t it? Just 200 rows even with a huge number of rows below it?

What if we cut down the number of partitions being hit?

test=> explain select distinct id from orders where order_timestamp > '2016-05-01' and order_timestamp < '2016-05-15';
                                                                                    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=169701.02..169703.02 rows=200 width=10)
   Group Key: orders.id
   ->  Append  (cost=0.00..165026.92 rows=1869642 width=10)
         ->  Seq Scan on orders  (cost=0.00..0.00 rows=1 width=178)
               Filter: ((order_timestamp > '2016-05-01 00:00:00'::timestamp without time zone) AND (order_timestamp < '2016-05-15 00:00:00'::timestamp without time zone))
         ->  Index Scan using orders_20160425_order_timestamp_idx on orders_20160425  (cost=0.43..10914.39 rows=120838 width=10)
               Index Cond: ((order_timestamp > '2016-05-01 00:00:00'::timestamp without time zone) AND (order_timestamp < '2016-05-15 00:00:00'::timestamp without time zone))
         ->  Seq Scan on orders_20160502  (cost=0.00..82988.46 rows=979431 width=10)
               Filter: ((order_timestamp > '2016-05-01 00:00:00'::timestamp without time zone) AND (order_timestamp < '2016-05-15 00:00:00'::timestamp without time zone))
         ->  Index Scan using orders_20160509_order_timestamp_idx on orders_20160509  (cost=0.42..71124.06 rows=769372 width=10)
               Index Cond: ((order_timestamp > '2016-05-01 00:00:00'::timestamp without time zone) AND (order_timestamp < '2016-05-15 00:00:00'::timestamp without time zone))
(11 rows)

Still 200 exactly. OK, that’s bizarre. Let’s select exactly one partition:

test=> explain select distinct id from orders where order_timestamp > '2016-05-14' and order_timestamp < '2016-05-15';
                                                                                    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=14669.26..14671.26 rows=200 width=10)
   Group Key: orders.id
   ->  Append  (cost=0.00..14283.05 rows=154481 width=10)
         ->  Seq Scan on orders  (cost=0.00..0.00 rows=1 width=178)
               Filter: ((order_timestamp > '2016-05-14 00:00:00'::timestamp without time zone) AND (order_timestamp < '2016-05-15 00:00:00'::timestamp without time zone))
         ->  Index Scan using orders_20160509_order_timestamp_idx on orders_20160509  (cost=0.42..14283.05 rows=154480 width=10)
               Index Cond: ((order_timestamp > '2016-05-14 00:00:00'::timestamp without time zone) AND (order_timestamp < '2016-05-15 00:00:00'::timestamp without time zone))
(7 rows)

Still 200 exactly. What happens if we select from the child directly?

test=> explain select distinct id from orders_20160509;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 HashAggregate  (cost=74780.41..75059.51 rows=27910 width=10)
   Group Key: id
   ->  Seq Scan on orders_20160509  (cost=0.00..72505.73 rows=909873 width=10)
(3 rows)

A much more reasonable estimate. So, what’s going on?

That 200 should be something of a flag, as that’s a compiled-in constant that PostgreSQL uses when it doesn’t have ndistinct information for a particular table, usually because there are no statistics collected on a table.

In this case, the issue was that an ANALYZE had never been done on the parent table. This isn’t surprising: Autovacuum would never hit that table, since (like most parent tables in a partition set), it has no rows and is never updated or inserted to. That lack-of-information gets passed up through the Append node, and the HashAggregate just uses the default 200.

Sure enough, when the parent table was ANALYZE’d, the estimates became much more reasonable.

So: It can pay to do an initial ANALYZE on a newly created partitioned table so that the planner gets statistics for the parent table, even if those statistics are “no rows here.”

David Rader: Thank you PostgreSQL for Secure Defaults

From Planet PostgreSQL. Published on Jan 04, 2017.

One of the most common questions from new PostgreSQL users is “how do I connect to the database server?” The nuances of pg_hba.conf and how to correctly enable your Python web app to connect to your db server without opening up connections to all users and all servers is not simple for someone new to Postgres. And architecting a secure multi-tenant SaaS system requires knowledge of how roles, schemas, databases, and search paths interact. That’s one reason we wrote a Security Whitepaper a while back.

But, after seeing thousands of MongoDB instances taken hostage by ransomware the “no authorization required” default for MongoDB is looking like a very dumb idea. Just imagine what executives whose developers picked MongoDB are saying today:

“You mean we store our client list in a database without security?

“Anyone can just delete our NoSQL database from the internet?”

“Were we hacked last year when you said we lost data in MongoDB?”

So, a quick “Thank You” to PostgreSQL for making sure that your data is Secure By Default.

Simon Riggs: PostgreSQL’s Popularity Goes Up Again

From Planet PostgreSQL. Published on Jan 04, 2017.

Mirror mirror on the wall,
Who is the fairest Database of all?

A frequently asked question, certainly.

DB-Engines recently announced it’s DBMS of the Year. Maybe the cool thing is that PostgreSQL is in 3rd Place. Yee-ha, an open source project is up there!

Let’s look closely about what this means.

PostgreSQL.org’s agreed response was this…

“It’s great to see the continued success of PostgreSQL being reflected in DB-Engines rankings. It’s clear that the strength of the following for the World’s Most Advanced Open Source Database is enough to outweigh the largest software companies as people continue to choose to move away from commercial databases.”

though because of commercial sensitivity this was changed down to this

“It’s great to see the continued success of PostgreSQL being reflected in DB-Engines rankings. It’s clear that the strength of the following for the World’s Most Advanced Open Source Database is enough to draw people away from commercial databases.”

What were the commercial sensitivities? (What about “open source sensitivies”? Well, blame me, cos I agreed the change.)

Well, the title of the post is that a Microsoft product is actually DBMS of the Year, even though it’s not ranked #1 on the main list, that’s still Oracle. And Postgres is #3 on DBMS of the Year, even though we moved through to #5 again, competing with MongoDB for position 4 (although mostly level).

My guess is that Microsoft would like to highlight how it gets more press than PostgreSQL, a point I would concede in an instant. Whether that means it is more popular or has better features is a different thing entirely. People are simply leaving commercial databases in droves to come to PostgreSQL, which is clearly reflected in the very public decline of Oracle licencing revenues over the last 10 quarters and I’m sure its just the same for Microsoft revenue.

The purpose of the announcement from PostgreSQL.org was to highlight that “the strength of the following for the World’s Most Advanced Open Source Database is enough to outweigh the largest software companies”, though my conclusion is that we are not YET in a position to do that. Larger marketing budget does still give a larger audience. Real world usage does still show PostgreSQL usage increasing at an amazing rate. And our technology continues to set the pace of feature development that other databases would like to achieve.

Number 3 means we’re on the list. We can discuss exactly place we’re at, but its enough to put us on the short list for every major technology decision, worldwide. And when people see the feature list, price and responsive support, the effect is compelling.

Anyway, ain’t no such thing as bad publicity, so we’re all happy.

Thanks very much to DBEngines for mentioning PostgreSQL in this post…
http://db-engines.com/en/blog_post/67

Anyway, I do thank Microsoft for continuing to support PostgreSQL in its framework and driver.

.

Bruce Momjian: Controlling Autovacuum

From Planet PostgreSQL. Published on Jan 03, 2017.

Unlike other database systems, Postgres makes the cleanup process visible and tunable to users. Autovacuum performs recycling of old rows and updates optimizer statistics. It appears in ps command output, the pg_stat_activity system view, and optionally in the server logs via log_autovacuum_min_duration.

Postgres also allows fine-grained control over the autovacuum cleanup process. Occasionally users find that cleanup is slowing the system down, and rather than modifying the behavior of autovacuum, they decide to turn it off via the autovacuum setting.

However, turning off autovacuum can cause problems. Initially the system will run faster since there is no cleanup overhead, but after a while old rows will clog up user tables and indexes, leading to increasing slowness. Once that happens, you can turn on autovacuum again, and it will recycle the old rows and free up space, but there will be much unused space that can't be reused quickly, or perhaps ever.

Continue Reading »

Event sourcing in Django

By Yoong Kang Lim from Django community aggregator: Community blog posts. Published on Jan 02, 2017.

Django comes with "batteries included" to make CRUD (create, read, update, delete) operations easy. It's nice that the CR part (create and read) of CRUD is so easy, but have you ever paused to think about the UD part (update and delete)?

Let's look at delete. All you need to do is this:

ReallyImportantModel.objects.get(id=32).delete()  # gone from the database forever

Just one line, and your data is gone forever. It can be done accidentally. Or you can be do it deliberately, only to later realise that your old data is valuable too.

Now what about updating?

Updating is deleting in disguise.

When you update, you're deleting the old data and replacing it with something new. It's still deletion.

important = ReallyImportantModel.object.get(id=32)
important.update(data={'new_data': 'This is new data'})  # OLD DATA GONE FOREVER

Okay, but why do we care?

Let's say we want to know the state of ReallyImportantModel 6 months ago. Oh that's right, you've deleted it, so you can't get it back.

Well, that's not exactly true -- you can recreate your data from backups (if you don't backup your database, stop reading right now and fix that immediately). But that's clumsy.

So by only storing the current state of the object, you lose all the contextual information on how the object arrived at this current state. Not only that, you make it difficult to make projections about the future.

Event sourcing 1 can help with that.

Event sourcing

The basic concept of event sourcing is this:

  • Instead of just storing the current state, we also store the events that lead up to the current state
  • Events are replayable. We can travel back in time to any point by replaying every event up to that point in time
  • That also means we can recover the current state just by replaying every event, even if the current state was accidentally deleted
  • Events are append-only.

To gain an intuition, let's look at an event sourcing system you're familiar with: your bank account.

Your "state" is your account balance, while your "events" are your transactions (deposit, withdrawal, etc.).

Can you imagine a bank account that only shows you the current balance?

That is clearly unacceptable ("Why do I only have $50? Where did my money go? If only I could see the the history."). So we always store the history of transfers as the source of truth.

Implementing event sourcing in Django

Let's look at a few ways to do this in Django.

Ad-hoc models

If you have a one or two important models, you probably don't need a generalizable event sourcing solution that applies to all models.

You could do it on an ad-hoc basis like this, if you can have a relationship that makes sense:

# in an app called 'account'
from django.db import models
from django.conf import settings


class Account(models.Model):
    """Bank account"""
    balance = models.DecimalField(max_digits=19, decimal_places=6)
    owner = models.ForeignKey(settings.AUTH_USER_MODEL, related_name='account')


class Transfer(models.Model):
    """
    Represents a transfer in or out of an account. A positive amount indicates
    that it is a transfer into the account, whereas a negative amount indicates
    that it is a transfer out of the account.
    """
    account = models.ForeignKey('account.Account', on_delete=models.PROTECT, 
                                related_name='transfers')
    amount = models.DecimalField(max_digits=19, decimal_places=6)
    date = models.DateTimeField()

In this case your "state" is in your Account model, whereas your Transfer model contains the "events".

Having Transfer objects makes it trivial to recreate any account.

Using an Event Store

You could also use a single Event model to store every possible event in any model. A nice way to do this is to encode the changes in a JSON field.

This example uses Postgres:

from django.contrib.contenttypes.fields import GenericForeignKey
from django.contrib.contenttypes.models import ContentType
from django.contrib.postgres.fields import JSONField
from django.db import models


class Event(models.Model):
    """Event table that stores all model changes"""
    content_type = models.ForeignKey(ContentType, on_delete=models.PROTECT)
    object_id = models.PositiveIntegerField()
    time_created = models.DateTimeField()
    content_object = GenericForeignKey('content_type', 'object_id')
    body = JSONField()

You can then add methods to any model that mutates the state:

class Account(models.Model):
    balance = models.DecimalField(max_digits=19, decimal_places=6, default=0
    owner = models.ForeignKey(settings.AUTH_USER_MODEL, related_name='account')

    def make_deposit(self, amount):
        """Deposit money into account"""
        Event.objects.create(
            content_object=self,
            time_created=timezone.now(),
            body=json.dumps({
                'type': 'made_deposit',
                'amount': amount,
            })
        )
        self.balance += amount
        self.save()

    def make_withdrawal(self, amount):
        """Withdraw money from account"""
        Event.objects.create(
            content_object=self,
            time_created=timezone.now(),
            body=json.dumps({
                'type': 'made_withdrawal',
                'amount': -amount,  # withdraw = negative amount
            })
        )
        self.balance -= amount
        self.save()

    @classmethod
    def create_account(cls, owner):
        """Create an account"""
        account = cls.objects.create(owner=owner, balance=0)
        Event.objects.create(
            content_object=account,
            time_created=timezone.now(),
            body=json.dumps({
                'type': 'created_account',
                'id': account.id,
                'owner_id': owner.id
            })
        )
        return account

So now you can do this:

account = Account.create_account(owner=User.objects.first())
account.make_deposit(decimal.Decimal(50.0))
account.make_deposit(decimal.Decimal(125.0))
account.make_withdrawal(decimal.Decimal(75.0))

events = Event.objects.filter(
    content_type=ContentType.objects.get_for_model(account), 
    object_id=account.id
)

for event in events:
    print(event.body)

Which should give you this:

{"type": "created_account", "id": 2, "owner_id": 1}
{"type": "made_deposit", "amount": 50.0}
{"type": "made_deposit", "amount": 50}
{"type": "made_deposit", "amount": 150}
{"type": "made_deposit", "amount": 200}
{"type": "made_withdrawal", "amount": -75}

Again, this makes it trivial to write any utility methods to recreate any instance of Account, even if you accidentally dropped the whole accounts table.

Snapshotting

There will come a time when you have too many events to efficiently replay the entire history. In this case, a good optimisation step would be snapshots taken at various points in history. For example, in our accounting example one could save snapshots of the account in an AccountBalance model, which is a snapshot of the account's state at a point in time.

You could do this via a scheduled task. Celery 2 is a good option.

Summary

Use event sourcing to maintain an append-only list of events for your critical data. This effectively allows you to travel in time to any point in history to see the state of your data at that time.


  1. Martin Fowler wrote a detailed description of event sourcing in his website here: http://martinfowler.com/eaaDev/EventSourcing.html 

  2. Celery project. http://www.celeryproject.org/ 

Magnus Hagander: Financial updates in PostgreSQL Europe

From Planet PostgreSQL. Published on Jan 02, 2017.

As we say welcome to a new year, we have a couple of updates to the finances and payment handling in PostgreSQL Europe, that will affect our members and attendees of our events.

First of all, PostgreSQL Europe has unfortunately been forced to VAT register. This means that most of our invoices (details below) will now include VAT.

Second, we have enabled a new payment provider for those of you that can't or prefer not to use credit cards but that still allows for fast payments.

Magnus Hagander: Mail agents in the PostgreSQL community

From Planet PostgreSQL. Published on Jan 01, 2017.

A few weeks back, I noticed the following tweet from Michael Paquier:

tweet

And my first thought was "that can't be right" (spoiler: Turns out it wasn't. But almost.)

The second thought was "hmm, I wonder how that has actually changed over time". And of course, with today being a day off and generally "slow pace" (ahem), what better way than to analyze the data that we have. The PostgreSQL mailinglist archives are all stored in a PostgreSQL database of course, so running the analytics is a quick job.

Django - Database access optimization

By Micropyramid django from Django community aggregator: Community blog posts. Published on Dec 30, 2016.

Django Queryset is generally lazy in nature. It will not hit the database until it  evaluates the query results.

Example:

queryset = User.objects.all() # It won't hit the database
print (queryset)  # Now, ORM turns the query into raw sql  and fetches results from the database

1.Caching and QuerySets

Generally Django stores the query results in the catche when it fetches the results for the first time.

Example: Get all users first names list  and emails list from the database.

first_names_list = [ user.first_name for user in User.objects.all()]
emails_list = [ user.email for user in User.objects.all()]

above code hits the database twice. To avoid the extra request to the database we can use the Django's database cache in the following way.

users_list = User.objects.all()    # No database activity
first_names_list =  [user.first_name for user in users_list]    # Hits the database and stores the results in the cache
emails_list = [user.email for user in users_list]    # uses the results from the cache.

other way is 

users_list = User.objects.all()
db = dict(users_list.values_list("first_name", "email"))
first_names_list, emails_list = db.keys(), db.values()

Note: Querysets are not cached if query is not evaluated.
Eaxmple: If you want to take a subset/part of the query results

queryset = Users.objects.all()
first_five = queryset[:5]    # Hits the database
first_ten = queryset[:10]    # Hits the database

If you want to use the cache in the above situation you can do it in the following way

queryset = Users.objects.all()
bool(queryset)  # queryset is evaluated  and results are cached
first_five = queryset[:5]    #  uses the cache
first_ten = queryset[:10]    # uses the cache

2. Complex Database Queries with "Q" objects

Django Queries uses "AND" caluse to fetch results when using filter.
Example:

User.objects.filter(email__contains="adam", first_name="Adam")
# equivalent SQL Qurery is
SELECT * FROM user_table WHERE email LIKE '%adam%'  AND first_name=Adam;

If you want to filter the users  email starts with "an" or "sa"
Simple Example:

users_list = User.objects.all()
filtered_users = []
for user in users_list:
    if user.email.startswith("an")  or  user.email.startswith("sa"):
        filtered_users.append(user)

In above example it gets all the records from the table. Insted, we can fetch the records that we required by using the "Q" object.
Example:

from django.db.models import Q
users_list = User.objects.filter(Q(email__startswith="an") | Q(email__startswith="sa"))
# equivalent SQL Qurery is
SELECT * FROM user_table WHERE email LIKE 'an%' OR email LIKE 'sa%'

Q object usuage:

~Q(email__startswith="an")  # email don't starts with "an" 
# SQL  equivalent
|  = OR
& = AND
~ = NOT

We can use parenthesis with "Q" object
Example:

Model.objects.filter((Q(key1="value1") & ~Q(key2="value2")) | (~Q(key3="value"))

3. Create Multiple Objects at once with Bulk Create

Simple Example: To create 4 users

users_details = [
        ("Dinesh", "dinesh@micropyramid.com"),
        ("Ravi", "ravi@micropyramid.com"),
        ("Santharao", "santharao@micropyramid.com"),
        ("Shera", "shera@micropyramid.com")
]
for first_name, email in users_details:
     User.objects.create(first_name=first_name, email=email)

Above example hits/request the database 4 times to create 4 users, but we can create 4 users with single database hit/request.
Example:

instance_list = [User(first_name=first_name, email=email) for first_name, email in users_details]
User.objects.bulk_create(instance_list)

4. Update Multiple Objects or a filtered Queryset at once with update

Let us consider Employee model as

from django.db import models

DESIGNATIONS = (("Designer", "Designer"),
                ("Developer", "Developer"),
                ("HR", "HR"))

class Employee(models.Model):
    name = models.CharField(max_length=30)
    email = models.EmailField(max_length=30)
    designation = models.CharField(max_length=30, choices=DESIGNATIONS)
    salary = models.DecimalField(default=20000)
 

After one year manager wants to increase the salary for Developers at an amount of 5000

SimpleExample:

developers = Employee.objects.filter(designation="Developer")
for developer in developers:
    developer.salary = developer.salary + 5000
    develope
r.save()

Above example hits the database for several times[i.e number of developers]. we can do this with a single database hit/request.

Example:

from django.db.models import F

amount = 5000
developers = Employee.objects.filter(designation="Developer")
developers.update(salary=F("salary")+amount)

5. Select only required fields in the query from the database to decrease querying time 

We use this when we need only the data[ie. fields values]  we dont get access to the model functions and relational objects. We can do this by using QuerySet.values() and QuerySet.values_list()

QuerySet.values()  returns the list of dictionaries. Each dictionary represents an instance of object.
QuerySet.values_list() returns the list of tuples. Each tutple represents an instance of object. order of values in the tuple is id followed by the fields in the model.

from .models import Employee
# create 2 objects
Employee.objects.create(name="Ravi", email="ravi@micropramid.com", designation="Developer", salary=30000)
Employee.objects.create(name="Santharao", email="santharao@micropramid.com", designation="Designer", salary=40000)

queryset = Employee.objects.all()    # you can also use filter 
# Example for  QuerySet.values()
users_dict_list = queryset.values()
# above line is equivalent to
users_dict_list = [
    {
        "id": 1,
        "name": "Ravi",
        "email": "ravi@micropramid.com",
        "designation": "Developer",
        "salary": 30000
     },
    {
        "id": 2,
        "name": "Santharao",
        "email": "santharao@micropramid.com",
        "designation": "Designer",
        "salary": 40000
     },
]
# To  get only required fields data ie.  "name", "salary"
users_dict_list = queryset.values("name", "salary")
# above line is equivalent to
users_dict_list = [
    {
        "name": "Ravi",
        "salary": 30000
     },
    {
        "name": "Santharao",
        "salary": 40000
     },
]
# Example for  QuerySet.values_list()
users_tuple_list = queryset.values_list()
# above line is equivalent to
users_tuple_list = [
     (1, "Ravi", "ravi@micropramid.com", "Developer", 30000),
     (2, "Santharao", "santharao@micropramid.com", "Designer", 40000),
]
# To  get only required fields data ie.  "name", "salary"
users_tuple_list = queryset.values_list("name", "salary")
# above line is equivalent to
users_tuple_list = [
     ("Ravi", 30000),
     ("Santharao", 40000),
]
# We can also get list of values of a single field in the model by setting "flat=True"
users_names_list = queryset.values_list("name", flat=True)     # it works only for single field
# above line is equivalent to
users_names_list = ["Ravi", "Santharao"]
 

5. Dont hit/request the database for related objects. Fetch all related objects in a single query using select_related and prefetch_related Let us consider the following models
from django.db import models

class Address(models.Model):
       city = models.CharField(max_length=100)
       state =  models.CharField(max_length=100)
       pin =  models.CharField(max_length=100)

class Person(models.Model):
       name = models.CharField(max_length=100)
       email = models.EmailField(max_length=100)
       present_address = models.ForeignKey(Address)
       previous_address = models.ForeignKey(Address)

class Book(models.Model):
       name = models.CharField(max_length=100)
       author = models.ForeignKey(Person)
       publishers = models.ManyToManyField(Person)

usuage of select_related

# without select related
person = Person.objects.get(id=1)
present_address = person.present_address  # Hits the database.
previous_address = person.previous_address  # Hits the database.
# total database hits = 3
# with select related
person = Person.objects.select_related().get(id=1)
present_address = person.present_address # Doesn't hit the database.
previous_address = person.previous_address # Doesn't hit the database.
# total database hits = 1
# you can also select the specific related objects
person = Person.objects.select_related("present_address").get(id=1)
present_address = person.present_address # Doesn't hit the database.
previous_address = person.previous_address # Hits the database.

Limitaions of select_related

select_related works by creating an SQL join and including the fields of the related object in the SELECT statement. For this reason, select_related gets the related objects in the same database query. However, to avoid the much larger result set that would result from joining across a ‘many’ relationship, select_related is limited to single-valued relationships - foreign key and one-to-one.

Usuage of prefetch_related

# without prefetch related
book = Book.objects.get(id=1)
author = book.author # Hits the database.
publishers = book # Hits the database.
# total database hits = 3
# with prefetch related
book = Book.objects.prefetch_related().get(id=1)
author = book.author  # Doesn't hit the database.
publishers = book.publishers.all() # Doesn't hit the database.
# total database hits = 1
# you can also select the specific related objects
book = Book.objects.prefetch_related("publishers").get(id=1)
author = book.author  # Doesn't hit the database.
publishers = book.publishers.all() Hits the database.

Advantage over select_related

prefetch_related does a separate lookup for each relationship, and does the ‘joining’ in Python. prefetch_related allows it to prefetch many-to-many and many-to-one objects, which cannot be done using select_related, in addition to the foreign key and one-to-one relationships that are supported by select_related. It also supports prefetching of GenericRelation and GenericForeignKey

Limitaions of prefetch_related

prefetching of related objects referenced by a GenericForeignKey is only supported if the query is restricted to one ContentType.

6. use queryset.count() if you require only count but not queryset objects

count = Book.objects.filter(author_id=5).count()   # It returns the count(number of objects/records) only. So, operation is very fast.

7. use queryset.exists() if you want to know whether objects exists or not only

is_exists = Book.objects.filter(author_id=5).exists()   # It returns the boolean value(True/False) only. So, operation is very fast.

8. Provide index to fields in database models & provide default ordering

If one of the models accessed very freequently use the indexes to the database tables.

class Book(models.Model):
       name = models.CharField(max_length=100, db_index=True)
       author = models.ForeignKey(Person)
       published_on =  models.DateField()
       publishers = models.ManyToManyField(Person)

      class Meta:
           index_together = ["author", "published_on"]
           ordering = ['-published_on']
Reference:  https://docs.djangoproject.com/en/1.9/topics/db/optimization/

Christophe Pettus: The Multi-Column Index of the Mysteries

From Planet PostgreSQL. Published on Dec 30, 2016.

The one thing that everyone knows about compositive indexes is: If you have an index on (A, B, C), it can’t be used for queries on (B) or (B, C) or (C), just (A), (A, B) or (A, B, C), right? I’ve said that multiple times in talks. It’s clearly true, right?

Well, no, it’s not. It’s one of those things that is not technically true, but it is still good advice.

The documentation on multi-column indexes is pretty clear:

A multicolumn B-tree index can be used with query conditions that involve any subset of the index’s columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned.

Let’s try this out!

First, create a table and index:

xof=# CREATE TABLE x ( 
xof(#     i integer,
xof(#     f float,
xof(#     g float
xof(# );
CREATE TABLE
xof=# CREATE INDEX ON x(i, f, g);
CREATE INDEX

And fill it with some test data:

xof=# INSERT INTO x SELECT 1, random(), random() FROM generate_series(1, 10000000);
INSERT 0 10000000
xof=# INSERT INTO x SELECT 2, random(), random() FROM generate_series(1, 10000000);
INSERT 0 10000000
xof=# INSERT INTO x SELECT 3, random(), random() FROM generate_series(1, 10000000);
INSERT 0 10000000
xof=# ANALYZE x;
ANALYZE

And away we go!

xof=# EXPLAIN ANALYZE SELECT SUM(g) FROM x WHERE f BETWEEN 0.11 AND 0.12;
                                                                   QUERY PLAN                                                                   
------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=599859.50..599859.51 rows=1 width=8) (actual time=91876.057..91876.057 rows=1 loops=1)
   ->  Index Only Scan using x_i_f_g_idx on x  (cost=0.56..599097.71 rows=304716 width=8) (actual time=1820.699..91652.409 rows=300183 loops=1)
         Index Cond: ((f >= '0.11'::double precision) AND (f <= '0.12'::double precision))
         Heap Fetches: 300183
 Planning time: 3.384 ms
 Execution time: 91876.165 ms
(6 rows)

And sure enough, it uses the index, even though we didn’t include column i in the query. In this case, the planner thinks that this will be more efficient than just doing a sequential scan on the whole table, even though it has to walk the whole index.

Is it right? Let’s turn off index scans and find out.

xof=# SET enable_indexonlyscan = 'off';
SET
xof=# EXPLAIN ANALYZE SELECT SUM(g) FROM x WHERE f BETWEEN 0.11 AND 0.12;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=599859.50..599859.51 rows=1 width=8) (actual time=39691.081..39691.081 rows=1 loops=1)
   ->  Index Scan using x_i_f_g_idx on x  (cost=0.56..599097.71 rows=304716 width=8) (actual time=1820.676..39624.144 rows=300183 loops=1)
         Index Cond: ((f >= '0.11'::double precision) AND (f <= '0.12'::double precision))
 Planning time: 0.181 ms
 Execution time: 39691.128 ms
(5 rows)

PostgreSQL, you’re not helping!

xof=# SET enable_indexscan = 'off';
SET
xof=# EXPLAIN ANALYZE SELECT SUM(g) FROM x WHERE f BETWEEN 0.11 AND 0.12;
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=689299.60..689299.61 rows=1 width=8) (actual time=40593.427..40593.428 rows=1 loops=1)
   ->  Bitmap Heap Scan on x  (cost=513444.70..688537.81 rows=304716 width=8) (actual time=37901.773..40542.900 rows=300183 loops=1)
         Recheck Cond: ((f >= '0.11'::double precision) AND (f <= '0.12'::double precision))
         Rows Removed by Index Recheck: 8269763
         Heap Blocks: exact=98341 lossy=53355
         ->  Bitmap Index Scan on x_i_f_g_idx  (cost=0.00..513368.52 rows=304716 width=0) (actual time=37860.366..37860.366 rows=300183 loops=1)
               Index Cond: ((f >= '0.11'::double precision) AND (f <= '0.12'::double precision))
 Planning time: 0.160 ms
 Execution time: 40593.764 ms
(9 rows)

Ugh, fine!

xof=# SET enable_bitmapscan='off';
xof=# EXPLAIN ANALYZE SELECT SUM(g) FROM x WHERE f BETWEEN 0.11 AND 0.12;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=641836.33..641836.34 rows=1 width=8) (actual time=27270.666..27270.666 rows=1 loops=1)
   ->  Seq Scan on x  (cost=0.00..641074.54 rows=304716 width=8) (actual time=0.081..27195.552 rows=300183 loops=1)
         Filter: ((f >= '0.11'::double precision) AND (f <= '0.12'::double precision))
         Rows Removed by Filter: 29699817
 Planning time: 0.157 ms
 Execution time: 27270.726 ms
(6 rows)

It turns out the seq scan is faster, which isn’t that much of a surprise. Of course, what’s really fast is using the index properly:

xof=# // reset all query planner settings
xof=# EXPLAIN ANALYZE SELECT SUM(g) FROM x WHERE i IN (1, 2, 3) AND f BETWEEN 0.11 AND 0.12;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=92459.82..92459.83 rows=1 width=8) (actual time=6283.162..6283.162 rows=1 loops=1)
   ->  Index Only Scan using x_i_f_g_idx on x  (cost=0.56..91698.03 rows=304716 width=8) (actual time=1.295..6198.409 rows=300183 loops=1)
         Index Cond: ((i = ANY ('{1,2,3}'::integer[])) AND (f >= '0.11'::double precision) AND (f <= '0.12'::double precision))
         Heap Fetches: 300183
 Planning time: 1.264 ms
 Execution time: 6283.567 ms
(6 rows)

And, of course, a dedicated index for that particular operation is the fastest of all:

xof=# CREATE INDEX ON x(f);
CREATE INDEX
xof=# EXPLAIN ANALYZE SELECT SUM(g) FROM x WHERE f BETWEEN 0.11 AND 0.12;
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=188492.00..188492.01 rows=1 width=8) (actual time=5536.940..5536.940 rows=1 loops=1)
   ->  Bitmap Heap Scan on x  (cost=4404.99..187662.16 rows=331934 width=8) (actual time=209.854..5466.633 rows=300183 loops=1)
         Recheck Cond: ((f >= '0.11'::double precision) AND (f <= '0.12'::double precision))
         Rows Removed by Index Recheck: 8258716
         Heap Blocks: exact=98337 lossy=53359
         ->  Bitmap Index Scan on x_f_idx  (cost=0.00..4322.00 rows=331934 width=0) (actual time=163.402..163.402 rows=300183 loops=1)
               Index Cond: ((f >= '0.11'::double precision) AND (f <= '0.12'::double precision))
 Planning time: 5.586 ms
 Execution time: 5537.235 ms
(9 rows)

Although, interestingly enough, PostgreSQL doesn’t quite get it right here:

xof=# SET enable_bitmapscan='off';
SET
xof=# EXPLAIN ANALYZE SELECT SUM(g) FROM x WHERE f BETWEEN 0.11 AND 0.12;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=203875.29..203875.30 rows=1 width=8) (actual time=2178.215..2178.216 rows=1 loops=1)
   ->  Index Scan using x_f_idx on x  (cost=0.56..203045.45 rows=331934 width=8) (actual time=0.161..2110.903 rows=300183 loops=1)
         Index Cond: ((f >= '0.11'::double precision) AND (f <= '0.12'::double precision))
 Planning time: 0.170 ms
 Execution time: 2178.279 ms
(5 rows)

So, we conclude:

  • Yes, PostgreSQL will sometimes use the second and further columns of a multi-column index, even if the first column isn’t used in the query.
  • This is rarely optimal, so it should not be relied on as an optimization path.
  • So, while the advice was not correct in the absolute statement, it was still valid as advice.

And there we are.

xof=# DROP TABLE x;
DROP TABLE

Ernst-Georg Schmid: One-time passwords with Google Authenticator PAM (and friends)

From Planet PostgreSQL. Published on Dec 30, 2016.

PostgreSQL allows for more than plain password authentication in pg_hba.conf. One of the most flexible is authenticating against a PAM.

Let's see how this works with one-time passwords from  Google Authenticator.

1.) Install Google Authenticator on your Android or iOS device.

2.) Install the Google Authenticator PAM on the machine where your PostgreSQL server lives, like in Step 1 - 4 of this guide.

3.) Connect your device with the account on that machine.

4.) Configure a PAM service for PostgreSQL. E.g. create a file named postgresql where your PAM configs live, on Ubuntu this is /etc/pam.d/. The file should look like this:

auth         sufficient     pam_google_authenticator.so

5.) Configure PostgreSQL to use the PAM. E.g. a line in pg_hba.conf could look like this:

hostssl    all    all    127.0.0.1/32   pam    pamservice=postgresql

And that's basically it. Now, next time you login, PostgreSQL will ask you for a password that is generated individually on your device.

Of course you can use all kinds of PAM with PostgreSQL like this.

Unfortunately, I also found a few caveats along the way. :-(

First, PostgreSQL clients will ask only for one password, regardless if you chain n PAM's for n-factor authentication.

So if you e.g. chain a PAM against LDAP with Google Authenticator as the second factor, this won't work. This seems to be a shortcoming of the PAM implementation in PostgreSQL, not expecting multiple password prompts. It is still possible to enable n-factor authentication though, but only one PAM can prompt for a password. If the other factors are hardware devices like a fingerprint scanner that does not prompt for a password, you are fine.

Alternatively, you can provide your own PAM that takes all passwords in one prompt and handles them internally.

Second, PAM requires PostgreSQL clients to send the password in plaintext. So now is the time to switch on TLS and make it mandatory (Noticed the hostssl switch above?).

Third, some clients like pgAdmin3 break with one-time passwords, because they apparently open new connections without prompting for a password again, but re-use the initial one instead until you disconnect. This obviously does not work with passwords which are valid only for one login attempt.

Fourth, if your PAM requires a physical account on the machine, but you want to map it to a different PostgreSQL user, pg_ident.conf is your friend.

Christophe Pettus: A Cheap and Cheerful Replication Check

From Planet PostgreSQL. Published on Dec 28, 2016.

On a PostgreSQL primary / secondary pair, it’s very important to monitor replication lag. Increasing replication lag is often the first sign of trouble, such as a network issue, the secondary disconnecting for some reason (or for no reason at all, which does happen rarely), disk space issues, etc.

You can find all kinds of complex scripts that do math on the various WAL positions that are available from the secondary and from pgstatreplication on the primary.

Or you can do this. It’s very cheap and cheerful, and for many installations, it gets the job done.

First, on the primary (and thus on the secondary), we create a one-column table:

CREATE TABLE replication_monitor (
   last_timestamp TIMESTAMPTZ
);

Then, we insert a singel row into the table (you can probably already see where this is going):

INSERT INTO replication_monitor VALUES(now());

Having that, we can start a cron job that runs every minute, updating that value:

* * * * * /usr/bin/psql -U postgres -c "update replication_monitor set last_update=now()" postgres > /dev/null

On the secondary (which is kept in sync with the primary via NTP, so make sure ntpd is running on both!), we have a script, also run from cron, that complains if the value in has fallen more than a certain amount behind now(). Here’s a (pretty basic) Python 2 version:

#!/usr/bin/python

import sys

import psycopg2

conn = psycopg2.connect("dbname=postgres user=postgres")

cur = conn.cursor()

cur.execute("select (now()-last_update)>'5 minutes'::interval from replication_monitor")

problem = cur.fetchone()[0]

if problem:
    print >>sys.stderr, "cat1r replication lag over 5 minutes."

We make sure we get the output from stderr for cron jobs on the secondary, set it up to run every so often, and we’re done!

Of course, this has its limitations:

  • It only has ±2 minutes of resolution, based on how often cron runs. For a basic “is replication working?” check, this is probably fine.

  • It creates traffic in the replication stream and WAL, but if you are really worried about a TIMESTAMPTZ’s worth of update once a minute, you probably have other things to worry about.

It has the advantage that it works even if the server is otherwise not taking traffic, since it creates traffic all by itself.

As a final note, check_postgres has this check integrated as one of the (many, many) checks it can do, as the replicate_row check. If you are using check_postgres, by all means use that one!

How to Implement Token Authentication with Django REST Framework

By Chris Bartos from Django community aggregator: Community blog posts. Published on Dec 27, 2016.

Token Authentication seems to be an Authentication Scheme that gives people the most trouble. The reason appears to be a misunderstanding not of so much how to implement it, but how to actually use it.

For example, the Django REST Framework documentation says that for every request, you have to add an Authorization Header to your requests. But, how do you create that Authorization header?

Also…

How do I use Token Authentication to authorize external clients? These are serious questions that I’ve seen from people just like you. But, the difference is that after reading this post, you’ll understand how that works. I’ll show you how to implement Token Authentication using Django REST Framework. Then, I’ll give you a sample application that uses Token Authentication to authenticate users within a Django Application. Best of all, you can take what you learn from my sample code to use in external clients (Javascript clients) to authenticate / authorize users with your RESTful API.

WARNING: NEVER, EVER use Token Authentication without using Secure HTTP (HTTPS). If you want to run my sample code or play around with your own code locally it is OKAY to use Unsecure HTTP (HTTP). However, NEVER do it in production. You’ve been warned.

Introduction to Token Authentication

Token Authentication is a way to authorize users by using an API Key or Auth Token. The way Django REST Framework implements Token Authentication requires you to add a header for each request. This header will be in the following format:

Authorization: Token 93138ba960dfb4ef2eef6b907718ae04400f606a

Authorization is the header key and Token 93138ba960dfb4ef2eef6b907718ae04400f606a is the header value. Note, there is a space between Token and the token itself.

The server where your API lives will read off the user’s token and determine if there is a user assigned to that particular token.

This is Token Authentication in a nutshell. It really doesn’t get anymore complicated than that. The difficulty is implementing it for each of the clients that will use the API (Javascript apps, Desktop apps, Commandline, etc.)

How to Implement Token Authentication

The implementation is a bit more complicated than other Authentication Schemes so take this slow, digest each piece of the puzzle and make sure to download my sample code so you can see what I’m doing each step of the way.

Preliminary Configuration

As with all Django REST Framework Authentication schemes, you must configure the authentication scheme that you will use in your settings.py file.

myproject/settings.py

REST_FRAMEWORK = { 
    'DEFAULT_AUTHENTICATION_CLASSES': (
        'rest_framework.authentication.TokenAuthentication',
    ),
    'DEFAULT_PERMISSION_CLASSES': (
        'rest_framework.permissions.IsAuthenticated',
    )
}

This is where it’s a little different than other Authentication schemes. You will also need to add an app to your INSTALLED_APPS array. The reason for this is because Token Authentication requires a special model called Token which is used to store your user authentication tokens.

In my sample application, my INSTALLED_APPS tuple looks like this:

myproject/settings.py

INSTALLED_APPS = ( 
    'django.contrib.admin', 
    'django.contrib.auth', 
    'django.contrib.contenttypes', 
    'django.contrib.sessions', 
    'django.contrib.messages', 
    'django.contrib.staticfiles',
    'polls',
    'accounts',
    'rest_framework',
    'rest_framework.authtoken',
)

Now, in order to install the app and update the database with the new Token model, it is imperative that we run python manage.py migrate.

Now, you should be ready to create tokens for your users, create a post_save method on your User model so that whenever a new user is added to your database it will automatically create a token for them.

Create Tokens for your Users

Go to your project shell by typing this command at a terminal: python manage.py shell. You should be presented with a prompt that looks like this: >>>.

Now, you’re going to type in the following code:

from django.contrib.auth.models import User
from rest_framework.authtoken.models import Token

users = User.objects.all()
for user in users:
    token, created = Token.objects.get_or_create(user=user)
    print user.username, token.key

This code will retrieve all the users that currently exist in your database. It loops through each of them, generates a unique token for each of them and prints our their username and their token. It’s nice to print out their username and the token so you can see that it in fact worked correctly.

Now, you don’t want to manually create tokens for your users whenever a new user registers for your web app so you’ll have to create a special function that will automatically create a token for each new user:

myproject/accounts/models.py

from django.conf import settings
from django.db.models.signals import post_save
from django.dispatch import receiver
from rest_framework.authtoken.models import Token 

@receiver(post_save, sender=settings.AUTH_USER_MODEL)
def create_auth_token(sender, instance=None, created=False, **kwargs):
    if created:
        Token.objects.create(user=instance)

Now, every time a new user is saved in your database, this function will run and a new Token will be created for that user.

Create a Route for Retrieving a Token for Successfully logged in users

Django REST Framework provides a view that simply returns the user’s token when they provide a correct username / password combo.

Let’s add this now:

myproject/accounts/urls.py

from django.conf.urls import url
from django.conf import settings
from django.conf.urls.static import static

from . import views as local_view
from rest_framework.authtoken import views as rest_framework_views

urlpatterns = [ 
    # Session Login
    url(r'^login/$', local_views.get_auth_token, name='login'),
    url(r'^logout/$', local_views.logout_user, name='logout'),
    url(r'^auth/$', local_views.login_form, name='login_form'),
    url(r'^get_auth_token/$', rest_framework_views.obtain_auth_token, name='get_auth_token'),
] + static(settings.STATIC_URL, document_root=settings.STATIC_ROOT)

The bottom route is what you should notice:

url(r'^get_auth_token/$', rest_framework_views.obtain_auth_token, name='get_auth_token'),

Now, when POST’ing to http://localhost:8000/accounts/get_auth_token/ with the following data (assuming this username / password exists): { 'username': 'admin', 'password': 'admin' } you’ll receive something that looks like the following: { 'token': '93138ba960dfb4ef2eef6b907718ae04400f606a' }

You’re done implementing Token Authentication. So, what can you do with this Authentication Scheme? Whatever you want. Let’s look at some sample code to see how this works.

Click Here to Download the Sample Code

How to use the Sample Code

  1. Unzip the code repository
  2. Change directory to the unzipped code repository
  3. Run the command python manage.py runserver
  4. Go to http://localhost:8000/polls/ to run the code.
  5. You’ll be redirected to a very crude login form
  6. Type admin for the username and admin for the password
  7. You’ll be redirected to the polls app and application should work.
  8. Go to http://localhost:8000/accounts/logout to invalidate the cookie that holds the Authentication Token.
  9. Download Postman from https://www.getpostman.com/
  10. Open Postman. Make sure the dropdown says, “GET”.
  11. Type in: http://localhost:8000/polls/api/questions/1 into the address bar.
  12. Make sure under Authorization “No Auth” is selected.
  13. Under “Headers” add the Header key: Content-Type. Make the Header value: application/json
  14. Click “Send” and you should retrieve "Authentication credentials were not provided."
  15. Now, under “Headers” underneath “Content-Type” add the Header key: Authorization and add the Header value: “Token 93138ba960dfb4ef2eef6b907718ae04400f606a” (without the quotes).
  16. Click “Send” again and you should be getting the correct data.

Homework (If you’d like…)

  1. Look at ALL of my sample code.
    • Underneath the admin form in login.html there is some Javascript that creates a call to http://localhost:8000/accounts/get_auth_token/ with my username and password. When I retrieve the token, I put it into a cookie called “token” and I redirect to the ‘/polls/‘ app.
    • In the app.js file (where my AngularJS code lives), [I created an “Interceptor”]. The purpose of it is to intercept requests and automatically add a Header with your token so that the Polls Application works correctly. If a request fails, it assumes the user isn’t authorized and redirects to the login form.
  2. Play around with Postman. As long as you keep the Authorization header you can run through the API endpoints and see what you get.
  3. Play with [Python Requests]. Keep the Django Application running and see if you can write a program using Python Requests that will allow you to login and retrieve some information.

The homework is to help you understand why you would want to use Token Authentication. You can create cool applications external to the Django Application that interface with your API. Give it a go and see what you come up with!

Until next time, Chris

3 Things You Need to Authenticate Users in Django

By Chris Bartos from Django community aggregator: Community blog posts. Published on Dec 27, 2016.

You want to authenticate users but you’re unsure how. The documentation isn’t the most helpful thing in the world. You think, “wow… this documentation assumes I know all this other stuff…”

What are the things you need to authenticate users? There are 3 things you need and I’m going to show you what each looks like.

First: You need some routes

You need authentication routes. I think it makes the most sense to create a separate app for this purpose. (Separate all your login logic from all your other logic)

Let’s look at some login routes:

_loginapp/urls.py

from django.conf.urls import url
from django.conf import settings
from django.conf.urls.static import static

from . import views

urlpatterns = [
    # Session Login
    url(r'^login/$', views.login_user, name='login'),
    url(r'^logout/$', views.logout_user, name='logout'),
    url(r'^auth/$',  views.login_form, name='login_form'),
]

Second: You’ll need some templates

Templates are important. Templates are the HTML representation of your application. For example, at the bare minimum, you’ll need a way to let your users login. How do you do it? It doesn’t have to be pretty because this is JUST HTML.

loginapp/templates/loginapp/login.html

<form method='post' action="{%  url 'loginapp:login' %}">
    <label for="username">Username:</label>
    <input type="text" name="username" />
    <br>
    <label for="password">Password:</label>
    <input type="password" name="password" />
    <br>
    <input type="submit" value="Login" />
</form>

Third: You’ll need some views

The views you’ll need for login will be:
1. The login form view (shows the login form)
2. The POST view that will authenticate a user that is active / exists
3. A view that will log the user out

Let’s start with the login form view (loginapp/auth):

def login_form(request):
    return render(request, 'accounts/login.html', {})

This view simply renders our login.html template that we created above. It’s also possible to make only 2 routes (1 that will detect a POST request and 1 that will detect a GET request) however, I (personally) really like have separate views for each request method.

Here is an example of a view that will detect a username and password and use those credentials to authenticate a user and login the user thus creating a session specifically for that user.

def login_user(request):
    username = request.POST.get('username')
    password = request.POST.get('password')
    user = authenticate(username=username, password=password)
    if user is not None:
        # the password verified for the user
        if user.is_active:
            login(request, user)
            return redirect('/polls/')
    return redirect(settings.LOGIN_URL, request)

This method will get the username and password from the POST request data. Then, we will use the username and password to try to authenticate a user that exists in our database.

If a user exists, we will try to login our user and redirect to our polls application. If the user does not exist we will redirect back to the login form.

How do you logout an authenticated user?

def logout_user(request):
    logout(request)
    return redirect('/polls/')

This method will take the request object and user it to logout the logged in user. Once the user logs out, the application will redirect the user to our polls application.

This is the 3 things that you need to authenticate users in your Django application. (If you want to use Session Authentication with Django REST Framework) this is how you would accomplish this.

I hope that helps you when need to authenticate users in your future web application.

Michael Paquier: Postgres 10 highlight - Checkpoint skip logic

From Planet PostgreSQL. Published on Dec 26, 2016.

It is too late for stories about Christmas, so here is a Postgres story about the following commit of Postgres 10:

commit: 6ef2eba3f57f17960b7cd4958e18aa79e357de2f
author: Andres Freund <andres@anarazel.de>
date: Thu, 22 Dec 2016 11:31:50 -0800
Skip checkpoints, archiving on idle systems.

Some background activity (like checkpoints, archive timeout, standby
snapshots) is not supposed to happen on an idle system. Unfortunately
so far it was not easy to determine when a system is idle, which
defeated some of the attempts to avoid redundant activity on an idle
system.

To make that easier, allow to make individual WAL insertions as not
being "important". By checking whether any important activity happened
since the last time an activity was performed, it now is easy to check
whether some action needs to be repeated.

Use the new facility for checkpoints, archive timeout and standby
snapshots.

The lack of a facility causes some issues in older releases, but in my
opinion the consequences (superflous checkpoints / archived segments)
aren't grave enough to warrant backpatching.

Author: Michael Paquier, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI
Bug: #13685
Discussion:
https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com
Backpatch: 

Postgres 9.0 has been the first release to introduce the parameter wal_level, with three different values:

  • “minimal”, to get enough information WAL-logged to recover from a crash.
  • “archive”, to be able to recover from a base backup and archives.
  • “hot_standby”, to be able to have a standby node work, resulting in information about exclusive locks and currently running transactions to be WAL-logged, called standby snapshots.
  • “logical”, introduced in 9.4, to work with logical decoding.

In 9.6, “archive” and “hot_standby” have been merged into “replica” as both levels have no difference in terms of performance. Another thing to know is that standby snapshots are generated more often since 9.3 via the bgwriter process, every 15 seconds to be exact.

So what is the commit above about? The fact that since “hot_standby” has been introduced the logic that was present in xlog.c to decide if checkpoints should be skipped or not was simply broken. As “hot_standby” has become a standard in terms of configuration, many installations have been producing useless checkpoints, or even useless WAL segments if archive_timeout was being set. This is actually no big deal for most installations, as there should always be some activity on a system, by that is meant activity that produces new WAL records, hence a checkpoints or a WAL segment switch after a timeout (if archive_timeout is set), would still have resulted in an operation to happen.

The main issue here are embedded systems, where Postgres runs for ages without intervention. For example an instance managing some internal facility of a company very likely faces a downspike of activity on weekends, because nobody is a robot and bodies need rest. Useless checkpoints being generated actually can result in more WAL segments created. And while storing those segments is not really a problem if they are compressed as the remaining empty part is filled with zeros, installations that do not compress them need some extra time to recover those segments in case of a crash, and that’s even more painful for developments that have a spiky WAL activity.

This has resulted in a couple of bug reports and misunderstandings over the last couple of years on the community mailing lists like this thread.

So in order to fix this problem has been designed a system allowing to mark a WAL record as “important” or not regarding the activity it creates. There has been much debate about the wording of this concept, named at some point as well “progress”, debate going on for perhaps more than a hundred of emails across many threads. There is also a set of routines that can be used to fetch the last important WAL position that can be used for more checks and do more fancy decision-making.

With this facility in place, records related to archive_timeout, understand here WAL segment switch, and standby snapshots (WAL-logging of exclusive locks and running transactions for hot standby nodes) are considered as unimportant WAL activity to decide if a checkpoint should be executed or not. Once those records are marked as such, deciding if a checkpoint should be skipped or not is just a matter of comparing the WAL position of the last checkpoint record with the last important WAL position. If they match, no checkpoint need to happen. And then embedded systems are happy.

How do I Implement Session Authentication in Django REST Framework?

By Chris Bartos from Django community aggregator: Community blog posts. Published on Dec 23, 2016.

Introduction to Session Authentication

Session Authentication when used with Django REST Framework allows you to authenticate users very similar to the way Django authenticates users without Django REST Framework.

This will make it extremely easy to introduce a REST API to your web app without having to completely overhaul your authentication system.

The best part of this Authentication Scheme is you literally only have to change ONE line of your Django Application.

Implementation Details (all the little bits…)

In your settings.py file, just add 'rest_framework.authentication.SessionAuthentication', to your DEFAULT_AUTHENTICATION_CLASSES setting.

myproject/settings.py

REST_FRAMEWORK = {
    'DEFAULT_AUTHENTICATION_CLASSES': (
        'rest_framework.authentication.SessionAuthentication',
    ),
    'DEFAULT_PERMISSION_CLASSES': (
        'rest_framework.permissions.IsAuthenticated',
    )
}

Now, you will be able to login using the login form that you learned how to create when you took the Polls Django Tutorial.

If you’d like an example of how this is accomplished, I’ve updated the Django Application I used to show you how to implement Basic Authentication using Django REST Framework. Now, it works using Session Authentication.

Click Here to Download the Sample Code

How to use this Sample Code

  1. Unzip the code repository
  2. Change directory to the unzipped code repository
  3. Run the command python manage.py runserver
  4. Go to http://localhost:8000/polls/ to run the code.
  5. You’ll be redirected to a very crude login form
  6. Type admin for the username and admin for the password
  7. You’ll be redirected to the polls app and application should work.
  8. Go to http://localhost:8000/accounts/logout to logout of your session.
  9. Go to http://localhost:8000/polls/api/questions/1 to checkout the API (it should tell you that you’re not authenticated to look at the data).
  10. Go to http://localhost:8000/accounts/login and login with admin for the username and admin for the password.
  11. Go back to http://localhost:8000/polls/api/questions/1 and you should be able to see the data now that you are signed in.

That is the Session Authentication Scheme in a nutshell. I hope you realize how simple it is to implement. Now… for some homework!

Homework (If you’d like…)

  1. Use the sample application and change the static login form into an AJAX style form. (So, when you put in admin for the username and password and click login there should be an AJAX POST request with the username, password and CSRF Token that will attempt to login the user and either send a success message or a failure message. Then, redirect the user back to the /polls/. If you’re unsure how to add a CSRF Token to all AJAX requests, sign up for my FREE Django REST Framework email course below.)

Pavel Stehule: OpenERP configuration

From Planet PostgreSQL. Published on Dec 23, 2016.

I had a customer with strange issues of OpenERP - the main problem was long life of OpenERP/PostgreSQL sessions. The fix was not hard - using pgBouncer in transaction mode with short server life time (ten minutes).

List of Authentication Schemes available for Django REST Framework

By Chris Bartos from Django community aggregator: Community blog posts. Published on Dec 22, 2016.

Do you know what your options are?

You learned a little bit of how authentication works with Django REST Framework from my short, FREE mini email course. What you might NOT know is that Basic Authentication isn’t the only type of authentication available do you.

You can use the following authentication schemes in Django REST framework and keep checking your inbox for implementation details for each of these authentication schemes.

The list of Authentication types in Django REST Framework

  1. Basic Authentication — you learned how to implement this while following my free email course. If you haven’t signed up for the email course yet, put your email address in the form at the end of this post.
  2. Session Authentication — this is similar to the authentication that Django uses to authenticate users.
  3. Token Authentication — Token Auth is how most APIs authenticate users. You might have played around some APIs that require an “API Key”. This is how you would implement that for your users.
  4. Custom Authentication — If for some reason, the other authentication schemes don’t do what you want, you can override the BaseAuthentication class and create your own authentication scheme in Django REST Framework.
  5. OAuth2 Authentication — OAuth authentication is a way to allow your users to authenticate using things like Social Media. I’m sure you’ve logged into certain web apps that allowed you to authenticate using Google or Facebook. This allows you to do that!

Which Authentication scheme is best for me?

That will depend on your specific use case. If you know how to implement each of them, you’ll be better off in the long run. Come back to this post periodically. Each of the authentication schemes above will eventually link to a post that will show you EXACTLY how to implement each one.

If you’d rather not come back for each post, simply put your email in the box below and you’ll get every new post sent directly to your inbox!

Scott Mead: PostgreSQL Schema Visualization

From Planet PostgreSQL. Published on Dec 22, 2016.

I spend a lot of time trying to learn what’s already been implemented, as DBA’s, we tend to live in that world. It’s important that you have tools that allow you to quickly get the visual so you can get on with your job. One of the biggest ‘reverse engineering’ tasks that DBA’s have to do is in the area of data modeling. There’s a lot of tools that do this, but my favorite of late is SchemaSpy (schemapy.sf.net). I’ve found that the output is complete and the ERD’s are easy to read.

java -jar schemaSpy_5.0.0.jar -t pgsql -host localhost:5432 -db postgres \
-dp ~/software/pg/drivers/jdbc/postgresql-9.4.1208.jre6.jar \
-o ERD -u postgres -s myschema

SchemaSpy is nice because it can talk to many different types of databases… a side effect of this is a fairly complex set of commandline switches.

-t specifies the type, -t pgsql specifies postgres
-o specifies a directory to create the output in

All you need in order to run this is the SchemaSpy jar and a postgresql jdbc driver. In the output directory ‘ERD’. Pulling up ERD/index.html gives you a nice page with a table list and some basics.

SchemaSpyDash

As a DBA, I really love the ‘Insertion Order’ and ‘Deletion Order’ links here. SchemaSpy reverse engineers the referential integrity chain for me! Clicking either one gives me a simple page, top to bottom with the right order!

InsertionOrder

Now for the real reason that I super-love SchemaSpy. The ‘Relationships’ tab. I’ve loaded up a pgbench schema. pgbench doesn’t actually create any real keys, but column names are consistent. SchemaSpy notices this and ‘infers’ relationships for me! This is huge, even without explicity keys, I can begin to infer what the developer intended (the estimated rowcount is helpful too 🙂

Relationships

I won’t force you to follow me through all of the tabs. If you’re looking for a schema visualization tool, give SchemaSpy a try.

Happy PostgreSQL-ing!

Braindump on Load Generation

By Will Larson from Django community aggregator: Community blog posts. Published on Dec 18, 2016.

Stripe is starting to build out a load generation team in Seattle (that’s a posting for San Francisco, but also works for Seattle), and consequently I’ve been thinking more about load generation lately. In particular, I’ve been thinking that I know a lot less about the topic than I’d like to, so here is a collection of sources and reading notes.

Hopefully, I’ll synthesize these into something readable soon!

The Interesting Questions

Perhaps because many companies never develop a mature solution for load generation, and because none of the open source solutions command broad awareness (except maybe JMeter?), it tends to be a place with far more opinions than average, and consequently there are quite a few interesting questions to think through.

Let’s start by exploring a few of those.

  1. Should you be load testing? Surprisingly few companies invest much into load testing, so it’s never entirely clear if you should be investing at a given point in time. My anecdotal impression is that companies which “believe in QA” tend to invest into load testing early, because they have dedicated people who can build the tooling and integration, and that most other companies tend to ignore it until they’re doing a significant amount of unplanned scalability investment. Said differently, for most companies load testing is a mechanism to convert unplanned scalability work into planned scalability work.

  2. Should you be load testing, redux? Beyond whether you should invest into building load testing tooling, my colleague Davin suggested an interesting perspective that most of the metrics generated by load testing can also be obtained through thoughtful instrumentation and analysis of your existing traffic.

  3. What layer of your infrastructure should you load test against? Depending on the application you’re running, it may be easy to generate load against your external interfaces (website, API, etc) but as you go deeper into your infrastructure you may want to run load against a specific service or your stateful systems (Kafka, databases, etc).

  4. What environment should you run your tests against? Perhaps the most common argument when rolling out load testing is whether you should run it against an existing QA environment, against a dedicated performance environment, or against your production environment. This depends a great deal on the layer you’re testing at, and if you’re doing load (how does the system react to this traffic?) or stress (at what load does the system fail?) testing.

  5. How should you model your traffic? Starting with the dead simple Siege, there are quite a few different ways to think about generating your load. Should you send a few request patterns at a high concurrency? Should you model your traffic using a state machine (codified in a simple script, or perhaps in a DSL), or should you just replay sanitized production traffic?

  6. Is the distinction between integration testing and load testing just a matter of scale? As you start thinking about it, integration testing is running requests against a server and ensuring the results are reasonable, and that’s pretty much the exact same thing as load testing. It’s pretty tempting to try to solve both problems with one tool.

  7. How should you configure load tests? Where should that configuration live? Merging into the whole “is configuration code?” debate, where should the configuration for your load tests live? Ideally it would be in your service’s repository, right? Or perhaps it should be stored in a dynamic configuration database somewhere to allow more organic exploration and testing? Hmm.

Alright then, with those questions in mind, time to read some papers.

Stress and Load Testing Research

As part of my research, I ran into an excellent list of stress and load testing papers from the Software Engineering Research Group at University of Texas Arlington. Many of these papers are pulled from there, and it’s a great starting point for your reading as well!

A Methodology to Support Load Test Analytics

A Methodology to Support Load Test Analytics (2010) starts with an excellent thought on why load testing is becoming increasingly important and complex:

For [ultra large scale systems], it is impossible for an analyst to skim through the huge volume of performance counters to find the required information. Instead, analyst employ few key performance counters known to [her] from past practice, performance gurus and domain trends as ‘rules of thumb’. In a ULSS, there is no single person with complete knowledge of end to end geographically distributed system activities.

That rings true to my experience.

With increasingly complex systems, it is remarkably hard to actually find the performance choke points, and load testing aspires to offer the distributed systems equivalent of a profiler. (This is also why QA environments are fraught with failures: as systems become increasingly complex, creating a useful facimile becomes rather hard.)

A few other interesting ideas:

  1. Fairly naive software can batch together highly correlated metrics, to greatly reduce the search space for humans to understand where things are degrading.
  2. Most test designs mean they can only run occasionally (as opposed to continuously), and interfering workloads (e.g. a weekly batch job occuring during your load test) can easily invalidate the data of those infrequent runs.
  3. Many tests end up with artificial constraints when run against non-production environments like QA, for example running against underpowered or mis-configured databases.
  4. They use “Principal Component Analysis” to find the minimal principal components which are not correlated with each other, so you have less redundant data to explore.
  5. After they’ve found the key Principal Components, they then convert those back into the underlying counters, which are human comprehensible, for further analysis.
  6. Ideally, you should be able to use historical data to build load signatures, such that you could quickly determine if the system’s fundamental constraint has shifted (e.g. you’ve finally fixed your core bottleneck and got a new one, or something else has degraded such that it is not your core bottleneck).

In particular, my take away is that probably the right load generation tool will start with a fairly simple approach with manually identified key metrics, and then move increasingly to using machine learning to avoid our implicit biases around where our systems ought to be slow.

A Unified Load Generator For Geographically…

A Unified Load Generator for Geographically Distributed Generation of Network Traffic (2006) is a master’s thesis, that happens to be an pretty excellent survey of academic ideas and topics around load generation.

One of the interesting ideas here is:

to design an accurate artificial load generator which is responsible to act in a flexible manner under different situations we need not only a load model but also a formal method to specify the realistic load.

I’m not sure I entirely agree that we need a formal model to get value from our load testing, we are after all trying to convert unplanned scalability work into planned scalability work, but I have such an industry focus that it’s a fascinating idea to me that you would even try to create a formal model here.

It also introduces a more specific vocabulary for discussing load generation:

The workload or load L=L (E, S, IF, T) denotes the total sequence of requests which is offered by an environment E to a service system S via a well-defined interface IF during the timeinterval T.

Perhaps more interestingly, is the emphasize on how quality of service forces us to redefine the goal of load testing:

The need for telecommunication networks capable of providing communication services such as data, voice and video motivated to deliver an acceptable QoS level to the users, and their success depends on the development of effective congestion control schemes.

I find this a fascinating point. As your systems start to include more systematic load shedding mechanisms, it becomes increasingly challenging to push them out of equilibrium because they (typically) preserve harvest by degrading yield. It’s consequently not enough to say that your systems should or should not fail at a given level of load, you also have to start to measure if it degrades appropriately based on load levels.

In section 3.5, it explains (the seemingly well known, albeit not to me) UniLog, also known as the Unified Load Generator. (Which is perhaps based on this paper, which is sadly hidden behind a paywall.) UniLog has an interesting architecture with intimidatingly exciting component names like PLM, ELM, GAR, LT, ADAPT and EEM. As best I can tell it is an extremely generic architecture for running and evaluating load experiments. It feels slightly overdesigned from my first reading, but perhaps as one spends more time in the caverns of load generation it starts to make more sense.

In section 4.4, it discusses centralized versus distributed load generation, which feels like one of the core design decisions you need to make for building such a system. My sense is that you likely want a distributed approach at some point, if only to avoid getting completely throttled by QoS operating on a per-IP ratelimit.

The rest of the paper focuses on some case studies and such. Overall, it was a surprisingly thorough introduction to the related research.

Automated Analysis of Load Testing Results

Automated Analysis of Load Testing Results takes a look at using automation to understand load test results (using both the execution logs of the load test and overall system metrics during the load test).

It summarizes the challenges of analyzing load test results as: outdated documentation, process-level profiling is cost prohibitive, load testing typically occurs late in the development cycle with short time lines, and the output from load tests can be overwhelmingly large.

I think the most interesting take away for me is the idea of very explicitly decoupling the gathering of performance data from its analysis. For example, you could start logging performance data early on (and likely your metrics tool, e.g. Graphite, already is capturing that data), and invest into more sophisticated analysis much later on. There is particular focus on comparing results across multiple load test runs, which can at a minimum narrow in on where performance “lives” within your metrics.

Even more papers…

Some additional papers with short summaries:

Converting Users to Testers - This paper discusses recording user traffic as an input to your load testing, with the goal of reducing time spent writing load generation scripts.

Automatic Feedback, Control-Based, Stress and Load Testing - This paper explores the idea of systems that try to drive and maintain load on a system to targeted threshold. This is an interesting idea because this would allow you to consistently run load against your production environment. The only caveat is that you have to first identify the inputs you want to use to influence that load, so you still need to model the incoming traffic in order to use it as an input (or record and sanitize real traffic), but at least once you have modeled it you could be more abstract in how you use that model (if your target is to create load, you don’t necessarily need to simulate realistic traffic, and you could use something like an n-armed bandit approach to “optimize” your load to generate the correct amount of load against the system). (Similarly, this paper tries to do that using genetic algorithms.)

Existing Tools

There are surprisingly few load testing tools, although wikipedia has a short list. Of that list, I’ve actually used JMeter some years ago, and I enjoyed this short rant about HP’s Loadrunner tooling.

I took a brief look at Gatling, which is a lightweight DSL written in Scala, which can be easily run by Jenkins or such. This seems like an interesting potential starting point for building a load generation tool. In particular the concept of treating your load tests as something you would check into your repository feels right, allowing you to iterate on your load tests like you would anything else. Reading through a few other blog posts on Gatling gave me a stronger sense that this might be a useful component of an overall load testing system (that allowed, e.g. many instances to be run against different endpoints or such).

Are there others that I’m missing out on?

Web Workload Generation According to…

As the name suggests, Web Workload Generation According to UniLoG Approach looks at adapting the UniLoG approach to the web. It nicely summarizes UniLoG’s approach as well:

The basic principle underlying the design and elaboration of UniLoG has been to start with a formal description of an abstract load model and thereafter to use an interface-dependent adapter to map the abstract requests to the concrete requests as they are “understood” by the service providing component at the real interface in question.

It also does a nice job of exploring ways to generate requests, although again coming back to either using logs of existing traffic or generating a model which defines your workload. There is an interesting hybrid here which would be using the distribution of actual usage as an input for the generated load (as opposed to using load on a one to one basis).

That said, unfortunately, I didn’t really get much out of it.

Braindump on Load Generation

By Will Larson from Django community aggregator: Community blog posts. Published on Dec 18, 2016.

Stripe is starting to build out a load generation team in Seattle (that’s a posting for San Francisco, but also works for Seattle), and consequently I’ve been thinking more about load generation lately. In particular, I’ve been thinking that I know a lot less about the topic than I’d like to, so here is a collection of sources and reading notes.

Hopefully, I’ll synthesize these into something readable soon!

The Interesting Questions

Perhaps because many companies never develop a mature solution for load generation, and because none of the open source solutions command broad awareness (except maybe JMeter?), it tends to be a place with far more opinions than average, and consequently there are quite a few interesting questions to think through.

Let’s start by exploring a few of those.

  1. Should you be load testing? Surprisingly few companies invest much into load testing, so it’s never entirely clear if you should be investing at a given point in time. My anecdotal impression is that companies which “believe in QA” tend to invest into load testing early, because they have dedicated people who can build the tooling and integration, and that most other companies tend to ignore it until they’re doing a significant amount of unplanned scalability investment. Said differently, for most companies load testing is a mechanism to convert unplanned scalability work into planned scalability work.

  2. Should you be load testing, redux? Beyond whether you should invest into building load testing tooling, my colleague Davin suggested an interesting perspective that most of the metrics generated by load testing can also be obtained through thoughtful instrumentation and analysis of your existing traffic.

  3. What layer of your infrastructure should you load test against? Depending on the application you’re running, it may be easy to generate load against your external interfaces (website, API, etc) but as you go deeper into your infrastructure you may want to run load against a specific service or your stateful systems (Kafka, databases, etc).

  4. What environment should you run your tests against? Perhaps the most common argument when rolling out load testing is whether you should run it against an existing QA environment, against a dedicated performance environment, or against your production environment. This depends a great deal on the layer you’re testing at, and if you’re doing load (how does the system react to this traffic?) or stress (at what load does the system fail?) testing.

  5. How should you model your traffic? Starting with the dead simple Siege, there are quite a few different ways to think about generating your load. Should you reply a few requests at higher concurrency? Should you model your traffic using a state machine (codified in a simple script, or perhaps in a DSL), or should you just replay sanitized production traffic?

  6. Is the distinction between integration testing and load testing just a matter of scale? As you start thinking about it, integration testing is running requests against a server and ensuring the results are reasonable, and that’s pretty much the exact same thing as load testing. It’s pretty tempting to try to solve both problems with one tool.

  7. How should you configure load tests? Where should that configuration live? Merging into the whole “is configuration code?” debate, where should the configuration for your load tests live? Ideally it would be in your service’s repository, right? Or perhaps it should be stored in a dynamic configuration database somewhere to allow more organic exploration and testing? Hmm.

Alright then, with those questions in mind, time to read some papers.

Stress and Load Testing Research

As part of my research, I ran into an excellent list of stress and load testing papers from the Software Engineering Research Group at University of Texas Arlington. Many of these papers are pulled from there, and it’s a great starting point for your reading as well!

A Methodology to Support Load Test Analytics

A Methodology to Support Load Test Analytics (2010) starts with an excellent thought on why load testing is becoming increasingly important and complex:

For [ultra large scale systems], it is impossible for an analyst to skim through the huge volume of performance counters to find the required information. Instead, analyst employ few key performance counters known to [her] from past practice, performance gurus and domain trends as ‘rules of thumb’. In a ULSS, there is no single person with complete knowledge of end to end geographically distributed system activities.

That rings true to my experience.

With increasingly complex systems, it is remarkably hard to actually find the performance choke points, and load testing aspires to offer the distributed systems equivalent of a profiler. (This is also why QA environments are fraught with failures: as systems become increasingly complex, creating a useful facimile becomes rather hard.)

A few other interesting ideas:

  1. Fairly naive software can batch together highly correlated metrics, to greatly reduce the search space for humans to understand where things are degrading.
  2. Most test designs mean they can only run occasionally (as opposed to continuously), and interfering workloads (e.g. a weekly batch job occuring during your load test) can easily invalidate the data of those infrequent runs.
  3. Many tests end up with artificial constraints when run against non-production environments like QA, for example running against underpowered or mis-configured databases.
  4. They use “Principal Component Analysis” to find the minimal principal components which are not correlated with each other, so you have less redundant data to explore.
  5. After they’ve found the key Principal Components, they then convert those back into the underlying counters, which are human comprehensible, for further analysis.
  6. Ideally, you should be able to use historical data to build load signatures, such that you could quickly determine if the system’s fundamental constraint has shifted (e.g. you’ve finally fixed your core bottleneck and got a new one, or something else has degraded such that it is not your core bottleneck).

In particular, my take away is that probably the right load generation tool will start with a fairly simple approach with manually identified key metrics, and then move increasingly to using machine learning to avoid our implicit biases around where our systems ought to be slow.

A Unified Load Generator For Geographically…

A Unified Load Generator for Geographically Distributed Generation of Network Traffic (2006) is a master’s thesis, that happens to be an pretty excellent survey of academic ideas and topics around load generation.

One of the interesting ideas here is:

to design an accurate artificial load generator which is responsible to act in a flexible manner under different situations we need not only a load model but also a formal method to specify the realistic load.

I’m not sure I entirely agree that we need a formal model to get value from our load testing, we are after all trying to convert unplanned scalability work into planned scalability work, but I have such an industry focus that it’s a fascinating idea to me that you would even try to create a formal model here.

It also introduces a more specific vocabulary for discussing load generation:

The workload or load L=L (E, S, IF, T) denotes the total sequence of requests which is offered by an environment E to a service system S via a well-defined interface IF during the timeinterval T.

Perhaps more interestingly, is the emphasize on how quality of service forces us to redefine the goal of load testing:

The need for telecommunication networks capable of providing communication services such as data, voice and video motivated to deliver an acceptable QoS level to the users, and their success depends on the development of effective congestion control schemes.

I find this a fascinating point. As your systems start to include more systematic load shedding mechanisms, it becomes increasingly challenging to push them out of equilibrium because they (typically) preserve harvest by degrading yield. It’s consequently not enough to say that your systems should or should not fail at a given level of load, you also have to start to measure if it degrades appropriately based on load levels.

In section 3.5, it explains (the seemingly well known, albeit not to me) UniLog, also known as the Unified Load Generator. (Which is perhaps based on this paper, which is sadly hidden behind a paywall.) UniLog has an interesting architecture with intimidatingly exciting component names like PLM, ELM, GAR, LT, ADAPT and EEM. As best I can tell it is an extremely generic architecture for running and evaluating load experiments. It feels slightly overdesigned from my first reading, but perhaps as one spends more time in the caverns of load generation it starts to make more sense.

In section 4.4, it discusses centralized versus distributed load generation, which feels like one of the core design decisions you need to make for building such a system. My sense is that you likely want a distributed approach at some point, if only to avoid getting completely throttled by QoS operating on a per-IP ratelimit.

The rest of the paper focuses on some case studies and such. Overall, it was a surprisingly thorough introduction to the related research.

Automated Analysis of Load Testing Results

Automated Analysis of Load Testing Results takes a look at using automation to understand load test results (using both the execution logs of the load test and overall system metrics during the load test).

It summarizes the challenges of analyzing load test results as: outdated documentation, process-level profiling is cost prohibitive, load testing typically occurs late in the development cycle with short time lines, and the output from load tests can be overwhelmingly large.

I think the most interesting take away for me is the idea of very explicitly decoupling the gathering of performance data from its analysis. For example, you could start logging performance data early on (and likely your metrics tool, e.g. Graphite, already is capturing that data), and invest into more sophisticated analysis much later on. There is particular focus on comparing results across multiple load test runs, which can at a minimum narrow in on where performance “lives” within your metrics.

Even more papers…

Some additional papers with short summaries:

Converting Users to Testers - This paper discusses recording user traffic as an input to your load testing, with the goal of reducing time spent writing load generation scripts.

Automatic Feedback, Control-Based, Stress and Load Testing - This paper explores the idea of systems that try to drive and maintain load on a system to targeted threshold. This is an interesting idea because this would allow you to consistently run load against your production environment. The only caveat is that you have to first identify the inputs you want to use to influence that load, so you still need to model the incoming traffic in order to use it as an input (or record and sanitize real traffic), but at least once you have modeled it you could be more abstract in how you use that model (if your target is to create load, you don’t necessarily need to simulate realistic traffic, and you could use something like an n-armed bandit approach to “optimize” your load to generate the correct amount of load against the system). (Similarly, this paper tries to do that using genetic algorithms.)

Existing Tools

There are surprisingly few load testing tools, although wikipedia has a short list. Of that list, I’ve actually used JMeter some years ago, and I enjoyed this short rant about HP’s Loadrunner tooling.

I took a brief look at Gatling, which is a lightweight DSL written in Scala, which can be easily run by Jenkins or such. This seems like an interesting potential starting point for building a load generation tool. In particular the concept of treating your load tests as something you would check into your repository feels right, allowing you to iterate on your load tests like you would anything else. Reading through a few other blog posts on Gatling gave me a stronger sense that this might be a useful component of an overall load testing system (that allowed, e.g. many instances to be run against different endpoints or such).

Are there others that I’m missing out on?

Web Workload Generation According to…

As the name suggests, Web Workload Generation According to UniLoG Approach looks at adapting the UniLoG approach to the web. It nicely summarizes UniLoG’s approach as well:

The basic principle underlying the design and elaboration of UniLoG has been to start with a formal description of an abstract load model and thereafter to use an interface-dependent adapter to map the abstract requests to the concrete requests as they are “understood” by the service providing component at the real interface in question.

It also does a nice job of exploring ways to generate requests, although again coming back to either using logs of existing traffic or generating a model which defines your workload. There is an interesting hybrid here which would be using the distribution of actual usage as an input for the generated load (as opposed to using load on a one to one basis).

That said, unfortunately, I didn’t really get much out of it.

Django Administration: Inlines for Inlines

By DjangoTricks from Django community aggregator: Community blog posts. Published on Dec 16, 2016.

The default Django model administration comes with a concept of inlines. If you have a one-to-many relationship, you can edit the parent and its children in the same form. However, you are limited in a way that you cannot have inlines under inlines at nested one-to-many relations. For example, you can't show models Painter, Picture, and Review in the same form if one painter may have drawn multiple pictures and each picture may have several reviews.

In this article I would like to share a workaround allowing you to quickly access the inlines of an inline model. The idea is that for every inline you can provide a HTML link leading to the separate form where you can edit the related model and its own relations. It's as simple as that.

For example, in the form of Painter model, you have the instances of Picture listed with specific links "Edit this Picture separately":

When such a link is clicked, the administrator goes to the form of the Picture model which shows the instances of Review model listed underneath:

Let's have a look, how to implement this.

First of all, I will create a gallery app and define the three models there. Nothing fancy here. The important part there is just that the Picture model has a foreign key to the Painter model and the Review model has a foreign key to the Picture model.

# gallery/models.py
# -*- coding: UTF-8 -*-
from __future__ import unicode_literals

import os

from django.db import models
from django.utils.encoding import python_2_unicode_compatible
from django.utils.translation import ugettext_lazy as _
from django.utils.text import slugify


@python_2_unicode_compatible
class Painter(models.Model):
name = models.CharField(_("Name"), max_length=255)

class Meta:
verbose_name = _("Painter")
verbose_name_plural = _("Painters")

def __str__(self):
return self.name


def upload_to(instance, filename):
filename_base, filename_ext = os.path.splitext(filename)
return "painters/{painter}/{filename}{extension}".format(
painter=slugify(instance.painter.name),
filename=slugify(filename_base),
extension=filename_ext.lower(),
)

@python_2_unicode_compatible
class Picture(models.Model):
painter = models.ForeignKey(Painter, verbose_name=_("Painter"), on_delete=models.CASCADE)
title = models.CharField(_("Title"), max_length=255)
picture = models.ImageField(_("Picture"), upload_to=upload_to)

class Meta:
verbose_name = _("Picture")
verbose_name_plural = _("Pictures")

def __str__(self):
return self.title


@python_2_unicode_compatible
class Review(models.Model):
picture = models.ForeignKey(Picture, verbose_name=_("Picture"), on_delete=models.CASCADE)
reviewer = models.CharField(_("Reviewer name"), max_length=255)
comment = models.TextField(_("Comment"))

class Meta:
verbose_name = _("Review")
verbose_name_plural = _("Reviews")

def __str__(self):
return self.reviewer

Then I will create the administration definition for the models of the gallery app. Here I will set two types of administration for the Picture model:

  • By extending admin.StackedInline I will create administration stacked as inline.
  • By extending admin.ModelAdmin I will create administration in a separate form.

In Django model administration besides usual form fields, you can also include some computed values. This can be done by your fields (or fieldsets) and readonly_fields attributes referring to a callable or a method name.

You can set a translatable label for those computed values by defining short_description attribute for the callable or method. If you want to render some HTML, you can also set the allow_tags attribute to True (otherwise your HTML string will be escaped).

# gallery/admin.py
# -*- coding: UTF-8 -*-
from django.contrib import admin
from django.core.urlresolvers import reverse
from django.utils.translation import ugettext_lazy as _
from django.utils.text import force_text

from .models import Painter, Picture, Review

def get_picture_preview(obj):
if obj.pk: # if object has already been saved and has a primary key, show picture preview
return """<a href="{src}" target="_blank"><img src="{src}" alt="{title}" style="max-width: 200px; max-height: 200px;" /></a>""".format(
src=obj.picture.url,
title=obj.title,
)
return _("(choose a picture and save and continue editing to see the preview)")
get_picture_preview.allow_tags = True
get_picture_preview.short_description = _("Picture Preview")


class PictureInline(admin.StackedInline):
model = Picture
extra = 0
fields = ["get_edit_link", "title", "picture", get_picture_preview]
readonly_fields = ["get_edit_link", get_picture_preview]

def get_edit_link(self, obj=None):
if obj.pk: # if object has already been saved and has a primary key, show link to it
url = reverse('admin:%s_%s_change' % (obj._meta.app_label, obj._meta.model_name), args=[force_text(obj.pk)])
return """<a href="{url}">{text}</a>""".format(
url=url,
text=_("Edit this %s separately") % obj._meta.verbose_name,
)
return _("(save and continue editing to create a link)")
get_edit_link.short_description = _("Edit link")
get_edit_link.allow_tags = True


@admin.register(Painter)
class PainterAdmin(admin.ModelAdmin):
save_on_top = True
fields = ["name"]
inlines = [PictureInline]


class ReviewInline(admin.StackedInline):
model = Review
extra = 0
fields = ["reviewer", "comment"]


@admin.register(Picture)
class PictureAdmin(admin.ModelAdmin):
save_on_top = True
fields = ["painter", "title", "picture", get_picture_preview]
readonly_fields = [get_picture_preview]
inlines = [ReviewInline]

In this administration setup, the get_edit_link() method creates a HTML link between the inline and the separate administration form for the Picture model. As you can see, I also added the get_picture_preview() function as a bonus. It is included in both administration definitions for the Picture model and its purpose is to show a preview of the uploaded picture after saving it.

To recap, nested inlines are not supported by Django out of the box. However, you can have your inlines edited in a separate page with the forms linked to each other. For the linking you would use some magic of the readonly_fields attribute.

What if you really need to have inlines under inlines in your project? In that case you might check django-nested-admin and don't hesitate to share your experience with it in the comments.

Django is Boring, or Why Tech Startups (Should) Use Django

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Dec 14, 2016.

I recently attended Django Under The Hood in Amsterdam, an annual gathering of Django core team members and developers from around the world. A common theme discussed at the conference this year is that “Django is boring.” While it’s not the first time this has been discussed, it still struck me as odd. Upon further reflection, however, I see Django’s “boringness” as a huge asset to the community and potential adopters of the framework.

Caktus first began using Django in late 2007. This was well before the release of Django 1.0, in the days when startups and established companies alike ran production web applications using Subversion “trunk” (akin to the Git “master” branch) rather than using a released version of the software. Using Django was definitely not boring, because it required reading each commit merged to see if it added a new feature you could use and to make sure it wasn’t going to break your project . Although Django kept us on our toes in the early days, it was clear that Django was growing into a robust and stable framework with hope for the future.

With the help of thousands of volunteers from around the world, Django’s progressed a lot since the early days of “tracking trunk.” What does it mean that the people developing Django itself consider it “boring,” and how does that change our outlook for the future of the framework? If you’re a tech startup looking for a web framework, why would you choose the “boring” option? Following are several reasons that Caktus still uses Django for all new custom web/SMS projects, reasons I think apply equally well in the startup environment.

1. Django has long taken pride in its “batteries included” philosophy.

Django strives to be a framework that solves common problems in web development in the best way possible. In my original post on the topic nearly 8 years ago, some of the key features included with Django were the built-in admin interface and a strong focus on data integrity, two features missing from Ruby on Rails, the other major web framework at the time.

Significant features that have arrived in Django since that time include support for aggregates and query expressions in the ORM, a built-in application for geographic applications (django.contrib.gis), a user messages framework, CSRF protection, Python 3 support, a configurable User model, improved database transaction management, support for database migrations, support for full-text search in Postgres, and countless other features, bug fixes, and security updates. The entire time, Django’s emphasis on backwards compatibility and its generous deprecation policy have made it perfectly reasonable to plan to support and grow applications over 10 years or more.

2. The community around Django continues to grow.

In the tradition of open source software, users of the framework new and old support each other via the mailing list, IRC channel, blog posts, StackOverflow, and cost-effective conferences around the globe. The ecosystem of reusable apps continues to grow, with 3317 packages available on https://djangopackages.org/ as of the time of this post.

A common historical pattern has been for apps or features to live external to Django until they’re “proven” in production by a large number of users, after which they might be merged into Django proper. Django also recently adopted the concept of “official” packages, where a third-party app might not make sense to merge into Django proper, but it’s sufficiently important to the wider Django community that the core team agrees to take ownership of its ongoing maintenance.

The batteries included in Django itself and the wealth of reusable apps not only help new projects get off the ground quickly, they also provide solutions that have undergone rigorous code review by experts in the relevant fields. This is particularly important in startup environments when the focus must be on building business-critical features quickly. The last thing a startup wants to do, for example, is focus on business-critical features at the expense of security or reliability; with Django, one doesn’t have to make this compromise.

3. Django is written in Python.

Python is one of the most popular, most taught programming languages in the world. Availability of skilled staff is a key concern for startups hoping to grow their team in the near future, so the prevalence of Python should reassure those teams looking to grow.

Similarly, Python as a programming language prides itself on readability; one should be able to understand the code one wrote 6-12 months ago. Although this is by no means new nor unique to Django, Python’s straightforward approach to development is another reason some developers might consider it “boring.” Both by necessity and convention, Python espouses the idea of clarity over cleverness in code, as articulated by Brian Kernighan in The Elements of Programming Style. Python’s philosophy about coding style is described in more detail in PEP 20 -- The Zen of Python. Leveraging this philosophy helps increase readability of the code and the bus factor of the project.

4. The documentation included with Django is stellar.

Not only does the documentation detail the usage of each and every feature in Django, it also includes detailed release notes, including any backwards-incompatible changes, along with each release. Again, while Django’s rigorous documentation practices aren’t anything new, writing and reading documentation might be considered “boring” by some developers.

Django’s documentation is important for two key reasons. First, it helps both new and existing users of the framework quickly determine how to use a given feature. Second, it serves as a “contract” for backwards-compatibility in Django; that is, if a feature is documented in Django, the project pledges that it will be supported for at least two additional releases (unless it’s already been deprecated in the release notes). Django’s documentation is helpful both to one-off projects that need to be built quickly, and to projects that need to grow and improve through numerous Django releases.

5. Last but not least, Django is immensely scalable.

The framework is used at companies like EventBrite, Disqus, and Instagram to handle web traffic and mobile app API usage on behalf of 500M+ users. Even after being acquired by Facebook, Instagram swapped out their database server but did not abandon Django. Although early startups don’t often have the luxury of worrying about this much traffic, it’s always good to know that one’s web framework can scale to handle dramatic and continuing spikes in demand.

At Caktus, we’ve engineered solutions for several projects using AWS Auto Scaling that create servers only when they’re needed, thereby maximizing scalability and minimizing hosting costs.

Django into the future

Caktus has long been a proponent of the Django framework, and I’m happy to say that remains true today. We established ourselves early on as one of the premiere web development companies specializing in Django, we’ve written previously about why we use Django in particular, and Caktus staff are regular contributors not only to Django itself but also to the wider community of open source apps and discussion surrounding the framework.

Django can be considered a best of breed collection of solutions to nearly all the problems common to web development and restful, mobile app API development that can be solved in generic ways. This is “boring” because most of the common problems have been solved already; there’s not a lot of low-hanging fruit for new developers to contribute. This is a good thing for startups, because it means there’s less need to build features manually that aren’t specific to the business.

The risk of adopting any “bleeding edge” technology is that the community behind it will lose interest and move on to something else, leaving the job of maintaining the framework up to the few companies without the budget to switch frameworks. There’s a secondary risk specific to more “fragmented” frameworks as well. Because of Django’s “batteries included” philosophy and focus on backwards compatibility, one can be assured that the features one selects today will continue to work well together in the future, which won’t always be the case with frameworks that rely on third-party packages to perform business-critical functions such as user management.

These risks couldn’t be any stronger in the world of web development, where the framework chosen must be considered a tried and true partner. A web framework is not a service, like a web server or a database, that can be swapped out for another similar solution with some effort. Switching web frameworks, especially if the programming language changes, may require rewriting the entire application from scratch, so it’s important to make the right choice up front. Django has matured substantially over the last 10 years, and I’m happy to celebrate that it’s now the “boring” option for web development. This means startups choosing Django today can focus more on what makes their projects special, and less on implementing common patterns in web development or struggling to perform a framework upgrade with significant, backwards-incompatible changes. It’s clear we made the right choice, and I can’t wait to see what startups adopt and grow on Django in the future.

Using Fanout.io in Django

By Peter Bengtsson from Django community aggregator: Community blog posts. Published on Dec 13, 2016.

Earlier this year we started using Fanout.io in Air Mozilla to enhance the experience for users awaiting content updates. Here I hope to flesh out its details a bit to inspire others to deploy a similar solution.

What It Is

First of all, Fanout.io is basically a service that handles your WebSockets. You put in some of Fanout's JavaScript into your site that handles a persistent WebSocket connection between your site and Fanout.io. And to push messages to your user you basically send them to Fanout.io from the server and they "forward" it to the WebSocket.

The HTML page looks like this:

<html>
<body>

  <h1>Web Page</h1>

<!-- replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page -->
<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
></script>
<script src="fanout.js"></script>
</body>
</html>

And the fanout.js script looks like this:

window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
     console.log('Incoming updated data from the server:', data);
  })
};

And in server it looks something like this:

from django.conf import settings
import fanout

fanout.realm = settings.FANOUT_REALM_ID
fanout.key = settings.FANOUT_REALM_KEY


def post_comment(request):
    """A django view function that saves the posted comment"""
   text = request.POST['comment']
   saved_comment = Comment.objects.create(text=text, user=request.user)
   fanout.publish('mycomments', {'new_comment': saved_comment.id})
   return http.JsonResponse({'comment_posted': True})

Note that, in the client-side code, there's no security since there's no authentication. Any client can connect to any channel. So it's important that you don't send anything sensitive. In fact, you should think of this pattern simply as a hint that something has changed. For example, here's a slightly more fleshed out example of how you'd use the subscription.

window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
    if (data.new_comment) {
      // server says a new comment has been posted in the server
      $.json('/comments', function(response) {
        $('#comments .comment').remove();
        $.each(response.comments, function(comment) {        
          $('<div class="comment">')
          .append($('<p>').text(comment.text))
          .append($('<span>').text('By: ' + comment.user.name))
          .appendTo('#comments');
        });
      });
    }
  })
};

Yes, I know jQuery isn't hip but it demonstrates the pattern well. Also, in the real world you might not want to ask the server for all comments (and re-render) but instead do an AJAX query to get all new comments since some parameter or something.

Why It's Awesome

It's awesome because you can have a simple page that updates near instantly when the server's database is updated. The alternative would be to do a setInterval loop that frequently does an AJAX query to see if there's new content to update. This is cumbersome because it requires a lot heavier AJAX queries. You might want to make it secure so you engage sessions that need to be looked up each time. Or, since you're going to request it often you have to write a very optimized server-side endpoint that is cheap to query often.

And last but not least, if you rely on an AJAX loop interval, you have to pick a frequency that your server can cope with and it's likely to be in the range of several seconds or else it might overload the server. That means that updates are quite delayed.

But maybe most important, you don't need to worry about running a WebSocket server. It's not terribly hard to do one yourself on your laptop with a bit of Node Express or Tornado but now you have yet another server to maintain and it, internally, needs to be connected to a "pub-sub framework" like Redis or a full blown message queue.

Alternatives

Fanout.io is not the only service that offers this. The decision to use Fanout.io was taken about a year ago and one of the attractive things it offers is that it's got a freemium option which is ideal for doing local testing. The honest truth is that I can't remember the other justifications used to chose Fanout.io over its competitors but here are some alternatives that popped up on a quick search:

It seems they all (including Fanout.io) has freemium plans, supports authentication, REST APIs (for sending and for querying connected clients' stats).

There are also some more advanced feature packed solutions like Meteor, Firebase and GunDB that act more like databases that are connected via WebSockets or alike. For example, you can have a database as a "conduit" for pushing data to a client. Meaning, instead of sending the data from the server directly you save it in a database which syncs to the connected clients.

Lastly, I've heard that Heroku has a really neat solution that does something similar whereby it sets up something similar as an extension.

Let's Get Realistic

The solution sketched out above is very simplistic. There are a lot more fine-grained details that you'd probably want to zoom in to if you're going to do this properly.

Throttling

In Air Mozilla, we call fanout.publish(channel, message) from a post_save ORM signal. If you have a lot of saves for some reason, you might be sending too many messages to the client. A throttling solution, per channel, simply makes sure your "callback" gets called only once per channel per small time frame. Here's the solution we employed:

window.Fanout = (function() {
  var _locks = {};
  return {
    subscribe: function subscribe(channel, callback) {
      _client.subscribe(channel, function(data) {
          if (_locks[channel]) {
              // throttled
              return;
          }
          _locks[channel] = true;
          callback(data);
          setTimeout(function() {
              _locks[channel] = false;
          }, 500);
      });        
    };
  }
})();

Subresource Integrity

Subresource integrity is an important web security technique where you know in advance a hash of the remote JavaScript you include. That means that if someone hacks the result of loading https://cdn.example.com/somelib.js the browser compares the hash of that with a hash mentioned in the <script> tag and refuses to load it if the hash doesn't match.

In the example of Fanout.io it actually looks like this:

<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
  crossOrigin="anonymous"
  integrity="sha384-/9uLm3UDnP3tBHstjgZiqLa7fopVRjYmFinSBjz+FPS/ibb2C4aowhIttvYIGGt9"
></script>

The SHA you get from the Fanout.io documentation. It requires, and implies, that you need to use an exact version of the library. You can't use it like this: <script src="https://cdn.example/somelib.latest.min.js" ....

WebSockets vs. Long-polling

Fanout.io's JavaScript client follows a pattern that makes it compatible with clients that don't support WebSockets. The first technique it uses is called long-polling. With this the server basically relys on standard HTTP techniques but the responses are long lasting instead. It means the request simply takes a very long time to respond and when it does, that's when data can be passed.

This is not a problem for modern browsers. They almost all support WebSocket but you might have an application that isn't a modern browser.

Anyway, what Fanout.io does internally is that it first creates a long-polling connection but then shortly after tries to "upgrade" to WebSockets if it's supported. However, the projects I work only need to support modern browsers and there's a trick to tell Fanout to go straight to WebSockets:

var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux', {
    // What this means is that we're opting to have
    // Fanout *start* with fancy-pants WebSocket and
    // if that doesn't work it **falls back** on other
    // options, such as long-polling.
    // The default behaviour is that it starts with
    // long-polling and tries to "upgrade" itself
    // to WebSocket.
    transportMode: 'fallback'
});

Fallbacks

In the case of Air Mozilla, it already had a traditional solution whereby it does a setInterval loop that does an AJAX query frequently.

Because the networks can be flaky or because something might go wrong in the client, the way we use it is like this:

var RELOAD_INTERVAL = 5;  // seconds

if (typeof window.Fanout !== 'undefined') {
    Fanout.subscribe('/' + container.data('subscription-channel-comments'), function(data) {
        // Supposedly the comments have changed.
        // For security, let's not trust the data but just take it
        // as a hint that it's worth doing an AJAX query
        // now.
        Comments.load(container, data);
    });
    // If Fanout doesn't work for some reason even though it
    // was made available, still use the regular old
    // interval. Just not as frequently.
    RELOAD_INTERVAL = 60 * 5;
}
setInterval(function() {
    Comments.reload_loop(container);
}, RELOAD_INTERVAL * 1000);

Use Fanout Selectively/Progressively

In the case of Air Mozilla, there are lots of pages. Some don't ever need a WebSocket connection. For example, it might be a simple CRUD (Create Update Delete) page. So, for that I made the whole Fanout functionality "lazy" and it only gets set up if the page has some JavaScript that knows it needs it.

This also has the benefit that the Fanout resource loading etc. is slightly delayed until more pressing things have loaded and the DOM is ready.

You can see the whole solution here. And the way you use it here.

Have Many Channels

You can have as many channels as you like. Don't create a channel called comments when you can have a channel called comments-123 where 123 is the ID of the page you're on for example.

In the case of Air Mozilla, there's a channel for every single page. If you're sitting on a page with a commenting widget, it doesn't get WebSocket messages about newly posted comments on other pages.

Conclusion

We've now used Fanout for almost a year in our little Django + jQuery app and it's been great. The management pages in Air Mozilla use AngularJS and the integration looks like this in the event manager page:

window.Fanout.subscribe('/events', function(data) {
    $scope.$apply(lookForModifiedEvents);
});

Fanout.io's been great to us. Really responsive support and very reliable. But if I were to start a fresh new project that needs a solution like this I'd try to spend a little time to investigate the competitors to see if there are some neat features I'd enjoy.

UPDATE

Fanout reached out to help explain more what's great about Fanout.io

"One of Fanout's biggest differentiators is that we use and promote open technologies/standards. For example, our service supports the open Bayeux protocol, and you can connect to it with any compatible client library, such as Faye. Nearly all competing services have proprietary protocols. This "open" aspect of Fanout aligns pretty well with Mozilla's values, and in fact you'd have a hard time finding any alternative that works the same way."

Transcoding with AWS- part two

By Krzysztof Żuraw Personal Blog from Django community aggregator: Community blog posts. Published on Dec 11, 2016.

As I have static and media files integrated with AWS now it's time to transcode them. In this post, I will write a short example of how to integrate AWS ElasticTranscoder with Django application.

Basic terms

ElasticTranscoder allows you to transcode files from your S3 bucket to various formats. To set this service up first you have to create a pipeline. What pipeline is? Basically, it's a workflow- how your transcoder should work. You can create a different pipeline for long content and different for short one. In my application I created the following pipeline:

Pipeline configuration

As I have my pipeline configured next step is to create jobs. Jobs are tasks for a transcoder that say which file I want to transcode, to what format or codec I want to do this:

Job details

PresetID is user created or already existing configuration that defines the format of transcoder output: is it mp4 or maybe flac? What resolution should video files have? All of this is set up in present.

As we know basic terms used in AWS Elastic Transcoder let's jump into the code.

Code

AWS has very good python API called boto3. Using that API and few examples from the internet I was able to create a simple class to create transcode job:

import os

from django.conf import settings

import boto3


class AudioTranscoder(object):

  def __init__(self, region_name='eu-west-1', pipeline_name='Audio Files'):
      self.region_name = region_name
      self.pipeline_name = pipeline_name
      self.transcoder = boto3.client('elastictranscoder', self.region_name)
      self.pipeline_id = self.get_pipeline()

  def get_pipeline(self):
      paginator = self.transcoder.get_paginator('list_pipelines')
      for page in paginator.paginate():
          for pipeline in page['Pipelines']:
              if pipeline['Name'] == self.pipeline_name:
                  return pipeline['Id']

  def start_transcode(self, filename):
      base_filename, _ = self.create_aws_filename(filename, '')
      wav_aws_filename, wav_filename = self.create_aws_filename(
          filename, extension='.wav'
      )
      flac_aws_filename, flac_filename = self.create_aws_filename(
          filename, extension='.flac'
      )
      mp4_aws_filename, mp4_filename = self.create_aws_filename(
          filename, extension='.mp4'
      )
      self.transcoder.create_job(
          PipelineId=self.pipeline_id,
          Input={
              'Key': base_filename,
              'FrameRate': 'auto',
              'Resolution': 'auto',
              'AspectRatio': 'auto',
              'Interlaced': 'auto',
              'Container': 'auto'
          },
          Outputs=[{
              'Key': wav_aws_filename,
              'PresetId': '1351620000001-300300'
              }, {
              'Key': flac_aws_filename,
              'PresetId': '1351620000001-300110'
              }, {
              'Key': mp4_aws_filename,
              'PresetId': '1351620000001-100110'
              }
          ]
      )
      return (wav_filename, flac_filename, mp4_filename)

  @staticmethod
  def create_aws_filename(filename, extension):
      aws_filename = os.path.join(
          settings.MEDIAFILES_LOCATION, filename + extension
      )
      return aws_filename, os.path.basename(aws_filename)


transcoder = AudioTranscoder()

Going from the top - I specified my region_name as well as pipeline_name for boto3 to know to which region it should connect. In method get_pipeline I iterate through all available pipelines and return that has the same name as pipeline_name. In this function paginator is an object which holds on portion of data so user don't have to wait until all available pipelines are fetched.

The main logic is in start_transcode method. At the beginning, I used helper function create_aws_filename that's creating proper AWS file name like media/my_mp3.mp3 and returns that whole path with filename. After I created filenames for all of my files I'm calling create_job that creates a job basing on pipeline_id and base_filename. The job can have multiple outputs so I specified one for wav, flac and mp4 files. How is it used in code? Let's go to view:

class UploadAudioFileView(FormView):
  # some code

  def form_valid(self, form):
      audio_file = AudioFile(
            name=self.get_form_kwargs().get('data')['name'],
            mp3_file=self.get_form_kwargs().get('files')['mp3_file']
      )
      audio_file.save()
      wav_filename, flac_filename, mp4_filename = transcoder.start_transcode(
          filename=audio_file.mp3_file.name
      )
      audio_file.mp4_file = mp4_filename
      audio_file.flac_file = flac_filename
      audio_file.wav_file = wav_filename
      audio_file.save()
      return HttpResponseRedirect(
          reverse('audio:detail', kwargs={'pk': audio_file.pk})
      )

In form_valid first I'm calling save() on AudioFile object which is uploading the file to S3 bucket. Then I'm using transcoder.start_transcode and basing on output from this function I match filenames to their respective fields. I know that this solution is not the best one as I have to call save twice and if you have a better way to do this I'm glad to hear it from you.

That's all for today! Transcoding works fine but there is a problem with what when files are big? Transcoding such files will take lots of time and user don't want to wait for a response. The solution will be revealed in next post.

Other blog posts in this series

The code that I have made so far is available on github. Stay tuned for next blog post from this series.

Special thanks to Kasia for being an editor for this post. Thank you.

While creating this blog post I used an code from offcial boto github account.

How to Create Group By Queries With Django ORM

By Simple is Better Than Complex from Django community aggregator: Community blog posts. Published on Dec 06, 2016.

This tutorial is about how to implement SQL-like group by queries using the Django ORM. It’s a fairly common operation, specially for those who are familiar with SQL. The Django ORM is actually an abstraction layer, that let us play with the database as it was object-oriented but in the end it’s just a relational database and all the operations are translated into SQL statements.

Most of the work can be done retrieving the raw data from the database, and playing with it in the Python side, grouping the data in dictionaries, iterating through it, making sums, averages and what not. But the database is a very powerful tool and do much more than simply storing the data, and often you can do the work much faster directly in the database.

Generally speaking, when you start doing group by queries, you are no longer interested in each model instances (or in a table row) details, but you want extract new information from your dataset, based on some common aspects shared between the model instances.

Let’s have a look in an example:

class Country(models.Model):
    name = models.CharField(max_length=30)

class City(models.Model):
    name = models.CharField(max_length=30)
    country = models.ForeignKey(Country)
    population = models.PositiveIntegerField()

And the raw data stored in the database:

cities
id name country_id population
1Tokyo2836,923,000
2Shanghai1334,000,000
3Jakarta1930,000,000
4Seoul2125,514,000
5Guangzhou1325,000,000
6Beijing1324,900,000
7Karachi2224,300,000
8Shenzhen1323,300,000
9Delhi2521,753,486
10Mexico City2421,339,781
11Lagos921,000,000
12São Paulo120,935,204
13Mumbai2520,748,395
14New York City2020,092,883
15Osaka2819,342,000
16Wuhan1319,000,000
17Chengdu1318,100,000
18Dhaka417,151,925
19Chongqing1317,000,000
20Tianjin1315,400,000
21Kolkata2514,617,882
22Tehran1114,595,904
23Istanbul214,377,018
24London2614,031,830
25Hangzhou1313,400,000
26Los Angeles2013,262,220
27Buenos Aires813,074,000
28Xi'an1312,900,000
29Paris612,405,426
30Changzhou1312,400,000
31Shantou1312,000,000
32Rio de Janeiro111,973,505
33Manila1811,855,975
34Nanjing1311,700,000
35Rhine-Ruhr1611,470,000
36Jinan1311,000,000
37Bangalore2510,576,167
38Harbin1310,500,000
39Lima79,886,647
40Zhengzhou139,700,000
41Qingdao139,600,000
42Chicago209,554,598
43Nagoya289,107,000
44Chennai258,917,749
45Bangkok158,305,218
46Bogotá277,878,783
47Hyderabad257,749,334
48Shenyang137,700,000
49Wenzhou137,600,000
50Nanchang137,400,000
51Hong Kong137,298,600
52Taipei297,045,488
53Dallas–Fort Worth206,954,330
54Santiago146,683,852
55Luanda236,542,944
56Houston206,490,180
57Madrid176,378,297
58Ahmedabad256,352,254
59Toronto56,055,724
60Philadelphia206,051,170
61Washington, D.C.206,033,737
62Miami205,929,819
63Belo Horizonte15,767,414
64Atlanta205,614,323
65Singapore125,535,000
66Barcelona175,445,616
67Munich165,203,738
68Stuttgart165,200,000
69Ankara25,150,072
70Hamburg165,100,000
71Pune255,049,968
72Berlin165,005,216
73Guadalajara244,796,050
74Boston204,732,161
75Sydney105,000,500
76San Francisco204,594,060
77Surat254,585,367
78Phoenix204,489,109
79Monterrey244,477,614
80Inland Empire204,441,890
81Rome34,321,244
82Detroit204,296,611
83Milan34,267,946
84Melbourne104,650,000
countries
id name
1Brazil
2Turkey
3Italy
4Bangladesh
5Canada
6France
7Peru
8Argentina
9Nigeria
10Australia
11Iran
12Singapore
13China
14Chile
15Thailand
16Germany
17Spain
18Philippines
19Indonesia
20United States
21South Korea
22Pakistan
23Angola
24Mexico
25India
26United Kingdom
27Colombia
28Japan
29Taiwan

This data is from Wikipedia, and I don’t know to what extent it is correct, but for our example it doesn’t really matter.

Considering the whole dataset, if we wanted to know the total of habitants in all the 84 cities, we could perhaps use an aggregate query:

from django.db.models import Sum

City.objects.all().aggregate(Sum('population'))
{'population__sum': 970880224}  # 970,880,224

Or the average population in the top 84 cities:

from django.db.models import Avg

City.objects.all().aggregate(Avg('population'))
{'population__avg': 11558097.904761905}  # 11,558,097.90

What if we now wanted to see the total population, but aggregated by the country instead? Not the whole dataset. In this case we no longer can use aggregate, instead we will be using annotate.

The aggregate clause is terminal, it returns a Python dictionary, meaning you can’t append any queryset methods. Also, it will always return a single result. So if you wanted to get the population sum by country, using aggregate, you would need to do something like this:

Don't
from django.db.models import Sum

for country in Country.objects.all():
    result = City.objects.filter(country=country).aggregate(Sum('population'))
    print '{}: {}'.format(country.name, result['population__sum'])

# Output:
# -------
# Brazil: 38676123
# Turkey: 19527090
# Italy: 8589190
# Bangladesh: 17151925
# Canada: 6055724
# France: 12405426
# Peru: 9886647
# Argentina: 13074000
# Nigeria: 21000000
# Australia: 9650500
# Iran: 14595904
# ...

While the result is correct, we needed to execute 30 different queries in the database. And we’ve lost some of the capabilities of the ORM, such as ordering this result set. Perhaps the data would be more interesting if we could order by the country with the most population for example.

Now a better way to do it is using annotate, which will be translated as a group by query in the database:

Do
City.objects.all().values('country__name').annotate(Sum('population'))

[
  {'country__name': u'Angola', 'population__sum': 6542944},
  {'country__name': u'Argentina', 'population__sum': 13074000},
  {'country__name': u'Australia', 'population__sum': 9650500},
  {'country__name': u'Bangladesh', 'population__sum': 17151925},
  {'country__name': u'Brazil', 'population__sum': 38676123},
  '...(remaining elements truncated)...'
]

Much better, right?

Now if we wanted to order by the country population, we can use an alias to make it look cleaner and to use in the order_by() clause:

City.objects.all() \
  .values('country__name') \
  .annotate(country_population=Sum('population')) \
  .order_by('-country_population')

[
  {'country__name': u'China', 'country_population': 309898600},
  {'country__name': u'United States', 'country_population': 102537091},
  {'country__name': u'India', 'country_population': 100350602},
  {'country__name': u'Japan', 'country_population': 65372000},
  {'country__name': u'Brazil', 'country_population': 38676123},
  '...(remaining elements truncated)...'
]

Here is how the last SQL query looks like:

  SELECT "core_country"."name", SUM("core_city"."population") AS "country_population"
    FROM "core_city" INNER JOIN "core_country" ON ("core_city"."country_id" = "core_country"."id")
GROUP BY "core_country"."name"
ORDER BY "country_population" DESC

Now an important thing to note here: it only makes sense adding in the values() clause, the data that can be grouped. Every field you add to the values() clause, will be used to create the group by query.

Look at this queryset:

City.objects.all().values('name', 'country__name').annotate(Sum('population'))

The resulting SQL query would be:

  SELECT "core_city"."name", "core_country"."name", SUM("core_city"."population") AS "population__sum"
    FROM "core_city" INNER JOIN "core_country" ON ("core_city"."country_id" = "core_country"."id")
GROUP BY "core_city"."name", "core_country"."name"

This would have no effect, because all the city names are unique, and they can’t be grouped (the database will try to group it, but each “group” will have only 1 row/instance). We can see it simply by performing a count on each queryset:

City.objects.all().values('name', 'country__name').annotate(Sum('population')).count()
84

City.objects.all().values('country__name').annotate(Sum('population')).count()
29

That’s what I meant when I said in the beginning of the post that, you are no longer interested in the details of each row. When we group by country to get the sum of the population, we lost the details of the cities (at least in the query result).

Sometimes it makes sense to have more than one value in the values() clause. For example if our database was composed by City / State / Country. Then we could group by using .values('state__name', 'country__name'). This way you would have the population by country. And you would avoid States from different countries (with the same name) to be grouped together.

The values you generate on the database, using the annotate clause, can also be used to filter data. Usually in the database we use the HAVING function, which makes it very idiomatic. You can read the query like it was plain English. Now, in the Django side, it’s a simple filter.

For example, let’s say we want to see the total population by country, but only those countries where the total population is greater than 50,000,000:

City.objects.all() \
  .values('country__name') \
  .annotate(country_population=Sum('population')) \
  .filter(country_population__gt=50000000) \
  .order_by('-country_population')

[
  {'country__name': u'China', 'country_population': 309898600},
  {'country__name': u'United States', 'country_population': 102537091},
  {'country__name': u'India', 'country_population': 100350602},
  {'country__name': u'Japan', 'country_population': 65372000}
]

And finally the SQL query:

  SELECT "core_country"."name", SUM("core_city"."population") AS "country_population"
    FROM "core_city" INNER JOIN "core_country" ON ("core_city"."country_id" = "core_country"."id")
GROUP BY "core_country"."name" HAVING SUM("core_city"."population") > 50000000
ORDER BY "country_population" DESC

I hope you found this small tutorial helpful! If you have any questions, please leave a comment below!

I'm Gonna Regret This

By chrism from . Published on Jun 14, 2016.

A plea for liberals to fight for individual rights.

Is Open Source Consulting Dead?

By chrism from . Published on Sep 10, 2013.

Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

Consulting and Patent Indemification

By chrism from . Published on Aug 09, 2013.

Article about consulting and patent indemnification

Python Advent Calendar 2012 Topic

By chrism from . Published on Dec 24, 2012.

An entry for the 2012 Japanese advent calendar at http://connpass.com/event/1439/

Why I Like ZODB

By chrism from . Published on May 15, 2012.

Why I like ZODB better than other persistence systems for writing real-world web applications.

A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

By chrism from . Published on Mar 03, 2012.

A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

In Praise of Complaining

By chrism from . Published on Jan 01, 2012.

In praise of complaining, even when the complaints are absurd.

2012 Python Meme

By chrism from . Published on Dec 24, 2011.

My "Python meme" replies.

In Defense of Zope Libraries

By chrism from . Published on Dec 19, 2011.

A much too long defense of Pyramid's use of Zope libraries.

Plone Conference 2011 Pyramid Sprint

By chrism from . Published on Nov 10, 2011.

An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

Jobs-Ification of Software Development

By chrism from . Published on Oct 17, 2011.

Try not to Jobs-ify the task of software development.

WebOb Now on Python 3

By chrism from . Published on Oct 15, 2011.

Report about porting to Python 3.

Open Source Project Maintainer Sarcastic Response Cheat Sheet

By chrism from . Published on Jun 12, 2011.

Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

Pylons Miniconference #0 Wrapup

By chrism from . Published on May 04, 2011.

Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

Pylons Project Meetup / Minicon

By chrism from . Published on Apr 14, 2011.

In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.