Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home

Open Source Posts

Code, Code, Code

By pydanny's blog from Django community aggregator: Community blog posts. Published on May 28, 2017.

I'm often asked by new programmers how they can forge a path into using their skills professionally. Or how they can get better at writing software.

How to Improve Your Coding Skills

This was my path. It may not be your path.

  1. I coded. A lot. From silly little scripts to automating tasks to attempting full-blown projects. At work or for fun. I failed a lot, but learned along the way.
  2. I didn't jump from language to language. Instead I stayed in a few places for years and focused my learning on those tools. My 19+ year career can be summed up as FoxPro then Java then Python. In the middle of things I picked up JavaScript. Sure, I've dallied with a few things (Lisp, Haskell, Lua, Perl, ColdFusion, Go), but by staying focused on a small set of tools I'm better than mediocre.
  3. I coded lots. Yes, this is a repeat of #1.
  4. Once I got the basics of a language, I looked up best practices for each of them. Then I religiously adhered to them, even becoming dogmatic about it. In general this means my code is more easily read. More easily debugged. And most importantly, more easily shared.
  5. Did I mention that I coded a lot? You can never get good at anything unless you practice. Another repeat of #1.
  6. I got over my fear/pride of asking questions. Well, mostly, I still am afraid/prideful from time to time. Honestly, by asking questions you aren't showing what you don't know, you are showing you are willing to learn. Also, the simple act of figuring out how to ask a question can put you in the right mindset to determine the answer yourself.
  7. As soon as I asked a question, whether or not I got an answer, I coded some more. Code, code, code! Yet another repeat of #1
  8. Once I've gotten the hang of a language, I looked for cookbooks and/or pocket references on it. I prefer paper copies of tech books (everything else I read is electronic). The recipes in cookbooks become the foundation of my toolkit. The terse, easy-to-find reminders in the pocket reference mean less cognitive overload.
  9. I took those recipes and references and coded with them. Again and again I coded. In work hours or play time. Practice makes perfect! Why do I keep repeating #1?
  10. Over the years I've stayed with the easiest-to-learn stable IDEs/text editors. Yes, I know there are really powerful tools with arcane commands (Vim, EMACS, etc), but I don't want to have to stop what I'm doing to learn new tools. I want to code, not tinker with desktop tools or arcane text editors.
  11. And again, reference back to #1, I use the text editor to write code. Code, code, code! Until my fingers and hands hurt, until I've had to learn how to deal with carpal tunnel syndrome. Code, code, code! It's like learning martial arts, guitar, or anything, repetition of simple actions provides the confidence for you to either combine those actions into something greater or learn something more complex.

What I Wish I Had Done

  • Studied computer science. If I could do it all over again, that would have been the focus of my academic studies. It wouldn't replace anything on my list, the path I've defined remains the same. Code, code, code!
  • It goes without saying I should have taken more tutorials. Nothing gives a kickstart like having an instructor, online or in-person, who guides you down the right path. Then you can code, code, code!

Practice makes perfect, right?

Code, Code, Code

Shaun M. Thomas: PG Phriday: DIY in the CLI (Part 1)

From Planet PostgreSQL. Published on Jun 24, 2016.

On a higher level, Postgres has a bevy of libraries, interfaces, and clients for accessing a database instance. From language APIs to GUIs like pgAdmin, or SaaS entries like JackDB, every flavor of interaction is covered. And yet, that’s only a small part of the story. For those who dare to tread into the watery depths, there’s also the world of dark incantations that is the command-line.

While most are aware of psql, the Postgres command-line client for accessing databases, there are far more creatures lurking in the black oblivion which deserve more visibility. In the spirit of getting the most from our tools, let’s take a look at these and consider how they might be useful in scenarios that aren’t restricted to boring old DBA considerations. To that end, let’s cover some of the standard client applications that anyone might use.

To begin, there are several commands that cover some standard SQL operations. Things like creating or dropping databases, users, or procedural languages, are all possible without actually connecting to Postgres. Normally, we’d have to use the equivalent SQL commands:

CREATE DATABASE foo;
DROP DATABASE foo;
 
CREATE USER kitty_cat WITH PASSWORD 'M30w9|_|rr';
DROP USER kitty_cat;
 
CREATE LANGUAGE plpythonu;
DROP LANGUAGE plpythonu;

While these are all possible to use by forcing them through psql -c, that makes automation testing and other tasks somewhat awkward, or force sysadmins to learn SQL for no reason. So Postgres provides functional wrappers we can use instead. They’re named about what we might expect, too:

createdb foo
dropdb foo
 
createuser kitty_cat --pwprompt
dropuser kitty_cat
 
createlang plpythonu
droplang plpythonu

As would be expected, all of these commands include the optional parameters from their SQL equivalents. This means a couple of handy things: flags can be listed with -? or --help, and longer descriptions are available via comprehensive man pages. The Postgres online documentation is great, and having all of this information at our fingertips is even better.

Of course, we’re not limited to databases, users, and languages. There are a few commands to streamline basic maintenance as well, and as might be expected, these also have SQL equivalents. Well, mostly. Consider these three scenarios:

  1. We want to VACUUM every database every night following a busy day of transactions. We’ve disabled autovacuum because we have a very active database, and have instead, opted for nightly maintenance.
  2. Over the years, our indexes on the user_session and login_state tables have gotten a bit fragmented, and we want to rebuild them.
  3. We have clustered multiple tables, sorting their disk pages along beneficial indexes. Due to MVCC storage, tables that experience updates need periodic re-clustering to maintain this ordering. To do this, we need to run CLUSTER regularly.

Assuming we have two databases, named ‘trading’ and ‘operations’ for example, we could do these things using SQL commands:

\c trading
VACUUM ANALYZE;
CLUSTER;
REINDEX TABLE user_session;
REINDEX TABLE login_state;
 
\c operations
VACUUM ANALYZE;
CLUSTER;

This is all perfectly serviceable. Yet for administrators or end users that don’t really care about SQL syntax, or for operations we want to automate, there is an easier way. We could get the same result using three command-line tools:

vacuumdb --all --analyze
reindexdb --table=user_session --table=login_state trading
clusterdb --all
 
# Or with short options:
 
vacuumdb -az
reindexdb -t user_session -t login_state trading
clusterdb -a

Unlike their SQL equivalents, the command-line tools can combine operations or target multiple objects. We leveraged that to reindex both tables with a single command, and vacuum or cluster all databases in our instance. This is about the time our tools become more than functional wrappers of the SQL commands. Parameters like --all illustrate client-level looping at the very least, and provide a great segue into heavier utilities.

Backing up databases and restoring them, for instance, leverages multiple high and low level Postgres interfaces. Due to this complexity, there is no SQL analog. The Postgres backup and restore utilities have also matured a lot over the years, gaining parallel dumping functionality as recently as 9.3.

Taking this into account, we could clone one of the previously mentioned databases using provided tools in two different ways:

# Just copy a target database by itself:
 
pg_dump --jobs=2 --format=directory --file=backup_dir trading
createdb trading_clone
pg_restore --jobs=2 --dbname=trading_clone backup_dir
 
# Copy the whole freaking instance at the binary level.
 
pg_basebackup -D clone_dir

Just so we’re clear, both approaches have pluses and minuses as well as applicable scenarios beyond the scope of this article. We’re merely illustrating very basic usage. Beyond that, the first method is a variant of the common dump/restore pattern used since time immemorial to perform Postgres upgrades until pg_upgrade hit the scene in 9.0. One of the primary reasons it fell out of favor was due to the growing trend of immense databases.

Even using parallel functionality, dumping the contents of every table in a database 200GB or larger will be extremely slow. Restoring that information is even worse, as we not only have to import all of the same data, but all indexes and constraints must be rebuilt from scratch. Before pg_restore was capable of parallel restores in 8.4, restores were even more frustrating. This, along with the new online replication capabilities, is the origin of pg_basebackup.

By default, pg_basebackup merely utilizes the Postgres data replication stream and reconstructs the host instance in a specified directory. This mans we can start Postgres from that directory as a replica, or use pg_upgrade to test newer versions while the old one keeps running. This encourages online failbacks in case of failed upgrades, multiple upgrade test runs, or running applications on new versions to test for edge cases and compatibility concerns.

Principally, it allows us to separate instance management from content management. We still need pg_dump and pg_restore, but they’re no longer primarily backup tools. This is espicially true for pg_dump. It has multiple export formats which can theoretically be used to transfer table data into other storage systems. Or maybe we just want to export and inspect raw schema creation commands.

What else is left on the client end of things? For those that want to build Postgres extensions, or get information about the Postgres binaries installed on a server, we have pg_config. Its output is basically a bunch of environment settings that were used to build Postgres itself. This is especially useful with distributed builds that have long lists of arbitrary configure flags. After all, here’s how Ubuntu compiles it:

pg_config --configure
 
'--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-openssl' '--with-libxml' '--with-libxslt' '--with-tclconfig=/usr/lib/x86_64-linux-gnu/tcl8.6' '--with-includes=/usr/include/tcl8.6' 'PYTHON=/usr/bin/python' '--mandir=/usr/share/postgresql/9.4/man' '--docdir=/usr/share/doc/postgresql-doc-9.4' '--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/' '--datadir=/usr/share/postgresql/9.4' '--bindir=/usr/lib/postgresql/9.4/bin' '--libdir=/usr/lib/x86_64-linux-gnu/' '--libexecdir=/usr/lib/postgresql/' '--includedir=/usr/include/postgresql/' '--enable-nls' '--enable-integer-datetimes' '--enable-thread-safety' '--enable-tap-tests' '--enable-debug' '--disable-rpath' '--with-uuid=e2fs' '--with-gnu-ld' '--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo' 'CFLAGS=-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -I/usr/include/mit-krb5 -fPIC -pie -DLINUX_OOM_SCORE_ADJ=0 -fno-omit-frame-pointer' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5' '--with-krb5' '--with-gssapi' '--with-ldap' '--with-selinux' 'CPPFLAGS=-D_FORTIFY_SOURCE=2

Some of that might be nice to know on occasion.

We also have pg_isready, a “new” addition with 9.3. It’s just a quick connection check with an exit status that follows standard exit codes. This is good for basic monitoring, but not much else. On the other hand, we no longer have to run a bogus command through psql and deal with the much more verbose client-oriented output. On an automation front, that’s a major step forward.

And finally, there’s benchmarks. Perhaps we’re testing new hardware, or a new Postgres version, or maybe we’re just curious. Having recently escaped existence as a “mere” contrib tool in 9.5, now anyone and their dog can beat up Postgres for fun and profit. It’s great to obtain data for articles like this, too.

Here’s a quick example:

createdb pgbench
pgbench --initialize --scale=100 pgbench
pgbench --client=4 --jobs=2 --select-only --time=10 pgbench
 
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
query mode: simple
number of clients: 4
number of threads: 2
duration: 10 s
number of transactions actually processed: 168447
latency average: 0.237 ms
tps = 16844.169409 (including connections establishing)
tps = 16853.718425 (excluding connections establishing)

Generally read/write tests should be larger than memory to really stress a system, but we’d need a much larger scale for that on most modern systems. At scale 100 we get about 1.5GB over 10M rows of data, so on a small 8GB test VM, we’d need a scale of 600 or higher to even start hitting the disks.

Beyond the basic test framework, we can also execute arbitrary SQL scripts. This makes it possible to stress-test transactional databases with multiple parallel clients, running anything from simple SELECT statements on our own tables, to intricate stored procedures that update fact tables. As such, it’s one of the most valuable tools in the arsenal of both users and DBAs, provided either even know it exists.

There are a couple more tools I’ve left out, but they’re related to logical and stream replication and are better covered in a more in-depth manner. Beyond that, take a look at your Postgres binary directory; you might be surprised what’s hiding there.

Because distributions like to hide all the cool stuff, we may have to go looking for some of these utilities. Debian and Ubuntu users can find binaries in /usr/lib/postgresql/9.x/bin, while RHEL variants like CentOS should look in /usr/pgsql-9.x/bin.

Explore often, but do so cautiously!

Joshua Drake: PgConf.US: 2016 Kicking the donkey of PostgreSQL Replication

From Planet PostgreSQL. Published on Jun 23, 2016.

My slides from my presentation and PgConf.US 2016:

 

Django with Bash for Windows

By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on Jun 23, 2016.

Doing django development using "Bash for Windows" seems to work pretty well. See what it takes to set it up and get started.
Watch Now...

Hans-Juergen Schoenig: PostgreSQL instance-level encryption

From Planet PostgreSQL. Published on Jun 23, 2016.

We have recently published a patch to provide full database encryption for PostgreSQL. Many business critical applications require full encryption to match legal or business requirements. As more and more customers asked for this, we decided to tackle this important issue. The idea behind our patch is to store data on disk in encrypted format […]

The post PostgreSQL instance-level encryption appeared first on Cybertec - The PostgreSQL Database Company.

Ernst-Georg Schmid: When 'good enough' is good enough - approximate answers with PostgreSQL 9.4999

From Planet PostgreSQL. Published on Jun 22, 2016.

Approximation in databases seems to be an alien concept at first. But if your application can deal with a known and controllable degree of error, it can help even in cases where conventional tuning is not an option for whatever reason.

Approximation is not evil

One of the top requirements for database systems is reliability. Whether you run a bank or a large retail business, you don't want to lose a cent here or there or charge your customer twice for the Pink Fluffy Unicorn he just bought, just because the DBMS gave a wrong answer. Classic OLTP operations have to be always 100% correct.

However, for the case of analytics, things become different. In some cases, it can be desirable to trade a bit of accuracy for a lot of speed. This is called approximation and to many database people (and users), the concept of accepting results with less than 100% accuracy seems strange at first.

But if you know - and can control - the error introduced by approximation, it is not. It can even be very useful, if a 95% accurate answer now is worth more than a 100% accurate answer tomorrow.

Welcome to approximation in PostgreSQL 9.5.

Approximating queries

Approximate queries work on subsets, called samples, of the whole data set, called the population.
If the sampling is done statistically correct, a sample much smaller than the whole population gives answers close to the real answer within a known error range.

A possible application for the hypothetical retail business would be to find which product is currently trending.
Instead of knowing that exactly 500, 1000 and 2000 Pink Fluffy Unicorns were sold in the last three weeks, knowing that 498, 1001 and 1999 Pink Fluffy Unicorns were sold in the last three weeks with let's say 5% error tells the procurement people that Pink Fluffy Unicorns are a trending product just as fine as the exact numbers. Only, they might have to wait a few seconds for the answer instead of a few hours...

PostgreSQL 9.5 has built-in support for approximate queries. Because I'm lazy and already wrote about this I just point to the corresponding post.
 
Still, all the population data has to be there for approximate queries to work. How about running queries without storing the underlying data at all?

Approximating data structures

If PostgreSQL has a weakness, it's the comparably poor performance of count() and distinct.
Due to the lock-free multiversion concurrency design of PostgreSQL, count() has to touch each row in a table to check whether it is visible in the current transaction or not. Unlike locking DBMS like Oracle, it can only use an index to count in a few cases . Full table scan.

Distinct always has to sort the table. It can use an index, but only covering indexes, and the larger the index is compared to the table, the less likely PostgreSQL will use it. Sorting can be tuned by raising work_mem, but since this is a per session parameter, it is limited by available RAM.

So count(distinct) is like the worst of both worlds (In the following example distinct alone is slower, because it has to return ten million rows to the client, count(distinct) returns only one value).
Like here (times are w/o Index / w Index):
 
create table hmr(id serial, value real);

insert into hmr (value) select random()*10000000 from generate_series(1,10000000);

select count (value) from hmr; --826 msec. / 817 msec.

select distinct value from hmr; --33917 msec. / 32420 mesc.

select count (distinct value) from hmr; -- 9665 msec. / 9439 msec.

Enter the HyperLogLog cardinality estimator. Some clever people at Google observed, that the cardinality of a multiset of evenly distributed random numbers can be predicted by finding the maximum number of leading zeroes in the binary representation of those numbers: For a maximum of k leading zeroes, the cardinality is 2^k.

HyperLogLog uses a hash function to transform arbitrary input values into such random numbers and thus allows to estimate the cardinality of an input multiset for cardinalities > 10^9 with a 2-3% error, using only 1280 bytes of storage

PostgreSQL has a HyperLogLog extension, hll.
 
create extension hll;

CREATE TABLE cardinality (
            id      integer,
            set     hll
    );

INSERT INTO cardinality(id, set)
    SELECT 1, (select hll_add_agg(hll_hash_any(value))
    FROM hmr); -- 2267 msec.

SELECT hll_cardinality(set)::int FROM cardinality WHERE id = 1; -- 11 msec.
 
Since count distinct(value) = 8187749 and hll_cardinality = 8470057, the error is ~3%

Another, not so PostgreSQL specific example would be a database that has a stream table, e.g. holding only one hour worth of events at any given point in time. I showed how to do this with stock PostgreSQL and a bit of Java here and here.

If you also want to know, how many distinct events that stream has seen in total, it's impossible, unless you store all distinct values and update their counts every time a new event arrives. But then, you might end up in storing all events - which is not what you wanted in the first place if you chose to use a stream table.

With HyperLogLog it's easy. Update your HyperLogLog estimator on every new event and you get a good approximation how many distinct values the stream has seen in total.

Approximating indexes

9.5 introduced BRIN indexes for very large tables. Unlike e.g. a btree, BRIN stores only ranges of values and points to the physical pages where a value that falls into that range could possibly be found.

A BRIN index thus only gives precise answers to the question where a certain value could not be found on disk.

9.6 will have Bloom-Filter indexes as an extension. Bloom filters can tell you that a value does not exist in a set with perfect accuracy. But the question if a value exists in the set can only be answered with a probability that increases with the collision resilience of the underlying hash.

So, as BRIN and Bloom indexes both are approximating indexes, every index hit has to be rechecked by the DBMS against the underlying data. But if you know their limitations and use them wisely, they too can speed up your queries significantly.

Michael Paquier: Postgres 9.6 feature highlight, pg_visibility

From Planet PostgreSQL. Published on Jun 22, 2016.

PostgreSQL 9.6 is shipping with a new contrib module manipulating and giving some input on the visibility map of a relation:

Add pg_visibility contrib module.

This lets you examine the visibility map as well as page-level
visibility information.  I initially wrote it as a debugging aid,
but was encouraged to polish it for commit.

Patch by me, reviewed by Masahiko Sawada.

Discussion: 56D77803.6080503@BlueTreble.com

The visibility map, associated to a relation in its own file, which is named with the suffix _vm, tracks information related to the visibility of tuples on relation pages for each backend. Up to 9.5, 1 bit was used per heap page, meaning that if this bit is set all the tuples stored on this page are visible to all the transactions. In 9.6, 2 bits are being used, the extra bit added is used to track if all tuples on a given page have been frozen or not, critically improving VACUUM performance by preventing full table scans.

pg_visibility contains a couple of functions allowing one to get a look at the status of each page’s bits. The first one, and aimed at general purposes, gives an insight about the all-frozen and all-visible status for each page of a relation, plus the status of PD_ALL_VISIBLE which is the same information as the all-visible flag except that it is stored in the heap page itself and not the VM file:

=# CREATE TABLE tab_visible AS SELECT generate_series(1,1000) AS id;
SELECT 1000
=# SELECT * FROM pg_visibility('tab_visible'::regclass);
 blkno | all_visible | all_frozen | pd_all_visible
-------+-------------+------------+----------------
     0 | f           | f          | f
     1 | f           | f          | f
     2 | f           | f          | f
     3 | f           | f          | f
     4 | f           | f          | f
(5 rows)

This function can take an optional argument in the shape of a block number. pg_visibility_map is similar to the previous function, except that it does not scan the all-visible flag value on the page and it just fetches what is available on the visibility map.

Then come the sanity checkers: pg_check_visible and pg_check_frozen that return a list of TIDs where refer to tuples that are respectively not all-visible and all-frozen even if the page they are on is marked as such. Those functions returning an empty set means that the database is not corrupted. If there are entries. Oops.

=# SELECT pg_check_visible('tab_visible'::regclass);
 pg_check_visible
------------------
(0 rows)
=# SELECT pg_check_frozen('tab_visible'::regclass);
 pg_check_frozen
-----------------
(0 rows)

And finally is a function that may become useful for maintenance purposes: pg_truncate_visibility_map which removes the visibility map of a relation. The next VACUUM that runs on this relation will forcibly rebuilt the visibility map of the relation. Note that this action is WAL-logged.

=# SELECT pg_truncate_visibility_map('tab_visible'::regclass);
 pg_truncate_visibility_map
----------------------------
 
(1 row)

Introducing smart cropping, intelligent quality selection and automated responsive images

By Cloudinary Blog - Django from Django community aggregator: Community blog posts. Published on Jun 22, 2016.

Auto everything - Images solved

Every image is unique and every one of your website visitors is different. In a perfect world, we'd like to adapt each image individually to be "just right" for every user. With "just right" being perfectly cropped, using responsive dimensions, correct encoding settings and optimal quality with the best image format.

Here’s a quick example using a photo of a (...drum roll...) cat:

Original cat photo

Now, we need to adapt the image to fit the graphic design on every device and browser. The image should look great in any resolution and website layout and it should be delivered optimized to improve website performance.

Ruby:
cl_image_tag("white_cat.jpg", :client_hints=>true, :transformation=>[
  {:aspect_ratio=>"4:6", :gravity=>"auto", :crop=>"fill"},
  {:width=>"auto", :quality=>"auto", :fetch_format=>:auto}
  ])
PHP:
cl_image_tag("white_cat.jpg", array("client_hints"=>true, "transformation"=>array(
  array("aspect_ratio"=>"4:6", "gravity"=>"auto", "crop"=>"fill"),
  array("width"=>"auto", "quality"=>"auto", "fetch_format"=>"auto")
  )))
Python:
CloudinaryImage("white_cat.jpg").image(client_hints=True, transformation=[
  {"aspect_ratio": "4:6", "gravity": "auto", "crop": "fill"},
  {"width": "auto", "quality": "auto", "fetch_format": "auto"}
  ])
Node.js:
cloudinary.image("white_cat.jpg", {client_hints: true, transformation: [
  {aspect_ratio: "4:6", gravity: "auto", crop: "fill"},
  {width: "auto", quality: "auto", fetch_format: "auto"}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .aspectRatio("4:6").gravity("auto").crop("fill").chain()
  .width("auto").quality("auto").fetchFormat("auto")).clientHints(true).imageTag("white_cat.jpg")
jQuery:
$.cloudinary.image("white_cat.jpg", {client_hints: true, transformation: [
  {aspect_ratio: "4:6", gravity: "auto", crop: "fill"},
  {width: "auto", quality: "auto", fetch_format: "auto"}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .AspectRatio("4:6").Gravity("auto").Crop("fill").Chain()
  .Width("auto").Quality("auto").FetchFormat("auto")).ClientHints(true).BuildImageTag("white_cat.jpg")

Standard image crop and encoding Automatic image crop and encoding

The image on the left is a standard crop and is delivered as a standard JPEG of the original 2000px image height, resulting in a large 537KB file. However, the image on the right is perfectly cropped, scaled down to 300px wide and delivered as an optimized 17KB WebP image on Chrome (saving 97% of bandwidth).

Optimal delivery on an image-by-image and user-by-user basis is quite challenging. It requires a considerable know-how on the intricacies of file formats, vision algorithms and browser support. This challenge gets even more complicated when you want to scale up to hundreds of thousands if not hundreds of millions of images that may be uploaded to your websites and mobile apps.

So how is this achieved, and how can you automatically apply it to your own website and mobile apps?

Images solved

When we launched the Cloudinary service more than 4 years ago, we wanted to eliminate the R&D time involved in managing the images of web and mobile applications. The service that was written by developers for developers, and was designed to be a complete solution for all their image needs. Today we are excited to introduce a major milestone in making our vision a reality by reaching the state of Images solved.

Developers want to create the perfect image crop focusing on the most important region. Developers need to determine the optimal file format and encoder settings for every image, to perfectly represent the content with the minimum number of bytes, for a fast and fluid user experience. Developers need to create multiple variants of every image for responsive delivery on various devices and in different resolutions.

Every image is unique in its content, and every visitor to your website has a different device, resolution, browser and network speed. It is challenging to deliver an optimal image to every user, especially when you need to scale that up to hundreds, thousands or millions of images.

The "Images Solved" solution introduced here, takes Cloudinary's cloud-based image management solution to the next level using automatic content-aware and context-aware image adaptation.

The concept is to let users upload one high resolution copy of any image and then automatically adapt the image, in real-time, to focus on the most important region of the image, select the optimal quality and encoding settings and responsively deliver the image to any device in any resolution or pixel density. We wanted to simplify the life of a developer further, by automating the following critical image manipulation and delivery components:

  • Automatic content-aware cropping - g_auto
  • Intelligent content-aware encoding - q_auto
  • Dynamic format selection - f_auto
  • Automatic responsive images - w_auto and dpr_auto

Automatic content-aware cropping - 'g_auto'

For any website design, and specifically any responsive design, simply scaling an image up or down isn't always enough -  images often require cropping to fit diverse device layouts, resolutions and aspect ratios while maximizing the visibility of important areas within the image.

Manually cropping individual images to fit them on various resolutions is resource intensive and quite impossible if your web application includes massive user generated content and a dynamic responsive layout.

Cropping all images to focus on the center of the image, by default, can result in a loss of important content.

Cloudinary already supports face detection to perfectly crop a photo, but what if you want to focus on more than just the face or if the image subject is a product, a food item or a pet?

The Cloudinary content-aware cropping algorithm uses a combination of heuristics to automatically detect the region of interest in every image and then crop them on the fly (using dynamic URLs) to fit the graphic design and layout of your site when viewed on any device, and without losing focus on important content.

For example, the following photo was uploaded to the cloud:

Original upload photo

Now we need to adapt this image to fit in three different pages on a website: in landscape (e.g., 800x200), in portrait (e.g., 300x600) and as a square (e.g., 300x300). As you can see below, standard center cropping brings problematic results, as most of the cat is left out.

Regular portrait image cropping
Standard image cropping
Regular square image crop

Setting the gravity image cropping parameter to auto or using g_auto in the following on-the-fly image manipulation URL, generates the optimal cropping result, ensuring that the cat is always included.

Ruby:
cl_image_tag("sofa_cat.jpg", :width=>300, :height=>600, :gravity=>"auto", :crop=>"fill")
PHP:
cl_image_tag("sofa_cat.jpg", array("width"=>300, "height"=>600, "gravity"=>"auto", "crop"=>"fill"))
Python:
CloudinaryImage("sofa_cat.jpg").image(width=300, height=600, gravity="auto", crop="fill")
Node.js:
cloudinary.image("sofa_cat.jpg", {width: 300, height: 600, gravity: "auto", crop: "fill"})
Java:
cloudinary.url().transformation(new Transformation().width(300).height(600).gravity("auto").crop("fill")).imageTag("sofa_cat.jpg")
jQuery:
$.cloudinary.image("sofa_cat.jpg", {width: 300, height: 600, gravity: "auto", crop: "fill"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(300).Height(600).Gravity("auto").Crop("fill")).BuildImageTag("sofa_cat.jpg")

Automatic portrait image cropping
Intelligent image cropping
Automatic square image crop

The images above used the popular fill crop mode, which tries to keep as much of the original image while cropping. However, sometimes websites prefer to display a more zoomed-in thumbnail emphasizing the person or object in the original image. For example, take a look at the following uploaded images:

Cat photo Car photo Red bike photo

For example, let's create a 300x270 thumbnail of each of the original images. The following images are dynamically created in the cloud using automatic face detection by setting gravity to face (g_face).

Ruby:
cl_image_tag("sofa_cat.jpg", :width=>300, :height=>270, :gravity=>"face", :crop=>"thumb")
PHP:
cl_image_tag("sofa_cat.jpg", array("width"=>300, "height"=>270, "gravity"=>"face", "crop"=>"thumb"))
Python:
CloudinaryImage("sofa_cat.jpg").image(width=300, height=270, gravity="face", crop="thumb")
Node.js:
cloudinary.image("sofa_cat.jpg", {width: 300, height: 270, gravity: "face", crop: "thumb"})
Java:
cloudinary.url().transformation(new Transformation().width(300).height(270).gravity("face").crop("thumb")).imageTag("sofa_cat.jpg")
jQuery:
$.cloudinary.image("sofa_cat.jpg", {width: 300, height: 270, gravity: "face", crop: "thumb"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(300).Height(270).Gravity("face").Crop("thumb")).BuildImageTag("sofa_cat.jpg")

Face detection based cropped thumbnail Face detection based cropped thumbnail Face detection based cropped thumbnail

The results above are nice, but they are not optimal thumbnails for the original photos. We see more of the blurry sofa instead of the nice cat, it's hard to see that there's a dog next to the lady, and no one will know that a red bike was in the original image.

Cloudinary's automatic cropping algorithm in the thumbnail generation mode focuses on the area of interest - not only on facial close ups - intelligently analyzing the image content. The thumbnails below are created by setting the crop mode to thumb and the gravity to auto (c_thumb,g_auto). These thumbnails effectively reflect the story behind each photo.

Ruby:
cl_image_tag("sofa_cat.jpg", :width=>300, :height=>270, :gravity=>"auto", :crop=>"thumb")
PHP:
cl_image_tag("sofa_cat.jpg", array("width"=>300, "height"=>270, "gravity"=>"auto", "crop"=>"thumb"))
Python:
CloudinaryImage("sofa_cat.jpg").image(width=300, height=270, gravity="auto", crop="thumb")
Node.js:
cloudinary.image("sofa_cat.jpg", {width: 300, height: 270, gravity: "auto", crop: "thumb"})
Java:
cloudinary.url().transformation(new Transformation().width(300).height(270).gravity("auto").crop("thumb")).imageTag("sofa_cat.jpg")
jQuery:
$.cloudinary.image("sofa_cat.jpg", {width: 300, height: 270, gravity: "auto", crop: "thumb"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(300).Height(270).Gravity("auto").Crop("thumb")).BuildImageTag("sofa_cat.jpg")

Automatically cropped thumbnail Smartly cropped thumbnail Intelligently cropped thumbnail

Try out the automatic cropping interactive demo page and see the 'Automatic cropping' documentation for more details.

Intelligent content-aware encoding - 'q_auto'

Developers and designers tend to refrain from experimenting with image quality adjustments for fear of degrading the visual quality. However, adjusting the quality compression level doesn’t always lead to a loss of visual quality. In fact, precise adjustment of the compression level complemented by fine tuning of the encoding settings can significantly reduce the file size without any degradation noticeable to the human eye. So, what’s the standard quality setting that works for all images? Unfortunately, there is no single setting that is optimal for all images because it depends on a combination of the compression algorithm, the image format and the actual image content.

Cloudinary automates the file size versus quality trade-off decision, on the fly, by using perceptual metrics and heuristics that tune the encoding settings and select the appropriate image format based on the specific image content and format.

The intelligent encoding algorithm analyzes every image to find the best quality compression level and the optimal encoding settings based on the image content and the viewing browser, in order to produce a perceptually fine image while minimizing the file size.

For example, the following 500px wide image was encoded as a JPEG with a quality of 90 (without chroma subsampling). The image looks quite good and weighs 58.3KB.

Ruby:
cl_image_tag("cld_popsicles.jpg", :width=>500, :quality=>90)
PHP:
cl_image_tag("cld_popsicles.jpg", array("width"=>500, "quality"=>90))
Python:
CloudinaryImage("cld_popsicles.jpg").image(width=500, quality=90)
Node.js:
cloudinary.image("cld_popsicles.jpg", {width: 500, quality: 90})
Java:
cloudinary.url().transformation(new Transformation().width(500).quality(90)).imageTag("cld_popsicles.jpg")
jQuery:
$.cloudinary.image("cld_popsicles.jpg", {width: 500, quality: 90})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(500).Quality(90)).BuildImageTag("cld_popsicles.jpg")
Encoded JPEG of quality 90

Developers who want to optimize performance and save file size, might select a more aggressive compression quality level. However, as you can see below the resulting image might have disturbing artifacts, while the file size is smaller and weighs 24KB.

Ruby:
cl_image_tag("cld_popsicles.jpg", :width=>500, :quality=>50)
PHP:
cl_image_tag("cld_popsicles.jpg", array("width"=>500, "quality"=>50))
Python:
CloudinaryImage("cld_popsicles.jpg").image(width=500, quality=50)
Node.js:
cloudinary.image("cld_popsicles.jpg", {width: 500, quality: 50})
Java:
cloudinary.url().transformation(new Transformation().width(500).quality(50)).imageTag("cld_popsicles.jpg")
jQuery:
$.cloudinary.image("cld_popsicles.jpg", {width: 500, quality: 50})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(500).Quality(50)).BuildImageTag("cld_popsicles.jpg")
JPEG of low 50 quality

Now, setting the quality manipulation parameter to auto (q_auto) tells Cloudinary to analyze the image and find the optimal image encoding settings and compression quality level. The resulting image looks quite good and weighs only 39.6KB, which means a saving of 32% over a standard 90 quality encoded image.

Ruby:
cl_image_tag("cld_popsicles.jpg", :width=>500, :quality=>"auto")
PHP:
cl_image_tag("cld_popsicles.jpg", array("width"=>500, "quality"=>"auto"))
Python:
CloudinaryImage("cld_popsicles.jpg").image(width=500, quality="auto")
Node.js:
cloudinary.image("cld_popsicles.jpg", {width: 500, quality: "auto"})
Java:
cloudinary.url().transformation(new Transformation().width(500).quality("auto")).imageTag("cld_popsicles.jpg")
jQuery:
$.cloudinary.image("cld_popsicles.jpg", {width: 500, quality: "auto"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(500).Quality("auto")).BuildImageTag("cld_popsicles.jpg")
Automatic JPEG quality

The significance of the performance optimization depends on the content of every individual image. For example, the following image weighs 63.7KB using regular JPEG encoding with a quality of 90, and only 25.1KB (a 61% saving) with quality set to auto, while still looking great.

Automatically cropped thumbnail Intelligently cropped thumbnail

Furthermore, when a user enables data saving mode, the browser sends the Save-Data: on Client Hint request header, and Cloudinary automatically switches to a more aggressive automatic quality level and encoding settings selection.

Try out the automatic quality interactive demo page and see the Automatic quality and encoding settings documentation for more details.

Dynamic format selection - 'f_auto'

Developers are expected to select optimal image formats for different scenarios based on the image content and the viewing device/browser. For example, JPEG should be used for a captured photograph or for faster loading while PNG should be used for illustrations or drawings or when using a transparent background. Additional logic should also be considered for modern formats such as WebP and JPEG-XR, if the viewing browser is Chrome or Internet-Explorer/Edge.

Image formats can have a significant impact on the page load time and bandwidth - for example using WebP over JPEG can result in a 30% file size reduction, which can lead to faster page loads resulting in improved site engagement and conversion rates.

The browser and format compatibility requirements seem simple, but manually adopting the format logic for many images can be complex and inefficient.

To optimize image delivery, Cloudinary can dynamically select the most efficient format, based on the content and viewing browser, and convert the image to that format on the fly. For example, it automatically delivers images as WebP to Chrome or JPEG-XR to Internet Explorer or Edge. No more manual conversion of images to the most efficient formats.

Transparent WebP delivery and JPEG-XR delivery were introduced a couple of years ago. Now, you can combine automatic format selection with automatic image encoding and quality selection. For example, the same image as above with q_auto,f_auto will be encoded as WebP for Chrome users, consuming only 16.2KB and still look great. Users of Internet-Explorer/Edge will get a JPEG-XR and users of other browsers will get a JPEG.

Ruby:
cl_image_tag("sofa_cat.jpg", :width=>750, :quality=>"auto", :fetch_format=>:auto)
PHP:
cl_image_tag("sofa_cat.jpg", array("width"=>750, "quality"=>"auto", "fetch_format"=>"auto"))
Python:
CloudinaryImage("sofa_cat.jpg").image(width=750, quality="auto", fetch_format="auto")
Node.js:
cloudinary.image("sofa_cat.jpg", {width: 750, quality: "auto", fetch_format: "auto"})
Java:
cloudinary.url().transformation(new Transformation().width(750).quality("auto").fetchFormat("auto")).imageTag("sofa_cat.jpg")
jQuery:
$.cloudinary.image("sofa_cat.jpg", {width: 750, quality: "auto", fetch_format: "auto"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(750).Quality("auto").FetchFormat("auto")).BuildImageTag("sofa_cat.jpg")
Automatic format and quality selection

The combination of q_auto and f_auto is even more powerful together than either of them separately. The algorithm might detect that the PNG format (or the lossless WebP format) is a better fit for specific images that contain content such as drawings. For some images, even the PNG8 format can be automatically selected for providing great looking results with a very efficient file size.

For example, the following URL dynamically generates a 500px wide version of a drawing using automatic image encoding and quality selection (q_auto without f_auto).

Ruby:
cl_image_tag("flowers_and_birds.jpg", :width=>500, :quality=>"auto")
PHP:
cl_image_tag("flowers_and_birds.jpg", array("width"=>500, "quality"=>"auto"))
Python:
CloudinaryImage("flowers_and_birds.jpg").image(width=500, quality="auto")
Node.js:
cloudinary.image("flowers_and_birds.jpg", {width: 500, quality: "auto"})
Java:
cloudinary.url().transformation(new Transformation().width(500).quality("auto")).imageTag("flowers_and_birds.jpg")
jQuery:
$.cloudinary.image("flowers_and_birds.jpg", {width: 500, quality: "auto"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(500).Quality("auto")).BuildImageTag("flowers_and_birds.jpg")
Automatic quality selection of a cartoon

The result is a JPEG image that weighs 41KB. If you look carefully, you will see that the lossy nature of JPEG resulted in some unpleasant artifacts. Now, let's tell Cloudinary to combine both q_auto and f_auto. The result is shown below. The algorithm decided to encode the image using the PNG8 format. The image looks better, has no artifacts (the only artifacts are from the original high-quality JPEG image), and weighs even less - just 34.8KB.

Ruby:
cl_image_tag("flowers_and_birds.jpg", :width=>500, :quality=>"auto", :fetch_format=>:auto)
PHP:
cl_image_tag("flowers_and_birds.jpg", array("width"=>500, "quality"=>"auto", "fetch_format"=>"auto"))
Python:
CloudinaryImage("flowers_and_birds.jpg").image(width=500, quality="auto", fetch_format="auto")
Node.js:
cloudinary.image("flowers_and_birds.jpg", {width: 500, quality: "auto", fetch_format: "auto"})
Java:
cloudinary.url().transformation(new Transformation().width(500).quality("auto").fetchFormat("auto")).imageTag("flowers_and_birds.jpg")
jQuery:
$.cloudinary.image("flowers_and_birds.jpg", {width: 500, quality: "auto", fetch_format: "auto"})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation().Width(500).Quality("auto").FetchFormat("auto")).BuildImageTag("flowers_and_birds.jpg")
Automatic and format selection of a cartoon

See Automatic format selection documentation page for more details.

Automatic responsive images - 'w_auto' and 'dpr_auto'

To implement responsive design, developers need to create multiple versions of every image to make it look perfect on any device, in any resolution, pixel density or orientation.

Generating, managing, marking-up, and delivering these numerous image versions can be a daunting task. Cloudinary simplifies dynamic image delivery for responsive websites on Retina and regular displays by automating the image width and DPR value decision based on the viewing device, display size and layout.

Using Google's Client Hints, Cloudinary determines the required width of an image based on the browser's viewport-width or the layout width, and then calculates the optimal resolution for displaying the image on that device. Furthermore, Cloudinary can dynamically select image-specific breakpoints by determining the required number of versions of every image in order to balance the optimal dimensions vs. bandwidth reduction trade-off.

This means that developers simply need to have a single high resolution version of every image. Cloudinary automatically adapts the image to fit the viewport, layout and resolution on any device using a single dynamic URL - ensuring a visually seamless user experience while improving performance.

Generating images of different width and DPR values via the same dynamic URL is done by setting the width and dpr manipulation parameters to auto (dpr_auto,w_auto). The Client Hints request headers are automatically processed to retrieve the best fitting image dimensions.

For example, the following URL first crops an image to a 16:9 aspect ratio and requests automatic DPR and width delivery. The device's DPR is rounded up to an integer value (1.0, 2.0, 3.0, etc.) and the actual required width value is rounded up to the closest multiple of 100 pixels (by default). The example below also limits the delivered image to 2000px wide in case that the Client Hints are not available.

Ruby:
cl_image_tag("car_lady_dog.jpg", :client_hints=>true, :transformation=>[
  {:gravity=>"auto", :aspect_ratio=>"16:9", :crop=>"fill"},
  {:dpr=>"auto", :width=>"auto"},
  {:width=>2000, :crop=>"limit"}
  ])
PHP:
cl_image_tag("car_lady_dog.jpg", array("client_hints"=>true, "transformation"=>array(
  array("gravity"=>"auto", "aspect_ratio"=>"16:9", "crop"=>"fill"),
  array("dpr"=>"auto", "width"=>"auto"),
  array("width"=>2000, "crop"=>"limit")
  )))
Python:
CloudinaryImage("car_lady_dog.jpg").image(client_hints=True, transformation=[
  {"gravity": "auto", "aspect_ratio": "16:9", "crop": "fill"},
  {"dpr": "auto", "width": "auto"},
  {"width": 2000, "crop": "limit"}
  ])
Node.js:
cloudinary.image("car_lady_dog.jpg", {client_hints: true, transformation: [
  {gravity: "auto", aspect_ratio: "16:9", crop: "fill"},
  {dpr: "auto", width: "auto"},
  {width: 2000, crop: "limit"}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .gravity("auto").aspectRatio("16:9").crop("fill").chain()
  .dpr("auto").width("auto").chain()
  .width(2000).crop("limit")).clientHints(true).imageTag("car_lady_dog.jpg")
jQuery:
$.cloudinary.image("car_lady_dog.jpg", {client_hints: true, transformation: [
  {gravity: "auto", aspect_ratio: "16:9", crop: "fill"},
  {dpr: "auto", width: "auto"},
  {width: 2000, crop: "limit"}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .Gravity("auto").AspectRatio("16:9").Crop("fill").Chain()
  .Dpr("auto").Width("auto").Chain()
  .Width(2000).Crop("limit")).ClientHints(true).BuildImageTag("car_lady_dog.jpg")

The following URL simulates the Client Hints header, specifying a fallback width of 387 pixels, which is used in the case that the Client Hint header is not available. In this case, the generated image is rounded up to 400 pixels wide, which weighs only 29.9KB compared to the 414KB of the the maximum 2000px wide image.

Ruby:
cl_image_tag("car_lady_dog.jpg", :client_hints=>true, :transformation=>[
  {:gravity=>"auto", :aspect_ratio=>"16:9", :crop=>"fill"},
  {:width=>"auto:100:387"}
  ])
PHP:
cl_image_tag("car_lady_dog.jpg", array("client_hints"=>true, "transformation"=>array(
  array("gravity"=>"auto", "aspect_ratio"=>"16:9", "crop"=>"fill"),
  array("width"=>"auto:100:387")
  )))
Python:
CloudinaryImage("car_lady_dog.jpg").image(client_hints=True, transformation=[
  {"gravity": "auto", "aspect_ratio": "16:9", "crop": "fill"},
  {"width": "auto:100:387"}
  ])
Node.js:
cloudinary.image("car_lady_dog.jpg", {client_hints: true, transformation: [
  {gravity: "auto", aspect_ratio: "16:9", crop: "fill"},
  {width: "auto:100:387"}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .gravity("auto").aspectRatio("16:9").crop("fill").chain()
  .width("auto:100:387")).clientHints(true).imageTag("car_lady_dog.jpg")
jQuery:
$.cloudinary.image("car_lady_dog.jpg", {client_hints: true, transformation: [
  {gravity: "auto", aspect_ratio: "16:9", crop: "fill"},
  {width: "auto:100:387"}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .Gravity("auto").AspectRatio("16:9").Crop("fill").Chain()
  .Width("auto:100:387")).ClientHints(true).BuildImageTag("car_lady_dog.jpg")
Automatic responsive image

One image for all screen resolutions and different devices is not enough. An image for every pixel change in width is too much, while an image for every 100 pixels difference might be an arbitrary threshold - so how can someone automatically choose the optimal responsive image sizes?  

Fortunately, Cloudinary's new automatic responsive images solution now supports dynamic generation of optimal breakpoints for each individual image on-the-fly. Here's an example of a URL that calculates and returns the optimal breakpoints as shown below.

Ruby:
cl_image_tag("car_lady_dog.jpg", :client_hints=>true, :transformation=>[
  {:gravity=>"auto", :aspect_ratio=>"16:9", :crop=>"fill"},
  {:width=>"auto:breakpoints:json"}
  ])
PHP:
cl_image_tag("car_lady_dog.jpg", array("client_hints"=>true, "transformation"=>array(
  array("gravity"=>"auto", "aspect_ratio"=>"16:9", "crop"=>"fill"),
  array("width"=>"auto:breakpoints:json")
  )))
Python:
CloudinaryImage("car_lady_dog.jpg").image(client_hints=True, transformation=[
  {"gravity": "auto", "aspect_ratio": "16:9", "crop": "fill"},
  {"width": "auto:breakpoints:json"}
  ])
Node.js:
cloudinary.image("car_lady_dog.jpg", {client_hints: true, transformation: [
  {gravity: "auto", aspect_ratio: "16:9", crop: "fill"},
  {width: "auto:breakpoints:json"}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .gravity("auto").aspectRatio("16:9").crop("fill").chain()
  .width("auto:breakpoints:json")).clientHints(true).imageTag("car_lady_dog.jpg")
jQuery:
$.cloudinary.image("car_lady_dog.jpg", {client_hints: true, transformation: [
  {gravity: "auto", aspect_ratio: "16:9", crop: "fill"},
  {width: "auto:breakpoints:json"}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .Gravity("auto").AspectRatio("16:9").Crop("fill").Chain()
  .Width("auto:breakpoints:json")).ClientHints(true).BuildImageTag("car_lady_dog.jpg")

{"breakpoints":[50,335,507,648,793,919,1000]}

The breakpoints are seamlessly used behind the scenes, so a single URL like the following one will return an image with a width equal to the breakpoint closest to the required display dimensions. For example, if the required width is 387px, an image that is 507 pixels wide will be dynamically generated in the cloud and delivered to users.

Ruby:
cl_image_tag("car_lady_dog.jpg", :client_hints=>true, :transformation=>[
  {:gravity=>"auto", :aspect_ratio=>"16:9", :crop=>"fill"},
  {:dpr=>"auto"},
  {:width=>"auto:breakpoints:387"}
  ])
PHP:
cl_image_tag("car_lady_dog.jpg", array("client_hints"=>true, "transformation"=>array(
  array("gravity"=>"auto", "aspect_ratio"=>"16:9", "crop"=>"fill"),
  array("dpr"=>"auto"),
  array("width"=>"auto:breakpoints:387")
  )))
Python:
CloudinaryImage("car_lady_dog.jpg").image(client_hints=True, transformation=[
  {"gravity": "auto", "aspect_ratio": "16:9", "crop": "fill"},
  {"dpr": "auto"},
  {"width": "auto:breakpoints:387"}
  ])
Node.js:
cloudinary.image("car_lady_dog.jpg", {client_hints: true, transformation: [
  {gravity: "auto", aspect_ratio: "16:9", crop: "fill"},
  {dpr: "auto"},
  {width: "auto:breakpoints:387"}
  ]})
Java:
cloudinary.url().transformation(new Transformation()
  .gravity("auto").aspectRatio("16:9").crop("fill").chain()
  .dpr("auto").chain()
  .width("auto:breakpoints:387")).clientHints(true).imageTag("car_lady_dog.jpg")
jQuery:
$.cloudinary.image("car_lady_dog.jpg", {client_hints: true, transformation: [
  {gravity: "auto", aspect_ratio: "16:9", crop: "fill"},
  {dpr: "auto"},
  {width: "auto:breakpoints:387"}
  ]})
.Net:
cloudinary.Api.UrlImgUp.Transform(new Transformation()
  .Gravity("auto").AspectRatio("16:9").Crop("fill").Chain()
  .Dpr("auto").Chain()
  .Width("auto:breakpoints:387")).ClientHints(true).BuildImageTag("car_lady_dog.jpg")
Automatic width with responsive breakpoints

Client Hints are currently only supported by the Chrome, Opera and Android browsers, and they require adding a meta header to your HTML page for enabling their usage. For other browsers, you may want to specify a fallback width value to resize the image.

See a live sample page of Automatic Responsive Images With Client Hints and the Automatic responsive images documentation page for more details.

Going forward

We believe that the four image automation capabilities that were introduced here are extremely useful for any web developer. From my personal experience, once you start using the new features, you will probably find it hard to recall what you did beforehand.

More than 120,000 developers have signed up to Cloudinary and over 3,000 of them are active paying customers. Our vision is to answer all their image (and video) related needs, and we believe that the new automatic algorithms are a big step towards realizing this vision.

Automatic algorithms, as intelligent as they can be, are not always perfect and the quality of the result might be subjective. Therefore, Cloudinary supports a manual override via the API or the UI: override automatic cropping by specifying custom coordinates for the image, and override the automatically selected quality level by specifying a custom level. The manual overriding actions are also used as feedback for our algorithms, allowing us to further enhance and fine tune the algorithms until as many images as possible are solved.

All image manipulation and delivery features introduced here are now available with no extra charge for all Cloudinary's plans, including the free plan.

In our upcoming posts, we’ll share more in-depth details about each of our new automation capabilities. Stay tuned, and as always, we would appreciate your feedback. Try it out?

Note: For customers with a custom domain name, the features that involve Client-Hints processing (w_auto and q_auto with Save-Data support) need to be set up. In addition, responsive breakpoints support (w_auto:breakpoints) with certain CDN providers might involve certain additional costs. Contact us for more details.

Ian Barwick: Letter from Japan

From Planet PostgreSQL. Published on Jun 22, 2016.

Greetings, honourable reader.

My name is Barwick of the 2ndQuadrant Company. With the onset of the Plum Rain, in my humble estimation it has become hot and humid recently. I trust all is well in your exalted undertakings?

No, I haven’t flipped a bit – the above is actually pretty much how a Japanese business letter – or email – starts off, and there’s a plethora of sites such as this one providing examples of various kinds of formal email. But fear not, this won’t be a lesson in written Japanese business etiquette, but the first in an occasional series of posts from Japan, a country which embraced PostgreSQL very early on and has contributed much to its development.

A potted PostgreSQL history, Japanese style

Back in 2006, on a visit to Japan before I started living here, I happened to wander into a bookstore and was looking for something on handling Japanese full-text search (a tricky subject in a language which has three or four different writing systems and multiple ways of writing the same word but no spaces), when something odd struck me – there were clearly more PostgreSQL books than MySQL ones. This was at a time when PostgreSQL was globally still in a bit of a niche and an equivalent bookshop in Europe would feature a couple of often dated volumes
cowering next to masses of dolphin-orientated literature.

Occasionally it’s postulated that PostgreSQL’s popularity in Japan is because it’s easier to pronounce than MySQL – this is a country where “Los Angeles” is abbreviated to “Rosu“, so it’s no surprise that the even more cumbersome rendering of PostgreSQL (“posutogure-esu-kyuu-eru“) is shortened to “posugure” (look it up on Google if you don’t believe me) – whereas the equivalent “mai” is short and ambiguous to use in normal conversation.

However the reality is more prosaic – during the mid/late 1990s, as the internet was taking off and open source databases were becoming viable, PostgreSQL got a head start in the accessibility stakes. A Japanese-language mailing list was set up as far back as Postgres95 and the Japan PostgreSQL User Group (JPUG) was founded in 1999, from which a vibrant community has emerged, providing seminars, conferences and perhaps crucially, very up-to-date Japanese language translations of the PostgreSQL documentation. And the rest, as they say, is history – PostgreSQL is now supported by most of the big players in the IT industry here, some of whom are also household names globally, and of the 22 PostgreSQL committers, three are from Japan (listed in order of first commit: Tatsuo Ishii, Itagaki Takahiro and Fujii Masao).

In future posts I’ll be writing about PostgreSQL-related subjects in Japan and hope to present some new perspectives on aspects which aren’t widely known in the outside world.

Oh, and that “Plum Rain” I mentioned? It’s not an actual precipitation of fruit, but the literal translation of “tsuyu“, the hot sticky rainy period between spring and summer, often described as the fifth of Japan’s four seasons.

A high availability Django setup on the cheap - Roland van Laar

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Jun 21, 2016.

(One of the talks at the 22 June 2016 Amsterdam Python meetup)

Roland build an educational website that needed to be high available on a tight budget. He demoed the website. A site for the teacher on his own laptop and a separate page for on the digiboard for the class. The teacher steers the digiboard from his laptop (or an ipad or phone).

As it is used in classrooms, it needs to be really really available. As a teacher, you don't want to have to change your lesson plan at 8:30. The customer hat three goals:

  • Not expensive.
  • Always up.
  • Reliable.

He had some technical goals of his own:

  • Buildable.
  • Functional.
  • Maintainable.

Always up? Django? You have the following challenges, apart from having a bunch of webservers.

  • Media files. Files uploaded on one server need to be visible on others.
  • Websockets.
  • Database.
  • Sessions.

The setup he chose:

  • Front end (html, javascript, images): cloudflare as CDN, content delivery network. The front end is a single page jquery app. It chooses a random API host for ajax requests.

    It changes API hosts when the API is not responding. But.... when is an API not responding? Some schools have really bad internet, so 10 seconds for a request might be "normal".

    Don't make a "ping pong" application that retries all the time. Try every server and then fail.

  • Some Django API servers. The actual django project was easy. Simple models, a bit of djangorestframework. As an extra he used some new postgres features.

  • Two SQL servers in BDR, bi-directional replication, mode. "Postgres async multi master". It is awesome! It just works! Even sessions are replicated faultlessly.

    Things to watch out for: create a separate replication user on both ends. Also watch out with sequences (auto-increment fields). For django it was easy to get working by configuring the database with "USING BDR" when using such IDs. This takes a little bit longer to create such objects. Alternatively you can UUIDs.

    Backups: oopsie. When postgres goes down, you normally restart it and it rebuilds itself. But in a BDR setup, the sequences don't work right then. The standard tools don't work, he had to write a custom script.

    Another drawback. For updating your tables, you need a lock on all database nodes. This means you have downtime. No problem, he'll just do it early on in the morning in a weekend.

  • He uses csync2 for syncing uploaded files between the hosts. Simply a cronjob on all servers. This is good enough as the updates only really happen in the summer; during the school year nothing changes.

  • Websockets. He uses Tornado plus javascript code for reconnecting websockets. Initial connection for the teacher to connect his laptop with the digiboad is via a short 6-digit number. Internally, a UUID is generated. The UUID is stored in local storage, so reloading the page or restarting a laptop Just Works.

The One Time We Were Down: they switched email providers one time because their original one got much more expensive. But the new provider wasn't as good and suddenly calls took more than 10 seconds and clients started to fail. It wasn't that critical as it happened after school time when only one teacher wanted to reset his password. So it was easy to fix.

Python for OPS and platform teams - Pavel Chunyayev

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Jun 21, 2016.

(One of the talks at the 22 June 2016 Amsterdam Python meetup)

The sub-title of his talk is supporting development teams in their journey to Continuous Delivery. Pavel's job description is continuous delivery architect. He loves biking so he just had to move to Amsterdam :-)

What is continuous delivery? Often you'll hear something like "safely, rapidly and predictably deliver new features to production". But that's a little bit restricted. He likes to start already in the inception and planning stage and only end in the actual operation stage. So the whole process. You don't want to continuously deliver bad features that won't be used. You want to include the inception/planning stage to make sure the right features are implemented.

A core idea is to keep the product releasable: build quality in. That is the core that we ought to remember. Quality is an integral part of the product, it is not something you can tack on later.

So... quality in the entire build/test/release cycle.

  • Build. All the checks you can do. Linting, static code analysis, unit tests.
  • Test. Contract tests, end-to-end testsuites, browser tests, scaling tests.
  • Release. Verify the deploy.

Especially in the "test" phase, you need specific clean test environments. Provision infrastructure, run the tests, dispose of the infrastructure. Repeatability is keys.

"Platform as a service": the real name of every operations team nowadays.

The number of jobs that include configuring servers manually is quickly declining. You need to be a programmer now! You need to enable self-service for the teams you work with.

So: python for system engineers. Python is the best language for this.

  • It is the most popular programming language for devops (apart from bash).
  • It is easy to learn (but hard to master).
  • It is more powerful than bash.
  • You can create and distribute real helper applications.
  • Devops means you're a developer now!

What you need to do: create a provisioning service. A self-service. The developer can just click a button and they have a server! Amazon's AWS was born this way. But these are generic servers. Your company needs custom setups. This is where you come up.

  • Services to manage environment lifecycle.
  • The same way for everyone.
  • Manipulation using API.
  • You can mix and match infrastructure providers.

They build something with python + flask + ansible. Flask recieves API calls, analyzes it and creates an ansible object and fires off ansible. Ready!

Also something to look at: Jenkins pipelines. All the jenkins tasks for one job inside one versionable file. You can apparently even write those in python nowadays (instead of the default "groovy" DSL).

Some further thoughts:

  • Open all the helper tools to the whole organization.
  • Distribute it as docker containers. Both the services and command line tools!
  • As sysadmin, act as a developer. Use the same tools!

What are distributed systems? - Quazi Nafiul Islam

By Reinout van Rees' weblog from Django community aggregator: Community blog posts. Published on Jun 21, 2016.

(One of the talks at the 22 June 2016 Amsterdam Python meetup)

Distributed systems?

  • Distributed computation (like hadoop, mapreduce).
  • Distributed storage (like cassandra, riak).
  • Distributed messaging (like kafka).

You need distributed systems if you want to do more than is regulary possible.

With many systems you need things like synchronisation (for instance NTP, network time protocol).

A distributed system is a bunch of computers working together. Such systems have challenges and limitations. Take for instance the CAP theorem for distributed storage systems. You can't have all three of the following three at the same time:

  • Consistency.
  • Availability.
  • Partition tolerance.

You can for instance value availability over consistency. You give the answer as soon as possible, even if you're not completely sure you have the full answer (as other nodes might have extra/newer information).

Hans-Juergen Schoenig: BRIN indexes: Correlation, correlation, correlation

From Planet PostgreSQL. Published on Jun 21, 2016.

Since BRIN indexes have been introduced in PostgreSQL 9.5, many people have gladly adopted this new index type. A lot has been written about this new feature and a lot of positive feedback has been reported. While BRIN indexes are clearly a success and definitely a win, some people tend to exagerate and use them […]

The post BRIN indexes: Correlation, correlation, correlation appeared first on Cybertec - The PostgreSQL Database Company.

gabrielle roth: RDS: log vacuum messages

From Planet PostgreSQL. Published on Jun 20, 2016.

Last year I commented that “autovacuum log messages are still missing in [RDS] 9.4“. Amazon fixed that for versions 9.4.5 and up with the custom GUC rds.force_autovacuum_logging_level. I discovered this week that it’s also available on earlier versions of RDS – I have it enabled on a 9.3.10 instance. The default value in the config […]

Joe Abbate: The Future of Pyrseas: Part 2

From Planet PostgreSQL. Published on Jun 20, 2016.

When I started working on Pyrseas, I reviewed several other products. Robert Brewer’s Post Facto was probably the one with the most unique design. Although it compared database schemas in order to generate SQL to synch them up, it did not store database object definitions in a standard VCS repository. Rather, it used a Postgres database as the repository analog.

While Post Facto’s design certainly influenced Pyrseas, there is one aspect of  the former that, unfortunately, I did not emulate.

The Dependables

As any developer knows, database objects have dependencies on each other: table A has a primary key PK1, table B is declared with a foreign key dependent on PK1, function C is dependent on type X, view D is based on table A and includes a call to function C.

Pyrseas currently deals with these dependencies in an object-specific manner. For example, it does at least two passes through pg_class objects (tables, views, sequences, etc.) in order to create, alter or drop these objects in the correct order. However, this ad hoc approach can result in incorrect sequencing of generated SQL statements in some cases, particularly those like view D above.

The missing feature from Post Facto that avoids this conundrum? If you answered topological sort you were obviously paying attention in your Algorithms class. If you didn’t, may I suggest chapter 15, “Devising and engineering an algorithm: Topological Sort” of Bertrand Meyer’s Touch of Class.

Daniele’s Quest

Over two years ago, someone opened an issue about the need to create primary keys before creating views. Later, Daniele Varrazzo reported another issue with dependencies.

Many of you Postgres users will recognize Daniele as the maintainer of Psycopg, the popular Python PG adapter, which of course is used by Pyrseas.  Daniele and I chatted online, I mentioned Post Facto’s solution and he, fortuitously and generously, started implementing a topological sort on a deptrack branch of Pyrseas.

We then collaborated for about eight months. He did most of the initial coding and I ran tests and fixed some issues. Unfortunately, Daniele is very busy, with a full-time job, Psycopg and other interests, so the work came to a near standstill.

Where We Stand

The last time changes were submitted to the deptrack branch, about six months ago, only four tests failed (out of over 600) running on both Python 2.7 and 3.4 against Postgres 9.3. Regrettably, three of those tests are integration and functional tests, so correcting those is critical to adding this feature.

In addition, although most tests complete successfully, the run times have been impacted severely. This will require an effort at re-optimizing performance before releasing the changes. Last but not least, the implementation needs internal documentation so that it can be properly maintained.

Sadly, I have not had much time or incentives to address these shortcomings. Are there any Pyrseas, Postgres or Python enthusiasts looking for a challenge?


Filed under: Postgres, Python, Version control

Query Expressions are Amazing

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jun 20, 2016.

The Django 1.8 release added support for complex query expressions. The documentation has some nice examples but they don't do justice to how crazy awesome these are. In this post, we will go through some additional examples of how to leverage these expressions.

Django has had one form of a query expression for several years now: the F expression. F() can be used to reference an existing column in a query. This is often used for atomic update operations, such as incrementing a counter. However, F can also be used to compare two columns on a model when filtering. For instance, we may be interested in users who haven't logged in since their first two weeks on the site. That requires comparing the value of the last_login column and the date_joined on the standard User model from contrib.auth:

from datetime import timedelta

from django.contrib.auth.models import User
from django.db.models import F
from django.utils.timezone import now

# Create some fake data: 10 active users and 20 inactive ones
today = now()
active_count = 10
inactive_count = 20
for i in range(1, active_count + inactive_count + 1):
    active = i <= active_count
    prefix = 'in' if active else ''
    domain = 'example.com' if i % 3 == 0 else 'caktusgroup.com'
    attributes = {
        'username': '{}active-{}'.format(prefix, i),
        'email': '{}active-{}@{}'.format(prefix, i, domain),
        'date_joined': today - timedelta(days=30),
        'last_login': today - timedelta(days=0 if active else 21),
    }
    User.objects.create(**attributes)
# Query inactive users
inactive = User.objects.filter(last_login__lte=F('date_joined') + timedelta(days=14))

The F expression supports basic arithmetic operations including some date math, as seen in the example above. However, it is still very limiting in comparison to what is available in SQL.

Relational databases such as Postgresql support a number of built-in functions which you can leverage in the ORM using the Func expression added in 1.8. For example, you may want to examine the email domains of your user base. For that, you might use the split_part function in Postgresql to extract the domain of the email address column. To normalize the domain values you can compose this with the built-in Lower expression:

# Continued from above
from django.contrib.auth.models import User
from django.db.models import F, Func, Value
from django.db.models.functions import Lower

qs = User.objects.annotate(domain=Lower(
    Func(F('email'), Value('@'), Value(2), function='split_part')))

This translates into the SQL call split_part("auth_user"."email", @, 2) and annotates every user with a new domain attribute which is the domain of their email address. The value 2 passed to split_part says to take the second value after splitting the string. Unlike Python this is a 1-based index rather than a 0-based index. With this we can find out what the most popular domains are for the users:

# Continued from above
from django.db.models import Count

popular = qs.values('domain').annotate(count=Count('id')).order_by('-count')
print(popular)
# Result
# [{'count': 20, 'domain': 'caktusgroup.com'},
# {'count': 10, 'domain': 'example.com'}]

As noted in the example, this returns a list of dictionaries of the form {'domain': <domain name>, 'count': #} ordered by the highest counts first. We can take this even further using the conditional expressions.

Two more new expressions Case and When can be used to build conditional aggregates. For instance, we may want to only count users who have logged in recently:

# Continued from above
from django.db.models import Case, When
from django.utils.timezone import now

active = When(
    last_login__isnull=False,
    last_login__gte=now() - timedelta(days=14),
    then=Value(1))

active defines the conditional expression when the last_login is not null and is a date later than 14 days ago. If there is a match then this row will add the value of 1 to the aggregate. This conditional expression can be passed into an aggregate expression such as Count, Sum, or Avg. To get the popular domains, we’ll count the number of active users for a given email domain.

# Continued from above
popular = qs.values('domain').annotate(
    count=Count('id'), active_count=Count(Case(active))).order_by('-active_count')
print(popular)
# Result
# [{'active_count': 7, 'count': 20, 'domain': 'caktusgroup.com'},
#  {'active_count': 3, 'count': 10, 'domain': 'example.com'}]

This adds a new key/value to the resulting dictionaries which include the number of active users for the domain. Here caktusgroup.com has the most active registered users but it also has the most registered users overall. For one last usage, we can look at the percent of users for each domain who are active again using the F expression:

# Continued from above
popular = popular.annotate(
    percent_active=Value(1.0) * F('active_count') / F('count') * Value(100)
).order_by('-percent_active')
print(popular)
# Result
# [{'active_count': 7, 'count': 20, 'domain': 'caktusgroup.com', 'percent_active': 35},
#  {'active_count': 3, 'count': 10, 'domain': 'example.com', 'percent_active': 30}]

Again this adds another data point to the returned list of dictionaries which is the percent of active users. Now we know which email domains are associated with the most users, the most recently logged in users, and the percent of users with that domain who have been recently active.

Query expressions like Func allow you to make more complex queries, leveraging more of your chosen database’s power without having to drop to raw SQL. Combined with the aggregation and conditional expressions you can roll up additional statistics about your data set using the expressive power of the ORM. I hope these examples give a good overview of some of the queries that are now easy to handle in the ORM and which previously required raw SQL.

Colin Copeland: Query Expressions are Amazing

From Planet PostgreSQL. Published on Jun 20, 2016.

The Django 1.8 release added support for complex query expressions. The documentation has some nice examples but they don't do justice to how crazy awesome these are. In this post, we will go through some additional examples of how to leverage these expressions.

Django has had one form of a query expression for several years now: the F expression. F() can be used to reference an existing column in a query. This is often used for atomic update operations, such as incrementing a counter. However, F can also be used to compare two columns on a model when filtering. For instance, we may be interested in users who haven't logged in since their first two weeks on the site. That requires comparing the value of the last_login column and the date_joined on the standard User model from contrib.auth:

from datetime import timedelta

from django.contrib.auth.models import User
from django.db.models import F
from django.utils.timezone import now

# Create some fake data: 10 active users and 20 inactive ones
today = now()
active_count = 10
inactive_count = 20
for i in range(1, active_count + inactive_count + 1):
    active = i <= active_count
    prefix = 'in' if active else ''
    domain = 'example.com' if i % 3 == 0 else 'caktusgroup.com'
    attributes = {
        'username': '{}active-{}'.format(prefix, i),
        'email': '{}active-{}@{}'.format(prefix, i, domain),
        'date_joined': today - timedelta(days=30),
        'last_login': today - timedelta(days=0 if active else 21),
    }
    User.objects.create(**attributes)
# Query inactive users
inactive = User.objects.filter(last_login__lte=F('date_joined') + timedelta(days=14))

The F expression supports basic arithmetic operations including some date math, as seen in the example above. However, it is still very limiting in comparison to what is available in SQL.

Relational databases such as Postgresql support a number of built-in functions which you can leverage in the ORM using the Func expression added in 1.8. For example, you may want to examine the email domains of your user base. For that, you might use the split_part function in Postgresql to extract the domain of the email address column. To normalize the domain values you can compose this with the built-in Lower expression:

# Continued from above
from django.contrib.auth.models import User
from django.db.models import F, Func, Value
from django.db.models.functions import Lower

qs = User.objects.annotate(domain=Lower(
    Func(F('email'), Value('@'), Value(2), function='split_part')))

This translates into the SQL call split_part("auth_user"."email", @, 2) and annotates every user with a new domain attribute which is the domain of their email address. The value 2 passed to split_part says to take the second value after splitting the string. Unlike Python this is a 1-based index rather than a 0-based index. With this we can find out what the most popular domains are for the users:

# Continued from above
from django.db.models import Count

popular = qs.values('domain').annotate(count=Count('id')).order_by('-count')
print(popular)
# Result
# [{'count': 20, 'domain': 'caktusgroup.com'},
# {'count': 10, 'domain': 'example.com'}]

As noted in the example, this returns a list of dictionaries of the form {'domain': <domain name>, 'count': #} ordered by the highest counts first. We can take this even further using the conditional expressions.

Two more new expressions Case and When can be used to build conditional aggregates. For instance, we may want to only count users who have logged in recently:

# Continued from above
from django.db.models import Case, When
from django.utils.timezone import now

active = When(
    last_login__isnull=False,
    last_login__gte=now() - timedelta(days=14),
    then=Value(1))

active defines the conditional expression when the last_login is not null and is a date later than 14 days ago. If there is a match then this row will add the value of 1 to the aggregate. This conditional expression can be passed into an aggregate expression such as Count, Sum, or Avg. To get the popular domains, we’ll count the number of active users for a given email domain.

# Continued from above
popular = qs.values('domain').annotate(
    count=Count('id'), active_count=Count(Case(active))).order_by('-active_count')
print(popular)
# Result
# [{'active_count': 7, 'count': 20, 'domain': 'caktusgroup.com'},
#  {'active_count': 3, 'count': 10, 'domain': 'example.com'}]

This adds a new key/value to the resulting dictionaries which include the number of active users for the domain. Here caktusgroup.com has the most active registered users but it also has the most registered users overall. For one last usage, we can look at the percent of users for each domain who are active again using the F expression:

# Continued from above
popular = popular.annotate(
    percent_active=Value(1.0) * F('active_count') / F('count') * Value(100)
).order_by('-percent_active')
print(popular)
# Result
# [{'active_count': 7, 'count': 20, 'domain': 'caktusgroup.com', 'percent_active': 35},
#  {'active_count': 3, 'count': 10, 'domain': 'example.com', 'percent_active': 30}]

Again this adds another data point to the returned list of dictionaries which is the percent of active users. Now we know which email domains are associated with the most users, the most recently logged in users, and the percent of users with that domain who have been recently active.

Query expressions like Func allow you to make more complex queries, leveraging more of your chosen database’s power without having to drop to raw SQL. Combined with the aggregation and conditional expressions you can roll up additional statistics about your data set using the expressive power of the ORM. I hope these examples give a good overview of some of the queries that are now easy to handle in the ORM and which previously required raw SQL.

Tatsuo Ishii: Avoiding connection failure while health checking

From Planet PostgreSQL. Published on Jun 20, 2016.



Hi pgpool-II users,

I've managed to come back to this blog after 3 years blank.

Last week pgpool-II developer team released minor versions from 3.1 to 3.5. This release includes special git for users: enhancement for health checking.

You might notice an annoying behavior of pgpool-II.

Example: suppose we have three PostgreSQL backends managed by pgpool-II. pgpool-II occasionally checks healthiness of each backend if "health check" is enabled. If backend #2 goes down, a fail over is triggered and after that, users can use the DB server cluster without backend #2. This is great. However, if you set up the retrying of health checking, clients cannot connect to pgpool-II while it is retrying the health check. For instance,

health_check_max_retries = 10
health_check_retry_delay = 6

will continue the health check retry for 10*6 = 60 seconds at least. This is very annoying for users because they cannot initiate new connections to pgpool-II for 1 minute. Making these parameters shorter might mitigate the situation a little bit but this may not be useful if the network is not stable and longer retries are desirable.

These new releases significantly enhance the situation. By setting:

fail_over_on_backend_error = off

when a user connects to pgpool-II while it is doing health checking, it starts to connect to all backends including #2. Before this release, pgpool-II gave up initiating session if one of backend is not available (in this case #2). With this release, pgpool-II skips the broken backend and continues to connect to rest of backends. Please note, however, this feature is only available when all of conditions below are met:

  • streaming replication mode
  • the broken backend is not primary server
  • fail_over_on_backend_error  is off

This enhancement is  available all of the new releases: 3.5.3, 3.4.7, 3.3.11, 3.2.16, 3.1.19.

Gabriele Bartolini: Concurrent backups with Barman and PostgreSQL 9.6

From Planet PostgreSQL. Published on Jun 20, 2016.

Postgres-9.6+b

PostgreSQL 9.6 has extended the traditional framework available for physical backups by allowing users to take backups concurrently. Barman will transparently support this new set of functions without requiring the pgespresso extension.

The pgespresso extension, conceived by our Simon Riggs, allowed marking the start and the stop of a backup process even on a read-only standby server. Through pgespresso, Barman users could take physical backups through rsync/Ssh from a standby server, off-loading the actual copy of files from the master server.

This feature is called concurrent backup, and is already available in PostgreSQL through streaming replication protocol, via pg_basebackup.

The last version that pgespresso will support in terms of concurrent backup is PostgreSQL 9.6. Why is that?

The available PostgreSQL API for taking low-level physical backups has been extended in order to natively support concurrent backups (rather, I should use the term “non-exclusive”). I am not sure if Magnus (the author of this patch) was inspired by pgespresso or not, but what is important is that his contribution is definitely more generic and robust (after all, pgespresso was designed to interact only with Barman).

Therefore, PostgreSQL 9.6 and future versions will natively have functions that allow Barman to request a concurrent backup, making pgespresso not necessary.

For more details about this new API, please refer to the section that I have written in the PostgreSQL Wiki page “What’s new in PostgreSQL 9.6″. For now, it is important to know that:

  • pg_start_backup() has been overridden and a new parameter specifies whether the backup is exclusive (default) or not;
  • a new version of pg_stop_backup() has been provided for concurrent backups: from a technical point of view, this function will now return the content of the backup label and the map of available tablespaces.

What is relevant to our Barman users is that Barman will transparently handle this new API, with no impact on the user experience whatsoever.

By default, Barman will request an exclusive backup (using the traditional set of functions available since PostgreSQL 8). However, you can trigger the new behaviour by setting concurrent_backup in the backup_options global/server configuration option, as follows:

backup_options = concurrent_backup

In the future, PostgreSQL will get rid of traditional functions for exclusive backups in favour of concurrent ones, due to some known dangerous corner cases. In particular, the sudden death of a PostgreSQL server before calling pg_start_backup(), which leaves a backup_label file in the PGDATA directory and prevents the server from restarting (by the way, the barman check command can identify this corner case).

When the new backup API becomes the preferred one, Barman will switch its default accordingly. In the meantime, we invite our users to test this new feature which is currently available only on Github and will be included in Barman 1.6.2/1.7.0.

Rajeev Rastogi: Advanced Transaction: Savepoint And Prepared Transaction

From Planet PostgreSQL. Published on Jun 18, 2016.

Hope you found my two previous posts on transaction interesting and useful. Based on some of readers request, I would cover few advanced transaction topic in my subsequent posts. In this post, I am going to cover following advanced variant of traditional transaction:
     1. Savepoint
     2. Prepared Transaction

Savepoint

In order to understand its usage, imagine one of your customer work-load as below:
            START TRANSACTION;
            Perform operation-1
            Perform operation-2
            Perform operation-3
Perform operation-4
            COMMIT;

As per the business logic it is likely that operation-3 and operation-4 are going to fail and also failure of these operations does not impact operation-1 and operation-2. Now as per transaction properties (recall the Basic of Transaction), whole of operation will be roll backed in-case of any of the four operation fails. So in the given business scenario though operation-1 and operation-2 should not be rollbacked but still it will be. 

In order to avoid this behavior  and control the main transaction life span, savepoint is being introduced.

Savepoint can be created only inside a transaction block. It creates a saving point for overall transaction i.e in-case if any commands results in error inside a transaction block, then instead of rollback whole transaction, it allows to rollback till the point where savepoint was created. E.g. Consider following example:
        CREATE TABLE TBL(ID INT, ID2 INT);
        CREATE UNIQUE INDEX IDX ON TBL(ID);

        START TRANSACTION
        INSERT INTO TBL VALUES(100, 100);
        INSERT INTO TBL VALUES(200, 200);
        SAVEPOINT first_save;
        INSERT INTO TBL VALUES(200, 300); <=== This will fail because of unique index.

Related Comamnds

1. SAVEPOINT savepoint_name: This command create a new savepoint inside a transaction block as shown in above example. There can be nested savepoint also i.e. savepoint can be created with-in already created savepoint.

2. ROLLBACK TO [SAVEPOINT ] savepoint_name: This command rollback all operation done from the point savepoint was created as shown in above example. Rollback to savepoint does not destroy the given savepoint but it destroys any savepoint nested with-in the given savepoint. E.g.

        START TRANSACTION
        INSERT INTO TBL VALUES(100, 100);
        INSERT INTO TBL VALUES(200, 200);
        SAVEPOINT first_save;
        INSERT INTO TBL VALUES(300, 300);
        SAVEPOINT second_save;
        INSERT INTO TBL VALUES(400, 400);
        ROLLBACK TO SAVEPOINT first_save;   <=== Will destroy second_save but first_save will remain active.

3. RELEASE [SAVEPOINT] savepoint_name: As the name suggest, it just removes/destroys the savepoint created. It does not have any impact on any other command executed after this savepoint. Similar to ROLLBACK TO SAVEPOINT, this command also destroy all savepoint nested with-in the savepoint getting released.

        START TRANSACTION
        INSERT INTO TBL VALUES(100, 100);
        INSERT INTO TBL VALUES(200, 200);
        SAVEPOINT first_save;
        INSERT INTO TBL VALUES(300, 300);
        SAVEPOINT second_save;
        INSERT INTO TBL VALUES(400, 400);
        RELEASE SAVEPOINT first_save; <=== Destroy both savepoints but no impact on records inserted.
        INSERT INTO TBL VALUES(400, 500); <=== As no savepoint active, so whole transaction gets rollbacked.

NOTE: There is no command like COMMIT TO SAVEPOINT as there is no meaning to commit only part of transaction (As it will be against transaction basic properties).

Prepare Transaction

Before jumping to define this variation, it is important to understand two-phase commit. As the name suggest it does commit of any transaction in two steps.

1.     First step is to verify all condition which needs to be satisfied in order to commit this transaction. This is called prepare stage.
2.    Second step is the actual commit, which is almost same as traditional commit.

This is more useful for a cluster based database or any similar variations, where in various nodes (databases) are responsible to run a part of bigger operation. So it is important that each node does either COMMIT or ROLLBACK at the same time in order to maintain overall transaction property. Many cluster solution achieve this using a dedicated transaction manager or just coordinator, whose tasks are:
  1. Step-1: will probe all nodes to check if commit can be done and in turns all nodes will respond with either positive or negative acknowledgment.
  2. Step-2: Once coordinator or transaction manager gets response from all nodes, it will issue COMMIT or ROLLBACK to all nodes.

I will cover more detailed two-phase commit in my future blogs.

So now coming back to understand Prepare Transaction, it actually prepares a normal current running transaction for two-phase commit. Once a transaction is prepared, it dissociates from current session and its gets stored on disk. Since its state gets stored on disk, it is almost certain that this transaction can be committed successfully even in-case of any database crash happened before explicit issue of commit.
Prepare transaction is given a unique identifier, which is used later to commit or rollback this transaction. Once prepare transaction is done, transaction gets dissociated from current session. Subsequently this transaction can be committed or rollbacked from any session using the name given during prepare.
Like traditional transaction, unless prepared transaction gets committed or rollbacked, impact of any operation done will not be visible by any session including the session which has performed this transaction.
This is not recommended to be used by application or any interactive sessions. Rather it should be used where there is need of any external transaction manager, which performs some validation or has some other dependency before actual commit.

Related Comamnds

  1. PREPARE TRANSACTION name: Prepares the current transaction and immediately dissociates from the current session. A unique name is given to this prepared transaction, so that later it can be committed/rollbacked using the same name.
  2. COMMIT PREPARED name: Commit the already prepared transaction. It’s like performing second phase of two phase commit protocol.
  3. ROLLBACK PREPARE name: Rollback the already prepared transaction.
       Example:
        START TRANSACTION
        INSERT INTO TBL VALUES(100, 100);
        INSERT INTO TBL VALUES(200, 200);
        PREPARE TRANSACTION 'first_prepare';
        SELLECT * FROM TBL  <=== This will return zero record, as transaction is not yet committed

        COMMIT PREPARED 'first_prepare';
        SELLECT * FROM TBL  <=== This will return both records.

        START TRANSACTION
        INSERT INTO TBL VALUES(100, 100);
        INSERT INTO TBL VALUES(200, 200);
        PREPARE TRANSACTION 'first_prepare';
        ROLLBACK PREPARED 'first_prepare';
        SELLECT * FROM TBL  <=== This will return zero records as transaction rollbacked.


It is highly recommended to take PostgreSQL package from https://www.postgresql.org/download/ and try all of the commands described. 
Subscribe to my blog to stay informed about my new upcoming post. My next post will be continuation of current post and will include internal code perspective of whatever discussed here. 

Please feel free to drop your query/suggestion/comments.

Shaun M. Thomas: PG Phriday: Let There Be Jank

From Planet PostgreSQL. Published on Jun 17, 2016.

One way the Postgres project is subtly misleading, is that it becomes easy to forget that not all other projects are nearly as well managed. This becomes more relevant when delving into niches that lack sufficient visibility to expose the more obvious deficiencies. As much as we like Postgres, it’s not quite as popular as it could be. This makes some of the side projects infrequently used, and as a direct consequence, they can often resemble jerky automatons cobbled together out of spit and bailing wire.

A good example of this is the hdfs_fdw extension for accessing Hadoop. To be fair, a large portion of its complexity is due to dependency on Hive and hastily assembled APIs like Thrift and fb303. Unfortunately it also suffers from “Works for Me” syndrome, lacking an autoconf to autodiscover these requirements, and it also doesn’t use automake to properly bootstrap the make environment. As a result, most builds will fail outright or require hand-modifying the Makefile—a task many will simply balk at before abandoning the extension outright.

So, let’s install the hdfs_fdw extension, going through all the necessary steps, and not just some incomplete shorthand that makes assumptions regarding the build system. To facilitate this, no default install locations will be used at all, because hardcoded defaults are how we got into this mess in the first place.

The hdfs_fdw extension depends on Thrift, so let’s start there. As of this writing, the latest version is 0.9.3. Here’s the full build process we used:

wget http://apache.osuosl.org/thrift/0.9.3/thrift-0.9.3.tar.gz
tar -xzf thrift-0.9.3.tar.gz
cd thrift-0.9.3
./configure --prefix=/opt/thrift
make -j2
sudo make install

So far, this is pretty normal. What isn’t normal, is that the thrift source includes a contrib module named fb303 we also need. Unlike the Postgres build environment, configuration settings and subsequent Makefile components do not cascade to the contrib modules. This is where the trouble begins.

The first issue is that, unlike Thrift, fb303 defaults to only static linking, and won’t generate a dynamic object file unless explicitly told to do so. Considering these components are packaged together, this restriction is unnecessarily jarring. Why choose between dynamic or static linking for a system library that may be used as either? In our case, we want dynamic shared objects, so we need to configure with --disable-static.

The second roadblock comes from the Thrift team itself, which closed a bug as “Cannot Reproduce” despite the fact one of their include path references is simply wrong. The user that reported the issue even provided a patch. So we need to fix that, too.

The whole process looks like this:

cd contrib/fb303
./bootstrap.sh --prefix=/opt/thrift --with-thriftpath=/opt/thrift \
    --disable-static
sed -i 's%/include/thrift%/include%' cpp/Makefile.am
make -j2
sudo make install

With Thrift out of the way, it’s time to move on to hdfs_fdw itself. Unfortunately we can’t quite do that yet. The hdfs_fdw extension is distributed with a library it depends on, but that is not accounted for in the Makefile. We need to build and install it separately for some reason. Further, this dependency has hardcoded paths in its own Makefile, so we must modify it or end up with the library in an arbitrary location, or with it unable to find required headers.

So let’s build the libhive library:

git clone https://github.com/EnterpriseDB/hdfs_fdw.git
cd hdfs_fdw/libhive
sed -i 's%THRIFT_HOME=.*%THRIFT_HOME=/opt/thrift/include%' Makefile
sed -i 's%INSTALL_DIR=.*%INSTALL_DIR=/opt/thrift/lib%' Makefile
make -j2
sudo make install

We elected to install libhive in the same location as Thrift because they’ll be used together in our case. This prevents cluttering up our /usr/local/lib directory, as well as allowing us to easily redistribute these prerequisites to our other Postgres systems since we don’t have a nice package.

Finally, we can build the hdfs_fdw extension itself. Or can we?

Unfortunately, we’re still not done setting up. Because we elected to install Thrift as an optional piece of software, and because we use dynamic linking, we need to tell the operating system where to find libraries. Usually this means modifying /etc/ld.so.conf and running ldconfig to re-read library paths. On Ubuntu and other Debian variants, we can actually put these in a subdirectory in a less intrusive fashion.

Here’s how that might look:

echo /opt/thrift/lib | sudo tee /etc/ld.so.conf.d/hive.conf
sudo /sbin/ldconfig

Our particular build environment is an Ubuntu system, which is a Debian variant. As such, build tools like pg_config are actually wrappers around the real utilities, which are hidden away in deep paths to prevent accidental use. This is to help facilitate having multiple Postgres versions on the same server, but it also complicates installing extensions, since the wrappers always assume the most recent version. Suppose Postgres 9.6 beta is on our system, but wanted to install an extension for 9.5?

That means we need to alter our path before building hdfs_fdw itself. Debian variants put everything we need in /usr/lib/postgresql/[version]/bin, but other UNIX systems may use a different location. We’ll need that information to proceed. In addition, since hdfs_fdw doesn’t use a configure script, we can’t tell it where to find the Thrift and fb303 libraries. This isn’t strictly necessary because we modified ldconfig, but it’s always better to be safe.

So, assuming we’re still in the hdfs_fdw/libhive directory, we’d finish the extension installing hdfs_fdw like this:

export PATH=/usr/lib/postgresql/9.5/bin:$PATH
 
cd ..
sed -i 's%/usr/local/thrift%/opt/thrift%' Makefile
make -j2 USE_PGXS=1
sudo -E make install

We needed the -E flag to preserve our $PATH variable. Otherwise the root user’s path would be used, and then the extension would be installed into the most recent Postgres version, regardless of our wishes.

Since we modified ldconfig with the new library references, we also need to restart Postgres. Otherwise, it won’t have /opt/thrift/lib in its library cache, and as a result, would throw an error when trying to activate hdfs_fdw. Ubuntu systems use pg_ctlcluster for this, while others will use the Postgres pg_ctl tool directly. Let’s make that our final step to “activate” the library before using it.

sudo pg_ctlcluster 9.5 main restart

And finally… finally we’re done. But did it work? Let’s check:

CREATE EXTENSION hdfs_fdw;
 
CREATE SERVER hdfs_server
         FOREIGN DATA WRAPPER hdfs_fdw
         OPTIONS (host 'hive_host');
 
CREATE USER MAPPING FOR postgres
    SERVER hdfs_server;
 
CREATE FOREIGN TABLE hive_sensor_log
(
  id            BIGINT,
  location      VARCHAR(255),
  reading       BIGINT,
  reading_date  TIMESTAMP
 
) SERVER hdfs_server
OPTIONS (dbname 'default', TABLE_NAME 'sensor_log');
 
SELECT * FROM hive_sensor_log;
 
 id | location | reading |    reading_date     
----+----------+---------+---------------------
  1 | place    |      82 | 2016-06-17 08:15:31
  2 | stuff    |      22 | 2016-06-17 08:18:31

Well then, that was quite an adventure. In the end, we got something that worked, though the amount of hoops we had to jump through was a little disconcerting. It shouldn’t have to be this way.

On a personal note, this was actually the easy part. Hadoop is a truculent beast, and other elements in the stack—of which there are many—just make it more ridiculous. Hive itself is probably one of the most janky things I’ve ever encountered. Postgres has libpq, so why isn’t there an equivalent for Hive? Is it a protocol or not? Why do I need to install a freaking one-off Facebook library to access my Hadoop install with a SQL interface?

Worse, I needed to follow multiple incomplete tutorials online to get Hive working at all. Beyond simply installing it, it must be started with SASL disabled for hdfs_fdw. But doing that means it defaults to Kerberos authentication. If that isn’t set up, commands need to run as the user that launched Hive, since users are mapped from the operating system. To get that to work, I had to modify several more XML files. I feel like I’d need to read at least three books on this subject before even trying to approach this with any amount of confidence.

Then Hive crashed with an OOM error after the query output above. It turns out I could select data to my heart’s content, but following an insert through Hive (which took almost 20 seconds per row), there were ceaseless problems. Pulling data after an insert always caused it to go into an infinite uncaught OOM exception crash loop that required kill -9 to stop. When all of my work for this article was complete, I shut it all down and backed away slowly, lest it rend me to dripping gobbets because I pressed the wrong key in its presence.

Postgres, despite its versatility, just works. It can be used in more advanced architectures and be leveraged for even more power, but it still functions in its base configuration. After today, I sincerely wish more projects followed that philosophy.

Alexander Korotkov: Faceted Search in the Single PostgreSQL Query

From Planet PostgreSQL. Published on Jun 17, 2016.

Faceted search is very popular buzzword nowadays. In short, faceted search specialty is that its results are organized per category. Popular search engines are receiving special support of faceted search.

Let’s see what PostgreSQL can do in this field. At first, let’s formalize our task. For each category which have matching documents we want to obtain:

  • Total number of matching documents;
  • TOP N matching documents.

For sure, it’s possible to query such data using multiple per category SQL queries. But we’ll make it in a single SQL query. That also would be faster in majority of cases. The query below implements faceted search over PostgreSQL mailing lists archives using window functions and CTE. Usage of window function is essential while CTE was used for better query readability.

psql Faceted search SQL query /* * Select all matching messages, calculate rank within list and total count * within list using window functions. */ WITH msg AS ( SELECT message_id, subject, list, RANK() OVER ( PARTITION BY list ORDER BY ts_rank_cd(body_tsvector, plainto_tsquery('index bloat')), id ) rank, COUNT(*) OVER (PARTITION BY list) cnt FROM messages WHERE body_tsvector @@ plainto_tsquery('index bloat') ), /* Aggregate messages and count per list into json. */ lst AS ( SELECT list, jsonb_build_object( 'count', cnt, 'results', jsonb_agg( jsonb_build_object( 'message_id', message_id, 'subject', subject ))) AS data FROM msg WHERE rank <= 5 GROUP by list, cnt ) /* Aggregate per list data into single json */ SELECT jsonb_object_agg(list, data) FROM lst;

The resulting JSON document contains total count of matching mailing list messages and TOP 5 relevant messages for each list.

js Faceted search JSON result { "pgsql-admin": { "count": 263, "results": [ {"message_id": "CACjxUsMUWkY1Z2K2A6yVdF88GT3xcFw5ofWTR6r1zqLUYu0WzA@mail.gmail.com", "subject": "Re: Slow planning time"}, {"message_id": "dcc563d11001041749w561874f7y6574fb42ab49f850@mail.gmail.com", "subject": "Re: Finetuning Autovacuum"}, {"message_id": "AANLkTikWabMzCRCSWuNLuPizSSQX3YILgJrNZuzgp3yM@mail.gmail.com", "subject": "Re: blocking automatic vacuum"}, {"message_id": "dcc563d10904011631l4058aabew12f3fe4895a072f3@mail.gmail.com", "subject": "Re: Vacuum Full"}, {"message_id": "FE44E0D7EAD2ED4BB2165071DB8E328C03062D9D@egcrc-ex01.egcrc.org", "subject": "Re: postgres bogged down beyond tolerance" } ] }, /*................................................................................*/ "pgsql-advocacy": { "count": 8, "results": [ {"message_id": "Pine.LNX.4.33.0310291602220.22178-100000@css120.ihs.com", "subject": "Re: Press Release"}, {"message_id": "20050502203626.GA29791@dcc.uchile.cl", "subject": "Re: [HACKERS] Increased company involvement"}, {"message_id": "5d94f7afb26f56652e06ba0657573ef2@biglumber.com", "subject": "Search and archives still out of sync"}, {"message_id": "Pine.GSO.4.64.0708010705100.13114@westnet.com", "subject": "Re: postgresql publication"}, {"message_id": "20070801151739.GF6165@alvh.no-ip.org", "subject": "Re: postgresql publication" } ] } }

In the plan of this query we can see that message_body_idx GIN index is scanned only once, and this is great.

psql Plan of faceted search SQL query QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=2369.50..2369.51 rows=1 width=114) (actual time=34.232..34.232 rows=1 loops=1) CTE msg -> WindowAgg (cost=2087.93..2354.30 rows=491 width=336) (actual time=30.925..33.087 rows=2486 loops=1) -> WindowAgg (cost=2087.93..2222.96 rows=491 width=336) (actual time=30.716..32.020 rows=2486 loops=1) -> Sort (cost=2087.93..2089.16 rows=491 width=336) (actual time=30.711..30.838 rows=2486 loops=1) Sort Key: messages.list, (ts_rank_cd(messages.body_tsvector, plainto_tsquery('index bloat'::text))), messages.id Sort Method: quicksort Memory: 582kB -> Bitmap Heap Scan on messages (cost=48.05..2065.98 rows=491 width=336) (actual time=3.037..24.345 rows=2486 loops=1) Recheck Cond: (body_tsvector @@ plainto_tsquery('index bloat'::text)) Heap Blocks: exact=2044 -> Bitmap Index Scan on message_body_idx (cost=0.00..47.93 rows=491 width=0) (actual time=2.723..2.723 rows=2486 loo Index Cond: (body_tsvector @@ plainto_tsquery('index bloat'::text)) CTE lst -> HashAggregate (cost=12.69..13.69 rows=67 width=540) (actual time=34.090..34.133 rows=14 loops=1) Group Key: msg.list, msg.cnt -> CTE Scan on msg (cost=0.00..11.05 rows=164 width=540) (actual time=30.928..33.879 rows=68 loops=1) Filter: (rank <= 5) Rows Removed by Filter: 2418 -> CTE Scan on lst (cost=0.00..1.34 rows=67 width=114) (actual time=34.092..34.140 rows=14 loops=1) Planning time: 0.380 ms Execution time: 34.357 ms

Thus, it appears that nothing prevents you from implementing trendy kinds of searches using old good SQL and powerful features of PostgreSQL including: fulltext search, JSON support, window functions etc.

Kaarel Moppel: PostgreSQL underused features – WAL compression

From Planet PostgreSQL. Published on Jun 17, 2016.

The WAL compression feature With the launch of Postgres 9.5 a new setting called „wal_compression“ was introduced, that should decrease the IO load on behalf of CPU load. And that is a desirable compromise considering typical constraints of modern hardware. But somehow the feature didn’t get the publicity it deserved – it wasn’t even mentioned in the What’s new roundup and by default […]

The post PostgreSQL underused features – WAL compression appeared first on Cybertec - The PostgreSQL Database Company.

The Journal of Medical Internet Research Features Epic Allies Phase 1 Study Results

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jun 16, 2016.

The Journal of Medical Internet Research recently published “Epic Allies: Development of a Gaming App to Improve Antiretroviral Therapy Adherence Among Young HIV-Positive Men Who Have Sex With Men”. Epic Allies, initially funded by a federal Small Business Innovation Research (SBIR) grant, represents a partnership between Caktus, UNC’s Institute of Global Health and Infection Diseases, and Duke Global Health Institute.

The article highlights the challenges of medication adherence, emphasizing the concerns of study participants directly:

“Yeah, cause honestly, it was a good few months before I ever took medication. And in that timeframe of diagnosis to taking medication, it was very easy for me to detach. It was very easy for me to say, this is not real, nahhh, whatever. It didn’t become real until I had to take a pill. When you take a pill, it’s real.” - Study participant.

The team used continuous participant feedback to iteratively develop the application. Ultimately, the study found that this iterative approach to application development was what made it “highly acceptable, relevant, and useful by YMSM (young men who have sex with men)”.

The study authors are Sara LeGrand, PhD; Kathryn Elizabeth Muessig, PhD; Tobias McNulty, BA (Caktus); Karina Soni, BA; Kelly Knudtson, MPH; Alex Lemann, MS (Caktus); Nkechinyere Nwoko, BA (Caktus); and Lisa B Hightow-Weidman, MPH, MD.

To read the study in full, visit http://games.jmir.org/2016/1/e6/.

The Philosophy of Channels

By Andrew Godwin from Django community aggregator: Community blog posts. Published on Jun 16, 2016.

Why is Channels designed like it is, and what does it really mean for Django, Python, and WebSocket handling?

It's been a while since my last blog post about Channels, and a lot has happened in the meantime - the API has developed out and stabilised, features like backpressure have come along, and the situation for backends is looking a lot better, especially once the local-and-remote combination layer matures some more.

The other thing that has happened, however, is confusion and worry over the direction Channels is headed, and the direction it spells for Django and Python overall. A lot of the development of Channels was addressing and tackling my own personal worries with its direction, and picking the right set of tradeoffs, sometimes from two equally valid options.

I've not been as proactive as I could have been at communicating my reasoning and long-term vision for what Channels could do; I'm hoping this blog post will rectify some of that. Let me take you through the specific set of problems I'm looking to tackle, why I chose to design things the way I did, and what I see as the path forwards.

It's not just about WebSockets

A lot of people's base reaction to Channels is two-fold; first, to see it as only being a way to get WebSocket support (that is the thing that spurred on development, but not the only reason for it; more on that later), and second, to then say that trying to solve WebSocket protocol handling via a message-passing distributed system is overkill.

They're right there; Python's async capabilities are getting ever better, and it's easy enough to use one of the existing libraries (such as Autobahn) to write a WebSocket handling server in a few hours. You probably need to standardise an interface so you can talk between this server and the rest of your project, but that's not particularly difficult.

This is, indeed, the route I first took, and how the very early versions of Channels (then called django-onair) worked. However, as I developed it out and starting pondering how to run it at a decent scale, the real problem became clear.

You see, WebSocket protocol handling isn't the hard problem, in my opinion; it's actually using those sockets in a larger project. Most uses of sockets are event-driven; you send data down the socket when something happens externally - be it a model being saved, an external system changing, or just another message on another WebSocket.

All these different sources of events can happen at different places in your deployed project. If you go down the traditional path of running a bunch of servers, each with a webserver and Python code, you quickly realise that you need some way to communicate between them; WebSocket handling is one thing, but being able to broadcast to groups of sockets when things happen is actually how you write applications.

Imagine a large-scale chat server where different people are logged into different physical machines; how will your processes broadcast out incoming chat messages on a socket to everyone else in that chatroom, on all the other servers? Where do you track who's in the room? What do you do about dead sessions?

What about a system where you send notifications to users when their profile is viewed - those views are likely happening on a different server, so how do you get the view event from that server over to the one where your user's WebSocket is terminated?

This is the hard problem that Channels is really aimed to solve; it's not just WebSocket protocol handling, but the problem of building complex applications around WebSockets. Distributed systems and messaging is a really tough problem, and I believe it's the sort of thing that benefits a lot from a few, shared, polished solutions, rather than a rough guide on how to tie async code together.

Encouraging async

One of the things Channels does is run Django in a synchronous fashion, and encourages you to write all the event handlers for messages the same way; it just runs this code in a tight worker loop, and discourages you from doing any blocking operations to stall that loop.

The problem is, people seem to think that's the only way intended for you to write code against Channels. It's not; Channels is meant to make messaging easier between sync and async programs in general, letting you choose the best tool for the job (and I would argue that in a lot of simple business logic cases you probably want synchronous code, as it's a lot easier to write and maintain).

In fact, Channels makes it easier than ever to write fully-asynchronous applications and have them communicate with the rest of your Python project; that's all the WebSocket interface server (Daphne) is, after all. Want async URL fetching? IoT communication? Outgoing sockets? You can write the async code as you normally would, and then thanks to Channels, keep the rest of your code in a familiar synchronous framework and communicate back and forth with your specialist code.

Once you have a project running using Channels, it makes it easier than ever to add in more async code as its own process to go and do some new task you need, and then have a clearly-defined, existing solution to communicate with other async and sync processes. The community experience and documentation surrounding it and write-ups and case studies of others who have gone before you all contribute, because things are on a single, shared design and platform.

More Protocols

Of course, this all ties back into the idea of Channels as not being about WebSockets; it's a general cross-process eventing system for Django (and, hopefully, Python at large). WebSockets are one of the protocols specified to run over it, but work is already underway on interface servers for Slack (letting you tie in chat integration to a server cluster) and email (allowing you to write consumers easily against incoming email alongside your HTTP and WebSocket code).

Message format specifications also help alternative implementations; much like there are many WSGI servers, the message formats allow any number of ASGI-compatible HTTP or WebSocket servers to exist, even running alongside each other in the same system.

Some protocols don't need the broadcast functionality that WebSockets do, especially if they don't have stateful connections, but good channel layer design will keep them all routed to the same server; while channel layers are meant to be cross-process and network-transparent, that doesn't mean they have to route every message through a central place. The Channels layout was designed to allow messages that can be done locally to be distinguished from those that must be sent elsewhere.

In fact, with the recent addition of the RedisLocalChannelLayer in the asgi_redis package, you can run servers in a standard socket-terminator and worker pair, and the channel layer will keep as much as it can local to the machine, only traversing over the network when it needs to find a specific terminated socket to send things down to another user, or for group broadcast.

Distributed systems are hard

At its core, then, Channels is solving a distributed systems problem; that of communication and broadcast. Distributed systems is an area where there's no perfect solution; you always have to pick tradeoffs. At-least-once or at-most-once is one key example; the CAP "theorem" about distributed databases is the effect of others.

Channels picks a certain set of these, aimed to be the best fit for the uses and protocols that are around the Web today, especially WebSockets. Dropping frames and closing the socket is preferred to duplicating frames, for example; clients can reconnect, but having a distributed deduplication system for actions is hard unless you make everything idempotent. I'll hopefully get another post up detailing exactly what the tradeoffs I've picked are and what the alternatives would imply, but each of them is chosen for a reason.

It's never going to work for everyone; that's an unachievable goal. Its purpose, instead, is to be the solution for the 90%; something that isn't always perfect, but generally does what you want, and where the tradeoffs are offset by the massive advantage of a single common platform and the shared code and community that enables. It's much like Django, which cannot be a perfect web framework - it can't solve every problem that every developer has - but we can solve 90% of all the problems developers have in a consistent manner, and have a standard and design pattern that encourages re-use and familiarity.

The ASGI API that Channels is built on is deliberately very slim; it specifies the minimum it needs to so you can get a consistent experience across different channel layer backends, while leaving a lot up to the backend, and thus a decent amount of flexibility in how you transport messages. As you get bigger, your needs will change and specialise; the channel layer abstraction is there to allow you to try and grow inside it as long as possible, being flexible while still presenting the same basic API you were developing on when you started; channels, groups, send, and receive.

I don't expect any "top 100" site with to run an unmodified ASGI channel layer, just like they wouldn't run a standard Django installation; as you get bigger and your needs specialise, what you want is a solution that leaves space for you to slowly and reliably replace it, and my goal with the design of ASGI is that, even once you remove all of the channels code, you're left with an abstraction and design that will work with many more specialised examples of distributed systems and events. Just like core Django itself, it lets you heavily modify it and replace parts while you grow and gets out of your way once you no longer need it.

This, then, is the philosophy of Channels - a solution that is not intended to be a panacea, but instead to be a common base to help with developing applications that span multiple servers and deal with stateful protocols like WebSockets. Smaller teams, agencies, and medium size sites can use it without many changes; larger projects will likely need to specialise a channel layer backend and maybe some of the handling, but can still benefit from the developer familiarity that following the abstraction and patterns provides.

Looking Ahead

With that in mind, what is the path ahead for Channels and ASGI? WebSocket projects themselves are in their relative infancy - very few have been deployed at any appreciable scale yet, let alone using Channels - and so we have a way to go with maturity no matter what. Sites are already using Channels in production, and the feedback I've had about it has been pretty much all positive, so we're on a good path to maturity on that front.

Daphne itself is heavily based on Twisted code for HTTP handling, and Autobahn for WebSocket handling - two libraries with a strong history of stability - while ASGI is based on our experience and research into scaling eventing systems inside Eventbrite, my previous experiences with distributed messaging, industry research and case studies, and talking to others handling similar problems. It's as solid a baseline as you can reach in a situation where there's no successful open-source example to easily follow.

The feeback I got during the proposal process for putting Channels into Django 1.10 (it did not get in before the deadline, but will still be developed as an external app with 1.10) was valuable; some of the more recent changes and work, such as backpressure and the local-and-remote Redis backend, are based on feedback from that process, and I imagine more tweakswill emerge as more things get deployed on Channels.

That said, I think the fundamental design abstraction - distributed message-passing - is a solid one, and a sensible API to build Django applications against in future as the needs and complexity of Web applications grows beyond simple request-response handling. This is a space where Django and Python have the opportunity to help lead the way in structuring and running these sorts of applications.

I'm also interested in taking the ASGI message formats and common interface standard into the PEP process, but before that I'm going to reach out to other web frameworks to make sure that it's something that truly works across framework boundaries, as always intended, and to try and work out potential issues in real-world scenarios.

I'm unsure quite what the future holds for Channels - the ideal would be for it to open up Django and Python as the solution to a much greater class of problems than people currently use them for, bringing the positive points of the language, framework and communities to a growing audience of developers faced with writing these large, stateful-protocol systems. It might also be that it ends up just being the WebSocket management layer for Django, but even for that purpose, it's important to get it designed well.

I hope this has illuminated some of the missing context and plans behind Channels; community feedback is incredibly important to this whole process, and so if this helped, or if you still have more questions, please get in touch and let me know. It's important that everyone understands both the implementation and the context of the problem is solves - one is nothing without the other - and I hope that, going forwards, we can have a clear idea of what they both mean together.

Marco Slot: COPY into distributed PostgreSQL tables, up to 7M rows/sec

From Planet PostgreSQL. Published on Jun 15, 2016.

COPY into distributed PostgreSQL tables, up to ~7M rows/sec

In the recent 5.1 release of the Citus extension for PostgreSQL, we added the ability to use the COPY command to load data into distributed tables. PostgreSQL's COPY command is one of the most powerful bulk loading features of any database and it is even more powerful on distributed tables. 

To get a sense of the achievable ingestion rate for COPY on distributed tables, I set up a Citus cluster on EC2 with a c4.8xlarge master node and 4 i2.4xlarge workers, if you want to get started more quickly you can use Citus Cloud to immediately provision a fully managed Citus cluster on top of AWS. I used the US names dataset containing ~5.7 million rows and created a distributed table as follows:

CREATE TABLE names (
  state text,
  gender text,
  year int,
  name text,
  number int
);
SELECT master_create_distributed_table('names','name','hash');
SELECT master_create_worker_shards('names',16,1);

To get the best ingestion rate, I split the input into 16 chunks and loaded them in parallel using the following commands:

split -n l/16 names.csv chunks/
find chunks/ -type f | time xargs -n1 -P16 sh -c "psql -c \"\\COPY names FROM '\$0' WITH (FORMAT CSV)\""

I repeated the the data loading command multiple times for different set-ups, local/distributed, with/without replication, and with/without an index on the year, and put the results in the table below. 

Table type Index Replication Ingestion rate
Local No N/A 1.5M rows/sec
Local Yes N/A 0.5M rows/sec
Distributed No No 7M rows/sec
Distributed No Yes 5M rows/sec
Distributed Yes No 2.5M rows/sec
Distributed Yes Yes 1.2M rows/sec

 

In the fastest set-up, Citus loaded up to 7 million rows/sec (finished in under a second). In the most realistic set-up, with replication and an index on the distributed table, COPY still achieves over 1M rows/sec for the names dataset. We recently saw a sustained 1.5M rows/sec in a production cluster, loading 10 billions of rows in a few hours.

Ingestion rate depends on many factors such as number of columns, data types, hardware, indexes, and benchmarks are unlikely to be representative of your use-case. Nonetheless, these numbers show that COPY, even with sharding, replication and indexes, can achieve a write throughput that rivals that of very large NoSQL clustersAt the same time, Citus can do both low latency lookups and parallel analytical queries, for example:

postgres=# SELECT name, count(*) FROM names WHERE year >= 2000 GROUP BY name ORDER BY 2 DESC LIMIT 5;
  name  | count 
--------+-------
 Riley  |  1612
 Jordan |  1573
 Avery  |  1557
 Taylor |  1517
 Peyton |  1497
(5 rows)

Time: 118.176 ms (560 ms on regular Postgres, 2 seconds on BigQuery)

The combination of fast ingestion and fast queries makes Citus very suitable for applications such as real-time, interactive analytics.

In an upcoming post, we'll look at a way of scaling out COPY on Citus horizontally, allowing arbitrarily high ingestion rates.

Simon Riggs: Finding your PostgreSQL version

From Planet PostgreSQL. Published on Jun 15, 2016.

Questions

How do you tell what version of PostgreSQL you are using?
What is the version number of your PostgreSQL server?
What’s the release of your Postgres server?
What release is your Postgres database?
How do I write a program that acts differently for different versions of the PostgreSQL server?

Answer

For a long time it’s been possible to write a query to find out what version your server is.

SELECT version();

or you could look at that information by the preset parameter

SHOW server_version;

Both of those mechanisms refer to an internal string, accessible to C programs as PG_VERSION

Since PostgreSQL 8.2 we’ve been able to access the version number in a numeric form, allowing us to write programs that depend upon version number for any server version from 8.2 upwards to current versions and later.

SHOW server_version_num;

You can use that more easily within an SQL query like this

SELECT current_setting(‘server_version_num’);

postgres=# SELECT current_setting('server_version_num');
current_setting
-----------------
90600
(1 row)

Or you can access this value more easily using the PostgreSQL C API, PQserverVersion()

https://www.postgresql.org/docs/current/static/libpq-status.html

int PQserverVersion(const PGconn *conn);

These mechanisms access the internal macro PG_VERSION_NUM

These numeric values are much easier to use programmatically, allowing you to write programs that vary their behaviour across releases.

e.g. if (version < 90500)

This is mainly useful for accessing in-core functions or in-core Catalog Views.

These versions will be maintained when we go to release 10.0, so we expect this…

postgres=# SELECT current_setting('server_version_num');
current_setting
-----------------
100000
(1 row)

so that 10.0 > 9.6 numerically.

A good example of such usage is in for marking your SQL functions safe to use with the new Parallel Query feature in 9.6 and above.

DO $$
 DECLARE ver integer;
 BEGIN
  SELECT current_setting('server_version_num') INTO ver;
  IF (ver >= 90600) THEN
   EXECUTE ‘alter function f7() parallel safe;’;
  END IF;
 END
$$;

Bruce Momjian: Be Prepared for PREPARE

From Planet PostgreSQL. Published on Jun 15, 2016.

Prepared statements are queries where the constants used in the query are separated from the query text. While this improves security by avoiding SQL injection attacks, it also allows repeatedly-executed queries to avoid parsing and planning overhead by executing saved generic plans that represent queries with typical constants. While generic plans don't have the advantage of being planned based on the statistics of specific constants, the avoidance of parsing and planning overhead is often more beneficial.

Before Postgres 9.2, generic plans were used for all prepared queries. In Postgres 9.2, logic was added to use a generic plan only if it has a cheaper cost after five or more executions. Unfortunately, this behavior was undocumented, causing confusion for users surprised to find the EXPLAIN plans of prepared queries changing after five executions, sometimes for the worse.

After much discussion, this has been remedied by improving the Notes section of the 9.6 PREPARE manual page. It now explains how generic plans are created, and when they are used. I have also created an SQL script that, when run through psql (output), illustrates the documented behavior.

Continue Reading »

Alexander Korotkov: RuntimeAppend in Pg_pathman: Achievements and New Challenges

From Planet PostgreSQL. Published on Jun 15, 2016.

Dealing with partitioned tables we can’t always select relevant partitions during query planning. Naturally, during query planning you can’t know values which come from subquery or outer part of nested loop join. Nevertheless, it would be ridiculous to scan all the partitions in such cases.

This is why my Postgres Professional colleague Dmitry Ivanov developed a new custom executor node for pg_pathman: RuntimeAppend. This node behaves like regular Append node: it contains set of children Nodes which should be appended. However, RuntimeAppend have one distinction: each run it selects only relevant children to append basing on parameter values.

Let’s consider example: join of journal table which contains row per each 30 seconds of year partitioned by day, and q table which refers 1000 random rows of journal table. Without RuntimeAppend optimizer selects Hash Join plan.

sql Regular Append: Hash Join # EXPLAIN ANALYZE SELECT * FROM q JOIN journal j ON q.dt = j.dt; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=27.50..25442.51 rows=1000 width=56) (actual time=0.479..252.506 rows=1000 loops=1) Hash Cond: (j.dt = q.dt) -> Append (cost=0.00..21463.01 rows=1051201 width=49) (actual time=0.005..152.258 rows=1051201 loops=1) -> Seq Scan on journal_1 j (cost=0.00..58.80 rows=2880 width=49) (actual time=0.004..0.247 rows=2880 loops=1) -> Seq Scan on journal_2 j_1 (cost=0.00..58.80 rows=2880 width=49) (actual time=0.001..0.208 rows=2880 loops=1) -> Seq Scan on journal_3 j_2 (cost=0.00..58.80 rows=2880 width=49) (actual time=0.001..0.197 rows=2880 loops=1) ............................................................................................................................... -> Seq Scan on journal_366 j_365 (cost=0.00..1.01 rows=1 width=49) (actual time=0.001..0.001 rows=1 loops=1) -> Hash (cost=15.00..15.00 rows=1000 width=8) (actual time=0.185..0.185 rows=1000 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 48kB -> Seq Scan on q (cost=0.00..15.00 rows=1000 width=8) (actual time=0.003..0.074 rows=1000 loops=1) Planning time: 29.262 ms Execution time: 256.337 ms (374 rows)

The Hash Join execution takes 256 milliseconds for execution and 29 milliseconds for planning. Relatively high planning time is expected because all the partitions are present in plan. It’s surprising that optimizer didn’t select Nested Loop join. Let’s force it to do so by enable_hashjoin = off and enable_mergejoin = off.

sql Regular Append: Nested Loop # EXPLAIN ANALYZE SELECT * FROM q JOIN journal j ON q.dt = j.dt; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=0.28..170817.00 rows=1000 width=56) (actual time=1.091..452.658 rows=1000 loops=1) -> Seq Scan on q (cost=0.00..15.00 rows=1000 width=8) (actual time=0.006..0.158 rows=1000 loops=1) -> Append (cost=0.28..167.14 rows=366 width=49) (actual time=0.218..0.438 rows=1 loops=1000) -> Index Scan using journal_1_dt_idx on journal_1 j (cost=0.28..0.46 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000) Index Cond: (dt = q.dt) -> Index Scan using journal_2_dt_idx on journal_2 j_1 (cost=0.28..0.46 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000) Index Cond: (dt = q.dt) -> Index Scan using journal_3_dt_idx on journal_3 j_2 (cost=0.28..0.46 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000) Index Cond: (dt = q.dt) ...................................................................................................................................................... -> Index Scan using journal_366_dt_idx on journal_366 j_365 (cost=0.12..0.15 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000) Index Cond: (dt = q.dt) Planning time: 29.922 ms Execution time: 456.140 ms (737 rows)

The Nested Loop join takes 456 milliseconds to execute. This is even worse. But this is understandable because we have to scan each partition of journal for each row of q.

Finally, let’s enable RuntimeAppend.

sql RuntimeAppend # EXPLAIN ANALYZE SELECT * FROM q JOIN journal j ON q.dt = j.dt; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=0.28..481.67 rows=1000 width=56) (actual time=0.041..9.911 rows=1000 loops=1) -> Seq Scan on q (cost=0.00..15.00 rows=1000 width=8) (actual time=0.005..0.079 rows=1000 loops=1) -> Custom Scan (RuntimeAppend) (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=1000) -> Index Scan using journal_330_dt_idx on journal_330 j (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=5) Index Cond: (dt = q.dt) -> Index Scan using journal_121_dt_idx on journal_121 j (cost=0.28..0.46 rows=1 width=49) (actual time=0.004..0.004 rows=1 loops=1) Index Cond: (dt = q.dt) -> Index Scan using journal_37_dt_idx on journal_37 j (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=4) Index Cond: (dt = q.dt) ................................................................................................................................................ -> Index Scan using journal_355_dt_idx on journal_355 j (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=1) Index Cond: (dt = q.dt) Planning time: 30.775 ms Execution time: 8.615 ms (687 rows)

The Nested Loop join with RuntimeAppend takes only about 9 milliseconds to execute! Such fast execution is possible thanks to RuntimeAppend scans only one relevant partition of journal for each row of q.

Nevertheless, all the partitions are present in plan and planning time is still quite high. This relatively high planning time could be not so significant for prepared statements or long OLAP queries.

However, long planning time appears to be not the only problem. We run a benchmark when RuntimeAppend node returns just a few rows in prepared statement. Despite high planning time doesn’t affect prepared statements, TPS was few time slower than it was without partitioning. After running perf, we got this flamegraph. This flamegraph shows that we spend very significant time for locking and unlocking every partition. Naturally, locking 365 partitions isn’t using fast-path locking and appears to be significant overhead.

Thus, we see how huge benefit could runtime partition selection have. However, in current design having all the partitions in plan cause high overhead. Solution could be found in redesigning partition locking. We are researching this problem now. It’s likely this problem can’t be solved in the boundaries of extension and proper solution requires hacking of PostgreSQL core.

I'm Gonna Regret This

By chrism from . Published on Jun 14, 2016.

A plea for liberals to fight for individual rights.

PyCon 2016 Recap

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jun 14, 2016.

PyCon, beyond being the best community event for Python developers, is also an event that we happily began thinking about eleven months ago. Almost as soon as PyCon 2015 ended, we had the good fortune of planning the look and feel of PyCon 2016 with organizer extraordinaires Ewa Jodlowska, Diana Clark, and new this year, Brandon Rhodes. Our team has loved working with the organizers on the PyCon websites for the past three years now. They’re great people who always prioritize the needs of PyCon attendees, whether that’s babysitting services or a smooth PyCon web experience.

Seeing the PyCon 2016 Artwork

The Caktus team arrived in Portland and were almost immediately greeted with large-scale versions of the artwork our team made for PyCon. Seeing it on arrival, throughout the event, and especially during the keynotes was surreal.

PyCon 2016 sponsor banner

Getting ready for the tradeshow

Our team got ready for the booth first, ahead of the PyCon Education Summit and Sponsor Workshops where we had team members speaking. Here’s the booth before everyone came to grab t-shirts and PyCon tattoos and to learn more about us.

The Caktus booth at PyCon before the festivities begin.

Here’s a closeup of our live RapidPro dashboard too.

The RapidPro live dashboard Caktus built for PyCon.

Supporting our team members

This year, at the PyCon Education Summit, Rebecca Conley spoke about expanding diversity in tech by increasing early access to coding education. Erin Mullaney and Rebecca Muraya spoke at a Sponsor Workshop on RapidPro, UNICEF’s SMS application platform. Sadly, we didn’t get a picture of Rebecca C, but Erin shared this picture of herself and Rebecca M. on Twitter.

Erin and Rebecca M. after giving their RapidPro talk at PyCon.

Tradeshow time!

PyCon, for our booth team, is always intense. Here’s a taste of the crowds across three days.

A busy crowd around the Caktus booth.

The excitement, of course, included a giveaway. Here’s the winner of our BB8 Sphero Ball raffle prize, Adam Porad of MetaBrite, with our Sales Director, Julie White:

PyCon attendee wins the Caktus BB8 Sphero giveaway.

So many talks

With our office almost empty and most of our team at PyCon, there were a lot of talks we went to, too many to list here (don’t worry, we’re going to start highlighting the talks in our annual PyCon Must See Series). We do want to highlight one of the best things about the talks— the representation of women, as described by the PyCon Diversity chair:

Across three packed days, here’s some of the topics we got to learn more about: real time train detection, inclusivity in the tech community, and better testing with less code. With the videos now available, we can still catch all the great talks even if we couldn’t be there.

PyLadies auction

One of the highlights of PyCon is definitely the PyLadies auction. Every year, it’s a raucous event that’s just plain fun. This year, we contributed original concept art for the PyCon 2016 logo. It went for $650 to Jacob Kaplan-Moss, the co-creator of Django. Since we’re a Django shop, there definitely was quite a bit of excited fandom for us.

Jacob Kaplan-Moss holds won auction item: Caktus' early concept art for PyCon 2016 logo

And we can’t leave without a cookie selfie

Whoever came up with the cookie selfie idea is brilliant. Here’s Technical Director Mark Lavin with his cookie selfie.

Hope to see you next year!

In the meantime, make sure to return to our blog for our annual PyCon Must See Series.

Hubert 'depesz' Lubaczewski: Incrementing counters in database

From Planet PostgreSQL. Published on Jun 14, 2016.

Some time ago someone on irc asked about creating fast counters for something (banners I think). I talked with her (him?) about it, but figured, I can as well write a blogpost, so others can use it too. First the general idea: we want to store incrementable counters, with optional trade off that “we can […]

Greg Sabino Mullane: Postgres migration speedup with table change analysis

From Planet PostgreSQL. Published on Jun 13, 2016.


(A Unicode rabbit face 🐰 will never be as cute
as this real bunny. Photo by Wade Simmons)

One of our clients recently reached out to us for help in upgrading their Postgres database. The use of the pg_upgrade program was not an option, primarily because the client was also taking the opportunity to change from their SQL_ASCII encoding to UTF-8. (If any of your databases, gentle reader, are still SQL_ASCII, please do the same!). Naturally, we also took advantage of the lack of pg_upgrade to enable the use of data checksums, another action we highly recommend. Although there were plenty of wrinkles, and stories to be told about this migration/upgrade, I wanted to focus on one particular problem we had: how to detect if a table has changed.

We needed to know if any applications were modifying certain tables because the speed of the migration was very important. If we could assert that no changes were made, there were some shortcuts available that would greatly speed things up. Initial testing showed that the migration was taking over eight hours, a time unacceptable to the client (no worries, we eventually reduced the time to under an hour!).

Looking closer, we found that over half that time was spent converting a single small (50MB) table from SQL_ASCII to UTF-8. How this conversion was performed is a story for another day, but suffice to say the table had some really, really messy bytes inside of it; the conversion program had to struggle mightily. When you are converting a database to a new encoding, it is imperative to examine every byte and make sure it gets changed to a format that Postgres will accept as valid UTF-8, or the entire table import will fail with an error similar to this:

ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

Looking closer at the data in the table showed that it might - just might! - be a historical table. In other words, it no longer receives updates, just selects. We really wanted this to be true, for it meant we could dump the whole table, convert it, and simply load the converted table into the new database (which took only a few seconds!). First, however, we had to confirm that the table was not changing.

Detecting changes may be done in several ways. For all of them, you can never prove that the table shall not change at some point in the future, but you can prove that it has not changed over a certain period of time. How you go about doing that depends on what kind of access you have. If you do not have super-user access, you could add a simple trigger to the table that updates another table when a update, insert, or delete is performed. Then, checking in on the second table will indicate if any changes have been made.

A better solution is to simply look at the underlying file that makes up the table. To do this, you need be a Postgres superuser or have access to the underlying operating system. Basically, we will trust the operating system's information on when the table was last changed to determine if the table itself has changed. Although not foolproof, it is an excellent solution. Let's illustrate it here. First: create a test table and add some rows:

$ psql
greg=# CREATE TABLE catbox AS SELECT 8675309::INT AS id FROM generate_series(1,1000);
SELECT 1000

Now we can use the pg_stat_file() function, which returns some basic information about a file on disk. We need to construct a path to the file that holds the data for our test table. Once we do that, we can see when it was last modified:

greg=# select * from pg_stat_file(format('base/%s/%s', (select oid from pg_database where datname=current_database()),
greg-#  (select relfilenode from pg_class where relname='catbox'))) \x\g
Expanded display is on.
-[ RECORD 1 ]+-----------------------
size         | 40960
access       | 2015-11-08 22:36:00-04
modification | 2015-11-08 22:36:00-04
change       | 2015-11-08 22:36:00-04
creation     | 
isdir        | f

Next we will revisit the table after some time (e.g. 24 hours) and see if the "modification" timestamp is the same. If it is, then the table has not been modified either. Unfortunately, the possibility of a false positive is possible due to VACUUM, which may change things on disk but does NOT change the data itself. (A regular VACUUM *may* modify the file, and a VACUUM FULL *always* modifies it).

greg=# select * from pg_stat_file(format('base/%s/%s', (select oid from pg_database where datname=current_database()),
greg-#  (select relfilenode from pg_class where relname='catbox'))) \x\g

-[ RECORD 1 ]+-----------------------
size         | 40960
access       | 2015-11-08 22:36:00-04
modification | 2015-11-08 22:36:00-04
change       | 2015-11-08 22:36:00-04
creation     | 
isdir        | f


greg=# vacuum catbox;
VACUUM

greg=# select * from pg_stat_file(format('base/%s/%s', (select oid from pg_database where datname=current_database()),
greg-#  (select relfilenode from pg_class where relname='catbox')));

2016-06-09 22:53:24-04
-[ RECORD 1 ]+-----------------------
size         | 40960
access       | 2015-11-08 22:36:00-04
modification | 2015-11-08 22:40:14-04
change       | 2015-11-08 22:40:14-04
creation     | 
isdir        | f

A second (and more foolproof) method is to simply generate a checksum of the entire table. This is a fairly straightforward approach; just pump the output of pg_dump to a checksum program:

$ pg_dump -t catbox --data-only | sha1sum
6f724565656f455072736e44646c207472537361  -

The advantage here is that even a VACUUM FULL will not change the checksum. However, because pg_dump does no ORDER BY when dumping out the table, it is possible for the rows to be returned in a different order. To work around that, issue a VACUUM FULL yourself before taking the checksum. As before, come back later (e.g. 24 hours) and re-run the command. If the checksums match, then the table has not changed (and is probably no longer updated by the application). By using this method, we were able to verify that the large, SQL_ASCII byte-soup table was indeed not being updated, and thus we took it out of the direct migration.

Of course, that table needed to be part of the new database, but we simply dumped the table, ran the conversion program on it, and (four hours later), had a complete dump of the table that loads extremely fast into the new database.

That solved only one of the problems, however; another table was also slowing down the migration. Although it did not have the SQL_ASCII conversion issue, it was a large table, and took a large percentage of the remaining migration time. A quick look at this table showed it had a "creation_time" column as well as a SERIAL primary key, and was obviously being updated quite often. Close examination showed that it was possible this was an append-only table, such that older rows were never updated. This called for a similar approach: could we prove that a large chunk of the table was not changing? If we could, we could pre-populate the new database and copy over only the most recent rows during the migration, saving a good bit of time.

The previous tricks would not work for this situation, because the underlying file would change constantly as seen by pg_stat_file(), and a pg_dump checksum would change on every insert. We needed to analyze a slice of the table - in this particular case, we wanted to see about checksumming all rows except those created in the last week. As a primary key lookup is very fast, we used the "creation_time" column to determine an approximate primary key to start with. Then it was simply a matter of feeding all those rows into the sha1sum program:

greg=# CREATE TABLE catbox2 (id SERIAL PRIMARY KEY, creation_time TIMESTAMPTZ);
CREATE TABLE
greg=# INSERT INTO catbox2(creation_time) select now() - '1 year'::interval + (x* '1 hour'::interval) from generate_series(1,24*365) x;
INSERT 0 8760

greg=# select * from catbox2 where creation_time > now()-'1 week'::interval order by 1 limit 1
  id  |         creation_time         
------+-------------------------------
 8617 | 2016-06-11 10:51:00.101971-08

$ psql -Atc "select * from catbox2 where id < 8617 order by 1" | sha1sum
456272656d65486e6f203139353120506173733f  -

## Add some rows to emulate the append-only nature of this table:
greg=# insert into catbox2(creation_time) select now() from generate_series(1,1000)
INSERT 0 1000

## Checksums should still be identical:
$ psql -Atc "select * from catbox2 where id < 8617 order by 1" | sha1sum
456272656d65486e6f203139353120506173733f  -

Despite the large size of this table (around 10 GB), this command did not take that long to run. A week later, we ran the same commands, and got the same checksum! Thus, we were able to prove that the table was mostly append-only - or at least enough for our use case. We copied over the "old" rows, then copied over the rest of the rows during the critical production migration window.

In the future, this client will able to take advantage of pg_upgrade, but getting to UTF-8 and data checksums was absolutely worth the high one-time cost. There were several other tricks used to speed up the final migration, but being able to remove the UTF-8 conversion of the first table, and being able to pre-copy 99% of the second table accounted for the lion's share of the final speed improvements.

My First Conference Talk: Reflecting on Support and Inclusivity at DjangoCon Europe 2016

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jun 13, 2016.

The environment at Caktus is, above all, one of encouragement. I experienced that encouragement as an intern and continue to experience it as a full-time developer. In addition to providing workplace mentorship, Caktus encourages all of its employees to submit talks to conferences. My manager Mark Lavin and mentor Karen Tracy encouraged me to get over my concerns about being new to the field and to start submitting talks.

Along with the support from Caktus was the impetus from Djangocon Europe for first-time speakers to submit. Djangocon Europe’s Call For Papers(CFP) includes suggested topics and offers of support beginning with brainstorming a topic and including mentors if your talk is chosen. I took them up on this offer and floated a couple of ideas over email. I got a very quick response with the suggestion that I expand a previous blog post I had written on my mid-career transition into a talk. Baptiste Mispelon and Xavier Dutreilh continued to be helpful and responsive throughout the application process and made me feel like my contribution was valued and that I was being taken seriously, whether my talk would be ultimately selected or not.

A week later, I received the notification that my talk was selected. The support continued from Caktus, the broader local development community, and the Djangocon Europe organizers. Mark helped me refine my talk content, and Caktus coworkers and Pyladies helped me organize public previews of the talk. Djangocon Europe opened a slack mentor channel in which I was able to ask a lot of questions about talks in general and about how to communicate effectively with an international audience. The refinement and confidence gained from these experiences helped send me to Europe excited about giving my first talk.

The organizers made travel easy, opening a Slack channel for attendees to ask general questions about the conference. I arrived in Budapest, got a shuttle to the hotel, and checked into my room. Then I got on the Slack #speakers channel and asked if anyone wanted to join me for dinner. I ended up with two fantastic dinner companions, Andrew Godwin and Anna Schneider. Over dinner I learned about developing for non-profits, London’s economic development, and many other fascinating things.

Budapest is beautiful and extremely friendly and walkable. In general, google maps and google translate worked to help me get around after initially leading me astray on my first walk to the venue. Once I arrived, I was greeted with signs telling me I was welcome, I looked awesome, what the code of conduct was, and what phone numbers and email to use if I had any concerns. The conference was well staffed with friendly folks to direct attendees and to answer questions. The food and snacks were good. There were dedicated quiet spaces and a separate prayer room. Attention to all of these details showed that the conference organizers carefully considered the needs and comfort of all the attendees in their planning and made us feel valued.

Throughout the conference and afterwards, Djangocon Europe showed particular dedication to the Code of Conduct and the principles behind it, which generally amount to “be kind to others.” All conference slides were reviewed by the organizers to make sure they adhered to the Code of Conduct. Light-hearted but direct signs in the bathrooms made it clear that gender non-conforming attendees were welcome and that their safety and comfort were important. During the conference, an announcement was made regarding a slide that had been added after the screening and brought to the attention of the organizers as a violation to the Code of Conduct. This announcement demonstrated that complaints were taken seriously and handled quickly. It served to make us all feel that our safety and comfort was a priority. I even saw an interaction on Slack that enforced these values of inclusion and kindness. The conversation started with er someone giving a lightning talk asking an organizer to screen his slides The organizer pointed out a slide with a photo that could be seen as objectifying the woman in the photo. The speaker agreed and removed the slide. It was a simple interaction from which everyone learned. As a female speaker, I felt that the organizer was absolutely looking after both my interests and the interests of the presenter. Soon after the conference, Djangocon Europe published a Transparency Report detailing protections put in place, issues that arose, and the way those issues were handled. No conference can completely control attendee behavior, but attentiveness and transparency like this should set the standard for how conferences create safe and inclusive environments.

DjangoCon Europe 2016 venue

The venue was very attractive and comfortable, with a small theater for the talks as well as a library and balcony where talks were streamed for those who wanted a smaller, quieter setting. Having those options definitely helped me enjoy the conference as I had speech preparation in mind, along with getting the most I could from the talks.

The first talks emphasized the tone of welcoming and mutual respect. In their talk, “Healthy Minds in a Healthy Community” Erik Romijn & Mikey Ariel spoke frankly and personally about the struggles many of us face in the open source community to maintain physical and mental health while facing the demands of our jobs as well as the added desire to contribute to open source projects. As a new developer, it was really important for me to hear that all the people I perceive as “rockstars” and “ninjas” are just as human as I am and that we all need to take care of each other. It also inspired me to reflect on my gratitude that I work at Caktus, where we are all valued as people and our health and happiness is a priority of the company.

The talks were all fantastic, a nice blend of topics from debugging to migrations to the challenge and necessity of including languages that don’t read left-to-right, given by women and men from all over the world. I felt honored to be among them and pleased that the organizers felt a mid-career transition into programming merited a speaking slot. The whole experience continued to be enjoyable, especially the speaker dinner consisting of traditional Hungarian food. At the dinner I had the chance to learn about the developing tech scene in Eastern Europe and the assumptions we had about each other on either side of the Iron Curtain in the mid 1980’s. Software itself is impressive. However, it is only when we get to understand the people who are making it and the people for whom we are making it that software’s real meaning and value become evident. Djangocon Europe definitely facilitated that kind of understanding. Another highlight of the evening was receiving my speaker gift, some souvenirs from Budapest and a handwritten note thanking me for participating, which made me feel very appreciated.

Before a talk.

My talk was on the last morning, and while I expected everyone to be tired from the party the night before with live music in one of Budapest’s “ruin pubs,” there was a good crowd. The emcee Tomasz Paczkowski did a great job preparing speakers, including me before we spoke enforcing the “no comments, only questions” policy after we finished speaking. Speakers were also given the option to have no questions from the stage. I didn’t choose that, but I see how that option would be valuable to some speakers.

What I didn’t know when I first submitted my talk was that it was a single-track conference. I learned that when I saw the schedule. My audience was the whole conference, as it was for all the speakers. It was daunting at first to know that all eyes would be on me (at least the eyes of everyone who chose to attend a talk). I went into the room the night before and stood on stage, looking at several hundred empty chairs and absorbing the idea that they would be full of people watching me the next day. Fortunately I now knew who at least some of these people were, and I had seen in general how they responded positively to each other and to other speakers. I have performed as a dancer in front of large crowds plenty of times, but had never given a professional talk to an audience of that size. The beauty of the space and the familiarity of being on stage certainly helped ease my apprehension.

By the last day of the conference, I felt so comfortable and appreciated that I enjoyed giving my talk, From Intern to Professional Developer: Advice on a Mid-Career Pivot immensely. I was a little bit nervous, but just enough for it to be motivating. A number of people made a point of encouraging me throughout the week and being present as smiling faces close to the front during my talk. It went by quickly. I tried to remember to breathe and look up (both things I forget to do when I’m nervous). The crowd was polite and responsive.I got some good questions and follow-up from people who had made a similar transition or were thinking about it, as well as questions from some hiring managers. I felt like I was able to make a valuable contribution to the conference and to the community through my talk, and I am grateful to Djangocon Europe and Caktus for making it all possible.

Kaarel Moppel: Estimating table bloat in PostgreSQL

From Planet PostgreSQL. Published on Jun 13, 2016.

Recently there have been a couple of very exhaustive and rather technical posts on Planet Postgresql on the topic of bloat. Bloat in short, for those new to Postgres,  is an unescapable byproduct of Postgres‘ excellent concurrency-enablement model called MVCC (multi-version concurrency control), that creates new copies of rows when you update data in a concurrent […]

The post Estimating table bloat in PostgreSQL appeared first on Cybertec - The PostgreSQL Database Company.

Nikolay Shaplov: Postgres: Reloptions string -> enum

From Planet PostgreSQL. Published on Jun 12, 2016.

While rewriting reloption.c code, I came to an idea, that all string reloptions that are used in the code are actually enum reloptions: just a fixed list of string constants, nothing more.


One string option is gist buffering, with the following validate function:
gistValidateBufferingOption(char *value)
{
     if (value == NULL ||
         (strcmp(value, "on") != 0 &&
          strcmp(value, "off") != 0 &&
          strcmp(value, "auto") != 0))
     {
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("invalid value for \"buffering\" option"),
               errdetail("Valid values are \"on\", \"off\", and \"auto\".")));
     }
}


Another one is check_option for views. That also a list of two options

void
validateWithCheckOption(char *value)
{
     if (value == NULL ||
         (pg_strcasecmp(value, "local") != 0 &&
          pg_strcasecmp(value, "cascaded") != 0))
     {
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("invalid value for \"check_option\" option"),
                  errdetail("Valid values are \"local\" and \"cascaded\".")));
     }
}


So the first idea, that came to me was to add a enum type, and switch the code to use it instead of string.

The second idea was, to get rid of string type at all. Because when parsing of reloptions is done, all the data is stored as a bytea binary chunk, that is copied from one memory context to another. So if there is no string data there, you can just work with it as with C structure. But if you have a string data, you should somehow put it into that data chunk, because otherwise it would not be copied from one memory context to another. This all require hacks that are not beautiful at all. And as I think should be avoided if possible. For example, this function takes a base size of reloption structure and adds some more space for each string option. I do not like this code:

void *
allocateReloptStruct(Size base, relopt_value *options, int numoptions)
{
	Size		size = base;
	int			i;

	for (i = 0; i < numoptions; i++)
		if (options[i].gen->type == RELOPT_TYPE_STRING)
			size += GET_STRING_RELOPTION_LEN(options[i]) + 1;

	return palloc0(size);
}


So I would ask the community to consider removing string option type, if nobody really using it right now. 'cause it is kind of evil for me.


PS. If you would like to comment this post, please login from any social network that is possible to login here, or at least write your name, so I would be able to answer you...

Custom python-social-auth Pipeline

By GoDjango - Django Screencasts from Django community aggregator: Community blog posts. Published on Jun 10, 2016.

Sometimes when you authenticate with a social service you need to do custom "stuff" in your code base beyond creating a user object. Learn how to get started simply adding a custom function do the python-social-auth pipeline.
Watch Now...

Shaun M. Thomas: PG Phriday: Moving to 9.5

From Planet PostgreSQL. Published on Jun 10, 2016.

There comes a day in every young database’s life that it’s time to move on. I’m sorry 9.4, but the day has come that we must say goodbye. It’s not like we haven’t had our good times. While I truly appreciate everything you’ve done for me, we must part ways. I’m far too needy, and I can’t demand so much of you in good conscience. May your future patches make you and your other suitors happy!

In all seriousness, Postgres 9.5 has been out since January 7th. Despite that, I’ve never really been an advocate of deploying dot-0 versions in production environments. Since the Postgres dev team is notoriously fanatically responsible, this is probably excessively paranoid. Still, I persist that some day, my certifiable anxiety regarding fresh releases will pay off. Until then, it’s time to examine a few reasons to start upgrading the 120+ Postgres instances littering our server fabric. Especially now that 9.5.3 is out, making it practically rock solid in terms of field testing.

JSONBetter

A common complaint about JSON and JSONB up to 9.4, was that there was no easy way to modify existing JSON objects. Adding fields was an exercise at text conversion, unpacking, concatenation, and re-casting. It was a mess. As a consequence, I recommended using it as a mere container column, or relying on PL/V8 or PL/Python to actually manipulate JSON data.

That’s no longer the case with 9.5. Not only are several standard operators overloaded to support JSONB, but they introduced the jsonb_set function for more direct manipulation. I covered this more extensively in the past, but here are a couple quick examples for reference:

-- Concatenation
 
SELECT '{"Hairy": true}'::JSONB || '{"Wobbly": false}'::JSONB;
 
             ?COLUMN?             
----------------------------------
 {"Hairy": TRUE, "Wobbly": FALSE}
 
-- Subtraction 
 
SELECT '{"Hairy": true, "Excited": false}'::JSONB - 'Excited';
 
    ?COLUMN?     
-----------------
 {"Hairy": TRUE}
 
-- Explicit jsonb_set
 
SELECT jsonb_set(
  '{"Hairy": true, "Status": {"Burning": true}}'::JSONB,
  '{Status,Burning}',
  'false'
);
 
                   jsonb_set                   
-----------------------------------------------
 {"Hairy": TRUE, "Status": {"Burning": FALSE}}

This is no less an exciting advancement than it was back in December, and could be a singular reason to upgrade depending on JSON usage saturation.

INSERT += UPDATE

One feature the MySQL camp has lorded over Postgres is the REPLACE INTO syntax. For those who don’t already know, it operates like an INSERT unless it finds an existing row for that primary key, upon which it acts more like an UPDATE. This has been a… point of contention for more than a decade.

I could find similar links until the end of time, but stopped at 2002 because this article needs to end eventually. Nary a month has gone by without someone asking about it in the Postgres mailing lists, being dissatisfied with the answer, and going back to MySQL. The only equivalent we had was a convoluted function that made use of exceptions and a loop in order to avoid a race condition. No more. Now, this is trivial:

CREATE TABLE upsert_me (
  id     INT PRIMARY KEY,
  stuff  TEXT
);
 
INSERT INTO upsert_me VALUES (1, 'Woo!');
INSERT INTO upsert_me VALUES (1, 'Woohoo!');
 
ERROR:  duplicate KEY VALUE violates UNIQUE CONSTRAINT "upsert_me_pkey"
 
INSERT INTO upsert_me VALUES (1, 'Woohoo!')
    ON CONFLICT (id) DO UPDATE
   SET stuff = EXCLUDED.stuff;
 
SELECT * FROM upsert_me;
 
 id |  stuff  
----+---------
  1 | Woohoo!

The Postgres implementation is actually safer, given the explicit control over when the replacement actually happens, under what criteria, and which columns are included. This kind of solution is exactly what the community has come to expect; the Postgres dev team never simply adds a new feature without careful and meticulous consideration. We get the same functionality, with a far more flexible and less fragile implementation.

Block Range Indexes

Otherwise known as a BRIN index, this type of indexing is geared toward VLDB installations with gargantuan tables that are generally INSERT only. They work by storing a page offset followed by a range of values represented there. This is extremely lossy, but for large sequentially organized tables, can help Postgres limit page scans to certain table regions immediately. Unlike a regular BTREE index which might require a random read per match, a BRIN index will match several potentially unnecessary pages and read them all. The primary difference here is between a random or a sequential read operation.

In the end, this requires more filtering to remove unwanted matches for values pulled in from the whole range. Yet this can be far cheaper in aggregate depending on the granularity of the ranges. Let’s try an example with 10M rows:

CREATE TABLE sensor_log (
  id            INT PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       '2016-12-31'::DATE - (s.id || 's')::INTERVAL
  FROM generate_series(1, 10000000) s(id);
 
CREATE INDEX idx_sensor_log_date_brin
    ON sensor_log USING BRIN (reading_date);
 
ANALYZE sensor_log;
 
\timing ON
 
SELECT COUNT(*)
  FROM sensor_log
 WHERE reading_date >= '2016-12-01'
   AND reading_date < '2016-12-02';
 
TIME: 20.805 ms

How does the sample query compare to using a regular BTREE index? Tests put the performance at about 10% faster when all results are in memory, otherwise the BRIN index was 3-5x faster. But there’s another benefit that isn’t quite as obvious. Due to the lossy nature of BRIN structure, these indexes are also orders of magnitude smaller. Let’s compare just how much:

CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
 
SELECT relname, pg_size_pretty(pg_relation_size(oid))
  FROM pg_class
 WHERE relname LIKE '%sensor_log%';
 
         relname          | pg_size_pretty 
--------------------------+----------------
 sensor_log               | 498 MB
 sensor_log_pkey          | 214 MB
 idx_sensor_log_date_brin | 32 kB
 idx_sensor_log_date      | 214 MB

This means an index that’s nearly 7000x larger is only 10% faster for this particular case under ideal circumstances. On a moderately busy warehouse system that mostly sees reporting queries spanning segments of multiple TB-size tables that would never fit in memory, the BRIN index would actually be a better use of resources. In cases where the query planner would take random page cost into account for a large result set and revert to a sequential scan, this is even more relevant. We may end up reading a small fraction of a vast table instead of the entire thing.

Thus BRIN indexes definitely have a niche role, but given the number of larger enterprises making use of Postgres for VLDB installations, that niche might be larger than we think.

Moving On

In the end, there are quite a few more features I will probably cover in more depth later. The Wiki is more than complete in addressing them, and includes cases I glossed over or skipped entirely. These are the ones I wanted to highlight due to their potential to directly impact instances I manage, or empower end-users under my purview. Any one of them would be an excellent justification to upgrade to 9.5, and taken together, they practically demand it.

Regardless, this is old news to anyone drooling over 9.6, including myself. However, following my own standards for not immediately deploying new versions, it’ll be a while before 9.6 will reach our production systems. To that end, I can rest easy knowing I’ll be writing a similar article about 9.5 when 9.6 reaches maturity. I’m glad that’s the one constant I’ve observed regarding Postgres over the years: every single upgrade is worth the trouble, often to a hilarious degree.

That’s not something you can always say in the software world, and I appreciate it.

On Python 3, again

By James Bennett from Django community aggregator: Community blog posts. Published on Jun 10, 2016.

A while back, Brett Cannon went into some detail on why Python 3 happened (that is, why it was backwards-incompatible with Python 2 and why it was backwards-incompatible in the particular ways it was). Python 3 has been pretty controversial, with people I respect chiming in on both sides of the good idea/bad idea debate. And the transition (which Brett has also taken a recent look at) has indeed been slow, but that at ...

Read full entry

Running a Humane Interview Process

By Will Larson from Django community aggregator: Community blog posts. Published on Jun 10, 2016.

I moved companies recently, which came with the requisite job search and interviewing. Having conducting four or five hundred of interviews over the past two years, I felt more prepared than any previous search, but being back on the other side of the table again really solidified some ideas for me.

The first thing which struck me is that the state of interviewing is improving: many processes now involve presenting a prepared presentation on a technical topic instead of an impromptu presentation (closer to replicating a real work task), and many have replaced the whiteboard algorithm problem with a collaborative pair programming on a laptop with your editor of choice.

Looking back on my early interviewing experiences where I was once asked to do calculus on the whiteboard, it’s amazing how far things have improved.

That said, it’s certainly not the case that interviewing has improved uniformly. There is still a lot of whiteboard programming out there, and a disproportionate number of the most desirable companies continue with the practice due to the combined power of inertia (it was the state of play when many engineers and managers–including myself–entered the profession) and coarse-grained analytics (if you’re hitting your hiring goals–and with enough dedicated sourcers, any process will hit your hiring goals–then it can be hard to prioritize improving your process).

Reflecting on the interviews I’ve run over the past few years and those I got to experience recently, I believe that while interviewing well is far from easy, it is fairly simple:

  1. be kind to the candidate,
  2. ensure all interviews agree on the role’s requirements,
  3. understand the signal your interview is checking for (and how to search that signal out),
  4. come to your interview prepared to interview,
  5. deliberately express interest in candidates,
  6. create feedback loops for interviewers and the loop’s designer,
  7. instrument and optimize like you would any conversion funnel.

You don’t have to do all of these to be effective! Start from being nice and slowly work your way to through to the analytics.

Be Kind

A good interview experience starts with being kind to your candidate.

Being kind comes through in the interview process in a hundred different ways. When an interview runs over time before getting to the candidate’s questions, the kind thing to do is to allow the candidate a few minutes to ask questions instead of running on to the next interview to catch up. Likewise, in that scenario the kind thing is to then negotiate new staggered start times versus kicking off a cascade of poor interviewer time management as each person tries to fractionally catch up to the original schedule.

My experience is that you can’t conduct a kind, candidate-centric interview process if your interviewers are tightly time constrained. Conversely, if an interviewer is unkind to a candidate (and these unkindnesses are typically of the “with a whisper not a bang” variety), I believe it is very often a structural problem with your interviewing process, and not something you can reliably dismiss as an issue with that specific interviewer.

Almost every unkind interviewer I’ve worked with has been either suffering from interview burnout after doing many interviews per week for many months, or has been busy with other work to the extent that they started to view interviews as a burden rather than a contribution. To fix it, give them a interview sabatical for a month or two and make sure their overall workload is sustainable before moving them back into the interview rotation.

(Identifying interview burnout is also one of the areas where having a strong relationship with open communication between engineering managers and recruiters is important. Having two sets of eyes looking for these signals helps.)

What Role Is This, Anyway?

The second critical step towards an effective interview loop is ensuring that everyone agrees on the role they are interviewing for, and what extent of which skills that role will require.

For some roles–especially roles which vary significantly between companies like engineering managers, product managers or architects–this is the primary failure mode for interviews, and preventing it requires reinforcing expectations during every candidate debrief to ensure interviewers are “calibrated.”

I’ve found that agreeing on the expected skills for a given role can be far harder than anticipated, and can require spending significant time with your interviewers to agree on what the role requires. (Often in the context of what extend and kind of programming experience is needed in engineering management, devops, and data science roles.)

Finding Signal

After you’ve broken the role down into a certain set of skills and requirements, the next step is to break your interview loop into a series of interview slots which together cover all of those signals. (Typically each skill is covered by two different interviewers to create some redundancy in signal detection if one of the interviews doesn’t go cleanly.)

Just identifying the signals you want is only half of the battle though, you also need to make sure the interviewer and the interview format actually exposes that signal. It really depends on the signal you’re looking for, but a few of the interview formats which I’ve found very effective are:

  1. prepared presentations on a topic: instead of asking the candidate to explain some architecture on the spur of the moment, give them a warning before the interview that you’ll ask them to talk about a given topic for 30 minutes, which is a closer approximation of what they’d be doing on the job,
  2. debugging or extending an existing codebase on a laptop (ideally on their laptop) is much more akin to the day to day work of development than writing an algorithm on the board. A great problem can involve algorithmic components without coming across as a pointless algorithmic question (one company I spoke with had me implement a full-stack auto-suggest feature for a search inbox box, which required implementing a prefix tree, but without framing it as yet-another-algos-question).
  3. giving demos of an existing product or feature (ideally the one they’d be working on) helps them learn more about your product, get a sense of if they have interest around what you’re doing, and helps you get a sense of how they deliver feedback and criticism.
  4. roleplays (operating off a script which describes the situation) can be pretty effective if you can get the interviewers to buy into it, allowing you to get the candidate to create more realistic behavior (architecting a system together, giving feedback on poor performance, running a client meeting, etc).

More than saying you should specifically try these four approaches (but you should!), the key point is to keep trying new and different approaches which improve your chance of finding signal from different candidates.

Be Prepared

If you know the role you’re interviewing for and know the signal which your interview slot is listening for, then the next step is showing up prepared to find that signal. Being unprepared is, in my opinion, the cardinal interview sin because it shows a disinterest in the candidates time, your team’s time, and in your own time. In the fortunately rare situations when I’ve been interviewed by someone who was both rude and unprepared, I still remember the unprepared part first and the rude part second.

I’ve also come to believe that interview preparedness is much more company dependent than it is individual dependent. Companies which train interviewers (more below), prioritize interviewing, and maintain a survivable interview-per-week load tend to do very well, and otherwise they just don’t.

Following from this, if you find your interviewers are typically unprepared, it’s probably a structural problem for you to iterate on improving and not a personal failing of your interviewers.

Deliberately Express Interest

Make sure your candidates know that you’re excited about them.

I first encountered this idea reading Rands’ “Wanted” article and he does an excellent job of covering it there. The remarkable thing is how few companies and teams do this intentionally: in my last interviewing process, three of the companies I spoke with expressed interest exceptionally well, and those three companies ended up being the ones I engaged with seriously.

Whenever you extend an offer to a candidate, have every interviewer send a note to them saying they enjoyed the interview (compliment rules apply: more detailed explanations are much more meaningful). At that point as an interviewer it can be easy to want to get back to “your real job”, but resist the temptation to quit closing just before you close: it’s a very powerful human experience to receive a dozen positive emails when you’re pondering if you should accept a job offer.

Feedback Loops

Interviewing is not a natural experience for anyone involved. With intentional practice you’ll slowly get better, but it’s also easy to pick up poor interviewing habits (asking brainteaser questions) or keep using older techniques (focus on whiteboard coding). As mentioned earlier, even great interviewers can become poor when experiencing interview burnout or when they are overloaded with other work.

The fix for all these issues is to ensure you build feedback loops into your process, both for the interviewers and for the designer of the interview process. Analytics (discussed in the next section) are powerful for identifying broad issues, but for actively improving your process, I’ve found that pair interviews, practice interviews and weekly sync ups between everyone strategically involved in recruiting (depending on your company’s structure, this might be recruiters and engineering managers or something else) work best.

For pair interviews, having a new interviewer (even if they are experienced somewhere else!) start by observing a more experienced interviewer for a few sessions and gradually take on more of the interview until eventually the more senior candidate is doing the observing. Since your goal is to create a consistent experience for your candidates, this is equally important for new hires who are experienced interviewing elsewhere as it is for a new college grad.

To get the full benefit of calibration and feedback, after the interview have each interviewer write up their candidate feedback independently before the two discuss the interview and candidate together (generally I’m against kibitzing about a candidate before the group debrief to reduce biasing later interviews based on an earlier one, but I think this is a reasonable exception given you’ve experienced the same interview together and in a certain sense calibrating on interviewing at your company is about having a consistent bias in how you view candidates, independently of who on your team interviews them).

Beyond the interviewers getting feedback, it’s also critical that the person who owns or designs the interview loop get feedback. The best place to get that is from the candidate and from the interview debrief session.

For direct feedback from candidates, in my “manager interview” sessions, I’ve started to ask every candidate how the process has been and what we could do to improve. The feedback is typically surprisingly candid, although many candidates aren’t really prepared to answer the question after five hours of interviews (it’s easy to get into the mode of surviving the interviews rather than thinking critically about the process which is being used to evaluate you). The other–more common–mechanism is to have the recruiters do a casual debrief with each candidate at the end of the day.

Both of these mechanisms are tricky because candidates are often exhausted and the power dynamics of interviewing work against honest feedback. Maybe we should start proactively asking every candidate to fill out an anonymous Glassdoor review on their interview experience. That said, this is definitely a place where starting to collect some feedback is more important than trying to get it prefect in the first pass: start collecting something and go from where.

Optimize The Funnel

Once you have the basics down, the final step of building a process which remains healthy for the long haul is instrumenting the process at each phase (sourcing, phone screens, take home tests, onsites, offers, and so on), and monitoring those metrics over time. If your ratio of referrals:sourced+direct goes down, then you probably have a problem (specifically, probably a morale problem in your existing team), and if your acceptance rate goes down then perhaps your offers are not high enough, but also it might be that your best interviewer has burned out on interviewing and is pushing people away.

Keep watching these numbers and listening to candidate post-process feedback, and you can go to sleep at night knowing that the process is still on the rails.

As a side note: I’ve put optimizing your funnel–and by this I include the entire procss of building explicit analytics around your process–as the last priority in building a great interviewing process. From a typical optimization perspective, you should always measure first and optimize second, and here I’m giving the opposite advice.

Doing this first intead of last is definitely reasonable, in fact, I considered making this the first priority and when I was setting up my last hiring process it was the first thing I did. In the end I think you’ll find that your process cannot thrive without handling the first six priorities, and that your analytics will direct you towards fixing those issues. Plus, the underlying data is very often poor and it can be easy to get lost spending your cycles on the process of instrumenting your process instead of improving it.


Altogether, the most important aspect to interviewing well is to budget enough time to interview well and maintaining a healthy skepticism about the efficiency of your current process. Keep iterating forward and your process will end up being a good one.

I’m sure I missed quite a few components of interviewing well, I’d love to hear more ideas at @lethain on Twitter or over email (included in the right-rail).

Joe Abbate: The Future of Pyrseas – Part 1

From Planet PostgreSQL. Published on Jun 09, 2016.

In the early days of the Pyrseas project, I read about some open source projects being abandoned due to their developers losing interest or perhaps becoming involved elsewhere and thought to myself “That could never happen to me.”

Sadly, this blog has remained silent for over two years, and I haven’t done substantive development work since last September.

Still, some remain interested in Pyrseas, if gauged by the latest two issues: inquiring about consolidating/updating our documentation and porting dbtoyaml/yamltodb to other databases. So it’s appropriate that we discuss in what directions the project may turn.

Some background first.

Pyrseas: Origins

Pyrseas was born from my experience with making structural changes to SQL databases, version control systems, the typical incompatibility between the two and the solution suggested by Andromeda.

Andromeda had a fairly simple concept: describe your database using YAML and use that to drive the process of generating SQL to make structural modifications (also, use the YAML spec to produce a basic CRUD app to make content changes—see Part 3).

The Pyrseas innovation was: rather than manually edit a YAML spec, why not create it from the Postgres catalogs? In addition, instead of using the information_schema catalog views so that the process could be nominally portable to a few other SQL databases, we opted to query the internal catalogs directly.

The Imaginarium of Peter Eisentraut

In response to a post regarding another tool that diff’d database versions, Core Team member Peter Eisentraut commented that something that “doesn’t support all PostgreSQL features” is not helpful “to help you manage your database upgrades, because it might miss important details.”

That made us reconsider the scope of Pyrseas which initially was to be limited to tables, views and perhaps functions. We decided to address the vast array of Postgres features and some releases later we managed to achieve that goal, for the most part.

A post about the proper place of business logic then led to a collaboration with Roger Hunwicks to augment the Pyrseas tools. Another discussion with Josep Martínez resulted in a preliminary capability to copy predefined data into the database.

Lilo & Sqitch (or Is diffing database versions sufficient?)

Although my Pyrseas development activity has been limited recently, I’ve continued to use Pyrseas to assist in my DBA and database development tasks. I’ve thus come to the conclusion that: no, in the most general case, diffing database versions is not sufficient to manage structural changes.

A simple example based on my original proof-of-concept schema can serve to validate this assertion. Suppose we wanted to modify the release_year column to release_date and use a DATE datetype. The RENAME may be made to work but the datatype change will require some specialized handling (e.g., an external script) that cannot be codified in a generalized “difference” approach.

When I first reviewed the other SQL schema versioning tools, Liquibase was the main exponent of the non-differencing camp (and it still seems to be going strong). About a year after that, David Wheeler came out with the first version of Sqitch and thanks to David I was able to experiment with it.

My main objection to Sqitch and Liquibase is that for the most common use cases, e.g., add a table, add a column, rewrite a view or function, etc., it seems unproductive for a DBA or developer to do the work more than once, i.e., in a development database, either via psql, using a design tool such as PgAdmin or, in some simple cases, even by editing the YAML spec directly. Recreating the SQL DDL and applying it –in the correct order– to another development, test or production database should be automated. The generated SQL should take into consideration the state of the target and, if it becomes necessary, should include generating SQL to backout changes, e.g., drop table, alter table drop column, etc., without any extra DBA or developer intervention.

Conclusion

My proposal to address the insufficiency of diffing database versions is to incorporate some of the ideas of the non-differencing approaches into yamltodb. The solution may be similar to what was done for dbaugment, i.e., add specialized scripts or configuration files that can control the additional processing. Admittedly, this is still very vague and will probably third in terms of priorities, although I chose to discuss it first.


Filed under: Database tools, Postgres, Version control

Alexander Korotkov: Drawing Graphs Directly in Psql

From Planet PostgreSQL. Published on Jun 09, 2016.

For people who are actively working with psql, it frequently happens that you want to draw graph for the table you’re currently seeing. Typically, it means a cycle of actions including: exporting data, importing it into graph drawing tool and drawing graph itself. It appears that this process could be automated: graph could be drawn by typing a single command directly in psql. See an example on the screenshot below.

It might seem like a magic, but actually there is absolutely no magic. iTerm2 supports image inlining since version 3 which is currently beta. Thus, if we put image surrounded with corresponding escape sequences it will appear in the terminal. From psql side we need to redirect output to the script which would do it. We can define a macro for simplifying this like in one of my previous posts.

1
\set graph ‘\g |pg_graph

And finally we need a pg_graph script which parses psql output, draws graph and puts it into stdout. I wrote one using Python and matplotlib. It recognizes first column as series of X-values and rest of columns as series of Y-values. If first column contains only decimal values it draws a plot chart, otherwise it draws a bar chart.

Thereby, it’s not hard to teach psql to do more things. Also, we can consider some improvements to psql including:

  • Add output format option for \g which would make it easier to parse psql output from scripts;
  • Provide elegant way to pass parameters into psql macro.

Andrew Dunstan: Removing a Buildfarm client/server perl dependency

From Planet PostgreSQL. Published on Jun 09, 2016.

With one very small exception, there is nothing in the buildfarm server that requires the client to be running perl, even though both the server and the standard client are written in perl.

Andreas Scherbaum has written a new specialized client in python, and he asked me about removing that dependency. Essentially what this requires is that an alternative mechanism be allowed for transmitting the serialized configuration in the client's web transaction. Currently the client uses perl's Storable module to serialize the data, and the server uses the same module to de-serialize it. The obvious alternative candidate is to serialize it as JSON.

The first problem was to determine if we can sanely distinguish between data serialized by the two mechanisms. And we can. JSON is not allowed to contain any control characters, and a structure serialized using Storable's nfreeze() method is pretty much guaranteed to contain such characters. So I've added a test to the receiving program that looks for such characters and if it doesn't find them assumes that the data is JSON and decodes it accordingly. This has been tested using the client nightjar.

There are a few wrinkles, though. The main one is that we can't assume that every client has a module available to encode the data as JSON. In fact, nightjar didn't until I installed one. So in order to build this into the client I'll need to find a way to determine at runtime if it's available and fall back to using Storable otherwise.

We should also look at actually storing the data as JSON rather than as a serialized blob. When the buildfarm was created we didn't have any database support for JSON, but now the sensible thing to do would be to store the data as jsonb, and make it searchable.

But that's a project for another day.

Keith Fiske: Cleaning Up PostgreSQL Bloat

From Planet PostgreSQL. Published on Jun 08, 2016.

As a followup to my previous post on checking for bloat, I figured I’d share some methods for actually cleaning up bloat once you find it. I’ll also be providing some updates on the script I wrote due to issues I encountered and thanks to user feedback from people that have used it already.

First, as these examples will show, the most important thing you need to clean up bloat is extra disk space. This means it is critically important to monitor your disk space usage if bloat turns out to be an issue for you. And if your database is of any reasonably large size, and you regularly do updates & deletes, bloat will be an issue at some point. I’d say a goal is to always try and stay below 75% disk usage either by archiving and/or pruning old data that’s no longer needed. Or simply adding more disk space or migrating to new hardware all together. Having less 25% free can put you in a precarious situation where you may have a whole lot of disk space you can free up, but not enough room to actually do any cleanup at all or without possibly impacting performance in big ways (Ex. You have to drop & recreate a bloated index instead of rebuilding it concurrently, making previously fast queries extremely slow).

gbloat

The above graph (y-axis terabytes) shows my recent adventures in bloat cleanup after using this new scan, and validates that what it reported is actually bloat. In both this graph and the one below, there were no data purges going on and each of the significant line changes coincided exactly with bloat cleanup session. You can see back on May 26th a huge drop in size. You can see an initial tiny drop followed by a fairly big increase then the huge drop. This is me first fixing one small, but very bloated index followed by running a pg_repack to take care of both table and a lot of index bloat. This is actually the group_members table I used as the example in my previous post. Over the next week or so I worked through roughly 80 bloated objects to recover about 270GB of disk space. Now, it may turn out that some of these objects will have their bloat return to their previous values quickly again and those could be candidates for exclusion from the regular report. But I figured I’d go through everything wasting more than a few hundred MB just so I can better assess what the actual normal bloat level of this database is. Here’s another example from another client that hadn’t really had any bloat monitoring in place at all before (that I was aware of anyway). It’s showing disk space available instead of total usage, hence the line going the opposite direction, and db12 is a slave of db11.

pp_bloat

The easiest, but most intrusive, bloat removal method is to just run a VACUUM FULL on the given table. This will take an exclusive lock on the table (blocks all reads and writes) and completely rebuild the table to new underlying files on disk. This clears out 100% of the bloat in both the table and all indexes it contains at the expense of blocking all access for the duration. If you can afford the outage, it’s the easiest, most reliable method available. For very small tables this is likely your best option.

The next option is to use the REINDEX command. This can be run on several levels: INDEX, TABLE, DATABASE. 9.5 introduced the SCHEMA level as well. Running it on the TABLE level has the same consequence of likely locking the entire table for the duration, so if you’re going that route, you might as well just run a VACUUM FULL. Same for running at the DATABASE level, although if you’re running 9.5+, it did introduce parallel vacuuming to the vacuumdb console command, which would be much more efficient. When running on the INDEX level, things are a little more flexible. All writes are blocked to the table, but if a read-only query does not hit the index that you’re rebuilding, that is not blocked. If you can afford several shorter outages on a given table, or the index is rather small, this is the best route to take for bloat cleanup.

If you’ve got tables that can’t really afford long outages, then things start getting tricky. Before getting into pg_repack, I’d like to share some methods that can be used without third-party tools. Index bloat is the most common occurrence, so I’ll start with that.

If you’ve just got a plain old index (b-tree, gin or gist), there’s a combination of 3 commands that can clear up bloat with minimal downtime (depending on database activity). The CONCURRENTLY flag to the CREATE INDEX command allows an index to be built without blocking any reads or writes to the table. So say we had this bloated index

objectname                          | objecttype | dead_tuples | free_space | free_percent
 index_guests_on_state                    | index      | 0 bytes     | 9460 MB    |        60.04

No dead tuples (so autovacuum is running efficiently) and 60% of the total index is free space that can be reclaimed. A handy command to get the definition of an index is pg_get_indexdef(regclass). In this case it’s a very easy index definition, but when you start getting into some really complicated functional or partial indexes, having a definition you can copy-n-paste is a lot safer.

# SELECT pg_get_indexdef('index_guests_on_state'::regclass);
                         pg_get_indexdef
------------------------------------------------------------------
 CREATE INDEX index_guests_on_state ON guests USING btree (state)

Now we can write our set of commands to rebuild the index

CREATE INDEX CONCURRENTLY index_guests_on_state_new ON guests USING btree (state);
ANALYZE guests;
DROP INDEX CONCURRENTLY index_guests_on_state;
ANALYZE guests;
ALTER INDEX index_guests_on_state_new RENAME TO index_guests_on_state;

I threw the ANALYZE calls in there just to ensure that the catalogs are up to date for any queries coming in during this rebuild. May not really be necessary, but I was doing this on a very busy table, so I’d rather be paranoid about it. Neither the CREATE nor the DROP command will block any other sessions that happen to come in while this is running. However, that final ALTER INDEX call can block other sessions coming in that try to use the given table. But the rename is optional and can be done at any time later. After the DROP command, your bloat has been cleaned up. While concurrent index creation does not block, there are some caveats with it, the major one being it can take much longer to rebuild the index. One of these for the second client above took 4.5 hours to complete. The documentation on building indexes concurrently goes into more detail on this, and how to deal with it possibly failing.

If you’re running this on a UNIQUE index, you may run into an issue if it was created as a UNIQUE CONSTRAINT vs a UNIQUE INDEX. Functionally, both are the same as far as PostgreSQL is concerned. And under the hood, creating a unique constraint will just create a unique index anyway. The big difference is you will not be able to drop a unique constraint concurrently. You will have to do an ALTER TABLE [..]. DROP CONSTRAINT […] call, which will require an exclusive like, just like the RENAME above. Also, the index is more flexible since you can make a partial unique index as well. So it’s better to just make a unique index vs a constraint if possible.

PRIMARY KEYs are another special case. Functionally, they’re no different than a unique index with a NOT NULL constraint on the column. But they are marked specially in the catalog and some applications specifically look for them. You can do something very similar to the above, taking advantage of the USING clause to the ADD PRIMARY KEY command.

postgres@production=# CREATE UNIQUE INDEX CONCURRENTLY email_records_pkey_new ON public.email_records USING btree (id);
CREATE INDEX
Time: 2794263.993 ms
postgres@production=# ANALYZE public.email_records;
ANALYZE
Time: 1658.590 ms
postgres@production=# ALTER TABLE public.email_records DROP CONSTRAINT email_records_pkey;
ALTER TABLE
Time: 43.774 ms
postgres@production=# ANALYZE public.email_records;
ANALYZE
Time: 1518.819 ms
postgres@production=# ALTER INDEX public.email_records_pkey_new RENAME TO email_records_pkey;
ALTER INDEX
Time: 0.586 ms
postgres@production=# ALTER TABLE public.email_records ADD PRIMARY KEY USING INDEX email_records_pkey;
ALTER TABLE
Time: 13.219 ms
postgres@production=# ANALYZE public.email_records;
ANALYZE
Time: 1611.376 ms

I gave full command examples here so you can see the runtimes involved. The concurrent index creation took quite a while (about 46 minutes), but everything besides the analyze commands was sub-second. Giving the command to create a primary key an already existing unique index to use allows it to skip the creation and validation usually done with that command.

As always, there are caveats to this. If the primary key, or any unique index for that matter, has any FOREIGN KEY references to it, you will not be able to drop that index without first dropping the foreign key(s). If there’s only 1 or 2 of those, you can likely do this in a transaction surrounding the drop & recreation of the primary key with commands that also drop and recreate the foreign keys. But if you start getting more in there, that’s just taking a longer and longer outage for the foreign key validation which will lock all tables involved. And also increasing the likelyhood of an error in the DDL you’re writing to manage recreating everything. In that case, it may just be better to take the outage to rebuild the primary key with the REINDEX command.

In all cases where I can use the above methods, I always try to use those first. They’re the native methods built into the database and, as long as you don’t typo the DDL commands, not likely to be prone to any issues cropping up later down the road. And since index bloat is primarily where I see the worst problems, it solves most cases (the second graph above was all index bloat). If you’re unable to use any of them, though, the pg_repack tool is very handy for removing table bloat or handling situations with very busy or complicated tables that cannot take extended outages. It’s gotten pretty stable over the last year or so, but just seeing some of the bugs that were encountered with it previously, I use it as a last resort for bloat removal. Also, if you’re running low on disk space, you may not have enough room for pg_repack since it requires rebuilding the entire table and all indexes in secondary tables before it can remove the original bloated table. As I said above, I did use it where you see that initial huge drop in disk space on the first graph, but before that there was a rather large spike to get there. In that case, the table had many, many foreign keys & triggers and was a very busy table, so it was easier to let pg_repack handle it.

For table bloat, Depesz wrote some blog posts a while ago that are still relevant with some interesting methods of moving data around on disk. This can also be handy when you are very low on disk space.

Bloat Removal Without Table Swapping

Bloat Remval By Tuples Moving

Since I initially wrote my blog post, I’ve had some great feedback from people using pg_bloat_check.py already. I’ve gotten several bugs fixed as well as adding some new features with version 2.1.0 being the latest available as of this blog post. json is now the preferred, structured output method if you need to see more details outside of querying the stats table in the database. I also added some additional options with –exclude_object_file  that allows for more fine grained filtering when you want to ignore certain objects in the regular report, but not forever in case they get out of hand. I updated the README with some examples of that since it’s a little more complex.

I also made note of the fact that this script isn’t something that’s made for real-time monitoring of bloat status. Since it’s doing full scans on both tables and indexes, this has the potential to force data out of shared buffers. So if you keep running it often, you may affect query performance of things that rely on data being readily available there. It’s best to run it maybe once a month or once a week at most during off-peak hours. If you have particularly troublesome tables you want to keep an eye on more regularly, the –tablename option allows you to scan just that specific table and nothing else. Once you’ve gotten the majority of your bloat issues cleaned up after your first few times running the script and see how bad things may be, bloat shouldn’t get out of hand that quickly that you need to run it that often. If it is, you may want to re-evaluate how you’re using PostgreSQL (Ex. MVCC makes it not great as a queuing system).

If anyone else has some handy tips for bloat cleanup, I’d definitely be interested in hearing them.

What We’re Clicking - May Link Roundup

By Caktus Consulting Group from Django community aggregator: Community blog posts. Published on Jun 08, 2016.

Below you can find this month’s roundup of articles and posts shared by Cakti that drew the most attention on Twitter. The list covers coding for matrix factorization algorithms in Python, designing apps that optimize for sequential dual screen usage, preventing technical debt, and understanding the complexities and limitations involved in building apps for low-income American families.

Finding Similar Music Using Matrix Factorization

A step-by-step guide to calculating related music artists with matrix factorization algorithms. This tutorial is written in Python using Pandas and SciPy for calculations and D3.js for interactive data visualization.

Windows on the Web

Completing a task across multiple devices is a common habit. And yet, according to the author of this piece Karen McGrane, this practice is rarely considered in user design scenarios. In this article, McGrane contemplates how to design the best user experience for sequential dual screen usage.

Technical Debt 101

This article is a detailed explanation of technical debt and the negative consequences of sacrificing code quality.

What I Learned from Building an App for Low-Income Americans

Ciara Byrne’s thoughtful article on lessons learned from her experience building an app for low-income Americans. Byrne reflects not only on the challenges involved in designing for this particular community of users but also the complex definitions of low-income that must be taken into account when approaching similar projects.

Daniel Pocock: Working to pass GSoC

From Planet PostgreSQL. Published on Jun 08, 2016.

GSoC students have officially been coding since 23 May (about 2.5 weeks) and are almost half-way to the mid-summer evaluation (20 - 27 June). Students who haven't completed some meaningful work before that deadline don't receive payment and in such a large program, there is no possibility to give students extensions or let them try and catch up later.

Every project and every student are different, some are still getting to know their environment while others have already done enough to pass the mid-summer evaluation.

I'd like to share a few tips to help students ensure they don't inadvertently fail the mid-summer evaluation

Kill electronic distractions

As a developer of real-time communications projects, many people will find it ironic or hypocritical that this is at the top of my list.

Switch off the mobile phone or put it in silent mode so it doesn't even vibrate. Research has suggested that physically turning it off and putting it out of sight has significant benefits. Disabling the voicemail service can be an effective way of making sure no time is lost listening to a bunch of messages later. Some people may grumble at first but if they respect you, they'll get into the habit of emailing you and waiting for you to respond when you are not working.

Get out a piece of paper and make a list of all the desktop notifications on your computer, whether they are from incoming emails, social media, automatic updates, security alerts or whatever else. Then figure out how to disable them all one-by-one.

Use email to schedule fixed times for meetings with mentors. Some teams/projects also have fixed daily or weekly times for IRC chat. For a development project like GSoC, it is not necessary or productive to be constantly on call for 3 straight months.

Commit every day

Habits are a powerful thing. Successful students have a habit of making at least one commit every day. The "C" in GSoC is for Code and commits are a good way to prove that coding is taking place.

GSoC is not a job, it is like a freelance project. There is no safety-net for students who get sick or have an accident and mentors are not bosses, each student is expected to be their own boss. Although Google has started recommending students work full time, 40 hours per week, it is unlikely any mentors have any way to validate these hours. Mentors can look for a commit log, however, and simply won't be able to pass a student if there isn't code.

There may be one day per week where a student writes a blog or investigates a particularly difficult bug and puts a detailed report in the bug tracker but by the time we reach the second or third week of GSoC, most students are making at least one commit in 3 days out of every 5.

Consider working away from home/family/friends

Can you work without anybody interrupting you for at least five or six hours every day?

Do you feel pressure to help with housework, cooking, siblings or other relatives? Even if there is no pressure to do these things, do you find yourself wandering away from the computer to deal with them anyway?

Do family, friends or housemates engage in social activities, games or other things in close proximity to where you work?

All these things can make a difference between passing and failing.

Maybe these things were tolerable during high school or university. GSoC, however, is a stepping stone into professional life and that means making a conscious decision to shut those things out and focus. Some students have the ability to manage these distractions well, but it is not for everybody. Think about how leading sports stars or musicians find a time and space to be "in the zone" when training or rehearsing, this is where great developers need to be too.

Some students find the right space in a public library or campus computer lab. Some students have been working in hacker spaces or at empty desks in local IT companies. These environments can also provide great networking opportunities.

Managing another summer job concurrently with GSoC

It is no secret that some GSoC students have another job as well. Sometimes the mentor is aware of it, sometimes it has not been disclosed.

The fact is, some students have passed GSoC while doing a summer job or internship concurrently but some have also failed badly in both GSoC and their summer job. Choosing one or the other is the best way to succeed, get the best results and maximize the quality of learning and community interaction. For students in this situation, now it is not too late to make the decision to withdraw from GSoC or the other job.

If doing a summer job concurrently with GSoC is unavoidable, the chance of success can be greatly increased by doing the GSoC work in the mornings, before starting the other job. Some students have found that they actually finish more quickly and produce better work when GSoC is constrained to a period of 4 or 5 hours each morning and their other job is only in the afternoon. On the other hand, if a student doesn't have the motivation or energy to get up and work on GSoC before the other job then this is a strong sign that it is better to withdraw from GSoC now.

Google Analytics Graphs to your Dashboard in Python Web Frameworks

By Micropyramid django from Django community aggregator: Community blog posts. Published on Jun 06, 2016.

Ecommerce solution providers like OpenCart, Magento Provide extensions to see Google analytics data in their own dashboards as graphs. whereas there are no such plugins availble in Python. In this tutorial, we will learn how to display google graphs in your website.

Open Google Developers Console, Create a new project or open existing project. enable google analytics API. create a service account and generate a JSON Key File. Use the below function to generate  access token for analytics in read only mode.

Python:

from oauth2client.client import SignedJwtAssertionCredentials
import json

def get_access_token():
    ''' Get Access Token From GOOGLE'''
    SCOPE = 'https://www.googleapis.com/auth/analytics.readonly'
    KEY_FILEPATH = <location to key file>
    with open(KEY_FILEPATH) as key_file:
        key_data = json.load(key_file)
    credentials = SignedJwtAssertionCredentials(
        key_data['client_email'], key_data['private_key'], SCOPE)
    return credentials.get_access_token().access_token

you need to pass Google UserID and Google Acess Token to template. 

Javascript:

//script to load analytics
<script>
(function(w,d,s,g,js,fs){
  g=w.gapi||(w.gapi={});g.analytics={q:[],ready:function(f){this.q.push(f);}};
  js=d.createElement(s);fs=d.getElementsByTagName(s)[0];
  js.src='https://apis.google.com/js/platform.js';
  fs.parentNode.insertBefore(js,fs);js.onload=function(){g.load('analytics');};
}(window,document,'script'));
</script>

//script to show graphs
<script>
gapi.analytics.ready(function() {

  gapi.analytics.auth.authorize({
    'serverAuth': {
      'access_token': "{{google_access_token}}"
    }
  });

  var dataChart1 = new gapi.analytics.googleCharts.DataChart({
    query: {
      'ids': '{{google_userid}}',
      'start-date': '30daysAgo',
      'end-date': 'today',
      'metrics': 'ga:sessions',
      'dimensions': 'ga:date',
      'sort': '-ga:date'
    },
    chart: {
      'container': 'chart-1-container',
      'type': 'LINE',
      'options': {
        'width': '100%'
      }
    }
  });
  dataChart1.execute();

//To change when selectbox value is changed

$("#chart-1").change(function(){
  dataChart1.set({query: {"start-date": $("#chart-1").val()}, chart:{"type":$("#chart-type-1").val()}})
  dataChart1.execute();
})
$("#chart-type-1").change(function(){
  dataChart1.set({query: {"start-date": $("#chart-1").val()}, chart:{"type":$("#chart-type-1").val()}})
  dataChart1.execute();
})
});
</script>

HTML

<div class='col-md-12 graph_wrap'>
<span>Users Sessions</span>
# select box to change type of graph
<span class="col-md-4 select_box">
  <select id="chart-type-1">
    <option value="LINE">Line</option>
    <option value="COLUMN">Column</option>
    <option value="TABLE">Table</option>
  </select>
</span>
# select box to switch between 1 week and 1 month 
<span class="col-md-4 select_box"> 
  <select id="chart-1">
    <option value="30daysAgo">Last 30 Days</option>
    <option value="7daysAgo">Last 7 Days</option>
  </select>
</span>
<div class="clearfix"></div>
<div id="chart-1-container"></div>

With above code you will generate the user session graph which can switch between week view and month view and also change type of Graph 

Other configuration options like dimensions,metrics and filters are available at Google Metrics Explorer.

Evennia in pictures

By Griatch's Evennia musings (MU* creation with Django+Twisted) from Django community aggregator: Community blog posts. Published on May 31, 2016.

Evennia in pictures
by Griatch, September 2015

This article was originally written for Optional Realities
Since it is no longer available to read on OR, I'm reposting it in full here.

The MU* development system Evennia works quite differently from many traditional MU* code bases/servers and we get questions about this from time to time. This article will try to give an overview for outsiders - using pictures!  

Figure 1: The parts of the Evennia library

evennia_infrastructureA.png

Evennia is a game development library. What you see in Figure 1 is the part you download from us. This will not run on its own, we will soon initialize the missing “jigsaw puzzle” piece on the left. But first let’s look at what we’ve got.

Looking at Figure 1 you will notice that Evennia internally has two components, the Portal and the Server. These will run as separate processes.

The Portal tracks all connections to the outside world and understands Telnet protocols, websockets, SSH and so on. It knows nothing about the database or the game state. Data sent between the Portal and the Server is protocol-agnostic, meaning the Server sends/receives the same data regardless of how the user is connected. Hiding behind the Portal also means that the Server can be completely rebooted without anyone getting disconnected.

The Server is the main “mud driver” and handles everything related to the game world and its database. It's asynchronous and uses Twisted. In the same process of the Server is also the Evennia web server component that serves the game’s website. That the Server and webserver are accessing the database in the same process allows for a consistent game state without any concerns for caching or race condition issues.

Now, let’s get a game going. We’ll call it mygame. Original, isn’t it?

Figure 2: The full setup for mygame
evennia_infrastructure.jpg.png

After installing evennia you will have the evennia command available. Using this you create a game directory - the darker grey piece in Figure 2 that was missing previously. This is where you will create your dream game!

During initialization, Evennia will create Python module templates in mygame/ and link up all configurations to make mygame a fully functioning, if empty, game, ready to start extending. Two more commands will create your database and then start the server. From this point on, mygame is up and running and you can connect to your new game with telnet on localhost:4000 or by pointing your browser to http://localhost:8000.

Now, our new mygame world needs Characters, locations, items and more! These we commonly refer to as game entities. Let’s see how Evennia handles those.

Figure 3: The Database Abstraction of Evennia entities

evennia_typeclassesA.png
Evennia is fully persistent and abstracts its database in Python using Django. The database tables are few and generic, each represented by a single Python class. As seen in Figure 3, the example ObjectDB Python class represents one database table. The properties on the class are the columns (fields) of the table. Each row is an instance of the class (one entity in the game).

Among the example columns shown is the key (name) of the ObjectDB entity as well as a Foreign key-relationship for its current “location”. From the above we can see that Trigger is in the Dungeon, carrying his trusty crossbow Old Betsy!

The db_typeclass_path is an important field. This is a python-style path and tells Evennia which subclass of ObjectDB is actually representing this entity.

Figure 4: Inheriting classes to customize entities


















evennia_typeclassesB.png

In Figure 4 we see the (somewhat simplified) Python class inheritance tree that you as an Evennia developer will see, along with the three instanced entities.

ObjectDB represents stuff you will actually see in-game and its child classes implement all the handlers, helper code and the hook methods that Evennia makes use of. In your mygame/ folder you just import these and overload the things you want to modify. In this way, the Crossbow is modified to do the stuff only crossbows can do and CastleRoom adds whatever it is that is special about rooms in the castle.

When creating a new entity in-game, a new row will automatically be created in the database table and then “Trigger” will appear in-game! If we, in code, search the database for Trigger, we will get an instance of the Character class back - a Python object we can work with normally.

Looking at this you may think that you will be making a lot of classes for every different object in the game. Your exact layout is up to you but Evennia also offers other ways to customize each individual object, as exemplified by Figure 5.

Figure 5: Adding persistent Attributes to a game entity.

evennia_typeclassesC.png

The Attribute is another class directly tied to the database behind the scenes. Each Attribute basically has a key, a value and a ForeignKey relation to another ObjectDB. An Attribute serializes Python constructs into the database, meaning you can store basically any valid Python, like the dictionary of skills seen in Figure 5. The “strength” and “skills” Attributes will henceforth be reachable directly from the Trigger object. This (and a few other resources) allow you to create individualized entities while only needing to create classes for those that really behave fundamentally different.


Figure 6: Sessions, Players and Objects

evennia_session_player_character.png

Trigger is most likely played by a human. This human connects to the game via one or more Sessions, one for each client they connect with. Their account on mygame is represented by a PlayerDB entity. The PlayerDB holds the password and other account info but has no existence in the game world. Through the PlayerDB entity, Sessions can control (“puppet”) one or more ObjectDB entities in-game.

In Figure 6, a user is connected to the game with three Sessions simultaneously. They are logged in to their Player account Richard. Through these Sessions they are simultaneously puppeting the in-game entities Trigger and Sir Hiss. Evennia can be configured to allow or disallow a range of different gaming styles like this.

Now, for users to be able to control their game entities and actually play the game, they need to be able to send Commands.

Figure 7: Commands are Python classes too

evennia_commandsA.png

Commands represent anything a user can input actively to the game, such as the look command, get, quit, emote and so on.

Each Command handles both argument parsing and execution. Since each Command is described with a normal Python class, it means that you can implement parsing once and then just have the rest of your commands inherit the effect. In Figure 7, the DIKUCommand parent class implements parsing of all the syntax common for all DIKU-style commands so CmdLook and others won’t have to.

Figure 8: Commands are grouped together in sets and always associated with game entities.

evennia_commandsB.png

Commands in Evennia are always joined together in Command Sets. These are containers that can hold many Command instances. A given Command class can contribute Command instances to any number of Command Sets. These sets are always associated with game entities. In Figure 8, Trigger has received a Command Set with a bunch of useful commands that he (and by extension his controlling Player) can now use.

Figure 9: Command Sets can affect those around them

evennia_commandsC.png

Trigger’s Command Set is only available to himself. In Figure 8 we put a Command Set with three commands on the Dungeon room. The room itself has no use for commands but we configure this set to affect those inside it instead. Note that we let these be different versions of these commands (hence the different color)! We’ll explain why below.

Figure 10: The name Command “Set” is not just a name

evennia_commandsD.png

Command Sets can be dynamically (and temporarily) merged together in a similar fashion as Set Theory, except the merge priority can be customized. In Figure 10 we see a Union-type merger where the Commands from Dungeon of the same name temporarily override the commands from Trigger. While in the Dungeon, Trigger will be using this version of those commands. When Trigger leaves, his own Command Set will be restored unharmed.

Why would we want to do this? Consider for example that the dungeon is in darkness. We can then let the Dungeon’s version of the look command only show the contents of the room if Trigger is carrying a light source. You might also not be able to easily get things in the room without light - you might even be fumbling randomly in your inventory!

Any number of Command Sets can be merged on the fly. This allows you to implement multiple overlapping states (like combat in a darkened room while intoxicated) without needing huge if statements for every possible combination. The merger is non-destructive, so you can remove cmdsets to get back previous states as needed.



… And that’s how many illustrations I have the stamina to draw at this time. Hopefully this quick illustrated dive into Evennia helps to clarify some of the basic features of the system!

Sending emails using sendgrid on heroku for a Django App

By Micropyramid django from Django community aggregator: Community blog posts. Published on May 30, 2016.

You can send automated email campaigns in a simple and easy manner by using Sendgrid on Heroku for your Django App.

Installation:
First, you need to deploy your Django app on Heroku and then you can directly add Sendgrid to your Heroku application from Heroku ADD-ONS.

Or You can integrate sendgrid add-on directly from your terminal by using Heroku CLI(Heroku Command Line). So, first you should install Heroku Toolbelt. 

Heroku Toolbelt Installation:

wget -O- https://toolbelt.heroku.com/install-ubuntu.sh | sh

You can directly specify the above line in your terminal for installing Heroku Toolbelt.

Heroku Version:
Check your heroku version, to verify whether toolbelt is installaed or not.

$ heroku --version
heroku-toolbelt/3.43.3 (x86_64-linux) ruby/1.9.3

To Know Heroku File Location:
By using 'which' command you can see your Heroku file location easily.

$ which heroku
/usr/local/heroku/bin/heroku

Heroku Login:
Login to heroku API by using your login creadentials like email and password and these details will be saved for future purpose.

$ heroku login
Enter your Heroku credentials.
Email: user@heroku.com
Password:

After logged into heroku CLI you can add heroku add-ons by using below command:

$ heroku addons:create sendgrid:starter

How to get Sendgrid Username & Password: 
Once sendgrid has been successfully added to your Heroku, you can get SENDGRID_USERNAME, SENDGRID_PASSWORD easily by using Heroku config:get command

$ heroku config:get SENDGRID_USERNAME

user@heroku.com
$ heroku config:get SENDGRID_PASSWORD

password

Example:
For sending email campaigns, initially you should add your 'Sendgrid_API_Key' and 'To_Email'. So, all the notification emails will be sent to the specified 'To_Email'. The following code is an example of how to send emails by using Sendgrid

import sendgrid

sg = sendgrid.SendGridClient(YOUR_SENDGRID_API_KEY')
message = sendgrid.Mail()
message.add_to('To_Email')
message.set_from('User_Name')
message.set_subject('Email_Subject')
message.set_html('Body')
sg.send(message)

Mastering Python

By wol.ph from Django community aggregator: Community blog posts. Published on May 30, 2016.


Mastering Python


Mastering Python
is finally out and available in your local book store.

If you are an experienced Python developer, this might be the book for you. It will teach you the useful new features in Python 3 (currently up to 3.5) and show you a lot of Python tricks that you may not have seen before. As a Python developer with many years of experience I can guarantee you that you will learn something new. Quite a few sections result from original research which currently cannot (believe me, I’ve tried) be found online.

Some of the more interesting subjects that are covered are asyncio and coroutines. Additionally, have you ever tried to apply the lambda calculus to Python?

Give it a try, here’s the Quicksort algorithm using the Y combinator:

>>> Y = lambda f: lambda *args: f(Y(f))(*args)

>>> quicksort = Y(lambda f:
...     lambda x: (
...         f([item for item in x if item < x&#91;0&#93;&#93;)
...         + &#91;y for y in x if x&#91;0&#93; == y&#93;
...         + f(&#91;item for item in x if item > x[0]])
...     ) if x else [])

Healing a Burned Out Team

By Will Larson from Django community aggregator: Community blog posts. Published on May 29, 2016.

Googlers sometimes tell a joke that at Google there are always two solutions to a problem: a prototype which isn’t ready and a working solution which is already deprecated. It’s funny, except it isn’t funny. It’s a joke, but it’s not actually a joke, because it’s true, and it’s not Google, it’s every company with more than a couple hundred engineers and that has been around for over four years.

You hear stories of small startups replacing Cassandra with PostgreSQL over a weekend or rewriting their service in Go in a development sprint, but as companies grow to 10,000s lines of code and hundreds of engineers, technology migrations become extraordinarily complex projects which take years to complete, but which are typically viewed as low status work.

When I started my first at Yahoo! in 2008, they were in the process of migrating from cvs to svn. I can guarantee two things about that effort: first, that it took years of grueling effort, and that the team working on that migration did not feel appreciated for those years of labor.

These migrations are fascinating on a number of dimensions: how to avoid them, how they can be conducted effectively, how they reduce your organization’s ability to plan or forecast, and how they can provoke an entire team to burnout.

This post will focus on the last.

Team Burnout

Teams burn out when they are doing too much work, which isn’t interesting enough, and without enough recognition. While an individual on your team burning out is generally a structural problem, when an entire team is burned out, it is always a structural problem.

The best way to fight their burnout is by attacking the definition:

  1. They’re doing too much work, so find ways to reduce their workload. This might require reopening timeline negotiations for the current project–which is painful for you as a manager–but is less painful than your team dying inside a little bit each day.
  2. Step back from a recurring task and find a way to make it more interesting. My theory is that any project can become interesting when you reframe it as an automation and tooling challenge. Perhaps there is a sufficiently advanced tool (indistinguishable from magic?) you can build which will eliminate the manual components of the migration, or an architectural approach which can make the transition seamless (maybe adding a proxy with the old interface on top of a new backend). Look for it!
  3. Never stop telling the story of why their work is important and how it’s enabling some kind of major improvement (if it isn’t, then figure out why you’re doing it and consider not doing it). In particular, keep telling this story up your management chain so that your team can hear the message from more than just you.

Over time, healing a wound comes from addressing its origin, but if you’ve just started that process, health takes time to recover. Having encountered this scenario a handful of times, I’ve found some practices which make it easier to start moving in the right direction.

Catalysts for Recovery

Team lunches or dinners are a key piece of team health, and work well for allowing a team to gel together, but for recovery from burnout, I’ve found they don’t work well because frustrated groups in a social situation tend to vent about their frustration, reinforcing the negative rather than melting it away.

Instead, you want to find activities for your team which will provide them a mental and emotional break from the grind. Activities where they come away refreshed with the feeling they’ve learned and experienced something new:

  1. Technical paper reading groups are a great opportunity to learn something new together. Pick an interesting paper, have everyone read it before hand and then break out into groups of three to four people to discuss and come back as a group 20-30 minutes later for a larger discussion (this year we read Mesos and SWIM papers, among others, which I can happily recommend as good starting places). These work even better if you invite people from other teams to join in, providing a safe environment to build relationships with new people.
  2. Hackathons during working hours are a great opportunity for the team to try out something new, ideally with people on the team they don’t work with as often, and get in some positive experiences together. (I’ve historically been very anti-hackathon because they are so often scheduled in a way prevents individuals with families and external demands from participating, but I’ve since found that they’re pretty effective and purely positive if you can schedule them exclusively during normal working hours over the course of a day or two.)
  3. Having recurring sessions once or twice a month where you learn a new programming language together, can be very positive experience. These can be especially fun if the projects themselves are interesting, some of the ideas I tried this past year were writing Redis and Statsd server implementations.
  4. At a quickly growing company, your team might be spending 20% of their time interviewing, but we rarely practice interviewing. This is unfortunate because we rarely get feedback on our interviewing, and also because practicing interviewing can actually be a fairly fun group activity. With groups of three, have one person ask a question, another work on answering the question, and a third keep notes on how the others could adjust their approaches.

These are just the beginning, any activity with an explicit focus which is not what you’re team is already focused on will be effective–and which is inherently interesting–will be effective.

Will these be enough to fully heal a burned out team? No, almost certainly not! You still have to attack the definition and find ways to get the team a reasonable workload with a mix of interesting work, but these can be useful for jumpstarting the process, getting you and your team out of mental deadlock and able to think again.

Code, Code, Code

By pydanny's blog from Django community aggregator: Community blog posts. Published on May 28, 2016.

I'm often asked by new programmers how they can forge a path into using their skills professionally. Or how they can get better at writing software.

How to Improve Your Coding Skills

This was my path. It may not be your path.

  1. I coded. A lot. From silly little scripts to automating tasks to attempting full-blown projects. At work or for fun. I failed a lot, but learned along the way.
  2. I didn't jump from language to language. Instead I stayed in a few places for years and focused my learning on those tools. My 19+ year career can be summed up as FoxPro then Java then Python. In the middle of things I picked up JavaScript. Sure, I've dallied with a few things (Lisp, Haskell, Lua, Perl, ColdFusion, Go), but by staying focused on a small set of tools I'm better than mediocre.
  3. I coded lots. Yes, this is a repeat of #1.
  4. Once I got the basics of a language, I looked up best practices for each of them. Then I religiously adhered to them, even becoming dogmatic about it. In general this means my code is more easily read. More easily debugged. And most importantly, more easily shared.
  5. Did I mention that I coded a lot? You can never get good at anything unless you practice. Another repeat of #1.
  6. I got over my fear/pride of asking questions. Well, mostly, I still am afraid/prideful from time to time. Honestly, by asking questions you aren't showing what you don't know, you are showing you are willing to learn. Also, the simple act of figuring out how to ask a question can put you in the right mindset to determine the answer yourself.
  7. As soon as I asked a question, whether or not I got an answer, I coded some more. Code, code, code! Yet another repeat of #1
  8. Once I've gotten the hang of a language, I looked for cookbooks and/or pocket references on it. I prefer paper copies of tech books (everything else I read is electronic). The recipes in cookbooks become the foundation of my toolkit. The terse, easy-to-find reminders in the pocket reference mean less cognitive overload.
  9. I took those recipes and references and coded with them. Again and again I coded. In work hours or play time. Practice makes perfect! Why do I keep repeating #1?
  10. Over the years I've stayed with the easiest-to-learn stable IDEs/text editors. Yes, I know there are really powerful tools with arcane commands (Vim, EMACS, etc), but I don't want to have to stop what I'm doing to learn new tools. I want to code, not tinker with desktop tools or arcane text editors.
  11. And again, reference back to #1, I use the text editor to write code. Code, code, code! Until my fingers and hands hurt, until I've had to learn how to deal with carpal tunnel syndrome. Code, code, code! It's like learning martial arts, guitar, or anything, repetition of simple actions provides the confidence for you to either combine those actions into something greater or learn something more complex.

What I Wish I Had Done

Studied computer science. If I could do it all over again, that would have been the focus of my academic studies. It wouldn't replace anything on my list, the path I've defined remains the same. Practice makes perfect, right?

Code, Code, Code

Querying with Django Q objects

By Micropyramid django from Django community aggregator: Community blog posts. Published on May 27, 2016.

Django Q objects:

     Q object encapsulates a SQL expression in a Python object that can be used in database-related operations. Using Q objects we can make complex queries with less and simple code.

For example, this Q object filters whether the question starts wiht 'what':

from django.db.models import Q
Q(question__startswith='What')

Q objects are helpfull for complex queries because they can be combined using logical operators and(&), or(|), negation(~)

For example, this statement returns if the question starts with 'who' or with 'what'.

Q(question__startswith='Who') | Q(question__startswith='What')

Note:
    If the operator is not included then by default 'AND' operator is used

The following code is source of Q class:

class Q(tree.Node):
    AND = 'AND'
    OR = 'OR'
    default = AND

    def __init__(self, *args, **kwargs):
        super(Q, self).__init__(children=list(args) + list(six.iteritems(kwargs)))

    def _combine(self, other, conn):
        if not isinstance(other, Q):
            raise TypeError(other)
        obj = type(self)()
        obj.connector = conn
        obj.add(self, conn)
        obj.add(other, conn)
        return obj

    def __or__(self, other):
        return self._combine(other, self.OR)

    def __and__(self, other):
        return self._combine(other, self.AND)

    def __invert__(self):
        obj = type(self)()
        obj.add(self, self.AND)
        obj.negate()
        return obj

As you can interpret from above code, we have three operators 'or', 'and' and invert(negation) and the default operator is AND.

Dynamic quering with Q objects:

    This is interesting feature as we can use the operator module to create dynamic queries. 

import operator
from django.db.models import Q
from your_app.models import your_model_object


q_list = [Q(question__startswith='Who'), Q(question__startswith='What')]
your_model_object.objects.filter(reduce(operator.or_, q_list))

We are performing the or operation using operator.or_

To use and operations simply execute:

your_model_object.objects.filter(reduce(operator.and_, q_list))

Q objects not only simplify complex queries, they are very handy for dynamic filtering.

 

 

 

 

 

 

 

 

 

 

 

 

 

Retrieve average, minumum, maximum values from model field using Django Aggregation

By Micropyramid django from Django community aggregator: Community blog posts. Published on May 27, 2016.

Django queries helps to create, retrieve, update and delete objects. But some times we need to get summered values from the objects. Then Simple solution is to use Django aggregate feature Here are simple examples for how to use aggrigation.

app/models.py

class company(models.Model):

    name=models.CharField(max_length=20)

    est=models.IntegerField(max_length=4)

class feature(models.Model):

    name=models.CharField(max_length=20)

class device(models.Model):

    name=models.CharField(max_length=20)

    Code=models.IntegerField()

    Company=models.ForeignKey(company)

    features=models.ManyToManyField(feature)

To perform Min Max and Avg on specific column

 

>>> from django.db.models import Avg, Max, Min, Sum
>>> device.objects.all().aggregate(Avg('price'))
    {'price__avg': 12234.0}
>>> device.objects.all().aggregate(Max('price'))
    {'price__max':587961 }
>>> device.objects.all().aggregate(Min('price'))
{'price__min': 01245}

To know the children count of a forignkey field. 

>>> Comp_Devices=company.objects.annotate(no_of_devices=Count('device'))
>>> Comp_Devices[0].no_of_devices
36
>>> Comp_Devices[0].name
u'Micro'
>>> Comp_Devices[1].no_of_devices
1

same procedure can be followed for ManyToMany field Columns

>>> Dev_features_count = device.objects.annotate(num=Count('features'))

>>> Dev_features_count[0].num

3

>>> Dev_features_count[1].num

2

>>> 

 

You can perform some complex queries like

To know the devices which has more than two features

>>> Dev_features= device.objects.annotate(num=Count('features')).filter(num__gt=2)

>>> Dev_features[0].num

3

>>> 

Pretty Formatting JSON in the Django Admin

By pydanny's blog from Django community aggregator: Community blog posts. Published on May 26, 2016.

Recently I was writing code to interact with a third-party API. The API changes frequently, especially the data contained in responses. However, that data has to be saved and periodically needs to be audited. I wanted a data model flexible enough to handle these periodic changes without a lot of anguish, yet queryable. Since the API serves out queryable JSON, this made it a no-brainer for using django.contrib.postgres's JSONField.

After a little bit of work, I had data samples to play with. Quickly my admin filled with chunks of JSON that looked something like this:

{"field_12": 8, "field_16": 4, "field_6": 14, "field_7": 13, "field_18": 2, "field_2": 18, "field_4": 16, "field_15": 5, "field_9": 11, "field_3": 17, "field_8": 12, "field_11": 9, "field_17": 3, "field_10": 10, "field_0": 20, "field_1": 19, "field_13": 7, "field_5": 15, "field_14": 6}

Kind of illegible, right? And that's a simple, flat example with just 20 keys. Imagine if this were a nested dictionary with 100 or 200 fields. For reference, that's the kind of data that I had that makes this kind of display nigh useless.

So I cooked up this quick fix:

import json
from pygments import highlight
from pygments.lexers import JsonLexer
from pygments.formatters import HtmlFormatter

from django.contrib import admin
from django.utils.safestring import mark_safe

from .models import APIData


class APIDataAdmin(admin.ModelAdmin):
    readonly_fields = ('data_prettified',)

    def data_prettified(self, instance):
        """Function to display pretty version of our data"""

        # Convert the data to sorted, indented JSON
        response = json.dumps(instance.data, sort_keys=True, indent=2)

        # Truncate the data. Alter as needed
        response = response[:5000]

        # Get the Pygments formatter
        formatter = HtmlFormatter(style='colorful')

        # Highlight the data
        response = highlight(response, JsonLexer(), formatter)

        # Get the stylesheet
        style = "<style>" + formatter.get_style_defs() + "</style><br>"

        # Safe the output
        return mark_safe(style + response)

    data_prettified.short_description = 'data prettified'

admin.site.register(APIData, APIDataAdmin)

The field remains the same, but we also get a display of nicely formatted data:

Admin Pretty JSON

Much better!

There may be a package out there that does this already, perhaps even using a JavaScript library like hightlight.js instead of pygments. If not, it shouldn't be hard to create one using Cookiecutter Django Package. Let me know if you package this and I'll add it to this blog post.


See you at PyCon!

I'll be at PyCon with Audrey Roy Greenfeld. You can easily find us at the Cookiecutter booth during the main conference days or at the Cookiecutter sprint. Stop by and say hi!

Running Django with PyPy to boost performance

By Micropyramid django from Django community aggregator: Community blog posts. Published on May 26, 2016.

PyPy:

   PyPy is an alternative python interpreter which focuses on speed and memory. PyPy uses JIT compiler. PyPy is a replacement for CPython. It is built using the RPython language that was co-developed with it. The main reason to use it instead of CPython is speed.

PyPy Installation:

   Ubuntu 12.04 - 14.04: Download https://bitbucket.org/pypy/pypy/downloads/pypy3-2.4.0-linux64.tar.bz2

   Now you can uncompress them either somewhere in your home directory or, say, in /opt, and if you want, put a symlink from somewhere like /usr/local/bin/pypy to /path/to/pypy-5.1.1/bin/pypy. Do not move or copy the executable pypy outside the tree – put a symlink to it, otherwise it will not find its libraries.

   ArchLinux:

              pacman -Sy pypy # for python2
              pacman -Sy pypy3 # for python3

 

Let's Create and run pypy_django app:

1. Create pypy virtualenv:

     Run the following command to create pypy based virtualenv:

virtualenv -p /usr/local/bin/pypy env  # if you used different path for pypy installation change the path as required

2. Install django:

     First activate the env and then run the follwing command to install django

pip install django

3. Create a django project and run using pypy interpreter.

    To create a django project run the cmd:

django-admin startproject pypy_django

    To run the django project with pypy simply do

./manage.py runserver

Since our virtualenv links to pypy interpreter, the project now runs with pypy virtualenv

Now you can see a welcome page of django at http://localhost:8000/

 

 

 

 

Is Open Source Consulting Dead?

By chrism from . Published on Sep 10, 2013.

Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

Consulting and Patent Indemification

By chrism from . Published on Aug 09, 2013.

Article about consulting and patent indemnification

Python Advent Calendar 2012 Topic

By chrism from . Published on Dec 24, 2012.

An entry for the 2012 Japanese advent calendar at http://connpass.com/event/1439/

Why I Like ZODB

By chrism from . Published on May 15, 2012.

Why I like ZODB better than other persistence systems for writing real-world web applications.

A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

By chrism from . Published on Mar 03, 2012.

A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

In Praise of Complaining

By chrism from . Published on Jan 01, 2012.

In praise of complaining, even when the complaints are absurd.

2012 Python Meme

By chrism from . Published on Dec 24, 2011.

My "Python meme" replies.

In Defense of Zope Libraries

By chrism from . Published on Dec 19, 2011.

A much too long defense of Pyramid's use of Zope libraries.

Plone Conference 2011 Pyramid Sprint

By chrism from . Published on Nov 10, 2011.

An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

Jobs-Ification of Software Development

By chrism from . Published on Oct 17, 2011.

Try not to Jobs-ify the task of software development.

WebOb Now on Python 3

By chrism from . Published on Oct 15, 2011.

Report about porting to Python 3.

Open Source Project Maintainer Sarcastic Response Cheat Sheet

By chrism from . Published on Jun 12, 2011.

Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

Pylons Miniconference #0 Wrapup

By chrism from . Published on May 04, 2011.

Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

Pylons Project Meetup / Minicon

By chrism from . Published on Apr 14, 2011.

In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.