Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home

Open Source Posts

Josh Berkus: Introducing Flexible Freeze

From Planet PostgreSQL. Published on Oct 21, 2014.

One of the things I mentioned in my series on VACUUM FREEZE was that we really needed a Postgres utility which would opportunistically freeze tables during low traffic periods. Today I'm announcing the Flexible Freeze project, our first attempt at designing such a utility.

All that's there right now is a simple Python script.  However, that script is already a useful tool, installed at multiple production sites. Here's how the script works:
  1. identify your active databases and daily/weekly low traffic periods.
  2. create a cron job which calls flexible_freeze.py with a time limit to keep it inside your low traffic window.
  3. flexible_freeze.py will loop through your tables with the oldest XIDs, freezing them until it runs out of time or out of tables
There is also a second mode, using the --vacuum switch, which does VACUUM ANALYZE on the tables with the most dead rows (according to pg_stat_user_tables).  This is to help users who have a strong high/low traffic cycle and want to make sure that regular vacuuming takes place during low traffic.  If you're running both modes, we advise doing the freeze first.

Of course, I have a tanker-truck full of desired improvements/expansions to this.  So, pull requests welcome. 

If you're more into Ruby, Wanelo has rewritten flexible freeze for Rails and incorporated it into their Postmodern tool.

Craig Ringer: Ware Yosemite? Possible PostgreSQL upgrade issues in OS X 10.10

From Planet PostgreSQL. Published on Oct 21, 2014.

I’m seeing reports of a number of issues with PostgreSQL after upgrades of OS X machines to Yosemite (OS X 10.10) that I’m concerned about, so I’m seeking more information about the experiences of PostgreSQL users who’ve done OS X 10.10 upgrades.

I can’t confirm anything yet, but back up all your databases before any upgrade to OS X 10.10. Just in case. (Of course, you do that before any upgrade, but just in case it slipped your mind this time…).

I don’t have access to a Mac because Apple’s policy prevents developers from running OS X for testing and development (or anything else) without buying physical Apple hardware and finding somewhere to run it. So I can’t test most of this myself, and I really need reports from users, or if possible, results of proactive testing by OS X users.

OS X built-in PostgreSQL deleted on update

Some OS X users appear to use the PostgreSQL version built-in to OS X for their own data, rather than installing a new PostgreSQL. Some of them, in addition to using the binaries, also use a PostgreSQL cluster (database instance) that’s created by OS X for the use of Server.app, instead of initdbing their own.

On releases prior to Yosemite the PostgreSQL provided by Apple was on the default PATH, though not necessarily running by default. It seems that on Yosemite it’s been removed; there’s no longer any /usr/bin/psql, etc. As far as I can tell Server.app now bundles PostgreSQL within the application bundle instead.

Some user reports suggest that on upgrade, the Apple-controlled databases in the new release are migrated into the new cluster managed by Server.app then the old cluster is stopped or possibly deleted – a colleage checked the upgrade script and found rm -rf /var/pgsql in it.

The PostgreSQL data directory in prior releases was /private/var/pgsql (and /var is a symlink to /private/var) or /Library/Server/PostgreSQL/Data.

The main symptom you’ll see is:

Connection refused
Is the server running locally and accepting
connections on Unix domain socket "/var/pgsql_socket/.s.PGSQL.5432"?

… but this issue is only one of many, many possible causes of that message.

OS X updater may be removing empty directories

I’m seeing a number of reports that suggest that the OS X updater may be removing empty directories. This causes problems with PostgreSQL, which expects to have an empty pg_twophase, pg_tblspc, and often pg_stat_tmp as a part of normal operation.

It looks like working around this is a simple matter of mkdiring the directories and setting appropriate permissions, but it’s a concern that it’s happening at all.

Server.app 3.2.1 upgrade issues

I’m also seeing reports that the patch release 3.2.1 for Server.app upgrades its private bundled PostgreSQL from 9.2.8 to 9.3.4, which seems to be causing some users issues.

If you’ve used the PostgreSQL bundled in Server.app to initdb a new cluster, this will render it inaccessible until you find and install a compatible PostgreSQL 9.2.

If you’ve used Server.app‘s own install then it may not preserve your databases when it upgrades. I can’t confirm the upgrade process it uses yet, and really need user tests reports for this.

Test reports and more information needed

Given these concerns, I would value reports and test results from OS X users who’re still on pre-Yosemite versions and planning to update soon.

If you’re an Apple customer, please also contact Apple support and ask them to investigate this.

So far I don’t know for sure if there’s data loss involved and I don’t have the access to investigate properly but I’m quite concerned about the preliminary indications I’m able to find.

robert berry: Monitoring Postgresql with a Background Worker

From Planet PostgreSQL. Published on Oct 20, 2014.

Monitoring Postgresql with a Background Worker

Oct 21, 2014 – Portland

pgantenna and pgsampler comprise an experimental Postgreqsql monitoring framework. This post explores how they work, and what problems they aim to solve.

Framework Overview

pgsampler is a background worker which collects data in a Postgresql cluster. It can log this data to CSV files or ship the metrics to a pgantenna instance over a tcp connection.

pgantenna is an application shipped as a Docker image which receives pgsampler data. It provides a web interface for live monitoring, checks for alerting conditions, and allows for psql access to a historical database of cluster metrics.

Motivation

There are a number of high quality monitoring and performance analysis tools for Postgresql. Many of these involve a remote service which connects to Postgresql as a regular client, or an application that parses log files.

The presented framework uses a background worker to ship statistics to a remote service. It aims to solve a grab bag of real or imagined problems discussed below. Of course, this approach presents it’s own problems and is thus best characterized as an experiment.

Live Monitoring

Data is sent from the cluster in a polling loop at second intervals. Different metrics can be tuned to desired sampling rates.

Using Postgres to Monitor Postgres

Dashboard plots and alert conditions are all written directly in SQL. For example, alert conditions are triggered whenever a cron-executed query returns a NULL in the first field in the first record. Plots are rendered with plotpg.

Historical Analysis with SQL

pgantenna provides a containerized remote cluster which stores historical data separate from transactional systems. The history is just a Postgresql database that can be queried with familiar tools.

Easy to Configure

The background worker uses direct access to identify and connect to databases automatically. Security concerns notwithstanding, this allows for very little configuration minutae to get started with comprehensive monitoring.

Close to metal

A background worker lives and dies with postmaster. One of the foundational alerting conditions is the receipt of a heartbeat from the background worker.

Extensible

Because the metrics collector is a background worker, it may prove to be able to collect data that other monitoring approaches could not reach. For example, while developing I considered several approaches to collect a notion of statements/second. I was thwarted by a reluctance to implement executor hooks or divine this information from shared memory data structures due to limited experience with Postgres internals. But is it possible, or a good idea? Maybe.

Trying it Out

This is an experimental prototype so it’s not appropriate for critical clusters.

To get started first launch a pgantenna instance which can be as simple as a single command on a system with Docker installed.

docker run -p 24831:24831 -p 80:80 no0p/pgantenna

Next install pgsampler and update postgresql.conf with an entry pointing pgsampler to the pgantenna instace.

pgsampler.output_network_host='localhost'

Hubert 'depesz' Lubaczewski: PostgreSQL + Perl + Unicode == confusion. Why?

From Planet PostgreSQL. Published on Oct 19, 2014.

Yesterday I had an interesting discussion on irc. A guy wanted to know why Perl script is causing problems when dealing with Pg and unicode characters. The discussion went sideways, I got (a bit) upset, and had to leave anyway, so I didn't finish it. But it did bother me, as for me the reasons […]

Pavel Stehule: styles for unicode borders are merged (PostgreSQL 9.5)

From Planet PostgreSQL. Published on Oct 18, 2014.

Following feature is less important for performance, but for somebody can be important for aesthetic reasons - now you can use a styles for unicode table borders. Possible styles are only two, but you can set a border, header and column style. It is a 6 combinations. Next you have a 3 styles for borders generally - so it together 18 possible combinations of psql table output:
postgres=# \pset unicode_header_linestyle double 
Unicode border linestyle is "double".
postgres=# \pset linestyle unicode
Line style is unicode.
postgres=# \l
List of databases
Name │ Owner │ Encoding │ Collate │ Ctype │ Access privileges
═══════════╪══════════╪══════════╪═════════════╪═════════════╪═══════════════════════
postgres │ postgres │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │
template0 │ postgres │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/postgres ↵
│ │ │ │ │ postgres=CTc/postgres
template1 │ postgres │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/postgres ↵
│ │ │ │ │ postgres=CTc/postgres
(3 rows)

postgres=# \l
List of databases
Name Owner Encoding Collate Ctype Access privileges
═════════ ════════ ════════ ═══════════ ═══════════ ═════════════════════
postgres postgres UTF8 en_US.UTF-8 en_US.UTF-8
template0 postgres UTF8 en_US.UTF-8 en_US.UTF-8 =c/postgres ↵
postgres=CTc/postgres
template1 postgres UTF8 en_US.UTF-8 en_US.UTF-8 =c/postgres ↵
postgres=CTc/postgres
(3 rows)


postgres=# \pset border 2
Border style is 2.
postgres=# \l
List of databases
┌───────────┬──────────┬──────────┬─────────────┬─────────────┬───────────────────────┐
│ Name │ Owner │ Encoding │ Collate │ Ctype │ Access privileges │
╞═══════════╪══════════╪══════════╪═════════════╪═════════════╪═══════════════════════╡
│ postgres │ postgres │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ │
│ template0 │ postgres │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/postgres ↵│
│ │ │ │ │ │ postgres=CTc/postgres │
│ template1 │ postgres │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/postgres ↵│
│ │ │ │ │ │ postgres=CTc/postgres │
└───────────┴──────────┴──────────┴─────────────┴─────────────┴───────────────────────┘
(3 rows)

postgres=# \pset unicode_border_linestyle double
Unicode border linestyle is "double".
postgres=# \l
List of databases
╔═══════════╤══════════╤══════════╤═════════════╤═════════════╤═══════════════════════╗
║ Name │ Owner │ Encoding │ Collate │ Ctype │ Access privileges ║
╠═══════════╪══════════╪══════════╪═════════════╪═════════════╪═══════════════════════╣
║ postgres │ postgres │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ ║
║ template0 │ postgres │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/postgres ↵║
║ │ │ │ │ │ postgres=CTc/postgres ║
║ template1 │ postgres │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/postgres ↵║
║ │ │ │ │ │ postgres=CTc/postgres ║
╚═══════════╧══════════╧══════════╧═════════════╧═════════════╧═══════════════════════╝
(3 rows)

postgres=# \pset border 1
Border style is 1.
postgres=# \pset unicode_column_linestyle double
Unicode column linestyle is "double".
postgres=# \l
List of databases
Name ║ Owner ║ Encoding ║ Collate ║ Ctype ║ Access privileges
═══════════╬══════════╬══════════╬═════════════╬═════════════╬═══════════════════════
postgres ║ postgres ║ UTF8 ║ en_US.UTF-8 ║ en_US.UTF-8 ║
template0 ║ postgres ║ UTF8 ║ en_US.UTF-8 ║ en_US.UTF-8 ║ =c/postgres ↵
║ ║ ║ ║ ║ postgres=CTc/postgres
template1 ║ postgres ║ UTF8 ║ en_US.UTF-8 ║ en_US.UTF-8 ║ =c/postgres ↵
║ ║ ║ ║ ║ postgres=CTc/postgres
(3 rows)

Magnus Hagander: A few short notes about PostgreSQL and POODLE

From Planet PostgreSQL. Published on Oct 17, 2014.

The POODLE attack on https (the attack is about https, the vulnerability in SSL, an important distinction) has received a lot of media attention lately, so I figured a (very) short writeup was necessary.

The TL;DR; version is, you don't have to worry about POODLE for your PostgreSQL connections when using SSL.

The slightly longer version can be summarized by:

  • The PostgreSQL libpq client in all supported versions will only connect with TLSv1 and newer, which is not vulnerable.
  • The PostgreSQL server prior to the upcoming 9.4 version will however respond in SSLv3 (which is the vulnerable version) if the client insists on it (which a third party client can do).
  • To exploit POODLE, you need a client that explicitly does out-of-protocol downgrading. Something that web browsers do all the time, but very few other clients do. No known PostgreSQL client library does.
  • To exploit POODLE, the attacker needs to be able to modify the contents of the encrypted stream - it cannot be passively broken into. This can of course happen if the attacker can control parameters to a SQL query for example, but the control over the data tends to be low, and the attacker needs to already control the client. In the https attack, this is typically done through injecting javascript.
  • To exploit POODLE, there needs to be some persistent secret data at a fixed offset in each connection. This is extremely unlikely in PostgreSQL, as the protocol itself has no such data. There is a "cancel key" at the same location in each stream, but it is not reused and a new one is created for each connection. This is where the https attack typically uses the session cookie which is both secret and fixed location in the request header.

For a really good writeup on the problem, see this post from PolarSSL, or this one from GnuTLS.

Michael Paquier: Postgres 9.5 feature highlight: Replication slot control with pg_receivexlog

From Planet PostgreSQL. Published on Oct 15, 2014.

Introduced in PostgreSQL 9.4, pg_recvlogical has the ability to control the creation of logical replication slots from which logical changes can be streamed. Note that in the case this is a mandatory condition when using logical decoding. pg_receivexlog does not have in 9.4 any control on the physical replication slots it may stream from (to ensure that the WAL segment files this utility is looking for are still retained on the server side). This feature has been added for 9.5 with the following commit:

commit: d9f38c7a555dd5a6b81100c6d1e4aa68342d8771
author: Andres Freund <andres@anarazel.de>
date: Mon, 6 Oct 2014 12:51:37 +0200
Add support for managing physical replication slots to pg_receivexlog.

pg_receivexlog already has the capability to use a replication slot to
reserve WAL on the upstream node. But the used slot currently has to
be created via SQL.

To allow using slots directly, without involving SQL, add
--create-slot and --drop-slot actions, analogous to the logical slot
manipulation support in pg_recvlogical.

Author: Michael Paquier

This simply introduces two new options allowing to create or drop a physical replication slot, respectively --create-slot and --drop-slot. The main difference with pg_recvlogical is that those additional actions are optional (not --start option introduced as well for backward-compatibility). Be careful of a couple of things when using this feature though. First, when a slot is created, stream of the segment files begins immediately.

$ pg_receivexlog --create-slot --slot physical_slot -v -D ~/xlog_data/
pg_receivexlog: creating replication slot "physical_slot"
pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)

The slot created can then be found in the system view pg_replication_slots.

=# select slot_name, plugin, restart_lsn from pg_replication_slots ;
   slot_name   | plugin | restart_lsn
---------------+--------+-------------
 physical_slot | null   | 0/1000000
(1 row)

Then, when dropping a slot, as process can stream nothing it exits immediately, and slot is of course not more:

$ pg_receivexlog --drop-slot --slot physical_slot -v
pg_receivexlog: dropping replication slot "physical_slot"
$ psql -c 'SELECT slot_name FROM pg_replication_slots'
 slot_name
-----------
(0 rows)

Deletion and creation of the replication slot is made uses the same replication connection as the one for stream and uses the commands CREATE_REPLICATION_SLOT and DROP_REPLICATION_SLOT from the replication protocol, resulting in a light-weight implementation. So do not hesitate to refer to this code when implementing your own client applications, src/bin/pg_basebackup/streamutil.c being particularly helpful.

Hans-Juergen Schoenig: Killing proper indexing: A neat idea

From Planet PostgreSQL. Published on Oct 15, 2014.

After being on the road to do PostgreSQL consulting for Cybertec for over a decade I noticed that there are a couple of ways to kill indexing entirely. One of the most favored ways is to apply functions or expressions on the column people want to filter on. It is a sure way to kill […]

Hubert 'depesz' Lubaczewski: What logging has least overhead?

From Planet PostgreSQL. Published on Oct 14, 2014.

When working with PostgreSQL you generally want to get information about slow queries. The usual approach is to set log_min_duration_statement to some low(ish) value, run your app, and then analyze logs. But you can log to many places – flat file, flat file on another disk, local syslog, remote syslog. And – perhaps, instead of […]

Josh Berkus: New Table Bloat Query

From Planet PostgreSQL. Published on Oct 10, 2014.

To accompany the New Index Bloat Query, I've written a New Table Bloat Query.  This also involves the launch of the pgx_scripts project on GitHub, which will include most of the "useful scripts" I talk about here, as well as some scripts from my co-workers.

The new table bloat query is different from the check_postgres.pl version in several ways:
  • Rewritten to use WITH statements for better maintainability and clarity
  • Conditional logic for old Postgres versions and 32-bit platforms taken out
  • Index bloat removed, since we have a separate query for that
  • Columns formatted to be more immediately comprehensible
In the course of building this, I found two fundamentally hard issues:
  1. Some attributes (such as JSON and polygon fields) have no stats, so those tables can't be estimated.
  2. There's no good way to estimate bloat for compressed (TOAST) attributes and rows.
Also, while I rewrote the query almost entirely, I am still relying on Greg's core math for estimating table size.  Comparing this with the results of pgstattuple, I'm seeing an error of +/- 20%, which is pretty substantial.  I'm not clear on where that error is coming from, so help improving the math is very welcome!

Results look like this:

  databasename | schemaname |   tablename   | pct_bloat | mb_bloat | table_mb   
--------------+------------+-------------------+-----------+----------+----------
members_2014 | public | current_member | 92 | 16.98 | 18.547
members_2014 | public | member_response | 87 | 17.46 | 20.000
members_2014 | public | archive_member | 84 | 35.16 | 41.734
members_2014 | public | survey | 57 | 28.59 | 50.188

pct_bloat is how much of the table (0 to 100) is estimated to be dead space.  MB_bloat is how many megabytes of bloat are estimated to exist.  Table_mb is the actual size of the table in megabytes.

The suggested criteria is to list tables which are either more than 50% bloat and bigger than 10MB, or more than 25% bloat and bigger than 1GB.  However, you should calibrate this according to your own database.

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Implement SKIP LOCKED for row-level locks

From Planet PostgreSQL. Published on Oct 10, 2014.

On 7th of October, Alvaro Herrera committed patch: Implement SKIP LOCKED for row-level locks   This clause changes the behavior of SELECT locking clauses in the presence of locked rows: instead of causing a process to block waiting for the locks held by other processes (or raise an error, with NOWAIT), SKIP LOCKED makes the […]

Michael Paquier: Postgres 9.5 feature highlight: SKIP LOCKED for row-level locking

From Planet PostgreSQL. Published on Oct 10, 2014.

SKIP LOCKED is a new feature associated with row-level locking that has been newly-introduced in PostgreSQL 9.5 by this commit:

commit: df630b0dd5ea2de52972d456f5978a012436115e
author: Alvaro Herrera <alvherre@alvh.no-ip.org>
date: Tue, 7 Oct 2014 17:23:34 -0300
Implement SKIP LOCKED for row-level locks

This clause changes the behavior of SELECT locking clauses in the
presence of locked rows: instead of causing a process to block waiting
for the locks held by other processes (or raise an error, with NOWAIT),
SKIP LOCKED makes the new reader skip over such rows.  While this is not
appropriate behavior for general purposes, there are some cases in which
it is useful, such as queue-like tables.

Catalog version bumped because this patch changes the representation of
stored rules.

Reviewed by Craig Ringer (based on a previous attempt at an
implementation by Simon Riggs, who also provided input on the syntax
used in the current patch), David Rowley, and Álvaro Herrera.

Author: Thomas Munro

Let's take for example the simple case of the following table that will be locked:

=# CREATE TABLE locked_table AS SELECT generate_series(1, 4) as id;
SELECT 1

Now a session is taking a shared lock on the row created of locked_table, taking the lock within a transaction block ensures that it will still be taken for the duration of the tests.

=# BEGIN;
BEGIN
=# SELECT id FROM locked_table WHERE id = 1 FOR SHARE;
 id
----
  1
(1 row)

Now, the shared lock prevents any update, delete or even exclusive lock from being taken in parallel. Hence the following query will wait until the transaction of previous session finishes. In this case this query is cancel by the user (note that error message tells for which row this query was waiting for):

=# SELECT * FROM locked_table WHERE id = 1 FOR UPDATE;
^CCancel request sent
ERROR:  57014: canceling statement due to user request
CONTEXT:  while locking tuple (0,1) in relation "locked_table"
LOCATION:  ProcessInterrupts, postgres.c:2966

There is already one way to bypass this wait phase, by using NOWAIT with the lock taken to return an error instead of waiting if there is a conflict:

=# SELECT * FROM locked_table WHERE id = 1 FOR UPDATE NOWAIT;
ERROR:  55P03: could not obtain lock on row in relation "locked_table"
LOCATION:  heap_lock_tuple, heapam.c:4542

And now shows up SKIP LOCKED, that can be used to bypass the rows locked when querying them:

=# SELECT * FROM locked_table ORDER BY id FOR UPDATE SKIP LOCKED;
 id
----
  2
  3
  4
(3 rows)

Note that this makes the data taken actually inconsistent, but this new clause finds its utility to reduce lock contention for example on queue tables where the same rows are being access from multiple clients simultaneously.

Mark Wong: Index Overhead on a Growing Table

From Planet PostgreSQL. Published on Oct 09, 2014.

This another simple test in continuation from last time. We will start with the same lineitem table as in the previous example. We will measure the time it takes to load the same 7.2GB text file repeatedly until the table size grows to about 1TB. We create a baseline with a table that has no indexes built on it. Then repeat with a B-tree index on the l_shipdate DATE column, and again after replacing the B-tree index with a BRIN index.

axle-growth-test

Our baseline shows that as the table grows the time it takes to insert data also increases. The difference in the time that it takes to insert data when the table is near 1TB compared to when it is empty is about 12 seconds. With the B-tree index in place the difference increases to 84 seconds. Finally the change is only about 15 seconds with the BRIN index in place.

So over a 1TB growth, the overheard on inserting data into the lineitem table due to just the size of the table increases about 4.3%. B-trees increase that difference to 12.2%. While the BRIN index continues to look encouraging by only increasing the overhead to 4.2%.

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n°318633 – the AXLE project –http://www.axleproject.eu

Greg Sabino Mullane: Postgres copy schema with pg_dump

From Planet PostgreSQL. Published on Oct 09, 2014.


Manny Calavera (animated by Lua!)
Image by Kitt Walker

Someone on the #postgresql IRC channel was asking how to make a copy of a schema; presented here are a few solutions and some wrinkles I found along the way. The goal is to create a new schema based on an existing one, in which everything is an exact copy. For all of the examples, 'alpha' is the existing, data-filled schema, and 'beta' is the newly created one. It should be noted that creating a copy of an entire database (with all of its schemas) is very easy: CREATE DATABASE betadb TEMPLATE alphadb;

The first approach for copying a schema is the "clone_schema" plpgsql function written by Emanuel Calvo. Go check it out, it's short. Basically, it gets a list of tables from the information_schema and then runs CREATE TABLE statements of the format CREATE TABLE beta.foo (LIKE alpha.foo INCLUDING CONSTRAINTS INCLUDING INDEXES INCLUDING DEFAULTS). This is a pretty good approach, but it does leave out many types of objects, such as functions, domains, FDWs, etc. as well as having a minor sequence problem. It's also slow to copy the data, as it creates all of the indexes before populating the table via INSERT.

My preferred approach for things like this is to use the venerable pg_dump program, as it is in the PostgreSQL 'core' and its purpose in life is to smartly interrogate the system catalogs to produce DDL commands. Yes, parsing the output of pg_dump can get a little hairy, but that's always preferred to trying to create DDL yourself by parsing system catalogs. My quick solution follows.

pg_dump -n alpha | sed '1,/with_oids/ {s/ alpha/ beta/}' | psql

Sure, it's a bit of a hack in that it expects a specific string ("with_oids") to exist at the top of the dump file, but it is quick to write and fast to run; pg_dump creates the tables, copies the data over, and then adds in indexes, triggers, and constraints. (For an explanation of the sed portion, visit this post). So this solution works very well. Or does it? When playing with this, I found that there is one place in which this breaks down: assignment of ownership to certain database objects, especially functions. It turns out pg_dump will *always* schema-qualify the ownership commands for functions, even though the function definition right above it has no schema, but sensibly relies on the search_path. So you see this weirdness in pg_dump output:

--
-- Name: myfunc(); Type: FUNCTION; Schema: alpha; Owner: greg
--
CREATE FUNCTION myfunc() RETURNS text
    LANGUAGE plpgsql
    AS $$ begin return 'quick test'; end$$;

ALTER FUNCTION alpha.myfunc() OWNER TO greg;

Note the fully qualified "alpha.myfunc". This is a problem, and the sed trick above will not replace this "alpha" with "beta", nor is there a simple way to do so, without descending into a dangerous web of regular expressions and slippery assumptions about the file contents. Compare this with the ownership assignments for almost every other object, such as tables:

--
-- Name: mytab; Type: TABLE; Schema: alpha; Owner: greg
--
CREATE TABLE mytab (
    id integer
);

ALTER TABLE mytab OWNER TO greg;

No mention of the "alpha" schema at all, except inside the comment! Before going into why pg_dump is acting like that, I'll present my current favorite solution for making a copy of a schema: using pg_dump and some creative renaming:

$ pg_dump -n alpha -f alpha.schema
$ psql -c 'ALTER SCHEMA alpha RENAME TO alpha_old'
$ psql -f alpha.schema
$ psql -c 'ALTER SCHEMA alpha RENAME TO beta'
$ psql -c 'ALTER SCHEMA alpha_old TO alpha'

This works very well, with the obvious caveat that for a period of time, you don't have your schema available to your applications. Still, a small price to pay for what is most likely a relatively rare event. The sed trick above is also an excellent solution if you don't have to worry about setting ownerships.

Getting back to pg_dump, why is it schema-qualifying some ownerships, despite a search_path being used? The answer seems to lie in src/bin/pg_dump/pg_backup_archiver.c:

  /*                                                                                                                                                      
     * These object types require additional decoration.  Fortunately, the                                                                                  
     * information needed is exactly what's in the DROP command.                                                                                            
     */
    if (strcmp(type, "AGGREGATE") == 0 ||
        strcmp(type, "FUNCTION") == 0 ||
        strcmp(type, "OPERATOR") == 0 ||
        strcmp(type, "OPERATOR CLASS") == 0 ||
        strcmp(type, "OPERATOR FAMILY") == 0)
    {
        /* Chop "DROP " off the front and make a modifiable copy */
        char       *first = pg_strdup(te->dropStmt + 5);

Well, that's an ugly elegant hack and explains why the schema name keeps popping up for functions, aggregates, and operators: because their names can be tricky, pg_dump hacks apart the already existing DROP statement built for the object, which unfortunately is schema-qualified. Thus, we get the redundant (and sed-busting) schema qualification!

Even with all of that, it is still always recommended to use pg_dump when trying to create DDL. Someday Postgres will have a DDL API to allow such things, and/or commands like MySQL's SHOW CREATE TABLE, but until then, use pg_dump, even if it means a few other contortions.

Keith Fiske: A Small Database Does Not Mean Small shared_buffers

From Planet PostgreSQL. Published on Oct 08, 2014.

As a followup to my previous blog post, A Large Database Does Not Mean Large shared_buffers, I had some more interesting findings applying the queries in that blog post to another client recently. I assume you have read that one already and don’t repeat any of what I explained previously, so if you haven’t read that one and aren’t familiar with the pg_buffercache extension, I highly recommend you go read that one first.

Another mantra often heard in PostgreSQL circles that you usually don’t want to set shared_buffers higher than 8GB. I will admit, that for a majority of clients, that is great advice and a good starting point (and a whole lot more useful than the default 32MB). There are also issues around double-buffering and allowing the kernel to do what it can probably do better than PostgreSQL as far as managing page reads/writes (a topic way out of the scope of this blog post). But if you investigate further into how PostgreSQL is using its shared memory and what your high demand data blocks actually are, you can possibly find benefit in setting it higher. Especially when you can clearly see what PostgreSQL thinks it needs most often. Or if you can just fit the whole thing into memory, as I stated before.

The client in these examples has shared_buffers set to 24Gb and the total database size is 145GB (111GB in the primary followed by 28GB, 5GB,  270MB & 150MB). I say small in the title of this post, but both large and small are relative terms and for my typical work this is a small database. And a setting that is 17% of the total size is larger than normal, so along with being a catchy followup name, the results do fit the title.

So I ran the basic query at the end of my previous post to see what the “ideal” minimal is. I ran this several times over about a half-hour period and, unlike the databases in my previous post, it did not deviate much.

database=# SELECT pg_size_pretty(count(*) * 8192) as ideal_shared_buffers
FROM pg_buffercache b
WHERE usagecount >= 3;
 ideal_shared_buffers 
----------------------
 18 GB

Much higher than I previously encountered and with a much smaller database too. The value did deviate slightly, but it never changed from the rounded, pretty value of 18GB. So I investigated further. First the primary, 111GB database:

database=# SELECT c.relname
   , pg_size_pretty(count(*) * 8192) as buffered
   , round(100.0 * count(*) / ( SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent
   , round(100.0 * count(*) * 8192 / pg_relation_size(c.oid),1) AS percent_of_relation
 FROM pg_class c
 INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
 INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
 WHERE pg_relation_size(c.oid) > 0
 GROUP BY c.oid, c.relname
 ORDER BY 3 DESC
 LIMIT 10;
          relname          | buffered | buffers_percent | percent_of_relation 
---------------------------+----------+-----------------+---------------------
 group_members             | 8697 MB  |            35.4 |                73.9
 order_items               | 1391 MB  |             5.7 |               100.0
 orders                    | 1258 MB  |             5.1 |               100.0
 users                     | 812 MB   |             3.3 |               100.0
 units                     | 801 MB   |             3.3 |               100.0
 images                    | 599 MB   |             2.4 |                71.5
 group_members_user_id_idx | 481 MB   |             2.0 |                10.9
 user_list_map             | 264 MB   |             1.1 |               100.0
 products                  | 202 MB   |             0.8 |               100.0

A good amount of the large tables had a significant amount of themselves in shared buffers. I looked at the top table here to see if it may be having problems keeping its high demand usage blocks in memory

database=# SELECT pg_size_pretty(count(*) * 8192)
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'group_members'::regclass
AND usagecount >= 3;
 pg_size_pretty 
----------------
 6606 MB

Actually looks ok. It’s got about 2GB of space to be able to swap out lower priority blocks for higher ones if needed. How about those next two 100% tables?

database=# SELECT pg_size_pretty(count(*) * 8192)
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'order_items'::regclass
AND usagecount >= 3;
 pg_size_pretty 
----------------
 1391 MB
(1 row)

database=# SELECT pg_size_pretty(count(*) * 8192)
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'orders'::regclass
AND usagecount >= 3;
 pg_size_pretty 
----------------
 1258 MB
(1 row)

I actually increased the usagecount parameter for both these tables all the way up to 5 and that only lowered the amount by a 2-3MB. So these are some pretty heavily used tables. For a client that does online order processing, this would seem to make sense for the context of this table. But it could also indicate a problem as well. This could mean there are queries doing a whole lot of sequential scans on this table and they might not need to be doing so. If that’s not something that’s readily apparent in the code accessing the database, I would then suggest turning to something like pgbadger for more in-depth query analysis to see where problems may be.

You may have noticed this doesn’t account for all the memory usage seen in the first query. Time to dive into the other databases (the 28GB one).

database=# \c mailer 
mailer=# SELECT c.relname
  , pg_size_pretty(count(*) * 8192) as buffered
  , round(100.0 * count(*) / ( SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer,1) AS buffers_percent
  , round(100.0 * count(*) * 8192 / pg_relation_size(c.oid),1) AS percent_of_relation
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE pg_relation_size(c.oid) > 0
GROUP BY c.oid, c.relname
ORDER BY 3 DESC
LIMIT 10;
            relname             | buffered | buffers_percent | percent_of_relation 
--------------------------------+----------+-----------------+---------------------
 messages_pkey                  | 1769 MB  |             7.2 |                88.7
 messages                       | 1200 MB  |             4.9 |               100.0
 subject_text                   | 261 MB   |             1.1 |                41.9
 messages_mailing_id_idx        | 259 MB   |             1.1 |                15.4
 subject_text_pkey              | 104 MB   |             0.4 |               100.0
 messages_created_at_idx        | 26 MB    |             0.1 |                 1.2
 messages_recipient_id_idx      | 30 MB    |             0.1 |                 1.7
 pg_attrdef_adrelid_adnum_index | 16 kB    |             0.0 |               100.0
 pg_index_indrelid_index        | 40 kB    |             0.0 |                35.7
 pg_namespace_oid_index         | 16 kB    |             0.0 |               100.0
(10 rows)

That primary key is taking up a lot of space and almost all of it seems to be in memory. But again, how much of it is really high usage?

mailer=# SELECT pg_size_pretty(count(*) * 8192)
 FROM pg_class c
 INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
 INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
 WHERE c.oid::regclass = 'messages_pkey'::regclass
 AND usagecount >= 3;
 pg_size_pretty 
----------------
 722 MB

Not nearly as much as is in shared_buffers. So no justification for an increase here. How about the messages table?

mailer=# SELECT pg_size_pretty(count(*) * 8192)
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE c.oid::regclass = 'messages'::regclass
AND usagecount >= 5;
 pg_size_pretty 
----------------
 1200 MB

The whole thing is in very high demand! And there’s plenty of space for it it be there. The remainder of the majority of the space was a table similar to this in yet another one of the databases in the cluster.

So this PostgreSQL cluster seems to have some pretty good justification for having a shared_buffers 3x higher than what is typically suggested. It’s not actually using all of what’s available (only 18 of 24GB) and there’s still a significant amount in shared_buffers that’s got a usagecount below 3. My guidance to the client was to leave shared_buffers where it was, but to keep an eye on the tables like orders, order_items & messages. If the high usage of those tables is justified and they start increasing in size significantly, then this evaluation should be done again to see if shared_buffers should possibly be increased to keep that high demand data readily available in memory.

The pg_buffercache extension has been a great help with fine tuning one of the more important settings in PostgreSQL. Hopefully this helps clarify more how to evaluate shared_buffers usage and figuring out an ideal setting. And to be honest, I’m hoping that someone that reads this is in a position to better experiment with actually changing the shared_buffers value in situations like this to see if it really can make a difference in performance. As someone commented on my previous post, shared_buffers is a pretty invasive setting to change, not only because it requires a restart, but because you don’t want to screw up your performance on an active production machine. But you need the kind of activity that will be on an active production machine to accurately evaluate such settings. Reproducing such activity outside of production is really challenging.

So, looking for feedback and for anyone else to either validate or find fault with my experimentations so far.

 

Jim Mlodgenski: PostgreSQL Dollar Quoting

From Planet PostgreSQL. Published on Oct 07, 2014.

I recently attended a excellent meetup about Redshift and one of the comments by the presenter was the trouble of the running of the UNLOAD command. The trouble they were having was that the UNLOAD command takes an SQL statement as a parameter, but if that SQL statement has strings, you need to escape everything which makes it fairly unreadable.

We can see an example of this in PostgreSQL using the dblink extension:

SELECT *
  FROM dblink('dbname=postgres', 'SELECT * FROM test WHERE b = ''2014-02-02''')
    AS t(a int, b date);

Since Redshift is a derivative of PostgreSQL, the dollar quoting syntax also works. Dollar quoting is a non-standard way of denoting string constants, but it makes things much simpler to read.

SELECT *
  FROM dblink('dbname=postgres', $$ SELECT * FROM test WHERE b = '2014-02-02' $$)
    AS t(a int, b date);

Joshua Drake: Don't kill yourself

From Planet PostgreSQL. Published on Oct 07, 2014.

As a PostgreSQL consultant you end up working with a lot of different types of clients and these clients tend to all have different requirements. One client may need high-availability, while another needs a DBA, while yet another is in desperate need of being hit with a clue stick and while it is true that there can be difficult clients, there is no bad client.

What!!! Surely you can't be serious?

Don't call me shirley.

I am absolutely serious.

A bad client is only a reflection of a consultants inability to manage that client. It is true that there are difficult clients. They set unrealistic expectations, try to low ball you by with things like: "We can get your expertise for 30.00/hr from India" or my favorite: calling you directly when it is after hours to "chat".

How are these not bad clients? They are not bad clients because it is you that controls the relationship with the client. You as the consultant have to set proper boundaries with the client to insure that the relationship as a whole is positive and profitable. If you can't manage that relationship you have two choices:

  1. Hire someone who can
  2. Fire the client

Woah! Fire the client? Yes. Terminate the relationship with the client.

It is always amazing to me how many people can't fathom the idea of firing a client. It is always some sacred vow that a client can fire you but you are left holding the bag, somehow that bag is filled with the feces of some dog and you are expected to light it on fire and leave it on the porch of some unsuspecting high-school football coach.[1]

The counter argument to this is usually "I need the money". This is a valid argument but do you need the money so badly that you are willing to sacrifice your health or your relationships? It is astonishing how many consultants are willing to do exactly that. In the words of the legendary band Big Fun, "Suicide, don't do it"[2].

The better you manage a client, the better the relationship. Good luck!

  1. http://en.wikipedia.org/wiki/All_the_Right_Moves_(film)
  2. https://www.youtube.com/watch?v=i-w1GeH8KPU

gabrielle roth: RDS: Three weeks in

From Planet PostgreSQL. Published on Oct 06, 2014.

I’ve spent the past few weeks learning my way around Amazon’s RDS offering (specifically Postgres, and a bit of elasticache). It’s a mixed bag so far; for every feature I think “Hey, this is neat!” I find at least one or two others that are not so thrilling. One of the things that may annoy […]

Andrew Dunstan: pg_repack pitfalls

From Planet PostgreSQL. Published on Oct 06, 2014.

pg_repack is a terrific tool for allowing you to reorganize a table without needing to hold long running strong locks on the table. That means that that your normal inserts, updates and deletes can continue to run against the table while the reorganization is proceeding.

I have had clients who have run into problems with it, however. In particular, it is possible to get it wedged so that the table is inaccessible and nothing can proceed, unless you either kill the repack operation or kill what is blocking it. Here is a simple example of how to cause problems.

In session 1, do:
pg_reorg -e -t foo dbnameset

and in session 2 in psql do:
select pg_sleep(10); lock table foo; rollback;
The sleep gets us past the time when pg_reorg is setting up, and happens while it is is doing its CREATE TABLE ... AS SELECT .... When that CREATE TABLE statement finishes, both sessions will be wedged.  Session 2 will be hung because it is unable to lock the table, since pg_reorg's other session will hold a weak lock on the table. And nothing, including pg_reorg, will be able to do anything with the table.

The solution is to make sure that nothing holds or even tries to obtain any strong long running locks on the table.

One useful thing is to use the check_postgres.pl monitor script to look for things like long running transactions and processes waiting for locks.

Or you can create a more customized test to look for this exact situation.

Most importantly, you need to be aware that problems can occur, and to protect against them happening in the first place.





Kirk Roybal: DFW PUG Meetup November 5, 2014

From Planet PostgreSQL. Published on Oct 06, 2014.

Our topic for November, “If you know PostgreSQL, then you know Big Data”. The HUGEdata Tech Team and Principal Data Scientist will share an overview of our Scale out SQL database that leverages PG admin client access. We will provide a demonstration, that includes a marketing example for customer segmentation. We’ll also talk about machine data, the internet of things, and other use case for Big Data. And, we’d love interaction with the group on the challenges they’ve faced scaling Postgres and their ideas on how to position our Analytics Platform to the community.

What: HugeData and PostgreSQL

Who: Beth Lahaie and the HugeData team

When: Wednesday, November 5, 2014

Where:
Improving Enterprises
16633 Dallas Parkway Suite 110 Addison, TX 75001

DFW PUG on Meetup

Mark Wong: Loading Tables and Creating B-tree and Block Range Indexes

From Planet PostgreSQL. Published on Oct 03, 2014.

I have been looking at the new Block Range Indexes (BRIN) being developed for PostgreSQL 9.5. BRIN indexes are designed to provide similar benefits to partitioning, especially for large tables, just without the need to declare partitions. That sounds pretty good but let’s look in greater detail to see if it lives up to the hype.

How large? Here’s one data point. Using the TPC Benchmark(TM) H provided dbgen we created data for the lineitem table at the 10GB scale factor, which results in a 7.2GB text file.

We’re going to compare a couple of basic tasks. The first look will be at the impact of inserting data into a table using the COPY command. We will do a simple experiment of creating a table without any indexes or constraints on it and time how long it takes to load the lineitem data. Then repeat with a B-tree index on one column. And finally repeat again with a BRIN index instead of a B-tree index on the same column.

axle-load-testThe above bar plot shows the average times over five measurements. Our baseline of loading the lineitem table without any indexes averaged 5.1 minutes. Once a B-tree index was added to the i_shipdate DATE column, the average load time increased to 9.4 minutes, or by 85%. When the B-three index was replaced by a BRIN index, the load time only increased to 5.6 minutes, or by 11%.

The next experiment is to average how long it takes to create a B-tree index on a table that is already populated with data. Then repeat that with a BRIN index. This will be done on the same i_shipdate DATE column and repeated for a total of five measurements each.

axle-index-create-test

 

 

The B-tree index took 95 seconds to build, where the BRIN index 18 seconds to build, an 80% improvement.

That’s very encouraging. The overhead to loading data into a table from a single BRIN index is only 11%, and reduced the total load time by 40% when compared to having a B-tree index. And creating a new BRIN index takes only 20% of the time that a new B-tree index would take. We will have more experiments lined up to see where else BRIN indexes may or may not benefit us.

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n°318633 – the AXLE project – http://www.axleproject.eu

Andrew Dunstan: Towards a PostgreSQL Benchfarm

From Planet PostgreSQL. Published on Oct 03, 2014.

For years I have been wanting to set up a farm of machines, modelled after the buildfarm, that will run some benchmarks and let us see performance regressions. Today I'm publishing some progress on that front, namely a recipe for vagrant to set up an instance on AWS of the client I have been testing with. All this can be seen on the PostgreSQL Buildfarm Github Repository on a repo called aws-vagrant-benchfarm-client. The README explains how to set it up. The only requirement is that you have vagrant installed and the vagrant-aws provider set up (and, of course, an Amazon AWS account to use).

Of course, we don't want to run members of the benchfarm on smallish AWS instances. But this gives me (and you, if you want to play along) something to work on, and the provisioning script documents all the setup steps rather than relying on complex instructions.

The provisioner installs a bleeding edge version of the buildfarm client's experimental Pgbench module, which currently only exists on the "benchfarm" topic branch. This module essentially runs Greg Smith's pgbench-tools suite, gets the results from the results database's "tests" table, and bundles it as a CSV for upload to the server.

Currently the server does nothing with it. This will just look like another buildfarm step. So the next thing to do is to get the server to start producing some pretty and useful graphs. Also, we need to decide what else we might want to capture.

Michael Paquier: Postgres 9.5 feature highlight: Row-Level Security and Policies

From Planet PostgreSQL. Published on Oct 03, 2014.

Row-level security is a new feature of PostgreSQL 9.5 that has been introduced by this commit:

commit: 491c029dbc4206779cf659aa0ff986af7831d2ff
author: Stephen Frost <sfrost@snowman.net>
date: Fri, 19 Sep 2014 11:18:35 -0400
Row-Level Security Policies (RLS)

Building on the updatable security-barrier views work, add the
ability to define policies on tables to limit the set of rows
which are returned from a query and which are allowed to be added
to a table.  Expressions defined by the policy for filtering are
added to the security barrier quals of the query, while expressions
defined to check records being added to a table are added to the
with-check options of the query.

Behind this jargon is a feature that could be defined in short words as a complementary permission manager of GRANT and REVOKE that allows controlling at row level which tuples can be retrieved for a read query or manipulated using INSERT, UPDATE or DELETE. This row control mechanism is controlled using a new query called CREATE POLICY (of course its flavor ALTER POLICY to update an existing policy and DROP POLICY to remove a policy exist as well). By default, tables have no restrictions in terms of how rows can be added and manipulated. However they can be made able to accept level restriction policies using ALTER TABLE and ENABLE ROW LEVEL SECURITY. Now, let's imagine the following table where a list of employees and their respective salaries can be read (salary is an integer as this is entirely fictive situation and refers to no real situation, quoique...):

=# CREATE TABLE employee_data (id int,
       employee text,
       salary int,
       phone_number text);
CREATE TABLE
=# CREATE ROLE ceo;
CREATE ROLE
=# CREATE ROLE jeanne;
CREATE ROLE
=# CREATE ROLE bob;
CREATE ROLE
=# INSERT INTO employee_data VALUES (1, 'ceo', 300000, '080-7777-8888');
INSERT 0 1
=# INSERT INTO employee_data VALUES (2, 'jeanne', 1000, '090-1111-2222');
INSERT 0 1
=# INSERT INTO employee_data VALUES (3, 'bob', 30000, '090-2222-3333');
INSERT 0 1

Now let's set some global permissions on this relation using GRANT. Logically, the CEO has a complete control (?!) on the grid of salary of his employees.

=# GRANT SELECT, INSERT, UPDATE, DELETE ON employee_data TO ceo;
GRANT

A normal employee can have information access to all the information, and can update as well his/her phone number or even his/her name:

=# GRANT SELECT (id, employee, phone_number, salary)
   ON employee_data TO public;
GRANT
=# GRANT UPDATE (employee, phone_number) ON employee_data TO public;
GRANT

As things stand now though, everybody is able to manipulate other's private data and not only his own. For example Jeanne can update her CEO's name:

=# SET ROLE jeanne;
SET
=> UPDATE employee_data
   SET employee = 'Raise our salaries -- Signed: Jeanne'
   WHERE employee = 'CEO';
UPDATE 1

Row-level security can be used to control with more granularity what are the rows that can be manipulated for a set of circumstances that are defined with a policy. First RLS must be enabled on the given table:

=# ALTER TABLE employee_data ENABLE ROW LEVEL SECURITY;
ALTER TABLE

Note that if there are no policies defined and that RLS is enabled normal users that even have GRANT access to a certain set of operations can do nothing:

=> set role ceo;
SET
=> UPDATE employee_data SET employee = 'I am God' WHERE id = 1;
UPDATE 0

So it is absolutely mandatory to set policies to bring the level of control wanted for a relation if RLS is in the game. First a policy needs to be defined to let the CEO have a complete access on the table (the default being FOR ALL all the operations are authorized this way to the CEO), and luckily Jeanne just got a promotion:

=# CREATE POLICY ceo_policy ON employee_data TO ceo
   USING (true) WITH CHECK (true);
CREATE POLICY
=# SET ROLE ceo;
SET
=> UPDATE employee_data SET salary = 5000 WHERE employee = 'jeanne' ;
UPDATE 1
=> SELECT * FROM employee_data ORDER BY id;
 id | employee | salary | phone_number
----+----------+--------+---------------
  1 | ceo      | 300000 | 080-7777-8888
  2 | jeanne   |   5000 | 090-1111-2222
  3 | bob      |  30000 | 090-2222-3333
(3 rows)

Even with SELECT access allowed through GRANT, Bob and Jeanne cannot view any row so they cannot view even their own information. This can be solved with a new policy (note in this case the clause USING that can be be used to define a boolean expression on which the rows are filtered):

=# CREATE POLICY read_own_data ON employee_data
   FOR SELECT USING (current_user = employee);
CREATE POLICY
=# SET ROLE jeanne;
SET
=> SELECT * FROM employee_data;
 id | employee | salary | phone_number
----+----------+--------+---------------
  2 | jeanne   |   5000 | 090-1111-2222
(1 row)

A user should be able to modify his own information as well, and note now the WITH CHECK clause that can be used to check the validity of a row once it has been manipulated. In this case, the employee name cannot be updated to a value other than the role name (well it was better not to give UPDATE access with GRANT to this column but this would have made an example above invalid...), and the new phone number cannot be NULL (have you though that the ceo can actually set his phone number to NULL, something less flexible with CHECK at relation level):

=# CREATE POLICY modify_own_data ON employee_data
   FOR UPDATE USING (current_user = employee)
   WITH CHECK (employee = current_user AND phone_number IS NOT NULL);
CREATE POLICY
=# SET ROLE jeanne;
SET
=> UDATE employee_data SET id = 10; -- blocked by GRANT
ERROR:  42501: permission denied for relation employee_data
LOCATION:  aclcheck_error, aclchk.c:3371
=> UPDATE employee_data SET phone_number = NULL; -- blocked by policy 
ERROR:  44000: new row violates WITH CHECK OPTION for "employee_data"
DETAIL:  Failing row contains (2, jeanne, 5000, null).
LOCATION:  ExecWithCheckOptions, execMain.c:1684
=> UPDATE employee_data SET phone_number = '1-1000-2000'; -- OK
UPDATE 1

Using this new policy, Jeanne has updated her phone number, and the CEO can check that freely:

=> SET ROLE ceo;
SET
=> SELECT * FROM employee_data ORDER BY id;
 id | employee | salary | phone_number
----+----------+--------+---------------
  1 | ceo      | 300000 | 080-7777-8888
  2 | jeanne   |   5000 | 1-1000-2000
  3 | bob      |  30000 | 090-2222-3333
(3 rows)

So, while GRANT and REVOKE offer control of the actions that can be done on a relation for a set of users vertically (control of columns), RLS offers the possibility to control things horizontally for each record so when using this feature be sure to use both together and wisely.

gabrielle roth: PDXPUG: October meeting

From Planet PostgreSQL. Published on Oct 02, 2014.

When: 6-8pm Thu Oct 16, 2014
Where: Iovation
What: PgOpen Recap (gabrielle, Mark, John M);  New Relic Instrumentation of Pg Queries (Andrew)

Please note the new earlier meeting time! We’ll try this over the winter.

Two topics this month:  PgOpen attendees will discuss highlights of that conference, and Andrew will talk about some New Relic-y stuff.

Our meeting will be held at Iovation, on the 32nd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak).  It’s right on the Green & Yellow Max lines.  Underground bike parking is available in the parking garage;  outdoors all around the block in the usual spots.  No bikes in the office, sorry!

Elevators open at 5:45 and building security closes access to the floor at 6:30.


Josh Berkus: JSONB and 9.4: Move Slow and Break Things

From Planet PostgreSQL. Published on Oct 02, 2014.

If you've been paying any attention at all, you're probably wondering why 9.4 isn't out yet.  The answer is that we had to change JSONB at the last minute, in a way that breaks compatibility with earlier betas.

In August, a beta-testing user reported that we had an issue with JSONB not compressing well.  This was because of the binary structure of key offsets at the beginning of the JSONB value, and the affects were dramatic; in worst cases, JSONB values were 150% larger than comparable JSON values.   We spent August through September revising the data structure and Heikki and Tom eventually developed one which gives better compressibility without sacrificing extraction speed.

I did a few benchmarks on the various JSONB types.  We're getting a JSONB which is both faster and smaller than competing databases, so it'll be worth the wait.

However, this means that we'll be releasing an 9.4beta3 next week, whose JSONB type will be incompatible with prior betas; you'll have to dump and reload if you were using Beta 1 or Beta 2 and have JSONB data.  It also means a delay in final release of 9.4.

Hubert 'depesz' Lubaczewski: Waiting for 9.5 – Row-Level Security Policies (RLS)

From Planet PostgreSQL. Published on Oct 02, 2014.

On 19th of September, Stephen Frost committed patch: Row-Level Security Policies (RLS)   Building on the updatable security-barrier views work, add the ability to define policies on tables to limit the set of rows which are returned from a query and which are allowed to be added to a table. Expressions defined by the policy […]

Craig Kerstiens: A simple guide for DB migrations

From Planet PostgreSQL. Published on Oct 01, 2014.

Most web applications will add/remove columns over time. This is extremely common early on and even mature applications will continue modifying their schemas with new columns. An all too common pitfall when adding new columns is setting a not null constraint in Postgres.

Not null constraints

What happens when you have a not null constraint on a table is it will re-write the entire table. Under the cover Postgres is really just an append only log. So when you update or delete data it’s really just writing new data. This means when you add a column with a new value it has to write a new record. If you do this requiring columns to not be null then you’re re-writing your entire table.

Where this becomes problematic for larger applications is it will hold a lock preventing you from writing new data during this time.

A better way

Of course you may want to not allow nulls and you may want to set a default value, the problem simply comes when you try to do this all at once. The safest approach at least in terms of uptime for your table –> data –> appliction is to break apart these steps.

  1. Start by simply adding the column with allowing nulls but setting a default value
  2. Run a background job that will go and retroactively update the new column to your default value
  3. Add your not null constraint.

Yes it’s a few extra steps, but I can say from having walked through this with a number of developers and their apps it makes for a much smoother process for making changes to your apps.

Payal Singh: Changing Owner of Multiple Database Objects

From Planet PostgreSQL. Published on Sep 30, 2014.

A while ago a got a task to change the owner of a group of functions. While the number of functions wasn't too high, it was still enough that I began looking at ways to change the owner in a batch, instead of having to manually change it for each function.
In case of other database objects, changing owners is fairly simple. It can be accomplished in two steps:

1. Get list of all tables/sequences/views:

payal@testvagrant:~$ psql -qAt -c "SELECT 'ALTER TABLE '||schemaname||'.'||tablename||' OWNER TO new_owner;' FROM pg_tables WHERE schemaname = 'payal'" > test.txt

This will give us the following file:

payal@testvagrant:~$ cat test.txt
ALTER TABLE payal.new_audit_users OWNER TO new_owner;
ALTER TABLE payal.v_count_states OWNER TO new_owner;
ALTER TABLE payal.test OWNER TO new_owner;
ALTER TABLE payal.old_audit_users OWNER TO new_owner;
ALTER TABLE payal.old_audit OWNER TO new_owner;
ALTER TABLE payal.adwords_dump OWNER TO new_owner;
ALTER TABLE payal.affiliate OWNER TO new_owner;
ALTER TABLE payal.new_affiliate OWNER TO new_owner;
ALTER TABLE payal.partest OWNER TO new_owner;
ALTER TABLE payal.audit_test OWNER TO new_owner;
ALTER TABLE payal.batatawada OWNER TO new_owner;
ALTER TABLE payal.dup_key_err OWNER TO new_owner;
ALTER TABLE payal.new_audit OWNER TO new_owner;


2. Now all that is needed is to run this file with psql:

payal@testvagrant:~$ psql < test.txt
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE



That simple! An alternate solution can be found here.

However, things get a little tricky with functions due to argument specifications. Basically, one needs to specify a function's arguments along with the function name to alter it. For example:

ALTER FUNCTION hstore.tconvert(text, text) OWNER TO hstore;

Using the method described above to changed owner for tables, you cannot get the function arguments from pg_proc. Instead, postgres has a function pg_get_function_identity_arguments(func_oid) that RhodiumToad told me about in #postgresql IRC channel. This function returns all arguments of a function. So, we can run a query like:

payal@testvagrant:~$ psql -qAXt -c "select 'ALTER FUNCTION ' || n.nspname || '.' || p.proname || '(' || pg_catalog.pg_get_function_identity_arguments(p.oid) || ') OWNER TO hstore;' from pg_proc p, pg_namespace n where p.pronamespace = n.oid and n.nspname = 'hstore'" -o alterfunctions.sql postgres

Which gets a list of all functions in the hstore schema with arguments:

payal@testvagrant:~$ tail -10 alterfunctions.sql
ALTER FUNCTION hstore.ghstore_compress(internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_decompress(internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_penalty(internal, internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_picksplit(internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_union(internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_same(internal, internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.ghstore_consistent(internal, internal, integer, oid, internal) OWNER TO hstore;
ALTER FUNCTION hstore.gin_extract_hstore(internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.gin_extract_hstore_query(internal, internal, smallint, internal, internal) OWNER TO hstore;
ALTER FUNCTION hstore.gin_consistent_hstore(internal, smallint, internal, integer, internal, internal) OWNER TO hstore;


Now we can just run this file with psql:

payal@testvagrant:~$ psql < alterfunctions.sql
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION
ALTER FUNCTION




Tomas Vondra: Examples of palloc overhead

From Planet PostgreSQL. Published on Sep 30, 2014.

This is the post I promised last week, explaining a few common issues with memory contexts. The issues mostly lead to excessive overhead, i.e. excessive usage of resources - usually memory. And that's exactly what this post is about, so whenever I say "overhead" you can read "excessive memory usage." I will also try to give advices on how to avoid those issues or minimize the impact.

As I briefly mentioned when explaining allocation set internals, there are three main sources of overhead:

  • chunk headers - for large chunks, this gets negligible
  • unused space (because of 2^N chunks) - expected ~25% for randomly sized chunks, but can get much worse
  • reuse not working efficiently - we'll see some examples how this can happen

If you haven't read that post, it's probably the right time to do that. Also, if you want to learn more about memory management and allocator implementations, there's a great post at IBM developerWorks explaining it quite well and also listing many interesting additional resources (e.g. various malloc implementations).

So let's see some usual (but somehow unexpected) examples of palloc overhead.

Allocating many tiny pieces of memory

Say we need to store a lot of small elements, a few bytes each - e.g. 64-bit integers, words in an ispell dictionary or something like that. You could do a separate palloc call for each element (and keep the pointer), but in that case you'll pay the 'header' price for each element. For example by doing this

int64 *items[1024];
for (i = 0; i < 1024; i++)
    items[i] = (int*)palloc(sizeof(int64));

you might think you allocated ~8kB of memory (1024 x 8B), but in fact this prepends each value with 16B header. So you end up with about 24kB (not counting the items array, which needs additional 8kB), which is ~3x the requested amount. If we asked for smaller values (e.g. 4B integers), the overhead would be even larger because 8B is the smallest chunk allocated by palloc (as mentioned in the previous post).

Let's assume all the elements have about the same life span / scope (i.e. can be freed at the same time), and don't need to be passed somewhere else (i.e. are used only locally). In such cases the best approach is often preallocating a large chunk of memory, and then slicing it into small pieces - essentially doing what the allocation set allocator does when slicing the block into chunks, but without the chunk headers.

Of course, that means you won't be able to call pfree on the elements, but that's OK because we assumed only local usage (if we pass one of the elements somewhere else, that code would be unaware of this). But if the elements have the same life span (as for example words in an ispell dictionary), we don't really need to call the pfree on the individual words as the dictionary gets freed at once.

This is especially trivial to do when the elements are of fixed size, and you know how many of them to expect - in that case you can just do

char * tmp = palloc(elementSize * elementCount)

tmp -> element #0
(tmp + elementSize) -> element #1
...
(tmp + k*elementSize) -> element #k

and you're done. But what to do when the elements are variable-sized, or when you don't know how many of them to expect? Well, who says you have to allocate all the elements at once? You can allocate a large chunk of memory, use it for elements until you run our of space, then allocate another one and start over.

A very simple example of this approach is compact_palloc0 method in spell.c, which is used for efficient allocation of words and other dictionary-related data (for the purpose of fulltext-search) (the following code is somewhat simplified, to make it easier to understand):

# define COMPACT_ALLOC_CHUNK 8192

/* current chunk and available space */
char * current_ptr = NULL;
Size   available_space = 0;

static void *
compact_palloc0(IspellDict *Conf, size_t size)
{
    void    *result;

    /* Need more space? */
    if (size > available_space)
    {
        current_ptr = palloc0(COMPACT_ALLOC_CHUNK);
        available_space = COMPACT_ALLOC_CHUNK;
    }

    result = (void *) current_ptr;

    current_ptr += size;
    available_space -= size;

    return result;
}

This effectively adds another (very simple) allocator on top of the AllocSet. The "large" chunks (allocated using palloc) are registered in the parent memory context and will be freed when that memory allocator gets destroyed.

Another example such "dense" allocation was recently committed into our hashjoin implementation - it's however slightly more complicated (to support some hashjoin-specific requirements). See commit 45f6240a for more details.

Allocating pieces that are not 2^N bytes

The other trap you may fall into is the 2^N sizing of chunks. There are plenty of ways how to "achieve" that - I'll show you a real-world example from my quantile extension. I already fixed it, so you have to look at commit c1e7bba9 to see it.

Let's say you're writing an aggregate function MEDIAN(float) - to achieve that, you need to collect the float values into the aggregate state (so that you can later sort them and choose the "middle" value, which is the median). The aggregate state might be represented by this structure

typedef struct median_state {
    int nelements;
    int next;
    float * elements;
} median_state;

The float elements are accumulated into 'elements' array, which is resized on the fly as needed. The size of the array (including unusued part) is tracked by 'nelements' and 'next' is index of the next available element. So when next == nelements happends, we need to enlarge the array, which is done like this:

#define SLICE_SIZE 5
...

if (state->nelement == 0) {
    state->nelements = SLICE_SIZE;
    state->elements = (float*)palloc(sizeof(float)*data->nelements);
} else if (state->next > state->nelements-1) {
    state->nelements = state->nelements + SLICE_SIZE;
    state->elements = (float*)repalloc(state->elements,
                                       sizeof(float)*data->nelements);
}

Can you spot the problem? It's pretty obvious :-/

Well, every time we call repalloc, the 'elements' array grows by 20 bytes (SLICE_SIZE * 4B). That's rather pointless, because it either fits into the current chunk (and then it's almost free), or the array has to be moved to a larger chunk. But it does not save any memory at all.

Moreover there's a small inefficiency, because 20 is not a divisor of chunk sizes (following the 2^N rule). This wastes a bit of memory, but for larger chunks this is negligible.

It however clearly shows that constant growth does not work, because it does not follow the 2^N chunk size pattern. A saner version of the resize would be about this:

#define SLICE_SIZE 8
...

if (state->nelement == 0) {
    state->nelements = SLICE_SIZE;
    state->elements = (float*)palloc(sizeof(float)*data->nelements);    
} else if (state->next > state->nelements-1) {
    state->nelements = state->nelements * 2;
    state->elements = (float*)repalloc(state->elements,
                                       sizeof(float)*data->nelements);
}

So, this is better - it starts with 8 elements (because 8 * 4B = 32B, which is 2^5 and thus follows the 2^N rule). I could have used smaller value (up to 2, because the smallest chunk is 8B). This would have about the same memory consumption as before, but it's more correct and the repalloc will be called much less frequently (exponentially less).

Now, let's see a more serious example, that I almost commited into count_distinct but luckily realized the error before doing that:

Let's say the state for the MEDIAN() aggregate was defined like this:

typedef struct median_state {
    int nelements;
    int next;
    float elements[1];
} median_state;

Placing a single-element array at the end of struct is a well known trick to define variable length structures. The resize then might look like this:

#define SLICE_SIZE 8
...

if (state->nelement == 0) {
    state->nelements = SLICE_SIZE;
    state->elements
        = (float*)palloc(offsetof(median_state,element) + sizeof(float)*data->nelements);
} else if (state->next > state->nelements-1) {
    state->nelements = state->nelements * 2;
    state->elements = (float*)repalloc(state->elements,
                         offsetof(median_state,element) + sizeof(float)*data->nelements);
}

We're still keeping the array nicely sized (still 2^N bytes), so perfect, right? Well, no. What gets allocated is the whole structure, and sadly that's always (2^N + 8B), because the two integers are part of the chunk. So we'll always get the perfect size + 2B. Which pretty much says we'll get 2x the necessary chunk size (because of the overflowing 8B). But this time we're guaranteed to waste the upper half of the chunk (except the first 8B). That kinda sucks.

There are two ways to fix this - either by allocating the array separately (which is what I did in count_distinct) or keeping the total size 2^N (and tweaking the nelements appropriately).

Creating many small contexts

Sometimes "less is more" and it certainly holds for memory contexts. A nice example is array_agg - an aggregate function that accumulates all the values into an array. The heavylifting is done by accumArrayResult in arrayfuncs.c. This piece of code performs initialization when the first value is passed to the function:

if (astate == NULL)
{
    /* First time through --- initialize */
    /* Make a temporary context to hold all the junk */
    arr_context = AllocSetContextCreate(rcontext,
                            "accumArrayResult",
                            ALLOCSET_DEFAULT_MINSIZE,
                            ALLOCSET_DEFAULT_INITSIZE,
                            ALLOCSET_DEFAULT_MAXSIZE);
    oldcontext = MemoryContextSwitchTo(arr_context);
    astate = (ArrayBuildState *) palloc(sizeof(ArrayBuildState));
    astate->mcontext = arr_context;
    astate->alen = 64;  /* arbitrary starting array size */
    astate->dvalues = (Datum *) palloc(astate->alen * sizeof(Datum));
    astate->dnulls = (bool *) palloc(astate->alen * sizeof(bool));
    astate->nelems = 0;
    astate->element_type = element_type;
    get_typlenbyvalalign(element_type,
                            &astate->typlen,
                            &astate->typbyval,
                            &astate->typalign);
}

It's slightly complicated because it sets a lot of values in the aggregate state, but apparently it preallocates space for 64 elements (astate->alen), which is 512B on 64-bit architectures. It then properly doubles the size (not shown here bor brevity). Perfect, right? What could go wrong?

Well, the first thing the code actually does is creating a dedicated memory context (for this group), and it uses ALLOCSET_DEFAULT_INITSIZE as the initial block size. And ALLOCSET_DEFAULT_INITSIZE is 8kB, so on the first palloc, this memory context allocates 8kB block. The fact that we wanted to preallocate space for only 64 elements is irrelevant - we'll get 16x that.

Now, imagine aggregation with many distinct groups. Each group will get 8kB, even though there may be a single item in the array. And we actually get bug reports related to this.

What makes this even worse is that keeping per-group memory contexts makes it impossible to reuse chunks across groups.

Gradually growing request sizes

The last issue is quite different from the previous ones, because it's about (in)efficient chunk reuse.

As I mentioned, chunks are not really freed - instead they're moved to a freelist for later reuse. So when you do palloc(50) you'll get 64B chunk, and when you do pfree(chunk) it'll be moved to a freelist and eventually used for other requests needing 64B chunks.

But what happens if the requested sizes only grow? Consider for example this example:

int i = 0, j = 0;
size_t current_size = 8;
int nchunks = 100;
int nloops = 10;
char * tmp[nchunks];

for (i = 0; i < nchunks; i++)
    tmp[i] = palloc(current_size);

for (i = 0; i < nloops; i++)
{
    current_size *= 2;
    for (j = 0; j < nchunks; j++)
        tmp[j] = repalloc(tmp[j], current_size);
}

So, how much memory is allocated at the end? There are 100 chunks, and the final chunk size is 8kB. So it has to be 800kB, right?

Actually, it's about double that, because none of the smaller chunks will ever be reused. The fact that the request sizes only grow prevents chunk reuse - the chunks will get stuck in the freelists until the memory context is destroyed.

It's however true that this overhead is bounded (it can't be higher than 100%, because it's bounded by a sum of infinite series 1/(2^k)), and most workloads actually mix requests of various sizes (making the reuse possible).

Summary

  • Don't try to be overly smart - follow the 2^N rule by allocating properly sized pieces and doubling the size when needed.
  • Where applicable, consider using dense allocation (as for example compact_palloc in spell.c).

gabrielle roth: My PgConf.EU Schedule

From Planet PostgreSQL. Published on Sep 29, 2014.

Yep, I’m headed to Madrid! I’ll be reprising my Autovacuum talk from SCALE, and am really looking forward to meeting some new folks. I’ll be helping out at the conference in some capacity, so come say hello. For reference, the conference schedule is here: http://www.postgresql.eu/events/schedule/pgconfeu2014/ Other talks I plan to attend: Wednesday: Performance Archaeology sounds […]

PlanetDjango.org shutting down

By Adomas Paltanavičius from Planet Django. Published on Mar 08, 2014.

Is Open Source Consulting Dead?

By chrism from plope. Published on Sep 10, 2013.

Has Elvis left the building? Will we be able to sustain ourselves as open source consultants?

Consulting and Patent Indemification

By chrism from plope. Published on Aug 09, 2013.

Article about consulting and patent indemnification

Python Advent Calendar 2012 Topic

By chrism from plope. Published on Dec 24, 2012.

An entry for the 2012 Japanese advent calendar at http://connpass.com/event/1439/

Why I Like ZODB

By chrism from plope. Published on May 15, 2012.

Why I like ZODB better than other persistence systems for writing real-world web applications.

A str. __iter__ Gotcha in Cross-Compatible Py2/Py3 Code

By chrism from plope. Published on Mar 03, 2012.

A bug caused by a minor incompatibility can remain latent for long periods of time in a cross-compatible Python 2 / Python 3 codebase.

In Praise of Complaining

By chrism from plope. Published on Jan 01, 2012.

In praise of complaining, even when the complaints are absurd.

2012 Python Meme

By chrism from plope. Published on Dec 24, 2011.

My "Python meme" replies.

In Defense of Zope Libraries

By chrism from plope. Published on Dec 19, 2011.

A much too long defense of Pyramid's use of Zope libraries.

Plone Conference 2011 Pyramid Sprint

By chrism from plope. Published on Nov 10, 2011.

An update about the happenings at the recent 2011 Plone Conference Pyramid sprint.

Jobs-Ification of Software Development

By chrism from plope. Published on Oct 17, 2011.

Try not to Jobs-ify the task of software development.

WebOb Now on Python 3

By chrism from plope. Published on Oct 15, 2011.

Report about porting to Python 3.

Open Source Project Maintainer Sarcastic Response Cheat Sheet

By chrism from plope. Published on Jun 12, 2011.

Need a sarcastic response to a support interaction as an open source project maintainer? Look no further!

Pylons Miniconference #0 Wrapup

By chrism from plope. Published on May 04, 2011.

Last week, I visited the lovely Bay Area to attend the 0th Pylons Miniconference in San Francisco.

Pylons Project Meetup / Minicon

By chrism from plope. Published on Apr 14, 2011.

In the SF Bay Area on the 28th, 29th, and 30th of this month (April), 3 separate Pylons Project events.

PyCon 2011 Report

By chrism from plope. Published on Mar 19, 2011.

My personal PyCon 2011 Report