Postgres Failures At Scale

A discussion of failures with postgres' default settings.

I love working at startups. My career up until this point has been exclusively in startups. Startups move fast to claim victory for the day because they know there might not be a tomorrow. As a software engineer in startups, I have to be quick with my solutions, nimble when the company pivots, and extremely humble to learn from my mistakes. And with this guide, I present postgres knowledge earned through failures.

Over the past four years, I experienced downtimes, late-night wake-up calls, "good-enough" designs, data growth, and organizational growth while managing postgres databases for my medium-size engineering department in a startup. I learned how postgres handles replication, how to quickly backup and restore databases, and how to lock up the database with schema changes. When I required help, a Google search would present solutions in the detailed documentation, in the community wiki, in the public mailing lists, and in StackOverflow. With the abundance of readily available information, postgres makes an excellent database for a beginner. I believe postgres is a delightful database, and I would recommend the database to anyone working in startups building web services.

And yet, the default settings for postgres are terrible for growing web services. Postgres is older than many of the millennials that now use the database in startups. Postgres uses range from bank software, data warehouses, web services like Instagram and Twitch, and of course, my super-awesome-badass-perfect pet projects. My guess is, the defaults we see today were a side effect of managing sophisticated software over decades (!). Nevertheless, developers in startups move fast to ship product, and it's easy to overlook the default settings when you're trying to get shit done.

So for this article, I'd like to focus on two settings that affect web services and postgres in the context of startup growth. The two settings are:

statement_timeout

Abort any statement that takes more than the specified number of milliseconds, starting from the time the command arrives at the server from the client. If log_min_error_statement is set to ERROR or lower, the statement that timed out will also be logged. A value of zero (the default) turns this off.

lock_timeout (postgres 9.3+)

Abort any statement that waits longer than the specified number of milliseconds while attempting to acquire a lock on a table, index, row, or other database object. The time limit applies separately to each lock acquisition attempt. The limit applies both to explicit locking requests (such as LOCK TABLE, or SELECT FOR UPDATE without NOWAIT) and to implicitly acquired locks. A value of zero (the default) turns this off.
Unlike statement_timeout, this timeout can only occur while waiting for locks. Note that if statement_timeout is nonzero, it is rather pointless to set lock_timeout to the same or larger value, since the statement timeout would always trigger first. If log_min_error_statement is set to ERROR or lower, the statement that timed out will be logged.

These settings, when enabled, limit how long a query executes or waits for locks. Both settings default to off, allowing SQL queries and transactions to execute without interruption from postgres. As we'll see, these defaults cause more significant problems in the future.

In the early stages of a web service, allowing queries to execute for as long as necessary isn't a problem. There isn't much traffic to the service, and there's little data. Every query running on the database is screaming fast -- even the problematic queries like pagination and counts. If you're building a mobile application to interface with the web service, the phone screen is likely to refresh without the user noticing. If you're on the web, user actions are snappy and responsive. If the webpage didn't flicker, you might not have seen a request. And with a bit of luck, your startup is attracting users. But this is like a new relationship — you're in a honeymoon phase.

Once the web service grows in users and data, query latencies climb. Snappy queries slow as the database filters more significant numbers of rows. The total number of queries sent to the database climbs higher, increasing overall resource contention. Your new developer Luke may forget to use the index, causing sequential scans that hog disk IO. New features may increase the complexity of queries, claiming more CPU and memory. No matter what the reason, the honeymoon phase declines as the database usage increases. In addition, software engineers are wearing so many hats as the startup grows that the database tuning may be left behind or forgotten. The situation may lead to the database's tipping point.

A Tipped Database

The tipping point of a database is when the database server resources are oversubscribed due to the number and complexity of queries concurrently running on the database. When the database reaches the tipping point, it falls over and is extremely difficult to upright. All queries have longer query latencies, and the entire system slows to a crawl. Normally fast queries are queued behind slower queries. This also causes the web service to respond slower. And slow web requests may lead to a loss of users. The events that follow a tipped database spiral out of control and solutions to upright a tipped database are at best short term solutions.

Pushing a database to its tipping point starts with long-running queries.

Long-Running Queries

Long-running queries (LRQ) hog database resources like disk IO, reducing the capacity for other queries on the system. The more concurrent LRQs are executing, the more time the database spends scheduling access to resources like CPU, memory, and disk IO. If enough LRQs saturate available resources, all other queries, whether they are LRQs or not, wait for the utilized resources. This, in turn, makes all queries long-running queries. And since statement_timeout is disabled, long-running queries beget more long-running queries.

There exist two strategies to bring a tipped database back to normal. The first strategy is to stop sending new LRQs to the database and allow the database to complete the current LRQs. This will slowly bring the database back to expected latencies when the existing LRQs complete. The second strategy is to cancel the LRQs with pf_cancel_backend() or pg_terminate_backend(). Canceled queries release shared resources allowing other queries to execute.

Both strategies are at best band-aids to the situation. Attempting the first strategy of disabling LRQs may lead to disabling a feature and running at a degraded service level. Worst, if the feature is critical to the web service (like a payment transaction), then a downtime event might be more appropriate.

The second strategy of canceling the LRQs is a never-ending battle with a tipped database. If the web service is online while you terminate queries, then the subset of users with canceled LRQs will see error messages. But, like the first strategy, if the canceled LRQs are critical to the web service, a downtime event might be more appropriate.

Both strategies are short-term bandaids that stop the bleeding. The problem is systemic. If you fix the problem with the bandaid solutions, the service will return to normal (hopefully your users too). When the traffic increases on the web service, the database will again reach the tipping point. Once tipped, you're bleeding again. A long-term fix is required to bring the service back to normal levels of operation.

But, what about your team during the incident?

Your team is likely in a meeting room, now dubbed "TEH War Room," debugging the tipped database. They scramble to find the off switch that disables the beautifully, yet fear-inducing automated deploy system. At the same time, your manager yells at your teammate to stop committing code, a teammate that has the propensity to ship code as soon as the happy path is complete. Another teammate is dashing across the office of densely packed manual desks, to call over the infrastructure/DBA/SRE/DevOps to help. Yet another teammate is downloading slow query logs to hunt for the one slow query that ruined them all. All the while, customer support, business development, marketing, product management, non-infrastructure operations, and sales team are asking, "Are we up yet?" Sigh.

Hopefully, a solution is found. Because if LRQs delay other queries, the database eventually reaches the next error.

FATAL: sorry, too many clients already.

I hate this error. To beginners, the apparent solution leads to problems at scale. You cannot increase max_connections with a postgres database and move on like other databases. Instead, the postgres wiki suggests using connection poolers to solve this error. This allows you the properly scale your web servers horizontally without overwhelming exhausting database connections.

So now you scaled up in traffic, and the connection pooler hums along smoothly, handling a larger number of queries. But, then database tips. Your team is frantically debugging the issue while your service is in a degraded state. Suddenly, FATAL: sorry, too many clients already. messages appear in your logs. And now your service is throwing out 500s. What happened?

LRQs force the connection pooler to maintain an active connection to the database. And in a tipped database, the more LRQs are running, the fewer available connections available. When all possible connections are held by LRQs, new queries sent to the database error with the message FATAL: sorry, too many clients already. You've maxed out available connections to the database.

This causes two problems for the web service. First, the web requests with an active LRQ connection have a slow response, well because it's an LRQ. Lucky (unluckily) you if you sent this web request. Second, all other web requests executing new queries experience database connection errors because the connection pooler fails to connect to the database. With a postgres default of 100 connections, 100 web requests will be slow, while all other web requests have an immediate 500 web response assuming your developers didn't account for a failed database in their web service.

Unlike LRQs, where the web service is responding, maxing out connections is an immediate downtime event. Luckily, your development team has the opportunity to manage this, just like the previous situation! Joy!

Zombie Long Running Queries

In the previous section, we discussed how the default statement_timeout affects tipped databases with LRQs by maxing available connections. Unfortunately, I conveniently failed to mention the best practice of timing out web requests, so let's discuss this now.

A slow web request hogs application server resources reducing the capacity for other requests. Load balancers queue new requests behind slow requests like you do when you line up behind a person with 500 grocery items at your local supermarket. This leads to slow responses for all users of your web service. Therefore, it's best practice to terminate long web requests. For instance, the popular deployment service Heroku, terminates requests after 30 seconds. If you’re not on Heroku, there are similar settings for application servers like unicorn or Rack::Timeout for ruby, gunicorn for python, and net.http for go.

Unfortunately, terminated web requests cause zombie queries. A zombie query is a query that is running on the postgres database, where the client is no longer waiting for the results. This happens because the postgres database cancels a query when postgres fails to deliver results back to the client. In other words, only after the query has completed executing (loading data into memory, filtering results, sorting, etc.) will postgres notice a client has disconnected. So up until the communication with the client, postgres will blindly utilize the necessary resources to execute the query. Knowing this, LRQs artificially reduce available connections and starve existing web handlers from executing queries.

If the tipped database stays tipped long enough, the web service will face another downtime event. Luckily, your development team should now have the experience of handling a downtime, right?

Locks

statement_timeout applies to all queries in a database, and it overlaps with lock_timeout, the following setting to discuss that impacts web services. lock_timeout applies when a query attempts to acquire a lock. Let's dig into what locks are and the effects of too many locks.

Locks enable a postgres to serialize access to database objects, just like the lock on the door serializes the entrance to the bathroom in the bar across the street from my work. When postgres clients acquire a lock on an object, all other clients that want to acquire a lock on that object wait. The time a client waits for a lock is controlled by lock_timeout -- the default enables indefinite waiting. And waiting is problematic at scale.

Lock Mechanics 101

Let's zoom into the mechanics of locks. Let's say we have transaction A that holds a lock on the users table. Now, let's say transaction B attempts to acquire a lock on the users table. In this scenario, transaction B is forced to wait until transaction A releases the lock on the users table. Pretty basic mechanics, right?

Zooming out, if a web service typically creates B-like transactions at high concurrency (not uncommon if you have code that uses SELECT FOR UPDATE), then when a slow A-like transaction comes along that acquires a lock, this causes all the B-like transactions to wait until the A-like transaction completes. The longer B-like transactions are waiting, the more active connections the connection pooler has to maintain to the database. The more active connections used, the faster the web service sees FATAL: sorry, too many clients already.

A real-life example of this is running this statement on postgres (< 11):

ALTER TABLE users
ADD COLUMN status TEXT SET DEFAULT 'pending';

For this statement, we're adding a new column status with a default value of 'pending'. The ALTER TABLE statement acquires a lock on the users and then rewrites the whole table to ensure the status column is set with the default. If you have a large users table, this statement holds the lock on users for a long time. This is an A-like transaction. If there are B-like transactions from your web service executing after the ALTER statement, then your web service rapidly marches (in seconds) towards connection exhaustion. Prep TEH War Room!

Lock Mechanics 102

Let's zoom back in at a different angle to look at the mechanics of locks. The previous example used a simplistic definition of locks -- when an object is locked, no other access to that object is allowed (like the single bathroom analogy). But, postgres has different modes of locks for various purposes. If I'm keeping with the bathroom analogy, and hell yes, I'm keeping to it, then different lock modes are like a larger bathroom that has a common area to wash your hands and a stall to do your business. These bathrooms handle more concurrency -- one person can use the stall while another person uses the sink. Postgres locks enable a similar concurrency with access to data.

Postgres locks range from read-only access to data (e.g., simple SELECT * FROM table LIMIT 1 statements) and to exclusive access to data. The difference between the locks other than semantics is whether a lock mode conflicts with another lock mode. For instance, the query SELECT * FROM users LIMIT 1 acquires an ACCESS SHARE lock. ACCESS SHARE locks do not conflict with one another, which allows multiple SELECT statements to acquire this lock on the same table. But, an ALTER TABLE query acquires an ACCESS EXCLUSIVE lock, which conflicts with every other lock mode available to postgres. With an ACCESS EXCLUSIVE lock, a SELECT query that uses the ACCESS SHARE lock is forced to wait until the ALTER TABLE query completes. Conflicting locks block. You can learn about the conflicts in the postgres documentation on explicit locking.

Now we can discuss the mechanics of locks with this new information. Going back to our example, transaction A holds the SHARE UPDATE EXCLUSIVE lock on the users table. Transaction B attempts to acquire an ACCESS EXCLUSIVE lock on users. Since ACCESS EXCLUSIVE conflicts with all other locks, transaction B is now blocked and waiting.

Now let's add a transaction C that attempts to acquire the ACCESS SHARE lock on the users table. Transaction C does not conflict with transaction B's SHARE UPDATE EXCLUSIVE, but does conflict with transaction A's ACCESS EXCLUSIVE lock mode.

Does transaction C gain access?

Nope. Transaction C is blocked by transaction B, which is blocked by transaction A.

Typically, the ACCESS SHARE lock mode and the SHARE UPDATE EXCLUSIVE lock mode does not conflict with one another, and postgres allows concurrent access to the same object (users table). But, postgres implements locking mechanics with a queue in front of the conflict check (at least this is what I think is how it's implemented based on my testing - I haven't read the source code to confirm). The queue ensures access is serialized (i.e., transaction A, then B, then C). So if the front of the queue is a transaction that is waiting for lock (i.e., transaction B), then every other transaction behind transaction B must also wait (i.e., transaction C).

In Lock Mechanics 101, we discussed the mechanics of locks at the front of the queue (A holds the lock, B waits). But as we see in this deep dive, locks gain access to database objects serially. The reason I bring this up is that it might be simple to skirt around locks if it was as simple as described in Lock Mechanics 101. You could implement everything without requiring high conflict lock modes. And maybe your web service works without high-conflict lock modes. But there is a real-life scenario where we see the example described in this lesson: postgres auto vacuums and schema changes.

The postgres autovacuum daemon uses the SHARE UPDATE EXCLUSIVE lock mode. One crucial task the autovacuum daemon handles cleaning up the rows to prevent transaction ID wraparound failures, which means you should not cancel this job if it is running. If you execute a schema change, say the same ALTER TABLE statement from Lock Mechanics 101, the database will rapidly exhaust connections to the database because of the waiting mechanics describe in this lesson.

Ask me if I have lock_timeout set for all my schema changes. Go ahead. Ask me!

Fuck yea I have lock_timeout set! You think I love getting screamed at by every department at my company?


Early Warning System

I end this discussion of postgres failures at scale with a strategy that has worked for me and the startups I've worked in. We set up an early warning system. This strategy surfaces issues before the database reaches the tipping point, but is not a guarantee to prevent tipping entirely. If I knew of a quick and easy solution to avoid tipping, I'd start a company …

We start by setting statement_timeout and lock_timeout for the application so that the web service has query boundaries. I like setting this to double or triple the average response time for the web service (your mileage will vary depending on your workloads). Then, we use monitoring systems to track when web service errors due to statement timeouts or lock timeouts. This is simply because most startups in the past 15 years have grown to use monitoring religiously, so this is a matter of adding on an additional alert. When the query timeout errors increase, we debug the issue like any other bug -- make a ticket and assign it to a developer. This way, we know there's an issue before a database has tipped, and we can focus our attention early.

This strategy allows a startup to concentrate on delivering product while keeping tabs on the web service and database as the company grows. Aside from the rocket growth startups that the media loves to pay attention to, startup growth tends to look tedious and slow at the beginning and the early warning system strategy banks on this. As the startup gains users and data, the query timeout errors increase and warn us of problems with the system. When the startup scales out the organization, a culture of debugging and fixing these errors (hopefully) has permeated through the team, and old developers would (hopefully) train new developers to behave similarly.

New Defaults and Usage Guide

Time to put the money where my mouth is. What follows is how I would configure the statement_timeout and lock_timeout settings for a web service.

(Re)setting statement_timeout

The postgres setting statement_timeout can be configured in the postgres.conf configuration file, for the user, and by the client (web service) after connecting to the database. Setting this for the postgres.conf makes all statements subject to the value — this is too aggressive and will hurt your poor operations/devops/SRE folks when they need to run maintenance or monitoring tasks. But also, setting this value in application code (i.e., client-side) doesn't ensure that all connections by the postgres user have a statement_timeout -- developers could forget or override it on accident. We want to ensure all connections by the web service defaults to a statement_timeout. Instead, we configure the setting on the postgres user used by the web service ensuring blanket coverage at the database.

For example, to set statement_timeout to 5 seconds for the user webservice, execute the following statement on the database as a privileged user:

ALTER ROLE webservice SET statement_timeout = 5000;

With this configuration, any queries made by the webservice user will be terminated if it executes for longer than 5 seconds.

Client Overrides for statement_timeout

It’s may not be possible to have every query running faster than the postgres user’s statement_timeout. For instance, if there’s a web request that was designed to return a large amount of data a few times a day, the statement_timeout might terminate the query prematurely. For those cases, we can set statement_timeout before the query is sent to the server, then reset the setting after the query completes.

BEGIN;
SET LOCAL statement_timeout = 20000; 
SET LOCAL idle_in_transaction_session_timeout = 20000; -- postgres 9.6+
SELECT * FROM mylargetable INNER JOIN mylarger table;
COMMIT; -- ROLLBACK is fine too

The override process described uses a transaction (BEGIN) and overrides the setting with SET LOCAL. SET LOCAL ensures the statement_timeout is reset after the transaction completes (whether COMMIT or ROLLBACK). The reason for the transaction is to guard against leaking the increased statement_timeout to other queries.

If you’re using postgres 9.6+, then you can also guard against transactions that are idle and open by configuring the idle_in_transaction_session_timeout as described above. If you’re not on 9.6+, then you need to resort to a cron or cron-like tool that cancels idle in transaction queries every few minutes. Or you can upgrade to postgres 9.6+.

There is an alternative method that I don't recommend. This method uses SET and RESET without a transaction. For instance, instead of sending five queries above, you could send three queries:

SET statement_timeout = 20000;
SELECT * FROM mylargetable INNER JOIN mylarger table;
RESET statement_timeout;

This is likely to cause leaked configuration if you're connecting through a connection pooler, and the web service terminates a request before the web handler could issue the RESET statement. In this situation, the connection pooler now has an active connection to the database with a statement_timeout set to 20 seconds.

(Re)setting lock_timeout

If you're on postgres 9.3 or greater, lock_timeout protects against queries waiting for locks indefinitely. lock_timeout should be shorter than statement_timeout because statement_timeout applies to all queries and will cancel queries that wait for locks longer than statement_timeout. Also, as discussed, locks are notorious at exhausting database connections rapidly, so a smaller value for lock_timeout is better.

Set lock_timeout, like statement_timeout on the user the web service is connecting with.

ALTER ROLE webservice SET statement_timeout = 1500;

This setting really depends on your application, and you should revisit it often. The lower you push the value, the better.

Client Overrides for lock_timeout

Overriding the client is similar to statement_timeout. Use a transaction and use SET LOCAL.

BEGIN;
SET LOCAL lock_timeout = 500; 
SET LOCAL idle_in_transaction_session_timeout = 20000; -- postgres 9.6+
LOCK mytable;
-- ... more queries ... 
COMMIT; -- ROLLBACK is fine too