Postgres deadlocks debugging guidelines, tips and tricks

This is the second post in a small series dedicated to analysing and dealing with deadlocks in PostgreSQL, and provides a general workflow to handle deadlock bugs and details on how to get information from PostgresSQL.

If you need a gentler introduction, take a look at the first article in this series.

Debugging a deadlock – a workflow

If you have to deal with a deadlock in a complex application, usually you’ll have a defect that appears in time-dependent, load-dependent and user-dependent way. It’s all too easy to get lost in hypotheses! To avoid that try to work in a scientific way and follow a strategy.

One such strategy is described below:

  1. look at Postgres logs first and try to get a rough idea of what is going on;
  2. determine what was going on at the exact time the deadlock happened. If you have a Web application, which page(s) were being loaded or submitted? If there are background jobs, were they running? If you have an API, were there clients using it? etc. Check all available logs as thoroughly as possible around the time the deadlock happened;
  3. if at all possible, try to trigger the problem in a reproducible way using information collected above. This might even be impossible in some cases, yet every effort should be taken to have the problem happening again predictably or you will not be able to validate any solution!
  4. based on information collected above, determine what transactions were in progress, what queries were completed before the deadlock and what queries were running at the time of the deadlock. In a big codebase this can be daunting, and ORMs such as Hibernate complicate things a lot by issuing automatically generated queries behind the scenes, yet you must, at some point, determine the active set of queries to have any chance;
    • prototyping might help. You can, for example, attempt to write one or more test cases that exercise relevant code paths in your application and run them concurrently in order to better isolate affected components – see if you can get them deadlocking and check that the stack trace corresponds to the original. This has the advantage of automatically issue relevant queries to the RDBMS, which you can then figure out from logs (more on that in the next article);
  5. at this point, you should now ideally be able to reproduce the issue by manually issuing queries into different pgAdmin III windows or psql terminals;
  6. from that, you can identify problematic tables and queries, and by this time it should not be hard to come up with a solution. Most probably you will want to change the ordering of queries, or if you can’t change the ordering, use SELECT ... FOR UPDATE to get locks on whatever rows you need to change later.

How to do all that? Let’s see some practical tips.

Debugging a deadlock – interpreting logs

Typical deadlock error output, which can be found in /var/lib/pgsql/data/pg_log/*.log, looks like this:

ERROR:  deadlock detected
DETAIL:  Process 21535 waits for AccessExclusiveLock on relation 342640 of database 41454; blocked by process 21506.
Process 21506 waits for AccessExclusiveLock on relation 342637 of database 41454; blocked by process 21535.
HINT:  See server log for query details.
CONTEXT:  SQL statement "UPDATE ..."

Notable points:

  • AccessExclusiveLock is the kind of lock, in this case, a rather strict one. More on lock types later;
  • “relation 342640” and “relation 342637” refer to internal Postgres Object Identifiers. You can retrieve legible table names with the following query:
SELECT 342640::regclass, 342637::regclass;
 regclass | regclass
----------+----------
 b        | a
(1 row)
  • the reported SQL statement after CONTEXT is the one that has been interrupted and killed by Postgres. If there is more than one, then it represents the full “stack trace” of functions that were running at killing time;

In this case situation is pretty clear: a transaction in process 21535 wants to acquire an exclusive lock on table b, which is locked by a second transaction in process 21506 which in turn wants to lock a.

That’s not the most common form of deadlock output, though. In most cases logs are more cryptic:

ERROR:  deadlock detected
DETAIL:  Process 5455 waits for ShareLock on transaction 44101; blocked by process 27458.
  Process 27458 waits for ShareLock on transaction 43490; blocked by process 5455.
  Process 5455: select * from rhn_channel.subscribe_server($1, $2, 1, $3) as result
  Process 27458: select * from rhn_channel.update_needed_cache($1) as result
HINT:  See server log for query details.
CONTEXT:  SQL statement "UPDATE ..."

Notable points:

  • in DETAIL you can find the top-level queries that originated this deadlock, CONTEXT shows the stack trace (omitted here for brevity);
  • locks are of a different type with respect to the above case, ShareLock. Again more on that later;
  • lock is not on a table and not even on a row, but on a transaction.

What’s the meaning of a transaction lock?

Transaction locks are really row locks

On a logical level, Postgres has two kinds of locks, table locks and row locks. Row locks obviously only apply to specific rows, while table locks lock the entire tables.

Implementation wise, it turns out that row locks are implemented on top of transaction locks. It’s actually an optimization, as transaction locks are less expensive than real row locks.

Here’s how it works:

  • the very first thing every transaction does when it is created is to get an exclusive lock on itself. It will be released it upon completion;
  • every table has a some extra system columns to implement MVCC, among those there’s xmax;
  • xmax is an integer, which can be either 0 or a transaction ID;
  • if it is 0, then it means no transaction has UPDATEd or DELETEd that row so far, in other words it has been freshly INSERTed;
  • if it is not 0, then it is the ID of the transaction which last UPDATEd that row;
  • whenever a transaction wants to UPDATE or DELETE a row it will check its xmax first:
    • if it is 0, it can proceed by first updating the xmax value with its own ID and then the rest of the row;
    • if it is the ID of a transaction that has finished, it can proceed by overwriting xmax;
    • if it is the ID of a transaction that has not finished yet, then the row is locked and it has to wait. The wait is implemented by requesting a lock on the transaction which ID is xmax.

Since all transactions have exclusive locks on themselves, when the original transaction will finish it will release the lock and the row will be editable again.

Note that this “trick” only works for the first transaction that waits for a row. If two or more are waiting, ones beyond the first will get a regular row lock.

You can get a fuller explanation here for more details.

Debugging a deadlock – interpreting logs (cont’d)

Now we can explain the previous example:

DETAIL:  Process 5455 waits for ShareLock on transaction 44101; blocked by process 27458.
  Process 27458 waits for ShareLock on transaction 43490; blocked by process 5455.

This really means that transaction 43490 wants to edit some row previously edited by transaction 44101, while transaction 44101 wants to edit some row previously edited by transaction 43490. Being only two transactions, no “real” row lock is requested but only the less expensive transaction locks.

How can I determine what rows are causing this problem?

In general it is difficult as Postgres does not track that (see above). If you know what tables and queries are involved, you can try reproducing the situation manually and take a look at xmax values, for example:

SELECT id, name, xmax FROM rhnServer;
     id     |              name               |  xmax   
------------+---------------------------------+---------
 1000010000 | sles11swtest.lab.dus.novell.com | 2962002
(1 row)

You can retrieve transaction IDs by running this query:

susemanager=#     SELECT txid_current();
 txid_current 
--------------
      2962002
(1 row)

In this case, you can see that rhnServer‘s only row 1000010000 was updated by the current transaction.

Debugging a deadlock – live view on locks

If you have to figure out what locks were requested in some occasion, your best friend is pgAdmin III’s Tools -> Server Status window (full explanation with screenshot here).

Important things to notice are:

  • that the Activity tab lists active processes/transactions;
  • that the Locks table lists all currently active locks.

Note that this view does not show table names, just OIDs – you can use the following query to get roughly the same information with proper table names:

SELECT virtualtransaction, relation::regclass, locktype, page, tuple, mode, granted, transactionid
  FROM pg_locks
  ORDER BY granted, virtualtransaction;

You will probably be surprised when you check lock types. Lock names are actually very confusing – for example RowExclusive is actually the name of a table lock type! You are thus strongly encouraged to read the lock types page thoroughly from Postgres manual before thinking that it is doing something stupid. Usually, it doesn’t.

Debugging a deadlock – each-and-every-query logging

If you can reproduce a deadlock issue but still cannot comprehend how it works, you might be interested in having Postgres logging all queries and all of their parameters. Needless to say, this log level is a bit extreme and looking at the results might be overwhelming, still it can be useful.

To activate it edit /var/lib/pgsql/data/postgresql.conf and add the following line:

log_statement = 'all'

You might also be interested in the Process ID that originated each query to make sense of the flows. Then change the following line to:

log_line_prefix = '%t %d %u %p '

This is an example output:

2015-02-25 16:44:51 CET susemanager susemanager 31444 LOG:  execute S_1: BEGIN
2015-02-25 16:44:51 CET susemanager susemanager 31444 LOG:  execute <unnamed>: select * from logging.clear_log_id() as result

[...]

2015-02-25 16:44:52 CET susemanager susemanager 31445 LOG:  execute S_1: BEGIN
2015-02-25 16:44:52 CET susemanager susemanager 31445 LOG:  execute <unnamed>: select * from logging.clear_log_id() as result

Note that it could well be totally unreadable in normal conditions. You might want to stop background tasks to start with or, ideally, have some unit test code that can trigger the condition and stop everything else for best clarity.

The next post in this series will present a case study based on SUSE Manager and its upstream project Spacewalk.

3 Commenti

  • Hi,

    Lets us consider the following situation. I had seen a deadlock detected in yesterdays’s log file.
    Now I wanna know how many users are executing that particular query is it one user with multiple sessions or different users?
    And I should be able to know how many sessions that are corresponding to a particular user?

    BTW I can get them by pg_stat_activity. But it gives current statistics. But I need to know what happened at a particular point of time in the past. Is there a way to crack this??

  • Deadlock-causing conditions and users are orthogonal concepts – a deadlock might be equally caused by several users or just one.

    At least two separate sessions have to be involved, there might be more if more than two queries cause the deadlock (you should be able to see from the logs how many transactions are waiting on one another).

    The log_lock_waits parameter will log more data when it comes to locks that require long before they become available, including deadlocks of course. That should give you more insights.

    https://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-LOCK-WAITS

    Hope this helps!

  • Thanks Silvio

Partecipa alla Discussione

Puoi usare i seguenti HTML tag e attributi: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Questo sito usa Akismet per ridurre lo spam. Scopri come i tuoi dati vengono elaborati.