A Postgres deadlock bug case study

This is the third post in a small series dedicated to analysing and dealing with deadlocks in PostgreSQL, and provides a real bugfixing case study. Check the first article for an introduction and the second one for guidelines and techniques.

This post describes how I attempted fixing a real deadlock bug in SUSE Manager, which is also affecting the Spacewalk project it’s based upon. Hopefully it will shed light to someone fixing a similar issue!

In doing this, I am roughly following the workflow described in the previous article.

Step 1, get a rough idea of what happened

First of all let’s check Postgres logs:

2015-02-18 08:25:43 CET susemanager susemanager ERROR:  deadlock detected
2015-02-18 08:25:43 CET susemanager susemanager DETAIL:  Process 13516 waits for ShareLock on transaction 2749181; blocked by process 13916.
  Process 13916 waits for ShareLock on transaction 2749185; blocked by process 13516.
  Process 13516: select * from rhn_server.update_needed_cache($1) as result
  Process 13916: select * from rhn_channel.subscribe_server($1, $2, 1, $3) as result
2015-02-18 08:25:43 CET susemanager susemanager HINT:  See server log for query details.
2015-02-18 08:25:43 CET susemanager susemanager CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."rhnserver" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x"
  SQL statement "insert into rhnServerNeededCache
               (server_id, errata_id, package_id, channel_id)
          (select distinct sp.server_id, x.errata_id, p.id, x.channel_id
             FROM (SELECT sp_sp.server_id, sp_sp.name_id,
              sp_sp.package_arch_id, max(sp_pe.evr) AS max_evr
                     FROM rhnServerPackage sp_sp
                     join rhnPackageEvr sp_pe ON sp_pe.id = sp_sp.evr_id
                    GROUP BY sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id) sp
             join rhnPackage p ON p.name_id = sp.name_id
             join rhnPackageEvr pe ON pe.id = p.evr_id AND sp.max_evr < pe.evr
             join rhnPackageUpgradeArchCompat puac
                ON puac.package_arch_id = sp.package_arch_id
          AND puac.package_upgrade_arch_id = p.package_arch_id
             join rhnServerChannel sc ON sc.server_id = sp.server_id
             join rhnChannelPackage cp ON cp.package_id = p.id
                AND cp.channel_id = sc.channel_id
             left join (SELECT ep.errata_id, ce.channel_id, ep.package_id
                          FROM rhnChannelErrata ce
                          join rhnErrataPackage ep
                 ON ep.errata_id = ce.errata_id
        join rhnServerChannel sc_sc
                 ON sc_sc.channel_id = ce.channel_id
             WHERE sc_sc.server_id = server_id_in) x
               ON x.channel_id = sc.channel_id AND x.package_id = cp.package_id
            where sp.server_id = server_id_in)"
  PL/pgSQL function "update_needed_cache" line 5 at SQL statement

Notes:

we have a dealock between two transactions;
one transaction was running rhn_server.update_needed_cache
- specifically, the insert into rhnServerNeededCache statement at line 5 (counting from $BODY$ included);
- specifically, the Postgres-internal query SELECT 1 FROM ONLY "public"."rhnserver" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x, more on that later;
- and it was killed
the other was running rhn_channel.subscribe_server and it completed;
later log lines like the following only tell that subsequent commands failed because the transaction was killed (and rolled back), thus can be ignored:

2015-02-18 08:25:43 CET susemanager susemanager STATEMENT: select * from rhn_server.update_needed_cache($1) as result 2015-02-18 08:25:43 CET susemanager susemanager ERROR: current transaction is aborted, commands ignored until end of transaction block

Sources can be read here from the Spacewalk project:

What was the user trying to do? Reporter says:

<moio> <detective voice="on"> so what were you doing at 2015-02-18 08:25:43, when the dead lock happened?</detective>
<Sascha> I was playing with a new client I set up to connect it to the SUSE channels and to check if I can alter the channels via the web ui
<moio> you seem not to be lying, good :-)

Indeed with a little bit of research in spacewalk-debug and the event history page the following can be concluded:

at 08:24:35, one only system was registered with id 1000010000 (source: System History page and Apache’s SSL log)
at 08:24:54, registration ended
at 08:25:04 user hit the Systems Overview page (source: Apache’s SSL log)
at 08:25:11 user hit the System Detail Overview page (source: Apache’s SSL log)
at 08:25:29 user submitted via POST the System Channels form to alter channel subscriptions, in particular adding channels with id 117 and 118 (source: Apache’s SSL log and System History page). While this is running…
at 08:25:00 Taskomatic starts with the per-server ErrataCacheTask, which is scheduled every minute. Since there is only one server, we can assume it’s targeting 1000010000 (source: Taskomatic log)
at 08:25:43 Taskomatic’s ErrataCacheTask fails with deadlock error (source: Taskomatic log)
at 08:25:50 the same page is requested via GET, so we can assume the POST above was successful and redirected to the page via GET (source: Apache’s SSL log)

It can be safely assumed that one of the transactions was Taskomatic’s, and that was the one which got killed (no errors showed up in catalina.out by the way). We can also suspect that Tomcat did the other, since the deadlocked transaction was running rhn_channel.subscribe_server. Some greps later I found out that rhn_channel.subscribe_server and rhn_server.update_needed_cache are indeed basically only used there.

Step 2, what about internal Postgres queries?

SELECT 1 FROM ONLY "public"."rhnserver" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x

This query is disturbing. Search the codebase and you will not be able to find out if Java code, Python code or Perl code originated it.

How about Hibernate? Maybe it automatically generated that code. To find out, you can run the Taskomatic task and the page’s code while having hibernate.show_sql = true in rhn.conf. That will dump anything generated from Hibernate in logs. Guess what? It’s not Hibernate either.

Turns out that some important parts of Postgres are implemented in plain SQL, or on top of other RDBMS features. For example, foreign key constraints are implemented with system-level triggers!

In this particular case, rhnNeededCache refers to rhnServer and has RI_... triggers to make sure Referential Integrity is preserved. Turns out that one of those triggers is responsible for that SELECT above, and its purpose is to lock all rhnServer rows corresponding to rhnNeededCache rows we are INSERTing, so that their keys don’t change before we COMMITTed.

Step 3, reconstructing transactions

OK, we now know what deadlocked, but what happened before? Any statements that ran in respective transactions before those which actually deadlocked might play a part in determining the set of locks which ultimately caused this issue.

This is the hardest part, as there’s no real solution other than manually following the code and writing down what tables are involved. It’s tedious, error-prone, long and hard.

What I did was to write a small Java test to be called multiple times concurrently from the IDE. Running about 10 in parallel helped me to reproduce this problem reliably.

This test tries to replicate what the two transactions did. After that I carefully started removing code to test whether it had an influence on the deadlock or not. Removing all unrelated code brought to a quite compact reproducer:

public void testDeadlock() throws Exception {
    Random r = new Random();
    boolean coin = r.nextBoolean();
    while (true) {
        HibernateFactory.closeSession();
        long sid = 1000010000L;
        long c1id = 117L;
        long c2id = 118L;
        System.out.println(coin);
        if (coin) {
            updateNeededCache(sid);
            HibernateFactory.commitTransaction();
        }
        else {
            LoggingFactory.clearLogId();
            subscribeServerToChannel(sid, c1id);
            subscribeServerToChannel(sid, c2id);
            HibernateFactory.rollbackTransaction();
        }
    }
}

Stopping all services and running the full-query Postgres logging mode resulted in this trace (it’s still pretty hard to read but it can be followed).

Removing all the cruft leads to this nice pure SQL reproducer you can run in two pgAdmin III terminals:

-- first terminal, update the cache but don't commit yet. Locks rows in rhnServerNeededCache
begin;
select * from rhn_server.update_needed_cache(1000010000) as result; -- locks all rows

-- switch to the second terminal, subscribe server to first channel
begin;
select logging.clear_log_id();
select * from rhn_channel.subscribe_server(1000010000, 117) as result; -- blocks on INSERT INTO rhnServerNeededCache because rows are locked (see above)

-- back to the first terminal
commit; -- releases all locks


-- back to the second terminal, blocked statement was unblocked and finishes

-- some new rows in rhnServerNeededCache are locked because of the INSERT mentioned above
-- rhnServer row is now locked as well, because of an UPDATE in subscribe_server

-- now subscribe server in second channel
select * from rhn_channel.subscribe_server(1000010000, 118) as result; -- blocks on DELETE rhnServerNeededCache

-- back to the first terminal
begin;
select * from rhn_server.update_needed_cache(1000010000) as result; -- DEADLOCK!

Step 4, root causes

Multiple factors determined the deadlock, among those:

there is an implicit locking of rhnServer by update_needed_cache because of the FOREIGN KEY constraint trigger implementation. Yes, deadlocks on FOREIGN KEYs can happen;
PostgreSQL 9.1 locked the entire rhnServer row, while locking only the id column would have been sufficient. Turns out 9.3 introduces syntax for a new type of lock called FOR NO KEY UPDATE, which is also used in system triggers thus preventing this deadlock!
subscribe_server calls update_needed_cache and update_needed_cache also runs concurrently on its own. This is a typical case for deadlocks, as the probability of interering on a same set of tables is high and yes, a function can happily deadlock with itself!
update_needed_cache uses DELETE followed by INSERT – this is a pattern known as upsert and it is notoriously hard to implement and use properly. In particular the DELETE statement is immediately “visible” from any other transaction irrespective of the isolation level by design. There are proposals to fix this in Postgres but it is distant future for now;

Crucially, update_needed_cache locks rhnServerNeededCache first (via DELETE) and then locks rhnServer (via INSERT to a child table). By running subscribe_server multiple times ErrataCacheTask could lock rhnServer first and then rhnServerNeededCache. The different order results in the deadlock.

Step 5, possible solutions

ostrich algorithm. This deadlock will only happen when a user changes subscribed channels while Taskomatic is rebuilding the cache, which is quite unlucky and unlikely. Even if it happens, retrying will just work;
explicitly lock rhnServer before anything else in update_needed_cache, by adding:

select * from rhnServer where id = server_id_in for update;

upgrade Postgres to 9.3. Fortunately this is coming to SUSE Linux Enterprise 11, which runs below SUSE Manager, really soon!

Related bug

Bug 1063821 is very similar and interests almost the same code path.

Silvio Moioli

professional open source developer