This is the third post in a small series dedicated to analysing and dealing with deadlocks in PostgreSQL, and provides a real bugfixing case study. Check the first article for an introduction and the second one for guidelines and techniques.
This post describes how I attempted fixing a real deadlock bug in SUSE Manager, which is also affecting the Spacewalk project it’s based upon. Hopefully it will shed light to someone fixing a similar issue!
In doing this, I am roughly following the workflow described in the previous article.
Step 1, get a rough idea of what happened
First of all let’s check Postgres logs:
2015-02-18 08:25:43 CET susemanager susemanager ERROR: deadlock detected
2015-02-18 08:25:43 CET susemanager susemanager DETAIL: Process 13516 waits for ShareLock on transaction 2749181; blocked by process 13916.
Process 13916 waits for ShareLock on transaction 2749185; blocked by process 13516.
Process 13516: select * from rhn_server.update_needed_cache($1) as result
Process 13916: select * from rhn_channel.subscribe_server($1, $2, 1, $3) as result
2015-02-18 08:25:43 CET susemanager susemanager HINT: See server log for query details.
2015-02-18 08:25:43 CET susemanager susemanager CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."rhnserver" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x"
SQL statement "insert into rhnServerNeededCache
(server_id, errata_id, package_id, channel_id)
(select distinct sp.server_id, x.errata_id, p.id, x.channel_id
FROM (SELECT sp_sp.server_id, sp_sp.name_id,
sp_sp.package_arch_id, max(sp_pe.evr) AS max_evr
FROM rhnServerPackage sp_sp
join rhnPackageEvr sp_pe ON sp_pe.id = sp_sp.evr_id
GROUP BY sp_sp.server_id, sp_sp.name_id, sp_sp.package_arch_id) sp
join rhnPackage p ON p.name_id = sp.name_id
join rhnPackageEvr pe ON pe.id = p.evr_id AND sp.max_evr < pe.evr
join rhnPackageUpgradeArchCompat puac
ON puac.package_arch_id = sp.package_arch_id
AND puac.package_upgrade_arch_id = p.package_arch_id
join rhnServerChannel sc ON sc.server_id = sp.server_id
join rhnChannelPackage cp ON cp.package_id = p.id
AND cp.channel_id = sc.channel_id
left join (SELECT ep.errata_id, ce.channel_id, ep.package_id
FROM rhnChannelErrata ce
join rhnErrataPackage ep
ON ep.errata_id = ce.errata_id
join rhnServerChannel sc_sc
ON sc_sc.channel_id = ce.channel_id
WHERE sc_sc.server_id = server_id_in) x
ON x.channel_id = sc.channel_id AND x.package_id = cp.package_id
where sp.server_id = server_id_in)"
PL/pgSQL function "update_needed_cache" line 5 at SQL statement
Notes:
- we have a dealock between two transactions;
- one transaction was running
rhn_server.update_needed_cache
- specifically, the
insert into rhnServerNeededCache
statement at line 5 (counting from$BODY$
included); - specifically, the Postgres-internal query
SELECT 1 FROM ONLY "public"."rhnserver" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x
, more on that later; - and it was killed
- specifically, the
- the other was running
rhn_channel.subscribe_server
and it completed; - later log lines like the following only tell that subsequent commands failed because the transaction was killed (and rolled back), thus can be ignored:
2015-02-18 08:25:43 CET susemanager susemanager STATEMENT: select * from rhn_server.update_needed_cache($1) as result 2015-02-18 08:25:43 CET susemanager susemanager ERROR: current transaction is aborted, commands ignored until end of transaction block
Sources can be read here from the Spacewalk project:
What was the user trying to do? Reporter says:
<moio> <detective voice="on"> so what were you doing at 2015-02-18 08:25:43, when the dead lock happened?</detective>
<Sascha> I was playing with a new client I set up to connect it to the SUSE channels and to check if I can alter the channels via the web ui
<moio> you seem not to be lying, good :-)
Indeed with a little bit of research in spacewalk-debug and the event history page the following can be concluded:
- at 08:24:35, one only system was registered with id 1000010000 (source: System History page and Apache’s SSL log)
- at 08:24:54, registration ended
- at 08:25:04 user hit the Systems Overview page (source: Apache’s SSL log)
- at 08:25:11 user hit the System Detail Overview page (source: Apache’s SSL log)
- at 08:25:29 user submitted via POST the System Channels form to alter channel subscriptions, in particular adding channels with id 117 and 118 (source: Apache’s SSL log and System History page). While this is running…
- at 08:25:00 Taskomatic starts with the per-server
ErrataCacheTask
, which is scheduled every minute. Since there is only one server, we can assume it’s targeting 1000010000 (source: Taskomatic log) - at 08:25:43 Taskomatic’s
ErrataCacheTask
fails with deadlock error (source: Taskomatic log) - at 08:25:50 the same page is requested via GET, so we can assume the POST above was successful and redirected to the page via GET (source: Apache’s SSL log)
It can be safely assumed that one of the transactions was Taskomatic’s, and that was the one which got killed (no errors showed up in catalina.out
by the way). We can also suspect that Tomcat did the other, since the deadlocked transaction was running rhn_channel.subscribe_server
. Some greps later I found out that rhn_channel.subscribe_server
and rhn_server.update_needed_cache
are indeed basically only used there.
Step 2, what about internal Postgres queries?
SELECT 1 FROM ONLY "public"."rhnserver" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR SHARE OF x
This query is disturbing. Search the codebase and you will not be able to find out if Java code, Python code or Perl code originated it.
How about Hibernate? Maybe it automatically generated that code. To find out, you can run the Taskomatic task and the page’s code while having hibernate.show_sql = true
in rhn.conf. That will dump anything generated from Hibernate in logs. Guess what? It’s not Hibernate either.
Turns out that some important parts of Postgres are implemented in plain SQL, or on top of other RDBMS features. For example, foreign key constraints are implemented with system-level triggers!
In this particular case, rhnNeededCache
refers to rhnServer
and has RI_...
triggers to make sure Referential Integrity is preserved. Turns out that one of those triggers is responsible for that SELECT
above, and its purpose is to lock all rhnServer
rows corresponding to rhnNeededCache
rows we are INSERT
ing, so that their keys don’t change before we COMMITT
ed.
Step 3, reconstructing transactions
OK, we now know what deadlocked, but what happened before? Any statements that ran in respective transactions before those which actually deadlocked might play a part in determining the set of locks which ultimately caused this issue.
This is the hardest part, as there’s no real solution other than manually following the code and writing down what tables are involved. It’s tedious, error-prone, long and hard.
What I did was to write a small Java test to be called multiple times concurrently from the IDE. Running about 10 in parallel helped me to reproduce this problem reliably.
This test tries to replicate what the two transactions did. After that I carefully started removing code to test whether it had an influence on the deadlock or not. Removing all unrelated code brought to a quite compact reproducer:
public void testDeadlock() throws Exception {
Random r = new Random();
boolean coin = r.nextBoolean();
while (true) {
HibernateFactory.closeSession();
long sid = 1000010000L;
long c1id = 117L;
long c2id = 118L;
System.out.println(coin);
if (coin) {
updateNeededCache(sid);
HibernateFactory.commitTransaction();
}
else {
LoggingFactory.clearLogId();
subscribeServerToChannel(sid, c1id);
subscribeServerToChannel(sid, c2id);
HibernateFactory.rollbackTransaction();
}
}
}
Stopping all services and running the full-query Postgres logging mode resulted in this trace (it’s still pretty hard to read but it can be followed).
Removing all the cruft leads to this nice pure SQL reproducer you can run in two pgAdmin III terminals:
-- first terminal, update the cache but don't commit yet. Locks rows in rhnServerNeededCache
begin;
select * from rhn_server.update_needed_cache(1000010000) as result; -- locks all rows
-- switch to the second terminal, subscribe server to first channel
begin;
select logging.clear_log_id();
select * from rhn_channel.subscribe_server(1000010000, 117) as result; -- blocks on INSERT INTO rhnServerNeededCache because rows are locked (see above)
-- back to the first terminal
commit; -- releases all locks
-- back to the second terminal, blocked statement was unblocked and finishes
-- some new rows in rhnServerNeededCache are locked because of the INSERT mentioned above
-- rhnServer row is now locked as well, because of an UPDATE in subscribe_server
-- now subscribe server in second channel
select * from rhn_channel.subscribe_server(1000010000, 118) as result; -- blocks on DELETE rhnServerNeededCache
-- back to the first terminal
begin;
select * from rhn_server.update_needed_cache(1000010000) as result; -- DEADLOCK!
Step 4, root causes
Multiple factors determined the deadlock, among those:
- there is an implicit locking of
rhnServer
byupdate_needed_cache
because of theFOREIGN KEY
constraint trigger implementation. Yes, deadlocks onFOREIGN KEY
s can happen; - PostgreSQL 9.1 locked the entire
rhnServer
row, while locking only theid
column would have been sufficient. Turns out 9.3 introduces syntax for a new type of lock calledFOR NO KEY UPDATE
, which is also used in system triggers thus preventing this deadlock! subscribe_server
callsupdate_needed_cache
andupdate_needed_cache
also runs concurrently on its own. This is a typical case for deadlocks, as the probability of interering on a same set of tables is high and yes, a function can happily deadlock with itself!update_needed_cache
usesDELETE
followed byINSERT
– this is a pattern known as upsert and it is notoriously hard to implement and use properly. In particular theDELETE
statement is immediately “visible” from any other transaction irrespective of the isolation level by design. There are proposals to fix this in Postgres but it is distant future for now;
Crucially, update_needed_cache
locks rhnServerNeededCache
first (via DELETE
) and then locks rhnServer
(via INSERT
to a child table). By running subscribe_server
multiple times ErrataCacheTask
could lock rhnServer
first and then rhnServerNeededCache
. The different order results in the deadlock.
Step 5, possible solutions
- ostrich algorithm. This deadlock will only happen when a user changes subscribed channels while Taskomatic is rebuilding the cache, which is quite unlucky and unlikely. Even if it happens, retrying will just work;
- explicitly lock
rhnServer
before anything else inupdate_needed_cache
, by adding:
select * from rhnServer where id = server_id_in for update;
- upgrade Postgres to 9.3. Fortunately this is coming to SUSE Linux Enterprise 11, which runs below SUSE Manager, really soon!
Related bug
Bug 1063821 is very similar and interests almost the same code path.
Sii il primo a lasciare un commento. Non essere timido!