Improvement: Jailing inactive pigeons

Analysis

There is something not quite right with Pigeon jailing due to missing keep alive calls.

For one thing, this is the noise caused by jailed validators during the time frame of 10 minutes:

Sep 07 05:50:12 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1jr353upa2jcg2af20lkq5ch60npk4a3w5gthnn","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:50:12Z","message":"jailing a validator"}
Sep 07 05:50:43 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1my3gpyx7sdx7wn4rd0hmng60q9jhykxhx3fwkh","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:50:43Z","message":"jailing a validator"}
Sep 07 05:51:14 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1jr353upa2jcg2af20lkq5ch60npk4a3w5gthnn","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:51:14Z","message":"jailing a validator"}
Sep 07 05:51:44 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1my3gpyx7sdx7wn4rd0hmng60q9jhykxhx3fwkh","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:51:44Z","message":"jailing a validator"}
Sep 07 05:51:59 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1wm4y8yhppxud6j5wvwr7fyynhh09tmv5fy845g","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:51:59Z","message":"jailing a validator"}
Sep 07 05:52:15 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1jr353upa2jcg2af20lkq5ch60npk4a3w5gthnn","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:52:15Z","message":"jailing a validator"}
Sep 07 05:52:45 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1my3gpyx7sdx7wn4rd0hmng60q9jhykxhx3fwkh","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:52:45Z","message":"jailing a validator"}
Sep 07 05:53:16 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1jr353upa2jcg2af20lkq5ch60npk4a3w5gthnn","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:53:16Z","message":"jailing a validator"}
Sep 07 05:53:47 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1my3gpyx7sdx7wn4rd0hmng60q9jhykxhx3fwkh","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:53:47Z","message":"jailing a validator"}
Sep 07 05:54:18 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1jr353upa2jcg2af20lkq5ch60npk4a3w5gthnn","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:54:18Z","message":"jailing a validator"}
Sep 07 05:54:33 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper16jhsvx4zrukkjqd9akfawx2tzduyx4ux8m9wre","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:54:33Z","message":"jailing a validator"}
Sep 07 05:54:48 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1my3gpyx7sdx7wn4rd0hmng60q9jhykxhx3fwkh","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:54:48Z","message":"jailing a validator"}
Sep 07 05:55:20 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1jr353upa2jcg2af20lkq5ch60npk4a3w5gthnn","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:55:20Z","message":"jailing a validator"}
Sep 07 05:55:51 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1my3gpyx7sdx7wn4rd0hmng60q9jhykxhx3fwkh","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:55:51Z","message":"jailing a validator"}
Sep 07 05:56:22 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1jr353upa2jcg2af20lkq5ch60npk4a3w5gthnn","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:56:22Z","message":"jailing a validator"}
Sep 07 05:56:52 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1my3gpyx7sdx7wn4rd0hmng60q9jhykxhx3fwkh","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:56:52Z","message":"jailing a validator"}
Sep 07 05:57:23 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1jr353upa2jcg2af20lkq5ch60npk4a3w5gthnn","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:57:23Z","message":"jailing a validator"}
Sep 07 05:57:54 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1my3gpyx7sdx7wn4rd0hmng60q9jhykxhx3fwkh","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:57:54Z","message":"jailing a validator"}
Sep 07 05:58:25 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1jr353upa2jcg2af20lkq5ch60npk4a3w5gthnn","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:58:25Z","message":"jailing a validator"}
Sep 07 05:58:25 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1kludne80z0tcq9t7j9fqa630fechsjhxc324tt","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:58:25Z","message":"jailing a validator"}
Sep 07 05:58:40 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper16jhsvx4zrukkjqd9akfawx2tzduyx4ux8m9wre","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:58:40Z","message":"jailing a validator"}
Sep 07 05:58:56 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1my3gpyx7sdx7wn4rd0hmng60q9jhykxhx3fwkh","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:58:56Z","message":"jailing a validator"}
Sep 07 05:59:42 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1jr353upa2jcg2af20lkq5ch60npk4a3w5gthnn","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:59:42Z","message":"jailing a validator"}
Sep 07 05:59:42 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper16jhsvx4zrukkjqd9akfawx2tzduyx4ux8m9wre","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:59:42Z","message":"jailing a validator"}

You can see the same validator in this multiple times, meaning they’re likely unjailing using a script. We don’t know whether their Pigeon really was offline during that time, but it seems unlikely.

With our own validator, we got jailed at 2023-09-07T05:51:59Z:

Sep 07 05:51:59 mainnet-validator palomad[61553]: {"level":"info","module":"server","module":"x/valset","val-addr":"palomavaloper1wm4y8yhppxud6j5wvwr7fyynhh09tmv5fy845g","reason":"validator's pigeon has been inactive","time":"2023-09-07T05:51:59Z","message":"jailing a validator"}

Looking at the pigeon logs during this time however, shows that Pigeon did indeed send a keep alive call at 2023-09-07T05:51:18Z, about 30 seconds prior to the TTL expiration deadline2023-09-07T05:51:49:

Sep 07 05:50:18 mainnet-validator pigeon[62135]: {"level":"debug","msg":"querying get alive time","time":"2023-09-07T05:50:18Z"}
Sep 07 05:50:18 mainnet-validator pigeon[62135]: {"alive-until":"2023-09-07T05:51:49.344891939Z","level":"debug","msg":"checking keep alive","should-send-keep-alive":false,"time":"2023-09-07T05:50:18Z","time-now":"2023-09-07T05:50:18.601121533Z","ttl":90743770406}
Sep 07 05:50:48 mainnet-validator pigeon[62135]: {"level":"debug","msg":"querying get alive time","time":"2023-09-07T05:50:48Z"}
Sep 07 05:50:48 mainnet-validator pigeon[62135]: {"alive-until":"2023-09-07T05:51:49.344891939Z","level":"debug","msg":"checking keep alive","should-send-keep-alive":false,"time":"2023-09-07T05:50:48Z","time-now":"2023-09-07T05:50:48.595878263Z","ttl":60749013676}
Sep 07 05:51:18 mainnet-validator pigeon[62135]: {"level":"debug","msg":"querying get alive time","time":"2023-09-07T05:51:18Z"}
Sep 07 05:51:18 mainnet-validator pigeon[62135]: {"alive-until":"2023-09-07T05:51:49.344891939Z","level":"debug","msg":"checking keep alive","should-send-keep-alive":true,"time":"2023-09-07T05:51:18Z","time-now":"2023-09-07T05:51:18.594214388Z","ttl":30750677551}
Sep 07 05:52:16 mainnet-validator pigeon[62135]: {"level":"debug","msg":"querying get alive time","time":"2023-09-07T05:52:16Z"}
Sep 07 05:52:16 mainnet-validator pigeon[62135]: {"alive-until":"2023-09-07T05:57:14.812926533Z","level":"debug","msg":"checking keep alive","should-send-keep-alive":false,"time":"2023-09-07T05:52:16Z","time-now":"2023-09-07T05:52:16.520998415Z","ttl":298291928118}
Sep 07 05:52:18 mainnet-validator pigeon[62135]: {"level":"debug","msg":"querying get alive time","time":"2023-09-07T05:52:18Z"}
Sep 07 05:52:18 mainnet-validator pigeon[62135]: {"alive-until":"2023-09-07T05:57:14.812926533Z","level":"debug","msg":"checking keep alive","should-send-keep-alive":false,"time":"2023-09-07T05:52:18Z","time-now":"2023-09-07T05:52:18.594424036Z","ttl":296218502497}
Sep 07 05:52:48 mainnet-validator pigeon[62135]: {"level":"debug","msg":"querying get alive time","time":"2023-09-07T05:52:48Z"}
Sep 07 05:52:48 mainnet-validator pigeon[62135]: {"alive-until":"2023-09-07T05:57:14.812926533Z","level":"debug","msg":"checking keep alive","should-send-keep-alive":false,"time":"2023-09-07T05:52:48Z","time-now":"2023-09-07T05:52:48.594414929Z","ttl":266218511604}

You can also see that during the next calls, the alive-until field did change to 2023-09-07T05:57:14, meaning the call did in fact go through and Paloma did update the Pigeon TTL record accordingly.

Yet somehow, during the next round of jailing at 2023-09-07T05:51:59Z, the validator got jailed. I can think of two explanations for this:

  • The update of the pigeon TTL took too long on Paloma side and did not finish in time before the jailing round. Given that the call went out over 30 seconds before, it sounds unlikely
  • Paloma does not compare against the correct data when choosing whether to jail a validator or not.

TL;DR

I need to do some further digging to find the true culprit, but here’s what I suggest to improve the overly aggressive jailing in the future:

  • further investigation to identify the root cause
  • add regression testing to paloma jailing
  • update Pigeon keep-alive call logic to happen more frequently, i.e. every 5 seconds instead of once per minute. Keep Alive data can be buffered on Pigeon side, it shouldn’t change on Paloma, we can still refresh once per minute, but we need to send the heart beat more aggressively in case there’s an error or timings are too tight
  • update Pigeon keep-alive logic to trigger a heart beat call slightly earlier than currently, although this might not be needed once the switch to block based TTL is done