Relayer Performance Metrics Collection

Business Case

Overview

In an effort to enhance Paloma’s Quality of Service (QoS) and dynamic routing feature, “Pigeon Feed”, there is a need to collect performance metrics of the network of relayer pigeons.

Rationale

The performance metrics of network relayers will provide valuable information that can be used to optimize the efficiency of the Paloma messaging service. These metrics will help us understand the performance and reliability of each relayer in the network, which in turn will enable us to make informed decisions about cross-chain message routing and cross-chain gas demand.

Impact

By integrating these metrics into the Paloma protocol’s “Pigeon Feed” feature, we can dynamically adjust Paloma’s message routing strategies based on real-time performance data of the Paloma validator flock. This will ensure that Paloma’s message relay service always offers the best possible performance and reliability to the users of the protocol. Increased reliability of Paloma’s message routing and reward system will encourage more quality validators to enter the flock and lower the cost of message relaying for users in a dynamic manner, without the need for ongoing governance changes for per chain fees and per chain gas minimums.

Specification

Requirements as per specification

There are currently 5 different metrics defined on [PF 01] Track metrics · Issue #421 · VolumeFi/paloma · GitHub

  • The pigeon’s message uptake rate
  • The pigeon’s message delivery success rate
  • The pigeon’s message delivery speed
  • Active MEV Endpoint * (Mev endpoint setting)

These metrics are chain specific, meaning Paloma will need to collect each of those metrics on a per target chain basis.

Current state

Additionally, there are currently 5 different relay weights defined that governance can assign:

  • Fee: This weight represents how important pigeon fee price should be during validator selection
  • Uptime: This weight represents how important pigeon uptime (time out of jail) should be during validator selection
  • SuccessRate: This weight represents how important pigeon relaying success rate should be during validator selection
  • ExecutionTime: I have no idea what this one was intended to represent. My guess is pigeon relaying time (i.e. faster → better)
  • FeatureSet: This weight represents how important pigeon feature sets (MEV, TXtype,etc…) should be during validator selection

This weights are currently not in use during relayer selection, but they are an active part of Paloma’s mainnet governance.

Message uptake rate

:question:This one is hard to define. At the moment, there is no easy way of saying “Pigeon A asked for messages”. Pigeons themselves will call in at different intervals and inform Paloma about which messages they queued for, but this is only useful in case Paloma actually has messages assigned for this pigeon.

Suggested implementation

Preconditions & assumptions

The current module structure in Paloma is a bit all over the place:

  • The EVM module is responsible for picking a relayer during message sending
  • The paloma module is only responsible for validator status (jailing validators with missing chain infos.
  • The treasury module is responsible for governing community and relayer fees.
  • The consensus module is responsible for the internal messaging queue, in which all EVM module messages are present as well.
  • The scheduler module is responsible for storing jobs (in a KV store) and executing those jobs (putting a new message on the consensus queue).

Some of this could probably be either cleaned up and consolidated, or restructured into more meaningful but smaller units.

Suggested approach

The Paloma protocol team believes we want to create a new module, metrics, which does nothing but collect performance metrics.

Each module may emit (fire & forget) and event when something happened (message delivered successfully, validator changed feature set, etc…).

It’s the job of the metrics module to consume said messages and arrange them in a meaningful way on a per-target-chain basis. Using a cachekv.Store , we can ensure that query times remain viable.

Shifting data window

The Paloma network will need to flush out older performance metrics ever so often in order to not clog the system and allow the system to relax the performers rated at both ends of the spectrum back towards the center, in order to:

  • Incentivise top performers to keep up the good work
  • Give low rated performers a chance to rise back to the top more quickly if they fulfil the performance requirements that will reward them GRAINS for successful relay activity

We can keep the metrics of the last 24 hours, or the last n blocks or so. But both approaches have a problem: Sometimes the relayer network is very active, and some times it’s very dormant.

So, for a more evenly distributed window, I suggest we correlate each metric with a message ID. That way, we can keep the metrics for the last n messages, no matter when they were sent and at what speed.

Applying the weights

The EVM module simply gets the metrics keeper injected and should be able to query performance metrics per validator during relayer selection. Since the data is simply in memory, queries should be fast enough. We can also batch calls.

System design


( source )

Questions / Open items

  • The only requirement that won’t really fit the current weight setup is message uptake rate. Should we just scrap this metric for now, given that it’s not that straight forward to collect and not that meaningful when counter balanced with delivery speed and success rate?
  • We might not need to capture relayer fees, as they’re fairly static and SHOULD be okay to be queried at relayer selection time as needed.
  • TODO: Enforce correct TX type per chain ID, as we’re looking to make this a fixed setting as opposed to configurable.