RyanDoyle.net

home posts projects contact

Writing a PMDA for Performance Co-Pilot

09 Mar 2019

In this post, we will go over writing a Performance Metrics Collection Agent (PMDA) for Performance Co-Pilot (PCP). PCP has a pluggable architecture and all metrics that exist within a PCP namespace are implemented by PMDAs.

We will implement a PMDA that reads statistics from a etcd proxy server.

Lets get started!

Prerequisites


PCP development tools

You’ll need PCP installed and depending on your operating system, you may need to install additional packages for PMDA development support. I’m using Debian so these are the packages I need.

sudo apt install pcp libpcp-pmda3-dev python3-pcp

We will be using the PMDA debugging program, dbpmda to test our PMDA as we develop it. Make sure that is available too. It should be install with PCP.

which dbpmda

Hello PMDA


Let’s start with a hello world example. Create a directory called etcd and inside that open a file called pmdaetcd.python. Python PMDAs need the .python extension, not .py. Paste the following in

#!/usr/bin/env pmpython
from cpmapi import PM_TYPE_STRING, PM_INDOM_NULL, PM_SEM_INSTANT

from pcp.pmapi import pmUnits, pmContext
from pcp.pmda import PMDA, pmdaMetric


class EtcdPMDA(PMDA):

    def __init__(self, name, domain):
        super().__init__(name, domain)

        self.add_metric(name + '.demo', pmdaMetric(
            PMDA.pmid(0, 0),
            PM_TYPE_STRING,
            PM_INDOM_NULL,
            PM_SEM_INSTANT,
            pmUnits()
        ))

        self.set_fetch_callback(self.fetch_callback)
        self.set_user(pmContext.pmGetConfig('PCP_USER'))

    def fetch_callback(self, cluster, item, inst):
        return ['hello PMDA', 1]


if __name__ == '__main__':
    EtcdPMDA('etcd', 400).run()

Breaking it down

Let’s start with the shebang.

#!/usr/bin/env pmpython

PCP still supports Python 2 and 3 as it is still distributed on Linux distributions that use Python 2 as a default. pmpython is a mechanism for finding the right version of Python that is in use. If you plan to ship your PMDA into the mainline PCP codebase, make sure your code will run on both versions.

Next, lets look at the constructor.

class EtcdPMDA(PMDA):
    def __init__(self, name, domain):
...
if __name__ == '__main__':
    EtcdPMDA('etcd', 400).run()

A PMDA is responsible for a namespace EG: etcd.*. We need to tell the PMDA what our namespace is. The other argument is the domain. This is an internal identifier used in PCP. Check stdpmid.pcp to find the next free ID. If we are shipping this into the PCP codebase, you’ll need to add your ID here too.

We add metrics with the self.add_metric() method.

        self.add_metric(name + '.demo', pmdaMetric(
            PMDA.pmid(0, 0),
            PM_TYPE_STRING,
            PM_INDOM_NULL,
            PM_SEM_INSTANT,
            pmUnits()
        ))

We register a name, etcd.demo, and then add metadata about the metric.

  • pmid: This is part of the internal metric identifier and is unique for all metrics in the PMDA. The first number is the cluster, basically a grouping of related metrics. The second is the item which is the most specific.
  • type: It can be a PM_TYPE_STRING, PM_TYPE_U32, PM_TYPE_64 etc…
  • indom: The instance domain that this metric belongs to. We aren’t using any instance domains at the moment but we will later on.
  • sem (semantics): This describes how the metric represents the data. Although not useful with string metrics, we can use this to mark the metric as a counter or gauge. The PCP client tools then know how to interpret the data and can then do automatic rate conversion.
  • pmunits: We can add unit information like time, bytes or bytes per second. We use the default value here which means “no units”.

As for fetching metrics, we define a callback. This is called when the PMDA is asked to provide metric values.

    def fetch_callback(self, cluster, item, inst):
        return ['hello PMDA', 1]

Later on we will use the cluster, item and inst to return the correct metric value. The 1 in the return value denotes success. If the fetching cannot be done, we return 0 with an error type.

Sanity test

As a quick sanity test to make sure we don’t have any syntax errors, we can run the PMDA like any normal Python program. We have to run it via sudo because of the self.set_user() call.

chmod +x pmdaetcd.python
sudo ./pmdaetcd.python
# Ctrl+C

It should print some random garbage on the screen. PMDAs communicate via STDIN and STDOUT via a binary protocol. Although we can’t do anything useful, this is still a good test to make sure the PMDA can at least start. To test actually fetching the metric, we will use the tooldbpmda.

Testing with dbpmda

dbpmda is a command line test harness for PMDAs. It allows us to quickly iterate when developing the PMDA.

Create a file called pmns-for-testing with the following content:

root {
    etcd    400:*:*
}

This will allow us to test the PMDA locally without having to install and register it with the global PCP namespace. The number needs to match the domain we chose in the PMDA, 400 in our case.

Now run the following.

cat<<EOF | sudo dbpmda -n pmns-for-testing
open pipe ./pmdaetcd.python
fetch etcd.demo
EOF

We have to run it as root because of the self.set_user() call. If you didn’t want to run this command as root, you could omit this line, just make sure to add it back before you ship the PMDA.

If all went well, you should see the following output:

Start pmdaetcd.python PMDA: ./pmdaetcd.python
PMID(s): 400.0.0
pmResult dump from 0x5597b44f6910 timestamp: 0.000000 10:00:00.000 numpmid: 1
  400.0.0 (<noname>): numval: 1 valfmt: 1 vlist[]:
   value "hello PMDA"

Testing by loading into PMCD

The PMDA can be loaded into PMCD to test via tools like pminfo and pmchart. The PMCD is the server that orchestrates routing requests to the correct PMDA and handles communication with the clients.

First, we need to create an Install and Remove script. These scripts are shipped individually with each PMDA and is the standard way that PCP installs and removes them.

Install

#!/bin/sh
. $PCP_DIR/etc/pcp.env
. $PCP_SHARE_DIR/lib/pmdaproc.sh

iam=etcd
domain=400
python_opt=true
daemon_opt=false

pmdaSetup
pmdaInstall
exit

Remove

#! /bin/sh
. $PCP_DIR/etc/pcp.env
. $PCP_SHARE_DIR/lib/pmdaproc.sh

iam=etcd

pmdaSetup
pmdaRemove
exit

Make the scripts executable

chmod +x Install Remove

PCP expects PMDAs to be located in a certain path (typically /var/lib/pcp/pmdas/$pmda_name/) so we will have to either move the etcd/ directory into there and continue working or copy over the contents each time we want to update the PMDA. I’m going to copy the contents over.

sudo mkdir /var/lib/pcp/pmdas/etcd
sudo chown $USER /var/lib/pcp/pmdas/etcd
cp  pmdaetcd.python Install Remove /var/lib/pcp/pmdas/etcd/
cd /var/lib/pcp/pmdas/etcd/
sudo ./Install

You should get notified that metrics have appeared. Check it out:

pminfo -f etcd

Next

Now that we have the basics of a PMDA working and an ability to test, lets implement more metrics exposed by Etcd.

Extending the PMDA


Etcd

Etcd has several API endpoints to get interesting statistics from. Our PMDA will expose these metrics through to PCP.

An example payload when calling the stats API looks like the following:

curl http://127.0.0.1:2379/v2/stats/self

{
  "name": "etcd0",
  "id": "7931e79c0d8b47c5",
  "state": "StateFollower",
  "startTime": "2019-03-10T05:28:37.430724906Z",
  "leaderInfo": {
    "leader": "",
    "uptime": "8m33.5174986s",
    "startTime": "2019-03-10T05:28:37.430724906Z"
  },
  "recvAppendRequestCnt": 0,
  "sendAppendRequestCnt": 0
}

We will implement name, id, state, recvAppendRequestCnt and sendAppendRequestCnt

Defining the metrics

We will add 5 self.add_metric() calls as well as implement fetching via the Python requests library. You may need this library installed. On my platform, I installed it with apt install python3-requests.

#!/usr/bin/env pmpython
from cpmapi import PM_TYPE_STRING, PM_INDOM_NULL, PM_SEM_INSTANT, PM_TYPE_U64, PM_SEM_COUNTER, PM_ERR_PMID, PM_ERR_AGAIN

import requests
from pcp.pmapi import pmUnits, pmContext
from pcp.pmda import PMDA, pmdaMetric


class EtcdPMDA(PMDA):

    def __init__(self, name, domain):
        super().__init__(name, domain)

        self.add_metric(name + '.name', pmdaMetric(
            PMDA.pmid(0, 0),
            PM_TYPE_STRING,
            PM_INDOM_NULL,
            PM_SEM_INSTANT,
            pmUnits()
        ))
        self.add_metric(name + '.id', pmdaMetric(
            PMDA.pmid(0, 1),
            PM_TYPE_STRING,
            PM_INDOM_NULL,
            PM_SEM_INSTANT,
            pmUnits()
        ))
        self.add_metric(name + '.state', pmdaMetric(
            PMDA.pmid(0, 2),
            PM_TYPE_STRING,
            PM_INDOM_NULL,
            PM_SEM_INSTANT,
            pmUnits()
        ))
        self.add_metric(name + '.recv_append_request', pmdaMetric(
            PMDA.pmid(0, 3),
            PM_TYPE_U64,
            PM_INDOM_NULL,
            PM_SEM_COUNTER,
            pmUnits()
        ))
        self.add_metric(name + '.send_append_request', pmdaMetric(
            PMDA.pmid(0, 4),
            PM_TYPE_U64,
            PM_INDOM_NULL,
            PM_SEM_COUNTER,
            pmUnits()
        ))

        self.set_fetch_callback(self.fetch_callback)
        self.set_user(pmContext.pmGetConfig('PCP_USER'))

    def fetch_callback(self, cluster, item, inst):
        if not cluster == 0:
            return [PM_ERR_PMID, 0]
        try:
            stats = requests.get('http://127.0.0.1:2379/v2/stats/self').json()
        except Exception:
            return [PM_ERR_AGAIN, 0]

        if item == 0:
            return [stats['name'], 1]
        if item == 1:
            return [stats['id'], 1]
        if item == 2:
            return [stats['state'], 1]
        if item == 3:
            return [stats['recvAppendRequestCnt'], 1]
        if item == 4:
            return [stats['sendAppendRequestCnt'], 1]
        return [PM_ERR_PMID, 0]


if __name__ == '__main__':
    EtcdPMDA('etcd', 400).run()

Lets test it with dbpmda

cat<<EOF | sudo dbpmda -n pmns-for-testing
open pipe ./pmdaetcd.python
fetch etcd.name
fetch etcd.id
fetch etcd.state
fetch etcd.recv_append_request
fetch etcd.send_append_request
EOF

Typed metrics

We now use PM_TYPE_U64 to define the data type of the counters. How did I know it is a U64? I had to check out the source code of Etcd! It is important to get the types right, either through the documentation or even better to look at the source code of where the metric is defined.

We also define this as a PM_SEM_COUNTER. PCP knows about the representation of numeric types so it is important to get this information correct too. I’ve even omitted ...Cnt from the metric names as we don’t need need to encode this information in the metric name. PCP knows if this is a counter or a gauge.

Handling errors

In fetch_callback(), we handle two types of errors.

If we get a request for a cluster or item we don’t know about, we return a PM_ERR_PMID. If there are any runtime errors fetching the statistics, we return a PM_ERR_AGAIN. This indicates that the PMDA is unavailable and the client should try again.

It is important to handle errors correctly in the PMDA. If an exception if thrown and not handled, the PMDA will exit.

Instance domains


Instance domains add another dimension to the metric namespace. A metric can have zero or more instances associated with it. If I look up the disk.dev.read metric, each disk is listed at as instance.

$ pminfo -f disk.dev.read

disk.dev.read
    inst [0 or "sdb"] value 289
    inst [1 or "sda"] value 85578
    inst [2 or "sdc"] value 226
    inst [3 or "sdd"] value 89513
    inst [4 or "sde"] value 151

Instance domains in etcd

If we look at the /v2/stats/store endpoint, there are metrics associated with successful and failed operations. curl http://127.0.0.1:2379/v2/stats/store

{
  "getsSuccess": 2,
  "getsFail": 57,
  "setsSuccess": 0,
  "setsFail": 0,
  "deleteSuccess": 0,
  "deleteFail": 0,
  "updateSuccess": 0,
  "updateFail": 0,
  "createSuccess": 3,
  "createFail": 0,
  "compareAndSwapSuccess": 0,
  "compareAndSwapFail": 0,
  "compareAndDeleteSuccess": 0,
  "compareAndDeleteFail": 0,
  "expireCount": 0,
  "watchers": 0
}

We will define an instance domain for operations. That is, gets, sets, deletes etc… and then have a metric etcd.store.success and etcd.store.fail. expireCount and watchers will be normal metrics without an instance domain.

Adding instances to the PMDA

We will define the instance via self.add_indom(). Now, in the self.fetch_callback(), we will use the instance as part of the metric lookup.

#!/usr/bin/env pmpython
from cpmapi import PM_TYPE_STRING, PM_INDOM_NULL, PM_SEM_INSTANT, PM_TYPE_U64, PM_SEM_COUNTER, PM_ERR_PMID, PM_ERR_AGAIN

import requests
from pcp.pmapi import pmUnits, pmContext
from pcp.pmda import PMDA, pmdaMetric, pmdaInstid, pmdaIndom


class EtcdPMDA(PMDA):

    def __init__(self, name, domain):
        super().__init__(name, domain)

        self.add_metric(name + '.name', pmdaMetric(
            PMDA.pmid(0, 0),
            PM_TYPE_STRING,
            PM_INDOM_NULL,
            PM_SEM_INSTANT,
            pmUnits()
        ))
        self.add_metric(name + '.id', pmdaMetric(
            PMDA.pmid(0, 1),
            PM_TYPE_STRING,
            PM_INDOM_NULL,
            PM_SEM_INSTANT,
            pmUnits()
        ))
        self.add_metric(name + '.state', pmdaMetric(
            PMDA.pmid(0, 2),
            PM_TYPE_STRING,
            PM_INDOM_NULL,
            PM_SEM_INSTANT,
            pmUnits()
        ))
        self.add_metric(name + '.recv_append_request', pmdaMetric(
            PMDA.pmid(0, 3),
            PM_TYPE_U64,
            PM_INDOM_NULL,
            PM_SEM_COUNTER,
            pmUnits()
        ))
        self.add_metric(name + '.send_append_request', pmdaMetric(
            PMDA.pmid(0, 4),
            PM_TYPE_U64,
            PM_INDOM_NULL,
            PM_SEM_COUNTER,
            pmUnits()
        ))

        self.stats_operations_instances = [
            pmdaInstid(0, 'gets'),
            pmdaInstid(1, 'sets'),
            pmdaInstid(2, 'delete'),
            pmdaInstid(3, 'update'),
            pmdaInstid(4, 'create'),
            pmdaInstid(5, 'compareAndSwap'),
            pmdaInstid(6, 'compareAndDelete'),
        ]
        self.stats_operations_indom = self.indom(0)
        self.add_indom(pmdaIndom(self.stats_operations_indom, self.stats_operations_instances))

        self.add_metric(name + '.store.success', pmdaMetric(
            self.pmid(1, 0),
            PM_TYPE_U64,
            self.stats_operations_indom,
            PM_SEM_COUNTER,
            pmUnits()
        ))
        self.add_metric(name + '.store.fail', pmdaMetric(
            self.pmid(1, 1),
            PM_TYPE_U64,
            self.stats_operations_indom,
            PM_SEM_COUNTER,
            pmUnits()
        ))
        self.add_metric(name + '.store.expire', pmdaMetric(
            self.pmid(1, 2),
            PM_TYPE_U64,
            PM_INDOM_NULL,
            PM_SEM_COUNTER,
            pmUnits()
        ))
        self.add_metric(name + '.store.watchers', pmdaMetric(
            self.pmid(1, 3),
            PM_TYPE_U64,
            PM_INDOM_NULL,
            PM_SEM_INSTANT,
            pmUnits()
        ))

        self.set_fetch_callback(self.fetch_callback)
        self.set_user(pmContext.pmGetConfig('PCP_USER'))

    def fetch_callback(self, cluster, item, inst):
        if cluster == 0:
            try:
                stats = requests.get('http://127.0.0.1:2379/v2/stats/self').json()
            except Exception:
                return [PM_ERR_AGAIN, 0]

            if item == 0:
                return [stats['name'], 1]
            if item == 1:
                return [stats['id'], 1]
            if item == 2:
                return [stats['state'], 1]
            if item == 3:
                return [stats['recvAppendRequestCnt'], 1]
            if item == 4:
                return [stats['sendAppendRequestCnt'], 1]
        if cluster == 1:
            try:
                stats = requests.get('http://127.0.0.1:2379/v2/stats/store').json()
            except Exception:
                return [PM_ERR_AGAIN, 0]

            if item == 0:
                metric_name_in_json = self.inst_name_lookup(self.stats_operations_indom, inst) + 'Success'
                return [stats[metric_name_in_json], 1]
            if item == 1:
                metric_name_in_json = self.inst_name_lookup(self.stats_operations_indom, inst) + 'Fail'
                return [stats[metric_name_in_json], 1]
            if item == 2:
                return [stats['expireCount'], 1]
            if item == 3:
                return [stats['watchers'], 1]

        return [PM_ERR_PMID, 0]


if __name__ == '__main__':
    EtcdPMDA('etcd', 400).run()

Now use dbpmda to show the instances and new metrics we have defined.

cat<<EOF | sudo dbpmda -n pmns-for-testing
open pipe ./pmdaetcd.python
instance 400.0
fetch etcd.store.success
fetch etcd.store.fail
fetch etcd.store.watchers
fetch etcd.store.expire
EOF

We create an instance via pmdaInstid(). This takes the internal instance identifier (a number) and the external human-readable name. We then register these instances to be part of an instance domain via self.add_indom()

You can see in fetch_callback(), we now use the inst argument. There are a number of ways you could fetch metrics using the instance identifier. For simplicity sake, we lookup the external name via self.inst_name_lookup() and build the key name that is in the JSON structure.

Further reading


There’s some great documentation available in the PCP Programmers Guide on writing PMDAs as well as the simple PMDA showing off additional PMDA features.

The sources for all packaged PMDAs can be seen here and show off other features such as dynamic instance registration, caching, external configuration files, logging and label support.



comments powered by Disqus

2023 - Ryan Doyle - Powered by Jekyll