Health checks

Separate from the service manager, Pebble implements custom “health checks” that can be configured to restart services when they fail.

Usage

Checks are configured in the layer configuration using the top-level field checks:

# Optional: A list of health checks managed by this configuration layer.
checks:
    <check name>:
        # Required
        override: merge | replace
        # Optional
        level: alive | ready
        # Optional
        period: <duration>
        # Optional
        timeout: <duration>
        # Optional
        threshold: <failure threshold>

        # HTTP check
        # Only one of "http", "tcp", or "exec" may be specified.
        http:
            # Required
            url: <full URL>
            # Optional
            headers:
                <name>: <value>

        # TCP port
        # Only one of "http", "tcp", or "exec" may be specified.
        tcp:
            # Required
            port: <port number>
            # Optional
            host: <host name>

        # Command execution check
        # Only one of "http", "tcp", or "exec" may be specified.
        exec:
            # Required
            command: <commmand>
            # Optional
            service-context: <service-name>
            # Optional
            environment:
                <name>: <value>
            # Optional
            user: <username>
            # Optional
            user-id: <uid>
            # Optional
            group: <group name>
            # Optional
            group-id: <gid>
            # Optional
            working-dir: <directory>

Full details are given in the layer specification.

Options

Each check can be one of three types. The types and their success criteria are:

  • http: an HTTP GET request to the URL specified must return an HTTP 2xx status code

  • tcp: opening the given TCP port must be successful

  • exec: executing the specified command must yield a zero exit code

Each check is performed with the specified period (the default is 10 seconds apart), and is considered an error if a timeout happens before the check responds – for example, before the HTTP request is complete or before the command finishes executing.

A check is considered healthy until it’s had threshold errors in a row (the default is 3). At that point, the check is considered “down”, and any associated on-check-failure actions will be triggered. When the check succeeds again, the failure count is reset to 0.

To enable Pebble auto-restart behavior based on a check, use the on-check-failure map in the service configuration (this is what ties together services and checks). For example, to restart the “server” service when the “test” check fails, use the following:

services:
    server:
        override: merge
        on-check-failure:
            # can also be "shutdown", "success-shutdown", or "ignore" (the default)
            test: restart

Examples

Below is an example layer showing the three different types of checks:

checks:
    up:
        override: replace
        level: alive
        period: 30s
        threshold: 1  # an aggressive threshold
        exec:
            command: service nginx status

    online:
        override: replace
        level: ready
        tcp:
            port: 8080

    test:
        override: replace
        http:
            url: http://localhost:8080/test

Checks command

You can view check status using the pebble checks command. This reports the checks along with their status (up or down) and number of failures. For example:

user@host:~$ pebble checks
Check   Level  Status  Failures  Changeup      alive  up      0/1       10online  ready  down    1/3       13 (dial tcp 127.0.0.1:8000: connect: connection refused)test    -      down    42/3      14 (Get "http://localhost:8080/": dial t... run "pebble tasks 14" for more)

The “Failures” column shows the current number of failures since the check started failing, a slash, and the configured threshold.

The “Change” column shows the change ID of the change driving the check, along with a (possibly-truncated) error message from the last error. Running pebble tasks <change-id> will show the change’s task, including the last 10 error messages in the task log.

Health checks are implemented using two change kinds:

  • perform-check: drives the check while it’s “up”. The change finishes when the number of failures hits the threshold, at which point the change switches to Error status and a recover-check change is spawned. Each check failure records a task log.

  • recover-check: drives the check while it’s “down”. The change finishes when the check starts succeeding again, at which point the change switches to Done status and a new perform-check change is spawned. Again, each check failure records a task log.

Health endpoint

If the --http option was given when starting pebble run, Pebble exposes a /v1/health HTTP endpoint that allows a user to query the health of configured checks, optionally filtered by check level with the query string ?level=<level> This endpoint returns an HTTP 200 status if the checks are healthy, HTTP 502 otherwise.

Each check can specify a level of “alive” or “ready”. These have semantic meaning: “alive” means the check or the service it’s connected to is up and running; “ready” means it’s properly accepting network traffic. These correspond to Kubernetes “liveness” and “readiness” probes.

The tool running the Pebble server can make use of this, for example, under Kubernetes you could initialize its liveness and readiness probes to hit Pebble’s /v1/health endpoint with ?level=alive and ?level=ready filters, respectively.

If only a “ready” check or only an “alive” check is configured, ready implies alive, and not-alive implies not-ready. If you’ve configured an “alive” check but no “ready” check, and the “alive” check is unhealthy, /v1/health?level=ready will report unhealthy as well, and the Kubernetes readiness probe will act on that.

On the other hand, not-ready does not imply not-alive: if you’ve configured a “ready” check but no “alive” check, and the “ready” check is unhealthy, /v1/health?level=alive will still report healthy.

If there are no checks configured, the /v1/health endpoint returns HTTP 200 so the liveness and readiness probes are successful by default. To use this feature, you must explicitly create checks with level: alive or level: ready in the layer configuration.