Health checks¶
Separate from the service manager, Pebble implements custom “health checks” that can be configured to restart services when they fail.
Usage¶
Checks are configured in the layer configuration using the top-level field checks
:
# Optional: A list of health checks managed by this configuration layer.
checks:
<check name>:
# Required
override: merge | replace
# Optional
level: alive | ready
# Optional
period: <duration>
# Optional
timeout: <duration>
# Optional
threshold: <failure threshold>
# HTTP check
# Only one of "http", "tcp", or "exec" may be specified.
http:
# Required
url: <full URL>
# Optional
headers:
<name>: <value>
# TCP port
# Only one of "http", "tcp", or "exec" may be specified.
tcp:
# Required
port: <port number>
# Optional
host: <host name>
# Command execution check
# Only one of "http", "tcp", or "exec" may be specified.
exec:
# Required
command: <commmand>
# Optional
service-context: <service-name>
# Optional
environment:
<name>: <value>
# Optional
user: <username>
# Optional
user-id: <uid>
# Optional
group: <group name>
# Optional
group-id: <gid>
# Optional
working-dir: <directory>
Full details are given in the layer specification.
Options¶
Each check can be one of three types. The types and their success criteria are:
http
: an HTTPGET
request to the URL specified must return an HTTP 2xx status codetcp
: opening the given TCP port must be successfulexec
: executing the specified command must yield a zero exit code
Each check is performed with the specified period
(the default is 10 seconds apart), and is considered an error if a timeout happens before the check responds – for example, before the HTTP request is complete or before the command finishes executing.
A check is considered healthy until it’s had threshold
errors in a row (the default is 3). At that point, the check is considered “down”, and any associated on-check-failure
actions will be triggered. When the check succeeds again, the failure count is reset to 0.
To enable Pebble auto-restart behavior based on a check, use the on-check-failure
map in the service configuration (this is what ties together services and checks). For example, to restart the “server” service when the “test” check fails, use the following:
services:
server:
override: merge
on-check-failure:
# can also be "shutdown", "success-shutdown", or "ignore" (the default)
test: restart
Examples¶
Below is an example layer showing the three different types of checks:
checks:
up:
override: replace
level: alive
period: 30s
threshold: 1 # an aggressive threshold
exec:
command: service nginx status
online:
override: replace
level: ready
tcp:
port: 8080
test:
override: replace
http:
url: http://localhost:8080/test
Checks command¶
You can view check status using the pebble checks
command. This reports the checks along with their status (up
or down
) and number of failures. For example:
user@host:~$
pebble checks
Check Level Status Failures Change
up alive up 0/1 10
online ready down 1/3 13 (dial tcp 127.0.0.1:8000: connect: connection refused)
test - down 42/3 14 (Get "http://localhost:8080/": dial t... run "pebble tasks 14" for more)
The “Failures” column shows the current number of failures since the check started failing, a slash, and the configured threshold.
The “Change” column shows the change ID of the change driving the check, along with a (possibly-truncated) error message from the last error. Running pebble tasks <change-id>
will show the change’s task, including the last 10 error messages in the task log.
Health checks are implemented using two change kinds:
perform-check
: drives the check while it’s “up”. The change finishes when the number of failures hits the threshold, at which point the change switches to Error status and arecover-check
change is spawned. Each check failure records a task log.recover-check
: drives the check while it’s “down”. The change finishes when the check starts succeeding again, at which point the change switches to Done status and a newperform-check
change is spawned. Again, each check failure records a task log.
Health endpoint¶
If the --http
option was given when starting pebble run
, Pebble exposes a /v1/health
HTTP endpoint that allows a user to query the health of configured checks, optionally filtered by check level with the query string ?level=<level>
This endpoint returns an HTTP 200 status if the checks are healthy, HTTP 502 otherwise.
Each check can specify a level
of “alive” or “ready”. These have semantic meaning: “alive” means the check or the service it’s connected to is up and running; “ready” means it’s properly accepting network traffic. These correspond to Kubernetes “liveness” and “readiness” probes.
The tool running the Pebble server can make use of this, for example, under Kubernetes you could initialize its liveness and readiness probes to hit Pebble’s /v1/health
endpoint with ?level=alive
and ?level=ready
filters, respectively.
If only a “ready” check or only an “alive” check is configured, ready implies alive, and not-alive implies not-ready. If you’ve configured an “alive” check but no “ready” check, and the “alive” check is unhealthy, /v1/health?level=ready
will report unhealthy as well, and the Kubernetes readiness probe will act on that.
On the other hand, not-ready does not imply not-alive: if you’ve configured a “ready” check but no “alive” check, and the “ready” check is unhealthy, /v1/health?level=alive
will still report healthy.
If there are no checks configured, the /v1/health
endpoint returns HTTP 200 so the liveness and readiness probes are successful by default. To use this feature, you must explicitly create checks with level: alive
or level: ready
in the layer configuration.