Service Monitoring

A device being reachable doesn’t mean the service running on it is actually working. A web server can answer pings while its login page returns errors; a mail relay can be up while it refuses connections. Service Monitoring closes that gap. It runs active checks — sometimes called synthetic or uptime checks — that reach out to a service on a schedule, confirm it responds correctly, and record how long it took.

The Service Monitoring page lists each check with its type, target, current status, and latest response time.

How this differs from device reachability

GridNMS already tells you whether a device is up — see Devices & Inventory. That check answers “is the box on the network?” Service Monitoring answers a different, more useful question:

	Device reachability	Service Monitoring
What it checks	The device responds on the network	A specific service answers correctly
Example	The web server replies to a ping	`https://app/health` returns OK in under 500 ms
Catches	Hardware down, link cut, device offline	App crashed, port closed, cert/DNS broken, service slow

Use both together: reachability for the device, a service check for each service that device exposes that you actually care about.

Where to find it

Open Configure → Service Monitoring. The page shows every check you’ve created, one per row, with its type, target, current status, latency, the collector running it, and when it was last checked. The list refreshes on its own, so you can leave it open as a status board.

Creating a check

Click Add monitor and fill in the dialog.

Name — a label you’ll recognize, e.g. Web app — login page.
Target — choose Manual host to type a hostname or IP, or Managed device to pick one of your existing devices from the list.

Check type — what kind of service you’re testing:

Type	What it verifies
HTTP	A web endpoint responds. Optionally give a full URL (otherwise it’s built from the host) and a keyword that must appear in the response body.
DNS	A name resolves. Pick the record type (A, AAAA, CNAME, MX, TXT, NS).
SMTP	A mail server accepts a connection.
SMB	A Windows/file-sharing service answers.

Collector — Service Monitoring runs each check from a collector, not from the GridNMS server, so the check is performed from inside the relevant network. GridNMS offers the collector(s) eligible to reach the target. If the target sits inside a named network, the matching collector is chosen automatically; otherwise pick which one runs it.
Port — leave as auto to use the default port for the check type, or set a specific one.
Interval (s) — how often the check runs (for example, 60 for once a minute). Shorter intervals notice an outage sooner but run more often.
Warn (ms) and Crit latency (ms) — optional response-time thresholds. Cross these and the check reports as degraded (warn) or contributes to a down/critical state, even when the service technically answered.

Click Create monitor. The first result appears after the next interval ticks.

Reading the results

Each row shows a status and the latest latency:

Status	Meaning
up	The service responded correctly within thresholds.
degraded	It responded, but slower than your Warn threshold.
down	It failed to respond, returned the wrong result, or the keyword was missing.
unknown	The check hasn’t run yet, or its collector is offline.

Latency is the round-trip time of the last check, in milliseconds — a steady number means a healthy service; a rising trend is an early warning that something is getting slower before it breaks.

The detail view

Click any row to open its detail panel. You get the full configuration at a glance plus a latency chart over a time range you choose, so you can see how response time has trended — useful for spotting a service that’s been slowly degrading or for confirming an outage window. The Last error field shows why the most recent check failed, if it did.

Editing and disabling a check

Use the pencil icon on a row to edit it. You can change the name, interval, thresholds, severity, HTTP/DNS options, and which collector runs it. The target and check type are fixed — to change those, delete the check and create a new one.
Use the Enabled toggle to pause a check without deleting it — handy during planned maintenance so you don’t get alerted for an outage you already know about.
Use the trash icon to remove a check entirely.

Getting alerted when a check fails

A failing check raises an event in GridNMS, exactly like any other alert. When you create or edit a check you set its down-event severity (Critical, Major, Minor, Warning, or Info), which controls how prominently the failure shows up.

From there it flows through your alerting like everything else: the event appears on the Events & Alerts page and is delivered to the right people through your Notifications setup. So you can, for example, have a critical web-app check page on-call while a minor internal check only sends email.

Where to go next

Decide who gets told, and how, in Notifications.
Triage the failures these checks raise on Events & Alerts.
Understand the difference between device and service health in Devices & Inventory.