I recently discovered that I had a faulted zpool on a server in my house.
I believe that this is partly my own fault because a disc in a raidz failed, I didn’t notice and then a powercycling event caused some corruption to the other two discs.
This post is not about that; this is about the systems I’m putting in place in the future to prevent a similar thing from occurring.
My main mistake was failing to understand that a zpool can be degraded but still online. I wrongly assumed it would enter some kind of read-only mode once a disc had failed (of course I didn’t truly expect a disk to fail either) and at that point I’d take some action.
I realised I need some kind of alerting to prevent me from missing an event like this again but I was also conscious that given I missed this event I might also miss a fault in the alerting system itself.
Therefore I decided to use a 3rd party system on the basis that a hosted solution might be more resilient to my own neglect on a non-mission critical server at home. I was also keen not to be totally locked in to some 3rd party solution or to feel like I’d given up too much privacy or autonomy in the solution.
The solution I came up with looks like this:
- Run a dockerised version of Prometheus locally on the server
- Run a dockerised version of Prometheus Node Exporter on the server. The node exporter handily creates metrics on zpool health
- Use the remote write functionality of Prometheus to write a select set of metrics to the free tier of Grafana Cloud
- Fire alerts from Grafana Cloud to an email address I check regularly
This left me with docker compose config like this:
# docker-compose.yml
networks:
monitoring:
driver: bridge
volumes:
prometheus_data: {}
services:
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
expose:
- 9100
networks:
- monitoring
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
expose:
- 9090
networks:
- monitoring
And a prometheus yaml like this:
# prometheus.yml
global:
scrape_interval: 1m
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 1m
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
remote_write:
- url: https://<redacted>.grafana.net/api/prom/push
basic_auth:
username: <username>
password: <password>
write_relabel_configs:
- source_labels: [__name__]
regex: "node_zfs_zpool_state"
action: keep
I then simulated a couple of outage conditions with docker compose down
on the prometheus stack and zpool offline zpool <disc-name>
to create a degraded zpool.
In both cases I got alerts in my email.
I’m happy with this outcome because:
- It removes the risk of me screwing up the monitoring and then not getting an alert
- (yes, there’s a risk that grafana cloud stops)
- I’m using off the shelf prometheus and node exporter so switching to an alternative cloud platform should be easy if needed
- I can also collect metrics locally
- I don’t need to ship lots of information about my server to Grafana like I would if I used alloy in its default config