June 2025 – Naive Logic

I recently discovered that I had a faulted zpool on a server in my house.

I believe that this is partly my own fault because a disc in a raidz failed, I didn’t notice and then a powercycling event caused some corruption to the other two discs.

This post is not about that; this is about the systems I’m putting in place in the future to prevent a similar thing from occurring.

My main mistake was failing to understand that a zpool can be degraded but still online. I wrongly assumed it would enter some kind of read-only mode once a disc had failed (of course I didn’t truly expect a disk to fail either) and at that point I’d take some action.

I realised I need some kind of alerting to prevent me from missing an event like this again but I was also conscious that given I missed this event I might also miss a fault in the alerting system itself.

Therefore I decided to use a 3rd party system on the basis that a hosted solution might be more resilient to my own neglect on a non-mission critical server at home. I was also keen not to be totally locked in to some 3rd party solution or to feel like I’d given up too much privacy or autonomy in the solution.

The solution I came up with looks like this:

Run a dockerised version of Prometheus locally on the server
Run a dockerised version of Prometheus Node Exporter on the server. The node exporter handily creates metrics on zpool health
Use the remote write functionality of Prometheus to write a select set of metrics to the free tier of Grafana Cloud
Fire alerts from Grafana Cloud to an email address I check regularly

This left me with docker compose config like this:

# docker-compose.yml
networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data: {}

services:
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    expose:
      - 9100
    networks:
      - monitoring

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    expose:
      - 9090
    networks:
      - monitoring

And a prometheus yaml like this:

# prometheus.yml
global:
  scrape_interval: 1m

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 1m
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

remote_write:
  - url: https://<redacted>.grafana.net/api/prom/push
    basic_auth:
      username: <username>
      password: <password>
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "node_zfs_zpool_state"
        action: keep

I then simulated a couple of outage conditions with docker compose down on the prometheus stack and zpool offline zpool <disc-name> to create a degraded zpool.

In both cases I got alerts in my email.

I’m happy with this outcome because:

It removes the risk of me screwing up the monitoring and then not getting an alert
- (yes, there’s a risk that grafana cloud stops)
I’m using off the shelf prometheus and node exporter so switching to an alternative cloud platform should be easy if needed
I can also collect metrics locally
I don’t need to ship lots of information about my server to Grafana like I would if I used alloy in its default config

Month: June 2025

ZFS outages and being properly alerted