Reply to Message

Watchdog
Right - it actually monitors a heartbeat between the device and the PC and if the heartbeat fails, it will recycle the power to the device. I did cover that in the article, but it might have been confusing.

This is how most clustering/failover solutions work. If you're not familiar with high-availability, the thing about heartbeat monitoring is that you *can* get false positives. Frequently the heartbeat is an RPC monitoring service - and if the RPC service gets overwhelmed or becomes unresponsive, it can trigger a false fail-over which can cause lots of problems like "split-brain" where two machines are both active and accessible and they each think they're in control. In this case, a false positive could result in a power-cycle when the intended services or applications being hosted by the server are still available, disrupting users. I'm not sure how the heartbeat for the Watchdog is implemented at the API, but regardless of what method they're using, there is still vulnerability to a false-positive where the strip loses communication with the PC and initiates a restart.

That shouldn't stop you from implementing a solution like this if you have a critical server you need to automate hang recovery on. In fact, this is probably one of the most accessible methods to achieve rudimentary high-availability for a server I've ever seen. But you're going to want to do some testing and tweaking of the configuration before you implement in production, or be prepared for angry users if they do lose data when the system spontaneously reboots on them while they're working.
Contributr
Posted by dcolbert@...
6th Jan