Yamon 0.91

I've uploaded a new Yamon release. Improvements:

  • Different tests can have different alert targets
  • Alerts can be sent to syslog or other arbitrary programs
  • Direct SMTP can be used instead of invoking 'mail' to send e-mail alerts
So the alerting stuff is more or less feature complete now. Yay!

If you're one of the two or three people that actually downloaded the program, and if you happen to be using it, take care that the syntax for specifying multiple alert recipients has changed. Instead of using ',' as a delimiter, I've switched to ';'.

As far as I know I'm still the only user, so I don't expect this to matter much. I'll try to refrain from changing syntax in non-backwards compatibles ways in the future.

2 comments:

Anonymous said...

Looks interesting! I have been considering writing something similar, able to monitor my web server from my home wireless router (which runs OpenWRT). Unfortunately this doesn't have enough flash for Perl, so shell-script is about the only option. So I may write "yet another yet another monitoring script"....

I see you have a "sanity check" of pinging google.com before doing anything else. Good plan. The other thing that I was considering was doing some sort of exponential back-off: imagine if the server goes down/up/down/up every 5 minutes while you're on holiday, and you're paying $$$ for each SMS. So ideally you'd get an SMS on the first "down", and then another message after n mins, and another after 2n, 4n, 8n etc. etc. Have you considered something like this?

Bjarni RĂșnar said...

You can tell the script to not alert unless the state has been consistant for the past N tests. So "flapping" won't cause an alert at all in many cases.

Generally my alert criteria are "alert if state changes and stays consistent for 3 tests in a row. If it's broken, remind me once every hour until it's fixed".

I want to know when the system recovers as well as knowing when it breaks, so I can stop panicing. :-)

There is one potential problem with this strategy for suppressing "flaps" - if a system flaps once or twice, you probably don't care. If it's flapping for hours, you want to know.