Issue Details (XML | Word | Printable)

Key: MBH-259
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Normal Normal
Assignee: Dave Evans
Reporter: Oliver Charles
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
MusicBrainz Hosting

Add a Nagios check to make sure statistics are being collected

Created: 17/Jul/12 07:40 PM   Updated: 20/Aug/12 02:49 PM   Resolved: 18/Aug/12 10:19 AM
Component/s: Nagios
Affects Version/s: None
Fix Version/s: None

File Attachments: 1. File statistics-collected (3 kB) 20/Jul/12 07:49 AM - Oliver Charles
2. Text File writing-nagios-plugins.txt (3 kB) 18/Jul/12 04:40 PM - Dave Evans



 Description  « Hide

Some complete cretin commented out the statistics collection, and we didn't collect stats for 5 days because of it. It'd be great if said cretin would be alerted of his stupid sooner.

djce: could you let me know what I need to write to get a Nagios check added? Ping me on IRC if convenient, I generally need to know what user it runs as, and how the script should communicate success or failure.



Sort Order: Ascending order - Click to sort in descending order
Dave Evans added a comment - 18/Jul/12 04:40 PM

Nagios plugin guide attached.

Feel free to ping/grab me etc for more!


Ian McEwen added a comment - 19/Jul/12 12:52 AM

Note: statistics are actually collected slightly after 0:00 UTC, so if you do go with the "more than a day has passed since the listed update date" method I suggested in IRC, I'd suggest you use something like 24.5-25 hours. If you're just going with a "statistics should have been collected during today-in-UTC", don't check until at least 20 after. For reference, today's stats were available at about 15 minutes after the hour.

Just figured I should mention to prevent some false positives.


Oliver Charles added a comment - 20/Jul/12 07:49 AM

Attached is the statistics-collected script. It requires Perl DBI and DBD::Pg modules, but otherwise just needs Perl.

I don't know where this check will actually be ran, so I have not depended on musicbrainz-server or anything else. If it can actually be ran by astro or something, then I'd prefer to get this checked in to the musicbrainz-server project and simplify it to use DBDefs.

Thoughts?


Dave Evans added a comment - 21/Jul/12 05:25 AM

Running it on astro seems like a good idea (assuming you're only talking for now about monitoring production). I suggest you add it to musicbrainz-server, presumably it can make use of settings etc so it knows how to connect to totoro - ideally you can just run something like

/path/to/codebase/bin/nagios/check-statistics-collected

(with no arguments, i.e. it Just Works)

then check that the same thing works when run as nagios, i.e.

sudo -u nagios /path/to/codebase/bin/nagios/check-statistics-collected

then once we've got that far we'll look into integrating it into Nagios.

Other questions to ponder:

  • how often should it be checked (once every 5 mins? 4 hours? 24 hours?)
  • if it goes warning/critical, how often should it be checked? (if different)
  • whom, if anyone, should Nagios notify when there's a problem? (currently notifications take the form of emails)
  • if there's a problem, should Nagios notify immediately, or (the default) wait for it to report a problem several times in a row?

Oliver Charles added a comment - 23/Jul/12 11:09 AM

Oliver Charles added a comment - 06/Aug/12 01:30 PM

Re-opening: the nagios script is now in master, but I need to talk to Dave about what to do next (ie, actually getting it ran).


Oliver Charles added a comment - 06/Aug/12 02:10 PM

Ok, this works fine if I do:

sudo -u nagios /home/musicbrainz/musicbrainz-server/admin/nagios/statistics-collected

I would like this to be checked:

  • At 5am, in whatever timezone astro is in.
  • In warning/critical please check every 2 hours
  • I would like to be notified of any change in status
  • I am happy with Nagios reporting either immediately or after a few checks. Go with whatever is easiest.

Dave Evans added a comment - 18/Aug/12 10:05 AM

djce@dudley gateway$ git commit -a
[master 08cbd43] MBH-259 Add a Nagios check to make sure statistics are being collected
1 files changed, 1 insertions, 0 deletions

Also
djce@astro:~$ cat /etc/nagios/nrpe.d/musicbrainz-server.cfg
command[check_mbserver_stats_collected]=/home/musicbrainz/musicbrainz-server/admin/nagios/statistics-collected
djce@astro:~$
which maybe should be part of the musicbrainz-server codebase?


Dave Evans added a comment - 18/Aug/12 10:14 AM

djce@dudley nagios$ git commit -a
[master 83ce853] MBH-259 Add a Nagios check to make sure statistics are being collected
2 files changed, 28 insertions, 0 deletions



Oliver Charles added a comment - 20/Aug/12 02:49 PM

I don't have access to that URL, so I'm going to take your word. I think having nagios configuration outside the server is not a problem, so this all looks good now.