We had a site outage because memcached on wiley crashed. This caused all of the web front ends to crash with the previously seen error message:
2011-11-28 17:31:50.495995500
[error] Caught exception in engine "Couldn't save expires:c1c611ace1febf8bc148e4a72d67ccbce510332b / 1322508709 in memcached storage"
(this message was taken from
MBS-3590. nagios did not send any messages about memcached on wiley being down. Can you please verify that we are monitoring memcached and if not, please add monitoring? Also, I see the mediawiki instance of memcached is being managed by daemontools – can we move the second instance of memcached on wiley to daemontools as well? The second instance runs on port 11215 and is started by /etc/init.d
What sort of monitoring would you like? If we make it daemontools-controlled then "is it up" is not a terribly useful thing to monitor, since (unless someone disables the service) it'll always be up.
See email re. move to daemontools.