out of hours notifications?
Hi: I was wondering what procedures you all used for out of hours critical monitoring alerts? For example, say a server you're responsible for goes down at 2:30 Sunday morning, how do you get notified? My company currently has a rotating company phone that goes around the department and gets text messages for every down system. The person on call is then responsible for finding the responsible person and waking them up to inform them of the problem. The system we're going to will involve pages going direct to our phones for systems we're responsible for. I'm personally thinking that will be better because right now being on call is far from a fun experience, and its rare you get a good night's sleep. In otherwords, our environment is just getting too big for one person to manage after hours alerts, and with pages going to each of us, its felt we can handle them better since we'll know our own systems. Thoughts, does anyone use a different approach? Thanks. Ryan
I use the text message alert system however I would love a better way.
Let me know what you come up with?
Sorry I cant be of more help.
What system are you using to send the alerts?
Are you looking for suggestions on alternative delivery methods or the
authenticity of genuine alerts.
For example, I have a mailbox set up just to receive all waht's up gold
alerts. If anything is down even for a second I get an alert. However
it gives a lot of false posatives. Most alerts are more genuine if the
threshold is set to 5 minutes or something like that. What's up gold also
has dependencies so for example, if a switch goes down and at the same
time all the nodes associated with that switch are unresponsive then WUG
sends the alert for teh switch only.
Feel free to ask if you have any questions.
Regards
Darragh Ó Héiligh
Fujitsu
Offices of the Houses of the Oireachtas,
Fredrick Building,
South Fredrick Street,
Dublin2
Telephone: +353 (1) 618 3559
Email: darragh.oheiligh@oireachtas.ie
Internet: http://www.oireachtas.ie
From: Ryan Shugart
Hi Darragh:
A little of both actually. We're using a product called EG Monitoring, which I'd never heard of before I started working here. I can't go configure any of my own alerts due to accessibility issues, but that's another story. I guess I'm worried on two angles, the first is the quality of life issue, AKA does it at all impact you that you could have a pager go off any time of the night, do you sleep less, etc. I know that's a very individual to individual thing but that's what's worrying me. The false positive thing also bothers me, we have dependencies set up in EG, but they don't work. We have several sites around the world, and most of those sites have at least one ESX host. I'm responsible for the ESX host being down, but often when we get a host for an ESX host, its something else like a router, and EG just hasn't figured out the dependency yet. So things like that, also we don't monitor our SANs at all, which I've complained about but am told there's no way to do. Anyway, the only way we find out if a volume is down many times is when I get notified that the ESX servers can't access a datastore, then I have to go troubleshooting. So, I'm a little conserned about a lot of false positives coming in at 2:30 in the morning, but also the more general angle of do you do anything to make sure you're actually woken up when the 2:30 alert comes in.
Thanks a lot.
Ryan
-----Original Message-----
From: Blind-sysadmins [mailto:blind-sysadmins-bounces@lists.hodgsonfamily.org] On Behalf Of Darragh OHeiligh
Sent: Friday, September 28, 2012 8:00 AM
To: Blind sysadmins list
Cc: Blind-sysadmins
Subject: Re: [Blind-sysadmins] out of hours notifications?
I use the text message alert system however I would love a better way.
Let me know what you come up with?
Sorry I cant be of more help.
What system are you using to send the alerts?
Are you looking for suggestions on alternative delivery methods or the
authenticity of genuine alerts.
For example, I have a mailbox set up just to receive all waht's up gold
alerts. If anything is down even for a second I get an alert. However
it gives a lot of false posatives. Most alerts are more genuine if the
threshold is set to 5 minutes or something like that. What's up gold also
has dependencies so for example, if a switch goes down and at the same
time all the nodes associated with that switch are unresponsive then WUG
sends the alert for teh switch only.
Feel free to ask if you have any questions.
Regards
Darragh Ó Héiligh
Fujitsu
Offices of the Houses of the Oireachtas,
Fredrick Building,
South Fredrick Street,
Dublin2
Telephone: +353 (1) 618 3559
Email: darragh.oheiligh@oireachtas.ie
Internet: http://www.oireachtas.ie
From: Ryan Shugart
Hi Darragh:
A little of both actually. We're using a product called EG Monitoring, which I'd never heard of before I started working here. I can't go configure any of my own alerts due to accessibility issues, but that's another story. I guess I'm worried on two angles, the first is the quality of life issue, AKA does it at all impact you that you could have a pager go off any time of the night, do you sleep less, etc. I know that's a very individual to individual thing but that's what's worrying me. The false positive thing also bothers me, we have dependencies set up in EG, but they don't work. We have several sites around the world, and most of those sites have at least one ESX host. I'm responsible for the ESX host being down, but often when we get a host for an ESX host, its something else like a router, and EG just hasn't figured out the dependency yet. So things like that, also we don't monitor our SANs at all, which I've complained about but am told there's no way to do. Anyway, the only way we find out if a volume is down many times is when I get notified that the ESX servers can't access a datastore, then I have to go troubleshooting. So, I'm a little conserned about a lot of false positives coming in at 2:30 in the morning, but also the more general angle of do you do anything to make sure you're actually woken up when the 2:30 alert comes in.
Thanks a lot.
Ryan
-----Original Message-----
From: Blind-sysadmins [mailto:blind-sysadmins-bounces@lists.hodgsonfamily.org] On Behalf Of Darragh OHeiligh
Sent: Friday, September 28, 2012 8:00 AM
To: Blind sysadmins list
Cc: Blind-sysadmins
Subject: Re: [Blind-sysadmins] out of hours notifications?
I use the text message alert system however I would love a better way.
Let me know what you come up with?
Sorry I cant be of more help.
What system are you using to send the alerts?
Are you looking for suggestions on alternative delivery methods or the
authenticity of genuine alerts.
For example, I have a mailbox set up just to receive all waht's up gold
alerts. If anything is down even for a second I get an alert. However
it gives a lot of false posatives. Most alerts are more genuine if the
threshold is set to 5 minutes or something like that. What's up gold also
has dependencies so for example, if a switch goes down and at the same
time all the nodes associated with that switch are unresponsive then WUG
sends the alert for teh switch only.
Feel free to ask if you have any questions.
Regards
Darragh Ó Héiligh
Fujitsu
Offices of the Houses of the Oireachtas,
Fredrick Building,
South Fredrick Street,
Dublin2
Telephone: +353 (1) 618 3559
Email: darragh.oheiligh@oireachtas.ie
Internet: http://www.oireachtas.ie
From: Ryan Shugart
We use nagios and an asterisk system to do critical notifications.
Nagios is a linux package but it can monitor Windows systems. You can write
a plug-in to do anything you can dream up. The action to take for each
notification is also scriptable. So our nagios system sits in my office and
I have it play different tunes via the PC speaker depending on what
happened.
Nagios also has escalation capabilities so I have it set up to call my cell
phone if some critical system is down for a long time. It uses festival
(create a wav file with a short description of the problem and then calls
my cell phone with the message. So at 3:00 AM, I might get a call with an
electronic voice saying, "DNS is critical".
Someday, I am going to change the bbeeps played through the PC speaker to
start with a couple of notes above the threshold of hearing for humans. And
then I'm going to train my guide dog to do something when he hears those
beeps. So if an alarm happens to go off when someone is in my office, the
dog will do his thing and it will look like he knew something was wrong.
That ought to make a few people scratch their heads.
-----Original Message-----
From: Blind-sysadmins
[mailto:blind-sysadmins-bounces@lists.hodgsonfamily.org] On Behalf Of
Darragh OHeiligh
Sent: Friday, September 28, 2012 9:00 AM
To: Blind sysadmins list
Cc: Blind-sysadmins
Subject: Re: [Blind-sysadmins] out of hours notifications?
I use the text message alert system however I would love a better way.
Let me know what you come up with?
Sorry I cant be of more help.
What system are you using to send the alerts?
Are you looking for suggestions on alternative delivery methods or the
authenticity of genuine alerts.
For example, I have a mailbox set up just to receive all waht's up gold
alerts. If anything is down even for a second I get an alert. However it
gives a lot of false posatives. Most alerts are more genuine if the
threshold is set to 5 minutes or something like that. What's up gold also
has dependencies so for example, if a switch goes down and at the same time
all the nodes associated with that switch are unresponsive then WUG sends
the alert for teh switch only.
Feel free to ask if you have any questions.
Regards
Darragh Ó Héiligh
Fujitsu
Offices of the Houses of the Oireachtas, Fredrick Building, South Fredrick
Street,
Dublin2
Telephone: +353 (1) 618 3559
Email: darragh.oheiligh@oireachtas.ie
Internet: http://www.oireachtas.ie
From: Ryan Shugart
Here we are using SolarWinds' Orion Network Performance Monitor (NPM). We get email notifications when a server goes down, when CPU usage goes beyond a predetermine threshold, when disk space availability drops below a certain amount, etc. The server going down might be misleading, because if there is a WAN connectivity issue it reports the server as going down. The screens to do the monitoring are web based. Using IE works very nicely with text to speech software. I don't know how they look visually though. Here is a link to their site. http://www.solarwinds.com/network-performance-monitor.aspx?gclid=CMrlro282LICFYpFMgodZiMAig&CMP=KNC-TAD-GGL-BRAND-O-DL-US&ef_id=UGWyKAAARVQkojaB:20120928142024:s Vic Pereira Infrastructure Operations | Operations des TI Shared Services Canada | Industry Canada | Services partagés Canada | Industrie Canada 400 St Mary Avenue, Winnipeg MB R3C 4K5 | 400, avenue St Mary, Winnipeg MB R3C 4K5 Vic.Pereira@ic.gc.ca Telephone | Téléphone 204-983-0653 Facsimile | Télécopieur 204-984-4205 Government of Canada | Gouvernement du Canada
Here we are using SolarWinds' Orion Network Performance Monitor (NPM). We get email notifications when a server goes down, when CPU usage goes beyond a predetermine threshold, when disk space availability drops below a certain amount, etc. The server going down might be misleading, because if there is a WAN connectivity issue it reports the server as going down. The screens to do the monitoring are web based. Using IE works very nicely with text to speech software. I don't know how they look visually though. Here is a link to their site. http://www.solarwinds.com/network-performance-monitor.aspx?gclid=CMrlro282LICFYpFMgodZiMAig&CMP=KNC-TAD-GGL-BRAND-O-DL-US&ef_id=UGWyKAAARVQkojaB:20120928142024:s Vic Pereira Infrastructure Operations | Operations des TI Shared Services Canada | Industry Canada | Services partagés Canada | Industrie Canada 400 St Mary Avenue, Winnipeg MB R3C 4K5 | 400, avenue St Mary, Winnipeg MB R3C 4K5 Vic.Pereira@ic.gc.ca Telephone | Téléphone 204-983-0653 Facsimile | Télécopieur 204-984-4205 Government of Canada | Gouvernement du Canada
I use the text message alert system however I would love a better way.
Let me know what you come up with?
Sorry I cant be of more help.
What system are you using to send the alerts?
Are you looking for suggestions on alternative delivery methods or the
authenticity of genuine alerts.
For example, I have a mailbox set up just to receive all waht's up gold
alerts. If anything is down even for a second I get an alert. However
it gives a lot of false posatives. Most alerts are more genuine if the
threshold is set to 5 minutes or something like that. What's up gold also
has dependencies so for example, if a switch goes down and at the same
time all the nodes associated with that switch are unresponsive then WUG
sends the alert for teh switch only.
Feel free to ask if you have any questions.
Regards
Darragh Ó Héiligh
Fujitsu
Offices of the Houses of the Oireachtas,
Fredrick Building,
South Fredrick Street,
Dublin2
Telephone: +353 (1) 618 3559
Email: darragh.oheiligh@oireachtas.ie
Internet: http://www.oireachtas.ie
From: Ryan Shugart
participants (4)
-
Darragh OHeiligh
-
John G. Heim
-
Ryan Shugart
-
Vic.Pereira@ic.gc.ca