out of hours notifications?
Hi: I was wondering what procedures you all used for out of hours critical monitoring alerts? For example, say a server you're responsible for goes down at 2:30 Sunday morning, how do you get notified? My company currently has a rotating company phone that goes around the department and gets text messages for every down system. The person on call is then responsible for finding the responsible person and waking them up to inform them of the problem. The system we're going to will involve pages going direct to our phones for systems we're responsible for. I'm personally thinking that will be better because right now being on call is far from a fun experience, and its rare you get a good night's sleep. In otherwords, our environment is just getting too big for one person to manage after hours alerts, and with pages going to each of us, its felt we can handle them better since we'll know our own systems. Thoughts, does anyone use a different approach? Thanks. Ryan
I use the text message alert system however I would love a better way. Let me know what you come up with? Sorry I cant be of more help. What system are you using to send the alerts? Are you looking for suggestions on alternative delivery methods or the authenticity of genuine alerts. For example, I have a mailbox set up just to receive all waht's up gold alerts. If anything is down even for a second I get an alert. However it gives a lot of false posatives. Most alerts are more genuine if the threshold is set to 5 minutes or something like that. What's up gold also has dependencies so for example, if a switch goes down and at the same time all the nodes associated with that switch are unresponsive then WUG sends the alert for teh switch only. Feel free to ask if you have any questions. Regards Darragh Ó Héiligh Fujitsu Offices of the Houses of the Oireachtas, Fredrick Building, South Fredrick Street, Dublin2 Telephone: +353 (1) 618 3559 Email: darragh.oheiligh@oireachtas.ie Internet: http://www.oireachtas.ie From: Ryan Shugart <rshugart@pcisys.net> To: Blind sysadmins list <blind-sysadmins@lists.hodgsonfamily.org>, Date: 28/09/2012 14:54 Subject: [Blind-sysadmins] out of hours notifications? Sent by: "Blind-sysadmins" <blind-sysadmins-bounces@lists.hodgsonfamily.org> Hi: I was wondering what procedures you all used for out of hours critical monitoring alerts? For example, say a server you're responsible for goes down at 2:30 Sunday morning, how do you get notified? My company currently has a rotating company phone that goes around the department and gets text messages for every down system. The person on call is then responsible for finding the responsible person and waking them up to inform them of the problem. The system we're going to will involve pages going direct to our phones for systems we're responsible for. I'm personally thinking that will be better because right now being on call is far from a fun experience, and its rare you get a good night's sleep. In otherwords, our environment is just getting too big for one person to manage after hours alerts, and with pages going to each of us, its felt we can handle them better since we'll know our own systems. Thoughts, does anyone use a different approach? Thanks. Ryan _______________________________________________ Blind-sysadmins mailing list Blind-sysadmins@lists.hodgsonfamily.org http://lists.hodgsonfamily.org/listinfo/blind-sysadmins Oireachtas email policy and disclaimer. http://www.oireachtas.ie/parliament/about/oireachtasemailpolicyanddisclaimer... Beartas ríomhphoist an Oireachtais agus séanadh. http://www.oireachtas.ie/parliament/ga/eolas/beartasriomhphoistanoireachtais...
I use the text message alert system however I would love a better way. Let me know what you come up with? Sorry I cant be of more help. What system are you using to send the alerts? Are you looking for suggestions on alternative delivery methods or the authenticity of genuine alerts. For example, I have a mailbox set up just to receive all waht's up gold alerts. If anything is down even for a second I get an alert. However it gives a lot of false posatives. Most alerts are more genuine if the threshold is set to 5 minutes or something like that. What's up gold also has dependencies so for example, if a switch goes down and at the same time all the nodes associated with that switch are unresponsive then WUG sends the alert for teh switch only. Feel free to ask if you have any questions. Regards Darragh Ó Héiligh Fujitsu Offices of the Houses of the Oireachtas, Fredrick Building, South Fredrick Street, Dublin2 Telephone: +353 (1) 618 3559 Email: darragh.oheiligh@oireachtas.ie Internet: http://www.oireachtas.ie From: Ryan Shugart <rshugart@pcisys.net> To: Blind sysadmins list <blind-sysadmins@lists.hodgsonfamily.org>, Date: 28/09/2012 14:54 Subject: [Blind-sysadmins] out of hours notifications? Sent by: "Blind-sysadmins" <blind-sysadmins-bounces@lists.hodgsonfamily.org> Hi: I was wondering what procedures you all used for out of hours critical monitoring alerts? For example, say a server you're responsible for goes down at 2:30 Sunday morning, how do you get notified? My company currently has a rotating company phone that goes around the department and gets text messages for every down system. The person on call is then responsible for finding the responsible person and waking them up to inform them of the problem. The system we're going to will involve pages going direct to our phones for systems we're responsible for. I'm personally thinking that will be better because right now being on call is far from a fun experience, and its rare you get a good night's sleep. In otherwords, our environment is just getting too big for one person to manage after hours alerts, and with pages going to each of us, its felt we can handle them better since we'll know our own systems. Thoughts, does anyone use a different approach? Thanks. Ryan _______________________________________________ Blind-sysadmins mailing list Blind-sysadmins@lists.hodgsonfamily.org http://lists.hodgsonfamily.org/listinfo/blind-sysadmins Oireachtas email policy and disclaimer. http://www.oireachtas.ie/parliament/about/oireachtasemailpolicyanddisclaimer... Beartas ríomhphoist an Oireachtais agus séanadh. http://www.oireachtas.ie/parliament/ga/eolas/beartasriomhphoistanoireachtais...
Hi Darragh: A little of both actually. We're using a product called EG Monitoring, which I'd never heard of before I started working here. I can't go configure any of my own alerts due to accessibility issues, but that's another story. I guess I'm worried on two angles, the first is the quality of life issue, AKA does it at all impact you that you could have a pager go off any time of the night, do you sleep less, etc. I know that's a very individual to individual thing but that's what's worrying me. The false positive thing also bothers me, we have dependencies set up in EG, but they don't work. We have several sites around the world, and most of those sites have at least one ESX host. I'm responsible for the ESX host being down, but often when we get a host for an ESX host, its something else like a router, and EG just hasn't figured out the dependency yet. So things like that, also we don't monitor our SANs at all, which I've complained about but am told there's no way to do. Anyway, the only way we find out if a volume is down many times is when I get notified that the ESX servers can't access a datastore, then I have to go troubleshooting. So, I'm a little conserned about a lot of false positives coming in at 2:30 in the morning, but also the more general angle of do you do anything to make sure you're actually woken up when the 2:30 alert comes in. Thanks a lot. Ryan -----Original Message----- From: Blind-sysadmins [mailto:blind-sysadmins-bounces@lists.hodgsonfamily.org] On Behalf Of Darragh OHeiligh Sent: Friday, September 28, 2012 8:00 AM To: Blind sysadmins list Cc: Blind-sysadmins Subject: Re: [Blind-sysadmins] out of hours notifications? I use the text message alert system however I would love a better way. Let me know what you come up with? Sorry I cant be of more help. What system are you using to send the alerts? Are you looking for suggestions on alternative delivery methods or the authenticity of genuine alerts. For example, I have a mailbox set up just to receive all waht's up gold alerts. If anything is down even for a second I get an alert. However it gives a lot of false posatives. Most alerts are more genuine if the threshold is set to 5 minutes or something like that. What's up gold also has dependencies so for example, if a switch goes down and at the same time all the nodes associated with that switch are unresponsive then WUG sends the alert for teh switch only. Feel free to ask if you have any questions. Regards Darragh Ó Héiligh Fujitsu Offices of the Houses of the Oireachtas, Fredrick Building, South Fredrick Street, Dublin2 Telephone: +353 (1) 618 3559 Email: darragh.oheiligh@oireachtas.ie Internet: http://www.oireachtas.ie From: Ryan Shugart <rshugart@pcisys.net> To: Blind sysadmins list <blind-sysadmins@lists.hodgsonfamily.org>, Date: 28/09/2012 14:54 Subject: [Blind-sysadmins] out of hours notifications? Sent by: "Blind-sysadmins" <blind-sysadmins-bounces@lists.hodgsonfamily.org> Hi: I was wondering what procedures you all used for out of hours critical monitoring alerts? For example, say a server you're responsible for goes down at 2:30 Sunday morning, how do you get notified? My company currently has a rotating company phone that goes around the department and gets text messages for every down system. The person on call is then responsible for finding the responsible person and waking them up to inform them of the problem. The system we're going to will involve pages going direct to our phones for systems we're responsible for. I'm personally thinking that will be better because right now being on call is far from a fun experience, and its rare you get a good night's sleep. In otherwords, our environment is just getting too big for one person to manage after hours alerts, and with pages going to each of us, its felt we can handle them better since we'll know our own systems. Thoughts, does anyone use a different approach? Thanks. Ryan _______________________________________________ Blind-sysadmins mailing list Blind-sysadmins@lists.hodgsonfamily.org http://lists.hodgsonfamily.org/listinfo/blind-sysadmins Oireachtas email policy and disclaimer. http://www.oireachtas.ie/parliament/about/oireachtasemailpolicyanddisclaimer... Beartas ríomhphoist an Oireachtais agus séanadh. http://www.oireachtas.ie/parliament/ga/eolas/beartasriomhphoistanoireachtais... _______________________________________________ Blind-sysadmins mailing list Blind-sysadmins@lists.hodgsonfamily.org http://lists.hodgsonfamily.org/listinfo/blind-sysadmins
Hi Darragh: A little of both actually. We're using a product called EG Monitoring, which I'd never heard of before I started working here. I can't go configure any of my own alerts due to accessibility issues, but that's another story. I guess I'm worried on two angles, the first is the quality of life issue, AKA does it at all impact you that you could have a pager go off any time of the night, do you sleep less, etc. I know that's a very individual to individual thing but that's what's worrying me. The false positive thing also bothers me, we have dependencies set up in EG, but they don't work. We have several sites around the world, and most of those sites have at least one ESX host. I'm responsible for the ESX host being down, but often when we get a host for an ESX host, its something else like a router, and EG just hasn't figured out the dependency yet. So things like that, also we don't monitor our SANs at all, which I've complained about but am told there's no way to do. Anyway, the only way we find out if a volume is down many times is when I get notified that the ESX servers can't access a datastore, then I have to go troubleshooting. So, I'm a little conserned about a lot of false positives coming in at 2:30 in the morning, but also the more general angle of do you do anything to make sure you're actually woken up when the 2:30 alert comes in. Thanks a lot. Ryan -----Original Message----- From: Blind-sysadmins [mailto:blind-sysadmins-bounces@lists.hodgsonfamily.org] On Behalf Of Darragh OHeiligh Sent: Friday, September 28, 2012 8:00 AM To: Blind sysadmins list Cc: Blind-sysadmins Subject: Re: [Blind-sysadmins] out of hours notifications? I use the text message alert system however I would love a better way. Let me know what you come up with? Sorry I cant be of more help. What system are you using to send the alerts? Are you looking for suggestions on alternative delivery methods or the authenticity of genuine alerts. For example, I have a mailbox set up just to receive all waht's up gold alerts. If anything is down even for a second I get an alert. However it gives a lot of false posatives. Most alerts are more genuine if the threshold is set to 5 minutes or something like that. What's up gold also has dependencies so for example, if a switch goes down and at the same time all the nodes associated with that switch are unresponsive then WUG sends the alert for teh switch only. Feel free to ask if you have any questions. Regards Darragh Ó Héiligh Fujitsu Offices of the Houses of the Oireachtas, Fredrick Building, South Fredrick Street, Dublin2 Telephone: +353 (1) 618 3559 Email: darragh.oheiligh@oireachtas.ie Internet: http://www.oireachtas.ie From: Ryan Shugart <rshugart@pcisys.net> To: Blind sysadmins list <blind-sysadmins@lists.hodgsonfamily.org>, Date: 28/09/2012 14:54 Subject: [Blind-sysadmins] out of hours notifications? Sent by: "Blind-sysadmins" <blind-sysadmins-bounces@lists.hodgsonfamily.org> Hi: I was wondering what procedures you all used for out of hours critical monitoring alerts? For example, say a server you're responsible for goes down at 2:30 Sunday morning, how do you get notified? My company currently has a rotating company phone that goes around the department and gets text messages for every down system. The person on call is then responsible for finding the responsible person and waking them up to inform them of the problem. The system we're going to will involve pages going direct to our phones for systems we're responsible for. I'm personally thinking that will be better because right now being on call is far from a fun experience, and its rare you get a good night's sleep. In otherwords, our environment is just getting too big for one person to manage after hours alerts, and with pages going to each of us, its felt we can handle them better since we'll know our own systems. Thoughts, does anyone use a different approach? Thanks. Ryan _______________________________________________ Blind-sysadmins mailing list Blind-sysadmins@lists.hodgsonfamily.org http://lists.hodgsonfamily.org/listinfo/blind-sysadmins Oireachtas email policy and disclaimer. http://www.oireachtas.ie/parliament/about/oireachtasemailpolicyanddisclaimer... Beartas ríomhphoist an Oireachtais agus séanadh. http://www.oireachtas.ie/parliament/ga/eolas/beartasriomhphoistanoireachtais... _______________________________________________ Blind-sysadmins mailing list Blind-sysadmins@lists.hodgsonfamily.org http://lists.hodgsonfamily.org/listinfo/blind-sysadmins
We use nagios and an asterisk system to do critical notifications. Nagios is a linux package but it can monitor Windows systems. You can write a plug-in to do anything you can dream up. The action to take for each notification is also scriptable. So our nagios system sits in my office and I have it play different tunes via the PC speaker depending on what happened. Nagios also has escalation capabilities so I have it set up to call my cell phone if some critical system is down for a long time. It uses festival (create a wav file with a short description of the problem and then calls my cell phone with the message. So at 3:00 AM, I might get a call with an electronic voice saying, "DNS is critical". Someday, I am going to change the bbeeps played through the PC speaker to start with a couple of notes above the threshold of hearing for humans. And then I'm going to train my guide dog to do something when he hears those beeps. So if an alarm happens to go off when someone is in my office, the dog will do his thing and it will look like he knew something was wrong. That ought to make a few people scratch their heads. -----Original Message----- From: Blind-sysadmins [mailto:blind-sysadmins-bounces@lists.hodgsonfamily.org] On Behalf Of Darragh OHeiligh Sent: Friday, September 28, 2012 9:00 AM To: Blind sysadmins list Cc: Blind-sysadmins Subject: Re: [Blind-sysadmins] out of hours notifications? I use the text message alert system however I would love a better way. Let me know what you come up with? Sorry I cant be of more help. What system are you using to send the alerts? Are you looking for suggestions on alternative delivery methods or the authenticity of genuine alerts. For example, I have a mailbox set up just to receive all waht's up gold alerts. If anything is down even for a second I get an alert. However it gives a lot of false posatives. Most alerts are more genuine if the threshold is set to 5 minutes or something like that. What's up gold also has dependencies so for example, if a switch goes down and at the same time all the nodes associated with that switch are unresponsive then WUG sends the alert for teh switch only. Feel free to ask if you have any questions. Regards Darragh Ó Héiligh Fujitsu Offices of the Houses of the Oireachtas, Fredrick Building, South Fredrick Street, Dublin2 Telephone: +353 (1) 618 3559 Email: darragh.oheiligh@oireachtas.ie Internet: http://www.oireachtas.ie From: Ryan Shugart <rshugart@pcisys.net> To: Blind sysadmins list <blind-sysadmins@lists.hodgsonfamily.org>, Date: 28/09/2012 14:54 Subject: [Blind-sysadmins] out of hours notifications? Sent by: "Blind-sysadmins" <blind-sysadmins-bounces@lists.hodgsonfamily.org> Hi: I was wondering what procedures you all used for out of hours critical monitoring alerts? For example, say a server you're responsible for goes down at 2:30 Sunday morning, how do you get notified? My company currently has a rotating company phone that goes around the department and gets text messages for every down system. The person on call is then responsible for finding the responsible person and waking them up to inform them of the problem. The system we're going to will involve pages going direct to our phones for systems we're responsible for. I'm personally thinking that will be better because right now being on call is far from a fun experience, and its rare you get a good night's sleep. In otherwords, our environment is just getting too big for one person to manage after hours alerts, and with pages going to each of us, its felt we can handle them better since we'll know our own systems. Thoughts, does anyone use a different approach? Thanks. Ryan _______________________________________________ Blind-sysadmins mailing list Blind-sysadmins@lists.hodgsonfamily.org http://lists.hodgsonfamily.org/listinfo/blind-sysadmins Oireachtas email policy and disclaimer. http://www.oireachtas.ie/parliament/about/oireachtasemailpolicyanddisclaimer / Beartas ríomhphoist an Oireachtais agus séanadh. http://www.oireachtas.ie/parliament/ga/eolas/beartasriomhphoistanoireachtais agusseanadh/ _______________________________________________ Blind-sysadmins mailing list Blind-sysadmins@lists.hodgsonfamily.org http://lists.hodgsonfamily.org/listinfo/blind-sysadmins
Here we are using SolarWinds' Orion Network Performance Monitor (NPM). We get email notifications when a server goes down, when CPU usage goes beyond a predetermine threshold, when disk space availability drops below a certain amount, etc. The server going down might be misleading, because if there is a WAN connectivity issue it reports the server as going down. The screens to do the monitoring are web based. Using IE works very nicely with text to speech software. I don't know how they look visually though. Here is a link to their site. http://www.solarwinds.com/network-performance-monitor.aspx?gclid=CMrlro282LICFYpFMgodZiMAig&CMP=KNC-TAD-GGL-BRAND-O-DL-US&ef_id=UGWyKAAARVQkojaB:20120928142024:s Vic Pereira Infrastructure Operations | Operations des TI Shared Services Canada | Industry Canada | Services partagés Canada | Industrie Canada 400 St Mary Avenue, Winnipeg MB R3C 4K5 | 400, avenue St Mary, Winnipeg MB R3C 4K5 Vic.Pereira@ic.gc.ca Telephone | Téléphone 204-983-0653 Facsimile | Télécopieur 204-984-4205 Government of Canada | Gouvernement du Canada
Here we are using SolarWinds' Orion Network Performance Monitor (NPM). We get email notifications when a server goes down, when CPU usage goes beyond a predetermine threshold, when disk space availability drops below a certain amount, etc. The server going down might be misleading, because if there is a WAN connectivity issue it reports the server as going down. The screens to do the monitoring are web based. Using IE works very nicely with text to speech software. I don't know how they look visually though. Here is a link to their site. http://www.solarwinds.com/network-performance-monitor.aspx?gclid=CMrlro282LICFYpFMgodZiMAig&CMP=KNC-TAD-GGL-BRAND-O-DL-US&ef_id=UGWyKAAARVQkojaB:20120928142024:s Vic Pereira Infrastructure Operations | Operations des TI Shared Services Canada | Industry Canada | Services partagés Canada | Industrie Canada 400 St Mary Avenue, Winnipeg MB R3C 4K5 | 400, avenue St Mary, Winnipeg MB R3C 4K5 Vic.Pereira@ic.gc.ca Telephone | Téléphone 204-983-0653 Facsimile | Télécopieur 204-984-4205 Government of Canada | Gouvernement du Canada
participants (4)
-
Darragh OHeiligh
-
John G. Heim
-
Ryan Shugart
-
Vic.Pereira@ic.gc.ca