April 2016 ~ Creating a Computer Incident Response Team
April 2016 ~ Creating a Computer Incident Response Team
What happens in your company when a critical computer system goes down? What happens when your customer-facing website goes down unexpectedly? What happens when your network crashes?
With technology woven through so much of our every day business activities, any of those situations can cripple a business for minutes, hours, days or weeks if not dealt with properly.
If you don’t have a good plan in place for handling outages, this article may spark some discussions about how to handle those better.
One of the organizations I had recently joined found themselves in deep trouble when a software patch was applied and the updated software would not work at all. The problem was discovered just at quitting time on a Monday. The person who first learned about the problem went home without reporting it to management or focusing any attention on it.
The next day, internal customers were livid when none of their office programs worked. It took several days before the problem was identified and eventually fixed. It was later learned that the patch caused the software to contact a server that was no longer in service.
That incident and the lack of any organized plan for handling computer system outages became a rallying cry for a better business process to avoid such severe customer pain. Management agreed that we needed to do a better job and supported figuring out how to do that.
I will use the process we developed as an example. Readers can adapt or adjust as appropriate for their company or organization.
A team was created to come up with a better business model for handling computer outages. In our organization, we gathered a representative from each area of the information services department:
Over the course of a couple months, we put together a plan that included:
While the organizing team was doing their job, we had a few unexpected outages that allowed us to practice what we were working on and fine-tune our process. Over a few years, it became a very well-run process that our internal and external customers came to appreciate very much.
We developed a standing interdisciplinary Incident Response Team (IRT) to lead problem identification, communication, documentation and resolution activities once an "Incident" was determined to have occurred. The team had pre-defined levels of authority appropriate for incident resolution. We assigned primary members and backup members for each role. The IRT designation was not the person's only job; their function on the IRT was directly related to the job they already had.
Whenever a potential issue was identified, the IRT Team met quickly to determine whether an issue was urgent enough to activate the IRT plan. If it was an urgent issue, we assigned a level (1 – low impact, 2 – medium impact, 3 – high impact) and quickly assigned roles: Incident Command Leader, Communication Leader, Documentation Leader and Help Desk Coordinator. And, assigned other staff and duties as appropriate for the incident.
If the incident lasted more than 2 hours or was designated a Level 3 (High Impact), a Management Representative was assigned who was responsible for the particular area of outage. For example, if it was a network issue, the Network Manager became the Management Representative for the IRT team. If it was a virus, the Security Manager became the Management Representative. If it was a software application issue, the Software Applications Manager became the Management Representative, etc.
We developed a 1-page Checklist that helped us identify Who, What, Where, Why, When and How that helped to quickly gather the appropriate resources to deal with an issue. (See example of Checklist )
The Communication Leader would then start the notification process to management and our internal customers giving them details about the incident: what happened, what we were doing about it, when they would see resolution and / or when we would provide an update. They also issued the "Incident Resolved" notice when the problem was fixed.
We learned very early that when a technical problem happens, the technical people who can fix it should focus all their time and energy on fixing the problem, not dealing with customer communications and a barrage of questions. People who can communicate with customers should be assigned the communication role so the technical people can concentrate on what they do best.
We developed communications templates that could be used and updated quickly with the current incident’s information. We developed a process for how to notify internal customers if the network or phone system went down (e.g., walking the floors of offices, using the phone system instead of email, using walkie-talkies, using cell phones, etc.) Our process also included when to post notices on our internal intranet as well as external customer-facing websites and social media about outages.
We developed a step-by-step plan for each severity level for each role – who did what, when and how. And, when an "Incident" was called, that incident became the IRT Team’s highest priority. Everyone went to work to get the issue resolved as quickly as possible.
Each business group that had a representative on the IRT Team, also developed a written process for what they would do when there was an Incident in their focus area. Each IRT Team member and backup person had a 3-ring notebook with their overall IRT and focus area written plans. That gave them something they could grab quickly and check off what needed to be done. If for any reason, someone else needed to step in for a team member, everyone knew where to look to take over easily. We even developed telephone scripts that could be used to broadcast a voicemail message to appropriate people in addition to email notifications.
After an Incident was designated, the IRT team members met quickly as needed for status updates (usually, about every 2 hours) so that we could keep impacted stakeholders updated with progress reports regularly. Some incidents were resolved quickly; some went on for several hours or days before they could be resolved.
After an Incident was resolved and customers notified of "all clear," we held a Post Incident Review meeting within a week to review what happened, what went well, what didn’t go well and what improvements we could make, either in the IRT process or in any other area affected. Over time, those review meetings provided a large number of process improvements that prevented other failures.
The IRT team members were committed to the process, given the authority to take action quickly and became very much respected by management and customers. Everyone in the organization learned that if a computer problem was noticed by anyone at all, there was a clear path to let the right people know about it so it could get resolved.
One of the side benefits of our Incident Response Team process was that it helped to raise awareness in the IT staff about the customer impact of a network change that went wrong, a new application that failed, a technology change that caused unexpected customer disruption or the business impact of any other type of technology outage. It also provided a clearly identified business process, with multiple people aware of and responsible for handing any possible service disruptions.
Computer failures happen all the time, for a wide variety of reasons. While we can’t very often predict sudden outages, as IT professionals, we can do a much better job helping customers know what to expect and we can do our very best to fix a problem as quickly as possible. We certainly owe our customers the simple courtesy of letting them know when the technology they use every day isn’t working correctly.
Customers — internal and external — expect to be treated with respect. Having an effective computer Incident Response Plan in place is simply good business. Not having one in place can lead to great customer dissatisfaction, wasted time trying to identify and fix problems, and can cost a business a great deal of money and unnecessary public relations headaches.
Related newsletter articles:
About our resource links: We do not endorse or agree with all the beliefs in these links. We do keep an open mind about different viewpoints and respect the ability of our readers to decide for themselves what is useful.
Page updated: April 02, 2016
| Barbara Taylor | Books |
FAQ | Feedback | Interesting Links
| Mailing List |