Major Incidents will require the focus and efforts of many individuals within your IT Operation. Detailed here are the roles involved and an overview of their remit when a major incident occurs. Every Operation is different and this is to be used as a framework, not necessarily verbatim.
The Service Desk
The Service Desk is the main point of contact for affected end users during service outages or degradation. Contact with the Service Desk is in the form of requests and reporting of incidents.
The Service Desk is usually the first team to be made aware of a potential or actual IT major incident. During major incidents it should provide updates to the end users by way of announcements (recorded messages on the phone systems). It should also update the end user portals/ intranets with the latest information; log any related incidents and advise users of resolution times and workarounds that have been implemented.
Technical Resolution Groups
Technical Resolution Groups provide the essential technical skills, knowledge and resources to implement a work around to resolve the major incident. They are responsible for the diagnosis, technical fixes and workaround implementations. Major incidents are often quite complex so it is common to have several Technical Resolution Groups groups working together to identify the cause and resolution.
Technical Lead Manager
Technical Lead Managers (TLM) are senior technical staff who are appointed by the Major Incident Manager to help centralise and control the technical diagnosis, fixes and workaround efforts when multiple Technical Resolution Groups are involved.
Some technical issues may be extremely complex and the Major Incident Manager will rely on the Technical Lead to support them in leading the technical staff, as well as translating technically complex information into plain speak for the purpose of issuing communications.
Service Continuity Manager
The Service Continuity Manager owns the service continuity process, which is invoked in disaster recovery scenarios when there is no ability to recover the service outage through Major Incident Management. Example scenarios include a data centre flooding and an alternative location having to be activated, or a physical threat to a key building, such as a bomb threat to an operation centre.
Service Manager/ Director
In IT Managed Service Provider (MSP) organisations the Service Manager and/ or Director will often hold the key relationship with the customer organisation or a specific location/regio . In the end user/ customer organisation the Service Manager and/ or Director is usually accountable for managing the outsourced service providers. Their role includes:
- Responsibility for the overall developing, continual improvement and operational delivery of IT services
- Representation of the IT MSP or in-house IT department to the end-users
- Supporting the Major Incident Manager with critical information that may be local to their account and providing access to sites, contacts, and information
Third parties refer to any external suppliers that provide services or products that contribute to your organisation’s overall IT service provision.
Third parties may have specialist knowledge that is required to resolve a major incident and may be required to support their own services or hardware.
Director/ Head of IT/ Head of Service
The Director/ Head of IT/ Head of Service is ultimately responsible for the entire Major Incident Management service provision; its components, people and resources.
He/ she owns all the contractual relationships. In larger organisations, or where a Major Incident Management service provision is mature, their role is simply that of a stakeholder – to receive updates and be kept aware of the situation. When the impact of a major incident is extreme, they may, once the incident has been resolved, be required to communicate the outage and its consequences to affected parties and discuss how the organisation intends to prevent a recurrence.
In smaller and less evolved organisations they may be heavily involved in supporting the Major Incident Manager during an incident. They may even act as the Major Incident Manager.
Change Managers and the change management process exists to ensure that standardised methods and procedures are used for all changes to the IT infrastructure. This reduces potential and realised impacts to IT services and provides control and detailed records when changes are made.
When major incidents are in progress it is not unusual for an IT infrastructure to undergo a change when a fix or workaround is implemented. The Change Manager should be involved in this, as should any authorising senior management. Often a retrospective change request can be raised in order to prevent any unnecessary delays that might be caused by going through the normal change request process / or an emergency change may be proposed and approved by the emergency change board.
The role of the Problem Manager is to identify problems (i.e. the cause of multiple incidents), appropriate actions and to facilitate permanent fixes in order to prevent any recurrence of future incidents.
Because of the serious impact that major incidents can have, Problem Managers are often involved after a major incident has been resolved. Whilst a Problem Manager’s role is extensive, in this overview article we are only concerned with their role and responsibilities in relation to major incidents.
Major Incident Manager
The Major Incident Manager is responsible for the end-to-end management of all IT major incidents. Their role and responsibilities are extremely varied and include (amongst others):
- Leveraging technology to issue all communications and providing key stakeholder management
- Leading, driving, facilitating and chairing all investigation activities, meetings, and conference calls
- Forming collaborative action plans with specific actions, roles and deadlines, and ensuring these are completed
- Matrix management of people, processes and resources including third parties – including resolving conflict to move forward to resolution
- Being accountable for resolving the outage via workaround or permanent fix
- Ensuring all administration and reports are maintained and up-to-date, including contacts information, technical diagrams, post major incident reviews
- Supporting and nurturing process improvements and knowledge base improvements
- Continually maintaining and developing tools and resources to manage major incidents effectively
- Providing periodic major incident metrics reports