ITIL Problem Management
Definition of Problem Management
Problem Management is an IT service management process tasked with managing the life cycle of underlying "Problems." Success is achieved by quickly detecting and providing solutions or workarounds to Problems in order to minimize impact on the organization and prevent recurrence. Problem Management also attempts to find the error in the IT infrastructure that is causing the Problem and contributing to the Incidents that users may have. The IT Infrastructure Library (ITIL) provides the following definitions for usage within this process:
- Problem: “The cause of one or more Incidents. The cause is not usually known at the time a Problem record is created"
- Error: “A design flaw or malfunction that causes a failure of one or more IT services or other configuration items”
- Known Error: “A Problem that has a documented root cause and workaround”
- Root Cause: “The underlying or original cause of an incident or problem”.
Phases of Problem Management
Problem management involves three distinct phases:
- Problem Identification: Problem identification activities identify and log problems by:
- Performing trend analysis of incident records.
- Detecting duplicate and recurring issues.
- During major incident management, identifying a risk that an incident could recur.
- Analyzing information received from suppliers and partners.
- Analyzing information received from internal software developers, test teams, and project teams.
- Problem Control: Problem control activities include problem analysis and documenting workarounds and known errors. Just like incidents, problems will be prioritized based on the risk they pose in terms of probability and impact to services. Focus should be given to problems that have highest risk to services and service management. When analysing incidents, it is important to remember that they may have interrelated causes, which may have complex relationships. Therefore problem analysis should have a holistic approach considering all contributory causes such as those that caused the incident to happen, made the incident worse, or even prolonged the incident. When a problem cannot be resolved quickly, it is often useful to find and document a workaround for future incidents, based on an understanding of the problem. A workaround is defined as a solution that reduces or eliminates the impact or probability of an incident or problem for which a full resolution is not yet available. An example of a workaround could be restarting services in an application, or failover to secondary equipment. Workarounds are documented in problem records, and this can be done at any stage without necessarily having to wait for analysis to be complete. However, if a workaround has been documented early in problem control, then this should be reviewed and improved after problem analysis has been completed. An effective incident workaround can become a permanent way of dealing with some problems, where resolution of the problem is not viable or cost-effective. If this is the case, then the problem remains in the known error status, and the documented workaround is applied when related incidents occur. Every documented workaround should include a clear definition of the symptoms and context to which it applies. Workarounds may be automated for greater efficiency and faster application.
- Error Control: Error control activities manage known errors, and may enable the identification of potential permanent solutions. Where a permanent solution requires change control, this has to be analysed from the perspective of cost, risk and benefits. Error control also regularly re-assesses the status of known errors that have not been resolved, taking account of the overall impact on customers and/or service availability, and the cost of permanent resolutions, and effectiveness of workarounds. The effectiveness of workarounds should be evaluated each time a workaround is used, as the workaround may be improved based on the assessment.
Scope and Value of Problem Management
Problem Management includes the activities required to diagnose the root cause of incidents identified through the Incident Management process, and to determine the resolution to those problems. It is also responsible for ensuring that the resolution is implemented through the appropriate control procedures, especially Change Management and Release Management.
Problem Management will also maintain information about problems and the appropriate workarounds and resolutions, so that the organization is able to reduce the number and impact of incidents over time. In this respect, Problem Management has a strong interface with Knowledge Management, and tools such as the Known Error Database will be used for both. Although Incident Management and Problem Management are separate processes, they are closely related and will typically use the same tools, and may use similar categorization, impact and priority coding systems. This will ensure effective communication when dealing with related incidents and problems.
Value to Business
Problem Management works together with Incident Management and Change Management to ensure that IT service availability and quality are increased. When incidents are resolved, information about the resolution is recorded. Over time, this information is used to speed up the resolution time and identify permanent solutions, reducing the number and resolution time of incidents. This results in less downtime and less disruption to business critical systems.
The Importance of Problem Management
A successful problem management results in less downtime and fewer disruptions in the business. It also improves service availability and quality. Problem management helps companies to reduce the time they spend having to resolve problems and also the number of problems that occur. This all leads to an increase in productivity and reduces costs. The final step in the problem management journey is that it leads to improved customer satisfaction. Technology is changing all the time, faster and faster with each passing quarter, and problem management is one way to mitigate the chaos often associated with these changes. Problem management keeps services running and increases quality.
Importance of Reactive and Proactive Problem Management
Problem management is a longer-term approach that aims to either speed up the resolution of incidents. But, preferably, to eliminate them altogether. It’s the follow-up activity to identify root causes and long-term solutions, as opposed to incident management’s “fire-fighting”.
An effective problem management process maximizes system availability, improves service levels, reduces costs, and improves customer convenience and satisfaction.
While problem management is easy to understand, implementing it within your own organization is extremely challenging. It happens often, that problem management doesn’t produce any of the desired outputs upon implementation. To prevent that, you must recognize the importance of both the reactive and proactive parts of it.
Reactive problem management reacts to incidents that have already occurred and focuses effort on eliminating their root cause and showing up again. The focus here is to increase long-term service stability and, consequently, customer satisfaction.
Proactive problem management is a continuous process that doesn’t wait for an incident to happen to react. This type of problem management is always active and always on guard. “Proactive problem management can be extremely challenging.” Especially in an environment where you have lots of services, different technologies, and when many things are going on. With proactive problem management, the focus is on continuous data analysis, and to do that, you need a large volume of quality data.
Techniques for Effective Problem Management
In order to perform Problem management effectively, there are different techniques available. Four popular techniques that are easier to implement are discussed below
- Brainstorming: Bring all key stakeholders involved in a problem in one place and discuss possible causes. This method is ideal for highly creative teams and eliminates any silo situation.
- Involves round robin discussion among participants
- A high volume of ideas in a shorter time
- Faster and enables diverse idea generation
- Encourages full participation as every person contributes to problem analysis
- Discuss and decide the brainstorming question
- Let every person share his/her idea
- Review the list of ideas to clarify and remove any duplicates
- Prepare an action plan to communicate to stakeholders
- Ishikawa / Fishbone / Cause and Effect Analysis: The cause-effect analysis describes relationships between a problem and its possible causes. For example network downtime is a problem which might have possible reasons such as router malfunction, network error, disaster, etc. Therefore, this method analyses various causes and defines relationships. This method is also known as Ishikawa or fishbone diagram that analyses primary and secondary causes of a problem. Causes, in turn, have different categories such as people, product, process, and partners. This method is used for reactive problem management. Therefore, it is essential to define the problem statement precisely.
- Get a thorough picture of all possible causes for an effect/situation
- Ideal for complex problems
- Has many possible causes and contributing factors
- Post the analysis, discuss action items to improve the process
- Define problem statement
- Add cause categories as fish bones
- Use traditional brainstorming techniques to fill in possible reasons for the “ribs.”
- Classify and prioritize primary and secondary causes as trunks
- Kepner Tregoe Problem Analysis: A logical approach to problem-solving, starting with defining and then describing the problem. Possible causes are established, and then tested, and finally, the exact cause is verified. Systematic four phase Root Cause Analysis (RCA) for complex problem analysis. Kepner Tregoe focuses on finding the root cause before getting into solutions. It is a group problem-solving technique to identify actual root cause with the help of evidence. KT enables group problem solving, speed and precision. KT framework includes “is” and “is not” kind of analysis. Kepner Tregoe (KT) is applicable for both proactive and reactive problem management. It involves problem analysis as well as potential problem analysis.
- What’s going on – Situation Appraisal
- Why did this occur – Problem analysis
- Actual cause for the problem and alternatives – Decision analysis
- What is the plan of action and risk associated – Potential problem analysis
- 5 Whys: Five why strategy is a simple and effective mechanism to understand the root cause of a problem by asking subsequent “why” questions. It is one of the six sigma techniques to identify the actual root cause of a problem and take appropriate countermeasures to prevent from occurring in future. It defines the relationships between different root causes. However, it is significant to frame the questions properly to find out the actual root cause. Asking why question five times is just a rule of thumb, and it varies depending on the problem complexity.
- Gather a group of people who are familiar with the problem
- Ask “why” questions – ‘n’ times depending on the complexity and type of answers
- Define action items to address the issue and prevent it in future
The Problem Management Process
The following diagram describes activities involved in Problem Management
- Problem detection: Problem can be detected in following ways −
- Analysis of incident by technical support group
- Automated detection of an infrastructure or application fault, using alert tools automatically to raise an incident which may reveal the need for problem management
- A notification from supplier that a problem exists that has to be resolved
- Problem logging: Problem should be fully logged and contains the following details −
- User details
- Service details
- Equipment details
- Priority and categorization details
- Date/time initially logged
- Problem categorization: In order to trace true nature of Problem, It is must to categorize the Problems in same way as Incidents.
- Problem Prioritization: Problems must be categorized in the same way as incidents to identify how serious the Problem is from an infrastructure perspective.
- Workarounds: It is temporary way to overcome the difficulties. Details of workaround should always be documented within Problem record.
- Raising a Known Error Record: Known error must be raised and placed in Known Error Database for future reference.
- Problem Resolution: Once resolution is found, it must be applied and documented with the problem details.
- Problem closure: At time of closure, a check should be performed to ensure that record contains full historical descriptions of all events.
- Major Problem Review: A review of following things should be made −
- Those things that were done correctly
- Those things that were done wrong
- What could be done better in future
- How to prevent recurrence
These are the ITIL Problem Management sub-processes and their process objectives:
- Proactive Problem Identification
- Process Objective: To improve overall availability of services by proactively identifying Problems. Proactive Problem Management aims to identify and solve Problems and/or provide suitable Workarounds before (further) Incidents recur.
- Problem Categorization and Prioritization
- Process Objective: To record and prioritize the Problem with appropriate diligence, in order to facilitate a swift and effective resolution.
- Problem Diagnosis and Resolution
- Process Objective: To identify the underlying root cause of a Problem and initiate the most appropriate and economical Problem solution. If possible, a temporary Workaround is supplied.
- Problem and Error Control
- Process Objective: To constantly monitor outstanding Problems with regards to their processing status, so that where necessary corrective measures may be introduced.
- Problem Closure and Evaluation
- Process Objective: To ensure that - after a successful Problem solution - the Problem Record contains a full historical description, and that related Known Error Records are updated.
- Major Problem Review
- Process Objective: To review the resolution of a Problem in order to prevent recurrence and learn any lessons for the future. Furthermore it is to be verified whether the Problems marked as closed have actually been eliminated.
- Problem Management Reporting
- Process Objective: ITIL Problem Management Reporting aims to ensure that the other Service Management processes as well as IT Management are informed of outstanding Problems, their processing-status and existing Workarounds (see "Problem Management Report").
Problem Management Vs. Key ITIL Processes
Problem management works alongside incident management and other ITIL practices to form an overall ITSM strategy.
Problem Management vs. Incident Management
ITIL defines a problem as a cause, or potential cause, of one or more incidents. The behaviors behind effective incident management and effective problem management are often similar and overlapping, but there are still key differences. For example, rolling back a recently deploy may get the service operating again and end the incident, but the underlying problem remains. That said, we believe that problem management and incident management practices are becoming increasingly intertwined. During the times between incidents, IT teams can focus their efforts on problem investigations that lead to improvements and better service quality. This is how problem management becomes the most valuable to the organization.
Problem Management and Change Management
Change management is the process of planning, tracking, and releasing changes without service disruption or downtime. When a change does cause disruption or downtime, that change is analyzed during incident and problem management processes.
Problem Management and Knowledge Management
Knowledge management creates a repository of solutions and documentation for common procedures and even incident workarounds. When used together, a healthy knowledge management practice can enable faster incident resolution and fewer incidents altogether.
Problem Management and Service Request Management
Service request management is the practice of processing a request from a user for something to be provided, such as access to applications, software enhancements, and information. It can sometimes be difficult to distinguish a service request from an incident. In fact, the two were not distinguished and both lumped into the category “incidents” until the release of ITIL V3 in 2007. ITIL now defines an incident as ‘an unplanned interruption to an IT service or reduction in the quality of an IT service.’ It defines a service request as “a formal request from a user for something to be provided – for example, a request for information or advice; to reset a password; or to install a workstation for a new user.”
Problem Management Metrics
CSF: Improving service quality
KPI: An increase in the percentage of proactive changes submitted by problem management
KPI: A reduction in the number of incidents over time
CSF: Minimizing the impact of problems
KPI: An increase in first call resolution through the use of workarounds
KPI: A reduction in the average time to implement fixes
CSF: Resolving problems effectively and efficiently
KPI: A reduction in the backlog of open problems
KPI: An increase in the number of problems that met or exceeded their target resolution times
Benefits of Problem Management
The following are the benefits of Problem Management:
- Eliminates the faults in an organization's services through suitable documentation.
- Refines the service design by identifying and solving weak points, ensuring the most effective and efficient path for service delivery.
- Increases the first time fix rate on service failures by providing permanent solutions to incidents rather than stopping at workarounds.
- Diminishes the impact of incidents affecting multiple users, or a single user at a crucial time.
- Prevents most of the incidents and problems plaguing an organization over time, boosting user productivity.
- Strengthens the confidence users have in the organization's IT services.
- Decreases the time it takes to recover from failures through systematic maintenance of a KEDB.
- Prevents recurring incidents through one-time fixes, sparing valuable service desk efforts in resolving them.
- Encourages IT services to mature as the organization develops by the learning from the resolved problems.
- Develops IT talent within the organization through technical awareness and valuable insights.
- Definition - What Does Problem Management Mean in ITIL? Cherwell
- The 3 Phases of Problem Management BMC
- Scope and Value of Problem Management Wikipedia
- Why Is Problem Management Important? Project Manager
- Importance of Reactive and Proactive Problem Management Paldesk
- Techniques for Effective Problem Management Freshservice
- Problem Management Subprocesses IT Processmaps
- The Relationship Between Problem Management and Other Key ITIL Processes Atlassian
- Problem Management Metrics HDI
- The Benefits of Problem Management Manage Enginie