What Is “Mean Time To Recovery (MTTR)”? — A Complete Overview
Previously Published to Propelo.AI
DORA metrics are cornerstone to the software engineering culture that is DevOps. In fact, DORA is actually an acronym for “DevOps Research and Analytics,” a practice that has been widely adopted by engineering organizations all over the world.
It’s an excellent tool for improving the performance, productivity and efficiency of the software development lifecycle (SDLC) — but an even better tool for making sure products are safe, secure, reliable and always available to the end-user.
Since its conception, DORA has only consisted of four priority metrics:
- Lead time for change
- Change failure rate (CFR)
- Deployment frequency (DF)
- Mean Time To Recovery (MTTR)
It has since added the fifth metric of reliability, as most engineers are moving to the cloud and are pushing out smaller, more frequent deployments to their clients over time.
The DevOps lifecycle strongly advocates for automation and monitoring at all steps of software production, covering all facets of development from integration, testing, releasing to deployment and infrastructure management — but more specifically, it pushes for more reliable programs, an increase in stability and an always-on availability expected from software-as-a-service (SaaS) platforms of today. It also aims at shorter development cycles, increased deployment frequency and more dependable releases — all of which closely align with ongoing business objectives.
To understand the potential impact of risky build iterations and repair procedures, engineers must also pay close attention to operational needs within the production environment.
While their primary focus should always be on the reliability and functionality of a quality product, they must also keep in mind that eyes are on them at all times — and that business clients and executives don’t take kindly to delays, losses or even downtime should unexpected troubles arise.
Engineers should always plan ahead — to mitigate risks and fool-proof the system.
By leveraging DevOps-related strategies, they will see a tremendous improvement in deployment frequency, lead time, the detection of cybersecurity vulnerabilities and flaws, the mean time to repair and mean time to recovery.
Mean Time to Recovery (MTTR) Explained
“Mean Time to Recovery” is the amount of time it takes, on average, for a system to rebound, or recover, from a disruption of service or total failure altogether. As an incident management metric, it’s essential to quantify and track downtime — whether the interruption is due to a recent deployment or an isolated system failure. The name itself is synonymous with other “mean time” calculations but generally shares the same concept for repair and recovery during the software development lifecycle (SDLC) or related services. Each of these concepts has a similar meaning with a slightly different definition. Most of these will be covered throughout this blog and include:
- Mean Time to Fix
- Mean Time to Repair
- Mean Time to Respond
- Mean Time to Resolve
- Mean Time to Resolution
- Mean Time to Remediate
There are other incident metrics used alongside MTTR to assess the performance of DevOps and ITOps processes. They’re used to evaluate the effectiveness of determining fixes, to detect the reliability of applicable solutions and to measure how easy it will be to maintain systems both during and after remediation:
- Mean Time to Failure (MTTF)
- Mean Time to Acknowledge (MTTA)
- Mean Time to Detect (MTTD)
- Mean Time to Identify (MTTI)
- Mean Time Between Failures (MTBF)
So, what is Mean Time to Recovery?
MTTR is often explained as the time that it takes for engineers to not only find a problem but to realize the problem, decide the best plan of action in remediating that problem, how long it takes to apply the fix and what it takes to get the platform back up and running — at full functionality.
In the incident management process, MTTR serves as a key performance indicator (KPI) that allows engineering teams to improve their response to issues both controlled and undetectable.
Atlassian, the creator of Jira and Trello, provides some clarity on this matter, suggesting that it’s a good idea to define which MTTR a team has adopted before tracking successes and failures. The company states that “a team needs to be on the same page about exactly what you’re tracking and be sure everyone knows they’re talking about the same thing,”
It’s just one step in the incident response process and one way engineers can gauge how quickly problems can find remediation and how soon new changes can be shipped out after that. It also provides deeper insight as to how stable a product actually is, primarily when delays have caused teams to work under stricter timelines and scrutiny through to delivery. The faster a system can be recovered, the least likely issues will worsen — and a higher probability for customer satisfaction.
Sometimes that means slowing down a bit.
When faced with a major issue, teams should only focus on this one problem — even if it means putting the entire project on hold to correct these errors. Cutting corners can lead to much bigger problems, and a domino effect can occur if the SDLC continues without resolution.
Controlled testing should immediately take place to decide whether there’s a problem. Here are a few questions to ask the engineering team:
- Which data was monitored, and did anything stand out?
- Did the system fail in a way that would otherwise be expected?
- Was the system an easy fix, and was it able to be handled quickly?
- How long did it take for the service to be restored once an issue was detected?
Both recovery speeds and response times mirror the team’s ability to not only diagnose problems but to correct them.
Engineers should focus on building a more robust system for recovery. This often includes an expanded view of real-time data and increased control over monitoring and analysis. In this same sense, teams will become more careful when preserving quality and even more concerned about improving the product throughout the development process.
What is Mean Time To Restore (MTTR)? Why is it important?
MTTR can also represent the time it takes to restore business-critical services from a failure in production, whether it’s an unplanned outage or simply a loss of service. Production issues can significantly impact the customer experience and should be prioritized to avoid displeasure with the system.
For this, engineers will want to know how much time it will take to recover from specific failures during production and how quickly those issues can be resolved on average. Some will go as far as building continuous improvement, continuous delivery systems that report failure and alert the teams as soon as an issue presents itself.
If you’ve read our whitepaper on DORA Metrics, you probably already know that measuring MTTR can be trickier than it appears because service failures and outages can be of different types or severity. This ultimately causes the reporting of a single MTTR score to be incorrect and of no value.
In some organizations, especially those with smaller teams, minor and low-severity issues are almost never prioritized or fixed. And so it’s said that they should not be counted towards the final MTTR calculation.
Service desks and production monitoring systems are necessary in order to accurately track MTTR data, prioritize system recovery and improve deployment times. Flagging particular changes or features within a system will allow engineers to quickly reference timestamps or to toggle between features. This will allow them to better understand what went wrong and to identify which problems are occurring at present while potentially bringing the MTTR down to just seconds on the hour.
What MTTR Is and Isn’t
MTTR is one of the most widely used metrics — and one of the most misunderstood metrics — in what’s often called the “systems reliability toolbox.”
Because so many engineers have a different understanding of what MTTR actually means, it can be, at times, hard for everyone to get on the same page. Since each instance of MTTR has a slightly different meaning, teams that lack clarity on the true definition of what MTTR means to them — or a clear vision for what needs to be done — may see more harm come than good, as the most minute differences can throw the entire repair and production off its course.
There needs to be a unified front among teams on how to use this metric, how to improve it and how to execute it in a sustainable yet consistent manner. There needs to be one set of rules in place — a single set of rules that determines how performance will be tracked, measured and adjusted across any given number of changes.
MTTR allows teams to set standards for reliability, accelerate velocities between sprints and increase the overall quality of the product before it’s delivered to the end user.
With continuous improvement at the helm of what DevOps actually represents, metrics should also factor in the lead time for changes, ongoing change failure rate and the deployment frequency thereafter.
MTTR can be a powerful predictor of the impact incidents can and will have on the organization’s bottom line. The higher the MTTR, the greater the risk of significant downtime. The lower an organization’s MTTR, the lower the risk for downtime — if at all — during the recovery process.
Delaying repair or incident response procedures to issues that have already occurred could mean potentially leading the business to further disruptions, customer dissatisfaction and loss of revenue.
Understanding the mean time to repair helps teams define and optimize DevOps incident management procedures — which is especially critical to the greater picture. If managed properly, minimizing MTTR can fortify the resiliency and stability of a product, add an extra layer of protection against system failure and lead to longer “uptimes” for the end-user. MTTR is also used in DevSecOps to establish cybersecurity measures when measuring a team’s ability to neutralize system attacks.
To further address what MTTR is or isn’t, we must first understand the differences between the various definitions:
- Mean Time to Recovery (Mean Time to Fix)
Mean time to recovery exists as a metric to assess the entire incident management process. It’s the time that lapses from the first sign of failure during production or deployment until the complete restoration of production or service-related operations. In other words, the mean time to recovery is how long it takes on average for applications in the production stage of deployment to fully recover from system failure and can be directly applied to:
- System and application deployments
- Backup and data lifecycle management
- Change and patch management
- Platform governance
- Mean Time to Respond
This is the actual time spent during system recovery and fixing the problem when tackling outages. This does not include any delays from the discovery of each incident to the beginning of the recovery process.
- Mean Time To Repair
This is the average time it takes to fix the issue, test the fix and successfully apply the fix.
- Mean Time To Resolve
This is the average total time spent diagnosing and fixing the problem, restoring the product or service and applying a fix. It also includes the time spent on safeguarding products against future failures.
What is the MTTR Formula?
Although MTTR is essentially the same metric across the board, the incident management process requires decision-makers to decide which instance of MTTR will be used based on situational analysis. Because each iteration is slightly different, choosing the wrong formula could lead to much more significant issues further down the line. So, it’s up to leadership to decide which one is most adequate.
Slight changes to the meaning of MTTR will mean slightly different calculations will be used to estimate a mean time or velocity. As an example, here are five different formulas that demonstrate how to calculate MTTR:
- Stability of systems = change failure rate + mean time to recovery (MTTR)
- Mean time to repair (MTTR) = total time spent repairing in any given time period / # of repairs
- Mean time to recovery (MTTR) = total time spent in discovery and repairing systems / # of repairs
- Mean time to resolve = total resolution time during any given time period/ # incidences
- Mean time to respond = Full response time from alert to fix / # of incidents
Accuracy can be increased by implementing data found using system timestamps. Developers need to know when an incident occurred and when it was effectively resolved. Understanding the root cause for system failure will also allow teams to pinpoint when the system went down and how much time lapsed between discoveries.
By knowing which deployment resolved an incident and the steps taken to mitigate each incident, further calculations will allow us to decide whether the service has been restored effectively and whether the user experience has suffered from such an event.
What is MTTR vs. MTBF?
MTBF is another key performance indicator (KPI) that addresses the average “mean time between failures,” as used to track both the availability and reliability of a software-based product or service.
Unlike MTTR, which looks at a lower mean time as an indicator of a healthy system, the higher the mean time is between failures, the more reliable the system is considered. And because of this, most companies aim to keep the MTBF as high as possible — eliminating any and all opportunities for failure, as long as it’s in their control.
The MTBF is calculated by finding the mean, or average, between failures over a designated period of time — whether a year or six months. Each stretch of time between failures is added up and then divided by the number of total failures within the selected period of time.
Because the metric is used to track reliability, MTBF does not factor in expected downtime during scheduled maintenance or testing procedures but instead focuses on the unexpected outages and issues faced due to outside sources or influences that dictate service capabilities.
In an enterprise, engineers should aim towards reducing the MTTR while increasing the MTBF to minimize — or completely avoid — unplanned downtime and failure. While an MTBF is generally a metric used when failure occurs within a repairable system, MTTFs typically occur in systems that require replacement.
With that said, this metric also serves as a tool for leadership when making more informed recommendations to their customers on replacing or maintaining particular parts, bug fixes or system upgrades.
If Used Together, MTBF, MTTR, MTTA, and MTTF Can Improve Incident Management While Reducing Downtime and Failure.
Failure metrics allow enterprise-level organizations to track the reliability and security of technical environments. This includes the hardware, software and infrastructure necessary to conduct day-to-day operations. This includes the tools and networks that provide added value to the technical environment — and the ecosystem that connects businesses across continents and territories.
They’re used for troubleshooting service requests and pinpointing issues with connectivity. Certain metrics relate to server failures — and others, the replacement of degraded or malfunctioning parts.
Unfortunately, when systems aren’t properly functioning, businesses will experience unnecessary downtime. Failure metrics allow us to restore systems before it negatively impacts the business — or the business of our customers.
We need to proactively detect issues before advancing to production. Large data sets need to be collected, processed and then analyzed when gathering both qualitative and quantitative data. This task can become quite tedious, but automating workflow activities, full-scale incident reports and internal management systems can dramatically control response times and lead us to resolve incidents in real time.
Proper visibility can make or break MTTR-related action plans while holding future failures and incidents at bay. Of course, critical incident response procedures should first be established to limit how much time lapses between recovery and the greater number of offenses that normally take place. But knowing how and when to use mean time will improve the organization’s response to any given failure, saving the company and its clientele from unnecessary expense, backlash or uncertain delay.
MTTR Performance Benchmarks
A more complete and accurate picture is revealed when MTTR is paired with other metrics during the recovery process. Performance, cost-basis and the impact of downtime can be predicted based on the occurrence of MTTR and how often it arises during production. As the MTTR trends downward, teams will begin seeing a clear indicator of healthier and improved systems for deployment.
The root cause of failure becomes clear when MTTR has been reduced, making it possible to develop strategies toward remediating related instances further down the line. It also signifies a more reliable and stable environment, where bugs become smaller, easier to fix and identifiable during earlier stages of production.
Key Performance Indicators (KPIs) also include:
- Instances of dependencies when system fixes become untethered
- How often incidents are repeated over a specified period of time
- How often new production is halted until past issues have been resolved
- The nature and expediency of ongoing feedback loops
- The velocity, quality and performance of activities between sprints
A sign of continuous improvement is when issues are either non-existent or are more easily managed over time. Implementing faster feedback loops and mechanisms can improve software builds earlier in the SDLC pipeline. Rapid iterations can mean warding off negative consequences by providing immediate attention to areas where “things just don’t seem right.”
Smaller packages and quicker fixes allow us to pinpoint the first sign of failure before it becomes a real problem. Planning for failure allows us to reduce MTTR by mitigating failure, rolling back unsuccessful deployments and improving the overall experience — one package at a time.
The Problems and Causes of Poor MTTR
The challenges surrounding MTTR begin with how we define the term.
Businesses — enterprise businesses, in particular — are more heavily and increasingly relying on cloud and software services to operate at their highest efficiency. As a result, any downtime to systems could translate to a disruption in revenue, productivity or dependability.
Disruptions can directly impact the reputation of a business and the ability to effectively conduct operations while the system is in downtime. It could also prevent them from providing a premium customer experience, expected to always stay online.
Generally, MTTR is typically used when talking about unplanned incidents, not service requests, and is only calculated based on business days or hours — which depending on how the team handles this situation, could be a good thing or a bad thing.
If they decide to work through a weekend (or overnight) in the case of extreme failure, they can have a system back up and running by Monday morning, whereas businesses may not even notice that any failures ever took place. However, engineers could face burnout without time off later in the week. MTTR is also not factored in using weekend days or hours. This could add additional complexity when calculating MTTR.
This is where teams need to be transparent and when defining the exact implementation of MTTR is of the utmost importance. The wrong calculation can lead to unrealistic turnarounds or expectations across the incident management process.
Depending on the business model, this could also add a significant cost and risk to the business or its business customer.
Businesses that have implemented a solid MTTR action plan could avoid disruptions by acting quickly, implementing an immediate fix and then testing the system.
In the case of delay or downtime, optimizing MTTR action plans could be the difference between an hour or a few days. Although, anything that takes more than a day could indicate poor alerting or poor monitoring and can result in a larger number of affected systems.
Because the mean time to repair doesn’t necessarily equate to the time it takes to repair the system with the actual downtime of the system outage itself, there could be some confusion as to how long it actually took to get the system back up and running. This would be especially true if other projects demanded priority, pushing the repair back even further behind schedule. In such circumstances, it’s quite common to see a certain lag between when the issue was first detected and when the repairs initially began.
That said, there are limitations when it comes to certain instances of MTTR. When deciding which iteration best serves the entire recovery process, one must answer the following questions:
- Is the issue being handled as quickly as it needs to be?
- Does the recovery team have the adequate tools, skills and training to carry out the recovery process?
- Is the team taking too long to find and implement fixes?
- How costly is downtime to the business or its customers?
- How does the company’s average downtime compare to its competitors?
- Are maintenance teams acting as effectively as they could be?
- Is it taking too long for someone to answer or respond to a fix request?
- How many issues are actually causing the delay?
- Are these delays happening due to an inadequacy among the team — or is there something more sinister at play?
- Are teams as effective as they could be? What could make them more effective?
Without the adequate data, it will be hard to arrive at the correct answers. But perhaps another issue could be plaguing the scene:
- Are there unnecessary delays taking place between a failure and an alert?
- Are alerts taking longer than they should to get to the right person?
- Are there other processes that could be improved?
- Is there a lack of direction?
- Is there a significant amount of data, a need for leadership or visibility?
- How much time is the team spending on repairs vs. diagnostics?
While MTTR is most useful when tracking how quickly teams can repair an issue, there’s little to factor in when it comes to prepping systems for repair or when faulty system alerts have been causing delays. In this case, the issue is somewhat unrelated to the system itself but necessary to tackle when further trying to close the gap on excessive delays.
The problem could reside within the alert system, or it could be related to automation, incompatibility and diagnostics.
Under these circumstances, any instance of MTTR may just not be enough. While the mean time to recovery is a great starting point for diagnosing unknown problems, other factors will need to be considered when carrying out the recovery process.
Alert fatigue is the byproduct of MTTR-related processes and procedures, unaffected and overlooked by MTTR Itself.
In the case where no particular issue stands out, alert fatigue could be a primary factor. We use alerts when automating our systems for issue discovery. We even become somewhat dependent on them. The phenomenon behind alert fatigue isn’t necessarily detectable, as this is a human-related factor as opposed to technical.
Alert fatigue happens when we’ve become overwhelmed by or desensitized to alerts. Think of it like when someone has slept through an alarm. When engineers are tasked with responding to incidents around the clock, they’re so used to hearing these alarms and may not realize an alarm has gone off. When engineers become fatigued by these alarms, they’ll often ignore or even miss a few alerts — ultimately delaying the response time for fixes and allowing the problem to worsen before they even realize it.
A few things that could be done to combat alert fatigue include:
- Reducing noise so that engineers don’t overlook these alerts
- Prioritizing alerts to refocus on issues that matter the most
- Alerting entire engineering teams of incidents as they occur, escalating failures and routing those incidents to engineers who are both available and equipped to immediately respond in the moment
- Establishing a solid incident retrospective, or a thorough plan to follow up, on all incidents for reassurance that repeat incidences will not occur
A troubleshooting process is essential when uncovering non-MTTR, MTTR-related issues. It’s equally important to diagnose the incident in its entirety and determine the correct fix without cutting corners on the road to recovery. This isn’t a situation where technical debt should be considered, and much larger issues could accumulate further into the future.
How to Improve MTTR through Action, Observation and Engagement Strategies
Create a fool-proof incident response process, ensuring that teams act with a sense of urgency to eliminate issues as soon as they are detected. Assess each dependency and acknowledge ongoing bottlenecks, further documenting everything that sets a precedence for potential failure going forward. Provide context and suggest action items in regard to such resolutions across the entire ecosystem.
Develop a robust plan of action to promote the deployment of more frequent, bite-sized changes across the SDLC, ensuring that any and all future issues will be easier to identify and apply a fix in the event anything should go wrong. An efficient yet effective incident-resolution process should also be developed, with the primary focus on reducing MTTR.
Improve both incident resolution times and workflow strategies to accelerate the process of remediation and establish a clear plan for escalation, including who to call and who will lead should a problem ever arise. Budgets should be allocated and easily accessible for times of incident in order to avoid unnecessary delays in approval when, instead, special tools and resources should be secured and put to use right away.
Commit to long-term service reliability — develop the mindset that once a problem is fixed, it stays fixed no matter how long.
Train interdisciplinary teams for cross-functional collaboration, ensuring that while specialists may be available, they will not become overworked or burnout — especially at times when they’re needed most. By developing more efficient teams at the most foundational level, absences won’t disrupt the team’s overarching ability to tackle each problem.
Among many other reasons, it would be wise to name leads for both technical and communicative aspects of the department or team. These expert voices and opinions would represent the needs of both internal teams and business leaders within the organization — providing very clear information as it relates to the operational and technical nature of the business, especially in the case of an immediate incident, delay or failure.
Hold teams fully accountable for closing the loop on the incident resolution process.
Each lead should serve as a buffer between various departments and make informed decisions on behalf of the engineering department. They should assess how many systems are impacted and present possible solutions in each matter.
Test the system’s resiliency by implementing chaos engineering into the recovery process. Uphold all service level agreements (SLAs), and remind everyone that quality is not an option
On the side of automation, leadership should consider alerting engineering teams of certain and identifiable failures. Tools should be thoughtfully calibrated for situations when issues are almost inevitable.
Stay Alerted to Failure with Around-the-Clock Incident Management
Real-time monitoring systems will provide accurate reads and average response times for a variety of complex metrics. These will provide teams with more solid, concrete facts that are devoid of all guesswork associated with incident management.
Many of these dashboards now include systems that will alert on-call team members during times of maintenance, testing and system disruption. Take control of these alerts to minimize alert fatigue by setting thresholds for notification.
Propelo can integrate with many systems, such as Jira, Salesforce service desk, or PagerDuty, to measure the MTTR. Propelo lets you take a scalpel approach to precisely measure certain types of failures across all your teams and gives you a much more standardized view of MTTR across all your teams.