A Beginner’s Guide to Downtime and What to Do about It

This blog provides an overview of this topic written for non-experts. It

  • explains why you might want to read this blog.
  • lists the various types of “machine maintenance.”
  • explains what “probabilistic modeling” is.
  • describes models for predicting downtime.
  • explains what these models can do for you.

Importance of Downtime

If you manufacture things for sale, you need machines to make those things. If your machines are up and running, you have a fighting chance to make money. If your machines are down, you lose opportunities to make money. Since downtime is so fundamental, it is worth some investment of money and thought to minimize downtime. By thought I mean probability math, since machine downtime is inherently a random phenomenon. Probability models can guide maintenance policies.

Machine Maintenance Policies

Maintenance is your defense against downtime. There are multiple types of maintenance policies, ranging from “Do nothing and wait for failure” to sophisticated analytic approaches involving sensors and probability models of failure.

A useful list of maintenance policies is:

  • Sitting back and wait for trouble, then sitting around some more wondering what to do when trouble inevitably happens. This is as foolish as it sounds.
  • Same as above except you prepare for the failure to minimize downtime, e.g., stockpiling spare parts.
  • Periodically checking for impending trouble coupled with interventions such as lubricating moving parts or replacing worn parts.
  • Basing the timing of maintenance on data about machine condition rather than relying on a fixed schedule; requires ongoing data collection and analysis. This is called condition-based maintenance.
  • Using data on machine condition more aggressively by converting it into predictions of failure time and suggestions for steps to take to delay failure. This is called predictive maintenance.

The last three types of maintenance rely on probability math to establish a maintenance schedule, or determine when data on machine condition call for intervention, or calculate when failure might occur and how best to postpone it.

 

Probability Models of Machine Failure

How long a machine will run before it fails is a random variable. So is the time it will spend down. Probability theory is the part of math that deals with random variables. Random variables are described by their probability distributions, e.g., what is the chance that the machine will run for 100 hours before it goes down? 200 hours? Or, equivalently, what is the chance that the machine is still working after 100 hours or 200 hours?

A sub-field called “reliability theory” answers this type of question and addresses related concepts like Mean Time Before Failure (MTBF), which is a shorthand summary of the information encoded in the probability distribution of time before failure.

Figures 1 shows data on the time before failure of air conditioning units. This type of plot depicts the cumulative probability distribution and shows the chance that a unit will have failed after some amount of time has elapsed. Figure 2 shows a reliability function, plotting the same type of information in an inverse format, i.e., depicting the chance that a unit is still functioning after some amount of time has elapsed.

In Figure 1, the blue tick marks next to the x-axis show the times at which individual air conditioners were observed to fail; this is the basic data. The black curve shows the cumulative proportion of units failed over time. The red curve is a mathematical approximation to the black curve – in this case an exponential distribution. The plots show that about 80 percent of the units will fail before 100 hours of operation.

Figure 1 Cumulative distribution function of uptime for air conditioners

Figure 1 Cumulative distribution function of uptime for air conditioners

 

Probability models can be applied to an individual part or component or subsystem, to a collection of related parts (e.g., “the hydraulic system”), or to an entire machine. Any of these can be described by the probability distribution of the time before they fail.

Figure 2 shows the reliability function of six subsystems in a machine for digging tunnels. The plot shows that the most reliable subsystem is the cutting arms and the least reliable is the water subsystem. The reliability of the entire system could be approximated by multiplying all six curves (because for the system as a whole to work, every subsystem must be functioning), which would result in a very short interval before something goes wrong.

Figure 2 Examples of probability distributions of subsystems in a tunneling machine

Figure 2 Examples of probability distributions of subsystems in a tunneling machine

 

Various factors influence the distribution of the time before failure. Investing in better parts will prolong system life. So will investing in redundancy. So will replacing used pars with new.

Once a probability distribution is available, it can be used to answer any number of what-if questions, as illustrated below in the section on Benefits of Models.

 

Approaches to Modeling Machine Reliability

Probability models can describe either the most basic units, such as individual system components (Figure 2), or collections of basic units, such as entire machines (Figure 1). In fact, an entire machine can be modeled either as a single unit or as a collection of components. If treating an entire machine as a single unit, the probability distribution of lifetime represents a summary of the combined effect of the lifetime distributions of each component.

If we have a model of an entire machine, we can jump to models of collections of machines. If instead we start with models of the lifetimes of individual components, then we must somehow combine those individual models into an overall model of the entire machine.

This is where the math can get hairy. Modeling always requires a wise balance between simplification, so that some results are possible, and complication, so that whatever results emerge are realistic. The usual trick is to assume that failures of the individual pieces of the system occur independently.

If we can assume failures occur independently, it is usually possible to model collections of machines. For instance, suppose a production line has four machines churning out the same product. Having a reliability model for a single machine (as in Figure 1) lets us predict, for instance, the chance that only three of the machines will still be working one week from now. Even here there can be a complication: the chance that a machine working today will still be working tomorrow often depends on how long it has been since its last failure. If the time between failures has an exponential distribution like the one in Figure 1, then it turns out that the time of the next failure doesn’t depend on how long it has been since the last failure. Unfortunately, many or even most systems do not have exponential distributions of uptime, so the complication remains.

Even worse, if we start with models of many individual component reliabilities, working our way up to predicting failure times for the entire complex machine may be nearly impossible if we try to work with all the relevant equations directly. In such cases, the only practical way to get results is to use another style of modeling: Monte Carlo simulation.

Monte Carlo simulation is a way to substitute computation for analysis when it is possible to create random scenarios of system operation. Using simulation to extrapolate machine reliability from component reliabilities works as follows.

  1. Start with the cumulative distribution functions (Figure 1) or reliability functions (Figure 2) of each machine component.
  2. Create a random sample from each component lifetime to get a set of sample failure times consistent with its reliability function.
  3. Using the logic of how components are related to one another, compute the failure time of the entire machine.
  4. Repeat steps 1-3 many times to see the full range of possible machine lifetimes.
  5. Optionally, average the results of step 4 to summarize the machine lifetime with such metrics such as the MTBF or the chance that the machine will run more than 500 hours before failing.

Step 1 would be a bit complicated if we do not have a nice probability model for a component lifetime, e.g., something like the red line in Figure 1.

Step 2 can require some careful bookkeeping. As time moves forward in the simulation, some components will fail and be replaced while others will keep grinding on. Unless a component’s lifetime has an exponential distribution, its remaining lifetime will depend on how long the component has been in continual use. So this step must account for the phenomena of burn in or wear out.

Step 3 is different from the others in that it does require some background math, though of a simple type. If Machine A only works when both components 1 and 2 are working, then (assuming failure of one component does not influence failure of the other)

Probability [A works] = Probability [1 works] x Probability [2 works].

If instead Machine A works if either component 1 works or component 2 works or both work, then

Probability [A fails] = Probability [1 fails] x Probability [2 fails]

so Probability [A works] = 1 – Probability [A fails].

Step 4 can involve creation of thousands of scenarios to show the full range of random outcomes. Computation is fast and cheap.

Step 5 can vary depending on the user’s goals. Computing the MTBF is standard. Choose others to suit the problem. Besides the summary statistics provided by step 5, individual simulation runs can be plotted to build intuition about the random dynamics of machine uptime and downtime. Figure 3 shows an example for a single machine showing alternating cycles of uptime and downtime resulting in 85% uptime.

Figure 3 A sample scenario for a single machine

Figure 3 A sample scenario for a single machine

 

Benefits of Machine Reliability Models

In Figure 3, the machine is up and running 85% of the time. That may not be good enough. You may have some ideas about how to improve the machine’s reliability, e.g., maybe you can improve the reliability of component 3 by buying a newer, better version from a different supplier. How much would that help? That is hard to guess: component 3 may only one of several and perhaps not the weakest link, and how much the change pays off depends on how much better the new one would be. Maybe you should develop a specification for component 3 that you can then shop to potential suppliers, but how long does component 3 have to last to have a material impact on the machine’s MTBF?

This is where having a model pays off. Without a model, you’re relying on guesswork. With a model, you can turn speculation about what-if situations into accurate estimates. For instance, you could analyze how a 10% increase in MTBF for component 3 would translate into an improvement in MTBF for the entire machine.

As another example, suppose you have seven machines producing an important product. You calculate that you must dedicate six of the seven to fill a major order from your one big customer, leaving one machine to handle demand from a number of miscellaneous small customers and to serve as a spare. A reliability model for each machine could be used to estimate the probabilities of various contingencies: all seven machines work and life is good; six machines work so you can at least keep your key customer happy; only five machines work so you have to negotiate something with your key customer, etc.

In sum, probability models of machine or component failure can provide the basis for converting failure time data into smart business decisions.

 

Read more about  Maximize Machine Uptime with Probabilistic Modeling

 

Read more about   Probabilistic forecasting for intermittent demand

 

 

Leave a Comment
Related Posts
The Cost of Spreadsheet Planning

The Cost of Spreadsheet Planning

Companies that depend on spreadsheets for demand planning, forecasting, and inventory management are often constrained by the spreadsheet’s inherent limitations. This post examines the drawbacks of traditional inventory management approaches caused by spreadsheets and their associated costs, contrasting these with the significant benefits gained from embracing state-of-the-art planning technologies.

Finding Your Spot on the Inventory Tradeoff Curve

Finding Your Spot on the Inventory Tradeoff Curve

This video blog holds essential insights for those working with the complexities of inventory management. The session focuses on striking the right balance within the inventory tradeoff curve, inviting viewers to understand the deep-seated importance of this equilibrium.

Why MRO Businesses Need Add-on Service Parts Planning & Inventory Software

Why MRO Businesses Need Add-on Service Parts Planning & Inventory Software

MRO organizations exist in a wide range of industries, including public transit, electrical utilities, wastewater, hydro power, aviation, and mining. To get their work done, MRO professionals use Enterprise Asset Management (EAM) and Enterprise Resource Planning (ERP) systems. These systems are designed to do a lot of jobs. Given their features, cost, and extensive implementation requirements, there is an assumption that EAM and ERP systems can do it all. In this post, we summarize the need for add-on software that addresses specialized analytics for inventory optimization, forecasting, and service parts planning.

Maximize Machine Uptime with Probabilistic Modeling

The Smart Forecaster

 Pursuing best practices in demand planning,

forecasting and inventory optimization

Two Inventory Problems

If you both make and sell things, you own two inventory problems. Companies that sell things must focus relentlessly on having enough product inventory to meet customer demand.  Manufacturers and asset intensive industries such as power generation, public transportation, mining, and refining, have an additional inventory concern:  having enough spare parts to keep their machines running. This technical brief reviews the basics of two probabilistic models of machine breakdown. It also relates machine uptime to the adequacy of spare parts inventory.

 

Modeling the failure of a machine treated as a “black box”

Just as product demand is inherently random, so is the timing of machine breakdowns. Likewise, just as probabilistic modeling is the right way to deal with random demand, it is also the right way to deal with random breakdowns.

Models of machine breakdown have two components. The first deals with the random duration of uptime. The second deals with the random duration of downtime.

The field of reliability theory offers several standard probability models describing the random time until failure of a machine without regard for the reason for the failure. The simplest model of uptime is the exponential distribution. This model says that the hazard rate, i.e., the chance of failing in the next instant of time, is constant no matter how long the system has been operating. The exponential model does a good job at modeling certain types of systems, especially electronics, but it is not universally applicable.

 

Download the Whitepaper

 

The next step up in model complexity is the Weibull model (pronounced “WHY-bull”). The Weibull distribution allows the risk of failure to change over time, either decreasing after a burn in period or, more often, increasing as wear and tear accumulate. The exponential distribution is a special case of the Weibull distribution in which the hazard rate is neither increasing nor decreasing.

Weibull Reliability Plot

Figure 1: Three different Weibull survival curves

Figure 1 illustrates the Weibull model’s probability that a machine is still running as a function of how long it has been running. There are three curves corresponding to constant, decreasing and increasing hazard rates. For obvious reasons, these are called survival curves because they plot the probability of surviving for various amounts of time (but they are also called reliability curves). The black curve that starts high and sinks fast (β=3) depicts a machine that wears out with age. The lightest curve in the middle fast (β=1) shows the exponential distribution. The medium-dark curve (β=0.5)  is one that has a high early hazard rate but gets better with age.

Of course, there is another phenomenon that needs to be included in the analysis: downtime. Modeling downtime is where inventory theory enters the picture. Downtime is modeled by a mixture of two different distributions. If a spare part is available to replace the failed part, then the downtime can be very brief, say one day. But if there is no spare in stock, then the downtime can be quite long. Even if the spare can be obtained on an expedited basis, it may be several days or a week before the machine can be repaired. If the spare must be fabricated by a far-away supplier and shipped by sea then by rail then trucked to your plant, the downtime could be weeks or months. This all means that keeping a proper inventory of spares is very important to keeping production humming along.

In this aggregated type of analysis, the machine is treated as a black box that is either working or not. Though ignoring the details of which part failed and when, such a model is useful for sizing the pool of machines needed to maintain some minimum level of production capacity with high probability.

The binomial distribution is the probability model relevant to this problem. The binomial is the same model that describes, for example, the distribution of the number of “heads” resulting from twenty tosses of a coin. In the machine reliability problem, the machines correspond to coins, and an outcome of heads corresponds to having a working machine.

As an example, if

  • the chance that any given machine is running on any particular day is 90%
  • machine failures are independent (e.g., no flood or tornado to wipe them all out at once)
  • you require at least a 95% chance that at least 5 machines are running on any given day

then the binomial model prescribes seven machines to achieve your goal.

 

Modeling machine failures based on component failures

Maximize Machine Uptime with Probabilistic Modeling

The Weibull model can also be used to describe the failure of a single part. However, any realistically complex production machine will have multiple parts and therefore have multiple failure modes. This means that calculating the time until the machine fails requires analysis of a “race to failure”, with each part vying for the “honor” of being the first to fail.

If we make the reasonable assumption that parts fail independently, standard probability theory points the way to combining the models of individual part failure into an overall model of machine failure. The time until the first of many parts fails has a poly-Weibull distribution. At this point, though, the analysis can get quite complicated, and the best move may be to switch from analysis-by-equation to analysis-by-simulation.

 

Simulating machine failure from the details of part failures

Simulation analysis got its modern start as a spinoff of the Manhattan Project to build the first atomic bomb. The method is also commonly called Monte Carlo simulation after the biggest gambling center on earth back in the day (today it would be “Macau simulation”).

A simulation model converts the logic of the sequence of random events into corresponding computer code. Then it uses computer-generated (pseudo-)random numbers as fuel to drive the simulation model. For example, each component’s failure time is created by drawing from its particular Weibull failure time distribution. Then the soonest of those failure times begins the next episode of machine downtime.

simulation of machine uptime over one year of operation

Figure 2: A simulation of machine uptime over one year of operation

Figure 2 shows the results of a simulation of the uptime of a single machine. Machines cycle through alternating periods of uptime and downtime. In this simulation, uptime is assumed to have an exponential distribution with an average duration (MTBF = Mean Time Before Failure) of 30 days. Downtime has a 50:50 split between 1 day if a spare is available and 30 days if not. In the simulation shown in Figure 2, the machine is working during 85% of the days in one year of operation.

 

An approximate formula for machine uptime

Although Monte Carlo simulation can provide more exact results, a simpler algebraic model does well as an approximation and makes it easier to see how the key variables relate.

Define the following key variables:

  • MTBF = Mean Time Before Failure (days)
  • Pa = Probability that there is a spare part available when needed
  • MDTshort = Mean Down Time if there is a spare available when needed
  • MDTlong = Mean Down Time if there is no spare available when needed
  • Uptime = Percentage of days in which the machine is up and running.

Then there is a simple approximation for the Uptime:

Uptime ≈ 100 x MTBF/(MTBF + MDTshort x Pa + MDTlong x (1-Pa)).    (Equation 1)

Equation 1 tells us that the uptime depends on the availability of a spare. If there is always a spare (Pa=1), then uptime achieves a peak value of about 100 x MTBF/(MTBF + MDTshort). If there is never a spare available (Pa=0), then uptime achieve its lowest value of about 100 x MTBF/(MTBF + MDTlong). When the repair time is about as long as the typical time between failures, uptime sinks to an unacceptable level near 50%. If a spare is always available, uptime can approach 100%.

Relating machine downtime to spare parts inventory

Minimizing downtime requires a multi-pronged initiative involving intensive operator training, use of quality raw materials, effective preventive maintenance – and adequate spare parts. The first three set the conditions for good results. The last deals with contingencies.

Inventory Planning for Manufacturers MRO SAAS

Once a machine is down, money is flying out the door and there is a premium on getting it back up pronto. This scene could play out in two ways. The good one has a spare part ready to go, so the downtime can be kept to a minimum. The bad one has no available spare, so there is a scramble to expedite delivery of the needed part. In this case, the manufacturer must bear both the cost of lost production and the cost of expedited shipping, if that is even an option.

If the inventory system is properly designed, spare parts availability will not be a major impediment to machine uptime. By the design of an inventory system, I mean the results of several choices: whether the shortage policy is a backorder policy or a loss policy, whether the inventory review cycle is periodic or continuous, and what reorder points and order quantities are established.

When inventory policies for products are designed, they are evaluated using several criteria. Service Level is the percentage of replenishment periods that pass without a stockout. Fill Rate is the percentage of units ordered that is supplied immediately from stock. Average Inventory Level is the typical number of units on hand.

None of these is exactly the metric needed for spare parts stocking, though they all are related. The needed metric is Item Availability, which is the percentage of days in which there is at least one spare ready for use. Higher Service Levels, Fill Rates, and Inventory Levels all imply high Item Availability, and there are ways to convert from one to the other. (When dealing with multiple machines sharing the same stock of spares, Inventory Availability gets replaced by the probability distribution of the number of spares on any given day. We leave that more complex problem for another day.)

Clearly, keeping a good supply of spares reduces the costs of machine downtime. Of course, keeping a good supply of spares creates its own inventory holding and ordering costs. This is the manufacturer’s second inventory problem. As with any decision involving inventory, the key is to strike the right balance between these two competing cost centers. See this article on probabilistic forecasting for intermittent demand for guidance on striking that balance.

 

Leave a Comment

Related Posts

Constructive Play with Digital Twins

Constructive Play with Digital Twins

Those of you who track hot topics will be familiar with the term “digital twin.” Those who have been too busy with work may want to read on and catch up. While there are several definitions of digital twin, here’s one that works well: A digital twin is a dynamic virtual copy of a physical asset, process, system, or environment that looks like and behaves identically to its real-world counterpart. A digital twin ingests data and replicates processes so you can predict possible performance outcomes and issues that the real-world product might undergo.

Direct to the Brain of the Boss – Inventory Analytics and Reporting

Direct to the Brain of the Boss – Inventory Analytics and Reporting

In this blog, the spotlight is cast on the software that creates reports for management, the silent hero that translates the beauty of furious calculations into actionable reports. Watch as the calculations, intricately guided by planners utilizing our software, seamlessly converge into Smart Operational Analytics (SOA) reports, dividing five key areas: inventory analysis, inventory performance, inventory trending, supplier performance, and demand anomalies.

How Are We Doing? KPI’s and KPP’s

How Are We Doing? KPI’s and KPP’s

Dealing with the day-to-day of inventory management can keep you busy. But you know you have to get your head up now and then to see where you’re heading. For that, your inventory software should show you metrics – and not just one, but a full set of metrics or KPI’s – Key Performance Indicators.

Recent Posts

  • Epicor AI Forecasting and Inventory Technology Combined with Planner Knowledge for InsightsSmart Software to Present at Epicor Insights 2024
    Smart Software will present at this year's Epicor Insights event in Nashville. If you plan to attend this year, please join us at booth #13 or #501, and learn more about Epicor Smart Inventory Planning and Optimization. . […]
  • Looking for Trouble in Your Inventory DataLooking for Trouble in Your Inventory Data
    In this video blog, the spotlight is on a critical aspect of inventory management: the analysis and interpretation of inventory data. The focus is specifically on a dataset from a public transit agency detailing spare parts for buses. […]
  • BAF Case Study SIOP planning Distribution CenterBig Ass Fans Turns to Smart Software as Demand Heats Up
    Big Ass Fans is the best-selling big fan manufacturer in the world, delivering comfort to spaces where comfort seems impossible. BAF had a problem: how to reliably plan production to meet demand. BAF was experiencing a gap between bookings forecasts vs. shipments, and this was impacting revenue and customer satisfaction BAF turned to Smart Software for help. […]
  • The Cost of Doing nothing with your inventory Planning SystemsThe Cost of Spreadsheet Planning
    Companies that depend on spreadsheets for demand planning, forecasting, and inventory management are often constrained by the spreadsheet’s inherent limitations. This post examines the drawbacks of traditional inventory management approaches caused by spreadsheets and their associated costs, contrasting these with the significant benefits gained from embracing state-of-the-art planning technologies. […]
  • Randomness can be an Ally in the Forecasting BattleCan Randomness be an Ally in the Forecasting Battle?
    When we try to understand the complex world of logistics, randomness plays a pivotal role. This introduces an interesting paradox: In a reality where precision and certainty are prized, could the unpredictable nature of supply and demand actually serve as a strategic ally? The quest for accurate forecasts is not just an academic exercise; it's a critical component of operational success across numerous industries. For demand planners who must anticipate product demand, the ramifications of getting it right—or wrong—are critical. Hence, recognizing and harnessing the power of randomness isn't merely a theoretical exercise; it’s a necessity for resilience and adaptability in an ever-changing environment. […]

    Inventory Optimization for Manufacturers, Distributors, and MRO

    • Why MRO Businesses Need Add-on Service Parts Planning & Inventory SoftwareWhy MRO Businesses Need Add-on Service Parts Planning & Inventory Software
      MRO organizations exist in a wide range of industries, including public transit, electrical utilities, wastewater, hydro power, aviation, and mining. To get their work done, MRO professionals use Enterprise Asset Management (EAM) and Enterprise Resource Planning (ERP) systems. These systems are designed to do a lot of jobs. Given their features, cost, and extensive implementation requirements, there is an assumption that EAM and ERP systems can do it all. In this post, we summarize the need for add-on software that addresses specialized analytics for inventory optimization, forecasting, and service parts planning. […]
    • Spare-parts-demand-forecasting-a-different-perspective-for-planning-service-partsThe Forecast Matters, but Maybe Not the Way You Think
      True or false: The forecast doesn't matter to spare parts inventory management. At first glance, this statement seems obviously false. After all, forecasts are crucial for planning stock levels, right? It depends on what you mean by a “forecast”. If you mean an old-school single-number forecast (“demand for item CX218b will be 3 units next week and 6 units the week after”), then no. If you broaden the meaning of forecast to include a probability distribution taking account of uncertainties in both demand and supply, then yes. […]
    • Whyt MRO Businesses Should Care about Excess InventoryWhy MRO Businesses Should Care About Excess Inventory
      Do MRO companies genuinely prioritize reducing excess spare parts inventory? From an organizational standpoint, our experience suggests not necessarily. Boardroom discussions typically revolve around expanding fleets, acquiring new customers, meeting service level agreements (SLAs), modernizing infrastructure, and maximizing uptime. In industries where assets supported by spare parts cost hundreds of millions or generate significant revenue (e.g., mining or oil & gas), the value of the inventory just doesn’t raise any eyebrows, and organizations tend to overlook massive amounts of excessive inventory. […]
    • Top Differences between Inventory Planning for Finished Goods and for MRO and Spare PartsTop Differences Between Inventory Planning for Finished Goods and for MRO and Spare Parts
      In today’s competitive business landscape, companies are constantly seeking ways to improve their operational efficiency and drive increased revenue. Optimizing service parts management is an often-overlooked aspect that can have a significant financial impact. Companies can improve overall efficiency and generate significant financial returns by effectively managing spare parts inventory. This article will explore the economic implications of optimized service parts management and how investing in Inventory Optimization and Demand Planning Software can provide a competitive advantage. […]