About the blackout of 2003

The blackout of 2003 has driven home our society’s dependence on its unbelievably complicated interconnected infrastructure systems such as the electric power transmission, transportation, and communication systems. The systems failure, in the form of a massive blackout, will clearly lead to months if not years of analysis of the details of what went wrong and what can be done to prevent a recurrence. While this type of post-mortem analysis is both appropriate and valuable, care must be taken not to miss some of the broader questions involved in the operation of large complex interconnected systems such as the power transmission system. We would like to point out three particular characteristics of these systems that should be kept in mind as we go about the process of analyzing this blackout and formulating new policy to prevent (or rather reduce the likelihood of) another such event.
1) Finding the detailed event or events that triggered the blackout is necessary and useful and will surely occur. However, focusing on this triggering event will overlook the fact that it is the overall state of the power transmission system that allowed the cascading failure to spread. This “system state” is the stressed power transmission grid due to years of increasing demand and supply with little improvement in the transmission infrastructure. Couple this to some hot summer days and we have a system sitting on the edge waiting for something to push it over. While it might be possible to prevent this particular trigger from occurring again, it is certain that some other unexpected potential triggers will occur in the future. If the system is ripe for another failure, it will be triggered by this new event and a cascade will follow. To mitigate this type of cascading failure the entire system must be moved back from the edge, for example by better operations and control or building more transmission lines and generators.
2) The repair, upgrade and maintenance of the transmission system, leading to more transmission capability, will hopefully be undertaken over the next few years. This, combined with new technologies and operating procedures, will reduce (though not eliminate) the likelihood of a big blackout. This brings us to the second point. In these large interconnected systems, people must be included as a part of the system. After 5 or 10 years with no large blackouts, we will start becoming complacent again and will be less diligent in upgrading lines before they fail and about investing in new more reliable technologies. Consequently, our infrastructure will creep back toward the edge and on a hot summers day a squirrel will chew through the insulation in a substation somewhere, or some unforeseen rare combination of circumstances will occur and another large cascading failure will occur. This human component is an important part of the way the system works and cannot be ignored when attempting to understand the long-term behavior.
3) Finally, a part of the human component is our remarkable ability to invent new more reliable technology and new ingenious operating procedures that allow us to get more and more out of a system. However, the guaranteed reliability standards under discussion that will drive such innovation can actually have a counter intuitive effect on these large complex systems. As individual components become more reliable, and operating procedures become even better, the systems will actually operate closer to the edge. This increased reliability of individual components will decrease the likelihood of small failures but can actually increase the likelihood of large cascading failures. This is analogous to a finely tuned and precision machined racing car versus an old reliable pickup truck. The racing car will win a race with the pickup because it is built and tuned to get the most out of it’s fuel; however, without constant tuning and rebuilding it will not run at all while the old pickup is happy with a periodic oil change and just keeps on running.
As the investigation into the blackout continues, there is a strong natural desire to assign blame to someone or something and this something will likely be the trigger event. But this may not really be fair since if it had not been that particular event on Aug 14th 2003 it would have been some other event on some other day. The reality is that the operators have been doing a remarkable job squeezing more and more out of their individual parts of a system getting closer and closer to its operational limits. Unfortunately this very ingenuity has driven the entire system even closer to the point where major failure is almost unavoidable.
These complex interconnected infrastructure systems have three distinct time scales that must be understood in order to understand the whole system. These three time scales are the very short time for the trigger and cascade (seconds to hours), the timescale for repair and upgrade of the system (weeks to years), and finally the human memory timescale (years to decades), all occurring as the entire system is being driven by the inexorable increase in demand. In order to improve the security of our infrastructure we must understand the entire interconnected system and not just pieces of it.



This was last changed on 22 August, 2003

This page is maintained by David Newman