IT Downtime: Assessing the Human Factor
A number of recent studies cite human error as the largest cause of centralized IT failures. However, methods for mitigating this threat don’t get much air time. IT managers seeking to make their systems more reliable would do well to take some lessons from other industries.
After the incident at Pearson Airport that left an Air France Airbus smoldering in a gully, Real Levasseur, lead investigator with Canada’s Transportation Safety Board, was asked if human error was a factor. “We don’t look at human error,” he replied. “Humans are humans, they are not machines.”
Levasseur was not dodging the question. He was emphasizing that human error is not something that can be irradiated. “The issue isn’t ‘can you engineer the perfect human being?’ There is no such thing”, explains Pierre Tremblay, VP of performance improvement and nuclear oversight for Ontario Power Generation. “You need to expect people to make mistakes.”
IT organizations would do well to adopt this viewpoint. When trouble occurs, the de facto strategy is often to blame individuals rather than address systemic problems. Firing an employee who messed up a security setting might give a temporary feeling of closure, but this does nothing to improve the long term security of the corporate database.
On the other hand, the methods used by Tremblay and others are highly adaptable to IT.
“I would say that the IT people are excellent candidates for pursuing human performance,” says Tremblay. “They’re always changing platforms, introducing new ways of doing business, putting upgrades into our work management systems, and so those areas are fraught with the possibility of human error.”
Reliable testing
Although the theoretical side of human performance is complex, many of Tremblay’s methods are quite straightforward, such as a technique called three-way communication. “For example, if I’m giving you an instruction,” Tremblay explains, “I’d go through that instruction and tell you exactly what it is that I intend for you to do. You as the receiver would repeat, restate the communication that I’ve just had — that’s the two way — and the three way would be I would confirm that I’ve heard your feedback and it is correct.”
Paul Eisen, senior director of the consulting services group at CIBC, helps various groups at the bank, including IT, adopt human performance methods. Eisen believes human error reduction begins when you first purchase an application. “Step number one — include usability issues, which include effectiveness and minimization of errors as key criteria in your purchase decision.” This goes beyond reliability statistics — Eisen insists that vendors show how a piece of software was tested, where independent labs were used, what the test results showed.
A key problem here is that designers and operators think differently. Graham Creedy, senior manager of Responsible Care for the Canadian Chemical Producers’ Association, explains, “Typically, when something new is being designed, the designers are operating at the knowledge-based level. They’ve got the knowledge to know what’s going on, but they’re sort of exploring themselves to work out how best to operate it.” Designers include not only software vendors, but also integrators, in-house programmers, applications specialists, and anybody who sets up systems for others to follow.
Rehearsing failures
Another key area Eisen focuses on at CIBC is training operators to bring a system back to normal after it has failed. Because emergency procedures aren’t used on a regular basis, you need to make sure operators rehearse various recovery scenarios in order to avoid errors when an emergency actually occurs. Eisen notes that recovery situations often place the operator under severe stress, where they “are going to be performing sub-optimally.”
When live systems are involved, these scenarios can also be highly complex. “I think the bigger issue is when real data has to be managed at the same time. You’ve got a production environment. Somehow you’ve got to take it down to upload a patch and then you need to be able to synchronize data that’s kind of in this waiting mode. And that’s where they run into problems — if they haven’t had the opportunity to rehearse that.”
The challenge of following multi-step IT tasks can be alleviated by use of job aids — another of Eisen’s key recommendations. “There are a lot of things that happen,” explains Eisen, “that don’t have anything to confirm that they’ve been completed, that’s where the job aid comes in. Make sure you’ve done ‘X’, check off when you’ve done ‘X’.”
Ensuring that job aids get used may require some push from management, as Tremblay points out. “If we want to make these error prevention tools commonplace and ingrained in the culture, then we need to positively reinforce those behaviours out of our employees, and we have a major role both as a leader and as a supervisor in making those things happen.”
The right technique
Managers also need to understand the kind of human error that is being targeted. “People who have looked into human error,” Creedy explains, “find that it’s often important to distinguish between whether things are slips or lapses, mistakes or violations, because to correct those kind of behaviours, you often need different techniques or different processes applied.”
In human performance terminology, a lapse is an occurrence where a step is performed out of sequence, and a slip is where a step is either skipped or done incorrectly. Sequential checklist-type tools, with clearly defined steps, will clearly alleviate much of these kinds of error.
With mistakes and violations, things get more complicated. A mistake is defined as an instance in which an individual misunderstands a situation, and consequently undertakes the wrong action. Error-reduction interventions here could include better training, clearer documentation, or more frequent communication between different groups that interact with a given system.
A violation involves an intentionally wrong procedure. This may be justified in an emergency, but violations are frequently committed by people who think they know more than everybody else. This is what happened at Chernobyl.
Knowing what’s right
Judgment is probably the ultimate key to preventing mistakes and ill-conceived violations, and this can only be improved through a holistic view of the environment. Michael V. Brown, president of The New Standard Institute, which provides consulting and training for industrial facilities, explains. “In my opinion, there has to be some sort of fundamental understanding of what’s right or wrong as far as repairs are concerned or operations are concerned — some fundamental understanding of what the process is all about on the part of the people who do the operations and maintenance.” In an IT context, this means that anybody who touches a production system should know how the system works, how it interacts with other systems, and where the vulnerabilities are.
Brown also cautions those who would try to document their way out of this problem. Designers frequently put procedures together for the purpose of covering themselves, but don’t address the usability of these instructions. Documents that are too lengthy simply will not get read. “People will use what they already know about things and just move ahead,” says Brown.
Eisen reflects that although humans are not perfect, systems are highly dependent on the unique capabilities that humans have. “They need to be able to respond to situations, they need to use human capabilities such as matching patterns, and they need to use higher level cognitive functions. That’s why we have them inserted there rather than just having the machinery respond itself. Then we have to accept the fact that sometimes they’re wrong.”