Rule-Based Model of Human Problem Solving Performance
1. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-10, NO. 7, jULY 1980
A Rule-Based Model of Human Problem
Solving Performance in Fault
Diagnosis Tasks
WILLIAM B. ROUSE, SENIOR MEMBER, IEEE, SANDRA H. ROUSE, AND SUSAN J. PELLEGRINO
Abstract-The sequence of tests chosen by humans in two fault diagno-
sis tasks are described in terms of a model composed of a rank-ordered set
of heuristics or rndes-of-thumb. The identification and evaluation of such
models are discussed. The approach is illustrated by modeling the choices
of test sequences of 118 subjects in one task and 36 in the other task. The
model and subjects are found to agree somewhat over 90 percent of the
time.
INTRODUCTION
T HIS PAPER is concerned with the problem of de-
scribing how humans perform fault diagnosis tasks.
The overall goal of the research upon which this paper is
based focuses on the development of an understanding of
human fault diagnosis abilities and the design of methods
of training that will enhance the human's abilities. In
pursuit of this goal a series of experiments have been
performed utilizing both context-free [1]-[4] and context-
specific [5] fault diagnosis tasks. The context-specific tasks
have involved diagnosis of faults in computer-simulated
automobile and aircraft power plants. An upcoming study
[6] will focus on the human's ability to transfer problem
solving skills learned in such simulations to situations
involving diagnosis of real equipment.
These empirical studies have thus far resulted in a data
base that includes data for over 150 subjects, most of
which were maintenance trainees, and approximately
12000 fault diagnosis problems. In an effort to succinctly
summarize such a large and varied quantity of data,
several mathematical modeling notions have emerged. A
model based on the theory of fuzzy sets, as well as several
pattern-evoked heuristics or rules-of-thumb, was found to
be quite adequate for predicting the average number of
tests for a subject to successfully solve a fault diagnosis
problem [2], [7]. Considering the time it takes to solve a
fault diagnosis problem, various measures of task com-
plexity were investigated, and an information theoretic
Manuscript received August 17, 1979; revised March 6, 1980. This
research was supported by the U.S. Army Research Institute for the
Behavioral and Social Sciences under Grant DAHC 19-78-G-0011 and
Contract MDA 903-79-C-0421.
W. B. Rouse and S. H. Rouse are with Delft University of Technol-
ogy, The Netherlands, on leave from the University of Illinois, Urbana,
IL 61801.
S. J. Pellegrino is with McDonnell Douglas Automation Company, St.
Louis, MO 63166.
measure produced a 0.84 overall correlation with time
until problem solution [8].
As this research has progressed, it has become apparent
that global performance measures such as number of tests
and time until problem solution do not provide enough
information to understand fully human problem solving
performance in fault diagnosis tasks. To overcome this
difficulty, it was decided that a model of how subjects
made each test was needed. While the previously men-
tioned fuzzy set model could have provided the basis for
this effort, a more direct approach was chosen.
The research described in this paper is based on a
fundamental hypothesis that human performance in fault
diagnosis tasks can be described by a rank-ordered set of
rules-of-thumb or heuristics. Before explaining the details
of this hypothesis, literature relating to the human's use of
heuristics in fault diagnosis tasks will be reviewed. Also,
two fault diagnosis tasks will be discussed. These tasks
will provide a framework within which the proposed rule-
based model can be explained.
BACKGROUND
Several investigators have studied the human's abilities
to employ the half-split heuristic whereby one attempts to
choose tests that will result in the maximum reduction of
uncertainty. Goldbeck and his colleagues [9] found that
subjects could only successfully implement this strategy
for relatively simple problems unless a rather intensive
training program was employed. Mills [10] had subjects
locate faults in series circuits where the probabilities of
failure were not uniformly distributed and found that the
half-split strategy was 14 percent better than subjects in
terms of number of tests until solution.
Bond and Rigney [11] compared the performance of
electronics technicians to a Bayesian model that optimally
updated probabilities of component failures based on the
results of tests. They found that the model agreed with
subjects' component replacement choices approximately
50 percent of the time. Further, they found that the match
of model and subjects was enhanced if subjects started
with good a priori estimates of component failure proba-
bilities. Stolurow and his colleagues [12] also considered
the human's use of failure probabilities as well as repair
times. They show that the replacement policy that mini-
0018-9472/80/0700-0366$00.75 C 1980 IEEE
366
2. ROUSE et al.: HUMAN PROBLEM SOLVING PERFORMANCE
mizes overall expected repair time is to replace compo-
nents in order of increasing value of repair time divided
by failure probability. To investigate the real-world appli-
cability of this rule-of-thumb, they evaluated the abilities
of maintenance instructors to estimate failure probabilities
and repair times. They found significant disagreement
among individuals.
Several investigators have represented human perfor-
mance in fault diagnosis tasks in terms of various routines
that are evoked under particular conditions. Rasmussen
and Jensen [13] analyzed extensive verbal protocols of
electronics technicians and identified three basic search
routines: topographic, functional, and search based on
specific fault characteristics. Westcourt and Hemphill [14]
used a procedural network model to describe debugging
of computer programs while Brown and Burton [15] em-
ployed a procedural network model to depict problem
solving in simple algebra tasks. Procedural network mod-
els are basically a set of routines and a structure which
describes the flow of control among routines. The model
to be presented in this paper is somewhat related to
procedural network models except that its control struc-
ture is only implicit and further, its rules are too elemental
to be classified as routines.
A common problem faced by those who study the rules,
heuristics, routines, procedures, etc. employed by humans
in problem solving tasks involves methodology. Identify-
ing rules and relationships among rules can be quite
difficult. Rasmussen and Jensen [13] as well as Westcourt
and Hemphill [14] refer to this problem. Rigney and
Towne [16] have formulated the basis of a methodology
for serial action tasks. However, this methodology does
not appear to be applicable to the types of pattern-evoked
problem solving behavior that is of interest in this paper.
This topic will be discussed in greater detail later. At this
point, in order to focus this discussion, two particular
fault diagnosis tasks will be considered.
Two FAULT DIAGNOSIS TASKS
The following two tasks both involve troubleshooting of
graphically displayed networks. Since the motivation for
developing these two tasks is amply documented
elsewhere, e.g., [1], [2], they will only be briefly reviewed
here.
Task One
An example of Task One is shown in Fig. 1. This
display was generated on a Tektronix 4010 by a DEC
System 10. These networks operate as follows. Each node
or component has a random number of inputs. Similarly,
a random number of outputs emanate from each compo-
nent. Components are devices that produce either e one or
zero. Outputs emanating from a component carry the
value produced by that component. A component will
produce a one if
1) all inputs to the component carry values of one, or
2) the component has not failed.
* 22.30= 1
* 23,3=1 1 5 2~2 6 40
* 30, 38=1
* 31,38=9-
* 24.31 = I
* 25. 1 = I
FAILU-RE '?31 2 6 2 6 3 40
PIGHTI
Fig. 1. Example of Task One.
If either of these two conditions are not satisfied, the
component will produce a zero. Thus, components are like
AND gates. If a component fails, it will produce values of
zero on all the outputs emanating from it. Any compo-
nents that are reached by these outputs will in turn
produce values of zero. This process continues and the
effects of a failure are thereby propagated throughout the
network.
A problem began with the display of a network with the
outputs indicated, as shown on the right side of Fig. 1.
Based on this evidence the subject's task was to "test"
connections until the failed component was found. All
components were equally likely to fail, but only one could
fail within any particular problem. Subjects were in-
structed to find the failure in the least amount of time
possible, while avoiding all mistakes and not making an
excessive number of tests.
The upper left side of Fig. 1 illustrates the manner in
which connections were tested. An asterisk was displayed
to indicate that subjects could choose a connection to test.
They entered commands of the form k1,k2 and were then
shown the value carried by the connection. If they re-
sponded to the asterisk with a simple "return," they were
asked to designate the failed component. Then, they were
given feedback about the correctness of their choice, and
then, the next problem was displayed.
Task Two
Task One is fairly limited in that only one type of node
or component is considered. Further, all connections are
feed-forward and thus, there are no feedback loops. To
overcome these limitations, a second troubleshooting task
was devised so as to include two types of components as
well as feedback loops.
Fig. 2 illustrates an example of Task Two. As with Task
One, inputs and outputs of components can only have
values of one or zero. A value of one represents an
acceptable output while a value of zero represents an
unacceptable output.
367
3. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-10, NO. 7, JULY 1980
* 20 25
* 1213 0
25-0
FAILURE
RIGHT!
21 12 17 223
Fig. 2. Example of Task Two.
A square component will produce a one if
1) all inputs to the component carry values of one, or
2) the component has not failed.
If either of these two conditions is not satisfied, the
component will produce a zero. Thus, square components
are like AND gates.
A hexagonal component will produce a one if
1) any input to the component carries a value of one, or
2) the component has not failed.
As before, if either of these two conditions is not satisfied,
the component will produce a zero. Thus, hexagonal com-
ponents are like OR gates.
The square and hexagonal components will henceforth
be referred to as AND and OR components, respectively.
However, it is important to emphasize that the ideas
discussed here have import for other than just logic
circuits [1], [2]. As a final comment on these components,
the simple square and hexagonal shapes were chosen in
order to allow rapid generation of the problems on a
graphics display.
As with Task One, all components were equally likely
to fail, but only one component could fail within any
particular problem. Subjects obtained information by test-
ing connections between components (see upper left of
Fig. 2). Tests were of the form k ,k2 where the connection
of interest was an output of component k1 and an input of
component k2. The instructions to the subjects were the
same as used for Task One. Namely, they were to find the
failure as quickly as possible, avoid all mistakes, and
avoid making an excessive number of tests.
Notation
Each of the networks used for Tasks One and Two can
be described by its reachability matrix R. Element r.. of R
equals one if a path exists from component i to compo-
nentj. Otherwise, rij equals zero. R can be computed from
the connectivity matrix C of the network. Element ci1 of C
equals one if component i is (directly) connected to com-
ponentj. Otherwise, c,j equals zero.
The human's knowledge of the state of component i will
be denoted by si. Values of si = 0 or s, = 1 indicate that the
human knows the output or state of component i, either
because it is one of the displayed outputs or because it is
the result of a test. When a problem begins, the set of
components for which si=0 constitutes the symptoms of
the failure.
A RULE-BASED MODEL
As noted earlier, the hypothesis upon which the re-
search reported in this paper was based involved viewing
the sequences of tests chosen by subjects as being gener-
ated by a rank-ordered set of heuristics or rules-of-thumb.
The idea of such a rule-based model closely resembles
Newell's production system models [17]. Basically, a pro-
duction is a situation-action pair where the situation side
is a list of things to watch for and the action side is a list
of things to do. A production system is a rank-ordered set
of productions where the actions resulting from one pro-
duction can result in situations that cause other produc-
tions to execute. In other words, a production system is a
rank-ordered set of pattern-evoked rules of action such
that actions modify the pattern and thereby evoke other
actions.
Newell has used production system models to describe
human information processing. He views long-term mem-
ory (LTM) as composed entirely of an ordered set of
productions while short-term memory (STM) holds an
ordered set of symbolic expressions. The model processes
information by observing the contents of the STM on a
last-come first-served basis. A match occurs when a sym-
bol or symbols in the STM match the situation side of a
production in LTM. Then, an action is evoked which
results in new symbols being deposited in the STM. This
process of pattern-evoked actions goes on continually
and, as a result, people play chess, solve arithmetic prob-
lems, etc. [17].
While production system models were originally devel-
oped to describe basic information processing such as
exhibited, for example, in reaction time tasks [18], they are
somewhat cumbersome if one attempts to view realisti-
cally complex tasks in terms of symbol manipulations in
the human's STM. This has resulted in a somewhat more
macroscopic application of production system models to
tasks such as air traffic control [19] and aircraft piloting
[20]. In these models the notion of a rank-ordered set of
pattern-evoked rules is retained, but the level at which the
task is viewed is more task-oriented with the specific
contents of the STM and LTM not explicitly considered.
As mentioned earlier, the rule-based model to be pre-
sented here follows the spirit of the production system
model approach, at least at a task-oriented level. The
model is depicted in Fig. 3. It is assumed that the human
368
4. ROUSE et al.: HUMAN PROBLEM SOLVING PERFORMANCE
II
UMAN PROBLEM SOLVER __ __
Fig. 3. Structure of the model.
scans the network looking for patterns that satisfy any of
his rank-ordered set of rules. For example, the first rule is
probably a "stopping rule" that checks to see whether or
not sufficient information is available to designate the
failed component. If sufficient information is not avail-
able, then the human must collect more information by
making tests. Rules 2 through N look for patterns that
satisfy the prerequisites for particular types of tests. After
a test is made, the human's state of knowledge of the
network is updated on the basis of the results of the test.
With the structure of the rule-based model defined, the
next issue is the identification of rules and rank orderings.
To a certain extent, identification can be considered as a
general problem. The next section of this paper will dis-
cuss these general considerations. However, since ap-
propriate rules and rank orderings are particular to
specific tasks, this general discussion will be somewhat
brief.
IDENTIFICATION OF RULE-BASED MODELS
Three aspects of identification are of concern: identifi-
cation of rules, identification of rank orderings, and
evaluation of identified models. While it seems reasonable
to hope that identification of rank orderings and evalua-
tion could be performed with a computer program, it
appears that identification of rules is best left to the
judgment of humans who thoroughly understand the task
of interest [13], [14]. Thus, for the research reported in this
paper, candidate sets of rules were developed by having
experts view replays of sessions of subjects solving fault
diagnosis problems. While this procedure may seem open
to arbitrary decisions, it actually can work quite well since
the value of the experts' choices becomes readily apparent
when one attempts to algorithmically identify rank order-
ings and evaluate the resulting models. In other words, if
the judges employed are not really experts, the resulting
rule-based models will not provide good descriptions of
problem solving behavior.
Given a set of candidate rules, the process of identify-
ing rank orderings begins by forming a preference matrix
P with elements pij. The value ofPij denotes the number of
times rule i was chosen when rulej was available (i.e., the
number of times rule i is preferred to rule j). The prefer-
ence matrix is formed by considering the problem before
each test is made and classifying each possible test in
terms of the rule most likely associated with that test.
While one can easily envision the possibility of multiple
rules being associated with each test, allowing such am-
biguity into the analysis can present difficulties unless the
interaction of human experts is allowed. This issue will be
considered further during the discussion of the analysis
for Task Two.
The preference matrix P is formed in the following
manner. For each test choice by the subject, the alterna-
tive choices available immediately prior to that choice are
determined. If rules i andj were available and the subject
preferred i (as evidenced by his test choice), then Pi is
incremented. Regardless of the number of instance of rule
j that are available, Pij is only incremented by one. The
process of incrementingpi1 is carried out for every element
of the ith row of P for which the corresponding rule was
available when the subject chose rule i.
The rank ordering can be directly identified from the
preference matrix. The procedure is quite straightforward
and only requires a simple computer program. Basically
one tries to choose each entry into the rank ordering so as
to minimize conflicts. Conflicts occur when rule i precedes
rulej in the rank ordering but p11 > 0. Summingp1i over all
j that are assumed to be less preferred than i yields the
overall number of conflicts.
Identifying a rank ordering is an iterative process. On
each iteration, the rule chosen to enter the rank ordering
is assumed to be preferred to all those rules not yet in the
rank ordering. To minimize conflicts, the rule chosen to
enter is the one whose overall number of conflicts is
smallest. In that way, one obtains the overall rank order-
ing with minimum number of conflicts.
The identified model can be evaluated by having it
perform the same task that the human performed and
determining whether or not the model makes the same or
similar tests as made by the human. One particular diffi-
culty with this method of evaluation is that once the
model and human disagree at all, they will henceforth
each be making decisions on the basis of different infor-
mation sets. In other words, if the model chooses a test
different than the human, then it will have knowledge of a
test result that the human does not have and, similarly,
369
5. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-10, NO. 7, JULY 1980
the human will have knowledge of a test result that is
unavailable to the model. From that point on, it is possi-
ble that the choices of model and human will diverge.
To avoid this difficulty, the following procedure was
employed. After the model chose a test, its choice was
compared to the human's and then, the test chosen by the
human was actually employed. In other words, for the
purpose of evaluating the model, its own test choices were
appraised. However, for the purpose of updating the state
of knowledge of the network, the human's test choices
were implemented. In this way the model always made
decisions on the basis of the same information as availa-
ble to the human.
To determine whether or not the model and human
agree, one might ask if they made exactly the same test.
While this criterion is useful, and will be employed in later
analyses, it can be somewhat too strict. For example,
consider a typical situation where one knows that a com-
ponent's output is unacceptable and has to test the com-
ponent's inputs to determine if the component has failed
or if one of its inputs is unacceptable. If there are multiple
inputs, which one should be tested first? It is quite likely
that almost any criterion would indicate that all alterna-
tives are equally desirable. In such a situation, it is dif-
ficult for a model to match the specific test chosen by a
human. For this reason the model proposed in this paper
has also been evaluated in terms of how often it chose
tests that were similar to the human's choices in the sense
that both tests were the result of using the same rule.
Thus, two evaluation criteria were employed. The first
criterion considered the percentage of tests where model
and human made the same test, while the second consid-
ered the percentage of tests where similar tests (i.e., same
rules) were chosen. This method of evaluation was em-
ployed by Bond and Rigney [11] in their assessment of the
degree of correspondence between humans and perfect
Bayesian troubleshooters.
This section has outlined the procedures whereby rule-
based models were identified and evaluated. These proce-
dures were followed for the analyses of Tasks One and
Two that will now be discussed. However, as the reader
will see, some modifications were necessary for the analy-
sis of Task Two.
ANALYSIS FOR TASK ONE
The following discussion is based on Pellegrino's thesis
[21]. She analyzed Task One data collected during three
transfer of training studies, two of which have been previ-
ously reported [3], [4] while the third was performed to
test a new training idea which will be discussed in a later
section of this paper. A total of 118 maintenance trainees
served as subjects in these three experiments. The data to
be considered here (i.e., Trial 4, the transfer trial) is based
on ten Task One problems where all subjects performed
the exact same problems with only the training on previ-
ous problems differing among subjects. In the first two
experiments, one-half of the subjects were trained with
computer aiding (see [1] for a description) while the other
one-half of the subjects did not receive aided training. In
the third experiment, one-third of the subjects received
computer aiding, one-third received no aiding, and one-
third received rule-based training which will later be de-
scribed.
Through an iterative process, Pellegrino arrived at the
twelve rules described in Table I. Before explaining the
motivation for these rules, the phrase "active compo-
nents" requires definition. As noted earlier, at the start of
a problem, the set of components for which si=O are
called the symptoms. At first, all of these components are
of interest. However, after one finds a component within
the network that is the source of any of the original si = 0
components, then one can focus on this source or ancestor
component while its descendents no longer need to be
actively considered. More formally, if si =0 and s =0
while r,j = 1, then componentj can be considered inactive.
This concept is explained in greater detail elsewhere [I],
[8].
Now, we will consider the origin of the twelve rules in
Table I in more detail. Rules 1, 2, and 3 reflect a situation
where a subject is focusing on a single si=0 component
and testing its inputs. These are weak rules in the sense
that it would be better if the subject considered tests of
components that affect all the active si=0 components
and none of the si= I components. An exception to this
generalization occurs if there is only one test that satisfies
this stronger condition and there is more than one active
s,=0 component. In such a situation, the subject could
infer the test result (i.e., si= 0) and thereby, avoid the test.
Thus, rule 3 is not a good choice.
Rules 4, 5, and 6 are stronger than rules 1, 2, and 3
because they deal with situations where either there is
only one choice (rule 5) or where the existence of multiple
alternatives prevents direct inference of the test result
(rules 4 and 6). Rules 7, 8, and 9 are even stronger
because they reach the symptoms rather than merely
connect to them.'
Rules 10, 11, and 12 represent situations that would
also satisfy rules 7, 8, and 9, respectively. However, the
satisfaction of rules 7, 8, or 9 is serendipitous rather than
intentional. Instead, rules 10, 11, and 12 represent situa-
tions where the subject is testing the inputs of a compo-
nent, the output of which he recently found to be si=0.
These rules are called "tracing back" rules because they
reflect a strategy of testing inputs to s, =0 components
until another si=0 component is found and then, testing
its inputs, etc.
Using the twelve rules in Table I and the identification
algorithm discussed earlier, rank orderings were obtained
'While any component that connects to another component also
reaches that component, we are using "reach" to denote situations where
the path from one component to another contains at least one interven-
ing component. Thus, our use of the word "reach" should be read
"reaches but does not connect."
370
6. ROUSE et al.: HUMAN PROBLEM SOLVING PERFORMANCE
TABLE I
RULES FOR TASK ONE
RULE DESCRIPTION
Test the output of a component that connects
to at least one, but not all, active components
for which s =O.
2 Test the output of a component that connects
to at least one active component for which
s. =O and at least one active component for
wAIich si=1.
3 Test the output of the only component that
connects to all (>1) active components
for which s5=O.
4 Test the output of any one of the components
(>1) that connects to all (>1) active
components for which sio=.
5 Test the output of the only component that
connects to the only active component
for which si=O.
6 Test the output of any one of the components
(>1) that connects to the only active
component for which si=O.
7 Test the output of any component that reaches
at least one, but not all, active components
for which si=O.
8 Test the output of any component that reaches
at least one active component for which s. O
and at least one active component for whibh si=1.
9 Test the output of any component that reaches
all active components for which si=O.
10 Same as rule no. 7 and also, component must
connect to a component for which a previous
test result was s=O.
11 Same as rule no. 8 and also, component must
connect to a component for which a previous
test result was si=O.
12 Same as rule no. 9 and also, component must
connect to a component for which a previous
test result was si=O.
for each subject. Evaluating these models, the results in
Tables II-V were produced. Considering the overall re-
sults for all three experiments (Table V), use of the rank
ordering identified for each individual subject resulted in
the model making the same test 52 percent of the time and
a similar test 89 percent of the time. If the rank ordering is
based on the whole training group rather than each indi-
vidual, the rank orderings in Table VI result and the
percentages decrease to 45 percent and 78 percent for
same test and similar test, respectively. Thus, individual
differences account for about 10 percent of the test
choices.
If one employs a rank ordering averaged across training
groups, the percentages only decrease slightly, in terms of
the overall results for all three experiments. However, the
results for the first experiment (Table II) show a much
greater effect of training with the percentages for unaided
training changing from 47 percent and 83 percent to 43
percent and 74 percent for same test and similar test,
respectively. This is quite consistent with the overall trans-
fer of training results which indicated that computer
aiding only resulted in a sizable transfer for the first
experiment [3].
TABLE II
RESULTS FOR FIRST EXPERIMENT WITH TASK ONE
% SIMILAR TESTS % SAME TESTS
MODEL UNAIDED AIDED UNAIDED AIDED
INDIVIDUAL 90 87 54 49
AVERAGE WITHIN TRAINING 83 76 47 43
AVERAGE ACROSS TRAINING 74 75 43 42
AGGREGATE 95 92 54 49
TABLE III
RESULTS FOR SECOND EXPERIMENT WITH TASK ONE
% SIMILAR TESTS % SAME TESTS
MODEL UNAIDED AIDED UNAIDED AIDED
INDIVIDUAL 88 90 50 52
AVERAGE WITHIN TRAINING 76 77 44 46
AVERAGE ACROSS TRAINING 76 76 44 45
AGGREGATE 93 95 50 52
This conclusion is supported by comparing the rank
orderings in Table VI for unaided and aided subjects in
the first experiment. The most important difference is the
fact that subjects who received aided training valued rule
9 (a powerful rule) to a much greater extent than subjects
who received unaided training. This difference does not
appear in the rank orderings for the second and third
experiment. Thus, one can conclude that the rule-based
model proposed here is appropriately sensitive to training.
One difficulty with the twelve rules in Table I is the fact
that it is difficult to argue that subjects consciously used
some of these rules. For example, rule 2 requires that the
test choice connect to a component for which si= 1. While
there is considerable evidence that subjects do not use the
si= 1 information to their benefit, there is no evidence that
they consciously use it to their detriment. In fact, many
subjects seem to ignore this information [2], [7]. From that
perspective, rules 1, 2 and perhaps 3 might seem identical
to subjects. One can make similar arguments for aggregat-
ing rules 4, 5, and 6; rules 7, 8, and 9; and rules 10, 11,
and 12. In this way, one obtains four aggregate rules.
1) Test an input of any active component for which
si=0.
2) Test the output of any component that connects to
all active components with si =0.
3) Test the output of any active component that
reaches any or all active components with si =0.
4) Test an input of the component for which s, =0 was
determined with the last test (termed tracing back).
From Table V, one can see that this aggregate model
results in 52 percent and 94 percent for same test and
similar test, respectively. Thus, the basic result of aggre-
gating twelve rules into four was to increase the per-
centage of similar tests from 89 percent to 94 percent.
371
7. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-1O, NO. 7, JULY 1980
TABLE IV
RESULTS FOR THIRD EXPERIMENT WITH TASK ONE
% SIMILAR TESTS % SAME TESTS
MODEL UNAIDED AIDED RULE-BASED UNAIDED AIDED RULE-BASED
INDIVIDUAL 89 89 87 53 57 50
AVERAGE WITHIN TRAINING 80 80 79 47 51 43
AVERAGE ACROSS TRAINING 80 80 79 47 51 43
AGGREGATE 95 92 92 53 57 50
TABLE V
OVERALL RESULTS FOR TASK ONE
MODEL ISIMILAR TESTS ISAME TESTS
INDIVIDUAL 89 52
AVERAGE WITHIN TRAINING 78 45
AVERAGE ACROSS TRAINING 77 45
AGGREGATE 94 52
TABLE VI
RANK ORDERINGS FOR TASK ONE
TRAINING RANK-ORDERING
EXPERIMENT NO. 1
UNAIDED 5 6 4 11 3 12 9 10 2 7 1 8
AIDED 5 6 4 11 9 12 3 8 7 2 10 1
ACROSS TRAINING 5 6 4 11 9 12 3 7 8 2 10 1
EXPERIMENT NO. 2
UNAIDED 5 6 11 4 9 12 3 7 8 10 2 1
AIDED 5 6 4 11 3 9 12 7 8 2 10 1
ACROSS TRAINING 5 6 4 11 9 3 12 7 8 10 2 1
EXPERIMENT NO. 3
UNAIDED 5 6 4 9 11 3 12 7 8 10 2
AIDED 6 4 5 9 11 12 3 10 7 8 1 2
RULED-BASED 5 6 4 9 12 11 3 7 10 8 2 1
ACROSS TRAINING 5 6 4 9 11 12 3 7 10 8 2 1
This rather small improvement might lead one to believe
that the original twelve-rule model was perhaps too fine-
grained.
ANALYSIS FOR TASK Two
The Task Two data to be discussed here was collected
in two transfer of training studies, one of which was
previously reported [4] while the other was performed to
investigate the effects of rule-based training which, as
noted earlier, will be discussed later in the paper. The data
to be considered was generated by 36 maintenance
trainees who served as subjects. From the first experiment,
data for the 15 subjects who made no more than one
incorrect diagnosis for the ten problems of Trial 7 (i.e., the
transfer trial) were selected for analysis. The data for the
33 subjects in the first experiment who made more than
one incorrect diagnosis have not as yet been analyzed.
Although, as stressed by Brown and Bruton [15], modeling
of human behavior when incorrectly performing a task is
a very interesting endeavor and thus, will be pursued in
the future. From the second experiment, data from the
last two problems for 21 of the 24 subjects was analyzed.
Due to technical difficulties, the data for the other three
subjects could not be considered.
Since the second experiment with Task Two did not
involve OR components, it was somewhat simpler to
analyze and therefore was considered first before analyz-
ing the data from the first experiment. Without OR com-
ponents, the main difference between Tasks One and Two
was the presence of feedback loops in Task Two. Loops
caused two new rules in particular to emerge. One of these
involved testing the outputs of components which had no
inputs. This rule is useful because it eliminates the particu-
larly troublesome problems of getting stuck in a loop. The
second new rule involved starting at components with no
inputs and, because these components were typically
toward the left side of the network (see Fig. 2), tracing
forward to the right while carefully avoiding loops. This
type of rule can be contrasted with the tracing back that
occurred when subjects started at the zero output compo-
nents on the right of the network and traced to the left in
search of the source of the zero outputs. As noted earlier,
tracing back was also evident in Task One.
One additional rule was of use in describing behavior
during the second Task Two experiments. It was termed
splitting whereby a few subjects (5 of 21) appeared to use
fairly skillful inferences to choose a test such that the
results of the test would split the set of feasible sources of
the symptoms into approximately two halves. Considering
the complexity of Task Two, this rule can be viewed as a
somewhat sophisticated approximation to the half-split
heuristic.
Considering the data for the first Task Two experiment,
only one additional rule appeared necessary. Since this
experiment involved OR components, subjects needed a
method of dealing with them. Some subjects (8 of 15)
focused on OR components, especially multiple-input OR
components for which si =0, since identification of only a
single acceptable input (i.e., sj = 1 for cji= 1) was sufficient
to designate the OR component as failed. The remaining 7
372
8. 373
ROUSE et al.: HUMAN PROBLEM SOLVING PERFORMANCE
of 15 subjects appeared to ignore the OR components if
possible.
Thus, analysis of the data for Task Two led to identifi-
cation of five rules. These rules are summarized in Table
VII. Notice that the notion of "active" components is not
included in this set of rules. This is due to the fact that the
presence of feedback loops prohibits the elimination of
more than a few components from further consideration.
This hypothesis that feedback loops affect human prob-
lem solving in this way is also supported by our studies of
measures of complexity [8].
Using the rules in Table VII, computerized identifica-
tion of rank orderings for Task Two was attempted.
Unfortunately, the results were mediocre with only a 50
percent match in terms of similar tests. However, in the
process of investigating why the identification scheme was
inadequate, it was found that a human analyst could scan
a set of problem solutions and produce an estimate of a
rank ordering that matched subject performance fairly
well. Pursuing this approach further, five independent
judges viewed the problem solved by each subject in the
second experiment with Task Two and estimated the
extent to which each subject matched particular rank
orderings.
The judges were blind in the sense that they did not
know the conditions under which each subject was
trained. This control was important since the analysis of
variance of performance for the second experiment with
Task Two indicated substantial training effects. (This will
later be discussed in more detail.) The five blind judges
were quite consistent in estimating that subjects with one
type of training employed significantly different (via t-test
p <0.01) strategies than subjects trained with the alterna-
tive method. However, this rather global conclusion did
not provide specific rank orderings.
To produce the desired rank orderings, a very fine-
grained and time-consuming analysis was necessary. Be-
cause this process was so labor-intensive, only two blind
judges were employed. Studying one subject at a time,
rules were assigned to each test made by the subject.
Often, multiple rules appeared to apply and thus, the
matching of rules was somewhat ambiguous. There was
no attempt to resolve the ambiguity at this point. Instead,
after all initial matches were complete, each judge viewed
the complete set of often ambiguous matches of tests and
rules and then, simply chose the rank ordering that
seemed to provide the best fit in terms of percentage of
similar tests. Interestingly, the two blind judges produced
almost identical rank orderings for all subjects. The results
appear in Table VIII.
The comparison of models and subjects in terms of
percentage of similar tests is quite favorable. Because of
the time-consuming nature of the analyses for Task Two,
no attempt was made to develop average models for
within and across training groups. Thus, the effects of
individual differences and training cannot be determined
from the results in Table VIII. However, training did have
a clear effect on rank ordering as the following discussion
of training will illustrate.
TABLE VII
RULES FOR TASK Two
RULE DESCRIPTION
B Choose any component for which si-o
and test its inputs ( termed tracing back) .
N Choose any component with no inputs and
test its outputs.
F Choose any component for which si=l
and test the output of component j where
cij=1 (tprmed tracing forward).
S Choose a test that approximately splits
the set of feasible sources of the
symptoms into two halves.
0 Choose a multiple input OR component
for which si=O and test its inputs.
If si is unknown, test the output first.
TABLE VIII
RAN.K ORDERINGS AND RESULTS FOR TASK Two
FIRST EXPERIMENT SECOND EXPERIMENT
RANK-ORDERING % SIMILAR NUMBER % SIMILAR NUMBER
TESTS OF SUBJECTS TESTS OF SUBJECTS
B 80 2 91 6
NB 87 2 - -
OB 93 2 - -
NFB/NBF 81 3 90 10
OBN/ONB/NOB 84 3 - -
ONFB 89 3 - -
SFB - - 87 5
ALL 85 15 90 21
RULE-BASED TRAINING
In studying the rules used by subjects for solving Tasks
One and Two, it became apparent that some rules were
particularly effective while other tended to result in rather
tedious solutions. For example, as noted earlier, use of
rule 9 for Task One (see Table I) greatly expedited the
diagnosis process while use of rule 2 was fairly unproduc-
tive. Similarly, for Task Two, the multiple input OR com-
ponent rule (see Table VII) was quite useful while the
tracing back rule (B) often led to difficulties, particularly
when there were quite a few feedback loops. These ob-
servations led to the idea of providing subjects with feed-
back in terms of a rating of the rules that the computer
inferred they were using.
9. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-10, NO. 7, JULY 1980
The rule-based training scheme that evolved from this
idea worked as follows. After each test, the computer
identified the rule that was likely to have generated the
test. The subject was then given feedback in terms of a
rating, displayed immediately to the right of the test
result. The rating schemes shown in Tables IX and X were
employed. These schemes were based on the following
principles.
1) Tests of components that reach active symptoms are
more effective than tests of components that only
connect to active components displaying symptoms.
2) Tests of components that reach or connect to com-
ponents displaying acceptable outputs (i.e., si= 1) are
particularly ineffective choices.
3) Tests of components that reach or connect to all
active components displaying symptoms are more
effective than tests of components that only reach or
connect to less than all active components displaying
symptoms.
4) For Task Two, tests of components with no inputs
can be effective because it assures that one is not
testing in a feedback loop.
5) Ratings of particular tests should not be absolute but
instead depend on what other tests are available.
TABLE IX
RATINGS OF RULES FOR TASK ONE
RULE 9 RULE 9
RATING AVAILABLE NOT AVAILABLE
E (Excellent) Rule 9 Rules 4,5,6,12
G (Good) Rules 4,5,6,7,12 Rules 7,10
F (Fair) Rules 1,3,8,10 Rules 1,3,8,11
P (Poor) Rules 2,11 Rule 2
Beyond the ratings shown in Tables IX and X, ratings of
U and N were also provided when the test was unneces-
sary (i.e., the output value was already known) and when
no further testing was necessary in order to designate the
failure (for Task Two only), respectively. It should be
noted that the rating schemes in Tables IX and X were
developed before conducting the formal identification
process that resulted in the rules noted in Tables I and
VII. Thus, some rules (i.e., S and 0 in Table VII) were
not included in the rule-based training scheme because it
was not anticipated that many subjects would employ
these rules.
Using the same experimental design as employed in the
previous experimental studies of computer aiding [3], [4],
an experiment was performed using three training condi-
tions: unaided, aided, and rule-based. From an initial
group of 39 fourth semester maintenance trainees, 30 were
evaluated using the three training schemes for Task One
while 24 were studied using unaided and rule-based aiding
for Task Two. For Task One the only interesting effect
was a negative transfer of rule-based training in terms of
percent correct for small problems (i.e., 95 percent versus
70 percent, F454=4.07, p<0.01). This negative effect is
difficult to interpret without considering the results for
Task Two.
During Task Two training, subjects using the rule-based
method made 36 percent more tests per problem than
those using the unaided scheme (2.77 versus 2.16, Fl 18=
5.27, p <0.05). For the last two problems of the transfer
trial, subjects who had received rule-based training made
67 percent more tests per problem than those who had
received unaided training: 4.83 versus 3.40, Fl 18=5.83,
p<0.05 for one problem and 3.67 versus 1.70, F1,18=
15.00, p <0.01 for the other problem. Thus, the negative
transfer of training for Task Two was substantial.
Combining the overall results for Tasks One and Two,
it seems safe to conclude that rule-based training was not
TABLE X
RATINGS OF RULES FOR TASK Two*
TYPE OF TRACING BACK (B) ALSO SATISFIES N DOES NOT SATISFY N
Test choice connects to E E
original s.-O symptom or to
a component for which si=O
was subsequently discovered.
Test choice connects to all E G
components for which si=o.
Test choice connects to some, F P
but not all, components for
which si=O.
Test choice reaches all E G
components for which si=0.
Test choice reaches some, G F
but not all, components for
which si=o,
*E means excellent, G means good, F means fair, and P means poor.
374
10. ROUSE et al.: HUMAN PROBLEM SOLVING PERFORMANCE
a particularly good idea. Several explanations are possible.
First of all, the rating schemes shown in Tables IX and X
may have been inappropriate. However, a more likely
explanation is that subjects misinterpreted the intent of
the ratings. Despite carefully written instructions, some
subjects appeared to feel that E meant they were close to
the failure while P indicated they were far away, much
like the children's game of "hot and cold." Other subjects
seemed to put more emphasis on collecting E than solving
the problem. (Of course, it is perhaps not surprising that
subjects, in their roles as students, adopted such a
strategy.)
The rule-based model for Task Two was quite success-
ful in capturing the negative transfer with rule-based
training. Considering the second experiment, of the five
subjects identified as having SFB rank orderings, four of
them received unaided training. On the other hand, eight
of the ten subjects whose rank orderings were identified as
NFB received rule-based training. Since S is a very power-
ful rule, SFB is definitely a better rank-ordering than
NFB. The analysis of variance of number of tests as well
as the opinions of the blind judges support this conclu-
sion. Interestingly, the rule-based training did not try to
instill the use of the S rule. It was thought that subjects
would have difficulty understanding its usefulness. Appar-
ently, the experimenters underestimated the ability of
some subjects. Nevertheless, this result points out the
usefulness of the rule-based model.
While the particular E, G, F, and P rating scheme used
was counterproductive, the U and N ratings seemed more
useful. While the data was not in a form that would
support this conjecture, the following aiding scheme
emerged from this idea. When appropriate, subjects will
be provided with a U, R, or N to designate unnecessary
test, repeated test, or no further tests necessary, respec-
tively. This type of feedback should help subjects to
overcome misinterpretations of how the tasks can be
performed effectively. An experimental study of this form
of feedback is planned.
CONCLUSION
This paper has considered the problem of modeling
human fault diagnosis behavior in terms of sequences of
tests chosen. A rule-based model has been proposed and
evaluated in the context of two fault diagnosis tasks.
Using data from three experiments that included data for
118 subjects for Task One and 36 subjects for Task Two,
it was shown that the model chose tests similar to those of
the human 94 percent and 88 percent of the time for the
two tasks, respectively. For Task One it was shown how
this percentage decreased if individual differences or
training effects were averaged out.
Considering the model's ability to choose the same tests
as subjects, the comparison between model and subjects
was not favorable, resulting in only 52 percent agreement
for Task One. However, as discussed earlier, such a result
is inevitable when subjects are placed in a situation where
they must choose between two or more equally attractive
alternatives. From this perspective, it seems much more
reasonable to ask if the model and subjects use the same
rules at the same time. If they do, we can say that they are
making similar tests. Thus, the fairly favorable results
presented here in terms of similar tests should be interpre-
ted as meaning that the model and subjects used the same
rule in the same situation somewhat over 90 percent of the
time.
A method of rule-based training was proposed and
found to produce substantial negative transfer of training.
Alternative explanations were suggested. However, it was
concluded that a training scheme that enabled subjects to
avoid unnecessary testing might be of value.
Future efforts in rule-based modeling by the authors
include evaluating the model's ability to describe context-
specific performance in tasks such as devised by Hunt [5].
Also, there are plans to extend the modeling methodology
to enable algorithmic identification of ambiguous models
such as discussed earlier. Further, various other ap-
proaches to the general problem of developing pattern-
directed inference [22] are being investigated. These in-
vestigations will hopefully allow the type of interesting
fine-grained analyses discussed in this paper while also
avoiding the labor-intensive nature of many of these
analyses.
REFERENCES
[1] W. B. Rouse, "Human problem solving performance in a fault
diagnosis task," IEEE Trans. Syst., Man, Cybern., vol. SMC-8, no.
4, pp. 258-271, 1978.
[2] W. B. Rouse, "A model of human decisionmaking in fault diagno-
sis tasks that include feedback and redundancy," IEEE Trans.
Syst., Man, Cybern., vol. SMC-9, no. 4, pp. 237-241, 1979.
[3] W. B. Rouse, "Problem solving performance of maintenance
trainees in a fault diagnosis task," Human Factors, vol. 21, no. 2,
pp. 195-203, 1979.
[4] W. B. Rouse, "Problem solving performance of first semester
maintenance trainees in two fault diagnosis tasks," Human Factors,
vol. 21, no. 5, pp. 611-618, 1979.
[5] R. M. Hunt, "A study of transfer of training from context-free to
context-specific fault diagnosis tasks," MSIE thesis, Univ. Illinois
at Urbana-Champaign, 1979.
[6] W. B. Johnson, "Computer simulations in fault diagnosis training:
an empirical study of learning transfer from simulation to live
system performance," Ph.D. dissertation, Univ. Illinois at
Urbana-Champaign, in progress.
[7] W. B. Rouse, "A model of human decisionmaking in a fault
diagnosis task," IEEE Trans. Syst., Man, Cybern., vol. SMC-8, no.
5, pp. 357-361, 1978.
[8] W. B. Rouse and S. H. Rouse, "Measures of complexity of fault
diagnosis tasks," IEEE Trans. Syst., Man, Cybern., vol. SMC-9, no.
11, pp. 720-727, 1979.
[9] R. A. Goldbeck, B. B. Bernstein, W. A. Hillix, and M. A. Marx,
"Application of the half-split technique to problem-solving tasks,"
J. Experimental Psychology, vol. 53, no. 5, pp. 330-338, 1957.
[10] R. G. Mills, "Probability processing and diagnostic search: 20
alternatives, 500 trials," Psychonomic Sci., vol. 24, no. 6, pp.
289-292, 1971.
[11] N. A. Bond, Jr. and J. W. Rigney, "Bayesian aspects of trou-
bleshooting behavior," Human Factors, vol. 8, pp. 377-383, 1966.
[12] L. M. Stolurow, B. Bergum, T. Hodgson, and J. Silva, "The
efficient course of action in troubleshooting as a joint function of
375
11. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. SMC-10, NO. 7, JULY 1980
probability and cost," Educational and Psychological Measurement,
vol. 15, no. 4, pp. 462-477, 1955.
[13] J. Rasmussen and A. Jensen, "Mental procedures in real-life tasks:
A case study of electronic troubleshooting," Ergonomics, vol. 17,
no. 3, pp. 293-307, 1974.
[14] K. T. Wescourt and L. Hemphill, "Representing and teaching
knowledge for troubleshooting/debugging," Institute for Mathe-
matical Studies in the Social Sciences, Rep. No. 292, Stanford
Univ., CA, 1978.
[15] J. S. Brown and R. R. Burton, "Diagnostic models for procedural
bugs in basic mathematical skills," Cognitive Sci., vol. 2, no. 2, pp.
155-192, 1978.
[16] J. W. Rigney and D. M. Towne, "Computer techniques for analyz-
ing the microstructure of serial-action work in industry," Human
Factors, vol. 11, no. 2, pp. 113-122, 1969.
[17] A. Newell and H. A. Simon, Human Problem Solving. Englewood
Cliffs, NJ: Prentice-Hall, 1972.
[18] A. Newell, "Production systems: models of control structures," in
Visual Information Processing, W. G. Chase, Ed. New York:
Academic, 1973, Ch. 10.
[191 R. B. Wesson, "Planning in the world of the air traffic controller,"
Proc. Fifth Int. Joint Conf Artificial Intell., Massachusetts Institute
of Technology, Aug. 1977, pp 473-479.
[201 I. P. Goldstein and E. Grimson, "Annotated production systems: a
model for skill acquisition," Proc. Fifth Int. Joint Conf: Artificial
Intell., Massachusetts Institute of Technology, Aug. 1977, pp.
311-317.
[21] S. J. Pellegrino, "Modeling test sequences chosen by humans in
fault diagnosis tasks," MSIE thesis, Univ. Illinois at Urbana-
Champaign, 1979.
[22] F. Hayes-Roth, D. A. Waterman, and D. B. Lenat, "Principles of
pattern-directed inference systems," in Pattern-Directed Inference
Systems, D. A. Waterman and F. Hayes-Roth, Eds. New York:
Academic, 1978, pp. 577-601.
A Feedback On-Off Model of Biped Dynamics
HOOSHANG HEMAMI, MEMBER, IEEE
Abstract-A feedback model of biped dynamics is proposed where the
internal and external forces which act on the skeleton are unified as forces
of constraint, some intermittent and some permanent. It is argued that
these forces are, in general, functions of the state and inputs of the system.
The inputs constitute gravity and muscular forces. This model is particu-
lady suited for understanding the control problems in all locomotion. It
encompasses constraints that may be violated as well as those that cannot
be violated. Applications to motion in space, locking of a joint, landing on
the ground, and Initiation of walk are discussed via a simple example. A
general projection method for reduction to lower dimensional systems is
provided where, by defining an appropriate coordinate transformation, a
prescribed number of forces of constraint are eliminated. Finally an
application of the model in estimating inputs (joint torques) is briefly
dussed.
I. INTRODUCTION
J N THE PAST a large amount of work has been
devoted to problems of human locomotion, notably
walking [1]-[4]. A number of mechanical linkage models
have been proposed [1], [2], [5]. The purpose of this work
is to provide a conceptual dynamic model that is particu-
Manuscript received June 4, 1979; revised February 19, 1980 and
March 17, 1980. This work was supported in part by the Department of
Electrical Engineering, Ohio State University, and in part by NSF Grant
ENG 78-24440. This paper was presented at the 1979 International
Conference on Cybernetics and Society, Denver, CO.
The author is with the Department of Electrical Engineering, Ohio
State University, Columbus, OH 43210.
larly suited for understanding and implementing control
of biped motion. Human physical activities involve
locomotion, dance, sport, and other task- and rest-related
movements. Some major characteristics of all these activi-
ties are as follows.
1) Variability of the number of degrees of freedom of
the system, e.g., knees and elbows are locked and
unlocked, feet are raised from ground or set on
ground, and the body is brought in contact with
other objects.
2) Often some portion of the system is in motion while
others are stationary.
3) Large variations occur in angles, angular velocities,
and speeds so that linear models are not sufficient.
The first characteristic requires proper treatment of
different constraints and incorporation of them in the
model. The second requirement calls for availability of
projection onto smaller spaces, and, finally, requirement
three calls for a nonlinear model.
The model presented here is able to satisfy all three
attributes. Notably, it provides a unified view of the
different constraints: joint connections, locking joints, re-
action forces, and collision. It shows how to deal with
transitions from one constrained configuration to another.
This model should make possible a better understanding
of functional human dynamics. It does not, however,
0018-9472/80/0700-0376$00.75 C 1980 IEEE
376