Journal of Applied Psychology Copyright 1989

Journal of Applied Psychology Copyright 1989 by the American Psychological Association, Inc.
1989, Vol. 74, No. 4, 619-624 0021-9010/89/$00.75
Rater Errors and Rating Accuracy
Kevin R. Murphy William K. Balzer
Colorado State University Bowling Green State University
Meta-analysis was used to determine the relationship between rater error measures and measures of
rating accuracy. Data from 10 studies (N = 1,096) were used to estimate correlations between measures of halo, leniency, and range restriction and Cronbach’s (1955) four measures of accuracy.
The average correlation between error and accuracy was .05. No moderators of the error-accuracy
relationship were found. Furthermore, the data are not consistent with the hypothesis that error
measures are sometimes valid indicators of accuracy. The average value of the 90th percentile of the
distribution of correlations (corrected for attenuation and range restriction) was. 11. The use of rater
error measures as indirect indicators of accuracy is not recommended.
A variety of techniques is available for assessing the quality
of rating data, including (a) applications of analysis of variance
(ANOVA) in assessing convergent and discriminant validity (Kavanaugh, MacKinney, & Wolins, 1971), (b) the use of multivariate analysis of variance (MANOVA) in assessing ratings on multiple performance dimensions (Saal, Downey, & Lahey, 1980),
and (c) applications of factor-analytic techniques (Landy,
Vance, Barnes-Farrell, & Steele, 1980). By far the most common method of evaluating ratings involves the assessment of
so-called rater errors (Landy, 1986; Landy & Farr, 1983). The
presence of halo, leniency, or range restriction is generally taken
to indicate inadequacies in the performance appraisal system;
the absence of rater errors is assumed to indicate accuracy in
measuring performance (Jacobs, Kafry, & Zedeck, 1980). In
situations in which direct measures of the accuracy of rating are
difficult to obtain, rater errors are thought to provide indirect
measures of accuracy.
When one considers the widespread use of rater error measures in evaluating rater training programs, scale formats, and
various rating techniques, it is surprising to note the scarcity of
empirical or theoretical support for the position that ratings
that are free of rater errors or which show “desirable” levels of
discriminant validity or ratee dispersion are more accurate
than ratings that show “undesirable” psychometric characteristics (Saal et al., 1980). Rater error measures, and to a lesser
extent ANOVA and MANOVA measures, are prescriptive in nature; ratings are assumed to be inaccurate if they fail to conform
with sometimes arbitrary assumptions about the true distributions and intercorrelations among various measures of performance. In part, the scarcity of data on the validity of indirect
measures of the quality of ratings can be explained by the lack
of any widely accepted standard against which these measures
can be compared. The purpose of our study is to compare these
We thank the colleagues who responded generously to our requests
for raw data.
Correspondence concerning this article should be addressed to Kevin
R. Murphy, Department of Psychology, Colorado State University, Fort
Collins, Colorado 80523.
619
indirect measures with direct measures of the accuracy of ratings.
Methods of directly measuring rating accuracy have been developed and applied by Borman (1977, 1979) and others (Murphy, Garcia, Kerkar, Martin, & Balzer, 1982; Murphy, Martin,
& Garcia, 1982; see Sulsky and Balzer, 1988, for a general review). When raters evaluate a number ofratees on multiple performance dimensions, it is possible to develop multivariate
measures of rating accuracy that reflect accuracy in (a) the overall level of rating (elevation), (b) discriminating among ratees
(differential elevation), (c) discriminating among performance
dimensions (stereotype accuracy), and (d) discriminating
among ratees within dimensions (differential accuracy; Cronbach, 1955; Murphy, Garcia, et al., 1982). The relationship between direct and indirect measures of rating accuracy (i.e., rater
error measures) is relevant in evaluating both error and accuracy measures. Indirect measures have served as criteria in a
large number of studies, but the implications of rater errors are
by no means clear. Because the use of accuracy measures is
effectively limited to laboratory settings (Sulsky & Balzer,
1988), the question of whether measures of halo, leniency, and
so forth, can be used to make valid inferences about the accuracy of ratings is an important one.
Results from a number of studies suggest that rater error
measures are not good indicators of the accuracy of ratings.
Borman (1977) presented data showing that the correlations between the differential accuracy of performance ratings and levels of halo and leniency are at best weak. Cooper (1981 ) suggested that halo shows little relationship with this specific measure of accuracy; the few significant halo-accuracy correlations
suggest that halo is positively related to differential accuracy.
Murphy and Balzer (1981) reviewed three studies (included in
our analysis) in which rater error measures were correlated with
accuracy measures; these correlations were generally small.
More recently, Becker and Cardy (1986) computed correlations
between four error and eight accuracy scores. Although 25 of
the 32 error-accuracy correlations were significant, both negative and positive correlations were found and most of the correlations were small (absolute value of median r =. 19).
Taken together, these data are somewhat troubling, but they
do not necessarily mean that rater errors are unrelated to accu-
620 KEV1N R. MURPHY AND WILLIAM K. BALZER
racy. First, as noted earlier, accuracy is a multivariate construct
rather than a univariate construct, and the different components of rating accuracy appear to be somewhat independent
(Becker & Cardy, 1986; Murphy, Garcia, et at., 1982; Sulsky &
Balzer, 1988). Therefore, data showing that halo is unrelated to
differential accuracy do not necessarily mean that halo is unrelated to other accuracy measures (see Becker & Cardy, 1986,
Appendix A). Second, there are a number of different operational definitions of each of the major rater errors (Murphy &
Balzer, 1981; Saal et at., 1980). It is possible that some measures
of each rater error are more strongly related to accuracy than
are others. Thus, the limited empirical data presently available
do not fully answer the question of whether rater error measures
may be used to make valid inferences about rating accuracy.
To systematically explore the relationship between rater errors and rating accuracy, we computed the relationships between multiple measures of halo, leniency, and range restriction
and multivariate measures of rating accuracy by using data
from 10 previous studies. A meta-analysis of the results of these
studies was used to estimate the relationship between error and
accuracy measures.
Table 1
Studies Included in Meta-Analysis
Study N Stimuli used
Balzer, Sulsky, Pollack, & Hammer
(1987) 122 Murphy tapes a
Banks (1986) 56 Borman tapes
Becker & Cardy (1986) 169 Vignettes b
Murphy, Balzer, Kellam, & Armstrong
(1984) 69 Murphy tapes
Murphy, Garcia, & Kerkar (1980) 50 Murphy tapes
Murphy, Garcia, Kerkar, Martin,
& Balzer (1982) 44 Murphy tapes
Pulakos (1986) 73 Borman tapes
Ruddy & Kavanagh (1986) 85 Borman tapes
Sulsky & Balzer (1986) 90 Murphy tapes
TaUarigo (1986) 338 Murphy tapes
a The development of these videotapes is described in Borman (1977)
and Murphy et al. (1982). b We were unable to obtain raw data from
this study and thus could not compute all 24 error-accuracy correlations; we obtained raw data for all other studies reviewed.
Method
Selection of Studies
Sulsky and Balzer’s (1988) review, together with a review of studies
published since Sulsky and Balzer, showed that 28 studies (published
and unpublished) had used accuracy scores as dependent measures. The
studies varied widely in their design and in the methods used to compute accuracy. Four criteria were considered in deciding whether or not
to include each study in our review. First, the rating scales should require evaluative judgment; studies that examined accuracy in behavior
recognition or in reporting the frequency of critical behaviors were
eliminated. Second, the study should report new data rather than reanalyses of data reported elsewhere. Our third criterion was based on
the number of ratees and dimensions. The appropriate unit of analysis
for both error and accuracy measures is the individual rater (Murphy,
1982; Murphy, Garcia, et al., 1982). The number of ratees and dimensions should therefore be sufficiently large to permit meaningful computations of accuracy for each rater. We chose to eliminate studies that
contained fewer than four ratees or dimensions. Fourth, the true scores
should be carefully developed and should show evidence of convergent
and discriminant validity. On the basis of these criteria, 15 of the 28
studies were eliminated.
We contacted the authors of the remaining 13 studies and were able
to obtain raw data for 9 of these 13 studies; these studies are listed in
Table 1. Six of the studies used videotapes developed by Murphy and
colleagues (see Murphy, Garcia, et at., 1982, for a general discussion of
the development of videotaped stimuli and ratee true score estimates).
Three studies used videotapes developed by Borman (see Borman,
1977, for a general discussion of the development of videotaped stimuli
and ratee true score estimates). Sample sizes in the studies ranged from
44 to 338. We were unable to obtain raw data from one additional study
(Becker & Cardy, 1986) that used the same four accuracy measures and
also used one of the same measures of halo error (MEDCORR), one of the
same leniency measures (MEAN), and one of the same range restriction
measures (SD) as used in this study. Becker and Cardy (1986) reported
correlations between these three error measures and the four accuracy
measures used here. These correlations were included in the meta-analysis.
Computation of Error and Accuracy Measures
Error measures. Six of the rater error measures reviewed in Saal et
al. (1980), two of which indicate halo, two of which indicate leniency,
and two of which indicate central tendency or range restriction, were
computed for individual raters. These measures are (a) MEDCORrt: the
median correlation between performance dimensions, over ratees
(halo); (b) VARRAT. the variance of the ratings assigned to each ratee,
averaged across ratees (halo); (c) MEAN: the absolute value of the difference between the mean rating, over ratees and dimensions, and the scale
midpoint (leniency); (d) SKEW. the skew of the distribution of ratings
over ratees and dimensions (leniency); (e) SD: the standard deviation of
the rating distribution, over ratees and dimensions (range restriction);
and (f) KURY. the kurtosis of the rating distribution over ratees and
dimensions (range restriction). These six error measures were computed
for each rater in the nine studies for which raw data were available.
Accuracy measures. Measures of rating accuracy were computed for
each rater by comparing ratings with true score estimates of performance. Here, true scores refer to mean ratings collected from multiple
expert raters under optimal rating conditions. The development and validation of true score estimates is described in detail in Borman (1977)
and Murphy, Garcia, et al. (1982).
For a rater who evaluates n ratees on k items or dimensions, scores
on elevation (EL), differential elevation (DEL), stereotype accuracy
(SA), and differential accuracy (DA) are given by the square roots of the
following terms:
EL 2 = (x.. – t..) 2
DEE 2 = l/ni ~ [(xi. – x..) – (ti. – t..)] 2
SA 2 = 1/k i ~, [(x.j -x..) – (t.j – t..)] 2
DA 2 = l/knij ~, [(xij – xi. – x.j + x..) – (tij – ti. – t.j + t..)] z,
Where
xij and t~j = rating and true score for ratee i on item j,
x~. and ti. = mean rating and true score for ratee i,
x.j and t.~ = mean rating and true score for item j, and
x.. and t.. = mean rating and true score over all ratees and items.
Rescaling accuracy and error scores. Error and accuracy scores are
RATER ERRORS 621
Table 2
Average Correlations Among Rater Error Measures
Measure 1 2 3 4 5 6
I. MEDCORR
2. VARRAT .25 —
3. MEAN -.03 -.09 —
4. SKEW .15 –.01 –.47 —
5. SO –.37 .35 –.09 –.18
6. KURT –.24 –.02 .13 –.09
m
.26 —
Note. MEDCORR = the median correlation between performance dimensions, over ratees; VARRAT = the variance of ratings assigned to
each ratee, averaged across ratees; MEAN = the absolute value of the
difference between the mean, rating, over ratees and dimensions, and
the scale midpoint; SKEW = the skew of the distribution of ratings, over
ratees and dimensions; SD = the standard deviation of the rating distribution, over ratees and dimensions; KURT = the kurtosis of the rating
distribution, over ratees and dimensions.
not scaled consistently. A low value for MEDCORR indicates the absence
of halo, whereas a low value for VARRAT indicates the presence of halo.
A large negative SKEW indicates leniency, but a leniency will generally
result in a large positive value for MEAN. Accuracy scores are scaled so
that low values indicate high levels of accuracy.
To simplify the interpretation of our results, all measures were scaled
so that a large score indicated the absence of specific rater errors or, in
the case of accuracy scores, the presence of accuracy. This entailed reverse-scoring all four accuracy measures as well as MEDCORR, MEAN,
SKEW, and KURT. Thus, the hypothesis that rater error measures provide
valid indirect indicators of rating accuracy will be supported if the error-accuracy correlations are positive.
Results
The average intercorrelations among the six rater error measures are presented in Table 2. These correlations suggest that
different operational definitions of the same rater error are not
empirically equivalent. Although both measures of leniency
(i.e., MEAN and SKEW) are scaled in the same direction, the correlation between these two measures is negative (r = -.47). The
two measures of halo and the two measures of range restriction
are positively correlated, but the correlations are not large (rs =
.25 and .26, respectively).
The average correlations between rater error and rating accuracy measures are presented in Table 3. These correlations suggest that error measures are not strongly related to accuracy
measures; the mean error-accuracy correlation is small and
negative (r = -.05, rc = -•06)• Only 6 of the 24 correlations
shown in Table 3 are positive, and none of these is greater than
• 15. The correlations reported in Tables 2 and 3 represent
weighted averages of the correlations obtained in each study,
giving each study a weight proportional to its sample size.
Correction for Attenuation and Sampling Error
Except in longitudinal designs, it is not possible to empirically estimate the reliability of error and accuracy scores. However, we can use the results in Table 3 to estimate lower bounds
for reliability. The theoretical maximum value for rxy is given
by the product of the square roots of the reliabilities ofx and y.
It follows that the minimum value of rV’-r-r~ r~yy is determined by
the size of rxy. The correlation between VARRAT and DA is
-.50. The reliability of differential accuracy scores must therefore be larger than .70; because VARRAT scores are not likely to
be perfectly reliable, it is likely that the reliability of DA is
greater than .70.
The reliability of elevation, differential elevation, and stereotype accuracy scores is likely to be higher than that of differential accuracy scores. The equations for accuracy scores show
that DA refers to the accuracy of individual ratings, whereas EL
refers to the accuracy of the mean rating assigned to each tape,
and SA refers to the accuracy of the mean rating assigned to
each rating dimension. One would ordinarily expect that the
mean of several observations would be more reliable than the
individual observations, which suggests that the reliability of
EL, DEL, and SA should be at least as large as the reliability
of DA.
Table 3 presents corrected correlation coefficients, assuming
a reliability of .70 for each accuracy score. Because we were
interested in the relationship between rater error scores that are
often used in the literature and are reliable measures of accuracy, we did not correct for unreliability of rater error scores.
The average corrected correlation between error and accuracy
measures was -.06.
Note that the low correlations shown in Table 3 cannot be
reasonably attributed to unreliability. If the reliability of both
error and accuracy scores is as low as .40, the average correlation between error and accuracy scores would be -.09, and
none of the correlation would exceed .35.
By using formulas presented in Hunter, Schmidt, and Jackson (1982), we calculated the observed variance in the corrected
rs and subtracted from this the variance attributable to sampling error for each of our 24 corrected correlations. We then
computed the 90th percentile value for each distribution of corrected rs (shown in Table 3); one can be 90% confident that the
true value of r is equal to or lower than this value. These values
suggest that some rater error measures will occasionally reflect
the assumed relationship between error measures and accuracy
measures–a nontrivial, positive r. However, this is typically not
the case. The best estimates of the relationships between error
and accuracy are given by re, which is negative for 18 of 24 error
accuracy correlations, has a mean of-.06, and which never
exceeds. 15. The average value of the 90th percentile of the corrected distribution of rs is also small (r =. 11), indicating that
error scores are, on the whole, rarely good indicators of accuracy.
Testing for moderators. Although most of the correlations
shown in Table 3 are small, there is sufficient variability in the
corrected distributions of many rs to allow for the possibility
that the correlations between various error and accuracy measures is sometimes large. We therefore attempted to identify
variables that might moderate the relationships between rater
errors and rating accuracy.
A large number of potential moderators exists• Some of these
(e.g., Murphy vs. Borman tapes) would be potentially interesting, but of little practical importance, because identifying study
characteristics that moderate error-accuracy correlations does
not help the researcher who is trying to decide whether or not
to use rater error measures in his or her own research. A more
622 KEVIN R. MURPHY AND WILLIAM K. BALZER
Table 3
Correlations Between Error and Accuracy Measures
Elevation Differential elevation Stereotype accuracy Differential accuracy
r rc 90% below r rc 90% below r rc 90% below r rc 90% below
Halo
MEDCORR –.05 –.06 –.07 –.06 –.08 .58 –. 12 –. 15 –. 15 –.30 –.35 –.31
VARRAT –.02 –.02 .14 .01 .01 .16 –.28 –.33 –.03 –.50 –.59 –.31
Leniency
MEAN –. l0 –. 12 .58 –.00 –.00 .06 –.00 –.01 .12 .05 .06 .24
SKEW .13 .15 .40 .14 .16 .16 –.01 –.01 .! l –.00 –.00 .18
Range restriction
SD .02 .03 .23 –. 12 –. 14 –.03 –.07 –.08 .06 –. l0 –. 1 l .05
KURT .10 .11 .31 –.14 .16 .04 –.08 –.09 .01 –.10 –.11 .03
Note. r = average observed r; rc = average corrected r (assume ryy = .70); 90% below = value at the 90th percentile of the distribution of corrected
rs, removing variance due to sampling error. MEDCORR = the median correlation between performance dimensions, over ratees; VARRATT = the
variance of ratings assigned to each ratee, averaged across ratees; MEAN = the absolute value of the difference between the mean rating, over ratees
and dimensions, and the scale midpoint; SKEW = the skew of the distribution of ratings, over ratees and dimensions; SD = the standard deviations of
the rating distribution, over ratees and dimensions; KURT = the kurtosis of the rating distribution, over ratees and dimensions.
promising avenue for explanation is to examine the moderating
effects of statistical characteristics of the ratings collected in a
study. For example, if we knew that halo-accuracy correlations
were high when observed intercorrelations were high, and low
when observed intercorrelations were low, this would help researchers decide whether or not to use particular error measures
in specific contexts.
The six rater error measures examined here reflect aspects
of rating data that are thought to affect accuracy. To test the
hypothesis that the observed levels of halo, leniency, and range
restriction moderated correlations between error and accuracy
measures, we calculated the mean value of each error measure
in each study and correlated these six means with each of the 24
error-accuracy correlations. The resulting correlation matrix
contains 124 correlations that indicate whether levels of a specific error measures moderate error-accuracy correlations. Because the number of correlations computed was large, we first
tested the omnibus null hypothesis that all rs are equal to zero
(Snedecor & Cochran, 1967). We were not able to reject this
omnibus null hypothesis, x2(124) = 149.8, p > .05, and concluded that the levels of different error indices do not moderate
error-accuracy correlations.
Regression Analysis
The six rater error measures were used to predict each oftbe
four accuracy measures in a multiple regression equation. Resuits of this anaIysis are presented in Table 4.
These results suggest that rater error measures can be used to
predict accuracy levels, but that this use of error scores requires
a reversal of our thinking about the implications of rater errors
for rating accuracy. Eleven of the 16 significant regression
weights are negative, indicating that high scores on error measures are usually associated with low scores on accuracy measures. As noted earlier, error and accuracy scores were scaled in
such a way that high scores indicated accuracy (and the absence
of rater errors), and low scores indicated inaccuracy (and the
presence of rater errors). Therefore, low scores of error measures (indicating the presence of errors) tend to indicate accuracy rather than inaccuracy in rating.
Discussion
The results of the present meta-analysis of rater error-rating
accuracy correlations computed on data from 10 separate studies show that error and accuracy measures are not strongly related; the average error-accuracy correlation is very near zero
(r = -.05, rc = -.06). Only 3 of the 24 corrected correlations
shown in Table 3 are greater (in an absolute sense) than .20,
and all of these are in the wrong direction. Because error and
accuracy indices were scaled in a consistent fashion (high scores
indicate few errors and high accuracy), the correlations between
error and accuracy scores should be positive. That is, the absence of errors should indicate high levels of accuracy (low accuracy scores). The data are more consistent with the hypothesis that rater errors contribute to accuracy than with the hypothesis that they detract from accuracy. The regression
analysis presented in Table 4 suggests a similar interpretation.
Eleven of the 16 significant regression coefl$cients are negative,
and the few positive weights that are shown in Table 4 are relatively small. Our results suggest that the traditional interpretation of rater error measures as indirect indices of accuracy is
unjustified.
Note that although the univariate relationships between error
and accuracy are generally low, the multivariate relationships
may be substantial. If the eight error scores are used to predict
EL, DEL, SA, and DA, squared multiple correlations of .04,
.12, .09, and .28, respectively, are obtained. Thus, in a sense,
rater error scores can be used as surrogate measures of accuracy, especially differential accuracy. However, this will require
researchers to reverse their thinking about error scores. That is,
it may be possible to use error scores as criteria, but only if the
presence of errors is now used to indicate accuracy.
We do not recommend the use of error scores as indicators
RATER ERRORS 623
Table 4
Multiple Regression Results
Standardized regression coefficients
Criterion R 2 MEDCORR VARRAT MEAN SKEW SD KURT
Elevation .04 -.09* .03 -.11″ .08* -.09* .12″
Differential elevation .12 -.35* .18″ .03 .15″ -.24* -.14″
Stereotype accuracy .09 -.06 -.27 -.01 -.01 .02 -.10″
Differential accuracy .28 -.07* -.51″ -.06* -.02 .08* -. 14″
Note. MEDCORR = the median correlation between performance dimensions, over ratees; VARRAT = the
variance of ratings assigned to each ratee, averaged across ratees; MEAN = the absolute value of the difference
between the mean rating, over ratees and dimensions, and the scale midpoint; SKEW = the skew of the
distribution of ratings, over ratees and dimensions; SD = the standard deviation of the rating distribution,
over ratees and dimensions; KORT = the kurtosis oftbe rating distribution, over ratees and dimensions.
* p < .05.
of accuracy. We think that reversing our thinking about error
scores will add to the existing confusion about criteria for evaluating ratings (Saal et al., 1980). With the exception of DA, none
of the multiple correlations is large enough to justify the inevitable confusion. We have argued elsewhere (Murphy, Garcia, et
al., 1982) that EL and DEL are the most important components
of accuracy, because they affect the accuracy of personnel decisions. DA affects the accuracy of placement decisions, but not
of selection-type decisions; because pure placement decisions
are rare outside the military, DA is typically not as important
as other aspects of accuracy. Rater error measures are not
sufficiently good indicators of EL and DEL to justify their use.
One possible explanation for the low correlation between
rater error and rating accuracy is suggested by Sulsky and
Balzer (1988), who noted a number of methodological and theoretical problems associated with the use of accuracy scores.
There are several different methods of measuring rating accuracy, and it is possible that rater error indices would be correlated with some other measures of rating accuracy. However, we
know of few studies suggesting a consistent link between the
error measures used here and any measures of rating accuracy;
the exception is the paradoxical link between halo and accuracy
noted by Cooper ( 1981). The accuracy measures used here have
been used for over 30 years (Cronbach, 1955) and appear to be
the most widely used measures of rating accuracy. The lack of
correlation between the different measures used here calls into
question the assumption that rater error indices provide an indirect measure of rating accuracy.
A more likely explanation for the weak links between error
and accuracy measures is the questionable validity of error indices. For example, a rater commits halo errors only if the correlations among his or her ratings exceed the true correlations.
Fisicaro (1988), Kozlowski and Kirsch (1987), and Murphy and
Reynolds (1988) note that observed correlations are often
smaller than the true intercorrelations among ratings. MEDCORR, which represents the most common index of halo, does
not take into account the true intercorrelations. It is therefore
possible that raters with large observed correlations are not
committing halo error (if the true correlations are also large)
and that raters with very small observed correlations are committing halo error (if the true correlations for these raters are
very small). The same criticism applies to all of the other rater
error measures. There is no way to tell whether observed VARRAT, MEAN, SKEW, SD, or KURT values are too large, too small,
or exactly correct, because none of these measures compares
the observed features of the data with the true means, intercorrelations, and so on.
Fisicaro (1988) presents evidence that halo measures based
on the difference between observed and true intercorrelations
are related to accuracy scores and that large discrepancies between observed and true intercorrelations are associated with
inaccuracy in rating. Although theoretically interesting, Fisicaro’s results do not provide a solution to the practical problem
of evaluating ratings in the field. Except for laboratory studies,
the true means, variances, intercorrelations, and so on, are unknown. Indeed, the primary justification for using rater error
measures appears to be the impossibility of obtaining the required true scores. If true scores were available, there would be
no good reason to compute indirect measures of accuracy, such
as rater error indices. In these situations, it would surely be better to compute direct measures of accuracy, such as those reviewed here.
In summary, the data are not consistent with the hypothesis
that the rater error measures used here are valid indirect indices
of accuracy. Where substantial error-accuracy correlations are
found, they tend to be in the opposite direction than what would
be expected. That is, raters who commit rater errors are more
likely to provide accurate ratings than do raters who show no
evidence of rater errors. We recommend that the use of rater
error indices as indirect indicators of rating accuracy be discontinued.
References
Balzer, W. K., Sulsky, L. M., Pollack, D., & Hammer, L. B. (1987).
[Individual differences in attention, categorization, memory, and integration and performance rating accuracy]. Unpublished raw data.
Banks, C. G. (1986). [Training and appraisal accuracy]. Unpublished
raw data.
Becker, B. E., & Cardy, R. L. (1986). Influence of halo error on appraisal
effectiveness: A conceptual and empirical reconsideration. Journal of
Applied Psychology, 71,662-671.
Borman, W. C. (1977). Consistency of rating accuracy and rating errors
in the judgment of human performance. Organizational Behavior
and Human Performance, 20, 238-252.
624 KEVIN R. MURPHY AND WILLIAM K. BALZER
Borman, W. C. (1979). Format and training effects on rating accuracy
and rater errors. Journal of Applied Psychology, 64, 410-421.
Cooper, W. J. (1981). Ubiquitous halo: Sources, solutions, and a paradox. Psychological Bulletin, 90, 218-244.
Cronbach, L. J. (1955). Processes affecting scores on understanding of
others’ and “assumed similarity.” Psychological Bulletin, 59, 177-
193.
Fisicaro, S. A. (1988). A reexamination of the relationship between halo
error and accuracy. Journal of Applied Psychology, 73, 239-244.
Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis:
Cumulating research findings across studies. Beverly Hills, CA: Sage.
Jacobs, R., Kafry, D., & Zedeck, S. (1980). Expectations of behavioral
expectation scales. Personnel Psychology, 33, 595-640.
Kavanaugh, M., MacKinney, A., & Wolins, L. ( 1971 ). Issues in managerial performance: Multitrait-multimethod analysis of ratings. Psychological Bulletin, 75, 34-49.
Kozlowski, S. W., & Kirsch, M. P. (1987). The systematic distortion
hypothesis, halo, and accuracy: An individual-level analysis. Journal
of Applied Psychology, 72, 252-261.
Landy, E J. (1986). Psychology of work behavior (3rd ed.), Homewood,
IL: Dorsey Press.
Landy, E J., & Farr, J. L. (1983). The measurement of work performance. New York: Academic Press.
Landy, F. J., Vance, R. J., Barnes-Farrell, J. L., & Steele, J. W. (1980).
Statistical control of halo error in performance ratings. Journal of
Applied Psychology, 65, 501-506.
Murphy, K. R. (1982). Difficulties in the statistical control of halo. Journal of Applied Psycholog3¢, 67, 161 – 164.
Murphy, K. R., & Balzer, W. K. ( 1981, August). Rater errors and rating
accuracy. Presented at the 89th Annual Convention of the American
Psychological Association, Los Angeles, CA.
Murphy, K. R., Balzer, W. K., Kellam, K. L., & Armstrong, J. G. (1984).
Effects of the purpose of rating in observing teacher behavior and
evaluating teaching performance. Journal of Educational Psychology,
76, 45-54.
Murphy, K. R., Garcia, M., & Kerkar, S. (1980). Accuracy in observing
and rating teacher behavior. Unpublished manuscript, Rice University.
Murphy, K. R., Garcia, M., Kerkar, S., Martin, C., & Balzer, W. K.
(1982). Relationship between observational accuracy and accuracy
in evaluating performance. Journal of Applied Psychology, 67, 320-
325.
Murphy, K. R., Martin, C., & Garcia, M. (1982). Do behavioral observation scales measure observation? Journal of Applied Psychology, 67,
562-567.
Murphy, K. R., & Reynolds, D. H. (1988). Does true halo affect observed halo? Journal of Applied Psychology, 73, 235-238.
Pulakos, E. D. (1986). The development of training programs to increase accuracy with different rating tasks. Organizational Behavior
and Human Decision Processes, 38, 76-9 I.
Ruddy, T. M., & Kavanaugh, M. J. (1986). Performance appraisal.” A
review of four training methods. Presented at the annual meeting of
the Southeastern Psychological Association, Orlando, FL.
Saal, E E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings:
Assessing the psychometric quality of rating data. Psychological Bulletin, 88, 413–428.
Snedecor, G. W., & Cochran, W. G. (1967). Statistical methods. Ames,
IA, Iowa State University Press.
Sulsky, L. M., & Balzer, W. K. (1986). The behavioral diary format:
Increasing rating accuracy through consideration of rater cognitive
processes. Presented at the Midwestern Psychological Association
Annual Convention, Chicago, IL.
Sulsky, L. M., & Balzer, W. K. 0988). The meaning and measurement
of performance rating accuracy: Some methodological and theoretical concerns. Journal of Applied Psychology, 73, 497-506.
Tallarigo, R. S. (1986). [Conceptual similarity and rating accuracy].
Unpublished raw data.
Received February 29, 1988
Revision received January 18, 1989
Accepted January 19, 1989 •

Don't use plagiarized sources. Get Your Custom Essay on
Journal of Applied Psychology Copyright 1989
Just from $13/Page
Order Essay
Still stressed from student homework?
Get quality assistance from academic writers!
error: Content is protected !!
Open chat
1
Need assignment help? You can contact our live agent via WhatsApp using +1 718 717 2861

Feel free to ask questions, clarifications, or discounts available when placing an order.

Order your essay today and save 30% with the discount code LOVE