Journal of Applied Psychology Copyright 1989 by the American Psychological Association, Inc.

1989, Vol. 74, No. 4, 619-624 0021-9010/89/$00.75

Rater Errors and Rating Accuracy

Kevin R. Murphy William K. Balzer

Colorado State University Bowling Green State University

Meta-analysis was used to determine the relationship between rater error measures and measures of

rating accuracy. Data from 10 studies (N = 1,096) were used to estimate correlations between measures of halo, leniency, and range restriction and Cronbach’s (1955) four measures of accuracy.

The average correlation between error and accuracy was .05. No moderators of the error-accuracy

relationship were found. Furthermore, the data are not consistent with the hypothesis that error

measures are sometimes valid indicators of accuracy. The average value of the 90th percentile of the

distribution of correlations (corrected for attenuation and range restriction) was. 11. The use of rater

error measures as indirect indicators of accuracy is not recommended.

A variety of techniques is available for assessing the quality

of rating data, including (a) applications of analysis of variance

(ANOVA) in assessing convergent and discriminant validity (Kavanaugh, MacKinney, & Wolins, 1971), (b) the use of multivariate analysis of variance (MANOVA) in assessing ratings on multiple performance dimensions (Saal, Downey, & Lahey, 1980),

and (c) applications of factor-analytic techniques (Landy,

Vance, Barnes-Farrell, & Steele, 1980). By far the most common method of evaluating ratings involves the assessment of

so-called rater errors (Landy, 1986; Landy & Farr, 1983). The

presence of halo, leniency, or range restriction is generally taken

to indicate inadequacies in the performance appraisal system;

the absence of rater errors is assumed to indicate accuracy in

measuring performance (Jacobs, Kafry, & Zedeck, 1980). In

situations in which direct measures of the accuracy of rating are

difficult to obtain, rater errors are thought to provide indirect

measures of accuracy.

When one considers the widespread use of rater error measures in evaluating rater training programs, scale formats, and

various rating techniques, it is surprising to note the scarcity of

empirical or theoretical support for the position that ratings

that are free of rater errors or which show “desirable” levels of

discriminant validity or ratee dispersion are more accurate

than ratings that show “undesirable” psychometric characteristics (Saal et al., 1980). Rater error measures, and to a lesser

extent ANOVA and MANOVA measures, are prescriptive in nature; ratings are assumed to be inaccurate if they fail to conform

with sometimes arbitrary assumptions about the true distributions and intercorrelations among various measures of performance. In part, the scarcity of data on the validity of indirect

measures of the quality of ratings can be explained by the lack

of any widely accepted standard against which these measures

can be compared. The purpose of our study is to compare these

We thank the colleagues who responded generously to our requests

for raw data.

Correspondence concerning this article should be addressed to Kevin

R. Murphy, Department of Psychology, Colorado State University, Fort

Collins, Colorado 80523.

619

indirect measures with direct measures of the accuracy of ratings.

Methods of directly measuring rating accuracy have been developed and applied by Borman (1977, 1979) and others (Murphy, Garcia, Kerkar, Martin, & Balzer, 1982; Murphy, Martin,

& Garcia, 1982; see Sulsky and Balzer, 1988, for a general review). When raters evaluate a number ofratees on multiple performance dimensions, it is possible to develop multivariate

measures of rating accuracy that reflect accuracy in (a) the overall level of rating (elevation), (b) discriminating among ratees

(differential elevation), (c) discriminating among performance

dimensions (stereotype accuracy), and (d) discriminating

among ratees within dimensions (differential accuracy; Cronbach, 1955; Murphy, Garcia, et al., 1982). The relationship between direct and indirect measures of rating accuracy (i.e., rater

error measures) is relevant in evaluating both error and accuracy measures. Indirect measures have served as criteria in a

large number of studies, but the implications of rater errors are

by no means clear. Because the use of accuracy measures is

effectively limited to laboratory settings (Sulsky & Balzer,

1988), the question of whether measures of halo, leniency, and

so forth, can be used to make valid inferences about the accuracy of ratings is an important one.

Results from a number of studies suggest that rater error

measures are not good indicators of the accuracy of ratings.

Borman (1977) presented data showing that the correlations between the differential accuracy of performance ratings and levels of halo and leniency are at best weak. Cooper (1981 ) suggested that halo shows little relationship with this specific measure of accuracy; the few significant halo-accuracy correlations

suggest that halo is positively related to differential accuracy.

Murphy and Balzer (1981) reviewed three studies (included in

our analysis) in which rater error measures were correlated with

accuracy measures; these correlations were generally small.

More recently, Becker and Cardy (1986) computed correlations

between four error and eight accuracy scores. Although 25 of

the 32 error-accuracy correlations were significant, both negative and positive correlations were found and most of the correlations were small (absolute value of median r =. 19).

Taken together, these data are somewhat troubling, but they

do not necessarily mean that rater errors are unrelated to accu-

620 KEV1N R. MURPHY AND WILLIAM K. BALZER

racy. First, as noted earlier, accuracy is a multivariate construct

rather than a univariate construct, and the different components of rating accuracy appear to be somewhat independent

(Becker & Cardy, 1986; Murphy, Garcia, et at., 1982; Sulsky &

Balzer, 1988). Therefore, data showing that halo is unrelated to

differential accuracy do not necessarily mean that halo is unrelated to other accuracy measures (see Becker & Cardy, 1986,

Appendix A). Second, there are a number of different operational definitions of each of the major rater errors (Murphy &

Balzer, 1981; Saal et at., 1980). It is possible that some measures

of each rater error are more strongly related to accuracy than

are others. Thus, the limited empirical data presently available

do not fully answer the question of whether rater error measures

may be used to make valid inferences about rating accuracy.

To systematically explore the relationship between rater errors and rating accuracy, we computed the relationships between multiple measures of halo, leniency, and range restriction

and multivariate measures of rating accuracy by using data

from 10 previous studies. A meta-analysis of the results of these

studies was used to estimate the relationship between error and

accuracy measures.

Table 1

Studies Included in Meta-Analysis

Study N Stimuli used

Balzer, Sulsky, Pollack, & Hammer

(1987) 122 Murphy tapes a

Banks (1986) 56 Borman tapes

Becker & Cardy (1986) 169 Vignettes b

Murphy, Balzer, Kellam, & Armstrong

(1984) 69 Murphy tapes

Murphy, Garcia, & Kerkar (1980) 50 Murphy tapes

Murphy, Garcia, Kerkar, Martin,

& Balzer (1982) 44 Murphy tapes

Pulakos (1986) 73 Borman tapes

Ruddy & Kavanagh (1986) 85 Borman tapes

Sulsky & Balzer (1986) 90 Murphy tapes

TaUarigo (1986) 338 Murphy tapes

a The development of these videotapes is described in Borman (1977)

and Murphy et al. (1982). b We were unable to obtain raw data from

this study and thus could not compute all 24 error-accuracy correlations; we obtained raw data for all other studies reviewed.

Method

Selection of Studies

Sulsky and Balzer’s (1988) review, together with a review of studies

published since Sulsky and Balzer, showed that 28 studies (published

and unpublished) had used accuracy scores as dependent measures. The

studies varied widely in their design and in the methods used to compute accuracy. Four criteria were considered in deciding whether or not

to include each study in our review. First, the rating scales should require evaluative judgment; studies that examined accuracy in behavior

recognition or in reporting the frequency of critical behaviors were

eliminated. Second, the study should report new data rather than reanalyses of data reported elsewhere. Our third criterion was based on

the number of ratees and dimensions. The appropriate unit of analysis

for both error and accuracy measures is the individual rater (Murphy,

1982; Murphy, Garcia, et al., 1982). The number of ratees and dimensions should therefore be sufficiently large to permit meaningful computations of accuracy for each rater. We chose to eliminate studies that

contained fewer than four ratees or dimensions. Fourth, the true scores

should be carefully developed and should show evidence of convergent

and discriminant validity. On the basis of these criteria, 15 of the 28

studies were eliminated.

We contacted the authors of the remaining 13 studies and were able

to obtain raw data for 9 of these 13 studies; these studies are listed in

Table 1. Six of the studies used videotapes developed by Murphy and

colleagues (see Murphy, Garcia, et at., 1982, for a general discussion of

the development of videotaped stimuli and ratee true score estimates).

Three studies used videotapes developed by Borman (see Borman,

1977, for a general discussion of the development of videotaped stimuli

and ratee true score estimates). Sample sizes in the studies ranged from

44 to 338. We were unable to obtain raw data from one additional study

(Becker & Cardy, 1986) that used the same four accuracy measures and

also used one of the same measures of halo error (MEDCORR), one of the

same leniency measures (MEAN), and one of the same range restriction

measures (SD) as used in this study. Becker and Cardy (1986) reported

correlations between these three error measures and the four accuracy

measures used here. These correlations were included in the meta-analysis.

Computation of Error and Accuracy Measures

Error measures. Six of the rater error measures reviewed in Saal et

al. (1980), two of which indicate halo, two of which indicate leniency,

and two of which indicate central tendency or range restriction, were

computed for individual raters. These measures are (a) MEDCORrt: the

median correlation between performance dimensions, over ratees

(halo); (b) VARRAT. the variance of the ratings assigned to each ratee,

averaged across ratees (halo); (c) MEAN: the absolute value of the difference between the mean rating, over ratees and dimensions, and the scale

midpoint (leniency); (d) SKEW. the skew of the distribution of ratings

over ratees and dimensions (leniency); (e) SD: the standard deviation of

the rating distribution, over ratees and dimensions (range restriction);

and (f) KURY. the kurtosis of the rating distribution over ratees and

dimensions (range restriction). These six error measures were computed

for each rater in the nine studies for which raw data were available.

Accuracy measures. Measures of rating accuracy were computed for

each rater by comparing ratings with true score estimates of performance. Here, true scores refer to mean ratings collected from multiple

expert raters under optimal rating conditions. The development and validation of true score estimates is described in detail in Borman (1977)

and Murphy, Garcia, et al. (1982).

For a rater who evaluates n ratees on k items or dimensions, scores

on elevation (EL), differential elevation (DEL), stereotype accuracy

(SA), and differential accuracy (DA) are given by the square roots of the

following terms:

EL 2 = (x.. – t..) 2

DEE 2 = l/ni ~ [(xi. – x..) – (ti. – t..)] 2

SA 2 = 1/k i ~, [(x.j -x..) – (t.j – t..)] 2

DA 2 = l/knij ~, [(xij – xi. – x.j + x..) – (tij – ti. – t.j + t..)] z,

Where

xij and t~j = rating and true score for ratee i on item j,

x~. and ti. = mean rating and true score for ratee i,

x.j and t.~ = mean rating and true score for item j, and

x.. and t.. = mean rating and true score over all ratees and items.

Rescaling accuracy and error scores. Error and accuracy scores are

RATER ERRORS 621

Table 2

Average Correlations Among Rater Error Measures

Measure 1 2 3 4 5 6

I. MEDCORR

2. VARRAT .25 —

3. MEAN -.03 -.09 —

4. SKEW .15 –.01 –.47 —

5. SO –.37 .35 –.09 –.18

6. KURT –.24 –.02 .13 –.09

m

.26 —

Note. MEDCORR = the median correlation between performance dimensions, over ratees; VARRAT = the variance of ratings assigned to

each ratee, averaged across ratees; MEAN = the absolute value of the

difference between the mean, rating, over ratees and dimensions, and

the scale midpoint; SKEW = the skew of the distribution of ratings, over

ratees and dimensions; SD = the standard deviation of the rating distribution, over ratees and dimensions; KURT = the kurtosis of the rating

distribution, over ratees and dimensions.

not scaled consistently. A low value for MEDCORR indicates the absence

of halo, whereas a low value for VARRAT indicates the presence of halo.

A large negative SKEW indicates leniency, but a leniency will generally

result in a large positive value for MEAN. Accuracy scores are scaled so

that low values indicate high levels of accuracy.

To simplify the interpretation of our results, all measures were scaled

so that a large score indicated the absence of specific rater errors or, in

the case of accuracy scores, the presence of accuracy. This entailed reverse-scoring all four accuracy measures as well as MEDCORR, MEAN,

SKEW, and KURT. Thus, the hypothesis that rater error measures provide

valid indirect indicators of rating accuracy will be supported if the error-accuracy correlations are positive.

Results

The average intercorrelations among the six rater error measures are presented in Table 2. These correlations suggest that

different operational definitions of the same rater error are not

empirically equivalent. Although both measures of leniency

(i.e., MEAN and SKEW) are scaled in the same direction, the correlation between these two measures is negative (r = -.47). The

two measures of halo and the two measures of range restriction

are positively correlated, but the correlations are not large (rs =

.25 and .26, respectively).

The average correlations between rater error and rating accuracy measures are presented in Table 3. These correlations suggest that error measures are not strongly related to accuracy

measures; the mean error-accuracy correlation is small and

negative (r = -.05, rc = -â€¢06)â€¢ Only 6 of the 24 correlations

shown in Table 3 are positive, and none of these is greater than

â€¢ 15. The correlations reported in Tables 2 and 3 represent

weighted averages of the correlations obtained in each study,

giving each study a weight proportional to its sample size.

Correction for Attenuation and Sampling Error

Except in longitudinal designs, it is not possible to empirically estimate the reliability of error and accuracy scores. However, we can use the results in Table 3 to estimate lower bounds

for reliability. The theoretical maximum value for rxy is given

by the product of the square roots of the reliabilities ofx and y.

It follows that the minimum value of rV’-r-r~ r~yy is determined by

the size of rxy. The correlation between VARRAT and DA is

-.50. The reliability of differential accuracy scores must therefore be larger than .70; because VARRAT scores are not likely to

be perfectly reliable, it is likely that the reliability of DA is

greater than .70.

The reliability of elevation, differential elevation, and stereotype accuracy scores is likely to be higher than that of differential accuracy scores. The equations for accuracy scores show

that DA refers to the accuracy of individual ratings, whereas EL

refers to the accuracy of the mean rating assigned to each tape,

and SA refers to the accuracy of the mean rating assigned to

each rating dimension. One would ordinarily expect that the

mean of several observations would be more reliable than the

individual observations, which suggests that the reliability of

EL, DEL, and SA should be at least as large as the reliability

of DA.

Table 3 presents corrected correlation coefficients, assuming

a reliability of .70 for each accuracy score. Because we were

interested in the relationship between rater error scores that are

often used in the literature and are reliable measures of accuracy, we did not correct for unreliability of rater error scores.

The average corrected correlation between error and accuracy

measures was -.06.

Note that the low correlations shown in Table 3 cannot be

reasonably attributed to unreliability. If the reliability of both

error and accuracy scores is as low as .40, the average correlation between error and accuracy scores would be -.09, and

none of the correlation would exceed .35.

By using formulas presented in Hunter, Schmidt, and Jackson (1982), we calculated the observed variance in the corrected

rs and subtracted from this the variance attributable to sampling error for each of our 24 corrected correlations. We then

computed the 90th percentile value for each distribution of corrected rs (shown in Table 3); one can be 90% confident that the

true value of r is equal to or lower than this value. These values

suggest that some rater error measures will occasionally reflect

the assumed relationship between error measures and accuracy

measures–a nontrivial, positive r. However, this is typically not

the case. The best estimates of the relationships between error

and accuracy are given by re, which is negative for 18 of 24 error

accuracy correlations, has a mean of-.06, and which never

exceeds. 15. The average value of the 90th percentile of the corrected distribution of rs is also small (r =. 11), indicating that

error scores are, on the whole, rarely good indicators of accuracy.

Testing for moderators. Although most of the correlations

shown in Table 3 are small, there is sufficient variability in the

corrected distributions of many rs to allow for the possibility

that the correlations between various error and accuracy measures is sometimes large. We therefore attempted to identify

variables that might moderate the relationships between rater

errors and rating accuracy.

A large number of potential moderators existsâ€¢ Some of these

(e.g., Murphy vs. Borman tapes) would be potentially interesting, but of little practical importance, because identifying study

characteristics that moderate error-accuracy correlations does

not help the researcher who is trying to decide whether or not

to use rater error measures in his or her own research. A more

622 KEVIN R. MURPHY AND WILLIAM K. BALZER

Table 3

Correlations Between Error and Accuracy Measures

Elevation Differential elevation Stereotype accuracy Differential accuracy

r rc 90% below r rc 90% below r rc 90% below r rc 90% below

Halo

MEDCORR –.05 –.06 –.07 –.06 –.08 .58 –. 12 –. 15 –. 15 –.30 –.35 –.31

VARRAT –.02 –.02 .14 .01 .01 .16 –.28 –.33 –.03 –.50 –.59 –.31

Leniency

MEAN –. l0 –. 12 .58 –.00 –.00 .06 –.00 –.01 .12 .05 .06 .24

SKEW .13 .15 .40 .14 .16 .16 –.01 –.01 .! l –.00 –.00 .18

Range restriction

SD .02 .03 .23 –. 12 –. 14 –.03 –.07 –.08 .06 –. l0 –. 1 l .05

KURT .10 .11 .31 –.14 .16 .04 –.08 –.09 .01 –.10 –.11 .03

Note. r = average observed r; rc = average corrected r (assume ryy = .70); 90% below = value at the 90th percentile of the distribution of corrected

rs, removing variance due to sampling error. MEDCORR = the median correlation between performance dimensions, over ratees; VARRATT = the

variance of ratings assigned to each ratee, averaged across ratees; MEAN = the absolute value of the difference between the mean rating, over ratees

and dimensions, and the scale midpoint; SKEW = the skew of the distribution of ratings, over ratees and dimensions; SD = the standard deviations of

the rating distribution, over ratees and dimensions; KURT = the kurtosis of the rating distribution, over ratees and dimensions.

promising avenue for explanation is to examine the moderating

effects of statistical characteristics of the ratings collected in a

study. For example, if we knew that halo-accuracy correlations

were high when observed intercorrelations were high, and low

when observed intercorrelations were low, this would help researchers decide whether or not to use particular error measures

in specific contexts.

The six rater error measures examined here reflect aspects

of rating data that are thought to affect accuracy. To test the

hypothesis that the observed levels of halo, leniency, and range

restriction moderated correlations between error and accuracy

measures, we calculated the mean value of each error measure

in each study and correlated these six means with each of the 24

error-accuracy correlations. The resulting correlation matrix

contains 124 correlations that indicate whether levels of a specific error measures moderate error-accuracy correlations. Because the number of correlations computed was large, we first

tested the omnibus null hypothesis that all rs are equal to zero

(Snedecor & Cochran, 1967). We were not able to reject this

omnibus null hypothesis, x2(124) = 149.8, p > .05, and concluded that the levels of different error indices do not moderate

error-accuracy correlations.

Regression Analysis

The six rater error measures were used to predict each oftbe

four accuracy measures in a multiple regression equation. Resuits of this anaIysis are presented in Table 4.

These results suggest that rater error measures can be used to

predict accuracy levels, but that this use of error scores requires

a reversal of our thinking about the implications of rater errors

for rating accuracy. Eleven of the 16 significant regression

weights are negative, indicating that high scores on error measures are usually associated with low scores on accuracy measures. As noted earlier, error and accuracy scores were scaled in

such a way that high scores indicated accuracy (and the absence

of rater errors), and low scores indicated inaccuracy (and the

presence of rater errors). Therefore, low scores of error measures (indicating the presence of errors) tend to indicate accuracy rather than inaccuracy in rating.

Discussion

The results of the present meta-analysis of rater error-rating

accuracy correlations computed on data from 10 separate studies show that error and accuracy measures are not strongly related; the average error-accuracy correlation is very near zero

(r = -.05, rc = -.06). Only 3 of the 24 corrected correlations

shown in Table 3 are greater (in an absolute sense) than .20,

and all of these are in the wrong direction. Because error and

accuracy indices were scaled in a consistent fashion (high scores

indicate few errors and high accuracy), the correlations between

error and accuracy scores should be positive. That is, the absence of errors should indicate high levels of accuracy (low accuracy scores). The data are more consistent with the hypothesis that rater errors contribute to accuracy than with the hypothesis that they detract from accuracy. The regression

analysis presented in Table 4 suggests a similar interpretation.

Eleven of the 16 significant regression coefl$cients are negative,

and the few positive weights that are shown in Table 4 are relatively small. Our results suggest that the traditional interpretation of rater error measures as indirect indices of accuracy is

unjustified.

Note that although the univariate relationships between error

and accuracy are generally low, the multivariate relationships

may be substantial. If the eight error scores are used to predict

EL, DEL, SA, and DA, squared multiple correlations of .04,

.12, .09, and .28, respectively, are obtained. Thus, in a sense,

rater error scores can be used as surrogate measures of accuracy, especially differential accuracy. However, this will require

researchers to reverse their thinking about error scores. That is,

it may be possible to use error scores as criteria, but only if the

presence of errors is now used to indicate accuracy.

We do not recommend the use of error scores as indicators

RATER ERRORS 623

Table 4

Multiple Regression Results

Standardized regression coefficients

Criterion R 2 MEDCORR VARRAT MEAN SKEW SD KURT

Elevation .04 -.09* .03 -.11″ .08* -.09* .12″

Differential elevation .12 -.35* .18″ .03 .15″ -.24* -.14″

Stereotype accuracy .09 -.06 -.27 -.01 -.01 .02 -.10″

Differential accuracy .28 -.07* -.51″ -.06* -.02 .08* -. 14″

Note. MEDCORR = the median correlation between performance dimensions, over ratees; VARRAT = the

variance of ratings assigned to each ratee, averaged across ratees; MEAN = the absolute value of the difference

between the mean rating, over ratees and dimensions, and the scale midpoint; SKEW = the skew of the

distribution of ratings, over ratees and dimensions; SD = the standard deviation of the rating distribution,

over ratees and dimensions; KORT = the kurtosis oftbe rating distribution, over ratees and dimensions.

* p < .05.

of accuracy. We think that reversing our thinking about error

scores will add to the existing confusion about criteria for evaluating ratings (Saal et al., 1980). With the exception of DA, none

of the multiple correlations is large enough to justify the inevitable confusion. We have argued elsewhere (Murphy, Garcia, et

al., 1982) that EL and DEL are the most important components

of accuracy, because they affect the accuracy of personnel decisions. DA affects the accuracy of placement decisions, but not

of selection-type decisions; because pure placement decisions

are rare outside the military, DA is typically not as important

as other aspects of accuracy. Rater error measures are not

sufficiently good indicators of EL and DEL to justify their use.

One possible explanation for the low correlation between

rater error and rating accuracy is suggested by Sulsky and

Balzer (1988), who noted a number of methodological and theoretical problems associated with the use of accuracy scores.

There are several different methods of measuring rating accuracy, and it is possible that rater error indices would be correlated with some other measures of rating accuracy. However, we

know of few studies suggesting a consistent link between the

error measures used here and any measures of rating accuracy;

the exception is the paradoxical link between halo and accuracy

noted by Cooper ( 1981). The accuracy measures used here have

been used for over 30 years (Cronbach, 1955) and appear to be

the most widely used measures of rating accuracy. The lack of

correlation between the different measures used here calls into

question the assumption that rater error indices provide an indirect measure of rating accuracy.

A more likely explanation for the weak links between error

and accuracy measures is the questionable validity of error indices. For example, a rater commits halo errors only if the correlations among his or her ratings exceed the true correlations.

Fisicaro (1988), Kozlowski and Kirsch (1987), and Murphy and

Reynolds (1988) note that observed correlations are often

smaller than the true intercorrelations among ratings. MEDCORR, which represents the most common index of halo, does

not take into account the true intercorrelations. It is therefore

possible that raters with large observed correlations are not

committing halo error (if the true correlations are also large)

and that raters with very small observed correlations are committing halo error (if the true correlations for these raters are

very small). The same criticism applies to all of the other rater

error measures. There is no way to tell whether observed VARRAT, MEAN, SKEW, SD, or KURT values are too large, too small,

or exactly correct, because none of these measures compares

the observed features of the data with the true means, intercorrelations, and so on.

Fisicaro (1988) presents evidence that halo measures based

on the difference between observed and true intercorrelations

are related to accuracy scores and that large discrepancies between observed and true intercorrelations are associated with

inaccuracy in rating. Although theoretically interesting, Fisicaro’s results do not provide a solution to the practical problem

of evaluating ratings in the field. Except for laboratory studies,

the true means, variances, intercorrelations, and so on, are unknown. Indeed, the primary justification for using rater error

measures appears to be the impossibility of obtaining the required true scores. If true scores were available, there would be

no good reason to compute indirect measures of accuracy, such

as rater error indices. In these situations, it would surely be better to compute direct measures of accuracy, such as those reviewed here.

In summary, the data are not consistent with the hypothesis

that the rater error measures used here are valid indirect indices

of accuracy. Where substantial error-accuracy correlations are

found, they tend to be in the opposite direction than what would

be expected. That is, raters who commit rater errors are more

likely to provide accurate ratings than do raters who show no

evidence of rater errors. We recommend that the use of rater

error indices as indirect indicators of rating accuracy be discontinued.

References

Balzer, W. K., Sulsky, L. M., Pollack, D., & Hammer, L. B. (1987).

[Individual differences in attention, categorization, memory, and integration and performance rating accuracy]. Unpublished raw data.

Banks, C. G. (1986). [Training and appraisal accuracy]. Unpublished

raw data.

Becker, B. E., & Cardy, R. L. (1986). Influence of halo error on appraisal

effectiveness: A conceptual and empirical reconsideration. Journal of

Applied Psychology, 71,662-671.

Borman, W. C. (1977). Consistency of rating accuracy and rating errors

in the judgment of human performance. Organizational Behavior

and Human Performance, 20, 238-252.

624 KEVIN R. MURPHY AND WILLIAM K. BALZER

Borman, W. C. (1979). Format and training effects on rating accuracy

and rater errors. Journal of Applied Psychology, 64, 410-421.

Cooper, W. J. (1981). Ubiquitous halo: Sources, solutions, and a paradox. Psychological Bulletin, 90, 218-244.

Cronbach, L. J. (1955). Processes affecting scores on understanding of

others’ and “assumed similarity.” Psychological Bulletin, 59, 177-

193.

Fisicaro, S. A. (1988). A reexamination of the relationship between halo

error and accuracy. Journal of Applied Psychology, 73, 239-244.

Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Meta-analysis:

Cumulating research findings across studies. Beverly Hills, CA: Sage.

Jacobs, R., Kafry, D., & Zedeck, S. (1980). Expectations of behavioral

expectation scales. Personnel Psychology, 33, 595-640.

Kavanaugh, M., MacKinney, A., & Wolins, L. ( 1971 ). Issues in managerial performance: Multitrait-multimethod analysis of ratings. Psychological Bulletin, 75, 34-49.

Kozlowski, S. W., & Kirsch, M. P. (1987). The systematic distortion

hypothesis, halo, and accuracy: An individual-level analysis. Journal

of Applied Psychology, 72, 252-261.

Landy, E J. (1986). Psychology of work behavior (3rd ed.), Homewood,

IL: Dorsey Press.

Landy, E J., & Farr, J. L. (1983). The measurement of work performance. New York: Academic Press.

Landy, F. J., Vance, R. J., Barnes-Farrell, J. L., & Steele, J. W. (1980).

Statistical control of halo error in performance ratings. Journal of

Applied Psychology, 65, 501-506.

Murphy, K. R. (1982). Difficulties in the statistical control of halo. Journal of Applied Psycholog3Â¢, 67, 161 – 164.

Murphy, K. R., & Balzer, W. K. ( 1981, August). Rater errors and rating

accuracy. Presented at the 89th Annual Convention of the American

Psychological Association, Los Angeles, CA.

Murphy, K. R., Balzer, W. K., Kellam, K. L., & Armstrong, J. G. (1984).

Effects of the purpose of rating in observing teacher behavior and

evaluating teaching performance. Journal of Educational Psychology,

76, 45-54.

Murphy, K. R., Garcia, M., & Kerkar, S. (1980). Accuracy in observing

and rating teacher behavior. Unpublished manuscript, Rice University.

Murphy, K. R., Garcia, M., Kerkar, S., Martin, C., & Balzer, W. K.

(1982). Relationship between observational accuracy and accuracy

in evaluating performance. Journal of Applied Psychology, 67, 320-

325.

Murphy, K. R., Martin, C., & Garcia, M. (1982). Do behavioral observation scales measure observation? Journal of Applied Psychology, 67,

562-567.

Murphy, K. R., & Reynolds, D. H. (1988). Does true halo affect observed halo? Journal of Applied Psychology, 73, 235-238.

Pulakos, E. D. (1986). The development of training programs to increase accuracy with different rating tasks. Organizational Behavior

and Human Decision Processes, 38, 76-9 I.

Ruddy, T. M., & Kavanaugh, M. J. (1986). Performance appraisal.” A

review of four training methods. Presented at the annual meeting of

the Southeastern Psychological Association, Orlando, FL.

Saal, E E., Downey, R. G., & Lahey, M. A. (1980). Rating the ratings:

Assessing the psychometric quality of rating data. Psychological Bulletin, 88, 413–428.

Snedecor, G. W., & Cochran, W. G. (1967). Statistical methods. Ames,

IA, Iowa State University Press.

Sulsky, L. M., & Balzer, W. K. (1986). The behavioral diary format:

Increasing rating accuracy through consideration of rater cognitive

processes. Presented at the Midwestern Psychological Association

Annual Convention, Chicago, IL.

Sulsky, L. M., & Balzer, W. K. 0988). The meaning and measurement

of performance rating accuracy: Some methodological and theoretical concerns. Journal of Applied Psychology, 73, 497-506.

Tallarigo, R. S. (1986). [Conceptual similarity and rating accuracy].

Unpublished raw data.

Received February 29, 1988

Revision received January 18, 1989

Accepted January 19, 1989 â€¢

Don't use plagiarized sources. Get Your Custom Essay on

Journal of Applied Psychology Copyright 1989

Just from $13/Page