*   *
 

 

go to Inter-ed International Consultancy Home Page


Police Judgment

ROGER MULLIN

  A version of this article first appeared in Policing vol 11 no. 4, 1995.

"Issues of judgment and decision making await thorough scrutiny, as a basis for future action aimed at improving performance"

Copyright  ©1999

 Unauthorised copying of this article may improve your judgment.

  Preamble

"Friends tell me that they are good judges - particularly about people- when they make holistic, ineffable, intuitive and unsystematic judgments. I agree only that they think they are good judges." (Dawes, RM.. You can't systematize human judgment: Dyslexia. New Directions for Methodology of Social and Behavioural Science, 1980, 4, 67-78.)

 Like many writers on judgment before and since, Robyn Dawes in the above quote was challenging a common sense (but inaccurate) assumption concerning the quality of human judgment. Another common sense belief about human judgment is that at an individual level its quality is positively correlated with confidence. This belief led, indeed still leads, to much education and training in the field of judgment and decision making being based on the assumption that enhancement of confidence can be equated with improved capability. This fallacy was, however, exposed as such as far back as 1954 when it was demonstrated that the most confident diagnosticians tend to be the least accurate. ( Holsopple, J.G., & Phelan, J.G. The skills of clinicians in analysis of projective tests. Journal of Clinical Psychology, 1954, 10, 307-320.)

 In these circumstances it is perhaps unsurprising that most professionals have been unenthusiastic regarding research into the judgment and decision making capability of practitioners. The medical profession, which to its credit has been most open to a critical appraisal of its judgment and decision making ability, is therefore a leader among a very reluctant community. However, even in the professional setting of medicine it has been claimed that,

"Clinicians have......been suspicious of attempts to explore (judgment and decision making) systematically with a view to making explicit their precise character." (Dowie, J.A., & Elstein A.S. Professional Judgment: A reader in clinical decision making. Cambridge: CUP, 1988.)

The situation is changing however. There is a growing awareness among most professions that old standards are no longer sufficient. Errors of judgment are increasingly likely to be exposed as such. To claim "in my judgment, based on years of training and experience..." is no longer a convincing or sufficient defence. Growing public interest in and concern about judgment issues - whether relating to medical cases where incompetence is claimed, or about miscarriages of  justice where police investigation is held at fault - has seen to it that there are fewer and fewer hiding places. Issues of judgment and decision making await thorough scrutiny, as a basis for future action aimed at improving performance.

Judgment and decision making

In this essay I distinguish between judgment and decision making. Decision making implies making a choice among alternatives. Thus the police may have to make a decision as to whether to arrest a suspect or not. This choice, in a rational world, will be based on prior judgments, such as whether or not there is sufficient evidence to warrant arrest. (Note that in a particular case it is quite possible that although it is judged that there is sufficient evidence, nonetheless a decision may be taken not to arrest and another choice exercised. This makes the point that to make judgments does not remove the need to make a decision as to what to do).

Decision making must be the focus of subsequent work. For the present I wish to address issues of judgment only. Further, I wish to do so only after issuing a warning to you the reader. The evidence presented in this essay does not allow for definitive conclusions. It is partial and in many respects lacking in research rigour. Its principal intention is therefore to establish the case for more substantial research and investigation. However, the conclusions should not be dismissed lightly. They match research findings from other professional arenas, as the references to more substantial studies indicate.

Police Judgment

To the great credit of the Scottish Police College, a serious interest in matters of judgment has developed, following research into the training needs of a number of ranks. One of these ranks was that of Detective Sergeants, who often have to make judgments at scenes of crime that have very important consequences for the conduct of investigations. It was recognised by the college that the matter required addressing, and they agreed that two types of exercise would be developed for, and tested on, Detective Sergeants during training. This essay reports on the findings from the participation of 84 Detective Sergeants in these exercises.

 The exercises were devised to ascertain whether or not Detective Sergeants were prone to some of the biases and errors in judgment that have been discovered in other professional arenas. The exercises provide information which can be analysed and quantified to give specific measures of performance. These are explained below. The results are significant enough to warrant this journey into print, though I again stress that the prime aim is to convince the reader that more research is needed.

Measuring Judgment

The first type of exercise was aimed at measuring two qualities of good judgment. The first of these qualities is calibration. Calibration measures the predictive value of judgments. That is, if a Detective Sergeant is certain (100% sure) about a set of judgments, he or she will demonstrate perfect calibration only by being correct in all cases in the set. If there is a set of judgments in which a Detective Sergeant is, say, 50% sure, perfect calibration means getting 50% of such judgments correct (no more, no less). Calibration can give us insight into the confidence level of respondents. In terms of confidence, three broad positions are possible: well calibrated, under-confident and over-confident.

Given the research literature it was hypothesised that Detective Sergeants would be overconfident.(See Arkes, H.R. Impediments to accurate clinical judgment and possible ways to minimize their impact. Journal of Consulting and Clinical Psychology, 1981, Vol. 49, No. 3, 323-330, for a summary of key literature on overconfidence.)  The exercise was set up as follows.

The Detective Division of  The Scottish Police College was asked to produce 10 questions for which there were known and uncontroversial answers. All questions were to be police related. Questions were sought which would range from "easy" ones that a Detective Sergeant could be expected to know very readily, through to relevant but more obscure questions where Detective Division expected few to be able to respond correctly.

Note that only 10 questions were developed. Such a small number is insufficient for the analysis of individual performance (somewhere in the order of 100 plus questions would be required before individual analysis becomes possible). But as class sizes ranged from 14 to 24, this was quite sufficient to give a group measure. (For example, twenty four individuals answering 10 questions each provides 240 separate responses). The exercise was used for demonstration purposes and to provide an overview of the performance of  Detective Sergeants "on average", rather than to provide any meaningful data about specific individuals.

The questions required two responses. First, an answer to each question had to be entered (leaving blank was not allowed), and second, a probability of this answer being correct had to be given. For the purpose of the exercise, five probability or confidence levels were used, 0% (certain answer is wrong), 25%, 50%, 75% and 100% (certain answer is correct). This style of exercise has become known as "Probers" after the pioneering work of Dr. Jack Dowie of  The Open University.( The course Professional Judgment and Decision Making (D300) from The Open University has pioneered the use of "Probers" among students of judgment and decision making. Introductory texts 1-7 of the course, are teaching texts which cover both calibration and discrimination measures).

 The theoretical basis of Probers is Bayesian. It has been argued that,

"From the Bayesian perspective, knowledge is represented in terms of statements or hypotheses, Hi, each of which is characterized by a subjective probability, P(Hi), representing one's confidence in its truth." (See Fischoff, B., & Beyth- Marom, R. Hypothesis evaluation from a Bayesian perspective. Psychological Review, 1983, 90, 239-260. and  DeFinetti, B. Probability: Beware of falsifications! Scientia, 1976, 3, 283-303.)

Analysis

The Prober answers were collated by probability level and by correct or incorrect response.

Table 1 below records the cumulative totals from the 84 respondents. These were then plotted onto a calibration graph (Table 2).

Table 1. Prober answers by confidence level by correct/incorrect response.
 
Judgment % assigned
Times assigned to correct answer
Times assigned to incorrect answer
Total times assigned
Proportion assigned to correct answer
100
140
167
307
46%
75
78
127
205
38%
50
54
110
164
33%
25
31
73
104
30%
0
12
44
56
21%
 
[315]
[521]
[836]
 

Technical Note: Of the 840 Prober responses analysed, 4 could not be categorised, due to error in completing the Prober answer sheets, leaving a sample size of 836.

Table 2. Calibration graph for Detective Sergeants.

In table 2, the 45 degree diagonal line represents perfect calibration. A line predominantly below this would represent under-confidence in knowledge judgments, a line predominantly above, as in this case, represents over-confidence. Indeed, the calibration line for Detective Sergeants suggests significant over-confidence. Detective Sergeants, as a group, think they know a lot more than they actually do !

On closer inspection, we see that the confidence levels (assigned as probabilities) do produce a meaningful ranking. Thus for the set of answers at the 100% confidence level, there is a higher proportion correct than at the 75% level, and in turn the set at the 75% level has a higher proportion correct than at the 50% level, and so on. Put another way, there is a positive correlation between probability assigned and number of correct answers.

However, despite a positive correlation, the calibration is poor. Interestingly, at the 0% and 25% levels, Detective Sergeants are under confident as a group, particularly so at the 0% level.

At the 50%, 75% and 100% levels, significant overconfidence is displayed. Furthermore the more confidence Detective Sergeants have in their answers the poorer becomes their calibration, until at the 100% level they display more than twice the level of confidence than can be justified on the basis of  their performance.

Further, it should be noted that Detective Sergeants used higher confidence levels much more frequently than lower ones. Indeed once more there is a clear rank order evident from Table 1.

On only 56 occasions was 0% assigned, on 104 occasions 25% was assigned, on 164 occasions 50% was assigned, on 205 occasions 75% was assigned, and at the top of the rank order, on no fewer than 307 occasions Detective Sergeants were 100% confident their answers were correct.

Thus in the small number of occasions Detective Sergeants display underconfidence, it is when they themselves have no or little confidence in their answers (0 and 25% confidence levels). But in the substantial majority of occasions where Detective Sergeants are confident (at 50%, 75% and 100% levels), significant overconfidence is displayed.

Does this matter? Is effective judgment of this type important? All 84 Detective Sergeants agreed it would be helpful if their calibration was good. The reasoning for agreeing is compelling.

Over-confidence in judgments might lead, for example, to erroneous conclusions being drawn from evidence. Over-confidence would be likely to encourage pursuing an initial hypothesis rather than maintaining a truly open mind. Over-confidence in an initial hypothesis could lead to a disregard of other possibilities. Where only one hypothesis is held,( as presumably would be the case where officers are 100% confident in their initial judgment), it has been shown that evidence will be selectively sought to confirm the hypothesis under consideration. (See Snyder, M. Seek and ye shall find. In Higgins, E.T., Herman, C.P., & Zanna, M.P. (eds) Social cognition: The Ontario symposium on personality and social psychology. Hillsdale, N.J.: Erlbaum, 1981.)  Similar arguments have been made by a wide range of other writers ( Einhorn, H.J., & Hogarth, R.M. Confidence in judgment: Persistence in the illusion of validity. Psychological Review, 1978, 85, 395-416.  Ross, L., Lepper, M. R., Strack, F.,& Steinmetz, J. Social explanation and social expectation: Effects of real and hypothetical explanations on subjective likelihood. Journal of Personality and Social Psychology, 1977, 35, 817-829. Oskamp, S. Overconfidence in case study judgments. Journal of Consulting Psychology, 1965, 29, 261-265. Mullin, R. Decisions and judgements in NVQ based assessment. London: NCVQ, 1992.)

The consequences of this can be extremely serious. In summarising the findings of a range of studies, Elstein and Bordage (Elstein A.S., & Bordage G. Psychology of clinical reasoning. In Dowie J.A., & Elstein A.S. Professional judgment: a reader in clinical decision making. Cambridge: CUP, 1988, 109-129 ) have shown that there is a noticed tendency for judges to overemphasize positive findings. That is, a tendency to overemphasize the importance of findings which confirm a hypothesis, and give too little weight to disconfirming evidence. This will be of particular threat if only one hypothesis has been raised in the first instance, because of over-confidence the individual may have in his or her own judgment. This could be an important contribution to serious error in terms of case outcome. Here surely is an issue worthy of further consideration.

Poor calibration, then, is likely to be a poor servant of the police. But it is not only calibration that is of concern from this initial piece of investigation. Using Receiver Operator Characteristic Curve Analysis (ROC curve hereafter), reveals that Detective Sergeants also displayed poor discrimination in relation to their knowledge. That is they had only a modest ability to distinguish between what they knew and what they didn't know.

Ideally, an ROC curve measures the discrimination power of  individuals. Once more however, this has only been applied to the group. It is based on plotting the True Positive Rate of judgments against the False Positive Rate of judgments using each probability level in turn as the cut-off  for saying "positive". Table 3 below displays the data, and Table 4 displays the ROC curve.

Table 3: True and False Positive Rates by Probability Judgment

 
Cut off scores
Probabilities as cut off scores
True Positives
False Positives
True Positive Rate
False Positive Rate
100
100
140
167
(140/315)44%
(167/521)32%
75
75
(140+78)218
(167+127)294
(218/315)69%
(294/521)56%
50
50
(218+54)272
(294+110)404
(272/315)86%
(404/521)78%
25
25
(272+31)303
(404+73)477
(303/315)96%
(477/521)92%
0
0
(303+12)315
(477+44)521
(315/315)100%
(521/521)100%
 

Notes:

  1. The method of calculating the true positives and false positives and their corresponding rates from the data transferred from Table 1, is shown in parenthesis.
  2. For an appreciation of the methodology, refer to Professional Judgment and Decision Making (D300), The Open University, Introductory Texts 1-7.  

  Table 4: Receiver Operator Characteristic Curve for the Detective Sergeant Group

If the Detective Sergeants had no discriminatory power - were useless by this measure - their curve would follow the 45 degree diagonal and have an area under it of 50%. Perfect discrimination - the capacity of being able to perfectly distinguish between the correct and wrong answers - would lead to an area under the curve of 100%. In this case as can be seen, Detective Sergeants' ROC curve produces an area under the curve of approximately 58%. This means that, if presented with a correct and a wrong answer, they would be almost as well advised to toss a coin to identify which was which. (They are better than a toss of the coin at discriminating between their own correct and incorrect answers, but only modestly better.)

Poor discrimination has important implications. It is clearly advantageous to "know when one knows" and "know when one doesn't know" when making judgments. Knowing that one doesn't know may lead to further investigation, reference to other sources, or at least maintaining an open mind before a judgment is made. Knowing when one does know may lead to effective deployment of knowledge in the interest of accurate judgments and decisions. To be confused about what is known and what is not known is unlikely to produce any benefits in terms of judgment and decision making.

We have discussed both calibration and discrimination. It is possible to have good calibration but poor discrimination, similarly it is possible to have good discrimination but poor calibration. The ideal of course is to have both good calibration and good discrimination. This investigation suggests that our respondents as a group have both poor calibration and poor discrimination. They are both over-confident about their police related knowledge and poor at discriminating between what they know and what they don't know in police related matters.

Given the importance of judgment for the police, the above should be matters for further inquiries.

An important issue not addressed here is why would poor calibration and discrimination become features of police judgment. Any future work should not only conduct more rigorous examination of calibration and discrimination, but study what it is about policing, police culture and police practice which may encourage the development of such behaviour.

Judging evidence

Besides being effective judges of their own knowledge and being able to apply that knowledge effectively, the police of course must also be skilled in judging other matters, such as evidence at scenes of crime. This is a complex matter, but the Scottish Police College sought some examination of this context, albeit via class based activity.

Thus a second type of exercise was designed to assess whether the style by which initial judgments are made has any implications. Some so-called de-biasing techniques, such as presenting people with the need to address alternatives has suggested that improvements may be possible in the approach people take. (See Koriat, A., Lichtenstein, S., & Fischoff, B. Reasons for confidence. Journal of Experimental Psychology: Human Learning and Memory, 1980, 6, 107-118.)  It was therefore decided to assess whether the development of a simple judgment aid, based on Bayesian principles, had potential for the police when making judgments about events at scenes of crime.

Once more Detective Division at the Scottish Police College was involved in constructing the exercise. Three case studies were constructed, each based on an actual case of recent vintage. The cases chosen had to have subsequently been satisfactorily "solved" from a policing standpoint, thus providing a gold-standard verdict against which Detective Sergeant judgments could be compared. The case studies were written up describing the scene of the crime in terms of  those matters which were deemed relevant and available to an investigating officer.

Three groups were formed in each class, with each group being given a different case study. The Detective Sergeants were asked to read the information carefully and then write down their judgment, as individuals, of what had happened in terms of crime or event. Only two of the eighty four respondents initially objected to being asked to provide a single judgment. They agreed on request to complete the exercise on trust. (At the later debriefing stage, most participants agreed it was not untypical for them to generate a single hypothesis. However some suggested, after receiving the results, that they had other approaches which prevented them considering only one hypothesis).

Table 5 indicates for each case the percentage of judgments which were correct using the gold standard verdict mentioned above as the benchmark.

Table 5: Correct Judgments Using a Single Hypothesis

 
Case Study 
Number of Judgments
No. of Correct Judgments given
%ge Correct Judgments
1
28
17
61%
2
28
12
43%
3
28
13
46%
 

As is evident there was an understandable variation across the different cases. However, it is worth noting that even in the case eliciting best performance (case study 1), a substantial number of initial judgments were inaccurate.

Without any discussion of the results, the same case studies were re-distributed, although each group received a different case study to address possible biasing effects from the experience of having already made a single judgment. This time each group member was given a card and asked to record up to three possible explanations/hypotheses and to give a percentage chance for each subsequently being shown to be the case. (The percentages attributed had to add up to (100%). The results are recorded in Table 6 below.

Table 6: Multiple Hypotheses Judgment

 
Case Study
No. of Judgments
Times Correct Judgment Included in 3 Hypotheses
Mean %ge Chance for Correct Judgment
Range of % Chance for Correct Judgment
1
28
28
57%
40-90%
2
28
28
33%
5-50%
3
28
26
33%
0-70%
 

These tables are interesting, but not surprising, given research into de-biasing techniques. When officers provided a series of prior percentage chances for their three most favoured explanations, in only 2 out of  84 cases (both relating to case study 3) was the gold standard verdict not present as one of  three possible explanations.

(The two cases have been treated in Table 6 as implying a 0% probability being attributed to the gold standard verdict explanation.)

On reflection, perhaps one aspect is surprising in the above. Although the vast majority of respondents were content at an earlier stage to put forward a single hypothesis, none of the eighty four felt strongly enough to award 100% to a single hypothesis when asked to provide up to three (implying they could have recorded only one, or two or three). Indeed of the eighty four responses, in only one case did a single hypothesis attract a rating as high as 90%. Such initial uncertainty will not be usefully captured if officers are expected to produce only one hypothesis at an early stage in investigations.

This exercise, based on paper case studies, appears to confirm that when a non-Bayesian approach is adopted, and, with over-confidence, Detective Sergeants set up a single initial explanation or hypothesis, significant error can be expected. This can have serious consequences for an investigation. It can bias the way in which an investigation is conducted, at least until such time as incoming evidence leads to a reappraisal. And time lost could be a costly side effect of such a strategy.

However, the very good news would appear to be that when as few as three initial hypotheses are required for consideration, these will include the "correct" judgment in the overwhelming number of cases. If a thorough gathering of evidence follows, and if each piece of evidence is considered against all hypotheses, it would be expected the Bayesian process of revising chances in the light of incoming evidence will optimise the chance of investigating officers eventually settling on the correct judgment, and having such judgment supported by effective consideration of evidence.

Clearly, this latter, optimistic, evaluation of adopting a Bayesian strategy requires thorough testing. However, if some of the earlier presented findings suggest Detective Sergeants are over-confident in their judgments and poor at discriminating between what they know and what they don't know, the later exercise suggests that all is not lost. When stimulated or required to operate in a more Bayesian manner, there would appear to be grounds for some optimism.

From a training point of view, it suggests that it may well be worthwhile introducing training in judgment - particularly in Bayesian approaches to judgment - into detective training.

These however are all matters for further study, serious consideration and reflection. The only crime in this case, would be to ignore the evidence of this initial investigation and fail to address the important issues raised.




| Close Window | Print |

Site Designed by Webpresent

 

 
*
© Copyright Inter-ed Ltd 1999-2002
*