|
Technical Note: Of the 840 Prober responses analysed, 4 could not be categorised, due to error in completing the Prober answer sheets, leaving a sample size of 836. Table 2. Calibration graph for Detective Sergeants.
In table 2, the 45 degree diagonal line represents perfect calibration. A line predominantly below this would represent under-confidence in knowledge judgments, a line predominantly above, as in this case, represents over-confidence. Indeed, the calibration line for Detective Sergeants suggests significant over-confidence. Detective Sergeants, as a group, think they know a lot more than they actually do ! On closer inspection, we see that the confidence levels (assigned as probabilities) do produce a meaningful ranking. Thus for the set of answers at the 100% confidence level, there is a higher proportion correct than at the 75% level, and in turn the set at the 75% level has a higher proportion correct than at the 50% level, and so on. Put another way, there is a positive correlation between probability assigned and number of correct answers. However, despite a positive correlation, the calibration is poor. Interestingly, at the 0% and 25% levels, Detective Sergeants are under confident as a group, particularly so at the 0% level. At the 50%, 75% and 100% levels, significant overconfidence is displayed. Furthermore the more confidence Detective Sergeants have in their answers the poorer becomes their calibration, until at the 100% level they display more than twice the level of confidence than can be justified on the basis of their performance. Further, it should be noted that Detective Sergeants used higher confidence levels much more frequently than lower ones. Indeed once more there is a clear rank order evident from Table 1. On only 56 occasions was 0% assigned, on 104 occasions 25% was assigned, on 164 occasions 50% was assigned, on 205 occasions 75% was assigned, and at the top of the rank order, on no fewer than 307 occasions Detective Sergeants were 100% confident their answers were correct. Thus in the small number of occasions Detective Sergeants display underconfidence, it is when they themselves have no or little confidence in their answers (0 and 25% confidence levels). But in the substantial majority of occasions where Detective Sergeants are confident (at 50%, 75% and 100% levels), significant overconfidence is displayed. Does this matter? Is effective judgment of this type important? All 84 Detective Sergeants agreed it would be helpful if their calibration was good. The reasoning for agreeing is compelling. Over-confidence in judgments might lead, for example, to erroneous conclusions being drawn from evidence. Over-confidence would be likely to encourage pursuing an initial hypothesis rather than maintaining a truly open mind. Over-confidence in an initial hypothesis could lead to a disregard of other possibilities. Where only one hypothesis is held,( as presumably would be the case where officers are 100% confident in their initial judgment), it has been shown that evidence will be selectively sought to confirm the hypothesis under consideration. (See Snyder, M. Seek and ye shall find. In Higgins, E.T., Herman, C.P., & Zanna, M.P. (eds) Social cognition: The Ontario symposium on personality and social psychology. Hillsdale, N.J.: Erlbaum, 1981.) Similar arguments have been made by a wide range of other writers ( Einhorn, H.J., & Hogarth, R.M. Confidence in judgment: Persistence in the illusion of validity. Psychological Review, 1978, 85, 395-416. Ross, L., Lepper, M. R., Strack, F.,& Steinmetz, J. Social explanation and social expectation: Effects of real and hypothetical explanations on subjective likelihood. Journal of Personality and Social Psychology, 1977, 35, 817-829. Oskamp, S. Overconfidence in case study judgments. Journal of Consulting Psychology, 1965, 29, 261-265. Mullin, R. Decisions and judgements in NVQ based assessment. London: NCVQ, 1992.) The consequences of this can be extremely serious. In summarising the findings of a range of studies, Elstein and Bordage (Elstein A.S., & Bordage G. Psychology of clinical reasoning. In Dowie J.A., & Elstein A.S. Professional judgment: a reader in clinical decision making. Cambridge: CUP, 1988, 109-129 ) have shown that there is a noticed tendency for judges to overemphasize positive findings. That is, a tendency to overemphasize the importance of findings which confirm a hypothesis, and give too little weight to disconfirming evidence. This will be of particular threat if only one hypothesis has been raised in the first instance, because of over-confidence the individual may have in his or her own judgment. This could be an important contribution to serious error in terms of case outcome. Here surely is an issue worthy of further consideration. Poor calibration, then, is likely to be a poor servant of the police. But it is not only calibration that is of concern from this initial piece of investigation. Using Receiver Operator Characteristic Curve Analysis (ROC curve hereafter), reveals that Detective Sergeants also displayed poor discrimination in relation to their knowledge. That is they had only a modest ability to distinguish between what they knew and what they didn't know. Ideally, an ROC curve measures the discrimination power of individuals. Once more however, this has only been applied to the group. It is based on plotting the True Positive Rate of judgments against the False Positive Rate of judgments using each probability level in turn as the cut-off for saying "positive". Table 3 below displays the data, and Table 4 displays the ROC curve. Table 3: True and False Positive Rates by Probability Judgment
Notes:
Table 4: Receiver Operator Characteristic Curve for the Detective Sergeant Group
If the Detective Sergeants had no discriminatory power - were useless by this measure - their curve would follow the 45 degree diagonal and have an area under it of 50%. Perfect discrimination - the capacity of being able to perfectly distinguish between the correct and wrong answers - would lead to an area under the curve of 100%. In this case as can be seen, Detective Sergeants' ROC curve produces an area under the curve of approximately 58%. This means that, if presented with a correct and a wrong answer, they would be almost as well advised to toss a coin to identify which was which. (They are better than a toss of the coin at discriminating between their own correct and incorrect answers, but only modestly better.) Poor discrimination has important implications. It is clearly advantageous to "know when one knows" and "know when one doesn't know" when making judgments. Knowing that one doesn't know may lead to further investigation, reference to other sources, or at least maintaining an open mind before a judgment is made. Knowing when one does know may lead to effective deployment of knowledge in the interest of accurate judgments and decisions. To be confused about what is known and what is not known is unlikely to produce any benefits in terms of judgment and decision making. We have discussed both calibration and discrimination. It is possible to have good calibration but poor discrimination, similarly it is possible to have good discrimination but poor calibration. The ideal of course is to have both good calibration and good discrimination. This investigation suggests that our respondents as a group have both poor calibration and poor discrimination. They are both over-confident about their police related knowledge and poor at discriminating between what they know and what they don't know in police related matters. Given the importance of judgment for the police, the above should be matters for further inquiries. An important issue not addressed here is why would poor calibration and discrimination become features of police judgment. Any future work should not only conduct more rigorous examination of calibration and discrimination, but study what it is about policing, police culture and police practice which may encourage the development of such behaviour.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
As is evident there was an understandable variation across the different cases. However, it is worth noting that even in the case eliciting best performance (case study 1), a substantial number of initial judgments were inaccurate.
Without any discussion of the results, the same case studies were re-distributed, although each group received a different case study to address possible biasing effects from the experience of having already made a single judgment. This time each group member was given a card and asked to record up to three possible explanations/hypotheses and to give a percentage chance for each subsequently being shown to be the case. (The percentages attributed had to add up to (100%). The results are recorded in Table 6 below.
Table 6: Multiple Hypotheses Judgment
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
These tables are interesting, but not surprising, given research into de-biasing techniques. When officers provided a series of prior percentage chances for their three most favoured explanations, in only 2 out of 84 cases (both relating to case study 3) was the gold standard verdict not present as one of three possible explanations.
(The two cases have been treated in Table 6 as implying a 0% probability being attributed to the gold standard verdict explanation.)
On reflection, perhaps one aspect is surprising in the above. Although the vast majority of respondents were content at an earlier stage to put forward a single hypothesis, none of the eighty four felt strongly enough to award 100% to a single hypothesis when asked to provide up to three (implying they could have recorded only one, or two or three). Indeed of the eighty four responses, in only one case did a single hypothesis attract a rating as high as 90%. Such initial uncertainty will not be usefully captured if officers are expected to produce only one hypothesis at an early stage in investigations.
This exercise, based on paper case studies, appears to confirm that when a non-Bayesian approach is adopted, and, with over-confidence, Detective Sergeants set up a single initial explanation or hypothesis, significant error can be expected. This can have serious consequences for an investigation. It can bias the way in which an investigation is conducted, at least until such time as incoming evidence leads to a reappraisal. And time lost could be a costly side effect of such a strategy.
However, the very good news would appear to be that when as few as three initial hypotheses are required for consideration, these will include the "correct" judgment in the overwhelming number of cases. If a thorough gathering of evidence follows, and if each piece of evidence is considered against all hypotheses, it would be expected the Bayesian process of revising chances in the light of incoming evidence will optimise the chance of investigating officers eventually settling on the correct judgment, and having such judgment supported by effective consideration of evidence.
Clearly, this latter, optimistic, evaluation of adopting a Bayesian strategy requires thorough testing. However, if some of the earlier presented findings suggest Detective Sergeants are over-confident in their judgments and poor at discriminating between what they know and what they don't know, the later exercise suggests that all is not lost. When stimulated or required to operate in a more Bayesian manner, there would appear to be grounds for some optimism.
From a training point of view, it suggests that it may well be worthwhile introducing training in judgment - particularly in Bayesian approaches to judgment - into detective training.
These however are all matters for further study, serious consideration and reflection. The only crime in this case, would be to ignore the evidence of this initial investigation and fail to address the important issues raised.
| Close
Window | Print
|