Statistical methods for assessing agreement between two methods of clinical measurement

        SUMMARY

In clinical measurement comparison of a new measurement technique with an established one is often needed to see whether they agree sufficiently for the new to replace the old. Such investigations are often analysed inappropriately, notably by using correlation coefficients. The use of correlation is misleading. An alternative approach, based on graphical techniques and simple calculations, is described, together with the relation between this analysis and the assessment of repeatability.

INTRODUCTION

Clinicians often wish to have data on, for example, cardiac stroke volume or blood pressure where direct measurement without adverse effects is difficult or impossible. The true values remain unknown. Instead indirect methods are used, and a new method has to be evaluated by comparison with an established technique rather than with the true quantity. If the new method agrees sufficiently well with the old, the old may be replaced. This is very different from calibration, where known quantities are measured by a new method and the result compared with the true value or with measurements made by a highly accurate method. When two methods are compared neither provides an unequivocally correct measurement, so we try to assess the degree of agreement. But how?

The correct statistical approach is not obvious. Many studies give the product-moment correlation coefficient (r) between the results of the two measurement methods as an indicator of agreement. It is no such thing. In a statistical journal we have proposed an alternative analysis, [1] and clinical colleagues have suggested that we describe it for a medical readership.

Most of the analysis will be illustrated by a set of data (Table 1) collected to compare two methods of measuring peak expiratory flow rate (PEFR).

 

INAPPROPRIATE USE OF CORRELATION COEFFICIENT

The second step is usually to calculate the correlation coefficient (r) between the two methods. For the data in fig 1, r = 0.94 (p < 0.001). The null hypothesis here is that the measurements by the two methods are not linearly related. The probability is very small and we can safely conclude that PEFR measurements by the mini and large meters are related. However, this high correlation does not mean that the two methods agree:

(1) r measures the strength of a relation between two variables, not the agreement between them. We have perfect agreement only if the points in fig 1 lie along the line of equality, but we will have perfect correlation if the points lie along any straight line.

(2) A change in scale of measurement does not affect the correlation, but it certainly affects the agreement. For example, we can measure subcutaneous fat by skinfold calipers. The calipers will measure two thicknesses of fat. If we were to plot calipers measurement against half-calipers measurement, in the style of fig 1, we should get a perfect straight line with slope 2.0. The correlation would be 1.0, but the two measurements would not agree — we could not mix fat thicknesses obtained by the two methods, since one is twice the other.

(3) Correlation depends on the range of the true quantity in the sample. If this is wide, the correlation will be greater than if it is narrow. For those subjects whose PEFR (by peak flow meter) is less than 500 l/min, r is 0.88 while for those with greater PEFRs r is 0.90. Both are less than the overall correlation of 0.94, but it would be absurd to argue that agreement is worse below 500 l/min and worse above 500 l/min than it is for everybody. Since investigators usually try to compare two methods over the whole range of values typically encountered, a high correlation is almost guaranteed.

(4) The test of significance may show that the two methods are related, but it would be amazing if two methods designed to measure the same quantity were not related. The test of significance is irrelevant to the question of agreement.

(5) Data which seem to be in poor agreement can produce quite high correlations. For example, Serfontein and Jaroszewicz [2] compared two methods of measuring gestational age. Babies with a gestational age of 35 weeks by one method had gestations between 34 and 39.5 weeks by the other, but r was high (0.85). On the other hand, Oldham et al. [3] compared the mini and large Wright peak flow meters and found a correlation of 0.992. They then connected the meters in series, so that both measured the same flow, and obtained a “material improvement” (0.996). If a correlation coefficient of 0.99 can be materially improved upon, we need to rethink our ideas of what a high correlation is in this context. As we show below, the high correlation of 0.94 for our own data conceals considerable lack of agreement between the two instruments.

MEASURING AGREEMENT

It is most unlikely that different methods will agree exactly, by giving the identical result for all individuals. We want to know by how much the new method is likely to differ from the old: if this is not enough to cause problems in clinical interpretation we can replace the old method by the new or use the two interchangeably. If the two PEFR meters were unlikely to give readings which differed by more than, say, 10 l/min, we could replace the large meter by the mini meter because so small a difference would not affect decisions on patient management. On the other hand, if the meters could differ by 100 l/min, the mini meter would be unlikely to be satisfactory. How far apart measurements can be without causing difficulties will be a question of judgment. Ideally, it should be defined in advance to help in the interpretation of the method comparison and to choose the sample size.

The first step is to examine the data. A simple plot of the results of one method against those of the other (fig 1) though without a regression line is a useful start but usually the data points will be clustered near the line and it will be difficult to assess between-method differences. A plot of the difference between the methods against their mean may be more informative. Fig 2 displays considerable lack of agreement between the large and mini meters, with discrepancies of up to 80 l/min, these differences are not obvious from fig 1. The plot of difference against mean also allows us to investigate any possible relationship between the measurement error and the true value. We do not know the true value, and the mean of the two measurements is the best estimate we have. It would be a mistake to plot the difference against either value separately because the difference will be related to each, a well-known statistical artefact. [4]

via Statistical methods for assessing agreement between two methods of clinical measurement.

Comparing methods of measurement: why plotting difference against standard method is misleading

My reasons for jumping into stats was to directly compare two measurement methods… with multiple trials, on multiple ILDs (inter-landmark distances).  I don’t really go for “funny name, lol” things, but when Bland and Borg are cited in the same paper on stats (which I long thought of [cluelessly/ignorantly] as boring).  Eponysterical.

But getting real, the issues raised by Bland and Altman sound pretty interesting, and they raise the issue that many tests of this sort may be using misleading information… I have tried to duplicate their methods in my own little H.T.-UGR/Inquiry Study.

 

 

Summary

When comparing a new method of measurement with a standard method, one of the things we want to know is whether the difference between the measurements by the two methods is related to the magnitude of the measurement. A plot of the difference against the standard measurement is sometimes suggested, but this will always appear to show a relationship between difference and magnitude when there is none. A plot of the difference against the average of the standard and new measurements is unlikely to mislead in this way. This is shown theoretically and illustrated by a practical example using measurements of systolic blood pressure.

Introduction

In earlier papers [1,2] we discussed the analysis of studies of agreement between methods of clinical measurement. We had two issues in mind: to demonstrate that the methods of analysis then in general use were incorrect and misleading, and to recommend a more appropriate method. We saw the aim of such a study as to determine whether two methods agreed sufficiently well for them to be used interchangeably. This led us to suggest that the analysis should be based on the differences between measurements on the same subject by the two methods. The mean difference would be the estimated bias, the systematic difference between methods, and the standard deviation of the differences would measure random fluctuations around this mean. We recommended 95% limits of agreement, mean difference plus or minus 2 standard deviations (or, more precisely, 1.96 standard deviations), which would tell us how far apart measurements by the two methods were likely to be for most individuals.

 

via Comparing methods of measurement: why plotting difference against standard method is misleading.

Applying the Right Statistics: Analyses of Measurement Studies

I have been learning a great deal about statistical analysis, and how to apply the abundant tools to particular problems… I  guess I should say that I will be sharing some articles and ideas that I have come across on this topic (there are a number of considerations for every question.  Bazsinga).  BTW; I have been using SOFA Statistics (link later, free for use, has “enhancers” you can pay for, but don’t need to before using it to the full potential) for my own bit of work, it is really nice, sometimes frustrating tool, though I am fairly sure that has more to do with my “not knowing what I can do”, rather than limitations in the software.

 

Introduction

Many research papers in radiology concern measurement. This is a topic which in the past has been much neglected in the medical research methods literature. When I was first approached with a question on measurement error, I turned in vain to my books. I had to work it out myself.

I am going to deal in this talk with two types of study: the estimation of the agreement between two methods of measurement, and the estimation of the agreement between two measurements by the same method, also called repeatability. In both cases I shall be concerned with the question of interpreting the individual clinical measurement. For agreement between two different methods of measurement, I shall be asking whether we can use measurements by these two methods interchangeably, i.e. can we ignore the method by which the measurement was made. For two measurements by the same method, I shall be asking how variable can measurements on a patient be if the true value of the quantity does not change and what this measurement tells us about the patient’s true or average value.

I shall avoid all mathematics, which even an audience as intelligent as this one finds difficult to follow during a presentation, except for one formula near the end, for which I shall apologise when the time comes. Instead I shall show what happens when we apply some simple statistical methods to a set of randomly generated data, and then show how this informs the interpretation of these methods when they are used to tackle measurement problems in the radiology literature.

For an example of the sort of study with which I shall be concerned, Borg et al. (1995) compared single X-ray absorptiometry (SXA) with single photon absorptiometry (SPA). They produced the following scatter plot for arm bone mineral density:

via Applying the Right Statistics: Analyses of Measurement Studies.

amor mundi: Hannah Arendt on Technology and Nature

We have seen that the animal laborens could be redeemed from its predicament of imprisonment in the ever-recurring cycle of the life process, of being subject to the necessity of labor and consumption, only through the mobilization of another human capacity, the capacity for making, fabricating, and producing of homo faber, who as a toolmaker not only eases the pain and trouble of laboring but also erects a world of durability. The redemption of life, which is sustained by labor, is worldliness, which is sustained by fabrication. We saw furthermore that homo faber could be redeemed from his predicament of meaninglessness, the “devaluation of all values,” and the impossibility of finding valid standards in a world determined by the category of means and ends, only through the interrelated faculties of action and speech, which produce meaningful stories as naturally as fabrication produces use objects. If it were not outside the scope of these considerations, one could add the predicament of thought to these instances; for thought, too, is unable to “think itself” out of predicaments which the very activity of thinking engenders. What in each of these instances saves man — man qua animal laborens, qua homo faber, qua thinker — is something altogether different; it comes from the outside — not, to be sure, outside of man, but outside each of the respective activities. From the viewpoint of the animal laborens, it is like a miracle that it is also a being which knows of and inhabots a world; from the viewpoint of homo faber it is like a miracle, like the revelation of divinity, that meaning should have a place in that world.The case of action and actions predicament is altogether different. Here, the remedy against the irreversibility and unpredictability of the process started by acting does not arise out of another and possibly higher faculty, but is one of the potentialities of action itself. The possible redemption from the predicament of irreversibility — of being unable to undo what one has done though one did not, and could not, have known what he was doing — is the faculty of forgiving. The remedy for unpredictability, for the chaotic uncertainty of the future, is contained in the faculty to make and keep promises. The two faculties belong together in so far as one of them, forgiving, serves to undo the deeds of the past, whose “sins” hang like Damocles sword over every new generation; and the other, binding oneself through promises, serves to set up in the ocean of uncertainty, which the future is by definition, islands of security without which not even continuity, let alone durability of any kind, would be possible in the relationships between men.Without being forgiven, released from the consequences of what we have done, our capacity to act would, as it were, be confined to one single deed from which we could never recover; we would remain the victims of its consequences forever, not unlike the sorcerers apprentice who lacked the magic formula to break the spell. Without being bound to the fulfillment of promises, we would never be able to keep our identities; we would be condemned to wander helplessly and without direction in the darkness of each mans lonely heart, caught in its contradictions and equivocations — a darkness which only the light shed over the public realm through the presence of others, who confirm the identity between the one who promises and the one who fulfills, can dispel. Both faculties therefore, depend on plurality, on the presence and acting of others, for no one can forgive himself and no one can feel bound to a promise made only to himself; forgiving and promising enacted in solitude or isolation remain without reality and can signify no more than a role played before ones self.

via amor mundi: Hannah Arendt on Technology and Nature.

Screenhero | Collaborative Screen Sharing

Screenhero lets you screen share any application with anyone, no matter where they are. It’s super simple and blazing fast. You each get your own mouse pointer, and you’re both always in control. It’s designed for collaboration, not just broadcasting your screen. It’s like Google Docs for any application on your computer.

Screenhero is designed to feel like you’re sitting next to the person you’re working with — even when you’re miles away. It’s available for both Mac and Windows.

via Screenhero | Collaborative Screen Sharing.

Curiosity Shoots First Nighttime Photos on the Surface of Mars

To ensure that everything is working properly, the rover first shot a couple of photographs of the calibration target found on its body. The photograph above is of the target illuminated by the white LEDs. Here’s the same target illuminated by the UV ones:

via Curiosity Shoots First Nighttime Photos on the Surface of Mars.

Barry Lyndon – Wikipedia, the free encyclopedia

Barry Lyndon – Wikipedia, the free encyclopedia.

Cinematography

The film—as with “almost every Kubrick film”—is a “showcase for [a] major innovation in technique.”[6] While 2001: A Space Odyssey had featured “revolutionary effects,” and The Shining would later feature heavy use of the SteadicamBarry Lyndon saw a considerable number of sequences shot “without recourse to electric light.”[6]Cinematography was overseen by director of photography John Alcott (who won an Oscar for his work), and is particularly noted for the technical innovations that made some of its most spectacular images possible. To achieve photography without electric lighting “[f]or the many densely furnished interior scenes… meant shooting by candlelight,” which is known to be difficult in still photography, “let alone with moving images.”[6]