Statistical methods for assessing agreement between two methods of clinical measurement

        SUMMARY

In clinical measurement comparison of a new measurement technique with an established one is often needed to see whether they agree sufficiently for the new to replace the old. Such investigations are often analysed inappropriately, notably by using correlation coefficients. The use of correlation is misleading. An alternative approach, based on graphical techniques and simple calculations, is described, together with the relation between this analysis and the assessment of repeatability.

INTRODUCTION

Clinicians often wish to have data on, for example, cardiac stroke volume or blood pressure where direct measurement without adverse effects is difficult or impossible. The true values remain unknown. Instead indirect methods are used, and a new method has to be evaluated by comparison with an established technique rather than with the true quantity. If the new method agrees sufficiently well with the old, the old may be replaced. This is very different from calibration, where known quantities are measured by a new method and the result compared with the true value or with measurements made by a highly accurate method. When two methods are compared neither provides an unequivocally correct measurement, so we try to assess the degree of agreement. But how?

The correct statistical approach is not obvious. Many studies give the product-moment correlation coefficient (r) between the results of the two measurement methods as an indicator of agreement. It is no such thing. In a statistical journal we have proposed an alternative analysis, [1] and clinical colleagues have suggested that we describe it for a medical readership.

Most of the analysis will be illustrated by a set of data (Table 1) collected to compare two methods of measuring peak expiratory flow rate (PEFR).

 

INAPPROPRIATE USE OF CORRELATION COEFFICIENT

The second step is usually to calculate the correlation coefficient (r) between the two methods. For the data in fig 1, r = 0.94 (p < 0.001). The null hypothesis here is that the measurements by the two methods are not linearly related. The probability is very small and we can safely conclude that PEFR measurements by the mini and large meters are related. However, this high correlation does not mean that the two methods agree:

(1) r measures the strength of a relation between two variables, not the agreement between them. We have perfect agreement only if the points in fig 1 lie along the line of equality, but we will have perfect correlation if the points lie along any straight line.

(2) A change in scale of measurement does not affect the correlation, but it certainly affects the agreement. For example, we can measure subcutaneous fat by skinfold calipers. The calipers will measure two thicknesses of fat. If we were to plot calipers measurement against half-calipers measurement, in the style of fig 1, we should get a perfect straight line with slope 2.0. The correlation would be 1.0, but the two measurements would not agree — we could not mix fat thicknesses obtained by the two methods, since one is twice the other.

(3) Correlation depends on the range of the true quantity in the sample. If this is wide, the correlation will be greater than if it is narrow. For those subjects whose PEFR (by peak flow meter) is less than 500 l/min, r is 0.88 while for those with greater PEFRs r is 0.90. Both are less than the overall correlation of 0.94, but it would be absurd to argue that agreement is worse below 500 l/min and worse above 500 l/min than it is for everybody. Since investigators usually try to compare two methods over the whole range of values typically encountered, a high correlation is almost guaranteed.

(4) The test of significance may show that the two methods are related, but it would be amazing if two methods designed to measure the same quantity were not related. The test of significance is irrelevant to the question of agreement.

(5) Data which seem to be in poor agreement can produce quite high correlations. For example, Serfontein and Jaroszewicz [2] compared two methods of measuring gestational age. Babies with a gestational age of 35 weeks by one method had gestations between 34 and 39.5 weeks by the other, but r was high (0.85). On the other hand, Oldham et al. [3] compared the mini and large Wright peak flow meters and found a correlation of 0.992. They then connected the meters in series, so that both measured the same flow, and obtained a “material improvement” (0.996). If a correlation coefficient of 0.99 can be materially improved upon, we need to rethink our ideas of what a high correlation is in this context. As we show below, the high correlation of 0.94 for our own data conceals considerable lack of agreement between the two instruments.

MEASURING AGREEMENT

It is most unlikely that different methods will agree exactly, by giving the identical result for all individuals. We want to know by how much the new method is likely to differ from the old: if this is not enough to cause problems in clinical interpretation we can replace the old method by the new or use the two interchangeably. If the two PEFR meters were unlikely to give readings which differed by more than, say, 10 l/min, we could replace the large meter by the mini meter because so small a difference would not affect decisions on patient management. On the other hand, if the meters could differ by 100 l/min, the mini meter would be unlikely to be satisfactory. How far apart measurements can be without causing difficulties will be a question of judgment. Ideally, it should be defined in advance to help in the interpretation of the method comparison and to choose the sample size.

The first step is to examine the data. A simple plot of the results of one method against those of the other (fig 1) though without a regression line is a useful start but usually the data points will be clustered near the line and it will be difficult to assess between-method differences. A plot of the difference between the methods against their mean may be more informative. Fig 2 displays considerable lack of agreement between the large and mini meters, with discrepancies of up to 80 l/min, these differences are not obvious from fig 1. The plot of difference against mean also allows us to investigate any possible relationship between the measurement error and the true value. We do not know the true value, and the mean of the two measurements is the best estimate we have. It would be a mistake to plot the difference against either value separately because the difference will be related to each, a well-known statistical artefact. [4]

via Statistical methods for assessing agreement between two methods of clinical measurement.

Comparing methods of measurement: why plotting difference against standard method is misleading

My reasons for jumping into stats was to directly compare two measurement methods… with multiple trials, on multiple ILDs (inter-landmark distances).  I don’t really go for “funny name, lol” things, but when Bland and Borg are cited in the same paper on stats (which I long thought of [cluelessly/ignorantly] as boring).  Eponysterical.

But getting real, the issues raised by Bland and Altman sound pretty interesting, and they raise the issue that many tests of this sort may be using misleading information… I have tried to duplicate their methods in my own little H.T.-UGR/Inquiry Study.

 

 

Summary

When comparing a new method of measurement with a standard method, one of the things we want to know is whether the difference between the measurements by the two methods is related to the magnitude of the measurement. A plot of the difference against the standard measurement is sometimes suggested, but this will always appear to show a relationship between difference and magnitude when there is none. A plot of the difference against the average of the standard and new measurements is unlikely to mislead in this way. This is shown theoretically and illustrated by a practical example using measurements of systolic blood pressure.

Introduction

In earlier papers [1,2] we discussed the analysis of studies of agreement between methods of clinical measurement. We had two issues in mind: to demonstrate that the methods of analysis then in general use were incorrect and misleading, and to recommend a more appropriate method. We saw the aim of such a study as to determine whether two methods agreed sufficiently well for them to be used interchangeably. This led us to suggest that the analysis should be based on the differences between measurements on the same subject by the two methods. The mean difference would be the estimated bias, the systematic difference between methods, and the standard deviation of the differences would measure random fluctuations around this mean. We recommended 95% limits of agreement, mean difference plus or minus 2 standard deviations (or, more precisely, 1.96 standard deviations), which would tell us how far apart measurements by the two methods were likely to be for most individuals.

 

via Comparing methods of measurement: why plotting difference against standard method is misleading.

amor mundi: Hannah Arendt on Technology and Nature

We have seen that the animal laborens could be redeemed from its predicament of imprisonment in the ever-recurring cycle of the life process, of being subject to the necessity of labor and consumption, only through the mobilization of another human capacity, the capacity for making, fabricating, and producing of homo faber, who as a toolmaker not only eases the pain and trouble of laboring but also erects a world of durability. The redemption of life, which is sustained by labor, is worldliness, which is sustained by fabrication. We saw furthermore that homo faber could be redeemed from his predicament of meaninglessness, the “devaluation of all values,” and the impossibility of finding valid standards in a world determined by the category of means and ends, only through the interrelated faculties of action and speech, which produce meaningful stories as naturally as fabrication produces use objects. If it were not outside the scope of these considerations, one could add the predicament of thought to these instances; for thought, too, is unable to “think itself” out of predicaments which the very activity of thinking engenders. What in each of these instances saves man — man qua animal laborens, qua homo faber, qua thinker — is something altogether different; it comes from the outside — not, to be sure, outside of man, but outside each of the respective activities. From the viewpoint of the animal laborens, it is like a miracle that it is also a being which knows of and inhabots a world; from the viewpoint of homo faber it is like a miracle, like the revelation of divinity, that meaning should have a place in that world.The case of action and actions predicament is altogether different. Here, the remedy against the irreversibility and unpredictability of the process started by acting does not arise out of another and possibly higher faculty, but is one of the potentialities of action itself. The possible redemption from the predicament of irreversibility — of being unable to undo what one has done though one did not, and could not, have known what he was doing — is the faculty of forgiving. The remedy for unpredictability, for the chaotic uncertainty of the future, is contained in the faculty to make and keep promises. The two faculties belong together in so far as one of them, forgiving, serves to undo the deeds of the past, whose “sins” hang like Damocles sword over every new generation; and the other, binding oneself through promises, serves to set up in the ocean of uncertainty, which the future is by definition, islands of security without which not even continuity, let alone durability of any kind, would be possible in the relationships between men.Without being forgiven, released from the consequences of what we have done, our capacity to act would, as it were, be confined to one single deed from which we could never recover; we would remain the victims of its consequences forever, not unlike the sorcerers apprentice who lacked the magic formula to break the spell. Without being bound to the fulfillment of promises, we would never be able to keep our identities; we would be condemned to wander helplessly and without direction in the darkness of each mans lonely heart, caught in its contradictions and equivocations — a darkness which only the light shed over the public realm through the presence of others, who confirm the identity between the one who promises and the one who fulfills, can dispel. Both faculties therefore, depend on plurality, on the presence and acting of others, for no one can forgive himself and no one can feel bound to a promise made only to himself; forgiving and promising enacted in solitude or isolation remain without reality and can signify no more than a role played before ones self.

via amor mundi: Hannah Arendt on Technology and Nature.

Screenhero | Collaborative Screen Sharing

Screenhero lets you screen share any application with anyone, no matter where they are. It’s super simple and blazing fast. You each get your own mouse pointer, and you’re both always in control. It’s designed for collaboration, not just broadcasting your screen. It’s like Google Docs for any application on your computer.

Screenhero is designed to feel like you’re sitting next to the person you’re working with — even when you’re miles away. It’s available for both Mac and Windows.

via Screenhero | Collaborative Screen Sharing.

oreillymedia/open_government · GitHub

Wow, O’Reilly has made Open Government available to the public free of charge, really not much I could say beyond good guy does good thing.  Worth a read.

Open Government was published in 2010 by O’Reilly Media. The United States had just elected a president in 2008, who, on his first day in office, issued an executive order committing his administration to “an unprecedented level of openness in government.” The contributors of Open Government had long fought for transparency and openness in government, as well as access to public information. Aaron Swartz was one of these contributors (Chapter 25: When is Transparency Useful?). Aaron was a hacker, an activist, a builder, and a respected member of the technology community. O’Reilly Media is making Open Government free to all to access in honor of Aaron. #PDFtribute

— Tim O’Reilly, January 15, 2013

via oreillymedia/open_government · GitHub.

Bertrand Russell and F.C. Copleston Debate the Existence of God, 1948 | Open Culture

On January 28, 1948 the British philosophers F.C. Copleston and Bertrand Russell squared off on BBC radio for a debate on the existence of God. Copleston was a Jesuit priest who believed in God. Russell maintained that while he was technically agnostic on the existence of the Judeo-Christian God–just as he was technically agnostic on the existence of the Greek gods Zeus and Poseidon–he was for all intents and purposes an atheist.

via Bertrand Russell and F.C. Copleston Debate the Existence of God, 1948 | Open Culture.

The Horse, the Wheel, and Language – David W. Anthony – Book Review – New York Times

Prepare for a massive series on PIE.  Many folks love PIE.  Renfrew, Anatolia, Kurgan culture Gimbutas, Mallory… Will try to hit it all (just because this article is first, has no bearing on “ranking” of positions.  Just have to start somewhere (yeah, this is a poor place to start, didn’t want to bookmark, or lose the link, so hey!)  Actually, I will return and resequence/recontextualize once I decide on more articles to use.

Where Proto-Indo-European came from and who originally spoke it has been a mystery ever since Sir William Jones, a British judge and scholar in India, posited its existence in the late 18th century. As a result, Anthony writes, the question of its origins was “politicized almost from the beginning.” Numerous groups, ranging from the Nazis to adherents of the “goddess movement” (who saw the Indo-Europeans as bellicose invaders who upended a feminine utopia), have made self-interested claims about the Indo-European past.

via The Horse, the Wheel, and Language – David W. Anthony – Book Review – New York Times.