Great paper on the linguistic evolution and uses of gender…
Great paper on the linguistic evolution and uses of gender…
Penn tablets by period:
Late Uruk (ca. 3400-3000 BC)
Proto-Elamite (ca. 3100-2900 BC)
Early Dynastic I-II (ca. 2900-2700 BC)
Early Dynastic IIIa (ca. 2600 BC)
Early Dynastic IIIb (ca. 2500-2350 BC)
Old Akkadian (ca. 2350-2200 BC)
Lagash II (ca. 2200-2100 BC)
Ur III period (ca. 2100-2000 BC)
Old Assyrian (ca. 2000-1900 BC)
Early Old Babylonian (ca. 2000-1800 BC)
Old Babylonian (ca. 1800-1600 BC)
Middle Babylonian (ca. 1500-1000 BC)
Middle Assyrian (ca. 1500-1000 BC)
Neo-Assyrian (ca. 1000-600 BC)
Neo-Babylonian (ca. 1000-540 BC)
Achaemenid (ca. 540-330 BC)
Hellenistic (ca. 330-140 BC)
Penn tablets by provenience (only major sites):
Penn tablets by text genre:
The tablet to the right (CBS 16106) contains on the top surface the impression of a diorite brick stamp said, in the neo-Assyrian inscription on the reverse surface, to have been found by a scribe in Naram-Sin’s palace in Agade, the capital of the Old Akkadian empire (ca. 2300 and 700 BC, respectively). The lower image offers a mirrored representation of the original stamp, in the orientation in which it would have been read in lines from top to bottom, and from right to left (click image to be directed to the text’s corresponding CDLI page).
via cdli – penn museum.
In clinical measurement comparison of a new measurement technique with an established one is often needed to see whether they agree sufficiently for the new to replace the old. Such investigations are often analysed inappropriately, notably by using correlation coefficients. The use of correlation is misleading. An alternative approach, based on graphical techniques and simple calculations, is described, together with the relation between this analysis and the assessment of repeatability.
Clinicians often wish to have data on, for example, cardiac stroke volume or blood pressure where direct measurement without adverse effects is difficult or impossible. The true values remain unknown. Instead indirect methods are used, and a new method has to be evaluated by comparison with an established technique rather than with the true quantity. If the new method agrees sufficiently well with the old, the old may be replaced. This is very different from calibration, where known quantities are measured by a new method and the result compared with the true value or with measurements made by a highly accurate method. When two methods are compared neither provides an unequivocally correct measurement, so we try to assess the degree of agreement. But how?
The correct statistical approach is not obvious. Many studies give the product-moment correlation coefficient (r) between the results of the two measurement methods as an indicator of agreement. It is no such thing. In a statistical journal we have proposed an alternative analysis,  and clinical colleagues have suggested that we describe it for a medical readership.
Most of the analysis will be illustrated by a set of data (Table 1) collected to compare two methods of measuring peak expiratory flow rate (PEFR).
INAPPROPRIATE USE OF CORRELATION COEFFICIENT
The second step is usually to calculate the correlation coefficient (r) between the two methods. For the data in fig 1, r = 0.94 (p < 0.001). The null hypothesis here is that the measurements by the two methods are not linearly related. The probability is very small and we can safely conclude that PEFR measurements by the mini and large meters are related. However, this high correlation does not mean that the two methods agree:
(1) r measures the strength of a relation between two variables, not the agreement between them. We have perfect agreement only if the points in fig 1 lie along the line of equality, but we will have perfect correlation if the points lie along any straight line.
(2) A change in scale of measurement does not affect the correlation, but it certainly affects the agreement. For example, we can measure subcutaneous fat by skinfold calipers. The calipers will measure two thicknesses of fat. If we were to plot calipers measurement against half-calipers measurement, in the style of fig 1, we should get a perfect straight line with slope 2.0. The correlation would be 1.0, but the two measurements would not agree — we could not mix fat thicknesses obtained by the two methods, since one is twice the other.
(3) Correlation depends on the range of the true quantity in the sample. If this is wide, the correlation will be greater than if it is narrow. For those subjects whose PEFR (by peak flow meter) is less than 500 l/min, r is 0.88 while for those with greater PEFRs r is 0.90. Both are less than the overall correlation of 0.94, but it would be absurd to argue that agreement is worse below 500 l/min and worse above 500 l/min than it is for everybody. Since investigators usually try to compare two methods over the whole range of values typically encountered, a high correlation is almost guaranteed.
(4) The test of significance may show that the two methods are related, but it would be amazing if two methods designed to measure the same quantity were not related. The test of significance is irrelevant to the question of agreement.
(5) Data which seem to be in poor agreement can produce quite high correlations. For example, Serfontein and Jaroszewicz  compared two methods of measuring gestational age. Babies with a gestational age of 35 weeks by one method had gestations between 34 and 39.5 weeks by the other, but r was high (0.85). On the other hand, Oldham et al.  compared the mini and large Wright peak flow meters and found a correlation of 0.992. They then connected the meters in series, so that both measured the same flow, and obtained a “material improvement” (0.996). If a correlation coefficient of 0.99 can be materially improved upon, we need to rethink our ideas of what a high correlation is in this context. As we show below, the high correlation of 0.94 for our own data conceals considerable lack of agreement between the two instruments.
It is most unlikely that different methods will agree exactly, by giving the identical result for all individuals. We want to know by how much the new method is likely to differ from the old: if this is not enough to cause problems in clinical interpretation we can replace the old method by the new or use the two interchangeably. If the two PEFR meters were unlikely to give readings which differed by more than, say, 10 l/min, we could replace the large meter by the mini meter because so small a difference would not affect decisions on patient management. On the other hand, if the meters could differ by 100 l/min, the mini meter would be unlikely to be satisfactory. How far apart measurements can be without causing difficulties will be a question of judgment. Ideally, it should be defined in advance to help in the interpretation of the method comparison and to choose the sample size.
The first step is to examine the data. A simple plot of the results of one method against those of the other (fig 1) though without a regression line is a useful start but usually the data points will be clustered near the line and it will be difficult to assess between-method differences. A plot of the difference between the methods against their mean may be more informative. Fig 2 displays considerable lack of agreement between the large and mini meters, with discrepancies of up to 80 l/min, these differences are not obvious from fig 1. The plot of difference against mean also allows us to investigate any possible relationship between the measurement error and the true value. We do not know the true value, and the mean of the two measurements is the best estimate we have. It would be a mistake to plot the difference against either value separately because the difference will be related to each, a well-known statistical artefact. 
My reasons for jumping into stats was to directly compare two measurement methods… with multiple trials, on multiple ILDs (inter-landmark distances). I don’t really go for “funny name, lol” things, but when Bland and Borg are cited in the same paper on stats (which I long thought of [cluelessly/ignorantly] as boring). Eponysterical.
But getting real, the issues raised by Bland and Altman sound pretty interesting, and they raise the issue that many tests of this sort may be using misleading information… I have tried to duplicate their methods in my own little H.T.-UGR/Inquiry Study.
When comparing a new method of measurement with a standard method, one of the things we want to know is whether the difference between the measurements by the two methods is related to the magnitude of the measurement. A plot of the difference against the standard measurement is sometimes suggested, but this will always appear to show a relationship between difference and magnitude when there is none. A plot of the difference against the average of the standard and new measurements is unlikely to mislead in this way. This is shown theoretically and illustrated by a practical example using measurements of systolic blood pressure.
In earlier papers [1,2] we discussed the analysis of studies of agreement between methods of clinical measurement. We had two issues in mind: to demonstrate that the methods of analysis then in general use were incorrect and misleading, and to recommend a more appropriate method. We saw the aim of such a study as to determine whether two methods agreed sufficiently well for them to be used interchangeably. This led us to suggest that the analysis should be based on the differences between measurements on the same subject by the two methods. The mean difference would be the estimated bias, the systematic difference between methods, and the standard deviation of the differences would measure random fluctuations around this mean. We recommended 95% limits of agreement, mean difference plus or minus 2 standard deviations (or, more precisely, 1.96 standard deviations), which would tell us how far apart measurements by the two methods were likely to be for most individuals.
I have been learning a great deal about statistical analysis, and how to apply the abundant tools to particular problems… I guess I should say that I will be sharing some articles and ideas that I have come across on this topic (there are a number of considerations for every question. Bazsinga). BTW; I have been using SOFA Statistics (link later, free for use, has “enhancers” you can pay for, but don’t need to before using it to the full potential) for my own bit of work, it is really nice, sometimes frustrating tool, though I am fairly sure that has more to do with my “not knowing what I can do”, rather than limitations in the software.
Many research papers in radiology concern measurement. This is a topic which in the past has been much neglected in the medical research methods literature. When I was first approached with a question on measurement error, I turned in vain to my books. I had to work it out myself.
I am going to deal in this talk with two types of study: the estimation of the agreement between two methods of measurement, and the estimation of the agreement between two measurements by the same method, also called repeatability. In both cases I shall be concerned with the question of interpreting the individual clinical measurement. For agreement between two different methods of measurement, I shall be asking whether we can use measurements by these two methods interchangeably, i.e. can we ignore the method by which the measurement was made. For two measurements by the same method, I shall be asking how variable can measurements on a patient be if the true value of the quantity does not change and what this measurement tells us about the patient’s true or average value.
I shall avoid all mathematics, which even an audience as intelligent as this one finds difficult to follow during a presentation, except for one formula near the end, for which I shall apologise when the time comes. Instead I shall show what happens when we apply some simple statistical methods to a set of randomly generated data, and then show how this informs the interpretation of these methods when they are used to tackle measurement problems in the radiology literature.
For an example of the sort of study with which I shall be concerned, Borg et al. (1995) compared single X-ray absorptiometry (SXA) with single photon absorptiometry (SPA). They produced the following scatter plot for arm bone mineral density:
We have seen that the animal laborens could be redeemed from its predicament of imprisonment in the ever-recurring cycle of the life process, of being subject to the necessity of labor and consumption, only through the mobilization of another human capacity, the capacity for making, fabricating, and producing of homo faber, who as a toolmaker not only eases the pain and trouble of laboring but also erects a world of durability. The redemption of life, which is sustained by labor, is worldliness, which is sustained by fabrication. We saw furthermore that homo faber could be redeemed from his predicament of meaninglessness, the “devaluation of all values,” and the impossibility of finding valid standards in a world determined by the category of means and ends, only through the interrelated faculties of action and speech, which produce meaningful stories as naturally as fabrication produces use objects. If it were not outside the scope of these considerations, one could add the predicament of thought to these instances; for thought, too, is unable to “think itself” out of predicaments which the very activity of thinking engenders. What in each of these instances saves man — man qua animal laborens, qua homo faber, qua thinker — is something altogether different; it comes from the outside — not, to be sure, outside of man, but outside each of the respective activities. From the viewpoint of the animal laborens, it is like a miracle that it is also a being which knows of and inhabots a world; from the viewpoint of homo faber it is like a miracle, like the revelation of divinity, that meaning should have a place in that world.The case of action and actions predicament is altogether different. Here, the remedy against the irreversibility and unpredictability of the process started by acting does not arise out of another and possibly higher faculty, but is one of the potentialities of action itself. The possible redemption from the predicament of irreversibility — of being unable to undo what one has done though one did not, and could not, have known what he was doing — is the faculty of forgiving. The remedy for unpredictability, for the chaotic uncertainty of the future, is contained in the faculty to make and keep promises. The two faculties belong together in so far as one of them, forgiving, serves to undo the deeds of the past, whose “sins” hang like Damocles sword over every new generation; and the other, binding oneself through promises, serves to set up in the ocean of uncertainty, which the future is by definition, islands of security without which not even continuity, let alone durability of any kind, would be possible in the relationships between men.Without being forgiven, released from the consequences of what we have done, our capacity to act would, as it were, be confined to one single deed from which we could never recover; we would remain the victims of its consequences forever, not unlike the sorcerers apprentice who lacked the magic formula to break the spell. Without being bound to the fulfillment of promises, we would never be able to keep our identities; we would be condemned to wander helplessly and without direction in the darkness of each mans lonely heart, caught in its contradictions and equivocations — a darkness which only the light shed over the public realm through the presence of others, who confirm the identity between the one who promises and the one who fulfills, can dispel. Both faculties therefore, depend on plurality, on the presence and acting of others, for no one can forgive himself and no one can feel bound to a promise made only to himself; forgiving and promising enacted in solitude or isolation remain without reality and can signify no more than a role played before ones self.