The following contribution, from Simone Alin, describes

The following contribution, from Simone Alin, describes

Hoaglin et. al's "Letter Values" method of data summary.

LETTER VALUES

The "letter values" referred to by Hoaglin et al. in the assigned reading

consist of an exploratory way of summarizing the spread and location

(median) of values in a data set. The authors of this book prefer this

system of summary statistics for exploratory data analysis because the

letter values are not as prone to the effects of outlying values as are the

mean and variance, although the letter values do focus more on the tails

than the central part of a distribution. In many ways, letter value

summary statistics are similar to those we discussed for box plots. I will

give a brief synopsis of the letter value system, using a simple data set.

The math behind it is quite simple, and in the interest of being succinct, I

will just give a brief description of the math necessary for figuring out

the Hoaglin et al. reading (those who are interested in the details are

referred to chapter 2 of the same book).

Assigning letter values to a simple batch of data entails finding the

median, the fourths, and the extremes. First, the data must be ordered from

smallest to largest. The median is simply the middle data point (in terms

of rank), and the extremes are just the highest and lowest values. So far,

this is pretty self-explanatory. To determine the fourths, one essentially

finds two more medians, these ones between the median of the full data set

and either extreme. For a simple data set such as this one:

i= 1 2 3 4 5 6 7 8 9 10 11

x(i)= 5 17 23 30 42 51 56 61 69 71 74

(note: that should be x sub(i), but I can't do that on my email account)

-the median would be the 6th value (x(6)=51)

-the extremes would be those values in the 1st and 11th ranks (5 and 74)

-the fourths would fall between the 3rd and 4th and the 9th and 10th (for

this we use linear interpolation to get (23+30)/2=26.5 and (69+71)/2=70)

When one has a larger data set, it can be desirable to have further

descriptive statistics. We already know that 50% of our measurements lie

between the fourths (or alternately, that 1/4 of the measurements lie in

either tail). Additional "letter values" may be assigned, and they always

further subdivide each tail by two. The next one would be eighths (then

sixteenths, etc.), and here too, one simply divides both tails into two

equal-sized pools of data points. This tells the experimenter the values

outside of which the largest and smallest eighths of the measured values

occur.

Okay, so you must be thinking "what does this have to do with 'letter

values' though?" Right. The creator of this system decided that M would

represent median, F would represent fourths, and from there on down the

letters would work backwards from F, wrapping backwards through Z, Y, X,

etc., if necessary. Perhaps the following table will lend some clarity:

Tag Tail Area

M 1/2=0.5

F 1/4=0.25

E 1/8=0.125

D 1/16=0.0625

C 1/32=0.03125

and so on... (X -> 1/1024=0.0009765625)

So in terms of decoding the Hoaglin et al. reading, you need to know how

letter values are computed and how the information is displayed in tables.

For computing the letter values, the ordered measurements are assigned

numerical values (1,2,3,...n) BOTH from smallest to largest and from largest

to smallest. These are their upward and downward ranks, respectively. The

depth of a data point in a batch is the smaller of these two ranks. The

depth of the median is (n+1)/2, the depths of the fourths are ([depth of

median] +1)/2, or more generally each depth is ([previous depth] +1)/2. Now

to understand how this information is displayed in tables, please see Table

4-1 in Hoaglin et al..

Some of the first lines are:

# 994

M 497.5 3480

F 249 2412 3678 4944

etc.

The chief elements here are:

# 994 - this indicates the total number of measurements from which the

letter summary values are drawn (i.e., n=994)

M 497.5 - this means that the median is at measurement #497.5

3480 - this is the value of the median

F 249 - the fourths are located at measurement #249 from the top _and_ from

the bottom of the list

2412 and 4944 - these are the values of the fourths

3678 - this is the value of the "mid-summary" which is the topic of

discussion in part of the assigned reading (i.e., "the average of the two

corresponding letter values" or the average of 2412 and 4944 in this case)

Course