The following contribution, from Simone Alin, describes
Hoaglin et. al's "Letter Values" method of data summary.
LETTER VALUES
The "letter values" referred to by Hoaglin et al.
in the assigned reading
consist of an exploratory way of summarizing the spread and
location
(median) of values in a data set. The authors of this book prefer this
system of summary statistics for exploratory data analysis
because the
letter values are not as prone to the effects of outlying
values as are the
mean and variance, although the letter values do focus more
on the tails
than the central part of a distribution. In many ways, letter value
summary statistics are similar to those we discussed for box
plots. I will
give a brief synopsis of the letter value system, using a
simple data set.
The math behind it is quite simple, and in the interest of
being succinct, I
will just give a brief description of the math necessary for
figuring out
the Hoaglin et al. reading (those who are interested in the
details are
referred to chapter 2 of the same book).
Assigning letter values to a simple batch of data entails
finding the
median, the fourths, and the extremes. First, the data must be ordered from
smallest to largest.
The median is simply the middle data point (in terms
of rank), and the extremes are just the highest and lowest
values. So far,
this is pretty self-explanatory. To determine the fourths, one essentially
finds two more medians, these ones between the median of the
full data set
and either extreme.
For a simple data set such as this one:
i= 1 2
3 4 5 6 7
8 9 10 11
x(i)= 5 17
23 30 42
51 56 61
69 71 74
(note: that should be x sub(i), but I can't do that on my
email account)
-the median would be the 6th value (x(6)=51)
-the extremes would be those values in the 1st and 11th
ranks (5 and 74)
-the fourths would fall between the 3rd and 4th and the 9th
and 10th (for
this we use linear interpolation to get (23+30)/2=26.5 and
(69+71)/2=70)
When one has a larger data set, it can be desirable to have
further
descriptive statistics.
We already know that 50% of our measurements lie
between the fourths (or alternately, that 1/4 of the
measurements lie in
either tail).
Additional "letter values" may be assigned, and they always
further subdivide each tail by two. The next one would be eighths (then
sixteenths, etc.), and here too, one simply divides both
tails into two
equal-sized pools of data points. This tells the experimenter the values
outside of which the largest and smallest eighths of the
measured values
occur.
Okay, so you must be thinking "what does this have to
do with 'letter
values' though?"
Right. The creator of this
system decided that M would
represent median, F would represent fourths, and from there
on down the
letters would work backwards from F, wrapping backwards
through Z, Y, X,
etc., if necessary.
Perhaps the following table will lend some clarity:
Tag Tail Area
M 1/2=0.5
F 1/4=0.25
E 1/8=0.125
D 1/16=0.0625
C 1/32=0.03125
and so on... (X -> 1/1024=0.0009765625)
So in terms of decoding the Hoaglin et al. reading, you need
to know how
letter values are computed and how the information is
displayed in tables.
For computing the letter values, the ordered measurements
are assigned
numerical values (1,2,3,...n) BOTH from smallest to largest
and from largest
to smallest. These
are their upward and downward ranks, respectively. The
depth of a data point in a batch is the smaller of these two
ranks. The
depth of the median is (n+1)/2, the depths of the fourths
are ([depth of
median] +1)/2, or more generally each depth is ([previous
depth] +1)/2. Now
to understand how this information is displayed in tables,
please see Table
4-1 in Hoaglin et al..
Some of the first lines are:
# 994
M 497.5 3480
F 249 2412 3678 4944
etc.
The chief elements here are:
# 994 - this indicates the total number of measurements from
which the
letter summary values are drawn (i.e., n=994)
M 497.5 - this means that the median is at measurement
#497.5
3480 - this is the
value of the median
F 249 - the fourths are located at measurement #249 from the
top _and_ from
the bottom of the list
2412 and 4944 - these are the values of the fourths
3678 - this is the value of the "mid-summary"
which is the topic of
discussion in part of the assigned reading (i.e., "the
average of the two
corresponding letter values" or the average of 2412 and
4944 in this case)