Understanding average

Famously, or not, the professor in my statistical mechanics course said infinity seems to be about 100. What he meant was it’s usually safe to treat any set with more than 100 members statistically. What he didn’t add was that it becomes necessary for any set with more than 1,000 members.

The US has 330,000,000 people in it. The world is at 8 billion, give or take. Even Missoula County has more than 100,000. We became “statistical” many years ago. Understanding statistics has been critical to solving societal problems for several millennia, even if that understanding has only come recently.

A good starting point is understanding “average.” From Webster’s online dictionary the second definition is, “a level (as of intelligence) typical of a group, class, or series.” For example, is the Milky Way an average galaxy, or is it unusual? This is the logic behind the math.

The first definition in Webster’s is, “A single value (such as a mean, mode, or median) that summarizes or represents the general significance of a set of unequal values.” There are actually many numerical definitions for average, however these are the three that are most familiar. Of these three, mean is used most often with median and mode being used less. A goal for this article is to advocate using the median more often. I believe mean and median should be used with the same frequency and mode should be the runner-up. That said, all three are clearly useful.

Consider a group of four people. Let’s assume that they have incomes of 35K, 40K, 45K, and 200K per year. The mean is the sum divided by the number. In this case that’s 280K/4 or 70K per year. But wait, three of the four have a lot less income than 70K. Only one has more, and that’s a lot more. Means are very useful for many things, but they become misleading when there are large asymmetries. This case is an example of that. The three incomes are all close to 40K, but one is much larger.

Half the people have an income below 40K. Half have an income above a tiny bit below 45K per year. Median is defined that way — half are below and half are above. In this case, it could be anything between 40K and 45K. The rejoinder is any value in that range that represents the incomes better than 70K from the mean. That’s because the median describes the same thing, the half-way point, regardless of the data.

Means are very useful when the data are symmetric. In these cases, median and mean will be close. The incomes I chose were specifically a case where these were different.

So, the question is why bother using the mean at all? We use a relatively simple formula to compute the mean. By comparison, computing the median has several steps. While there are clever algorithms to do these extra steps, they still take more computational time than computing the mean.

There are formulae for various properties of the mean. There really isn’t anything like this for the median. If data are from a known type of distribution, the mean and the width — this is the standard deviation — are frequently adequate to characterize it.

It should also be noted that four is much less than 100. If that was the total number, treating it statistically should at least be done cautiously. Even if it was a sample, say from a survey, from a much larger group, four is small. A good rule of thumb is one needs a sample of at least six values before one can say anything about a new and unknown data set.

An example of this is determining whether a weather event, say a once per century flood is really once per century. To determine this one needs six occurrences and that takes 600 years. And of course, six is a minimum. One hundred would be better. One thousand, better still. The most common error in statistical analysis is giving a value, say how frequently a flood occurs, much more credibility than it deserves.

Finally, what is a mode? A mode is a value with a maximum number of occurrences. What should be noted is data can have several modes. In most years, California gets very little rain. In El Niño years, however, rain and snow are plentiful. A typical 10 years might be: 10, 12, 30, 15, 10, 12, 9, 25, 28 and 13 inches. Seven values are between nine and 15 inches while three are between 25 and 30 inches. There are two modes. The data here also have no values between 15 and 25 inches. With this, what makes sense is to determine the relative frequency of the two modes and treat them independently. In this case, seven of the 10 values were dry and three were wet. The mean for the seven is 11.6 inches. The mean for the three is 27.7 inches.

The 10 data were given as an example of a simple bimodal distribution using rainfall in California as a hypothetical example. The real world tends to be much nastier. For example, there are years where the El Niño condition goes away mid-winter. These fill in the gap between El Niño and normal years. The hypothetical values for El Niño years are a neat trio. The fact is that rainfall in El Niño years varies much more and can be much larger than the mean for those years. That distribution is not symmetric.

In the end it’s important to recognize two things. The first is that societies became statistical many years ago. That is distributions, what fraction of people are doing whatever, are what matters. The second is that it is critical to pay attention to the details. Ignoring details of distributions will generally bite us in the ass.

Rob Loveman received his PhD in experimental nuclear physics from the University of Washington in 1984. He moved to Seeley Lake in 2003 to learn how to run sled dogs.

 

Reader Comments(0)