Biases in sampling

Consider this simple example. A fair coin will come up heads or tails each about 50% of tosses. This doesn't mean that heads and tail alternate, but rather that in a sequence of flips the numbers of heads and tails are about equal. In fact, the way to decide if a coin is fair is to flip it many times and see if the numbers of heads and tails are close enough to equal. I flipped a coin 500 times, and found 246 heads and 254 tails. These are very nearly equal, and without going into long winded details about whether they are equal enough, I can call this coin fair. What happens if I focus my attention on the actual sequence of flips, however. Let me examine the sequence starting at the 352nd flip...

...h t h t h h t h t t t t t t t t h h t t h h...

Oh! There is a run of 8 tails here. This is really improbable isn't it? We can calculate what are the odds of getting such a sequence. To get the first tail was a 1/2 probablilty, and the odds of getting each additional tail is another factor of 1/2, which means the odds of getting the run of 8 tails is (1/2)8 or about 0.4%. So by focussing on this run alone we could convince ourselves that something very unusual had occurred; something with odds against it of 996 to 4, in fact, and the coin isn't fair after all. Of course, what are the odds of getting the sequence of 8 tosses leading up to the run? Well, the odds of the first head is 1/2, and the next tail is 1/2, and so on. What results is that the odds of getting the sequence hththhth that leads up to the run is (1/2)8 ...exactly the same probability as the long run of tails!

Surely, you say, there is something wrong here. The run of eight tails has to be the more unusual sequence. However, the truth is that it is not more unusual if we view it as a sequence. It is true that if we take samples of eight tosses at a time, the eight tosses that are all tails is much less likely than getting a sample of 3 tails and 5 heads. That is, as aggregate samples the long runs are unusual, as sequences of individual events they are not.

The run of eight tails occurred only once in 500 tosses. If we were to pick samples of 8 tosses each without being able to pick through the sequences, then picking the sequence of 8 tails or even some portion of it would be unlikely.

When we are talking about diseases, there aren't many dreaded diseases where the odds of infection are 1/2. Let's take as an example some dread thing that only one person in every 16 households gets. Some sort of cancer say. We will simulate incidence of this disease by flipping an unfair coin that comes up heads only once in 16 tosses. We will work our way through some large city, house by house, flipping this coin at each. Here and there the toss results in heads and someone in that house supposedly gets the disease. The pattern we obtain in this simulation is pretty sparse, but out near house 151,321 we go down a street and find a pattern like...

...n n y n y y y y n n n n n...

Oooo, four households all in a row have the dread thing. This is a probability of (1/16)4 or 1/65,536. Immediately we suspect some influence, and begin to look around the neighborhood for power lines, an abandoned warehouse, an old gasoline station, a radio tower, a factory upwind, and so forth. Something unusual has occurred and we begin to look for a cause, never mind that the unusual thing has occurred as an (expected) rare case. Here we can see the folly of looking for an underlying cause in a series of tosses of a coin. In the case of a truly dread disease people have difficultly allowing themselves to think this way, and they instinctively believe in a cause. However, if the disease is just a random occurence, even a rare occurence, there will always be a cluster in space or time -- even among the women on one floor of an embassy in Moscow.

Having the freedom to decide where to begin and end a sample is, in effect, the freedom to choose that rare cluster of occurence, and, in turn, choose the outcome of the analysis.