WSU STAT 360
Class Session 4 Summary and Notes Autumn 2000
Today we practiced using Excel to work problems in statistics and probability. The main points covered were:
Some of these activities were complex and perhaps incomprehensible today. Yet, we will cover all of these subjects again as we perform example calculations in class over the next few weeks.
The notes from last year, which you will find below, have some interesting material regarding how new probability distributions are developed from ones that we currently know. This was a subject that I touched only briefly in last week's notes. There is also an extended discussion of Stirling's approximation and using it to calculate the probability of polio cases is a city.
Main points regarding joint probability density
I managed to describe what a joint probability density is, but I didn't manage to explain why it is important.
P(X1=H,X2=T)=Probability of a tail following a head.This two dimensional distribution is trivial to calculate. However, note that the joint probability is the product of the probabilities of the separate tosses. That is...
P(X1=H,X2=T)=P(H)*P(T)=1/4This is a characteristic of events that are independent of one another. If the two events were not independent of one another, then the joint probability would have to account for the degree to which the events could occur together.
P(X1=j,X2=k)=Probability of rolling "j" and "k." for example, P(X1=1,X2=6)=P(1)*P(6)=1/36
Here is an interesting application of a joint distribution. Suppose we sample k individuals at a time from a population. The population itself has a Cdf that is known to us. In each such sample there is a largest element that we refer to as the maximum; call it m. Is there a way to calculate the probability density function of this maximum value?
This seems like a difficult question at first, but it is not difficult at all. First, sort the sample from largest to smallest. We can write down the probability associated with a maximum value as follows.
P(X1=m,X2<=m,...,Xk<=m)=Probability of k items <=m.
P(X1=m,X2<=m,...,Xk<=m)=Cdf(m)*Cdf(m)*...*Cdf(m) or, equivalently P(X1=m,X2<=m,...,Xk<=m)=Cdf(m)kIs this cool, or what? The distribution obtained here is the distribution of the extreme value, and it is of great importance in engineering work. Often we design things for the maximum credible badness that can happen, and the distribution of the extreme value is useful for calculating how big the badness could be.
Notes: The following points are minor, but I needed to fill space somehow.
Answers to selected homework problems only
Hey, if I put all the problems on this page, then where would you find any challenge?
Problem 3.6 y p(y) yp(y) y2p(y) 0 0.1022 0 0 1 0.3633 0.3633 0.3633 2 0.3814 0.7628 3.0512 3 0.1387 0.4161 3.7449 4 0.0144 0.0576 0.9216 Sums 1 1.5998 8.081 3.6.a p(3)+p(4)= 0.1531 3.6.b p(0)+p(1)= 0.4655 3.6.c E(y)= 1.5998 Var(y)= 5.522 StdDev= 2.35
Problem: How to calculate binomial coefficients when they involve truly enormous numbers.
Tim's calculator has demonstrated that it can calculate just about anything up to 101000, which is something like 10900 times the number of protons in the universe. By the way 10100 is a number called a googleplex. It is the largest number that I know of with a name of its own. No one has ever named a larger number because they can't think of how a larger number could occur. They hadn't reckoned with Tim's (and Sherrie's as it turns out), calculator(s). Not all calculating machines are capable of handling huge numbers, so I thought I'd show you how to use Stirling's approximation to calculate truly awful binomial factors. The essence of Stirling's approximation is that the ratio...
2.50662827*N(N+1/2)e-N/N!->1 as N->infinityThus the numerator in the above limit makes a good approximation for the factorial in the denominator. The Table below shows how this approximation fares for numbers from 1! to 15!. The agreement is quite good, and the approximation gets even better, in terms of relative error, at larger values. I suspect Tim's calculator uses this approximation for large factorials.
Stirling's Approximation and Factorial Comparison ========================= N Factorial Stirling ========================= 1 1 0.922137 2 2 1.919004 3 6 5.83621 4 24 23.50618 5 120 118.0192 6 720 710.0782 7 5040 4980.396 8 40320 39902.4 9 362880 359536.9 10 3628800 3598696 11 39916800 39615625 12 4.79E+08 4.76E+08 13 6.23E+09 6.19E+09 14 8.72E+10 8.67E+10 15 1.31E+12 1.3E+12 ==========================
An impressive example, calculating odds of the number of Polio cases in a medium sized town.
Suppose the town has 40000 people and the probability of contracting the dread polio is normally 0.0003 per season before 1954 (the actual rate was 0.000023 in 1998). The probability that, say, 10 people contract polio in the season thus involves calculating 40000C10 and (0.9997)39990. These are pretty awful to contemplate. However, using Stirling's approximation NC(N-n) becomes...
0.39894228*[N/((N-n)*n)]1/2*NN/(N-n)(N-n)*nn and the binomial factor... NC(N-n)pnq(N-n) becomes... 0.39894228*[N/((N-n)*n)]1/2*NN[q/(N-n)](N-n)*[p/n]n
The binomial factor is still beyond the normal calculation, but because we have forged it into the product of powers, we are now in a position to take the logarithm which reduces the product into a sum of factors. We add the factors together and then take the antilogarithm. Does everyone remember how to do such a calculation using logarithms and antilogarithms? This is a potentially interesting project: Towit, show the derivation of these results in detail.
The table that follows shows the binomial factors for this exact problem. I have also calculated the cumulative probability. Note that we have reached a cumulative probability of 1.0 by about the 21st success; whereas by definition we reach 1.0 exactly at the 40,000th success. There is a little round-off error in working with such enormous factors, but this is not a serious problem because the probability of additional successes beyond 20 or so is miniscule. By the way, rather than calculate numbers like these, people would instead use an approximation to the binomial distribution that makes use of the standard normal distribution (see paragraph below). This calculation was meant only to show-off.
p q N n (N-n) P(x=n) Cumulative ================================================================ 0.0003 0.9997 40000 1 39999 7.98E-05 7.98E-05 0.0003 0.9997 40000 2 39998 0.00046 0.00054 0.0003 0.9997 40000 3 39997 0.001817 0.002357 0.0003 0.9997 40000 4 39996 0.005415 0.007773 0.0003 0.9997 40000 5 39995 0.012946 0.020719 0.0003 0.9997 40000 6 39994 0.025825 0.046544 0.0003 0.9997 40000 7 39993 0.04419 0.090734 0.0003 0.9997 40000 8 39992 0.066195 0.15693 0.0003 0.9997 40000 9 39991 0.088167 0.245097 0.0003 0.9997 40000 10 39990 0.105711 0.350808 0.0003 0.9997 40000 11 39989 0.11524 0.466048 0.0003 0.9997 40000 12 39988 0.11517 0.581217 0.0003 0.9997 40000 13 39987 0.106254 0.687471 0.0003 0.9997 40000 14 39986 0.091031 0.778502 0.0003 0.9997 40000 15 39985 0.072792 0.851294 0.0003 0.9997 40000 16 39984 0.054571 0.905865 0.0003 0.9997 40000 17 39983 0.038505 0.94437 0.0003 0.9997 40000 18 39982 0.02566 0.97003 0.0003 0.9997 40000 19 39981 0.0162 0.98623 0.0003 0.9997 40000 20 39980 0.009716 0.995946 0.0003 0.9997 40000 21 39979 0.00555 1.001496 0.0003 0.9997 40000 22 39978 0.003026 1.004522 0.0003 0.9997 40000 23 39977 0.001578 1.0061 0.0003 0.9997 40000 24 39976 0.000789 1.006889 0.0003 0.9997 40000 25 39975 0.000378 1.007267 0.0003 0.9997 40000 26 39974 0.000175 1.007442 0.0003 0.9997 40000 27 39973 7.76E-05 1.007519 0.0003 0.9997 40000 28 39972 3.32E-05 1.007552 0.0003 0.9997 40000 29 39971 1.37E-05 1.007566 0.0003 0.9997 40000 30 39970 5.49E-06 1.007572 0.0003 0.9997 40000 31 39969 2.13E-06 1.007574 0.0003 0.9997 40000 32 39968 7.96E-07 1.007574
The easier way to handle the binomial distribution in this situation is to approximate it as a normal distribution with mean value Np and standard deviation (Npq)1/2. As another potential project assignment, show that the distribution in the table above behaves exactly this way.
Link forward to the next set of class notes for Friday, September 29, 2000