Given a bag of N balls numbered 1 to N. I'd like to calculate the expected (average?) number of draws needed to have probability P of drawing each of the N balls with replacement. For example, how many balls would I need to draw (with replacement) to have a 90% chance of drawing every ball? I believe the odds involve one or more infinite series, but my stat skills are way too rusty to be able to work it out. Thanks for any help.

Had to think about this one for a while! A quick note about terminology: "expected", "average", and "mean" typically mean the same thing in statistics, but that's not what you're looking for here. In operational terms, the "expected number of draws needed" is what you would get by drawing until you got all N balls, writing down the number of draws it took you, and then repeating the experiment many times to get an average. What you want is for a number of draws D, what is the probability P to draw each ball in D draws or less. As far as I can tell, this is a significantly tougher problem. My solution isn't super-elegant, but I think it works; please point out any errors you see, or any cleaner solutions. When faced with a problem of the form "what is the probability of x happening at least once?" it often helps to re-frame it as "what is the probability of x not happening at all?" Since x must either happen or not happen, you can subtract the answer to either problem from 1 to get the answer to the other. In this case, the re-framed question becomes "what is the probability that one or more balls will never be drawn across D attempts?" For starters, we can note that for a given draw, any particular ball will not be drawn with probability \(\frac{N-1}{N}\). The probability of a given ball not being drawn during D consecutive attempts is then \(\left(\frac{N-1}{N}\right)^D\). Since any of the N balls not being drawn is sufficient for failure, we might naively just multiply this probability by N: \(1-P\approx N\left(\frac{N-1}{N}\right)^D\) Unfortunately, this expression is too high, because it over counts the cases in which more than one ball is never drawn. For example, the above expression includes probabilities for both "ball 1 was never drawn" and "ball 2 was never drawn", both of which will include cases where neither ball 1 nor ball 2 was drawn. To correct such double-counted events, we can subtract off the cases where two given balls are never drawn. By similar logic to above, the probability of each one of these cases is \(\left(\frac{N-2}{N}\right)^D\). How many such cases are there? We want all the ways to choose 2 elements out of a list of N options, which is the choose function \({N \choose 2}\). This gives us the less-naive estimate \(1-P\approx N\left(\frac{N-1}{N}\right)^D-{N \choose 2}\left(\frac{N-2}{N}\right)^D\) While this is better, we now face the problem of under-counting cases in which three balls are never drawn. Just counting by hand, we can find that we need one more copy of each triple-failure case, so we need to add an extra \({N \choose 3}\left(\frac{N-3}{N}\right)^D\). By now, we see that every change we make will force a higher-order correction, but fortunately, a pattern is emerging. For even-n terms, we subtract \({N \choose n}\left(\frac{N-n}{N}\right)^D\), while for odd-n terms, we add the same. Formally, we know this pattern will hold for all n from the identity \(\displaystyle\sum_{n=1}^N (-1)^{n-1}{N \choose n}=1\) I leave the proof of this as an exercise to the reader. Please Register or Log in to view the hidden image! Carrying this pattern through all the way to the end, we finally get the exact expression we want: \(1-P = \displaystyle\sum_{n=1}^N (-1)^{n-1}{N \choose n}\left(\frac{N-n}{N}\right)^D\) Not too shabby! There are a number of ways one can shuffle the terms in this sum around and pull out common factors, but I couldn't find anything that really simplified it. Please let me know if you see any improvements!

Wow! That's impressive. It will take me a bit of time to fully understand it. Your opening statement did bring back one old rusty memory from by stat classes -- that part about turning the problem into the odds of it not happening at all. But even if I had remembered that, I still doubt that I could have come up with what you did. You have provided P=f(N,D). I want D=f(N,P), so I guess I'll have to add one more level of complexity by taking logs. In the meantime, I gave up and wrote a little simulation. Here are the results from 1 million trials with 32 balls: Code: Number of trials = 1,000,000 Number of balls = 32 Trial Summary Draws Tally Cum Prob 43 1 0.00 44 1 0.00 45 3 0.00 46 2 0.00 47 7 0.00 48 7 0.00 49 6 0.00 50 15 0.00 51 30 0.01 . . . 68 1538 0.92 69 1786 1.10 70 2042 1.30 71 2279 1.53 . . . 87 7903 9.83 88 8385 10.67 89 8553 11.53 . . . 101 11651 24.14 102 11702 25.31 103 11998 26.51 . . . 122 11671 49.66 123 11596 50.82 124 11361 51.96 . . . 149 7269 74.93 150 6915 75.62 151 6627 76.29 . . . 180 3053 89.93 181 2952 90.22 182 2934 90.51 . . . 202 1663 94.89 203 1556 95.04 204 1562 95.20 . . . 253 349 98.99 254 338 99.02 255 315 99.05 . . . 423 1 99.99 424 1 100.00 425 2 100.00 . . . 556 0 100.00 557 0 100.00 558 1 100.00