PDA

View Full Version : Attempted answer to - How many races should I test?


podonne
08-18-2008, 02:40 PM
I've asked this question myself many times, so I thought I would take an attempt to answer it. The question goes like this: if there are 50,000 races during any given time period, how many must i test to determine if my system is "valid", meaning that the results can be truly attributed to the system and not just to tricks of randomness. (Note: not will it work in the future)

So i did this test.

1. Create an sql table and populate it with 50,000 random numbers between 0 and 1.
2. Choose a random percent (P) between 0 and 100%
3. Create two tables and copy (50,000 * P) randomly selected numbers into each one.
4. Measure how much the two tables differ from each other, and from the mean of the whole 50,000 sample, by computing an average for each extra table.
5. Repeat from #2 about 10,000 times

So you should see the value in 4. that gets smaller as the percentage you measure gets larger (that makes sense, the more samples in the population you measure, the more precise your average will be).

The question is, how fast? and where is there a natural point that you can point to and say, i want to measure that percent.

So the results are attatched (note that the graph is focused on sample sizes of < 20% and deviations of less than 2%).

My decision, I picked 0.9 (or where I sampled 10% of the population). Seemed to be a good tradeoff between small sample sizes and high precision (the 0.01 line means that you can expect your value to be 99% correct). If you only want to be 95% correct, you might go as low as 1% of the population, or maybe lower.