podonne
05-28-2010, 01:52 PM
Greetings,
Up until now I have always used a simple sampling technique: pick a certain number of entries from my database randomly. I think most people do this. However, a friend recently pointed out that this isn't the best way to do sampling, and suggested stratified sampling.
This involves identifying important subpopulations and making sure each is properly represented in the sample, which raises the question, what are the important sub-populations in a horse racing database? I have to keep the number rather small in order to be of use.
For instance, if track is an important subpopulation, I might ensure that the distribution of entries in my sample from each track is consistent with the distribution of tracks in my database, to avoid oversampling from a certain track and undersampling from another.
My guesses:
Surface (Dirt\Turf\Synth) (obvious)
Track level (major, minor)
Age
Claiming Price (if applicable)
Layoff\No Layoff?
First time starter\not first time starter
But I am open to suggestions. To put this question another way, if you had only 4-5 attributes on which to seperate entries, which are MOST important to predicting who will win?
Thanks,
podonne
Up until now I have always used a simple sampling technique: pick a certain number of entries from my database randomly. I think most people do this. However, a friend recently pointed out that this isn't the best way to do sampling, and suggested stratified sampling.
This involves identifying important subpopulations and making sure each is properly represented in the sample, which raises the question, what are the important sub-populations in a horse racing database? I have to keep the number rather small in order to be of use.
For instance, if track is an important subpopulation, I might ensure that the distribution of entries in my sample from each track is consistent with the distribution of tracks in my database, to avoid oversampling from a certain track and undersampling from another.
My guesses:
Surface (Dirt\Turf\Synth) (obvious)
Track level (major, minor)
Age
Claiming Price (if applicable)
Layoff\No Layoff?
First time starter\not first time starter
But I am open to suggestions. To put this question another way, if you had only 4-5 attributes on which to seperate entries, which are MOST important to predicting who will win?
Thanks,
podonne