PDA

View Full Version : Important subpopulations for sampling


podonne
05-28-2010, 01:52 PM
Greetings,

Up until now I have always used a simple sampling technique: pick a certain number of entries from my database randomly. I think most people do this. However, a friend recently pointed out that this isn't the best way to do sampling, and suggested stratified sampling.

This involves identifying important subpopulations and making sure each is properly represented in the sample, which raises the question, what are the important sub-populations in a horse racing database? I have to keep the number rather small in order to be of use.

For instance, if track is an important subpopulation, I might ensure that the distribution of entries in my sample from each track is consistent with the distribution of tracks in my database, to avoid oversampling from a certain track and undersampling from another.

My guesses:
Surface (Dirt\Turf\Synth) (obvious)
Track level (major, minor)
Age
Claiming Price (if applicable)
Layoff\No Layoff?
First time starter\not first time starter

But I am open to suggestions. To put this question another way, if you had only 4-5 attributes on which to seperate entries, which are MOST important to predicting who will win?

Thanks,
podonne

Robert Goren
05-28-2010, 03:09 PM
Something that tends to get over looked in today's world of speed and pace rating is finish in the last race.

kenwoodall2
05-28-2010, 04:33 PM
Suspect the top M/L to be overworked- has to be proven to have enough energy left to win.

Overlay
05-28-2010, 04:44 PM
I would go with surface, distance categories, and age (mostly with respect to differentiating two-year-olds from three-year-olds and up, including the treatment of first-time starters in each of those age groups).

I would then let the internal subsets for those combinations address other factors such as condition, class (of horse or track), speed, pace, and connection-related data in order to separate out or rank the horses within each group.

podonne
05-28-2010, 05:31 PM
I would go with surface, distance categories, and age (mostly with respect to differentiating two-year-olds from three-year-olds and up, including the treatment of first-time starters in each of those age groups).

I would then let the internal subsets for those combinations address other factors such as condition, class (of horse or track), speed, pace, and connection-related data in order to separate out or rank the horses within each group.

Thanks Overlay. So if I can turn that into categories, your suggestion would be:

Name Groups
----------------------
Surface (Dirt, Turf, Synthetic)
Distance (<5F, 5-6F, 6-7F, 7-8F, 8+F)
Age (2yo, 3+yo)

Overlay
05-28-2010, 05:41 PM
Thanks Overlay. So if I can turn that into categories, your suggestion would be:

Name Groups
----------------------
Surface (Dirt, Turf, Synthetic)
Distance (<5F, 5-6F, 6-7F, 7-8F, 8+F)
Age (2yo, 3+yo)

I've gotten by using the same groupings that Mike Nunamaker employed in his multiple statistical studies, which were the same as those you listed, except that he used only two distance groupings (sprints (less than one mile) and routes (one mile or more)). It would be a question of how large a sample you would have to collect with respect to distance to have enough data points in each distance sub-group from which to draw valid conclusions, as well as whether the differences in breaking it down that fine would be significant enough to warrant the time and effort. Plus, some commonly used metrics may already incorporate significant distance distinctions in how they are calculated (I'm thinking of things like Quirin speed points, for example), so that you would not necessarily have to be so exacting with the distance factor itself.

arno
05-28-2010, 06:19 PM
Male Female

Overlay
05-28-2010, 06:29 PM
Male Female

Thank you, that had completely slipped my mind. :blush: