PDA

View Full Version : Databasing IV Factors


mwilding1981
09-19-2009, 11:31 AM
Just a quick question for those of you who run or keep databases. What do you find is the best way to store IV's for factors. Do you just create them dynamically in the front-end or do you store them and update them every x months?

Also what would you say is the best way of splitting the ratings, for example a rating from 0 to 10 where most horses fall between 5 and 7 (2 decimals used), for the IV's?

Thanks

CBedo
09-19-2009, 11:50 PM
It really depends on how often you are recalculating them. In an analytic database (versus a transactional), you definitely have to take performance into consideration and not necessarily keep the database normalized. Redundant data and calculated fields will often speed up things quite a bit.

I don't really have anything to add about your last questions. I think that's going to be a personal methodology decision, but I someone else might have a different opinion.

mwilding1981
09-21-2009, 08:53 AM
Thanks for the reply CBedo. Ok, I think i will store them as tables in the database then and update them however often. With regards to the other question would you say that 30 winners is the minimum that should be in the sample.

CBedo
09-21-2009, 10:36 AM
Thanks for the reply CBedo. Ok, I think i will store them as tables in the database then and update them however often. With regards to the other question would you say that 30 winners is the minimum that should be in the sample.That's a tough one, and there's probably no right number to use. I haven't been working with any IVs much lately, but when I do, what I do, is just "normalize" the IVs reducing the range from top to bottom in the smaller samples (the smaller the sample, the more the smoothing). The basic concept is that I'm trying to get a more realistic win & return metric. I am pretty sure that none of my metrics are going to produce a 2000% return (I wish) for example, so in a small sample, an impact value which projects a monstrous edge gets reduced.

Dave Schwartz is probably the guy to ask about this. His HorseStreet software has some pretty sophisticated uses of IVs.

GameTheory
09-21-2009, 12:04 PM
That's a tough one, and there's probably no right number to use. I haven't been working with any IVs much lately, but when I do, what I do, is just "normalize" the IVs reducing the range from top to bottom in the smaller samples (the smaller the sample, the more the smoothing). The basic concept is that I'm trying to get a more realistic win & return metric. I am pretty sure that none of my metrics are going to produce a 2000% return (I wish) for example, so in a small sample, an impact value which projects a monstrous edge gets reduced.

Dave Schwartz is probably the guy to ask about this. His HorseStreet software has some pretty sophisticated uses of IVs.You could start with a Bayesian prior of average performance (which would be IV = 1.0), and then go from there -- that would help keep small samples in line. A simple version would just be to use a modified version of Laplace smoothing where your initial samples all start with 1 hit and 10 misses. Something like that -- between 8 & 10 for average field size if you're doing wins. (Laplace assumes 50% is average so uses 1/2.)

So all samples are intialized this way, and then if you've got a small sample where you hit 6 out of 9 on some factor, you'd figure it as (1 + 6) / (10 + 9) = 7/19 -- still 36%, but not the outrageous 66% of the actual sample. Then as the sample size grows, the prior becomes less and less important...

CBedo
09-21-2009, 12:35 PM
Leave it to GT to come up with something grounded in logic and math as opposed to the "pull it out of my ass" approach that I normally use...:bang:

mwilding1981
09-21-2009, 01:00 PM
Brilliant, thank you guys.

GameTheory
09-21-2009, 03:02 PM
Technical correction to what I said above: 1 hit and 9 misses, meaning 1 out of 10.

mwilding1981
10-18-2009, 11:07 AM
Another question to do with the structure of IV's. How do you decide how many sections to break your rating into? For example a rating of 0 to 100. Do I break it into 10 groups of 10 that are equal or 20 groups? Should they be equal or should they be based on the amount of data in the group, for example if between 20 and 40 has 75% of the horses then groups of 10 would be to big here so should I break it down based on the amount of winners in a group?