PDA

View Full Version : Statistical Analysis of Racing Data


LaughAndBeMerry
12-25-2005, 08:42 AM
Greetings from a new member and relative newbie (three years) to racing. I have begun messing around with statistical models in trying to generate accurate probability lines. I’ve got a lot of data gathered over the past two years, but my work so far has generated as many questions as answers. I was hoping some of the veterans on the board could help me out with a few questions I have.



1) I understand the pitfalls of multiple regression analysis as it applies to racing. I can see the non uniformity of the residual terms (literally on a scatter plot). From what I’ve read, that’s going to be an issue with any 0/1 outcomes. Does it make sense to help smooth out the data by giving partial credit for finishing second or third? After all, would it be correct to say that your odds line is great if some 50-1 shot you made 8-1 wins by a nose, but that the line was crappy if the same horse lost a 3 horse photo?



2) Can anyone point out the pros and cons of using probit analysis vs. logistic regression? I’ve tried them both and don’t have a feel for which would work better.



3) Finally, what would be a reasonable r2 or percentage of deviance explained by the model given that I’m looking at 35,000 observations? While some of the models I’ve run seem to do a pretty good job of fitting (i.e. actual win %’s against predicted are close across odds groups) they all have r2 of 10-25%. Some of the quant jocks where I purchased the stats package have said that this is not unreasonable given the large number of observations.



Wishing a happy holidays to all, and many thanks for the insights I’ve gained on the board as a lurker this past year.

arkansasman
12-25-2005, 09:32 AM
I am just curious, but what kind of logit software do you use?

LaughAndBeMerry
12-25-2005, 09:53 AM
Using Statgraphics (www.statgraphics.com) centurion version. It's got a lot of powerful features at a semi-reasonable price.

Turfday
12-25-2005, 03:34 PM
I have done exactly that, taking odds, finish position and FIELD SIZE into consideration and compiling a rating based on various time periods for trainers, jockeys, trainer-jockey combinations, sire and post positions.

It makes it easier that I'm a strategic partner of Equibase. At the risk of self promotion (a big no no on this site), I won't say more.

mcikey01
12-25-2005, 07:03 PM
1)I understand the pitfalls of multiple regression analysis as it applies to racing. I can see the non uniformity of the residual terms (literally on a scatter plot). From what I’ve read, that’s going to be an issue with any 0/1 outcomes. Does it make sense to help smooth out the data by giving partial credit for finishing second or third? After all, would it be correct to say that your odds line is great if some 50-1 shot you made 8-1 wins by a nose, but that the line was crappy if the same horse lost a 3 horse photo?

This is some much a game of relativity that it is difficult to say what the "real" value (validity/causality issues are paramount in statistical analysis of racing factors) of a runner-up position is? In any case, you might try to factor in finish relative to field size to see if smooths out the data and enhances "predictability".

osophy_junkie
01-02-2006, 01:04 PM
1) I understand the pitfalls of multiple regression analysis as it applies to racing. I can see the non uniformity of the residual terms (literally on a scatter plot). From what I’ve read, that’s going to be an issue with any 0/1 outcomes. Does it make sense to help smooth out the data by giving partial credit for finishing second or third?
All of the determining factors are contained in my model. The algorithem gives them the proper weight and I do not alter this probability. I give horses coming in second, third, etc. more "credit" by removing the horse in front of them, and recalculating values with them remaining horses.

After all, would it be correct to say that your odds line is great if some 50-1 shot you made 8-1 wins by a nose, but that the line was crappy if the same horse lost a 3 horse photo?
I would say it doesn't matter. The only thing that matters is how consistent your model is.

LaughAndBeMerry
01-02-2006, 05:01 PM
Thanks for the reply. I'd like to talk more about this if you wouldn't mind. Off the board works best for me if that's OK for you. My e-mail address is velocityphd@yahoo.com. Thanks.

LBM

obeguy
01-19-2006, 07:44 PM
You need conditional logistic regression with race Id as the grouping variable. All of the questions you have about variables and outliers will bve answered by your model runs. the process in entirely empirical. Benter made all his money with this type of model but around 2001, switched to a better probit model in conjunction with some kind of Markov chain Monte Carlo algotithms which I don't understand.

toetoe
01-19-2006, 08:22 PM
Great thread already, only eight posts old. :ThmbUp:

kenwoodallpromos
01-19-2006, 11:12 PM
Since I have no idea what you are talking about, let ME ask you this:
"Would it be correct to say that your odds line is great if some 50-1 shot you made 8-1 wins by a nose, but that the line was crappy if the same horse lost a 3 horse photo?"
If you pegged a horse at 8-1 and it came close, doesn't that mean you should have made it lower odds anyway?

Overlay
01-20-2006, 02:01 AM
I'm not sure you could draw that conclusion from just one race. But if horses that are assigned higher fair odds are consistently outperforming lower-odds horses (either in terms of winning or finishing in the money), or if horses in a particular odds range are regularly winning significantly more or less often than their fair odds indicate they should, I'd say that the odds-assignment model would need some refinement.

arkansasman
01-20-2006, 05:01 AM
3) Finally, what would be a reasonable r2 or percentage of deviance explained by the model given that I’m looking at 35,000 observations? While some of the models I’ve run seem to do a pretty good job of fitting (i.e. actual win %’s against predicted are close across odds groups) they all have r2 of 10-25%. Some of the quant jocks where I purchased the stats package have said that this is not unreasonable given the large number of observations.

I am getting a R2 of .1189 in a combined model(model and Public) of 3100 races out of sample.