PDA

View Full Version : Logit Model Question


mwilding1981
09-07-2008, 02:17 PM
Have been playing with logit models recently and have been getting a lot of horses with infinitessimal chances and a couple with high chances. Not sure what I have done as I was getting this before, does anybody have any suggestions?

gm10
09-08-2008, 05:32 AM
Have been playing with logit models recently and have been getting a lot of horses with infinitessimal chances and a couple with high chances. Not sure what I have done as I was getting this before, does anybody have any suggestions?

hi, me again

sounds like you have a problem with multi-collinearity (highly correlated input variables)

mwilding1981
09-08-2008, 07:24 AM
GM thank you. I had added some variables that I didn't think were correlated but on second thoughts they may well be. I shall check into it. Thank you for your help :)

garyoz
09-08-2008, 10:15 AM
GM thank you. I had added some variables that I didn't think were correlated but on second thoughts they may well be. I shall check into it. Thank you for your help :)


You only need to run a simple correlation matrix--if you are up around .7 (or -.7), then you have problems. This problem has been attacked by many on this board, run a search for "logit regression" "logistical regression" "probit analysis" or even "multiple regression." One solution is to consider combining the correlated variables into some type of compound variable, such as using Factor Analysis. There are some other econometric type approaches that may also be appropriate.

LottaKash
09-08-2008, 12:42 PM
Hey All......I would like to see in print, some of what you gentlemen are babbling about, as it is all greek to me.......I'm am open-minded and convinced that there is validity in undertaking all of this analysis, but as it relates to finding contenders and winners, I sure would like to see some bread n' butter on the table.....I have been to Stalford and it boggles my mind in what they are talking about.......haha.....I am curious and would like to be enlightened.....


best,

gm10
09-08-2008, 05:41 PM
try finding this paper
i found it on google today but can't find a download now
it gives a plausible way to deal with the problem

Random Multiclass Classification: Generalizing Random Forests to Random MNL

by (Van Den Poel and Pritzie from the University of Ghent in Belgium)

LottaKash
09-08-2008, 08:44 PM
try finding this paper
i found it on google today but can't find a download now
it gives a plausible way to deal with the problem

Random Multiclass Classification: Generalizing Random Forests to Random MNL

by (Van Den Poel and Pritzie from the University of Ghent in Belgium)


Hey thx for the heads up on this matter......but it is still greek to me....haha...I get the gist of it, but not for me..... Today I caught a very gettable $37.00 goodie.. and all without trees, and complex variables, it was just the quickest and stealthiest horse in the race combined with good trainer intentions on race-day......

Still, all kidding aside, I would love to see a diagram or printout of this madness, and what immediate effect it has on predicting the outcome of a horse race.......Are you trying to find isolated impact values, or are you attempting to handicap a race with an algorithim............???

dazed and confused,

best,

garyoz
09-08-2008, 09:24 PM
.......Are you trying to find isolated impact values, or are you attempting to handicap a race with an algorithim............???

yes to both, lot of ways to use the data. personally I've given up on the "madness"--it is fun to think about but a blackhole of time--better off getting those "gettable goodies" the old fashion way.

You can get insight from modeling, but personally I don't think straightforward logistical regression is an end point and could be a significant detour.

LottaKash
09-08-2008, 09:41 PM
yes to both, lot of ways to use the data. personally I've given up on the "madness"--it is fun to think about but a blackhole of time--better off getting those "gettable goodies" the old fashion way.

You can get insight from modeling, but personally I don't think straightforward logistical regression is an end point and could be a significant detour.


Cool,, If I were a younger man and with the technology of today I would be greatly fascinated by the thought of finding the "Black Box".....

Some time ago Racecom.com was into using Neural Networks to handicap the races (still are).....They were somewhat successful, but horses being the flesh and blood that they are, and trainers as greedy and sneaky as they are, their software hit a limitation in their prognostications.... I suspect a machine, as grand and powerful as it may be, can only take you so far......Three cheers for our "Grey Box".........

best,

gm10
09-09-2008, 03:50 AM
Hey thx for the heads up on this matter......but it is still greek to me....haha...I get the gist of it, but not for me..... Today I caught a very gettable $37.00 goodie.. and all without trees, and complex variables, it was just the quickest and stealthiest horse in the race combined with good trainer intentions on race-day......

Still, all kidding aside, I would love to see a diagram or printout of this madness, and what immediate effect it has on predicting the outcome of a horse race.......Are you trying to find isolated impact values, or are you attempting to handicap a race with an algorithim............???

dazed and confused,

best,

You try to find the probability P = 1/(1+odds) of a horse winning a race.

basically you model P as a function of all sorts of input variables that according to the handicapper in you will make a difference (last Beyer rating, days since last race, etc)

so you have P = f(X1, X2, X3, ....)
just for simplicty assume f() takes a linear form
so you have P = a*X1 + b*X2 + c*X3

estimating a, b, c is usually not too hard (depending on your choice of f)

problem is, if X1 , X2, X3 are too highly correlated, your model becomes instable
the estimation procedure of a, b, c is no longer reliable and as a result neither are your probabilities (and odds)

robert99
09-09-2008, 06:43 AM
(snipped)
problem is, if X1 , X2, X3 are too highly correlated, your model becomes instable
the estimation procedure of a, b, c is no longer reliable and as a result neither are your probabilities (and odds)

It is always a hard concept to grasp that, say, 5 carefully chosen, good, and largely independent variables are far more powerful for prediction than those models which use over a 100, and which the author claims, after years of data collection and input, are "obviously" far superior.

The other philosophical problem with regression, and similarly NN, is that it tends to the mean, the average, of the past when racing analysis is about accurately predicting quite small amounts of difference for today. Odds-line predictions which are relevant over a longer term do not suffer quite so much as individual race prediction but the long term can be still made up of individual errors in individual races. That is, the overlay still has to win to collect and the apparent overlay the model indicates may sometimes actually be an underlay.

Any money is made by getting into the market early before the others catch up and hammering any edge. Most people are too cautious to do that and hang back to perfect the "model" for another year and then the edge is gone.

garyoz
09-09-2008, 07:41 AM
You try to find the probability P = 1/(1+odds) of a horse winning a race.

basically you model P as a function of all sorts of input variables that according to the handicapper in you will make a difference (last Beyer rating, days since last race, etc)

so you have P = f(X1, X2, X3, ....)
just for simplicty assume f() takes a linear form
so you have P = a*X1 + b*X2 + c*X3

estimating a, b, c is usually not too hard (depending on your choice of f)

problem is, if X1 , X2, X3 are too highly correlated, your model becomes instable
the estimation procedure of a, b, c is no longer reliable and as a result neither are your probabilities (and odds)

The theoretical issue is the correlation of the error terms, and regression assumes them to be independently distributed. Obviously with a high correlation they are not. Robert99 has an excellent point that less is often more on the predictor variable side. Then there's always backfitting, regression to the mean, blah blah blah etc. .....been written about too many times on this Board.

gm10
09-09-2008, 08:11 AM
Yes I'm aware of that, I was just trying to make it easy.
You can no longer assume the estimators to follow the theoretical distribution which in the case of MLE of McFadden's model would be a multivariate normal d'ion. But even before you get there, I have found that the estimation algorithm is finding false optimal values or sometimes doesn't even converge.

But that is a problem with having many input variables in general. You will often be faced with the "curse of dimensionality". You need a super-huge data set otherwise you are basically dealing with a huge chunk of multi-dimensional space with your data only occupying small bits of it. Finding the maximum of your n-dimensional shape is almost impossible unless you have huge amounts of data. I remember reading about creating artificial races to increase the sample size. They would take out the winner of the race, consider the rest of the field to be in a race of their one, and the second finisher would be modeled as the winner of this race, etc.

I wouldn't totally agree with the statement that 5 variables is better than 100 but there does need to be some sort of "feature selection", which is dealt with in the article I post above.

Nice discussion btw guys,

TrifectaMike
09-09-2008, 10:10 AM
But that is a problem with having many input variables in general. You will often be faced with the "curse of dimensionality". You need a super-huge data set otherwise you are basically dealing with a huge chunk of multi-dimensional space with your data only occupying small bits of it. Finding the maximum of your n-dimensional shape is almost impossible unless you have huge amounts of data. I remember reading about creating artificial races to increase the sample size. They would take out the winner of the race, consider the rest of the field to be in a race of their one, and the second finisher would be modeled as the winner of this race, etc.

I wouldn't totally agree with the statement that 5 variables is better than 100 but there does need to be some sort of "feature selection", which is dealt with in the article I post above.

Nice discussion btw guys,

I agree with the "curse of dimensionality" question. However, I disagree that artificially creating race results is a viable solution. Removing any runner, especially the winner, to increase sample size is not meaningful in horse racing. Doing so completely negates the influence the winner had on the finish order of the race. Each horse has an influence function which contributes to the race outcome. Obviously some runners have a greater influence than others.

That is not to say that artificially increasing sample size as described doesn't have it's place, just not in horse racing. It can be used in areas where results or outcomes are not directly influenced by the dynamics created by the participants.

For example, if one were to use height or weight samples to determine disributions, the approach would be viable, since one sample height or weight would has no inflence on any of the other sample heights or weights.

Futhermore, Monte Carlo simulations can be employed to demonstrate that increasing sample size as described is not a viable approach.

The answer to the dimensionality problem is to reduce dimensionality and still maintain integrity of the model.

Mike

gm10
09-09-2008, 12:30 PM
I agree with the "curse of dimensionality" question. However, I disagree that artificially creating race results is a viable solution. Removing any runner, especially the winner, to increase sample size is not meaningful in horse racing. Doing so completely negates the influence the winner had on the finish order of the race. Each horse has an influence function which contributes to the race outcome. Obviously some runners have a greater influence than others.

That is not to say that artificially increasing sample size as described doesn't have it's place, just not in horse racing. It can be used in areas where results or outcomes are not directly influenced by the dynamics created by the participants.

For example, if one were to use height or weight samples to determine disributions, the approach would be viable, since one sample height or weight would has no inflence on any of the other sample heights or weights.

Futhermore, Monte Carlo simulations can be employed to demonstrate that increasing sample size as described is not a viable approach.

The answer to the dimensionality problem is to reduce dimensionality and still maintain integrity of the model.

Mike

I agree that reducing dimensionality is a better approach in most cases (>20 variables), but increasing the sample size was a reasonable approach. I saw results on it in Efficiency of BEtting Markets (that blue book) and while the handicapper in me did not approve it at first for the same reasons that you gave, it turned out to be useful. I think it is less relevant in the current age where it is fairly straightforward to compile big databases.

gm10
09-09-2008, 05:38 PM
I agree that reducing dimensionality is a better approach in most cases (>20 variables), but increasing the sample size was a reasonable approach. I saw results on it in Efficiency of BEtting Markets (that blue book) and while the handicapper in me did not approve it at first for the same reasons that you gave, it turned out to be useful. I think it is less relevant in the current age where it is fairly straightforward to compile big databases.

It's from "Searching for positive returns at the track: a multinomial logit model for handicapping horse races" by Bolton and Chapman.
The authors refer to it as "Rank Order Explosion Process".

TrifectaMike
09-09-2008, 07:28 PM
It's from "Searching for positive returns at the track: a multinomial logit model for handicapping horse races" by Bolton and Chapman.
The authors refer to it as "Rank Order Explosion Process".

Thanks. I'm familiar with the process. I don't accept the premise that the rank ordered data decompose into statistically independent choice sets. Therefore, the assumption that they can be viewed as equivalent independent races is in a practical sense flawed. That is not to say that there isn't a subset of races that can be exploded without introducing noise. However, without carefully reviewing how a race was run, it is not a simple matter to determine independency.

The process might add additional data for backfitting, but I'm not as confident in the derived model going forward.

Mike

gm10
09-10-2008, 05:28 AM
Thanks. I'm familiar with the process. I don't accept the premise that the rank ordered data decompose into statistically independent choice sets. Therefore, the assumption that they can be viewed as equivalent independent races is in a practical sense flawed. That is not to say that there isn't a subset of races that can be exploded without introducing noise. However, without carefully reviewing how a race was run, it is not a simple matter to determine independency.

The process might add additional data for backfitting, but I'm not as confident in the derived model going forward.

Mike

Regardless of whether you accept the race for second as an independent race or not, there is a trade-off between increasing the sample size and introducing more noise into your sample. One increases the predictive power of a model, the other has a negative effect. I can think of situations where it is useful, but I don't use it. I tested it a few years agoand there were no significant effects.