PDA

View Full Version : Question about multinomial logistic regression


vnams
05-08-2012, 08:03 AM
It is obvious to me how to use logistic regression in horse racing - for each horse, the dependent factor is whether it wins or not. But it is not clear how to apply multinomial logistic regression. I understand that one race is one sample, and rather than predicting whether one horse wins, you are trying to predict which of 10 horses wins.

A classical example for multinomial logistic regression is choice of foods. Suppose you have four kinds of fruits (apples, oranges, bananas, pineapples), and 100 people. Each person tries all of the fruits, and chooses the one they like the best. You want to find out why people choose different kinds of fruit. You use multinomial logistic regression to use fruit characteristics to predict which kind of fruit a person will choose.

apples=1
oranges=2
bananas=3
pineapples=4


Person#------Fruit----------Chosen?------Color-----Sweetness-------Cost
1----------------1----------------1------------3.1------------12------------4.5
1----------------2----------------0------------4.2------------10------------5.9
1----------------3----------------0------------1.8--------------9------------3.2
1----------------4----------------0------------3.8--------------8------------1.4
2----------------1----------------0------------3.0--------------11----------4.5
2----------------2----------------1------------4.4--------------11----------6.1
2----------------3----------------0------------1.9--------------8------------3.0
2----------------4----------------0------------3.0--------------7------------1.1



Applying this to horse racing almost, but not quite, fits. Here, the horses might be like the different fruits, and the races are like the different people testing the fruits. A horse winning a race is like a person choosing that fruit. But there is a problem - in the fruit example there are different types of fruit. Fruit #1 for person #1 is the same kind as fruit #1 for person #2. In horse racing, there are 10 horses, but there are not 10 uniquely different types of horses - there is no obvious way to link horse #1 in race 1 to horse #1 in race 2.

If any of you have used multinomial logistic regression, how have you handled this situation? Or is there something about multinomial logistic regression that I am not understanding?

Oaklawn John
05-08-2012, 11:19 AM
Is this a case of comparing apples to oranges?

AITrader
05-08-2012, 11:31 AM
Mlogit used in horse racing uses a set of factors that show horse, jockey, trainer ability as well as things like track bias, weather, etc. Using logit is not as important as the fundamental factors that build your model. Neural nets, randoms forests, SVM's, and even linear regression can all be tuned to working systems if your underlying factors are significant.

A few primers -

"Searching for positive returns at the track: a multinomial logic model for handicapping horse races" - Bolton & Chapman

"Computer Based Horse Race Handicapping and Wagering Systems: A Report" - Bill Benter

"Still searching for positive returns at the track" - Chapman

"Alternative methods of predicting competitive events: an application in horserace betting markets." - Lessman, Sung, Johnson

Most of the above are included in the 'bible' - The Efficiency of Racetrack Betting Markets, which is a compendium of the above papers and many more. It is required reading for anyone looking at building a statistical system IMO.

---------------------------

Most statistical systems are built around a core factor with the remaining factors scaled to add or subtract from this core factor. Many of the old Hong Kong systems used normalized finishing position. Many of the newer systems used scaled BRIS or Beyer speed numbers or similar pace/speed numbers. You can also scale the speed/pace numbers into a normalized finish factor.

Once you have discovered a core factor that is highly significant you need to build and add factors one at a time and check that they improve predictability. Many of the factors you spend time on will add little or no proedictability or will make your model worse. This is a trial and error process.

Building a working statistical system that provides a positive return is likely a 2-5 year process (full time) depending on your expertise, the data you have available, and the skill of your competition. Most folks do this in teams where each member has a specific responsibility and area of expertise.

Once you have a working statistical system, you need to build a betting system as well.

Most of the folks that have built working systems will tell you that you will need around one hundred significant factors before you will begin to see a positive return. This isn't a hard and fast number, more a reflection of the number of factors that most working systems seem to have.

vnams
05-08-2012, 03:12 PM
Mlogit used in horse racing uses a set of factors that show horse, jockey, trainer ability as well as things like track bias, weather, etc. Using logit is not as important as the fundamental factors that build your model. Neural nets, randoms forests, SVM's, and even linear regression can all be tuned to working systems if your underlying factors are significant.

A few primers -

"Searching for positive returns at the track: a multinomial logic model for handicapping horse races" - Bolton & Chapman

"Computer Based Horse Race Handicapping and Wagering Systems: A Report" - Bill Benter

"Still searching for positive returns at the track" - Chapman

"Alternative methods of predicting competitive events: an application in horserace betting markets." - Lessman, Sung, Johnson

Most of the above are included in the 'bible' - The Efficiency of Racetrack Betting Markets, which is a compendium of the above papers and many more. It is required reading for anyone looking at building a statistical system IMO.

---------------------------

Most statistical systems are built around a core factor with the remaining factors scaled to add or subtract from this core factor. Many of the old Hong Kong systems used normalized finishing position. Many of the newer systems used scaled BRIS or Beyer speed numbers or similar pace/speed numbers. You can also scale the speed/pace numbers into a normalized finish factor.

Once you have discovered a core factor that is highly significant you need to build and add factors one at a time and check that they improve predictability. Many of the factors you spend time on will add little or no proedictability or will make your model worse. This is a trial and error process.

Building a working statistical system that provides a positive return is likely a 2-5 year process (full time) depending on your expertise, the data you have available, and the skill of your competition. Most folks do this in teams where each member has a specific responsibility and area of expertise.

Once you have a working statistical system, you need to build a betting system as well.

Most of the folks that have built working systems will tell you that you will need around one hundred significant factors before you will begin to see a positive return. This isn't a hard and fast number, more a reflection of the number of factors that most working systems seem to have.


I have done most of those - I have read all of those references, I have a huge dataset with many significant factors. I have so far tried logistic regression, boosted logistic regression, linear regression, random forests, and I have a background in statistics.

"Using logit is not as important as the fundamental factors that build your model."

But how to use it IS important. So far as I could tell, none of those references address this issue I mentioned of how to exactly apply Mlogit to horse racing.

Perhaps this means that I am not understanding something about Mlogit?

gm10
05-08-2012, 03:32 PM
I have done most of those - I have read all of those references, I have a huge dataset with many significant factors. I have so far tried logistic regression, boosted logistic regression, linear regression, random forests, and I have a background in statistics.

"Using logit is not as important as the fundamental factors that build your model."

But how to use it IS important. So far as I could tell, none of those references address this issue I mentioned of how to exactly apply Mlogit to horse racing.

Perhaps this means that I am not understanding something about Mlogit?

The mlogit doesn't require fruit 1/2/3/4 to be repeated in every sample. Sample 1 can consist of apple/pear/orange/banana, sample 2 can be mango/pineapple/kiwi/peach. It's the properties of the fruit that influence what fruit will be chosen, not the name of the fruit.

For horse racing, the concept is that "nature" chooses the best horse on the basis of certain variables (speed, trainer %, etc etc). Each race is a new sample.

Native Texan III
05-08-2012, 04:14 PM
You do understand.
Regression statistics are just not very good for racing where it is not the average of a statistic but a tiny difference on the day that changes with the season and as the horse strengthens or weakens. A loser in one set may change to be a winner of another set. Nothing stands still or is ever repeated.


Indeed Lessmann advises in his paper, and horse racing is possibly the worst example for such analysis:

"Accurately estimating the winningprobabilities of participants in competitive events, such as elections andsports events, represents a challenge to standard forecasting frameworks suchas regression or classification. They are not designed for modelling thecompetitive element, whereby a specific participants chance of success dependsnot only on his/her individual capabilities but also on those of his/hercompetitors"

AITrader
05-08-2012, 05:16 PM
I have done most of those - I have read all of those references, I have a huge dataset with many significant factors. I have so far tried logistic regression, boosted logistic regression, linear regression, random forests, and I have a background in statistics.

"Using logit is not as important as the fundamental factors that build your model."

But how to use it IS important. So far as I could tell, none of those references address this issue I mentioned of how to exactly apply Mlogit to horse racing.

Perhaps this means that I am not understanding something about Mlogit?

If you read Bolton & Chapman's paper or Chapman's 2000 update you should understand the gist of how some of these systems are built. I suspect what is happening is that you are lost in how to go from a few significant factors to a sophisticated model with a positive expectation. That is the crux of the problem.

But hey, if it were easy everyone would be doing it...and there would be no edge :-)

GameTheory
05-08-2012, 06:16 PM
Where can I find the Lessman paper?

turninforhome10
05-09-2012, 12:53 PM
I am enjoying this thread very much. Here is my questions regarding factors.
I use a program that allows me 8 factors to be considered.
I have one method that uses all "fixed" factors such as weight, odds, all things which are knowns.
Trying to ask this without sounding stupid, but if using 8 factors would it be better to start with fixed and slowly add variables until I find a synergy. What I find is that when I add the method as a factor into more predictive methods it has a huge multiplier. Is their any credence to what I am thinking. Just thinking about the idea of synergy.

Red Knave
05-09-2012, 01:43 PM
Where can I find the Lessman paper?
I found a link here (http://www.sciencedirect.com/science/article/pii/S0169207009002143)

You can purchase it for $31.50.

AITrader
05-09-2012, 02:03 PM
I am enjoying this thread very much. Here is my questions regarding factors.
I use a program that allows me 8 factors to be considered.
I have one method that uses all "fixed" factors such as weight, odds, all things which are knowns.
Trying to ask this without sounding stupid, but if using 8 factors would it be better to start with fixed and slowly add variables until I find a synergy. What I find is that when I add the method as a factor into more predictive methods it has a huge multiplier. Is their any credence to what I am thinking. Just thinking about the idea of synergy.

Just my 2c worth...

I would start with a single core factor and see what R2 you get. (Or pseudo R2 if you run logit). A normalized finish position is what many systems use. Getting this single factor done correctly is trickier than it sounds.

Once you have a core factor that is highly significant pick one more and add it. You want to tune this factor so that your core factor does not lose significance and the second one only adds significance. Watch the R2/pseudo-R2 value with the core alone and with the second factor added. You can also watch the coefficient values. Tune the second factor so that the first has about the same coefficient value with and without the second factor.

Do this a factor at a time tuning for higher predictability. I wouldn't throw all eight into the mix at once and hope for synergies. Most likely you will get covariance effects that will be difficult to trace and things will get messy very quickly.

turninforhome10
05-09-2012, 02:25 PM
Just my 2c worth...

I would start with a single core factor and see what R2 you get. (Or pseudo R2 if you run logit). A normalized finish position is what many systems use. Getting this single factor done correctly is trickier than it sounds.

Once you have a core factor that is highly significant pick one more and add it. You want to tune this factor so that your core factor does not lose significance and the second one only adds significance. Watch the R2/pseudo-R2 value with the core alone and with the second factor added. You can also watch the coefficient values. Tune the second factor so that the first has about the same coefficient value with and without the second factor.

Do this a factor at a time tuning for higher predictability. I wouldn't throw all eight into the mix at once and hope for synergies. Most likely you will get covariance effects that will be difficult to trace and things will get messy very quickly.
And that Sir is exactly what happened. Thanks much.

TrifectaMike
05-09-2012, 06:31 PM
More emphasis on curvefitting and less reliance on regression might be helpful.

vnams
05-09-2012, 07:04 PM
The mlogit doesn't require fruit 1/2/3/4 to be repeated in every sample. Sample 1 can consist of apple/pear/orange/banana, sample 2 can be mango/pineapple/kiwi/peach. It's the properties of the fruit that influence what fruit will be chosen, not the name of the fruit.

For horse racing, the concept is that "nature" chooses the best horse on the basis of certain variables (speed, trainer %, etc etc). Each race is a new sample.

So, to clarify - are you saying that mlogit does not group all apples together, but just treats them as generic fruit?

Dave Schwartz
05-10-2012, 11:25 AM
Regression statistics are just not very good for racing...

and yet, the biggest players in the world use it effectively.

AITrader
05-10-2012, 02:18 PM
It is obvious to me how to use logistic regression in horse racing - for each horse, the dependent factor is whether it wins or not. But it is not clear how to apply multinomial logistic regression.

I re-read your question after having missed the real issue.

I haven't heard of anyone using multinomial regression (i.e probabilities of multiple categories). Most folks call it multinomial logit, though technically binomial logit would probably be a more correct description.

Native Texan III
05-10-2012, 03:51 PM
and yet, the biggest players in the world use it effectively.

How many big players reveal their methods to you Dave?

Dave Schwartz
05-10-2012, 04:08 PM
There are, in fact, a handful of them which I have direct knowledge of.

GameTheory
05-10-2012, 04:25 PM
There are, in fact, a handful of them which I have direct knowledge of.
I've never heard of any of the big teams using anything else (other than very similar variants, like Probit). Anybody aware of any major teams doing anything more radical?

bettheoverlay
05-10-2012, 04:36 PM
Do "whales" ever crash and burn, or are they all wildly successful?

Dave Schwartz
05-10-2012, 05:01 PM
Do "whales" ever crash and burn, or are they all wildly successful?

The majority of them fail. I know of four groups who have been trying to be profitable for 5+ years.

Dark Target
05-10-2012, 05:40 PM
and yet, the biggest players in the world use it effectively.

Exactly.

The easiest thing to do when you don't understand something is dismiss it.

AITrader
05-10-2012, 06:39 PM
The majority of them fail. I know of four groups who have been trying to be profitable for 5+ years.

How many successful teams or individuals do you know of?

Dave Schwartz
05-10-2012, 07:27 PM
Not something I can discuss here.

Sorry.

vnams
05-10-2012, 09:10 PM
I re-read your question after having missed the real issue.

I haven't heard of anyone using multinomial regression (i.e probabilities of multiple categories). Most folks call it multinomial logit, though technically binomial logit would probably be a more correct description.

For basic logistic regression, you need to in some way compare the horse to the others in that race - by scaling the variables in some way. A multinomial regression should do those comparisons automatically.

Bentner (Computer Based Horse Race Handicapping and Wagering Systems: A Report) said that "A multinomial logit model using fundamental factors ..." - I presume that he figured out how to do it. Alas, he doesn't go into details.

I'll spend some time looking into this, and report to the forum when I can resolve it.

Dave Schwartz
05-10-2012, 09:19 PM
do a search for McFadden R-squared or McFadden pseudo-R-squared.

Inglewood Flamingo
05-10-2012, 11:45 PM
and yet, the biggest players in the world use it effectively.

Once again 100% correct Dave. Wow, is this becoming a habit? :)

Inglewood Flamingo
05-10-2012, 11:48 PM
How many big players reveal their methods to you Dave?

I would have to say a resounding NO from my first hand experience. Dave may have a consulting agreement with some of the teams but I am almost positive that will not be discussed here.

AITrader
05-11-2012, 02:25 AM
For basic logistic regression, you need to in some way compare the horse to the others in that race - by scaling the variables in some way. A multinomial regression should do those comparisons automatically.

Bentner (Computer Based Horse Race Handicapping and Wagering Systems: A Report) said that "A multinomial logit model using fundamental factors ..." - I presume that he figured out how to do it. Alas, he doesn't go into details.

No, multinomial logit simply predicts the probability(ies) of one or more dependent variables based on the independent variables.

Benter used multinomial logit with a single dependent variable (win - lose), which might be more properly called binomial logit.


Again, using logit, probit, linear regression, spline regression, nnets (logit is just a backprop net with a single node btw), SVM's, genetic algorithms/programming, random forests, or any other paradigm is not going to magically divine anything for you. Some of these techniques can help or hurt a tiny amount, but if your underlying factors are not significant you are wasting your time. (This is from someone who has wasted more time on techniques than I want to admit!).

gm10
05-11-2012, 03:29 AM
No, multinomial logit simply predicts the probability(ies) of one or more dependent variables based on the independent variables.

Benter used multinomial logit with a single dependent variable (win - lose), which might be more properly called binomial logit.


Again, using logit, probit, linear regression, spline regression, nnets (logit is just a backprop net with a single node btw), SVM's, genetic algorithms/programming, random forests, or any other paradigm is not going to magically divine anything for you. Some of these techniques can help or hurt a tiny amount, but if your underlying factors are not significant you are wasting your time. (This is from someone who has wasted more time on techniques than I want to admit!).

How do you determine whether a variable is useful for your model? Do you just add it and measure its effectiveness through how the model performs with and without the variable, or do you do base inclusion on some preliminary analysis (chi squared test for example)?

AITrader
05-11-2012, 04:08 AM
How do you determine whether a variable is useful for your model? Do you just add it and measure its effectiveness through how the model performs with and without the variable, or do you do base inclusion on some preliminary analysis (chi squared test for example)?

Benter's favorite method, as described in his paper, was to compare the McFadden pseudo-R2 difference between public odds alone and public odds plus the fundamental model's probability estimates. He claims that this is an analogue for betting profitability.

You can also run a betting simulation using Monte Carlo methods, though this requires a big database with odds changes right up until race time. I find this is superior to delta McFadden R2. Benter would probably disagree.

I also run correlation tests, Pearson chi-squared tests, t-tests, compare the z-scores, watch the standard deviation, calculate the RMS error, watch the changes in the coefficients, to name a few. It's a bit of alchemy.

HorseDataMiner
07-19-2012, 02:47 AM
Where can I find the Lessman paper?

You can find by google it. I have it but dunno how to post up

HorseDataMiner
07-19-2012, 02:50 AM
Both benter and Chapman paper in 199x refer to modeling it with
Mcfadden conditional logit.

It is obvious to me how to use logistic regression in horse racing - for each horse, the dependent factor is whether it wins or not. But it is not clear how to apply multinomial logistic regression. I understand that one race is one sample, and rather than predicting whether one horse wins, you are trying to predict which of 10 horses wins.

A classical example for multinomial logistic regression is choice of foods. Suppose you have four kinds of fruits (apples, oranges, bananas, pineapples), and 100 people. Each person tries all of the fruits, and chooses the one they like the best. You want to find out why people choose different kinds of fruit. You use multinomial logistic regression to use fruit characteristics to predict which kind of fruit a person will choose.

apples=1
oranges=2
bananas=3
pineapples=4


Person#------Fruit----------Chosen?------Color-----Sweetness-------Cost
1----------------1----------------1------------3.1------------12------------4.5
1----------------2----------------0------------4.2------------10------------5.9
1----------------3----------------0------------1.8--------------9------------3.2
1----------------4----------------0------------3.8--------------8------------1.4
2----------------1----------------0------------3.0--------------11----------4.5
2----------------2----------------1------------4.4--------------11----------6.1
2----------------3----------------0------------1.9--------------8------------3.0
2----------------4----------------0------------3.0--------------7------------1.1



Applying this to horse racing almost, but not quite, fits. Here, the horses might be like the different fruits, and the races are like the different people testing the fruits. A horse winning a race is like a person choosing that fruit. But there is a problem - in the fruit example there are different types of fruit. Fruit #1 for person #1 is the same kind as fruit #1 for person #2. In horse racing, there are 10 horses, but there are not 10 uniquely different types of horses - there is no obvious way to link horse #1 in race 1 to horse #1 in race 2.

If any of you have used multinomial logistic regression, how have you handled this situation? Or is there something about multinomial logistic regression that I am not understanding?