PDA

View Full Version : Combining probabilities


podonne
05-27-2010, 11:39 PM
Greetings,

When estimating odds, I am creating a large set of "Characteristics" that indicate a horse has a certain chance of wining. Like "if the horse is this and this, he has a x% chance of winning". I apply that to a race to try to get a line.

Say I have a characterisic A that when a horse has it, the horse has a 15% chance of winning. I also have characteristic B that says a horse will have a 10% chance of winning. Now a horse in an upcoming race has both A and B, so, how do I calculate his odds?

Probability says I should calculate a third characteristic (horses with both A and B), but often the sample size is too small to be of any use. Is there any shorthand way of doing this? Take the smaller probability? The larger? Average them? Multiply them?

Any thoughts are appreciated...

podonne

Robert Goren
05-27-2010, 11:51 PM
If they are truly independent, it would be A^B=B(1-A)+A or in this case 0.1(1-0.15) +0.15=0.235=23.5%
I know the ^ is not right sign, but I could not remember how to make my key board make the right one.

podonne
05-28-2010, 12:00 AM
If they are truly independent, it would be A^B=B(1-A)+A or in this case 0.1(1-0.15) +0.15=0.235=23.5%

Hi Robert.

Sorry to jump on you, but its difficult for me to accept that a combination of two factors can result in a higher probability than the originals.

Also, maybe to probe deeper without dismissing your idea, what if there are three factors A=0.15, B=0.10, and C=0.05?

Thanks,
podonne

markgoldie
05-28-2010, 12:09 AM
RG's math may indeed be correct. However, the chances of these characteristics being truly independent are virtually nil, since this would mean that the twin characteristics are the only ones that have relevance to a horse's chance of winning. And naturally, as you might add in more characteristics, the probability of winning would be impossibly high.

It would seem that the best way to approach the question would be to average the numbers. That way, adding additional factors, as you may develop them, will not lead to unrealistically high percentages.

Robert Goren
05-28-2010, 12:59 AM
In horse racing I really doubt if any thing is truly independent of each other. In a none horse example If you 2 sets of dice, If you rolled both of them, what is the chance that one of them would have a seven? It is the same problem. There several ways to get to it, but the answer is 11/36.

Robert Goren
05-28-2010, 01:08 AM
Hi Robert.

Sorry to jump on you, but its difficult for me to accept that a combination of two factors can result in a higher probability than the originals.

Also, maybe to probe deeper without dismissing your idea, what if there are three factors A=0.15, B=0.10, and C=0.05?

Thanks,
podonne It is not a combination of the two factors, but a one or the other or both problem. The answer to the 3 factors is 0.27325. At least that is the way I read it, but it is late and I could wrong.

Overlay
05-28-2010, 02:03 AM
Greetings,

When estimating odds, I am creating a large set of "Characteristics" that indicate a horse has a certain chance of wining. Like "if the horse is this and this, he has a x% chance of winning". I apply that to a race to try to get a line.

Say I have a characterisic A that when a horse has it, the horse has a 15% chance of winning. I also have characteristic B that says a horse will have a 10% chance of winning. Now a horse in an upcoming race has both A and B, so, how do I calculate his odds?

Probability says I should calculate a third characteristic (horses with both A and B), but often the sample size is too small to be of any use. Is there any shorthand way of doing this? Take the smaller probability? The larger? Average them? Multiply them?

Any thoughts are appreciated...

podonne

I would suggest working with impact values (rather than percentages) for the characteristics that you're using, if you have the data in that form, or can reconstruct it.

kenwoodall2
05-28-2010, 04:32 AM
Average the numbers.

formula_2002
05-28-2010, 05:42 AM
to quote from a Cole Porter tune...EXPERIMENT!

stu
05-28-2010, 07:37 AM
If they are truly indepent do what Bob says. For example, blue-eyed hemophiliacs.

But if it is remotely dependent then you need to collect stats for the combined category. Example given turf sprints, there is no way to mathematical compute turf win percentage and sprint win percentage and model correctly. For every Linda Rice who probably has a higher TS then percentage that her T or S percentage, there is a trainer who has a percentage that is less than either component.

Two variable statistics are where the astute handicapper can get an edge. Another obvious example is to find the percentage of a trainer for debut runners routing. Using numbers out of the air most trainers are about 8% first time starter and about 8% routes but combine for a whopping approx. 1% for debut routers.

Robert Goren
05-28-2010, 10:33 AM
The problem is getting a sample size on trainers large enough to mean anything. The introduction of designer drugs in the horse training business has skewed a lot of numbers. It is now a matter of who has gotten the latest brand of good dope before everyone else does. JMO

SchagFactorToWin
05-28-2010, 11:41 AM
If they are truly independent, it would be A^B=B(1-A)+A or in this case 0.1(1-0.15) +0.15=0.235=23.5%
I know the ^ is not right sign, but I could not remember how to make my key board make the right one.

Nice. Does this formula have a name in statistics? Also, what would be the formula for 3 factors?

TrifectaMike
05-28-2010, 01:06 PM
In horse racing I really doubt if any thing is truly independent of each other. In a none horse example If you 2 sets of dice, If you rolled both of them, what is the chance that one of them would have a seven? It is the same problem. There several ways to get to it, but the answer is 11/36.

Duh, I is confused, because you are confusing

Before I comment further can you explain your non horse example.

Mike

podonne
05-28-2010, 01:27 PM
If they are truly indepent do what Bob says. For example, blue-eyed hemophiliacs.

But if it is remotely dependent then you need to collect stats for the combined category. Example given turf sprints, there is no way to mathematical compute turf win percentage and sprint win percentage and model correctly. For every Linda Rice who probably has a higher TS then percentage that her T or S percentage, there is a trainer who has a percentage that is less than either component.

Two variable statistics are where the astute handicapper can get an edge. Another obvious example is to find the percentage of a trainer for debut runners routing. Using numbers out of the air most trainers are about 8% first time starter and about 8% routes but combine for a whopping approx. 1% for debut routers.

Based on Stu's example, I can see where maybe there is no shorthand answer. If A and B are not independent you truely cannot combine them and make any sense if the number of entries with A&B is too small.

Perhaps an alternative suggestion? What if we ranked the factors based on sample size, and only used the one that has the highest sample size (i.e. the one we are most confident in)?

formula_2002
05-28-2010, 01:39 PM
Based on Stu's example, I can see where maybe there is no shorthand answer. If A and B are not independent you truely cannot combine them and make any sense if the number of entries with A&B is too small.

Perhaps an alternative suggestion? What if we ranked the factors based on sample size, and only used the one that has the highest sample size (i.e. the one we are most confident in)?

THE OPERATIVE WORD IT CONFIDENT.

You get that by analyzing the data. There is no short cut.
Get the data, and determine the confidence levels at incremental odds for different combination of factors and then test it on out of sample data...
Then do it all over again.
Once you think you have a good model, don't bet it, just post about 500 picks in the PA selections forum..I guarantee you will be ahead of the game :) , either mentally or finically.. EXPERIMENT!!

gm10
05-28-2010, 01:59 PM
I use a multinomial logit model for this. It takes any X-variables as input and produces a winning probability for each horse. And a winning probability is of course the reverse of the (odds + 1).

It's fairly easy in your case, have a look at wikipedia for the formula.

Cratos
05-28-2010, 02:15 PM
Greetings,

When estimating odds, I am creating a large set of "Characteristics" that indicate a horse has a certain chance of wining. Like "if the horse is this and this, he has a x% chance of winning". I apply that to a race to try to get a line.

Say I have a characterisic A that when a horse has it, the horse has a 15% chance of winning. I also have characteristic B that says a horse will have a 10% chance of winning. Now a horse in an upcoming race has both A and B, so, how do I calculate his odds?

Probability says I should calculate a third characteristic (horses with both A and B), but often the sample size is too small to be of any use. Is there any shorthand way of doing this? Take the smaller probability? The larger? Average them? Multiply them?

Any thoughts are appreciated...

podonne


I am not sure what you are attempting to do because the probability of a horse winning a race based on handicapping or influences of prediction is different from the odds on the toteboard.

The odds on the toteboard is the payoff form of game theory and is driven by the collective risk profiles of the bettors and those profiles are not necessarily driven by handicapping.

Robert Goren
05-28-2010, 02:23 PM
I use a multinomial logit model for this. It takes any X-variables as input and produces a winning probability for each horse. And a winning probability is of course the reverse of the (odds + 1).

It's fairly easy in your case, have a look at wikipedia for the formula.That ought to do really good job of confusing him. I think I lost him last night and anything I post will just make things worse. These things are not easily explained to people who are not used working with them. Or at least I have never had any luck in trying to do so.

TrifectaMike
05-28-2010, 02:44 PM
That ought to do really good job of confusing him. I think I lost him last night and anything I post will just make things worse. These things are not easily explained to people who are not used working with them. Or at least I have never had any luck in trying to do so.

Maybe if you try to explain to a novice like myself, I can explain it to him.

But before you get too technical, can you explain your non horse, dice thing to me.

Please help me understand. Share the knowledge.

Mike

gm10
05-28-2010, 03:10 PM
That ought to do really good job of confusing him. I think I lost him last night and anything I post will just make things worse. These things are not easily explained to people who are not used working with them. Or at least I have never had any luck in trying to do so.

it's really not difficult
the name of the model is more complicated than the formula!

podonne
05-28-2010, 04:02 PM
it's really not difficult
the name of the model is more complicated than the formula!

Haha, mathmaticians don't make the best marketers, that's for sure.

Without getting into what I do or do not know or can or cannot understand, I was looking for a simple way to combine them, and it doesn't look like there is one. It would be complicated (writing it from scratch) to calculate collinearity for all the factors, which you'd want to do for any regression.

gm10, I thought a multinomial regression was designed to produce nominal results, as opposed to a liklihood. What do you do to get to a probability as the dependent variable? Thanks for the suggestion.

Robert Fischer
05-28-2010, 04:30 PM
Probability says I should calculate a third characteristic (horses with both A and B), but often the sample size is too small to be of any use. Is there any shorthand way of doing this? Take the smaller probability? The larger? Average them? Multiply them?

Assuming you have the power of observation, and are correct that both factors are better than 1 alone, you can assume a slightly greater probability with both characteristics combined than the greater of the two individual characteristic probabilities.
Of course that is sometimes difficult, and as you say if sample sizes are small, you may see the probabilities "play out" differently than expected.

Robert Goren
05-28-2010, 04:51 PM
Maybe if you try to explain to a novice like myself, I can explain it to him.

But before you get too technical, can you explain your non horse, dice thing to me.

Please help me understand. Share the knowledge.

MikeWell here goes. You have 2 pair of dice a red pair and a green pair. You roll the red pair first. one time in six you get a 7. Five time in six you don't get a 7 Since you all ready have the 7 you don't worry about the time you have rolled a 7 for the time being. Now in five times in six that you don't get a 7, you roll the green pair. again you get a 7 one time six. so when you roll both the red and green pair, you get a 7 five time in thirty six.(5/6 times 1/6) Now you go back and in the times you only rolled the red pair (1/6 = 6/36) and add it to the times the time rolled both sets. 6/36 plus 5/36 equals 11/36.

HUSKER55
05-28-2010, 05:22 PM
if either can happen you add and if both have to happen you multiply

favorable outcomes / total chances of "A" (plus / times) favorable outcomes / total chances of "B"


It is my belief that untill you have a significant number of races in your base, the outcome will be volital.

if 'a' and 'b' =6/12 if either can happen the probability is 1 (12/12)

if both have to happen the probability is 1/4 (36/144)

I think that is right

good luck

Njalle
05-28-2010, 05:35 PM
Haha, mathmaticians don't make the best marketers, that's for sure.

Basically, most people use statistics on the past performance of the horse and jockey to estimate win probabilities. They do it ad hoc. Evidently, you would need to watch hundreds of races in order to come up with ad hoc estimates of the win probability for a horse. Remembering this many races is hard for most people.
Another issue when estimating win probabilities ad hoc is that the number of data in each race is enormous. When you should estimate the win probability of a horse, you do not only have to use the data from that horse, but from all horses in the race. This, too, is hard for most people.

For these reasons I prefer a statistical/mathematical model to the ad hoc method. As several other handicappers, some of whom are very succesful, I use the multinomial logit model because of its appropriateness in horse racing. Basically, it weights each of the past performance factors by analyzing a number of races. The weights are used to calcuate each horse's value and the functional form of the model then compares it to the other horses in the race and ultimately offers a probability estimate.

Let us apply this to the racetrack betting procedure. Most people have a pretty good idea of who the favourite is, but not how big a favourite he is. Similarly with the second favourite and so forth. Take a horse with a 10 pct. (true) winning chance - wagering when the final odds are above 10-1 yields a positive expected payoff. Now take a horse with a 15 pct. winning chance - this horse just need a 7-1 odds or above in order to yield a positive expected payoff. I dare to say that ad hoc estimation makes it almost impossible to tell a 10 pct. winning chance from a 15 pct. winning chance. This is not the case with a statical model.

So, while I agree with you that mathematicians may not be the best marketers per se, mathematicial skills and knowledge about horse racing combined produce a very powerful weapon.

Overlay
05-28-2010, 05:55 PM
I would suggest working with impact values (rather than percentages) for the characteristics that you're using, if you have the data in that form, or can reconstruct it.

To elaborate, here's a previous post concerning a manner in which impact values could be used:

http://www.paceadvantage.com/forum/showthread.php?p=239228&highlight=formatting#post239228

Jeff P
05-28-2010, 07:13 PM
Valid statistical methods exist for combining multiple distinct probability estimates.

One such method that I've been working with this year is called a beta transformed linear opinion pool or BLP:
http://www.stat.washington.edu/research/reports/2008/tr543.pdf

I actually visited a nearby university and convinced (I had to spring for lunch and drinks) a math professor to play tutor long enough to get me through the math... so that I could code out some algorithms.


posted by Formula 2002: THE OPERATIVE WORD IS CONFIDENT.

You get that by analyzing the data. There is no short cut.
Get the data, and determine the confidence levels at incremental odds for different combination of factors and then test it on out of sample data...
Then do it all over again. IMHO, the the above quote is VERY relevant to this thread.

I found that taking my own version of a beta transformed linear opinion pool (BLP) and combining it with observations made about the data using factors not readily available to the public within incremental odds ranges enabled me to create a very clean (and useful) odds line.


-jp

.

TrifectaMike
05-28-2010, 09:24 PM
Well here goes. You have 2 pair of dice a red pair and a green pair. You roll the red pair first. one time in six you get a 7. Five time in six you don't get a 7 Since you all ready have the 7 you don't worry about the time you have rolled a 7 for the time being. Now in five times in six that you don't get a 7, you roll the green pair. again you get a 7 one time six. so when you roll both the red and green pair, you get a 7 five time in thirty six.(5/6 times 1/6) Now you go back and in the times you only rolled the red pair (1/6 = 6/36) and add it to the times the time rolled both sets. 6/36 plus 5/36 equals 11/36.

Are you a math teacher? I think you are. You have to be, because that is some confusing stuff.

Let me take a crack at it in layman's terms.

Instead of 7, which doen't exist on a die, let's use 6.

And let's roll the dice. Four dice according to you.

The chance of 6 not appearing on the first die is

5/6

The chance of 6 not appearing on second die is

5/6

The chance of 6 not appearing on the third die is

5/6

The chance of 6 not appearing on the fourth die is

5/6

The chance of 6 not appearing with four dice is

5/6 x 5/6 x 5/6 x 5/6 which equals 625/1,296 or 48.22%

The chance of a 6 appearing four dice is

671/1,296 or 51.77%

Obviously, this isn't as elegant as your explanation, but I think I get it.

Mike

Robert Fischer
05-28-2010, 10:05 PM
when u have 2(or many) separate factors
they don't necessarily combine
to the same degree that they do individually.

sometimes math can't do it any better justice than good judgment


i guess it would be different if these types of things were paramount to your game. To me personally, this is secondary stuff, although i want to have a decent ballpark estimate.

Robert Goren
05-28-2010, 10:31 PM
Are you a math teacher? I think you are. You have to be, because that is some confusing stuff.

Let me take a crack at it in layman's terms.

Instead of 7, which doen't exist on a die, let's use 6.

And let's roll the dice. Four dice according to you.

The chance of 6 not appearing on the first die is

5/6

The chance of 6 not appearing on second die is

5/6

The chance of 6 not appearing on the third die is

5/6

The chance of 6 not appearing on the fourth die is

5/6

The chance of 6 not appearing with four dice is

5/6 x 5/6 x 5/6 x 5/6 which equals 625/1,296 or 48.22%

The chance of a 6 appearing four dice is

671/1,296 or 51.77%

Obviously, this isn't as elegant as your explanation, but I think I get it.

Mike you got it. Not a math teacher although for a time I was a math major in college, but then I majored in about everything at one time or another.;)

gm10
05-29-2010, 08:54 AM
Haha, mathmaticians don't make the best marketers, that's for sure.

Without getting into what I do or do not know or can or cannot understand, I was looking for a simple way to combine them, and it doesn't look like there is one. It would be complicated (writing it from scratch) to calculate collinearity for all the factors, which you'd want to do for any regression.

gm10, I thought a multinomial regression was designed to produce nominal results, as opposed to a liklihood. What do you do to get to a probability as the dependent variable? Thanks for the suggestion.

Multi-collinearity is not that hard to verify, just start with the correlation matrix of your X variables. There's plenty of ways to calculate that.

Multinomial logit is not a classical regression. Just have a look at wikipedia to see how the output becomes a probability. They explain it much better than I could.

It will take you 5 minutes in excel to program ... although the coefficient estimation is a lot harder ... but you can set your own coefficients initially (equal weights for example).

formula_2002
05-29-2010, 09:56 AM
Valid statistical methods exist for combining multiple distinct probability estimates.

One such method that I've been working with this year is called a beta transformed linear opinion pool or BLP:
http://www.stat.washington.edu/research/reports/2008/tr543.pdf

I actually visited a nearby university and convinced (I had to spring for lunch and drinks) a math professor to play tutor long enough to get me through the math... so that I could code out some algorithms.


posted by Formula 2002: IMHO, the the above quote is VERY relevant to this thread.

I found that taking my own version of a beta transformed linear opinion pool (BLP) and combining it with observations made about the data using factors not readily available to the public within incremental odds ranges enabled me to create a very clean (and useful) odds line.


-jp

.
I read the TEXT only :) ( I need to review my algebra and intergration).
Having said that I come away with the feeling that the Beta in the BLP is a
statistical measurement of the statistics in the linear opinion pool, where the pool is the combined multiple distinct probability estimates ,wrt actual results. Close?

teddy
05-29-2010, 10:39 AM
When I run my spot plays of about 80, I will get the same horse on about 10 of them. That is a very strong angle. Meaning the horse had 10 great qualities that should be profitable. There is some overlap but the spot plays are all different in some way. My experience is that it is profitable but not hugely profitable as you would think. If it is confirmed as the post time favorite then it seems to be much stronger.

Robert Fischer
05-29-2010, 11:31 AM
other than verifying effect and correlation after the fact, judging the effects of different overlapping factors before essentially having enough significant data reminds me a little of cooking. :p

gm10
05-29-2010, 01:03 PM
other than verifying effect and correlation after the fact, judging the effects of different overlapping factors before essentially having enough significant data reminds me a little of cooking. :p

it is ... it's not a well-researched field at all
statisticians have only recently started providing methods for selecting these 'random forests'