PDA

View Full Version : How do I tell if an angle is "real"


podonne
06-28-2011, 09:54 PM
Greetings,

Let's imagine for a moment that I found an angle of the form "if the horse X has characteristic Y and is going off at odds better than Z, bet." I run this over my database for four years and the return on the angle is 3% per bet, and it happens fairly often, say a couple of hundred times a year. Part of me says, awesome, let's go make some money.

Then I pull up a time series of the bets per day over that time frame. Two of those years, 1 and 3 the angle was positive, say 4%, but the other two years, 2 and 4, it was negative. And it runs randomly through the year, sometimes up 300% in a month and down 150% the next. So now I'm worried... if I start betting now maybe it goes down.

It's clear that the 3% number is hugely dependent on when I stop and started my time series. Understanding probability I might say the I should use a random sample instead of a time series, but you know, real life doesn't happen based on random samples.

So my question, esteemed handicappers, is, what statistical method can I use to determine if my angle is "real", or if its just an artifact of the fact that I started and stopped my time series when I did?

Thanks,
Philip

cj
06-28-2011, 10:16 PM
Greetings,

Let's imagine for a moment that I found an angle of the form "if the horse X has characteristic Y and is going off at odds better than Z, bet." I run this over my database for four years and the return on the angle is 3% per bet, and it happens fairly often, say a couple of hundred times a year. Part of me says, awesome, let's go make some money.

Then I pull up a time series of the bets per day over that time frame. Two of those years, 1 and 3 the angle was positive, say 4%, but the other two years, 2 and 4, it was negative. And it runs randomly through the year, sometimes up 300% in a month and down 150% the next. So now I'm worried... if I start betting now maybe it goes down.

It's clear that the 3% number is hugely dependent on when I stop and started my time series. Understanding probability I might say the I should use a random sample instead of a time series, but you know, real life doesn't happen based on random samples.

So my question, esteemed handicappers, is, what statistical method can I use to determine if my angle is "real", or if its just an artifact of the fact that I started and stopped my time series when I did?

Thanks,
Philip

You really need split samples. Find angles on one database, but then test them on a new, untested sample of similar size. Even that is no guarantee, but it is better than what you did above by a lot. I know from experience.

Dave Schwartz
06-28-2011, 10:47 PM
Another thing to consider is the overall sample size and the odds on the horses.

First, a few hundred races pretty much means nothing in the scheme of things. The higher the odds, the more important the sample size becomes. In other words, after 500 bets it is much more difficult to be profitable with 5/2 horses than it is with 20/1 horses.

I had someone with a significant math background tell me once that the absolute minimum sample size of anything is 30. He went on to say that instead of insisting upon having 30 starters you should have a minimum of 30 winners! When one is building a system based upon 20/1 and above you have to handicap a lot of races to find those 30 winners!


Regards,
Dave Schwartz

podonne
06-28-2011, 11:14 PM
You really need split samples. Find angles on one database, but then test them on a new, untested sample of similar size. Even that is no guarantee, but it is better than what you did above by a lot. I know from experience.

I found this angle using a random sample of 66% as a training set and the remaining 33% as a testing set. The results were both positive, but different, slightly higher in the testing set than the training set.

I also tried a 3-year training set and a 1-year testing set, but that is the same as saying positive for 3 years and positive for one year, so its not so different from just looking at the whole time series.

Here's an example time series, with real data, $100 to start and $10 a bet. Average odds of around 2.186/1 on 2,951 bets and a win rate of ~55%.

http://i51.tinypic.com/15q7s79.png

What worries me is that if I did this exact same analysis in Jan 2009, I would have noticed $100 go to $239 over the previous two years and ~1,500 bets and been excited. If I had started betting I would have lost about $10 by the end of 2009, despite how great the previous years had been.

Do the math again in Jan 2010 and I still see $100 go to $238 over three years, still pretty good. So I keep betting and my balance goes to $425.

So what does 2011 have in store? Or am I just worring too much about this?

podonne
06-28-2011, 11:22 PM
Another thing to consider is the overall sample size and the odds on the horses.

First, a few hundred races pretty much means nothing in the scheme of things. The higher the odds, the more important the sample size becomes. In other words, after 500 bets it is much more difficult to be profitable with 5/2 horses than it is with 20/1 horses.

I had someone with a significant math background tell me once that the absolute minimum sample size of anything is 30. He went on to say that instead of insisting upon having 30 starters you should have a minimum of 30 winners! When one is building a system based upon 20/1 and above you have to handicap a lot of races to find those 30 winners!


Regards,
Dave Schwartz

Dave,

I certainly understand what you're saying about small sample sizes, but I think I have that covered. I just put up a more concrete example with 2,951 bets and a 55% win rate, so the sample sizes in any given year should be sufficient. I'm more concerned with the variance from one year to the next, and within one year, than specifically with the overall sample size being too small.

Time series get me more confused about sample sizes, because if I try to look at weeks, say I try to measure the % of weeks that are profitable, a week is probably too small. Months may be sufficient, but the returns are all over the place. Years are definetly enough, but in this case 50% of my years are losers.

So there's my confusion. How do I look at something like this and say, to use a stastistician's language, that there is a 95% probability that this is a winning angle...

Cheers,
Philip

podonne
06-28-2011, 11:26 PM
So there's my confusion. How do I look at something like this and say, to use a stastistician's language, that there is a 95% probability that this is a winning angle...

Or even harder, how to I figure out exactly how much of an advantage is represented by the angle. To give that as a % of expected value per bet, depending on how I measure it and over what period of time, over which particular set of random samples, the number is all over the place.

CBedo
06-29-2011, 04:10 AM
I don't have much to add about that other's haven't already contributed with regards to model verification, but would add a couple things about the series you presented.

1) Either the series you showed us isn't representative of what you have spoken about overall, or your math or records are off somewhere. 55% win rate at odds of over 2/1 should give you a per bet margin that is absolutely monstrous, significantly higher than 3%.

2) Aside from the math and whether or not the series is a true representation, just looking at it visually would give me some cause for hesitation (could just be my risk profile). In the timeframe you show, you have at least 6 drawdowns of 20+ bets (looks like 3 over 30+). It's no wonder that you might be worried about "when" you start the series. Qualitatively, the risk reward curve seems unfavorable to me.

windoor
06-29-2011, 06:11 AM
Ask yourself this:

What happens in June, July, and August of the last there years.
Warmer weather? Different tracks? More turf races? ETC.

What is the difference between 2007 and the rest of the years.

Divide by Seven.

Regards,

Windoor

Tom
06-29-2011, 07:36 AM
The HTR Robot give you a breakdown by week of your spot plays. I have a dynomite play for NYRA that lost money every week the last 52.
If I only play weekends, it about breaks even, but if I only play Sundays, the roi is 1.43 the last year. No idea, why, but I play Sundays without fail.

Robert Goren
06-29-2011, 07:44 AM
I don't have much to add about that other's haven't already contributed with regards to model verification, but would add a couple things about the series you presented.

1) Either the series you showed us isn't representative of what you have spoken about overall, or your math or records are off somewhere. 55% win rate at odds of over 2/1 should give you a per bet margin that is absolutely monstrous, significantly higher than 3%.

2) Aside from the math and whether or not the series is a true representation, just looking at it visually would give me some cause for hesitation (could just be my risk profile). In the timeframe you show, you have at least 6 drawdowns of 20+ bets (looks like 3 over 30+). It's no wonder that you might be worried about "when" you start the series. Qualitatively, the risk reward curve seems unfavorable to me. If you had a win rate of 55% at odds of exactly 2/1 you show a ROI of 65% on flat bets. In order to get a ROI of 3% you would need a really bad money management system. Something is rotten in Denmark.

Valuist
06-29-2011, 08:06 AM
The best professional gamblers are right about 55% of the time. And thats based on outcomes involving only 2 teams. I would be shocked if anyone could hit 55% winners in racing unless they were betting odds on horses.

Robert Goren
06-29-2011, 08:16 AM
a 3% return with a 55% win rate would mean an average odds of 0.87 to 1. I believe that is highly possible for you to have found some thing that produces that kind of a ROI at those odds. Study after study have shown that lower the odds on all horses you wager, the closer you come to getting even. The thing you need to do is break it down in to groups of 50 bets. Then calculate the mean and SD of the profits of those groups. A simple student's T test of those numbers should give the answer you are looking for.

TrifectaMike
06-29-2011, 08:31 AM
Greetings,

Let's imagine for a moment that I found an angle of the form "if the horse X has characteristic Y and is going off at odds better than Z, bet." I run this over my database for four years and the return on the angle is 3% per bet, and it happens fairly often, say a couple of hundred times a year. Part of me says, awesome, let's go make some money.

Then I pull up a time series of the bets per day over that time frame. Two of those years, 1 and 3 the angle was positive, say 4%, but the other two years, 2 and 4, it was negative. And it runs randomly through the year, sometimes up 300% in a month and down 150% the next. So now I'm worried... if I start betting now maybe it goes down.

It's clear that the 3% number is hugely dependent on when I stop and started my time series. Understanding probability I might say the I should use a random sample instead of a time series, but you know, real life doesn't happen based on random samples.

So my question, esteemed handicappers, is, what statistical method can I use to determine if my angle is "real", or if its just an artifact of the fact that I started and stopped my time series when I did?

Thanks,
Philip

A very interesting question. I'm sure many of use can benefit from the answer.

Mike (Dr Beav)

Robert Goren
06-29-2011, 09:33 AM
The next question is about how you got your data. Did you search a bunch of factors in a data base until you found something that worked. If that is the case, then you search a different unrelated data base to make sure it produces the same results. If you search enough factors in a data base you are bound to find that statistical outlier. Get it replicated in a different data base is essential especially with such small ROI before you start laying down your hard earned cash.

Laplace
06-29-2011, 11:46 AM
Try stochastic interpolation procedures

Regards

Mike

Robert Fischer
06-29-2011, 01:37 PM
Greetings,

Let's imagine for a moment that I found an angle of the form "if the horse X has characteristic Y and is going off at odds better than Z, bet." I run this over my database for four years and the return on the angle is 3% per bet, and it happens fairly often, say a couple of hundred times a year. Part of me says, awesome, let's go make some money.

Then I pull up a time series of the bets per day over that time frame. Two of those years, 1 and 3 the angle was positive, say 4%, but the other two years, 2 and 4, it was negative. And it runs randomly through the year, sometimes up 300% in a month and down 150% the next. So now I'm worried... if I start betting now maybe it goes down.

It's clear that the 3% number is hugely dependent on when I stop and started my time series. Understanding probability I might say the I should use a random sample instead of a time series, but you know, real life doesn't happen based on random samples.

So my question, esteemed handicappers, is, what statistical method can I use to determine if my angle is "real", or if its just an artifact of the fact that I started and stopped my time series when I did?

Thanks,
Philip
I'll go ahead and throw my 2¢ down. This advice will primarily be based on getting money out of the pools and not on advanced statistics.
Let's imagine for a moment that I found an angle of the form "if the horse X has characteristic Y and is going off at odds better than Z, bet." I run this over my database for four years and the return on the angle is 3% per bet, and it happens fairly often, say a couple of hundred times a year. Part of me says, awesome, let's go make some money.
I'm imagining...
OK, the first thing you need to do is define your angle further, to the point where your definitions describe the real game.
The part of your angle which stands out as being most immediately in need of clarification is "and is going off at odds better than Z, bet."


@ what post time are you using (for example "at odds better than Z with <1 minutes to post")?
How does Pool Size affect your specific angle (for example smaller tracks or pools could see dramatic late odds-drops on favorites)
Greetings,

Let's imagine for a moment that I found an angle of the form "if the horse X has characteristic Y and is going off at odds better than Z, bet." I run this over my database for four years and the return on the angle is 3% per bet, and it happens fairly often, say a couple of hundred times a year. Part of me says, awesome, let's go make some money.

Then I pull up a time series of the bets per day over that time frame. Two of those years, 1 and 3 the angle was positive, say 4%, but the other two years, 2 and 4, it was negative. And it runs randomly through the year, sometimes up 300% in a month and down 150% the next. So now I'm worried... if I start betting now maybe it goes down.

It's clear that the 3% number is hugely dependent on when I stop and started my time series. Understanding probability I might say the I should use a random sample instead of a time series, but you know, real life doesn't happen based on random samples.

So my question, esteemed handicappers, is, what statistical method can I use to determine if my angle is "real", or if its just an artifact of the fact that I started and stopped my time series when I did?

Thanks,
Philip

You've got 800 plays over 4 years and they've been giving you a 3% return.

The first, most important thing is a thorough insight. An angle is discovered by insight or statistical discovery. An angle that will consistently bring you pool money is going to allow you take out pool money because of certain truths. Your level of accurate insight will go a long way in estimating the probability of success with a given angle.
If you ask me for advice, a 3% angle that isn't rock solid consistent month after month, will require me to gain a profound insight. It is not enough for me to simply see whether "characteristic Y" is present. I need to know why "characteristic Y" indicates better performance than the public will estimate. And I'm talking about true insight, not superficial theory. This insight now allows me to purify the angle, it allows me to understand not only the macro-level of the whole angle over the last 4 years, but I can expect to see certain specific environments where the angle will yield a higher hit rate and/or a higher ROI. All angles should fluctuate up and down within their expected probable range. If the range is substantial (for example you are saying that over a YEAR and 200PLAYS) you may arbitrarily lose = then this is completely fine provided the hit% is rather low and you can look at the expected fluctuation and see that it is consistent with the reality. However if your Hit% is rather high, 200 plays a year should generally be enough to show a consistent positive return.


As far as looking at graphs - I could care less about the intermittent wave fluctuations, compared to the recent trends. If the "last2" years have been losing it worries me a lot more than an up and down pattern. Same for "last year", "last 6 months", etc...



Either way it is imperative that you understand the expected fluctuations and can verify that your estimates for the angle fall in line. When they do not, your ANGLE is fluctuating. When this is the case, you have to know how much, you have to know why, and so on.

you can gamble on any angle you want , and the more consistent and higher return an angle shows, the less you might demand an insight into the angle.

pondman
06-29-2011, 01:56 PM
Greetings,

Let's imagine for a moment that I found an angle of the form "if the horse X has characteristic Y and is going off at odds better than Z, bet." I run this over my database for four years and the return on the angle is 3% per bet, and it happens fairly often, say a couple of hundred times a year. Part of me says, awesome, let's go make some money.

You have to be carefull when you talk statistics and sample size. If you were truely choosing a distinct variable such as the color of a marble, you'd have a structured problem, which could be scalable.

Are you sure there isn't a subset you've overlooked?

What happens when you raise your requirement upwards to 4-1?

Have you broken this down by conditions? You may have something under one condition and not another, which adds more complexity.

I personally couldn't bet 1000 time to make $400.

pondman
06-29-2011, 02:02 PM
.
And I'm talking about true insight, not superficial theory.



I agree with your response. Maybe you could explain true insight opposed to superficial theory.

Robert Goren
06-29-2011, 02:43 PM
You have to be carefull when you talk statistics and sample size. If you were truely choosing a distinct variable such as the color of a marble, you'd have a structured problem, which could be scalable.

Are you sure there isn't a subset you've overlooked?

What happens when you raise your requirement upwards to 4-1?

Have you broken this down by conditions? You may have something under one condition and not another, which adds more complexity.

I personally couldn't bet 1000 time to make $400. Most handicappers would be glad to $400 ahead in their last 1000 bets. With the last couple of years I have had I know would. No need for me to worry about IRS lately.

JustRalph
06-29-2011, 02:50 PM
Great thread :ThmbUp:

Robert Fischer
06-29-2011, 03:38 PM
...explain true insight opposed to superficial theory.

With real insight you know why an angle works. You understand why the angle works on the track, and you understand why the angle works in the pools.
With superficial theory you just know that a variable is "good" and/or you just know that a variable "works" or is profitable.

With enough consistent profit, or high enough return, in a given set of past performance, you still may go ahead and gamble that the trend will continue forward, even if you really don't have a great insight.

If you do have an insight into the angle you should be able to think of subsets that reflect those truths in an even more concentrated form, and you should increases in hit rate or return when looking at those subsets. Regardless of whether you want to specialize in those subsets, they can offer a way to test your understanding.




A fictional example = ANGLE AQUEDUCTINNER1X = "if the horse X has characteristic Y(y=Posts1-3&topE1) and is going off at odds better than 6-1, bet."
Over the last 800plays/4yrs $2return = $2.06

example of SUPERFICIAL THEORY = "InsideSpeed is good on the AQUEDUCT INNER :ThmbUp: :cool:".

example TRUE INSIGHT = "Looking at the results, "all" the return is coming from the 7f sprints., Looking at a bunch of 7f sprints, all the 7f sprints on the Aqueduct Inner start on the turn, the Inside Speed has a huge advantage on the 7f sprints on the Aqueduct Inner because they start on the turn, looking at just the 7f sprints at Aqueduct inner, there are 400plays last 4 years with a $2 ROI of $3.87..."

Cratos
06-29-2011, 04:48 PM
Greetings,

Let's imagine for a moment that I found an angle of the form "if the horse X has characteristic Y and is going off at odds better than Z, bet." I run this over my database for four years and the return on the angle is 3% per bet, and it happens fairly often, say a couple of hundred times a year. Part of me says, awesome, let's go make some money.

Then I pull up a time series of the bets per day over that time frame. Two of those years, 1 and 3 the angle was positive, say 4%, but the other two years, 2 and 4, it was negative. And it runs randomly through the year, sometimes up 300% in a month and down 150% the next. So now I'm worried... if I start betting now maybe it goes down.

It's clear that the 3% number is hugely dependent on when I stop and started my time series. Understanding probability I might say the I should use a random sample instead of a time series, but you know, real life doesn't happen based on random samples.

So my question, esteemed handicappers, is, what statistical method can I use to determine if my angle is "real", or if its just an artifact of the fact that I started and stopped my time series when I did?

Thanks,
Philip

While I truly enjoy posts on math and statistics I see this thread as a continuation of a thread which was started by another poster and some of those who lambasted him on that thread is now contributors to this thread; the oddity and contradictions in posting is amazing.

pondman
06-29-2011, 04:54 PM
With real insight you know why an angle works.

Not arguing against your point.

But what if I bet only shippers...often with a layoff. Can I really quantify these, because I have no idea why a lousy looking horse at one track wins at another. I don't think anyone else beside the connections know what the motivation is behind the change in condition. Is this a real insight, other than I hit +10-1 shots every few weeks? And at the end of the year I've got a pile of money.

I'd say the same about a horse off a maiden win. Or a cheap maiden claimer who gets beaten by a nose. These horses win at some tracks at better than 10-1, enough to do better than break even, especially if you have them on top of a tri-wheel.

I'm not however able to quantify these because they are a concept rather than a concrete measurable object. And so I'm not able to do much except count them on my fingers and get crazy when some 50-1 shipper comes along.

PaceAdvantage
06-29-2011, 08:04 PM
While I truly enjoy posts on math and statistics I see this thread as a continuation of a thread which was started by another poster and some of those who lambasted him on that thread is now contributors to this thread; the oddity and contradictions in posting is amazing.Not sure the point of posting something like this...it will only serve to derail the thread. Agree?

podonne
06-29-2011, 08:11 PM
The next question is about how you got your data. Did you search a bunch of factors in a data base until you found something that worked. If that is the case, then you search a different unrelated data base to make sure it produces the same results. If you search enough factors in a data base you are bound to find that statistical outlier. Get it replicated in a different data base is essential especially with such small ROI before you start laying down your hard earned cash.

To answer your question, I found it by searching a bunch of factors until something worked, which is probably a pretty common answer, and I'm well aware of the danger from unsupervised learning, which is why I'm being very skeptical and asking this question. I did split the database into 66% and 33% and found the factor on the 66% and verified on the 33%.

podonne
06-29-2011, 08:16 PM
You have to be carefull when you talk statistics and sample size. If you were truely choosing a distinct variable such as the color of a marble, you'd have a structured problem, which could be scalable.

Are you sure there isn't a subset you've overlooked?

What happens when you raise your requirement upwards to 4-1?

Have you broken this down by conditions? You may have something under one condition and not another, which adds more complexity.

I personally couldn't bet 1000 time to make $400.

THanks for this suggestion, and I've seen it a few times so far, so I'll answer them all to save typing. :-) I can cut this data further and see a higher return, but naturally the sample size falls each time, so it gets less reliable.

By my reasoning, adding another factor to get a higher return doesn't really help me answer the question, since I can ask the same question about the new factor as with the previous one. It works from point-in-time A to point-in-time B, but how do I know it will work going forward and how do I know how good it is.

podonne
06-29-2011, 08:21 PM
With real insight you know why an angle works. You understand why the angle works on the track, and you understand why the angle works in the pools.
With superficial theory you just know that a variable is "good" and/or you just know that a variable "works" or is profitable.

With enough consistent profit, or high enough return, in a given set of past performance, you still may go ahead and gamble that the trend will continue forward, even if you really don't have a great insight.

If you do have an insight into the angle you should be able to think of subsets that reflect those truths in an even more concentrated form, and you should increases in hit rate or return when looking at those subsets. Regardless of whether you want to specialize in those subsets, they can offer a way to test your understanding.

Robert,

With respect, I'm probably in the realm of the superficial and trying to get to the concrete, but without looking at the factors manually. I know that I can add factors and get subsets with a higher hit rate and return, but I'll just end up asking the original question about a different factor with a smaller sample size...

I suppose one way of asking what I am is to ask how you would define "consistent profit" and "high enough return", and how you would precisely calculate profit and return in that context.

podonne
06-29-2011, 08:25 PM
Aside from the math and whether or not the series is a true representation, just looking at it visually would give me some cause for hesitation (could just be my risk profile). In the timeframe you show, you have at least 6 drawdowns of 20+ bets (looks like 3 over 30+). It's no wonder that you might be worried about "when" you start the series. Qualitatively, the risk reward curve seems unfavorable to me.

CBedo,

There are definetly a few runs that make me nervous, up and down. But I know enough about probability to know that runs, even long runs, are to be expected. Do you have a rule of thumb or a method of determining whether a run of a certain length or with a certain frequency is abnormal and might indicate that the angle isn't actually a good one? Or is it just a matter of intangibles like risk tolerance and "bet size vs bankroll"?

Thanks

Cratos
06-29-2011, 08:29 PM
Not sure the point of posting something like this...it will only serve to derail the thread. Agree?

Hopefully that will not happen like the thread that “TrifecfaMike” started. There isn’t any intent to derail this thread or any thread on this forum because I enjoy open discussion; especially when it is math, statistics, and physics as those disciplines are related to horseracing

podonne
06-29-2011, 08:33 PM
a 3% return with a 55% win rate would mean an average odds of 0.87 to 1. I believe that is highly possible for you to have found some thing that produces that kind of a ROI at those odds. Study after study have shown that lower the odds on all horses you wager, the closer you come to getting even. The thing you need to do is break it down in to groups of 50 bets. Then calculate the mean and SD of the profits of those groups. A simple student's T test of those numbers should give the answer you are looking for.

I won't argue with your math, its entirely possible I chicken scratched the wrong figure. :-)

Your suggestion is excellent, but in the past I've struggled to come up with defensible parameters for a test with random small subsets of the data (as in, I didn't just pick figures out of the air because they seemed large).

- How many times would you reccomend running those 50 samples in order to come up with a reasonably accurate estimate of the return?
- Is that 50 winners, or 50 bets?
- If I have a training and a testing set, should I be doing 50 sample-size subsets of each set seperately or just with the whole group?

I ask because I use a SELECT statement with a RAND()<x.x to generate a figure for a random subset of the population, and that takes a long time on large tables. Its much faster to do a SELECT RAND()<0.66 and then a SELECT NOT IN for the remaining 33%.

podonne
06-29-2011, 08:44 PM
While I truly enjoy posts on math and statistics I see this thread as a continuation of a thread which was started by another poster and some of those who lambasted him on that thread is now contributors to this thread; the oddity and contradictions in posting is amazing.

I've seen similar threads (been on here for a couple of years), but it never seems to resolve on an answer, maybe because the conversation degrades, maybe because emotion comes in, not sure.

It seems to me that we should be able to arrive at an answer, maybe a checklist? An angle is good if it satisfies the following conditions:

- Includes at least 30 winners (Dave)
- Shows a positive ROI for all available data (maybe, in surveying you're never supposed to survey 100% of the population)
- Shows a mean positive ROI for 95% of 1000 50-sample subsets of all available data (Robert)
- Shows a positive ROI for both a training set of 66% of data and a testing set of 33% (how I found the angle)
- Doesn't show a run of more than 20 losing or winning bets over the entire period (CBedo)
- And a rule for determining which figure you should use when determining the advantage (I would lean toward Robert's)

(I just made those up from reading the thread, not saying that's the answer). I don't think that average odds really matters. It says something about the nature of the angle (longshots vs favorites) but not whather its 'good' or not. Same for winning pct (except maybe for determining sample sizes, but I think we'd all agree that 2,951 is enough samples for whatever odds or winpct you'd like)

jdhanover
06-30-2011, 12:23 AM
GREAT thread!


- Includes at least 30 winners (Dave)
- Shows a positive ROI for all available data (maybe, in surveying you're never supposed to survey 100% of the population)
- Shows a mean positive ROI for 95% of 1000 50-sample subsets of all available data (Robert)
- Shows a positive ROI for both a training set of 66% of data and a testing set of 33% (how I found the angle)
- Doesn't show a run of more than 20 losing or winning bets over the entire period (CBedo)
- And a rule for determining which figure you should use when determining the advantage (I would lean toward Robert's)


This set cannot be correct for all angles. Issue really is what kind of angle is it?

Let's say Angle A is one that hit really long longshots (>20-1) 6% of the time. I have no doubt you would hit some 20 race losing streaks. And some bad 50 race sets. Note you would need 500 races or so to get the 30 winners
Contrast with Angle B where you hit 50% at average odds of 6/5. Here you might satisfy all the above criteria.

But Angle A is still very viable. Its just that most of us would struggle with such a low hit rate.

Dave covers this concept, and an approach to mitigate the issue with an angle like Angle A, in his Basics Of Winning class btw (highly recommended)

Robert Goren
06-30-2011, 01:03 AM
To answer your question, I found it by searching a bunch of factors until something worked, which is probably a pretty common answer, and I'm well aware of the danger from unsupervised learning, which is why I'm being very skeptical and asking this question. I did split the database into 66% and 33% and found the factor on the 66% and verified on the 33%. Did you split the data before you began the first search? If you did that, then you are probably safe from it being an outlier. If I were you I'd be looking for a good ADW with good rebates and start wagering for real. The trick now will be devolop a money management system to both grow and protect your bankroll. If you choose Kelly, be vary careful to use it right. There is lot of wrong information out there on how to use the formula including what is in Poundstone's otherwise good book.

podonne
06-30-2011, 12:46 PM
This set cannot be correct for all angles. Issue really is what kind of angle is it?

Let's say Angle A is one that hit really long longshots (>20-1) 6% of the time. I have no doubt you would hit some 20 race losing streaks. And some bad 50 race sets. Note you would need 500 races or so to get the 30 winners
Contrast with Angle B where you hit 50% at average odds of 6/5. Here you might satisfy all the above criteria.


Good point. 20 is definitely not a static number given the winning pct. Doing some Excel math, I can up with an equation to give the expected number of times that a losing run of given length will occur:

Let W = % of winners, N = the length of the losing run you are interested in, T = the time frame (or number of trials) you have, and E = the expected number of times that a losing run will occur in that time frame;

E = T * (1-W)^N

So for the example above, with a 50% win rate, looking at runs of 20 or more over 1000 entries.

1000*(1-0.50)^20 = 0.000953674

or roughly, never. So you could use that equation and if you found runs of N or higher more than E times, you would reject the angle.

podonne
06-30-2011, 01:02 PM
Did you split the data before you began the first search? If you did that, then you are probably safe from it being an outlier. If I were you I'd be looking for a good ADW with good rebates and start wagering for real. The trick now will be devolop a money management system to both grow and protect your bankroll. If you choose Kelly, be vary careful to use it right. There is lot of wrong information out there on how to use the formula including what is in Poundstone's otherwise good book.

I did split the data before the search. But that leads to the second part of the question, how do you determine the edge. In straight betting it is enough to know that an angle is positive, but Kelly requires you to know how positive in order to calculate the optimum bet size.

I was thinking this through last night on the plane, if I took 1000 50-bet samples and calculated the mean and standard deviation of the results, I might say that the edge is (mean - 2 standard deviations) to be super safe.

But I keep struggling with how to get at the 50 number or the 1000 number. The bigger the number 50 is (lets call it "trials") the lower the standard deviation, so increasing that is preferable. But, as "trials" approaches the total number of available bets (as in, if the total number of bets is 50), standard deviation drops to zero and we're back where we started, applying the angle to the entire data set. So we need a number of trials that is large enough to yield a reasonably small standard deviation, but small enough that we're operating on a subset of the data and not the whole thing.

And I have no idea how to arrive at that number.

TrifectaMike
06-30-2011, 03:53 PM
I did split the data before the search. But that leads to the second part of the question, how do you determine the edge. In straight betting it is enough to know that an angle is positive, but Kelly requires you to know how positive in order to calculate the optimum bet size.

I was thinking this through last night on the plane, if I took 1000 50-bet samples and calculated the mean and standard deviation of the results, I might say that the edge is (mean - 2 standard deviations) to be super safe.

But I keep struggling with how to get at the 50 number or the 1000 number. The bigger the number 50 is (lets call it "trials") the lower the standard deviation, so increasing that is preferable. But, as "trials" approaches the total number of available bets (as in, if the total number of bets is 50), standard deviation drops to zero and we're back where we started, applying the angle to the entire data set. So we need a number of trials that is large enough to yield a reasonably small standard deviation, but small enough that we're operating on a subset of the data and not the whole thing.

And I have no idea how to arrive at that number.

1. Why not just apply the Binomial Theorem (Binomial Distribution) for what you are seeking.

2. You can calculate an estimate of consecutive losses over N samples from the following:

Consecutive Losses = Ln(N)/-Ln(1-P)

where
N = number runs in a sequence
P = probability of a win
Ln = the natural log (base e)

or you can also try the following

How often will one experience a sequence of n losses given a win%, p?

Number Races = 1/(1-p)^n

Here's another way...

The probability of n losses is (1-p) ^n

Consecutive losses = log((1-p)^n) where log is to base (1-p)

It is as precise as the the win % given.

Mike (Dr Beav)

Tom
06-30-2011, 07:28 PM
- Shows a mean positive ROI for 95% of 1000 50-sample subsets of all available data (Robert)

You don't plan on betting for a long time, huh?

TrifectaMike
06-30-2011, 07:55 PM
1. Why not just apply the Binomial Theorem (Binomial Distribution) for what you are seeking.

2. You can calculate an estimate of consecutive losses over N samples from the following:

Consecutive Losses = Ln(N)/-Ln(1-P)

where
N = number runs in a sequence
P = probability of a win
Ln = the natural log (base e)

or you can also try the following

How often will one experience a sequence of n losses given a win%, p?

Number Races = 1/(1-p)^n

Here's another way...

The probability of n losses is (1-p) ^n

Consecutive losses = log((1-p)^n) where log is to base (1-p)

It is as precise as the the win % given.

Mike (Dr Beav)

Disregard this part below - In order to use this you would have to fix the odds of a losing streak. Then you can compute the maximum losing streak as the log of odds losing streak to the base of loss probability. I'm not sure if you require that information.

Here's another way...

The probability of n losses is (1-p) ^n

Consecutive losses = log((1-p)^n) where log is to base (1-p)

It is as precise as the the win % given.

Cratos
06-30-2011, 07:56 PM
1. Why not just apply the Binomial Theorem (Binomial Distribution) for what you are seeking.

2. You can calculate an estimate of consecutive losses over N samples from the following:

Consecutive Losses = Ln(N)/-Ln(1-P)

where
N = number runs in a sequence
P = probability of a win
Ln = the natural log (base e)

or you can also try the following

How often will one experience a sequence of n losses given a win%, p?

Number Races = 1/(1-p)^n

Here's another way...

The probability of n losses is (1-p) ^n

Consecutive losses = log((1-p)^n) where log is to base (1-p)

It is as precise as the the win % given.

Mike (Dr Beav)

Mike, I like your suggestion of the Bernoulli distribution, but you do think analysis of variance (ANOVA) to test whether or not the means of the groups are all equal would be a good statistical application for this problem?

TrifectaMike
06-30-2011, 08:09 PM
Mike, I like your suggestion of the Bernoulli distribution, but you do think analysis of variance (ANOVA) to test whether or not the means of the groups are all equal would be a good statistical application for this problem?

ANOVA can work. I was about to suggest a Bayesian approach, viewing it as a multiple comparisons problem. Then my eyes began to bleed.

Mike (Dr Beav)

dansan
07-01-2011, 10:32 AM
when you have more winning tickets than losing ones :lol:

podonne
07-01-2011, 01:18 PM
I was thinking this through last night on the plane, if I took 1000 50-bet samples and calculated the mean and standard deviation of the results, I might say that the edge is (mean - 2 standard deviations) to be super safe.

Robert,

This notion of pulling a large number of subsets doesn't seem to work.

My angle shows a 46.78% win rate over 2,950 entries at average odds of 2.186. To get 50 winners I needed to pull sets of 107 each time (50/0.4678)

Over 1000 iterations I pulled 107 randomly chosen entries and calculated the total return as ((odds*wonflag)-count(*)). Note that this is not the ROI, but since all the sample sizes were equal it doesn't matter. That gave me 1000 data points to show the dollar return from betting on each of those 107 sample subsets.

The problem is the standard deviation was ENORMOUS. Here's the stats from the Exel data add-in:


Mean 1.272565315
Standard Error 0.352258674
Median 1.024084687
Standard Deviation 11.13939734
Sample Variance 124.0861731
Kurtosis 0.215769659
Skewness 0.020975815
Range 84.64238703
Minimum -44.32224566
Maximum 40.32014138

The mean being $1.27 is reasonable, but the standard deviation of $11.1? By that notion the angle couldn't be 'verified' unless the mean return was > $22.2 over 107 entries. The maximum was only $40.3. It just seems like too high a bar to hit.

Just for kicks I increased the subset size upwards from 107 until the standard deviation became more reasonable, and it didn't until I was pulling 2,360 each time, 80% of the total sample. That's a big jump over 3.6% needed to hit 107.

So either I'm totally misunderstanding this, the test we are talking about doesn't work, or the angle isn't good. Any ideas?

Thanks,
Philip

TrifectaMike
07-01-2011, 03:17 PM
Robert,

This notion of pulling a large number of subsets doesn't seem to work.

My angle shows a 46.78% win rate over 2,950 entries at average odds of 2.186. To get 50 winners I needed to pull sets of 107 each time (50/0.4678)

Over 1000 iterations I pulled 107 randomly chosen entries and calculated the total return as ((odds*wonflag)-count(*)). Note that this is not the ROI, but since all the sample sizes were equal it doesn't matter. That gave me 1000 data points to show the dollar return from betting on each of those 107 sample subsets.

The problem is the standard deviation was ENORMOUS. Here's the stats from the Exel data add-in:


Mean 1.272565315
Standard Error 0.352258674
Median 1.024084687
Standard Deviation 11.13939734
Sample Variance 124.0861731
Kurtosis 0.215769659
Skewness 0.020975815
Range 84.64238703
Minimum -44.32224566
Maximum 40.32014138

The mean being $1.27 is reasonable, but the standard deviation of $11.1? By that notion the angle couldn't be 'verified' unless the mean return was > $22.2 over 107 entries. The maximum was only $40.3. It just seems like too high a bar to hit.

Just for kicks I increased the subset size upwards from 107 until the standard deviation became more reasonable, and it didn't until I was pulling 2,360 each time, 80% of the total sample. That's a big jump over 3.6% needed to hit 107.

So either I'm totally misunderstanding this, the test we are talking about doesn't work, or the angle isn't good. Any ideas?

Thanks,
Philip

Philip,

The first question that has to be answered: Is this streaky pattern "real" or is it just a byproduct of Bernoulli chance variation?

You can answer this question by means of a Bayes factor. Suppose we partition the total races into groups of 20 (this number is not fixed) races. We observe counts w1, ..., wn, where wi is the number of wins in the ith group. Suppose wi is binomial(20, pi) where pi is the probability of a win in the ith period.

You define two hypotheses:

H1 (not streaky) the probabilities across periods are equal, p1 = ... = pn = p

H2 (streaky) the probabilities across periods vary according to a beta distribution (The domain of the Beta distribution can be viewed as a probability, since it is defined on the interval (0,1) ) with mean zeta and precision K. This model is indexed by the parameter K.

You can determine support for streaky or not streaky -- use of the log Bayes factor determines which is more likely from the data.

First, I would suggest some reading on the Beta distribution, so that you can get an insight on why the Beta distribution.

Second, some reading on Bayes factor.

Third, use a statistical package for your computations. I believe a free package like R can do the computations.

Good luck (I do believe you need to answer this question first)

Mike (Dr Beav)

You appear to be ready to invest in this angle. I wouldn't proceed without an answer to the "streakiness" question.

podonne
07-01-2011, 05:19 PM
Philip,

The first question that has to be answered: Is this streaky pattern "real" or is it just a byproduct of Bernoulli chance variation?

You can answer this question by means of a Bayes factor. Suppose we partition the total races into groups of 20 (this number is not fixed) races. We observe counts w1, ..., wn, where wi is the number of wins in the ith group. Suppose wi is binomial(20, pi) where pi is the probability of a win in the ith period.

You define two hypotheses:

H1 (not streaky) the probabilities across periods are equal, p1 = ... = pn = p

H2 (streaky) the probabilities across periods vary according to a beta distribution (The domain of the Beta distribution can be viewed as a probability, since it is defined on the interval (0,1) ) with mean zeta and precision K. This model is indexed by the parameter K.

You can determine support for streaky or not streaky -- use of the log Bayes factor determines which is more likely from the data.

First, I would suggest some reading on the Beta distribution, so that you can get an insight on why the Beta distribution.

Second, some reading on Bayes factor.

Third, use a statistical package for your computations. I believe a free package like R can do the computations.

Good luck (I do believe you need to answer this question first)

Mike (Dr Beav)

You appear to be ready to invest in this angle. I wouldn't proceed without an answer to the "streakiness" question.

Mike,

I've spent a few hours refreshing myself about Bayesian statistics, but unfortunately, other than a general recognition of the proces you describe and the terms being used, I am at a loss as how to implement this in code.

I've attempted to learn R before, but its a hell of a learning curve. :-). Plus, the end goal of my explorations is to implement this "checklist" as an unsupervised learning angle miner, so relying on a third-party program to perform the calculations won't help.

So instead I'll inquire why you believe this question of "streakness" is so important to evaluating the validity of an angle? If the goal is just to identify whether an angle is likley to be profitable going forward, streakiness would only be a temporary phenomenom and introduce more variability. That perhaps has reprocussions when calculating bet size proportional to bankroll, but does "streakiness" change the objective evaluation of the angle as valid and profitable?

In a related note, the process you describe charactercises a partition by its winning pct, but doesn't that ignore the odds? By that logic an increase of $10 and a decrease of $1 are treated the same...

Thanks for your patience,
Philip

Robert Goren
07-01-2011, 05:40 PM
I did split the data before the search. But that leads to the second part of the question, how do you determine the edge. In straight betting it is enough to know that an angle is positive, but Kelly requires you to know how positive in order to calculate the optimum bet size.

. Kelly's Edge. Take One divided the odds plus one of each horse or 1/(odds +1) . Now you square that number. You divide that number by total of the 1/(odds+1). Now you take Win % - of arrived previous number. That will give you the edge. If you don't have too many high odds horses in either your winners or over all, you can get an reason approximation using W% - [1 /(average odds +1)] You should get a positive number with either formula.

classhandicapper
07-01-2011, 07:13 PM
There are formulas available that can be used to calculate the sample size you need to be confident that something is statistically significant with "X" degree of confidence. (like you can be 90% sure or 95% sure etc.. that what you are seeing is a real angle)

It has been a long time since I took probability and statistics courses so I can't remember how to do it, but it should be easy to find on the internet.

TrifectaMike
07-01-2011, 08:51 PM
Philip,

The first question that has to be answered: Is this streaky pattern "real" or is it just a byproduct of Bernoulli chance variation?

You can answer this question by means of a Bayes factor. Suppose we partition the total races into groups of 20 (this number is not fixed) races. We observe counts w1, ..., wn, where wi is the number of wins in the ith group. Suppose wi is binomial(20, pi) where pi is the probability of a win in the ith period.

You define two hypotheses:

H1 (not streaky) the probabilities across periods are equal, p1 = ... = pn = p

H2 (streaky) the probabilities across periods vary according to a beta distribution (The domain of the Beta distribution can be viewed as a probability, since it is defined on the interval (0,1) ) with mean zeta and precision K. This model is indexed by the parameter K.

You can determine support for streaky or not streaky -- use of the log Bayes factor determines which is more likely from the data.

First, I would suggest some reading on the Beta distribution, so that you can get an insight on why the Beta distribution.

Second, some reading on Bayes factor.

Third, use a statistical package for your computations. I believe a free package like R can do the computations.

Good luck (I do believe you need to answer this question first)

Mike (Dr Beav)

You appear to be ready to invest in this angle. I wouldn't proceed without an answer to the "streakiness" question.

Philip,

I believe there is a misunderstanding. And it is most likely my fault. I don't want to hijack your thread. Since, I don't have much time at the moment, I'd like to ask this question to all. This might be helpful to you and me...possibly to others too.

Do you understand the question I posed? Why I would ask such a question and more importantly the answer to such a question? And possibly what caused me to ask such a question?

Don't be shy. This is not a test.

Mike (Dr. Beav)

TrifectaMike
07-02-2011, 09:19 AM
Philip,

I believe there is a misunderstanding. And it is most likely my fault. I don't want to hijack your thread. Since, I don't have much time at the moment, I'd like to ask this question to all. This might be helpful to you and me...possibly to others too.

Do you understand the question I posed? Why I would ask such a question and more importantly the answer to such a question? And possibly what caused me to ask such a question?

Don't be shy. This is not a test.

Mike (Dr. Beav)

As I assumed. I am speaking to myself. You guys continue, I'll just read.

Mike (Dr Beav)

podonne
07-02-2011, 01:38 PM
Philip,

I believe there is a misunderstanding. And it is most likely my fault. I don't want to hijack your thread. Since, I don't have much time at the moment, I'd like to ask this question to all. This might be helpful to you and me...possibly to others too.

Do you understand the question I posed? Why I would ask such a question and more importantly the answer to such a question? And possibly what caused me to ask such a question?

Don't be shy. This is not a test.

Mike (Dr. Beav)

Mike,

Not a hijacking at all. If streakiness is a concern, then it's an item on the "checklist" to tell whether an angle is valid. So well worth the discussion. I just couldn't figure out how to calculate it according to your instructions.

I'll hazard a guess, though. As an extreme, if there were 1000 races in a time series and the first 501 were winners and the next 499 were losers, all at even odds, that would show a slightly positive return for the hundred races at n=1000. But obviously that pattern would be cause for great concern.

I think I understand a bit more now. You're saying that when I pull 1000 sample sets of 50 samples each, the ratio of the bayes factor of those two hypothesis will tell us whether its more likley that the variation in sample means is equal (meaning steady) or unequal (meaning streaky).

I understand how to calculate the bayes factor for the first hypothesis, which (from wikipedia) for a sample size of 200 with 115 successes and 85 failures would be
http://upload.wikimedia.org/math/6/f/0/6f09ea127880cd7d0426c66a84ace80d.png

I just don't know how to calculate the second equation in code:
http://upload.wikimedia.org/math/c/1/9/c1948629d7a39d88920acca964f96713.png
to calculate the bayes factor for a hypothesis with a distribution in the equation.

I found some software called Infer.Net which seems let you use distributions as variables, so I'm going to try that one and see what happens.

Also, you refer to a beta distribution defined by zeta and K, what values do I use for zeta and K? I assume zeta is 115/200, but K?

Thanks,
Philip

TrifectaMike
07-03-2011, 11:09 AM
Mike,

Not a hijacking at all. If streakiness is a concern, then it's an item on the "checklist" to tell whether an angle is valid. So well worth the discussion. I just couldn't figure out how to calculate it according to your instructions.

I'll hazard a guess, though. As an extreme, if there were 1000 races in a time series and the first 501 were winners and the next 499 were losers, all at even odds, that would show a slightly positive return for the hundred races at n=1000. But obviously that pattern would be cause for great concern.

I think I understand a bit more now. You're saying that when I pull 1000 sample sets of 50 samples each, the ratio of the bayes factor of those two hypothesis will tell us whether its more likley that the variation in sample means is equal (meaning steady) or unequal (meaning streaky).

I understand how to calculate the bayes factor for the first hypothesis, which (from wikipedia) for a sample size of 200 with 115 successes and 85 failures would be
http://upload.wikimedia.org/math/6/f/0/6f09ea127880cd7d0426c66a84ace80d.png

I just don't know how to calculate the second equation in code:
http://upload.wikimedia.org/math/c/1/9/c1948629d7a39d88920acca964f96713.png
to calculate the bayes factor for a hypothesis with a distribution in the equation.

I found some software called Infer.Net which seems let you use distributions as variables, so I'm going to try that one and see what happens.

Also, you refer to a beta distribution defined by zeta and K, what values do I use for zeta and K? I assume zeta is 115/200, but K?

Thanks,
Philip

Good. I believe I have your attention and you are getting it partially. Let's put aside the actual computations. Let's look at at what we are attempting to determine. We'll take a step back. (This to some maybe confusing, but well worth trying to understand. This a powerful technique, and has many applications in horse racing)

We defined two hypotheses:

H1 (not streaky) the probabilities across periods are equal, p1 = ... = pn = p
(The consistent model)

H2 (streaky) the probabilities across periods vary according to a beta distribution (The domain of the Beta distribution can be viewed as a probability, since it is defined on the interval (0,1) ) with mean zeta and precision K. This model is indexed by the parameter K.
(The streaky model)

We would like to determine which model is better supported by the data. (In Bayes terms these models are our givens)

Let's look at the data (I'm going to take some liberties here. This is not a research paper). Suppose we group win/loss data into bins of a given size, here we used 20 races. We then have grouped win/loss data s1, ...., sn, where si is the number of success in the ith period. Now, suppose si is distributed binomial with probability of success pi. The consistent model, denoted by H1, states that the winning success is a constant value p over the entire sample.

H1 : p1 = p2 = p3 ,,, = pn = p (Recall the binned data one p per bin)

To complete the model, we assume that constant value p has a uniform prior.

Now, let's move on to the streaky model, denoted by H2. The streaky model says that the winning probability over the sample (pi) vary according to a beta density of the form (don't get lost in the complexity of the function f(p). Instead concentrate on what we are attempting to determine)

f(p) = 1/B(km,k(1-m)) p^km-1 (1-p)^k(1-m)-1, 0 <p< 1

B(km,k(1-m)) is the beta density

In this beta density, m is the mean and k is a precision parameter.

We'll get back to the parameter k. Let's go on to the Bayes Factor.

Bayes Factor

BF = p(s|H1)/p(s|H2) ....(1)

where p(s|H1) is the predictive density given model H1 and
p(s|H2) is the predictive density given model H2. This Bayes Factor in (1) is defined in support of the the consistent model over the streaky model.

I'll stop here.

Mike (Dr Beav)

Overlay
07-03-2011, 03:18 PM
So my question, esteemed handicappers, is, what statistical method can I use to determine if my angle is "real", or if its just an artifact of the fact that I started and stopped my time series when I did?
I haven't read through the entire thread, but didn't Quirin cover this from a statistical standpoint in Appendix A of Winning at the Races, where he talked about the tests that he used to distinguish positive and negative attributes (the kind that he designated with a single or double asterisk, respectively) that would continue into the future (barring some fundamental change in racing conditions, such as the introduction of a new type of running surface) from results that, although they might appear positive or negative, did not meet the statistical criteria for that degree of confidence in future performance? I believe it had to do with the extent to which the horses possessing each particular characteristic over- or under-performed the winning percentage associated with the odds at which they went off.

classhandicapper
07-03-2011, 03:52 PM
I find the advanced math being discussed here both very interesting (because I think I understand what you guys are getting at) and incomprehensible (because it's outside my range). :D

The thing is, I'm not sure you are answering the original question (assuming I even understand it).

I think the issue is one of sample size and not volatility.

The volatility of the results will be related to the probability of success for any individual play. If you are picking 50% winners your results aren't going to be as volatile as if you pick 10%. I'm not sure you need to know much beyond that.

There are formulas for determining the sample size you need to get to "X%" level of confidence depending on the win%. I think that's what he's getting at and what he needs.

Robert Goren
07-03-2011, 08:26 PM
Robert,

This notion of pulling a large number of subsets doesn't seem to work.

My angle shows a 46.78% win rate over 2,950 entries at average odds of 2.186. To get 50 winners I needed to pull sets of 107 each time (50/0.4678)

Over 1000 iterations I pulled 107 randomly chosen entries and calculated the total return as ((odds*wonflag)-count(*)). Note that this is not the ROI, but since all the sample sizes were equal it doesn't matter. That gave me 1000 data points to show the dollar return from betting on each of those 107 sample subsets.

The problem is the standard deviation was ENORMOUS. Here's the stats from the Exel data add-in:


Mean 1.272565315
Standard Error 0.352258674
Median 1.024084687
Standard Deviation 11.13939734
Sample Variance 124.0861731
Kurtosis 0.215769659
Skewness 0.020975815
Range 84.64238703
Minimum -44.32224566
Maximum 40.32014138

The mean being $1.27 is reasonable, but the standard deviation of $11.1? By that notion the angle couldn't be 'verified' unless the mean return was > $22.2 over 107 entries. The maximum was only $40.3. It just seems like too high a bar to hit.

Just for kicks I increased the subset size upwards from 107 until the standard deviation became more reasonable, and it didn't until I was pulling 2,360 each time, 80% of the total sample. That's a big jump over 3.6% needed to hit 107.

So either I'm totally misunderstanding this, the test we are talking about doesn't work, or the angle isn't good. Any ideas?

Thanks,
PhilipI am not sure what you did. What was the mean and the SD of the number of winners of each sample? Do you have a couple of big winners and a whole bunch of below average winners in your data? What percentage of the samples means were below average sample mean? I don't know if it is really worth digging into. I really think you are safe in betting this, if your test sample gave you similar results as the data you use to come with the idea to start with. Good Luck with your wagering.

trackrat59
07-03-2011, 08:57 PM
:eek: Math vs. Handicapping?

I don't understand any of this but it's interesting. I hope whatever you're all talking about pays off for someone. :)

TrifectaMike
07-04-2011, 02:59 AM
We'll get back to the parameter k.

k is fixed to a particular value. k can be viewed as a measure of streakiness. The smaller the value k the higher the level of streakiness, So, in effect, we fix a value of the precision parameter to let's say 100 to respresent "true" streakiness or variation in the winning probabilities and use the corresponding Bayes Factor to measure the streakiness in the data.

Let's go back to the integral equation you posted .

That is the correct expression to determine the marginal density for the consistent model. However, this is would have to be computed for each "bin". The marginal density equals

intergal of the product.

( . )p^xi (1-p)^(ni-xi) ....(1) I have used the . to represent ni and xi where you have 200 and 115.

where ni in our example is 20 and xi is the number of wins.

For the streaky model we perform a similar operation except that we replace (1) with f(p) from the earlier post.

As you can see, there is some calculus involved. In the next post, I'll explain how to interpret the results.

Mike (Dr Beav)

podonne
07-04-2011, 04:19 AM
I find the advanced math being discussed here both very interesting (because I think I understand what you guys are getting at) and incomprehensible (because it's outside my range). :D

The thing is, I'm not sure you are answering the original question (assuming I even understand it).

I think the issue is one of sample size and not volatility.

The volatility of the results will be related to the probability of success for any individual play. If you are picking 50% winners your results aren't going to be as volatile as if you pick 10%. I'm not sure you need to know much beyond that.

There are formulas for determining the sample size you need to get to "X%" level of confidence depending on the win%. I think that's what he's getting at and what he needs.

The issue (I think) as to why more complex math is involved than the more common "average the payouts, get the standard deviation, use the z score to get a confidence interval" is that the payouts from a series of system bets won't be a randomly distributed variable. It can't be.

Picture a histogram in your mind of the payouts from a series of bets. A bunch will show up right on -$2.00 where your losing bets were. Then there will be zero results until you get to +$2.10, then an increasing number until you get to your average winning payout, then falling counts and trailing off as the odds get higher. The "average payout" is usually something like -$0.05.

Since the distribution is so wierd, you can't just get the standard deviation, so no zscore and no confidence interval. I'm looking for another way.

It was suggested to repeatedly pick a subset of plays (like 20) and record the averages, then those averages will form a normal distribution and I can get right on truckin. They will, I tried it.

Problem is I can't find a relevant theory as to why this would work from a mathematical perspective. I find it hard to believe that you can take any distribution, no matter how wonky or skewed, repeatedly average a subset of 20 samples, then plot the averages, and magically have meaningful statistics. Without a theory I don't know that 20 is the right number, or how many times to "repeatedly" pull the samples, or whether to do it with or without replacement.

I hope someone tells me I'm making a mountain out of a molehill, but the more I look at this the more I think something simple like evaluating whether a system works is actually maddeningly complex.

podonne
07-04-2011, 04:21 AM
I am not sure what you did. What was the mean and the SD of the number of winners of each sample? Do you have a couple of big winners and a whole bunch of below average winners in your data? What percentage of the samples means were below average sample mean? I don't know if it is really worth digging into. I really think you are safe in betting this, if your test sample gave you similar results as the data you use to come with the idea to start with. Good Luck with your wagering.

Actually I'll be perfectly honest and say that I screwed up there. I forgot to convert the standard deviation into a 95% confidence level.

But then this whole "payouts aren't normally distributed" thing came up and tossed me around, so I'm flailing again.

TrifectaMike
07-04-2011, 04:23 AM
Bayes Factor

BF = p(s|H1)/p(s|H2)

recall

H1 ; The Consistent Model
H2 ; The Streaky Model

Bayes factor is the summary of the evidence provided by the data in favor of one of the models as opposed to another.

Log(BF) (log is to the base 10)
0-.5 Not much difference in the models
.5-1 Substantial
1-2 Strong
>2 Decisive

As I have said, this stuff (Bayesian probability) maybe a bit complex, but it is powerful. At a minimum you should be aware that a few, not many, are using these sophisticated techniques in horse racing.

Mike (Dr Beav)

P.S. I've come across a problem (solved by Bayes factor) that was given in a grad class that is very clever and rather simple to understand. I'll post the question and solution.

podonne
07-04-2011, 04:31 AM
I haven't read through the entire thread, but didn't Quirin cover this from a statistical standpoint in Appendix A of Winning at the Races, where he talked about the tests that he used to distinguish positive and negative attributes (the kind that he designated with a single or double asterisk, respectively) that would continue into the future (barring some fundamental change in racing conditions, such as the introduction of a new type of running surface) from results that, although they might appear positive or negative, did not meet the statistical criteria for that degree of confidence in future performance? I believe it had to do with the extent to which the horses possessing each particular characteristic over- or under-performed the winning percentage associated with the odds at which they went off.

Interesting. I don't have a copy of Winning at the Races, can you give us a brief note on the method he used, or maybe a link online to someone who looked into it in more detail?

From your post I would read it as (% of system bets that are winners)/(average implied probability of system bets that are winners)

Curious as to how he got to a quantifiable confidence level or got down to a $ or % edge.

TrifectaMike
07-04-2011, 05:28 AM
Problem

Here is a sequence of numbers:

-1,3,7,11

Predict what the next two numbers are likely to be, and infer what the underlying process was, that gave rise to the sequence.

If you said the prediction for the next two numbers is 15 and 19, with the explanation add 4 to the previous number. (Seems plausible to me).

But hold on. Someone in the back of the room is jumping up and down (Probably T'Mike jr). He says, I have another answer:

The next two numbers in the sequence are -19, 1043.8. Mr. Teacher says, how did you arrive at those numbers?

Get the next number from the previous number, x, by evaluating
-x^3/11 + 9/11 x^2 + 23/11

Mr. Teacher (more out of amusements) says ok. let see if that works. To the teachers amazement it works. The alternate rule also fits the data.

Mr. Teacher is not happy. He is confronted with two answers and both are apparently correct.

So, we have two rules:

1. Add 4 to the previous number
2. Evaluate -x^3/11 + 9/11 x^2 + 23/11 and get the next number

Question: Is one rule more plausible than the other?

Think about this in the context of this thread. I'll show you how to answer this question in the next post.

Mike (Dr Beav)

TrifectaMike
07-04-2011, 07:43 AM
Problem

Here is a sequence of numbers:

-1,3,7,11

Predict what the next two numbers are likely to be, and infer what the underlying process was, that gave rise to the sequence.

If you said the prediction for the next two numbers is 15 and 19, with the explanation add 4 to the previous number. (Seems plausible to me).

But hold on. Someone in the back of the room is jumping up and down (Probably T'Mike jr). He says, I have another answer:

The next two numbers in the sequence are -19, 1043.8. Mr. Teacher says, how did you arrive at those numbers?

Get the next number from the previous number, x, by evaluating
-x^3/11 + 9/11 x^2 + 23/11

Mr. Teacher (more out of amusement) says ok. let see if that works. To the teachers amazement it works. The alternate rule also fits the data.

Mr. Teacher is not happy. He is confronted with two answers and both are apparently correct.

So, we have two rules:

1. Add 4 to the previous number
2. Evaluate -x^3/11 + 9/11 x^2 + 23/11 and get the next number

Question: Is one rule more plausible than the other?

Think about this in the context of this thread. I'll show you how to answer this question in the next post.

Mike (Dr Beav)

Let's start by assigning labels to the two rules.

H1 : the sequence is an arithmetic progression, add n to the previous number

H2 : the sequence is generated by the function of the form ax^3 + bx^2 + c, where a, b and c are fractions.

Let's give H1 and H2 equal prior probabilities and concentrate on what the data have to say.

H1 is dependent on an added integer n and the first number in the sequence. Let's say that these number could have been anywhere between -50 and 50. Then since only the pair of values (n = 4, first number -1) give rise to the data (-1,3,7,11), the probability of the data, given H1, is:

P(data|H1) = 1/101 * 1/101 = .00010

To evaluate P(data|H2), we must determine what values the fractions a, b and c might take on. A reasonable prior might is that for each fraction the numerator could be any number between -50 and 50. The denominator can be any number between 1 and 50. As for the initial value in the sequence, we use the same probability distribution as in H1 (1).

Under this prior,

there are four ways of expressing the fraction a

a = -1/11 = -2/22 = -3/33 = -4/44

there are four ways of expressing the fraction b

b = 9/11 = 18/22 = 27/33 = 36/44

there are two ways of expressing the fraction c

c =23/11 = 46/22

So,

P(data|H2) = (1/101)*(4/101 * 1/50)*(4/101 * 1/50)*(2/101*1/50)

P(data|H2) = 0.0000000000025

Comparing P(data|H2) with P(data|H1) the odds in favor of P(data|H1) overP(data|H2) given the data sequence are about 40 million to one!

Gotta love this Bayesian stuff. It is a window to a new world and much, more superior to frequentist stuff.

Mike (Dr Beav)

Overlay
07-04-2011, 09:21 AM
Interesting. I don't have a copy of Winning at the Races, can you give us a brief note on the method he used, or maybe a link online to someone who looked into it in more detail?

From your post I would read it as (% of system bets that are winners)/(average implied probability of system bets that are winners)

Curious as to how he got to a quantifiable confidence level or got down to a $ or % edge.
Quirin used the "standard normal" test for this determination. He first calculated the rate at which horses at each toteboard odds level should win (given a 17% take, and dime breakage), according to the following formula:

Expected Winning Percentage = (1.00 - .17) / (Odds-to-$1.00 + 1.05)

Each horse possessing the handicapping characteristic being tested in a given sample of races was then assigned a value corresponding to the expected winning percentage (EW) associated with its odds. For example, the expected winning percentage (according to the above formula) of horses going off at toteboard odds of 4-1 works out to .83 / 5.05, or .164. These values were totaled for all the horses in the sample to derive the expected number of winners. That number was then compared to the actual number of winners (NW), and to the total number of horses (NH) in the sample exhibiting the characteristic being tested, using the following formula:


(NW - EW) / ((EW (1-(EW/NH))^1/2)


(i.e., for the denominator, take the square root of the expression (EW (1-(EW/NH)) ). (I didn't have a square root symbol on my keyboard.)

Factors for which this equation works out to greater than +2.5 (for positive attributes) or less than -2.5 (for negative attributes) meet the criteria for being trends that figure to keep exhibiting the same positive or negative performance into the future. Factors that do not meet those criteria would not offer that same degree of confidence, despite results that might appear to be markedly positive or negative.

(I am basically restating Quirin here, as opposed to injecting any personal opinions or procedures.)

classhandicapper
07-04-2011, 11:41 AM
The issue (I think) as to why more complex math is involved than the more common "average the payouts, get the standard deviation, use the z score to get a confidence interval" is that the payouts from a series of system bets won't be a randomly distributed variable. It can't be.

Picture a histogram in your mind of the payouts from a series of bets. A bunch will show up right on -$2.00 where your losing bets were. Then there will be zero results until you get to +$2.10, then an increasing number until you get to your average winning payout, then falling counts and trailing off as the odds get higher. The "average payout" is usually something like -$0.05.

Since the distribution is so wierd, you can't just get the standard deviation, so no zscore and no confidence interval. I'm looking for another way.

It was suggested to repeatedly pick a subset of plays (like 20) and record the averages, then those averages will form a normal distribution and I can get right on truckin. They will, I tried it.

Problem is I can't find a relevant theory as to why this would work from a mathematical perspective. I find it hard to believe that you can take any distribution, no matter how wonky or skewed, repeatedly average a subset of 20 samples, then plot the averages, and magically have meaningful statistics. Without a theory I don't know that 20 is the right number, or how many times to "repeatedly" pull the samples, or whether to do it with or without replacement.

I hope someone tells me I'm making a mountain out of a molehill, but the more I look at this the more I think something simple like evaluating whether a system works is actually maddeningly complex.

If you are talking about things like 1 or 2 big longshots distorting the results, I understand what you are talking about. Other than that, if the winners are generally clustered around the mean, I think more basic probability and statistics will be fine (unless you are writing a peer reviewed paper lol).

All you really want to know is whether the angle was profitable for a long series and at a certain sample size how confident you can be that it wasn't some random event as opposed to a legit value oriented angle. If you are concerned about the lack of randomness in the payouts, then simply insist on a higher degree of confidence before wagering to compensate.

One of the problems with stats is that the game is constantly changing.

Imagine a guy that uses just speed figures to play horses. He bets 10K per race (I've known guys like that). If he stops playing, it might change the values of every stat you calculated precisely because the distribution of money changed.

Imagine he's doing well and three of his friends start playing the same way, opposite impact.

I know it's difficult for mathematically oriented people to work in an imprecise changing world, but sometimes you have to worry about the dollars and forget the cents. ;)

I'm all for improving the math, but I think no matter what you are going to be faced with issues you can't address because the game is always changing because other players are also looking for value.

Overlay
07-04-2011, 12:25 PM
Quirin used the "standard normal" test for this determination. He first calculated the rate at which horses at each toteboard odds level should win (given a 17% take, and dime breakage), according to the following formula:

Expected Winning Percentage = (1.00 - .17) / (Odds-to-$1.00 + 1.05)

Each horse possessing the handicapping characteristic being tested in a given sample of races was then assigned a value corresponding to the expected winning percentage (EW) associated with its odds. For example, the expected winning percentage (according to the above formula) of horses going off at toteboard odds of 4-1 works out to .83 / 5.05, or .164. These values were totaled for all the horses in the sample to derive the expected number of winners. That number was then compared to the actual number of winners (NW), and to the total number of horses (NH) in the sample exhibiting the characteristic being tested, using the following formula:


(NW - EW) / ((EW (1-(EW/NH))^1/2)


(i.e., for the denominator, take the square root of the expression (EW (1-(EW/NH)) ). (I didn't have a square root symbol on my keyboard.)

Factors for which this equation works out to greater than +2.5 (for positive attributes) or less than -2.5 (for negative attributes) meet the criteria for being trends that figure to keep exhibiting the same positive or negative performance into the future. Factors that do not meet those criteria would not offer that same degree of confidence, despite results that might appear to be markedly positive or negative.

(I am basically restating Quirin here, as opposed to injecting any personal opinions or procedures.)

I might add that, after Mike Nunamaker came out with Modern Impact Values (his more extensive update of Winning at the Races), I asked him whether he had employed similar testing on his data. He replied that the much larger (than Qurin's) sample size that he used (12,815 races involving a total of 111,700 horses) served to provide the same degree of confidence in the validity and durability of the results as the formal statistical tests that Quirin had performed on his smaller samples, without the necessity of actually subjecting the data in Modern Impact Values to the same tests that Quirin employed. (However, I don't know (nor did Nunamaker comment on) when a sample would be considered large enough to make that statement with confidence -- which may have been the point of your original question.)

(However, Nunamaker did indicate that he required a specific subcategory of races (broken down by running surface (dirt; turf [no artificial surfaces back then]); distance (sprint; route), maidens or non-maidens; and age (two-year-olds; three-year-olds and up)) to have at least 100 races before he felt that there was sufficient data on which to base published results. The only categories that did not meet this minimum sample size requirement were turf sprints (for horses of any age) and turf routes for two-year-olds. He therefore omitted those categories from his study.)

podonne
07-04-2011, 02:37 PM
If you are talking about things like 1 or 2 big longshots distorting the results, I understand what you are talking about. Other than that, if the winners are generally clustered around the mean, I think more basic probability and statistics will be fine (unless you are writing a peer reviewed paper lol).

Well, not writing one for sure, but I still want the reasoning to be correct. :-)

But to be clear, I'm not talking about longshots distorting the results, I'm talking about the losses distorting the results. The winners are clustered around the mean, but the losses stack up on -2, and a loss is just as valid as a win. What you're saying above is "when I win a bet, I expect the payout to be average +- ci, but that's not helpful evaluating a system because all the results aren't winners. You'd just be describing the characteristics of the winning payouts, not evaluating the system. (CI means confidence interval)

Though maybe we're on to something. What if you said that the result of the system is

p% of bets will win a value of
average_payout +- average_payout_confidence_interval,
and (1-p)% of bets win a value of -$2

Then could you say that the system as a whole will have a value of

(average_payout* p%) +- (average_payout_confidence_interval* p%)

Does that seem reasonable to anyone?

podonne
07-04-2011, 02:43 PM
One of the problems with stats is that the game is constantly changing.

Imagine a guy that uses just speed figures to play horses. He bets 10K per race (I've known guys like that). If he stops playing, it might change the values of every stat you calculated precisely because the distribution of money changed.

Imagine he's doing well and three of his friends start playing the same way, opposite impact.



That's less of an issue for non-parimutuel betting venues, like Betfair and Matchbook (RIP). There its just a matter of who gets their bets in when.

You can also control for it by including a minimum odds in the system description, bet on horses displaying a certain set of characteristics at odds of better than 2/1. With that kind of system, as the angle becomes wider known it should just decrease the number of plays you make, not the profitability of any given play.

podonne
07-04-2011, 07:22 PM
I may have hit upon something suggested in a statistics\probability forum called "Bootstrapping". Its specifically designed to solve estimation problems where the distribution of the underlying variable is unknown, or not calculable. Its similar to the process that Robert was describing, with a few changes.

Basically you take a list of the results of a system's bets. Call the number of bets that the system makes over whatever time period you're looking at "n".

Once you have the list in an array, you chose n bets from that array (as in, the same number as in the array) such that each bet has an equal chance of being selected. But every time you "choose" a bet you put the bet back, so that is can be selected again. Then average those bets. Note that you can select an infintite number of items from the original n bets this way.

This is different from what I was talking about before because I was trying to do it via a select statement, "select where rand<0.xx", but that will select bets "without replacement" in the parlance.

So you select n bets and record the average a bunch of times, how many times I don't know, the literature seems to suggest "as many times as computational time allows". At least 1,000, 10,000 is probably enough. Then you compute the mean of the averages, the standard deviation, and calculate a confidence interval.

Psuedocode looks like this (VB):
Dim means as New List
Dim l() as Single = {-2, 2.4, -2, 3.1} 'this is the list of payouts, -2 if the bet loses, +2.10 or higher when the bet wins.
For x = 1 to 10000
dim samples as new list
For y = 1 to List.Count
samples.Add(l(Rand()*l.Count)) ' choose list.count items at random
Next y
means.Add(samples.Average) ' record the average of the samples you chose
Next x

'now you can use the 10,000 values in the means list to calculate the mean, standard deviation, and confidence interval.
'IMPORTANT: when you convert the standard deviation to a confidence interval you would normally divide by the sample count, so you might think 10,000, but acutally you should use the number of samples in the original list (in this case, 4)



The difference from what we were talking about before is that we were chosing 30 samples without replacement, instead we're choosing the same number as the entire list with replacemet.

Thoughts?

Cratos
07-05-2011, 06:14 PM
One of the problems with stats is that the game is constantly changing.

I'm all for improving the math, but I think no matter what you are going to be faced with issues you can't address because the game is always changing because other players are also looking for value.

The strength of math and statistics is quantification and therefore with the correct postulations you should always be ahead of the "change curve."

Cratos
07-05-2011, 06:25 PM
We'll get back to the parameter k.

k is fixed to a particular value. k can be viewed as a measure of streakiness. The smaller the value k the higher the level of streakiness, So, in effect, we fix a value of the precision parameter to let's say 100 to respresent "true" streakiness or variation in the winning probabilities and use the corresponding Bayes Factor to measure the streakiness in the data.

Let's go back to the integral equation you posted .

That is the correct expression to determine the marginal density for the consistent model. However, this is would have to be computed for each "bin". The marginal density equals

intergal of the product.


( . )p^xi (1-p)^(ni-xi) ....(1) I have used the . to represent ni and xi where you have 200 and 115.

where ni in our example is 20 and xi is the number of wins.

For the streaky model we perform a similar operation except that we replace (1) with f(p) from the earlier post.

As you can see, there is some calculus involved. In the next post, I'll explain how to interpret the results.

Mike (Dr Beav)

Mike, did I miss something about how “k” is derived because its value is not revealed although it has been defined.

To me “k” is the glue that holds everything together. Therefore is it a calculated parameter or is it a given parameter.

TrifectaMike
07-05-2011, 07:17 PM
Mike, did I miss something about how “k” is derived because its value is not revealed although it has been defined.

To me “k” is the glue that holds everything together. Therefore is it a calculated parameter or is it a given parameter.

The value of K is selected. Note that as K approaches infinity, The Streaky model approaches The Complete model.

Generally we'd like to compute the log10 Bayes factor for a sequence of values of log 10 K.

Mike (Dr Beav)

Cratos
07-05-2011, 07:19 PM
The value of K is selected. Note that as K approaches infinity, The Streaky model approaches The Complete model.

Generally we'd like to compute the log10 Bayes factor for a sequence of values of log 10 K.

Mike (Dr Beav)

Excellent, that makes sense

098poi
07-05-2011, 08:26 PM
Excellent, that makes sense

I mean no disrespect at all but this last exchange has to be the post of the year. I started laughing when I read it. Not laughing at it but laughing as a man on the verge of insanity who has just crossed the line into insanity.

CBedo
07-06-2011, 01:53 AM
Let's start by assigning labels to the two rules.

H1 : the sequence is an arithmetic progression, add n to the previous number

H2 : the sequence is generated by the function of the form ax^3 + bx^2 + c, where a, b and c are fractions.

Let's give H1 and H2 equal prior probabilities and concentrate on what the data have to say.

H1 is dependent on an added integer n and the first number in the sequence. Let's say that these number could have been anywhere between -50 and 50. Then since only the pair of values (n = 4, first number -1) give rise to the data (-1,3,7,11), the probability of the data, given H1, is:

P(data|H1) = 1/101 * 1/101 = .00010

To evaluate P(data|H2), we must determine what values the fractions a, b and c might take on. A reasonable prior might is that for each fraction the numerator could be any number between -50 and 50. The denominator can be any number between 1 and 50. As for the initial value in the sequence, we use the same probability distribution as in H1 (1).

Under this prior,

there are four ways of expressing the fraction a

a = -1/11 = -2/22 = -3/33 = -4/44

there are four ways of expressing the fraction b

b = 9/11 = 18/22 = 27/33 = 36/44

there are two ways of expressing the fraction c

c =23/11 = 46/22

So,

P(data|H2) = (1/101)*(4/101 * 1/50)*(4/101 * 1/50)*(2/101*1/50)

P(data|H2) = 0.0000000000025

Comparing P(data|H2) with P(data|H1) the odds in favor of P(data|H1) overP(data|H2) given the data sequence are about 40 million to one!

Gotta love this Bayesian stuff. It is a window to a new world and much, more superior to frequentist stuff.

Mike (Dr Beav)I've been using Facebook too much I think, because I found myself looking for the "like" button.

Robert Goren
07-06-2011, 10:16 AM
I think Dr Beav has developed a man crush on a man that has been dead for over 300 years.:rolleyes: Actually Bayes is pretty powerful stuff and it is a new line of thought on this board. And the good Doctor is getting better at explaining something that is very hard for the layman to grasp. He should be applauded for trying.:ThmbUp:

Robert Goren
07-06-2011, 10:21 AM
podonne, was wondering if you had done the edge calculation for the Kelly formula and how it came out?

TrifectaMike
07-06-2011, 01:45 PM
I think Dr Beav has developed a man crush on a man that has been dead for over 300 years.:rolleyes: Actually Bayes is pretty powerful stuff and it is a new line of thought on this board. And the good Doctor is getting better at explaining something that is very hard for the layman to grasp. He should be applauded for trying.:ThmbUp:

You are correct. This is not a simple subject to teach nor is it simple to grasp. In one of your posts you mentioned the Monty Hall problem. I commented that it was a trivial problem for a Bayesian and it is. However, after some thought I also realized that many people attempt to solve the problem in a less than formal manner.

Recalling that many mathematicians failed to give the correct answer, I should show the Bayesian approach to the problem. At minimum applying Bayes Theorem to the Monty Hall problem can show, by example, probabilistic reasoning.

Bayes Theorem

P(A|B) = P(B|A)P(A)/P(B)

Recall that B is the information that is given.

Let me write a variant of Bayes Formula that corresponds to not simply one piece of information, but two pieces of information.

Bayes Theorem Variant

P(A|B,C) = P(B|A,C)P(A|C)/P(B|C)

Monty Hall Problem

On a quiz show you faced with three doors. A car is placed behind one of the three doors. The other two doors have no prize behind them. You are asked to choose a door.

Once you have chosen a door, the host selects a door which contains no prize (car). He opens the door and shows you that there is no car behind the door. At this point he asks you if you would like to change your selection to the unopened door?

Unless you are a Bayesian thinker, you would reason that switching doesn't make any difference...two doors remaining (one you selected and one unselected). The odds are even that you have selected the car.

If you are a Bayesian thinker, you will immediately look toward to any conditioning of the problem.

So, let's determine the probability that the unopened and unselected door contains the car given that we selected a door and the host opened a door.

We can choose Door A, B or C.

Let's choose Door A.

The host can open B or C.

Host opens Door B.

The car can be behind Door C or Door A.

The question is, should you switch Door A (the door you selected) for the unopened Door C?

Let's rewrite Bayes Theorem Variant in words to make it easier to follow.

P(Prize C|Choice A, Open B) = P(Open B|Choice A, Prize C)P(Prize C|Choice A)/P(Open B|Choice A )

A = Door A (No car)
B = Door B (No Car)
C = Door C (Car)

Now for the calculations:

P(Open B|Choice A, Prize C) = 1 The host never opens the door with the car.

P(Prize C|Choice A) = 1/3 The car is behind one of the doors and it doesn't matter which door one chooses.


P(Open B|Choice A ) = 1/2

So,

P(Prize C|Choice A, Open B) = 1(1/3)/(1/2) = 2/3

The probability that the car is behind the door that hasn't been opened or chosen is 2/3. Therefore you should change your choice.

Mike (Dr Beav)

P.S. Dave Schwartz has developed a horse racing method based on the Monty Hall problem. You might want to inquire about it.

TrifectaMike
07-06-2011, 04:00 PM
I mean no disrespect at all but this last exchange has to be the post of the year. I started laughing when I read it. Not laughing at it but laughing as a man on the verge of insanity who has just crossed the line into insanity.

I'm laughing at you laughing. At least you're not seeing dead people (not yet) Come on back to the sane side.

Mike (Dr. Beav)

Robert Goren
07-06-2011, 04:05 PM
If you can get your mind to working so that you under the reasoinng behind the Monty Hall problem, then the rest of Bayes fails in to step. In my class on Bayes many moons ago, some students were as much in the dark after a semester as they were the first day.

classhandicapper
07-06-2011, 04:39 PM
Well, not writing one for sure, but I still want the reasoning to be correct. :-)

But to be clear, I'm not talking about longshots distorting the results, I'm talking about the losses distorting the results. The winners are clustered around the mean, but the losses stack up on -2, and a loss is just as valid as a win. What you're saying above is "when I win a bet, I expect the payout to be average +- ci, but that's not helpful evaluating a system because all the results aren't winners. You'd just be describing the characteristics of the winning payouts, not evaluating the system. (CI means confidence interval)

Though maybe we're on to something. What if you said that the result of the system is

p% of bets will win a value of
average_payout +- average_payout_confidence_interval,
and (1-p)% of bets win a value of -$2

Then could you say that the system as a whole will have a value of

(average_payout* p%) +- (average_payout_confidence_interval* p%)

Does that seem reasonable to anyone?

One of us is still missing the others point (probably me). :lol:

I don't think you need to worry about anything other than the win%, average payout, and sample size. That's enough to calculate the probability that the method is profitable with "X" degree of confidence.

You should try to accumulate as large a sample as possible all the time anyway.

The most important point is that the game is changing. Trainers are changing methods, racing secretaries are creating new classes, we recently added multiple new synthetic surfaces, sharp horse players are always writing books and educating the masses, more good information is available for sale, the breed is changing, drug laws change, the big money players change etc...

So you are always at risk that you are looking at stats that could be irrelevant or outdated the moment you find them, no matter how large the sample or how hard to study them.

Sometimes I make a bet off a sample of 1 or off no sample at all (seriously). For example, if a new trainer wins with his first starter and moves the horse way up, I'm immediately researching if he was an assistant for another move up trainer. I try to make a subjective estimate of the chances that this is going to be fairly typical. If you wait until you have a large sample, it's too late.

TrifectaMike
07-06-2011, 05:29 PM
One of us is still missing the others point (probably me). :lol:

I don't think you need to worry about anything other than the win%, average payout, and sample size. That's enough to calculate the probability that the method is profitable with "X" degree of confidence. Can you explain what does this means and how would you determine it?


Mike (Dr Beav)

P.S. I'm not picking on you. I really want to know.

podonne
07-06-2011, 10:26 PM
So you are always at risk that you are looking at stats that could be irrelevant or outdated the moment you find them, no matter how large the sample or how hard to study them.

Sometimes I make a bet off a sample of 1 or off no sample at all (seriously). For example, if a new trainer wins with his first starter and moves the horse way up, I'm immediately researching if he was an assistant for another move up trainer. I try to make a subjective estimate of the chances that this is going to be fairly typical. If you wait until you have a large sample, it's too late.

Sorry, but that reasoning is self-defeating. If we assume that any stat or system is invalid regardles of its results and that no prior information has value, then the EV of any bet is equal to the EV of one randomly placed bet. Not sure what the EV of one random bet is, but its not positive, at least. :-)

If we say that prior information has at least some value, then you've got to generalize the situation until it applies broadly enough to at least give you an idea of what's going to happen. I could write a system to research what you've described easily enough: "Bet on any horse that is moving up after a win if the horse's trainer was an assistant to a "move-up" trainer, "move up" meaning a trainer who moved at least 80% of his horses up in class following a win and those horses won at above their expected value." What if I ran that system and it did terribly? The world is ful of assistants that mimick thier former bosses and fail (Bill Parcells is kinda famous for that I think). Would you still bet it?

I'm all for the notion that past results do not predict future performance, but that is not the same as saying that prior information has no value.

Robert Goren
07-06-2011, 10:31 PM
Sorry, but that reasoning is self-defeating. If we assume that any stat or system is invalid regardles of its results and that no prior information has value, then the EV of any bet is equal to the EV of one randomly placed bet. Not sure what the EV of one random bet is, but its not positive, at least. :-)

If we say that prior information has at least some value, then you've got to generalize the situation until it applies broadly enough to at least give you an idea of what's going to happen. I could write a system to research what you've described easily enough: "Bet on any horse that is moving up after a win if the horse's trainer was an assistant to a "move-up" trainer, "move up" meaning a trainer who moved at least 80% of his horses up in class following a win and those horses won at above their expected value." What if I ran that system and it did terribly? The world is ful of assistants that mimick thier former bosses and fail (Bill Parcells is kinda famous for that I think). Would you still bet it?

I'm all for the notion that past results do not predict future performance, but that is not the same as saying that prior information has no value.Except for the one that has done really well.

podonne
07-06-2011, 10:54 PM
Except for the one that has done really well.

The exception that proves the rule? :D

thaskalos
07-06-2011, 10:57 PM
Sorry, but that reasoning is self-defeating. If we assume that any stat or system is invalid regardles of its results and that no prior information has value, then the EV of any bet is equal to the EV of one randomly placed bet. Not sure what the EV of one random bet is, but its not positive, at least. :-)

If we say that prior information has at least some value, then you've got to generalize the situation until it applies broadly enough to at least give you an idea of what's going to happen. I could write a system to research what you've described easily enough: "Bet on any horse that is moving up after a win if the horse's trainer was an assistant to a "move-up" trainer, "move up" meaning a trainer who moved at least 80% of his horses up in class following a win and those horses won at above their expected value." What if I ran that system and it did terribly? The world is ful of assistants that mimick thier former bosses and fail (Bill Parcells is kinda famous for that I think). Would you still bet it?

I'm all for the notion that past results do not predict future performance, but that is not the same as saying that prior information has no value.
I used to think that there is no system/angle in existence with a positive long-term ROI, but then I took a good look at some of the trainer stats and jockey/trainer combinations out there...and now I'm not so sure.

IMO...if a long-term profitable stat/system/method DOES exist, it is much more likely for it to be connected to trainer and trainer/jockey stats than to hardcore handicapping principles.

Cratos
07-06-2011, 11:02 PM
One of us is still missing the others point (probably me). :lol:

I don't think you need to worry about anything other than the win%, average payout, and sample size. That's enough to calculate the probability that the method is profitable with "X" degree of confidence.

You should try to accumulate as large a sample as possible all the time anyway.

The most important point is that the game is changing. Trainers are changing methods, racing secretaries are creating new classes, we recently added multiple new synthetic surfaces, sharp horse players are always writing books and educating the masses, more good information is available for sale, the breed is changing, drug laws change, the big money players change etc...

So you are always at risk that you are looking at stats that could be irrelevant or outdated the moment you find them, no matter how large the sample or how hard to study them.

Sometimes I make a bet off a sample of 1 or off no sample at all (seriously). For example, if a new trainer wins with his first starter and moves the horse way up, I'm immediately researching if he was an assistant for another move up trainer. I try to make a subjective estimate of the chances that this is going to be fairly typical. If you wait until you have a large sample, it's too late.

Class, correct me if I am making the wrong assumption about your post, but it seem to me that you believe that sample size is the pivot point here.

However one of the big issues in determining the sample size when estimating unknown parameters in statistics is cost. While it is generally true that larger sample sizes lead to increase precision of the estimation, but again at what cost.

Two fundamental facts in statistics that would be helpful in understanding the sample size accuracy phenomenon are the “law of large numbers” and the “central limit theorem.”

podonne
07-07-2011, 12:24 AM
I'm starting to put some of this together. It's frustrating. I figure myself a smart guy and I think I understand the thoery as you're described it, and illustrated. I'm guessing the part where I begin to get lost in the calculations is when the calculus rears its ugly head. :-) And before I go any farther, let me thank you for the time you've put in helping me (us) understand how to apply the baysean approach to this problem.

I don't think I'll get it completely until I can see, step by step, how to address the problem of our original question, developing a confidence factor for the "streakiness" of a series of bets.

This is the beginning (pseudocode, VB):
'These two things are the givens
dim bet_record() = {0,1,1,0,1,1,1,0} 'the results of bets made by the system, 1=win, 0=lost
dim k = 100 'a given

We can get at the set of probabilities in your example noted by p1...pn by resampling, monte carlo-style:

dim sample as new list(of single) 'this will hold the samples that we'll be investigating
'resample bet_record in sizes of 20
'the function resample(a,b) create an array of size b by randomly selecting "b" bets from array "a" with replacement
for x = 1 to 1000
sample.add(resample(bet_record,20).Average())
next x

So now we have p1,p2 ... p1000 (they are the values in sample(0),sample(1)...sample(1000).

Now lets get to your f(p) function, which I'm trying to get into code. The spaces there confuse me a bit, are they multiplication?

f(p) = 1/B(km,k(1-m)) p^km-1 (1-p)^k(1-m)-1, 0 <p< 1

I think it comes out as this (where p would be any of the values from p1,p2 ... p1000)

f_p = (1/beta_density) * Math.Pow(p, (k*m)-1) * Math.Pow(1-p, k*((1-m)-1)

Assuming that's correct, I'm confused a bit about the beta_density number. You defined is as B(km, (1-k)m), which as I understand it is how to describe the shape of beta curve, but doesn't actually resolve to a "value" that we can use in an equation. Can you clarify that a bit?

Oh, and I'm keeping a running draft of the code to do this on pastebin, anyone is free to take a look as it evolves: http://pastebin.com/BLadDGM6

TrifectaMike
07-07-2011, 11:32 AM
I'm starting to put some of this together. It's frustrating. I figure myself a smart guy and I think I understand the thoery as you're described it, and illustrated. I'm guessing the part where I begin to get lost in the calculations is when the calculus rears its ugly head. :-) And before I go any farther, let me thank you for the time you've put in helping me (us) understand how to apply the baysean approach to this problem.

I don't think I'll get it completely until I can see, step by step, how to address the problem of our original question, developing a confidence factor for the "streakiness" of a series of bets.

This is the beginning (pseudocode, VB):
'These two things are the givens
dim bet_record() = {0,1,1,0,1,1,1,0} 'the results of bets made by the system, 1=win, 0=lost
dim k = 100 'a given

We can get at the set of probabilities in your example noted by p1...pn by resampling, monte carlo-style:

dim sample as new list(of single) 'this will hold the samples that we'll be investigating
'resample bet_record in sizes of 20
'the function resample(a,b) create an array of size b by randomly selecting "b" bets from array "a" with replacement
for x = 1 to 1000
sample.add(resample(bet_record,20).Average())
next x

So now we have p1,p2 ... p1000 (they are the values in sample(0),sample(1)...sample(1000).

Now lets get to your f(p) function, which I'm trying to get into code. The spaces there confuse me a bit, are they multiplication?

f(p) = 1/B(km,k(1-m)) p^km-1 (1-p)^k(1-m)-1, 0 <p< 1

I think it comes out as this (where p would be any of the values from p1,p2 ... p1000)

f_p = (1/beta_density) * Math.Pow(p, (k*m)-1) * Math.Pow(1-p, k*((1-m)-1)

Assuming that's correct, I'm confused a bit about the beta_density number. You defined is as B(km, (1-k)m), which as I understand it is how to describe the shape of beta curve, but doesn't actually resolve to a "value" that we can use in an equation. Can you clarify that a bit?

Oh, and I'm keeping a running draft of the code to do this on pastebin, anyone is free to take a look as it evolves: http://pastebin.com/BLadDGM6

Philip,

I applaud your attempt at this. You have some things correct and others not so correct. You can continue and I'll help you understand it.

I'm prepared to do you a favor. If you send me your win/loss history as a vector of 0 and 1's (you can break it down yearly), I'll do it for you. This way you'll have an answer and you can continue at your leisure.

Mike (Dr Beav)

TrifectaMike
07-07-2011, 12:43 PM
Sorry, but that reasoning is self-defeating. If we assume that any stat or system is invalid regardles of its results and that no prior information has value, then the EV of any bet is equal to the EV of one randomly placed bet. Not sure what the EV of one random bet is, but its not positive, at least. :-)

If we say that prior information has at least some value, then you've got to generalize the situation until it applies broadly enough to at least give you an idea of what's going to happen. I could write a system to research what you've described easily enough: "Bet on any horse that is moving up after a win if the horse's trainer was an assistant to a "move-up" trainer, "move up" meaning a trainer who moved at least 80% of his horses up in class following a win and those horses won at above their expected value." What if I ran that system and it did terribly? The world is ful of assistants that mimick thier former bosses and fail (Bill Parcells is kinda famous for that I think). Would you still bet it?

I'm all for the notion that past results do not predict future performance, but that is not the same as saying that prior information has no value.

Isn't prior knowledge related to what some refer to as intuition?

Just talking out loud.

Mike (Dr Beav)

podonne
07-07-2011, 01:34 PM
Philip,

I applaud your attempt at this. You have some things correct and others not so correct. You can continue and I'll help you understand it.

I'm prepared to do you a favor. If you send me your win/loss history as a vector of 0 and 1's (you can break it down yearly), I'll do it for you. This way you'll have an answer and you can continue at your leisure.

Mike (Dr Beav)

Mike,

I certainly appreciate your offer, but having the answer won't help when I have a different angle that I want to test. Plus, I really want to learn what's going on here, its just I learn best when I can crosswalk between a theory and an implementation for a problem. (Computer programmers, I know...) Though I might take you up on it at the end, just to make sure whatever we come up with works. :-)

You seem to have a solid idea about how to calculate this in practice (and an excellent idea about how to do it in theory). Any chance you could just run through this in a "here's how you do it" order rather than a "here's how to understand it" way? I'll turn your sequence into code and we all walk away with a nifty new way of calculating the EV of an angle. :-)

Regards,
Philip

TrifectaMike
07-07-2011, 03:42 PM
Mike,

I certainly appreciate your offer, but having the answer won't help when I have a different angle that I want to test. Plus, I really want to learn what's going on here, its just I learn best when I can crosswalk between a theory and an implementation for a problem. (Computer programmers, I know...) Though I might take you up on it at the end, just to make sure whatever we come up with works. :-)

You seem to have a solid idea about how to calculate this in practice (and an excellent idea about how to do it in theory). Any chance you could just run through this in a "here's how you do it" order rather than a "here's how to understand it" way? I'll turn your sequence into code and we all walk away with a nifty new way of calculating the EV of an angle. :-)

Regards,
Philip

Here are the full equations.

The Consistent Model

http://i53.tinypic.com/2n8o7pe.jpg

The Streaky Model

http://i53.tinypic.com/oiyvzp.jpg

My offer still stands.

Mike (Dr Beav)

podonne
07-07-2011, 05:18 PM
The Consistent Model

http://i53.tinypic.com/2n8o7pe.jpg



Just to be clear:
ni = number of successes in subset i
yi = number of failures in subset i
p = ni/(ni+yi) in subset i

What's d?

TrifectaMike
07-07-2011, 05:57 PM
Just to be clear:
ni = number of successes in subset i
yi = number of failures in subset i
p = ni/(ni+yi) in subset i

What's d?

ni = number of samples in the subset i
yi = number of successes in the subset i
ni - yi = number of failures in the subset i
p = p1=p2=p3=p4= ... = pn ( p has a uniform distribution)
dp = is the differential we are computing the integral from p =0 to 1

Mike (Dr Beav)

TrifectaMike
07-07-2011, 08:34 PM
I've been using Facebook too much I think, because I found myself looking for the "like" button.

I'm glad you liked the example. From reading some of your past posts, I'm not surprised. I believe you get it.

Mike (Dr Beav)

classhandicapper
07-08-2011, 12:54 PM
Mike (Dr Beav)

P.S. I'm not picking on you. I really want to know.

It has been a long time since took probability and statistics in college. So I am out of my element here even though I'm anxious to learn something new. There are some basic calculations for this kind of thing.

classhandicapper
07-08-2011, 01:06 PM
Sorry, but that reasoning is self-defeating. If we assume that any stat or system is invalid regardles of its results and that no prior information has value, then the EV of any bet is equal to the EV of one randomly placed bet. Not sure what the EV of one random bet is, but its not positive, at least. :-)

If we say that prior information has at least some value, then you've got to generalize the situation until it applies broadly enough to at least give you an idea of what's going to happen. I could write a system to research what you've described easily enough: "Bet on any horse that is moving up after a win if the horse's trainer was an assistant to a "move-up" trainer, "move up" meaning a trainer who moved at least 80% of his horses up in class following a win and those horses won at above their expected value." What if I ran that system and it did terribly? The world is ful of assistants that mimick thier former bosses and fail (Bill Parcells is kinda famous for that I think). Would you still bet it?

I'm all for the notion that past results do not predict future performance, but that is not the same as saying that prior information has no value.

In the world of statistics, you are always looking for large enough samples to be sure something is not random.

In the world of gambling, everyone is looking for value.

The end result is that as statistics that suggest good value become larger and larger (and less likely to be random), the public bets them further and further down and removes the value.

That's usually doesn't change the fact that "stat X" is a negative or positive, but it changes its usefulness from a gambling perspective.

So the trick is make subjective judgments BEFORE there are large samples.

In my example with "trainers that mysteriously improve horses", I rarely need more than 1 or 2 horses to be pretty sure I KNOW the scoop and sometimes I bet before I see even 1. Occasionally I will be wrong, but when I am right I am so far ahead of the curve I can get in 1 or 2 bets at great value before the public starts crushing the odds.

IMO a better use of statistics is to learn the game.

Suppose you wanted to know how second time starters that won sprinting first time out and are stretching out to a route do compared to horses that won routing and are coming back at the same distance. Statistics would be a great way to study that. You could keep refining and refining the details and at the end it you'd learn a bunch of stuff that is unlikely to change much. If it does change, it will be over a long period of time. The "values" on such things are more likely to change and could even change abruptly without your knowledge.

classhandicapper
07-08-2011, 01:09 PM
However one of the big issues in determining the sample size when estimating unknown parameters in statistics is cost. While it is generally true that larger sample sizes lead to increase precision of the estimation, but again at what cost.



Totally agree. In a few of my other posts I said that I often make bets long before I have a good sample size (based on a subjective judgment) because thats' when the value is still there.

Tom
07-08-2011, 03:29 PM
Suppose I have an angle that has hit 19% of the time with an ROI of 1.12 over the last 500 plays.

What sample size do I need to use? Ballpark.

Cratos
07-09-2011, 04:50 PM
Suppose I have an angle that has hit 19% of the time with an ROI of 1.12 over the last 500 plays.

What sample size do I need to use? Ballpark.

374 with a 2% margin of error and a 95% confidence interval

dansan
07-09-2011, 11:40 PM
if you got a real angle keep it to yourself once you give it up its not an angle anymore

Robert Fischer
07-10-2011, 12:51 AM
estimating the margin of error and confidence interval seems to be key