PDA

View Full Version : Algorithms Regression etc


mbugg1976
06-18-2015, 10:35 AM
Hi All

First post from me from a rare sunny England

My question is I have a load of past data that I would like to analyse to predict future races
not to bothered if it gives me a odds line at the end or just a score

I have used Logistic regression but I think it is weighting some variables to high
I have also used a simple correlation to pick the top the 20 variables and then remove any variables that are highly correlated together, then weight the left over variables

So what's out there to pick the best variables to use?
is there any other ways out there that someone who has no data analysis experience can use

Cheers
Michael

Now time for a cup of tea and watch Royal Ascot

Robert Goren
06-18-2015, 10:58 AM
I start by doing one variable at time. I take the best one and try it with rest again on at a time. I use the Z-scores to determine the best one. Then I add in a third variable. And so forth. The lower the Z score, the more useful the variable is. I use .05 as the upper limit. I use Rcmandr in "R" and use glm for getting the values. I think some people might use the "R squared" stat. For a while it was the gold standard, but I have my doubts about how useful it is in Logit models. In multi variable non binomial regression such as comparing SRs + whatever to Finishing position, it does generally tell you when to stop adding in variables.

mbugg1976
06-18-2015, 11:19 AM
Cheers Robert

May take a while I have 95 different variables

traynor
06-18-2015, 07:02 PM
You might find it worthwhile to invest a bit of time in learning to write basic code. Horse racing is not that complicated, and data mining the limited base of horse data is even less complicated. There are many options that are available (WEKA, Anaconda3, RapidMiner, others) that should not take more than a couple of hours to learn to use well enough to do what you indicate you want to do.

The problem is exponential searches. You might want to narrow the field a bit first, extracting the factors with the most impact, and combining limited sets. Divide the number of hits by the number of matches. More hits, fewer matches, good. Pick the factors with the highest ratios of hits to matches for starters.

One thing that many data crunchers don't want to admit--combining factors does not always improve performance. And the more factors you combine--regardless of how well they seem to work--the greater the tendency of the end result to only fit the specific dataset you started with. As in "backfitting" or "overfitting."

There is a set of (very short, not at all complex) videos on WEKA (from Waikato University) on YouTube. You might find it worthwhile to watch them, just for suggestions on how to avoid the most common rookie errors in data mining.

If you think you have something worthwhile, you would probably be well advised to do the data mining yourself. It is not that tough, and not that complicated. It would be dismaying to develop the world's greatest handicapping method and wake up to discover it posted on a half dozen open-source forums. Or offered for sale by someone you have never heard of.

https://www.youtube.com/watch?v=m7kpIBGEdkI

mbugg1976
06-18-2015, 08:33 PM
Thanks Traynor

Not really understanding the below

The problem is exponential searches. You might want to narrow the field a bit first, extracting the factors with the most impact, and combining limited sets. Divide the number of hits by the number of matches. More hits, fewer matches, good. Pick the factors with the highest ratios of hits to matches for starters.

I can use weka but there is that many classifiers it left me slightly confused

traynor
06-18-2015, 10:10 PM
Thanks Traynor

Not really understanding the below



I can use weka but there is that many classifiers it left me slightly confused

You look for X (whatever it may be). You look in all entries, all races to see how many times you find a match for X. Example, jockey with green cap. Those are matches. Then find how many of those matches were wins--however you define it--usually "won the race" or close, or something similar. Those are the hits.

In 100 races, 300 entries match X. Of that 300, 30 won. The value of X as a factor is 30/300. 0.10.

If you use that basic qualifier first, it will quickly locate the most signicant of the variables.

Initially, I use a coded "pattern" to pre-qualify. Faster, easier, and simpler than most other approaches. The data is in a long string (for example) separated by commas. Split it at the commas. Identify the element of interest (46). Identify the element you want as "win" (79 or whatever). Loop through the whole data clump, finding how many times element 46 matches X, and how many times element 79 is "won the race" or whatever.

If data(46) = something Then
'increment matches
matches += 1
if data(79) = "won" Then
'increment wins
wins += 1
End If
End If


I did a research project awhile back that analyzed some insane number of horses and races (every race run in Australia in five years). It was MUCH easier to use code than to try to fix the (horribly broken) database that it came in.

classhandicapper
08-20-2017, 05:26 PM
I have a question on regression analysis.

Let's assume I have a set of data on each horse for over 1000 races. I want to know how to weight each of these factors to produce the most likely winner.

Let's say I want to use 5 factors to help predict the outcome

1. Speed Figure Last Race
2. Class Figure Last Race
3. Finish Position Last Race
4. Lengths Behind at Finish Last Race
5. Winning Margin Last Race (if the horse won)

I want to use those 5 factors to predict the finish position in today race (which I also have in my data along with Date, Track, and Race).


I've seen some very basic regression analysis done in Excel, but one thing I don't understand is how to make the analysis recognize that each race is it's own unit/group.

It has to understand that a 70 speed figure in one race might have been the top figure in one race and lead to a win but in another race that 70 speed figure may have been the worst speed figure and lead to a 9th.

Whosonfirst
08-21-2017, 07:43 AM
I have a question on regression analysis.

Let's assume I have a set of data on each horse for over 1000 races. I want to know how to weight each of these factors to produce the most likely winner.

Let's say I want to use 5 factors to help predict the outcome

1. Speed Figure Last Race
2. Class Figure Last Race
3. Finish Position Last Race
4. Lengths Behind at Finish Last Race
5. Winning Margin Last Race (if the horse won)

I want to use those 5 factors to predict the finish position in today race (which I also have in my data along with Date, Track, and Race).


I've seen some very basic regression analysis done in Excel, but one thing I don't understand is how to make the analysis recognize that each race is it's own unit/group.

It has to understand that a 70 speed figure in one race might have been the top figure in one race and lead to a win but in another race that 70 speed figure may have been the worst speed figure and lead to a 9th.

It sounds like you may want to try using multiple regression to solve the speed fig thing as to correlation. For example, what is the relative ranking of a 70 sp fig in relation all others in a particular race. Or how does a 70 relate to the average winning fig. relative to that class of race. A simpler way, although maybe not as accurate is to do a sort of all l.r. speed figs in a race, then have excel assign a ranking, 1,2,3, etc. The same technique can be used for class whether it's avg. $ per start, or some other class number.

Nitro
08-21-2017, 08:16 PM
I posted this on another thread related to HK racing. However, it actually has more significance here as related to the value of statistical analysis.

However, I will point out that the chief cheerleader for Hong Kong here, Nitro, claims he only bets based on tote-board action. So, for all of the wonderful, amazingly clean Hong Kong racing he always trumpets, he's STILL wagering based on insider moves and supposed conspiracies. Sounds wonderful.Correct.
It is “wonderful” when you can often glean an inside track of potential contenders in a race!

For those like me, who acknowledge the significance of money flow as a “Given”, we really could care less about where it’s coming from. What’s more important is it’s flow: When and Where its going in to the various betting pools. Insinuating “Conspiracy theories” is a feeble explanation for underestimating the objectives and intent of those on the inside: The connections. Who BTW don’t often publicly reveal their intentions, but would rather “Let their money do their talking" as objective evidence of their confidence.

Yet, you have to enjoy comments like these (At least I do); because it clearly demonstrates (once again) a complete lack of appreciation of the realities of the horse racing game from the perspective of the typical handicapper. I’m not going to pursue this very deeply, because it’s getting old.

However, I would enjoy reading a valid argument (from any credible source) as to how typical handicapping methodology in general (Computerized or otherwise) can rationalize 2 very basic but critical aspects of the game from an Outsider’s perspective:
1) Can it predict whether or not each and every entry in a race is actually going to make an attempt to Win it?
2) How all of the past performance data or statistical analysis can determine the current physical and mental well-being of each horse entered in a race?

Although I won’t take this any further, you might want to consider what Bill Benter has personally stated. If you don’t recognize the name here’s a link:
http://www.worlds-greatest-gamblers.com/gamblers/horse-racing/william-benter/
If you don’t respect his credibility, I would assume that you’re beyond his accomplishments and capabilities. So don't bother trying to comprehend the significance of his comments (below):

Excerpts from:
“Computer Based Horse Race Handicapping and Wagering Systems:”
A Report by William Benter

INTRODUCTION
The question of whether a fully mechanical system can ever "beat the races" has been widely discussed in both the academic and popular literature. Certain authors have convincingly demonstrated that profitable wagering systems do exist for the races. The most well documented of these have generally been of the technical variety, that is, they are concerned mainly with the public odds, and do not attempt to predict horse performance from fundamental factors. Technical systems for place and show betting, (Ziemba and Hausch, 1987) and exotic pool betting, (Ziemba and Hausch,1986) as well as the 'odds movement' system developed by Asch and Quandt (1986), fall into this category. A benefit of these systems is that they require relatively little preparatory effort, and can be effectively employed by the occasional race goer.

The complexity of predicting horse performance makes the specification of an elegant handicapping model quite difficult. Ideally, each independent variable would capture a unique aspect of the influences effecting horse performance. In the author's experience, the trial and error method of adding independent variables to increase the model's goodness-of-fit, results in the model tending to become a hodgepodge of highly correlated variables whose individual significance's are difficult to determine and often counter-intuitive.

Additionally, there will always be a significant amount of 'inside information' in horse racing that cannot be readily included in a statistical model. Trainer's and jockey's intentions, secret workouts, whether the horse ate its breakfast, and the like, will be available to certain parties who will no doubt take advantage of it. Their betting will be reflected in the odds. This presents an obstacle to the model developer with access to published information only. For a statistical model to compete in this environment, it must make full use of the advantages of computer modeling, namely, the ability to make complex calculations on large data sets.

The odds set by the public betting yield a sophisticated estimate of the horses' win probabilities.

It can be presumed that valid fundamental information exists which can not be systematically or practically incorporated into a statistical model. Therefore, any statistical model, however well developed, will always be incomplete. An extremely important step in model development, and one that the author believes has been generally overlooked in the literature, is the estimation of the relation of the model's probability estimates to the public's estimates, and the adjustment of the model's estimates to incorporate whatever information can be gleaned from the public's estimates. The public's implied probability estimates generally correspond well with the actual frequencies of winning.

BTW the last 2 sentences express exactly how Mr. Benter was able to achieve his success.
.

traveler
08-21-2017, 08:44 PM
I have a question on regression analysis.

Let's assume I have a set of data on each horse for over 1000 races. I want to know how to weight each of these factors to produce the most likely winner.

Let's say I want to use 5 factors to help predict the outcome

1. Speed Figure Last Race
2. Class Figure Last Race
3. Finish Position Last Race
4. Lengths Behind at Finish Last Race
5. Winning Margin Last Race (if the horse won)

I want to use those 5 factors to predict the finish position in today race (which I also have in my data along with Date, Track, and Race).


I've seen some very basic regression analysis done in Excel, but one thing I don't understand is how to make the analysis recognize that each race is it's own unit/group.

It has to understand that a 70 speed figure in one race might have been the top figure in one race and lead to a win but in another race that 70 speed figure may have been the worst speed figure and lead to a 9th.

Don't discount just using the rank rather than the actual SR or Pace figure etc.
If you want to go with an actual figure, try the difference from top idea - 70 SR gets a 0, 65 SR gets a 5 or -5 etc. all numbers should be viewed in the relative context of today's race.
I think at one point Benter figured the public odds were 65% or so of his line.

classhandicapper
08-22-2017, 09:58 AM
I tried something approximating difference from the top and got pretty good results, but I think I see a flaw in the approach.

The regression is looking at multiple field values for each horse as input and then the finish position of the horses. So it is more or less trying to maximize the ability to rank all the horses in a race correctly. I'm really only interested in the ability to pick winners. It's not doing that nearly as well as I can do it with weights for each factor that came up with via trial and error.

So I need to somehow stress that the winners are the key.

Perhaps instead of ranking all the horses in each race I could look at just the top 3 finishers????

DeltaLover
08-22-2017, 10:58 AM
I tried something approximating difference from the top and got pretty good results, but I think I see a flaw in the approach.

The regression is looking at multiple field values for each horse as input and then the finish position of the horses. So it is more or less trying to maximize the ability to rank all the horses in a race correctly. I'm really only interested in the ability to pick winners. It's not doing that nearly as well as I can do it with weights for each factor that came up with via trial and error.

So I need to somehow stress that the winners are the key.

Perhaps instead of ranking all the horses in each race I could look at just the top 3 finishers????

Logistic regression can be seen as a special case of neural networks; many problems are impossible to be solved with the former but require the latter. Such a problem can be found in the prediction of the finish positions that you are describing here.

What I see as the major challenge though, is not the algorithm to be used but the way to present the data to it; some of the data preprocessing tasks that need to be addressed are the following:



What the metrics to use? (ex: speed or pace figure, closing figures etc)

How to generate the necessary metrics? (ex: track variant estimation, cross distance - track adjustment etc)

How many past performances to use? (ex: Do we need individual models based in the number of available past performances? How to handle shippers? etc)

Should metrics be normalized using a per race window or passed in absolute values?)

How to pass race level data? (ex: Like wire to wire winning stats, average speed figures for all starters etc)

What kind and how to pass primitive (predicates) handicapping factors?(ex: layoffs, dirt to turf, first lasix etc)

How to handle connections? (ex: jockey/ trainer changes etc)


Even after answering all these questions, we still need to decide how to formulate the “target” of the model and by this I mean that simply targeting for raw finish ordering might very well not be any useful as at best it will match the crowd’s ranking.

classhandicapper
08-22-2017, 11:27 AM
The goal right now is very simple.

Input a handful of factors, run a regression analysis that will supply me the weight for each factor that will produce the most winners. That's all.

As I described previously, when I did that, it produced a set of weights that picked a lot of winners for the highest ranked horse. However it underperformed a set of weights for each factor I had defined myself using handicapping experience and tweaking further with trial and error. I was expecting improvement over what I could do intuitively and through trial and error.

So that got me thinking.

The regression isn't trying to solve for just picking winners. It's trying to solve for getting the 2nd ranked horse to come in 2nd, 3rd ranked horse to come in 3rd....7th rank horse to come in 7th etc... So the regression may be doing a better job than me at ranking all the horses, but it's not doing a better job than my weights at picking winners.

So the question becomes how do I use all the same input but set it up so it provides the weights that maximize the top ranked horse winning.

Right now it is comparing the input factors to the finish position.

DeltaLover
08-22-2017, 11:34 AM
The goal right now is very simple.

Input a handful of factors, run a regression analysis that will supply me the weight for each factor that will produce the most winners. That's all.

As I described previously, when I did that, it produced a set of weights that picked a lot of winners for the highest ranked horse. However it underperformed a set of weights for each factor I had defined myself using handicapping experience and tweaking further with trial and error. I was expecting improvement over what I could do intuitively and through trial and error.

So that got me thinking.

The regression isn't trying to solve for just picking winners. It's trying to solve for getting the 2nd ranked horse to come in 2nd, 3rd ranked horse to come in 3rd....7th rank horse to come in 7th etc... So the regression may be doing a better job than me at ranking all the horses, but it's not doing a better job than my weights at picking winners.

So the question becomes how do I use all the same input but set it up so it provides the weights that maximize the top ranked horse winning.

Right now it is comparing the input factors to the finish position.

Does this mean that you are developing separate models based on the number of the starters?

sjk
08-22-2017, 01:52 PM
If you are doing a linear model you want to do something to limit the effect of a horse being beat double digit lengths, such as limit the beaten length parameter to a predetermined value or you might use 10 minus beaten lengths and bottom out at 0.

On the theory that a horse that runs last in a 6 horse field hasn't really done anything better than a horse that runs last in a 12 horse field you might limit the finish position (or use 6 minus fin position as above).

The speed and class numbers should probably be in relation to what would be expected at that level.

I would be leery of the information that is only available for winners. That is probably not going to fit in a nice linear manner with the others.

classhandicapper
08-22-2017, 03:46 PM
Does this mean that you are developing separate models based on the number of the starters?

I thought about that, but the samples might start getting too small.

classhandicapper
08-22-2017, 03:59 PM
If you are doing a linear model you want to do something to limit the effect of a horse being beat double digit lengths, such as limit the beaten length parameter to a predetermined value or you might use 10 minus beaten lengths and bottom out at 0.

On the theory that a horse that runs last in a 6 horse field hasn't really done anything better than a horse that runs last in a 12 horse field you might limit the finish position (or use 6 minus fin position as above).

The speed and class numbers should probably be in relation to what would be expected at that level.

I would be leery of the information that is only available for winners. That is probably not going to fit in a nice linear manner with the others.

1. Finding some cutoffs sounds like a good idea.

2. Field size of last race is one of the inputs I am using. So I would hope it would know for example that a horse that beat a 12 horse field did more than a horse that beat a 5 horse field. But perhaps I can change that into a somewhat higher quality/expressive input. In my own "intuitive" model I use a fixed value adjustment per number of starters, but I was never satisfied I had a great way of doing it. It just works better than not doing it. That's the kind of thing I was hoping to learn here.

3. The way I did the figures was to find the maximum value in each race and set that to 100.

So for example:

If the top figure in a race was 115, I set it to 100 and then lowered the figure for each of the other horses in the race by 15 to keep the relationships the same.

If the top figure in the race was 75, I set it to 100 and then raised the figure for each of the other horses by 25 to keep the relationships the same.

classhandicapper
08-22-2017, 04:16 PM
1. Finding some cutoffs sounds like a good idea.

2. Field size of last race is one of the inputs I am using. So I would hope it would know for example that a horse that beat a 12 horse field did more than a horse that beat a 5 horse field. But perhaps I can change that into a somewhat higher quality/expressive input. In my own "intuitive" model I use a fixed value adjustment per number of starters, but I was never satisfied I had a great way of doing it. It just works better than not doing it. That's the kind of thing I was hoping to learn here.

3. The way I did the figures was to find the maximum value in each race and set that to 100.

So for example:

If the top figure in a race was 115, I set it to 100 and then lowered the figure for each of the other horses in the race by 15 to keep the relationships the same.

If the top figure in the race was 75, I set it to 100 and then raised the figure for each of the other horses by 25 to keep the relationships the same.

4. I think it might also help to add a figure rank as someone suggested earlier.

DeltaLover
08-22-2017, 04:25 PM
I thought about that, but the samples might start getting too small.

So in this case how do you structure your patterns? I am asking this because the input to the logistic regression must be of the same size, so if you vary it you will need to somehow fill the missing values with some normalized data, something that can become very challenging.

classhandicapper
08-23-2017, 04:38 PM
So in this case how do you structure your patterns? I am asking this because the input to the logistic regression must be of the same size, so if you vary it you will need to somehow fill the missing values with some normalized data, something that can become very challenging.

If every race has to be the same field size or it will break, that might be one of the issues I have. It can't be a major issue though because the results are good. They just aren't as good as my intuitive weights.

I'm still thinking about it. At least I'm at the stage where I am getting output and learning.

traveler
08-23-2017, 09:58 PM
If you have 1000 races, study the winners only and horses who finished say less than 1 length behind - so you got say 1100 horses.

Take the factor that produced the most winners, remove those winners from your dataset, now what factor grabs the most winners from the remaining data.

You got 2 factors weight them the same and then raise and lower the weights to get your best answer, add a 3rd factor rinse and repeat. You need to use a "sample" of your total database so once you have it built you can test against some different data.

You'd be better off studying what makes the favorites lose but few ever want to hear that.

Good luck.

classhandicapper
08-25-2017, 11:52 AM
If you have 1000 races, study the winners only and horses who finished say less than 1 length behind - so you got say 1100 horses.

Take the factor that produced the most winners, remove those winners from your dataset, now what factor grabs the most winners from the remaining data.

You got 2 factors weight them the same and then raise and lower the weights to get your best answer, add a 3rd factor rinse and repeat. You need to use a "sample" of your total database so once you have it built you can test against some different data.

You'd be better off studying what makes the favorites lose but few ever want to hear that.

Good luck.

1. I like the idea of studying losing favorites. I already do that on some level via queries against my database. I hadn't thought about using regression. :ThmbUp:

2. On your first point, I've tried things like that but you run into issues. That's why I am trying to use a more formal regression.

It's very easy to find the appropriate weights when you have 2 factors, but when you add a 3rd, 4th, 5th etc... it gets trickier.

To make it simple, let's say my research says that factor 1 and factor 2 should each be weighted at 50% to get the optimal result.

Next I use the combination of 1 and 2 with factor 3.

Let's say it says Factor 3 should be 20%.

That means factor 1 and 2 are 40% each and factor 3 is 20%.

That may be a good answer, but not necessarily the correct answer because a lot of stats overlap to some degree. A formal regression might come up with a better result by using different weights.

Red Knave
08-25-2017, 12:07 PM
If the top figure in a race was 115, I set it to 100 and then lowered the figure for each of the other horses in the race by 15 to keep the relationships the same.

If the top figure in the race was 75, I set it to 100 and then raised the figure for each of the other horses by 25 to keep the relationships the same.
This would work better if you use the rankings rather than the values.
Simply adding or subtracting in order to normalize the values will change their relationships to one another. In order to keep relationships the same you should calculate what is required to change your max value to equal 100 and then use that to modify the other values.
i.e. - 100 / 115 = 0.87 and 100 / 75 = 1.33 so multiply the values by these quotients to get them in the same range.

mikesal57
08-27-2017, 09:18 AM
Take the factor that produced the most winners, remove those winners from your dataset, now what factor grabs the most winners from the remaining data.



Good luck.

Would it be better to run another query with that top factor part of it???

Doesn't a winning horse have some sort of relationship with each factor?

Say your top factor is a class based one...if you take it out , its like starting from scratch again....But if you leave it in than other factors will take that one into the mix...

just a thought..

mike

classhandicapper
08-27-2017, 06:13 PM
This would work better if you use the rankings rather than the values.
Simply adding or subtracting in order to normalize the values will change their relationships to one another. In order to keep relationships the same you should calculate what is required to change your max value to equal 100 and then use that to modify the other values.
i.e. - 100 / 115 = 0.87 and 100 / 75 = 1.33 so multiply the values by these quotients to get them in the same range.

Are you essentially saying that 30 to 15 is different than 100 to 85 even though both are a 15 point difference?

I thought about that and agree that's clearly the case with math, but I'm not so sure that the case when we are talking about the difference between horses because it represents a fixed number of lengths.

If some horse at Finger Lakes is 5 lengths faster than his opposition is that different than if Arrogate is 5 lengths faster than Gun Runner?

I can try it.

Red Knave
08-29-2017, 07:46 AM
Are you essentially saying that 30 to 15 is different than 100 to 85 even though both are a 15 point difference?

Yes. And my thought was more that the 2nd, 3rd or 4th rank will be unduly rewarded or penalized by simple adding/subtracting. Especially if these ratings flow to impact other ratings.

JJMartin
08-29-2017, 02:25 PM
1. I like the idea of studying losing favorites.

Study the winners in the losing favorites races. Especially when they are 3rd or higher ranking in post time odds.

ReplayRandall
08-29-2017, 02:32 PM
Study the winners in the losing favorites races. Especially when they are 3rd or higher ranking in post time odds.

Now we finally have something to delve into...:ThmbUp: