Horse Racing Forum - PaceAdvantage.Com - Horse Racing Message Board

Go Back   Horse Racing Forum - PaceAdvantage.Com - Horse Racing Message Board > Thoroughbred Horse Racing Discussion > Handicapping Software


Reply
 
Thread Tools Rate Thread
Old 06-18-2015, 10:35 AM   #1
mbugg1976
Registered User
 
Join Date: Jun 2015
Posts: 3
Algorithms Regression etc

Hi All

First post from me from a rare sunny England

My question is I have a load of past data that I would like to analyse to predict future races
not to bothered if it gives me a odds line at the end or just a score

I have used Logistic regression but I think it is weighting some variables to high
I have also used a simple correlation to pick the top the 20 variables and then remove any variables that are highly correlated together, then weight the left over variables

So what's out there to pick the best variables to use?
is there any other ways out there that someone who has no data analysis experience can use

Cheers
Michael

Now time for a cup of tea and watch Royal Ascot
mbugg1976 is offline   Reply With Quote Reply
Old 06-18-2015, 10:58 AM   #2
Robert Goren
Racing Form Detective
 
Robert Goren's Avatar
 
Join Date: Jul 2007
Location: Lincoln, Ne but my heart is at Santa Anita
Posts: 16,316
I start by doing one variable at time. I take the best one and try it with rest again on at a time. I use the Z-scores to determine the best one. Then I add in a third variable. And so forth. The lower the Z score, the more useful the variable is. I use .05 as the upper limit. I use Rcmandr in "R" and use glm for getting the values. I think some people might use the "R squared" stat. For a while it was the gold standard, but I have my doubts about how useful it is in Logit models. In multi variable non binomial regression such as comparing SRs + whatever to Finishing position, it does generally tell you when to stop adding in variables.
__________________
Some day in the not too distant future, horse players will betting on computer generated races over the net. Race tracks will become casinos and shopping centers. And some crooner will be belting out "there used to be a race track here".
Robert Goren is offline   Reply With Quote Reply
Old 06-18-2015, 11:19 AM   #3
mbugg1976
Registered User
 
Join Date: Jun 2015
Posts: 3
Cheers Robert

May take a while I have 95 different variables
mbugg1976 is offline   Reply With Quote Reply
Old 06-18-2015, 07:02 PM   #4
traynor
Registered User
 
traynor's Avatar
 
Join Date: Jan 2005
Posts: 6,626
You might find it worthwhile to invest a bit of time in learning to write basic code. Horse racing is not that complicated, and data mining the limited base of horse data is even less complicated. There are many options that are available (WEKA, Anaconda3, RapidMiner, others) that should not take more than a couple of hours to learn to use well enough to do what you indicate you want to do.

The problem is exponential searches. You might want to narrow the field a bit first, extracting the factors with the most impact, and combining limited sets. Divide the number of hits by the number of matches. More hits, fewer matches, good. Pick the factors with the highest ratios of hits to matches for starters.

One thing that many data crunchers don't want to admit--combining factors does not always improve performance. And the more factors you combine--regardless of how well they seem to work--the greater the tendency of the end result to only fit the specific dataset you started with. As in "backfitting" or "overfitting."

There is a set of (very short, not at all complex) videos on WEKA (from Waikato University) on YouTube. You might find it worthwhile to watch them, just for suggestions on how to avoid the most common rookie errors in data mining.

If you think you have something worthwhile, you would probably be well advised to do the data mining yourself. It is not that tough, and not that complicated. It would be dismaying to develop the world's greatest handicapping method and wake up to discover it posted on a half dozen open-source forums. Or offered for sale by someone you have never heard of.

https://www.youtube.com/watch?v=m7kpIBGEdkI

Last edited by traynor; 06-18-2015 at 07:05 PM.
traynor is offline   Reply With Quote Reply
Old 06-18-2015, 08:33 PM   #5
mbugg1976
Registered User
 
Join Date: Jun 2015
Posts: 3
Thanks Traynor

Not really understanding the below

Quote:
The problem is exponential searches. You might want to narrow the field a bit first, extracting the factors with the most impact, and combining limited sets. Divide the number of hits by the number of matches. More hits, fewer matches, good. Pick the factors with the highest ratios of hits to matches for starters.
I can use weka but there is that many classifiers it left me slightly confused
mbugg1976 is offline   Reply With Quote Reply
Old 06-18-2015, 10:10 PM   #6
traynor
Registered User
 
traynor's Avatar
 
Join Date: Jan 2005
Posts: 6,626
Quote:
Originally Posted by mbugg1976
Thanks Traynor

Not really understanding the below



I can use weka but there is that many classifiers it left me slightly confused
You look for X (whatever it may be). You look in all entries, all races to see how many times you find a match for X. Example, jockey with green cap. Those are matches. Then find how many of those matches were wins--however you define it--usually "won the race" or close, or something similar. Those are the hits.

In 100 races, 300 entries match X. Of that 300, 30 won. The value of X as a factor is 30/300. 0.10.

If you use that basic qualifier first, it will quickly locate the most signicant of the variables.

Initially, I use a coded "pattern" to pre-qualify. Faster, easier, and simpler than most other approaches. The data is in a long string (for example) separated by commas. Split it at the commas. Identify the element of interest (46). Identify the element you want as "win" (79 or whatever). Loop through the whole data clump, finding how many times element 46 matches X, and how many times element 79 is "won the race" or whatever.

If data(46) = something Then
'increment matches
matches += 1
if data(79) = "won" Then
'increment wins
wins += 1
End If
End If


I did a research project awhile back that analyzed some insane number of horses and races (every race run in Australia in five years). It was MUCH easier to use code than to try to fix the (horribly broken) database that it came in.

Last edited by traynor; 06-18-2015 at 10:12 PM.
traynor is offline   Reply With Quote Reply
Old 08-20-2017, 05:26 PM   #7
classhandicapper
Registered User
 
classhandicapper's Avatar
 
Join Date: Mar 2005
Location: Queens, NY
Posts: 20,527
I have a question on regression analysis.

Let's assume I have a set of data on each horse for over 1000 races. I want to know how to weight each of these factors to produce the most likely winner.

Let's say I want to use 5 factors to help predict the outcome

1. Speed Figure Last Race
2. Class Figure Last Race
3. Finish Position Last Race
4. Lengths Behind at Finish Last Race
5. Winning Margin Last Race (if the horse won)

I want to use those 5 factors to predict the finish position in today race (which I also have in my data along with Date, Track, and Race).


I've seen some very basic regression analysis done in Excel, but one thing I don't understand is how to make the analysis recognize that each race is it's own unit/group.

It has to understand that a 70 speed figure in one race might have been the top figure in one race and lead to a win but in another race that 70 speed figure may have been the worst speed figure and lead to a 9th.
__________________
"Unlearning is the highest form of learning"
classhandicapper is online now   Reply With Quote Reply
Old 08-21-2017, 07:43 AM   #8
FakeNameChanged
Registered User
 
Join Date: Jan 2010
Posts: 2,176
Quote:
Originally Posted by classhandicapper View Post
I have a question on regression analysis.

Let's assume I have a set of data on each horse for over 1000 races. I want to know how to weight each of these factors to produce the most likely winner.

Let's say I want to use 5 factors to help predict the outcome

1. Speed Figure Last Race
2. Class Figure Last Race
3. Finish Position Last Race
4. Lengths Behind at Finish Last Race
5. Winning Margin Last Race (if the horse won)

I want to use those 5 factors to predict the finish position in today race (which I also have in my data along with Date, Track, and Race).


I've seen some very basic regression analysis done in Excel, but one thing I don't understand is how to make the analysis recognize that each race is it's own unit/group.

It has to understand that a 70 speed figure in one race might have been the top figure in one race and lead to a win but in another race that 70 speed figure may have been the worst speed figure and lead to a 9th.
It sounds like you may want to try using multiple regression to solve the speed fig thing as to correlation. For example, what is the relative ranking of a 70 sp fig in relation all others in a particular race. Or how does a 70 relate to the average winning fig. relative to that class of race. A simpler way, although maybe not as accurate is to do a sort of all l.r. speed figs in a race, then have excel assign a ranking, 1,2,3, etc. The same technique can be used for class whether it's avg. $ per start, or some other class number.
__________________
One of the downsides of the Internet is that it allows like-minded people to form communities, and sometimes those communities are stupid.
FakeNameChanged is offline   Reply With Quote Reply
Old 08-21-2017, 08:16 PM   #9
Nitro
Registered User
 
Nitro's Avatar
 
Join Date: Feb 2009
Location: NY
Posts: 18,945
I posted this on another thread related to HK racing. However, it actually has more significance here as related to the value of statistical analysis.

Quote:
Originally Posted by castaway01 View Post
However, I will point out that the chief cheerleader for Hong Kong here, Nitro, claims he only bets based on tote-board action. So, for all of the wonderful, amazingly clean Hong Kong racing he always trumpets, he's STILL wagering based on insider moves and supposed conspiracies. Sounds wonderful.
Correct.
It is “wonderful” when you can often glean an inside track of potential contenders in a race!

For those like me, who acknowledge the significance of money flow as a “Given”, we really could care less about where it’s coming from. What’s more important is it’s flow: When and Where its going in to the various betting pools. Insinuating “Conspiracy theories” is a feeble explanation for underestimating the objectives and intent of those on the inside: The connections. Who BTW don’t often publicly reveal their intentions, but would rather “Let their money do their talking" as objective evidence of their confidence.

Yet, you have to enjoy comments like these (At least I do); because it clearly demonstrates (once again) a complete lack of appreciation of the realities of the horse racing game from the perspective of the typical handicapper. I’m not going to pursue this very deeply, because it’s getting old.

However, I would enjoy reading a valid argument (from any credible source) as to how typical handicapping methodology in general (Computerized or otherwise) can rationalize 2 very basic but critical aspects of the game from an Outsider’s perspective:
1) Can it predict whether or not each and every entry in a race is actually going to make an attempt to Win it?
2) How all of the past performance data or statistical analysis can determine the current physical and mental well-being of each horse entered in a race?

Although I won’t take this any further, you might want to consider what Bill Benter has personally stated. If you don’t recognize the name here’s a link:
http://www.worlds-greatest-gamblers....illiam-benter/
If you don’t respect his credibility, I would assume that you’re beyond his accomplishments and capabilities. So don't bother trying to comprehend the significance of his comments (below):

Excerpts from:
Computer Based Horse Race Handicapping and Wagering Systems:”
A Report by William Benter


INTRODUCTION
The question of whether a fully mechanical system can ever "beat the races" has been widely discussed in both the academic and popular literature. Certain authors have convincingly demonstrated that profitable wagering systems do exist for the races. The most well documented of these have generally been of the technical variety, that is, they are concerned mainly with the public odds, and do not attempt to predict horse performance from fundamental factors. Technical systems for place and show betting, (Ziemba and Hausch, 1987) and exotic pool betting, (Ziemba and Hausch,1986) as well as the 'odds movement' system developed by Asch and Quandt (1986), fall into this category. A benefit of these systems is that they require relatively little preparatory effort, and can be effectively employed by the occasional race goer.

The complexity of predicting horse performance makes the specification of an elegant handicapping model quite difficult. Ideally, each independent variable would capture a unique aspect of the influences effecting horse performance. In the author's experience, the trial and error method of adding independent variables to increase the model's goodness-of-fit, results in the model tending to become a hodgepodge of highly correlated variables whose individual significance's are difficult to determine and often counter-intuitive.

Additionally, there will always be a significant amount of 'inside information' in horse racing that cannot be readily included in a statistical model. Trainer's and jockey's intentions, secret workouts, whether the horse ate its breakfast, and the like, will be available to certain parties who will no doubt take advantage of it. Their betting will be reflected in the odds. This presents an obstacle to the model developer with access to published information only. For a statistical model to compete in this environment, it must make full use of the advantages of computer modeling, namely, the ability to make complex calculations on large data sets.

The odds set by the public betting yield a sophisticated estimate of the horses' win probabilities.

It can be presumed that valid fundamental information exists which can not be systematically or practically incorporated into a statistical model. Therefore, any statistical model, however well developed, will always be incomplete. An extremely important step in model development, and one that the author believes has been generally overlooked in the literature, is the estimation of the relation of the model's probability estimates to the public's estimates, and the adjustment of the model's estimates to incorporate whatever information can be gleaned from the public's estimates. The public's implied probability estimates generally correspond well with the actual frequencies of winning.


BTW the last 2 sentences express exactly how Mr. Benter was able to achieve his success.
.
Nitro is offline   Reply With Quote Reply
Old 08-21-2017, 08:44 PM   #10
traveler
Registered User
 
Join Date: Feb 2003
Location: NY
Posts: 245
Quote:
Originally Posted by classhandicapper View Post
I have a question on regression analysis.

Let's assume I have a set of data on each horse for over 1000 races. I want to know how to weight each of these factors to produce the most likely winner.

Let's say I want to use 5 factors to help predict the outcome

1. Speed Figure Last Race
2. Class Figure Last Race
3. Finish Position Last Race
4. Lengths Behind at Finish Last Race
5. Winning Margin Last Race (if the horse won)

I want to use those 5 factors to predict the finish position in today race (which I also have in my data along with Date, Track, and Race).


I've seen some very basic regression analysis done in Excel, but one thing I don't understand is how to make the analysis recognize that each race is it's own unit/group.

It has to understand that a 70 speed figure in one race might have been the top figure in one race and lead to a win but in another race that 70 speed figure may have been the worst speed figure and lead to a 9th.
Don't discount just using the rank rather than the actual SR or Pace figure etc.
If you want to go with an actual figure, try the difference from top idea - 70 SR gets a 0, 65 SR gets a 5 or -5 etc. all numbers should be viewed in the relative context of today's race.
I think at one point Benter figured the public odds were 65% or so of his line.
traveler is offline   Reply With Quote Reply
Old 08-22-2017, 09:58 AM   #11
classhandicapper
Registered User
 
classhandicapper's Avatar
 
Join Date: Mar 2005
Location: Queens, NY
Posts: 20,527
I tried something approximating difference from the top and got pretty good results, but I think I see a flaw in the approach.

The regression is looking at multiple field values for each horse as input and then the finish position of the horses. So it is more or less trying to maximize the ability to rank all the horses in a race correctly. I'm really only interested in the ability to pick winners. It's not doing that nearly as well as I can do it with weights for each factor that came up with via trial and error.

So I need to somehow stress that the winners are the key.

Perhaps instead of ranking all the horses in each race I could look at just the top 3 finishers????
__________________
"Unlearning is the highest form of learning"
classhandicapper is online now   Reply With Quote Reply
Old 08-22-2017, 10:58 AM   #12
DeltaLover
Registered user
 
DeltaLover's Avatar
 
Join Date: Oct 2008
Location: FALIRIKON DELTA
Posts: 4,439
Quote:
Originally Posted by classhandicapper View Post
I tried something approximating difference from the top and got pretty good results, but I think I see a flaw in the approach.

The regression is looking at multiple field values for each horse as input and then the finish position of the horses. So it is more or less trying to maximize the ability to rank all the horses in a race correctly. I'm really only interested in the ability to pick winners. It's not doing that nearly as well as I can do it with weights for each factor that came up with via trial and error.

So I need to somehow stress that the winners are the key.

Perhaps instead of ranking all the horses in each race I could look at just the top 3 finishers????
Logistic regression can be seen as a special case of neural networks; many problems are impossible to be solved with the former but require the latter. Such a problem can be found in the prediction of the finish positions that you are describing here.

What I see as the major challenge though, is not the algorithm to be used but the way to present the data to it; some of the data preprocessing tasks that need to be addressed are the following:

  • What the metrics to use? (ex: speed or pace figure, closing figures etc)
  • How to generate the necessary metrics? (ex: track variant estimation, cross distance - track adjustment etc)
  • How many past performances to use? (ex: Do we need individual models based in the number of available past performances? How to handle shippers? etc)
  • Should metrics be normalized using a per race window or passed in absolute values?)
  • How to pass race level data? (ex: Like wire to wire winning stats, average speed figures for all starters etc)
  • What kind and how to pass primitive (predicates) handicapping factors?(ex: layoffs, dirt to turf, first lasix etc)
  • How to handle connections? (ex: jockey/ trainer changes etc)

Even after answering all these questions, we still need to decide how to formulate the “target” of the model and by this I mean that simply targeting for raw finish ordering might very well not be any useful as at best it will match the crowd’s ranking.
__________________
whereof one cannot speak thereof one must be silent
Ludwig Wittgenstein
DeltaLover is offline   Reply With Quote Reply
Old 08-22-2017, 11:27 AM   #13
classhandicapper
Registered User
 
classhandicapper's Avatar
 
Join Date: Mar 2005
Location: Queens, NY
Posts: 20,527
The goal right now is very simple.

Input a handful of factors, run a regression analysis that will supply me the weight for each factor that will produce the most winners. That's all.

As I described previously, when I did that, it produced a set of weights that picked a lot of winners for the highest ranked horse. However it underperformed a set of weights for each factor I had defined myself using handicapping experience and tweaking further with trial and error. I was expecting improvement over what I could do intuitively and through trial and error.

So that got me thinking.

The regression isn't trying to solve for just picking winners. It's trying to solve for getting the 2nd ranked horse to come in 2nd, 3rd ranked horse to come in 3rd....7th rank horse to come in 7th etc... So the regression may be doing a better job than me at ranking all the horses, but it's not doing a better job than my weights at picking winners.

So the question becomes how do I use all the same input but set it up so it provides the weights that maximize the top ranked horse winning.

Right now it is comparing the input factors to the finish position.
__________________
"Unlearning is the highest form of learning"
classhandicapper is online now   Reply With Quote Reply
Old 08-22-2017, 11:34 AM   #14
DeltaLover
Registered user
 
DeltaLover's Avatar
 
Join Date: Oct 2008
Location: FALIRIKON DELTA
Posts: 4,439
Quote:
Originally Posted by classhandicapper View Post
The goal right now is very simple.

Input a handful of factors, run a regression analysis that will supply me the weight for each factor that will produce the most winners. That's all.

As I described previously, when I did that, it produced a set of weights that picked a lot of winners for the highest ranked horse. However it underperformed a set of weights for each factor I had defined myself using handicapping experience and tweaking further with trial and error. I was expecting improvement over what I could do intuitively and through trial and error.

So that got me thinking.

The regression isn't trying to solve for just picking winners. It's trying to solve for getting the 2nd ranked horse to come in 2nd, 3rd ranked horse to come in 3rd....7th rank horse to come in 7th etc... So the regression may be doing a better job than me at ranking all the horses, but it's not doing a better job than my weights at picking winners.

So the question becomes how do I use all the same input but set it up so it provides the weights that maximize the top ranked horse winning.

Right now it is comparing the input factors to the finish position.
Does this mean that you are developing separate models based on the number of the starters?
__________________
whereof one cannot speak thereof one must be silent
Ludwig Wittgenstein
DeltaLover is offline   Reply With Quote Reply
Old 08-22-2017, 01:52 PM   #15
sjk
Registered User
 
Join Date: Feb 2003
Posts: 2,105
If you are doing a linear model you want to do something to limit the effect of a horse being beat double digit lengths, such as limit the beaten length parameter to a predetermined value or you might use 10 minus beaten lengths and bottom out at 0.

On the theory that a horse that runs last in a 6 horse field hasn't really done anything better than a horse that runs last in a 12 horse field you might limit the finish position (or use 6 minus fin position as above).

The speed and class numbers should probably be in relation to what would be expected at that level.

I would be leery of the information that is only available for winners. That is probably not going to fit in a nice linear manner with the others.
sjk is offline   Reply With Quote Reply
Reply




Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

» Advertisement
Powered by vBadvanced CMPS v3.2.3

All times are GMT -4. The time now is 12:04 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2024, vBulletin Solutions, Inc.
Copyright 1999 - 2023 -- PaceAdvantage.Com -- All Rights Reserved
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program
designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.