Factor Models [Archive] - Horse Racing Forum - PaceAdvantage.Com

View Full Version : Factor Models

highnote

03-17-2002, 01:25 AM

Does anyone have experience devising computerized Factor Models for handicapping?

I've done one and am willing to share my experience.

Here's my wishlist:

1. Create a giant database of racing data from which various handicapping factors can be tested.
2. Create a handicapping model.
3. Build a computerized online betting system.
4. Work with a team because I certainly don't have time to do it alone.

Please realize that there are almost certainly others doing this in the U.S. right now.

andicap

03-17-2002, 11:44 AM

Here's my problem with factor models using velocity ratings. They all use computerized paceline selections which is ridiculous. No matter how you program it, computers will select silly pacelines often enough to screw up your model.
At the least it will take a race with a really slow or fast pace which you have to throw out. Plus they ignore form cycles so a horse who throws in a lifetime best last race and bounces to the moon is irrelevant because the selected paceline will be atypical.
Also, the horse might be a "p" horse for the most part but went to the lead last time.....an atypical paceline which shouldn't be used, but most computers will use it.

For models to work, you need to double-check the computer -selected lines, which is way too time-consuming for most people to do this every day unless you concentrate on only one track.

highnote

03-17-2002, 01:16 PM

Andicap,
I'm not sure I made clear what my facot model was like.

The model I developed had 30 factors. Some of the factors were actually compound factors - that is, some of them were new factors created by combining two or more factors into one new factor.

In theory, I wanted my model to use the lifetime past performances of all the horses in a race so that form cycle could be captured. I had a pretty good database. In many cases I had lifetime pps, but when I encountered a shipper I was was usually stuck with using the horse's last 10 races.

One of my factors was average beaten lengths in a sprint or "blsavg". I calculated the average beaten lengths for a horse in all its races and then assigned a higher value for recent races and a lower value for older races - in other words the beaten lengths were recency weighted.

A compound factor was then devised using "blsavg multiplied by "blspct". Where "blspct" is the sum of the beaten lengths by which a horse was beaten; then the beaten lengths of each horse in the race were summed; and finally, each horse's beaten length total was divided by the sum of all the beaten lengths of all the horses to find the beaten length percentage.

I found that "blspct" by itself wasn't useful, but when combined with "blsavg" it was very useful. Even though "blsavg" was already included in the model, the model was even better with the compound factor.

Some of my other factors were: days away, beaten lengths in a route, speed figures, class rating, early speed in a sprint, a jockey factor based on an auxilliary regression, weight, and finish position.

I would have taken it much further, but couldn't justify funding the project based on the results I was getting and the fact that my wife and I wanted a house and a baby. Life gets in the way of handicapping.

I would like to continue it's development, but need to find some like minded people.

GameTheory

03-17-2002, 03:46 PM

I've done some similar stuff, mostly trying to get the computer to "discover" factors or correlations on its own through artificial intelligence techniques like genetic algorithms and neural networks. I've had mixed but encouraging results. I've never had enough (or good enough) data to do it properly though. That is one reason I made my charts parser, so I could build up a database without spending a fortune.

I believe a human-coded program by a knowlegable handicapper would be the best, but at least half of my interest was in the computer algorithms themselves. I would love to get my hands on a big dataset of pp's & results to try some stuff out on...

highnote

03-17-2002, 05:13 PM

GameTheory,
Maybe I can send you my BRIS/DRF pps and the associated BRIS comma-delimited charts as payment for your parser software you were so kind to send me?

John

GameTheory

03-17-2002, 05:21 PM

As we were discussing in another thread, that would be illegal. So e-mail me...;)

highnote

03-17-2002, 06:49 PM

GameTheory,
You're probably correct that it would be illegal to send them as payment. The IRS would consider that bartering. If I gave them to you as a gift then you might even have to declare them on your taxes. If, hypothetically speaking, you were to send me $1 as payment then there would probably be no problem - as long as I declared the $1 on my taxes.

Hey, wasn't this the way Enron's accountants did business?

GameTheory

03-17-2002, 08:06 PM

Oh man,

I didn't even think about that. I was referring to the fact that you're not allowed to re-distribute the BRIS data...

highnote

03-17-2002, 08:35 PM

Not allowed to redistribute BRIS data.

Unless someone was doing a Napster-like thing and posting their downloaded BRIS data on website for others to download for free, I doubt that me sending an individual a file or two is something BRIS would sue over. I wonder if that would really be illegal.

That's like saying it's illegal to videotape the last Soprano's episode and give it to someone else to watch. If anything, that type of "passing along" increases awareness of the show and ultimately it's fan base.

So if I send someone a bunch of free BRIS data files and that person develops a killer app isn't it likely that they'd continue to use BRIS data - since that's the data they've already written code for?

Just a thought.

PaceAdvantage

03-17-2002, 10:08 PM

Not likely BRIS would see things that way.....they are fairly vigilent about protecting their turf...big or small fish alike.

Let's just chalk the last few posts up to "thinking out loud" and restrict any further talk about this subject to e-mail.....ok??

Nobody needs BRIS breathing down my neck accusing this site as being the NAPSTER of horse racing data files..... ;)

==PA

highnote

03-17-2002, 10:43 PM

PA,
I'm sure there's nobody here that would ever give away anything for free -- especially to help someone else get an edge in a pari-mutuel game! HAHAHA
John

GameTheory

03-18-2002, 12:17 AM

PA --

Feel free to delete the last several posts on this thread and we'll forget all about it...

highnote

03-18-2002, 05:53 AM

PA,
I agree with GameTheory. You can delete them.
John

Derek2U

03-18-2002, 11:01 AM

It really sickens me to think how little representation us
horseplayers have. Imagine, YOU buy data & then you cant
even openly share it with others for non-commercial uses!

One day, when these data hoarders start moving younger,
computer-minded people into their executive ranks, maybe
then we will all be able to swap data in the open.

I honestly dont undestand it all anyways: WE should all be
able to call DRF/Equibase and say, "How much for a CD of
the last 3 seasons at Churchill Downs & Monmouth Park?"

Pay by credit card... & presto, Your data awaits.

Am I insane?

Dave Schwartz

03-18-2002, 12:01 PM

Derek,

You may be insane <G> but certainly not because of your belief on these issues. I have long held two beliefs myself:

1. Yesterday's data (or last month's) should be less expensive than tomorrow's data.

2. A vendor selling a large database would expand the entire sport.

Nevertheless, you cannot force anyone to sell what they do not want to sell.

Regards,
Dave Schwartz

highnote

03-18-2002, 12:04 PM

Does the word Monopoly come to mind?

Rick

03-18-2002, 04:46 PM

Swetyejohn,

I think 30 factors is probably too many for your model. Even though it will look really good on past data, it probably won't be too impressive on future data, especially in the ROI area. You can really hurt yourself by trying to optimize the win %. Try the 3 or 4 factors that are most independent of one another and you'll probably improve your bottom line.

highnote

03-18-2002, 05:04 PM

Thanks for the suggestion. A friend had the same advice. I tried a smaller model using the most predictive variables, but the fit of that model was never as good as with the 30 variable model.

If I don't maximize for the winning percentage then I'm not sure what I'd use. A friend recommended trying to predict a speed figure rather than final odds. I'm not sure it would outperform the public.

My model created an oddsline and then a second model was created combining my oddsline with the public's oddsline to create a hybrid oddsline. The theory being that the public's odds contain a lot of information mine won't contain. Too, the public's odds are the most accurate predictor of the outcome over many races. No single model can outperform the public in each and every race every day of the week. So the best we can hope for is to add information to the near final odds and then bet the overlays.

Alternatively, one could just pick and choose his spots. That's what most successful players probably do. However, I want to bet nearly every race and still make a profit. Maybe it's impossible. It's certainly difficult!

Rick

03-18-2002, 06:24 PM

Here's my advice (for what it's worth).

1. Get the book "Efficiency of Racetrack Markets", edited by Hausch, Lo, & Ziemba (reprint available at GBC). Read the paper "Computer Based Horse Race Handicapping and Wagering Systems: A Report" by William Benter and several other papers about multinomial logit modeling in horse race handicapping.

2. Get some software to do logit modeling, such as EasyReg (which is free).

3. Input at least 500 races including your favorite variables and post time odds.

3. Choose a small number of variables and include post time probability (1 / odds +1) as a variable (or log of the probability would be slightly better). Also, include 1 / Number_of_Horses as a variable. If there are first time starters, you'll need a flag for that too. The dependent variable would be wins.

4. Check to see if the other variables you have chosen are significant at the 5% level in the model including actual odds.

5. Include only the (usually very few) variables that contribute significantly to improving the prediction over actual odds.

6. If your data set is too small to get any significant predictions, you can try "explosion" as described in the book to the second place finisher (but probably not the third for US data).

7. Don't try to use finish position or payoff as a dependent variable because they won't produce as good a model as the one mentioned previously.

8. Good luck. You'll need it!

highnote

03-18-2002, 09:37 PM

Rick,
Done that. Like I said earlier my model wasn't easy to implement because we didn't have internet wagering available. So I gave up on it. It wasn't a bad model, but it didn't add enough to the public's odds to be profitable. It might be the case that too many other handicappers have access to too much of the same data and a profitable model based on the available data isn't possible. However, if trip and paddock handicapping is added then it might be possible to come up with something useful. It's a big job.

GameTheory

03-19-2002, 01:07 AM

I think it helps to use a filtered selection of races for your model instead of all races. (Not counting any filtering you might do for different race types, surface, etc.) For instance, only include races where the winner was between 5-1 to 15-1. This will help you zoom in on factors that the public underbets...

Rick

03-19-2002, 02:33 AM

swetyejohn,

Sorry it didn't work out for you. It's tough to find significant independent variables these days but it can be done. You seem to have avoided the usual pitfalls though, so don't stop looking.

GameTheory,

Interesting idea that I've never tried. Horses in that odds range are definitely the ones you need in order to get overlays. I always wanted to try to optimize ROI but could never find a good way to weight payoffs to eliminate the noise. You need to have something that will pick legitimate longshots but not give them more credit than they're due. I tried predicting show payoffs thinking they would be more stable but it didn't work well.

GameTheory

03-19-2002, 03:25 AM

Another idea is to have the model predict something that you can measure BEFORE the race is run in addition to whatever else -- like post-time odds (or nearly post-time, anyway), or ranking the horses by their predicted odds or something.

The idea is to use this as a sort of "confidence score" -- if it is getting THAT right, maybe it has a good handle on this particular race. I think some regression models might have some sort of prediction error, no?

I'm not sure this one has merit -- I need to do some more checking.

The previous idea about filtering your model is very helpful, though. Then you can go ahead and max for win% without worrying so much that you're just going to be focusing on chalk all the time. (And throwing out huge longshots makes for a less noisy model.)

rrbauer

03-19-2002, 06:01 AM

Oh man are you guys making me tired....there's about three lifetimes worth of work mentioned already in this thread!

Here's an angle that I've felt has been under exploited:

If you accept the idea that performance improvement (PI) is a precursor to winning (and here I'm thinking in terms of young horses, developing 3YO, etc.) then what are the factors associated with PI. How do those factors show up? Are some of the factors inter-related such that when they show up either in total; or, in some sequence, that it's BINGO time?

One of my favorite angles is the "3YO that got good". This happens more, or less, overnight. At least that's the way it looks in the PP's. The nice thing about the angle is that every year we have a new crop, and every year, we have a new set of 3YO's that get good. The "got good" part manifests itself in a horse winning at a big price and it looks like a form reversal or some sort of anomaly like it was in the mud. Hence, the betting public dismisses it as a one-time thing and next time out the horse is a big price again. Maybe not as big as the initial time, but big enough that a $25 bet will produce a car payment. Sometimes, if the horse is being raised in class or moving through the allowance conditions the second score gets dismissed too.

One or two scores with these babies can make a meet. A handfull of them can make the year! So, between now and the end of the summer, keep your eyes peeled for the "3YO that got good"!

Rick

03-19-2002, 10:59 AM

GT,

I tried using actual odds (actually log odds) as the dependent variable and it didn't work too well. Morning line odds alone are pretty good at that and, of course, that leads nowhere. You can improve that somewhat by adding your own estimate and some jockey data. DRF sweep odds and consensus used to add a little bit too. But even including all of that, you'll only explain about 85% of the variance at most. That leaves 15% for what I would call smart money, or at least non-public and subjective information. The best odds line would combine the actual odds along with your estimate. In fact, it appears that separating out late money would be a little better. But doing all of this in the final minutes would probably require automation and direct access to the pools.

Richard,

If you figure that one out be sure to let me know. I'd like to cash a Kentucky Derby ticket at least once in my lifetime!

GameTheory

03-19-2002, 11:45 AM

Well,

I think it is going to be a good long while before a computer can out-do a human "master" handicapper on a one race at a time basis. In other words, over the small number of races that a single skillful person can do in a reasonable amount of time, I would expect the human (or comp-assisted human) to have a greater edge over that small set than any computer model. Computers are good for gaining a somewhat smaller edge over a larger number of races.

So for the Kentucky Derby, put that computer aside, put on your thinking cap, and don't forget to listen to your gut...

highnote

03-19-2002, 03:25 PM

I used the Multinomial Logistic Regression model outlined by Benter. I used a binary dependent variable. Where a win equals 1 and a loss equals 0. So for any group of horses in a race there is one 1 and several zeros. My output was a number between 0 and 1 which could then be easily converted to an oddsline. Although, for betting I converted odds to percentages.

I focused on 6 furlong races on Aqueduct's inner dirt course. In retrospect I should have chosen a different distance and maybe even surface. I included claiming, allowance and stakes races for females and males. I did not use maidens or statebred races in the model.

Next time I may try to develop a model based on turf marathons. I may have to include multiple tracks, but I might still be able to get an edge.

Rick

03-19-2002, 06:05 PM

swetyejohn,

You're on the right track. Horse racing is hard to predict using any standard statistical techniques (as you know). You'll come up with spurious correlations more often than not. If you focus on using fewer dependent variables (2 - 4) and make sure they're as independent of one another as possible, you will come up with something significant. I've been working on this problem for about 20 years and that's the best advice I can give. It's just way harder than you (or I) would ever expect.

I was fortunate to actually get some good advice from some legitimate pros in the late 70's and spent a lot of time checking spot play systems for them for a few years. I learned a lot about contrarian approaches to handicapping from those guys, and most of it still works today. Later, when I tried to develop a more generalized handicapping approach, I was able to incorporate some of these ideas.

I learned a lot about statistical methods in college and spent several years trying to appy those in a simplistic kind of way. Well, as you know, there is a lot more to be learned about predicting anything than what's in a college textbook. There are probably thousands of young people these days thinking that this is an easy game (as I did) and trying to apply these statistical techniques. That's one of the things I love about this game though. Experience is rewarded handsomely, and being an old guy, I'm really happy about that.

highnote

03-19-2002, 06:46 PM

Rick,

Benter says it doesn't matter if the model becomes a hodgepodge of highly correlated variables as long as it is useful. My experience has been that the model improves by including these interactive variables.

On the other hand, I might find that some of the variables are weakly associated with the outcome but when taken together can become important predictors.

My approach is to throw everything into the pot. As long as it can predict the outcome I don't mind if it's not aesthetically pleasing.

As an aside... I talked to a nueral net programmer. He said that I might be able to squeeze a little bit more out of my data with a neural net, but that logistic regression is pretty good and is probably the way to go.

I found neural nets to be very slow and not as predictive. Of course, I have very little knowledge on how to set them up. It's like learning a different language to use them.

Rick

03-19-2002, 07:12 PM

swetyejohn,

Yeah, I'd generally trust what Benter says. I've not found that to be true in US racing, but he's the guru of statistical handicapping. My problem is that what I've found with many variables has not held up in out of sample testing.

highnote

03-19-2002, 07:49 PM

I've had similar experience as you. They look good in sample, but out of sample they don't hold up.

GameTheory

03-20-2002, 12:25 AM

I come from somewhat of an opposite approach from you guys, but with similar results, probably.

I know about neural nets, genetic algorithms, etc., and virtually nothing of regression and statistics. For instance, I have no idea what "least squares" means. I did find a nice tutorial on the net the other day however. Maybe I'll get around to taking a look at it.

Anyway, standard neural nets aren't much good -- they tend to end up being "averagers". With the right framework, the right model, and with everything just perfect it might work ok, but they just take too damn long to train. I use a modified neural net sometimes that trains itself on the fly everytime it predicts something. So while most NN's train slow & predict fast, this doesn't train at all and predicts slow (you have to wait a second to get your answer instead of getting it instantly). But you can change the model & predict again in a jiffy.

My newest adventure is a new kind of genetic programming -- a program that writes itself. Genetic programming has been severly limited in the past, but a recent advance looks like it will take it a big leap forward. I'm coding that system now.

Anyway, back to making models. The big problem I always have is trying to figure out how to segment the data. For instance, John mentioned that he used a model that has binary outputs for each horse. (i.e. 0010 = bet on number 3, 1000 = bet on number 1) Which means each horse must be represented in the data, like this:

Input variables for one race:
variable 1: horse1
variable 2: horse2
variable 3: horse3

(Assume each horse is boiled down to one number, which they wouldn't be.)

But think about that. The model now correlates all "horse1"s as being somehow equivalent, when in fact they just randomly happen to share slot number one in the data, unless you really want to give that much weight to program numbers or whatever order they come in. So you really ought to do this to represent a single race in the model:

h1,h2,h3 1 0 0 (horse #1 won)
h1,h3,h2 1 0 0
h2,h1,h3 0 1 0
h2,h3,h1 0 0 1
h3,h1,h2 0 1 0
h3,h2,h1 0 0 1

Each line represents exactly the same thing. And that's with only three horses. A 10-horse field would take 90 lines. Which no doubt helps to smooth your data quite a bit, but it also makes it HUGE.

And what about those field sizes? Should I have different model for each field size, or should I just have 12 slots and leave some of them empty for less horses?

Do these issues exist in statistical modelling, or is it just with computer AI stuff? What's the best way to handle it?

Other methods I know of include:

-- making your basic unit one horse, which is by far the easiest to deal with, with some variables thrown in that somehow represent the rest of the field as whole.

-- "matching" up each horse with one other horse, and predicting which one will come out in front. Repeat for each possible combination. Also has a huge number of combinations, but with less variables involved.

Any opinions on the best way to do things?

highnote

03-20-2002, 01:25 AM

GT,
I'll give your message some thought. In the meantime, here's how I coded for model.

Hypothetically, let's say there is one race in my model with 3 horses (there might hundreds of races in an actual model). Each horse has 4 factors (or independent variables). Call them last race speed figure, best lifetime speed figure, finish position last race, weight last race. Horse 1 won this race so his dependent variable is a 1. The other two horses lost so the dependent variables are 0.

The data would look like this:

LRS BRS FPLR WGT W/L
Horse 1: 89 92 2 117 1
Horse 2: 77 103 1 110 0
Horse 3: 85 100 3 115 0

Next we gather data for, say, 500 races. Then perform logistic regression to determine the coefficient for each factor - that is, how much each factor should be weighted.

So if we have a race being run tomorrow, three of the horses' data might look like this:

LRS BRS FPLR WGT Prob
H1: 87*1.23 92*.89 5*.053 116*.23 .33
H2: 86*1.23 89*.89 4*.053 113*.23 .23
H3: 55*1.23 117*.89 8*.053 110*.23 .19

The rest of the horses in the race would have a total probability which will sum to 1.0.

The beauty of logistic regression is that the probabilities that are output will sum to 1.0. Standard regression that is coded with a binary dependent variable can possibly have an output that will be greater than 1 or less than 0 - which is quite impossible in real life.

bigray76

03-20-2002, 11:27 AM

Interesting thread gentlemen,

I am currently working with my own 'model' of sorts, which I developed after about two years of noticing small factors. I do not use anything inter-related, but base everything off of each horse starting with a value of 1.0. Each horse then gets additional factors based on my researched probabilities for certain factors, which in some cases may apply to them positively or negatively. Every track has basically the same factors, however, they may be slightly higher or lower. I have been recently trying to add factors for my angles which tend to outweigh the findings of my 'point system'. I have a rough model in place for harness racing right now, that actually computes an odds line for each horse in the race. I am still testing that out to see if it is a viable way to find overlays and underlays, and more importantly show a profit.

I have tried to keep my models (t-bred and harness) fairly simple, so that I can do a race by hand in a matter of minutes. The problem is that as a human, I have many exceptions to my rules, which I am trying to figure out for when I start programming it in the computer. (Also learning programming in JAVA which is rather time consuming).

Ray

Rick

03-20-2002, 12:23 PM

GT,

My data is coded similar to swetyejohn's except that I use the variable minus the average of the variable for the horses in that race. I also include 1 / number of horses in the race as a variable. Both of these give a better fit.

highnote

03-20-2002, 10:09 PM

GT,
I made one error in my last email. The probabilities that are output for the horses ina particular race do not sum to one. I have to normalize them to sum to one.
John

Jaguar

04-09-2002, 09:35 AM

I highly recommend Nunamaker's work on impact values. He
really solved the age-old question of "What information is relevant in horse handicapping and how does a handicapper assign values to those factors?"

Curiously- at least to me, as a traditional, old-fashioned handicapper- RaceCom found the solution to this predicament via a totally different and heretofore unorthodox approach.

Without really understanding the nuts and bolts of their algorithms- but as a former customer of theirs, who is familiar with their basic approach to financial analysis and to handicapping, it seems to me that they have taken pace analysis, consistency rating, and "performance within different class levels", to a new level of accuracy and predictive power.

I have been benchmarking their results for Analog 5.0 for a year or so, and their huge database has enabled them to model America's racetracks with astonishing results. Of course, their service is extremely expensive, and they can be arrogant and abrupt.

Jaguar

jackad

04-09-2002, 12:13 PM

Jaguar,
You say you have been tracking Racecom's Analog 5 for a year or so. Does it show a profit over the long term? If so, by betting it how? Any specific data you can provide will be appreciated.
Jack

GameTheory

04-10-2002, 05:00 AM

I've experimenting with "bagging" & "stacked generalization" techniques lately to help improve generalization accuracy of algorithms/models trained from a particular set.

Does anyone around here know about this kind of stuff? I'd be interested in your experiences...

Arkle

04-10-2002, 10:53 AM

[QUOTE]Originally posted by Rick
[B]Here's my advice (for what it's worth).

1. Get the book "Efficiency of Racetrack Markets", edited by Hausch, Lo, & Ziemba (reprint available at GBC). Read the paper "Computer Based Horse Race Handicapping and Wagering Systems: A Report" by William Benter and several other papers about multinomial logit modeling in horse race handicapping.

Rick:
Do you know where one could get William Benter's paper?
Is it possible for a non-specialist to understand it?

Regards

Arkle

GameTheory

04-10-2002, 11:55 AM

Here's a version of it:

http://www.in-the-money.com/artandpap/Efficiency%20of%
20the%20Market%20for%20Racetrack%20Betting.doc

(Strange URL -- cut & paste it)

GameTheory

04-10-2002, 01:17 PM

I'm sorry -- I meant point out that the link is for Ziemba's paper, not Benter's.

Rick

04-10-2002, 01:49 PM

Arkle,

It should be available from Gambler's Book Shop, www.gamblersbook.com. It's expensive though, I think $80 or $90, so you might want to try to find it at a university library and copy whatever papers you're interested in. There is a lot of math involved in the theory, but in practice it's easy to apply using readily available statistical packages. I use EasyReg, which is free and pretty easy to use.

Arkle

04-10-2002, 04:51 PM

Thank you, Rick and GameTheory.

Regards,

Arkle