Horse Racing Forum - PaceAdvantage.Com - Horse Racing Message Board

Go Back   Horse Racing Forum - PaceAdvantage.Com - Horse Racing Message Board > Thoroughbred Horse Racing Discussion > General Handicapping Discussion


Reply
 
Thread Tools Rating: Thread Rating: 5 votes, 5.00 average.
Old 01-19-2019, 10:27 AM   #16
Jeff P
Registered User
 
Jeff P's Avatar
 
Join Date: Dec 2001
Location: JCapper Platinum: Kind of like Deep Blue... but for horses.
Posts: 5,258
My first real foray into developing a logistic regression model for live play was in 2016.

In May of that year I looked ahead at the calendar and decided on Saratoga.

I created a Saratoga only database (in JCapper) spanning years 2011-2012-2013-2014-2015 (five years in total.)

I exported JCapper data for years 2011-2012-2013-2014-2015 to .csv file.

In predictive modeling and AI one of the most important steps is preparing your data for analysis.

I made a number of transformations to the exported data and cleaned it for outliers.

One of the transformations I made was to assign every starter a random number between 1 and 100. My thinking was that random numbers can be used to create a holdout sample. For example only use 1 through 70 during model development. Then after model development, test and tune using 71 to 100.

Imo, with horse racing data, your most obvious outliers are first time starters. Every FTS has 0's for speed figs, pace figs, starts, and earnings, etc.

In the field of predictive modeling, one generally accepted way to handle this type of outlier is to coerce numeric value for whatever independent variable you are looking at to the mean. If F1 Early Pace Fig is your independent variable: Sum the F1 Early Pace Figs for every starter in the current race, calculate the mean, and coerce the F1 Early Pace Fig of every FTS to the mean.

That's the way I did it at first.

Several months later I took the time to test a handful of other ways. Through trial and error I discovered a technique that (vs. using the mean) generated incrementally better results during forward testing of my end models. But that's a minor point.

The important point is that horse racing data is noisy. You'll get far better results from your end models if you take the time to prepare your data for analysis vs. if you don't.

After no small effort preparing the exported data for analysis: I had a development sample.

I began running data through the mlogit module in r.

I tend to think in terms of areas of the game rather than factors.

To my way of thinking a factor is just a way to express how horses stack up vs. each other numerically for a given area of the game.

I started with two areas of the game: Early pace and late pace.

I chose one early pace factor and one late pace factor.

At that point I had a model based on factors representing those two areas of the game. I ran it through the mlogit module in r and noted the output.

Then I swapped out the first early pace factor for a different early pace factor and repeated the process - noting the output from the mlogit module in r once again.

I kept trying different early pace factors and noting the output.

After cycling through seven or eight early pace factors, it was clear from the output of the mlogit module in r which early pace factor had the most significance. So I chose that one.
Fyi, according to the authors of "Statistics for Biology and Health" (a 500 page textbook on logistic regression in PDF form that I later downloaded and read, the process I just described is called Forward Elimination. They recommend a process called Backwards Elimination where you do it the other way. Start with a list of independent variables and work backwards one variable at a time, making eliminations based on significance as you go.
From there I did the same thing with late pace factors. I used Forward Elimination to land on the late pace factor that, when combined with the early pace factor I selected, had the most significance.

Since this was my first time undergoing this process I really didn't know if that's the correct way to do it or not.

But that's what I did.

Next I did Forward Elimination and landed on a factor for form.

At this point I had a model that was handling early pace, late pace, and form.

The really interesting thing to me was when I tested the model 'as is' on a five year holdout sample based on random numbers: Horses that were rank=1 according to the model had a win pct of just 0.22

But the win pool roi of the rank=1's according to the model was right at break even.

And the win pool roi of the model rank=1's going forward over the 2016 meet? Right at break even too.

This, from a global model that did not yet involve the tote. A global model that made no attempt to handle maiden claimers going 5.5f on the dirt differently than say a stakes route on the turf.

That's when I became kind of sold on the whole idea.

That guy Benter? I think he was on to something.



-jp

.
__________________
Team JCapper: 2011 PAIHL Regular Season ROI Leader after 15 weeks
www.JCapper.com

Last edited by Jeff P; 01-19-2019 at 10:38 AM.
Jeff P is offline   Reply With Quote Reply
Old 01-19-2019, 11:51 AM   #17
chrisl
Registered User
 
Join Date: Jan 2007
Location: Ketchikan,AK
Posts: 2,086
What an incredible thread. Nice job gentlemen. Thanks
chrisl is offline   Reply With Quote Reply
Old 01-19-2019, 12:53 PM   #18
ultracapper
Registered User
 
Join Date: Jul 2013
Location: Seattle
Posts: 3,943
Quote:
Originally Posted by chrisl View Post
What an incredible thread. Nice job gentlemen. Thanks
I wasn't going to comment on this thread because it is WAY over my head, but I can at least 2nd this. Very interesting.

A comment about Jeff's remarks concerning FTS. Back when the computer was first getting into the game, and the few people I knew incorporating it into their daily regimen, the maiden races always had the FTS listed at the bottom of the field. They'd grade the horses that had already run, but they couldn't do anything with the firsters, so they were just listed below the others. Obviously the computer was of almost zero value for 2 yos in the spring and early summer. It will always be a challenge.
ultracapper is offline   Reply With Quote Reply
Old 01-19-2019, 01:53 PM   #19
headhawg
crusty old guy
 
headhawg's Avatar
 
Join Date: Aug 2003
Location: Snarkytown USA
Posts: 3,909
I also agree that this is a great thread. It's the kind we need more of, imo. Thanks for the info guys; it's nice to read things from the practical experts instead of the theoretical ones.
__________________
"Don't believe everything that you read on the Internet." -- Abraham Lincoln
headhawg is offline   Reply With Quote Reply
Old 01-19-2019, 06:58 PM   #20
Dave Schwartz
 
Dave Schwartz's Avatar
 
Join Date: Mar 2001
Location: Reno, NV
Posts: 16,877
Jeff,

Regarding FTS... Are you saying that you treat all FTS as "average FTS?" (I am thinking not.)

How do you decide which FTS will likely go early?
Dave Schwartz is online now   Reply With Quote Reply
Old 01-19-2019, 08:19 PM   #21
Buckeye
Smarty Pants
 
Buckeye's Avatar
 
Join Date: Sep 2001
Location: Every Vote Counts
Posts: 3,160
There's no way to do that Dave.

Ok, maybe I'm wrong . . .

Look at the workouts and look at the bloodlines and Trainer.
Buckeye is offline   Reply With Quote Reply
Old 01-19-2019, 08:48 PM   #22
Dave Schwartz
 
Dave Schwartz's Avatar
 
Join Date: Mar 2001
Location: Reno, NV
Posts: 16,877
Quote:
Originally Posted by Buckeye View Post
There's no way to do that Dave.

Ok, maybe I'm wrong . . .

Look at the workouts and look at the bloodlines and Trainer.
I agree. But Jeff's post has me confused.

Some years ago I did a similar FTS study to determine how many Quirin ES points should be allowed for. The number I came up with was around 2.6.

A better study told me what pct of the time a FTS would actually run as an "E" or "EP" horse. (i.e. 7-8, 5-6)

Of course that became a crapshoot of sorts. I felt that it was safe to say that in a field which contained 4 FTS, one of them would at least press the pace.

Actually reminded me of my BJ playing days when I'd be playing a hand from the last seat and there were players in the 1st 2 seats who stood against a 10 with a rich deck. It was logical to me that between those two hands there were at least three 10-count cards.

This allowed me to adjust my hit/stand strategy towards the direction of accurate.
Dave Schwartz is online now   Reply With Quote Reply
Old 01-19-2019, 09:57 PM   #23
traveler
Registered User
 
Join Date: Feb 2003
Location: NY
Posts: 245
Thank you Jeff for your insights
traveler is offline   Reply With Quote Reply
Old 01-19-2019, 10:01 PM   #24
Jeff P
Registered User
 
Jeff P's Avatar
 
Join Date: Dec 2001
Location: JCapper Platinum: Kind of like Deep Blue... but for horses.
Posts: 5,258
You're welcome.


-jp

.
__________________
Team JCapper: 2011 PAIHL Regular Season ROI Leader after 15 weeks
www.JCapper.com
Jeff P is offline   Reply With Quote Reply
Old 01-19-2019, 11:04 PM   #25
Jeff P
Registered User
 
Jeff P's Avatar
 
Join Date: Dec 2001
Location: JCapper Platinum: Kind of like Deep Blue... but for horses.
Posts: 5,258
Quote:
Originally Posted by Dave Schwartz View Post
Regarding FTS... Are you saying that you treat all FTS as "average FTS?" (I am thinking not.)
No. Definitely not.

If I step back and give this a little thought --

If I were to just use the 0's that come from the data vendor, I'd end up generating prob estimates for first time starters that are, if not the same, too high for most of them, too low for a select few of them, and too close together on the whole.

As players, we absolutely need a technique for handling them.

But it needn't be rocket science.


Quote:
Originally Posted by Dave Schwartz View Post
How do you decide which FTS will likely go early?
I don't try to decide whether or not a FTS will go early.

Believe it or not I have the ability to do that - without writing any new code.

Just never did that specifically for first time starters.

All it would take is model the likelihood of a first time starter winning the start call... the first call... the second call...

Or some combination of the above.

Crap.

The more I think about it the more I realize I should probably do that.

That said, I started writing a post based on what Ultracapper wrote about first time starters back in post #18.

That post is kind of long. It goes into some depth about how I handle first time starters. I'm editing it right now and once it's a little further along I'll come back to this thread and post it.



-jp

.
__________________
Team JCapper: 2011 PAIHL Regular Season ROI Leader after 15 weeks
www.JCapper.com

Last edited by Jeff P; 01-19-2019 at 11:05 PM.
Jeff P is offline   Reply With Quote Reply
Old 01-20-2019, 01:22 AM   #26
Jeff P
Registered User
 
Jeff P's Avatar
 
Join Date: Dec 2001
Location: JCapper Platinum: Kind of like Deep Blue... but for horses.
Posts: 5,258
Quote:
Originally Posted by Dave Schwartz View Post
I agree. But Jeff's post has me confused...
The short answer - how I handle first time starters:

For factors in my models where the first time starter has 0's or the data is otherwise missing and the other horses in the race have data for that same factor:
  • The first time starter gets the mean, or in my case something close to the mean, instead of a 0.

  • For all of the factors in my models where the first time starter does have a number:

    The first time starter gets the actual number - just like any other horse.

  • From there the factor values -- be it a mean or be it something I created myself like scored query results or be it a number that came from a data vendor:

    I feed the number or factor value into the algorithm that generates the output for my models.
That's how I do it.

Keep in mind that I am using Prob Expressions to generate scored query results for rider, trainer, post position, and track bias. I do this for every horse in the race - first time starters included.

So while first time starters have 0's for a great many factors, in my case, I'm replacing the 0's with something close to the mean based on the factor values of the other horses in the race. I'm also including Prob Expression scores for rider, trainer, post position, and track bias.

By the time I'm done I have a reasonably good prob estimate for my first time starters.




What follows below is a longer answer as to how I handle horses with missing data. This post is longer than I intended and there's plenty of background info.

Fyi, what I'm about to write below doesn't just apply to first time starters. It applies to all horses in the data.


Factors where a horse, first time starter or not, has a 0 or the data is otherwise missing - and the other horses in the race have actual numbers for that factor:

I treat all horses with missing data the same. First time starters are the most frequent example of this. That's why I've so far focused specifically on them.

Imo, missing data for a factor when other horses in the race have data for that same factor creates an outlier situation.

Imo, better to handle outliers than ignore them.

Again, I treat all horses in the data the same. If the data is missing and other horses in the race have values for that factor, the horses with the missing data get the mean (or in my case something close to the mean) instead of a 0 or an empty or null data field.

First time starters typically have 0's for a great many factors like final time speed figs, early pace figs, class or pars from previous races, starts, wins, money earned etc.

For these types of numerical factors, where all of the other runners with at least one lifetime start have a number for the factor involved, what are we supposed to do?

How are we supposed to handle the first time starter?

If you go with the 0's for these factors that are coming from your data vendor: You risk rating the first time starter or foreign shipper too low relative to its true ability.

On the other hand, rate the first time starter too high and you risk generating probability estimates that will have you betting first time starters you think are overlays when they're not.

The truth is we don't really know the true ability of a first time starter until it has a few races under its belt. It could rank towards the bottom of the field. It could rank somewhere in the middle. It could also be a stickout. But looking at the raw Equibase data there's no real way to know.

Enter predictive modeling.

The authors of most predictive modeling papers, if they are telling the truth, struggle with outliers too. If you're looking at data for a drug trial and a small percentage of the records in the dataset are missing the patient's age: What do you do?

If I were doing transformations while creating a development sample for a drug trial I'd probably ask someone higher up on the ladder exactly how they wanted me to handle those records where the patient's age is missing. Delete them? Start fresh with a brand new sample? What are you really supposed to do?

If say 5.278% of the records in the dataset (by the way that's the percentage of North American thoroughbreds who were first time starters in calendar year 2018) were missing the patient's age and and other various pieces of data:

I've never been hired as a data consultant to work on a drug trial. But it wouldn't surprise me if someone higher up on the ladder decided to have me delete all records where the patient's age was missing, move on, and add a footnote to the final published version of the project at the end.

Something tells me the body of work might be compromised if I did that. (But I could be wrong.)

Obviously, with horse racing data you don't want to just delete first time starters (or 5.278% of the records) from the database.

Well you can if you want to. But if you do that you are going to struggle with first time starters until you figure out what to do with them.

In horse racing we need a technique for handling first time starters. Actually we need a technique for handling all horses with missing data.

In predictive modeling one generally accepted technique for dealing with the equivalent of a first time starter -- where you have missing data and you are required to make probability estimates for records with missing data -- is to replace the missing data with the mean.

There are other techniques that work slightly better but I've found using the mean works pretty well. No, not the mean from the entire dataset - but the mean from the other records for the current factor in the current race --

That True/False event you are modeling the likelihood of -- such as winning the race, leading at the pace call, or a ending up as the post time favorite.

Fyi, each of these are perfectly worthy events you might want to model. And no, you don't have to stop there.


Factors where the first time starter has an actual number:

On the other hand there are many factors where the first time starter does have a number.

For example:

Numbers you create yourself like the scored Prob Expression query results of the type I wrote about back in post #12.

Imo, these can be really useful for scoring riders, trainers, post positions, breeding, and track bias, etc.

Here, sometimes it pays to get creative.

For example, what's the trainer's record or sires's record for the n most recent first time starters for today's track-surface-dist-classdescriptor (SPLWT or a Maiden Claimer?)

Sometimes you can create custom data points that are really useful.

Maybe n is a smallish number that few would even consider. But catch a trainer or sire whose runners are winning first out early enough in the cycle before everyone else catches on and... Well, you get the idea.

Even if you never manage to catch lightning in a bottle --

If you can get your Interface to query this kind of stuff from a database, score it, craft your sql queries in a creative way --

And if you follow up and test what you have going forward so that you understand what you actually have:

You can create custom datapoints with statistical significance that only you get to see.

On the other hand, do it in a generic way like the Equibase stats that everybody else sees and you're not likely to get anywhere.

There are also data points you can observe with your own two eyes. Imo, horse physicality matters. Learn to judge it, score it in a consistent way, test it going forward, get better at it -- and at some point your [insert your last name here] Phys Ratings can become a custom data point. Of course this is easier to do if you attend the races live and you are following a single circuit.

I can tell you from experience that first time starters and foreign shippers who look physically fit, are carrying solid weight, and who are acting well win a higher pct of their races than first time starters and foreign shippers whose physicality is suspect.

There are lots of useful data points you can pick up on your own that are never going to be in the Equibase data.

Trip notes and track bias notes fall into this category.

So do workouts if you have a bent for getting up early, watching them with your own two eyes, and making your own notes.

There's also the pieces of equipment that horses wear. Front wraps. Tongue ties, Shoes. Hot patches. Type of bit. Type of bridle. Horses whose gait is all wrong warming up on a wet surface. Riders who are inept on a wet surface. Hoof size relative to the ideal for today's surface.

It pays to think outside the box a little and be observant in a creative way.

No matter what it is you are looking at:

Learn to score it in a consistent way. Test what you have going forward. Work at making incrementally better observations. Believe me. You'll know once you're generating quality ratings for whatever it is you're looking at in a consistent way.

At that point work it into your models.

You just might surprise yourself.


Ratings created by your Data Vendor - such as:

Rider rating and/or win pct, trainer rating and/or win pct, distance pedigree rating, turf pedigree rating, off track pedigree rating, workout rating, etc.

All of these apply to first time starters and foreign shippers. All of them are certainly fair game as independent variables in your models.



To Recap:

For factors in my models where horses have 0's or the data is otherwise missing and the other horses in the race have numbers for that same factor:
  • The horse gets the mean, or in my case something close to the mean, instead of a 0, empty, or null.

  • For all of the factors in my models where a horse, first time starter or not, does have a number:

    The horse, first time starter or not, gets the actual number - just like any other horse.

  • From there the factor values -- be it a mean, be it scored query results, be it something like a Phys score I've created myself, or be it a number that came from my data vendor:

    I feed the number or factor value into the algorithm that generates the output for my models.
That's how I do it.

All of that said, there's no reason prob estimates generated by a model for first time starters have to differ greatly from reality.

First time starters have workouts. First time starters with terrible workouts win at a reduced rate vs. first time starters with better workouts.

First time starters have riders. First time starters with terrible riders (don't get me started) win at a reduced rate vs. first time starters with better riders.

The same thing applies for trainers, post positions, breeding, scored query results, and any custom data points you may be creating yourself.

Make good numbers for all of the attributes you can think of. Test them going forward. You need to understand what you actually have. Tune and fit as needed.

Work it into your models. Then do an ongoing tune and fit of the model itself.



-jp


.
__________________
Team JCapper: 2011 PAIHL Regular Season ROI Leader after 15 weeks
www.JCapper.com

Last edited by Jeff P; 01-20-2019 at 01:35 AM.
Jeff P is offline   Reply With Quote Reply
Old 01-20-2019, 07:39 AM   #27
sjk
Registered User
 
Join Date: Feb 2003
Posts: 2,105
The way I handle firsters (and horses who have not run in 120 days) is simple for those looking for such an approach.

Firsters as a group return -25% which is the same the return on a randomly chosen horse. I give in to the idea that I do not have enough information to hope to overcome that edge so I never bet them.

In order to assign odds to all of the other competitors I give the firsters the same chance to win as the public. To do this you need to be willing to adjust your odds on the fly based on real time tote information. So I take the tote odds and normalize all of the chances to 100% to see what remain for the others.

I don't play races where firsters and long layoff horses are more than 1/3 of the field. That cuts out a lot of maiden races.

Of course they beat me and it is annoying when that happens just one of many ways to get beat so I accept it.
sjk is offline   Reply With Quote Reply
Old 01-20-2019, 08:10 AM   #28
JerryBoyle
Veteran
 
Join Date: Feb 2018
Posts: 845
Sounds like I do something similar to both of you. Jeff, I agree that this is simply a missing data problem, not just a FTS problem, so I basically have imputation strategies based on the feature itself. This is because in some cases, imputing with a 0 makes sense, while in other cases, imputing with mean/max/min etc may make more sense. It's important to remember that 0 isn't some special signal in any of the regression type models that means something like "ignore". Meaning, if your feature is something like Average Lifetime Speed, where the mean of all non zero runners is 80, imputing with a 0 will have a DRASTIC impact on the prediction. In this case, I'd use the mean of the non zero runners. However, if I have a features such as avg lifetime finish, and the finish is normalize to be centered at 0, then it makes sense to impute the missing value with 0s.

I do this for all runners with missing data. As Jeff said, it just happens that FTS have A LOT of missing data. Because I have features that are not dependent on a runner's past performances (post bias, trainer specific, jockey specific, etc), FTS still get a rating. For a race with all FTS, it ends up being a competition between those features - that is, a competition between trainer/jockey/etc. This makes sense to me, as the model doesn't have any information about the runner's specific potential to take an opinion.

Of course, we can then test whether these strategies are valid by analyzing the results of our holdout data. Try to quantify the average number of starts per race for all runners or maybe total number of FTS and bucket your profitability or hit rate or whatever metric you use to determine success and see how you do in those ranges. Should be fairly easy to determine if the model is getting crushed in races where the runners haven't run before or haven't run much.

Finally, if you also use a 2 step model, which merges your fundamental prediction with the tote prediction, you'll end up adjusting the public's opinion slightly for features on which your model DOES have an opinion for FTS (like post bias).
JerryBoyle is offline   Reply With Quote Reply
Old 01-20-2019, 08:44 AM   #29
castaway01
Registered User
 
Join Date: Jul 2009
Location: NJ
Posts: 3,816
Excellent posts in this thread, thanks guys.
castaway01 is offline   Reply With Quote Reply
Old 01-20-2019, 11:55 AM   #30
MJC922
Registered User
 
Join Date: Nov 2012
Posts: 1,506
If you're betting after the rank and file ADW 0MTP conditional crowd I suspect you're doing just fine even with pen and paper handicapping. If you're not then I'd be hard pressed to believe any long term profits will be sustainable with or without computer models.

Now if we get exchanges someday that's a whole different ballgame where I suspect you will see some folks once again start to do very well for themselves.
__________________
North American Class Rankings

Last edited by MJC922; 01-20-2019 at 11:58 AM.
MJC922 is offline   Reply With Quote Reply
Reply




Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

» Advertisement
Powered by vBadvanced CMPS v3.2.3

All times are GMT -4. The time now is 12:47 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2024, vBulletin Solutions, Inc.
Copyright 1999 - 2023 -- PaceAdvantage.Com -- All Rights Reserved
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program
designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.