Quote:
Originally Posted by Dave Schwartz
I agree. But Jeff's post has me confused...
|
The short answer - how I handle first time starters:
For factors in my models where the first time starter has 0's or the data is otherwise missing and the other horses in the race have data for that same factor:
- The first time starter gets the mean, or in my case something close to the mean, instead of a 0.
- For all of the factors in my models where the first time starter does have a number:
The first time starter gets the actual number - just like any other horse.
- From there the factor values -- be it a mean or be it something I created myself like scored query results or be it a number that came from a data vendor:
I feed the number or factor value into the algorithm that generates the output for my models.
That's how I do it.
Keep in mind that I am using Prob Expressions to generate scored query results for rider, trainer, post position, and track bias. I do this for every horse in the race - first time starters included.
So while first time starters have 0's for a great many factors, in my case, I'm replacing the 0's with something close to the mean based on the factor values of the other horses in the race. I'm also including Prob Expression scores for rider, trainer, post position, and track bias.
By the time I'm done I have a reasonably good prob estimate for my first time starters.
What follows below is a longer answer as to how I handle horses with missing data. This post is longer than I intended and there's plenty of background info.
Fyi, what I'm about to write below doesn't just apply to first time starters. It applies to all horses in the data.
Factors where a horse, first time starter or not, has a 0 or the data is otherwise missing - and the other horses in the race have actual numbers for that factor:
I treat all horses with missing data the same. First time starters are the most frequent example of this. That's why I've so far focused specifically on them.
Imo, missing data for a factor when other horses in the race have data for that same factor creates an outlier situation.
Imo, better to handle outliers than ignore them.
Again, I treat all horses in the data the same. If the data is missing and other horses in the race have values for that factor, the horses with the missing data get the mean (or in my case something close to the mean) instead of a 0 or an empty or null data field.
First time starters typically have 0's for a great many factors like final time speed figs, early pace figs, class or pars from previous races, starts, wins, money earned etc.
For these types of numerical factors, where all of the other runners with at least one lifetime start have a number for the factor involved, what are we supposed to do?
How are we supposed to handle the first time starter?
If you go with the 0's for these factors that are coming from your data vendor: You risk rating the first time starter or foreign shipper too low relative to its true ability.
On the other hand, rate the first time starter too high and you risk generating probability estimates that will have you betting first time starters you think are overlays when they're not.
The truth is we don't really know the true ability of a first time starter until it has a few races under its belt. It could rank towards the bottom of the field. It could rank somewhere in the middle. It could also be a stickout. But looking at the raw Equibase data there's no real way to know.
Enter predictive modeling.
The authors of most predictive modeling papers, if they are telling the truth, struggle with outliers too. If you're looking at data for a drug trial and a small percentage of the records in the dataset are missing the patient's age: What do you do?
If I were doing transformations while creating a development sample for a drug trial I'd probably ask someone higher up on the ladder exactly how they wanted me to handle those records where the patient's age is missing. Delete them? Start fresh with a brand new sample? What are you really supposed to do?
If say 5.278% of the records in the dataset (by the way that's the percentage of North American thoroughbreds who were first time starters in calendar year 2018) were missing the patient's age and and other various pieces of data:
I've never been hired as a data consultant to work on a drug trial. But it wouldn't surprise me if someone higher up on the ladder decided to have me delete all records where the patient's age was missing, move on, and add a footnote to the final published version of the project at the end.
Something tells me the body of work might be compromised if I did that. (But I could be wrong.)
Obviously, with horse racing data you don't want to just delete first time starters (or 5.278% of the records) from the database.
Well you can if you want to. But if you do that you are going to struggle with first time starters until you figure out what to do with them.
In horse racing we need a technique for handling first time starters. Actually we need a technique for handling all horses with missing data.
In predictive modeling one generally accepted technique for dealing with the equivalent of a first time starter -- where you have missing data and you are required to make probability estimates for records with missing data -- is to replace the missing data with the mean.
There are other techniques that work slightly better but I've found using the mean works pretty well. No, not the mean from the entire dataset - but the mean from the other records for the current factor in the current race --
That True/False event you are modeling the likelihood of -- such as winning the race, leading at the pace call, or a ending up as the post time favorite.
Fyi, each of these are perfectly worthy events you might want to model. And no, you don't have to stop there.
Factors where the first time starter has an actual number:
On the other hand there are many factors where the first time starter does have a number.
For example:
Numbers you create yourself like the scored Prob Expression query results of the type I wrote about back in post #12.
Imo, these can be really useful for scoring riders, trainers, post positions, breeding, and track bias, etc.
Here, sometimes it pays to get creative.
For example, what's the trainer's record or sires's record for the n most recent first time starters for today's track-surface-dist-classdescriptor (SPLWT or a Maiden Claimer?)
Sometimes you can create custom data points that are really useful.
Maybe n is a smallish number that few would even consider. But catch a trainer or sire whose runners are winning first out early enough in the cycle before everyone else catches on and...
Well, you get the idea.
Even if you never manage to catch lightning in a bottle --
If you can get your Interface to query this kind of stuff from a database, score it, craft your sql queries in a creative way --
And if you
follow up and test what you have going forward so that you understand what you actually have:
You can create custom datapoints with statistical significance that only you get to see.
On the other hand, do it in a generic way like the Equibase stats that everybody else sees and you're not likely to get anywhere.
There are also data points you can observe with your own two eyes. Imo, horse physicality matters. Learn to judge it, score it in a consistent way, test it going forward, get better at it -- and at some point your [insert your last name here] Phys Ratings can become a custom data point. Of course this is easier to do if you attend the races live and you are following a single circuit.
I can tell you from experience that first time starters and foreign shippers who look physically fit, are carrying solid weight, and who are acting well win a higher pct of their races than first time starters and foreign shippers whose physicality is suspect.
There are lots of useful data points you can pick up on your own that are never going to be in the Equibase data.
Trip notes and track bias notes fall into this category.
So do workouts if you have a bent for getting up early, watching them with your own two eyes, and making your own notes.
There's also the pieces of equipment that horses wear. Front wraps. Tongue ties, Shoes. Hot patches. Type of bit. Type of bridle. Horses whose gait is all wrong warming up on a wet surface. Riders who are inept on a wet surface. Hoof size relative to the ideal for today's surface.
It pays to think outside the box a little and be observant in a creative way.
No matter what it is you are looking at:
Learn to score it in a consistent way. Test what you have going forward. Work at making incrementally better observations. Believe me. You'll
know once you're generating quality ratings for whatever it is you're looking at in a consistent way.
At that point work it into your models.
You just might surprise yourself.
Ratings created by your Data Vendor - such as:
Rider rating and/or win pct, trainer rating and/or win pct, distance pedigree rating, turf pedigree rating, off track pedigree rating, workout rating, etc.
All of these apply to first time starters and foreign shippers. All of them are certainly fair game as independent variables in your models.
To Recap:
For factors in my models where horses have 0's or the data is otherwise missing and the other horses in the race have numbers for that same factor:
- The horse gets the mean, or in my case something close to the mean, instead of a 0, empty, or null.
- For all of the factors in my models where a horse, first time starter or not, does have a number:
The horse, first time starter or not, gets the actual number - just like any other horse.
- From there the factor values -- be it a mean, be it scored query results, be it something like a Phys score I've created myself, or be it a number that came from my data vendor:
I feed the number or factor value into the algorithm that generates the output for my models.
That's how I do it.
All of that said, there's no reason prob estimates generated by a model for first time starters have to differ greatly from reality.
First time starters have workouts. First time starters with terrible workouts win at a reduced rate vs. first time starters with better workouts.
First time starters have riders. First time starters with terrible riders (don't get me started) win at a reduced rate vs. first time starters with better riders.
The same thing applies for trainers, post positions, breeding, scored query results, and any custom data points you may be creating yourself.
Make good numbers for all of the attributes you can think of. Test them going forward. You need to understand what you actually have. Tune and fit as needed.
Work it into your models. Then do an ongoing tune and fit of the model itself.
-jp
.