PDA

View Full Version : Question on data mining procedures


markgoldie
10-06-2009, 11:35 AM
I've been comtemplating the difficulties in targeted data mining. For example, suppose you want to know what effect a jockey change might have on a horse. We would assume that there could exist a-cause-and-effect sliding scale that more or less might track the difference in win percentage of the jockies involved. But merely taking an roi on horses who show a 10% jump in jockey win percentage, let's say, is not very illuminating since there are so many other factors that might come into play. Also, (and this particularly affects gimmick players) we would have no idea how the effect would show up on the vast majority of horses who are unable to win. A horse who finishes last in a given class and distance undergoes a significant change in performance if he returns under similar conditions to run third (let's say), beaten a length or two. But this won't show up in roi tracking.

Therefore, my intuitive idea would be to measure the effect on a performance barometer that would be traceable on all horses in all situations, and for want of a better solution, I keep coming back to change in speed fig (Beyers, etc.). This too is fraught with complications since a speed fig is subject to many circumstantial factors (no need to list them; they are many and we all know them). However, it would seem that a large enough data base would in great measure equalize these different factors. For example, over 5,000-10,000 races, we would hope that the other variables might play out each way so that the target variable would be somewhat isolated and were this the case, we might get an idea of how the isolated variable plays out on average.

This would seem to be the only long-term logical way to evaluate particular variables in a steady and coherent manner. For example, consider the Brisnet Prime Power number (flawed that it may be). How would such an algorithm be derived if not through many-race data mining? We know (and I assume that we might take them at their word) that this algorithm takes into account a substantial number of handicapping factors- currency, speed, pace speed, pace matchups, distance, trainer/jockey abilities, class, running style vs. track biases, etc., etc. But since they are handicapping every race on the basis of their PP number, where do they get the proper "weighing" (Dave Schwartz' favorite terminology) of the individual factors that comprise the algorithm if not through large-scale data mining? Certainly, trial and error is a caveman approach to a now-automated universe and is not an attractive option.

I know that a DRF Beyer associate (Fotias maybe?) wrote a book on things that affect the "Number" but I haven't read it and so I don't know how much of a data base he used. I also know that the Beyer guys are not much for pace analysis, so I'd imagine that their approch would be somewhat limited in scope. At any rate, I was interested in the approach that any data miners out there might use. Is using a performance barometer such as a speed fig logical and doable? Does it make sense? Is that what data miners are doing?

Jeff P
10-06-2009, 12:32 PM
...At any rate, I was interested in the approach that any data miners out there might use. Is using a performance barometer such as a speed fig logical and doable? Does it make sense? Is that what data miners are doing?

Mark, Great Question!

I use databases to look at many different things... early speed, late speed, class, ability from speed figs, form, weight, weather, track condition, human connections, breeding and extended families, the performance of the betting public, the performance of the ML line maker, tendencies/requrements tied to surface and distance... you name it - if it's related to racing - I have probably spent at least a few hours at some point using databases to study it.

No matter what I happen to be studying, I take a similar approach to my R&D.

Start with a large sample... and take a benchmark.

Then drill down. Take a tighter sample - providing a more specific look into whatever area the R&D is based on... and compare win rate and roi from the tighter/more specific sample to that of the benchmark.

Differences (or lack of differences) between the benchmark and the tighter/more specific sample will almost always tell you something.

This can have many applications. About two years ago I wrote an article for my user base titled Understanding Track Weight:
http://www.jcapper.com/HelpDocs/UnderstandingTrackWeight.htm

While the tendencies of track surfaces are always subject to change, the article does provide an example of how benchmarks can be used to understand differences between track surfaces.


-jp

.

GameTheory
10-06-2009, 12:33 PM
For example, consider the Brisnet Prime Power number (flawed that it may be). How would such an algorithm be derived if not through many-race data mining? We know (and I assume that we might take them at their word) that this algorithm takes into account a substantial number of handicapping factors- currency, speed, pace speed, pace matchups, distance, trainer/jockey abilities, class, running style vs. track biases, etc., etc. But since they are handicapping every race on the basis of their PP number, where do they get the proper "weighing" (Dave Schwartz' favorite terminology) of the individual factors that comprise the algorithm if not through large-scale data mining?I would suggest that they (when coming up with the algorithm) did a lot of backfitting on previous data. They obviously would have access to as many years as they cared to go back. If they were interested in getting the best result going forward, rather than just getting the best result on previous data so they could put nice-sounding claims in the promo material (we will assume they didn't do any outright lying in that category), then they would have left aside some unused data or did some sort of cross-validation, etc, to come up with something the generalized well for future data. But since the Prime Power number has been around a long time now, assuming they haven't changed the formula, then all of the data actually using the number published since it was invented is all you need to know. (This is off the track of your question, but it is worth it to point that that lots of figure and software developers just create something back-fitting to old data and then make great claims about it that won't hold up going forward.) The short answer is that they probably did do extensive data mining on data that was past at the time they created the number. BRIS never went back and assigned Prime Power numbers to OLD races prior to the introduction of that number, so there is no conflict there. And once the algorithm was created, it was probably fixed, and that's that.


Is using a performance barometer such as a speed fig logical and doable? Does it make sense? Is that what data miners are doing?I'm not sure there is one "thing" that data miners are doing. They have probably tried this and anything else you might think of at least once. You could use speed figures or anything else as a gauge. Maybe a "better/worse than expected" value (taking odds into account) number as you yourself previously offered, impact value, etc.

This idea that by using tons of races you can isolate that one variable often doesn't work out though because all else is NOT always equal. You might be looking at some change in isolation and discover to your surprise that what you thought was going to be a positive change (in speed figures or whatever) turned out to show no improvement or even a slight decline in performance. But then you did a little deeper and discover (for instance) that 60% of the time when this change occurs from one race to another that the horse also jumps up in class. The trainer also probably thought the horse was going to improve, so he jumped him up in level and also went out to find a "better" jockey, which you picked up as your indicator. Maybe he was right, maybe not. But in the overall stats, the higher level of class would take its toll and the average we'd see wouldn't look like any improvement was gained. But if we broke those stats down by those class change up/down/same to go along with our jockey change indicator, maybe we'd be on to something. Or maybe if we would have used an odds-based performance indicator rather than speed figures (since the odds probably would have gone up with the class rise).

All of these things boggle and befuddle the data miner, and you still feel like you doing the "caveman approach", only one step removed. Now all your trial-and-error is on what the parameters are going to be for the data mining. Has your problem really gotten any easier?

You seem to want to know what the "best practices" are. Well, there are some for data mining in general, but horse racing really throws one for a loop because we are always asking a different question. In making a BRIS Prime Power number, I assume they didn't give too much weight to the value question -- that wanted to put winners in those top slots as much as possible. And that's the type of question that data miners are usually looking at -- I've got this data, and I want to predict something. But we've got a whole bunch of bettors already doing that for every race with the results showing up as the odds. And now we've got to come up with a different but still good answer to make any money, and if we are using more or less the same data as the bettors are (and we are, most of us), then we've actually got to come up with a sub-optimal answer, but that gives us more value. But data-mining is not designed to find sub-optimal answers, so we've got to figure out how to ask it a question such that the answer to that question, when answered optimally, will give us what we want -- a number with VALUE rather than just predictiveness. It is pretty easy to come up with a highly-predictive number that will point out a lot of winners. Much tougher to come up with a number that will make us money. For the exotics, there is probably good work still to be done on predicting horses that won't win but will fill out the other slots.

sjk
10-06-2009, 12:34 PM
You could make your best projection of the speed figure that a horse will run (power ratings from his past races might be an important ingredient in this) without using information about the jockey and then see how the difference between your projection and actual varies with jockey win percent or with the change of jockey win percent from those in previous races.

If you plan to do the same with trainers be aware that there is a strong correlation between high percentage trainers and high percentage jockeys and take care not to double count.

Jeff P
10-06-2009, 12:51 PM
Suppose you wanted to do R&D into factors that resulted in an improved speed fig next out...

Take a large sample... note the typical difference between next out speed fig and speed fig last race. Use that as a benchmark.

Take a more specific/tighter sample based on a factor that you'd like to test... maybe claimed by a trainer who happens to be on a list of trainers you think have the ability to move horses up.

Note the typical difference between next out speed fig and speed fig last race for the more specific/tighter sample and compare it to the benchmark.

Then take another factor you think might help horses improve and do the same thing.

Caveman approach?

Yes.

But I think you'll find - generally - this approach works.


Posted by GT:This idea that by using tons of races you can isolate that one variable often doesn't work out though because all else is NOT always equal....

...You seem to want to know what the "best practices" are. Well, there are some for data mining in general, but horse racing really throws one for a loop because we are always asking a different question. Absolutely agree with this. Many of the factors in racing that we think of at first glance as being independent are often correlated to each other.


-jp

.

markgoldie
10-06-2009, 03:52 PM
Excellent stimulating replies from GT and JP.

The issues: "Caveman" approach. Months ago, I apparently demonstrated my ignorance in a thread devoted to genetic algorithms. My general understanding of them at the time had led me to believe that a GA could fully automate the data-mining process; that is, if you introduced all the known variables to the GA and provided it with a huge data base of past pps's and results, it would parse the factors, in the right dosages, to solve the overall problem (which in this case would be a positive roi). Dave Schwartz and others told me this was wrong. Not sure why, but that's besides the point. I am willing to take the word of experts.

If we are using a do-it-yourself method of data mining, we still may be cave-maning, but certainly the result will be subject to automation.

Narrowing the search after choosing a benchmark: The concept makes sense, but it seems that too much narrowing carries substantial risks. Let's take a broad example. Suppose we are investigating the affects of a better-percentage jockey on final-time figs. We might be immediately tempted to filter out some of the "noise" that surrounds a performance; that is, let's say we decide to restrict the investigation to same class, same distance, same surface, same track, and currency under 45 days (or any combination of these or similar restrictions). Three immediate problems: (1) the data base has been diminished, (2) the universality of the investigation has been badly compromised, and (3) how do we know that jockey choice, intent, or riding strategy has not been affected by the other narrowing circumstances? This is what GT and JP refer to as the inter-relationships between factors. For these reasons, it would seem that any narrowing other than the introduction of the raw proposition (variable to be investigated) is self-defeating. The narrowing process would seem to be the final step, undertaken after we have the generalized effects well quantified.

Value as the final Holy Grail: GT mentions this and he is right. But every time we look at roi, we are doing a value analysis and the larger question is where do we find the avenues that lead to a higher roi? Now, I understand that this may seem like building a better mousetrap. But who has invented the perfect mousetrap? One thing I am relatively sure of: The perfect mousetrap hasn't come close to being invented.

Additionally, this approach may seem to some like an impossible quest to develop a generalized handicapping strategy with it's eventual aim to just be better handicappers than the average public such that we can do a better job in any given race. This proposition is understandably a turn-off to "niche" handicappers, who feel that finding a profitable restrictive angle is the way to success. But this approach of broad-based data mining actually aims at finding profitable niches since it seeks to unearth the values of poorly understood individual factors. The result is not necessarily a greater generalized understanding of broad-based handicapping, but instead the ability to find specific situations in which a favorable factor or combination of factors exists. Thus, we are finding the niches that lead to a profitable angle, no matter how restricted that might be.

Using projected speed figs: This was mentioned by sjk. The problem here is that our "projection" method involves taking into consideration many other factors other than what we are seeking to investigate. This leads to the same problem as with too much narrowing. We are mixing in factors which may have some interdependent relationship to the investigated factor.

Using roi as a barometer of gauging an effect: In general I don't believe that using roi as a gauge when investigating the affect of variable factors has much relevance at all. Why? Because roi contains within it all the other handicapping factors that cause the confusing conglomeration which we want to parse. And that's because it takes into the account the public's opinion of a horse's prospects which went into creating the odds. Such opinion therefore, is rife with myriad factors. Yes, we will eventually need to get to the question of roi, but using it as an R&D barometer when we are investigating the individual factor clouds the result beyong recognition. This is why I suggest a performance barometer like a final-speed fig, which (flawed as it is), at least does not in and of itself contain broad-based handicapping considerations. It is also why (as GT) proposed, finish-vs. odds' position won't work, because it is similarly polluted by the odds assigned by the handicapping public.

In sum, you will never understand the impact of a single factor if your judgement of it is exclusively based on a comparison to a number created by all handicapping factors. You are actually prematurely backfitting such that you will never understand the raw relationships.

Is a final-speed fig, then, the best barometer for data mining? Or is there something better which I'm not envisioning?

DJofSD
10-06-2009, 03:59 PM
Many years ago, there was a project I got wind of at the place I worked. It involved data mining. Not know too terribly much about it, I decided to see what I could learn on my own.

I learned two things: data mining is in part looking for relationships captured in existing data bases that either is not apparent or is not easily discovered.

The second tidbit is that quiet often, the data is in data bases that have been designed to use normal forms. The data in tables in normal forms leads to very ineffecient queries when it comes to data mining. So, often, what happens is data that exists in table in 3NF are extracted into files in 2NF. Then the data mining processes are pointed at those files.

Dave Schwartz
10-06-2009, 04:22 PM
...Dave Schwartz and others told me this was wrong. Not sure why, but that's besides the point. I am willing to take the word of experts.

Well, Not exactly, "wrong." It is my belief that it can be done. However, you must structure the data very carefully. It is not just "let's dump everything in and see what comes out," any more than it is with a statistical aprpoach.


I absolutely believe that it WILL work.


Dave

sjk
10-06-2009, 04:30 PM
I view the entire issue as projecting speed ratings. If you can tell me with assurance who will run the best speed figure in todays race then I know who to bet. If you tell me the probability that a certain horse will run the best speed rating then I can look for value on the board.

If you use information from past performances to make your projection then I think it is reasonable to view who is riding the horse today as an independent piece of information that can be used to fine-tune the projection. The jockey and trainer in todays race are not independent pieces of information so once you use one you need to be careful about how you use the other.

I see no "problem" with any of this and have had good luck with it over many years.

TrifectaMike
10-06-2009, 06:55 PM
I view the entire issue as projecting speed ratings. If you can tell me with assurance who will run the best speed figure in todays race then I know who to bet. If you tell me the probability that a certain horse will run the best speed rating then I can look for value on the board.

If you use information from past performances to make your projection then I think it is reasonable to view who is riding the horse today as an independent piece of information that can be used to fine-tune the projection. The jockey and trainer in todays race are not independent pieces of information so once you use one you need to be careful about how you use the other.

I see no "problem" with any of this and have had good luck with it over many years.

I completely agree with your first paragraph. If your probabilities associated with your projected speed ratings results in a subjective probability that is closer to the horse's objective probability than the subjective probability associated with tote odds, then that is all you'll ever need!

It can be done. And questions of correlation, independency do not have to be addressed (A big clue here)

Mike