Is this the Ultimate in Data Mining or the Ultimate in Backfitting [Archive] - Horse Racing Forum - PaceAdvantage.Com

View Full Version : Is this the Ultimate in Data Mining or the Ultimate in Backfitting

Bob Allen

04-05-2004, 11:55 PM

Hi folks,

Need some good and wise cappers to weigh in on this, please.

There is a software program which has been promoted here on PA's board. Now let me describe, as much as I can, what it does and you advise what is happening:

This is a database program. In it you can define up to eight criteria, say for example - first call, second call, speed or any of about 80 or so factors. You only get to choose eight unless you combine some of the 80.

To use the program you can direct it to "tune" by Surface, Distance and Grade (CLM, MDC, MSW, etc.). Or you can have it do it by all in one or all of those three categories.

You set the dates you want to tune, click giddyup and off it goes. Here's what it does then - it looks at every race that meets the Surface, Distance and Grade criteria you set and "tunes" itself to find the optimum weights to assign to those eight criteria. The result from this is very impressive.

To test this part I could never find any way to tune anything that showed a loss. I used straight trifectas, exactas, wins, places, and show wagers. They were all wonderfully profitable looking at races that had already been run.

Then, I did one other little test. I tested from the beginning of my database (about 37,000) races for the tracks currently running from the beginning of the database up until today -30 days. And then looked at the next day's results. Then I reran it from the beginning of the database up until today -29 days. And then looked at the next days results. I did 30 days to have a month's worth of what I would term, real-world results. Basically it returned a -18% ROI where it had been showing positive ROIs tuning the whole DB.

If I have not given enough information, just let me know and I'll endeavor to add what I can. I just simply need someone with more experience than I have to tell me what's going on here.

The one thing I know is that the program as currently written does not show a profit using it as it is designed to be used. That is not to say that it cannot be profitable - just not profitable the way it is promoted or designed.

What say you?

Thanks,

Bob

GameTheory

04-06-2004, 12:13 AM

It's backfitting. And the more variables you use, the worse it will probably get. Try tuning with only 2 or 3 variables and see what happens....

schweitz

04-06-2004, 12:17 AM

I think I'm going to be in the minority here---but I did the database thing 10 yrs ago---I could always backfit to profitability but could never make it stand up to show a profit in the present. It's like the disclaimer on mutual funds "past performance may not be indicative of future results". It does sound like some on this board are having better results with their databases than I did and I wish them well---just giving you my experience.

Jaguar

04-06-2004, 05:07 AM

Bob, with just a few factors being used to analyze a complex activity- horse racing- these crude criteria are regressing to the mean, and just plain giving inaccurate results- which is typical of several older programs with limited scope.

If you see an analysis done which employs 2 dozen, or more, algorithms- and I'm talking about subtle questions relating to an animal's pp's- while the impact value of many factors will register as "statistically insignificant"- some algorithms will discover important patterns relating to the horse, and important patterns relating to the trainer and to the jockey as well.

Example, in a 1 Mile Allowance contest at Gulfstream the other day, the big favorite- Hart's Gap- was rated as a stickout by the software I use. The horse romped and everybody and their grandmother cashed a ticket on this obvious winner

Interestingly, of the 9 algorithms, or "strategies" which rated Hart's gap as a standout, only one criterion was truly statistically important: Strategy 17(recent speed ratings and track variants and the jockey's rating).

This factor alone- when it indicates a horse is the best pick- accounts for 26% winners in races of this type.

So, since Strategy 17 picked Hart's Gap as a monster- and since 8 other Strategies also rated him the best choice, and since he had a superior jockey rating, Hart's Gap was a legitimate statistical stickout.

Curiously, looking at the impact value of the 25 algorithms the software uses in rating horses in Allowance Routes at Gulfstream, only 3 Strategies show a real edge:

Strategy 8- 24% Winners(recent vs. today's distance, speed, closing, and weight carried)

Strategy 14- 25% Winners(days since last race, speed, and closing)

Strategy 17- 26% Winners(weight carried, speed and track variant)

Summing up, horse software needs to be wide in scope, needs to look at alot of information, particularly in races where there are no obvious stickouts.

All The Best,

Jaguar

sjk

04-06-2004, 07:56 AM

I am a database user but I would have the following concerns about the approach your program takes:

1). Although 37,000 races is probably a reasonable amount of races to study it sounds like it is just trying to predict the 37,000 winners, not what happened to the other horses. There may be as much to learn from who ran up the track as from who won. Predicting the performance of the 37,000 horses who won doesn't seem like a lot of data.

2). If you divide the races into categories you are creating some groups (1 3/8 miles on the turf, MCL) where the data is very sparse. The results against these groups will be strongly self-fulfilling.

3). If you only select 8 variables you may be knowingly leaving out something you consider important. That is dangerous since the other bettors are using this information against you.

4). This is a very simplistic way to use a database (as a data repository with a simple linear factor application). When people say that they have doubts about the efficacy of a database approach, I hope they are aware that there much more intricate ways to analyze and combine the data.

5). There is no odds line or way to identify value. I would consider this to be the main reason to use data analysis.

You have put the program to the correct test. There are only two ways that I know of test a program: 1). Use past data to predict future day's results; 2). The real money test. A program should be able to pass both of these tests.

raybo

04-06-2004, 11:58 AM

I agree with what has been stated already. 8 factors isn't nearly enough, unless they are combinations of many other factors. Also, you need to group the races fairly tight so the model you intend to use for each group is more accurate. There are "classes within classes", for example. Databases are great for evaluating possible grading scenarios to weed out the ones that won't work, but as far as predicting future ROI's, I would be very cautious, each race has it's own set of variables which need to be looked at individually. These "datamining tools" are just that, tools. Use them, but you still have to be the operator.

GameTheory

04-06-2004, 12:56 PM

Confused enough already?

The advice so far is that you need more factors to model things effectively. The problem is, the more factors you use, the easier it is to overfit the data, and the more races you need to prove anything. There are competing goals here:

-- increase predictiveness (which requires more factors) [I don't necessarily agree with that, btw, but it is one way to go. You can make money using just a few factors to pick out horses the public is likely to overlook, but to accurately gauge the chances of all horses you need to get fairly complex.]

-- avoid overfitting to the training data (which requires less factors, or a more course/more general view of the data, if you will)

How do you balance the two? The real secret is the way you sample the data. When you create a method by fitting to old data, it is like making a wax mold of that data -- the result is going to "look" like THAT data, not some other (future) data. So you want to break up your data somehow and fit to each piece, then find the commonalities between the methods. I'm being somewhat vague because it all depends on your exact setup, but the bottom line is if you take a single lump of data (no matter how big) and tune your method to it and then stop at that point, it is going to overfit -- period. If you use less and/or more general factors and a large dataset, it will overfit somewhat less, but it will still overfit. You need to use several sets of data so you can see what features occur in all of them...

04-06-2004, 02:39 PM

Originally posted by GameTheory
...You can make money using just a few factors to pick out horses the public is likely to overlook...

Once you accomplish this, why bother with the rest. I spend about 5 minutes a race looking for precisely what you are talking about, then send it home. Once I have an overlay, I don't care how accurate my assessment of the others is. Its not worth the hassle in my opinion. There are so many races every day to choose from, I want volume. I'll take a 5-10% profit betting a thousand races a month over a 30% profit betting a hundred or fewer.

Bob Allen

04-06-2004, 02:41 PM

Everyone above,

Wow! Ask for help and you get a whole basketful here. I guess that's why I've been coming to PA's Place for 5 years. Not any other place on the web has people with the experience AND the knowledge to immediately assess just darn near any handicapping situation.

You gentlemen have confirmed my impression. I did want several people with your stature in the handicapping community to verify the way I felt. You guys sure did that and very, very well - thank you.

Can't tell you how much your quick and precise responses means, you guys deserve to cash a $100+ ticket in the next couple of days.

You are appreciated,

Bob

JimG

04-06-2004, 02:44 PM

I'm not a database guy, but it seems to me that you would want to slice the races at a track and sample your method on each slice. For instance, say you have a 5000 race database for Calder going back to 2001...if you test the data in 6 month increments (Jan 1- Jun 30, 2001) on only those races using the same method, and get a profit, then test on the next 6 months (July 1, Dec 31, 2001), etc. and see if the results meet your expectations... If it bombs during a 6 month period, then you would lose confidence as well as money...if the method continues to get decent results go to the next 6 month period etc., if not, keep tweaking the factors until you get results you are satisfied with. Some periods may be small losses but long term needs to be very profitable and at a hit rate your bankroll can stand.

Seems to me you would always want to test a method forward (new races) using methods created from earlier races in the database.

Jim

JimG

04-06-2004, 02:47 PM

Originally posted by Bob Allen
Everyone above,

Wow! Ask for help and you get a whole basketful here. I guess that's why I've been coming to PA's Place for 5 years.

Bob,

Only 10 posts from you in 5 years. You should post more.

Jim

PS...On the other hand, I should probably post less

:D

04-06-2004, 02:49 PM

Originally posted by JimG
...Only 10 posts from you in 5 years. You should post more...

Hey, he's efficient!

raybo

04-06-2004, 02:54 PM

RE:<Seems to me you would always want to test a method forward (new races) using methods created from earlier races in the database.>

Exactly, well said. Test from the past and learn from the future.

raybo

04-06-2004, 02:56 PM

Or vice versa, I learn from what I am doing in the present and test it against what happened in the past. Either way works. The key is you have to do both.

GameTheory

04-06-2004, 03:16 PM

Originally posted by cjmilkowski
Once you accomplish this, why bother with the rest. I spend about 5 minutes a race looking for precisely what you are talking about, then send it home. Once I have an overlay, I don't care how accurate my assessment of the others is. Its not worth the hassle in my opinion. There are so many races every day to choose from, I want volume. I'll take a 5-10% profit betting a thousand races a month over a 30% profit betting a hundred or fewer.

The ultimate goal in evaluating every horse as precisely as possible is to uncover more betting opportunities -- more volume. It's just a different approach, and you can use both.

Look at it as a lock and key analogy. The races are locks that you want to open. Maybe you have 4 or 5 master keys that will allow you to open 25% of the locks you come across (i.e. you look for certain pre-defined scenarios, and if don't find them, you move on). Whereas if you have locksmith tools you can open any lock (i.e. you take what the race gives you, and handicap accordingly).

Bob Allen

04-06-2004, 06:45 PM

Originally posted by JimG
Bob,

Only 10 posts from you in 5 years. You should post more.

Jim

PS...On the other hand, I should probably post less

:D

A few months ago changed my handle to my name from my address. Makes me look like a newbie, but maybe I am.

Nah, Jim, anyone who has used as much software as you have should probably be posting more, not less.

Ok, you get a $100 ticket too,

Bob

Jeevan

04-06-2004, 07:39 PM

Bob, Have you come to any conclusions as to why you were getting such good stats and in reality, they were not as good? Were you backfitting without realizing it or is there some kind of flaw in the software.
Also, from reading your 1st post it sounds like you are somewhat dissallusioned with the software that you are using. I'm curious to know if you are dissapointed and why. (naming the software is not necessary)

bucktron

04-06-2004, 09:24 PM

Computer Handicapers/Developers

1. Have any of you Tested your systems output on a viable Sample, lets say 10,000 races, that the System had not seen before?

2. Did you find that by increasing your Training Set Size that your Backfitted Equations/Models produced results in your Test Set close enough to your Training Set Results to show a reasonable profit ?

3. What were your results?
Example: Win% of your Top Two Selections in you Training Set vs your Win% of your Top Two Selections in your Test Set.

4. What was the size of your Test Set?

Bob Allen

04-06-2004, 09:48 PM

Originally posted by Jeevan
Bob, Have you come to any conclusions as to why you were getting such good stats and in reality, they were not as good? Were you backfitting without realizing it or is there some kind of flaw in the software.
Also, from reading your 1st post it sounds like you are somewhat dissallusioned with the software that you are using. I'm curious to know if you are dissapointed and why. (naming the software is not necessary)

bucktron,

The conclusions I finally came to were that it was a backfitting program. It was not the program I use for wagers as I am strictly an EquiSim guy and have been for quite a while.

The software in question is no longer on any of my computers or network.

The disillusionment came when I recommended to the programmer that the feature in question should be scrapped and replaced by something new. Let me put this as mild as possible - uh, he went ballistic for me to even challenge one of his pets and stated that the software would NEVER be changed.

Anyone taking an absolutist position such as that is not the type of programmer you want to design something to point to wagers.

Just take a look around at the super professional programmers on this board, not one of them stands still and rejects the notion of change. I would name them but I'd leave someone out and that would be bad. Let me just take two, for example. Nathan and Dave have both walked a path of improvement of their software. Both, not coincidentally, have excellent software programs and if they thought of or found a way to improve them even more, they'd do it in a heartbeat. They have no intention of standing still. That's why they are two of the best at what they want to do.

I am somewhat disillusioned not to find the Modeler software I want. EquiSim gets pretty darned close but I wanted to test the other package just to see if it would do what was purported right here in this forum on several occasions. It didn't do it.

And the beat, er, search goes on,

Bob

Bob Allen

04-06-2004, 09:50 PM

Jeevan and bucktron,

My apologies for getting your names wrong.

Next time you're in town the Eskimo Pies are on me,

Bob

B. Comin'

04-08-2004, 01:28 AM

Originally posted by JimG
Bob,

PS...On the other hand, I should probably post less

:D

Based on your current picture Jim, it looks like you need oxygen!:eek: