Computer-Based Head-to-Head Handicapping - Horse Racing Forum - PaceAdvantage.Com

InControlX · 10-31-2012, 12:47 PM

Several PA Users have posted some interesting theoretical approaches to computer-based handicapping. I think the willingness to share concepts and ideas is positive, and that we can to a certain extent help each other out without giving up secrets. Sometimes, though, I think the shared concepts are too vague to be of practical use and a more defined suggestion would be better. To that aim I will attempt to illustrate a simplified short version of a specific computer analysis technique I call "Head to Head" with which I've had good success. This technique utilizes no new age thought processes nor any artificial intelligence algorithms, but is more of a cookbook approach for a "test, verify, and validate" framework than a mysterious black box for selection.

If there is interest I will go into further detail, and yes, there is a lot more of detail.

Disclaimers: Nothing is for sale here, nor will be. No guarantees. You might pitch your laptop against the wall in frustration if you attempt this. This is not claimed to be the best method ever nor the total solution to handicapping. You don't have to join a cult or go to grad school to try it. This is a method of handicapping, not a wagering strategy.

First, several things are needed to start:

1. Two years of delimited Chart Files.

2. The same two years of home made or purchased delimited Past Performance Files.

3. Visual Basic (VB) and good familiarity with its use. VB is not hard to learn.

4. A relatively modern laptop or desktop computer with 900+ GB hard drive

5. Eight to Twelve key independent handicapping Boolean Parameters of YOUR OWN DESIGN.

6. A LOT of free time.

The method described has only been possible on commercially available laptop and desk top personal computers the past eight or so years. Prior to that, the memory and speed requirements would bog down the machines. Perhaps future equipment advances will permit expansion of this process to larger arrays.

The basic Head to Head method is formulated around the concept that unlike Blackjack, Roulette, other games of chance and pure statistics, horse racing is a contest between competing entrant horses. I'm surprised how often this fact is ignored by approaches which import analysis methods from other studies. Playing cards don't compete to get to the top of the deck, have class levels, nor are (legally) manipulated by their owners.

Let's take a quick jump past 1 through 4 above, assume they are in hand or at least accessible, and delve into number 5.

If you've been at this game for awhile I'm sure you have some favorite things to see in an entry's past performances which indicate a good performance is pending under the right conditions. The key in setting up a Head to Head run is to define eight to twelve of them in Boolean (true/false) format. I am not going to divulge the parameters I use and I don't recommend you do either. What I suggest is that you plug YOUR favorites into the Head to Head and see what you discover. What has usually happened for me is finding that I have to refine my initial parameter list and start over. Although tedious, this eventually generates a proven handicapping approach after three to five trials. There is no rule that restricts you to eliminate pre-filters. If your best filters fit one particular class, say dirt routes, just run those. Note, however, that pre-filters will cut into your sample counts and if too restrictive will cause problems later. It's all a trade-off. You will note that this method does not pick spot plays, where a specific pattern is found which yields a high success rate, but rather finds races where the head-to-head competition stacks one entrant as a standout. In Head-to-Head analysis we automatically consider the strengths and weaknesses of all the entrants, not just the spot play pick. We also find the key combinations of Boolean Parameters that are best without having to guess ahead.

The Boolean Parameters must reduce down to true/false determinations about the horse's past performances prior to race. Of course, the more predictive the parameters are the better your eventual results will be. However, if you focus too much on speed figures or obvious indicators for the majority of your parameters you will likely end up with a "morning line favorite" picker with a fine winning percentage (40%+) but a poor ROI (80% or so). Try to use your more obscure handicapping edges which are not so obvious or well known. An example set, (definitely not my best, but not bad) Boolean Parameters are:

(1) 1 Has made early position gain at this class before at Sprint Distances.

(2) 2 Has made late position gain at this class before at Sprint Distances.

(3) 4 Has made early position gain at this class before at Route Distances.

(4) 8 Has made late position gain at this class before at Route Distances.

(5) 16 Has made early position gain at greater than this class before at Sprint Distances.

(6) 32 Has made late position gain at greater than this class before at Sprint Distances.

(7) 64 Has made early position gain at greater than this class before at Route Distances.

(8) 128 Has made late position gain at greater than class before at Route Distances.

(9) 256 Has made early position gain at less than this class before at Sprint Distances.

(10) 512 Has made late position gain at less than this class before at Sprint Distances.

(11) 1024 Has made early position gain at less than this class before at Route Distances.

(12) 2048 Has made late position gain at less than class before at Route Distances.

Each of the Boolean Parameters is assigned a value of consecutive powers of "2". These are shown after the index numbers above as 1, 2, 4, 8, ...2048.

The purpose in defining the Boolean Parameters is to differentiate the entrants into a fixed number of possible Preparation Groups consisting of all combinations. Entrant's individual Preparation Groups are identified by the sum of values for TRUE answered Boolean Parameters. In this example we have twelve Boolean Parameters which will yield 2^12 = 4096 possible Preparation Groups. Note that each Boolean Parameter MUST be answerable with a clear true/false response. There can be no gray area in the question.

Now a word or two about Type Discriminators is needed. We have some good evidence that not all races favor the same preparation. Therefore, we need to sub-categorize the race types by some logical division. For now, let's choose four types of dirt sprint, dirt route, turf sprint, and turf route. The artificial surfaces are included with "dirt", or you could add two more types for them.

The next step is to turn each Boolean Parameter into a Visual Basic program filter, and one-by-one go through the first database year's charts with the VB program keeping track of each Preparation Pattern Group's record against each other. A "competitive victory" is defined as the PGrp(X) entrant finishing in at least the top three positions and the PGrp(Y) entrant finishing behind the PGrp(X). (PGrp(X) and PGrp(Y) are two Preparation Pattern Groups). The past performance files are opened to yield the data for testing the Boolean Parameters and determining each entrant's Preparation Pattern Group. In English, the result is a scorecard of each Preparation Pattern AGAINST each other Preparation Pattern over the year's database. Because we are counting head-to-head outcomes we derive many more data samples than single pattern win/loss analysis which yield only one sample per race. A single six horse field in this method provides fifteen samples, a field of nine yields twenty four. A second run tabulates the results into a head-to-head comparison array over the first database year which holds the winning ratios of PGrp(X) vs. PGrp(Y).

HTH(PGrp(X), PGrp(Y), Type) where X = 0 to 4095, and Y = 0 to 4095 and Type = 0 for dirt sprints, 1 for dirt routes, 2 for turf sprints, and 3 for turf routes.

A third program run uses the head-to-head comparison array derived above in attempt to predict race outcomes in the second year's database of charts. This essentially repeats the second program run but adds a ranking determination to sort each race by HTH matrix values to yield a Race Prediction Matrix on new data. We use the second year's charts to confirm or disprove predictability. If the first year's patterns result in a poor winning percentage and ROI as applied to the second year's data, we can hardly have confidence of success in future application to real time entries. The Race Prediction Matrix is easier to understand by example. The insert figure is the Race Prediction Matrix for Belmont Race 1 on October 26, 2012. In the matrix, the head-to-head winning percentage ratios are entered for each entrant vs. each other for the entered race type, in columns vs. rows alignment. These are the HTH array values previously determined. The matrix column sums and this value divided by the total entrants minus one yields Average Competitive Ratios for each entrant printed beneath the matrix. It is the Average Competitive Ratio that is used to rank the entries.

The Average Competitive Ratios are then used to rank the entrants predictive finishing position from first to last:

Race 1 BEL 20121026 Predictive Rank

Dirt 1 M Purse 30,000

Fillies and Mares 3 Year Olds And Up CLAIMING ( $15,000 )

1 #6 pp 6 [0.739] PGrp= 1025 Coast of Sangria M/L= 2-1

2 #2 pp 2 [0.535] PGrp= 2060 Glynisthemenace M/L= 3-1

3 #1 pp 1 [0.498] PGrp= 3 Miss Brass Bonanza M/L= 20-1

4 #7 pp 7 [0.48] PGrp= 2176 Miss Libby M/L= 12-1

5 #8 pp 8 [0.474] PGrp= 2056 Destination Moon M/L= 15-1

6 #4 pp 4 [0.457] PGrp= 1 File Gumbo M/L= 8-5

7 #3 pp 3 [0.408] PGrp= 0 Katy's Office Girl M/L= 12-1

8 #5 pp 5 [0.408] PGrp= 0 So Much Heart M/L= 30-1

This race was chosen as an example because the first place pick Average Competitive Ratio (0.739 for #6) is much greater than the second place pick's (0.535 for #2) indicating a considerable advantage (GAP = (0.739 - 0.535) = 0.204) . Although in initial test runs I include all rankings, I later subdivide the results according to the GAP to establish a practical limit. An increasing winning percentage with GAP is a good sign that you're on to something.

A few other observations:

- In the matrix, the opposing rows and column entries (i.e., 3 vs. 4 and 4 vs. 3) should add up to 1.000, because if PGrp(X) beats PGrp(Y) 0.600 or 60% of the time, PGrp(Y) must beat PGrp(X) 0.400 or 40% of the time.

- The last two rated entries in the example race have no "true" Boolean Parameters and thus are Preparation Group Zero. I use a race filter which skips any race which has a Group Zero ML of less than 5:1 and only one less than 10:1.

- The Normalized Predictive Odds includes a few other factors than just Average Competitive Ratio, too messy to include now.

- A large GAP between 2nd and 3rd picks, and 3rd and 4th picks, and so on can be tested and used for exotics in more elaborate wagering.

After test application for tens of thousands of races a good evaluation is determined on the predictive quality of the original Boolean Parameters. If the results are not good, adjust and try again. I usually have a laptop or two running continuously over parameter or filter iterations.

It's a good idea to continuously monitor the success rate of selection gap picks. I perform this on a monthly basis. It's also prudent to refine and optimize pre-filters and post-filters around a selected set of Boolean Parameters. In other words, it never ends.

ICX

DeltaLover · 10-31-2012, 02:20 PM

Very interesting....

You are making a lot of progress in just a single post, covering way too many topics...

I think we need to go a bit slower taking them one by one.....

Let's start:

Quote:

Of course, the more predictive the parameters are the better your eventual results will be..

Please clarify..

How you measure how predictive a parameter is?

What you call a boolean parameter I would define as a predicate function (with no side effects) returning True or False while accepting a single parameter will must be a starter. This function implements a decision tree whose leafs consist boolean values.

Each starter has access to his own primitive data covering ratings, rankings and figures and also is having access to all its competitors and their summarized rankings..

Based in this definition the universe of all possible factors is infinite.

Your statement about how predictive a parameter is, is fundamental for our approach, so I suggest we spend some time and effort clarifying exactly this...

What are your thoughts?

PICSIX · 10-31-2012, 02:29 PM

Are the, "predictive ranks" basically a power rating assigned to each entrant?

DeltaLover · 10-31-2012, 02:54 PM

My understanding in that here we are taking about the predictiveness of a specific boolean factor. Having a factor f its predictiveness will be a number expressing its quality so we will be able to compare it against another one.

InControlX · 10-31-2012, 03:12 PM

Quote:

Originally Posted by DeltaLover

Very interesting....

You are making a lot of progress in just a single post, covering way too many topics...

I think we need to go a bit slower taking them one by one.....

Let's start:

Please clarify..

How you measure how predictive a parameter is?

What you call a boolean parameter I would define as a predicate function (with no side effects) returning True or False while accepting a single parameter will must be a starter. This function implements a decision tree whose leafs consist boolean values.

Each starter has access to his own primitive data covering ratings, rankings and figures and also is having access to all its competitors and their summarized rankings..

Based in this definition the universe of all possible factors is infinite.

Your statement about how predictive a parameter is, is fundamental for our approach, so I suggest we spend some time and effort clarifying exactly this...

What are your thoughts?

Good questions and comments, DeltaLover.

I agree its a lot for one post, but if I left too many gaps I thought it wouldn't tie together and the purpose would be lost.

I call the initial determinations Boolean simply because they are dimensioned as Visual Basic Boolean values, i.e., only true/false, so that they can correspond to a final unique binary sum of 2's power values.

The validation of a candidate predictive parameter is only realized by following the whole process through and seeing winning percentage/ROI improvement, although some obvious conclusions can be drawn for shortcuts. I've struggled on selections of these over the years, especially with class determinations and back time limits on qualifying races. I need improvements on several of them, but the neat thing about this approach is that it "automatically" identifies the key interrelationships between the Boolean Parameters. In essence, the results tell you how they combine for advantage (or not).

ICX

InControlX · 10-31-2012, 03:14 PM

Quote:

Originally Posted by PICSIX

Are the, "predictive ranks" basically a power rating assigned to each entrant?

Correct. With the power rating being the average of the database head-to-head advantage ratios of each entry vs. each other.

ICX

DeltaLover · 10-31-2012, 04:13 PM

Quote:

I call the initial determinations Boolean simply because they are dimensioned as Visual Basic Boolean values, i.e., only true/false, so that they can correspond to a final unique binary sum of 2's power values

I am following a pretty similar apporach. Representing all matching factors as a binary number makes it very easy to search for matches and patterns both in a traditional relational data base or just by using a program written in a more imperative environment.

VB or any other language of the .net environment, mysql, SqlServer and most similar technologies present some restriction as far as the actual number of the bits that will represent each individual factor. Since the largest natively integer type supported by these is the long integer this means that handling more the 64 factors cannot easily be accomplished and you have to add some complexity to allow your application to handle it... This is just one of the reasons I find PYTHON to be a great solution for any research and development project... It handles any integer no matter what its size will be exactly in the same way without any need to cast or reallocate...

Quote:

The validation of a candidate predictive parameter is only realized by following the whole process through and seeing winning percentage/ROI improvement, although some obvious conclusions can be drawn for shortcuts. I've struggled on selections of these over the years, especially with class determinations and back time limits on qualifying races

You have various alternatives that you can use to quantify the effectiveness of a factor, its winning percentage, impact value, weight impact value, ROI and final PNL are some of them. I think though that the better approach would take in consideration its frequency and its final ROI. The reason I am adding its frequency, in other words how often occurs has to do with avoiding overfilling which could be a serious issue as the factor granularity is increasing...

Capper Al · 10-31-2012, 04:30 PM

Interesting approach.

InControlX · 10-31-2012, 04:31 PM

Quote:

Originally Posted by DeltaLover

I am following a pretty similar apporach. Representing all matching factors as a binary number makes it very easy to search for matches and patterns both in a traditional relational data base or just by using a program written in a more imperative environment.

VB or any other language of the .net environment, mysql, SqlServer and most similar technologies present some restriction as far as the actual number of the bits that will represent each individual factor. Since the largest natively integer type supported by these is the long integer this means that handling more the 64 factors cannot easily be accomplished and you have to add some complexity to allow your application to handle it... This is just one of the reasons I find PYTHON to be a great solution for any research and development project... It handles any integer no matter what its size will be exactly in the same way without any need to cast or reallocate...

You have various alternatives that you can use to quantify the effectiveness of a factor, its winning percentage, impact value, weight impact value, ROI and final PNL are some of them. I think though that the better approach would take in consideration its frequency and its final ROI. The reason I am adding its frequency, in other words how often occurs has to do with avoiding overfilling which could be a serious issue as the factor granularity is increasing...

Python looks very adaptive and more integer size tolerant, but my problem with changing from VB is that I've got about fifty custom applications in VB I need for work and I need to tweak them often. I've migrated platforms in the past and its always been painful keeping the formats straight. That said, this method is certainly not restricted to VB.

I generally sort a new set of initial parameters (or modified previous ones) into a big results array including descriptor code elements of surface, purse, distance, race condition, etc. which I can use as post-filters. My rule of thumb has been a minimum filtered quantity of 1000/year. If I filter the sampling much below 1K the error margin creeps up, and as you mention, the granularity produces over-optimistic results.

ICX

eurocapper · 11-01-2012, 07:30 AM

I'm afraid to me this has some aspect of looking for the ultimate truth in horse racing, instead of what is profitable for the time being. I believe research oriented persons are prone to this, personally I would focus on value or longshot analysis.

vegasone · 11-01-2012, 09:12 AM

This looks to me like genetic algorithms.

InControlX · 11-01-2012, 09:36 AM

Quote:

Originally Posted by eurocapper

I'm afraid to me this has some aspect of looking for the ultimate truth in horse racing, instead of what is profitable for the time being. I believe research oriented persons are prone to this, personally I would focus on value or longshot analysis.

You get out a ranking based upon the quality of your initial parameters. If you use common inputs you get common outputs. The trick is to start with some parameters that define an advantage not widely applied by others.

ICX

InControlX · 11-01-2012, 09:37 AM

Quote:

Originally Posted by vegasone

This looks to me like genetic algorithms.

Assuming you mean generic, you are correct. There is no unusual math involved in the method.

ICX

DeltaLover · 11-01-2012, 11:24 AM

Quote:

Originally Posted by InControlX

You get out a ranking based upon the quality of your initial parameters. If you use common inputs you get common outputs. The trick is to start with some parameters that define an advantage not widely applied by others.

ICX

It is not very clear when you say to start with some parameters defining an advantage not widely appiled by others.

The most primitive level of your data will be common to anyone and published to the public domain. Going a level higher we have a normalized form of this common data that can be presented using many different methodologies : ragozin, thorograms, beyers, equibase or CJ figures are all measuring the same dimention using very common data (there can be some additional info from some of them as Ragozin or Brown claim to do) but these 'numbers' will all be correlated enough to be considered at least similar. These level can be composed to even higer level 'ratings' like bris prime power for example serving as an index that can describe a starter with a signle dimentional number as opposed to an array of speed figures per each start in the past... Then we can add another layer now expressing a quality of the event itself, like for example an index describing how 'much' speed or stamina is present.

Based in this, I can see tha advantage not coming from the parameters themselfs but from another synthetic level combining these layers of data with their perception from the public (as expressed in the pools) always searching for inefficiencies that add some space for price correction.

DeltaLover · 11-01-2012, 11:25 AM

Quote:

Originally Posted by vegasone

This looks to me like genetic algorithms.

Can you please expain better what looks like a genetic algorithm ?