PDA

View Full Version : DAYS SINCE LAST START


mountainman
11-18-2012, 01:52 PM
Have query for someone with a d-base. What are the avg, median, and most frequent number of days between starts for american t-breds? Please exclude laid up horses, if possible. Suggest cutoff point at 45 days. Need info for day job. Tx in advance.

DeltaLover
11-18-2012, 02:42 PM
average days off: 43.0567870884
stdev : 63.9649372945

Sample size: 143,999

I will post later today a document the analytical break down... Right now I am at the track and cannot access my
ftp server
Please let me know if you need more info

JustRalph
11-18-2012, 02:57 PM
My stuff is almost exactly the same as Delta Lover.

I also find that horses that start between 55 and 70 days are less likely to win, even if they are Favs

After 70 days it's basically "laid up" to me and the numbers go all over the place

mountainman
11-18-2012, 04:52 PM
Tx guys, but i'm more looking for the avg interim for horses not rested for a significant period or potentially removed from training. So, could you possibly supply the avg number of days since last start for all horses running back in less than 45 days (an arbitrary time-frame in part selected to coincide with drf's "layup line") ??

This is not requested for handicapping purposes, but instead to be referenced in negotiations with horsemen concerning the rotating of races in our condition book.

davew
11-18-2012, 06:04 PM
I would think you are asking a chicken - egg which came first type question.

The condition book rotation at a specific track would depend on length of track season, number of horses at track, type of horses at track, and whether big groups of horses enter/leave as other tracks open/close.

As long as you are negotiating with horsemen, why not ask them what they would like for number of days to return their horses for next race. I suspect as a bettor, many trainers would like the 10-14 day range, but surely would be individual horse dependant.

mountainman
11-18-2012, 06:23 PM
The condition book rotation at a specific track would depend on length of track season, number of horses at track, type of horses at track, and whether big groups of horses enter/leave as other tracks open/close.



Indeed. Having served as a racing official for some 23 years, i can attest that you're, in large part, correct. But the chicken/ egg analogy doesn't really apply to the huge sample size i need for representative data.

And, believe me, if you poll 20 horsemen on how to write a condition book, you will get 20 different -and EXTREMELY self-serving answers. Each will want the book tailored to his stock and particular training methods.

In addition, the info i've requested won't be used in tooling the condition book, but in supporting our argument for the way it's currently written.

Jeff P
11-18-2012, 08:33 PM
Tx guys, but i'm more looking for the avg interim for horses not rested for a significant period or potentially removed from training. So, could you possibly supply the avg number of days since last start for all horses running back in less than 45 days (an arbitrary time-frame in part selected to coincide with drf's "layup line") ??

This is not requested for handicapping purposes, but instead to be referenced in negotiations with horsemen concerning the rotating of races in our condition book.

Here's what I have in the database for all starters (thoroughbred only) from Jan 01, 2012 through yesterday Sat Nov 16, 2012 that a.) made a start within that date range and b.) whose most recent race back was within 45 days of the start recorded in the database. (I think that's what you are asking for.)

First, all starters in the database fitting that description:
query start: 11/18/2012 5:05:11 PM
query end: 11/18/2012 5:08:52 PM
elapsed time: 221 seconds

Data Window Settings:
Connected to: C:\JCapper\exe\JCapper2.mdb
999 Divisor Odds Cap: None

SQL: SELECT * FROM STARTERHISTORY
WHERE RACEDAYS <= 45
AND [DATE] >= #01-01-2012#


Data Summary Win Place Show
----------------------------------------------
Mutuel Totals 411880.00 415724.30 417524.60
Bet -546000.00 -546000.00 -546000.00
Gain -134120.00 -130275.70 -128475.40

Wins 36472 72951 106499
Plays 273000 273000 273000
PCT .1336 .2672 .3901

ROI 0.7544 0.7614 0.7647
Avg Mut 11.29 5.70 3.92

Next, data for the above starters broken out by days last start:
By: Recent Activity- Days Last Start

>=Min <= Max Gain Bet Roi Wins Plays Pct Impact
-----------------------------------------------------------------------------
1 4 -139.70 626.00 0.7768 51 313 .1629 1.2196
5 9 -9982.40 38682.00 0.7419 2484 19341 .1284 0.9613
10 14 -28912.60 112950.00 0.7440 7274 56475 .1288 0.9641
15 19 -26839.10 104500.00 0.7432 6877 52250 .1316 0.9852
20 24 -28604.70 112780.00 0.7464 7444 56390 .1320 0.9881
25 29 -16940.80 75702.00 0.7762 5168 37851 .1365 1.0220
30 34 -9005.20 39422.00 0.7716 2767 19711 .1404 1.0508
35 39 -7921.60 32838.00 0.7588 2377 16419 .1448 1.0836
40 44 -5223.80 26106.00 0.7999 1870 13053 .1433 1.0723
45 45 -550.10 2394.00 0.7702 160 1197 .1337 1.0005

How to read...

There were exactly 273,000 starters in the sample who came back to race within 45 days of their most recent start.

Each row in the second section shows a min and max range for days last start... For example, the 3rd row in that section is for horses whose most recent start was between 10 and 14 days inclusive. There were 56,475 starters in that row (representing 20.69% of the total sample.)

Hope that helps.


-jp

.

Jeff P
11-18-2012, 08:42 PM
Same query... MNR Only:

query start: 11/18/2012 5:36:15 PM
query end: 11/18/2012 5:36:30 PM
elapsed time: 15 seconds

Data Window Settings:
Connected to: C:\JCapper\exe\JCapper2.mdb
999 Divisor Odds Cap: None

SQL: SELECT * FROM STARTERHISTORY
WHERE TRACK='MNR'
AND RACEDAYS <= 45
AND [DATE] >= #01-01-2012#


Data Summary Win Place Show
--------------------------------------------
Mutuel Totals 17418.00 17617.00 18229.40
Bet -22812.00 -22812.00 -22812.00
Gain -5394.00 -5195.00 -4582.60

Wins 1549 3040 4345
Plays 11406 11406 11406
PCT .1358 .2665 .3809

ROI 0.7635 0.7723 0.7991
Avg Mut 11.24 5.80 4.20

By: Recent Activity- Days Last Start

>=Min <= Max Gain Bet Roi Wins Plays Pct Impact
-----------------------------------------------------------------------------
1 4 17.60 28.00 1.6286 6 14 .4286 3.1558
5 0 -406.60 2120.00 0.8082 138 1060 .1302 0.9586
10 14 -1090.00 4944.00 0.7795 354 2472 .1432 1.0545
15 19 -1085.00 4780.00 0.7730 345 2390 .1444 1.0629
20 24 -1089.60 5404.00 0.7984 355 2702 .1314 0.9674
25 29 -704.00 2180.00 0.6771 133 1090 .1220 0.8985
30 34 -491.80 1484.00 0.6686 103 742 .1388 1.0222
35 39 -572.80 1034.00 0.4460 61 517 .1180 0.8688
40 44 -1.80 756.00 0.9976 48 378 .1270 0.9350
45 45 30.00 82.00 1.3659 6 41 .1463 1.0776




-jp

.

Overlay
11-18-2012, 08:49 PM
Jeff's numbers indicate why I was having trouble understanding the statements by DeltaLover and JustRalph to the effect of their data showing that the average days off between starts for all horses was slightly over 43 days. According to Jeff's figures, the great bulk of starters in his sample (which I agree was limited to horses coming back within 45 days) came back within 30 days or less. I find it difficult to believe that there would be enough horses taking a layoff of over 45 days between starts to make the composite average for all horses work out to 43 days, especially with the way the number of horses starts dropping off dramatically in Jeff's sample once you get past 30 days. Or am I misinterpreting something about DeltaLover's and JustRalph's data/statements?

Jeff P
11-18-2012, 09:03 PM
For the all starters coming back to race within 45 days sample above:

Sum of days last start all starters: 5,822,538
---Number of starters in the sample: 273,000
---Avg number of days since last race: 21.33




For the MNR only sample above:

Sum of days last start all starters: 226,089
--Number of starters in the sample: 11,406
-Avg number of days since last race: 19.82





-jp

.

DeltaLover
11-18-2012, 09:05 PM
complete breakdown in csv format:

http://kasosoft.com/days_off_per_starter.csv

Jeff P
11-18-2012, 09:40 PM
Jeff's numbers indicate why I was having trouble understanding the statements by DeltaLover and JustRalph to the effect of their data showing that the average days off between starts for all horses was slightly over 43 days. According to Jeff's figures, the great bulk of starters in his sample (which I agree was limited to horses coming back within 45 days) came back within 30 days or less. I find it difficult to believe that there would be enough horses taking a layoff of over 45 days between starts to make the composite average for all horses work out to 43 days, especially with the way the number of horses starts dropping off dramatically in Jeff's sample once you get past 30 days. Or am I misinterpreting something about DeltaLover's and JustRalph's data/statements?

Running all starters in the database that a.) raced between Jan 01, 2012 and Sat Nov 16, 2012 and b.) had at least 1 lifetime start... I get roughly the same as what Ralph and DeltaLover came up with:

Sum of days last start all starters: 14,031,112
---Number of starters in the sample: 331, 274
---Avg number of days since last race: 42.36

Apparently limiting the query to just those coming back within 45 days does make a difference.


-jp

.

Jeff P
11-18-2012, 10:06 PM
Tim, Here's what that bigger sample looks like:

Data Window Settings:
Connected to: C:\JCapper\exe\JCapper2.mdb
999 Divisor Odds Cap: None

SQL: SELECT * FROM STARTERHISTORY
WHERE STARTSLIFETIME > 0
AND [DATE] >= #01-01-2012#


Data Summary Win Place Show
----------------------------------------------
Mutuel Totals 498420.80 502488.50 501650.20
Bet -662548.00 -662548.00 -662548.00
Gain -164127.20 -160059.50 -160897.80

Wins 43595 86765 126536
Plays 331274 331274 331274
PCT .1316 .2619 .3820

ROI 0.7523 0.7584 0.7572
Avg Mut 11.43 5.79 3.96

By: Recent Activity- Days Last Start

>=Min <= Max Gain Bet Roi Wins Plays Pct Impact
-----------------------------------------------------------------------------
1 14 -39034.70 152258.00 0.7436 9809 76129 .1288 0.9791
15 29 -72384.60 292982.00 0.7529 19489 146491 .1330 1.0109
30 44 -22150.60 98366.00 0.7748 7014 49183 .1426 1.0837
45 59 -7791.40 35672.00 0.7816 2451 17836 .1374 1.0442
60 74 -2930.30 16294.00 0.8202 1143 8147 .1403 1.0661
75 89 -2356.90 9172.00 0.7430 573 4586 .1249 0.9494
90 104 -1534.30 6106.00 0.7487 369 3053 .1209 0.9184
105 119 -1456.60 4644.00 0.6863 256 2322 .1102 0.8378
120 134 -856.30 3894.00 0.7801 206 1947 .1058 0.8040
135 149 -753.40 3672.00 0.7948 217 1836 .1182 0.8981
150 164 -1175.60 3422.00 0.6565 166 1711 .0970 0.7372
165 179 -1174.20 3470.00 0.6616 173 1735 .0997 0.7577
180 194 -1343.70 3666.00 0.6335 201 1833 .1097 0.8333
195 209 -543.90 3306.00 0.8355 205 1653 .1240 0.9424
210 224 -1114.30 3650.00 0.6947 193 1825 .1058 0.8036
225 239 -1280.30 3324.00 0.6148 171 1662 .1029 0.7818
240 254 -729.90 2708.00 0.7305 146 1354 .1078 0.8194
255 269 -578.50 2490.00 0.7677 140 1245 .1124 0.8545
270 999999 -4937.70 13452.00 0.6329 673 6726 .1001 0.7603


It's not that the larger sample differs materially from the smaller samples... it doesn't.

It's that none of the starters in the smaller sample have a days last race number greater than 45... while enough of the starters in the larger sample do. Once you start adding up the total number of days and dividing by the number of starters to get an avg... the avg days last race per starter does differ significantly in the two samples.

Hope I managed to get most of that out in a way that makes sense.


-jp

.

Overlay
11-18-2012, 10:17 PM
Thank you, Jeff. The absolute number of long-layoff horses is not that great, but I hadn't been giving sufficient consideration to the fact that it only takes a relatively small number that have been off for hundreds of days to drive the mean layoff up for the entire population, even if the values for the median and the mode would be much lower, and possibly more representative of the population as a whole.

traynor
11-18-2012, 10:43 PM
Thank you, Jeff. The absolute number of long-layoff horses is not that great, but I hadn't been giving sufficient consideration to the fact that it only takes a relatively small number that have been off for hundreds of days to drive the mean layoff up for the entire population, even if the values for the median and the mode would be much lower, and possibly more representative of the population as a whole.

The same thing happens with ROI figures with a few aberrant mutuels tossed into the mix.

AITrader
11-19-2012, 04:33 AM
Quick 'n dirty formula to optimize days-last-race is:

( days-last-race_horse - days-last-race-avg-for-this-race ) / days-last-race-std-dev-for-this-race

This assumes that trainers/owners will, on average, optimize layoffs for this particular class, gender, type, etc of horse.

Horses with unknown or no history should be set to the average value.

I also recommend scaling values from 0.5 to -0.5, or whatever normalization is appropriate for your system.

raybo
11-19-2012, 08:00 AM
The same thing happens with ROI figures with a few aberrant mutuels tossed into the mix.

Agree, that's why one must scan the individual numbers and/or look at median as well, etc..

Overlay
11-19-2012, 08:50 AM
The same thing happens with ROI figures with a few aberrant mutuels tossed into the mix.
That's why I'm surprised with the number of posts that I see on the board that emphasize the ROI of various methods (especially when comparing one method with another), since that statistic is subject to outlier bias (as you note), and also to shrinkage as more people begin to play the method (assuming that everyone who plays it ends up betting the same selections).

DeltaLover
11-19-2012, 09:33 AM
That's why I'm surprised with the number of posts that I see on the board that emphasize the ROI of various methods (especially when comparing one method with another), since that statistic is subject to outlier bias (as you note), and also to shrinkage as more people begin to play the method (assuming that everyone who plays it ends up betting the same selections).


I could not agree more...

The following is the outcome of a genetic algorithm that is optimizing for bankroll growth. The fitness is the PNL of a hypothetical 10,000 starting bank roll.

Starters with less than 2-1 odds are not considered while winners at more than 10-1 are truncated to to 10-1

The training universe consists of 5,000 randomly selected races:

(I've just realized that what appears as win% is actually win probability and has to be multiplied by 100 to become percentage)



generation#: 149

All chromosomes for generation

BF: 0.40 total bets: 2008 win: 0.21% ROI: 1.22 mean odds: 5.42 max odds: 10.00 min odds: 2.05 fitness = 43625.0
BF: 0.40 total bets: 2008 win: 0.21% ROI: 1.22 mean odds: 5.42 max odds: 10.00 min odds: 2.05 fitness = 43625.0
BF: 0.40 total bets: 2008 win: 0.21% ROI: 1.22 mean odds: 5.42 max odds: 10.00 min odds: 2.05 fitness = 43625.0
BF: 0.67 total bets: 3358 win: 0.16% ROI: 0.88 mean odds: 6.14 max odds: 10.00 min odds: 2.05 fitness = -40695.0
BF: 0.06 total bets: 286 win: 0.25% ROI: 1.10 mean odds: 3.98 max odds: 10.00 min odds: 2.05 fitness = 2995.0
BF: 0.62 total bets: 3112 win: 0.09% ROI: 0.68 mean odds: 8.22 max odds: 10.00 min odds: 2.05 fitness = -100370.0
BF: 0.64 total bets: 3199 win: 0.10% ROI: 0.68 mean odds: 8.17 max odds: 10.00 min odds: 2.05 fitness = -101875.0
BF: 0.01 total bets: 52 win: 0.37% ROI: 1.32 mean odds: 3.04 max odds: 8.60 min odds: 2.05 fitness = 1640.0
BF: 0.64 total bets: 3186 win: 0.13% ROI: 0.81 mean odds: 6.46 max odds: 10.00 min odds: 2.05 fitness = -59215.0
BF: 0.10 total bets: 516 win: 0.23% ROI: 1.11 mean odds: 4.43 max odds: 10.00 min odds: 2.05 fitness = 5755.0
BF: 0.55 total bets: 2739 win: 0.14% ROI: 0.85 mean odds: 6.47 max odds: 10.00 min odds: 2.05 fitness = -40770.0
BF: 0.02 total bets: 86 win: 0.24% ROI: 1.29 mean odds: 4.40 max odds: 10.00 min odds: 2.05 fitness = 2480.0
BF: 0.67 total bets: 3325 win: 0.16% ROI: 0.96 mean odds: 6.24 max odds: 10.00 min odds: 2.05 fitness = -14535.0
BF: 0.72 total bets: 3579 win: 0.11% ROI: 0.76 mean odds: 7.62 max odds: 10.00 min odds: 2.05 fitness = -85750.0
BF: 0.00 total bets: 10 win: 0.40% ROI: 2.23 mean odds: 6.31 max odds: 10.00 min odds: 2.10 fitness = 1225.0
BF: 0.00 total bets: 18 win: 0.39% ROI: 1.56 mean odds: 3.84 max odds: 10.00 min odds: 2.10 fitness = 1015.0
BF: 0.54 total bets: 2684 win: 0.17% ROI: 1.06 mean odds: 6.42 max odds: 10.00 min odds: 2.05 fitness = 15470.0
BF: 0.64 total bets: 3222 win: 0.13% ROI: 0.81 mean odds: 7.61 max odds: 10.00 min odds: 2.05 fitness = -61590.0
BF: 0.04 total bets: 197 win: 0.23% ROI: 0.91 mean odds: 3.57 max odds: 10.00 min odds: 2.05 fitness = -1785.0
BF: 0.58 total bets: 2898 win: 0.14% ROI: 0.91 mean odds: 6.87 max odds: 10.00 min odds: 2.05 fitness = -27395.0
BF: 0.22 total bets: 1081 win: 0.09% ROI: 0.64 mean odds: 8.26 max odds: 10.00 min odds: 2.05 fitness = -39220.0
BF: 0.12 total bets: 625 win: 0.06% ROI: 0.51 mean odds: 9.20 max odds: 10.00 min odds: 2.15 fitness = -30795.0
BF: 0.15 total bets: 747 win: 0.08% ROI: 0.50 mean odds: 8.25 max odds: 10.00 min odds: 2.10 fitness = -37655.0
BF: 0.50 total bets: 2520 win: 0.16% ROI: 0.80 mean odds: 5.42 max odds: 10.00 min odds: 2.05 fitness = -50910.0
BF: 0.17 total bets: 858 win: 0.21% ROI: 1.19 mean odds: 5.43 max odds: 10.00 min odds: 2.05 fitness = 16340.0
BF: 0.51 total bets: 2549 win: 0.05% ROI: 0.40 mean odds: 9.41 max odds: 10.00 min odds: 2.15 fitness = -154170.0
BF: 0.24 total bets: 1222 win: 0.07% ROI: 0.46 mean odds: 8.11 max odds: 10.00 min odds: 2.05 fitness = -66505.0
BF: 0.25 total bets: 1254 win: 0.10% ROI: 0.66 mean odds: 7.94 max odds: 10.00 min odds: 2.05 fitness = -42165.0
BF: 0.15 total bets: 770 win: 0.23% ROI: 1.17 mean odds: 4.53 max odds: 10.00 min odds: 2.05 fitness = 12895.0
BF: 0.19 total bets: 939 win: 0.17% ROI: 0.88 mean odds: 5.22 max odds: 10.00 min odds: 2.05 fitness = -10865.0
BF: 0.41 total bets: 2048 win: 0.05% ROI: 0.44 mean odds: 9.16 max odds: 10.00 min odds: 2.10 fitness = -114350.0
BF: 0.67 total bets: 3368 win: 0.14% ROI: 0.87 mean odds: 6.69 max odds: 10.00 min odds: 2.05 fitness = -45255.0
BF: 0.80 total bets: 3983 win: 0.07% ROI: 0.47 mean odds: 8.75 max odds: 10.00 min odds: 2.05 fitness = -209940.0
BF: 0.20 total bets: 1015 win: 0.18% ROI: 0.98 mean odds: 5.23 max odds: 10.00 min odds: 2.05 fitness = -2520.0
BF: 0.63 total bets: 3147 win: 0.08% ROI: 0.47 mean odds: 7.73 max odds: 10.00 min odds: 2.05 fitness = -165895.0
BF: 0.71 total bets: 3542 win: 0.15% ROI: 0.84 mean odds: 5.81 max odds: 10.00 min odds: 2.05 fitness = -55760.0
BF: 0.59 total bets: 2940 win: 0.10% ROI: 0.75 mean odds: 8.36 max odds: 10.00 min odds: 2.05 fitness = -72890.0
BF: 0.15 total bets: 738 win: 0.20% ROI: 1.12 mean odds: 5.25 max odds: 10.00 min odds: 2.05 fitness = 8935.0
BF: 0.71 total bets: 3553 win: 0.11% ROI: 0.61 mean odds: 6.75 max odds: 10.00 min odds: 2.05 fitness = -138195.0
BF: 0.00 total bets: 15 win: 0.33% ROI: 1.76 mean odds: 3.25 max odds: 8.00 min odds: 2.05 fitness = 1135.0

number of chromosomes: 40
Elitism Factor: 0.05
Mutation Rate: 0.18

winner chromosome:
0.287 -0.145 -0.100 0.354 -0.055 -0.243 0.428 0.152 0.202 0.464 0.335 -0.020 Fitness: 43625.0

************************************************** **************************************************





















The winner behavior is the following:

BF: 0.40 total bets: 2008 win: 0.21% ROI: 1.22 mean odds: 5.42 max odds: 10.00 min odds: 2.05 fitness = 43625.0


Now back testing this chromosome to a another randomly selected universe of 6,100 races none of them included to the original
we have the following behavior:


>ga_example.py
final pnl: 8775.0
all races: 6100
all best: 2447
all winners: 495
win: 20.23%
roi: 1.03586023702







Again I skip less than 2-1 and if a winner is more than 10-1 I reset it to 10-1


This is the same run without odd restrictions:


>ga_example.py
final pnl: 23150.0
all races: 6100
all best: 2447
all winners: 495 win: 20.23%
roi: 1.09460563956











I thing this is a good demonstration of how misleading outliers can be in the calculation of ROI.

traynor
11-19-2012, 09:35 AM
That's why I'm surprised with the number of posts that I see on the board that emphasize the ROI of various methods (especially when comparing one method with another), since that statistic is subject to outlier bias (as you note), and also to shrinkage as more people begin to play the method (assuming that everyone who plays it ends up betting the same selections).

It is even more surprising to discover how adamantly opposed otherwise rational people become when it is suggested their ROI figures may be skewed by a few unusual (and unlikely to repeat) mutuel payoffs. That data is skewed by outliers is a given, and most researchers routinely correct for such. Accepting an unusually fast or unusually slow time on a given day and throwing it into an average figure for a DTV, for example, is an error that only a rank amateur would make. The same caution should be exercised in other areas of research as well--including calculations of average days off.

raybo
11-19-2012, 09:44 AM
That's why I'm surprised with the number of posts that I see on the board that emphasize the ROI of various methods (especially when comparing one method with another), since that statistic is subject to outlier bias (as you note), and also to shrinkage as more people begin to play the method (assuming that everyone who plays it ends up betting the same selections).

Agree. The method developer must include user options/customizations that allow emphasis to be placed on goals other than positive ROI, regarding degree, like number of plays, hit rate %, profit, etc..

traynor
11-19-2012, 09:46 AM
I could not agree more...

...

Again I skip less than 2-1 and if a winner is more than 10-1 I reset it to 10-1

I thing this is a good demonstration of how misleading outliers can be in the calculation of ROI.

Of all things related to horse race analysis, I think the one factor that will contribute more to becoming a consistent winner than any other is correcting for mutuel outliers. There are few things more frustrating than chasing rainbows believing that profit exists where it does not. And there are few things more beneficial than realizing that the apparent profit does not really exist, and that continued study and research is necessary to find it.

The upside is that when models are controlled for outliers, when and if those outliers crop up in future results, the actual ROI is always better than anticipated.

traynor
11-19-2012, 09:55 AM
Agree. The method developer must include user options/customizations that allow emphasis to be placed on goals other than positive ROI, regarding degree, like number of plays, hit rate %, profit, etc..

Most useful is the ability to search results with toggles for odds ranges. When a model is developed that shows a positive ROI, rather than betting on it with both hands, the researcher is well advised to re-run the same data with the odds range set to a more reasonable range to see if the positive ROI still exists. That is, with filters set in a given pattern, an ROI that shows up when "any odds" are searched may change substantially when "only odds <= 10-1" are considered. Especially with small samples.

I understand that the topic of this thread is days off. My comments are not meant as a digression, but rather to point out the necessity of controlling for outliers in ANY type of serious research.

Dave Schwartz
11-19-2012, 10:18 AM
Most useful is the ability to search results with toggles for odds ranges. When a model is developed that shows a positive ROI, rather than betting on it with both hands, the researcher is well advised to re-run the same data with the odds range set to a more reasonable range to see if the positive ROI still exists. That is, with filters set in a given pattern, an ROI that shows up when "any odds" are searched may change substantially when "only odds <= 10-1" are considered. Especially with small samples.

Traynor,

Another technique that I like is adding some randomness. If a horse won by a nostril hair and pays $68 that should have different value than if the horse won by 12 lengths.


And sometimes thread drift is a good thing.


Dave

mountainman
11-19-2012, 11:28 AM
For the all starters coming back to race within 45 days sample above:

Sum of days last start all starters: 5,822,538
---Number of starters in the sample: 273,000
---Avg number of days since last race: 21.33




For the MNR only sample above:

Sum of days last start all starters: 226,089
--Number of starters in the sample: 11,406
-Avg number of days since last race: 19.82





-jp

.

More than encompassing, perfectly distilled, and supportive of my argument. Many thanks, sir. As always, your thoroughness astounds me.

DeltaLover
11-19-2012, 11:30 AM
Most useful is the ability to search results with toggles for odds ranges. When a model is developed that shows a positive ROI, rather than betting on it with both hands, the researcher is well advised to re-run the same data with the odds range set to a more reasonable range to see if the positive ROI still exists. That is, with filters set in a given pattern, an ROI that shows up when "any odds" are searched may change substantially when "only odds <= 10-1" are considered. Especially with small samples.

I understand that the topic of this thread is days off. My comments are not meant as a digression, but rather to point out the necessity of controlling for outliers in ANY type of serious research.

I agree .

Instead a generic model using several more specialized based in odds ranking or odds makes the whole process more reliable.

For example a complete strategy might consist of the following models:

Favorite
Less than 5-1
Less than 10-1
More than 10-1

We can specialize each model to have more than opinions meaning that the more than 10-1 might give us none, one, two or more candidates, or we can create negative models focusing for very low returns.

Having more than one final selections can be a sign to pass at first glance at least, although things get more complicated if we consider the possibilities of dutching or exotics and most likely we need another model to make this decision for us.

Although I am reluctant to bet long exotic propositions, the low takeout pick 5 running in CA looks interesting and multi selection models might be the way to go. I still find it hard though to bet more than one starter for the first spot albeit such an approach seems like a necessity to attack this pool.... I would certainly liked it better if the minimum was not set to fifty cents but was higher, preferably two dollars...

DeltaLover
11-19-2012, 11:32 AM
Traynor,
If a horse won by a nostril hair and pays $68 that should have different value than if the horse won by 12 lengths.
Dave

Why so?

I really cannot understand it. This is a binary event after all.

Dave Schwartz
11-19-2012, 11:45 AM
It is a "binary event" if you choose for it to be.

In other words, if a horse wins by a nose and the race were run 100 times, that horse would likely not win 100 times. Perhaps it would be 50-50 with the 2nd horse. Or 52-48... you choose. But not likely 100-0.

On the other hand, if the horse won by 12 lengths then that is pretty much 100-0.

DeltaLover
11-19-2012, 12:04 PM
It is a "binary event" if you choose for it to be.


I have to disagree..

it is not what you 'choose', it is what happens in the real world.

It is a binary event because this is how the game operates.

If there was a spread like for football for example, then it would have been a different case.

Horse racing does not operate like this. A nose have the same value as Secretariat's Belmont.

mountainman
11-19-2012, 12:50 PM
Most useful is the ability to search results with toggles for odds ranges. When a model is developed that shows a positive ROI, rather than betting on it with both hands, the researcher is well advised to re-run the same data with the odds range set to a more reasonable range to see if the positive ROI still exists. That is, with filters set in a given pattern, an ROI that shows up when "any odds" are searched may change substantially when "only odds <= 10-1" are considered. Especially with small samples.

I understand that the topic of this thread is days off. My comments are not meant as a digression, but rather to point out the necessity of controlling for outliers in ANY type of serious research.

Sharp post.

raybo
11-19-2012, 01:03 PM
I have to disagree..

it is not what you 'choose', it is what happens in the real world.

It is a binary event because this is how the game operates.

If there was a spread like for football for example, then it would have been a different case.

Horse racing does not operate like this. A nose have the same value as Secretariat's Belmont.

Can't wait for Dave's response to that one! By the way, I completely agree with Dave on this.

Choosing a horse that pays $68 and wins by many lengths means much more than if it wins by a "nostril hair". In the many length win, unless something completely off the wall happened during the race, that winner is a much better horse than the one he beat. In the other case, the race could have run normally, thus the winner is NOT a "much better" horse than the one he beat.

If your method produces $68 selections that are much better horses than those at much lower prices, then your method is not just good, it is excellent, and kicks the publics' butt, which is what we must do in order to be successful as players.

"Binary events" become much more than binary, when examined in context with the entire sample.

raybo
11-19-2012, 01:06 PM
Most useful is the ability to search results with toggles for odds ranges. When a model is developed that shows a positive ROI, rather than betting on it with both hands, the researcher is well advised to re-run the same data with the odds range set to a more reasonable range to see if the positive ROI still exists. That is, with filters set in a given pattern, an ROI that shows up when "any odds" are searched may change substantially when "only odds <= 10-1" are considered. Especially with small samples.

I understand that the topic of this thread is days off. My comments are not meant as a digression, but rather to point out the necessity of controlling for outliers in ANY type of serious research.

That ability is built into my program and allows the user to see, at a glance, how a chosen method does at different odds ranges, except you don't have to "toggle" those ranges, they are all viewable at the same time. It's also a big part of the "auto-record" track testing process in the program.

Dave Schwartz
11-19-2012, 01:43 PM
I have to disagree..

it is not what you 'choose', it is what happens in the real world.

It is a binary event because this is how the game operates.

If there was a spread like for football for example, then it would have been a different case.

Horse racing does not operate like this. A nose have the same value as Secretariat's Belmont.

Statistically you are right. However, in terms of reality, I would disagree.

But, disagreement is what makes for a horse race, right?


Regards,
Dave Schwartz

AITrader
11-20-2012, 03:34 AM
A suggestion as to bracketing odds -

log( public_odds / no_horses_in_race )

Normalizing and recency weighting might be good additions.

raybo
11-20-2012, 08:22 AM
Have query for someone with a d-base. What are the avg, median, and most frequent number of days between starts for american t-breds? Please exclude laid up horses, if possible. Suggest cutoff point at 45 days. Need info for day job. Tx in advance.

MM,

I tried to get the data from Jeff's posting you were looking for, tough to do. Here's a spreadsheet I came up with, grouped in 9 day groups except the 40-45 day group. Could not get frequency, but it appears the average falls between 25 and 35 days, and the median falls between 30 and 39 days. The smaller grouping between 40-45 days skews it a bit. Probably not accurate, but here you go.

traynor
11-20-2012, 09:22 AM
That ability is built into my program and allows the user to see, at a glance, how a chosen method does at different odds ranges, except you don't have to "toggle" those ranges, they are all viewable at the same time. It's also a big part of the "auto-record" track testing process in the program.

I don't understand how your application makes selections. It seems you are saying that variations in your selection process are limited, and only differ based on the attributes of the group of races used by end user. And for that collection of races (different for each user), your application calculates how many horses in that group of races qualified as wagers, and how many of those qualifiers actually won, separated into categories by odds range?

mountainman
11-20-2012, 10:16 AM
MM,

I tried to get the data from Jeff's posting you were looking for, tough to do. Here's a spreadsheet I came up with, grouped in 9 day groups except the 40-45 day group. Could not get frequency, but it appears the average falls between 25 and 35 days, and the median falls between 30 and 39 days. The smaller grouping between 40-45 days skews it a bit. Probably not accurate, but here you go.

Tx very much. The 45 day ceiling is intended to exclude horses taken out of action. Any avg gleaned from data that includes laid up horses would hardly reflect the frequency of starts for active runners amidst campaigns.

raybo
11-20-2012, 12:27 PM
I don't understand how your application makes selections. It seems you are saying that variations in your selection process are limited, and only differ based on the attributes of the group of races used by end user. And for that collection of races (different for each user), your application calculates how many horses in that group of races qualified as wagers, and how many of those qualifiers actually won, separated into categories by odds range?

That is basically correct. There are presently 6 options (soon to be 8) by "ranking method" that the user can test against his "current" database. There are also several odds ranges, 1/9 or higher through 11/1 or higher, that are tested automatically, for each of those methods, all being listed in a table by minimum odds range.

From each user's database, which is updated every time a card of races and results are added to the database (after the oldest card and results is first removed from the database), in either manual recording mode, or "auto-record" mode, ranges of running styles and early speed points, for specific pace pressure calculations done on the race's field, as well as potential early pace matchups, including adjusted early fractional velocities, are returned from the database and used as eliminations, regarding win contenders, in that race.

So, the user has many combinations of options, from which to choose what is shown to be best, according to his/her own personal preference: number of plays, hit rate, ROI, or net profit, or combinations of those. Once the user makes that decision, he/she enters their preferred method and minimum odds in the program and the program uses those 2 things to produce the black box selections, and "pass/play/alternate method play" notifications for each new race, in each card that is imported and processed by the program, automatically.

So, depending on the exact cards that the user has included in his/her own database, the selections are made by the program, and because their database may either contain, or not contain, the exact same cards and results as another user, and because they may be using different preferred ranking methods and/or minimum odds, the selections could be different between those 2 users.

The only time they will bet exactly the same horses, in every new race, is if their databases and preferred ranking method and minimum odds range are identical, in every way, and continue to be identical in the future. Needless to say, that is highly unlikely.

raybo
11-20-2012, 12:30 PM
Tx very much. The 45 day ceiling is intended to exclude horses taken out of action. Any avg gleaned from data that includes laid up horses would hardly reflect the frequency of starts for active runners amidst campaigns.

Understand. As I stated, the screenshot I posted may not be accurate, but it is probably close, and offers a, general, visual representation of the data.

traynor
11-20-2012, 06:21 PM
That is basically correct. There are presently 6 options (soon to be 8) by "ranking method" that the user can test against his "current" database. There are also several odds ranges, 1/9 or higher through 11/1 or higher, that are tested automatically, for each of those methods, all being listed in a table by minimum odds range.

From each user's database, which is updated every time a card of races and results are added to the database (after the oldest card and results is first removed from the database), in either manual recording mode, or "auto-record" mode, ranges of running styles and early speed points, for specific pace pressure calculations done on the race's field, as well as potential early pace matchups, including adjusted early fractional velocities, are returned from the database and used as eliminations, regarding win contenders, in that race.

So, the user has many combinations of options, from which to choose what is shown to be best, according to his/her own personal preference: number of plays, hit rate, ROI, or net profit, or combinations of those. Once the user makes that decision, he/she enters their preferred method and minimum odds in the program and the program uses those 2 things to produce the black box selections, and "pass/play/alternate method play" notifications for each new race, in each card that is imported and processed by the program, automatically.

So, depending on the exact cards that the user has included in his/her own database, the selections are made by the program, and because their database may either contain, or not contain, the exact same cards and results as another user, and because they may be using different preferred ranking methods and/or minimum odds, the selections could be different between those 2 users.

The only time they will bet exactly the same horses, in every new race, is if their databases and preferred ranking method and minimum odds range are identical, in every way, and continue to be identical in the future. Needless to say, that is highly unlikely.

I understand. The process you are using is what most would call "setting filters." When the selections are made by the program, those selections are made based on a process that "would have worked best" on the sample of races contained in each user's database. That is fairly simple regression and back-fitting.

The part I am unclear about is whether or not there is a facility that includes all entries that fit the particular filter combination in the calculations. That is, there are two very different questions "answered" by filtering and regression. One is "how many races were won--and at what mutuel--by entries with this specific set of attributes"? The second (and perhaps more meaningful) question is "how many entries had this specific set of attributes and how many of that group won the races in which they were entered"?

Specifically, does your application deal with situations in which more than one entry in a race fits the specific set of attributes defined?

Cratos
11-20-2012, 07:48 PM
It is a "binary event" if you choose for it to be.

In other words, if a horse wins by a nose and the race were run 100 times, that horse would likely not win 100 times. Perhaps it would be 50-50 with the 2nd horse. Or 52-48... you choose. But not likely 100-0.

On the other hand, if the horse won by 12 lengths then that is pretty much 100-0.

I disagree because each race is an independent event and the margin of victory in one event has nothing to do with the margin in another event. Additionally, the field of horses will probably be different in each event.

raybo
11-20-2012, 08:08 PM
Specifically, does your application deal with situations in which more than one entry in a race fits the specific set of attributes defined?

Yes, the user bets up to 3 contenders per race, any of which, if they win will produce a positive return. All final contenders are ranked by total velocity, the top 3, if there are more than 3 contenders, are the wagers, if they meet the minimum odds requirements of the user.

The program also breaks down each of those top 3 possible selections by hit rate, ROI, and net profit, so the user could bet on a single contender, rather than multiples. Of course, the hit rate won't be as high as in multiples, but the average payouts would be higher.

So, whatever works best and whatever suits the user best, that is what he/she uses.

Any method that uses past performances, as selection criteria, could be considered "back-fitting" to some extent, when you get right down to it, even what you are doing I would imagine.

"True" back-fitting is when the results for a race that you are handicapping has already been seen by the database you're using to produce the selections. That never happens in this program. The database sees "similar" races, but not the same races.

raybo
11-20-2012, 08:14 PM
I disagree because each race is an independent event and the margin of victory in one event has nothing to do with the margin in another event. Additionally, the field of horses will probably be different in each event.

But in that individual event, you could determine that the winner is a much better horse than the ones he beat, or vice versa. So, it really could have something to do with one's overall success or failure, going forward, depending on what kind of method he is using. Just the fact that a horse paid $68 and your method showed him as a win contender, says a lot about what your success/failure ratio will be.

traynor
11-20-2012, 09:08 PM
Yes, the user bets up to 3 contenders per race, any of which, if they win will produce a positive return. All final contenders are ranked by total velocity, the top 3, if there are more than 3 contenders, are the wagers, if they meet the minimum odds requirements of the user.

The program also breaks down each of those top 3 possible selections by hit rate, ROI, and net profit, so the user could bet on a single contender, rather than multiples. Of course, the hit rate won't be as high as in multiples, but the average payouts would be higher.

So, whatever works best and whatever suits the user best, that is what he/she uses.

Any method that uses past performances, as selection criteria, could be considered "back-fitting" to some extent, when you get right down to it, even what you are doing I would imagine.

"True" back-fitting is when the results for a race that you are handicapping has already been seen by the database you're using to produce the selections. That never happens in this program. The database sees "similar" races, but not the same races.

Pretty much any type of regression based on past events is generically referred to as "backfitting." No insult was intended.

I think I may be forming the basic question I am asking too obliquely. My understanding is that your application uses a base of races to determine a set of predictive attributes for that specific base of races, then determines selections in future races according to that set of predictive attributes. Is that correct? Or does your application use some predetermined criteria for selection? If the former, how many races (at one specific track, for example) would be needed to establish the baseline for wagering? I am assuming that an equivalent number would be needed for each track. Is that correct?

You have mentioned several times that the application deletes older races, so there must be some guidelines as to the number of races needed (or used).

I apologize if these seem stupid questions. I am trying to understand what an average new user would have to do to build up a sufficient file of races to enable the application to function properly.

raybo
11-20-2012, 11:49 PM
Pretty much any type of regression based on past events is generically referred to as "backfitting." No insult was intended.

I think I may be forming the basic question I am asking too obliquely. My understanding is that your application uses a base of races to determine a set of predictive attributes for that specific base of races, then determines selections in future races according to that set of predictive attributes. Is that correct? Or does your application use some predetermined criteria for selection? If the former, how many races (at one specific track, for example) would be needed to establish the baseline for wagering? I am assuming that an equivalent number would be needed for each track. Is that correct?

You have mentioned several times that the application deletes older races, so there must be some guidelines as to the number of races needed (or used).

I apologize if these seem stupid questions. I am trying to understand what an average new user would have to do to build up a sufficient file of races to enable the application to function properly.

Your assumption is correct, we get a set of predictive attributes from the database, the program then uses those attributes to determine eliminations as win contenders, then afurther elimination process is undertaken regarding adjusted early fractional velocities.

The number of cards needed for a particular track varies according to the average number of races per card, but, approximately 240-250 recent races (24-25 cards approximately) are kept in the database, constantly updated, of course. That number can vary for some tracks, depending on how the spread of pace pressure groupings works out, and how many of each of those type races actually occur, some pace pressure readings happen an insignificant number of times, as you can imagine, while others occur in significant numbers. There are 20 different groupings, so not all tracks have the same spread of pace pressure ratings, thus needing a few more, or fewer, races in the database. Once we find the number where predictiveness is at an acceptable level, the program keeps the database there automatically.

traynor
11-21-2012, 12:41 AM
Your assumption is correct, we get a set of predictive attributes from the database, the program then uses those attributes to determine eliminations as win contenders, then afurther elimination process is undertaken regarding adjusted early fractional velocities.

The number of cards needed for a particular track varies according to the average number of races per card, but, approximately 240-250 recent races (24-25 cards approximately) are kept in the database, constantly updated, of course. That number can vary for some tracks, depending on how the spread of pace pressure groupings works out, and how many of each of those type races actually occur, some pace pressure readings happen an insignificant number of times, as you can imagine, while others occur in significant numbers. There are 20 different groupings, so not all tracks have the same spread of pace pressure ratings, thus needing a few more, or fewer, races in the database. Once we find the number where predictiveness is at an acceptable level, the program keeps the database there automatically.

I am impressed at the innovative way you view the data. Thank you for explaining.

raybo
11-21-2012, 07:25 AM
You're welcome. I think the fact that the program takes Randy Giles' work a step or 2 further, makes it more valuable. While Randy's work tells you a particular PPG gives advantage to an early, or a late, horse, etc., this program tells you which of the specific running styles have won those races and at what percentage. Also, it tells you the same thing for those styles' early speed point ranges. Both of those things being track specific and in a recent time frame.

Capper Al
11-21-2012, 07:41 AM
I disagree because each race is an independent event and the margin of victory in one event has nothing to do with the margin in another event. Additionally, the field of horses will probably be different in each event.

It is never the attribute (fact) like DSLR, it is how it is used. Days has to be applied in conjunction with other factors or it is meaningless. The more dominate attributes like speed and pace are considered primary factors and may stand as a single source (variable) on their own for selecting horses while Days Since Last Raced is a secondary attribute and has to be interpreted in a comprehensive manner together with other factors.

bob60566
11-21-2012, 10:25 AM
It is never the attribute (fact) like DSLR, it is how it is used. Days has to be applied in conjunction with other factors or it is meaningless. The more dominate attributes like speed and pace are considered primary factors and may stand as a single source (variable) on their own for selecting horses while Days Since Last Raced is a secondary attribute and has to be interpreted in a comprehensive manner together with other factors.
In todays world there is only so many times you can administer apple juice in a cycle and make the horse effective. :(

Capper Al
11-21-2012, 10:53 AM
In todays world there is only so many times you can administer apple juice in a cycle and make the horse effective. :(

If you are referring to cheating then there isn't much one can do except hope there's a pattern that can be flagged.

DeltaLover
11-21-2012, 11:33 AM
In my opinion the concept of recency is one of the most naively treated in handicapping literature and related software.

Recency analysis is a classical example of unsupervised learning where the input vector consists of the intervals between consecutive races and the output is a classification universe. It is a cluster analysis that can be optimized by various fitness functions: PNL, ROI or winning frequency maximizers or minimizers.

The objective of this classification is to derive group monikers in such a way that each starter will be assigned one creating a recency shape for each race. The value of a classifications schema can be evaluated by a selection method (ga, nn, linear regression or other) that will be able to show profitability utilizing it in some way...

bob60566
11-21-2012, 01:11 PM
If you are referring to cheating then there isn't much one can do except hope there's a pattern that can be flagged.

Wrong thread should have posted under Pattern Recognition :)

TrifectaMike
11-21-2012, 07:18 PM
In my opinion the concept of recency is one of the most naively treated in handicapping literature and related software.

Recency analysis is a classical example of unsupervised learning where the input vector consists of the intervals between consecutive races and the output is a classification universe. It is a cluster analysis that can be optimized by various fitness functions: PNL, ROI or winning frequency maximizers or minimizers.

The objective of this classification is to derive group monikers in such a way that each starter will be assigned one creating a recency shape for each race. The value of a classifications schema can be evaluated by a selection method (ga, nn, linear regression or other) that will be able to show profitability utilizing it in some way...

You might discover a "bounce" classification.

Delta, you may want to give an example of what you have stated in paragraph 2. It is a good idea.

Mike (Dr Beav)

DeltaLover
11-21-2012, 11:07 PM
Sure Doc..

I will try to present my approach:

Each starter has an array of days intervals between his starts:


s1: 25 15 53 212 ....

s1: d1, d2, dn

For simplicity let's consider only todays race days off and previous race.

In this case we have

classififaction = f(d1,d2)

Each interval adds one dimension so for our example we are talking about two dimensions

Each starter can be represented as a point in a two dimensional surface x,y.

To make the algorithm easier we might need some transformation logic for the days off:

for example:
T(d) = log(d) (or whatever)

Using the euclidean distance for all starters we will be looking for clusters having some similar behavior:

for example winning frequency.

The objective of the algorithm will be to find two dimensional clusters that behave similarly.

For example we might find a cluster c1 who is having winning frequency or c2 who is having the lower.

Each cluster will be assigned an arbitrary label C1, C2, C2 etc

Then the whole race can be described as a composite of clusters based where each starter belongs:

C1
C1
C2
C3
C7


Now the race can be matched against similar races from where we might be able to conclude (for example)
that this type of race is most frequently won by a C1 type

SmarterSig
11-22-2012, 04:08 AM
Hi Delta

Do you use any other inputs to your cluster analysis. Trainer springs to mind ?.

DeltaLover
11-22-2012, 07:16 AM
Trainer will add a categorical dimension in the space which subsequently will narrow clusterization to this dimension only.

Although I am not currently doing in in any of my systems we can add analyze trainers after we have completed the recency clustrerization. Based in it we will now be able to assign to each trainer a distribution of classifications and rate him based in them.

For example:

Trainer: T1

C1 : T1-C1-Stats win% roi
C2 : T1-C2-Stats win% roi
C3 : T1-C3-Stats win% roi
C4 : T1-C4-Stats win% roi


Trainer: T2

C1 : T2-C1-Stats win% roi
C2 : T2-C2-Stats win% roi
C3 : T2-C3-Stats win% roi
C4 : T2-C4-Stats win% roi

etc


Now we can use each trainer's vector :

[ Ti-Ci .... ]

To perform another classification...