|
|
12-31-2010, 12:17 AM
|
#76
|
AllAboutTheROE
Join Date: Aug 2006
Location: Denver
Posts: 2,411
|
Quote:
Originally Posted by TrifectaMike
Z-score = (p^ - p )/ sqrt(p*(1 - p) / n)
p^ = Observed frequency
p = Expected frequency
n = Number of cases in the sample
Z-score for - Days since last race 1 to 14
= (.1220 - .1054)/sqrt(.1054*(1-.1054)/900)
= 1.622
Therefore any horse that raced within 14 days would receive a weight of 1.622.
You can compute the Z-score for the remaining two categories.
Mike
|
Since (p^ - p ) can be negative(and hopefully the square root can't, lol), the Z scores can be negative as well, which means we could have negative weights? I think both the second and third categories would be negative (-1.38 & -0.25). I'll wait patiently to see how this all comes together, but this leads to the question of data preprocessing and "how to bin." In this instance, a horse that came back in 14 days gets a positive weight, but a horse who comes back one day later receives and equally negative factor weighting?
__________________
"No problem can withstand the assault of sustained thinking" -- Voltaire
|
|
|
12-31-2010, 05:30 AM
|
#77
|
Registered User
Join Date: Feb 2008
Posts: 1,591
|
Quote:
Originally Posted by CBedo
Since (p^ - p ) can be negative(and hopefully the square root can't, lol), the Z scores can be negative as well, which means we could have negative weights? I think both the second and third categories would be negative (-1.38 & -0.25). I'll wait patiently to see how this all comes together, but this leads to the question of data preprocessing and "how to bin." In this instance, a horse that came back in 14 days gets a positive weight, but a horse who comes back one day later receives and equally negative factor weighting?
|
You are correct. Data definitions are important. I personally like to take a "shotgun" approach. Shoot as many pellets as possible and observe the patterns. And then consider refinements, and then shoot again...and again if necessary...always letting the data speak for itself, even when it appears illogical.
I attempt to construct metrics using medians. This allows one to measure strength within a race and process data across races much easier.
For example, we all to some degree or another scan a horses speed ratings (beyers, etc) and naturally assume the last rating is more significant than the previous. In fact, it is standard practice for logistic regression people to use time related weights on their factors. It seems reasonable, but is it always correct?
Let me give you a specific example ( arkansasman try this out with your model). When looking at horses last four Beyers or Bris numbers, etc, you would think that the last is more significant than the previous and so on. The data does not support the premise.
I have found using numerous tests that the second race beyer (or any other similar rating) is insignificant and adds no usable information. The third is significant as well as the fourth, but NOT the second race back. Interesting I would say.
Mike
|
|
|
12-31-2010, 05:45 AM
|
#78
|
Registered User
Join Date: Nov 2003
Location: Ohio
Posts: 1,307
|
Quote:
Originally Posted by TrifectaMike
". How does one test for goodness of fit of a logistic regression?"
Mike
|
R-Squared statistic.
|
|
|
12-31-2010, 07:11 AM
|
#79
|
Registered User
Join Date: Dec 2005
Location: MI
Posts: 6,330
|
Quote:
Originally Posted by TrifectaMike
I have found using numerous tests that the second race beyer (or any other similar rating) is insignificant and adds no usable information. The third is significant as well as the fourth, but NOT the second race back. Interesting I would say.
Mike
|
I'm not following this because if the second Speed figure back isn't significant in today's race, it will become the third Speed figure back in the next race. Then it has significances? It's the same figure. How can this be?
__________________
"The Law, in its majestic equality, forbids the rich, as well as the poor, to sleep under bridges, to beg in the streets, and to steal bread."
Anatole France
|
|
|
12-31-2010, 07:50 AM
|
#80
|
Librocubicularist
Join Date: Jun 2010
Location: Ohio
Posts: 10,466
|
Quote:
Originally Posted by TrifectaMike
Example of Chi-Square Statistic and Z-Score
NOTE: This data is factious and used for demonstration purposes.
Let's us test a recency factor of days since the horse's last race.
Observed
Code:
Win Lose Total Win%
Days since last race 1 to 14 110 790 900 12.2
Days since last race 15 to 30 90 890 980 9.2
Days since last race over 30 65 570 635 10.2
Total 265
The sample size is 2,515 horses.
The Win% of sample is 265/2515 = 10.54%
Expected
Code:
Win Lose Total Win%
Days since last race 1 to 14 95 805 900 10.54
Days since last race 15 to 30 103 877 980 10.54
Days since last race over 30 67 568 635 10.54
Chi-Square Value
Code:
= (110-95)**2/95 + (790-805)**2/805 + (90-103)**2/103 +
(890-877)**2/877 + (65-67)**2/67 + (570-568)**2/568
= 2.367 + .280 + 1.64 + .193 + .060 + .007
Chi-Square Value = 4.547
|
I've been going over this with my college statistics textbook by my side and, as best as I can tell, you do not seem to be calculating the Chi-Square Value correctly. From the text:
Code:
THEOREM: If n1, n2, ..., nk and e1, e2, ..., ek are the observed and
expected frequencies, respectively, for the k possible outcomes of an
experiment that is performed n times, then as n becomes infinite the
distribution of the quantity
k
Σ (ni - ei)²/ei
i=1
will approach that of a Chi-Squared variable with k-1 degrees of freedom.
I.e, the results are to be summed over all possible outcomes. There are three possible outcomes here: a horse from one of the three categories wins. There is no other possibility. There should be three terms in the summation, viz., the first, third, and fifth terms. The second, fourth, and sixth terms are double counts. The correct value is 4.18, which is less than the critical value of 4.605.
These double counts are not present in your die casting example.
You should look up the definition of "factious." I think you mean "fictitious" but I could be wrong.
__________________
Sapere aude
Last edited by Actor; 12-31-2010 at 07:52 AM.
|
|
|
12-31-2010, 08:09 AM
|
#81
|
Librocubicularist
Join Date: Jun 2010
Location: Ohio
Posts: 10,466
|
Quote:
Originally Posted by TrifectaMike
Z-score = (p^ - p )/ sqrt(p*(1 - p) / n)
|
This is William L. Quirin's "standard normal" test.
For the first category I compute the standard normal value to be 1.59 Quirin says that the value needs to lie outside -2.5 to +2.5, indicating that this category (which has the biggest impact value) is not significant.
__________________
Sapere aude
|
|
|
12-31-2010, 08:26 AM
|
#82
|
Registered User
Join Date: Feb 2003
Posts: 2,105
|
Quote:
Originally Posted by TrifectaMike
Let me give you a specific example ( arkansasman try this out with your model). When looking at horses last four Beyers or Bris numbers, etc, you would think that the last is more significant than the previous and so on. The data does not support the premise.
I have found using numerous tests that the second race beyer (or any other similar rating) is insignificant and adds no usable information. The third is significant as well as the fourth, but NOT the second race back. Interesting I would say.
Mike
|
My reading of the data disagrees with your assertion here. In my experience the more recent races should be weighted most heavily with the second last race having something like a 25% weighting.
|
|
|
12-31-2010, 10:24 AM
|
#83
|
Registered User
Join Date: Feb 2008
Posts: 1,591
|
Quote:
Originally Posted by Actor
I've been going over this with my college statistics textbook by my side and, as best as I can tell, you do not seem to be calculating the Chi-Square Value correctly. From the text:
Code:
THEOREM: If n1, n2, ..., nk and e1, e2, ..., ek are the observed and
expected frequencies, respectively, for the k possible outcomes of an
experiment that is performed n times, then as n becomes infinite the
distribution of the quantity
k
Σ (ni - ei)²/ei
i=1
will approach that of a Chi-Squared variable with k-1 degrees of freedom.
I.e, the results are to be summed over all possible outcomes. There are three possible outcomes here: a horse from one of the three categories wins. There is no other possibility. There should be three terms in the summation, viz., the first, third, and fifth terms. The second, fourth, and sixth terms are double counts. The correct value is 4.18, which is less than the critical value of 4.605.
These double counts are not present in your die casting example.
You should look up the definition of "factious." I think you mean "fictitious" but I could be wrong.
|
Read another book or read that one more carefully.
Mike
|
|
|
12-31-2010, 10:26 AM
|
#84
|
Registered User
Join Date: Feb 2008
Posts: 1,591
|
Quote:
Originally Posted by Actor
This is William L. Quirin's "standard normal" test.
For the first category I compute the standard normal value to be 1.59 Quirin says that the value needs to lie outside -2.5 to +2.5, indicating that this category (which has the biggest impact value) is not significant.
|
Where do you guys come up with this stuff? Quirin's standard normal?
Mike
|
|
|
12-31-2010, 10:30 AM
|
#85
|
Registered User
Join Date: Feb 2008
Posts: 1,591
|
Quote:
Originally Posted by sjk
My reading of the data disagrees with your assertion here. In my experience the more recent races should be weighted most heavily with the second last race having something like a 25% weighting.
|
Sjk, I know this is off topic, but let's have someone, possibly Arkansasman run a test.
Run a logistic regression using the last four beyers as regressors....and share the results.
Mike
|
|
|
12-31-2010, 10:34 AM
|
#86
|
Registered User
Join Date: Dec 2005
Location: MI
Posts: 6,330
|
I'm going to be forced to re-read The Mathematics of Horse Racing by David B. Fogel, not a bad idea just need the time. Fogel covers the 'How to' topic in six pages with references to charts in the appendices. Then he walks one through examples in twenty-one pages. His examples are for beaten lengths, days between races, drop in class, shippers, and change in distance. The rest of the book covers making your own system and the possibility of making a living at the track. In other words, Fogel gives one the 'How to' with many examples in 27 pages. Let's move on.
__________________
"The Law, in its majestic equality, forbids the rich, as well as the poor, to sleep under bridges, to beg in the streets, and to steal bread."
Anatole France
|
|
|
12-31-2010, 10:38 AM
|
#87
|
Registered User
Join Date: Feb 2003
Posts: 2,105
|
Quote:
Originally Posted by TrifectaMike
Sjk, I know this is off topic, but let's have someone, possibly Arkansasman run a test.
Run a logistic regression using the last four beyers as regressors....and share the results.
Mike
|
I would be interested in the findings and how they were arrived at.
SK
|
|
|
12-31-2010, 10:44 AM
|
#88
|
Registered User
Join Date: Feb 2008
Posts: 1,591
|
Quote:
Originally Posted by garyoz
R-Squared statistic.
|
And which R-Squared statistic would that be? Mine, yours, Daves, GT's, Benter's, Loie's, Peter's or Saint Paul's? You tell me, because if you compute one, I can compute another.
Mike
|
|
|
12-31-2010, 11:01 AM
|
#89
|
Registered User
Join Date: Jan 2009
Posts: 1,516
|
In the end are we trying to find out what factors grouped together will produce a profitable return? The statistics are interesting but what are we going to be left with. Impact values are everywhere...
|
|
|
12-31-2010, 11:04 AM
|
#90
|
Registered User
Join Date: Jan 2007
Posts: 18,962
|
Quote:
Originally Posted by TrifectaMike
For example, we all to some degree or another scan a horses speed ratings (beyers, etc) and naturally assume the last rating is more significant than the previous. In fact, it is standard practice for logistic regression people to use time related weights on their factors. It seems reasonable, but is it always correct?
Mike
|
I like the path that you are taking Trifecta Mike. The concept behind the approach is more important at this point than whether or not your particular figures are being done accurately. (Obviously down the road figure accuracy would be important.)
I also like the questions that you are asking.
For instance, with respect to " time related weights" they do not necessarily work that well in turf races. One has to look at the total picture on grass.
|
|
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|