PDA

View Full Version : Using Chi Square Statistic To Produce A Handicapping Method


TrifectaMike
12-28-2010, 12:34 PM
If there is sufficient interest, I will along with any PA member(s) will go through the process of generating a handicapping method based on the Chi Square statistic.

I (We) will do it in a such a manner which will be non-technical, easy to understand, using only basic arithmetic, and will allow for anyone to participate.

The end result I believe will be a profitable system.

We will need someone with a large database to provide data for the factors that we will test, and include in our model.

If there is interest I will proceed.

Let me know.

Mike

Dave Schwartz
12-28-2010, 12:40 PM
I can tell you that I am certainly interested to see this.

Hochay
12-28-2010, 12:45 PM
Please allow me to partake and supply any information that i may help obtain.:)

HandyKapper
12-28-2010, 12:48 PM
Absolutely count me in.

Tom
12-28-2010, 12:48 PM
I will help anyway I can.

TrifectaMike
12-28-2010, 01:33 PM
The Chi Square Statistic

Types of Data:

There are basically two types of random variables and they yield two types of data: numerical and categorical.

Categorical variables yield data in particular categories.
Responses to such questions as "What is your hair color?" or Do you own a
home?" are categorical because they yield data such as "biology" or "yes/no."

Numerical variables yield data in numerical form.
Responses to such questions as "How tall are you?" or "What is your
age?" are numerical.

Numerical data can be either discrete or continuous.
The table below may help you see the differences between these two variables.

Data Type Question Type Possible Responses
Categorical What is your sex? male or female
Numerical Discrete- How many jackets do you own? four or five
Numerical Continuous - How old are you? 42 years

Discrete data arise from a counting process, while continuous data
arise from a measuring process.

The Chi Square statistic compares the counts of categorical responses between two (or more) independent groups.

Chi square tests can only be used on actual numbers and not on percentages, proportions, means, etc.

Since I want all interested to understand the process, I will answer questions. No question is too simple or foolish... be free to ask and open. However, I will not answer any theoretical questions nor will I attempt to be rigorous. I will make up terms in the process, which may not be mathematically correct, but I will do so when necessary to clarify a point.

Mike

Next, I will use the "standard" dice example to explain the process.

ceejay
12-28-2010, 02:41 PM
I expect that this will be an interesting thread/topic.

Mike,
Regarding discrete vs. continuous numerical data. Are you saying that uncertainty in discrete data arises from measurement uncertainty whereas uncertainty in continuous data is caused by measurement imprecision? Somebody answering the question "how are you?"with 42 as the answer really means "42.50 plus or minus 0.50."

Robert Goren
12-28-2010, 02:51 PM
I will be interested this too. I looked into this back in the 70s and gave it up because at the time the compiling of the data was not feasible. I did a regression instead. Good luck

raybo
12-28-2010, 03:04 PM
This may sound whimsical, to those who have worked in, or, have studied statistical functions (?) (which I sadly have not), but, what is the syntax for a "Chi Square" test? And, what is the reasoning behind using this function(?)? What does one hope to achieve as a result of the test?

TrifectaMike
12-28-2010, 03:13 PM
I expect that this will be an interesting thread/topic.

Mike,
Regarding discrete vs. continuous numerical data. Are you saying that uncertainty in discrete data arises from measurement uncertainty whereas uncertainty in continuous data is caused by measurement imprecision? Somebody answering the question "how are you?"with 42 as the answer really means "42.50 plus or minus 0.50."

There is no need to go into measurement error(s) or uncertainty of measurements. The point is that we will use statistical data that can be counted or tallied.

For example:

How many horses won or lost their next race after finishing second in their previous race. We will use a Chi Square statistic to determine significance (or lack of significance) for that factor, and then compute a weight for that factor. We are jumping ahead, but I do want you to understand.

Mike

GameTheory
12-28-2010, 03:22 PM
I've done this sort of thing many times, never came up with a profitable system though. Will be interested to see what you come up with...

TrifectaMike
12-28-2010, 03:23 PM
This may sound whimsical, to those who have worked in, or, have studied statistical functions (?) (which I sadly have not), but, what is the syntax for a "Chi Square" test? And, what is the reasoning behind using this function(?)? What does one hope to achieve as a result of the test?

I believe the "reason behind using the function" and what we can achieve will be become clear after the dice example and after we test a factor. As I said I will not go into the theoretical aspects of the distribution. The symbol is unimportant.

The information we will use will be in table form.

Mike

TrifectaMike
12-28-2010, 03:30 PM
I've done this sort of thing many times, never came up with a profitable system though. Will be interested to see what you come up with...

It will depend on the factors we select and test. And we will weight them. The same can be said about any approach. As you probably know we will not be using speed/pace ratings.

I believe we can make approach work during the testing phase by eliminating certain categories of races.

Mike

P.S. Join in we'll all have a hand in choosing the factors.

Robert Goren
12-28-2010, 03:42 PM
The one thing I will say about this project. Chi-Square is known for producing some very strange relationships. Be very careful that you don't get a bunch of Gobbly Goop.

TrifectaMike
12-28-2010, 03:48 PM
The one thing I will say about this project. Chi-Square is known for producing some very strange relationships. Be very careful that you don't get a bunch of Gobbly Goop.

Maybe I don't understand your post, but Chi-Square doesn't produce any relationships.

Mike

Greyfox
12-28-2010, 03:50 PM
Chi-Square is known for producing some very strange relationships. Be very careful that you don't get a bunch of Gobbly Goop.

"Strange relationships" that work don't have to be explained.
There is no reason to link the outcome with the cause if you are looking for weighted factors.
For instance, it may well be that horses with 5 letter names win more often on days with full moons. The Chi square shows significance but doesn't need to explain why. Whatever works consistently, use it, I say.

GameTheory
12-28-2010, 03:58 PM
It will depend on the factors we select and test. And we will weight them. The same can be said about any approach. As you probably know we will not be using speed/pace ratings.

I believe we can make approach work during the testing phase by eliminating certain categories of races.

P.S. Join in we'll all have a hand in choosing the factors.Yeah, never could get anything to work that uses simple well-known statistical tests (as the engine to find a method). And there are a whole lot of trained statistician-types that thought they'd be able to analyze some racing data and they'd be on their way only to find it doesn't work. (Or insisted it did when it didn't and wrote a book about it anyway.)

Have you gotten it to work before (i.e. created a profitable method) with the procedure you're going to demonstrate now? Sometimes it looks like it will because it passes all the statistical tests, but just doesn't hold up on new races.

Not trying to be discouraging -- you are certainly better versed in these matters than I am! So I will be very interested indeed.

ArlJim78
12-28-2010, 04:10 PM
I'm interested and will help if I can. What is the next step?

Capper Al
12-28-2010, 04:15 PM
I'm interested. When I took my stats classes it was before calculators. I don't remember how the chi-Square method differs from regular multi-variant analysis. Would you clarify in lay terms? Thanks.

TrifectaMike
12-28-2010, 04:31 PM
Can someone explain to me how I can present data in a nice table form.

Mike

gm10
12-28-2010, 04:55 PM
If there is sufficient interest, I will along with any PA member(s) will go through the process of generating a handicapping method based on the Chi Square statistic.

I (We) will do it in a such a manner which will be non-technical, easy to understand, using only basic arithmetic, and will allow for anyone to participate.

The end result I believe will be a profitable system.

We will need someone with a large database to provide data for the factors that we will test, and include in our model.

If there is interest I will proceed.

Let me know.

Mike

I have 6 years of full data, have a degree in stats and know how to program in S-Plus so maybe I can be useful. I'm intrigued.

arno
12-28-2010, 05:09 PM
Count me in also.
Have a BBA (73) in Statisitics.
Not many colleges have a degree in Stat.

punteray
12-28-2010, 05:13 PM
Sounds interesting

Punteray

GameTheory
12-28-2010, 05:33 PM
Can someone explain to me how I can present data in a nice table form.

Mike
Get it nicely formatted in a text editor, tab-delimited with tabs set to 8.

Then put opening and closing "code" tags in your post, "code" and "/code" with square brackets around them. Cut-and-paste your pre-formatted source in between the tags. Don't edit your table from inside the post editor -- only cut-and-paste. It will not look pretty in the post editor, but it will in the preview and actual post....


1.343 2.343 3.343
2.232 3.223 4.355

Robert Goren
12-28-2010, 05:48 PM
Maybe I don't understand your post, but Chi-Square doesn't produce any relationships.

Mike:blush: You are right. It was a very poor choice of words on my part.

Robert Fischer
12-28-2010, 06:02 PM
im gonna have an energy drink and come back on with some google-smart comments and a bad joke...

TrifectaMike
12-28-2010, 06:15 PM
Let's try to determine if a die is fair. If a dice is really fair, what can we say about it in terms of probability? We can say that on any roll the chance of getting a 3 is the same as getting a 1 or any number from 1 to 6. So, the chance of getting any one specific number is 1/6. This is important, because we can estimate the number 1's, 2's, 3's, 4's, 5's, and 6's wewould expect to observe if the die is rolled a number of times.

To keep the numbers simple, let's say we roll the die 60 times, we would expect to observe each number 10 times. We don't expect to see exactly 10 of each number. In fact, the chance of observing 10 of each number is rather small. However, the chance of getting approximately 10 for each number is high.

Let's roll the die 60 times and record what we observe:

The number of 1's 2's 3's 4's 5's 6's Observed
6 12 5 8 13 16 Total 60

The number of 1's 2's 3's 4's 5's 6's Expected
10 10 10 10 10 10 Total 60

Let's compute the difference between the expected and observed
(this will tell us how far apart they are).

Difference 4 -2 5 2 -3 -6

We will ignore the signs in the difference and square each number
(to square a number is to multiply it by itself)

Difference 16 4 25 4 9 36
Squared

Now, we'll divide Difference Squared by Expected Values.

Difference Squared/
Expected Value 16/10 4/10 25/10 4/10 9/10 36/10

and add these all up, which equals 99/10 = 9.9

The value of 9.9 is what we are interested in.

Mike

Next we will learn what the value of 9.9 signifies.

GameTheory
12-28-2010, 06:16 PM
I knew there was an old article along these lines. Found it:

http://www.hoof.demon.co.uk/archie.html

TrifectaMike
12-28-2010, 06:28 PM
I knew there was an old article along these lines. Found it:

http://www.hoof.demon.co.uk/archie.html

Thanks. Read the article. That is not where we heading. We are going to develop a handicapping method.

Mike

CBedo
12-28-2010, 06:59 PM
I'm definitely interested, and if there is anything I can do to help, let me know.

Capper Al
12-28-2010, 07:26 PM
This is sounding like The Mathematics of Horse Racing by David B. Fogel. I have the book. I should like this. I liked the book, but never settled on a system to measure. My systems seem never to stand still too long. I'm always changing them.

garyoz
12-28-2010, 08:45 PM
Chi-Square is typically used with categorical (nominal-level) data and is stats 101. It is typically the only statistical method taught in intro research methods at the undergrad level.

Data can be categorized as nominal, ordinal, interval, or ratio. Categorical level data is the least robust. You can try to prove statistically significant associative relationships with a Chi-Square test--and argue for causality, but really don't think it will get you anyplace in handicapping (IMHO). You won't get any weights (e.g., impact value). You will also have to make judgements in placing variables into a category for analysis--unlike ratio level data where you can use the number. No need to do that when many of your underlying variables have a zero point and are ratio level--allowing you to use more sophisticated approaches.

Plenty of people on this board have applied statistics to databases, primarily regression analysis (ratio level-and far more robust) and there are a ton of posts if you do a search. So if you want to bounce variables against each other and look for relationships you might want to consider a statistical approach that provides greater power of interpretation.

On the other hand, maybe you have sophisticated categorical level predictor variables in mind. Handicapping certainly needs some new ideas.

TrifectaMike
12-28-2010, 09:22 PM
Chi-Square Distribution Table

Df\Prob .250 .100 .050 .025 .010 .005
1 1.323 2.705 3.841 5.023 6.634 7.879
2 2.772 4.605 5.991 7.377 9.210 10.596
3 4.108 6.251 7.814 9.348 11.344 12.838
4 5.385 7.779 9.487 11.143 13.276 14.860
5 6.625 9.236 11.070 12.832 15.086 16.749

6 7.840 10.644 12.591 14.449 16.811 18.547
7 9.037 12.017 14.067 16.012 18.475 20.277
8 10.218 13.361 15.507 17.534 20.090 21.954
9 11.388 14.683 16.918 19.022 21.665 23.589
10 12.548 15.987 18.307 20.483 23.209 25.188


Recall the 9.9 value from the dice example, we will use the above table to test for significance ( if the die was fair). As I said I will take liberties with definitions to make the process understandable. That said, the left side of the table can be viewed as the number of items minus one ( in the case of the die that would be 6 faces minus one or five). The top of the table are probabilities. However, let's view it as a measure of confidence of our premise.

So, let us see where the 9.9 value is greater than a value in the body of the table. Going across the number, 5, we note that 9.9 is greater than 9.236, which corresponds to a probability of .10. Viewing it as a measure of confidence, we can state with a confidence level of 90% that die is loaded.

Make an attempt to understand the process. It is not too complicated, and it'll become easier when we do handicapping factors.

Mike

TrifectaMike
12-28-2010, 09:37 PM
Chi-Square is typically used with categorical (nominal-level) data and is stats 101. It is typically the only statistical method taught in intro research methods at the undergrad level.

Data can be categorized as nominal, ordinal, interval, or ratio. Categorical level data is the least robust. You can try to prove statistically significant associative relationships with a Chi-Square test--and argue for causality, but really don't think it will get you anyplace in handicapping (IMHO). You won't get any weights (e.g., impact value). You will also have to make judgements in placing variables into a category for analysis--unlike ratio level data where you can use the number. No need to do that when many of your underlying variables have a zero point and are ratio level--allowing you to use more sophisticated approaches.

Plenty of people on this board have applied statistics to databases, primarily regression analysis (ratio level-and far more robust) and there are a ton of posts if you do a search. So if you want to bounce variables against each other and look for relationships you might want to consider a statistical approach that provides greater power of interpretation.

On the other hand, maybe you have sophisticated categorical level predictor variables in mind. Handicapping certainly needs some new ideas.

Thank you for the suggestions. I don't think I'll be doing any searches here for information on regression analysis. Can you prove that regression is far more robust? I'd be interested in such a proof.

Mike

CBedo
12-28-2010, 10:00 PM
and add these all up, which equals 99/10 = 9.9

The value of 9.9 is what we are interested in.

Mike

Next we will learn what the value of 9.9 signifies.FYI, I think it's 9.4, not 9.9.

Thanks for doing this. I think all of us are (or should be) eager and open minded enough to listen when someone else is willing to take the time to share and teach.

P.S. For anyone who wants to play around with some of these statistics, especially graphically, I'd recommend checking out R, a free & robust statistical package (http://www.r-project.org/).

HUSKER55
12-29-2010, 01:20 AM
I will help if I can. I read the article and it seems to be pretty advanced for me.
Interested in the results, however.

Don't know how I can help but ask any way.

garyoz
12-29-2010, 03:54 AM
Thank you for the suggestions. I don't think I'll be doing any searches here for information on regression analysis. Can you prove that regression is far more robust? I'd be interested in such a proof.

Mike

Ratio level data provide for mathematical manipulation and measurement of central tendency in a manner that goes beyond mode (which is merely counting frequency of occurance), which is very useful in a practical sense for confidence intervals, beta weights, etc. With Chi-Square you are just going to be looking at the observed frequency of a category compared to its expected frequency.

I don't want to discourage your effort--just trying to fit the methodology to the research question (not sure what that is) and the data. I've taught research methods at both the undergrad and grad levels.

See generally, http://en.wikipedia.org/wiki/Level_of_measurement#Ratio_measurement


The central tendency of a variable measured at the interval level can be represented by its mode, its median, or its arithmetic mean. Statistical dispersion can be measured in most of the usual ways, which just involved differences or averaging, such as range, interquartile range, and standard deviation. .

Also see :

http://www.quickmba.com/stats/centralten/

http://en.wikipedia.org/wiki/Mode_(statistics)

Capper Al
12-29-2010, 05:47 AM
Wow! The outcome looked believable. I would have read that back wards and guessed that with a 90% confidence level the die looked fair. It has been a long time since I did this. Thanks.


Chi-Square Distribution Table

Df\Prob .250 .100 .050 .025 .010 .005
1 1.323 2.705 3.841 5.023 6.634 7.879
2 2.772 4.605 5.991 7.377 9.210 10.596
3 4.108 6.251 7.814 9.348 11.344 12.838
4 5.385 7.779 9.487 11.143 13.276 14.860
5 6.625 9.236 11.070 12.832 15.086 16.749

6 7.840 10.644 12.591 14.449 16.811 18.547
7 9.037 12.017 14.067 16.012 18.475 20.277
8 10.218 13.361 15.507 17.534 20.090 21.954
9 11.388 14.683 16.918 19.022 21.665 23.589
10 12.548 15.987 18.307 20.483 23.209 25.188

Recall the 9.9 value from the dice example, we will use the above table to test for significance ( if the die was fair). As I said I will take liberties with definitions to make the process understandable. That said, the left side of the table can be viewed as the number of items minus one ( in the case of the die that would be 6 faces minus one or five). The top of the table are probabilities. However, let's view it as a measure of confidence of our premise.

So, let us see where the 9.9 value is greater than a value in the body of the table. Going across the number, 5, we note that 9.9 is greater than 9.236, which corresponds to a probability of .10. Viewing it as a measure of confidence, we can state with a confidence level of 90% that die is loaded.

Make an attempt to understand the process. It is not too complicated, and it'll become easier when we do handicapping factors.

Mike

TrifectaMike
12-29-2010, 08:14 AM
Ratio level data provide for mathematical manipulation and measurement of central tendency in a manner that goes beyond mode (which is merely counting frequency of occurance), which is very useful in a practical sense for confidence intervals, beta weights, etc. With Chi-Square you are just going to be looking at the observed frequency of a category compared to its expected frequency.

I don't want to discourage your effort--just trying to fit the methodology to the research question (not sure what that is) and the data. I've taught research methods at both the undergrad and grad levels.

See generally, http://en.wikipedia.org/wiki/Level_of_measurement#Ratio_measurement


The central tendency of a variable measured at the interval level can be represented by its mode, its median, or its arithmetic mean. Statistical dispersion can be measured in most of the usual ways, which just involved differences or averaging, such as range, interquartile range, and standard deviation. .

Also see :

http://www.quickmba.com/stats/centralten/

http://en.wikipedia.org/wiki/Mode_(statistics)

Interesting. You make some valid and important points. After some further reading, maybe logistic regression is "more robust".

Question???
Aside from how one would determine the independent variables, I have another question. How does one test for goodness of fit of a logistic regression?

Mike

HandyKapper
12-29-2010, 09:12 AM
CBedo is correct. The value would be 9.4

Hochay
12-29-2010, 09:24 AM
Well what is the workup being used to obtain 9.4 all i seem to get is 9.9 unless i'm not using the proper formula.

Robert Goren
12-29-2010, 09:58 AM
Well what is the workup being used to obtain 9.4 all i seem to get is 9.9 unless i'm not using the proper formula.Check your addition.

Hochay
12-29-2010, 10:01 AM
Check your addition. checked it and DUH! I am a turd it is 94 <9.4>.. Thanx

TrifectaMike
12-29-2010, 10:11 AM
Let's continue with an actual handicapping factor as an example.

Benter and others report that one of the more significant factors
in their model is "number of past races".

As in the die example, we want to know if this factor is "loaded".

We'll define some categories:

Win Lose
Number of past races <= 5
Number of past races > 5 <= 10
Number of past races > 11 <= 15
Number of past races > 16 <= 20
Number of past races > 20 <= 25
Number of past races > 26 <= 30
Number of past races > 30 <= 40
Number of past races > 40 <= 50
Number of past races > 50


We will not use Maiden races. We'll use two distinct samples.
One for three year olds and another for four year olds and up.

We'll also need to know the number of horses in the samples.

After we get this data, we'll crunch some numbers.

Mike

P.S. My bad addition. It is 9.4.

Capper Al
12-29-2010, 10:40 AM
Okay, the number of races is a definite number. Speed from a key race would be subjective to the handicapper selecting the past performance line. This would not be a definite number since it will vary from capper to capper.

Dave Schwartz
12-29-2010, 12:22 PM
"Speed" could be if it came from a definitive set of rules such as "best of last 2 speed ratings" or "average of the last 3."

sjk
12-29-2010, 12:45 PM
Here is what I get for a 5 year sample:

past race count group wins losses
1 60499 479627
2 55378 345525
3 39598 259501
4 27842 196923
5 20187 148844
6 15269 112095
7 19397 148410
8 21209 179788

The 1st group with few starts has a lower win percent but that is likely related to field size. Win percentage at peak in 2nd group then gradually decreases throughout.

Capper Al
12-29-2010, 01:49 PM
"Speed" could be if it came from a definitive set of rules such as "best of last 2 speed ratings" or "average of the last 3."

You're right. The question that will need to be answered is how certain are we that we have captured a true reflection of a horse's speed quality by blindly going best last 2 or 3. It should be good, but how good?

GameTheory
12-29-2010, 01:58 PM
You're right. The question that will need to be answered is how certain are we that we have captured a true reflection of a horse's speed quality by blindly going best last 2 or 3. It should be good, but how good?Isn't that what statistics are for? To let you know how good of a reflection (or stand-in) a standardized measurement is compared to the real, ultimately unmeasurable factor?

TrifectaMike
12-29-2010, 02:21 PM
Here is what I get for a 5 year sample:

past race count group wins losses
1 60499 479627
2 55378 345525
3 39598 259501
4 27842 196923
5 20187 148844
6 15269 112095
7 19397 148410
8 21209 179788

The 1st group with few starts has a lower win percent but that is likely related to field size. Win percentage at peak in 2nd group then gradually decreases throughout.

Let me make sure I'm reading this correctly.

Win Lose
Number of past races <= 5 --- ---
Number of past races > 5 <= 10 60499 479627
Number of past races > 11 <= 15 55378 345525
Number of past races > 16 <= 20 39598 259501
Number of past races > 20 <= 25 27842 196923
Number of past races > 26 <= 30 20187 148844
Number of past races > 30 <= 40 15269 112095
Number of past races > 40 <= 50 19397 148410
Number of past races > 50 21209 179788


The sample size is 2,129,332 horses

Mike

sjk
12-29-2010, 03:01 PM
Thanks for correcting the columns.

You are reading correctly. It is 2.1 million starts and is 5 years worth of racing. (Actual dates 1/19/04 thru 1/19/09)

lansdale
12-29-2010, 03:16 PM
Here is what I get for a 5 year sample:

past race count group wins losses
1 60499 479627
2 55378 345525
3 39598 259501
4 27842 196923
5 20187 148844
6 15269 112095
7 19397 148410
8 21209 179788

The 1st group with few starts has a lower win percent but that is likely related to field size. Win percentage at peak in 2nd group then gradually decreases throughout.

I believe this is similar to Benter's finding - somewhat predictable - that horses decline in ability as they age. Although there might also be a factor - as a friend pointed out about his own body - 'It's not the years but the mileage'. It's for this reason that I believe Benter awarded a higher weight to younger horses.

Cheers,

lansdale

Capper Al
12-29-2010, 03:19 PM
Isn't that what statistics are for? To let you know how good of a reflection (or stand-in) a standardized measurement is compared to the real, ultimately unmeasurable factor?

My comment was on category. Your comment is on measuring. We can measure numbers. We can standardize the measurement of last 2 or 3 races best speed since this is objective and clearly defined, but not the horse's speed which is subjective depending on the capper's selected PP line. No two cappers will pick the same PP lines.

Dave Schwartz
12-29-2010, 03:26 PM
Al,

I think it is safe to assume that "speed" (to most people) has nothing to do with picking pacelines.

Most people who refer to "speed" (I think) are referring to some other form of determination such as I referenced early.

Obviously, if it is subjective it is outside the scope of this discussion.


Dave

sjk
12-29-2010, 04:03 PM
I was light one row; had the last two together.

past race count group wins losses
1 60499 479627
2 55378 345525
3 39598 259501
4 27842 196923
5 20187 148844
6 15269 112095
7 19397 148410
8 10487 84004
9 10722 95784

TrifectaMike
12-29-2010, 04:09 PM
Observed
Win Lose Total Win%
Number of past races <= 5 --- --- --- ---
Number of past races > 5 <= 10 60499 479627 540126 11.2
Number of past races > 11 <= 15 55378 345525 400903 13.8
Number of past races > 16 <= 20 39598 259501 299099 13.2
Number of past races > 20 <= 25 27842 196923 223965 12.3
Number of past races > 26 <= 30 20187 148844 169071 11.9
Number of past races > 30 <= 40 15269 112095 127364 12.0
Number of past races > 40 <= 50 19397 148410 167807 11.6
Number of past races > 50 21209 179788 200997 10.5
Total 259,379


The sample size is 2,129,332 horses.
Thw Win% of sample is 259,379/2,129,332 = 12.18%


Expected
Win Lose Total Win%
Number of past races <= 5 --- --- --- ---
Number of past races > 5 <= 10 65787 474339 540126 12.18
Number of past races > 11 <= 15 48129 352774 400903 12.18
Number of past races > 16 <= 20 36430 262669 299099 12.18
Number of past races > 20 <= 25 27278 196687 223965 12.18
Number of past races > 26 <= 30 20592 148479 169071 12.18
Number of past races > 30 <= 40 15512 111852 127364 12.18
Number of past races > 40 <= 50 20438 147369 167807 12.18
Number of past races > 50 24481 176516 200997 12.18


To arrive at the Chi Square statistic we do the following for each category:

(Actual Wins - Expected Wins)^2 divided by Expected Wins +
(Actual Losses - Expected Losses)^2 divided by Expected Losses

and sum them up.

Someone want to take a crack at it. As you can already tell I'm prone to
computational errors.

Mike

GameTheory
12-29-2010, 04:32 PM
My comment was on category. Your comment is on measuring. We can measure numbers. We can standardize the measurement of last 2 or 3 races best speed since this is objective and clearly defined, but not the horse's speed which is subjective depending on the capper's selected PP line. No two cappers will pick the same PP lines.I was using measurement in a more general sense: any bit of data about a horse -- however arrived at, objective, subjective, categorical, continuous, ordinal, interval, psychic conjecture, whatever -- that we would like to evaluate statistically.

Capper Al
12-29-2010, 04:46 PM
Al,

I think it is safe to assume that "speed" (to most people) has nothing to do with picking pacelines.

Most people who refer to "speed" (I think) are referring to some other form of determination such as I referenced early.

Obviously, if it is subjective it is outside the scope of this discussion.


Dave

Dave,

I believe many cappers will use a key paceline's speed rating to reflect a horse's possible speed in today's race. I know I do and many authors do with in their books and systems. This popular method is subjective.

Dave Schwartz
12-29-2010, 05:44 PM
Al,

Difference of opinion always makes life interesting.

Personally, I have never met a single player that selected pacelines who was not a pace handicapper. But let's not clog this thread any more with our trivia.


Dave

Capper Al
12-29-2010, 08:18 PM
Al,

Difference of opinion always makes life interesting.

Personally, I have never met a single player that selected pacelines who was not a pace handicapper. But let's not clog this thread any more with our trivia.


Dave

Fair enough. We'll just have to wait and see how TrifectaMie handles speed. That's what is important here.

TrifectaMike
12-29-2010, 11:16 PM
Observed
Win Lose Total Win%
Number of past races <= 5 --- --- --- ---
Number of past races > 5 <= 10 60499 479627 540126 11.2
Number of past races > 11 <= 15 55378 345525 400903 13.8
Number of past races > 16 <= 20 39598 259501 299099 13.2
Number of past races > 20 <= 25 27842 196923 223965 12.3
Number of past races > 26 <= 30 20187 148844 169071 11.9
Number of past races > 30 <= 40 15269 112095 127364 12.0
Number of past races > 40 <= 50 19397 148410 167807 11.6
Number of past races > 50 21209 179788 200997 10.5
Total 259,379


The sample size is 2,129,332 horses.
Thw Win% of sample is 259,379/2,129,332 = 12.18%


Expected
Win Lose Total Win%
Number of past races <= 5 --- --- --- ---
Number of past races > 5 <= 10 65787 474339 540126 12.18
Number of past races > 11 <= 15 48129 352774 400903 12.18
Number of past races > 16 <= 20 36430 262669 299099 12.18
Number of past races > 20 <= 25 27278 196687 223965 12.18
Number of past races > 26 <= 30 20592 148479 169071 12.18
Number of past races > 30 <= 40 15512 111852 127364 12.18
Number of past races > 40 <= 50 20438 147369 167807 12.18
Number of past races > 50 24481 176516 200997 12.18


To arrive at the Chi Square statistic we do the following for each category:

(Actual Wins - Expected Wins)^2 divided by Expected Wins +
(Actual Losses - Expected Losses)^2 divided by Expected Losses

and sum them up.

Someone want to take a crack at it. As you can already tell I'm prone to
computational errors.

Mike

Without actually doing all the number crunching, it's obvious that we have an extremely high chi-square value.

There are two ways of getting a very high chi-square value; an unusual result from the correct theory , or a result from the wrong theory. These are usually indistinguishable. However, in this case I believe the theory is correct.

Mr. Benter and others are correct. Anyone NOT making use of the this critical factor in their handicapping is negligent.

Mike

CBedo
12-30-2010, 12:31 AM
Without actually doing all the number crunching, it's obvious that we have an extremely high chi-square value.

There are two ways of getting a very high chi-square value; an unusual result from the correct theory , or a result from the wrong theory. These are usually indistinguishable. However, in this case I believe the theory is correct.

Mr. Benter and others are correct. Anyone NOT making use of the this critical factor in their handicapping is negligent.

MikeIt's pretty clear with the difference in expected values in the first couple categories that we're going to get a pretty big chi-square.

If there are two ways of getting a very high value, and they can be indistinguishable, how do you suggest moving forward once you get a significant result?

Capper Al
12-30-2010, 07:03 AM
Without actually doing all the number crunching, it's obvious that we have an extremely high chi-square value.

There are two ways of getting a very high chi-square value; an unusual result from the correct theory , or a result from the wrong theory. These are usually indistinguishable. However, in this case I believe the theory is correct.

Mr. Benter and others are correct. Anyone NOT making use of the this critical factor in their handicapping is negligent.

Mike

Double checking the numbers. It is amazing. All but one come out with a 12.18 win percent, and this is 12.01 for > 11 <= 10. I'm not getting the significance of this stat. To me, it looks like it doesn't really make a difference how many races they run. It's all the same.

TrifectaMike
12-30-2010, 09:09 AM
It's pretty clear with the difference in expected values in the first couple categories that we're going to get a pretty big chi-square.

If there are two ways of getting a very high value, and they can be indistinguishable, how do you suggest moving forward once you get a significant result?

I anticipated the problem that we would get an extremely high chi-square value when SJK was kind enough to provide the data. Chi-Square values are sensitive to very large samples. I proceeded anyway. My interest at this point is to make us comfortable with the concept and calculations.

As we go forward we will limit our sample size to approximately 5,000 horses.

In a later post I'll layout the plan of attack. We will define, test and weight various factors. However, before we do that, I'll will introduce one more concept., Z- Score. The Z-Score is how we will weight the factors that we determine are significant.

Be patient, we will have fun, and at minimum it will be educational.

Mike

TrifectaMike
12-30-2010, 09:36 AM
Double checking the numbers. It is amazing. All but one come out with a 12.18 win percent, and this is 12.01 for > 11 <= 10. I'm not getting the significance of this stat. To me, it looks like it doesn't really make a difference how many races they run. It's all the same.

Al,

Rotate your thinking 180 degrees. The 12.18 win percent is what we expected, not what actually observed.

Mike

Capper Al
12-30-2010, 10:06 AM
Al,

Rotate your thinking 180 degrees. The 12.18 win percent is what we expected, not what actually observed.

Mike

Yeah, I figured it out later. It was too clean. Restarting my slide-ruler.

Dave Schwartz
12-30-2010, 10:14 AM
Career Starts is an indirect way to say "age."

TrifectaMike
12-30-2010, 10:18 AM
Z-score

What does a Z-score do?
A Z-score compares some occurrence of condition to the expected value of
that condition, as normalized to the rate and sample size.

Z-score = (p^ - p )/ sqrt(p*(1 - p) / n)

p^ = Observed frequency
p = Expected frequency
n = Number of cases in the sample

The numerator, p^ - p, compares the value of the observed frequency of
the condition to the expected frequency of the condition. When p^ - p is
equal to zero, the observed frequency is exactly the same as the expected
frequency. A Z-score around zero indicates that the sample is randomly
drawn from the overall population.

As the Observed frequency gets farther and farther from the expected frequency, the absolute value of the Z-score gets bigger.

We will be using the Z-score to weigh the factors that we find to be significant by the Chi-Square Value.

Mike

Next, we'll do an example. I'm open to any questions...just ask.

raybo
12-30-2010, 02:04 PM
Z-score

What does a Z-score do?
A Z-score compares some occurrence of condition to the expected value of
that condition, as normalized to the rate and sample size.

Z-score = (p^ - p )/ sqrt(p*(1 - p) / n)

p^ = Observed frequency
p = Expected frequency
n = Number of cases in the sample

The numerator, p^ - p, compares the value of the observed frequency of
the condition to the expected frequency of the condition. When p^ - p is
equal to zero, the observed frequency is exactly the same as the expected
frequency. A Z-score around zero indicates that the sample is randomly
drawn from the overall population.

As the Observed frequency gets farther and farther from the expected frequency, the absolute value of the Z-score gets bigger.

We will be using the Z-score to weigh the factors that we find to be significant by the Chi-Square Value.

Mike

Next, we'll do an example. I'm open to any questions...just ask.

Mike,

Try total distance of workouts, after last race. I've found in my initial operation of AllDataBase that this factor comes up time after time, to some positive extent.

It varies by track, race distance, class, etc., but it almost always smooths out the trend, in a positive way (decreases the frequency and duration of losing streaks, makes sense as it points toward current form).

Now, granted, we are using "work distance ranking", among the competitors in each race, and not by the actual distance value, however, it might be of value to categorize by total distance value, by distance, class, etc.. Then again, total distance value might be too variable to give a result we could use for contender/win selection. I know rankings do work very good.

TrifectaMike
12-30-2010, 03:37 PM
Example of Chi-Square Statistic and Z-Score

NOTE: This data is factious and used for demonstration purposes.

Let's us test a recency factor of days since the horse's last race.

Observed
Win Lose Total Win%
Days since last race 1 to 14 110 790 900 12.2
Days since last race 15 to 30 90 890 980 9.2
Days since last race over 30 65 570 635 10.2
Total 265


The sample size is 2,515 horses.
The Win% of sample is 265/2515 = 10.54%


Expected
Win Lose Total Win%
Days since last race 1 to 14 95 805 900 10.54
Days since last race 15 to 30 103 877 980 10.54
Days since last race over 30 67 568 635 10.54

Chi-Square Value

= (110-95)**2/95 + (790-805)**2/805 + (90-103)**2/103 +
(890-877)**2/877 + (65-67)**2/67 + (570-568)**2/568

= 2.367 + .280 + 1.64 + .193 + .060 + .007


Chi-Square Value = 4.547

Using row Df=2 (3 items -1 =2), we locate the value 4.605 which is
very close to our value of 4.547. We can conclude this factor
is significant with 90% confidence.

Chi-Square Distribution Table

Df\Prob .250 .100 .050 .025 .010 .005
1 1.323 2.705 3.841 5.023 6.634 7.879
2 2.772 4.605 5.991 7.377 9.210 10.596
3 4.108 6.251 7.814 9.348 11.344 12.838
4 5.385 7.779 9.487 11.143 13.276 14.860
5 6.625 9.236 11.070 12.832 15.086 16.749

6 7.840 10.644 12.591 14.449 16.811 18.547
7 9.037 12.017 14.067 16.012 18.475 20.277
8 10.218 13.361 15.507 17.534 20.090 21.954
9 11.388 14.683 16.918 19.022 21.665 23.589
10 12.548 15.987 18.307 20.483 23.209 25.188




Z-score = (p^ - p )/ sqrt(p*(1 - p) / n)

p^ = Observed frequency
p = Expected frequency
n = Number of cases in the sample

Z-score for - Days since last race 1 to 14

= (.1220 - .1054)/sqrt(.1054*(1-.1054)/900)
= 1.622

Therefore any horse that raced within 14 days would receive a weight of 1.622.

You can compute the Z-score for the remaining two categories.

Mike

TrifectaMike
12-30-2010, 04:15 PM
Arkansasman

If you are reading this thread, let me suggest to you that you abandon p values for statistical significance for your logistic equation coefficients. Because the statistical significance is so dependent on sample size and p values provide little information on the strength, importance or intuitive meaning of the relationship.

I suggest you look into using Bayesian information criterion for your statistical testing of your coefficients. It is a much stronger and reliable measure than p values, and takes into account sample size.

Mike

Capper Al
12-30-2010, 05:36 PM
Arkansasman

If you are reading this thread, let me suggest to you that you abandon p values for statistical significance for your logistic equation coefficients. Because the statistical significance is so dependent on sample size and p values provide little information on the strength, importance or intuitive meaning of the relationship.

I suggest you look into using Bayesian information criterion for your statistical testing of your coefficients. It is a much stronger and reliable measure than p values, and takes into account sample size.

Mike

That's what I thought too. My slide-ruler is really frozen up now. I need to re-boot.

arkansasman
12-30-2010, 07:03 PM
I have been reading this thread.


John

Native Texan III
12-30-2010, 07:18 PM
Interesting. You make some valid and important points. After some further reading, maybe logistic regression is "more robust".

Question???
Aside from how one would determine the independent variables, I have another question. How does one test for goodness of fit of a logistic regression?

Mike

The logit input dataset is processed to give the minimum error for the resultant factor weightings for each data input and overall. As the normal logit modeling is against 1 (= win) or 0 (= lost) then the errors (which can be obtained by using the calculated weightings and back- applying) do not matter too much, (as would be the case in a linear regression, say), as close to 1 is 1 and close to 0 is 0. There are more complex multi-competitor logit methods which can only be solved by making trial and error assumptions.

TrifectaMike
12-30-2010, 07:47 PM
The logit input dataset is processed to give the minimum error for the resultant factor weightings for each data input and overall. As the normal logit modeling is against 1 (= win) or 0 (= lost) then the errors (which can be obtained by using the calculated weightings and back- applying) do not matter too much, (as would be the case in a linear regression, say), as close to 1 is 1 and close to 0 is 0. There are more complex multi-competitor logit methods which can only be solved by making trial and error assumptions.

It's my understanding that estimation is based on Maximum likelihood...finding those coefficients that have the greatest likelihood of producing the observed data. In practice, I would assume that means maximizing the log likelihood function (the objective function).

Hey, that is just my understanding. I could be wrong.

But I ask once again,

"Aside from how one would determine the independent variables, I have another question. How does one test for goodness of fit of a logistic regression?"

Mike

CBedo
12-31-2010, 12:17 AM
Z-score = (p^ - p )/ sqrt(p*(1 - p) / n)

p^ = Observed frequency
p = Expected frequency
n = Number of cases in the sample

Z-score for - Days since last race 1 to 14

= (.1220 - .1054)/sqrt(.1054*(1-.1054)/900)
= 1.622

Therefore any horse that raced within 14 days would receive a weight of 1.622.

You can compute the Z-score for the remaining two categories.

MikeSince (p^ - p ) can be negative(and hopefully the square root can't, lol), the Z scores can be negative as well, which means we could have negative weights? I think both the second and third categories would be negative (-1.38 & -0.25). I'll wait patiently to see how this all comes together, but this leads to the question of data preprocessing and "how to bin." In this instance, a horse that came back in 14 days gets a positive weight, but a horse who comes back one day later receives and equally negative factor weighting?

TrifectaMike
12-31-2010, 05:30 AM
Since (p^ - p ) can be negative(and hopefully the square root can't, lol), the Z scores can be negative as well, which means we could have negative weights? I think both the second and third categories would be negative (-1.38 & -0.25). I'll wait patiently to see how this all comes together, but this leads to the question of data preprocessing and "how to bin." In this instance, a horse that came back in 14 days gets a positive weight, but a horse who comes back one day later receives and equally negative factor weighting?

You are correct. Data definitions are important. I personally like to take a "shotgun" approach. Shoot as many pellets as possible and observe the patterns. And then consider refinements, and then shoot again...and again if necessary...always letting the data speak for itself, even when it appears illogical.

I attempt to construct metrics using medians. This allows one to measure strength within a race and process data across races much easier.

For example, we all to some degree or another scan a horses speed ratings (beyers, etc) and naturally assume the last rating is more significant than the previous. In fact, it is standard practice for logistic regression people to use time related weights on their factors. It seems reasonable, but is it always correct?

Let me give you a specific example ( arkansasman try this out with your model). When looking at horses last four Beyers or Bris numbers, etc, you would think that the last is more significant than the previous and so on. The data does not support the premise.

I have found using numerous tests that the second race beyer (or any other similar rating) is insignificant and adds no usable information. The third is significant as well as the fourth, but NOT the second race back. Interesting I would say.

Mike

garyoz
12-31-2010, 05:45 AM
". How does one test for goodness of fit of a logistic regression?"

Mike

R-Squared statistic.

Capper Al
12-31-2010, 07:11 AM
I have found using numerous tests that the second race beyer (or any other similar rating) is insignificant and adds no usable information. The third is significant as well as the fourth, but NOT the second race back. Interesting I would say.

Mike

I'm not following this because if the second Speed figure back isn't significant in today's race, it will become the third Speed figure back in the next race. Then it has significances? It's the same figure. How can this be?

Actor
12-31-2010, 07:50 AM
Example of Chi-Square Statistic and Z-Score

NOTE: This data is factious and used for demonstration purposes.

Let's us test a recency factor of days since the horse's last race.

Observed
Win Lose Total Win%
Days since last race 1 to 14 110 790 900 12.2
Days since last race 15 to 30 90 890 980 9.2
Days since last race over 30 65 570 635 10.2
Total 265


The sample size is 2,515 horses.
The Win% of sample is 265/2515 = 10.54%


Expected
Win Lose Total Win%
Days since last race 1 to 14 95 805 900 10.54
Days since last race 15 to 30 103 877 980 10.54
Days since last race over 30 67 568 635 10.54

Chi-Square Value

= (110-95)**2/95 + (790-805)**2/805 + (90-103)**2/103 +
(890-877)**2/877 + (65-67)**2/67 + (570-568)**2/568

= 2.367 + .280 + 1.64 + .193 + .060 + .007


Chi-Square Value = 4.547

I've been going over this with my college statistics textbook by my side and, as best as I can tell, you do not seem to be calculating the Chi-Square Value correctly. From the text:

THEOREM: If n1, n2, ..., nk and e1, e2, ..., ek are the observed and
expected frequencies, respectively, for the k possible outcomes of an
experiment that is performed n times, then as n becomes infinite the
distribution of the quantity

k
Σ (ni - ei)²/ei
i=1

will approach that of a Chi-Squared variable with k-1 degrees of freedom.

I.e, the results are to be summed over all possible outcomes. There are three possible outcomes here: a horse from one of the three categories wins. There is no other possibility. There should be three terms in the summation, viz., the first, third, and fifth terms. The second, fourth, and sixth terms are double counts. The correct value is 4.18, which is less than the critical value of 4.605.

These double counts are not present in your die casting example.

You should look up the definition of "factious." I think you mean "fictitious" but I could be wrong.

Actor
12-31-2010, 08:09 AM
Z-score = (p^ - p )/ sqrt(p*(1 - p) / n)


This is William L. Quirin's "standard normal" test.

For the first category I compute the standard normal value to be 1.59 Quirin says that the value needs to lie outside -2.5 to +2.5, indicating that this category (which has the biggest impact value) is not significant.

sjk
12-31-2010, 08:26 AM
Let me give you a specific example ( arkansasman try this out with your model). When looking at horses last four Beyers or Bris numbers, etc, you would think that the last is more significant than the previous and so on. The data does not support the premise.

I have found using numerous tests that the second race beyer (or any other similar rating) is insignificant and adds no usable information. The third is significant as well as the fourth, but NOT the second race back. Interesting I would say.

Mike

My reading of the data disagrees with your assertion here. In my experience the more recent races should be weighted most heavily with the second last race having something like a 25% weighting.

TrifectaMike
12-31-2010, 10:24 AM
I've been going over this with my college statistics textbook by my side and, as best as I can tell, you do not seem to be calculating the Chi-Square Value correctly. From the text:

THEOREM: If n1, n2, ..., nk and e1, e2, ..., ek are the observed and
expected frequencies, respectively, for the k possible outcomes of an
experiment that is performed n times, then as n becomes infinite the
distribution of the quantity

k
Σ (ni - ei)²/ei
i=1

will approach that of a Chi-Squared variable with k-1 degrees of freedom.

I.e, the results are to be summed over all possible outcomes. There are three possible outcomes here: a horse from one of the three categories wins. There is no other possibility. There should be three terms in the summation, viz., the first, third, and fifth terms. The second, fourth, and sixth terms are double counts. The correct value is 4.18, which is less than the critical value of 4.605.

These double counts are not present in your die casting example.

You should look up the definition of "factious." I think you mean "fictitious" but I could be wrong.

Read another book or read that one more carefully.

Mike

TrifectaMike
12-31-2010, 10:26 AM
This is William L. Quirin's "standard normal" test.

For the first category I compute the standard normal value to be 1.59 Quirin says that the value needs to lie outside -2.5 to +2.5, indicating that this category (which has the biggest impact value) is not significant.

Where do you guys come up with this stuff? Quirin's standard normal?

Mike

TrifectaMike
12-31-2010, 10:30 AM
My reading of the data disagrees with your assertion here. In my experience the more recent races should be weighted most heavily with the second last race having something like a 25% weighting.

Sjk, I know this is off topic, but let's have someone, possibly Arkansasman run a test.

Run a logistic regression using the last four beyers as regressors....and share the results.

Mike

Capper Al
12-31-2010, 10:34 AM
I'm going to be forced to re-read The Mathematics of Horse Racing by David B. Fogel, not a bad idea just need the time. Fogel covers the 'How to' topic in six pages with references to charts in the appendices. Then he walks one through examples in twenty-one pages. His examples are for beaten lengths, days between races, drop in class, shippers, and change in distance. The rest of the book covers making your own system and the possibility of making a living at the track. In other words, Fogel gives one the 'How to' with many examples in 27 pages. Let's move on.

sjk
12-31-2010, 10:38 AM
Sjk, I know this is off topic, but let's have someone, possibly Arkansasman run a test.

Run a logistic regression using the last four beyers as regressors....and share the results.

Mike


I would be interested in the findings and how they were arrived at.

SK

TrifectaMike
12-31-2010, 10:44 AM
R-Squared statistic.

And which R-Squared statistic would that be? Mine, yours, Daves, GT's, Benter's, Loie's, Peter's or Saint Paul's? You tell me, because if you compute one, I can compute another.

Mike

teddy
12-31-2010, 11:01 AM
In the end are we trying to find out what factors grouped together will produce a profitable return? The statistics are interesting but what are we going to be left with. Impact values are everywhere...

Greyfox
12-31-2010, 11:04 AM
For example, we all to some degree or another scan a horses speed ratings (beyers, etc) and naturally assume the last rating is more significant than the previous. In fact, it is standard practice for logistic regression people to use time related weights on their factors. It seems reasonable, but is it always correct?

Mike

:ThmbUp:
I like the path that you are taking Trifecta Mike. The concept behind the approach is more important at this point than whether or not your particular figures are being done accurately. (Obviously down the road figure accuracy would be important.)

I also like the questions that you are asking.
For instance, with respect to "time related weights" they do not necessarily work that well in turf races. One has to look at the total picture on grass.

Actor
12-31-2010, 11:04 AM
Where do you guys come up with this stuff? Quirin's standard normal?

Mike

Winning at the Races: Computer Discoveries in Thoroughbred Handicapping By William L. Quirin, Ph.D. 1979, page 297.

TrifectaMike
12-31-2010, 11:41 AM
I'm not continuing with this thread. If I've insulted anyone, I apologize.

I tend to say too much.

Happy New Year to all.

Mike

Dave Schwartz
12-31-2010, 11:54 AM
Personally, I am getting a lot out of this thread. I'd hate to see it discontinued.

May I make a suggestion?

Mike is a guy that offered to lead a class. Please, let him do just that. If you wish to take issue with his approach, please do it AFTER he has finished.

At that point you can tell him why YOUR way is better.


Happy New Year to all.

Dave Schwartz

PaceAdvantage
12-31-2010, 12:19 PM
I'm not continuing with this thread. If I've insulted anyone, I apologize.

I tend to say too much.

Happy New Year to all.

MikeWhat did I miss? I've been watching this thread like a hawk making sure it doesn't get sidetracked....tell me how to fix it so that it may continue.

CBedo
12-31-2010, 12:21 PM
Personally, I am getting a lot out of this thread. I'd hate to see it discontinued.

May I make a suggestion?

Mike is a guy that offered to lead a class. Please, let him do just that. If you wish to take issue with his approach, please do it AFTER he has finished.

At that point you can tell him why YOUR way is better.


Happy New Year to all.

Dave SchwartzI agree with Dave. I hope Mike will continue. Why can't everyone not get hung up on the factors (for now) and focus on the development of the methodology?

raybo
12-31-2010, 12:30 PM
I, too, would like to see Mike continue with his teachings. Although I'm not a "statistics" guy, I'm sure there will be things to learn here.

Maybe if we just went along with Mike's theories, withholding personal beliefs for the time being, there will come a point when the value/viability of the Chi Square statistic will become visible.

CBedo
12-31-2010, 12:32 PM
I've been going over this with my college statistics textbook by my side and, as best as I can tell, you do not seem to be calculating the Chi-Square Value correctly.I think you're misreading your stats book (or the book is bad, lol). If you don't think Mike is calculating it correctly, try putting the data into a statistical package or even excel and doing the calculation.

From R:

> recency
Wins Losses
1-14 110 790
15-30 90 890
31+ 65 570
> chisq.test(recency)

Pearson's Chi-squared test

data: recency
X-squared = 4.6765, df = 2, p-value = 0.0965Looks dead on in his analysis to me (numbers slightly different due to rounding probably).

TrifectaMike
12-31-2010, 12:36 PM
I made a bold statement and I should not have tried to impose on
Arkansasman. So, I've run a Logistic regression using the last four
Speed Ratings.

Here's an explanation of the independent variables (regressors)

Variable 1 Last Speed Rating
Variable 2 Second Speed Rating Back
Variable 3 Third Speed Rating Back
Variable 4 Fourth Speed Rating Back

How I setup the data:
In each case(last speed rating, etc) I determine the median rating
for each race. Then I determine how each horse's rating differs from the
median. This allows for determining the strength within the race and
also allows to use the data across all races.

Running iteration number 1
Running iteration number 2
Running iteration number 3
Running iteration number 4
Running iteration number 5
Running iteration number 6
Running iteration number 7

The process converged after 7 iterations

The software I use is home grown.


Descriptives.......
154 cases have Y = 1 1236 cases have Y = 0

-2 log likelihood = 967.9034 (Null Model)
-2 log likelihood = 887.8558 (Full Model)

Overall Model Fit...
Chi Square = 80.0476 df = 4 p = 0.0000
R Square = 0.0827

Akaike's Information Criterion = 897.8558
Bayesian Information Criterion = 895.9030

Coefficients and Standard Errors...
Variable Coefficient Standard Error prob
1 0.0788 0.0142 0.0000
2 0.0050 0.0102 0.6216
3 0.0428 0.0123 0.0005
4 0.0243 0.0112 0.0298
Intercept -2.1564

Odds Ratios and 95% Confidence Intervals...
Variable Odds Ratio Low High
1 1.0819 1.0522 1.1125
2 1.0050 0.9852 1.0253
3 1.0438 1.0189 1.0692
4 1.0246 1.0024 1.0474
input data record?

Let me show you the important numbers.

Variable 2 (Second race back rating)
prob = .6216
That is not good!!!!!

Coefficient 0.0050
That is not good!!!

Odds Ratio 1.0050
That is not good!!!!

Mike

garyoz
12-31-2010, 01:49 PM
I don't want to sound negative or discouraging and I hope this is helpful.

You can't run such highly correlated variables as a multiple regression model. Check out a correlation matrix for the variables--if you have values higher than .7 or so you run into multi-colinearity which leads to an unstable model. Instability is tied to the correlation between error terms (Regression assumes a normal distriubtion for error terms--which the correlation violates). Also if you are running a stepwise regression, the first variable "sucks up" all the variance for explanation and doesn't leave enough for the following variables to be associated with. (not a very technical explanation)

You can run them individually as bivariate regressions (one regressor and the logit probability density function--just an S-shaped curve--as the dependent variable) and compare each of those models. So you can run, a model for speed figure one race back, then one for two races back etc. and compare them.

See which one gives you the best fit. That's assuming that is what you want to do. I think what you are trying to do is use the probability of winning as the dependent variable and speed figure as the independent variable.

TrifectaMike
12-31-2010, 02:14 PM
I don't want to sound negative or discouraging and I hope this is helpful.

You can't run such highly correlated variables as a multiple regression model. Check out a correlation matrix for the variables--if you have values higher than .7 or so you run into multi-colinearity which leads to an unstable model. Instability is tied to the correlation between error terms (Regression assumes a normal distriubtion for error terms--which the correlation violates). Also if you are running a stepwise regression, the first variable "sucks up" all the variance for explanation and doesn't leave enough for the following variables to be associated with. (not a very technical explanation)

You can run them individually as bivariate regressions (one regressor and the logit probability density function--just an S-shaped curve--as the dependent variable) and compare each of those models. So you can run, a model for speed figure one race back, then one for two races back etc. and compare them.

See which one gives you the best fit. That's assuming that is what you want to do. I think what you are trying to do is use the probability of winning as the dependent variable and speed figure as the independent variable.

You're not telling me anything I already don't know. Without getting too technical, it DIDN'T reduce the significance of the 3rd and 4th variable. I stand by these results.

Mike

Capper Al
12-31-2010, 04:24 PM
Go for it TrifectaMike. I'm interested in your take on handicapping using the stats. Let's see how your system works. Thanks in advance for doing this.

Native Texan III
12-31-2010, 08:20 PM
It's my understanding that estimation is based on Maximum likelihood...finding those coefficients that have the greatest likelihood of producing the observed data. In practice, I would assume that means maximizing the log likelihood function (the objective function).

Hey, that is just my understanding. I could be wrong.

But I ask once again,

"Aside from how one would determine the independent variables, I have another question. How does one test for goodness of fit of a logistic regression?"

Mike

That is what I posted.

gm10
12-31-2010, 09:42 PM
Let me give you a specific example ( arkansasman try this out with your model). When looking at horses last four Beyers or Bris numbers, etc, you would think that the last is more significant than the previous and so on. The data does not support the premise.

I have found using numerous tests that the second race beyer (or any other similar rating) is insignificant and adds no usable information. The third is significant as well as the fourth, but NOT the second race back. Interesting I would say.

Mike

How did you test that? Y ~ f(X)
What was Y, what was the shape of f() .. linear for example

Multi-collinearity was my first thought as well - if you were using linear regression, it sounds likely that this would be a problem. Have you tested the significance without including the most recent Beyer in your X vector?

gm10
12-31-2010, 10:06 PM
I made a bold statement and I should not have tried to impose on
Arkansasman. So, I've run a Logistic regression using the last four
Speed Ratings.

Here's an explanation of the independent variables (regressors)

Variable 1 Last Speed Rating
Variable 2 Second Speed Rating Back
Variable 3 Third Speed Rating Back
Variable 4 Fourth Speed Rating Back

How I setup the data:
In each case(last speed rating, etc) I determine the median rating
for each race. Then I determine how each horse's rating differs from the
median. This allows for determining the strength within the race and
also allows to use the data across all races.

Running iteration number 1
Running iteration number 2
Running iteration number 3
Running iteration number 4
Running iteration number 5
Running iteration number 6
Running iteration number 7

The process converged after 7 iterations

The software I use is home grown.


Descriptives.......
154 cases have Y = 1 1236 cases have Y = 0

-2 log likelihood = 967.9034 (Null Model)
-2 log likelihood = 887.8558 (Full Model)

Overall Model Fit...
Chi Square = 80.0476 df = 4 p = 0.0000
R Square = 0.0827

Akaike's Information Criterion = 897.8558
Bayesian Information Criterion = 895.9030

Coefficients and Standard Errors...
Variable Coefficient Standard Error prob
1 0.0788 0.0142 0.0000
2 0.0050 0.0102 0.6216
3 0.0428 0.0123 0.0005
4 0.0243 0.0112 0.0298
Intercept -2.1564

Odds Ratios and 95% Confidence Intervals...
Variable Odds Ratio Low High
1 1.0819 1.0522 1.1125
2 1.0050 0.9852 1.0253
3 1.0438 1.0189 1.0692
4 1.0246 1.0024 1.0474
input data record?

Let me show you the important numbers.

Variable 2 (Second race back rating)
prob = .6216
That is not good!!!!!

Coefficient 0.0050
That is not good!!!

Odds Ratio 1.0050
That is not good!!!!

Mike

Interesting ... can you test without X1 (most recent)?

TrifectaMike
12-31-2010, 11:51 PM
Interesting ... can you test without X1 (most recent)?


Descriptives.......
154 cases have Y = 1 1236 cases have Y = 0

-2 log likelihood = 967.9034 (Null Model)
-2 log likelihood = 926.0421 (Full Model)

Overall Model Fit...
Chi Square = 41.8613 df = 3 p = 0.0000
R Square = 0.0432

Akaike's Information Criterion = 934.0421
Bayesian Information Criterion = 931.5873

Coefficients and Standard Errors...
Variable Coefficient Standard Error prob
1 0.0181 0.0104 0.0813
2 0.0493 0.0122 0.0001
3 0.0282 0.0110 0.0103
Intercept -2.0949

Odds Ratios and 95% Confidence Intervals...
Variable Odds Ratio Low High
1 1.0183 0.9977 1.0393
2 1.0506 1.0257 1.0760
3 1.0286 1.0067 1.0510
input data record?

gm10
01-01-2011, 09:56 AM
Descriptives.......
154 cases have Y = 1 1236 cases have Y = 0

-2 log likelihood = 967.9034 (Null Model)
-2 log likelihood = 926.0421 (Full Model)

Overall Model Fit...
Chi Square = 41.8613 df = 3 p = 0.0000
R Square = 0.0432

Akaike's Information Criterion = 934.0421
Bayesian Information Criterion = 931.5873

Coefficients and Standard Errors...
Variable Coefficient Standard Error prob
1 0.0181 0.0104 0.0813
2 0.0493 0.0122 0.0001
3 0.0282 0.0110 0.0103
Intercept -2.0949

Odds Ratios and 95% Confidence Intervals...
Variable Odds Ratio Low High
1 1.0183 0.9977 1.0393
2 1.0506 1.0257 1.0760
3 1.0286 1.0067 1.0510
input data record?


Is X1 the second most recent? It is significant now?

The goodness of fit is quite a bit lower now.

Interesting, thx.

TrifectaMike
01-01-2011, 10:22 AM
Is X1 the second most recent? It is significant now?

The goodness of fit is quite a bit lower now.

Interesting, thx.

No, it is not significant now. Unless you intend to falsify the data, by changing the definition, or topology of the data. The result is not unexpected.

As I see it, there appears to be a periodicity associated with the data. Modulating the data with a decaying function isn't valid. Modulating with a periodic function of some sort, maybe. There is definitely an attractor.

Mike

douglasw32
01-01-2011, 10:34 AM
Mike- Please keep teaching

TrifectaMike
01-01-2011, 10:36 AM
R-Squared statistic.

My previous response was meant to provoke a response from you. Since, it didn't, let me be a bit more serious.

No consensus has emerged among the community on the single best measure, and each measure may give different results. At best an R-squared statistic is a rough guide and one should not attribute great importance to to it.

In fact, no serious researcher, in the area of logistic regression, presents the R-Squared statistic as a measure of pseudo variance.

Mike

Tom
01-01-2011, 10:53 AM
This is the most interesting thread in months!
Let's see it to the end. :ThmbUp:

I'm taking the position that everything is worth investigating and leaving all previous ideas at home for a while.

raybo
01-01-2011, 11:30 AM
I agree, let's all try to keep an open mind, for the time being.

gm10
01-01-2011, 06:04 PM
No, it is not significant now. Unless you intend to falsify the data, by changing the definition, or topology of the data. The result is not unexpected.

As I see it, there appears to be a periodicity associated with the data. Modulating the data with a decaying function isn't valid. Modulating with a periodic function of some sort, maybe. There is definitely an attractor.

Mike

Sorry I was looking at the second column. It is significant at the 10% level - not the most convincing p-level, admittedly. Gray zone.

I'm very surprised but also intrigued with this result. Can you tell me which races you are using - I want to test the same on my own ratings.

Pell Mell
01-02-2011, 07:31 AM
You are correct. Data definitions are important. I personally like to take a "shotgun" approach. Shoot as many pellets as possible and observe the patterns. And then consider refinements, and then shoot again...and again if necessary...always letting the data speak for itself, even when it appears illogical.

I attempt to construct metrics using medians. This allows one to measure strength within a race and process data across races much easier.

For example, we all to some degree or another scan a horses speed ratings (beyers, etc) and naturally assume the last rating is more significant than the previous. In fact, it is standard practice for logistic regression people to use time related weights on their factors. It seems reasonable, but is it always correct?

Let me give you a specific example ( arkansasman try this out with your model). When looking at horses last four Beyers or Bris numbers, etc, you would think that the last is more significant than the previous and so on. The data does not support the premise.

I have found using numerous tests that the second race beyer (or any other similar rating) is insignificant and adds no usable information. The third is significant as well as the fourth, but NOT the second race back. Interesting I would say.

Mike

Interesting to me also because I consider the second back probably the most important.
I remember that 10 yrs ago my partner and I made over 25 Gs at Charles Town just betting off the second race back.
There are conditions that have to be met and that is that, no matter which race one uses, the race being used has to be taken in the context of the form cycle. IMHO, this is why all this work trying to reduce a bunch of data to a number is an exercise in futility.
Number crunching without taking progression or regression into consideration is akin to trying to capture a fart in a windstorm.
As an example; if the second race back was the first off a layoff it can be totally disregarded UNLESS certain critera can be met in the last race. Which, in effect, means that any numbers must be evaluated in relation to other numbers. Isolated numbers mean nothing!
One other item regarding the second race back is that, value wise, it's worth a helluva lot more than the last race.
I don't run a database but this is what I have garnered by handicapping virtually every day for over 60 yrs. JMO

TrifectaMike
01-02-2011, 08:26 AM
Sorry I was looking at the second column. It is significant at the 10% level - not the most convincing p-level, admittedly. Gray zone.

I'm very surprised but also intrigued with this result. Can you tell me which races you are using - I want to test the same on my own ratings.

Aqueduct 11-05 thru 12-19-2010

Mike

TrifectaMike
01-02-2011, 08:36 AM
Interesting to me also because I consider the second back probably the most important.
I remember that 10 yrs ago my partner and I made over 25 Gs at Charles Town just betting off the second race back.
There are conditions that have to be met and that is that, no matter which race one uses, the race being used has to be taken in the context of the form cycle. IMHO, this is why all this work trying to reduce a bunch of data to a number is an exercise in futility.
Number crunching without taking progression or regression into consideration is akin to trying to capture a fart in a windstorm.
As an example; if the second race back was the first off a layoff it can be totally disregarded UNLESS certain critera can be met in the last race. Which, in effect, means that any numbers must be evaluated in relation to other numbers. Isolated numbers mean nothing!
One other item regarding the second race back is that, value wise, it's worth a helluva lot more than the last race.
I don't run a database but this is what I have garnered by handicapping virtually every day for over 60 yrs. JMO

Thank you for the lesson. You have made it obviously clear to me that you are correct and I am wrong. I will reevaluate my entire approach.

Maybe you can help me further, because I have been trying to "capture a fart in a windstorm" my entire adult life. I'm intrigued by your " progression or regression" concept. Can you elaborate on this concept?

One again, thanks.

Mike

Pell Mell
01-02-2011, 08:55 AM
Thank you for the lesson. You have made it obviously clear to me that you are correct and I am wrong. I will reevaluate my entire approach.

Maybe you can help me further, because I have been trying to "capture a fart in a windstorm" my entire adult life. I'm intrigued by your " progression or regression" concept. Can you elaborate on this concept?

One again, thanks.

Mike

Do I detect a note of sarcasm in your answer? If not, please accept my apology for thinking so.

TrifectaMike
01-02-2011, 09:11 AM
Do I detect a note of sarcasm in your answer? If not, please accept my apology for thinking so.

Sarcasm? Not at all. Just a simple "Thank You" note and a request for further information.

Once again, thank you.

Mike

GameTheory
01-02-2011, 10:20 AM
Sarcasm? Not at all. Just a simple "Thank You" note and a request for further information.

Once again, thank you.

MikeThis thread is strange. Lots of good discussion with occasional unprovoked outbursts of passive-aggressive hostility from the thread starter. Weird.

Anyway, let's say that today's race is the second back after a year's layoff. Which makes the 1st race back recent (say 2 weeks ago), and the 2nd race back a year and 2 weeks (and just for fun, let's say the horse was injured in that race, thus the year lay-off). It seems common sense that this 2nd race back is not quite the same as if the horse is not coming off a layoff at all and that race had been just 30 days ago. Yet in the statistical stew, they both get thrown in as "2nd race back" in the model.

That's an extreme example, but the same applies with the form cycle. Like he was saying above -- let's say that 2nd race back was our first race after a year and the animal was just running around the track trying not to break a leg. Now we've got some recency, but still hardly a true measure of anything.

Or let's say in the 2nd race he ran a great race on dirt (a sprint), but his last race was seemingly crappy (but a turf route). But back to a dirt sprint today -- now that 2nd race back could be a great indicator -- certainly better than the last race anyway. Or one horse has obviously been improving and another obviously declining.

But all "2nd race back". Statistics cuts-out bits of data out of reality and give them the same label no matter how dissimilar as long as it fits the "2nd race back" criteria. (Which is fine, statistics-wise, but still you can see why someone would raise an eyebrow.)

Now, you can slice and dice any factor like this, saying there should be exceptions and distinctions. (Horse likes to race on the rail, and last 2 races was stuck outside, or had bad pace scenario, or the weather was different, or it was a full moon, or any damn distinction you want to make, no matter how small) But some of them are so "macro" and obvious and seemingly relevant it just seems stupid to ignore them (e.g. last race two years ago treated the same as last race run two weeks ago). And thus everytime someone brings up using statistical methods, these same objections are always heard from the non-statheads. And I can't blame them, because the statheads usually never give them a good answer as to why the way things are done are done that way and cop a condescending attitude, essentially saying "shut up, the grown-ups are tallking"...

TrifectaMike
01-02-2011, 10:26 AM
If PP's, video replays, the ability to watch races, etc didn't exist, which ONE piece of historical data would you like to have associated with each horse?

Give this serious thought, because this should be your starting point.

Mike

sjk
01-02-2011, 10:27 AM
GT,

If I were to weight the adjusted speed rating for the second last race in a forecast for today's speed (which I do) the weighting would decay based on days since that datapoint.

Tom
01-02-2011, 10:43 AM
OK, maybe chi square can answer the question about 2nd race back....

2nd back less than 30 days
2nd back 30- 60 days
2nd back 60- 90 days
2nd back 90- 120 days
2nd back over 120 days........or something like that

gl45
01-02-2011, 10:45 AM
Interesting to me also because I consider the second back probably the most important.
I remember that 10 yrs ago my partner and I made over 25 Gs at Charles Town just betting off the second race back.
There are conditions that have to be met and that is that, no matter which race one uses, the race being used has to be taken in the context of the form cycle. IMHO, this is why all this work trying to reduce a bunch of data to a number is an exercise in futility.
Number crunching without taking progression or regression into consideration is akin to trying to capture a fart in a windstorm.
As an example; if the second race back was the first off a layoff it can be totally disregarded UNLESS certain critera can be met in the last race. Which, in effect, means that any numbers must be evaluated in relation to other numbers. Isolated numbers mean nothing!
One other item regarding the second race back is that, value wise, it's worth a helluva lot more than the last race.
I don't run a database but this is what I have garnered by handicapping virtually every day for over 60 yrs. JMO

From a boxcar quote in the thread Longshots are the Ticket to Success.
"Of course, handicapping or the application of any handicapping angle or methods work best when they're incorporated into a a more comprehensive approach and some good ol' fashioned common sense is also applied. The core handicapping philosophy that I subscribed to and applied dilgently was: There is no such thing as an isolated handicapping factor -- that all factors are related to one another, especially to the Form Factor.

How important is the 2nd. race back? If you had follow boxcar teachings and read Ray Taulbot lessons you would have discovered that the second race back, handicapped singularly would amount to nothing but in conjunction with
three or more horse's attributes the 2nd back would be a deadly handicapping factor.

TrifectaMike
01-02-2011, 10:47 AM
This thread is strange. Lots of good discussion with occasional unprovoked outbursts of passive-aggressive hostility from the thread starter. Weird.

Anyway, let's say that today's race is the second back after a year's layoff. Which makes the 1st race back recent (say 2 weeks ago), and the 2nd race back a year and 2 weeks (and just for fun, let's say the horse was injured in that race, thus the year lay-off). It seems common sense that this 2nd race back is not quite the same as if the horse is not coming off a layoff at all and that race had been just 30 days ago. Yet in the statistical stew, they both get thrown in as "2nd race back" in the model.

That's an extreme example, but the same applies with the form cycle. Like he was saying above -- let's say that 2nd race back was our first race after a year and the animal was just running around the track trying not to break a leg. Now we've got some recency, but still hardly a true measure of anything.

Or let's say in the 2nd race he ran a great race on dirt (a sprint), but his last race was seemingly crappy (but a turf route). But back to a dirt sprint today -- now that 2nd race back could be a great indicator -- certainly better than the last race anyway. Or one horse has obviously been improving and another obviously declining.

But all "2nd race back". Statistics cuts-out bits of data out of reality and give them the same label no matter how dissimilar as long as it fits the "2nd race back" criteria. (Which is fine, statistics-wise, but still you can see why someone would raise an eyebrow.)

Now, you can slice and dice any factor like this, saying there should be exceptions and distinctions. (Horse likes to race on the rail, and last 2 races was stuck outside, or had bad pace scenario, or the weather was different, or it was a full moon, or any damn distinction you want to make, no matter how small) But some of them are so "macro" and obvious and seemingly relevant it just seems stupid to ignore them (e.g. last race two years ago treated the same as last race run two weeks ago). And thus everytime someone brings up using statistical methods, these same objections are always heard from the non-statheads. And I can't blame them, because the statheads usually never give them a good answer as to why the way things are done are done that way and cop a condescending attitude, essentially saying "shut up, the grown-ups are tallking"...

Can you share your "good answer" with us ?

Mike

GameTheory
01-02-2011, 11:03 AM
Right right and right. But when designing a model or experiment, you've still got the draw the lines somewhere to make it objective. And what a model tries to "capture" (so it can predict it later) is the "nonrandom" portion of a phenomenon contained in the data. So when we take a bit of data, defined by rules macro or micro ("2nd race back period" or "2nd race back, but only within 30 days") you always still leaving out an essentially infinite amount of context which you consign to be "random" (even if the process is actually completely deterministic and predictable in reality [given some other factors] -- it is now "random" relative to the factor we are looking at simply because we aren't going to account for those determining forces).

So when you make a model and the result is "2nd race back is not significant" it is very easy to fall into the trap of thinking that it REALLY is not significant in REALITY because it is not significant in the model, or it is easy anyway to misconstrue the true meaning of such statements. It is "not significant" only GIVEN THE EXACT RULES AND CONDITIONS AND UNDERLYING ASSUMPTIONS of the model. To make a model where a factor like 2nd race back turns out to be "not useful" in that model, and then to make the leap that 2nd race back should be thrown away now and forever from any consideration in handicapping in general (i.e. for everyone else not using this particular model with its particular rules and assumptions) is a huge mistake and gross abuse of statistics.

In other words, everyone is right from their own perspective. But as soon as you take the conclusions you've made in YOUR context and try to impose them into another one, you're making a mistake. I feel like I'm not being very clear here, but anyway, maybe someone understands what I'm getting at...

GameTheory
01-02-2011, 11:09 AM
Can you share your "good answer" with us ?
I'm not a stathead, but it might be something like I just posted before this, but more comprehensible. Maybe after my coffee.

In general, I don't see people who reach conclusions via statistics give us all the warnings that "ought to be on the label" so to speak and tend to take those conclusions and extrapolate them where they are not justified. See the collapse of Long-Term Capital Management for a famous example. Of course, everyone tends to do the same thing no matter how they arrive at their opinion, but those backed by "sound statistical practice" often have a particularly arrogant attitude when challenged, or not even challenged but just asked a simple question, e.g. "Yeah sure, but how do you account for layoffs?"

raybo
01-02-2011, 11:21 AM
If PP's, video replays, the ability to watch races, etc didn't exist, which ONE piece of historical data would you like to have associated with each horse?

Give this serious thought, because this should be your starting point.

Mike

That's a good question. Of course many things play a part in this, but, if all I had for each horse was 1 piece of historical information, with which to compare horses against one another, I suppose each horse's recent ROI (one might say recent earnings per start but that would have many more variables involved, class of competitors, purse sizes, etc.).

I would be interested in knowing what single factor others here believe would point to a horse's ability, when compared against his/her competitors.

TrifectaMike
01-02-2011, 11:22 AM
Right right and right. But when designing a model or experiment, you've still got the draw the lines somewhere to make it objective. And what a model tries to "capture" (so it can predict it later) is the "nonrandom" portion of a phenomenon contained in the data. So when we take a bit of data, defined by rules macro or micro ("2nd race back period" or "2nd race back, but only within 30 days") you always still leaving out an essentially infinite amount of context which you consign to be "random" (even if the process is actually completely deterministic and predictable in reality [given some other factors] -- it is now "random" relative to the factor we are looking at simply because we aren't going to account for those determining forces).

So when you make a model and the result is "2nd race back is not significant" it is very easy to fall into the trap of thinking that it REALLY is not significant in REALITY because it is not significant in the model, or it is easy anyway to misconstrue the true meaning of such statements. It is "not significant" only GIVEN THE EXACT RULES AND CONDITIONS AND UNDERLYING ASSUMPTIONS of the model. To make a model where a factor like 2nd race back turns out to be "not useful" in that model, and then to make the leap that 2nd race back should be thrown away now and forever from any consideration in handicapping in general (i.e. for everyone else not using this particular model with its particular rules and assumptions) is a huge mistake and gross abuse of statistics.

In other words, everyone is right from their own perspective. But as soon as you take the conclusions you've made in YOUR context and try to impose them into another one, you're making a mistake. I feel like I'm not being very clear here, but anyway, maybe someone understands what I'm getting at...

Let me help....

My 'good' pie requires no sugar doesn't mean that sugar is not required in someone else's pie to make a good pie.

Mike

Native Texan III
01-02-2011, 12:36 PM
TM,

In your logit model it looks like you gave the winner (1) and the rest of the runners that lost a (0). Forgive me if this is the wrong assumption.

This would imply that the horse coming second (and its given speed attributes) is no better than the one coming last as far as modeling the constants is concerned. The winner of a weak race might have far worse attributes in reality than the horse coming last in a strong race. Is this correct?


"SNIPPED: I've run a Logistic regression using the last four
Speed Ratings.

Here's an explanation of the independent variables (regressors)

Variable 1 Last Speed Rating
Variable 2 Second Speed Rating Back
Variable 3 Third Speed Rating Back
Variable 4 Fourth Speed Rating Back

How I setup the data:
In each case(last speed rating, etc) I determine the median rating
for each race. Then I determine how each horse's rating differs from the
median. This allows for determining the strength within the race and
also allows to use the data across all races.

The process converged after 7 iterations

The software I use is home grown.


Descriptives.......
154 cases have Y = 1 1236 cases have Y = 0"

TrifectaMike
01-02-2011, 01:04 PM
TM,

In your logit model it looks like you gave the winner (1) and the rest of the runners that lost a (0). Forgive me if this is the wrong assumption.

This would imply that the horse coming second (and its given speed attributes) is no better than the one coming last as far as modeling the constants is concerned. The winner of a weak race might have far worse attributes in reality than the horse coming last in a strong race. Is this correct?


"SNIPPED: I've run a Logistic regression using the last four
Speed Ratings.

Here's an explanation of the independent variables (regressors)

Variable 1 Last Speed Rating
Variable 2 Second Speed Rating Back
Variable 3 Third Speed Rating Back
Variable 4 Fourth Speed Rating Back

How I setup the data:
In each case(last speed rating, etc) I determine the median rating
for each race. Then I determine how each horse's rating differs from the
median. This allows for determining the strength within the race and
also allows to use the data across all races.

The process converged after 7 iterations

The software I use is home grown.


Descriptives.......
154 cases have Y = 1 1236 cases have Y = 0"

From a previous post:

How I setup the data:
In each case(last speed rating, etc) I determine the median rating
for each race. Then I determine how each horse's rating differs from the
median. This allows for determining the strength within the race and
also allows to use the data across all races.

Mike

GameTheory
01-02-2011, 01:34 PM
In your logit model it looks like you gave the winner (1) and the rest of the runners that lost a (0). Forgive me if this is the wrong assumption.
A standard logit model requires this -- it modeling a probability of something either happening (won) or not happening (lost). It is always between two choices. (Or in a multi-nomial model between unordered categories anyway.) To use a more fine-grained ordered dependent variable, another kind of model is appropriate (like probit), but over a large number of races results are similar either way because although your point is valid, on average over many races the winners have better attributes than the losers.

Dahoss2002
01-02-2011, 01:44 PM
If PP's, video replays, the ability to watch races, etc didn't exist, which ONE piece of historical data would you like to have associated with each horse?

Give this serious thought, because this should be your starting point.

Mike
I would want a speed rating. BRIS or Beyer either one. If only allowed one race, I would take the last race. If allowed more, then the last 4 races. Excellent thread!

TrifectaMike
01-02-2011, 02:10 PM
Firstly, you lecture me on Logistic Regression

Native Texan III[/b]]The logit input dataset is processed to give the minimum error for the resultant factor weightings for each data input and overall. As the normal logit modeling is against 1 (= win) or 0 (= lost) then the errors (which can be obtained by using the calculated weightings and back- applying) do not matter too much, (as would be the case in a linear regression, say), as close to 1 is 1 and close to 0 is 0. There are more complex multi-competitor logit methods which can only be solved by making trial and error assumptions.

Then you ask the following:

Native Texan III[/b]]In your logit model it looks like you gave the winner (1) and the rest of the runners that lost a (0). Forgive me if this is the wrong assumption.

This would imply that the horse coming second (and its given speed attributes) is no better than the one coming last as far as modeling the constants is concerned. The winner of a weak race might have far worse attributes in reality than the horse coming last in a strong race. Is this correct?

Am I to take you seriously?

Mike

TrifectaMike
01-02-2011, 02:36 PM
I would want a speed rating. BRIS or Beyer either one. If only allowed one race, I would take the last race. If allowed more, then the last 4 races. Excellent thread!

Thank you, Dahoss. I'll keep track.

1. Last race speed rating.

Mike

Robert Goren
01-02-2011, 04:08 PM
Finish in last race.

TrifectaMike
01-02-2011, 04:39 PM
Finish in last race.

Thank you, Robert.

1. Last race speed rating.
2. Finish in last race

Mike

TexasDolly
01-02-2011, 05:40 PM
Mike,
Top single factor for me, Last Race Odds
TD

TrifectaMike
01-02-2011, 05:43 PM
Mike,
Top single factor for me, Last Race Odds
TD

Dolly,

Male or female I love you.

Mike

TrifectaMike
01-02-2011, 05:45 PM
Thank you, Texas Dolly.

1. Last race speed rating.
2. Finish in last race
3. Last Race Odds

Mike

valueguy
01-02-2011, 08:05 PM
Speed fig at best level good race (good race defined as horse 1 st or 2nd or
within 2 lenghts of winner)best level refers to a comparison of class.This
coupled with best speed of one of the last 2 races gives you a pretty high strike
rate.

CBedo
01-02-2011, 08:22 PM
If everyone is going to ride Mike about his factors or system, I vote we just make up some data or take data from an unrelated field, so Mike can continue to explain the methodology to create and validate the system.

I'll take all the help I can get to learn new ways of getting at this stuff!

gm10
01-02-2011, 10:21 PM
GT,

If I were to weight the adjusted speed rating for the second last race in a forecast for today's speed (which I do) the weighting would decay based on days since that datapoint.

Good point. Getting that weighting correct is more important than your choice of speed figures in my opinion.

bigbrown
01-03-2011, 12:20 AM
Thank you, Texas Dolly.

1. Last race speed rating.
2. Finish in last race
3. Last Race Odds

Mike

Mike,

If you allow me to make a suggestion: keeping it the way you just re-started (ie compiling a list like the one above), will ensure the focus on your original questions and minimize the deviations from the core content.

Good thread! Keep it going.

TrifectaMike
01-03-2011, 10:34 AM
""Data definitions are important. I personally like to take a "shotgun" approach. Shoot as many pellets as possible and observe the patterns. And then consider refinements, and then shoot again...and again if necessary...always letting the data speak for itself, even when it appears illogical.

I attempt to construct metrics using medians. This allows one to measure strength within a race and process data across races much easier.

For example, we all to some degree or another scan a horses speed ratings (beyers, etc) and naturally assume the last rating is more significant than the previous. In fact, it is standard practice for logistic regression people to use time related weights on their factors. It seems reasonable, but is it always correct?

Let me give you a specific example When looking at horses last four Beyers or Bris numbers, etc, you would think that the last is more significant than the previous and so on. The data does not support the premise.

I have found using numerous tests that the second race beyer (or any other similar rating) is insignificant and adds no usable information. The third is significant as well as the fourth, but NOT the second race back. Interesting I would say.""

Mike

Robert Goren
01-03-2011, 10:43 AM
You are using a previous beyer's(or whatever)with no qualifiers to predict todays beyer's, you won't find much correlation no matter which one you use. The R of the last race to today's is about 0.5, which is not much help.

TrifectaMike
01-03-2011, 10:58 AM
You are using a previous beyer's(or whatever)with no qualifiers to predict todays beyer's, you won't find much correlation no matter which one you use. The R of the last race to today's is about 0.5, which is not much help.

As I suspected, the post is misunderstood.

But you do raise an interesting question. What is the the R of the based on using the last four beyers without further qualifiers? I believe using last four beyers as regressors and a llittle Monte Carlo Simulation can easily lead one to extremely good Probabilities, and in many cases much better than the most complex logistic models.

Try it, you might like it.

Mike

Dave Schwartz
01-03-2011, 01:43 PM
But you do raise an interesting question. What is the the R of the based on using the last four beyers without further qualifiers? I believe using last four beyers as regressors and a llittle Monte Carlo Simulation can easily lead one to extremely good Probabilities, and in many cases much better than the most complex logistic models.

Try it, you might like it.

I don't know about Beyers, but I have written a Monte Carlo simulator using speed ratings that performs quite well.


Regards,
Dave Schwartz

raybo
01-03-2011, 02:59 PM
Thank you, Texas Dolly.

1. Last race speed rating.
2. Finish in last race
3. Last Race Odds

Mike

Horse's recent ROI.

gm10
01-04-2011, 06:25 PM
As I suspected, the post is misunderstood.

But you do raise an interesting question. What is the the R of the based on using the last four beyers without further qualifiers? I believe using last four beyers as regressors and a llittle Monte Carlo Simulation can easily lead one to extremely good Probabilities, and in many cases much better than the most complex logistic models.

Try it, you might like it.

Mike

I disagree here. Multinomial logit models yield better results in my experience. The probit models are alleged to be even better.

TrifectaMike
01-04-2011, 07:01 PM
I disagree here. Multinomial logit models yield better results in my experience. The probit models are alleged to be even better.

You highly doubt it based on.....?
Have you tried it, tested it, and scored it against public odds?
I, for one, would be very interested in the results.

Probit? Just another way to linearize the nonlinear...not much difference. One transforms probabilities of an event into scores from the standard normal distribution rather than into logged odds from the logisticdistribution. They both essentially give the same results.

Getting the proper independent variables is more important than which transformation one uses. So, what is the Probit alleged to do better than logistic? I'd be interested to know.

Mike

gm10
01-04-2011, 07:53 PM
You highly doubt it based on.....?
Have you tried it, tested it, and scored it against public odds?
I, for one, would be very interested in the results.

Yes I have but I did not keep track of all the (parametric and nonparametric) simulation results. I compared them with my main multinomial logit model at the time in terms of predicting winners, and ROI, and was disappointed. My logit model is better than now than it was then, hence my doubts.

Probit? Just another way to linearize the nonlinear...not much difference. One transforms probabilities of an event into scores from the standard normal distribution rather than into logged odds from the logisticdistribution. They both essentially give the same results.


Probit is more advanced and computationally much more complex than previous models. Benter uses it. I read somewhere that he improved his ROI with 5 to 10% higher after replacing his multinomial logit model with a probit model. There is a presentation about him mentioning the probit model here

http://wmedia.hkedcity.net/archive/05/04_ICCM/ICCMpt04Ex.wmv

(very interesting).



Getting the proper independent variables is more important than which transformation one uses. So, what is the Probit alleged to do better than logistic? I'd be interested to know.

I have not looked into the mathematics of the probit model for about 6 years so won't comment on them. In general however, how you use the information you get is just as important as the information you have. As far as I know, probit is the most advanced of the well-researched models.

Anyway - enough about probit. I use multinomial logit and as I said earlier, my simulation results on the basis of speed figures, underperformed. That is why I made the earlier post.

Having said all that, I've always had vague plans to build a grand simulation model that merges pace analysis with speed, form and class analysis. The logit model is not ideal for that.

TrifectaMike
01-05-2011, 06:39 AM
Yes I have but I did not keep track of all the (parametric and nonparametric) simulation results. I compared them with my main multinomial logit model at the time in terms of predicting winners, and ROI, and was disappointed. My logit model is better than now than it was then, hence my doubts.



Probit is more advanced and computationally much more complex than previous models. Benter uses it. I read somewhere that he improved his ROI with 5 to 10% higher after replacing his multinomial logit model with a probit model. There is a presentation about him mentioning the probit model here

http://wmedia.hkedcity.net/archive/05/04_ICCM/ICCMpt04Ex.wmv

(very interesting).

I have not looked into the mathematics of the probit model for about 6 years so won't comment on them. In general however, how you use the information you get is just as important as the information you have. As far as I know, probit is the most advanced of the well-researched models.

Anyway - enough about probit. I use multinomial logit and as I said earlier, my simulation results on the basis of speed figures, underperformed. That is why I made the earlier post.

Having said all that, I've always had vague plans to build a grand simulation model that merges pace analysis with speed, form and class analysis. The logit model is not ideal for that.


Thanks for the info. I haven't had the chance to listen to Benter's presentation, but I will soon.

I REALLY like the idea of a simulation model.

Mike

TrifectaMike
01-05-2011, 06:54 AM
Let me list our tools that we will be using:

1. Chi-Square statistic for significance testing of factors.

2. Z-score for weighting the significant factors.

However, before we run a single test, we want to do the following:

Segment our selection of races for testing by odds.

For instance, one data sample may include races where the winner was between 2-1 to 4-1, another sample may include races where the winner was between 5-1 to 10-1.

This will help us find on significant factors that the public under bets and over bets.

Mike

TrifectaMike
01-05-2011, 08:05 AM
And some advice for anyone looking into modeling using regression techniques.

Before one throws himself into the complex realm of logistic or probit regression, that he first learn about ordinary linear regression. And learn in a manner that advances the cause.

Using ordinary regression one can determine expected horses winning times and expected race winning times with an adjusted R2 of .95 which explains nearly all the variation in horse times.

And then use this information in your logistic model ( A very, very good start).

Here I will get you started with a list of factors to use (from actual experience):

Expected Horse Time Parameters
1, Number of Previous Races
2. Number of Previous Races Squared
3. Days Since Last Race
4. Days Since Last Race Squared
5. Age
6. Age squared
7. Number of Top Three Finishes
8. Number of Top Three Finishes Squared
9 Distance
10. Distance Squared
11. Earnings per Start
12 Categorical Variable (Maiden, Allowance, Claiming, Stake, Handicap)

Expected Race Time Parameters
1. Distance
2. Distance Squared
3. Purse Size
4. Categorical Variable (Maiden, Allowance, Claiming, Stake, Handicap)

The squared terms are an absolute must.

It allows the effect of the factor to be diminished at some level.

Gotta stop this now...I don't want to piss too many people off.

Mike

m001001
01-05-2011, 08:33 AM
Logit model predicts only winning probabilities. You have to use conditional probability (e.g. Harville formula) or other methods to find prob for 2nd, 3rd, etc...

Probit model predicts probability for each and every permutation of finishing order. Therefore more accurate probabilities for exotics. But probit model is far far far more difficult to build and run.

TrifectaMike
01-05-2011, 01:21 PM
Logit model predicts only winning probabilities. You have to use conditional probability (e.g. Harville formula) or other methods to find prob for 2nd, 3rd, etc...

Probit model predicts probability for each and every permutation of finishing order. Therefore more accurate probabilities for exotics. But probit model is far far far more difficult to build and run.

Wow, and I always thought they were similar....I guess viewing them from a purely mathematical sense is insufficient. Thanks, I'll explore this further.

Mike

SchagFactorToWin
01-05-2011, 01:51 PM
Using ordinary regression one can determine expected horses winning times and expected race winning times with an adjusted R2 of .95 which explains nearly all the variation in horse times.

Here I will get you started with a list of factors to use (from actual experience):


Maybe I'm misunderstanding you. Are you saying you achieved a .95 by using those (or similar, or those plus more) parameters?

TrifectaMike
01-05-2011, 02:07 PM
Maybe I'm misunderstanding you. Are you saying you achieved a .95 by using those (or similar, or those plus more) parameters?

Don't fall off your chair just yet, because the expected performance depends on horse k 's expected time in race j as well as the expected winning time in race j.

Mike

SchagFactorToWin
01-05-2011, 02:21 PM
Don't fall off your chair just yet, because the expected performance depends on horse k 's expected time in race j as well as the expected winning time in race j.

Mike
You didn't answer the question.

TrifectaMike
01-05-2011, 02:30 PM
You didn't answer the question.

Short answer, yes.
Mike

gm10
01-05-2011, 09:04 PM
And some advice for anyone looking into modeling using regression techniques.

Before one throws himself into the complex realm of logistic or probit regression, that he first learn about ordinary linear regression. And learn in a manner that advances the cause.

Using ordinary regression one can determine expected horses winning times and expected race winning times with an adjusted R2 of .95 which explains nearly all the variation in horse times.

And then use this information in your logistic model ( A very, very good start).

Here I will get you started with a list of factors to use (from actual experience):

Expected Horse Time Parameters
1, Number of Previous Races
2. Number of Previous Races Squared
3. Days Since Last Race
4. Days Since Last Race Squared
5. Age
6. Age squared
7. Number of Top Three Finishes
8. Number of Top Three Finishes Squared
9 Distance
10. Distance Squared
11. Earnings per Start
12 Categorical Variable (Maiden, Allowance, Claiming, Stake, Handicap)

Expected Race Time Parameters
1. Distance
2. Distance Squared
3. Purse Size
4. Categorical Variable (Maiden, Allowance, Claiming, Stake, Handicap)

The squared terms are an absolute must.

It allows the effect of the factor to be diminished at some level.

Gotta stop this now...I don't want to piss too many people off.

Mike

Very interesting comment on the squared terms.

Cratos
01-05-2011, 11:43 PM
If there is sufficient interest, I will along with any PA member(s) will go through the process of generating a handicapping method based on the Chi Square statistic.

I (We) will do it in a such a manner which will be non-technical, easy to understand, using only basic arithmetic, and will allow for anyone to participate.

The end result I believe will be a profitable system.

We will need someone with a large database to provide data for the factors that we will test, and include in our model.

If there is interest I will proceed.

Let me know.

Mike

Mike,

I have read the posts on this thread from its inception and although I have found your thread premise to be enjoyable, I don’t regard it as “Using Chi Square Statistic to Produce a Handicapping Method.”

I see it as “Using Chi Square Statistic to Produce a Handicapping Analysis.” You might say that I am splitting hairs, but I am not. A handicapping method in my opinion should be a predictive method based on the inputs which determines the horse winning.

As you very well know, the Chi Square statistic is about the “goodness of fit,” comparing observed data with data we would expect to obtain according to a specific hypothesis.

Therefore if we develop a predictive method (and I have) we can use the Chi Square statistic to test how “good” a predictor that method is.

In my experiment to develop a predictive method I started with the question: “What makes a horse wins?” From that question I isolated two distinct variable groups that influence a horse winning. They are Factor-variables and Angle-variables. Factors-variables typically influence the winning or losing of the race regardless. Angle-variables might or might not influence the winning or losing of the race and might only occur once in a horse’s racing career.

Additionally, my effort led me to develop a logarithmic predictive curve which I wrote an equation for to allow me to integrate the parametric factor variables and use the angle-variables when applicable as additives.

For example pace is a factor variable. To win a race with all else being equal, a horse must be able to negotiate the pace. On the other hand, equipment change can be an angle-variable; a trainer might add blinkers to a horse and it run off and leaves the field in its wake all else being equal.
The list of factor-variables is very short when compared to the list of angle variable.

I hope you don’t take this as a criticism, but just another point of view and I am anxiously awaiting the conclusion to your method.

TrifectaMike
01-06-2011, 10:23 AM
Very interesting comment on the squared terms.

Interesting how? Is it because it seems on the surface to be a contradiction?

Mike

gm10
01-06-2011, 10:56 AM
Interesting how? Is it because it seems on the surface to be a contradiction?

Mike

I meant more your comment that they are a must, and that they diminish the effect of the factor 'at some level'. How should I interpret this?

To your list, I would also add the position of the temporary rail in turf races.

TrifectaMike
01-06-2011, 12:26 PM
I meant more your comment that they are a must, and that they diminish the effect of the factor 'at some level'. How should I interpret this?

To your list, I would also add the position of the temporary rail in turf races.

I'll explain with an example.

Let's take one independent variable, age. If the regression is based only on age, we know horse's get faster as they get older But when does this age factor become less and less important? Does a horse continue to be faster indefinitely?By adding a quadratic term it attenuates this factor.

How?

The linear term coefficient would have a negative value and the quadratic term would have a positive value. Now depending on the magnitude of the respective coefficients, there is an age where the effect on the dependent variable (expected time) no longer decreases with age, but instead increases with age.

Mike

gm10
01-10-2011, 08:19 PM
I'll explain with an example.

Let's take one independent variable, age. If the regression is based only on age, we know horse's get faster as they get older But when does this age factor become less and less important? Does a horse continue to be faster indefinitely?By adding a quadratic term it attenuates this factor.

How?

The linear term coefficient would have a negative value and the quadratic term would have a positive value. Now depending on the magnitude of the respective coefficients, there is an age where the effect on the dependent variable (expected time) no longer decreases with age, but instead increases with age.

Mike


Would you do the same with 'total number of races so far' (as mentioned a few pages ago)?

teddy
01-13-2011, 12:24 PM
No one seems to be posting... I was waiting to see how you used it in a real race scenario.... are you looking to find horses that are predicted to have a faster time than the race is supposed to have.. a horse faster than par?

Robert Goren
01-13-2011, 12:30 PM
He has found something too good to share.;)

Capper Al
01-13-2011, 08:06 PM
Here's the problem with the Chi test or any statistical method in handicapping. They are not for coming up with a selection method. Stats depend on a premise of prior knowledge that can be accurately measured like a die with six sides. Horse racing rarely gets such a clean and obvious scenario. Although, once a capper creates a system then they can use the statistics to rate its success because it can be measured, not the other way around where the stats determine the handicapping system.

Cratos
01-14-2011, 06:35 PM
Here's the problem with the Chi test or any statistical method in handicapping. They are not for coming up with a selection method. Stats depend on a premise of prior knowledge that can be accurately measured like a die with six sides. Horse racing rarely gets such a clean and obvious scenario. Although, once a capper creates a system then they can use the statistics to rate its success because it can be measured, not the other way around where the stats determine the handicapping system.

You are correct; see post #161 in this thread

TrifectaMike
01-14-2011, 07:01 PM
Here's the problem with the Chi test or any statistical method in handicapping. They are not for coming up with a selection method. Stats depend on a premise of prior knowledge that can be accurately measured like a die with six sides. Horse racing rarely gets such a clean and obvious scenario. Although, once a capper creates a system then they can use the statistics to rate its success because it can be measured, not the other way around where the stats determine the handicapping system.

You are absolutely correct. I was perpetuating a fraud. I could not develop a method using statistics. Chi-Square distribution can not be used to determine statistical significance. It can only be used for "goodness of fit". I'm such a fraud. Even worse, I need to learn the real meaning of a distribution.

I must admit. I was not well prepared for the level of expertise in the subject of statistical analysis as exhibited by some posters.

I find myself unable with my limited knowledge in probability, number theory and logic to refute the claims made by the statisticians in this forum.

Chi-Square I hate you. You have exposed me. You will not be invited to Christmas dinner at my house.

Mike

Tom
01-14-2011, 07:17 PM
Which one of the geniuses will now step up and continue? :rolleyes:

Dave Schwartz
01-14-2011, 07:23 PM
Mike,

You can always come to dinner at MY house.


Dave

Cratos
01-14-2011, 07:40 PM
You are absolutely correct. I was perpetuating a fraud. I could not develop a method using statistics. Chi-Square distribution can not be used to determine statistical significance. It can only be used for "goodness of fit". I'm such a fraud. Even worse, I need to learn the real meaning of a distribution.

I must admit. I was not well prepared for the level of expertise in the subject of statistical analysis as exhibited by some posters.

I find myself unable with my limited knowledge in probability, number theory and logic to refute the claims made by the statisticians in this forum.

Chi-Square I hate you. You have exposed me. You will not be invited to Christmas dinner at my house.


Mike


Sorry Mike, I don't believe your last post because from reading some of your prior posts on this forum, it is clear to me that you are well learned and experience in the theory and application of high level math and statistics.

raybo
01-14-2011, 08:06 PM
Sorry Mike, I don't believe your last post because from reading some of your prior posts on this forum, it is clear to me that you are well learned and experience in the theory and application of high level math and statistics.

I think he was being sarcastic.

PaceAdvantage
01-14-2011, 10:33 PM
Mike may have the tools and the smarts, but apparently, he doesn't have the personality required to conduct a thread like this, which is a shame. He takes any comment that even hints at disagreement as a personal affront, and we get into that "take my ball and go home" mode...

Actor
01-15-2011, 02:20 AM
Mike may have the tools and the smarts, but apparently, he doesn't have the personality required to conduct a thread like this, which is a shame. He takes any comment that even hints at disagreement as a personal affront, and we get into that "take my ball and go home" mode...
Does this mean the thread is dead?

Seems to me that a lot of posts are off topic, at least as far as chi squared is concerned. Posts about R squared and logistic regression are over my head.

That said, I still have questions. Is anyone willing/able to respond?

CBedo
01-15-2011, 02:33 AM
Does this mean the thread is dead?

Seems to me that a lot of posts are off topic, at least as far as chi squared is concerned. Posts about R squared and logistic regression are over my head.

That said, I still have questions. Is anyone willing/able to respond?I'm sure someone will respond. The question is a) will it actually answer your question, and b) will it be understandable. :rolleyes:

Capper Al
01-15-2011, 06:51 AM
You are absolutely correct. I was perpetuating a fraud. I could not develop a method using statistics. Chi-Square distribution can not be used to determine statistical significance. It can only be used for "goodness of fit". I'm such a fraud. Even worse, I need to learn the real meaning of a distribution.

I must admit. I was not well prepared for the level of expertise in the subject of statistical analysis as exhibited by some posters.

I find myself unable with my limited knowledge in probability, number theory and logic to refute the claims made by the statisticians in this forum.

Chi-Square I hate you. You have exposed me. You will not be invited to Christmas dinner at my house.

Mike

Mike,

You read me all wrong. I'm glad you started this thread. It has good info. When discussing handicapping, we need to focus on the strength and weaknesses of an approach. You are not a fraud. Believe me, I know forums can be a tough place for even the best intended poster. Once I finally settle on a system, I will do a goodness to fit using your methods. You can have dinner at my house anytime too.

raybo
01-15-2011, 08:22 AM
I'm sure someone will respond. The question is a) will it actually answer your question, and b) will it be understandable. :rolleyes:

Yeah, is there a "Chi Squared For Dummies" book?

GameTheory
01-15-2011, 09:54 AM
Yeah, is there a "Chi Squared For Dummies" book?Yeah, cept it will be a chapter in the "Statistics for Dummies" book.

Dave Schwartz
01-15-2011, 11:53 AM
GT,

I actually have that book as well as the second one in the series. They're very good for the likes of the statistically uneducated (which would be me).

http://www.amazon.com/Statistics-Dummies-Deborah-Rumsey/dp/0764554239

Also, Intermediate Statistics for Dummies
http://www.amazon.com/Intermediate-Statistics-Dummies-Deborah-Rumsey/dp/0470045205/ref=pd_sim_b_13




Dave

Actor
01-15-2011, 09:07 PM
I've been going over a half dozen statistics books from my local public library. One thing they have in common is that they don't devote many pages to the chi-square statistic. One doesn't even discuss the chi-square test at all, although it does discuss the chi distribution.

Please bear with me as I work through an example from one book as I understand it.

495 people are each given one of four cold-preventing medicines with this result. The book calls this a contingency table .

Medicine 1 Medicine 2 Medicine 3 Medicine 4 Total

How many
got colds 15 26 9 14 64

How many
did not 111 107 96 117 431


Total 126 133 105 131 495


Another textbook adds a column for what it calls the "magic number." The magic number for each row is the total for that row divided by the total of all rows.


Medicine 1 Medicine 2 Medicine 3 Medicine 4 Total

How many
got colds 15 26 9 14 64 0.129

How many
did not 111 107 96 117 431 0.871


Total 126 133 105 131 495 1.000


The expected values are then added to the contingency table. Expected values are obtained by multiplying the magic number for that row by the total for that column.


Medicine 1 Medicine 2 Medicine 3 Medicine 4

How many
got colds
observed 15 26 9 14
expected 16.29 17.20 13.58 16.94

How many
did not
observed 111 107 96 117
expected 109.71 115.80 91.42 114.06


Now we compute the value of (Obs - Exp)^2/Exp for each cell.


Medicine 1 Medicine 2 Medicine 3 Medicine 4

How many
got colds 0.1023 4.5075 1.5423 0.5094

How many
did not 0.0152 0.6663 0.2290 0.0756


Adding up all the cells the chi-squared statistic = 7.6507 Note: the author of the book got 7.666, a difference of 0.015. I think this is due to the author doing the calculations by hand whereas I used Excel.

If you're still with me let's apply the same technique to the data from Post #70.

Here's the contingency table complete with magic numbers:


1-14 15-30 Over 30 Total

Win 95 103 67 265 10.54%

Lose 805 877 568 2250 89.46%

Total 900 980 635 2515 100.00%


Expected values

1-14 15-30 Over 30

Win
Observed 95 103 67
Expected 94.831 103.260 66.909

Lose
Observed 805 877 568
Expected 805.169 876.740 568.091


Chi values for each cell

1-14 15-30 Over 30

Win 0.0003 0.0007 0.0001

Lose 0.0000 0.0001 0.0000

Chi-squared statistic 0.0012


This is considerably different from the value of 4.547 arrived at in post #70. I'll get to that shortly.

The extremely low value of 0.0012 makes sense. Consider the group 1-14. There are 900 such horses, 95 of them (10.56%) winners. From the entire population of 2515 horses we have 265 (10.54%) winners. Results for the other two groups are similar, implying that days since last race is not a significant factor, i.e., the low chi-squared statistic makes sense.


If you still believe that the correct value of the chi-square statistic is 4.547 I'd appreciate you pointing out the error in my calculations.

ezpace
01-15-2011, 10:48 PM
Post 1 thru 182 thx for the read.... I think lol

PHEWWWWWWwwwwwwwheavy!!!!!!!!!!!!!!!!!!

thanks TrifectaMike

glad to see the monkey also whether he cares or not ; )

and stil cant's believe OVERLAY never posted once on the subject hmmmmm

snoadog
01-16-2011, 12:35 AM
Just to make sure this horse is dead:

I was monkeying around in excel and found the CHITEST function.
In Actor's example above, It is a measure of how certain you can be of the win(loss) rate's independence from days since last race, as I understand it. In Actor's example above this value is 0.9994.

Stated differently; if we had assumed the win/loss rate was/is independent of days since the last race we can be 99.94% sure that a Chi statistic at least as high as .0012 would have occured by random chance.

......All other things being equal. :)

CBedo
01-16-2011, 03:22 AM
If you're still with me let's apply the same technique to the data from Post #70.

Here's the contingency table complete with magic numbers:


1-14 15-30 Over 30 Total

Win 95 103 67 265 10.54%

Lose 805 877 568 2250 89.46%

Total 900 980 635 2515 100.00%


Expected values

1-14 15-30 Over 30

Win
Observed 95 103 67
Expected 94.831 103.260 66.909

Lose
Observed 805 877 568
Expected 805.169 876.740 568.091


Chi values for each cell

1-14 15-30 Over 30

Win 0.0003 0.0007 0.0001

Lose 0.0000 0.0001 0.0000

Chi-squared statistic 0.0012


This is considerably different from the value of 4.547 arrived at in post #70. I'll get to that shortly.

The extremely low value of 0.0012 makes sense. Consider the group 1-14. There are 900 such horses, 95 of them (10.56%) winners. From the entire population of 2515 horses we have 265 (10.54%) winners. Results for the other two groups are similar, implying that days since last race is not a significant factor, i.e., the low chi-squared statistic makes sense.


If you still believe that the correct value of the chi-square statistic is 4.547 I'd appreciate you pointing out the error in my calculations.You used the expected values instead of the actual values and then compared them to expected values. Your methodology looks fine; it's your data that's bad--garbage in => garbage out! :p

The actual data was 1-14: 110 wins, 15-30: 90 wins & ,30+: 65 wins. Try that and see what you get.

Actor
01-16-2011, 01:24 PM
You used the expected values instead of the actual values and then compared them to expected values. Your methodology looks fine; it's your data that's bad--garbage in => garbage out! :p

The actual data was 1-14: 110 wins, 15-30: 90 wins & ,30+: 65 wins. Try that and see what you get.
Thank you, CBedo. I looked at this thing for hours trying to reconcile the two results. It's not the first time in my life that the final answer turned out to be something dumb.

With the input corrected I get 4.6765, close enough to 4.547. Again, I think the difference is due to roundoff error.

Thanks again. Now on to other things.

Actor
01-17-2011, 06:45 PM
Please bear with me as I summarize some of what has been presented so far. This is my understanding of the material and is leading up to a question.

First, from post #33, here is the Chi-Square Distribution Table

Df\Prob .250 .100 .050 .025 .010 .005
1 1.323 2.705 3.841 5.023 6.634 7.879
2 2.772 4.605 5.991 7.377 9.210 10.596
3 4.108 6.251 7.814 9.348 11.344 12.838
4 5.385 7.779 9.487 11.143 13.276 14.860
5 6.625 9.236 11.070 12.832 15.086 16.749

6 7.840 10.644 12.591 14.449 16.811 18.547
7 9.037 12.017 14.067 16.012 18.475 20.277
8 10.218 13.361 15.507 17.534 20.090 21.954
9 11.388 14.683 16.918 19.022 21.665 23.589
10 12.548 15.987 18.307 20.483 23.209 25.188

The column down the left side of the Chi-Square Distribution Table is the "degrees of freedom" of DF. I don't think the term "degrees of freedom" has thus far been used in this thread.


Second, from here is the (corrected) contingency table from post #182

1-14 15-30 Over 30

Win 110 90 65

Lose 790 890 570


What are degrees of freedom? Thus far all posters have treated it as the number of columns in the contingency table minus one, and so far that's been correct. However, all textbooks I have read define degrees of freedom thus:

DF = (#col - 1)(#row - 1)

where #col and #row are the number of columns and rows respectively in the contingency table. With one exception, all examples have had only two rows, WIN and LOSE, Whenever the contingency table has two rows the degrees of freedom is indeed the number of columns minus one. DF for the above contingency table is 2.

However, consider this contingency table.

1-14 15-30 Over 30

Win 110 90 65

Place 90 80 50

Show 70 60 45

Lose 630 750 475


There are 3 columns and 4 rows. For this problem f\df = (3 - 1)(4 - 1) = 6

Whether we'll need to worry about contingency table with more than two rows remains to be seen. If we get into exotics I suspect we will, but that's beyond what I'm trying to address here.


Now consider the "rolling a die" problem of post #27. The contingency table is:

The number of 1's 2's 3's 4's 5's 6's

Observed 6 12 5 8 13 16 Total 60


There's only one row of data so DF should be zero. However, the poster uses DF = 5. Most textbooks treat this same problem and they also use DF = 5.


So here's my question. Why is the DF for a one dimensional contingency table = #col - 1 instead of zero?

I don't think this is trivial. Dr. Quirin presents a rather complex one dimensional problem which I'd like to understand.

Only one of the texts I've been reading hints at an explanation to this paradox, and I'm uncertain that I understand it correctly. I'll post that after giving someone a chance to respond. I sincerely hope someone does.

snoadog
01-18-2011, 05:01 PM
Here's my layman's explanation.

DF is equal to least amount cells in the table you must know before you can deduce what the remaining cell values must be. In the win, place, show table you already know the totals of the rows and columns, and if you know 6 of the interior cells you can deduce the remaining cells - 6 DF. In the die table you know the row total so you must know 5 of the cells to figure out the remaining cell's value - 5 DF.

Cratos
01-21-2011, 08:39 PM
This should have been an excellent thread about the development of a handicapping methodology and statistically testing it for "goodness of fit" using Chi-Square, but the thread drifted off into a statistics primer.