Using Chi Square Statistic To Produce A Handicapping Method - Page 7 - Horse Racing Forum - PaceAdvantage.Com

Actor · 12-31-2010, 11:04 AM

Quote:

Originally Posted by TrifectaMike

Where do you guys come up with this stuff? Quirin's standard normal?

Mike

Winning at the Races: Computer Discoveries in Thoroughbred Handicapping By William L. Quirin, Ph.D. 1979, page 297.

TrifectaMike · 12-31-2010, 11:41 AM

I'm not continuing with this thread. If I've insulted anyone, I apologize.

I tend to say too much.

Happy New Year to all.

Mike

Dave Schwartz · 12-31-2010, 11:54 AM

Personally, I am getting a lot out of this thread. I'd hate to see it discontinued.

May I make a suggestion?

Mike is a guy that offered to lead a class. Please, let him do just that. If you wish to take issue with his approach, please do it AFTER he has finished.

At that point you can tell him why YOUR way is better.

Happy New Year to all.

Dave Schwartz

PaceAdvantage · 12-31-2010, 12:19 PM

Quote:

Originally Posted by TrifectaMike

I'm not continuing with this thread. If I've insulted anyone, I apologize.

I tend to say too much.

Happy New Year to all.

Mike

What did I miss? I've been watching this thread like a hawk making sure it doesn't get sidetracked....tell me how to fix it so that it may continue.

CBedo · 12-31-2010, 12:21 PM

Quote:

Originally Posted by Dave Schwartz

Personally, I am getting a lot out of this thread. I'd hate to see it discontinued.

May I make a suggestion?

Mike is a guy that offered to lead a class. Please, let him do just that. If you wish to take issue with his approach, please do it AFTER he has finished.

At that point you can tell him why YOUR way is better.

Happy New Year to all.

Dave Schwartz

I agree with Dave. I hope Mike will continue. Why can't everyone not get hung up on the factors (for now) and focus on the development of the methodology?

raybo · 12-31-2010, 12:30 PM

I, too, would like to see Mike continue with his teachings. Although I'm not a "statistics" guy, I'm sure there will be things to learn here.

Maybe if we just went along with Mike's theories, withholding personal beliefs for the time being, there will come a point when the value/viability of the Chi Square statistic will become visible.

CBedo · 12-31-2010, 12:32 PM

Quote:

Originally Posted by Actor

I've been going over this with my college statistics textbook by my side and, as best as I can tell, you do not seem to be calculating the Chi-Square Value correctly.

I think you're misreading your stats book (or the book is bad, lol). If you don't think Mike is calculating it correctly, try putting the data into a statistical package or even excel and doing the calculation.

From R:

Code:

> recency
      Wins Losses
1-14   110    790
15-30   90    890
31+     65    570
> chisq.test(recency)

	Pearson's Chi-squared test

data:  recency 
X-squared = 4.6765, df = 2, p-value = 0.0965

Looks dead on in his analysis to me (numbers slightly different due to rounding probably).

TrifectaMike · 12-31-2010, 12:36 PM

I made a bold statement and I should not have tried to impose on
Arkansasman. So, I've run a Logistic regression using the last four
Speed Ratings.

Here's an explanation of the independent variables (regressors)

Variable 1 Last Speed Rating
Variable 2 Second Speed Rating Back
Variable 3 Third Speed Rating Back
Variable 4 Fourth Speed Rating Back

How I setup the data:
In each case(last speed rating, etc) I determine the median rating
for each race. Then I determine how each horse's rating differs from the
median. This allows for determining the strength within the race and
also allows to use the data across all races.

Running iteration number 1
Running iteration number 2
Running iteration number 3
Running iteration number 4
Running iteration number 5
Running iteration number 6
Running iteration number 7

The process converged after 7 iterations

The software I use is home grown.

Descriptives.......
154 cases have Y = 1 1236 cases have Y = 0

-2 log likelihood = 967.9034 (Null Model)
-2 log likelihood = 887.8558 (Full Model)

Overall Model Fit...
Chi Square = 80.0476 df = 4 p = 0.0000
R Square = 0.0827

Akaike's Information Criterion = 897.8558
Bayesian Information Criterion = 895.9030

Coefficients and Standard Errors...
Variable Coefficient Standard Error prob
1 0.0788 0.0142 0.0000
2 0.0050 0.0102 0.6216
3 0.0428 0.0123 0.0005
4 0.0243 0.0112 0.0298
Intercept -2.1564

Odds Ratios and 95% Confidence Intervals...
Variable Odds Ratio Low High
1 1.0819 1.0522 1.1125
2 1.0050 0.9852 1.0253
3 1.0438 1.0189 1.0692
4 1.0246 1.0024 1.0474
input data record?

Let me show you the important numbers.

Variable 2 (Second race back rating)
prob = .6216
That is not good!!!!!

Coefficient 0.0050
That is not good!!!

Odds Ratio 1.0050
That is not good!!!!

Mike

garyoz · 12-31-2010, 01:49 PM

I don't want to sound negative or discouraging and I hope this is helpful.

You can't run such highly correlated variables as a multiple regression model. Check out a correlation matrix for the variables--if you have values higher than .7 or so you run into multi-colinearity which leads to an unstable model. Instability is tied to the correlation between error terms (Regression assumes a normal distriubtion for error terms--which the correlation violates). Also if you are running a stepwise regression, the first variable "sucks up" all the variance for explanation and doesn't leave enough for the following variables to be associated with. (not a very technical explanation)

You can run them individually as bivariate regressions (one regressor and the logit probability density function--just an S-shaped curve--as the dependent variable) and compare each of those models. So you can run, a model for speed figure one race back, then one for two races back etc. and compare them.

See which one gives you the best fit. That's assuming that is what you want to do. I think what you are trying to do is use the probability of winning as the dependent variable and speed figure as the independent variable.

TrifectaMike · 12-31-2010, 02:14 PM

Quote:

Originally Posted by garyoz

I don't want to sound negative or discouraging and I hope this is helpful.

You can't run such highly correlated variables as a multiple regression model. Check out a correlation matrix for the variables--if you have values higher than .7 or so you run into multi-colinearity which leads to an unstable model. Instability is tied to the correlation between error terms (Regression assumes a normal distriubtion for error terms--which the correlation violates). Also if you are running a stepwise regression, the first variable "sucks up" all the variance for explanation and doesn't leave enough for the following variables to be associated with. (not a very technical explanation)

You can run them individually as bivariate regressions (one regressor and the logit probability density function--just an S-shaped curve--as the dependent variable) and compare each of those models. So you can run, a model for speed figure one race back, then one for two races back etc. and compare them.

See which one gives you the best fit. That's assuming that is what you want to do. I think what you are trying to do is use the probability of winning as the dependent variable and speed figure as the independent variable.

You're not telling me anything I already don't know. Without getting too technical, it DIDN'T reduce the significance of the 3rd and 4th variable. I stand by these results.

Mike

Capper Al · 12-31-2010, 04:24 PM

Go for it TrifectaMike. I'm interested in your take on handicapping using the stats. Let's see how your system works. Thanks in advance for doing this.

Native Texan III · 12-31-2010, 08:20 PM

Quote:

Originally Posted by TrifectaMike

It's my understanding that estimation is based on Maximum likelihood...finding those coefficients that have the greatest likelihood of producing the observed data. In practice, I would assume that means maximizing the log likelihood function (the objective function).

Hey, that is just my understanding. I could be wrong.

But I ask once again,

"Aside from how one would determine the independent variables, I have another question. How does one test for goodness of fit of a logistic regression?"

Mike

That is what I posted.

gm10 · 12-31-2010, 09:42 PM

Quote:

Originally Posted by TrifectaMike

Let me give you a specific example ( arkansasman try this out with your model). When looking at horses last four Beyers or Bris numbers, etc, you would think that the last is more significant than the previous and so on. The data does not support the premise.

I have found using numerous tests that the second race beyer (or any other similar rating) is insignificant and adds no usable information. The third is significant as well as the fourth, but NOT the second race back. Interesting I would say.

Mike

How did you test that? Y ~ f(X)
What was Y, what was the shape of f() .. linear for example

Multi-collinearity was my first thought as well - if you were using linear regression, it sounds likely that this would be a problem. Have you tested the significance without including the most recent Beyer in your X vector?

gm10 · 12-31-2010, 10:06 PM

Quote:

Originally Posted by TrifectaMike

I made a bold statement and I should not have tried to impose on
Arkansasman. So, I've run a Logistic regression using the last four
Speed Ratings.

Here's an explanation of the independent variables (regressors)

Variable 1 Last Speed Rating
Variable 2 Second Speed Rating Back
Variable 3 Third Speed Rating Back
Variable 4 Fourth Speed Rating Back

How I setup the data:
In each case(last speed rating, etc) I determine the median rating
for each race. Then I determine how each horse's rating differs from the
median. This allows for determining the strength within the race and
also allows to use the data across all races.

Running iteration number 1
Running iteration number 2
Running iteration number 3
Running iteration number 4
Running iteration number 5
Running iteration number 6
Running iteration number 7

The process converged after 7 iterations

The software I use is home grown.

Descriptives.......
154 cases have Y = 1 1236 cases have Y = 0

-2 log likelihood = 967.9034 (Null Model)
-2 log likelihood = 887.8558 (Full Model)

Overall Model Fit...
Chi Square = 80.0476 df = 4 p = 0.0000
R Square = 0.0827

Akaike's Information Criterion = 897.8558
Bayesian Information Criterion = 895.9030

Coefficients and Standard Errors...
Variable Coefficient Standard Error prob
1 0.0788 0.0142 0.0000
2 0.0050 0.0102 0.6216
3 0.0428 0.0123 0.0005
4 0.0243 0.0112 0.0298
Intercept -2.1564

Odds Ratios and 95% Confidence Intervals...
Variable Odds Ratio Low High
1 1.0819 1.0522 1.1125
2 1.0050 0.9852 1.0253
3 1.0438 1.0189 1.0692
4 1.0246 1.0024 1.0474
input data record?

Let me show you the important numbers.

Variable 2 (Second race back rating)
prob = .6216
That is not good!!!!!

Coefficient 0.0050
That is not good!!!

Odds Ratio 1.0050
That is not good!!!!

Mike

Interesting ... can you test without X1 (most recent)?

TrifectaMike · 12-31-2010, 11:51 PM

Quote:

Originally Posted by gm10

Interesting ... can you test without X1 (most recent)?

Code:

  Descriptives.......
  154 cases have Y = 1  1236 cases have Y = 0
  
  -2 log likelihood = 967.9034	(Null Model)
  -2 log likelihood = 926.0421	(Full Model)
  
  Overall Model Fit...
  Chi Square = 41.8613	df = 3  p = 0.0000
  R Square = 0.0432
  
  Akaike's Information Criterion = 934.0421
  Bayesian Information Criterion = 931.5873
  
  Coefficients and Standard Errors...
    Variable	 Coefficient  Standard Error	   prob
  		 1		  0.0181		  0.0104	 0.0813
  		 2		  0.0493		  0.0122	 0.0001
  		 3		  0.0282		  0.0110	 0.0103
  Intercept	   -2.0949
  
  Odds Ratios and 95% Confidence Intervals...
    Variable	  Odds Ratio			 Low	   High
  		 1		  1.0183		  0.9977	 1.0393
  		 2		  1.0506		  1.0257	 1.0760
  		 3		  1.0286		  1.0067	 1.0510
  input data record?