PDA

View Full Version : Asessing variable importance


mwilding1981
04-12-2009, 10:52 AM
I have been doing some searching but cannot find any discussions on what are the preferred methods of determining which variables are the most important in determining the underlying factors of a model. I have been playilng with ways to do this and still have not found anything better than going through all possible combinations by hand. This is fine until you begin to have a large number of variables. Having said than I would like to open the discussion on your prefferred method for acheivng this.

osophy_junkie
04-12-2009, 07:05 PM
A Genetic Algorithm picking the variables and just testing them in the model can work.

mwilding1981
04-12-2009, 08:23 PM
Do you mean using a GA to create a model and calculate the importance levels and use them?

I will try that, I have estimated the expected optimal amount of variables that influence my underlying factors it is now a question of finding the best ones for the job. Using a GA is not an approach I have tried so I will give it a go.

garyoz
04-13-2009, 09:11 AM
I have been doing some searching but cannot find any discussions on what are the preferred methods of determining which variables are the most important in determining the underlying factors of a model. I have been playilng with ways to do this and still have not found anything better than going through all possible combinations by hand. This is fine until you begin to have a large number of variables. Having said than I would like to open the discussion on your prefferred method for acheivng this.

Search under "regression" or "mulitiple regression analysis." Also try "logistical regression" or "logit." The subject has been much discussed on this board. Essential problem is multicolinearity in predictor variables. What exactly are you measuring with one variable? Personally, I'm burnt out on the subject.

mwilding1981
04-13-2009, 03:48 PM
I know about multiple regression and logistical regression but I am not happy with it and therefore looking for another way. I am not happy with it because it seems to me be quite biased. I shall probably be steadily proved wrong here but it is very specific to the data it is fitted on. For example if you run it using two different data sets you will get very different answers. I want something that is more stable. Even a neural network proves to be more stable as when you run it on completely different data sets you are still getting pretty much the same results for the variable importance. I was just wondering if anyone else uses other methods.

sjk
04-13-2009, 04:42 PM
We used to discuss such things endlessly 3-4 years ago. I am sure that you could search through threads about computer handicapping and not run out of reading material for quite some time.

As to which variable is most important I would say they all (anything handicappers normally talk about and use in their process) are important and if you leave anything out of the picture you will seriously impair your results.

Jeff P
04-13-2009, 05:37 PM
I'll take a shot at (hopefully) sparking further discussion about factor selection and weighting.

In my world, a model represents an attempt to solve the overall handicapping puzzle. Factor weight within the model attempts to assign importance to the individual factor within the overall handicapping puzzle as you've defined it.

Many factors are closely related to each other. For example, it's been said that speed is class and class is speed. Consider... Fast horses run higher speed figures than slow horses do. Horses with higher class tend to run higher speed figures than lower class horses do. So while speed figures and class ratings might be two separate factors - crossover exists - and both do measure the same thing to a degree.

Then there's the subject of causal factors. I've seen large data samples where flat betting to win any starter whose name began with the letter "E" was at break even or higher. Would I base a model on that? No. In my world those kind of sample results represent noise. I want the factors in my models to represent true cause and effect about what's happening out there on the track in front of me.

It's relatively easy to run a report for a given track-surface-distance-class level-field size and look at a list of factors sorted by win bet roi. And it's relatively easy to just "grab" the best performing factors and make them part of your model for that race type. But will that work going forward? Believe it or not sometimes that approach actually DOES work. But many times no - Often you get a complete reversal in the next sample of races. The trick, IMHO, is actually understanding WHY those factors are working (while others are not working) in the first place. And that takes a fair degree of experience.

I've had the most success breaking out potential candidate factors for my models into general categories such as:

Early Pace Ability
Late Pace Ability
Pace Matchup
Ability from Speed Figures
Class
Form
Human Connections
Track Profile

Then, when "grabbing" factors for my model and assigning weight (importance) to them I try to pick and weight them (hopefully) using a somewhat accurate assessment/understanding (context) of how each of the categories are influencing outcomes of races at whatever track-surface-distance-class level-field size, etc. race type I'm attempting to model.

For me, that type of approach works better than trying to grab and weight the current hot factors and hoping for the best.



-jp

.

mwilding1981
04-13-2009, 05:48 PM
Jeff, great post thank you. I am definately not wanting to use hot factors and am looking at what makes the difference. The way I have defined it to myself is the best way to weight the variables that influence the underlying factors the most. For example I think however many variables you have there will only be a limited amount of underlying factors that they are explaining. For example the quality of the horse is likely to be one of the underlying factors that a number of variables may be explaining. This makes my goal as trying to define the variables that define this the best and weight them accordingly. I am actually currently using decision trees to do this as they are very strong against incomplete datasets and noise.

garyoz
04-13-2009, 09:48 PM
Sounds like you guys are describing linear models with Beta weights attached to variables. Sounds like regression to me. Or am I missing something? You still have the same problem of colinearity--e.g. speed figures and class are highly correlated and you are in effect double counting them if you include them in the same model, also you may have unstable models.

Jeff P
04-14-2009, 02:03 AM
Not necessarily a bad thing... at least IMHO.

The key for me after creating models (any model no matter what type) is testing the factor mix and weights across some pretty large data samples. I like to confront my models with fresh races not present in the development database used during model construction. Before accepting the model for live play I need to see win rate and roi going forward (within limits) be realistically close to win rate and roi obtained in the development database. In general I use data sets broken out by calendar quarter. I'll generally develop a model using 2-3 consecutive calendar quarters as the development database - and then test going forward one quarter at a time. When I see acceptable results quarter after quarter going forward... at some point you realize you have something dependable (stable) and it's time to promote the model from R&D status to that of live play provided you have room for it.


-jp

.

Jeff P
04-14-2009, 02:48 AM
...You still have the same problem of colinearity--e.g. speed figures and class are highly correlated and you are in effect double counting them if you include them in the same model...
I wanted to address this one sentence specifically because I think it makes an important point. Crossover (colinearity) CAN introduce instability into a model.

But I'll make the argument that proper testing can overcome that.

Let's say you have a factor in your model based on final time based speed figures from selected running lines in the past performance record of the horse being evaluated. Within the context of your model you think of this factor as representing a horse's ability from speed figures.

Let's also say that you have a second variable in your model. It's based on speed figures too. But instead of being based on the speed figures of the individual horse being evaluated - this factor looks at the speed figures of the horses the horse being evaluated has faced in (either each of or selected) running lines from its past performance record. Within the context of your model you think of this factor as representing class.

Two different things?

Or is there some crossover going on?

I tend to think there's some crossover (colinearity) going on.

I can only speak for myself, but the way to handle things is to make your best educated guess as to the importance of each of the two factors. Then plug factor weight representing that importance into the model and TEST what you have and then record the results. Then re-weight one of the factors (either higher or lower) and re-test. Compare the results of your re-test to results from the original. If you moved the win rate and roi needle in the desired direction (results improved) that tells you something. If you moved the needle in the other (wrong) direction that tells you something too. Re-weight a third time and re-test. After several iterations you should have a fairly accurate idea of how best to weight those two factors - or even whether or not you are better off eliminating one of them from the model.

If you stop to think about it before getting started you might realize that you are actually following a decision tree. And if you really think about it you might realize that such a decision tree can be mapped out ahead of time... before you even begin R&D on your next model. And then one day after years of manually setting parameters for firing off individual data runs - you realize that if it can be mapped out ahead of time - then it can be automated.

It might take some time (sometimes it takes a very long time) but eventually you can arrive at a working model. And even if your model contains variables with some degree of crossover (colinearity) your testing should have suggested how best to weight those factors and what to expect going forward.


-jp

.

garyoz
04-14-2009, 07:20 AM
So you're using backfitting and using your "gut" for determining factor weights instead of any established statistical methodology. You'd never publish it in an academic journal and you'd get an "F" in statistics--but if it works great.

Also decision trees are rudimentary because they are really just a ranking (barely ordinal) method. Either A or B (dichotomous decision) then on to either A or C, etc. I suppose the human handicapping process does that in its own way.

I have no solutions--I've tried rigorous statistical analysis--if I had time I'd be pursuing more dynamic models based upon econometrics that use real time pool data. Have no idea if that would work.

sjk
04-14-2009, 07:52 AM
I thought that what Jeff posted made good sense.

If you analyze past performance metrics in the light of what you know ought to be there (past speed ratings provide a clue to future speed ratings etc) it is possible to make good probabilities for the outcome of a race.

The fact that two potential inputs might be highly correlated (for example trainer win percent and jockey win percent) does not mean you have to throw up your hands and ignore both.

Surely if you develop a method using data from 100,000 races and show success on a non-overlapping sample of another 100,000 races there must be some statistical significance.

I am not suggesting that you find a boatload of methods and throw away anything that fails. I am saying that if you take a considerable amount of time and develop a single method and that method forward tests successfully that you are likely to make some money using it.

garyoz
04-14-2009, 08:53 AM
I thought that what Jeff posted made good sense.



I don't disagree. It is trial and error, and your testing against a database that is not backfitting provides for a degree of validity. SJK, you have some of the best post over the years on this topic--and undoubtedly really understand your data.

I have an academic background hence my focus on reproducablity and tests of statistical significance. I believe "gut" weighting is appropriate-- I've given up trying to use formal multivariate statistics (I don't have the time to do any research anyway).

facorsig
04-14-2009, 10:51 AM
Ten years ago when I was a quality engineer I used to calculate the number of tensile tests required to establish statistical confidence that the lot of pipes was meeting the quality specification. HSH borrows from the world of quality engineering to measure individual factors with Confidence Intervals. The Confidence Interval is measuring how many observations of the factor do you have against your edge to assure that the factor is not "randomly profitable". The highest Confidence Interval factors shows as 99 on the IV Table screen and go down from there one point at a time to 95 and five points at a time from there. I have applied this concept for horse racing since I originally saw it in All In One.

Fred

Jeff P
04-14-2009, 12:44 PM
So you're using backfitting and using your "gut" for determining factor weights instead of any established statistical methodology. You'd never publish it in an academic journal and you'd get an "F" in statistics--but if it works great.
You talk about gut as if it's a bad thing.

Yes, I'm looking at data (backfitting if you will) and then gut. But understand something. "Gut" after decades of experience working with handicapping data sets can be a very different thing than "gut" for somebody without that level of experience.

Somehow after looking at enough data samples and watching enough races (since about 1981) I started to understand WHY data from samples produced the results that it did. IOW I could interpret/extrapolate WHY from the data and then see it play out on the track right in front of me.

Within that context WHY can go a long way towards making "gut" a beautiful thing.

Let's see... The ability to grow a bankroll using models developed with methods that college stat professors wouldn't approve of? Or an F in a college level stat class? :cool:

We're talking about parimutuel wagering here not academia. I'm certainly not looking for any type of validation from the statistics community.


-jp

.

mwilding1981
04-14-2009, 03:16 PM
Bankroll growth is the end goal of course. I am not sure I fully explained how I was using a decision tree,. I am using it to get an interperation of the level of importance of different variables and then weighting them using a combination of gut and regression. I am also now trying the same process simultaneously except using a GA. I am finding that using the tree generally speaking each variable adds an amount of explained variance to the model which is getting lower the further down the importance list that the tree created I go. There are some in there that seem to make a small decline or no difference at all which I am removing. When I have finished making the model I shall then start removing the factors that seem to contribute the smallest amount in order to see whether I can then gain a further improvement.

sjk
04-14-2009, 05:49 PM
It is not clear to me what your variables are trying to predict. It seems to me you would want to be predicting the speed that a particular horse will run because if you know such things as the probability that Horse A will run an 80 and all of the other horses will run a 79 you can go on to create winning probabilities for each of the horses.

As to how to predict the likelihood of Horse A running an 80 I would think the most important information would relate to his last start such as his speed figure and sectional performances and information about the horses he ran against etc and the next most important information would relate to his next to last start and so forth.

mwilding1981
04-14-2009, 06:29 PM
What I am trying to do is weight the variables in the model and also to calculate ht optimum amount of variables for the model. I am currently using a method of trying to determine the most important variables in defining the output of the model and then using another method to weight them and see whether they improve the model. As I have a lot of variables to go through (many of which I shall most likely not use) I need a more efficient way than trying all levels of combinations of them with different weights hence the attempt at assessing the importance of them first before weighting and then adding them to the model.