PDA

View Full Version : Boosted Regression Trees


AITrader
03-18-2012, 10:42 AM
This is a question for the more technical folks. I've developed a system using a two-step multinomial logit for European harness racing. The logits are good at generalizing when cross-correlation among factors has been minimized. But this is difficult to do without losing factor accuracy in my experience.

I am looking at moving to boosted logistic regression trees instead and wonder if anyone of the more technical folks here have used them and what their experiences have been?

DeltaLover
03-18-2012, 12:12 PM
But this is difficult to do without losing factor accuracy in my experience.


I agree that this difficulty represent the major challenge in the whole exercise. I have experimented with techniques like logarithmic and Z-Score transformations, and laplacian normalization but I still have to rely in a trial and error approach for the selection of the method used .

I am thinking to use a Genetic Program as a provider of data transformation although its implementation details still are not clear to me and definitely I need to do some research towards this directions ..

As far as boosted logistic regression trees, I have never used them so far, although I did some experiments with mixed logit and fully recursive logit models. My understanding of them is still shallow....

PaceAdvantage
03-18-2012, 12:15 PM
I have moved this thread to the General Handicapping Section because it isn't really about a specific piece of software...plus it will have more visibility here.

AITrader
03-18-2012, 12:34 PM
If you are interested in GP the new crop of ATI/AMD graphics cards is probably the way to go. Here is a link to an OpenCL library for GP/GPU computing that might be helpful -

http://gpocl.sourceforge.net/

At some point I'd like to revisit GP to fine tune a working general solution based on BRT or logits.

vnams
04-02-2012, 08:44 PM
I am also analysing harness racing, but for a track in US and one in Canada.
I tried boosted regression. Although the fit (i.e. r squared) was better than a linear regression, the profit was quite similar. This likely depends on what variables you are trying to predict. I am trying to predict Log(amount won + 30).
What variable are you predicting?

AITrader
04-03-2012, 05:22 AM
I am looking at boosted logistic regression, not linear. So far linear regression seems to produce a lower fit than logistic regression. I did try linear regression trees with bootstrap aggregation (i.e. bagging) but found it caused a lot of over fitting.

I currently use ~75 variables and train on 20000 races. They are somewhat similar to the ones listed in Wikipedia -

http://en.wikipedia.org/wiki/Horse_race_handicapping_factors

Boosting looks like a neat technique to scale in multiple logits but I wonder if it suffers from the same kind of overfitting I saw with bagged linear regression trees.

Edit: my dependent variable is win probability.

gm10
04-03-2012, 06:34 AM
I am looking at boosted logistic regression, not linear. So far linear regression seems to produce a lower fit than logistic regression. I did try linear regression trees with bootstrap aggregation (i.e. bagging) but found it caused a lot of over fitting.

I currently use ~75 variables and train on 20000 races. They are somewhat similar to the ones listed in Wikipedia -

http://en.wikipedia.org/wiki/Horse_race_handicapping_factors

Boosting looks like a neat technique to scale in multiple logits but I wonder if it suffers from the same kind of overfitting I saw with bagged linear regression trees.

Edit: my dependent variable is win probability.

I've often wanted to include a "correlated form" factor such as COMPETITORS1 but I haven't reached that point yet.

vnams
04-03-2012, 06:50 AM
I also have used boosted logistic regression, and it did much better than straight logistic, with little overfitting. I randomly divided my dataset into a test and a training set.
However, I had problems with the software for that (I was using a package in R) and have been waiting back for a reply from the authors of the package. And, it took a long time for the analyses.

AITrader
04-03-2012, 06:59 AM
I also have used boosted logistic regression, and it did much better than straight logistic, with little overfitting. I randomly divided my dataset into a test and a training set.
However, I had problems with the software for that (I was using a package in R) and have been waiting back for a reply from the authors of the package. And, it took a long time for the analyses.

What R package do you use and what problems are you having? If it's written in C or C++ I might be able to work on it and possibly fix the issue.

I usually run straight C code for the main system. I use R as a double check and tuning tool. I find R can be very slow and memory hungry for many of the more complex algorithms.

AITrader
04-03-2012, 07:04 AM
I've often wanted to include a "correlated form" factor such as COMPETITORS1 but I haven't reached that point yet.

What algorithm do you envision using for this factor?

gm10
04-03-2012, 07:09 AM
What R package do you use and what problems are you having? If it's written in C or C++ I might be able to work on it and possibly fix the issue.

I usually run straight C code for the main system. I use R as a double check and tuning tool. I find R can be very slow and memory hungry for many of the more complex algorithms.

The main reason for slow performance in R and S-Plus is the way the language handles loops. It is very inefficient in this respect, and indeed very memory hungry.

However, you can usually avoid such problems by expressing your loop as an operation on vectors/matrices.

(For more information on the loops do's and dont's, see http://stackoverflow.com/questions/7142767/why-are-loops-slow-in-r).

gm10
04-03-2012, 07:20 AM
What algorithm do you envision using for this factor?

Not sure about any details yet but I envision lots of circular references and months of banging my head on the keyboard.

I've thought about a TrueSkill-type ranking as well. That would be a nice thing to have.

AITrader
04-03-2012, 07:21 AM
gm10 - jag ser att du ar i danemark, ar du dansk och haller du pa med travanalys?

Edit: asking if gm10 is Danish and also analyzing European harness racing.

gm10
04-03-2012, 07:33 AM
gm10 - jag ser att du ar i danemark, ar du dansk och haller du pa med travanalys?

Edit: asking if gm10 is Danish and also analyzing European harness racing.

No and No! (My other half hails from Ringkoebing)

AITrader
04-03-2012, 07:41 AM
Not sure about any details yet but I envision lots of circular references and months of banging my head on the keyboard.

I've thought about a TrueSkill-type ranking as well. That would be a nice thing to have.

I'm not exactly sure what algorithm Benter used for COMPETITORS1. I suspect it may be a separate linear regression on weighted normalized finshing position for races without those competitors subtracted from a value of a regression with them.

gm10
04-03-2012, 09:30 AM
I'm not exactly sure what algorithm Benter used for COMPETITORS1. I suspect it may be a separate linear regression on weighted normalized finshing position for races without those competitors subtracted from a value of a regression with them.

That sounds interesting. I knew he used this type of auxiliary regressions, but I had never thought of applying it to collateral form.

I had a look at boosted regression trees during my lunch break. Looks worth a try, especially since there seems to be an out-of-the-box library available for it.

Can I ask you what you mean by a two-step multilogit model?

AITrader
04-03-2012, 09:40 AM
Can I ask you what you mean by a two-step multilogit model?

It's the standard Bolton & Chapman method described in their 1986 paper that Bill Benter used in Hong Kong. Both Benter & Bolton/Chapman's papers are in The Efficiency of Racetrack Betting Markets. The horse factors listed in Wikipedia were given to Chapman by Benter for an updated paper called "Still Searching for Positive Returns at the Track", which is also in TERBM.

http://books.google.co.uk/books?id=HX2xapyqENsC&pg=PA173&lpg=PA176&ots=mYTAB_h3zI

Robert Goren
04-03-2012, 10:30 AM
In the papers at the link, I could not find a couple of things. Does anyone know how they normalized the finish? And does anyone know how they weighted the recency? The first couple of questions of what I am sure will be many questions I will have.

AITrader
04-03-2012, 10:39 AM
Benter actually published how he normalized the finish. As I recall he scaled it from 0.5 to -0.5. The weighting is just a normal weighted average. The secret is in the weighting formula :-)

Experiment a bit and you'll find it.

NFIN = 0.5 - (place - 1) / (total_horses - 1)

Robert Goren
04-03-2012, 01:33 PM
Thanks

gm10
04-03-2012, 03:41 PM
It's the standard Bolton & Chapman method described in their 1986 paper that Bill Benter used in Hong Kong. Both Benter & Bolton/Chapman's papers are in The Efficiency of Racetrack Betting Markets. The horse factors listed in Wikipedia were given to Chapman by Benter for an updated paper called "Still Searching for Positive Returns at the Track", which is also in TERBM.

http://books.google.co.uk/books?id=HX2xapyqENsC&pg=PA173&lpg=PA176&ots=mYTAB_h3zI

Ah, I've read that paper, I've actually read most of that book! Are you refering to the "data explosion" method?

AITrader
04-03-2012, 04:02 PM
Ah, I've read that paper, I've actually read most of that book! Are you refering to the "data explosion" method?

Data explosion is a technique to expand limited race data by using races with all finishers, then the same data with 2nd place on, then 3rd.

The two-step logit uses a set of fundamental factors for the first step and then the public odds plus the output of the first-step to create a final win probability estimate.

gm10
04-03-2012, 04:15 PM
Data explosion is a technique to expand limited race data by using races with all finishers, then the same data with 2nd place on, then 3rd.

The two-step logit uses a set of fundamental factors for the first step and then the public odds plus the output of the first-step to create a final win probability estimate.

OK thanks ... I remember now. I used to add the public's probabilities, but after some research I found out that, in my case, the second step increases the R-squared, but lowers the final ROI.

vnams
04-03-2012, 08:03 PM
I'm using ada.
But today the author of ada emailed me, and hopefully I can get this resolved with him.
Thanks for your offer.

gm10
04-04-2012, 06:32 AM
Has anyone here tried both the multinomial approach, and the boosted regression tree approach (using identical or at least equivalent X/Y variables)? Just wondering how they compare.