Studying Past Results - Horse Racing Forum - PaceAdvantage.Com

traynor · 06-12-2017, 08:03 PM

A good explanation of why handicappers "studying" their clumps of races so often go astray, and wind up chasing rainbows that don't exist in the real world. "Overfitting to a specific clump of races" should not be dismissed lightly.

"To train a machine learning system, you start with a lot of training data: millions of photos, for example. You divide that data into a training set and a test set. You use the training set to "train" the system so it can identify those images correctly. Then you use the test set to see how well the training works: how good is it at labeling a different set of images? The process is essentially the same whether you're dealing with images, voices, medical records, or something else. It's essentially the same whether you're using the coolest and trendiest deep learning algorithms, or whether you're using simple linear regression.

But there's a fundamental limit to this process, pointed out in Understanding Deep Learning Requires Rethinking Generalization. If you train your system so it's 100% accurate on the training set, it will always do poorly on the test set and on any real-world data. It doesn't matter how big (or small) the training set is, or how careful you are. 100% accuracy means that you've built a system that has memorized the training set, and such a system is unlikely to indentify anything that it hasn't memorized."

https://www.oreilly.com/ideas/the-ma...wsltr_20170607

traynor · 06-12-2017, 09:32 PM

Well, so what? If you are using a piece of software (yours or someone else's) that "builds models" using all the races in a specific clump, or all the races that fit a specific set of filters, it might be wise to view the output with a healthy bit of skepticism. Especially those dazzling ROIs that never quite seem to work out when you bet on the recommended patterns.

It is relatively trivial to split data into training sets (to find the patterns) and control sets (to test the patterns). A jillion races is not necessary. Even if you are building models from a few hundred races, it might be much to your advantage to split it into training sets and control sets.

DeltaLover · 06-12-2017, 10:04 PM

Quote:

Originally Posted by traynor

Well, so what? If you are using a piece of software (yours or someone else's) that "builds models" using all the races in a specific clump, or all the races that fit a specific set of filters, it might be wise to view the output with a healthy bit of skepticism. Especially those dazzling ROIs that never quite seem to work out when you bet on the recommended patterns.

It is relatively trivial to split data into training sets (to find the patterns) and control sets (to test the patterns). A jillion races is not necessary. Even if you are building models from a few hundred races, it might be much to your advantage to split it into training sets and control sets.

The biggest challenge lies in the way your training data are presented in your earning algorithm. Of course the data transformation can also require ML so we can say that the process is recursive to some extend. The size of your training universe is proportional to the features you are going to pass as the deepness of your networks as well.

whodoyoulike · 06-12-2017, 10:33 PM

Quote:

Originally Posted by traynor

A good explanation of why handicappers "studying" their clumps of races so often go astray, and wind up chasing rainbows that don't exist in the real world. "Overfitting to a specific clump of races" should not be dismissed lightly.

"To train a machine learning system, you start with a lot of training data: millions of photos, for example. You divide that data into a training set and a test set. You use the training set to "train" the system so it can identify those images correctly. Then you use the test set to see how well the training works: how good is it at labeling a different set of images? The process is essentially the same whether you're dealing with images, voices, medical records, or something else. It's essentially the same whether you're using the coolest and trendiest deep learning algorithms, or whether you're using simple linear regression.

But there's a fundamental limit to this process, pointed out in Understanding Deep Learning Requires Rethinking Generalization. If you train your system so it's 100% accurate on the training set, it will always do poorly on the test set and on any real-world data. It doesn't matter how big (or small) the training set is, or how careful you are. 100% accuracy means that you've built a system that has memorized the training set, and such a system is unlikely to indentify anything that it hasn't memorized."

https://www.oreilly.com/ideas/the-ma...wsltr_20170607

I have to ask you these questions.

What kind of computer system (PC??) do you think most people on here own?

and

What kind of computer system do you own? Is it also a PC?

Then, maybe your recent posts would make some sense to me.

traynor · 06-12-2017, 11:37 PM

Quote:

Originally Posted by DeltaLover

The biggest challenge lies in the way your training data are presented in your earning algorithm. Of course the data transformation can also require ML so we can say that the process is recursive to some extend. The size of your training universe is proportional to the features you are going to pass as the deepness of your networks as well.

Absolutely. If one is using standard PP data, finding stuff that everyone else misses or overlooks is almost impossible. Whatever one discovers is guaranteed to be found (or to have been found) by others.

One of the "depth" problems is that the more factors/attributes included, the more likely it is that others will be using the same factors/attributes (more or less in combination with other factors/attributes that one may or may not be using). It often seems that trying to include too many factors is a bigger problem than including too few. Fewer factors, better prices.

traynor · 06-12-2017, 11:44 PM

Quote:

Originally Posted by whodoyoulike

I have to ask you these questions.

What kind of computer system (PC??) do you think most people on here own?

and

What kind of computer system do you own? Is it also a PC?

Then, maybe your recent posts would make some sense to me.

Plain vanilla, standard laptop and desktop. Nothing spectacular. Some of the most useful data mining apps (and processes) are well-suited to pretty basic computer hardware.

It is the approach to data analysis that is as (or more) important than any gee whiz hardware or Big Data software.

barn32 · 06-12-2017, 11:52 PM

Quote:

Originally Posted by traynor

Well, so what? If you are using a piece of software (yours or someone else's) that "builds models" using all the races in a specific clump, or all the races that fit a specific set of filters, it might be wise to view the output with a healthy bit of skepticism. Especially those dazzling ROIs that never quite seem to work out when you bet on the recommended patterns.

It is relatively trivial to split data into training sets (to find the patterns) and control sets (to test the patterns). A jillion races is not necessary. Even if you are building models from a few hundred races, it might be much to your advantage to split it into training sets and control sets.

Quote:

Originally Posted by DeltaLover

The biggest challenge lies in the way your training data are presented in your earning algorithm. Of course the data transformation can also require ML so we can say that the process is recursive to some extend. The size of your training universe is proportional to the features you are going to pass as the deepness of your networks as well.

I still think you two guys are the same person.

lamboy · 06-13-2017, 10:13 AM

ML is indeed difficult to apply to handicapping especially since flow and trips need to be taken into account -- however, these factors are so subjective. Take other fields where ML systems are applied and experts all say it requires SMEs to interpret the data.

At the end, imho, an ensemble method of algorithms work ok but more importantly a good visualation tool works best. After all--aren't the bris,timeform and drf pps nothing more than data dashboards?

DeltaLover · 06-13-2017, 10:45 AM

Quote:

Originally Posted by lamboy

ML is indeed difficult to apply to handicapping especially since flow and trips need to be taken into account -- however, these factors are so subjective. Take other fields where ML systems are applied and experts all say it requires SMEs to interpret the data.

At the end, imho, an ensemble method of algorithms work ok but more importantly a good visualation tool works best. After all--aren't the bris,timeform and drf pps nothing more than data dashboards?

The difficulty lies in the problem definition more than anything else. One of the core challenges has to do with the representation of the primitive handicapping factors along with the derived metrics and their through time and circuit behavior.

lamboy · 06-13-2017, 11:16 AM

Quote:

Originally Posted by DeltaLover

The difficulty lies in the problem definition more than anything else. One of the core challenges has to do with the representation of the primitive handicapping factors along with the derived metrics and their through time and circuit behavior.

i use graph theory to represent the core handicapping factors which allows me to see the relationships between different circuits and classes of horses.

DeltaLover · 06-13-2017, 11:24 AM

Quote:

Originally Posted by lamboy

i use graph theory to represent the core handicapping factors which allows me to see the relationships between different circuits and classes of horses.

What you say here is not very descriptive though. Questions like what you use as vertex - edge in your graph, how you calculate edge weights and how you are searching the graph can clarify your statement.

lamboy · 06-13-2017, 11:31 AM

Quote:

Originally Posted by DeltaLover

What you say here is not very descriptive though. Questions like what you use as vertex - edge in your graph, how you calculate edge weights and how you are searching the graph can clarify your statement.

LOL, that's why i stress building a great visualizaion tool!!

DeltaLover · 06-13-2017, 11:32 AM

Quote:

Originally Posted by lamboy

LOL, that's why i stress building a great visualizaion tool!!

??

lamboy · 06-13-2017, 11:50 AM

Quote:

Originally Posted by DeltaLover

??

i think the disconnect is you're thinking along the lines of building a blackbox?

i parse the necessary data and run it through my algos which spit it out in a gui. it's up to me (SME/handicapper) to sculpt the data. imho, handicapping is sometimes an art.

ReplayRandall · 06-13-2017, 01:16 PM

Quote:

Originally Posted by lamboy

i think the disconnect is you're thinking along the lines of building a blackbox?

i parse the necessary data and run it through my algos which spit it out in a gui. it's up to me (SME/handicapper) to sculpt the data. imho, handicapping is sometimes an art.

That's how I see it as well, good point Phil. BTW, congrats on your 4th place finish at the Belmont Stakes Challenge, $45K+prize money+NHC seat....