PDA

View Full Version : question on chart / past performance parsing


Robert Fischer
04-02-2007, 08:57 PM
I am converting pdf to excel (also works for .html, word etc..). When I convert the past performances (bris or drf) the numbers in the beaten lengths are converted as "crazy" SYMBOLS.

Similar problem with charts - i get a number like 2212 to mean 2nd by 2.5 lengths. I also can not seperate the position from the beaten lengths automatically.

Are there any pdf chart parsers?

Kelso
04-02-2007, 10:36 PM
I am converting pdf to excel (also works for .html, word etc..).



Robert,
What application are you using for the conversion? Do you find it difficult to get it to distinguish tightly spaced columns or to understand that some single columns (e.g., horse name) contain spaces but are still only one column? (These problems have stopped me from purchasing several I've used on trial.)

Thank you.

osophy_junkie
04-02-2007, 11:32 PM
What ever you are using to read the PDF's is doing the character type conversion incorrectly. One charset's crazy symbol is another's 1/2, etc..

Robert Fischer
04-03-2007, 10:51 AM
Robert,
What application are you using for the conversion? Do you find it difficult to get it to distinguish tightly spaced columns or to understand that some single columns (e.g., horse name) contain spaces but are still only one column? (These problems have stopped me from purchasing several I've used on trial.)

Thank you.

Kelso,
One of the applications I use is called "Able2Extract". It lets you custom specify your columns. Tightly spaced columns can be a problem - I can not automatically seperate postion from lengths behind in the charts. It does understand the spaces in a single column issue via the custom method.

Robert Fischer
04-03-2007, 10:53 AM
What ever you are using to read the PDF's is doing the character type conversion incorrectly. One charset's crazy symbol is another's 1/2, etc..

Sounds like I need to edit or replace the charset for my extraction program? Not sure how, but it is a start:confused: .

the_fat_man
04-03-2007, 03:35 PM
I venture into this with trepidation cause no matter what I write someone will KNOW BETTER and contest it.

Anyway:

The vendor 'font' is the least of your worries. You can always figure out what the symbols 'represent' by looking at the original chart. They will be consistent throughout the document. You can then do a find and replace (regex) and quickly convert.

Your real issues are with separating the running position from the lengths behind. Easy to do in HTML, with the <sup> tag. NOT TRIVIAL with PDF or text.

Then, assuming you can't just convert these to Excel files, if you want to store these in Excel, you'll have to write a regex that will parse a given line, pull out your data, put it into a temporary data structure and do whatever calculations you might want to it before importing to Excel.

The parsing part will be the most difficult as there is just enough variation between tracks and race distances to keep you revising your regex for a long time. And regex's are a beast all to themselves.

Then again, you can always buy the data.

cj
04-03-2007, 03:43 PM
The comma delimited charts are cheap enough that, in my opinion, they are well worth the money. The only reason I parsed HTML charts was that there was no better option in the past for a reasonable fee.

You will spend countless hours trying to parse PDFs and probably never get it perfected.

Tom
04-03-2007, 03:50 PM
Agree - just using the comma delim charts is a breeze. When you consider the time you spend doing the parsing and organizing and all that jazz, you could buy them for less. Unless you work cheap:p

Now, having said that, anyone out there have a parser for harness charts from USTA? They are HTML and I download and hand-transfer those, just because there are no reasonalby priced charts available. I would like to do more tracks, but as I near the twilight of my years:rolleyes: I get lazier and lazier.

the_fat_man
04-03-2007, 04:04 PM
The comma delimited charts are cheap enough that, in my opinion, they are well worth the money. The only reason I parsed HTML charts was that there was no better option in the past for a reasonable fee.

You will spend countless hours trying to parse PDFs and probably never get it perfected.

I agree. Time is certainly money, especially for those playing multiple tracks.


I've basically got the PDF parse down (the next hurdles are 2F baby races in Cali and 4F ---much easier-- in Kee) and I did it cause I wanted to learn Perl so I could do more complex things with it. Morever, I've yet to settle on a firm overall method and customizing allows me to play around with alot of different formats.

I certainly wouldn't recommend this method to others.

Robert Fischer
04-03-2007, 07:52 PM
Thanks All for the advice on purchasing the comma delim.

TFM - the find and replace works pretty well, again it is time consuming.

Also I found a decent program that lets me edit and type(figures, notes) in the pdf files right on the file, no conversion.

the_fat_man
04-03-2007, 08:01 PM
Thanks All for the advice on purchasing the comma delim.

TFM - the find and replace works pretty well, again it is time consuming.

Also I found a decent program that lets me edit and type(figures, notes) in the pdf files right on the file, no conversion.

A word of advice:

If you're doing ANY manual work in parsing these files (other than opening them to convert and then passing them to some program as text files or whatever) you're doing TOO MUCH work. You should buy data if that's the case.

You can automate the find and replace with a macro in WORD. But you really should consider using a language with stronger (more comprehensive) regular expressions.

Take it a step at a time. When I first started, I was tweaking, and replacing using WORD.