PDF Parser - Horse Racing Forum - PaceAdvantage.Com

guckers · 01-22-2012, 12:36 AM

I've been doing some homework on programming a parser and was wondering if anyone else has successfully parsed a PP in PDF (Equibase or DRF)?

openhorse · 01-25-2012, 12:53 PM

By breaking the data out of the document model, you can get reliable results.

Regex and text parsing wont be as reliable as something that knows the lengths of varying data from reading the header.

C#, php

http://itextpdf.com/

Greybase · 02-04-2012, 02:44 AM

Just caught this... I've done some pretty extensive PDF parsing, for Greyhound programs however. Fully automated, using various command-line text extraction tools. When I looked at doing this with DRF and Equibase PDF's there were a number of complications. You have lots of embedded tiny fractions, and even worse, symbols embedded in horse PP's... which make text conversion difficult. I still say it COULD be done!!

guckers · 02-04-2012, 03:15 PM

Quote:

Originally Posted by Greybase

Just caught this... I've done some pretty extensive PDF parsing, for Greyhound programs however. Fully automated, using various command-line text extraction tools. When I looked at doing this with DRF and Equibase PDF's there were a number of complications. You have lots of embedded tiny fractions, and even worse, symbols embedded in horse PP's... which make text conversion difficult. I still say it COULD be done!!

Greybase, would you mind telling me at a high level how you achieved this? It seems that you would have to understand the chart framework and ordering of things, while still accounting for unique anomalies that happen. Then go through line by line and parse it by identifying locations and keywords. Does my explanation match something similar to what you were doing?

GameTheory · 02-04-2012, 03:46 PM

I used to parse the Equibase pdf charts for a while after they switched over from pdf. So it used to be possible, anyway. They'd added some stuff since then (PP preview box of the running lines, for instance).

togatrigger · 02-14-2012, 03:48 PM

I tried for awhile, but any conversion from the compressed sections were often also encrypted. Maybe that isn't the case anymore, but it's doubtful. The format is notoriously difficult to parse and convert. Good luck.

guckers · 02-15-2012, 12:55 AM

PDF manipulation is no easy task, thanks for the input from everyone. I will update my progress as it may (or may not) come along.