PDA

View Full Version : Question about pdf charts?


csperberg
08-16-2006, 03:17 AM
Hopefully I posted this in the right area.

Ok first off I really dont know dittly squat about how people took data from the html charts and converted or exported for use in a database or spreadsheet. I assume that is what people mean when they parse.

Now that there are pdf files instead of html and they are password protected has this caused a stale mate in what people could do with the free charts?

What if the password could be removed from the pdf files? What could you do with the files then? Say I could remove the password and convert the pdf to text file or a document, only because it seems that people say it is hard to to anything with a pdf file and if it could be converted it would be easier. How could I convert that into use in a database or spreadsheet?

Hope it dont sound like an off beat question, I just am not real familiar with how people work with the data and importing, exporting or converting of it.

K9Pup
08-16-2006, 10:18 AM
Hopefully I posted this in the right area.

What if the password could be removed from the pdf files? What could you do with the files then? Say I could remove the password and convert the pdf to text file or a document, only because it seems that people say it is hard to to anything with a pdf file and if it could be converted it would be easier. How could I convert that into use in a database or spreadsheet?

Hope it dont sound like an off beat question, I just am not real familiar with how people work with the data and importing, exporting or converting of it.

There is software available that takes PDF type files and converts the contents to text. I experimented with one but I found the results of the conversion to not be easily usable for my purposes. Do a search on "PDF to text" to find products that might do what you want.

In most cases a computer program can read "simple" HTML files and strip away the HTML tags to get usable data. It isn't so simple for PDFs.

If someone knows of a good product for PDF to text, I would like to know also.

Tom
08-16-2006, 03:24 PM
I think you will find not everyting in the PDF chart will covert to text. I hve tried this conversion of other PDF files and gotten unusable results.

This would not come out like you need.

4hd
4

sjk
08-16-2006, 03:36 PM
I was always concerned whether you could tell 1-11 from 11-1 after it converted the superscripts (lengths) to text.

Actually life is easier just buying the data and not worrying about parsing.

csperberg
08-16-2006, 05:42 PM
What if the main reason the problem in pdf converting is the type of fonts they use, and of course if it is password protected?

I noticed this on the free charts from any of the CD tracks websites. Their pdf charts are non password proteced. I can use Adobe Acrobat Pro to edit them, delete things type in new text what ever I want. Its when I tried to conver them to text they became all garbbled. I then did some investigating and found they were using fonts I didnt have installed on my system. So when it went to convert it didnt have anything to use to make it readable on my system.

Now I understand there is copyrights involved with this stuff I am asking. I just want to say and make clear that I am not trying to exploit the data providers in any shape or form. This is merely educational, evaluational, and for personaly use only. In no way shape or form will I attempt to retransmit or sell the data under a new name or form. Everything done is maintained only on my own computer and only for my use. The data being used in this test is merely the same free data any person is fee to download. I happen to just be looking for an easier automated approach to storing the free information on my system for personal use.

I just wanted to say that just incase some legal head is out there and catches wind of this. I know there has been plenty of debate whether the information is actually copyrightable or not. I dont want to take any chances, since even copyrighted material you are free to use as use wish for personal use. It would be like buying a book and the way you bought the book is the way it is copyrighted, the print, the order of the pages, the content in the book etc. Well there is nothing saying I cant rearrange the pages in that book to the way I like it. This basically what I am doing just rearrange the data into a format I would like. I also want to point out I am a paying customer when it comes to getting my paid data, this is only using what readily available for anyone to download for free.

So anyway kind of got off the topic, but figured it might be important to note. I have figured a way to get strip the passwords off the free pdf charts. I tried it on Equibase, DRF, and TSN free pdf charts. I was able to remove the password from all of them no problem.

The problems I did run into were that the DRF and TSN charts use some fonts I didnt have installed in some areas of their charts. Most of the converting came out good, with only a few areas being gibberish.

The Equibase files converted exactly. The only problem with theirs was some races the text was a little out of order. Like one race it would look just who you are reading it, then another might look like each item will be gourped together, like pg#, horse name, wgt etc. It is all readable and there is no jibberish. I can remove the equibase image out of it, and even remove the copyright info from the bottom of each race page so they are not present when doing a convert.

Of course the Equibase files and the others all have slightly differnt layouts and information. Like the Equibase ones dont have the horses age, only the age restrictions for the race. The DRF and TSN ones are virtually similar but unless you have the all the fonts they use they will not appear correct.

So now this is the part where I say I get stuck. I am not familiar with what people do to extract the data to be used in a spreadsheet or database. Is there easy tools to use to tell what you want extracted out of these txt files? Is there a way to easily convert them to csv files? Would it be easier if they were converted to html, xml, rich text, or document? Is there a difference between converting to plain text and accessible text? Or do I just have to go through each race and copy and paste the info by hand and build my database up like that.

If one could get the data into a useable format in a spreedsheet or database, then couldnt one also do the same with the entries? Then couldnt they be linked some how that when you enter in the data for the entries it could give you back a very simply PP to look at. Nothing special really, but if you are not an unlimited subscriber to any data it could allow you to scan the races for the day across the nation to see if you see anything you might want to take a deeper look at. Then you could just buy PP's or data files for the races you think fit your criteria. Almost like a browse and try it on in the dressing room and if you like it then buy it sort of thing.

I guess thats where I am going with this whole thing. A way to put the free and public information to work for me. So I know ahead of time that hey maybe I dont want to buy any PP"s for this track because there wasnt anything shaping up in my handicapping favor by first glance. You might maybe find a track with plenty of favorable situations for you, but normally you would not even play that track.

This is probaly also where I should have paid attention alot more, and kept up with all my access, vb, and sql classes I took in school. Skimmed by all of them, and have not used any of them in a very long time like 4 years. I never used them enough to remember anything about them anyway.

Is this a dead end thing or is there hope that something can made usefull out this? hopefully everything I said was clear enough, or at least clear enough to know what I am trying to say.

sjk
08-16-2006, 06:28 PM
I think most or all of the chart parsers have either decided that the cost of buying data is not a huge factor in their profit picture or have given up because of the barriers to entry that Equibase has created.

Actually this has worked out pretty well for those in the first group.

Parsing had all of the benefits you describe for 1993-2005 (although I don't know why you would ever pay for PPs if you can parse charts and entries) but it seems to me that those days are gone.

You can see by the increases in handle (not!) that Equibase has improved the game by their actions.

osophy_junkie
08-16-2006, 06:28 PM
Removing the password from the PDF is trivial. Parsing what is extracted is not, I have a parser freely available at http://lamedomain.net/horses/chartparser/ if your interested in looking at the code. Equibase is at liberty to change the format at anytime and the actual algorithm which guesses the positions isn't always right.

If you want something easy, I would suggest purchasing CSV files for race and result information from a vendor. There is not much you can do about getting historic data.

- Ed

cj
08-16-2006, 09:55 PM
I parsed charts for one reason. There was no better alternative available, even for a fee. Now that they are available at a reasonable price, why bother? Time is money.

TitanSooner
08-17-2006, 01:07 PM
http://www.verypdf.com/pdf2txt/pdf2txt.htm

there is a command line option that is great.. one of the options is a -html flag that will convert it to html, which is much easier to parse the data out of than the standard text file it will create.

the html tags make it simple.. you can also import the html into excel which makes it a jiffy using vb and controlling excel through the automation objects..

just some thoughts.

the_fat_man
08-17-2006, 04:05 PM
We've pretty much beaten this to death.

Those that need results from multiple tracks just (grudgingly :D) buy the CSV files.

Those that feel it's a matter of principle (and focus on single tracks) figure out ways to parse the pdf files.

For example, ghostscript does an almost perfect job of converting the DRF pdf results chart. It only misses when it comes to the superscripts --the lengths ahead. Can't handle the BRIS files, however ---- limited testing on my part.

There's another option if you're running Unix/Linux and are able to comment out a line of code in a C file and recompile.

I look at it this way: it's all code ---what's done can be undone or converted; just a matter of one's motivation and, more importantly, the needed format for the data.