PDA

View Full Version : need your opinion


HUSKER55
06-24-2012, 09:09 AM
I was wondering if a PDF converter program would allow me to down load the pdf and then put it into excel and allow me to pick and choose the data.

I got some info on Nuance Pro8 that include Dragon speaking program.


Does anyone have any experience or thoughts you would share?


thanks!

Capper Al
06-24-2012, 09:13 AM
Doesn't sound practical to me. Most data streams can either be copied from the screen or uploaded from a comma delimited file. Or I might not be understanding your question.

wilderness
06-24-2012, 09:31 AM
There are different types of PDF's.
1) A PDF created as an image
2) A PDF created as text.

In addition, PDF creators are offered security options (restrictions) that either prohibit and/or limit general file options (copy and paste is one)

You may not copy data from PDF-image file when data is non-existent, at least not without the use of an OCR tool.

hcap
06-24-2012, 01:49 PM
I doubt if ant will do it well.
Having said that, here is a google search that turned up quite a few claiming otherwise.

http://www.google.com/search?hl=en&source=hp&biw=&bih=&q=pdf+to+excel&oq=pdf+to+excel&aq=f&aqi=g10&aql=&gs_l=firefox-hp.3..0l10.3313.18091.0.19233.14.12.0.2.2.0.198.11 11.10j2.12.0...0.0.Gp0vF-ih5UQ

HUSKER55
06-24-2012, 05:11 PM
THANK YOU GUYS FOR TAKING THE TIME TO HELP.:)

Robert Goren
06-25-2012, 12:54 AM
I had a program that I bought a long time ago that allowed me to copy and paste Bris PDF past performances that Twin Spires gives away to Excel. It wasn't as useful as I hope.

SchagFactorToWin
06-25-2012, 10:01 AM
I've never been able to get the sub/super script in the running lines to convert properly.

wilderness
06-25-2012, 10:57 AM
I had a program that I bought a long time ago that allowed me to copy and paste Bris PDF past performances that Twin Spires gives away to Excel. It wasn't as useful as I hope.

Robert,
FWIW and with today's additional fields in the Tracmaster data, the software would even be less useful.


I've never been able to get the sub/super script in the running lines to convert properly.

Schag,
I've the same problem in OCRing with small fonts of any nature. Fractions (race times) are difficult in OCR as well.

There is a PDF OCR software, that is quite effective, however it 'taint cheap.
Abbyy FineReader, has a diverse set of fonts for recognition added to their software in which recognition is far more accurate than the OCR software that I use with my scanner. Unfortunately the bloated fonts slows the OCR process down quite a lot.
I've never used Abbyy for tables ( I try to avoid stats at every turn).

Last year I worked with some HS students using the scanner OCR software (same model scanners and software I use daily) and they thought it was slow as molasses.

The key to OCR is in an accumulated supplemental dictionary (topic based), unfortunately there is NOT much room for adding numerals to the dictionary.

chickenhead
06-25-2012, 12:05 PM
I experimented with building an automated setup for this for race charts. Since equibase has historical charts one could build extensive pp collection.

Ocr: Google docs uses Ocr to do pdf conversion and it handles charts fine. However the formatting is atrocious and would be difficult to deal with.

Cracking: breaking the password to allow text access proved much more fruitful. After unlocking one can use a pdf parse package like many languages have (i used pdfminer for python) and one can pretty reasonably get what they need.

wilderness
06-25-2012, 12:51 PM
Cracking: breaking the password to allow text access proved much more fruitful. After unlocking one can use a pdf parse package like many languages have (i used pdfminer for python) and one can pretty reasonably get what they need.

My understanding related to this is that the vulnerability of hacking the password is dependent upon the number of pages the PDF contains.

A 1-4 PDF is much easier hack than a 10-20 page PDF, with the later likely not possible at all.
The hacking is also dependent upon which bit-range is set by the PDF creator.

A 40-bit (early Adobe) is easily hacked.
A 128-bit is quite difficult.

I've seen some PDF's in the past where the creator added security settings, however never added two levels of passwords, thus the security settings were easily removed. Course, your not able to change these settings with the FREE Acrobat viewer or many of the other FREE PDF tools.

tupper
06-26-2012, 02:44 AM
I just tested the PDF Import plugin in LibreOffice, and it opens PDF PPs through the draw program in the same layout as the original PDF.

It is probably possible to then make a LibreOffice macro that would convert all of the fields to an ordered spreadsheet.

dietant
06-24-2013, 04:16 AM
for a a PDF created as text using foxit software text converter can help a little bit
do the rest with PDFs downloaded from Amwest programs link (today gone?) and excell VBA macro.
The foxit stuff produce 3 rows for a line containing super's and sub's racetrackfonts
example:
----------------------- Page 3-----------------------
Race 2 6 Furlongs BELMONT PARK - Wednesday, June 12, 2013 3+ FCLM, $20000
Purse: $ 28000. For Maidens, Fillies And Mares Three Years Old And Upward Foaled In NeYork State And Approved By The NeYork State-bred
2 Registry. Three Year Olds, 119 Lbs.; Older, 124 Lbs. CLM Price $20,000.
etc...13 Record: 5 0 0 1 $ 7220
1 Royal Blue, YelloTriangular Panel, YelloDiamonds On Sleeves, Blue And YelloCap 2/1 [19, 0, 2, 4, 0%] 12 Record: 1 0 0 0 $ 267
MAXANA 119 $ 20000 JUNIOR ALVARADO 12-13 Off: 1 0 0 0 $ 1650
-etc..
-etc..
1st Row: 52 50 08 38 5 6 6 7 6 nk nk
2nd Row :21/04/13-2AQU ft 3+F MSW50000 6f 23 47 1:11 68 4 2 5 2 6 5 2 5 4 Tomas,P 112 Lb 30.75 ChinaGold113 AhGaga118 Vaid118
to rail 1/2, 4upper[6]
3nd Row: 113 118 118 1 2 4
-etc...
super strings: 52 50 08 38 5 6 6 7 6 nk nk
sub strings: 113 118 118 1 2 4
Concatenating:
the super's 52 50 08 38 and 6f, 23, 47, 1:11 = 6f (52) ft1=23.50, ft2=47:08, ft3=1:11:38
the super's 5 6 6 7 and the sub's 1 2 4 = 5 (5 ½), 6 6, 5 (6 ½), 5 (7 ¼)
the sub's 113 118 118 and the super's 6 nk nk = ChinaGold113 6 AhGaga118 nk Vaid118 nk

Longshot6977
06-24-2013, 12:51 PM
I have tried some OCR programs and have the same problems as other regarding not reading correctly the superscripts and subscripts. They also put too much data in one row in Excel and that too is hard to deal with since it varies sometimes.
Has anyone got any program they had good success with to allow proper importing/reading of the PDF charts or PPs to Excel? Dietant, can you please elaborate a little more on your procedure? Thanks.

PS- I found ABLE2EXTRACT Pro v8 to be the best so far, but it requires too much finagling with the columns and won't always read sub/superscripts.

vegasone
06-24-2013, 04:10 PM
The HTML output of ABLE2EXTRACT Pro v8 looks like it would be the easiest to parse if you were able to do that.

dietant
06-24-2013, 05:23 PM
No big deal.
the super's, numbers, and sub's have diferent size, ocupies diferent pdf(X,Y) positions, and have diferent fonts.
the translators "pdf to text" write them in diferent rows depending on the value of Y and use the X value as offset from the begin of the row
Study case: Fin 2nd behind 1 and 1/2

.............. 1
.............. -
.............. 2
.........1
...2
The "2":
Position (X,Y); (258.40 ,636.50)
Font Name: Univers-Condensed-Medium
Font Size: 7
Text 2
------
The "1":
Position (X,Y); (261.82 ,638.40)
Font Name: Univers-Condensed-Medium
Font Size: 5.25
Text 1
-----
the (1/2)
Position (X,Y); (263.73 ,636.70)
Font Name: SansFractionsVerticalPlain
Font Size: 5.25
Text 2
----
the translators are unable to mix different rows in one :sleeping:

Longshot6977
06-24-2013, 09:11 PM
Dietant, you seem to know this very well, but some of it looks Greek to me. Is there a program you could recommend that allows the Equibase PDF result charts to be converted into Excel where it could read these super/sub scripts?

Can we play around with their x,y positions ourselves in those programs to make them look 'correct' so we can sort and review the data in Excel? I think a lot of us could really benefit from properly converting the PDFs to Excel. Thanks.

dietant
06-25-2013, 01:40 AM
" The HTML output of ABLE2EXTRACT Pro v8 looks like it would be the easiest to parse if you were able to do that." Vegasone.

with:
Foxit PDF Editor
Delete any type of object from the pages;
Change font, font size, color and other text attributes for text objects
......
1 convert to html with ABLE2EXTRACT PROFESSIONAL version 6.0
or ABLE2EXTRACT PROFESSIONAL version 8.0 from INVESTINTECH.com
2 Import the .html to Excel
3 analyze the strings of characters in order to associate groups of characters with their super's and sub's
or
1 convert to txt with Foxit PDF2TXT
2 Import the .txt to Excel
3 analyze the strings of characters in order to associate groups of characters with their super's and sub's
......

dietant
06-25-2013, 02:20 PM
for Equibase charts use Foxit PDF Editor
for each race delete any type of object from the pages as logo, images, all data below "winner;", footnotes etc.
change font, font size, color and other text attributes for text objects
save charts (using ur special name)
convert to excel with ABLE2EXTRACT PROFESSIONAL version 6/8.
Parse (No probls with helvetica, helvetica-bold fonts)
for Bris charts:
Need to HACK the PDF!
Remove the pdf protection with PDF Pasword Remover from verypdf.com Inc
Edit the pdf with Foxit PDF Editor
convert to html with ABLE2EXTRACT PROFESSIONAL version 6/8
import the html page tables from Excel.
Parse.
I do not like hack bris docs so I would rather use equibase.
:confused: