TTS vs PDF

Forum for info exchange on beta tests of new versions/features, and in-depth discussions of issues related to nextup products with Power Users. You must register with the forum system in order to have access to this section.

Moderators: kdwhite, Jim Bretti, D.Leikin

TTS vs PDF

Postby D.Leikin » Wed Oct 10, 2007 2:13 pm

Just ran across a powerful app that can be used to extract text from almost everything that can be viewed on the screen.

The app is able to auto-scroll pdf stuff, which makes it possible to render multi-page documents into image files in just one click. Then an OCR can be used to extract editable text from the image.

The app’s name is SnagIt. 30-days trial version is available from the developers’ site TechSmith.

Tip: Higher video resolution ensures better image rendering. Reasonable values start at about 1152x864 depending on the document font size.
D.Leikin
 
Posts: 682
Joined: Sat Jan 14, 2006 2:15 pm

Postby kdwhite » Wed Oct 10, 2007 5:51 pm

So it is automatically doing OCR on entire documents?
Ken White
NextUp.com
The Power of Spoken Audio
http://www.NextUp.com

** TextAloud - The world's most popular Text To Speech tool.
http://www.nextup.com/TextAloud/
kdwhite
Site Admin
 
Posts: 2627
Joined: Mon Sep 29, 2003 11:34 am

Postby D.Leikin » Thu Oct 11, 2007 3:31 pm

Not sure. I think it isn’t using ocr but is able to rip video subsystem while it auto-scrolls text in active app’s window.

If an app has no protection on text it displays, SnagIt seems to be able to intercept char codes sent to video subsystem and thus render document contents back into editable text.

When SnagIt is unable to render things into text (presumably, when some protection prevents it from intercepting the char codes, I guess), SnagIt is still able to render video memory into an image file that one can convert later by using external ocr.

Here’s an experiment whose amazing results I can’t seem to comprehend.

1. Initial PDF: uses non-system embedded fonts, RC4 128 bit encryption, strong password, any and all options disallowed except opening and viewing.

2. SnagIt auto-scrols initial PDF and then saves captured image as an “image PDF” (purely image file).

3. The “image PDF” is then opened in Acrobat Pro and is recognized by Acrobat’s built-in OCR.

4. The resulting PDF is found to be identical with the initial PDF. It has editable text, the same formatting, same non-system embedded fonts, BUT is no longer password protected. All options (including copy, print, TTS, etc) appear to be allowed.
D.Leikin
 
Posts: 682
Joined: Sat Jan 14, 2006 2:15 pm

Postby DaveH » Fri Oct 12, 2007 4:59 am

Very interesting!

If a large ebook is scrolled in this way though, a huge image file will be generated (many megs)
I would think it may be too big to load into normal OCR software so that splitting is required.

Does snaglt split the file without breaking images or text?

For your experiment, was the input PDF file generated with Adobe Acrobat and would it work for PDF generated by other software?
Dave UK
DaveH
 
Posts: 178
Joined: Tue Feb 17, 2004 11:54 am
Location: UK

Postby D.Leikin » Fri Oct 12, 2007 7:45 pm

I’m not sure if splitting is necessary at all. I believe that huge images can be saved as multi-page PDFs, which should be ok for professional ocr like Adobe or Abbyy.

I experimented with a PDF available for free download from a scientific library. The PDF had been created in LaTeX with hyperref package and had no security restrictions, so I used Password Security method (Acrobat Professional 7.0) to set all the restriction to “Not Allowed”.

SnagIt should probably work fine with large files of any type (i.e., doc, html., xls, etc) as it doesn’t seem to care much about the extension. It just rips anything it can scroll, I think.
D.Leikin
 
Posts: 682
Joined: Sat Jan 14, 2006 2:15 pm

Postby DaveH » Sat Oct 13, 2007 12:18 pm

[quote="D.Leikin"] I’m not sure if splitting is necessary at all. I believe that huge images can be saved as multi-page PDFs, which should be ok for professional ocr like Adobe or Abbyy.[/quote]

Doesn't the OCR process each page of the PDF document independently? In which case no images or lines of text should be split between adjacent pages.
Dave UK
DaveH
 
Posts: 178
Joined: Tue Feb 17, 2004 11:54 am
Location: UK

Postby D.Leikin » Sat Oct 13, 2007 1:49 pm

As far as

PDF->(SnagIt)->image PDF->OCR

conversion is concerned, there should be no problem with page-to-page correspondence, I think.

I haven't tried the app on lengthy web stuff, but it has tons of adjustable parameters and should probably handle html too.
D.Leikin
 
Posts: 682
Joined: Sat Jan 14, 2006 2:15 pm

snagIt

Postby DaveH » Thu Nov 01, 2007 5:01 pm

I have now used this program for converting long secure PDF files to HTML for use with TA. The scrolling window to PDF works beautifully and by fitting the PDF file to screen width one can maximise resolution ready for OCR (omnipage pro in my case). Saving the snagIt image as a PDF allows each page of the PDF to be processed separately by the OCR. Memory limitations on my PC indicated restricting to about 40 pages at a time. So impressed I bought it!

Thanks D.Leikin, for suggesting it.
Dave UK
DaveH
 
Posts: 178
Joined: Tue Feb 17, 2004 11:54 am
Location: UK

Re: TTS vs PDF

Postby SFCurley » Sat Feb 06, 2010 11:11 am

Concur on this -- this is a really great program, which I too use for conversion of protected PDF files.

There is a free, full working version floating around the internet -- NOT a bootleg version, but a free-license giveaway version that the company published several years ago after version 8.0 came out. (They're now on Version 9.x). I have the 9.x and there are some nice additional features, but if you want to poke around do a Google search for "SnagIt free version 7.2.5) you should be able to find it and the free license info.
SFCurley
 
Posts: 361
Joined: Wed Dec 10, 2003 1:12 pm

Re: TTS vs PDF

Postby Gofer01 » Thu Jul 12, 2012 6:34 pm

I use hypersnap. In my opion hypersnap is better than snagit. Over the pass 8 years or so I have evaluate snagit. I still keep going back to hypersnap.
Gofer01
 
Posts: 3
Joined: Mon May 17, 2010 4:07 pm
Location: Los Lunas, New Mexico


Return to Power Users, Beta Tests, In-Depth Discussions

Who is online

Users browsing this forum: No registered users and 1 guest