Moderator: eeuunikkeiexpat

oh wise Lobby.. PDF to text?

Postby carica » Tue Jan 31, 2012 2:17 pm

My Kindle can read PDFs but the highlighting feature works terribly or doesn't work at all. I've been wasting hours on the internet trying to find freeware programs that convert image-based (scanned text) PDFs into text documents (.doc, .rtf, .txt). I used automator on my mac to create a macro that extracted PDF text into an .rtf format but it doesn't work when the scanned image is the least bit distorted. Also used Calibre to convert to .mobi format but unless I'm missing something the highlighting tool doesn't work in this format.

Any tips? :mrgreen: gracias.
User avatar
carica
Rank: Chile Forum Citizen
 
Posts: 351
Joined: Wed Mar 18, 2009 9:38 pm

Re: oh wise Lobby.. PDF to text?

Postby zer0nz » Tue Jan 31, 2012 2:19 pm

User avatar
zer0nz
Rank: Chile Forum Citizen
 
Posts: 5682
Joined: Sun Jun 14, 2009 4:46 am
Location: Lost!

Re: oh wise Lobby.. PDF to text?

Postby carica » Tue Jan 31, 2012 3:56 pm



Sorry, I should have added that in these "wasted hours" on the internet I have indeed found this and many other free pdf to text sites, which DO NOT convert larger PDF files. They simply stop and say that the server is too busy. Same in Google docs, after 2 megs uploaded it automatically canceled saying we don't convert larger files. Since I'm looking at entire books, those websites require buying the full version, which I was trying to get around.

Thanks, though.
User avatar
carica
Rank: Chile Forum Citizen
 
Posts: 351
Joined: Wed Mar 18, 2009 9:38 pm

Re: oh wise Lobby.. PDF to text?

Postby regioncentralX » Tue Jan 31, 2012 4:02 pm

carica wrote:those websites require buying the full version, which I was trying to get around.

Thanks, though.

Ahhhemm...you mean Serial Box and stuff you can find on Vuze :idea: :?: :?: This will not be elaborated on.
¡ This is Sshiile Weon !
User avatar
regioncentralX
Rank: Chile Forum Citizen
 
Posts: 308
Joined: Fri Oct 29, 2010 12:33 am

Re: oh wise Lobby.. PDF to text?

Postby zer0nz » Tue Jan 31, 2012 4:04 pm

carica wrote:


Sorry, I should have added that in these "wasted hours" on the internet I have indeed found this and many other free pdf to text sites, which DO NOT convert larger PDF files. They simply stop and say that the server is too busy. Same in Google docs, after 2 megs uploaded it automatically canceled saying we don't convert larger files. Since I'm looking at entire books, those websites require buying the full version, which I was trying to get around.

Thanks, though.


you say your on a mac.... i found lots of windows based ocr programs.... but this site here kinda gives the idea for mac....

http://www.alifesoft.com/blog/2011/4/Ho ... ac-OS.html

sorry cant be much more help than googling... if it was important to me i would buy the software if the trial did the job!
User avatar
zer0nz
Rank: Chile Forum Citizen
 
Posts: 5682
Joined: Sun Jun 14, 2009 4:46 am
Location: Lost!

Re: oh wise Lobby.. PDF to text?

Postby carica » Tue Jan 31, 2012 4:21 pm

Good advice! Thank you.
User avatar
carica
Rank: Chile Forum Citizen
 
Posts: 351
Joined: Wed Mar 18, 2009 9:38 pm

Re: oh wise Lobby.. PDF to text?

Postby martagill » Tue Jan 31, 2012 6:35 pm

Hi, I'm a new member ... look for a free software program called, Calibre. Works on Mac and PC's and will convert most formats to kindle (MOBI) compatible format. Very user friendly.
martagill
Rank: Chile Forum Full Member
 
Posts: 45
Joined: Fri Sep 30, 2011 2:23 pm

Re: oh wise Lobby.. PDF to text?

Postby carica » Tue Jan 31, 2012 8:03 pm

martagill wrote:Hi, I'm a new member ... look for a free software program called, Calibre. Works on Mac and PC's and will convert most formats to kindle (MOBI) compatible format. Very user friendly.

carica wrote:Also used Calibre to convert to .mobi format but unless I'm missing something the highlighting tool doesn't work in this format.

Yep, the Calibre program's .mobi format doesn't do scanned image PDFs to text. After converting it to that format, even though the option to modify text size exists nothing happens when you select it. Highlighting doesn't work either, which is really what I want.

Looks like I might have to spring for a paid version..
User avatar
carica
Rank: Chile Forum Citizen
 
Posts: 351
Joined: Wed Mar 18, 2009 9:38 pm

Re: oh wise Lobby.. PDF to text?

Postby jehturner » Tue Jan 31, 2012 11:37 pm

There's "pstotext", which works with the postscript file renderer "ghostscript", but I'm not sure whether it's easily avaiable on a Mac (it should be reasonably straightforward for a developer to get it working on a Mac, but whether someone has actually done that I'm not sure). On Windows there's a program called gsview that makes it a bit easier to use. Anyway, it may be worth searching for.

Cheers,

James.

PS. You probably have to type the command in a terminal but it should be very simple.
jehturner
Rank: Chile Forum Citizen
 
Posts: 1286
Joined: Thu Nov 20, 2008 12:24 am
Location: La Serena

Re: oh wise Lobby.. PDF to text?

Postby jehturner » Tue Jan 31, 2012 11:47 pm

It looks like pstotext is availabe from MacPorts, if you know how to use that:

http://www.macports.org/
jehturner
Rank: Chile Forum Citizen
 
Posts: 1286
Joined: Thu Nov 20, 2008 12:24 am
Location: La Serena

Re: oh wise Lobby.. PDF to text?

Postby rust » Wed Feb 01, 2012 4:06 am

PDF files usually contain two flavours of data other than font information and positioning data. In the case of "normal" pdf files, ie produced from text using a word processor or similar, the flavour is plain text. In the case of "scanned" pages (worst effing way to do it), you have raster image data.

Loading a PDF into Adobe products like Photoshop will give you the various layers. There is a non-Adobe program called PDF Editor which I used to deconstruct the Obama "birth certificate" which allows you to get to the raster data. Last time I looked at cadkas.com it was still available. Not freeware, but functional demo. In any case, you will want a decent OCR program to convert the raster into text. Major Pain in the Poto.

IIRC, pstotext isn't magical. It extracts the text data only from a postscript file. It won't magically convert raster image data into machine readable text.

Your mileage may vary.
Estamos EN Chile! It's a way of life...
User avatar
rust
Rank: Chile Forum Citizen
 
Posts: 100
Joined: Tue May 04, 2010 2:19 am

Re: oh wise Lobby.. PDF to text?

Postby jehturner » Wed Feb 01, 2012 11:47 pm

rust wrote:IIRC, pstotext isn't magical. It extracts the text data only from a postscript file.

Yes, if you need OCR that's another thing altogether.

Oh, I see, it does sound like that's what carica needs. Sorry... (has been 20 years since I used OCR; hopefully it has improved since then, as it was most excruciating.)
jehturner
Rank: Chile Forum Citizen
 
Posts: 1286
Joined: Thu Nov 20, 2008 12:24 am
Location: La Serena


Return to Lobby

Who is online

Users browsing this forum: No registered users