PDF TEXT Extraction

I would like to draw out text from a part (using coordinates) of PDF utilizing Ghostscript. Can anyone helpme out?

You will have a lot of trouble doing that with coordinates. That would need finding every text cell in the file, calculating string width and wrapping, then estimation clipping windows and selecting inclusion/exclusion.

Yes, with Ghostscript, you can draw out text from PDFs. What you can do: extract the text of a particular range of pages just.

This will output all text included on pages 3-5 to stdout. If you want output to a text file, usage -sOutputFile= textfilename.txt.

This one needs you to download the current variation of the file ps2ascii.ps from the Ghostscript Git source code repository. You ‘d need to transform your PDF to PostScript, then run this command on the PS file:

If the -dSIMPLE specification is not defined, each output line consists of some additional info beyond the pure text material about font styles and fontsize used.

If you change that criterion by -dCOMPLEX, you’ll get additional details about images and colors utilized.

A more comfortable way to do text extraction: utilize pdftotext (offered for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could attempt:

This will show the page range 13 (very first page) to 17 (last page), protect the layout of a double-password safeguarded named PDF file (utilizing user and owner passwords secret and supersecret), with Unix EOL convention, however without inserting pagebreaks between PDF pages, piped through less …

pdftotext -h shows all available commandline alternatives.

Obviously, both tools just work for the text parts of PDFs

The cross-platform, open source MuPDF application (made by the exact same company that likewise develops Ghostscript) has bundled a command line tool, mutool. To draw out text from a PDF with this tool, usage:

will produce the extracted text to <stdout>. Use -o filename.txt to write it into a file.

TET, the Text Extraction Toolkit from the pdflib household of products can find the x-y-coordinate of text content in a PDF file (and a lot more). TET has a commandline interface, and it’s the most effective of all text extraction tools I know.

I’m not sure GhostScript can accept coordinates, however you can transform the PDF to a image and send it to an OCR engine either as a subimage cropped from the provided collaborates or as the entire image in addition to the collaborates. Some OCR API accepts a rectangle specification to narrow the area for OCR.

Look at VietOCR for a working example, which utilizes Tesseract as its OCR engine and GhostScript as PDF-to-image converter.

Debenu Quick PDF Library can draw out text from a specified area on a page. The SetTextExtractionArea function lets you specify the x and y coordinates and then you can likewise specify the width and height of the area.

Then the GetPageText c# function can be called right away after this to extract the text from PDF defined location.

Here’s an example using C# (though the library is multi-platform and can be used with various programs languages):.

Using GetPageText it is also possible to return simply the text situated in that area or the text located in that area in addition to info about the text’s font style such as name, size and color.

Posted in: pdf

Leave a Reply

Your email address will not be published. Required fields are marked *