Discussion:
Using Find Command in XSane PDF File
(too old to reply)
Raymond Hanslits
2017-01-31 22:22:42 UTC
Permalink
Raw Message
I have been scanning and creating PDF files in XSane.  However, I cannot get the Find Command to search for words and phrases in PDF files.Can you help me?
Thank you for any assistance you may provide.
Ray ***@yahoo.com
Torfinn Ingolfsen
2017-02-01 00:21:37 UTC
Permalink
Raw Message
I have been scanning and creating PDF files in XSane. However, I cannot get
the Find Command to search for words and phrases in PDF files.
Can you help me?
Scanned documents consists of images (yes, even PDF documents) from the start.
The find command (or a search command) works with text (words, phrases).
So, after scanning your document, you need to perform text recognition
on the images in your document. This process is usually called ocr -
optical character recognition. It scans the images for characters and
creates text from the characters it finds. This is not a 100% success,
more like 97 - 98. This text is usually saved as a layer (invisible)
"on top" of each image in a PDF file, so when you search you will find
the word on the correct page.

I see that XSane has an "OCR" button, but I have never used it, so I
don't know if it works or what it takes to make it work.
I usually use gscan2pdf http://gscan2pdf.sourceforge.net/ for scanning
text documents. It uses SANE, has several ocr tools you can choose
from, and usually works very well.

HTH
--
Regards,
Torfinn Ingolfsen
--
sane-devel mailing list: sane-***@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
to sane-devel-***@lists.alioth.debian.org
Johannes Meixner
2017-02-01 09:00:30 UTC
Permalink
Raw Message
Hello,
Post by Raymond Hanslits
I have been scanning and creating PDF files in XSane.
However, I cannot get the Find Command to search for
words and phrases in PDF files.
Because the scanner device does not produce characters
(no scanner "understands" what it scans) but only pixels,
the data of what is scanned does not contain characters
(or even words or phrases) but it contains only pixels
regardless what data (container) format is used.

See "XSane - Saving scan to text" at
http://lists.alioth.debian.org/pipermail/sane-devel/2017-January/035005.html

The crucial part is the OCR software, cf.
https://en.wikipedia.org/wiki/Optical_character_recognition

Personally I do not use OCR software but as far as I noticed
it makes a difference regarding how good the OCR result is
that appropriate scanning parameters are used specially for OCR.
For example things like black and white scanning at a relatively
low resolution could help to get better OCR results compared to
high resolution photo scanning modes.
Perhaps also the data format of what is scanned could make
a difference (e.g. PNG versus JPEG or even PDF) for OCR.


Kind Regards
Johannes Meixner
--
SUSE LINUX GmbH - GF: Felix Imendoerffer, Jane Smithard,
Graham Norton - HRB 21284 (AG Nuernberg)
--
sane-devel mailing list: sane-***@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
to sane-devel-***@lists.alioth.debian.org
Roger
2017-02-01 20:22:10 UTC
Permalink
Raw Message
Here's a quick list of commands for first scanning in a receipt at the
recommended 300 DPI (customize per your scanner), then using ImageMagick's
convert for converting (and for some odd reason I failed to document these doh!
But likely you can skip these convert commands), then using Tesseract OCR for
creating a text file of the possible recognized text.

# First scan in the image, at 300 DPI is recommended and 450 DPI I think is the
# optimal DPI when attempt OCR.

$ scanimage > ./receipt.tif

# Or a more extravagant method:

$ scanimage --format=tiff --progress --custom-gamma=no --source Flatbed --resolution=300 --icc-profile=${HOME}/ICC/CanoScan9000F/CNSR0D.ICC > receipt.tif

# The below attempt to auto crop the background from the receipt, but due to
# the scanner's white background, the commands fail to detect the background
# with white paper. The commands should work with a black background, after
# some adjustment. (eg. Use black paper from a hobby shop for providing a
# black background during scanning.)
$ convert -trim -fuzz 55% /tmp/receipt.tif /tmp/receipt-trim.tif
$ convert -verbose -border 10x10 -trim +repage -fuzz 75% receipt.tif receipt-trim.tif

# If I recall correctly, just remove "stdout" and a receipt.txt should be
# automatically created within the immediate folder.
$ tesseract receipt.tif stdout

# As extensively described here, this creates a PDF with included OCR text.
# The included text within the PDF file is written in binary and cannot be
# simply grepped!
$ tesseract receipt.tiff receipt.pdf


There are two resulting end results:
1) A scanned image (eg. receipt.tif) and a text file (eg. receipt.txt)
containing possibly recognized text. If you archive data, this is probably
your best method for preserving image detail and preventing FUD and extravagant
proprietary formats. Searching simple text files are extremely easy.
Maintaining two separate files can be troublesome.

2) A scanned image (eg. receipt.tif) imported into a PDF file containing the
OCR text. Using the latest versions of Tesseract, I believe the default is to
provide a PDF file including the image and text file, while older versions
output a text file. Choose the PDF file method if you like simplicity and care
less about details. The downside, the image is further significantly
compressed.

I prefer the first solution, as this provides me with a high resolution
TIF/JPEG image versus after creating the PDF file, the image is further
compressed drastically. On the flip, the one PDF file includes both the image
and text files rather than having to deal with two separate files. (eg.
receipt.tif and receipt.txt)


The final incantation of find will search a PDF file containing OCR text or general text.

# Search multiple PDF files for TEXT
find /tmp -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "TEXT"' \;


Last but not least, somebody actively maintains gscan2pdf
(http://gscan2pdf.sourceforge.net/), containing a GUI front-end making scanning
to PDF simple and easy, written in Python. I've installed & tried it, but am
extremely bias with command line utilities versus troublesome clicky
front-ends.
--
Roger
http://rogerx.freeshell.org/
Loading...