Discussion:
Scan quality enhancements/processing (vs Windows with Fujitsu ScanSnap S1500)
Add Reply
Matt Garman
2017-07-03 20:02:59 UTC
Reply
Permalink
Raw Message
TL;DR: for those of you who have migrated your document scanning
workflow from Windows to a Sane platform (e.g. Linux), what
settings/tools have you found to maintain or improve quality of
scanned documents (relative to Windows)?

Long version:

I have a Fujitsu ScanScan s1500 document scanner, and sane-1.0.27
running on Arch Linux. So far it seems to just work. I've had this
scanner for nearly a decade, and used it exclusively under Windows
until now (trying to move to a pure Linux desktop).

So while I can scan documents just fine, the results to me aren't as
good as what I get under Windows using the proprietary ScanSnap
software. Specifically, they are too light/too dark, text not crisp
enough, straight lines not straight, colors a bit off, etc.

One example: de-skewing. All the years I've had this scanner, I
didn't even realize this was a thing until now. I can use scanimage's
software de-skewing (--swdeskew=yes), and it seems to *mostly* work,
but pages are often still somewhat skewed. Excepting for
wacky/unusual documents, I don't recall ever seeing any skew under
Windows.

Despeckling (--swdespeck=n) does seem to be a major step in the right direction.

I'm also playing with all the enhancement options. E.g.,
--brightness, --contrast, --emphasis, etc. Brightness and contrast
are fairly intuitive, but I don't really understand what the other
options actually mean, or what I should expect from them. I've been
taking the trial-and-error approach, but e.g. --variance doesn't seem
to do anything. And I'm not sure how the options interact with each
other, so trial-and-error could take forever.

Having said all that, my one test document is maybe 90% as "good" as
the same scanned on Windows. Probably good enough to live with, but:

(1) This seems to be a very popular scanner - has anyone been able to
back out the settings the proprietary ScanSnap software uses?
- and -
(2) I wonder if the Windows ScanSnap software settings are static or
dynamic? E.g., is there some kind of pre-processing algorithm applied
to guess at the best enhancement settings?

As for 3rd party tools, I've read about unpaper, and played a bit with
scantailor, but (outside of OCR), they don't seem to offer more
enhancement features than sane itself (or I'm overlooking something).

Any other rules of thumb or general pointers for getting the best out
of my scanner with Sane?

Thanks!
--
sane-devel mailing list: sane-***@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
to sane-devel-***@lists.alioth.debian.org
Roger
2017-07-04 01:19:58 UTC
Reply
Permalink
Raw Message
Post by Matt Garman
TL;DR: for those of you who have migrated your document scanning
workflow from Windows to a Sane platform (e.g. Linux), what
settings/tools have you found to maintain or improve quality of
scanned documents (relative to Windows)?
I have a Fujitsu ScanScan s1500 document scanner, and sane-1.0.27
running on Arch Linux. So far it seems to just work. I've had this
scanner for nearly a decade, and used it exclusively under Windows
until now (trying to move to a pure Linux desktop).
So while I can scan documents just fine, the results to me aren't as
good as what I get under Windows using the proprietary ScanSnap
software. Specifically, they are too light/too dark, text not crisp
enough, straight lines not straight, colors a bit off, etc.
In my experience, the Windows or proprietary solutions usually utilize
significant software processing after the original data is acquired from the
scan.

Most experienced users (eg. photographers, ...) tend to desire vanilla scanned
results or data, for either realistic/exact results or for legal reasons. I
personally just cannot stand to see a scan of negative media being processed
with overly satured colors, rather than seeing the original colors of a
negative! I personally prefer an original vanilla scanned image for archiving
purposes. I then, if needed, augment the image later.

Most people in Linux/Unix world, desire one tool for one specific task.
Sane/XSane gets the data in the computer. Other separate tools are utilized
later for improving an image.
Post by Matt Garman
One example: de-skewing. All the years I've had this scanner, I
didn't even realize this was a thing until now. I can use scanimage's
software de-skewing (--swdeskew=yes), and it seems to *mostly* work,
but pages are often still somewhat skewed. Excepting for
wacky/unusual documents, I don't recall ever seeing any skew under
Windows.
Despeckling (--swdespeck=n) does seem to be a major step in the right direction.
I'm also playing with all the enhancement options. E.g.,
--brightness, --contrast, --emphasis, etc. Brightness and contrast
are fairly intuitive, but I don't really understand what the other
options actually mean, or what I should expect from them. I've been
taking the trial-and-error approach, but e.g. --variance doesn't seem
to do anything. And I'm not sure how the options interact with each
other, so trial-and-error could take forever.
Having said all that, my one test document is maybe 90% as "good" as
The command line sane tool is pretty basic in my brief experience. The XSane
interface seems to perform a better job than some of those command line
switches, unless somebody else wants to pipe-in here. It maybe, XSane uses the
sane program libraries (eg. routines/functions) a little bit better to get
better results than the sane command line tool.
Post by Matt Garman
(1) This seems to be a very popular scanner - has anyone been able to
back out the settings the proprietary ScanSnap software uses?
- and -
(2) I wonder if the Windows ScanSnap software settings are static or
dynamic? E.g., is there some kind of pre-processing algorithm applied
to guess at the best enhancement settings?
My Canon 9000f Windows' proprietary software when scanning negative media, will
apply post-processing to saturate the image with colors the Canon 9000f
software thinks are good-looking versus VueScan's more realistic colors. I
believe the software also straightens the images as well. Amazingly, all this
proprietary post-processing is not able to be turned-off within the proprietary
software! Quite commonly, this post-processing or processes other than
scanning the target, can usually all be performed better using other tools such
as ImageMagick, Gimp, etc ... because this post-processing is more well
maintained elsewhere and those other tools' development is usually more well
focused, as well as being updated more frequently.
Post by Matt Garman
As for 3rd party tools, I've read about unpaper, and played a bit with
scantailor, but (outside of OCR), they don't seem to offer more
enhancement features than sane itself (or I'm overlooking something).
I just heard about unpaper too as well, likely something you'll be definitely
desiring to use.
Post by Matt Garman
Any other rules of thumb or general pointers for getting the best out
of my scanner with Sane?
VueScan likely does all you're probably wanting as well, but think you're doing
fine using open source. The main reason I use VueScan, is for old
ancestry/genealogy negatives and VueScan is well proven for photography uses.
I don't get much time here, and the media is very time sensitive.

Personally, I prefer command line tools; as it's far easier to pipe tasks to
other utilities. For document processing, you'll probably have a far easier
time creating a script.
--
Roger
http://rogerx.freeshell.org/
Matt Garman
2017-07-05 17:50:44 UTC
Reply
Permalink
Raw Message
Hi Roger, thank you for your thoughts! More comments below...
Post by Roger
In my experience, the Windows or proprietary solutions usually utilize
significant software processing after the original data is acquired from the
scan.
I had a hunch that was/is the case.
Post by Roger
Most experienced users (eg. photographers, ...) tend to desire vanilla scanned
results or data, for either realistic/exact results or for legal reasons. I
personally just cannot stand to see a scan of negative media being processed
with overly satured colors, rather than seeing the original colors of a
negative! I personally prefer an original vanilla scanned image for archiving
purposes. I then, if needed, augment the image later.
Most people in Linux/Unix world, desire one tool for one specific task.
Sane/XSane gets the data in the computer. Other separate tools are utilized
later for improving an image.
This now seems obvious, but wasn't until your comment. I scan all
these documents almost entirely for archival purposes. 99% of the
time, the documents are scanned and never re-visited. Sometimes I
might take a quick look at the occasional document (e.g., when was
that invoice dated?), and as long as it's readable, good enough. And
I almost never need something that is perfectly cleaned or
print-quality, though it does come up.

So eventually, I will need to figure out how to get a document looking
its best, but for now I can simply focus on getting a good quality
scans, and save them in a losslessly-compressed format.
Post by Roger
The command line sane tool is pretty basic in my brief experience. The XSane
interface seems to perform a better job than some of those command line
switches, unless somebody else wants to pipe-in here. It maybe, XSane uses the
sane program libraries (eg. routines/functions) a little bit better to get
better results than the sane command line tool.
Anyone have any comment on this? I assumed the CLI and GUI versions
did exactly the same thing, but perhaps that's a bad assumption.
Post by Roger
VueScan likely does all you're probably wanting as well, but think you're doing
fine using open source. The main reason I use VueScan, is for old
ancestry/genealogy negatives and VueScan is well proven for photography uses.
I don't get much time here, and the media is very time sensitive.
I might give that a try, just to see. Looks like they have a free trial.
Post by Roger
Personally, I prefer command line tools; as it's far easier to pipe tasks to
other utilities. For document processing, you'll probably have a far easier
time creating a script.
Agreed.

At any rate, now that I'm changing my focus to just get a good scan
and worry about post-processing later, I'm finding getting a good
initial scan isn't so easy.

In particular, now I see that most of my scans have the top few mm cut
off. I set the --page-width and --page-height arguments to be padded
with an extra inch. So in my scans I have about an inch of "pure
white", but the top is still cut off a few mm.

I haven't yet messed with the scan area params (t, l, x, y), because
that leads me to a more general question: is it necessary to always
specify the paper size and scan area for every single document? While
I agree most processing can be a separate task, I feel like the
hardware and/or software ought to be able to auto-detect the document
size and scan area.

Thanks again!
--
sane-devel mailing list: sane-***@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
to sane-devel-***@lists.alioth.debian.org
Roger
2017-07-07 00:21:15 UTC
Reply
Permalink
Raw Message
is it necessary to always specify the paper size and scan area for every single document?
I'm not sure on this question, but for semi-auto cropping, you'll likely need to
perform something to the similar, for which I (obviously) haven't yet mastered:

# Acquire scanned image
$ scanimage --format=tiff --resolution=300 > /tmp/receipt.tif

# Two likely incorrect incantation of ImageMagick convert for cropping an image.
$ convert -trim -fuzz 55% /tmp/receipt.tif /tmp/receipt-cropped.tif

$ convert -verbose -border 10x10 -trim +repage -fuzz 75% /tmp/receipt.tiff /tmp/receipt-cropped.tiff

# OCR the tiff file
$ tesseract /tmp/receipt-cropped.tiff /tmp/receipt.pdf

# Search multiple PDF files for TEXT
$ find /tmp -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "SOME_TEXT_HERE"' \;


As you can see, I was performing something similar to what you're doing, but
then had to migrate to other tasks due to a lack of time.

Had I completed the above, one would have to admit, the above steps would be
far more quickly performed rather than relying upon even the best of today's
GUI tools directly due to the load time of the GUI.
--
Roger
http://rogerx.freeshell.org/
m. allan noah
2017-07-07 17:06:05 UTC
Reply
Permalink
Raw Message
Matt-

I wrote both the sane-fujitsu backend, and the sanei_magic library
that it uses to provide the deskew/crop, etc functions. I'll try to
answer your questions as best I can. BTW, sorry I did not reply
sooner- I have been traveling, and wanted to write something a bit
longer than I could stand on my cell phone :)
Post by Matt Garman
TL;DR: for those of you who have migrated your document scanning
workflow from Windows to a Sane platform (e.g. Linux), what
settings/tools have you found to maintain or improve quality of
scanned documents (relative to Windows)?
I have a Fujitsu ScanScan s1500 document scanner, and sane-1.0.27
running on Arch Linux. So far it seems to just work. I've had this
scanner for nearly a decade, and used it exclusively under Windows
until now (trying to move to a pure Linux desktop).
So while I can scan documents just fine, the results to me aren't as
good as what I get under Windows using the proprietary ScanSnap
software. Specifically, they are too light/too dark, text not crisp
enough, straight lines not straight, colors a bit off, etc.
One example: de-skewing. All the years I've had this scanner, I
didn't even realize this was a thing until now. I can use scanimage's
software de-skewing (--swdeskew=yes), and it seems to *mostly* work,
but pages are often still somewhat skewed. Excepting for
wacky/unusual documents, I don't recall ever seeing any skew under
Windows.
That is not scanimage providing the deskew, but the fujitsu backend
(driver) itself, using the sanei_magic library, and your host system's
CPU. The deskewing algorithm uses a simplified Hough transform, which
attempts to detect the edges of the paper instead of the print on it.
This code works better if you expand the dimensions of the scan to
grab some of the background. If you attempt to crop the image too
small using the page_width and page_height or x/y params, it will
likely fail to deskew, or pick the wrong feature to align, which will
make things worse.

In general, if you are using swdeskew, it is probably better to scan
at full width, and use the swcrop option too. Also, some fujitsu
machines support the overscan option, which will cause the scanner to
output some extra background rows before the paper is ingested. This
can significantly improve the swdeskew performance. The S1500 does not
have a black background option, but the larger scanners do, and this
will also help.

If you have a document that consistently reproduces poor deskewing,
even with those additional options, I'd like to see a .pnm file of the
scan with and without swdeskew enabled.
Post by Matt Garman
Despeckling (--swdespeck=n) does seem to be a major step in the right direction.
Yes- particularly if you are scanning in binary (line art) mode, small
amounts of noise can be distracting.
Post by Matt Garman
I'm also playing with all the enhancement options. E.g.,
--brightness, --contrast, --emphasis, etc. Brightness and contrast
are fairly intuitive, but I don't really understand what the other
options actually mean, or what I should expect from them. I've been
taking the trial-and-error approach, but e.g. --variance doesn't seem
to do anything. And I'm not sure how the options interact with each
other, so trial-and-error could take forever.
All the options you list here are values which we send to the
hardware. Frankly, I have little documentation about what they do, but
it is certainly possible that some of these only have effect in binary
mode, and they may not even work on the S1500. I'll see if I can track
that down, and disable them in cases where they cannot be used.
Post by Matt Garman
Having said all that, my one test document is maybe 90% as "good" as
(1) This seems to be a very popular scanner - has anyone been able to
back out the settings the proprietary ScanSnap software uses?
- and -
(2) I wonder if the Windows ScanSnap software settings are static or
dynamic? E.g., is there some kind of pre-processing algorithm applied
to guess at the best enhancement settings?
I'm willing to guess that much of what you are seeing is the windows
software making a larger, potentially higher resolution or color mode
scan, and then cleaning it up in software.

I think we will need side-by-side example images from windows and
sane, along with the list of options for each to diagnose further.

allan
--
"well, I stand up next to a mountain- and I chop it down with the edge
of my hand"
--
sane-devel mailing list: sane-***@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
to sane-devel-***@lists.alioth.debian.org
Matt Garman
2017-07-09 18:54:32 UTC
Reply
Permalink
Raw Message
Hi Allan,

Thank you for taking the time to provide some feedback and look into this!
Post by m. allan noah
In general, if you are using swdeskew, it is probably better to scan
at full width, and use the swcrop option too. Also, some fujitsu
machines support the overscan option, which will cause the scanner to
output some extra background rows before the paper is ingested. This
can significantly improve the swdeskew performance. The S1500 does not
have a black background option, but the larger scanners do, and this
will also help.
If you have a document that consistently reproduces poor deskewing,
even with those additional options, I'd like to see a .pnm file of the
scan with and without swdeskew enabled.
Most of my documents have far too much personal info on them to
publish publicly... but we'll try to make due with some generic
documents.

Here's a link to the output of "scanimage -h", just for reference for
what default settings are. I included the PDF as scanned under
Windows, and the PNM files from a Linux scan, both with and without
the --swdeskew option enabled. (There are two files for each of the
Linux/PNM, because it was a duplex scan, i.e. front and back. Windows
scan was also duplex, but the software concatenates all scans into one
file.)

http://raw-sewage.net/images/linux_scansnap_s1500/

At any rate - this one doesn't demonstrate skew very well. But it
does sort of answer my previous question: you can see that when I used
the swdeskew option, it actually cut off part of the document. I
previously assumed the scanner was mis-judging the actual scan
area/document size, but from this simple experiment, it appears the
swdeskew can get confused and cutoff parts of the document.

Under Linux, you can see I scanned at 300dpi. I have it set to "auto"
under Windows, but I'm quite sure it scanned at 300dpi or less. If I
specify anything over 300dpi (under either platform) the scanning is
noticeably slower. Not a precision measurement I know, but it's
something.
Post by m. allan noah
All the options you list here are values which we send to the
hardware. Frankly, I have little documentation about what they do, but
it is certainly possible that some of these only have effect in binary
mode, and they may not even work on the S1500. I'll see if I can track
that down, and disable them in cases where they cannot be used.
( ... )
I'm willing to guess that much of what you are seeing is the windows
software making a larger, potentially higher resolution or color mode
scan, and then cleaning it up in software.
I'm assuming similar open source tools exist for Linux; i.e.
scanimage/sane doesn't have to do all the work... I just want to get a
good, comprehensive scan out of sane. Then I can script a pipeline of
post-processing tools.

I'll continue to look for non-personal documents to present more
examples. I created a Windows virtual machine so now I can quickly
test the two without having to turn on my old PC and switch the
keyboard/video/mouse cables over!

Thanks again, let me know if I can provide any more info.

-Matt
--
sane-devel mailing list: sane-***@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
to sane-devel-***@lists.alioth.debian.org
m. allan noah
2017-07-09 20:55:21 UTC
Reply
Permalink
Raw Message
Post by Matt Garman
Hi Allan,
Thank you for taking the time to provide some feedback and look into this!
Post by m. allan noah
In general, if you are using swdeskew, it is probably better to scan
at full width, and use the swcrop option too. Also, some fujitsu
machines support the overscan option, which will cause the scanner to
output some extra background rows before the paper is ingested. This
can significantly improve the swdeskew performance. The S1500 does not
have a black background option, but the larger scanners do, and this
will also help.
If you have a document that consistently reproduces poor deskewing,
even with those additional options, I'd like to see a .pnm file of the
scan with and without swdeskew enabled.
Most of my documents have far too much personal info on them to
publish publicly... but we'll try to make due with some generic
documents.
Here's a link to the output of "scanimage -h", just for reference for
what default settings are. I included the PDF as scanned under
Windows, and the PNM files from a Linux scan, both with and without
the --swdeskew option enabled. (There are two files for each of the
Linux/PNM, because it was a duplex scan, i.e. front and back. Windows
scan was also duplex, but the software concatenates all scans into one
file.)
http://raw-sewage.net/images/linux_scansnap_s1500/
At any rate - this one doesn't demonstrate skew very well. But it
does sort of answer my previous question: you can see that when I used
the swdeskew option, it actually cut off part of the document. I
previously assumed the scanner was mis-judging the actual scan
area/document size, but from this simple experiment, it appears the
swdeskew can get confused and cutoff parts of the document.
Well, I got over my fear of clicking on 'raw sewage' and looked at a
couple of your images. I agree that this is a bug, the swdeskew should
not crop the top of the image in this case. I wonder- is that purple
stripe near the left margin actually on the page, or does your scanner
have a problem?
Post by Matt Garman
Under Linux, you can see I scanned at 300dpi. I have it set to "auto"
under Windows, but I'm quite sure it scanned at 300dpi or less. If I
specify anything over 300dpi (under either platform) the scanning is
noticeably slower. Not a precision measurement I know, but it's
something.
I expect the sane driver to be equal or slightly faster than the
windows driver. If you find otherwise, let me know.
Post by Matt Garman
Post by m. allan noah
All the options you list here are values which we send to the
hardware. Frankly, I have little documentation about what they do, but
it is certainly possible that some of these only have effect in binary
mode, and they may not even work on the S1500. I'll see if I can track
that down, and disable them in cases where they cannot be used.
( ... )
I'm willing to guess that much of what you are seeing is the windows
software making a larger, potentially higher resolution or color mode
scan, and then cleaning it up in software.
I'm assuming similar open source tools exist for Linux; i.e.
scanimage/sane doesn't have to do all the work... I just want to get a
good, comprehensive scan out of sane. Then I can script a pipeline of
post-processing tools.
There are some tools like unpaper and leptonica which can do some of
these things. Perhaps they will be useful to you. Even still, I would
like to fix bugs in sane if I can. I'll take a crack at it using your
image.
Post by Matt Garman
I'll continue to look for non-personal documents to present more
examples. I created a Windows virtual machine so now I can quickly
test the two without having to turn on my old PC and switch the
keyboard/video/mouse cables over!
Thanks again, let me know if I can provide any more info.
-Matt
allan
--
"well, I stand up next to a mountain- and I chop it down with the edge
of my hand"
--
sane-devel mailing list: sane-***@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
to sane-devel-***@lists.alioth.debian.org
Matt Garman
2017-07-10 16:39:05 UTC
Reply
Permalink
Raw Message
Post by m. allan noah
Well, I got over my fear of clicking on 'raw sewage' and looked at a
Ha! That's a tribute to my BBS days, many many years ago. Dating myself here!
Post by m. allan noah
couple of your images. I agree that this is a bug, the swdeskew should
not crop the top of the image in this case. I wonder- is that purple
stripe near the left margin actually on the page, or does your scanner
have a problem?
I'm pretty sure that purple stripe is something with the scanner
itself. I'm definitely overdue to clean my scanner.
Post by m. allan noah
There are some tools like unpaper and leptonica which can do some of
these things. Perhaps they will be useful to you. Even still, I would
like to fix bugs in sane if I can. I'll take a crack at it using your
image.
Thanks again! Please let me know if I can help out!
--
sane-devel mailing list: sane-***@lists.alioth.debian.org
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
to sane-devel-***@lists.alioth.debian.org
Loading...