OCR: tesseract with simplescan or xsane

johnraff · 2024-10-01 11:45:39

I've needed to scan in some snippets of text and add them to documents, so playing with OCR options.

My specific needs might not match yours exactly but I hope this stuff at least helps.

I wanted to scan in a snippet from a magazine page and paste it as text into a LibreOffice document.

Crop it down to the bit I wanted:

Feed that to tesseract, strip out the linebreaks so it will fill the new document width and put it on the clipboard to paste into Libreoffice Writer:

SPACE New hope of finding life on Mars after indication of water Vast amounts of water could be trapped deep in the crust of Mars, raising fresh questions about the possibility of life on the red planet. Scientists say that more than 3bn years ago, Mars had lakes, rivers and oceans - but they disappeared as it lost its atmosphere. Now researchers say vast quantities of liquid water could be trapped within rocks about 11.5-20km below the planet’s surface. “The presence of water does not signify that there is life, but water is thought to be an important ingredient for life,” said Dr Vashan Wright, a co-author of the study from Scripps Institution of Oceanography at the University of California San Diego. The research is published in the Proceedings of the National Academy of Sciences.

I didn't edit the above text at all. There's no full-stop after "...indication of water" because that's the headline. So bold text or new paragraphs have to be added to suit the new document, but I don't think there's any way to automate that part.

Tesseract is doing a great job here IMO and it can easily be linked up with either simplescan or xsane with a couple of small bash scripts.

simplescan

If you're mainly going to be using this for OCR there are a few things you can do to make life easier:

1) The default file to save the scanned image to is '~/Documents/Scanned Document.pdf' Tesseract can't read pdf files, so every time you want to read an image you have to change the filename to something.png or something.jpeg
If you get tired of that and want to change the default filename you have first to install dconf-editor. Open it and navigate to org > gnome > simple-scan > save-format, set "Use default value" to 0 and "Custom value" to 'image/png' or 'image/jpeg'.

2) Since I don't need to keep the images after they've been read I find it easiest to save them to /tmp so the system will delete them at shutdown. Also, there's no need to change the name once you've read an image - just use the same default name for the next one and it will be overwritten.

3) In simplescan Preferences set the "Text Resolution" to '300 dpi'. I think you can have "Image Resolution" at anything you want.

4) At the bottom of Preferences, Postprocessing:
"Enable Postprocessing" to '1'
"Script" to '/path/to/simplescan-tesseract.sh' (see below for the script)
"Script Arguments" to '--dpi 300 -l eng'
(english is the default language anyway, but you can set something else here - see 'man tesserect')
"Keep original file" to 1
(This is unexpected as you don't need the file, but if you don't keep it the default filename will be blank next time and you'll have to write it in every time. I find it easier to leave the file in /tmp and let the system delete it later.)

Now put the script below somewhere convenient (the same path you entered in Preferences), make sure it's executable.

Set the scan option next to the Scan button to "Text". That switches to greyscale and uses the text dpi you set above. Scan the document, use the crop tool to select the part you want and hit the "save document to a file" button. You can probably leave the filename as it is. At that point the OCR script is called and a notification popup will appear when tesseract has finished. You can then use Ctrl+V to paste the text into some document.

Once you've got it set up it's really quick and easy.

Here's the script which links simplescan to tesseract:

#!/bin/bash
# simplescan-tesseract.sh
#
# Reference:
# https://gist.github.com/marcosrogers/fc0250a52490e92ab8293bd781231a7e
#
# Usage:
#  Set this file as the post-processing script in the simple-scan preferences. No extra arguments needed.
#  Any postprocessing script arguments entered in the preferences will be passed along to tesseract, for
#  example, add '-l eng+spa' to recognize English and Spanish text.
#
# Requirements:
# - simple-scan
# - tesseract-ocr
#
# For reference, as of version 42.5-2 the arguments from simple-scan are:
# $1    - the mime type, eg image/png
# $2    - whether or not to keep a copy of the original file
# $3    - the filename
# $4    - postprocessing script arguments entered in preferences (all in one string)

logfile=/tmp/simplescan-tesseract.log
echo "Running OCR at $(date)" >$logfile
echo "mimetype: ${1}
keep_original: ${2}
filename: ${3}
tesseract extra args: ${4}" >>$logfile

filename=$3
keep_original=$2

[[ -r $filename ]] || {
  notify-send -i scanner "OCR Error" "Cannot read $filename"
  exit 1
}

text=$( tesseract "${filename}" - ${4} 2>>$logfile ) || {
  notify-send -i scanner "OCR Failed" "See $logfile"
  exit 1
}

echo "$text" | tr '\n' ' ' | tr -s ' ' | xsel -ib    # for Ctrl-V/Paste

[[ $keep_original = 'false' ]] && rm "${filename}"

notify-send -i scanner "OCR Complete" "Text is on clipboard."

xsane

Xsane's a bit slower, but no more difficult to use, and it gives you more control over the scanning for tricky cases.

Preferences > Setup > OCR:
OCR Command: /path/to/xsane-tesseract.sh
Other options you can leave as-is.
If you want to set a different language or dpi (usually the default 300 is the best) then add the options to the "OCR command", eg:

/path/to/xsane-tesseract.sh -l jpn -d 600

(See 'xsane-tesseract.sh --help' for the available options.)

Before scanning, set options in the main xsane window to:
mode: Grey
resolution: 300
and set a suitable path and filename (xsane, like simplescan, will remember what you last used).

In the preview window click "Acquire preview". It will be a bit rough looking, but you'll do a proper scan at the next step.
Drag the window edges to crop, then you can use the "zoom into selected area" button at the top.
Maybe "pick white point" will sharpen the contrast.

Now hit "Scan" on the main window (yes all these windows are a pain) and when its done you'll see a nicer scan of the selected zone, in the "Viewer" window. Possibly run the "despeckle" filter?

Now File > "OCR - save as text" and your script should run.
You have to enter a filename to save the text to, but the script ignores it and puts the text on the clipboard as with simplescan. Ctrl-V to paste the text in your document.

Close the Viewer and you'll be warned the image is not saved. You can save it or not as you wish.

Here's the xsane script:

#!/bin/bash
#
# xsane-tesseract.sh
#
# Usage:
#  Set this file as the post-processing script in the xsane OCR preferences.
#  Options -i and -o can be left as-is.
#  Add -d <dpi> to script command line to change tesseract dpi from default 300.
#  Add -l <language> to change tesseract language from default eng.
#
# Requirements:
# - xsane
# - tesseract-ocr
#

USAGE='Usage:
Set this file as the post-processing script in the xsane OCR preferences.
Options -i and -o can be left as-is.
Add -d <dpi> to script command line to change tesseract dpi from default 300.
Add -l <language> to change tesseract language from default eng.
'

logfile=/tmp/xsane-tesseract.log
echo "Running $0 at $(date)" >$logfile

# default values
language=eng
resolution=300

exec 2>>"$logfile"

while [[ -n $1 ]]
do
  case $1 in
    -h|--help)
      echo "$USAGE"
      exit
      ;;
    -i)
      imgfile=$2
      shift 2
      ;;
    -o)
      outfile=$2
      shift 2
      ;;
    -l)
      language=$2
      shift 2
      ;;
    -d)
      resolution=$2
      shift 2
      ;;
    *)
      echo "$1 unknown option" >&2
      exit 1
      ;;
  esac
done

echo "imgfile: $imgfile
outfile: $outfile
language: $language
resolution: $resolution" >>$logfile

[[ -r $imgfile ]] || {
  notify-send -i scanner "OCR Error" "Cannot read $imgfile"
  exit 1
}

text=$( tesseract "${imgfile}" - --dpi "$resolution" -l "$language" 2>>$logfile ) || {
  notify-send -i scanner "OCR Failed" "See $logfile"
  exit 1
}

echo "$text" | tr '\n' ' ' | tr -s ' ' | xsel -ib    # for ctrl-V/Paste

rm -f "${outfile}"

notify-send -i scanner "OCR Complete" "Text is on clipboard."

Both simplescan and xsane worked pretty well for me. With some images xsane took longer because it's saving a bigger image file but the end results were pretty much identical.

I might keep simplescan for OCR and xsane for other scanning, but make your own choice.

Last edited by johnraff (2024-10-02 04:16:04)

Martin · 2024-10-01 18:35:28

Thanks for this information.

I have not used simplescan + tesseract but in the past I have had good luck with scan2pdf and tesseract. I found there was a sweet-spot in scan resolution when doing OCR. Too high resolution and tesseract started to struggle!

/Martin

#1 2024-10-01 11:45:39

OCR: tesseract with simplescan or xsane

#2 2024-10-01 18:35:28

Re: OCR: tesseract with simplescan or xsane

Board footer